linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] proactive kcompactd
@ 2017-07-27 16:06 Vlastimil Babka
  2017-07-27 16:06 ` [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep() Vlastimil Babka
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Vlastimil Babka @ 2017-07-27 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel, Vlastimil Babka

As we discussed at last LSF/MM [1], the goal here is to shift more compaction
work to kcompactd, which currently just makes a single high-order page
available and then goes to sleep. The last patch, evolved from the initial RFC
[2] does this by recording for each order > 0 how many allocations would have
potentially be able to skip direct compaction, if the memory wasn't fragmented.
Kcompactd then tries to compact as long as it takes to make that many
allocations satisfiable. This approach avoids any hooks in allocator fast
paths. There are more details to this, see the last patch.

The first 4 patches fix some corner cases where kcompactd wasn't properly woken
up in my basic testing, and could be reviewed and merged immediately if found
OK. Patch 5 terminates compaction (direct or kcompactd) faster when free memory
has been consumed in parallel. IIRC similar thing was already proposed by
Joonsoo.

First I did some basic testing with workload described in patches 2-4, where
memory is fragmented with allocating a large file and then in every other page
a hole is punched. A test doing GFP_NOWAIT allocations with short sleeps then
fails allocation, waking up kcompactd so that the next allocation succeeds,
then another one fails, waking up kcompactd etc. After the series, the number
of consecutive successes gradually grows as kcompactd increases its target,
and then it falls down as all free memory is depleted.

Then I did some more measurements with mmtests stress-highalloc (3 iterations)
configured for allocating order-4 pages with GFP_NOWAIT, to make it rely on
kcompactd completely. The baseline kernel is 4.12.3 plus "mm, page_alloc:
fallback to smallest page when not stealing whole pageblock"

                               4.12.3                4.12.3                4.12.3                4.12.3
                                 base                patch4                patch5                patch6
Success 1 Min         71.00 (  0.00%)       71.00 (  0.00%)       71.00 (  0.00%)       74.00 ( -4.23%)
Success 1 Mean        72.33 (  0.00%)       72.33 (  0.00%)       72.33 (  0.00%)       75.00 ( -3.69%)
Success 1 Max         73.00 (  0.00%)       74.00 ( -1.37%)       74.00 ( -1.37%)       76.00 ( -4.11%)
Success 2 Min         78.00 (  0.00%)       74.00 (  5.13%)       76.00 (  2.56%)       80.00 ( -2.56%)
Success 2 Mean        80.00 (  0.00%)       77.33 (  3.33%)       79.33 (  0.83%)       81.67 ( -2.08%)
Success 2 Max         81.00 (  0.00%)       81.00 (  0.00%)       82.00 ( -1.23%)       84.00 ( -3.70%)
Success 3 Min         88.00 (  0.00%)       88.00 (  0.00%)       91.00 ( -3.41%)       90.00 ( -2.27%)
Success 3 Mean        88.33 (  0.00%)       88.67 ( -0.38%)       91.33 ( -3.40%)       90.67 ( -2.64%)
Success 3 Max         89.00 (  0.00%)       90.00 ( -1.12%)       92.00 ( -3.37%)       91.00 ( -2.25%)

Success rates didn't change much, already quite high for an order-4 GFP_NOWAIT.

                                    4.12.3       4.12.3     4.12.3      4.12.3
                                      base      patch4      patch5      patch6
Kcompactd wakeups                    15705       16312       15335       20234
Compaction stalls                      155         130         135         130
Compaction success                      86          69          79          82
Compaction failures                     69          61          56          48
Page migrate success                925279      945304      954274     1363974
Page migrate failure                 77284       76466       82188       15060
Compaction pages isolated          1947918     1987482     2008541     2748177
Compaction migrate scanned      1501322768  1590902016  1601288469    85829846
Compaction free scanned          509199325   526560956   522041706   104149027
Compaction cost                      11507       12156       12238        2069

Not much happening until patch6, which results in more kcompactd wakeups, but
surprisingly much reduced scanning activity, and improved migrate stats.

Same test, but order-9 (again GFP_NOWAIT)

                               4.12.3                4.12.3                4.12.3                4.12.3
                                 base                patch4                patch5                patch6
Success 1 Min         57.00 (  0.00%)       56.00 (  1.75%)       54.00 (  5.26%)       56.00 (  1.75%)
Success 1 Mean        59.00 (  0.00%)       59.33 ( -0.56%)       56.00 (  5.08%)       58.00 (  1.69%)
Success 1 Max         60.00 (  0.00%)       63.00 ( -5.00%)       58.00 (  3.33%)       60.00 (  0.00%)
Success 2 Min         66.00 (  0.00%)       66.00 (  0.00%)       67.00 ( -1.52%)       65.00 (  1.52%)
Success 2 Mean        66.33 (  0.00%)       67.00 ( -1.01%)       67.00 ( -1.01%)       66.33 (  0.00%)
Success 2 Max         67.00 (  0.00%)       68.00 ( -1.49%)       67.00 (  0.00%)       68.00 ( -1.49%)
Success 3 Min         53.00 (  0.00%)       56.00 ( -5.66%)       51.00 (  3.77%)       57.00 ( -7.55%)
Success 3 Mean        56.00 (  0.00%)       57.00 ( -1.79%)       54.33 (  2.98%)       57.33 ( -2.38%)
Success 3 Max         58.00 (  0.00%)       59.00 ( -1.72%)       58.00 (  0.00%)       58.00 (  0.00%)

                                    4.12.3       4.12.3     4.12.3      4.12.3
                                      base      patch4      patch5      patch6
Kcompactd wakeups                      992        1676        1749        1661
Compaction stalls                      134         139         151          91
Compaction success                      93          83         103          53
Compaction failures                     41          55          48          37
Page migrate success                885733      904325      849397      869434
Page migrate failure                  8261       12819       12299       10288
Compaction pages isolated          1779692     1822833     1713638     1749977
Compaction migrate scanned        95755848    87494396    96276153    18487127
Compaction migrate prescanned            0           0           0           0
Compaction free scanned           33409748    38040646    34997109    15738289
Compaction free direct alloc             0           0           0           0
Compaction free dir. all. miss           0           0           0           0
Compaction cost                       1623        1585        1588        1065

Order-9 allocations are more likely to trigger the corner cases fixed by
patches 2-4 and thus we see increased kcompactd wakeups with patch 4.
Patch 6 again significantly decreases the numbers of pages scanned. It's not
yet clear how. Optimistic explanation would be that creating more free
high-order at once is more efficient than repeatedly creating a single page
always rescanning part of the zone uselessly, but it needs more investigation.
I will also redo the test with gfp flags allowing direct compaction, and see
if the series does shift the direct compaction effort into kcompactd as
expected. Meanwhile I would like some feedback whether this is going into the
right direction or not...

[1] https://lwn.net/Articles/717656/
[2] https://marc.info/?l=linux-mm&m=148898500006034

Vlastimil Babka (6):
  mm, kswapd: refactor kswapd_try_to_sleep()
  mm, kswapd: don't reset kswapd_order prematurely
  mm, kswapd: reset kswapd's order to 0 when it fails to reclaim enough
  mm, kswapd: wake up kcompactd when kswapd had too many failures
  mm, compaction: stop when number of free pages goes below watermark
  mm: make kcompactd more proactive

 include/linux/compaction.h |   6 ++
 include/linux/mmzone.h     |   3 +
 mm/compaction.c            | 226 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c            |  13 +++
 mm/vmscan.c                | 149 +++++++++++++++++-------------
 5 files changed, 329 insertions(+), 68 deletions(-)

-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep()
  2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
@ 2017-07-27 16:06 ` Vlastimil Babka
  2017-07-28  9:38   ` Mel Gorman
  2017-07-27 16:06 ` [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely Vlastimil Babka
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-07-27 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel, Vlastimil Babka

The code of kswapd_try_to_sleep() is unnecessarily hard to follow. Also we
needlessly call prepare_kswapd_sleep() twice, if the first one fails.
Restructure the code so that each non-success case is accounted and returns
immediately.

This patch should not introduce any functional change, except when the first
prepare_kswapd_sleep() would have returned false, and then the second would be
true (because somebody else has freed memory), kswapd would sleep before this
patch and now it won't. This has likely been an accidental property of the
implementation, and extremely rare to happen in practice anyway.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 88 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 44 insertions(+), 44 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ad39bbc79e6..9b6dfa67131e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3385,65 +3385,65 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 	 * eligible zone balanced that it's also unlikely that compaction will
 	 * succeed.
 	 */
-	if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
-		/*
-		 * Compaction records what page blocks it recently failed to
-		 * isolate pages from and skips them in the future scanning.
-		 * When kswapd is going to sleep, it is reasonable to assume
-		 * that pages and compaction may succeed so reset the cache.
-		 */
-		reset_isolation_suitable(pgdat);
+	if (!prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
+		count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
+		goto out;
+	}
 
-		/*
-		 * We have freed the memory, now we should compact it to make
-		 * allocation of the requested order possible.
-		 */
-		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+	/*
+	 * Compaction records what page blocks it recently failed to isolate
+	 * pages from and skips them in the future scanning.  When kswapd is
+	 * going to sleep, it is reasonable to assume that pages and compaction
+	 * may succeed so reset the cache.
+	 */
+	reset_isolation_suitable(pgdat);
+
+	/*
+	 * We have freed the memory, now we should compact it to make
+	 * allocation of the requested order possible.
+	 */
+	wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
 
-		remaining = schedule_timeout(HZ/10);
+	remaining = schedule_timeout(HZ/10);
 
+	/* After a short sleep, check if it was a premature sleep. */
+	if (remaining) {
 		/*
 		 * If woken prematurely then reset kswapd_classzone_idx and
 		 * order. The values will either be from a wakeup request or
 		 * the previous request that slept prematurely.
 		 */
-		if (remaining) {
-			pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
-			pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
-		}
+		pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
+		pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
+
+		count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
+		goto out;
+	}
 
-		finish_wait(&pgdat->kswapd_wait, &wait);
-		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+	/* If not, then go fully to sleep until explicitly woken up. */
+	finish_wait(&pgdat->kswapd_wait, &wait);
+	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+	if (!prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
+		count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
+		goto out;
 	}
 
+	trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+
 	/*
-	 * After a short sleep, check if it was a premature sleep. If not, then
-	 * go fully to sleep until explicitly woken up.
+	 * vmstat counters are not perfectly accurate and the estimated value
+	 * for counters such as NR_FREE_PAGES can deviate from the true value by
+	 * nr_online_cpus * threshold. To avoid the zone watermarks being
+	 * breached while under pressure, we reduce the per-cpu vmstat threshold
+	 * while kswapd is awake and restore them before going back to sleep.
 	 */
-	if (!remaining &&
-	    prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
-		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+	set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
-		/*
-		 * vmstat counters are not perfectly accurate and the estimated
-		 * value for counters such as NR_FREE_PAGES can deviate from the
-		 * true value by nr_online_cpus * threshold. To avoid the zone
-		 * watermarks being breached while under pressure, we reduce the
-		 * per-cpu vmstat threshold while kswapd is awake and restore
-		 * them before going back to sleep.
-		 */
-		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
-
-		if (!kthread_should_stop())
-			schedule();
+	if (!kthread_should_stop())
+		schedule();
 
-		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
-	} else {
-		if (remaining)
-			count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
-		else
-			count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
-	}
+	set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+out:
 	finish_wait(&pgdat->kswapd_wait, &wait);
 }
 
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely
  2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
  2017-07-27 16:06 ` [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep() Vlastimil Babka
@ 2017-07-27 16:06 ` Vlastimil Babka
  2017-07-28 10:16   ` Mel Gorman
  2017-07-27 16:06 ` [PATCH 3/6] mm, kswapd: reset kswapd's order to 0 when it fails to reclaim enough Vlastimil Babka
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-07-27 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel, Vlastimil Babka

This patch deals with a corner case found when testing kcompactd with a very
simple testcase that first fragments memory (by creating a large shmem file and
then punching hole in every even page) and then uses artificial order-9
GFP_NOWAIT allocations in a loop. This is freshly after virtme-run boot in KVM
and no other activity.

What happens is that kswapd always reclaims too little to get over
compact_gap() in kswapd_shrink_node(), so it doesn't set sc->order to 0, thus
"goto kswapd_try_sleep" in kswapd() doesn't happen. In the next iteration of
kswapd() loop, alloc_order and reclaim_order is read again from
pgdat->kswapd_order, which the previous iteration has reset to 0 and there was
no other kswapd wakeup meanwhile (the workload inserts short sleeps between
allocations). With the working order 0, node appears balanced and
wakeup_kcompactd() does nothing.

This part is fixed by setting alloc/reclaim order to max of the value used for
balancing in previous iteration, and eventual newly arrived kswapd wakeup.
This mirrors what we do for classzone_idx already.

The next problem comes when kswapd_try_to_sleep() fails to sleep, because the
node is not balanced for order-9. Then it again reads pgdat->kswapd_order and
classzone_idx, which have been reset in the previous iteration. Then it has
nothing to balance and goes to sleep with order-0 for balance check and
kcompactd wakeup. Arguably it should continue with the original order and
classzone_idx, which is still not balanced. This patch makes
kswapd_try_to_sleep() indicate whether it has been successful with a full
sleep, only then is the kswapd_order and classzone_idx read freshly and reset.
Otherwise, we again take the maximum of the current value and any wakeup
attemps that meanwhile came. This has been partially done for the case of
premature wakeup in kswapd_try_to_sleep(), so we can now remove this code.

These changes might potentially make kswapd loop uselessly for a high-order
wakeup. If it has enough to reclaim to overcome the compact gap,
kswapd_shrink_node() will reset the order to 0 and defer to kcompactd. If it
has nothing to reclaim, pgdat->kswapd_failures will eventually exceed
MAX_RECLAIM_RETRIES and send kswapd to sleep. This is what ultimately happens
in the test scenario above. The remaining possible case is that kswapd
repeatedly reclaims more than 0 but less that compact gap pages. In that case
it should arguably also defer to kcompactd, and right now it doesn't. This is
handled in the next patch.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 51 +++++++++++++++++++++++++++++----------------------
 1 file changed, 29 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b6dfa67131e..ae897a85e7f3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3367,14 +3367,19 @@ static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
 	return max(pgdat->kswapd_classzone_idx, classzone_idx);
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
+/*
+ * Return true if kswapd fully slept because pgdat was balanced and there was
+ * no premature wakeup.
+ */
+static bool kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
 				unsigned int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
+	bool ret = false;
 
 	if (freezing(current) || kthread_should_stop())
-		return;
+		return false;
 
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
@@ -3408,14 +3413,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 
 	/* After a short sleep, check if it was a premature sleep. */
 	if (remaining) {
-		/*
-		 * If woken prematurely then reset kswapd_classzone_idx and
-		 * order. The values will either be from a wakeup request or
-		 * the previous request that slept prematurely.
-		 */
-		pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
-		pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
-
 		count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 		goto out;
 	}
@@ -3429,6 +3426,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 	}
 
 	trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+	ret = true;
 
 	/*
 	 * vmstat counters are not perfectly accurate and the estimated value
@@ -3442,9 +3440,9 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 	if (!kthread_should_stop())
 		schedule();
 
-	set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
 out:
 	finish_wait(&pgdat->kswapd_wait, &wait);
+	return ret;
 }
 
 /*
@@ -3462,7 +3460,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
  */
 static int kswapd(void *p)
 {
-	unsigned int alloc_order, reclaim_order;
+	int alloc_order, reclaim_order;
 	unsigned int classzone_idx = MAX_NR_ZONES - 1;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
@@ -3493,23 +3491,32 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
-	pgdat->kswapd_order = 0;
+	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
 		bool ret;
 
-		alloc_order = reclaim_order = pgdat->kswapd_order;
+		alloc_order = reclaim_order = max(alloc_order, pgdat->kswapd_order);
 		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
 
 kswapd_try_sleep:
-		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
-					classzone_idx);
-
-		/* Read the new order and classzone_idx */
-		alloc_order = reclaim_order = pgdat->kswapd_order;
-		classzone_idx = kswapd_classzone_idx(pgdat, 0);
-		pgdat->kswapd_order = 0;
-		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
+		if (kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
+							classzone_idx)) {
+
+			/* Read the new order and classzone_idx */
+			alloc_order = reclaim_order = pgdat->kswapd_order;
+			classzone_idx = kswapd_classzone_idx(pgdat, 0);
+			pgdat->kswapd_order = 0;
+			pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
+		} else {
+			/*
+			 * We failed to sleep, so continue on the current order
+			 * and classzone_idx, unless they increased.
+			 */
+			alloc_order = max(alloc_order, pgdat->kswapd_order);
+			reclaim_order = max(reclaim_order, pgdat->kswapd_order) ;
+			classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
+		}
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] mm, kswapd: reset kswapd's order to 0 when it fails to reclaim enough
  2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
  2017-07-27 16:06 ` [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep() Vlastimil Babka
  2017-07-27 16:06 ` [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely Vlastimil Babka
@ 2017-07-27 16:06 ` Vlastimil Babka
  2017-07-27 16:06 ` [PATCH 4/6] mm, kswapd: wake up kcompactd when kswapd had too many failures Vlastimil Babka
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2017-07-27 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel, Vlastimil Babka

For high-order allocations, kswapd will either manage to create the free page
by reclaim itself, or reclaim just enough to let compaction proceed, set its
order to 0 (so that watermark checks don't look for high-order pages anymore)
and goes to sleep while waking up kcompactd.

This doesn't work as expected in case when kswapd cannot reclaim compact_gap()
worth of pages (nor balance the node by itself) even at highest priority. Then
it won't go to sleep and wake up kcompactd. This patch fixes this corner case
by setting sc.order to 0 in such case.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ae897a85e7f3..a3f914c88dea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3340,6 +3340,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	if (!sc.nr_reclaimed)
 		pgdat->kswapd_failures++;
 
+	/*
+	 * Even at highest priority, we could not reclaim enough to balance
+	 * the zone or reclaim over compact_gap() (see kswapd_shrink_node())
+	 * so we better give up now and wake up kcompactd instead.
+	 */
+	if (sc.order > 0 && sc.priority == 0)
+		sc.order = 0;
+
 out:
 	snapshot_refaults(NULL, pgdat);
 	/*
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] mm, kswapd: wake up kcompactd when kswapd had too many failures
  2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
                   ` (2 preceding siblings ...)
  2017-07-27 16:06 ` [PATCH 3/6] mm, kswapd: reset kswapd's order to 0 when it fails to reclaim enough Vlastimil Babka
@ 2017-07-27 16:06 ` Vlastimil Babka
  2017-07-28 10:41   ` Mel Gorman
  2017-07-27 16:07 ` [RFC PATCH 5/6] mm, compaction: stop when number of free pages goes below watermark Vlastimil Babka
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-07-27 16:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel, Vlastimil Babka

This patch deals with a corner case found when testing kcompactd with a very
simple testcase that first fragments memory (by creating a large shmem file and
then punching hole in every even page) and then uses artificial order-9
GFP_NOWAIT allocations in a loop. This is freshly after virtme-run boot in KVM
and no other activity.

What happens is that after few kswapd runs, there are no more reclaimable
pages, and high-order pages can only be created by compaction. Because kswapd
can't reclaim anything, pgdat->kswapd_failures increases up to
MAX_RECLAIM_RETRIES and kswapd is no longer woken up. Thus kcompactd is also
not woken up. After this patch, we will try to wake up kcompactd immediately
instead of kswapd.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a3f914c88dea..18ad0cd0c0f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3578,9 +3578,15 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
-	/* Hopeless node, leave it to direct reclaim */
-	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+	/*
+	 * Hopeless node, leave it to direct reclaim. For high-order
+	 * allocations, try to wake up kcompactd instead.
+	 */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) {
+		if (order)
+			wakeup_kcompactd(pgdat, order, classzone_idx);
 		return;
+	}
 
 	if (pgdat_balanced(pgdat, order, classzone_idx))
 		return;
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 5/6] mm, compaction: stop when number of free pages goes below watermark
  2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
                   ` (3 preceding siblings ...)
  2017-07-27 16:06 ` [PATCH 4/6] mm, kswapd: wake up kcompactd when kswapd had too many failures Vlastimil Babka
@ 2017-07-27 16:07 ` Vlastimil Babka
  2017-07-28 10:43   ` Mel Gorman
  2017-07-27 16:07 ` [RFC PATCH 6/6] mm: make kcompactd more proactive Vlastimil Babka
  2017-08-09 20:58 ` [RFC PATCH 0/6] proactive kcompactd David Rientjes
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-07-27 16:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel, Vlastimil Babka

When isolating free pages as miration targets in __isolate_free_page(),
compaction respects the min watermark. Although it checks that there's enough
free pages above the watermark in __compaction_suitable() before starting to
compact, parallel allocation may result in their depletion. Compaction will
detect this only after needlessly scanning many pages for migration,
potentially wasting CPU time.

After this patch, we check if we are still above the watermark in
__compact_finished(). For kcompactd, we check the low watermark instead of min
watermark, because that's the point when kswapd is woken up and it's better to
let kswapd finish freeing memory before doing kcompactd work.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 613c59e928cb..6647359dc8e3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1291,6 +1291,7 @@ static enum compact_result __compact_finished(struct zone *zone,
 {
 	unsigned int order;
 	const int migratetype = cc->migratetype;
+	unsigned long watermark;
 
 	if (cc->contended || fatal_signal_pending(current))
 		return COMPACT_CONTENDED;
@@ -1374,6 +1375,23 @@ static enum compact_result __compact_finished(struct zone *zone,
 		}
 	}
 
+	/*
+	 * It's possible that the number of free pages has dropped below
+	 * watermark during our compaction, and __isolate_free_page() would fail.
+	 * In that case, let's stop now and not waste time searching for migrate
+	 * pages.
+	 * For direct compaction, the check is close to the one in
+	 * __isolate_free_page().  For kcompactd, we use the low watermark,
+	 * because that's the point when kswapd gets woken up, so it's better
+	 * for kcompactd to let kswapd free memory first.
+	 */
+	if (cc->direct_compaction)
+		watermark = min_wmark_pages(zone);
+	else
+		watermark = low_wmark_pages(zone);
+	if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		return COMPACT_PARTIAL_SKIPPED;
+
 	return COMPACT_NO_SUITABLE_PAGE;
 }
 
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 6/6] mm: make kcompactd more proactive
  2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
                   ` (4 preceding siblings ...)
  2017-07-27 16:07 ` [RFC PATCH 5/6] mm, compaction: stop when number of free pages goes below watermark Vlastimil Babka
@ 2017-07-27 16:07 ` Vlastimil Babka
  2017-07-28 10:58   ` Mel Gorman
  2017-08-09 20:58 ` [RFC PATCH 0/6] proactive kcompactd David Rientjes
  6 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-07-27 16:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel, Vlastimil Babka

Kcompactd activity is currently tied to kswapd - it is woken up when kswapd
goes to sleep, and compacts to make a single high-order page available, of the
order that was used to wake up kswapd. This leaves the rest of free pages
fragmented and results in direct compaction when the demand for fresh
high-order pages is higher than a single page per kswapd cycle.

Another extreme would be to let kcompactd compact whole zone the same way as
manual compaction from /proc interface. This would be wasteful if the resulting
high-order pages would be not needed, but just split back to base pages for
allocations.

This patch aims to adjust the kcompactd effort through observed demand for
high-order pages. This is done by hooking into alloc_pages_slowpath() and
counting (per each order > 0) allocation attempts that would pass the order-0
watermarks, but don't have the high-order page available. This demand is
(currently) recorded per node and then redistributed per zones in each node
according to their relative sizes.

The redistribution considers the current recorded failed attempts together with
the value used in the previous kcompactd cycle. If there were any recorded
failed attempts for the current cycle, it means the previous kcompactd activity
was insufficient, so the two values are added up. If there were zero failed
attempts it means either the previous amount of activity was optimum, or that
the demand decreased. We cannot know that without recording also successful
attempts, which would add overhead to allocator fast paths, so we use
exponential moving average to decay the kcompactd target in such case.
In any case, the target is capped to high watermark worth of base pages, since
that's the kswapd's target when balancing.

Kcompactd then uses a different termination criteria than direct compaction.
It checks whether for each order, the recorded number of attempted allocations
would fit within the free pages of that order of with possible splitting of
higher orders, assuming there would be no allocations of other orders. This
should make kcompactd effort reflect the high-order demand.

In the worst case, the demand is so high that kcompactd will in fact compact
the whole zone and would have to be run with higher frequency than kswapd to
make a larger difference. That possibility can be explored later.
---
 include/linux/compaction.h |   6 ++
 include/linux/mmzone.h     |   3 +
 mm/compaction.c            | 222 ++++++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c            |  13 +++
 4 files changed, 233 insertions(+), 11 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 0d8415820fc3..b342a80bde17 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -176,6 +176,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 extern int kcompactd_run(int nid);
 extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
+extern void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order,
+				int alloc_flags, struct alloc_context *ac);
 
 #else
 static inline void reset_isolation_suitable(pg_data_t *pgdat)
@@ -224,6 +226,10 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_i
 {
 }
 
+static inline void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order,
+				int alloc_flags, struct alloc_context *ac)
+{
+}
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef6a13b7bd3e..73d1a569bad2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -484,6 +484,7 @@ struct zone {
 	unsigned int		compact_considered;
 	unsigned int		compact_defer_shift;
 	int			compact_order_failed;
+	unsigned int		compact_free_target[MAX_ORDER];
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -643,6 +644,8 @@ typedef struct pglist_data {
 	enum zone_type kcompactd_classzone_idx;
 	wait_queue_head_t kcompactd_wait;
 	struct task_struct *kcompactd;
+	atomic_t compact_free_target[MAX_ORDER];
+	unsigned int compact_free_target_ema[MAX_ORDER];
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	/* Lock serializing the migrate rate limiting window */
diff --git a/mm/compaction.c b/mm/compaction.c
index 6647359dc8e3..6843cf74bfaa 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -21,6 +21,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/page_owner.h>
+#include <linux/cpuset.h>
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
@@ -1286,6 +1287,56 @@ static inline bool is_via_compact_memory(int order)
 	return order == -1;
 }
 
+static enum compact_result kcompactd_finished(struct zone *zone)
+{
+	unsigned int order;
+	unsigned long sum_nr_free = 0;
+	bool success = true;
+	unsigned long watermark;
+	unsigned long zone_nr_free;
+
+	zone_nr_free = zone_page_state(zone, NR_FREE_PAGES);
+
+	for (order = MAX_ORDER - 1; order > 0; order--) {
+		unsigned long nr_free;
+		unsigned long target;
+
+		nr_free = zone->free_area[order].nr_free;
+		sum_nr_free += nr_free;
+
+		/*
+		 * If we can't achieve the target via compacting the existing
+		 * free pages, no point in continuing compaction.
+		 */
+		target = zone->compact_free_target[order];
+		if (sum_nr_free < min(target, zone_nr_free >> order)) {
+			success = false;
+			break;
+		}
+
+		/*
+		 * Each free page of current order can fit two pages of the
+		 * next lower order
+		 */
+		sum_nr_free <<= 1UL;
+	}
+
+	if (success)
+		return COMPACT_SUCCESS;
+
+	/*
+	 * If number of pages dropped below low watermark, kswapd will be woken
+	 * up, so it's better for kcompactd to give up for now.
+	 */
+	watermark = low_wmark_pages(zone);
+	if (!__zone_watermark_ok(zone, 0, watermark, zone_idx(zone), 0,
+								zone_nr_free))
+		return COMPACT_PARTIAL_SKIPPED;
+
+	return COMPACT_CONTINUE;
+
+}
+
 static enum compact_result __compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
@@ -1330,6 +1381,13 @@ static enum compact_result __compact_finished(struct zone *zone,
 			return COMPACT_CONTINUE;
 	}
 
+	/*
+	 * Compaction that's neither direct nor is_via_compact_memory() has to
+	 * be from kcompactd, which has different criteria.
+	 */
+	if (!cc->direct_compaction)
+		return kcompactd_finished(zone);
+
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		struct free_area *area = &zone->free_area[order];
@@ -1381,14 +1439,9 @@ static enum compact_result __compact_finished(struct zone *zone,
 	 * In that case, let's stop now and not waste time searching for migrate
 	 * pages.
 	 * For direct compaction, the check is close to the one in
-	 * __isolate_free_page().  For kcompactd, we use the low watermark,
-	 * because that's the point when kswapd gets woken up, so it's better
-	 * for kcompactd to let kswapd free memory first.
+	 * __isolate_free_page().
 	 */
-	if (cc->direct_compaction)
-		watermark = min_wmark_pages(zone);
-	else
-		watermark = low_wmark_pages(zone);
+	watermark = min_wmark_pages(zone);
 	if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
 		return COMPACT_PARTIAL_SKIPPED;
 
@@ -1918,7 +1971,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 	struct zone *zone;
 	enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;
 
-	for (zoneid = 0; zoneid <= classzone_idx; zoneid++) {
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 		zone = &pgdat->node_zones[zoneid];
 
 		if (!populated_zone(zone))
@@ -1927,11 +1980,155 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 		if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
 					classzone_idx) == COMPACT_CONTINUE)
 			return true;
+
+		if (kcompactd_finished(zone) == COMPACT_CONTINUE)
+			return true;
 	}
 
 	return false;
 }
 
+void kcompactd_inc_free_target(gfp_t gfp_mask, unsigned int order,
+				int alloc_flags, struct alloc_context *ac)
+{
+	struct zone *zone;
+	struct zoneref *zref;
+
+	// FIXME: spread over nodes instead of increasing all?
+	for_each_zone_zonelist_nodemask(zone, zref, ac->zonelist,
+					ac->high_zoneidx, ac->nodemask) {
+		unsigned long mark;
+		int nid = zone_to_nid(zone);
+		int zoneidx;
+		bool zone_not_highest = false;
+
+		/*
+		 * A kludge to avoid incrementing for the same node twice or
+		 * more, regardless of zonelist being in zone or node order.
+		 * This is to avoid allocating a nodemask on stack to mark
+		 * visited nodes.
+		 */
+		for (zoneidx = zonelist_zone_idx(zref) + 1;
+						zoneidx <= ac->high_zoneidx;
+						zoneidx++) {
+			struct zone *z = &zone->zone_pgdat->node_zones[zoneidx];
+
+			if (populated_zone(z)) {
+				zone_not_highest = true;
+				break;
+			}
+		}
+
+		if (zone_not_highest)
+			continue;
+
+		if (cpusets_enabled() &&
+				(alloc_flags & ALLOC_CPUSET) &&
+				!cpuset_zone_allowed(zone, gfp_mask))
+			continue;
+
+		/* The high-order allocation should succeed on this node */
+		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+		if (zone_watermark_ok(zone, order, mark,
+				       ac_classzone_idx(ac), alloc_flags))
+			continue;
+
+		/*
+		 * High-order allocation wouldn't succeed. If order-0
+		 * allocations of same total size would pass the watermarks,
+		 * we know it's due to fragmentation, and kcompactd trying
+		 * harder could help.
+		 */
+		mark += (1UL << order) - 1;
+		if (zone_watermark_ok(zone, 0, mark, ac_classzone_idx(ac),
+								alloc_flags)) {
+			/*
+			 * TODO: consider prioritizing based on gfp_mask, e.g.
+			 * THP faults are opportunistic and should not result
+			 * in perpetual kcompactd activity. Allocation attempts
+			 * without easy fallback should be more important.
+			 */
+			atomic_inc(&NODE_DATA(nid)->compact_free_target[order]);
+		}
+	}
+}
+
+static void kcompactd_adjust_free_targets(pg_data_t *pgdat)
+{
+	unsigned long managed_pages = 0;
+	unsigned long high_wmark = 0;
+	int zoneid, order;
+	struct zone *zone;
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		zone = &pgdat->node_zones[zoneid];
+
+		if (!populated_zone(zone))
+			continue;
+
+		managed_pages += zone->managed_pages;
+		high_wmark += high_wmark_pages(zone);
+	}
+
+	if (!managed_pages)
+		return;
+
+	for (order = 1; order < MAX_ORDER; order++) {
+		unsigned int target;
+
+		target = atomic_xchg(&pgdat->compact_free_target[order], 0);
+
+		/*
+		 * If the target is non-zero, it means we could have done more
+		 * in the previous run, so add it to the previous run's target.
+		 * Otherwise start decaying the target.
+		 */
+		if (target)
+			target += pgdat->compact_free_target_ema[order];
+		else
+			/* Exponential moving average, coefficient 0.5 */
+			target = DIV_ROUND_UP(target
+				+ pgdat->compact_free_target_ema[order], 2);
+
+
+		/*
+		 * Limit the target by high wmark worth of pages, otherwise
+		 * kcompactd can't achieve it anyway.
+		 */
+		if ((target << order) > high_wmark)
+			target = high_wmark >> order;
+
+		pgdat->compact_free_target_ema[order] = target;
+
+		if (!target)
+			continue;
+
+		/* Distribute the target among zones */
+		for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+
+			unsigned long zone_target = target;
+
+			zone = &pgdat->node_zones[zoneid];
+
+			if (!populated_zone(zone))
+				continue;
+
+			/* For a single zone on node, take a shortcut */
+			if (managed_pages == zone->managed_pages) {
+				zone->compact_free_target[order] = zone_target;
+				continue;
+			}
+
+			/* Take proportion of zone's page to whole node */
+			zone_target *= zone->managed_pages;
+			/* Round up for remainder of at least 1/2 */
+			zone_target = DIV_ROUND_UP_ULL(zone_target, managed_pages);
+
+			zone->compact_free_target[order] = zone_target;
+		}
+	}
+}
+
 static void kcompactd_do_work(pg_data_t *pgdat)
 {
 	/*
@@ -1954,7 +2151,9 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 							cc.classzone_idx);
 	count_compact_event(KCOMPACTD_WAKE);
 
-	for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) {
+	kcompactd_adjust_free_targets(pgdat);
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 		int status;
 
 		zone = &pgdat->node_zones[zoneid];
@@ -1964,8 +2163,9 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		if (compaction_deferred(zone, cc.order))
 			continue;
 
-		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
+		if ((compaction_suitable(zone, cc.order, 0, zoneid) !=
 							COMPACT_CONTINUE)
+			&& kcompactd_finished(zone) != COMPACT_CONTINUE)
 			continue;
 
 		cc.nr_freepages = 0;
@@ -1982,7 +2182,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 
 		if (status == COMPACT_SUCCESS) {
 			compaction_defer_reset(zone, cc.order, false);
-		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
+		} else if (status == COMPACT_COMPLETE) {
 			/*
 			 * We use sync migration mode here, so we defer like
 			 * sync direct compaction does.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index faed38d52721..82483ce9a202 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3780,6 +3780,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/*
+	 * If it looks like increased kcompactd effort could have spared
+	 * us from direct compaction (or allocation failure if we cannot
+	 * compact), increase kcompactd's target.
+	 */
+	if (order > 0)
+		kcompactd_inc_free_target(gfp_mask, order, alloc_flags, ac);
+
+	/*
 	 * For costly allocations, try direct compaction first, as it's likely
 	 * that we have enough base pages and don't need to reclaim. For non-
 	 * movable high-order allocations, do that as well, as compaction will
@@ -6038,6 +6046,7 @@ static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
  */
 static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 {
+	int i;
 	enum zone_type j;
 	int nid = pgdat->node_id;
 	int ret;
@@ -6057,6 +6066,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 #ifdef CONFIG_COMPACTION
 	init_waitqueue_head(&pgdat->kcompactd_wait);
+	for (i = 0; i < MAX_ORDER; i++) {
+		atomic_set(&pgdat->compact_free_target[i], 0);
+		pgdat->compact_free_target_ema[i] = 0;
+	}
 #endif
 	pgdat_page_ext_init(pgdat);
 	spin_lock_init(&pgdat->lru_lock);
-- 
2.13.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep()
  2017-07-27 16:06 ` [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep() Vlastimil Babka
@ 2017-07-28  9:38   ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2017-07-28  9:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Joonsoo Kim, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel

On Thu, Jul 27, 2017 at 06:06:56PM +0200, Vlastimil Babka wrote:
> The code of kswapd_try_to_sleep() is unnecessarily hard to follow. Also we
> needlessly call prepare_kswapd_sleep() twice, if the first one fails.
> Restructure the code so that each non-success case is accounted and returns
> immediately.
> 
> This patch should not introduce any functional change, except when the first
> prepare_kswapd_sleep() would have returned false, and then the second would be
> true (because somebody else has freed memory), kswapd would sleep before this
> patch and now it won't. This has likely been an accidental property of the
> implementation, and extremely rare to happen in practice anyway.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely
  2017-07-27 16:06 ` [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely Vlastimil Babka
@ 2017-07-28 10:16   ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2017-07-28 10:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Joonsoo Kim, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel

On Thu, Jul 27, 2017 at 06:06:57PM +0200, Vlastimil Babka wrote:
> This patch deals with a corner case found when testing kcompactd with a very
> simple testcase that first fragments memory (by creating a large shmem file and
> then punching hole in every even page) and then uses artificial order-9
> GFP_NOWAIT allocations in a loop. This is freshly after virtme-run boot in KVM
> and no other activity.
> 
> What happens is that kswapd always reclaims too little to get over
> compact_gap() in kswapd_shrink_node(), so it doesn't set sc->order to 0, thus
> "goto kswapd_try_sleep" in kswapd() doesn't happen. In the next iteration of
> kswapd() loop, alloc_order and reclaim_order is read again from
> pgdat->kswapd_order, which the previous iteration has reset to 0 and there was
> no other kswapd wakeup meanwhile (the workload inserts short sleeps between
> allocations). With the working order 0, node appears balanced and
> wakeup_kcompactd() does nothing.
> 

The risk with a change like this is that there is an introduction of
kswapd-stuck-at-100%-cpu reclaiming for high order pages. Consider for
example this part

> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
> +/*
> + * Return true if kswapd fully slept because pgdat was balanced and there was
> + * no premature wakeup.
> + */
> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
>  				unsigned int classzone_idx)
>  {
>  	long remaining = 0;
>  	DEFINE_WAIT(wait);
> +	bool ret = false;
>  
>  	if (freezing(current) || kthread_should_stop())
> -		return;
> +		return false;
>  
>  	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>  

...

> @@ -3493,23 +3491,32 @@ static int kswapd(void *p)
>  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
>  
> -	pgdat->kswapd_order = 0;
> +	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
>  	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>  	for ( ; ; ) {
>  		bool ret;
>  
> -		alloc_order = reclaim_order = pgdat->kswapd_order;
> +		alloc_order = reclaim_order = max(alloc_order, pgdat->kswapd_order);
>  		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
>  
>  kswapd_try_sleep:
> -		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> -					classzone_idx);
> -
> -		/* Read the new order and classzone_idx */
> -		alloc_order = reclaim_order = pgdat->kswapd_order;
> -		classzone_idx = kswapd_classzone_idx(pgdat, 0);
> -		pgdat->kswapd_order = 0;
> -		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> +		if (kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> +							classzone_idx)) {
> +
> +			/* Read the new order and classzone_idx */
> +			alloc_order = reclaim_order = pgdat->kswapd_order;
> +			classzone_idx = kswapd_classzone_idx(pgdat, 0);
> +			pgdat->kswapd_order = 0;
> +			pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> +		} else {
> +			/*
> +			 * We failed to sleep, so continue on the current order
> +			 * and classzone_idx, unless they increased.
> +			 */
> +			alloc_order = max(alloc_order, pgdat->kswapd_order);
> +			reclaim_order = max(reclaim_order, pgdat->kswapd_order) ;
> +			classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> +		}
>  
>  		ret = try_to_freeze();
>  		if (kthread_should_stop())

kswapd_try_to_sleep returns true only if it fully slept. Now, consider
a case where kswapd is woken for order-9, fails and there are streaming
allocators that are keeping kswapd awake between the low/high watermark.
Even though all subsequent wakeups are for potentially for order-0, the
false branch above keeps kswapd at order-9.

You should be very wary of keeping kswapd awake for high-order allocations
and somehow defer to either kcompactd or push it into direct reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/6] mm, kswapd: wake up kcompactd when kswapd had too many failures
  2017-07-27 16:06 ` [PATCH 4/6] mm, kswapd: wake up kcompactd when kswapd had too many failures Vlastimil Babka
@ 2017-07-28 10:41   ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2017-07-28 10:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Joonsoo Kim, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel

On Thu, Jul 27, 2017 at 06:06:59PM +0200, Vlastimil Babka wrote:
> This patch deals with a corner case found when testing kcompactd with a very
> simple testcase that first fragments memory (by creating a large shmem file and
> then punching hole in every even page) and then uses artificial order-9
> GFP_NOWAIT allocations in a loop. This is freshly after virtme-run boot in KVM
> and no other activity.
> 
> What happens is that after few kswapd runs, there are no more reclaimable
> pages, and high-order pages can only be created by compaction. Because kswapd
> can't reclaim anything, pgdat->kswapd_failures increases up to
> MAX_RECLAIM_RETRIES and kswapd is no longer woken up. Thus kcompactd is also
> not woken up. After this patch, we will try to wake up kcompactd immediately
> instead of kswapd.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

If kswapd cannot make any progress then it's possible that kcompact
won'y be able to move the pages either. However, an exception is
anonymous pages without swap configured so

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 5/6] mm, compaction: stop when number of free pages goes below watermark
  2017-07-27 16:07 ` [RFC PATCH 5/6] mm, compaction: stop when number of free pages goes below watermark Vlastimil Babka
@ 2017-07-28 10:43   ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2017-07-28 10:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Joonsoo Kim, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel

On Thu, Jul 27, 2017 at 06:07:00PM +0200, Vlastimil Babka wrote:
> When isolating free pages as miration targets in __isolate_free_page(),

s/miration/migration/

> compaction respects the min watermark. Although it checks that there's enough
> free pages above the watermark in __compaction_suitable() before starting to
> compact, parallel allocation may result in their depletion. Compaction will
> detect this only after needlessly scanning many pages for migration,
> potentially wasting CPU time.
> 
> After this patch, we check if we are still above the watermark in
> __compact_finished(). For kcompactd, we check the low watermark instead of min
> watermark, because that's the point when kswapd is woken up and it's better to
> let kswapd finish freeing memory before doing kcompactd work.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Otherwise I cannot see a problem. Some compaction opportunities might be
"missed" but they're ones that potentially cause increased direct
reclaim or kswapd reclaim activity.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 6/6] mm: make kcompactd more proactive
  2017-07-27 16:07 ` [RFC PATCH 6/6] mm: make kcompactd more proactive Vlastimil Babka
@ 2017-07-28 10:58   ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2017-07-28 10:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Joonsoo Kim, David Rientjes, Michal Hocko,
	Johannes Weiner, Andrea Arcangeli, Rik van Riel

On Thu, Jul 27, 2017 at 06:07:01PM +0200, Vlastimil Babka wrote:
> Kcompactd activity is currently tied to kswapd - it is woken up when kswapd
> goes to sleep, and compacts to make a single high-order page available, of the
> order that was used to wake up kswapd. This leaves the rest of free pages
> fragmented and results in direct compaction when the demand for fresh
> high-order pages is higher than a single page per kswapd cycle.
> 
> Another extreme would be to let kcompactd compact whole zone the same way as
> manual compaction from /proc interface. This would be wasteful if the resulting
> high-order pages would be not needed, but just split back to base pages for
> allocations.
> 
> This patch aims to adjust the kcompactd effort through observed demand for
> high-order pages. This is done by hooking into alloc_pages_slowpath() and
> counting (per each order > 0) allocation attempts that would pass the order-0
> watermarks, but don't have the high-order page available. This demand is
> (currently) recorded per node and then redistributed per zones in each node
> according to their relative sizes.
> 
> The redistribution considers the current recorded failed attempts together with
> the value used in the previous kcompactd cycle. If there were any recorded
> failed attempts for the current cycle, it means the previous kcompactd activity
> was insufficient, so the two values are added up. If there were zero failed
> attempts it means either the previous amount of activity was optimum, or that
> the demand decreased. We cannot know that without recording also successful
> attempts, which would add overhead to allocator fast paths, so we use
> exponential moving average to decay the kcompactd target in such case.
> In any case, the target is capped to high watermark worth of base pages, since
> that's the kswapd's target when balancing.
> 
> Kcompactd then uses a different termination criteria than direct compaction.
> It checks whether for each order, the recorded number of attempted allocations
> would fit within the free pages of that order of with possible splitting of
> higher orders, assuming there would be no allocations of other orders. This
> should make kcompactd effort reflect the high-order demand.
> 
> In the worst case, the demand is so high that kcompactd will in fact compact
> the whole zone and would have to be run with higher frequency than kswapd to
> make a larger difference. That possibility can be explored later.

Very broadly speaking, I can't see a problem with the direction you are
taking. Misc comments are

o kcompactd_inc_free_target is a bit excessive without data backing it
  up. It's overkill to go through every allowed node incrementing counters
  in the page allocator slow path. It's not even necessarily a good idea
  because it's hard to reason what impact that has on how the attempts get
  decayed and what impact it can have on remote nodes that.  At a first
  cut, I would have thought incrementing the preferred zone only would be
  reasonable. If there are concerns about small high zones then every zone
  in the local node and do not bother with the cpuset checks. Overall, don't
  worry about the remote nodes unless there is strong evidence it's needed.

o Similarly, it's not clear how much benefit there is to spreading
  targets across zones and the compexity in there. I would suggest
  keeping kcompactd_inc_free_target as simple as possible for as long as
  possible. While it's called from the page allocator slowpath for high-order
  allocations only, we shouldn't pay costs there unless we have to.

o The atomics seem a little overkill considering that this is just a
  heuristic hint. If lost updates happen, it's not that big a deal and
  at worst, there is a spurious compaction run just as the counters hit
  0. That corner case is marginal compared to the atomic overheads. Just
  watch for going negative due to the races which is a minor fix.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
                   ` (5 preceding siblings ...)
  2017-07-27 16:07 ` [RFC PATCH 6/6] mm: make kcompactd more proactive Vlastimil Babka
@ 2017-08-09 20:58 ` David Rientjes
  2017-08-21 14:10   ` Johannes Weiner
  6 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2017-08-09 20:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Joonsoo Kim, Mel Gorman, Michal Hocko, Johannes Weiner,
	Andrea Arcangeli, Rik van Riel

On Thu, 27 Jul 2017, Vlastimil Babka wrote:

> As we discussed at last LSF/MM [1], the goal here is to shift more compaction
> work to kcompactd, which currently just makes a single high-order page
> available and then goes to sleep. The last patch, evolved from the initial RFC
> [2] does this by recording for each order > 0 how many allocations would have
> potentially be able to skip direct compaction, if the memory wasn't fragmented.
> Kcompactd then tries to compact as long as it takes to make that many
> allocations satisfiable. This approach avoids any hooks in allocator fast
> paths. There are more details to this, see the last patch.
> 

I think I would have liked to have seen "less proactive" :)

Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it 
continues until it can defragment memory.  On a host with 128GB of memory 
and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd 
wakeups for order-2 memory allocation.  The stats are pretty bad:

compact_migrate_scanned 2931254031294 
compact_free_scanned    102707804816705 
compact_isolated        1309145254 

0.0012% of memory scanned is ever actually isolated.  We constantly see 
very high cpu for compaction_alloc() because kcompactd is almost always 
running in the background and iterating most memory completely needlessly 
(define needless as 0.0012% of memory scanned being isolated).

vm.extfrag_threshold isn't a solution to the problem because it sees 
memory as being free in the 28GB of memory remaining and isolates/migrates 
even if order-2 memory will not become available, so it would need to be 
set at >850 for it to prevent compaction.  If memory is freed from the 
hugetlb pool we would need to adjust the threshold at runtime.  (Why is 
kcompactd setting ignore_skip_hint, again?)

I think we need to look at making kcompactd do less work on each wakeup, 
perhaps by not forcing full scans of memory with MIGRATE_SYNC_LIGHT and 
defer compaction for longer if most scanning is completely pointless.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-08-09 20:58 ` [RFC PATCH 0/6] proactive kcompactd David Rientjes
@ 2017-08-21 14:10   ` Johannes Weiner
  2017-08-21 21:40     ` Rik van Riel
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Johannes Weiner @ 2017-08-21 14:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, linux-mm, Joonsoo Kim, Mel Gorman, Michal Hocko,
	Andrea Arcangeli, Rik van Riel

On Wed, Aug 09, 2017 at 01:58:42PM -0700, David Rientjes wrote:
> On Thu, 27 Jul 2017, Vlastimil Babka wrote:
> 
> > As we discussed at last LSF/MM [1], the goal here is to shift more compaction
> > work to kcompactd, which currently just makes a single high-order page
> > available and then goes to sleep. The last patch, evolved from the initial RFC
> > [2] does this by recording for each order > 0 how many allocations would have
> > potentially be able to skip direct compaction, if the memory wasn't fragmented.
> > Kcompactd then tries to compact as long as it takes to make that many
> > allocations satisfiable. This approach avoids any hooks in allocator fast
> > paths. There are more details to this, see the last patch.
> > 
> 
> I think I would have liked to have seen "less proactive" :)
> 
> Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it 
> continues until it can defragment memory.  On a host with 128GB of memory 
> and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd 
> wakeups for order-2 memory allocation.  The stats are pretty bad:
> 
> compact_migrate_scanned 2931254031294 
> compact_free_scanned    102707804816705 
> compact_isolated        1309145254 
> 
> 0.0012% of memory scanned is ever actually isolated.  We constantly see 
> very high cpu for compaction_alloc() because kcompactd is almost always 
> running in the background and iterating most memory completely needlessly 
> (define needless as 0.0012% of memory scanned being isolated).

The free page scanner will inevitably wade through mostly used memory,
but 0.0012% is lower than what systems usually have free. I'm guessing
this is because of concurrent allocation & free cycles racing with the
scanner? There could also be an issue with how we do partial scans.

Anyway, we've also noticed scalability issues with the current scanner
on 128G and 256G machines. Even with a better efficiency - finding the
1% of free memory, that's still a ton of linear search space.

I've been toying around with the below patch. It adds a free page
bitmap, allowing the free scanner to quickly skip over the vast areas
of used memory. I don't have good data on skip-efficiency at higher
uptimes and the resulting fragmentation yet. The overhead added to the
page allocator is concerning, but I cannot think of a better way to
make the search more efficient. What do you guys think?

---

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-08-21 14:10   ` Johannes Weiner
@ 2017-08-21 21:40     ` Rik van Riel
  2017-08-22 20:57     ` David Rientjes
  2017-08-23  5:36     ` Joonsoo Kim
  2 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2017-08-21 21:40 UTC (permalink / raw)
  To: Johannes Weiner, David Rientjes
  Cc: Vlastimil Babka, linux-mm, Joonsoo Kim, Mel Gorman, Michal Hocko,
	Andrea Arcangeli

On Mon, 2017-08-21 at 10:10 -0400, Johannes Weiner wrote:
> 
> I've been toying around with the below patch. It adds a free page
> bitmap, allowing the free scanner to quickly skip over the vast areas
> of used memory. I don't have good data on skip-efficiency at higher
> uptimes and the resulting fragmentation yet. The overhead added to
> the
> page allocator is concerning, but I cannot think of a better way to
> make the search more efficient. What do you guys think?

Michael Tsirkin and I have been thinking about using a bitmap
to allow KVM guests to tell the host which pages are free (and
could be discarded by the host).

Having multiple users for the bitmap makes having one much more
compelling...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-08-21 14:10   ` Johannes Weiner
  2017-08-21 21:40     ` Rik van Riel
@ 2017-08-22 20:57     ` David Rientjes
  2017-08-23  5:36     ` Joonsoo Kim
  2 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2017-08-22 20:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, linux-mm, Joonsoo Kim, Mel Gorman, Michal Hocko,
	Andrea Arcangeli, Rik van Riel

On Mon, 21 Aug 2017, Johannes Weiner wrote:

> > I think I would have liked to have seen "less proactive" :)
> > 
> > Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it 
> > continues until it can defragment memory.  On a host with 128GB of memory 
> > and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd 
> > wakeups for order-2 memory allocation.  The stats are pretty bad:
> > 
> > compact_migrate_scanned 2931254031294 
> > compact_free_scanned    102707804816705 
> > compact_isolated        1309145254 
> > 
> > 0.0012% of memory scanned is ever actually isolated.  We constantly see 
> > very high cpu for compaction_alloc() because kcompactd is almost always 
> > running in the background and iterating most memory completely needlessly 
> > (define needless as 0.0012% of memory scanned being isolated).
> 
> The free page scanner will inevitably wade through mostly used memory,
> but 0.0012% is lower than what systems usually have free. I'm guessing
> this is because of concurrent allocation & free cycles racing with the
> scanner? There could also be an issue with how we do partial scans.
> 

More than 90% of this system's memory is in the hugetlbfs pool so the 
freeing scanner needlessly scans over it.  Because kcompactd does 
MIGRATE_SYNC_LIGHT compaction, it doesn't stop iterating until the 
allocation is successful at pgdat->kcompactd_max_order or the migration 
and freeing scanners meet.  This is normally all memory.

Because of MIGRATE_SYNC_LIGHT, kcompactd does respect deferred compaction 
and will avoid doing compaction at all for the next 
1 << COMPACT_MAX_DEFER_SHIFT wakeups, but while the rest of userspace not 
mapping hugetlbfs memory tries to fault thp, this happens almost nonstop 
at 100% of cpu.

Although this might not be a typical configuration, it can easily be used 
to demonstrate how inefficient kcompactd behaves under load when a small 
amount of memory is free or cannot be isolated because its pinned.  
vm.extfrag_threshold isn't an adequate solution.

> Anyway, we've also noticed scalability issues with the current scanner
> on 128G and 256G machines. Even with a better efficiency - finding the
> 1% of free memory, that's still a ton of linear search space.
> 

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-08-21 14:10   ` Johannes Weiner
  2017-08-21 21:40     ` Rik van Riel
  2017-08-22 20:57     ` David Rientjes
@ 2017-08-23  5:36     ` Joonsoo Kim
  2017-08-23  8:12       ` Vlastimil Babka
  2 siblings, 1 reply; 21+ messages in thread
From: Joonsoo Kim @ 2017-08-23  5:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Vlastimil Babka, linux-mm, Mel Gorman,
	Michal Hocko, Andrea Arcangeli, Rik van Riel

On Mon, Aug 21, 2017 at 10:10:14AM -0400, Johannes Weiner wrote:
> On Wed, Aug 09, 2017 at 01:58:42PM -0700, David Rientjes wrote:
> > On Thu, 27 Jul 2017, Vlastimil Babka wrote:
> > 
> > > As we discussed at last LSF/MM [1], the goal here is to shift more compaction
> > > work to kcompactd, which currently just makes a single high-order page
> > > available and then goes to sleep. The last patch, evolved from the initial RFC
> > > [2] does this by recording for each order > 0 how many allocations would have
> > > potentially be able to skip direct compaction, if the memory wasn't fragmented.
> > > Kcompactd then tries to compact as long as it takes to make that many
> > > allocations satisfiable. This approach avoids any hooks in allocator fast
> > > paths. There are more details to this, see the last patch.
> > > 
> > 
> > I think I would have liked to have seen "less proactive" :)
> > 
> > Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it 
> > continues until it can defragment memory.  On a host with 128GB of memory 
> > and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd 
> > wakeups for order-2 memory allocation.  The stats are pretty bad:
> > 
> > compact_migrate_scanned 2931254031294 
> > compact_free_scanned    102707804816705 
> > compact_isolated        1309145254 
> > 
> > 0.0012% of memory scanned is ever actually isolated.  We constantly see 
> > very high cpu for compaction_alloc() because kcompactd is almost always 
> > running in the background and iterating most memory completely needlessly 
> > (define needless as 0.0012% of memory scanned being isolated).
> 
> The free page scanner will inevitably wade through mostly used memory,
> but 0.0012% is lower than what systems usually have free. I'm guessing
> this is because of concurrent allocation & free cycles racing with the
> scanner? There could also be an issue with how we do partial scans.
> 
> Anyway, we've also noticed scalability issues with the current scanner
> on 128G and 256G machines. Even with a better efficiency - finding the
> 1% of free memory, that's still a ton of linear search space.
> 
> I've been toying around with the below patch. It adds a free page
> bitmap, allowing the free scanner to quickly skip over the vast areas
> of used memory. I don't have good data on skip-efficiency at higher
> uptimes and the resulting fragmentation yet. The overhead added to the
> page allocator is concerning, but I cannot think of a better way to
> make the search more efficient. What do you guys think?

Hello, Johannes.

I think that the best solution is that the compaction doesn't do linear
scan completely. Vlastimil already have suggested that idea.

mm, compaction: direct freepage allocation for async direct
compaction

lkml.kernel.org/r/<1459414236-9219-5-git-send-email-vbabka@suse.cz>

It uses the buddy allocator to get a freepage so there is no linear
scan. It would completely remove scalability issue.

Unfortunately, he applied this idea only to async compaction since
changing the other compaction mode will probably cause long term
fragmentation. And, I disagreed with that idea at that time since
different compaction logic for different compaction mode would make
the system more unpredicatable.

I doubt long term fragmentation is a real issue in practice. We loses
too much things to prevent long term fragmentation. I think that it's
the time to fix up the real issue (yours and David's) by giving up the
solution for long term fragmentation.

If someone doesn't agree with above solution, your approach looks the
second best to me. Though, there is something to optimize.

I think that we don't need to be precise to track the pageblock's
freepage state. Compaction is a far rare event compared to page
allocation so compaction could be tolerate with false positive.

So, my suggestion is:

1) Use 1 bit for the pageblock. Reusing PB_migrate_skip looks the best
to me.
2) Mark PB_migrate_skip only in free path and only when needed.
Unmark it in compaction if freepage scan fails in that pageblock.
In compaction, skip the pageblock if PB_migrate_skip is set. It means
that there is no freepage in the pageblock.

Following is some code about my suggestion.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 90b1996..c292ad2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -798,12 +798,17 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 static inline void __free_one_page(struct page *page,
                unsigned long pfn,
                struct zone *zone, unsigned int order,
-               int migratetype)
+               int pageblock_flag)
 {
        unsigned long combined_pfn;
        unsigned long uninitialized_var(buddy_pfn);
        struct page *buddy;
        unsigned int max_order;
+       int migratetype = pageblock_flag & MT_MASK;
+       int need_set_skip = !(pageblock_flag & SKIP_MASK);
+
+       if (unlikely(need_set_skip))
+               set_pageblock_skip(page);
 
        max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
 
@@ -1155,7 +1160,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 static void free_one_page(struct zone *zone,
                                struct page *page, unsigned long pfn,
                                unsigned int order,
-                               int migratetype)
+                               int pageblock_flag)
 {
        spin_lock(&zone->lock);
        if (unlikely(has_isolate_pageblock(zone) ||
@@ -1248,10 +1253,10 @@ static void __free_pages_ok(struct page *page, unsigned int order)
        if (!free_pages_prepare(page, order, true))
                return;
 
-       migratetype = get_pfnblock_migratetype(page, pfn);
+       pageblock_flage = get_pfnblock_flag(page, pfn);
        local_irq_save(flags);
        __count_vm_events(PGFREE, 1 << order);
-       free_one_page(page_zone(page), page, pfn, order, migratetype);
+       free_one_page(page_zone(page), page, pfn, order, pageblock_flag);
        local_irq_restore(flags);
 }

We already access the pageblock flag for migratetype. Reusing it would
reduce cache-line overhead. And, updating bit only happens when first
freepage in the pageblock is freed. We don't need to modify allocation
path since we don't track the freepage state precisly. I guess that
this solution has almost no overhead in allocation/free path.

If allocation happens after free, compaction would see false-positive
so it would scan the pageblock uselessly. But, as mentioned above,
compaction is a far rare event so doing more thing in the compaction
with reducing the overhead on allocation/free path seems better to me.

Johannes, what do you think about it?

Thanks.

> 
> ---
> 
> >From 115c76ee34c4c133e527b8b5358a8baed09d5bfb Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Fri, 16 Jun 2017 12:26:01 -0400
> Subject: [PATCH] mm: fast free bitmap for compaction free scanner
> 
> XXX: memory hotplug does "bootmem registering" for usemap
> XXX: evaluate page allocator performance impact of bitmap
> XXX: evaluate skip efficiency after some uptime
> 
> On Facebook machines, we routinely observe kcompactd running at
> 80-100% of CPU, spending most cycles in isolate_freepages_block(). The
> allocations that trigger this are order-3 requests coming in from the
> network stack at a rate of hundreds per second. In 4.6, the order-2
> kernel stack allocations on each fork also heavily contributed to that
> load; luckily we can use vmap stacks in later kernels. Still, there is
> something to be said about the scalability of the compaction free page
> scanner when we're looking at systems with hundreds of gigs of memory.
> 
> The compaction code scans movable pages and free pages from opposite
> ends of the PFN range. By packing used pages into one end of RAM, it
> frees up contiguous blocks at the other end. However, free pages
> usually don't make up more than 1-2% of a system's memory - that's a
> small needle for a linear search through the haystack. Looking at page
> structs one-by-one to find these pages is a serious bottleneck.
> 
> Our workaround in the Facebook fleet has been to bump min_free_kbytes
> to several gigabytes, just to make the needle bigger. But in the
> long-term that's not very satisfying answer to the problem.
> 
> This patch sets up a bitmap of free pages that the page allocator
> maintains and the compaction free scanner can consult to quickly skip
> over the majority of page blocks that have no free pages left in them.
> 
> A 24h production A/B test in our fleet showed a 62.67% reduction in
> cycles spent in isolate_freepages_block(). The load on the machines
> isn't exactly the same, but the patched kernel actually finishes more
> jobs/minute and puts, when adding up compact_free_scanned and the new
> compact_free_skipped, much more pressure on the compaction subsystem.
> 
> One bit per 4k page means the bitmap consumes 0.02% of total memory.
> 
> Not-yet-signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h        | 12 ++++++--
>  include/linux/vm_event_item.h |  3 +-
>  mm/compaction.c               |  9 +++++-
>  mm/page_alloc.c               | 71 +++++++++++++++++++++++++++++++++++++++----
>  mm/sparse.c                   | 64 +++++++++++++++++++++++++++++++++++---
>  mm/vmstat.c                   |  1 +
>  6 files changed, 145 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ef6a13b7bd3e..55c663f3da69 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -374,9 +374,12 @@ struct zone {
>  
>  #ifndef CONFIG_SPARSEMEM
>  	/*
> -	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
> -	 * In SPARSEMEM, this map is stored in struct mem_section
> +	 * Allocation bitmap and flags for a pageblock_nr_pages
> +	 * block. See pageblock-flags.h.
> +	 *
> +	 * In SPARSEMEM, this map is * stored in struct mem_section
>  	 */
> +	unsigned long		*pageblock_freemap;
>  	unsigned long		*pageblock_flags;
>  #endif /* CONFIG_SPARSEMEM */
>  
> @@ -768,6 +771,7 @@ bool zone_watermark_ok(struct zone *z, unsigned int order,
>  		unsigned int alloc_flags);
>  bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
>  		unsigned long mark, int classzone_idx);
> +int test_page_freemap(struct page *page, unsigned int nr_pages);
>  enum memmap_context {
>  	MEMMAP_EARLY,
>  	MEMMAP_HOTPLUG,
> @@ -1096,7 +1100,8 @@ struct mem_section {
>  	 */
>  	unsigned long section_mem_map;
>  
> -	/* See declaration of similar field in struct zone */
> +	/* See declaration of similar fields in struct zone */
> +	unsigned long *pageblock_freemap;
>  	unsigned long *pageblock_flags;
>  #ifdef CONFIG_PAGE_EXTENSION
>  	/*
> @@ -1104,6 +1109,7 @@ struct mem_section {
>  	 * section. (see page_ext.h about this.)
>  	 */
>  	struct page_ext *page_ext;
> +#else
>  	unsigned long pad;
>  #endif
>  	/*
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index d84ae90ccd5c..6d6371df551b 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -52,7 +52,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
>  #endif
>  #ifdef CONFIG_COMPACTION
> -		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
> +		COMPACTMIGRATE_SCANNED,
> +		COMPACTFREE_SKIPPED, COMPACTFREE_SCANNED,
>  		COMPACTISOLATED,
>  		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
>  		KCOMPACTD_WAKE,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 613c59e928cb..1da4e557eaca 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -420,6 +420,13 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  
>  	cursor = pfn_to_page(blockpfn);
>  
> +	/* Usually, most memory is used. Skip full blocks quickly */
> +	if (!strict && !test_page_freemap(cursor, end_pfn - blockpfn)) {
> +		count_compact_events(COMPACTFREE_SKIPPED, end_pfn - blockpfn);
> +		blockpfn = end_pfn;
> +		goto skip_full;
> +	}
> +
>  	/* Isolate free pages. */
>  	for (; blockpfn < end_pfn; blockpfn++, cursor++) {
>  		int isolated;
> @@ -525,7 +532,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  	 */
>  	if (unlikely(blockpfn > end_pfn))
>  		blockpfn = end_pfn;
> -
> +skip_full:
>  	trace_mm_compaction_isolate_freepages(*start_pfn, blockpfn,
>  					nr_scanned, total_isolated);
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2302f250d6b1..5076c982d06a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -364,6 +364,54 @@ static inline bool update_defer_init(pg_data_t *pgdat,
>  }
>  #endif
>  
> +#ifdef CONFIG_SPARSEMEM
> +static void load_freemap(struct page *page,
> +			 unsigned long **bits, unsigned int *idx)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +
> +	*bits = __pfn_to_section(pfn)->pageblock_freemap;
> +	*idx = pfn & (PAGES_PER_SECTION - 1);
> +}
> +#else
> +static void load_freemap(struct page *page,
> +			 unsigned long **bits, unsigned int idx)
> +{
> +	unsigned long pfn = page_to_pfn(page);
> +	struct zone *zone = page_zone(page);
> +
> +	*bits = zone->pageblock_freemap;
> +	*idx = pfn - zone->zone_start_pfn;
> +}
> +#endif /* CONFIG_SPARSEMEM */
> +
> +static void set_page_freemap(struct page *page, int order)
> +{
> +	unsigned long *bits;
> +	unsigned int idx;
> +
> +	load_freemap(page, &bits, &idx);
> +	bitmap_set(bits, idx, 1 << order);
> +}
> +
> +static void clear_page_freemap(struct page *page, int order)
> +{
> +	unsigned long *bits;
> +	unsigned int idx;
> +
> +	load_freemap(page, &bits, &idx);
> +	bitmap_clear(bits, idx, 1 << order);
> +}
> +
> +int test_page_freemap(struct page *page, unsigned int nr_pages)
> +{
> +	unsigned long *bits;
> +	unsigned int idx;
> +
> +	load_freemap(page, &bits, &idx);
> +	return !bitmap_empty(bits + idx, nr_pages);
> +}
> +
>  /* Return a pointer to the bitmap storing bits affecting a block of pages */
>  static inline unsigned long *get_pageblock_bitmap(struct page *page,
>  							unsigned long pfn)
> @@ -718,12 +766,14 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>  
>  static inline void set_page_order(struct page *page, unsigned int order)
>  {
> +	set_page_freemap(page, order);
>  	set_page_private(page, order);
>  	__SetPageBuddy(page);
>  }
>  
>  static inline void rmv_page_order(struct page *page)
>  {
> +	clear_page_freemap(page, page_private(page));
>  	__ClearPageBuddy(page);
>  	set_page_private(page, 0);
>  }
> @@ -5906,14 +5956,16 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
>   * round what is now in bits to nearest long in bits, then return it in
>   * bytes.
>   */
> -static unsigned long __init usemap_size(unsigned long zone_start_pfn, unsigned long zonesize)
> +static unsigned long __init map_size(unsigned long zone_start_pfn,
> +				     unsigned long zonesize,
> +				     unsigned int bits)
>  {
>  	unsigned long usemapsize;
>  
>  	zonesize += zone_start_pfn & (pageblock_nr_pages-1);
>  	usemapsize = roundup(zonesize, pageblock_nr_pages);
>  	usemapsize = usemapsize >> pageblock_order;
> -	usemapsize *= NR_PAGEBLOCK_BITS;
> +	usemapsize *= bits;
>  	usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
>  
>  	return usemapsize / 8;
> @@ -5924,12 +5976,19 @@ static void __init setup_usemap(struct pglist_data *pgdat,
>  				unsigned long zone_start_pfn,
>  				unsigned long zonesize)
>  {
> -	unsigned long usemapsize = usemap_size(zone_start_pfn, zonesize);
> +	unsigned long size;
> +
> +	zone->pageblock_freemap = NULL;
>  	zone->pageblock_flags = NULL;
> -	if (usemapsize)
> +
> +	size = map_size(zone_start_pfn, zonesize, 1);
> +	if (size)
> +		zone->pageblock_freemap =
> +			memblock_virt_alloc_node_nopanic(size, pgdat->node_id);
> +	size = map_size(zone_start_pfn, zonesize, NR_PAGEBLOCK_BITS);
> +	if (size)
>  		zone->pageblock_flags =
> -			memblock_virt_alloc_node_nopanic(usemapsize,
> -							 pgdat->node_id);
> +			memblock_virt_alloc_node_nopanic(size, pgdat->node_id);
>  }
>  #else
>  static inline void setup_usemap(struct pglist_data *pgdat, struct zone *zone,
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 6903c8fc3085..f295b012cac9 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -233,7 +233,7 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
>  
>  static int __meminit sparse_init_one_section(struct mem_section *ms,
>  		unsigned long pnum, struct page *mem_map,
> -		unsigned long *pageblock_bitmap)
> +		unsigned long *pageblock_freemap, unsigned long *pageblock_flags)
>  {
>  	if (!present_section(ms))
>  		return -EINVAL;
> @@ -241,17 +241,27 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
>  	ms->section_mem_map &= ~SECTION_MAP_MASK;
>  	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
>  							SECTION_HAS_MEM_MAP;
> - 	ms->pageblock_flags = pageblock_bitmap;
> +	ms->pageblock_freemap = pageblock_freemap;
> +	ms->pageblock_flags = pageblock_flags;
>  
>  	return 1;
>  }
>  
> +unsigned long freemap_size(void)
> +{
> +	return BITS_TO_LONGS(PAGES_PER_SECTION) * sizeof(unsigned long);
> +}
> +
>  unsigned long usemap_size(void)
>  {
>  	return BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS) * sizeof(unsigned long);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +static unsigned long *__kmalloc_section_freemap(void)
> +{
> +	return kmalloc(freemap_size(), GFP_KERNEL);
> +}
>  static unsigned long *__kmalloc_section_usemap(void)
>  {
>  	return kmalloc(usemap_size(), GFP_KERNEL);
> @@ -338,6 +348,32 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> +static void __init sparse_early_freemaps_alloc_node(void *data,
> +				 unsigned long pnum_begin,
> +				 unsigned long pnum_end,
> +				 unsigned long freemap_count, int nodeid)
> +{
> +	void *freemap;
> +	unsigned long pnum;
> +	unsigned long **freemap_map = (unsigned long **)data;
> +	int size = freemap_size();
> +
> +	freemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nodeid),
> +							size * freemap_count);
> +	if (!freemap) {
> +		pr_warn("%s: allocation failed\n", __func__);
> +		return;
> +	}
> +
> +	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
> +		if (!present_section_nr(pnum))
> +			continue;
> +		freemap_map[pnum] = freemap;
> +		freemap += size;
> +		check_usemap_section_nr(nodeid, freemap_map[pnum]);
> +	}
> +}
> +
>  static void __init sparse_early_usemaps_alloc_node(void *data,
>  				 unsigned long pnum_begin,
>  				 unsigned long pnum_end,
> @@ -520,6 +556,8 @@ void __init sparse_init(void)
>  {
>  	unsigned long pnum;
>  	struct page *map;
> +	unsigned long *freemap;
> +	unsigned long **freemap_map;
>  	unsigned long *usemap;
>  	unsigned long **usemap_map;
>  	int size;
> @@ -546,6 +584,12 @@ void __init sparse_init(void)
>  	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.
>  	 */
>  	size = sizeof(unsigned long *) * NR_MEM_SECTIONS;
> +	freemap_map = memblock_virt_alloc(size, 0);
> +	if (!freemap_map)
> +		panic("can not allocate freemap_map\n");
> +	alloc_usemap_and_memmap(sparse_early_freemaps_alloc_node,
> +							(void *)freemap_map);
> +
>  	usemap_map = memblock_virt_alloc(size, 0);
>  	if (!usemap_map)
>  		panic("can not allocate usemap_map\n");
> @@ -565,6 +609,10 @@ void __init sparse_init(void)
>  		if (!present_section_nr(pnum))
>  			continue;
>  
> +		freemap = freemap_map[pnum];
> +		if (!freemap)
> +			continue;
> +
>  		usemap = usemap_map[pnum];
>  		if (!usemap)
>  			continue;
> @@ -578,7 +626,7 @@ void __init sparse_init(void)
>  			continue;
>  
>  		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
> -								usemap);
> +					freemap, usemap);
>  	}
>  
>  	vmemmap_populate_print_last();
> @@ -692,6 +740,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
>  	struct pglist_data *pgdat = zone->zone_pgdat;
>  	struct mem_section *ms;
>  	struct page *memmap;
> +	unsigned long *freemap;
>  	unsigned long *usemap;
>  	unsigned long flags;
>  	int ret;
> @@ -706,8 +755,14 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
>  	memmap = kmalloc_section_memmap(section_nr, pgdat->node_id);
>  	if (!memmap)
>  		return -ENOMEM;
> +	freemap = __kmalloc_section_freemap();
> +	if (!freemap) {
> +		__kfree_section_memmap(memmap);
> +		return -ENOMEM;
> +	}
>  	usemap = __kmalloc_section_usemap();
>  	if (!usemap) {
> +		kfree(freemap);
>  		__kfree_section_memmap(memmap);
>  		return -ENOMEM;
>  	}
> @@ -724,12 +779,13 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
>  
>  	ms->section_mem_map |= SECTION_MARKED_PRESENT;
>  
> -	ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
> +	ret = sparse_init_one_section(ms, section_nr, memmap, freemap, usemap);
>  
>  out:
>  	pgdat_resize_unlock(pgdat, &flags);
>  	if (ret <= 0) {
>  		kfree(usemap);
> +		kfree(freemap);
>  		__kfree_section_memmap(memmap);
>  	}
>  	return ret;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 76f73670200a..e10a8213a562 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1032,6 +1032,7 @@ const char * const vmstat_text[] = {
>  #endif
>  #ifdef CONFIG_COMPACTION
>  	"compact_migrate_scanned",
> +	"compact_free_skipped",
>  	"compact_free_scanned",
>  	"compact_isolated",
>  	"compact_stall",
> -- 
> 2.13.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-08-23  5:36     ` Joonsoo Kim
@ 2017-08-23  8:12       ` Vlastimil Babka
  2017-08-24  6:24         ` Joonsoo Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-08-23  8:12 UTC (permalink / raw)
  To: Joonsoo Kim, Johannes Weiner
  Cc: David Rientjes, linux-mm, Mel Gorman, Michal Hocko,
	Andrea Arcangeli, Rik van Riel

On 08/23/2017 07:36 AM, Joonsoo Kim wrote:
> On Mon, Aug 21, 2017 at 10:10:14AM -0400, Johannes Weiner wrote:
>> On Wed, Aug 09, 2017 at 01:58:42PM -0700, David Rientjes wrote:
>>> On Thu, 27 Jul 2017, Vlastimil Babka wrote:
>>>
>>>> As we discussed at last LSF/MM [1], the goal here is to shift more compaction
>>>> work to kcompactd, which currently just makes a single high-order page
>>>> available and then goes to sleep. The last patch, evolved from the initial RFC
>>>> [2] does this by recording for each order > 0 how many allocations would have
>>>> potentially be able to skip direct compaction, if the memory wasn't fragmented.
>>>> Kcompactd then tries to compact as long as it takes to make that many
>>>> allocations satisfiable. This approach avoids any hooks in allocator fast
>>>> paths. There are more details to this, see the last patch.
>>>>
>>>
>>> I think I would have liked to have seen "less proactive" :)
>>>
>>> Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it 
>>> continues until it can defragment memory.  On a host with 128GB of memory 
>>> and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd 
>>> wakeups for order-2 memory allocation.  The stats are pretty bad:
>>>
>>> compact_migrate_scanned 2931254031294 
>>> compact_free_scanned    102707804816705 
>>> compact_isolated        1309145254 
>>>
>>> 0.0012% of memory scanned is ever actually isolated.  We constantly see 
>>> very high cpu for compaction_alloc() because kcompactd is almost always 
>>> running in the background and iterating most memory completely needlessly 
>>> (define needless as 0.0012% of memory scanned being isolated).
>>
>> The free page scanner will inevitably wade through mostly used memory,
>> but 0.0012% is lower than what systems usually have free. I'm guessing
>> this is because of concurrent allocation & free cycles racing with the
>> scanner? There could also be an issue with how we do partial scans.
>>
>> Anyway, we've also noticed scalability issues with the current scanner
>> on 128G and 256G machines. Even with a better efficiency - finding the
>> 1% of free memory, that's still a ton of linear search space.
>>
>> I've been toying around with the below patch. It adds a free page
>> bitmap, allowing the free scanner to quickly skip over the vast areas
>> of used memory. I don't have good data on skip-efficiency at higher
>> uptimes and the resulting fragmentation yet. The overhead added to the
>> page allocator is concerning, but I cannot think of a better way to
>> make the search more efficient. What do you guys think?
> 
> Hello, Johannes.
> 
> I think that the best solution is that the compaction doesn't do linear
> scan completely. Vlastimil already have suggested that idea.

I was going to bring this up here, thanks :)

> mm, compaction: direct freepage allocation for async direct
> compaction
> 
> lkml.kernel.org/r/<1459414236-9219-5-git-send-email-vbabka@suse.cz>
> 
> It uses the buddy allocator to get a freepage so there is no linear
> scan. It would completely remove scalability issue.

Another big advantage is that migration scanner would get to see the
whole zone, and not be biased towards the first 1/3 until it meets the
free scanner. And another advantage is that we wouldn't be splitting
free pages needlessly.

> Unfortunately, he applied this idea only to async compaction since
> changing the other compaction mode will probably cause long term
> fragmentation. And, I disagreed with that idea at that time since
> different compaction logic for different compaction mode would make
> the system more unpredicatable.
> 
> I doubt long term fragmentation is a real issue in practice. We loses
> too much things to prevent long term fragmentation. I think that it's
> the time to fix up the real issue (yours and David's) by giving up the
> solution for long term fragmentation.

I'm now also more convinced that this direction should be pursued, and
wanted to get to it after the proactive kcompactd part. My biggest
concern is that freelists can give us the pages from the same block that
we (or somebody else) is trying to compact (migrate away). Isolating
(i.e. MIGRATE_ISOLATE) the block first would work, but the overhead of
the isolation could be significant. But I have some alternative ideas
that could be tried.

> If someone doesn't agree with above solution, your approach looks the
> second best to me. Though, there is something to optimize.
> 
> I think that we don't need to be precise to track the pageblock's
> freepage state. Compaction is a far rare event compared to page
> allocation so compaction could be tolerate with false positive.
> 
> So, my suggestion is:
> 
> 1) Use 1 bit for the pageblock. Reusing PB_migrate_skip looks the best
> to me.

Wouldn't the reusing cripple the original use for the migration scanner?

> 2) Mark PB_migrate_skip only in free path and only when needed.
> Unmark it in compaction if freepage scan fails in that pageblock.
> In compaction, skip the pageblock if PB_migrate_skip is set. It means
> that there is no freepage in the pageblock.
> 
> Following is some code about my suggestion.

Otherwise is sounds like it could work until the direct allocation
approach is fully developed (or turns out to be infeasible).

Thanks.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 90b1996..c292ad2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -798,12 +798,17 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>  static inline void __free_one_page(struct page *page,
>                 unsigned long pfn,
>                 struct zone *zone, unsigned int order,
> -               int migratetype)
> +               int pageblock_flag)
>  {
>         unsigned long combined_pfn;
>         unsigned long uninitialized_var(buddy_pfn);
>         struct page *buddy;
>         unsigned int max_order;
> +       int migratetype = pageblock_flag & MT_MASK;
> +       int need_set_skip = !(pageblock_flag & SKIP_MASK);
> +
> +       if (unlikely(need_set_skip))
> +               set_pageblock_skip(page);
>  
>         max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
>  
> @@ -1155,7 +1160,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  static void free_one_page(struct zone *zone,
>                                 struct page *page, unsigned long pfn,
>                                 unsigned int order,
> -                               int migratetype)
> +                               int pageblock_flag)
>  {
>         spin_lock(&zone->lock);
>         if (unlikely(has_isolate_pageblock(zone) ||
> @@ -1248,10 +1253,10 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>         if (!free_pages_prepare(page, order, true))
>                 return;
>  
> -       migratetype = get_pfnblock_migratetype(page, pfn);
> +       pageblock_flage = get_pfnblock_flag(page, pfn);
>         local_irq_save(flags);
>         __count_vm_events(PGFREE, 1 << order);
> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
> +       free_one_page(page_zone(page), page, pfn, order, pageblock_flag);
>         local_irq_restore(flags);
>  }
> 
> We already access the pageblock flag for migratetype. Reusing it would
> reduce cache-line overhead. And, updating bit only happens when first
> freepage in the pageblock is freed. We don't need to modify allocation
> path since we don't track the freepage state precisly. I guess that
> this solution has almost no overhead in allocation/free path.
> 
> If allocation happens after free, compaction would see false-positive
> so it would scan the pageblock uselessly. But, as mentioned above,
> compaction is a far rare event so doing more thing in the compaction
> with reducing the overhead on allocation/free path seems better to me.
> 
> Johannes, what do you think about it?
> 
> Thanks.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-08-23  8:12       ` Vlastimil Babka
@ 2017-08-24  6:24         ` Joonsoo Kim
  2017-08-24 11:30           ` Vlastimil Babka
  0 siblings, 1 reply; 21+ messages in thread
From: Joonsoo Kim @ 2017-08-24  6:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, David Rientjes, linux-mm, Mel Gorman,
	Michal Hocko, Andrea Arcangeli, Rik van Riel

On Wed, Aug 23, 2017 at 10:12:14AM +0200, Vlastimil Babka wrote:
> On 08/23/2017 07:36 AM, Joonsoo Kim wrote:
> > On Mon, Aug 21, 2017 at 10:10:14AM -0400, Johannes Weiner wrote:
> >> On Wed, Aug 09, 2017 at 01:58:42PM -0700, David Rientjes wrote:
> >>> On Thu, 27 Jul 2017, Vlastimil Babka wrote:
> >>>
> >>>> As we discussed at last LSF/MM [1], the goal here is to shift more compaction
> >>>> work to kcompactd, which currently just makes a single high-order page
> >>>> available and then goes to sleep. The last patch, evolved from the initial RFC
> >>>> [2] does this by recording for each order > 0 how many allocations would have
> >>>> potentially be able to skip direct compaction, if the memory wasn't fragmented.
> >>>> Kcompactd then tries to compact as long as it takes to make that many
> >>>> allocations satisfiable. This approach avoids any hooks in allocator fast
> >>>> paths. There are more details to this, see the last patch.
> >>>>
> >>>
> >>> I think I would have liked to have seen "less proactive" :)
> >>>
> >>> Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it 
> >>> continues until it can defragment memory.  On a host with 128GB of memory 
> >>> and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd 
> >>> wakeups for order-2 memory allocation.  The stats are pretty bad:
> >>>
> >>> compact_migrate_scanned 2931254031294 
> >>> compact_free_scanned    102707804816705 
> >>> compact_isolated        1309145254 
> >>>
> >>> 0.0012% of memory scanned is ever actually isolated.  We constantly see 
> >>> very high cpu for compaction_alloc() because kcompactd is almost always 
> >>> running in the background and iterating most memory completely needlessly 
> >>> (define needless as 0.0012% of memory scanned being isolated).
> >>
> >> The free page scanner will inevitably wade through mostly used memory,
> >> but 0.0012% is lower than what systems usually have free. I'm guessing
> >> this is because of concurrent allocation & free cycles racing with the
> >> scanner? There could also be an issue with how we do partial scans.
> >>
> >> Anyway, we've also noticed scalability issues with the current scanner
> >> on 128G and 256G machines. Even with a better efficiency - finding the
> >> 1% of free memory, that's still a ton of linear search space.
> >>
> >> I've been toying around with the below patch. It adds a free page
> >> bitmap, allowing the free scanner to quickly skip over the vast areas
> >> of used memory. I don't have good data on skip-efficiency at higher
> >> uptimes and the resulting fragmentation yet. The overhead added to the
> >> page allocator is concerning, but I cannot think of a better way to
> >> make the search more efficient. What do you guys think?
> > 
> > Hello, Johannes.
> > 
> > I think that the best solution is that the compaction doesn't do linear
> > scan completely. Vlastimil already have suggested that idea.
> 
> I was going to bring this up here, thanks :)
> 
> > mm, compaction: direct freepage allocation for async direct
> > compaction
> > 
> > lkml.kernel.org/r/<1459414236-9219-5-git-send-email-vbabka@suse.cz>
> > 
> > It uses the buddy allocator to get a freepage so there is no linear
> > scan. It would completely remove scalability issue.
> 
> Another big advantage is that migration scanner would get to see the
> whole zone, and not be biased towards the first 1/3 until it meets the
> free scanner. And another advantage is that we wouldn't be splitting
> free pages needlessly.
> 
> > Unfortunately, he applied this idea only to async compaction since
> > changing the other compaction mode will probably cause long term
> > fragmentation. And, I disagreed with that idea at that time since
> > different compaction logic for different compaction mode would make
> > the system more unpredicatable.
> > 
> > I doubt long term fragmentation is a real issue in practice. We loses
> > too much things to prevent long term fragmentation. I think that it's
> > the time to fix up the real issue (yours and David's) by giving up the
> > solution for long term fragmentation.
> 
> I'm now also more convinced that this direction should be pursued, and
> wanted to get to it after the proactive kcompactd part. My biggest
> concern is that freelists can give us the pages from the same block that
> we (or somebody else) is trying to compact (migrate away). Isolating
> (i.e. MIGRATE_ISOLATE) the block first would work, but the overhead of
> the isolation could be significant. But I have some alternative ideas
> that could be tried.
> 
> > If someone doesn't agree with above solution, your approach looks the
> > second best to me. Though, there is something to optimize.
> > 
> > I think that we don't need to be precise to track the pageblock's
> > freepage state. Compaction is a far rare event compared to page
> > allocation so compaction could be tolerate with false positive.
> > 
> > So, my suggestion is:
> > 
> > 1) Use 1 bit for the pageblock. Reusing PB_migrate_skip looks the best
> > to me.
> 
> Wouldn't the reusing cripple the original use for the migration scanner?

I think that there is no serious problem. Problem happens if we set
PB_migrate_skip wrongly. Consider following two cases that set
PB_migrate_skip.

1) migration scanner find that whole pages in the pageblock is pinned.
-> set skip -> it is cleared after one of the page is freed. No
problem.

There is a possibility that temporary pinned page is unpinned and we
miss this pageblock but it would be minor case.

2) migration scanner find that whole pages in the pageblock are free.
-> set skip -> we can miss the pageblock for a long time.

We need to fix 2) case in order to reuse PB_migrate_skip. I guess that
just counting the number of freepage in isolate_migratepages_block()
and considering it to not set PB_migrate_skip will work.

> 
> > 2) Mark PB_migrate_skip only in free path and only when needed.
> > Unmark it in compaction if freepage scan fails in that pageblock.
> > In compaction, skip the pageblock if PB_migrate_skip is set. It means
> > that there is no freepage in the pageblock.
> > 
> > Following is some code about my suggestion.
> 
> Otherwise is sounds like it could work until the direct allocation
> approach is fully developed (or turns out to be infeasible).

Agreed.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-08-24  6:24         ` Joonsoo Kim
@ 2017-08-24 11:30           ` Vlastimil Babka
  2017-08-24 23:51             ` Joonsoo Kim
  0 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-08-24 11:30 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Johannes Weiner, David Rientjes, linux-mm, Mel Gorman,
	Michal Hocko, Andrea Arcangeli, Rik van Riel

On 08/24/2017 08:24 AM, Joonsoo Kim wrote:
>>
>>> If someone doesn't agree with above solution, your approach looks the
>>> second best to me. Though, there is something to optimize.
>>>
>>> I think that we don't need to be precise to track the pageblock's
>>> freepage state. Compaction is a far rare event compared to page
>>> allocation so compaction could be tolerate with false positive.
>>>
>>> So, my suggestion is:
>>>
>>> 1) Use 1 bit for the pageblock. Reusing PB_migrate_skip looks the best
>>> to me.
>>
>> Wouldn't the reusing cripple the original use for the migration scanner?
> 
> I think that there is no serious problem. Problem happens if we set
> PB_migrate_skip wrongly. Consider following two cases that set
> PB_migrate_skip.
> 
> 1) migration scanner find that whole pages in the pageblock is pinned.
> -> set skip -> it is cleared after one of the page is freed. No
> problem.
> 
> There is a possibility that temporary pinned page is unpinned and we
> miss this pageblock but it would be minor case.
> 
> 2) migration scanner find that whole pages in the pageblock are free.
> -> set skip -> we can miss the pageblock for a long time.

On second thought, this is probably not an issue. If whole pageblock is
free, then there's most likely no reason for compaction to be running.
It's also not likely that migrate scanner would see a pageblock that the
free scanner has processed previously, which is why we already use
single bit for both scanners.

But I realized your code seems wrong. You want to set skip bit when a
page is freed, although for the free scanner that means a page has
become available so we would actually want to *clear* the bit in that
case. That could be indeed much more accurate for kcompactd (which runs
after kswapd reclaim) than its ignore_skip_hint usage

> We need to fix 2) case in order to reuse PB_migrate_skip. I guess that
> just counting the number of freepage in isolate_migratepages_block()
> and considering it to not set PB_migrate_skip will work.
> 
>>
>>> 2) Mark PB_migrate_skip only in free path and only when needed.
>>> Unmark it in compaction if freepage scan fails in that pageblock.
>>> In compaction, skip the pageblock if PB_migrate_skip is set. It means
>>> that there is no freepage in the pageblock.
>>>
>>> Following is some code about my suggestion.
>>
>> Otherwise is sounds like it could work until the direct allocation
>> approach is fully developed (or turns out to be infeasible).
> 
> Agreed.
> 
> Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] proactive kcompactd
  2017-08-24 11:30           ` Vlastimil Babka
@ 2017-08-24 23:51             ` Joonsoo Kim
  0 siblings, 0 replies; 21+ messages in thread
From: Joonsoo Kim @ 2017-08-24 23:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, David Rientjes, linux-mm, Mel Gorman,
	Michal Hocko, Andrea Arcangeli, Rik van Riel

On Thu, Aug 24, 2017 at 01:30:24PM +0200, Vlastimil Babka wrote:
> On 08/24/2017 08:24 AM, Joonsoo Kim wrote:
> >>
> >>> If someone doesn't agree with above solution, your approach looks the
> >>> second best to me. Though, there is something to optimize.
> >>>
> >>> I think that we don't need to be precise to track the pageblock's
> >>> freepage state. Compaction is a far rare event compared to page
> >>> allocation so compaction could be tolerate with false positive.
> >>>
> >>> So, my suggestion is:
> >>>
> >>> 1) Use 1 bit for the pageblock. Reusing PB_migrate_skip looks the best
> >>> to me.
> >>
> >> Wouldn't the reusing cripple the original use for the migration scanner?
> > 
> > I think that there is no serious problem. Problem happens if we set
> > PB_migrate_skip wrongly. Consider following two cases that set
> > PB_migrate_skip.
> > 
> > 1) migration scanner find that whole pages in the pageblock is pinned.
> > -> set skip -> it is cleared after one of the page is freed. No
> > problem.
> > 
> > There is a possibility that temporary pinned page is unpinned and we
> > miss this pageblock but it would be minor case.
> > 
> > 2) migration scanner find that whole pages in the pageblock are free.
> > -> set skip -> we can miss the pageblock for a long time.
> 
> On second thought, this is probably not an issue. If whole pageblock is
> free, then there's most likely no reason for compaction to be running.
> It's also not likely that migrate scanner would see a pageblock that the
> free scanner has processed previously, which is why we already use
> single bit for both scanners.

Think about the case that migration scanner see the pageblock where
all pages are free and set skip bit. Sometime after, those pages would
be used and not be freed for a long time. Compaction cannot notice
that that pageblock has migratable page and skip it for a long time.
It would be also minor case but I think that considering this case is
more safer way.

> But I realized your code seems wrong. You want to set skip bit when a
> page is freed, although for the free scanner that means a page has
> become available so we would actually want to *clear* the bit in that
> case. That could be indeed much more accurate for kcompactd (which runs
> after kswapd reclaim) than its ignore_skip_hint usage

Oops... I also realized my code is wrong. My intention is clear skip
bit when freeing the page. :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-08-24 23:50 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27 16:06 [RFC PATCH 0/6] proactive kcompactd Vlastimil Babka
2017-07-27 16:06 ` [PATCH 1/6] mm, kswapd: refactor kswapd_try_to_sleep() Vlastimil Babka
2017-07-28  9:38   ` Mel Gorman
2017-07-27 16:06 ` [PATCH 2/6] mm, kswapd: don't reset kswapd_order prematurely Vlastimil Babka
2017-07-28 10:16   ` Mel Gorman
2017-07-27 16:06 ` [PATCH 3/6] mm, kswapd: reset kswapd's order to 0 when it fails to reclaim enough Vlastimil Babka
2017-07-27 16:06 ` [PATCH 4/6] mm, kswapd: wake up kcompactd when kswapd had too many failures Vlastimil Babka
2017-07-28 10:41   ` Mel Gorman
2017-07-27 16:07 ` [RFC PATCH 5/6] mm, compaction: stop when number of free pages goes below watermark Vlastimil Babka
2017-07-28 10:43   ` Mel Gorman
2017-07-27 16:07 ` [RFC PATCH 6/6] mm: make kcompactd more proactive Vlastimil Babka
2017-07-28 10:58   ` Mel Gorman
2017-08-09 20:58 ` [RFC PATCH 0/6] proactive kcompactd David Rientjes
2017-08-21 14:10   ` Johannes Weiner
2017-08-21 21:40     ` Rik van Riel
2017-08-22 20:57     ` David Rientjes
2017-08-23  5:36     ` Joonsoo Kim
2017-08-23  8:12       ` Vlastimil Babka
2017-08-24  6:24         ` Joonsoo Kim
2017-08-24 11:30           ` Vlastimil Babka
2017-08-24 23:51             ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).