[PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd
@ 2016-02-08 13:38 Vlastimil Babka
  2016-02-08 13:38 ` [PATCH v2 1/5] mm, kswapd: remove bogus check of balance_classzone_idx Vlastimil Babka
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-02-08 13:38 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Vlastimil Babka

The previous RFC is here [1]. It didn't have a cover letter, so the description
and results are in the individual patches.

Changes since v1:
- do only sync compaction in kcompactd (Mel)
- only compact zones up to classzone_idx (Mel)
- move wakeup_kcompactd() call from patch 2 to patch 4 (Mel)
- Patch 3 is separate from Patch 2 for review purposes, although I would just
  fold it in the end (Mel)
- Patch 5 is new
- retested on 4.5-rc1 with 5 repeats, which removed some counter-intuitive
  results and added more confidence

[1] https://lkml.org/lkml/2016/1/26/558

Vlastimil Babka (5):
  mm, kswapd: remove bogus check of balance_classzone_idx
  mm, compaction: introduce kcompactd
  mm, memory hotplug: small cleanup in online_pages()
  mm, kswapd: replace kswapd compaction with waking up kcompactd
  mm, compaction: adapt isolation_suitable flushing to kcompactd

 include/linux/compaction.h        |  16 +++
 include/linux/mmzone.h            |   6 +
 include/linux/vm_event_item.h     |   1 +
 include/trace/events/compaction.h |  55 +++++++++
 mm/compaction.c                   | 230 +++++++++++++++++++++++++++++++++++++-
 mm/internal.h                     |   1 +
 mm/memory_hotplug.c               |  15 ++-
 mm/page_alloc.c                   |   3 +
 mm/vmscan.c                       | 147 ++++++++----------------
 mm/vmstat.c                       |   1 +
 10 files changed, 366 insertions(+), 109 deletions(-)

-- 
2.7.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/5] mm, kswapd: remove bogus check of balance_classzone_idx
  2016-02-08 13:38 [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Vlastimil Babka
@ 2016-02-08 13:38 ` Vlastimil Babka
  2016-02-08 13:38 ` [PATCH v2 2/5] mm, compaction: introduce kcompactd Vlastimil Babka
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-02-08 13:38 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Vlastimil Babka

During work on kcompactd integration I have spotted a confusing check of
balance_classzone_idx, which I believe is bogus.

The balanced_classzone_idx is filled by balance_pgdat() as the highest zone
it attempted to balance. This was introduced by commit dc83edd941f4 ("mm:
kswapd: use the classzone idx that kswapd was using for
sleeping_prematurely()"). The intention is that (as expressed in today's
function names), the value used for kswapd_shrink_zone() calls in
balance_pgdat() is the same as for the decisions in kswapd_try_to_sleep().
An unwanted side-effect of that commit was breaking the checks in kswapd()
whether there was another kswapd_wakeup with a tighter (=lower) classzone_idx.
Commits 215ddd6664ce ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") and d2ebd0f6b895 ("kswapd: avoid unnecessary
rebalance after an unsuccessful balancing") tried to fixed, but apparently
introduced a bogus check that this patch removes.

Consider zone indexes X < Y < Z, where:
- Z is the value used for the first kswapd wakeup.
- Y is returned as balanced_classzone_idx, which means zones with index higher
  than Y (including Z) were found to be unreclaimable.
- X is the value used for the second kswapd wakeup

The new wakeup with value X means that kswapd is now supposed to balance harder
all zones with index <= X. But instead, due to Y < Z, it will go sleep and
won't read the new value X. This is subtly wrong.

The effect of this patch is that kswapd will react better in some situations,
where e.g. the first wakeup is for ZONE_DMA32, the second is for ZONE_DMA, and
due to unreclaimable ZONE_NORMAL. Before this patch, kswapd would go sleep
instead of reclaiming ZONE_DMA harder. I expect these situations are very rare,
and more value is in better maintainability due to the removal of confusing
and bogus check.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 18b3767136f4..c67df4831565 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3451,8 +3451,7 @@ static int kswapd(void *p)
 		 * new request of a similar or harder type will succeed soon
 		 * so consider going to sleep on the basis we reclaimed at
 		 */
-		if (balanced_classzone_idx >= new_classzone_idx &&
-					balanced_order == new_order) {
+		if (balanced_order == new_order) {
 			new_order = pgdat->kswapd_max_order;
 			new_classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order =  0;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/5] mm, compaction: introduce kcompactd
  2016-02-08 13:38 [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Vlastimil Babka
  2016-02-08 13:38 ` [PATCH v2 1/5] mm, kswapd: remove bogus check of balance_classzone_idx Vlastimil Babka
@ 2016-02-08 13:38 ` Vlastimil Babka
  2016-03-02  6:09   ` Joonsoo Kim
  2016-02-08 13:38 ` [PATCH v2 3/5] mm, memory hotplug: small cleanup in online_pages() Vlastimil Babka
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2016-02-08 13:38 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Vlastimil Babka

Memory compaction can be currently performed in several contexts:

- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP page
  fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc

The purpose of compaction is two-fold. The obvious purpose is to satisfy a
(pending or future) high-order allocation, and is easy to evaluate. The other
purpose is to keep overal memory fragmentation low and help the
anti-fragmentation mechanism. The success wrt the latter purpose is more
difficult to evaluate though.

The current situation wrt the purposes has a few drawbacks:

- compaction is invoked only when a high-order page or hugepage is not
  available (or manually). This might be too late for the purposes of keeping
  memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would be
  better if compaction was performed asynchronously to keep fragmentation low,
  before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP page
  faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in some
  scenarios. It could also end up compacting for a high-order allocation
  request when it should be reclaiming memory for a later order-0 request.

To improve the situation, we should be able to benefit from an equivalent of
kswapd, but for compaction - i.e. a background thread which responds to
fragmentation and the need for high-order allocations (including hugepages)
somewhat proactively.

One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let kswapd
handle reclaim, as order-0 allocations are often more critical than high-order
ones.

Another possibility is to extend khugepaged, but this kthread is a single
instance and tied to THP configs.

This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new tunables.
The lifecycle mimics kswapd kthreads, including the memory hotplug hooks.

For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality. Unlike
direct compaction, it uses only sync compaction, as there's no allocation
latency to minimize.

This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with the old
approach.

Waking up of the kcompactd threads is also tied to kswapd activity and follows
these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from the
  slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so don't
  invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd

Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are not
available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation
event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform
periodic compaction with kcompactd.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/compaction.h        |  16 +++
 include/linux/mmzone.h            |   6 ++
 include/linux/vm_event_item.h     |   1 +
 include/trace/events/compaction.h |  55 ++++++++++
 mm/compaction.c                   | 220 ++++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c               |   9 +-
 mm/page_alloc.c                   |   3 +
 mm/vmstat.c                       |   1 +
 8 files changed, 309 insertions(+), 2 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4cd4ddf64cc7..1367c0564d42 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -52,6 +52,10 @@ extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
 extern bool compaction_restarting(struct zone *zone, int order);
 
+extern int kcompactd_run(int nid);
+extern void kcompactd_stop(int nid);
+extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
+
 #else
 static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order, int alloc_flags,
@@ -84,6 +88,18 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 	return true;
 }
 
+static int kcompactd_run(int nid)
+{
+	return 0;
+}
+static void kcompactd_stop(int nid)
+{
+}
+
+static void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
+{
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 48cb6f0c6083..c8dfb14105c7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -670,6 +670,12 @@ typedef struct pglist_data {
 					   mem_hotplug_begin/end() */
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
+#ifdef CONFIG_COMPACTION
+	int kcompactd_max_order;
+	enum zone_type kcompactd_classzone_idx;
+	wait_queue_head_t kcompactd_wait;
+	struct task_struct *kcompactd;
+#endif
 #ifdef CONFIG_NUMA_BALANCING
 	/* Lock serializing the migrate rate limiting window */
 	spinlock_t numabalancing_migrate_lock;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 67c1dbd19c6d..58ecc056ee45 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -53,6 +53,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
 		COMPACTISOLATED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
+		KCOMPACTD_WAKE,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 111e5666e5eb..e215bf68f521 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -350,6 +350,61 @@ DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_reset,
 );
 #endif
 
+TRACE_EVENT(mm_compaction_kcompactd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+DECLARE_EVENT_CLASS(kcompactd_wake_template,
+
+	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
+
+	TP_ARGS(nid, order, classzone_idx),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, order)
+		__field(enum zone_type, classzone_idx)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->order = order;
+		__entry->classzone_idx = classzone_idx;
+	),
+
+	TP_printk("nid=%d order=%d classzone_idx=%-8s",
+		__entry->nid,
+		__entry->order,
+		__print_symbolic(__entry->classzone_idx, ZONE_TYPE))
+);
+
+DEFINE_EVENT(kcompactd_wake_template, mm_compaction_wakeup_kcompactd,
+
+	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
+
+	TP_ARGS(nid, order, classzone_idx)
+);
+
+DEFINE_EVENT(kcompactd_wake_template, mm_compaction_kcompactd_wake,
+
+	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
+
+	TP_ARGS(nid, order, classzone_idx)
+);
+
 #endif /* _TRACE_COMPACTION_H */
 
 /* This part must be outside protection */
diff --git a/mm/compaction.c b/mm/compaction.c
index 93f71d968098..c03715ba65c7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -17,6 +17,9 @@
 #include <linux/balloon_compaction.h>
 #include <linux/page-isolation.h>
 #include <linux/kasan.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/module.h>
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
@@ -1736,4 +1739,221 @@ void compaction_unregister_node(struct node *node)
 }
 #endif /* CONFIG_SYSFS && CONFIG_NUMA */
 
+static inline bool kcompactd_work_requested(pg_data_t *pgdat)
+{
+	return pgdat->kcompactd_max_order > 0;
+}
+
+static bool kcompactd_node_suitable(pg_data_t *pgdat)
+{
+	int zoneid;
+	struct zone *zone;
+	enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;
+
+	for (zoneid = 0; zoneid < classzone_idx; zoneid++) {
+		zone = &pgdat->node_zones[zoneid];
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
+					classzone_idx) == COMPACT_CONTINUE)
+			return true;
+	}
+
+	return false;
+}
+
+static void kcompactd_do_work(pg_data_t *pgdat)
+{
+	/*
+	 * With no special task, compact all zones so that a page of requested
+	 * order is allocatable.
+	 */
+	int zoneid;
+	struct zone *zone;
+	struct compact_control cc = {
+		.order = pgdat->kcompactd_max_order,
+		.classzone_idx = pgdat->kcompactd_classzone_idx,
+		.mode = MIGRATE_SYNC_LIGHT,
+		.ignore_skip_hint = true,
+
+	};
+	bool success = false;
+
+	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
+							cc.classzone_idx);
+	count_vm_event(KCOMPACTD_WAKE);
+
+	for (zoneid = 0; zoneid < cc.classzone_idx; zoneid++) {
+		int status;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		if (compaction_deferred(zone, cc.order))
+			continue;
+
+		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
+							COMPACT_CONTINUE)
+			continue;
+
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		status = compact_zone(zone, &cc);
+
+		if (zone_watermark_ok(zone, cc.order, low_wmark_pages(zone),
+						cc.classzone_idx, 0)) {
+			success = true;
+			compaction_defer_reset(zone, cc.order, false);
+		} else if (cc.mode != MIGRATE_ASYNC &&
+						status == COMPACT_COMPLETE) {
+			defer_compaction(zone, cc.order);
+		}
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+
+	/*
+	 * Regardless of success, we are done until woken up next. But remember
+	 * the requested order/classzone_idx in case it was higher/tighter than
+	 * our current ones
+	 */
+	if (pgdat->kcompactd_max_order <= cc.order)
+		pgdat->kcompactd_max_order = 0;
+	if (pgdat->classzone_idx >= cc.classzone_idx)
+		pgdat->classzone_idx = pgdat->nr_zones - 1;
+}
+
+void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
+{
+	if (!order)
+		return;
+
+	if (pgdat->kcompactd_max_order < order)
+		pgdat->kcompactd_max_order = order;
+
+	if (pgdat->kcompactd_classzone_idx > classzone_idx)
+		pgdat->kcompactd_classzone_idx = classzone_idx;
+
+	if (!waitqueue_active(&pgdat->kcompactd_wait))
+		return;
+
+	if (!kcompactd_node_suitable(pgdat))
+		return;
+
+	trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, order,
+							classzone_idx);
+	wake_up_interruptible(&pgdat->kcompactd_wait);
+}
+
+/*
+ * The background compaction daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kcompactd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+
+	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+
+	if (!cpumask_empty(cpumask))
+		set_cpus_allowed_ptr(tsk, cpumask);
+
+	set_freezable();
+
+	pgdat->kcompactd_max_order = 0;
+	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
+
+	while (!kthread_should_stop()) {
+		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
+		wait_event_freezable(pgdat->kcompactd_wait,
+				kcompactd_work_requested(pgdat));
+
+		kcompactd_do_work(pgdat);
+	}
+
+	return 0;
+}
+
+/*
+ * This kcompactd start function will be called by init and node-hot-add.
+ * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
+ */
+int kcompactd_run(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	int ret = 0;
+
+	if (pgdat->kcompactd)
+		return 0;
+
+	pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
+	if (IS_ERR(pgdat->kcompactd)) {
+		pr_err("Failed to start kcompactd on node %d\n", nid);
+		ret = PTR_ERR(pgdat->kcompactd);
+		pgdat->kcompactd = NULL;
+	}
+	return ret;
+}
+
+/*
+ * Called by memory hotplug when all memory in a node is offlined. Caller must
+ * hold mem_hotplug_begin/end().
+ */
+void kcompactd_stop(int nid)
+{
+	struct task_struct *kcompactd = NODE_DATA(nid)->kcompactd;
+
+	if (kcompactd) {
+		kthread_stop(kcompactd);
+		NODE_DATA(nid)->kcompactd = NULL;
+	}
+}
+
+/*
+ * It's optimal to keep kcompactd on the same CPUs as their memory, but
+ * not required for correctness. So if the last cpu in a node goes
+ * away, we get changed to run anywhere: as the first one comes back,
+ * restore their cpu bindings.
+ */
+static int cpu_callback(struct notifier_block *nfb, unsigned long action,
+			void *hcpu)
+{
+	int nid;
+
+	if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
+		for_each_node_state(nid, N_MEMORY) {
+			pg_data_t *pgdat = NODE_DATA(nid);
+			const struct cpumask *mask;
+
+			mask = cpumask_of_node(pgdat->node_id);
+
+			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
+				/* One of our CPUs online: restore mask */
+				set_cpus_allowed_ptr(pgdat->kcompactd, mask);
+		}
+	}
+	return NOTIFY_OK;
+}
+
+static int __init kcompactd_init(void)
+{
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY)
+		kcompactd_run(nid);
+	hotcpu_notifier(cpu_callback, 0);
+	return 0;
+}
+
+module_init(kcompactd_init)
+
 #endif /* CONFIG_COMPACTION */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 46b46a9dcf81..7aa7697fedd1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -33,6 +33,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memblock.h>
 #include <linux/bootmem.h>
+#include <linux/compaction.h>
 
 #include <asm/tlbflush.h>
 
@@ -1132,8 +1133,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 
 	init_per_zone_wmark_min();
 
-	if (onlined_pages)
+	if (onlined_pages) {
 		kswapd_run(zone_to_nid(zone));
+		kcompactd_run(nid);
+	}
 
 	vm_total_pages = nr_free_pagecache_pages();
 
@@ -1907,8 +1910,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		zone_pcp_update(zone);
 
 	node_states_clear_node(node, &arg);
-	if (arg.status_change_nid >= 0)
+	if (arg.status_change_nid >= 0) {
 		kswapd_stop(node);
+		kcompactd_stop(node);
+	}
 
 	vm_total_pages = nr_free_pagecache_pages();
 	writeback_set_ratelimit();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4ca4ead6ab05..d8ada4ab70c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5484,6 +5484,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 #endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
+#ifdef CONFIG_COMPACTION
+	init_waitqueue_head(&pgdat->kcompactd_wait);
+#endif
 	pgdat_page_ext_init(pgdat);
 
 	for (j = 0; j < MAX_NR_ZONES; j++) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 69ce64f7b8d7..c9571294f61c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -826,6 +826,7 @@ const char * const vmstat_text[] = {
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
+	"compact_kcompatd_wake",
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/5] mm, memory hotplug: small cleanup in online_pages()
  2016-02-08 13:38 [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Vlastimil Babka
  2016-02-08 13:38 ` [PATCH v2 1/5] mm, kswapd: remove bogus check of balance_classzone_idx Vlastimil Babka
  2016-02-08 13:38 ` [PATCH v2 2/5] mm, compaction: introduce kcompactd Vlastimil Babka
@ 2016-02-08 13:38 ` Vlastimil Babka
  2016-02-08 13:38 ` [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd Vlastimil Babka
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-02-08 13:38 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Vlastimil Babka

We can reuse the nid we've determined instead of repeated pfn_to_nid() usages.
Also zone_to_nid() should be a bit cheaper in general than pfn_to_nid().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memory_hotplug.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7aa7697fedd1..66064008a489 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1082,7 +1082,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	arg.nr_pages = nr_pages;
 	node_states_check_changes_online(nr_pages, zone, &arg);
 
-	nid = pfn_to_nid(pfn);
+	nid = zone_to_nid(zone);
 
 	ret = memory_notify(MEM_GOING_ONLINE, &arg);
 	ret = notifier_to_errno(ret);
@@ -1122,7 +1122,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
 	if (onlined_pages) {
-		node_states_set_node(zone_to_nid(zone), &arg);
+		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
 			build_all_zonelists(NULL, NULL);
 		else
@@ -1134,7 +1134,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	init_per_zone_wmark_min();
 
 	if (onlined_pages) {
-		kswapd_run(zone_to_nid(zone));
+		kswapd_run(nid);
 		kcompactd_run(nid);
 	}
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-02-08 13:38 [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Vlastimil Babka
                   ` (2 preceding siblings ...)
  2016-02-08 13:38 ` [PATCH v2 3/5] mm, memory hotplug: small cleanup in online_pages() Vlastimil Babka
@ 2016-02-08 13:38 ` Vlastimil Babka
  2016-02-08 22:58   ` Andrew Morton
                     ` (4 more replies)
  2016-02-08 13:38 ` [PATCH v2 5/5] mm, compaction: adapt isolation_suitable flushing to kcompactd Vlastimil Babka
  2016-03-09 15:52 ` [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Michal Hocko
  5 siblings, 5 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-02-08 13:38 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Vlastimil Babka

Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and
compaction to attempt making memory allocation of given order available. The
details differ from direct reclaim e.g. in having high watermark as a goal.
The code involved in kswapd's reclaim/compaction decisions has evolved to be
quite complex. Testing reveals that it doesn't actually work in at least one
scenario, and closer inspection suggests that it could be greatly simplified
without compromising on the goal (make high-order page available) or efficiency
(don't reclaim too much). The simplification relieas of doing all compaction in
kcompactd, which is simply woken up when high watermarks are reached by
kswapd's reclaim.

The scenario where kswapd compaction doesn't work was found with mmtests test
stress-highalloc configured to attempt order-9 allocations without direct
reclaim, just waking up kswapd. There was no compaction attempt from kswapd
during the whole test. Some added instrumentation shows what happens:

- balance_pgdat() sets end_zone to Normal, as it's not balanced
- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
  cannot reclaim anything, so sc.nr_reclaimed is 0
- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
  merely checks if high watermarks were reached for base pages. This is true,
  so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
  compaction_suitable() returned COMPACT_SKIPPED
- even though the pgdat_needs_compaction flag wasn't set to false, no
  compaction happens due to the condition sc.nr_reclaimed > nr_attempted
  being false (as 0 < 99)
- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
  pgdat_balanced() is false as only the small zone DMA appears balanced
  (curiously in that check, watermark appears OK and compaction_suitable()
  returns COMPACT_PARTIAL, because a lower classzone_idx is used there)

Now, even if it was decided that reclaim shouldn't be attempted on the DMA
zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0)
is also false. The condition really should use >= as the comment suggests.
Then there is a mismatch in the check for setting pgdat_needs_compaction to
false using low watermark, while the rest uses high watermark, and who knows
what other subtlety. Hopefully this demonstrates that this is unsustainable.

Luckily we can simplify this a lot. The reclaim/compaction decisions make
sense for direct reclaim scenario, but in kswapd, our primary goal is to reach
high watermark in order-0 pages. Afterwards we can attempt compaction just
once. Unlike direct reclaim, we don't reclaim extra pages (over the high
watermark), the current code already disallows it for good reasons.

After this patch, we simply wake up kcompactd to process the pgdat, after we
have either succeeded or failed to reach the high watermarks in kswapd, which
goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply
the same criteria to determine which zones are worth compacting. Note that we
use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which
can include higher zones that kswapd tried to balance too, but didn't consider
them in pgdat_balanced().

Since kswapd now cannot create high-order pages itself, we need to adjust how
it determines the zones to be balanced. The key element here is adding a
"highorder" parameter to zone_balanced, which, when set to false, makes it
consider only order-0 watermark instead of the desired higher order (this was
done previously by kswapd_shrink_zone(), but not elsewhere).  This false is
passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true
to make sure kswapd and thus kcompactd are woken up for a high-order allocation
failure.

For testing, I used stress-highalloc configured to do order-9 allocations with
GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd
reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as
usual):

stress-highalloc
                              4.5-rc1               4.5-rc1
                               3-test                4-test
Success 1 Min          1.00 (  0.00%)        3.00 (-200.00%)
Success 1 Mean         1.40 (  0.00%)        4.00 (-185.71%)
Success 1 Max          2.00 (  0.00%)        6.00 (-200.00%)
Success 2 Min          1.00 (  0.00%)        3.00 (-200.00%)
Success 2 Mean         1.80 (  0.00%)        4.20 (-133.33%)
Success 2 Max          3.00 (  0.00%)        6.00 (-100.00%)
Success 3 Min         34.00 (  0.00%)       63.00 (-85.29%)
Success 3 Mean        41.80 (  0.00%)       64.60 (-54.55%)
Success 3 Max         53.00 (  0.00%)       67.00 (-26.42%)

             4.5-rc1     4.5-rc1
              3-test      4-test
User         3166.67     3088.82
System       1153.37     1142.01
Elapsed      1768.53     1780.91

                                  4.5-rc1     4.5-rc1
                                   3-test      4-test
Minor Faults                    106940795   106582816
Major Faults                          829         813
Swap Ins                              482         311
Swap Outs                            6278        5598
Allocation stalls                     128         184
DMA allocs                            145          32
DMA32 allocs                     74646161    74843238
Normal allocs                    26090955    25886668
Movable allocs                          0           0
Direct pages scanned                32938       31429
Kswapd pages scanned              2183166     2185293
Kswapd pages reclaimed            2152359     2134389
Direct pages reclaimed              32735       31234
Kswapd efficiency                     98%         97%
Kswapd velocity                  1243.877    1228.666
Direct efficiency                     99%         99%
Direct velocity                    18.767      17.671
Percentage direct scans                1%          1%
Zone normal velocity              299.981     291.409
Zone dma32 velocity               962.522     954.928
Zone dma velocity                   0.142       0.000
Page writes by reclaim           6278.800    5598.600
Page writes file                        0           0
Page writes anon                     6278        5598
Page reclaim immediate                 93          96
Sector Reads                      4357114     4307161
Sector Writes                    11053628    11053091
Page rescued immediate                  0           0
Slabs scanned                     1592829     1555770
Direct inode steals                  1557        2025
Kswapd inode steals                 46056       45418
Kswapd skipped wait                     0           0
THP fault alloc                       579         614
THP collapse alloc                    304         324
THP splits                              0           0
THP fault fallback                    793         730
THP collapse fail                      11          14
Compaction stalls                    1013         959
Compaction success                     92          69
Compaction failures                   920         890
Page migrate success               238457      662054
Page migrate failure                23021       32846
Compaction pages isolated          504695     1370326
Compaction migrate scanned         661390     7025772
Compaction free scanned          13476658    73302642
Compaction cost                       262         762

After this patch we see improvements in allocation success rate (especially for
phase 3) along with increased compaction activity. The compaction stalls
(direct compaction) in the interfering kernel builds (probably THP's) also
decreased somewhat to kcompactd activity, yet THP alloc successes improved a
bit.

We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

stress-highalloc
                              4.5-rc1               4.5-rc1
                              3-test2               4-test2
Success 1 Min          4.00 (  0.00%)        6.00 (-50.00%)
Success 1 Mean         8.00 (  0.00%)        8.40 ( -5.00%)
Success 1 Max         12.00 (  0.00%)       13.00 ( -8.33%)
Success 2 Min          4.00 (  0.00%)        6.00 (-50.00%)
Success 2 Mean         8.20 (  0.00%)        8.60 ( -4.88%)
Success 2 Max         13.00 (  0.00%)       12.00 (  7.69%)
Success 3 Min         75.00 (  0.00%)       75.00 (  0.00%)
Success 3 Mean        75.60 (  0.00%)       75.60 (  0.00%)
Success 3 Max         77.00 (  0.00%)       76.00 (  1.30%)

             4.5-rc1     4.5-rc1
             3-test2     4-test2
User         3344.73     3258.62
System       1194.24     1177.92
Elapsed      1838.04     1837.02

                                  4.5-rc1     4.5-rc1
                                  3-test2     4-test2
Minor Faults                    111269736   109392253
Major Faults                          806         755
Swap Ins                              671         155
Swap Outs                            5390        5790
Allocation stalls                    4610        4562
DMA allocs                            250          34
DMA32 allocs                     78091501    76901680
Normal allocs                    27004414    26587089
Movable allocs                          0           0
Direct pages scanned               125146      108854
Kswapd pages scanned              2119757     2131589
Kswapd pages reclaimed            2073183     2090937
Direct pages reclaimed             124909      108699
Kswapd efficiency                     97%         98%
Kswapd velocity                  1161.027    1160.870
Direct efficiency                     99%         99%
Direct velocity                    68.545      59.283
Percentage direct scans                5%          4%
Zone normal velocity              296.678     294.389
Zone dma32 velocity               932.841     925.764
Zone dma velocity                   0.053       0.000
Page writes by reclaim           5392.000    5790.600
Page writes file                        1           0
Page writes anon                     5390        5790
Page reclaim immediate                104         218
Sector Reads                      4350232     4376989
Sector Writes                    11126496    11102113
Page rescued immediate                  0           0
Slabs scanned                     1705294     1692486
Direct inode steals                  8700       16266
Kswapd inode steals                 36352       28364
Kswapd skipped wait                     0           0
THP fault alloc                       599         567
THP collapse alloc                    323         326
THP splits                              0           0
THP fault fallback                    806         805
THP collapse fail                      17          18
Compaction stalls                    2457        2070
Compaction success                    906         527
Compaction failures                  1551        1543
Page migrate success              2031423     2423657
Page migrate failure                32845       28790
Compaction pages isolated         4129761     4916017
Compaction migrate scanned       11996712    19370264
Compaction free scanned         214970969   360662356
Compaction cost                      2271        2745

Here, this patch doesn't change the success rate as direct compaction already
tries what it can. There's however significant reduction in direct compaction
stalls, made entirely of the successful stalls. This means the offload to
kcompactd is working as expected, and direct compaction is reduced either due
to detecting contention, or compaction deferred by kcompactd. In the previous
version of this patchset there was some apparent reduction of success rate,
but the changes in this version (such as using sync compaction only), new
baseline kernel, and/or averaging results from 5 executions (my bet), made this
go away.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 146 ++++++++++++++++++++----------------------------------------
 1 file changed, 48 insertions(+), 98 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c67df4831565..b8478a737ef5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-			  unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order, bool highorder,
+			unsigned long balance_gap, int classzone_idx)
 {
-	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
-				    balance_gap, classzone_idx))
-		return false;
+	unsigned long mark = high_wmark_pages(zone) + balance_gap;
 
-	if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
-				order, 0, classzone_idx) == COMPACT_SKIPPED)
-		return false;
+	/*
+	 * When checking from pgdat_balanced(), kswapd should stop and sleep
+	 * when it reaches the high order-0 watermark and let kcompactd take
+	 * over. Other callers such as wakeup_kswapd() want to determine the
+	 * true high-order watermark.
+	 */
+	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
+		mark += (1UL << order);
+		order = 0;
+	}
 
-	return true;
+	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
 
 /*
@@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 			continue;
 		}
 
-		if (zone_balanced(zone, order, 0, i))
+		if (zone_balanced(zone, order, false, 0, i))
 			balanced_pages += zone->managed_pages;
 		else if (!order)
 			return false;
@@ -3066,8 +3071,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  */
 static bool kswapd_shrink_zone(struct zone *zone,
 			       int classzone_idx,
-			       struct scan_control *sc,
-			       unsigned long *nr_attempted)
+			       struct scan_control *sc)
 {
 	int testorder = sc->order;
 	unsigned long balance_gap;
@@ -3077,17 +3081,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
 
 	/*
-	 * Kswapd reclaims only single pages with compaction enabled. Trying
-	 * too hard to reclaim until contiguous free pages have become
-	 * available can hurt performance by evicting too much useful data
-	 * from memory. Do not reclaim more than needed for compaction.
-	 */
-	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
-			compaction_suitable(zone, sc->order, 0, classzone_idx)
-							!= COMPACT_SKIPPED)
-		testorder = 0;
-
-	/*
 	 * We put equal pressure on every zone, unless one zone has way too
 	 * many pages free already. The "too many pages" is defined as the
 	 * high wmark plus a "gap" where the gap is either the low
@@ -3101,15 +3094,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * reclaim is necessary
 	 */
 	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, testorder,
+	if (!lowmem_pressure && zone_balanced(zone, testorder, false,
 						balance_gap, classzone_idx))
 		return true;
 
 	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
 
-	/* Account for the number of pages attempted to reclaim */
-	*nr_attempted += sc->nr_to_reclaim;
-
 	clear_bit(ZONE_WRITEBACK, &zone->flags);
 
 	/*
@@ -3119,7 +3109,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * waits.
 	 */
 	if (zone_reclaimable(zone) &&
-	    zone_balanced(zone, testorder, 0, classzone_idx)) {
+	    zone_balanced(zone, testorder, false, 0, classzone_idx)) {
 		clear_bit(ZONE_CONGESTED, &zone->flags);
 		clear_bit(ZONE_DIRTY, &zone->flags);
 	}
@@ -3131,7 +3121,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).
  *
- * Returns the final order kswapd was reclaiming at
+ * Returns the highest zone idx kswapd was reclaiming at
  *
  * There is special handling here for zones which are full of pinned pages.
  * This can happen if the pages are all mlocked, or if they are all used by
@@ -3148,8 +3138,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
  * interoperates with the page allocator fallback scheme to ensure that aging
  * of pages is balanced across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
-							int *classzone_idx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -3166,9 +3155,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	count_vm_event(PAGEOUTRUN);
 
 	do {
-		unsigned long nr_attempted = 0;
 		bool raise_priority = true;
-		bool pgdat_needs_compaction = (order > 0);
 
 		sc.nr_reclaimed = 0;
 
@@ -3203,7 +3190,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				break;
 			}
 
-			if (!zone_balanced(zone, order, 0, 0)) {
+			if (!zone_balanced(zone, order, true, 0, 0)) {
 				end_zone = i;
 				break;
 			} else {
@@ -3219,24 +3206,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		if (i < 0)
 			goto out;
 
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			if (!populated_zone(zone))
-				continue;
-
-			/*
-			 * If any zone is currently balanced then kswapd will
-			 * not call compaction as it is expected that the
-			 * necessary pages are already available.
-			 */
-			if (pgdat_needs_compaction &&
-					zone_watermark_ok(zone, order,
-						low_wmark_pages(zone),
-						*classzone_idx, 0))
-				pgdat_needs_compaction = false;
-		}
-
 		/*
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
@@ -3280,8 +3249,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			 * that that high watermark would be met at 100%
 			 * efficiency.
 			 */
-			if (kswapd_shrink_zone(zone, end_zone,
-					       &sc, &nr_attempted))
+			if (kswapd_shrink_zone(zone, end_zone, &sc))
 				raise_priority = false;
 		}
 
@@ -3294,49 +3262,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up_all(&pgdat->pfmemalloc_wait);
 
-		/*
-		 * Fragmentation may mean that the system cannot be rebalanced
-		 * for high-order allocations in all zones. If twice the
-		 * allocation size has been reclaimed and the zones are still
-		 * not balanced then recheck the watermarks at order-0 to
-		 * prevent kswapd reclaiming excessively. Assume that a
-		 * process requested a high-order can direct reclaim/compact.
-		 */
-		if (order && sc.nr_reclaimed >= 2UL << order)
-			order = sc.order = 0;
-
 		/* Check if kswapd should be suspending */
 		if (try_to_freeze() || kthread_should_stop())
 			break;
 
 		/*
-		 * Compact if necessary and kswapd is reclaiming at least the
-		 * high watermark number of pages as requsted
-		 */
-		if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
-			compact_pgdat(pgdat, order);
-
-		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
 		if (raise_priority || !sc.nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1 &&
-		 !pgdat_balanced(pgdat, order, *classzone_idx));
+			!pgdat_balanced(pgdat, order, classzone_idx));
 
 out:
 	/*
-	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
-	 * makes a decision on the order we were last reclaiming at. However,
-	 * if another caller entered the allocator slow path while kswapd
-	 * was awake, order will remain at the higher level
+	 * Return the highest zone idx we were reclaiming at so
+	 * prepare_kswapd_sleep() makes the same decisions as here.
 	 */
-	*classzone_idx = end_zone;
-	return order;
+	return end_zone;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
+				int classzone_idx, int balanced_classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
@@ -3347,7 +3295,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining,
+						balanced_classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -3357,7 +3306,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining,
+						balanced_classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -3378,6 +3328,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		reset_isolation_suitable(pgdat);
 
+		/*
+		 * We have freed the memory, now we should compact it to make
+		 * allocation of the requested order possible.
+		 */
+		wakeup_kcompactd(pgdat, order, classzone_idx);
+
 		if (!kthread_should_stop())
 			schedule();
 
@@ -3407,7 +3363,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 static int kswapd(void *p)
 {
 	unsigned long order, new_order;
-	unsigned balanced_order;
 	int classzone_idx, new_classzone_idx;
 	int balanced_classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
@@ -3440,23 +3395,19 @@ static int kswapd(void *p)
 	set_freezable();
 
 	order = new_order = 0;
-	balanced_order = 0;
 	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
 	balanced_classzone_idx = classzone_idx;
 	for ( ; ; ) {
 		bool ret;
 
 		/*
-		 * If the last balance_pgdat was unsuccessful it's unlikely a
-		 * new request of a similar or harder type will succeed soon
-		 * so consider going to sleep on the basis we reclaimed at
+		 * While we were reclaiming, there might have been another
+		 * wakeup, so check the values.
 		 */
-		if (balanced_order == new_order) {
-			new_order = pgdat->kswapd_max_order;
-			new_classzone_idx = pgdat->classzone_idx;
-			pgdat->kswapd_max_order =  0;
-			pgdat->classzone_idx = pgdat->nr_zones - 1;
-		}
+		new_order = pgdat->kswapd_max_order;
+		new_classzone_idx = pgdat->classzone_idx;
+		pgdat->kswapd_max_order =  0;
+		pgdat->classzone_idx = pgdat->nr_zones - 1;
 
 		if (order < new_order || classzone_idx > new_classzone_idx) {
 			/*
@@ -3466,7 +3417,7 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, balanced_order,
+			kswapd_try_to_sleep(pgdat, order, classzone_idx,
 						balanced_classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
@@ -3486,9 +3437,8 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balanced_classzone_idx = classzone_idx;
-			balanced_order = balance_pgdat(pgdat, order,
-						&balanced_classzone_idx);
+			balanced_classzone_idx = balance_pgdat(pgdat, order,
+								classzone_idx);
 		}
 	}
 
@@ -3518,7 +3468,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0, 0))
+	if (zone_balanced(zone, order, true, 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 5/5] mm, compaction: adapt isolation_suitable flushing to kcompactd
  2016-02-08 13:38 [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Vlastimil Babka
                   ` (3 preceding siblings ...)
  2016-02-08 13:38 ` [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd Vlastimil Babka
@ 2016-02-08 13:38 ` Vlastimil Babka
  2016-03-01 14:44   ` Vlastimil Babka
  2016-03-09 15:52 ` [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Michal Hocko
  5 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2016-02-08 13:38 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner, Vlastimil Babka

Compaction maintains a pageblock_skip bitmap to record pageblocks where
isolation recently failed. This bitmap can be reset by three ways:

1) direct compaction is restarting after going through the full deferred cycle

2) kswapd goes to sleep, and some other direct compaction has previously
   finished scanning the whole zone and set zone->compact_blockskip_flush.
   Note that a successful direct compaction clears this flag.

3) compaction was invoked manually via trigger in /proc

The case 2) is somewhat fuzzy to begin with, but after introducing kcompactd we
should update it. The check for direct compaction in 1), and to set the flush
flag in 2) use current_is_kswapd(), which doesn't work for kcompactd. Thus,
this patch adds bool direct_compaction to compact_control to use in 2). For
the case 1) we remove the check completely - unlike the former kswapd
compaction, kcompactd does use the deferred compaction functionality, so
flushing tied to restarting from deferred compaction makes sense here.

Note that when kswapd goes to sleep, kcompactd is woken up, so it will see the
flushed pageblock_skip bits. This is different from when the former kswapd
compaction observed the bits and I believe it makes more sense. Kcompactd can
afford to be more thorough than a direct compaction trying to limit allocation
latency, or kswapd whose primary goal is to reclaim.

To sum up, after this patch, the pageblock_skip flushing makes intuitively
more sense for kcompactd. Practially, the differences are minimal.
Stress-highalloc With order-9 allocations without direct reclaim/compaction:

stress-highalloc
                              4.5-rc1               4.5-rc1
                               4-test                5-test
Success 1 Min          3.00 (  0.00%)        5.00 (-66.67%)
Success 1 Mean         4.00 (  0.00%)        6.20 (-55.00%)
Success 1 Max          6.00 (  0.00%)        7.00 (-16.67%)
Success 2 Min          3.00 (  0.00%)        5.00 (-66.67%)
Success 2 Mean         4.20 (  0.00%)        6.40 (-52.38%)
Success 2 Max          6.00 (  0.00%)        7.00 (-16.67%)
Success 3 Min         63.00 (  0.00%)       62.00 (  1.59%)
Success 3 Mean        64.60 (  0.00%)       63.80 (  1.24%)
Success 3 Max         67.00 (  0.00%)       65.00 (  2.99%)

             4.5-rc1     4.5-rc1
              4-test      5-test
User         3088.82     3181.09
System       1142.01     1158.25
Elapsed      1780.91     1799.37

                                  4.5-rc1     4.5-rc1
                                   4-test      5-test
Minor Faults                    106582816   107907437
Major Faults                          813         734
Swap Ins                              311         235
Swap Outs                            5598        5485
Allocation stalls                     184         207
DMA allocs                             32          31
DMA32 allocs                     74843238    75757965
Normal allocs                    25886668    26130990
Movable allocs                          0           0
Direct pages scanned                31429       32797
Kswapd pages scanned              2185293     2202613
Kswapd pages reclaimed            2134389     2143524
Direct pages reclaimed              31234       32545
Kswapd efficiency                     97%         97%
Kswapd velocity                  1228.666    1218.536
Direct efficiency                     99%         99%
Direct velocity                    17.671      18.144
Percentage direct scans                1%          1%
Zone normal velocity              291.409     286.309
Zone dma32 velocity               954.928     950.371
Zone dma velocity                   0.000       0.000
Page writes by reclaim           5598.600    5485.600
Page writes file                        0           0
Page writes anon                     5598        5485
Page reclaim immediate                 96          60
Sector Reads                      4307161     4293509
Sector Writes                    11053091    11072127
Page rescued immediate                  0           0
Slabs scanned                     1555770     1549506
Direct inode steals                  2025        7018
Kswapd inode steals                 45418       40265
Kswapd skipped wait                     0           0
THP fault alloc                       614         612
THP collapse alloc                    324         316
THP splits                              0           0
THP fault fallback                    730         778
THP collapse fail                      14          16
Compaction stalls                     959        1007
Compaction success                     69          67
Compaction failures                   890         939
Page migrate success               662054      721374
Page migrate failure                32846       23469
Compaction pages isolated         1370326     1479924
Compaction migrate scanned        7025772     8812554
Compaction free scanned          73302642    84327916
Compaction cost                       762         838

With direct reclaim/compaction:

stress-highalloc
/home/vbabka/labs/mmtests-results/storm/2016-02-02_16-37/test2/1
                              4.5-rc1               4.5-rc1
                              4-test2               5-test2
Success 1 Min          6.00 (  0.00%)        9.00 (-50.00%)
Success 1 Mean         8.40 (  0.00%)       10.00 (-19.05%)
Success 1 Max         13.00 (  0.00%)       11.00 ( 15.38%)
Success 2 Min          6.00 (  0.00%)        9.00 (-50.00%)
Success 2 Mean         8.60 (  0.00%)       10.00 (-16.28%)
Success 2 Max         12.00 (  0.00%)       11.00 (  8.33%)
Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
Success 3 Max         76.00 (  0.00%)       76.00 (  0.00%)

             4.5-rc1     4.5-rc1
             4-test2     5-test2
User         3258.62     3246.04
System       1177.92     1172.29
Elapsed      1837.02     1836.76

                                  4.5-rc1     4.5-rc1
                                  4-test2     5-test2
Minor Faults                    109392253   109773220
Minor Faults                    109392253   109773220
Major Faults                          755         864
Swap Ins                              155         262
Swap Outs                            5790        5871
Allocation stalls                    4562        4540
DMA allocs                             34          39
DMA32 allocs                     76901680    77122082
Normal allocs                    26587089    26748274
Movable allocs                          0           0
Direct pages scanned               108854      120966
Kswapd pages scanned              2131589     2135012
Kswapd pages reclaimed            2090937     2108388
Direct pages reclaimed             108699      120577
Kswapd efficiency                     98%         98%
Kswapd velocity                  1160.870    1170.537
Direct efficiency                     99%         99%
Direct velocity                    59.283      66.321
Percentage direct scans                4%          5%
Zone normal velocity              294.389     293.821
Zone dma32 velocity               925.764     943.036
Zone dma velocity                   0.000       0.000
Page writes by reclaim           5790.600    5871.200
Page writes file                        0           0
Page writes anon                     5790        5871
Page reclaim immediate                218         225
Sector Reads                      4376989     4428264
Sector Writes                    11102113    11110668
Page rescued immediate                  0           0
Slabs scanned                     1692486     1709123
Direct inode steals                 16266        6898
Kswapd inode steals                 28364       38351
Kswapd skipped wait                     0           0
THP fault alloc                       567         652
THP collapse alloc                    326         354
THP splits                              0           0
THP fault fallback                    805         793
THP collapse fail                      18          16
Compaction stalls                    2070        2025
Compaction success                    527         518
Compaction failures                  1543        1507
Page migrate success              2423657     2360608
Page migrate failure                28790       40852
Compaction pages isolated         4916017     4802025
Compaction migrate scanned       19370264    21750613
Compaction free scanned         360662356   344372001
Compaction cost                      2745        2694

Singed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 10 +++++-----
 mm/internal.h   |  1 +
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c03715ba65c7..67bb651c56b1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1191,11 +1191,11 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
 
 		/*
 		 * Mark that the PG_migrate_skip information should be cleared
-		 * by kswapd when it goes to sleep. kswapd does not set the
+		 * by kswapd when it goes to sleep. kcompactd does not set the
 		 * flag itself as the decision to be clear should be directly
 		 * based on an allocation request.
 		 */
-		if (!current_is_kswapd())
+		if (cc->direct_compaction)
 			zone->compact_blockskip_flush = true;
 
 		return COMPACT_COMPLETE;
@@ -1338,10 +1338,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 	/*
 	 * Clear pageblock skip if there were failures recently and compaction
-	 * is about to be retried after being deferred. kswapd does not do
-	 * this reset as it'll reset the cached information when going to sleep.
+	 * is about to be retried after being deferred.
 	 */
-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
+	if (compaction_restarting(zone, cc->order))
 		__reset_isolation_suitable(zone);
 
 	/*
@@ -1477,6 +1476,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 		.mode = mode,
 		.alloc_flags = alloc_flags,
 		.classzone_idx = classzone_idx,
+		.direct_compaction = true,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
diff --git a/mm/internal.h b/mm/internal.h
index 17ae0b52534b..013a786fa37f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -181,6 +181,7 @@ struct compact_control {
 	unsigned long last_migrated_pfn;/* Not yet flushed page being freed */
 	enum migrate_mode mode;		/* Async or sync migration mode */
 	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
+	bool direct_compaction;		/* False from kcompactd or /proc/... */
 	int order;			/* order a direct compactor needs */
 	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	const int alloc_flags;		/* alloc flags of a direct compactor */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-02-08 13:38 ` [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd Vlastimil Babka
@ 2016-02-08 22:58   ` Andrew Morton
  2016-02-09 10:53     ` Vlastimil Babka
  2016-02-09 10:21   ` Vlastimil Babka
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2016-02-08 22:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A. Shutemov,
	Rik van Riel, Joonsoo Kim, Mel Gorman, David Rientjes,
	Michal Hocko, Johannes Weiner

On Mon,  8 Feb 2016 14:38:10 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:

> Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and
> compaction to attempt making memory allocation of given order available. The
> details differ from direct reclaim e.g. in having high watermark as a goal.
> The code involved in kswapd's reclaim/compaction decisions has evolved to be
> quite complex. Testing reveals that it doesn't actually work in at least one
> scenario, and closer inspection suggests that it could be greatly simplified
> without compromising on the goal (make high-order page available) or efficiency
> (don't reclaim too much). The simplification relieas of doing all compaction in
> kcompactd, which is simply woken up when high watermarks are reached by
> kswapd's reclaim.
> 
> The scenario where kswapd compaction doesn't work was found with mmtests test
> stress-highalloc configured to attempt order-9 allocations without direct
> reclaim, just waking up kswapd. There was no compaction attempt from kswapd
> during the whole test. Some added instrumentation shows what happens:
> 
> - balance_pgdat() sets end_zone to Normal, as it's not balanced
> - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
>   cannot reclaim anything, so sc.nr_reclaimed is 0
> - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
>   merely checks if high watermarks were reached for base pages. This is true,
>   so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
>   compaction_suitable() returned COMPACT_SKIPPED
> - even though the pgdat_needs_compaction flag wasn't set to false, no
>   compaction happens due to the condition sc.nr_reclaimed > nr_attempted
>   being false (as 0 < 99)
> - priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
>   pgdat_balanced() is false as only the small zone DMA appears balanced
>   (curiously in that check, watermark appears OK and compaction_suitable()
>   returns COMPACT_PARTIAL, because a lower classzone_idx is used there)
> 
> Now, even if it was decided that reclaim shouldn't be attempted on the DMA
> zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0)
> is also false. The condition really should use >= as the comment suggests.
> Then there is a mismatch in the check for setting pgdat_needs_compaction to
> false using low watermark, while the rest uses high watermark, and who knows
> what other subtlety. Hopefully this demonstrates that this is unsustainable.
> 
> Luckily we can simplify this a lot. The reclaim/compaction decisions make
> sense for direct reclaim scenario, but in kswapd, our primary goal is to reach
> high watermark in order-0 pages. Afterwards we can attempt compaction just
> once. Unlike direct reclaim, we don't reclaim extra pages (over the high
> watermark), the current code already disallows it for good reasons.
> 
> After this patch, we simply wake up kcompactd to process the pgdat, after we
> have either succeeded or failed to reach the high watermarks in kswapd, which
> goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply
> the same criteria to determine which zones are worth compacting. Note that we
> use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which
> can include higher zones that kswapd tried to balance too, but didn't consider
> them in pgdat_balanced().
> 
> Since kswapd now cannot create high-order pages itself, we need to adjust how
> it determines the zones to be balanced. The key element here is adding a
> "highorder" parameter to zone_balanced, which, when set to false, makes it
> consider only order-0 watermark instead of the desired higher order (this was
> done previously by kswapd_shrink_zone(), but not elsewhere).  This false is
> passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true
> to make sure kswapd and thus kcompactd are woken up for a high-order allocation
> failure.
> 
> For testing, I used stress-highalloc configured to do order-9 allocations with
> GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd
> reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as
> usual):
> 
> stress-highalloc
>                               4.5-rc1               4.5-rc1
>                                3-test                4-test

What are "3-test" and "4-test"?  I'm assuming (hoping) they mean
"before and after this patchset", but the nomenclature is odd.

> Success 1 Min          1.00 (  0.00%)        3.00 (-200.00%)
> Success 1 Mean         1.40 (  0.00%)        4.00 (-185.71%)
> Success 1 Max          2.00 (  0.00%)        6.00 (-200.00%)
> Success 2 Min          1.00 (  0.00%)        3.00 (-200.00%)
> Success 2 Mean         1.80 (  0.00%)        4.20 (-133.33%)
> Success 2 Max          3.00 (  0.00%)        6.00 (-100.00%)
> Success 3 Min         34.00 (  0.00%)       63.00 (-85.29%)
> Success 3 Mean        41.80 (  0.00%)       64.60 (-54.55%)
> Success 3 Max         53.00 (  0.00%)       67.00 (-26.42%)
> 
>              4.5-rc1     4.5-rc1
>               3-test      4-test
> User         3166.67     3088.82
> System       1153.37     1142.01
> Elapsed      1768.53     1780.91
>
>                                   4.5-rc1     4.5-rc1
>                                    3-test      4-test
> Minor Faults                    106940795   106582816
> Major Faults                          829         813
> Swap Ins                              482         311
> Swap Outs                            6278        5598
> Allocation stalls                     128         184
> DMA allocs                            145          32
> DMA32 allocs                     74646161    74843238
> Normal allocs                    26090955    25886668
> Movable allocs                          0           0
> Direct pages scanned                32938       31429
> Kswapd pages scanned              2183166     2185293
> Kswapd pages reclaimed            2152359     2134389
> Direct pages reclaimed              32735       31234
> Kswapd efficiency                     98%         97%
> Kswapd velocity                  1243.877    1228.666
> Direct efficiency                     99%         99%
> Direct velocity                    18.767      17.671

What do "efficiency" and "velocity" refer to here?

> Percentage direct scans                1%          1%
> Zone normal velocity              299.981     291.409
> Zone dma32 velocity               962.522     954.928
> Zone dma velocity                   0.142       0.000
> Page writes by reclaim           6278.800    5598.600
> Page writes file                        0           0
> Page writes anon                     6278        5598
> Page reclaim immediate                 93          96
> Sector Reads                      4357114     4307161
> Sector Writes                    11053628    11053091
> Page rescued immediate                  0           0
> Slabs scanned                     1592829     1555770
> Direct inode steals                  1557        2025
> Kswapd inode steals                 46056       45418
> Kswapd skipped wait                     0           0
> THP fault alloc                       579         614
> THP collapse alloc                    304         324
> THP splits                              0           0
> THP fault fallback                    793         730
> THP collapse fail                      11          14
> Compaction stalls                    1013         959
> Compaction success                     92          69
> Compaction failures                   920         890
> Page migrate success               238457      662054
> Page migrate failure                23021       32846
> Compaction pages isolated          504695     1370326
> Compaction migrate scanned         661390     7025772
> Compaction free scanned          13476658    73302642
> Compaction cost                       262         762
> 
> After this patch we see improvements in allocation success rate (especially for
> phase 3) along with increased compaction activity. The compaction stalls
> (direct compaction) in the interfering kernel builds (probably THP's) also
> decreased somewhat to kcompactd activity, yet THP alloc successes improved a
> bit.
> 
> We can also configure stress-highalloc to perform both direct
> reclaim/compaction and wakeup kswapd/kcompactd, by using
> GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
> 
> stress-highalloc
>                               4.5-rc1               4.5-rc1
>                               3-test2               4-test2
> Success 1 Min          4.00 (  0.00%)        6.00 (-50.00%)
> Success 1 Mean         8.00 (  0.00%)        8.40 ( -5.00%)
> Success 1 Max         12.00 (  0.00%)       13.00 ( -8.33%)
> Success 2 Min          4.00 (  0.00%)        6.00 (-50.00%)
> Success 2 Mean         8.20 (  0.00%)        8.60 ( -4.88%)
> Success 2 Max         13.00 (  0.00%)       12.00 (  7.69%)
> Success 3 Min         75.00 (  0.00%)       75.00 (  0.00%)
> Success 3 Mean        75.60 (  0.00%)       75.60 (  0.00%)
> Success 3 Max         77.00 (  0.00%)       76.00 (  1.30%)
> 
>              4.5-rc1     4.5-rc1
>              3-test2     4-test2
> User         3344.73     3258.62
> System       1194.24     1177.92
> Elapsed      1838.04     1837.02

Elapsed time increased in both test runs.  But you later say "There's
however significant reduction in direct compaction stalls, made
entirely of the successful stalls".  This seems inconsistent - less
stalls should mean less time stuck in D state.

>                                   4.5-rc1     4.5-rc1
>                                   3-test2     4-test2
> Minor Faults                    111269736   109392253
> Major Faults                          806         755
> Swap Ins                              671         155
> Swap Outs                            5390        5790
> Allocation stalls                    4610        4562
> DMA allocs                            250          34
> DMA32 allocs                     78091501    76901680
> Normal allocs                    27004414    26587089
> Movable allocs                          0           0
> Direct pages scanned               125146      108854
> Kswapd pages scanned              2119757     2131589
> Kswapd pages reclaimed            2073183     2090937
> Direct pages reclaimed             124909      108699
> Kswapd efficiency                     97%         98%
> Kswapd velocity                  1161.027    1160.870
> Direct efficiency                     99%         99%
> Direct velocity                    68.545      59.283
> Percentage direct scans                5%          4%
> Zone normal velocity              296.678     294.389
> Zone dma32 velocity               932.841     925.764
> Zone dma velocity                   0.053       0.000
> Page writes by reclaim           5392.000    5790.600
> Page writes file                        1           0
> Page writes anon                     5390        5790
> Page reclaim immediate                104         218
> Sector Reads                      4350232     4376989
> Sector Writes                    11126496    11102113
> Page rescued immediate                  0           0
> Slabs scanned                     1705294     1692486
> Direct inode steals                  8700       16266
> Kswapd inode steals                 36352       28364
> Kswapd skipped wait                     0           0
> THP fault alloc                       599         567
> THP collapse alloc                    323         326
> THP splits                              0           0
> THP fault fallback                    806         805
> THP collapse fail                      17          18
> Compaction stalls                    2457        2070
> Compaction success                    906         527
> Compaction failures                  1551        1543
> Page migrate success              2031423     2423657
> Page migrate failure                32845       28790
> Compaction pages isolated         4129761     4916017
> Compaction migrate scanned       11996712    19370264
> Compaction free scanned         214970969   360662356
> Compaction cost                      2271        2745
> 
> Here, this patch doesn't change the success rate as direct compaction already
> tries what it can. There's however significant reduction in direct compaction
> stalls, made entirely of the successful stalls. This means the offload to
> kcompactd is working as expected, and direct compaction is reduced either due
> to detecting contention, or compaction deferred by kcompactd. In the previous
> version of this patchset there was some apparent reduction of success rate,
> but the changes in this version (such as using sync compaction only), new
> baseline kernel, and/or averaging results from 5 executions (my bet), made this
> go away.
> 

A general thought: are we being as nice as possible to small systems in
this patchset?  Does a small single-node machine which doesn't even use
hugepages really need the additional overhead and bloat which we're
adding?  A system which either doesn't use networking at all or uses
NICs which never request more than an order-1 page?

Maybe the answer there is "turn off compaction".  If so, I wonder if
we've done all we can to tell the builders of such systems that this is
what we think they should do.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-02-08 13:38 ` [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd Vlastimil Babka
  2016-02-08 22:58   ` Andrew Morton
@ 2016-02-09 10:21   ` Vlastimil Babka
  2016-03-01 14:14   ` Vlastimil Babka
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-02-09 10:21 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner

On 02/08/2016 02:38 PM, Vlastimil Babka wrote:
> Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and
> compaction to attempt making memory allocation of given order available. The
> details differ from direct reclaim e.g. in having high watermark as a goal.
> The code involved in kswapd's reclaim/compaction decisions has evolved to be
> quite complex. Testing reveals that it doesn't actually work in at least one
> scenario, and closer inspection suggests that it could be greatly simplified
> without compromising on the goal (make high-order page available) or efficiency
> (don't reclaim too much). The simplification relieas of doing all compaction in
> kcompactd, which is simply woken up when high watermarks are reached by
> kswapd's reclaim.
> 
> The scenario where kswapd compaction doesn't work was found with mmtests test
> stress-highalloc configured to attempt order-9 allocations without direct
> reclaim, just waking up kswapd. There was no compaction attempt from kswapd
> during the whole test. Some added instrumentation shows what happens:
> 
> - balance_pgdat() sets end_zone to Normal, as it's not balanced
> - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
>   cannot reclaim anything, so sc.nr_reclaimed is 0
> - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
>   merely checks if high watermarks were reached for base pages. This is true,
>   so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
>   compaction_suitable() returned COMPACT_SKIPPED
> - even though the pgdat_needs_compaction flag wasn't set to false, no
>   compaction happens due to the condition sc.nr_reclaimed > nr_attempted
>   being false (as 0 < 99)
> - priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
>   pgdat_balanced() is false as only the small zone DMA appears balanced
>   (curiously in that check, watermark appears OK and compaction_suitable()
>   returns COMPACT_PARTIAL, because a lower classzone_idx is used there)
> 
> Now, even if it was decided that reclaim shouldn't be attempted on the DMA
> zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0)
> is also false. The condition really should use >= as the comment suggests.
> Then there is a mismatch in the check for setting pgdat_needs_compaction to
> false using low watermark, while the rest uses high watermark, and who knows
> what other subtlety. Hopefully this demonstrates that this is unsustainable.
> 
> Luckily we can simplify this a lot. The reclaim/compaction decisions make
> sense for direct reclaim scenario, but in kswapd, our primary goal is to reach
> high watermark in order-0 pages. Afterwards we can attempt compaction just
> once. Unlike direct reclaim, we don't reclaim extra pages (over the high
> watermark), the current code already disallows it for good reasons.
> 
> After this patch, we simply wake up kcompactd to process the pgdat, after we
> have either succeeded or failed to reach the high watermarks in kswapd, which
> goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply
> the same criteria to determine which zones are worth compacting. Note that we
> use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which
> can include higher zones that kswapd tried to balance too, but didn't consider
> them in pgdat_balanced().
> 
> Since kswapd now cannot create high-order pages itself, we need to adjust how
> it determines the zones to be balanced. The key element here is adding a
> "highorder" parameter to zone_balanced, which, when set to false, makes it
> consider only order-0 watermark instead of the desired higher order (this was
> done previously by kswapd_shrink_zone(), but not elsewhere).  This false is
> passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true
> to make sure kswapd and thus kcompactd are woken up for a high-order allocation
> failure.
> 
> For testing, I used stress-highalloc configured to do order-9 allocations with
> GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd
> reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as
> usual):
> 
> stress-highalloc
>                               4.5-rc1               4.5-rc1
>                                3-test                4-test
> Success 1 Min          1.00 (  0.00%)        3.00 (-200.00%)
> Success 1 Mean         1.40 (  0.00%)        4.00 (-185.71%)
> Success 1 Max          2.00 (  0.00%)        6.00 (-200.00%)
> Success 2 Min          1.00 (  0.00%)        3.00 (-200.00%)
> Success 2 Mean         1.80 (  0.00%)        4.20 (-133.33%)
> Success 2 Max          3.00 (  0.00%)        6.00 (-100.00%)
> Success 3 Min         34.00 (  0.00%)       63.00 (-85.29%)
> Success 3 Mean        41.80 (  0.00%)       64.60 (-54.55%)
> Success 3 Max         53.00 (  0.00%)       67.00 (-26.42%)
> 
>              4.5-rc1     4.5-rc1
>               3-test      4-test
> User         3166.67     3088.82
> System       1153.37     1142.01
> Elapsed      1768.53     1780.91
> 
>                                   4.5-rc1     4.5-rc1
>                                    3-test      4-test
> Minor Faults                    106940795   106582816
> Major Faults                          829         813
> Swap Ins                              482         311
> Swap Outs                            6278        5598
> Allocation stalls                     128         184
> DMA allocs                            145          32
> DMA32 allocs                     74646161    74843238
> Normal allocs                    26090955    25886668
> Movable allocs                          0           0
> Direct pages scanned                32938       31429
> Kswapd pages scanned              2183166     2185293
> Kswapd pages reclaimed            2152359     2134389
> Direct pages reclaimed              32735       31234
> Kswapd efficiency                     98%         97%
> Kswapd velocity                  1243.877    1228.666
> Direct efficiency                     99%         99%
> Direct velocity                    18.767      17.671
> Percentage direct scans                1%          1%
> Zone normal velocity              299.981     291.409
> Zone dma32 velocity               962.522     954.928
> Zone dma velocity                   0.142       0.000
> Page writes by reclaim           6278.800    5598.600
> Page writes file                        0           0
> Page writes anon                     6278        5598
> Page reclaim immediate                 93          96
> Sector Reads                      4357114     4307161
> Sector Writes                    11053628    11053091
> Page rescued immediate                  0           0
> Slabs scanned                     1592829     1555770
> Direct inode steals                  1557        2025
> Kswapd inode steals                 46056       45418
> Kswapd skipped wait                     0           0
> THP fault alloc                       579         614
> THP collapse alloc                    304         324
> THP splits                              0           0
> THP fault fallback                    793         730
> THP collapse fail                      11          14
> Compaction stalls                    1013         959
> Compaction success                     92          69
> Compaction failures                   920         890
> Page migrate success               238457      662054
> Page migrate failure                23021       32846
> Compaction pages isolated          504695     1370326
> Compaction migrate scanned         661390     7025772
> Compaction free scanned          13476658    73302642
> Compaction cost                       262         762

Also (after adjusting mmtests' ftrace monitor):

Time kswapd awake               2547781     2269241
Time kcompactd awake                  0      119253
Time direct compacting           939937      557649
Time kswapd compacting                0           0
Time kcompactd compacting             0      119099

The decrease of overal time spent compacting doesn't match the increased
compaction stats. I suspect the tasks get rescheduled and the ftrace
monitor doesn't see that, so it's wall time, not CPU time. But I guess
that direct compactors care about overall latency anyway, whether busy
compacting or waiting for CPU doesn't matter...
It's also interesting how much time kswapd spent just going through all
the priorities and failing to even try compacting, over and over...

> After this patch we see improvements in allocation success rate (especially for
> phase 3) along with increased compaction activity. The compaction stalls
> (direct compaction) in the interfering kernel builds (probably THP's) also
> decreased somewhat to kcompactd activity, yet THP alloc successes improved a

                     ^ thanks to

> bit.

> We can also configure stress-highalloc to perform both direct
> reclaim/compaction and wakeup kswapd/kcompactd, by using
> GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
> 
> stress-highalloc
>                               4.5-rc1               4.5-rc1
>                               3-test2               4-test2
> Success 1 Min          4.00 (  0.00%)        6.00 (-50.00%)
> Success 1 Mean         8.00 (  0.00%)        8.40 ( -5.00%)
> Success 1 Max         12.00 (  0.00%)       13.00 ( -8.33%)
> Success 2 Min          4.00 (  0.00%)        6.00 (-50.00%)
> Success 2 Mean         8.20 (  0.00%)        8.60 ( -4.88%)
> Success 2 Max         13.00 (  0.00%)       12.00 (  7.69%)
> Success 3 Min         75.00 (  0.00%)       75.00 (  0.00%)
> Success 3 Mean        75.60 (  0.00%)       75.60 (  0.00%)
> Success 3 Max         77.00 (  0.00%)       76.00 (  1.30%)
> 
>              4.5-rc1     4.5-rc1
>              3-test2     4-test2
> User         3344.73     3258.62
> System       1194.24     1177.92
> Elapsed      1838.04     1837.02
> 
>                                   4.5-rc1     4.5-rc1
>                                   3-test2     4-test2
> Minor Faults                    111269736   109392253
> Major Faults                          806         755
> Swap Ins                              671         155
> Swap Outs                            5390        5790
> Allocation stalls                    4610        4562
> DMA allocs                            250          34
> DMA32 allocs                     78091501    76901680
> Normal allocs                    27004414    26587089
> Movable allocs                          0           0
> Direct pages scanned               125146      108854
> Kswapd pages scanned              2119757     2131589
> Kswapd pages reclaimed            2073183     2090937
> Direct pages reclaimed             124909      108699
> Kswapd efficiency                     97%         98%
> Kswapd velocity                  1161.027    1160.870
> Direct efficiency                     99%         99%
> Direct velocity                    68.545      59.283
> Percentage direct scans                5%          4%
> Zone normal velocity              296.678     294.389
> Zone dma32 velocity               932.841     925.764
> Zone dma velocity                   0.053       0.000
> Page writes by reclaim           5392.000    5790.600
> Page writes file                        1           0
> Page writes anon                     5390        5790
> Page reclaim immediate                104         218
> Sector Reads                      4350232     4376989
> Sector Writes                    11126496    11102113
> Page rescued immediate                  0           0
> Slabs scanned                     1705294     1692486
> Direct inode steals                  8700       16266
> Kswapd inode steals                 36352       28364
> Kswapd skipped wait                     0           0
> THP fault alloc                       599         567
> THP collapse alloc                    323         326
> THP splits                              0           0
> THP fault fallback                    806         805
> THP collapse fail                      17          18
> Compaction stalls                    2457        2070
> Compaction success                    906         527
> Compaction failures                  1551        1543
> Page migrate success              2031423     2423657
> Page migrate failure                32845       28790
> Compaction pages isolated         4129761     4916017
> Compaction migrate scanned       11996712    19370264
> Compaction free scanned         214970969   360662356
> Compaction cost                      2271        2745

Time kswapd awake               2532984     2326824
Time kcompactd awake                  0      257916
Time direct compacting           864839      735130
Time kswapd compacting                0           0
Time kcompactd compacting             0      257585

> 
> Here, this patch doesn't change the success rate as direct compaction already
> tries what it can. There's however significant reduction in direct compaction
> stalls, made entirely of the successful stalls. This means the offload to
> kcompactd is working as expected, and direct compaction is reduced either due
> to detecting contention, or compaction deferred by kcompactd. In the previous
> version of this patchset there was some apparent reduction of success rate,
> but the changes in this version (such as using sync compaction only), new
> baseline kernel, and/or averaging results from 5 executions (my bet), made this
> go away.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-02-08 22:58   ` Andrew Morton
@ 2016-02-09 10:53     ` Vlastimil Babka
  0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-02-09 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A. Shutemov,
	Rik van Riel, Joonsoo Kim, Mel Gorman, David Rientjes,
	Michal Hocko, Johannes Weiner

On 02/08/2016 11:58 PM, Andrew Morton wrote:
>>
>> For testing, I used stress-highalloc configured to do order-9 allocations with
>> GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd
>> reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as
>> usual):
>>
>> stress-highalloc
>>                               4.5-rc1               4.5-rc1
>>                                3-test                4-test
> 
> What are "3-test" and "4-test"?  I'm assuming (hoping) they mean
> "before and after this patchset", but the nomenclature is odd.

3 and 4 is the number of patch in series. "test" is the config's name
which I should have rewritten to "nodirect" or someting.

>> Success 1 Min          1.00 (  0.00%)        3.00 (-200.00%)
>> Success 1 Mean         1.40 (  0.00%)        4.00 (-185.71%)
>> Success 1 Max          2.00 (  0.00%)        6.00 (-200.00%)
>> Success 2 Min          1.00 (  0.00%)        3.00 (-200.00%)
>> Success 2 Mean         1.80 (  0.00%)        4.20 (-133.33%)
>> Success 2 Max          3.00 (  0.00%)        6.00 (-100.00%)
>> Success 3 Min         34.00 (  0.00%)       63.00 (-85.29%)
>> Success 3 Mean        41.80 (  0.00%)       64.60 (-54.55%)
>> Success 3 Max         53.00 (  0.00%)       67.00 (-26.42%)
>>
>>              4.5-rc1     4.5-rc1
>>               3-test      4-test
>> User         3166.67     3088.82
>> System       1153.37     1142.01
>> Elapsed      1768.53     1780.91
>>
>>                                   4.5-rc1     4.5-rc1
>>                                    3-test      4-test
>> Minor Faults                    106940795   106582816
>> Major Faults                          829         813
>> Swap Ins                              482         311
>> Swap Outs                            6278        5598
>> Allocation stalls                     128         184
>> DMA allocs                            145          32
>> DMA32 allocs                     74646161    74843238
>> Normal allocs                    26090955    25886668
>> Movable allocs                          0           0
>> Direct pages scanned                32938       31429
>> Kswapd pages scanned              2183166     2185293
>> Kswapd pages reclaimed            2152359     2134389
>> Direct pages reclaimed              32735       31234
>> Kswapd efficiency                     98%         97%
>> Kswapd velocity                  1243.877    1228.666
>> Direct efficiency                     99%         99%
>> Direct velocity                    18.767      17.671
> 
> What do "efficiency" and "velocity" refer to here?

Velocity is scanned pages per second, efficiency is the ratio of
reclaimed pages to scanned pages.

> 
>> Percentage direct scans                1%          1%
>> Zone normal velocity              299.981     291.409
>> Zone dma32 velocity               962.522     954.928
>> Zone dma velocity                   0.142       0.000
>> Page writes by reclaim           6278.800    5598.600
>> Page writes file                        0           0
>> Page writes anon                     6278        5598
>> Page reclaim immediate                 93          96
>> Sector Reads                      4357114     4307161
>> Sector Writes                    11053628    11053091
>> Page rescued immediate                  0           0
>> Slabs scanned                     1592829     1555770
>> Direct inode steals                  1557        2025
>> Kswapd inode steals                 46056       45418
>> Kswapd skipped wait                     0           0
>> THP fault alloc                       579         614
>> THP collapse alloc                    304         324
>> THP splits                              0           0
>> THP fault fallback                    793         730
>> THP collapse fail                      11          14
>> Compaction stalls                    1013         959
>> Compaction success                     92          69
>> Compaction failures                   920         890
>> Page migrate success               238457      662054
>> Page migrate failure                23021       32846
>> Compaction pages isolated          504695     1370326
>> Compaction migrate scanned         661390     7025772
>> Compaction free scanned          13476658    73302642
>> Compaction cost                       262         762
>>
>> After this patch we see improvements in allocation success rate (especially for
>> phase 3) along with increased compaction activity. The compaction stalls
>> (direct compaction) in the interfering kernel builds (probably THP's) also
>> decreased somewhat to kcompactd activity, yet THP alloc successes improved a
>> bit.
>>
>> We can also configure stress-highalloc to perform both direct
>> reclaim/compaction and wakeup kswapd/kcompactd, by using
>> GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
>>
>> stress-highalloc
>>                               4.5-rc1               4.5-rc1
>>                               3-test2               4-test2
>> Success 1 Min          4.00 (  0.00%)        6.00 (-50.00%)
>> Success 1 Mean         8.00 (  0.00%)        8.40 ( -5.00%)
>> Success 1 Max         12.00 (  0.00%)       13.00 ( -8.33%)
>> Success 2 Min          4.00 (  0.00%)        6.00 (-50.00%)
>> Success 2 Mean         8.20 (  0.00%)        8.60 ( -4.88%)
>> Success 2 Max         13.00 (  0.00%)       12.00 (  7.69%)
>> Success 3 Min         75.00 (  0.00%)       75.00 (  0.00%)
>> Success 3 Mean        75.60 (  0.00%)       75.60 (  0.00%)
>> Success 3 Max         77.00 (  0.00%)       76.00 (  1.30%)
>>
>>              4.5-rc1     4.5-rc1
>>              3-test2     4-test2
>> User         3344.73     3258.62
>> System       1194.24     1177.92
>> Elapsed      1838.04     1837.02
> 
> Elapsed time increased in both test runs.

Yeah, elapsed and user isn't so useful for this benchmark, because of
the background interference being unpredictable. It's just to quickly
spot some major unexpected differences. System time is somewhat more
useful and that didn't increase.

> But you later say "There's
> however significant reduction in direct compaction stalls, made
> entirely of the successful stalls".  This seems inconsistent - less
> stalls should mean less time stuck in D state.

In /proc/vmstat terms, compact_stall is when the allocating process goes
to direct compaction, so it doesn't necessarily mean D states.

I've replied to the original patch with some more detailed time data
based on tracepoints, which shows that (wall) time spent in direct
compaction did indeed decrease.

[...]

>> Here, this patch doesn't change the success rate as direct compaction already
>> tries what it can. There's however significant reduction in direct compaction
>> stalls, made entirely of the successful stalls. This means the offload to
>> kcompactd is working as expected, and direct compaction is reduced either due
>> to detecting contention, or compaction deferred by kcompactd. In the previous
>> version of this patchset there was some apparent reduction of success rate,
>> but the changes in this version (such as using sync compaction only), new
>> baseline kernel, and/or averaging results from 5 executions (my bet), made this
>> go away.
>>
> 
> A general thought: are we being as nice as possible to small systems in
> this patchset?  Does a small single-node machine which doesn't even use
> hugepages really need the additional overhead and bloat which we're
> adding?  A system which either doesn't use networking at all or uses
> NICs which never request more than an order-1 page?

Hmm, aren't even kernel stacks larger than order-1 nowadays? Maybe not
on some 32bit arm...

> Maybe the answer there is "turn off compaction".  If so, I wonder if
> we've done all we can to tell the builders of such systems that this is
> what we think they should do.

Frankly, I wouldn't recommend that to anyone, since lumpy reclaim is
gone. But I admit I've never built such system. I hope that kcompactd
doesn't add that much bloat compared to the rest of compaction
infrastructure, it's one thread and some extra variables in struct zone,
which come in fixed low numbers.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-02-08 13:38 ` [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd Vlastimil Babka
  2016-02-08 22:58   ` Andrew Morton
  2016-02-09 10:21   ` Vlastimil Babka
@ 2016-03-01 14:14   ` Vlastimil Babka
  2016-03-02  6:33   ` Joonsoo Kim
  2016-03-02 12:27   ` Vlastimil Babka
  4 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-01 14:14 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner

Hi Andrew,

here's updated changelog for the patch in mmotm

http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd.patch

to reflect your earlier questions and my replies. I've named the result columns 
better, dropped stats that were not relevant, and included the ftrace-based times.

----8<----

Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and
compaction to attempt making memory allocation of given order available. The
details differ from direct reclaim e.g. in having high watermark as a goal.
The code involved in kswapd's reclaim/compaction decisions has evolved to be
quite complex. Testing reveals that it doesn't actually work in at least one
scenario, and closer inspection suggests that it could be greatly simplified
without compromising on the goal (make high-order page available) or efficiency
(don't reclaim too much). The simplification relieas of doing all compaction in
kcompactd, which is simply woken up when high watermarks are reached by
kswapd's reclaim.

The scenario where kswapd compaction doesn't work was found with mmtests test
stress-highalloc configured to attempt order-9 allocations without direct
reclaim, just waking up kswapd. There was no compaction attempt from kswapd
during the whole test. Some added instrumentation shows what happens:

- balance_pgdat() sets end_zone to Normal, as it's not balanced
- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
   cannot reclaim anything, so sc.nr_reclaimed is 0
- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
   merely checks if high watermarks were reached for base pages. This is true,
   so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
   compaction_suitable() returned COMPACT_SKIPPED
- even though the pgdat_needs_compaction flag wasn't set to false, no
   compaction happens due to the condition sc.nr_reclaimed > nr_attempted
   being false (as 0 < 99)
- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
   pgdat_balanced() is false as only the small zone DMA appears balanced
   (curiously in that check, watermark appears OK and compaction_suitable()
   returns COMPACT_PARTIAL, because a lower classzone_idx is used there)

Now, even if it was decided that reclaim shouldn't be attempted on the DMA
zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0)
is also false. The condition really should use >= as the comment suggests.
Then there is a mismatch in the check for setting pgdat_needs_compaction to
false using low watermark, while the rest uses high watermark, and who knows
what other subtlety. Hopefully this demonstrates that this is unsustainable.

Luckily we can simplify this a lot. The reclaim/compaction decisions make
sense for direct reclaim scenario, but in kswapd, our primary goal is to reach
high watermark in order-0 pages. Afterwards we can attempt compaction just
once. Unlike direct reclaim, we don't reclaim extra pages (over the high
watermark), the current code already disallows it for good reasons.

After this patch, we simply wake up kcompactd to process the pgdat, after we
have either succeeded or failed to reach the high watermarks in kswapd, which
goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply
the same criteria to determine which zones are worth compacting. Note that we
use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which
can include higher zones that kswapd tried to balance too, but didn't consider
them in pgdat_balanced().

Since kswapd now cannot create high-order pages itself, we need to adjust how
it determines the zones to be balanced. The key element here is adding a
"highorder" parameter to zone_balanced, which, when set to false, makes it
consider only order-0 watermark instead of the desired higher order (this was
done previously by kswapd_shrink_zone(), but not elsewhere).  This false is
passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true
to make sure kswapd and thus kcompactd are woken up for a high-order allocation
failure.

For testing, I used stress-highalloc configured to do order-9 allocations with
GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd
reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as
usual):

stress-highalloc
                        4.5-rc1+before          4.5-rc1+after
                             -nodirect              -nodirect
Success 1 Min          1.00 (  0.00%)        3.00 (-200.00%)
Success 1 Mean         1.40 (  0.00%)        4.00 (-185.71%)
Success 1 Max          2.00 (  0.00%)        6.00 (-200.00%)
Success 2 Min          1.00 (  0.00%)        3.00 (-200.00%)
Success 2 Mean         1.80 (  0.00%)        4.20 (-133.33%)
Success 2 Max          3.00 (  0.00%)        6.00 (-100.00%)
Success 3 Min         34.00 (  0.00%)       63.00 (-85.29%)
Success 3 Mean        41.80 (  0.00%)       64.60 (-54.55%)
Success 3 Max         53.00 (  0.00%)       67.00 (-26.42%)

User                          3166.67       3088.82
System                        1153.37       1142.01
Elapsed                       1768.53       1780.91

                            4.5-rc1+before  4.5-rc1+after
                                 -nodirect   -nodirect
Direct pages scanned                32938       31429
Kswapd pages scanned              2183166     2185293
Kswapd pages reclaimed            2152359     2134389
Direct pages reclaimed              32735       31234
Percentage direct scans                1%          1%
THP fault alloc                       579         614
THP collapse alloc                    304         324
THP splits                              0           0
THP fault fallback                    793         730
THP collapse fail                      11          14
Compaction stalls                    1013         959
Compaction success                     92          69
Compaction failures                   920         890
Page migrate success               238457      662054
Page migrate failure                23021       32846
Compaction pages isolated          504695     1370326
Compaction migrate scanned         661390     7025772
Compaction free scanned          13476658    73302642
Compaction cost                       262         762

After this patch we see improvements in allocation success rate (especially for
phase 3) along with increased compaction activity. The compaction stalls
(direct compaction) in the interfering kernel builds (probably THP's) also
decreased somewhat thanks to kcompactd activity, yet THP alloc successes
improved a bit.

Note that elapsed and user time isn't so useful for this benchmark, because of
the background interference being unpredictable. It's just to quickly spot some
major unexpected differences. System time is somewhat more useful and that
didn't increase.

Also (after adjusting mmtests' ftrace monitor):

Time kswapd awake               2547781     2269241
Time kcompactd awake                  0      119253
Time direct compacting           939937      557649
Time kswapd compacting                0           0
Time kcompactd compacting             0      119099

The decrease of overal time spent compacting appears to not match the increased
compaction stats. I suspect the tasks get rescheduled and since the ftrace
monitor doesn't see that, the reported time is wall time, not CPU time. But
arguably direct compactors care about overall latency anyway, whether busy
compacting or waiting for CPU doesn't matter. And that latency seems to almost
halved.

It's also interesting how much time kswapd spent awake just going through all
the priorities and failing to even try compacting, over and over.

We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

stress-highalloc
                        4.5-rc1+before         4.5-rc1+after
                               -direct               -direct
Success 1 Min          4.00 (  0.00%)        6.00 (-50.00%)
Success 1 Mean         8.00 (  0.00%)        8.40 ( -5.00%)
Success 1 Max         12.00 (  0.00%)       13.00 ( -8.33%)
Success 2 Min          4.00 (  0.00%)        6.00 (-50.00%)
Success 2 Mean         8.20 (  0.00%)        8.60 ( -4.88%)
Success 2 Max         13.00 (  0.00%)       12.00 (  7.69%)
Success 3 Min         75.00 (  0.00%)       75.00 (  0.00%)
Success 3 Mean        75.60 (  0.00%)       75.60 (  0.00%)
Success 3 Max         77.00 (  0.00%)       76.00 (  1.30%)

User                          3344.73       3258.62
System                        1194.24       1177.92
Elapsed                       1838.04       1837.02

                            4.5-rc1+before  4.5-rc1+after
                                   -direct     -direct
Direct pages scanned               125146      108854
Kswapd pages scanned              2119757     2131589
Kswapd pages reclaimed            2073183     2090937
Direct pages reclaimed             124909      108699
Percentage direct scans                5%          4%
THP fault alloc                       599         567
THP collapse alloc                    323         326
THP splits                              0           0
THP fault fallback                    806         805
THP collapse fail                      17          18
Compaction stalls                    2457        2070
Compaction success                    906         527
Compaction failures                  1551        1543
Page migrate success              2031423     2423657
Page migrate failure                32845       28790
Compaction pages isolated         4129761     4916017
Compaction migrate scanned       11996712    19370264
Compaction free scanned         214970969   360662356
Compaction cost                      2271        2745

In this scenario, this patch doesn't change the overal success rate as direct
compaction already tries all it can. There's however significant reduction in
direct compaction stalls (that is, the number of allocations that went into
direct compaction).  The number of successes (i.e. direct compaction stalls
that ended up with successful allocation) is reduced by the same number. This
means the offload to kcompactd is working as expected, and direct compaction is
reduced either due to detecting contention, or compaction deferred by
kcompactd. In the previous version of this patchset there was some apparent
reduction of success rate, but the changes in this version (such as using sync
compaction only), new baseline kernel, and/or averaging results from 5
executions (my bet), made this go away.

Ftrace-based stats seem to roughly agree:

Time kswapd awake               2532984     2326824
Time kcompactd awake                  0      257916
Time direct compacting           864839      735130
Time kswapd compacting                0           0
Time kcompactd compacting             0      257585

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/5] mm, compaction: adapt isolation_suitable flushing to kcompactd
  2016-02-08 13:38 ` [PATCH v2 5/5] mm, compaction: adapt isolation_suitable flushing to kcompactd Vlastimil Babka
@ 2016-03-01 14:44   ` Vlastimil Babka
  0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-01 14:44 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner

For consistency with previous patch's updated changelog, here's similar update 
for this one:
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-adapt-isolation_suitable-flushing-to-kcompactd.patch

----8<----

Compaction maintains a pageblock_skip bitmap to record pageblocks where
isolation recently failed. This bitmap can be reset by three ways:

1) direct compaction is restarting after going through the full deferred cycle

2) kswapd goes to sleep, and some other direct compaction has previously
    finished scanning the whole zone and set zone->compact_blockskip_flush.
    Note that a successful direct compaction clears this flag.

3) compaction was invoked manually via trigger in /proc

The case 2) is somewhat fuzzy to begin with, but after introducing kcompactd we
should update it. The check for direct compaction in 1), and to set the flush
flag in 2) use current_is_kswapd(), which doesn't work for kcompactd. Thus,
this patch adds bool direct_compaction to compact_control to use in 2). For
the case 1) we remove the check completely - unlike the former kswapd
compaction, kcompactd does use the deferred compaction functionality, so
flushing tied to restarting from deferred compaction makes sense here.

Note that when kswapd goes to sleep, kcompactd is woken up, so it will see the
flushed pageblock_skip bits. This is different from when the former kswapd
compaction observed the bits and I believe it makes more sense. Kcompactd can
afford to be more thorough than a direct compaction trying to limit allocation
latency, or kswapd whose primary goal is to reclaim.

To sum up, after this patch, the pageblock_skip flushing makes intuitively
more sense for kcompactd. Practially, the differences are minimal.
Stress-highalloc With order-9 allocations without direct reclaim/compaction:

stress-highalloc
                        4.5-rc1+before         4.5-rc1+after
                             -nodirect             -nodirect
Success 1 Min          3.00 (  0.00%)        5.00 (-66.67%)
Success 1 Mean         4.00 (  0.00%)        6.20 (-55.00%)
Success 1 Max          6.00 (  0.00%)        7.00 (-16.67%)
Success 2 Min          3.00 (  0.00%)        5.00 (-66.67%)
Success 2 Mean         4.20 (  0.00%)        6.40 (-52.38%)
Success 2 Max          6.00 (  0.00%)        7.00 (-16.67%)
Success 3 Min         63.00 (  0.00%)       62.00 (  1.59%)
Success 3 Mean        64.60 (  0.00%)       63.80 (  1.24%)
Success 3 Max         67.00 (  0.00%)       65.00 (  2.99%)

User                          3088.82       3181.09
System                        1142.01       1158.25
Elapsed                       1780.91       1799.37

                            4.5-rc1+before  4.5-rc1+after
                                 -nodirect   -nodirect
Direct pages scanned                31429       32797
Kswapd pages scanned              2185293     2202613
Kswapd pages reclaimed            2134389     2143524
Direct pages reclaimed              31234       32545
Percentage direct scans                1%          1%
THP fault alloc                       614         612
THP collapse alloc                    324         316
THP splits                              0           0
THP fault fallback                    730         778
THP collapse fail                      14          16
Compaction stalls                     959        1007
Compaction success                     69          67
Compaction failures                   890         939
Page migrate success               662054      721374
Page migrate failure                32846       23469
Compaction pages isolated         1370326     1479924
Compaction migrate scanned        7025772     8812554
Compaction free scanned          73302642    84327916
Compaction cost                       762         838

With direct reclaim/compaction:

stress-highalloc
                        4.5-rc1+before         4.5-rc1+after
                               -direct               -direct
Success 1 Min          6.00 (  0.00%)        9.00 (-50.00%)
Success 1 Mean         8.40 (  0.00%)       10.00 (-19.05%)
Success 1 Max         13.00 (  0.00%)       11.00 ( 15.38%)
Success 2 Min          6.00 (  0.00%)        9.00 (-50.00%)
Success 2 Mean         8.60 (  0.00%)       10.00 (-16.28%)
Success 2 Max         12.00 (  0.00%)       11.00 (  8.33%)
Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
Success 3 Max         76.00 (  0.00%)       76.00 (  0.00%)

User                          3258.62       3246.04
System                        1177.92       1172.29
Elapsed                       1837.02       1836.76

                            4.5-rc1+before  4.5-rc1+after
                                   -direct     -direct
Direct pages scanned               108854      120966
Kswapd pages scanned              2131589     2135012
Kswapd pages reclaimed            2090937     2108388
Direct pages reclaimed             108699      120577
Percentage direct scans                4%          5%
THP fault alloc                       567         652
THP collapse alloc                    326         354
THP splits                              0           0
THP fault fallback                    805         793
THP collapse fail                      18          16
Compaction stalls                    2070        2025
Compaction success                    527         518
Compaction failures                  1543        1507
Page migrate success              2423657     2360608
Page migrate failure                28790       40852
Compaction pages isolated         4916017     4802025
Compaction migrate scanned       19370264    21750613
Compaction free scanned         360662356   344372001
Compaction cost                      2745        2694

Singed-off-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/5] mm, compaction: introduce kcompactd
  2016-02-08 13:38 ` [PATCH v2 2/5] mm, compaction: introduce kcompactd Vlastimil Babka
@ 2016-03-02  6:09   ` Joonsoo Kim
  2016-03-02 12:25     ` Vlastimil Babka
  0 siblings, 1 reply; 28+ messages in thread
From: Joonsoo Kim @ 2016-03-02  6:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, David Rientjes,
	Michal Hocko, Johannes Weiner

On Mon, Feb 08, 2016 at 02:38:08PM +0100, Vlastimil Babka wrote:
> Memory compaction can be currently performed in several contexts:
> 
> - kswapd balancing a zone after a high-order allocation failure
> - direct compaction to satisfy a high-order allocation, including THP page
>   fault attemps
> - khugepaged trying to collapse a hugepage
> - manually from /proc
> 
> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
> (pending or future) high-order allocation, and is easy to evaluate. The other
> purpose is to keep overal memory fragmentation low and help the
> anti-fragmentation mechanism. The success wrt the latter purpose is more
> difficult to evaluate though.
> 
> The current situation wrt the purposes has a few drawbacks:
> 
> - compaction is invoked only when a high-order page or hugepage is not
>   available (or manually). This might be too late for the purposes of keeping
>   memory fragmentation low.
> - direct compaction increases latency of allocations. Again, it would be
>   better if compaction was performed asynchronously to keep fragmentation low,
>   before the allocation itself comes.
> - (a special case of the previous) the cost of compaction during THP page
>   faults can easily offset the benefits of THP.
> - kswapd compaction appears to be complex, fragile and not working in some
>   scenarios. It could also end up compacting for a high-order allocation
>   request when it should be reclaiming memory for a later order-0 request.
> 
> To improve the situation, we should be able to benefit from an equivalent of
> kswapd, but for compaction - i.e. a background thread which responds to
> fragmentation and the need for high-order allocations (including hugepages)
> somewhat proactively.
> 
> One possibility is to extend the responsibilities of kswapd, which could
> however complicate its design too much. It should be better to let kswapd
> handle reclaim, as order-0 allocations are often more critical than high-order
> ones.
> 
> Another possibility is to extend khugepaged, but this kthread is a single
> instance and tied to THP configs.
> 
> This patch goes with the option of a new set of per-node kthreads called
> kcompactd, and lays the foundations, without introducing any new tunables.
> The lifecycle mimics kswapd kthreads, including the memory hotplug hooks.
> 
> For compaction, kcompactd uses the standard compaction_suitable() and
> ompact_finished() criteria and the deferred compaction functionality. Unlike
> direct compaction, it uses only sync compaction, as there's no allocation
> latency to minimize.
> 
> This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
> compact/reclaim loop for high-order pages will be replaced by waking up
> kcompactd in the next patch with the description of what's wrong with the old
> approach.
> 
> Waking up of the kcompactd threads is also tied to kswapd activity and follows
> these rules:
> - we don't want to affect any fastpaths, so wake up kcompactd only from the
>   slowpath, as it's done for kswapd
> - if kswapd is doing reclaim, it's more important than compaction, so don't
>   invoke kcompactd until kswapd goes to sleep
> - the target order used for kswapd is passed to kcompactd
> 
> Future possible future uses for kcompactd include the ability to wake up
> kcompactd on demand in special situations, such as when hugepages are not
> available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation
> event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform
> periodic compaction with kcompactd.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/compaction.h        |  16 +++
>  include/linux/mmzone.h            |   6 ++
>  include/linux/vm_event_item.h     |   1 +
>  include/trace/events/compaction.h |  55 ++++++++++
>  mm/compaction.c                   | 220 ++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c               |   9 +-
>  mm/page_alloc.c                   |   3 +
>  mm/vmstat.c                       |   1 +
>  8 files changed, 309 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4cd4ddf64cc7..1367c0564d42 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -52,6 +52,10 @@ extern void compaction_defer_reset(struct zone *zone, int order,
>  				bool alloc_success);
>  extern bool compaction_restarting(struct zone *zone, int order);
>  
> +extern int kcompactd_run(int nid);
> +extern void kcompactd_stop(int nid);
> +extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
> +
>  #else
>  static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
>  			unsigned int order, int alloc_flags,
> @@ -84,6 +88,18 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return true;
>  }
>  
> +static int kcompactd_run(int nid)
> +{
> +	return 0;
> +}
> +static void kcompactd_stop(int nid)
> +{
> +}
> +
> +static void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
> +{
> +}
> +
>  #endif /* CONFIG_COMPACTION */
>  
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 48cb6f0c6083..c8dfb14105c7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -670,6 +670,12 @@ typedef struct pglist_data {
>  					   mem_hotplug_begin/end() */
>  	int kswapd_max_order;
>  	enum zone_type classzone_idx;
> +#ifdef CONFIG_COMPACTION
> +	int kcompactd_max_order;
> +	enum zone_type kcompactd_classzone_idx;
> +	wait_queue_head_t kcompactd_wait;
> +	struct task_struct *kcompactd;
> +#endif
>  #ifdef CONFIG_NUMA_BALANCING
>  	/* Lock serializing the migrate rate limiting window */
>  	spinlock_t numabalancing_migrate_lock;
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 67c1dbd19c6d..58ecc056ee45 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -53,6 +53,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
>  		COMPACTISOLATED,
>  		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> +		KCOMPACTD_WAKE,
>  #endif
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index 111e5666e5eb..e215bf68f521 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -350,6 +350,61 @@ DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_reset,
>  );
>  #endif
>  
> +TRACE_EVENT(mm_compaction_kcompactd_sleep,
> +
> +	TP_PROTO(int nid),
> +
> +	TP_ARGS(nid),
> +
> +	TP_STRUCT__entry(
> +		__field(int, nid)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nid = nid;
> +	),
> +
> +	TP_printk("nid=%d", __entry->nid)
> +);
> +
> +DECLARE_EVENT_CLASS(kcompactd_wake_template,
> +
> +	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
> +
> +	TP_ARGS(nid, order, classzone_idx),
> +
> +	TP_STRUCT__entry(
> +		__field(int, nid)
> +		__field(int, order)
> +		__field(enum zone_type, classzone_idx)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->nid = nid;
> +		__entry->order = order;
> +		__entry->classzone_idx = classzone_idx;
> +	),
> +
> +	TP_printk("nid=%d order=%d classzone_idx=%-8s",
> +		__entry->nid,
> +		__entry->order,
> +		__print_symbolic(__entry->classzone_idx, ZONE_TYPE))
> +);
> +
> +DEFINE_EVENT(kcompactd_wake_template, mm_compaction_wakeup_kcompactd,
> +
> +	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
> +
> +	TP_ARGS(nid, order, classzone_idx)
> +);
> +
> +DEFINE_EVENT(kcompactd_wake_template, mm_compaction_kcompactd_wake,
> +
> +	TP_PROTO(int nid, int order, enum zone_type classzone_idx),
> +
> +	TP_ARGS(nid, order, classzone_idx)
> +);
> +
>  #endif /* _TRACE_COMPACTION_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 93f71d968098..c03715ba65c7 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -17,6 +17,9 @@
>  #include <linux/balloon_compaction.h>
>  #include <linux/page-isolation.h>
>  #include <linux/kasan.h>
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
> +#include <linux/module.h>
>  #include "internal.h"
>  
>  #ifdef CONFIG_COMPACTION
> @@ -1736,4 +1739,221 @@ void compaction_unregister_node(struct node *node)
>  }
>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>  
> +static inline bool kcompactd_work_requested(pg_data_t *pgdat)
> +{
> +	return pgdat->kcompactd_max_order > 0;
> +}
> +
> +static bool kcompactd_node_suitable(pg_data_t *pgdat)
> +{
> +	int zoneid;
> +	struct zone *zone;
> +	enum zone_type classzone_idx = pgdat->kcompactd_classzone_idx;
> +
> +	for (zoneid = 0; zoneid < classzone_idx; zoneid++) {
> +		zone = &pgdat->node_zones[zoneid];
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
> +					classzone_idx) == COMPACT_CONTINUE)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void kcompactd_do_work(pg_data_t *pgdat)
> +{
> +	/*
> +	 * With no special task, compact all zones so that a page of requested
> +	 * order is allocatable.
> +	 */
> +	int zoneid;
> +	struct zone *zone;
> +	struct compact_control cc = {
> +		.order = pgdat->kcompactd_max_order,
> +		.classzone_idx = pgdat->kcompactd_classzone_idx,
> +		.mode = MIGRATE_SYNC_LIGHT,
> +		.ignore_skip_hint = true,
> +
> +	};
> +	bool success = false;
> +
> +	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
> +							cc.classzone_idx);
> +	count_vm_event(KCOMPACTD_WAKE);
> +
> +	for (zoneid = 0; zoneid < cc.classzone_idx; zoneid++) {
> +		int status;
> +
> +		zone = &pgdat->node_zones[zoneid];
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (compaction_deferred(zone, cc.order))
> +			continue;
> +
> +		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
> +							COMPACT_CONTINUE)
> +			continue;
> +
> +		cc.nr_freepages = 0;
> +		cc.nr_migratepages = 0;
> +		cc.zone = zone;
> +		INIT_LIST_HEAD(&cc.freepages);
> +		INIT_LIST_HEAD(&cc.migratepages);
> +
> +		status = compact_zone(zone, &cc);
> +
> +		if (zone_watermark_ok(zone, cc.order, low_wmark_pages(zone),
> +						cc.classzone_idx, 0)) {
> +			success = true;
> +			compaction_defer_reset(zone, cc.order, false);
> +		} else if (cc.mode != MIGRATE_ASYNC &&
> +						status == COMPACT_COMPLETE) {
> +			defer_compaction(zone, cc.order);
> +		}

We alerady set mode to MIGRATE_SYNC_LIGHT so this cc.mode check looks weird.
It would be better to change it and add some comment that we can
safely call defer_compaction() here.

> +
> +		VM_BUG_ON(!list_empty(&cc.freepages));
> +		VM_BUG_ON(!list_empty(&cc.migratepages));
> +	}
> +
> +	/*
> +	 * Regardless of success, we are done until woken up next. But remember
> +	 * the requested order/classzone_idx in case it was higher/tighter than
> +	 * our current ones
> +	 */
> +	if (pgdat->kcompactd_max_order <= cc.order)
> +		pgdat->kcompactd_max_order = 0;
> +	if (pgdat->classzone_idx >= cc.classzone_idx)
> +		pgdat->classzone_idx = pgdat->nr_zones - 1;
> +}

Maybe, you intend to update kcompactd_classzone_idx.

> +
> +void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
> +{
> +	if (!order)
> +		return;
> +
> +	if (pgdat->kcompactd_max_order < order)
> +		pgdat->kcompactd_max_order = order;
> +
> +	if (pgdat->kcompactd_classzone_idx > classzone_idx)
> +		pgdat->kcompactd_classzone_idx = classzone_idx;
> +
> +	if (!waitqueue_active(&pgdat->kcompactd_wait))
> +		return;
> +
> +	if (!kcompactd_node_suitable(pgdat))
> +		return;
> +
> +	trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, order,
> +							classzone_idx);
> +	wake_up_interruptible(&pgdat->kcompactd_wait);
> +}
> +
> +/*
> + * The background compaction daemon, started as a kernel thread
> + * from the init process.
> + */
> +static int kcompactd(void *p)
> +{
> +	pg_data_t *pgdat = (pg_data_t*)p;
> +	struct task_struct *tsk = current;
> +
> +	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +
> +	if (!cpumask_empty(cpumask))
> +		set_cpus_allowed_ptr(tsk, cpumask);
> +
> +	set_freezable();
> +
> +	pgdat->kcompactd_max_order = 0;
> +	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
> +
> +	while (!kthread_should_stop()) {
> +		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
> +		wait_event_freezable(pgdat->kcompactd_wait,
> +				kcompactd_work_requested(pgdat));
> +
> +		kcompactd_do_work(pgdat);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * This kcompactd start function will be called by init and node-hot-add.
> + * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
> + */
> +int kcompactd_run(int nid)
> +{
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +	int ret = 0;
> +
> +	if (pgdat->kcompactd)
> +		return 0;
> +
> +	pgdat->kcompactd = kthread_run(kcompactd, pgdat, "kcompactd%d", nid);
> +	if (IS_ERR(pgdat->kcompactd)) {
> +		pr_err("Failed to start kcompactd on node %d\n", nid);
> +		ret = PTR_ERR(pgdat->kcompactd);
> +		pgdat->kcompactd = NULL;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Called by memory hotplug when all memory in a node is offlined. Caller must
> + * hold mem_hotplug_begin/end().
> + */
> +void kcompactd_stop(int nid)
> +{
> +	struct task_struct *kcompactd = NODE_DATA(nid)->kcompactd;
> +
> +	if (kcompactd) {
> +		kthread_stop(kcompactd);
> +		NODE_DATA(nid)->kcompactd = NULL;
> +	}
> +}
> +
> +/*
> + * It's optimal to keep kcompactd on the same CPUs as their memory, but
> + * not required for correctness. So if the last cpu in a node goes
> + * away, we get changed to run anywhere: as the first one comes back,
> + * restore their cpu bindings.
> + */
> +static int cpu_callback(struct notifier_block *nfb, unsigned long action,
> +			void *hcpu)
> +{
> +	int nid;
> +
> +	if (action == CPU_ONLINE || action == CPU_ONLINE_FROZEN) {
> +		for_each_node_state(nid, N_MEMORY) {
> +			pg_data_t *pgdat = NODE_DATA(nid);
> +			const struct cpumask *mask;
> +
> +			mask = cpumask_of_node(pgdat->node_id);
> +
> +			if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
> +				/* One of our CPUs online: restore mask */
> +				set_cpus_allowed_ptr(pgdat->kcompactd, mask);
> +		}
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static int __init kcompactd_init(void)
> +{
> +	int nid;
> +
> +	for_each_node_state(nid, N_MEMORY)
> +		kcompactd_run(nid);
> +	hotcpu_notifier(cpu_callback, 0);
> +	return 0;
> +}
> +
> +module_init(kcompactd_init)
> +
>  #endif /* CONFIG_COMPACTION */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 46b46a9dcf81..7aa7697fedd1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -33,6 +33,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/memblock.h>
>  #include <linux/bootmem.h>
> +#include <linux/compaction.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -1132,8 +1133,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
>  
>  	init_per_zone_wmark_min();
>  
> -	if (onlined_pages)
> +	if (onlined_pages) {
>  		kswapd_run(zone_to_nid(zone));
> +		kcompactd_run(nid);
> +	}
>  
>  	vm_total_pages = nr_free_pagecache_pages();
>  
> @@ -1907,8 +1910,10 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		zone_pcp_update(zone);
>  
>  	node_states_clear_node(node, &arg);
> -	if (arg.status_change_nid >= 0)
> +	if (arg.status_change_nid >= 0) {
>  		kswapd_stop(node);
> +		kcompactd_stop(node);
> +	}
>  
>  	vm_total_pages = nr_free_pagecache_pages();
>  	writeback_set_ratelimit();
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4ca4ead6ab05..d8ada4ab70c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5484,6 +5484,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>  #endif
>  	init_waitqueue_head(&pgdat->kswapd_wait);
>  	init_waitqueue_head(&pgdat->pfmemalloc_wait);
> +#ifdef CONFIG_COMPACTION
> +	init_waitqueue_head(&pgdat->kcompactd_wait);
> +#endif
>  	pgdat_page_ext_init(pgdat);
>  
>  	for (j = 0; j < MAX_NR_ZONES; j++) {
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69ce64f7b8d7..c9571294f61c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -826,6 +826,7 @@ const char * const vmstat_text[] = {
>  	"compact_stall",
>  	"compact_fail",
>  	"compact_success",
> +	"compact_kcompatd_wake",
>  #endif
>  
>  #ifdef CONFIG_HUGETLB_PAGE
> -- 
> 2.7.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-02-08 13:38 ` [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd Vlastimil Babka
                     ` (2 preceding siblings ...)
  2016-03-01 14:14   ` Vlastimil Babka
@ 2016-03-02  6:33   ` Joonsoo Kim
  2016-03-02 10:04     ` Vlastimil Babka
  2016-03-02 12:27   ` Vlastimil Babka
  4 siblings, 1 reply; 28+ messages in thread
From: Joonsoo Kim @ 2016-03-02  6:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, David Rientjes,
	Michal Hocko, Johannes Weiner

On Mon, Feb 08, 2016 at 02:38:10PM +0100, Vlastimil Babka wrote:
> Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and
> compaction to attempt making memory allocation of given order available. The
> details differ from direct reclaim e.g. in having high watermark as a goal.
> The code involved in kswapd's reclaim/compaction decisions has evolved to be
> quite complex. Testing reveals that it doesn't actually work in at least one
> scenario, and closer inspection suggests that it could be greatly simplified
> without compromising on the goal (make high-order page available) or efficiency
> (don't reclaim too much). The simplification relieas of doing all compaction in
> kcompactd, which is simply woken up when high watermarks are reached by
> kswapd's reclaim.
> 
> The scenario where kswapd compaction doesn't work was found with mmtests test
> stress-highalloc configured to attempt order-9 allocations without direct
> reclaim, just waking up kswapd. There was no compaction attempt from kswapd
> during the whole test. Some added instrumentation shows what happens:
> 
> - balance_pgdat() sets end_zone to Normal, as it's not balanced
> - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
>   cannot reclaim anything, so sc.nr_reclaimed is 0
> - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
>   merely checks if high watermarks were reached for base pages. This is true,
>   so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
>   compaction_suitable() returned COMPACT_SKIPPED
> - even though the pgdat_needs_compaction flag wasn't set to false, no
>   compaction happens due to the condition sc.nr_reclaimed > nr_attempted
>   being false (as 0 < 99)
> - priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
>   pgdat_balanced() is false as only the small zone DMA appears balanced
>   (curiously in that check, watermark appears OK and compaction_suitable()
>   returns COMPACT_PARTIAL, because a lower classzone_idx is used there)
> 
> Now, even if it was decided that reclaim shouldn't be attempted on the DMA
> zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0)
> is also false. The condition really should use >= as the comment suggests.
> Then there is a mismatch in the check for setting pgdat_needs_compaction to
> false using low watermark, while the rest uses high watermark, and who knows
> what other subtlety. Hopefully this demonstrates that this is unsustainable.
> 
> Luckily we can simplify this a lot. The reclaim/compaction decisions make
> sense for direct reclaim scenario, but in kswapd, our primary goal is to reach
> high watermark in order-0 pages. Afterwards we can attempt compaction just
> once. Unlike direct reclaim, we don't reclaim extra pages (over the high
> watermark), the current code already disallows it for good reasons.
> 
> After this patch, we simply wake up kcompactd to process the pgdat, after we
> have either succeeded or failed to reach the high watermarks in kswapd, which
> goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply
> the same criteria to determine which zones are worth compacting. Note that we
> use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which
> can include higher zones that kswapd tried to balance too, but didn't consider
> them in pgdat_balanced().
> 
> Since kswapd now cannot create high-order pages itself, we need to adjust how
> it determines the zones to be balanced. The key element here is adding a
> "highorder" parameter to zone_balanced, which, when set to false, makes it
> consider only order-0 watermark instead of the desired higher order (this was
> done previously by kswapd_shrink_zone(), but not elsewhere).  This false is
> passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true
> to make sure kswapd and thus kcompactd are woken up for a high-order allocation
> failure.
> 
> For testing, I used stress-highalloc configured to do order-9 allocations with
> GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd
> reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as
> usual):
> 
> stress-highalloc
>                               4.5-rc1               4.5-rc1
>                                3-test                4-test
> Success 1 Min          1.00 (  0.00%)        3.00 (-200.00%)
> Success 1 Mean         1.40 (  0.00%)        4.00 (-185.71%)
> Success 1 Max          2.00 (  0.00%)        6.00 (-200.00%)
> Success 2 Min          1.00 (  0.00%)        3.00 (-200.00%)
> Success 2 Mean         1.80 (  0.00%)        4.20 (-133.33%)
> Success 2 Max          3.00 (  0.00%)        6.00 (-100.00%)
> Success 3 Min         34.00 (  0.00%)       63.00 (-85.29%)
> Success 3 Mean        41.80 (  0.00%)       64.60 (-54.55%)
> Success 3 Max         53.00 (  0.00%)       67.00 (-26.42%)
> 
>              4.5-rc1     4.5-rc1
>               3-test      4-test
> User         3166.67     3088.82
> System       1153.37     1142.01
> Elapsed      1768.53     1780.91
> 
>                                   4.5-rc1     4.5-rc1
>                                    3-test      4-test
> Minor Faults                    106940795   106582816
> Major Faults                          829         813
> Swap Ins                              482         311
> Swap Outs                            6278        5598
> Allocation stalls                     128         184
> DMA allocs                            145          32
> DMA32 allocs                     74646161    74843238
> Normal allocs                    26090955    25886668
> Movable allocs                          0           0
> Direct pages scanned                32938       31429
> Kswapd pages scanned              2183166     2185293
> Kswapd pages reclaimed            2152359     2134389
> Direct pages reclaimed              32735       31234
> Kswapd efficiency                     98%         97%
> Kswapd velocity                  1243.877    1228.666
> Direct efficiency                     99%         99%
> Direct velocity                    18.767      17.671
> Percentage direct scans                1%          1%
> Zone normal velocity              299.981     291.409
> Zone dma32 velocity               962.522     954.928
> Zone dma velocity                   0.142       0.000
> Page writes by reclaim           6278.800    5598.600
> Page writes file                        0           0
> Page writes anon                     6278        5598
> Page reclaim immediate                 93          96
> Sector Reads                      4357114     4307161
> Sector Writes                    11053628    11053091
> Page rescued immediate                  0           0
> Slabs scanned                     1592829     1555770
> Direct inode steals                  1557        2025
> Kswapd inode steals                 46056       45418
> Kswapd skipped wait                     0           0
> THP fault alloc                       579         614
> THP collapse alloc                    304         324
> THP splits                              0           0
> THP fault fallback                    793         730
> THP collapse fail                      11          14
> Compaction stalls                    1013         959
> Compaction success                     92          69
> Compaction failures                   920         890
> Page migrate success               238457      662054
> Page migrate failure                23021       32846
> Compaction pages isolated          504695     1370326
> Compaction migrate scanned         661390     7025772
> Compaction free scanned          13476658    73302642
> Compaction cost                       262         762
> 
> After this patch we see improvements in allocation success rate (especially for
> phase 3) along with increased compaction activity. The compaction stalls
> (direct compaction) in the interfering kernel builds (probably THP's) also
> decreased somewhat to kcompactd activity, yet THP alloc successes improved a
> bit.

Why you did the test with THP? THP interferes result of main test so
it would be better not to enable it.

And, this patch increased compaction activity (10 times for migrate scanned)
may be due to resetting skip block information. Isn't is better to disable it
for this patch to work as similar as possible that kswapd does and re-enable it
on next patch? If something goes bad, it can simply be reverted.

Look like it is even not mentioned in the description.

> 
> We can also configure stress-highalloc to perform both direct
> reclaim/compaction and wakeup kswapd/kcompactd, by using
> GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
> 
> stress-highalloc
>                               4.5-rc1               4.5-rc1
>                               3-test2               4-test2
> Success 1 Min          4.00 (  0.00%)        6.00 (-50.00%)
> Success 1 Mean         8.00 (  0.00%)        8.40 ( -5.00%)
> Success 1 Max         12.00 (  0.00%)       13.00 ( -8.33%)
> Success 2 Min          4.00 (  0.00%)        6.00 (-50.00%)
> Success 2 Mean         8.20 (  0.00%)        8.60 ( -4.88%)
> Success 2 Max         13.00 (  0.00%)       12.00 (  7.69%)
> Success 3 Min         75.00 (  0.00%)       75.00 (  0.00%)
> Success 3 Mean        75.60 (  0.00%)       75.60 (  0.00%)
> Success 3 Max         77.00 (  0.00%)       76.00 (  1.30%)
> 
>              4.5-rc1     4.5-rc1
>              3-test2     4-test2
> User         3344.73     3258.62
> System       1194.24     1177.92
> Elapsed      1838.04     1837.02
> 
>                                   4.5-rc1     4.5-rc1
>                                   3-test2     4-test2
> Minor Faults                    111269736   109392253
> Major Faults                          806         755
> Swap Ins                              671         155
> Swap Outs                            5390        5790
> Allocation stalls                    4610        4562
> DMA allocs                            250          34
> DMA32 allocs                     78091501    76901680
> Normal allocs                    27004414    26587089
> Movable allocs                          0           0
> Direct pages scanned               125146      108854
> Kswapd pages scanned              2119757     2131589
> Kswapd pages reclaimed            2073183     2090937
> Direct pages reclaimed             124909      108699
> Kswapd efficiency                     97%         98%
> Kswapd velocity                  1161.027    1160.870
> Direct efficiency                     99%         99%
> Direct velocity                    68.545      59.283
> Percentage direct scans                5%          4%
> Zone normal velocity              296.678     294.389
> Zone dma32 velocity               932.841     925.764
> Zone dma velocity                   0.053       0.000
> Page writes by reclaim           5392.000    5790.600
> Page writes file                        1           0
> Page writes anon                     5390        5790
> Page reclaim immediate                104         218
> Sector Reads                      4350232     4376989
> Sector Writes                    11126496    11102113
> Page rescued immediate                  0           0
> Slabs scanned                     1705294     1692486
> Direct inode steals                  8700       16266
> Kswapd inode steals                 36352       28364
> Kswapd skipped wait                     0           0
> THP fault alloc                       599         567
> THP collapse alloc                    323         326
> THP splits                              0           0
> THP fault fallback                    806         805
> THP collapse fail                      17          18
> Compaction stalls                    2457        2070
> Compaction success                    906         527
> Compaction failures                  1551        1543
> Page migrate success              2031423     2423657
> Page migrate failure                32845       28790
> Compaction pages isolated         4129761     4916017
> Compaction migrate scanned       11996712    19370264
> Compaction free scanned         214970969   360662356
> Compaction cost                      2271        2745
> 
> Here, this patch doesn't change the success rate as direct compaction already
> tries what it can. There's however significant reduction in direct compaction
> stalls, made entirely of the successful stalls. This means the offload to
> kcompactd is working as expected, and direct compaction is reduced either due
> to detecting contention, or compaction deferred by kcompactd. In the previous
> version of this patchset there was some apparent reduction of success rate,
> but the changes in this version (such as using sync compaction only), new
> baseline kernel, and/or averaging results from 5 executions (my bet), made this
> go away.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/vmscan.c | 146 ++++++++++++++++++++----------------------------------------
>  1 file changed, 48 insertions(+), 98 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c67df4831565..b8478a737ef5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
>  	} while (memcg);
>  }
>  
> -static bool zone_balanced(struct zone *zone, int order,
> -			  unsigned long balance_gap, int classzone_idx)
> +static bool zone_balanced(struct zone *zone, int order, bool highorder,
> +			unsigned long balance_gap, int classzone_idx)
>  {
> -	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
> -				    balance_gap, classzone_idx))
> -		return false;
> +	unsigned long mark = high_wmark_pages(zone) + balance_gap;
>  
> -	if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
> -				order, 0, classzone_idx) == COMPACT_SKIPPED)
> -		return false;
> +	/*
> +	 * When checking from pgdat_balanced(), kswapd should stop and sleep
> +	 * when it reaches the high order-0 watermark and let kcompactd take
> +	 * over. Other callers such as wakeup_kswapd() want to determine the
> +	 * true high-order watermark.
> +	 */
> +	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
> +		mark += (1UL << order);
> +		order = 0;
> +	}
>  
> -	return true;
> +	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
>  }
>  
>  /*
> @@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>  			continue;
>  		}
>  
> -		if (zone_balanced(zone, order, 0, i))
> +		if (zone_balanced(zone, order, false, 0, i))
>  			balanced_pages += zone->managed_pages;
>  		else if (!order)
>  			return false;
> @@ -3066,8 +3071,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>   */
>  static bool kswapd_shrink_zone(struct zone *zone,
>  			       int classzone_idx,
> -			       struct scan_control *sc,
> -			       unsigned long *nr_attempted)
> +			       struct scan_control *sc)
>  {
>  	int testorder = sc->order;

You can remove testorder completely.

>  	unsigned long balance_gap;
> @@ -3077,17 +3081,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
>  
>  	/*
> -	 * Kswapd reclaims only single pages with compaction enabled. Trying
> -	 * too hard to reclaim until contiguous free pages have become
> -	 * available can hurt performance by evicting too much useful data
> -	 * from memory. Do not reclaim more than needed for compaction.
> -	 */
> -	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
> -			compaction_suitable(zone, sc->order, 0, classzone_idx)
> -							!= COMPACT_SKIPPED)
> -		testorder = 0;
> -
> -	/*
>  	 * We put equal pressure on every zone, unless one zone has way too
>  	 * many pages free already. The "too many pages" is defined as the
>  	 * high wmark plus a "gap" where the gap is either the low
> @@ -3101,15 +3094,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	 * reclaim is necessary
>  	 */
>  	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
> -	if (!lowmem_pressure && zone_balanced(zone, testorder,
> +	if (!lowmem_pressure && zone_balanced(zone, testorder, false,
>  						balance_gap, classzone_idx))
>  		return true;
>  
>  	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>  
> -	/* Account for the number of pages attempted to reclaim */
> -	*nr_attempted += sc->nr_to_reclaim;
> -
>  	clear_bit(ZONE_WRITEBACK, &zone->flags);
>  
>  	/*
> @@ -3119,7 +3109,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>  	 * waits.
>  	 */
>  	if (zone_reclaimable(zone) &&
> -	    zone_balanced(zone, testorder, 0, classzone_idx)) {
> +	    zone_balanced(zone, testorder, false, 0, classzone_idx)) {
>  		clear_bit(ZONE_CONGESTED, &zone->flags);
>  		clear_bit(ZONE_DIRTY, &zone->flags);
>  	}
> @@ -3131,7 +3121,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>   * For kswapd, balance_pgdat() will work across all this node's zones until
>   * they are all at high_wmark_pages(zone).
>   *
> - * Returns the final order kswapd was reclaiming at
> + * Returns the highest zone idx kswapd was reclaiming at
>   *
>   * There is special handling here for zones which are full of pinned pages.
>   * This can happen if the pages are all mlocked, or if they are all used by
> @@ -3148,8 +3138,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
>   * interoperates with the page allocator fallback scheme to ensure that aging
>   * of pages is balanced across the zones.
>   */
> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> -							int *classzone_idx)
> +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  {
>  	int i;
>  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> @@ -3166,9 +3155,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  	count_vm_event(PAGEOUTRUN);
>  
>  	do {
> -		unsigned long nr_attempted = 0;
>  		bool raise_priority = true;
> -		bool pgdat_needs_compaction = (order > 0);
>  
>  		sc.nr_reclaimed = 0;
>  
> @@ -3203,7 +3190,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  				break;
>  			}
>  
> -			if (!zone_balanced(zone, order, 0, 0)) {
> +			if (!zone_balanced(zone, order, true, 0, 0)) {

Should we use highorder = true? We eventually skip to reclaim in the
kswapd_shrink_zone() when zone_balanced(,,false,,) is true.

Thanks.

>  				end_zone = i;
>  				break;
>  			} else {
> @@ -3219,24 +3206,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  		if (i < 0)
>  			goto out;
>  
> -		for (i = 0; i <= end_zone; i++) {
> -			struct zone *zone = pgdat->node_zones + i;
> -
> -			if (!populated_zone(zone))
> -				continue;
> -
> -			/*
> -			 * If any zone is currently balanced then kswapd will
> -			 * not call compaction as it is expected that the
> -			 * necessary pages are already available.
> -			 */
> -			if (pgdat_needs_compaction &&
> -					zone_watermark_ok(zone, order,
> -						low_wmark_pages(zone),
> -						*classzone_idx, 0))
> -				pgdat_needs_compaction = false;
> -		}
> -
>  		/*
>  		 * If we're getting trouble reclaiming, start doing writepage
>  		 * even in laptop mode.
> @@ -3280,8 +3249,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  			 * that that high watermark would be met at 100%
>  			 * efficiency.
>  			 */
> -			if (kswapd_shrink_zone(zone, end_zone,
> -					       &sc, &nr_attempted))
> +			if (kswapd_shrink_zone(zone, end_zone, &sc))
>  				raise_priority = false;
>  		}
>  
> @@ -3294,49 +3262,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  				pfmemalloc_watermark_ok(pgdat))
>  			wake_up_all(&pgdat->pfmemalloc_wait);
>  
> -		/*
> -		 * Fragmentation may mean that the system cannot be rebalanced
> -		 * for high-order allocations in all zones. If twice the
> -		 * allocation size has been reclaimed and the zones are still
> -		 * not balanced then recheck the watermarks at order-0 to
> -		 * prevent kswapd reclaiming excessively. Assume that a
> -		 * process requested a high-order can direct reclaim/compact.
> -		 */
> -		if (order && sc.nr_reclaimed >= 2UL << order)
> -			order = sc.order = 0;
> -
>  		/* Check if kswapd should be suspending */
>  		if (try_to_freeze() || kthread_should_stop())
>  			break;
>  
>  		/*
> -		 * Compact if necessary and kswapd is reclaiming at least the
> -		 * high watermark number of pages as requsted
> -		 */
> -		if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
> -			compact_pgdat(pgdat, order);
> -
> -		/*
>  		 * Raise priority if scanning rate is too low or there was no
>  		 * progress in reclaiming pages
>  		 */
>  		if (raise_priority || !sc.nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1 &&
> -		 !pgdat_balanced(pgdat, order, *classzone_idx));
> +			!pgdat_balanced(pgdat, order, classzone_idx));
>  
>  out:
>  	/*
> -	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
> -	 * makes a decision on the order we were last reclaiming at. However,
> -	 * if another caller entered the allocator slow path while kswapd
> -	 * was awake, order will remain at the higher level
> +	 * Return the highest zone idx we were reclaiming at so
> +	 * prepare_kswapd_sleep() makes the same decisions as here.
>  	 */
> -	*classzone_idx = end_zone;
> -	return order;
> +	return end_zone;
>  }
>  
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> +static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
> +				int classzone_idx, int balanced_classzone_idx)
>  {
>  	long remaining = 0;
>  	DEFINE_WAIT(wait);
> @@ -3347,7 +3295,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>  
>  	/* Try to sleep for a short interval */
> -	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
> +	if (prepare_kswapd_sleep(pgdat, order, remaining,
> +						balanced_classzone_idx)) {
>  		remaining = schedule_timeout(HZ/10);
>  		finish_wait(&pgdat->kswapd_wait, &wait);
>  		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> @@ -3357,7 +3306,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	 * After a short sleep, check if it was a premature sleep. If not, then
>  	 * go fully to sleep until explicitly woken up.
>  	 */
> -	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
> +	if (prepare_kswapd_sleep(pgdat, order, remaining,
> +						balanced_classzone_idx)) {
>  		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
>  
>  		/*
> @@ -3378,6 +3328,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		 */
>  		reset_isolation_suitable(pgdat);
>  
> +		/*
> +		 * We have freed the memory, now we should compact it to make
> +		 * allocation of the requested order possible.
> +		 */
> +		wakeup_kcompactd(pgdat, order, classzone_idx);
> +
>  		if (!kthread_should_stop())
>  			schedule();
>  
> @@ -3407,7 +3363,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  static int kswapd(void *p)
>  {
>  	unsigned long order, new_order;
> -	unsigned balanced_order;
>  	int classzone_idx, new_classzone_idx;
>  	int balanced_classzone_idx;
>  	pg_data_t *pgdat = (pg_data_t*)p;
> @@ -3440,23 +3395,19 @@ static int kswapd(void *p)
>  	set_freezable();
>  
>  	order = new_order = 0;
> -	balanced_order = 0;
>  	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
>  	balanced_classzone_idx = classzone_idx;
>  	for ( ; ; ) {
>  		bool ret;
>  
>  		/*
> -		 * If the last balance_pgdat was unsuccessful it's unlikely a
> -		 * new request of a similar or harder type will succeed soon
> -		 * so consider going to sleep on the basis we reclaimed at
> +		 * While we were reclaiming, there might have been another
> +		 * wakeup, so check the values.
>  		 */
> -		if (balanced_order == new_order) {
> -			new_order = pgdat->kswapd_max_order;
> -			new_classzone_idx = pgdat->classzone_idx;
> -			pgdat->kswapd_max_order =  0;
> -			pgdat->classzone_idx = pgdat->nr_zones - 1;
> -		}
> +		new_order = pgdat->kswapd_max_order;
> +		new_classzone_idx = pgdat->classzone_idx;
> +		pgdat->kswapd_max_order =  0;
> +		pgdat->classzone_idx = pgdat->nr_zones - 1;
>  
>  		if (order < new_order || classzone_idx > new_classzone_idx) {
>  			/*
> @@ -3466,7 +3417,7 @@ static int kswapd(void *p)
>  			order = new_order;
>  			classzone_idx = new_classzone_idx;
>  		} else {
> -			kswapd_try_to_sleep(pgdat, balanced_order,
> +			kswapd_try_to_sleep(pgdat, order, classzone_idx,
>  						balanced_classzone_idx);
>  			order = pgdat->kswapd_max_order;
>  			classzone_idx = pgdat->classzone_idx;
> @@ -3486,9 +3437,8 @@ static int kswapd(void *p)
>  		 */
>  		if (!ret) {
>  			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> -			balanced_classzone_idx = classzone_idx;
> -			balanced_order = balance_pgdat(pgdat, order,
> -						&balanced_classzone_idx);
> +			balanced_classzone_idx = balance_pgdat(pgdat, order,
> +								classzone_idx);
>  		}
>  	}
>  
> @@ -3518,7 +3468,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	}
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
> -	if (zone_balanced(zone, order, 0, 0))
> +	if (zone_balanced(zone, order, true, 0, 0))
>  		return;
>  
>  	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
> -- 
> 2.7.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-02  6:33   ` Joonsoo Kim
@ 2016-03-02 10:04     ` Vlastimil Babka
  2016-03-02 13:57       ` Joonsoo Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-02 10:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, David Rientjes,
	Michal Hocko, Johannes Weiner

On 03/02/2016 07:33 AM, Joonsoo Kim wrote:
>>
>>                                    4.5-rc1     4.5-rc1
>>                                     3-test      4-test
>> Minor Faults                    106940795   106582816
>> Major Faults                          829         813
>> Swap Ins                              482         311
>> Swap Outs                            6278        5598
>> Allocation stalls                     128         184
>> DMA allocs                            145          32
>> DMA32 allocs                     74646161    74843238
>> Normal allocs                    26090955    25886668
>> Movable allocs                          0           0
>> Direct pages scanned                32938       31429
>> Kswapd pages scanned              2183166     2185293
>> Kswapd pages reclaimed            2152359     2134389
>> Direct pages reclaimed              32735       31234
>> Kswapd efficiency                     98%         97%
>> Kswapd velocity                  1243.877    1228.666
>> Direct efficiency                     99%         99%
>> Direct velocity                    18.767      17.671
>> Percentage direct scans                1%          1%
>> Zone normal velocity              299.981     291.409
>> Zone dma32 velocity               962.522     954.928
>> Zone dma velocity                   0.142       0.000
>> Page writes by reclaim           6278.800    5598.600
>> Page writes file                        0           0
>> Page writes anon                     6278        5598
>> Page reclaim immediate                 93          96
>> Sector Reads                      4357114     4307161
>> Sector Writes                    11053628    11053091
>> Page rescued immediate                  0           0
>> Slabs scanned                     1592829     1555770
>> Direct inode steals                  1557        2025
>> Kswapd inode steals                 46056       45418
>> Kswapd skipped wait                     0           0
>> THP fault alloc                       579         614
>> THP collapse alloc                    304         324
>> THP splits                              0           0
>> THP fault fallback                    793         730
>> THP collapse fail                      11          14
>> Compaction stalls                    1013         959
>> Compaction success                     92          69
>> Compaction failures                   920         890
>> Page migrate success               238457      662054
>> Page migrate failure                23021       32846
>> Compaction pages isolated          504695     1370326
>> Compaction migrate scanned         661390     7025772
>> Compaction free scanned          13476658    73302642
>> Compaction cost                       262         762
>>
>> After this patch we see improvements in allocation success rate (especially for
>> phase 3) along with increased compaction activity. The compaction stalls
>> (direct compaction) in the interfering kernel builds (probably THP's) also
>> decreased somewhat to kcompactd activity, yet THP alloc successes improved a
>> bit.
>
> Why you did the test with THP? THP interferes result of main test so
> it would be better not to enable it.

Hmm I've always left it enabled. It makes for a more realistic 
interference and would also show unintended regressions in that closely 
related area.

> And, this patch increased compaction activity (10 times for migrate scanned)
> may be due to resetting skip block information.

Note that kswapd compaction activity was completely non-existent for 
reasons outlined in the changelog.

> Isn't is better to disable it
> for this patch to work as similar as possible that kswapd does and re-enable it
> on next patch? If something goes bad, it can simply be reverted.
>
> Look like it is even not mentioned in the description.

Yeah skip block information is discussed in the next patch, which 
mentions that it's being reset and why. I think it makes more sense, as 
when kswapd reclaims from low watermark to high, potentially many 
pageblocks have new free pages and the skip bits are obsolete. Next, 
kcompactd is separate thread, so it doesn't stall allocations (or kswapd 
reclaim) by its activity.
Personally I hope that one day we can get rid of the skip bits 
completely. They can make the stats look apparently nicer, but I think 
their effect is nearly random.

>> @@ -3066,8 +3071,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
>>    */
>>   static bool kswapd_shrink_zone(struct zone *zone,
>>   			       int classzone_idx,
>> -			       struct scan_control *sc,
>> -			       unsigned long *nr_attempted)
>> +			       struct scan_control *sc)
>>   {
>>   	int testorder = sc->order;
>
> You can remove testorder completely.

Hm right, thanks.

>> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>> -							int *classzone_idx)
>> +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>>   {
>>   	int i;
>>   	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
>> @@ -3166,9 +3155,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>   	count_vm_event(PAGEOUTRUN);
>>
>>   	do {
>> -		unsigned long nr_attempted = 0;
>>   		bool raise_priority = true;
>> -		bool pgdat_needs_compaction = (order > 0);
>>
>>   		sc.nr_reclaimed = 0;
>>
>> @@ -3203,7 +3190,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>   				break;
>>   			}
>>
>> -			if (!zone_balanced(zone, order, 0, 0)) {
>> +			if (!zone_balanced(zone, order, true, 0, 0)) {
>
> Should we use highorder = true? We eventually skip to reclaim in the
> kswapd_shrink_zone() when zone_balanced(,,false,,) is true.

Hmm right. I probably thought that the value of end_zone -> 
balanced_classzone_idx would be important when waking kcompactd, but 
it's not used, so it's causing just some wasted CPU cycles.

Thanks for the reviews!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/5] mm, compaction: introduce kcompactd
  2016-03-02  6:09   ` Joonsoo Kim
@ 2016-03-02 12:25     ` Vlastimil Babka
  0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-02 12:25 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Kirill A. Shutemov,
	Rik van Riel, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner

>> +		if (zone_watermark_ok(zone, cc.order, low_wmark_pages(zone),
>> +						cc.classzone_idx, 0)) {
>> +			success = true;
>> +			compaction_defer_reset(zone, cc.order, false);
>> +		} else if (cc.mode != MIGRATE_ASYNC &&
>> +						status == COMPACT_COMPLETE) {
>> +			defer_compaction(zone, cc.order);
>> +		}
> 
> We alerady set mode to MIGRATE_SYNC_LIGHT so this cc.mode check looks weird.
> It would be better to change it and add some comment that we can
> safely call defer_compaction() here.

Right.
 
>> +
>> +		VM_BUG_ON(!list_empty(&cc.freepages));
>> +		VM_BUG_ON(!list_empty(&cc.migratepages));
>> +	}
>> +
>> +	/*
>> +	 * Regardless of success, we are done until woken up next. But remember
>> +	 * the requested order/classzone_idx in case it was higher/tighter than
>> +	 * our current ones
>> +	 */
>> +	if (pgdat->kcompactd_max_order <= cc.order)
>> +		pgdat->kcompactd_max_order = 0;
>> +	if (pgdat->classzone_idx >= cc.classzone_idx)
>> +		pgdat->classzone_idx = pgdat->nr_zones - 1;
>> +}
> 
> Maybe, you intend to update kcompactd_classzone_idx.

Oops, true. Thanks for the review!

Here's a fixlet.

----8<----
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 2 Mar 2016 10:15:22 +0100
Subject: mm-compaction-introduce-kcompactd-fix-3

Remove extraneous check for sync compaction before deferring.
Correctly adjust kcompactd's classzone_idx instead of kswapd's.

Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c03715ba65c7..9a605c3d4177 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1811,8 +1811,11 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 						cc.classzone_idx, 0)) {
 			success = true;
 			compaction_defer_reset(zone, cc.order, false);
-		} else if (cc.mode != MIGRATE_ASYNC &&
-						status == COMPACT_COMPLETE) {
+		} else if (status == COMPACT_COMPLETE) {
+			/*
+			 * We use sync migration mode here, so we defer like
+			 * sync direct compaction does.
+			 */
 			defer_compaction(zone, cc.order);
 		}
 
@@ -1827,8 +1830,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 	 */
 	if (pgdat->kcompactd_max_order <= cc.order)
 		pgdat->kcompactd_max_order = 0;
-	if (pgdat->classzone_idx >= cc.classzone_idx)
-		pgdat->classzone_idx = pgdat->nr_zones - 1;
+	if (pgdat->kcompactd_classzone_idx >= cc.classzone_idx)
+		pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 }
 
 void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-02-08 13:38 ` [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd Vlastimil Babka
                     ` (3 preceding siblings ...)
  2016-03-02  6:33   ` Joonsoo Kim
@ 2016-03-02 12:27   ` Vlastimil Babka
  4 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-02 12:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: linux-kernel, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Joonsoo Kim, Mel Gorman, David Rientjes, Michal Hocko,
	Johannes Weiner

A fixlet for things Joonsoo caught.

----8<----
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 2 Mar 2016 11:50:16 +0100
Subject: mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd-fix

Change zone_balanced() check in balance_pgdat() to consider only base pages
as that's the same what kswapd_shrink_zone() later does. So a zone that would
be selected due to lack of high-order pages would not be reclaimed anyway,
and just cause extra CPU churn.

Also remove unnecessary variable testorder in kswapd_shrink_zone().

Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8478a737ef5..23bc7e643ad8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3073,7 +3073,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 			       int classzone_idx,
 			       struct scan_control *sc)
 {
-	int testorder = sc->order;
 	unsigned long balance_gap;
 	bool lowmem_pressure;
 
@@ -3094,7 +3093,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * reclaim is necessary
 	 */
 	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, testorder, false,
+	if (!lowmem_pressure && zone_balanced(zone, sc->order, false,
 						balance_gap, classzone_idx))
 		return true;
 
@@ -3109,7 +3108,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * waits.
 	 */
 	if (zone_reclaimable(zone) &&
-	    zone_balanced(zone, testorder, false, 0, classzone_idx)) {
+	    zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
 		clear_bit(ZONE_CONGESTED, &zone->flags);
 		clear_bit(ZONE_DIRTY, &zone->flags);
 	}
@@ -3190,7 +3189,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 				break;
 			}
 
-			if (!zone_balanced(zone, order, true, 0, 0)) {
+			if (!zone_balanced(zone, order, false, 0, 0)) {
 				end_zone = i;
 				break;
 			} else {
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-02 10:04     ` Vlastimil Babka
@ 2016-03-02 13:57       ` Joonsoo Kim
  2016-03-02 14:09         ` Vlastimil Babka
  0 siblings, 1 reply; 28+ messages in thread
From: Joonsoo Kim @ 2016-03-02 13:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Linux Memory Management List, Andrew Morton, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

2016-03-02 19:04 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> On 03/02/2016 07:33 AM, Joonsoo Kim wrote:
>>>
>>>
>>>                                    4.5-rc1     4.5-rc1
>>>                                     3-test      4-test
>>> Minor Faults                    106940795   106582816
>>> Major Faults                          829         813
>>> Swap Ins                              482         311
>>> Swap Outs                            6278        5598
>>> Allocation stalls                     128         184
>>> DMA allocs                            145          32
>>> DMA32 allocs                     74646161    74843238
>>> Normal allocs                    26090955    25886668
>>> Movable allocs                          0           0
>>> Direct pages scanned                32938       31429
>>> Kswapd pages scanned              2183166     2185293
>>> Kswapd pages reclaimed            2152359     2134389
>>> Direct pages reclaimed              32735       31234
>>> Kswapd efficiency                     98%         97%
>>> Kswapd velocity                  1243.877    1228.666
>>> Direct efficiency                     99%         99%
>>> Direct velocity                    18.767      17.671
>>> Percentage direct scans                1%          1%
>>> Zone normal velocity              299.981     291.409
>>> Zone dma32 velocity               962.522     954.928
>>> Zone dma velocity                   0.142       0.000
>>> Page writes by reclaim           6278.800    5598.600
>>> Page writes file                        0           0
>>> Page writes anon                     6278        5598
>>> Page reclaim immediate                 93          96
>>> Sector Reads                      4357114     4307161
>>> Sector Writes                    11053628    11053091
>>> Page rescued immediate                  0           0
>>> Slabs scanned                     1592829     1555770
>>> Direct inode steals                  1557        2025
>>> Kswapd inode steals                 46056       45418
>>> Kswapd skipped wait                     0           0
>>> THP fault alloc                       579         614
>>> THP collapse alloc                    304         324
>>> THP splits                              0           0
>>> THP fault fallback                    793         730
>>> THP collapse fail                      11          14
>>> Compaction stalls                    1013         959
>>> Compaction success                     92          69
>>> Compaction failures                   920         890
>>> Page migrate success               238457      662054
>>> Page migrate failure                23021       32846
>>> Compaction pages isolated          504695     1370326
>>> Compaction migrate scanned         661390     7025772
>>> Compaction free scanned          13476658    73302642
>>> Compaction cost                       262         762
>>>
>>> After this patch we see improvements in allocation success rate
>>> (especially for
>>> phase 3) along with increased compaction activity. The compaction stalls
>>> (direct compaction) in the interfering kernel builds (probably THP's)
>>> also
>>> decreased somewhat to kcompactd activity, yet THP alloc successes
>>> improved a
>>> bit.
>>
>>
>> Why you did the test with THP? THP interferes result of main test so
>> it would be better not to enable it.
>
>
> Hmm I've always left it enabled. It makes for a more realistic interference
> and would also show unintended regressions in that closely related area.

But, it makes review hard because complex analysis is needed to
understand the result.

Following is the example.

"The compaction stalls
(direct compaction) in the interfering kernel builds (probably THP's) also
decreased somewhat to kcompactd activity, yet THP alloc successes improved a
bit."

So, why do we need this comment to understand effect of this patch? If you did
a test without THP, it would not be necessary.

>> And, this patch increased compaction activity (10 times for migrate
>> scanned)
>> may be due to resetting skip block information.
>
>
> Note that kswapd compaction activity was completely non-existent for reasons
> outlined in the changelog.
>> Isn't is better to disable it
>> for this patch to work as similar as possible that kswapd does and
>> re-enable it
>> on next patch? If something goes bad, it can simply be reverted.
>>
>> Look like it is even not mentioned in the description.
>
>
> Yeah skip block information is discussed in the next patch, which mentions
> that it's being reset and why. I think it makes more sense, as when kswapd

Yes, I know.
What I'd like to say here is that you need to care current_is_kswapd() in
this patch. This patch unintentionally change the back ground compaction thread
behaviour to restart compaction by every 64 trials because calling
curret_is_kswapd()
by kcompactd would return false and is treated as direct reclaim.
Result of patch 4
and patch 5 would be same.

Thanks.

> reclaims from low watermark to high, potentially many pageblocks have new
> free pages and the skip bits are obsolete. Next, kcompactd is separate
> thread, so it doesn't stall allocations (or kswapd reclaim) by its activity.
> Personally I hope that one day we can get rid of the skip bits completely.
> They can make the stats look apparently nicer, but I think their effect is
> nearly random.
>
>>> @@ -3066,8 +3071,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat,
>>> int order, long remaining,
>>>    */
>>>   static bool kswapd_shrink_zone(struct zone *zone,
>>>                                int classzone_idx,
>>> -                              struct scan_control *sc,
>>> -                              unsigned long *nr_attempted)
>>> +                              struct scan_control *sc)
>>>   {
>>>         int testorder = sc->order;
>>
>>
>> You can remove testorder completely.
>
>
> Hm right, thanks.
>
>>> -static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>>> -                                                       int
>>> *classzone_idx)
>>> +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>>>   {
>>>         int i;
>>>         int end_zone = 0;       /* Inclusive.  0 = ZONE_DMA */
>>> @@ -3166,9 +3155,7 @@ static unsigned long balance_pgdat(pg_data_t
>>> *pgdat, int order,
>>>         count_vm_event(PAGEOUTRUN);
>>>
>>>         do {
>>> -               unsigned long nr_attempted = 0;
>>>                 bool raise_priority = true;
>>> -               bool pgdat_needs_compaction = (order > 0);
>>>
>>>                 sc.nr_reclaimed = 0;
>>>
>>> @@ -3203,7 +3190,7 @@ static unsigned long balance_pgdat(pg_data_t
>>> *pgdat, int order,
>>>                                 break;
>>>                         }
>>>
>>> -                       if (!zone_balanced(zone, order, 0, 0)) {
>>> +                       if (!zone_balanced(zone, order, true, 0, 0)) {
>>
>>
>> Should we use highorder = true? We eventually skip to reclaim in the
>> kswapd_shrink_zone() when zone_balanced(,,false,,) is true.
>
>
> Hmm right. I probably thought that the value of end_zone ->
> balanced_classzone_idx would be important when waking kcompactd, but it's
> not used, so it's causing just some wasted CPU cycles.
>
> Thanks for the reviews!
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-02 13:57       ` Joonsoo Kim
@ 2016-03-02 14:09         ` Vlastimil Babka
  2016-03-02 14:22           ` Joonsoo Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-02 14:09 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Linux Memory Management List, Andrew Morton, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

On 03/02/2016 02:57 PM, Joonsoo Kim wrote:
> 2016-03-02 19:04 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>> On 03/02/2016 07:33 AM, Joonsoo Kim wrote:
>>>
>>>
>>> Why you did the test with THP? THP interferes result of main test so
>>> it would be better not to enable it.
>>
>>
>> Hmm I've always left it enabled. It makes for a more realistic interference
>> and would also show unintended regressions in that closely related area.
>
> But, it makes review hard because complex analysis is needed to
> understand the result.
>
> Following is the example.
>
> "The compaction stalls
> (direct compaction) in the interfering kernel builds (probably THP's) also
> decreased somewhat to kcompactd activity, yet THP alloc successes improved a
> bit."
>
> So, why do we need this comment to understand effect of this patch? If you did
> a test without THP, it would not be necessary.

I see. Next time I'll do a run with THP disabled.

>>> And, this patch increased compaction activity (10 times for migrate
>>> scanned)
>>> may be due to resetting skip block information.
>>
>>
>> Note that kswapd compaction activity was completely non-existent for reasons
>> outlined in the changelog.
>>> Isn't is better to disable it
>>> for this patch to work as similar as possible that kswapd does and
>>> re-enable it
>>> on next patch? If something goes bad, it can simply be reverted.
>>>
>>> Look like it is even not mentioned in the description.
>>
>>
>> Yeah skip block information is discussed in the next patch, which mentions
>> that it's being reset and why. I think it makes more sense, as when kswapd
>
> Yes, I know.
> What I'd like to say here is that you need to care current_is_kswapd() in
> this patch. This patch unintentionally change the back ground compaction thread
> behaviour to restart compaction by every 64 trials because calling
> curret_is_kswapd()
 > by kcompactd would return false and is treated as direct reclaim.

Oh, you mean this path to reset the skip bits. I see. But if skip bits 
are already reset by kswapd when waking kcompactd, then effect of 
another (rare) reset in kcompactd itself will be minimal?

> Result of patch 4
> and patch 5 would be same.

It's certainly possible to fold patch 5 into 4. I posted them separately 
mainly to make review more feasible. But the differences in results are 
already quite small.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-02 14:09         ` Vlastimil Babka
@ 2016-03-02 14:22           ` Joonsoo Kim
  2016-03-02 14:40             ` Vlastimil Babka
  0 siblings, 1 reply; 28+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Linux Memory Management List, Andrew Morton, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

2016-03-02 23:09 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> On 03/02/2016 02:57 PM, Joonsoo Kim wrote:
>>
>> 2016-03-02 19:04 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>>>
>>> On 03/02/2016 07:33 AM, Joonsoo Kim wrote:
>>>>
>>>>
>>>>
>>>> Why you did the test with THP? THP interferes result of main test so
>>>> it would be better not to enable it.
>>>
>>>
>>>
>>> Hmm I've always left it enabled. It makes for a more realistic
>>> interference
>>> and would also show unintended regressions in that closely related area.
>>
>>
>> But, it makes review hard because complex analysis is needed to
>> understand the result.
>>
>> Following is the example.
>>
>> "The compaction stalls
>> (direct compaction) in the interfering kernel builds (probably THP's) also
>> decreased somewhat to kcompactd activity, yet THP alloc successes improved
>> a
>> bit."
>>
>> So, why do we need this comment to understand effect of this patch? If you
>> did
>> a test without THP, it would not be necessary.
>
>
> I see. Next time I'll do a run with THP disabled.
>
>>>> And, this patch increased compaction activity (10 times for migrate
>>>> scanned)
>>>> may be due to resetting skip block information.
>>>
>>>
>>>
>>> Note that kswapd compaction activity was completely non-existent for
>>> reasons
>>> outlined in the changelog.
>>>>
>>>> Isn't is better to disable it
>>>> for this patch to work as similar as possible that kswapd does and
>>>> re-enable it
>>>> on next patch? If something goes bad, it can simply be reverted.
>>>>
>>>> Look like it is even not mentioned in the description.
>>>
>>>
>>>
>>> Yeah skip block information is discussed in the next patch, which
>>> mentions
>>> that it's being reset and why. I think it makes more sense, as when
>>> kswapd
>>
>>
>> Yes, I know.
>> What I'd like to say here is that you need to care current_is_kswapd() in
>> this patch. This patch unintentionally change the back ground compaction
>> thread
>> behaviour to restart compaction by every 64 trials because calling
>> curret_is_kswapd()
>
>> by kcompactd would return false and is treated as direct reclaim.
>
> Oh, you mean this path to reset the skip bits. I see. But if skip bits are
> already reset by kswapd when waking kcompactd, then effect of another (rare)
> reset in kcompactd itself will be minimal?

If you care current_is_kswapd() in this patch properly (properly means change
like "current_is_kcompactd()), reset in kswapd would not
happen because, compact_blockskip_flush would not be set by kcompactd.

In this case, patch 5 would have it's own meaning so cannot be folded.

Thanks.

>> Result of patch 4
>> and patch 5 would be same.
>
>
> It's certainly possible to fold patch 5 into 4. I posted them separately
> mainly to make review more feasible. But the differences in results are
> already quite small.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-02 14:22           ` Joonsoo Kim
@ 2016-03-02 14:40             ` Vlastimil Babka
  2016-03-02 14:59               ` Joonsoo Kim
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-02 14:40 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Linux Memory Management List, Andrew Morton, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
> 2016-03-02 23:09 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>> On 03/02/2016 02:57 PM, Joonsoo Kim wrote:
>>>
>>>
>>> Yes, I know.
>>> What I'd like to say here is that you need to care current_is_kswapd() in
>>> this patch. This patch unintentionally change the back ground compaction
>>> thread
>>> behaviour to restart compaction by every 64 trials because calling
>>> curret_is_kswapd()
>>
>>> by kcompactd would return false and is treated as direct reclaim.
>>
>> Oh, you mean this path to reset the skip bits. I see. But if skip bits are
>> already reset by kswapd when waking kcompactd, then effect of another (rare)
>> reset in kcompactd itself will be minimal?
> 
> If you care current_is_kswapd() in this patch properly (properly means change
> like "current_is_kcompactd()), reset in kswapd would not
> happen because, compact_blockskip_flush would not be set by kcompactd.
> 
> In this case, patch 5 would have it's own meaning so cannot be folded.

So I understand that patch 5 would be just about this?

-	if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
+	if (compaction_restarting(zone, cc->order))
 		__reset_isolation_suitable(zone);

I'm more inclined to fold it in that case. 

> Thanks.
> 
>>> Result of patch 4
>>> and patch 5 would be same.
>>
>>
>> It's certainly possible to fold patch 5 into 4. I posted them separately
>> mainly to make review more feasible. But the differences in results are
>> already quite small.
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-02 14:40             ` Vlastimil Babka
@ 2016-03-02 14:59               ` Joonsoo Kim
  2016-03-02 15:22                 ` Vlastimil Babka
  0 siblings, 1 reply; 28+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Linux Memory Management List, Andrew Morton, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

2016-03-02 23:40 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
>> 2016-03-02 23:09 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>>> On 03/02/2016 02:57 PM, Joonsoo Kim wrote:
>>>>
>>>>
>>>> Yes, I know.
>>>> What I'd like to say here is that you need to care current_is_kswapd() in
>>>> this patch. This patch unintentionally change the back ground compaction
>>>> thread
>>>> behaviour to restart compaction by every 64 trials because calling
>>>> curret_is_kswapd()
>>>
>>>> by kcompactd would return false and is treated as direct reclaim.
>>>
>>> Oh, you mean this path to reset the skip bits. I see. But if skip bits are
>>> already reset by kswapd when waking kcompactd, then effect of another (rare)
>>> reset in kcompactd itself will be minimal?
>>
>> If you care current_is_kswapd() in this patch properly (properly means change
>> like "current_is_kcompactd()), reset in kswapd would not
>> happen because, compact_blockskip_flush would not be set by kcompactd.
>>
>> In this case, patch 5 would have it's own meaning so cannot be folded.
>
> So I understand that patch 5 would be just about this?
>
> -       if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
> +       if (compaction_restarting(zone, cc->order))
>                 __reset_isolation_suitable(zone);

Yeah, you understand correctly. :)

> I'm more inclined to fold it in that case.

Patch would be just simple, but, I guess it would cause some difference
in test result. But, I'm okay for folding.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-02 14:59               ` Joonsoo Kim
@ 2016-03-02 15:22                 ` Vlastimil Babka
  2016-03-04 23:25                   ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-02 15:22 UTC (permalink / raw)
  To: Joonsoo Kim, Andrew Morton
  Cc: Joonsoo Kim, Linux Memory Management List, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

On 03/02/2016 03:59 PM, Joonsoo Kim wrote:
> 2016-03-02 23:40 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
>>
>> So I understand that patch 5 would be just about this?
>>
>> -       if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
>> +       if (compaction_restarting(zone, cc->order))
>>                  __reset_isolation_suitable(zone);
>
> Yeah, you understand correctly. :)
>
>> I'm more inclined to fold it in that case.
>
> Patch would be just simple, but, I guess it would cause some difference
> in test result. But, I'm okay for folding.

Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with 
all the accumulated fixlets (including those I sent earlier today) and 
combined changelog, or do you want to apply the new fixlets separately 
first and let them sit for a week or so? In any case, sorry for the churn.

> Thanks.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-02 15:22                 ` Vlastimil Babka
@ 2016-03-04 23:25                   ` Andrew Morton
  2016-03-07  9:45                     ` Vlastimil Babka
                                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Andrew Morton @ 2016-03-04 23:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Joonsoo Kim, Linux Memory Management List, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

On Wed, 2 Mar 2016 16:22:43 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 03/02/2016 03:59 PM, Joonsoo Kim wrote:
> > 2016-03-02 23:40 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> >> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
> >>
> >> So I understand that patch 5 would be just about this?
> >>
> >> -       if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
> >> +       if (compaction_restarting(zone, cc->order))
> >>                  __reset_isolation_suitable(zone);
> >
> > Yeah, you understand correctly. :)
> >
> >> I'm more inclined to fold it in that case.
> >
> > Patch would be just simple, but, I guess it would cause some difference
> > in test result. But, I'm okay for folding.
> 
> Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with 
> all the accumulated fixlets (including those I sent earlier today) and 
> combined changelog, or do you want to apply the new fixlets separately 
> first and let them sit for a week or so? In any case, sorry for the churn.

Did I get everything?

http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-remove-bogus-check-of-balance_classzone_idx.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-2.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-3.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-memory-hotplug-small-cleanup-in-online_pages.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-adapt-isolation_suitable-flushing-to-kcompactd.patch

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-04 23:25                   ` Andrew Morton
@ 2016-03-07  9:45                     ` Vlastimil Babka
  2016-03-09 13:47                     ` Vlastimil Babka
  2016-03-09 13:50                     ` Vlastimil Babka
  2 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-07  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Joonsoo Kim, Linux Memory Management List, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

On 03/05/2016 12:25 AM, Andrew Morton wrote:
> On Wed, 2 Mar 2016 16:22:43 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:
>
>> On 03/02/2016 03:59 PM, Joonsoo Kim wrote:
>>> 2016-03-02 23:40 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>>>> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
>>>>
>>>> So I understand that patch 5 would be just about this?
>>>>
>>>> -       if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
>>>> +       if (compaction_restarting(zone, cc->order))
>>>>                   __reset_isolation_suitable(zone);
>>>
>>> Yeah, you understand correctly. :)
>>>
>>>> I'm more inclined to fold it in that case.
>>>
>>> Patch would be just simple, but, I guess it would cause some difference
>>> in test result. But, I'm okay for folding.
>>
>> Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with
>> all the accumulated fixlets (including those I sent earlier today) and
>> combined changelog, or do you want to apply the new fixlets separately
>> first and let them sit for a week or so? In any case, sorry for the churn.
>
> Did I get everything?

Yeah, thanks! I'll send the squash in few days (also with a vmstat typo 
fix that Hugh pointed out in another thread).

> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-remove-bogus-check-of-balance_classzone_idx.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-2.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-3.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-memory-hotplug-small-cleanup-in-online_pages.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-adapt-isolation_suitable-flushing-to-kcompactd.patch
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-04 23:25                   ` Andrew Morton
  2016-03-07  9:45                     ` Vlastimil Babka
@ 2016-03-09 13:47                     ` Vlastimil Babka
  2016-03-09 13:50                     ` Vlastimil Babka
  2 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-09 13:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Joonsoo Kim, Linux Memory Management List, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

On 03/05/2016 12:25 AM, Andrew Morton wrote:
>> Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with 
>> all the accumulated fixlets (including those I sent earlier today) and 
>> combined changelog, or do you want to apply the new fixlets separately 
>> first and let them sit for a week or so? In any case, sorry for the churn.
> 
> Did I get everything?
> 
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-remove-bogus-check-of-balance_classzone_idx.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-2.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-3.patch

Please add the one below here.

> http://ozlabs.org/~akpm/mmots/broken-out/mm-memory-hotplug-small-cleanup-in-online_pages.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-adapt-isolation_suitable-flushing-to-kcompactd.patch
 
----8<----
>From 0977a031f891ef6f675a64c53b797d92d839f11c Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Thu, 3 Mar 2016 11:35:23 +0100
Subject: [PATCH 1/3] mm-compaction-introduce-kcompactd-fix-4

Fix typo in /proc/vmstat for kcompactd wakeups. Per Hugh's suggestion,
rename the item to compact_daemon_wake.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c9571294f61c..f80066248c94 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -826,7 +826,7 @@ const char * const vmstat_text[] = {
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
-	"compact_kcompatd_wake",
+	"compact_daemon_wake",
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd
  2016-03-04 23:25                   ` Andrew Morton
  2016-03-07  9:45                     ` Vlastimil Babka
  2016-03-09 13:47                     ` Vlastimil Babka
@ 2016-03-09 13:50                     ` Vlastimil Babka
  2 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-09 13:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Joonsoo Kim, Linux Memory Management List, LKML,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	David Rientjes, Michal Hocko, Johannes Weiner

On 03/05/2016 12:25 AM, Andrew Morton wrote:
> On Wed, 2 Mar 2016 16:22:43 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> On 03/02/2016 03:59 PM, Joonsoo Kim wrote:
>>> 2016-03-02 23:40 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>>>> On 03/02/2016 03:22 PM, Joonsoo Kim wrote:
>>>>
>>>> So I understand that patch 5 would be just about this?
>>>>
>>>> -       if (compaction_restarting(zone, cc->order) && !current_is_kcompactd())
>>>> +       if (compaction_restarting(zone, cc->order))
>>>>                  __reset_isolation_suitable(zone);
>>>
>>> Yeah, you understand correctly. :)
>>>
>>>> I'm more inclined to fold it in that case.
>>>
>>> Patch would be just simple, but, I guess it would cause some difference
>>> in test result. But, I'm okay for folding.
>>
>> Thanks. Andrew, should I send now patch folding patch 4/5 and 5/5 with 
>> all the accumulated fixlets (including those I sent earlier today) and 
>> combined changelog, or do you want to apply the new fixlets separately 
>> first and let them sit for a week or so? In any case, sorry for the churn.
> 
> Did I get everything?
> 
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-remove-bogus-check-of-balance_classzone_idx.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-2.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-introduce-kcompactd-fix-3.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-memory-hotplug-small-cleanup-in-online_pages.patch

Please replace the following three:

> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-kswapd-replace-kswapd-compaction-with-waking-up-kcompactd-fix.patch
> http://ozlabs.org/~akpm/mmots/broken-out/mm-compaction-adapt-isolation_suitable-flushing-to-kcompactd.patch

With the squashed one below (had to mangle the changelog nontrivially).
This is after discussion with Joonsoo. It was perhaps better separately for
review, but functionality-wise the first patch leaves things somewhat
weird without the third patch.

----8<----
>From c829909527ecd33eb869c96bcd287bade2b32100 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 9 Mar 2016 12:45:24 +0100
Subject: [PATCH 3/3] mm, kswapd: replace kswapd compaction with waking up
 kcompactd

Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim
and compaction to attempt making memory allocation of given order
available.  The details differ from direct reclaim e.g.  in having high
watermark as a goal.  The code involved in kswapd's reclaim/compaction
decisions has evolved to be quite complex.  Testing reveals that it
doesn't actually work in at least one scenario, and closer inspection
suggests that it could be greatly simplified without compromising on the
goal (make high-order page available) or efficiency (don't reclaim too
much).  The simplification relieas of doing all compaction in kcompactd,
which is simply woken up when high watermarks are reached by kswapd's
reclaim.

The scenario where kswapd compaction doesn't work was found with mmtests
test stress-highalloc configured to attempt order-9 allocations without
direct reclaim, just waking up kswapd.  There was no compaction attempt
from kswapd during the whole test.  Some added instrumentation shows what
happens:

- balance_pgdat() sets end_zone to Normal, as it's not balanced
- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
   cannot reclaim anything, so sc.nr_reclaimed is 0
- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
   merely checks if high watermarks were reached for base pages. This is true,
   so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
   compaction_suitable() returned COMPACT_SKIPPED
- even though the pgdat_needs_compaction flag wasn't set to false, no
   compaction happens due to the condition sc.nr_reclaimed > nr_attempted
   being false (as 0 < 99)
- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
   pgdat_balanced() is false as only the small zone DMA appears balanced
   (curiously in that check, watermark appears OK and compaction_suitable()
   returns COMPACT_PARTIAL, because a lower classzone_idx is used there)

Now, even if it was decided that reclaim shouldn't be attempted on the DMA
zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
nr_attempted=0) is also false.  The condition really should use >= as the
comment suggests.  Then there is a mismatch in the check for setting
pgdat_needs_compaction to false using low watermark, while the rest uses
high watermark, and who knows what other subtlety.  Hopefully this
demonstrates that this is unsustainable.

Luckily we can simplify this a lot.  The reclaim/compaction decisions make
sense for direct reclaim scenario, but in kswapd, our primary goal is to
reach high watermark in order-0 pages.  Afterwards we can attempt
compaction just once.  Unlike direct reclaim, we don't reclaim extra pages
(over the high watermark), the current code already disallows it for good
reasons.

After this patch, we simply wake up kcompactd to process the pgdat, after
we have either succeeded or failed to reach the high watermarks in kswapd,
which goes to sleep.  We pass kswapd's order and classzone_idx, so
kcompactd can apply the same criteria to determine which zones are worth
compacting.  Note that we use the classzone_idx from wakeup_kswapd(), not
balanced_classzone_idx which can include higher zones that kswapd tried to
balance too, but didn't consider them in pgdat_balanced().

Since kswapd now cannot create high-order pages itself, we need to adjust
how it determines the zones to be balanced.  The key element here is
adding a "highorder" parameter to zone_balanced, which, when set to false,
makes it consider only order-0 watermark instead of the desired higher
order (this was done previously by kswapd_shrink_zone(), but not
elsewhere).  This false is passed for example in pgdat_balanced().
Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
kcompactd are woken up for a high-order allocation failure.

The last thing is to decide what to do with pageblock_skip bitmap handling.
Compaction maintains a pageblock_skip bitmap to record pageblocks where
isolation recently failed.  This bitmap can be reset by three ways:

1) direct compaction is restarting after going through the full deferred cycle

2) kswapd goes to sleep, and some other direct compaction has previously
    finished scanning the whole zone and set zone->compact_blockskip_flush.
    Note that a successful direct compaction clears this flag.

3) compaction was invoked manually via trigger in /proc

The case 2) is somewhat fuzzy to begin with, but after introducing
kcompactd we should update it.  The check for direct compaction in 1), and
to set the flush flag in 2) use current_is_kswapd(), which doesn't work
for kcompactd.  Thus, this patch adds bool direct_compaction to
compact_control to use in 2).  For the case 1) we remove the check
completely - unlike the former kswapd compaction, kcompactd does use the
deferred compaction functionality, so flushing tied to restarting from
deferred compaction makes sense here.

Note that when kswapd goes to sleep, kcompactd is woken up, so it will see
the flushed pageblock_skip bits.  This is different from when the former
kswapd compaction observed the bits and I believe it makes more sense.
Kcompactd can afford to be more thorough than a direct compaction trying
to limit allocation latency, or kswapd whose primary goal is to reclaim.

For testing, I used stress-highalloc configured to do order-9 allocations
with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on
kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
phases 1 and 2 work as usual):

stress-highalloc
                        4.5-rc1+before          4.5-rc1+after
                             -nodirect              -nodirect
Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)

User                          3166.67        3181.09
System                        1153.37        1158.25
Elapsed                       1768.53        1799.37

                            4.5-rc1+before   4.5-rc1+after
                                 -nodirect    -nodirect
Direct pages scanned                32938        32797
Kswapd pages scanned              2183166      2202613
Kswapd pages reclaimed            2152359      2143524
Direct pages reclaimed              32735        32545
Percentage direct scans                1%           1%
THP fault alloc                       579          612
THP collapse alloc                    304          316
THP splits                              0            0
THP fault fallback                    793          778
THP collapse fail                      11           16
Compaction stalls                    1013         1007
Compaction success                     92           67
Compaction failures                   920          939
Page migrate success               238457       721374
Page migrate failure                23021        23469
Compaction pages isolated          504695      1479924
Compaction migrate scanned         661390      8812554
Compaction free scanned          13476658     84327916
Compaction cost                       262          838

After this patch we see improvements in allocation success rate
(especially for phase 3) along with increased compaction activity.  The
compaction stalls (direct compaction) in the interfering kernel builds
(probably THP's) also decreased somewhat thanks to kcompactd activity, yet
THP alloc successes improved a bit.

Note that elapsed and user time isn't so useful for this benchmark,
because of the background interference being unpredictable.  It's just to
quickly spot some major unexpected differences.  System time is somewhat
more useful and that didn't increase.

Also (after adjusting mmtests' ftrace monitor):

Time kswapd awake               2547781     2269241
Time kcompactd awake                  0      119253
Time direct compacting           939937      557649
Time kswapd compacting                0           0
Time kcompactd compacting             0      119099

The decrease of overal time spent compacting appears to not match the
increased compaction stats.  I suspect the tasks get rescheduled and since
the ftrace monitor doesn't see that, the reported time is wall time, not
CPU time.  But arguably direct compactors care about overall latency
anyway, whether busy compacting or waiting for CPU doesn't matter.  And
that latency seems to almost halved.

It's also interesting how much time kswapd spent awake just going through
all the priorities and failing to even try compacting, over and over.

We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

stress-highalloc
                        4.5-rc1+before         4.5-rc1+after
                               -direct               -direct
Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)

User                          3344.73       3246.04
System                        1194.24       1172.29
Elapsed                       1838.04       1836.76

                            4.5-rc1+before  4.5-rc1+after
                                   -direct     -direct
Direct pages scanned               125146      120966
Kswapd pages scanned              2119757     2135012
Kswapd pages reclaimed            2073183     2108388
Direct pages reclaimed             124909      120577
Percentage direct scans                5%          5%
THP fault alloc                       599         652
THP collapse alloc                    323         354
THP splits                              0           0
THP fault fallback                    806         793
THP collapse fail                      17          16
Compaction stalls                    2457        2025
Compaction success                    906         518
Compaction failures                  1551        1507
Page migrate success              2031423     2360608
Page migrate failure                32845       40852
Compaction pages isolated         4129761     4802025
Compaction migrate scanned       11996712    21750613
Compaction free scanned         214970969   344372001
Compaction cost                      2271        2694

In this scenario, this patch doesn't change the overall success rate as
direct compaction already tries all it can.  There's however significant
reduction in direct compaction stalls (that is, the number of allocations
that went into direct compaction).  The number of successes (i.e.  direct
compaction stalls that ended up with successful allocation) is reduced by
the same number.  This means the offload to kcompactd is working as
expected, and direct compaction is reduced either due to detecting
contention, or compaction deferred by kcompactd.  In the previous version
of this patchset there was some apparent reduction of success rate, but
the changes in this version (such as using sync compaction only), new
baseline kernel, and/or averaging results from 5 executions (my bet), made
this go away.

Ftrace-based stats seem to roughly agree:

Time kswapd awake               2532984     2326824
Time kcompactd awake                  0      257916
Time direct compacting           864839      735130
Time kswapd compacting                0           0
Time kcompactd compacting             0      257585

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/compaction.c |  10 ++--
 mm/internal.h   |   1 +
 mm/vmscan.c     | 147 ++++++++++++++++++--------------------------------------
 3 files changed, 54 insertions(+), 104 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 5b2bfbaa821a..ccf97b02b85f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1191,11 +1191,11 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
 
 		/*
 		 * Mark that the PG_migrate_skip information should be cleared
-		 * by kswapd when it goes to sleep. kswapd does not set the
+		 * by kswapd when it goes to sleep. kcompactd does not set the
 		 * flag itself as the decision to be clear should be directly
 		 * based on an allocation request.
 		 */
-		if (!current_is_kswapd())
+		if (cc->direct_compaction)
 			zone->compact_blockskip_flush = true;
 
 		return COMPACT_COMPLETE;
@@ -1338,10 +1338,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 	/*
 	 * Clear pageblock skip if there were failures recently and compaction
-	 * is about to be retried after being deferred. kswapd does not do
-	 * this reset as it'll reset the cached information when going to sleep.
+	 * is about to be retried after being deferred.
 	 */
-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
+	if (compaction_restarting(zone, cc->order))
 		__reset_isolation_suitable(zone);
 
 	/*
@@ -1477,6 +1476,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 		.mode = mode,
 		.alloc_flags = alloc_flags,
 		.classzone_idx = classzone_idx,
+		.direct_compaction = true,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
diff --git a/mm/internal.h b/mm/internal.h
index 17ae0b52534b..013a786fa37f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -181,6 +181,7 @@ struct compact_control {
 	unsigned long last_migrated_pfn;/* Not yet flushed page being freed */
 	enum migrate_mode mode;		/* Async or sync migration mode */
 	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
+	bool direct_compaction;		/* False from kcompactd or /proc/... */
 	int order;			/* order a direct compactor needs */
 	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	const int alloc_flags;		/* alloc flags of a direct compactor */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c67df4831565..23bc7e643ad8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-			  unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order, bool highorder,
+			unsigned long balance_gap, int classzone_idx)
 {
-	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
-				    balance_gap, classzone_idx))
-		return false;
+	unsigned long mark = high_wmark_pages(zone) + balance_gap;
 
-	if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
-				order, 0, classzone_idx) == COMPACT_SKIPPED)
-		return false;
+	/*
+	 * When checking from pgdat_balanced(), kswapd should stop and sleep
+	 * when it reaches the high order-0 watermark and let kcompactd take
+	 * over. Other callers such as wakeup_kswapd() want to determine the
+	 * true high-order watermark.
+	 */
+	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
+		mark += (1UL << order);
+		order = 0;
+	}
 
-	return true;
+	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
 
 /*
@@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 			continue;
 		}
 
-		if (zone_balanced(zone, order, 0, i))
+		if (zone_balanced(zone, order, false, 0, i))
 			balanced_pages += zone->managed_pages;
 		else if (!order)
 			return false;
@@ -3066,10 +3071,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
  */
 static bool kswapd_shrink_zone(struct zone *zone,
 			       int classzone_idx,
-			       struct scan_control *sc,
-			       unsigned long *nr_attempted)
+			       struct scan_control *sc)
 {
-	int testorder = sc->order;
 	unsigned long balance_gap;
 	bool lowmem_pressure;
 
@@ -3077,17 +3080,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
 
 	/*
-	 * Kswapd reclaims only single pages with compaction enabled. Trying
-	 * too hard to reclaim until contiguous free pages have become
-	 * available can hurt performance by evicting too much useful data
-	 * from memory. Do not reclaim more than needed for compaction.
-	 */
-	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
-			compaction_suitable(zone, sc->order, 0, classzone_idx)
-							!= COMPACT_SKIPPED)
-		testorder = 0;
-
-	/*
 	 * We put equal pressure on every zone, unless one zone has way too
 	 * many pages free already. The "too many pages" is defined as the
 	 * high wmark plus a "gap" where the gap is either the low
@@ -3101,15 +3093,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * reclaim is necessary
 	 */
 	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
-	if (!lowmem_pressure && zone_balanced(zone, testorder,
+	if (!lowmem_pressure && zone_balanced(zone, sc->order, false,
 						balance_gap, classzone_idx))
 		return true;
 
 	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
 
-	/* Account for the number of pages attempted to reclaim */
-	*nr_attempted += sc->nr_to_reclaim;
-
 	clear_bit(ZONE_WRITEBACK, &zone->flags);
 
 	/*
@@ -3119,7 +3108,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
 	 * waits.
 	 */
 	if (zone_reclaimable(zone) &&
-	    zone_balanced(zone, testorder, 0, classzone_idx)) {
+	    zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
 		clear_bit(ZONE_CONGESTED, &zone->flags);
 		clear_bit(ZONE_DIRTY, &zone->flags);
 	}
@@ -3131,7 +3120,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at high_wmark_pages(zone).
  *
- * Returns the final order kswapd was reclaiming at
+ * Returns the highest zone idx kswapd was reclaiming at
  *
  * There is special handling here for zones which are full of pinned pages.
  * This can happen if the pages are all mlocked, or if they are all used by
@@ -3148,8 +3137,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
  * interoperates with the page allocator fallback scheme to ensure that aging
  * of pages is balanced across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
-							int *classzone_idx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -3166,9 +3154,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	count_vm_event(PAGEOUTRUN);
 
 	do {
-		unsigned long nr_attempted = 0;
 		bool raise_priority = true;
-		bool pgdat_needs_compaction = (order > 0);
 
 		sc.nr_reclaimed = 0;
 
@@ -3203,7 +3189,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				break;
 			}
 
-			if (!zone_balanced(zone, order, 0, 0)) {
+			if (!zone_balanced(zone, order, false, 0, 0)) {
 				end_zone = i;
 				break;
 			} else {
@@ -3219,24 +3205,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 		if (i < 0)
 			goto out;
 
-		for (i = 0; i <= end_zone; i++) {
-			struct zone *zone = pgdat->node_zones + i;
-
-			if (!populated_zone(zone))
-				continue;
-
-			/*
-			 * If any zone is currently balanced then kswapd will
-			 * not call compaction as it is expected that the
-			 * necessary pages are already available.
-			 */
-			if (pgdat_needs_compaction &&
-					zone_watermark_ok(zone, order,
-						low_wmark_pages(zone),
-						*classzone_idx, 0))
-				pgdat_needs_compaction = false;
-		}
-
 		/*
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
@@ -3280,8 +3248,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 			 * that that high watermark would be met at 100%
 			 * efficiency.
 			 */
-			if (kswapd_shrink_zone(zone, end_zone,
-					       &sc, &nr_attempted))
+			if (kswapd_shrink_zone(zone, end_zone, &sc))
 				raise_priority = false;
 		}
 
@@ -3294,49 +3261,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 				pfmemalloc_watermark_ok(pgdat))
 			wake_up_all(&pgdat->pfmemalloc_wait);
 
-		/*
-		 * Fragmentation may mean that the system cannot be rebalanced
-		 * for high-order allocations in all zones. If twice the
-		 * allocation size has been reclaimed and the zones are still
-		 * not balanced then recheck the watermarks at order-0 to
-		 * prevent kswapd reclaiming excessively. Assume that a
-		 * process requested a high-order can direct reclaim/compact.
-		 */
-		if (order && sc.nr_reclaimed >= 2UL << order)
-			order = sc.order = 0;
-
 		/* Check if kswapd should be suspending */
 		if (try_to_freeze() || kthread_should_stop())
 			break;
 
 		/*
-		 * Compact if necessary and kswapd is reclaiming at least the
-		 * high watermark number of pages as requsted
-		 */
-		if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
-			compact_pgdat(pgdat, order);
-
-		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
 		if (raise_priority || !sc.nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1 &&
-		 !pgdat_balanced(pgdat, order, *classzone_idx));
+			!pgdat_balanced(pgdat, order, classzone_idx));
 
 out:
 	/*
-	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
-	 * makes a decision on the order we were last reclaiming at. However,
-	 * if another caller entered the allocator slow path while kswapd
-	 * was awake, order will remain at the higher level
+	 * Return the highest zone idx we were reclaiming at so
+	 * prepare_kswapd_sleep() makes the same decisions as here.
 	 */
-	*classzone_idx = end_zone;
-	return order;
+	return end_zone;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
+				int classzone_idx, int balanced_classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
@@ -3347,7 +3294,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining,
+						balanced_classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -3357,7 +3305,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
+	if (prepare_kswapd_sleep(pgdat, order, remaining,
+						balanced_classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -3378,6 +3327,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		reset_isolation_suitable(pgdat);
 
+		/*
+		 * We have freed the memory, now we should compact it to make
+		 * allocation of the requested order possible.
+		 */
+		wakeup_kcompactd(pgdat, order, classzone_idx);
+
 		if (!kthread_should_stop())
 			schedule();
 
@@ -3407,7 +3362,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 static int kswapd(void *p)
 {
 	unsigned long order, new_order;
-	unsigned balanced_order;
 	int classzone_idx, new_classzone_idx;
 	int balanced_classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
@@ -3440,23 +3394,19 @@ static int kswapd(void *p)
 	set_freezable();
 
 	order = new_order = 0;
-	balanced_order = 0;
 	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
 	balanced_classzone_idx = classzone_idx;
 	for ( ; ; ) {
 		bool ret;
 
 		/*
-		 * If the last balance_pgdat was unsuccessful it's unlikely a
-		 * new request of a similar or harder type will succeed soon
-		 * so consider going to sleep on the basis we reclaimed at
+		 * While we were reclaiming, there might have been another
+		 * wakeup, so check the values.
 		 */
-		if (balanced_order == new_order) {
-			new_order = pgdat->kswapd_max_order;
-			new_classzone_idx = pgdat->classzone_idx;
-			pgdat->kswapd_max_order =  0;
-			pgdat->classzone_idx = pgdat->nr_zones - 1;
-		}
+		new_order = pgdat->kswapd_max_order;
+		new_classzone_idx = pgdat->classzone_idx;
+		pgdat->kswapd_max_order =  0;
+		pgdat->classzone_idx = pgdat->nr_zones - 1;
 
 		if (order < new_order || classzone_idx > new_classzone_idx) {
 			/*
@@ -3466,7 +3416,7 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, balanced_order,
+			kswapd_try_to_sleep(pgdat, order, classzone_idx,
 						balanced_classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
@@ -3486,9 +3436,8 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balanced_classzone_idx = classzone_idx;
-			balanced_order = balance_pgdat(pgdat, order,
-						&balanced_classzone_idx);
+			balanced_classzone_idx = balance_pgdat(pgdat, order,
+								classzone_idx);
 		}
 	}
 
@@ -3518,7 +3467,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	}
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
-	if (zone_balanced(zone, order, 0, 0))
+	if (zone_balanced(zone, order, true, 0, 0))
 		return;
 
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd
  2016-02-08 13:38 [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Vlastimil Babka
                   ` (4 preceding siblings ...)
  2016-02-08 13:38 ` [PATCH v2 5/5] mm, compaction: adapt isolation_suitable flushing to kcompactd Vlastimil Babka
@ 2016-03-09 15:52 ` Michal Hocko
  2016-03-10  8:38   ` Vlastimil Babka
  5 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2016-03-09 15:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Joonsoo Kim, Mel Gorman,
	David Rientjes, Johannes Weiner

On Mon 08-02-16 14:38:06, Vlastimil Babka wrote:
> The previous RFC is here [1]. It didn't have a cover letter, so the description
> and results are in the individual patches.

FWIW I think this is a step in the right direction. I would give my
Acked-by to all patches but I wasn't able to find time for a deep review
and my lack of knowledge of compaction details doesn't help much. I do
agree that conflating kswapd with compaction didn't really work out well
and fixing this would just make the code more complex and would more
prone to new bugs. In future we might want to invent something similar
to watermarks and set an expected level of high order pages prepared for
the allocation (e.g. have at least XMB of memory in order-9+). kcompact
then could try as hard as possible to provide them. Does that sound at
least doable?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd
  2016-03-09 15:52 ` [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Michal Hocko
@ 2016-03-10  8:38   ` Vlastimil Babka
  0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2016-03-10  8:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, linux-kernel, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Joonsoo Kim, Mel Gorman,
	David Rientjes, Johannes Weiner

On 03/09/2016 04:52 PM, Michal Hocko wrote:
> On Mon 08-02-16 14:38:06, Vlastimil Babka wrote:
>> The previous RFC is here [1]. It didn't have a cover letter, so the description
>> and results are in the individual patches.
>
> FWIW I think this is a step in the right direction. I would give my

Thanks!

> Acked-by to all patches but I wasn't able to find time for a deep review
> and my lack of knowledge of compaction details doesn't help much. I do
> agree that conflating kswapd with compaction didn't really work out well
> and fixing this would just make the code more complex and would more
> prone to new bugs.

Yeah, it seems that direct reclaim/compaction is complex enough already...

> In future we might want to invent something similar
> to watermarks and set an expected level of high order pages prepared for
> the allocation (e.g. have at least XMB of memory in order-9+). kcompact
> then could try as hard as possible to provide them. Does that sound at
> least doable?

Sure, that was/is part of the plan. But I was trimming the series for 
initial merge over the past year to arrive at a starting point where 
reaching consensus is easier.

> Thanks!
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-03-10  8:39 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-08 13:38 [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Vlastimil Babka
2016-02-08 13:38 ` [PATCH v2 1/5] mm, kswapd: remove bogus check of balance_classzone_idx Vlastimil Babka
2016-02-08 13:38 ` [PATCH v2 2/5] mm, compaction: introduce kcompactd Vlastimil Babka
2016-03-02  6:09   ` Joonsoo Kim
2016-03-02 12:25     ` Vlastimil Babka
2016-02-08 13:38 ` [PATCH v2 3/5] mm, memory hotplug: small cleanup in online_pages() Vlastimil Babka
2016-02-08 13:38 ` [PATCH v2 4/5] mm, kswapd: replace kswapd compaction with waking up kcompactd Vlastimil Babka
2016-02-08 22:58   ` Andrew Morton
2016-02-09 10:53     ` Vlastimil Babka
2016-02-09 10:21   ` Vlastimil Babka
2016-03-01 14:14   ` Vlastimil Babka
2016-03-02  6:33   ` Joonsoo Kim
2016-03-02 10:04     ` Vlastimil Babka
2016-03-02 13:57       ` Joonsoo Kim
2016-03-02 14:09         ` Vlastimil Babka
2016-03-02 14:22           ` Joonsoo Kim
2016-03-02 14:40             ` Vlastimil Babka
2016-03-02 14:59               ` Joonsoo Kim
2016-03-02 15:22                 ` Vlastimil Babka
2016-03-04 23:25                   ` Andrew Morton
2016-03-07  9:45                     ` Vlastimil Babka
2016-03-09 13:47                     ` Vlastimil Babka
2016-03-09 13:50                     ` Vlastimil Babka
2016-03-02 12:27   ` Vlastimil Babka
2016-02-08 13:38 ` [PATCH v2 5/5] mm, compaction: adapt isolation_suitable flushing to kcompactd Vlastimil Babka
2016-03-01 14:44   ` Vlastimil Babka
2016-03-09 15:52 ` [PATCH v2 0/5] introduce kcompactd and stop compacting in kswapd Michal Hocko
2016-03-10  8:38   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).