Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] mm: Proactive compaction
@ 2019-11-15 22:21 Nitin Gupta
  2019-11-29 13:55 ` Vlastimil Babka
  0 siblings, 1 reply; 3+ messages in thread
From: Nitin Gupta @ 2019-11-15 22:21 UTC (permalink / raw)
  To: Mel Gorman, Michal Hocko
  Cc: Nitin Gupta, Andrew Morton, Vlastimil Babka, Yu Zhao,
	Mike Kravetz, Matthew Wilcox, linux-kernel, linux-mm

For some applications we need to allocate almost all memory as
hugepages. However, on a running system, higher order allocations can
fail if the memory is fragmented. Linux kernel currently does on-demand
compaction as we request more hugepages but this style of compaction
incurs very high latency. Experiments with one-time full memory
compaction (followed by hugepage allocations) shows that kernel is able
to restore a highly fragmented memory state to a fairly compacted memory
state within <1 sec for a 32G system. Such data suggests that a more
proactive compaction can help us allocate a large fraction of memory as
hugepages keeping allocation latencies low.

For a more proactive compaction, the approach taken here is to define
per page-node tunable called ‘hpage_compaction_effort’ which dictates
bounds for external fragmentation for HPAGE_PMD_ORDER pages which
kcompactd should try to maintain.

The tunable is exposed through sysfs:
  /sys/kernel/mm/compaction/node-n/hpage_compaction_effort

The value of this tunable is used to determine low and high thresholds
for external fragmentation wrt HPAGE_PMD_ORDER order.

Note that previous version of this patch [1] was found to introduce too
many tunables (per-order, extfrag_{low, high}) but this one reduces them
to just (per-node, hpage_compaction_effort). Also, the new tunable is an
opaque value instead of asking for specific bounds of “external
fragmentation” which would have been difficult to estimate. The internal
interpretation of this opaque value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
extfrag thresholds (low=100-hpage_compaction_effort, high=low+10%). To
periodically check per-node extfrag status, we reuse per-node kcompactd
threads which are woken up every few milliseconds to check the same. If
any zone on its corresponding node has extfrag above the high threshold
for the HPAGE_PMD_ORDER order, the thread starts compaction in
background till all zones are below the low extfrag level for this
order. By default. By default, the tunable is set to 0 (=> low=100%,
high=100%).

This patch is largely based on ideas from Michal Hocko posted here:
https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/

* Performance data

System: x64_64, 32G RAM, 12-cores.

I made a small driver that allocates as many hugepages as possible and
measures allocation latency:

The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
and if that fails, tries to allocate with `GFP_TRANSHUGE |
__GFP_RETRY_MAYFAIL`. The drives stops when both methods fail for a
hugepage allocation.

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur just by
hitting the slow path for most allocations.

(all latency values are in microseconds)

- With vanilla kernel 5.4.0-rc5:

percentile latency
---------- -------
         5       7
        10       7
        25       8
        30       8
        40       8
        50       8
        60       9
        75     215
        80     222
        90     323
        95     429

Total 2M hugepages allocated = 1829 (3.5G worth of hugepages out of 25G
total free => 14% of free memory could be allocated as hugepages)

- Now with kernel 5.4.0-rc5 + this patch:
(hpage_compaction_effort = 60)

percentile latency
---------- -------
         5       3
        10       3
        25       4
        30       4
        40       4
        50       4
        60       5
        75       6
        80       9
        90     370
        95     652

Total 2M hugepages allocated = 11120 (21.7G worth of hugepages out of
25G total free => 86% of free memory could be allocated as hugepages)

Above workload produces a memory state which is easy to compact.
However, if memory is filled with unmovable pages, pro-active compaction
should essentially back off. To test this aspect, I ran a mix of this
workload (thanks to Matthew Wilcox for suggesting these):

- dentry_thrash: it opens /tmp/missing.x for x in [1, 1000000] where
first 10000 files actually exist.
- pagecache_thrash: it opens a 128G file (on a 32G system) and then
reads at random offsets.

With this mix of workload, system quickly reaches 90-100% fragmentation
wrt order-9. Trace of compaction events shows that we keep hitting
compaction_deferred event, as expected.

After terminating dentry_thrash and dropping denty caches, the system
could proceed with compaction according to set value of
hpage_compaction_effort (60).

[1] https://patchwork.kernel.org/patch/11098289/

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
---
 include/linux/compaction.h |  10 ++
 mm/compaction.c            | 193 ++++++++++++++++++++++++++++++-------
 mm/vmstat.c                |  12 +++
 3 files changed, 178 insertions(+), 37 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4b898cdbdf05..7bf78b74b46a 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -60,6 +60,15 @@ enum compact_result {
 
 struct alloc_context; /* in mm/internal.h */
 
+// "node-%d"
+#define COMPACTION_STATE_NAME_LEN 16
+// Per-node compaction state
+struct compaction_state {
+	int node_id;
+	unsigned int hpage_compaction_effort;
+	char name[COMPACTION_STATE_NAME_LEN];
+};
+
 /*
  * Number of free order-0 pages that should be available above given watermark
  * to make sure compaction has reasonable chance of not running out of free
@@ -90,6 +99,7 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 extern int sysctl_extfrag_threshold;
 extern int sysctl_compact_unevictable_allowed;
 
+extern int extfrag_for_order(struct zone *zone, unsigned int order);
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 		unsigned int order, unsigned int alloc_flags,
diff --git a/mm/compaction.c b/mm/compaction.c
index 672d3c78c6ab..2b933e9baa3e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -25,6 +25,10 @@
 #include <linux/psi.h>
 #include "internal.h"
 
+#ifdef CONFIG_COMPACTION
+static struct compaction_state compaction_states[MAX_NUMNODES];
+#endif
+
 #ifdef CONFIG_COMPACTION
 static inline void count_compact_event(enum vm_event_item item)
 {
@@ -1846,6 +1850,31 @@ static inline bool is_via_compact_memory(int order)
 	return order == -1;
 }
 
+static unsigned int extfrag_hpage_wmark(struct zone *zone, bool low)
+{
+	int node_id, wmark, wmark_low;
+
+	node_id = zone->zone_pgdat->node_id;
+	wmark_low = 100 - compaction_states[node_id].hpage_compaction_effort;
+	wmark = low ? wmark_low : min(wmark_low + 10, 100);
+
+	return extfrag_for_order(zone, HUGETLB_PAGE_ORDER) > wmark;
+}
+
+static bool node_hpage_should_compact(pg_data_t *pgdat)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		if (extfrag_hpage_wmark(zone, false) &&
+			compaction_suitable(zone, HUGETLB_PAGE_ORDER,
+				0, zone_idx(zone)) == COMPACT_CONTINUE) {
+			return true;
+		}
+	}
+	return false;
+}
+
 static enum compact_result __compact_finished(struct compact_control *cc)
 {
 	unsigned int order;
@@ -1875,6 +1904,9 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 	if (is_via_compact_memory(cc->order))
 		return COMPACT_CONTINUE;
 
+	if (extfrag_hpage_wmark(cc->zone, true))
+		return COMPACT_CONTINUE;
+
 	/*
 	 * Always finish scanning a pageblock to reduce the possibility of
 	 * fallbacks in the future. This is particularly important when
@@ -1965,15 +1997,6 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	if (is_via_compact_memory(order))
 		return COMPACT_CONTINUE;
 
-	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
-	/*
-	 * If watermarks for high-order allocation are already met, there
-	 * should be no need for compaction at all.
-	 */
-	if (zone_watermark_ok(zone, order, watermark, classzone_idx,
-								alloc_flags))
-		return COMPACT_SUCCESS;
-
 	/*
 	 * Watermarks for order-0 must be met for compaction to be able to
 	 * isolate free pages for migration targets. This means that the
@@ -2003,31 +2026,9 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
 					int classzone_idx)
 {
 	enum compact_result ret;
-	int fragindex;
 
 	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
 				    zone_page_state(zone, NR_FREE_PAGES));
-	/*
-	 * fragmentation index determines if allocation failures are due to
-	 * low memory or external fragmentation
-	 *
-	 * index of -1000 would imply allocations might succeed depending on
-	 * watermarks, but we already failed the high-order watermark check
-	 * index towards 0 implies failure is due to lack of memory
-	 * index towards 1000 implies failure is due to fragmentation
-	 *
-	 * Only compact if a failure would be due to fragmentation. Also
-	 * ignore fragindex for non-costly orders where the alternative to
-	 * a successful reclaim/compaction is OOM. Fragindex and the
-	 * vm.extfrag_threshold sysctl is meant as a heuristic to prevent
-	 * excessive compaction for costly orders, but it should not be at the
-	 * expense of system stability.
-	 */
-	if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
-		fragindex = fragmentation_index(zone, order);
-		if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
-			ret = COMPACT_NOT_SUITABLE_ZONE;
-	}
 
 	trace_mm_compaction_suitable(zone, order, ret);
 	if (ret == COMPACT_NOT_SUITABLE_ZONE)
@@ -2419,7 +2420,6 @@ static void compact_node(int nid)
 		.gfp_mask = GFP_KERNEL,
 	};
 
-
 	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 
 		zone = &pgdat->node_zones[zoneid];
@@ -2492,8 +2492,122 @@ void compaction_unregister_node(struct node *node)
 }
 #endif /* CONFIG_SYSFS && CONFIG_NUMA */
 
+#ifdef CONFIG_SYSFS
+
+#define COMPACTION_ATTR_RO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+
+#define COMPACTION_ATTR(_name) \
+	static struct kobj_attribute _name##_attr = \
+		__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static struct kobject *compaction_kobj;
+static struct kobject *compaction_kobjs[MAX_NUMNODES];
+
+static struct compaction_state *kobj_to_compaction_state(struct kobject *kobj)
+{
+	int node;
+
+	for_each_online_node(node) {
+		if (compaction_kobjs[node] == kobj)
+			return &compaction_states[node];
+	}
+
+	return NULL;
+}
+
+static ssize_t hpage_compaction_effort_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+	struct compaction_state *c = kobj_to_compaction_state(kobj);
+
+	err = kstrtoul(buf, 10, &input);
+	if (err)
+		return err;
+	if (input > 100)
+		return -EINVAL;
+
+	c->hpage_compaction_effort = input;
+	return count;
+}
+
+static ssize_t hpage_compaction_effort_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	struct compaction_state *c = kobj_to_compaction_state(kobj);
+
+	return sprintf(buf, "%u\n", c->hpage_compaction_effort);
+}
+
+COMPACTION_ATTR(hpage_compaction_effort);
+
+static struct attribute *compaction_attrs[] = {
+	&hpage_compaction_effort_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group compaction_attr_group = {
+	.attrs = compaction_attrs,
+};
+
+static int compaction_sysfs_add_node(struct compaction_state *c,
+	struct kobject *parent, struct kobject **compaction_kobjs,
+	const struct attribute_group *compaction_attr_group)
+{
+	int retval;
+
+	compaction_kobjs[c->node_id] =
+			kobject_create_and_add(c->name, parent);
+	if (!compaction_kobjs[c->node_id])
+		return -ENOMEM;
+
+	retval = sysfs_create_group(compaction_kobjs[c->node_id],
+				compaction_attr_group);
+	if (retval)
+		kobject_put(compaction_kobjs[c->node_id]);
+
+	return retval;
+}
+
+static void __init compaction_sysfs_init(void)
+{
+	struct compaction_state *c;
+	int err, node;
+
+	compaction_kobj = kobject_create_and_add("compaction", mm_kobj);
+	if (!compaction_kobj)
+		return;
+
+	for_each_online_node(node) {
+		c = &compaction_states[node];
+		err = compaction_sysfs_add_node(c, compaction_kobj,
+					compaction_kobjs,
+					&compaction_attr_group);
+		if (err)
+			pr_err("compaction: Unable to add state %s", c->name);
+	}
+}
+
+static void __init compaction_init(void)
+{
+	int node;
+
+	for_each_online_node(node) {
+		struct compaction_state *c = &compaction_states[node];
+
+		c->node_id = node;
+		c->hpage_compaction_effort = 0;
+		snprintf(c->name, COMPACTION_STATE_NAME_LEN, "node-%d", node);
+	}
+}
+#endif
+
 static inline bool kcompactd_work_requested(pg_data_t *pgdat)
 {
+	if (node_hpage_should_compact(pgdat) && !pgdat->kcompactd_max_order)
+		pgdat->kcompactd_max_order = 1;
 	return pgdat->kcompactd_max_order > 0 || kthread_should_stop();
 }
 
@@ -2529,7 +2643,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		.order = pgdat->kcompactd_max_order,
 		.search_order = pgdat->kcompactd_max_order,
 		.classzone_idx = pgdat->kcompactd_classzone_idx,
-		.mode = MIGRATE_SYNC_LIGHT,
+		.mode = MIGRATE_SYNC,
 		.ignore_skip_hint = false,
 		.gfp_mask = GFP_KERNEL,
 	};
@@ -2556,7 +2670,6 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 
 		cc.zone = zone;
 		status = compact_zone(&cc, NULL);
-
 		if (status == COMPACT_SUCCESS) {
 			compaction_defer_reset(zone, cc.order, false);
 		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
@@ -2641,11 +2754,14 @@ static int kcompactd(void *p)
 	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 
 	while (!kthread_should_stop()) {
-		unsigned long pflags;
+		unsigned long ret, pflags;
 
 		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
-		wait_event_freezable(pgdat->kcompactd_wait,
-				kcompactd_work_requested(pgdat));
+		ret = wait_event_freezable_timeout(pgdat->kcompactd_wait,
+				kcompactd_work_requested(pgdat),
+				msecs_to_jiffies(500));
+		if (!ret)
+			continue;
 
 		psi_memstall_enter(&pflags);
 		kcompactd_do_work(pgdat);
@@ -2726,6 +2842,9 @@ static int __init kcompactd_init(void)
 		return ret;
 	}
 
+	compaction_init();
+	compaction_sysfs_init();
+
 	for_each_node_state(nid, N_MEMORY)
 		kcompactd_run(nid);
 	return 0;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a8222041bd44..0230a45c548a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1074,6 +1074,18 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in
 	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
 }
 
+int extfrag_for_order(struct zone *zone, unsigned int order)
+{
+	struct contig_page_info info;
+
+	fill_contig_page_info(zone, order, &info);
+	if (info.free_pages == 0)
+		return 0;
+
+	return (info.free_pages - (info.free_blocks_suitable << order)) * 100
+							/ info.free_pages;
+}
+
 /* Same as __fragmentation index but allocs contig_page_info on stack */
 int fragmentation_index(struct zone *zone, unsigned int order)
 {
-- 
2.17.1



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] mm: Proactive compaction
  2019-11-15 22:21 [PATCH] mm: Proactive compaction Nitin Gupta
@ 2019-11-29 13:55 ` Vlastimil Babka
  2019-12-03 20:37   ` Nitin Gupta
  0 siblings, 1 reply; 3+ messages in thread
From: Vlastimil Babka @ 2019-11-29 13:55 UTC (permalink / raw)
  To: Nitin Gupta, Mel Gorman, Michal Hocko
  Cc: Andrew Morton, Yu Zhao, Mike Kravetz, Matthew Wilcox,
	linux-kernel, linux-mm

On 11/15/19 11:21 PM, Nitin Gupta wrote:
> For some applications we need to allocate almost all memory as
> hugepages. However, on a running system, higher order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) shows that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define
> per page-node tunable called ‘hpage_compaction_effort’ which dictates
> bounds for external fragmentation for HPAGE_PMD_ORDER pages which
> kcompactd should try to maintain.
> 
> The tunable is exposed through sysfs:
>   /sys/kernel/mm/compaction/node-n/hpage_compaction_effort
> 
> The value of this tunable is used to determine low and high thresholds
> for external fragmentation wrt HPAGE_PMD_ORDER order.

Could we instead start with a non-tunable value that would be linked to
 to e.g. the number of THP allocations between kcompactd cycles?
Anything we expose will inevitably get set to stone, I'm afraid, so I
would introduce it only as a last resort.

> Note that previous version of this patch [1] was found to introduce too
> many tunables (per-order, extfrag_{low, high}) but this one reduces them
> to just (per-node, hpage_compaction_effort). Also, the new tunable is an
> opaque value instead of asking for specific bounds of “external
> fragmentation” which would have been difficult to estimate. The internal
> interpretation of this opaque value allows for future fine-tuning.
> 
> Currently, we use a simple translation from this tunable to [low, high]
> extfrag thresholds (low=100-hpage_compaction_effort, high=low+10%). To
> periodically check per-node extfrag status, we reuse per-node kcompactd
> threads which are woken up every few milliseconds to check the same. If
> any zone on its corresponding node has extfrag above the high threshold
> for the HPAGE_PMD_ORDER order, the thread starts compaction in
> background till all zones are below the low extfrag level for this
> order. By default. By default, the tunable is set to 0 (=> low=100%,
> high=100%).
> 
> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> 
> * Performance data
> 
> System: x64_64, 32G RAM, 12-cores.
> 
> I made a small driver that allocates as many hugepages as possible and
> measures allocation latency:
> 
> The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
> and if that fails, tries to allocate with `GFP_TRANSHUGE |
> __GFP_RETRY_MAYFAIL`. The drives stops when both methods fail for a
> hugepage allocation.
> 
> Before starting the driver, the system was fragmented from a userspace
> program that allocates all memory and then for each 2M aligned section,
> frees 3/4 of base pages using munmap. The workload is mainly anonymous
> userspace pages which are easy to move around. I intentionally avoided
> unmovable pages in this test to see how much latency we incur just by
> hitting the slow path for most allocations.
> 
> (all latency values are in microseconds)
> 
> - With vanilla kernel 5.4.0-rc5:
> 
> percentile latency
> ---------- -------
>          5       7
>         10       7
>         25       8
>         30       8
>         40       8
>         50       8
>         60       9
>         75     215
>         80     222
>         90     323
>         95     429
> 
> Total 2M hugepages allocated = 1829 (3.5G worth of hugepages out of 25G
> total free => 14% of free memory could be allocated as hugepages)
> 
> - Now with kernel 5.4.0-rc5 + this patch:
> (hpage_compaction_effort = 60)
> 
> percentile latency
> ---------- -------
>          5       3
>         10       3
>         25       4
>         30       4
>         40       4
>         50       4
>         60       5
>         75       6
>         80       9
>         90     370
>         95     652
> 
> Total 2M hugepages allocated = 11120 (21.7G worth of hugepages out of
> 25G total free => 86% of free memory could be allocated as hugepages)

I wonder about the 14->86% improvement. As you say, this kind of
fragmentation is easy to compact. Why wouldn't GFP_TRANSHUGE |
__GFP_RETRY_MAYFAIL attempts succeed?

Thanks,
Vlastimil

> Above workload produces a memory state which is easy to compact.
> However, if memory is filled with unmovable pages, pro-active compaction
> should essentially back off. To test this aspect, I ran a mix of this
> workload (thanks to Matthew Wilcox for suggesting these):
> 
> - dentry_thrash: it opens /tmp/missing.x for x in [1, 1000000] where
> first 10000 files actually exist.
> - pagecache_thrash: it opens a 128G file (on a 32G system) and then
> reads at random offsets.
> 
> With this mix of workload, system quickly reaches 90-100% fragmentation
> wrt order-9. Trace of compaction events shows that we keep hitting
> compaction_deferred event, as expected.
> 
> After terminating dentry_thrash and dropping denty caches, the system
> could proceed with compaction according to set value of
> hpage_compaction_effort (60).
> 
> [1] https://patchwork.kernel.org/patch/11098289/
> 
> Signed-off-by: Nitin Gupta <nigupta@nvidia.com>


^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH] mm: Proactive compaction
  2019-11-29 13:55 ` Vlastimil Babka
@ 2019-12-03 20:37   ` Nitin Gupta
  0 siblings, 0 replies; 3+ messages in thread
From: Nitin Gupta @ 2019-12-03 20:37 UTC (permalink / raw)
  To: Vlastimil Babka, Mel Gorman, Michal Hocko
  Cc: Andrew Morton, Yu Zhao, Mike Kravetz, Matthew Wilcox,
	linux-kernel, linux-mm, khalid.aziz

> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Vlastimil Babka
> Sent: Friday, November 29, 2019 5:55 AM
> To: Nitin Gupta <nigupta@nvidia.com>; Mel Gorman
> <mgorman@techsingularity.net>; Michal Hocko <mhocko@suse.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>; Yu Zhao
> <yuzhao@google.com>; Mike Kravetz <mike.kravetz@oracle.com>; Matthew
> Wilcox <willy@infradead.org>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org
> Subject: Re: [PATCH] mm: Proactive compaction
> 
> On 11/15/19 11:21 PM, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > per page-node tunable called ‘hpage_compaction_effort’ which dictates
> > bounds for external fragmentation for HPAGE_PMD_ORDER pages which
> > kcompactd should try to maintain.
> >
> > The tunable is exposed through sysfs:
> >   /sys/kernel/mm/compaction/node-n/hpage_compaction_effort
> >
> > The value of this tunable is used to determine low and high thresholds
> > for external fragmentation wrt HPAGE_PMD_ORDER order.
> 
> Could we instead start with a non-tunable value that would be linked to  to
> e.g. the number of THP allocations between kcompactd cycles?
> Anything we expose will inevitably get set to stone, I'm afraid, so I would
> introduce it only as a last resort.
>

There have been attempts in the past to do proactive compaction without
adding any new tunables. For instance, see this patch from Khalid (CC'ed):

https://lkml.org/lkml/2019/8/12/1302

This patch collects allocation and fragmentation data points to predict
memory exhaustion or fragmentation and triggers reclaim or compaction
proactively. The main concern in the discussion for this patch was that such
data collection and predictive analysis can be done from userspace together
with wmark adjustment (already exposed), so no kernel changes are
potentially necessary.

However, in the same discussion, Michal pointed out a lack of similar
interface (like wmarks for kswapd) missing to guide kcompactd:

https://lkml.org/lkml/2019/8/13/730

My patch is addressing the need for that missing userspace tunable. A userspace
daemon  can measure fragmentation trends and adjust this tunable to let
kernel know amount of effort to put into compaction to avoid hitting direct
compact path.

 
> > Note that previous version of this patch [1] was found to introduce
> > too many tunables (per-order, extfrag_{low, high}) but this one
> > reduces them to just (per-node, hpage_compaction_effort). Also, the
> > new tunable is an opaque value instead of asking for specific bounds
> > of “external fragmentation” which would have been difficult to
> > estimate. The internal interpretation of this opaque value allows for future
> fine-tuning.
> >
> > Currently, we use a simple translation from this tunable to [low,
> > high] extfrag thresholds (low=100-hpage_compaction_effort,
> > high=low+10%). To periodically check per-node extfrag status, we reuse
> > per-node kcompactd threads which are woken up every few milliseconds
> > to check the same. If any zone on its corresponding node has extfrag
> > above the high threshold for the HPAGE_PMD_ORDER order, the thread
> > starts compaction in background till all zones are below the low
> > extfrag level for this order. By default. By default, the tunable is
> > set to 0 (=> low=100%, high=100%).
> >
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-
> mm/20161230131412.GI13301@dhcp22.suse.cz
> > /
> >
> > * Performance data
> >
> > System: x64_64, 32G RAM, 12-cores.
> >
> > I made a small driver that allocates as many hugepages as possible and
> > measures allocation latency:
> >
> > The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
> > and if that fails, tries to allocate with `GFP_TRANSHUGE |
> > __GFP_RETRY_MAYFAIL`. The drives stops when both methods fail for a
> > hugepage allocation.
> >
> > Before starting the driver, the system was fragmented from a userspace
> > program that allocates all memory and then for each 2M aligned
> > section, frees 3/4 of base pages using munmap. The workload is mainly
> > anonymous userspace pages which are easy to move around. I
> > intentionally avoided unmovable pages in this test to see how much
> > latency we incur just by hitting the slow path for most allocations.
> >
> > (all latency values are in microseconds)
> >
> > - With vanilla kernel 5.4.0-rc5:
> >
> > percentile latency
> > ---------- -------
> >          5       7
> >         10       7
> >         25       8
> >         30       8
> >         40       8
> >         50       8
> >         60       9
> >         75     215
> >         80     222
> >         90     323
> >         95     429
> >
> > Total 2M hugepages allocated = 1829 (3.5G worth of hugepages out of
> > 25G total free => 14% of free memory could be allocated as hugepages)
> >
> > - Now with kernel 5.4.0-rc5 + this patch:
> > (hpage_compaction_effort = 60)
> >
> > percentile latency
> > ---------- -------
> >          5       3
> >         10       3
> >         25       4
> >         30       4
> >         40       4
> >         50       4
> >         60       5
> >         75       6
> >         80       9
> >         90     370
> >         95     652
> >
> > Total 2M hugepages allocated = 11120 (21.7G worth of hugepages out of
> > 25G total free => 86% of free memory could be allocated as hugepages)
> 
> I wonder about the 14->86% improvement. As you say, this kind of
> fragmentation is easy to compact. Why wouldn't GFP_TRANSHUGE |
> __GFP_RETRY_MAYFAIL attempts succeed?
> 

I'm not too sure at this point. With kernel 5.3.0 I could allocate 80-90% of
memory as hugepages under similar conditions though with very high
latencies (as I reported here: https://patchwork.kernel.org/patch/11098289/)

With kernel 5.4.0-x I observed this significant drop. Perhaps
GFP_TRANSHUGE flag was changed between these versions. I will post
future numbers with base flags to avoid such surprises in future.

Thanks,
Nitin




> > Above workload produces a memory state which is easy to compact.
> > However, if memory is filled with unmovable pages, pro-active
> > compaction should essentially back off. To test this aspect, I ran a
> > mix of this workload (thanks to Matthew Wilcox for suggesting these):
> >
> > - dentry_thrash: it opens /tmp/missing.x for x in [1, 1000000] where
> > first 10000 files actually exist.
> > - pagecache_thrash: it opens a 128G file (on a 32G system) and then
> > reads at random offsets.
> >
> > With this mix of workload, system quickly reaches 90-100%
> > fragmentation wrt order-9. Trace of compaction events shows that we
> > keep hitting compaction_deferred event, as expected.
> >
> > After terminating dentry_thrash and dropping denty caches, the system
> > could proceed with compaction according to set value of
> > hpage_compaction_effort (60).
> >
> > [1] https://patchwork.kernel.org/patch/11098289/
> >
> > Signed-off-by: Nitin Gupta <nigupta@nvidia.com>


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, back to index

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15 22:21 [PATCH] mm: Proactive compaction Nitin Gupta
2019-11-29 13:55 ` Vlastimil Babka
2019-12-03 20:37   ` Nitin Gupta

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git