All of lore.kernel.org
 help / color / mirror / Atom feed
* [rfc] mm, thp: allow khugepaged to periodically compact memory synchronously
@ 2015-07-15  2:19 ` David Rientjes
  0 siblings, 0 replies; 6+ messages in thread
From: David Rientjes @ 2015-07-15  2:19 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Andrea Arcangeli, Rik van Riel, Kirill A. Shutemov, Mel Gorman,
	linux-kernel, linux-mm

We have seen a large benefit in the amount of hugepages that can be
allocated at fault and by khugepaged when memory is periodically
compacted in the background.

We trigger synchronous memory compaction over all memory every 15 minutes
to keep fragmentation low and to offset the lightweight compaction that
is done at page fault to keep latency low.

compact_sleep_millisecs controls how often khugepaged will compact all
memory.  Each scan_sleep_millisecs wakeup after this value has expired, a
node is synchronously compacted until all memory has been scanned.  Then,
khugepaged will restart the process compact_sleep_millisecs later.

This defaults to 0, which means no memory compaction is done.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 RFC: this is for initial comment on whether it's appropriate to do this
 in khugepaged.  We already do to the background compaction for the
 benefit of thp, but others may feel like this belongs in a new per-node
 kcompactd thread as proposed by Vlastimil.

 Regardless, it appears there is a substantial need for periodic memory
 compaction in the background to reduce the latency of thp page faults
 and still have a reasonable chance of having the allocation succeed.

 We could also speed up this process in the case of alloc_sleep_millisecs
 timeout since allocation recently failed for khugepaged.

 Documentation/vm/transhuge.txt | 10 +++++++
 mm/huge_memory.c               | 65 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -170,6 +170,16 @@ A lower value leads to gain less thp performance. Value of
 max_ptes_none can waste cpu time very little, you can
 ignore it.
 
+/sys/kernel/mm/transparent_hugepage/khugepaged/compact_sleep_millisecs
+
+controls how often khugepaged will utilize memory compaction to defragment
+memory.  This makes it easier to allocate hugepages both at page fault and
+by khugepaged since this compaction can be synchronous.
+
+This only occurs if scan_sleep_millisecs is configured.  One node per
+scan_sleep_millisecs wakeup is compacted when compact_sleep_millisecs
+expires until all memory has been compacted.
+
 == Boot parameter ==
 
 You can change the sysfs boot time defaults of Transparent Hugepage
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/compaction.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -65,6 +66,16 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  */
 static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
 
+/*
+ * Khugepaged may memory compaction over all memory at regular intervals.
+ * It round-robins through all nodes, compacting one at a time each
+ * scan_sleep_millisecs wakeup when triggered.
+ * May be set with compact_sleep_millisecs, which is disabled by default.
+ */
+static unsigned long khugepaged_compact_sleep_millisecs __read_mostly;
+static unsigned long khugepaged_compact_jiffies;
+static int next_compact_node = MAX_NUMNODES;
+
 static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
 static void khugepaged_slab_exit(void);
@@ -463,6 +474,34 @@ static struct kobj_attribute alloc_sleep_millisecs_attr =
 	__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
 	       alloc_sleep_millisecs_store);
 
+static ssize_t compact_sleep_millisecs_show(struct kobject *kobj,
+					    struct kobj_attribute *attr,
+					    char *buf)
+{
+	return sprintf(buf, "%lu\n", khugepaged_compact_sleep_millisecs);
+}
+
+static ssize_t compact_sleep_millisecs_store(struct kobject *kobj,
+					     struct kobj_attribute *attr,
+					     const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = kstrtoul(buf, 10, &msecs);
+	if (err || msecs > ULONG_MAX)
+		return -EINVAL;
+
+	khugepaged_compact_sleep_millisecs = msecs;
+	khugepaged_compact_jiffies = jiffies + msecs_to_jiffies(msecs);
+	wake_up_interruptible(&khugepaged_wait);
+
+	return count;
+}
+static struct kobj_attribute compact_sleep_millisecs_attr =
+	__ATTR(compact_sleep_millisecs, 0644, compact_sleep_millisecs_show,
+	       compact_sleep_millisecs_store);
+
 static ssize_t pages_to_scan_show(struct kobject *kobj,
 				  struct kobj_attribute *attr,
 				  char *buf)
@@ -564,6 +603,7 @@ static struct attribute *khugepaged_attr[] = {
 	&full_scans_attr.attr,
 	&scan_sleep_millisecs_attr.attr,
 	&alloc_sleep_millisecs_attr.attr,
+	&compact_sleep_millisecs_attr.attr,
 	NULL,
 };
 
@@ -652,6 +692,10 @@ static int __init hugepage_init(void)
 		return 0;
 	}
 
+	if (khugepaged_compact_sleep_millisecs)
+		khugepaged_compact_jiffies = jiffies +
+			msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
+
 	err = start_stop_khugepaged();
 	if (err)
 		goto err_khugepaged;
@@ -2834,6 +2878,26 @@ static void khugepaged_wait_work(void)
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
+static void khugepaged_compact_memory(void)
+{
+	if (!khugepaged_compact_jiffies ||
+	    time_before(jiffies, khugepaged_compact_jiffies))
+		return;
+
+	get_online_mems();
+	if (next_compact_node == MAX_NUMNODES)
+		next_compact_node = first_node(node_states[N_MEMORY]);
+
+	compact_pgdat(NODE_DATA(next_compact_node), -1);
+
+	next_compact_node = next_node(next_compact_node, node_states[N_MEMORY]);
+	put_online_mems();
+
+	if (next_compact_node == MAX_NUMNODES)
+		khugepaged_compact_jiffies = jiffies +
+			msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
+}
+
 static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
@@ -2842,6 +2906,7 @@ static int khugepaged(void *none)
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
+		khugepaged_compact_memory();
 		khugepaged_do_scan();
 		khugepaged_wait_work();
 	}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [rfc] mm, thp: allow khugepaged to periodically compact memory synchronously
@ 2015-07-15  2:19 ` David Rientjes
  0 siblings, 0 replies; 6+ messages in thread
From: David Rientjes @ 2015-07-15  2:19 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Andrea Arcangeli, Rik van Riel, Kirill A. Shutemov, Mel Gorman,
	linux-kernel, linux-mm

We have seen a large benefit in the amount of hugepages that can be
allocated at fault and by khugepaged when memory is periodically
compacted in the background.

We trigger synchronous memory compaction over all memory every 15 minutes
to keep fragmentation low and to offset the lightweight compaction that
is done at page fault to keep latency low.

compact_sleep_millisecs controls how often khugepaged will compact all
memory.  Each scan_sleep_millisecs wakeup after this value has expired, a
node is synchronously compacted until all memory has been scanned.  Then,
khugepaged will restart the process compact_sleep_millisecs later.

This defaults to 0, which means no memory compaction is done.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 RFC: this is for initial comment on whether it's appropriate to do this
 in khugepaged.  We already do to the background compaction for the
 benefit of thp, but others may feel like this belongs in a new per-node
 kcompactd thread as proposed by Vlastimil.

 Regardless, it appears there is a substantial need for periodic memory
 compaction in the background to reduce the latency of thp page faults
 and still have a reasonable chance of having the allocation succeed.

 We could also speed up this process in the case of alloc_sleep_millisecs
 timeout since allocation recently failed for khugepaged.

 Documentation/vm/transhuge.txt | 10 +++++++
 mm/huge_memory.c               | 65 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -170,6 +170,16 @@ A lower value leads to gain less thp performance. Value of
 max_ptes_none can waste cpu time very little, you can
 ignore it.
 
+/sys/kernel/mm/transparent_hugepage/khugepaged/compact_sleep_millisecs
+
+controls how often khugepaged will utilize memory compaction to defragment
+memory.  This makes it easier to allocate hugepages both at page fault and
+by khugepaged since this compaction can be synchronous.
+
+This only occurs if scan_sleep_millisecs is configured.  One node per
+scan_sleep_millisecs wakeup is compacted when compact_sleep_millisecs
+expires until all memory has been compacted.
+
 == Boot parameter ==
 
 You can change the sysfs boot time defaults of Transparent Hugepage
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/compaction.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -65,6 +66,16 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  */
 static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
 
+/*
+ * Khugepaged may memory compaction over all memory at regular intervals.
+ * It round-robins through all nodes, compacting one at a time each
+ * scan_sleep_millisecs wakeup when triggered.
+ * May be set with compact_sleep_millisecs, which is disabled by default.
+ */
+static unsigned long khugepaged_compact_sleep_millisecs __read_mostly;
+static unsigned long khugepaged_compact_jiffies;
+static int next_compact_node = MAX_NUMNODES;
+
 static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
 static void khugepaged_slab_exit(void);
@@ -463,6 +474,34 @@ static struct kobj_attribute alloc_sleep_millisecs_attr =
 	__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
 	       alloc_sleep_millisecs_store);
 
+static ssize_t compact_sleep_millisecs_show(struct kobject *kobj,
+					    struct kobj_attribute *attr,
+					    char *buf)
+{
+	return sprintf(buf, "%lu\n", khugepaged_compact_sleep_millisecs);
+}
+
+static ssize_t compact_sleep_millisecs_store(struct kobject *kobj,
+					     struct kobj_attribute *attr,
+					     const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = kstrtoul(buf, 10, &msecs);
+	if (err || msecs > ULONG_MAX)
+		return -EINVAL;
+
+	khugepaged_compact_sleep_millisecs = msecs;
+	khugepaged_compact_jiffies = jiffies + msecs_to_jiffies(msecs);
+	wake_up_interruptible(&khugepaged_wait);
+
+	return count;
+}
+static struct kobj_attribute compact_sleep_millisecs_attr =
+	__ATTR(compact_sleep_millisecs, 0644, compact_sleep_millisecs_show,
+	       compact_sleep_millisecs_store);
+
 static ssize_t pages_to_scan_show(struct kobject *kobj,
 				  struct kobj_attribute *attr,
 				  char *buf)
@@ -564,6 +603,7 @@ static struct attribute *khugepaged_attr[] = {
 	&full_scans_attr.attr,
 	&scan_sleep_millisecs_attr.attr,
 	&alloc_sleep_millisecs_attr.attr,
+	&compact_sleep_millisecs_attr.attr,
 	NULL,
 };
 
@@ -652,6 +692,10 @@ static int __init hugepage_init(void)
 		return 0;
 	}
 
+	if (khugepaged_compact_sleep_millisecs)
+		khugepaged_compact_jiffies = jiffies +
+			msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
+
 	err = start_stop_khugepaged();
 	if (err)
 		goto err_khugepaged;
@@ -2834,6 +2878,26 @@ static void khugepaged_wait_work(void)
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
+static void khugepaged_compact_memory(void)
+{
+	if (!khugepaged_compact_jiffies ||
+	    time_before(jiffies, khugepaged_compact_jiffies))
+		return;
+
+	get_online_mems();
+	if (next_compact_node == MAX_NUMNODES)
+		next_compact_node = first_node(node_states[N_MEMORY]);
+
+	compact_pgdat(NODE_DATA(next_compact_node), -1);
+
+	next_compact_node = next_node(next_compact_node, node_states[N_MEMORY]);
+	put_online_mems();
+
+	if (next_compact_node == MAX_NUMNODES)
+		khugepaged_compact_jiffies = jiffies +
+			msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
+}
+
 static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
@@ -2842,6 +2906,7 @@ static int khugepaged(void *none)
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
+		khugepaged_compact_memory();
 		khugepaged_do_scan();
 		khugepaged_wait_work();
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] mm, thp: allow khugepaged to periodically compact memory synchronously
  2015-07-15  2:19 ` David Rientjes
@ 2015-07-21  9:36   ` Vlastimil Babka
  -1 siblings, 0 replies; 6+ messages in thread
From: Vlastimil Babka @ 2015-07-21  9:36 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Andrea Arcangeli, Rik van Riel, Kirill A. Shutemov, Mel Gorman,
	linux-kernel, linux-mm

On 07/15/2015 04:19 AM, David Rientjes wrote:
> We have seen a large benefit in the amount of hugepages that can be
> allocated at fault

That's understandable...

> and by khugepaged when memory is periodically
> compacted in the background.

... but for khugepaged it's surprising. Doesn't khugepaged (unlike page 
faults) attempt the same sync compaction as your manual triggers?

> We trigger synchronous memory compaction over all memory every 15 minutes
> to keep fragmentation low and to offset the lightweight compaction that
> is done at page fault to keep latency low.

I'm surprised that 15 minutes is frequent enough to make a difference. 
I'd expect it very much depends on the memory size and workload though.

> compact_sleep_millisecs controls how often khugepaged will compact all
> memory.  Each scan_sleep_millisecs wakeup after this value has expired, a
> node is synchronously compacted until all memory has been scanned.  Then,
> khugepaged will restart the process compact_sleep_millisecs later.
>
> This defaults to 0, which means no memory compaction is done.

Being another tunable and defaulting to 0 it means that most people 
won't use it at all, or their distro will provide some other value. We 
should really strive to make it self-tuning based on e.g. memory 
fragmentation statistics. But I know that my kcompactd proposal also 
wasn't quite there yet...

>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>   RFC: this is for initial comment on whether it's appropriate to do this
>   in khugepaged.  We already do to the background compaction for the
>   benefit of thp, but others may feel like this belongs in a new per-node
>   kcompactd thread as proposed by Vlastimil.
>
>   Regardless, it appears there is a substantial need for periodic memory
>   compaction in the background to reduce the latency of thp page faults
>   and still have a reasonable chance of having the allocation succeed.
>
>   We could also speed up this process in the case of alloc_sleep_millisecs
>   timeout since allocation recently failed for khugepaged.
>
>   Documentation/vm/transhuge.txt | 10 +++++++
>   mm/huge_memory.c               | 65 ++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 75 insertions(+)
>
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -170,6 +170,16 @@ A lower value leads to gain less thp performance. Value of
>   max_ptes_none can waste cpu time very little, you can
>   ignore it.
>
> +/sys/kernel/mm/transparent_hugepage/khugepaged/compact_sleep_millisecs
> +
> +controls how often khugepaged will utilize memory compaction to defragment
> +memory.  This makes it easier to allocate hugepages both at page fault and
> +by khugepaged since this compaction can be synchronous.
> +
> +This only occurs if scan_sleep_millisecs is configured.  One node per
> +scan_sleep_millisecs wakeup is compacted when compact_sleep_millisecs
> +expires until all memory has been compacted.
> +
>   == Boot parameter ==
>
>   You can change the sysfs boot time defaults of Transparent Hugepage
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -23,6 +23,7 @@
>   #include <linux/pagemap.h>
>   #include <linux/migrate.h>
>   #include <linux/hashtable.h>
> +#include <linux/compaction.h>
>
>   #include <asm/tlb.h>
>   #include <asm/pgalloc.h>
> @@ -65,6 +66,16 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
>    */
>   static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
>
> +/*
> + * Khugepaged may memory compaction over all memory at regular intervals.
> + * It round-robins through all nodes, compacting one at a time each
> + * scan_sleep_millisecs wakeup when triggered.
> + * May be set with compact_sleep_millisecs, which is disabled by default.
> + */
> +static unsigned long khugepaged_compact_sleep_millisecs __read_mostly;
> +static unsigned long khugepaged_compact_jiffies;
> +static int next_compact_node = MAX_NUMNODES;
> +
>   static int khugepaged(void *none);
>   static int khugepaged_slab_init(void);
>   static void khugepaged_slab_exit(void);
> @@ -463,6 +474,34 @@ static struct kobj_attribute alloc_sleep_millisecs_attr =
>   	__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
>   	       alloc_sleep_millisecs_store);
>
> +static ssize_t compact_sleep_millisecs_show(struct kobject *kobj,
> +					    struct kobj_attribute *attr,
> +					    char *buf)
> +{
> +	return sprintf(buf, "%lu\n", khugepaged_compact_sleep_millisecs);
> +}
> +
> +static ssize_t compact_sleep_millisecs_store(struct kobject *kobj,
> +					     struct kobj_attribute *attr,
> +					     const char *buf, size_t count)
> +{
> +	unsigned long msecs;
> +	int err;
> +
> +	err = kstrtoul(buf, 10, &msecs);
> +	if (err || msecs > ULONG_MAX)
> +		return -EINVAL;
> +
> +	khugepaged_compact_sleep_millisecs = msecs;
> +	khugepaged_compact_jiffies = jiffies + msecs_to_jiffies(msecs);
> +	wake_up_interruptible(&khugepaged_wait);
> +
> +	return count;
> +}
> +static struct kobj_attribute compact_sleep_millisecs_attr =
> +	__ATTR(compact_sleep_millisecs, 0644, compact_sleep_millisecs_show,
> +	       compact_sleep_millisecs_store);
> +
>   static ssize_t pages_to_scan_show(struct kobject *kobj,
>   				  struct kobj_attribute *attr,
>   				  char *buf)
> @@ -564,6 +603,7 @@ static struct attribute *khugepaged_attr[] = {
>   	&full_scans_attr.attr,
>   	&scan_sleep_millisecs_attr.attr,
>   	&alloc_sleep_millisecs_attr.attr,
> +	&compact_sleep_millisecs_attr.attr,
>   	NULL,
>   };
>
> @@ -652,6 +692,10 @@ static int __init hugepage_init(void)
>   		return 0;
>   	}
>
> +	if (khugepaged_compact_sleep_millisecs)
> +		khugepaged_compact_jiffies = jiffies +
> +			msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
> +
>   	err = start_stop_khugepaged();
>   	if (err)
>   		goto err_khugepaged;
> @@ -2834,6 +2878,26 @@ static void khugepaged_wait_work(void)
>   		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
>   }
>
> +static void khugepaged_compact_memory(void)
> +{
> +	if (!khugepaged_compact_jiffies ||
> +	    time_before(jiffies, khugepaged_compact_jiffies))
> +		return;
> +
> +	get_online_mems();
> +	if (next_compact_node == MAX_NUMNODES)
> +		next_compact_node = first_node(node_states[N_MEMORY]);
> +
> +	compact_pgdat(NODE_DATA(next_compact_node), -1);
> +
> +	next_compact_node = next_node(next_compact_node, node_states[N_MEMORY]);
> +	put_online_mems();
> +
> +	if (next_compact_node == MAX_NUMNODES)
> +		khugepaged_compact_jiffies = jiffies +
> +			msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
> +}
> +
>   static int khugepaged(void *none)
>   {
>   	struct mm_slot *mm_slot;
> @@ -2842,6 +2906,7 @@ static int khugepaged(void *none)
>   	set_user_nice(current, MAX_NICE);
>
>   	while (!kthread_should_stop()) {
> +		khugepaged_compact_memory();
>   		khugepaged_do_scan();
>   		khugepaged_wait_work();
>   	}
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] mm, thp: allow khugepaged to periodically compact memory synchronously
@ 2015-07-21  9:36   ` Vlastimil Babka
  0 siblings, 0 replies; 6+ messages in thread
From: Vlastimil Babka @ 2015-07-21  9:36 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Andrea Arcangeli, Rik van Riel, Kirill A. Shutemov, Mel Gorman,
	linux-kernel, linux-mm

On 07/15/2015 04:19 AM, David Rientjes wrote:
> We have seen a large benefit in the amount of hugepages that can be
> allocated at fault

That's understandable...

> and by khugepaged when memory is periodically
> compacted in the background.

... but for khugepaged it's surprising. Doesn't khugepaged (unlike page 
faults) attempt the same sync compaction as your manual triggers?

> We trigger synchronous memory compaction over all memory every 15 minutes
> to keep fragmentation low and to offset the lightweight compaction that
> is done at page fault to keep latency low.

I'm surprised that 15 minutes is frequent enough to make a difference. 
I'd expect it very much depends on the memory size and workload though.

> compact_sleep_millisecs controls how often khugepaged will compact all
> memory.  Each scan_sleep_millisecs wakeup after this value has expired, a
> node is synchronously compacted until all memory has been scanned.  Then,
> khugepaged will restart the process compact_sleep_millisecs later.
>
> This defaults to 0, which means no memory compaction is done.

Being another tunable and defaulting to 0 it means that most people 
won't use it at all, or their distro will provide some other value. We 
should really strive to make it self-tuning based on e.g. memory 
fragmentation statistics. But I know that my kcompactd proposal also 
wasn't quite there yet...

>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>   RFC: this is for initial comment on whether it's appropriate to do this
>   in khugepaged.  We already do to the background compaction for the
>   benefit of thp, but others may feel like this belongs in a new per-node
>   kcompactd thread as proposed by Vlastimil.
>
>   Regardless, it appears there is a substantial need for periodic memory
>   compaction in the background to reduce the latency of thp page faults
>   and still have a reasonable chance of having the allocation succeed.
>
>   We could also speed up this process in the case of alloc_sleep_millisecs
>   timeout since allocation recently failed for khugepaged.
>
>   Documentation/vm/transhuge.txt | 10 +++++++
>   mm/huge_memory.c               | 65 ++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 75 insertions(+)
>
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -170,6 +170,16 @@ A lower value leads to gain less thp performance. Value of
>   max_ptes_none can waste cpu time very little, you can
>   ignore it.
>
> +/sys/kernel/mm/transparent_hugepage/khugepaged/compact_sleep_millisecs
> +
> +controls how often khugepaged will utilize memory compaction to defragment
> +memory.  This makes it easier to allocate hugepages both at page fault and
> +by khugepaged since this compaction can be synchronous.
> +
> +This only occurs if scan_sleep_millisecs is configured.  One node per
> +scan_sleep_millisecs wakeup is compacted when compact_sleep_millisecs
> +expires until all memory has been compacted.
> +
>   == Boot parameter ==
>
>   You can change the sysfs boot time defaults of Transparent Hugepage
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -23,6 +23,7 @@
>   #include <linux/pagemap.h>
>   #include <linux/migrate.h>
>   #include <linux/hashtable.h>
> +#include <linux/compaction.h>
>
>   #include <asm/tlb.h>
>   #include <asm/pgalloc.h>
> @@ -65,6 +66,16 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
>    */
>   static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
>
> +/*
> + * Khugepaged may memory compaction over all memory at regular intervals.
> + * It round-robins through all nodes, compacting one at a time each
> + * scan_sleep_millisecs wakeup when triggered.
> + * May be set with compact_sleep_millisecs, which is disabled by default.
> + */
> +static unsigned long khugepaged_compact_sleep_millisecs __read_mostly;
> +static unsigned long khugepaged_compact_jiffies;
> +static int next_compact_node = MAX_NUMNODES;
> +
>   static int khugepaged(void *none);
>   static int khugepaged_slab_init(void);
>   static void khugepaged_slab_exit(void);
> @@ -463,6 +474,34 @@ static struct kobj_attribute alloc_sleep_millisecs_attr =
>   	__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
>   	       alloc_sleep_millisecs_store);
>
> +static ssize_t compact_sleep_millisecs_show(struct kobject *kobj,
> +					    struct kobj_attribute *attr,
> +					    char *buf)
> +{
> +	return sprintf(buf, "%lu\n", khugepaged_compact_sleep_millisecs);
> +}
> +
> +static ssize_t compact_sleep_millisecs_store(struct kobject *kobj,
> +					     struct kobj_attribute *attr,
> +					     const char *buf, size_t count)
> +{
> +	unsigned long msecs;
> +	int err;
> +
> +	err = kstrtoul(buf, 10, &msecs);
> +	if (err || msecs > ULONG_MAX)
> +		return -EINVAL;
> +
> +	khugepaged_compact_sleep_millisecs = msecs;
> +	khugepaged_compact_jiffies = jiffies + msecs_to_jiffies(msecs);
> +	wake_up_interruptible(&khugepaged_wait);
> +
> +	return count;
> +}
> +static struct kobj_attribute compact_sleep_millisecs_attr =
> +	__ATTR(compact_sleep_millisecs, 0644, compact_sleep_millisecs_show,
> +	       compact_sleep_millisecs_store);
> +
>   static ssize_t pages_to_scan_show(struct kobject *kobj,
>   				  struct kobj_attribute *attr,
>   				  char *buf)
> @@ -564,6 +603,7 @@ static struct attribute *khugepaged_attr[] = {
>   	&full_scans_attr.attr,
>   	&scan_sleep_millisecs_attr.attr,
>   	&alloc_sleep_millisecs_attr.attr,
> +	&compact_sleep_millisecs_attr.attr,
>   	NULL,
>   };
>
> @@ -652,6 +692,10 @@ static int __init hugepage_init(void)
>   		return 0;
>   	}
>
> +	if (khugepaged_compact_sleep_millisecs)
> +		khugepaged_compact_jiffies = jiffies +
> +			msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
> +
>   	err = start_stop_khugepaged();
>   	if (err)
>   		goto err_khugepaged;
> @@ -2834,6 +2878,26 @@ static void khugepaged_wait_work(void)
>   		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
>   }
>
> +static void khugepaged_compact_memory(void)
> +{
> +	if (!khugepaged_compact_jiffies ||
> +	    time_before(jiffies, khugepaged_compact_jiffies))
> +		return;
> +
> +	get_online_mems();
> +	if (next_compact_node == MAX_NUMNODES)
> +		next_compact_node = first_node(node_states[N_MEMORY]);
> +
> +	compact_pgdat(NODE_DATA(next_compact_node), -1);
> +
> +	next_compact_node = next_node(next_compact_node, node_states[N_MEMORY]);
> +	put_online_mems();
> +
> +	if (next_compact_node == MAX_NUMNODES)
> +		khugepaged_compact_jiffies = jiffies +
> +			msecs_to_jiffies(khugepaged_compact_sleep_millisecs);
> +}
> +
>   static int khugepaged(void *none)
>   {
>   	struct mm_slot *mm_slot;
> @@ -2842,6 +2906,7 @@ static int khugepaged(void *none)
>   	set_user_nice(current, MAX_NICE);
>
>   	while (!kthread_should_stop()) {
> +		khugepaged_compact_memory();
>   		khugepaged_do_scan();
>   		khugepaged_wait_work();
>   	}
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] mm, thp: allow khugepaged to periodically compact memory synchronously
  2015-07-21  9:36   ` Vlastimil Babka
@ 2015-07-21 23:19     ` David Rientjes
  -1 siblings, 0 replies; 6+ messages in thread
From: David Rientjes @ 2015-07-21 23:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel,
	Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm

On Tue, 21 Jul 2015, Vlastimil Babka wrote:

> On 07/15/2015 04:19 AM, David Rientjes wrote:
> > We have seen a large benefit in the amount of hugepages that can be
> > allocated at fault
> 
> That's understandable...
> 
> > and by khugepaged when memory is periodically
> > compacted in the background.
> 
> ... but for khugepaged it's surprising. Doesn't khugepaged (unlike page
> faults) attempt the same sync compaction as your manual triggers?
> 

Not exactly, this compaction is over all memory rather than just 
terminating when you find a pageblock free.  It keeps more pageblocks free 
of memory that are easily allocated both at fault and by khugepaged 
without having to do its own compaction.  The largest benefit, obviously, 
is to the page fault path, however.

> > We trigger synchronous memory compaction over all memory every 15 minutes
> > to keep fragmentation low and to offset the lightweight compaction that
> > is done at page fault to keep latency low.
> 
> I'm surprised that 15 minutes is frequent enough to make a difference. I'd
> expect it very much depends on the memory size and workload though.
> 

This is over all machines running all workloads and its directly related 
to how abort-happy we have made memory compaction in the pagefault path 
which occurs when locks are contended or need_resched() triggers.  
Sometimes we see memory compaction doing very little work in the fault 
path as a result of this and this patch becomes the only real source of 
memory compactions over all memory; it just isn't triggered anywhere else.

We make it a tunable here because some users will want speed it up just as 
they do scan_sleep_millisecs or abort_sleep_millisecs and, yes, that will 
rely on your particular workload and config.

> > compact_sleep_millisecs controls how often khugepaged will compact all
> > memory.  Each scan_sleep_millisecs wakeup after this value has expired, a
> > node is synchronously compacted until all memory has been scanned.  Then,
> > khugepaged will restart the process compact_sleep_millisecs later.
> > 
> > This defaults to 0, which means no memory compaction is done.
> 
> Being another tunable and defaulting to 0 it means that most people won't use
> it at all, or their distro will provide some other value. We should really
> strive to make it self-tuning based on e.g. memory fragmentation statistics.
> But I know that my kcompactd proposal also wasn't quite there yet...
> 

Agreed on a better default.  I proposed the rfc this way because we need 
this functionality now and works well for our 15m period so we have no 
problem immediately tuning this to 15m.

I would imagine that the default should be a multiple of the 
abort_sleep_millisecs default of 60000 and consider the length of the 
largest node.

That said, I'm not sure about self-tuning of the period itself.  The 
premise is that this compaction keeps the memory in a relatively 
unfragmented state such that thp does not need to compact in the fault 
path and we have a much higher likelihood of being able to allocate since 
nothing else may actually trigger memory compaction besides this.  It 
seems like that should done on period defined by the user; hence, this is 
for "periodic memory compaction".

However, I agree with your comment in the kcompactd thread about the 
benefits of a different type, "background memory compaction", that can be 
kicked off when a high-order allocation fails, for instance, or based on 
a heuristic that looks at memory fragmentation statistics.  I think the 
two are quite distinct.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [rfc] mm, thp: allow khugepaged to periodically compact memory synchronously
@ 2015-07-21 23:19     ` David Rientjes
  0 siblings, 0 replies; 6+ messages in thread
From: David Rientjes @ 2015-07-21 23:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Andrea Arcangeli, Rik van Riel,
	Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm

On Tue, 21 Jul 2015, Vlastimil Babka wrote:

> On 07/15/2015 04:19 AM, David Rientjes wrote:
> > We have seen a large benefit in the amount of hugepages that can be
> > allocated at fault
> 
> That's understandable...
> 
> > and by khugepaged when memory is periodically
> > compacted in the background.
> 
> ... but for khugepaged it's surprising. Doesn't khugepaged (unlike page
> faults) attempt the same sync compaction as your manual triggers?
> 

Not exactly, this compaction is over all memory rather than just 
terminating when you find a pageblock free.  It keeps more pageblocks free 
of memory that are easily allocated both at fault and by khugepaged 
without having to do its own compaction.  The largest benefit, obviously, 
is to the page fault path, however.

> > We trigger synchronous memory compaction over all memory every 15 minutes
> > to keep fragmentation low and to offset the lightweight compaction that
> > is done at page fault to keep latency low.
> 
> I'm surprised that 15 minutes is frequent enough to make a difference. I'd
> expect it very much depends on the memory size and workload though.
> 

This is over all machines running all workloads and its directly related 
to how abort-happy we have made memory compaction in the pagefault path 
which occurs when locks are contended or need_resched() triggers.  
Sometimes we see memory compaction doing very little work in the fault 
path as a result of this and this patch becomes the only real source of 
memory compactions over all memory; it just isn't triggered anywhere else.

We make it a tunable here because some users will want speed it up just as 
they do scan_sleep_millisecs or abort_sleep_millisecs and, yes, that will 
rely on your particular workload and config.

> > compact_sleep_millisecs controls how often khugepaged will compact all
> > memory.  Each scan_sleep_millisecs wakeup after this value has expired, a
> > node is synchronously compacted until all memory has been scanned.  Then,
> > khugepaged will restart the process compact_sleep_millisecs later.
> > 
> > This defaults to 0, which means no memory compaction is done.
> 
> Being another tunable and defaulting to 0 it means that most people won't use
> it at all, or their distro will provide some other value. We should really
> strive to make it self-tuning based on e.g. memory fragmentation statistics.
> But I know that my kcompactd proposal also wasn't quite there yet...
> 

Agreed on a better default.  I proposed the rfc this way because we need 
this functionality now and works well for our 15m period so we have no 
problem immediately tuning this to 15m.

I would imagine that the default should be a multiple of the 
abort_sleep_millisecs default of 60000 and consider the length of the 
largest node.

That said, I'm not sure about self-tuning of the period itself.  The 
premise is that this compaction keeps the memory in a relatively 
unfragmented state such that thp does not need to compact in the fault 
path and we have a much higher likelihood of being able to allocate since 
nothing else may actually trigger memory compaction besides this.  It 
seems like that should done on period defined by the user; hence, this is 
for "periodic memory compaction".

However, I agree with your comment in the kcompactd thread about the 
benefits of a different type, "background memory compaction", that can be 
kicked off when a high-order allocation fails, for instance, or based on 
a heuristic that looks at memory fragmentation statistics.  I think the 
two are quite distinct.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-07-21 23:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-15  2:19 [rfc] mm, thp: allow khugepaged to periodically compact memory synchronously David Rientjes
2015-07-15  2:19 ` David Rientjes
2015-07-21  9:36 ` Vlastimil Babka
2015-07-21  9:36   ` Vlastimil Babka
2015-07-21 23:19   ` David Rientjes
2015-07-21 23:19     ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.