linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] mm, thp: add new background defrag option
@ 2017-01-04 23:41 David Rientjes
  2017-01-05 10:13 ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2017-01-04 23:41 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Michal Hocko, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, linux-kernel, linux-mm

There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
to do direct compaction and reclaim while all other thp allocations simply 
trigger kswapd and kcompactd in the background and fail immediately.

The "defer" setting simply triggers background reclaim and compaction for 
all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
userspace where MADV_HUGEPAGE is being used to indicate the application is 
willing to wait for work for thp memory to be available.

The "madvise" setting will do direct compaction and reclaim for these
MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
background for anybody else.

For reasonable usage, there needs to be a mesh between the two options.  
This patch introduces a fifth mode, "background", that will do direct 
reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
reclaim and compaction for everybody else so that hugepages may be 
available in the near future.

A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
regions as part of the "defer" mode, making it a very powerful setting and 
avoids breaking userspace, was offered: 
http://marc.info/?t=148236612700003.  This additional mode is a 
compromise.

This patch also cleans up the helper function for storing to "enabled" 
and "defrag" since the former supports three modes while the latter 
supports five and triple_flag_store() was getting unnecessarily messy.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 I don't understand Mel's suggestion of "defer-fault" as option naming.

 No better idea for the naming of "background", although it is admittedly 
 not the best.  Better suggestions are welcome.

 Documentation/vm/transhuge.txt |   8 ++-
 include/linux/huge_mm.h        |   1 +
 mm/huge_memory.c               | 146 +++++++++++++++++++++--------------------
 3 files changed, 82 insertions(+), 73 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
 
 echo always >/sys/kernel/mm/transparent_hugepage/defrag
 echo defer >/sys/kernel/mm/transparent_hugepage/defrag
+echo background >/sys/kernel/mm/transparent_hugepage/defrag
 echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
 echo never >/sys/kernel/mm/transparent_hugepage/defrag
 
@@ -119,8 +120,13 @@ allocate a THP immediately. This may be desirable for virtual machines
 that benefit heavily from THP use and are willing to delay the VM start
 to utilise them.
 
+"background" will enter direct reclaim and compaction like "always", but
+only for regions that have used madvise(MADV_HUGEPAGE); all other regions
+will wake kswapd in the background to reclaim pages and wake kcompactd to
+compact memory so that THP is available in the near future.
+
 "defer" means that an application will wake kswapd in the background
-to reclaim pages and wake kcompact to compact memory so that THP is
+to reclaim pages and wake kcompactd to compact memory so that THP is
 available in the near future. It's the responsibility of khugepaged
 to then install the THP pages later.
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_MADV_OR_KSWAPD_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -142,42 +142,6 @@ static struct shrinker huge_zero_page_shrinker = {
 };
 
 #ifdef CONFIG_SYSFS
-
-static ssize_t triple_flag_store(struct kobject *kobj,
-				 struct kobj_attribute *attr,
-				 const char *buf, size_t count,
-				 enum transparent_hugepage_flag enabled,
-				 enum transparent_hugepage_flag deferred,
-				 enum transparent_hugepage_flag req_madv)
-{
-	if (!memcmp("defer", buf,
-		    min(sizeof("defer")-1, count))) {
-		if (enabled == deferred)
-			return -EINVAL;
-		clear_bit(enabled, &transparent_hugepage_flags);
-		clear_bit(req_madv, &transparent_hugepage_flags);
-		set_bit(deferred, &transparent_hugepage_flags);
-	} else if (!memcmp("always", buf,
-		    min(sizeof("always")-1, count))) {
-		clear_bit(deferred, &transparent_hugepage_flags);
-		clear_bit(req_madv, &transparent_hugepage_flags);
-		set_bit(enabled, &transparent_hugepage_flags);
-	} else if (!memcmp("madvise", buf,
-			   min(sizeof("madvise")-1, count))) {
-		clear_bit(enabled, &transparent_hugepage_flags);
-		clear_bit(deferred, &transparent_hugepage_flags);
-		set_bit(req_madv, &transparent_hugepage_flags);
-	} else if (!memcmp("never", buf,
-			   min(sizeof("never")-1, count))) {
-		clear_bit(enabled, &transparent_hugepage_flags);
-		clear_bit(req_madv, &transparent_hugepage_flags);
-		clear_bit(deferred, &transparent_hugepage_flags);
-	} else
-		return -EINVAL;
-
-	return count;
-}
-
 static ssize_t enabled_show(struct kobject *kobj,
 			    struct kobj_attribute *attr, char *buf)
 {
@@ -193,19 +157,28 @@ static ssize_t enabled_store(struct kobject *kobj,
 			     struct kobj_attribute *attr,
 			     const char *buf, size_t count)
 {
-	ssize_t ret;
+	ssize_t ret = count;
 
-	ret = triple_flag_store(kobj, attr, buf, count,
-				TRANSPARENT_HUGEPAGE_FLAG,
-				TRANSPARENT_HUGEPAGE_FLAG,
-				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else
+		ret = -EINVAL;
 
 	if (ret > 0) {
 		int err = start_stop_khugepaged();
 		if (err)
 			ret = err;
 	}
-
 	return ret;
 }
 static struct kobj_attribute enabled_attr =
@@ -241,32 +214,58 @@ ssize_t single_hugepage_flag_store(struct kobject *kobj,
 	return count;
 }
 
-/*
- * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
- * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
- * memory just to allocate one more hugepage.
- */
 static ssize_t defrag_show(struct kobject *kobj,
 			   struct kobj_attribute *attr, char *buf)
 {
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return sprintf(buf, "[always] defer madvise never\n");
+		return sprintf(buf, "[always] defer background madvise never\n");
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return sprintf(buf, "always [defer] madvise never\n");
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return sprintf(buf, "always defer [madvise] never\n");
-	else
-		return sprintf(buf, "always defer madvise [never]\n");
-
+		return sprintf(buf, "always [defer] background madvise never\n");
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_MADV_OR_KSWAPD_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always defer [background] madvise never\n");
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always defer background [madvise] never\n");
+	return sprintf(buf, "always defer background madvise [never]\n");
 }
+
 static ssize_t defrag_store(struct kobject *kobj,
 			    struct kobj_attribute *attr,
 			    const char *buf, size_t count)
 {
-	return triple_flag_store(kobj, attr, buf, count,
-				 TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
-				 TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
-				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_MADV_OR_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("defer", buf,
+		    min(sizeof("defer")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_MADV_OR_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("background", buf,
+		    min(sizeof("background")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_MADV_OR_KSWAPD_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_MADV_OR_KSWAPD_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_MADV_OR_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
 }
 static struct kobj_attribute defrag_attr =
 	__ATTR(defrag, 0644, defrag_show, defrag_store);
@@ -612,25 +611,28 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 }
 
 /*
- * If THP defrag is set to always then directly reclaim/compact as necessary
- * If set to defer then do only background reclaim/compact and defer to khugepaged
- * If set to madvise and the VMA is flagged then directly reclaim/compact
- * When direct reclaim/compact is allowed, don't retry except for flagged VMA's
+ * always: directly stall for all thp allocations
+ * defer: wake kswapd and fail if not immediately available
+ * background: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise fail
+ *             if not immediately available
+ * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately
+ *          available
+ * never: never stall for any thp allocation
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
-	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
-	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
-				&transparent_hugepage_flags) && vma_madvised)
-		return GFP_TRANSHUGE;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
-						&transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
-						&transparent_hugepage_flags))
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
-
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_MADV_OR_KSWAPD_FLAG, &transparent_hugepage_flags))
+		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+							     __GFP_KSWAPD_RECLAIM);
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+							     0);
 	return GFP_TRANSHUGE_LIGHT;
 }
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-04 23:41 [patch] mm, thp: add new background defrag option David Rientjes
@ 2017-01-05 10:13 ` Mel Gorman
  2017-01-05 10:33   ` Michal Hocko
  2017-01-05 13:58   ` Vlastimil Babka
  0 siblings, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2017-01-05 10:13 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, linux-kernel, linux-mm

On Wed, Jan 04, 2017 at 03:41:59PM -0800, David Rientjes wrote:
> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "background", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=148236612700003.  This additional mode is a 
> compromise.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  I don't understand Mel's suggestion of "defer-fault" as option naming.
> 

defer-fault was intended to reflect "defer faults but not anything else"
with the only sensible alternative being madvise requests. While not a
major fan of the background name, I don't have a better suggestion either
other than defer-fault.

There are likely to be objections based on how this should be specified
and investigating alternative proposals such as fine-grained control of
how background compaction should be done but I hadn't proposed them and
hadn't intended to work on such patches. This patch appears to give the
semantics you want and I said I would ack such a configuration option so;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-05 10:13 ` Mel Gorman
@ 2017-01-05 10:33   ` Michal Hocko
  2017-01-05 13:58   ` Vlastimil Babka
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-01-05 10:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, Jonathan Corbet,
	Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm

On Thu 05-01-17 10:13:30, Mel Gorman wrote:
> On Wed, Jan 04, 2017 at 03:41:59PM -0800, David Rientjes wrote:
> > There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> > to do direct compaction and reclaim while all other thp allocations simply 
> > trigger kswapd and kcompactd in the background and fail immediately.
> > 
> > The "defer" setting simply triggers background reclaim and compaction for 
> > all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> > userspace where MADV_HUGEPAGE is being used to indicate the application is 
> > willing to wait for work for thp memory to be available.
> > 
> > The "madvise" setting will do direct compaction and reclaim for these
> > MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> > background for anybody else.
> > 
> > For reasonable usage, there needs to be a mesh between the two options.  
> > This patch introduces a fifth mode, "background", that will do direct 
> > reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> > reclaim and compaction for everybody else so that hugepages may be 
> > available in the near future.
> > 
> > A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> > regions as part of the "defer" mode, making it a very powerful setting and 
> > avoids breaking userspace, was offered: 
> > http://marc.info/?t=148236612700003.  This additional mode is a 
> > compromise.
> > 
> > This patch also cleans up the helper function for storing to "enabled" 
> > and "defrag" since the former supports three modes while the latter 
> > supports five and triple_flag_store() was getting unnecessarily messy.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  I don't understand Mel's suggestion of "defer-fault" as option naming.
> > 
> 
> defer-fault was intended to reflect "defer faults but not anything else"
> with the only sensible alternative being madvise requests. While not a
> major fan of the background name, I don't have a better suggestion either
> other than defer-fault.
> 
> There are likely to be objections based on how this should be specified
> and investigating alternative proposals such as fine-grained control of
> how background compaction should be done but I hadn't proposed them and
> hadn't intended to work on such patches. This patch appears to give the
> semantics you want and I said I would ack such a configuration option so;

Yes, I would really like to see that we have exhausted all the proposed
options before we go with a new tunable value. I personally do not have
strong objection to this patch as long as all other options are
considered not viable. The naming is really confusing because defer and
background sonds just too similar and background suggests that no
expensive operation will happen in the direct (fault) context. From that
POV, Mel's defer-fault was more clear to me. I would even like to see
madvise in the name. Something like madvise-with-defer?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-05 10:13 ` Mel Gorman
  2017-01-05 10:33   ` Michal Hocko
@ 2017-01-05 13:58   ` Vlastimil Babka
  2017-01-05 15:50     ` Michal Hocko
  2017-01-05 22:54     ` David Rientjes
  1 sibling, 2 replies; 21+ messages in thread
From: Vlastimil Babka @ 2017-01-05 13:58 UTC (permalink / raw)
  To: Mel Gorman, David Rientjes
  Cc: Andrew Morton, Michal Hocko, Jonathan Corbet, Kirill A. Shutemov,
	linux-kernel, linux-mm

On 01/05/2017 11:13 AM, Mel Gorman wrote:
> On Wed, Jan 04, 2017 at 03:41:59PM -0800, David Rientjes wrote:
>> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
>> to do direct compaction and reclaim while all other thp allocations simply 
>> trigger kswapd and kcompactd in the background and fail immediately.
>>
>> The "defer" setting simply triggers background reclaim and compaction for 
>> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
>> userspace where MADV_HUGEPAGE is being used to indicate the application is 
>> willing to wait for work for thp memory to be available.
>>
>> The "madvise" setting will do direct compaction and reclaim for these
>> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
>> background for anybody else.
>>
>> For reasonable usage, there needs to be a mesh between the two options.  
>> This patch introduces a fifth mode, "background", that will do direct 
>> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
>> reclaim and compaction for everybody else so that hugepages may be 
>> available in the near future.
>>
>> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
>> regions as part of the "defer" mode, making it a very powerful setting and 
>> avoids breaking userspace, was offered: 
>> http://marc.info/?t=148236612700003.  This additional mode is a 
>> compromise.
>>
>> This patch also cleans up the helper function for storing to "enabled" 
>> and "defrag" since the former supports three modes while the latter 
>> supports five and triple_flag_store() was getting unnecessarily messy.
>>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>> ---
>>  I don't understand Mel's suggestion of "defer-fault" as option naming.
>>
> 
> defer-fault was intended to reflect "defer faults but not anything else"
> with the only sensible alternative being madvise requests. While not a

Hmm that's probably why it's hard to understand, because "madvise
request" is just setting a vma flag, and the THP allocation (and defrag)
still happens at fault.

I'm not a fan of either name, so I've tried to implement my own
suggestion. Turns out it was easier than expected, as there's no kernel
boot option for "defer", just for "enabled", so that particular worry
was unfounded.

And personally I think that it's less confusing when one can enable defer
and madvise together (and not any other combination), than having to dig
up the difference between "defer" and "background".

I have only tested the sysfs manipulation, not actual THP, but seems to me
that alloc_hugepage_direct_gfpmask() already happens to process the flags
in a way that it works as expected.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10eedbf14421..cc5ae86169a8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -150,7 +150,16 @@ static ssize_t triple_flag_store(struct kobject *kobj,
 				 enum transparent_hugepage_flag deferred,
 				 enum transparent_hugepage_flag req_madv)
 {
-	if (!memcmp("defer", buf,
+	if (!memcmp("defer madvise", buf,
+			min(sizeof("defer madvise")-1, count))
+	    || !memcmp("madvise defer", buf,
+			min(sizeof("madvise defer")-1, count))) {
+		if (enabled == deferred)
+			return -EINVAL;
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+		set_bit(deferred, &transparent_hugepage_flags);
+	} else if (!memcmp("defer", buf,
 		    min(sizeof("defer")-1, count))) {
 		if (enabled == deferred)
 			return -EINVAL;
@@ -251,9 +260,12 @@ static ssize_t defrag_show(struct kobject *kobj,
 {
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
 		return sprintf(buf, "[always] defer madvise never\n");
-	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return sprintf(buf, "always [defer] madvise never\n");
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) {
+		if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
+			return sprintf(buf, "always [defer] [madvise] never\n");
+		else
+			return sprintf(buf, "always [defer] madvise never\n");
+	} else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return sprintf(buf, "always defer [madvise] never\n");
 	else
 		return sprintf(buf, "always defer madvise [never]\n");

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-05 13:58   ` Vlastimil Babka
@ 2017-01-05 15:50     ` Michal Hocko
  2017-01-05 22:54     ` David Rientjes
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-01-05 15:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, David Rientjes, Andrew Morton, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On Thu 05-01-17 14:58:47, Vlastimil Babka wrote:
[...]
> I'm not a fan of either name, so I've tried to implement my own
> suggestion. Turns out it was easier than expected, as there's no kernel
> boot option for "defer", just for "enabled", so that particular worry
> was unfounded.
> 
> And personally I think that it's less confusing when one can enable defer
> and madvise together (and not any other combination), than having to dig
> up the difference between "defer" and "background".
> 
> I have only tested the sysfs manipulation, not actual THP, but seems to me
> that alloc_hugepage_direct_gfpmask() already happens to process the flags
> in a way that it works as expected.

IMHO this looks indeed much simpler implementation wise, more consistent
from the semantic point of view and less confusing from the usage POV.
 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 10eedbf14421..cc5ae86169a8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -150,7 +150,16 @@ static ssize_t triple_flag_store(struct kobject *kobj,
>  				 enum transparent_hugepage_flag deferred,
>  				 enum transparent_hugepage_flag req_madv)
>  {
> -	if (!memcmp("defer", buf,
> +	if (!memcmp("defer madvise", buf,
> +			min(sizeof("defer madvise")-1, count))
> +	    || !memcmp("madvise defer", buf,
> +			min(sizeof("madvise defer")-1, count))) {
> +		if (enabled == deferred)
> +			return -EINVAL;
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		set_bit(req_madv, &transparent_hugepage_flags);
> +		set_bit(deferred, &transparent_hugepage_flags);
> +	} else if (!memcmp("defer", buf,
>  		    min(sizeof("defer")-1, count))) {
>  		if (enabled == deferred)
>  			return -EINVAL;
> @@ -251,9 +260,12 @@ static ssize_t defrag_show(struct kobject *kobj,
>  {
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>  		return sprintf(buf, "[always] defer madvise never\n");
> -	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return sprintf(buf, "always [defer] madvise never\n");
> -	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) {
> +		if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> +			return sprintf(buf, "always [defer] [madvise] never\n");
> +		else
> +			return sprintf(buf, "always [defer] madvise never\n");
> +	} else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return sprintf(buf, "always defer [madvise] never\n");
>  	else
>  		return sprintf(buf, "always defer madvise [never]\n");

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-05 13:58   ` Vlastimil Babka
  2017-01-05 15:50     ` Michal Hocko
@ 2017-01-05 22:54     ` David Rientjes
  2017-01-06  8:41       ` Vlastimil Babka
  1 sibling, 1 reply; 21+ messages in thread
From: David Rientjes @ 2017-01-05 22:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On Thu, 5 Jan 2017, Vlastimil Babka wrote:

> Hmm that's probably why it's hard to understand, because "madvise
> request" is just setting a vma flag, and the THP allocation (and defrag)
> still happens at fault.
> 
> I'm not a fan of either name, so I've tried to implement my own
> suggestion. Turns out it was easier than expected, as there's no kernel
> boot option for "defer", just for "enabled", so that particular worry
> was unfounded.
> 
> And personally I think that it's less confusing when one can enable defer
> and madvise together (and not any other combination), than having to dig
> up the difference between "defer" and "background".
> 

I think allowing only two options to be combined amongst four available 
solo options is going to be confusing and then even more difficult for the 
user to understand what happens when they are combined.  Thus, I think 
these options should only have one settable mode as they have always done.

The kernel implementation takes less of a priority to userspace 
simplicitly, imo, and my patch actually cleans up much of the existing 
code and ends up adding fewer lines that yours.  I consider it an 
improvement in itself.  I don't see the benefit of allowing combined 
options.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-05 22:54     ` David Rientjes
@ 2017-01-06  8:41       ` Vlastimil Babka
  2017-01-06 14:01         ` Michal Hocko
  2017-01-06 22:20         ` David Rientjes
  0 siblings, 2 replies; 21+ messages in thread
From: Vlastimil Babka @ 2017-01-06  8:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andrew Morton, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On 01/05/2017 11:54 PM, David Rientjes wrote:
> On Thu, 5 Jan 2017, Vlastimil Babka wrote:
> 
>> Hmm that's probably why it's hard to understand, because "madvise
>> request" is just setting a vma flag, and the THP allocation (and defrag)
>> still happens at fault.
>>
>> I'm not a fan of either name, so I've tried to implement my own
>> suggestion. Turns out it was easier than expected, as there's no kernel
>> boot option for "defer", just for "enabled", so that particular worry
>> was unfounded.
>>
>> And personally I think that it's less confusing when one can enable defer
>> and madvise together (and not any other combination), than having to dig
>> up the difference between "defer" and "background".
>>
> 
> I think allowing only two options to be combined amongst four available 
> solo options is going to be confusing and then even more difficult for the 
> user to understand what happens when they are combined.  Thus, I think 

Well, the other options are named "always" and "never", so I wouldn't
think so confusing that they can't be combined with anything else.
Deciding between "defer" and "background" is however confusing, and also
doesn't indicate that the difference is related to madvise.

> these options should only have one settable mode as they have always done.
> 
> The kernel implementation takes less of a priority to userspace 
> simplicitly, imo, and my patch actually cleans up much of the existing 
> code and ends up adding fewer lines that yours.  I consider it an 
> improvement in itself.  I don't see the benefit of allowing combined 
> options.

I don't like bikesheding, but as this is about user-space API, more care
should be taken than for implementation details that can change. Even
though realistically there will be in 99% of cases only two groups of
users setting this
- experts like you who know what they are doing, and confusing names
won't prevent them from making the right choice
- people who will blindly copy/paste from the future cargo-cult websites
(if they ever get updated from the enabled="never" recommendations), who
likely won't stop and think about the other options.

Well, so we'll probably disagree, maybe others can add their opinions.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-06  8:41       ` Vlastimil Babka
@ 2017-01-06 14:01         ` Michal Hocko
  2017-01-06 22:20         ` David Rientjes
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-01-06 14:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Mel Gorman, Andrew Morton, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On Fri 06-01-17 09:41:57, Vlastimil Babka wrote:
> On 01/05/2017 11:54 PM, David Rientjes wrote:
> > On Thu, 5 Jan 2017, Vlastimil Babka wrote:
> > 
> >> Hmm that's probably why it's hard to understand, because "madvise
> >> request" is just setting a vma flag, and the THP allocation (and defrag)
> >> still happens at fault.
> >>
> >> I'm not a fan of either name, so I've tried to implement my own
> >> suggestion. Turns out it was easier than expected, as there's no kernel
> >> boot option for "defer", just for "enabled", so that particular worry
> >> was unfounded.
> >>
> >> And personally I think that it's less confusing when one can enable defer
> >> and madvise together (and not any other combination), than having to dig
> >> up the difference between "defer" and "background".
> >>
> > 
> > I think allowing only two options to be combined amongst four available 
> > solo options is going to be confusing and then even more difficult for the 
> > user to understand what happens when they are combined.  Thus, I think 
> 
> Well, the other options are named "always" and "never", so I wouldn't
> think so confusing that they can't be combined with anything else.
> Deciding between "defer" and "background" is however confusing, and also
> doesn't indicate that the difference is related to madvise.

fully agreed. Calling a mode which does _direct_ compaction as
_background_ is not only confusng but just plain wrong.

> > these options should only have one settable mode as they have always done.
> > 
> > The kernel implementation takes less of a priority to userspace 
> > simplicitly, imo, and my patch actually cleans up much of the existing 
> > code and ends up adding fewer lines that yours.  I consider it an 
> > improvement in itself.  I don't see the benefit of allowing combined 
> > options.
> 
> I don't like bikesheding, but as this is about user-space API, more care
> should be taken than for implementation details that can change. Even
> though realistically there will be in 99% of cases only two groups of
> users setting this
> - experts like you who know what they are doing, and confusing names
> won't prevent them from making the right choice
> - people who will blindly copy/paste from the future cargo-cult websites
> (if they ever get updated from the enabled="never" recommendations), who
> likely won't stop and think about the other options.

agreed!
 
> Well, so we'll probably disagree, maybe others can add their opinions.

To me the combined option sounds better than a new one with confusing
name. Maybe we can come up with a better name that would reflect the
functionality better, though. There is some minor risk that some
userspace doesn't cope with two options being selected but there has
never been any guarantee about a single option being selected in the
first place. Is anybody actually parsing this file to make further
decisions? I would expect it is write mostly thing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-06  8:41       ` Vlastimil Babka
  2017-01-06 14:01         ` Michal Hocko
@ 2017-01-06 22:20         ` David Rientjes
  2017-01-09 10:04           ` Vlastimil Babka
  1 sibling, 1 reply; 21+ messages in thread
From: David Rientjes @ 2017-01-06 22:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On Fri, 6 Jan 2017, Vlastimil Babka wrote:

> Deciding between "defer" and "background" is however confusing, and also
> doesn't indicate that the difference is related to madvise.
> 

Any suggestions for a better name for "background" are more than welcome.  

> > The kernel implementation takes less of a priority to userspace 
> > simplicitly, imo, and my patch actually cleans up much of the existing 
> > code and ends up adding fewer lines that yours.  I consider it an 
> > improvement in itself.  I don't see the benefit of allowing combined 
> > options.
> 
> I don't like bikesheding, but as this is about user-space API, more care
> should be taken than for implementation details that can change. Even
> though realistically there will be in 99% of cases only two groups of
> users setting this
> - experts like you who know what they are doing, and confusing names
> won't prevent them from making the right choice
> - people who will blindly copy/paste from the future cargo-cult websites
> (if they ever get updated from the enabled="never" recommendations), who
> likely won't stop and think about the other options.
> 

I think the far majority will go with a third option: simply use the 
kernel default and be unaware of other settings or consider it to be the 
most likely choice solely because it is the kernel default.

I think the kernel default could easily be changed to "background" after 
this and nobody would actually notice, but I don't have a strong 
preference for that.  I think users who notice large thp_fault_fallback 
and want to get the true "transparent" nature of hugepages will 
investigate defragmentation behavior and see "background" is exactly what 
they want.  Indeed, I think that the new "background" mode meshes well 
with the expectation of "transparent" hugepages.  I don't foresee any 
usecase, present or future, for "defer" so I'll simply ignore it.

So whether it's better to do echo background or echo "madvise defer" is 
not important to me, I simply imagine that the combination will be more 
difficult to describe to users.  It would break our userspace to currently 
tests for "[madvise]" and reports that state as strictly madvise to our 
mission control, but I can work around that; not sure if others would 
encounter the same issue (would "[defer madvise]" or "[defer] [madvise]" 
break fewer userspaces?).

I'd leave it to Andrew to decide whether sysfs files should accept 
multiple modes or not.  If you are to propose a patch to do so, I'd 
encourage you to do the same cleanup of triple_flag_store() that I did and 
make the gfp mask construction more straight-forward.  If you'd like to 
suggest a different name for "background", I'd be happy to change that if 
it's more descriptive.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-06 22:20         ` David Rientjes
@ 2017-01-09 10:04           ` Vlastimil Babka
  2017-01-09 12:06             ` Vlastimil Babka
  2017-01-10  2:19             ` David Rientjes
  0 siblings, 2 replies; 21+ messages in thread
From: Vlastimil Babka @ 2017-01-09 10:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andrew Morton, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On 01/06/2017 11:20 PM, David Rientjes wrote:
> On Fri, 6 Jan 2017, Vlastimil Babka wrote:
> 
>> Deciding between "defer" and "background" is however confusing, and also
>> doesn't indicate that the difference is related to madvise.
>>
> 
> Any suggestions for a better name for "background" are more than welcome.  

Why not just "madvise+defer"?

>> I don't like bikesheding, but as this is about user-space API, more care
>> should be taken than for implementation details that can change. Even
>> though realistically there will be in 99% of cases only two groups of
>> users setting this
>> - experts like you who know what they are doing, and confusing names
>> won't prevent them from making the right choice
>> - people who will blindly copy/paste from the future cargo-cult websites
>> (if they ever get updated from the enabled="never" recommendations), who
>> likely won't stop and think about the other options.
>>
> 
> I think the far majority will go with a third option: simply use the 
> kernel default and be unaware of other settings or consider it to be the 
> most likely choice solely because it is the kernel default.

Sure, my prediction was only about "users setting this" :) Agreed that
those will be a small minority of all users.

[...]

> So whether it's better to do echo background or echo "madvise defer" is 
> not important to me, I simply imagine that the combination will be more 
> difficult to describe to users.  It would break our userspace to currently 
> tests for "[madvise]" and reports that state as strictly madvise to our 
> mission control, but I can work around that; not sure if others would 
> encounter the same issue (would "[defer madvise]" or "[defer] [madvise]" 
> break fewer userspaces?).

OK, well I'm reluctant to break existing userspace knowingly over such
silliness. Also apparently sysfs files in general should accept only one
value, so I'm not going to push my approach.

> I'd leave it to Andrew to decide whether sysfs files should accept 
> multiple modes or not.  If you are to propose a patch to do so, I'd 
> encourage you to do the same cleanup of triple_flag_store() that I did and 
> make the gfp mask construction more straight-forward.  If you'd like to 
> suggest a different name for "background", I'd be happy to change that if 
> it's more descriptive.

Suggestion is above. I however think your cleanup isn't really needed,
we can simply keep the existing 3 internal flags, and "madvise+defer"
would enable two of them, like in my patch. Nothing says that internally
each option should correspond to exactly one flag.

Vlastimil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-09 10:04           ` Vlastimil Babka
@ 2017-01-09 12:06             ` Vlastimil Babka
  2017-01-10  2:19             ` David Rientjes
  1 sibling, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2017-01-09 12:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andrew Morton, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On 01/09/2017 11:04 AM, Vlastimil Babka wrote:
> On 01/06/2017 11:20 PM, David Rientjes wrote:
>> I'd leave it to Andrew to decide whether sysfs files should accept 
>> multiple modes or not.  If you are to propose a patch to do so, I'd 
>> encourage you to do the same cleanup of triple_flag_store() that I did and 
>> make the gfp mask construction more straight-forward.  If you'd like to 
>> suggest a different name for "background", I'd be happy to change that if 
>> it's more descriptive.
> 
> Suggestion is above. I however think your cleanup isn't really needed,
> we can simply keep the existing 3 internal flags, and "madvise+defer"
> would enable two of them, like in my patch. Nothing says that internally
> each option should correspond to exactly one flag.

Forgot to add that if/when you repost this, please CC linux-api and
summarize what else was considered in the changelog. Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-09 10:04           ` Vlastimil Babka
  2017-01-09 12:06             ` Vlastimil Babka
@ 2017-01-10  2:19             ` David Rientjes
  2017-01-10  3:38               ` Hugh Dickins
                                 ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: David Rientjes @ 2017-01-10  2:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On Mon, 9 Jan 2017, Vlastimil Babka wrote:

> > Any suggestions for a better name for "background" are more than welcome.  
> 
> Why not just "madvise+defer"?
> 

Seeing no other activity regarding this issue (omg!), I'll wait a day or 
so to see if there are any objections to "madvise+defer" or suggestions 
that may be better and repost.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-10  2:19             ` David Rientjes
@ 2017-01-10  3:38               ` Hugh Dickins
  2017-01-10  8:44                 ` Vlastimil Babka
  2017-01-10 13:01               ` Michal Hocko
  2017-01-11  0:15               ` [patch v2] mm, thp: add new defer+madvise " David Rientjes
  2 siblings, 1 reply; 21+ messages in thread
From: Hugh Dickins @ 2017-01-10  3:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Michal Hocko,
	Jonathan Corbet, Kirill A. Shutemov, linux-kernel, linux-mm

On Mon, 9 Jan 2017, David Rientjes wrote:
> On Mon, 9 Jan 2017, Vlastimil Babka wrote:
> 
> > > Any suggestions for a better name for "background" are more than welcome.  
> > 
> > Why not just "madvise+defer"?
> > 
> 
> Seeing no other activity regarding this issue (omg!), I'll wait a day or 
> so to see if there are any objections to "madvise+defer" or suggestions 
> that may be better and repost.

I get very confused by the /sys/kernel/mm/transparent_hugepage/defrag
versus enabled flags, and this may be a terrible, even more confusing,
idea: but I've been surprised and sad to see defrag with a "defer"
option, but poor enabled without one; and it has crossed my mind that
perhaps the peculiar "madvise+defer" syntax in defrag might rather be
handled by "madvise" in defrag with "defer" in enabled?  Or something
like that: 4 x 4 possibilities instead of 5 x 3.

Please be gentle with me,
Hugh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-10  3:38               ` Hugh Dickins
@ 2017-01-10  8:44                 ` Vlastimil Babka
  2017-01-10 23:52                   ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-01-10  8:44 UTC (permalink / raw)
  To: Hugh Dickins, David Rientjes
  Cc: Mel Gorman, Andrew Morton, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On 01/10/2017 04:38 AM, Hugh Dickins wrote:
> On Mon, 9 Jan 2017, David Rientjes wrote:
>> On Mon, 9 Jan 2017, Vlastimil Babka wrote:
>>
>>>> Any suggestions for a better name for "background" are more than welcome.  
>>>
>>> Why not just "madvise+defer"?
>>>
>>
>> Seeing no other activity regarding this issue (omg!), I'll wait a day or 
>> so to see if there are any objections to "madvise+defer" or suggestions 
>> that may be better and repost.
> 
> I get very confused by the /sys/kernel/mm/transparent_hugepage/defrag
> versus enabled flags, and this may be a terrible, even more confusing,
> idea: but I've been surprised and sad to see defrag with a "defer"
> option, but poor enabled without one; and it has crossed my mind that
> perhaps the peculiar "madvise+defer" syntax in defrag might rather be
> handled by "madvise" in defrag with "defer" in enabled?  Or something
> like that: 4 x 4 possibilities instead of 5 x 3.

But would all the possibilities make sense? For example, if I saw
"defer" in enabled, my first expectation would be that it would only use
khugepaged, and no THP page faults at all - possibly including madvised
regions.

If we really wanted really to cover the whole configuration space, we
would have files called "enable", "defrag", "enable-madvise",
"defrag-madvise" and each with possible values "yes", "no", "defer",
where "defer" for enable* files would mean to skip THP page fault
completely and defer to khugepaged, and "defer" for defrag* files would
mean wake up kswapd/kcompactd and skip direct reclaim/compaction.

But, too late for that :)

> 
> Please be gentle with me,
> Hugh
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-10  2:19             ` David Rientjes
  2017-01-10  3:38               ` Hugh Dickins
@ 2017-01-10 13:01               ` Michal Hocko
  2017-01-11  0:15               ` [patch v2] mm, thp: add new defer+madvise " David Rientjes
  2 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-01-10 13:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On Mon 09-01-17 18:19:56, David Rientjes wrote:
> On Mon, 9 Jan 2017, Vlastimil Babka wrote:
> 
> > > Any suggestions for a better name for "background" are more than welcome.  
> > 
> > Why not just "madvise+defer"?
> > 
> 
> Seeing no other activity regarding this issue (omg!), I'll wait a day or 
> so to see if there are any objections to "madvise+defer" or suggestions 
> that may be better and repost.

madvise+defer is much better than background. So if the combined (flag
like approach) is too risky then I am OK with that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm, thp: add new background defrag option
  2017-01-10  8:44                 ` Vlastimil Babka
@ 2017-01-10 23:52                   ` David Rientjes
  0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2017-01-10 23:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Mel Gorman, Andrew Morton, Michal Hocko,
	Jonathan Corbet, Kirill A. Shutemov, linux-kernel, linux-mm

On Tue, 10 Jan 2017, Vlastimil Babka wrote:

> > I get very confused by the /sys/kernel/mm/transparent_hugepage/defrag
> > versus enabled flags, and this may be a terrible, even more confusing,
> > idea: but I've been surprised and sad to see defrag with a "defer"
> > option, but poor enabled without one; and it has crossed my mind that
> > perhaps the peculiar "madvise+defer" syntax in defrag might rather be
> > handled by "madvise" in defrag with "defer" in enabled?  Or something
> > like that: 4 x 4 possibilities instead of 5 x 3.
> 
> But would all the possibilities make sense? For example, if I saw
> "defer" in enabled, my first expectation would be that it would only use
> khugepaged, and no THP page faults at all - possibly including madvised
> regions.
> 

And this is why I've tried to minimize the config requirements and respect 
userspace decisions to do MADV_HUGEPAGE, MADV_NOHUGEPAGE, or set/clear 
PR_SET_THP_DISABLE because all these system-wide options combined with 
userspace syscalls truly seems unmaintainable and waay too confusing to 
correctly describe.  Owell, I am fine with introducing 
yet-another-defrag-mode if it lets us move in a direction that supports 
our usecase.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch v2] mm, thp: add new defer+madvise defrag option
  2017-01-10  2:19             ` David Rientjes
  2017-01-10  3:38               ` Hugh Dickins
  2017-01-10 13:01               ` Michal Hocko
@ 2017-01-11  0:15               ` David Rientjes
  2017-01-11  7:35                 ` Vlastimil Babka
                                   ` (2 more replies)
  2 siblings, 3 replies; 21+ messages in thread
From: David Rientjes @ 2017-01-11  0:15 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Mel Gorman, Michal Hocko, Jonathan Corbet, Kirill A. Shutemov,
	linux-kernel, linux-mm

There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
to do direct compaction and reclaim while all other thp allocations simply 
trigger kswapd and kcompactd in the background and fail immediately.

The "defer" setting simply triggers background reclaim and compaction for 
all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
userspace where MADV_HUGEPAGE is being used to indicate the application is 
willing to wait for work for thp memory to be available.

The "madvise" setting will do direct compaction and reclaim for these
MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
background for anybody else.

For reasonable usage, there needs to be a mesh between the two options.  
This patch introduces a fifth mode, "defer+madvise", that will do direct 
reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
reclaim and compaction for everybody else so that hugepages may be 
available in the near future.

A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
regions as part of the "defer" mode, making it a very powerful setting and 
avoids breaking userspace, was offered: 
http://marc.info/?t=148236612700003.  This additional mode is a 
compromise.

A second proposal to allow both "defer" and "madvise" to be selected at
the same time was also offered: http://marc.info/?t=148357345300001.
This is possible, but there was a concern that it might break existing
userspaces the parse the output of the defrag mode, so the fifth option
was introduced instead.

This patch also cleans up the helper function for storing to "enabled" 
and "defrag" since the former supports three modes while the latter 
supports five and triple_flag_store() was getting unnecessarily messy.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2: uses new naming suggested by Vlastimil
     (defer+madvise order looks better in
      "... defer defer+madvise madvise ...")

 v1 was acked by Mel, and it probably could have been preserved but it was
 removed in case there is an issue with the name change.

 Documentation/vm/transhuge.txt |   8 ++-
 include/linux/huge_mm.h        |   1 +
 mm/huge_memory.c               | 146 +++++++++++++++++++++--------------------
 3 files changed, 82 insertions(+), 73 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
 
 echo always >/sys/kernel/mm/transparent_hugepage/defrag
 echo defer >/sys/kernel/mm/transparent_hugepage/defrag
+echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
 echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
 echo never >/sys/kernel/mm/transparent_hugepage/defrag
 
@@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to delay the VM start
 to utilise them.
 
 "defer" means that an application will wake kswapd in the background
-to reclaim pages and wake kcompact to compact memory so that THP is
+to reclaim pages and wake kcompactd to compact memory so that THP is
 available in the near future. It's the responsibility of khugepaged
 to then install the THP pages later.
 
+"defer+madvise" will enter direct reclaim and compaction like "always", but
+only for regions that have used madvise(MADV_HUGEPAGE); all other regions
+will wake kswapd in the background to reclaim pages and wake kcompactd to
+compact memory so that THP is available in the near future.
+
 "madvise" will enter direct reclaim like "always" but only for regions
 that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -142,42 +142,6 @@ static struct shrinker huge_zero_page_shrinker = {
 };
 
 #ifdef CONFIG_SYSFS
-
-static ssize_t triple_flag_store(struct kobject *kobj,
-				 struct kobj_attribute *attr,
-				 const char *buf, size_t count,
-				 enum transparent_hugepage_flag enabled,
-				 enum transparent_hugepage_flag deferred,
-				 enum transparent_hugepage_flag req_madv)
-{
-	if (!memcmp("defer", buf,
-		    min(sizeof("defer")-1, count))) {
-		if (enabled == deferred)
-			return -EINVAL;
-		clear_bit(enabled, &transparent_hugepage_flags);
-		clear_bit(req_madv, &transparent_hugepage_flags);
-		set_bit(deferred, &transparent_hugepage_flags);
-	} else if (!memcmp("always", buf,
-		    min(sizeof("always")-1, count))) {
-		clear_bit(deferred, &transparent_hugepage_flags);
-		clear_bit(req_madv, &transparent_hugepage_flags);
-		set_bit(enabled, &transparent_hugepage_flags);
-	} else if (!memcmp("madvise", buf,
-			   min(sizeof("madvise")-1, count))) {
-		clear_bit(enabled, &transparent_hugepage_flags);
-		clear_bit(deferred, &transparent_hugepage_flags);
-		set_bit(req_madv, &transparent_hugepage_flags);
-	} else if (!memcmp("never", buf,
-			   min(sizeof("never")-1, count))) {
-		clear_bit(enabled, &transparent_hugepage_flags);
-		clear_bit(req_madv, &transparent_hugepage_flags);
-		clear_bit(deferred, &transparent_hugepage_flags);
-	} else
-		return -EINVAL;
-
-	return count;
-}
-
 static ssize_t enabled_show(struct kobject *kobj,
 			    struct kobj_attribute *attr, char *buf)
 {
@@ -193,19 +157,28 @@ static ssize_t enabled_store(struct kobject *kobj,
 			     struct kobj_attribute *attr,
 			     const char *buf, size_t count)
 {
-	ssize_t ret;
+	ssize_t ret = count;
 
-	ret = triple_flag_store(kobj, attr, buf, count,
-				TRANSPARENT_HUGEPAGE_FLAG,
-				TRANSPARENT_HUGEPAGE_FLAG,
-				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else
+		ret = -EINVAL;
 
 	if (ret > 0) {
 		int err = start_stop_khugepaged();
 		if (err)
 			ret = err;
 	}
-
 	return ret;
 }
 static struct kobj_attribute enabled_attr =
@@ -241,32 +214,58 @@ ssize_t single_hugepage_flag_store(struct kobject *kobj,
 	return count;
 }
 
-/*
- * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
- * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
- * memory just to allocate one more hugepage.
- */
 static ssize_t defrag_show(struct kobject *kobj,
 			   struct kobj_attribute *attr, char *buf)
 {
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return sprintf(buf, "[always] defer madvise never\n");
+		return sprintf(buf, "[always] defer defer+madvise madvise never\n");
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return sprintf(buf, "always [defer] madvise never\n");
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return sprintf(buf, "always defer [madvise] never\n");
-	else
-		return sprintf(buf, "always defer madvise [never]\n");
-
+		return sprintf(buf, "always [defer] defer+madvise madvise never\n");
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always defer [defer+madvise] madvise never\n");
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always defer defer+madvise [madvise] never\n");
+	return sprintf(buf, "always defer defer+madvise madvise [never]\n");
 }
+
 static ssize_t defrag_store(struct kobject *kobj,
 			    struct kobj_attribute *attr,
 			    const char *buf, size_t count)
 {
-	return triple_flag_store(kobj, attr, buf, count,
-				 TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
-				 TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
-				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("defer", buf,
+		    min(sizeof("defer")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("defer+madvise", buf,
+		    min(sizeof("defer+madvise")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
 }
 static struct kobj_attribute defrag_attr =
 	__ATTR(defrag, 0644, defrag_show, defrag_store);
@@ -612,25 +611,28 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 }
 
 /*
- * If THP defrag is set to always then directly reclaim/compact as necessary
- * If set to defer then do only background reclaim/compact and defer to khugepaged
- * If set to madvise and the VMA is flagged then directly reclaim/compact
- * When direct reclaim/compact is allowed, don't retry except for flagged VMA's
+ * always: directly stall for all thp allocations
+ * defer: wake kswapd and fail if not immediately available
+ * defer+madvise: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise
+ *		  fail if not immediately available
+ * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately
+ *	    available
+ * never: never stall for any thp allocation
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
-	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
-	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
-				&transparent_hugepage_flags) && vma_madvised)
-		return GFP_TRANSHUGE;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
-						&transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
-						&transparent_hugepage_flags))
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
-
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
+		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+							     __GFP_KSWAPD_RECLAIM);
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+							     0);
 	return GFP_TRANSHUGE_LIGHT;
 }
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch v2] mm, thp: add new defer+madvise defrag option
  2017-01-11  0:15               ` [patch v2] mm, thp: add new defer+madvise " David Rientjes
@ 2017-01-11  7:35                 ` Vlastimil Babka
  2017-01-12  8:01                   ` Michal Hocko
  2017-01-11  8:56                 ` Mel Gorman
  2017-01-12  0:16                 ` Andrew Morton
  2 siblings, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2017-01-11  7:35 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Jonathan Corbet, Kirill A. Shutemov,
	linux-kernel, linux-mm, Linux API

[+CC linux-api]

On 01/11/2017 01:15 AM, David Rientjes wrote:
> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=148236612700003.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=148357345300001.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

alloc_hugepage_direct_gfpmask() would have been IMHO simpler if a new
internal flag wasn't added, and combination of two existing for defer
and madvise used, but whatever, I won't nak the patch over that.

> ---
>  v2: uses new naming suggested by Vlastimil
>      (defer+madvise order looks better in
>       "... defer defer+madvise madvise ...")

OK.

>  v1 was acked by Mel, and it probably could have been preserved but it was
>  removed in case there is an issue with the name change.
> 
>  Documentation/vm/transhuge.txt |   8 ++-
>  include/linux/huge_mm.h        |   1 +
>  mm/huge_memory.c               | 146 +++++++++++++++++++++--------------------
>  3 files changed, 82 insertions(+), 73 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
>  
>  echo always >/sys/kernel/mm/transparent_hugepage/defrag
>  echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo never >/sys/kernel/mm/transparent_hugepage/defrag
>  
> @@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to delay the VM start
>  to utilise them.
>  
>  "defer" means that an application will wake kswapd in the background
> -to reclaim pages and wake kcompact to compact memory so that THP is
> +to reclaim pages and wake kcompactd to compact memory so that THP is
>  available in the near future. It's the responsibility of khugepaged
>  to then install the THP pages later.
>  
> +"defer+madvise" will enter direct reclaim and compaction like "always", but
> +only for regions that have used madvise(MADV_HUGEPAGE); all other regions
> +will wake kswapd in the background to reclaim pages and wake kcompactd to
> +compact memory so that THP is available in the near future.
> +
>  "madvise" will enter direct reclaim like "always" but only for regions
>  that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
>  
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -33,6 +33,7 @@ enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>  	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -142,42 +142,6 @@ static struct shrinker huge_zero_page_shrinker = {
>  };
>  
>  #ifdef CONFIG_SYSFS
> -
> -static ssize_t triple_flag_store(struct kobject *kobj,
> -				 struct kobj_attribute *attr,
> -				 const char *buf, size_t count,
> -				 enum transparent_hugepage_flag enabled,
> -				 enum transparent_hugepage_flag deferred,
> -				 enum transparent_hugepage_flag req_madv)
> -{
> -	if (!memcmp("defer", buf,
> -		    min(sizeof("defer")-1, count))) {
> -		if (enabled == deferred)
> -			return -EINVAL;
> -		clear_bit(enabled, &transparent_hugepage_flags);
> -		clear_bit(req_madv, &transparent_hugepage_flags);
> -		set_bit(deferred, &transparent_hugepage_flags);
> -	} else if (!memcmp("always", buf,
> -		    min(sizeof("always")-1, count))) {
> -		clear_bit(deferred, &transparent_hugepage_flags);
> -		clear_bit(req_madv, &transparent_hugepage_flags);
> -		set_bit(enabled, &transparent_hugepage_flags);
> -	} else if (!memcmp("madvise", buf,
> -			   min(sizeof("madvise")-1, count))) {
> -		clear_bit(enabled, &transparent_hugepage_flags);
> -		clear_bit(deferred, &transparent_hugepage_flags);
> -		set_bit(req_madv, &transparent_hugepage_flags);
> -	} else if (!memcmp("never", buf,
> -			   min(sizeof("never")-1, count))) {
> -		clear_bit(enabled, &transparent_hugepage_flags);
> -		clear_bit(req_madv, &transparent_hugepage_flags);
> -		clear_bit(deferred, &transparent_hugepage_flags);
> -	} else
> -		return -EINVAL;
> -
> -	return count;
> -}
> -
>  static ssize_t enabled_show(struct kobject *kobj,
>  			    struct kobj_attribute *attr, char *buf)
>  {
> @@ -193,19 +157,28 @@ static ssize_t enabled_store(struct kobject *kobj,
>  			     struct kobj_attribute *attr,
>  			     const char *buf, size_t count)
>  {
> -	ssize_t ret;
> +	ssize_t ret = count;
>  
> -	ret = triple_flag_store(kobj, attr, buf, count,
> -				TRANSPARENT_HUGEPAGE_FLAG,
> -				TRANSPARENT_HUGEPAGE_FLAG,
> -				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +	} else
> +		ret = -EINVAL;
>  
>  	if (ret > 0) {
>  		int err = start_stop_khugepaged();
>  		if (err)
>  			ret = err;
>  	}
> -
>  	return ret;
>  }
>  static struct kobj_attribute enabled_attr =
> @@ -241,32 +214,58 @@ ssize_t single_hugepage_flag_store(struct kobject *kobj,
>  	return count;
>  }
>  
> -/*
> - * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
> - * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
> - * memory just to allocate one more hugepage.
> - */
>  static ssize_t defrag_show(struct kobject *kobj,
>  			   struct kobj_attribute *attr, char *buf)
>  {
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return sprintf(buf, "[always] defer madvise never\n");
> +		return sprintf(buf, "[always] defer defer+madvise madvise never\n");
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return sprintf(buf, "always [defer] madvise never\n");
> -	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> -		return sprintf(buf, "always defer [madvise] never\n");
> -	else
> -		return sprintf(buf, "always defer madvise [never]\n");
> -
> +		return sprintf(buf, "always [defer] defer+madvise madvise never\n");
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> +		return sprintf(buf, "always defer [defer+madvise] madvise never\n");
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> +		return sprintf(buf, "always defer defer+madvise [madvise] never\n");
> +	return sprintf(buf, "always defer defer+madvise madvise [never]\n");
>  }
> +
>  static ssize_t defrag_store(struct kobject *kobj,
>  			    struct kobj_attribute *attr,
>  			    const char *buf, size_t count)
>  {
> -	return triple_flag_store(kobj, attr, buf, count,
> -				 TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> -				 TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> -				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
> +	} else if (!memcmp("defer", buf,
> +		    min(sizeof("defer")-1, count))) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
> +	} else if (!memcmp("defer+madvise", buf,
> +		    min(sizeof("defer+madvise")-1, count))) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
>  }
>  static struct kobj_attribute defrag_attr =
>  	__ATTR(defrag, 0644, defrag_show, defrag_store);
> @@ -612,25 +611,28 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
>  }
>  
>  /*
> - * If THP defrag is set to always then directly reclaim/compact as necessary
> - * If set to defer then do only background reclaim/compact and defer to khugepaged
> - * If set to madvise and the VMA is flagged then directly reclaim/compact
> - * When direct reclaim/compact is allowed, don't retry except for flagged VMA's
> + * always: directly stall for all thp allocations
> + * defer: wake kswapd and fail if not immediately available
> + * defer+madvise: wake kswapd and directly stall for MADV_HUGEPAGE, otherwise
> + *		  fail if not immediately available
> + * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately
> + *	    available
> + * never: never stall for any thp allocation
>   */
>  static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
> -	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
> -	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> -				&transparent_hugepage_flags) && vma_madvised)
> -		return GFP_TRANSHUGE;
> -	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> -						&transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> -	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> -						&transparent_hugepage_flags))
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> -
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
> +		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> +							     __GFP_KSWAPD_RECLAIM);
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> +		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> +							     0);
>  	return GFP_TRANSHUGE_LIGHT;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch v2] mm, thp: add new defer+madvise defrag option
  2017-01-11  0:15               ` [patch v2] mm, thp: add new defer+madvise " David Rientjes
  2017-01-11  7:35                 ` Vlastimil Babka
@ 2017-01-11  8:56                 ` Mel Gorman
  2017-01-12  0:16                 ` Andrew Morton
  2 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2017-01-11  8:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On Tue, Jan 10, 2017 at 04:15:27PM -0800, David Rientjes wrote:
> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=148236612700003.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=148357345300001.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  v2: uses new naming suggested by Vlastimil
>      (defer+madvise order looks better in
>       "... defer defer+madvise madvise ...")
> 
>  v1 was acked by Mel, and it probably could have been preserved but it was
>  removed in case there is an issue with the name change.
> 

There isn't

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch v2] mm, thp: add new defer+madvise defrag option
  2017-01-11  0:15               ` [patch v2] mm, thp: add new defer+madvise " David Rientjes
  2017-01-11  7:35                 ` Vlastimil Babka
  2017-01-11  8:56                 ` Mel Gorman
@ 2017-01-12  0:16                 ` Andrew Morton
  2 siblings, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2017-01-12  0:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, Mel Gorman, Michal Hocko, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm

On Tue, 10 Jan 2017 16:15:27 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=148236612700003.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=148357345300001.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
>  
>  echo always >/sys/kernel/mm/transparent_hugepage/defrag
>  echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo never >/sys/kernel/mm/transparent_hugepage/defrag
>  
> @@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to delay the VM start
>  to utilise them.
>  
>  "defer" means that an application will wake kswapd in the background
> -to reclaim pages and wake kcompact to compact memory so that THP is
> +to reclaim pages and wake kcompactd to compact memory so that THP is
>  available in the near future. It's the responsibility of khugepaged
>  to then install the THP pages later.
>  
> +"defer+madvise" will enter direct reclaim and compaction like "always", but
> +only for regions that have used madvise(MADV_HUGEPAGE); all other regions
> +will wake kswapd in the background to reclaim pages and wake kcompactd to
> +compact memory so that THP is available in the near future.
> +

It would be helpful if this text were to tell the reader why they may
choose to use this option: runtime effects, advantages, when-to-use,
when-not-to-use, etc.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch v2] mm, thp: add new defer+madvise defrag option
  2017-01-11  7:35                 ` Vlastimil Babka
@ 2017-01-12  8:01                   ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-01-12  8:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Jonathan Corbet,
	Kirill A. Shutemov, linux-kernel, linux-mm, Linux API

On Wed 11-01-17 08:35:27, Vlastimil Babka wrote:
> [+CC linux-api]
> 
> On 01/11/2017 01:15 AM, David Rientjes wrote:
> > There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> > to do direct compaction and reclaim while all other thp allocations simply 
> > trigger kswapd and kcompactd in the background and fail immediately.
> > 
> > The "defer" setting simply triggers background reclaim and compaction for 
> > all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> > userspace where MADV_HUGEPAGE is being used to indicate the application is 
> > willing to wait for work for thp memory to be available.
> > 
> > The "madvise" setting will do direct compaction and reclaim for these
> > MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> > background for anybody else.
> > 
> > For reasonable usage, there needs to be a mesh between the two options.  
> > This patch introduces a fifth mode, "defer+madvise", that will do direct 
> > reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> > reclaim and compaction for everybody else so that hugepages may be 
> > available in the near future.
> > 
> > A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> > regions as part of the "defer" mode, making it a very powerful setting and 
> > avoids breaking userspace, was offered: 
> > http://marc.info/?t=148236612700003.  This additional mode is a 
> > compromise.
> > 
> > A second proposal to allow both "defer" and "madvise" to be selected at
> > the same time was also offered: http://marc.info/?t=148357345300001.
> > This is possible, but there was a concern that it might break existing
> > userspaces the parse the output of the defrag mode, so the fifth option
> > was introduced instead.
> > 
> > This patch also cleans up the helper function for storing to "enabled" 
> > and "defrag" since the former supports three modes while the latter 
> > supports five and triple_flag_store() was getting unnecessarily messy.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> alloc_hugepage_direct_gfpmask() would have been IMHO simpler if a new
> internal flag wasn't added, and combination of two existing for defer
> and madvise used,

I agree with Vlastimil here. The patch can do without touching anything
outside of the sysfs handling.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-01-12  8:01 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-04 23:41 [patch] mm, thp: add new background defrag option David Rientjes
2017-01-05 10:13 ` Mel Gorman
2017-01-05 10:33   ` Michal Hocko
2017-01-05 13:58   ` Vlastimil Babka
2017-01-05 15:50     ` Michal Hocko
2017-01-05 22:54     ` David Rientjes
2017-01-06  8:41       ` Vlastimil Babka
2017-01-06 14:01         ` Michal Hocko
2017-01-06 22:20         ` David Rientjes
2017-01-09 10:04           ` Vlastimil Babka
2017-01-09 12:06             ` Vlastimil Babka
2017-01-10  2:19             ` David Rientjes
2017-01-10  3:38               ` Hugh Dickins
2017-01-10  8:44                 ` Vlastimil Babka
2017-01-10 23:52                   ` David Rientjes
2017-01-10 13:01               ` Michal Hocko
2017-01-11  0:15               ` [patch v2] mm, thp: add new defer+madvise " David Rientjes
2017-01-11  7:35                 ` Vlastimil Babka
2017-01-12  8:01                   ` Michal Hocko
2017-01-11  8:56                 ` Mel Gorman
2017-01-12  0:16                 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).