linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
@ 2016-12-22  0:21 David Rientjes
  2016-12-22  8:31 ` Kirill A. Shutemov
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: David Rientjes @ 2016-12-22  0:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Kirill A. Shutemov, Vlastimil Babka, Mel Gorman,
	linux-kernel, linux-mm

Currently, when defrag is set to "madvise", thp allocations will direct
reclaim.  However, when defrag is set to "defer", all thp allocations do
not attempt reclaim regardless of MADV_HUGEPAGE.

This patch always directly reclaims for MADV_HUGEPAGE regions when defrag
is not set to "never."  The idea is that MADV_HUGEPAGE regions really
want to be backed by hugepages and are willing to endure the latency at
fault as it was the default behavior prior to commit 444eb2a449ef ("mm:
thp: set THP defrag by default to madvise and add a stall-free defrag
option").

In this form, "defer" is a stronger, more heavyweight version of
"madvise".

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/vm/transhuge.txt |  7 +++++--
 mm/huge_memory.c               | 10 ++++++----
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -121,8 +121,11 @@ to utilise them.
 
 "defer" means that an application will wake kswapd in the background
 to reclaim pages and wake kcompact to compact memory so that THP is
-available in the near future. It's the responsibility of khugepaged
-to then install the THP pages later.
+available in the near future, unless it is for a region where
+madvise(MADV_HUGEPAGE) has been used, in which case direct reclaim will be
+used. Kcompactd will attempt to make hugepages available for allocation in
+the near future and khugepaged will try to collapse existing memory into
+hugepages later.
 
 "madvise" will enter direct reclaim like "always" but only for regions
 that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -619,15 +619,17 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
-	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 				&transparent_hugepage_flags) && vma_madvised)
 		return GFP_TRANSHUGE;
 	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
-						&transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
+						&transparent_hugepage_flags)) {
+		return GFP_TRANSHUGE_LIGHT |
+		       (vma_madvised ? __GFP_DIRECT_RECLAIM :
+				       __GFP_KSWAPD_RECLAIM);
+	} else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
 						&transparent_hugepage_flags))
 		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-22  0:21 [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred David Rientjes
@ 2016-12-22  8:31 ` Kirill A. Shutemov
  2016-12-22 10:00 ` Michal Hocko
  2017-01-02  8:38 ` Vlastimil Babka
  2 siblings, 0 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-12-22  8:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Wed, Dec 21, 2016 at 04:21:54PM -0800, David Rientjes wrote:
> Currently, when defrag is set to "madvise", thp allocations will direct
> reclaim.  However, when defrag is set to "defer", all thp allocations do
> not attempt reclaim regardless of MADV_HUGEPAGE.
> 
> This patch always directly reclaims for MADV_HUGEPAGE regions when defrag
> is not set to "never."  The idea is that MADV_HUGEPAGE regions really
> want to be backed by hugepages and are willing to endure the latency at
> fault as it was the default behavior prior to commit 444eb2a449ef ("mm:
> thp: set THP defrag by default to madvise and add a stall-free defrag
> option").
> 
> In this form, "defer" is a stronger, more heavyweight version of
> "madvise".
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Makes senses to me.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-22  0:21 [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred David Rientjes
  2016-12-22  8:31 ` Kirill A. Shutemov
@ 2016-12-22 10:00 ` Michal Hocko
  2016-12-22 21:05   ` David Rientjes
  2017-01-02  8:38 ` Vlastimil Babka
  2 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-12-22 10:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Wed 21-12-16 16:21:54, David Rientjes wrote:
> Currently, when defrag is set to "madvise", thp allocations will direct
> reclaim.  However, when defrag is set to "defer", all thp allocations do
> not attempt reclaim regardless of MADV_HUGEPAGE.
> 
> This patch always directly reclaims for MADV_HUGEPAGE regions when defrag
> is not set to "never."  The idea is that MADV_HUGEPAGE regions really
> want to be backed by hugepages and are willing to endure the latency at
> fault as it was the default behavior prior to commit 444eb2a449ef ("mm:
> thp: set THP defrag by default to madvise and add a stall-free defrag
> option").

AFAIR "defer" is implemented exactly as intended. To offer a never-stall
but allow to form THP in the background option. The patch description
doesn't explain why this is not good anymore. Could you give us more
details about the motivation and why "madvise" doesn't work for
you? This is a user visible change so the reason should better be really
documented and strong.

I am worried that after this patch we really lose a lightweight option
which guarantees _no_ THP specific stalls during the page fault which
was the biggest pain in the past.

> In this form, "defer" is a stronger, more heavyweight version of
> "madvise".
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/vm/transhuge.txt |  7 +++++--
>  mm/huge_memory.c               | 10 ++++++----
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -121,8 +121,11 @@ to utilise them.
>  
>  "defer" means that an application will wake kswapd in the background
>  to reclaim pages and wake kcompact to compact memory so that THP is
> -available in the near future. It's the responsibility of khugepaged
> -to then install the THP pages later.
> +available in the near future, unless it is for a region where
> +madvise(MADV_HUGEPAGE) has been used, in which case direct reclaim will be
> +used. Kcompactd will attempt to make hugepages available for allocation in
> +the near future and khugepaged will try to collapse existing memory into
> +hugepages later.
>  
>  "madvise" will enter direct reclaim like "always" but only for regions
>  that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -619,15 +619,17 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
>   */
>  static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
> -	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  				&transparent_hugepage_flags) && vma_madvised)
>  		return GFP_TRANSHUGE;
>  	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> -						&transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> -	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> +						&transparent_hugepage_flags)) {
> +		return GFP_TRANSHUGE_LIGHT |
> +		       (vma_madvised ? __GFP_DIRECT_RECLAIM :
> +				       __GFP_KSWAPD_RECLAIM);
> +	} else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>  						&transparent_hugepage_flags))
>  		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-22 10:00 ` Michal Hocko
@ 2016-12-22 21:05   ` David Rientjes
  2016-12-23  8:51     ` Michal Hocko
  2016-12-30 12:36     ` Mel Gorman
  0 siblings, 2 replies; 29+ messages in thread
From: David Rientjes @ 2016-12-22 21:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Thu, 22 Dec 2016, Michal Hocko wrote:

> > Currently, when defrag is set to "madvise", thp allocations will direct
> > reclaim.  However, when defrag is set to "defer", all thp allocations do
> > not attempt reclaim regardless of MADV_HUGEPAGE.
> > 
> > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag
> > is not set to "never."  The idea is that MADV_HUGEPAGE regions really
> > want to be backed by hugepages and are willing to endure the latency at
> > fault as it was the default behavior prior to commit 444eb2a449ef ("mm:
> > thp: set THP defrag by default to madvise and add a stall-free defrag
> > option").
> 
> AFAIR "defer" is implemented exactly as intended. To offer a never-stall
> but allow to form THP in the background option. The patch description
> doesn't explain why this is not good anymore. Could you give us more
> details about the motivation and why "madvise" doesn't work for
> you? This is a user visible change so the reason should better be really
> documented and strong.
> 

The offering of defer breaks backwards compatibility with previous 
settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on 
.text segment remap and try to force thp backing if available but not 
directly reclaim for non VM_HUGEPAGE vmas.  This was very advantageous.  
We prefer that to stay unchanged and allow kcompactd compaction to be 
triggered in background by everybody else as opposed to direct reclaim.  
We do not have that ability without this patch.

Without this patch, we will be forced to offer multiple sysfs tunables to 
define (1) direct vs background compact, (2) madvise behavior, (3) always, 
(4) never and we cannot have 2^4 settings for "defrag" alone.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-22 21:05   ` David Rientjes
@ 2016-12-23  8:51     ` Michal Hocko
  2016-12-23 10:01       ` David Rientjes
  2016-12-30 12:36     ` Mel Gorman
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-12-23  8:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Thu 22-12-16 13:05:27, David Rientjes wrote:
> On Thu, 22 Dec 2016, Michal Hocko wrote:
> 
> > > Currently, when defrag is set to "madvise", thp allocations will direct
> > > reclaim.  However, when defrag is set to "defer", all thp allocations do
> > > not attempt reclaim regardless of MADV_HUGEPAGE.
> > > 
> > > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag
> > > is not set to "never."  The idea is that MADV_HUGEPAGE regions really
> > > want to be backed by hugepages and are willing to endure the latency at
> > > fault as it was the default behavior prior to commit 444eb2a449ef ("mm:
> > > thp: set THP defrag by default to madvise and add a stall-free defrag
> > > option").
> > 
> > AFAIR "defer" is implemented exactly as intended. To offer a never-stall
> > but allow to form THP in the background option. The patch description
> > doesn't explain why this is not good anymore. Could you give us more
> > details about the motivation and why "madvise" doesn't work for
> > you? This is a user visible change so the reason should better be really
> > documented and strong.
> > 
> 
> The offering of defer breaks backwards compatibility with previous 
> settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on 
> .text segment remap and try to force thp backing if available but not 
> directly reclaim for non VM_HUGEPAGE vmas.

I do not understand the backwards compatibility issue part here. Maybe I
am missing something but the semantic of defrag=madvise hasn't changed
and a new flag can hardly break backward compatibility.

> This was very advantageous.  
> We prefer that to stay unchanged and allow kcompactd compaction to be 
> triggered in background by everybody else as opposed to direct reclaim.  
> We do not have that ability without this patch.

So why don't you use defrag=madvise?

> Without this patch, we will be forced to offer multiple sysfs tunables to 
> define (1) direct vs background compact, (2) madvise behavior, (3) always, 
> (4) never and we cannot have 2^4 settings for "defrag" alone.

I disagree. I think the current set of defrag values should be
sufficient. We can completely disable direct reclaim, enable it only for
opt-in, enable for all and never allow to stall. The advantage of this
set of values is that they have _clear_ semantic and behave
consistently. If you change defer to "almost never stall except when
MADV_HUGEPAGE" then the semantic is less clear. Admin might have a good
reason to never allow stalls - especially when he doesn't have a control
over the code he is running. Your patch would break this usecase.

If we want to provide a better background THP availability we should
focus more on kcompactd and the way how it is invoked. Currently we only
wake it up during the page allocation path. Long term we want to make it
more watermak based I believe. Similar to kswapd. Vlastimil is already
playing with this idea. I would prefer such a long term plan more than
tweaking THP configuration.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-23  8:51     ` Michal Hocko
@ 2016-12-23 10:01       ` David Rientjes
  2016-12-23 11:18         ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2016-12-23 10:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Fri, 23 Dec 2016, Michal Hocko wrote:

> > The offering of defer breaks backwards compatibility with previous 
> > settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on 
> > .text segment remap and try to force thp backing if available but not 
> > directly reclaim for non VM_HUGEPAGE vmas.
> 
> I do not understand the backwards compatibility issue part here. Maybe I
> am missing something but the semantic of defrag=madvise hasn't changed
> and a new flag can hardly break backward compatibility.
> 

We have no way to compact memory for users who are not using 
MADV_HUGEPAGE, which is some customers, others require MADV_HUGEPAGE for 
.text segment remap while loading their binary, without defrag=always or 
defrag=defer.  The problem is that we want to demand direct compact for 
MADV_HUGEPAGE: they _really_ want hugepages, it's the point of the 
madvise.  We have no setting, without this patch, to ask for background 
compaction for everybody so that their fault does not have long latency 
and for some customers to demand compaction.  It's a userspace decision, 
not a kernel decision, and we have lost that ability.

> > This was very advantageous.  
> > We prefer that to stay unchanged and allow kcompactd compaction to be 
> > triggered in background by everybody else as opposed to direct reclaim.  
> > We do not have that ability without this patch.
> 
> So why don't you use defrag=madvise?
> 

Um, wtf?  Prior to the patch, we used defrag=always because we do not have 
low latency option; everybody was forced into it.  Now that we do have 
the option, we wish to use deferred compaction so that we have opportunity 
to fault hugepages in near future.  We also have userspace apps, and 
others have database apps, which want hugepages and are ok with any 
latency.  This should not be a difficult point to understand.  Allow the 
user to define if they are willing to accept latency with MADV_HUGEPAGE.

> I disagree. I think the current set of defrag values should be
> sufficient. We can completely disable direct reclaim, enable it only for
> opt-in, enable for all and never allow to stall. The advantage of this
> set of values is that they have _clear_ semantic and behave
> consistently. If you change defer to "almost never stall except when
> MADV_HUGEPAGE" then the semantic is less clear. Admin might have a good
> reason to never allow stalls - especially when he doesn't have a control
> over the code he is running. Your patch would break this usecase.
> 

?????? Why does the admin care if a user's page fault wants to reclaim to 
get high order memory?  The user incurs the penalty for MADV_HUGEPAGE, it 
always has.  Lol.

This objection is nonsensical.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-23 10:01       ` David Rientjes
@ 2016-12-23 11:18         ` Michal Hocko
  2016-12-23 22:46           ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-12-23 11:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Fri 23-12-16 02:01:33, David Rientjes wrote:
> On Fri, 23 Dec 2016, Michal Hocko wrote:
> 
> > > The offering of defer breaks backwards compatibility with previous 
> > > settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on 
> > > .text segment remap and try to force thp backing if available but not 
> > > directly reclaim for non VM_HUGEPAGE vmas.
> > 
> > I do not understand the backwards compatibility issue part here. Maybe I
> > am missing something but the semantic of defrag=madvise hasn't changed
> > and a new flag can hardly break backward compatibility.
> > 
> 
> We have no way to compact memory for users who are not using 
> MADV_HUGEPAGE,

yes we have. it is defrag=always. If you do not want direct compaction
and the resulting allocation stalls then you have to rely on kcompactd
which is something we should work longterm.

> which is some customers, others require MADV_HUGEPAGE for 
> .text segment remap while loading their binary, without defrag=always or 
> defrag=defer.  The problem is that we want to demand direct compact for 
> MADV_HUGEPAGE: they _really_ want hugepages, it's the point of the 
> madvise.

and that is the point of defrag=madvise to give them this direct
compaction.

> We have no setting, without this patch, to ask for background 
> compaction for everybody so that their fault does not have long latency 
> and for some customers to demand compaction.

that is true and what I am trying to say is that we should aim to give
this background compaction for everybody via kcompactd because there are
more users than THP who might benefit from low latency high order pages
availability. We shouldn't tweak the defer option for that purpose.

> It's a userspace decision, not a kernel decision, and we have lost
> that ability.

I must be missing something but which setting did allow this before?
 
> > > This was very advantageous.  
> > > We prefer that to stay unchanged and allow kcompactd compaction to be 
> > > triggered in background by everybody else as opposed to direct reclaim.  
> > > We do not have that ability without this patch.
> > 
> > So why don't you use defrag=madvise?
> > 
> 
> Um, wtf?  Prior to the patch, we used defrag=always because we do not have 
> low latency option; everybody was forced into it.  Now that we do have 
> the option, we wish to use deferred compaction so that we have opportunity 
> to fault hugepages in near future. We also have userspace apps, and 
> others have database apps, which want hugepages and are ok with any 
> latency.  This should not be a difficult point to understand.  Allow the 
> user to define if they are willing to accept latency with MADV_HUGEPAGE.
> 
> > I disagree. I think the current set of defrag values should be
> > sufficient. We can completely disable direct reclaim, enable it only for
> > opt-in, enable for all and never allow to stall. The advantage of this
> > set of values is that they have _clear_ semantic and behave
> > consistently. If you change defer to "almost never stall except when
> > MADV_HUGEPAGE" then the semantic is less clear. Admin might have a good
> > reason to never allow stalls - especially when he doesn't have a control
> > over the code he is running. Your patch would break this usecase.
> > 
> 
> ?????? Why does the admin care if a user's page fault wants to reclaim to 
> get high order memory?

Because the whole point of the defrag knob is to allow _administrator_
control how much we try to fault in THP. And the primary motivation were
latencies. The whole point of introducing defer option was to _never_
stall in the page fault while it still allows to kick the background
compaction. If you really want to tweak any option then madvise would be
more appropriate IMHO because the semantic would be still clear. Use
direct compaction for MADV_HUGEPAGE vmas and kick in kswapd/kcompactd
for others.

That being said, I understand your usecase and agree that it is useful
to allow background compaction to allow smooth THP deferred allocations
while madvise users can tolerate stalls from the direct compaction. I
just disagree with tweaking defer option to allow for that because I
_believe_ that we should accomplish this in a more generic way and allow
kcompactd to be more configurable and proactive. We definitely need
background compaction for other users than THP. So please try to think
about it some more before sending more wtf replies...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-23 11:18         ` Michal Hocko
@ 2016-12-23 22:46           ` David Rientjes
  2016-12-26  9:02             ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2016-12-23 22:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Fri, 23 Dec 2016, Michal Hocko wrote:

> > We have no way to compact memory for users who are not using 
> > MADV_HUGEPAGE,
> 
> yes we have. it is defrag=always. If you do not want direct compaction
> and the resulting allocation stalls then you have to rely on kcompactd
> which is something we should work longterm.
> 

No, the point of madvise(MADV_HUGEPAGE) is for applications to tell the 
kernel that they really want hugepages.  Really.  Everybody else either 
never did direct compaction or did a substantially watered down version of 
it.  Now, we have a situation where you can either do direct compaction 
for MADV_HUGEPAGE and nothing for anybody else, or direct compaction for 
everybody.  In our usecase, we want everybody to kick off background 
compaction because order=9 gfp_mask & __GFP_KSWAPD_RECLAIM is the only 
thing that is going to trigger background compaction but are unable to do 
so without still incurring lengthy pagefaults for non MADV_HUGEPAGE users.

> > which is some customers, others require MADV_HUGEPAGE for 
> > .text segment remap while loading their binary, without defrag=always or 
> > defrag=defer.  The problem is that we want to demand direct compact for 
> > MADV_HUGEPAGE: they _really_ want hugepages, it's the point of the 
> > madvise.
> 
> and that is the point of defrag=madvise to give them this direct
> compaction.
> 

Do you see the problem by first suggesting defrag=always at the top of 
your reply and then defrag=madvise now?  We cannot set both at once, it's 
the entire problem with the tristate and now quadstate setting.  We want a 
combination: EVERYBODY kicks off background compaction and applications 
that really want hugepages and are fine with incuring lengthy page fault, 
such as those (for the third time) remapping .text segment and doing 
madvise(MADV_HUGEPAGE) before fault, can use the madvise.

> > We have no setting, without this patch, to ask for background 
> > compaction for everybody so that their fault does not have long latency 
> > and for some customers to demand compaction.
> 
> that is true and what I am trying to say is that we should aim to give
> this background compaction for everybody via kcompactd because there are
> more users than THP who might benefit from low latency high order pages
> availability. 

My patch does that, we _defer_ for everybody unless you're using 
madvise(MADV_HUGEPAGE) and really want hugepages.  Forget defrag=never 
exists, it's not important in the discussion.  Forget defrag=always exists 
because all apps, like batch jobs, don't want lengthy pagefaults.  We have 
two options remaining:

 - defrag=defer: everybody kicks off background compaction, _nobody_ does
   direct compaction

 - defrag=madvise: madvise(MADV_HUGEPAGE) does direct compaction,
   everybody else does nothing

The point you're missing is that we _want_ defrag=defer.  We really do.  
We don't want to stall in the page allocator to get thp, but we want to 
try to make it available in the short term.  However, apps that do 
madvise(MADV_HUGEPAGE), like remapping your .text segment and wanting your 
text backed by hugepages and incurring the expense up front, or a 
database, or a vm, _want_ hugepages now and don't care about lengthy page 
faults.

The point is that I HAVE NO SETTING to get that behavior and 
defrag=madvise is _not_ a solution because it requires the presence of an 
app that is doing madvise(MADV_HUGEPAGE) AND faulting memory to get any 
order=9 compaction.

> > ?????? Why does the admin care if a user's page fault wants to reclaim to 
> > get high order memory?
> 
> Because the whole point of the defrag knob is to allow _administrator_
> control how much we try to fault in THP. And the primary motivation were
> latencies. The whole point of introducing defer option was to _never_
> stall in the page fault while it still allows to kick the background
> compaction. If you really want to tweak any option then madvise would be
> more appropriate IMHO because the semantic would be still clear. Use
> direct compaction for MADV_HUGEPAGE vmas and kick in kswapd/kcompactd
> for others.
> 

You want defrag=madvise to start doing background compaction for 
everybody, which was never done before for existing users of 
defrag=madvise?  That might be possible, I don't really care, I just think 
it's riskier because there are existing users of defrag=madvise who are 
opting in to new behavior because of the kernel change.  This patch 
changes defrag=defer because it's the new option and people setting the 
mode know what they are getting.

I disagree with your description of what the defrag setting is intended 
for.  The setting of thp defrag is to optimize for apps that truly want 
transparent behavior, i.e. they aren't doing madvise(MADV_HUGEPAGE).  Are 
they willing to incur lengthy pagefaults for thp when not doing any 
madvise(2)?  defrag=defer should not mean that users of 
madvise(MADV_HUGEPAGE) that have clearly specified their intent should not 
be allowed to try compacting memory themselves because they have indicated 
they are fine with such an expense by doing the madvise(2).

This is obviously fine for Kirill, and I have users who remap their .text 
segment and do madvise(MADV_DONTNEED) because they really want hugepages 
when they are exec'd, so I'd kindly ask you to consider the real-world use 
cases that require background compaction to make hugepages available for 
everybody but allow apps to opt-in to take the expense of compaction on 
themselves rather than your own theory of what users want.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-23 22:46           ` David Rientjes
@ 2016-12-26  9:02             ` Michal Hocko
  2016-12-27  0:53               ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-12-26  9:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Fri 23-12-16 14:46:43, David Rientjes wrote:
[...]
> You want defrag=madvise to start doing background compaction for 
> everybody, which was never done before for existing users of 
> defrag=madvise?  That might be possible, I don't really care, I just think 
> it's riskier because there are existing users of defrag=madvise who are 
> opting in to new behavior because of the kernel change.  This patch 
> changes defrag=defer because it's the new option and people setting the 
> mode know what they are getting.

But my primary argument is that if you tweak "defer" value behavior
then you lose the only "stall free yet allow background compaction"
option. That option is really important. You seem to think that it
is the application which is under the control. And I am not all that
surprised because you are under control of the whole userspace in your
deployments. But there are others where the administrator is not under
the control of what application asks for yet he is responsible for the
overal "experience" if you will. Long stalls during the page faults are
often seen as bugs and users might not really care whether the
application writer really wanted THP or not...

[...]

> This is obviously fine for Kirill, and I have users who remap their .text 
> segment and do madvise(MADV_DONTNEED) because they really want hugepages 
> when they are exec'd, so I'd kindly ask you to consider the real-world use 
> cases that require background compaction to make hugepages available for 
> everybody but allow apps to opt-in to take the expense of compaction on 
> themselves rather than your own theory of what users want.

I definitely _agree_ that this is a very important usecase! I am just
trying to think long term and a more sophisticated background compaction
is something that we definitely lack and _want_ longterm. There are more
high order users than THP. I believe we really want to teach kcompactd
to maintain configurable amount of highorder pages.

If there is really a need for an immediate solution^Wworkaround then I
think that tweaking the madvise option should be reasonably safe. Admins
are really prepared for stalls because they are explicitly opting in for
madvise behavior and they will get a background compaction on top. This
is a new behavior but I do not see how it would be harmful. If an
excessive compaction is a problem then THP can be reduced to madvise
only vmas.

But, I really _do_ care about having a stall free option which is not a
complete disable of the background compaction for THP.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f3c2040edbb1..3679c47faef4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -622,8 +622,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
-				&transparent_hugepage_flags) && vma_madvised)
-		return GFP_TRANSHUGE;
+				&transparent_hugepage_flags))
+		return (vma_madvise) ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
 						&transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-26  9:02             ` Michal Hocko
@ 2016-12-27  0:53               ` David Rientjes
  2016-12-27  2:32                 ` Kirill A. Shutemov
  2016-12-27  9:41                 ` Michal Hocko
  0 siblings, 2 replies; 29+ messages in thread
From: David Rientjes @ 2016-12-27  0:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Mon, 26 Dec 2016, Michal Hocko wrote:

> But my primary argument is that if you tweak "defer" value behavior
> then you lose the only "stall free yet allow background compaction"
> option. That option is really important.

Important to who?

What regresses if we kick a background kthread to compact memory for 
order-9 pageblocks?

Why don't we allow userspace to clear __GFP_KSWAPD_RECLAIM if we don't 
want background reclaim for allocations?

> You seem to think that it
> is the application which is under the control. And I am not all that
> surprised because you are under control of the whole userspace in your
> deployments.

I have no control over the userspace that runs on my "deployments," I 
caution you to not make any inferences.

> But there are others where the administrator is not under
> the control of what application asks for yet he is responsible for the
> overal "experience" if you will.

The administrator is in charge of an "experience" and wants to avoid 
background compaction for thp allocations but not background reclaim for 
any other allocation?  (Why am I even replying to this?)  If the admin is 
concerned about anybody doing compaction, they can set defrag to "never".  
They have had this ability since thp was introduced.

> Long stalls during the page faults are
> often seen as bugs and users might not really care whether the
> application writer really wanted THP or not...
> 

There are no long stalls during page faults introduced by this patch, we 
are waking up a kthread to do the work.

> I definitely _agree_ that this is a very important usecase! I am just
> trying to think long term and a more sophisticated background compaction
> is something that we definitely lack and _want_ longterm. There are more
> high order users than THP. I believe we really want to teach kcompactd
> to maintain configurable amount of highorder pages.
> 

We are addressing thp defrag here, not any other use for background 
compaction for other high-order allocations.  I'd prefer that we stay on 
topic, please.  This is only about setting thp defrag to "defer" and if it 
is possible to kick background compaction and defer direct compaction.  We 
need this patch, Kirill has acked it, and I simply have no more time to 
talk in circles.

> If there is really a need for an immediate solution^Wworkaround then I
> think that tweaking the madvise option should be reasonably safe. Admins
> are really prepared for stalls because they are explicitly opting in for
> madvise behavior and they will get a background compaction on top. This
> is a new behavior but I do not see how it would be harmful. If an
> excessive compaction is a problem then THP can be reduced to madvise
> only vmas.
> 
> But, I really _do_ care about having a stall free option which is not a
> complete disable of the background compaction for THP.
> 

This is completely wrong.  Before the "defer" option has been introduced, 
we had "madvise" and should maintain its behavior as much as possible so 
there are no surprises.  We don't change behavior for a tunable out from 
under existing users because you think you know better.  With the new 
"defer" option, we can make this a stronger variant of "madvise", which 
Kirill acked, so that existing users of MADV_HUGEPAGE have no change in 
behavior and we can configure whether we do direct or background 
compaction for everybody else.  If people don't want background 
compaction, they can set defrag to "madvise".  If they want it, they can 
set it to "defer".  It's very simple.

That said, I simply don't have the time to continue in circular arguments 
and would respectfully ask Andrew to apply this acked patch.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-27  0:53               ` David Rientjes
@ 2016-12-27  2:32                 ` Kirill A. Shutemov
  2016-12-27  9:41                 ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-12-27  2:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Mon, Dec 26, 2016 at 04:53:39PM -0800, David Rientjes wrote:
> > If there is really a need for an immediate solution^Wworkaround then I
> > think that tweaking the madvise option should be reasonably safe. Admins
> > are really prepared for stalls because they are explicitly opting in for
> > madvise behavior and they will get a background compaction on top. This
> > is a new behavior but I do not see how it would be harmful. If an
> > excessive compaction is a problem then THP can be reduced to madvise
> > only vmas.
> > 
> > But, I really _do_ care about having a stall free option which is not a
> > complete disable of the background compaction for THP.
> > 
> 
> This is completely wrong.  Before the "defer" option has been introduced, 
> we had "madvise" and should maintain its behavior as much as possible so 
> there are no surprises.  We don't change behavior for a tunable out from 
> under existing users because you think you know better.  With the new 
> "defer" option, we can make this a stronger variant of "madvise", which 
> Kirill acked, so that existing users of MADV_HUGEPAGE have no change in 
> behavior and we can configure whether we do direct or background 
> compaction for everybody else.  If people don't want background 
> compaction, they can set defrag to "madvise".  If they want it, they can 
> set it to "defer".  It's very simple.
> 
> That said, I simply don't have the time to continue in circular arguments 
> and would respectfully ask Andrew to apply this acked patch.

+1.

I don't see a point to make "defer" weaker than "madvise". MADV_HUGEPAGE
is a way for an application to say that it's okay with paying price for
huge page allocation.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-27  0:53               ` David Rientjes
  2016-12-27  2:32                 ` Kirill A. Shutemov
@ 2016-12-27  9:41                 ` Michal Hocko
  2016-12-27 21:36                   ` David Rientjes
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-12-27  9:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Mon 26-12-16 16:53:39, David Rientjes wrote:
> On Mon, 26 Dec 2016, Michal Hocko wrote:
> 
> > But my primary argument is that if you tweak "defer" value behavior
> > then you lose the only "stall free yet allow background compaction"
> > option. That option is really important.
> 
> Important to who?

To all users who want to have THP without stalls experience. This was
the whole point of 444eb2a449ef ("mm: thp: set THP defrag by default to
madvise and add a stall-free defrag option").

> What regresses if we kick a background kthread to compact memory for 
> order-9 pageblocks?

I am not worried about this part. I am worried about the direct
compaction part.

> Why don't we allow userspace to clear __GFP_KSWAPD_RECLAIM if we don't 
> want background reclaim for allocations?
> 
> > You seem to think that it
> > is the application which is under the control. And I am not all that
> > surprised because you are under control of the whole userspace in your
> > deployments.
> 
> I have no control over the userspace that runs on my "deployments," I 
> caution you to not make any inferences.

the usecase you have described suggested otherwise. The way how you are
using madvise sounds pretty much intentional to me. This is quite a
different thing than running an application which uses madivise because
it _thinks_ it is a good idea and you are left with that decision and
cannot do anything about that.

> > But there are others where the administrator is not under
> > the control of what application asks for yet he is responsible for the
> > overal "experience" if you will.
> 
> The administrator is in charge of an "experience" and wants to avoid 
> background compaction for thp allocations but not background reclaim for 
> any other allocation?

I do not understand why you are mentioning the background
reclaim/compaction again. All I am talking about here is the _direct_
compaction and the way to prevent from it for _all_ THP requests
regardless of the madvise status because that is not under the admin
control.

> (Why am I even replying to this?)  If the admin is 
> concerned about anybody doing compaction, they can set defrag to "never".  
> They have had this ability since thp was introduced.
>
> > Long stalls during the page faults are
> > often seen as bugs and users might not really care whether the
> > application writer really wanted THP or not...
> > 
> 
> There are no long stalls during page faults introduced by this patch, we 
> are waking up a kthread to do the work.

Yes there _are_. All madvised vmas can stall now which was not the case
before. This is breaking the semantic of the defer option as it was
introduced and intended (which should be pretty clear from its name).
 
> > I definitely _agree_ that this is a very important usecase! I am just
> > trying to think long term and a more sophisticated background compaction
> > is something that we definitely lack and _want_ longterm. There are more
> > high order users than THP. I believe we really want to teach kcompactd
> > to maintain configurable amount of highorder pages.
> > 
> 
> We are addressing thp defrag here, not any other use for background 
> compaction for other high-order allocations.  I'd prefer that we stay on 
> topic, please.  This is only about setting thp defrag to "defer" and if it 
> is possible to kick background compaction and defer direct compaction.  We 
> need this patch, Kirill has acked it, and I simply have no more time to 
> talk in circles.

You seem to completely ignore the review feedback and given arguments
which is really sad...

> > If there is really a need for an immediate solution^Wworkaround then I
> > think that tweaking the madvise option should be reasonably safe. Admins
> > are really prepared for stalls because they are explicitly opting in for
> > madvise behavior and they will get a background compaction on top. This
> > is a new behavior but I do not see how it would be harmful. If an
> > excessive compaction is a problem then THP can be reduced to madvise
> > only vmas.
> > 
> > But, I really _do_ care about having a stall free option which is not a
> > complete disable of the background compaction for THP.
> > 
> 
> This is completely wrong.  Before the "defer" option has been introduced, 
> we had "madvise" and should maintain its behavior as much as possible so 
> there are no surprises.  We don't change behavior for a tunable out from 
> under existing users because you think you know better.  With the new 
> "defer" option, we can make this a stronger variant of "madvise", which 

I do not see why "defer" would be any different in that regards. The
defer option is there for 3 releases already. It's not an rc thing...
I fail to see why adding a background behavior to one existing knob is
a problem while adding a _directly_ visible one to another is OK. This
just doesn't make any sense to me.

> Kirill acked, so that existing users of MADV_HUGEPAGE have no change in 
> behavior and we can configure whether we do direct or background 
> compaction for everybody else.  If people don't want background 
> compaction, they can set defrag to "madvise".  If they want it, they can 
> set it to "defer".  It's very simple.
>
>
> That said, I simply don't have the time to continue in circular arguments 
> and would respectfully ask Andrew to apply this acked patch.

for reasons mentioned already
Nacked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-27  9:41                 ` Michal Hocko
@ 2016-12-27 21:36                   ` David Rientjes
  2016-12-28  8:48                     ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2016-12-27 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Tue, 27 Dec 2016, Michal Hocko wrote:

> > Important to who?
> 
> To all users who want to have THP without stalls experience. This was
> the whole point of 444eb2a449ef ("mm: thp: set THP defrag by default to
> madvise and add a stall-free defrag option").
> 

THEY DO NOT STALL.  If the application is not using 
madvise(MADV_HUGEPAGE), all we do is kick kcompactd.  Nothing else.  We 
don't need any kernel tunable for an admin to override an application 
doing madvise(MADV_HUGEPAGE) when it wants hugepages and is perfectly 
happy stalling for them.

> > > You seem to think that it
> > > is the application which is under the control. And I am not all that
> > > surprised because you are under control of the whole userspace in your
> > > deployments.
> > 
> > I have no control over the userspace that runs on my "deployments," I 
> > caution you to not make any inferences.
> 
> the usecase you have described suggested otherwise. The way how you are
> using madvise sounds pretty much intentional to me. This is quite a
> different thing than running an application which uses madivise because
> it _thinks_ it is a good idea and you are left with that decision and
> cannot do anything about that.
> 

I literally cannot believe I am reading this on lkml.  I am legitimately 
stunned by this.  You are saying that the admin thinks it knows better 
than the application writer and that its madvise(MADV_HUGEPAGE) wasn't 
actually intentional?  We don't introduce tunables so that admins can 
control the intentional behavior of an application, unless that behavior 
is a security concern.  The application has specified it wants to wait for 
hugepages and has backwards compatibility with defrag=madvise settings 
since thp was introduced, which introduced MADV_HUGEPAGE.  It's the entire 
point of MADV_HUGEPAGE existing.

> > > Long stalls during the page faults are
> > > often seen as bugs and users might not really care whether the
> > > application writer really wanted THP or not...
> > > 
> > 
> > There are no long stalls during page faults introduced by this patch, we 
> > are waking up a kthread to do the work.
> 
> Yes there _are_. All madvised vmas can stall now which was not the case
> before. This is breaking the semantic of the defer option as it was
> introduced and intended (which should be pretty clear from its name).
>  

All madvised VMAs stall now because THEY WANT TO STALL.  It is 
unbelievable that you would claim otherwise or think that you know better 
than the application writer about their application.

> > We are addressing thp defrag here, not any other use for background 
> > compaction for other high-order allocations.  I'd prefer that we stay on 
> > topic, please.  This is only about setting thp defrag to "defer" and if it 
> > is possible to kick background compaction and defer direct compaction.  We 
> > need this patch, Kirill has acked it, and I simply have no more time to 
> > talk in circles.
> 
> You seem to completely ignore the review feedback and given arguments
> which is really sad...
> 

I am perfectly satisfied with Kirill's review feedback because (1) it 
makes sense and (2) it supports allowing users who do MADV_HUGEPAGE to 
actually try to get hugepages, which is the point of the madvise.

> > That said, I simply don't have the time to continue in circular arguments 
> > and would respectfully ask Andrew to apply this acked patch.
> 
> for reasons mentioned already
> Nacked-by: Michal Hocko <mhocko@suse.com>

I hope I'm not being unrealistically optimistic in assuming that this will 
be the end of this thread.  The patch should be merged.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-27 21:36                   ` David Rientjes
@ 2016-12-28  8:48                     ` Michal Hocko
  2016-12-28 21:33                       ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-12-28  8:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Tue 27-12-16 13:36:54, David Rientjes wrote:
[...]
> All madvised VMAs stall now because THEY WANT TO STALL.  It is 
> unbelievable that you would claim otherwise or think that you know better 
> than the application writer about their application.

I do care more about _users_ and their _experience_ than what
application _writers_ think is the best. This is the whole point
of giving the defrag tunable. madvise(MADV_HUGEPAGE) is just a hint to
the system that using transparent hugepages is _preferable_, not
mandatory. We have an option to allow stalls for those vmas to increase
the allocation success rate. We also have tunable to completely ignore
it. And we should also have an option to not stall.

Your interpretation of what is the expected stalling behavior of
MADV_HUGEPAGE doesn't match what the man page says:

: MADV_HUGEPAGE (since Linux 2.6.38)
: Enable Transparent Huge Pages (THP) for pages in the range specified
: by addr and length.  Currently, Transparent Huge Pages work only with
: private anonymous pages (see mmap(2)).  The kernel will regularly
: scan the areas marked as huge page candidates to replace them with
: huge pages.  The kernel will also allocate huge pages directly
: when the region is naturally aligned to the huge page size (see
: posix_memalign(2)).
: 
: This feature is primarily aimed at applications that use large mappings
: of data and access large regions of that memory at a time (e.g.,
: virtualization systems such as QEMU).  It can very easily waste memory
: (e.g., a 2MB mapping that only ever accesses 1 byte will result in 2MB
: of wired memory instead of one 4KB page).  See the Linux kernel source
: file Documentation/vm/transhuge.txt for more details.
: 
: The MADV_HUGEPAGE and MADV_NOHUGEPAGE operations are available only if
: the kernel was configured with CONFIG_TRANSPARENT_HUGEPAGE.

while it clearly contradicts what "defer" option is documented to
provide in  Documentation/vm/transhuge.txt:

: "defer" means that an application will wake kswapd in the background
: to reclaim pages and wake kcompact to compact memory so that THP is
: available in the near future. It's the responsibility of khugepaged
: to then install the THP pages later.

I am not going to reply to other your claims because I would just repeat
myself.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-28  8:48                     ` Michal Hocko
@ 2016-12-28 21:33                       ` David Rientjes
  2016-12-29  8:24                         ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2016-12-28 21:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Wed, 28 Dec 2016, Michal Hocko wrote:

> I do care more about _users_ and their _experience_ than what
> application _writers_ think is the best. This is the whole point
> of giving the defrag tunable. madvise(MADV_HUGEPAGE) is just a hint to
> the system that using transparent hugepages is _preferable_, not
> mandatory. We have an option to allow stalls for those vmas to increase
> the allocation success rate. We also have tunable to completely ignore
> it. And we should also have an option to not stall.
> 

The application developer who uses madvise(MADV_HUGEPAGE) is doing so for 
a reason.

We lack the ability to defragment in the background for all users who 
don't want to block while allowing madvise(MADV_HUGEPAGE) users to block, 
as the changelog for this patch clearly indicates.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-28 21:33                       ` David Rientjes
@ 2016-12-29  8:24                         ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2016-12-29  8:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, Mel Gorman, linux-kernel, linux-mm

On Wed 28-12-16 13:33:49, David Rientjes wrote:
> On Wed, 28 Dec 2016, Michal Hocko wrote:
> 
> > I do care more about _users_ and their _experience_ than what
> > application _writers_ think is the best. This is the whole point
> > of giving the defrag tunable. madvise(MADV_HUGEPAGE) is just a hint to
> > the system that using transparent hugepages is _preferable_, not
> > mandatory. We have an option to allow stalls for those vmas to increase
> > the allocation success rate. We also have tunable to completely ignore
> > it. And we should also have an option to not stall.
> > 
> 
> The application developer who uses madvise(MADV_HUGEPAGE) is doing so for 
> a reason.

and nobody questions that... But the application developer can hardly
forsee the environment where the application runs. And what might
look as a reasonable cost/benefit balance in one setup can turn out
completely wrong in a different one - just consider the fragmentation
which is the primary contributor to stalls. It is hardly predictable
and vary between different workloads/setups a lot. While we have a way
(policty if you will) to tell that madvise should be honored as much
as possible (defrag=madvise) we do not have a way to tell that even
madvised vmas are not worth stalling over because the benefit would not
offset the cost.

> We lack the ability to defragment in the background for all users who 
> don't want to block while allowing madvise(MADV_HUGEPAGE) users to block, 
> as the changelog for this patch clearly indicates.

And I agree that this is something to be addressed. I just disagree that
this patch is the way how to achieve that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-22 21:05   ` David Rientjes
  2016-12-23  8:51     ` Michal Hocko
@ 2016-12-30 12:36     ` Mel Gorman
  2016-12-30 12:56       ` Michal Hocko
  2016-12-30 22:30       ` David Rientjes
  1 sibling, 2 replies; 29+ messages in thread
From: Mel Gorman @ 2016-12-30 12:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, linux-kernel, linux-mm

On Thu, Dec 22, 2016 at 01:05:27PM -0800, David Rientjes wrote:
> On Thu, 22 Dec 2016, Michal Hocko wrote:
> 
> > > Currently, when defrag is set to "madvise", thp allocations will direct
> > > reclaim.  However, when defrag is set to "defer", all thp allocations do
> > > not attempt reclaim regardless of MADV_HUGEPAGE.
> > > 
> > > This patch always directly reclaims for MADV_HUGEPAGE regions when defrag
> > > is not set to "never."  The idea is that MADV_HUGEPAGE regions really
> > > want to be backed by hugepages and are willing to endure the latency at
> > > fault as it was the default behavior prior to commit 444eb2a449ef ("mm:
> > > thp: set THP defrag by default to madvise and add a stall-free defrag
> > > option").
> > 
> > AFAIR "defer" is implemented exactly as intended. To offer a never-stall
> > but allow to form THP in the background option. The patch description
> > doesn't explain why this is not good anymore. Could you give us more
> > details about the motivation and why "madvise" doesn't work for
> > you? This is a user visible change so the reason should better be really
> > documented and strong.
> > 
> 

I ended up not reading this whole thread in detail because it went back
and forth a lot.

Michal is correct in that my intent for defer was to have "never stall"
as the default behaviour.  This was because of the number of severe stalls
users experienced that lead to recommendations in tuning guides to always
disable THP. I'd also seen multiple instances in bug reports for stalls
where it was suggested that THP be disabled even when it could not have
been a factor. It would be preferred to keep the default behaviour to
avoid reintroducing such bugs.

That said;

> The offering of defer breaks backwards compatibility with previous 
> settings of defrag=madvise, where we could set madvise(MADV_HUGEPAGE) on 
> .text segment remap and try to force thp backing if available but not 
> directly reclaim for non VM_HUGEPAGE vmas.  This was very advantageous.  
> We prefer that to stay unchanged and allow kcompactd compaction to be 
> triggered in background by everybody else as opposed to direct reclaim.  
> We do not have that ability without this patch.
> 

I accept the reasoning that applications that use MADV_HUGEPAGE really
want huge pages and may be willing to incur a large stall to get them.
It's impossible for the kernel to know in all cases which behaviour is
desirable so something is needed.

> Without this patch, we will be forced to offer multiple sysfs tunables to 
> define (1) direct vs background compact, (2) madvise behavior, (3) always, 
> (4) never and we cannot have 2^4 settings for "defrag" alone.

In itself, I don't see this as a bad thing. I won't nak the patch as-is
although I consider it unfortunate and worry that we'll see bugs again about
slow start times for qemu (the most common application I'm aware of that
uses MADV_HUGEPAGE). If that happens, we'd be forced to have a workaround
like a systemtap script that intercepted MADV_HUGEPAGE and stripped it
or LD_PRELOAD if a specific application can be controlled.

I'll neither ack nor nak this patch. However, I would much prefer an
additional option be added to sysfs called defer-fault that would avoid
all fault-based stalls but still potentially stall for MADV_HUGEPAGE. I
would also prefer that the default option is "defer" for both MADV_HUGEPAGE
and faults.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-30 12:36     ` Mel Gorman
@ 2016-12-30 12:56       ` Michal Hocko
  2016-12-30 14:08         ` Mel Gorman
  2016-12-30 22:30       ` David Rientjes
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-12-30 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, Jonathan Corbet,
	Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm

On Fri 30-12-16 12:36:20, Mel Gorman wrote:
[...]
> I'll neither ack nor nak this patch. However, I would much prefer an
> additional option be added to sysfs called defer-fault that would avoid
> all fault-based stalls but still potentially stall for MADV_HUGEPAGE.

Would you consider changing the semantic of defer=madvise to invoke
KSWAPD for !madvised vmas as acceptable. It would be a change in
semantic but I am wondering what would be a risk and potential
regression space.

Also I am planning to send a pro-active compaction based on a
"watermark" as an LSF/MM topic proposal. I suspect that no additional
thp specific tunable will be needed if we have a proper compaction
watermark tunable.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-30 12:56       ` Michal Hocko
@ 2016-12-30 14:08         ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2016-12-30 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, Jonathan Corbet,
	Kirill A. Shutemov, Vlastimil Babka, linux-kernel, linux-mm

On Fri, Dec 30, 2016 at 01:56:16PM +0100, Michal Hocko wrote:
> On Fri 30-12-16 12:36:20, Mel Gorman wrote:
> [...]
> > I'll neither ack nor nak this patch. However, I would much prefer an
> > additional option be added to sysfs called defer-fault that would avoid
> > all fault-based stalls but still potentially stall for MADV_HUGEPAGE.
> 
> Would you consider changing the semantic of defer=madvise to invoke
> KSWAPD for !madvised vmas as acceptable. It would be a change in
> semantic but I am wondering what would be a risk and potential
> regression space.
> 

I'd worry a little, but not a lot. The concern would be that kswapd waking
up would reclaim pages and cause major faults that would have remained
resident with the current semantics.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-30 12:36     ` Mel Gorman
  2016-12-30 12:56       ` Michal Hocko
@ 2016-12-30 22:30       ` David Rientjes
  2017-01-03 10:37         ` Mel Gorman
  1 sibling, 1 reply; 29+ messages in thread
From: David Rientjes @ 2016-12-30 22:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, linux-kernel, linux-mm

On Fri, 30 Dec 2016, Mel Gorman wrote:

> Michal is correct in that my intent for defer was to have "never stall"
> as the default behaviour.  This was because of the number of severe stalls
> users experienced that lead to recommendations in tuning guides to always
> disable THP. I'd also seen multiple instances in bug reports for stalls
> where it was suggested that THP be disabled even when it could not have
> been a factor. It would be preferred to keep the default behaviour to
> avoid reintroducing such bugs.
> 

I sympathize with that, I've dealt with a number of issues that we have 
encountered where thp defrag was either at fault or wasn't, and there were 
also suggestions to set defrag to "madvise" to rule it out and that 
impacted other users.

I'm curious if you could show examples where there were severe stalls 
being encountered by applications that did madvise(MADV_HUGEPAGE) and 
users were forced to set madvise to "never".  That is, after all, the only 
topic for consideration in this thread: the direct impact to users of 
madvise(MADV_HUGEPAGE).  If an application does it, I believe that's a 
demand for work to be done at allocation time to try to get hugepages.  
They can certainly provide an application-level option to not do the 
MADV_HUGEPAGE.  Qemu is no different, you can add options to do 
madvise(MADV_HUGEPAGE) or not, and you can also do it after fault.

The problem with the current option set is that we don't have the ability 
to trigger background compaction for everybody, which only very minimally 
impacts their page fault latency since it just wakes up kcompactd, and 
allow MADV_HUGEPAGE users to accept that up-front cost by doing direct 
compaction.  My usecase, remapping .text segment and faulting thp memory 
at startup, demands that ability.  Setting defrag=madvise gets that 
behavior, but nobody else triggers background compaction when thp memory 
fails and we _want_ that behavior so work is being done to defrag.  
Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I 
argue that's the wrong behavior.

> I'll neither ack nor nak this patch. However, I would much prefer an
> additional option be added to sysfs called defer-fault that would avoid
> all fault-based stalls but still potentially stall for MADV_HUGEPAGE. I
> would also prefer that the default option is "defer" for both MADV_HUGEPAGE
> and faults.
> 

If you want a fifth option added to sysfs for thp defrag, that's fine, we 
can easily do that.  I'm slightly concerned with more and more options 
added that we will eventually approach the 2^4 option count that I 
mentioned earlier and nobody will know what to select.  I'm fine with the 
kernel default remaining as "madvise," we will just set it to whatever 
gets us "direct for madvise, background for everybody else" behavior as we 
were planning on using "defer."

We can either do

 (1) merge this patch and allow madvise(MADV_HUGEPAGE) users to always try
     to get hugepages, potentially adding options to qemu to suppress 
     their MADV_HUGEPAGE if users have complained (would even fix the 
     issue on 2.6 kernels) or do it after majority has been faulted, or

 (2) add a fifth defrag option to do this suggested behavior and maintain
     that option forever.

I'd obviously prefer the former since I consider MADV_HUGEPAGE and not 
willing to stall as a userspace issue that can _trivially_ be worked 
around in userspace, but in the interest of moving forward on this we can 
do the latter if you'd prefer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-22  0:21 [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred David Rientjes
  2016-12-22  8:31 ` Kirill A. Shutemov
  2016-12-22 10:00 ` Michal Hocko
@ 2017-01-02  8:38 ` Vlastimil Babka
  2017-01-03 22:44   ` David Rientjes
  2 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2017-01-02  8:38 UTC (permalink / raw)
  To: David Rientjes, Andrew Morton
  Cc: Jonathan Corbet, Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm

On 12/22/2016 01:21 AM, David Rientjes wrote:
> Currently, when defrag is set to "madvise", thp allocations will direct
> reclaim.  However, when defrag is set to "defer", all thp allocations do
> not attempt reclaim regardless of MADV_HUGEPAGE.
> 
> This patch always directly reclaims for MADV_HUGEPAGE regions when defrag
> is not set to "never."  The idea is that MADV_HUGEPAGE regions really
> want to be backed by hugepages and are willing to endure the latency at
> fault as it was the default behavior prior to commit 444eb2a449ef ("mm:
> thp: set THP defrag by default to madvise and add a stall-free defrag
> option").
> 
> In this form, "defer" is a stronger, more heavyweight version of
> "madvise".
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

I'm late to the thread (I did read it fully though), so instead of
multiple responses, I'll just list my observations here:

- "defer", e.g. background kswapd+compaction is not a silver bullet, it
will also affect the system. Mel already mentioned extra reclaim.
Compaction also has CPU costs, just hides the accounting to a kernel
thread so it's not visible as latency. It also increases zone/node
lru_lock and lock pressure.

For the same reasons, admin might want to limit direct compaction for
THP, even for madvise() apps. It's also likely that "defer" might have
lower system overhead than "madvise", as with "defer",
reclaim/compaction is done by one per-node thread at a time, but there
might be multiple madvise() threads. So there might be sense in not
allowing madvise() apps to do direct reclaim/compaction on "defer".

- for overriding specific apps such as QEMU (including their madvise()
usage, AFAICS), we have PR_SET_THP_DISABLE prctl(), so no need to
LD_PRELOAD stuff IMO.

- I have wondered about exactly the issue here when Mel proposed the
defer option [1]. Mel responded that it doesn't seem needed at that
point. Now it seems it is. Too bad you didn't raise it then, but to be
fair you were not CC'd.

So would something like this be possible?

> echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag
> cat /sys/kernel/mm/transparent_hugepage/defrag
always [defer] [madvise] never

I'm not sure about the analogous kernel boot option though, I guess
those can't use spaces, so maybe comma-separated?

If that's not acceptable, then I would probably rather be for changing
"madvise" to include "defer", than the other way around. When we augment
kcompactd to be more proactive, it might easily be that it will
effectively act as "defer", even when defrag=none is set, anyway.

[1] http://marc.info/?l=linux-mm&m=145683613929750&w=2

> ---
>  Documentation/vm/transhuge.txt |  7 +++++--
>  mm/huge_memory.c               | 10 ++++++----
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -121,8 +121,11 @@ to utilise them.
>  
>  "defer" means that an application will wake kswapd in the background
>  to reclaim pages and wake kcompact to compact memory so that THP is
> -available in the near future. It's the responsibility of khugepaged
> -to then install the THP pages later.
> +available in the near future, unless it is for a region where
> +madvise(MADV_HUGEPAGE) has been used, in which case direct reclaim will be
> +used. Kcompactd will attempt to make hugepages available for allocation in
> +the near future and khugepaged will try to collapse existing memory into
> +hugepages later.
>  
>  "madvise" will enter direct reclaim like "always" but only for regions
>  that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -619,15 +619,17 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
>   */
>  static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
> -	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  				&transparent_hugepage_flags) && vma_madvised)
>  		return GFP_TRANSHUGE;
>  	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> -						&transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> -	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> +						&transparent_hugepage_flags)) {
> +		return GFP_TRANSHUGE_LIGHT |
> +		       (vma_madvised ? __GFP_DIRECT_RECLAIM :
> +				       __GFP_KSWAPD_RECLAIM);
> +	} else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>  						&transparent_hugepage_flags))
>  		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>  
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2016-12-30 22:30       ` David Rientjes
@ 2017-01-03 10:37         ` Mel Gorman
  2017-01-03 21:57           ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2017-01-03 10:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, linux-kernel, linux-mm

On Fri, Dec 30, 2016 at 02:30:32PM -0800, David Rientjes wrote:
> On Fri, 30 Dec 2016, Mel Gorman wrote:
> 
> > Michal is correct in that my intent for defer was to have "never stall"
> > as the default behaviour.  This was because of the number of severe stalls
> > users experienced that lead to recommendations in tuning guides to always
> > disable THP. I'd also seen multiple instances in bug reports for stalls
> > where it was suggested that THP be disabled even when it could not have
> > been a factor. It would be preferred to keep the default behaviour to
> > avoid reintroducing such bugs.
> > 
> 
> I sympathize with that, I've dealt with a number of issues that we have 
> encountered where thp defrag was either at fault or wasn't, and there were 
> also suggestions to set defrag to "madvise" to rule it out and that 
> impacted other users.
> 
> I'm curious if you could show examples where there were severe stalls 
> being encountered by applications that did madvise(MADV_HUGEPAGE)

I do not have a bug report that is specific to MADV_HUGEPAGE. Until very
recently they would have been masked by THP fault overhead in general.
The current defer logic isn't in the field long enough to generate bugs
that are detailed enough to catch something like this.

> and 
> users were forced to set madvise to "never". 

In the bugs I've dealt with, the switch was between "always" and "never". I
haven't seen a bug specific to "madvise".

> That is, after all, the only 
> topic for consideration in this thread: the direct impact to users of 
> madvise(MADV_HUGEPAGE).  If an application does it, I believe that's a 
> demand for work to be done at allocation time to try to get hugepages.  
> They can certainly provide an application-level option to not do the 
> MADV_HUGEPAGE.  Qemu is no different, you can add options to do 
> madvise(MADV_HUGEPAGE) or not, and you can also do it after fault.
> 

True, it's possible that this is minor hence why I didn't want to outright
Nak the patch.

> The problem with the current option set is that we don't have the ability 
> to trigger background compaction for everybody, which only very minimally 
> impacts their page fault latency since it just wakes up kcompactd, and 
> allow MADV_HUGEPAGE users to accept that up-front cost by doing direct 
> compaction.  My usecase, remapping .text segment and faulting thp memory 
> at startup, demands that ability.  Setting defrag=madvise gets that 
> behavior, but nobody else triggers background compaction when thp memory 
> fails and we _want_ that behavior so work is being done to defrag.  
> Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I 
> argue that's the wrong behavior.
> 

Again, I accept your reasoning and I don't have direct evidence that it'll be
a problem. In an emergency, it could also be worked around using LD_PRELOAD
or a systemtap script until a kernel fix could be applied. Unfortunately it
could also be years before a patch like this would hit enough users for me
to spot the problem in the field. That's not enough to Nak the patch but
it was enough to suggest an alternative that would side-step the problem
ever occurring.

> > I'll neither ack nor nak this patch. However, I would much prefer an
> > additional option be added to sysfs called defer-fault that would avoid
> > all fault-based stalls but still potentially stall for MADV_HUGEPAGE. I
> > would also prefer that the default option is "defer" for both MADV_HUGEPAGE
> > and faults.
> > 
> 
> If you want a fifth option added to sysfs for thp defrag, that's fine, we 
> can easily do that.  I'm slightly concerned with more and more options 
> added that we will eventually approach the 2^4 option count that I 
> mentioned earlier and nobody will know what to select.  I'm fine with the 
> kernel default remaining as "madvise,"

I find it hard to believe this one *can* explode. There are a limited
number of user-triggable actions that can trigger stalls.

> we will just set it to whatever 
> gets us "direct for madvise, background for everybody else" behavior as we 
> were planning on using "defer."
> 
> We can either do
> 
>  (1) merge this patch and allow madvise(MADV_HUGEPAGE) users to always try
>      to get hugepages, potentially adding options to qemu to suppress 
>      their MADV_HUGEPAGE if users have complained (would even fix the 
>      issue on 2.6 kernels) or do it after majority has been faulted, or
> 
>  (2) add a fifth defrag option to do this suggested behavior and maintain
>      that option forever.
> 
> I'd obviously prefer the former since I consider MADV_HUGEPAGE and not 
> willing to stall as a userspace issue that can _trivially_ be worked 
> around in userspace, but in the interest of moving forward on this we can 
> do the latter if you'd prefer.

The latter is preferred because it prevents any possibility of encountering
this in the field and being unable to workaround it with LD_PRELOAD or
systemtap hackery but I won't nak the former either on the grounds I have
no data it's a problem and it could be a year or more before I have an
example. If it's encountered, we'll be back at introducing another sysfs
option.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2017-01-03 10:37         ` Mel Gorman
@ 2017-01-03 21:57           ` David Rientjes
  2017-01-04 10:12             ` Mel Gorman
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2017-01-03 21:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, linux-kernel, linux-mm

On Tue, 3 Jan 2017, Mel Gorman wrote:

> > I sympathize with that, I've dealt with a number of issues that we have 
> > encountered where thp defrag was either at fault or wasn't, and there were 
> > also suggestions to set defrag to "madvise" to rule it out and that 
> > impacted other users.
> > 
> > I'm curious if you could show examples where there were severe stalls 
> > being encountered by applications that did madvise(MADV_HUGEPAGE)
> 
> I do not have a bug report that is specific to MADV_HUGEPAGE. Until very
> recently they would have been masked by THP fault overhead in general.

I parse this, the masking of thp fault overhead in general, as an 
indication that the qemu user was using defrag set to "always" rather than 
the new kernel default of "madvise".

I wholeheartedly agree that we don't want defrag to be set to "always" be 
default, but that's not really a huge concern: we can easily set it to 
anything else by initscripts.

Qemu, when they added the MADV_HUGEPAGE, obviously wanted to try to 
allocate hugepages at fault using the available means when defrag was set 
to "madvise": https://patchwork.ozlabs.org/patch/177695

So now qemu notices no difference that the kernel default has changed, but 
you later reference qemu in your email about bugs concerning "slow start 
times."  It's puzzling unless you're offering a defrag setting of "defer" 
to workaround this potential bug report, which affects the whole machine 
and now qemu users have _no_ option to try to get thp at fault because the 
admin thinks he knows better, essentially making MADV_HUGEPAGE a no-op 
with no alternative provided.  That's specifically what I'm arguing 
against.

Qemu can be fixed, and I'll do it myself if necessary, when allocating a 
new RAMBlock or translation buffer to suppress the MADV_HUGEPAGE if 
configured.  It's a very trivial change, and I can do that if you'll 
kindly point me to the initial bug report so I can propose it to the 
appropriate user.

As Vlastimil also correctly brings up, there is already a 
prctl(PR_SET_THP_DISABLE) option available to prevent hugepages at fault 
and simply requires you to fork the process in the correct context to 
inherit the vma setting, see commit 1e1836e84f87.

> The current defer logic isn't in the field long enough to generate bugs
> that are detailed enough to catch something like this.
> 

Let us consider this email as a generating a bug that we, the users of 
MADV_HUGEPAGE that are using the madvise(2) correctly and add flags to 
suppress it when desired correctly, have no option to allow background 
compaction for everybody when we cannot allocate thp immediately but also 
allow users of our library to accept the cost of direct compaction at 
fault because they really want their .text segment remapped and backed by 
hugepages.

> > The problem with the current option set is that we don't have the ability 
> > to trigger background compaction for everybody, which only very minimally 
> > impacts their page fault latency since it just wakes up kcompactd, and 
> > allow MADV_HUGEPAGE users to accept that up-front cost by doing direct 
> > compaction.  My usecase, remapping .text segment and faulting thp memory 
> > at startup, demands that ability.  Setting defrag=madvise gets that 
> > behavior, but nobody else triggers background compaction when thp memory 
> > fails and we _want_ that behavior so work is being done to defrag.  
> > Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I 
> > argue that's the wrong behavior.
> > 
> 
> Again, I accept your reasoning and I don't have direct evidence that it'll be
> a problem. In an emergency, it could also be worked around using LD_PRELOAD
> or a systemtap script until a kernel fix could be applied. Unfortunately it
> could also be years before a patch like this would hit enough users for me
> to spot the problem in the field. That's not enough to Nak the patch but
> it was enough to suggest an alternative that would side-step the problem
> ever occurring.
> 

Or simply forking the application after doing prctl(PR_SET_THP_DISABLE)?  
What exactly are you working around with a LD_PRELOAD that isn't addressed 
by this?

Btw, is there a qemu bug filed that makes doing the MADV_HUGEPAGE 
configurable?  I don't find it at https://bugs.launchpad.net/qemu.

> > If you want a fifth option added to sysfs for thp defrag, that's fine, we 
> > can easily do that.  I'm slightly concerned with more and more options 
> > added that we will eventually approach the 2^4 option count that I 
> > mentioned earlier and nobody will know what to select.  I'm fine with the 
> > kernel default remaining as "madvise,"
> 
> I find it hard to believe this one *can* explode. There are a limited
> number of user-triggable actions that can trigger stalls.
> 

I'm confused as to whether you support the addition of a fifth option that 
users will have to learn what they want, or whether you are open to 
changing the behavior of "defer" to actually respect userspace madvise(2)?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2017-01-02  8:38 ` Vlastimil Babka
@ 2017-01-03 22:44   ` David Rientjes
  2017-01-04  8:32     ` Vlastimil Babka
  0 siblings, 1 reply; 29+ messages in thread
From: David Rientjes @ 2017-01-03 22:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Mel Gorman,
	linux-kernel, linux-mm

On Mon, 2 Jan 2017, Vlastimil Babka wrote:

> I'm late to the thread (I did read it fully though), so instead of
> multiple responses, I'll just list my observations here:
> 
> - "defer", e.g. background kswapd+compaction is not a silver bullet, it
> will also affect the system. Mel already mentioned extra reclaim.
> Compaction also has CPU costs, just hides the accounting to a kernel
> thread so it's not visible as latency. It also increases zone/node
> lru_lock and lock pressure.
> 
> For the same reasons, admin might want to limit direct compaction for
> THP, even for madvise() apps. It's also likely that "defer" might have
> lower system overhead than "madvise", as with "defer",
> reclaim/compaction is done by one per-node thread at a time, but there
> might be multiple madvise() threads. So there might be sense in not
> allowing madvise() apps to do direct reclaim/compaction on "defer".
> 

Hmm, is there a significant benefit to setting "defer" rather than "never" 
if you can rely on khugepaged to trigger compaction when it tries to 
allocate.  I suppose if there is nothing to collapse that this won't do 
compaction, but is this not intended for users who always want to defer 
when not immediately available?

"Defer" in it's current setting is useless, in my opinion, other than 
providing it as a simple workaround to users when their applications are 
doing MADV_HUGEPAGE without allowing them to configure it.  We would love 
to use "defer" if it didn't completely break MADV_HUGEPAGE, though.

> - for overriding specific apps such as QEMU (including their madvise()
> usage, AFAICS), we have PR_SET_THP_DISABLE prctl(), so no need to
> LD_PRELOAD stuff IMO.
> 

Very good point, and I think it's also worthwhile to allow users to 
suppress the MADV_HUGEPAGE when allocating a translation buffer in qemu if 
they choose to do so; it's a very trivial patch to qemu to allow this to 
be configurable.  I haven't proposed it because I don't personally have a 
need for it, and haven't been pointed to anyone who has a need for it.

> - I have wondered about exactly the issue here when Mel proposed the
> defer option [1]. Mel responded that it doesn't seem needed at that
> point. Now it seems it is. Too bad you didn't raise it then, but to be
> fair you were not CC'd.
> 

My understanding is that the defer option is available to users who cannot 
modify their binary to suppress an madvise(MADV_HUGEPAGE) and are unaware 
that PR_SET_THP_DISABLE exists.  The prctl was added specifically when you 
cannot control your binary.

> So would something like this be possible?
> 
> > echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag
> > cat /sys/kernel/mm/transparent_hugepage/defrag
> always [defer] [madvise] never
> 
> I'm not sure about the analogous kernel boot option though, I guess
> those can't use spaces, so maybe comma-separated?
> 
> If that's not acceptable, then I would probably rather be for changing
> "madvise" to include "defer", than the other way around. When we augment
> kcompactd to be more proactive, it might easily be that it will
> effectively act as "defer", even when defrag=none is set, anyway.
> 

The concern I have with changing the behavior of "madvise" is that it 
changes long standing behavior that people have correctly implemented 
userspace applications with.  I suggest doing this only with "defer" since 
it's an option that is new, nobody appears to be deploying with, and makes 
it much more powerful.  I think we could make the kernel default as 
"defer" later as well and not break userspace that has been setting 
"madvise" ever since the 2.6 kernel.

My position is this: userspace that does MADV_HUGEPAGES knows what it's 
doing.  Let it stall if it wants to stall.  If users don't want it to be 
done, allow them to configure it.  If a binary has forced you into using 
it, use the prctl.  Otherwise, I think "defer" doing background compaction 
for everybody and direct compaction for users who really want hugepages is 
appropriate and is precisely what I need.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2017-01-03 22:44   ` David Rientjes
@ 2017-01-04  8:32     ` Vlastimil Babka
  2017-01-04  9:46       ` Michal Hocko
  2017-01-04 22:04       ` David Rientjes
  0 siblings, 2 replies; 29+ messages in thread
From: Vlastimil Babka @ 2017-01-04  8:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Mel Gorman,
	linux-kernel, linux-mm

On 01/03/2017 11:44 PM, David Rientjes wrote:
> On Mon, 2 Jan 2017, Vlastimil Babka wrote:
> 
>> I'm late to the thread (I did read it fully though), so instead of
>> multiple responses, I'll just list my observations here:
>>
>> - "defer", e.g. background kswapd+compaction is not a silver bullet, it
>> will also affect the system. Mel already mentioned extra reclaim.
>> Compaction also has CPU costs, just hides the accounting to a kernel
>> thread so it's not visible as latency. It also increases zone/node
>> lru_lock and lock pressure.
>>
>> For the same reasons, admin might want to limit direct compaction for
>> THP, even for madvise() apps. It's also likely that "defer" might have
>> lower system overhead than "madvise", as with "defer",
>> reclaim/compaction is done by one per-node thread at a time, but there
>> might be multiple madvise() threads. So there might be sense in not
>> allowing madvise() apps to do direct reclaim/compaction on "defer".
>>
> 
> Hmm, is there a significant benefit to setting "defer" rather than "never" 
> if you can rely on khugepaged to trigger compaction when it tries to 
> allocate.  I suppose if there is nothing to collapse that this won't do 
> compaction, but is this not intended for users who always want to defer 
> when not immediately available?

I guess two things
- khugepaged is quite sleepy and will not respond to demand quickly, so
it won't compact that much than kcompactd triggered by "defer"
- thus with "defer" it's more likely that although some THP faults will
fail, others in near future will succeed and benefit from THP
immediately. Again, khugepaged is much slower. But it may recover
long-running processes that were unlucky in the initial faults, so it's
not useless.

> "Defer" in it's current setting is useless, in my opinion, other than 
> providing it as a simple workaround to users when their applications are 
> doing MADV_HUGEPAGE without allowing them to configure it.

I don't think the primary motivation for "defer" was to restrict
MADV_HUGEPAGE apps, but rather to prevent latency to the majority of
apps oblivious to THP when the default was "always". On the other hand,
setting "madvise" would make performance needlessly worse in some
scenarios, so "defer" is a compromise that tries to provide THP's but
without the latency, and still much more timely than khugepaged.

But that's just my POV, Mel probably has/had also the MADV_HUGEPAGE
restriction in mind. I'd expect that the "you have to disable THP"
cargo-cult originated around apps (databases?) that did not use
MADV_HUGEPAGE, though.

> We would love 
> to use "defer" if it didn't completely break MADV_HUGEPAGE, though.

Right.

>> - for overriding specific apps such as QEMU (including their madvise()
>> usage, AFAICS), we have PR_SET_THP_DISABLE prctl(), so no need to
>> LD_PRELOAD stuff IMO.
>>
> 
> Very good point, and I think it's also worthwhile to allow users to 
> suppress the MADV_HUGEPAGE when allocating a translation buffer in qemu if 
> they choose to do so; it's a very trivial patch to qemu to allow this to 
> be configurable.  I haven't proposed it because I don't personally have a 
> need for it, and haven't been pointed to anyone who has a need for it.
> 
>> - I have wondered about exactly the issue here when Mel proposed the
>> defer option [1]. Mel responded that it doesn't seem needed at that
>> point. Now it seems it is. Too bad you didn't raise it then, but to be
>> fair you were not CC'd.
>>
> 
> My understanding is that the defer option is available to users who cannot 
> modify their binary to suppress an madvise(MADV_HUGEPAGE) and are unaware 
> that PR_SET_THP_DISABLE exists.  The prctl was added specifically when you 
> cannot control your binary.

Yeah, it's easier than LD_PRELOAD, but still not system-wide transparent.

>> So would something like this be possible?
>>
>>> echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag
>>> cat /sys/kernel/mm/transparent_hugepage/defrag
>> always [defer] [madvise] never
>>
>> I'm not sure about the analogous kernel boot option though, I guess
>> those can't use spaces, so maybe comma-separated?

No opinion on the above? I think it could be somewhat more elegant than
a fifth-option that Mel said he would prefer, and deliver the same
flexibility.

>> If that's not acceptable, then I would probably rather be for changing
>> "madvise" to include "defer", than the other way around. When we augment
>> kcompactd to be more proactive, it might easily be that it will
>> effectively act as "defer", even when defrag=none is set, anyway.
>>
> 
> The concern I have with changing the behavior of "madvise" is that it 
> changes long standing behavior that people have correctly implemented 
> userspace applications with.  I suggest doing this only with "defer" since 
> it's an option that is new, nobody appears to be deploying with, and makes 
> it much more powerful.  I think we could make the kernel default as 
> "defer" later as well and not break userspace that has been setting 
> "madvise" ever since the 2.6 kernel.
> 
> My position is this: userspace that does MADV_HUGEPAGES knows what it's 
> doing.  Let it stall if it wants to stall.  If users don't want it to be 
> done, allow them to configure it.  If a binary has forced you into using 
> it, use the prctl.  Otherwise, I think "defer" doing background compaction 
> for everybody and direct compaction for users who really want hugepages is 
> appropriate and is precisely what I need.

I'm not completely against this. But we haven't really ruled out the
most flexible option yet...

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2017-01-04  8:32     ` Vlastimil Babka
@ 2017-01-04  9:46       ` Michal Hocko
  2017-01-04 22:04       ` David Rientjes
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2017-01-04  9:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Andrew Morton, Jonathan Corbet,
	Kirill A. Shutemov, Mel Gorman, linux-kernel, linux-mm

On Wed 04-01-17 09:32:55, Vlastimil Babka wrote:
> On 01/03/2017 11:44 PM, David Rientjes wrote:
> > On Mon, 2 Jan 2017, Vlastimil Babka wrote:
[...]
> >>> echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag
> >>> cat /sys/kernel/mm/transparent_hugepage/defrag
> >> always [defer] [madvise] never
> >>
> >> I'm not sure about the analogous kernel boot option though, I guess
> >> those can't use spaces, so maybe comma-separated?
> 
> No opinion on the above? I think it could be somewhat more elegant than
> a fifth-option that Mel said he would prefer, and deliver the same
> flexibility.

I am not sure we have considered the kcompactd watermark option
throughly as well. In case the relation is not clear because I admit
that the propsal was scattered in more emails. So let me summarize it
here.

Let's add a system configuration whih would control the pro-active
background compaction which would
	- wake up kcompactd pro-actively even when there is no immediate
	  memory pressure - based on the timeout
	- keep compacting as long as the requested order is under the
	  configured watermark and the compaction makes further
	  progress.

Admin can set up this tunable to reflect demand for the THP in the
particular workload. Now how it would play with the THP specific defrag
options?
	- never - THP allocations will be tried without any feedback to
	  kcopactd - no stalls in the page fault path
	- defer -  THP allocations will be tried and kcompactd woken up
	  outside of its wmark setting to catch with the workload - no
	  stalls in the page fault path
	- madvise - do the direct compaction for madvised VMAs and rely
	  on kcompactd watermarks setting to do the background
	  compaction
	- always - do the direct compaction for all VMAs

We won't have to add or modify any new THP specific option and we will
have a generic user independent tunable to tell that the system should
try to generate high order pages which is something that is demand for.
Such a solution would be more flexible as well because the configuration
could reflect the demand much better.

Is there any reason, except for not being implemented yet, that would
make it inappropriate for the described usecase?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2017-01-03 21:57           ` David Rientjes
@ 2017-01-04 10:12             ` Mel Gorman
  2017-01-04 21:53               ` David Rientjes
  0 siblings, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2017-01-04 10:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, linux-kernel, linux-mm

On Tue, Jan 03, 2017 at 01:57:33PM -0800, David Rientjes wrote:
> On Tue, 3 Jan 2017, Mel Gorman wrote:
> 
> > > I sympathize with that, I've dealt with a number of issues that we have 
> > > encountered where thp defrag was either at fault or wasn't, and there were 
> > > also suggestions to set defrag to "madvise" to rule it out and that 
> > > impacted other users.
> > > 
> > > I'm curious if you could show examples where there were severe stalls 
> > > being encountered by applications that did madvise(MADV_HUGEPAGE)
> > 
> > I do not have a bug report that is specific to MADV_HUGEPAGE. Until very
> > recently they would have been masked by THP fault overhead in general.
> 
> I parse this, the masking of thp fault overhead in general, as an 
> indication that the qemu user was using defrag set to "always" rather than 
> the new kernel default of "madvise".
> 

There is a slight disconnect. The bug reports I'm aware of predate the
introduction of "defer" and the current "madvise" semantics for defrag. The
current semantics have not had enough time in the field to generate
reports. I expect lag before users are aware of "defer" due to the number
of recommendations out there about blindly disabling THP.  This because
the majority of users I deal with are not running mainline kernels.

> <SNIP>
>
> Qemu can be fixed, and I'll do it myself if necessary, when allocating a 
> new RAMBlock or translation buffer to suppress the MADV_HUGEPAGE if 
> configured.  It's a very trivial change, and I can do that if you'll 
> kindly point me to the initial bug report so I can propose it to the 
> appropriate user.
> 

I don't have a QEMU-related bug to point to. Even if I did, it would be
an enterprise distribution bug that isn't public. My expectation is that
if I get a bug, that it's going to be QEMU-related but I'm guessing.

> As Vlastimil also correctly brings up, there is already a 
> prctl(PR_SET_THP_DISABLE) option available to prevent hugepages at fault 
> and simply requires you to fork the process in the correct context to 
> inherit the vma setting, see commit 1e1836e84f87.
> 

That disables everything and functionally similar to disabling THP.

> > The current defer logic isn't in the field long enough to generate bugs
> > that are detailed enough to catch something like this.
> > 
> 
> Let us consider this email as a generating a bug that we, the users of 
> MADV_HUGEPAGE that are using the madvise(2) correctly and add flags to 
> suppress it when desired correctly, have no option to allow background 
> compaction for everybody when we cannot allocate thp immediately but also 
> allow users of our library to accept the cost of direct compaction at 
> fault because they really want their .text segment remapped and backed by 
> hugepages.
> 

Again, I accept that. A new option for the semantics you want instead
of adjusting the existing "defer" semantics is preferred to give the
"more heavyweight version of madvise" you're looking for. This is simply
because it's easier to resolve in the field when users are not always that
knowledgable and are typically reluctant to modify application launching
but usually open to adjusting kernel tunables.

> > > The problem with the current option set is that we don't have the ability 
> > > to trigger background compaction for everybody, which only very minimally 
> > > impacts their page fault latency since it just wakes up kcompactd, and 
> > > allow MADV_HUGEPAGE users to accept that up-front cost by doing direct 
> > > compaction.  My usecase, remapping .text segment and faulting thp memory 
> > > at startup, demands that ability.  Setting defrag=madvise gets that 
> > > behavior, but nobody else triggers background compaction when thp memory 
> > > fails and we _want_ that behavior so work is being done to defrag.  
> > > Setting defrag=defer makes MADV_HUGEPAGE a no-op for page fault, and I 
> > > argue that's the wrong behavior.
> > > 
> > 
> > Again, I accept your reasoning and I don't have direct evidence that it'll be
> > a problem. In an emergency, it could also be worked around using LD_PRELOAD
> > or a systemtap script until a kernel fix could be applied. Unfortunately it
> > could also be years before a patch like this would hit enough users for me
> > to spot the problem in the field. That's not enough to Nak the patch but
> > it was enough to suggest an alternative that would side-step the problem
> > ever occurring.
> > 
> 
> Or simply forking the application after doing prctl(PR_SET_THP_DISABLE)?  
> What exactly are you working around with a LD_PRELOAD that isn't addressed 
> by this?
> 

LD_PRELOAD can mask the MADV flag for get the existing "defer"
semantics. Disabling THP entirely is something else.

> Btw, is there a qemu bug filed that makes doing the MADV_HUGEPAGE 
> configurable?  I don't find it at https://bugs.launchpad.net/qemu.
> 

Not that I'm aware of but I didn't go looking either.

> > > If you want a fifth option added to sysfs for thp defrag, that's fine, we 
> > > can easily do that.  I'm slightly concerned with more and more options 
> > > added that we will eventually approach the 2^4 option count that I 
> > > mentioned earlier and nobody will know what to select.  I'm fine with the 
> > > kernel default remaining as "madvise,"
> > 
> > I find it hard to believe this one *can* explode. There are a limited
> > number of user-triggable actions that can trigger stalls.
> > 
> 
> I'm confused as to whether you support the addition of a fifth option that 
> users will have to learn what they want, or whether you are open to 
> changing the behavior of "defer" to actually respect userspace madvise(2)?

I would prefer the fifth option and have users select the option if they
need it and preserve the existing semantics of defer. However, as I've
stated before, as I'm not actually aware of problems with madvise and the
current semantics so I cannot nak the patch you have. It'll simply require
to have some solution in mind if a bug is encountered. When THP was first
introduced, it was months before users started reporting bugs as most users
I deal with are not running mainline kernels. There was similar lag when
automatic NUMA balancing was introduced. I expect a similar lag for the
existing "madvise" default and some education about using "defer" instead
of disabling THP entirely when stalls are encountered. It's the hazard of
dealing with users that lag behind mainline and it's not a unique problem.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2017-01-04 10:12             ` Mel Gorman
@ 2017-01-04 21:53               ` David Rientjes
  0 siblings, 0 replies; 29+ messages in thread
From: David Rientjes @ 2017-01-04 21:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Andrew Morton, Jonathan Corbet, Kirill A. Shutemov,
	Vlastimil Babka, linux-kernel, linux-mm

On Wed, 4 Jan 2017, Mel Gorman wrote:

> There is a slight disconnect. The bug reports I'm aware of predate the
> introduction of "defer" and the current "madvise" semantics for defrag. The
> current semantics have not had enough time in the field to generate
> reports. I expect lag before users are aware of "defer" due to the number
> of recommendations out there about blindly disabling THP.  This because
> the majority of users I deal with are not running mainline kernels.
> 

I find it sad that we need to have five options for thp defrag and that 
people now need to research options that don't break their userspace and 
affect all applications on the system, especially when one of those 
options now appears to have a hypothetical usecase, but in the interest of 
making forward progress in this area for support that we truly need, I'll 
reluctantly propose it as a patch.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred
  2017-01-04  8:32     ` Vlastimil Babka
  2017-01-04  9:46       ` Michal Hocko
@ 2017-01-04 22:04       ` David Rientjes
  1 sibling, 0 replies; 29+ messages in thread
From: David Rientjes @ 2017-01-04 22:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Jonathan Corbet, Kirill A. Shutemov, Mel Gorman,
	linux-kernel, linux-mm

On Wed, 4 Jan 2017, Vlastimil Babka wrote:

> > Hmm, is there a significant benefit to setting "defer" rather than "never" 
> > if you can rely on khugepaged to trigger compaction when it tries to 
> > allocate.  I suppose if there is nothing to collapse that this won't do 
> > compaction, but is this not intended for users who always want to defer 
> > when not immediately available?
> 
> I guess two things
> - khugepaged is quite sleepy and will not respond to demand quickly, so
> it won't compact that much than kcompactd triggered by "defer"

That's configurable, so if a user sets defrag to never, they also have the 
ability to make khugepaged more aggressive in the background to complement 
that decision.

> I don't think the primary motivation for "defer" was to restrict
> MADV_HUGEPAGE apps, but rather to prevent latency to the majority of
> apps oblivious to THP when the default was "always". On the other hand,
> setting "madvise" would make performance needlessly worse in some
> scenarios, so "defer" is a compromise that tries to provide THP's but
> without the latency, and still much more timely than khugepaged.
> 

It's disappointing we need to have an option that exists solely to 
suppress a userspace MADV_HUGEPAGE and not actually fix the userspace to 
not do the MADV_HUGEPAGE in the first place by making it configurable.  
That is backwards compatible and doesn't require a new kernel version.  
This never gets answered in the thread, however, and I offered to make the 
very trivial patch to qemu to do that for the translation buffer but 
nobody who uses qemu is even asking for this.  It's baffling.

> >> So would something like this be possible?
> >>
> >>> echo "defer madvise" > /sys/kernel/mm/transparent_hugepage/defrag
> >>> cat /sys/kernel/mm/transparent_hugepage/defrag
> >> always [defer] [madvise] never
> >>
> >> I'm not sure about the analogous kernel boot option though, I guess
> >> those can't use spaces, so maybe comma-separated?
> 
> No opinion on the above? I think it could be somewhat more elegant than
> a fifth-option that Mel said he would prefer, and deliver the same
> flexibility.
> 

I think this would work, but I'm concerned about two things: (1) the 
kernel command line format as you pointed out earlier, (2) allowing two 
options to be combined but not other options (always + never), so it takes 
even more explaining to do to say what you can actually formulate and 
what the results of that combining is.  The tristate, quadstate, and now 
quint-state options for thp were never extendable, but now this appears to 
be the most desired option.  We can await the bug reports of users who say 
their MADV_HUGEPAGE is a no-op, though, and tell them their admin needs to 
switch away from "defer" if anybody actually ever uses that setting.

I think you, me, and Kirill are mostly on the same page with respect to 
this, but I can't argue against hypothetical usecases and how we need to 
wait years for "defer" to be available to see if any bug reports are 
generated to make a decision in this area, so my final proposal in this 
matter will be the reluctant fifth option and if it doesn't work I'll just 
carry this for ourselves (we have no use for "defer" without this patch).

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2017-01-04 22:05 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-22  0:21 [patch] mm, thp: always direct reclaim for MADV_HUGEPAGE even when deferred David Rientjes
2016-12-22  8:31 ` Kirill A. Shutemov
2016-12-22 10:00 ` Michal Hocko
2016-12-22 21:05   ` David Rientjes
2016-12-23  8:51     ` Michal Hocko
2016-12-23 10:01       ` David Rientjes
2016-12-23 11:18         ` Michal Hocko
2016-12-23 22:46           ` David Rientjes
2016-12-26  9:02             ` Michal Hocko
2016-12-27  0:53               ` David Rientjes
2016-12-27  2:32                 ` Kirill A. Shutemov
2016-12-27  9:41                 ` Michal Hocko
2016-12-27 21:36                   ` David Rientjes
2016-12-28  8:48                     ` Michal Hocko
2016-12-28 21:33                       ` David Rientjes
2016-12-29  8:24                         ` Michal Hocko
2016-12-30 12:36     ` Mel Gorman
2016-12-30 12:56       ` Michal Hocko
2016-12-30 14:08         ` Mel Gorman
2016-12-30 22:30       ` David Rientjes
2017-01-03 10:37         ` Mel Gorman
2017-01-03 21:57           ` David Rientjes
2017-01-04 10:12             ` Mel Gorman
2017-01-04 21:53               ` David Rientjes
2017-01-02  8:38 ` Vlastimil Babka
2017-01-03 22:44   ` David Rientjes
2017-01-04  8:32     ` Vlastimil Babka
2017-01-04  9:46       ` Michal Hocko
2017-01-04 22:04       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).