linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: make PR_SET_THP_DISABLE immediately active
@ 2017-06-02 15:03 Mike Rapoport
       [not found] ` <1496415802-30944-1-git-send-email-rppt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Rapoport @ 2017-06-02 15:03 UTC (permalink / raw)
  To: Linux API
  Cc: Michal Hocko, Vlastimil Babka, Andrea Arcangeli, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Michal Hocko, Mike Rapoport

From: Michal Hocko <mhocko@suse.com>

PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
existing mapping because it only updated mm->def_flags which is a template
for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
VM_NOHUGEPAGE flag set.  This can be quite surprising for all those
applications which do not do prctl(); fork() & exec() and want to control
their own THP behavior.

Another usecase when the immediate semantic of the prctl might be useful is
a combination of pre- and post-copy migration of containers with CRIU.  In
this case CRIU populates a part of a memory region with data that was saved
during the pre-copy stage. Afterwards, the region is registered with
userfaultfd and CRIU expects to get page faults for the parts of the region
that were not yet populated. However, khugepaged collapses the pages and
the expected page faults do not occur.

In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.

Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
tested when decision whether to use huge pages is taken either during page
fault of at the time of THP collapse.

It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case previously.

Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
 include/linux/huge_mm.h        | 1 +
 include/linux/khugepaged.h     | 3 ++-
 include/linux/sched/coredump.h | 5 ++++-
 kernel/sys.c                   | 6 +++---
 mm/khugepaged.c                | 3 ++-
 mm/shmem.c                     | 8 +++++---
 6 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a3762d4..9da053c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,6 +92,7 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
 	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
 	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
+	 !test_bit(MMF_DISABLE_THP, &(__vma)->vm_mm->flags) &&		\
 	 !is_vma_temporary_stack(__vma))
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 5d9a400..f0d7335 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -48,7 +48,8 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
 		if ((khugepaged_always() ||
 		     (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
-		    !(vm_flags & VM_NOHUGEPAGE))
+		    !(vm_flags & VM_NOHUGEPAGE) &&
+		    !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 			if (__khugepaged_enter(vma->vm_mm))
 				return -ENOMEM;
 	return 0;
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 69eedce..98ae0d0 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -68,7 +68,10 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
 #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
 #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
+#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
+#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
-#define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
+#define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
+				 MMF_DISABLE_THP_MASK)
 
 #endif /* _LINUX_SCHED_COREDUMP_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 8a94b4e..e48f063 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2266,7 +2266,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_GET_THP_DISABLE:
 		if (arg2 || arg3 || arg4 || arg5)
 			return -EINVAL;
-		error = !!(me->mm->def_flags & VM_NOHUGEPAGE);
+		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
 		break;
 	case PR_SET_THP_DISABLE:
 		if (arg3 || arg4 || arg5)
@@ -2274,9 +2274,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		if (down_write_killable(&me->mm->mmap_sem))
 			return -EINTR;
 		if (arg2)
-			me->mm->def_flags |= VM_NOHUGEPAGE;
+			set_bit(MMF_DISABLE_THP, &me->mm->flags);
 		else
-			me->mm->def_flags &= ~VM_NOHUGEPAGE;
+			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
 		up_write(&me->mm->mmap_sem);
 		break;
 	case PR_MPX_ENABLE_MANAGEMENT:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 32c66e7..b444b9b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -824,7 +824,8 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
 static bool hugepage_vma_check(struct vm_area_struct *vma)
 {
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
-	    (vma->vm_flags & VM_NOHUGEPAGE))
+	    (vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 		return false;
 	if (shmem_file(vma->vm_file)) {
 		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
diff --git a/mm/shmem.c b/mm/shmem.c
index e67d6ba..33b908b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1977,10 +1977,12 @@ static int shmem_fault(struct vm_fault *vmf)
 	}
 
 	sgp = SGP_CACHE;
-	if (vma->vm_flags & VM_HUGEPAGE)
-		sgp = SGP_HUGE;
-	else if (vma->vm_flags & VM_NOHUGEPAGE)
+
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 		sgp = SGP_NOHUGE;
+	else if (vma->vm_flags & VM_HUGEPAGE)
+		sgp = SGP_HUGE;
 
 	error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
 				  gfp, vma, vmf, &ret);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
       [not found] ` <1496415802-30944-1-git-send-email-rppt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2017-06-02 19:50   ` Andrew Morton
  2017-06-02 20:31     ` Vlastimil Babka
  2017-06-03 10:40     ` Mike Rapoprt
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2017-06-02 19:50 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Linux API, Michal Hocko, Vlastimil Babka, Andrea Arcangeli,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Michal Hocko

On Fri,  2 Jun 2017 18:03:22 +0300 "Mike Rapoport" <rppt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:

> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
> existing mapping because it only updated mm->def_flags which is a template
> for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
> VM_NOHUGEPAGE flag set.  This can be quite surprising for all those
> applications which do not do prctl(); fork() & exec() and want to control
> their own THP behavior.
> 
> Another usecase when the immediate semantic of the prctl might be useful is
> a combination of pre- and post-copy migration of containers with CRIU.  In
> this case CRIU populates a part of a memory region with data that was saved
> during the pre-copy stage. Afterwards, the region is registered with
> userfaultfd and CRIU expects to get page faults for the parts of the region
> that were not yet populated. However, khugepaged collapses the pages and
> the expected page faults do not occur.
> 
> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
> temporary mechanism for enabling/disabling THP process wide.
> 
> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
> tested when decision whether to use huge pages is taken either during page
> fault of at the time of THP collapse.
> 
> It should be noted, that the new implementation makes PR_SET_THP_DISABLE
> master override to any per-VMA setting, which was not the case previously.
>
> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")

"Fixes" is a bit strong.  I'd say "alters".  And significantly altering
the runtime behaviour of a three-year-old interface is rather a worry,
no?

Perhaps we should be adding new prctl modes to select this new
behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
  2017-06-02 19:50   ` Andrew Morton
@ 2017-06-02 20:31     ` Vlastimil Babka
       [not found]       ` <8a810c81-6a72-2af0-a450-6f03c71d8cca-AlSwsSmVLrQ@public.gmane.org>
  2017-06-03 10:40     ` Mike Rapoprt
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2017-06-02 20:31 UTC (permalink / raw)
  To: Andrew Morton, Mike Rapoport
  Cc: Linux API, Michal Hocko, Andrea Arcangeli, Arnd Bergmann,
	Kirill A. Shutemov, Pavel Emelyanov, linux-mm, lkml,
	Michal Hocko

On 06/02/2017 09:50 PM, Andrew Morton wrote:
> On Fri,  2 Jun 2017 18:03:22 +0300 "Mike Rapoport" <rppt@linux.vnet.ibm.com> wrote:
> 
>> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
>> existing mapping because it only updated mm->def_flags which is a template
>> for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
>> VM_NOHUGEPAGE flag set.  This can be quite surprising for all those
>> applications which do not do prctl(); fork() & exec() and want to control
>> their own THP behavior.
>>
>> Another usecase when the immediate semantic of the prctl might be useful is
>> a combination of pre- and post-copy migration of containers with CRIU.  In
>> this case CRIU populates a part of a memory region with data that was saved
>> during the pre-copy stage. Afterwards, the region is registered with
>> userfaultfd and CRIU expects to get page faults for the parts of the region
>> that were not yet populated. However, khugepaged collapses the pages and
>> the expected page faults do not occur.
>>
>> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
>> temporary mechanism for enabling/disabling THP process wide.
>>
>> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
>> tested when decision whether to use huge pages is taken either during page
>> fault of at the time of THP collapse.
>>
>> It should be noted, that the new implementation makes PR_SET_THP_DISABLE
>> master override to any per-VMA setting, which was not the case previously.
>>
>> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
> 
> "Fixes" is a bit strong.  I'd say "alters".  And significantly altering
> the runtime behaviour of a three-year-old interface is rather a worry,
> no?
> 
> Perhaps we should be adding new prctl modes to select this new
> behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?

I think we can reasonably assume that most users of the prctl do just
the fork() & exec() thing, so they will be unaffected. And as usual, if
somebody does complain in the end, we revert and try the other way?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
       [not found]       ` <8a810c81-6a72-2af0-a450-6f03c71d8cca-AlSwsSmVLrQ@public.gmane.org>
@ 2017-06-02 20:40         ` Andrew Morton
       [not found]           ` <20170602134038.13728cb77678ae1a7d7128a4-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2017-06-02 20:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mike Rapoport, Linux API, Michal Hocko, Andrea Arcangeli,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Michal Hocko

On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:

> On 06/02/2017 09:50 PM, Andrew Morton wrote:
> > On Fri,  2 Jun 2017 18:03:22 +0300 "Mike Rapoport" <rppt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > 
> >> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
> >> existing mapping because it only updated mm->def_flags which is a template
> >> for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
> >> VM_NOHUGEPAGE flag set.  This can be quite surprising for all those
> >> applications which do not do prctl(); fork() & exec() and want to control
> >> their own THP behavior.
> >>
> >> Another usecase when the immediate semantic of the prctl might be useful is
> >> a combination of pre- and post-copy migration of containers with CRIU.  In
> >> this case CRIU populates a part of a memory region with data that was saved
> >> during the pre-copy stage. Afterwards, the region is registered with
> >> userfaultfd and CRIU expects to get page faults for the parts of the region
> >> that were not yet populated. However, khugepaged collapses the pages and
> >> the expected page faults do not occur.
> >>
> >> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
> >> temporary mechanism for enabling/disabling THP process wide.
> >>
> >> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
> >> tested when decision whether to use huge pages is taken either during page
> >> fault of at the time of THP collapse.
> >>
> >> It should be noted, that the new implementation makes PR_SET_THP_DISABLE
> >> master override to any per-VMA setting, which was not the case previously.
> >>
> >> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
> > 
> > "Fixes" is a bit strong.  I'd say "alters".  And significantly altering
> > the runtime behaviour of a three-year-old interface is rather a worry,
> > no?
> > 
> > Perhaps we should be adding new prctl modes to select this new
> > behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
> 
> I think we can reasonably assume that most users of the prctl do just
> the fork() & exec() thing, so they will be unaffected.

That sounds optimistic.  Perhaps people are using the current behaviour
to set on particular mapping to MMF_DISABLE_THP, with

	prctl(PR_SET_THP_DISABLE)
	mmap()
	prctl(PR_CLR_THP_DISABLE)

?

Seems a reasonable thing to do.  But who knows - people do all sorts of
inventive things.

> And as usual, if
> somebody does complain in the end, we revert and try the other way?

But by then it's too late - the new behaviour will be out in the field.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
       [not found]           ` <20170602134038.13728cb77678ae1a7d7128a4-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2017-06-02 20:55             ` Vlastimil Babka
  2017-06-02 21:10               ` Andrew Morton
       [not found]               ` <f9e8a159-7a25-6813-f909-11c4ae58adf3-AlSwsSmVLrQ@public.gmane.org>
  2017-06-03  7:40             ` Michal Hocko
  1 sibling, 2 replies; 11+ messages in thread
From: Vlastimil Babka @ 2017-06-02 20:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Rapoport, Linux API, Michal Hocko, Andrea Arcangeli,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Michal Hocko

On 06/02/2017 10:40 PM, Andrew Morton wrote:
> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
>>> Perhaps we should be adding new prctl modes to select this new
>>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
>>
>> I think we can reasonably assume that most users of the prctl do just
>> the fork() & exec() thing, so they will be unaffected.
> 
> That sounds optimistic.  Perhaps people are using the current behaviour
> to set on particular mapping to MMF_DISABLE_THP, with
> 
> 	prctl(PR_SET_THP_DISABLE)
> 	mmap()
> 	prctl(PR_CLR_THP_DISABLE)
> 
> ?
> 
> Seems a reasonable thing to do.

Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
effect. And it's older (2.6.38).

> But who knows - people do all sorts of
> inventive things.

Yeah :( but we can hope they don't even know that the prctl currently
behaves they way it does - man page doesn't suggest it would, and most
of us in this thread found it surprising.

>> And as usual, if
>> somebody does complain in the end, we revert and try the other way?
> 
> But by then it's too late - the new behaviour will be out in the field.

Revert in stable then?
But I don't think this patch should go to stable. I understand right
that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
prctl change/new madvise anymore?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
  2017-06-02 20:55             ` Vlastimil Babka
@ 2017-06-02 21:10               ` Andrew Morton
  2017-06-05  9:27                 ` Vlastimil Babka
       [not found]               ` <f9e8a159-7a25-6813-f909-11c4ae58adf3-AlSwsSmVLrQ@public.gmane.org>
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2017-06-02 21:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mike Rapoport, Linux API, Michal Hocko, Andrea Arcangeli,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Michal Hocko

On Fri, 2 Jun 2017 22:55:12 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 06/02/2017 10:40 PM, Andrew Morton wrote:
> > On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> >>> Perhaps we should be adding new prctl modes to select this new
> >>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
> >>
> >> I think we can reasonably assume that most users of the prctl do just
> >> the fork() & exec() thing, so they will be unaffected.
> > 
> > That sounds optimistic.  Perhaps people are using the current behaviour
> > to set on particular mapping to MMF_DISABLE_THP, with
> > 
> > 	prctl(PR_SET_THP_DISABLE)
> > 	mmap()
> > 	prctl(PR_CLR_THP_DISABLE)
> > 
> > ?
> > 
> > Seems a reasonable thing to do.
> 
> Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
> effect. And it's older (2.6.38).
> 
> > But who knows - people do all sorts of
> > inventive things.
> 
> Yeah :( but we can hope they don't even know that the prctl currently
> behaves they way it does - man page doesn't suggest it would, and most
> of us in this thread found it surprising.

Well.  There might be such people and sometimes we do make people
unhappy.  it partly depends on how traumatic it would be to leave the
current behaviour as-is.  Have you evaluated such a patch?


> >> And as usual, if
> >> somebody does complain in the end, we revert and try the other way?
> > 
> > But by then it's too late - the new behaviour will be out in the field.
> 
> Revert in stable then?
> But I don't think this patch should go to stable. I understand right
> that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
> prctl change/new madvise anymore?

What I mean is that the new behaviour will go out in 4.12 and it may
be many months before we find out that we broke someone.  By then, we
can't go back because others may be assuming the new behaviour.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
       [not found]           ` <20170602134038.13728cb77678ae1a7d7128a4-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2017-06-02 20:55             ` Vlastimil Babka
@ 2017-06-03  7:40             ` Michal Hocko
  1 sibling, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2017-06-03  7:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Mike Rapoport, Linux API, Andrea Arcangeli,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml

On Fri 02-06-17 13:40:38, Andrew Morton wrote:
> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
> 
> > On 06/02/2017 09:50 PM, Andrew Morton wrote:
> > > On Fri,  2 Jun 2017 18:03:22 +0300 "Mike Rapoport" <rppt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > > 
> > >> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
> > >> existing mapping because it only updated mm->def_flags which is a template
> > >> for new mappings. The mappings created after prctl(PR_SET_THP_DISABLE) have
> > >> VM_NOHUGEPAGE flag set.  This can be quite surprising for all those
> > >> applications which do not do prctl(); fork() & exec() and want to control
> > >> their own THP behavior.
> > >>
> > >> Another usecase when the immediate semantic of the prctl might be useful is
> > >> a combination of pre- and post-copy migration of containers with CRIU.  In
> > >> this case CRIU populates a part of a memory region with data that was saved
> > >> during the pre-copy stage. Afterwards, the region is registered with
> > >> userfaultfd and CRIU expects to get page faults for the parts of the region
> > >> that were not yet populated. However, khugepaged collapses the pages and
> > >> the expected page faults do not occur.
> > >>
> > >> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
> > >> temporary mechanism for enabling/disabling THP process wide.
> > >>
> > >> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
> > >> tested when decision whether to use huge pages is taken either during page
> > >> fault of at the time of THP collapse.
> > >>
> > >> It should be noted, that the new implementation makes PR_SET_THP_DISABLE
> > >> master override to any per-VMA setting, which was not the case previously.
> > >>
> > >> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
> > > 
> > > "Fixes" is a bit strong.  I'd say "alters".  And significantly altering
> > > the runtime behaviour of a three-year-old interface is rather a worry,
> > > no?
> > > 
> > > Perhaps we should be adding new prctl modes to select this new
> > > behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
> > 
> > I think we can reasonably assume that most users of the prctl do just
> > the fork() & exec() thing, so they will be unaffected.
> 
> That sounds optimistic.  Perhaps people are using the current behaviour
> to set on particular mapping to MMF_DISABLE_THP, with
> 
> 	prctl(PR_SET_THP_DISABLE)
> 	mmap()
> 	prctl(PR_CLR_THP_DISABLE)
> 
> ?
> 
> Seems a reasonable thing to do.

Is it? The documentation is not very specific but it is clear about the
scope being thread (I would argue process would be more approapriate
but whatever) "Set the state of the "THP disable" flag for the calling
thread." So the above seems like an incorrect usage to me.

> But who knows - people do all sorts of inventive things.

well yes.

> > And as usual, if
> > somebody does complain in the end, we revert and try the other way?
> 
> But by then it's too late - the new behaviour will be out in the field.

Well, the interface is currently broken for anything other than prctl
& exec. And those will work properly even with the patch. So I am not
really sure whether keeping the current status quo is reasonable.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
       [not found]               ` <f9e8a159-7a25-6813-f909-11c4ae58adf3-AlSwsSmVLrQ@public.gmane.org>
@ 2017-06-03 10:34                 ` Mike Rapoprt
       [not found]                   ` <CAAB5A6A-D7A1-4C06-9A07-D7EF56278EE5-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Rapoprt @ 2017-06-03 10:34 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Linux API, Michal Hocko, Andrea Arcangeli, Arnd Bergmann,
	Kirill A. Shutemov, Pavel Emelyanov, linux-mm, lkml,
	Michal Hocko



On June 2, 2017 11:55:12 PM GMT+03:00, Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
>On 06/02/2017 10:40 PM, Andrew Morton wrote:
>> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
>wrote:
>>>> Perhaps we should be adding new prctl modes to select this new
>>>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour
>as-is?
>>>
>>> I think we can reasonably assume that most users of the prctl do
>just
>>> the fork() & exec() thing, so they will be unaffected.
>> 
>> That sounds optimistic.  Perhaps people are using the current
>behaviour
>> to set on particular mapping to MMF_DISABLE_THP, with
>> 
>> 	prctl(PR_SET_THP_DISABLE)
>> 	mmap()
>> 	prctl(PR_CLR_THP_DISABLE)
>> 
>> ?
>> 
>> Seems a reasonable thing to do.
>
>Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
>effect. And it's older (2.6.38).
>
>> But who knows - people do all sorts of
>> inventive things.
>
>Yeah :( but we can hope they don't even know that the prctl currently
>behaves they way it does - man page doesn't suggest it would, and most
>of us in this thread found it surprising.
>
>>> And as usual, if
>>> somebody does complain in the end, we revert and try the other way?
>> 
>> But by then it's too late - the new behaviour will be out in the
>field.
>
>Revert in stable then?
>But I don't think this patch should go to stable. I understand right
>that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
>prctl change/new madvise anymore?

Yes, we are going to use UFFDIO_COPY. We still might want to have control over THP in the future without changing per-VMA flags, though.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
  2017-06-02 19:50   ` Andrew Morton
  2017-06-02 20:31     ` Vlastimil Babka
@ 2017-06-03 10:40     ` Mike Rapoprt
  1 sibling, 0 replies; 11+ messages in thread
From: Mike Rapoprt @ 2017-06-03 10:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux API, Michal Hocko, Vlastimil Babka, Andrea Arcangeli,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Michal Hocko



On June 2, 2017 10:50:59 PM GMT+03:00, Andrew Morton <akpm@linux-foundation.org> wrote:
>On Fri,  2 Jun 2017 18:03:22 +0300 "Mike Rapoport"
><rppt@linux.vnet.ibm.com> wrote:
>
>> PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect
>any
>> existing mapping because it only updated mm->def_flags which is a
>template
>> for new mappings. The mappings created after
>prctl(PR_SET_THP_DISABLE) have
>> VM_NOHUGEPAGE flag set.  This can be quite surprising for all those
>> applications which do not do prctl(); fork() & exec() and want to
>control
>> their own THP behavior.
>> 
>> Another usecase when the immediate semantic of the prctl might be
>useful is
>> a combination of pre- and post-copy migration of containers with
>CRIU.  In
>> this case CRIU populates a part of a memory region with data that was
>saved
>> during the pre-copy stage. Afterwards, the region is registered with
>> userfaultfd and CRIU expects to get page faults for the parts of the
>region
>> that were not yet populated. However, khugepaged collapses the pages
>and
>> the expected page faults do not occur.
>> 
>> In more general case, the prctl(PR_SET_THP_DISABLE) could be used as
>a
>> temporary mechanism for enabling/disabling THP process wide.
>> 
>> Implementation wise, a new MMF_DISABLE_THP flag is added. This flag
>is
>> tested when decision whether to use huge pages is taken either during
>page
>> fault of at the time of THP collapse.
>> 
>> It should be noted, that the new implementation makes
>PR_SET_THP_DISABLE
>> master override to any per-VMA setting, which was not the case
>previously.
>>
>> Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and
>PRCTL_THP_DISABLE")
>
>"Fixes" is a bit strong.  I'd say "alters".  And significantly altering
>the runtime behaviour of a three-year-old interface is rather a worry,
>no?

Well, there are people that consider current behavior as bug :)
One can argue we alter the implementation​details and users should not rely on that...

>Perhaps we should be adding new prctl modes to select this new
>behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?



-- 
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
       [not found]                   ` <CAAB5A6A-D7A1-4C06-9A07-D7EF56278EE5-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2017-06-05  7:05                     ` Mike Rapoport
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Rapoport @ 2017-06-05  7:05 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Linux API, Michal Hocko, Andrea Arcangeli, Arnd Bergmann,
	Kirill A. Shutemov, Pavel Emelyanov, linux-mm, lkml,
	Michal Hocko

On Sat, Jun 03, 2017 at 01:34:52PM +0300, Mike Rapoprt wrote:
> 
> 
> On June 2, 2017 11:55:12 PM GMT+03:00, Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org> wrote:
> >On 06/02/2017 10:40 PM, Andrew Morton wrote:
> >> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
> >wrote:
> >>>> Perhaps we should be adding new prctl modes to select this new
> >>>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour
> >as-is?
> >>>
> >>> I think we can reasonably assume that most users of the prctl do
> >just
> >>> the fork() & exec() thing, so they will be unaffected.
> >> 
> >> That sounds optimistic.  Perhaps people are using the current
> >behaviour
> >> to set on particular mapping to MMF_DISABLE_THP, with
> >> 
> >> 	prctl(PR_SET_THP_DISABLE)
> >> 	mmap()
> >> 	prctl(PR_CLR_THP_DISABLE)
> >> 
> >> ?
> >> 
> >> Seems a reasonable thing to do.
> >
> >Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
> >effect. And it's older (2.6.38).
> >
> >> But who knows - people do all sorts of
> >> inventive things.
> >
> >Yeah :( but we can hope they don't even know that the prctl currently
> >behaves they way it does - man page doesn't suggest it would, and most
> >of us in this thread found it surprising.
> >
> >>> And as usual, if
> >>> somebody does complain in the end, we revert and try the other way?
> >> 
> >> But by then it's too late - the new behaviour will be out in the
> >field.
> >
> >Revert in stable then?
> >But I don't think this patch should go to stable. I understand right
> >that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
> >prctl change/new madvise anymore?
> 
> Yes, we are going to use UFFDIO_COPY. We still might want to have control
> over THP in the future without changing per-VMA flags, though.

Unfortunately, I was over optimistic about ability of CRIU to use
UFFDIO_COPY for pre-copy part :(
I was too concentrated on the simplified flow and overlooked some important
details. After I've spent some time trying to actually implement usage of
UFFDIO_COPY, I realized that registering memory with userfault at that
point of the restore flow quite contradicts CRIU architecture :(

That said, we would really want to have the interface this patch proposes.
 
-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: make PR_SET_THP_DISABLE immediately active
  2017-06-02 21:10               ` Andrew Morton
@ 2017-06-05  9:27                 ` Vlastimil Babka
  0 siblings, 0 replies; 11+ messages in thread
From: Vlastimil Babka @ 2017-06-05  9:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Rapoport, Linux API, Michal Hocko, Andrea Arcangeli,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Michal Hocko

On 06/02/2017 11:10 PM, Andrew Morton wrote:
> On Fri, 2 Jun 2017 22:55:12 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> On 06/02/2017 10:40 PM, Andrew Morton wrote:
>>> On Fri, 2 Jun 2017 22:31:47 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>> Perhaps we should be adding new prctl modes to select this new
>>>>> behaviour and leave the existing PR_SET_THP_DISABLE behaviour as-is?
>>>>
>>>> I think we can reasonably assume that most users of the prctl do just
>>>> the fork() & exec() thing, so they will be unaffected.
>>>
>>> That sounds optimistic.  Perhaps people are using the current behaviour
>>> to set on particular mapping to MMF_DISABLE_THP, with
>>>
>>> 	prctl(PR_SET_THP_DISABLE)
>>> 	mmap()
>>> 	prctl(PR_CLR_THP_DISABLE)
>>>
>>> ?
>>>
>>> Seems a reasonable thing to do.
>>
>> Using madvise(MADV_NOHUGEPAGE) seems reasonabler to me, with the same
>> effect. And it's older (2.6.38).
>>
>>> But who knows - people do all sorts of
>>> inventive things.
>>
>> Yeah :( but we can hope they don't even know that the prctl currently
>> behaves they way it does - man page doesn't suggest it would, and most
>> of us in this thread found it surprising.
> 
> Well.  There might be such people and sometimes we do make people
> unhappy.  it partly depends on how traumatic it would be to leave the
> current behaviour as-is.  Have you evaluated such a patch?

You mean introducing a new prctl instead of changing the existing one? I
can evaluate that as being ugly :)
Well, maybe we could use arg3, because currently we have:
        case PR_SET_THP_DISABLE:
                if (arg3 || arg4 || arg5)
                        return -EINVAL;

We could make non-zero arg3 (or specific value of arg3) set the new
"immediate" behavior. This would also take care of the discovery of
kernels that support the fixed/altered behavior, without having to check
uname etc - just check if we got -EINVAL.

I'm just not sure how to implement PR_GET_THP_DISABLE properly in such
scenario. Or what happens when somebody calls SET with arg3==0 and then
arg3==1 (or vice versa). But we would have to think about it even when
we introduced a newly named option. Reminds me of the MLOCK_ONFAULT
discussions...

>>>> And as usual, if
>>>> somebody does complain in the end, we revert and try the other way?
>>>
>>> But by then it's too late - the new behaviour will be out in the field.
>>
>> Revert in stable then?
>> But I don't think this patch should go to stable. I understand right
>> that CRIU will switch to the UFFDIO_COPY approach and doesn't need the
>> prctl change/new madvise anymore?
> 
> What I mean is that the new behaviour will go out in 4.12 and it may
> be many months before we find out that we broke someone.  By then, we
> can't go back because others may be assuming the new behaviour.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-06-05  9:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-02 15:03 [PATCH] mm: make PR_SET_THP_DISABLE immediately active Mike Rapoport
     [not found] ` <1496415802-30944-1-git-send-email-rppt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2017-06-02 19:50   ` Andrew Morton
2017-06-02 20:31     ` Vlastimil Babka
     [not found]       ` <8a810c81-6a72-2af0-a450-6f03c71d8cca-AlSwsSmVLrQ@public.gmane.org>
2017-06-02 20:40         ` Andrew Morton
     [not found]           ` <20170602134038.13728cb77678ae1a7d7128a4-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2017-06-02 20:55             ` Vlastimil Babka
2017-06-02 21:10               ` Andrew Morton
2017-06-05  9:27                 ` Vlastimil Babka
     [not found]               ` <f9e8a159-7a25-6813-f909-11c4ae58adf3-AlSwsSmVLrQ@public.gmane.org>
2017-06-03 10:34                 ` Mike Rapoprt
     [not found]                   ` <CAAB5A6A-D7A1-4C06-9A07-D7EF56278EE5-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2017-06-05  7:05                     ` Mike Rapoport
2017-06-03  7:40             ` Michal Hocko
2017-06-03 10:40     ` Mike Rapoprt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).