All of lore.kernel.org
 help / color / mirror / Atom feed
* + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
@ 2014-02-25 23:53 akpm
       [not found] ` <530D9F50.1080400@de.ibm.com>
  0 siblings, 1 reply; 19+ messages in thread
From: akpm @ 2014-02-25 23:53 UTC (permalink / raw)
  To: mm-commits, viro, schwidefsky, rientjes, riel, peterz, pbonzini,
	oleg, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes,
	gerald.schaefer, ebiederm, borntraeger, aarcange, athorlton

Subject: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
To: athorlton@sgi.com,aarcange@redhat.com,borntraeger@de.ibm.com,ebiederm@xmission.com,gerald.schaefer@de.ibm.com,hannes@cmpxchg.org,heiko.carstens@de.ibm.com,kirill.shutemov@linux.intel.com,mgorman@suse.de,mingo@kernel.org,oleg@redhat.com,pbonzini@redhat.com,peterz@infradead.org,riel@redhat.com,rientjes@google.com,schwidefsky@de.ibm.com,viro@zeniv.linux.org.uk
From: akpm@linux-foundation.org
Date: Tue, 25 Feb 2014 15:53:13 -0800


The patch titled
     Subject: mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
has been added to the -mm tree.  Its filename is
     mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alex Thorlton <athorlton@sgi.com>
Subject: mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified, and using a malloc hook with
madvise is not an option (i.e.  statically allocated data).  This patch
allows us to do just that, without affecting other jobs running on the
system.

We need to do this sort of thing for jobs where THP hurts performance, due
to the possibility of increased remote memory accesses that can be created
by situations such as the following:

When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will be
handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from.  If many remote nodes need to do work on that
same chunk, they'll be making remote accesses.

With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.

This patch is based on some of my work combined with some
suggestions/patches given by Oleg Nesterov.  The main goal here is to add
a prctl switch to allow us to disable to THP on a per mm_struct basis.


Here's a bit of test data with the new patch in place...

First with the flag unset:

# perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g                  
Setting thp_disabled for this task...
thp_disable: 0
Set thp_disabled state to 0
Process pid = 18027

                                                                                                                     PF/
                                MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
 512      1.120      0.060      0.000    0.110      0.110     0.000    28571428864 -9223372036854775808  55803572      23

 Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

  273719072.841402 task-clock                #  641.026 CPUs utilized           [100.00%]
         1,008,986 context-switches          #    0.000 M/sec                   [100.00%]
             7,717 CPU-migrations            #    0.000 M/sec                   [100.00%]
         1,698,932 page-faults               #    0.000 M/sec
355,222,544,890,379 cycles                    #    1.298 GHz                     [100.00%]
536,445,412,234,588 stalled-cycles-frontend   #  151.02% frontend cycles idle    [100.00%]
409,110,531,310,223 stalled-cycles-backend    #  115.17% backend  cycles idle    [100.00%]
148,286,797,266,411 instructions              #    0.42  insns per cycle
                                             #    3.62  stalled cycles per insn [100.00%]
27,061,793,159,503 branches                  #   98.867 M/sec                   [100.00%]
     1,188,655,196 branch-misses             #    0.00% of all branches

     427.001706337 seconds time elapsed

Now with the flag set:

# perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 1
Set thp_disabled state to 1
Process pid = 144957

                                                                                                                     PF/
                                MAX        MIN                                  TOTCPU/      TOT_PF/   TOT_PF/     WSEC/
TYPE:               CPUS       WALL       WALL        SYS     USER     TOTCPU       CPU     WALL_SEC   SYS_SEC       CPU   NODES
 512      0.620      0.260      0.250    0.320      0.570     0.001    51612901376 128000000000 100806448      23

 Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

  138789390.540183 task-clock                #  641.959 CPUs utilized           [100.00%]
           534,205 context-switches          #    0.000 M/sec                   [100.00%]
             4,595 CPU-migrations            #    0.000 M/sec                   [100.00%]
        63,133,119 page-faults               #    0.000 M/sec
147,977,747,269,768 cycles                    #    1.066 GHz                     [100.00%]
200,524,196,493,108 stalled-cycles-frontend   #  135.51% frontend cycles idle    [100.00%]
105,175,163,716,388 stalled-cycles-backend    #   71.07% backend  cycles idle    [100.00%]
180,916,213,503,160 instructions              #    1.22  insns per cycle
                                             #    1.11  stalled cycles per insn [100.00%]
26,999,511,005,868 branches                  #  194.536 M/sec                   [100.00%]
       714,066,351 branch-misses             #    0.00% of all branches

     216.196778807 seconds time elapsed

As with previous versions of the patch, We're getting about a 2x
performance increase here.  Here's a link to the test case I used, along
with the little wrapper to activate the flag:

http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz




This patch (of 3):

Revert 8e72033f2a489b6c98 and add in code to fix up any issues caused by
the revert.

The revert is necessary because hugepage_madvise would return -EINVAL when
VM_NOHUGEPAGE is set, which will break subsequent chunks of this patch
set.

Here's a snip of an e-mail from Gerald detailing the original purpose of
this code, and providing justification for the revert:

<snip>
The intent of 8e72033f2a48 was to guard against any future programming
errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings,
which would crash the kernel.

Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48
was to be reverted, because that check will also prevent a kernel crash
in the case described above, it will now send a SIGSEGV instead.

This would now also allow to do the madvise on other parts, if needed,
so it is a more flexible approach. One could also say that it would have
been better to do it this way right from the beginning...
</snip>

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/mm/pgtable.c |    3 +++
 mm/huge_memory.c       |    4 ----
 2 files changed, 3 insertions(+), 4 deletions(-)

diff -puN arch/s390/mm/pgtable.c~mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags arch/s390/mm/pgtable.c
--- a/arch/s390/mm/pgtable.c~mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags
+++ a/arch/s390/mm/pgtable.c
@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned
 	if (!pmd_present(*pmd) &&
 	    __pte_alloc(mm, vma, pmd, vmaddr))
 		return -ENOMEM;
+	/* large pmds cannot yet be handled */
+	if (pmd_large(*pmd))
+		return -EFAULT;
 	/* pmd now points to a valid segment table entry. */
 	rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT);
 	if (!rmap)
diff -puN mm/huge_memory.c~mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags mm/huge_memory.c
--- a/mm/huge_memory.c~mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags
+++ a/mm/huge_memory.c
@@ -1891,8 +1891,6 @@ out:
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
 {
-	struct mm_struct *mm = vma->vm_mm;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
       [not found] ` <530D9F50.1080400@de.ibm.com>
@ 2014-02-26 14:50   ` Oleg Nesterov
  2014-02-26 15:06     ` Christian Borntraeger
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2014-02-26 14:50 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: akpm, linux-kernel, viro, schwidefsky, rientjes, riel, peterz,
	pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens,
	hannes, gerald.schaefer, ebiederm, aarcange, athorlton

On 02/26, Christian Borntraeger wrote:
>
> On 26/02/14 00:53, akpm@linux-foundation.org wrote:
> > Subject: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
> > To: athorlton@sgi.com,aarcange@redhat.com,borntraeger@de.ibm.com,ebiederm@xmission.com,gerald.schaefer@de.ibm.com,hannes@cmpxchg.org,heiko.carstens@de.ibm.com,kirill.shutemov@linux.intel.com,mgorman@suse.de,mingo@kernel.org,oleg@redhat.com,pbonzini@redhat.com,peterz@infradead.org,riel@redhat.com,rientjes@google.com,schwidefsky@de.ibm.com,viro@zeniv.linux.org.uk
> > From: akpm@linux-foundation.org
> > Date: Tue, 25 Feb 2014 15:53:13 -0800
> >
> >
> > The patch titled
> >      Subject: mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
> > has been added to the -mm tree.  Its filename is
> >      mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
> >
> > This patch should soon appear at
> >     http://ozlabs.org/~akpm/mmots/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
> > and later at
> >     http://ozlabs.org/~akpm/mmotm/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
>
>
> NAK.
>
> Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages.
> (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this
> breaks any recent kvm guest on s390.

Well, I can't really discuss the changes in arch/s390.

But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
Otherwise I'd suggest the change below.

Oleg.


--- x/mm/huge_memory.c
+++ x/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
 {
-	struct mm_struct *mm = vma->vm_mm;
-
 	switch (advice) {
 	case MADV_HUGEPAGE:
 		/*
@@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
 		 */
 		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
 			return -EINVAL;
-		if (mm->def_flags & VM_NOHUGEPAGE)
+
+/*
+ * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
+ * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
+ */
+#ifdef CONFIG_S390
+		if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
 			return -EINVAL;
+#endif
+
 		*vm_flags &= ~VM_NOHUGEPAGE;
 		*vm_flags |= VM_HUGEPAGE;
 		/*


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 14:50   ` Oleg Nesterov
@ 2014-02-26 15:06     ` Christian Borntraeger
  2014-02-26 15:22       ` Kirill A. Shutemov
  2014-02-26 15:31       ` Oleg Nesterov
  0 siblings, 2 replies; 19+ messages in thread
From: Christian Borntraeger @ 2014-02-26 15:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: akpm, linux-kernel, viro, schwidefsky, rientjes, riel, peterz,
	pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens,
	hannes, gerald.schaefer, ebiederm, aarcange, athorlton

On 26/02/14 15:50, Oleg Nesterov wrote:
[...]

>> NAK.
>>
>> Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages.
>> (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this
>> breaks any recent kvm guest on s390.
> 
> Well, I can't really discuss the changes in arch/s390.
> 
> But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
> Otherwise I'd suggest the change below.
> 
> Oleg.
> 
> 
> --- x/mm/huge_memory.c
> +++ x/mm/huge_memory.c
> @@ -1968,8 +1968,6 @@ out:
>  int hugepage_madvise(struct vm_area_struct *vma,
>  		     unsigned long *vm_flags, int advice)
>  {
> -	struct mm_struct *mm = vma->vm_mm;
> -
>  	switch (advice) {
>  	case MADV_HUGEPAGE:
>  		/*
> @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
>  		 */
>  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
>  			return -EINVAL;
> -		if (mm->def_flags & VM_NOHUGEPAGE)
> +
> +/*
> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> + */
> +#ifdef CONFIG_S390
> +		if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
>  			return -EINVAL;
> +#endif
> +

Ifdefs are ugly but might be the only way of not breaking existing userspace.
If we come up with a solution for THP in KVM host processes on s390, we can
then remove that wart. We could even limit that hack to KVM only processes
to retain Alex' prctl capability by checking mm_has_pgste (defined in
arch/s390/include/asm/pgtable.h should be included via linux/mm.h)

> +
> +/*
> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> + */
> +#ifdef CONFIG_S390
> +		if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm))
>  			return -EINVAL;
> +#endif
> +



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 15:06     ` Christian Borntraeger
@ 2014-02-26 15:22       ` Kirill A. Shutemov
  2014-02-26 15:31       ` Oleg Nesterov
  1 sibling, 0 replies; 19+ messages in thread
From: Kirill A. Shutemov @ 2014-02-26 15:22 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Oleg Nesterov, akpm, linux-kernel, viro, schwidefsky, rientjes,
	riel, peterz, pbonzini, mingo, mgorman, kirill.shutemov,
	heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange,
	athorlton

Christian Borntraeger wrote:
> On 26/02/14 15:50, Oleg Nesterov wrote:
> [...]
> 
> >> NAK.
> >>
> >> Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages.
> >> (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this
> >> breaks any recent kvm guest on s390.
> > 
> > Well, I can't really discuss the changes in arch/s390.
> > 
> > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
> > Otherwise I'd suggest the change below.
> > 
> > Oleg.
> > 
> > 
> > --- x/mm/huge_memory.c
> > +++ x/mm/huge_memory.c
> > @@ -1968,8 +1968,6 @@ out:
> >  int hugepage_madvise(struct vm_area_struct *vma,
> >  		     unsigned long *vm_flags, int advice)
> >  {
> > -	struct mm_struct *mm = vma->vm_mm;
> > -
> >  	switch (advice) {
> >  	case MADV_HUGEPAGE:
> >  		/*
> > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
> >  		 */
> >  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> >  			return -EINVAL;
> > -		if (mm->def_flags & VM_NOHUGEPAGE)
> > +
> > +/*
> > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > + */
> > +#ifdef CONFIG_S390
> > +		if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
> >  			return -EINVAL;
> > +#endif
> > +
> 
> Ifdefs are ugly

IS_ENABLED(CONFIG_S390) should help with this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 15:06     ` Christian Borntraeger
  2014-02-26 15:22       ` Kirill A. Shutemov
@ 2014-02-26 15:31       ` Oleg Nesterov
  2014-02-26 16:55         ` Gerald Schaefer
  2014-02-26 16:57         ` Peter Zijlstra
  1 sibling, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2014-02-26 15:31 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: akpm, linux-kernel, viro, schwidefsky, rientjes, riel, peterz,
	pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens,
	hannes, gerald.schaefer, ebiederm, aarcange, athorlton

On 02/26, Christian Borntraeger wrote:
>
> On 26/02/14 15:50, Oleg Nesterov wrote:
> >
> > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
> > Otherwise I'd suggest the change below.
> >
> > Oleg.
> >
> >
> > --- x/mm/huge_memory.c
> > +++ x/mm/huge_memory.c
> > @@ -1968,8 +1968,6 @@ out:
> >  int hugepage_madvise(struct vm_area_struct *vma,
> >  		     unsigned long *vm_flags, int advice)
> >  {
> > -	struct mm_struct *mm = vma->vm_mm;
> > -
> >  	switch (advice) {
> >  	case MADV_HUGEPAGE:
> >  		/*
> > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
> >  		 */
> >  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> >  			return -EINVAL;
> > -		if (mm->def_flags & VM_NOHUGEPAGE)
> > +
> > +/*
> > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > + */
> > +#ifdef CONFIG_S390
> > +		if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
> >  			return -EINVAL;
> > +#endif
> > +
>
> Ifdefs are ugly but might be the only way of not breaking existing userspace.

Yes, agreed. but note that this code is already s390-specific, nobody
else puts VM_NOHUGEPAGE into ->def_flags (until this series of course).

> If we come up with a solution for THP in KVM host processes on s390, we can
> then remove that wart. We could even limit that hack to KVM only processes
> to retain Alex' prctl capability by checking mm_has_pgste

Yes, yes, I looked at mm->context.has_pgste too ;) But I wasn't sure
we can rely on it. Thanks.

> > +/*
> > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > + */
> > +#ifdef CONFIG_S390
> > +		if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm))
> >  			return -EINVAL;

Hmm... but why do we need to check VM_NOHUGEPAGE then? Can't the
change below work?

Oleg.

--- x/arch/s390/mm/pgtable.c
+++ x/arch/s390/mm/pgtable.c
@@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m
 		vma->vm_flags &= ~VM_HUGEPAGE;
 		vma->vm_flags |= VM_NOHUGEPAGE;
 	}
-	mm->def_flags |= VM_NOHUGEPAGE;
 }
 #else
 static inline void thp_split_mm(struct mm_struct *mm)
--- x/mm/huge_memory.c
+++ x/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
 int hugepage_madvise(struct vm_area_struct *vma,
 		     unsigned long *vm_flags, int advice)
 {
-	struct mm_struct *mm = vma->vm_mm;
-
 	switch (advice) {
 	case MADV_HUGEPAGE:
 		/*
@@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru
 		 */
 		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
 			return -EINVAL;
-		if (mm->def_flags & VM_NOHUGEPAGE)
+
+#ifdef CONFIG_S390
+		if (mm_has_pgste(vma->vm_mm))
 			return -EINVAL;
+#endif
+
 		*vm_flags &= ~VM_NOHUGEPAGE;
 		*vm_flags |= VM_HUGEPAGE;
 		/*


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 15:31       ` Oleg Nesterov
@ 2014-02-26 16:55         ` Gerald Schaefer
  2014-02-26 16:57         ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Gerald Schaefer @ 2014-02-26 16:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky,
	rientjes, riel, peterz, pbonzini, mingo, mgorman,
	kirill.shutemov, heiko.carstens, hannes, ebiederm, aarcange,
	athorlton

On Wed, 26 Feb 2014 16:31:44 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 02/26, Christian Borntraeger wrote:
> >
> > On 26/02/14 15:50, Oleg Nesterov wrote:
> > >
> > > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ?
> > > Otherwise I'd suggest the change below.
> > >
> > > Oleg.
> > >
> > >
> > > --- x/mm/huge_memory.c
> > > +++ x/mm/huge_memory.c
> > > @@ -1968,8 +1968,6 @@ out:
> > >  int hugepage_madvise(struct vm_area_struct *vma,
> > >  		     unsigned long *vm_flags, int advice)
> > >  {
> > > -	struct mm_struct *mm = vma->vm_mm;
> > > -
> > >  	switch (advice) {
> > >  	case MADV_HUGEPAGE:
> > >  		/*
> > > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru
> > >  		 */
> > >  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> > >  			return -EINVAL;
> > > -		if (mm->def_flags & VM_NOHUGEPAGE)
> > > +
> > > +/*
> > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > > + */
> > > +#ifdef CONFIG_S390
> > > +		if (vma->vm_mm->def_flags & VM_NOHUGEPAGE)
> > >  			return -EINVAL;
> > > +#endif
> > > +
> >
> > Ifdefs are ugly but might be the only way of not breaking existing userspace.
> 
> Yes, agreed. but note that this code is already s390-specific, nobody
> else puts VM_NOHUGEPAGE into ->def_flags (until this series of course).
> 
> > If we come up with a solution for THP in KVM host processes on s390, we can
> > then remove that wart. We could even limit that hack to KVM only processes
> > to retain Alex' prctl capability by checking mm_has_pgste
> 
> Yes, yes, I looked at mm->context.has_pgste too ;) But I wasn't sure
> we can rely on it. Thanks.
> 
> > > +/*
> > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > > + */
> > > +#ifdef CONFIG_S390
> > > +		if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm))
> > >  			return -EINVAL;
> 
> Hmm... but why do we need to check VM_NOHUGEPAGE then? Can't the
> change below work?

Yes, good point, that should do the trick.

> 
> Oleg.
> 
> --- x/arch/s390/mm/pgtable.c
> +++ x/arch/s390/mm/pgtable.c
> @@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m
>  		vma->vm_flags &= ~VM_HUGEPAGE;
>  		vma->vm_flags |= VM_NOHUGEPAGE;
>  	}
> -	mm->def_flags |= VM_NOHUGEPAGE;
>  }
>  #else
>  static inline void thp_split_mm(struct mm_struct *mm)
> --- x/mm/huge_memory.c
> +++ x/mm/huge_memory.c
> @@ -1968,8 +1968,6 @@ out:
>  int hugepage_madvise(struct vm_area_struct *vma,
>  		     unsigned long *vm_flags, int advice)
>  {
> -	struct mm_struct *mm = vma->vm_mm;
> -
>  	switch (advice) {
>  	case MADV_HUGEPAGE:
>  		/*
> @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru
>  		 */
>  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
>  			return -EINVAL;
> -		if (mm->def_flags & VM_NOHUGEPAGE)
> +
> +#ifdef CONFIG_S390
> +		if (mm_has_pgste(vma->vm_mm))
>  			return -EINVAL;
> +#endif
> +
>  		*vm_flags &= ~VM_NOHUGEPAGE;
>  		*vm_flags |= VM_HUGEPAGE;
>  		/*
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 15:31       ` Oleg Nesterov
  2014-02-26 16:55         ` Gerald Schaefer
@ 2014-02-26 16:57         ` Peter Zijlstra
  2014-02-26 17:22           ` Alex Thorlton
  2014-02-26 18:08           ` Oleg Nesterov
  1 sibling, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-02-26 16:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky,
	rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov,
	heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange,
	athorlton

On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote:
> --- x/arch/s390/mm/pgtable.c
> +++ x/arch/s390/mm/pgtable.c
> @@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m
>  		vma->vm_flags &= ~VM_HUGEPAGE;
>  		vma->vm_flags |= VM_NOHUGEPAGE;
>  	}
> -	mm->def_flags |= VM_NOHUGEPAGE;
>  }
>  #else
>  static inline void thp_split_mm(struct mm_struct *mm)
> --- x/mm/huge_memory.c
> +++ x/mm/huge_memory.c
> @@ -1968,8 +1968,6 @@ out:
>  int hugepage_madvise(struct vm_area_struct *vma,
>  		     unsigned long *vm_flags, int advice)
>  {
> -	struct mm_struct *mm = vma->vm_mm;
> -
>  	switch (advice) {
>  	case MADV_HUGEPAGE:
>  		/*
> @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru
>  		 */
>  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
>  			return -EINVAL;
> -		if (mm->def_flags & VM_NOHUGEPAGE)
> +
> +#ifdef CONFIG_S390

Do we want a comment here, explaining why s390 is special again?

> +		if (mm_has_pgste(vma->vm_mm))
>  			return -EINVAL;
> +#endif
> +
>  		*vm_flags &= ~VM_NOHUGEPAGE;
>  		*vm_flags |= VM_HUGEPAGE;
>  		/*
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 16:57         ` Peter Zijlstra
@ 2014-02-26 17:22           ` Alex Thorlton
  2014-02-26 18:06             ` Oleg Nesterov
  2014-02-26 18:08           ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: Alex Thorlton @ 2014-02-26 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Christian Borntraeger, akpm, linux-kernel, viro,
	schwidefsky, rientjes, riel, pbonzini, mingo, mgorman,
	kirill.shutemov, heiko.carstens, hannes, gerald.schaefer,
	ebiederm, aarcange

On Wed, Feb 26, 2014 at 05:57:59PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote:
> Do we want a comment here, explaining why s390 is special again?

Here's what I've got, with everybody's suggestions spun together:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf..7f01491 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1968,8 +1968,6 @@ out:
 int hugepage_madvise(struct vm_area_struct *vma,
                     unsigned long *vm_flags, int advice)
 {
-       struct mm_struct *mm = vma->vm_mm;
-
        switch (advice) {
        case MADV_HUGEPAGE:
                /*
@@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_struct *vma,
                 */
                if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
                        return -EINVAL;
-               if (mm->def_flags & VM_NOHUGEPAGE)
+
+/*
+ * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
+ * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
+ */
+#ifdef CONFIG_S390
+               if (mm_has_pgste(vma->vm_mm))
                        return -EINVAL;
+#endif
+
                *vm_flags &= ~VM_NOHUGEPAGE;
                *vm_flags |= VM_HUGEPAGE;
                /*

I've compiled and tested and it works ok on my machines (I don't have an
s390 to test on though :).  Is everybody okay with this solution?

BTW, Kirill, I looked at using IS_ENABLED to clean up the ifdef, but it
won't work here, since mm_has_pgste isn't defined unless CONFIG_S390
is defined.  i.e.: This line:

if (IS_ENABLED(CONFIG_S390) && mm_has_pgste(vma->vm_mm))

won't compile.

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 17:22           ` Alex Thorlton
@ 2014-02-26 18:06             ` Oleg Nesterov
  2014-02-26 19:05               ` Gerald Schaefer
                                 ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Oleg Nesterov @ 2014-02-26 18:06 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Peter Zijlstra, Christian Borntraeger, akpm, linux-kernel, viro,
	schwidefsky, rientjes, riel, pbonzini, mingo, mgorman,
	kirill.shutemov, heiko.carstens, hannes, gerald.schaefer,
	ebiederm, aarcange

On 02/26, Alex Thorlton wrote:
>
> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> + */
> +#ifdef CONFIG_S390
> +               if (mm_has_pgste(vma->vm_mm))
>                         return -EINVAL;
> +#endif

The comment is not really right...

And personally I think that

	@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
		if (!pmd_present(*pmd) &&
		    __pte_alloc(mm, vma, pmd, vmaddr))
			return -ENOMEM;
	+       /* large pmds cannot yet be handled */
	+       if (pmd_large(*pmd))
	+               return -EFAULT;

change still makes sense, so that we can simply revert this s390-
specific hack in hugepage_madvise().

I'd suggest the patch below on top of your changes, but I won't argue.

It would be nice to also change thp_split_mm() to not not play with
mm->def_flags, but I am not sure if we can do this.

Oleg.
---

Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()

As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a4310a5..0e08d92 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
 {
 	switch (advice) {
 	case MADV_HUGEPAGE:
+#ifdef CONFIG_S390
+		/*
+		 * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
+		 * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
+		 * and expects it must fail on s390. Avoid a possible SIGSEGV
+		 * until qemu is changed.
+		 */
+		if (mm_has_pgste(vma->vm_mm))
+			return -EINVAL;
+#endif
 		/*
 		 * Be somewhat over-protective like KSM for now!
 		 */
 		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
 			return -EINVAL;
+
 		*vm_flags &= ~VM_NOHUGEPAGE;
 		*vm_flags |= VM_HUGEPAGE;
 		/*


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 16:57         ` Peter Zijlstra
  2014-02-26 17:22           ` Alex Thorlton
@ 2014-02-26 18:08           ` Oleg Nesterov
  1 sibling, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2014-02-26 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky,
	rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov,
	heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange,
	athorlton

On 02/26, Peter Zijlstra wrote:
>
> On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote:
> >  		/*
> > @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru
> >  		 */
> >  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> >  			return -EINVAL;
> > -		if (mm->def_flags & VM_NOHUGEPAGE)
> > +
> > +#ifdef CONFIG_S390
>
> Do we want a comment here, explaining why s390 is special again?

Yes, yes, sure. Especially because this is (hopefully) the temporary
hack.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 18:06             ` Oleg Nesterov
@ 2014-02-26 19:05               ` Gerald Schaefer
  2014-02-27 16:45                 ` Oleg Nesterov
  2014-02-26 19:27               ` Christian Borntraeger
  2014-02-26 20:41               ` Paolo Bonzini
  2 siblings, 1 reply; 19+ messages in thread
From: Gerald Schaefer @ 2014-02-26 19:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alex Thorlton, Peter Zijlstra, Christian Borntraeger, akpm,
	linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo,
	mgorman, kirill.shutemov, heiko.carstens, hannes, ebiederm,
	aarcange

On Wed, 26 Feb 2014 19:06:03 +0100
Oleg Nesterov <oleg@redhat.com> wrote:

> On 02/26, Alex Thorlton wrote:
> >
> > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> > + */
> > +#ifdef CONFIG_S390
> > +               if (mm_has_pgste(vma->vm_mm))
> >                         return -EINVAL;
> > +#endif
> 
> The comment is not really right...
> 
> And personally I think that
> 
> 	@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> 		if (!pmd_present(*pmd) &&
> 		    __pte_alloc(mm, vma, pmd, vmaddr))
> 			return -ENOMEM;
> 	+       /* large pmds cannot yet be handled */
> 	+       if (pmd_large(*pmd))
> 	+               return -EFAULT;
> 
> change still makes sense, so that we can simply revert this s390-
> specific hack in hugepage_madvise().

Yes, agreed.

> 
> I'd suggest the patch below on top of your changes, but I won't argue.
> 
> It would be nice to also change thp_split_mm() to not not play with
> mm->def_flags, but I am not sure if we can do this.

Hmm, I'm also wondering about this. Basically, we only need VM_NOHUGEPAGE
in vma->vm_flags, which is done for all existing vmas in thp_split_mm().
But if there should be new vmas created afterwards, it would still be
necessary to also have VM_NOHUGEPAGE in mm->def_flags, because the
vm_flags for new vmas will be set via OR of mm->def_flags, e.g. in
do_brk() and do_mmap_pgoff().

I guess the question is if new vmas can be created for the qemu/kvm host
process?

Anyway, this would then have to be a separate patch, to keep the
"revertability" of this hack.

> 
> Oleg.
> ---
> 
> Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()
> 
> As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
> check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
> all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.
> 
> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a4310a5..0e08d92 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
>  {
>  	switch (advice) {
>  	case MADV_HUGEPAGE:
> +#ifdef CONFIG_S390
> +		/*
> +		 * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> +		 * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> +		 * and expects it must fail on s390. Avoid a possible SIGSEGV
> +		 * until qemu is changed.
> +		 */
> +		if (mm_has_pgste(vma->vm_mm))
> +			return -EINVAL;
> +#endif
>  		/*
>  		 * Be somewhat over-protective like KSM for now!
>  		 */
>  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
>  			return -EINVAL;
> +
>  		*vm_flags &= ~VM_NOHUGEPAGE;
>  		*vm_flags |= VM_HUGEPAGE;
>  		/*
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 18:06             ` Oleg Nesterov
  2014-02-26 19:05               ` Gerald Schaefer
@ 2014-02-26 19:27               ` Christian Borntraeger
  2014-02-26 19:39                 ` Alex Thorlton
  2014-02-26 20:41               ` Paolo Bonzini
  2 siblings, 1 reply; 19+ messages in thread
From: Christian Borntraeger @ 2014-02-26 19:27 UTC (permalink / raw)
  To: Oleg Nesterov, Alex Thorlton
  Cc: Peter Zijlstra, akpm, linux-kernel, viro, schwidefsky, rientjes,
	riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens,
	hannes, gerald.schaefer, ebiederm, aarcange

On 26/02/14 19:06, Oleg Nesterov wrote:
> On 02/26, Alex Thorlton wrote:
>>
>> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
>> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
>> + */
>> +#ifdef CONFIG_S390
>> +               if (mm_has_pgste(vma->vm_mm))
>>                         return -EINVAL;
>> +#endif
> 
> The comment is not really right...
> 
> And personally I think that
> 
> 	@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> 		if (!pmd_present(*pmd) &&
> 		    __pte_alloc(mm, vma, pmd, vmaddr))
> 			return -ENOMEM;
> 	+       /* large pmds cannot yet be handled */
> 	+       if (pmd_large(*pmd))
> 	+               return -EFAULT;
> 
> change still makes sense, so that we can simply revert this s390-
> specific hack in hugepage_madvise().

Yes, it still makes sense to cover existing THPs here.


> I'd suggest the patch below on top of your changes, but I won't argue.
> 
> It would be nice to also change thp_split_mm() to not not play with
> mm->def_flags, but I am not sure if we can do this.
> 
> Oleg.
> ---
> 
> Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()
> 
> As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
> check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
> all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.
> 
> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a4310a5..0e08d92 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
>  {
>  	switch (advice) {
>  	case MADV_HUGEPAGE:
> +#ifdef CONFIG_S390
> +		/*
> +		 * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> +		 * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> +		 * and expects it must fail on s390. Avoid a possible SIGSEGV
> +		 * until qemu is changed.

I prefer:
		 * until kvm/s390 can handle large pages in the host.

Otherwise qemu has to be changed again, if we get THP working for kvm.

> +		 */
> +		if (mm_has_pgste(vma->vm_mm))
> +			return -EINVAL;
> +#endif
>  		/*
>  		 * Be somewhat over-protective like KSM for now!
>  		 */
>  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
>  			return -EINVAL;
> +

Unrelated white space?
>  		*vm_flags &= ~VM_NOHUGEPAGE;
>  		*vm_flags |= VM_HUGEPAGE;
>  		/*
> 



With the comment and white space change:

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>

Thanks for the quick patch


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 19:27               ` Christian Borntraeger
@ 2014-02-26 19:39                 ` Alex Thorlton
  2014-02-26 23:24                   ` Andrew Morton
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Thorlton @ 2014-02-26 19:39 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Oleg Nesterov, Peter Zijlstra, akpm, linux-kernel, viro,
	schwidefsky, rientjes, riel, pbonzini, mingo, mgorman,
	kirill.shutemov, heiko.carstens, hannes, gerald.schaefer,
	ebiederm, aarcange

On Wed, Feb 26, 2014 at 08:27:36PM +0100, Christian Borntraeger wrote:
> On 26/02/14 19:06, Oleg Nesterov wrote:
> > On 02/26, Alex Thorlton wrote:
> >>
> >> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because
> >> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie().
> >> + */
> >> +#ifdef CONFIG_S390
> >> +               if (mm_has_pgste(vma->vm_mm))
> >>                         return -EINVAL;
> >> +#endif
> > 
> > The comment is not really right...
> > 
> > And personally I think that
> > 
> > 	@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment,
> > 		if (!pmd_present(*pmd) &&
> > 		    __pte_alloc(mm, vma, pmd, vmaddr))
> > 			return -ENOMEM;
> > 	+       /* large pmds cannot yet be handled */
> > 	+       if (pmd_large(*pmd))
> > 	+               return -EFAULT;
> > 
> > change still makes sense, so that we can simply revert this s390-
> > specific hack in hugepage_madvise().
> 
> Yes, it still makes sense to cover existing THPs here.

Yes, it does.  I just snipped the chunk of the original patch that I
actually changed in my last e-mail.  I didn't intend to actually remove
the above portion in the final patch - sorry for the confusion!

> 
> 
> > I'd suggest the patch below on top of your changes, but I won't argue.

I like this approach, and your updated comment as well.  This should
probably go in as [PATCH 2/4] in my series.  Do I need to spin a v5
with this new patch included in the series, or will Andrew pull this in?

> > 
> > It would be nice to also change thp_split_mm() to not not play with
> > mm->def_flags, but I am not sure if we can do this.
> > 
> > Oleg.
> > ---
> > 
> > Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie()
> > 
> > As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE
> > check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for
> > all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm.
> > 
> > Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a4310a5..0e08d92 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >  {
> >  	switch (advice) {
> >  	case MADV_HUGEPAGE:
> > +#ifdef CONFIG_S390
> > +		/*
> > +		 * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> > +		 * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> > +		 * and expects it must fail on s390. Avoid a possible SIGSEGV
> > +		 * until qemu is changed.
> 
> I prefer:
> 		 * until kvm/s390 can handle large pages in the host.
> 
> Otherwise qemu has to be changed again, if we get THP working for kvm.
> 
> > +		 */
> > +		if (mm_has_pgste(vma->vm_mm))
> > +			return -EINVAL;
> > +#endif
> >  		/*
> >  		 * Be somewhat over-protective like KSM for now!
> >  		 */
> >  		if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> >  			return -EINVAL;
> > +
> 
> Unrelated white space?
> >  		*vm_flags &= ~VM_NOHUGEPAGE;
> >  		*vm_flags |= VM_HUGEPAGE;
> >  		/*
> > 
> 
> 
> 
> With the comment and white space change:
> 
> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> 
> Thanks for the quick patch

Yes, thank you all for the quick responses, and for looking over these
patches!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 18:06             ` Oleg Nesterov
  2014-02-26 19:05               ` Gerald Schaefer
  2014-02-26 19:27               ` Christian Borntraeger
@ 2014-02-26 20:41               ` Paolo Bonzini
  2014-02-27 16:34                 ` Oleg Nesterov
  2 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2014-02-26 20:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alex Thorlton, Peter Zijlstra, Christian Borntraeger, akpm,
	linux-kernel, viro, schwidefsky, rientjes, riel, mingo, mgorman,
	kirill shutemov, heiko carstens, hannes, gerald schaefer,
	ebiederm, aarcange


> +#ifdef CONFIG_S390
> +		/*
> +		 * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> +		 * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> +		 * and expects it must fail on s390. Avoid a possible SIGSEGV
> +		 * until qemu is changed.
> +		 */
> +		if (mm_has_pgste(vma->vm_mm))
> +			return -EINVAL;
> +#endif

The comment is not quite true.  QEMU doesn't expect that the madvise fails.
It simply expects that the madvise doesn't cause SIGSEGVs or later
malfunctioning, because (quoting tha man page) madvise "does not influence
the semantics of the application".

There's nothing to fix in QEMU, in fact I believe this should return 0
rather than -EINVAL.  It's perfectly legal for the kernel to ignore an madvise,
and this is what should happen here.

Paolo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 19:39                 ` Alex Thorlton
@ 2014-02-26 23:24                   ` Andrew Morton
  2014-02-27  0:01                     ` Alex Thorlton
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2014-02-26 23:24 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Christian Borntraeger, Oleg Nesterov, Peter Zijlstra,
	linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo,
	mgorman, kirill.shutemov, heiko.carstens, hannes,
	gerald.schaefer, ebiederm, aarcange

On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <athorlton@sgi.com> wrote:

> > 
> > 
> > > I'd suggest the patch below on top of your changes, but I won't argue.
> 
> I like this approach, and your updated comment as well.  This should
> probably go in as [PATCH 2/4] in my series.  Do I need to spin a v5
> with this new patch included in the series, or will Andrew pull this in?

Paolo's comments on Oleg's patch remain unaddressed, so please take a
look at that and send out the final version?

An incremental patch would be preferred, please.  I'll later fold that
into mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch to
avoid any bisection holes.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 23:24                   ` Andrew Morton
@ 2014-02-27  0:01                     ` Alex Thorlton
  2014-02-27 17:26                       ` Alex Thorlton
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Thorlton @ 2014-02-27  0:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Borntraeger, Oleg Nesterov, Peter Zijlstra,
	linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo,
	mgorman, kirill.shutemov, heiko.carstens, hannes,
	gerald.schaefer, ebiederm, aarcange

On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote:
> On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <athorlton@sgi.com> wrote:
> 
> > > 
> > > 
> > > > I'd suggest the patch below on top of your changes, but I won't argue.
> > 
> > I like this approach, and your updated comment as well.  This should
> > probably go in as [PATCH 2/4] in my series.  Do I need to spin a v5
> > with this new patch included in the series, or will Andrew pull this in?
> 
> Paolo's comments on Oleg's patch remain unaddressed, so please take a
> look at that and send out the final version?

Got it.  I'll get that to you tonight/tomorrow morning.

> An incremental patch would be preferred, please.  I'll later fold that
> into mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch to
> avoid any bisection holes.

Understood.  I'll keep them separate.

Thanks, Andrew!

- Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 20:41               ` Paolo Bonzini
@ 2014-02-27 16:34                 ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2014-02-27 16:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alex Thorlton, Peter Zijlstra, Christian Borntraeger, akpm,
	linux-kernel, viro, schwidefsky, rientjes, riel, mingo, mgorman,
	kirill shutemov, heiko carstens, hannes, gerald schaefer,
	ebiederm, aarcange

On 02/26, Paolo Bonzini wrote:
>
> > +#ifdef CONFIG_S390
> > +		/*
> > +		 * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu
> > +		 * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages
> > +		 * and expects it must fail on s390. Avoid a possible SIGSEGV
> > +		 * until qemu is changed.
> > +		 */
> > +		if (mm_has_pgste(vma->vm_mm))
> > +			return -EINVAL;
> > +#endif
>
> The comment is not quite true.  QEMU doesn't expect that the madvise fails.

Yes, sorry. I didn't mean "it expects -EINVAL".

> It simply expects that the madvise doesn't cause SIGSEGVs or later
> malfunctioning, because (quoting tha man page) madvise "does not influence
> the semantics of the application".

Yes, I understand. But currently this means "MADV_HUGEPAGE should not
actually work", this is what I tried to say.

> There's nothing to fix in QEMU,

I was going to argue, but this is probably true too.

In short: I agree with any comment ;)

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-26 19:05               ` Gerald Schaefer
@ 2014-02-27 16:45                 ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2014-02-27 16:45 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Alex Thorlton, Peter Zijlstra, Christian Borntraeger, akpm,
	linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo,
	mgorman, kirill.shutemov, heiko.carstens, hannes, ebiederm,
	aarcange

On 02/26, Gerald Schaefer wrote:
>
> On Wed, 26 Feb 2014 19:06:03 +0100
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > It would be nice to also change thp_split_mm() to not not play with
> > mm->def_flags, but I am not sure if we can do this.
>
> Hmm, I'm also wondering about this. Basically, we only need VM_NOHUGEPAGE
> in vma->vm_flags, which is done for all existing vmas in thp_split_mm().
> But if there should be new vmas created afterwards, it would still be
> necessary to also have VM_NOHUGEPAGE in mm->def_flags, because the
> vm_flags for new vmas will be set via OR of mm->def_flags, e.g. in
> do_brk() and do_mmap_pgoff().

Yes, exactly, this was my concern.

And while I know nothing about s390, it seems to me that huge pages should
be forbidden for any vma if ->has_pgste was set.

> Anyway, this would then have to be a separate patch, to keep the
> "revertability" of this hack.

Agreed. Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
  2014-02-27  0:01                     ` Alex Thorlton
@ 2014-02-27 17:26                       ` Alex Thorlton
  0 siblings, 0 replies; 19+ messages in thread
From: Alex Thorlton @ 2014-02-27 17:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Borntraeger, Oleg Nesterov, Peter Zijlstra,
	linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo,
	mgorman, kirill.shutemov, heiko.carstens, hannes,
	gerald.schaefer, ebiederm, aarcange

On Wed, Feb 26, 2014 at 06:01:53PM -0600, Alex Thorlton wrote:
> On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote:
> > On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <athorlton@sgi.com> wrote:
> > 
> > > > 
> > > > 
> > > > > I'd suggest the patch below on top of your changes, but I won't argue.
> > > 
> > > I like this approach, and your updated comment as well.  This should
> > > probably go in as [PATCH 2/4] in my series.  Do I need to spin a v5
> > > with this new patch included in the series, or will Andrew pull this in?
> > 
> > Paolo's comments on Oleg's patch remain unaddressed, so please take a
> > look at that and send out the final version?
> 
> Got it.  I'll get that to you tonight/tomorrow morning.

Just kicked out another version of the patch that should cover all
comments/suggestions.  Let me know if you need anything else from me!

- Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-02-27 17:26 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-25 23:53 + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree akpm
     [not found] ` <530D9F50.1080400@de.ibm.com>
2014-02-26 14:50   ` Oleg Nesterov
2014-02-26 15:06     ` Christian Borntraeger
2014-02-26 15:22       ` Kirill A. Shutemov
2014-02-26 15:31       ` Oleg Nesterov
2014-02-26 16:55         ` Gerald Schaefer
2014-02-26 16:57         ` Peter Zijlstra
2014-02-26 17:22           ` Alex Thorlton
2014-02-26 18:06             ` Oleg Nesterov
2014-02-26 19:05               ` Gerald Schaefer
2014-02-27 16:45                 ` Oleg Nesterov
2014-02-26 19:27               ` Christian Borntraeger
2014-02-26 19:39                 ` Alex Thorlton
2014-02-26 23:24                   ` Andrew Morton
2014-02-27  0:01                     ` Alex Thorlton
2014-02-27 17:26                       ` Alex Thorlton
2014-02-26 20:41               ` Paolo Bonzini
2014-02-27 16:34                 ` Oleg Nesterov
2014-02-26 18:08           ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.