All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
@ 2019-05-14 14:29     ` Mike Rapoport
  2019-05-16 16:25         ` Andrei Vagin
                         ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Mike Rapoport @ 2019-05-14 14:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, linux-kernel, Mike Rapoport

When get_user_pages*() is called with pages = NULL, the processing of
VM_FAULT_RETRY terminates early without actually retrying to fault-in all
the pages.

If the pages in the requested range belong to a VMA that has userfaultfd
registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
has populated the page, but for the gup pre-fault case there's no actual
retry and the caller will get no pages although they are present.

This issue was uncovered when running post-copy memory restore in CRIU
after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
copy_fpstate_to_sigframe() fails").

After this change, the copying of FPU state to the sigframe switched from
copy_to_user() variants which caused a real page fault to get_user_pages()
with pages parameter set to NULL.

In post-copy mode of CRIU, the destination memory is managed with
userfaultfd and lack of the retry for pre-fault case in get_user_pages()
causes a crash of the restored process.

Making the pre-fault behavior of get_user_pages() the same as the "normal"
one fixes the issue.

Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/gup.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 91819b8..c32ae5a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 			BUG_ON(ret >= nr_pages);
 		}
 
-		if (!pages)
-			/* If it's a prefault don't insist harder */
-			return ret;
-
 		if (ret > 0) {
 			nr_pages -= ret;
 			pages_done += ret;
@@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 				pages_done = ret;
 			break;
 		}
-		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
-		pages += ret;
+		/*
+		 * VM_FAULT_RETRY triggered, so seek to the faulting offset.
+		 * For the prefault case (!pages) we only update counts.
+		 */
+		if (likely(pages))
+			pages += ret;
 		start += ret << PAGE_SHIFT;
 
 		/*
@@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		pages_done++;
 		if (!nr_pages)
 			break;
-		pages++;
+		if (likely(pages))
+			pages++;
 		start += PAGE_SIZE;
 	}
 	if (lock_dropped && *locked) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-14 14:29     ` [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults Mike Rapoport
@ 2019-05-16 16:25         ` Andrei Vagin
  2019-05-21 15:53       ` Mike Rapoport
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Andrei Vagin @ 2019-05-16 16:25 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm, LKML

On Tue, May 14, 2019 at 7:32 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> When get_user_pages*() is called with pages = NULL, the processing of
> VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> the pages.
>
> If the pages in the requested range belong to a VMA that has userfaultfd
> registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> has populated the page, but for the gup pre-fault case there's no actual
> retry and the caller will get no pages although they are present.
>
> This issue was uncovered when running post-copy memory restore in CRIU
> after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails").
>
> After this change, the copying of FPU state to the sigframe switched from
> copy_to_user() variants which caused a real page fault to get_user_pages()
> with pages parameter set to NULL.
>
> In post-copy mode of CRIU, the destination memory is managed with
> userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> causes a crash of the restored process.
>
> Making the pre-fault behavior of get_user_pages() the same as the "normal"
> one fixes the issue.
>

Tested-by: Andrei Vagin <avagin@gmail.com>

https://travis-ci.org/avagin/linux/builds/533184940

> Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  mm/gup.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 91819b8..c32ae5a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                         BUG_ON(ret >= nr_pages);
>                 }
>
> -               if (!pages)
> -                       /* If it's a prefault don't insist harder */
> -                       return ret;
> -
>                 if (ret > 0) {
>                         nr_pages -= ret;
>                         pages_done += ret;
> @@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                                 pages_done = ret;
>                         break;
>                 }
> -               /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> -               pages += ret;
> +               /*
> +                * VM_FAULT_RETRY triggered, so seek to the faulting offset.
> +                * For the prefault case (!pages) we only update counts.
> +                */
> +               if (likely(pages))
> +                       pages += ret;
>                 start += ret << PAGE_SHIFT;
>
>                 /*
> @@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                 pages_done++;
>                 if (!nr_pages)
>                         break;
> -               pages++;
> +               if (likely(pages))
> +                       pages++;
>                 start += PAGE_SIZE;
>         }
>         if (lock_dropped && *locked) {
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
@ 2019-05-16 16:25         ` Andrei Vagin
  0 siblings, 0 replies; 27+ messages in thread
From: Andrei Vagin @ 2019-05-16 16:25 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: Andrew Morton, Andrea Arcangeli, linux-mm, LKML

On Tue, May 14, 2019 at 7:32 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> When get_user_pages*() is called with pages = NULL, the processing of
> VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> the pages.
>
> If the pages in the requested range belong to a VMA that has userfaultfd
> registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> has populated the page, but for the gup pre-fault case there's no actual
> retry and the caller will get no pages although they are present.
>
> This issue was uncovered when running post-copy memory restore in CRIU
> after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails").
>
> After this change, the copying of FPU state to the sigframe switched from
> copy_to_user() variants which caused a real page fault to get_user_pages()
> with pages parameter set to NULL.
>
> In post-copy mode of CRIU, the destination memory is managed with
> userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> causes a crash of the restored process.
>
> Making the pre-fault behavior of get_user_pages() the same as the "normal"
> one fixes the issue.
>

Tested-by: Andrei Vagin <avagin@gmail.com>

https://travis-ci.org/avagin/linux/builds/533184940

> Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  mm/gup.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 91819b8..c32ae5a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                         BUG_ON(ret >= nr_pages);
>                 }
>
> -               if (!pages)
> -                       /* If it's a prefault don't insist harder */
> -                       return ret;
> -
>                 if (ret > 0) {
>                         nr_pages -= ret;
>                         pages_done += ret;
> @@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                                 pages_done = ret;
>                         break;
>                 }
> -               /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> -               pages += ret;
> +               /*
> +                * VM_FAULT_RETRY triggered, so seek to the faulting offset.
> +                * For the prefault case (!pages) we only update counts.
> +                */
> +               if (likely(pages))
> +                       pages += ret;
>                 start += ret << PAGE_SHIFT;
>
>                 /*
> @@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                 pages_done++;
>                 if (!nr_pages)
>                         break;
> -               pages++;
> +               if (likely(pages))
> +                       pages++;
>                 start += PAGE_SIZE;
>         }
>         if (lock_dropped && *locked) {
> --
> 2.7.4
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-14 14:29     ` [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults Mike Rapoport
  2019-05-16 16:25         ` Andrei Vagin
@ 2019-05-21 15:53       ` Mike Rapoport
  2019-05-22 19:21       ` Andrew Morton
  2019-06-06 17:25       ` [tip:x86/urgent] x86/fpu: Use fault_in_pages_writeable() for pre-faulting tip-bot for Hugh Dickins
  3 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2019-05-21 15:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-mm, linux-kernel

Hi,

Any comments on this?

On Tue, May 14, 2019 at 05:29:55PM +0300, Mike Rapoport wrote:
> When get_user_pages*() is called with pages = NULL, the processing of
> VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> the pages.
> 
> If the pages in the requested range belong to a VMA that has userfaultfd
> registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> has populated the page, but for the gup pre-fault case there's no actual
> retry and the caller will get no pages although they are present.
> 
> This issue was uncovered when running post-copy memory restore in CRIU
> after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails").
> 
> After this change, the copying of FPU state to the sigframe switched from
> copy_to_user() variants which caused a real page fault to get_user_pages()
> with pages parameter set to NULL.
> 
> In post-copy mode of CRIU, the destination memory is managed with
> userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> causes a crash of the restored process.
> 
> Making the pre-fault behavior of get_user_pages() the same as the "normal"
> one fixes the issue.
> 
> Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  mm/gup.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 91819b8..c32ae5a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  			BUG_ON(ret >= nr_pages);
>  		}
>  
> -		if (!pages)
> -			/* If it's a prefault don't insist harder */
> -			return ret;
> -
>  		if (ret > 0) {
>  			nr_pages -= ret;
>  			pages_done += ret;
> @@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  				pages_done = ret;
>  			break;
>  		}
> -		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> -		pages += ret;
> +		/*
> +		 * VM_FAULT_RETRY triggered, so seek to the faulting offset.
> +		 * For the prefault case (!pages) we only update counts.
> +		 */
> +		if (likely(pages))
> +			pages += ret;
>  		start += ret << PAGE_SHIFT;
>  
>  		/*
> @@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		pages_done++;
>  		if (!nr_pages)
>  			break;
> -		pages++;
> +		if (likely(pages))
> +			pages++;
>  		start += PAGE_SIZE;
>  	}
>  	if (lock_dropped && *locked) {
> -- 
> 2.7.4
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-14 14:29     ` [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults Mike Rapoport
  2019-05-16 16:25         ` Andrei Vagin
  2019-05-21 15:53       ` Mike Rapoport
@ 2019-05-22 19:21       ` Andrew Morton
  2019-05-22 19:43         ` Sebastian Andrzej Siewior
  2019-05-22 20:38         ` Mike Rapoport
  2019-06-06 17:25       ` [tip:x86/urgent] x86/fpu: Use fault_in_pages_writeable() for pre-faulting tip-bot for Hugh Dickins
  3 siblings, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2019-05-22 19:21 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrea Arcangeli, linux-mm, linux-kernel,
	Sebastian Andrzej Siewior, Borislav Petkov

On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:

> When get_user_pages*() is called with pages = NULL, the processing of
> VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> the pages.
> 
> If the pages in the requested range belong to a VMA that has userfaultfd
> registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> has populated the page, but for the gup pre-fault case there's no actual
> retry and the caller will get no pages although they are present.
> 
> This issue was uncovered when running post-copy memory restore in CRIU
> after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails").
> 
> After this change, the copying of FPU state to the sigframe switched from
> copy_to_user() variants which caused a real page fault to get_user_pages()
> with pages parameter set to NULL.

You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL?

If so was that expected by the (now cc'ed) developers of
d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if
copy_fpstate_to_sigframe() fails")?

It seems rather odd.  copy_fpregs_to_sigframe() doesn't look like it's
expecting a NULL argument.

Also, I wonder if copy_fpstate_to_sigframe() would be better using
fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
seems like it operates at a more suitable level and I guess it will fix
this issue also.

> In post-copy mode of CRIU, the destination memory is managed with
> userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> causes a crash of the restored process.
> 
> Making the pre-fault behavior of get_user_pages() the same as the "normal"
> one fixes the issue.

Should this be backported into -stable trees?

> Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-22 19:21       ` Andrew Morton
@ 2019-05-22 19:43         ` Sebastian Andrzej Siewior
  2019-05-24 22:22             ` Hugh Dickins
  2019-05-22 20:38         ` Mike Rapoport
  1 sibling, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-05-22 19:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Rapoport, Andrea Arcangeli, linux-mm, linux-kernel, Borislav Petkov

On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote:
> On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> > When get_user_pages*() is called with pages = NULL, the processing of
> > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > the pages.
> > 
> > If the pages in the requested range belong to a VMA that has userfaultfd
> > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > has populated the page, but for the gup pre-fault case there's no actual
> > retry and the caller will get no pages although they are present.
> > 
> > This issue was uncovered when running post-copy memory restore in CRIU
> > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > copy_fpstate_to_sigframe() fails").
> > 
> > After this change, the copying of FPU state to the sigframe switched from
> > copy_to_user() variants which caused a real page fault to get_user_pages()
> > with pages parameter set to NULL.
> 
> You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL?

buf_fx is user stack pointer and it should not be NULL.

> If so was that expected by the (now cc'ed) developers of
> d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails")?
> 
> It seems rather odd.  copy_fpregs_to_sigframe() doesn't look like it's
> expecting a NULL argument.

exactly, this is not expected.

> Also, I wonder if copy_fpstate_to_sigframe() would be better using
> fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> seems like it operates at a more suitable level and I guess it will fix
> this issue also.

It looks, like fault_in_pages_writeable() would work. If this is the
recommendation from the MM department than I can switch to that.

> > In post-copy mode of CRIU, the destination memory is managed with
> > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > causes a crash of the restored process.
> > 
> > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > one fixes the issue.
> 
> Should this be backported into -stable trees?

Sebastian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-22 19:21       ` Andrew Morton
  2019-05-22 19:43         ` Sebastian Andrzej Siewior
@ 2019-05-22 20:38         ` Mike Rapoport
  2019-05-22 21:18           ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: Mike Rapoport @ 2019-05-22 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, linux-mm, linux-kernel,
	Sebastian Andrzej Siewior, Borislav Petkov,
	Dr. David Alan Gilbert, kvm

(added kvm)

On Wed, May 22, 2019 at 12:21:13PM -0700, Andrew Morton wrote:
> On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> > When get_user_pages*() is called with pages = NULL, the processing of
> > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > the pages.
> > 
> > If the pages in the requested range belong to a VMA that has userfaultfd
> > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > has populated the page, but for the gup pre-fault case there's no actual
> > retry and the caller will get no pages although they are present.
> > 
> > This issue was uncovered when running post-copy memory restore in CRIU
> > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > copy_fpstate_to_sigframe() fails").
> > 
> > After this change, the copying of FPU state to the sigframe switched from
> > copy_to_user() variants which caused a real page fault to get_user_pages()
> > with pages parameter set to NULL.
> 
> You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL?

Apparently I haven't explained well. The 'pages' parameter in the call to
get_user_pages_unlocked() is NULL.
 
> If so was that expected by the (now cc'ed) developers of
> d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails")?
> 
> It seems rather odd.  copy_fpregs_to_sigframe() doesn't look like it's
> expecting a NULL argument.
> 
> Also, I wonder if copy_fpstate_to_sigframe() would be better using
> fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> seems like it operates at a more suitable level and I guess it will fix
> this issue also.

If I understand correctly, one of the points of d9c9ce34ed5c8923 ("x86/fpu:
Fault-in user stack if copy_fpstate_to_sigframe() fails") was to to avoid
page faults, hence the use of get_user_pages().

With fault_in_pages_writeable() there might be a page fault, unless I've
completely mistaken.

Unrelated to copy_fpstate_to_sigframe(), the issue could happen if any call
to get_user_pages() with pages parameter set to NULL tries to access
userfaultfd-managed memory. Currently, there are 4 in tree users:

arch/x86/kernel/fpu/signal.c:198:8-31:  -> gup with !pages
arch/x86/mm/mpx.c:423:11-25:  -> gup with !pages
virt/kvm/async_pf.c:90:1-22:  -> gup with !pages
virt/kvm/kvm_main.c:1437:6-20:  -> gup with !pages

I don't know if anybody is using mpx with uffd and anyway mpx seems to go
away.

As for KVM, I think that post-copy live migration of L2 guest might trigger
that as well. Not sure though, I'm not really familiar with KVM code.
 
> > In post-copy mode of CRIU, the destination memory is managed with
> > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > causes a crash of the restored process.
> > 
> > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > one fixes the issue.
> 
> Should this be backported into -stable trees?

I think that it depends on whether KVM affected by this or not.

> > Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-22 20:38         ` Mike Rapoport
@ 2019-05-22 21:18           ` Andrew Morton
  2019-06-22 17:51             ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2019-05-22 21:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrea Arcangeli, linux-mm, linux-kernel,
	Sebastian Andrzej Siewior, Borislav Petkov,
	Dr. David Alan Gilbert, kvm

On Wed, 22 May 2019 23:38:29 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:

> (added kvm)
> 
> On Wed, May 22, 2019 at 12:21:13PM -0700, Andrew Morton wrote:
> > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> > 
> > > When get_user_pages*() is called with pages = NULL, the processing of
> > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > > the pages.
> > > 
> > > If the pages in the requested range belong to a VMA that has userfaultfd
> > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > > has populated the page, but for the gup pre-fault case there's no actual
> > > retry and the caller will get no pages although they are present.
> > > 
> > > This issue was uncovered when running post-copy memory restore in CRIU
> > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > > copy_fpstate_to_sigframe() fails").
> > > 
> > > After this change, the copying of FPU state to the sigframe switched from
> > > copy_to_user() variants which caused a real page fault to get_user_pages()
> > > with pages parameter set to NULL.
> > 
> > You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL?
> 
> Apparently I haven't explained well. The 'pages' parameter in the call to
> get_user_pages_unlocked() is NULL.

Doh.

> > If so was that expected by the (now cc'ed) developers of
> > d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if
> > copy_fpstate_to_sigframe() fails")?
> > 
> > It seems rather odd.  copy_fpregs_to_sigframe() doesn't look like it's
> > expecting a NULL argument.
> > 
> > Also, I wonder if copy_fpstate_to_sigframe() would be better using
> > fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> > seems like it operates at a more suitable level and I guess it will fix
> > this issue also.
> 
> If I understand correctly, one of the points of d9c9ce34ed5c8923 ("x86/fpu:
> Fault-in user stack if copy_fpstate_to_sigframe() fails") was to to avoid
> page faults, hence the use of get_user_pages().
> 
> With fault_in_pages_writeable() there might be a page fault, unless I've
> completely mistaken.
> 
> Unrelated to copy_fpstate_to_sigframe(), the issue could happen if any call
> to get_user_pages() with pages parameter set to NULL tries to access
> userfaultfd-managed memory. Currently, there are 4 in tree users:
> 
> arch/x86/kernel/fpu/signal.c:198:8-31:  -> gup with !pages
> arch/x86/mm/mpx.c:423:11-25:  -> gup with !pages
> virt/kvm/async_pf.c:90:1-22:  -> gup with !pages
> virt/kvm/kvm_main.c:1437:6-20:  -> gup with !pages

OK.

> I don't know if anybody is using mpx with uffd and anyway mpx seems to go
> away.
> 
> As for KVM, I think that post-copy live migration of L2 guest might trigger
> that as well. Not sure though, I'm not really familiar with KVM code.
>  
> > > In post-copy mode of CRIU, the destination memory is managed with
> > > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > > causes a crash of the restored process.
> > > 
> > > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > > one fixes the issue.
> > 
> > Should this be backported into -stable trees?
> 
> I think that it depends on whether KVM affected by this or not.
> 

How do we determine this?

I guess it doesn't matter much.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-22 19:43         ` Sebastian Andrzej Siewior
@ 2019-05-24 22:22             ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2019-05-24 22:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek

On Wed, 22 May 2019, Sebastian Andrzej Siewior wrote:
> On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote:
> > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> > 
> > > When get_user_pages*() is called with pages = NULL, the processing of
> > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > > the pages.
> > > 
> > > If the pages in the requested range belong to a VMA that has userfaultfd
> > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > > has populated the page, but for the gup pre-fault case there's no actual
> > > retry and the caller will get no pages although they are present.
> > > 
> > > This issue was uncovered when running post-copy memory restore in CRIU
> > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > > copy_fpstate_to_sigframe() fails").

I've been getting unexplained segmentation violations, and "make" giving
up early, when running kernel builds under swapping memory pressure: no
CRIU involved.

Bisected last night to that same x86/fpu commit, not itself guilty, but
suffering from the odd behavior of get_user_pages_unlocked() giving up
too early.

(I wondered at first if copy_fpstate_to_sigframe() ought to retry if
non-negative ret < nr_pages, but no, that would be wrong: a present page
followed by an invalid area would repeatedly return 1 for nr_pages 2.)

Cc'ing Pavel, who's been having segfault trouble in emacs: maybe same?

> > > 
> > > After this change, the copying of FPU state to the sigframe switched from
> > > copy_to_user() variants which caused a real page fault to get_user_pages()
> > > with pages parameter set to NULL.
...
> > Also, I wonder if copy_fpstate_to_sigframe() would be better using
> > fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> > seems like it operates at a more suitable level and I guess it will fix
> > this issue also.
> 
> It looks, like fault_in_pages_writeable() would work. If this is the
> recommendation from the MM department than I can switch to that.

I've now run a couple of hours of load successfully with Mike's patch
to GUP, no problem; but whatever the merits of that patch in general,
I agree with Andrew that fault_in_pages_writeable() seems altogether
more appropriate for copy_fpstate_to_sigframe(), and have now run a
couple of hours of load successfully with this instead (rewrite to taste):

--- 5.2-rc1/arch/x86/kernel/fpu/signal.c
+++ linux/arch/x86/kernel/fpu/signal.c
@@ -3,6 +3,7 @@
  * FPU signal frame handling routines.
  */
 
+#include <linux/pagemap.h>
 #include <linux/compat.h>
 #include <linux/cpu.h>
 
@@ -189,15 +190,7 @@ retry:
 	fpregs_unlock();
 
 	if (ret) {
-		int aligned_size;
-		int nr_pages;
-
-		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
-		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
-
-		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
-					      NULL, FOLL_WRITE);
-		if (ret == nr_pages)
+		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}

(I did wonder whether there needs to be an access_ok() check on buf_fx;
but if so, then I think it would already have been needed before the
earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
that to be sure, nor into whether access_ok() check on buf covers buf_fx.)

Hugh

> 
> > > In post-copy mode of CRIU, the destination memory is managed with
> > > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > > causes a crash of the restored process.
> > > 
> > > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > > one fixes the issue.
> > 
> > Should this be backported into -stable trees?
> 
> Sebastian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
@ 2019-05-24 22:22             ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2019-05-24 22:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek

On Wed, 22 May 2019, Sebastian Andrzej Siewior wrote:
> On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote:
> > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> > 
> > > When get_user_pages*() is called with pages = NULL, the processing of
> > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > > the pages.
> > > 
> > > If the pages in the requested range belong to a VMA that has userfaultfd
> > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > > has populated the page, but for the gup pre-fault case there's no actual
> > > retry and the caller will get no pages although they are present.
> > > 
> > > This issue was uncovered when running post-copy memory restore in CRIU
> > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > > copy_fpstate_to_sigframe() fails").

I've been getting unexplained segmentation violations, and "make" giving
up early, when running kernel builds under swapping memory pressure: no
CRIU involved.

Bisected last night to that same x86/fpu commit, not itself guilty, but
suffering from the odd behavior of get_user_pages_unlocked() giving up
too early.

(I wondered at first if copy_fpstate_to_sigframe() ought to retry if
non-negative ret < nr_pages, but no, that would be wrong: a present page
followed by an invalid area would repeatedly return 1 for nr_pages 2.)

Cc'ing Pavel, who's been having segfault trouble in emacs: maybe same?

> > > 
> > > After this change, the copying of FPU state to the sigframe switched from
> > > copy_to_user() variants which caused a real page fault to get_user_pages()
> > > with pages parameter set to NULL.
...
> > Also, I wonder if copy_fpstate_to_sigframe() would be better using
> > fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> > seems like it operates at a more suitable level and I guess it will fix
> > this issue also.
> 
> It looks, like fault_in_pages_writeable() would work. If this is the
> recommendation from the MM department than I can switch to that.

I've now run a couple of hours of load successfully with Mike's patch
to GUP, no problem; but whatever the merits of that patch in general,
I agree with Andrew that fault_in_pages_writeable() seems altogether
more appropriate for copy_fpstate_to_sigframe(), and have now run a
couple of hours of load successfully with this instead (rewrite to taste):

--- 5.2-rc1/arch/x86/kernel/fpu/signal.c
+++ linux/arch/x86/kernel/fpu/signal.c
@@ -3,6 +3,7 @@
  * FPU signal frame handling routines.
  */
 
+#include <linux/pagemap.h>
 #include <linux/compat.h>
 #include <linux/cpu.h>
 
@@ -189,15 +190,7 @@ retry:
 	fpregs_unlock();
 
 	if (ret) {
-		int aligned_size;
-		int nr_pages;
-
-		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
-		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
-
-		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
-					      NULL, FOLL_WRITE);
-		if (ret == nr_pages)
+		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}

(I did wonder whether there needs to be an access_ok() check on buf_fx;
but if so, then I think it would already have been needed before the
earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
that to be sure, nor into whether access_ok() check on buf covers buf_fx.)

Hugh

> 
> > > In post-copy mode of CRIU, the destination memory is managed with
> > > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > > causes a crash of the restored process.
> > > 
> > > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > > one fixes the issue.
> > 
> > Should this be backported into -stable trees?
> 
> Sebastian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-24 22:22             ` Hugh Dickins
  (?)
@ 2019-05-25  8:45             ` Sebastian Andrzej Siewior
  2019-05-25 18:09                 ` Hugh Dickins
  -1 siblings, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-05-25  8:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek

On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote:
> I've now run a couple of hours of load successfully with Mike's patch
> to GUP, no problem; but whatever the merits of that patch in general,
> I agree with Andrew that fault_in_pages_writeable() seems altogether
> more appropriate for copy_fpstate_to_sigframe(), and have now run a
> couple of hours of load successfully with this instead (rewrite to taste):

so this patch instead of Mike's GUP patch fixes the issue you observed?
Is this just a taste question or limitation of the function in general?

I'm asking because it has been suggested and is used in MPX code (in the
signal path but .mmap) and I'm not aware of any limitation. But as I
wrote earlier to akpm, if the MM folks suggest to use this instead I am
happy to switch.

> --- 5.2-rc1/arch/x86/kernel/fpu/signal.c
> +++ linux/arch/x86/kernel/fpu/signal.c
> @@ -3,6 +3,7 @@
>   * FPU signal frame handling routines.
>   */
>  
> +#include <linux/pagemap.h>
>  #include <linux/compat.h>
>  #include <linux/cpu.h>
>  
> @@ -189,15 +190,7 @@ retry:
>  	fpregs_unlock();
>  
>  	if (ret) {
> -		int aligned_size;
> -		int nr_pages;
> -
> -		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
> -		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
> -
> -		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
> -					      NULL, FOLL_WRITE);
> -		if (ret == nr_pages)
> +		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
>  			goto retry;
>  		return -EFAULT;
>  	}
> 
> (I did wonder whether there needs to be an access_ok() check on buf_fx;
> but if so, then I think it would already have been needed before the
> earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
> that to be sure, nor into whether access_ok() check on buf covers buf_fx.)

There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The
memory is allocated from user's stack and there is (later) an
access_ok() for the whole region (which can be more than the memory used
by the FPU code).

> Hugh

Sebastian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-25  8:45             ` Sebastian Andrzej Siewior
@ 2019-05-25 18:09                 ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2019-05-25 18:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hugh Dickins, Andrew Morton, Mike Rapoport, Andrea Arcangeli,
	linux-mm, linux-kernel, Borislav Petkov, Pavel Machek,
	Dave Hansen

On Sat, 25 May 2019, Sebastian Andrzej Siewior wrote:
> On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote:
> > I've now run a couple of hours of load successfully with Mike's patch
> > to GUP, no problem; but whatever the merits of that patch in general,
> > I agree with Andrew that fault_in_pages_writeable() seems altogether
> > more appropriate for copy_fpstate_to_sigframe(), and have now run a
> > couple of hours of load successfully with this instead (rewrite to taste):
> 
> so this patch instead of Mike's GUP patch fixes the issue you observed?

Yes.

> Is this just a taste question or limitation of the function in general?

I'd say it's just a taste question. Though the the fact that your
usage showed up a bug in the get_user_pages_unlocked() implementation,
demanding a fix, does indicate that it's a more fragile and complex
route, better avoided if there's a good simple alternative. If it were
not already on your slowpath, I'd also argue fault_in_pages_writeable()
is a more efficient way to do it.

> 
> I'm asking because it has been suggested and is used in MPX code (in the
> signal path but .mmap) and I'm not aware of any limitation. But as I
> wrote earlier to akpm, if the MM folks suggest to use this instead I am
> happy to switch.

I know nothing of MPX, beyond that Dave Hansen has posted patches to
remove that support entirely, so I'm surprised arch/x86/mm/mpx.c is
still in the tree. But peering at it now, it looks as if it's using
get_user_pages() while holding mmap_sem, whereas you (sensibly enough)
used get_user_pages_unlocked() to handle the mmap_sem for you -
the trouble with that is that since it knows it's in control of
mmap_sem, it feels free to drop it internally, and that takes it
down the path of the premature return when pages NULL that Mike is
fixing. MPX's get_user_pages() is not free to go that way.

> 
> > --- 5.2-rc1/arch/x86/kernel/fpu/signal.c
> > +++ linux/arch/x86/kernel/fpu/signal.c
> > @@ -3,6 +3,7 @@
> >   * FPU signal frame handling routines.
> >   */
> >  
> > +#include <linux/pagemap.h>
> >  #include <linux/compat.h>
> >  #include <linux/cpu.h>
> >  
> > @@ -189,15 +190,7 @@ retry:
> >  	fpregs_unlock();
> >  
> >  	if (ret) {
> > -		int aligned_size;
> > -		int nr_pages;
> > -
> > -		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
> > -		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
> > -
> > -		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
> > -					      NULL, FOLL_WRITE);
> > -		if (ret == nr_pages)
> > +		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
> >  			goto retry;
> >  		return -EFAULT;
> >  	}
> > 
> > (I did wonder whether there needs to be an access_ok() check on buf_fx;
> > but if so, then I think it would already have been needed before the
> > earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
> > that to be sure, nor into whether access_ok() check on buf covers buf_fx.)
> 
> There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The
> memory is allocated from user's stack and there is (later) an
> access_ok() for the whole region (which can be more than the memory used
> by the FPU code).

Yes, but remember I know nothing of this FPU signal code, so I cannot
tell whether an access_ok(buf, size) is good enough to cover the range
of an access_ok(buf_fx, fpu_user_xstate_size).

Your "(later)" worries me a little - I hope you're not writing first
and checking the limits later; but what you're doing may be perfectly
correct, I'm just too far from understanding the details to say; but
raised the matter because (I think) get_user_pages_unlocked() would
entail an access_ok() check where fault_in_pages_writable() would not.

Hugh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
@ 2019-05-25 18:09                 ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2019-05-25 18:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hugh Dickins, Andrew Morton, Mike Rapoport, Andrea Arcangeli,
	linux-mm, linux-kernel, Borislav Petkov, Pavel Machek,
	Dave Hansen

On Sat, 25 May 2019, Sebastian Andrzej Siewior wrote:
> On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote:
> > I've now run a couple of hours of load successfully with Mike's patch
> > to GUP, no problem; but whatever the merits of that patch in general,
> > I agree with Andrew that fault_in_pages_writeable() seems altogether
> > more appropriate for copy_fpstate_to_sigframe(), and have now run a
> > couple of hours of load successfully with this instead (rewrite to taste):
> 
> so this patch instead of Mike's GUP patch fixes the issue you observed?

Yes.

> Is this just a taste question or limitation of the function in general?

I'd say it's just a taste question. Though the the fact that your
usage showed up a bug in the get_user_pages_unlocked() implementation,
demanding a fix, does indicate that it's a more fragile and complex
route, better avoided if there's a good simple alternative. If it were
not already on your slowpath, I'd also argue fault_in_pages_writeable()
is a more efficient way to do it.

> 
> I'm asking because it has been suggested and is used in MPX code (in the
> signal path but .mmap) and I'm not aware of any limitation. But as I
> wrote earlier to akpm, if the MM folks suggest to use this instead I am
> happy to switch.

I know nothing of MPX, beyond that Dave Hansen has posted patches to
remove that support entirely, so I'm surprised arch/x86/mm/mpx.c is
still in the tree. But peering at it now, it looks as if it's using
get_user_pages() while holding mmap_sem, whereas you (sensibly enough)
used get_user_pages_unlocked() to handle the mmap_sem for you -
the trouble with that is that since it knows it's in control of
mmap_sem, it feels free to drop it internally, and that takes it
down the path of the premature return when pages NULL that Mike is
fixing. MPX's get_user_pages() is not free to go that way.

> 
> > --- 5.2-rc1/arch/x86/kernel/fpu/signal.c
> > +++ linux/arch/x86/kernel/fpu/signal.c
> > @@ -3,6 +3,7 @@
> >   * FPU signal frame handling routines.
> >   */
> >  
> > +#include <linux/pagemap.h>
> >  #include <linux/compat.h>
> >  #include <linux/cpu.h>
> >  
> > @@ -189,15 +190,7 @@ retry:
> >  	fpregs_unlock();
> >  
> >  	if (ret) {
> > -		int aligned_size;
> > -		int nr_pages;
> > -
> > -		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
> > -		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
> > -
> > -		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
> > -					      NULL, FOLL_WRITE);
> > -		if (ret == nr_pages)
> > +		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
> >  			goto retry;
> >  		return -EFAULT;
> >  	}
> > 
> > (I did wonder whether there needs to be an access_ok() check on buf_fx;
> > but if so, then I think it would already have been needed before the
> > earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
> > that to be sure, nor into whether access_ok() check on buf covers buf_fx.)
> 
> There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The
> memory is allocated from user's stack and there is (later) an
> access_ok() for the whole region (which can be more than the memory used
> by the FPU code).

Yes, but remember I know nothing of this FPU signal code, so I cannot
tell whether an access_ok(buf, size) is good enough to cover the range
of an access_ok(buf_fx, fpu_user_xstate_size).

Your "(later)" worries me a little - I hope you're not writing first
and checking the limits later; but what you're doing may be perfectly
correct, I'm just too far from understanding the details to say; but
raised the matter because (I think) get_user_pages_unlocked() would
entail an access_ok() check where fault_in_pages_writable() would not.

Hugh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-24 22:22             ` Hugh Dickins
  (?)
  (?)
@ 2019-05-26 17:25             ` Pavel Machek
  -1 siblings, 0 replies; 27+ messages in thread
From: Pavel Machek @ 2019-05-26 17:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sebastian Andrzej Siewior, Andrew Morton, Mike Rapoport,
	Andrea Arcangeli, linux-mm, linux-kernel, Borislav Petkov

On Fri 2019-05-24 15:22:51, Hugh Dickins wrote:
> On Wed, 22 May 2019, Sebastian Andrzej Siewior wrote:
> > On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote:
> > > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > 
> > > > When get_user_pages*() is called with pages = NULL, the processing of
> > > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > > > the pages.
> > > > 
> > > > If the pages in the requested range belong to a VMA that has userfaultfd
> > > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > > > has populated the page, but for the gup pre-fault case there's no actual
> > > > retry and the caller will get no pages although they are present.
> > > > 
> > > > This issue was uncovered when running post-copy memory restore in CRIU
> > > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > > > copy_fpstate_to_sigframe() fails").
> 
> I've been getting unexplained segmentation violations, and "make" giving
> up early, when running kernel builds under swapping memory pressure: no
> CRIU involved.
> 
> Bisected last night to that same x86/fpu commit, not itself guilty, but
> suffering from the odd behavior of get_user_pages_unlocked() giving up
> too early.
> 
> (I wondered at first if copy_fpstate_to_sigframe() ought to retry if
> non-negative ret < nr_pages, but no, that would be wrong: a present page
> followed by an invalid area would repeatedly return 1 for nr_pages 2.)
> 
> Cc'ing Pavel, who's been having segfault trouble in emacs: maybe same?

The emacs segfault was always during process exit. This sounds different...

I don't see problems with make.

But its true that at least one of affected machines uses swap heavily.

Best regards,
								Pavel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
@ 2019-05-26 17:33 Sebastian Andrzej Siewior
  2019-05-26 17:35 ` Sebastian Andrzej Siewior
  2019-05-29  4:18 ` Andrew Morton
  0 siblings, 2 replies; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-05-26 17:33 UTC (permalink / raw)
  To: Hugh Dickins, x86
  Cc: Andrew Morton, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek, Dave Hansen

From: Hugh Dickins <hughd@google.com>

Since commit

   d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")

we use get_user_pages_unlocked() to pre-faulting user's memory if a
write generates a page fault while the handler is disabled.
This works in general and uncovered a bug as reported by Mike Rapoport.

It has been pointed out that this function may be fragile and a
simple pre-fault as in fault_in_pages_writeable() would be a better
solution. Better as in taste and simplicity: That write (as performed by
the alternative function) performs exactly the same faulting of memory
that we had before. This was suggested by Hugh Dickins and Andrew
Morton.

Use fault_in_pages_writeable() for pre-faulting of user's stack.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.1905251033230.1112@eggly.anvils
[bigeasy: patch description]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 arch/x86/kernel/fpu/signal.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 5a8d118bc423e..060d6188b4533 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -5,6 +5,7 @@
 
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/pagemap.h>
 
 #include <asm/fpu/internal.h>
 #include <asm/fpu/signal.h>
@@ -189,15 +190,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		int aligned_size;
-		int nr_pages;
-
-		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
-		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
-
-		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
-					      NULL, FOLL_WRITE);
-		if (ret == nr_pages)
+		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  2019-05-26 17:33 [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting Sebastian Andrzej Siewior
@ 2019-05-26 17:35 ` Sebastian Andrzej Siewior
  2019-05-26 19:25     ` Hugh Dickins
  2019-05-29  4:18 ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-05-26 17:35 UTC (permalink / raw)
  To: Hugh Dickins, x86
  Cc: Andrew Morton, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek, Dave Hansen

On 2019-05-26 19:33:25 [+0200], To Hugh Dickins wrote:
> From: Hugh Dickins <hughd@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Hugh, I took your patch, slapped a signed-off-by line. Please say that
you are fine with it (or object otherwise).

Sebastian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  2019-05-26 17:35 ` Sebastian Andrzej Siewior
@ 2019-05-26 19:25     ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2019-05-26 19:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hugh Dickins, x86, Andrew Morton, Mike Rapoport,
	Andrea Arcangeli, linux-mm, linux-kernel, Borislav Petkov,
	Pavel Machek, Dave Hansen

[-- Attachment #1: Type: TEXT/PLAIN, Size: 709 bytes --]

On Sun, 26 May 2019, Sebastian Andrzej Siewior wrote:
> On 2019-05-26 19:33:25 [+0200], To Hugh Dickins wrote:
> From: Hugh Dickins <hughd@google.com>
> …
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Hugh, I took your patch, slapped a signed-off-by line. Please say that
> you are fine with it (or object otherwise).

I'm fine with it, thanks Sebastian. Sorry if I wasted your time by not
giving it my sign-off in the first place, but I was not comfortable to
dabble there without your sign-off too - which it now has. (And thought
you might already have your own version anyway: just provided mine as
illustration, so that we could be sure of exactly what I'd been testing.)

Hugh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
@ 2019-05-26 19:25     ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2019-05-26 19:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hugh Dickins, x86, Andrew Morton, Mike Rapoport,
	Andrea Arcangeli, linux-mm, linux-kernel, Borislav Petkov,
	Pavel Machek, Dave Hansen

[-- Attachment #1: Type: TEXT/PLAIN, Size: 709 bytes --]

On Sun, 26 May 2019, Sebastian Andrzej Siewior wrote:
> On 2019-05-26 19:33:25 [+0200], To Hugh Dickins wrote:
> From: Hugh Dickins <hughd@google.com>
> …
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Hugh, I took your patch, slapped a signed-off-by line. Please say that
> you are fine with it (or object otherwise).

I'm fine with it, thanks Sebastian. Sorry if I wasted your time by not
giving it my sign-off in the first place, but I was not comfortable to
dabble there without your sign-off too - which it now has. (And thought
you might already have your own version anyway: just provided mine as
illustration, so that we could be sure of exactly what I'd been testing.)

Hugh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-25 18:09                 ` Hugh Dickins
  (?)
@ 2019-05-26 19:36                 ` Sebastian Andrzej Siewior
  2019-05-26 20:17                     ` Hugh Dickins
  -1 siblings, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-05-26 19:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek, Dave Hansen

On 2019-05-25 11:09:15 [-0700], Hugh Dickins wrote:
> On Sat, 25 May 2019, Sebastian Andrzej Siewior wrote:
> > On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote:
> > > I've now run a couple of hours of load successfully with Mike's patch
> > > to GUP, no problem; but whatever the merits of that patch in general,
> > > I agree with Andrew that fault_in_pages_writeable() seems altogether
> > > more appropriate for copy_fpstate_to_sigframe(), and have now run a
> > > couple of hours of load successfully with this instead (rewrite to taste):
> > 
> > so this patch instead of Mike's GUP patch fixes the issue you observed?
> 
> Yes.
> 
> > Is this just a taste question or limitation of the function in general?
> 
> I'd say it's just a taste question. Though the the fact that your
> usage showed up a bug in the get_user_pages_unlocked() implementation,
> demanding a fix, does indicate that it's a more fragile and complex
> route, better avoided if there's a good simple alternative. If it were
> not already on your slowpath, I'd also argue fault_in_pages_writeable()
> is a more efficient way to do it.

Okay. The GUP functions are not properly documented for my taste. There
is no indication whether or not the mm_sem has to be acquired prior
invoking it. Following the call chain of get_user_pages() I ended up in
__get_user_pages_locked() `locked = NULL' indicated that mm_sem is no
acquired and then I saw this:
|                 if (!locked)
|                         /* VM_FAULT_RETRY couldn't trigger, bypass */
|                         return ret;

kind of suggesting that it is okay to invoke it without holding the
mm_sem prefault. It passed a few tests and then
	https://lkml.kernel.org/r/1556657902.6132.13.camel@lca.pw

happened. After that, I switched to the locked variant and the problem
disappeared (also I noticed that MPX code is invoked within ->mmap()).

> > I'm asking because it has been suggested and is used in MPX code (in the
> > signal path but .mmap) and I'm not aware of any limitation. But as I
> > wrote earlier to akpm, if the MM folks suggest to use this instead I am
> > happy to switch.
> 
> I know nothing of MPX, beyond that Dave Hansen has posted patches to
> remove that support entirely, so I'm surprised arch/x86/mm/mpx.c is
> still in the tree.
I need to poke at that. I has been removed but then KVM folks complained
that they kind of depend on that if it has been exposed to the guest. We
need to fade it out slowly…

>                    But peering at it now, it looks as if it's using
> get_user_pages() while holding mmap_sem, whereas you (sensibly enough)
> used get_user_pages_unlocked() to handle the mmap_sem for you -
> the trouble with that is that since it knows it's in control of
> mmap_sem, it feels free to drop it internally, and that takes it
> down the path of the premature return when pages NULL that Mike is
> fixing. MPX's get_user_pages() is not free to go that way.
oki.

> > > --- 5.2-rc1/arch/x86/kernel/fpu/signal.c
> > > +++ linux/arch/x86/kernel/fpu/signal.c
> > > @@ -3,6 +3,7 @@
> > >   * FPU signal frame handling routines.
> > >   */
> > >  
> > > +#include <linux/pagemap.h>
> > >  #include <linux/compat.h>
> > >  #include <linux/cpu.h>
> > >  
> > > @@ -189,15 +190,7 @@ retry:
> > >  	fpregs_unlock();
> > >  
> > >  	if (ret) {
> > > -		int aligned_size;
> > > -		int nr_pages;
> > > -
> > > -		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
> > > -		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
> > > -
> > > -		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
> > > -					      NULL, FOLL_WRITE);
> > > -		if (ret == nr_pages)
> > > +		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
> > >  			goto retry;
> > >  		return -EFAULT;
> > >  	}
> > > 
> > > (I did wonder whether there needs to be an access_ok() check on buf_fx;
> > > but if so, then I think it would already have been needed before the
> > > earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
> > > that to be sure, nor into whether access_ok() check on buf covers buf_fx.)
> > 
> > There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The
> > memory is allocated from user's stack and there is (later) an
> > access_ok() for the whole region (which can be more than the memory used
> > by the FPU code).
> 
> Yes, but remember I know nothing of this FPU signal code, so I cannot
> tell whether an access_ok(buf, size) is good enough to cover the range
> of an access_ok(buf_fx, fpu_user_xstate_size).

yes, because size >= fpu_user_xstate_size

> Your "(later)" worries me a little - I hope you're not writing first
> and checking the limits later; but what you're doing may be perfectly
> correct, I'm just too far from understanding the details to say; but
> raised the matter because (I think) get_user_pages_unlocked() would
> entail an access_ok() check where fault_in_pages_writable() would not.

no, we first check the range and then write. It is later checked again
after the size has been extended.

> Hugh

Sebastian

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-26 19:36                 ` Sebastian Andrzej Siewior
@ 2019-05-26 20:17                     ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2019-05-26 20:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hugh Dickins, Andrew Morton, Mike Rapoport, Andrea Arcangeli,
	linux-mm, linux-kernel, Borislav Petkov, Pavel Machek,
	Dave Hansen

On Sun, 26 May 2019, Sebastian Andrzej Siewior wrote:
> 
> Okay. The GUP functions are not properly documented for my taste. There
> is no indication whether or not the mm_sem has to be acquired prior
> invoking it. Following the call chain of get_user_pages() I ended up in
> __get_user_pages_locked() `locked = NULL' indicated that mm_sem is no
> acquired and then I saw this:
> |                 if (!locked)
> |                         /* VM_FAULT_RETRY couldn't trigger, bypass */
> |                         return ret;
> 
> kind of suggesting that it is okay to invoke it without holding the
> mm_sem prefault. It passed a few tests and then
> 	https://lkml.kernel.org/r/1556657902.6132.13.camel@lca.pw
> 
> happened. After that, I switched to the locked variant and the problem
> disappeared (also I noticed that MPX code is invoked within ->mmap()).

Ah, I wasn't following what happened here while it was in linux-next.

I certainly agree that all the variants are very confusing, and the
matter of mmap_sem particularly so. Because it used to be a simple
rule that you needed to hold mmap_sem to call get_user_pages(); but
that can be so contended that get_user_pages_fast(), and VM_FAULT_RETRY,
optimizations came in, then interfaces like get_user_pages_locked() and
get_user_pages_unlocked().

I think when you say that you "switched to the locked variant", you're
writing of when you switched to using get_user_pages_unlocked(), which
takes and releases the mmap_sem itself: yes, using get_user_pages()
without holding mmap_sem would have been open to serious errors.

The documentation in comments above get_user_pages_locked() and
get_user_pages_unlocked() looks rather good to me, actually.

But this is all good reason to use the less challenging
fault_in_pages_writable() instead. (And also saves all those GUP
developers, who from time to time have to search through all users
of GUP in one form or another, to make this or that improvement,
from having to visit arch/x86/kernel/fpu/signal.c: they will
quietly thank you.)

Hugh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
@ 2019-05-26 20:17                     ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2019-05-26 20:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hugh Dickins, Andrew Morton, Mike Rapoport, Andrea Arcangeli,
	linux-mm, linux-kernel, Borislav Petkov, Pavel Machek,
	Dave Hansen

On Sun, 26 May 2019, Sebastian Andrzej Siewior wrote:
> 
> Okay. The GUP functions are not properly documented for my taste. There
> is no indication whether or not the mm_sem has to be acquired prior
> invoking it. Following the call chain of get_user_pages() I ended up in
> __get_user_pages_locked() `locked = NULL' indicated that mm_sem is no
> acquired and then I saw this:
> |                 if (!locked)
> |                         /* VM_FAULT_RETRY couldn't trigger, bypass */
> |                         return ret;
> 
> kind of suggesting that it is okay to invoke it without holding the
> mm_sem prefault. It passed a few tests and then
> 	https://lkml.kernel.org/r/1556657902.6132.13.camel@lca.pw
> 
> happened. After that, I switched to the locked variant and the problem
> disappeared (also I noticed that MPX code is invoked within ->mmap()).

Ah, I wasn't following what happened here while it was in linux-next.

I certainly agree that all the variants are very confusing, and the
matter of mmap_sem particularly so. Because it used to be a simple
rule that you needed to hold mmap_sem to call get_user_pages(); but
that can be so contended that get_user_pages_fast(), and VM_FAULT_RETRY,
optimizations came in, then interfaces like get_user_pages_locked() and
get_user_pages_unlocked().

I think when you say that you "switched to the locked variant", you're
writing of when you switched to using get_user_pages_unlocked(), which
takes and releases the mmap_sem itself: yes, using get_user_pages()
without holding mmap_sem would have been open to serious errors.

The documentation in comments above get_user_pages_locked() and
get_user_pages_unlocked() looks rather good to me, actually.

But this is all good reason to use the less challenging
fault_in_pages_writable() instead. (And also saves all those GUP
developers, who from time to time have to search through all users
of GUP in one form or another, to make this or that improvement,
from having to visit arch/x86/kernel/fpu/signal.c: they will
quietly thank you.)

Hugh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* My emacs problem -- was Re: [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  2019-05-26 19:25     ` Hugh Dickins
  (?)
@ 2019-05-28 11:54     ` Pavel Machek
  -1 siblings, 0 replies; 27+ messages in thread
From: Pavel Machek @ 2019-05-28 11:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sebastian Andrzej Siewior, x86, Andrew Morton, Mike Rapoport,
	Andrea Arcangeli, linux-mm, linux-kernel, Borislav Petkov,
	Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 5158 bytes --]

Hi!

On Sun 2019-05-26 12:25:27, Hugh Dickins wrote:
> On Sun, 26 May 2019, Sebastian Andrzej Siewior wrote:
> > On 2019-05-26 19:33:25 [+0200], To Hugh Dickins wrote:
> > From: Hugh Dickins <hughd@google.com>
> > …
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > 
> > Hugh, I took your patch, slapped a signed-off-by line. Please say that
> > you are fine with it (or object otherwise).
> 
> I'm fine with it, thanks Sebastian. Sorry if I wasted your time by not
> giving it my sign-off in the first place, but I was not comfortable to
> dabble there without your sign-off too - which it now has. (And thought
> you might already have your own version anyway: just provided mine as
> illustration, so that we could be sure of exactly what I'd been testing.)

I applied Hugh's patch on top of -rc2, but still get emacs problems:

But this time I'm not sure if it is same emacs problem or different
emacs problem....

X protocol error: BadValue (integer parameter out of range for
operation) on protocol request 139
When compiled with GTK, Emacs cannot recover from X disconnects.
This is a GTK bug: https://bugzilla.gnome.org/show_bug.cgi?id=85715
For details, see etc/PROBLEMS.

(emacs:8175): GLib-WARNING **: g_main_context_prepare() called
recursively from within a source's check() or prepare() member.

(emacs:8175): GLib-WARNING **: g_main_context_check() called
recursively from within a source's check() or prepare() member.
Fatal error 6: Aborted
Backtrace:
emacs[0x8138719]
emacs[0x8120446]
emacs[0x813875c]
emacs[0x80f54c0]
emacs[0x80f6f3f]
emacs[0x80f6fab]
/usr/lib/i386-linux-gnu/libX11.so.6(_XError+0x11a)[0xf6ea1b3a]
/usr/lib/i386-linux-gnu/libX11.so.6(+0x39b5b)[0xf6e9eb5b]
/usr/lib/i386-linux-gnu/libX11.so.6(+0x39c26)[0xf6e9ec26]
/usr/lib/i386-linux-gnu/libX11.so.6(_XEventsQueued+0x6e)[0xf6e9f4be]
/usr/lib/i386-linux-gnu/libX11.so.6(XPending+0x62)[0xf6e90752]
/usr/lib/i386-linux-gnu/libgdk-3.so.0(+0x48073)[0xf7566073]
/lib/i386-linux-gnu/libglib-2.0.so.0(g_main_context_prepare+0x17b)[0xf70244fb]
/lib/i386-linux-gnu/libglib-2.0.so.0(+0x46f74)[0xf7024f74]
/lib/i386-linux-gnu/libglib-2.0.so.0(g_main_context_pending+0x34)[0xf7025144]
/usr/lib/i386-linux-gnu/libgtk-3.so.0(gtk_events_pending+0x1f)[0xf77c9a8f]
emacs[0x80f55a9]
emacs[0x812714f]
emacs[0x8126a95]
emacs[0x8172db9]
emacs[0x8192bd7]
emacs[0x819312d]
emacs[0x8125634]
emacs[0x8125c6d]
emacs[0x812725b]
emacs[0x8129eaa]
emacs[0x81c7c90]
emacs[0x8127815]
emacs[0x812ada3]
emacs[0x812bdad]
emacs[0x812d838]
emacs[0x818b76c]
emacs[0x8120890]
emacs[0x818b66b]
emacs[0x8124b84]
emacs[0x8124e3f]
emacs[0x8059cb0]
/lib/i386-linux-gnu/i686/cmov/libc.so.6(__libc_start_main+0xf3)[0xf61a7a63]
emacs[0x805a76f]
Aborted (core dumped)

Best regards,
									Pavel


commit 018c9da72adf920efd0ba250fcf433b836d3cfbc
Author: Hugh Dickins <hughd@google.com>
Date:   Sun May 26 19:33:25 2019 +0200

    x86/fpu: Use fault_in_pages_writeable() for pre-faulting
    
    Since commit
    
       d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
    
    we use get_user_pages_unlocked() to pre-faulting user's memory if a
    write generates a page fault while the handler is disabled.
    This works in general and uncovered a bug as reported by Mike Rapoport.
    
    It has been pointed out that this function may be fragile and a
    simple pre-fault as in fault_in_pages_writeable() would be a better
    solution. Better as in taste and simplicity: That write (as performed by
    the alternative function) performs exactly the same faulting of memory
    that we had before. This was suggested by Hugh Dickins and Andrew
    Morton.
    
    Use fault_in_pages_writeable() for pre-faulting of user's stack.
    
    Suggested-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Hugh Dickins <hughd@google.com>
    Link: https://lkml.kernel.org/r/alpine.LSU.2.11.1905251033230.1112@eggly.anvils
    [bigeasy: patch description]
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 5a8d118..060d618 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -5,6 +5,7 @@
 
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/pagemap.h>
 
 #include <asm/fpu/internal.h>
 #include <asm/fpu/signal.h>
@@ -189,15 +190,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		int aligned_size;
-		int nr_pages;
-
-		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
-		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
-
-		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
-					      NULL, FOLL_WRITE);
-		if (ret == nr_pages)
+		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  2019-05-26 17:33 [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting Sebastian Andrzej Siewior
  2019-05-26 17:35 ` Sebastian Andrzej Siewior
@ 2019-05-29  4:18 ` Andrew Morton
  2019-05-29  7:25   ` [PATCH v2] " Sebastian Andrzej Siewior
  1 sibling, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2019-05-29  4:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Hugh Dickins, x86, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek, Dave Hansen

On Sun, 26 May 2019 19:33:25 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> From: Hugh Dickins <hughd@google.com>
> 
> Since commit
> 
>    d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")

Please add this as a

Fixes: d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")

line so that anyone who backports d9c9ce34ed5c8 has a chance of finding
this patch also.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  2019-05-29  4:18 ` Andrew Morton
@ 2019-05-29  7:25   ` Sebastian Andrzej Siewior
  2019-05-14 14:29     ` [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults Mike Rapoport
  2019-05-29 21:29     ` [PATCH v2] " Chris Wilson
  0 siblings, 2 replies; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-05-29  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, x86, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek, Dave Hansen

From: Hugh Dickins <hughd@google.com>

Since commit

   d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")

we use get_user_pages_unlocked() to pre-faulting user's memory if a
write generates a pagefault while the handler is disabled.
This works in general and uncovered a bug as reported by Mike Rapoport.
It has been pointed out that this function may be fragile and a
simple pre-fault as in fault_in_pages_writeable() would be a better
solution. Better as in taste and simplicity: That write (as performed by
the alternative function) performs exactly the same faulting of memory
that we had before. This was suggested by Hugh Dickins and Andrew
Morton.

Use fault_in_pages_writeable() for pre-faulting of user's stack.

Fixes: d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
[bigeasy: patch description]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2: Added a Fixes tag.

 arch/x86/kernel/fpu/signal.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 5a8d118bc423e..060d6188b4533 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -5,6 +5,7 @@
 
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/pagemap.h>
 
 #include <asm/fpu/internal.h>
 #include <asm/fpu/signal.h>
@@ -189,15 +190,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 	fpregs_unlock();
 
 	if (ret) {
-		int aligned_size;
-		int nr_pages;
-
-		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
-		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
-
-		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
-					      NULL, FOLL_WRITE);
-		if (ret == nr_pages)
+		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  2019-05-29  7:25   ` [PATCH v2] " Sebastian Andrzej Siewior
  2019-05-14 14:29     ` [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults Mike Rapoport
@ 2019-05-29 21:29     ` Chris Wilson
  1 sibling, 0 replies; 27+ messages in thread
From: Chris Wilson @ 2019-05-29 21:29 UTC (permalink / raw)
  To: Andrew Morton, Sebastian Andrzej Siewior
  Cc: Hugh Dickins, x86, Mike Rapoport, Andrea Arcangeli, linux-mm,
	linux-kernel, Borislav Petkov, Pavel Machek, Dave Hansen

Quoting Sebastian Andrzej Siewior (2019-05-29 08:25:40)
> From: Hugh Dickins <hughd@google.com>
> 
> Since commit
> 
>    d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> 
> we use get_user_pages_unlocked() to pre-faulting user's memory if a
> write generates a pagefault while the handler is disabled.
> This works in general and uncovered a bug as reported by Mike Rapoport.
> It has been pointed out that this function may be fragile and a
> simple pre-fault as in fault_in_pages_writeable() would be a better
> solution. Better as in taste and simplicity: That write (as performed by
> the alternative function) performs exactly the same faulting of memory
> that we had before. This was suggested by Hugh Dickins and Andrew
> Morton.
> 
> Use fault_in_pages_writeable() for pre-faulting of user's stack.
> 
> Fixes: d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> [bigeasy: patch description]
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

I am able to reliably hit the bug here by putting the system under
mempressure, and afterwards processes would die as the exit. This patch
also greatly reduces cycletest latencies while under that mempressure,
~320ms -> ~16ms (on a bxt while also spinning on i915.ko).

Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [tip:x86/urgent] x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  2019-05-14 14:29     ` [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults Mike Rapoport
                         ` (2 preceding siblings ...)
  2019-05-22 19:21       ` Andrew Morton
@ 2019-06-06 17:25       ` tip-bot for Hugh Dickins
  3 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Hugh Dickins @ 2019-06-06 17:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, pavel, jannh, hughd, dave.hansen, x86, mingo, rppt, tglx,
	linux-mm, bp, mingo, chris, linux-kernel, riel, hpa, bigeasy,
	aarcange

Commit-ID:  b81ff1013eb8eef2934ca7e8cf53d553c1029e84
Gitweb:     https://git.kernel.org/tip/b81ff1013eb8eef2934ca7e8cf53d553c1029e84
Author:     Hugh Dickins <hughd@google.com>
AuthorDate: Wed, 29 May 2019 09:25:40 +0200
Committer:  Borislav Petkov <bp@suse.de>
CommitDate: Thu, 6 Jun 2019 19:15:17 +0200

x86/fpu: Use fault_in_pages_writeable() for pre-faulting

Since commit

   d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")

get_user_pages_unlocked() pre-faults user's memory if a write generates
a page fault while the handler is disabled.

This works in general and uncovered a bug as reported by Mike
Rapoport¹. It has been pointed out that this function may be fragile
and a simple pre-fault as in fault_in_pages_writeable() would be a
better solution. Better as in taste and simplicity: that write (as
performed by the alternative function) performs exactly the same
faulting of memory as before. This was suggested by Hugh Dickins and
Andrew Morton.

Use fault_in_pages_writeable() for pre-faulting user's stack.

  [ bigeasy: Write commit message. ]
  [ bp: Massage some. ]

¹ https://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com

Fixes: d9c9ce34ed5c8 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190529072540.g46j4kfeae37a3iu@linutronix.de
Link: https://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com
---
 arch/x86/kernel/fpu/signal.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 5a8d118bc423..060d6188b453 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -5,6 +5,7 @@
 
 #include <linux/compat.h>
 #include <linux/cpu.h>
+#include <linux/pagemap.h>
 
 #include <asm/fpu/internal.h>
 #include <asm/fpu/signal.h>
@@ -189,15 +190,7 @@ retry:
 	fpregs_unlock();
 
 	if (ret) {
-		int aligned_size;
-		int nr_pages;
-
-		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
-		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
-
-		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
-					      NULL, FOLL_WRITE);
-		if (ret == nr_pages)
+		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
  2019-05-22 21:18           ` Andrew Morton
@ 2019-06-22 17:51             ` Andrea Arcangeli
  0 siblings, 0 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2019-06-22 17:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Rapoport, linux-mm, linux-kernel, Sebastian Andrzej Siewior,
	Borislav Petkov, Dr. David Alan Gilbert, kvm

Hello everyone,

On Wed, May 22, 2019 at 02:18:03PM -0700, Andrew Morton wrote:
> > arch/x86/kernel/fpu/signal.c:198:8-31:  -> gup with !pages

This simply had not to return -EFAULT if ret < nr_pages.. but ret >= 0.

Instead it did:

               if (ret == nr_pages)
                       goto retry;
               return -EFAULT;

That was the bug and the correct code would have been:

    	    ret = get_user_pages_unlocked(pages=NULL)
	    if (ret < 0)
	       return -EFAULT;
	    goto retry;

This eventually should have worked fine but it was less efficient
because it's still acting in a full prefault mode and it just tells
GUP that pages = NULL and so all it is trying to do is to issue the
blocking I/O after the mmap_sem has been released already.

Overall the solution applied in commit
b81ff1013eb8eef2934ca7e8cf53d553c1029e84 looks nicer.

Alternatively it could have used down_read(); get_user_pages(); which
prevents get_user_pages to drop the mmap_sem and break the loop if
some blocking I/O had to be executed outside mmap_sem. But that would
have the side effect of breaking userfaultfd (uffd requires
gup_locked/unlocked and FAULT_FLAG_ALLOW_RETRY to be set in the fault
flags).

Eventually we need to allow VM_FAULT_RETRY to be returned even if
FOLL_TRIED is set, so in theory get_user_pages_unlocked(pages=NULL) in
a loop must eventually stop returning VM_FAULT_RETRY. FOLL_TRIED could
still disambiguate if VM_FAULT_RETRY should or should not be returned
so that it is only returned only if it cannnot be avoided
(i.e. userfaultfd case).

With gup_unlocked(pages=NULL) however all we are interested about is
to execute the blocking I/O and we don't care to map anything in the
pagetables. A later page fault has to happen anyway for sure because
pages was == NULL, it just needs to be a fast one.

> > arch/x86/mm/mpx.c:423:11-25:  -> gup with !pages

Note that get_user_pages is never affected by whatever change after
the below, !locked check in gup_locked:

		if (!locked)
			/* VM_FAULT_RETRY couldn't trigger, bypass */
			return ret;

The bypass means when locked is NULL, there is a 1:1 bypass from
__get_user_pages<->get_user_pages and the VM_FAULT_RETRY dance never
runs.

get_user_pages in fact can't support userfaultfd, which makes ptrace
and core dump and the hwpoison non blocking in VM_FAULT_RETRY.

All places that must support userfaultfd must use
get_user_pages_unlocked/locked or somehow end up with
FAULT_FLAG_ALLOW_RETRY set in the fault flags.

> > virt/kvm/async_pf.c:90:1-22:  -> gup with !pages

Didn't this get slowed down with the commit
df17277b2a85c00f5710e33ce238ba4114687a28?

I mean it was a feature not a bug to skip that additional
__get_user_pages(FOLL_TRIED).

> > virt/kvm/kvm_main.c:1437:6-20:  -> gup with !pages

Like for mpx.c get_user_pages is agnostic to all these gup_locked
changes because it sets locked = NULL, it couldn't break the loop
early because it couldn't return VM_FAULT_RETRY.

> 
> OK.

Commit df17277b2a85c00f5710e33ce238ba4114687a28 is now applied.

So I think the effect it has is to make async_pf.c slower and we
didn't solve anything.

There are two __get_user_pages:

1)		ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
				       vmas, locked);


		if (called by get_user_pages)
		    return ret; /* bypass the whole VM_FAULT_RETRY logic */


		*locked = 1;
		lock_dropped = true;
		down_read(&mm->mmap_sem);
2)		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
				       pages, NULL, NULL);


The problem introduced is that 2) is getting executed with pages==NULL
but there's no point to ever run 2) with pages = NULL.

async_pf especially uses nr_pages == 1, so it couldn't get any more
optimal than it already was.

Before df17277b2a85c00f5710e33ce238ba4114687a28 we broke the loop as
soon as the first __get_user_pages returned VM_FAULT_RETRY.

We can argue if we shouldn't have broken the loop and we should have
kept executing only the first __get_user_pages (marked "1)" above) for
the whole range, but nr_pages == 1 is common and in such case there's
no difference between the two behaviors.

The prefetch callers with nr_pages == 1, didn't even check the retval
at all:

	down_read(&mm->mmap_sem);
	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
			&locked);                            ^^^^ pages NULL

	// retval ignored

It should probably check for retval < 0... but the fault will be
retried for good later still with get_user_pages_unlocked() but with
pages != NULL, so it'll find out later if it's a segfault.

Now if we change the code to skip the second __get_user_pages it's not
clear if we can return nr_pages because we may still not have faulted
in the whole range in the pagetables. I guess we could still return
nr_pages even if we scanned the whole range with only the first of the
two __get_user_pages. However if you had mmu notifier registered in
the range nr_pages would have stronger semantics if you would execute
2) too, but then without pages array not-NULL such stronger semantics
cannot be taken advantage of anyway, because you don't know where
those pages are and you can't map them in a secondary MMU even if you
execute the line 2).

I personally preferred the older code which should at least in theory
run faster, it just required documentation that if "pages == NULL"
we'll break the loop early because it has to be a "prefetch" attempt
and it must be retried until nr_pages == ret.

Either that or add a "continue" to skip the second __get_user_pages in
line 2) above and then returning nr_pages to indicate VM_FAULT_RETRY
may very well have been returned on all page offsets in the virtual
range. That will behave the same for async_pf.c because nr_pages == 1.

When VM_FAULT_RETRY is returned all I/O should be complete (no matter
if network or disk with userfaultfd or just pagecache readpage on
filebacked kernel faults) and only a minor fault is required to obtain
the page. But it is better to defer that second minor fault to the
point where "pages" already become != NULL, so we end up calling
__get_user_pages 2 times instead of 3 times.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-06-22 17:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-26 17:33 [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting Sebastian Andrzej Siewior
2019-05-26 17:35 ` Sebastian Andrzej Siewior
2019-05-26 19:25   ` Hugh Dickins
2019-05-26 19:25     ` Hugh Dickins
2019-05-28 11:54     ` My emacs problem -- was " Pavel Machek
2019-05-29  4:18 ` Andrew Morton
2019-05-29  7:25   ` [PATCH v2] " Sebastian Andrzej Siewior
2019-05-14 14:29     ` [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults Mike Rapoport
2019-05-16 16:25       ` Andrei Vagin
2019-05-16 16:25         ` Andrei Vagin
2019-05-21 15:53       ` Mike Rapoport
2019-05-22 19:21       ` Andrew Morton
2019-05-22 19:43         ` Sebastian Andrzej Siewior
2019-05-24 22:22           ` Hugh Dickins
2019-05-24 22:22             ` Hugh Dickins
2019-05-25  8:45             ` Sebastian Andrzej Siewior
2019-05-25 18:09               ` Hugh Dickins
2019-05-25 18:09                 ` Hugh Dickins
2019-05-26 19:36                 ` Sebastian Andrzej Siewior
2019-05-26 20:17                   ` Hugh Dickins
2019-05-26 20:17                     ` Hugh Dickins
2019-05-26 17:25             ` Pavel Machek
2019-05-22 20:38         ` Mike Rapoport
2019-05-22 21:18           ` Andrew Morton
2019-06-22 17:51             ` Andrea Arcangeli
2019-06-06 17:25       ` [tip:x86/urgent] x86/fpu: Use fault_in_pages_writeable() for pre-faulting tip-bot for Hugh Dickins
2019-05-29 21:29     ` [PATCH v2] " Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.