mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
diff mbox series

Message ID 1557844195-18882-1-git-send-email-rppt@linux.ibm.com
State Accepted
Commit df17277b2a85c00f5710e33ce238ba4114687a28
Headers show
Series
  • mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
Related show

Commit Message

Mike Rapoport May 14, 2019, 2:29 p.m. UTC
When get_user_pages*() is called with pages = NULL, the processing of
VM_FAULT_RETRY terminates early without actually retrying to fault-in all
the pages.

If the pages in the requested range belong to a VMA that has userfaultfd
registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
has populated the page, but for the gup pre-fault case there's no actual
retry and the caller will get no pages although they are present.

This issue was uncovered when running post-copy memory restore in CRIU
after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
copy_fpstate_to_sigframe() fails").

After this change, the copying of FPU state to the sigframe switched from
copy_to_user() variants which caused a real page fault to get_user_pages()
with pages parameter set to NULL.

In post-copy mode of CRIU, the destination memory is managed with
userfaultfd and lack of the retry for pre-fault case in get_user_pages()
causes a crash of the restored process.

Making the pre-fault behavior of get_user_pages() the same as the "normal"
one fixes the issue.

Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/gup.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

Comments

Andrei Vagin May 16, 2019, 4:25 p.m. UTC | #1
On Tue, May 14, 2019 at 7:32 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> When get_user_pages*() is called with pages = NULL, the processing of
> VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> the pages.
>
> If the pages in the requested range belong to a VMA that has userfaultfd
> registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> has populated the page, but for the gup pre-fault case there's no actual
> retry and the caller will get no pages although they are present.
>
> This issue was uncovered when running post-copy memory restore in CRIU
> after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails").
>
> After this change, the copying of FPU state to the sigframe switched from
> copy_to_user() variants which caused a real page fault to get_user_pages()
> with pages parameter set to NULL.
>
> In post-copy mode of CRIU, the destination memory is managed with
> userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> causes a crash of the restored process.
>
> Making the pre-fault behavior of get_user_pages() the same as the "normal"
> one fixes the issue.
>

Tested-by: Andrei Vagin <avagin@gmail.com>

https://travis-ci.org/avagin/linux/builds/533184940

> Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  mm/gup.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 91819b8..c32ae5a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                         BUG_ON(ret >= nr_pages);
>                 }
>
> -               if (!pages)
> -                       /* If it's a prefault don't insist harder */
> -                       return ret;
> -
>                 if (ret > 0) {
>                         nr_pages -= ret;
>                         pages_done += ret;
> @@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                                 pages_done = ret;
>                         break;
>                 }
> -               /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> -               pages += ret;
> +               /*
> +                * VM_FAULT_RETRY triggered, so seek to the faulting offset.
> +                * For the prefault case (!pages) we only update counts.
> +                */
> +               if (likely(pages))
> +                       pages += ret;
>                 start += ret << PAGE_SHIFT;
>
>                 /*
> @@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                 pages_done++;
>                 if (!nr_pages)
>                         break;
> -               pages++;
> +               if (likely(pages))
> +                       pages++;
>                 start += PAGE_SIZE;
>         }
>         if (lock_dropped && *locked) {
> --
> 2.7.4
>
Mike Rapoport May 21, 2019, 3:53 p.m. UTC | #2
Hi,

Any comments on this?

On Tue, May 14, 2019 at 05:29:55PM +0300, Mike Rapoport wrote:
> When get_user_pages*() is called with pages = NULL, the processing of
> VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> the pages.
> 
> If the pages in the requested range belong to a VMA that has userfaultfd
> registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> has populated the page, but for the gup pre-fault case there's no actual
> retry and the caller will get no pages although they are present.
> 
> This issue was uncovered when running post-copy memory restore in CRIU
> after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails").
> 
> After this change, the copying of FPU state to the sigframe switched from
> copy_to_user() variants which caused a real page fault to get_user_pages()
> with pages parameter set to NULL.
> 
> In post-copy mode of CRIU, the destination memory is managed with
> userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> causes a crash of the restored process.
> 
> Making the pre-fault behavior of get_user_pages() the same as the "normal"
> one fixes the issue.
> 
> Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  mm/gup.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 91819b8..c32ae5a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -936,10 +936,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  			BUG_ON(ret >= nr_pages);
>  		}
>  
> -		if (!pages)
> -			/* If it's a prefault don't insist harder */
> -			return ret;
> -
>  		if (ret > 0) {
>  			nr_pages -= ret;
>  			pages_done += ret;
> @@ -955,8 +951,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  				pages_done = ret;
>  			break;
>  		}
> -		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> -		pages += ret;
> +		/*
> +		 * VM_FAULT_RETRY triggered, so seek to the faulting offset.
> +		 * For the prefault case (!pages) we only update counts.
> +		 */
> +		if (likely(pages))
> +			pages += ret;
>  		start += ret << PAGE_SHIFT;
>  
>  		/*
> @@ -979,7 +979,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		pages_done++;
>  		if (!nr_pages)
>  			break;
> -		pages++;
> +		if (likely(pages))
> +			pages++;
>  		start += PAGE_SIZE;
>  	}
>  	if (lock_dropped && *locked) {
> -- 
> 2.7.4
>
Andrew Morton May 22, 2019, 7:21 p.m. UTC | #3
On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:

> When get_user_pages*() is called with pages = NULL, the processing of
> VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> the pages.
> 
> If the pages in the requested range belong to a VMA that has userfaultfd
> registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> has populated the page, but for the gup pre-fault case there's no actual
> retry and the caller will get no pages although they are present.
> 
> This issue was uncovered when running post-copy memory restore in CRIU
> after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails").
> 
> After this change, the copying of FPU state to the sigframe switched from
> copy_to_user() variants which caused a real page fault to get_user_pages()
> with pages parameter set to NULL.

You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL?

If so was that expected by the (now cc'ed) developers of
d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if
copy_fpstate_to_sigframe() fails")?

It seems rather odd.  copy_fpregs_to_sigframe() doesn't look like it's
expecting a NULL argument.

Also, I wonder if copy_fpstate_to_sigframe() would be better using
fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
seems like it operates at a more suitable level and I guess it will fix
this issue also.

> In post-copy mode of CRIU, the destination memory is managed with
> userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> causes a crash of the restored process.
> 
> Making the pre-fault behavior of get_user_pages() the same as the "normal"
> one fixes the issue.

Should this be backported into -stable trees?

> Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Sebastian Andrzej Siewior May 22, 2019, 7:43 p.m. UTC | #4
On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote:
> On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> > When get_user_pages*() is called with pages = NULL, the processing of
> > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > the pages.
> > 
> > If the pages in the requested range belong to a VMA that has userfaultfd
> > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > has populated the page, but for the gup pre-fault case there's no actual
> > retry and the caller will get no pages although they are present.
> > 
> > This issue was uncovered when running post-copy memory restore in CRIU
> > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > copy_fpstate_to_sigframe() fails").
> > 
> > After this change, the copying of FPU state to the sigframe switched from
> > copy_to_user() variants which caused a real page fault to get_user_pages()
> > with pages parameter set to NULL.
> 
> You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL?

buf_fx is user stack pointer and it should not be NULL.

> If so was that expected by the (now cc'ed) developers of
> d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails")?
> 
> It seems rather odd.  copy_fpregs_to_sigframe() doesn't look like it's
> expecting a NULL argument.

exactly, this is not expected.

> Also, I wonder if copy_fpstate_to_sigframe() would be better using
> fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> seems like it operates at a more suitable level and I guess it will fix
> this issue also.

It looks, like fault_in_pages_writeable() would work. If this is the
recommendation from the MM department than I can switch to that.

> > In post-copy mode of CRIU, the destination memory is managed with
> > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > causes a crash of the restored process.
> > 
> > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > one fixes the issue.
> 
> Should this be backported into -stable trees?

Sebastian
Mike Rapoport May 22, 2019, 8:38 p.m. UTC | #5
(added kvm)

On Wed, May 22, 2019 at 12:21:13PM -0700, Andrew Morton wrote:
> On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> > When get_user_pages*() is called with pages = NULL, the processing of
> > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > the pages.
> > 
> > If the pages in the requested range belong to a VMA that has userfaultfd
> > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > has populated the page, but for the gup pre-fault case there's no actual
> > retry and the caller will get no pages although they are present.
> > 
> > This issue was uncovered when running post-copy memory restore in CRIU
> > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > copy_fpstate_to_sigframe() fails").
> > 
> > After this change, the copying of FPU state to the sigframe switched from
> > copy_to_user() variants which caused a real page fault to get_user_pages()
> > with pages parameter set to NULL.
> 
> You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL?

Apparently I haven't explained well. The 'pages' parameter in the call to
get_user_pages_unlocked() is NULL.
 
> If so was that expected by the (now cc'ed) developers of
> d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if
> copy_fpstate_to_sigframe() fails")?
> 
> It seems rather odd.  copy_fpregs_to_sigframe() doesn't look like it's
> expecting a NULL argument.
> 
> Also, I wonder if copy_fpstate_to_sigframe() would be better using
> fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> seems like it operates at a more suitable level and I guess it will fix
> this issue also.

If I understand correctly, one of the points of d9c9ce34ed5c8923 ("x86/fpu:
Fault-in user stack if copy_fpstate_to_sigframe() fails") was to to avoid
page faults, hence the use of get_user_pages().

With fault_in_pages_writeable() there might be a page fault, unless I've
completely mistaken.

Unrelated to copy_fpstate_to_sigframe(), the issue could happen if any call
to get_user_pages() with pages parameter set to NULL tries to access
userfaultfd-managed memory. Currently, there are 4 in tree users:

arch/x86/kernel/fpu/signal.c:198:8-31:  -> gup with !pages
arch/x86/mm/mpx.c:423:11-25:  -> gup with !pages
virt/kvm/async_pf.c:90:1-22:  -> gup with !pages
virt/kvm/kvm_main.c:1437:6-20:  -> gup with !pages

I don't know if anybody is using mpx with uffd and anyway mpx seems to go
away.

As for KVM, I think that post-copy live migration of L2 guest might trigger
that as well. Not sure though, I'm not really familiar with KVM code.
 
> > In post-copy mode of CRIU, the destination memory is managed with
> > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > causes a crash of the restored process.
> > 
> > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > one fixes the issue.
> 
> Should this be backported into -stable trees?

I think that it depends on whether KVM affected by this or not.

> > Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> 
>
Andrew Morton May 22, 2019, 9:18 p.m. UTC | #6
On Wed, 22 May 2019 23:38:29 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:

> (added kvm)
> 
> On Wed, May 22, 2019 at 12:21:13PM -0700, Andrew Morton wrote:
> > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> > 
> > > When get_user_pages*() is called with pages = NULL, the processing of
> > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > > the pages.
> > > 
> > > If the pages in the requested range belong to a VMA that has userfaultfd
> > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > > has populated the page, but for the gup pre-fault case there's no actual
> > > retry and the caller will get no pages although they are present.
> > > 
> > > This issue was uncovered when running post-copy memory restore in CRIU
> > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > > copy_fpstate_to_sigframe() fails").
> > > 
> > > After this change, the copying of FPU state to the sigframe switched from
> > > copy_to_user() variants which caused a real page fault to get_user_pages()
> > > with pages parameter set to NULL.
> > 
> > You're saying that argument buf_fx in copy_fpstate_to_sigframe() is NULL?
> 
> Apparently I haven't explained well. The 'pages' parameter in the call to
> get_user_pages_unlocked() is NULL.

Doh.

> > If so was that expected by the (now cc'ed) developers of
> > d9c9ce34ed5c8923 ("x86/fpu: Fault-in user stack if
> > copy_fpstate_to_sigframe() fails")?
> > 
> > It seems rather odd.  copy_fpregs_to_sigframe() doesn't look like it's
> > expecting a NULL argument.
> > 
> > Also, I wonder if copy_fpstate_to_sigframe() would be better using
> > fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> > seems like it operates at a more suitable level and I guess it will fix
> > this issue also.
> 
> If I understand correctly, one of the points of d9c9ce34ed5c8923 ("x86/fpu:
> Fault-in user stack if copy_fpstate_to_sigframe() fails") was to to avoid
> page faults, hence the use of get_user_pages().
> 
> With fault_in_pages_writeable() there might be a page fault, unless I've
> completely mistaken.
> 
> Unrelated to copy_fpstate_to_sigframe(), the issue could happen if any call
> to get_user_pages() with pages parameter set to NULL tries to access
> userfaultfd-managed memory. Currently, there are 4 in tree users:
> 
> arch/x86/kernel/fpu/signal.c:198:8-31:  -> gup with !pages
> arch/x86/mm/mpx.c:423:11-25:  -> gup with !pages
> virt/kvm/async_pf.c:90:1-22:  -> gup with !pages
> virt/kvm/kvm_main.c:1437:6-20:  -> gup with !pages

OK.

> I don't know if anybody is using mpx with uffd and anyway mpx seems to go
> away.
> 
> As for KVM, I think that post-copy live migration of L2 guest might trigger
> that as well. Not sure though, I'm not really familiar with KVM code.
>  
> > > In post-copy mode of CRIU, the destination memory is managed with
> > > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > > causes a crash of the restored process.
> > > 
> > > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > > one fixes the issue.
> > 
> > Should this be backported into -stable trees?
> 
> I think that it depends on whether KVM affected by this or not.
> 

How do we determine this?

I guess it doesn't matter much.
Hugh Dickins May 24, 2019, 10:22 p.m. UTC | #7
On Wed, 22 May 2019, Sebastian Andrzej Siewior wrote:
> On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote:
> > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> > 
> > > When get_user_pages*() is called with pages = NULL, the processing of
> > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > > the pages.
> > > 
> > > If the pages in the requested range belong to a VMA that has userfaultfd
> > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > > has populated the page, but for the gup pre-fault case there's no actual
> > > retry and the caller will get no pages although they are present.
> > > 
> > > This issue was uncovered when running post-copy memory restore in CRIU
> > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > > copy_fpstate_to_sigframe() fails").

I've been getting unexplained segmentation violations, and "make" giving
up early, when running kernel builds under swapping memory pressure: no
CRIU involved.

Bisected last night to that same x86/fpu commit, not itself guilty, but
suffering from the odd behavior of get_user_pages_unlocked() giving up
too early.

(I wondered at first if copy_fpstate_to_sigframe() ought to retry if
non-negative ret < nr_pages, but no, that would be wrong: a present page
followed by an invalid area would repeatedly return 1 for nr_pages 2.)

Cc'ing Pavel, who's been having segfault trouble in emacs: maybe same?

> > > 
> > > After this change, the copying of FPU state to the sigframe switched from
> > > copy_to_user() variants which caused a real page fault to get_user_pages()
> > > with pages parameter set to NULL.
...
> > Also, I wonder if copy_fpstate_to_sigframe() would be better using
> > fault_in_pages_writeable() rather than get_user_pages_unlocked().  That
> > seems like it operates at a more suitable level and I guess it will fix
> > this issue also.
> 
> It looks, like fault_in_pages_writeable() would work. If this is the
> recommendation from the MM department than I can switch to that.

I've now run a couple of hours of load successfully with Mike's patch
to GUP, no problem; but whatever the merits of that patch in general,
I agree with Andrew that fault_in_pages_writeable() seems altogether
more appropriate for copy_fpstate_to_sigframe(), and have now run a
couple of hours of load successfully with this instead (rewrite to taste):

--- 5.2-rc1/arch/x86/kernel/fpu/signal.c
+++ linux/arch/x86/kernel/fpu/signal.c
@@ -3,6 +3,7 @@
  * FPU signal frame handling routines.
  */
 
+#include <linux/pagemap.h>
 #include <linux/compat.h>
 #include <linux/cpu.h>
 
@@ -189,15 +190,7 @@ retry:
 	fpregs_unlock();
 
 	if (ret) {
-		int aligned_size;
-		int nr_pages;
-
-		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
-		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
-
-		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
-					      NULL, FOLL_WRITE);
-		if (ret == nr_pages)
+		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
 			goto retry;
 		return -EFAULT;
 	}

(I did wonder whether there needs to be an access_ok() check on buf_fx;
but if so, then I think it would already have been needed before the
earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
that to be sure, nor into whether access_ok() check on buf covers buf_fx.)

Hugh

> 
> > > In post-copy mode of CRIU, the destination memory is managed with
> > > userfaultfd and lack of the retry for pre-fault case in get_user_pages()
> > > causes a crash of the restored process.
> > > 
> > > Making the pre-fault behavior of get_user_pages() the same as the "normal"
> > > one fixes the issue.
> > 
> > Should this be backported into -stable trees?
> 
> Sebastian
Sebastian Andrzej Siewior May 25, 2019, 8:45 a.m. UTC | #8
On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote:
> I've now run a couple of hours of load successfully with Mike's patch
> to GUP, no problem; but whatever the merits of that patch in general,
> I agree with Andrew that fault_in_pages_writeable() seems altogether
> more appropriate for copy_fpstate_to_sigframe(), and have now run a
> couple of hours of load successfully with this instead (rewrite to taste):

so this patch instead of Mike's GUP patch fixes the issue you observed?
Is this just a taste question or limitation of the function in general?

I'm asking because it has been suggested and is used in MPX code (in the
signal path but .mmap) and I'm not aware of any limitation. But as I
wrote earlier to akpm, if the MM folks suggest to use this instead I am
happy to switch.

> --- 5.2-rc1/arch/x86/kernel/fpu/signal.c
> +++ linux/arch/x86/kernel/fpu/signal.c
> @@ -3,6 +3,7 @@
>   * FPU signal frame handling routines.
>   */
>  
> +#include <linux/pagemap.h>
>  #include <linux/compat.h>
>  #include <linux/cpu.h>
>  
> @@ -189,15 +190,7 @@ retry:
>  	fpregs_unlock();
>  
>  	if (ret) {
> -		int aligned_size;
> -		int nr_pages;
> -
> -		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
> -		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
> -
> -		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
> -					      NULL, FOLL_WRITE);
> -		if (ret == nr_pages)
> +		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
>  			goto retry;
>  		return -EFAULT;
>  	}
> 
> (I did wonder whether there needs to be an access_ok() check on buf_fx;
> but if so, then I think it would already have been needed before the
> earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
> that to be sure, nor into whether access_ok() check on buf covers buf_fx.)

There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The
memory is allocated from user's stack and there is (later) an
access_ok() for the whole region (which can be more than the memory used
by the FPU code).

> Hugh

Sebastian
Hugh Dickins May 25, 2019, 6:09 p.m. UTC | #9
On Sat, 25 May 2019, Sebastian Andrzej Siewior wrote:
> On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote:
> > I've now run a couple of hours of load successfully with Mike's patch
> > to GUP, no problem; but whatever the merits of that patch in general,
> > I agree with Andrew that fault_in_pages_writeable() seems altogether
> > more appropriate for copy_fpstate_to_sigframe(), and have now run a
> > couple of hours of load successfully with this instead (rewrite to taste):
> 
> so this patch instead of Mike's GUP patch fixes the issue you observed?

Yes.

> Is this just a taste question or limitation of the function in general?

I'd say it's just a taste question. Though the the fact that your
usage showed up a bug in the get_user_pages_unlocked() implementation,
demanding a fix, does indicate that it's a more fragile and complex
route, better avoided if there's a good simple alternative. If it were
not already on your slowpath, I'd also argue fault_in_pages_writeable()
is a more efficient way to do it.

> 
> I'm asking because it has been suggested and is used in MPX code (in the
> signal path but .mmap) and I'm not aware of any limitation. But as I
> wrote earlier to akpm, if the MM folks suggest to use this instead I am
> happy to switch.

I know nothing of MPX, beyond that Dave Hansen has posted patches to
remove that support entirely, so I'm surprised arch/x86/mm/mpx.c is
still in the tree. But peering at it now, it looks as if it's using
get_user_pages() while holding mmap_sem, whereas you (sensibly enough)
used get_user_pages_unlocked() to handle the mmap_sem for you -
the trouble with that is that since it knows it's in control of
mmap_sem, it feels free to drop it internally, and that takes it
down the path of the premature return when pages NULL that Mike is
fixing. MPX's get_user_pages() is not free to go that way.

> 
> > --- 5.2-rc1/arch/x86/kernel/fpu/signal.c
> > +++ linux/arch/x86/kernel/fpu/signal.c
> > @@ -3,6 +3,7 @@
> >   * FPU signal frame handling routines.
> >   */
> >  
> > +#include <linux/pagemap.h>
> >  #include <linux/compat.h>
> >  #include <linux/cpu.h>
> >  
> > @@ -189,15 +190,7 @@ retry:
> >  	fpregs_unlock();
> >  
> >  	if (ret) {
> > -		int aligned_size;
> > -		int nr_pages;
> > -
> > -		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
> > -		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
> > -
> > -		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
> > -					      NULL, FOLL_WRITE);
> > -		if (ret == nr_pages)
> > +		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
> >  			goto retry;
> >  		return -EFAULT;
> >  	}
> > 
> > (I did wonder whether there needs to be an access_ok() check on buf_fx;
> > but if so, then I think it would already have been needed before the
> > earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
> > that to be sure, nor into whether access_ok() check on buf covers buf_fx.)
> 
> There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The
> memory is allocated from user's stack and there is (later) an
> access_ok() for the whole region (which can be more than the memory used
> by the FPU code).

Yes, but remember I know nothing of this FPU signal code, so I cannot
tell whether an access_ok(buf, size) is good enough to cover the range
of an access_ok(buf_fx, fpu_user_xstate_size).

Your "(later)" worries me a little - I hope you're not writing first
and checking the limits later; but what you're doing may be perfectly
correct, I'm just too far from understanding the details to say; but
raised the matter because (I think) get_user_pages_unlocked() would
entail an access_ok() check where fault_in_pages_writable() would not.

Hugh
Pavel Machek May 26, 2019, 5:25 p.m. UTC | #10
On Fri 2019-05-24 15:22:51, Hugh Dickins wrote:
> On Wed, 22 May 2019, Sebastian Andrzej Siewior wrote:
> > On 2019-05-22 12:21:13 [-0700], Andrew Morton wrote:
> > > On Tue, 14 May 2019 17:29:55 +0300 Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > 
> > > > When get_user_pages*() is called with pages = NULL, the processing of
> > > > VM_FAULT_RETRY terminates early without actually retrying to fault-in all
> > > > the pages.
> > > > 
> > > > If the pages in the requested range belong to a VMA that has userfaultfd
> > > > registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
> > > > has populated the page, but for the gup pre-fault case there's no actual
> > > > retry and the caller will get no pages although they are present.
> > > > 
> > > > This issue was uncovered when running post-copy memory restore in CRIU
> > > > after commit d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
> > > > copy_fpstate_to_sigframe() fails").
> 
> I've been getting unexplained segmentation violations, and "make" giving
> up early, when running kernel builds under swapping memory pressure: no
> CRIU involved.
> 
> Bisected last night to that same x86/fpu commit, not itself guilty, but
> suffering from the odd behavior of get_user_pages_unlocked() giving up
> too early.
> 
> (I wondered at first if copy_fpstate_to_sigframe() ought to retry if
> non-negative ret < nr_pages, but no, that would be wrong: a present page
> followed by an invalid area would repeatedly return 1 for nr_pages 2.)
> 
> Cc'ing Pavel, who's been having segfault trouble in emacs: maybe same?

The emacs segfault was always during process exit. This sounds different...

I don't see problems with make.

But its true that at least one of affected machines uses swap heavily.

Best regards,
								Pavel
Sebastian Andrzej Siewior May 26, 2019, 7:36 p.m. UTC | #11
On 2019-05-25 11:09:15 [-0700], Hugh Dickins wrote:
> On Sat, 25 May 2019, Sebastian Andrzej Siewior wrote:
> > On 2019-05-24 15:22:51 [-0700], Hugh Dickins wrote:
> > > I've now run a couple of hours of load successfully with Mike's patch
> > > to GUP, no problem; but whatever the merits of that patch in general,
> > > I agree with Andrew that fault_in_pages_writeable() seems altogether
> > > more appropriate for copy_fpstate_to_sigframe(), and have now run a
> > > couple of hours of load successfully with this instead (rewrite to taste):
> > 
> > so this patch instead of Mike's GUP patch fixes the issue you observed?
> 
> Yes.
> 
> > Is this just a taste question or limitation of the function in general?
> 
> I'd say it's just a taste question. Though the the fact that your
> usage showed up a bug in the get_user_pages_unlocked() implementation,
> demanding a fix, does indicate that it's a more fragile and complex
> route, better avoided if there's a good simple alternative. If it were
> not already on your slowpath, I'd also argue fault_in_pages_writeable()
> is a more efficient way to do it.

Okay. The GUP functions are not properly documented for my taste. There
is no indication whether or not the mm_sem has to be acquired prior
invoking it. Following the call chain of get_user_pages() I ended up in
__get_user_pages_locked() `locked = NULL' indicated that mm_sem is no
acquired and then I saw this:
|                 if (!locked)
|                         /* VM_FAULT_RETRY couldn't trigger, bypass */
|                         return ret;

kind of suggesting that it is okay to invoke it without holding the
mm_sem prefault. It passed a few tests and then
	https://lkml.kernel.org/r/1556657902.6132.13.camel@lca.pw

happened. After that, I switched to the locked variant and the problem
disappeared (also I noticed that MPX code is invoked within ->mmap()).

> > I'm asking because it has been suggested and is used in MPX code (in the
> > signal path but .mmap) and I'm not aware of any limitation. But as I
> > wrote earlier to akpm, if the MM folks suggest to use this instead I am
> > happy to switch.
> 
> I know nothing of MPX, beyond that Dave Hansen has posted patches to
> remove that support entirely, so I'm surprised arch/x86/mm/mpx.c is
> still in the tree.
I need to poke at that. I has been removed but then KVM folks complained
that they kind of depend on that if it has been exposed to the guest. We
need to fade it out slowly…

>                    But peering at it now, it looks as if it's using
> get_user_pages() while holding mmap_sem, whereas you (sensibly enough)
> used get_user_pages_unlocked() to handle the mmap_sem for you -
> the trouble with that is that since it knows it's in control of
> mmap_sem, it feels free to drop it internally, and that takes it
> down the path of the premature return when pages NULL that Mike is
> fixing. MPX's get_user_pages() is not free to go that way.
oki.

> > > --- 5.2-rc1/arch/x86/kernel/fpu/signal.c
> > > +++ linux/arch/x86/kernel/fpu/signal.c
> > > @@ -3,6 +3,7 @@
> > >   * FPU signal frame handling routines.
> > >   */
> > >  
> > > +#include <linux/pagemap.h>
> > >  #include <linux/compat.h>
> > >  #include <linux/cpu.h>
> > >  
> > > @@ -189,15 +190,7 @@ retry:
> > >  	fpregs_unlock();
> > >  
> > >  	if (ret) {
> > > -		int aligned_size;
> > > -		int nr_pages;
> > > -
> > > -		aligned_size = offset_in_page(buf_fx) + fpu_user_xstate_size;
> > > -		nr_pages = DIV_ROUND_UP(aligned_size, PAGE_SIZE);
> > > -
> > > -		ret = get_user_pages_unlocked((unsigned long)buf_fx, nr_pages,
> > > -					      NULL, FOLL_WRITE);
> > > -		if (ret == nr_pages)
> > > +		if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
> > >  			goto retry;
> > >  		return -EFAULT;
> > >  	}
> > > 
> > > (I did wonder whether there needs to be an access_ok() check on buf_fx;
> > > but if so, then I think it would already have been needed before the
> > > earlier copy_fpregs_to_sigframe(); but I didn't get deep enough into
> > > that to be sure, nor into whether access_ok() check on buf covers buf_fx.)
> > 
> > There is an access_ok() at the begin of copy_fpregs_to_sigframe(). The
> > memory is allocated from user's stack and there is (later) an
> > access_ok() for the whole region (which can be more than the memory used
> > by the FPU code).
> 
> Yes, but remember I know nothing of this FPU signal code, so I cannot
> tell whether an access_ok(buf, size) is good enough to cover the range
> of an access_ok(buf_fx, fpu_user_xstate_size).

yes, because size >= fpu_user_xstate_size

> Your "(later)" worries me a little - I hope you're not writing first
> and checking the limits later; but what you're doing may be perfectly
> correct, I'm just too far from understanding the details to say; but
> raised the matter because (I think) get_user_pages_unlocked() would
> entail an access_ok() check where fault_in_pages_writable() would not.

no, we first check the range and then write. It is later checked again
after the size has been extended.

> Hugh

Sebastian
Hugh Dickins May 26, 2019, 8:17 p.m. UTC | #12
On Sun, 26 May 2019, Sebastian Andrzej Siewior wrote:
> 
> Okay. The GUP functions are not properly documented for my taste. There
> is no indication whether or not the mm_sem has to be acquired prior
> invoking it. Following the call chain of get_user_pages() I ended up in
> __get_user_pages_locked() `locked = NULL' indicated that mm_sem is no
> acquired and then I saw this:
> |                 if (!locked)
> |                         /* VM_FAULT_RETRY couldn't trigger, bypass */
> |                         return ret;
> 
> kind of suggesting that it is okay to invoke it without holding the
> mm_sem prefault. It passed a few tests and then
> 	https://lkml.kernel.org/r/1556657902.6132.13.camel@lca.pw
> 
> happened. After that, I switched to the locked variant and the problem
> disappeared (also I noticed that MPX code is invoked within ->mmap()).

Ah, I wasn't following what happened here while it was in linux-next.

I certainly agree that all the variants are very confusing, and the
matter of mmap_sem particularly so. Because it used to be a simple
rule that you needed to hold mmap_sem to call get_user_pages(); but
that can be so contended that get_user_pages_fast(), and VM_FAULT_RETRY,
optimizations came in, then interfaces like get_user_pages_locked() and
get_user_pages_unlocked().

I think when you say that you "switched to the locked variant", you're
writing of when you switched to using get_user_pages_unlocked(), which
takes and releases the mmap_sem itself: yes, using get_user_pages()
without holding mmap_sem would have been open to serious errors.

The documentation in comments above get_user_pages_locked() and
get_user_pages_unlocked() looks rather good to me, actually.

But this is all good reason to use the less challenging
fault_in_pages_writable() instead. (And also saves all those GUP
developers, who from time to time have to search through all users
of GUP in one form or another, to make this or that improvement,
from having to visit arch/x86/kernel/fpu/signal.c: they will
quietly thank you.)

Hugh
Andrea Arcangeli June 22, 2019, 5:51 p.m. UTC | #13
Hello everyone,

On Wed, May 22, 2019 at 02:18:03PM -0700, Andrew Morton wrote:
> > arch/x86/kernel/fpu/signal.c:198:8-31:  -> gup with !pages

This simply had not to return -EFAULT if ret < nr_pages.. but ret >= 0.

Instead it did:

               if (ret == nr_pages)
                       goto retry;
               return -EFAULT;

That was the bug and the correct code would have been:

    	    ret = get_user_pages_unlocked(pages=NULL)
	    if (ret < 0)
	       return -EFAULT;
	    goto retry;

This eventually should have worked fine but it was less efficient
because it's still acting in a full prefault mode and it just tells
GUP that pages = NULL and so all it is trying to do is to issue the
blocking I/O after the mmap_sem has been released already.

Overall the solution applied in commit
b81ff1013eb8eef2934ca7e8cf53d553c1029e84 looks nicer.

Alternatively it could have used down_read(); get_user_pages(); which
prevents get_user_pages to drop the mmap_sem and break the loop if
some blocking I/O had to be executed outside mmap_sem. But that would
have the side effect of breaking userfaultfd (uffd requires
gup_locked/unlocked and FAULT_FLAG_ALLOW_RETRY to be set in the fault
flags).

Eventually we need to allow VM_FAULT_RETRY to be returned even if
FOLL_TRIED is set, so in theory get_user_pages_unlocked(pages=NULL) in
a loop must eventually stop returning VM_FAULT_RETRY. FOLL_TRIED could
still disambiguate if VM_FAULT_RETRY should or should not be returned
so that it is only returned only if it cannnot be avoided
(i.e. userfaultfd case).

With gup_unlocked(pages=NULL) however all we are interested about is
to execute the blocking I/O and we don't care to map anything in the
pagetables. A later page fault has to happen anyway for sure because
pages was == NULL, it just needs to be a fast one.

> > arch/x86/mm/mpx.c:423:11-25:  -> gup with !pages

Note that get_user_pages is never affected by whatever change after
the below, !locked check in gup_locked:

		if (!locked)
			/* VM_FAULT_RETRY couldn't trigger, bypass */
			return ret;

The bypass means when locked is NULL, there is a 1:1 bypass from
__get_user_pages<->get_user_pages and the VM_FAULT_RETRY dance never
runs.

get_user_pages in fact can't support userfaultfd, which makes ptrace
and core dump and the hwpoison non blocking in VM_FAULT_RETRY.

All places that must support userfaultfd must use
get_user_pages_unlocked/locked or somehow end up with
FAULT_FLAG_ALLOW_RETRY set in the fault flags.

> > virt/kvm/async_pf.c:90:1-22:  -> gup with !pages

Didn't this get slowed down with the commit
df17277b2a85c00f5710e33ce238ba4114687a28?

I mean it was a feature not a bug to skip that additional
__get_user_pages(FOLL_TRIED).

> > virt/kvm/kvm_main.c:1437:6-20:  -> gup with !pages

Like for mpx.c get_user_pages is agnostic to all these gup_locked
changes because it sets locked = NULL, it couldn't break the loop
early because it couldn't return VM_FAULT_RETRY.

> 
> OK.

Commit df17277b2a85c00f5710e33ce238ba4114687a28 is now applied.

So I think the effect it has is to make async_pf.c slower and we
didn't solve anything.

There are two __get_user_pages:

1)		ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
				       vmas, locked);


		if (called by get_user_pages)
		    return ret; /* bypass the whole VM_FAULT_RETRY logic */


		*locked = 1;
		lock_dropped = true;
		down_read(&mm->mmap_sem);
2)		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
				       pages, NULL, NULL);


The problem introduced is that 2) is getting executed with pages==NULL
but there's no point to ever run 2) with pages = NULL.

async_pf especially uses nr_pages == 1, so it couldn't get any more
optimal than it already was.

Before df17277b2a85c00f5710e33ce238ba4114687a28 we broke the loop as
soon as the first __get_user_pages returned VM_FAULT_RETRY.

We can argue if we shouldn't have broken the loop and we should have
kept executing only the first __get_user_pages (marked "1)" above) for
the whole range, but nr_pages == 1 is common and in such case there's
no difference between the two behaviors.

The prefetch callers with nr_pages == 1, didn't even check the retval
at all:

	down_read(&mm->mmap_sem);
	get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
			&locked);                            ^^^^ pages NULL

	// retval ignored

It should probably check for retval < 0... but the fault will be
retried for good later still with get_user_pages_unlocked() but with
pages != NULL, so it'll find out later if it's a segfault.

Now if we change the code to skip the second __get_user_pages it's not
clear if we can return nr_pages because we may still not have faulted
in the whole range in the pagetables. I guess we could still return
nr_pages even if we scanned the whole range with only the first of the
two __get_user_pages. However if you had mmu notifier registered in
the range nr_pages would have stronger semantics if you would execute
2) too, but then without pages array not-NULL such stronger semantics
cannot be taken advantage of anyway, because you don't know where
those pages are and you can't map them in a secondary MMU even if you
execute the line 2).

I personally preferred the older code which should at least in theory
run faster, it just required documentation that if "pages == NULL"
we'll break the loop early because it has to be a "prefetch" attempt
and it must be retried until nr_pages == ret.

Either that or add a "continue" to skip the second __get_user_pages in
line 2) above and then returning nr_pages to indicate VM_FAULT_RETRY
may very well have been returned on all page offsets in the virtual
range. That will behave the same for async_pf.c because nr_pages == 1.

When VM_FAULT_RETRY is returned all I/O should be complete (no matter
if network or disk with userfaultfd or just pagecache readpage on
filebacked kernel faults) and only a minor fault is required to obtain
the page. But it is better to defer that second minor fault to the
point where "pages" already become != NULL, so we end up calling
__get_user_pages 2 times instead of 3 times.

Thanks,
Andrea

Patch
diff mbox series

diff --git a/mm/gup.c b/mm/gup.c
index 91819b8..c32ae5a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -936,10 +936,6 @@  static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 			BUG_ON(ret >= nr_pages);
 		}
 
-		if (!pages)
-			/* If it's a prefault don't insist harder */
-			return ret;
-
 		if (ret > 0) {
 			nr_pages -= ret;
 			pages_done += ret;
@@ -955,8 +951,12 @@  static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 				pages_done = ret;
 			break;
 		}
-		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
-		pages += ret;
+		/*
+		 * VM_FAULT_RETRY triggered, so seek to the faulting offset.
+		 * For the prefault case (!pages) we only update counts.
+		 */
+		if (likely(pages))
+			pages += ret;
 		start += ret << PAGE_SHIFT;
 
 		/*
@@ -979,7 +979,8 @@  static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		pages_done++;
 		if (!nr_pages)
 			break;
-		pages++;
+		if (likely(pages))
+			pages++;
 		start += PAGE_SIZE;
 	}
 	if (lock_dropped && *locked) {