All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>,
	'Andrew Morton' <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	"'Dr. David Alan Gilbert'" <dgilbert@redhat.com>,
	'Shaohua Li' <shli@fb.com>,
	'Pavel Emelyanov' <xemul@parallels.com>,
	'Mike Rapoport' <rppt@linux.vnet.ibm.com>
Subject: Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
Date: Wed, 16 Nov 2016 10:53:39 -0800	[thread overview]
Message-ID: <8ee2c6db-7ee4-285f-4c68-75fd6e799c0d@oracle.com> (raw)
In-Reply-To: <20161116182809.GC26185@redhat.com>

On 11/16/2016 10:28 AM, Andrea Arcangeli wrote:
> Hello Mike,
> 
> On Tue, Nov 08, 2016 at 01:06:06PM -0800, Mike Kravetz wrote:
>> -- 
>> Mike Kravetz
>>
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing
>>
>> The new routine copy_huge_page_from_user() uses kmap_atomic() to map
>> PAGE_SIZE pages.  However, this prevents page faults in the subsequent
>> call to copy_from_user().  This is OK in the case where the routine
>> is copied with mmap_sema held.  However, in another case we want to
>> allow page faults.  So, add a new argument allow_pagefault to indicate
>> if the routine should allow page faults.
>>
>> A patch (mm/hugetlb: fix huge page reservation leak in private mapping
>> error paths) was recently submitted and is being added to -mm tree.  It
>> addresses the issue huge page reservations when a huge page is allocated,
>> and free'ed before being instantiated in an address space.  This would
>> typically happen in error paths.  The routine __mcopy_atomic_hugetlb has
>> such an error path, so it will need to call restore_reserve_on_error()
>> before free'ing the huge page.  restore_reserve_on_error is currently
>> only visible in mm/hugetlb.c.  So, add it to a header file so that it
>> can be used in mm/userfaultfd.c.  Another option would be to move
>> __mcopy_atomic_hugetlb into mm/hugetlb.c
> 
> It would have been better to split this in two patches.
> 
>> @@ -302,8 +302,10 @@ static __always_inline ssize_t
>> __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>>  out_unlock:
>>  	up_read(&dst_mm->mmap_sem);
>>  out:
>> -	if (page)
>> +	if (page) {
>> +		restore_reserve_on_error(h, dst_vma, dst_addr, page);
>>  		put_page(page);
>> +	}
>>  	BUG_ON(copied < 0);
> 
> If the revalidation fails dst_vma could even be NULL.
> 
> We get there with page not NULL only if something in the revalidation
> fails effectively... I'll have to drop the above change as the fix
> will hurt more than the vma reservation not being restored. Didn't
> think too much about it, but there was no obvious way to restore the
> reservation of a vma, after we drop the mmap_sem. However if we don't
> drop the mmap_sem, we'd recurse into it, and it'll deadlock in current
> implementation if a down_write is already pending somewhere else. In
> this specific case fairness is not an issue, but it's not checking
> it's the same thread taking it again, so it's doesn't allow to recurse
> (checking it's the same thread would make it slower).

Thanks for reviewing this Andrea.  You are correct, we can not call
restore_reserve_on_error without mmap_sem held.

I was running some tests with error injection to exercise the error
path and noticed the reservation leaks as the system eventually ran
out of huge pages.  I need to think about it some more, but we may
want to at least do something like the following before put_page (with
a BIG comment):

	if (unlikely(PagePrivate(page)))
		ClearPagePrivate(page);

That would at least keep the global reservation count from increasing.
Let me look into that.

> I also fixed the gup support for userfaultfd, could you review it?

I will take a look, and 'may' have a test that can be modified for this.

-- 
Mike Kravetz

> Beware, untested... will test it shortly with qemu postcopy live
> migration with hugetlbfs instead of THP (that currently gracefully
> complains about FAULT_FLAG_ALLOW_RETRY missing, KVM ioctl returns
> badaddr and DEBUG_VM=y clearly showed the stack trace of where
> FAULT_FLAG_ALLOW_RETRY was missing).
> 
> I think this enhancement is needed by Oracle too, so that you don't
> get an error from I/O syscalls, and you instead get an userfault.
> 
> We need to update the selftest to trigger userfaults not only with the
> CPU but with O_DIRECT too.
> 
> Note, the FOLL_NOWAIT is needed to offload the userfaults to async
> page faults. KVM tries an async fault first (FOLL_NOWAIT, nonblocking
> = NULL), if that fails it offload a blocking (*nonblocking = 1) fault
> through async page fault kernel thread while guest scheduler schedule
> away the blocked process. So the userfaults behave like SSD swapins
> from disk hitting on a single guest thread and not the whole host vcpu
> thread. Clearly hugetlbfs cannot ever block for I/O, FOLL_NOWAIT is
> only useful to avoid blocking in the vcpu thread in
> handle_userfault().
> 
> From ff1ce62ee0acb14ed71621ba99f01f008a5d212d Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 16 Nov 2016 18:34:20 +0100
> Subject: [PATCH 1/1] userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY
> 
> Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that
> get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features
> will work on hugetlbfs. This is required for fully functional
> userfaultfd non-present support on hugetlbfs.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/hugetlb.h |  5 +++--
>  mm/gup.c                |  2 +-
>  mm/hugetlb.c            | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 44 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index bf02b7e..542416d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -65,7 +65,8 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			 struct page **, struct vm_area_struct **,
> -			 unsigned long *, unsigned long *, long, unsigned int);
> +			 unsigned long *, unsigned long *, long, unsigned int,
> +			 int *);
>  void unmap_hugepage_range(struct vm_area_struct *,
>  			  unsigned long, unsigned long, struct page *);
>  void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> @@ -138,7 +139,7 @@ static inline unsigned long hugetlb_total_pages(void)
>  	return 0;
>  }
>  
> -#define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
> +#define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
>  #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
>  static inline void hugetlb_report_meminfo(struct seq_file *m)
> diff --git a/mm/gup.c b/mm/gup.c
> index ec4f827..36e88a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -572,7 +572,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			if (is_vm_hugetlb_page(vma)) {
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
> -						gup_flags);
> +						gup_flags, nonblocking);
>  				continue;
>  			}
>  		}
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9ce8ecb..022750d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4039,7 +4039,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 struct page **pages, struct vm_area_struct **vmas,
>  			 unsigned long *position, unsigned long *nr_pages,
> -			 long i, unsigned int flags)
> +			 long i, unsigned int flags, int *nonblocking)
>  {
>  	unsigned long pfn_offset;
>  	unsigned long vaddr = *position;
> @@ -4102,16 +4102,43 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		    ((flags & FOLL_WRITE) &&
>  		      !huge_pte_write(huge_ptep_get(pte)))) {
>  			int ret;
> +			unsigned int fault_flags = 0;
>  
>  			if (pte)
>  				spin_unlock(ptl);
> -			ret = hugetlb_fault(mm, vma, vaddr,
> -				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> -			if (!(ret & VM_FAULT_ERROR))
> -				continue;
> -
> -			remainder = 0;
> -			break;
> +			if (flags & FOLL_WRITE)
> +				fault_flags |= FAULT_FLAG_WRITE;
> +			if (nonblocking)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
> +			if (flags & FOLL_NOWAIT)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> +					FAULT_FLAG_RETRY_NOWAIT;
> +			if (flags & FOLL_TRIED) {
> +				VM_WARN_ON_ONCE(fault_flags &
> +						FAULT_FLAG_ALLOW_RETRY);
> +				fault_flags |= FAULT_FLAG_TRIED;
> +			}
> +			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> +			if (ret & VM_FAULT_ERROR) {
> +				remainder = 0;
> +				break;
> +			}
> +			if (ret & VM_FAULT_RETRY) {
> +				if (nonblocking)
> +					*nonblocking = 0;
> +				*nr_pages = 0;
> +				/*
> +				 * VM_FAULT_RETRY must not return an
> +				 * error, it will return zero
> +				 * instead.
> +				 *
> +				 * No need to update "position" as the
> +				 * caller will not check it after
> +				 * *nr_pages is set to 0.
> +				 */
> +				return i;
> +			}
> +			continue;
>  		}
>  
>  		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> @@ -4140,6 +4167,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  	}
>  	*nr_pages = remainder;
> +	/*
> +	 * setting position is actually required only if remainder is
> +	 * not zero but it's faster not to add a "if (remainder)"
> +	 * branch.
> +	 */
>  	*position = vaddr;
>  
>  	return i ? i : -EFAULT;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-11-16 18:53 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 01/33] userfaultfd: document _IOR/_IOW Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 02/33] userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 03/33] userfaultfd: convert BUG() to WARN_ON_ONCE() Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 04/33] userfaultfd: use vma_is_anonymous Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 05/33] userfaultfd: non-cooperative: Split the find_userfault() routine Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 06/33] userfaultfd: non-cooperative: Add ability to report non-PF events from uffd descriptor Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 07/33] userfaultfd: non-cooperative: report all available features to userland Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 08/33] userfaultfd: non-cooperative: Add fork() event Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 09/33] userfaultfd: non-cooperative: Add fork() event, build warning fix Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 10/33] userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event Andrea Arcangeli
2016-11-03  7:41   ` Hillf Danton
2016-11-03 17:52     ` Mike Rapoport
2016-11-04 15:40     ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Andrea Arcangeli
2016-11-03  8:01   ` Hillf Danton
2016-11-03 17:24     ` Mike Rapoport
2016-11-04 16:40       ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED requestg Andrea Arcangeli
2016-11-04 15:42     ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Mike Rapoport
2016-11-02 19:33 ` [PATCH 13/33] userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 14/33] userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for " Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Andrea Arcangeli
2016-11-03 10:15   ` Hillf Danton
2016-11-03 17:33     ` Mike Kravetz
2016-11-03 19:14       ` Mike Kravetz
2016-11-04  6:43         ` Hillf Danton
2016-11-04 19:36         ` Andrea Arcangeli
2016-11-04 20:34           ` Mike Kravetz
2016-11-08 21:06           ` Mike Kravetz
2016-11-16 18:28             ` Andrea Arcangeli
2016-11-16 18:53               ` Mike Kravetz [this message]
2016-11-17 15:40                 ` Andrea Arcangeli
2016-11-17 19:26                   ` Mike Kravetz
2016-11-18  0:05                     ` Andrea Arcangeli
2016-11-18  5:52                       ` Mike Kravetz
2016-11-22  1:16                         ` Mike Kravetz
2016-11-23  6:38                           ` Hillf Danton
2016-12-15 19:02                             ` Andrea Arcangeli
2016-12-16  3:54                               ` Hillf Danton
2016-11-17 19:41               ` Mike Kravetz
2016-11-04 16:35       ` Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook Andrea Arcangeli
2016-11-04  7:02   ` Hillf Danton
2016-11-02 19:33 ` [PATCH 17/33] userfaultfd: hugetlbfs: allow registration of ranges containing huge pages Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 18/33] userfaultfd: hugetlbfs: add userfaultfd_hugetlb test Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 19/33] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 20/33] userfaultfd: introduce vma_can_userfault Andrea Arcangeli
2016-11-04  7:39   ` Hillf Danton
2016-11-02 19:33 ` [PATCH 21/33] userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 22/33] userfaultfd: shmem: introduce vma_is_shmem Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 23/33] userfaultfd: shmem: add tlbflush.h header for microblaze Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 24/33] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults Andrea Arcangeli
2016-11-04  8:59   ` Hillf Danton
2016-11-04 14:53     ` Mike Rapoport
2016-11-04 15:44     ` Mike Rapoport
2016-11-04 16:56       ` Andrea Arcangeli
2016-11-18  0:37       ` Andrea Arcangeli
2016-11-20 12:10         ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 26/33] userfaultfd: shmem: allow registration of shared memory ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 27/33] userfaultfd: shmem: add userfaultfd_shmem test Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 28/33] userfaultfd: shmem: lock the page before adding it to pagecache Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 29/33] userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 30/33] userfaultfd: non-cooperative: selftest: introduce userfaultfd_open Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 31/33] userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 32/33] userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 33/33] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock Andrea Arcangeli
2016-11-02 20:07 ` [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8ee2c6db-7ee4-285f-4c68-75fd6e799c0d@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dgilbert@redhat.com \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=linux-mm@kvack.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=shli@fb.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.