All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Mike Kravetz <mike.kravetz@oracle.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Jason Gunthorpe <jgg@ziepe.ca>, Hugh Dickins <hughd@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Jerome Glisse <jglisse@redhat.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
Date: Mon, 21 Jun 2021 20:40:37 -0400	[thread overview]
Message-ID: <YNExhdKPfIb9QlDe@t490s> (raw)
In-Reply-To: <2098802.ffhGrX9TH4@nvdebian>

On Mon, Jun 21, 2021 at 06:41:17PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 6:22:14 AM AEST Peter Xu wrote:
> > File-backed memory is prone to being unmapped at any time.  It means all
> > information in the pte will be dropped, including the uffd-wp flag.
> > 
> > Since the uffd-wp info cannot be stored in page cache or swap cache, persist
> > this wr-protect information by installing the special uffd-wp marker pte when
> > we're going to unmap a uffd wr-protected pte.  When the pte is accessed again,
> > we will know it's previously wr-protected by recognizing the special pte.
> > 
> > Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
> > persist such an information.  For example, when destroying the whole vma, or
> > punching a hole in a shmem file.  For the latter, we can only drop the uffd-wp
> > bit when holding the page lock.  It means the unmap_mapping_range() in
> > shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
> > because that's still racy with the page faults.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/mm.h        | 11 ++++++++++
> >  include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
> >  mm/memory.c               | 42 +++++++++++++++++++++++++++++++++++++-
> >  mm/rmap.c                 |  8 ++++++++
> >  mm/truncate.c             |  8 +++++++-
> >  5 files changed, 110 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index b1fb2826e29c..5989fc7ed00d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1725,6 +1725,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> >  #define  ZAP_FLAG_CHECK_MAPPING             BIT(0)
> >  /* Whether to skip zapping swap entries */
> >  #define  ZAP_FLAG_SKIP_SWAP                 BIT(1)
> > +/* Whether to completely drop uffd-wp entries for file-backed memory */
> > +#define  ZAP_FLAG_DROP_FILE_UFFD_WP         BIT(2)
> >  
> >  /*
> >   * Parameter block passed down to zap_pte_range in exceptional cases.
> > @@ -1757,6 +1759,15 @@ zap_skip_swap(struct zap_details *details)
> >  	return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> >  }
> >  
> > +static inline bool
> > +zap_drop_file_uffd_wp(struct zap_details *details)
> > +{
> > +	if (!details)
> > +		return false;
> > +
> > +	return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
> > +}
> 
> Is this a good default having to explicitly specify that you don't want
> special pte's left in place?

I made it explicitly the default so we won't accidentally drop that bit without
being aware of it; because missing of the uffd-wp bit anywhere can directly
cause data corruption in the userspace.

> For example the OOM killer seems to call unmap_page_range() with details ==
> NULL (although in practice only for anonymous vmas so it wont actually cause
> an issue). Similarly in madvise for MADV_DONTNEED, although arguably I
> suppose that is the correct thing to do there?

So I must confess I'm not familiar with the oom code, it looks to me it's a
fast path to recycle pages that can have a better chance to be reclaimed.  Even
in exit_mmap() we'll do this first before unmap_vmas().  Then it still looks
the right thing to do if it's only a fast path, not to mention if we only runs
with anonymous then it's ignored.

Basically I followed this rule: the bit should never be cleared if (1) user
manually clear it using UFFDIO_WRITEPROTECT, (2) unmapping the whole region.
There can be special cases e.g. when unregister the vma with VM_UFFD_WP, but
that's a rare case, and we also have code to take care of those lazily (e.g.,
we'll restore such a uffd-wp special pte into none pte if we found we've got a
fault and the vma is not registered with uffd-wp at all, in do_swap_pte).
Otherwise I never clear the bit.

> 
> >  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> >  			     pte_t pte);
> >  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index 355ea1ee32bd..c29a6ef3a642 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -4,6 +4,8 @@
> >  
> >  #include <linux/huge_mm.h>
> >  #include <linux/swap.h>
> > +#include <linux/userfaultfd_k.h>
> > +#include <linux/swapops.h>
> >  
> >  /**
> >   * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> >  	update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> >  			-thp_nr_pages(page));
> >  }
> > +
> > +/*
> > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > + * replace a none pte.  NOTE!  This should only be called when *pte is already
> > + * cleared so we will never accidentally replace something valuable.  Meanwhile
> > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > + * even better.
> > + *
> > + * Must be called with pgtable lock held.
> > + */
> > +static inline void
> > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > +			      pte_t *pte, pte_t pteval)
> > +{
> > +#ifdef CONFIG_USERFAULTFD
> > +	bool arm_uffd_pte = false;
> > +
> > +	/* The current status of the pte should be "cleared" before calling */
> > +	WARN_ON_ONCE(!pte_none(*pte));
> > +
> > +	if (vma_is_anonymous(vma))
> > +		return;
> > +
> > +	/* A uffd-wp wr-protected normal pte */
> > +	if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > +		arm_uffd_pte = true;
> > +
> > +	/*
> > +	 * A uffd-wp wr-protected swap pte.  Note: this should even work for
> > +	 * pte_swp_uffd_wp_special() too.
> > +	 */
> 
> I'm probably missing something but when can we actually have this case and why
> would we want to leave a special pte behind? From what I can tell this is
> called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> when not skipping swap pages.

Yes this is a good question..

Initially I made this function make sure I cover all forms of uffd-wp bit, that
contains both swap and present ptes; imho that's pretty safe.  However for
!anonymous cases we don't keep swap entry inside pte even if swapped out, as
they should reside in shmem page cache indeed.  The only missing piece seems to
be the device private entries as you also spotted below.

> 
> > +	if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> > +		arm_uffd_pte = true;
> > +
> > +	if (unlikely(arm_uffd_pte))
> > +		set_pte_at(vma->vm_mm, addr, pte,
> > +			   pte_swp_mkuffd_wp_special(vma));
> > +#endif
> > +}
> > +
> >  #endif
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 319552efc782..3453b8ae5f4f 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -73,6 +73,7 @@
> >  #include <linux/perf_event.h>
> >  #include <linux/ptrace.h>
> >  #include <linux/vmalloc.h>
> > +#include <linux/mm_inline.h>
> >  
> >  #include <trace/events/kmem.h>
> >  
> > @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * This function makes sure that we'll replace the none pte with an uffd-wp
> > + * swap special pte marker when necessary. Must be with the pgtable lock held.
> > + */
> > +static inline void
> > +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> > +			      unsigned long addr, pte_t *pte,
> > +			      struct zap_details *details, pte_t pteval)
> > +{
> > +	if (zap_drop_file_uffd_wp(details))
> > +		return;
> > +
> > +	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> > +}
> > +
> >  static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  				struct vm_area_struct *vma, pmd_t *pmd,
> >  				unsigned long addr, unsigned long end,
> > @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  			ptent = ptep_get_and_clear_full(mm, addr, pte,
> >  							tlb->fullmm);
> >  			tlb_remove_tlb_entry(tlb, pte, addr);
> > +			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > +						      ptent);
> >  			if (unlikely(!page))
> >  				continue;
> >  
> > @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  			continue;
> >  		}
> >  
> > +		/*
> > +		 * If this is a special uffd-wp marker pte... Drop it only if
> > +		 * enforced to do so.
> > +		 */
> > +		if (unlikely(is_swap_special_pte(ptent))) {
> > +			WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
> 
> Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?
> 
> > +			/*
> > +			 * If this is a common unmap of ptes, keep this as is.
> > +			 * Drop it only if this is a whole-vma destruction.
> > +			 */
> > +			if (zap_drop_file_uffd_wp(details))
> > +				ptep_get_and_clear_full(mm, addr, pte,
> > +							tlb->fullmm);
> > +			continue;
> > +		}
> > +
> >  		entry = pte_to_swp_entry(ptent);
> >  		if (is_device_private_entry(entry) ||
> >  		    is_device_exclusive_entry(entry)) {
> > @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  				page_remove_rmap(page, false);
> >  
> >  			put_page(page);
> > +			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > +						      ptent);
> 
> Device entries only support anonymous vmas at present so should we drop this?
> I guess I'm also a little confused by this because I'm not sure in what
> scenarios you would want to zap swap entries but leave special swap ptes behind
> (see also my earlier question above as well).

If that's the case, maybe indeed this is not needed, and I can use a
WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
possible one day to have !anonymous support for device private entries?
Frankly I have no solid idea on how device private is used, so some more
context would be nice too; since I think you should know much better than me,
so maybe it's a good chance to learn more about it. :)

> 
> >  			continue;
> >  		}
> >  
> > @@ -1390,6 +1426,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  		if (unlikely(!free_swap_and_cache(entry)))
> >  			print_bad_pte(vma, addr, ptent, NULL);
> >  		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > +		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> >  	} while (pte++, addr += PAGE_SIZE, addr != end);
> >  
> >  	add_mm_rss_vec(mm, rss);
> > @@ -1589,12 +1626,15 @@ void unmap_vmas(struct mmu_gather *tlb,
> >  		unsigned long end_addr)
> >  {
> >  	struct mmu_notifier_range range;
> > +	struct zap_details details = {
> > +		.zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
> > +	};
> >  
> >  	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> >  				start_addr, end_addr);
> >  	mmu_notifier_invalidate_range_start(&range);
> >  	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> > -		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > +		unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >  
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0419c9a1a280..a94d9aed9d95 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -72,6 +72,7 @@
> >  #include <linux/page_idle.h>
> >  #include <linux/memremap.h>
> >  #include <linux/userfaultfd_k.h>
> > +#include <linux/mm_inline.h>
> >  
> >  #include <asm/tlbflush.h>
> >  
> > @@ -1509,6 +1510,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			pteval = ptep_clear_flush(vma, address, pvmw.pte);
> >  		}
> >  
> > +		/*
> > +		 * Now the pte is cleared.  If this is uffd-wp armed pte, we
> > +		 * may want to replace a none pte with a marker pte if it's
> > +		 * file-backed, so we don't lose the tracking information.
> > +		 */
> > +		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> 
> From what I can tell we don't need to do this in try_to_migrate_one() (assuming
> that goes in) as well because the existing uffd wp code already deals with
> copying the pte bits over to the migration entries. Is that correct?

I agree try_to_migrate_one() shouldn't need it.  But I'm not sure about
try_to_unmap_one(), as e.g. I think we should rely on this to make shmem work
with when page got swapped out.

Thanks,

-- 
Peter Xu


  reply	other threads:[~2021-06-22  0:40 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27 20:19 [PATCH v3 00/27] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
2021-05-27 20:19 ` [PATCH v3 01/27] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte Peter Xu
2021-05-27 20:19 ` [PATCH v3 02/27] shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2021-05-27 20:19 ` [PATCH v3 03/27] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
2021-05-27 20:19 ` [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem Peter Xu
2021-05-28  8:32   ` Alistair Popple
2021-05-28 12:56     ` Peter Xu
2021-06-03 11:53       ` Alistair Popple
2021-06-03 14:51         ` Peter Xu
2021-06-04  0:55           ` Alistair Popple
2021-06-04  3:14             ` Hugh Dickins
2021-06-04  3:14               ` Hugh Dickins
2021-06-04  6:16               ` Alistair Popple
2021-06-04 16:01                 ` Peter Xu
2021-06-08 13:18                   ` Alistair Popple
2021-06-09 13:06   ` Alistair Popple
2021-06-09 14:43     ` Peter Xu
2021-05-27 20:21 ` [PATCH v3 05/27] mm/swap: Introduce the idea of special swap ptes Peter Xu
2021-05-27 20:21 ` [PATCH v3 06/27] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler Peter Xu
2021-06-17  8:59   ` Alistair Popple
2021-06-17 15:10     ` Peter Xu
2021-05-27 20:21 ` [PATCH v3 07/27] mm: Drop first_index/last_index in zap_details Peter Xu
2021-06-21 12:20   ` Alistair Popple
2021-05-27 20:21 ` [PATCH v3 08/27] mm: Introduce zap_details.zap_flags Peter Xu
2021-06-21 12:09   ` Alistair Popple
2021-06-21 16:16     ` Peter Xu
2021-06-22  2:07       ` Alistair Popple
2021-05-27 20:21 ` [PATCH v3 09/27] mm: Introduce ZAP_FLAG_SKIP_SWAP Peter Xu
2021-06-21 12:36   ` Alistair Popple
2021-06-21 16:26     ` Peter Xu
2021-06-22  2:11       ` Alistair Popple
2021-05-27 20:21 ` [PATCH v3 10/27] mm: Pass zap_flags into unmap_mapping_pages() Peter Xu
2021-05-27 20:22 ` [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed Peter Xu
2021-06-21  8:41   ` Alistair Popple
2021-06-22  0:40     ` Peter Xu [this message]
2021-06-22 12:47       ` Alistair Popple
2021-06-22 15:44         ` Peter Xu
2021-06-23  6:04           ` Alistair Popple
2021-06-23 15:31             ` Peter Xu
2021-07-06  5:40               ` Alistair Popple
2021-07-06 15:35                 ` Peter Xu
2021-07-08  2:49                   ` Alistair Popple
2021-05-27 20:22 ` [PATCH v3 12/27] shmem/userfaultfd: Allow wr-protect none pte for file-backed mem Peter Xu
2021-05-27 20:22 ` [PATCH v3 13/27] shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps Peter Xu
2021-05-27 20:22 ` [PATCH v3 14/27] shmem/userfaultfd: Handle the left-overed special swap ptes Peter Xu
2021-05-27 20:22 ` [PATCH v3 15/27] shmem/userfaultfd: Pass over uffd-wp special swap pte when fork() Peter Xu
2021-05-27 20:23 ` [PATCH v3 16/27] mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h Peter Xu
2021-05-27 20:23 ` [PATCH v3 17/27] mm/hugetlb: Introduce huge pte version of uffd-wp helpers Peter Xu
2021-05-27 20:23 ` [PATCH v3 18/27] hugetlb/userfaultfd: Hook page faults for uffd write protection Peter Xu
2021-05-27 20:23 ` [PATCH v3 19/27] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2021-05-27 20:23 ` [PATCH v3 20/27] hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT Peter Xu
2021-05-27 20:23 ` [PATCH v3 21/27] mm/hugetlb: Introduce huge version of special swap pte helpers Peter Xu
2021-05-27 20:23 ` [PATCH v3 22/27] hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler Peter Xu
2021-05-27 20:23 ` [PATCH v3 23/27] hugetlb/userfaultfd: Allow wr-protect none ptes Peter Xu
2021-05-27 20:23 ` [PATCH v3 24/27] hugetlb/userfaultfd: Only drop uffd-wp special pte if required Peter Xu
2021-05-27 20:23 ` [PATCH v3 25/27] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Peter Xu
2021-05-27 20:23 ` [PATCH v3 26/27] mm/userfaultfd: Enable write protection for shmem & hugetlbfs Peter Xu
2021-05-27 20:23 ` [PATCH v3 27/27] userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs Peter Xu
2021-06-02 14:40 ` [PATCH v3 00/27] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
2021-06-02 22:36   ` Andrew Morton
2021-06-03  0:09     ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YNExhdKPfIb9QlDe@t490s \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=hughd@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=nadav.amit@gmail.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.