All of lore.kernel.org
 help / color / mirror / Atom feed
From: Suren Baghdasaryan <surenb@google.com>
To: Jann Horn <jannh@google.com>
Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk,
	brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com,
	lokeshgidra@google.com, peterx@redhat.com, david@redhat.com,
	hughd@google.com, mhocko@suse.com, axelrasmussen@google.com,
	rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com,
	zhangpeng362@huawei.com, bgeffon@google.com,
	kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	kernel-team@android.com
Subject: Re: [PATCH 2/3] userfaultfd: UFFDIO_REMAP uABI
Date: Tue, 19 Sep 2023 18:49:23 -0700	[thread overview]
Message-ID: <CAJuCfpGEPPmEW6GqCjQXB5g04PA=BwhNgLVooM+DcroQj8RjyA@mail.gmail.com> (raw)
In-Reply-To: <CAG48ez212+UjQMB94vKvyV4YAEgg=jBhdzg_1b4BRe6=SO09fA@mail.gmail.com>

On Tue, Sep 19, 2023 at 4:51 PM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Sep 20, 2023 at 1:08 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Thu, Sep 14, 2023 at 7:28 PM Jann Horn <jannh@google.com> wrote:
> > > On Thu, Sep 14, 2023 at 5:26 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > >
> > > > This implements the uABI of UFFDIO_REMAP.
> > > >
> > > > Notably one mode bitflag is also forwarded (and in turn known) by the
> > > > lowlevel remap_pages method.
> [...]
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> [...]
> > > > +int remap_pages_huge_pmd(struct mm_struct *dst_mm,
> > > > +                        struct mm_struct *src_mm,
> > > > +                        pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > +                        pmd_t dst_pmdval,
> > > > +                        struct vm_area_struct *dst_vma,
> > > > +                        struct vm_area_struct *src_vma,
> > > > +                        unsigned long dst_addr,
> > > > +                        unsigned long src_addr)
> > > > +{
> > > > +       pmd_t _dst_pmd, src_pmdval;
> > > > +       struct page *src_page;
> > > > +       struct anon_vma *src_anon_vma, *dst_anon_vma;
> > > > +       spinlock_t *src_ptl, *dst_ptl;
> > > > +       pgtable_t pgtable;
> > > > +       struct mmu_notifier_range range;
> > > > +
> > > > +       src_pmdval = *src_pmd;
> > > > +       src_ptl = pmd_lockptr(src_mm, src_pmd);
> > > > +
> > > > +       BUG_ON(!pmd_trans_huge(src_pmdval));
> > > > +       BUG_ON(!pmd_none(dst_pmdval));
> > >
> > > Why can we assert that pmd_none(dst_pmdval) is true here? Can we not
> > > have concurrent faults (or userfaultfd operations) populating that
> > > PMD?
> >
> > IIUC dst_pmdval is a copy of the value from dst_pmd, so that local
> > copy should not change even if some concurrent operation changes
> > dst_pmd. We can assert that it's pmd_none because we checked for that
> > before calling remap_pages_huge_pmd. Later on we check if dst_pmd
> > changed from under us (see pmd_same(*dst_pmd, dst_pmdval) check) and
> > retry if that happened.
>
> Oh, right, I don't know what I was thinking when I typed that.
>
> But now I wonder about the check directly above that: What does this
> code do for swap PMDs? It looks like that might splat on the
> BUG_ON(!pmd_trans_huge(src_pmdval)). All we've checked on the path to
> here is that the virtual memory area is aligned, that the destination
> PMD is empty, and that pmd_trans_huge_lock() succeeded; but
> pmd_trans_huge_lock() explicitly permits swap PMDs (which is the
> swapped-out version of transhuge PMDs):
>
> static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
>                 struct vm_area_struct *vma)
> {
>         if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
>                 return __pmd_trans_huge_lock(pmd, vma);
>         else
>                 return NULL;
> }

Yeah... Ok, I think I'm missing a check for  pmd_trans_huge(*src_pmd)
after we lock it with pmd_trans_huge_lock(src_pmd, src_vma). And we
can remove the above BUG_ON(). Would that address your concern?

>
> > >
> > > > +       BUG_ON(!spin_is_locked(src_ptl));
> > > > +       mmap_assert_locked(src_mm);
> > > > +       mmap_assert_locked(dst_mm);
> > > > +       BUG_ON(src_addr & ~HPAGE_PMD_MASK);
> > > > +       BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
> > > > +
> > > > +       src_page = pmd_page(src_pmdval);
> > > > +       BUG_ON(!PageHead(src_page));
> > > > +       BUG_ON(!PageAnon(src_page));
> > > > +       if (unlikely(page_mapcount(src_page) != 1)) {
> > > > +               spin_unlock(src_ptl);
> > > > +               return -EBUSY;
> > > > +       }
> > > > +
> > > > +       get_page(src_page);
> > > > +       spin_unlock(src_ptl);
> > > > +
> > > > +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm, src_addr,
> > > > +                               src_addr + HPAGE_PMD_SIZE);
> > > > +       mmu_notifier_invalidate_range_start(&range);
> > > > +
> > > > +       /* block all concurrent rmap walks */
> > > > +       lock_page(src_page);
> > > > +
> > > > +       /*
> > > > +        * split_huge_page walks the anon_vma chain without the page
> > > > +        * lock. Serialize against it with the anon_vma lock, the page
> > > > +        * lock is not enough.
> > > > +        */
> > > > +       src_anon_vma = folio_get_anon_vma(page_folio(src_page));
> > > > +       if (!src_anon_vma) {
> > > > +               unlock_page(src_page);
> > > > +               put_page(src_page);
> > > > +               mmu_notifier_invalidate_range_end(&range);
> > > > +               return -EAGAIN;
> > > > +       }
> > > > +       anon_vma_lock_write(src_anon_vma);
> > > > +
> > > > +       dst_ptl = pmd_lockptr(dst_mm, dst_pmd);
> > > > +       double_pt_lock(src_ptl, dst_ptl);
> > > > +       if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> > > > +                    !pmd_same(*dst_pmd, dst_pmdval) ||
> > > > +                    page_mapcount(src_page) != 1)) {
> > > > +               double_pt_unlock(src_ptl, dst_ptl);
> > > > +               anon_vma_unlock_write(src_anon_vma);
> > > > +               put_anon_vma(src_anon_vma);
> > > > +               unlock_page(src_page);
> > > > +               put_page(src_page);
> > > > +               mmu_notifier_invalidate_range_end(&range);
> > > > +               return -EAGAIN;
> > > > +       }
> > > > +
> > > > +       BUG_ON(!PageHead(src_page));
> > > > +       BUG_ON(!PageAnon(src_page));
> > > > +       /* the PT lock is enough to keep the page pinned now */
> > > > +       put_page(src_page);
> > > > +
> > > > +       dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> > > > +       WRITE_ONCE(src_page->mapping, (struct address_space *) dst_anon_vma);
> > > > +       WRITE_ONCE(src_page->index, linear_page_index(dst_vma, dst_addr));
> > > > +
> > > > +       if (!pmd_same(pmdp_huge_clear_flush(src_vma, src_addr, src_pmd),
> > > > +                     src_pmdval))
> > > > +               BUG_ON(1);
> > >
> > > I'm not sure we can assert that the PMDs are exactly equal; the CPU
> > > might have changed the A/D bits under us?
> >
> > Yes. I wonder if I can simply remove the BUG_ON here like this:
> >
> > src_pmdval = pmdp_huge_clear_flush(src_vma, src_addr, src_pmd);
> >
> > Technically we don't use src_pmdval after this but for the possible
> > future use that would keep things correct. If A/D bits changed from
> > under us we will still copy correct values into dst_pmd.
>
> And when we set up the dst_pmd, we always mark it as dirty and
> accessed... so I guess that's fine.

Ack.

>
> > > > +       _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
> > > > +       _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
> > > > +       set_pmd_at(dst_mm, dst_addr, dst_pmd, _dst_pmd);
> > > > +
> > > > +       pgtable = pgtable_trans_huge_withdraw(src_mm, src_pmd);
> > > > +       pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> > >
> > > Are we allowed to move page tables between mm_structs on all
> > > architectures? The first example I found that looks a bit dodgy,
> > > looking through various architectures' pte_alloc_one(), is s390's
> > > page_table_alloc() which looks like page tables are tied to per-MM
> > > lists sometimes.
> > > If that's not allowed, we might have to allocate a new deposit table
> > > and free the old one or something like that.
> >
> > Hmm. Yeah, looks like in the case of !CONFIG_PGSTE the table can be
> > linked to mm->context.pgtable_list, so can't be moved to another mm. I
> > guess I'll have to keep a pgtable allocated, ready to be deposited and
> > free the old one. Maybe it's worth having an arch-specific function
> > indicating whether moving a pgtable between MMs is supported? Or do it
> > separately as an optimization. WDYT?
>
> Hm, dunno. I guess you could have architectures opt in with some
> config flag similar to how flags like
> ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH are wired up - define it in
> init/Kconfig, select it in the architectures that support it, and then
> gate the fast version on that with #ifdef?

Yeah, that sounds good to me. I can implement an unoptimized common
path first and then introduce this optimization in the follow-up
patches.

>
> > > > +       if (dst_mm != src_mm) {
> > > > +               add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> > > > +               add_mm_counter(src_mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > > > +       }
> > > > +       double_pt_unlock(src_ptl, dst_ptl);
> > > > +
> > > > +       anon_vma_unlock_write(src_anon_vma);
> > > > +       put_anon_vma(src_anon_vma);
> > > > +
> > > > +       /* unblock rmap walks */
> > > > +       unlock_page(src_page);
> > > > +
> > > > +       mmu_notifier_invalidate_range_end(&range);
> > > > +       return 0;
> > > > +}
> > > > +#endif /* CONFIG_USERFAULTFD */
> > > > +
> > > >  /*
> > > >   * Returns page table lock pointer if a given pmd maps a thp, NULL otherwise.
> > > >   *
> > > [...]
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index 96d9eae5c7cc..0cca60dfa8f8 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > [...]
> > > > +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > +                   unsigned long dst_start, unsigned long src_start,
> > > > +                   unsigned long len, __u64 mode)
> > > > +{
> > > [...]
> > > > +
> > > > +       if (pgprot_val(src_vma->vm_page_prot) !=
> > > > +           pgprot_val(dst_vma->vm_page_prot))
> > > > +               goto out;
> > >
> > > Does this check intentionally allow moving pages from a
> > > PROT_READ|PROT_WRITE anonymous private VMA into a PROT_READ anonymous
> > > private VMA (on architectures like x86 and arm64 where CoW memory has
> > > the same protection flags as read-only memory), but forbid moving them
> > > from a PROT_READ|PROT_EXEC VMA into a PROT_READ VMA? I think this
> > > check needs at least a comment to explain what's going on here.
> >
> > The check is simply to ensure the VMAs have the same access
> > permissions to prevent page copies that should have different
> > permissions. The fact that x86 and arm64 have the same protection bits
> > for R/O and COW memory is a "side-effect" IMHO. I'm not sure what
> > comment would be good here but I'm open to suggestions.
>
> I'm not sure if you can do a meaningful security check on the
> ->vm_page_prot. I also don't think it matters for anonymous VMAs.
> I guess if you want to keep this check but make this behavior more
> consistent, you could put another check in front of this that rejects
> VMAs where vm_flags like VM_READ, VM_WRITE, VM_SHARED or VM_EXEC are
> different?

Ok, I'll post that in the next version and we can decide if that's enough.

>
> [...]
> > > > +       /*
> > > > +        * Ensure the dst_vma has a anon_vma or this page
> > > > +        * would get a NULL anon_vma when moved in the
> > > > +        * dst_vma.
> > > > +        */
> > > > +       err = -ENOMEM;
> > > > +       if (unlikely(anon_vma_prepare(dst_vma)))
> > > > +               goto out;
> > > > +
> > > > +       for (src_addr = src_start, dst_addr = dst_start;
> > > > +            src_addr < src_start + len;) {
> > > > +               spinlock_t *ptl;
> > > > +               pmd_t dst_pmdval;
> > > > +
> > > > +               BUG_ON(dst_addr >= dst_start + len);
> > > > +               src_pmd = mm_find_pmd(src_mm, src_addr);
> > >
> > > (this would blow up pretty badly if we could have transparent huge PUD
> > > in the region but I think that's limited to file VMAs so it's fine as
> > > it currently is)
> >
> > Should I add a comment here as a warning if in the future we decide to
> > implement support for file-backed pages?
>
> Hm, yeah, I guess that might be a good idea.

Ack.

Thanks for the feedback!

  reply	other threads:[~2023-09-20  1:49 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-14 15:26 [PATCH 0/3] userfaultfd remap option Suren Baghdasaryan
2023-09-14 15:26 ` [PATCH 1/3] userfaultfd: UFFDIO_REMAP: rmap preparation Suren Baghdasaryan
2023-09-14 17:56   ` Matthew Wilcox
2023-09-14 18:34     ` Suren Baghdasaryan
2023-09-14 15:26 ` [PATCH 2/3] userfaultfd: UFFDIO_REMAP uABI Suren Baghdasaryan
2023-09-14 18:11   ` Matthew Wilcox
2023-09-14 18:43     ` David Hildenbrand
2023-09-14 18:45       ` David Hildenbrand
2023-09-21 18:04         ` Suren Baghdasaryan
2023-09-21 18:17           ` David Hildenbrand
2023-09-22  1:57             ` Suren Baghdasaryan
2023-09-14 18:47   ` David Hildenbrand
2023-09-14 18:54     ` Suren Baghdasaryan
2023-09-14 19:28   ` Jann Horn
2023-09-14 20:57     ` Suren Baghdasaryan
2023-09-19 23:08     ` Suren Baghdasaryan
2023-09-19 23:40       ` Suren Baghdasaryan
2023-09-19 23:50       ` Jann Horn
2023-09-20  1:49         ` Suren Baghdasaryan [this message]
2023-09-20 16:11           ` Jann Horn
2023-09-21 16:59     ` Jann Horn
2023-09-14 21:57   ` Nadav Amit
2023-09-15  3:28     ` Suren Baghdasaryan
2023-09-15  4:03       ` Nadav Amit
2023-09-15  4:15         ` Suren Baghdasaryan
2023-09-15 23:33   ` Jann Horn
2023-09-15 23:39     ` Suren Baghdasaryan
2023-09-14 15:26 ` [PATCH 3/3] selftests/mm: add UFFDIO_REMAP ioctl test Suren Baghdasaryan
2023-09-14 16:00 [PATCH 2/3] userfaultfd: UFFDIO_REMAP uABI kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJuCfpGEPPmEW6GqCjQXB5g04PA=BwhNgLVooM+DcroQj8RjyA@mail.gmail.com' \
    --to=surenb@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=bgeffon@google.com \
    --cc=brauner@kernel.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=jdduke@google.com \
    --cc=kaleshsingh@google.com \
    --cc=kernel-team@android.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=mhocko@suse.com \
    --cc=ngeoffray@google.com \
    --cc=peterx@redhat.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=zhangpeng362@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.