All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Michal Hocko <mhocko@kernel.org>, Nadav Amit <namit@vmware.com>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Xu <peterx@redhat.com>, Donald Dutile <ddutile@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>,
	Jan Kara <jack@suse.cz>, Liang Zhang <zhangliang5@huawei.com>,
	Pedro Gomes <pedrodemargomes@gmail.com>,
	Oded Gabbay <oded.gabbay@gmail.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH v2 11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
Date: Fri, 18 Mar 2022 13:29:18 -0700	[thread overview]
Message-ID: <CAHbLzko_VjR6_rx+i8Qn9cKU3mvLC2A0t92ZtYqmE7QD5PH8pg@mail.gmail.com> (raw)
In-Reply-To: <2b280ac6-9d39-58c5-b255-f39b1dac607b@redhat.com>

On Thu, Mar 17, 2022 at 2:06 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 16.03.22 22:23, Yang Shi wrote:
> > On Tue, Mar 15, 2022 at 3:52 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
> >> exclusive, and use that information to make GUP pins reliable and stay
> >> consistent with the page mapped into the page table even if the
> >> page table entry gets write-protected.
> >>
> >> With that information at hand, we can extend our COW logic to always
> >> reuse anonymous pages that are exclusive. For anonymous pages that
> >> might be shared, the existing logic applies.
> >>
> >> As already documented, PG_anon_exclusive is usually only expressive in
> >> combination with a page table entry. Especially PTE vs. PMD-mapped
> >> anonymous pages require more thought, some examples: due to mremap() we
> >> can easily have a single compound page PTE-mapped into multiple page tables
> >> exclusively in a single process -- multiple page table locks apply.
> >> Further, due to MADV_WIPEONFORK we might not necessarily write-protect
> >> all PTEs, and only some subpages might be pinned. Long story short: once
> >> PTE-mapped, we have to track information about exclusivity per sub-page,
> >> but until then, we can just track it for the compound page in the head
> >> page and not having to update a whole bunch of subpages all of the time
> >> for a simple PMD mapping of a THP.
> >>
> >> For simplicity, this commit mostly talks about "anonymous pages", while
> >> it's for THP actually "the part of an anonymous folio referenced via
> >> a page table entry".
> >>
> >> To not spill PG_anon_exclusive code all over the mm code-base, we let
> >> the anon rmap code to handle all PG_anon_exclusive logic it can easily
> >> handle.
> >>
> >> If a writable, present page table entry points at an anonymous (sub)page,
> >> that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably
> >> pin (FOLL_PIN) on an anonymous page references via a present
> >> page table entry, it must only pin if PG_anon_exclusive is set for the
> >> mapped (sub)page.
> >>
> >> This commit doesn't adjust GUP, so this is only implicitly handled for
> >> FOLL_WRITE, follow-up commits will teach GUP to also respect it for
> >> FOLL_PIN without !FOLL_WRITE, to make all GUP pins of anonymous pages
> >> fully reliable.
> >>
> >> Whenever an anonymous page is to be shared (fork(), KSM), or when
> >> temporarily unmapping an anonymous page (swap, migration), the relevant
> >> PG_anon_exclusive bit has to be cleared to mark the anonymous page
> >> possibly shared. Clearing will fail if there are GUP pins on the page:
> >> * For fork(), this means having to copy the page and not being able to
> >>   share it. fork() protects against concurrent GUP using the PT lock and
> >>   the src_mm->write_protect_seq.
> >> * For KSM, this means sharing will fail. For swap this means, unmapping
> >>   will fail, For migration this means, migration will fail early. All
> >>   three cases protect against concurrent GUP using the PT lock and a
> >>   proper clear/invalidate+flush of the relevant page table entry.
> >>
> >> This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
> >> pinned page gets mapped R/O and the successive write fault ends up
> >> replacing the page instead of reusing it. It improves the situation for
> >> O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN,
> >> if fork() is *not* involved, however swapout and fork() are still
> >> problematic. Properly using FOLL_PIN instead of FOLL_GET for these
> >> GUP users will fix the issue for them.
> >>
> >> I. Details about basic handling
> >>
> >> I.1. Fresh anonymous pages
> >>
> >> page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
> >> given page exclusive via __page_set_anon_rmap(exclusive=1). As that is
> >> the mechanism fresh anonymous pages come into life (besides migration
> >> code where we copy the page->mapping), all fresh anonymous pages will
> >> start out as exclusive.
> >>
> >> I.2. COW reuse handling of anonymous pages
> >>
> >> When a COW handler stumbles over a (sub)page that's marked exclusive, it
> >> simply reuses it. Otherwise, the handler tries harder under page lock to
> >> detect if the (sub)page is exclusive and can be reused. If exclusive,
> >> page_move_anon_rmap() will mark the given (sub)page exclusive.
> >>
> >> Note that hugetlb code does not yet check for PageAnonExclusive(), as it
> >> still uses the old COW logic that is prone to the COW security issue
> >> because hugetlb code cannot really tolerate unnecessary/wrong COW as
> >> huge pages are a scarce resource.
> >>
> >> I.3. Migration handling
> >>
> >> try_to_migrate() has to try marking an exclusive anonymous page shared
> >> via page_try_share_anon_rmap(). If it fails because there are GUP pins
> >> on the page, unmap fails. migrate_vma_collect_pmd() and
> >> __split_huge_pmd_locked() are handled similarly.
> >>
> >> Writable migration entries implicitly point at shared anonymous pages.
> >> For readable migration entries that information is stored via a new
> >> "readable-exclusive" migration entry, specific to anonymous pages.
> >>
> >> When restoring a migration entry in remove_migration_pte(), information
> >> about exlusivity is detected via the migration entry type, and
> >> RMAP_EXCLUSIVE is set accordingly for
> >> page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that
> >> information.
> >>
> >> I.4. Swapout handling
> >>
> >> try_to_unmap() has to try marking the mapped page possibly shared via
> >> page_try_share_anon_rmap(). If it fails because there are GUP pins on the
> >> page, unmap fails. For now, information about exclusivity is lost. In the
> >> future, we might want to remember that information in the swap entry in
> >> some cases, however, it requires more thought, care, and a way to store
> >> that information in swap entries.
> >>
> >> I.5. Swapin handling
> >>
> >> do_swap_page() will never stumble over exclusive anonymous pages in the
> >> swap cache, as try_to_migrate() prohibits that. do_swap_page() always has
> >> to detect manually if an anonymous page is exclusive and has to set
> >> RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
> >>
> >> I.6. THP handling
> >>
> >> __split_huge_pmd_locked() has to move the information about exclusivity
> >> from the PMD to the PTEs.
> >>
> >> a) In case we have a readable-exclusive PMD migration entry, simply insert
> >> readable-exclusive PTE migration entries.
> >>
> >> b) In case we have a present PMD entry and we don't want to freeze
> >> ("convert to migration entries"), simply forward PG_anon_exclusive to
> >> all sub-pages, no need to temporarily clear the bit.
> >>
> >> c) In case we have a present PMD entry and want to freeze, handle it
> >> similar to try_to_migrate(): try marking the page shared first. In case
> >> we fail, we ignore the "freeze" instruction and simply split ordinarily.
> >> try_to_migrate() will properly fail because the THP is still mapped via
> >> PTEs.
>
> Hi,
>
> thanks for the review!
>
> >
> > How come will try_to_migrate() fail? The afterward pvmw will find
> > those PTEs then convert them to migration entries anyway IIUC.
> >
>
> It will run into that code:
>
> >> @@ -1903,6 +1938,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
> >>                                 page_vma_mapped_walk_done(&pvmw);
> >>                                 break;
> >>                         }
> >> +                       VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) &&
> >> +                                      !anon_exclusive, page);
> >> +                       if (anon_exclusive &&
> >> +                           page_try_share_anon_rmap(subpage)) {
> >> +                               set_pte_at(mm, address, pvmw.pte, pteval);
> >> +                               ret = false;
> >> +                               page_vma_mapped_walk_done(&pvmw);
> >> +                               break;
> >> +                       }
>
> and similarly fail the page_try_share_anon_rmap(), at which point
> try_to_migrate() stops and the caller will still observe a
> "page_mapped() == true".

Thanks, I missed that. Yes, the page will still be mapped. This should
trigger the VM_WARN_ON_ONCE in unmap_page(), if this change will make
this happen more often, we may consider removing that warning even
though it is "once" since seeing a mapped page may become a normal
case (once DIO is switched to FOLL_PIN, it may be more often). Anyway
we don't have to remove it right now.

>
> --
> Thanks,
>
> David / dhildenb
>

  reply	other threads:[~2022-03-18 20:29 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 10:47 [PATCH v2 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 01/15] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed David Hildenbrand
2022-03-29 13:59   ` David Hildenbrand
2022-03-29 20:42     ` Khalid Aziz
2022-03-29 20:55       ` David Hildenbrand
2022-03-30 17:04         ` Khalid Aziz
2022-03-31 13:46           ` David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 02/15] mm/hugetlb: take src_mm->write_protect_seq in copy_hugetlb_page_range() David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 03/15] mm/memory: slightly simplify copy_present_pte() David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 04/15] mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() David Hildenbrand
2022-03-16 20:02   ` Yang Shi
2022-03-17  9:00     ` David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 05/15] mm/rmap: convert RMAP flags to a proper distinct rmap_t type David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 06/15] mm/rmap: remove do_page_add_anon_rmap() David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 07/15] mm/rmap: pass rmap flags to hugepage_add_anon_rmap() David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 08/15] mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 09/15] mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() page exclusively David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 10/15] mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 11/15] mm: remember exclusively mapped anonymous pages with PG_anon_exclusive David Hildenbrand
2022-03-16 21:23   ` Yang Shi
2022-03-17  9:06     ` David Hildenbrand
2022-03-18 20:29       ` Yang Shi [this message]
2022-03-19 10:21         ` David Hildenbrand
2022-03-19 10:50           ` David Hildenbrand
2022-03-21 20:56             ` Yang Shi
2022-03-22  9:41               ` David Hildenbrand
2022-03-21 20:51           ` Yang Shi
2022-03-15 10:47 ` [PATCH v2 12/15] mm/gup: disallow follow_page(FOLL_PIN) David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 13/15] mm: support GUP-triggered unsharing of anonymous pages David Hildenbrand
2022-03-18 23:30   ` Jason Gunthorpe
2022-03-21 16:15     ` David Hildenbrand
2022-03-21 16:18       ` Jason Gunthorpe
2022-03-21 16:24         ` David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 14/15] mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page David Hildenbrand
2022-03-15 10:47 ` [PATCH v2 15/15] mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning David Hildenbrand
2022-03-18 23:35   ` Jason Gunthorpe
2022-03-19 10:22     ` David Hildenbrand
2022-03-18 23:36 ` [PATCH v2 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzko_VjR6_rx+i8Qn9cKU3mvLC2A0t92ZtYqmE7QD5PH8pg@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=ddutile@redhat.com \
    --cc=guro@fb.com \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=namit@vmware.com \
    --cc=oded.gabbay@gmail.com \
    --cc=oleg@redhat.com \
    --cc=pedrodemargomes@gmail.com \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rppt@linux.ibm.com \
    --cc=shakeelb@google.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=zhangliang5@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.