linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: peterx@redhat.com, Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Hellwig <hch@lst.de>, Yang Shi <shy828301@gmail.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Hugh Dickins <hughd@google.com>, Jann Horn <jannh@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, Jan Kara <jack@suse.cz>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Leon Romanovsky <leonro@nvidia.com>
Subject: [PATCH v2 4/4] mm/thp: Split huge pmds/puds if they're pinned when fork()
Date: Fri, 25 Sep 2020 18:26:00 -0400	[thread overview]
Message-ID: <20200925222600.6832-5-peterx@redhat.com> (raw)
In-Reply-To: <20200925222600.6832-1-peterx@redhat.com>

Pinned pages shouldn't be write-protected when fork() happens, because follow
up copy-on-write on these pages could cause the pinned pages to be replaced by
random newly allocated pages.

For huge PMDs, we split the huge pmd if pinning is detected.  So that future
handling will be done by the PTE level (with our latest changes, each of the
small pages will be copied).  We can achieve this by let copy_huge_pmd() return
-EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
finally land the next copy_pte_range() call.

Huge PUDs will be even more special - so far it does not support anonymous
pages.  But it can actually be done the same as the huge PMDs even if the split
huge PUDs means to erase the PUD entries.  It'll guarantee the follow up fault
ins will remap the same pages in either parent/child later.

This might not be the most efficient way, but it should be easy and clean
enough.  It should be fine, since we're tackling with a very rare case just to
make sure userspaces that pinned some thps will still work even without
MADV_DONTFORK and after they fork()ed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faadc449cca5..da397779a6d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1074,6 +1074,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+
+	/*
+	 * If this page is a potentially pinned page, split and retry the fault
+	 * with smaller page size.  Normally this should not happen because the
+	 * userspace should use MADV_DONTFORK upon pinned regions.  This is a
+	 * best effort that the pinned pages won't be replaced by another
+	 * random page during the coming copy-on-write.
+	 */
+	if (unlikely(is_cow_mapping(vma->vm_flags) &&
+		     atomic_read(&src_mm->has_pinned) &&
+		     page_maybe_dma_pinned(src_page))) {
+		pte_free(dst_mm, pgtable);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
+		return -EAGAIN;
+	}
+
 	get_page(src_page);
 	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1177,6 +1195,16 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		/* No huge zero pud yet */
 	}
 
+	/* Please refer to comments in copy_huge_pmd() */
+	if (unlikely(is_cow_mapping(vma->vm_flags) &&
+		     atomic_read(&src_mm->has_pinned) &&
+		     page_maybe_dma_pinned(pud_page(pud)))) {
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+		__split_huge_pud(vma, src_pud, addr);
+		return -EAGAIN;
+	}
+
 	pudp_set_wrprotect(src_mm, addr, src_pud);
 	pud = pud_mkold(pud_wrprotect(pud));
 	set_pud_at(dst_mm, addr, dst_pud, pud);
-- 
2.26.2


  parent reply	other threads:[~2020-09-25 22:26 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-25 22:25 [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Peter Xu
2020-09-25 22:25 ` [PATCH v2 1/4] mm: Introduce mm_struct.has_pinned Peter Xu
2020-09-25 22:25 ` [PATCH v2 2/4] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
2020-09-30 13:30   ` Kirill A. Shutemov
2020-09-30 17:05     ` Peter Xu
2020-09-25 22:25 ` [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
2020-09-26 23:23   ` Jason Gunthorpe
2020-09-27  0:04     ` Linus Torvalds
2020-09-25 22:26 ` Peter Xu [this message]
2020-09-27 19:35 ` [PATCH v2 0/4] mm: Break COW for pinned pages during fork() Linus Torvalds
2020-09-29 11:02   ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200925222600.6832-5-peterx@redhat.com \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=shy828301@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).