All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>, Linux-MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@suse.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Hugh Dickins <hughd@google.com>, Christoph Hellwig <hch@lst.de>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Leon Romanovsky <leonro@nvidia.com>, Jann Horn <jannh@google.com>
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned
Date: Fri, 25 Sep 2020 14:06:59 -0700	[thread overview]
Message-ID: <CAHk-=whDSH_MRMt80JaSwoquzt=1nQ-0n3w0aVngoWPAc10BCw@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wgz5SXKA6-uZ_BimOP1C7pHJag0ndz=tnJDAZS_Z+FrGQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

On Fri, Sep 25, 2020 at 12:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And honestly, since this is all getting fairly late in the rc, and it
> took longer than I thought, I think we should do the GFP_ATOMIC
> approach for now - not great, but since it only triggers for this case
> that really should never happen anyway, I think it's probably the best
> thing for 5.9, and we can improve on things later.

I'm not super-happy with this patch, but I'm throwing it out anyway, in case

 (a) somebody can test it - I don't have any test cases

 (b) somebody can find issues and improve on it

but it's the simplest patch I can come up with for the small-page case.

I have *NOT* tested it. I have tried to think about it, and there are
more lines of comments than there are lines of code, but that only
means that if I didn't think about some case, it's neither in the
comments nor in the code.

I'm happy to take Peter's series too, this is more of an alternative
simplified version to keep the discussion going.

Hmm? What did I miss?

                     Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 5633 bytes --]

 mm/memory.c | 128 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 122 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f3eb55975902..49ceddd91db4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -773,7 +773,115 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	return 0;
 }
 
-static inline void
+/*
+ * Copy a single small page for fork().
+ *
+ * We have already marked it read-only in the parent if
+ * it's a COW page, and the pte passed in has also been
+ * marked read-only. So the normal thing to do is to
+ * simply increae the page count and the page mapping
+ * count, and the rss, and use the pte as-is. Done.
+ *
+ * However, there is one situation where we can't just
+ * rely on the COW behavior - if the page has been pinned
+ * for DMA in the parent, we can't just give a reference
+ * to it to the child, and say "whoever writes to it will
+ * force a COW". No, the pinned page needs to remain
+ * with the parent, and we need to give the child a copy.
+ *
+ * NOTE! This should never happen. Good pinning users
+ * will either not fork, or will mark the area they pinned
+ * as MADV_DONTFORK so that this situation never comes up.
+ * But if you don't do that...
+ *
+ * Note that if a small page has been pinned, we know the
+ * mapcount for that page should be 1, since the pinning
+ * will have doen the COW at that point. So together with
+ * the elevated refcount, we have very solid heuristics
+ * for "is this page something we need to worry about"
+ */
+static int copy_normal_page(struct vm_area_struct *vma, unsigned long addr,
+		struct mm_struct *src_mm, struct mm_struct *dst_mm,
+		pte_t *src_pte, pte_t *dst_pte,
+		struct page *src_page, int *rss)
+{
+	struct page *dst_page;
+
+	if (likely(!page_maybe_dma_pinned(src_page)))
+		goto reuse_page;
+
+	if (!is_cow_mapping(vma->vm_flags))
+		goto reuse_page;
+
+	if (__page_mapcount(src_page) != 1)
+		goto reuse_page;
+
+	if (!vma->anon_vma || !pte_dirty(*src_pte))
+		goto reuse_page;
+
+	/*
+	 * We have now checked that the page count implies that
+	 * it's pinned, and that it's mapped only in this process,
+	 * and that it's dirty and we have an anonvma (so it's
+	 * an actual write pin, not some read-only one).
+	 *
+	 * That means we have to treat is specially. Nasty.
+	 */
+
+	/*
+	 * Note the wrong 'vma' - source rather than destination.
+	 * It's only used for policy, which is the same.
+	 *
+	 * The bigger issue is that we're holding the ptl lock,
+	 * so this needs to be a non-sleeping allocation.
+	 */
+	dst_page = alloc_page_vma(GFP_ATOMIC | __GFP_HIGH | __GFP_NOWARN, vma, addr);
+	if (!dst_page)
+		return -ENOMEM;
+
+	if (mem_cgroup_charge(dst_page, dst_mm, GFP_ATOMIC)) {
+		put_page(dst_page);
+		return -ENOMEM;
+	}
+	cgroup_throttle_swaprate(dst_page, GFP_ATOMIC);
+	__SetPageUptodate(dst_page);
+
+	copy_user_highpage(dst_page, src_page, addr, vma);
+	*dst_pte = mk_pte(dst_page, vma->vm_page_prot);
+
+	/*
+	 * NOTE! This uses the wrong vma again, but the only thing
+	 * that matters are the vma flags and anon_vma, which are
+	 * the same for source and destination.
+	 */
+	page_add_new_anon_rmap(dst_page, vma, addr, false);
+	lru_cache_add_inactive_or_unevictable(dst_page, vma);
+	rss[mm_counter(dst_page)]++;
+
+	/*
+	 * Final note: make the source writable again. The fact that
+	 * it was unwritable means that we didn't race with any new
+	 * PIN events using fast-GUP, and we've held on to the page
+	 * table lock the whole time so it's safe to just make it
+	 * writable again here.
+	 *
+	 * We might race with hardware walkers, but the dirty bit
+	 * was already set, so no fear of losing a race with a hw
+	 * walker that sets that.
+	 */
+	if (vma->vm_flags & VM_WRITE)
+		*src_pte = pte_mkwrite(*src_pte);
+
+	return 0;
+
+reuse_page:
+	get_page(src_page);
+	page_dup_rmap(src_page, false);
+	rss[mm_counter(src_page)]++;
+	return 0;
+}
+
+static inline int
 copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 		unsigned long addr, int *rss)
@@ -809,12 +917,15 @@ copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
-		get_page(page);
-		page_dup_rmap(page, false);
-		rss[mm_counter(page)]++;
+		int error;
+
+		error = copy_normal_page(vma, addr, src_mm, dst_mm, src_pte, &pte, page, rss);
+		if (error)
+			return error;
 	}
 
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+	return 0;
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -824,7 +935,7 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
-	int progress = 0;
+	int progress = 0, error = 0;
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
 
@@ -865,8 +976,10 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			progress += 8;
 			continue;
 		}
-		copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
+		error = copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
 				 vma, addr, rss);
+		if (error)
+			break;
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
@@ -877,6 +990,9 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
+	if (error)
+		return error;
+
 	if (entry.val) {
 		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
 			return -ENOMEM;

  reply	other threads:[~2020-09-25 21:07 UTC|newest]

Thread overview: 133+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-21 21:17 [PATCH 0/5] mm: Break COW for pinned pages during fork() Peter Xu
2020-09-21 21:17 ` [PATCH 1/5] mm: Introduce mm_struct.has_pinned Peter Xu
2020-09-21 21:43   ` Jann Horn
2020-09-21 21:43     ` Jann Horn
2020-09-21 22:30     ` Peter Xu
2020-09-21 22:47       ` Jann Horn
2020-09-21 22:47         ` Jann Horn
2020-09-22 11:54         ` Jason Gunthorpe
2020-09-22 14:28           ` Peter Xu
2020-09-22 15:56             ` Jason Gunthorpe
2020-09-22 16:25               ` Linus Torvalds
2020-09-22 16:25                 ` Linus Torvalds
2020-09-21 23:53   ` John Hubbard
2020-09-22  0:01     ` John Hubbard
2020-09-22 15:17     ` Peter Xu
2020-09-22 16:10       ` Jason Gunthorpe
2020-09-22 17:54         ` Peter Xu
2020-09-22 19:11           ` Jason Gunthorpe
2020-09-23  0:27             ` Peter Xu
2020-09-23 13:10               ` Peter Xu
2020-09-23 14:20                 ` Jan Kara
2020-09-23 17:12                   ` Jason Gunthorpe
2020-09-24  7:44                     ` Jan Kara
2020-09-24 14:02                       ` Jason Gunthorpe
2020-09-24 14:45                         ` Jan Kara
2020-09-23 17:07               ` Jason Gunthorpe
2020-09-24 14:35                 ` Peter Xu
2020-09-24 16:51                   ` Jason Gunthorpe
2020-09-24 17:55                     ` Peter Xu
2020-09-24 18:15                       ` Jason Gunthorpe
2020-09-24 18:34                         ` Peter Xu
2020-09-24 18:39                           ` Jason Gunthorpe
2020-09-24 21:30                             ` Peter Xu
2020-09-25 19:56                               ` Linus Torvalds
2020-09-25 19:56                                 ` Linus Torvalds
2020-09-25 21:06                                 ` Linus Torvalds [this message]
2020-09-25 21:06                                   ` Linus Torvalds
2020-09-26  0:41                                   ` Jason Gunthorpe
2020-09-26  1:15                                     ` Linus Torvalds
2020-09-26  1:15                                       ` Linus Torvalds
2020-09-26 22:28                                       ` Linus Torvalds
2020-09-26 22:28                                         ` Linus Torvalds
2020-09-27  6:23                                         ` Leon Romanovsky
2020-09-27 18:16                                           ` Linus Torvalds
2020-09-27 18:16                                             ` Linus Torvalds
2020-09-27 18:45                                             ` Linus Torvalds
2020-09-27 18:45                                               ` Linus Torvalds
2020-09-28 12:49                                               ` Jason Gunthorpe
2020-09-28 16:17                                                 ` Linus Torvalds
2020-09-28 16:17                                                   ` Linus Torvalds
2020-09-28 17:22                                                   ` Peter Xu
2020-09-28 17:54                                                     ` Linus Torvalds
2020-09-28 17:54                                                       ` Linus Torvalds
2020-09-28 18:39                                                       ` Jason Gunthorpe
2020-09-28 19:29                                                         ` Linus Torvalds
2020-09-28 19:29                                                           ` Linus Torvalds
2020-09-28 23:57                                                           ` Jason Gunthorpe
2020-09-29  0:18                                                             ` John Hubbard
2020-09-28 19:36                                                         ` Linus Torvalds
2020-09-28 19:36                                                           ` Linus Torvalds
2020-09-28 19:50                                                           ` Linus Torvalds
2020-09-28 19:50                                                             ` Linus Torvalds
2020-09-28 22:51                                                             ` Jason Gunthorpe
2020-09-29  0:30                                                               ` Peter Xu
2020-10-08  5:49                                                             ` Leon Romanovsky
2020-09-28 17:13                                             ` Peter Xu
2020-09-25 21:13                                 ` Peter Xu
2020-09-25 22:08                                   ` Linus Torvalds
2020-09-25 22:08                                     ` Linus Torvalds
2020-09-22 18:02       ` John Hubbard
2020-09-22 18:15         ` Peter Xu
2020-09-22 19:11       ` John Hubbard
2020-09-27  0:41   ` [mm] 698ac7610f: will-it-scale.per_thread_ops 8.2% improvement kernel test robot
2020-09-27  0:41     ` kernel test robot
2020-09-21 21:17 ` [PATCH 2/5] mm/fork: Pass new vma pointer into copy_page_range() Peter Xu
2020-09-21 21:17 ` [PATCH 3/5] mm: Rework return value for copy_one_pte() Peter Xu
2020-09-22  7:11   ` John Hubbard
2020-09-22 15:29     ` Peter Xu
2020-09-22 10:08   ` Oleg Nesterov
2020-09-22 10:18     ` Oleg Nesterov
2020-09-22 15:36       ` Peter Xu
2020-09-22 15:48         ` Oleg Nesterov
2020-09-22 16:03           ` Peter Xu
2020-09-22 16:53             ` Oleg Nesterov
2020-09-22 18:13               ` Peter Xu
2020-09-22 18:23                 ` Oleg Nesterov
2020-09-22 18:49                   ` Peter Xu
2020-09-23  6:52                     ` Oleg Nesterov
2020-09-23 17:16   ` Linus Torvalds
2020-09-23 17:16     ` Linus Torvalds
2020-09-23 21:24     ` Linus Torvalds
2020-09-23 21:24       ` Linus Torvalds
2020-09-21 21:20 ` [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes Peter Xu
2020-09-21 21:55   ` Jann Horn
2020-09-21 21:55     ` Jann Horn
2020-09-21 22:18     ` John Hubbard
2020-09-21 22:27       ` Jann Horn
2020-09-21 22:27         ` Jann Horn
2020-09-22  0:08         ` John Hubbard
2020-09-21 22:27     ` Peter Xu
2020-09-22 11:48   ` Oleg Nesterov
2020-09-22 12:40     ` Oleg Nesterov
2020-09-22 15:58       ` Peter Xu
2020-09-22 16:52         ` Oleg Nesterov
2020-09-22 18:34           ` Peter Xu
2020-09-22 18:44             ` Oleg Nesterov
2020-09-23  1:03               ` Peter Xu
2020-09-23 20:25                 ` Linus Torvalds
2020-09-23 20:25                   ` Linus Torvalds
2020-09-24 15:08                   ` Peter Xu
2020-09-24 11:48   ` Kirill Tkhai
2020-09-24 15:16     ` Peter Xu
2020-09-21 21:20 ` [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork() Peter Xu
2020-09-22  6:41   ` John Hubbard
2020-09-22 10:33     ` Jan Kara
2020-09-22 20:01       ` John Hubbard
2020-09-23  9:22         ` Jan Kara
2020-09-23 13:50           ` Peter Xu
2020-09-23 14:01             ` Jan Kara
2020-09-23 15:44               ` Peter Xu
2020-09-23 20:19                 ` John Hubbard
2020-09-24 18:49                   ` Peter Xu
2020-09-23 16:06     ` Peter Xu
2020-09-22 12:05   ` Jason Gunthorpe
2020-09-23 15:24     ` Peter Xu
2020-09-23 16:07       ` Yang Shi
2020-09-23 16:07         ` Yang Shi
2020-09-24 15:47         ` Peter Xu
2020-09-24 17:29           ` Yang Shi
2020-09-24 17:29             ` Yang Shi
2020-09-23 17:17       ` Jason Gunthorpe
2020-09-23 10:21 ` [PATCH 0/5] mm: Break COW for pinned pages during fork() Leon Romanovsky
2020-09-23 15:37   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=whDSH_MRMt80JaSwoquzt=1nQ-0n3w0aVngoWPAc10BCw@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.