From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=kxYy=C6=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-12.6 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D7457C41604
	for <linux-mm@archiver.kernel.org>; Mon, 21 Sep 2020 21:20:38 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 58D2523A5C
	for <linux-mm@archiver.kernel.org>; Mon, 21 Sep 2020 21:20:38 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WEcii2Ri"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 58D2523A5C
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id C32066B00E7; Mon, 21 Sep 2020 17:20:37 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BBB228E0001; Mon, 21 Sep 2020 17:20:37 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A814F6B00E9; Mon, 21 Sep 2020 17:20:37 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com [216.40.44.171])
	by kanga.kvack.org (Postfix) with ESMTP id 86D3A6B00E7
	for <linux-mm@kvack.org>; Mon, 21 Sep 2020 17:20:37 -0400 (EDT)
Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 3899D180AD802
	for <linux-mm@kvack.org>; Mon, 21 Sep 2020 21:20:37 +0000 (UTC)
X-FDA: 77288337714.29.move59_460bc7527148
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin29.hostedemail.com (Postfix) with ESMTP id 1450118086CCB
	for <linux-mm@kvack.org>; Mon, 21 Sep 2020 21:20:37 +0000 (UTC)
X-HE-Tag: move59_460bc7527148
X-Filterd-Recvd-Size: 13953
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf09.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 21 Sep 2020 21:20:36 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1600723236;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=43s2lX7vf1KFz4f0PiHBoD0e82m8SJ4yZ74XBgtP834=;
	b=WEcii2RikySpvhibnyyG/ByeHK6a7G8Dv0e7aBtICPrU85Dy7PEralLp2ObXG1GSaTKN2H
	/kf0SG3QaY4OXprE3InqdNxNhwmuAf7QroTIrAXk8JzMLvjxZ36rbfZLl/c3vE10TMqEOt
	mgM00zcV19/dJ5l22Pzt5YquZZnpCIE=
Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com
 [209.85.222.197]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-451-On4oLe0HPMm2YrkqCVu2ww-1; Mon, 21 Sep 2020 17:20:34 -0400
X-MC-Unique: On4oLe0HPMm2YrkqCVu2ww-1
Received: by mail-qk1-f197.google.com with SMTP id 139so11988080qkl.11
        for <linux-mm@kvack.org>; Mon, 21 Sep 2020 14:20:34 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=43s2lX7vf1KFz4f0PiHBoD0e82m8SJ4yZ74XBgtP834=;
        b=aM6164y+9x/DxgAhpszZqxesj3OUWWH1hz7UB72FV6mxsBlXGaM6+Ket79q9iD2zjJ
         JeojxL41MQeqlDgHO6/scXcZyvI1ioMRN1qnnOrCwQm0o+jfOfu1VvPeJWDFyw1w3iwj
         aX9KjXbMDKhgfCk11hliThS/54aZS/57tXPBxfUA85ghKt+fkjLp8wrdkgUB4xTyttI1
         UcZQMY0LJ/0gu41FDtb7Qngmaxb2ZUnBxpl8dqUHvNGg5zwa7IPZNMYSwjeYJk3spv2c
         BehsUc5VnHkpe8hZJBDu3RvVQ5t7nj3pLLw25+7mstuCN7CYU6aMQ+Vp9lmwxXgAy35w
         m7gA==
X-Gm-Message-State: AOAM533emJsIRIE/br+SRzM3VTrxt+o4kiGAZOsg5qV/rrcUbzVlkMk1
	lGQPGLk2ReX/WHEZsQ58s6liQbAsz9QGlXU5Hl1GeuDwIAEJz28hB7w8BZnbcJ1Kb37kDBVVgf0
	8b4P0IObSj/0=
X-Received: by 2002:a37:2783:: with SMTP id n125mr1693615qkn.153.1600723233041;
        Mon, 21 Sep 2020 14:20:33 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJwLT5pzNUxTdwbgOHa0jEGxYj6aO4nFo2tV/KIOf5ISoWRI/khTtBZZLYyKf5wjXTYXmpfJzg==
X-Received: by 2002:a37:2783:: with SMTP id n125mr1693461qkn.153.1600723231200;
        Mon, 21 Sep 2020 14:20:31 -0700 (PDT)
Received: from xz-x1.redhat.com (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15])
        by smtp.gmail.com with ESMTPSA id g18sm10019264qko.126.2020.09.21.14.20.29
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 21 Sep 2020 14:20:30 -0700 (PDT)
From: Peter Xu <peterx@redhat.com>
To: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: peterx@redhat.com,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Jann Horn <jannh@google.com>,
	Oleg Nesterov <oleg@redhat.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Hugh Dickins <hughd@google.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Jan Kara <jack@suse.cz>,
	John Hubbard <jhubbard@nvidia.com>,
	Christoph Hellwig <hch@lst.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH 4/5] mm: Do early cow for pinned pages during fork() for ptes
Date: Mon, 21 Sep 2020 17:20:28 -0400
Message-Id: <20200921212028.25184-1-peterx@redhat.com>
X-Mailer: git-send-email 2.26.2
In-Reply-To: <20200921211744.24758-1-peterx@redhat.com>
References: <20200921211744.24758-1-peterx@redhat.com>
MIME-Version: 1.0
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

This patch is greatly inspired by the discussions on the list from Linus,=
 Jason
Gunthorpe and others [1].

It allows copy_pte_range() to do early cow if the pages were pinned on th=
e
source mm.  Currently we don't have an accurate way to know whether a pag=
e is
pinned or not.  The only thing we have is page_maybe_dma_pinned().  Howev=
er
that's good enough for now.  Especially, with the newly added mm->has_pin=
ned
flag to make sure we won't affect processes that never pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within copy_one_pte=
().
Unluckily, we can't because we're with the page table locks held for both=
 the
parent and child processes.  So the page copy process needs to be done ou=
tside
copy_one_pte().

The new COPY_MM_BREAK_COW is introduced for this - copy_one_pte() would r=
eturn
this when it finds any pte that may need an early breaking of cow.

page_duplicate() is used to handle the page copy process in copy_pte_rang=
e().
Of course we need to do that after releasing of the locks.

The slightly tricky part is page_duplicate() will fill in the copy_mm_dat=
a with
the new page copied and we'll need to re-install the pte again with page =
table
locks held again.  That's done in pte_install_copied_page().

The whole procedure looks quite similar to wp_page_copy() however it's si=
mpler
because we know the page is special (pinned) and we know we don't need tl=
b
flushings because no one is referencing the new mm yet.

Though we still have to be very careful on maintaining the two pages (one=
 old
source page, one new allocated page) across all these lock taking/releasi=
ng
process and make sure neither of them will get lost.

[1] https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 167 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1530bb1070f4..8f3521be80ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -691,12 +691,72 @@ struct page *vm_normal_page_pmd(struct vm_area_stru=
ct *vma, unsigned long addr,
=20
 #define  COPY_MM_DONE               0
 #define  COPY_MM_SWAP_CONT          1
+#define  COPY_MM_BREAK_COW          2
=20
 struct copy_mm_data {
 	/* COPY_MM_SWAP_CONT */
 	swp_entry_t entry;
+	/* COPY_MM_BREAK_COW */
+	struct {
+		struct page *cow_old_page; /* Released by page_duplicate() */
+		struct page *cow_new_page; /* Released by page_release_cow() */
+		pte_t cow_oldpte;
+	};
 };
=20
+static inline void page_release_cow(struct copy_mm_data *data)
+{
+	/* The old page should only be released in page_duplicate() */
+	WARN_ON_ONCE(data->cow_old_page);
+
+	if (data->cow_new_page) {
+		put_page(data->cow_new_page);
+		data->cow_new_page =3D NULL;
+	}
+}
+
+/*
+ * Duplicate the page for this PTE.  Returns zero if page copied (so we =
need to
+ * retry on the same PTE again to arm the copied page very soon), or neg=
ative
+ * if error happened.  In all cases, the old page will be properly relea=
sed.
+ */
+static int page_duplicate(struct mm_struct *src_mm, struct vm_area_struc=
t *vma,
+			  unsigned long address, struct copy_mm_data *data)
+{
+	struct page *new_page =3D NULL;
+	int ret;
+
+	/* This should have been set in change_one_pte() when reach here */
+	WARN_ON_ONCE(!data->cow_old_page);
+
+	new_page =3D alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (!new_page) {
+		ret =3D -ENOMEM;
+		goto out;
+	}
+
+	copy_user_highpage(new_page, data->cow_old_page, address, vma);
+	ret =3D mem_cgroup_charge(new_page, src_mm, GFP_KERNEL);
+	if (ret) {
+		put_page(new_page);
+		ret =3D -ENOMEM;
+		goto out;
+	}
+
+	cgroup_throttle_swaprate(new_page, GFP_KERNEL);
+	__SetPageUptodate(new_page);
+
+	/* So far so good; arm the new page for the next attempt */
+	data->cow_new_page =3D new_page;
+
+out:
+	/* Always release the old page */
+	put_page(data->cow_old_page);
+	data->cow_old_page =3D NULL;
+
+	return ret;
+}
+
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
@@ -711,6 +771,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_stru=
ct *src_mm,
 	unsigned long vm_flags =3D vma->vm_flags;
 	pte_t pte =3D *src_pte;
 	struct page *page;
+	bool wp;
=20
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -789,10 +850,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_str=
uct *src_mm,
 	 * If it's a COW mapping, write protect it both
 	 * in the parent and the child
 	 */
-	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-		pte =3D pte_wrprotect(pte);
-	}
+	wp =3D is_cow_mapping(vm_flags) && pte_write(pte);
=20
 	/*
 	 * If it's a shared mapping, mark it clean in
@@ -813,15 +871,80 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_st=
ruct *src_mm,
 	page =3D vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
+
+		/*
+		 * If the page is pinned in source mm, do early cow right now
+		 * so that the pinned page won't be replaced by another random
+		 * page without being noticed after the fork().
+		 *
+		 * Note: there can be some very rare cases that we'll do
+		 * unnecessary cow here, due to page_maybe_dma_pinned() is
+		 * sometimes bogus, and has_pinned flag is currently aggresive
+		 * too.  However this should be good enough for us for now as
+		 * long as we covered all the pinned pages.  We can make this
+		 * better in the future by providing an accurate accounting for
+		 * pinned pages.
+		 *
+		 * Because we'll need to release the locks before doing cow,
+		 * pass this work to upper layer.
+		 */
+		if (READ_ONCE(src_mm->has_pinned) && wp &&
+		    page_maybe_dma_pinned(page)) {
+			/* We've got the page already; we're safe */
+			data->cow_old_page =3D page;
+			data->cow_oldpte =3D *src_pte;
+			return COPY_MM_BREAK_COW;
+		}
+
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
 	}
=20
+	if (wp) {
+		ptep_set_wrprotect(src_mm, addr, src_pte);
+		pte =3D pte_wrprotect(pte);
+	}
+
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return COPY_MM_DONE;
 }
=20
+/*
+ * Install the pte with the copied page stored in `data'.  Returns true =
when
+ * installation completes, or false when src pte has changed.
+ */
+static int pte_install_copied_page(struct mm_struct *dst_mm,
+				   struct vm_area_struct *new,
+				   pte_t *src_pte, pte_t *dst_pte,
+				   unsigned long addr, int *rss,
+				   struct copy_mm_data *data)
+{
+	struct page *new_page =3D data->cow_new_page;
+	pte_t entry;
+
+	if (!pte_same(*src_pte, data->cow_oldpte)) {
+		/* PTE has changed under us.  Release the page and retry */
+		page_release_cow(data);
+		return false;
+	}
+
+	entry =3D mk_pte(new_page, new->vm_page_prot);
+	entry =3D pte_sw_mkyoung(entry);
+	entry =3D maybe_mkwrite(pte_mkdirty(entry), new);
+	page_add_new_anon_rmap(new_page, new, addr, false);
+	set_pte_at(dst_mm, addr, dst_pte, entry);
+	rss[mm_counter(new_page)]++;
+
+	/*
+	 * Manually clear the new page pointer since we've moved ownership to
+	 * the newly armed PTE.
+	 */
+	data->cow_new_page =3D NULL;
+
+	return true;
+}
+
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *sr=
c_mm,
 		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
 		   struct vm_area_struct *new,
@@ -830,16 +953,23 @@ static int copy_pte_range(struct mm_struct *dst_mm,=
 struct mm_struct *src_mm,
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
-	int progress, copy_ret =3D COPY_MM_DONE;
+	int progress, ret, copy_ret =3D COPY_MM_DONE;
 	int rss[NR_MM_COUNTERS];
 	struct copy_mm_data data;
=20
 again:
+	/* We don't reset this for COPY_MM_BREAK_COW */
+	memset(&data, 0, sizeof(data));
+
+again_break_cow:
 	init_rss_vec(rss);
=20
 	dst_pte =3D pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
-	if (!dst_pte)
+	if (!dst_pte) {
+		/* Guarantee that the new page is released if there is */
+		page_release_cow(&data);
 		return -ENOMEM;
+	}
 	src_pte =3D pte_offset_map(src_pmd, addr);
 	src_ptl =3D pte_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -859,6 +989,25 @@ static int copy_pte_range(struct mm_struct *dst_mm, =
struct mm_struct *src_mm,
 			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
 				break;
 		}
+
+		if (unlikely(data.cow_new_page)) {
+			/*
+			 * If cow_new_page set, we must be at the 2nd round of
+			 * a previous COPY_MM_BREAK_COW.  Try to arm the new
+			 * page now.  Note that in all cases page_break_cow()
+			 * will properly release the objects in copy_mm_data.
+			 */
+			WARN_ON_ONCE(copy_ret !=3D COPY_MM_BREAK_COW);
+			if (pte_install_copied_page(dst_mm, new, src_pte,
+						    dst_pte, addr, rss,
+						    &data)) {
+				/* We installed the pte successfully; move on */
+				progress++;
+				continue;
+			}
+			/* PTE changed.  Retry this pte (falls through) */
+		}
+
 		if (pte_none(*src_pte)) {
 			progress++;
 			continue;
@@ -882,8 +1031,19 @@ static int copy_pte_range(struct mm_struct *dst_mm,=
 struct mm_struct *src_mm,
 		if (add_swap_count_continuation(data.entry, GFP_KERNEL) < 0)
 			return -ENOMEM;
 		break;
-	default:
+	case COPY_MM_BREAK_COW:
+		/* Do accounting onto parent mm directly */
+		ret =3D page_duplicate(src_mm, vma, addr, &data);
+		if (ret)
+			return ret;
+		goto again_break_cow;
+	case COPY_MM_DONE:
+		/* This means we're all good. */
 		break;
+	default:
+		/* This should mean copy_ret < 0.  Time to fail this fork().. */
+		WARN_ON_ONCE(copy_ret >=3D 0);
+		return copy_ret;
 	}
=20
 	if (addr !=3D end)
--=20
2.26.2