From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05B70C4727E for ; Mon, 28 Sep 2020 17:55:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7BAA22184D for ; Mon, 28 Sep 2020 17:55:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=sent.com header.i=@sent.com header.b="PcWM1IW1"; dkim=temperror (0-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="AeRz+mXg" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7BAA22184D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=sent.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8A983900005; Mon, 28 Sep 2020 13:55:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 602576B0087; Mon, 28 Sep 2020 13:55:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 34173900005; Mon, 28 Sep 2020 13:55:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0083.hostedemail.com [216.40.44.83]) by kanga.kvack.org (Postfix) with ESMTP id E27936B0087 for ; Mon, 28 Sep 2020 13:55:23 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 96CBF180AD802 for ; Mon, 28 Sep 2020 17:55:23 +0000 (UTC) X-FDA: 77313222126.15.line79_2b0b8dd27183 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin15.hostedemail.com (Postfix) with ESMTP id 6A8B51814B0C1 for ; Mon, 28 Sep 2020 17:55:23 +0000 (UTC) X-HE-Tag: line79_2b0b8dd27183 X-Filterd-Recvd-Size: 11226 Received: from wnew3-smtp.messagingengine.com (wnew3-smtp.messagingengine.com [64.147.123.17]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Mon, 28 Sep 2020 17:55:22 +0000 (UTC) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailnew.west.internal (Postfix) with ESMTP id 5FB36E14; Mon, 28 Sep 2020 13:55:20 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute4.internal (MEProxy); Mon, 28 Sep 2020 13:55:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sent.com; h=from :to:cc:subject:date:message-id:in-reply-to:references:reply-to :mime-version:content-transfer-encoding; s=fm1; bh=1JXkwDp9KLCIB rx6jYPsY+s0zDNvLwvprwELtbcH1wI=; b=PcWM1IW1xAk/OXpjNFCoGNof1Y3rq vs5b7fBue3B0+BME4GUysIJ+S4WYQhjEqBOBEvc1+6dH6PvAFp0p+bxR6nC3UwCL QWmDIKhhu2th6TIrHogQ4Y4qHpBLJ7oV7DVpW2Dm/yZ+Q4P+1dHyKuZ8KesHQHQ0 LP2eUwZ+i/hBcZf2JpnSDL7kfPCTVQFC1+IghVLLzc82/z6zbAF1T0Y2LYti2tcy nrmBKO4x7ozhh4p0KCbJCHGRD8gWsd/XZy+XDzMbIOySNdsmSq1HJSW8mj+1Iai6 zE5F6qkIbUh/dUkjrIyA1N4h85KJxLOue0BhTAj/7Acik0nVqJrakDbiA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:from :in-reply-to:message-id:mime-version:references:reply-to:subject :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; bh=1JXkwDp9KLCIBrx6jYPsY+s0zDNvLwvprwELtbcH1wI=; b=AeRz+mXg aPkenou0K+dBeH6HEtt5xx6xyDf+KqVHVruB4ad0OKP4ypm+UQoakVzPiEa8aqtF Ovn79QSuMZJVe59fBWuEE5WH4HfL76eJONMhsi4hhxjhLaNlMgJ7qEnaCVcSd9cp mIwwa71J1CgjBELP5af1xR7lZBU2OsbnByryr/HhgWTRHvzCTNu4GqU1FIsRZy1F Po0k1s+qL9oLO2f9if4hs5z2lq1Cj+RYZD+apZiA4+aU/cb/K8SLwoFFalv59Jk7 qusbYcaJyF0Qp/y0afYhMQ49LW1tcq5PkfYbvm88mGBpPhnF9TwrX1ZtfPmkYGQu 9jvvS5b15y3CSQ== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedujedrvdeigdeliecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvffufffkofgjfhhrggfgsedtkeertdertddtnecuhfhrohhmpegkihcujggr nhcuoeiiihdrhigrnhesshgvnhhtrdgtohhmqeenucggtffrrghtthgvrhhnpeduhfffve ektdduhfdutdfgtdekkedvhfetuedufedtgffgvdevleehheevjefgtdenucfkphepuddv rdegiedruddtiedrudeigeenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmh grihhlfhhrohhmpeiiihdrhigrnhesshgvnhhtrdgtohhm X-ME-Proxy: Received: from nvrsysarch6.NVidia.COM (unknown [12.46.106.164]) by mail.messagingengine.com (Postfix) with ESMTPA id 40EF8306467D; Mon, 28 Sep 2020 13:55:18 -0400 (EDT) From: Zi Yan To: linux-mm@kvack.org Cc: "Kirill A . Shutemov" , Roman Gushchin , Rik van Riel , Matthew Wilcox , Shakeel Butt , Yang Shi , Jason Gunthorpe , Mike Kravetz , Michal Hocko , David Hildenbrand , William Kucharski , Andrea Arcangeli , John Hubbard , David Nellans , linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Date: Mon, 28 Sep 2020 13:53:59 -0400 Message-Id: <20200928175428.4110504-2-zi.yan@sent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20200928175428.4110504-1-zi.yan@sent.com> References: <20200928175428.4110504-1-zi.yan@sent.com> Reply-To: Zi Yan MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jason Gunthorpe The pagewalker runs while only holding the mmap_sem for read. The pud can be set asynchronously, while also holding the mmap_sem for read eg from: handle_mm_fault() __handle_mm_fault() create_huge_pmd() dev_dax_huge_fault() __dev_dax_pud_fault() vmf_insert_pfn_pud() insert_pfn_pud() pud_lock() set_pud_at() At least x86 sets the PUD using WRITE_ONCE(), so an unlocked read of unstable data should be paired to use READ_ONCE(). For the pagewalker to work locklessly the PUD must work similarly to the PMD: once the PUD entry becomes a pointer to a PMD, it must be stable, an= d safe to pass to pmd_offset() Passing the value from READ_ONCE into the callbacks prevents the callers from seeing inconsistencies after they re-read, such as seeing pud_none()= . If a callback does obtain the pud_lock then it should trigger ACTION_AGAI= N if a data race caused the original value to change. Use the same pattern as gup_pmd_range() and pass in the address of the local READ_ONCE stack variable to pmd_offset() to avoid reading it again. Signed-off-by: Jason Gunthorpe --- include/linux/pagewalk.h | 2 +- mm/hmm.c | 16 +++++++--------- mm/mapping_dirty_helpers.c | 6 ++---- mm/pagewalk.c | 28 ++++++++++++++++------------ mm/ptdump.c | 3 +-- 5 files changed, 27 insertions(+), 28 deletions(-) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index b1cb6b753abb..6caf28aadafb 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -39,7 +39,7 @@ struct mm_walk_ops { unsigned long next, struct mm_walk *walk); int (*p4d_entry)(p4d_t *p4d, unsigned long addr, unsigned long next, struct mm_walk *walk); - int (*pud_entry)(pud_t *pud, unsigned long addr, + int (*pud_entry)(pud_t pud, pud_t *pudp, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pmd_entry)(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk); diff --git a/mm/hmm.c b/mm/hmm.c index 943cb2ba4442..419e9e50fd51 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -402,28 +402,26 @@ static inline unsigned long pud_to_hmm_pfn_flags(st= ruct hmm_range *range, hmm_pfn_flags_order(PUD_SHIFT - PAGE_SHIFT); } =20 -static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned l= ong end, - struct mm_walk *walk) +static int hmm_vma_walk_pud(pud_t pud, pud_t *pudp, unsigned long start, + unsigned long end, struct mm_walk *walk) { struct hmm_vma_walk *hmm_vma_walk =3D walk->private; struct hmm_range *range =3D hmm_vma_walk->range; unsigned long addr =3D start; - pud_t pud; int ret =3D 0; spinlock_t *ptl =3D pud_trans_huge_lock(pudp, walk->vma); =20 if (!ptl) return 0; + if (memcmp(pudp, &pud, sizeof(pud)) !=3D 0) { + walk->action =3D ACTION_AGAIN; + spin_unlock(ptl); + return 0; + } =20 /* Normally we don't want to split the huge page */ walk->action =3D ACTION_CONTINUE; =20 - pud =3D READ_ONCE(*pudp); - if (pud_none(pud)) { - spin_unlock(ptl); - return hmm_vma_walk_hole(start, end, -1, walk); - } - if (pud_huge(pud) && pud_devmap(pud)) { unsigned long i, npages, pfn; unsigned int required_fault; diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c index 2c7d03675903..9fc46ebef497 100644 --- a/mm/mapping_dirty_helpers.c +++ b/mm/mapping_dirty_helpers.c @@ -150,11 +150,9 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned l= ong addr, unsigned long end, * causes dirty info loss. The pagefault handler should do * that if needed. */ -static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned l= ong end, - struct mm_walk *walk) +static int wp_clean_pud_entry(pud_t pudval, pud_t *pudp, unsigned long a= ddr, + unsigned long end, struct mm_walk *walk) { - pud_t pudval =3D READ_ONCE(*pud); - if (!pud_trans_unstable(&pudval)) return 0; =20 diff --git a/mm/pagewalk.c b/mm/pagewalk.c index e81640d9f177..15d1e423b4a3 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -58,7 +58,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long add= r, unsigned long end, return err; } =20 -static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long = end, +static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long e= nd, struct mm_walk *walk) { pmd_t *pmd; @@ -67,7 +67,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long add= r, unsigned long end, int err =3D 0; int depth =3D real_depth(3); =20 - pmd =3D pmd_offset(pud, addr); + pmd =3D pmd_offset(&pud, addr); do { again: next =3D pmd_addr_end(addr, end); @@ -119,17 +119,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long= addr, unsigned long end, static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long = end, struct mm_walk *walk) { - pud_t *pud; + pud_t *pudp; + pud_t pud; unsigned long next; const struct mm_walk_ops *ops =3D walk->ops; int err =3D 0; int depth =3D real_depth(2); =20 - pud =3D pud_offset(p4d, addr); + pudp =3D pud_offset(p4d, addr); do { again: + pud =3D READ_ONCE(*pudp); next =3D pud_addr_end(addr, end); - if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) { + if (pud_none(pud) || (!walk->vma && !walk->no_vma)) { if (ops->pte_hole) err =3D ops->pte_hole(addr, next, depth, walk); if (err) @@ -140,27 +142,29 @@ static int walk_pud_range(p4d_t *p4d, unsigned long= addr, unsigned long end, walk->action =3D ACTION_SUBTREE; =20 if (ops->pud_entry) - err =3D ops->pud_entry(pud, addr, next, walk); + err =3D ops->pud_entry(pud, pudp, addr, next, walk); if (err) break; =20 if (walk->action =3D=3D ACTION_AGAIN) goto again; =20 - if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || + if ((!walk->vma && (pud_leaf(pud) || !pud_present(pud))) || walk->action =3D=3D ACTION_CONTINUE || !(ops->pmd_entry || ops->pte_entry)) continue; =20 - if (walk->vma) - split_huge_pud(walk->vma, pud, addr); - if (pud_none(*pud)) - goto again; + if (walk->vma) { + split_huge_pud(walk->vma, pudp, addr); + pud =3D READ_ONCE(*pudp); + if (pud_none(pud)) + goto again; + } =20 err =3D walk_pmd_range(pud, addr, next, walk); if (err) break; - } while (pud++, addr =3D next, addr !=3D end); + } while (pudp++, addr =3D next, addr !=3D end); =20 return err; } diff --git a/mm/ptdump.c b/mm/ptdump.c index ba88ec43ff21..2055b940408e 100644 --- a/mm/ptdump.c +++ b/mm/ptdump.c @@ -65,11 +65,10 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long= addr, return 0; } =20 -static int ptdump_pud_entry(pud_t *pud, unsigned long addr, +static int ptdump_pud_entry(pud_t val, pud_t *pudp, unsigned long addr, unsigned long next, struct mm_walk *walk) { struct ptdump_state *st =3D walk->private; - pud_t val =3D READ_ONCE(*pud); =20 #if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_KASAN) if (pud_page(val) =3D=3D virt_to_page(lm_alias(kasan_early_shadow_pmd))= ) --=20 2.28.0