From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.6 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 832AAC2D0A8 for ; Wed, 23 Sep 2020 16:38:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EA09B208E4 for ; Wed, 23 Sep 2020 16:38:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="EOLE5mJU" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EA09B208E4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F40526B006C; Wed, 23 Sep 2020 12:38:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EF0A96B006E; Wed, 23 Sep 2020 12:38:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB7C76B0070; Wed, 23 Sep 2020 12:38:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0253.hostedemail.com [216.40.44.253]) by kanga.kvack.org (Postfix) with ESMTP id C54136B006C for ; Wed, 23 Sep 2020 12:38:36 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 7FB6D182BA12D for ; Wed, 23 Sep 2020 16:38:36 +0000 (UTC) X-FDA: 77294884632.03.nest94_2608dd727157 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin03.hostedemail.com (Postfix) with ESMTP id 938CF714834 for ; Wed, 23 Sep 2020 16:08:04 +0000 (UTC) X-HE-Tag: nest94_2608dd727157 X-Filterd-Recvd-Size: 7497 Received: from mail-ej1-f67.google.com (mail-ej1-f67.google.com [209.85.218.67]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Wed, 23 Sep 2020 16:08:03 +0000 (UTC) Received: by mail-ej1-f67.google.com with SMTP id gx22so402150ejb.5 for ; Wed, 23 Sep 2020 09:08:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ytpQX50zZMaEMQaAlIh0adb4P84Vsg6tw8C6NwQ+stk=; b=EOLE5mJUi4mJrrvSyx0iP2oqTE+pryWZIck1qLqp23MF0iBaUWYwzFN/94Hkc3a7Ov B9kmZGDXCgAg7tLR5zzD9Aszz/1WQZqreUiI4vvG3h34F95rVFvXeS1MQTK7dgPVfp4o CQqoSfduPZZqkyrQC4FetsbzlOAB+1mG0prHc5H8Vu3jAK7ZiDnGbQw5UbpZz+k70kKk LZEayFtnOoWkEvgebh4n6q0sZclmKduaH9XRmxYCdT640hCL9Z3rsM/DEkkYQh56svGX b+7LkxBg0F0vHdMZ4n0abSJINZAY4ur0/wT4HUhs+10KhNhUhIoMYQZNersv/G21yPFR Q77A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ytpQX50zZMaEMQaAlIh0adb4P84Vsg6tw8C6NwQ+stk=; b=LVdvv9SLhHbYhsLgVrpHkTEjx8B1jYltzeRKf5PUEBaZbQfws3P9k501n7b7Z5sXHU JroKklHnChVkSshTx8fM2axSB+6E0eWehQyfi6YWJo7CEPyccEI7Ki7tkf19/TA4XngW ib9i3Hco9dBzAYrjI4pVLbb6BnJ06WgN6OzFr7kZcmYHI2xkBs1cjrOl5sd7RgdfqLK1 10VRI+2YAkAUui/h3iHjpmE0UiMbdZMwkLgOF+P/o1UJJPi78fpWtNWu3Qfd8sTFX2Ay RqPMNfLlVNojGphXq7xuoNWaG+P6/U+26az7Q423Cj3YYF6mQQz4E8kuocDj4qMyYMoV azjQ== X-Gm-Message-State: AOAM531E+P5tMSHH/ozdAEI7wOLBC3kqFimJHYaKV2hAqOWEBVUv/L8m bvAVi8q/6y3abprdfh65bhqDXynrw406M1befTo= X-Google-Smtp-Source: ABdhPJy2mvZO8nnJZWgDCc07HwUVWuH6YOlb3YxRoyTAukIUclE1P+QiWn4/UpavJrB9WxROjJROhSVLgWWrIXeSPCg= X-Received: by 2002:a17:906:8297:: with SMTP id h23mr325450ejx.383.1600877282397; Wed, 23 Sep 2020 09:08:02 -0700 (PDT) MIME-Version: 1.0 References: <20200921211744.24758-1-peterx@redhat.com> <20200921212031.25233-1-peterx@redhat.com> <20200922120505.GH8409@ziepe.ca> <20200923152409.GC59978@xz-x1> In-Reply-To: <20200923152409.GC59978@xz-x1> From: Yang Shi Date: Wed, 23 Sep 2020 09:07:49 -0700 Message-ID: Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork() To: Peter Xu Cc: Jason Gunthorpe , Linux MM , Linux Kernel Mailing List , Linus Torvalds , Michal Hocko , Kirill Shutemov , Jann Horn , Oleg Nesterov , Kirill Tkhai , Hugh Dickins , Leon Romanovsky , Jan Kara , John Hubbard , Christoph Hellwig , Andrew Morton , Andrea Arcangeli Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 23, 2020 at 8:26 AM Peter Xu wrote: > > On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote: > > On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote: > > > Pinned pages shouldn't be write-protected when fork() happens, because follow > > > up copy-on-write on these pages could cause the pinned pages to be replaced by > > > random newly allocated pages. > > > > > > For huge PMDs, we split the huge pmd if pinning is detected. So that future > > > handling will be done by the PTE level (with our latest changes, each of the > > > small pages will be copied). We can achieve this by let copy_huge_pmd() return > > > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and > > > finally land the next copy_pte_range() call. > > > > > > Huge PUDs will be even more special - so far it does not support anonymous > > > pages. But it can actually be done the same as the huge PMDs even if the split > > > huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault > > > ins will remap the same pages in either parent/child later. > > > > > > This might not be the most efficient way, but it should be easy and clean > > > enough. It should be fine, since we're tackling with a very rare case just to > > > make sure userspaces that pinned some thps will still work even without > > > MADV_DONTFORK and after they fork()ed. > > > > > > Signed-off-by: Peter Xu > > > mm/huge_memory.c | 26 ++++++++++++++++++++++++++ > > > 1 file changed, 26 insertions(+) > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index 7ff29cc3d55c..c40aac0ad87e 100644 > > > +++ b/mm/huge_memory.c > > > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > > > > > src_page = pmd_page(pmd); > > > VM_BUG_ON_PAGE(!PageHead(src_page), src_page); > > > + > > > + /* > > > + * If this page is a potentially pinned page, split and retry the fault > > > + * with smaller page size. Normally this should not happen because the > > > + * userspace should use MADV_DONTFORK upon pinned regions. This is a > > > + * best effort that the pinned pages won't be replaced by another > > > + * random page during the coming copy-on-write. > > > + */ > > > + if (unlikely(READ_ONCE(src_mm->has_pinned) && > > > + page_maybe_dma_pinned(src_page))) { > > > + pte_free(dst_mm, pgtable); > > > + spin_unlock(src_ptl); > > > + spin_unlock(dst_ptl); > > > + __split_huge_pmd(vma, src_pmd, addr, false, NULL); > > > + return -EAGAIN; > > > + } > > > > Not sure why, but the PMD stuff here is not calling is_cow_mapping() > > before doing the write protect. Seems like it might be an existing > > bug? > > IMHO it's not a bug, because splitting a huge pmd should always be safe. > > One thing I can think of that might be special here is when the pmd is > anonymously mapped but also shared (shared, tmpfs thp, I think?), then here > we'll also mark it as wrprotected even if we don't need to (or maybe we need it > for some reason..). But again I think it's safe anyways - when page fault For tmpfs map, the pmd split just clears the pmd entry without reinstalling ptes (oppositely anonymous map would reinstall ptes). It looks this patch intends to copy at pte level by splitting pmd. But I'm afraid this may not work for tmpfs mappings. > happens, wp_huge_pmd() should split it into smaller pages unconditionally. I > just don't know whether it's the ideal way for the shared case. Andrea should > definitely know it better (because it is there since the 1st day of thp). > > > > > In any event, the has_pinned logic shouldn't be used without also > > checking is_cow_mapping(), so it should be added to that test. Same > > remarks for PUD > > I think the case mentioned above is also the special case here when we didn't > check is_cow_mapping(). The major difference is whether we'll split the page > right now, or postpone it until the next write to each mm. But I think, yes, > maybe I should better still keep the is_cow_mapping() to be explicit. > > Thanks, > > -- > Peter Xu >