From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04E0EC11F65 for ; Wed, 30 Jun 2021 18:03:48 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 74F0661449 for ; Wed, 30 Jun 2021 18:03:47 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 74F0661449 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BFD698D01BD; Wed, 30 Jun 2021 14:03:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B86FF8D01A2; Wed, 30 Jun 2021 14:03:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9DA088D01BD; Wed, 30 Jun 2021 14:03:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0147.hostedemail.com [216.40.44.147]) by kanga.kvack.org (Postfix) with ESMTP id 7501C8D01A2 for ; Wed, 30 Jun 2021 14:03:46 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 385F1181FFCBE for ; Wed, 30 Jun 2021 18:03:46 +0000 (UTC) X-FDA: 78311163252.18.FED3664 Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com [209.85.167.50]) by imf12.hostedemail.com (Postfix) with ESMTP id DC8A710000A2 for ; Wed, 30 Jun 2021 18:03:45 +0000 (UTC) Received: by mail-lf1-f50.google.com with SMTP id a11so6677686lfg.11 for ; Wed, 30 Jun 2021 11:03:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=SV+Dg+ywBv21qRTVUPeFzNQ3owUTMn2wPuzSXJcfrnc=; b=DP70WG1Ul1F/XWK8yRLIF+7S9RKvQQ4sz8PjwB6TqphigPX0xW+/zu7PnH7kQ3AqrF O+h40UwFwzNO4PSBsr4QPbGvmS92+YXp6gnzQNqSHD+5aucIDsqxVnDOo3awVAfoEiRd dzojwP9CnK1TqTLCSmUvkmZJstjUJDfTbH21Y= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=SV+Dg+ywBv21qRTVUPeFzNQ3owUTMn2wPuzSXJcfrnc=; b=DbZM26E8mfsOB5pIUvB1jcz6XMXFn9NGrVoyx0RqFJ/+w/cA7aSCjUi+5KX25kox/e r83r/RZ6/oVzCyGynb8pLaJ4uPlQbDo/Dz1nLi65XDUGf+HgcMcYrOJq/kTsPCI+wpf3 UJowOW0F1/S4ZwWHJmNN8jKIW9GGAMME+/UWwE4ljJv6OPw2v8OUGILogVprZW2EbSDW /0nbdugLyFwRCIOtgXGPi4AIilWBoSjFrGAsuWrfaf1TkR4Kp5rtIdDCwqGQhrGMVyxy jw2wazwJ5i0UlzFB58sra55VF9RQSvTIkSCdnY4Eh13PGw5+GjUT8U71idZdeOwpPCSA MbcQ== X-Gm-Message-State: AOAM530HXasMn5YQDVCnTBrGWu6an7zTJT41udtD3/ySyNvoFD4CA2we ggGSTYeJO+RlDI1TpNUuoexUVwLdDupkaI8U4ag= X-Google-Smtp-Source: ABdhPJyRB1Lp4SUORlnvufXMRHL2lE899jEXZgiu8OwDbjK2+0U/21W2s8CeFqzB3Xz/43IE7zxhZQ== X-Received: by 2002:a05:6512:16a9:: with SMTP id bu41mr28434072lfb.428.1625076224040; Wed, 30 Jun 2021 11:03:44 -0700 (PDT) Received: from mail-lf1-f45.google.com (mail-lf1-f45.google.com. [209.85.167.45]) by smtp.gmail.com with ESMTPSA id c9sm2310240ljb.22.2021.06.30.11.03.43 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 30 Jun 2021 11:03:43 -0700 (PDT) Received: by mail-lf1-f45.google.com with SMTP id a18so6690313lfs.10 for ; Wed, 30 Jun 2021 11:03:43 -0700 (PDT) X-Received: by 2002:a05:6512:15a2:: with SMTP id bp34mr26994369lfb.40.1625076222794; Wed, 30 Jun 2021 11:03:42 -0700 (PDT) MIME-Version: 1.0 References: <20210628193256.008961950a714730751c1423@linux-foundation.org> <20210629023959.4ZAFiI8oZ%akpm@linux-foundation.org> In-Reply-To: From: Linus Torvalds Date: Wed, 30 Jun 2021 11:03:25 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch 128/192] mm: improve mprotect(R|W) efficiency on pages referenced once To: Peter Xu Cc: Andrew Morton , Andrea Arcangeli , Evgeniy Stepanov , kostyak@google.com, Linux-MM , mm-commits@vger.kernel.org, Peter Collingbourne Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=DP70WG1U; dmarc=none; spf=pass (imf12.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.167.50 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org X-Rspamd-Server: rspam02 X-Stat-Signature: ur3cda7mgygp3twyx18tp1mopxt4py4z X-Rspamd-Queue-Id: DC8A710000A2 X-HE-Tag: 1625076225-857695 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 30, 2021 at 9:42 AM Peter Xu wrote: > > Yes I still fully agree it's very un-obvious. So far the best thing I can come > up with is something like below (patch attached too but not yet tested). I > moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the > checks so the final outcome looks like below: > > static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, > unsigned long cp_flags) > { > /* > * It is unclear whether this optimization can be done safely for NUMA > * pages. > */ > if (cp_flags & MM_CP_PROT_NUMA) > return false; Please just put that VM_WRITE test first. It's the one that really *matters*. There's no "it's unclear if" about that part. Just handle the obvious and important check first. Yeah, yeah, they both return false, so order doesn't matter from a semantic standpoint, but from a clarity standpoint just do the clear and unambiguous and security-relevant test first. The rest of the tests are implementation details, the VM_WRITE test is fundamental behavior. It's the one that made me worry about this patch in the first place. > /* > * Don't do this optimization for clean pages as we need to be notified > * of the transition from clean to dirty. > */ > if (!pte_dirty(pte)) > return false; > > /* Same for softdirty. */ > if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) > return false; > > /* > * For userfaultfd the user program needs to monitor write faults so we > * can't do this optimization. > */ > if (pte_uffd_wp(pte)) > return false; So all of these are a bit special. Why? Because if I look at the actual page fault path, these are not the tests there. I'd really like to have some obvious situation where we keep this "make it writable" in sync with what would actually happen on a write fault when it's not writable. And it's not at all obvious to me for these cases. The do_wp_page() code doesn't even use pte_uffd_wp(). It uses userfaultfd_pte_wp(vma, pte), and I don't even know why. Yes, I can see the code (it additionally tests the VM_UFFD_WP flag in the vma), but a number of other paths then only do that pte_uffd_wp() test. I get the feeling that we really should try to match what the do_wp_page() path does, though. Which brings up another issue: the do_wp_page() path treats PageKsm() pages differently. And it locks the page before double-checking the page count. Why does mprotect() not need to do the same thing? I think this has come up before, and "change_protection()" can get called with the mmap_sem held just for reading - see userfaultfd - so it has all the same issues as a page fault does, afaik. > /* > * MM_CP_DIRTY_ACCT indicates that we can always make the page writable > * regardless of the number of references. Time to set the write bit. > */ > if (cp_flags & MM_CP_DIRTY_ACCT) > return true; > > /* > * Othewise it means !MM_CP_DIRTY_ACCT. We can only apply write bit > * early if it's anonymous page and we exclusively own it. > */ > if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1)) > return true; > > /* Don't play any trick */ > return false; > } > > The logic should be the same as before, it's just that we'll do an extra check > on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok. See above. I don't think the logic before was all that clear either. The one case that is clear is that if it's a shared mapping, and MM_CP_DIRTY_ACCT is set, and it was already dirty (and softdirty), then it's ok., That's the old code. I don't like how the old code was written (because I think that MM_CP_DIRTY_ACCT bit wasx too subtle), but I think the old code was at least correct. The new code, it just worries me. It adds all those new cases for when we can make the page writable early - that's the whole point of the patch, after all - but my point here is that it's not at all obvious that those new cases are actually correct. MAYBE it's all correct. I'm not saying it's wrong. I'm just saying it's not _obvious_ that it's correct. What about that page_count() test, for example: it has a comment, it looks obvious, but it's very different from what do_wp_page() does. So what happens if we have a page-out at the same time that turns that page into a swap cache page, and increments the page count? What about that race? Do we end up with a writable page that is shared with a swap cache entry? Is that ok? Why isn't it ok in the page fault case? See why this patch worries me so much? Linus