From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_RED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0FEFC433DB for ; Sun, 10 Jan 2021 19:39:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 41421224D3 for ; Sun, 10 Jan 2021 19:39:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 41421224D3 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 34C566B0068; Sun, 10 Jan 2021 14:39:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2FD686B00F3; Sun, 10 Jan 2021 14:39:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1EB8E6B00F6; Sun, 10 Jan 2021 14:39:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0027.hostedemail.com [216.40.44.27]) by kanga.kvack.org (Postfix) with ESMTP id 09D0F6B0068 for ; Sun, 10 Jan 2021 14:39:25 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id BB47E181AC9CB for ; Sun, 10 Jan 2021 19:39:24 +0000 (UTC) X-FDA: 77690879448.01.hate30_37019a827506 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id A0B4E1004721F for ; Sun, 10 Jan 2021 19:39:24 +0000 (UTC) X-HE-Tag: hate30_37019a827506 X-Filterd-Recvd-Size: 9564 Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Sun, 10 Jan 2021 19:39:24 +0000 (UTC) Received: by mail-ej1-f49.google.com with SMTP id jx16so21709641ejb.10 for ; Sun, 10 Jan 2021 11:39:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=U77Z/e7pPvpSKaZjlLCX+01Zq5Bo7+MX62kJ615oHaE=; b=G+mRxXy9WB+qwItlNZSVxB/DRWseWVu117U84YUv3t/BfX6ZJJuN6Ym6QluxJ6CZU7 U4w3a/V2xMaHLe3JZ08/zOZqKMMLmhDFynHH/g9/bslyPz3xd1OC0/uB7FsSgAB8j6hO qk7xFViwthsby2Y810OHu+t4uUwexcNAg7BM4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=U77Z/e7pPvpSKaZjlLCX+01Zq5Bo7+MX62kJ615oHaE=; b=NXY5UUcpV3Dy2MzT2aCJ80kuPNb9dSFhbNZ8eRYRyMxvk993S4Frf65MFxsr3Bj2Fi 5xe7q258/OCovdOk4bYqGfOHgWt5zX5d5IERGZCR5bLg4XmsSs+/D7ugLy/4AytytNud bn/+fdq4KCjzzzrY8dWrxWLlJHiwPjEMv6q09LtbRIsaJfXiU5PLLG/2RZllrtT6f18e uE7OMer4lL0J/GysLlAkoqGU+c1oV9XdTukuoyOa1iZ4uvqcgrqPoOze6zjVz0++wZKS ThbMOicoldFzjcGvt5fV2SAFF0b7Vk6QqvAnlMWZ2+kWYY2qlGA6nug5OuTxdZe+aObX 5l8Q== X-Gm-Message-State: AOAM531qBE1EcnhQhFgJgNwuN6vswd1FsidSL3JJHVwHyVZtkkBvxpMj hWugEP8UdD/pcBmfJqIUC80Y2Sn8S2Qxwg== X-Google-Smtp-Source: ABdhPJz1faORT5IjtrgBJfF286lD9cc7TBq832InGC1kwuch68ocnFvnFvI/XTvxl6lG18hu+AIiXw== X-Received: by 2002:a17:906:1796:: with SMTP id t22mr8329452eje.372.1610307562641; Sun, 10 Jan 2021 11:39:22 -0800 (PST) Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com. [209.85.208.53]) by smtp.gmail.com with ESMTPSA id dx7sm6121376ejb.120.2021.01.10.11.39.22 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 10 Jan 2021 11:39:22 -0800 (PST) Received: by mail-ed1-f53.google.com with SMTP id i24so16611889edj.8 for ; Sun, 10 Jan 2021 11:39:22 -0800 (PST) X-Received: by 2002:a05:6512:338f:: with SMTP id h15mr5416099lfg.40.1610307073936; Sun, 10 Jan 2021 11:31:13 -0800 (PST) MIME-Version: 1.0 References: <20210110004435.26382-1-aarcange@redhat.com> In-Reply-To: From: Linus Torvalds Date: Sun, 10 Jan 2021 11:30:57 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse To: Andrea Arcangeli Cc: Andrew Morton , Linux-MM , Linux Kernel Mailing List , Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra , Hugh Dickins , "Kirill A. Shutemov" , Matthew Wilcox , Oleg Nesterov , Jann Horn , Kees Cook , John Hubbard , Leon Romanovsky , Jason Gunthorpe , Jan Kara , Kirill Tkhai , Nadav Amit , Jens Axboe Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds wrote: > > COW is about "I'm about to write to this page, and that means I need > an _exclusive_ page so that I don't write to a page that somebody else > is using". So this kind of fundamentally explains why I hate the games we used to play wrt page_mapcount(): they were fundamentally fragile. I _much_ prefer just having the rule that we use page_count(), which the above simple and straightforward single sentence explains 100%. This gets back to the fact that especially considering how we've had subtle bugs here (the "wrong-way COW" issue has existed since literally the first GUP ever, so it goes back decades), I want the core VM rules to be things that can be explained basically from simple "first principles". And the reason I argue for the current direction that I'm pushing, is exactly that the above is a very simple "first principle" for why COW exists. If the rule for COW is simply "I will always COW if it's not clear that I'm the exclusive user", then COW itself is very simple to think about. The other rule I want to stress is that COW is common, and that argues against the model we used to have of "let's lock the page to make sure that everything else is stable". That model was garbage anyway, since page locking doesn't even guarantee any stability wrt exclusive use in the first place (ie GUP being another example), but it's why I truly detested the old model that depended so much on the page lock to serialize things. So if you start off with the rule that "I will always COW unless I can trivially see I'm the only owner", then I think we have really made for a really clear and unambiguous rule. And remember: COW is only an issue for private mappings. So pretty much BY DEFINITION, doing a COW is always safe for all normal circumstances. Now, this is where it does get subtle: that "all normal circumstances" part. The one special case is a cache-coherent GUP. It's arguable whether "pinned" should matter or not, and it would obviously be better if "pinned" simply didn't matter at all (and the only issue with any long-term pinning would simply be about resource counting). The current approach I'm advocating is "coherency means that it must have been writable", and then the way to solve the whole "Oh, it's shared with something else" is to simply never accept making it read-only, because BY DEFINITION it's not _really_ read-only (because we know we've created that other alias of the virtual address that is *not* controlled by the page table protection bits). Notice how this is all both conceptually fairly simple (ie I can explain the rules in plain English without really making any complex argument) and it is arguably internally fairly self-consistent (ie the whole notion of "oh, there's another thing that has write access that page but doesn't go through the page table, so trying to make it read-only in the page tables is a nonsensical operation"). Are the end results wrt something like soft-dirty a bit odd? Not really. If you do soft-dirty, such a GUP-shared page would simply always show up as dirty. That's still consistent with the rules. If somebody else may be writing to it because of GUP, that page really *isn't* clean, and us marking it read-only would be just lying about things. I'm admittedly not very happy about mprotect() itself, though. It's actually ok to do the mprotect(PROT_READ) and turn the page read-only: that will also disable COW itself (because a page fault will now be a SIGSEGV, not a COW). But if you then make it writable again with mprotect(PROT_WRITE), you *have* lost the WP bit, and you'll COW on a write, and lose the coherency. Now, I'm willing to just say: "if you do page pinning, and then do mprotect(PROT_READ), and then do mprotect(PROT_WRITE) and then write to the page, you really do get to keep both broken pieces". IOW, I'm perfectly happy to just say you get what you deserve. But I'd also be perfectly happy to make the whole "I'm the exclusive user" logic a bit more extensive. Right now it's basically _purely_ page_count(), and the other part of "I'm the exclusive owner" is that the RW bit in the page table is simply not clear. That makes things really easy for COW: it just won't happen in the first place if you broke the "obviously exclusive" rule with GUP. But we _could_ do something slightly smarter. But "page_mapcount()" is not that "slightly smarter" thing, because we already know it's broken wrt lots of other uses (GUP, page cache, whatever). Just having a bit in the page flags for "I already made this exclusive, and fork() will keep it so" is I feel the best option. In a way, "page is writable" right now _is_ that bit. By definition, if you have a writable page in an anonymous mapping, that is an exclusive user. But because "writable" has these interactions with other operations, it would be better if it was a harder bit than that "maybe_pinned()", though. It would be lovely if a regular non-pinning write-GUP just always set it, for example. "maybe_pinned()" is good enough for the fork() case, which is the one that matters for long-term pinning. But it's admittedly not perfect. Linus