From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault Date: Mon, 27 Jul 2020 13:56:41 -0700 Message-ID: References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org> <20200724041508.QlTbrHnfh%akpm@linux-foundation.org> <7de20d4a-f86c-8e1f-b238-65f02b560325@linux.alibaba.com> <45015c63-e719-a58a-7d07-6c156273a890@linux.alibaba.com> <20200727184239.GA21230@gaia> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Return-path: Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42460 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726183AbgG0U5C (ORCPT ); Mon, 27 Jul 2020 16:57:02 -0400 Received: from mail-lf1-x141.google.com (mail-lf1-x141.google.com [IPv6:2a00:1450:4864:20::141]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE8AEC0619D2 for ; Mon, 27 Jul 2020 13:57:01 -0700 (PDT) Received: by mail-lf1-x141.google.com with SMTP id m15so9101228lfp.7 for ; Mon, 27 Jul 2020 13:57:01 -0700 (PDT) Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com. [209.85.167.52]) by smtp.gmail.com with ESMTPSA id u21sm2437522lff.91.2020.07.27.13.56.58 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 27 Jul 2020 13:56:58 -0700 (PDT) Received: by mail-lf1-f52.google.com with SMTP id i80so9752273lfi.13 for ; Mon, 27 Jul 2020 13:56:58 -0700 (PDT) In-Reply-To: <20200727184239.GA21230@gaia> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Catalin Marinas , linux-arch Cc: Yang Shi , Andrew Morton , Johannes Weiner , Hillf Danton , Hugh Dickins , Josef Bacik , "Kirill A . Shutemov" , Linux-MM , mm-commits@vger.kernel.org, Will Deacon , Matthew Wilcox , Yu Xu [ The arch list is going to be missing some of the emails in this thread, but they are all on lore: https://lore.kernel.org/linux-mm/20200727184239.GA21230@gaia/ and I think the context is probably sufficient even without that. ] On Mon, Jul 27, 2020 at 11:42 AM Catalin Marinas wrote: > > At least on arm64 (and arm32), old ptes are not cached in the TLB, so > there is no need to flush if the only action was to make the pte young > from old. However, this may not be the same on other architectures. > > Also not sure about races with making a pte old then young again, some > CPU could get confused. Hmm. I'd love to extend the interface (at the same time we fix the bogus traditional default) to show what the old pte state was, but the whole point is that we don't even know. It is, by definition, gone. We got a fault for something that is no longer the case, and we didn't modify anything in the page tables. And so we know that the TLB didn't - at the time of the fault - match what we now see in the page tables. So we don't even know if we "made the pte young from old". Somebody *else* did that part. Maybe two CPU's both hit the HW page table walk at roughly the same time, both saw and old entry and triggered a sw fault, one CPU then won the race to the page table spinlock, and marked it young. And then the other CPU comes along, says "ok, nothing seems to have changed, it's a spurious fault as far as I can tell, now what?" It *could* be that "old->young" transition. But it *could* also have been that the other CPU did a write access and turned things writable (in addition to turning it young). We don't even know. So if arm doesn't cache old ptes in the TLB, then I guess for ARM, the "only try to flush for write faults" is fine, because the only possible stale bit in the TLB would be the dirty bit. And _maybe_ that ends up being true on other architectures too. But it does sound very very dodgy. Maybe we should just pass in the fault information we have (ie just give flush_tlb_fix_spurious_fault() the whole vmf pointer), and then the architecture can make their own decision based on that. So if the architecture can say "the only case that might be cached is a non-dirty old PTE that I need to flush now because it's a write fault, and not flushing it would possibly cause an endless loop", then that test for if (vmf->flags & FAULT_FLAG_WRITE) is the right thing. NOTE! The vmf does have a bit that is called "orig_pte", and has the comment "Value of PTE at the time of fault" associated with it. That comment is bogus. We don't _really_ know what the original pte was, and that "orig_pte" is just the one we loaded fairly early, and before we took the page table lock. We've made decisions based on the value, but we've also already checked that after taking the page table lock, the pte still matches. So that vmf structure may be less useful than you'd think. The only really useful information in there is likely just the address and the fault flags. Even the vma is by definition not really useful. The vma we looked up may not be the same vma that the original hardware fault happened with. If we took a fault, and some other CPU got around to do a mmap() before we got the mmap semaphore in the fault handler, we'll have the *new* vma, but the spurious fault might have come from another source entirely. But again - any *serious* page table updates we should have synchronized against, and the other CPU will have done the TLB shootdown etc. So we shouldn't need to do anything. The only thing that matters is the trivial bits that _may_ have been changed without bothering with a cross-CPU TLB flush. So it's likely only dirty/accessed bits. But I really do have this strong memory of us at least at some point deciding that we can avoid it for some other "this operation only ever _adds_ permissions, doesn't take them away" case. I can't find that code, though, so it might be either early-onset Alzheimer's, or some historical footnote that just isn't true any longer. That said, I *can* find places where we delay TLB flushes a _lot_. So another CPU may be modifying the page tables, and the flush happens much much later. For example: look at fork(). We'll mark the source page table as being read-only for COW purposes, but we'll delay the actual TLB flush to long long after we did so (but we'll do so with the mmap lock held for writing to protect against stack growing). So it's not even like the page table lock really synchronizes the page table changes with the faults and the TLB flush. The mmap lock for writing may do that too. So there's a fairly large window for these "spurious" faults, where the fault may have happened relatively much earlier, and things have changed a *lot* by the time we actually got all our locks, and saw "hey, I see nothing to change in the page tables, the fault is spurious". Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0AFD5C433DF for ; Mon, 27 Jul 2020 20:57:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B09F420809 for ; Mon, 27 Jul 2020 20:57:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="SZ6cq29l" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B09F420809 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 34FA66B0002; Mon, 27 Jul 2020 16:57:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 301726B0005; Mon, 27 Jul 2020 16:57:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1EEB26B0006; Mon, 27 Jul 2020 16:57:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0166.hostedemail.com [216.40.44.166]) by kanga.kvack.org (Postfix) with ESMTP id 0526F6B0002 for ; Mon, 27 Jul 2020 16:57:03 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id A426E1EF1 for ; Mon, 27 Jul 2020 20:57:02 +0000 (UTC) X-FDA: 77085065484.13.yard30_3710a0d26f64 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id 66EE518140B70 for ; Mon, 27 Jul 2020 20:57:02 +0000 (UTC) X-HE-Tag: yard30_3710a0d26f64 X-Filterd-Recvd-Size: 9101 Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Mon, 27 Jul 2020 20:57:01 +0000 (UTC) Received: by mail-lj1-f195.google.com with SMTP id v4so9110396ljd.0 for ; Mon, 27 Jul 2020 13:57:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=uohsPYq+BX230M7M3jLJE/AhykJk1PPKqAuzVuVWr0I=; b=SZ6cq29lpdq/W16Movw9sLDqEpQG6dsTIAo3iY5pMf0J5NBT0iq73DA6PrVOmzwNpf p9r6jH63+XidlozxL5/E7HHHaE/tZA+zRrCRVLBC2vZNDEY7P/FHlUNprtWwv4+gfJjW 49ItkQNuWhtYxgT/X+HbL/ctWs2fxNZy3juwk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=uohsPYq+BX230M7M3jLJE/AhykJk1PPKqAuzVuVWr0I=; b=V+VahLUpSpUGTFeOtsG9v1g1wIe7LoTLMfwUd6dpNju7PVK1VmmQtFKjmqyzgbeRKA 7tWeqqp67poM76yX4nw357vgxc1LbcVL7oVcM5QYgjHMa0zaV3jiJX29XIMu1Bea1CBO dQhF781fa8K1OW+5ucKJAMPNSlELAM9zZhf+gFfJXW8TlAvPNEwEEiYVtM0QRm9tNrk+ 8UEtht6Qw8nXAwHADVG4tO71gonuLZkGSwOQzhnOipQg/KtN8FrdhOaNEj1YEHV0TyyN aXB9X+VPQFCJ1TyRnyJrTzQSMo1CHIYHIdr85sr3zsCF4rsmQ2jD+7r2x9ClZPE4k9wz Uoww== X-Gm-Message-State: AOAM530mNo7Ds0lkprHrDzx1pkDqS+Zp4dmifYv6H3JI1QvqCjJCCZCc TrN+g+Q2v6z0AA+ouTMa4B4UacDkx14= X-Google-Smtp-Source: ABdhPJwrs/Rro1VNbiM1a1MkWriOmgqdW0P7PEYkzHNvu2Ev96QDE4NVS127UQneD5+zcIFgRALD0g== X-Received: by 2002:a2e:b00c:: with SMTP id y12mr2128083ljk.18.1595883419573; Mon, 27 Jul 2020 13:56:59 -0700 (PDT) Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com. [209.85.167.49]) by smtp.gmail.com with ESMTPSA id z186sm3250696lfa.6.2020.07.27.13.56.58 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 27 Jul 2020 13:56:58 -0700 (PDT) Received: by mail-lf1-f49.google.com with SMTP id c3so2598171lfb.3 for ; Mon, 27 Jul 2020 13:56:58 -0700 (PDT) X-Received: by 2002:ac2:522b:: with SMTP id i11mr12752744lfl.30.1595883417969; Mon, 27 Jul 2020 13:56:57 -0700 (PDT) MIME-Version: 1.0 References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org> <20200724041508.QlTbrHnfh%akpm@linux-foundation.org> <7de20d4a-f86c-8e1f-b238-65f02b560325@linux.alibaba.com> <45015c63-e719-a58a-7d07-6c156273a890@linux.alibaba.com> <20200727184239.GA21230@gaia> In-Reply-To: <20200727184239.GA21230@gaia> From: Linus Torvalds Date: Mon, 27 Jul 2020 13:56:41 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault To: Catalin Marinas , linux-arch Cc: Yang Shi , Andrew Morton , Johannes Weiner , Hillf Danton , Hugh Dickins , Josef Bacik , "Kirill A . Shutemov" , Linux-MM , mm-commits@vger.kernel.org, Will Deacon , Matthew Wilcox , Yu Xu Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 66EE518140B70 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [ The arch list is going to be missing some of the emails in this thread, but they are all on lore: https://lore.kernel.org/linux-mm/20200727184239.GA21230@gaia/ and I think the context is probably sufficient even without that. ] On Mon, Jul 27, 2020 at 11:42 AM Catalin Marinas wrote: > > At least on arm64 (and arm32), old ptes are not cached in the TLB, so > there is no need to flush if the only action was to make the pte young > from old. However, this may not be the same on other architectures. > > Also not sure about races with making a pte old then young again, some > CPU could get confused. Hmm. I'd love to extend the interface (at the same time we fix the bogus traditional default) to show what the old pte state was, but the whole point is that we don't even know. It is, by definition, gone. We got a fault for something that is no longer the case, and we didn't modify anything in the page tables. And so we know that the TLB didn't - at the time of the fault - match what we now see in the page tables. So we don't even know if we "made the pte young from old". Somebody *else* did that part. Maybe two CPU's both hit the HW page table walk at roughly the same time, both saw and old entry and triggered a sw fault, one CPU then won the race to the page table spinlock, and marked it young. And then the other CPU comes along, says "ok, nothing seems to have changed, it's a spurious fault as far as I can tell, now what?" It *could* be that "old->young" transition. But it *could* also have been that the other CPU did a write access and turned things writable (in addition to turning it young). We don't even know. So if arm doesn't cache old ptes in the TLB, then I guess for ARM, the "only try to flush for write faults" is fine, because the only possible stale bit in the TLB would be the dirty bit. And _maybe_ that ends up being true on other architectures too. But it does sound very very dodgy. Maybe we should just pass in the fault information we have (ie just give flush_tlb_fix_spurious_fault() the whole vmf pointer), and then the architecture can make their own decision based on that. So if the architecture can say "the only case that might be cached is a non-dirty old PTE that I need to flush now because it's a write fault, and not flushing it would possibly cause an endless loop", then that test for if (vmf->flags & FAULT_FLAG_WRITE) is the right thing. NOTE! The vmf does have a bit that is called "orig_pte", and has the comment "Value of PTE at the time of fault" associated with it. That comment is bogus. We don't _really_ know what the original pte was, and that "orig_pte" is just the one we loaded fairly early, and before we took the page table lock. We've made decisions based on the value, but we've also already checked that after taking the page table lock, the pte still matches. So that vmf structure may be less useful than you'd think. The only really useful information in there is likely just the address and the fault flags. Even the vma is by definition not really useful. The vma we looked up may not be the same vma that the original hardware fault happened with. If we took a fault, and some other CPU got around to do a mmap() before we got the mmap semaphore in the fault handler, we'll have the *new* vma, but the spurious fault might have come from another source entirely. But again - any *serious* page table updates we should have synchronized against, and the other CPU will have done the TLB shootdown etc. So we shouldn't need to do anything. The only thing that matters is the trivial bits that _may_ have been changed without bothering with a cross-CPU TLB flush. So it's likely only dirty/accessed bits. But I really do have this strong memory of us at least at some point deciding that we can avoid it for some other "this operation only ever _adds_ permissions, doesn't take them away" case. I can't find that code, though, so it might be either early-onset Alzheimer's, or some historical footnote that just isn't true any longer. That said, I *can* find places where we delay TLB flushes a _lot_. So another CPU may be modifying the page tables, and the flush happens much much later. For example: look at fork(). We'll mark the source page table as being read-only for COW purposes, but we'll delay the actual TLB flush to long long after we did so (but we'll do so with the mmap lock held for writing to protect against stack growing). So it's not even like the page table lock really synchronizes the page table changes with the faults and the TLB flush. The mmap lock for writing may do that too. So there's a fairly large window for these "spurious" faults, where the fault may have happened relatively much earlier, and things have changed a *lot* by the time we actually got all our locks, and saw "hey, I see nothing to change in the page tables, the fault is spurious". Linus