From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ZIdU=BE=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 55F8BC433E0
	for <linux-mm@archiver.kernel.org>; Sat, 25 Jul 2020 01:30:07 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D4D702067D
	for <linux-mm@archiver.kernel.org>; Sat, 25 Jul 2020 01:30:06 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="H5SwgPX0"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D4D702067D
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 211DD6B0029; Fri, 24 Jul 2020 21:30:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1C1CD6B002A; Fri, 24 Jul 2020 21:30:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0B30B8D0003; Fri, 24 Jul 2020 21:30:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0133.hostedemail.com [216.40.44.133])
	by kanga.kvack.org (Postfix) with ESMTP id E62276B0029
	for <linux-mm@kvack.org>; Fri, 24 Jul 2020 21:30:05 -0400 (EDT)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 757C1183F6A3E
	for <linux-mm@kvack.org>; Sat, 25 Jul 2020 01:30:05 +0000 (UTC)
X-FDA: 77074867170.10.print46_57059ff26f4c
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin10.hostedemail.com (Postfix) with ESMTP id 4C3531EA18
	for <linux-mm@kvack.org>; Sat, 25 Jul 2020 01:30:05 +0000 (UTC)
X-HE-Tag: print46_57059ff26f4c
X-Filterd-Recvd-Size: 8441
Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195])
	by imf08.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sat, 25 Jul 2020 01:30:04 +0000 (UTC)
Received: by mail-lj1-f195.google.com with SMTP id q4so11783897lji.2
        for <linux-mm@kvack.org>; Fri, 24 Jul 2020 18:30:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linux-foundation.org; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=6l2WvcMeYLr3pYum5PXSP+vp+wmUJVXAXszIOrHfsfg=;
        b=H5SwgPX0DwUBH4JmV1C+JWDb/+yqk/1yG97REfYqHRKv5H0flTmtBx4PcDW22azzfe
         Muk3lPRyiqn9VHAU2hearYBJFxFZ1enL2OUiff/DhP1KWcjSK5e9Pkb6gSiUlj6LAcKl
         C49+48SxpgwzrfCmdUowloJpN9y6syhb0fn+M=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=6l2WvcMeYLr3pYum5PXSP+vp+wmUJVXAXszIOrHfsfg=;
        b=mXjTeZll3dr0haPWSDPiBL7VpypO+R2V3q6jogDLFZB01HVltVImJEulFWiRfBuC2/
         etz8rfrQzYd3Fm0gJVkhwUXg+kFd3YRR8mQc1T7KwjwxdAtTqSAUCfHQ737SPu9tBNsd
         V+6aHHLewZgbCa4xAfrUBgzNb1SFJEwlZA0ZSZAFPJ1ihK4kERnugIz4PQDWp+5iEMk+
         1XVDxDid/c2QdvnpBcxS+aeuil4J/Hj5lyQdvEHqM31Iq/MJgCukD41HveblsMhNmG4C
         ZPiBIXu44XkiXLtxZApZuya4SOl+FC4eKEEOPCX3LWGvOo+ryMmdpPAqC6hWsmdmSApR
         /T7w==
X-Gm-Message-State: AOAM533G4MIKD9ICu48ZUMRGQ/6je2H4klil30S2DbWN58ybk9ehtk7i
	UPPgOJp+0zNksJ4p1LrjwOTqP9CnGrM=
X-Google-Smtp-Source: ABdhPJy9PthiTS516OU3dWrGFVGerEJCo2DSvg39+d2b2x8YbkNYOCTYNmEHzRY11RivLZYSpR55aQ==
X-Received: by 2002:a2e:880e:: with SMTP id x14mr5204603ljh.218.1595640602128;
        Fri, 24 Jul 2020 18:30:02 -0700 (PDT)
Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com. [209.85.167.52])
        by smtp.gmail.com with ESMTPSA id z8sm760205lfd.58.2020.07.24.18.30.00
        for <linux-mm@kvack.org>
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 24 Jul 2020 18:30:00 -0700 (PDT)
Received: by mail-lf1-f52.google.com with SMTP id j22so241973lfm.2
        for <linux-mm@kvack.org>; Fri, 24 Jul 2020 18:30:00 -0700 (PDT)
X-Received: by 2002:ac2:521a:: with SMTP id a26mr6201015lfl.192.1595640599701;
 Fri, 24 Jul 2020 18:29:59 -0700 (PDT)
MIME-Version: 1.0
References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org>
 <20200724041508.QlTbrHnfh%akpm@linux-foundation.org> <CAHk-=wguPA=pDskR-eMMjwR5LDEaMXrqbmDbrKr0u=wV1LE4rg@mail.gmail.com>
 <CAHk-=wh4kmU5FdT=Yy7N9wA=se=ALbrquCrOkjCMhiQnOBLvDA@mail.gmail.com> <7de20d4a-f86c-8e1f-b238-65f02b560325@linux.alibaba.com>
In-Reply-To: <7de20d4a-f86c-8e1f-b238-65f02b560325@linux.alibaba.com>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 24 Jul 2020 18:29:43 -0700
X-Gmail-Original-Message-ID: <CAHk-=wi8ztx7E3fmU3anHSUzxC9g+ac7kdBOop9UWdvdG-jFOg@mail.gmail.com>
Message-ID: <CAHk-=wi8ztx7E3fmU3anHSUzxC9g+ac7kdBOop9UWdvdG-jFOg@mail.gmail.com>
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for
 retried page fault
To: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Catalin Marinas <catalin.marinas@arm.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Hillf Danton <hdanton@sina.com>, Hugh Dickins <hughd@google.com>, 
	Josef Bacik <josef@toxicpanda.com>, 
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Linux-MM <linux-mm@kvack.org>, 
	mm-commits@vger.kernel.org, Will Deacon <will.deacon@arm.com>, 
	Matthew Wilcox <willy@infradead.org>, xuyu@linux.alibaba.com
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 4C3531EA18
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam04
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Jul 24, 2020 at 5:37 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> A follow-up question about your comment in the previous email "The
> notion of "this is a retry, so let's do nothing" is fundamentally
> wrong.", do you mean it is not safe?

I mean it fails my "smell test".

The patch didn't just avoid the TLB flush, it avoided all the other
"mark it dirty and young" things too. And that made me go "why would
RETRY be different in this regard"?

It sounds unsafe, because it basically means that a retry does
something else than the initial page fault handling would do.

See what worries me and makes me go "that's not safe"?

> Or since we have pte_same check, we
> should just rely on it to skip unnecessary TLB flush?

Right. That makes me much happier, because if the retry flag is only
used to avoid a TLB flush (when the pte's are identical, of course),
then I feel that the retry path is _logically_ all the same. The page
tables end up looking exactly the same, and the only difference is
whether we do that TLB invalidate for a spurious fault.

And that, in turn, makes me feel it is safe, because even if it turns
out that "yes, we keep getting a spurious fault because we have some
stale TLB entries", then checking the RETRY bit is fine: we'll do a
full page fault next time around without the retry bit set.

So that's why I feel that your patch is sketchy and unsafe, but I
don't worry about testing the RETRY bit in that "clear spurious TLB
entries" case.

See?

> > Can somebody flesh out the comment about the
> > "spurious_protection_fault()" thing? Because something like this I
> > wouldn't mind, but I'd like that comment to explain the
> > FAULT_FLAG_WRITE part too.
>
> I'm not quite familiar with other architectures, my wild guess is
> FAULT_FLAG_WRITE is a cheap way to tell us if this is a .text page or
> not.

Yes. However, I'm not seeing why a text page would be so special.

IOW, if it's ok to skip the TLB flush fo ra text page, then why isn't
it ok to skip for a normal page?

My suspicion is that we have stale TLB entries for potentially
multiple different reasons:

 - software optimizations, where we decide "skip the TLB flush,
because it's expensive and it is likely to never matter".

   I have a _memory_ of us doing this when we have a pure "loosening"
of the protections (IOW, make something writable that wasn't writable
before), but I can't actually find the code. I'm thinking things like
the wp_page_reuse() case.

 - temporarily stale TLB entries because we've _just_updated them on
another CPU, but it hasn't gotten to the actual TLB flush yet.

   By the time we actually get to this point, we'll have serialized
with the page table lock, but the *fault* happened when the CPU saw
the original stale TLB entry, so we took the fault with what is now a
stale TLB entry.

 - actual software bugs where we've not flushed the TLB properly.

Anyway, the _reason_ for that "flush_tlb_fix_spurious_fault()" is that
some architectures don't flush their TLB on a fault.

So if you don't flush the TLB when talking a page fault, and you may
have these stale TLB entries around, you'll just keep faulting until
enough other system event happens that just ends up flushing the TLB
sufficiently.

On an otherwise idle system, that "keep faulting until enough other
system event happens" might be effectively forever.

For any architecture that guarantees that a page fault will always
flush the old TLB entry for this kind of situation, that
flush_tlb_fix_spurious_fault() thing can be a no-op.

So that's why on x86, we just do

  #define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)

and have no issues.

Note that it does *not* need to do any cross-CPU flushing or anything
like that. So it's actually wrong (I think) to have that default
fallback for

   #define flush_tlb_fix_spurious_fault(vma, address)
flush_tlb_page(vma, address)

because flush_tlb_page() is the serious "do cross CPU etc".

Does the arm64 flush_tlb_page() perhaps do the whole expensive
cross-CPU thing rather than the much cheaper "just local invalidate"
version?

The "random letter combination" thing that ARM documentation uses for
these things is really confusing, but I think the "is" in "vale1is"
means that it's broadcast to all "inner sharable" - ie CPU cores.

I get the feeling that on arm64, flush_tlb_fix_spurious_fault() should
either be a no-op, or it should perhaps be a non-broadcasting version
of the TLB invalidates, and use just "vale1"

Catalin? Will?

                Linus