From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ZIdU=BE=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED,
	USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 91A2DC433E1
	for <linux-mm@archiver.kernel.org>; Sat, 25 Jul 2020 00:37:20 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2EC682070B
	for <linux-mm@archiver.kernel.org>; Sat, 25 Jul 2020 00:37:20 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2EC682070B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 9ED0A6B0010; Fri, 24 Jul 2020 20:37:19 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 99D916B0022; Fri, 24 Jul 2020 20:37:19 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8B4E38D0003; Fri, 24 Jul 2020 20:37:19 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0159.hostedemail.com [216.40.44.159])
	by kanga.kvack.org (Postfix) with ESMTP id 770186B0010
	for <linux-mm@kvack.org>; Fri, 24 Jul 2020 20:37:19 -0400 (EDT)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id E59998248047
	for <linux-mm@kvack.org>; Sat, 25 Jul 2020 00:37:18 +0000 (UTC)
X-FDA: 77074734156.13.pies53_550990d26f4b
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin13.hostedemail.com (Postfix) with ESMTP id B359C1801A82A
	for <linux-mm@kvack.org>; Sat, 25 Jul 2020 00:37:18 +0000 (UTC)
X-HE-Tag: pies53_550990d26f4b
X-Filterd-Recvd-Size: 8239
Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44])
	by imf23.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sat, 25 Jul 2020 00:37:17 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04397;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0U3hbagi_1595637424;
Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0U3hbagi_1595637424)
          by smtp.aliyun-inc.com(127.0.0.1);
          Sat, 25 Jul 2020 08:37:11 +0800
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for
 retried page fault
To: Linus Torvalds <torvalds@linux-foundation.org>,
 Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
 Johannes Weiner <hannes@cmpxchg.org>, Hillf Danton <hdanton@sina.com>,
 Hugh Dickins <hughd@google.com>, Josef Bacik <josef@toxicpanda.com>,
 "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
 Linux-MM <linux-mm@kvack.org>, mm-commits@vger.kernel.org,
 Will Deacon <will.deacon@arm.com>, Matthew Wilcox <willy@infradead.org>,
 xuyu@linux.alibaba.com
References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org>
 <20200724041508.QlTbrHnfh%akpm@linux-foundation.org>
 <CAHk-=wguPA=pDskR-eMMjwR5LDEaMXrqbmDbrKr0u=wV1LE4rg@mail.gmail.com>
 <CAHk-=wh4kmU5FdT=Yy7N9wA=se=ALbrquCrOkjCMhiQnOBLvDA@mail.gmail.com>
From: Yang Shi <yang.shi@linux.alibaba.com>
Message-ID: <7de20d4a-f86c-8e1f-b238-65f02b560325@linux.alibaba.com>
Date: Fri, 24 Jul 2020 17:36:57 -0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0)
 Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <CAHk-=wh4kmU5FdT=Yy7N9wA=se=ALbrquCrOkjCMhiQnOBLvDA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Queue-Id: B359C1801A82A
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam01
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 7/24/20 1:22 PM, Linus Torvalds wrote:
> On Fri, Jul 24, 2020 at 12:27 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> It *may* make sense to say "ok, don't bother flushing the TLB if this
>> is a retry, because we already did that originally". MAYBE.
> That sounds wrong to me too.
>
> Maybe a *BIG* and understandable comment about why the
> FAULT_FLAG_TRIED case shouldn't actually do anything at all.
>
> But it does smell like the real issue is that the "else" case for
> ptep_set_access_flags() is simply wrong.
>
> Or maybe something else. But this thing from the changelog really
> raises my hackles:
>
>         "But it seems not necessary to modify those
>    bits again for #3 since they should be already set by the first page=
 fault
>    try"
>
> since we just verified that we know _exactly_ what the pte is:
>
>          if (unlikely(!pte_same(*vmf->pte, entry))) {
>                  update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
>                  goto unlock;
>          }
>
> so there is no "should be already set" case. We have 100% information
> about what the current state is.
>
> And if we don't make any changes, then that's exactly when
> ptep_set_access_flags() returns zero.
>
> So the real question is "why do we need the
> flush_tlb_fix_spurious_fault() thing".
>
> We could say that we never need it at all for FAULT_FLAG_RETRY. That
> makes a lot of sense to me.

Thanks a lot for looking into this. It does make sense to me.

A follow-up question about your comment in the previous email "The=20
notion of "this is a retry, so let's do nothing" is fundamentally=20
wrong.", do you mean it is not safe? Or since we have pte_same check, we=20
should just rely on it to skip unnecessary TLB flush?

>
> So a patch that does something like the appended (intentionally
> whitespace-damaged) seems sensible.
>
> But note the XYZ in that commit. When do we actually have stale TLB
> entries? Do we actually do the lazy "avoid TLB flushing when loosening
> the rules" anywhere?
>
> I think that "when it's a write fault" is actually bogus. I could
> imagine that code pages could get the same issue. So the
> "FAULT_FLAG_RETRY" part of the check makes perfect sense to me, but
> the legacy "FAULT_FLAG_WRITE" case I'd actually want to document more.
>
> On x86, we never care about lazy faults. Page faulting will always
> update the TLB.
>
> On other architectures, I can see spurious faults happening either due
> to lazy reasons, or simply because another core is modifying the page
> table right now (ie the concurrent fault thing), but hasn't actually
> flushed yet.
>
> Can somebody flesh out the comment about the
> "spurious_protection_fault()" thing? Because something like this I
> wouldn't mind, but I'd like that comment to explain the
> FAULT_FLAG_WRITE part too.

I'm not quite familiar with other architectures, my wild guess is=20
FAULT_FLAG_WRITE is a cheap way to tell us if this is a .text page or=20
not. The original commit which introduced the check said so as well:

commit 1a44e149084d772a1bcf4cdbdde8a013a8a1cfde
Author: Andrea Arcangeli <andrea@suse.de>
Date:=C2=A0=C2=A0 Sat Oct 29 18:16:48 2005 -0700

 =C2=A0=C2=A0=C2=A0 [PATCH] .text page fault SMP scalability optimization

 =C2=A0=C2=A0=C2=A0 We had a problem on ppc64 where with more than 4 thre=
ads a large system
 =C2=A0=C2=A0=C2=A0 wouldn't scale well while faulting in the .text (most=
 of the time=20
was spent
 =C2=A0=C2=A0=C2=A0 in the kernel despite it was an userland compute inte=
nsive app).=C2=A0 The
 =C2=A0=C2=A0=C2=A0 reason is the useless overwrite of the same pte from =
all cpu.

 =C2=A0=C2=A0=C2=A0 I fixed it this way (verified on an older kernel but =
the forward=20
port is
 =C2=A0=C2=A0=C2=A0 almost identical).=C2=A0 This will benefit all archs =
not just ppc64.

 =C2=A0=C2=A0=C2=A0 Signed-off-by: Andrea Arcangeli <andrea@suse.de>
 =C2=A0=C2=A0=C2=A0 Cc: Hugh Dickins <hugh@veritas.com>
 =C2=A0=C2=A0=C2=A0 Signed-off-by: Andrew Morton <akpm@osdl.org>
 =C2=A0=C2=A0=C2=A0 Signed-off-by: Linus Torvalds <torvalds@osdl.org>

>
>                Linus
>
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ecad55103ad..9994c98d88c3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4163,6 +4163,26 @@ static vm_fault_t wp_huge_pud(struct vm_fault
> *vmf, pud_t orig_pud)
>           return VM_FAULT_FALLBACK;
>   }
>
> +/*
> + * If ptep_set_access_flags() returns zero, that means that
> + * it made no changes. Why did we get a fault?
> + *
> + * It might be a spurious protection fault because we at
> + * some point lazily didn't flush a TLB when we only loosened
> + * the protection rules. But it might also be because a
> + * concurrent fault on another CPU had already marked things
> + * young, and our young/dirty changes didn't change anything.
> + *
> + * The lazy TLB optimization only happens when we make things
> + * writable. See XYZ.
> + */
> +static inline bool spurious_protection_fault(unsigned int flags)
> +{
> +        if (flags & FAULT_FLAG_RETRY)
> +                return false;
> +        return flags & FAULT_FLAG_WRITE;
> +}
> +
>   /*
>    * These routines also need to handle stuff like marking pages dirty
>    * and/or accessed for architectures that don't do it in hardware (mo=
st
> @@ -4247,15 +4267,8 @@ static vm_fault_t handle_pte_fault(struct
> vm_fault *vmf)
>           if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, e=
ntry,
>                                   vmf->flags & FAULT_FLAG_WRITE)) {
>                   update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
> -        } else {
> -                /*
> -                 * This is needed only for protection faults but the
> arch code
> -                 * is not yet telling us if this is a protection
> fault or not.
> -                 * This still avoids useless tlb flushes for .text
> page faults
> -                 * with threads.
> -                 */
> -                if (vmf->flags & FAULT_FLAG_WRITE)
> -                        flush_tlb_fix_spurious_fault(vmf->vma,
> vmf->address);
> +        } else if (spurious_protection_fault(vmf->flags)) {
> +                flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
>           }
>   unlock:
>           pte_unmap_unlock(vmf->pte, vmf->ptl);