From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91A2DC433E1 for ; Sat, 25 Jul 2020 00:37:20 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2EC682070B for ; Sat, 25 Jul 2020 00:37:20 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2EC682070B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9ED0A6B0010; Fri, 24 Jul 2020 20:37:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 99D916B0022; Fri, 24 Jul 2020 20:37:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8B4E38D0003; Fri, 24 Jul 2020 20:37:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0159.hostedemail.com [216.40.44.159]) by kanga.kvack.org (Postfix) with ESMTP id 770186B0010 for ; Fri, 24 Jul 2020 20:37:19 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id E59998248047 for ; Sat, 25 Jul 2020 00:37:18 +0000 (UTC) X-FDA: 77074734156.13.pies53_550990d26f4b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id B359C1801A82A for ; Sat, 25 Jul 2020 00:37:18 +0000 (UTC) X-HE-Tag: pies53_550990d26f4b X-Filterd-Recvd-Size: 8239 Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Sat, 25 Jul 2020 00:37:17 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04397;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0U3hbagi_1595637424; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0U3hbagi_1595637424) by smtp.aliyun-inc.com(127.0.0.1); Sat, 25 Jul 2020 08:37:11 +0800 Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault To: Linus Torvalds , Andrew Morton Cc: Catalin Marinas , Johannes Weiner , Hillf Danton , Hugh Dickins , Josef Bacik , "Kirill A . Shutemov" , Linux-MM , mm-commits@vger.kernel.org, Will Deacon , Matthew Wilcox , xuyu@linux.alibaba.com References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org> <20200724041508.QlTbrHnfh%akpm@linux-foundation.org> From: Yang Shi Message-ID: <7de20d4a-f86c-8e1f-b238-65f02b560325@linux.alibaba.com> Date: Fri, 24 Jul 2020 17:36:57 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Queue-Id: B359C1801A82A X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 7/24/20 1:22 PM, Linus Torvalds wrote: > On Fri, Jul 24, 2020 at 12:27 PM Linus Torvalds > wrote: >> It *may* make sense to say "ok, don't bother flushing the TLB if this >> is a retry, because we already did that originally". MAYBE. > That sounds wrong to me too. > > Maybe a *BIG* and understandable comment about why the > FAULT_FLAG_TRIED case shouldn't actually do anything at all. > > But it does smell like the real issue is that the "else" case for > ptep_set_access_flags() is simply wrong. > > Or maybe something else. But this thing from the changelog really > raises my hackles: > > "But it seems not necessary to modify those > bits again for #3 since they should be already set by the first page= fault > try" > > since we just verified that we know _exactly_ what the pte is: > > if (unlikely(!pte_same(*vmf->pte, entry))) { > update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); > goto unlock; > } > > so there is no "should be already set" case. We have 100% information > about what the current state is. > > And if we don't make any changes, then that's exactly when > ptep_set_access_flags() returns zero. > > So the real question is "why do we need the > flush_tlb_fix_spurious_fault() thing". > > We could say that we never need it at all for FAULT_FLAG_RETRY. That > makes a lot of sense to me. Thanks a lot for looking into this. It does make sense to me. A follow-up question about your comment in the previous email "The=20 notion of "this is a retry, so let's do nothing" is fundamentally=20 wrong.", do you mean it is not safe? Or since we have pte_same check, we=20 should just rely on it to skip unnecessary TLB flush? > > So a patch that does something like the appended (intentionally > whitespace-damaged) seems sensible. > > But note the XYZ in that commit. When do we actually have stale TLB > entries? Do we actually do the lazy "avoid TLB flushing when loosening > the rules" anywhere? > > I think that "when it's a write fault" is actually bogus. I could > imagine that code pages could get the same issue. So the > "FAULT_FLAG_RETRY" part of the check makes perfect sense to me, but > the legacy "FAULT_FLAG_WRITE" case I'd actually want to document more. > > On x86, we never care about lazy faults. Page faulting will always > update the TLB. > > On other architectures, I can see spurious faults happening either due > to lazy reasons, or simply because another core is modifying the page > table right now (ie the concurrent fault thing), but hasn't actually > flushed yet. > > Can somebody flesh out the comment about the > "spurious_protection_fault()" thing? Because something like this I > wouldn't mind, but I'd like that comment to explain the > FAULT_FLAG_WRITE part too. I'm not quite familiar with other architectures, my wild guess is=20 FAULT_FLAG_WRITE is a cheap way to tell us if this is a .text page or=20 not. The original commit which introduced the check said so as well: commit 1a44e149084d772a1bcf4cdbdde8a013a8a1cfde Author: Andrea Arcangeli Date:=C2=A0=C2=A0 Sat Oct 29 18:16:48 2005 -0700 =C2=A0=C2=A0=C2=A0 [PATCH] .text page fault SMP scalability optimization =C2=A0=C2=A0=C2=A0 We had a problem on ppc64 where with more than 4 thre= ads a large system =C2=A0=C2=A0=C2=A0 wouldn't scale well while faulting in the .text (most= of the time=20 was spent =C2=A0=C2=A0=C2=A0 in the kernel despite it was an userland compute inte= nsive app).=C2=A0 The =C2=A0=C2=A0=C2=A0 reason is the useless overwrite of the same pte from = all cpu. =C2=A0=C2=A0=C2=A0 I fixed it this way (verified on an older kernel but = the forward=20 port is =C2=A0=C2=A0=C2=A0 almost identical).=C2=A0 This will benefit all archs = not just ppc64. =C2=A0=C2=A0=C2=A0 Signed-off-by: Andrea Arcangeli =C2=A0=C2=A0=C2=A0 Cc: Hugh Dickins =C2=A0=C2=A0=C2=A0 Signed-off-by: Andrew Morton =C2=A0=C2=A0=C2=A0 Signed-off-by: Linus Torvalds > > Linus > > --- > diff --git a/mm/memory.c b/mm/memory.c > index 3ecad55103ad..9994c98d88c3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4163,6 +4163,26 @@ static vm_fault_t wp_huge_pud(struct vm_fault > *vmf, pud_t orig_pud) > return VM_FAULT_FALLBACK; > } > > +/* > + * If ptep_set_access_flags() returns zero, that means that > + * it made no changes. Why did we get a fault? > + * > + * It might be a spurious protection fault because we at > + * some point lazily didn't flush a TLB when we only loosened > + * the protection rules. But it might also be because a > + * concurrent fault on another CPU had already marked things > + * young, and our young/dirty changes didn't change anything. > + * > + * The lazy TLB optimization only happens when we make things > + * writable. See XYZ. > + */ > +static inline bool spurious_protection_fault(unsigned int flags) > +{ > + if (flags & FAULT_FLAG_RETRY) > + return false; > + return flags & FAULT_FLAG_WRITE; > +} > + > /* > * These routines also need to handle stuff like marking pages dirty > * and/or accessed for architectures that don't do it in hardware (mo= st > @@ -4247,15 +4267,8 @@ static vm_fault_t handle_pte_fault(struct > vm_fault *vmf) > if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, e= ntry, > vmf->flags & FAULT_FLAG_WRITE)) { > update_mmu_cache(vmf->vma, vmf->address, vmf->pte); > - } else { > - /* > - * This is needed only for protection faults but the > arch code > - * is not yet telling us if this is a protection > fault or not. > - * This still avoids useless tlb flushes for .text > page faults > - * with threads. > - */ > - if (vmf->flags & FAULT_FLAG_WRITE) > - flush_tlb_fix_spurious_fault(vmf->vma, > vmf->address); > + } else if (spurious_protection_fault(vmf->flags)) { > + flush_tlb_fix_spurious_fault(vmf->vma, vmf->address); > } > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl);