From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2889DC433E1 for ; Mon, 27 Jul 2020 07:31:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D91F32070B for ; Mon, 27 Jul 2020 07:31:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D91F32070B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6A0AB6B0002; Mon, 27 Jul 2020 03:31:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 651786B0003; Mon, 27 Jul 2020 03:31:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 567736B0005; Mon, 27 Jul 2020 03:31:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 3FA3B6B0002 for ; Mon, 27 Jul 2020 03:31:29 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id EB741180AD806 for ; Mon, 27 Jul 2020 07:31:28 +0000 (UTC) X-FDA: 77083035456.20.water69_38134c126f5f Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin20.hostedemail.com (Postfix) with ESMTP id BB62A180C060D for ; Mon, 27 Jul 2020 07:31:28 +0000 (UTC) X-HE-Tag: water69_38134c126f5f X-Filterd-Recvd-Size: 9124 Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Mon, 27 Jul 2020 07:31:26 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R901e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01358;MF=xuyu@linux.alibaba.com;NM=1;PH=DS;RN=14;SR=0;TI=SMTPD_---0U3v4blS_1595835076; Received: from xuyu-mbp15.local(mailfrom:xuyu@linux.alibaba.com fp:SMTPD_---0U3v4blS_1595835076) by smtp.aliyun-inc.com(127.0.0.1); Mon, 27 Jul 2020 15:31:17 +0800 Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault To: Linus Torvalds , Andrew Morton Cc: Catalin Marinas , Johannes Weiner , Hillf Danton , Hugh Dickins , Josef Bacik , "Kirill A . Shutemov" , Linux-MM , mm-commits@vger.kernel.org, Will Deacon , Matthew Wilcox , yang.shi@linux.alibaba.com References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org> <20200724041508.QlTbrHnfh%akpm@linux-foundation.org> From: Yu Xu Message-ID: <0323de82-cfbd-8506-fa9c-a702703dd654@linux.alibaba.com> Date: Mon, 27 Jul 2020 15:31:16 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: BB62A180C060D X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 7/25/20 4:22 AM, Linus Torvalds wrote: > On Fri, Jul 24, 2020 at 12:27 PM Linus Torvalds > wrote: >> >> It *may* make sense to say "ok, don't bother flushing the TLB if this >> is a retry, because we already did that originally". MAYBE. > > That sounds wrong to me too. > > Maybe a *BIG* and understandable comment about why the > FAULT_FLAG_TRIED case shouldn't actually do anything at all. > > But it does smell like the real issue is that the "else" case for > ptep_set_access_flags() is simply wrong. > > Or maybe something else. But this thing from the changelog really > raises my hackles: > > "But it seems not necessary to modify those > bits again for #3 since they should be already set by the first page fault > try" > > since we just verified that we know _exactly_ what the pte is: > > if (unlikely(!pte_same(*vmf->pte, entry))) { > update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); > goto unlock; > } > > so there is no "should be already set" case. We have 100% information > about what the current state is. > > And if we don't make any changes, then that's exactly when > ptep_set_access_flags() returns zero. > > So the real question is "why do we need the > flush_tlb_fix_spurious_fault() thing". Yes, some debugging information shows that tlb flush does happen in flush_tlb_fix_spurious_fault(), i.e., "else" case for ptep_set_access_flags(). > > We could say that we never need it at all for FAULT_FLAG_RETRY. That > makes a lot of sense to me. > > So a patch that does something like the appended (intentionally > whitespace-damaged) seems sensible. I tested your patch on our aarch64 box, with 128 online CPUs. The testcase is [page_fault3](https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c) from will-it-scale suite, with test parameters as number of processes or threads to run. test vanilla patched parameter (89b15332af7c) (Linus's patch) 1p 829299 772028 (94.44 %) 1t 998007 925583 (91.89 %) 32p 18916718 18712348 (98.86 %) 32t 2020918 1687744 (69.43 %) 64p 18965168 18982148 (100.05 %) 64t 1415404 1649234 (72.42 %) 96p 18949438 18866126 (99.68 %) 96t 1622876 1448309 (73.08 %) 128p 18926813 18423990 (97.53 %) 128t 1643109 0 (0.00 %) There are two points to sum up. 1) the performance of page_fault3_process is restored, while the performance of page_fault3_thread is about ~80% of the vanilla, except the case of 128 threads. 2) in the case of 128 threads, test worker threads seem to get stuck, making no progress in the iterations of mmap-write-munmap until a period of time later. the test result is 0 because only first 16 samples are counted, and they are all 0. This situation is easy to re-produce with large number of threads (not necessarily 128), and the stack of one stuck thread is shown below. [<0>] __switch_to+0xdc/0x150 [<0>] wb_wait_for_completion+0x84/0xb0 [<0>] __writeback_inodes_sb_nr+0x9c/0xe8 [<0>] try_to_writeback_inodes_sb+0x6c/0x88 [<0>] ext4_nonda_switch+0x90/0x98 [ext4] [<0>] ext4_page_mkwrite+0x248/0x4c0 [ext4] [<0>] do_page_mkwrite+0x4c/0x100 [<0>] do_fault+0x2ac/0x3e0 [<0>] handle_pte_fault+0xb4/0x258 [<0>] __handle_mm_fault+0x1d8/0x3a8 [<0>] handle_mm_fault+0x104/0x1d0 [<0>] do_page_fault+0x16c/0x490 [<0>] do_translation_fault+0x60/0x68 [<0>] do_mem_abort+0x58/0x100 [<0>] el0_da+0x24/0x28 [<0>] 0xffffffffffffffff It seems quite normal, right? and I've run out of ideas. Thanks, Yu > > But note the XYZ in that commit. When do we actually have stale TLB > entries? Do we actually do the lazy "avoid TLB flushing when loosening > the rules" anywhere? > > I think that "when it's a write fault" is actually bogus. I could > imagine that code pages could get the same issue. So the > "FAULT_FLAG_RETRY" part of the check makes perfect sense to me, but > the legacy "FAULT_FLAG_WRITE" case I'd actually want to document more. > > On x86, we never care about lazy faults. Page faulting will always > update the TLB. > > On other architectures, I can see spurious faults happening either due > to lazy reasons, or simply because another core is modifying the page > table right now (ie the concurrent fault thing), but hasn't actually > flushed yet. > > Can somebody flesh out the comment about the > "spurious_protection_fault()" thing? Because something like this I > wouldn't mind, but I'd like that comment to explain the > FAULT_FLAG_WRITE part too. > > Linus > > --- > diff --git a/mm/memory.c b/mm/memory.c > index 3ecad55103ad..9994c98d88c3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4163,6 +4163,26 @@ static vm_fault_t wp_huge_pud(struct vm_fault > *vmf, pud_t orig_pud) > return VM_FAULT_FALLBACK; > } > > +/* > + * If ptep_set_access_flags() returns zero, that means that > + * it made no changes. Why did we get a fault? > + * > + * It might be a spurious protection fault because we at > + * some point lazily didn't flush a TLB when we only loosened > + * the protection rules. But it might also be because a > + * concurrent fault on another CPU had already marked things > + * young, and our young/dirty changes didn't change anything. > + * > + * The lazy TLB optimization only happens when we make things > + * writable. See XYZ. > + */ > +static inline bool spurious_protection_fault(unsigned int flags) > +{ > + if (flags & FAULT_FLAG_RETRY) > + return false; > + return flags & FAULT_FLAG_WRITE; > +} > + > /* > * These routines also need to handle stuff like marking pages dirty > * and/or accessed for architectures that don't do it in hardware (most > @@ -4247,15 +4267,8 @@ static vm_fault_t handle_pte_fault(struct > vm_fault *vmf) > if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry, > vmf->flags & FAULT_FLAG_WRITE)) { > update_mmu_cache(vmf->vma, vmf->address, vmf->pte); > - } else { > - /* > - * This is needed only for protection faults but the > arch code > - * is not yet telling us if this is a protection > fault or not. > - * This still avoids useless tlb flushes for .text > page faults > - * with threads. > - */ > - if (vmf->flags & FAULT_FLAG_WRITE) > - flush_tlb_fix_spurious_fault(vmf->vma, > vmf->address); > + } else if (spurious_protection_fault(vmf->flags)) { > + flush_tlb_fix_spurious_fault(vmf->vma, vmf->address); > } > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); >