From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3518FC433E1 for ; Tue, 28 Jul 2020 06:41:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CDAF02065C for ; Tue, 28 Jul 2020 06:41:48 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CDAF02065C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5A8556B0002; Tue, 28 Jul 2020 02:41:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 558AB6B0005; Tue, 28 Jul 2020 02:41:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 448216B0006; Tue, 28 Jul 2020 02:41:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0237.hostedemail.com [216.40.44.237]) by kanga.kvack.org (Postfix) with ESMTP id 2E4DC6B0002 for ; Tue, 28 Jul 2020 02:41:48 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id E314E180AD806 for ; Tue, 28 Jul 2020 06:41:47 +0000 (UTC) X-FDA: 77086539054.18.crush97_04094e526f67 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin18.hostedemail.com (Postfix) with ESMTP id B87D4100EDBDD for ; Tue, 28 Jul 2020 06:41:47 +0000 (UTC) X-HE-Tag: crush97_04094e526f67 X-Filterd-Recvd-Size: 6925 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by imf32.hostedemail.com (Postfix) with ESMTP for ; Tue, 28 Jul 2020 06:41:46 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04427;MF=xuyu@linux.alibaba.com;NM=1;PH=DS;RN=14;SR=0;TI=SMTPD_---0U42dexi_1595918494; Received: from xuyu-mbp15.local(mailfrom:xuyu@linux.alibaba.com fp:SMTPD_---0U42dexi_1595918494) by smtp.aliyun-inc.com(127.0.0.1); Tue, 28 Jul 2020 14:41:35 +0800 Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault From: Yu Xu To: Catalin Marinas Cc: Linus Torvalds , Andrew Morton , Johannes Weiner , Hillf Danton , Hugh Dickins , Josef Bacik , "Kirill A . Shutemov" , Linux-MM , mm-commits@vger.kernel.org, Will Deacon , Matthew Wilcox , yang.shi@linux.alibaba.com References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org> <20200724041508.QlTbrHnfh%akpm@linux-foundation.org> <0323de82-cfbd-8506-fa9c-a702703dd654@linux.alibaba.com> <20200727110512.GB25400@gaia> <39560818-463f-da3a-fc9e-3a4a0a082f61@linux.alibaba.com> Message-ID: Date: Tue, 28 Jul 2020 14:41:34 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <39560818-463f-da3a-fc9e-3a4a0a082f61@linux.alibaba.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Queue-Id: B87D4100EDBDD X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000038, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 7/28/20 1:12 AM, Yu Xu wrote: > On 7/27/20 7:05 PM, Catalin Marinas wrote: >> On Mon, Jul 27, 2020 at 03:31:16PM +0800, Yu Xu wrote: >>> On 7/25/20 4:22 AM, Linus Torvalds wrote: >>>> On Fri, Jul 24, 2020 at 12:27 PM Linus Torvalds >>>> wrote: >>>>> >>>>> It *may* make sense to say "ok, don't bother flushing the TLB if th= is >>>>> is a retry, because we already did that originally". MAYBE. >> [...] >>>> We could say that we never need it at all for FAULT_FLAG_RETRY. That >>>> makes a lot of sense to me. >>>> >>>> So a patch that does something like the appended (intentionally >>>> whitespace-damaged) seems sensible. >>> >>> I tested your patch on our aarch64 box, with 128 online CPUs. >> [...] >>> There are two points to sum up. >>> >>> 1) the performance of page_fault3_process is restored, while the=20 >>> performance >>> of page_fault3_thread is about ~80% of the vanilla, except the case=20 >>> of 128 >>> threads. >>> >>> 2) in the case of 128 threads, test worker threads seem to get stuck,= =20 >>> making >>> no progress in the iterations of mmap-write-munmap until a period of=20 >>> time >>> later.=C2=A0 the test result is 0 because only first 16 samples are=20 >>> counted, and >>> they are all 0.=C2=A0 This situation is easy to re-produce with large= =20 >>> number of >>> threads (not necessarily 128), and the stack of one stuck thread is=20 >>> shown >>> below. >>> >>> [<0>] __switch_to+0xdc/0x150 >>> [<0>] wb_wait_for_completion+0x84/0xb0 >>> [<0>] __writeback_inodes_sb_nr+0x9c/0xe8 >>> [<0>] try_to_writeback_inodes_sb+0x6c/0x88 >>> [<0>] ext4_nonda_switch+0x90/0x98 [ext4] >>> [<0>] ext4_page_mkwrite+0x248/0x4c0 [ext4] >>> [<0>] do_page_mkwrite+0x4c/0x100 >>> [<0>] do_fault+0x2ac/0x3e0 >>> [<0>] handle_pte_fault+0xb4/0x258 >>> [<0>] __handle_mm_fault+0x1d8/0x3a8 >>> [<0>] handle_mm_fault+0x104/0x1d0 >>> [<0>] do_page_fault+0x16c/0x490 >>> [<0>] do_translation_fault+0x60/0x68 >>> [<0>] do_mem_abort+0x58/0x100 >>> [<0>] el0_da+0x24/0x28 >>> [<0>] 0xffffffffffffffff >>> >>> It seems quite normal, right? and I've run out of ideas. >> >> If threads get stuck here, it could be a stale TLB entry that's not >> flushed with Linus' patch. Since that's a write fault, I think it hits >> the FAULT_FLAG_TRIED case. >=20 > There must be some changes in my test box, because I find that even the > vanilla kernel (89b15332af7c^) get result of 0 in 128t testcase.=C2=A0 = And I > just directly used the history test data as the baseline.=C2=A0 I will = dig > into this then. Hi all, I reset the test box, and re-run the whole test, the result this time makes more sense. Test 89b15332a^ Linus Catalin Yang 1p 100 90.10 79.20 86.19 % 1t 100 89.56 88.74 92.21 % 32p 100 98.22 97.36 98.91 % 32t 100 75.45 76.06 75.75 % 64p 100 99.97 100.01 99.97 % 64t 100 70.44 74.53 61.75 % 96p 100 99.95 99.91 100.00 % 96t 100 67.95 72.56 63.88 % 128p 100 99.92 99.93 100.12 % 128t 100 73.23 73.85 73.16 % Sorry for previously confusing test data. Performance drop in thread mode is now the remaining issue. Thanks Yu >=20 > And do we still need to concern the ~20% performance drop in thread mod= e? >=20 >> >> Could you give my patch here a try as an alternative: >> >> https://lore.kernel.org/linux-mm/20200725155841.GA14490@gaia/ >=20 > I ran the same test on the same aarch64 box, with your patch, the resul= t > is as follows. >=20 > test=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vanilla kern= el=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 patched kernel > parameter=C2=A0=C2=A0=C2=A0=C2=A0 (89b15332af7c^)=C2=A0=C2=A0=C2=A0=C2=A0= (Catalin's patch) > 1p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 82= 9299=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 787676=C2=A0=C2=A0=C2=A0 (96.36 %) > 1t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 99= 8007=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 789284=C2=A0=C2=A0=C2=A0 (78.36 %) > 32p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1891671= 8=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 17921= 100=C2=A0 (94.68 %) > 32t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2020918= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = 1644146=C2=A0=C2=A0 (67.64 %) > 64p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1896516= 8=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 18983= 580=C2=A0 (100.0 %) > 64t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1415404= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = 1093750=C2=A0=C2=A0 (48.03 %) > 96p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1894943= 8=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 18963= 921=C2=A0 (100.1 %) > 96t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1622876= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = 1262878=C2=A0=C2=A0 (63.72 %) > 128p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 18926813=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1680146=C2=A0= =C2=A0 (8.89=C2=A0 %) > 128t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1643109=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 (0.0= 0 % ) # ignore this temporarily >=20 > Thanks > Yu >=20 >> >> It leaves the spurious flush in place but only local (though note that >> in a guest under KVM, all local TLBIs are upgraded to inner-shareable, >> so you'd not get the performance benefit). >>