From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=h896=BH=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.3 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,
	SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3518FC433E1
	for <linux-mm@archiver.kernel.org>; Tue, 28 Jul 2020 06:41:49 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id CDAF02065C
	for <linux-mm@archiver.kernel.org>; Tue, 28 Jul 2020 06:41:48 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CDAF02065C
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5A8556B0002; Tue, 28 Jul 2020 02:41:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 558AB6B0005; Tue, 28 Jul 2020 02:41:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 448216B0006; Tue, 28 Jul 2020 02:41:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0237.hostedemail.com [216.40.44.237])
	by kanga.kvack.org (Postfix) with ESMTP id 2E4DC6B0002
	for <linux-mm@kvack.org>; Tue, 28 Jul 2020 02:41:48 -0400 (EDT)
Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id E314E180AD806
	for <linux-mm@kvack.org>; Tue, 28 Jul 2020 06:41:47 +0000 (UTC)
X-FDA: 77086539054.18.crush97_04094e526f67
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin18.hostedemail.com (Postfix) with ESMTP id B87D4100EDBDD
	for <linux-mm@kvack.org>; Tue, 28 Jul 2020 06:41:47 +0000 (UTC)
X-HE-Tag: crush97_04094e526f67
X-Filterd-Recvd-Size: 6925
Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130])
	by imf32.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 28 Jul 2020 06:41:46 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04427;MF=xuyu@linux.alibaba.com;NM=1;PH=DS;RN=14;SR=0;TI=SMTPD_---0U42dexi_1595918494;
Received: from xuyu-mbp15.local(mailfrom:xuyu@linux.alibaba.com fp:SMTPD_---0U42dexi_1595918494)
          by smtp.aliyun-inc.com(127.0.0.1);
          Tue, 28 Jul 2020 14:41:35 +0800
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for
 retried page fault
From: Yu Xu <xuyu@linux.alibaba.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 Johannes Weiner <hannes@cmpxchg.org>, Hillf Danton <hdanton@sina.com>,
 Hugh Dickins <hughd@google.com>, Josef Bacik <josef@toxicpanda.com>,
 "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
 Linux-MM <linux-mm@kvack.org>, mm-commits@vger.kernel.org,
 Will Deacon <will.deacon@arm.com>, Matthew Wilcox <willy@infradead.org>,
 yang.shi@linux.alibaba.com
References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org>
 <20200724041508.QlTbrHnfh%akpm@linux-foundation.org>
 <CAHk-=wguPA=pDskR-eMMjwR5LDEaMXrqbmDbrKr0u=wV1LE4rg@mail.gmail.com>
 <CAHk-=wh4kmU5FdT=Yy7N9wA=se=ALbrquCrOkjCMhiQnOBLvDA@mail.gmail.com>
 <0323de82-cfbd-8506-fa9c-a702703dd654@linux.alibaba.com>
 <20200727110512.GB25400@gaia>
 <39560818-463f-da3a-fc9e-3a4a0a082f61@linux.alibaba.com>
Message-ID: <e592cd98-2323-2a96-83d9-61e6a9a74670@linux.alibaba.com>
Date: Tue, 28 Jul 2020 14:41:34 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0)
 Gecko/20100101 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <39560818-463f-da3a-fc9e-3a4a0a082f61@linux.alibaba.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Queue-Id: B87D4100EDBDD
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam02
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000038, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 7/28/20 1:12 AM, Yu Xu wrote:
> On 7/27/20 7:05 PM, Catalin Marinas wrote:
>> On Mon, Jul 27, 2020 at 03:31:16PM +0800, Yu Xu wrote:
>>> On 7/25/20 4:22 AM, Linus Torvalds wrote:
>>>> On Fri, Jul 24, 2020 at 12:27 PM Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:
>>>>>
>>>>> It *may* make sense to say "ok, don't bother flushing the TLB if th=
is
>>>>> is a retry, because we already did that originally". MAYBE.
>> [...]
>>>> We could say that we never need it at all for FAULT_FLAG_RETRY. That
>>>> makes a lot of sense to me.
>>>>
>>>> So a patch that does something like the appended (intentionally
>>>> whitespace-damaged) seems sensible.
>>>
>>> I tested your patch on our aarch64 box, with 128 online CPUs.
>> [...]
>>> There are two points to sum up.
>>>
>>> 1) the performance of page_fault3_process is restored, while the=20
>>> performance
>>> of page_fault3_thread is about ~80% of the vanilla, except the case=20
>>> of 128
>>> threads.
>>>
>>> 2) in the case of 128 threads, test worker threads seem to get stuck,=
=20
>>> making
>>> no progress in the iterations of mmap-write-munmap until a period of=20
>>> time
>>> later.=C2=A0 the test result is 0 because only first 16 samples are=20
>>> counted, and
>>> they are all 0.=C2=A0 This situation is easy to re-produce with large=
=20
>>> number of
>>> threads (not necessarily 128), and the stack of one stuck thread is=20
>>> shown
>>> below.
>>>
>>> [<0>] __switch_to+0xdc/0x150
>>> [<0>] wb_wait_for_completion+0x84/0xb0
>>> [<0>] __writeback_inodes_sb_nr+0x9c/0xe8
>>> [<0>] try_to_writeback_inodes_sb+0x6c/0x88
>>> [<0>] ext4_nonda_switch+0x90/0x98 [ext4]
>>> [<0>] ext4_page_mkwrite+0x248/0x4c0 [ext4]
>>> [<0>] do_page_mkwrite+0x4c/0x100
>>> [<0>] do_fault+0x2ac/0x3e0
>>> [<0>] handle_pte_fault+0xb4/0x258
>>> [<0>] __handle_mm_fault+0x1d8/0x3a8
>>> [<0>] handle_mm_fault+0x104/0x1d0
>>> [<0>] do_page_fault+0x16c/0x490
>>> [<0>] do_translation_fault+0x60/0x68
>>> [<0>] do_mem_abort+0x58/0x100
>>> [<0>] el0_da+0x24/0x28
>>> [<0>] 0xffffffffffffffff
>>>
>>> It seems quite normal, right? and I've run out of ideas.
>>
>> If threads get stuck here, it could be a stale TLB entry that's not
>> flushed with Linus' patch. Since that's a write fault, I think it hits
>> the FAULT_FLAG_TRIED case.
>=20
> There must be some changes in my test box, because I find that even the
> vanilla kernel (89b15332af7c^) get result of 0 in 128t testcase.=C2=A0 =
And I
> just directly used the history test data as the baseline.=C2=A0 I will =
dig
> into this then.

Hi all, I reset the test box, and re-run the whole test, the result this
time makes more sense.

Test  89b15332a^  Linus    Catalin    Yang
1p      100       90.10    79.20      86.19   %
1t      100       89.56    88.74      92.21   %
32p     100       98.22    97.36      98.91   %
32t     100       75.45    76.06      75.75   %
64p     100       99.97    100.01     99.97   %
64t     100       70.44    74.53      61.75   %
96p     100       99.95    99.91      100.00  %
96t     100       67.95    72.56      63.88   %
128p    100       99.92    99.93      100.12  %
128t    100       73.23    73.85      73.16   %

Sorry for previously confusing test data. Performance drop in thread
mode is now the remaining issue.

Thanks
Yu

>=20
> And do we still need to concern the ~20% performance drop in thread mod=
e?
>=20
>>
>> Could you give my patch here a try as an alternative:
>>
>> https://lore.kernel.org/linux-mm/20200725155841.GA14490@gaia/
>=20
> I ran the same test on the same aarch64 box, with your patch, the resul=
t
> is as follows.
>=20
> test=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vanilla kern=
el=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 patched kernel
> parameter=C2=A0=C2=A0=C2=A0=C2=A0 (89b15332af7c^)=C2=A0=C2=A0=C2=A0=C2=A0=
 (Catalin's patch)
> 1p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 82=
9299=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 787676=C2=A0=C2=A0=C2=A0 (96.36 %)
> 1t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 99=
8007=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 789284=C2=A0=C2=A0=C2=A0 (78.36 %)
> 32p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1891671=
8=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 17921=
100=C2=A0 (94.68 %)
> 32t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2020918=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =
1644146=C2=A0=C2=A0 (67.64 %)
> 64p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1896516=
8=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 18983=
580=C2=A0 (100.0 %)
> 64t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1415404=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =
1093750=C2=A0=C2=A0 (48.03 %)
> 96p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1894943=
8=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 18963=
921=C2=A0 (100.1 %)
> 96t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1622876=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =
1262878=C2=A0=C2=A0 (63.72 %)
> 128p=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 18926813=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1680146=C2=A0=
=C2=A0 (8.89=C2=A0 %)
> 128t=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1643109=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 (0.0=
0 % ) # ignore this temporarily
>=20
> Thanks
> Yu
>=20
>>
>> It leaves the spurious flush in place but only local (though note that
>> in a guest under KVM, all local TLBIs are upgraded to inner-shareable,
>> so you'd not get the performance benefit).
>>