Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

From: Yang Shi <yang.shi@linux.alibaba.com>
To: Yu Xu <xuyu@linux.alibaba.com>,
	Catalin Marinas <catalin.marinas@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hillf Danton <hdanton@sina.com>, Hugh Dickins <hughd@google.com>,
	Josef Bacik <josef@toxicpanda.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Linux-MM <linux-mm@kvack.org>,
	mm-commits@vger.kernel.org, Will Deacon <will.deacon@arm.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault
Date: Mon, 27 Jul 2020 11:04:01 -0700	[thread overview]
Message-ID: <eb1f5cb4-7c3d-df42-f4aa-804e12df45e2@linux.alibaba.com> (raw)
In-Reply-To: <39560818-463f-da3a-fc9e-3a4a0a082f61@linux.alibaba.com>

On 7/27/20 10:12 AM, Yu Xu wrote:
> On 7/27/20 7:05 PM, Catalin Marinas wrote:
>> On Mon, Jul 27, 2020 at 03:31:16PM +0800, Yu Xu wrote:
>>> On 7/25/20 4:22 AM, Linus Torvalds wrote:
>>>> On Fri, Jul 24, 2020 at 12:27 PM Linus Torvalds
>>>> <torvalds@linux-foundation.org> wrote:
>>>>>
>>>>> It *may* make sense to say "ok, don't bother flushing the TLB if this
>>>>> is a retry, because we already did that originally". MAYBE.
>> [...]
>>>> We could say that we never need it at all for FAULT_FLAG_RETRY. That
>>>> makes a lot of sense to me.
>>>>
>>>> So a patch that does something like the appended (intentionally
>>>> whitespace-damaged) seems sensible.
>>>
>>> I tested your patch on our aarch64 box, with 128 online CPUs.
>> [...]
>>> There are two points to sum up.
>>>
>>> 1) the performance of page_fault3_process is restored, while the 
>>> performance
>>> of page_fault3_thread is about ~80% of the vanilla, except the case 
>>> of 128
>>> threads.
>>>
>>> 2) in the case of 128 threads, test worker threads seem to get 
>>> stuck, making
>>> no progress in the iterations of mmap-write-munmap until a period of 
>>> time
>>> later.  the test result is 0 because only first 16 samples are 
>>> counted, and
>>> they are all 0.  This situation is easy to re-produce with large 
>>> number of
>>> threads (not necessarily 128), and the stack of one stuck thread is 
>>> shown
>>> below.
>>>
>>> [<0>] __switch_to+0xdc/0x150
>>> [<0>] wb_wait_for_completion+0x84/0xb0
>>> [<0>] __writeback_inodes_sb_nr+0x9c/0xe8
>>> [<0>] try_to_writeback_inodes_sb+0x6c/0x88
>>> [<0>] ext4_nonda_switch+0x90/0x98 [ext4]
>>> [<0>] ext4_page_mkwrite+0x248/0x4c0 [ext4]
>>> [<0>] do_page_mkwrite+0x4c/0x100
>>> [<0>] do_fault+0x2ac/0x3e0
>>> [<0>] handle_pte_fault+0xb4/0x258
>>> [<0>] __handle_mm_fault+0x1d8/0x3a8
>>> [<0>] handle_mm_fault+0x104/0x1d0
>>> [<0>] do_page_fault+0x16c/0x490
>>> [<0>] do_translation_fault+0x60/0x68
>>> [<0>] do_mem_abort+0x58/0x100
>>> [<0>] el0_da+0x24/0x28
>>> [<0>] 0xffffffffffffffff
>>>
>>> It seems quite normal, right? and I've run out of ideas.
>>
>> If threads get stuck here, it could be a stale TLB entry that's not
>> flushed with Linus' patch. Since that's a write fault, I think it hits
>> the FAULT_FLAG_TRIED case.
>
> There must be some changes in my test box, because I find that even the
> vanilla kernel (89b15332af7c^) get result of 0 in 128t testcase. And I
> just directly used the history test data as the baseline.  I will dig
> into this then.

Thanks for doing the test.

>
> And do we still need to concern the ~20% performance drop in thread mode?

I guess there might be more resource contention for thread mode, i.e. 
page table lock, etc so the result might be not very stable. And retried 
page fault may exacerbate such contention. Anyway we got the process 
mode back to normal and improved the thread mode a lot.

>
>>
>> Could you give my patch here a try as an alternative:
>>
>> https://lore.kernel.org/linux-mm/20200725155841.GA14490@gaia/
>
> I ran the same test on the same aarch64 box, with your patch, the result
> is as follows.
>
> test          vanilla kernel      patched kernel
> parameter     (89b15332af7c^)     (Catalin's patch)
> 1p            829299              787676    (96.36 %)
> 1t            998007              789284    (78.36 %)
> 32p           18916718            17921100  (94.68 %)
> 32t           2020918             1644146   (67.64 %)
> 64p           18965168            18983580  (100.0 %)
> 64t           1415404             1093750   (48.03 %)
> 96p           18949438            18963921  (100.1 %)
> 96t           1622876             1262878   (63.72 %)
> 128p          18926813            1680146   (8.89  %)
> 128t          1643109             0 (0.00 % ) # ignore this temporarily

It looks Linus's patch has better data. It seems sane to me since 
Catalin's patch still needs flush TLB in the shared domain.

>
> Thanks
> Yu
>
>>
>> It leaves the spurious flush in place but only local (though note that
>> in a guest under KVM, all local TLBIs are upgraded to inner-shareable,
>> so you'd not get the performance benefit).
>>