Re: [PATCH RESEND v4] arm64: mm: fix linear mem mapping access performance degradation

From: "guanghui.fgh" <guanghuifeng@linux.alibaba.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mike Rapoport <rppt@kernel.org>, Will Deacon <will@kernel.org>,
	Ard Biesheuvel <ardb@kernel.org>,
	baolin.wang@linux.alibaba.com, akpm@linux-foundation.org,
	david@redhat.com, jianyong.wu@arm.com, james.morse@arm.com,
	quic_qiancai@quicinc.com, christophe.leroy@csgroup.eu,
	jonathan@marek.ca, mark.rutland@arm.com,
	thunder.leizhen@huawei.com, anshuman.khandual@arm.com,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, geert+renesas@glider.be,
	linux-mm@kvack.org, yaohongbo@linux.alibaba.com,
	alikernel-developer@linux.alibaba.com
Subject: Re: [PATCH RESEND v4] arm64: mm: fix linear mem mapping access performance degradation
Date: Fri, 8 Jul 2022 20:28:18 +0800	[thread overview]
Message-ID: <f0dd1176-39f1-0a08-d0e2-627dd437aa5f@linux.alibaba.com> (raw)
In-Reply-To: <YsWtCLIG2qKETqmq@arm.com>

Thanks.

在 2022/7/6 23:40, Catalin Marinas 写道:
> On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote:
>> 在 2022/7/6 21:54, Mike Rapoport 写道:
>>> One thing I can think of is to only remap the crash kernel memory if it is
>>> a part of an allocation that exactly fits into one ore more PUDs.
>>>
>>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
>>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
>>> succeeds, we remap the entire area that now contains only memory allocated
>>> in reserve_crashkernel() and free the extra memory after remapping is done.
>>> If the large allocation fails, we fall back to the original size and
>>> alignment and don't allow unmapping crash kernel memory in
>>> arch_kexec_protect_crashkres().
>>
>> There is a new method.
>> I think we should use the patch v3(similar but need add some changes)
>>
>> 1.We can walk crashkernle block/section pagetable,
>> [[[(keep the origin block/section mapping valid]]]
>> rebuild the pte level page mapping for the crashkernel mem
>> rebuild left & right margin mem(which is in same block/section mapping but
>> out of crashkernel mem) with block/section mapping
>>
>> 2.'replace' the origin block/section mapping by new builded mapping
>> iterately
>>
>> With this method, all the mem mapping keep valid all the time.
> 
> As I already commented on one of your previous patches, this is not
> allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I
> think), the worst that can happen is a TLB conflict abort and the
> handler should invalidate the TLBs and restart the faulting instruction,
> assuming the handler won't try to access the same conflicting virtual
> address. Prior to FEAT_BBM, that's not possible as the architecture does
> not describe a precise behaviour of conflicting TLB entries (you might
> as well get the TLB output of multiple entries being or'ed together).
> 

The cpu can generate a TLB conflict abort if it detects that the address 
being looked up in the TLB hits multiple entries.

(1).I think when gathering small page to block/section mapping, there 
maybe tlb conflict if no complying with BBM.

Namely:
a.Map a 4KB page (address X)
   Touch that page, in order to get the translation cached in the TLB

b.Modify the translation tables
   replacing the mapping for address X with a 2MB mapping - DO NOT 
INVALIDATE the TLB

c.Touch "X + 4KB"
   This will/should miss in the TLB, causing a new walk returning the 
2MB mapping

d.Touch X
   Assuming they've not been evicted, you'll hit both on the 4KB and 2MB 
mapping - as both cover address X.

There is tlb conflict.
(link: 
https://community.arm.com/support-forums/f/dev-platforms-forum/13583/tlb-conflict-abort)

(2).But when spliting large block/section mapping to small granularity, 
there maybe no tlb conflict.

Namely:
a.rebuild the pte level mapping without any change to orgin pagetable
   (the relation between virtual address and physicall address keep same)

b.modify 1G mappting to use the new pte level mapping in the [[[mem]]] 
without tlb flush

c.When the cpu access the 1G mem(anywhere),
   If 1G tlb entry already cached in tlb, all the 1G mem will access 
success(without any tlb loaded, no confilict)

   If 1G tlb entry has been evicted, then the tlb will access pagetable 
in mem(despite the cpu "catch" the old(1G) or new(4k) mapped pagetale in 
the mem, all the 1G mem can access sucess)(load new tlb entry, no conflict)

d.Afterward, we flush the tlb and force cpu use the new pagetable.(no 
conflict)

It seems that there are no two tlb entries for a same virtual address in 
the tlb cache When spliting large block/section mapping.

(3).At the same time, I think we can use another way.
As the system linear maping is builded with init_pg_dir, we can also 
resue the init_pg_dir to split the block/setion mapping sometime.
As init_pg_dir contain all kernel text/data access and we can comply 
with the BBM requirement.

a.rebuild new pte level mapping without any change to the old 
mapping(the cpu can't walk access the new page mapping, it's isolated)

b.change to use init_pg_dir

c.clear the old 1G block mapping and flush tlb

d.modify the linear mapping to use new pte level page mapping with 
init_pg_dir(TLB BBM)

e.switch to swapper_pg_dir

Could you give me some advice?

Thanks.