Re: [PATCH] arm64: mm: fix linear mapping mem access performace degradation

From: "guanghui.fgh" <guanghuifeng@linux.alibaba.com>
To: "Leizhen (ThunderTown)" <thunder.leizhen@huawei.com>,
	Mike Rapoport <rppt@kernel.org>
Cc: baolin.wang@linux.alibaba.com, catalin.marinas@arm.com,
	will@kernel.org, akpm@linux-foundation.org, david@redhat.com,
	jianyong.wu@arm.com, james.morse@arm.com,
	quic_qiancai@quicinc.com, christophe.leroy@csgroup.eu,
	jonathan@marek.ca, mark.rutland@arm.com,
	anshuman.khandual@arm.com, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, geert+renesas@glider.be,
	ardb@kernel.org, linux-mm@kvack.org, bhe@redhat.com,
	Yao Hongbo <yaohongbo@linux.alibaba.com>
Subject: Re: [PATCH] arm64: mm: fix linear mapping mem access performace degradation
Date: Tue, 28 Jun 2022 15:52:48 +0800	[thread overview]
Message-ID: <2e76694e-9ead-fc05-c8ad-01646ff02151@linux.alibaba.com> (raw)
In-Reply-To: <54f13945-fa35-247d-ca33-182931fd05ff@linux.alibaba.com>

在 2022/6/28 11:06, guanghui.fgh 写道:
> Thanks.
> 
> 在 2022/6/28 9:34, Leizhen (ThunderTown) 写道:
>>
>>
>> On 2022/6/27 20:25, guanghui.fgh wrote:
>>> Thanks.
>>>
>>> 在 2022/6/27 20:06, Leizhen (ThunderTown) 写道:
>>>>
>>>>
>>>> On 2022/6/27 18:46, guanghui.fgh wrote:
>>>>>
>>>>>
>>>>> 在 2022/6/27 17:49, Mike Rapoport 写道:
>>>>>> Please don't post HTML.
>>>>>>
>>>>>> On Mon, Jun 27, 2022 at 05:24:10PM +0800, guanghui.fgh wrote:
>>>>>>> Thanks.
>>>>>>>
>>>>>>> 在 2022/6/27 14:34, Mike Rapoport 写道:
>>>>>>>
>>>>>>>        On Sun, Jun 26, 2022 at 07:10:15PM +0800, Guanghui Feng 
>>>>>>> wrote:
>>>>>>>
>>>>>>>            The arm64 can build 2M/1G block/sectiion mapping. When 
>>>>>>> using DMA/DMA32 zone
>>>>>>>            (enable crashkernel, disable rodata full, disable 
>>>>>>> kfence), the mem_map will
>>>>>>>            use non block/section mapping(for crashkernel requires 
>>>>>>> to shrink the region
>>>>>>>            in page granularity). But it will degrade performance 
>>>>>>> when doing larging
>>>>>>>            continuous mem access in kernel(memcpy/memmove, etc).
>>>>>>>
>>>>>>>            There are many changes and discussions:
>>>>>>>            commit 031495635b46
>>>>>>>            commit 1a8e1cef7603
>>>>>>>            commit 8424ecdde7df
>>>>>>>            commit 0a30c53573b0
>>>>>>>            commit 2687275a5843
>>>>>>>
>>>>>>>        Please include oneline summary of the commit. (See section 
>>>>>>> "Describe your
>>>>>>>        changes" in Documentation/process/submitting-patches.rst)
>>>>>>>
>>>>>>> OK, I will add oneline summary in the git commit messages.
>>>>>>>
>>>>>>>            This patch changes mem_map to use block/section 
>>>>>>> mapping with crashkernel.
>>>>>>>            Firstly, do block/section mapping(normally 2M or 1G) 
>>>>>>> for all avail mem at
>>>>>>>            mem_map, reserve crashkernel memory. And then walking 
>>>>>>> pagetable to split
>>>>>>>            block/section mapping to non block/section 
>>>>>>> mapping(normally 4K) [[[only]]]
>>>>>>>            for crashkernel mem.
>>>>>>>
>>>>>>>        This already happens when ZONE_DMA/ZONE_DMA32 are 
>>>>>>> disabled. Please explain
>>>>>>>        why is it Ok to change the way the memory is mapped with
>>>>>>>        ZONE_DMA/ZONE_DMA32 enabled.
>>>>>>>
>>>>>>> In short:
>>>>>>>
>>>>>>> 1.building all avail mem with block/section mapping（normally 
>>>>>>> 1G/2M） without
>>>>>>> inspecting crashkernel
>>>>>>> 2. Reserve crashkernel mem as same as previous doing
>>>>>>> 3. only change the crashkernle mem mapping to normal 
>>>>>>> mapping(normally 4k).
>>>>>>> With this method, there are block/section mapping as more as 
>>>>>>> possible.
>>>>>>
>>>>>> This does not answer the question why changing the way the memory 
>>>>>> is mapped
>>>>>> when there is ZONE_DMA/DMA32 and crashkernel won't cause a 
>>>>>> regression.
>>>>>>
>>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>>
>>>>> "Memory reservation for crash kernel either done early or deferred
>>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>>
>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>>> here instead of max_zone_phys().  This lets early reservation of
>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>>> Reserving memory early for crash kernel allows linear creation of 
>>>>> block
>>>>> mappings (greater than page-granularity) for all the memory bank 
>>>>> rangs.
>>>>> In this scheme a comparatively quicker boot is observed.
>>>>>
>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>>> is delayed until DMA zone memory range size initialization 
>>>>> performed in
>>>>> zone_sizes_init().  The defer is necessary to steer clear of DMA zone
>>>>> memory range to avoid overlap allocation.  So crash kernel memory 
>>>>> boundaries are not known when mapping all bank memory ranges, which 
>>>>> otherwise means not possible to exclude crash kernel range from 
>>>>> creating block mappings so page-granularity mappings are created 
>>>>> for the entire memory range."
>>>>>
>>>>> Namely, the init order: memblock init--->linear mem mapping(4k 
>>>>> mapping for crashkernel, requirinig page-granularity 
>>>>> changing))--->zone dma limit--->reserve crashkernel.
>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping 
>>>>> using 4k mapping.
>>>>>
>>>>> 2.As mentioned above, when linear mem use 4k mapping simply, there 
>>>>> is high dtlb miss(degrade performance).
>>>>> This patch use block/section mapping as far as possible with 
>>>>> performance improvement.
>>>>>
>>>>> 3.This patch reserve crashkernel as same as the history(ZONE DMA & 
>>>>> crashkernel reserving order), and only change the linear mem 
>>>>> mapping to block/section mapping.
>>>>> .
>>>>>
>>>>
>>>> I think Mike Rapoport's probably asking you to answer whether you've
>>>> taken into account such as BBM. For example, the following code:
>>>> we should prepare the next level pgtable first, then change 2M block
>>>> mapping to 4K page mapping, and flush TLB at the end.
>>>>> +static void init_crashkernel_pmd(pud_t *pudp, unsigned long addr,
>>>> +                 unsigned long end, phys_addr_t phys,
>>>> +                 pgprot_t prot,
>>>> +                 phys_addr_t (*pgtable_alloc)(int), int flags)
>>>> +{
>>>> +    phys_addr_t map_offset;
>>>> +    unsigned long next;
>>>> +    pmd_t *pmdp;
>>>> +    pmdval_t pmdval;
>>>> +
>>>> +    pmdp = pmd_offset(pudp, addr);
>>>> +    do {
>>>> +        next = pmd_addr_end(addr, end);
>>>> +        if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) {
>>>> +            phys_addr_t pte_phys = pgtable_alloc(PAGE_SHIFT);
>>>> +            pmd_clear(pmdp);
>>>> +            pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
>>>> +            if (flags & NO_EXEC_MAPPINGS)
>>>> +                pmdval |= PMD_TABLE_PXN;
>>>> +            __pmd_populate(pmdp, pte_phys, pmdval);
>>>> +            flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>>>
>>>> The pgtable is empty now. However, memory other than crashkernel may 
>>>> be being accessed.
>>> 1.When reserving crashkernel and remapping linear mem mapping, there 
>>> is only one boot cpu running. There is no other cpu/thread running at 
>>> the same time.
>>
>> So, put this in the code comment?
> OK.
>>
>> If scalability is considered and unpredictable changes occur in the 
>> future, for example,
>> other modules also need this mapping function. It would be better to 
>> deal with the BBM now,
>> and make this public.
> OK, could you give me some advice?
>>
>>
>>>
>>> 2.When clearing block/section mapping, I have flush tlb by 
>>> flush_tlb_kernel_range. Afterwards rebuilt 4k mapping(I think it's no 
>>> need flush tlb).
>>
>>
>>>
>>>>
>>>> +
>>>> +            map_offset = addr - (addr & PMD_MASK);
>>>> +            if (map_offset)
>>>> +                alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr,
>>>> +                        phys - map_offset, prot,
>>>> +                        pgtable_alloc, flags);
>>>> +
>>>> +            if (next < (addr & PMD_MASK) + PMD_SIZE)
>>>> +                alloc_init_cont_pte(pmdp, next, (addr & PUD_MASK) +
>>>> +                        PUD_SIZE, next - addr + phys,
>>>> +                        prot, pgtable_alloc, flags);
>>
>> Here and alloc_crashkernel_pud() should use the raw flags. It may not 
>> contain  (NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS)
> Yes. the mem out of crashkernel should use block/section mapping as far 
> as possible including the LeftMargin and RightMargin.
> But I had test it on HiSilicon Kunpeng 920-6426 with it and get 
> performacne degrade(without NO_BLOCK_MAPPINGS/NO_CONT_MAPPINGS flags for 
> the left/right margin)
> It's strange, could you give some advice? Maybe it's good for other arm 
> platform except for HiSilicon Kunpeng 920-6426.
There should split non-crashkernel mem [[[ without ]]]
NO_BLOCK_MAPPINGS/NO_CONT_MAPPINGS flags

I had test it on other arm platform [[[ non HiSilicon arm platform ]]] 
and also get performance improvement greatly.

Could you help me to check the difference betweent HiSilicon Kunpeng 
920-6426 and other arm platform for the block/section mapping TLB support?

>>
>>>> +        }
>>>> +        alloc_crashkernel_cont_pte(pmdp, addr, next, phys, prot,
>>>> +                       pgtable_alloc, flags);
>>>> +        phys += next - addr;
>>>> +    } while (pmdp++, addr = next, addr != end);
>>>> +}
>>>>
>>> .
>>>
>>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel