* [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-02 15:57 ` Guanghui Feng 0 siblings, 0 replies; 79+ messages in thread From: Guanghui Feng @ 2022-07-02 15:57 UTC (permalink / raw) To: baolin.wang, catalin.marinas, will, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo Cc: alikernel-developer The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone (enable crashkernel, disable rodata full, disable kfence), the mem_map will use non block/section mapping(for crashkernel requires to shrink the region in page granularity). But it will degrade performance when doing larging continuous mem access in kernel(memcpy/memmove, etc). There are many changes and discussions: commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones") commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into mem_init()") commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required") This patch changes mem_map to use block/section mapping with crashkernel. Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at mem_map, reserve crashkernel memory. And then walking pagetable to split block/section mapping to non block/section mapping(normally 4K) [[[only]]] for crashkernel mem. So the linear mem mapping use block/section mapping as more as possible. We will reduce the cpu dTLB miss conspicuously, and accelerate mem access about 10-20% performance improvement. I have tested it with pft(Page Fault Test) and fio, obtained great performace improvement. For fio test: 1.prepare ramdisk modprobe -r brd modprobe brd rd_nr=1 rd_size=67108864 dmsetup remove_all wipefs -a --force /dev/ram0 mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 mkdir -p /fs/ram0 mount -t ext4 /dev/ram0 /fs/ram0 2.prepare fio paremeter in x.fio file: [global] bs=4k ioengine=psync iodepth=128 size=32G direct=1 invalidate=1 group_reporting thread=1 rw=read directory=/fs/ram0 numjobs=1 [task_0] cpus_allowed=16 stonewall=1 3.run testcase: perf stat -e dTLB-load-misses fio x.fio 4.contrast ------------------------ without patch with patch fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s dTLB-load-misses 1,818,320,693 438,729,774 time elapsed(s) 70.500326434 62.877316408 user(s) 15.926332000 15.684721000 sys(s) 54.211939000 47.046165000 5.conclusion Using this patch will reduce dTLB misses and improve performace greatly. Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/mm/init.c | 8 +- arch/arm64/mm/mmu.c | 176 +++++++++++++++++++++++++++++++------------ 3 files changed, 132 insertions(+), 53 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 48f8466..1a46b81 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) extern void arm64_memblock_init(void); extern void paging_init(void); extern void bootmem_init(void); +extern void map_crashkernel(void); extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); extern void init_mem_pgprot(void); extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 339ee84..241d27e 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); + map_crashkernel(); } /* @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) } early_init_fdt_scan_reserved_mem(); - - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); - high_memory = __va(memblock_end_of_DRAM() - 1) + 1; } @@ -438,8 +435,7 @@ void __init bootmem_init(void) * request_standard_resources() depends on crashkernel's memory being * reserved, so do it here. */ - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); + reserve_crashkernel(); memblock_dump_all(); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32..76a4ff0 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -65,6 +65,10 @@ static DEFINE_SPINLOCK(swapper_pgdir_lock); static DEFINE_MUTEX(fixmap_lock); +static void unmap_hotplug_range(unsigned long addr, unsigned long end, + bool free_mapped, struct vmem_altmap *altmap, + pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), int flags); void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd) { @@ -483,20 +487,49 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } -static bool crash_mem_map __initdata; +#ifdef CONFIG_KEXEC_CORE +static phys_addr_t __init early_crashkernel_pgtable_alloc(int shift) +{ + phys_addr_t phys; + void *ptr; + + phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0, + MEMBLOCK_ALLOC_NOLEAKTRACE); + if (!phys) + panic("Failed to allocate page table page\n"); + + ptr = (void *)__phys_to_virt(phys); + memset(ptr, 0, PAGE_SIZE); + return phys; +} -static int __init enable_crash_mem_map(char *arg) +void __init map_crashkernel(void) { - /* - * Proper parameter parsing is done by reserve_crashkernel(). We only - * need to know if the linear map has to avoid block mappings so that - * the crashkernel reservations can be unmapped later. - */ - crash_mem_map = true; + phys_addr_t start, end, size; - return 0; + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) + return; + + if (!crashk_res.end) + return; + + start = crashk_res.start & PAGE_MASK; + end = PAGE_ALIGN(crashk_res.end); + size = end - start; + + unmap_hotplug_range(__phys_to_virt(start), __phys_to_virt(end), false, + NULL, PAGE_KERNEL, early_crashkernel_pgtable_alloc, + NO_EXEC_MAPPINGS); + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, + __phys_to_virt(crashk_res.start), + size, PAGE_KERNEL, + early_crashkernel_pgtable_alloc, + NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS | + NO_CONT_MAPPINGS); } -early_param("crashkernel", enable_crash_mem_map); +#else +void __init mapping_crashkernel(void) {} +#endif static void __init map_mem(pgd_t *pgdp) { @@ -527,17 +560,6 @@ static void __init map_mem(pgd_t *pgdp) */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map) { - if (IS_ENABLED(CONFIG_ZONE_DMA) || - IS_ENABLED(CONFIG_ZONE_DMA32)) - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; - else if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif - /* map all the memory banks */ for_each_mem_range(i, &start, &end) { if (start >= end) @@ -570,19 +592,6 @@ static void __init map_mem(pgd_t *pgdp) * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map && - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, - crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } - } -#endif } void mark_rodata_ro(void) @@ -827,7 +836,6 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } -#ifdef CONFIG_MEMORY_HOTPLUG static void free_hotplug_page_range(struct page *page, size_t size, struct vmem_altmap *altmap) { @@ -863,9 +871,25 @@ static bool pgtable_range_aligned(unsigned long start, unsigned long end, return true; } +static void pte_clear_cont(pte_t *ptep) +{ + int i = 0; + pte_t pte = READ_ONCE(*ptep); + if (pte_none(pte) || !pte_cont(pte)) + return; + ptep -= ((u64)ptep / sizeof(pte_t)) & + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); + do { + pte = pte_mknoncont(READ_ONCE(*ptep)); + set_pte(ptep, pte); + } while (++ptep, ++i < CONT_PTES); +} + static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { pte_t *ptep, pte; @@ -876,6 +900,8 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, continue; WARN_ON(!pte_present(pte)); + + pte_clear_cont(ptep); pte_clear(&init_mm, addr, ptep); flush_tlb_kernel_range(addr, addr + PAGE_SIZE); if (free_mapped) @@ -884,9 +910,26 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, } while (addr += PAGE_SIZE, addr < end); } +static void pmd_clear_cont(pmd_t *pmdp) +{ + int i = 0; + pmd_t pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) + return; + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); + do { + pmd = READ_ONCE(*pmdp); + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); + set_pmd(pmdp, pmd); + } while (++pmdp, ++i < CONT_PMDS); +} + static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; pmd_t *pmdp, pmd; @@ -900,6 +943,8 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, WARN_ON(!pmd_present(pmd)); if (pmd_sect(pmd)) { + //clear CONT flags + pmd_clear_cont(pmdp); pmd_clear(pmdp); /* @@ -907,19 +952,36 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, * range is mapped with a single block entry. */ flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + if (addr & ~PMD_MASK) + alloc_init_cont_pte(pmdp, addr & PMD_MASK, + addr, __virt_to_phys(addr & + PMD_MASK), prot, + pgtable_alloc, flags); + + if (next & ~PMD_MASK) + alloc_init_cont_pte(pmdp, next, ALIGN(next, + PMD_SIZE), + __virt_to_phys(next), + prot, pgtable_alloc, + flags); + if (free_mapped) free_hotplug_page_range(pmd_page(pmd), PMD_SIZE, altmap); continue; } WARN_ON(!pmd_table(pmd)); - unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap); + unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; pud_t *pudp, pud; @@ -940,19 +1002,36 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr, * range is mapped with a single block entry. */ flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + if (addr & (~PUD_MASK)) + alloc_init_cont_pmd(pudp, addr & PUD_MASK, + addr, __virt_to_phys(addr & + PUD_MASK), prot, + pgtable_alloc, flags); + + if (next & (~PUD_MASK)) + alloc_init_cont_pmd(pudp, next, + ALIGN(next, PUD_SIZE), + __virt_to_phys(next), + prot, pgtable_alloc, + flags); + if (free_mapped) free_hotplug_page_range(pud_page(pud), PUD_SIZE, altmap); continue; } WARN_ON(!pud_table(pud)); - unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap); + unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; p4d_t *p4dp, p4d; @@ -965,12 +1044,15 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr, continue; WARN_ON(!p4d_present(p4d)); - unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap); + unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_range(unsigned long addr, unsigned long end, - bool free_mapped, struct vmem_altmap *altmap) + bool free_mapped, struct vmem_altmap *altmap, + pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), int flags) { unsigned long next; pgd_t *pgdp, pgd; @@ -991,7 +1073,8 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end, continue; WARN_ON(!pgd_present(pgd)); - unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); + unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } @@ -1148,7 +1231,6 @@ static void free_empty_tables(unsigned long addr, unsigned long end, free_empty_p4d_table(pgdp, addr, next, floor, ceiling); } while (addr = next, addr < end); } -#endif #if !ARM64_KERNEL_USES_PMD_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -1210,7 +1292,7 @@ void vmemmap_free(unsigned long start, unsigned long end, { WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END)); - unmap_hotplug_range(start, end, true, altmap); + unmap_hotplug_range(start, end, true, altmap, __pgprot(0), NULL, 0); free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); } #endif /* CONFIG_MEMORY_HOTPLUG */ @@ -1474,7 +1556,7 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size) WARN_ON(pgdir != init_mm.pgd); WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END)); - unmap_hotplug_range(start, end, false, NULL); + unmap_hotplug_range(start, end, false, NULL, __pgprot(0), NULL, 0); free_empty_tables(start, end, PAGE_OFFSET, PAGE_END); } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-02 15:57 ` Guanghui Feng 0 siblings, 0 replies; 79+ messages in thread From: Guanghui Feng @ 2022-07-02 15:57 UTC (permalink / raw) To: baolin.wang, catalin.marinas, will, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo Cc: alikernel-developer The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone (enable crashkernel, disable rodata full, disable kfence), the mem_map will use non block/section mapping(for crashkernel requires to shrink the region in page granularity). But it will degrade performance when doing larging continuous mem access in kernel(memcpy/memmove, etc). There are many changes and discussions: commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones") commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into mem_init()") commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required") This patch changes mem_map to use block/section mapping with crashkernel. Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at mem_map, reserve crashkernel memory. And then walking pagetable to split block/section mapping to non block/section mapping(normally 4K) [[[only]]] for crashkernel mem. So the linear mem mapping use block/section mapping as more as possible. We will reduce the cpu dTLB miss conspicuously, and accelerate mem access about 10-20% performance improvement. I have tested it with pft(Page Fault Test) and fio, obtained great performace improvement. For fio test: 1.prepare ramdisk modprobe -r brd modprobe brd rd_nr=1 rd_size=67108864 dmsetup remove_all wipefs -a --force /dev/ram0 mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 mkdir -p /fs/ram0 mount -t ext4 /dev/ram0 /fs/ram0 2.prepare fio paremeter in x.fio file: [global] bs=4k ioengine=psync iodepth=128 size=32G direct=1 invalidate=1 group_reporting thread=1 rw=read directory=/fs/ram0 numjobs=1 [task_0] cpus_allowed=16 stonewall=1 3.run testcase: perf stat -e dTLB-load-misses fio x.fio 4.contrast ------------------------ without patch with patch fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s dTLB-load-misses 1,818,320,693 438,729,774 time elapsed(s) 70.500326434 62.877316408 user(s) 15.926332000 15.684721000 sys(s) 54.211939000 47.046165000 5.conclusion Using this patch will reduce dTLB misses and improve performace greatly. Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/mm/init.c | 8 +- arch/arm64/mm/mmu.c | 176 +++++++++++++++++++++++++++++++------------ 3 files changed, 132 insertions(+), 53 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 48f8466..1a46b81 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) extern void arm64_memblock_init(void); extern void paging_init(void); extern void bootmem_init(void); +extern void map_crashkernel(void); extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); extern void init_mem_pgprot(void); extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 339ee84..241d27e 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); + map_crashkernel(); } /* @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) } early_init_fdt_scan_reserved_mem(); - - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); - high_memory = __va(memblock_end_of_DRAM() - 1) + 1; } @@ -438,8 +435,7 @@ void __init bootmem_init(void) * request_standard_resources() depends on crashkernel's memory being * reserved, so do it here. */ - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); + reserve_crashkernel(); memblock_dump_all(); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32..76a4ff0 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -65,6 +65,10 @@ static DEFINE_SPINLOCK(swapper_pgdir_lock); static DEFINE_MUTEX(fixmap_lock); +static void unmap_hotplug_range(unsigned long addr, unsigned long end, + bool free_mapped, struct vmem_altmap *altmap, + pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), int flags); void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd) { @@ -483,20 +487,49 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } -static bool crash_mem_map __initdata; +#ifdef CONFIG_KEXEC_CORE +static phys_addr_t __init early_crashkernel_pgtable_alloc(int shift) +{ + phys_addr_t phys; + void *ptr; + + phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0, + MEMBLOCK_ALLOC_NOLEAKTRACE); + if (!phys) + panic("Failed to allocate page table page\n"); + + ptr = (void *)__phys_to_virt(phys); + memset(ptr, 0, PAGE_SIZE); + return phys; +} -static int __init enable_crash_mem_map(char *arg) +void __init map_crashkernel(void) { - /* - * Proper parameter parsing is done by reserve_crashkernel(). We only - * need to know if the linear map has to avoid block mappings so that - * the crashkernel reservations can be unmapped later. - */ - crash_mem_map = true; + phys_addr_t start, end, size; - return 0; + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) + return; + + if (!crashk_res.end) + return; + + start = crashk_res.start & PAGE_MASK; + end = PAGE_ALIGN(crashk_res.end); + size = end - start; + + unmap_hotplug_range(__phys_to_virt(start), __phys_to_virt(end), false, + NULL, PAGE_KERNEL, early_crashkernel_pgtable_alloc, + NO_EXEC_MAPPINGS); + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, + __phys_to_virt(crashk_res.start), + size, PAGE_KERNEL, + early_crashkernel_pgtable_alloc, + NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS | + NO_CONT_MAPPINGS); } -early_param("crashkernel", enable_crash_mem_map); +#else +void __init mapping_crashkernel(void) {} +#endif static void __init map_mem(pgd_t *pgdp) { @@ -527,17 +560,6 @@ static void __init map_mem(pgd_t *pgdp) */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map) { - if (IS_ENABLED(CONFIG_ZONE_DMA) || - IS_ENABLED(CONFIG_ZONE_DMA32)) - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; - else if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif - /* map all the memory banks */ for_each_mem_range(i, &start, &end) { if (start >= end) @@ -570,19 +592,6 @@ static void __init map_mem(pgd_t *pgdp) * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map && - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, - crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } - } -#endif } void mark_rodata_ro(void) @@ -827,7 +836,6 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } -#ifdef CONFIG_MEMORY_HOTPLUG static void free_hotplug_page_range(struct page *page, size_t size, struct vmem_altmap *altmap) { @@ -863,9 +871,25 @@ static bool pgtable_range_aligned(unsigned long start, unsigned long end, return true; } +static void pte_clear_cont(pte_t *ptep) +{ + int i = 0; + pte_t pte = READ_ONCE(*ptep); + if (pte_none(pte) || !pte_cont(pte)) + return; + ptep -= ((u64)ptep / sizeof(pte_t)) & + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); + do { + pte = pte_mknoncont(READ_ONCE(*ptep)); + set_pte(ptep, pte); + } while (++ptep, ++i < CONT_PTES); +} + static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { pte_t *ptep, pte; @@ -876,6 +900,8 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, continue; WARN_ON(!pte_present(pte)); + + pte_clear_cont(ptep); pte_clear(&init_mm, addr, ptep); flush_tlb_kernel_range(addr, addr + PAGE_SIZE); if (free_mapped) @@ -884,9 +910,26 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, } while (addr += PAGE_SIZE, addr < end); } +static void pmd_clear_cont(pmd_t *pmdp) +{ + int i = 0; + pmd_t pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) + return; + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); + do { + pmd = READ_ONCE(*pmdp); + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); + set_pmd(pmdp, pmd); + } while (++pmdp, ++i < CONT_PMDS); +} + static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; pmd_t *pmdp, pmd; @@ -900,6 +943,8 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, WARN_ON(!pmd_present(pmd)); if (pmd_sect(pmd)) { + //clear CONT flags + pmd_clear_cont(pmdp); pmd_clear(pmdp); /* @@ -907,19 +952,36 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, * range is mapped with a single block entry. */ flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + if (addr & ~PMD_MASK) + alloc_init_cont_pte(pmdp, addr & PMD_MASK, + addr, __virt_to_phys(addr & + PMD_MASK), prot, + pgtable_alloc, flags); + + if (next & ~PMD_MASK) + alloc_init_cont_pte(pmdp, next, ALIGN(next, + PMD_SIZE), + __virt_to_phys(next), + prot, pgtable_alloc, + flags); + if (free_mapped) free_hotplug_page_range(pmd_page(pmd), PMD_SIZE, altmap); continue; } WARN_ON(!pmd_table(pmd)); - unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap); + unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; pud_t *pudp, pud; @@ -940,19 +1002,36 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr, * range is mapped with a single block entry. */ flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + if (addr & (~PUD_MASK)) + alloc_init_cont_pmd(pudp, addr & PUD_MASK, + addr, __virt_to_phys(addr & + PUD_MASK), prot, + pgtable_alloc, flags); + + if (next & (~PUD_MASK)) + alloc_init_cont_pmd(pudp, next, + ALIGN(next, PUD_SIZE), + __virt_to_phys(next), + prot, pgtable_alloc, + flags); + if (free_mapped) free_hotplug_page_range(pud_page(pud), PUD_SIZE, altmap); continue; } WARN_ON(!pud_table(pud)); - unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap); + unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr, unsigned long end, bool free_mapped, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) { unsigned long next; p4d_t *p4dp, p4d; @@ -965,12 +1044,15 @@ static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr, continue; WARN_ON(!p4d_present(p4d)); - unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap); + unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } static void unmap_hotplug_range(unsigned long addr, unsigned long end, - bool free_mapped, struct vmem_altmap *altmap) + bool free_mapped, struct vmem_altmap *altmap, + pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), int flags) { unsigned long next; pgd_t *pgdp, pgd; @@ -991,7 +1073,8 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end, continue; WARN_ON(!pgd_present(pgd)); - unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); + unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap, + prot, pgtable_alloc, flags); } while (addr = next, addr < end); } @@ -1148,7 +1231,6 @@ static void free_empty_tables(unsigned long addr, unsigned long end, free_empty_p4d_table(pgdp, addr, next, floor, ceiling); } while (addr = next, addr < end); } -#endif #if !ARM64_KERNEL_USES_PMD_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -1210,7 +1292,7 @@ void vmemmap_free(unsigned long start, unsigned long end, { WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END)); - unmap_hotplug_range(start, end, true, altmap); + unmap_hotplug_range(start, end, true, altmap, __pgprot(0), NULL, 0); free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); } #endif /* CONFIG_MEMORY_HOTPLUG */ @@ -1474,7 +1556,7 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size) WARN_ON(pgdir != init_mm.pgd); WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END)); - unmap_hotplug_range(start, end, false, NULL); + unmap_hotplug_range(start, end, false, NULL, __pgprot(0), NULL, 0); free_empty_tables(start, end, PAGE_OFFSET, PAGE_END); } -- 1.8.3.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-02 15:57 ` Guanghui Feng @ 2022-07-04 10:35 ` Will Deacon -1 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 10:35 UTC (permalink / raw) To: Guanghui Feng Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > use non block/section mapping(for crashkernel requires to shrink the region > in page granularity). But it will degrade performance when doing larging > continuous mem access in kernel(memcpy/memmove, etc). Hmm. It seems a bit silly to me that we take special care to unmap the crashkernel from the linear map even when can_set_direct_map() is false, as we won't be protecting the main kernel at all! Why don't we just leave the crashkernel mapped if !can_set_direct_map() and then this problem just goes away? Will ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 10:35 ` Will Deacon 0 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 10:35 UTC (permalink / raw) To: Guanghui Feng Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > use non block/section mapping(for crashkernel requires to shrink the region > in page granularity). But it will degrade performance when doing larging > continuous mem access in kernel(memcpy/memmove, etc). Hmm. It seems a bit silly to me that we take special care to unmap the crashkernel from the linear map even when can_set_direct_map() is false, as we won't be protecting the main kernel at all! Why don't we just leave the crashkernel mapped if !can_set_direct_map() and then this problem just goes away? Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 10:35 ` Will Deacon @ 2022-07-04 10:58 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 10:58 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer 在 2022/7/4 18:35, Will Deacon 写道: > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >> use non block/section mapping(for crashkernel requires to shrink the region >> in page granularity). But it will degrade performance when doing larging >> continuous mem access in kernel(memcpy/memmove, etc). > > Hmm. It seems a bit silly to me that we take special care to unmap the > crashkernel from the linear map even when can_set_direct_map() is false, as > we won't be protecting the main kernel at all! > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > and then this problem just goes away? > > Will This question had been asked lask week. 1.Quoted messages from arch/arm64/mm/init.c "Memory reservation for crash kernel either done early or deferred depending on DMA memory zones configs (ZONE_DMA) -- In absence of ZONE_DMA configs arm64_dma_phys_limit initialized here instead of max_zone_phys(). This lets early reservation of crash kernel memory which has a dependency on arm64_dma_phys_limit. Reserving memory early for crash kernel allows linear creation of block mappings (greater than page-granularity) for all the memory bank rangs. In this scheme a comparatively quicker boot is observed. If ZONE_DMA configs are defined, crash kernel memory reservation is delayed until DMA zone memory range size initialization performed in zone_sizes_init(). The defer is necessary to steer clear of DMA zone memory range to avoid overlap allocation. [[[ So crash kernel memory boundaries are not known when mapping all bank memory ranges, which otherwise means not possible to exclude crash kernel range from creating block mappings so page-granularity mappings are created for the entire memory range. ]]]" Namely, the init order: memblock init--->linear mem mapping(4k mapping for crashkernel, requirinig page-granularity changing))--->zone dma limit--->reserve crashkernel. So when enable ZONE DMA and using crashkernel, the mem mapping using 4k mapping. 2.As mentioned above, when linear mem use 4k mapping simply, there is high dtlb miss(degrade performance). This patch use block/section mapping as far as possible with performance improvement. 3.This patch reserve crashkernel as same as the history(ZONE DMA & crashkernel reserving order), and only change the linear mem mapping to block/section mapping. Init order: memblock init--->linear mem mapping(block/section mapping for linear mem mapping))--->zone dma limit--->reserve crashkernel--->[[[only]]] rebuild 4k pagesize mapping for crashkernel mem With this method, there will use block/section mapping as far as possible. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 10:58 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 10:58 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer 在 2022/7/4 18:35, Will Deacon 写道: > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >> use non block/section mapping(for crashkernel requires to shrink the region >> in page granularity). But it will degrade performance when doing larging >> continuous mem access in kernel(memcpy/memmove, etc). > > Hmm. It seems a bit silly to me that we take special care to unmap the > crashkernel from the linear map even when can_set_direct_map() is false, as > we won't be protecting the main kernel at all! > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > and then this problem just goes away? > > Will This question had been asked lask week. 1.Quoted messages from arch/arm64/mm/init.c "Memory reservation for crash kernel either done early or deferred depending on DMA memory zones configs (ZONE_DMA) -- In absence of ZONE_DMA configs arm64_dma_phys_limit initialized here instead of max_zone_phys(). This lets early reservation of crash kernel memory which has a dependency on arm64_dma_phys_limit. Reserving memory early for crash kernel allows linear creation of block mappings (greater than page-granularity) for all the memory bank rangs. In this scheme a comparatively quicker boot is observed. If ZONE_DMA configs are defined, crash kernel memory reservation is delayed until DMA zone memory range size initialization performed in zone_sizes_init(). The defer is necessary to steer clear of DMA zone memory range to avoid overlap allocation. [[[ So crash kernel memory boundaries are not known when mapping all bank memory ranges, which otherwise means not possible to exclude crash kernel range from creating block mappings so page-granularity mappings are created for the entire memory range. ]]]" Namely, the init order: memblock init--->linear mem mapping(4k mapping for crashkernel, requirinig page-granularity changing))--->zone dma limit--->reserve crashkernel. So when enable ZONE DMA and using crashkernel, the mem mapping using 4k mapping. 2.As mentioned above, when linear mem use 4k mapping simply, there is high dtlb miss(degrade performance). This patch use block/section mapping as far as possible with performance improvement. 3.This patch reserve crashkernel as same as the history(ZONE DMA & crashkernel reserving order), and only change the linear mem mapping to block/section mapping. Init order: memblock init--->linear mem mapping(block/section mapping for linear mem mapping))--->zone dma limit--->reserve crashkernel--->[[[only]]] rebuild 4k pagesize mapping for crashkernel mem With this method, there will use block/section mapping as far as possible. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 10:58 ` guanghui.fgh @ 2022-07-04 11:14 ` Will Deacon -1 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 11:14 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 18:35, Will Deacon 写道: > > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > > > use non block/section mapping(for crashkernel requires to shrink the region > > > in page granularity). But it will degrade performance when doing larging > > > continuous mem access in kernel(memcpy/memmove, etc). > > > > Hmm. It seems a bit silly to me that we take special care to unmap the > > crashkernel from the linear map even when can_set_direct_map() is false, as > > we won't be protecting the main kernel at all! > > > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > > and then this problem just goes away? > > > > Will > > This question had been asked lask week. Sorry, I didn't spot that. Please could you link me to the conversation, as I'm still unable to find it in my inbox? > 1.Quoted messages from arch/arm64/mm/init.c > > "Memory reservation for crash kernel either done early or deferred > depending on DMA memory zones configs (ZONE_DMA) -- > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > here instead of max_zone_phys(). This lets early reservation of > crash kernel memory which has a dependency on arm64_dma_phys_limit. > Reserving memory early for crash kernel allows linear creation of block > mappings (greater than page-granularity) for all the memory bank rangs. > In this scheme a comparatively quicker boot is observed. > > If ZONE_DMA configs are defined, crash kernel memory reservation > is delayed until DMA zone memory range size initialization performed in > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > memory range to avoid overlap allocation. > > [[[ > So crash kernel memory boundaries are not known when mapping all bank memory > ranges, which otherwise means not possible to exclude crash kernel range > from creating block mappings so page-granularity mappings are created for > the entire memory range. > ]]]" > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > crashkernel, requirinig page-granularity changing))--->zone dma > limit--->reserve crashkernel. > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > mapping. Yes, I understand that is how things work today but I'm saying that we may as well leave the crashkernel mapped (at block granularity) if !can_set_direct_map() and then I think your patch becomes a lot simpler. Will ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 11:14 ` Will Deacon 0 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 11:14 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 18:35, Will Deacon 写道: > > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > > > use non block/section mapping(for crashkernel requires to shrink the region > > > in page granularity). But it will degrade performance when doing larging > > > continuous mem access in kernel(memcpy/memmove, etc). > > > > Hmm. It seems a bit silly to me that we take special care to unmap the > > crashkernel from the linear map even when can_set_direct_map() is false, as > > we won't be protecting the main kernel at all! > > > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > > and then this problem just goes away? > > > > Will > > This question had been asked lask week. Sorry, I didn't spot that. Please could you link me to the conversation, as I'm still unable to find it in my inbox? > 1.Quoted messages from arch/arm64/mm/init.c > > "Memory reservation for crash kernel either done early or deferred > depending on DMA memory zones configs (ZONE_DMA) -- > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > here instead of max_zone_phys(). This lets early reservation of > crash kernel memory which has a dependency on arm64_dma_phys_limit. > Reserving memory early for crash kernel allows linear creation of block > mappings (greater than page-granularity) for all the memory bank rangs. > In this scheme a comparatively quicker boot is observed. > > If ZONE_DMA configs are defined, crash kernel memory reservation > is delayed until DMA zone memory range size initialization performed in > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > memory range to avoid overlap allocation. > > [[[ > So crash kernel memory boundaries are not known when mapping all bank memory > ranges, which otherwise means not possible to exclude crash kernel range > from creating block mappings so page-granularity mappings are created for > the entire memory range. > ]]]" > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > crashkernel, requirinig page-granularity changing))--->zone dma > limit--->reserve crashkernel. > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > mapping. Yes, I understand that is how things work today but I'm saying that we may as well leave the crashkernel mapped (at block granularity) if !can_set_direct_map() and then I think your patch becomes a lot simpler. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 11:14 ` Will Deacon @ 2022-07-04 12:05 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 12:05 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer 在 2022/7/4 19:14, Will Deacon 写道: > On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 18:35, Will Deacon 写道: >>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>> use non block/section mapping(for crashkernel requires to shrink the region >>>> in page granularity). But it will degrade performance when doing larging >>>> continuous mem access in kernel(memcpy/memmove, etc). >>> >>> Hmm. It seems a bit silly to me that we take special care to unmap the >>> crashkernel from the linear map even when can_set_direct_map() is false, as >>> we won't be protecting the main kernel at all! >>> >>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>> and then this problem just goes away? >>> >>> Will >> >> This question had been asked lask week. > > Sorry, I didn't spot that. Please could you link me to the conversation, as > I'm still unable to find it in my inbox? Please access this link: https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > >> 1.Quoted messages from arch/arm64/mm/init.c >> >> "Memory reservation for crash kernel either done early or deferred >> depending on DMA memory zones configs (ZONE_DMA) -- >> >> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >> here instead of max_zone_phys(). This lets early reservation of >> crash kernel memory which has a dependency on arm64_dma_phys_limit. >> Reserving memory early for crash kernel allows linear creation of block >> mappings (greater than page-granularity) for all the memory bank rangs. >> In this scheme a comparatively quicker boot is observed. >> >> If ZONE_DMA configs are defined, crash kernel memory reservation >> is delayed until DMA zone memory range size initialization performed in >> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >> memory range to avoid overlap allocation. >> >> [[[ >> So crash kernel memory boundaries are not known when mapping all bank memory >> ranges, which otherwise means not possible to exclude crash kernel range >> from creating block mappings so page-granularity mappings are created for >> the entire memory range. >> ]]]" >> >> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >> crashkernel, requirinig page-granularity changing))--->zone dma >> limit--->reserve crashkernel. >> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >> mapping. > > Yes, I understand that is how things work today but I'm saying that we may > as well leave the crashkernel mapped (at block granularity) if > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > Will But Page-granularity mapppings are necessary for crash kernel memory range for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from arch/arm64/mm/init.c). So this patch split block/section mapping to 4k page-granularity mapping for crashkernel mem. Thanks. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 12:05 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 12:05 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer 在 2022/7/4 19:14, Will Deacon 写道: > On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 18:35, Will Deacon 写道: >>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>> use non block/section mapping(for crashkernel requires to shrink the region >>>> in page granularity). But it will degrade performance when doing larging >>>> continuous mem access in kernel(memcpy/memmove, etc). >>> >>> Hmm. It seems a bit silly to me that we take special care to unmap the >>> crashkernel from the linear map even when can_set_direct_map() is false, as >>> we won't be protecting the main kernel at all! >>> >>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>> and then this problem just goes away? >>> >>> Will >> >> This question had been asked lask week. > > Sorry, I didn't spot that. Please could you link me to the conversation, as > I'm still unable to find it in my inbox? Please access this link: https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > >> 1.Quoted messages from arch/arm64/mm/init.c >> >> "Memory reservation for crash kernel either done early or deferred >> depending on DMA memory zones configs (ZONE_DMA) -- >> >> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >> here instead of max_zone_phys(). This lets early reservation of >> crash kernel memory which has a dependency on arm64_dma_phys_limit. >> Reserving memory early for crash kernel allows linear creation of block >> mappings (greater than page-granularity) for all the memory bank rangs. >> In this scheme a comparatively quicker boot is observed. >> >> If ZONE_DMA configs are defined, crash kernel memory reservation >> is delayed until DMA zone memory range size initialization performed in >> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >> memory range to avoid overlap allocation. >> >> [[[ >> So crash kernel memory boundaries are not known when mapping all bank memory >> ranges, which otherwise means not possible to exclude crash kernel range >> from creating block mappings so page-granularity mappings are created for >> the entire memory range. >> ]]]" >> >> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >> crashkernel, requirinig page-granularity changing))--->zone dma >> limit--->reserve crashkernel. >> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >> mapping. > > Yes, I understand that is how things work today but I'm saying that we may > as well leave the crashkernel mapped (at block granularity) if > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > Will But Page-granularity mapppings are necessary for crash kernel memory range for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from arch/arm64/mm/init.c). So this patch split block/section mapping to 4k page-granularity mapping for crashkernel mem. Thanks. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 12:05 ` guanghui.fgh @ 2022-07-04 13:15 ` Will Deacon -1 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 13:15 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 19:14, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: > > > > > > > > > 在 2022/7/4 18:35, Will Deacon 写道: > > > > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > > > > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > > > > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > > > > > use non block/section mapping(for crashkernel requires to shrink the region > > > > > in page granularity). But it will degrade performance when doing larging > > > > > continuous mem access in kernel(memcpy/memmove, etc). > > > > > > > > Hmm. It seems a bit silly to me that we take special care to unmap the > > > > crashkernel from the linear map even when can_set_direct_map() is false, as > > > > we won't be protecting the main kernel at all! > > > > > > > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > > > > and then this problem just goes away? > > > > > > This question had been asked lask week. > > > > Sorry, I didn't spot that. Please could you link me to the conversation, as > > I'm still unable to find it in my inbox? > > Please access this link: > https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ Sorry, but I read through the thread and I still can't find where the possibility of leaving the crashkernel mapped was discussed. > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > "Memory reservation for crash kernel either done early or deferred > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > here instead of max_zone_phys(). This lets early reservation of > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > Reserving memory early for crash kernel allows linear creation of block > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > In this scheme a comparatively quicker boot is observed. > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > is delayed until DMA zone memory range size initialization performed in > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > memory range to avoid overlap allocation. > > > > > > [[[ > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > ranges, which otherwise means not possible to exclude crash kernel range > > > from creating block mappings so page-granularity mappings are created for > > > the entire memory range. > > > ]]]" > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > limit--->reserve crashkernel. > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > mapping. > > > > Yes, I understand that is how things work today but I'm saying that we may > > as well leave the crashkernel mapped (at block granularity) if > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > But Page-granularity mapppings are necessary for crash kernel memory range > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > arch/arm64/mm/init.c). > So this patch split block/section mapping to 4k page-granularity mapping for > crashkernel mem. Why? I don't see why the mapping granularity is relevant at all if we always leave the whole thing mapped. Will ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 13:15 ` Will Deacon 0 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 13:15 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 19:14, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: > > > > > > > > > 在 2022/7/4 18:35, Will Deacon 写道: > > > > On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: > > > > > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > > > > > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > > > > > use non block/section mapping(for crashkernel requires to shrink the region > > > > > in page granularity). But it will degrade performance when doing larging > > > > > continuous mem access in kernel(memcpy/memmove, etc). > > > > > > > > Hmm. It seems a bit silly to me that we take special care to unmap the > > > > crashkernel from the linear map even when can_set_direct_map() is false, as > > > > we won't be protecting the main kernel at all! > > > > > > > > Why don't we just leave the crashkernel mapped if !can_set_direct_map() > > > > and then this problem just goes away? > > > > > > This question had been asked lask week. > > > > Sorry, I didn't spot that. Please could you link me to the conversation, as > > I'm still unable to find it in my inbox? > > Please access this link: > https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ Sorry, but I read through the thread and I still can't find where the possibility of leaving the crashkernel mapped was discussed. > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > "Memory reservation for crash kernel either done early or deferred > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > here instead of max_zone_phys(). This lets early reservation of > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > Reserving memory early for crash kernel allows linear creation of block > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > In this scheme a comparatively quicker boot is observed. > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > is delayed until DMA zone memory range size initialization performed in > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > memory range to avoid overlap allocation. > > > > > > [[[ > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > ranges, which otherwise means not possible to exclude crash kernel range > > > from creating block mappings so page-granularity mappings are created for > > > the entire memory range. > > > ]]]" > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > limit--->reserve crashkernel. > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > mapping. > > > > Yes, I understand that is how things work today but I'm saying that we may > > as well leave the crashkernel mapped (at block granularity) if > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > But Page-granularity mapppings are necessary for crash kernel memory range > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > arch/arm64/mm/init.c). > So this patch split block/section mapping to 4k page-granularity mapping for > crashkernel mem. Why? I don't see why the mapping granularity is relevant at all if we always leave the whole thing mapped. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 13:15 ` Will Deacon @ 2022-07-04 13:41 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 13:41 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/4 21:15, Will Deacon 写道: > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 19:14, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >>>> >>>> >>>> 在 2022/7/4 18:35, Will Deacon 写道: >>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>>>> use non block/section mapping(for crashkernel requires to shrink the region >>>>>> in page granularity). But it will degrade performance when doing larging >>>>>> continuous mem access in kernel(memcpy/memmove, etc). >>>>> >>>>> Hmm. It seems a bit silly to me that we take special care to unmap the >>>>> crashkernel from the linear map even when can_set_direct_map() is false, as >>>>> we won't be protecting the main kernel at all! >>>>> >>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>>>> and then this problem just goes away? >>>> >>>> This question had been asked lask week. >>> >>> Sorry, I didn't spot that. Please could you link me to the conversation, as >>> I'm still unable to find it in my inbox? >> >> Please access this link: >> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > > Sorry, but I read through the thread and I still can't find where the > possibility of leaving the crashkernel mapped was discussed > >>>> 1.Quoted messages from arch/arm64/mm/init.c >>>> >>>> "Memory reservation for crash kernel either done early or deferred >>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>> >>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>> here instead of max_zone_phys(). This lets early reservation of >>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>> Reserving memory early for crash kernel allows linear creation of block >>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>> In this scheme a comparatively quicker boot is observed. >>>> >>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>> is delayed until DMA zone memory range size initialization performed in >>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>> memory range to avoid overlap allocation. >>>> >>>> [[[ >>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>> ranges, which otherwise means not possible to exclude crash kernel range >>>> from creating block mappings so page-granularity mappings are created for >>>> the entire memory range. >>>> ]]]" >>>> >>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>> limit--->reserve crashkernel. >>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>> mapping. >>> >>> Yes, I understand that is how things work today but I'm saying that we may >>> as well leave the crashkernel mapped (at block granularity) if >>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >> >> But Page-granularity mapppings are necessary for crash kernel memory range >> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >> arch/arm64/mm/init.c). >> So this patch split block/section mapping to 4k page-granularity mapping for >> crashkernel mem. > > Why? I don't see why the mapping granularity is relevant at all if we > always leave the whole thing mapped. > > Will I have find the commit 06a7f711246b081afc21fff859f1003f1f2a0fbc adding /sys/kernel/kexec_crash_size for changing crashkernel mem size. "Implement shrinking the reserved memory for crash kernel, if it is more than enough." Maybe we could use block/section mapping for crashkernle mem, and split a part of crashkernel mem block/section mapping when shringking(by writing to /sys/kernel/kexec_crash_size(handled by crash_shrink_memory, crash_free_reserved_phys_range)). (Maybe there is no need to split all crashkernle mem block/section mapping at boot time). Thanks. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 13:41 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 13:41 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/4 21:15, Will Deacon 写道: > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 19:14, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >>>> >>>> >>>> 在 2022/7/4 18:35, Will Deacon 写道: >>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>>>> use non block/section mapping(for crashkernel requires to shrink the region >>>>>> in page granularity). But it will degrade performance when doing larging >>>>>> continuous mem access in kernel(memcpy/memmove, etc). >>>>> >>>>> Hmm. It seems a bit silly to me that we take special care to unmap the >>>>> crashkernel from the linear map even when can_set_direct_map() is false, as >>>>> we won't be protecting the main kernel at all! >>>>> >>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>>>> and then this problem just goes away? >>>> >>>> This question had been asked lask week. >>> >>> Sorry, I didn't spot that. Please could you link me to the conversation, as >>> I'm still unable to find it in my inbox? >> >> Please access this link: >> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > > Sorry, but I read through the thread and I still can't find where the > possibility of leaving the crashkernel mapped was discussed > >>>> 1.Quoted messages from arch/arm64/mm/init.c >>>> >>>> "Memory reservation for crash kernel either done early or deferred >>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>> >>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>> here instead of max_zone_phys(). This lets early reservation of >>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>> Reserving memory early for crash kernel allows linear creation of block >>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>> In this scheme a comparatively quicker boot is observed. >>>> >>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>> is delayed until DMA zone memory range size initialization performed in >>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>> memory range to avoid overlap allocation. >>>> >>>> [[[ >>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>> ranges, which otherwise means not possible to exclude crash kernel range >>>> from creating block mappings so page-granularity mappings are created for >>>> the entire memory range. >>>> ]]]" >>>> >>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>> limit--->reserve crashkernel. >>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>> mapping. >>> >>> Yes, I understand that is how things work today but I'm saying that we may >>> as well leave the crashkernel mapped (at block granularity) if >>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >> >> But Page-granularity mapppings are necessary for crash kernel memory range >> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >> arch/arm64/mm/init.c). >> So this patch split block/section mapping to 4k page-granularity mapping for >> crashkernel mem. > > Why? I don't see why the mapping granularity is relevant at all if we > always leave the whole thing mapped. > > Will I have find the commit 06a7f711246b081afc21fff859f1003f1f2a0fbc adding /sys/kernel/kexec_crash_size for changing crashkernel mem size. "Implement shrinking the reserved memory for crash kernel, if it is more than enough." Maybe we could use block/section mapping for crashkernle mem, and split a part of crashkernel mem block/section mapping when shringking(by writing to /sys/kernel/kexec_crash_size(handled by crash_shrink_memory, crash_free_reserved_phys_range)). (Maybe there is no need to split all crashkernle mem block/section mapping at boot time). Thanks. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 13:15 ` Will Deacon @ 2022-07-04 14:11 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 14:11 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer 在 2022/7/4 21:15, Will Deacon 写道: > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 19:14, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >>>> >>>> >>>> 在 2022/7/4 18:35, Will Deacon 写道: >>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>>>> use non block/section mapping(for crashkernel requires to shrink the region >>>>>> in page granularity). But it will degrade performance when doing larging >>>>>> continuous mem access in kernel(memcpy/memmove, etc). >>>>> >>>>> Hmm. It seems a bit silly to me that we take special care to unmap the >>>>> crashkernel from the linear map even when can_set_direct_map() is false, as >>>>> we won't be protecting the main kernel at all! >>>>> >>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>>>> and then this problem just goes away? >>>> >>>> This question had been asked lask week. >>> >>> Sorry, I didn't spot that. Please could you link me to the conversation, as >>> I'm still unable to find it in my inbox? >> >> Please access this link: >> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > > Sorry, but I read through the thread and I still can't find where the > possibility of leaving the crashkernel mapped was discussed. > >>>> 1.Quoted messages from arch/arm64/mm/init.c >>>> >>>> "Memory reservation for crash kernel either done early or deferred >>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>> >>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>> here instead of max_zone_phys(). This lets early reservation of >>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>> Reserving memory early for crash kernel allows linear creation of block >>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>> In this scheme a comparatively quicker boot is observed. >>>> >>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>> is delayed until DMA zone memory range size initialization performed in >>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>> memory range to avoid overlap allocation. >>>> >>>> [[[ >>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>> ranges, which otherwise means not possible to exclude crash kernel range >>>> from creating block mappings so page-granularity mappings are created for >>>> the entire memory range. >>>> ]]]" >>>> >>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>> limit--->reserve crashkernel. >>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>> mapping. >>> >>> Yes, I understand that is how things work today but I'm saying that we may >>> as well leave the crashkernel mapped (at block granularity) if >>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >> >> But Page-granularity mapppings are necessary for crash kernel memory range >> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >> arch/arm64/mm/init.c). >> So this patch split block/section mapping to 4k page-granularity mapping for >> crashkernel mem. > > Why? I don't see why the mapping granularity is relevant at all if we > always leave the whole thing mapped. > > Will There is another reason. When loading crashkernel finish, the do_kexec_load will use arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel mem(protect crashkernel mem from access). arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And if the crashkernel use block/section mapping, there will be some error. Namely, it's need to use non block/section mapping for crashkernel mem before shringking. Thanks. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 14:11 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 14:11 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer 在 2022/7/4 21:15, Will Deacon 写道: > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/4 19:14, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 06:58:20PM +0800, guanghui.fgh wrote: >>>> >>>> >>>> 在 2022/7/4 18:35, Will Deacon 写道: >>>>> On Sat, Jul 02, 2022 at 11:57:53PM +0800, Guanghui Feng wrote: >>>>>> The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone >>>>>> (enable crashkernel, disable rodata full, disable kfence), the mem_map will >>>>>> use non block/section mapping(for crashkernel requires to shrink the region >>>>>> in page granularity). But it will degrade performance when doing larging >>>>>> continuous mem access in kernel(memcpy/memmove, etc). >>>>> >>>>> Hmm. It seems a bit silly to me that we take special care to unmap the >>>>> crashkernel from the linear map even when can_set_direct_map() is false, as >>>>> we won't be protecting the main kernel at all! >>>>> >>>>> Why don't we just leave the crashkernel mapped if !can_set_direct_map() >>>>> and then this problem just goes away? >>>> >>>> This question had been asked lask week. >>> >>> Sorry, I didn't spot that. Please could you link me to the conversation, as >>> I'm still unable to find it in my inbox? >> >> Please access this link: >> https://lore.kernel.org/linux-arm-kernel/075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com/T/ > > Sorry, but I read through the thread and I still can't find where the > possibility of leaving the crashkernel mapped was discussed. > >>>> 1.Quoted messages from arch/arm64/mm/init.c >>>> >>>> "Memory reservation for crash kernel either done early or deferred >>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>> >>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>> here instead of max_zone_phys(). This lets early reservation of >>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>> Reserving memory early for crash kernel allows linear creation of block >>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>> In this scheme a comparatively quicker boot is observed. >>>> >>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>> is delayed until DMA zone memory range size initialization performed in >>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>> memory range to avoid overlap allocation. >>>> >>>> [[[ >>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>> ranges, which otherwise means not possible to exclude crash kernel range >>>> from creating block mappings so page-granularity mappings are created for >>>> the entire memory range. >>>> ]]]" >>>> >>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>> limit--->reserve crashkernel. >>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>> mapping. >>> >>> Yes, I understand that is how things work today but I'm saying that we may >>> as well leave the crashkernel mapped (at block granularity) if >>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >> >> But Page-granularity mapppings are necessary for crash kernel memory range >> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >> arch/arm64/mm/init.c). >> So this patch split block/section mapping to 4k page-granularity mapping for >> crashkernel mem. > > Why? I don't see why the mapping granularity is relevant at all if we > always leave the whole thing mapped. > > Will There is another reason. When loading crashkernel finish, the do_kexec_load will use arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel mem(protect crashkernel mem from access). arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And if the crashkernel use block/section mapping, there will be some error. Namely, it's need to use non block/section mapping for crashkernel mem before shringking. Thanks. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 14:11 ` guanghui.fgh @ 2022-07-04 14:23 ` Will Deacon -1 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 14:23 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > 在 2022/7/4 21:15, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > > > > > "Memory reservation for crash kernel either done early or deferred > > > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > > > here instead of max_zone_phys(). This lets early reservation of > > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > > > Reserving memory early for crash kernel allows linear creation of block > > > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > > > In this scheme a comparatively quicker boot is observed. > > > > > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > > > is delayed until DMA zone memory range size initialization performed in > > > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > > > memory range to avoid overlap allocation. > > > > > > > > > > [[[ > > > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > > > ranges, which otherwise means not possible to exclude crash kernel range > > > > > from creating block mappings so page-granularity mappings are created for > > > > > the entire memory range. > > > > > ]]]" > > > > > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > > > limit--->reserve crashkernel. > > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > > > mapping. > > > > > > > > Yes, I understand that is how things work today but I'm saying that we may > > > > as well leave the crashkernel mapped (at block granularity) if > > > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > > > > > But Page-granularity mapppings are necessary for crash kernel memory range > > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > > > arch/arm64/mm/init.c). > > > So this patch split block/section mapping to 4k page-granularity mapping for > > > crashkernel mem. > > > > Why? I don't see why the mapping granularity is relevant at all if we > > always leave the whole thing mapped. > > > There is another reason. > > When loading crashkernel finish, the do_kexec_load will use > arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel > mem(protect crashkernel mem from access). > > arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range > > In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And > if the crashkernel use block/section mapping, there will be some error. > > Namely, it's need to use non block/section mapping for crashkernel mem > before shringking. Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do that if we're leaving the thing mapped, no? Will ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 14:23 ` Will Deacon 0 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 14:23 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > 在 2022/7/4 21:15, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > > > > > "Memory reservation for crash kernel either done early or deferred > > > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > > > here instead of max_zone_phys(). This lets early reservation of > > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > > > Reserving memory early for crash kernel allows linear creation of block > > > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > > > In this scheme a comparatively quicker boot is observed. > > > > > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > > > is delayed until DMA zone memory range size initialization performed in > > > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > > > memory range to avoid overlap allocation. > > > > > > > > > > [[[ > > > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > > > ranges, which otherwise means not possible to exclude crash kernel range > > > > > from creating block mappings so page-granularity mappings are created for > > > > > the entire memory range. > > > > > ]]]" > > > > > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > > > limit--->reserve crashkernel. > > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > > > mapping. > > > > > > > > Yes, I understand that is how things work today but I'm saying that we may > > > > as well leave the crashkernel mapped (at block granularity) if > > > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > > > > > But Page-granularity mapppings are necessary for crash kernel memory range > > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > > > arch/arm64/mm/init.c). > > > So this patch split block/section mapping to 4k page-granularity mapping for > > > crashkernel mem. > > > > Why? I don't see why the mapping granularity is relevant at all if we > > always leave the whole thing mapped. > > > There is another reason. > > When loading crashkernel finish, the do_kexec_load will use > arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel > mem(protect crashkernel mem from access). > > arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range > > In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And > if the crashkernel use block/section mapping, there will be some error. > > Namely, it's need to use non block/section mapping for crashkernel mem > before shringking. Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do that if we're leaving the thing mapped, no? Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 14:23 ` Will Deacon @ 2022-07-04 14:34 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 14:34 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/4 22:23, Will Deacon 写道: > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >> 在 2022/7/4 21:15, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >>>>>> 1.Quoted messages from arch/arm64/mm/init.c >>>>>> >>>>>> "Memory reservation for crash kernel either done early or deferred >>>>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>>>> >>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>>>> here instead of max_zone_phys(). This lets early reservation of >>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>>>> Reserving memory early for crash kernel allows linear creation of block >>>>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>>>> In this scheme a comparatively quicker boot is observed. >>>>>> >>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>>>> is delayed until DMA zone memory range size initialization performed in >>>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>>>> memory range to avoid overlap allocation. >>>>>> >>>>>> [[[ >>>>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>>>> ranges, which otherwise means not possible to exclude crash kernel range >>>>>> from creating block mappings so page-granularity mappings are created for >>>>>> the entire memory range. >>>>>> ]]]" >>>>>> >>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>>>> limit--->reserve crashkernel. >>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>>>> mapping. >>>>> >>>>> Yes, I understand that is how things work today but I'm saying that we may >>>>> as well leave the crashkernel mapped (at block granularity) if >>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >>>> >>>> But Page-granularity mapppings are necessary for crash kernel memory range >>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >>>> arch/arm64/mm/init.c). >>>> So this patch split block/section mapping to 4k page-granularity mapping for >>>> crashkernel mem. >>> >>> Why? I don't see why the mapping granularity is relevant at all if we >>> always leave the whole thing mapped. >>> >> There is another reason. >> >> When loading crashkernel finish, the do_kexec_load will use >> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel >> mem(protect crashkernel mem from access). >> >> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range >> >> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And >> if the crashkernel use block/section mapping, there will be some error. >> >> Namely, it's need to use non block/section mapping for crashkernel mem >> before shringking. > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > that if we're leaving the thing mapped, no? > > Will I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. Because when invalid crashkernel mem pagetable, there is no chance to rd/wr the crashkernel mem by mistake. If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem pagetable, there maybe some write operations to these mem by mistake which may cause crashkernel boot error and vmcore saving error. Can we change the arch_kexec_[un]protect_crashkres to support block/section mapping?(But we also need to remap when shrinking) Thanks. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 14:34 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-04 14:34 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/4 22:23, Will Deacon 写道: > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >> 在 2022/7/4 21:15, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >>>>>> 1.Quoted messages from arch/arm64/mm/init.c >>>>>> >>>>>> "Memory reservation for crash kernel either done early or deferred >>>>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>>>> >>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>>>> here instead of max_zone_phys(). This lets early reservation of >>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>>>> Reserving memory early for crash kernel allows linear creation of block >>>>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>>>> In this scheme a comparatively quicker boot is observed. >>>>>> >>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>>>> is delayed until DMA zone memory range size initialization performed in >>>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>>>> memory range to avoid overlap allocation. >>>>>> >>>>>> [[[ >>>>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>>>> ranges, which otherwise means not possible to exclude crash kernel range >>>>>> from creating block mappings so page-granularity mappings are created for >>>>>> the entire memory range. >>>>>> ]]]" >>>>>> >>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>>>> limit--->reserve crashkernel. >>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>>>> mapping. >>>>> >>>>> Yes, I understand that is how things work today but I'm saying that we may >>>>> as well leave the crashkernel mapped (at block granularity) if >>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >>>> >>>> But Page-granularity mapppings are necessary for crash kernel memory range >>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >>>> arch/arm64/mm/init.c). >>>> So this patch split block/section mapping to 4k page-granularity mapping for >>>> crashkernel mem. >>> >>> Why? I don't see why the mapping granularity is relevant at all if we >>> always leave the whole thing mapped. >>> >> There is another reason. >> >> When loading crashkernel finish, the do_kexec_load will use >> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel >> mem(protect crashkernel mem from access). >> >> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range >> >> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And >> if the crashkernel use block/section mapping, there will be some error. >> >> Namely, it's need to use non block/section mapping for crashkernel mem >> before shringking. > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > that if we're leaving the thing mapped, no? > > Will I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. Because when invalid crashkernel mem pagetable, there is no chance to rd/wr the crashkernel mem by mistake. If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem pagetable, there maybe some write operations to these mem by mistake which may cause crashkernel boot error and vmcore saving error. Can we change the arch_kexec_[un]protect_crashkres to support block/section mapping?(But we also need to remap when shrinking) Thanks. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 14:34 ` guanghui.fgh @ 2022-07-04 16:38 ` Will Deacon -1 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 16:38 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > Thanks. > > 在 2022/7/4 22:23, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 21:15, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > > > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > > > > > > > > > "Memory reservation for crash kernel either done early or deferred > > > > > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > > > > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > > > > > here instead of max_zone_phys(). This lets early reservation of > > > > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > > > > > Reserving memory early for crash kernel allows linear creation of block > > > > > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > > > > > In this scheme a comparatively quicker boot is observed. > > > > > > > > > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > > > > > is delayed until DMA zone memory range size initialization performed in > > > > > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > > > > > memory range to avoid overlap allocation. > > > > > > > > > > > > > > [[[ > > > > > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > > > > > ranges, which otherwise means not possible to exclude crash kernel range > > > > > > > from creating block mappings so page-granularity mappings are created for > > > > > > > the entire memory range. > > > > > > > ]]]" > > > > > > > > > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > > > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > > > > > limit--->reserve crashkernel. > > > > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > > > > > mapping. > > > > > > > > > > > > Yes, I understand that is how things work today but I'm saying that we may > > > > > > as well leave the crashkernel mapped (at block granularity) if > > > > > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > > > > > > > > > But Page-granularity mapppings are necessary for crash kernel memory range > > > > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > > > > > arch/arm64/mm/init.c). > > > > > So this patch split block/section mapping to 4k page-granularity mapping for > > > > > crashkernel mem. > > > > > > > > Why? I don't see why the mapping granularity is relevant at all if we > > > > always leave the whole thing mapped. > > > > > > > There is another reason. > > > > > > When loading crashkernel finish, the do_kexec_load will use > > > arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel > > > mem(protect crashkernel mem from access). > > > > > > arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range > > > > > > In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And > > > if the crashkernel use block/section mapping, there will be some error. > > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > before shringking. > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > that if we're leaving the thing mapped, no? > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > the crashkernel mem by mistake. > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > pagetable, there maybe some write operations to these mem by mistake which > may cause crashkernel boot error and vmcore saving error. I don't really buy this line of reasoning. The entire main kernel is writable, so why do we care about protecting the crashkernel so much? The _code_ to launch the crash kernel is writable! If you care about preventing writes to memory which should not be writable, then you should use rodata=full. Will ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 16:38 ` Will Deacon 0 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-04 16:38 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > Thanks. > > 在 2022/7/4 22:23, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > > > 在 2022/7/4 21:15, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: > > > > > > > 1.Quoted messages from arch/arm64/mm/init.c > > > > > > > > > > > > > > "Memory reservation for crash kernel either done early or deferred > > > > > > > depending on DMA memory zones configs (ZONE_DMA) -- > > > > > > > > > > > > > > In absence of ZONE_DMA configs arm64_dma_phys_limit initialized > > > > > > > here instead of max_zone_phys(). This lets early reservation of > > > > > > > crash kernel memory which has a dependency on arm64_dma_phys_limit. > > > > > > > Reserving memory early for crash kernel allows linear creation of block > > > > > > > mappings (greater than page-granularity) for all the memory bank rangs. > > > > > > > In this scheme a comparatively quicker boot is observed. > > > > > > > > > > > > > > If ZONE_DMA configs are defined, crash kernel memory reservation > > > > > > > is delayed until DMA zone memory range size initialization performed in > > > > > > > zone_sizes_init(). The defer is necessary to steer clear of DMA zone > > > > > > > memory range to avoid overlap allocation. > > > > > > > > > > > > > > [[[ > > > > > > > So crash kernel memory boundaries are not known when mapping all bank memory > > > > > > > ranges, which otherwise means not possible to exclude crash kernel range > > > > > > > from creating block mappings so page-granularity mappings are created for > > > > > > > the entire memory range. > > > > > > > ]]]" > > > > > > > > > > > > > > Namely, the init order: memblock init--->linear mem mapping(4k mapping for > > > > > > > crashkernel, requirinig page-granularity changing))--->zone dma > > > > > > > limit--->reserve crashkernel. > > > > > > > So when enable ZONE DMA and using crashkernel, the mem mapping using 4k > > > > > > > mapping. > > > > > > > > > > > > Yes, I understand that is how things work today but I'm saying that we may > > > > > > as well leave the crashkernel mapped (at block granularity) if > > > > > > !can_set_direct_map() and then I think your patch becomes a lot simpler. > > > > > > > > > > But Page-granularity mapppings are necessary for crash kernel memory range > > > > > for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from > > > > > arch/arm64/mm/init.c). > > > > > So this patch split block/section mapping to 4k page-granularity mapping for > > > > > crashkernel mem. > > > > > > > > Why? I don't see why the mapping granularity is relevant at all if we > > > > always leave the whole thing mapped. > > > > > > > There is another reason. > > > > > > When loading crashkernel finish, the do_kexec_load will use > > > arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel > > > mem(protect crashkernel mem from access). > > > > > > arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range > > > > > > In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And > > > if the crashkernel use block/section mapping, there will be some error. > > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > before shringking. > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > that if we're leaving the thing mapped, no? > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > the crashkernel mem by mistake. > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > pagetable, there maybe some write operations to these mem by mistake which > may cause crashkernel boot error and vmcore saving error. I don't really buy this line of reasoning. The entire main kernel is writable, so why do we care about protecting the crashkernel so much? The _code_ to launch the crash kernel is writable! If you care about preventing writes to memory which should not be writable, then you should use rodata=full. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 16:38 ` Will Deacon @ 2022-07-04 17:09 ` Ard Biesheuvel -1 siblings, 0 replies; 79+ messages in thread From: Ard Biesheuvel @ 2022-07-04 17:09 UTC (permalink / raw) To: Will Deacon Cc: guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > Thanks. > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: ... > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > before shringking. > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > that if we're leaving the thing mapped, no? > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > the crashkernel mem by mistake. > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > pagetable, there maybe some write operations to these mem by mistake which > > may cause crashkernel boot error and vmcore saving error. > > I don't really buy this line of reasoning. The entire main kernel is > writable, so why do we care about protecting the crashkernel so much? The > _code_ to launch the crash kernel is writable! If you care about preventing > writes to memory which should not be writable, then you should use > rodata=full. > This is not entirely true - the core kernel text and rodata are remapped r/o in the linear map, whereas all module code and rodata are left writable when rodata != full. But the conclusion is the same, imo: if you can't be bothered to protect a good chunk of the code and rodata that the kernel relies on, why should the crashkernel be treated any differently? ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-04 17:09 ` Ard Biesheuvel 0 siblings, 0 replies; 79+ messages in thread From: Ard Biesheuvel @ 2022-07-04 17:09 UTC (permalink / raw) To: Will Deacon Cc: guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > Thanks. > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: ... > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > before shringking. > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > that if we're leaving the thing mapped, no? > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > the crashkernel mem by mistake. > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > pagetable, there maybe some write operations to these mem by mistake which > > may cause crashkernel boot error and vmcore saving error. > > I don't really buy this line of reasoning. The entire main kernel is > writable, so why do we care about protecting the crashkernel so much? The > _code_ to launch the crash kernel is writable! If you care about preventing > writes to memory which should not be writable, then you should use > rodata=full. > This is not entirely true - the core kernel text and rodata are remapped r/o in the linear map, whereas all module code and rodata are left writable when rodata != full. But the conclusion is the same, imo: if you can't be bothered to protect a good chunk of the code and rodata that the kernel relies on, why should the crashkernel be treated any differently? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 17:09 ` Ard Biesheuvel (?) @ 2022-07-05 8:35 ` Baoquan He -1 siblings, 0 replies; 79+ messages in thread From: Baoquan He @ 2022-07-05 8:35 UTC (permalink / raw) To: Ard Biesheuvel Cc: Will Deacon, kexec, guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On 07/04/22 at 07:09pm, Ard Biesheuvel wrote: > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > Thanks. > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > ... > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > before shringking. > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > that if we're leaving the thing mapped, no? > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > the crashkernel mem by mistake. > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > pagetable, there maybe some write operations to these mem by mistake which > > > may cause crashkernel boot error and vmcore saving error. > > > > I don't really buy this line of reasoning. The entire main kernel is > > writable, so why do we care about protecting the crashkernel so much? The > > _code_ to launch the crash kernel is writable! If you care about preventing > > writes to memory which should not be writable, then you should use > > rodata=full. > > > > This is not entirely true - the core kernel text and rodata are > remapped r/o in the linear map, whereas all module code and rodata are > left writable when rodata != full. > > But the conclusion is the same, imo: if you can't be bothered to > protect a good chunk of the code and rodata that the kernel relies on, > why should the crashkernel be treated any differently? Kernel text and rodata are remapped r/o in linear map, whereas module code and rodata are left writable, it's different concept than crashkernel region being mapped r/o. If it's doable in technology to remap module code and rodata r/o, and stamping into those regions will corrupt the entire system, we should do it too. However, kdump is a system error diagonosing mechanism which is very important and helpful on server, or some application scenarios, e.g cloud. Stamping into crashkernel region will make it useless. I am not against removing the arch_kexec_[un]protect_crashkres on arm64. It is a balance: Protecting the crashkernel region, causeing severe performance degradation. This is always felt since we usually don't specify rodata and enable kfence. Taking off the protecting of crashkernel region, performance improved very much, while wrong code may stamp into crashkernel region and fail kdump. That could happen one in a million. Once happen, it's a nightmare of kernel dev. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 8:35 ` Baoquan He 0 siblings, 0 replies; 79+ messages in thread From: Baoquan He @ 2022-07-05 8:35 UTC (permalink / raw) To: Ard Biesheuvel Cc: Will Deacon, kexec, guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On 07/04/22 at 07:09pm, Ard Biesheuvel wrote: > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > Thanks. > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > ... > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > before shringking. > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > that if we're leaving the thing mapped, no? > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > the crashkernel mem by mistake. > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > pagetable, there maybe some write operations to these mem by mistake which > > > may cause crashkernel boot error and vmcore saving error. > > > > I don't really buy this line of reasoning. The entire main kernel is > > writable, so why do we care about protecting the crashkernel so much? The > > _code_ to launch the crash kernel is writable! If you care about preventing > > writes to memory which should not be writable, then you should use > > rodata=full. > > > > This is not entirely true - the core kernel text and rodata are > remapped r/o in the linear map, whereas all module code and rodata are > left writable when rodata != full. > > But the conclusion is the same, imo: if you can't be bothered to > protect a good chunk of the code and rodata that the kernel relies on, > why should the crashkernel be treated any differently? Kernel text and rodata are remapped r/o in linear map, whereas module code and rodata are left writable, it's different concept than crashkernel region being mapped r/o. If it's doable in technology to remap module code and rodata r/o, and stamping into those regions will corrupt the entire system, we should do it too. However, kdump is a system error diagonosing mechanism which is very important and helpful on server, or some application scenarios, e.g cloud. Stamping into crashkernel region will make it useless. I am not against removing the arch_kexec_[un]protect_crashkres on arm64. It is a balance: Protecting the crashkernel region, causeing severe performance degradation. This is always felt since we usually don't specify rodata and enable kfence. Taking off the protecting of crashkernel region, performance improved very much, while wrong code may stamp into crashkernel region and fail kdump. That could happen one in a million. Once happen, it's a nightmare of kernel dev. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 8:35 ` Baoquan He 0 siblings, 0 replies; 79+ messages in thread From: Baoquan He @ 2022-07-05 8:35 UTC (permalink / raw) To: Ard Biesheuvel Cc: Will Deacon, kexec, guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On 07/04/22 at 07:09pm, Ard Biesheuvel wrote: > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > Thanks. > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > ... > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > before shringking. > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > that if we're leaving the thing mapped, no? > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > the crashkernel mem by mistake. > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > pagetable, there maybe some write operations to these mem by mistake which > > > may cause crashkernel boot error and vmcore saving error. > > > > I don't really buy this line of reasoning. The entire main kernel is > > writable, so why do we care about protecting the crashkernel so much? The > > _code_ to launch the crash kernel is writable! If you care about preventing > > writes to memory which should not be writable, then you should use > > rodata=full. > > > > This is not entirely true - the core kernel text and rodata are > remapped r/o in the linear map, whereas all module code and rodata are > left writable when rodata != full. > > But the conclusion is the same, imo: if you can't be bothered to > protect a good chunk of the code and rodata that the kernel relies on, > why should the crashkernel be treated any differently? Kernel text and rodata are remapped r/o in linear map, whereas module code and rodata are left writable, it's different concept than crashkernel region being mapped r/o. If it's doable in technology to remap module code and rodata r/o, and stamping into those regions will corrupt the entire system, we should do it too. However, kdump is a system error diagonosing mechanism which is very important and helpful on server, or some application scenarios, e.g cloud. Stamping into crashkernel region will make it useless. I am not against removing the arch_kexec_[un]protect_crashkres on arm64. It is a balance: Protecting the crashkernel region, causeing severe performance degradation. This is always felt since we usually don't specify rodata and enable kfence. Taking off the protecting of crashkernel region, performance improved very much, while wrong code may stamp into crashkernel region and fail kdump. That could happen one in a million. Once happen, it's a nightmare of kernel dev. _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 17:09 ` Ard Biesheuvel @ 2022-07-05 9:52 ` Will Deacon -1 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-05 9:52 UTC (permalink / raw) To: Ard Biesheuvel Cc: guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > Thanks. > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > ... > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > before shringking. > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > that if we're leaving the thing mapped, no? > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > the crashkernel mem by mistake. > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > pagetable, there maybe some write operations to these mem by mistake which > > > may cause crashkernel boot error and vmcore saving error. > > > > I don't really buy this line of reasoning. The entire main kernel is > > writable, so why do we care about protecting the crashkernel so much? The > > _code_ to launch the crash kernel is writable! If you care about preventing > > writes to memory which should not be writable, then you should use > > rodata=full. > > > > This is not entirely true - the core kernel text and rodata are > remapped r/o in the linear map, whereas all module code and rodata are > left writable when rodata != full. Yes, sorry, you're quite right. The kernel text is only writable if rodata=off. But I still think it makes sense to protect the crashkernel only if rodata=full (which is the default on arm64) as this allows us to rely on page mappings and I think fits well with what we do for modules. > But the conclusion is the same, imo: if you can't be bothered to > protect a good chunk of the code and rodata that the kernel relies on, > why should the crashkernel be treated any differently? Thanks. Will ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 9:52 ` Will Deacon 0 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-05 9:52 UTC (permalink / raw) To: Ard Biesheuvel Cc: guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > Thanks. > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > ... > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > before shringking. > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > that if we're leaving the thing mapped, no? > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > the crashkernel mem by mistake. > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > pagetable, there maybe some write operations to these mem by mistake which > > > may cause crashkernel boot error and vmcore saving error. > > > > I don't really buy this line of reasoning. The entire main kernel is > > writable, so why do we care about protecting the crashkernel so much? The > > _code_ to launch the crash kernel is writable! If you care about preventing > > writes to memory which should not be writable, then you should use > > rodata=full. > > > > This is not entirely true - the core kernel text and rodata are > remapped r/o in the linear map, whereas all module code and rodata are > left writable when rodata != full. Yes, sorry, you're quite right. The kernel text is only writable if rodata=off. But I still think it makes sense to protect the crashkernel only if rodata=full (which is the default on arm64) as this allows us to rely on page mappings and I think fits well with what we do for modules. > But the conclusion is the same, imo: if you can't be bothered to > protect a good chunk of the code and rodata that the kernel relies on, > why should the crashkernel be treated any differently? Thanks. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 9:52 ` Will Deacon @ 2022-07-05 12:07 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-05 12:07 UTC (permalink / raw) To: Will Deacon, Ard Biesheuvel Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer 在 2022/7/5 17:52, Will Deacon 写道: > On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: >> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: >>> >>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >>>> Thanks. >>>> >>>> 在 2022/7/4 22:23, Will Deacon 写道: >>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >> ... >>>>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>>>> before shringking. >>>>> >>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>>>> that if we're leaving the thing mapped, no? >>>>> >>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >>>> >>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >>>> the crashkernel mem by mistake. >>>> >>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >>>> pagetable, there maybe some write operations to these mem by mistake which >>>> may cause crashkernel boot error and vmcore saving error. >>> >>> I don't really buy this line of reasoning. The entire main kernel is >>> writable, so why do we care about protecting the crashkernel so much? The >>> _code_ to launch the crash kernel is writable! If you care about preventing >>> writes to memory which should not be writable, then you should use >>> rodata=full. >>> >> >> This is not entirely true - the core kernel text and rodata are >> remapped r/o in the linear map, whereas all module code and rodata are >> left writable when rodata != full. > > Yes, sorry, you're quite right. The kernel text is only writable if > rodata=off. > > But I still think it makes sense to protect the crashkernel only if > rodata=full (which is the default on arm64) as this allows us to rely > on page mappings and I think fits well with what we do for modules. > >> But the conclusion is the same, imo: if you can't be bothered to >> protect a good chunk of the code and rodata that the kernel relies on, >> why should the crashkernel be treated any differently? > > Thanks. > > Will Thanks. 1.The rodata full is harm to the performance and has been disabled in-house. 2.When using crashkernel with rodata non full, the kernel also will use non block/section mapping which cause high d-TLB miss and degrade performance greatly. This patch fix it to use block/section mapping as far as possible. bool can_set_direct_map(void) { return rodata_full || debug_pagealloc_enabled(); } map_mem: if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; 3.When rodata full is disabled, crashkernel also need protect(keep arch_kexec_[un]protect_crashkres using). I think crashkernel should't depend on radata full(Maybe other architecture don't support radata full now). Thanks. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 12:07 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-05 12:07 UTC (permalink / raw) To: Will Deacon, Ard Biesheuvel Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer 在 2022/7/5 17:52, Will Deacon 写道: > On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: >> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: >>> >>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >>>> Thanks. >>>> >>>> 在 2022/7/4 22:23, Will Deacon 写道: >>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >> ... >>>>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>>>> before shringking. >>>>> >>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>>>> that if we're leaving the thing mapped, no? >>>>> >>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >>>> >>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >>>> the crashkernel mem by mistake. >>>> >>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >>>> pagetable, there maybe some write operations to these mem by mistake which >>>> may cause crashkernel boot error and vmcore saving error. >>> >>> I don't really buy this line of reasoning. The entire main kernel is >>> writable, so why do we care about protecting the crashkernel so much? The >>> _code_ to launch the crash kernel is writable! If you care about preventing >>> writes to memory which should not be writable, then you should use >>> rodata=full. >>> >> >> This is not entirely true - the core kernel text and rodata are >> remapped r/o in the linear map, whereas all module code and rodata are >> left writable when rodata != full. > > Yes, sorry, you're quite right. The kernel text is only writable if > rodata=off. > > But I still think it makes sense to protect the crashkernel only if > rodata=full (which is the default on arm64) as this allows us to rely > on page mappings and I think fits well with what we do for modules. > >> But the conclusion is the same, imo: if you can't be bothered to >> protect a good chunk of the code and rodata that the kernel relies on, >> why should the crashkernel be treated any differently? > > Thanks. > > Will Thanks. 1.The rodata full is harm to the performance and has been disabled in-house. 2.When using crashkernel with rodata non full, the kernel also will use non block/section mapping which cause high d-TLB miss and degrade performance greatly. This patch fix it to use block/section mapping as far as possible. bool can_set_direct_map(void) { return rodata_full || debug_pagealloc_enabled(); } map_mem: if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; 3.When rodata full is disabled, crashkernel also need protect(keep arch_kexec_[un]protect_crashkres using). I think crashkernel should't depend on radata full(Maybe other architecture don't support radata full now). Thanks. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 12:07 ` guanghui.fgh @ 2022-07-05 12:11 ` Will Deacon -1 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-05 12:11 UTC (permalink / raw) To: guanghui.fgh Cc: Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > 在 2022/7/5 17:52, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: > > > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > > > Thanks. > > > > > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > > > ... > > > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > > > before shringking. > > > > > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > > > that if we're leaving the thing mapped, no? > > > > > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > > > the crashkernel mem by mistake. > > > > > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > > > pagetable, there maybe some write operations to these mem by mistake which > > > > > may cause crashkernel boot error and vmcore saving error. > > > > > > > > I don't really buy this line of reasoning. The entire main kernel is > > > > writable, so why do we care about protecting the crashkernel so much? The > > > > _code_ to launch the crash kernel is writable! If you care about preventing > > > > writes to memory which should not be writable, then you should use > > > > rodata=full. > > > > > > > > > > This is not entirely true - the core kernel text and rodata are > > > remapped r/o in the linear map, whereas all module code and rodata are > > > left writable when rodata != full. > > > > Yes, sorry, you're quite right. The kernel text is only writable if > > rodata=off. > > > > But I still think it makes sense to protect the crashkernel only if > > rodata=full (which is the default on arm64) as this allows us to rely > > on page mappings and I think fits well with what we do for modules. > > > > > But the conclusion is the same, imo: if you can't be bothered to > > > protect a good chunk of the code and rodata that the kernel relies on, > > > why should the crashkernel be treated any differently? > > > > Thanks. > > > > Will > Thanks. > > 1.The rodata full is harm to the performance and has been disabled in-house. > > 2.When using crashkernel with rodata non full, the kernel also will use non > block/section mapping which cause high d-TLB miss and degrade performance > greatly. > This patch fix it to use block/section mapping as far as possible. > > bool can_set_direct_map(void) > { > return rodata_full || debug_pagealloc_enabled(); > } > > map_mem: > if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > 3.When rodata full is disabled, crashkernel also need protect(keep > arch_kexec_[un]protect_crashkres using). > I think crashkernel should't depend on radata full(Maybe other architecture > don't support radata full now). I think this is going round in circles :/ As a first step, can we please leave the crashkernel mapped unless rodata=full? It should be a much simpler patch to write, review and maintain and it gives you the performance you want when crashkernel is being used. Will ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 12:11 ` Will Deacon 0 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-05 12:11 UTC (permalink / raw) To: guanghui.fgh Cc: Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > 在 2022/7/5 17:52, Will Deacon 写道: > > On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: > > > On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: > > > > > > > > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: > > > > > Thanks. > > > > > > > > > > 在 2022/7/4 22:23, Will Deacon 写道: > > > > > > On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: > > > ... > > > > > > > Namely, it's need to use non block/section mapping for crashkernel mem > > > > > > > before shringking. > > > > > > > > > > > > Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do > > > > > > that if we're leaving the thing mapped, no? > > > > > > > > > > > I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. > > > > > > > > > > Because when invalid crashkernel mem pagetable, there is no chance to rd/wr > > > > > the crashkernel mem by mistake. > > > > > > > > > > If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem > > > > > pagetable, there maybe some write operations to these mem by mistake which > > > > > may cause crashkernel boot error and vmcore saving error. > > > > > > > > I don't really buy this line of reasoning. The entire main kernel is > > > > writable, so why do we care about protecting the crashkernel so much? The > > > > _code_ to launch the crash kernel is writable! If you care about preventing > > > > writes to memory which should not be writable, then you should use > > > > rodata=full. > > > > > > > > > > This is not entirely true - the core kernel text and rodata are > > > remapped r/o in the linear map, whereas all module code and rodata are > > > left writable when rodata != full. > > > > Yes, sorry, you're quite right. The kernel text is only writable if > > rodata=off. > > > > But I still think it makes sense to protect the crashkernel only if > > rodata=full (which is the default on arm64) as this allows us to rely > > on page mappings and I think fits well with what we do for modules. > > > > > But the conclusion is the same, imo: if you can't be bothered to > > > protect a good chunk of the code and rodata that the kernel relies on, > > > why should the crashkernel be treated any differently? > > > > Thanks. > > > > Will > Thanks. > > 1.The rodata full is harm to the performance and has been disabled in-house. > > 2.When using crashkernel with rodata non full, the kernel also will use non > block/section mapping which cause high d-TLB miss and degrade performance > greatly. > This patch fix it to use block/section mapping as far as possible. > > bool can_set_direct_map(void) > { > return rodata_full || debug_pagealloc_enabled(); > } > > map_mem: > if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > 3.When rodata full is disabled, crashkernel also need protect(keep > arch_kexec_[un]protect_crashkres using). > I think crashkernel should't depend on radata full(Maybe other architecture > don't support radata full now). I think this is going round in circles :/ As a first step, can we please leave the crashkernel mapped unless rodata=full? It should be a much simpler patch to write, review and maintain and it gives you the performance you want when crashkernel is being used. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 12:11 ` Will Deacon @ 2022-07-05 12:27 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-05 12:27 UTC (permalink / raw) To: Will Deacon Cc: Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer 在 2022/7/5 20:11, Will Deacon 写道: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/5 17:52, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: >>>> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: >>>>> >>>>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >>>>>> Thanks. >>>>>> >>>>>> 在 2022/7/4 22:23, Will Deacon 写道: >>>>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >>>> ... >>>>>>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>>>>>> before shringking. >>>>>>> >>>>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>>>>>> that if we're leaving the thing mapped, no? >>>>>>> >>>>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >>>>>> >>>>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >>>>>> the crashkernel mem by mistake. >>>>>> >>>>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >>>>>> pagetable, there maybe some write operations to these mem by mistake which >>>>>> may cause crashkernel boot error and vmcore saving error. >>>>> >>>>> I don't really buy this line of reasoning. The entire main kernel is >>>>> writable, so why do we care about protecting the crashkernel so much? The >>>>> _code_ to launch the crash kernel is writable! If you care about preventing >>>>> writes to memory which should not be writable, then you should use >>>>> rodata=full. >>>>> >>>> >>>> This is not entirely true - the core kernel text and rodata are >>>> remapped r/o in the linear map, whereas all module code and rodata are >>>> left writable when rodata != full. >>> >>> Yes, sorry, you're quite right. The kernel text is only writable if >>> rodata=off. >>> >>> But I still think it makes sense to protect the crashkernel only if >>> rodata=full (which is the default on arm64) as this allows us to rely >>> on page mappings and I think fits well with what we do for modules. >>> >>>> But the conclusion is the same, imo: if you can't be bothered to >>>> protect a good chunk of the code and rodata that the kernel relies on, >>>> why should the crashkernel be treated any differently? >>> >>> Thanks. >>> >>> Will >> Thanks. >> >> 1.The rodata full is harm to the performance and has been disabled in-house. >> >> 2.When using crashkernel with rodata non full, the kernel also will use non >> block/section mapping which cause high d-TLB miss and degrade performance >> greatly. >> This patch fix it to use block/section mapping as far as possible. >> >> bool can_set_direct_map(void) >> { >> return rodata_full || debug_pagealloc_enabled(); >> } >> >> map_mem: >> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; >> >> 3.When rodata full is disabled, crashkernel also need protect(keep >> arch_kexec_[un]protect_crashkres using). >> I think crashkernel should't depend on radata full(Maybe other architecture >> don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. > > Will Thanks. There is a circle. 1.When the rodata is non full, there will be some error when calling arch_kexec_[un]protect_crashkres(BUG_ON(pud_huge(*pud))) now. It's also need non-block/section mapping for crashkernel mem. 2.In other words, maybe we should change arch_kexec_[un]protect_crashkres to support block/section mapping which can leave crashkernel block/section mapping. But when we shrink the crashkernel mem, we also need to split some block/section mapping(part mem for crashkernel, the left for the normal kernel). As a result, maybe we build crashkernel mem with non-block/section mapping is appropriate(as this patch doing). ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 12:27 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-05 12:27 UTC (permalink / raw) To: Will Deacon Cc: Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, linux-mm, yaohongbo, alikernel-developer 在 2022/7/5 20:11, Will Deacon 写道: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: >> >> >> 在 2022/7/5 17:52, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 07:09:23PM +0200, Ard Biesheuvel wrote: >>>> On Mon, 4 Jul 2022 at 18:38, Will Deacon <will@kernel.org> wrote: >>>>> >>>>> On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >>>>>> Thanks. >>>>>> >>>>>> 在 2022/7/4 22:23, Will Deacon 写道: >>>>>>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >>>> ... >>>>>>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>>>>>> before shringking. >>>>>>> >>>>>>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>>>>>> that if we're leaving the thing mapped, no? >>>>>>> >>>>>> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >>>>>> >>>>>> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >>>>>> the crashkernel mem by mistake. >>>>>> >>>>>> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >>>>>> pagetable, there maybe some write operations to these mem by mistake which >>>>>> may cause crashkernel boot error and vmcore saving error. >>>>> >>>>> I don't really buy this line of reasoning. The entire main kernel is >>>>> writable, so why do we care about protecting the crashkernel so much? The >>>>> _code_ to launch the crash kernel is writable! If you care about preventing >>>>> writes to memory which should not be writable, then you should use >>>>> rodata=full. >>>>> >>>> >>>> This is not entirely true - the core kernel text and rodata are >>>> remapped r/o in the linear map, whereas all module code and rodata are >>>> left writable when rodata != full. >>> >>> Yes, sorry, you're quite right. The kernel text is only writable if >>> rodata=off. >>> >>> But I still think it makes sense to protect the crashkernel only if >>> rodata=full (which is the default on arm64) as this allows us to rely >>> on page mappings and I think fits well with what we do for modules. >>> >>>> But the conclusion is the same, imo: if you can't be bothered to >>>> protect a good chunk of the code and rodata that the kernel relies on, >>>> why should the crashkernel be treated any differently? >>> >>> Thanks. >>> >>> Will >> Thanks. >> >> 1.The rodata full is harm to the performance and has been disabled in-house. >> >> 2.When using crashkernel with rodata non full, the kernel also will use non >> block/section mapping which cause high d-TLB miss and degrade performance >> greatly. >> This patch fix it to use block/section mapping as far as possible. >> >> bool can_set_direct_map(void) >> { >> return rodata_full || debug_pagealloc_enabled(); >> } >> >> map_mem: >> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; >> >> 3.When rodata full is disabled, crashkernel also need protect(keep >> arch_kexec_[un]protect_crashkres using). >> I think crashkernel should't depend on radata full(Maybe other architecture >> don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. > > Will Thanks. There is a circle. 1.When the rodata is non full, there will be some error when calling arch_kexec_[un]protect_crashkres(BUG_ON(pud_huge(*pud))) now. It's also need non-block/section mapping for crashkernel mem. 2.In other words, maybe we should change arch_kexec_[un]protect_crashkres to support block/section mapping which can leave crashkernel block/section mapping. But when we shrink the crashkernel mem, we also need to split some block/section mapping(part mem for crashkernel, the left for the normal kernel). As a result, maybe we build crashkernel mem with non-block/section mapping is appropriate(as this patch doing). _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 12:11 ` Will Deacon @ 2022-07-05 12:56 ` Mike Rapoport -1 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-05 12:56 UTC (permalink / raw) To: Will Deacon Cc: guanghui.fgh, Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > > 1.The rodata full is harm to the performance and has been disabled in-house. > > > > 2.When using crashkernel with rodata non full, the kernel also will use non > > block/section mapping which cause high d-TLB miss and degrade performance > > greatly. > > This patch fix it to use block/section mapping as far as possible. > > > > bool can_set_direct_map(void) > > { > > return rodata_full || debug_pagealloc_enabled(); > > } > > > > map_mem: > > if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > > > 3.When rodata full is disabled, crashkernel also need protect(keep > > arch_kexec_[un]protect_crashkres using). > > I think crashkernel should't depend on radata full(Maybe other architecture > > don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. Since we are talking about large systems, what do you think about letting them set CRASH_ALIGN to PUD_SIZE, then unmap(crashkernel); __create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS); should be enough to make crash kernel mapped with base pages. > Will -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 12:56 ` Mike Rapoport 0 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-05 12:56 UTC (permalink / raw) To: Will Deacon Cc: guanghui.fgh, Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > > 1.The rodata full is harm to the performance and has been disabled in-house. > > > > 2.When using crashkernel with rodata non full, the kernel also will use non > > block/section mapping which cause high d-TLB miss and degrade performance > > greatly. > > This patch fix it to use block/section mapping as far as possible. > > > > bool can_set_direct_map(void) > > { > > return rodata_full || debug_pagealloc_enabled(); > > } > > > > map_mem: > > if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > > > > 3.When rodata full is disabled, crashkernel also need protect(keep > > arch_kexec_[un]protect_crashkres using). > > I think crashkernel should't depend on radata full(Maybe other architecture > > don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. Since we are talking about large systems, what do you think about letting them set CRASH_ALIGN to PUD_SIZE, then unmap(crashkernel); __create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS); should be enough to make crash kernel mapped with base pages. > Will -- Sincerely yours, Mike. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 12:56 ` Mike Rapoport @ 2022-07-05 13:17 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-05 13:17 UTC (permalink / raw) To: Mike Rapoport, Will Deacon Cc: Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer 在 2022/7/5 20:56, Mike Rapoport 写道: > On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: >> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: >>> >>> 1.The rodata full is harm to the performance and has been disabled in-house. >>> >>> 2.When using crashkernel with rodata non full, the kernel also will use non >>> block/section mapping which cause high d-TLB miss and degrade performance >>> greatly. >>> This patch fix it to use block/section mapping as far as possible. >>> >>> bool can_set_direct_map(void) >>> { >>> return rodata_full || debug_pagealloc_enabled(); >>> } >>> >>> map_mem: >>> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; >>> >>> 3.When rodata full is disabled, crashkernel also need protect(keep >>> arch_kexec_[un]protect_crashkres using). >>> I think crashkernel should't depend on radata full(Maybe other architecture >>> don't support radata full now). >> >> I think this is going round in circles :/ >> >> As a first step, can we please leave the crashkernel mapped unless >> rodata=full? It should be a much simpler patch to write, review and maintain >> and it gives you the performance you want when crashkernel is being used. > > Since we are talking about large systems, what do you think about letting > them set CRASH_ALIGN to PUD_SIZE, then > > unmap(crashkernel); > __create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS); > > should be enough to make crash kernel mapped with base pages. > >> Will > Thanks. 1.When kernel boots with crashkernel, the crashkernel parameters format is: 0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M which is self-adaption to multiple system. 2.As mentioned above, the crashkernel mem size maybe less than PUD_SIZE(Not multiple time to PUD_SIZE). So, maybe we also need use some non block/section mappings. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 13:17 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-05 13:17 UTC (permalink / raw) To: Mike Rapoport, Will Deacon Cc: Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer 在 2022/7/5 20:56, Mike Rapoport 写道: > On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: >> On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: >>> >>> 1.The rodata full is harm to the performance and has been disabled in-house. >>> >>> 2.When using crashkernel with rodata non full, the kernel also will use non >>> block/section mapping which cause high d-TLB miss and degrade performance >>> greatly. >>> This patch fix it to use block/section mapping as far as possible. >>> >>> bool can_set_direct_map(void) >>> { >>> return rodata_full || debug_pagealloc_enabled(); >>> } >>> >>> map_mem: >>> if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; >>> >>> 3.When rodata full is disabled, crashkernel also need protect(keep >>> arch_kexec_[un]protect_crashkres using). >>> I think crashkernel should't depend on radata full(Maybe other architecture >>> don't support radata full now). >> >> I think this is going round in circles :/ >> >> As a first step, can we please leave the crashkernel mapped unless >> rodata=full? It should be a much simpler patch to write, review and maintain >> and it gives you the performance you want when crashkernel is being used. > > Since we are talking about large systems, what do you think about letting > them set CRASH_ALIGN to PUD_SIZE, then > > unmap(crashkernel); > __create_pgd_mapping(crashkernel, NO_BLOCK_MAPPINGS); > > should be enough to make crash kernel mapped with base pages. > >> Will > Thanks. 1.When kernel boots with crashkernel, the crashkernel parameters format is: 0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M which is self-adaption to multiple system. 2.As mentioned above, the crashkernel mem size maybe less than PUD_SIZE(Not multiple time to PUD_SIZE). So, maybe we also need use some non block/section mappings. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 12:11 ` Will Deacon @ 2022-07-05 15:02 ` Mike Rapoport -1 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-05 15:02 UTC (permalink / raw) To: Will Deacon Cc: guanghui.fgh, Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > > 3.When rodata full is disabled, crashkernel also need protect(keep > > arch_kexec_[un]protect_crashkres using). > > I think crashkernel should't depend on radata full(Maybe other architecture > > don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. As it seems I failed to communicate my thoughts about reusing the existing unmap_hotplug_range() to remap the crash kernel, let's try a more formal approach ;-) This is what I came up with and it does not look too complex. There are a couple of extra #ifdefs that can be removed if we toss some code around in a preparation patch. From 5adbfcbe370da0f09cd917e73aaac7ba8c6b45df Mon Sep 17 00:00:00 2001 From: Mike Rapoport <rppt@linux.ibm.com> Date: Sat, 2 Jul 2022 23:57:53 +0800 Subject: [PATCH] arm64/mm: remap crash kernel with base pages even if rodata_full disabled For server systems it is important to protect crash kernel memory for post-mortem analysis. In order to protect this memory it should be mapped at PTE level. When CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled, usage of crash kernel essentially forces mapping of the entire linear map with base pages even if rodata_full is not set (commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required")) and this causes performance degradation. To reduce the performance degradation, postpone reservation of the crash kernel memory to bootmem_init() regardless of CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 and enable remapping of the crash kernel memory at PTE level. Co-developed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/mm/init.c | 8 +--- arch/arm64/mm/mmu.c | 91 +++++++++++++++++++----------------- 3 files changed, 52 insertions(+), 48 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 48f8466a4be9..f4eb2f61dd0d 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot); extern void mark_linear_text_alias_ro(void); extern bool kaslr_requires_kpti(void); +extern void remap_crashkernel(void); #define INIT_MM_CONTEXT(name) \ .pgd = init_pg_dir, diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 339ee84e5a61..51f8329931f8 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); + remap_crashkernel(); } /* @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) } early_init_fdt_scan_reserved_mem(); - - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); - high_memory = __va(memblock_end_of_DRAM() - 1) + 1; } @@ -438,8 +435,7 @@ void __init bootmem_init(void) * request_standard_resources() depends on crashkernel's memory being * reserved, so do it here. */ - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); + reserve_crashkernel(); memblock_dump_all(); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32873c6..e0b5769bfc9f 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -483,21 +483,6 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } -static bool crash_mem_map __initdata; - -static int __init enable_crash_mem_map(char *arg) -{ - /* - * Proper parameter parsing is done by reserve_crashkernel(). We only - * need to know if the linear map has to avoid block mappings so that - * the crashkernel reservations can be unmapped later. - */ - crash_mem_map = true; - - return 0; -} -early_param("crashkernel", enable_crash_mem_map); - static void __init map_mem(pgd_t *pgdp) { static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN); @@ -527,17 +512,6 @@ static void __init map_mem(pgd_t *pgdp) */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map) { - if (IS_ENABLED(CONFIG_ZONE_DMA) || - IS_ENABLED(CONFIG_ZONE_DMA32)) - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; - else if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif - /* map all the memory banks */ for_each_mem_range(i, &start, &end) { if (start >= end) @@ -570,19 +544,6 @@ static void __init map_mem(pgd_t *pgdp) * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map && - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, - crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } - } -#endif } void mark_rodata_ro(void) @@ -827,7 +788,7 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } -#ifdef CONFIG_MEMORY_HOTPLUG +#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_KEXEC_CORE) static void free_hotplug_page_range(struct page *page, size_t size, struct vmem_altmap *altmap) { @@ -839,6 +800,7 @@ static void free_hotplug_page_range(struct page *page, size_t size, } } +#ifdef CONFIG_MEMORY_HOTPLUG static void free_hotplug_pgtable_page(struct page *page) { free_hotplug_page_range(page, PAGE_SIZE, NULL); @@ -862,6 +824,7 @@ static bool pgtable_range_aligned(unsigned long start, unsigned long end, return false; return true; } +#endif /* CONFIG_MEMORY_HOTPLUG */ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, unsigned long end, bool free_mapped, @@ -994,7 +957,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end, unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); } while (addr = next, addr < end); } +#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */ +#ifdef CONFIG_MEMORY_HOTPLUG static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) @@ -1148,7 +1113,7 @@ static void free_empty_tables(unsigned long addr, unsigned long end, free_empty_p4d_table(pgdp, addr, next, floor, ceiling); } while (addr = next, addr < end); } -#endif +#endif /* CONFIG_MEMORY_HOTPLUG */ #if !ARM64_KERNEL_USES_PMD_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -1213,7 +1178,7 @@ void vmemmap_free(unsigned long start, unsigned long end, unmap_hotplug_range(start, end, true, altmap); free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); } -#endif /* CONFIG_MEMORY_HOTPLUG */ +#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */ static inline pud_t *fixmap_pud(unsigned long addr) { @@ -1677,3 +1642,45 @@ static int __init prevent_bootmem_remove_init(void) } early_initcall(prevent_bootmem_remove_init); #endif + +void __init remap_crashkernel(void) +{ +#ifdef CONFIG_KEXEC_CORE + phys_addr_t start, end, size; + phys_addr_t aligned_start, aligned_end; + + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) + return; + + if (!crashk_res.end) + return; + + start = crashk_res.start & PAGE_MASK; + end = PAGE_ALIGN(crashk_res.end); + + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); + aligned_end = ALIGN(end, PUD_SIZE); + + /* Clear PUDs containing crash kernel memory */ + unmap_hotplug_range(__phys_to_virt(aligned_start), + __phys_to_virt(aligned_end), false, NULL); + + /* map area from PUD start to start of crash kernel with large pages */ + size = start - aligned_start; + __create_pgd_mapping(swapper_pg_dir, aligned_start, + __phys_to_virt(aligned_start), + size, PAGE_KERNEL, early_pgtable_alloc, 0); + + /* map crash kernel memory with base pages */ + size = end - start; + __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start), + size, PAGE_KERNEL, early_pgtable_alloc, + NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS | + NO_CONT_MAPPINGS); + + /* map area from end of crash kernel to PUD end with large pages */ + size = aligned_end - end; + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), + size, PAGE_KERNEL, early_pgtable_alloc, 0); +#endif +} -- 2.35.3 > Will -- Sincerely yours, Mike. ^ permalink raw reply related [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 15:02 ` Mike Rapoport 0 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-05 15:02 UTC (permalink / raw) To: Will Deacon Cc: guanghui.fgh, Ard Biesheuvel, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 01:11:16PM +0100, Will Deacon wrote: > On Tue, Jul 05, 2022 at 08:07:07PM +0800, guanghui.fgh wrote: > > > > 3.When rodata full is disabled, crashkernel also need protect(keep > > arch_kexec_[un]protect_crashkres using). > > I think crashkernel should't depend on radata full(Maybe other architecture > > don't support radata full now). > > I think this is going round in circles :/ > > As a first step, can we please leave the crashkernel mapped unless > rodata=full? It should be a much simpler patch to write, review and maintain > and it gives you the performance you want when crashkernel is being used. As it seems I failed to communicate my thoughts about reusing the existing unmap_hotplug_range() to remap the crash kernel, let's try a more formal approach ;-) This is what I came up with and it does not look too complex. There are a couple of extra #ifdefs that can be removed if we toss some code around in a preparation patch. From 5adbfcbe370da0f09cd917e73aaac7ba8c6b45df Mon Sep 17 00:00:00 2001 From: Mike Rapoport <rppt@linux.ibm.com> Date: Sat, 2 Jul 2022 23:57:53 +0800 Subject: [PATCH] arm64/mm: remap crash kernel with base pages even if rodata_full disabled For server systems it is important to protect crash kernel memory for post-mortem analysis. In order to protect this memory it should be mapped at PTE level. When CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled, usage of crash kernel essentially forces mapping of the entire linear map with base pages even if rodata_full is not set (commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required")) and this causes performance degradation. To reduce the performance degradation, postpone reservation of the crash kernel memory to bootmem_init() regardless of CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 and enable remapping of the crash kernel memory at PTE level. Co-developed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/mm/init.c | 8 +--- arch/arm64/mm/mmu.c | 91 +++++++++++++++++++----------------- 3 files changed, 52 insertions(+), 48 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 48f8466a4be9..f4eb2f61dd0d 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot); extern void mark_linear_text_alias_ro(void); extern bool kaslr_requires_kpti(void); +extern void remap_crashkernel(void); #define INIT_MM_CONTEXT(name) \ .pgd = init_pg_dir, diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 339ee84e5a61..51f8329931f8 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); + remap_crashkernel(); } /* @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) } early_init_fdt_scan_reserved_mem(); - - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); - high_memory = __va(memblock_end_of_DRAM() - 1) + 1; } @@ -438,8 +435,7 @@ void __init bootmem_init(void) * request_standard_resources() depends on crashkernel's memory being * reserved, so do it here. */ - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); + reserve_crashkernel(); memblock_dump_all(); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32873c6..e0b5769bfc9f 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -483,21 +483,6 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } -static bool crash_mem_map __initdata; - -static int __init enable_crash_mem_map(char *arg) -{ - /* - * Proper parameter parsing is done by reserve_crashkernel(). We only - * need to know if the linear map has to avoid block mappings so that - * the crashkernel reservations can be unmapped later. - */ - crash_mem_map = true; - - return 0; -} -early_param("crashkernel", enable_crash_mem_map); - static void __init map_mem(pgd_t *pgdp) { static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN); @@ -527,17 +512,6 @@ static void __init map_mem(pgd_t *pgdp) */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map) { - if (IS_ENABLED(CONFIG_ZONE_DMA) || - IS_ENABLED(CONFIG_ZONE_DMA32)) - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; - else if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif - /* map all the memory banks */ for_each_mem_range(i, &start, &end) { if (start >= end) @@ -570,19 +544,6 @@ static void __init map_mem(pgd_t *pgdp) * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map && - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, - crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } - } -#endif } void mark_rodata_ro(void) @@ -827,7 +788,7 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } -#ifdef CONFIG_MEMORY_HOTPLUG +#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_KEXEC_CORE) static void free_hotplug_page_range(struct page *page, size_t size, struct vmem_altmap *altmap) { @@ -839,6 +800,7 @@ static void free_hotplug_page_range(struct page *page, size_t size, } } +#ifdef CONFIG_MEMORY_HOTPLUG static void free_hotplug_pgtable_page(struct page *page) { free_hotplug_page_range(page, PAGE_SIZE, NULL); @@ -862,6 +824,7 @@ static bool pgtable_range_aligned(unsigned long start, unsigned long end, return false; return true; } +#endif /* CONFIG_MEMORY_HOTPLUG */ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, unsigned long end, bool free_mapped, @@ -994,7 +957,9 @@ static void unmap_hotplug_range(unsigned long addr, unsigned long end, unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); } while (addr = next, addr < end); } +#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */ +#ifdef CONFIG_MEMORY_HOTPLUG static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling) @@ -1148,7 +1113,7 @@ static void free_empty_tables(unsigned long addr, unsigned long end, free_empty_p4d_table(pgdp, addr, next, floor, ceiling); } while (addr = next, addr < end); } -#endif +#endif /* CONFIG_MEMORY_HOTPLUG */ #if !ARM64_KERNEL_USES_PMD_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -1213,7 +1178,7 @@ void vmemmap_free(unsigned long start, unsigned long end, unmap_hotplug_range(start, end, true, altmap); free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); } -#endif /* CONFIG_MEMORY_HOTPLUG */ +#endif /* CONFIG_MEMORY_HOTPLUG || CONFIG_KEXEC_CORE */ static inline pud_t *fixmap_pud(unsigned long addr) { @@ -1677,3 +1642,45 @@ static int __init prevent_bootmem_remove_init(void) } early_initcall(prevent_bootmem_remove_init); #endif + +void __init remap_crashkernel(void) +{ +#ifdef CONFIG_KEXEC_CORE + phys_addr_t start, end, size; + phys_addr_t aligned_start, aligned_end; + + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) + return; + + if (!crashk_res.end) + return; + + start = crashk_res.start & PAGE_MASK; + end = PAGE_ALIGN(crashk_res.end); + + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); + aligned_end = ALIGN(end, PUD_SIZE); + + /* Clear PUDs containing crash kernel memory */ + unmap_hotplug_range(__phys_to_virt(aligned_start), + __phys_to_virt(aligned_end), false, NULL); + + /* map area from PUD start to start of crash kernel with large pages */ + size = start - aligned_start; + __create_pgd_mapping(swapper_pg_dir, aligned_start, + __phys_to_virt(aligned_start), + size, PAGE_KERNEL, early_pgtable_alloc, 0); + + /* map crash kernel memory with base pages */ + size = end - start; + __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start), + size, PAGE_KERNEL, early_pgtable_alloc, + NO_EXEC_MAPPINGS | NO_BLOCK_MAPPINGS | + NO_CONT_MAPPINGS); + + /* map area from end of crash kernel to PUD end with large pages */ + size = aligned_end - end; + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), + size, PAGE_KERNEL, early_pgtable_alloc, 0); +#endif +} -- 2.35.3 > Will -- Sincerely yours, Mike. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 15:02 ` Mike Rapoport @ 2022-07-05 15:34 ` Catalin Marinas -1 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-05 15:34 UTC (permalink / raw) To: Mike Rapoport Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > +void __init remap_crashkernel(void) > +{ > +#ifdef CONFIG_KEXEC_CORE > + phys_addr_t start, end, size; > + phys_addr_t aligned_start, aligned_end; > + > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > + return; > + > + if (!crashk_res.end) > + return; > + > + start = crashk_res.start & PAGE_MASK; > + end = PAGE_ALIGN(crashk_res.end); > + > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > + aligned_end = ALIGN(end, PUD_SIZE); > + > + /* Clear PUDs containing crash kernel memory */ > + unmap_hotplug_range(__phys_to_virt(aligned_start), > + __phys_to_virt(aligned_end), false, NULL); What I don't understand is what happens if there's valid kernel data between aligned_start and crashk_res.start (or the other end of the range). -- Catalin ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 15:34 ` Catalin Marinas 0 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-05 15:34 UTC (permalink / raw) To: Mike Rapoport Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > +void __init remap_crashkernel(void) > +{ > +#ifdef CONFIG_KEXEC_CORE > + phys_addr_t start, end, size; > + phys_addr_t aligned_start, aligned_end; > + > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > + return; > + > + if (!crashk_res.end) > + return; > + > + start = crashk_res.start & PAGE_MASK; > + end = PAGE_ALIGN(crashk_res.end); > + > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > + aligned_end = ALIGN(end, PUD_SIZE); > + > + /* Clear PUDs containing crash kernel memory */ > + unmap_hotplug_range(__phys_to_virt(aligned_start), > + __phys_to_virt(aligned_end), false, NULL); What I don't understand is what happens if there's valid kernel data between aligned_start and crashk_res.start (or the other end of the range). -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 15:34 ` Catalin Marinas @ 2022-07-05 15:57 ` Mike Rapoport -1 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-05 15:57 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > +void __init remap_crashkernel(void) > > +{ > > +#ifdef CONFIG_KEXEC_CORE > > + phys_addr_t start, end, size; > > + phys_addr_t aligned_start, aligned_end; > > + > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > + return; > > + > > + if (!crashk_res.end) > > + return; > > + > > + start = crashk_res.start & PAGE_MASK; > > + end = PAGE_ALIGN(crashk_res.end); > > + > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > + aligned_end = ALIGN(end, PUD_SIZE); > > + > > + /* Clear PUDs containing crash kernel memory */ > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > + __phys_to_virt(aligned_end), false, NULL); > > What I don't understand is what happens if there's valid kernel data > between aligned_start and crashk_res.start (or the other end of the > range). Data shouldn't go anywhere :) There is + /* map area from PUD start to start of crash kernel with large pages */ + size = start - aligned_start; + __create_pgd_mapping(swapper_pg_dir, aligned_start, + __phys_to_virt(aligned_start), + size, PAGE_KERNEL, early_pgtable_alloc, 0); and + /* map area from end of crash kernel to PUD end with large pages */ + size = aligned_end - end; + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), + size, PAGE_KERNEL, early_pgtable_alloc, 0); after the unmap, so after we tear down a part of a linear map we immediately recreate it, just with a different page size. This all happens before SMP, so there is no concurrency at that point. > -- > Catalin -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 15:57 ` Mike Rapoport 0 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-05 15:57 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > +void __init remap_crashkernel(void) > > +{ > > +#ifdef CONFIG_KEXEC_CORE > > + phys_addr_t start, end, size; > > + phys_addr_t aligned_start, aligned_end; > > + > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > + return; > > + > > + if (!crashk_res.end) > > + return; > > + > > + start = crashk_res.start & PAGE_MASK; > > + end = PAGE_ALIGN(crashk_res.end); > > + > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > + aligned_end = ALIGN(end, PUD_SIZE); > > + > > + /* Clear PUDs containing crash kernel memory */ > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > + __phys_to_virt(aligned_end), false, NULL); > > What I don't understand is what happens if there's valid kernel data > between aligned_start and crashk_res.start (or the other end of the > range). Data shouldn't go anywhere :) There is + /* map area from PUD start to start of crash kernel with large pages */ + size = start - aligned_start; + __create_pgd_mapping(swapper_pg_dir, aligned_start, + __phys_to_virt(aligned_start), + size, PAGE_KERNEL, early_pgtable_alloc, 0); and + /* map area from end of crash kernel to PUD end with large pages */ + size = aligned_end - end; + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), + size, PAGE_KERNEL, early_pgtable_alloc, 0); after the unmap, so after we tear down a part of a linear map we immediately recreate it, just with a different page size. This all happens before SMP, so there is no concurrency at that point. > -- > Catalin -- Sincerely yours, Mike. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 15:57 ` Mike Rapoport @ 2022-07-05 17:05 ` Catalin Marinas -1 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-05 17:05 UTC (permalink / raw) To: Mike Rapoport Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > +void __init remap_crashkernel(void) > > > +{ > > > +#ifdef CONFIG_KEXEC_CORE > > > + phys_addr_t start, end, size; > > > + phys_addr_t aligned_start, aligned_end; > > > + > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > + return; > > > + > > > + if (!crashk_res.end) > > > + return; > > > + > > > + start = crashk_res.start & PAGE_MASK; > > > + end = PAGE_ALIGN(crashk_res.end); > > > + > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > + > > > + /* Clear PUDs containing crash kernel memory */ > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > + __phys_to_virt(aligned_end), false, NULL); > > > > What I don't understand is what happens if there's valid kernel data > > between aligned_start and crashk_res.start (or the other end of the > > range). > > Data shouldn't go anywhere :) > > There is > > + /* map area from PUD start to start of crash kernel with large pages */ > + size = start - aligned_start; > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > + __phys_to_virt(aligned_start), > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > and > > + /* map area from end of crash kernel to PUD end with large pages */ > + size = aligned_end - end; > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > after the unmap, so after we tear down a part of a linear map we > immediately recreate it, just with a different page size. > > This all happens before SMP, so there is no concurrency at that point. That brief period of unmap worries me. The kernel text, data and stack are all in the vmalloc space but any other (memblock) allocation to this point may be in the unmapped range before and after the crashkernel reservation. The interrupts are off, so I think the only allocation and potential access that may go in this range is the page table itself. But it looks fragile to me. -- Catalin ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 17:05 ` Catalin Marinas 0 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-05 17:05 UTC (permalink / raw) To: Mike Rapoport Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > +void __init remap_crashkernel(void) > > > +{ > > > +#ifdef CONFIG_KEXEC_CORE > > > + phys_addr_t start, end, size; > > > + phys_addr_t aligned_start, aligned_end; > > > + > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > + return; > > > + > > > + if (!crashk_res.end) > > > + return; > > > + > > > + start = crashk_res.start & PAGE_MASK; > > > + end = PAGE_ALIGN(crashk_res.end); > > > + > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > + > > > + /* Clear PUDs containing crash kernel memory */ > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > + __phys_to_virt(aligned_end), false, NULL); > > > > What I don't understand is what happens if there's valid kernel data > > between aligned_start and crashk_res.start (or the other end of the > > range). > > Data shouldn't go anywhere :) > > There is > > + /* map area from PUD start to start of crash kernel with large pages */ > + size = start - aligned_start; > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > + __phys_to_virt(aligned_start), > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > and > > + /* map area from end of crash kernel to PUD end with large pages */ > + size = aligned_end - end; > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > after the unmap, so after we tear down a part of a linear map we > immediately recreate it, just with a different page size. > > This all happens before SMP, so there is no concurrency at that point. That brief period of unmap worries me. The kernel text, data and stack are all in the vmalloc space but any other (memblock) allocation to this point may be in the unmapped range before and after the crashkernel reservation. The interrupts are off, so I think the only allocation and potential access that may go in this range is the page table itself. But it looks fragile to me. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 17:05 ` Catalin Marinas @ 2022-07-05 20:45 ` Mike Rapoport -1 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-05 20:45 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > +void __init remap_crashkernel(void) > > > > +{ > > > > +#ifdef CONFIG_KEXEC_CORE > > > > + phys_addr_t start, end, size; > > > > + phys_addr_t aligned_start, aligned_end; > > > > + > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > + return; > > > > + > > > > + if (!crashk_res.end) > > > > + return; > > > > + > > > > + start = crashk_res.start & PAGE_MASK; > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > + > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > + > > > > + /* Clear PUDs containing crash kernel memory */ > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > What I don't understand is what happens if there's valid kernel data > > > between aligned_start and crashk_res.start (or the other end of the > > > range). > > > > Data shouldn't go anywhere :) > > > > There is > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > + size = start - aligned_start; > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > + __phys_to_virt(aligned_start), > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > and > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > + size = aligned_end - end; > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > after the unmap, so after we tear down a part of a linear map we > > immediately recreate it, just with a different page size. > > > > This all happens before SMP, so there is no concurrency at that point. > > That brief period of unmap worries me. The kernel text, data and stack > are all in the vmalloc space but any other (memblock) allocation to this > point may be in the unmapped range before and after the crashkernel > reservation. The interrupts are off, so I think the only allocation and > potential access that may go in this range is the page table itself. But > it looks fragile to me. I agree there are chances there will be an allocation from the unmapped range. We can make sure this won't happen, though. We can cap the memblock allocations with memblock_set_current_limit(aligned_end) or memblock_reserve(algined_start, aligned_end) until the mappings are restored. > -- > Catalin -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 20:45 ` Mike Rapoport 0 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-05 20:45 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > +void __init remap_crashkernel(void) > > > > +{ > > > > +#ifdef CONFIG_KEXEC_CORE > > > > + phys_addr_t start, end, size; > > > > + phys_addr_t aligned_start, aligned_end; > > > > + > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > + return; > > > > + > > > > + if (!crashk_res.end) > > > > + return; > > > > + > > > > + start = crashk_res.start & PAGE_MASK; > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > + > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > + > > > > + /* Clear PUDs containing crash kernel memory */ > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > What I don't understand is what happens if there's valid kernel data > > > between aligned_start and crashk_res.start (or the other end of the > > > range). > > > > Data shouldn't go anywhere :) > > > > There is > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > + size = start - aligned_start; > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > + __phys_to_virt(aligned_start), > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > and > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > + size = aligned_end - end; > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > after the unmap, so after we tear down a part of a linear map we > > immediately recreate it, just with a different page size. > > > > This all happens before SMP, so there is no concurrency at that point. > > That brief period of unmap worries me. The kernel text, data and stack > are all in the vmalloc space but any other (memblock) allocation to this > point may be in the unmapped range before and after the crashkernel > reservation. The interrupts are off, so I think the only allocation and > potential access that may go in this range is the page table itself. But > it looks fragile to me. I agree there are chances there will be an allocation from the unmapped range. We can make sure this won't happen, though. We can cap the memblock allocations with memblock_set_current_limit(aligned_end) or memblock_reserve(algined_start, aligned_end) until the mappings are restored. > -- > Catalin -- Sincerely yours, Mike. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 20:45 ` Mike Rapoport @ 2022-07-06 2:49 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-06 2:49 UTC (permalink / raw) To: Mike Rapoport, Catalin Marinas Cc: Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/6 4:45, Mike Rapoport 写道: > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>> +void __init remap_crashkernel(void) >>>>> +{ >>>>> +#ifdef CONFIG_KEXEC_CORE >>>>> + phys_addr_t start, end, size; >>>>> + phys_addr_t aligned_start, aligned_end; >>>>> + >>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>> + return; >>>>> + >>>>> + if (!crashk_res.end) >>>>> + return; >>>>> + >>>>> + start = crashk_res.start & PAGE_MASK; >>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>> + >>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>> + >>>>> + /* Clear PUDs containing crash kernel memory */ >>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>> + __phys_to_virt(aligned_end), false, NULL); >>>> >>>> What I don't understand is what happens if there's valid kernel data >>>> between aligned_start and crashk_res.start (or the other end of the >>>> range). >>> >>> Data shouldn't go anywhere :) >>> >>> There is >>> >>> + /* map area from PUD start to start of crash kernel with large pages */ >>> + size = start - aligned_start; >>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>> + __phys_to_virt(aligned_start), >>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>> >>> and >>> >>> + /* map area from end of crash kernel to PUD end with large pages */ >>> + size = aligned_end - end; >>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>> >>> after the unmap, so after we tear down a part of a linear map we >>> immediately recreate it, just with a different page size. >>> >>> This all happens before SMP, so there is no concurrency at that point. >> >> That brief period of unmap worries me. The kernel text, data and stack >> are all in the vmalloc space but any other (memblock) allocation to this >> point may be in the unmapped range before and after the crashkernel >> reservation. The interrupts are off, so I think the only allocation and >> potential access that may go in this range is the page table itself. But >> it looks fragile to me. > > I agree there are chances there will be an allocation from the unmapped > range. > > We can make sure this won't happen, though. We can cap the memblock > allocations with memblock_set_current_limit(aligned_end) or > memblock_reserve(algined_start, aligned_end) until the mappings are > restored. > >> -- >> Catalin > I think there is no need to worry about vmalloc mem. 1.As mentioned above, When reserving crashkernel and remapping linear mem mapping, there is only one boot cpu running. There is no other cpu/thread running at the same time. 2.Although vmalloc may alloc mem from the ummaped area, but we will rebuid remapping using pte level mapping which keeps virtual address to the same physical address (At the same time, no other cpu/thread is access vmalloc mem). As a result, it has no effect to vmalloc mem. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-06 2:49 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-06 2:49 UTC (permalink / raw) To: Mike Rapoport, Catalin Marinas Cc: Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/6 4:45, Mike Rapoport 写道: > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>> +void __init remap_crashkernel(void) >>>>> +{ >>>>> +#ifdef CONFIG_KEXEC_CORE >>>>> + phys_addr_t start, end, size; >>>>> + phys_addr_t aligned_start, aligned_end; >>>>> + >>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>> + return; >>>>> + >>>>> + if (!crashk_res.end) >>>>> + return; >>>>> + >>>>> + start = crashk_res.start & PAGE_MASK; >>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>> + >>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>> + >>>>> + /* Clear PUDs containing crash kernel memory */ >>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>> + __phys_to_virt(aligned_end), false, NULL); >>>> >>>> What I don't understand is what happens if there's valid kernel data >>>> between aligned_start and crashk_res.start (or the other end of the >>>> range). >>> >>> Data shouldn't go anywhere :) >>> >>> There is >>> >>> + /* map area from PUD start to start of crash kernel with large pages */ >>> + size = start - aligned_start; >>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>> + __phys_to_virt(aligned_start), >>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>> >>> and >>> >>> + /* map area from end of crash kernel to PUD end with large pages */ >>> + size = aligned_end - end; >>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>> >>> after the unmap, so after we tear down a part of a linear map we >>> immediately recreate it, just with a different page size. >>> >>> This all happens before SMP, so there is no concurrency at that point. >> >> That brief period of unmap worries me. The kernel text, data and stack >> are all in the vmalloc space but any other (memblock) allocation to this >> point may be in the unmapped range before and after the crashkernel >> reservation. The interrupts are off, so I think the only allocation and >> potential access that may go in this range is the page table itself. But >> it looks fragile to me. > > I agree there are chances there will be an allocation from the unmapped > range. > > We can make sure this won't happen, though. We can cap the memblock > allocations with memblock_set_current_limit(aligned_end) or > memblock_reserve(algined_start, aligned_end) until the mappings are > restored. > >> -- >> Catalin > I think there is no need to worry about vmalloc mem. 1.As mentioned above, When reserving crashkernel and remapping linear mem mapping, there is only one boot cpu running. There is no other cpu/thread running at the same time. 2.Although vmalloc may alloc mem from the ummaped area, but we will rebuid remapping using pte level mapping which keeps virtual address to the same physical address (At the same time, no other cpu/thread is access vmalloc mem). As a result, it has no effect to vmalloc mem. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-06 2:49 ` guanghui.fgh @ 2022-07-06 7:43 ` Catalin Marinas -1 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-06 7:43 UTC (permalink / raw) To: guanghui.fgh Cc: Mike Rapoport, Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Wed, Jul 06, 2022 at 10:49:43AM +0800, guanghui.fgh wrote: > 在 2022/7/6 4:45, Mike Rapoport 写道: > > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > > +void __init remap_crashkernel(void) > > > > > > +{ > > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > > + phys_addr_t start, end, size; > > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > > + > > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > > + return; > > > > > > + > > > > > > + if (!crashk_res.end) > > > > > > + return; > > > > > > + > > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > > + > > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > > + > > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > > between aligned_start and crashk_res.start (or the other end of the > > > > > range). > > > > > > > > Data shouldn't go anywhere :) > > > > > > > > There is > > > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > > + size = start - aligned_start; > > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > > + __phys_to_virt(aligned_start), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > and > > > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > > + size = aligned_end - end; > > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > after the unmap, so after we tear down a part of a linear map we > > > > immediately recreate it, just with a different page size. > > > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > > > That brief period of unmap worries me. The kernel text, data and stack > > > are all in the vmalloc space but any other (memblock) allocation to this > > > point may be in the unmapped range before and after the crashkernel > > > reservation. The interrupts are off, so I think the only allocation and > > > potential access that may go in this range is the page table itself. But > > > it looks fragile to me. > > > > I agree there are chances there will be an allocation from the unmapped > > range. > > > > We can make sure this won't happen, though. We can cap the memblock > > allocations with memblock_set_current_limit(aligned_end) or > > memblock_reserve(algined_start, aligned_end) until the mappings are > > restored. > > I think there is no need to worry about vmalloc mem. That's not what I'm worried about. It's about memblock allocations that are accessed through the linear map. -- Catalin ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-06 7:43 ` Catalin Marinas 0 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-06 7:43 UTC (permalink / raw) To: guanghui.fgh Cc: Mike Rapoport, Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Wed, Jul 06, 2022 at 10:49:43AM +0800, guanghui.fgh wrote: > 在 2022/7/6 4:45, Mike Rapoport 写道: > > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > > +void __init remap_crashkernel(void) > > > > > > +{ > > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > > + phys_addr_t start, end, size; > > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > > + > > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > > + return; > > > > > > + > > > > > > + if (!crashk_res.end) > > > > > > + return; > > > > > > + > > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > > + > > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > > + > > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > > between aligned_start and crashk_res.start (or the other end of the > > > > > range). > > > > > > > > Data shouldn't go anywhere :) > > > > > > > > There is > > > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > > + size = start - aligned_start; > > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > > + __phys_to_virt(aligned_start), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > and > > > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > > + size = aligned_end - end; > > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > after the unmap, so after we tear down a part of a linear map we > > > > immediately recreate it, just with a different page size. > > > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > > > That brief period of unmap worries me. The kernel text, data and stack > > > are all in the vmalloc space but any other (memblock) allocation to this > > > point may be in the unmapped range before and after the crashkernel > > > reservation. The interrupts are off, so I think the only allocation and > > > potential access that may go in this range is the page table itself. But > > > it looks fragile to me. > > > > I agree there are chances there will be an allocation from the unmapped > > range. > > > > We can make sure this won't happen, though. We can cap the memblock > > allocations with memblock_set_current_limit(aligned_end) or > > memblock_reserve(algined_start, aligned_end) until the mappings are > > restored. > > I think there is no need to worry about vmalloc mem. That's not what I'm worried about. It's about memblock allocations that are accessed through the linear map. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-05 20:45 ` Mike Rapoport @ 2022-07-06 10:04 ` Catalin Marinas -1 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-06 10:04 UTC (permalink / raw) To: Mike Rapoport Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > +void __init remap_crashkernel(void) > > > > > +{ > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > + phys_addr_t start, end, size; > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > + > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > + return; > > > > > + > > > > > + if (!crashk_res.end) > > > > > + return; > > > > > + > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > + > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > + > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > between aligned_start and crashk_res.start (or the other end of the > > > > range). > > > > > > Data shouldn't go anywhere :) > > > > > > There is > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > + size = start - aligned_start; > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > + __phys_to_virt(aligned_start), > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > and > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > + size = aligned_end - end; > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > after the unmap, so after we tear down a part of a linear map we > > > immediately recreate it, just with a different page size. > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > That brief period of unmap worries me. The kernel text, data and stack > > are all in the vmalloc space but any other (memblock) allocation to this > > point may be in the unmapped range before and after the crashkernel > > reservation. The interrupts are off, so I think the only allocation and > > potential access that may go in this range is the page table itself. But > > it looks fragile to me. > > I agree there are chances there will be an allocation from the unmapped > range. > > We can make sure this won't happen, though. We can cap the memblock > allocations with memblock_set_current_limit(aligned_end) or > memblock_reserve(algined_start, aligned_end) until the mappings are > restored. We can reserve the region just before unmapping to avoid new allocations for the page tables but we can't do much about pages already allocated prior to calling remap_crashkernel(). -- Catalin ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-06 10:04 ` Catalin Marinas 0 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-06 10:04 UTC (permalink / raw) To: Mike Rapoport Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > +void __init remap_crashkernel(void) > > > > > +{ > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > + phys_addr_t start, end, size; > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > + > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > + return; > > > > > + > > > > > + if (!crashk_res.end) > > > > > + return; > > > > > + > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > + > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > + > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > between aligned_start and crashk_res.start (or the other end of the > > > > range). > > > > > > Data shouldn't go anywhere :) > > > > > > There is > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > + size = start - aligned_start; > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > + __phys_to_virt(aligned_start), > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > and > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > + size = aligned_end - end; > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > after the unmap, so after we tear down a part of a linear map we > > > immediately recreate it, just with a different page size. > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > That brief period of unmap worries me. The kernel text, data and stack > > are all in the vmalloc space but any other (memblock) allocation to this > > point may be in the unmapped range before and after the crashkernel > > reservation. The interrupts are off, so I think the only allocation and > > potential access that may go in this range is the page table itself. But > > it looks fragile to me. > > I agree there are chances there will be an allocation from the unmapped > range. > > We can make sure this won't happen, though. We can cap the memblock > allocations with memblock_set_current_limit(aligned_end) or > memblock_reserve(algined_start, aligned_end) until the mappings are > restored. We can reserve the region just before unmapping to avoid new allocations for the page tables but we can't do much about pages already allocated prior to calling remap_crashkernel(). -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-06 10:04 ` Catalin Marinas @ 2022-07-06 13:54 ` Mike Rapoport -1 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-06 13:54 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: > > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > > +void __init remap_crashkernel(void) > > > > > > +{ > > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > > + phys_addr_t start, end, size; > > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > > + > > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > > + return; > > > > > > + > > > > > > + if (!crashk_res.end) > > > > > > + return; > > > > > > + > > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > > + > > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > > + > > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > > between aligned_start and crashk_res.start (or the other end of the > > > > > range). > > > > > > > > Data shouldn't go anywhere :) > > > > > > > > There is > > > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > > + size = start - aligned_start; > > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > > + __phys_to_virt(aligned_start), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > and > > > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > > + size = aligned_end - end; > > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > after the unmap, so after we tear down a part of a linear map we > > > > immediately recreate it, just with a different page size. > > > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > > > That brief period of unmap worries me. The kernel text, data and stack > > > are all in the vmalloc space but any other (memblock) allocation to this > > > point may be in the unmapped range before and after the crashkernel > > > reservation. The interrupts are off, so I think the only allocation and > > > potential access that may go in this range is the page table itself. But > > > it looks fragile to me. > > > > I agree there are chances there will be an allocation from the unmapped > > range. > > > > We can make sure this won't happen, though. We can cap the memblock > > allocations with memblock_set_current_limit(aligned_end) or > > memblock_reserve(algined_start, aligned_end) until the mappings are > > restored. > > We can reserve the region just before unmapping to avoid new allocations > for the page tables but we can't do much about pages already allocated > prior to calling remap_crashkernel(). Right, this was bothering me too after I re-read you previous email. One thing I can think of is to only remap the crash kernel memory if it is a part of an allocation that exactly fits into one ore more PUDs. Say, in reserve_crashkernel() we try the memblock_phys_alloc() with PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation succeeds, we remap the entire area that now contains only memory allocated in reserve_crashkernel() and free the extra memory after remapping is done. If the large allocation fails, we fall back to the original size and alignment and don't allow unmapping crash kernel memory in arch_kexec_protect_crashkres(). > -- > Catalin -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-06 13:54 ` Mike Rapoport 0 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-06 13:54 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, guanghui.fgh, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: > On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: > > On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: > > > On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: > > > > On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: > > > > > On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: > > > > > > +void __init remap_crashkernel(void) > > > > > > +{ > > > > > > +#ifdef CONFIG_KEXEC_CORE > > > > > > + phys_addr_t start, end, size; > > > > > > + phys_addr_t aligned_start, aligned_end; > > > > > > + > > > > > > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > > > > > > + return; > > > > > > + > > > > > > + if (!crashk_res.end) > > > > > > + return; > > > > > > + > > > > > > + start = crashk_res.start & PAGE_MASK; > > > > > > + end = PAGE_ALIGN(crashk_res.end); > > > > > > + > > > > > > + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); > > > > > > + aligned_end = ALIGN(end, PUD_SIZE); > > > > > > + > > > > > > + /* Clear PUDs containing crash kernel memory */ > > > > > > + unmap_hotplug_range(__phys_to_virt(aligned_start), > > > > > > + __phys_to_virt(aligned_end), false, NULL); > > > > > > > > > > What I don't understand is what happens if there's valid kernel data > > > > > between aligned_start and crashk_res.start (or the other end of the > > > > > range). > > > > > > > > Data shouldn't go anywhere :) > > > > > > > > There is > > > > > > > > + /* map area from PUD start to start of crash kernel with large pages */ > > > > + size = start - aligned_start; > > > > + __create_pgd_mapping(swapper_pg_dir, aligned_start, > > > > + __phys_to_virt(aligned_start), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > and > > > > > > > > + /* map area from end of crash kernel to PUD end with large pages */ > > > > + size = aligned_end - end; > > > > + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), > > > > + size, PAGE_KERNEL, early_pgtable_alloc, 0); > > > > > > > > after the unmap, so after we tear down a part of a linear map we > > > > immediately recreate it, just with a different page size. > > > > > > > > This all happens before SMP, so there is no concurrency at that point. > > > > > > That brief period of unmap worries me. The kernel text, data and stack > > > are all in the vmalloc space but any other (memblock) allocation to this > > > point may be in the unmapped range before and after the crashkernel > > > reservation. The interrupts are off, so I think the only allocation and > > > potential access that may go in this range is the page table itself. But > > > it looks fragile to me. > > > > I agree there are chances there will be an allocation from the unmapped > > range. > > > > We can make sure this won't happen, though. We can cap the memblock > > allocations with memblock_set_current_limit(aligned_end) or > > memblock_reserve(algined_start, aligned_end) until the mappings are > > restored. > > We can reserve the region just before unmapping to avoid new allocations > for the page tables but we can't do much about pages already allocated > prior to calling remap_crashkernel(). Right, this was bothering me too after I re-read you previous email. One thing I can think of is to only remap the crash kernel memory if it is a part of an allocation that exactly fits into one ore more PUDs. Say, in reserve_crashkernel() we try the memblock_phys_alloc() with PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation succeeds, we remap the entire area that now contains only memory allocated in reserve_crashkernel() and free the extra memory after remapping is done. If the large allocation fails, we fall back to the original size and alignment and don't allow unmapping crash kernel memory in arch_kexec_protect_crashkres(). > -- > Catalin -- Sincerely yours, Mike. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-06 13:54 ` Mike Rapoport @ 2022-07-06 15:18 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-06 15:18 UTC (permalink / raw) To: Mike Rapoport, Catalin Marinas Cc: Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/6 21:54, Mike Rapoport 写道: > On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: >> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: >>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>>>> +void __init remap_crashkernel(void) >>>>>>> +{ >>>>>>> +#ifdef CONFIG_KEXEC_CORE >>>>>>> + phys_addr_t start, end, size; >>>>>>> + phys_addr_t aligned_start, aligned_end; >>>>>>> + >>>>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>>>> + return; >>>>>>> + >>>>>>> + if (!crashk_res.end) >>>>>>> + return; >>>>>>> + >>>>>>> + start = crashk_res.start & PAGE_MASK; >>>>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>>>> + >>>>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>>>> + >>>>>>> + /* Clear PUDs containing crash kernel memory */ >>>>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>>>> + __phys_to_virt(aligned_end), false, NULL); >>>>>> >>>>>> What I don't understand is what happens if there's valid kernel data >>>>>> between aligned_start and crashk_res.start (or the other end of the >>>>>> range). >>>>> >>>>> Data shouldn't go anywhere :) >>>>> >>>>> There is >>>>> >>>>> + /* map area from PUD start to start of crash kernel with large pages */ >>>>> + size = start - aligned_start; >>>>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>>>> + __phys_to_virt(aligned_start), >>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>> >>>>> and >>>>> >>>>> + /* map area from end of crash kernel to PUD end with large pages */ >>>>> + size = aligned_end - end; >>>>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>> >>>>> after the unmap, so after we tear down a part of a linear map we >>>>> immediately recreate it, just with a different page size. >>>>> >>>>> This all happens before SMP, so there is no concurrency at that point. >>>> >>>> That brief period of unmap worries me. The kernel text, data and stack >>>> are all in the vmalloc space but any other (memblock) allocation to this >>>> point may be in the unmapped range before and after the crashkernel >>>> reservation. The interrupts are off, so I think the only allocation and >>>> potential access that may go in this range is the page table itself. But >>>> it looks fragile to me. >>> >>> I agree there are chances there will be an allocation from the unmapped >>> range. >>> >>> We can make sure this won't happen, though. We can cap the memblock >>> allocations with memblock_set_current_limit(aligned_end) or >>> memblock_reserve(algined_start, aligned_end) until the mappings are >>> restored. >> >> We can reserve the region just before unmapping to avoid new allocations >> for the page tables but we can't do much about pages already allocated >> prior to calling remap_crashkernel(). > > Right, this was bothering me too after I re-read you previous email. > > One thing I can think of is to only remap the crash kernel memory if it is > a part of an allocation that exactly fits into one ore more PUDs. > > Say, in reserve_crashkernel() we try the memblock_phys_alloc() with > PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation > succeeds, we remap the entire area that now contains only memory allocated > in reserve_crashkernel() and free the extra memory after remapping is done. > If the large allocation fails, we fall back to the original size and > alignment and don't allow unmapping crash kernel memory in > arch_kexec_protect_crashkres(). > >> -- >> Catalin > Thanks. There is a new method. I think we should use the patch v3(similar but need add some changes) 1.We can walk crashkernle block/section pagetable, [[[(keep the origin block/section mapping valid]]] rebuild the pte level page mapping for the crashkernel mem rebuild left & right margin mem(which is in same block/section mapping but out of crashkernel mem) with block/section mapping 2.'replace' the origin block/section mapping by new builded mapping iterately With this method, all the mem mapping keep valid all the time. 3.the patch v3 link: https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@linux.alibaba.com/T/ (Need some changes) ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-06 15:18 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-06 15:18 UTC (permalink / raw) To: Mike Rapoport, Catalin Marinas Cc: Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/6 21:54, Mike Rapoport 写道: > On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: >> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: >>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>>>> +void __init remap_crashkernel(void) >>>>>>> +{ >>>>>>> +#ifdef CONFIG_KEXEC_CORE >>>>>>> + phys_addr_t start, end, size; >>>>>>> + phys_addr_t aligned_start, aligned_end; >>>>>>> + >>>>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>>>> + return; >>>>>>> + >>>>>>> + if (!crashk_res.end) >>>>>>> + return; >>>>>>> + >>>>>>> + start = crashk_res.start & PAGE_MASK; >>>>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>>>> + >>>>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>>>> + >>>>>>> + /* Clear PUDs containing crash kernel memory */ >>>>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>>>> + __phys_to_virt(aligned_end), false, NULL); >>>>>> >>>>>> What I don't understand is what happens if there's valid kernel data >>>>>> between aligned_start and crashk_res.start (or the other end of the >>>>>> range). >>>>> >>>>> Data shouldn't go anywhere :) >>>>> >>>>> There is >>>>> >>>>> + /* map area from PUD start to start of crash kernel with large pages */ >>>>> + size = start - aligned_start; >>>>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>>>> + __phys_to_virt(aligned_start), >>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>> >>>>> and >>>>> >>>>> + /* map area from end of crash kernel to PUD end with large pages */ >>>>> + size = aligned_end - end; >>>>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>> >>>>> after the unmap, so after we tear down a part of a linear map we >>>>> immediately recreate it, just with a different page size. >>>>> >>>>> This all happens before SMP, so there is no concurrency at that point. >>>> >>>> That brief period of unmap worries me. The kernel text, data and stack >>>> are all in the vmalloc space but any other (memblock) allocation to this >>>> point may be in the unmapped range before and after the crashkernel >>>> reservation. The interrupts are off, so I think the only allocation and >>>> potential access that may go in this range is the page table itself. But >>>> it looks fragile to me. >>> >>> I agree there are chances there will be an allocation from the unmapped >>> range. >>> >>> We can make sure this won't happen, though. We can cap the memblock >>> allocations with memblock_set_current_limit(aligned_end) or >>> memblock_reserve(algined_start, aligned_end) until the mappings are >>> restored. >> >> We can reserve the region just before unmapping to avoid new allocations >> for the page tables but we can't do much about pages already allocated >> prior to calling remap_crashkernel(). > > Right, this was bothering me too after I re-read you previous email. > > One thing I can think of is to only remap the crash kernel memory if it is > a part of an allocation that exactly fits into one ore more PUDs. > > Say, in reserve_crashkernel() we try the memblock_phys_alloc() with > PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation > succeeds, we remap the entire area that now contains only memory allocated > in reserve_crashkernel() and free the extra memory after remapping is done. > If the large allocation fails, we fall back to the original size and > alignment and don't allow unmapping crash kernel memory in > arch_kexec_protect_crashkres(). > >> -- >> Catalin > Thanks. There is a new method. I think we should use the patch v3(similar but need add some changes) 1.We can walk crashkernle block/section pagetable, [[[(keep the origin block/section mapping valid]]] rebuild the pte level page mapping for the crashkernel mem rebuild left & right margin mem(which is in same block/section mapping but out of crashkernel mem) with block/section mapping 2.'replace' the origin block/section mapping by new builded mapping iterately With this method, all the mem mapping keep valid all the time. 3.the patch v3 link: https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@linux.alibaba.com/T/ (Need some changes) _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-06 15:18 ` guanghui.fgh @ 2022-07-06 15:30 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-06 15:30 UTC (permalink / raw) To: Mike Rapoport, Catalin Marinas Cc: Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer 在 2022/7/6 23:18, guanghui.fgh 写道: > Thanks. > > 在 2022/7/6 21:54, Mike Rapoport 写道: >> On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: >>> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: >>>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >>>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>>>>> +void __init remap_crashkernel(void) >>>>>>>> +{ >>>>>>>> +#ifdef CONFIG_KEXEC_CORE >>>>>>>> + phys_addr_t start, end, size; >>>>>>>> + phys_addr_t aligned_start, aligned_end; >>>>>>>> + >>>>>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + if (!crashk_res.end) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + start = crashk_res.start & PAGE_MASK; >>>>>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>>>>> + >>>>>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>>>>> + >>>>>>>> + /* Clear PUDs containing crash kernel memory */ >>>>>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>>>>> + __phys_to_virt(aligned_end), false, NULL); >>>>>>> >>>>>>> What I don't understand is what happens if there's valid kernel data >>>>>>> between aligned_start and crashk_res.start (or the other end of the >>>>>>> range). >>>>>> >>>>>> Data shouldn't go anywhere :) >>>>>> >>>>>> There is >>>>>> >>>>>> + /* map area from PUD start to start of crash kernel with >>>>>> large pages */ >>>>>> + size = start - aligned_start; >>>>>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>>>>> + __phys_to_virt(aligned_start), >>>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>>> >>>>>> and >>>>>> >>>>>> + /* map area from end of crash kernel to PUD end with large >>>>>> pages */ >>>>>> + size = aligned_end - end; >>>>>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>>> >>>>>> after the unmap, so after we tear down a part of a linear map we >>>>>> immediately recreate it, just with a different page size. >>>>>> >>>>>> This all happens before SMP, so there is no concurrency at that >>>>>> point. >>>>> >>>>> That brief period of unmap worries me. The kernel text, data and stack >>>>> are all in the vmalloc space but any other (memblock) allocation to >>>>> this >>>>> point may be in the unmapped range before and after the crashkernel >>>>> reservation. The interrupts are off, so I think the only allocation >>>>> and >>>>> potential access that may go in this range is the page table >>>>> itself. But >>>>> it looks fragile to me. >>>> >>>> I agree there are chances there will be an allocation from the unmapped >>>> range. >>>> >>>> We can make sure this won't happen, though. We can cap the memblock >>>> allocations with memblock_set_current_limit(aligned_end) or >>>> memblock_reserve(algined_start, aligned_end) until the mappings are >>>> restored. >>> >>> We can reserve the region just before unmapping to avoid new allocations >>> for the page tables but we can't do much about pages already allocated >>> prior to calling remap_crashkernel(). >> >> Right, this was bothering me too after I re-read you previous email. >> >> One thing I can think of is to only remap the crash kernel memory if >> it is >> a part of an allocation that exactly fits into one ore more PUDs. >> >> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >> succeeds, we remap the entire area that now contains only memory >> allocated >> in reserve_crashkernel() and free the extra memory after remapping is >> done. >> If the large allocation fails, we fall back to the original size and >> alignment and don't allow unmapping crash kernel memory in >> arch_kexec_protect_crashkres(). >>> -- >>> Catalin >> > Thanks. > > There is a new method. > I think we should use the patch v3(similar but need add some changes) > > 1.We can walk crashkernle block/section pagetable, > [[[(keep the origin block/section mapping valid]]] > rebuild the pte level page mapping for the crashkernel mem > rebuild left & right margin mem(which is in same block/section mapping > but out of crashkernel mem) with block/section mapping > > 2.'replace' the origin block/section mapping by new builded mapping > iterately > > With this method, all the mem mapping keep valid all the time. > > 3.the patch v3 link: > https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@linux.alibaba.com/T/ > > (Need some changes) Namely, When rebuilding for crashkernel mem pagemapping, there is no change to the origin mapping. When the new mapping is ready, we replace old mapping with the new builded mapping. With this method, keep all mem mapping valid all the time. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-06 15:30 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-06 15:30 UTC (permalink / raw) To: Mike Rapoport, Catalin Marinas Cc: Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer 在 2022/7/6 23:18, guanghui.fgh 写道: > Thanks. > > 在 2022/7/6 21:54, Mike Rapoport 写道: >> On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote: >>> On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote: >>>> On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote: >>>>> On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote: >>>>>> On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote: >>>>>>> On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote: >>>>>>>> +void __init remap_crashkernel(void) >>>>>>>> +{ >>>>>>>> +#ifdef CONFIG_KEXEC_CORE >>>>>>>> + phys_addr_t start, end, size; >>>>>>>> + phys_addr_t aligned_start, aligned_end; >>>>>>>> + >>>>>>>> + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + if (!crashk_res.end) >>>>>>>> + return; >>>>>>>> + >>>>>>>> + start = crashk_res.start & PAGE_MASK; >>>>>>>> + end = PAGE_ALIGN(crashk_res.end); >>>>>>>> + >>>>>>>> + aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE); >>>>>>>> + aligned_end = ALIGN(end, PUD_SIZE); >>>>>>>> + >>>>>>>> + /* Clear PUDs containing crash kernel memory */ >>>>>>>> + unmap_hotplug_range(__phys_to_virt(aligned_start), >>>>>>>> + __phys_to_virt(aligned_end), false, NULL); >>>>>>> >>>>>>> What I don't understand is what happens if there's valid kernel data >>>>>>> between aligned_start and crashk_res.start (or the other end of the >>>>>>> range). >>>>>> >>>>>> Data shouldn't go anywhere :) >>>>>> >>>>>> There is >>>>>> >>>>>> + /* map area from PUD start to start of crash kernel with >>>>>> large pages */ >>>>>> + size = start - aligned_start; >>>>>> + __create_pgd_mapping(swapper_pg_dir, aligned_start, >>>>>> + __phys_to_virt(aligned_start), >>>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>>> >>>>>> and >>>>>> >>>>>> + /* map area from end of crash kernel to PUD end with large >>>>>> pages */ >>>>>> + size = aligned_end - end; >>>>>> + __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end), >>>>>> + size, PAGE_KERNEL, early_pgtable_alloc, 0); >>>>>> >>>>>> after the unmap, so after we tear down a part of a linear map we >>>>>> immediately recreate it, just with a different page size. >>>>>> >>>>>> This all happens before SMP, so there is no concurrency at that >>>>>> point. >>>>> >>>>> That brief period of unmap worries me. The kernel text, data and stack >>>>> are all in the vmalloc space but any other (memblock) allocation to >>>>> this >>>>> point may be in the unmapped range before and after the crashkernel >>>>> reservation. The interrupts are off, so I think the only allocation >>>>> and >>>>> potential access that may go in this range is the page table >>>>> itself. But >>>>> it looks fragile to me. >>>> >>>> I agree there are chances there will be an allocation from the unmapped >>>> range. >>>> >>>> We can make sure this won't happen, though. We can cap the memblock >>>> allocations with memblock_set_current_limit(aligned_end) or >>>> memblock_reserve(algined_start, aligned_end) until the mappings are >>>> restored. >>> >>> We can reserve the region just before unmapping to avoid new allocations >>> for the page tables but we can't do much about pages already allocated >>> prior to calling remap_crashkernel(). >> >> Right, this was bothering me too after I re-read you previous email. >> >> One thing I can think of is to only remap the crash kernel memory if >> it is >> a part of an allocation that exactly fits into one ore more PUDs. >> >> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >> succeeds, we remap the entire area that now contains only memory >> allocated >> in reserve_crashkernel() and free the extra memory after remapping is >> done. >> If the large allocation fails, we fall back to the original size and >> alignment and don't allow unmapping crash kernel memory in >> arch_kexec_protect_crashkres(). >>> -- >>> Catalin >> > Thanks. > > There is a new method. > I think we should use the patch v3(similar but need add some changes) > > 1.We can walk crashkernle block/section pagetable, > [[[(keep the origin block/section mapping valid]]] > rebuild the pte level page mapping for the crashkernel mem > rebuild left & right margin mem(which is in same block/section mapping > but out of crashkernel mem) with block/section mapping > > 2.'replace' the origin block/section mapping by new builded mapping > iterately > > With this method, all the mem mapping keep valid all the time. > > 3.the patch v3 link: > https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@linux.alibaba.com/T/ > > (Need some changes) Namely, When rebuilding for crashkernel mem pagemapping, there is no change to the origin mapping. When the new mapping is ready, we replace old mapping with the new builded mapping. With this method, keep all mem mapping valid all the time. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-06 15:18 ` guanghui.fgh @ 2022-07-06 15:40 ` Catalin Marinas -1 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-06 15:40 UTC (permalink / raw) To: guanghui.fgh Cc: Mike Rapoport, Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: > 在 2022/7/6 21:54, Mike Rapoport 写道: > > One thing I can think of is to only remap the crash kernel memory if it is > > a part of an allocation that exactly fits into one ore more PUDs. > > > > Say, in reserve_crashkernel() we try the memblock_phys_alloc() with > > PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation > > succeeds, we remap the entire area that now contains only memory allocated > > in reserve_crashkernel() and free the extra memory after remapping is done. > > If the large allocation fails, we fall back to the original size and > > alignment and don't allow unmapping crash kernel memory in > > arch_kexec_protect_crashkres(). > > There is a new method. > I think we should use the patch v3(similar but need add some changes) > > 1.We can walk crashkernle block/section pagetable, > [[[(keep the origin block/section mapping valid]]] > rebuild the pte level page mapping for the crashkernel mem > rebuild left & right margin mem(which is in same block/section mapping but > out of crashkernel mem) with block/section mapping > > 2.'replace' the origin block/section mapping by new builded mapping > iterately > > With this method, all the mem mapping keep valid all the time. As I already commented on one of your previous patches, this is not allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I think), the worst that can happen is a TLB conflict abort and the handler should invalidate the TLBs and restart the faulting instruction, assuming the handler won't try to access the same conflicting virtual address. Prior to FEAT_BBM, that's not possible as the architecture does not describe a precise behaviour of conflicting TLB entries (you might as well get the TLB output of multiple entries being or'ed together). -- Catalin ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-06 15:40 ` Catalin Marinas 0 siblings, 0 replies; 79+ messages in thread From: Catalin Marinas @ 2022-07-06 15:40 UTC (permalink / raw) To: guanghui.fgh Cc: Mike Rapoport, Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: > 在 2022/7/6 21:54, Mike Rapoport 写道: > > One thing I can think of is to only remap the crash kernel memory if it is > > a part of an allocation that exactly fits into one ore more PUDs. > > > > Say, in reserve_crashkernel() we try the memblock_phys_alloc() with > > PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation > > succeeds, we remap the entire area that now contains only memory allocated > > in reserve_crashkernel() and free the extra memory after remapping is done. > > If the large allocation fails, we fall back to the original size and > > alignment and don't allow unmapping crash kernel memory in > > arch_kexec_protect_crashkres(). > > There is a new method. > I think we should use the patch v3(similar but need add some changes) > > 1.We can walk crashkernle block/section pagetable, > [[[(keep the origin block/section mapping valid]]] > rebuild the pte level page mapping for the crashkernel mem > rebuild left & right margin mem(which is in same block/section mapping but > out of crashkernel mem) with block/section mapping > > 2.'replace' the origin block/section mapping by new builded mapping > iterately > > With this method, all the mem mapping keep valid all the time. As I already commented on one of your previous patches, this is not allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I think), the worst that can happen is a TLB conflict abort and the handler should invalidate the TLBs and restart the faulting instruction, assuming the handler won't try to access the same conflicting virtual address. Prior to FEAT_BBM, that's not possible as the architecture does not describe a precise behaviour of conflicting TLB entries (you might as well get the TLB output of multiple entries being or'ed together). -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-06 15:40 ` Catalin Marinas @ 2022-07-07 17:02 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-07 17:02 UTC (permalink / raw) To: Catalin Marinas Cc: Mike Rapoport, Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/6 23:40, Catalin Marinas 写道: > On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: >> 在 2022/7/6 21:54, Mike Rapoport 写道: >>> One thing I can think of is to only remap the crash kernel memory if it is >>> a part of an allocation that exactly fits into one ore more PUDs. >>> >>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >>> succeeds, we remap the entire area that now contains only memory allocated >>> in reserve_crashkernel() and free the extra memory after remapping is done. >>> If the large allocation fails, we fall back to the original size and >>> alignment and don't allow unmapping crash kernel memory in >>> arch_kexec_protect_crashkres(). >> >> There is a new method. >> I think we should use the patch v3(similar but need add some changes) >> >> 1.We can walk crashkernle block/section pagetable, >> [[[(keep the origin block/section mapping valid]]] >> rebuild the pte level page mapping for the crashkernel mem >> rebuild left & right margin mem(which is in same block/section mapping but >> out of crashkernel mem) with block/section mapping >> >> 2.'replace' the origin block/section mapping by new builded mapping >> iterately >> >> With this method, all the mem mapping keep valid all the time. > > As I already commented on one of your previous patches, this is not > allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I > think), the worst that can happen is a TLB conflict abort and the > handler should invalidate the TLBs and restart the faulting instruction, > assuming the handler won't try to access the same conflicting virtual > address. Prior to FEAT_BBM, that's not possible as the architecture does > not describe a precise behaviour of conflicting TLB entries (you might > as well get the TLB output of multiple entries being or'ed together). > I think there is another way to handle it. 1.We can rebuild the crashkernel mem mapping firstly, but [[[don't change the origin linear mapping]]]. 2.Afterward, we can reuse the idmap_pg_dir and switch to it. We use idmap_pg_dir to change the linear mapping which complyes with the TLB BBM. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-07 17:02 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-07 17:02 UTC (permalink / raw) To: Catalin Marinas Cc: Mike Rapoport, Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/6 23:40, Catalin Marinas 写道: > On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: >> 在 2022/7/6 21:54, Mike Rapoport 写道: >>> One thing I can think of is to only remap the crash kernel memory if it is >>> a part of an allocation that exactly fits into one ore more PUDs. >>> >>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >>> succeeds, we remap the entire area that now contains only memory allocated >>> in reserve_crashkernel() and free the extra memory after remapping is done. >>> If the large allocation fails, we fall back to the original size and >>> alignment and don't allow unmapping crash kernel memory in >>> arch_kexec_protect_crashkres(). >> >> There is a new method. >> I think we should use the patch v3(similar but need add some changes) >> >> 1.We can walk crashkernle block/section pagetable, >> [[[(keep the origin block/section mapping valid]]] >> rebuild the pte level page mapping for the crashkernel mem >> rebuild left & right margin mem(which is in same block/section mapping but >> out of crashkernel mem) with block/section mapping >> >> 2.'replace' the origin block/section mapping by new builded mapping >> iterately >> >> With this method, all the mem mapping keep valid all the time. > > As I already commented on one of your previous patches, this is not > allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I > think), the worst that can happen is a TLB conflict abort and the > handler should invalidate the TLBs and restart the faulting instruction, > assuming the handler won't try to access the same conflicting virtual > address. Prior to FEAT_BBM, that's not possible as the architecture does > not describe a precise behaviour of conflicting TLB entries (you might > as well get the TLB output of multiple entries being or'ed together). > I think there is another way to handle it. 1.We can rebuild the crashkernel mem mapping firstly, but [[[don't change the origin linear mapping]]]. 2.Afterward, we can reuse the idmap_pg_dir and switch to it. We use idmap_pg_dir to change the linear mapping which complyes with the TLB BBM. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH RESEND v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-06 15:40 ` Catalin Marinas @ 2022-07-08 12:28 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-08 12:28 UTC (permalink / raw) To: Catalin Marinas Cc: Mike Rapoport, Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/6 23:40, Catalin Marinas 写道: > On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: >> 在 2022/7/6 21:54, Mike Rapoport 写道: >>> One thing I can think of is to only remap the crash kernel memory if it is >>> a part of an allocation that exactly fits into one ore more PUDs. >>> >>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >>> succeeds, we remap the entire area that now contains only memory allocated >>> in reserve_crashkernel() and free the extra memory after remapping is done. >>> If the large allocation fails, we fall back to the original size and >>> alignment and don't allow unmapping crash kernel memory in >>> arch_kexec_protect_crashkres(). >> >> There is a new method. >> I think we should use the patch v3(similar but need add some changes) >> >> 1.We can walk crashkernle block/section pagetable, >> [[[(keep the origin block/section mapping valid]]] >> rebuild the pte level page mapping for the crashkernel mem >> rebuild left & right margin mem(which is in same block/section mapping but >> out of crashkernel mem) with block/section mapping >> >> 2.'replace' the origin block/section mapping by new builded mapping >> iterately >> >> With this method, all the mem mapping keep valid all the time. > > As I already commented on one of your previous patches, this is not > allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I > think), the worst that can happen is a TLB conflict abort and the > handler should invalidate the TLBs and restart the faulting instruction, > assuming the handler won't try to access the same conflicting virtual > address. Prior to FEAT_BBM, that's not possible as the architecture does > not describe a precise behaviour of conflicting TLB entries (you might > as well get the TLB output of multiple entries being or'ed together). > The cpu can generate a TLB conflict abort if it detects that the address being looked up in the TLB hits multiple entries. (1).I think when gathering small page to block/section mapping, there maybe tlb conflict if no complying with BBM. Namely: a.Map a 4KB page (address X) Touch that page, in order to get the translation cached in the TLB b.Modify the translation tables replacing the mapping for address X with a 2MB mapping - DO NOT INVALIDATE the TLB c.Touch "X + 4KB" This will/should miss in the TLB, causing a new walk returning the 2MB mapping d.Touch X Assuming they've not been evicted, you'll hit both on the 4KB and 2MB mapping - as both cover address X. There is tlb conflict. (link: https://community.arm.com/support-forums/f/dev-platforms-forum/13583/tlb-conflict-abort) (2).But when spliting large block/section mapping to small granularity, there maybe no tlb conflict. Namely: a.rebuild the pte level mapping without any change to orgin pagetable (the relation between virtual address and physicall address keep same) b.modify 1G mappting to use the new pte level mapping in the [[[mem]]] without tlb flush c.When the cpu access the 1G mem(anywhere), If 1G tlb entry already cached in tlb, all the 1G mem will access success(without any tlb loaded, no confilict) If 1G tlb entry has been evicted, then the tlb will access pagetable in mem(despite the cpu "catch" the old(1G) or new(4k) mapped pagetale in the mem, all the 1G mem can access sucess)(load new tlb entry, no conflict) d.Afterward, we flush the tlb and force cpu use the new pagetable.(no conflict) It seems that there are no two tlb entries for a same virtual address in the tlb cache When spliting large block/section mapping. (3).At the same time, I think we can use another way. As the system linear maping is builded with init_pg_dir, we can also resue the init_pg_dir to split the block/setion mapping sometime. As init_pg_dir contain all kernel text/data access and we can comply with the BBM requirement. a.rebuild new pte level mapping without any change to the old mapping(the cpu can't walk access the new page mapping, it's isolated) b.change to use init_pg_dir c.clear the old 1G block mapping and flush tlb d.modify the linear mapping to use new pte level page mapping with init_pg_dir(TLB BBM) e.switch to swapper_pg_dir Could you give me some advice? Thanks. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH RESEND v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-08 12:28 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-08 12:28 UTC (permalink / raw) To: Catalin Marinas Cc: Mike Rapoport, Will Deacon, Ard Biesheuvel, baolin.wang, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, linux-mm, yaohongbo, alikernel-developer Thanks. 在 2022/7/6 23:40, Catalin Marinas 写道: > On Wed, Jul 06, 2022 at 11:18:22PM +0800, guanghui.fgh wrote: >> 在 2022/7/6 21:54, Mike Rapoport 写道: >>> One thing I can think of is to only remap the crash kernel memory if it is >>> a part of an allocation that exactly fits into one ore more PUDs. >>> >>> Say, in reserve_crashkernel() we try the memblock_phys_alloc() with >>> PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation >>> succeeds, we remap the entire area that now contains only memory allocated >>> in reserve_crashkernel() and free the extra memory after remapping is done. >>> If the large allocation fails, we fall back to the original size and >>> alignment and don't allow unmapping crash kernel memory in >>> arch_kexec_protect_crashkres(). >> >> There is a new method. >> I think we should use the patch v3(similar but need add some changes) >> >> 1.We can walk crashkernle block/section pagetable, >> [[[(keep the origin block/section mapping valid]]] >> rebuild the pte level page mapping for the crashkernel mem >> rebuild left & right margin mem(which is in same block/section mapping but >> out of crashkernel mem) with block/section mapping >> >> 2.'replace' the origin block/section mapping by new builded mapping >> iterately >> >> With this method, all the mem mapping keep valid all the time. > > As I already commented on one of your previous patches, this is not > allowed by the architecture. If FEAT_BBM is implemented (ARMv8.4 I > think), the worst that can happen is a TLB conflict abort and the > handler should invalidate the TLBs and restart the faulting instruction, > assuming the handler won't try to access the same conflicting virtual > address. Prior to FEAT_BBM, that's not possible as the architecture does > not describe a precise behaviour of conflicting TLB entries (you might > as well get the TLB output of multiple entries being or'ed together). > The cpu can generate a TLB conflict abort if it detects that the address being looked up in the TLB hits multiple entries. (1).I think when gathering small page to block/section mapping, there maybe tlb conflict if no complying with BBM. Namely: a.Map a 4KB page (address X) Touch that page, in order to get the translation cached in the TLB b.Modify the translation tables replacing the mapping for address X with a 2MB mapping - DO NOT INVALIDATE the TLB c.Touch "X + 4KB" This will/should miss in the TLB, causing a new walk returning the 2MB mapping d.Touch X Assuming they've not been evicted, you'll hit both on the 4KB and 2MB mapping - as both cover address X. There is tlb conflict. (link: https://community.arm.com/support-forums/f/dev-platforms-forum/13583/tlb-conflict-abort) (2).But when spliting large block/section mapping to small granularity, there maybe no tlb conflict. Namely: a.rebuild the pte level mapping without any change to orgin pagetable (the relation between virtual address and physicall address keep same) b.modify 1G mappting to use the new pte level mapping in the [[[mem]]] without tlb flush c.When the cpu access the 1G mem(anywhere), If 1G tlb entry already cached in tlb, all the 1G mem will access success(without any tlb loaded, no confilict) If 1G tlb entry has been evicted, then the tlb will access pagetable in mem(despite the cpu "catch" the old(1G) or new(4k) mapped pagetale in the mem, all the 1G mem can access sucess)(load new tlb entry, no conflict) d.Afterward, we flush the tlb and force cpu use the new pagetable.(no conflict) It seems that there are no two tlb entries for a same virtual address in the tlb cache When spliting large block/section mapping. (3).At the same time, I think we can use another way. As the system linear maping is builded with init_pg_dir, we can also resue the init_pg_dir to split the block/setion mapping sometime. As init_pg_dir contain all kernel text/data access and we can comply with the BBM requirement. a.rebuild new pte level mapping without any change to the old mapping(the cpu can't walk access the new page mapping, it's isolated) b.change to use init_pg_dir c.clear the old 1G block mapping and flush tlb d.modify the linear mapping to use new pte level page mapping with init_pg_dir(TLB BBM) e.switch to swapper_pg_dir Could you give me some advice? Thanks. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation 2022-07-08 12:28 ` guanghui.fgh @ 2022-07-10 13:44 ` Guanghui Feng -1 siblings, 0 replies; 79+ messages in thread From: Guanghui Feng @ 2022-07-10 13:44 UTC (permalink / raw) To: baolin.wang, catalin.marinas, will, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo Cc: alikernel-developer The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone (enable crashkernel, disable rodata full, disable kfence), the mem_map will use non block/section mapping(for crashkernel requires to shrink the region in page granularity). But it will degrade performance when doing larging continuous mem access in kernel(memcpy/memmove, etc). There are many changes and discussions: commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones") commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into mem_init()") commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required") This patch changes mem_map to use block/section mapping with crashkernel. Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at mem_map, reserve crashkernel memory. And then walking pagetable to split block/section mapping to non block/section mapping(normally 4K) [[[only]]] for crashkernel mem. So the linear mem mapping use block/section mapping as more as possible. We will reduce the cpu dTLB miss conspicuously, and accelerate mem access about 10-20% performance improvement. I have tested it with pft(Page Fault Test) and fio, obtained great performace improvement. For fio test: 1.prepare ramdisk modprobe -r brd modprobe brd rd_nr=1 rd_size=67108864 dmsetup remove_all wipefs -a --force /dev/ram0 mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 mkdir -p /fs/ram0 mount -t ext4 /dev/ram0 /fs/ram0 2.prepare fio paremeter in x.fio file: [global] bs=4k ioengine=psync iodepth=128 size=32G direct=1 invalidate=1 group_reporting thread=1 rw=read directory=/fs/ram0 numjobs=1 [task_0] cpus_allowed=16 stonewall=1 3.run testcase: perf stat -e dTLB-load-misses fio x.fio 4.contrast ------------------------ without patch with patch fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s dTLB-load-misses 1,818,320,693 438,729,774 time elapsed(s) 70.500326434 62.877316408 user(s) 15.926332000 15.684721000 sys(s) 54.211939000 47.046165000 5.conclusion Using this patch will reduce dTLB misses and improve performace greatly. Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/include/asm/mmu_context.h | 28 ++++ arch/arm64/mm/init.c | 8 +- arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++---------- arch/arm64/mm/proc.S | 19 +++ 5 files changed, 267 insertions(+), 94 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 48f8466..1a46b81 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) extern void arm64_memblock_init(void); extern void paging_init(void); extern void bootmem_init(void); +extern void map_crashkernel(void); extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); extern void init_mem_pgprot(void); extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h index 6770667..10dffd4 100644 --- a/arch/arm64/include/asm/mmu_context.h +++ b/arch/arm64/include/asm/mmu_context.h @@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz) isb(); } +static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1) +{ + typedef void (ttbr_replace_func)(phys_addr_t); + extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb; + ttbr_replace_func *replace_phys; + + /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */ + ttbr1 = phys_to_ttbr(ttbr1); + + if (system_supports_cnp()) { + /* + * cpu_replace_ttbr1() is used when there's a boot CPU + * up (i.e. cpufeature framework is not up yet) and + * latter only when we enable CNP via cpufeature's + * enable() callback. + * Also we rely on the cpu_hwcap bit being set before + * calling the enable() function. + */ + ttbr1 |= TTBR_CNP_BIT; + } + + replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb)); + + cpu_install_idmap(); + replace_phys(ttbr1); + cpu_uninstall_idmap(); +} + /* * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD, * avoiding the possibility of conflicting TLB entries being allocated. diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 339ee84..241d27e 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); + map_crashkernel(); } /* @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) } early_init_fdt_scan_reserved_mem(); - - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); - high_memory = __va(memblock_end_of_DRAM() - 1) + 1; } @@ -438,8 +435,7 @@ void __init bootmem_init(void) * request_standard_resources() depends on crashkernel's memory being * reserved, so do it here. */ - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); + reserve_crashkernel(); memblock_dump_all(); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32..3e30c82 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -42,6 +42,7 @@ #define NO_BLOCK_MAPPINGS BIT(0) #define NO_CONT_MAPPINGS BIT(1) #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */ +#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/ u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN); u64 idmap_ptrs_per_pgd = PTRS_PER_PGD; @@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new) return ((old ^ new) & ~mask) == 0; } +static void pte_clear_cont(pte_t *ptep) +{ + int i = 0; + pte_t pte = READ_ONCE(*ptep); + if (pte_none(pte) || !pte_cont(pte)) + return; + ptep -= ((u64)ptep / sizeof(pte_t)) & + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); + do { + pte = pte_mknoncont(READ_ONCE(*ptep)); + set_pte(ptep, pte); + } while (++ptep, ++i < CONT_PTES); +} + static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, - phys_addr_t phys, pgprot_t prot) + phys_addr_t phys, pgprot_t prot, int flags) { pte_t *ptep; @@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, do { pte_t old_pte = READ_ONCE(*ptep); + if (flags & NO_SEC_REMAPPINGS) + pte_clear_cont(ptep); + set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot)); /* @@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, pte_clear_fixmap(); } +static inline void allock_pte_page(pmd_t *pmdp, + phys_addr_t (*pgtable_alloc)(int), + int flags) +{ + pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; + phys_addr_t pte_phys; + if (!pmd_none(READ_ONCE(*pmdp))) + return; + + if (flags & NO_EXEC_MAPPINGS) + pmdval |= PMD_TABLE_PXN; + BUG_ON(!pgtable_alloc); + pte_phys = pgtable_alloc(PAGE_SHIFT); + __pmd_populate(pmdp, pte_phys, pmdval); +} + static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, @@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, pmd_t pmd = READ_ONCE(*pmdp); BUG_ON(pmd_sect(pmd)); - if (pmd_none(pmd)) { - pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; - phys_addr_t pte_phys; - - if (flags & NO_EXEC_MAPPINGS) - pmdval |= PMD_TABLE_PXN; - BUG_ON(!pgtable_alloc); - pte_phys = pgtable_alloc(PAGE_SHIFT); - __pmd_populate(pmdp, pte_phys, pmdval); - pmd = READ_ONCE(*pmdp); - } + allock_pte_page(pmdp, pgtable_alloc, flags); + pmd = READ_ONCE(*pmdp); BUG_ON(pmd_bad(pmd)); do { @@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, next = pte_cont_addr_end(addr, end); /* use a contiguous mapping if the range is suitably aligned */ - if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) && - (flags & NO_CONT_MAPPINGS) == 0) + if (!(flags & NO_SEC_REMAPPINGS) && + (((addr | next | phys) & ~CONT_PTE_MASK) == 0) && + (flags & NO_CONT_MAPPINGS) == 0) __prot = __pgprot(pgprot_val(prot) | PTE_CONT); - init_pte(pmdp, addr, next, phys, __prot); + init_pte(pmdp, addr, next, phys, __prot, flags); phys += next - addr; } while (addr = next, addr != end); } +static void pmd_clear_cont(pmd_t *pmdp) +{ + int i = 0; + pmd_t pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) + return; + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); + do { + pmd = READ_ONCE(*pmdp); + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); + set_pmd(pmdp, pmd); + } while (++pmdp, ++i < CONT_PMDS); +} + +static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end, + phys_addr_t phys, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), int flags) +{ + unsigned long next; + pmd_t *pmdp; + phys_addr_t map_offset; + + pmdp = pmd_set_fixmap_offset(pudp, addr); + do { + next = pmd_addr_end(addr, end); + + if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) { + pmd_clear_cont(pmdp); + pmd_clear(pmdp); + allock_pte_page(pmdp, pgtable_alloc, flags); + + map_offset = addr - (addr & PMD_MASK); + if (map_offset) + alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr, + phys - map_offset, prot, + pgtable_alloc, + flags & (~NO_SEC_REMAPPINGS)); + + if (next < (addr & PMD_MASK) + PMD_SIZE) + alloc_init_cont_pte(pmdp, next, + (addr & PUD_MASK) + PUD_SIZE, + next - addr + phys, + prot, pgtable_alloc, + flags & (~NO_SEC_REMAPPINGS)); + } + alloc_init_cont_pte(pmdp, addr, next, phys, prot, + pgtable_alloc, flags); + phys += next - addr; + } while (pmdp++, addr = next, addr != end); + + pmd_clear_fixmap(); +} + static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, phys_addr_t (*pgtable_alloc)(int), int flags) @@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, pmd_clear_fixmap(); } +static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags) +{ + + pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; + phys_addr_t pmd_phys; + + if (!pud_none(READ_ONCE(*pudp))) + return; + + if (flags & NO_EXEC_MAPPINGS) + pudval |= PUD_TABLE_PXN; + BUG_ON(!pgtable_alloc); + pmd_phys = pgtable_alloc(PMD_SHIFT); + __pud_populate(pudp, pmd_phys, pudval); +} + static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, @@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, * Check for initial section mappings in the pgd/pud. */ BUG_ON(pud_sect(pud)); - if (pud_none(pud)) { - pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; - phys_addr_t pmd_phys; - - if (flags & NO_EXEC_MAPPINGS) - pudval |= PUD_TABLE_PXN; - BUG_ON(!pgtable_alloc); - pmd_phys = pgtable_alloc(PMD_SHIFT); - __pud_populate(pudp, pmd_phys, pudval); - pud = READ_ONCE(*pudp); - } + alloc_pmd_page(pudp, pgtable_alloc, flags); + pud = READ_ONCE(*pudp); BUG_ON(pud_bad(pud)); do { @@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, next = pmd_cont_addr_end(addr, end); /* use a contiguous mapping if the range is suitably aligned */ - if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) && + if (!(flags & NO_SEC_REMAPPINGS) && + (((addr | next | phys) & ~CONT_PMD_MASK) == 0) && (flags & NO_CONT_MAPPINGS) == 0) __prot = __pgprot(pgprot_val(prot) | PTE_CONT); - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags); + if (flags & NO_SEC_REMAPPINGS) + init_pmd_remap(pudp, addr, next, phys, __prot, + pgtable_alloc, flags); + else + init_pmd(pudp, addr, next, phys, __prot, + pgtable_alloc, flags); phys += next - addr; } while (addr = next, addr != end); } +static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next, + phys_addr_t phys, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) +{ + phys_addr_t map_offset; + + if (!pud_none(*pudp) && pud_sect(*pudp)) { + pud_clear(pudp); + alloc_pmd_page(pudp, pgtable_alloc, flags); + + map_offset = addr - (addr & PUD_MASK); + if (map_offset) + alloc_init_cont_pmd(pudp, addr & PUD_MASK, + addr, phys - map_offset, + prot, pgtable_alloc, + flags & (~NO_SEC_REMAPPINGS)); + + if (next < (addr & PUD_MASK) + PUD_SIZE) + alloc_init_cont_pmd(pudp, next, + (addr & PUD_MASK) + PUD_SIZE, + next - addr + phys, + prot, pgtable_alloc, + flags & (~NO_SEC_REMAPPINGS)); + } + alloc_init_cont_pmd(pudp, addr, next, phys, prot, + pgtable_alloc, flags); +} + +static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next, + phys_addr_t phys, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) +{ + pud_t old_pud = READ_ONCE(*pudp); + /* + * For 4K granule only, attempt to put down a 1GB block + */ + if (pud_sect_supported() && + ((addr | next | phys) & ~PUD_MASK) == 0 && + (flags & NO_BLOCK_MAPPINGS) == 0) { + pud_set_huge(pudp, phys, prot); + + /* + * After the PUD entry has been populated once, we + * only allow updates to the permission attributes. + */ + BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), + READ_ONCE(pud_val(*pudp)))); + } else { + alloc_init_cont_pmd(pudp, addr, next, phys, prot, + pgtable_alloc, flags); + + BUG_ON(pud_val(old_pud) != 0 && + pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); + } +} + static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, phys_addr_t (*pgtable_alloc)(int), @@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, */ if (system_state != SYSTEM_BOOTING) mutex_lock(&fixmap_lock); + pudp = pud_set_fixmap_offset(p4dp, addr); do { - pud_t old_pud = READ_ONCE(*pudp); - next = pud_addr_end(addr, end); - /* - * For 4K granule only, attempt to put down a 1GB block - */ - if (pud_sect_supported() && - ((addr | next | phys) & ~PUD_MASK) == 0 && - (flags & NO_BLOCK_MAPPINGS) == 0) { - pud_set_huge(pudp, phys, prot); - - /* - * After the PUD entry has been populated once, we - * only allow updates to the permission attributes. - */ - BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), - READ_ONCE(pud_val(*pudp)))); - } else { - alloc_init_cont_pmd(pudp, addr, next, phys, prot, - pgtable_alloc, flags); - - BUG_ON(pud_val(old_pud) != 0 && - pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); - } + if (flags & NO_SEC_REMAPPINGS) + init_pud_remap(pudp, addr, next, phys, prot, + pgtable_alloc, flags); + else + init_pud(pudp, addr, next, phys, prot, pgtable_alloc, + flags); phys += next - addr; } while (pudp++, addr = next, addr != end); pud_clear_fixmap(); + if (system_state != SYSTEM_BOOTING) mutex_unlock(&fixmap_lock); } @@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } -static bool crash_mem_map __initdata; - -static int __init enable_crash_mem_map(char *arg) +#ifdef CONFIG_KEXEC_CORE +void __init map_crashkernel(void) { - /* - * Proper parameter parsing is done by reserve_crashkernel(). We only - * need to know if the linear map has to avoid block mappings so that - * the crashkernel reservations can be unmapped later. - */ - crash_mem_map = true; + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) + return; - return 0; + if (!crashk_res.end) + return; + + cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir)); + init_mm.pgd = init_pg_dir; + + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, + __phys_to_virt(crashk_res.start), + crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL, + early_pgtable_alloc, + NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS); + + cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); + init_mm.pgd = swapper_pg_dir; + + memblock_phys_free(__pa_symbol(init_pg_dir), + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); } -early_param("crashkernel", enable_crash_mem_map); +#else +void __init map_crashkernel(void) +{ + memblock_phys_free(__pa_symbol(init_pg_dir), + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); +} +#endif static void __init map_mem(pgd_t *pgdp) { @@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp) */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map) { - if (IS_ENABLED(CONFIG_ZONE_DMA) || - IS_ENABLED(CONFIG_ZONE_DMA32)) - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; - else if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif - /* map all the memory banks */ for_each_mem_range(i, &start, &end) { if (start >= end) @@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp) * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map && - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, - crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } - } -#endif } void mark_rodata_ro(void) @@ -774,8 +903,8 @@ void __init paging_init(void) cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); init_mm.pgd = swapper_pg_dir; - memblock_phys_free(__pa_symbol(init_pg_dir), - __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); + //memblock_phys_free(__pa_symbol(init_pg_dir), + // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); memblock_allow_resize(); } diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S index 50bbed9..161bae6 100644 --- a/arch/arm64/mm/proc.S +++ b/arch/arm64/mm/proc.S @@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume) isb .endm +SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb) + save_and_disable_daif flags=x2 + + __idmap_cpu_set_reserved_ttbr1 x1, x3 + + offset_ttbr1 x0, x3 + msr ttbr1_el1, x0 + isb + + restore_daif x2 + + dsb nshst + tlbi vmalle1 + dsb nsh + isb + + ret +SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb) + /* * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1) * -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-10 13:44 ` Guanghui Feng 0 siblings, 0 replies; 79+ messages in thread From: Guanghui Feng @ 2022-07-10 13:44 UTC (permalink / raw) To: baolin.wang, catalin.marinas, will, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo Cc: alikernel-developer The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone (enable crashkernel, disable rodata full, disable kfence), the mem_map will use non block/section mapping(for crashkernel requires to shrink the region in page granularity). But it will degrade performance when doing larging continuous mem access in kernel(memcpy/memmove, etc). There are many changes and discussions: commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for platforms with no DMA memory zones") commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into mem_init()") commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required") This patch changes mem_map to use block/section mapping with crashkernel. Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at mem_map, reserve crashkernel memory. And then walking pagetable to split block/section mapping to non block/section mapping(normally 4K) [[[only]]] for crashkernel mem. So the linear mem mapping use block/section mapping as more as possible. We will reduce the cpu dTLB miss conspicuously, and accelerate mem access about 10-20% performance improvement. I have tested it with pft(Page Fault Test) and fio, obtained great performace improvement. For fio test: 1.prepare ramdisk modprobe -r brd modprobe brd rd_nr=1 rd_size=67108864 dmsetup remove_all wipefs -a --force /dev/ram0 mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 mkdir -p /fs/ram0 mount -t ext4 /dev/ram0 /fs/ram0 2.prepare fio paremeter in x.fio file: [global] bs=4k ioengine=psync iodepth=128 size=32G direct=1 invalidate=1 group_reporting thread=1 rw=read directory=/fs/ram0 numjobs=1 [task_0] cpus_allowed=16 stonewall=1 3.run testcase: perf stat -e dTLB-load-misses fio x.fio 4.contrast ------------------------ without patch with patch fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s dTLB-load-misses 1,818,320,693 438,729,774 time elapsed(s) 70.500326434 62.877316408 user(s) 15.926332000 15.684721000 sys(s) 54.211939000 47.046165000 5.conclusion Using this patch will reduce dTLB misses and improve performace greatly. Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> --- arch/arm64/include/asm/mmu.h | 1 + arch/arm64/include/asm/mmu_context.h | 28 ++++ arch/arm64/mm/init.c | 8 +- arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++---------- arch/arm64/mm/proc.S | 19 +++ 5 files changed, 267 insertions(+), 94 deletions(-) diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h index 48f8466..1a46b81 100644 --- a/arch/arm64/include/asm/mmu.h +++ b/arch/arm64/include/asm/mmu.h @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) extern void arm64_memblock_init(void); extern void paging_init(void); extern void bootmem_init(void); +extern void map_crashkernel(void); extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); extern void init_mem_pgprot(void); extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h index 6770667..10dffd4 100644 --- a/arch/arm64/include/asm/mmu_context.h +++ b/arch/arm64/include/asm/mmu_context.h @@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz) isb(); } +static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1) +{ + typedef void (ttbr_replace_func)(phys_addr_t); + extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb; + ttbr_replace_func *replace_phys; + + /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */ + ttbr1 = phys_to_ttbr(ttbr1); + + if (system_supports_cnp()) { + /* + * cpu_replace_ttbr1() is used when there's a boot CPU + * up (i.e. cpufeature framework is not up yet) and + * latter only when we enable CNP via cpufeature's + * enable() callback. + * Also we rely on the cpu_hwcap bit being set before + * calling the enable() function. + */ + ttbr1 |= TTBR_CNP_BIT; + } + + replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb)); + + cpu_install_idmap(); + replace_phys(ttbr1); + cpu_uninstall_idmap(); +} + /* * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD, * avoiding the possibility of conflicting TLB entries being allocated. diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 339ee84..241d27e 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) crashk_res.start = crash_base; crashk_res.end = crash_base + crash_size - 1; insert_resource(&iomem_resource, &crashk_res); + map_crashkernel(); } /* @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) } early_init_fdt_scan_reserved_mem(); - - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); - high_memory = __va(memblock_end_of_DRAM() - 1) + 1; } @@ -438,8 +435,7 @@ void __init bootmem_init(void) * request_standard_resources() depends on crashkernel's memory being * reserved, so do it here. */ - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) - reserve_crashkernel(); + reserve_crashkernel(); memblock_dump_all(); } diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 626ec32..3e30c82 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -42,6 +42,7 @@ #define NO_BLOCK_MAPPINGS BIT(0) #define NO_CONT_MAPPINGS BIT(1) #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */ +#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/ u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN); u64 idmap_ptrs_per_pgd = PTRS_PER_PGD; @@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new) return ((old ^ new) & ~mask) == 0; } +static void pte_clear_cont(pte_t *ptep) +{ + int i = 0; + pte_t pte = READ_ONCE(*ptep); + if (pte_none(pte) || !pte_cont(pte)) + return; + ptep -= ((u64)ptep / sizeof(pte_t)) & + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); + do { + pte = pte_mknoncont(READ_ONCE(*ptep)); + set_pte(ptep, pte); + } while (++ptep, ++i < CONT_PTES); +} + static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, - phys_addr_t phys, pgprot_t prot) + phys_addr_t phys, pgprot_t prot, int flags) { pte_t *ptep; @@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, do { pte_t old_pte = READ_ONCE(*ptep); + if (flags & NO_SEC_REMAPPINGS) + pte_clear_cont(ptep); + set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot)); /* @@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, pte_clear_fixmap(); } +static inline void allock_pte_page(pmd_t *pmdp, + phys_addr_t (*pgtable_alloc)(int), + int flags) +{ + pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; + phys_addr_t pte_phys; + if (!pmd_none(READ_ONCE(*pmdp))) + return; + + if (flags & NO_EXEC_MAPPINGS) + pmdval |= PMD_TABLE_PXN; + BUG_ON(!pgtable_alloc); + pte_phys = pgtable_alloc(PAGE_SHIFT); + __pmd_populate(pmdp, pte_phys, pmdval); +} + static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, @@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, pmd_t pmd = READ_ONCE(*pmdp); BUG_ON(pmd_sect(pmd)); - if (pmd_none(pmd)) { - pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; - phys_addr_t pte_phys; - - if (flags & NO_EXEC_MAPPINGS) - pmdval |= PMD_TABLE_PXN; - BUG_ON(!pgtable_alloc); - pte_phys = pgtable_alloc(PAGE_SHIFT); - __pmd_populate(pmdp, pte_phys, pmdval); - pmd = READ_ONCE(*pmdp); - } + allock_pte_page(pmdp, pgtable_alloc, flags); + pmd = READ_ONCE(*pmdp); BUG_ON(pmd_bad(pmd)); do { @@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, next = pte_cont_addr_end(addr, end); /* use a contiguous mapping if the range is suitably aligned */ - if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) && - (flags & NO_CONT_MAPPINGS) == 0) + if (!(flags & NO_SEC_REMAPPINGS) && + (((addr | next | phys) & ~CONT_PTE_MASK) == 0) && + (flags & NO_CONT_MAPPINGS) == 0) __prot = __pgprot(pgprot_val(prot) | PTE_CONT); - init_pte(pmdp, addr, next, phys, __prot); + init_pte(pmdp, addr, next, phys, __prot, flags); phys += next - addr; } while (addr = next, addr != end); } +static void pmd_clear_cont(pmd_t *pmdp) +{ + int i = 0; + pmd_t pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) + return; + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); + do { + pmd = READ_ONCE(*pmdp); + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); + set_pmd(pmdp, pmd); + } while (++pmdp, ++i < CONT_PMDS); +} + +static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end, + phys_addr_t phys, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), int flags) +{ + unsigned long next; + pmd_t *pmdp; + phys_addr_t map_offset; + + pmdp = pmd_set_fixmap_offset(pudp, addr); + do { + next = pmd_addr_end(addr, end); + + if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) { + pmd_clear_cont(pmdp); + pmd_clear(pmdp); + allock_pte_page(pmdp, pgtable_alloc, flags); + + map_offset = addr - (addr & PMD_MASK); + if (map_offset) + alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr, + phys - map_offset, prot, + pgtable_alloc, + flags & (~NO_SEC_REMAPPINGS)); + + if (next < (addr & PMD_MASK) + PMD_SIZE) + alloc_init_cont_pte(pmdp, next, + (addr & PUD_MASK) + PUD_SIZE, + next - addr + phys, + prot, pgtable_alloc, + flags & (~NO_SEC_REMAPPINGS)); + } + alloc_init_cont_pte(pmdp, addr, next, phys, prot, + pgtable_alloc, flags); + phys += next - addr; + } while (pmdp++, addr = next, addr != end); + + pmd_clear_fixmap(); +} + static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, phys_addr_t (*pgtable_alloc)(int), int flags) @@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, pmd_clear_fixmap(); } +static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags) +{ + + pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; + phys_addr_t pmd_phys; + + if (!pud_none(READ_ONCE(*pudp))) + return; + + if (flags & NO_EXEC_MAPPINGS) + pudval |= PUD_TABLE_PXN; + BUG_ON(!pgtable_alloc); + pmd_phys = pgtable_alloc(PMD_SHIFT); + __pud_populate(pudp, pmd_phys, pudval); +} + static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, @@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, * Check for initial section mappings in the pgd/pud. */ BUG_ON(pud_sect(pud)); - if (pud_none(pud)) { - pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; - phys_addr_t pmd_phys; - - if (flags & NO_EXEC_MAPPINGS) - pudval |= PUD_TABLE_PXN; - BUG_ON(!pgtable_alloc); - pmd_phys = pgtable_alloc(PMD_SHIFT); - __pud_populate(pudp, pmd_phys, pudval); - pud = READ_ONCE(*pudp); - } + alloc_pmd_page(pudp, pgtable_alloc, flags); + pud = READ_ONCE(*pudp); BUG_ON(pud_bad(pud)); do { @@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, next = pmd_cont_addr_end(addr, end); /* use a contiguous mapping if the range is suitably aligned */ - if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) && + if (!(flags & NO_SEC_REMAPPINGS) && + (((addr | next | phys) & ~CONT_PMD_MASK) == 0) && (flags & NO_CONT_MAPPINGS) == 0) __prot = __pgprot(pgprot_val(prot) | PTE_CONT); - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags); + if (flags & NO_SEC_REMAPPINGS) + init_pmd_remap(pudp, addr, next, phys, __prot, + pgtable_alloc, flags); + else + init_pmd(pudp, addr, next, phys, __prot, + pgtable_alloc, flags); phys += next - addr; } while (addr = next, addr != end); } +static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next, + phys_addr_t phys, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) +{ + phys_addr_t map_offset; + + if (!pud_none(*pudp) && pud_sect(*pudp)) { + pud_clear(pudp); + alloc_pmd_page(pudp, pgtable_alloc, flags); + + map_offset = addr - (addr & PUD_MASK); + if (map_offset) + alloc_init_cont_pmd(pudp, addr & PUD_MASK, + addr, phys - map_offset, + prot, pgtable_alloc, + flags & (~NO_SEC_REMAPPINGS)); + + if (next < (addr & PUD_MASK) + PUD_SIZE) + alloc_init_cont_pmd(pudp, next, + (addr & PUD_MASK) + PUD_SIZE, + next - addr + phys, + prot, pgtable_alloc, + flags & (~NO_SEC_REMAPPINGS)); + } + alloc_init_cont_pmd(pudp, addr, next, phys, prot, + pgtable_alloc, flags); +} + +static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next, + phys_addr_t phys, pgprot_t prot, + phys_addr_t (*pgtable_alloc)(int), + int flags) +{ + pud_t old_pud = READ_ONCE(*pudp); + /* + * For 4K granule only, attempt to put down a 1GB block + */ + if (pud_sect_supported() && + ((addr | next | phys) & ~PUD_MASK) == 0 && + (flags & NO_BLOCK_MAPPINGS) == 0) { + pud_set_huge(pudp, phys, prot); + + /* + * After the PUD entry has been populated once, we + * only allow updates to the permission attributes. + */ + BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), + READ_ONCE(pud_val(*pudp)))); + } else { + alloc_init_cont_pmd(pudp, addr, next, phys, prot, + pgtable_alloc, flags); + + BUG_ON(pud_val(old_pud) != 0 && + pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); + } +} + static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, phys_addr_t phys, pgprot_t prot, phys_addr_t (*pgtable_alloc)(int), @@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, */ if (system_state != SYSTEM_BOOTING) mutex_lock(&fixmap_lock); + pudp = pud_set_fixmap_offset(p4dp, addr); do { - pud_t old_pud = READ_ONCE(*pudp); - next = pud_addr_end(addr, end); - /* - * For 4K granule only, attempt to put down a 1GB block - */ - if (pud_sect_supported() && - ((addr | next | phys) & ~PUD_MASK) == 0 && - (flags & NO_BLOCK_MAPPINGS) == 0) { - pud_set_huge(pudp, phys, prot); - - /* - * After the PUD entry has been populated once, we - * only allow updates to the permission attributes. - */ - BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), - READ_ONCE(pud_val(*pudp)))); - } else { - alloc_init_cont_pmd(pudp, addr, next, phys, prot, - pgtable_alloc, flags); - - BUG_ON(pud_val(old_pud) != 0 && - pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); - } + if (flags & NO_SEC_REMAPPINGS) + init_pud_remap(pudp, addr, next, phys, prot, + pgtable_alloc, flags); + else + init_pud(pudp, addr, next, phys, prot, pgtable_alloc, + flags); phys += next - addr; } while (pudp++, addr = next, addr != end); pud_clear_fixmap(); + if (system_state != SYSTEM_BOOTING) mutex_unlock(&fixmap_lock); } @@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void) PAGE_KERNEL_RO); } -static bool crash_mem_map __initdata; - -static int __init enable_crash_mem_map(char *arg) +#ifdef CONFIG_KEXEC_CORE +void __init map_crashkernel(void) { - /* - * Proper parameter parsing is done by reserve_crashkernel(). We only - * need to know if the linear map has to avoid block mappings so that - * the crashkernel reservations can be unmapped later. - */ - crash_mem_map = true; + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) + return; - return 0; + if (!crashk_res.end) + return; + + cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir)); + init_mm.pgd = init_pg_dir; + + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, + __phys_to_virt(crashk_res.start), + crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL, + early_pgtable_alloc, + NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS); + + cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); + init_mm.pgd = swapper_pg_dir; + + memblock_phys_free(__pa_symbol(init_pg_dir), + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); } -early_param("crashkernel", enable_crash_mem_map); +#else +void __init map_crashkernel(void) +{ + memblock_phys_free(__pa_symbol(init_pg_dir), + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); +} +#endif static void __init map_mem(pgd_t *pgdp) { @@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp) */ memblock_mark_nomap(kernel_start, kernel_end - kernel_start); -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map) { - if (IS_ENABLED(CONFIG_ZONE_DMA) || - IS_ENABLED(CONFIG_ZONE_DMA32)) - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; - else if (crashk_res.end) - memblock_mark_nomap(crashk_res.start, - resource_size(&crashk_res)); - } -#endif - /* map all the memory banks */ for_each_mem_range(i, &start, &end) { if (start >= end) @@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp) * in page granularity and put back unused memory to buddy system * through /sys/kernel/kexec_crash_size interface. */ -#ifdef CONFIG_KEXEC_CORE - if (crash_mem_map && - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { - if (crashk_res.end) { - __map_memblock(pgdp, crashk_res.start, - crashk_res.end + 1, - PAGE_KERNEL, - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); - memblock_clear_nomap(crashk_res.start, - resource_size(&crashk_res)); - } - } -#endif } void mark_rodata_ro(void) @@ -774,8 +903,8 @@ void __init paging_init(void) cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); init_mm.pgd = swapper_pg_dir; - memblock_phys_free(__pa_symbol(init_pg_dir), - __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); + //memblock_phys_free(__pa_symbol(init_pg_dir), + // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); memblock_allow_resize(); } diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S index 50bbed9..161bae6 100644 --- a/arch/arm64/mm/proc.S +++ b/arch/arm64/mm/proc.S @@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume) isb .endm +SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb) + save_and_disable_daif flags=x2 + + __idmap_cpu_set_reserved_ttbr1 x1, x3 + + offset_ttbr1 x0, x3 + msr ttbr1_el1, x0 + isb + + restore_daif x2 + + dsb nshst + tlbi vmalle1 + dsb nsh + isb + + ret +SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb) + /* * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1) * -- 1.8.3.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 79+ messages in thread
* Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation 2022-07-10 13:44 ` Guanghui Feng @ 2022-07-10 14:32 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-10 14:32 UTC (permalink / raw) To: baolin.wang, catalin.marinas, will, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo Cc: alikernel-developer Hi, Mike Rapoport, Catalin Marinas, I have rewrited the patch to handle tlb conflict and Circular dependency. Namely, (1)When doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir in use. (2)Firstly, I change the swapper_pg_dir pgd to idmap_pg_dir. As idmap_pg_dir use ttbr0_el1 and without any tlb confict in ttbr1_el1. Afterward I flush all the tlb(all the swapper_pg_dir tlb cache flushed). Later, I change idmap_pg_dir pgd to use init_pg_dir and flush tlb again. Therefor, there is no tlb conflict when rebuiding with init_pg_dir(without any tlb cache for swapper_pg_dir). (3)When switch to init_pg_dir, I can saftely modify swapper_pg_dir pagetable withou any [[[Circular dependency]]] and tlb conflict. Namely, When unmapping or modifying any swapper_pg_dir pagetable, I will use fix map to access the mem. I have test it and work well. Could you give me some advice? Thanks. 在 2022/7/10 21:44, Guanghui Feng 写道: > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > use non block/section mapping(for crashkernel requires to shrink the region > in page granularity). But it will degrade performance when doing larging > continuous mem access in kernel(memcpy/memmove, etc). > > There are many changes and discussions: > commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for > platforms with no DMA memory zones") > commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into > mem_init()") > commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel > reservation is required") > > This patch changes mem_map to use block/section mapping with crashkernel. > Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at > mem_map, reserve crashkernel memory. And then walking pagetable to split > block/section mapping to non block/section mapping(normally 4K) [[[only]]] > for crashkernel mem. So the linear mem mapping use block/section mapping > as more as possible. We will reduce the cpu dTLB miss conspicuously, and > accelerate mem access about 10-20% performance improvement. > > I have tested it with pft(Page Fault Test) and fio, obtained great > performace improvement. > > For fio test: > 1.prepare ramdisk > modprobe -r brd > modprobe brd rd_nr=1 rd_size=67108864 > dmsetup remove_all > wipefs -a --force /dev/ram0 > mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 > mkdir -p /fs/ram0 > mount -t ext4 /dev/ram0 /fs/ram0 > > 2.prepare fio paremeter in x.fio file: > [global] > bs=4k > ioengine=psync > iodepth=128 > size=32G > direct=1 > invalidate=1 > group_reporting > thread=1 > rw=read > directory=/fs/ram0 > numjobs=1 > > [task_0] > cpus_allowed=16 > stonewall=1 > > 3.run testcase: > perf stat -e dTLB-load-misses fio x.fio > > 4.contrast > ------------------------ > without patch with patch > fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s > dTLB-load-misses 1,818,320,693 438,729,774 > time elapsed(s) 70.500326434 62.877316408 > user(s) 15.926332000 15.684721000 > sys(s) 54.211939000 47.046165000 > > 5.conclusion > Using this patch will reduce dTLB misses and improve performace greatly. > > Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> > --- > arch/arm64/include/asm/mmu.h | 1 + > arch/arm64/include/asm/mmu_context.h | 28 ++++ > arch/arm64/mm/init.c | 8 +- > arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++---------- > arch/arm64/mm/proc.S | 19 +++ > 5 files changed, 267 insertions(+), 94 deletions(-) > > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h > index 48f8466..1a46b81 100644 > --- a/arch/arm64/include/asm/mmu.h > +++ b/arch/arm64/include/asm/mmu.h > @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) > extern void arm64_memblock_init(void); > extern void paging_init(void); > extern void bootmem_init(void); > +extern void map_crashkernel(void); > extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); > extern void init_mem_pgprot(void); > extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h > index 6770667..10dffd4 100644 > --- a/arch/arm64/include/asm/mmu_context.h > +++ b/arch/arm64/include/asm/mmu_context.h > @@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz) > isb(); > } > > +static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1) > +{ > + typedef void (ttbr_replace_func)(phys_addr_t); > + extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb; > + ttbr_replace_func *replace_phys; > + > + /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */ > + ttbr1 = phys_to_ttbr(ttbr1); > + > + if (system_supports_cnp()) { > + /* > + * cpu_replace_ttbr1() is used when there's a boot CPU > + * up (i.e. cpufeature framework is not up yet) and > + * latter only when we enable CNP via cpufeature's > + * enable() callback. > + * Also we rely on the cpu_hwcap bit being set before > + * calling the enable() function. > + */ > + ttbr1 |= TTBR_CNP_BIT; > + } > + > + replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb)); > + > + cpu_install_idmap(); > + replace_phys(ttbr1); > + cpu_uninstall_idmap(); cpu_install_idmap will Swtich to use idmap_pg_dir with use [[[ttbr0_el1]]]. swapper_gp_dir use the [[[ttbr1_el0]]], so there is no tlb conflit to idmap_pg_dir(the different address space range). When switching to idmap_pg_dir, there also will be a tlb flush all. Afterward, the idmap_cpu_replace_ttbr1_with_flush_tlb will run in idmap_pg_dir pgd and change the pgd to another pgd complying with the parameter ttbr1. So, there is no tlb conflict when switching pgd. Namely, I can change ttbr1 pgd safely without any tlb conflict. > +} > + > /* > * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD, > * avoiding the possibility of conflicting TLB entries being allocated. > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 339ee84..241d27e 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) > crashk_res.start = crash_base; > crashk_res.end = crash_base + crash_size - 1; > insert_resource(&iomem_resource, &crashk_res); > + map_crashkernel(); > } > > /* > @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) > } > > early_init_fdt_scan_reserved_mem(); > - > - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) > - reserve_crashkernel(); > - > high_memory = __va(memblock_end_of_DRAM() - 1) + 1; > } > > @@ -438,8 +435,7 @@ void __init bootmem_init(void) > * request_standard_resources() depends on crashkernel's memory being > * reserved, so do it here. > */ > - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) > - reserve_crashkernel(); > + reserve_crashkernel(); > > memblock_dump_all(); > } > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 626ec32..3e30c82 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -42,6 +42,7 @@ > #define NO_BLOCK_MAPPINGS BIT(0) > #define NO_CONT_MAPPINGS BIT(1) > #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */ > +#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/ > > u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN); > u64 idmap_ptrs_per_pgd = PTRS_PER_PGD; > @@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new) > return ((old ^ new) & ~mask) == 0; > } > > +static void pte_clear_cont(pte_t *ptep) > +{ > + int i = 0; > + pte_t pte = READ_ONCE(*ptep); > + if (pte_none(pte) || !pte_cont(pte)) > + return; > + ptep -= ((u64)ptep / sizeof(pte_t)) & > + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); > + do { > + pte = pte_mknoncont(READ_ONCE(*ptep)); > + set_pte(ptep, pte); > + } while (++ptep, ++i < CONT_PTES); > +} > + > static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > - phys_addr_t phys, pgprot_t prot) > + phys_addr_t phys, pgprot_t prot, int flags) > { > pte_t *ptep; > > @@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > do { > pte_t old_pte = READ_ONCE(*ptep); > > + if (flags & NO_SEC_REMAPPINGS) > + pte_clear_cont(ptep); > + > set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot)); > > /* > @@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > pte_clear_fixmap(); > } > > +static inline void allock_pte_page(pmd_t *pmdp, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; > + phys_addr_t pte_phys; > + if (!pmd_none(READ_ONCE(*pmdp))) > + return; > + > + if (flags & NO_EXEC_MAPPINGS) > + pmdval |= PMD_TABLE_PXN; > + BUG_ON(!pgtable_alloc); > + pte_phys = pgtable_alloc(PAGE_SHIFT); > + __pmd_populate(pmdp, pte_phys, pmdval); > +} > + > static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > unsigned long end, phys_addr_t phys, > pgprot_t prot, > @@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > pmd_t pmd = READ_ONCE(*pmdp); > > BUG_ON(pmd_sect(pmd)); > - if (pmd_none(pmd)) { > - pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; > - phys_addr_t pte_phys; > - > - if (flags & NO_EXEC_MAPPINGS) > - pmdval |= PMD_TABLE_PXN; > - BUG_ON(!pgtable_alloc); > - pte_phys = pgtable_alloc(PAGE_SHIFT); > - __pmd_populate(pmdp, pte_phys, pmdval); > - pmd = READ_ONCE(*pmdp); > - } > + allock_pte_page(pmdp, pgtable_alloc, flags); > + pmd = READ_ONCE(*pmdp); > BUG_ON(pmd_bad(pmd)); > > do { > @@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > next = pte_cont_addr_end(addr, end); > > /* use a contiguous mapping if the range is suitably aligned */ > - if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) && > - (flags & NO_CONT_MAPPINGS) == 0) > + if (!(flags & NO_SEC_REMAPPINGS) && > + (((addr | next | phys) & ~CONT_PTE_MASK) == 0) && > + (flags & NO_CONT_MAPPINGS) == 0) > __prot = __pgprot(pgprot_val(prot) | PTE_CONT); > > - init_pte(pmdp, addr, next, phys, __prot); > + init_pte(pmdp, addr, next, phys, __prot, flags); > > phys += next - addr; > } while (addr = next, addr != end); > } > > +static void pmd_clear_cont(pmd_t *pmdp) > +{ > + int i = 0; > + pmd_t pmd = READ_ONCE(*pmdp); > + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) > + return; > + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & > + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); > + do { > + pmd = READ_ONCE(*pmdp); > + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); > + set_pmd(pmdp, pmd); > + } while (++pmdp, ++i < CONT_PMDS); > +} > + > +static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), int flags) > +{ > + unsigned long next; > + pmd_t *pmdp; > + phys_addr_t map_offset; > + > + pmdp = pmd_set_fixmap_offset(pudp, addr); > + do { > + next = pmd_addr_end(addr, end); > + > + if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) { > + pmd_clear_cont(pmdp); > + pmd_clear(pmdp); > + allock_pte_page(pmdp, pgtable_alloc, flags); > + > + map_offset = addr - (addr & PMD_MASK); > + if (map_offset) > + alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr, > + phys - map_offset, prot, > + pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + > + if (next < (addr & PMD_MASK) + PMD_SIZE) > + alloc_init_cont_pte(pmdp, next, > + (addr & PUD_MASK) + PUD_SIZE, > + next - addr + phys, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + } > + alloc_init_cont_pte(pmdp, addr, next, phys, prot, > + pgtable_alloc, flags); > + phys += next - addr; > + } while (pmdp++, addr = next, addr != end); > + > + pmd_clear_fixmap(); > +} > + > static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, > phys_addr_t phys, pgprot_t prot, > phys_addr_t (*pgtable_alloc)(int), int flags) > @@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, > pmd_clear_fixmap(); > } > > +static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags) > +{ > + > + pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; > + phys_addr_t pmd_phys; > + > + if (!pud_none(READ_ONCE(*pudp))) > + return; > + > + if (flags & NO_EXEC_MAPPINGS) > + pudval |= PUD_TABLE_PXN; > + BUG_ON(!pgtable_alloc); > + pmd_phys = pgtable_alloc(PMD_SHIFT); > + __pud_populate(pudp, pmd_phys, pudval); > +} > + > static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > unsigned long end, phys_addr_t phys, > pgprot_t prot, > @@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > * Check for initial section mappings in the pgd/pud. > */ > BUG_ON(pud_sect(pud)); > - if (pud_none(pud)) { > - pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; > - phys_addr_t pmd_phys; > - > - if (flags & NO_EXEC_MAPPINGS) > - pudval |= PUD_TABLE_PXN; > - BUG_ON(!pgtable_alloc); > - pmd_phys = pgtable_alloc(PMD_SHIFT); > - __pud_populate(pudp, pmd_phys, pudval); > - pud = READ_ONCE(*pudp); > - } > + alloc_pmd_page(pudp, pgtable_alloc, flags); > + pud = READ_ONCE(*pudp); > BUG_ON(pud_bad(pud)); > > do { > @@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > next = pmd_cont_addr_end(addr, end); > > /* use a contiguous mapping if the range is suitably aligned */ > - if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) && > + if (!(flags & NO_SEC_REMAPPINGS) && > + (((addr | next | phys) & ~CONT_PMD_MASK) == 0) && > (flags & NO_CONT_MAPPINGS) == 0) > __prot = __pgprot(pgprot_val(prot) | PTE_CONT); > > - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags); > + if (flags & NO_SEC_REMAPPINGS) > + init_pmd_remap(pudp, addr, next, phys, __prot, > + pgtable_alloc, flags); > + else > + init_pmd(pudp, addr, next, phys, __prot, > + pgtable_alloc, flags); > > phys += next - addr; > } while (addr = next, addr != end); > } > > +static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + phys_addr_t map_offset; > + > + if (!pud_none(*pudp) && pud_sect(*pudp)) { > + pud_clear(pudp); > + alloc_pmd_page(pudp, pgtable_alloc, flags); > + > + map_offset = addr - (addr & PUD_MASK); > + if (map_offset) > + alloc_init_cont_pmd(pudp, addr & PUD_MASK, > + addr, phys - map_offset, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + > + if (next < (addr & PUD_MASK) + PUD_SIZE) > + alloc_init_cont_pmd(pudp, next, > + (addr & PUD_MASK) + PUD_SIZE, > + next - addr + phys, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + } > + alloc_init_cont_pmd(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > +} > + > +static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + pud_t old_pud = READ_ONCE(*pudp); > + /* > + * For 4K granule only, attempt to put down a 1GB block > + */ > + if (pud_sect_supported() && > + ((addr | next | phys) & ~PUD_MASK) == 0 && > + (flags & NO_BLOCK_MAPPINGS) == 0) { > + pud_set_huge(pudp, phys, prot); > + > + /* > + * After the PUD entry has been populated once, we > + * only allow updates to the permission attributes. > + */ > + BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), > + READ_ONCE(pud_val(*pudp)))); > + } else { > + alloc_init_cont_pmd(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > + > + BUG_ON(pud_val(old_pud) != 0 && > + pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); > + } > +} > + > static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, > phys_addr_t phys, pgprot_t prot, > phys_addr_t (*pgtable_alloc)(int), > @@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, > */ > if (system_state != SYSTEM_BOOTING) > mutex_lock(&fixmap_lock); > + > pudp = pud_set_fixmap_offset(p4dp, addr); > do { > - pud_t old_pud = READ_ONCE(*pudp); > - > next = pud_addr_end(addr, end); > > - /* > - * For 4K granule only, attempt to put down a 1GB block > - */ > - if (pud_sect_supported() && > - ((addr | next | phys) & ~PUD_MASK) == 0 && > - (flags & NO_BLOCK_MAPPINGS) == 0) { > - pud_set_huge(pudp, phys, prot); > - > - /* > - * After the PUD entry has been populated once, we > - * only allow updates to the permission attributes. > - */ > - BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), > - READ_ONCE(pud_val(*pudp)))); > - } else { > - alloc_init_cont_pmd(pudp, addr, next, phys, prot, > - pgtable_alloc, flags); > - > - BUG_ON(pud_val(old_pud) != 0 && > - pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); > - } > + if (flags & NO_SEC_REMAPPINGS) > + init_pud_remap(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > + else > + init_pud(pudp, addr, next, phys, prot, pgtable_alloc, > + flags); > phys += next - addr; > } while (pudp++, addr = next, addr != end); > > pud_clear_fixmap(); > + > if (system_state != SYSTEM_BOOTING) > mutex_unlock(&fixmap_lock); > } > @@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void) > PAGE_KERNEL_RO); > } > > -static bool crash_mem_map __initdata; > - > -static int __init enable_crash_mem_map(char *arg) > +#ifdef CONFIG_KEXEC_CORE > +void __init map_crashkernel(void) > { > - /* > - * Proper parameter parsing is done by reserve_crashkernel(). We only > - * need to know if the linear map has to avoid block mappings so that > - * the crashkernel reservations can be unmapped later. > - */ > - crash_mem_map = true; > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > + return; > > - return 0; > + if (!crashk_res.end) > + return; > + > + cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir)); > + init_mm.pgd = init_pg_dir; > + > + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, > + __phys_to_virt(crashk_res.start), > + crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL, > + early_pgtable_alloc, > + NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS); > + > + cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); > + init_mm.pgd = swapper_pg_dir; > + > + memblock_phys_free(__pa_symbol(init_pg_dir), > + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > } Because unmap_hotplug_range work with swapper_gp_dir pgd. There are many changes needed for working with init_pg_dir pgd. So that I use the init_pg_dir and __create_pgd_mapping to direct walk/modify swapper_gp_dir pagetable. When finishing it, free the init_pg_dir memblock. > -early_param("crashkernel", enable_crash_mem_map); > +#else > +void __init map_crashkernel(void) > +{ > + memblock_phys_free(__pa_symbol(init_pg_dir), > + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > +} > +#endif > > static void __init map_mem(pgd_t *pgdp) > { > @@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp) > */ > memblock_mark_nomap(kernel_start, kernel_end - kernel_start); > > -#ifdef CONFIG_KEXEC_CORE > - if (crash_mem_map) { > - if (IS_ENABLED(CONFIG_ZONE_DMA) || > - IS_ENABLED(CONFIG_ZONE_DMA32)) > - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > - else if (crashk_res.end) > - memblock_mark_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > -#endif > - > /* map all the memory banks */ > for_each_mem_range(i, &start, &end) { > if (start >= end) > @@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp) > * in page granularity and put back unused memory to buddy system > * through /sys/kernel/kexec_crash_size interface. > */ > -#ifdef CONFIG_KEXEC_CORE > - if (crash_mem_map && > - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { > - if (crashk_res.end) { > - __map_memblock(pgdp, crashk_res.start, > - crashk_res.end + 1, > - PAGE_KERNEL, > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > - memblock_clear_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > - } > -#endif > } > > void mark_rodata_ro(void) > @@ -774,8 +903,8 @@ void __init paging_init(void) > cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); > init_mm.pgd = swapper_pg_dir; > > - memblock_phys_free(__pa_symbol(init_pg_dir), > - __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > + //memblock_phys_free(__pa_symbol(init_pg_dir), > + // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > > memblock_allow_resize(); > } > diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S > index 50bbed9..161bae6 100644 > --- a/arch/arm64/mm/proc.S > +++ b/arch/arm64/mm/proc.S > @@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume) > isb > .endm > > +SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb) > + save_and_disable_daif flags=x2 > + > + __idmap_cpu_set_reserved_ttbr1 x1, x3 > + > + offset_ttbr1 x0, x3 > + msr ttbr1_el1, x0 > + isb > + > + restore_daif x2 > + > + dsb nshst > + tlbi vmalle1 > + dsb nsh > + isb > + > + ret > +SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb) > + Switch ttbr1 to pgd(refer to x0), and flush tlb all. > /* > * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1) > * ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-10 14:32 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-10 14:32 UTC (permalink / raw) To: baolin.wang, catalin.marinas, will, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo Cc: alikernel-developer Hi, Mike Rapoport, Catalin Marinas, I have rewrited the patch to handle tlb conflict and Circular dependency. Namely, (1)When doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir in use. (2)Firstly, I change the swapper_pg_dir pgd to idmap_pg_dir. As idmap_pg_dir use ttbr0_el1 and without any tlb confict in ttbr1_el1. Afterward I flush all the tlb(all the swapper_pg_dir tlb cache flushed). Later, I change idmap_pg_dir pgd to use init_pg_dir and flush tlb again. Therefor, there is no tlb conflict when rebuiding with init_pg_dir(without any tlb cache for swapper_pg_dir). (3)When switch to init_pg_dir, I can saftely modify swapper_pg_dir pagetable withou any [[[Circular dependency]]] and tlb conflict. Namely, When unmapping or modifying any swapper_pg_dir pagetable, I will use fix map to access the mem. I have test it and work well. Could you give me some advice? Thanks. 在 2022/7/10 21:44, Guanghui Feng 写道: > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > use non block/section mapping(for crashkernel requires to shrink the region > in page granularity). But it will degrade performance when doing larging > continuous mem access in kernel(memcpy/memmove, etc). > > There are many changes and discussions: > commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for > platforms with no DMA memory zones") > commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into > mem_init()") > commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel > reservation is required") > > This patch changes mem_map to use block/section mapping with crashkernel. > Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at > mem_map, reserve crashkernel memory. And then walking pagetable to split > block/section mapping to non block/section mapping(normally 4K) [[[only]]] > for crashkernel mem. So the linear mem mapping use block/section mapping > as more as possible. We will reduce the cpu dTLB miss conspicuously, and > accelerate mem access about 10-20% performance improvement. > > I have tested it with pft(Page Fault Test) and fio, obtained great > performace improvement. > > For fio test: > 1.prepare ramdisk > modprobe -r brd > modprobe brd rd_nr=1 rd_size=67108864 > dmsetup remove_all > wipefs -a --force /dev/ram0 > mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 > mkdir -p /fs/ram0 > mount -t ext4 /dev/ram0 /fs/ram0 > > 2.prepare fio paremeter in x.fio file: > [global] > bs=4k > ioengine=psync > iodepth=128 > size=32G > direct=1 > invalidate=1 > group_reporting > thread=1 > rw=read > directory=/fs/ram0 > numjobs=1 > > [task_0] > cpus_allowed=16 > stonewall=1 > > 3.run testcase: > perf stat -e dTLB-load-misses fio x.fio > > 4.contrast > ------------------------ > without patch with patch > fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s > dTLB-load-misses 1,818,320,693 438,729,774 > time elapsed(s) 70.500326434 62.877316408 > user(s) 15.926332000 15.684721000 > sys(s) 54.211939000 47.046165000 > > 5.conclusion > Using this patch will reduce dTLB misses and improve performace greatly. > > Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> > --- > arch/arm64/include/asm/mmu.h | 1 + > arch/arm64/include/asm/mmu_context.h | 28 ++++ > arch/arm64/mm/init.c | 8 +- > arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++---------- > arch/arm64/mm/proc.S | 19 +++ > 5 files changed, 267 insertions(+), 94 deletions(-) > > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h > index 48f8466..1a46b81 100644 > --- a/arch/arm64/include/asm/mmu.h > +++ b/arch/arm64/include/asm/mmu.h > @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) > extern void arm64_memblock_init(void); > extern void paging_init(void); > extern void bootmem_init(void); > +extern void map_crashkernel(void); > extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); > extern void init_mem_pgprot(void); > extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h > index 6770667..10dffd4 100644 > --- a/arch/arm64/include/asm/mmu_context.h > +++ b/arch/arm64/include/asm/mmu_context.h > @@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz) > isb(); > } > > +static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1) > +{ > + typedef void (ttbr_replace_func)(phys_addr_t); > + extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb; > + ttbr_replace_func *replace_phys; > + > + /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */ > + ttbr1 = phys_to_ttbr(ttbr1); > + > + if (system_supports_cnp()) { > + /* > + * cpu_replace_ttbr1() is used when there's a boot CPU > + * up (i.e. cpufeature framework is not up yet) and > + * latter only when we enable CNP via cpufeature's > + * enable() callback. > + * Also we rely on the cpu_hwcap bit being set before > + * calling the enable() function. > + */ > + ttbr1 |= TTBR_CNP_BIT; > + } > + > + replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb)); > + > + cpu_install_idmap(); > + replace_phys(ttbr1); > + cpu_uninstall_idmap(); cpu_install_idmap will Swtich to use idmap_pg_dir with use [[[ttbr0_el1]]]. swapper_gp_dir use the [[[ttbr1_el0]]], so there is no tlb conflit to idmap_pg_dir(the different address space range). When switching to idmap_pg_dir, there also will be a tlb flush all. Afterward, the idmap_cpu_replace_ttbr1_with_flush_tlb will run in idmap_pg_dir pgd and change the pgd to another pgd complying with the parameter ttbr1. So, there is no tlb conflict when switching pgd. Namely, I can change ttbr1 pgd safely without any tlb conflict. > +} > + > /* > * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD, > * avoiding the possibility of conflicting TLB entries being allocated. > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 339ee84..241d27e 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) > crashk_res.start = crash_base; > crashk_res.end = crash_base + crash_size - 1; > insert_resource(&iomem_resource, &crashk_res); > + map_crashkernel(); > } > > /* > @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) > } > > early_init_fdt_scan_reserved_mem(); > - > - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) > - reserve_crashkernel(); > - > high_memory = __va(memblock_end_of_DRAM() - 1) + 1; > } > > @@ -438,8 +435,7 @@ void __init bootmem_init(void) > * request_standard_resources() depends on crashkernel's memory being > * reserved, so do it here. > */ > - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) > - reserve_crashkernel(); > + reserve_crashkernel(); > > memblock_dump_all(); > } > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 626ec32..3e30c82 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -42,6 +42,7 @@ > #define NO_BLOCK_MAPPINGS BIT(0) > #define NO_CONT_MAPPINGS BIT(1) > #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */ > +#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/ > > u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN); > u64 idmap_ptrs_per_pgd = PTRS_PER_PGD; > @@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new) > return ((old ^ new) & ~mask) == 0; > } > > +static void pte_clear_cont(pte_t *ptep) > +{ > + int i = 0; > + pte_t pte = READ_ONCE(*ptep); > + if (pte_none(pte) || !pte_cont(pte)) > + return; > + ptep -= ((u64)ptep / sizeof(pte_t)) & > + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); > + do { > + pte = pte_mknoncont(READ_ONCE(*ptep)); > + set_pte(ptep, pte); > + } while (++ptep, ++i < CONT_PTES); > +} > + > static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > - phys_addr_t phys, pgprot_t prot) > + phys_addr_t phys, pgprot_t prot, int flags) > { > pte_t *ptep; > > @@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > do { > pte_t old_pte = READ_ONCE(*ptep); > > + if (flags & NO_SEC_REMAPPINGS) > + pte_clear_cont(ptep); > + > set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot)); > > /* > @@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > pte_clear_fixmap(); > } > > +static inline void allock_pte_page(pmd_t *pmdp, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; > + phys_addr_t pte_phys; > + if (!pmd_none(READ_ONCE(*pmdp))) > + return; > + > + if (flags & NO_EXEC_MAPPINGS) > + pmdval |= PMD_TABLE_PXN; > + BUG_ON(!pgtable_alloc); > + pte_phys = pgtable_alloc(PAGE_SHIFT); > + __pmd_populate(pmdp, pte_phys, pmdval); > +} > + > static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > unsigned long end, phys_addr_t phys, > pgprot_t prot, > @@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > pmd_t pmd = READ_ONCE(*pmdp); > > BUG_ON(pmd_sect(pmd)); > - if (pmd_none(pmd)) { > - pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; > - phys_addr_t pte_phys; > - > - if (flags & NO_EXEC_MAPPINGS) > - pmdval |= PMD_TABLE_PXN; > - BUG_ON(!pgtable_alloc); > - pte_phys = pgtable_alloc(PAGE_SHIFT); > - __pmd_populate(pmdp, pte_phys, pmdval); > - pmd = READ_ONCE(*pmdp); > - } > + allock_pte_page(pmdp, pgtable_alloc, flags); > + pmd = READ_ONCE(*pmdp); > BUG_ON(pmd_bad(pmd)); > > do { > @@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > next = pte_cont_addr_end(addr, end); > > /* use a contiguous mapping if the range is suitably aligned */ > - if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) && > - (flags & NO_CONT_MAPPINGS) == 0) > + if (!(flags & NO_SEC_REMAPPINGS) && > + (((addr | next | phys) & ~CONT_PTE_MASK) == 0) && > + (flags & NO_CONT_MAPPINGS) == 0) > __prot = __pgprot(pgprot_val(prot) | PTE_CONT); > > - init_pte(pmdp, addr, next, phys, __prot); > + init_pte(pmdp, addr, next, phys, __prot, flags); > > phys += next - addr; > } while (addr = next, addr != end); > } > > +static void pmd_clear_cont(pmd_t *pmdp) > +{ > + int i = 0; > + pmd_t pmd = READ_ONCE(*pmdp); > + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) > + return; > + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & > + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); > + do { > + pmd = READ_ONCE(*pmdp); > + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); > + set_pmd(pmdp, pmd); > + } while (++pmdp, ++i < CONT_PMDS); > +} > + > +static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), int flags) > +{ > + unsigned long next; > + pmd_t *pmdp; > + phys_addr_t map_offset; > + > + pmdp = pmd_set_fixmap_offset(pudp, addr); > + do { > + next = pmd_addr_end(addr, end); > + > + if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) { > + pmd_clear_cont(pmdp); > + pmd_clear(pmdp); > + allock_pte_page(pmdp, pgtable_alloc, flags); > + > + map_offset = addr - (addr & PMD_MASK); > + if (map_offset) > + alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr, > + phys - map_offset, prot, > + pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + > + if (next < (addr & PMD_MASK) + PMD_SIZE) > + alloc_init_cont_pte(pmdp, next, > + (addr & PUD_MASK) + PUD_SIZE, > + next - addr + phys, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + } > + alloc_init_cont_pte(pmdp, addr, next, phys, prot, > + pgtable_alloc, flags); > + phys += next - addr; > + } while (pmdp++, addr = next, addr != end); > + > + pmd_clear_fixmap(); > +} > + > static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, > phys_addr_t phys, pgprot_t prot, > phys_addr_t (*pgtable_alloc)(int), int flags) > @@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, > pmd_clear_fixmap(); > } > > +static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags) > +{ > + > + pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; > + phys_addr_t pmd_phys; > + > + if (!pud_none(READ_ONCE(*pudp))) > + return; > + > + if (flags & NO_EXEC_MAPPINGS) > + pudval |= PUD_TABLE_PXN; > + BUG_ON(!pgtable_alloc); > + pmd_phys = pgtable_alloc(PMD_SHIFT); > + __pud_populate(pudp, pmd_phys, pudval); > +} > + > static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > unsigned long end, phys_addr_t phys, > pgprot_t prot, > @@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > * Check for initial section mappings in the pgd/pud. > */ > BUG_ON(pud_sect(pud)); > - if (pud_none(pud)) { > - pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; > - phys_addr_t pmd_phys; > - > - if (flags & NO_EXEC_MAPPINGS) > - pudval |= PUD_TABLE_PXN; > - BUG_ON(!pgtable_alloc); > - pmd_phys = pgtable_alloc(PMD_SHIFT); > - __pud_populate(pudp, pmd_phys, pudval); > - pud = READ_ONCE(*pudp); > - } > + alloc_pmd_page(pudp, pgtable_alloc, flags); > + pud = READ_ONCE(*pudp); > BUG_ON(pud_bad(pud)); > > do { > @@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > next = pmd_cont_addr_end(addr, end); > > /* use a contiguous mapping if the range is suitably aligned */ > - if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) && > + if (!(flags & NO_SEC_REMAPPINGS) && > + (((addr | next | phys) & ~CONT_PMD_MASK) == 0) && > (flags & NO_CONT_MAPPINGS) == 0) > __prot = __pgprot(pgprot_val(prot) | PTE_CONT); > > - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags); > + if (flags & NO_SEC_REMAPPINGS) > + init_pmd_remap(pudp, addr, next, phys, __prot, > + pgtable_alloc, flags); > + else > + init_pmd(pudp, addr, next, phys, __prot, > + pgtable_alloc, flags); > > phys += next - addr; > } while (addr = next, addr != end); > } > > +static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + phys_addr_t map_offset; > + > + if (!pud_none(*pudp) && pud_sect(*pudp)) { > + pud_clear(pudp); > + alloc_pmd_page(pudp, pgtable_alloc, flags); > + > + map_offset = addr - (addr & PUD_MASK); > + if (map_offset) > + alloc_init_cont_pmd(pudp, addr & PUD_MASK, > + addr, phys - map_offset, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + > + if (next < (addr & PUD_MASK) + PUD_SIZE) > + alloc_init_cont_pmd(pudp, next, > + (addr & PUD_MASK) + PUD_SIZE, > + next - addr + phys, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + } > + alloc_init_cont_pmd(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > +} > + > +static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + pud_t old_pud = READ_ONCE(*pudp); > + /* > + * For 4K granule only, attempt to put down a 1GB block > + */ > + if (pud_sect_supported() && > + ((addr | next | phys) & ~PUD_MASK) == 0 && > + (flags & NO_BLOCK_MAPPINGS) == 0) { > + pud_set_huge(pudp, phys, prot); > + > + /* > + * After the PUD entry has been populated once, we > + * only allow updates to the permission attributes. > + */ > + BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), > + READ_ONCE(pud_val(*pudp)))); > + } else { > + alloc_init_cont_pmd(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > + > + BUG_ON(pud_val(old_pud) != 0 && > + pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); > + } > +} > + > static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, > phys_addr_t phys, pgprot_t prot, > phys_addr_t (*pgtable_alloc)(int), > @@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, > */ > if (system_state != SYSTEM_BOOTING) > mutex_lock(&fixmap_lock); > + > pudp = pud_set_fixmap_offset(p4dp, addr); > do { > - pud_t old_pud = READ_ONCE(*pudp); > - > next = pud_addr_end(addr, end); > > - /* > - * For 4K granule only, attempt to put down a 1GB block > - */ > - if (pud_sect_supported() && > - ((addr | next | phys) & ~PUD_MASK) == 0 && > - (flags & NO_BLOCK_MAPPINGS) == 0) { > - pud_set_huge(pudp, phys, prot); > - > - /* > - * After the PUD entry has been populated once, we > - * only allow updates to the permission attributes. > - */ > - BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), > - READ_ONCE(pud_val(*pudp)))); > - } else { > - alloc_init_cont_pmd(pudp, addr, next, phys, prot, > - pgtable_alloc, flags); > - > - BUG_ON(pud_val(old_pud) != 0 && > - pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); > - } > + if (flags & NO_SEC_REMAPPINGS) > + init_pud_remap(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > + else > + init_pud(pudp, addr, next, phys, prot, pgtable_alloc, > + flags); > phys += next - addr; > } while (pudp++, addr = next, addr != end); > > pud_clear_fixmap(); > + > if (system_state != SYSTEM_BOOTING) > mutex_unlock(&fixmap_lock); > } > @@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void) > PAGE_KERNEL_RO); > } > > -static bool crash_mem_map __initdata; > - > -static int __init enable_crash_mem_map(char *arg) > +#ifdef CONFIG_KEXEC_CORE > +void __init map_crashkernel(void) > { > - /* > - * Proper parameter parsing is done by reserve_crashkernel(). We only > - * need to know if the linear map has to avoid block mappings so that > - * the crashkernel reservations can be unmapped later. > - */ > - crash_mem_map = true; > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > + return; > > - return 0; > + if (!crashk_res.end) > + return; > + > + cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir)); > + init_mm.pgd = init_pg_dir; > + > + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, > + __phys_to_virt(crashk_res.start), > + crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL, > + early_pgtable_alloc, > + NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS); > + > + cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); > + init_mm.pgd = swapper_pg_dir; > + > + memblock_phys_free(__pa_symbol(init_pg_dir), > + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > } Because unmap_hotplug_range work with swapper_gp_dir pgd. There are many changes needed for working with init_pg_dir pgd. So that I use the init_pg_dir and __create_pgd_mapping to direct walk/modify swapper_gp_dir pagetable. When finishing it, free the init_pg_dir memblock. > -early_param("crashkernel", enable_crash_mem_map); > +#else > +void __init map_crashkernel(void) > +{ > + memblock_phys_free(__pa_symbol(init_pg_dir), > + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > +} > +#endif > > static void __init map_mem(pgd_t *pgdp) > { > @@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp) > */ > memblock_mark_nomap(kernel_start, kernel_end - kernel_start); > > -#ifdef CONFIG_KEXEC_CORE > - if (crash_mem_map) { > - if (IS_ENABLED(CONFIG_ZONE_DMA) || > - IS_ENABLED(CONFIG_ZONE_DMA32)) > - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > - else if (crashk_res.end) > - memblock_mark_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > -#endif > - > /* map all the memory banks */ > for_each_mem_range(i, &start, &end) { > if (start >= end) > @@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp) > * in page granularity and put back unused memory to buddy system > * through /sys/kernel/kexec_crash_size interface. > */ > -#ifdef CONFIG_KEXEC_CORE > - if (crash_mem_map && > - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { > - if (crashk_res.end) { > - __map_memblock(pgdp, crashk_res.start, > - crashk_res.end + 1, > - PAGE_KERNEL, > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > - memblock_clear_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > - } > -#endif > } > > void mark_rodata_ro(void) > @@ -774,8 +903,8 @@ void __init paging_init(void) > cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); > init_mm.pgd = swapper_pg_dir; > > - memblock_phys_free(__pa_symbol(init_pg_dir), > - __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > + //memblock_phys_free(__pa_symbol(init_pg_dir), > + // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > > memblock_allow_resize(); > } > diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S > index 50bbed9..161bae6 100644 > --- a/arch/arm64/mm/proc.S > +++ b/arch/arm64/mm/proc.S > @@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume) > isb > .endm > > +SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb) > + save_and_disable_daif flags=x2 > + > + __idmap_cpu_set_reserved_ttbr1 x1, x3 > + > + offset_ttbr1 x0, x3 > + msr ttbr1_el1, x0 > + isb > + > + restore_daif x2 > + > + dsb nshst > + tlbi vmalle1 > + dsb nsh > + isb > + > + ret > +SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb) > + Switch ttbr1 to pgd(refer to x0), and flush tlb all. > /* > * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1) > * _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation 2022-07-10 13:44 ` Guanghui Feng @ 2022-07-10 15:33 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-10 15:33 UTC (permalink / raw) To: baolin.wang, catalin.marinas, will, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo Cc: alikernel-developer In short, this path work: 1.Before doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir in [[[ttbr1]]] 2.Change the [[[ttbr0]]]to use idmap_pg_dir pgd 3.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] are mapped [[[only]]] with idmap_pg_dir mapping in [[[ttbr0]]] 4.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] will flush tlb all, switch [[[ttbr1]]] to use init_pg_dir pgd(and flush tlb all again). There is no tlb conflict to swapper_pg_dir. There is no tlb cache for swapper_pg_dir. 5.Woring with init_pg_dir pgd to access swapper_pg_dir pagetable with fix mapping. And modify crashkernel mapping in the swapper_pg_dir without any tlb conflict and flush. 6.When finishing the work, switch ttbr1 pgd to the origin swapper_pg_dir with cpu_replace_ttbr1 function(similar to the above). 在 2022/7/10 21:44, Guanghui Feng 写道: > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > use non block/section mapping(for crashkernel requires to shrink the region > in page granularity). But it will degrade performance when doing larging > continuous mem access in kernel(memcpy/memmove, etc). > > There are many changes and discussions: > commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for > platforms with no DMA memory zones") > commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into > mem_init()") > commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel > reservation is required") > > This patch changes mem_map to use block/section mapping with crashkernel. > Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at > mem_map, reserve crashkernel memory. And then walking pagetable to split > block/section mapping to non block/section mapping(normally 4K) [[[only]]] > for crashkernel mem. So the linear mem mapping use block/section mapping > as more as possible. We will reduce the cpu dTLB miss conspicuously, and > accelerate mem access about 10-20% performance improvement. > > I have tested it with pft(Page Fault Test) and fio, obtained great > performace improvement. > > For fio test: > 1.prepare ramdisk > modprobe -r brd > modprobe brd rd_nr=1 rd_size=67108864 > dmsetup remove_all > wipefs -a --force /dev/ram0 > mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 > mkdir -p /fs/ram0 > mount -t ext4 /dev/ram0 /fs/ram0 > > 2.prepare fio paremeter in x.fio file: > [global] > bs=4k > ioengine=psync > iodepth=128 > size=32G > direct=1 > invalidate=1 > group_reporting > thread=1 > rw=read > directory=/fs/ram0 > numjobs=1 > > [task_0] > cpus_allowed=16 > stonewall=1 > > 3.run testcase: > perf stat -e dTLB-load-misses fio x.fio > > 4.contrast > ------------------------ > without patch with patch > fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s > dTLB-load-misses 1,818,320,693 438,729,774 > time elapsed(s) 70.500326434 62.877316408 > user(s) 15.926332000 15.684721000 > sys(s) 54.211939000 47.046165000 > > 5.conclusion > Using this patch will reduce dTLB misses and improve performace greatly. > > Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> > --- > arch/arm64/include/asm/mmu.h | 1 + > arch/arm64/include/asm/mmu_context.h | 28 ++++ > arch/arm64/mm/init.c | 8 +- > arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++---------- > arch/arm64/mm/proc.S | 19 +++ > 5 files changed, 267 insertions(+), 94 deletions(-) > > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h > index 48f8466..1a46b81 100644 > --- a/arch/arm64/include/asm/mmu.h > +++ b/arch/arm64/include/asm/mmu.h > @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) > extern void arm64_memblock_init(void); > extern void paging_init(void); > extern void bootmem_init(void); > +extern void map_crashkernel(void); > extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); > extern void init_mem_pgprot(void); > extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h > index 6770667..10dffd4 100644 > --- a/arch/arm64/include/asm/mmu_context.h > +++ b/arch/arm64/include/asm/mmu_context.h > @@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz) > isb(); > } > > +static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1) > +{ > + typedef void (ttbr_replace_func)(phys_addr_t); > + extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb; > + ttbr_replace_func *replace_phys; > + > + /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */ > + ttbr1 = phys_to_ttbr(ttbr1); > + > + if (system_supports_cnp()) { > + /* > + * cpu_replace_ttbr1() is used when there's a boot CPU > + * up (i.e. cpufeature framework is not up yet) and > + * latter only when we enable CNP via cpufeature's > + * enable() callback. > + * Also we rely on the cpu_hwcap bit being set before > + * calling the enable() function. > + */ > + ttbr1 |= TTBR_CNP_BIT; > + } > + > + replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb)); > + > + cpu_install_idmap(); > + replace_phys(ttbr1); > + cpu_uninstall_idmap(); > +} > + > /* > * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD, > * avoiding the possibility of conflicting TLB entries being allocated. > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 339ee84..241d27e 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) > crashk_res.start = crash_base; > crashk_res.end = crash_base + crash_size - 1; > insert_resource(&iomem_resource, &crashk_res); > + map_crashkernel(); > } > > /* > @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) > } > > early_init_fdt_scan_reserved_mem(); > - > - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) > - reserve_crashkernel(); > - > high_memory = __va(memblock_end_of_DRAM() - 1) + 1; > } > > @@ -438,8 +435,7 @@ void __init bootmem_init(void) > * request_standard_resources() depends on crashkernel's memory being > * reserved, so do it here. > */ > - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) > - reserve_crashkernel(); > + reserve_crashkernel(); > > memblock_dump_all(); > } > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 626ec32..3e30c82 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -42,6 +42,7 @@ > #define NO_BLOCK_MAPPINGS BIT(0) > #define NO_CONT_MAPPINGS BIT(1) > #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */ > +#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/ > > u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN); > u64 idmap_ptrs_per_pgd = PTRS_PER_PGD; > @@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new) > return ((old ^ new) & ~mask) == 0; > } > > +static void pte_clear_cont(pte_t *ptep) > +{ > + int i = 0; > + pte_t pte = READ_ONCE(*ptep); > + if (pte_none(pte) || !pte_cont(pte)) > + return; > + ptep -= ((u64)ptep / sizeof(pte_t)) & > + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); > + do { > + pte = pte_mknoncont(READ_ONCE(*ptep)); > + set_pte(ptep, pte); > + } while (++ptep, ++i < CONT_PTES); > +} > + > static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > - phys_addr_t phys, pgprot_t prot) > + phys_addr_t phys, pgprot_t prot, int flags) > { > pte_t *ptep; > > @@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > do { > pte_t old_pte = READ_ONCE(*ptep); > > + if (flags & NO_SEC_REMAPPINGS) > + pte_clear_cont(ptep); > + > set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot)); > > /* > @@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > pte_clear_fixmap(); > } > > +static inline void allock_pte_page(pmd_t *pmdp, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; > + phys_addr_t pte_phys; > + if (!pmd_none(READ_ONCE(*pmdp))) > + return; > + > + if (flags & NO_EXEC_MAPPINGS) > + pmdval |= PMD_TABLE_PXN; > + BUG_ON(!pgtable_alloc); > + pte_phys = pgtable_alloc(PAGE_SHIFT); > + __pmd_populate(pmdp, pte_phys, pmdval); > +} > + > static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > unsigned long end, phys_addr_t phys, > pgprot_t prot, > @@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > pmd_t pmd = READ_ONCE(*pmdp); > > BUG_ON(pmd_sect(pmd)); > - if (pmd_none(pmd)) { > - pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; > - phys_addr_t pte_phys; > - > - if (flags & NO_EXEC_MAPPINGS) > - pmdval |= PMD_TABLE_PXN; > - BUG_ON(!pgtable_alloc); > - pte_phys = pgtable_alloc(PAGE_SHIFT); > - __pmd_populate(pmdp, pte_phys, pmdval); > - pmd = READ_ONCE(*pmdp); > - } > + allock_pte_page(pmdp, pgtable_alloc, flags); > + pmd = READ_ONCE(*pmdp); > BUG_ON(pmd_bad(pmd)); > > do { > @@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > next = pte_cont_addr_end(addr, end); > > /* use a contiguous mapping if the range is suitably aligned */ > - if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) && > - (flags & NO_CONT_MAPPINGS) == 0) > + if (!(flags & NO_SEC_REMAPPINGS) && > + (((addr | next | phys) & ~CONT_PTE_MASK) == 0) && > + (flags & NO_CONT_MAPPINGS) == 0) > __prot = __pgprot(pgprot_val(prot) | PTE_CONT); > > - init_pte(pmdp, addr, next, phys, __prot); > + init_pte(pmdp, addr, next, phys, __prot, flags); > > phys += next - addr; > } while (addr = next, addr != end); > } > > +static void pmd_clear_cont(pmd_t *pmdp) > +{ > + int i = 0; > + pmd_t pmd = READ_ONCE(*pmdp); > + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) > + return; > + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & > + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); > + do { > + pmd = READ_ONCE(*pmdp); > + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); > + set_pmd(pmdp, pmd); > + } while (++pmdp, ++i < CONT_PMDS); > +} > + > +static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), int flags) > +{ > + unsigned long next; > + pmd_t *pmdp; > + phys_addr_t map_offset; > + > + pmdp = pmd_set_fixmap_offset(pudp, addr); > + do { > + next = pmd_addr_end(addr, end); > + > + if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) { > + pmd_clear_cont(pmdp); > + pmd_clear(pmdp); > + allock_pte_page(pmdp, pgtable_alloc, flags); > + > + map_offset = addr - (addr & PMD_MASK); > + if (map_offset) > + alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr, > + phys - map_offset, prot, > + pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + > + if (next < (addr & PMD_MASK) + PMD_SIZE) > + alloc_init_cont_pte(pmdp, next, > + (addr & PUD_MASK) + PUD_SIZE, > + next - addr + phys, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + } > + alloc_init_cont_pte(pmdp, addr, next, phys, prot, > + pgtable_alloc, flags); > + phys += next - addr; > + } while (pmdp++, addr = next, addr != end); > + > + pmd_clear_fixmap(); > +} > + > static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, > phys_addr_t phys, pgprot_t prot, > phys_addr_t (*pgtable_alloc)(int), int flags) > @@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, > pmd_clear_fixmap(); > } > > +static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags) > +{ > + > + pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; > + phys_addr_t pmd_phys; > + > + if (!pud_none(READ_ONCE(*pudp))) > + return; > + > + if (flags & NO_EXEC_MAPPINGS) > + pudval |= PUD_TABLE_PXN; > + BUG_ON(!pgtable_alloc); > + pmd_phys = pgtable_alloc(PMD_SHIFT); > + __pud_populate(pudp, pmd_phys, pudval); > +} > + > static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > unsigned long end, phys_addr_t phys, > pgprot_t prot, > @@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > * Check for initial section mappings in the pgd/pud. > */ > BUG_ON(pud_sect(pud)); > - if (pud_none(pud)) { > - pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; > - phys_addr_t pmd_phys; > - > - if (flags & NO_EXEC_MAPPINGS) > - pudval |= PUD_TABLE_PXN; > - BUG_ON(!pgtable_alloc); > - pmd_phys = pgtable_alloc(PMD_SHIFT); > - __pud_populate(pudp, pmd_phys, pudval); > - pud = READ_ONCE(*pudp); > - } > + alloc_pmd_page(pudp, pgtable_alloc, flags); > + pud = READ_ONCE(*pudp); > BUG_ON(pud_bad(pud)); > > do { > @@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > next = pmd_cont_addr_end(addr, end); > > /* use a contiguous mapping if the range is suitably aligned */ > - if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) && > + if (!(flags & NO_SEC_REMAPPINGS) && > + (((addr | next | phys) & ~CONT_PMD_MASK) == 0) && > (flags & NO_CONT_MAPPINGS) == 0) > __prot = __pgprot(pgprot_val(prot) | PTE_CONT); > > - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags); > + if (flags & NO_SEC_REMAPPINGS) > + init_pmd_remap(pudp, addr, next, phys, __prot, > + pgtable_alloc, flags); > + else > + init_pmd(pudp, addr, next, phys, __prot, > + pgtable_alloc, flags); > > phys += next - addr; > } while (addr = next, addr != end); > } > > +static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + phys_addr_t map_offset; > + > + if (!pud_none(*pudp) && pud_sect(*pudp)) { > + pud_clear(pudp); > + alloc_pmd_page(pudp, pgtable_alloc, flags); > + > + map_offset = addr - (addr & PUD_MASK); > + if (map_offset) > + alloc_init_cont_pmd(pudp, addr & PUD_MASK, > + addr, phys - map_offset, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + > + if (next < (addr & PUD_MASK) + PUD_SIZE) > + alloc_init_cont_pmd(pudp, next, > + (addr & PUD_MASK) + PUD_SIZE, > + next - addr + phys, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + } > + alloc_init_cont_pmd(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > +} > + > +static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + pud_t old_pud = READ_ONCE(*pudp); > + /* > + * For 4K granule only, attempt to put down a 1GB block > + */ > + if (pud_sect_supported() && > + ((addr | next | phys) & ~PUD_MASK) == 0 && > + (flags & NO_BLOCK_MAPPINGS) == 0) { > + pud_set_huge(pudp, phys, prot); > + > + /* > + * After the PUD entry has been populated once, we > + * only allow updates to the permission attributes. > + */ > + BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), > + READ_ONCE(pud_val(*pudp)))); > + } else { > + alloc_init_cont_pmd(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > + > + BUG_ON(pud_val(old_pud) != 0 && > + pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); > + } > +} > + > static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, > phys_addr_t phys, pgprot_t prot, > phys_addr_t (*pgtable_alloc)(int), > @@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, > */ > if (system_state != SYSTEM_BOOTING) > mutex_lock(&fixmap_lock); > + > pudp = pud_set_fixmap_offset(p4dp, addr); > do { > - pud_t old_pud = READ_ONCE(*pudp); > - > next = pud_addr_end(addr, end); > > - /* > - * For 4K granule only, attempt to put down a 1GB block > - */ > - if (pud_sect_supported() && > - ((addr | next | phys) & ~PUD_MASK) == 0 && > - (flags & NO_BLOCK_MAPPINGS) == 0) { > - pud_set_huge(pudp, phys, prot); > - > - /* > - * After the PUD entry has been populated once, we > - * only allow updates to the permission attributes. > - */ > - BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), > - READ_ONCE(pud_val(*pudp)))); > - } else { > - alloc_init_cont_pmd(pudp, addr, next, phys, prot, > - pgtable_alloc, flags); > - > - BUG_ON(pud_val(old_pud) != 0 && > - pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); > - } > + if (flags & NO_SEC_REMAPPINGS) > + init_pud_remap(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > + else > + init_pud(pudp, addr, next, phys, prot, pgtable_alloc, > + flags); > phys += next - addr; > } while (pudp++, addr = next, addr != end); > > pud_clear_fixmap(); > + > if (system_state != SYSTEM_BOOTING) > mutex_unlock(&fixmap_lock); > } > @@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void) > PAGE_KERNEL_RO); > } > > -static bool crash_mem_map __initdata; > - > -static int __init enable_crash_mem_map(char *arg) > +#ifdef CONFIG_KEXEC_CORE > +void __init map_crashkernel(void) > { > - /* > - * Proper parameter parsing is done by reserve_crashkernel(). We only > - * need to know if the linear map has to avoid block mappings so that > - * the crashkernel reservations can be unmapped later. > - */ > - crash_mem_map = true; > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > + return; > > - return 0; > + if (!crashk_res.end) > + return; > + > + cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir)); > + init_mm.pgd = init_pg_dir; > + > + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, > + __phys_to_virt(crashk_res.start), > + crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL, > + early_pgtable_alloc, > + NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS); > + > + cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); > + init_mm.pgd = swapper_pg_dir; > + > + memblock_phys_free(__pa_symbol(init_pg_dir), > + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > } > -early_param("crashkernel", enable_crash_mem_map); > +#else > +void __init map_crashkernel(void) > +{ > + memblock_phys_free(__pa_symbol(init_pg_dir), > + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > +} > +#endif > > static void __init map_mem(pgd_t *pgdp) > { > @@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp) > */ > memblock_mark_nomap(kernel_start, kernel_end - kernel_start); > > -#ifdef CONFIG_KEXEC_CORE > - if (crash_mem_map) { > - if (IS_ENABLED(CONFIG_ZONE_DMA) || > - IS_ENABLED(CONFIG_ZONE_DMA32)) > - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > - else if (crashk_res.end) > - memblock_mark_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > -#endif > - > /* map all the memory banks */ > for_each_mem_range(i, &start, &end) { > if (start >= end) > @@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp) > * in page granularity and put back unused memory to buddy system > * through /sys/kernel/kexec_crash_size interface. > */ > -#ifdef CONFIG_KEXEC_CORE > - if (crash_mem_map && > - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { > - if (crashk_res.end) { > - __map_memblock(pgdp, crashk_res.start, > - crashk_res.end + 1, > - PAGE_KERNEL, > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > - memblock_clear_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > - } > -#endif > } > > void mark_rodata_ro(void) > @@ -774,8 +903,8 @@ void __init paging_init(void) > cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); > init_mm.pgd = swapper_pg_dir; > > - memblock_phys_free(__pa_symbol(init_pg_dir), > - __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > + //memblock_phys_free(__pa_symbol(init_pg_dir), > + // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > > memblock_allow_resize(); > } > diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S > index 50bbed9..161bae6 100644 > --- a/arch/arm64/mm/proc.S > +++ b/arch/arm64/mm/proc.S > @@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume) > isb > .endm > > +SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb) > + save_and_disable_daif flags=x2 > + > + __idmap_cpu_set_reserved_ttbr1 x1, x3 > + > + offset_ttbr1 x0, x3 > + msr ttbr1_el1, x0 > + isb > + > + restore_daif x2 > + > + dsb nshst > + tlbi vmalle1 > + dsb nsh > + isb > + > + ret > +SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb) > + > /* > * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1) > * ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-10 15:33 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-10 15:33 UTC (permalink / raw) To: baolin.wang, catalin.marinas, will, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo Cc: alikernel-developer In short, this path work: 1.Before doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir in [[[ttbr1]]] 2.Change the [[[ttbr0]]]to use idmap_pg_dir pgd 3.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] are mapped [[[only]]] with idmap_pg_dir mapping in [[[ttbr0]]] 4.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] will flush tlb all, switch [[[ttbr1]]] to use init_pg_dir pgd(and flush tlb all again). There is no tlb conflict to swapper_pg_dir. There is no tlb cache for swapper_pg_dir. 5.Woring with init_pg_dir pgd to access swapper_pg_dir pagetable with fix mapping. And modify crashkernel mapping in the swapper_pg_dir without any tlb conflict and flush. 6.When finishing the work, switch ttbr1 pgd to the origin swapper_pg_dir with cpu_replace_ttbr1 function(similar to the above). 在 2022/7/10 21:44, Guanghui Feng 写道: > The arm64 can build 2M/1G block/sectiion mapping. When using DMA/DMA32 zone > (enable crashkernel, disable rodata full, disable kfence), the mem_map will > use non block/section mapping(for crashkernel requires to shrink the region > in page granularity). But it will degrade performance when doing larging > continuous mem access in kernel(memcpy/memmove, etc). > > There are many changes and discussions: > commit 031495635b46 ("arm64: Do not defer reserve_crashkernel() for > platforms with no DMA memory zones") > commit 0a30c53573b0 ("arm64: mm: Move reserve_crashkernel() into > mem_init()") > commit 2687275a5843 ("arm64: Force NO_BLOCK_MAPPINGS if crashkernel > reservation is required") > > This patch changes mem_map to use block/section mapping with crashkernel. > Firstly, do block/section mapping(normally 2M or 1G) for all avail mem at > mem_map, reserve crashkernel memory. And then walking pagetable to split > block/section mapping to non block/section mapping(normally 4K) [[[only]]] > for crashkernel mem. So the linear mem mapping use block/section mapping > as more as possible. We will reduce the cpu dTLB miss conspicuously, and > accelerate mem access about 10-20% performance improvement. > > I have tested it with pft(Page Fault Test) and fio, obtained great > performace improvement. > > For fio test: > 1.prepare ramdisk > modprobe -r brd > modprobe brd rd_nr=1 rd_size=67108864 > dmsetup remove_all > wipefs -a --force /dev/ram0 > mkfs -t ext4 -E lazy_itable_init=0,lazy_journal_init=0 -q -F /dev/ram0 > mkdir -p /fs/ram0 > mount -t ext4 /dev/ram0 /fs/ram0 > > 2.prepare fio paremeter in x.fio file: > [global] > bs=4k > ioengine=psync > iodepth=128 > size=32G > direct=1 > invalidate=1 > group_reporting > thread=1 > rw=read > directory=/fs/ram0 > numjobs=1 > > [task_0] > cpus_allowed=16 > stonewall=1 > > 3.run testcase: > perf stat -e dTLB-load-misses fio x.fio > > 4.contrast > ------------------------ > without patch with patch > fio READ aggrb=1493.2MB/s aggrb=1775.3MB/s > dTLB-load-misses 1,818,320,693 438,729,774 > time elapsed(s) 70.500326434 62.877316408 > user(s) 15.926332000 15.684721000 > sys(s) 54.211939000 47.046165000 > > 5.conclusion > Using this patch will reduce dTLB misses and improve performace greatly. > > Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> > --- > arch/arm64/include/asm/mmu.h | 1 + > arch/arm64/include/asm/mmu_context.h | 28 ++++ > arch/arm64/mm/init.c | 8 +- > arch/arm64/mm/mmu.c | 305 +++++++++++++++++++++++++---------- > arch/arm64/mm/proc.S | 19 +++ > 5 files changed, 267 insertions(+), 94 deletions(-) > > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h > index 48f8466..1a46b81 100644 > --- a/arch/arm64/include/asm/mmu.h > +++ b/arch/arm64/include/asm/mmu.h > @@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void) > extern void arm64_memblock_init(void); > extern void paging_init(void); > extern void bootmem_init(void); > +extern void map_crashkernel(void); > extern void __iomem *early_io_map(phys_addr_t phys, unsigned long virt); > extern void init_mem_pgprot(void); > extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys, > diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h > index 6770667..10dffd4 100644 > --- a/arch/arm64/include/asm/mmu_context.h > +++ b/arch/arm64/include/asm/mmu_context.h > @@ -139,6 +139,34 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz) > isb(); > } > > +static inline void __nocfi cpu_replace_ttbr1_with_phys(phys_addr_t ttbr1) > +{ > + typedef void (ttbr_replace_func)(phys_addr_t); > + extern ttbr_replace_func idmap_cpu_replace_ttbr1_with_flush_tlb; > + ttbr_replace_func *replace_phys; > + > + /* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */ > + ttbr1 = phys_to_ttbr(ttbr1); > + > + if (system_supports_cnp()) { > + /* > + * cpu_replace_ttbr1() is used when there's a boot CPU > + * up (i.e. cpufeature framework is not up yet) and > + * latter only when we enable CNP via cpufeature's > + * enable() callback. > + * Also we rely on the cpu_hwcap bit being set before > + * calling the enable() function. > + */ > + ttbr1 |= TTBR_CNP_BIT; > + } > + > + replace_phys = (void *)__pa_symbol(function_nocfi(idmap_cpu_replace_ttbr1_with_flush_tlb)); > + > + cpu_install_idmap(); > + replace_phys(ttbr1); > + cpu_uninstall_idmap(); > +} > + > /* > * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD, > * avoiding the possibility of conflicting TLB entries being allocated. > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 339ee84..241d27e 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -190,6 +190,7 @@ static void __init reserve_crashkernel(void) > crashk_res.start = crash_base; > crashk_res.end = crash_base + crash_size - 1; > insert_resource(&iomem_resource, &crashk_res); > + map_crashkernel(); > } > > /* > @@ -388,10 +389,6 @@ void __init arm64_memblock_init(void) > } > > early_init_fdt_scan_reserved_mem(); > - > - if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) > - reserve_crashkernel(); > - > high_memory = __va(memblock_end_of_DRAM() - 1) + 1; > } > > @@ -438,8 +435,7 @@ void __init bootmem_init(void) > * request_standard_resources() depends on crashkernel's memory being > * reserved, so do it here. > */ > - if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32)) > - reserve_crashkernel(); > + reserve_crashkernel(); > > memblock_dump_all(); > } > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 626ec32..3e30c82 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -42,6 +42,7 @@ > #define NO_BLOCK_MAPPINGS BIT(0) > #define NO_CONT_MAPPINGS BIT(1) > #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */ > +#define NO_SEC_REMAPPINGS BIT(3) /* rebuild with non block/sec mapping*/ > > u64 idmap_t0sz = TCR_T0SZ(VA_BITS_MIN); > u64 idmap_ptrs_per_pgd = PTRS_PER_PGD; > @@ -155,8 +156,22 @@ static bool pgattr_change_is_safe(u64 old, u64 new) > return ((old ^ new) & ~mask) == 0; > } > > +static void pte_clear_cont(pte_t *ptep) > +{ > + int i = 0; > + pte_t pte = READ_ONCE(*ptep); > + if (pte_none(pte) || !pte_cont(pte)) > + return; > + ptep -= ((u64)ptep / sizeof(pte_t)) & > + ((1 << CONFIG_ARM64_CONT_PTE_SHIFT) - 1); > + do { > + pte = pte_mknoncont(READ_ONCE(*ptep)); > + set_pte(ptep, pte); > + } while (++ptep, ++i < CONT_PTES); > +} > + > static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > - phys_addr_t phys, pgprot_t prot) > + phys_addr_t phys, pgprot_t prot, int flags) > { > pte_t *ptep; > > @@ -164,6 +179,9 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > do { > pte_t old_pte = READ_ONCE(*ptep); > > + if (flags & NO_SEC_REMAPPINGS) > + pte_clear_cont(ptep); > + > set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot)); > > /* > @@ -179,6 +197,22 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end, > pte_clear_fixmap(); > } > > +static inline void allock_pte_page(pmd_t *pmdp, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; > + phys_addr_t pte_phys; > + if (!pmd_none(READ_ONCE(*pmdp))) > + return; > + > + if (flags & NO_EXEC_MAPPINGS) > + pmdval |= PMD_TABLE_PXN; > + BUG_ON(!pgtable_alloc); > + pte_phys = pgtable_alloc(PAGE_SHIFT); > + __pmd_populate(pmdp, pte_phys, pmdval); > +} > + > static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > unsigned long end, phys_addr_t phys, > pgprot_t prot, > @@ -189,17 +223,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > pmd_t pmd = READ_ONCE(*pmdp); > > BUG_ON(pmd_sect(pmd)); > - if (pmd_none(pmd)) { > - pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; > - phys_addr_t pte_phys; > - > - if (flags & NO_EXEC_MAPPINGS) > - pmdval |= PMD_TABLE_PXN; > - BUG_ON(!pgtable_alloc); > - pte_phys = pgtable_alloc(PAGE_SHIFT); > - __pmd_populate(pmdp, pte_phys, pmdval); > - pmd = READ_ONCE(*pmdp); > - } > + allock_pte_page(pmdp, pgtable_alloc, flags); > + pmd = READ_ONCE(*pmdp); > BUG_ON(pmd_bad(pmd)); > > do { > @@ -208,16 +233,71 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr, > next = pte_cont_addr_end(addr, end); > > /* use a contiguous mapping if the range is suitably aligned */ > - if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) && > - (flags & NO_CONT_MAPPINGS) == 0) > + if (!(flags & NO_SEC_REMAPPINGS) && > + (((addr | next | phys) & ~CONT_PTE_MASK) == 0) && > + (flags & NO_CONT_MAPPINGS) == 0) > __prot = __pgprot(pgprot_val(prot) | PTE_CONT); > > - init_pte(pmdp, addr, next, phys, __prot); > + init_pte(pmdp, addr, next, phys, __prot, flags); > > phys += next - addr; > } while (addr = next, addr != end); > } > > +static void pmd_clear_cont(pmd_t *pmdp) > +{ > + int i = 0; > + pmd_t pmd = READ_ONCE(*pmdp); > + if (pmd_none(pmd) || !pmd_sect(pmd) || !pmd_cont(pmd)) > + return; > + pmdp -= ((u64)pmdp / sizeof(pmd_t)) & > + ((1 << CONFIG_ARM64_CONT_PMD_SHIFT) - 1); > + do { > + pmd = READ_ONCE(*pmdp); > + pmd = pte_pmd(pte_mknoncont(pmd_pte(pmd))); > + set_pmd(pmdp, pmd); > + } while (++pmdp, ++i < CONT_PMDS); > +} > + > +static void init_pmd_remap(pud_t *pudp, unsigned long addr, unsigned long end, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), int flags) > +{ > + unsigned long next; > + pmd_t *pmdp; > + phys_addr_t map_offset; > + > + pmdp = pmd_set_fixmap_offset(pudp, addr); > + do { > + next = pmd_addr_end(addr, end); > + > + if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) { > + pmd_clear_cont(pmdp); > + pmd_clear(pmdp); > + allock_pte_page(pmdp, pgtable_alloc, flags); > + > + map_offset = addr - (addr & PMD_MASK); > + if (map_offset) > + alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr, > + phys - map_offset, prot, > + pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + > + if (next < (addr & PMD_MASK) + PMD_SIZE) > + alloc_init_cont_pte(pmdp, next, > + (addr & PUD_MASK) + PUD_SIZE, > + next - addr + phys, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + } > + alloc_init_cont_pte(pmdp, addr, next, phys, prot, > + pgtable_alloc, flags); > + phys += next - addr; > + } while (pmdp++, addr = next, addr != end); > + > + pmd_clear_fixmap(); > +} > + > static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, > phys_addr_t phys, pgprot_t prot, > phys_addr_t (*pgtable_alloc)(int), int flags) > @@ -255,6 +335,22 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end, > pmd_clear_fixmap(); > } > > +static inline void alloc_pmd_page(pud_t *pudp, phys_addr_t (*pgtable_alloc)(int), int flags) > +{ > + > + pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; > + phys_addr_t pmd_phys; > + > + if (!pud_none(READ_ONCE(*pudp))) > + return; > + > + if (flags & NO_EXEC_MAPPINGS) > + pudval |= PUD_TABLE_PXN; > + BUG_ON(!pgtable_alloc); > + pmd_phys = pgtable_alloc(PMD_SHIFT); > + __pud_populate(pudp, pmd_phys, pudval); > +} > + > static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > unsigned long end, phys_addr_t phys, > pgprot_t prot, > @@ -267,17 +363,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > * Check for initial section mappings in the pgd/pud. > */ > BUG_ON(pud_sect(pud)); > - if (pud_none(pud)) { > - pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN; > - phys_addr_t pmd_phys; > - > - if (flags & NO_EXEC_MAPPINGS) > - pudval |= PUD_TABLE_PXN; > - BUG_ON(!pgtable_alloc); > - pmd_phys = pgtable_alloc(PMD_SHIFT); > - __pud_populate(pudp, pmd_phys, pudval); > - pud = READ_ONCE(*pudp); > - } > + alloc_pmd_page(pudp, pgtable_alloc, flags); > + pud = READ_ONCE(*pudp); > BUG_ON(pud_bad(pud)); > > do { > @@ -286,16 +373,80 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr, > next = pmd_cont_addr_end(addr, end); > > /* use a contiguous mapping if the range is suitably aligned */ > - if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) && > + if (!(flags & NO_SEC_REMAPPINGS) && > + (((addr | next | phys) & ~CONT_PMD_MASK) == 0) && > (flags & NO_CONT_MAPPINGS) == 0) > __prot = __pgprot(pgprot_val(prot) | PTE_CONT); > > - init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags); > + if (flags & NO_SEC_REMAPPINGS) > + init_pmd_remap(pudp, addr, next, phys, __prot, > + pgtable_alloc, flags); > + else > + init_pmd(pudp, addr, next, phys, __prot, > + pgtable_alloc, flags); > > phys += next - addr; > } while (addr = next, addr != end); > } > > +static void init_pud_remap(pud_t *pudp, unsigned long addr, unsigned long next, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + phys_addr_t map_offset; > + > + if (!pud_none(*pudp) && pud_sect(*pudp)) { > + pud_clear(pudp); > + alloc_pmd_page(pudp, pgtable_alloc, flags); > + > + map_offset = addr - (addr & PUD_MASK); > + if (map_offset) > + alloc_init_cont_pmd(pudp, addr & PUD_MASK, > + addr, phys - map_offset, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + > + if (next < (addr & PUD_MASK) + PUD_SIZE) > + alloc_init_cont_pmd(pudp, next, > + (addr & PUD_MASK) + PUD_SIZE, > + next - addr + phys, > + prot, pgtable_alloc, > + flags & (~NO_SEC_REMAPPINGS)); > + } > + alloc_init_cont_pmd(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > +} > + > +static void init_pud(pud_t *pudp, unsigned long addr, unsigned long next, > + phys_addr_t phys, pgprot_t prot, > + phys_addr_t (*pgtable_alloc)(int), > + int flags) > +{ > + pud_t old_pud = READ_ONCE(*pudp); > + /* > + * For 4K granule only, attempt to put down a 1GB block > + */ > + if (pud_sect_supported() && > + ((addr | next | phys) & ~PUD_MASK) == 0 && > + (flags & NO_BLOCK_MAPPINGS) == 0) { > + pud_set_huge(pudp, phys, prot); > + > + /* > + * After the PUD entry has been populated once, we > + * only allow updates to the permission attributes. > + */ > + BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), > + READ_ONCE(pud_val(*pudp)))); > + } else { > + alloc_init_cont_pmd(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > + > + BUG_ON(pud_val(old_pud) != 0 && > + pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); > + } > +} > + > static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, > phys_addr_t phys, pgprot_t prot, > phys_addr_t (*pgtable_alloc)(int), > @@ -325,37 +476,22 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end, > */ > if (system_state != SYSTEM_BOOTING) > mutex_lock(&fixmap_lock); > + > pudp = pud_set_fixmap_offset(p4dp, addr); > do { > - pud_t old_pud = READ_ONCE(*pudp); > - > next = pud_addr_end(addr, end); > > - /* > - * For 4K granule only, attempt to put down a 1GB block > - */ > - if (pud_sect_supported() && > - ((addr | next | phys) & ~PUD_MASK) == 0 && > - (flags & NO_BLOCK_MAPPINGS) == 0) { > - pud_set_huge(pudp, phys, prot); > - > - /* > - * After the PUD entry has been populated once, we > - * only allow updates to the permission attributes. > - */ > - BUG_ON(!pgattr_change_is_safe(pud_val(old_pud), > - READ_ONCE(pud_val(*pudp)))); > - } else { > - alloc_init_cont_pmd(pudp, addr, next, phys, prot, > - pgtable_alloc, flags); > - > - BUG_ON(pud_val(old_pud) != 0 && > - pud_val(old_pud) != READ_ONCE(pud_val(*pudp))); > - } > + if (flags & NO_SEC_REMAPPINGS) > + init_pud_remap(pudp, addr, next, phys, prot, > + pgtable_alloc, flags); > + else > + init_pud(pudp, addr, next, phys, prot, pgtable_alloc, > + flags); > phys += next - addr; > } while (pudp++, addr = next, addr != end); > > pud_clear_fixmap(); > + > if (system_state != SYSTEM_BOOTING) > mutex_unlock(&fixmap_lock); > } > @@ -483,20 +619,37 @@ void __init mark_linear_text_alias_ro(void) > PAGE_KERNEL_RO); > } > > -static bool crash_mem_map __initdata; > - > -static int __init enable_crash_mem_map(char *arg) > +#ifdef CONFIG_KEXEC_CORE > +void __init map_crashkernel(void) > { > - /* > - * Proper parameter parsing is done by reserve_crashkernel(). We only > - * need to know if the linear map has to avoid block mappings so that > - * the crashkernel reservations can be unmapped later. > - */ > - crash_mem_map = true; > + if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE)) > + return; > > - return 0; > + if (!crashk_res.end) > + return; > + > + cpu_replace_ttbr1_with_phys(__pa_symbol(init_pg_dir)); > + init_mm.pgd = init_pg_dir; > + > + __create_pgd_mapping(swapper_pg_dir, crashk_res.start, > + __phys_to_virt(crashk_res.start), > + crashk_res.end + 1 - crashk_res.start, PAGE_KERNEL, > + early_pgtable_alloc, > + NO_EXEC_MAPPINGS | NO_SEC_REMAPPINGS); > + > + cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); > + init_mm.pgd = swapper_pg_dir; > + > + memblock_phys_free(__pa_symbol(init_pg_dir), > + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > } > -early_param("crashkernel", enable_crash_mem_map); > +#else > +void __init map_crashkernel(void) > +{ > + memblock_phys_free(__pa_symbol(init_pg_dir), > + __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > +} > +#endif > > static void __init map_mem(pgd_t *pgdp) > { > @@ -527,17 +680,6 @@ static void __init map_mem(pgd_t *pgdp) > */ > memblock_mark_nomap(kernel_start, kernel_end - kernel_start); > > -#ifdef CONFIG_KEXEC_CORE > - if (crash_mem_map) { > - if (IS_ENABLED(CONFIG_ZONE_DMA) || > - IS_ENABLED(CONFIG_ZONE_DMA32)) > - flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; > - else if (crashk_res.end) > - memblock_mark_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > -#endif > - > /* map all the memory banks */ > for_each_mem_range(i, &start, &end) { > if (start >= end) > @@ -570,19 +712,6 @@ static void __init map_mem(pgd_t *pgdp) > * in page granularity and put back unused memory to buddy system > * through /sys/kernel/kexec_crash_size interface. > */ > -#ifdef CONFIG_KEXEC_CORE > - if (crash_mem_map && > - !IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) { > - if (crashk_res.end) { > - __map_memblock(pgdp, crashk_res.start, > - crashk_res.end + 1, > - PAGE_KERNEL, > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS); > - memblock_clear_nomap(crashk_res.start, > - resource_size(&crashk_res)); > - } > - } > -#endif > } > > void mark_rodata_ro(void) > @@ -774,8 +903,8 @@ void __init paging_init(void) > cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); > init_mm.pgd = swapper_pg_dir; > > - memblock_phys_free(__pa_symbol(init_pg_dir), > - __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > + //memblock_phys_free(__pa_symbol(init_pg_dir), > + // __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir)); > > memblock_allow_resize(); > } > diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S > index 50bbed9..161bae6 100644 > --- a/arch/arm64/mm/proc.S > +++ b/arch/arm64/mm/proc.S > @@ -178,6 +178,25 @@ SYM_FUNC_END(cpu_do_resume) > isb > .endm > > +SYM_FUNC_START(idmap_cpu_replace_ttbr1_with_flush_tlb) > + save_and_disable_daif flags=x2 > + > + __idmap_cpu_set_reserved_ttbr1 x1, x3 > + > + offset_ttbr1 x0, x3 > + msr ttbr1_el1, x0 > + isb > + > + restore_daif x2 > + > + dsb nshst > + tlbi vmalle1 > + dsb nsh > + isb > + > + ret > +SYM_FUNC_END(idmap_cpu_replace_ttbr1_with_flush_tlb) > + > /* > * void idmap_cpu_replace_ttbr1(phys_addr_t ttbr1) > * _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation 2022-07-10 15:33 ` guanghui.fgh @ 2022-07-18 13:10 ` Will Deacon -1 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-18 13:10 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Sun, Jul 10, 2022 at 11:33:02PM +0800, guanghui.fgh wrote: > In short, this path work: > > 1.Before doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir > in [[[ttbr1]]] > > 2.Change the [[[ttbr0]]]to use idmap_pg_dir pgd > > 3.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] are mapped [[[only]]] > with idmap_pg_dir mapping in [[[ttbr0]]] > > 4.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] will flush tlb all, > switch [[[ttbr1]]] to use init_pg_dir pgd(and flush tlb all again). > There is no tlb conflict to swapper_pg_dir. > There is no tlb cache for swapper_pg_dir. > > 5.Woring with init_pg_dir pgd to access swapper_pg_dir pagetable with fix > mapping. And modify crashkernel mapping in the swapper_pg_dir without any > tlb conflict and flush. > > 6.When finishing the work, switch ttbr1 pgd to the origin swapper_pg_dir > with cpu_replace_ttbr1 function(similar to the above). I do not think that this complexity is justified. As I have stated on numerous occasions already, I would prefer that we leave the crashkernel mapped when rodata is not "full". That fixes your performance issue and matches what we do for module code, so I do not see a security argument against it. I do not plan to merge this patch as-is. Thanks, Will ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-18 13:10 ` Will Deacon 0 siblings, 0 replies; 79+ messages in thread From: Will Deacon @ 2022-07-18 13:10 UTC (permalink / raw) To: guanghui.fgh Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Sun, Jul 10, 2022 at 11:33:02PM +0800, guanghui.fgh wrote: > In short, this path work: > > 1.Before doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir > in [[[ttbr1]]] > > 2.Change the [[[ttbr0]]]to use idmap_pg_dir pgd > > 3.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] are mapped [[[only]]] > with idmap_pg_dir mapping in [[[ttbr0]]] > > 4.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] will flush tlb all, > switch [[[ttbr1]]] to use init_pg_dir pgd(and flush tlb all again). > There is no tlb conflict to swapper_pg_dir. > There is no tlb cache for swapper_pg_dir. > > 5.Woring with init_pg_dir pgd to access swapper_pg_dir pagetable with fix > mapping. And modify crashkernel mapping in the swapper_pg_dir without any > tlb conflict and flush. > > 6.When finishing the work, switch ttbr1 pgd to the origin swapper_pg_dir > with cpu_replace_ttbr1 function(similar to the above). I do not think that this complexity is justified. As I have stated on numerous occasions already, I would prefer that we leave the crashkernel mapped when rodata is not "full". That fixes your performance issue and matches what we do for module code, so I do not see a security argument against it. I do not plan to merge this patch as-is. Thanks, Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation 2022-07-18 13:10 ` Will Deacon @ 2022-07-25 6:46 ` Mike Rapoport -1 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-25 6:46 UTC (permalink / raw) To: Will Deacon Cc: guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 18, 2022 at 02:10:06PM +0100, Will Deacon wrote: > On Sun, Jul 10, 2022 at 11:33:02PM +0800, guanghui.fgh wrote: > > In short, this path work: > > > > 1.Before doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir > > in [[[ttbr1]]] > > > > 2.Change the [[[ttbr0]]]to use idmap_pg_dir pgd > > > > 3.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] are mapped [[[only]]] > > with idmap_pg_dir mapping in [[[ttbr0]]] > > > > 4.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] will flush tlb all, > > switch [[[ttbr1]]] to use init_pg_dir pgd(and flush tlb all again). > > There is no tlb conflict to swapper_pg_dir. > > There is no tlb cache for swapper_pg_dir. > > > > 5.Woring with init_pg_dir pgd to access swapper_pg_dir pagetable with fix > > mapping. And modify crashkernel mapping in the swapper_pg_dir without any > > tlb conflict and flush. > > > > 6.When finishing the work, switch ttbr1 pgd to the origin swapper_pg_dir > > with cpu_replace_ttbr1 function(similar to the above). > > I do not think that this complexity is justified. As I have stated on > numerous occasions already, I would prefer that we leave the crashkernel > mapped when rodata is not "full". That fixes your performance issue and > matches what we do for module code, so I do not see a security argument > against it. The protection of the crash kernel is not about security. This is about an ability to do post mortem with kdump and such when the things go awry. I think it's possible to have both large pages in the linear map and the protection of crash kernel in much simpler way that this patch does, I'll try to send an RFC Really Soon. > I do not plan to merge this patch as-is. > > Thanks, > > Will -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v5] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-25 6:46 ` Mike Rapoport 0 siblings, 0 replies; 79+ messages in thread From: Mike Rapoport @ 2022-07-25 6:46 UTC (permalink / raw) To: Will Deacon Cc: guanghui.fgh, baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer On Mon, Jul 18, 2022 at 02:10:06PM +0100, Will Deacon wrote: > On Sun, Jul 10, 2022 at 11:33:02PM +0800, guanghui.fgh wrote: > > In short, this path work: > > > > 1.Before doing work for rebuiling crashkernel mem, the pgd is swapper_pg_dir > > in [[[ttbr1]]] > > > > 2.Change the [[[ttbr0]]]to use idmap_pg_dir pgd > > > > 3.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] are mapped [[[only]]] > > with idmap_pg_dir mapping in [[[ttbr0]]] > > > > 4.The [[[idmap_cpu_replace_ttbr1_with_flush_tlb]]] will flush tlb all, > > switch [[[ttbr1]]] to use init_pg_dir pgd(and flush tlb all again). > > There is no tlb conflict to swapper_pg_dir. > > There is no tlb cache for swapper_pg_dir. > > > > 5.Woring with init_pg_dir pgd to access swapper_pg_dir pagetable with fix > > mapping. And modify crashkernel mapping in the swapper_pg_dir without any > > tlb conflict and flush. > > > > 6.When finishing the work, switch ttbr1 pgd to the origin swapper_pg_dir > > with cpu_replace_ttbr1 function(similar to the above). > > I do not think that this complexity is justified. As I have stated on > numerous occasions already, I would prefer that we leave the crashkernel > mapped when rodata is not "full". That fixes your performance issue and > matches what we do for module code, so I do not see a security argument > against it. The protection of the crash kernel is not about security. This is about an ability to do post mortem with kdump and such when the things go awry. I think it's possible to have both large pages in the linear map and the protection of crash kernel in much simpler way that this patch does, I'll try to send an RFC Really Soon. > I do not plan to merge this patch as-is. > > Thanks, > > Will -- Sincerely yours, Mike. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation 2022-07-04 16:38 ` Will Deacon @ 2022-07-05 2:44 ` guanghui.fgh -1 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-05 2:44 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer 在 2022/7/5 0:38, Will Deacon 写道: > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >> Thanks. >> >> 在 2022/7/4 22:23, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >>>> 在 2022/7/4 21:15, Will Deacon 写道: >>>>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >>>>>>>> 1.Quoted messages from arch/arm64/mm/init.c >>>>>>>> >>>>>>>> "Memory reservation for crash kernel either done early or deferred >>>>>>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>>>>>> >>>>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>>>>>> here instead of max_zone_phys(). This lets early reservation of >>>>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>>>>>> Reserving memory early for crash kernel allows linear creation of block >>>>>>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>>>>>> In this scheme a comparatively quicker boot is observed. >>>>>>>> >>>>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>>>>>> is delayed until DMA zone memory range size initialization performed in >>>>>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>>>>>> memory range to avoid overlap allocation. >>>>>>>> >>>>>>>> [[[ >>>>>>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>>>>>> ranges, which otherwise means not possible to exclude crash kernel range >>>>>>>> from creating block mappings so page-granularity mappings are created for >>>>>>>> the entire memory range. >>>>>>>> ]]]" >>>>>>>> >>>>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>>>>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>>>>>> limit--->reserve crashkernel. >>>>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>>>>>> mapping. >>>>>>> >>>>>>> Yes, I understand that is how things work today but I'm saying that we may >>>>>>> as well leave the crashkernel mapped (at block granularity) if >>>>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >>>>>> >>>>>> But Page-granularity mapppings are necessary for crash kernel memory range >>>>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >>>>>> arch/arm64/mm/init.c). >>>>>> So this patch split block/section mapping to 4k page-granularity mapping for >>>>>> crashkernel mem. >>>>> >>>>> Why? I don't see why the mapping granularity is relevant at all if we >>>>> always leave the whole thing mapped. >>>>> >>>> There is another reason. >>>> >>>> When loading crashkernel finish, the do_kexec_load will use >>>> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel >>>> mem(protect crashkernel mem from access). >>>> >>>> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range >>>> >>>> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And >>>> if the crashkernel use block/section mapping, there will be some error. >>>> >>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>> before shringking. >>> >>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>> that if we're leaving the thing mapped, no? >>> >> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >> >> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >> the crashkernel mem by mistake. >> >> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >> pagetable, there maybe some write operations to these mem by mistake which >> may cause crashkernel boot error and vmcore saving error. > > I don't really buy this line of reasoning. The entire main kernel is > writable, so why do we care about protecting the crashkernel so much? The > _code_ to launch the crash kernel is writable! If you care about preventing > writes to memory which should not be writable, then you should use > rodata=full. > > Will Thanks. 1.I think the normal kernel mem can be writeable or readonly 2.But the normal kernel should't access(read/write) crashkernel mem after the crashkernel is loaded to the mem.(despite the normal kernel write/read access protect) So invalid pagetable for crashkernel mem is needed. With this method, it can guarantee the usability of the crashkernel when occuring emergency and generating vmcore. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation @ 2022-07-05 2:44 ` guanghui.fgh 0 siblings, 0 replies; 79+ messages in thread From: guanghui.fgh @ 2022-07-05 2:44 UTC (permalink / raw) To: Will Deacon Cc: baolin.wang, catalin.marinas, akpm, david, jianyong.wu, james.morse, quic_qiancai, christophe.leroy, jonathan, mark.rutland, thunder.leizhen, anshuman.khandual, linux-arm-kernel, linux-kernel, rppt, geert+renesas, ardb, linux-mm, yaohongbo, alikernel-developer 在 2022/7/5 0:38, Will Deacon 写道: > On Mon, Jul 04, 2022 at 10:34:07PM +0800, guanghui.fgh wrote: >> Thanks. >> >> 在 2022/7/4 22:23, Will Deacon 写道: >>> On Mon, Jul 04, 2022 at 10:11:27PM +0800, guanghui.fgh wrote: >>>> 在 2022/7/4 21:15, Will Deacon 写道: >>>>> On Mon, Jul 04, 2022 at 08:05:59PM +0800, guanghui.fgh wrote: >>>>>>>> 1.Quoted messages from arch/arm64/mm/init.c >>>>>>>> >>>>>>>> "Memory reservation for crash kernel either done early or deferred >>>>>>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>>>>>> >>>>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>>>>>> here instead of max_zone_phys(). This lets early reservation of >>>>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>>>>>> Reserving memory early for crash kernel allows linear creation of block >>>>>>>> mappings (greater than page-granularity) for all the memory bank rangs. >>>>>>>> In this scheme a comparatively quicker boot is observed. >>>>>>>> >>>>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>>>>>> is delayed until DMA zone memory range size initialization performed in >>>>>>>> zone_sizes_init(). The defer is necessary to steer clear of DMA zone >>>>>>>> memory range to avoid overlap allocation. >>>>>>>> >>>>>>>> [[[ >>>>>>>> So crash kernel memory boundaries are not known when mapping all bank memory >>>>>>>> ranges, which otherwise means not possible to exclude crash kernel range >>>>>>>> from creating block mappings so page-granularity mappings are created for >>>>>>>> the entire memory range. >>>>>>>> ]]]" >>>>>>>> >>>>>>>> Namely, the init order: memblock init--->linear mem mapping(4k mapping for >>>>>>>> crashkernel, requirinig page-granularity changing))--->zone dma >>>>>>>> limit--->reserve crashkernel. >>>>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping using 4k >>>>>>>> mapping. >>>>>>> >>>>>>> Yes, I understand that is how things work today but I'm saying that we may >>>>>>> as well leave the crashkernel mapped (at block granularity) if >>>>>>> !can_set_direct_map() and then I think your patch becomes a lot simpler. >>>>>> >>>>>> But Page-granularity mapppings are necessary for crash kernel memory range >>>>>> for shrinking its size via /sys/kernel/kexec_crash_size interfac(Quoted from >>>>>> arch/arm64/mm/init.c). >>>>>> So this patch split block/section mapping to 4k page-granularity mapping for >>>>>> crashkernel mem. >>>>> >>>>> Why? I don't see why the mapping granularity is relevant at all if we >>>>> always leave the whole thing mapped. >>>>> >>>> There is another reason. >>>> >>>> When loading crashkernel finish, the do_kexec_load will use >>>> arch_kexec_protect_crashkres to invalid all the pagetable for crashkernel >>>> mem(protect crashkernel mem from access). >>>> >>>> arch_kexec_protect_crashkres--->set_memory_valid--->...--->apply_to_pmd_range >>>> >>>> In the apply_to_pmd_range, there is a judement: BUG_ON(pud_huge(*pud)). And >>>> if the crashkernel use block/section mapping, there will be some error. >>>> >>>> Namely, it's need to use non block/section mapping for crashkernel mem >>>> before shringking. >>> >>> Well, yes, but we can change arch_kexec_[un]protect_crashkres() not to do >>> that if we're leaving the thing mapped, no? >>> >> I think we should use arch_kexec_[un]protect_crashkres for crashkernel mem. >> >> Because when invalid crashkernel mem pagetable, there is no chance to rd/wr >> the crashkernel mem by mistake. >> >> If we don't use arch_kexec_[un]protect_crashkres to invalid crashkernel mem >> pagetable, there maybe some write operations to these mem by mistake which >> may cause crashkernel boot error and vmcore saving error. > > I don't really buy this line of reasoning. The entire main kernel is > writable, so why do we care about protecting the crashkernel so much? The > _code_ to launch the crash kernel is writable! If you care about preventing > writes to memory which should not be writable, then you should use > rodata=full. > > Will Thanks. 1.I think the normal kernel mem can be writeable or readonly 2.But the normal kernel should't access(read/write) crashkernel mem after the crashkernel is loaded to the mem.(despite the normal kernel write/read access protect) So invalid pagetable for crashkernel mem is needed. With this method, it can guarantee the usability of the crashkernel when occuring emergency and generating vmcore. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 79+ messages in thread
end of thread, other threads:[~2022-07-25 6:47 UTC | newest] Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-07-02 15:57 [PATCH v4] arm64: mm: fix linear mem mapping access performance degradation Guanghui Feng 2022-07-02 15:57 ` Guanghui Feng 2022-07-04 10:35 ` Will Deacon 2022-07-04 10:35 ` Will Deacon 2022-07-04 10:58 ` guanghui.fgh 2022-07-04 10:58 ` guanghui.fgh 2022-07-04 11:14 ` Will Deacon 2022-07-04 11:14 ` Will Deacon 2022-07-04 12:05 ` guanghui.fgh 2022-07-04 12:05 ` guanghui.fgh 2022-07-04 13:15 ` Will Deacon 2022-07-04 13:15 ` Will Deacon 2022-07-04 13:41 ` guanghui.fgh 2022-07-04 13:41 ` guanghui.fgh 2022-07-04 14:11 ` guanghui.fgh 2022-07-04 14:11 ` guanghui.fgh 2022-07-04 14:23 ` Will Deacon 2022-07-04 14:23 ` Will Deacon 2022-07-04 14:34 ` guanghui.fgh 2022-07-04 14:34 ` guanghui.fgh 2022-07-04 16:38 ` Will Deacon 2022-07-04 16:38 ` Will Deacon 2022-07-04 17:09 ` Ard Biesheuvel 2022-07-04 17:09 ` Ard Biesheuvel 2022-07-05 8:35 ` Baoquan He 2022-07-05 8:35 ` Baoquan He 2022-07-05 8:35 ` Baoquan He 2022-07-05 9:52 ` Will Deacon 2022-07-05 9:52 ` Will Deacon 2022-07-05 12:07 ` guanghui.fgh 2022-07-05 12:07 ` guanghui.fgh 2022-07-05 12:11 ` Will Deacon 2022-07-05 12:11 ` Will Deacon 2022-07-05 12:27 ` guanghui.fgh 2022-07-05 12:27 ` guanghui.fgh 2022-07-05 12:56 ` Mike Rapoport 2022-07-05 12:56 ` Mike Rapoport 2022-07-05 13:17 ` guanghui.fgh 2022-07-05 13:17 ` guanghui.fgh 2022-07-05 15:02 ` Mike Rapoport 2022-07-05 15:02 ` Mike Rapoport 2022-07-05 15:34 ` Catalin Marinas 2022-07-05 15:34 ` Catalin Marinas 2022-07-05 15:57 ` Mike Rapoport 2022-07-05 15:57 ` Mike Rapoport 2022-07-05 17:05 ` Catalin Marinas 2022-07-05 17:05 ` Catalin Marinas 2022-07-05 20:45 ` Mike Rapoport 2022-07-05 20:45 ` Mike Rapoport 2022-07-06 2:49 ` guanghui.fgh 2022-07-06 2:49 ` guanghui.fgh 2022-07-06 7:43 ` Catalin Marinas 2022-07-06 7:43 ` Catalin Marinas 2022-07-06 10:04 ` Catalin Marinas 2022-07-06 10:04 ` Catalin Marinas 2022-07-06 13:54 ` Mike Rapoport 2022-07-06 13:54 ` Mike Rapoport 2022-07-06 15:18 ` guanghui.fgh 2022-07-06 15:18 ` guanghui.fgh 2022-07-06 15:30 ` guanghui.fgh 2022-07-06 15:30 ` guanghui.fgh 2022-07-06 15:40 ` Catalin Marinas 2022-07-06 15:40 ` Catalin Marinas 2022-07-07 17:02 ` guanghui.fgh 2022-07-07 17:02 ` guanghui.fgh 2022-07-08 12:28 ` [PATCH RESEND " guanghui.fgh 2022-07-08 12:28 ` guanghui.fgh 2022-07-10 13:44 ` [PATCH v5] " Guanghui Feng 2022-07-10 13:44 ` Guanghui Feng 2022-07-10 14:32 ` guanghui.fgh 2022-07-10 14:32 ` guanghui.fgh 2022-07-10 15:33 ` guanghui.fgh 2022-07-10 15:33 ` guanghui.fgh 2022-07-18 13:10 ` Will Deacon 2022-07-18 13:10 ` Will Deacon 2022-07-25 6:46 ` Mike Rapoport 2022-07-25 6:46 ` Mike Rapoport 2022-07-05 2:44 ` [PATCH v4] " guanghui.fgh 2022-07-05 2:44 ` guanghui.fgh
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.