linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] arm64/mm: Enable color zero pages
@ 2020-09-16  3:25 Gavin Shan
  2020-09-16  3:25 ` [PATCH 1/2] arm64/mm: Introduce zero PGD table Gavin Shan
  2020-09-16  3:25 ` [PATCH 2/2] arm64/mm: Enable color zero pages Gavin Shan
  0 siblings, 2 replies; 12+ messages in thread
From: Gavin Shan @ 2020-09-16  3:25 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: mark.rutland, anshuman.khandual, catalin.marinas, will,
	linux-kernel, shan.gavin

The feature of color zero pages isn't enabled on arm64, meaning all
read-only (anonymous) VM areas are backed up by same zero page. It
leads pressure to L1 (data) cache on reading data from them. This
tries to enable color zero pages.

PATCH[1/2] decouples the zero PGD table from zero page
PATCH[2/2] allocates the needed zero pages according to L1 cache size

Gavin Shan (2):
  arm64/mm: Introduce zero PGD table
  arm64/mm: Enable color zero pages

 arch/arm64/include/asm/cache.h       | 22 +++++++++++++++++
 arch/arm64/include/asm/mmu_context.h |  6 ++---
 arch/arm64/include/asm/pgtable.h     | 11 +++++++--
 arch/arm64/kernel/cacheinfo.c        | 34 +++++++++++++++++++++++++++
 arch/arm64/kernel/setup.c            |  2 +-
 arch/arm64/kernel/vmlinux.lds.S      |  4 ++++
 arch/arm64/mm/init.c                 | 35 ++++++++++++++++++++++++++++
 arch/arm64/mm/mmu.c                  |  7 ------
 arch/arm64/mm/proc.S                 |  2 +-
 9 files changed, 109 insertions(+), 14 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] arm64/mm: Introduce zero PGD table
  2020-09-16  3:25 [PATCH 0/2] arm64/mm: Enable color zero pages Gavin Shan
@ 2020-09-16  3:25 ` Gavin Shan
  2020-09-16  3:25 ` [PATCH 2/2] arm64/mm: Enable color zero pages Gavin Shan
  1 sibling, 0 replies; 12+ messages in thread
From: Gavin Shan @ 2020-09-16  3:25 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: mark.rutland, anshuman.khandual, catalin.marinas, will,
	linux-kernel, shan.gavin

The zero PGD table is used when TTBR_EL1 is changed. It's exactly
the zero page. As the zero page(s) will be allocated dynamically
when colored zero page feature is enabled in subsequent patch. the
zero page(s) aren't usable during early boot stage.

This introduces zero PGD table, which is decoupled from the zero
page(s).

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/mmu_context.h | 6 +++---
 arch/arm64/include/asm/pgtable.h     | 2 ++
 arch/arm64/kernel/setup.c            | 2 +-
 arch/arm64/kernel/vmlinux.lds.S      | 4 ++++
 arch/arm64/mm/proc.S                 | 2 +-
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index f2d7537d6f83..6dbc5726fd56 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -36,11 +36,11 @@ static inline void contextidr_thread_switch(struct task_struct *next)
 }
 
 /*
- * Set TTBR0 to empty_zero_page. No translations will be possible via TTBR0.
+ * Set TTBR0 to zero_pg_dir. No translations will be possible via TTBR0.
  */
 static inline void cpu_set_reserved_ttbr0(void)
 {
-	unsigned long ttbr = phys_to_ttbr(__pa_symbol(empty_zero_page));
+	unsigned long ttbr = phys_to_ttbr(__pa_symbol(zero_pg_dir));
 
 	write_sysreg(ttbr, ttbr0_el1);
 	isb();
@@ -189,7 +189,7 @@ static inline void update_saved_ttbr0(struct task_struct *tsk,
 		return;
 
 	if (mm == &init_mm)
-		ttbr = __pa_symbol(empty_zero_page);
+		ttbr = __pa_symbol(zero_pg_dir);
 	else
 		ttbr = virt_to_phys(mm->pgd) | ASID(mm) << 48;
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d5d3fbe73953..6953498f4d40 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -474,6 +474,8 @@ static inline bool pud_table(pud_t pud) { return true; }
 				 PUD_TYPE_TABLE)
 #endif
 
+extern pgd_t zero_pg_dir[PTRS_PER_PGD];
+extern pgd_t zero_pg_end[];
 extern pgd_t init_pg_dir[PTRS_PER_PGD];
 extern pgd_t init_pg_end[];
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 53acbeca4f57..7e83eaed641e 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -366,7 +366,7 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 	 * faults in case uaccess_enable() is inadvertently called by the init
 	 * thread.
 	 */
-	init_task.thread_info.ttbr0 = __pa_symbol(empty_zero_page);
+	init_task.thread_info.ttbr0 = __pa_symbol(zero_pg_dir);
 #endif
 
 	if (boot_args[1] || boot_args[2] || boot_args[3]) {
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 7cba7623fcec..3d3c155d10a4 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -137,6 +137,10 @@ SECTIONS
 	/* everything from this point to __init_begin will be marked RO NX */
 	RO_DATA(PAGE_SIZE)
 
+	zero_pg_dir = .;
+	. += PAGE_SIZE;
+	zero_pg_end = .;
+
 	idmap_pg_dir = .;
 	. += IDMAP_DIR_SIZE;
 	idmap_pg_end = .;
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 796e47a571e6..90b135c366b3 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -163,7 +163,7 @@ SYM_FUNC_END(cpu_do_resume)
 	.pushsection ".idmap.text", "awx"
 
 .macro	__idmap_cpu_set_reserved_ttbr1, tmp1, tmp2
-	adrp	\tmp1, empty_zero_page
+	adrp	\tmp1, zero_pg_dir
 	phys_to_ttbr \tmp2, \tmp1
 	offset_ttbr1 \tmp2, \tmp1
 	msr	ttbr1_el1, \tmp2
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-16  3:25 [PATCH 0/2] arm64/mm: Enable color zero pages Gavin Shan
  2020-09-16  3:25 ` [PATCH 1/2] arm64/mm: Introduce zero PGD table Gavin Shan
@ 2020-09-16  3:25 ` Gavin Shan
  2020-09-16  8:28   ` Will Deacon
  2020-09-18 12:10   ` kernel test robot
  1 sibling, 2 replies; 12+ messages in thread
From: Gavin Shan @ 2020-09-16  3:25 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: mark.rutland, anshuman.khandual, catalin.marinas, will,
	linux-kernel, shan.gavin

This enables color zero pages by allocating contigous page frames
for it. The number of pages for this is determined by L1 dCache
(or iCache) size, which is probbed from the hardware.

   * Add cache_total_size() to return L1 dCache (or iCache) size

   * Implement setup_zero_pages(), which is called after the page
     allocator begins to work, to allocate the contigous pages
     needed by color zero page.

   * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
 arch/arm64/include/asm/pgtable.h |  9 ++++++--
 arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
 arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
 arch/arm64/mm/mmu.c              |  7 -------
 5 files changed, 98 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index a4d1b5f771f6..420e9dde2c51 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -39,6 +39,27 @@
 #define CLIDR_LOC(clidr)	(((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
 #define CLIDR_LOUIS(clidr)	(((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
 
+#define CSSELR_TND_SHIFT	4
+#define CSSELR_TND_MASK		(UL(1) << CSSELR_TND_SHIFT)
+#define CSSELR_LEVEL_SHIFT	1
+#define CSSELR_LEVEL_MASK	(UL(7) << CSSELR_LEVEL_SHIFT)
+#define CSSELR_IND_SHIFT	0
+#define CSSERL_IND_MASK		(UL(1) << CSSELR_IND_SHIFT)
+
+#define CCSIDR_64_LS_SHIFT	0
+#define CCSIDR_64_LS_MASK	(UL(7) << CCSIDR_64_LS_SHIFT)
+#define CCSIDR_64_ASSOC_SHIFT	3
+#define CCSIDR_64_ASSOC_MASK	(UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
+#define CCSIDR_64_SET_SHIFT	32
+#define CCSIDR_64_SET_MASK	(UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
+
+#define CCSIDR_32_LS_SHIFT	0
+#define CCSIDR_32_LS_MASK	(UL(7) << CCSIDR_32_LS_SHIFT)
+#define CCSIDR_32_ASSOC_SHIFT	3
+#define CCSIDR_32_ASSOC_MASK	(UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
+#define CCSIDR_32_SET_SHIFT	13
+#define CCSIDR_32_SET_MASK	(UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
+
 /*
  * Memory returned by kmalloc() may be used for DMA, so we must make
  * sure that all such allocations are cache aligned. Otherwise,
@@ -89,6 +110,7 @@ static inline int cache_line_size_of_cpu(void)
 }
 
 int cache_line_size(void);
+int cache_total_size(void);
 
 /*
  * Read the effective value of CTR_EL0.
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6953498f4d40..5cb5f8bb090d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -54,8 +54,13 @@ extern void __pgd_error(const char *file, int line, unsigned long val);
  * ZERO_PAGE is a global shared page that is always zero: used
  * for zero-mapped memory areas etc..
  */
-extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
-#define ZERO_PAGE(vaddr)	phys_to_page(__pa_symbol(empty_zero_page))
+extern unsigned long empty_zero_page;
+extern unsigned long zero_page_mask;
+
+#define __HAVE_COLOR_ZERO_PAGE
+#define ZERO_PAGE(vaddr)				\
+	(virt_to_page((void *)(empty_zero_page +	\
+	(((unsigned long)(vaddr)) & zero_page_mask))))
 
 #define pte_ERROR(pte)		__pte_error(__FILE__, __LINE__, pte_val(pte))
 
diff --git a/arch/arm64/kernel/cacheinfo.c b/arch/arm64/kernel/cacheinfo.c
index 7fa6828bb488..d3b9ab757014 100644
--- a/arch/arm64/kernel/cacheinfo.c
+++ b/arch/arm64/kernel/cacheinfo.c
@@ -43,6 +43,40 @@ static void ci_leaf_init(struct cacheinfo *this_leaf,
 	this_leaf->type = type;
 }
 
+int cache_total_size(void)
+{
+	unsigned int ctype, size;
+	unsigned long val;
+	bool ccidx = false;
+
+	/* Check first level cache is supported */
+	ctype = get_cache_type(1);
+	if (ctype == CACHE_TYPE_NOCACHE)
+		return 0;
+
+	/* ARMv8.3-CCIDX is supported or not */
+	val = read_sanitised_ftr_reg(SYS_ID_MMFR4_EL1);
+	ccidx = !!(val & (UL(0xF) << ID_AA64MMFR2_CCIDX_SHIFT));
+
+	/* Retrieve the information and calculate the total size */
+	val = FIELD_PREP(CSSELR_LEVEL_MASK, 0) |
+	      FIELD_PREP(CSSERL_IND_MASK, 0);
+	write_sysreg(val, csselr_el1);
+
+	val = read_sysreg(ccsidr_el1);
+	if (ccidx) {
+		size = (1 << FIELD_GET(CCSIDR_64_LS_MASK, val)) *
+		       (FIELD_GET(CCSIDR_64_ASSOC_MASK, val) + 1) *
+		       (FIELD_GET(CCSIDR_64_SET_MASK, val) + 1);
+	} else {
+		size = (1 << FIELD_GET(CCSIDR_32_LS_MASK, val)) *
+		       (FIELD_GET(CCSIDR_32_ASSOC_MASK, val) + 1) *
+		       (FIELD_GET(CCSIDR_32_SET_MASK, val) + 1);
+	}
+
+	return size;
+}
+
 static int __init_cache_level(unsigned int cpu)
 {
 	unsigned int ctype, level, leaves, fw_level;
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 481d22c32a2e..ca6b3cddafb7 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -69,6 +69,11 @@ EXPORT_SYMBOL(vmemmap);
 phys_addr_t arm64_dma_phys_limit __ro_after_init;
 static phys_addr_t arm64_dma32_phys_limit __ro_after_init;
 
+unsigned long empty_zero_page;
+EXPORT_SYMBOL(empty_zero_page);
+unsigned long zero_page_mask;
+EXPORT_SYMBOL(zero_page_mask);
+
 #ifdef CONFIG_KEXEC_CORE
 /*
  * reserve_crashkernel() - reserves memory for crash kernel
@@ -507,6 +512,35 @@ static void __init free_unused_memmap(void)
 }
 #endif	/* !CONFIG_SPARSEMEM_VMEMMAP */
 
+static void __init setup_zero_pages(void)
+{
+	struct page *page;
+	int order, size, i;
+
+	size = cache_total_size();
+	order = size > 0 ? get_order(PAGE_ALIGN(size)) : 0;
+	order = min(order, MAX_ORDER - 1);
+
+	do {
+		empty_zero_page = __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+						   order);
+		if (empty_zero_page)
+			break;
+	} while (--order >= 0);
+
+	if (!empty_zero_page)
+		panic("%s: out of memory\n", __func__);
+
+	page = virt_to_page((void *) empty_zero_page);
+	split_page(page, order);
+	for (i = 1 << order; i > 0; i--) {
+		mark_page_reserved(page);
+		page++;
+	}
+
+	zero_page_mask = ((PAGE_SIZE << order) - 1) & PAGE_MASK;
+}
+
 /*
  * mem_init() marks the free areas in the mem_map and tells us how much memory
  * is free.  This is done after various parts of the system have claimed their
@@ -527,6 +561,7 @@ void __init mem_init(void)
 #endif
 	/* this will put all unused low memory onto the freelists */
 	memblock_free_all();
+	setup_zero_pages();
 
 	mem_init_print_info(NULL);
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..736939ab3b4f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -49,13 +49,6 @@ EXPORT_SYMBOL(vabits_actual);
 u64 kimage_voffset __ro_after_init;
 EXPORT_SYMBOL(kimage_voffset);
 
-/*
- * Empty_zero_page is a special page that is used for zero-initialized data
- * and COW.
- */
-unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_bss;
-EXPORT_SYMBOL(empty_zero_page);
-
 static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_bss;
 static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_bss __maybe_unused;
 static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-16  3:25 ` [PATCH 2/2] arm64/mm: Enable color zero pages Gavin Shan
@ 2020-09-16  8:28   ` Will Deacon
  2020-09-16 10:46     ` Robin Murphy
  2020-09-17  3:35     ` Gavin Shan
  2020-09-18 12:10   ` kernel test robot
  1 sibling, 2 replies; 12+ messages in thread
From: Will Deacon @ 2020-09-16  8:28 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linux-arm-kernel, mark.rutland, anshuman.khandual,
	catalin.marinas, linux-kernel, shan.gavin

On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
> This enables color zero pages by allocating contigous page frames
> for it. The number of pages for this is determined by L1 dCache
> (or iCache) size, which is probbed from the hardware.
> 
>    * Add cache_total_size() to return L1 dCache (or iCache) size
> 
>    * Implement setup_zero_pages(), which is called after the page
>      allocator begins to work, to allocate the contigous pages
>      needed by color zero page.
> 
>    * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>  arch/arm64/include/asm/pgtable.h |  9 ++++++--
>  arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>  arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>  arch/arm64/mm/mmu.c              |  7 -------
>  5 files changed, 98 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index a4d1b5f771f6..420e9dde2c51 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -39,6 +39,27 @@
>  #define CLIDR_LOC(clidr)	(((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>  #define CLIDR_LOUIS(clidr)	(((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>  
> +#define CSSELR_TND_SHIFT	4
> +#define CSSELR_TND_MASK		(UL(1) << CSSELR_TND_SHIFT)
> +#define CSSELR_LEVEL_SHIFT	1
> +#define CSSELR_LEVEL_MASK	(UL(7) << CSSELR_LEVEL_SHIFT)
> +#define CSSELR_IND_SHIFT	0
> +#define CSSERL_IND_MASK		(UL(1) << CSSELR_IND_SHIFT)
> +
> +#define CCSIDR_64_LS_SHIFT	0
> +#define CCSIDR_64_LS_MASK	(UL(7) << CCSIDR_64_LS_SHIFT)
> +#define CCSIDR_64_ASSOC_SHIFT	3
> +#define CCSIDR_64_ASSOC_MASK	(UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
> +#define CCSIDR_64_SET_SHIFT	32
> +#define CCSIDR_64_SET_MASK	(UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
> +
> +#define CCSIDR_32_LS_SHIFT	0
> +#define CCSIDR_32_LS_MASK	(UL(7) << CCSIDR_32_LS_SHIFT)
> +#define CCSIDR_32_ASSOC_SHIFT	3
> +#define CCSIDR_32_ASSOC_MASK	(UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
> +#define CCSIDR_32_SET_SHIFT	13
> +#define CCSIDR_32_SET_MASK	(UL(0x7FFF) << CCSIDR_32_SET_SHIFT)

I don't think we should be inferring cache structure from these register
values. The Arm ARM helpfully says:

  | You cannot make any inference about the actual sizes of caches based
  | on these parameters.

so we need to take the topology information from elsewhere.

But before we get into that, can you justify why we need to do this at all,
please? Do you have data to show the benefit of adding this complexity?

Cheers,

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-16  8:28   ` Will Deacon
@ 2020-09-16 10:46     ` Robin Murphy
  2020-09-17  4:36       ` Gavin Shan
  2020-09-17  3:35     ` Gavin Shan
  1 sibling, 1 reply; 12+ messages in thread
From: Robin Murphy @ 2020-09-16 10:46 UTC (permalink / raw)
  To: Will Deacon, Gavin Shan
  Cc: mark.rutland, anshuman.khandual, catalin.marinas, linux-kernel,
	shan.gavin, linux-arm-kernel

On 2020-09-16 09:28, Will Deacon wrote:
> On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
>> This enables color zero pages by allocating contigous page frames
>> for it. The number of pages for this is determined by L1 dCache
>> (or iCache) size, which is probbed from the hardware.
>>
>>     * Add cache_total_size() to return L1 dCache (or iCache) size
>>
>>     * Implement setup_zero_pages(), which is called after the page
>>       allocator begins to work, to allocate the contigous pages
>>       needed by color zero page.
>>
>>     * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>>   arch/arm64/include/asm/pgtable.h |  9 ++++++--
>>   arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>>   arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>>   arch/arm64/mm/mmu.c              |  7 -------
>>   5 files changed, 98 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>> index a4d1b5f771f6..420e9dde2c51 100644
>> --- a/arch/arm64/include/asm/cache.h
>> +++ b/arch/arm64/include/asm/cache.h
>> @@ -39,6 +39,27 @@
>>   #define CLIDR_LOC(clidr)	(((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>>   #define CLIDR_LOUIS(clidr)	(((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>>   
>> +#define CSSELR_TND_SHIFT	4
>> +#define CSSELR_TND_MASK		(UL(1) << CSSELR_TND_SHIFT)
>> +#define CSSELR_LEVEL_SHIFT	1
>> +#define CSSELR_LEVEL_MASK	(UL(7) << CSSELR_LEVEL_SHIFT)
>> +#define CSSELR_IND_SHIFT	0
>> +#define CSSERL_IND_MASK		(UL(1) << CSSELR_IND_SHIFT)
>> +
>> +#define CCSIDR_64_LS_SHIFT	0
>> +#define CCSIDR_64_LS_MASK	(UL(7) << CCSIDR_64_LS_SHIFT)
>> +#define CCSIDR_64_ASSOC_SHIFT	3
>> +#define CCSIDR_64_ASSOC_MASK	(UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
>> +#define CCSIDR_64_SET_SHIFT	32
>> +#define CCSIDR_64_SET_MASK	(UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
>> +
>> +#define CCSIDR_32_LS_SHIFT	0
>> +#define CCSIDR_32_LS_MASK	(UL(7) << CCSIDR_32_LS_SHIFT)
>> +#define CCSIDR_32_ASSOC_SHIFT	3
>> +#define CCSIDR_32_ASSOC_MASK	(UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
>> +#define CCSIDR_32_SET_SHIFT	13
>> +#define CCSIDR_32_SET_MASK	(UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
> 
> I don't think we should be inferring cache structure from these register
> values. The Arm ARM helpfully says:
> 
>    | You cannot make any inference about the actual sizes of caches based
>    | on these parameters.
> 
> so we need to take the topology information from elsewhere.

Yes, these represent parameters for the low-level cache maintenance by 
set/way instructions, and nothing more. There are definitely cases where 
they do not reflect the underlying cache structure (commit 793acf870ea3 
is an obvious first call).

Robin.

> But before we get into that, can you justify why we need to do this at all,
> please? Do you have data to show the benefit of adding this complexity?
> 
> Cheers,
> 
> Will
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-16  8:28   ` Will Deacon
  2020-09-16 10:46     ` Robin Murphy
@ 2020-09-17  3:35     ` Gavin Shan
  2020-09-17 10:22       ` Robin Murphy
  1 sibling, 1 reply; 12+ messages in thread
From: Gavin Shan @ 2020-09-17  3:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-kernel, mark.rutland, anshuman.khandual,
	catalin.marinas, linux-kernel, shan.gavin

Hi Will,

On 9/16/20 6:28 PM, Will Deacon wrote:
> On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
>> This enables color zero pages by allocating contigous page frames
>> for it. The number of pages for this is determined by L1 dCache
>> (or iCache) size, which is probbed from the hardware.
>>
>>     * Add cache_total_size() to return L1 dCache (or iCache) size
>>
>>     * Implement setup_zero_pages(), which is called after the page
>>       allocator begins to work, to allocate the contigous pages
>>       needed by color zero page.
>>
>>     * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>>   arch/arm64/include/asm/pgtable.h |  9 ++++++--
>>   arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>>   arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>>   arch/arm64/mm/mmu.c              |  7 -------
>>   5 files changed, 98 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>> index a4d1b5f771f6..420e9dde2c51 100644
>> --- a/arch/arm64/include/asm/cache.h
>> +++ b/arch/arm64/include/asm/cache.h
>> @@ -39,6 +39,27 @@
>>   #define CLIDR_LOC(clidr)	(((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>>   #define CLIDR_LOUIS(clidr)	(((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>>   
>> +#define CSSELR_TND_SHIFT	4
>> +#define CSSELR_TND_MASK		(UL(1) << CSSELR_TND_SHIFT)
>> +#define CSSELR_LEVEL_SHIFT	1
>> +#define CSSELR_LEVEL_MASK	(UL(7) << CSSELR_LEVEL_SHIFT)
>> +#define CSSELR_IND_SHIFT	0
>> +#define CSSERL_IND_MASK		(UL(1) << CSSELR_IND_SHIFT)
>> +
>> +#define CCSIDR_64_LS_SHIFT	0
>> +#define CCSIDR_64_LS_MASK	(UL(7) << CCSIDR_64_LS_SHIFT)
>> +#define CCSIDR_64_ASSOC_SHIFT	3
>> +#define CCSIDR_64_ASSOC_MASK	(UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
>> +#define CCSIDR_64_SET_SHIFT	32
>> +#define CCSIDR_64_SET_MASK	(UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
>> +
>> +#define CCSIDR_32_LS_SHIFT	0
>> +#define CCSIDR_32_LS_MASK	(UL(7) << CCSIDR_32_LS_SHIFT)
>> +#define CCSIDR_32_ASSOC_SHIFT	3
>> +#define CCSIDR_32_ASSOC_MASK	(UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
>> +#define CCSIDR_32_SET_SHIFT	13
>> +#define CCSIDR_32_SET_MASK	(UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
> 
> I don't think we should be inferring cache structure from these register
> values. The Arm ARM helpfully says:
> 
>    | You cannot make any inference about the actual sizes of caches based
>    | on these parameters.
> 
> so we need to take the topology information from elsewhere.
> 

Yeah, I also noticed the statement in the spec. However, the L1 cache size
figured out from above registers are matching with "lscpu" on the machine
where I did my tests. Note "lscpu" depends on sysfs entries whose information
is retrieved from ACPI (PPTT) table. The number of cache levels are partially
retrieved from system register (clidr_el1).

It's doable to retrieve the L1 cache size from ACPI (PPTT) table. I'll
change accordingly in v2 if this enablement is really needed. More clarify
is provided below.

> But before we get into that, can you justify why we need to do this at all,
> please? Do you have data to show the benefit of adding this complexity?
> 

Initially, I found it's the missed feature which has been enabled on
mips/s390. Currently, all read-only anonymous VMAs are backed up by
same zero page. It means all reads to these VMAs are cached by same
set of cache, but still multiple ways if supported. So it would be
nice to have multiple zero pages to back up these read-only anonymous
VMAs, so that the reads on them can be cached by multiple sets (multiple
ways still if supported). It's overall beneficial to the performance.

Unfortunately, I didn't find a machine where the size of cache set is
larger than page size. So I had one experiment as indication how L1
data cache miss affects the overall performance:

     L1 data cache size:           32KB
     L1 data cache line size:      64
     Number of L1 data cache set:  64
     Number of L1 data cache ways: 8
     ----------------------------------------------------------------------
                   size = (cache_line_size) * (num_of_sets) * (num_of_ways)

     Kernel configuration:
            VA_BITS:               48
            PAGE_SIZE:             4KB
            PMD HugeTLB Page Size: 2MB

     Experiment:
            I have a program to do the following things and check the
            consumed time and L1-data-cache-misses by perf.

            (1) Allocate (mmap) a PMD HugeTLB Page, which is 2MB.
            (2) Read on the mmap'd region in step of page size (4KB)
                for 8 or 9 times. Note 8 is the number of data cache
                ways.
            (3) Repeat (2) for 1000000 times.
     
     Result:
            (a) when we have 8 for the steps in (2):
                37,103      L1-dcache-load-misses
                0.217522515 seconds time elapsed
                0.217564000 seconds user
                0.000000000 seconds sys
            (b) when we have 9 for the steps in (2):
                4,687,932   L1-dcache-load-misses            (126 times)
                0.248132105 seconds time elapsed             (+14.2%)
                0.248267000 seconds user
                0.000000000 seconds sys

Please let me know if it's worthy for a v2, to retrieve the cache size
from ACPI (PPTT) table. The cost is to allocate multiple zero pages and
the worst case is fail back to one zero page, as before :)

Cheers,
Gavin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-16 10:46     ` Robin Murphy
@ 2020-09-17  4:36       ` Gavin Shan
  0 siblings, 0 replies; 12+ messages in thread
From: Gavin Shan @ 2020-09-17  4:36 UTC (permalink / raw)
  To: Robin Murphy, Will Deacon
  Cc: mark.rutland, anshuman.khandual, catalin.marinas, linux-kernel,
	shan.gavin, linux-arm-kernel

Hi Robin,

On 9/16/20 8:46 PM, Robin Murphy wrote:
> On 2020-09-16 09:28, Will Deacon wrote:
>> On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
>>> This enables color zero pages by allocating contigous page frames
>>> for it. The number of pages for this is determined by L1 dCache
>>> (or iCache) size, which is probbed from the hardware.
>>>
>>>     * Add cache_total_size() to return L1 dCache (or iCache) size
>>>
>>>     * Implement setup_zero_pages(), which is called after the page
>>>       allocator begins to work, to allocate the contigous pages
>>>       needed by color zero page.
>>>
>>>     * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>>>
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>> ---
>>>   arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>>>   arch/arm64/include/asm/pgtable.h |  9 ++++++--
>>>   arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>>>   arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>>>   arch/arm64/mm/mmu.c              |  7 -------
>>>   5 files changed, 98 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>>> index a4d1b5f771f6..420e9dde2c51 100644
>>> --- a/arch/arm64/include/asm/cache.h
>>> +++ b/arch/arm64/include/asm/cache.h
>>> @@ -39,6 +39,27 @@
>>>   #define CLIDR_LOC(clidr)    (((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>>>   #define CLIDR_LOUIS(clidr)    (((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>>> +#define CSSELR_TND_SHIFT    4
>>> +#define CSSELR_TND_MASK        (UL(1) << CSSELR_TND_SHIFT)
>>> +#define CSSELR_LEVEL_SHIFT    1
>>> +#define CSSELR_LEVEL_MASK    (UL(7) << CSSELR_LEVEL_SHIFT)
>>> +#define CSSELR_IND_SHIFT    0
>>> +#define CSSERL_IND_MASK        (UL(1) << CSSELR_IND_SHIFT)
>>> +
>>> +#define CCSIDR_64_LS_SHIFT    0
>>> +#define CCSIDR_64_LS_MASK    (UL(7) << CCSIDR_64_LS_SHIFT)
>>> +#define CCSIDR_64_ASSOC_SHIFT    3
>>> +#define CCSIDR_64_ASSOC_MASK    (UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
>>> +#define CCSIDR_64_SET_SHIFT    32
>>> +#define CCSIDR_64_SET_MASK    (UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
>>> +
>>> +#define CCSIDR_32_LS_SHIFT    0
>>> +#define CCSIDR_32_LS_MASK    (UL(7) << CCSIDR_32_LS_SHIFT)
>>> +#define CCSIDR_32_ASSOC_SHIFT    3
>>> +#define CCSIDR_32_ASSOC_MASK    (UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
>>> +#define CCSIDR_32_SET_SHIFT    13
>>> +#define CCSIDR_32_SET_MASK    (UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
>>
>> I don't think we should be inferring cache structure from these register
>> values. The Arm ARM helpfully says:
>>
>>    | You cannot make any inference about the actual sizes of caches based
>>    | on these parameters.
>>
>> so we need to take the topology information from elsewhere.
> 
> Yes, these represent parameters for the low-level cache maintenance by set/way instructions, and nothing more. There are definitely cases where they do not reflect the underlying cache structure (commit 793acf870ea3 is an obvious first call).
> 

Thanks for your confirming. Yeah, the system registers aren't
reliable since the cache way/set are hard coded to 1/1. As I
suggested in another thread, ACPI (PPTT) table would be correct
place to get such kind of information.

[...]

Cheers,
Gavin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-17  3:35     ` Gavin Shan
@ 2020-09-17 10:22       ` Robin Murphy
  2020-09-21  2:56         ` Gavin Shan
  0 siblings, 1 reply; 12+ messages in thread
From: Robin Murphy @ 2020-09-17 10:22 UTC (permalink / raw)
  To: Gavin Shan, Will Deacon
  Cc: mark.rutland, anshuman.khandual, catalin.marinas, linux-kernel,
	shan.gavin, linux-arm-kernel

On 2020-09-17 04:35, Gavin Shan wrote:
> Hi Will,
> 
> On 9/16/20 6:28 PM, Will Deacon wrote:
>> On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
>>> This enables color zero pages by allocating contigous page frames
>>> for it. The number of pages for this is determined by L1 dCache
>>> (or iCache) size, which is probbed from the hardware.
>>>
>>>     * Add cache_total_size() to return L1 dCache (or iCache) size
>>>
>>>     * Implement setup_zero_pages(), which is called after the page
>>>       allocator begins to work, to allocate the contigous pages
>>>       needed by color zero page.
>>>
>>>     * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>>>
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>> ---
>>>   arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>>>   arch/arm64/include/asm/pgtable.h |  9 ++++++--
>>>   arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>>>   arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>>>   arch/arm64/mm/mmu.c              |  7 -------
>>>   5 files changed, 98 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/cache.h 
>>> b/arch/arm64/include/asm/cache.h
>>> index a4d1b5f771f6..420e9dde2c51 100644
>>> --- a/arch/arm64/include/asm/cache.h
>>> +++ b/arch/arm64/include/asm/cache.h
>>> @@ -39,6 +39,27 @@
>>>   #define CLIDR_LOC(clidr)    (((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>>>   #define CLIDR_LOUIS(clidr)    (((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>>> +#define CSSELR_TND_SHIFT    4
>>> +#define CSSELR_TND_MASK        (UL(1) << CSSELR_TND_SHIFT)
>>> +#define CSSELR_LEVEL_SHIFT    1
>>> +#define CSSELR_LEVEL_MASK    (UL(7) << CSSELR_LEVEL_SHIFT)
>>> +#define CSSELR_IND_SHIFT    0
>>> +#define CSSERL_IND_MASK        (UL(1) << CSSELR_IND_SHIFT)
>>> +
>>> +#define CCSIDR_64_LS_SHIFT    0
>>> +#define CCSIDR_64_LS_MASK    (UL(7) << CCSIDR_64_LS_SHIFT)
>>> +#define CCSIDR_64_ASSOC_SHIFT    3
>>> +#define CCSIDR_64_ASSOC_MASK    (UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
>>> +#define CCSIDR_64_SET_SHIFT    32
>>> +#define CCSIDR_64_SET_MASK    (UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
>>> +
>>> +#define CCSIDR_32_LS_SHIFT    0
>>> +#define CCSIDR_32_LS_MASK    (UL(7) << CCSIDR_32_LS_SHIFT)
>>> +#define CCSIDR_32_ASSOC_SHIFT    3
>>> +#define CCSIDR_32_ASSOC_MASK    (UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
>>> +#define CCSIDR_32_SET_SHIFT    13
>>> +#define CCSIDR_32_SET_MASK    (UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
>>
>> I don't think we should be inferring cache structure from these register
>> values. The Arm ARM helpfully says:
>>
>>    | You cannot make any inference about the actual sizes of caches based
>>    | on these parameters.
>>
>> so we need to take the topology information from elsewhere.
>>
> 
> Yeah, I also noticed the statement in the spec. However, the L1 cache size
> figured out from above registers are matching with "lscpu" on the machine
> where I did my tests. Note "lscpu" depends on sysfs entries whose 
> information
> is retrieved from ACPI (PPTT) table. The number of cache levels are 
> partially
> retrieved from system register (clidr_el1).
> 
> It's doable to retrieve the L1 cache size from ACPI (PPTT) table. I'll
> change accordingly in v2 if this enablement is really needed. More clarify
> is provided below.
> 
>> But before we get into that, can you justify why we need to do this at 
>> all,
>> please? Do you have data to show the benefit of adding this complexity?
>>
> 
> Initially, I found it's the missed feature which has been enabled on
> mips/s390. Currently, all read-only anonymous VMAs are backed up by
> same zero page. It means all reads to these VMAs are cached by same
> set of cache, but still multiple ways if supported. So it would be
> nice to have multiple zero pages to back up these read-only anonymous
> VMAs, so that the reads on them can be cached by multiple sets (multiple
> ways still if supported). It's overall beneficial to the performance.

Is this a concern for true PIPT caches, or is it really just working 
around a pathological case for alias-detecting VIPT caches?

> Unfortunately, I didn't find a machine where the size of cache set is
> larger than page size. So I had one experiment as indication how L1
> data cache miss affects the overall performance:
> 
>      L1 data cache size:           32KB
>      L1 data cache line size:      64
>      Number of L1 data cache set:  64
>      Number of L1 data cache ways: 8
>      ----------------------------------------------------------------------
>                    size = (cache_line_size) * (num_of_sets) * (num_of_ways)
> 
>      Kernel configuration:
>             VA_BITS:               48
>             PAGE_SIZE:             4KB
>             PMD HugeTLB Page Size: 2MB
> 
>      Experiment:
>             I have a program to do the following things and check the
>             consumed time and L1-data-cache-misses by perf.
> 
>             (1) Allocate (mmap) a PMD HugeTLB Page, which is 2MB.
>             (2) Read on the mmap'd region in step of page size (4KB)
>                 for 8 or 9 times. Note 8 is the number of data cache
>                 ways.
>             (3) Repeat (2) for 1000000 times.
>      Result:
>             (a) when we have 8 for the steps in (2):
>                 37,103      L1-dcache-load-misses
>                 0.217522515 seconds time elapsed
>                 0.217564000 seconds user
>                 0.000000000 seconds sys
>             (b) when we have 9 for the steps in (2):
>                 4,687,932   L1-dcache-load-misses            (126 times)
>                 0.248132105 seconds time elapsed             (+14.2%)
>                 0.248267000 seconds user
>                 0.000000000 seconds sys

I have a vague feeling this may have come up before, but are there 
real-world applications that have a performance bottleneck on reading 
from uninitialised memory? As far as synthetic benchmarks go, I'm sure 
we could equally come up with one that shows a regression due to real 
data being pushed out of the cache by all those extra zeros ;)

Robin.

> Please let me know if it's worthy for a v2, to retrieve the cache size
> from ACPI (PPTT) table. The cost is to allocate multiple zero pages and
> the worst case is fail back to one zero page, as before :)
> 
> Cheers,
> Gavin
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-16  3:25 ` [PATCH 2/2] arm64/mm: Enable color zero pages Gavin Shan
  2020-09-16  8:28   ` Will Deacon
@ 2020-09-18 12:10   ` kernel test robot
  1 sibling, 0 replies; 12+ messages in thread
From: kernel test robot @ 2020-09-18 12:10 UTC (permalink / raw)
  To: Gavin Shan, linux-arm-kernel
  Cc: kbuild-all, clang-built-linux, mark.rutland, anshuman.khandual,
	catalin.marinas, will, linux-kernel, shan.gavin

[-- Attachment #1: Type: text/plain, Size: 3908 bytes --]

Hi Gavin,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on soc/for-next arm/for-next kvmarm/next v5.9-rc5 next-20200917]
[cannot apply to xlnx/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Gavin-Shan/arm64-mm-Enable-color-zero-pages/20200916-112710
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: arm64-randconfig-r002-20200917 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 0ff28fa6a75617d61b1aeea77463d6a1684c3c89)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/arm64/kernel/cacheinfo.c:62:8: error: implicit declaration of function 'FIELD_PREP' [-Werror,-Wimplicit-function-declaration]
           val = FIELD_PREP(CSSELR_LEVEL_MASK, 0) |
                 ^
>> arch/arm64/kernel/cacheinfo.c:68:16: error: implicit declaration of function 'FIELD_GET' [-Werror,-Wimplicit-function-declaration]
                   size = (1 << FIELD_GET(CCSIDR_64_LS_MASK, val)) *
                                ^
   arch/arm64/kernel/cacheinfo.c:68:16: note: did you mean 'FIELD_PREP'?
   arch/arm64/kernel/cacheinfo.c:62:8: note: 'FIELD_PREP' declared here
           val = FIELD_PREP(CSSELR_LEVEL_MASK, 0) |
                 ^
   arch/arm64/kernel/cacheinfo.c:72:16: error: implicit declaration of function 'FIELD_GET' [-Werror,-Wimplicit-function-declaration]
                   size = (1 << FIELD_GET(CCSIDR_32_LS_MASK, val)) *
                                ^
   3 errors generated.

# https://github.com/0day-ci/linux/commit/ab02baff46e4889747f134626f5cd3566fd90582
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Gavin-Shan/arm64-mm-Enable-color-zero-pages/20200916-112710
git checkout ab02baff46e4889747f134626f5cd3566fd90582
vim +/FIELD_PREP +62 arch/arm64/kernel/cacheinfo.c

    45	
    46	int cache_total_size(void)
    47	{
    48		unsigned int ctype, size;
    49		unsigned long val;
    50		bool ccidx = false;
    51	
    52		/* Check first level cache is supported */
    53		ctype = get_cache_type(1);
    54		if (ctype == CACHE_TYPE_NOCACHE)
    55			return 0;
    56	
    57		/* ARMv8.3-CCIDX is supported or not */
    58		val = read_sanitised_ftr_reg(SYS_ID_MMFR4_EL1);
    59		ccidx = !!(val & (UL(0xF) << ID_AA64MMFR2_CCIDX_SHIFT));
    60	
    61		/* Retrieve the information and calculate the total size */
  > 62		val = FIELD_PREP(CSSELR_LEVEL_MASK, 0) |
    63		      FIELD_PREP(CSSERL_IND_MASK, 0);
    64		write_sysreg(val, csselr_el1);
    65	
    66		val = read_sysreg(ccsidr_el1);
    67		if (ccidx) {
  > 68			size = (1 << FIELD_GET(CCSIDR_64_LS_MASK, val)) *
    69			       (FIELD_GET(CCSIDR_64_ASSOC_MASK, val) + 1) *
    70			       (FIELD_GET(CCSIDR_64_SET_MASK, val) + 1);
    71		} else {
    72			size = (1 << FIELD_GET(CCSIDR_32_LS_MASK, val)) *
    73			       (FIELD_GET(CCSIDR_32_ASSOC_MASK, val) + 1) *
    74			       (FIELD_GET(CCSIDR_32_SET_MASK, val) + 1);
    75		}
    76	
    77		return size;
    78	}
    79	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37827 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-17 10:22       ` Robin Murphy
@ 2020-09-21  2:56         ` Gavin Shan
  2020-09-21 12:40           ` Anshuman Khandual
  0 siblings, 1 reply; 12+ messages in thread
From: Gavin Shan @ 2020-09-21  2:56 UTC (permalink / raw)
  To: Robin Murphy, Will Deacon
  Cc: mark.rutland, anshuman.khandual, catalin.marinas, linux-kernel,
	shan.gavin, linux-arm-kernel

Hi Robin,

On 9/17/20 8:22 PM, Robin Murphy wrote:
> On 2020-09-17 04:35, Gavin Shan wrote:
>> On 9/16/20 6:28 PM, Will Deacon wrote:
>>> On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
>>>> This enables color zero pages by allocating contigous page frames
>>>> for it. The number of pages for this is determined by L1 dCache
>>>> (or iCache) size, which is probbed from the hardware.
>>>>
>>>>     * Add cache_total_size() to return L1 dCache (or iCache) size
>>>>
>>>>     * Implement setup_zero_pages(), which is called after the page
>>>>       allocator begins to work, to allocate the contigous pages
>>>>       needed by color zero page.
>>>>
>>>>     * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>>>>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>>   arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>>>>   arch/arm64/include/asm/pgtable.h |  9 ++++++--
>>>>   arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>>>>   arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>>>>   arch/arm64/mm/mmu.c              |  7 -------
>>>>   5 files changed, 98 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>>>> index a4d1b5f771f6..420e9dde2c51 100644
>>>> --- a/arch/arm64/include/asm/cache.h
>>>> +++ b/arch/arm64/include/asm/cache.h
>>>> @@ -39,6 +39,27 @@
>>>>   #define CLIDR_LOC(clidr)    (((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>>>>   #define CLIDR_LOUIS(clidr)    (((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>>>> +#define CSSELR_TND_SHIFT    4
>>>> +#define CSSELR_TND_MASK        (UL(1) << CSSELR_TND_SHIFT)
>>>> +#define CSSELR_LEVEL_SHIFT    1
>>>> +#define CSSELR_LEVEL_MASK    (UL(7) << CSSELR_LEVEL_SHIFT)
>>>> +#define CSSELR_IND_SHIFT    0
>>>> +#define CSSERL_IND_MASK        (UL(1) << CSSELR_IND_SHIFT)
>>>> +
>>>> +#define CCSIDR_64_LS_SHIFT    0
>>>> +#define CCSIDR_64_LS_MASK    (UL(7) << CCSIDR_64_LS_SHIFT)
>>>> +#define CCSIDR_64_ASSOC_SHIFT    3
>>>> +#define CCSIDR_64_ASSOC_MASK    (UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
>>>> +#define CCSIDR_64_SET_SHIFT    32
>>>> +#define CCSIDR_64_SET_MASK    (UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
>>>> +
>>>> +#define CCSIDR_32_LS_SHIFT    0
>>>> +#define CCSIDR_32_LS_MASK    (UL(7) << CCSIDR_32_LS_SHIFT)
>>>> +#define CCSIDR_32_ASSOC_SHIFT    3
>>>> +#define CCSIDR_32_ASSOC_MASK    (UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
>>>> +#define CCSIDR_32_SET_SHIFT    13
>>>> +#define CCSIDR_32_SET_MASK    (UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
>>>
>>> I don't think we should be inferring cache structure from these register
>>> values. The Arm ARM helpfully says:
>>>
>>>    | You cannot make any inference about the actual sizes of caches based
>>>    | on these parameters.
>>>
>>> so we need to take the topology information from elsewhere.
>>>
>>
>> Yeah, I also noticed the statement in the spec. However, the L1 cache size
>> figured out from above registers are matching with "lscpu" on the machine
>> where I did my tests. Note "lscpu" depends on sysfs entries whose information
>> is retrieved from ACPI (PPTT) table. The number of cache levels are partially
>> retrieved from system register (clidr_el1).
>>
>> It's doable to retrieve the L1 cache size from ACPI (PPTT) table. I'll
>> change accordingly in v2 if this enablement is really needed. More clarify
>> is provided below.
>>
>>> But before we get into that, can you justify why we need to do this at all,
>>> please? Do you have data to show the benefit of adding this complexity?
>>>
>>
>> Initially, I found it's the missed feature which has been enabled on
>> mips/s390. Currently, all read-only anonymous VMAs are backed up by
>> same zero page. It means all reads to these VMAs are cached by same
>> set of cache, but still multiple ways if supported. So it would be
>> nice to have multiple zero pages to back up these read-only anonymous
>> VMAs, so that the reads on them can be cached by multiple sets (multiple
>> ways still if supported). It's overall beneficial to the performance.
> 
> Is this a concern for true PIPT caches, or is it really just working around a pathological case for alias-detecting VIPT caches?
> 

I think it's definitely a concern for PIPT caches. However, I'm not
sure about VIPT caches because I failed to understand how it works
from ARM8-A spec. If I'm correct, the index of VIPT cache line is
still determined by the physical address and the number of sets is
another limitation? For example, two virtual addresses (v1) and (v2)
are translated to same physical address (p1), there is still one
cache line (from particular set) for them. If so, this should helps
in terms of performance.

However, I'm not sure I understood VIPT caches correctly because there
is one statement in the ARM8-A spec as below. It seems (v1) and (v2)
can be backed by two different cache line from one particular set.

----

The only architecturally-guaranteed way to invalidate all aliases of
a PA from a VIPT instruction cache is to invalidate the entire instruction
cache.

>> Unfortunately, I didn't find a machine where the size of cache set is
>> larger than page size. So I had one experiment as indication how L1
>> data cache miss affects the overall performance:
>>
>>      L1 data cache size:           32KB
>>      L1 data cache line size:      64
>>      Number of L1 data cache set:  64
>>      Number of L1 data cache ways: 8
>>      ----------------------------------------------------------------------
>>                    size = (cache_line_size) * (num_of_sets) * (num_of_ways)
>>
>>      Kernel configuration:
>>             VA_BITS:               48
>>             PAGE_SIZE:             4KB
>>             PMD HugeTLB Page Size: 2MB
>>
>>      Experiment:
>>             I have a program to do the following things and check the
>>             consumed time and L1-data-cache-misses by perf.
>>
>>             (1) Allocate (mmap) a PMD HugeTLB Page, which is 2MB.
>>             (2) Read on the mmap'd region in step of page size (4KB)
>>                 for 8 or 9 times. Note 8 is the number of data cache
>>                 ways.
>>             (3) Repeat (2) for 1000000 times.
>>      Result:
>>             (a) when we have 8 for the steps in (2):
>>                 37,103      L1-dcache-load-misses
>>                 0.217522515 seconds time elapsed
>>                 0.217564000 seconds user
>>                 0.000000000 seconds sys
>>             (b) when we have 9 for the steps in (2):
>>                 4,687,932   L1-dcache-load-misses            (126 times)
>>                 0.248132105 seconds time elapsed             (+14.2%)
>>                 0.248267000 seconds user
>>                 0.000000000 seconds sys
> 
> I have a vague feeling this may have come up before, but are there real-world applications that have a performance bottleneck on reading from uninitialised memory? As far as synthetic benchmarks go, I'm sure we could equally come up with one that shows a regression due to real data being pushed out of the cache by all those extra zeros ;)

[...]

Ok. If this was proposed before, I'm not sure if the link to that
patchset is still available? :)

When I was searching "my_zero_pfn" in upstream kernel, DAX uses the
zero pages to fill the holes in one particular file in dax_load_hole().
mmap() on /proc/kcore could use zero page either.

Yes, it's correct that the data cache has more chance to be flushed by
these extra zeroes. However, it's not why we compromise the performance
on reading zero page and depress it intentionally. The cache lines won't
be used if there is no reading activities on zero page :)

Cheers,
Gavin

  


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-21  2:56         ` Gavin Shan
@ 2020-09-21 12:40           ` Anshuman Khandual
  2020-09-22 12:39             ` Gavin Shan
  0 siblings, 1 reply; 12+ messages in thread
From: Anshuman Khandual @ 2020-09-21 12:40 UTC (permalink / raw)
  To: Gavin Shan, Robin Murphy, Will Deacon
  Cc: mark.rutland, catalin.marinas, linux-kernel, shan.gavin,
	linux-arm-kernel



On 09/21/2020 08:26 AM, Gavin Shan wrote:
> Hi Robin,
> 
> On 9/17/20 8:22 PM, Robin Murphy wrote:
>> On 2020-09-17 04:35, Gavin Shan wrote:
>>> On 9/16/20 6:28 PM, Will Deacon wrote:
>>>> On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
>>>>> This enables color zero pages by allocating contigous page frames
>>>>> for it. The number of pages for this is determined by L1 dCache
>>>>> (or iCache) size, which is probbed from the hardware.
>>>>>
>>>>>     * Add cache_total_size() to return L1 dCache (or iCache) size
>>>>>
>>>>>     * Implement setup_zero_pages(), which is called after the page
>>>>>       allocator begins to work, to allocate the contigous pages
>>>>>       needed by color zero page.
>>>>>
>>>>>     * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>>>>>
>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>> ---
>>>>>   arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>>>>>   arch/arm64/include/asm/pgtable.h |  9 ++++++--
>>>>>   arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>>>>>   arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>>>>>   arch/arm64/mm/mmu.c              |  7 -------
>>>>>   5 files changed, 98 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>>>>> index a4d1b5f771f6..420e9dde2c51 100644
>>>>> --- a/arch/arm64/include/asm/cache.h
>>>>> +++ b/arch/arm64/include/asm/cache.h
>>>>> @@ -39,6 +39,27 @@
>>>>>   #define CLIDR_LOC(clidr)    (((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>>>>>   #define CLIDR_LOUIS(clidr)    (((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>>>>> +#define CSSELR_TND_SHIFT    4
>>>>> +#define CSSELR_TND_MASK        (UL(1) << CSSELR_TND_SHIFT)
>>>>> +#define CSSELR_LEVEL_SHIFT    1
>>>>> +#define CSSELR_LEVEL_MASK    (UL(7) << CSSELR_LEVEL_SHIFT)
>>>>> +#define CSSELR_IND_SHIFT    0
>>>>> +#define CSSERL_IND_MASK        (UL(1) << CSSELR_IND_SHIFT)
>>>>> +
>>>>> +#define CCSIDR_64_LS_SHIFT    0
>>>>> +#define CCSIDR_64_LS_MASK    (UL(7) << CCSIDR_64_LS_SHIFT)
>>>>> +#define CCSIDR_64_ASSOC_SHIFT    3
>>>>> +#define CCSIDR_64_ASSOC_MASK    (UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
>>>>> +#define CCSIDR_64_SET_SHIFT    32
>>>>> +#define CCSIDR_64_SET_MASK    (UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
>>>>> +
>>>>> +#define CCSIDR_32_LS_SHIFT    0
>>>>> +#define CCSIDR_32_LS_MASK    (UL(7) << CCSIDR_32_LS_SHIFT)
>>>>> +#define CCSIDR_32_ASSOC_SHIFT    3
>>>>> +#define CCSIDR_32_ASSOC_MASK    (UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
>>>>> +#define CCSIDR_32_SET_SHIFT    13
>>>>> +#define CCSIDR_32_SET_MASK    (UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
>>>>
>>>> I don't think we should be inferring cache structure from these register
>>>> values. The Arm ARM helpfully says:
>>>>
>>>>    | You cannot make any inference about the actual sizes of caches based
>>>>    | on these parameters.
>>>>
>>>> so we need to take the topology information from elsewhere.
>>>>
>>>
>>> Yeah, I also noticed the statement in the spec. However, the L1 cache size
>>> figured out from above registers are matching with "lscpu" on the machine
>>> where I did my tests. Note "lscpu" depends on sysfs entries whose information
>>> is retrieved from ACPI (PPTT) table. The number of cache levels are partially
>>> retrieved from system register (clidr_el1).
>>>
>>> It's doable to retrieve the L1 cache size from ACPI (PPTT) table. I'll
>>> change accordingly in v2 if this enablement is really needed. More clarify
>>> is provided below.
>>>
>>>> But before we get into that, can you justify why we need to do this at all,
>>>> please? Do you have data to show the benefit of adding this complexity?
>>>>
>>>
>>> Initially, I found it's the missed feature which has been enabled on
>>> mips/s390. Currently, all read-only anonymous VMAs are backed up by
>>> same zero page. It means all reads to these VMAs are cached by same
>>> set of cache, but still multiple ways if supported. So it would be
>>> nice to have multiple zero pages to back up these read-only anonymous
>>> VMAs, so that the reads on them can be cached by multiple sets (multiple
>>> ways still if supported). It's overall beneficial to the performance.
>>
>> Is this a concern for true PIPT caches, or is it really just working around a pathological case for alias-detecting VIPT caches?
>>
> 
> I think it's definitely a concern for PIPT caches. However, I'm not
> sure about VIPT caches because I failed to understand how it works
> from ARM8-A spec. If I'm correct, the index of VIPT cache line is
> still determined by the physical address and the number of sets is
> another limitation? For example, two virtual addresses (v1) and (v2)
> are translated to same physical address (p1), there is still one
> cache line (from particular set) for them. If so, this should helps
> in terms of performance.
> 
> However, I'm not sure I understood VIPT caches correctly because there
> is one statement in the ARM8-A spec as below. It seems (v1) and (v2)
> can be backed by two different cache line from one particular set.
> 
> ----
> 
> The only architecturally-guaranteed way to invalidate all aliases of
> a PA from a VIPT instruction cache is to invalidate the entire instruction
> cache.
> 
>>> Unfortunately, I didn't find a machine where the size of cache set is
>>> larger than page size. So I had one experiment as indication how L1
>>> data cache miss affects the overall performance:
>>>
>>>      L1 data cache size:           32KB
>>>      L1 data cache line size:      64
>>>      Number of L1 data cache set:  64
>>>      Number of L1 data cache ways: 8
>>>      ----------------------------------------------------------------------
>>>                    size = (cache_line_size) * (num_of_sets) * (num_of_ways)
>>>
>>>      Kernel configuration:
>>>             VA_BITS:               48
>>>             PAGE_SIZE:             4KB
>>>             PMD HugeTLB Page Size: 2MB
>>>
>>>      Experiment:
>>>             I have a program to do the following things and check the
>>>             consumed time and L1-data-cache-misses by perf.
>>>
>>>             (1) Allocate (mmap) a PMD HugeTLB Page, which is 2MB.
>>>             (2) Read on the mmap'd region in step of page size (4KB)
>>>                 for 8 or 9 times. Note 8 is the number of data cache
>>>                 ways.
>>>             (3) Repeat (2) for 1000000 times.
>>>      Result:
>>>             (a) when we have 8 for the steps in (2):
>>>                 37,103      L1-dcache-load-misses
>>>                 0.217522515 seconds time elapsed
>>>                 0.217564000 seconds user
>>>                 0.000000000 seconds sys
>>>             (b) when we have 9 for the steps in (2):
>>>                 4,687,932   L1-dcache-load-misses            (126 times)
>>>                 0.248132105 seconds time elapsed             (+14.2%)
>>>                 0.248267000 seconds user
>>>                 0.000000000 seconds sys
>>
>> I have a vague feeling this may have come up before, but are there real-world applications that have a performance bottleneck on reading from uninitialised memory? As far as synthetic benchmarks go, I'm sure we could equally come up with one that shows a regression due to real data being pushed out of the cache by all those extra zeros ;)
> 
> [...]
> 
> Ok. If this was proposed before, I'm not sure if the link to that
> patchset is still available? :)
> 
> When I was searching "my_zero_pfn" in upstream kernel, DAX uses the
> zero pages to fill the holes in one particular file in dax_load_hole().
> mmap() on /proc/kcore could use zero page either.

But how often those mapped areas will be used afterwards either in DAX
or /proc/kcore ? It seems like the minimal adaption for this feature so
far on platforms (i.e s390 and mips) might have to do with real world
workload's frequency of such read accesses on mapped areas using zero
pages.

> 
> Yes, it's correct that the data cache has more chance to be flushed by
> these extra zeroes. However, it's not why we compromise the performance
> on reading zero page and depress it intentionally. The cache lines won't
> be used if there is no reading activities on zero page :)
> 
> Cheers,
> Gavin
> 
>  
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] arm64/mm: Enable color zero pages
  2020-09-21 12:40           ` Anshuman Khandual
@ 2020-09-22 12:39             ` Gavin Shan
  0 siblings, 0 replies; 12+ messages in thread
From: Gavin Shan @ 2020-09-22 12:39 UTC (permalink / raw)
  To: Anshuman Khandual, Robin Murphy, Will Deacon
  Cc: mark.rutland, catalin.marinas, linux-kernel, shan.gavin,
	linux-arm-kernel

Hi Anshuman,

On 9/21/20 10:40 PM, Anshuman Khandual wrote:
> On 09/21/2020 08:26 AM, Gavin Shan wrote:
>> On 9/17/20 8:22 PM, Robin Murphy wrote:
>>> On 2020-09-17 04:35, Gavin Shan wrote:
>>>> On 9/16/20 6:28 PM, Will Deacon wrote:
>>>>> On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
>>>>>> This enables color zero pages by allocating contigous page frames
>>>>>> for it. The number of pages for this is determined by L1 dCache
>>>>>> (or iCache) size, which is probbed from the hardware.
>>>>>>
>>>>>>      * Add cache_total_size() to return L1 dCache (or iCache) size
>>>>>>
>>>>>>      * Implement setup_zero_pages(), which is called after the page
>>>>>>        allocator begins to work, to allocate the contigous pages
>>>>>>        needed by color zero page.
>>>>>>
>>>>>>      * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>>>>>>
>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>>> ---
>>>>>>    arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>>>>>>    arch/arm64/include/asm/pgtable.h |  9 ++++++--
>>>>>>    arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>>>>>>    arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>>>>>>    arch/arm64/mm/mmu.c              |  7 -------
>>>>>>    5 files changed, 98 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
>>>>>> index a4d1b5f771f6..420e9dde2c51 100644
>>>>>> --- a/arch/arm64/include/asm/cache.h
>>>>>> +++ b/arch/arm64/include/asm/cache.h
>>>>>> @@ -39,6 +39,27 @@
>>>>>>    #define CLIDR_LOC(clidr)    (((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>>>>>>    #define CLIDR_LOUIS(clidr)    (((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>>>>>> +#define CSSELR_TND_SHIFT    4
>>>>>> +#define CSSELR_TND_MASK        (UL(1) << CSSELR_TND_SHIFT)
>>>>>> +#define CSSELR_LEVEL_SHIFT    1
>>>>>> +#define CSSELR_LEVEL_MASK    (UL(7) << CSSELR_LEVEL_SHIFT)
>>>>>> +#define CSSELR_IND_SHIFT    0
>>>>>> +#define CSSERL_IND_MASK        (UL(1) << CSSELR_IND_SHIFT)
>>>>>> +
>>>>>> +#define CCSIDR_64_LS_SHIFT    0
>>>>>> +#define CCSIDR_64_LS_MASK    (UL(7) << CCSIDR_64_LS_SHIFT)
>>>>>> +#define CCSIDR_64_ASSOC_SHIFT    3
>>>>>> +#define CCSIDR_64_ASSOC_MASK    (UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
>>>>>> +#define CCSIDR_64_SET_SHIFT    32
>>>>>> +#define CCSIDR_64_SET_MASK    (UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
>>>>>> +
>>>>>> +#define CCSIDR_32_LS_SHIFT    0
>>>>>> +#define CCSIDR_32_LS_MASK    (UL(7) << CCSIDR_32_LS_SHIFT)
>>>>>> +#define CCSIDR_32_ASSOC_SHIFT    3
>>>>>> +#define CCSIDR_32_ASSOC_MASK    (UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
>>>>>> +#define CCSIDR_32_SET_SHIFT    13
>>>>>> +#define CCSIDR_32_SET_MASK    (UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
>>>>>

[...]

>> Ok. If this was proposed before, I'm not sure if the link to that
>> patchset is still available? :)
>>
>> When I was searching "my_zero_pfn" in upstream kernel, DAX uses the
>> zero pages to fill the holes in one particular file in dax_load_hole().
>> mmap() on /proc/kcore could use zero page either.
> 
> But how often those mapped areas will be used afterwards either in DAX
> or /proc/kcore ? It seems like the minimal adaption for this feature so
> far on platforms (i.e s390 and mips) might have to do with real world
> workload's frequency of such read accesses on mapped areas using zero
> pages.
> 

I don't think /proc/kcore is used frequently in real world, still depending
on the workload. DAX has been supported by multiple filesystems (ext2/ext4/
xfs). Taking ext4 as an example, all allocated extents (blocks), which isn't
be written with the data. The data is retrieved from the zero page. I guess
this intends to avoid exposing data, which was written by previous user for
safety reason. This would be common case and heavily depend on read performance
on zero page. Besides, holes (blocks) in ext4 are also backed by zero pages
and it would happen frequently.

However, I'm not a filesystem guy. I checked the code and understood the code
as above, but I might be wrong completely here.

Yeah, I failed to understand why this feature was enabled on s390/mips from
the corresponding commit logs. Nothing helpful is provided there. I guess
some specific S390/MIPS CPUs have large L1 cache capacity, multiple page
sizes in one set. In this case, multiple (color) zero pages can avoid
the cache line collisions on reading these pages. However, I'm not sure
about arm64. On the CPU where I had my experiment, there is 8-ways/64-sets
and 32KB L1 dCache and iCache, meaning 4KB L1 cache in one particular set.

This feature has low cost to be enabled as several extra pages are needed
and not harmful at least as I can see.

[...]

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-09-22 12:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-16  3:25 [PATCH 0/2] arm64/mm: Enable color zero pages Gavin Shan
2020-09-16  3:25 ` [PATCH 1/2] arm64/mm: Introduce zero PGD table Gavin Shan
2020-09-16  3:25 ` [PATCH 2/2] arm64/mm: Enable color zero pages Gavin Shan
2020-09-16  8:28   ` Will Deacon
2020-09-16 10:46     ` Robin Murphy
2020-09-17  4:36       ` Gavin Shan
2020-09-17  3:35     ` Gavin Shan
2020-09-17 10:22       ` Robin Murphy
2020-09-21  2:56         ` Gavin Shan
2020-09-21 12:40           ` Anshuman Khandual
2020-09-22 12:39             ` Gavin Shan
2020-09-18 12:10   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).