linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] arm64/mm: Enable color zero pages
@ 2020-09-28  7:22 Gavin Shan
  2020-09-28  7:22 ` [PATCH v3 1/2] arm64/mm: Introduce zero PGD table Gavin Shan
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Gavin Shan @ 2020-09-28  7:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, mark.rutland, anshuman.khandual, robin.murphy,
	catalin.marinas, will, shan.gavin

The feature of color zero pages isn't enabled on arm64, meaning all
read-only (anonymous) VM areas are backed up by same zero page. It
leads pressure to data cache on reading data from them. In extreme
case, the same data cache set could be experiencing high pressure
and thrashing. This tries to enable color zero pages to resolve the
issue.

PATCH[1/2] decouples the zero PGD table from zero page
PATCH[2/2] allocates the needed zero pages according to L1 cache size

Testing
=======
[1] The experiment reveals how heavily the (L1) data cache miss impacts
    the overall application's performance. The machine where the test
    is carried out has the following L1 data cache topology. In the
    mean while, the host kernel have following configurations.

    The test case allocates contiguous page frames through HugeTLBfs
    and reads 4-bytes data from the same offset (0x0) from these (N)
    contiguous page frames. N is equal to 8 or 9 separately in the
    following two test cases. This is repeated for one million of
    times.

    Note that 8 is number of L1 data cache ways. The experiment is
    cause L1 cache thrashing on one particular set.

    Host:      CONFIG_ARM64_PAGE_SHIFT=12
               DEFAULT_HUGE_PAGE_SIZE=2MB
    L1 dcache: cache-line-size=64
               number-of-sets=64
               number-of-ways=8

                            N=8           N=9
    ------------------------------------------------------------------
    cache-misses:           43,429        9,038,460
    L1-dcache-load-misses:  43,429        9,038,460
    seconds time elapsed:   0.299206372   0.722253140   (2.41 times)

[2] The experiment should have been carried out on machine where the
    L1 data cache capacity of one particular way is larger than 4KB.
    However, I'm unable to find such kind of machines. So I have to
    evaluate the performance impact caused by L2 data cache thrashing.
    The experiment is carried out on the machine, which has following
    L1/L2 data cache topology. The host kernel configuration is same
    to [1].

    The corresponding test program allocates contiguous page frames
    through hugeTLBfs and builds VMAs backed by zero pages. These
    contiguous pages are sequentially read from fixed offset (0) in step
    of 32KB and by 8 times. After that, the VMA backed by zero pages are
    sequentially read in step of 4KB and by once. It's repeated by 8
    millions of times.

    Note 32KB is the cache capacity in one L2 data cache way and 8 is
    number of L2 data cache sets. This experiment is to cause L2 data
    cache thrashing on one particular set.

    L1 dcache:  <same as [1]>
    L2 dcache:  cache-line-size=64
                number-of-sets=512
                number-of-ways=8

    -----------------------------------------------------------------------
    cache-references:       1,427,213,737    1,421,394,472
    cache-misses:              35,804,552       42,636,698
    L1-dcache-load-misses:     35,804,552       42,636,698
    seconds time elapsed:   2.602511671      2.098198172      (+19.3%)

Changes since v2:

   * Rebased to last upstream kernel (5.9.rc6)             (Gavin)
   * Improved commit log                                   (Gavin)
   * Provide performance data in the cover letter          (Catalin)


Gavin Shan (2):
  arm64/mm: Introduce zero PGD table
  arm64/mm: Enable color zero pages

 arch/arm64/include/asm/cache.h       |  3 ++
 arch/arm64/include/asm/mmu_context.h |  6 +--
 arch/arm64/include/asm/pgtable.h     | 11 ++++-
 arch/arm64/kernel/cacheinfo.c        | 67 ++++++++++++++++++++++++++++
 arch/arm64/kernel/setup.c            |  2 +-
 arch/arm64/kernel/vmlinux.lds.S      |  4 ++
 arch/arm64/mm/init.c                 | 37 +++++++++++++++
 arch/arm64/mm/mmu.c                  |  7 ---
 arch/arm64/mm/proc.S                 |  2 +-
 drivers/base/cacheinfo.c             |  3 +-
 include/linux/cacheinfo.h            |  6 +++
 11 files changed, 132 insertions(+), 16 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3 1/2] arm64/mm: Introduce zero PGD table
  2020-09-28  7:22 [PATCH v3 0/2] arm64/mm: Enable color zero pages Gavin Shan
@ 2020-09-28  7:22 ` Gavin Shan
  2020-09-28  7:22 ` [PATCH v3 2/2] arm64/mm: Enable color zero pages Gavin Shan
  2020-09-28 15:22 ` [PATCH v3 0/2] " Catalin Marinas
  2 siblings, 0 replies; 5+ messages in thread
From: Gavin Shan @ 2020-09-28  7:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, mark.rutland, anshuman.khandual, robin.murphy,
	catalin.marinas, will, shan.gavin

The zero PGD table is used when TTBR_EL1 is changed. It's exactly
the zero page. As the zero page(s) will be allocated dynamically
when colored zero page feature is enabled in subsequent patch. the
zero page(s) aren't usable during early boot stage.

This introduces zero PGD table, which is decoupled from the zero
page(s).

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/mmu_context.h | 6 +++---
 arch/arm64/include/asm/pgtable.h     | 2 ++
 arch/arm64/kernel/setup.c            | 2 +-
 arch/arm64/kernel/vmlinux.lds.S      | 4 ++++
 arch/arm64/mm/proc.S                 | 2 +-
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index f2d7537d6f83..6dbc5726fd56 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -36,11 +36,11 @@ static inline void contextidr_thread_switch(struct task_struct *next)
 }
 
 /*
- * Set TTBR0 to empty_zero_page. No translations will be possible via TTBR0.
+ * Set TTBR0 to zero_pg_dir. No translations will be possible via TTBR0.
  */
 static inline void cpu_set_reserved_ttbr0(void)
 {
-	unsigned long ttbr = phys_to_ttbr(__pa_symbol(empty_zero_page));
+	unsigned long ttbr = phys_to_ttbr(__pa_symbol(zero_pg_dir));
 
 	write_sysreg(ttbr, ttbr0_el1);
 	isb();
@@ -189,7 +189,7 @@ static inline void update_saved_ttbr0(struct task_struct *tsk,
 		return;
 
 	if (mm == &init_mm)
-		ttbr = __pa_symbol(empty_zero_page);
+		ttbr = __pa_symbol(zero_pg_dir);
 	else
 		ttbr = virt_to_phys(mm->pgd) | ASID(mm) << 48;
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d5d3fbe73953..6953498f4d40 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -474,6 +474,8 @@ static inline bool pud_table(pud_t pud) { return true; }
 				 PUD_TYPE_TABLE)
 #endif
 
+extern pgd_t zero_pg_dir[PTRS_PER_PGD];
+extern pgd_t zero_pg_end[];
 extern pgd_t init_pg_dir[PTRS_PER_PGD];
 extern pgd_t init_pg_end[];
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 53acbeca4f57..7e83eaed641e 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -366,7 +366,7 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 	 * faults in case uaccess_enable() is inadvertently called by the init
 	 * thread.
 	 */
-	init_task.thread_info.ttbr0 = __pa_symbol(empty_zero_page);
+	init_task.thread_info.ttbr0 = __pa_symbol(zero_pg_dir);
 #endif
 
 	if (boot_args[1] || boot_args[2] || boot_args[3]) {
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 7cba7623fcec..3d3c155d10a4 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -137,6 +137,10 @@ SECTIONS
 	/* everything from this point to __init_begin will be marked RO NX */
 	RO_DATA(PAGE_SIZE)
 
+	zero_pg_dir = .;
+	. += PAGE_SIZE;
+	zero_pg_end = .;
+
 	idmap_pg_dir = .;
 	. += IDMAP_DIR_SIZE;
 	idmap_pg_end = .;
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 796e47a571e6..90b135c366b3 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -163,7 +163,7 @@ SYM_FUNC_END(cpu_do_resume)
 	.pushsection ".idmap.text", "awx"
 
 .macro	__idmap_cpu_set_reserved_ttbr1, tmp1, tmp2
-	adrp	\tmp1, empty_zero_page
+	adrp	\tmp1, zero_pg_dir
 	phys_to_ttbr \tmp2, \tmp1
 	offset_ttbr1 \tmp2, \tmp1
 	msr	ttbr1_el1, \tmp2
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 2/2] arm64/mm: Enable color zero pages
  2020-09-28  7:22 [PATCH v3 0/2] arm64/mm: Enable color zero pages Gavin Shan
  2020-09-28  7:22 ` [PATCH v3 1/2] arm64/mm: Introduce zero PGD table Gavin Shan
@ 2020-09-28  7:22 ` Gavin Shan
  2020-09-28 15:22 ` [PATCH v3 0/2] " Catalin Marinas
  2 siblings, 0 replies; 5+ messages in thread
From: Gavin Shan @ 2020-09-28  7:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, mark.rutland, anshuman.khandual, robin.murphy,
	catalin.marinas, will, shan.gavin

This enables color zero pages by allocating contiguous page frames
for it. The number of pages for this is determined by L1 dCache
(or iCache) size, which is probbed from the hardware.

   * Export cache_setup_of_node() so that the cache topology could
     be parsed from device-tree.

   * Add cache_get_info() so that L1 dCache size can be retrieved.

   * Implement setup_zero_pages(), which is called after the page
     allocator begins to work, to allocate the contiguous pages
     needed by color zero page. With this, reading load on these
     zero pages can be distributed in different L1/L2/L3 dCache
     sets, to improve the overall performance. On other hand, it
     isn't going to thrash same L1/L2/L3 dCache, which is beneficial
     to overall load balance.

   * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 arch/arm64/include/asm/cache.h   |  3 ++
 arch/arm64/include/asm/pgtable.h |  9 ++++-
 arch/arm64/kernel/cacheinfo.c    | 67 ++++++++++++++++++++++++++++++++
 arch/arm64/mm/init.c             | 37 ++++++++++++++++++
 arch/arm64/mm/mmu.c              |  7 ----
 drivers/base/cacheinfo.c         |  3 +-
 include/linux/cacheinfo.h        |  6 +++
 7 files changed, 121 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index a4d1b5f771f6..a42dbcc6b484 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -89,6 +89,9 @@ static inline int cache_line_size_of_cpu(void)
 }
 
 int cache_line_size(void);
+unsigned int cache_get_info(unsigned int level, unsigned int type,
+			    unsigned int *sets, unsigned int *ways,
+			    unsigned int *cl_size);
 
 /*
  * Read the effective value of CTR_EL0.
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6953498f4d40..5cb5f8bb090d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -54,8 +54,13 @@ extern void __pgd_error(const char *file, int line, unsigned long val);
  * ZERO_PAGE is a global shared page that is always zero: used
  * for zero-mapped memory areas etc..
  */
-extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
-#define ZERO_PAGE(vaddr)	phys_to_page(__pa_symbol(empty_zero_page))
+extern unsigned long empty_zero_page;
+extern unsigned long zero_page_mask;
+
+#define __HAVE_COLOR_ZERO_PAGE
+#define ZERO_PAGE(vaddr)				\
+	(virt_to_page((void *)(empty_zero_page +	\
+	(((unsigned long)(vaddr)) & zero_page_mask))))
 
 #define pte_ERROR(pte)		__pte_error(__FILE__, __LINE__, pte_val(pte))
 
diff --git a/arch/arm64/kernel/cacheinfo.c b/arch/arm64/kernel/cacheinfo.c
index 7fa6828bb488..c13b8897323f 100644
--- a/arch/arm64/kernel/cacheinfo.c
+++ b/arch/arm64/kernel/cacheinfo.c
@@ -43,6 +43,73 @@ static void ci_leaf_init(struct cacheinfo *this_leaf,
 	this_leaf->type = type;
 }
 
+unsigned int cache_get_info(unsigned int level, unsigned int type,
+			    unsigned int *sets, unsigned int *ways,
+			    unsigned int *cl_size)
+{
+	int ret, i, cpu = smp_processor_id();
+	enum cache_type t;
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	struct cacheinfo ci, *p = NULL;
+
+	/* Sanity check */
+	if (type != CACHE_TYPE_INST && type != CACHE_TYPE_DATA)
+		return 0;
+
+	/* Fetch the cache information if it has been populated */
+	if (this_cpu_ci->num_leaves) {
+		for (i = 0; i < this_cpu_ci->num_leaves; i++) {
+			p = &this_cpu_ci->info_list[i];
+			if (p->level == level &&
+			    (p->type == type || p->type == CACHE_TYPE_UNIFIED))
+				break;
+		}
+
+		ret = (i < this_cpu_ci->num_leaves) ? 0 : -ENOENT;
+		goto out;
+	}
+
+	/*
+	 * The cache information isn't populated yet, we have to
+	 * retrieve it from ACPI or device tree.
+	 */
+	t = get_cache_type(level);
+	if (t == CACHE_TYPE_NOCACHE ||
+	    (t != CACHE_TYPE_SEPARATE && t != type)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	p = &ci;
+	p->type = type;
+	p->level = level;
+	this_cpu_ci->info_list = p;
+	this_cpu_ci->num_levels = 1;
+	this_cpu_ci->num_leaves = 1;
+	if (!acpi_disabled)
+		ret = cache_setup_acpi(cpu);
+	else if (of_have_populated_dt())
+		ret = cache_setup_of_node(cpu);
+	else
+		ret = -EPERM;
+
+	memset(this_cpu_ci, 0, sizeof(*this_cpu_ci));
+
+out:
+	if (!ret) {
+		if (sets)
+			*sets = p->number_of_sets;
+		if (ways)
+			*ways = p->ways_of_associativity;
+		if (cl_size)
+			*cl_size = p->coherency_line_size;
+
+		return p->size;
+	}
+
+	return 0;
+}
+
 static int __init_cache_level(unsigned int cpu)
 {
 	unsigned int ctype, level, leaves, fw_level;
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 481d22c32a2e..330a9f610f28 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -18,6 +18,7 @@
 #include <linux/gfp.h>
 #include <linux/memblock.h>
 #include <linux/sort.h>
+#include <linux/cacheinfo.h>
 #include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/dma-direct.h>
@@ -69,6 +70,11 @@ EXPORT_SYMBOL(vmemmap);
 phys_addr_t arm64_dma_phys_limit __ro_after_init;
 static phys_addr_t arm64_dma32_phys_limit __ro_after_init;
 
+unsigned long empty_zero_page;
+EXPORT_SYMBOL(empty_zero_page);
+unsigned long zero_page_mask;
+EXPORT_SYMBOL(zero_page_mask);
+
 #ifdef CONFIG_KEXEC_CORE
 /*
  * reserve_crashkernel() - reserves memory for crash kernel
@@ -507,6 +513,36 @@ static void __init free_unused_memmap(void)
 }
 #endif	/* !CONFIG_SPARSEMEM_VMEMMAP */
 
+static void __init setup_zero_pages(void)
+{
+	struct page *page;
+	unsigned int size;
+	int order, i;
+
+	size = cache_get_info(1, CACHE_TYPE_DATA, NULL, NULL, NULL);
+	order = size > 0 ? get_order(PAGE_ALIGN(size)) : 0;
+	order = min(order, MAX_ORDER - 1);
+
+	do {
+		empty_zero_page = __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+						   order);
+		if (empty_zero_page)
+			break;
+	} while (--order >= 0);
+
+	if (!empty_zero_page)
+		panic("%s: out of memory\n", __func__);
+
+	page = virt_to_page((void *) empty_zero_page);
+	split_page(page, order);
+	for (i = 1 << order; i > 0; i--) {
+		mark_page_reserved(page);
+		page++;
+	}
+
+	zero_page_mask = ((PAGE_SIZE << order) - 1) & PAGE_MASK;
+}
+
 /*
  * mem_init() marks the free areas in the mem_map and tells us how much memory
  * is free.  This is done after various parts of the system have claimed their
@@ -527,6 +563,7 @@ void __init mem_init(void)
 #endif
 	/* this will put all unused low memory onto the freelists */
 	memblock_free_all();
+	setup_zero_pages();
 
 	mem_init_print_info(NULL);
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..736939ab3b4f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -49,13 +49,6 @@ EXPORT_SYMBOL(vabits_actual);
 u64 kimage_voffset __ro_after_init;
 EXPORT_SYMBOL(kimage_voffset);
 
-/*
- * Empty_zero_page is a special page that is used for zero-initialized data
- * and COW.
- */
-unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_bss;
-EXPORT_SYMBOL(empty_zero_page);
-
 static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_bss;
 static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_bss __maybe_unused;
 static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 8d553c92cd32..f0dc66fc24f1 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -153,7 +153,7 @@ static void cache_of_set_props(struct cacheinfo *this_leaf,
 	cache_associativity(this_leaf);
 }
 
-static int cache_setup_of_node(unsigned int cpu)
+int cache_setup_of_node(unsigned int cpu)
 {
 	struct device_node *np;
 	struct cacheinfo *this_leaf;
@@ -195,7 +195,6 @@ static int cache_setup_of_node(unsigned int cpu)
 	return 0;
 }
 #else
-static inline int cache_setup_of_node(unsigned int cpu) { return 0; }
 static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf,
 					   struct cacheinfo *sib_leaf)
 {
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index 46b92cd61d0c..f13d625d3e76 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -100,6 +100,12 @@ struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu);
 int init_cache_level(unsigned int cpu);
 int populate_cache_leaves(unsigned int cpu);
 int cache_setup_acpi(unsigned int cpu);
+#ifdef CONFIG_OF
+int cache_setup_of_node(unsigned int cpu);
+#else
+static inline int cache_setup_of_node(unsigned int cpu) { return 0; }
+#endif
+
 #ifndef CONFIG_ACPI_PPTT
 /*
  * acpi_find_last_cache_level is only called on ACPI enabled
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 0/2] arm64/mm: Enable color zero pages
  2020-09-28  7:22 [PATCH v3 0/2] arm64/mm: Enable color zero pages Gavin Shan
  2020-09-28  7:22 ` [PATCH v3 1/2] arm64/mm: Introduce zero PGD table Gavin Shan
  2020-09-28  7:22 ` [PATCH v3 2/2] arm64/mm: Enable color zero pages Gavin Shan
@ 2020-09-28 15:22 ` Catalin Marinas
  2020-09-29  5:39   ` Gavin Shan
  2 siblings, 1 reply; 5+ messages in thread
From: Catalin Marinas @ 2020-09-28 15:22 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linux-arm-kernel, linux-kernel, mark.rutland, anshuman.khandual,
	robin.murphy, will, shan.gavin

Hi Gavin,

On Mon, Sep 28, 2020 at 05:22:54PM +1000, Gavin Shan wrote:
> Testing
> =======
> [1] The experiment reveals how heavily the (L1) data cache miss impacts
>     the overall application's performance. The machine where the test
>     is carried out has the following L1 data cache topology. In the
>     mean while, the host kernel have following configurations.
> 
>     The test case allocates contiguous page frames through HugeTLBfs
>     and reads 4-bytes data from the same offset (0x0) from these (N)
>     contiguous page frames. N is equal to 8 or 9 separately in the
>     following two test cases. This is repeated for one million of
>     times.
> 
>     Note that 8 is number of L1 data cache ways. The experiment is
>     cause L1 cache thrashing on one particular set.
> 
>     Host:      CONFIG_ARM64_PAGE_SHIFT=12
>                DEFAULT_HUGE_PAGE_SIZE=2MB
>     L1 dcache: cache-line-size=64
>                number-of-sets=64
>                number-of-ways=8
> 
>                             N=8           N=9
>     ------------------------------------------------------------------
>     cache-misses:           43,429        9,038,460
>     L1-dcache-load-misses:  43,429        9,038,460
>     seconds time elapsed:   0.299206372   0.722253140   (2.41 times)
> 
> [2] The experiment should have been carried out on machine where the
>     L1 data cache capacity of one particular way is larger than 4KB.
>     However, I'm unable to find such kind of machines. So I have to
>     evaluate the performance impact caused by L2 data cache thrashing.
>     The experiment is carried out on the machine, which has following
>     L1/L2 data cache topology. The host kernel configuration is same
>     to [1].
> 
>     The corresponding test program allocates contiguous page frames
>     through hugeTLBfs and builds VMAs backed by zero pages. These
>     contiguous pages are sequentially read from fixed offset (0) in step
>     of 32KB and by 8 times. After that, the VMA backed by zero pages are
>     sequentially read in step of 4KB and by once. It's repeated by 8
>     millions of times.
> 
>     Note 32KB is the cache capacity in one L2 data cache way and 8 is
>     number of L2 data cache sets. This experiment is to cause L2 data
>     cache thrashing on one particular set.
> 
>     L1 dcache:  <same as [1]>
>     L2 dcache:  cache-line-size=64
>                 number-of-sets=512
>                 number-of-ways=8
> 
>     -----------------------------------------------------------------------
>     cache-references:       1,427,213,737    1,421,394,472
>     cache-misses:              35,804,552       42,636,698
>     L1-dcache-load-misses:     35,804,552       42,636,698
>     seconds time elapsed:   2.602511671      2.098198172      (+19.3%)

No-one is denying a performance improvement in a very specific way but
what's missing here is explaining how these artificial benchmarks relate
to real-world applications.

-- 
Catalin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 0/2] arm64/mm: Enable color zero pages
  2020-09-28 15:22 ` [PATCH v3 0/2] " Catalin Marinas
@ 2020-09-29  5:39   ` Gavin Shan
  0 siblings, 0 replies; 5+ messages in thread
From: Gavin Shan @ 2020-09-29  5:39 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: linux-arm-kernel, linux-kernel, mark.rutland, anshuman.khandual,
	robin.murphy, will, shan.gavin

Hi Catalin,

On 9/29/20 1:22 AM, Catalin Marinas wrote:
> On Mon, Sep 28, 2020 at 05:22:54PM +1000, Gavin Shan wrote:
>> Testing
>> =======
>> [1] The experiment reveals how heavily the (L1) data cache miss impacts
>>      the overall application's performance. The machine where the test
>>      is carried out has the following L1 data cache topology. In the
>>      mean while, the host kernel have following configurations.
>>
>>      The test case allocates contiguous page frames through HugeTLBfs
>>      and reads 4-bytes data from the same offset (0x0) from these (N)
>>      contiguous page frames. N is equal to 8 or 9 separately in the
>>      following two test cases. This is repeated for one million of
>>      times.
>>
>>      Note that 8 is number of L1 data cache ways. The experiment is
>>      cause L1 cache thrashing on one particular set.
>>
>>      Host:      CONFIG_ARM64_PAGE_SHIFT=12
>>                 DEFAULT_HUGE_PAGE_SIZE=2MB
>>      L1 dcache: cache-line-size=64
>>                 number-of-sets=64
>>                 number-of-ways=8
>>
>>                              N=8           N=9
>>      ------------------------------------------------------------------
>>      cache-misses:           43,429        9,038,460
>>      L1-dcache-load-misses:  43,429        9,038,460
>>      seconds time elapsed:   0.299206372   0.722253140   (2.41 times)
>>
>> [2] The experiment should have been carried out on machine where the
>>      L1 data cache capacity of one particular way is larger than 4KB.
>>      However, I'm unable to find such kind of machines. So I have to
>>      evaluate the performance impact caused by L2 data cache thrashing.
>>      The experiment is carried out on the machine, which has following
>>      L1/L2 data cache topology. The host kernel configuration is same
>>      to [1].
>>
>>      The corresponding test program allocates contiguous page frames
>>      through hugeTLBfs and builds VMAs backed by zero pages. These
>>      contiguous pages are sequentially read from fixed offset (0) in step
>>      of 32KB and by 8 times. After that, the VMA backed by zero pages are
>>      sequentially read in step of 4KB and by once. It's repeated by 8
>>      millions of times.
>>
>>      Note 32KB is the cache capacity in one L2 data cache way and 8 is
>>      number of L2 data cache sets. This experiment is to cause L2 data
>>      cache thrashing on one particular set.
>>
>>      L1 dcache:  <same as [1]>
>>      L2 dcache:  cache-line-size=64
>>                  number-of-sets=512
>>                  number-of-ways=8
>>
>>      -----------------------------------------------------------------------
>>      cache-references:       1,427,213,737    1,421,394,472
>>      cache-misses:              35,804,552       42,636,698
>>      L1-dcache-load-misses:     35,804,552       42,636,698
>>      seconds time elapsed:   2.602511671      2.098198172      (+19.3%)
> 
> No-one is denying a performance improvement in a very specific way but
> what's missing here is explaining how these artificial benchmarks relate
> to real-world applications.
> 

Thanks for your comments. It depends on the activities of reading zero
pages and its frequency. The idea is to distribute reading zero page(s)
on multiple sets of caches. Otherwise, the cache sets corresponding to
these zero page(s) are have more load and prone to cause cache thrashing,
depending on the workload pattern though.

As discussed on v1, there are two use cases from the kernel code: (1)
/proc/vmcore (2) DAX. For (1), it's only valid on x86 where those
non-RAM-resident pages are mapped and backed by zero page(s). For (2),
I was expecting to setup xfs and DAX on RBD (Ram Block Device).
Unfortunately, DAX support for RBD was removed two years ago and
I'm unable to enable xfs and DAX on RBD. DAX is only supported on
limited hardware and I don't have around.

    # mknod /dev/ramdisk b 1 20
    # mkfs.xfs /dev/ramdisk
    # mkdir -p /tmp/ramdisk
    # mount -txfs -odax /dev/ramdisk /tmp/ramdisk
    # dmesg | tail -n 4
    [ 3721.848830] brd: module loaded
    [ 3772.015934] XFS (ram20): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
    [ 3772.023423] XFS (ram20): DAX unsupported by block device. Turning off DAX.
    [ 3772.030285] XFS (ram20): DAX and reflink cannot be used together!

the feature just needs a couple of extra pages and it wouldn't be a
concern. However, the caching behavior for reading zero page(s) is
altering because the caches for zero pages are distributed. It depends
on how frequently these zero page(s) are accessed. Also, I tried to
build the kernel image and no performance altering is detected.

    command:           make -j 80 clean; time make -j 80
                       (was executed for 3 times)
    without the patch: 3m29.084s 3m29.265s 3m30.806s
    with the patch:    3m28.954s 3m29.819s 3m30.180s

Cheers,
Gavin


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-29  5:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-28  7:22 [PATCH v3 0/2] arm64/mm: Enable color zero pages Gavin Shan
2020-09-28  7:22 ` [PATCH v3 1/2] arm64/mm: Introduce zero PGD table Gavin Shan
2020-09-28  7:22 ` [PATCH v3 2/2] arm64/mm: Enable color zero pages Gavin Shan
2020-09-28 15:22 ` [PATCH v3 0/2] " Catalin Marinas
2020-09-29  5:39   ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).