linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v3 0/9] parallelized "struct page" zeroing
@ 2017-05-05 17:03 Pavel Tatashin
  2017-05-05 17:03 ` [v3 1/9] sparc64: simplify vmemmap_populate Pavel Tatashin
                   ` (9 more replies)
  0 siblings, 10 replies; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

Changelog:
	v2 - v3
	- Addressed David's comments about one change per patch:
		* Splited changes to platforms into 4 patches
		* Made "do not zero vmemmap_buf" as a separate patch
	v1 - v2
	- Per request, added s390 to deferred "struct page" zeroing
	- Collected performance data on x86 which proofs the importance to
	  keep memset() as prefetch (see below).

When deferred struct page initialization feature is enabled, we get a
performance gain of initializing vmemmap in parallel after other CPUs are
started. However, we still zero the memory for vmemmap using one boot CPU.
This patch-set fixes the memset-zeroing limitation by deferring it as well.

Performance gain on SPARC with 32T:
base:	https://hastebin.com/ozanelatat.go
fix:	https://hastebin.com/utonawukof.go

As you can see without the fix it takes: 97.89s to boot
With the fix it takes: 46.91 to boot.

Performance gain on x86 with 1T:
base:	https://hastebin.com/uvifasohon.pas
fix:	https://hastebin.com/anodiqaguj.pas

On Intel we save 10.66s/T while on SPARC we save 1.59s/T. Intel has
twice as many pages, and also fewer nodes than SPARC (sparc 32 nodes, vs.
intel 8 nodes).

It takes one thread 11.25s to zero vmemmap on Intel for 1T, so it should
take additional 11.25 / 8 = 1.4s  (this machine has 8 nodes) per node to
initialize the memory, but it takes only additional 0.456s per node, which
means on Intel we also benefit from having memset() and initializing all
other fields in one place.

Pavel Tatashin (9):
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: add "zero" argument to vmemmap allocators
  mm: do not zero vmemmap_buf
  mm: zero struct pages during initialization
  sparc64: teach sparc not to zero struct pages memory
  x86: teach x86 not to zero struct pages memory
  powerpc: teach platforms not to zero struct pages memory
  s390: teach platforms not to zero struct pages memory

 arch/powerpc/mm/init_64.c |    4 +-
 arch/s390/mm/vmem.c       |    5 ++-
 arch/sparc/mm/init_64.c   |   26 +++++++----------------
 arch/x86/mm/init_64.c     |    3 +-
 include/linux/bootmem.h   |    3 ++
 include/linux/mm.h        |   15 +++++++++++--
 mm/memblock.c             |   46 ++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c           |    3 ++
 mm/sparse-vmemmap.c       |   48 +++++++++++++++++++++++++++++---------------
 9 files changed, 103 insertions(+), 50 deletions(-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [v3 1/9] sparc64: simplify vmemmap_populate
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-05 17:03 ` [v3 2/9] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

Remove duplicating code, by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate functions.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 arch/sparc/mm/init_64.c |   23 ++++++-----------------
 1 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 0cda653..14cc1fc 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2530,30 +2530,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 	vstart = vstart & PMD_MASK;
 	vend = ALIGN(vend, PMD_SIZE);
 	for (; vstart < vend; vstart += PMD_SIZE) {
-		pgd_t *pgd = pgd_offset_k(vstart);
+		pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
 		unsigned long pte;
 		pud_t *pud;
 		pmd_t *pmd;
 
-		if (pgd_none(*pgd)) {
-			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+		if (!pgd)
+			return -ENOMEM;
 
-			if (!new)
-				return -ENOMEM;
-			pgd_populate(&init_mm, pgd, new);
-		}
-
-		pud = pud_offset(pgd, vstart);
-		if (pud_none(*pud)) {
-			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-			if (!new)
-				return -ENOMEM;
-			pud_populate(&init_mm, pud, new);
-		}
+		pud = vmemmap_pud_populate(pgd, vstart, node);
+		if (!pud)
+			return -ENOMEM;
 
 		pmd = pmd_offset(pud, vstart);
-
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
 			void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v3 2/9] mm: defining memblock_virt_alloc_try_nid_raw
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
  2017-05-05 17:03 ` [v3 1/9] sparc64: simplify vmemmap_populate Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-05 17:03 ` [v3 3/9] mm: add "zero" argument to vmemmap allocators Pavel Tatashin
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

A new version of memblock_virt_alloc_* allocations:
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 include/linux/bootmem.h |    3 +++
 mm/memblock.c           |   46 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 962164d..4e0f08b 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern int reserve_bootmem_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+				      phys_addr_t min_addr,
+				      phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index 696f06d..7fdc555 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1271,7 +1271,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
 static void * __init memblock_virt_alloc_internal(
 				phys_addr_t size, phys_addr_t align,
 				phys_addr_t min_addr, phys_addr_t max_addr,
-				int nid)
+				int nid, bool zero)
 {
 	phys_addr_t alloc;
 	void *ptr;
@@ -1322,7 +1322,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
+	if (zero)
+		memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1336,6 +1337,37 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
 }
 
 /**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *	  is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *	      allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_try_nid_raw(
+			phys_addr_t size, phys_addr_t align,
+			phys_addr_t min_addr, phys_addr_t max_addr,
+			int nid)
+{
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, (void *)_RET_IP_);
+	return memblock_virt_alloc_internal(size, align,
+					   min_addr, max_addr, nid, false);
+}
+
+/**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
  * @align: alignment of the region and block's size
@@ -1346,8 +1378,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
- * additional debug information (including caller info), if enabled.
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. This function zeroes the allocated memory.
  *
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
@@ -1361,7 +1393,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
 	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+					     max_addr, nid, true);
 }
 
 /**
@@ -1375,7 +1407,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
  *	      allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public panicking version of _memblock_virt_alloc_try_nid_nopanic()
+ * Public panicking version of memblock_virt_alloc_try_nid_nopanic()
  * which provides debug information (including caller info), if enabled,
  * and panics if the request can not be satisfied.
  *
@@ -1393,7 +1425,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
-					   min_addr, max_addr, nid);
+					   min_addr, max_addr, nid, true);
 	if (ptr)
 		return ptr;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v3 3/9] mm: add "zero" argument to vmemmap allocators
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
  2017-05-05 17:03 ` [v3 1/9] sparc64: simplify vmemmap_populate Pavel Tatashin
  2017-05-05 17:03 ` [v3 2/9] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-13 19:17   ` kbuild test robot
  2017-05-05 17:03 ` [v3 4/9] mm: do not zero vmemmap_buf Pavel Tatashin
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

Allow clients to request non-zeroed memory from vmemmap allocator.
The following two public function have a new boolean argument called zero:

__vmemmap_alloc_block_buf()
vmemmap_alloc_block()

When zero is true, memory that is allocated by memblock allocator is zeroed
(the current behavior), when argument is false, the memory is not zeroed.

This change allows for optimizations where client knows when it is better
to zero memory: may be later when other CPUs are started, or may be client
is going to set every byte in the allocated memory, so no need to zero
memory beforehand.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 arch/powerpc/mm/init_64.c |    4 +-
 arch/s390/mm/vmem.c       |    5 ++-
 arch/sparc/mm/init_64.c   |    3 +-
 arch/x86/mm/init_64.c     |    3 +-
 include/linux/mm.h        |    6 ++--
 mm/sparse-vmemmap.c       |   48 +++++++++++++++++++++++++++++---------------
 6 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c22f207..d42c6b3 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -133,7 +133,7 @@ static int __meminit vmemmap_populated(unsigned long start, int page_size)
 
 	/* allocate a page when required and hand out chunks */
 	if (!num_left) {
-		next = vmemmap_alloc_block(PAGE_SIZE, node);
+		next = vmemmap_alloc_block(PAGE_SIZE, node, true);
 		if (unlikely(!next)) {
 			WARN_ON(1);
 			return NULL;
@@ -181,7 +181,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 		if (vmemmap_populated(start, page_size))
 			continue;
 
-		p = vmemmap_alloc_block(page_size, node);
+		p = vmemmap_alloc_block(page_size, node, true);
 		if (!p)
 			return -ENOMEM;
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 60d3899..9c75214 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -251,7 +251,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 			if (MACHINE_HAS_EDAT1) {
 				void *new_page;
 
-				new_page = vmemmap_alloc_block(PMD_SIZE, node);
+				new_page = vmemmap_alloc_block(PMD_SIZE, node,
+							       true);
 				if (!new_page)
 					goto out;
 				pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
@@ -271,7 +272,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 		if (pte_none(*pt_dir)) {
 			void *new_page;
 
-			new_page = vmemmap_alloc_block(PAGE_SIZE, node);
+			new_page = vmemmap_alloc_block(PAGE_SIZE, node, true);
 			if (!new_page)
 				goto out;
 			pte_val(*pt_dir) = __pa(new_page) | pgt_prot;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 14cc1fc..c72d070 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2545,7 +2545,8 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 		pmd = pmd_offset(pud, vstart);
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
-			void *block = vmemmap_alloc_block(PMD_SIZE, node);
+			void *block = vmemmap_alloc_block(PMD_SIZE, node,
+							  true);
 
 			if (!block)
 				return -ENOMEM;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 745e5e1..839e5d4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1275,7 +1275,8 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 		if (pmd_none(*pmd)) {
 			void *p;
 
-			p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
+			p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap,
+						      true);
 			if (p) {
 				pte_t entry;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d22e69..4375015 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2402,13 +2402,13 @@ void sparse_mem_maps_populate_node(struct page **map_map,
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
-void *vmemmap_alloc_block(unsigned long size, int node);
+void *vmemmap_alloc_block(unsigned long size, int node, bool zero);
 struct vmem_altmap;
 void *__vmemmap_alloc_block_buf(unsigned long size, int node,
-		struct vmem_altmap *altmap);
+		struct vmem_altmap *altmap, bool zero);
 static inline void *vmemmap_alloc_block_buf(unsigned long size, int node)
 {
-	return __vmemmap_alloc_block_buf(size, node, NULL);
+	return __vmemmap_alloc_block_buf(size, node, NULL, true);
 }
 
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a56c398..5d255b0 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -39,16 +39,27 @@
 static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long size,
 				unsigned long align,
-				unsigned long goal)
+				unsigned long goal,
+				bool zero)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
-					    BOOTMEM_ALLOC_ACCESSIBLE, node);
+	void *mem = memblock_virt_alloc_try_nid_raw(size, align, goal,
+						    BOOTMEM_ALLOC_ACCESSIBLE,
+						    node);
+	if (!mem) {
+		panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=0x%lx\n",
+		      __func__, size, align, node, goal);
+		return NULL;
+	}
+
+	if (zero)
+		memset(mem, 0, size);
+	return mem;
 }
 
 static void *vmemmap_buf;
 static void *vmemmap_buf_end;
 
-void * __meminit vmemmap_alloc_block(unsigned long size, int node)
+void * __meminit vmemmap_alloc_block(unsigned long size, int node, bool zero)
 {
 	/* If the main allocator is up use that, fallback to bootmem. */
 	if (slab_is_available()) {
@@ -67,24 +78,27 @@
 		return NULL;
 	} else
 		return __earlyonly_bootmem_alloc(node, size, size,
-				__pa(MAX_DMA_ADDRESS));
+				__pa(MAX_DMA_ADDRESS), zero);
 }
 
 /* need to make sure size is all the same during early stage */
-static void * __meminit alloc_block_buf(unsigned long size, int node)
+static void * __meminit alloc_block_buf(unsigned long size, int node, bool zero)
 {
 	void *ptr;
 
 	if (!vmemmap_buf)
-		return vmemmap_alloc_block(size, node);
+		return vmemmap_alloc_block(size, node, zero);
 
 	/* take the from buf */
 	ptr = (void *)ALIGN((unsigned long)vmemmap_buf, size);
 	if (ptr + size > vmemmap_buf_end)
-		return vmemmap_alloc_block(size, node);
+		return vmemmap_alloc_block(size, node, zero);
 
 	vmemmap_buf = ptr + size;
 
+	if (zero)
+		memset(ptr, 0, size);
+
 	return ptr;
 }
 
@@ -152,11 +166,11 @@ static unsigned long __meminit vmem_altmap_alloc(struct vmem_altmap *altmap,
 
 /* need to make sure size is all the same during early stage */
 void * __meminit __vmemmap_alloc_block_buf(unsigned long size, int node,
-		struct vmem_altmap *altmap)
+		struct vmem_altmap *altmap, bool zero)
 {
 	if (altmap)
 		return altmap_alloc_block_buf(size, altmap);
-	return alloc_block_buf(size, node);
+	return alloc_block_buf(size, node, zero);
 }
 
 void __meminit vmemmap_verify(pte_t *pte, int node,
@@ -175,7 +189,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p = alloc_block_buf(PAGE_SIZE, node);
+		void *p = alloc_block_buf(PAGE_SIZE, node, true);
 		if (!p)
 			return NULL;
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
@@ -188,7 +202,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block(PAGE_SIZE, node, true);
 		if (!p)
 			return NULL;
 		pmd_populate_kernel(&init_mm, pmd, p);
@@ -200,7 +214,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 {
 	pud_t *pud = pud_offset(p4d, addr);
 	if (pud_none(*pud)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block(PAGE_SIZE, node, true);
 		if (!p)
 			return NULL;
 		pud_populate(&init_mm, pud, p);
@@ -212,7 +226,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 {
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	if (p4d_none(*p4d)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block(PAGE_SIZE, node, true);
 		if (!p)
 			return NULL;
 		p4d_populate(&init_mm, p4d, p);
@@ -224,7 +238,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 {
 	pgd_t *pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd)) {
-		void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+		void *p = vmemmap_alloc_block(PAGE_SIZE, node, true);
 		if (!p)
 			return NULL;
 		pgd_populate(&init_mm, pgd, p);
@@ -290,8 +304,8 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 	void *vmemmap_buf_start;
 
 	size = ALIGN(size, PMD_SIZE);
-	vmemmap_buf_start = __earlyonly_bootmem_alloc(nodeid, size * map_count,
-			 PMD_SIZE, __pa(MAX_DMA_ADDRESS));
+	vmemmap_buf_start = __earlyonly_bootmem_alloc(nodeid, size
+			* map_count, PMD_SIZE, __pa(MAX_DMA_ADDRESS), true);
 
 	if (vmemmap_buf_start) {
 		vmemmap_buf = vmemmap_buf_start;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v3 4/9] mm: do not zero vmemmap_buf
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
                   ` (2 preceding siblings ...)
  2017-05-05 17:03 ` [v3 3/9] mm: add "zero" argument to vmemmap allocators Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-05 17:03 ` [v3 5/9] mm: zero struct pages during initialization Pavel Tatashin
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

alloc_block_buf() can either use external allocator by calling
vmemmap_alloc_block() or when available use pre-allocated vmemmap_buf
to do allocation. In either case, alloc_block_buf() knows when to zero
memory based on the "zero" argument.  This is why it is not needed to
zero vmemmap_buf beforehand. Let clients of alloc_block_buf() to
decide whether that is needed.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 mm/sparse-vmemmap.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 5d255b0..1e9508b 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -305,7 +305,7 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 
 	size = ALIGN(size, PMD_SIZE);
 	vmemmap_buf_start = __earlyonly_bootmem_alloc(nodeid, size
-			* map_count, PMD_SIZE, __pa(MAX_DMA_ADDRESS), true);
+			* map_count, PMD_SIZE, __pa(MAX_DMA_ADDRESS), false);
 
 	if (vmemmap_buf_start) {
 		vmemmap_buf = vmemmap_buf_start;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v3 5/9] mm: zero struct pages during initialization
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
                   ` (3 preceding siblings ...)
  2017-05-05 17:03 ` [v3 4/9] mm: do not zero vmemmap_buf Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-05 17:03 ` [v3 6/9] sparc64: teach sparc not to zero struct pages memory Pavel Tatashin
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

When deferred struct page initialization is enabled, do not expect that
the memory that was allocated for struct pages was zeroed by the
allocator. Zero it when "struct pages" are initialized.

Also, a defined boolean VMEMMAP_ZERO is provided to tell platforms whether
they should zero memory or can deffer it.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 include/linux/mm.h |    9 +++++++++
 mm/page_alloc.c    |    3 +++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4375015..1c481fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2419,6 +2419,15 @@ int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 #ifdef CONFIG_MEMORY_HOTPLUG
 void vmemmap_free(unsigned long start, unsigned long end);
 #endif
+/*
+ * Don't zero "struct page"es during early boot, and zero only when they are
+ * initialized in parallel.
+ */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+#define VMEMMAP_ZERO	false
+#else
+#define VMEMMAP_ZERO	true
+#endif
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long size);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c25de4..e736c6a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1159,6 +1159,9 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+	memset(page, 0, sizeof(struct page));
+#endif
 	set_page_links(page, zone, nid, pfn);
 	init_page_count(page);
 	page_mapcount_reset(page);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v3 6/9] sparc64: teach sparc not to zero struct pages memory
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
                   ` (4 preceding siblings ...)
  2017-05-05 17:03 ` [v3 5/9] mm: zero struct pages during initialization Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-05 17:03 ` [v3 7/9] x86: teach x86 " Pavel Tatashin
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

If we are using deferred struct page initialization feature, most of
"struct page"es are getting initialized after other CPUs are started, and
hence we are benefiting from doing this job in parallel. However, we are
still zeroing all the memory that is allocated for "struct pages" using the
boot CPU.  This patch solves this problem, by deferring zeroing "struct
pages" to only when they are initialized on SPARC.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 arch/sparc/mm/init_64.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index c72d070..dae040c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2546,7 +2546,7 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
 		pte = pmd_val(*pmd);
 		if (!(pte & _PAGE_VALID)) {
 			void *block = vmemmap_alloc_block(PMD_SIZE, node,
-							  true);
+							  VMEMMAP_ZERO);
 
 			if (!block)
 				return -ENOMEM;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v3 7/9] x86: teach x86 not to zero struct pages memory
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
                   ` (5 preceding siblings ...)
  2017-05-05 17:03 ` [v3 6/9] sparc64: teach sparc not to zero struct pages memory Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-05 17:03 ` [v3 8/9] powerpc: teach platforms " Pavel Tatashin
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

If we are using deferred struct page initialization feature, most of
"struct page"es are getting initialized after other CPUs are started, and
hence we are benefiting from doing this job in parallel. However, we are
still zeroing all the memory that is allocated for "struct pages" using the
boot CPU.  This patch solves this problem, by deferring zeroing "struct
pages" to only when they are initialized on x86 platforms.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 arch/x86/mm/init_64.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 839e5d4..332a21e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1276,7 +1276,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			void *p;
 
 			p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap,
-						      true);
+						      VMEMMAP_ZERO);
 			if (p) {
 				pte_t entry;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v3 8/9] powerpc: teach platforms not to zero struct pages memory
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
                   ` (6 preceding siblings ...)
  2017-05-05 17:03 ` [v3 7/9] x86: teach x86 " Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-05 17:03 ` [v3 9/9] s390: " Pavel Tatashin
  2017-05-09 18:12 ` [v3 0/9] parallelized "struct page" zeroing Michal Hocko
  9 siblings, 0 replies; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

If we are using deferred struct page initialization feature, most of
"struct page"es are getting initialized after other CPUs are started, and
hence we are benefiting from doing this job in parallel. However, we are
still zeroing all the memory that is allocated for "struct pages" using the
boot CPU.  This patch solves this problem, by deferring zeroing "struct
pages" to only when they are initialized on PowerPC platforms.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 arch/powerpc/mm/init_64.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d42c6b3..c381bd7 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -181,7 +181,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 		if (vmemmap_populated(start, page_size))
 			continue;
 
-		p = vmemmap_alloc_block(page_size, node, true);
+		p = vmemmap_alloc_block(page_size, node, VMEMMAP_ZERO);
 		if (!p)
 			return -ENOMEM;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [v3 9/9] s390: teach platforms not to zero struct pages memory
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
                   ` (7 preceding siblings ...)
  2017-05-05 17:03 ` [v3 8/9] powerpc: teach platforms " Pavel Tatashin
@ 2017-05-05 17:03 ` Pavel Tatashin
  2017-05-08 11:36   ` Heiko Carstens
  2017-05-09 18:12 ` [v3 0/9] parallelized "struct page" zeroing Michal Hocko
  9 siblings, 1 reply; 46+ messages in thread
From: Pavel Tatashin @ 2017-05-05 17:03 UTC (permalink / raw)
  To: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

If we are using deferred struct page initialization feature, most of
"struct page"es are getting initialized after other CPUs are started, and
hence we are benefiting from doing this job in parallel. However, we are
still zeroing all the memory that is allocated for "struct pages" using the
boot CPU.  This patch solves this problem, by deferring zeroing "struct
pages" to only when they are initialized on s390 platforms.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 arch/s390/mm/vmem.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 9c75214..ffe9ba1 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -252,7 +252,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 				void *new_page;
 
 				new_page = vmemmap_alloc_block(PMD_SIZE, node,
-							       true);
+							       VMEMMAP_ZERO);
 				if (!new_page)
 					goto out;
 				pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [v3 9/9] s390: teach platforms not to zero struct pages memory
  2017-05-05 17:03 ` [v3 9/9] s390: " Pavel Tatashin
@ 2017-05-08 11:36   ` Heiko Carstens
  2017-05-15 18:24     ` Pasha Tatashin
  0 siblings, 1 reply; 46+ messages in thread
From: Heiko Carstens @ 2017-05-08 11:36 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, davem

On Fri, May 05, 2017 at 01:03:16PM -0400, Pavel Tatashin wrote:
> If we are using deferred struct page initialization feature, most of
> "struct page"es are getting initialized after other CPUs are started, and
> hence we are benefiting from doing this job in parallel. However, we are
> still zeroing all the memory that is allocated for "struct pages" using the
> boot CPU.  This patch solves this problem, by deferring zeroing "struct
> pages" to only when they are initialized on s390 platforms.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
> ---
>  arch/s390/mm/vmem.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index 9c75214..ffe9ba1 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -252,7 +252,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
>  				void *new_page;
>  
>  				new_page = vmemmap_alloc_block(PMD_SIZE, node,
> -							       true);
> +							       VMEMMAP_ZERO);
>  				if (!new_page)
>  					goto out;
>  				pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;

If you add the hunk below then this is

Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>

diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index ffe9ba1aec8b..bf88a8b9c24d 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -272,7 +272,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 		if (pte_none(*pt_dir)) {
 			void *new_page;
 
-			new_page = vmemmap_alloc_block(PAGE_SIZE, node, true);
+			new_page = vmemmap_alloc_block(PAGE_SIZE, node, VMEMMAP_ZERO);
 			if (!new_page)
 				goto out;
 			pte_val(*pt_dir) = __pa(new_page) | pgt_prot;

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
                   ` (8 preceding siblings ...)
  2017-05-05 17:03 ` [v3 9/9] s390: " Pavel Tatashin
@ 2017-05-09 18:12 ` Michal Hocko
  2017-05-09 18:54   ` Pasha Tatashin
  2017-05-15 18:12   ` Pasha Tatashin
  9 siblings, 2 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-09 18:12 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

On Fri 05-05-17 13:03:07, Pavel Tatashin wrote:
> Changelog:
> 	v2 - v3
> 	- Addressed David's comments about one change per patch:
> 		* Splited changes to platforms into 4 patches
> 		* Made "do not zero vmemmap_buf" as a separate patch
> 	v1 - v2
> 	- Per request, added s390 to deferred "struct page" zeroing
> 	- Collected performance data on x86 which proofs the importance to
> 	  keep memset() as prefetch (see below).
> 
> When deferred struct page initialization feature is enabled, we get a
> performance gain of initializing vmemmap in parallel after other CPUs are
> started. However, we still zero the memory for vmemmap using one boot CPU.
> This patch-set fixes the memset-zeroing limitation by deferring it as well.

I like the idea of postponing the zeroing from the allocation to the
init time. To be honest the improvement looks much larger than I would
expect (Btw. this should be a part of the changelog rather than a
outside link).

The implementation just looks too large to what I would expect. E.g. do
we really need to add zero argument to the large part of the memblock
API? Wouldn't it be easier to simply export memblock_virt_alloc_internal
(or its tiny wrapper memblock_virt_alloc_core) and move the zeroing
outside to its 2 callers? A completely untested scratched version at the
end of the email.

Also it seems that this is not 100% correct either as it only cares
about VMEMMAP while DEFERRED_STRUCT_PAGE_INIT might be enabled also for
SPARSEMEM. This would suggest that we would zero out pages twice,
right?

A similar concern would go to the memory hotplug patch which will
fall back to the slab/page allocator IIRC. On the other hand
__init_single_page is shared with the hotplug code so again we would
initialize 2 times.

So I suspect more changes are needed. I will have a closer look tomorrow.

>  arch/powerpc/mm/init_64.c |    4 +-
>  arch/s390/mm/vmem.c       |    5 ++-
>  arch/sparc/mm/init_64.c   |   26 +++++++----------------
>  arch/x86/mm/init_64.c     |    3 +-
>  include/linux/bootmem.h   |    3 ++
>  include/linux/mm.h        |   15 +++++++++++--
>  mm/memblock.c             |   46 ++++++++++++++++++++++++++++++++++++------
>  mm/page_alloc.c           |    3 ++
>  mm/sparse-vmemmap.c       |   48 +++++++++++++++++++++++++++++---------------
>  9 files changed, 103 insertions(+), 50 deletions(-)


The bootmem API change mentioned above.

 include/linux/bootmem.h |  3 +++
 mm/memblock.c           | 41 ++++++++++++++++++++++++++---------------
 mm/sparse-vmemmap.c     |  2 +-
 3 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 962164d36506..c9a08463d9a8 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE		(~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void * memblock_virt_alloc_core(phys_addr_t size, phys_addr_t align,
+				phys_addr_t min_addr, phys_addr_t max_addr,
+				int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index b049c9b2dba8..eab7da94f873 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1271,8 +1271,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
  *
  * The memory block is aligned on SMP_CACHE_BYTES if @align == 0.
  *
- * The phys address of allocated boot memory block is converted to virtual and
- * allocated memory is reset to 0.
+ * The function has to be zeroed out explicitly.
  *
  * In addition, function sets the min_count to 0 using kmemleak_alloc for
  * allocated boot memory block, so that it is never reported as leaks.
@@ -1280,15 +1279,18 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
  */
-static void * __init memblock_virt_alloc_internal(
+static inline void * __init memblock_virt_alloc_internal(
 				phys_addr_t size, phys_addr_t align,
 				phys_addr_t min_addr, phys_addr_t max_addr,
-				int nid)
+				int nid, void *caller)
 {
 	phys_addr_t alloc;
 	void *ptr;
 	ulong flags = choose_memblock_flags();
 
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, caller);
 	if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n"))
 		nid = NUMA_NO_NODE;
 
@@ -1334,7 +1336,6 @@ static void * __init memblock_virt_alloc_internal(
 	return NULL;
 done:
 	ptr = phys_to_virt(alloc);
-	memset(ptr, 0, size);
 
 	/*
 	 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1347,6 +1348,14 @@ static void * __init memblock_virt_alloc_internal(
 	return ptr;
 }
 
+void * __init memblock_virt_alloc_core(phys_addr_t size, phys_addr_t align,
+				phys_addr_t min_addr, phys_addr_t max_addr,
+				int nid)
+{
+	return memblock_virt_alloc_internal(size, align, min_addr, max_addr, nid,
+			(void *)_RET_IP_);
+}
+
 /**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1369,11 +1378,14 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 				phys_addr_t min_addr, phys_addr_t max_addr,
 				int nid)
 {
-	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
-		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
-		     (u64)max_addr, (void *)_RET_IP_);
-	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+	void *ptr;
+
+	ptr = memblock_virt_alloc_internal(size, align, min_addr,
+					     max_addr, nid, (void *)_RET_IP_);
+	if (ptr)
+		memset(ptr, 0, size);
+
+	return ptr;
 }
 
 /**
@@ -1401,13 +1413,12 @@ void * __init memblock_virt_alloc_try_nid(
 {
 	void *ptr;
 
-	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
-		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
-		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
-					   min_addr, max_addr, nid);
-	if (ptr)
+					   min_addr, max_addr, nid, (void *)_RET_IP_);
+	if (ptr) {
+		memset(ptr, 0, size);
 		return ptr;
+	}
 
 	panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n",
 	      __func__, (u64)size, (u64)align, nid, (u64)min_addr,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a56c3989f773..4e060f0f9fe5 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
 				unsigned long align,
 				unsigned long goal)
 {
-	return memblock_virt_alloc_try_nid(size, align, goal,
+	return memblock_virt_alloc_core(size, align, goal,
 					    BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-09 18:12 ` [v3 0/9] parallelized "struct page" zeroing Michal Hocko
@ 2017-05-09 18:54   ` Pasha Tatashin
  2017-05-10  7:24     ` Michal Hocko
  2017-05-15 18:12   ` Pasha Tatashin
  1 sibling, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-09 18:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

Hi Michal,

> I like the idea of postponing the zeroing from the allocation to the
> init time. To be honest the improvement looks much larger than I would
> expect (Btw. this should be a part of the changelog rather than a
> outside link).

The improvements are larger, because this time was never measured, as 
Linux does not have early boot time stamps. I added them for x86 and 
SPARC to emasure the performance. I am pushing those changes through 
separate patchsets.

> 
> The implementation just looks too large to what I would expect. E.g. do
> we really need to add zero argument to the large part of the memblock
> API? Wouldn't it be easier to simply export memblock_virt_alloc_internal
> (or its tiny wrapper memblock_virt_alloc_core) and move the zeroing
> outside to its 2 callers? A completely untested scratched version at the
> end of the email.

I am OK, with this change. But, I do not really see a difference between:

memblock_virt_alloc_raw()
and
memblock_virt_alloc_core()

In both cases we use memblock_virt_alloc_internal(), but the only 
difference is that in my case we tell memblock_virt_alloc_internal() to 
zero the pages if needed, and in your case the other two callers are 
zeroing it. I like moving memblock_dbg() inside 
memblock_virt_alloc_internal()

> 
> Also it seems that this is not 100% correct either as it only cares
> about VMEMMAP while DEFERRED_STRUCT_PAGE_INIT might be enabled also for
> SPARSEMEM. This would suggest that we would zero out pages twice,
> right?

Thank you, I will check this combination before sending out the next patch.

> 
> A similar concern would go to the memory hotplug patch which will
> fall back to the slab/page allocator IIRC. On the other hand
> __init_single_page is shared with the hotplug code so again we would
> initialize 2 times.

Correct, when memory it hotplugged, to gain the benefit of this fix, and 
also not to regress by actually double zeroing "struct pages" we should 
not zero it out. However, I do not really have means to test it.

> 
> So I suspect more changes are needed. I will have a closer look tomorrow.

Thank you for reviewing this work. I will wait for your comments before 
sending out updated patches.

Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-09 18:54   ` Pasha Tatashin
@ 2017-05-10  7:24     ` Michal Hocko
  2017-05-10 13:42       ` Pasha Tatashin
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2017-05-10  7:24 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

On Tue 09-05-17 14:54:50, Pasha Tatashin wrote:
[...]
> >The implementation just looks too large to what I would expect. E.g. do
> >we really need to add zero argument to the large part of the memblock
> >API? Wouldn't it be easier to simply export memblock_virt_alloc_internal
> >(or its tiny wrapper memblock_virt_alloc_core) and move the zeroing
> >outside to its 2 callers? A completely untested scratched version at the
> >end of the email.
> 
> I am OK, with this change. But, I do not really see a difference between:
> 
> memblock_virt_alloc_raw()
> and
> memblock_virt_alloc_core()
> 
> In both cases we use memblock_virt_alloc_internal(), but the only difference
> is that in my case we tell memblock_virt_alloc_internal() to zero the pages
> if needed, and in your case the other two callers are zeroing it. I like
> moving memblock_dbg() inside memblock_virt_alloc_internal()

Well, I didn't object to this particular part. I was mostly concerned
about
http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com
and the "zero" argument for other functions. I guess we can do without
that. I _think_ that we should simply _always_ initialize the page at the
__init_single_page time rather than during the allocation. That would
require dropping __GFP_ZERO for non-memblock allocations. Or do you
think we could regress for single threaded initialization?

> >Also it seems that this is not 100% correct either as it only cares
> >about VMEMMAP while DEFERRED_STRUCT_PAGE_INIT might be enabled also for
> >SPARSEMEM. This would suggest that we would zero out pages twice,
> >right?
> 
> Thank you, I will check this combination before sending out the next patch.
> 
> >
> >A similar concern would go to the memory hotplug patch which will
> >fall back to the slab/page allocator IIRC. On the other hand
> >__init_single_page is shared with the hotplug code so again we would
> >initialize 2 times.
> 
> Correct, when memory it hotplugged, to gain the benefit of this fix, and
> also not to regress by actually double zeroing "struct pages" we should not
> zero it out. However, I do not really have means to test it.

It should be pretty easy to test with kvm, but I can help with testing
on the real HW as well.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10  7:24     ` Michal Hocko
@ 2017-05-10 13:42       ` Pasha Tatashin
  2017-05-10 14:57         ` Michal Hocko
  0 siblings, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-10 13:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

> 
> Well, I didn't object to this particular part. I was mostly concerned
> about
> http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com
> and the "zero" argument for other functions. I guess we can do without
> that. I _think_ that we should simply _always_ initialize the page at the
> __init_single_page time rather than during the allocation. That would
> require dropping __GFP_ZERO for non-memblock allocations. Or do you
> think we could regress for single threaded initialization?
> 

Hi Michal,

Thats exactly right, I am worried that we will regress when there is no 
parallelized initialization of "struct pages" if we force 
unconditionally do memset() in __init_single_page(). The overhead of 
calling memset() on a smaller chunks (64-bytes) may cause the 
regression, this is why I opted only for parallelized case to zero this 
metadata. This way, we are guaranteed to see great improvements from 
this change without having regressions on platforms and builds that do 
not support parallelized initialization of "struct pages".

However, on some chips such as latest SPARCs it is beneficial to have 
memset() right inside __init_single_page() even for single threaded 
case, because it can act as a prefetch on chips with optimized block 
initialized store instructions.

Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 13:42       ` Pasha Tatashin
@ 2017-05-10 14:57         ` Michal Hocko
  2017-05-10 15:01           ` Pasha Tatashin
  2017-05-10 15:19           ` David Miller
  0 siblings, 2 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-10 14:57 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

On Wed 10-05-17 09:42:22, Pasha Tatashin wrote:
> >
> >Well, I didn't object to this particular part. I was mostly concerned
> >about
> >http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com
> >and the "zero" argument for other functions. I guess we can do without
> >that. I _think_ that we should simply _always_ initialize the page at the
> >__init_single_page time rather than during the allocation. That would
> >require dropping __GFP_ZERO for non-memblock allocations. Or do you
> >think we could regress for single threaded initialization?
> >
> 
> Hi Michal,
> 
> Thats exactly right, I am worried that we will regress when there is no
> parallelized initialization of "struct pages" if we force unconditionally do
> memset() in __init_single_page(). The overhead of calling memset() on a
> smaller chunks (64-bytes) may cause the regression, this is why I opted only
> for parallelized case to zero this metadata. This way, we are guaranteed to
> see great improvements from this change without having regressions on
> platforms and builds that do not support parallelized initialization of
> "struct pages".

Have you measured that? I do not think it would be super hard to
measure. I would be quite surprised if this added much if anything at
all as the whole struct page should be in the cache line already. We do
set reference count and other struct members. Almost nobody should be
looking at our page at this time and stealing the cache line. On the
other hand a large memcpy will basically wipe everything away from the
cpu cache. Or am I missing something?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 14:57         ` Michal Hocko
@ 2017-05-10 15:01           ` Pasha Tatashin
  2017-05-10 15:20             ` David Miller
  2017-05-11 20:47             ` Pasha Tatashin
  2017-05-10 15:19           ` David Miller
  1 sibling, 2 replies; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-10 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem



On 05/10/2017 10:57 AM, Michal Hocko wrote:
> On Wed 10-05-17 09:42:22, Pasha Tatashin wrote:
>>>
>>> Well, I didn't object to this particular part. I was mostly concerned
>>> about
>>> http://lkml.kernel.org/r/1494003796-748672-4-git-send-email-pasha.tatashin@oracle.com
>>> and the "zero" argument for other functions. I guess we can do without
>>> that. I _think_ that we should simply _always_ initialize the page at the
>>> __init_single_page time rather than during the allocation. That would
>>> require dropping __GFP_ZERO for non-memblock allocations. Or do you
>>> think we could regress for single threaded initialization?
>>>
>>
>> Hi Michal,
>>
>> Thats exactly right, I am worried that we will regress when there is no
>> parallelized initialization of "struct pages" if we force unconditionally do
>> memset() in __init_single_page(). The overhead of calling memset() on a
>> smaller chunks (64-bytes) may cause the regression, this is why I opted only
>> for parallelized case to zero this metadata. This way, we are guaranteed to
>> see great improvements from this change without having regressions on
>> platforms and builds that do not support parallelized initialization of
>> "struct pages".
> 
> Have you measured that? I do not think it would be super hard to
> measure. I would be quite surprised if this added much if anything at
> all as the whole struct page should be in the cache line already. We do
> set reference count and other struct members. Almost nobody should be
> looking at our page at this time and stealing the cache line. On the
> other hand a large memcpy will basically wipe everything away from the
> cpu cache. Or am I missing something?
> 

Perhaps you are right, and I will measure on x86. But, I suspect hit can 
become unacceptable on some platfoms: there is an overhead of calling a 
function, even if it is leaf-optimized, and there is an overhead in 
memset() to check for alignments of size and address, types of setting 
(zeroing vs. non-zeroing), etc., that adds up quickly.

Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 14:57         ` Michal Hocko
  2017-05-10 15:01           ` Pasha Tatashin
@ 2017-05-10 15:19           ` David Miller
  2017-05-10 17:17             ` Matthew Wilcox
  2017-05-11  8:05             ` Michal Hocko
  1 sibling, 2 replies; 46+ messages in thread
From: David Miller @ 2017-05-10 15:19 UTC (permalink / raw)
  To: mhocko
  Cc: pasha.tatashin, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens

From: Michal Hocko <mhocko@kernel.org>
Date: Wed, 10 May 2017 16:57:26 +0200

> Have you measured that? I do not think it would be super hard to
> measure. I would be quite surprised if this added much if anything at
> all as the whole struct page should be in the cache line already. We do
> set reference count and other struct members. Almost nobody should be
> looking at our page at this time and stealing the cache line. On the
> other hand a large memcpy will basically wipe everything away from the
> cpu cache. Or am I missing something?

I guess it might be clearer if you understand what the block
initializing stores do on sparc64.  There are no memory accesses at
all.

The cpu just zeros out the cache line, that's it.

No L3 cache line is allocated.  So this "wipe everything" behavior
will not happen in the L3.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 15:01           ` Pasha Tatashin
@ 2017-05-10 15:20             ` David Miller
  2017-05-11 20:47             ` Pasha Tatashin
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2017-05-10 15:20 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: mhocko, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens

From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Wed, 10 May 2017 11:01:40 -0400

> Perhaps you are right, and I will measure on x86. But, I suspect hit
> can become unacceptable on some platfoms: there is an overhead of
> calling a function, even if it is leaf-optimized, and there is an
> overhead in memset() to check for alignments of size and address,
> types of setting (zeroing vs. non-zeroing), etc., that adds up
> quickly.

Another source of overhead on the sparc64 side is that we much
do memory barriers around the block initializiing stores.  So
batching calls to memset() amortize that as well.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 15:19           ` David Miller
@ 2017-05-10 17:17             ` Matthew Wilcox
  2017-05-10 18:00               ` David Miller
  2017-05-11  8:05             ` Michal Hocko
  1 sibling, 1 reply; 46+ messages in thread
From: Matthew Wilcox @ 2017-05-10 17:17 UTC (permalink / raw)
  To: David Miller
  Cc: mhocko, pasha.tatashin, linux-kernel, sparclinux, linux-mm,
	linuxppc-dev, linux-s390, borntraeger, heiko.carstens

On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Wed, 10 May 2017 16:57:26 +0200
> 
> > Have you measured that? I do not think it would be super hard to
> > measure. I would be quite surprised if this added much if anything at
> > all as the whole struct page should be in the cache line already. We do
> > set reference count and other struct members. Almost nobody should be
> > looking at our page at this time and stealing the cache line. On the
> > other hand a large memcpy will basically wipe everything away from the
> > cpu cache. Or am I missing something?
> 
> I guess it might be clearer if you understand what the block
> initializing stores do on sparc64.  There are no memory accesses at
> all.
> 
> The cpu just zeros out the cache line, that's it.
> 
> No L3 cache line is allocated.  So this "wipe everything" behavior
> will not happen in the L3.

There's either something wrong with your explanation or my reading
skills :-)

"There are no memory accesses"
"No L3 cache line is allocated"

You can have one or the other ... either the CPU sends a cacheline-sized
write of zeroes to memory without allocating an L3 cache line (maybe
using the store buffer?), or the CPU allocates an L3 cache line and sets
its contents to zeroes, probably putting it in the last way of the set
so it's the first thing to be evicted if not touched.

Or there's some magic in the memory bus protocol where the CPU gets to
tell the DRAM "hey, clear these cache lines".  Although that's also a
memory access of sorts ...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 17:17             ` Matthew Wilcox
@ 2017-05-10 18:00               ` David Miller
  2017-05-10 21:11                 ` Matthew Wilcox
  0 siblings, 1 reply; 46+ messages in thread
From: David Miller @ 2017-05-10 18:00 UTC (permalink / raw)
  To: willy
  Cc: mhocko, pasha.tatashin, linux-kernel, sparclinux, linux-mm,
	linuxppc-dev, linux-s390, borntraeger, heiko.carstens

From: Matthew Wilcox <willy@infradead.org>
Date: Wed, 10 May 2017 10:17:03 -0700

> On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote:
>> From: Michal Hocko <mhocko@kernel.org>
>> Date: Wed, 10 May 2017 16:57:26 +0200
>> 
>> > Have you measured that? I do not think it would be super hard to
>> > measure. I would be quite surprised if this added much if anything at
>> > all as the whole struct page should be in the cache line already. We do
>> > set reference count and other struct members. Almost nobody should be
>> > looking at our page at this time and stealing the cache line. On the
>> > other hand a large memcpy will basically wipe everything away from the
>> > cpu cache. Or am I missing something?
>> 
>> I guess it might be clearer if you understand what the block
>> initializing stores do on sparc64.  There are no memory accesses at
>> all.
>> 
>> The cpu just zeros out the cache line, that's it.
>> 
>> No L3 cache line is allocated.  So this "wipe everything" behavior
>> will not happen in the L3.
> 
> There's either something wrong with your explanation or my reading
> skills :-)
> 
> "There are no memory accesses"
> "No L3 cache line is allocated"
> 
> You can have one or the other ... either the CPU sends a cacheline-sized
> write of zeroes to memory without allocating an L3 cache line (maybe
> using the store buffer?), or the CPU allocates an L3 cache line and sets
> its contents to zeroes, probably putting it in the last way of the set
> so it's the first thing to be evicted if not touched.

There is no conflict in what I said.

Only an L2 cache line is allocated and cleared.  L3 is left alone.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 18:00               ` David Miller
@ 2017-05-10 21:11                 ` Matthew Wilcox
  0 siblings, 0 replies; 46+ messages in thread
From: Matthew Wilcox @ 2017-05-10 21:11 UTC (permalink / raw)
  To: David Miller
  Cc: mhocko, pasha.tatashin, linux-kernel, sparclinux, linux-mm,
	linuxppc-dev, linux-s390, borntraeger, heiko.carstens

On Wed, May 10, 2017 at 02:00:26PM -0400, David Miller wrote:
> From: Matthew Wilcox <willy@infradead.org>
> Date: Wed, 10 May 2017 10:17:03 -0700
> > On Wed, May 10, 2017 at 11:19:43AM -0400, David Miller wrote:
> >> I guess it might be clearer if you understand what the block
> >> initializing stores do on sparc64.  There are no memory accesses at
> >> all.
> >> 
> >> The cpu just zeros out the cache line, that's it.
> >> 
> >> No L3 cache line is allocated.  So this "wipe everything" behavior
> >> will not happen in the L3.
> > 
> > There's either something wrong with your explanation or my reading
> > skills :-)
> > 
> > "There are no memory accesses"
> > "No L3 cache line is allocated"
> > 
> > You can have one or the other ... either the CPU sends a cacheline-sized
> > write of zeroes to memory without allocating an L3 cache line (maybe
> > using the store buffer?), or the CPU allocates an L3 cache line and sets
> > its contents to zeroes, probably putting it in the last way of the set
> > so it's the first thing to be evicted if not touched.
> 
> There is no conflict in what I said.
> 
> Only an L2 cache line is allocated and cleared.  L3 is left alone.

I thought SPARC had inclusive caches.  So allocating an L2 cacheline
would necessitate allocating an L3 cacheline.  Or is this an exception
to the normal order of things?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 15:19           ` David Miller
  2017-05-10 17:17             ` Matthew Wilcox
@ 2017-05-11  8:05             ` Michal Hocko
  2017-05-11 14:35               ` David Miller
  1 sibling, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2017-05-11  8:05 UTC (permalink / raw)
  To: David Miller
  Cc: pasha.tatashin, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens

On Wed 10-05-17 11:19:43, David S. Miller wrote:
> From: Michal Hocko <mhocko@kernel.org>
> Date: Wed, 10 May 2017 16:57:26 +0200
> 
> > Have you measured that? I do not think it would be super hard to
> > measure. I would be quite surprised if this added much if anything at
> > all as the whole struct page should be in the cache line already. We do
> > set reference count and other struct members. Almost nobody should be
> > looking at our page at this time and stealing the cache line. On the
> > other hand a large memcpy will basically wipe everything away from the
> > cpu cache. Or am I missing something?
> 
> I guess it might be clearer if you understand what the block
> initializing stores do on sparc64.  There are no memory accesses at
> all.
> 
> The cpu just zeros out the cache line, that's it.
> 
> No L3 cache line is allocated.  So this "wipe everything" behavior
> will not happen in the L3.

OK, good to know. My undestanding of sparc64 is close to zero.

Anyway, do you agree that doing the struct page initialization along
with other writes to it shouldn't add a measurable overhead comparing
to pre-zeroing of larger block of struct pages?  We already have an
exclusive cache line and doing one 64B write along with few other stores
should be basically the same.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-11  8:05             ` Michal Hocko
@ 2017-05-11 14:35               ` David Miller
  0 siblings, 0 replies; 46+ messages in thread
From: David Miller @ 2017-05-11 14:35 UTC (permalink / raw)
  To: mhocko
  Cc: pasha.tatashin, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens

From: Michal Hocko <mhocko@kernel.org>
Date: Thu, 11 May 2017 10:05:38 +0200

> Anyway, do you agree that doing the struct page initialization along
> with other writes to it shouldn't add a measurable overhead comparing
> to pre-zeroing of larger block of struct pages?  We already have an
> exclusive cache line and doing one 64B write along with few other stores
> should be basically the same.

Yes, it should be reasonably cheap.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-10 15:01           ` Pasha Tatashin
  2017-05-10 15:20             ` David Miller
@ 2017-05-11 20:47             ` Pasha Tatashin
  2017-05-11 20:59               ` Pasha Tatashin
  2017-05-12 16:56               ` David Miller
  1 sibling, 2 replies; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-11 20:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

>>
>> Have you measured that? I do not think it would be super hard to
>> measure. I would be quite surprised if this added much if anything at
>> all as the whole struct page should be in the cache line already. We do
>> set reference count and other struct members. Almost nobody should be
>> looking at our page at this time and stealing the cache line. On the
>> other hand a large memcpy will basically wipe everything away from the
>> cpu cache. Or am I missing something?
>>

Here is data for single thread (deferred struct page init is disabled):

Intel CPU E7-8895 v3 @ 2.60GHz  1T memory
-----------------------------------------
time to memset "struct pages in memblock: 11.28s
time to init "struct pag"es:               4.90s

Moving memset into __init_single_page()
time to init and memset "struct page"es:   8.39s

SPARC M6 @ 3600 MHz  1T memory
-----------------------------------------
time to memset "struct pages in memblock:  1.60s
time to init "struct pag"es:               3.37s

Moving memset into __init_single_page()
time to init and memset "struct page"es:  12.99s


So, moving memset() into __init_single_page() benefits Intel. I am 
actually surprised why memset() is so slow on intel when it is called 
from memblock. But, hurts SPARC, I guess these membars at the end of 
memset() kills the performance.

Also, when looking at these values, remeber that Intel has twice as many 
"struct page" for the same amount of memory.

Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-11 20:47             ` Pasha Tatashin
@ 2017-05-11 20:59               ` Pasha Tatashin
  2017-05-12 16:57                 ` David Miller
  2017-05-12 16:56               ` David Miller
  1 sibling, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-11 20:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

We should either keep memset() only for deferred struct pages as what I 
have in my patches.

Another option is to add a new function struct_page_clear() which would 
default to memset() and to something else on platforms that decide to 
optimize it.

On SPARC it would call STBIs, and we would do one membar call after all 
"struct pages" are initialized.

I think what I sent out already is cleaner and better solution, because 
I am not sure what kind of performance we would see on other chips.

On 05/11/2017 04:47 PM, Pasha Tatashin wrote:
>>>
>>> Have you measured that? I do not think it would be super hard to
>>> measure. I would be quite surprised if this added much if anything at
>>> all as the whole struct page should be in the cache line already. We do
>>> set reference count and other struct members. Almost nobody should be
>>> looking at our page at this time and stealing the cache line. On the
>>> other hand a large memcpy will basically wipe everything away from the
>>> cpu cache. Or am I missing something?
>>>
> 
> Here is data for single thread (deferred struct page init is disabled):
> 
> Intel CPU E7-8895 v3 @ 2.60GHz  1T memory
> -----------------------------------------
> time to memset "struct pages in memblock: 11.28s
> time to init "struct pag"es:               4.90s
> 
> Moving memset into __init_single_page()
> time to init and memset "struct page"es:   8.39s
> 
> SPARC M6 @ 3600 MHz  1T memory
> -----------------------------------------
> time to memset "struct pages in memblock:  1.60s
> time to init "struct pag"es:               3.37s
> 
> Moving memset into __init_single_page()
> time to init and memset "struct page"es:  12.99s
> 
> 
> So, moving memset() into __init_single_page() benefits Intel. I am 
> actually surprised why memset() is so slow on intel when it is called 
> from memblock. But, hurts SPARC, I guess these membars at the end of 
> memset() kills the performance.
> 
> Also, when looking at these values, remeber that Intel has twice as many 
> "struct page" for the same amount of memory.
> 
> Pasha
> -- 
> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-11 20:47             ` Pasha Tatashin
  2017-05-11 20:59               ` Pasha Tatashin
@ 2017-05-12 16:56               ` David Miller
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2017-05-12 16:56 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: mhocko, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens

From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Thu, 11 May 2017 16:47:05 -0400

> So, moving memset() into __init_single_page() benefits Intel. I am
> actually surprised why memset() is so slow on intel when it is called
> from memblock. But, hurts SPARC, I guess these membars at the end of
> memset() kills the performance.

Perhaps an x86 expert can chime in, but it might be the case that past
a certain size, the microcode for the enhanced stosb uses non-temporal
stores or something like that.

As for sparc64, yes we can get really killed by the transactional cost
of memset because of the membars.

But I wonder, for a single page struct, if we even use the special
stores and thus eat the membar cost.  struct page is only 64 bytes,
and the cutoff in the Niagara4 bzero implementation is "64 + (64 - 8)"
so indeed the initializing stores will not even be used.

So sparc64 will only use initializing stores and do the membars if
at least 2 pages are cleared at a time.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-11 20:59               ` Pasha Tatashin
@ 2017-05-12 16:57                 ` David Miller
  2017-05-12 17:24                   ` Pasha Tatashin
  0 siblings, 1 reply; 46+ messages in thread
From: David Miller @ 2017-05-12 16:57 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: mhocko, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens

From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Thu, 11 May 2017 16:59:33 -0400

> We should either keep memset() only for deferred struct pages as what
> I have in my patches.
> 
> Another option is to add a new function struct_page_clear() which
> would default to memset() and to something else on platforms that
> decide to optimize it.
> 
> On SPARC it would call STBIs, and we would do one membar call after
> all "struct pages" are initialized.

No membars will be performed for single individual page struct clear,
the cutoff to use the STBI is larger than that.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-12 16:57                 ` David Miller
@ 2017-05-12 17:24                   ` Pasha Tatashin
  2017-05-12 17:37                     ` David Miller
  0 siblings, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-12 17:24 UTC (permalink / raw)
  To: David Miller
  Cc: mhocko, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens



On 05/12/2017 12:57 PM, David Miller wrote:
> From: Pasha Tatashin <pasha.tatashin@oracle.com>
> Date: Thu, 11 May 2017 16:59:33 -0400
> 
>> We should either keep memset() only for deferred struct pages as what
>> I have in my patches.
>>
>> Another option is to add a new function struct_page_clear() which
>> would default to memset() and to something else on platforms that
>> decide to optimize it.
>>
>> On SPARC it would call STBIs, and we would do one membar call after
>> all "struct pages" are initialized.
> 
> No membars will be performed for single individual page struct clear,
> the cutoff to use the STBI is larger than that.
> 

Right now it is larger, but what I suggested is to add a new optimized 
routine just for this case, which would do STBI for 64-bytes but without 
membar (do membar at the end of memmap_init_zone() and 
deferred_init_memmap()

#define struct_page_clear(page)                                 \
         __asm__ __volatile__(                                   \
         "stxa   %%g0, [%0]%2\n"                                 \
         "stxa   %%xg0, [%0 + %1]%2\n"                           \
         : /* No output */                                       \
         : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))

And insert it into __init_single_page() instead of memset()

The final result is 4.01s/T which is even faster compared to current 4.97s/T



Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-12 17:24                   ` Pasha Tatashin
@ 2017-05-12 17:37                     ` David Miller
  2017-05-16 23:50                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 46+ messages in thread
From: David Miller @ 2017-05-12 17:37 UTC (permalink / raw)
  To: pasha.tatashin
  Cc: mhocko, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens

From: Pasha Tatashin <pasha.tatashin@oracle.com>
Date: Fri, 12 May 2017 13:24:52 -0400

> Right now it is larger, but what I suggested is to add a new optimized
> routine just for this case, which would do STBI for 64-bytes but
> without membar (do membar at the end of memmap_init_zone() and
> deferred_init_memmap()
> 
> #define struct_page_clear(page)                                 \
>         __asm__ __volatile__(                                   \
>         "stxa   %%g0, [%0]%2\n"                                 \
>         "stxa   %%xg0, [%0 + %1]%2\n"                           \
>         : /* No output */                                       \
>         : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))
> 
> And insert it into __init_single_page() instead of memset()
> 
> The final result is 4.01s/T which is even faster compared to current
> 4.97s/T

Ok, indeed, that would work.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 3/9] mm: add "zero" argument to vmemmap allocators
  2017-05-05 17:03 ` [v3 3/9] mm: add "zero" argument to vmemmap allocators Pavel Tatashin
@ 2017-05-13 19:17   ` kbuild test robot
  0 siblings, 0 replies; 46+ messages in thread
From: kbuild test robot @ 2017-05-13 19:17 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: kbuild-all, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens, davem

[-- Attachment #1: Type: text/plain, Size: 1793 bytes --]

Hi Pavel,

[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20170512]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pavel-Tatashin/parallelized-struct-page-zeroing/20170507-061412
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: ia64-gensparse_defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=ia64 

All error/warnings (new ones prefixed by >>):

   mm/sparse-vmemmap.c: In function '__earlyonly_bootmem_alloc':
>> mm/sparse-vmemmap.c:45:14: error: implicit declaration of function 'memblock_virt_alloc_try_nid_raw' [-Werror=implicit-function-declaration]
     void *mem = memblock_virt_alloc_try_nid_raw(size, align, goal,
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/sparse-vmemmap.c:45:14: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
   cc1: some warnings being treated as errors

vim +/memblock_virt_alloc_try_nid_raw +45 mm/sparse-vmemmap.c

    39	static void * __ref __earlyonly_bootmem_alloc(int node,
    40					unsigned long size,
    41					unsigned long align,
    42					unsigned long goal,
    43					bool zero)
    44	{
  > 45		void *mem = memblock_virt_alloc_try_nid_raw(size, align, goal,
    46							    BOOTMEM_ALLOC_ACCESSIBLE,
    47							    node);
    48		if (!mem) {

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 17766 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-09 18:12 ` [v3 0/9] parallelized "struct page" zeroing Michal Hocko
  2017-05-09 18:54   ` Pasha Tatashin
@ 2017-05-15 18:12   ` Pasha Tatashin
  2017-05-15 19:38     ` Michal Hocko
  1 sibling, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-15 18:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

Hi Michal,

After looking at your suggested memblock_virt_alloc_core() change again, 
I decided to keep what I have. I do not want to inline 
memblock_virt_alloc_internal(), because it is not a performance critical 
path, and by inlining it we will unnecessarily increase the text size on 
all platforms.

Also, because it will be very hard to make sure that no platform 
regresses by making memset() default in _memblock_virt_alloc_core() (as 
I already showed last week at least sun4v SPARC64 will require special 
changes in order for this to work), I decided to make it available only 
for "deferred struct page init" case. As, what is already in the patch.

I am working on testing to make sure we do not need to double zero in 
the two cases that you found: sparsemem, and mem hotplug. Please let me 
know if you have any more comments, or if I can send new patches out 
when they are ready.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 9/9] s390: teach platforms not to zero struct pages memory
  2017-05-08 11:36   ` Heiko Carstens
@ 2017-05-15 18:24     ` Pasha Tatashin
  2017-05-15 23:17       ` Heiko Carstens
  0 siblings, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-15 18:24 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, davem

Hi Heiko,

Thank you for looking at this patch. I am worried to make the proposed 
change, because, as I understand in this case we allocate memory not for 
"struct page"s but for table that hold them. So, we will change the 
behavior from the current one, where this table is allocated zeroed, but 
now it won't be zeroed.

Pasha

> 
> If you add the hunk below then this is
> 
> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> 
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index ffe9ba1aec8b..bf88a8b9c24d 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -272,7 +272,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
>   		if (pte_none(*pt_dir)) {
>   			void *new_page;
>   
> -			new_page = vmemmap_alloc_block(PAGE_SIZE, node, true);
> +			new_page = vmemmap_alloc_block(PAGE_SIZE, node, VMEMMAP_ZERO);
>   			if (!new_page)
>   				goto out;
>   			pte_val(*pt_dir) = __pa(new_page) | pgt_prot;
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-15 18:12   ` Pasha Tatashin
@ 2017-05-15 19:38     ` Michal Hocko
  2017-05-15 20:44       ` Pasha Tatashin
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2017-05-15 19:38 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

On Mon 15-05-17 14:12:10, Pasha Tatashin wrote:
> Hi Michal,
> 
> After looking at your suggested memblock_virt_alloc_core() change again, I
> decided to keep what I have. I do not want to inline
> memblock_virt_alloc_internal(), because it is not a performance critical
> path, and by inlining it we will unnecessarily increase the text size on all
> platforms.

I do not insist but I would really _prefer_ if the bool zero argument
didn't proliferate all over the memblock API.
 
> Also, because it will be very hard to make sure that no platform regresses
> by making memset() default in _memblock_virt_alloc_core() (as I already
> showed last week at least sun4v SPARC64 will require special changes in
> order for this to work), I decided to make it available only for "deferred
> struct page init" case. As, what is already in the patch.

I do not think this is the right approach. Your measurements just show
that sparc could have a more optimized memset for small sizes. If you
keep the same memset only for the parallel initialization then you
just hide this fact. I wouldn't worry about other architectures. All
sane architectures should simply work reasonably well when touching a
single or only few cache lines at the same time. If some arches really
suffer from small memsets then the initialization should be driven by a
specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend
on DEFERRED_INIT. Or if you are too worried then make it opt-in and make
it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and
sparc after memset optimization.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-15 19:38     ` Michal Hocko
@ 2017-05-15 20:44       ` Pasha Tatashin
  2017-05-16  8:36         ` Michal Hocko
  0 siblings, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-15 20:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem



On 05/15/2017 03:38 PM, Michal Hocko wrote:
> On Mon 15-05-17 14:12:10, Pasha Tatashin wrote:
>> Hi Michal,
>>
>> After looking at your suggested memblock_virt_alloc_core() change again, I
>> decided to keep what I have. I do not want to inline
>> memblock_virt_alloc_internal(), because it is not a performance critical
>> path, and by inlining it we will unnecessarily increase the text size on all
>> platforms.
> 
> I do not insist but I would really _prefer_ if the bool zero argument
> didn't proliferate all over the memblock API.

Sure, I will remove zero boolean argument from 
memblock_virt_alloc_internal(), and do memset() calls inside callers.

>   
>> Also, because it will be very hard to make sure that no platform regresses
>> by making memset() default in _memblock_virt_alloc_core() (as I already
>> showed last week at least sun4v SPARC64 will require special changes in
>> order for this to work), I decided to make it available only for "deferred
>> struct page init" case. As, what is already in the patch.
> 
> I do not think this is the right approach. Your measurements just show
> that sparc could have a more optimized memset for small sizes. If you
> keep the same memset only for the parallel initialization then you
> just hide this fact. I wouldn't worry about other architectures. All
> sane architectures should simply work reasonably well when touching a
> single or only few cache lines at the same time. If some arches really
> suffer from small memsets then the initialization should be driven by a
> specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend
> on DEFERRED_INIT. Or if you are too worried then make it opt-in and make
> it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and
> sparc after memset optimization.

OK, I will think about this.

I do not really like adding new configs because they tend to clutter the 
code. This is why, I wanted to rely on already existing config that I 
know benefits all platforms that use it. Eventually, 
"CONFIG_DEFERRED_STRUCT_PAGE_INIT" is going to become the default 
everywhere, as there should not be a drawback of using it even on small 
machines.

Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 9/9] s390: teach platforms not to zero struct pages memory
  2017-05-15 18:24     ` Pasha Tatashin
@ 2017-05-15 23:17       ` Heiko Carstens
  2017-05-16  0:33         ` Pasha Tatashin
  0 siblings, 1 reply; 46+ messages in thread
From: Heiko Carstens @ 2017-05-15 23:17 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, davem

Hello Pasha,

> Thank you for looking at this patch. I am worried to make the proposed
> change, because, as I understand in this case we allocate memory not for
> "struct page"s but for table that hold them. So, we will change the behavior
> from the current one, where this table is allocated zeroed, but now it won't
> be zeroed.

The page table, if needed, is allocated and populated a couple of lines
above. See the vmem_pte_alloc() call. So my request to include the hunk
below is still valid ;)

> >If you add the hunk below then this is
> >
> >Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> >
> >diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> >index ffe9ba1aec8b..bf88a8b9c24d 100644
> >--- a/arch/s390/mm/vmem.c
> >+++ b/arch/s390/mm/vmem.c
> >@@ -272,7 +272,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
> >  		if (pte_none(*pt_dir)) {
> >  			void *new_page;
> >-			new_page = vmemmap_alloc_block(PAGE_SIZE, node, true);
> >+			new_page = vmemmap_alloc_block(PAGE_SIZE, node, VMEMMAP_ZERO);
> >  			if (!new_page)
> >  				goto out;
> >  			pte_val(*pt_dir) = __pa(new_page) | pgt_prot;
> >
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 9/9] s390: teach platforms not to zero struct pages memory
  2017-05-15 23:17       ` Heiko Carstens
@ 2017-05-16  0:33         ` Pasha Tatashin
  0 siblings, 0 replies; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-16  0:33 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, davem

Ah OK, I will include the change.

Thank you,
Pasha

On 05/15/2017 07:17 PM, Heiko Carstens wrote:
> Hello Pasha,
> 
>> Thank you for looking at this patch. I am worried to make the proposed
>> change, because, as I understand in this case we allocate memory not for
>> "struct page"s but for table that hold them. So, we will change the behavior
>> from the current one, where this table is allocated zeroed, but now it won't
>> be zeroed.
> 
> The page table, if needed, is allocated and populated a couple of lines
> above. See the vmem_pte_alloc() call. So my request to include the hunk
> below is still valid ;)
> 
>>> If you add the hunk below then this is
>>>
>>> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
>>>
>>> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
>>> index ffe9ba1aec8b..bf88a8b9c24d 100644
>>> --- a/arch/s390/mm/vmem.c
>>> +++ b/arch/s390/mm/vmem.c
>>> @@ -272,7 +272,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
>>>   		if (pte_none(*pt_dir)) {
>>>   			void *new_page;
>>> -			new_page = vmemmap_alloc_block(PAGE_SIZE, node, true);
>>> +			new_page = vmemmap_alloc_block(PAGE_SIZE, node, VMEMMAP_ZERO);
>>>   			if (!new_page)
>>>   				goto out;
>>>   			pte_val(*pt_dir) = __pa(new_page) | pgt_prot;
>>>
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-15 20:44       ` Pasha Tatashin
@ 2017-05-16  8:36         ` Michal Hocko
  2017-05-26 16:45           ` Pasha Tatashin
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2017-05-16  8:36 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

On Mon 15-05-17 16:44:26, Pasha Tatashin wrote:
> On 05/15/2017 03:38 PM, Michal Hocko wrote:
> >I do not think this is the right approach. Your measurements just show
> >that sparc could have a more optimized memset for small sizes. If you
> >keep the same memset only for the parallel initialization then you
> >just hide this fact. I wouldn't worry about other architectures. All
> >sane architectures should simply work reasonably well when touching a
> >single or only few cache lines at the same time. If some arches really
> >suffer from small memsets then the initialization should be driven by a
> >specific ARCH_WANT_LARGE_PAGEBLOCK_INIT rather than making this depend
> >on DEFERRED_INIT. Or if you are too worried then make it opt-in and make
> >it depend on ARCH_WANT_PER_PAGE_INIT and make it enabled for x86 and
> >sparc after memset optimization.
> 
> OK, I will think about this.
> 
> I do not really like adding new configs because they tend to clutter the
> code. This is why,

Yes I hate adding new (arch) config options as well. And I still believe
we do not need any here either...

> I wanted to rely on already existing config that I know benefits all
> platforms that use it.

I wouldn't be so sure about this. If any other platform has a similar
issues with small memset as sparc then the overhead is just papered over
by parallel initialization.

> Eventually,
> "CONFIG_DEFERRED_STRUCT_PAGE_INIT" is going to become the default
> everywhere, as there should not be a drawback of using it even on small
> machines.

Maybe and I would highly appreciate that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-12 17:37                     ` David Miller
@ 2017-05-16 23:50                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 46+ messages in thread
From: Benjamin Herrenschmidt @ 2017-05-16 23:50 UTC (permalink / raw)
  To: David Miller, pasha.tatashin
  Cc: linux-s390, borntraeger, heiko.carstens, linux-kernel, mhocko,
	linux-mm, sparclinux, linuxppc-dev

On Fri, 2017-05-12 at 13:37 -0400, David Miller wrote:
> > Right now it is larger, but what I suggested is to add a new optimized
> > routine just for this case, which would do STBI for 64-bytes but
> > without membar (do membar at the end of memmap_init_zone() and
> > deferred_init_memmap()
> > 
> > #define struct_page_clear(page)                                 \
> >          __asm__ __volatile__(                                   \
> >          "stxa   %%g0, [%0]%2\n"                                 \
> >          "stxa   %%xg0, [%0 + %1]%2\n"                           \
> >          : /* No output */                                       \
> >          : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))
> > 
> > And insert it into __init_single_page() instead of memset()
> > 
> > The final result is 4.01s/T which is even faster compared to current
> > 4.97s/T
> 
> Ok, indeed, that would work.

On ppc64, that might not. We have a dcbz instruction that clears an
entire cache line at once. That's what we use for memset's and page
clearing. However, 64 bytes is half a cache line on modern processors
so we can't use it with that semantic and would have to fallback to the
slower stores.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-16  8:36         ` Michal Hocko
@ 2017-05-26 16:45           ` Pasha Tatashin
  2017-05-29 11:53             ` Michal Hocko
  0 siblings, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-26 16:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

Hi Michal,

I have considered your proposals:

1. Making memset(0) unconditional inside __init_single_page() is not 
going to work because it slows down SPARC, and ppc64. On SPARC even the 
BSTI optimization that I have proposed earlier won't work, because after 
consulting with other engineers I was told that stores (without loads!) 
after BSTI without membar are unsafe

2. Adding ARCH_WANT_LARGE_PAGEBLOCK_INIT is not going to solve the 
problem, because while arch might want a large memset(), it still wants 
to get the benefit of parallelized struct page initialization.

3. Another approach that have I considered is moving memset() above 
__init_single_page() and do it in a larger chunks. However, this 
solution is also not going to work, because inside the loops, there are 
cases where "struct page"s are skipped, so every single page is checked:
early_pfn_valid(pfn), early_pfn_in_nid(), and also mirroed_kernelcore cases.

> I wouldn't be so sure about this. If any other platform has a similar
> issues with small memset as sparc then the overhead is just papered over
> by parallel initialization.

That is true, and that is fine, because parallelization gives an order 
of magnitude better improvements compared to trade of slower single 
thread performance. Remember, this will happen during boot and memory 
hotplug only, and not something that will eat up computing resources 
during runtime.

So, at the moment I cannot really find a better solution compared to 
what I have proposed: do memset() inside __init_single_page() only when 
deferred initialization is enabled.

Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-26 16:45           ` Pasha Tatashin
@ 2017-05-29 11:53             ` Michal Hocko
  2017-05-30 17:16               ` Pasha Tatashin
  0 siblings, 1 reply; 46+ messages in thread
From: Michal Hocko @ 2017-05-29 11:53 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

On Fri 26-05-17 12:45:55, Pasha Tatashin wrote:
> Hi Michal,
> 
> I have considered your proposals:
> 
> 1. Making memset(0) unconditional inside __init_single_page() is not going
> to work because it slows down SPARC, and ppc64. On SPARC even the BSTI
> optimization that I have proposed earlier won't work, because after
> consulting with other engineers I was told that stores (without loads!)
> after BSTI without membar are unsafe

Could you be more specific? E.g. how are other stores done in
__init_single_page safe then? I am sorry to be dense here but how does
the full 64B store differ from other stores done in the same function.

[...]
> So, at the moment I cannot really find a better solution compared to what I
> have proposed: do memset() inside __init_single_page() only when deferred
> initialization is enabled.

As I've already said I am not going to block your approach I was just
hoping for something that doesn't depend on the deferred initialization.
Especially when the struct page is a small objects and it makes sense to
initialize it completely at a single page. Writing to a single cache
line should simply not add memory traffic for exclusive cache line and
struct pages are very likely to exclusive at that stage.

If that doesn't fly then be it but I have to confess I still do not
understand why that is not the case.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-29 11:53             ` Michal Hocko
@ 2017-05-30 17:16               ` Pasha Tatashin
  2017-05-31 16:31                 ` Michal Hocko
  0 siblings, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-05-30 17:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

> Could you be more specific? E.g. how are other stores done in
> __init_single_page safe then? I am sorry to be dense here but how does
> the full 64B store differ from other stores done in the same function.

Hi Michal,

It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb) 
without membar, but they are slower compared to STBI which require a 
membar before memory can be accessed. So when on SPARC we zero a larger 
span of memory it is faster to use STBI, and do one membar at the end. 
This is why for single thread it is faster to zero multiple pages of 
memory and than initialize only fields that are needed in "struct page". 
I believe the same is true for ppc64, as they clear the whole cacheline 
128-bytes at a time with larger memsets.

Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-30 17:16               ` Pasha Tatashin
@ 2017-05-31 16:31                 ` Michal Hocko
  2017-05-31 16:51                   ` David Miller
  2017-06-01  3:35                   ` Pasha Tatashin
  0 siblings, 2 replies; 46+ messages in thread
From: Michal Hocko @ 2017-05-31 16:31 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

On Tue 30-05-17 13:16:50, Pasha Tatashin wrote:
> >Could you be more specific? E.g. how are other stores done in
> >__init_single_page safe then? I am sorry to be dense here but how does
> >the full 64B store differ from other stores done in the same function.
> 
> Hi Michal,
> 
> It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb)
> without membar, but they are slower compared to STBI which require a membar
> before memory can be accessed.

OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
would do memset. You said it would be slower but would that be
measurable? I am sorry to be so persistent here but I would be really
happier if this didn't depend on the deferred initialization. If this is
absolutely a no-go then I can live with that of course.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-31 16:31                 ` Michal Hocko
@ 2017-05-31 16:51                   ` David Miller
  2017-06-01  3:35                   ` Pasha Tatashin
  1 sibling, 0 replies; 46+ messages in thread
From: David Miller @ 2017-05-31 16:51 UTC (permalink / raw)
  To: mhocko
  Cc: pasha.tatashin, linux-kernel, sparclinux, linux-mm, linuxppc-dev,
	linux-s390, borntraeger, heiko.carstens

From: Michal Hocko <mhocko@kernel.org>
Date: Wed, 31 May 2017 18:31:31 +0200

> On Tue 30-05-17 13:16:50, Pasha Tatashin wrote:
>> >Could you be more specific? E.g. how are other stores done in
>> >__init_single_page safe then? I am sorry to be dense here but how does
>> >the full 64B store differ from other stores done in the same function.
>> 
>> Hi Michal,
>> 
>> It is safe to do regular 8-byte and smaller stores (stx, st, sth, stb)
>> without membar, but they are slower compared to STBI which require a membar
>> before memory can be accessed.
> 
> OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
> would do memset. You said it would be slower but would that be
> measurable? I am sorry to be so persistent here but I would be really
> happier if this didn't depend on the deferred initialization. If this is
> absolutely a no-go then I can live with that of course.

It is measurable.  That's the impetus for this work in the first place.

When the do the memory barrier, the whole store buffer flushes because
the memory barrier is done with a dependency on the next load or store
operation, one of which the caller is going to do immediately.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-05-31 16:31                 ` Michal Hocko
  2017-05-31 16:51                   ` David Miller
@ 2017-06-01  3:35                   ` Pasha Tatashin
  2017-06-01  8:46                     ` Michal Hocko
  1 sibling, 1 reply; 46+ messages in thread
From: Pasha Tatashin @ 2017-06-01  3:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

> OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
> would do memset. You said it would be slower but would that be
> measurable? I am sorry to be so persistent here but I would be really
> happier if this didn't depend on the deferred initialization. If this is
> absolutely a no-go then I can live with that of course.

Hi Michal,

This is actually a very good idea. I just did some measurements, and it 
looks like performance is very good.

Here is data from SPARC-M7 with 3312G memory with single thread performance:

Current:
memset() in memblock allocator takes: 8.83s
__init_single_page() take: 8.63s

Option 1:
memset() in __init_single_page() takes: 61.09s (as we discussed because 
of membar overhead, memset should really be optimized to do STBI only 
when size is 1 page or bigger).

Option 2:

8 stores (stx) in __init_single_page(): 8.525s!

So, even for single thread performance we can double the initialization 
speed of "struct page" on SPARC by removing memset() from memblock, and 
using 8 stx in __init_single_page(). It appears we never miss L1 in 
__init_single_page() after the initial 8 stx.

I will update patches with memset() on other platforms, and stx on SPARC.

My experimental code looks like this:

static void __meminit __init_single_page(struct page *page, unsigned 
long pfn, unsigned long zone, int nid)
{
         __asm__ __volatile__(
         "stx    %%g0, [%0 + 0x00]\n"
         "stx    %%g0, [%0 + 0x08]\n"
         "stx    %%g0, [%0 + 0x10]\n"
         "stx    %%g0, [%0 + 0x18]\n"
         "stx    %%g0, [%0 + 0x20]\n"
         "stx    %%g0, [%0 + 0x28]\n"
         "stx    %%g0, [%0 + 0x30]\n"
         "stx    %%g0, [%0 + 0x38]\n"
         :
         :"r"(page));
         set_page_links(page, zone, nid, pfn);
         init_page_count(page);
         page_mapcount_reset(page);
         page_cpupid_reset_last(page);

         INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
         /* The shift won't overflow because ZONE_NORMAL is below 4G. */
         if (!is_highmem_idx(zone))
                 set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
}

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [v3 0/9] parallelized "struct page" zeroing
  2017-06-01  3:35                   ` Pasha Tatashin
@ 2017-06-01  8:46                     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-06-01  8:46 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kernel, sparclinux, linux-mm, linuxppc-dev, linux-s390,
	borntraeger, heiko.carstens, davem

On Wed 31-05-17 23:35:48, Pasha Tatashin wrote:
> >OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
> >would do memset. You said it would be slower but would that be
> >measurable? I am sorry to be so persistent here but I would be really
> >happier if this didn't depend on the deferred initialization. If this is
> >absolutely a no-go then I can live with that of course.
> 
> Hi Michal,
> 
> This is actually a very good idea. I just did some measurements, and it
> looks like performance is very good.
> 
> Here is data from SPARC-M7 with 3312G memory with single thread performance:
> 
> Current:
> memset() in memblock allocator takes: 8.83s
> __init_single_page() take: 8.63s
> 
> Option 1:
> memset() in __init_single_page() takes: 61.09s (as we discussed because of
> membar overhead, memset should really be optimized to do STBI only when size
> is 1 page or bigger).
> 
> Option 2:
> 
> 8 stores (stx) in __init_single_page(): 8.525s!
> 
> So, even for single thread performance we can double the initialization
> speed of "struct page" on SPARC by removing memset() from memblock, and
> using 8 stx in __init_single_page(). It appears we never miss L1 in
> __init_single_page() after the initial 8 stx.

OK, that is good to hear and it actually matches my understanding that
writes to a single cacheline should add an overhead.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-06-01  8:46 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-05 17:03 [v3 0/9] parallelized "struct page" zeroing Pavel Tatashin
2017-05-05 17:03 ` [v3 1/9] sparc64: simplify vmemmap_populate Pavel Tatashin
2017-05-05 17:03 ` [v3 2/9] mm: defining memblock_virt_alloc_try_nid_raw Pavel Tatashin
2017-05-05 17:03 ` [v3 3/9] mm: add "zero" argument to vmemmap allocators Pavel Tatashin
2017-05-13 19:17   ` kbuild test robot
2017-05-05 17:03 ` [v3 4/9] mm: do not zero vmemmap_buf Pavel Tatashin
2017-05-05 17:03 ` [v3 5/9] mm: zero struct pages during initialization Pavel Tatashin
2017-05-05 17:03 ` [v3 6/9] sparc64: teach sparc not to zero struct pages memory Pavel Tatashin
2017-05-05 17:03 ` [v3 7/9] x86: teach x86 " Pavel Tatashin
2017-05-05 17:03 ` [v3 8/9] powerpc: teach platforms " Pavel Tatashin
2017-05-05 17:03 ` [v3 9/9] s390: " Pavel Tatashin
2017-05-08 11:36   ` Heiko Carstens
2017-05-15 18:24     ` Pasha Tatashin
2017-05-15 23:17       ` Heiko Carstens
2017-05-16  0:33         ` Pasha Tatashin
2017-05-09 18:12 ` [v3 0/9] parallelized "struct page" zeroing Michal Hocko
2017-05-09 18:54   ` Pasha Tatashin
2017-05-10  7:24     ` Michal Hocko
2017-05-10 13:42       ` Pasha Tatashin
2017-05-10 14:57         ` Michal Hocko
2017-05-10 15:01           ` Pasha Tatashin
2017-05-10 15:20             ` David Miller
2017-05-11 20:47             ` Pasha Tatashin
2017-05-11 20:59               ` Pasha Tatashin
2017-05-12 16:57                 ` David Miller
2017-05-12 17:24                   ` Pasha Tatashin
2017-05-12 17:37                     ` David Miller
2017-05-16 23:50                       ` Benjamin Herrenschmidt
2017-05-12 16:56               ` David Miller
2017-05-10 15:19           ` David Miller
2017-05-10 17:17             ` Matthew Wilcox
2017-05-10 18:00               ` David Miller
2017-05-10 21:11                 ` Matthew Wilcox
2017-05-11  8:05             ` Michal Hocko
2017-05-11 14:35               ` David Miller
2017-05-15 18:12   ` Pasha Tatashin
2017-05-15 19:38     ` Michal Hocko
2017-05-15 20:44       ` Pasha Tatashin
2017-05-16  8:36         ` Michal Hocko
2017-05-26 16:45           ` Pasha Tatashin
2017-05-29 11:53             ` Michal Hocko
2017-05-30 17:16               ` Pasha Tatashin
2017-05-31 16:31                 ` Michal Hocko
2017-05-31 16:51                   ` David Miller
2017-06-01  3:35                   ` Pasha Tatashin
2017-06-01  8:46                     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).