linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [mm PATCH v5 0/7] Deferred page init improvements
@ 2018-11-05 21:19 Alexander Duyck
  2018-11-05 21:19 ` [mm PATCH v5 1/7] mm: Use mm_zero_struct_page from SPARC on all 64b architectures Alexander Duyck
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 21:19 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: sparclinux, linux-kernel, linux-nvdimm, davem, pavel.tatashin,
	mhocko, mingo, kirill.shutemov, dan.j.williams, dave.jiang,
	alexander.h.duyck, rppt, willy, vbabka, khalid.aziz, ldufour,
	mgorman, yi.z.zhang, alexander.h.duyck

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
memory per node I have seen the following. In the case of regular memory
initialization the deferred init time was decreased from 3.75s to 1.06s on
average. For the persistent memory the initialization time dropped from
24.17s to 19.12s on average. This amounts to a 253% improvement for the
deferred memory initialization performance, and a 26% improvement in the
persistent memory initialization performance.

I have called out the improvement observed with each patch.

v1->v2:
    Fixed build issue on PowerPC due to page struct size being 56
    Added new patch that removed __SetPageReserved call for hotplug
v2->v3:
    Rebased on latest linux-next
    Removed patch that had removed __SetPageReserved call from init
    Added patch that folded __SetPageReserved into set_page_links
    Tweaked __init_pageblock to use start_pfn to get section_nr instead of pfn
v3->v4:
    Updated patch description and comments for mm_zero_struct_page patch
        Replaced "default" with "case 64"
        Removed #ifndef mm_zero_struct_page
    Fixed typo in comment that ommited "_from" in kerneldoc for iterator
    Added Reviewed-by for patches reviewed by Pavel
    Added Acked-by from Michal Hocko
    Added deferred init times for patches that affect init performance
    Swapped patches 5 & 6, pulled some code/comments from 4 into 5
v4->v5:
    Updated Acks/Reviewed-by
    Rebased on latest linux-next
    Split core bits of zone iterator patch from MAX_ORDER_NR_PAGES init

---

Alexander Duyck (7):
      mm: Use mm_zero_struct_page from SPARC on all 64b architectures
      mm: Drop meminit_pfn_in_nid as it is redundant
      mm: Implement new zone specific memblock iterator
      mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections
      mm: Move hot-plug specific memory init into separate functions and optimize
      mm: Add reserved flag setting to set_page_links
      mm: Use common iterator for deferred_init_pages and deferred_free_pages


 arch/sparc/include/asm/pgtable_64.h |   30 --
 include/linux/memblock.h            |   38 ++
 include/linux/mm.h                  |   50 +++
 mm/memblock.c                       |   63 ++++
 mm/page_alloc.c                     |  567 +++++++++++++++++++++--------------
 5 files changed, 492 insertions(+), 256 deletions(-)

--

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm PATCH v5 1/7] mm: Use mm_zero_struct_page from SPARC on all 64b architectures
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
@ 2018-11-05 21:19 ` Alexander Duyck
  2018-11-05 21:19 ` [mm PATCH v5 2/7] mm: Drop meminit_pfn_in_nid as it is redundant Alexander Duyck
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 21:19 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: sparclinux, linux-kernel, linux-nvdimm, davem, pavel.tatashin,
	mhocko, mingo, kirill.shutemov, dan.j.williams, dave.jiang,
	alexander.h.duyck, rppt, willy, vbabka, khalid.aziz, ldufour,
	mgorman, yi.z.zhang, alexander.h.duyck

This change makes it so that we use the same approach that was already in
use on Sparc on all the archtectures that support a 64b long.

This is mostly motivated by the fact that 7 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.

An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call. As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
my system.

One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.

This change should introduce no change on SPARC since it already had this
code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
initializing 384GB of RAM per node. Pavel Tatashin tested on a system with
Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page()
takes 19.30ns / 64-byte struct page before this patch and with this patch
it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on
a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His
results per 64-byte struct page were 4.68ns before, and 4.59ns after this
patch.

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 arch/sparc/include/asm/pgtable_64.h |   30 --------------------------
 include/linux/mm.h                  |   41 ++++++++++++++++++++++++++++++++---
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 1393a8ac596b..22500c3be7a9 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -231,36 +231,6 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
-/* This macro must be updated when the size of struct page grows above 80
- * or reduces below 64.
- * The idea that compiler optimizes out switch() statement, and only
- * leaves clrx instructions
- */
-#define	mm_zero_struct_page(pp) do {					\
-	unsigned long *_pp = (void *)(pp);				\
-									\
-	 /* Check that struct page is either 64, 72, or 80 bytes */	\
-	BUILD_BUG_ON(sizeof(struct page) & 7);				\
-	BUILD_BUG_ON(sizeof(struct page) < 64);				\
-	BUILD_BUG_ON(sizeof(struct page) > 80);				\
-									\
-	switch (sizeof(struct page)) {					\
-	case 80:							\
-		_pp[9] = 0;	/* fallthrough */			\
-	case 72:							\
-		_pp[8] = 0;	/* fallthrough */			\
-	default:							\
-		_pp[7] = 0;						\
-		_pp[6] = 0;						\
-		_pp[5] = 0;						\
-		_pp[4] = 0;						\
-		_pp[3] = 0;						\
-		_pp[2] = 0;						\
-		_pp[1] = 0;						\
-		_pp[0] = 0;						\
-	}								\
-} while (0)
-
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..288c407c08fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,10 +98,45 @@ extern int mmap_rnd_compat_bits __read_mostly;
 
 /*
  * On some architectures it is expensive to call memset() for small sizes.
- * Those architectures should provide their own implementation of "struct page"
- * zeroing by defining this macro in <asm/pgtable.h>.
+ * If an architecture decides to implement their own version of
+ * mm_zero_struct_page they should wrap the defines below in a #ifndef and
+ * define their own version of this macro in <asm/pgtable.h>
  */
-#ifndef mm_zero_struct_page
+#if BITS_PER_LONG == 64
+/* This function must be updated when the size of struct page grows above 80
+ * or reduces below 56. The idea that compiler optimizes out switch()
+ * statement, and only leaves move/store instructions. Also the compiler can
+ * combine write statments if they are both assignments and can be reordered,
+ * this can result in several of the writes here being dropped.
+ */
+#define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
+static inline void __mm_zero_struct_page(struct page *page)
+{
+	unsigned long *_pp = (void *)page;
+
+	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
+	BUILD_BUG_ON(sizeof(struct page) & 7);
+	BUILD_BUG_ON(sizeof(struct page) < 56);
+	BUILD_BUG_ON(sizeof(struct page) > 80);
+
+	switch (sizeof(struct page)) {
+	case 80:
+		_pp[9] = 0;	/* fallthrough */
+	case 72:
+		_pp[8] = 0;	/* fallthrough */
+	case 64:
+		_pp[7] = 0;	/* fallthrough */
+	case 56:
+		_pp[6] = 0;
+		_pp[5] = 0;
+		_pp[4] = 0;
+		_pp[3] = 0;
+		_pp[2] = 0;
+		_pp[1] = 0;
+		_pp[0] = 0;
+	}
+}
+#else
 #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
 #endif
 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm PATCH v5 2/7] mm: Drop meminit_pfn_in_nid as it is redundant
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
  2018-11-05 21:19 ` [mm PATCH v5 1/7] mm: Use mm_zero_struct_page from SPARC on all 64b architectures Alexander Duyck
@ 2018-11-05 21:19 ` Alexander Duyck
  2018-11-05 21:19 ` [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator Alexander Duyck
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 21:19 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: sparclinux, linux-kernel, linux-nvdimm, davem, pavel.tatashin,
	mhocko, mingo, kirill.shutemov, dan.j.williams, dave.jiang,
	alexander.h.duyck, rppt, willy, vbabka, khalid.aziz, ldufour,
	mgorman, yi.z.zhang, alexander.h.duyck

As best as I can tell the meminit_pfn_in_nid call is completely redundant.
The deferred memory initialization is already making use of
for_each_free_mem_range which in turn will call into __next_mem_range which
will only return a memory range if it matches the node ID provided assuming
it is not NUMA_NO_NODE.

I am operating on the assumption that there are no zones or pgdata_t
structures that have a NUMA node of NUMA_NO_NODE associated with them. If
that is the case then __next_mem_range will never return a memory range
that doesn't match the zone's node ID and as such the check is redundant.

So one piece I would like to verify on this is if this works for ia64.
Technically it was using a different approach to get the node ID, but it
seems to have the node ID also encoded into the memblock. So I am
assuming this is okay, but would like to get confirmation on that.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 2.80s to 1.85s as a result of this patch.

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |   51 ++++++++++++++-------------------------------------
 1 file changed, 14 insertions(+), 37 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..be1197c120a8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1301,36 +1301,22 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
 #endif
 
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES
-static inline bool __meminit __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
-		   struct mminit_pfnnid_cache *state)
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
 	int nid;
 
-	nid = __early_pfn_to_nid(pfn, state);
+	nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
 	if (nid >= 0 && nid != node)
 		return false;
 	return true;
 }
 
-/* Only safe to use early in boot when initialisation is single-threaded */
-static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
-	return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
-}
-
 #else
-
 static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
 	return true;
 }
-static inline bool __meminit  __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
-		   struct mminit_pfnnid_cache *state)
-{
-	return true;
-}
 #endif
 
 
@@ -1459,21 +1445,13 @@ static inline void __init pgdat_init_report_one_done(void)
  *
  * Then, we check if a current large page is valid by only checking the validity
  * of the head pfn.
- *
- * Finally, meminit_pfn_in_nid is checked on systems where pfns can interleave
- * within a node: a pfn is between start and end of a node, but does not belong
- * to this memory node.
  */
-static inline bool __init
-deferred_pfn_valid(int nid, unsigned long pfn,
-		   struct mminit_pfnnid_cache *nid_init_state)
+static inline bool __init deferred_pfn_valid(unsigned long pfn)
 {
 	if (!pfn_valid_within(pfn))
 		return false;
 	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
 		return false;
-	if (!meminit_pfn_in_nid(pfn, nid, nid_init_state))
-		return false;
 	return true;
 }
 
@@ -1481,15 +1459,14 @@ deferred_pfn_valid(int nid, unsigned long pfn,
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
+static void __init deferred_free_pages(unsigned long pfn,
 				       unsigned long end_pfn)
 {
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
 	unsigned long nr_free = 0;
 
 	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+		if (!deferred_pfn_valid(pfn)) {
 			deferred_free_range(pfn - nr_free, nr_free);
 			nr_free = 0;
 		} else if (!(pfn & nr_pgmask)) {
@@ -1509,17 +1486,18 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
  * by performing it only once every pageblock_nr_pages.
  * Return number of pages initialized.
  */
-static unsigned long  __init deferred_init_pages(int nid, int zid,
+static unsigned long  __init deferred_init_pages(struct zone *zone,
 						 unsigned long pfn,
 						 unsigned long end_pfn)
 {
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	int nid = zone_to_nid(zone);
 	unsigned long nr_pages = 0;
+	int zid = zone_idx(zone);
 	struct page *page = NULL;
 
 	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+		if (!deferred_pfn_valid(pfn)) {
 			page = NULL;
 			continue;
 		} else if (!page || !(pfn & nr_pgmask)) {
@@ -1582,12 +1560,12 @@ static int __init deferred_init_memmap(void *data)
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
+		nr_pages += deferred_init_pages(zone, spfn, epfn);
 	}
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		deferred_free_pages(nid, zid, spfn, epfn);
+		deferred_free_pages(spfn, epfn);
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -1626,7 +1604,6 @@ static DEFINE_STATIC_KEY_TRUE(deferred_pages);
 static noinline bool __init
 deferred_grow_zone(struct zone *zone, unsigned int order)
 {
-	int zid = zone_idx(zone);
 	int nid = zone_to_nid(zone);
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
@@ -1676,7 +1653,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 		while (spfn < epfn && nr_pages < nr_pages_needed) {
 			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
 			first_deferred_pfn = min(t, epfn);
-			nr_pages += deferred_init_pages(nid, zid, spfn,
+			nr_pages += deferred_init_pages(zone, spfn,
 							first_deferred_pfn);
 			spfn = first_deferred_pfn;
 		}
@@ -1688,7 +1665,7 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
-		deferred_free_pages(nid, zid, spfn, epfn);
+		deferred_free_pages(spfn, epfn);
 
 		if (first_deferred_pfn == epfn)
 			break;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
  2018-11-05 21:19 ` [mm PATCH v5 1/7] mm: Use mm_zero_struct_page from SPARC on all 64b architectures Alexander Duyck
  2018-11-05 21:19 ` [mm PATCH v5 2/7] mm: Drop meminit_pfn_in_nid as it is redundant Alexander Duyck
@ 2018-11-05 21:19 ` Alexander Duyck
  2018-11-09 23:26   ` Pavel Tatashin
  2018-11-05 21:19 ` [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections Alexander Duyck
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 21:19 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: sparclinux, linux-kernel, linux-nvdimm, davem, pavel.tatashin,
	mhocko, mingo, kirill.shutemov, dan.j.williams, dave.jiang,
	alexander.h.duyck, rppt, willy, vbabka, khalid.aziz, ldufour,
	mgorman, yi.z.zhang, alexander.h.duyck

This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.

This iterator will take care of making sure a given memory range provided
is in fact contained within a zone. It takes are of all the bounds checking
we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
it should help to speed up the search a bit by iterating until the end of a
range is greater than the start of the zone pfn range, and will exit
completely if the start is beyond the end of the zone.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/memblock.h |   22 ++++++++++++++++
 mm/memblock.c            |   63 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c          |   31 +++++++++--------------
 3 files changed, 97 insertions(+), 19 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a6aa76..413623dc96a3 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -248,6 +248,28 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+				  unsigned long *out_spfn,
+				  unsigned long *out_epfn);
+/**
+ * for_each_free_mem_range_in_zone - iterate through zone specific free
+ * memblock areas
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone. Available as soon as memblock is initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)	\
+	for (i = 0,							\
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
+	     i != (u64)ULLONG_MAX;					\
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
 /**
  * for_each_free_mem_range - iterate through free memblock areas
  * @i: u64 used as loop variable
diff --git a/mm/memblock.c b/mm/memblock.c
index 7df468c8ebc8..f1d1fbfd1ae7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 	return 0;
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/**
+ * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
+ *
+ * @idx: pointer to u64 loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @out_start: ptr to ulong for start pfn of the range, can be %NULL
+ * @out_end: ptr to ulong for end pfn of the range, can be %NULL
+ *
+ * This function is meant to be a zone/pfn specific wrapper for the
+ * for_each_mem_range type iterators. Specifically they are used in the
+ * deferred memory init routines and as such we were duplicating much of
+ * this logic throughout the code. So instead of having it in multiple
+ * locations it seemed like it would make more sense to centralize this to
+ * one new iterator that does everything they need.
+ */
+void __init_memblock
+__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+			     unsigned long *out_spfn, unsigned long *out_epfn)
+{
+	int zone_nid = zone_to_nid(zone);
+	phys_addr_t spa, epa;
+	int nid;
+
+	__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+			 &memblock.memory, &memblock.reserved,
+			 &spa, &epa, &nid);
+
+	while (*idx != ULLONG_MAX) {
+		unsigned long epfn = PFN_DOWN(epa);
+		unsigned long spfn = PFN_UP(spa);
+
+		/*
+		 * Verify the end is at least past the start of the zone and
+		 * that we have at least one PFN to initialize.
+		 */
+		if (zone->zone_start_pfn < epfn && spfn < epfn) {
+			/* if we went too far just stop searching */
+			if (zone_end_pfn(zone) <= spfn)
+				break;
+
+			if (out_spfn)
+				*out_spfn = max(zone->zone_start_pfn, spfn);
+			if (out_epfn)
+				*out_epfn = min(zone_end_pfn(zone), epfn);
+
+			return;
+		}
+
+		__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+				 &memblock.memory, &memblock.reserved,
+				 &spa, &epa, &nid);
+	}
+
+	/* signal end of iteration */
+	*idx = ULLONG_MAX;
+	if (out_spfn)
+		*out_spfn = ULONG_MAX;
+	if (out_epfn)
+		*out_epfn = 0;
+}
+
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 					phys_addr_t align, phys_addr_t start,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index be1197c120a8..5cfd3ebe10d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1516,11 +1516,9 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
-	int nid = pgdat->node_id;
 	unsigned long start = jiffies;
 	unsigned long nr_pages = 0;
 	unsigned long spfn, epfn, first_init_pfn, flags;
-	phys_addr_t spa, epa;
 	int zid;
 	struct zone *zone;
 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
@@ -1557,14 +1555,12 @@ static int __init deferred_init_memmap(void *data)
 	 * freeing pages we can access pages that are ahead (computing buddy
 	 * page in __free_one_page()).
 	 */
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
+	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
+		spfn = max_t(unsigned long, first_init_pfn, spfn);
 		nr_pages += deferred_init_pages(zone, spfn, epfn);
 	}
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
+	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
+		spfn = max_t(unsigned long, first_init_pfn, spfn);
 		deferred_free_pages(spfn, epfn);
 	}
 	pgdat_resize_unlock(pgdat, &flags);
@@ -1572,8 +1568,8 @@ static int __init deferred_init_memmap(void *data)
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
-	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
-					jiffies_to_msecs(jiffies - start));
+	pr_info("node %d initialised, %lu pages in %ums\n",
+		pgdat->node_id,	nr_pages, jiffies_to_msecs(jiffies - start));
 
 	pgdat_init_report_one_done();
 	return 0;
@@ -1604,13 +1600,11 @@ static DEFINE_STATIC_KEY_TRUE(deferred_pages);
 static noinline bool __init
 deferred_grow_zone(struct zone *zone, unsigned int order)
 {
-	int nid = zone_to_nid(zone);
-	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
+	pg_data_t *pgdat = zone->zone_pgdat;
 	unsigned long nr_pages = 0;
 	unsigned long first_init_pfn, spfn, epfn, t, flags;
 	unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
-	phys_addr_t spa, epa;
 	u64 i;
 
 	/* Only the last zone may have deferred pages */
@@ -1646,9 +1640,8 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 		return false;
 	}
 
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
+	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
+		spfn = max_t(unsigned long, first_init_pfn, spfn);
 
 		while (spfn < epfn && nr_pages < nr_pages_needed) {
 			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
@@ -1662,9 +1655,9 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 			break;
 	}
 
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
+	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
+		spfn = max_t(unsigned long, first_init_pfn, spfn);
+		epfn = min_t(unsigned long, first_deferred_pfn, epfn);
 		deferred_free_pages(spfn, epfn);
 
 		if (first_deferred_pfn == epfn)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
                   ` (2 preceding siblings ...)
  2018-11-05 21:19 ` [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator Alexander Duyck
@ 2018-11-05 21:19 ` Alexander Duyck
  2018-11-10  1:02   ` Pavel Tatashin
  2018-11-05 21:19 ` [mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize Alexander Duyck
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 21:19 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: sparclinux, linux-kernel, linux-nvdimm, davem, pavel.tatashin,
	mhocko, mingo, kirill.shutemov, dan.j.williams, dave.jiang,
	alexander.h.duyck, rppt, willy, vbabka, khalid.aziz, ldufour,
	mgorman, yi.z.zhang, alexander.h.duyck

This patch adds yet another iterator, for_each_free_mem_range_in_zone_from.
It and then uses it to support initializing and freeing pages in groups no
larger than MAX_ORDER_NR_PAGES. By doing this we can greatly improve the
cache locality of the pages while we do several loops over them in the init
and freeing process.

We are able to tighten the loops as a result since we only really need the
checks for first_init_pfn in our first iteration and after that we can
assume that all future values will be greater than this. So I have added a
function called deferred_init_mem_pfn_range_in_zone that primes the
iterators and if it fails we can just exit.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 1.85s to 1.38s as a result of this patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/memblock.h |   16 +++++
 mm/page_alloc.c          |  162 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 134 insertions(+), 44 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 413623dc96a3..5ba52a7878a0 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -268,6 +268,22 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
 	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
 	     i != (u64)ULLONG_MAX;					\
 	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+/**
+ * for_each_free_mem_range_in_zone_from - iterate through zone specific
+ * free memblock areas from a given point
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone, continuing from current position. Available as soon as memblock is
+ * initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
+	for (; i != (u64)ULLONG_MAX;					  \
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5cfd3ebe10d1..3466a01ed90a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1512,16 +1512,102 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
 	return (nr_pages);
 }
 
+/*
+ * This function is meant to pre-load the iterator for the zone init.
+ * Specifically it walks through the ranges until we are caught up to the
+ * first_init_pfn value and exits there. If we never encounter the value we
+ * return false indicating there are no valid ranges left.
+ */
+static bool __init
+deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
+				    unsigned long *spfn, unsigned long *epfn,
+				    unsigned long first_init_pfn)
+{
+	u64 j;
+
+	/*
+	 * Start out by walking through the ranges in this zone that have
+	 * already been initialized. We don't need to do anything with them
+	 * so we just need to flush them out of the system.
+	 */
+	for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
+		if (*epfn <= first_init_pfn)
+			continue;
+		if (*spfn < first_init_pfn)
+			*spfn = first_init_pfn;
+		*i = j;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Initialize and free pages. We do it in two loops: first we initialize
+ * struct page, than free to buddy allocator, because while we are
+ * freeing pages we can access pages that are ahead (computing buddy
+ * page in __free_one_page()).
+ *
+ * In order to try and keep some memory in the cache we have the loop
+ * broken along max page order boundaries. This way we will not cause
+ * any issues with the buddy page computation.
+ */
+static unsigned long __init
+deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
+		       unsigned long *end_pfn)
+{
+	unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
+	unsigned long spfn = *start_pfn, epfn = *end_pfn;
+	unsigned long nr_pages = 0;
+	u64 j = *i;
+
+	/* First we loop through and initialize the page values */
+	for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
+		unsigned long t;
+
+		if (mo_pfn <= spfn)
+			break;
+
+		t = min(mo_pfn, epfn);
+		nr_pages += deferred_init_pages(zone, spfn, t);
+
+		if (mo_pfn <= epfn)
+			break;
+	}
+
+	/* Reset values and now loop through freeing pages as needed */
+	j = *i;
+
+	for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
+		unsigned long t;
+
+		if (mo_pfn <= *start_pfn)
+			break;
+
+		t = min(mo_pfn, *end_pfn);
+		deferred_free_pages(*start_pfn, t);
+		*start_pfn = t;
+
+		if (mo_pfn < *end_pfn)
+			break;
+	}
+
+	/* Store our current values to be reused on the next iteration */
+	*i = j;
+
+	return nr_pages;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
+	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+	unsigned long first_init_pfn, flags;
 	unsigned long start = jiffies;
-	unsigned long nr_pages = 0;
-	unsigned long spfn, epfn, first_init_pfn, flags;
-	int zid;
 	struct zone *zone;
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	int zid;
 	u64 i;
 
 	/* Bind memory initialisation thread to a local node if possible */
@@ -1547,22 +1633,23 @@ static int __init deferred_init_memmap(void *data)
 		if (first_init_pfn < zone_end_pfn(zone))
 			break;
 	}
-	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
+
+	/* If the zone is empty somebody else may have cleared out the zone */
+	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
+						 first_init_pfn)) {
+		pgdat_resize_unlock(pgdat, &flags);
+		pgdat_init_report_one_done();
+		return 0;
+	}
 
 	/*
-	 * Initialize and free pages. We do it in two loops: first we initialize
-	 * struct page, than free to buddy allocator, because while we are
-	 * freeing pages we can access pages that are ahead (computing buddy
-	 * page in __free_one_page()).
+	 * Initialize and free pages in MAX_ORDER sized increments so
+	 * that we can avoid introducing any issues with the buddy
+	 * allocator.
 	 */
-	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
-		spfn = max_t(unsigned long, first_init_pfn, spfn);
-		nr_pages += deferred_init_pages(zone, spfn, epfn);
-	}
-	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
-		spfn = max_t(unsigned long, first_init_pfn, spfn);
-		deferred_free_pages(spfn, epfn);
-	}
+	while (spfn < epfn)
+		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+
 	pgdat_resize_unlock(pgdat, &flags);
 
 	/* Sanity check that the next zone really is unpopulated */
@@ -1602,9 +1689,9 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 {
 	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
 	pg_data_t *pgdat = zone->zone_pgdat;
-	unsigned long nr_pages = 0;
-	unsigned long first_init_pfn, spfn, epfn, t, flags;
 	unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
+	unsigned long spfn, epfn, flags;
+	unsigned long nr_pages = 0;
 	u64 i;
 
 	/* Only the last zone may have deferred pages */
@@ -1633,36 +1720,23 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
 		return true;
 	}
 
-	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
-
-	if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
+	/* If the zone is empty somebody else may have cleared out the zone */
+	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
+						 first_deferred_pfn)) {
 		pgdat_resize_unlock(pgdat, &flags);
-		return false;
+		return true;
 	}
 
-	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
-		spfn = max_t(unsigned long, first_init_pfn, spfn);
-
-		while (spfn < epfn && nr_pages < nr_pages_needed) {
-			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
-			first_deferred_pfn = min(t, epfn);
-			nr_pages += deferred_init_pages(zone, spfn,
-							first_deferred_pfn);
-			spfn = first_deferred_pfn;
-		}
-
-		if (nr_pages >= nr_pages_needed)
-			break;
+	/*
+	 * Initialize and free pages in MAX_ORDER sized increments so
+	 * that we can avoid introducing any issues with the buddy
+	 * allocator.
+	 */
+	while (spfn < epfn && nr_pages < nr_pages_needed) {
+		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		first_deferred_pfn = spfn;
 	}
 
-	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
-		spfn = max_t(unsigned long, first_init_pfn, spfn);
-		epfn = min_t(unsigned long, first_deferred_pfn, epfn);
-		deferred_free_pages(spfn, epfn);
-
-		if (first_deferred_pfn == epfn)
-			break;
-	}
 	pgdat->first_deferred_pfn = first_deferred_pfn;
 	pgdat_resize_unlock(pgdat, &flags);
 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
                   ` (3 preceding siblings ...)
  2018-11-05 21:19 ` [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections Alexander Duyck
@ 2018-11-05 21:19 ` Alexander Duyck
  2018-11-10  2:07   ` Pavel Tatashin
  2018-11-05 21:19 ` [mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links Alexander Duyck
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 21:19 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: sparclinux, linux-kernel, linux-nvdimm, davem, pavel.tatashin,
	mhocko, mingo, kirill.shutemov, dan.j.williams, dave.jiang,
	alexander.h.duyck, rppt, willy, vbabka, khalid.aziz, ldufour,
	mgorman, yi.z.zhang, alexander.h.duyck

This patch is going through and combining the bits in memmap_init_zone and
memmap_init_zone_device that are related to hotplug into a single function
called __memmap_init_hotplug.

I also took the opportunity to integrate __init_single_page's functionality
into this function. In doing so I can get rid of some of the redundancy
such as the LRU pointers versus the pgmap.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |  214 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 144 insertions(+), 70 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3466a01ed90a..dbe00c1a0e23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1192,6 +1192,92 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 #endif
 }
 
+static void __meminit __init_pageblock(unsigned long start_pfn,
+				       unsigned long nr_pages,
+				       unsigned long zone, int nid,
+				       struct dev_pagemap *pgmap)
+{
+	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	struct page *start_page = pfn_to_page(start_pfn);
+	unsigned long pfn = start_pfn + nr_pages - 1;
+#ifdef WANT_PAGE_VIRTUAL
+	bool is_highmem = is_highmem_idx(zone);
+#endif
+	struct page *page;
+
+	/*
+	 * Enforce the following requirements:
+	 * size > 0
+	 * size < pageblock_nr_pages
+	 * start_pfn -> pfn does not cross pageblock_nr_pages boundary
+	 */
+	VM_BUG_ON(((start_pfn ^ pfn) | (nr_pages - 1)) > nr_pgmask);
+
+	/*
+	 * Work from highest page to lowest, this way we will still be
+	 * warm in the cache when we call set_pageblock_migratetype
+	 * below.
+	 *
+	 * The loop is based around the page pointer as the main index
+	 * instead of the pfn because pfn is not used inside the loop if
+	 * the section number is not in page flags and WANT_PAGE_VIRTUAL
+	 * is not defined.
+	 */
+	for (page = start_page + nr_pages; page-- != start_page; pfn--) {
+		mm_zero_struct_page(page);
+
+		/*
+		 * We use the start_pfn instead of pfn in the set_page_links
+		 * call because of the fact that the pfn number is used to
+		 * get the section_nr and this function should not be
+		 * spanning more than a single section.
+		 */
+		set_page_links(page, zone, nid, start_pfn);
+		init_page_count(page);
+		page_mapcount_reset(page);
+		page_cpupid_reset_last(page);
+
+		/*
+		 * We can use the non-atomic __set_bit operation for setting
+		 * the flag as we are still initializing the pages.
+		 */
+		__SetPageReserved(page);
+
+		/*
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
+		 * page is ever freed or placed on a driver-private list.
+		 */
+		page->pgmap = pgmap;
+		if (!pgmap)
+			INIT_LIST_HEAD(&page->lru);
+
+#ifdef WANT_PAGE_VIRTUAL
+		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
+		if (!is_highmem)
+			set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+	}
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * bitmap is created for zone's valid pfn range. but memmap
+	 * can be created for invalid pages (for alignment)
+	 * check here not to call set_pageblock_migratetype() against
+	 * pfn out of zone.
+	 *
+	 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
+	 * because this is done early in sparse_add_one_section
+	 */
+	if (!(start_pfn & nr_pgmask))
+		set_pageblock_migratetype(start_page, MIGRATE_MOVABLE);
+}
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static void __meminit init_reserved_page(unsigned long pfn)
 {
@@ -5513,6 +5599,25 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
 	return false;
 }
 
+static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
+					    unsigned long zone,
+					    unsigned long start_pfn,
+					    struct dev_pagemap *pgmap)
+{
+	unsigned long pfn = start_pfn + size;
+
+	while (pfn != start_pfn) {
+		unsigned long stride = pfn;
+
+		pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
+		stride -= pfn;
+
+		__init_pageblock(pfn, stride, zone, nid, pgmap);
+
+		cond_resched();
+	}
+}
+
 /*
  * Initially all pages are reserved - free ones are freed
  * up by memblock_free_all() once the early boot process is
@@ -5523,49 +5628,59 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		struct vmem_altmap *altmap)
 {
 	unsigned long pfn, end_pfn = start_pfn + size;
-	struct page *page;
 
 	if (highest_memmap_pfn < end_pfn - 1)
 		highest_memmap_pfn = end_pfn - 1;
 
+	if (context == MEMMAP_HOTPLUG) {
 #ifdef CONFIG_ZONE_DEVICE
-	/*
-	 * Honor reservation requested by the driver for this ZONE_DEVICE
-	 * memory. We limit the total number of pages to initialize to just
-	 * those that might contain the memory mapping. We will defer the
-	 * ZONE_DEVICE page initialization until after we have released
-	 * the hotplug lock.
-	 */
-	if (zone == ZONE_DEVICE) {
-		if (!altmap)
-			return;
+		/*
+		 * Honor reservation requested by the driver for this
+		 * ZONE_DEVICE memory. We limit the total number of pages to
+		 * initialize to just those that might contain the memory
+		 * mapping. We will defer the ZONE_DEVICE page initialization
+		 * until after we have released the hotplug lock.
+		 */
+		if (zone == ZONE_DEVICE) {
+			if (!altmap)
+				return;
+
+			if (start_pfn == altmap->base_pfn)
+				start_pfn += altmap->reserve;
+			end_pfn = altmap->base_pfn +
+				  vmem_altmap_offset(altmap);
+		}
+#endif
+		/*
+		 * For these ZONE_DEVICE pages we don't need to record the
+		 * pgmap as they should represent only those pages used to
+		 * store the memory map. The actual ZONE_DEVICE pages will
+		 * be initialized later.
+		 */
+		__memmap_init_hotplug(end_pfn - start_pfn, nid, zone,
+				      start_pfn, NULL);
 
-		if (start_pfn == altmap->base_pfn)
-			start_pfn += altmap->reserve;
-		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
+		return;
 	}
-#endif
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		struct page *page;
+
 		/*
 		 * There can be holes in boot-time mem_map[]s handed to this
 		 * function.  They do not exist on hotplugged memory.
 		 */
-		if (context == MEMMAP_EARLY) {
-			if (!early_pfn_valid(pfn))
-				continue;
-			if (!early_pfn_in_nid(pfn, nid))
-				continue;
-			if (overlap_memmap_init(zone, &pfn))
-				continue;
-			if (defer_init(nid, pfn, end_pfn))
-				break;
-		}
+		if (!early_pfn_valid(pfn))
+			continue;
+		if (!early_pfn_in_nid(pfn, nid))
+			continue;
+		if (overlap_memmap_init(zone, &pfn))
+			continue;
+		if (defer_init(nid, pfn, end_pfn))
+			break;
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -5592,7 +5707,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long size,
 				   struct dev_pagemap *pgmap)
 {
-	unsigned long pfn, end_pfn = start_pfn + size;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
@@ -5608,53 +5722,13 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	 */
 	if (pgmap->altmap_valid) {
 		struct vmem_altmap *altmap = &pgmap->altmap;
+		unsigned long end_pfn = start_pfn + size;
 
 		start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
 		size = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		struct page *page = pfn_to_page(pfn);
-
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
-		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
-		 * page is ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->hmm_data = 0;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * bitmap is created for zone's valid pfn range. but memmap
-		 * can be created for invalid pages (for alignment)
-		 * check here not to call set_pageblock_migratetype() against
-		 * pfn out of zone.
-		 *
-		 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
-		 * because this is done early in sparse_add_one_section
-		 */
-		if (!(pfn & (pageblock_nr_pages - 1))) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
-	}
+	__memmap_init_hotplug(size, nid, zone_idx, start_pfn, pgmap);
 
 	pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
 		size, jiffies_to_msecs(jiffies - start));


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
                   ` (4 preceding siblings ...)
  2018-11-05 21:19 ` [mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize Alexander Duyck
@ 2018-11-05 21:19 ` Alexander Duyck
  2018-11-10  2:11   ` Pavel Tatashin
  2018-11-05 21:20 ` [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages Alexander Duyck
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 21:19 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: sparclinux, linux-kernel, linux-nvdimm, davem, pavel.tatashin,
	mhocko, mingo, kirill.shutemov, dan.j.williams, dave.jiang,
	alexander.h.duyck, rppt, willy, vbabka, khalid.aziz, ldufour,
	mgorman, yi.z.zhang, alexander.h.duyck

This patch modifies the set_page_links function to include the setting of
the reserved flag via a simple AND and OR operation. The motivation for
this is the fact that the existing __set_bit call still seems to have
effects on performance as replacing the call with the AND and OR can reduce
initialization time.

Looking over the assembly code before and after the change the main
difference between the two is that the reserved bit is stored in a value
that is generated outside of the main initialization loop and is then
written with the other flags field values in one write to the page->flags
value. Previously the generated value was written and then then a btsq
instruction was issued.

On my x86_64 test system with 3TB of persistent memory per node I saw the
persistent memory initialization time on average drop from 23.49s to
19.12s per node.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mm.h |    9 ++++++++-
 mm/page_alloc.c    |   29 +++++++++++++++++++----------
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 288c407c08fc..de6535a98e45 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1171,11 +1171,18 @@ static inline void set_page_node(struct page *page, unsigned long node)
 	page->flags |= (node & NODES_MASK) << NODES_PGSHIFT;
 }
 
+static inline void set_page_reserved(struct page *page, bool reserved)
+{
+	page->flags &= ~(1ul << PG_reserved);
+	page->flags |= (unsigned long)(!!reserved) << PG_reserved;
+}
+
 static inline void set_page_links(struct page *page, enum zone_type zone,
-	unsigned long node, unsigned long pfn)
+	unsigned long node, unsigned long pfn, bool reserved)
 {
 	set_page_zone(page, zone);
 	set_page_node(page, node);
+	set_page_reserved(page, reserved);
 #ifdef SECTION_IN_PAGE_FLAGS
 	set_page_section(page, pfn_to_section_nr(pfn));
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dbe00c1a0e23..9eb993a9be99 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1179,7 +1179,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
 	mm_zero_struct_page(page);
-	set_page_links(page, zone, nid, pfn);
+	set_page_links(page, zone, nid, pfn, false);
 	init_page_count(page);
 	page_mapcount_reset(page);
 	page_cpupid_reset_last(page);
@@ -1195,7 +1195,8 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 static void __meminit __init_pageblock(unsigned long start_pfn,
 				       unsigned long nr_pages,
 				       unsigned long zone, int nid,
-				       struct dev_pagemap *pgmap)
+				       struct dev_pagemap *pgmap,
+				       bool is_reserved)
 {
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
 	struct page *start_page = pfn_to_page(start_pfn);
@@ -1231,18 +1232,15 @@ static void __meminit __init_pageblock(unsigned long start_pfn,
 		 * call because of the fact that the pfn number is used to
 		 * get the section_nr and this function should not be
 		 * spanning more than a single section.
+		 *
+		 * We can use a non-atomic operation for setting the
+		 * PG_reserved flag as we are still initializing the pages.
 		 */
-		set_page_links(page, zone, nid, start_pfn);
+		set_page_links(page, zone, nid, start_pfn, is_reserved);
 		init_page_count(page);
 		page_mapcount_reset(page);
 		page_cpupid_reset_last(page);
 
-		/*
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
 		/*
 		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
 		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
@@ -5612,7 +5610,18 @@ static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
 		pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
 		stride -= pfn;
 
-		__init_pageblock(pfn, stride, zone, nid, pgmap);
+		/*
+		 * The last argument of __init_pageblock is a boolean
+		 * value indicating if the page will be marked as reserved.
+		 *
+		 * Mark page reserved as it will need to wait for onlining
+		 * phase for it to be fully associated with a zone.
+		 *
+		 * Under certain circumstances ZONE_DEVICE pages may not
+		 * need to be marked as reserved, however there is still
+		 * code that is depending on this being set for now.
+		 */
+		__init_pageblock(pfn, stride, zone, nid, pgmap, true);
 
 		cond_resched();
 	}


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
                   ` (5 preceding siblings ...)
  2018-11-05 21:19 ` [mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links Alexander Duyck
@ 2018-11-05 21:20 ` Alexander Duyck
  2018-11-10  4:13   ` Pavel Tatashin
  2018-11-09 21:15 ` [mm PATCH v5 0/7] Deferred page init improvements Pavel Tatashin
  2018-11-14 15:07 ` Michal Hocko
  8 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 21:20 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: sparclinux, linux-kernel, linux-nvdimm, davem, pavel.tatashin,
	mhocko, mingo, kirill.shutemov, dan.j.williams, dave.jiang,
	alexander.h.duyck, rppt, willy, vbabka, khalid.aziz, ldufour,
	mgorman, yi.z.zhang, alexander.h.duyck

This patch creates a common iterator to be used by both deferred_init_pages
and deferred_free_pages. By doing this we can cut down a bit on code
overhead as they will likely both be inlined into the same function anyway.

This new approach allows deferred_init_pages to make use of
__init_pageblock. By doing this we can cut down on the code size by sharing
code between both the hotplug and deferred memory init code paths.

An additional benefit to this approach is that we improve in cache locality
of the memory init as we can focus on the memory areas related to
identifying if a given PFN is valid and keep that warm in the cache until
we transition to a region of a different type. So we will stream through a
chunk of valid blocks before we turn to initializing page structs.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 1.38s to 1.06s as a result of this patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |  134 +++++++++++++++++++++++++++----------------------------
 1 file changed, 65 insertions(+), 69 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9eb993a9be99..521b94eb02a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1484,32 +1484,6 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(unsigned long pfn,
-				       unsigned long nr_pages)
-{
-	struct page *page;
-	unsigned long i;
-
-	if (!nr_pages)
-		return;
-
-	page = pfn_to_page(pfn);
-
-	/* Free a large naturally-aligned chunk if possible */
-	if (nr_pages == pageblock_nr_pages &&
-	    (pfn & (pageblock_nr_pages - 1)) == 0) {
-		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, pageblock_order);
-		return;
-	}
-
-	for (i = 0; i < nr_pages; i++, page++, pfn++) {
-		if ((pfn & (pageblock_nr_pages - 1)) == 0)
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, 0);
-	}
-}
-
 /* Completion tracking for deferred_init_memmap() threads */
 static atomic_t pgdat_init_n_undone __initdata;
 static __initdata DECLARE_COMPLETION(pgdat_init_all_done_comp);
@@ -1521,48 +1495,77 @@ static inline void __init pgdat_init_report_one_done(void)
 }
 
 /*
- * Returns true if page needs to be initialized or freed to buddy allocator.
+ * Returns count if page range needs to be initialized or freed
  *
- * First we check if pfn is valid on architectures where it is possible to have
- * holes within pageblock_nr_pages. On systems where it is not possible, this
- * function is optimized out.
+ * First, we check if a current large page is valid by only checking the
+ * validity of the head pfn.
  *
- * Then, we check if a current large page is valid by only checking the validity
- * of the head pfn.
+ * Then we check if the contiguous pfns are valid on architectures where it
+ * is possible to have holes within pageblock_nr_pages. On systems where it
+ * is not possible, this function is optimized out.
  */
-static inline bool __init deferred_pfn_valid(unsigned long pfn)
+static unsigned long __next_pfn_valid_range(unsigned long *i,
+					    unsigned long end_pfn)
 {
-	if (!pfn_valid_within(pfn))
-		return false;
-	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
-		return false;
-	return true;
+	unsigned long pfn = *i;
+	unsigned long count;
+
+	while (pfn < end_pfn) {
+		unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
+		unsigned long pageblock_pfn = min(t, end_pfn);
+
+#ifndef CONFIG_HOLES_IN_ZONE
+		count = pageblock_pfn - pfn;
+		pfn = pageblock_pfn;
+		if (!pfn_valid(pfn))
+			continue;
+#else
+		for (count = 0; pfn < pageblock_pfn; pfn++) {
+			if (pfn_valid_within(pfn)) {
+				count++;
+				continue;
+			}
+
+			if (count)
+				break;
+		}
+
+		if (!count)
+			continue;
+#endif
+		*i = pfn;
+		return count;
+	}
+
+	return 0;
 }
 
+#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) \
+	for (i = (start_pfn),						     \
+	     count = __next_pfn_valid_range(&i, (end_pfn));		     \
+	     count && ({ pfn = i - count; 1; });			     \
+	     count = __next_pfn_valid_range(&i, (end_pfn)))
 /*
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(unsigned long pfn,
+static void __init deferred_free_pages(unsigned long start_pfn,
 				       unsigned long end_pfn)
 {
-	unsigned long nr_pgmask = pageblock_nr_pages - 1;
-	unsigned long nr_free = 0;
-
-	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(pfn)) {
-			deferred_free_range(pfn - nr_free, nr_free);
-			nr_free = 0;
-		} else if (!(pfn & nr_pgmask)) {
-			deferred_free_range(pfn - nr_free, nr_free);
-			nr_free = 1;
-			touch_nmi_watchdog();
+	unsigned long i, pfn, count;
+
+	for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (count == pageblock_nr_pages) {
+			__free_pages_core(page, pageblock_order);
 		} else {
-			nr_free++;
+			while (count--)
+				__free_pages_core(page++, 0);
 		}
+
+		touch_nmi_watchdog();
 	}
-	/* Free the last block of pages to allocator */
-	deferred_free_range(pfn - nr_free, nr_free);
 }
 
 /*
@@ -1571,29 +1574,22 @@ static void __init deferred_free_pages(unsigned long pfn,
  * Return number of pages initialized.
  */
 static unsigned long  __init deferred_init_pages(struct zone *zone,
-						 unsigned long pfn,
+						 unsigned long start_pfn,
 						 unsigned long end_pfn)
 {
-	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	unsigned long i, pfn, count;
 	int nid = zone_to_nid(zone);
 	unsigned long nr_pages = 0;
 	int zid = zone_idx(zone);
-	struct page *page = NULL;
 
-	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(pfn)) {
-			page = NULL;
-			continue;
-		} else if (!page || !(pfn & nr_pgmask)) {
-			page = pfn_to_page(pfn);
-			touch_nmi_watchdog();
-		} else {
-			page++;
-		}
-		__init_single_page(page, pfn, zid, nid);
-		nr_pages++;
+	for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) {
+		nr_pages += count;
+		__init_pageblock(pfn, count, zid, nid, NULL, false);
+
+		touch_nmi_watchdog();
 	}
-	return (nr_pages);
+
+	return nr_pages;
 }
 
 /*


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
                   ` (6 preceding siblings ...)
  2018-11-05 21:20 ` [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages Alexander Duyck
@ 2018-11-09 21:15 ` Pavel Tatashin
  2018-11-09 23:14   ` Alexander Duyck
  2018-11-14 15:07 ` Michal Hocko
  8 siblings, 1 reply; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-09 21:15 UTC (permalink / raw)
  To: Alexander Duyck, daniel.m.jordan
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On 18-11-05 13:19:25, Alexander Duyck wrote:
> This patchset is essentially a refactor of the page initialization logic
> that is meant to provide for better code reuse while providing a
> significant improvement in deferred page initialization performance.
> 
> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> memory per node I have seen the following. In the case of regular memory
> initialization the deferred init time was decreased from 3.75s to 1.06s on
> average. For the persistent memory the initialization time dropped from
> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> deferred memory initialization performance, and a 26% improvement in the
> persistent memory initialization performance.

Hi Alex,

Please try to run your persistent memory init experiment with Daniel's
patches:

https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/

The performance should improve by much more than 26%.

Overall, your works looks good, but it needs to be considered how easy it will be
to merge with ktask. I will try to complete the review today.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-09 21:15 ` [mm PATCH v5 0/7] Deferred page init improvements Pavel Tatashin
@ 2018-11-09 23:14   ` Alexander Duyck
  2018-11-10  0:00     ` Pavel Tatashin
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-09 23:14 UTC (permalink / raw)
  To: Pavel Tatashin, daniel.m.jordan
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> On 18-11-05 13:19:25, Alexander Duyck wrote:
> > This patchset is essentially a refactor of the page initialization logic
> > that is meant to provide for better code reuse while providing a
> > significant improvement in deferred page initialization performance.
> > 
> > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > memory per node I have seen the following. In the case of regular memory
> > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > average. For the persistent memory the initialization time dropped from
> > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > deferred memory initialization performance, and a 26% improvement in the
> > persistent memory initialization performance.
> 
> Hi Alex,
> 
> Please try to run your persistent memory init experiment with Daniel's
> patches:
> 
> https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/

I've taken a quick look at it. It seems like a bit of a brute force way
to try and speed things up. I would be worried about it potentially
introducing performance issues if the number of CPUs thrown at it end
up exceeding the maximum throughput of the memory.

The data provided with patch 11 seems to point to issues such as that.
In the case of the E7-8895 example cited it is increasing the numbers
of CPUs used from memory initialization from 8 to 72, a 9x increase in
the number of CPUs but it is yeilding only a 3.88x speedup.

> The performance should improve by much more than 26%.

The 26% improvement, or speedup of 1.26x using the ktask approach, was
for persistent memory, not deferred memory init. The ktask patch
doesn't do anything for persistent memory since it is takes the hot-
plug path and isn't handled via the deferred memory init.

I had increased deferred memory init to about 3.53x the original speed
(3.75s to 1.06s) on the system which I was testing. I do agree the two
patches should be able to synergistically boost each other though as
this patch set was meant to make the init much more cache friendly so
as a result it should scale better as you add additional cores. I know
I had done some playing around with fake numa to split up a single node
into 8 logical nodes and I had seen a similar speedup of about 3.85x
with my test memory initializing in about 275ms.

> Overall, your works looks good, but it needs to be considered how easy it will be
> to merge with ktask. I will try to complete the review today.
> 
> Thank you,
> Pasha

Looking over the patches they are still in the RFC stage and the data
is in need of updates since it is referencing 4.15-rc kernels as its
baseline. If anything I really think the ktask patch 11 would be easier
to rebase around my patch set then the other way around. Also, this
series is in Andrew's mmots as of a few days ago, so I think it will be
in the next mmotm that comes out.

The integration with the ktask code should be pretty straight forward.
If anything I think my code would probably make it easier since it gets
rid of the need to do all this in two passes. The only new limitation
it would add is that you would probably want to split up the work along
either max order or section aligned boundaries. What it would
essentially do is make things so that each of the ktask threads would
probably look more like deferred_grow_zone which after my patch set is
actually a fairly simple function.

Thanks.

- Alex





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator
  2018-11-05 21:19 ` [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator Alexander Duyck
@ 2018-11-09 23:26   ` Pavel Tatashin
  2018-11-09 23:58     ` Alexander Duyck
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-09 23:26 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

> +/**
> + * for_each_free_mem_range_in_zone - iterate through zone specific free
> + * memblock areas
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone. Available as soon as memblock is initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)	\
> +	for (i = 0,							\
> +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
> +	     i != (u64)ULLONG_MAX;					\
> +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

Use U64_MAX instead of ULLONG_MAX, and avoid u64 cast. I know other
places in this file use UULONG_MAX with cast, but I think U64_MAX is
better.

> +
>  /**
>   * for_each_free_mem_range - iterate through free memblock areas
>   * @i: u64 used as loop variable
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 7df468c8ebc8..f1d1fbfd1ae7 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
>  	return 0;
>  }
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +/**
> + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> + *
> + * @idx: pointer to u64 loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @out_start: ptr to ulong for start pfn of the range, can be %NULL
> + * @out_end: ptr to ulong for end pfn of the range, can be %NULL
> + *
> + * This function is meant to be a zone/pfn specific wrapper for the
> + * for_each_mem_range type iterators. Specifically they are used in the
> + * deferred memory init routines and as such we were duplicating much of
> + * this logic throughout the code. So instead of having it in multiple
> + * locations it seemed like it would make more sense to centralize this to
> + * one new iterator that does everything they need.
> + */
> +void __init_memblock
> +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> +			     unsigned long *out_spfn, unsigned long *out_epfn)
> +{
> +	int zone_nid = zone_to_nid(zone);
> +	phys_addr_t spa, epa;
> +	int nid;
> +
> +	__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> +			 &memblock.memory, &memblock.reserved,
> +			 &spa, &epa, &nid);
> +
> +	while (*idx != ULLONG_MAX) {

Ditto, use U64_MAX

> +		unsigned long epfn = PFN_DOWN(epa);
> +		unsigned long spfn = PFN_UP(spa);
> +
> +		/*
> +		 * Verify the end is at least past the start of the zone and
> +		 * that we have at least one PFN to initialize.
> +		 */
> +		if (zone->zone_start_pfn < epfn && spfn < epfn) {
> +			/* if we went too far just stop searching */
> +			if (zone_end_pfn(zone) <= spfn)
> +				break;

Set *idx = U64_MAX here, then break. This way after we are outside this
while loop idx is always equals to U64_MAX.

> +
> +			if (out_spfn)
> +				*out_spfn = max(zone->zone_start_pfn, spfn);
> +			if (out_epfn)
> +				*out_epfn = min(zone_end_pfn(zone), epfn);

Don't we need to verify after adjustment that out_spfn != out_epfn, so
there is at least one PFN to initialize?

The rest looks good. Once the above is fixed:

Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator
  2018-11-09 23:26   ` Pavel Tatashin
@ 2018-11-09 23:58     ` Alexander Duyck
  2018-11-10  0:11       ` Pavel Tatashin
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-09 23:58 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On Fri, 2018-11-09 at 18:26 -0500, Pavel Tatashin wrote:
> > +/**
> > + * for_each_free_mem_range_in_zone - iterate through zone specific free
> > + * memblock areas
> > + * @i: u64 used as loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over free (memory && !reserved) areas of memblock in a specific
> > + * zone. Available as soon as memblock is initialized.
> > + */
> > +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)	\
> > +	for (i = 0,							\
> > +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
> > +	     i != (u64)ULLONG_MAX;					\
> > +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> 
> Use U64_MAX instead of ULLONG_MAX, and avoid u64 cast. I know other
> places in this file use UULONG_MAX with cast, but I think U64_MAX is
> better.

Okay, maybe I will submit a follow up that just cleans all of these up.

> > +
> >  /**
> >   * for_each_free_mem_range - iterate through free memblock areas
> >   * @i: u64 used as loop variable
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 7df468c8ebc8..f1d1fbfd1ae7 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
> >  	return 0;
> >  }
> >  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > +/**
> > + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> > + *
> > + * @idx: pointer to u64 loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @out_start: ptr to ulong for start pfn of the range, can be %NULL
> > + * @out_end: ptr to ulong for end pfn of the range, can be %NULL
> > + *
> > + * This function is meant to be a zone/pfn specific wrapper for the
> > + * for_each_mem_range type iterators. Specifically they are used in the
> > + * deferred memory init routines and as such we were duplicating much of
> > + * this logic throughout the code. So instead of having it in multiple
> > + * locations it seemed like it would make more sense to centralize this to
> > + * one new iterator that does everything they need.
> > + */
> > +void __init_memblock
> > +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> > +			     unsigned long *out_spfn, unsigned long *out_epfn)
> > +{
> > +	int zone_nid = zone_to_nid(zone);
> > +	phys_addr_t spa, epa;
> > +	int nid;
> > +
> > +	__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> > +			 &memblock.memory, &memblock.reserved,
> > +			 &spa, &epa, &nid);
> > +
> > +	while (*idx != ULLONG_MAX) {
> 
> Ditto, use U64_MAX

Same here.

> > +		unsigned long epfn = PFN_DOWN(epa);
> > +		unsigned long spfn = PFN_UP(spa);
> > +
> > +		/*
> > +		 * Verify the end is at least past the start of the zone and
> > +		 * that we have at least one PFN to initialize.
> > +		 */
> > +		if (zone->zone_start_pfn < epfn && spfn < epfn) {
> > +			/* if we went too far just stop searching */
> > +			if (zone_end_pfn(zone) <= spfn)
> > +				break;
> 
> Set *idx = U64_MAX here, then break. This way after we are outside this
> while loop idx is always equals to U64_MAX.

Actually I think what you are asking for is the logic that is outside
of the while loop we are breaking out of. So if you check at the end of
the function there is the bit of code with the comment "signal end of
iteration" where I end up setting *idx to ULLONG_MAX, *out_spfn to
ULONG_MAX, and *out_epfn to 0.

The general idea I had with the function is that you could use either
the index or spfn < epfn checks to determine if you keep going or not.

> 
> > +
> > +			if (out_spfn)
> > +				*out_spfn = max(zone->zone_start_pfn, spfn);
> > +			if (out_epfn)
> > +				*out_epfn = min(zone_end_pfn(zone), epfn);
> 
> Don't we need to verify after adjustment that out_spfn != out_epfn, so
> there is at least one PFN to initialize?

We have a few checks that I believe prevent that. Before we get to this
point we have verified the following:
	zone->zone_start < epfn
	spfn < epfn

The other check that should be helping to prevent that is the break
statement above that is forcing us to exit if spfn is somehow already
past the end of the zone, that essentially maps out:
	spfn < zone_end_pfn(zone)

So the only check we don't have is:
	zone->zone_start < zone_end_pfn(zone)

If I am not mistaken that is supposed to be a given is it not? I would
assume we don't have any zones that are completely empty or inverted
that would be called here do we?

> The rest looks good. Once the above is fixed:
> 
> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> 
> Thank you,
> Pasha

Thanks for the feedback.

- Alex


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-09 23:14   ` Alexander Duyck
@ 2018-11-10  0:00     ` Pavel Tatashin
  2018-11-10  0:46       ` Alexander Duyck
  2018-11-12 16:25       ` Daniel Jordan
  0 siblings, 2 replies; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-10  0:00 UTC (permalink / raw)
  To: Alexander Duyck, daniel.m.jordan
  Cc: daniel.m.jordan, akpm, linux-mm, sparclinux, linux-kernel,
	linux-nvdimm, davem, pavel.tatashin, mhocko, mingo,
	kirill.shutemov, dan.j.williams, dave.jiang, rppt, willy, vbabka,
	khalid.aziz, ldufour, mgorman, yi.z.zhang

On 18-11-09 15:14:35, Alexander Duyck wrote:
> On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > This patchset is essentially a refactor of the page initialization logic
> > > that is meant to provide for better code reuse while providing a
> > > significant improvement in deferred page initialization performance.
> > > 
> > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > memory per node I have seen the following. In the case of regular memory
> > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > average. For the persistent memory the initialization time dropped from
> > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > deferred memory initialization performance, and a 26% improvement in the
> > > persistent memory initialization performance.
> > 
> > Hi Alex,
> > 
> > Please try to run your persistent memory init experiment with Daniel's
> > patches:
> > 
> > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> 
> I've taken a quick look at it. It seems like a bit of a brute force way
> to try and speed things up. I would be worried about it potentially

There is a limit to max number of threads that ktasks start. The memory
throughput is *much* higher than what one CPU can maxout in a node, so
there is no reason to leave the other CPUs sit idle during boot when
they can help to initialize.

> introducing performance issues if the number of CPUs thrown at it end
> up exceeding the maximum throughput of the memory.
> 
> The data provided with patch 11 seems to point to issues such as that.
> In the case of the E7-8895 example cited it is increasing the numbers
> of CPUs used from memory initialization from 8 to 72, a 9x increase in
> the number of CPUs but it is yeilding only a 3.88x speedup.

Yes, but in both cases we are far from maxing out the memory throughput.
The 3.88x is indeed low, and I do not know what slows it down.

Daniel,

Could you please check why multi-threading efficiency is so low here?

I bet, there is some atomic operation introduces a contention within a
node. It should be possible to resolve.

> 
> > The performance should improve by much more than 26%.
> 
> The 26% improvement, or speedup of 1.26x using the ktask approach, was
> for persistent memory, not deferred memory init. The ktask patch
> doesn't do anything for persistent memory since it is takes the hot-
> plug path and isn't handled via the deferred memory init.

Ah, I thought in your experiment persistent memory takes deferred init
path. So, what exactly in your patches make this 1.26x speedup?

> 
> I had increased deferred memory init to about 3.53x the original speed
> (3.75s to 1.06s) on the system which I was testing. I do agree the two
> patches should be able to synergistically boost each other though as
> this patch set was meant to make the init much more cache friendly so
> as a result it should scale better as you add additional cores. I know
> I had done some playing around with fake numa to split up a single node
> into 8 logical nodes and I had seen a similar speedup of about 3.85x
> with my test memory initializing in about 275ms.
> 
> > Overall, your works looks good, but it needs to be considered how easy it will be
> > to merge with ktask. I will try to complete the review today.
> > 
> > Thank you,
> > Pasha
> 
> Looking over the patches they are still in the RFC stage and the data
> is in need of updates since it is referencing 4.15-rc kernels as its
> baseline. If anything I really think the ktask patch 11 would be easier
> to rebase around my patch set then the other way around. Also, this
> series is in Andrew's mmots as of a few days ago, so I think it will be
> in the next mmotm that comes out.

I do not disagree, I think these two patch series should complement each
other. But, if your changes make it impossible for ktask, I would strongly argue
against it, as the potential improvements with ktasks are much higher.
But, so far I do not see anything, so I think they can work together. I
am still reviewing your work.

> 
> The integration with the ktask code should be pretty straight forward.
> If anything I think my code would probably make it easier since it gets
> rid of the need to do all this in two passes. The only new limitation
> it would add is that you would probably want to split up the work along
> either max order or section aligned boundaries. What it would

Which is totally OK, it should make ktasks scale even better.

> essentially do is make things so that each of the ktask threads would
> probably look more like deferred_grow_zone which after my patch set is
> actually a fairly simple function.
> 
> Thanks.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator
  2018-11-09 23:58     ` Alexander Duyck
@ 2018-11-10  0:11       ` Pavel Tatashin
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-10  0:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

> > > +		unsigned long epfn = PFN_DOWN(epa);
> > > +		unsigned long spfn = PFN_UP(spa);
> > > +
> > > +		/*
> > > +		 * Verify the end is at least past the start of the zone and
> > > +		 * that we have at least one PFN to initialize.
> > > +		 */
> > > +		if (zone->zone_start_pfn < epfn && spfn < epfn) {
> > > +			/* if we went too far just stop searching */
> > > +			if (zone_end_pfn(zone) <= spfn)
> > > +				break;
> > 
> > Set *idx = U64_MAX here, then break. This way after we are outside this
> > while loop idx is always equals to U64_MAX.
> 
> Actually I think what you are asking for is the logic that is outside
> of the while loop we are breaking out of. So if you check at the end of
> the function there is the bit of code with the comment "signal end of
> iteration" where I end up setting *idx to ULLONG_MAX, *out_spfn to
> ULONG_MAX, and *out_epfn to 0.
> 
> The general idea I had with the function is that you could use either
> the index or spfn < epfn checks to determine if you keep going or not.

Yes, I meant to remove that *idx = U64_MAX after the loop, it is
confusing to have a loop:

while (*idx != U64_MAX) {
	...
}

*idx = U64_MAX;


So, it is better to set idx to U643_MAX inside the loop before the
break.

> 
> > 
> > > +
> > > +			if (out_spfn)
> > > +				*out_spfn = max(zone->zone_start_pfn, spfn);
> > > +			if (out_epfn)
> > > +				*out_epfn = min(zone_end_pfn(zone), epfn);
> > 
> > Don't we need to verify after adjustment that out_spfn != out_epfn, so
> > there is at least one PFN to initialize?
> 
> We have a few checks that I believe prevent that. Before we get to this
> point we have verified the following:
> 	zone->zone_start < epfn
> 	spfn < epfn
> 
> The other check that should be helping to prevent that is the break
> statement above that is forcing us to exit if spfn is somehow already
> past the end of the zone, that essentially maps out:
> 	spfn < zone_end_pfn(zone)
> 
> So the only check we don't have is:
> 	zone->zone_start < zone_end_pfn(zone)
> 
> If I am not mistaken that is supposed to be a given is it not? I would
> assume we don't have any zones that are completely empty or inverted
> that would be called here do we?


if (zone_end_pfn(zone) <= spfn) won't break

Equal sign in <= here takes care of the case I was thinking. Yes, logic looks good.

Thank you
Pasha

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-10  0:00     ` Pavel Tatashin
@ 2018-11-10  0:46       ` Alexander Duyck
  2018-11-10  1:16         ` Pavel Tatashin
  2018-11-12 16:25       ` Daniel Jordan
  1 sibling, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-10  0:46 UTC (permalink / raw)
  To: Pavel Tatashin, daniel.m.jordan
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> On 18-11-09 15:14:35, Alexander Duyck wrote:
> > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > This patchset is essentially a refactor of the page initialization logic
> > > > that is meant to provide for better code reuse while providing a
> > > > significant improvement in deferred page initialization performance.
> > > > 
> > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > > memory per node I have seen the following. In the case of regular memory
> > > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > > average. For the persistent memory the initialization time dropped from
> > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > deferred memory initialization performance, and a 26% improvement in the
> > > > persistent memory initialization performance.
> > > 
> > > Hi Alex,
> > > 
> > > Please try to run your persistent memory init experiment with Daniel's
> > > patches:
> > > 
> > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> > 
> > I've taken a quick look at it. It seems like a bit of a brute force way
> > to try and speed things up. I would be worried about it potentially
> 
> There is a limit to max number of threads that ktasks start. The memory
> throughput is *much* higher than what one CPU can maxout in a node, so
> there is no reason to leave the other CPUs sit idle during boot when
> they can help to initialize.

Right, but right now that limit can still be pretty big when it is
something like 25% of all the CPUs on a 288 CPU system.

One issue is the way the code was ends up essentially blowing out the
cache over and over again. Doing things in two passes made it really
expensive as you took one cache miss to initialize it, and another to
free it. I think getting rid of that is one of the biggest gains with
my patch set.

> > introducing performance issues if the number of CPUs thrown at it end
> > up exceeding the maximum throughput of the memory.
> > 
> > The data provided with patch 11 seems to point to issues such as that.
> > In the case of the E7-8895 example cited it is increasing the numbers
> > of CPUs used from memory initialization from 8 to 72, a 9x increase in
> > the number of CPUs but it is yeilding only a 3.88x speedup.
> 
> Yes, but in both cases we are far from maxing out the memory throughput.
> The 3.88x is indeed low, and I do not know what slows it down.
> 
> Daniel,
> 
> Could you please check why multi-threading efficiency is so low here?
> 
> I bet, there is some atomic operation introduces a contention within a
> node. It should be possible to resolve.

Depending on what kernel this was based on it probably was something
like SetPageReserved triggering pipeline stalls, or maybe it is freeing
the pages themselves since I would imagine that has a pretty big
overhead an may end up serializing some things.

> > 
> > > The performance should improve by much more than 26%.
> > 
> > The 26% improvement, or speedup of 1.26x using the ktask approach, was
> > for persistent memory, not deferred memory init. The ktask patch
> > doesn't do anything for persistent memory since it is takes the hot-
> > plug path and isn't handled via the deferred memory init.
> 
> Ah, I thought in your experiment persistent memory takes deferred init
> path. So, what exactly in your patches make this 1.26x speedup?

Basically it is the folding of the reserved bit into the setting of the
rest of the values written into the page flags. With that change we are
able to tighten the memory initialization loop to the point where it
almost looks like a memset case as it will put together all of the
memory values outside of the loop and then just loop through assigning
the values for each page.

> > 
> > I had increased deferred memory init to about 3.53x the original speed
> > (3.75s to 1.06s) on the system which I was testing. I do agree the two
> > patches should be able to synergistically boost each other though as
> > this patch set was meant to make the init much more cache friendly so
> > as a result it should scale better as you add additional cores. I know
> > I had done some playing around with fake numa to split up a single node
> > into 8 logical nodes and I had seen a similar speedup of about 3.85x
> > with my test memory initializing in about 275ms.

It is also possible that the the 3.8x is an upper limit for something
on the system in terms of scaling. I just realized that the 3.85x I saw
with an 8 way split over a 4 socket system isn't too different from the
3.88x that was recorded for a 9 way split over an 8 socket system.

> > > Overall, your works looks good, but it needs to be considered how easy it will be
> > > to merge with ktask. I will try to complete the review today.
> > > 
> > > Thank you,
> > > Pasha
> > 
> > Looking over the patches they are still in the RFC stage and the data
> > is in need of updates since it is referencing 4.15-rc kernels as its
> > baseline. If anything I really think the ktask patch 11 would be easier
> > to rebase around my patch set then the other way around. Also, this
> > series is in Andrew's mmots as of a few days ago, so I think it will be
> > in the next mmotm that comes out.
> 
> I do not disagree, I think these two patch series should complement each
> other. But, if your changes make it impossible for ktask, I would strongly argue
> against it, as the potential improvements with ktasks are much higher.
> But, so far I do not see anything, so I think they can work together. I
> am still reviewing your work.

I am assuming patch 11 from the ktask set is the only one that really
touches the deferred memory init. I'm thinking that the actual patch
should be smaller if it is rebased off of this patch set. I didn't see
anything that should conflict with my patch set, and as I said the only
concern would be making certain to split the zone correctly to prevent
the free thread from hopping across the MAX_ORDER boundaries.

> > 
> > The integration with the ktask code should be pretty straight forward.
> > If anything I think my code would probably make it easier since it gets
> > rid of the need to do all this in two passes. The only new limitation
> > it would add is that you would probably want to split up the work along
> > either max order or section aligned boundaries. What it would
> 
> Which is totally OK, it should make ktasks scale even better.

Right.

> > essentially do is make things so that each of the ktask threads would
> > probably look more like deferred_grow_zone which after my patch set is
> > actually a fairly simple function.
> > 
> > Thanks.
> 
> Thank you,
> Pasha




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections
  2018-11-05 21:19 ` [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections Alexander Duyck
@ 2018-11-10  1:02   ` Pavel Tatashin
  2018-11-19 18:53     ` Alexander Duyck
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-10  1:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On 18-11-05 13:19:45, Alexander Duyck wrote:
>  	}
> -	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
> +
> +	/* If the zone is empty somebody else may have cleared out the zone */
> +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> +						 first_init_pfn)) {
> +		pgdat_resize_unlock(pgdat, &flags);
> +		pgdat_init_report_one_done();
> +		return 0;

It would make more sense to add goto to the end of this function and report
that in Node X had 0 pages initialized. Otherwise, it will be confusing
why some nodes are not initialized during boot. It is simply because
they do not have deferred pages left to be initialized.


> +	}
>  
>  	/*
> -	 * Initialize and free pages. We do it in two loops: first we initialize
> -	 * struct page, than free to buddy allocator, because while we are
> -	 * freeing pages we can access pages that are ahead (computing buddy
> -	 * page in __free_one_page()).
> +	 * Initialize and free pages in MAX_ORDER sized increments so
> +	 * that we can avoid introducing any issues with the buddy
> +	 * allocator.
>  	 */
> -	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
> -		spfn = max_t(unsigned long, first_init_pfn, spfn);
> -		nr_pages += deferred_init_pages(zone, spfn, epfn);
> -	}
> -	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
> -		spfn = max_t(unsigned long, first_init_pfn, spfn);
> -		deferred_free_pages(spfn, epfn);
> -	}
> +	while (spfn < epfn)
> +		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +
>  	pgdat_resize_unlock(pgdat, &flags);
>  
>  	/* Sanity check that the next zone really is unpopulated */
> @@ -1602,9 +1689,9 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
>  {

How about order = max(order, max_order)?

>  	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);


> -	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
> -
> -	if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
> +	/* If the zone is empty somebody else may have cleared out the zone */
> +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> +						 first_deferred_pfn)) {
>  		pgdat_resize_unlock(pgdat, &flags);
> -		return false;
> +		return true;

I am OK to change to true here, please also set
pgdat->first_deferred_pfn to ULONG_MAX to indicate that there are no
more deferred pages in this node.


Overall, I like this patch, makes things a lot easier, assuming the
above is addressed:

Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-10  0:46       ` Alexander Duyck
@ 2018-11-10  1:16         ` Pavel Tatashin
  2018-11-12 19:10           ` Alexander Duyck
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-10  1:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: daniel.m.jordan, akpm, linux-mm, sparclinux, linux-kernel,
	linux-nvdimm, davem, pavel.tatashin, mhocko, mingo,
	kirill.shutemov, dan.j.williams, dave.jiang, rppt, willy, vbabka,
	khalid.aziz, ldufour, mgorman, yi.z.zhang

On 18-11-09 16:46:02, Alexander Duyck wrote:
> On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> > On 18-11-09 15:14:35, Alexander Duyck wrote:
> > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > > This patchset is essentially a refactor of the page initialization logic
> > > > > that is meant to provide for better code reuse while providing a
> > > > > significant improvement in deferred page initialization performance.
> > > > > 
> > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > > > memory per node I have seen the following. In the case of regular memory
> > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > > > average. For the persistent memory the initialization time dropped from
> > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > > deferred memory initialization performance, and a 26% improvement in the
> > > > > persistent memory initialization performance.
> > > > 
> > > > Hi Alex,
> > > > 
> > > > Please try to run your persistent memory init experiment with Daniel's
> > > > patches:
> > > > 
> > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> > > 
> > > I've taken a quick look at it. It seems like a bit of a brute force way
> > > to try and speed things up. I would be worried about it potentially
> > 
> > There is a limit to max number of threads that ktasks start. The memory
> > throughput is *much* higher than what one CPU can maxout in a node, so
> > there is no reason to leave the other CPUs sit idle during boot when
> > they can help to initialize.
> 
> Right, but right now that limit can still be pretty big when it is
> something like 25% of all the CPUs on a 288 CPU system.

It is still OK. About 9 threads per node.

That machine has 1T of memory, which means 8 nodes need to initialize 2G
of memory each. With 46G/s throughout it should take 0.043s Which is 10
times higher than what Daniel sees with 0.325s, so there is still room
to saturate the memory throughput.

Now, if the multi-threadding efficiency is good, it should take
1.261s / 9 threads =  0.14s

> 
> One issue is the way the code was ends up essentially blowing out the
> cache over and over again. Doing things in two passes made it really
> expensive as you took one cache miss to initialize it, and another to
> free it. I think getting rid of that is one of the biggest gains with
> my patch set.

I am not arguing that your patches make sense, all I am saying that
ktasks improve time order of magnitude better on machines with large
amount of memory.

Pasha

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize
  2018-11-05 21:19 ` [mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize Alexander Duyck
@ 2018-11-10  2:07   ` Pavel Tatashin
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-10  2:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On 18-11-05 13:19:50, Alexander Duyck wrote:
> This patch is going through and combining the bits in memmap_init_zone and
> memmap_init_zone_device that are related to hotplug into a single function
> called __memmap_init_hotplug.
> 
> I also took the opportunity to integrate __init_single_page's functionality
> into this function. In doing so I can get rid of some of the redundancy
> such as the LRU pointers versus the pgmap.

Please don't do it, __init_single_page() is hard function to optimize,
do not copy its code. Instead could you you split __init_single_page()
in two parts, something like this:

static inline init_single_page_nolru(struct page *page, unsigned long pfn,
                                       unsigned long zone, int nid) {
        mm_zero_struct_page(page);
        set_page_links(page, zone, nid, pfn);
        init_page_count(page);
        page_mapcount_reset(page);
        page_cpupid_reset_last(page);
#ifdef WANT_PAGE_VIRTUAL
        /* The shift won't overflow because ZONE_NORMAL is below 4G. */
        if (!is_highmem_idx(zone))
                set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
}


static void __meminit init_single_page(struct page *page, unsigned long pfn, 
                                unsigned long zone, int nid) 
{
        init_single_page_nolru(page, pfn, zone, nid);
        INIT_LIST_HEAD(&page->lru);
}

And call init_single_page_nolru() from __init_pageblock() ? Also, remove
WANT_PAGE_VIRTUAL optimization, I do not think it worse it.

The rest looks very good, please do the above change.

Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>

> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links
  2018-11-05 21:19 ` [mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links Alexander Duyck
@ 2018-11-10  2:11   ` Pavel Tatashin
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-10  2:11 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On 18-11-05 13:19:56, Alexander Duyck wrote:
> This patch modifies the set_page_links function to include the setting of
> the reserved flag via a simple AND and OR operation. The motivation for
> this is the fact that the existing __set_bit call still seems to have
> effects on performance as replacing the call with the AND and OR can reduce
> initialization time.
> 
> Looking over the assembly code before and after the change the main
> difference between the two is that the reserved bit is stored in a value
> that is generated outside of the main initialization loop and is then
> written with the other flags field values in one write to the page->flags
> value. Previously the generated value was written and then then a btsq
> instruction was issued.
> 
> On my x86_64 test system with 3TB of persistent memory per node I saw the
> persistent memory initialization time on average drop from 23.49s to
> 19.12s per node.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages
  2018-11-05 21:20 ` [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages Alexander Duyck
@ 2018-11-10  4:13   ` Pavel Tatashin
  2018-11-12 15:12     ` Alexander Duyck
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-10  4:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On 18-11-05 13:20:01, Alexander Duyck wrote:
> +static unsigned long __next_pfn_valid_range(unsigned long *i,
> +					    unsigned long end_pfn)
>  {
> -	if (!pfn_valid_within(pfn))
> -		return false;
> -	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
> -		return false;
> -	return true;
> +	unsigned long pfn = *i;
> +	unsigned long count;
> +
> +	while (pfn < end_pfn) {
> +		unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
> +		unsigned long pageblock_pfn = min(t, end_pfn);
> +
> +#ifndef CONFIG_HOLES_IN_ZONE
> +		count = pageblock_pfn - pfn;
> +		pfn = pageblock_pfn;
> +		if (!pfn_valid(pfn))
> +			continue;
> +#else
> +		for (count = 0; pfn < pageblock_pfn; pfn++) {
> +			if (pfn_valid_within(pfn)) {
> +				count++;
> +				continue;
> +			}
> +
> +			if (count)
> +				break;
> +		}
> +
> +		if (!count)
> +			continue;
> +#endif
> +		*i = pfn;
> +		return count;
> +	}
> +
> +	return 0;
>  }
>  
> +#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) \
> +	for (i = (start_pfn),						     \
> +	     count = __next_pfn_valid_range(&i, (end_pfn));		     \
> +	     count && ({ pfn = i - count; 1; });			     \
> +	     count = __next_pfn_valid_range(&i, (end_pfn)))

Can this be improved somehow? It took me a while to understand this
piece of code. i is actually end of block, and not an index by PFN, ({pfn = i - count; 1;}) is
simply hard to parse. Why can't we make __next_pfn_valid_range() to
return both end and a start of a block?

The rest is good:

Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages
  2018-11-10  4:13   ` Pavel Tatashin
@ 2018-11-12 15:12     ` Alexander Duyck
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2018-11-12 15:12 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On 11/9/2018 8:13 PM, Pavel Tatashin wrote:
> On 18-11-05 13:20:01, Alexander Duyck wrote:
>> +static unsigned long __next_pfn_valid_range(unsigned long *i,
>> +					    unsigned long end_pfn)
>>   {
>> -	if (!pfn_valid_within(pfn))
>> -		return false;
>> -	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
>> -		return false;
>> -	return true;
>> +	unsigned long pfn = *i;
>> +	unsigned long count;
>> +
>> +	while (pfn < end_pfn) {
>> +		unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
>> +		unsigned long pageblock_pfn = min(t, end_pfn);
>> +
>> +#ifndef CONFIG_HOLES_IN_ZONE
>> +		count = pageblock_pfn - pfn;
>> +		pfn = pageblock_pfn;
>> +		if (!pfn_valid(pfn))
>> +			continue;
>> +#else
>> +		for (count = 0; pfn < pageblock_pfn; pfn++) {
>> +			if (pfn_valid_within(pfn)) {
>> +				count++;
>> +				continue;
>> +			}
>> +
>> +			if (count)
>> +				break;
>> +		}
>> +
>> +		if (!count)
>> +			continue;
>> +#endif
>> +		*i = pfn;
>> +		return count;
>> +	}
>> +
>> +	return 0;
>>   }
>>   
>> +#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) \
>> +	for (i = (start_pfn),						     \
>> +	     count = __next_pfn_valid_range(&i, (end_pfn));		     \
>> +	     count && ({ pfn = i - count; 1; });			     \
>> +	     count = __next_pfn_valid_range(&i, (end_pfn)))
> 
> Can this be improved somehow? It took me a while to understand this
> piece of code. i is actually end of block, and not an index by PFN, ({pfn = i - count; 1;}) is
> simply hard to parse. Why can't we make __next_pfn_valid_range() to
> return both end and a start of a block?

One thing I could do is flip the direction and work from the end to the 
start. If I did that then 'i' and 'pfn' would be the same value and I 
wouldn't have to do the subtraction. If that works for you I could 
probably do that and it may actually be more efficient.

Otherwise I could probably pass pfn as a reference, and compute it in 
the case where count is non-zero.

> The rest is good:
> 
> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> 
> Thank you,
> Pasha
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-10  0:00     ` Pavel Tatashin
  2018-11-10  0:46       ` Alexander Duyck
@ 2018-11-12 16:25       ` Daniel Jordan
  1 sibling, 0 replies; 34+ messages in thread
From: Daniel Jordan @ 2018-11-12 16:25 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Alexander Duyck, daniel.m.jordan, akpm, linux-mm, sparclinux,
	linux-kernel, linux-nvdimm, davem, pavel.tatashin, mhocko, mingo,
	kirill.shutemov, dan.j.williams, dave.jiang, rppt, willy, vbabka,
	khalid.aziz, ldufour, mgorman, yi.z.zhang

On Fri, Nov 09, 2018 at 07:00:06PM -0500, Pavel Tatashin wrote:
> On 18-11-09 15:14:35, Alexander Duyck wrote:
> > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > This patchset is essentially a refactor of the page initialization logic
> > > > that is meant to provide for better code reuse while providing a
> > > > significant improvement in deferred page initialization performance.
> > > > 
> > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > > memory per node I have seen the following. In the case of regular memory
> > > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > > average. For the persistent memory the initialization time dropped from
> > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > deferred memory initialization performance, and a 26% improvement in the
> > > > persistent memory initialization performance.
> > > 
> > > Hi Alex,
> > > 
> > > Please try to run your persistent memory init experiment with Daniel's
> > > patches:
> > > 
> > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> > 
> > I've taken a quick look at it. It seems like a bit of a brute force way
> > to try and speed things up. I would be worried about it potentially
> 
> There is a limit to max number of threads that ktasks start. The memory
> throughput is *much* higher than what one CPU can maxout in a node, so
> there is no reason to leave the other CPUs sit idle during boot when
> they can help to initialize.
> 
> > introducing performance issues if the number of CPUs thrown at it end
> > up exceeding the maximum throughput of the memory.
> > 
> > The data provided with patch 11 seems to point to issues such as that.
> > In the case of the E7-8895 example cited it is increasing the numbers
> > of CPUs used from memory initialization from 8 to 72, a 9x increase in
> > the number of CPUs but it is yeilding only a 3.88x speedup.
> 
> Yes, but in both cases we are far from maxing out the memory throughput.
> The 3.88x is indeed low, and I do not know what slows it down.
> 
> Daniel,
> 
> Could you please check why multi-threading efficiency is so low here?

I'll hop on the machine after Plumbers.

> I bet, there is some atomic operation introduces a contention within a
> node. It should be possible to resolve.

We'll see, in any case I'm curious to see what the multithreading does with
Alex's patches, especially since we won't do two passes through the memory
anymore.

Not seeing anything in Alex's work right off that would preclude
multithreading.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-10  1:16         ` Pavel Tatashin
@ 2018-11-12 19:10           ` Alexander Duyck
  2018-11-12 20:37             ` Pavel Tatashin
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-12 19:10 UTC (permalink / raw)
  To: pasha.tatashin, Andrew Morton
  Cc: alexander.h.duyck, Daniel Jordan, linux-mm, sparclinux, LKML,
	linux-nvdimm, David Miller, pavel.tatashin, Michal Hocko,
	Ingo Molnar, Kirill A. Shutemov, dan.j.williams, dave.jiang,
	rppt, Matthew Wilcox, Vlastimil Babka, khalid.aziz, ldufour,
	Mel Gorman, yi.z.zhang

On Fri, Nov 9, 2018 at 5:17 PM Pavel Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On 18-11-09 16:46:02, Alexander Duyck wrote:
> > On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> > > On 18-11-09 15:14:35, Alexander Duyck wrote:
> > > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > > > This patchset is essentially a refactor of the page initialization logic
> > > > > > that is meant to provide for better code reuse while providing a
> > > > > > significant improvement in deferred page initialization performance.
> > > > > >
> > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > > > > memory per node I have seen the following. In the case of regular memory
> > > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > > > > average. For the persistent memory the initialization time dropped from
> > > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > > > deferred memory initialization performance, and a 26% improvement in the
> > > > > > persistent memory initialization performance.
> > > > >
> > > > > Hi Alex,
> > > > >
> > > > > Please try to run your persistent memory init experiment with Daniel's
> > > > > patches:
> > > > >
> > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> > > >
> > > > I've taken a quick look at it. It seems like a bit of a brute force way
> > > > to try and speed things up. I would be worried about it potentially
> > >
> > > There is a limit to max number of threads that ktasks start. The memory
> > > throughput is *much* higher than what one CPU can maxout in a node, so
> > > there is no reason to leave the other CPUs sit idle during boot when
> > > they can help to initialize.
> >
> > Right, but right now that limit can still be pretty big when it is
> > something like 25% of all the CPUs on a 288 CPU system.
>
> It is still OK. About 9 threads per node.
>
> That machine has 1T of memory, which means 8 nodes need to initialize 2G
> of memory each. With 46G/s throughout it should take 0.043s Which is 10
> times higher than what Daniel sees with 0.325s, so there is still room
> to saturate the memory throughput.
>
> Now, if the multi-threadding efficiency is good, it should take
> 1.261s / 9 threads =  0.14s

The two patch sets combined would hopefully do better then that,
although I don't know what the clock speed is on the CPUs in use.

The system I have been testing with has 1.5TB spread over 4 nodes. So
we have effectively 3 times as much to initialize and with the
"numa=fake=8U" I was seeing it only take .275s to initialize the
nodes.

> >
> > One issue is the way the code was ends up essentially blowing out the
> > cache over and over again. Doing things in two passes made it really
> > expensive as you took one cache miss to initialize it, and another to
> > free it. I think getting rid of that is one of the biggest gains with
> > my patch set.
>
> I am not arguing that your patches make sense, all I am saying that
> ktasks improve time order of magnitude better on machines with large
> amount of memory.

The point I was trying to make is that it doesn't. You say it is an
order of magnitude better but it is essentially 3.5x vs 3.8x and to
achieve the 3.8x you are using a ton of system resources. My approach
is meant to do more with less, while this approach will throw a
quarter of the system at  page initialization.

An added advantage to my approach is that it speeds up things
regardless of the number of cores used, whereas the scaling approach
requires that there be more cores available to use. So for example on
some of the new AMD Zen stuff I am not sure the benefit would be all
that great since if I am not mistaken each tile is only 8 processors
so at most you are only doubling the processing power applied to the
initialization. In such a case it is likely that my approach would
fare much better then this approach since I don't require additional
cores to achieve the same results.

Anyway there are tradeoffs we have to take into account.

I will go over the changes you suggested after Plumbers. I just need
to figure out if I am doing incremental changes, or if Andrew wants me
to just resubmit the whole set. I can probably deal with these changes
either way since most of them are pretty small.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-12 19:10           ` Alexander Duyck
@ 2018-11-12 20:37             ` Pavel Tatashin
  0 siblings, 0 replies; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-12 20:37 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, alexander.h.duyck, Daniel Jordan, linux-mm,
	sparclinux, LKML, linux-nvdimm, David Miller, pavel.tatashin,
	Michal Hocko, Ingo Molnar, Kirill A. Shutemov, dan.j.williams,
	dave.jiang, rppt, Matthew Wilcox, Vlastimil Babka, khalid.aziz,
	ldufour, Mel Gorman, yi.z.zhang

On 18-11-12 11:10:35, Alexander Duyck wrote:
> 
> The point I was trying to make is that it doesn't. You say it is an
> order of magnitude better but it is essentially 3.5x vs 3.8x and to
> achieve the 3.8x you are using a ton of system resources. My approach
> is meant to do more with less, while this approach will throw a
> quarter of the system at  page initialization.

3.8x is a bug, that is going to be fixed before ktasks are accepted. The
final results will be close to time/nthreads.
Using more resources to initialize pages is fine, because other CPUs are
idling during this time in boot.

Lets wait for what Daniel finds out after Linux Plumber. And we can
continue this discussion in ktask thread.

> 
> An added advantage to my approach is that it speeds up things
> regardless of the number of cores used, whereas the scaling approach

Yes, I agree, I like your approach. It is clean, simplifies, and
improves the performance. I have tested it on both ARM and x86, and
verified the performance improvements. So:

Tested-by: Pavel Tatashin <pasha.tatashin@soleen.com>


> requires that there be more cores available to use. So for example on
> some of the new AMD Zen stuff I am not sure the benefit would be all
> that great since if I am not mistaken each tile is only 8 processors
> so at most you are only doubling the processing power applied to the
> initialization. In such a case it is likely that my approach would
> fare much better then this approach since I don't require additional
> cores to achieve the same results.
> 
> Anyway there are tradeoffs we have to take into account.
> 
> I will go over the changes you suggested after Plumbers. I just need
> to figure out if I am doing incremental changes, or if Andrew wants me
> to just resubmit the whole set. I can probably deal with these changes
> either way since most of them are pretty small.

Send the full series again, Andrew is very good at taking only
incremental  changes once a new version is posted of something
that is already in mm-tree.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
                   ` (7 preceding siblings ...)
  2018-11-09 21:15 ` [mm PATCH v5 0/7] Deferred page init improvements Pavel Tatashin
@ 2018-11-14 15:07 ` Michal Hocko
  2018-11-14 19:12   ` Pavel Tatashin
  2018-11-15  0:50   ` Alexander Duyck
  8 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2018-11-14 15:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> This patchset is essentially a refactor of the page initialization logic
> that is meant to provide for better code reuse while providing a
> significant improvement in deferred page initialization performance.
> 
> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> memory per node I have seen the following. In the case of regular memory
> initialization the deferred init time was decreased from 3.75s to 1.06s on
> average. For the persistent memory the initialization time dropped from
> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> deferred memory initialization performance, and a 26% improvement in the
> persistent memory initialization performance.
> 
> I have called out the improvement observed with each patch.

I have only glanced through the code (there is a lot of the code to look
at here). And I do not like the code duplication and the way how you
make the hotplug special. There shouldn't be any real reason for that
IMHO (e.g. why do we init pfn-at-a-time in early init while we do
pageblock-at-a-time for hotplug). I might be wrong here and the code
reuse might be really hard to achieve though.

I am also not impressed by new iterators because this api is quite
complex already. But this is mostly a detail.

Thing I do not like is that you keep microptimizing PageReserved part
while there shouldn't be anything fundamental about it. We should just
remove it rather than make the code more complex. I fell more and more
guilty to add there actually.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-14 15:07 ` Michal Hocko
@ 2018-11-14 19:12   ` Pavel Tatashin
  2018-11-14 21:35     ` Michal Hocko
  2018-11-15  0:50   ` Alexander Duyck
  1 sibling, 1 reply; 34+ messages in thread
From: Pavel Tatashin @ 2018-11-14 19:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexander Duyck, akpm, linux-mm, sparclinux, linux-kernel,
	linux-nvdimm, davem, pavel.tatashin, mingo, kirill.shutemov,
	dan.j.williams, dave.jiang, rppt, willy, vbabka, khalid.aziz,
	ldufour, mgorman, yi.z.zhang

On 18-11-14 16:07:42, Michal Hocko wrote:
> On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> > This patchset is essentially a refactor of the page initialization logic
> > that is meant to provide for better code reuse while providing a
> > significant improvement in deferred page initialization performance.
> > 
> > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > memory per node I have seen the following. In the case of regular memory
> > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > average. For the persistent memory the initialization time dropped from
> > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > deferred memory initialization performance, and a 26% improvement in the
> > persistent memory initialization performance.
> > 
> > I have called out the improvement observed with each patch.
> 
> I have only glanced through the code (there is a lot of the code to look
> at here). And I do not like the code duplication and the way how you
> make the hotplug special. There shouldn't be any real reason for that
> IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> pageblock-at-a-time for hotplug). I might be wrong here and the code
> reuse might be really hard to achieve though.

I do not like having __init_single_page() to be done differently for
hotplug. I think, if that is fixed, there is almost no more code
duplication, the rest looks alright to me.

> 
> I am also not impressed by new iterators because this api is quite
> complex already. But this is mostly a detail.

I have reviewed all the patches in this series, and at first was also
worried about the new iterators. But, after diving deeper, they actually
make sense, and new memblock iterators look alright to me. The only
iterator, that I would like to see improved is
for_each_deferred_pfn_valid_range(), because it is very hard to
understand it.

This series is an impressive performance improvement, and I have
confirmed it on both arm, and x86. It is also should be compatible with
ktasks.

> 
> Thing I do not like is that you keep microptimizing PageReserved part
> while there shouldn't be anything fundamental about it. We should just

Agree.

> remove it rather than make the code more complex. I fell more and more
> guilty to add there actually.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-14 19:12   ` Pavel Tatashin
@ 2018-11-14 21:35     ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2018-11-14 21:35 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Alexander Duyck, akpm, linux-mm, sparclinux, linux-kernel,
	linux-nvdimm, davem, pavel.tatashin, mingo, kirill.shutemov,
	dan.j.williams, dave.jiang, rppt, willy, vbabka, khalid.aziz,
	ldufour, mgorman, yi.z.zhang

On Wed 14-11-18 19:12:53, Pavel Tatashin wrote:
> On 18-11-14 16:07:42, Michal Hocko wrote:
> > On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> > > This patchset is essentially a refactor of the page initialization logic
> > > that is meant to provide for better code reuse while providing a
> > > significant improvement in deferred page initialization performance.
> > > 
> > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > memory per node I have seen the following. In the case of regular memory
> > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > average. For the persistent memory the initialization time dropped from
> > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > deferred memory initialization performance, and a 26% improvement in the
> > > persistent memory initialization performance.
> > > 
> > > I have called out the improvement observed with each patch.
> > 
> > I have only glanced through the code (there is a lot of the code to look
> > at here). And I do not like the code duplication and the way how you
> > make the hotplug special. There shouldn't be any real reason for that
> > IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> > pageblock-at-a-time for hotplug). I might be wrong here and the code
> > reuse might be really hard to achieve though.
> 
> I do not like having __init_single_page() to be done differently for
> hotplug. I think, if that is fixed, there is almost no more code
> duplication, the rest looks alright to me.

There is still that zone device special casing which I have brought up
previously but I reckon this is out of scope of this series.

> > I am also not impressed by new iterators because this api is quite
> > complex already. But this is mostly a detail.
> 
> I have reviewed all the patches in this series, and at first was also
> worried about the new iterators. But, after diving deeper, they actually
> make sense, and new memblock iterators look alright to me. The only
> iterator, that I would like to see improved is
> for_each_deferred_pfn_valid_range(), because it is very hard to
> understand it.
> 
> This series is an impressive performance improvement, and I have
> confirmed it on both arm, and x86. It is also should be compatible with
> ktasks.

I see the performance argument. I also do see the maintenance burden
argument. Recent bootmem removal has shown how big and hard to follow
the whole API is. And this should be considered. I am not saying the
current series is a nogo though. Maybe changelogs should be more clear
to spell out advantages to do a per-pageblock initialization that brings
a lot of new code in. As I've said I have basically glanced through
the comulative diff and changelogs to get an impression so it is not
entirely clear to me.  Especially when the per block init does per pfn
within the block init anyway.

That being said, I belive changelogs should be much more specific and
I hope to see this to be a much more modest in the added code. If that
is not possible, then fair enough, but I would like to see it tried.
And please let's not build on top of cargocult and rather get rid of
pointless stuff (e.g. PageReserved) rather than go around it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-14 15:07 ` Michal Hocko
  2018-11-14 19:12   ` Pavel Tatashin
@ 2018-11-15  0:50   ` Alexander Duyck
  2018-11-15  1:55     ` Mike Rapoport
  2018-11-15  8:10     ` Michal Hocko
  1 sibling, 2 replies; 34+ messages in thread
From: Alexander Duyck @ 2018-11-15  0:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang



On 11/14/2018 7:07 AM, Michal Hocko wrote:
> On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
>> This patchset is essentially a refactor of the page initialization logic
>> that is meant to provide for better code reuse while providing a
>> significant improvement in deferred page initialization performance.
>>
>> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
>> memory per node I have seen the following. In the case of regular memory
>> initialization the deferred init time was decreased from 3.75s to 1.06s on
>> average. For the persistent memory the initialization time dropped from
>> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
>> deferred memory initialization performance, and a 26% improvement in the
>> persistent memory initialization performance.
>>
>> I have called out the improvement observed with each patch.
> 
> I have only glanced through the code (there is a lot of the code to look
> at here). And I do not like the code duplication and the way how you
> make the hotplug special. There shouldn't be any real reason for that
> IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> pageblock-at-a-time for hotplug). I might be wrong here and the code
> reuse might be really hard to achieve though.

Actually it isn't so much that hotplug is special. The issue is more 
that the non-hotplug case is special in that you have to perform a 
number of extra checks for things that just aren't necessary for the 
hotplug case.

If anything I would probably need a new iterator that would be able to 
take into account all the checks for the non-hotplug case and then 
provide ranges of PFNs to initialize.

> I am also not impressed by new iterators because this api is quite
> complex already. But this is mostly a detail.

Yeah, the iterators were mostly an attempt at hiding some of the 
complexity. Being able to break a loop down to just an iterator provding 
the start of the range and the number of elements to initialize is 
pretty easy to visualize, or at least I thought so.

> Thing I do not like is that you keep microptimizing PageReserved part
> while there shouldn't be anything fundamental about it. We should just
> remove it rather than make the code more complex. I fell more and more
> guilty to add there actually.

I plan to remove it, but don't think I can get to it in this patch set.

I was planning to submit one more iteration of this patch set early next 
week, and then start focusing more on the removal of the PageReserved 
bit for hotplug. I figure it is probably going to be a full patch set 
onto itself and as you pointed out at the start of this email there is 
already enough code to review without adding that.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-15  0:50   ` Alexander Duyck
@ 2018-11-15  1:55     ` Mike Rapoport
  2018-11-15 19:09       ` Mike Rapoport
  2018-11-15  8:10     ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Rapoport @ 2018-11-15  1:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michal Hocko, akpm, linux-mm, sparclinux, linux-kernel,
	linux-nvdimm, davem, pavel.tatashin, mingo, kirill.shutemov,
	dan.j.williams, dave.jiang, rppt, willy, vbabka, khalid.aziz,
	ldufour, mgorman, yi.z.zhang

On Wed, Nov 14, 2018 at 04:50:23PM -0800, Alexander Duyck wrote:
> 
> 
> On 11/14/2018 7:07 AM, Michal Hocko wrote:
> >On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> >>This patchset is essentially a refactor of the page initialization logic
> >>that is meant to provide for better code reuse while providing a
> >>significant improvement in deferred page initialization performance.
> >>
> >>In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> >>memory per node I have seen the following. In the case of regular memory
> >>initialization the deferred init time was decreased from 3.75s to 1.06s on
> >>average. For the persistent memory the initialization time dropped from
> >>24.17s to 19.12s on average. This amounts to a 253% improvement for the
> >>deferred memory initialization performance, and a 26% improvement in the
> >>persistent memory initialization performance.
> >>
> >>I have called out the improvement observed with each patch.
> >
> >I have only glanced through the code (there is a lot of the code to look
> >at here). And I do not like the code duplication and the way how you
> >make the hotplug special. There shouldn't be any real reason for that
> >IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> >pageblock-at-a-time for hotplug). I might be wrong here and the code
> >reuse might be really hard to achieve though.
> 
> Actually it isn't so much that hotplug is special. The issue is more that
> the non-hotplug case is special in that you have to perform a number of
> extra checks for things that just aren't necessary for the hotplug case.
> 
> If anything I would probably need a new iterator that would be able to take
> into account all the checks for the non-hotplug case and then provide ranges
> of PFNs to initialize.
> 
> >I am also not impressed by new iterators because this api is quite
> >complex already. But this is mostly a detail.
> 
> Yeah, the iterators were mostly an attempt at hiding some of the complexity.
> Being able to break a loop down to just an iterator provding the start of
> the range and the number of elements to initialize is pretty easy to
> visualize, or at least I thought so.

Just recently we had a discussion about overlapping for_each_mem_range()
and for_each_mem_pfn_range(), but unfortunately it appears that no mailing
list was cc'ed by the original patch author :(
In short, there was a spelling fix in one of them and Michal pointed out
that their functionality overlaps.

I have no objection for for_each_free_mem_pfn_range_in_zone() and
__next_mem_pfn_range_in_zone(), but probably we should consider unifying
the older iterators before we introduce a new one? 
 
> >Thing I do not like is that you keep microptimizing PageReserved part
> >while there shouldn't be anything fundamental about it. We should just
> >remove it rather than make the code more complex. I fell more and more
> >guilty to add there actually.
> 
> I plan to remove it, but don't think I can get to it in this patch set.
> 
> I was planning to submit one more iteration of this patch set early next
> week, and then start focusing more on the removal of the PageReserved bit
> for hotplug. I figure it is probably going to be a full patch set onto
> itself and as you pointed out at the start of this email there is already
> enough code to review without adding that.
> 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-15  0:50   ` Alexander Duyck
  2018-11-15  1:55     ` Mike Rapoport
@ 2018-11-15  8:10     ` Michal Hocko
  2018-11-15 16:02       ` Alexander Duyck
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2018-11-15  8:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On Wed 14-11-18 16:50:23, Alexander Duyck wrote:
> 
> 
> On 11/14/2018 7:07 AM, Michal Hocko wrote:
> > On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> > > This patchset is essentially a refactor of the page initialization logic
> > > that is meant to provide for better code reuse while providing a
> > > significant improvement in deferred page initialization performance.
> > > 
> > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > memory per node I have seen the following. In the case of regular memory
> > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > average. For the persistent memory the initialization time dropped from
> > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > deferred memory initialization performance, and a 26% improvement in the
> > > persistent memory initialization performance.
> > > 
> > > I have called out the improvement observed with each patch.
> > 
> > I have only glanced through the code (there is a lot of the code to look
> > at here). And I do not like the code duplication and the way how you
> > make the hotplug special. There shouldn't be any real reason for that
> > IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> > pageblock-at-a-time for hotplug). I might be wrong here and the code
> > reuse might be really hard to achieve though.
> 
> Actually it isn't so much that hotplug is special. The issue is more that
> the non-hotplug case is special in that you have to perform a number of
> extra checks for things that just aren't necessary for the hotplug case.

Can we hide those behind a helper (potentially with a jump label if
necessary) and still share a large part of the code? Also this code is
quite old and maybe we are overzealous with the early checks. Do we
really need them. Why should be early boot memory any different from the
hotplug. The only exception I can see should really be deferred
initialization check.

> If anything I would probably need a new iterator that would be able to take
> into account all the checks for the non-hotplug case and then provide ranges
> of PFNs to initialize.
> 
> > I am also not impressed by new iterators because this api is quite
> > complex already. But this is mostly a detail.
> 
> Yeah, the iterators were mostly an attempt at hiding some of the complexity.
> Being able to break a loop down to just an iterator provding the start of
> the range and the number of elements to initialize is pretty easy to
> visualize, or at least I thought so.

I am not against hiding the complexity. I am mostly concerned that we
have too many of those iterators. Maybe we can reuse existing ones in
some way. If that is not really possible or it would make even more mess
then fair enough and go with new ones.

> > Thing I do not like is that you keep microptimizing PageReserved part
> > while there shouldn't be anything fundamental about it. We should just
> > remove it rather than make the code more complex. I fell more and more
> > guilty to add there actually.
> 
> I plan to remove it, but don't think I can get to it in this patch set.

What I am trying to argue is that we should simply drop the
__SetPageReserved as an independent patch prior to this whole series.
As I've mentioned earlier, I have added this just to be sure and part of
that was that __add_section has set the reserved bit. This is no longer
the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug").

Nobody should really depend on that because struct pages are in
undefined state after __add_pages and they should get fully initialized
after move_pfn_range_to_zone.

If you really insist on setting the reserved bit then it really has to
happen much sooner than it is right now. So I do not really see any
point in doing so. Sure there are some pfn walkers that really need to
do pfn_to_online_page because pfn_valid is not sufficient but that is
largely independent on any optimization work in this area.

I am sorry if I haven't been clear about that before. Does it make more
sense to you now?

P.S.
There is always that tempting thing to follow the existing code and
tweak it for a new purpose. This approach, however, adds more and more
complex code on top of something that might be wrong or stale already.
I have seen that in MM code countless times and I have contributed to
that myself. I am sorry to push back on this so hard but this code is
a mess and any changes to make it more optimal should really make sure
the foundations are solid before. Been there done that, not a huge fun
but that is the price for having basically unmaintained piece of code
that random usecases stop by and do what they need without ever
following up later.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-15  8:10     ` Michal Hocko
@ 2018-11-15 16:02       ` Alexander Duyck
  2018-11-15 16:40         ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-15 16:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On 11/15/2018 12:10 AM, Michal Hocko wrote:
> On Wed 14-11-18 16:50:23, Alexander Duyck wrote:
>>
>>
>> On 11/14/2018 7:07 AM, Michal Hocko wrote:
>>> On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
>>>> This patchset is essentially a refactor of the page initialization logic
>>>> that is meant to provide for better code reuse while providing a
>>>> significant improvement in deferred page initialization performance.
>>>>
>>>> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
>>>> memory per node I have seen the following. In the case of regular memory
>>>> initialization the deferred init time was decreased from 3.75s to 1.06s on
>>>> average. For the persistent memory the initialization time dropped from
>>>> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
>>>> deferred memory initialization performance, and a 26% improvement in the
>>>> persistent memory initialization performance.
>>>>
>>>> I have called out the improvement observed with each patch.
>>>
>>> I have only glanced through the code (there is a lot of the code to look
>>> at here). And I do not like the code duplication and the way how you
>>> make the hotplug special. There shouldn't be any real reason for that
>>> IMHO (e.g. why do we init pfn-at-a-time in early init while we do
>>> pageblock-at-a-time for hotplug). I might be wrong here and the code
>>> reuse might be really hard to achieve though.
>>
>> Actually it isn't so much that hotplug is special. The issue is more that
>> the non-hotplug case is special in that you have to perform a number of
>> extra checks for things that just aren't necessary for the hotplug case.
> 
> Can we hide those behind a helper (potentially with a jump label if
> necessary) and still share a large part of the code? Also this code is
> quite old and maybe we are overzealous with the early checks. Do we
> really need them. Why should be early boot memory any different from the
> hotplug. The only exception I can see should really be deferred
> initialization check.
> 
>> If anything I would probably need a new iterator that would be able to take
>> into account all the checks for the non-hotplug case and then provide ranges
>> of PFNs to initialize.
>>
>>> I am also not impressed by new iterators because this api is quite
>>> complex already. But this is mostly a detail.
>>
>> Yeah, the iterators were mostly an attempt at hiding some of the complexity.
>> Being able to break a loop down to just an iterator provding the start of
>> the range and the number of elements to initialize is pretty easy to
>> visualize, or at least I thought so.
> 
> I am not against hiding the complexity. I am mostly concerned that we
> have too many of those iterators. Maybe we can reuse existing ones in
> some way. If that is not really possible or it would make even more mess
> then fair enough and go with new ones.
> 
>>> Thing I do not like is that you keep microptimizing PageReserved part
>>> while there shouldn't be anything fundamental about it. We should just
>>> remove it rather than make the code more complex. I fell more and more
>>> guilty to add there actually.
>>
>> I plan to remove it, but don't think I can get to it in this patch set.
> 
> What I am trying to argue is that we should simply drop the
> __SetPageReserved as an independent patch prior to this whole series.
> As I've mentioned earlier, I have added this just to be sure and part of
> that was that __add_section has set the reserved bit. This is no longer
> the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory
> hotplug").
> 
> Nobody should really depend on that because struct pages are in
> undefined state after __add_pages and they should get fully initialized
> after move_pfn_range_to_zone.
> 
> If you really insist on setting the reserved bit then it really has to
> happen much sooner than it is right now. So I do not really see any
> point in doing so. Sure there are some pfn walkers that really need to
> do pfn_to_online_page because pfn_valid is not sufficient but that is
> largely independent on any optimization work in this area.
> 
> I am sorry if I haven't been clear about that before. Does it make more
> sense to you now?

I get what you are saying I just don't agree with the ordering. I have 
had these patches in the works for a couple months now. You are 
essentially telling me to defer these in favor of taking care of the 
reserved bit first.

The only spot where I think we disagree is that I would prefer to get 
these in first, and then focus on the reserved bit. I'm not saying we 
shouldn't do the work, but I would rather take care of the immediate 
issue, and then "clean house" as it were. I've done that sort of thing 
in the past where I start deferring patches and then by the end of 
things I have a 60 patch set I am trying to push because one fix gets 
ahead of another and another.

My big concern is that dropping the reserved bit is going to be much 
more error prone than the work I have done in this patch set since it is 
much more likely that somebody somewhere has made a false reliance on it 
being set. If you have any tests you usually run for memory hotplug it 
would be useful if you could point me in that direction. Then I can move 
forward with the upcoming patch set with a bit more confidence.

> P.S.
> There is always that tempting thing to follow the existing code and
> tweak it for a new purpose. This approach, however, adds more and more
> complex code on top of something that might be wrong or stale already.
> I have seen that in MM code countless times and I have contributed to
> that myself. I am sorry to push back on this so hard but this code is
> a mess and any changes to make it more optimal should really make sure
> the foundations are solid before. Been there done that, not a huge fun
> but that is the price for having basically unmaintained piece of code
> that random usecases stop by and do what they need without ever
> following up later.

That is what I am doing. That is why I found and dropped the check for 
the NUMA not in the deferred init. I am pushing back on code where it 
makes sense to do so and determine what actually is adding value. My 
concern is more that I am worried that trying to make things "perfect" 
might be getting in the way of "good". Kernel development has always 
been an incremental process. My preference would be to lock down what we 
have before I go diving in and cleaning up other bits of the memory init.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-15 16:02       ` Alexander Duyck
@ 2018-11-15 16:40         ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2018-11-15 16:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On Thu 15-11-18 08:02:33, Alexander Duyck wrote:
> On 11/15/2018 12:10 AM, Michal Hocko wrote:
> > On Wed 14-11-18 16:50:23, Alexander Duyck wrote:
[...]
> > > I plan to remove it, but don't think I can get to it in this patch set.
> > 
> > What I am trying to argue is that we should simply drop the
> > __SetPageReserved as an independent patch prior to this whole series.
> > As I've mentioned earlier, I have added this just to be sure and part of
> > that was that __add_section has set the reserved bit. This is no longer
> > the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory
> > hotplug").
> > 
> > Nobody should really depend on that because struct pages are in
> > undefined state after __add_pages and they should get fully initialized
> > after move_pfn_range_to_zone.
> > 
> > If you really insist on setting the reserved bit then it really has to
> > happen much sooner than it is right now. So I do not really see any
> > point in doing so. Sure there are some pfn walkers that really need to
> > do pfn_to_online_page because pfn_valid is not sufficient but that is
> > largely independent on any optimization work in this area.
> > 
> > I am sorry if I haven't been clear about that before. Does it make more
> > sense to you now?
> 
> I get what you are saying I just don't agree with the ordering. I have had
> these patches in the works for a couple months now. You are essentially
> telling me to defer these in favor of taking care of the reserved bit first.

General development is to fix correctness and/or cleanup stale code
first and optimize on top. I really do not see why this should be
any different. Especially when page reserved bit is a part of your
optimization work.

There is some review feedback to address from this version so you can
add a patch to remove the reserved bit to the series - feel free to
reuse my explanation as the justification. I really do not think you
have to change other call sites because they would be broken in the same
way as when the bit is set (late).

> The only spot where I think we disagree is that I would prefer to get these
> in first, and then focus on the reserved bit. I'm not saying we shouldn't do
> the work, but I would rather take care of the immediate issue, and then
> "clean house" as it were. I've done that sort of thing in the past where I
> start deferring patches and then by the end of things I have a 60 patch set
> I am trying to push because one fix gets ahead of another and another.

This is nothing exceptional and it happened to me as well.

> My big concern is that dropping the reserved bit is going to be much more
> error prone than the work I have done in this patch set since it is much
> more likely that somebody somewhere has made a false reliance on it being
> set. If you have any tests you usually run for memory hotplug it would be
> useful if you could point me in that direction. Then I can move forward with
> the upcoming patch set with a bit more confidence.

There is nothing except for exercising hotplug and do random stuff to
it. We simply do not know about all the pfn walkers so we have to count
on a reasonable model to justify changes. I hope I have clarified why
the reserved bit doesn't act the purpose it has been added for.

I understand that you want to push your optimization work ASAP but I _do_
care about longterm maintainability much more than having performance
gain _right_ now.

That being said, I am not nacking your series so if others really think
that I am asking too much I can live with that. I was overruled the last
time when the first optimization pile went in. I still hope we can get
the result cleaned up as promised, though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 0/7] Deferred page init improvements
  2018-11-15  1:55     ` Mike Rapoport
@ 2018-11-15 19:09       ` Mike Rapoport
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Rapoport @ 2018-11-15 19:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michal Hocko, akpm, linux-mm, sparclinux, linux-kernel,
	linux-nvdimm, davem, pavel.tatashin, mingo, kirill.shutemov,
	dan.j.williams, dave.jiang, rppt, willy, vbabka, khalid.aziz,
	ldufour, mgorman, yi.z.zhang

On Wed, Nov 14, 2018 at 05:55:12PM -0800, Mike Rapoport wrote:
> On Wed, Nov 14, 2018 at 04:50:23PM -0800, Alexander Duyck wrote:
> > 
> > 
> > On 11/14/2018 7:07 AM, Michal Hocko wrote:
> > >On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> > >>This patchset is essentially a refactor of the page initialization logic
> > >>that is meant to provide for better code reuse while providing a
> > >>significant improvement in deferred page initialization performance.
> > >>
> > >>In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > >>memory per node I have seen the following. In the case of regular memory
> > >>initialization the deferred init time was decreased from 3.75s to 1.06s on
> > >>average. For the persistent memory the initialization time dropped from
> > >>24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > >>deferred memory initialization performance, and a 26% improvement in the
> > >>persistent memory initialization performance.
> > >>
> > >>I have called out the improvement observed with each patch.
> > >
> > >I have only glanced through the code (there is a lot of the code to look
> > >at here). And I do not like the code duplication and the way how you
> > >make the hotplug special. There shouldn't be any real reason for that
> > >IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> > >pageblock-at-a-time for hotplug). I might be wrong here and the code
> > >reuse might be really hard to achieve though.
> > 
> > Actually it isn't so much that hotplug is special. The issue is more that
> > the non-hotplug case is special in that you have to perform a number of
> > extra checks for things that just aren't necessary for the hotplug case.
> > 
> > If anything I would probably need a new iterator that would be able to take
> > into account all the checks for the non-hotplug case and then provide ranges
> > of PFNs to initialize.
> > 
> > >I am also not impressed by new iterators because this api is quite
> > >complex already. But this is mostly a detail.
> > 
> > Yeah, the iterators were mostly an attempt at hiding some of the complexity.
> > Being able to break a loop down to just an iterator provding the start of
> > the range and the number of elements to initialize is pretty easy to
> > visualize, or at least I thought so.
> 
> Just recently we had a discussion about overlapping for_each_mem_range()
> and for_each_mem_pfn_range(), but unfortunately it appears that no mailing
> list was cc'ed by the original patch author :(
> In short, there was a spelling fix in one of them and Michal pointed out
> that their functionality overlaps.
> 
> I have no objection for for_each_free_mem_pfn_range_in_zone() and
> __next_mem_pfn_range_in_zone(), but probably we should consider unifying
> the older iterators before we introduce a new one? 

Another thing I realized only now is that
for_each_free_mem_pfn_range_in_zone() can be used only relatively late in
the memblock life-span because zones are initialized far later than
setup_arch() in many cases.

At the very least this should be documented.
 
> > >Thing I do not like is that you keep microptimizing PageReserved part
> > >while there shouldn't be anything fundamental about it. We should just
> > >remove it rather than make the code more complex. I fell more and more
> > >guilty to add there actually.
> > 
> > I plan to remove it, but don't think I can get to it in this patch set.
> > 
> > I was planning to submit one more iteration of this patch set early next
> > week, and then start focusing more on the removal of the PageReserved bit
> > for hotplug. I figure it is probably going to be a full patch set onto
> > itself and as you pointed out at the start of this email there is already
> > enough code to review without adding that.
> > 
> > 
> 
> -- 
> Sincerely yours,
> Mike.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections
  2018-11-10  1:02   ` Pavel Tatashin
@ 2018-11-19 18:53     ` Alexander Duyck
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Duyck @ 2018-11-19 18:53 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: akpm, linux-mm, sparclinux, linux-kernel, linux-nvdimm, davem,
	pavel.tatashin, mhocko, mingo, kirill.shutemov, dan.j.williams,
	dave.jiang, rppt, willy, vbabka, khalid.aziz, ldufour, mgorman,
	yi.z.zhang

On Fri, 2018-11-09 at 20:02 -0500, Pavel Tatashin wrote:
> On 18-11-05 13:19:45, Alexander Duyck wrote:
> >  	}
> > -	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
> > +
> > +	/* If the zone is empty somebody else may have cleared out the zone */
> > +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> > +						 first_init_pfn)) {
> > +		pgdat_resize_unlock(pgdat, &flags);
> > +		pgdat_init_report_one_done();
> > +		return 0;
> 
> It would make more sense to add goto to the end of this function and report
> that in Node X had 0 pages initialized. Otherwise, it will be confusing
> why some nodes are not initialized during boot. It is simply because
> they do not have deferred pages left to be initialized.

This part makes sense and will be implemented for v6.

> 
> > +	}
> >  
> >  	/*
> > -	 * Initialize and free pages. We do it in two loops: first we initialize
> > -	 * struct page, than free to buddy allocator, because while we are
> > -	 * freeing pages we can access pages that are ahead (computing buddy
> > -	 * page in __free_one_page()).
> > +	 * Initialize and free pages in MAX_ORDER sized increments so
> > +	 * that we can avoid introducing any issues with the buddy
> > +	 * allocator.
> >  	 */
> > -	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
> > -		spfn = max_t(unsigned long, first_init_pfn, spfn);
> > -		nr_pages += deferred_init_pages(zone, spfn, epfn);
> > -	}
> > -	for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
> > -		spfn = max_t(unsigned long, first_init_pfn, spfn);
> > -		deferred_free_pages(spfn, epfn);
> > -	}
> > +	while (spfn < epfn)
> > +		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > +
> >  	pgdat_resize_unlock(pgdat, &flags);
> >  
> >  	/* Sanity check that the next zone really is unpopulated */
> > @@ -1602,9 +1689,9 @@ deferred_grow_zone(struct zone *zone, unsigned int order)
> >  {
> 
> How about order = max(order, max_order)?

I'm not sure what you are getting at here? The "order" variable passed
is only really used to get the nr_pages_needed, and that value is
rounded up to PAGES_PER_SECTION which should always be equal to or
greater than MAX_ORDER pages.

If order is greater than MAX_ORDER we would want to process some number
of MAX_ORDER sized blocks until we get to the size specified by order.
If we process it all as one large block it will perform worse as we
risk pushing page structs out of the cache.

> >  	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
> 
> 
> > -	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
> > -
> > -	if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
> > +	/* If the zone is empty somebody else may have cleared out the zone */
> > +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> > +						 first_deferred_pfn)) {
> >  		pgdat_resize_unlock(pgdat, &flags);
> > -		return false;
> > +		return true;
> 
> I am OK to change to true here, please also set
> pgdat->first_deferred_pfn to ULONG_MAX to indicate that there are no
> more deferred pages in this node.

Done for v6 of the patch set.

> 
> Overall, I like this patch, makes things a lot easier, assuming the
> above is addressed:
> 
> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> 
> Thank you,
> Pasha


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-11-19 18:54 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-05 21:19 [mm PATCH v5 0/7] Deferred page init improvements Alexander Duyck
2018-11-05 21:19 ` [mm PATCH v5 1/7] mm: Use mm_zero_struct_page from SPARC on all 64b architectures Alexander Duyck
2018-11-05 21:19 ` [mm PATCH v5 2/7] mm: Drop meminit_pfn_in_nid as it is redundant Alexander Duyck
2018-11-05 21:19 ` [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator Alexander Duyck
2018-11-09 23:26   ` Pavel Tatashin
2018-11-09 23:58     ` Alexander Duyck
2018-11-10  0:11       ` Pavel Tatashin
2018-11-05 21:19 ` [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections Alexander Duyck
2018-11-10  1:02   ` Pavel Tatashin
2018-11-19 18:53     ` Alexander Duyck
2018-11-05 21:19 ` [mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize Alexander Duyck
2018-11-10  2:07   ` Pavel Tatashin
2018-11-05 21:19 ` [mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links Alexander Duyck
2018-11-10  2:11   ` Pavel Tatashin
2018-11-05 21:20 ` [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages Alexander Duyck
2018-11-10  4:13   ` Pavel Tatashin
2018-11-12 15:12     ` Alexander Duyck
2018-11-09 21:15 ` [mm PATCH v5 0/7] Deferred page init improvements Pavel Tatashin
2018-11-09 23:14   ` Alexander Duyck
2018-11-10  0:00     ` Pavel Tatashin
2018-11-10  0:46       ` Alexander Duyck
2018-11-10  1:16         ` Pavel Tatashin
2018-11-12 19:10           ` Alexander Duyck
2018-11-12 20:37             ` Pavel Tatashin
2018-11-12 16:25       ` Daniel Jordan
2018-11-14 15:07 ` Michal Hocko
2018-11-14 19:12   ` Pavel Tatashin
2018-11-14 21:35     ` Michal Hocko
2018-11-15  0:50   ` Alexander Duyck
2018-11-15  1:55     ` Mike Rapoport
2018-11-15 19:09       ` Mike Rapoport
2018-11-15  8:10     ` Michal Hocko
2018-11-15 16:02       ` Alexander Duyck
2018-11-15 16:40         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).