All of lore.kernel.org
 help / color / mirror / Atom feed
* [mm PATCH v4 0/6] Deferred page init improvements
@ 2018-10-17 23:54 ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
memory per node I have seen the following. In the case of regular memory
initialization the deferred init time was decreased from 3.75s to 1.06s on
average. For the persistent memory the initialization time dropped from
24.17s to 19.12s on average. This amounts to a 253% improvement for the
deferred memory initialization performance, and a 26% improvement in the
persistent memory initialization performance.

I have called out the improvement observed with each patch.

v1->v2:
    Fixed build issue on PowerPC due to page struct size being 56
    Added new patch that removed __SetPageReserved call for hotplug
v2->v3:
    Rebased on latest linux-next
    Removed patch that had removed __SetPageReserved call from init
    Added patch that folded __SetPageReserved into set_page_links
    Tweaked __init_pageblock to use start_pfn to get section_nr instead of pfn
v3->v4:
    Updated patch description and comments for mm_zero_struct_page patch
        Replaced "default" with "case 64"
        Removed #ifndef mm_zero_struct_page
    Fixed typo in comment that ommited "_from" in kerneldoc for iterator
    Added Reviewed-by for patches reviewed by Pavel
    Added Acked-by from Michal Hocko
    Added deferred init times for patches that affect init performance
    Swapped patches 5 & 6, pulled some code/comments from 4 into 5
        Did this as reserved bit wasn't used in deferred memory init

---

Alexander Duyck (6):
      mm: Use mm_zero_struct_page from SPARC on all 64b architectures
      mm: Drop meminit_pfn_in_nid as it is redundant
      mm: Use memblock/zone specific iterator for handling deferred page init
      mm: Move hot-plug specific memory init into separate functions and optimize
      mm: Add reserved flag setting to set_page_links
      mm: Use common iterator for deferred_init_pages and deferred_free_pages


 arch/sparc/include/asm/pgtable_64.h |   30 --
 include/linux/memblock.h            |   58 ++++
 include/linux/mm.h                  |   50 +++
 mm/memblock.c                       |   63 ++++
 mm/page_alloc.c                     |  569 +++++++++++++++++++++--------------
 5 files changed, 513 insertions(+), 257 deletions(-)

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [mm PATCH v4 0/6] Deferred page init improvements
@ 2018-10-17 23:54 ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
memory per node I have seen the following. In the case of regular memory
initialization the deferred init time was decreased from 3.75s to 1.06s on
average. For the persistent memory the initialization time dropped from
24.17s to 19.12s on average. This amounts to a 253% improvement for the
deferred memory initialization performance, and a 26% improvement in the
persistent memory initialization performance.

I have called out the improvement observed with each patch.

v1->v2:
    Fixed build issue on PowerPC due to page struct size being 56
    Added new patch that removed __SetPageReserved call for hotplug
v2->v3:
    Rebased on latest linux-next
    Removed patch that had removed __SetPageReserved call from init
    Added patch that folded __SetPageReserved into set_page_links
    Tweaked __init_pageblock to use start_pfn to get section_nr instead of pfn
v3->v4:
    Updated patch description and comments for mm_zero_struct_page patch
        Replaced "default" with "case 64"
        Removed #ifndef mm_zero_struct_page
    Fixed typo in comment that ommited "_from" in kerneldoc for iterator
    Added Reviewed-by for patches reviewed by Pavel
    Added Acked-by from Michal Hocko
    Added deferred init times for patches that affect init performance
    Swapped patches 5 & 6, pulled some code/comments from 4 into 5
        Did this as reserved bit wasn't used in deferred memory init

---

Alexander Duyck (6):
      mm: Use mm_zero_struct_page from SPARC on all 64b architectures
      mm: Drop meminit_pfn_in_nid as it is redundant
      mm: Use memblock/zone specific iterator for handling deferred page init
      mm: Move hot-plug specific memory init into separate functions and optimize
      mm: Add reserved flag setting to set_page_links
      mm: Use common iterator for deferred_init_pages and deferred_free_pages


 arch/sparc/include/asm/pgtable_64.h |   30 --
 include/linux/memblock.h            |   58 ++++
 include/linux/mm.h                  |   50 +++
 mm/memblock.c                       |   63 ++++
 mm/page_alloc.c                     |  569 +++++++++++++++++++++--------------
 5 files changed, 513 insertions(+), 257 deletions(-)

--

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [mm PATCH v4 1/6] mm: Use mm_zero_struct_page from SPARC on all 64b architectures
  2018-10-17 23:54 ` Alexander Duyck
@ 2018-10-17 23:54   ` Alexander Duyck
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This change makes it so that we use the same approach that was already in
use on Sparc on all the archtectures that support a 64b long.

This is mostly motivated by the fact that 7 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.

An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call. As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
my system.

One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.

This change should introduce no change on SPARC since it already had this
code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
initializing 384GB of RAM per node. Pavel Tatashin tested on a system with
Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page()
takes 19.30ns / 64-byte struct page before this patch and with this patch
it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on
a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His
results per 64-byte struct page were 4.68ns before, and 4.59ns after this
patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 arch/sparc/include/asm/pgtable_64.h |   30 --------------------------
 include/linux/mm.h                  |   41 ++++++++++++++++++++++++++++++++---
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 1393a8ac596b..22500c3be7a9 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -231,36 +231,6 @@
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
-/* This macro must be updated when the size of struct page grows above 80
- * or reduces below 64.
- * The idea that compiler optimizes out switch() statement, and only
- * leaves clrx instructions
- */
-#define	mm_zero_struct_page(pp) do {					\
-	unsigned long *_pp = (void *)(pp);				\
-									\
-	 /* Check that struct page is either 64, 72, or 80 bytes */	\
-	BUILD_BUG_ON(sizeof(struct page) & 7);				\
-	BUILD_BUG_ON(sizeof(struct page) < 64);				\
-	BUILD_BUG_ON(sizeof(struct page) > 80);				\
-									\
-	switch (sizeof(struct page)) {					\
-	case 80:							\
-		_pp[9] = 0;	/* fallthrough */			\
-	case 72:							\
-		_pp[8] = 0;	/* fallthrough */			\
-	default:							\
-		_pp[7] = 0;						\
-		_pp[6] = 0;						\
-		_pp[5] = 0;						\
-		_pp[4] = 0;						\
-		_pp[3] = 0;						\
-		_pp[2] = 0;						\
-		_pp[1] = 0;						\
-		_pp[0] = 0;						\
-	}								\
-} while (0)
-
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fcf9cc9d535f..6e2c9631af05 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,10 +98,45 @@ static inline void set_max_mapnr(unsigned long limit) { }
 
 /*
  * On some architectures it is expensive to call memset() for small sizes.
- * Those architectures should provide their own implementation of "struct page"
- * zeroing by defining this macro in <asm/pgtable.h>.
+ * If an architecture decides to implement their own version of
+ * mm_zero_struct_page they should wrap the defines below in a #ifndef and
+ * define their own version of this macro in <asm/pgtable.h>
  */
-#ifndef mm_zero_struct_page
+#if BITS_PER_LONG == 64
+/* This function must be updated when the size of struct page grows above 80
+ * or reduces below 56. The idea that compiler optimizes out switch()
+ * statement, and only leaves move/store instructions. Also the compiler can
+ * combine write statments if they are both assignments and can be reordered,
+ * this can result in several of the writes here being dropped.
+ */
+#define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
+static inline void __mm_zero_struct_page(struct page *page)
+{
+	unsigned long *_pp = (void *)page;
+
+	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
+	BUILD_BUG_ON(sizeof(struct page) & 7);
+	BUILD_BUG_ON(sizeof(struct page) < 56);
+	BUILD_BUG_ON(sizeof(struct page) > 80);
+
+	switch (sizeof(struct page)) {
+	case 80:
+		_pp[9] = 0;	/* fallthrough */
+	case 72:
+		_pp[8] = 0;	/* fallthrough */
+	case 64:
+		_pp[7] = 0;	/* fallthrough */
+	case 56:
+		_pp[6] = 0;
+		_pp[5] = 0;
+		_pp[4] = 0;
+		_pp[3] = 0;
+		_pp[2] = 0;
+		_pp[1] = 0;
+		_pp[0] = 0;
+	}
+}
+#else
 #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
 #endif
 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 1/6] mm: Use mm_zero_struct_page from SPARC on all 64b architectures
@ 2018-10-17 23:54   ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This change makes it so that we use the same approach that was already in
use on Sparc on all the archtectures that support a 64b long.

This is mostly motivated by the fact that 7 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.

An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call. As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
my system.

One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.

This change should introduce no change on SPARC since it already had this
code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
initializing 384GB of RAM per node. Pavel Tatashin tested on a system with
Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page()
takes 19.30ns / 64-byte struct page before this patch and with this patch
it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on
a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His
results per 64-byte struct page were 4.68ns before, and 4.59ns after this
patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 arch/sparc/include/asm/pgtable_64.h |   30 --------------------------
 include/linux/mm.h                  |   41 ++++++++++++++++++++++++++++++++---
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 1393a8ac596b..22500c3be7a9 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -231,36 +231,6 @@
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
-/* This macro must be updated when the size of struct page grows above 80
- * or reduces below 64.
- * The idea that compiler optimizes out switch() statement, and only
- * leaves clrx instructions
- */
-#define	mm_zero_struct_page(pp) do {					\
-	unsigned long *_pp = (void *)(pp);				\
-									\
-	 /* Check that struct page is either 64, 72, or 80 bytes */	\
-	BUILD_BUG_ON(sizeof(struct page) & 7);				\
-	BUILD_BUG_ON(sizeof(struct page) < 64);				\
-	BUILD_BUG_ON(sizeof(struct page) > 80);				\
-									\
-	switch (sizeof(struct page)) {					\
-	case 80:							\
-		_pp[9] = 0;	/* fallthrough */			\
-	case 72:							\
-		_pp[8] = 0;	/* fallthrough */			\
-	default:							\
-		_pp[7] = 0;						\
-		_pp[6] = 0;						\
-		_pp[5] = 0;						\
-		_pp[4] = 0;						\
-		_pp[3] = 0;						\
-		_pp[2] = 0;						\
-		_pp[1] = 0;						\
-		_pp[0] = 0;						\
-	}								\
-} while (0)
-
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fcf9cc9d535f..6e2c9631af05 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -98,10 +98,45 @@ static inline void set_max_mapnr(unsigned long limit) { }
 
 /*
  * On some architectures it is expensive to call memset() for small sizes.
- * Those architectures should provide their own implementation of "struct page"
- * zeroing by defining this macro in <asm/pgtable.h>.
+ * If an architecture decides to implement their own version of
+ * mm_zero_struct_page they should wrap the defines below in a #ifndef and
+ * define their own version of this macro in <asm/pgtable.h>
  */
-#ifndef mm_zero_struct_page
+#if BITS_PER_LONG = 64
+/* This function must be updated when the size of struct page grows above 80
+ * or reduces below 56. The idea that compiler optimizes out switch()
+ * statement, and only leaves move/store instructions. Also the compiler can
+ * combine write statments if they are both assignments and can be reordered,
+ * this can result in several of the writes here being dropped.
+ */
+#define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
+static inline void __mm_zero_struct_page(struct page *page)
+{
+	unsigned long *_pp = (void *)page;
+
+	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
+	BUILD_BUG_ON(sizeof(struct page) & 7);
+	BUILD_BUG_ON(sizeof(struct page) < 56);
+	BUILD_BUG_ON(sizeof(struct page) > 80);
+
+	switch (sizeof(struct page)) {
+	case 80:
+		_pp[9] = 0;	/* fallthrough */
+	case 72:
+		_pp[8] = 0;	/* fallthrough */
+	case 64:
+		_pp[7] = 0;	/* fallthrough */
+	case 56:
+		_pp[6] = 0;
+		_pp[5] = 0;
+		_pp[4] = 0;
+		_pp[3] = 0;
+		_pp[2] = 0;
+		_pp[1] = 0;
+		_pp[0] = 0;
+	}
+}
+#else
 #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
 #endif
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 2/6] mm: Drop meminit_pfn_in_nid as it is redundant
  2018-10-17 23:54 ` Alexander Duyck
@ 2018-10-17 23:54   ` Alexander Duyck
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

As best as I can tell the meminit_pfn_in_nid call is completely redundant.
The deferred memory initialization is already making use of
for_each_free_mem_range which in turn will call into __next_mem_range which
will only return a memory range if it matches the node ID provided assuming
it is not NUMA_NO_NODE.

I am operating on the assumption that there are no zones or pgdata_t
structures that have a NUMA node of NUMA_NO_NODE associated with them. If
that is the case then __next_mem_range will never return a memory range
that doesn't match the zone's node ID and as such the check is redundant.

So one piece I would like to verfy on this is if this works for ia64.
Technically it was using a different approach to get the node ID, but it
seems to have the node ID also encoded into the memblock. So I am
assuming this is okay, but would like to get confirmation on that.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 2.80s to 1.85s as a result of this patch.

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |   50 ++++++++++++++------------------------------------
 1 file changed, 14 insertions(+), 36 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4bd858d1c3ba..a766a15fad81 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1301,36 +1301,22 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
 #endif
 
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES
-static inline bool __meminit __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
-		   struct mminit_pfnnid_cache *state)
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
 	int nid;
 
-	nid = __early_pfn_to_nid(pfn, state);
+	nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
 	if (nid >= 0 && nid != node)
 		return false;
 	return true;
 }
 
-/* Only safe to use early in boot when initialisation is single-threaded */
-static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
-	return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
-}
-
 #else
-
 static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
 	return true;
 }
-static inline bool __meminit  __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
-		   struct mminit_pfnnid_cache *state)
-{
-	return true;
-}
 #endif
 
 
@@ -1459,21 +1445,13 @@ static inline void __init pgdat_init_report_one_done(void)
  *
  * Then, we check if a current large page is valid by only checking the validity
  * of the head pfn.
- *
- * Finally, meminit_pfn_in_nid is checked on systems where pfns can interleave
- * within a node: a pfn is between start and end of a node, but does not belong
- * to this memory node.
  */
-static inline bool __init
-deferred_pfn_valid(int nid, unsigned long pfn,
-		   struct mminit_pfnnid_cache *nid_init_state)
+static inline bool __init deferred_pfn_valid(unsigned long pfn)
 {
 	if (!pfn_valid_within(pfn))
 		return false;
 	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
 		return false;
-	if (!meminit_pfn_in_nid(pfn, nid, nid_init_state))
-		return false;
 	return true;
 }
 
@@ -1481,15 +1459,14 @@ static inline void __init pgdat_init_report_one_done(void)
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
+static void __init deferred_free_pages(unsigned long pfn,
 				       unsigned long end_pfn)
 {
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
 	unsigned long nr_free = 0;
 
 	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+		if (!deferred_pfn_valid(pfn)) {
 			deferred_free_range(pfn - nr_free, nr_free);
 			nr_free = 0;
 		} else if (!(pfn & nr_pgmask)) {
@@ -1509,17 +1486,18 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
  * by performing it only once every pageblock_nr_pages.
  * Return number of pages initialized.
  */
-static unsigned long  __init deferred_init_pages(int nid, int zid,
+static unsigned long  __init deferred_init_pages(struct zone *zone,
 						 unsigned long pfn,
 						 unsigned long end_pfn)
 {
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	int nid = zone_to_nid(zone);
 	unsigned long nr_pages = 0;
+	int zid = zone_idx(zone);
 	struct page *page = NULL;
 
 	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+		if (!deferred_pfn_valid(pfn)) {
 			page = NULL;
 			continue;
 		} else if (!page || !(pfn & nr_pgmask)) {
@@ -1582,12 +1560,12 @@ static int __init deferred_init_memmap(void *data)
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
+		nr_pages += deferred_init_pages(zone, spfn, epfn);
 	}
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		deferred_free_pages(nid, zid, spfn, epfn);
+		deferred_free_pages(spfn, epfn);
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -1676,7 +1654,7 @@ static int __init deferred_init_memmap(void *data)
 		while (spfn < epfn && nr_pages < nr_pages_needed) {
 			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
 			first_deferred_pfn = min(t, epfn);
-			nr_pages += deferred_init_pages(nid, zid, spfn,
+			nr_pages += deferred_init_pages(zone, spfn,
 							first_deferred_pfn);
 			spfn = first_deferred_pfn;
 		}
@@ -1688,7 +1666,7 @@ static int __init deferred_init_memmap(void *data)
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
-		deferred_free_pages(nid, zid, spfn, epfn);
+		deferred_free_pages(spfn, epfn);
 
 		if (first_deferred_pfn == epfn)
 			break;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 2/6] mm: Drop meminit_pfn_in_nid as it is redundant
@ 2018-10-17 23:54   ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

As best as I can tell the meminit_pfn_in_nid call is completely redundant.
The deferred memory initialization is already making use of
for_each_free_mem_range which in turn will call into __next_mem_range which
will only return a memory range if it matches the node ID provided assuming
it is not NUMA_NO_NODE.

I am operating on the assumption that there are no zones or pgdata_t
structures that have a NUMA node of NUMA_NO_NODE associated with them. If
that is the case then __next_mem_range will never return a memory range
that doesn't match the zone's node ID and as such the check is redundant.

So one piece I would like to verfy on this is if this works for ia64.
Technically it was using a different approach to get the node ID, but it
seems to have the node ID also encoded into the memblock. So I am
assuming this is okay, but would like to get confirmation on that.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 2.80s to 1.85s as a result of this patch.

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |   50 ++++++++++++++------------------------------------
 1 file changed, 14 insertions(+), 36 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4bd858d1c3ba..a766a15fad81 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1301,36 +1301,22 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
 #endif
 
 #ifdef CONFIG_NODES_SPAN_OTHER_NODES
-static inline bool __meminit __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
-		   struct mminit_pfnnid_cache *state)
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
 	int nid;
 
-	nid = __early_pfn_to_nid(pfn, state);
+	nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
 	if (nid >= 0 && nid != node)
 		return false;
 	return true;
 }
 
-/* Only safe to use early in boot when initialisation is single-threaded */
-static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
-	return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
-}
-
 #else
-
 static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
 {
 	return true;
 }
-static inline bool __meminit  __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
-		   struct mminit_pfnnid_cache *state)
-{
-	return true;
-}
 #endif
 
 
@@ -1459,21 +1445,13 @@ static inline void __init pgdat_init_report_one_done(void)
  *
  * Then, we check if a current large page is valid by only checking the validity
  * of the head pfn.
- *
- * Finally, meminit_pfn_in_nid is checked on systems where pfns can interleave
- * within a node: a pfn is between start and end of a node, but does not belong
- * to this memory node.
  */
-static inline bool __init
-deferred_pfn_valid(int nid, unsigned long pfn,
-		   struct mminit_pfnnid_cache *nid_init_state)
+static inline bool __init deferred_pfn_valid(unsigned long pfn)
 {
 	if (!pfn_valid_within(pfn))
 		return false;
 	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
 		return false;
-	if (!meminit_pfn_in_nid(pfn, nid, nid_init_state))
-		return false;
 	return true;
 }
 
@@ -1481,15 +1459,14 @@ static inline void __init pgdat_init_report_one_done(void)
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
+static void __init deferred_free_pages(unsigned long pfn,
 				       unsigned long end_pfn)
 {
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
 	unsigned long nr_free = 0;
 
 	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+		if (!deferred_pfn_valid(pfn)) {
 			deferred_free_range(pfn - nr_free, nr_free);
 			nr_free = 0;
 		} else if (!(pfn & nr_pgmask)) {
@@ -1509,17 +1486,18 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
  * by performing it only once every pageblock_nr_pages.
  * Return number of pages initialized.
  */
-static unsigned long  __init deferred_init_pages(int nid, int zid,
+static unsigned long  __init deferred_init_pages(struct zone *zone,
 						 unsigned long pfn,
 						 unsigned long end_pfn)
 {
-	struct mminit_pfnnid_cache nid_init_state = { };
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	int nid = zone_to_nid(zone);
 	unsigned long nr_pages = 0;
+	int zid = zone_idx(zone);
 	struct page *page = NULL;
 
 	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+		if (!deferred_pfn_valid(pfn)) {
 			page = NULL;
 			continue;
 		} else if (!page || !(pfn & nr_pgmask)) {
@@ -1582,12 +1560,12 @@ static int __init deferred_init_memmap(void *data)
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
+		nr_pages += deferred_init_pages(zone, spfn, epfn);
 	}
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		deferred_free_pages(nid, zid, spfn, epfn);
+		deferred_free_pages(spfn, epfn);
 	}
 	pgdat_resize_unlock(pgdat, &flags);
 
@@ -1676,7 +1654,7 @@ static int __init deferred_init_memmap(void *data)
 		while (spfn < epfn && nr_pages < nr_pages_needed) {
 			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
 			first_deferred_pfn = min(t, epfn);
-			nr_pages += deferred_init_pages(nid, zid, spfn,
+			nr_pages += deferred_init_pages(zone, spfn,
 							first_deferred_pfn);
 			spfn = first_deferred_pfn;
 		}
@@ -1688,7 +1666,7 @@ static int __init deferred_init_memmap(void *data)
 	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
 		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
 		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
-		deferred_free_pages(nid, zid, spfn, epfn);
+		deferred_free_pages(spfn, epfn);
 
 		if (first_deferred_pfn = epfn)
 			break;

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
  2018-10-17 23:54 ` Alexander Duyck
@ 2018-10-17 23:54   ` Alexander Duyck
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.

This iterator will take care of making sure a given memory range provided
is in fact contained within a zone. It takes are of all the bounds checking
we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
it should help to speed up the search a bit by iterating until the end of a
range is greater than the start of the zone pfn range, and will exit
completely if the start is beyond the end of the zone.

This patch adds yet another iterator called
for_each_free_mem_range_in_zone_from and then uses it to support
initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
By doing this we can greatly improve the cache locality of the pages while
we do several loops over them in the init and freeing process.

We are able to tighten the loops as a result since we only really need the
checks for first_init_pfn in our first iteration and after that we can
assume that all future values will be greater than this. So I have added a
function called deferred_init_mem_pfn_range_in_zone that primes the
iterators and if it fails we can just exit.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 1.85s to 1.38s as a result of this patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/memblock.h |   58 +++++++++++++++
 mm/memblock.c            |   63 ++++++++++++++++
 mm/page_alloc.c          |  176 ++++++++++++++++++++++++++++++++--------------
 3 files changed, 242 insertions(+), 55 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a6aa76..2ddd1bafdd03 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
 			      p_start, p_end, p_nid))
 
 /**
+ * for_each_mem_range_from - iterate through memblock areas from type_a and not
+ * included in type_b. Or just type_a if type_b is NULL.
+ * @i: u64 used as loop variable
+ * @type_a: ptr to memblock_type to iterate
+ * @type_b: ptr to memblock_type which excludes from the iteration
+ * @nid: node selector, %NUMA_NO_NODE for all nodes
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ * @p_nid: ptr to int for nid of the range, can be %NULL
+ */
+#define for_each_mem_range_from(i, type_a, type_b, nid, flags,		\
+			   p_start, p_end, p_nid)			\
+	for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b,	\
+				     p_start, p_end, p_nid);		\
+	     i != (u64)ULLONG_MAX;					\
+	     __next_mem_range(&i, nid, flags, type_a, type_b,		\
+			      p_start, p_end, p_nid))
+/**
  * for_each_mem_range_rev - reverse iterate through memblock areas from
  * type_a and not included in type_b. Or just type_a if type_b is NULL.
  * @i: u64 used as loop variable
@@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+				  unsigned long *out_spfn,
+				  unsigned long *out_epfn);
+/**
+ * for_each_free_mem_range_in_zone - iterate through zone specific free
+ * memblock areas
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone. Available as soon as memblock is initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)	\
+	for (i = 0,							\
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
+	     i != (u64)ULLONG_MAX;					\
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+/**
+ * for_each_free_mem_range_in_zone_from - iterate through zone specific
+ * free memblock areas from a given point
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone, continuing from current position. Available as soon as memblock is
+ * initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
+	for (; i != (u64)ULLONG_MAX;					  \
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
 /**
  * for_each_free_mem_range - iterate through free memblock areas
  * @i: u64 used as loop variable
diff --git a/mm/memblock.c b/mm/memblock.c
index f2ef3915a356..ab3545e356b7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 	return 0;
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/**
+ * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
+ *
+ * @idx: pointer to u64 loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @out_start: ptr to ulong for start pfn of the range, can be %NULL
+ * @out_end: ptr to ulong for end pfn of the range, can be %NULL
+ *
+ * This function is meant to be a zone/pfn specific wrapper for the
+ * for_each_mem_range type iterators. Specifically they are used in the
+ * deferred memory init routines and as such we were duplicating much of
+ * this logic throughout the code. So instead of having it in multiple
+ * locations it seemed like it would make more sense to centralize this to
+ * one new iterator that does everything they need.
+ */
+void __init_memblock
+__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+			     unsigned long *out_spfn, unsigned long *out_epfn)
+{
+	int zone_nid = zone_to_nid(zone);
+	phys_addr_t spa, epa;
+	int nid;
+
+	__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+			 &memblock.memory, &memblock.reserved,
+			 &spa, &epa, &nid);
+
+	while (*idx != ULLONG_MAX) {
+		unsigned long epfn = PFN_DOWN(epa);
+		unsigned long spfn = PFN_UP(spa);
+
+		/*
+		 * Verify the end is at least past the start of the zone and
+		 * that we have at least one PFN to initialize.
+		 */
+		if (zone->zone_start_pfn < epfn && spfn < epfn) {
+			/* if we went too far just stop searching */
+			if (zone_end_pfn(zone) <= spfn)
+				break;
+
+			if (out_spfn)
+				*out_spfn = max(zone->zone_start_pfn, spfn);
+			if (out_epfn)
+				*out_epfn = min(zone_end_pfn(zone), epfn);
+
+			return;
+		}
+
+		__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+				 &memblock.memory, &memblock.reserved,
+				 &spa, &epa, &nid);
+	}
+
+	/* signal end of iteration */
+	*idx = ULLONG_MAX;
+	if (out_spfn)
+		*out_spfn = ULONG_MAX;
+	if (out_epfn)
+		*out_epfn = 0;
+}
+
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
 unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a766a15fad81..20e9eb35d75d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1512,19 +1512,103 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
 	return (nr_pages);
 }
 
+/*
+ * This function is meant to pre-load the iterator for the zone init.
+ * Specifically it walks through the ranges until we are caught up to the
+ * first_init_pfn value and exits there. If we never encounter the value we
+ * return false indicating there are no valid ranges left.
+ */
+static bool __init
+deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
+				    unsigned long *spfn, unsigned long *epfn,
+				    unsigned long first_init_pfn)
+{
+	u64 j;
+
+	/*
+	 * Start out by walking through the ranges in this zone that have
+	 * already been initialized. We don't need to do anything with them
+	 * so we just need to flush them out of the system.
+	 */
+	for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
+		if (*epfn <= first_init_pfn)
+			continue;
+		if (*spfn < first_init_pfn)
+			*spfn = first_init_pfn;
+		*i = j;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Initialize and free pages. We do it in two loops: first we initialize
+ * struct page, than free to buddy allocator, because while we are
+ * freeing pages we can access pages that are ahead (computing buddy
+ * page in __free_one_page()).
+ *
+ * In order to try and keep some memory in the cache we have the loop
+ * broken along max page order boundaries. This way we will not cause
+ * any issues with the buddy page computation.
+ */
+static unsigned long __init
+deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
+		       unsigned long *end_pfn)
+{
+	unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
+	unsigned long spfn = *start_pfn, epfn = *end_pfn;
+	unsigned long nr_pages = 0;
+	u64 j = *i;
+
+	/* First we loop through and initialize the page values */
+	for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
+		unsigned long t;
+
+		if (mo_pfn <= spfn)
+			break;
+
+		t = min(mo_pfn, epfn);
+		nr_pages += deferred_init_pages(zone, spfn, t);
+
+		if (mo_pfn <= epfn)
+			break;
+	}
+
+	/* Reset values and now loop through freeing pages as needed */
+	j = *i;
+
+	for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
+		unsigned long t;
+
+		if (mo_pfn <= *start_pfn)
+			break;
+
+		t = min(mo_pfn, *end_pfn);
+		deferred_free_pages(*start_pfn, t);
+		*start_pfn = t;
+
+		if (mo_pfn < *end_pfn)
+			break;
+	}
+
+	/* Store our current values to be reused on the next iteration */
+	*i = j;
+
+	return nr_pages;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
-	int nid = pgdat->node_id;
+	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+	unsigned long first_init_pfn, flags;
 	unsigned long start = jiffies;
-	unsigned long nr_pages = 0;
-	unsigned long spfn, epfn, first_init_pfn, flags;
-	phys_addr_t spa, epa;
-	int zid;
 	struct zone *zone;
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
 	u64 i;
+	int zid;
 
 	/* Bind memory initialisation thread to a local node if possible */
 	if (!cpumask_empty(cpumask))
@@ -1549,31 +1633,30 @@ static int __init deferred_init_memmap(void *data)
 		if (first_init_pfn < zone_end_pfn(zone))
 			break;
 	}
-	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
+
+	/* If the zone is empty somebody else may have cleared out the zone */
+	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
+						 first_init_pfn)) {
+		pgdat_resize_unlock(pgdat, &flags);
+		pgdat_init_report_one_done();
+		return 0;
+	}
 
 	/*
-	 * Initialize and free pages. We do it in two loops: first we initialize
-	 * struct page, than free to buddy allocator, because while we are
-	 * freeing pages we can access pages that are ahead (computing buddy
-	 * page in __free_one_page()).
+	 * Initialize and free pages in MAX_ORDER sized increments so
+	 * that we can avoid introducing any issues with the buddy
+	 * allocator.
 	 */
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		nr_pages += deferred_init_pages(zone, spfn, epfn);
-	}
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		deferred_free_pages(spfn, epfn);
-	}
+	while (spfn < epfn)
+		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+
 	pgdat_resize_unlock(pgdat, &flags);
 
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
-	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
-					jiffies_to_msecs(jiffies - start));
+	pr_info("node %d initialised, %lu pages in %ums\n",
+		pgdat->node_id,	nr_pages, jiffies_to_msecs(jiffies - start));
 
 	pgdat_init_report_one_done();
 	return 0;
@@ -1604,14 +1687,11 @@ static int __init deferred_init_memmap(void *data)
 static noinline bool __init
 deferred_grow_zone(struct zone *zone, unsigned int order)
 {
-	int zid = zone_idx(zone);
-	int nid = zone_to_nid(zone);
-	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
-	unsigned long nr_pages = 0;
-	unsigned long first_init_pfn, spfn, epfn, t, flags;
+	pg_data_t *pgdat = zone->zone_pgdat;
 	unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
-	phys_addr_t spa, epa;
+	unsigned long spfn, epfn, flags;
+	unsigned long nr_pages = 0;
 	u64 i;
 
 	/* Only the last zone may have deferred pages */
@@ -1640,37 +1720,23 @@ static int __init deferred_init_memmap(void *data)
 		return true;
 	}
 
-	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
-
-	if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
+	/* If the zone is empty somebody else may have cleared out the zone */
+	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
+						 first_deferred_pfn)) {
 		pgdat_resize_unlock(pgdat, &flags);
-		return false;
+		return true;
 	}
 
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-
-		while (spfn < epfn && nr_pages < nr_pages_needed) {
-			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
-			first_deferred_pfn = min(t, epfn);
-			nr_pages += deferred_init_pages(zone, spfn,
-							first_deferred_pfn);
-			spfn = first_deferred_pfn;
-		}
-
-		if (nr_pages >= nr_pages_needed)
-			break;
+	/*
+	 * Initialize and free pages in MAX_ORDER sized increments so
+	 * that we can avoid introducing any issues with the buddy
+	 * allocator.
+	 */
+	while (spfn < epfn && nr_pages < nr_pages_needed) {
+		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		first_deferred_pfn = spfn;
 	}
 
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
-		deferred_free_pages(spfn, epfn);
-
-		if (first_deferred_pfn == epfn)
-			break;
-	}
 	pgdat->first_deferred_pfn = first_deferred_pfn;
 	pgdat_resize_unlock(pgdat, &flags);
 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
@ 2018-10-17 23:54   ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.

This iterator will take care of making sure a given memory range provided
is in fact contained within a zone. It takes are of all the bounds checking
we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
it should help to speed up the search a bit by iterating until the end of a
range is greater than the start of the zone pfn range, and will exit
completely if the start is beyond the end of the zone.

This patch adds yet another iterator called
for_each_free_mem_range_in_zone_from and then uses it to support
initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
By doing this we can greatly improve the cache locality of the pages while
we do several loops over them in the init and freeing process.

We are able to tighten the loops as a result since we only really need the
checks for first_init_pfn in our first iteration and after that we can
assume that all future values will be greater than this. So I have added a
function called deferred_init_mem_pfn_range_in_zone that primes the
iterators and if it fails we can just exit.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 1.85s to 1.38s as a result of this patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/memblock.h |   58 +++++++++++++++
 mm/memblock.c            |   63 ++++++++++++++++
 mm/page_alloc.c          |  176 ++++++++++++++++++++++++++++++++--------------
 3 files changed, 242 insertions(+), 55 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a6aa76..2ddd1bafdd03 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
 			      p_start, p_end, p_nid))
 
 /**
+ * for_each_mem_range_from - iterate through memblock areas from type_a and not
+ * included in type_b. Or just type_a if type_b is NULL.
+ * @i: u64 used as loop variable
+ * @type_a: ptr to memblock_type to iterate
+ * @type_b: ptr to memblock_type which excludes from the iteration
+ * @nid: node selector, %NUMA_NO_NODE for all nodes
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ * @p_nid: ptr to int for nid of the range, can be %NULL
+ */
+#define for_each_mem_range_from(i, type_a, type_b, nid, flags,		\
+			   p_start, p_end, p_nid)			\
+	for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b,	\
+				     p_start, p_end, p_nid);		\
+	     i != (u64)ULLONG_MAX;					\
+	     __next_mem_range(&i, nid, flags, type_a, type_b,		\
+			      p_start, p_end, p_nid))
+/**
  * for_each_mem_range_rev - reverse iterate through memblock areas from
  * type_a and not included in type_b. Or just type_a if type_b is NULL.
  * @i: u64 used as loop variable
@@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+				  unsigned long *out_spfn,
+				  unsigned long *out_epfn);
+/**
+ * for_each_free_mem_range_in_zone - iterate through zone specific free
+ * memblock areas
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone. Available as soon as memblock is initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)	\
+	for (i = 0,							\
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
+	     i != (u64)ULLONG_MAX;					\
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+/**
+ * for_each_free_mem_range_in_zone_from - iterate through zone specific
+ * free memblock areas from a given point
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone, continuing from current position. Available as soon as memblock is
+ * initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
+	for (; i != (u64)ULLONG_MAX;					  \
+	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
 /**
  * for_each_free_mem_range - iterate through free memblock areas
  * @i: u64 used as loop variable
diff --git a/mm/memblock.c b/mm/memblock.c
index f2ef3915a356..ab3545e356b7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
 	return 0;
 }
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/**
+ * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
+ *
+ * @idx: pointer to u64 loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @out_start: ptr to ulong for start pfn of the range, can be %NULL
+ * @out_end: ptr to ulong for end pfn of the range, can be %NULL
+ *
+ * This function is meant to be a zone/pfn specific wrapper for the
+ * for_each_mem_range type iterators. Specifically they are used in the
+ * deferred memory init routines and as such we were duplicating much of
+ * this logic throughout the code. So instead of having it in multiple
+ * locations it seemed like it would make more sense to centralize this to
+ * one new iterator that does everything they need.
+ */
+void __init_memblock
+__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+			     unsigned long *out_spfn, unsigned long *out_epfn)
+{
+	int zone_nid = zone_to_nid(zone);
+	phys_addr_t spa, epa;
+	int nid;
+
+	__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+			 &memblock.memory, &memblock.reserved,
+			 &spa, &epa, &nid);
+
+	while (*idx != ULLONG_MAX) {
+		unsigned long epfn = PFN_DOWN(epa);
+		unsigned long spfn = PFN_UP(spa);
+
+		/*
+		 * Verify the end is at least past the start of the zone and
+		 * that we have at least one PFN to initialize.
+		 */
+		if (zone->zone_start_pfn < epfn && spfn < epfn) {
+			/* if we went too far just stop searching */
+			if (zone_end_pfn(zone) <= spfn)
+				break;
+
+			if (out_spfn)
+				*out_spfn = max(zone->zone_start_pfn, spfn);
+			if (out_epfn)
+				*out_epfn = min(zone_end_pfn(zone), epfn);
+
+			return;
+		}
+
+		__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+				 &memblock.memory, &memblock.reserved,
+				 &spa, &epa, &nid);
+	}
+
+	/* signal end of iteration */
+	*idx = ULLONG_MAX;
+	if (out_spfn)
+		*out_spfn = ULONG_MAX;
+	if (out_epfn)
+		*out_epfn = 0;
+}
+
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
 unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a766a15fad81..20e9eb35d75d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1512,19 +1512,103 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
 	return (nr_pages);
 }
 
+/*
+ * This function is meant to pre-load the iterator for the zone init.
+ * Specifically it walks through the ranges until we are caught up to the
+ * first_init_pfn value and exits there. If we never encounter the value we
+ * return false indicating there are no valid ranges left.
+ */
+static bool __init
+deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
+				    unsigned long *spfn, unsigned long *epfn,
+				    unsigned long first_init_pfn)
+{
+	u64 j;
+
+	/*
+	 * Start out by walking through the ranges in this zone that have
+	 * already been initialized. We don't need to do anything with them
+	 * so we just need to flush them out of the system.
+	 */
+	for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
+		if (*epfn <= first_init_pfn)
+			continue;
+		if (*spfn < first_init_pfn)
+			*spfn = first_init_pfn;
+		*i = j;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Initialize and free pages. We do it in two loops: first we initialize
+ * struct page, than free to buddy allocator, because while we are
+ * freeing pages we can access pages that are ahead (computing buddy
+ * page in __free_one_page()).
+ *
+ * In order to try and keep some memory in the cache we have the loop
+ * broken along max page order boundaries. This way we will not cause
+ * any issues with the buddy page computation.
+ */
+static unsigned long __init
+deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
+		       unsigned long *end_pfn)
+{
+	unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
+	unsigned long spfn = *start_pfn, epfn = *end_pfn;
+	unsigned long nr_pages = 0;
+	u64 j = *i;
+
+	/* First we loop through and initialize the page values */
+	for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
+		unsigned long t;
+
+		if (mo_pfn <= spfn)
+			break;
+
+		t = min(mo_pfn, epfn);
+		nr_pages += deferred_init_pages(zone, spfn, t);
+
+		if (mo_pfn <= epfn)
+			break;
+	}
+
+	/* Reset values and now loop through freeing pages as needed */
+	j = *i;
+
+	for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
+		unsigned long t;
+
+		if (mo_pfn <= *start_pfn)
+			break;
+
+		t = min(mo_pfn, *end_pfn);
+		deferred_free_pages(*start_pfn, t);
+		*start_pfn = t;
+
+		if (mo_pfn < *end_pfn)
+			break;
+	}
+
+	/* Store our current values to be reused on the next iteration */
+	*i = j;
+
+	return nr_pages;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
-	int nid = pgdat->node_id;
+	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+	unsigned long first_init_pfn, flags;
 	unsigned long start = jiffies;
-	unsigned long nr_pages = 0;
-	unsigned long spfn, epfn, first_init_pfn, flags;
-	phys_addr_t spa, epa;
-	int zid;
 	struct zone *zone;
-	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
 	u64 i;
+	int zid;
 
 	/* Bind memory initialisation thread to a local node if possible */
 	if (!cpumask_empty(cpumask))
@@ -1549,31 +1633,30 @@ static int __init deferred_init_memmap(void *data)
 		if (first_init_pfn < zone_end_pfn(zone))
 			break;
 	}
-	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
+
+	/* If the zone is empty somebody else may have cleared out the zone */
+	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
+						 first_init_pfn)) {
+		pgdat_resize_unlock(pgdat, &flags);
+		pgdat_init_report_one_done();
+		return 0;
+	}
 
 	/*
-	 * Initialize and free pages. We do it in two loops: first we initialize
-	 * struct page, than free to buddy allocator, because while we are
-	 * freeing pages we can access pages that are ahead (computing buddy
-	 * page in __free_one_page()).
+	 * Initialize and free pages in MAX_ORDER sized increments so
+	 * that we can avoid introducing any issues with the buddy
+	 * allocator.
 	 */
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		nr_pages += deferred_init_pages(zone, spfn, epfn);
-	}
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-		deferred_free_pages(spfn, epfn);
-	}
+	while (spfn < epfn)
+		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+
 	pgdat_resize_unlock(pgdat, &flags);
 
 	/* Sanity check that the next zone really is unpopulated */
 	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
-	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
-					jiffies_to_msecs(jiffies - start));
+	pr_info("node %d initialised, %lu pages in %ums\n",
+		pgdat->node_id,	nr_pages, jiffies_to_msecs(jiffies - start));
 
 	pgdat_init_report_one_done();
 	return 0;
@@ -1604,14 +1687,11 @@ static int __init deferred_init_memmap(void *data)
 static noinline bool __init
 deferred_grow_zone(struct zone *zone, unsigned int order)
 {
-	int zid = zone_idx(zone);
-	int nid = zone_to_nid(zone);
-	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
-	unsigned long nr_pages = 0;
-	unsigned long first_init_pfn, spfn, epfn, t, flags;
+	pg_data_t *pgdat = zone->zone_pgdat;
 	unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
-	phys_addr_t spa, epa;
+	unsigned long spfn, epfn, flags;
+	unsigned long nr_pages = 0;
 	u64 i;
 
 	/* Only the last zone may have deferred pages */
@@ -1640,37 +1720,23 @@ static int __init deferred_init_memmap(void *data)
 		return true;
 	}
 
-	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
-
-	if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
+	/* If the zone is empty somebody else may have cleared out the zone */
+	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
+						 first_deferred_pfn)) {
 		pgdat_resize_unlock(pgdat, &flags);
-		return false;
+		return true;
 	}
 
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-
-		while (spfn < epfn && nr_pages < nr_pages_needed) {
-			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
-			first_deferred_pfn = min(t, epfn);
-			nr_pages += deferred_init_pages(zone, spfn,
-							first_deferred_pfn);
-			spfn = first_deferred_pfn;
-		}
-
-		if (nr_pages >= nr_pages_needed)
-			break;
+	/*
+	 * Initialize and free pages in MAX_ORDER sized increments so
+	 * that we can avoid introducing any issues with the buddy
+	 * allocator.
+	 */
+	while (spfn < epfn && nr_pages < nr_pages_needed) {
+		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		first_deferred_pfn = spfn;
 	}
 
-	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
-		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
-		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
-		deferred_free_pages(spfn, epfn);
-
-		if (first_deferred_pfn = epfn)
-			break;
-	}
 	pgdat->first_deferred_pfn = first_deferred_pfn;
 	pgdat_resize_unlock(pgdat, &flags);
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 4/6] mm: Move hot-plug specific memory init into separate functions and optimize
  2018-10-17 23:54 ` Alexander Duyck
@ 2018-10-17 23:54   ` Alexander Duyck
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patch is going through and combining the bits in memmap_init_zone and
memmap_init_zone_device that are related to hotplug into a single function
called __memmap_init_hotplug.

I also took the opportunity to integrate __init_single_page's functionality
into this function. In doing so I can get rid of some of the redundancy
such as the LRU pointers versus the pgmap.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |  216 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 145 insertions(+), 71 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20e9eb35d75d..a0b81e0bef03 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1192,6 +1192,92 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 #endif
 }
 
+static void __meminit __init_pageblock(unsigned long start_pfn,
+				       unsigned long nr_pages,
+				       unsigned long zone, int nid,
+				       struct dev_pagemap *pgmap)
+{
+	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	struct page *start_page = pfn_to_page(start_pfn);
+	unsigned long pfn = start_pfn + nr_pages - 1;
+#ifdef WANT_PAGE_VIRTUAL
+	bool is_highmem = is_highmem_idx(zone);
+#endif
+	struct page *page;
+
+	/*
+	 * Enforce the following requirements:
+	 * size > 0
+	 * size < pageblock_nr_pages
+	 * start_pfn -> pfn does not cross pageblock_nr_pages boundary
+	 */
+	VM_BUG_ON(((start_pfn ^ pfn) | (nr_pages - 1)) > nr_pgmask);
+
+	/*
+	 * Work from highest page to lowest, this way we will still be
+	 * warm in the cache when we call set_pageblock_migratetype
+	 * below.
+	 *
+	 * The loop is based around the page pointer as the main index
+	 * instead of the pfn because pfn is not used inside the loop if
+	 * the section number is not in page flags and WANT_PAGE_VIRTUAL
+	 * is not defined.
+	 */
+	for (page = start_page + nr_pages; page-- != start_page; pfn--) {
+		mm_zero_struct_page(page);
+
+		/*
+		 * We use the start_pfn instead of pfn in the set_page_links
+		 * call because of the fact that the pfn number is used to
+		 * get the section_nr and this function should not be
+		 * spanning more than a single section.
+		 */
+		set_page_links(page, zone, nid, start_pfn);
+		init_page_count(page);
+		page_mapcount_reset(page);
+		page_cpupid_reset_last(page);
+
+		/*
+		 * We can use the non-atomic __set_bit operation for setting
+		 * the flag as we are still initializing the pages.
+		 */
+		__SetPageReserved(page);
+
+		/*
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
+		 * page is ever freed or placed on a driver-private list.
+		 */
+		page->pgmap = pgmap;
+		if (!pgmap)
+			INIT_LIST_HEAD(&page->lru);
+
+#ifdef WANT_PAGE_VIRTUAL
+		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
+		if (!is_highmem)
+			set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+	}
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * bitmap is created for zone's valid pfn range. but memmap
+	 * can be created for invalid pages (for alignment)
+	 * check here not to call set_pageblock_migratetype() against
+	 * pfn out of zone.
+	 *
+	 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
+	 * because this is done early in sparse_add_one_section
+	 */
+	if (!(start_pfn & nr_pgmask))
+		set_pageblock_migratetype(start_page, MIGRATE_MOVABLE);
+}
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static void __meminit init_reserved_page(unsigned long pfn)
 {
@@ -5513,6 +5599,25 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
 	return false;
 }
 
+static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
+					    unsigned long zone,
+					    unsigned long start_pfn,
+					    struct dev_pagemap *pgmap)
+{
+	unsigned long pfn = start_pfn + size;
+
+	while (pfn != start_pfn) {
+		unsigned long stride = pfn;
+
+		pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
+		stride -= pfn;
+
+		__init_pageblock(pfn, stride, zone, nid, pgmap);
+
+		cond_resched();
+	}
+}
+
 /*
  * Initially all pages are reserved - free ones are freed
  * up by memblock_free_all() once the early boot process is
@@ -5523,51 +5628,61 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		struct vmem_altmap *altmap)
 {
 	unsigned long pfn, end_pfn = start_pfn + size;
-	struct page *page;
 
 	if (highest_memmap_pfn < end_pfn - 1)
 		highest_memmap_pfn = end_pfn - 1;
 
+	if (context == MEMMAP_HOTPLUG) {
 #ifdef CONFIG_ZONE_DEVICE
-	/*
-	 * Honor reservation requested by the driver for this ZONE_DEVICE
-	 * memory. We limit the total number of pages to initialize to just
-	 * those that might contain the memory mapping. We will defer the
-	 * ZONE_DEVICE page initialization until after we have released
-	 * the hotplug lock.
-	 */
-	if (zone == ZONE_DEVICE) {
-		if (!altmap)
-			return;
+		/*
+		 * Honor reservation requested by the driver for this
+		 * ZONE_DEVICE memory. We limit the total number of pages to
+		 * initialize to just those that might contain the memory
+		 * mapping. We will defer the ZONE_DEVICE page initialization
+		 * until after we have released the hotplug lock.
+		 */
+		if (zone == ZONE_DEVICE) {
+			if (!altmap)
+				return;
+
+			if (start_pfn == altmap->base_pfn)
+				start_pfn += altmap->reserve;
+			end_pfn = altmap->base_pfn +
+				  vmem_altmap_offset(altmap);
+		}
+#endif
+		/*
+		 * For these ZONE_DEVICE pages we don't need to record the
+		 * pgmap as they should represent only those pages used to
+		 * store the memory map. The actual ZONE_DEVICE pages will
+		 * be initialized later.
+		 */
+		__memmap_init_hotplug(end_pfn - start_pfn, nid, zone,
+				      start_pfn, NULL);
 
-		if (start_pfn == altmap->base_pfn)
-			start_pfn += altmap->reserve;
-		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
+		return;
 	}
-#endif
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		struct page *page;
+
 		/*
 		 * There can be holes in boot-time mem_map[]s handed to this
 		 * function.  They do not exist on hotplugged memory.
 		 */
-		if (context == MEMMAP_EARLY) {
-			if (!early_pfn_valid(pfn)) {
-				pfn = next_valid_pfn(pfn) - 1;
-				continue;
-			}
-			if (!early_pfn_in_nid(pfn, nid))
-				continue;
-			if (overlap_memmap_init(zone, &pfn))
-				continue;
-			if (defer_init(nid, pfn, end_pfn))
-				break;
+		if (!early_pfn_valid(pfn)) {
+			pfn = next_valid_pfn(pfn) - 1;
+			continue;
 		}
+		if (!early_pfn_in_nid(pfn, nid))
+			continue;
+		if (overlap_memmap_init(zone, &pfn))
+			continue;
+		if (defer_init(nid, pfn, end_pfn))
+			break;
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -5594,7 +5709,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long size,
 				   struct dev_pagemap *pgmap)
 {
-	unsigned long pfn, end_pfn = start_pfn + size;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
@@ -5610,53 +5724,13 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	 */
 	if (pgmap->altmap_valid) {
 		struct vmem_altmap *altmap = &pgmap->altmap;
+		unsigned long end_pfn = start_pfn + size;
 
 		start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
 		size = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		struct page *page = pfn_to_page(pfn);
-
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
-		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
-		 * page is ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->hmm_data = 0;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * bitmap is created for zone's valid pfn range. but memmap
-		 * can be created for invalid pages (for alignment)
-		 * check here not to call set_pageblock_migratetype() against
-		 * pfn out of zone.
-		 *
-		 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
-		 * because this is done early in sparse_add_one_section
-		 */
-		if (!(pfn & (pageblock_nr_pages - 1))) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
-	}
+	__memmap_init_hotplug(size, nid, zone_idx, start_pfn, pgmap);
 
 	pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
 		size, jiffies_to_msecs(jiffies - start));


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 4/6] mm: Move hot-plug specific memory init into separate functions and optimize
@ 2018-10-17 23:54   ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patch is going through and combining the bits in memmap_init_zone and
memmap_init_zone_device that are related to hotplug into a single function
called __memmap_init_hotplug.

I also took the opportunity to integrate __init_single_page's functionality
into this function. In doing so I can get rid of some of the redundancy
such as the LRU pointers versus the pgmap.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |  216 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 145 insertions(+), 71 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20e9eb35d75d..a0b81e0bef03 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1192,6 +1192,92 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 #endif
 }
 
+static void __meminit __init_pageblock(unsigned long start_pfn,
+				       unsigned long nr_pages,
+				       unsigned long zone, int nid,
+				       struct dev_pagemap *pgmap)
+{
+	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	struct page *start_page = pfn_to_page(start_pfn);
+	unsigned long pfn = start_pfn + nr_pages - 1;
+#ifdef WANT_PAGE_VIRTUAL
+	bool is_highmem = is_highmem_idx(zone);
+#endif
+	struct page *page;
+
+	/*
+	 * Enforce the following requirements:
+	 * size > 0
+	 * size < pageblock_nr_pages
+	 * start_pfn -> pfn does not cross pageblock_nr_pages boundary
+	 */
+	VM_BUG_ON(((start_pfn ^ pfn) | (nr_pages - 1)) > nr_pgmask);
+
+	/*
+	 * Work from highest page to lowest, this way we will still be
+	 * warm in the cache when we call set_pageblock_migratetype
+	 * below.
+	 *
+	 * The loop is based around the page pointer as the main index
+	 * instead of the pfn because pfn is not used inside the loop if
+	 * the section number is not in page flags and WANT_PAGE_VIRTUAL
+	 * is not defined.
+	 */
+	for (page = start_page + nr_pages; page-- != start_page; pfn--) {
+		mm_zero_struct_page(page);
+
+		/*
+		 * We use the start_pfn instead of pfn in the set_page_links
+		 * call because of the fact that the pfn number is used to
+		 * get the section_nr and this function should not be
+		 * spanning more than a single section.
+		 */
+		set_page_links(page, zone, nid, start_pfn);
+		init_page_count(page);
+		page_mapcount_reset(page);
+		page_cpupid_reset_last(page);
+
+		/*
+		 * We can use the non-atomic __set_bit operation for setting
+		 * the flag as we are still initializing the pages.
+		 */
+		__SetPageReserved(page);
+
+		/*
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
+		 * page is ever freed or placed on a driver-private list.
+		 */
+		page->pgmap = pgmap;
+		if (!pgmap)
+			INIT_LIST_HEAD(&page->lru);
+
+#ifdef WANT_PAGE_VIRTUAL
+		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
+		if (!is_highmem)
+			set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+	}
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * bitmap is created for zone's valid pfn range. but memmap
+	 * can be created for invalid pages (for alignment)
+	 * check here not to call set_pageblock_migratetype() against
+	 * pfn out of zone.
+	 *
+	 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
+	 * because this is done early in sparse_add_one_section
+	 */
+	if (!(start_pfn & nr_pgmask))
+		set_pageblock_migratetype(start_page, MIGRATE_MOVABLE);
+}
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static void __meminit init_reserved_page(unsigned long pfn)
 {
@@ -5513,6 +5599,25 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
 	return false;
 }
 
+static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
+					    unsigned long zone,
+					    unsigned long start_pfn,
+					    struct dev_pagemap *pgmap)
+{
+	unsigned long pfn = start_pfn + size;
+
+	while (pfn != start_pfn) {
+		unsigned long stride = pfn;
+
+		pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
+		stride -= pfn;
+
+		__init_pageblock(pfn, stride, zone, nid, pgmap);
+
+		cond_resched();
+	}
+}
+
 /*
  * Initially all pages are reserved - free ones are freed
  * up by memblock_free_all() once the early boot process is
@@ -5523,51 +5628,61 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		struct vmem_altmap *altmap)
 {
 	unsigned long pfn, end_pfn = start_pfn + size;
-	struct page *page;
 
 	if (highest_memmap_pfn < end_pfn - 1)
 		highest_memmap_pfn = end_pfn - 1;
 
+	if (context = MEMMAP_HOTPLUG) {
 #ifdef CONFIG_ZONE_DEVICE
-	/*
-	 * Honor reservation requested by the driver for this ZONE_DEVICE
-	 * memory. We limit the total number of pages to initialize to just
-	 * those that might contain the memory mapping. We will defer the
-	 * ZONE_DEVICE page initialization until after we have released
-	 * the hotplug lock.
-	 */
-	if (zone = ZONE_DEVICE) {
-		if (!altmap)
-			return;
+		/*
+		 * Honor reservation requested by the driver for this
+		 * ZONE_DEVICE memory. We limit the total number of pages to
+		 * initialize to just those that might contain the memory
+		 * mapping. We will defer the ZONE_DEVICE page initialization
+		 * until after we have released the hotplug lock.
+		 */
+		if (zone = ZONE_DEVICE) {
+			if (!altmap)
+				return;
+
+			if (start_pfn = altmap->base_pfn)
+				start_pfn += altmap->reserve;
+			end_pfn = altmap->base_pfn +
+				  vmem_altmap_offset(altmap);
+		}
+#endif
+		/*
+		 * For these ZONE_DEVICE pages we don't need to record the
+		 * pgmap as they should represent only those pages used to
+		 * store the memory map. The actual ZONE_DEVICE pages will
+		 * be initialized later.
+		 */
+		__memmap_init_hotplug(end_pfn - start_pfn, nid, zone,
+				      start_pfn, NULL);
 
-		if (start_pfn = altmap->base_pfn)
-			start_pfn += altmap->reserve;
-		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
+		return;
 	}
-#endif
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		struct page *page;
+
 		/*
 		 * There can be holes in boot-time mem_map[]s handed to this
 		 * function.  They do not exist on hotplugged memory.
 		 */
-		if (context = MEMMAP_EARLY) {
-			if (!early_pfn_valid(pfn)) {
-				pfn = next_valid_pfn(pfn) - 1;
-				continue;
-			}
-			if (!early_pfn_in_nid(pfn, nid))
-				continue;
-			if (overlap_memmap_init(zone, &pfn))
-				continue;
-			if (defer_init(nid, pfn, end_pfn))
-				break;
+		if (!early_pfn_valid(pfn)) {
+			pfn = next_valid_pfn(pfn) - 1;
+			continue;
 		}
+		if (!early_pfn_in_nid(pfn, nid))
+			continue;
+		if (overlap_memmap_init(zone, &pfn))
+			continue;
+		if (defer_init(nid, pfn, end_pfn))
+			break;
 
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
-		if (context = MEMMAP_HOTPLUG)
-			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -5594,7 +5709,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long size,
 				   struct dev_pagemap *pgmap)
 {
-	unsigned long pfn, end_pfn = start_pfn + size;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
@@ -5610,53 +5724,13 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	 */
 	if (pgmap->altmap_valid) {
 		struct vmem_altmap *altmap = &pgmap->altmap;
+		unsigned long end_pfn = start_pfn + size;
 
 		start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
 		size = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		struct page *page = pfn_to_page(pfn);
-
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
-		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
-		 * page is ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->hmm_data = 0;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * bitmap is created for zone's valid pfn range. but memmap
-		 * can be created for invalid pages (for alignment)
-		 * check here not to call set_pageblock_migratetype() against
-		 * pfn out of zone.
-		 *
-		 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
-		 * because this is done early in sparse_add_one_section
-		 */
-		if (!(pfn & (pageblock_nr_pages - 1))) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
-	}
+	__memmap_init_hotplug(size, nid, zone_idx, start_pfn, pgmap);
 
 	pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
 		size, jiffies_to_msecs(jiffies - start));

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 5/6] mm: Add reserved flag setting to set_page_links
  2018-10-17 23:54 ` Alexander Duyck
@ 2018-10-17 23:54   ` Alexander Duyck
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patch modifies the set_page_links function to include the setting of
the reserved flag via a simple AND and OR operation. The motivation for
this is the fact that the existing __set_bit call still seems to have
effects on performance as replacing the call with the AND and OR can reduce
initialization time.

Looking over the assembly code before and after the change the main
difference between the two is that the reserved bit is stored in a value
that is generated outside of the main initialization loop and is then
written with the other flags field values in one write to the page->flags
value. Previously the generated value was written and then then a btsq
instruction was issued.

On my x86_64 test system with 3TB of persistent memory per node I saw the
persistent memory initialization time on average drop from 23.49s to
19.12s per node.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mm.h |    9 ++++++++-
 mm/page_alloc.c    |   29 +++++++++++++++++++----------
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6e2c9631af05..14d06d7d2986 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1171,11 +1171,18 @@ static inline void set_page_node(struct page *page, unsigned long node)
 	page->flags |= (node & NODES_MASK) << NODES_PGSHIFT;
 }
 
+static inline void set_page_reserved(struct page *page, bool reserved)
+{
+	page->flags &= ~(1ul << PG_reserved);
+	page->flags |= (unsigned long)(!!reserved) << PG_reserved;
+}
+
 static inline void set_page_links(struct page *page, enum zone_type zone,
-	unsigned long node, unsigned long pfn)
+	unsigned long node, unsigned long pfn, bool reserved)
 {
 	set_page_zone(page, zone);
 	set_page_node(page, node);
+	set_page_reserved(page, reserved);
 #ifdef SECTION_IN_PAGE_FLAGS
 	set_page_section(page, pfn_to_section_nr(pfn));
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a0b81e0bef03..e7fee7a5f8a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1179,7 +1179,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
 	mm_zero_struct_page(page);
-	set_page_links(page, zone, nid, pfn);
+	set_page_links(page, zone, nid, pfn, false);
 	init_page_count(page);
 	page_mapcount_reset(page);
 	page_cpupid_reset_last(page);
@@ -1195,7 +1195,8 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 static void __meminit __init_pageblock(unsigned long start_pfn,
 				       unsigned long nr_pages,
 				       unsigned long zone, int nid,
-				       struct dev_pagemap *pgmap)
+				       struct dev_pagemap *pgmap,
+				       bool is_reserved)
 {
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
 	struct page *start_page = pfn_to_page(start_pfn);
@@ -1231,19 +1232,16 @@ static void __meminit __init_pageblock(unsigned long start_pfn,
 		 * call because of the fact that the pfn number is used to
 		 * get the section_nr and this function should not be
 		 * spanning more than a single section.
+		 *
+		 * We can use a non-atomic operation for setting the
+		 * PG_reserved flag as we are still initializing the pages.
 		 */
-		set_page_links(page, zone, nid, start_pfn);
+		set_page_links(page, zone, nid, start_pfn, is_reserved);
 		init_page_count(page);
 		page_mapcount_reset(page);
 		page_cpupid_reset_last(page);
 
 		/*
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
 		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
 		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
 		 * page is ever freed or placed on a driver-private list.
@@ -5612,7 +5610,18 @@ static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
 		pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
 		stride -= pfn;
 
-		__init_pageblock(pfn, stride, zone, nid, pgmap);
+		/*
+		 * The last argument of __init_pageblock is a boolean
+		 * value indicating if the page will be marked as reserved.
+		 *
+		 * Mark page reserved as it will need to wait for onlining
+		 * phase for it to be fully associated with a zone.
+		 *
+		 * Under certain circumstances ZONE_DEVICE pages may not
+		 * need to be marked as reserved, however there is still
+		 * code that is depending on this being set for now.
+		 */
+		__init_pageblock(pfn, stride, zone, nid, pgmap, true);
 
 		cond_resched();
 	}


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 5/6] mm: Add reserved flag setting to set_page_links
@ 2018-10-17 23:54   ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patch modifies the set_page_links function to include the setting of
the reserved flag via a simple AND and OR operation. The motivation for
this is the fact that the existing __set_bit call still seems to have
effects on performance as replacing the call with the AND and OR can reduce
initialization time.

Looking over the assembly code before and after the change the main
difference between the two is that the reserved bit is stored in a value
that is generated outside of the main initialization loop and is then
written with the other flags field values in one write to the page->flags
value. Previously the generated value was written and then then a btsq
instruction was issued.

On my x86_64 test system with 3TB of persistent memory per node I saw the
persistent memory initialization time on average drop from 23.49s to
19.12s per node.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/mm.h |    9 ++++++++-
 mm/page_alloc.c    |   29 +++++++++++++++++++----------
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6e2c9631af05..14d06d7d2986 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1171,11 +1171,18 @@ static inline void set_page_node(struct page *page, unsigned long node)
 	page->flags |= (node & NODES_MASK) << NODES_PGSHIFT;
 }
 
+static inline void set_page_reserved(struct page *page, bool reserved)
+{
+	page->flags &= ~(1ul << PG_reserved);
+	page->flags |= (unsigned long)(!!reserved) << PG_reserved;
+}
+
 static inline void set_page_links(struct page *page, enum zone_type zone,
-	unsigned long node, unsigned long pfn)
+	unsigned long node, unsigned long pfn, bool reserved)
 {
 	set_page_zone(page, zone);
 	set_page_node(page, node);
+	set_page_reserved(page, reserved);
 #ifdef SECTION_IN_PAGE_FLAGS
 	set_page_section(page, pfn_to_section_nr(pfn));
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a0b81e0bef03..e7fee7a5f8a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1179,7 +1179,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
 	mm_zero_struct_page(page);
-	set_page_links(page, zone, nid, pfn);
+	set_page_links(page, zone, nid, pfn, false);
 	init_page_count(page);
 	page_mapcount_reset(page);
 	page_cpupid_reset_last(page);
@@ -1195,7 +1195,8 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 static void __meminit __init_pageblock(unsigned long start_pfn,
 				       unsigned long nr_pages,
 				       unsigned long zone, int nid,
-				       struct dev_pagemap *pgmap)
+				       struct dev_pagemap *pgmap,
+				       bool is_reserved)
 {
 	unsigned long nr_pgmask = pageblock_nr_pages - 1;
 	struct page *start_page = pfn_to_page(start_pfn);
@@ -1231,19 +1232,16 @@ static void __meminit __init_pageblock(unsigned long start_pfn,
 		 * call because of the fact that the pfn number is used to
 		 * get the section_nr and this function should not be
 		 * spanning more than a single section.
+		 *
+		 * We can use a non-atomic operation for setting the
+		 * PG_reserved flag as we are still initializing the pages.
 		 */
-		set_page_links(page, zone, nid, start_pfn);
+		set_page_links(page, zone, nid, start_pfn, is_reserved);
 		init_page_count(page);
 		page_mapcount_reset(page);
 		page_cpupid_reset_last(page);
 
 		/*
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
 		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
 		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
 		 * page is ever freed or placed on a driver-private list.
@@ -5612,7 +5610,18 @@ static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
 		pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
 		stride -= pfn;
 
-		__init_pageblock(pfn, stride, zone, nid, pgmap);
+		/*
+		 * The last argument of __init_pageblock is a boolean
+		 * value indicating if the page will be marked as reserved.
+		 *
+		 * Mark page reserved as it will need to wait for onlining
+		 * phase for it to be fully associated with a zone.
+		 *
+		 * Under certain circumstances ZONE_DEVICE pages may not
+		 * need to be marked as reserved, however there is still
+		 * code that is depending on this being set for now.
+		 */
+		__init_pageblock(pfn, stride, zone, nid, pgmap, true);
 
 		cond_resched();
 	}

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 6/6] mm: Use common iterator for deferred_init_pages and deferred_free_pages
  2018-10-17 23:54 ` Alexander Duyck
@ 2018-10-17 23:54   ` Alexander Duyck
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patch creates a common iterator to be used by both deferred_init_pages
and deferred_free_pages. By doing this we can cut down a bit on code
overhead as they will likely both be inlined into the same function anyway.

This new approach allows deferred_init_pages to make use of
__init_pageblock. By doing this we can cut down on the code size by sharing
code between both the hotplug and deferred memory init code paths.

An additional benefit to this approach is that we improve in cache locality
of the memory init as we can focus on the memory areas related to
identifying if a given PFN is valid and keep that warm in the cache until
we transition to a region of a different type. So we will stream through a
chunk of valid blocks before we turn to initializing page structs.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 1.38s to 1.06s as a result of this patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |  134 +++++++++++++++++++++++++++----------------------------
 1 file changed, 65 insertions(+), 69 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e7fee7a5f8a3..f47d02e42cf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1484,32 +1484,6 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(unsigned long pfn,
-				       unsigned long nr_pages)
-{
-	struct page *page;
-	unsigned long i;
-
-	if (!nr_pages)
-		return;
-
-	page = pfn_to_page(pfn);
-
-	/* Free a large naturally-aligned chunk if possible */
-	if (nr_pages == pageblock_nr_pages &&
-	    (pfn & (pageblock_nr_pages - 1)) == 0) {
-		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, pageblock_order);
-		return;
-	}
-
-	for (i = 0; i < nr_pages; i++, page++, pfn++) {
-		if ((pfn & (pageblock_nr_pages - 1)) == 0)
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, 0);
-	}
-}
-
 /* Completion tracking for deferred_init_memmap() threads */
 static atomic_t pgdat_init_n_undone __initdata;
 static __initdata DECLARE_COMPLETION(pgdat_init_all_done_comp);
@@ -1521,48 +1495,77 @@ static inline void __init pgdat_init_report_one_done(void)
 }
 
 /*
- * Returns true if page needs to be initialized or freed to buddy allocator.
+ * Returns count if page range needs to be initialized or freed
  *
- * First we check if pfn is valid on architectures where it is possible to have
- * holes within pageblock_nr_pages. On systems where it is not possible, this
- * function is optimized out.
+ * First, we check if a current large page is valid by only checking the
+ * validity of the head pfn.
  *
- * Then, we check if a current large page is valid by only checking the validity
- * of the head pfn.
+ * Then we check if the contiguous pfns are valid on architectures where it
+ * is possible to have holes within pageblock_nr_pages. On systems where it
+ * is not possible, this function is optimized out.
  */
-static inline bool __init deferred_pfn_valid(unsigned long pfn)
+static unsigned long __next_pfn_valid_range(unsigned long *i,
+					    unsigned long end_pfn)
 {
-	if (!pfn_valid_within(pfn))
-		return false;
-	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
-		return false;
-	return true;
+	unsigned long pfn = *i;
+	unsigned long count;
+
+	while (pfn < end_pfn) {
+		unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
+		unsigned long pageblock_pfn = min(t, end_pfn);
+
+#ifndef CONFIG_HOLES_IN_ZONE
+		count = pageblock_pfn - pfn;
+		pfn = pageblock_pfn;
+		if (!pfn_valid(pfn))
+			continue;
+#else
+		for (count = 0; pfn < pageblock_pfn; pfn++) {
+			if (pfn_valid_within(pfn)) {
+				count++;
+				continue;
+			}
+
+			if (count)
+				break;
+		}
+
+		if (!count)
+			continue;
+#endif
+		*i = pfn;
+		return count;
+	}
+
+	return 0;
 }
 
+#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) \
+	for (i = (start_pfn),						     \
+	     count = __next_pfn_valid_range(&i, (end_pfn));		     \
+	     count && ({ pfn = i - count; 1; });			     \
+	     count = __next_pfn_valid_range(&i, (end_pfn)))
 /*
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(unsigned long pfn,
+static void __init deferred_free_pages(unsigned long start_pfn,
 				       unsigned long end_pfn)
 {
-	unsigned long nr_pgmask = pageblock_nr_pages - 1;
-	unsigned long nr_free = 0;
-
-	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(pfn)) {
-			deferred_free_range(pfn - nr_free, nr_free);
-			nr_free = 0;
-		} else if (!(pfn & nr_pgmask)) {
-			deferred_free_range(pfn - nr_free, nr_free);
-			nr_free = 1;
-			touch_nmi_watchdog();
+	unsigned long i, pfn, count;
+
+	for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (count == pageblock_nr_pages) {
+			__free_pages_core(page, pageblock_order);
 		} else {
-			nr_free++;
+			while (count--)
+				__free_pages_core(page++, 0);
 		}
+
+		touch_nmi_watchdog();
 	}
-	/* Free the last block of pages to allocator */
-	deferred_free_range(pfn - nr_free, nr_free);
 }
 
 /*
@@ -1571,29 +1574,22 @@ static void __init deferred_free_pages(unsigned long pfn,
  * Return number of pages initialized.
  */
 static unsigned long  __init deferred_init_pages(struct zone *zone,
-						 unsigned long pfn,
+						 unsigned long start_pfn,
 						 unsigned long end_pfn)
 {
-	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	unsigned long i, pfn, count;
 	int nid = zone_to_nid(zone);
 	unsigned long nr_pages = 0;
 	int zid = zone_idx(zone);
-	struct page *page = NULL;
 
-	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(pfn)) {
-			page = NULL;
-			continue;
-		} else if (!page || !(pfn & nr_pgmask)) {
-			page = pfn_to_page(pfn);
-			touch_nmi_watchdog();
-		} else {
-			page++;
-		}
-		__init_single_page(page, pfn, zid, nid);
-		nr_pages++;
+	for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) {
+		nr_pages += count;
+		__init_pageblock(pfn, count, zid, nid, NULL, false);
+
+		touch_nmi_watchdog();
 	}
-	return (nr_pages);
+
+	return nr_pages;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [mm PATCH v4 6/6] mm: Use common iterator for deferred_init_pages and deferred_free_pages
@ 2018-10-17 23:54   ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-17 23:54 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, alexander.h.duyck,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

This patch creates a common iterator to be used by both deferred_init_pages
and deferred_free_pages. By doing this we can cut down a bit on code
overhead as they will likely both be inlined into the same function anyway.

This new approach allows deferred_init_pages to make use of
__init_pageblock. By doing this we can cut down on the code size by sharing
code between both the hotplug and deferred memory init code paths.

An additional benefit to this approach is that we improve in cache locality
of the memory init as we can focus on the memory areas related to
identifying if a given PFN is valid and keep that warm in the cache until
we transition to a region of a different type. So we will stream through a
chunk of valid blocks before we turn to initializing page structs.

On my x86_64 test system with 384GB of memory per node I saw a reduction in
initialization time from 1.38s to 1.06s as a result of this patch.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/page_alloc.c |  134 +++++++++++++++++++++++++++----------------------------
 1 file changed, 65 insertions(+), 69 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e7fee7a5f8a3..f47d02e42cf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1484,32 +1484,6 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(unsigned long pfn,
-				       unsigned long nr_pages)
-{
-	struct page *page;
-	unsigned long i;
-
-	if (!nr_pages)
-		return;
-
-	page = pfn_to_page(pfn);
-
-	/* Free a large naturally-aligned chunk if possible */
-	if (nr_pages = pageblock_nr_pages &&
-	    (pfn & (pageblock_nr_pages - 1)) = 0) {
-		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, pageblock_order);
-		return;
-	}
-
-	for (i = 0; i < nr_pages; i++, page++, pfn++) {
-		if ((pfn & (pageblock_nr_pages - 1)) = 0)
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, 0);
-	}
-}
-
 /* Completion tracking for deferred_init_memmap() threads */
 static atomic_t pgdat_init_n_undone __initdata;
 static __initdata DECLARE_COMPLETION(pgdat_init_all_done_comp);
@@ -1521,48 +1495,77 @@ static inline void __init pgdat_init_report_one_done(void)
 }
 
 /*
- * Returns true if page needs to be initialized or freed to buddy allocator.
+ * Returns count if page range needs to be initialized or freed
  *
- * First we check if pfn is valid on architectures where it is possible to have
- * holes within pageblock_nr_pages. On systems where it is not possible, this
- * function is optimized out.
+ * First, we check if a current large page is valid by only checking the
+ * validity of the head pfn.
  *
- * Then, we check if a current large page is valid by only checking the validity
- * of the head pfn.
+ * Then we check if the contiguous pfns are valid on architectures where it
+ * is possible to have holes within pageblock_nr_pages. On systems where it
+ * is not possible, this function is optimized out.
  */
-static inline bool __init deferred_pfn_valid(unsigned long pfn)
+static unsigned long __next_pfn_valid_range(unsigned long *i,
+					    unsigned long end_pfn)
 {
-	if (!pfn_valid_within(pfn))
-		return false;
-	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
-		return false;
-	return true;
+	unsigned long pfn = *i;
+	unsigned long count;
+
+	while (pfn < end_pfn) {
+		unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
+		unsigned long pageblock_pfn = min(t, end_pfn);
+
+#ifndef CONFIG_HOLES_IN_ZONE
+		count = pageblock_pfn - pfn;
+		pfn = pageblock_pfn;
+		if (!pfn_valid(pfn))
+			continue;
+#else
+		for (count = 0; pfn < pageblock_pfn; pfn++) {
+			if (pfn_valid_within(pfn)) {
+				count++;
+				continue;
+			}
+
+			if (count)
+				break;
+		}
+
+		if (!count)
+			continue;
+#endif
+		*i = pfn;
+		return count;
+	}
+
+	return 0;
 }
 
+#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) \
+	for (i = (start_pfn),						     \
+	     count = __next_pfn_valid_range(&i, (end_pfn));		     \
+	     count && ({ pfn = i - count; 1; });			     \
+	     count = __next_pfn_valid_range(&i, (end_pfn)))
 /*
  * Free pages to buddy allocator. Try to free aligned pages in
  * pageblock_nr_pages sizes.
  */
-static void __init deferred_free_pages(unsigned long pfn,
+static void __init deferred_free_pages(unsigned long start_pfn,
 				       unsigned long end_pfn)
 {
-	unsigned long nr_pgmask = pageblock_nr_pages - 1;
-	unsigned long nr_free = 0;
-
-	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(pfn)) {
-			deferred_free_range(pfn - nr_free, nr_free);
-			nr_free = 0;
-		} else if (!(pfn & nr_pgmask)) {
-			deferred_free_range(pfn - nr_free, nr_free);
-			nr_free = 1;
-			touch_nmi_watchdog();
+	unsigned long i, pfn, count;
+
+	for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (count = pageblock_nr_pages) {
+			__free_pages_core(page, pageblock_order);
 		} else {
-			nr_free++;
+			while (count--)
+				__free_pages_core(page++, 0);
 		}
+
+		touch_nmi_watchdog();
 	}
-	/* Free the last block of pages to allocator */
-	deferred_free_range(pfn - nr_free, nr_free);
 }
 
 /*
@@ -1571,29 +1574,22 @@ static void __init deferred_free_pages(unsigned long pfn,
  * Return number of pages initialized.
  */
 static unsigned long  __init deferred_init_pages(struct zone *zone,
-						 unsigned long pfn,
+						 unsigned long start_pfn,
 						 unsigned long end_pfn)
 {
-	unsigned long nr_pgmask = pageblock_nr_pages - 1;
+	unsigned long i, pfn, count;
 	int nid = zone_to_nid(zone);
 	unsigned long nr_pages = 0;
 	int zid = zone_idx(zone);
-	struct page *page = NULL;
 
-	for (; pfn < end_pfn; pfn++) {
-		if (!deferred_pfn_valid(pfn)) {
-			page = NULL;
-			continue;
-		} else if (!page || !(pfn & nr_pgmask)) {
-			page = pfn_to_page(pfn);
-			touch_nmi_watchdog();
-		} else {
-			page++;
-		}
-		__init_single_page(page, pfn, zid, nid);
-		nr_pages++;
+	for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) {
+		nr_pages += count;
+		__init_pageblock(pfn, count, zid, nid, NULL, false);
+
+		touch_nmi_watchdog();
 	}
-	return (nr_pages);
+
+	return nr_pages;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 1/6] mm: Use mm_zero_struct_page from SPARC on all 64b architectures
  2018-10-17 23:54   ` Alexander Duyck
@ 2018-10-18 18:29     ` Pavel Tatashin
  -1 siblings, 0 replies; 28+ messages in thread
From: Pavel Tatashin @ 2018-10-18 18:29 UTC (permalink / raw)
  To: Alexander Duyck, linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, linux-kernel, willy, davem,
	yi.z.zhang, khalid.aziz, rppt, vbabka, sparclinux,
	dan.j.williams, ldufour, mgorman, mingo, kirill.shutemov



On 10/17/18 7:54 PM, Alexander Duyck wrote:
> This change makes it so that we use the same approach that was already in
> use on Sparc on all the archtectures that support a 64b long.
> 
> This is mostly motivated by the fact that 7 to 10 store/move instructions
> are likely always going to be faster than having to call into a function
> that is not specialized for handling page init.
> 
> An added advantage to doing it this way is that the compiler can get away
> with combining writes in the __init_single_page call. As a result the
> memset call will be reduced to only about 4 write operations, or at least
> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
> count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
> my system.
> 
> One change I had to make to the function was to reduce the minimum page
> size to 56 to support some powerpc64 configurations.
> 
> This change should introduce no change on SPARC since it already had this
> code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
> initializing 384GB of RAM per node. Pavel Tatashin tested on a system with
> Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page()
> takes 19.30ns / 64-byte struct page before this patch and with this patch
> it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on
> a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His
> results per 64-byte struct page were 4.68ns before, and 4.59ns after this
> patch.

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>

> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  arch/sparc/include/asm/pgtable_64.h |   30 --------------------------
>  include/linux/mm.h                  |   41 ++++++++++++++++++++++++++++++++---
>  2 files changed, 38 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 1393a8ac596b..22500c3be7a9 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -231,36 +231,6 @@
>  extern struct page *mem_map_zero;
>  #define ZERO_PAGE(vaddr)	(mem_map_zero)
>  
> -/* This macro must be updated when the size of struct page grows above 80
> - * or reduces below 64.
> - * The idea that compiler optimizes out switch() statement, and only
> - * leaves clrx instructions
> - */
> -#define	mm_zero_struct_page(pp) do {					\
> -	unsigned long *_pp = (void *)(pp);				\
> -									\
> -	 /* Check that struct page is either 64, 72, or 80 bytes */	\
> -	BUILD_BUG_ON(sizeof(struct page) & 7);				\
> -	BUILD_BUG_ON(sizeof(struct page) < 64);				\
> -	BUILD_BUG_ON(sizeof(struct page) > 80);				\
> -									\
> -	switch (sizeof(struct page)) {					\
> -	case 80:							\
> -		_pp[9] = 0;	/* fallthrough */			\
> -	case 72:							\
> -		_pp[8] = 0;	/* fallthrough */			\
> -	default:							\
> -		_pp[7] = 0;						\
> -		_pp[6] = 0;						\
> -		_pp[5] = 0;						\
> -		_pp[4] = 0;						\
> -		_pp[3] = 0;						\
> -		_pp[2] = 0;						\
> -		_pp[1] = 0;						\
> -		_pp[0] = 0;						\
> -	}								\
> -} while (0)
> -
>  /* PFNs are real physical page numbers.  However, mem_map only begins to record
>   * per-page information starting at pfn_base.  This is to handle systems where
>   * the first physical page in the machine is at some huge physical address,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fcf9cc9d535f..6e2c9631af05 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -98,10 +98,45 @@ static inline void set_max_mapnr(unsigned long limit) { }
>  
>  /*
>   * On some architectures it is expensive to call memset() for small sizes.
> - * Those architectures should provide their own implementation of "struct page"
> - * zeroing by defining this macro in <asm/pgtable.h>.
> + * If an architecture decides to implement their own version of
> + * mm_zero_struct_page they should wrap the defines below in a #ifndef and
> + * define their own version of this macro in <asm/pgtable.h>
>   */
> -#ifndef mm_zero_struct_page
> +#if BITS_PER_LONG == 64
> +/* This function must be updated when the size of struct page grows above 80
> + * or reduces below 56. The idea that compiler optimizes out switch()
> + * statement, and only leaves move/store instructions. Also the compiler can
> + * combine write statments if they are both assignments and can be reordered,
> + * this can result in several of the writes here being dropped.
> + */
> +#define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> +static inline void __mm_zero_struct_page(struct page *page)
> +{
> +	unsigned long *_pp = (void *)page;
> +
> +	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
> +	BUILD_BUG_ON(sizeof(struct page) & 7);
> +	BUILD_BUG_ON(sizeof(struct page) < 56);
> +	BUILD_BUG_ON(sizeof(struct page) > 80);
> +
> +	switch (sizeof(struct page)) {
> +	case 80:
> +		_pp[9] = 0;	/* fallthrough */
> +	case 72:
> +		_pp[8] = 0;	/* fallthrough */
> +	case 64:
> +		_pp[7] = 0;	/* fallthrough */
> +	case 56:
> +		_pp[6] = 0;
> +		_pp[5] = 0;
> +		_pp[4] = 0;
> +		_pp[3] = 0;
> +		_pp[2] = 0;
> +		_pp[1] = 0;
> +		_pp[0] = 0;
> +	}
> +}
> +#else
>  #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
>  #endif
>  
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 1/6] mm: Use mm_zero_struct_page from SPARC on all 64b architectures
@ 2018-10-18 18:29     ` Pavel Tatashin
  0 siblings, 0 replies; 28+ messages in thread
From: Pavel Tatashin @ 2018-10-18 18:29 UTC (permalink / raw)
  To: Alexander Duyck, linux-mm, akpm
  Cc: pavel.tatashin, mhocko, dave.jiang, linux-kernel, willy, davem,
	yi.z.zhang, khalid.aziz, rppt, vbabka, sparclinux,
	dan.j.williams, ldufour, mgorman, mingo, kirill.shutemov



On 10/17/18 7:54 PM, Alexander Duyck wrote:
> This change makes it so that we use the same approach that was already in
> use on Sparc on all the archtectures that support a 64b long.
> 
> This is mostly motivated by the fact that 7 to 10 store/move instructions
> are likely always going to be faster than having to call into a function
> that is not specialized for handling page init.
> 
> An added advantage to doing it this way is that the compiler can get away
> with combining writes in the __init_single_page call. As a result the
> memset call will be reduced to only about 4 write operations, or at least
> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
> count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
> my system.
> 
> One change I had to make to the function was to reduce the minimum page
> size to 56 to support some powerpc64 configurations.
> 
> This change should introduce no change on SPARC since it already had this
> code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
> initializing 384GB of RAM per node. Pavel Tatashin tested on a system with
> Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page()
> takes 19.30ns / 64-byte struct page before this patch and with this patch
> it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on
> a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His
> results per 64-byte struct page were 4.68ns before, and 4.59ns after this
> patch.

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>

> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  arch/sparc/include/asm/pgtable_64.h |   30 --------------------------
>  include/linux/mm.h                  |   41 ++++++++++++++++++++++++++++++++---
>  2 files changed, 38 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 1393a8ac596b..22500c3be7a9 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -231,36 +231,6 @@
>  extern struct page *mem_map_zero;
>  #define ZERO_PAGE(vaddr)	(mem_map_zero)
>  
> -/* This macro must be updated when the size of struct page grows above 80
> - * or reduces below 64.
> - * The idea that compiler optimizes out switch() statement, and only
> - * leaves clrx instructions
> - */
> -#define	mm_zero_struct_page(pp) do {					\
> -	unsigned long *_pp = (void *)(pp);				\
> -									\
> -	 /* Check that struct page is either 64, 72, or 80 bytes */	\
> -	BUILD_BUG_ON(sizeof(struct page) & 7);				\
> -	BUILD_BUG_ON(sizeof(struct page) < 64);				\
> -	BUILD_BUG_ON(sizeof(struct page) > 80);				\
> -									\
> -	switch (sizeof(struct page)) {					\
> -	case 80:							\
> -		_pp[9] = 0;	/* fallthrough */			\
> -	case 72:							\
> -		_pp[8] = 0;	/* fallthrough */			\
> -	default:							\
> -		_pp[7] = 0;						\
> -		_pp[6] = 0;						\
> -		_pp[5] = 0;						\
> -		_pp[4] = 0;						\
> -		_pp[3] = 0;						\
> -		_pp[2] = 0;						\
> -		_pp[1] = 0;						\
> -		_pp[0] = 0;						\
> -	}								\
> -} while (0)
> -
>  /* PFNs are real physical page numbers.  However, mem_map only begins to record
>   * per-page information starting at pfn_base.  This is to handle systems where
>   * the first physical page in the machine is at some huge physical address,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fcf9cc9d535f..6e2c9631af05 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -98,10 +98,45 @@ static inline void set_max_mapnr(unsigned long limit) { }
>  
>  /*
>   * On some architectures it is expensive to call memset() for small sizes.
> - * Those architectures should provide their own implementation of "struct page"
> - * zeroing by defining this macro in <asm/pgtable.h>.
> + * If an architecture decides to implement their own version of
> + * mm_zero_struct_page they should wrap the defines below in a #ifndef and
> + * define their own version of this macro in <asm/pgtable.h>
>   */
> -#ifndef mm_zero_struct_page
> +#if BITS_PER_LONG = 64
> +/* This function must be updated when the size of struct page grows above 80
> + * or reduces below 56. The idea that compiler optimizes out switch()
> + * statement, and only leaves move/store instructions. Also the compiler can
> + * combine write statments if they are both assignments and can be reordered,
> + * this can result in several of the writes here being dropped.
> + */
> +#define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> +static inline void __mm_zero_struct_page(struct page *page)
> +{
> +	unsigned long *_pp = (void *)page;
> +
> +	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
> +	BUILD_BUG_ON(sizeof(struct page) & 7);
> +	BUILD_BUG_ON(sizeof(struct page) < 56);
> +	BUILD_BUG_ON(sizeof(struct page) > 80);
> +
> +	switch (sizeof(struct page)) {
> +	case 80:
> +		_pp[9] = 0;	/* fallthrough */
> +	case 72:
> +		_pp[8] = 0;	/* fallthrough */
> +	case 64:
> +		_pp[7] = 0;	/* fallthrough */
> +	case 56:
> +		_pp[6] = 0;
> +		_pp[5] = 0;
> +		_pp[4] = 0;
> +		_pp[3] = 0;
> +		_pp[2] = 0;
> +		_pp[1] = 0;
> +		_pp[0] = 0;
> +	}
> +}
> +#else
>  #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
>  #endif
>  
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 1/6] mm: Use mm_zero_struct_page from SPARC on all 64b architectures
  2018-10-17 23:54   ` Alexander Duyck
@ 2018-10-29 20:12     ` Michal Hocko
  -1 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2018-10-29 20:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-mm, akpm, pavel.tatashin, dave.jiang, linux-kernel, willy,
	davem, yi.z.zhang, khalid.aziz, rppt, vbabka, sparclinux,
	dan.j.williams, ldufour, mgorman, mingo, kirill.shutemov

On Wed 17-10-18 16:54:08, Alexander Duyck wrote:
> This change makes it so that we use the same approach that was already in
> use on Sparc on all the archtectures that support a 64b long.
> 
> This is mostly motivated by the fact that 7 to 10 store/move instructions
> are likely always going to be faster than having to call into a function
> that is not specialized for handling page init.
> 
> An added advantage to doing it this way is that the compiler can get away
> with combining writes in the __init_single_page call. As a result the
> memset call will be reduced to only about 4 write operations, or at least
> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
> count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
> my system.
> 
> One change I had to make to the function was to reduce the minimum page
> size to 56 to support some powerpc64 configurations.
> 
> This change should introduce no change on SPARC since it already had this
> code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
> initializing 384GB of RAM per node. Pavel Tatashin tested on a system with
> Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page()
> takes 19.30ns / 64-byte struct page before this patch and with this patch
> it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on
> a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His
> results per 64-byte struct page were 4.68ns before, and 4.59ns after this
> patch.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

I thought I have sent my ack already but haven't obviously.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks for the updated version. I will try to get to the rest of the
series soon.

> ---
>  arch/sparc/include/asm/pgtable_64.h |   30 --------------------------
>  include/linux/mm.h                  |   41 ++++++++++++++++++++++++++++++++---
>  2 files changed, 38 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 1393a8ac596b..22500c3be7a9 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -231,36 +231,6 @@
>  extern struct page *mem_map_zero;
>  #define ZERO_PAGE(vaddr)	(mem_map_zero)
>  
> -/* This macro must be updated when the size of struct page grows above 80
> - * or reduces below 64.
> - * The idea that compiler optimizes out switch() statement, and only
> - * leaves clrx instructions
> - */
> -#define	mm_zero_struct_page(pp) do {					\
> -	unsigned long *_pp = (void *)(pp);				\
> -									\
> -	 /* Check that struct page is either 64, 72, or 80 bytes */	\
> -	BUILD_BUG_ON(sizeof(struct page) & 7);				\
> -	BUILD_BUG_ON(sizeof(struct page) < 64);				\
> -	BUILD_BUG_ON(sizeof(struct page) > 80);				\
> -									\
> -	switch (sizeof(struct page)) {					\
> -	case 80:							\
> -		_pp[9] = 0;	/* fallthrough */			\
> -	case 72:							\
> -		_pp[8] = 0;	/* fallthrough */			\
> -	default:							\
> -		_pp[7] = 0;						\
> -		_pp[6] = 0;						\
> -		_pp[5] = 0;						\
> -		_pp[4] = 0;						\
> -		_pp[3] = 0;						\
> -		_pp[2] = 0;						\
> -		_pp[1] = 0;						\
> -		_pp[0] = 0;						\
> -	}								\
> -} while (0)
> -
>  /* PFNs are real physical page numbers.  However, mem_map only begins to record
>   * per-page information starting at pfn_base.  This is to handle systems where
>   * the first physical page in the machine is at some huge physical address,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fcf9cc9d535f..6e2c9631af05 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -98,10 +98,45 @@ static inline void set_max_mapnr(unsigned long limit) { }
>  
>  /*
>   * On some architectures it is expensive to call memset() for small sizes.
> - * Those architectures should provide their own implementation of "struct page"
> - * zeroing by defining this macro in <asm/pgtable.h>.
> + * If an architecture decides to implement their own version of
> + * mm_zero_struct_page they should wrap the defines below in a #ifndef and
> + * define their own version of this macro in <asm/pgtable.h>
>   */
> -#ifndef mm_zero_struct_page
> +#if BITS_PER_LONG == 64
> +/* This function must be updated when the size of struct page grows above 80
> + * or reduces below 56. The idea that compiler optimizes out switch()
> + * statement, and only leaves move/store instructions. Also the compiler can
> + * combine write statments if they are both assignments and can be reordered,
> + * this can result in several of the writes here being dropped.
> + */
> +#define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> +static inline void __mm_zero_struct_page(struct page *page)
> +{
> +	unsigned long *_pp = (void *)page;
> +
> +	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
> +	BUILD_BUG_ON(sizeof(struct page) & 7);
> +	BUILD_BUG_ON(sizeof(struct page) < 56);
> +	BUILD_BUG_ON(sizeof(struct page) > 80);
> +
> +	switch (sizeof(struct page)) {
> +	case 80:
> +		_pp[9] = 0;	/* fallthrough */
> +	case 72:
> +		_pp[8] = 0;	/* fallthrough */
> +	case 64:
> +		_pp[7] = 0;	/* fallthrough */
> +	case 56:
> +		_pp[6] = 0;
> +		_pp[5] = 0;
> +		_pp[4] = 0;
> +		_pp[3] = 0;
> +		_pp[2] = 0;
> +		_pp[1] = 0;
> +		_pp[0] = 0;
> +	}
> +}
> +#else
>  #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
>  #endif
>  
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 1/6] mm: Use mm_zero_struct_page from SPARC on all 64b architectures
@ 2018-10-29 20:12     ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2018-10-29 20:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-mm, akpm, pavel.tatashin, dave.jiang, linux-kernel, willy,
	davem, yi.z.zhang, khalid.aziz, rppt, vbabka, sparclinux,
	dan.j.williams, ldufour, mgorman, mingo, kirill.shutemov

On Wed 17-10-18 16:54:08, Alexander Duyck wrote:
> This change makes it so that we use the same approach that was already in
> use on Sparc on all the archtectures that support a 64b long.
> 
> This is mostly motivated by the fact that 7 to 10 store/move instructions
> are likely always going to be faster than having to call into a function
> that is not specialized for handling page init.
> 
> An added advantage to doing it this way is that the compiler can get away
> with combining writes in the __init_single_page call. As a result the
> memset call will be reduced to only about 4 write operations, or at least
> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
> count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
> my system.
> 
> One change I had to make to the function was to reduce the minimum page
> size to 56 to support some powerpc64 configurations.
> 
> This change should introduce no change on SPARC since it already had this
> code. In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
> initializing 384GB of RAM per node. Pavel Tatashin tested on a system with
> Broadcom's Stingray CPU and 48GB of RAM and found that __init_single_page()
> takes 19.30ns / 64-byte struct page before this patch and with this patch
> it takes 17.33ns / 64-byte struct page. Mike Rapoport ran a similar test on
> a OpenPower (S812LC 8348-21C) with Power8 processor and 128GB or RAM. His
> results per 64-byte struct page were 4.68ns before, and 4.59ns after this
> patch.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

I thought I have sent my ack already but haven't obviously.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks for the updated version. I will try to get to the rest of the
series soon.

> ---
>  arch/sparc/include/asm/pgtable_64.h |   30 --------------------------
>  include/linux/mm.h                  |   41 ++++++++++++++++++++++++++++++++---
>  2 files changed, 38 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 1393a8ac596b..22500c3be7a9 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -231,36 +231,6 @@
>  extern struct page *mem_map_zero;
>  #define ZERO_PAGE(vaddr)	(mem_map_zero)
>  
> -/* This macro must be updated when the size of struct page grows above 80
> - * or reduces below 64.
> - * The idea that compiler optimizes out switch() statement, and only
> - * leaves clrx instructions
> - */
> -#define	mm_zero_struct_page(pp) do {					\
> -	unsigned long *_pp = (void *)(pp);				\
> -									\
> -	 /* Check that struct page is either 64, 72, or 80 bytes */	\
> -	BUILD_BUG_ON(sizeof(struct page) & 7);				\
> -	BUILD_BUG_ON(sizeof(struct page) < 64);				\
> -	BUILD_BUG_ON(sizeof(struct page) > 80);				\
> -									\
> -	switch (sizeof(struct page)) {					\
> -	case 80:							\
> -		_pp[9] = 0;	/* fallthrough */			\
> -	case 72:							\
> -		_pp[8] = 0;	/* fallthrough */			\
> -	default:							\
> -		_pp[7] = 0;						\
> -		_pp[6] = 0;						\
> -		_pp[5] = 0;						\
> -		_pp[4] = 0;						\
> -		_pp[3] = 0;						\
> -		_pp[2] = 0;						\
> -		_pp[1] = 0;						\
> -		_pp[0] = 0;						\
> -	}								\
> -} while (0)
> -
>  /* PFNs are real physical page numbers.  However, mem_map only begins to record
>   * per-page information starting at pfn_base.  This is to handle systems where
>   * the first physical page in the machine is at some huge physical address,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fcf9cc9d535f..6e2c9631af05 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -98,10 +98,45 @@ static inline void set_max_mapnr(unsigned long limit) { }
>  
>  /*
>   * On some architectures it is expensive to call memset() for small sizes.
> - * Those architectures should provide their own implementation of "struct page"
> - * zeroing by defining this macro in <asm/pgtable.h>.
> + * If an architecture decides to implement their own version of
> + * mm_zero_struct_page they should wrap the defines below in a #ifndef and
> + * define their own version of this macro in <asm/pgtable.h>
>   */
> -#ifndef mm_zero_struct_page
> +#if BITS_PER_LONG = 64
> +/* This function must be updated when the size of struct page grows above 80
> + * or reduces below 56. The idea that compiler optimizes out switch()
> + * statement, and only leaves move/store instructions. Also the compiler can
> + * combine write statments if they are both assignments and can be reordered,
> + * this can result in several of the writes here being dropped.
> + */
> +#define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> +static inline void __mm_zero_struct_page(struct page *page)
> +{
> +	unsigned long *_pp = (void *)page;
> +
> +	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
> +	BUILD_BUG_ON(sizeof(struct page) & 7);
> +	BUILD_BUG_ON(sizeof(struct page) < 56);
> +	BUILD_BUG_ON(sizeof(struct page) > 80);
> +
> +	switch (sizeof(struct page)) {
> +	case 80:
> +		_pp[9] = 0;	/* fallthrough */
> +	case 72:
> +		_pp[8] = 0;	/* fallthrough */
> +	case 64:
> +		_pp[7] = 0;	/* fallthrough */
> +	case 56:
> +		_pp[6] = 0;
> +		_pp[5] = 0;
> +		_pp[4] = 0;
> +		_pp[3] = 0;
> +		_pp[2] = 0;
> +		_pp[1] = 0;
> +		_pp[0] = 0;
> +	}
> +}
> +#else
>  #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
>  #endif
>  
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
  2018-10-17 23:54   ` Alexander Duyck
@ 2018-10-31 15:40     ` Pasha Tatashin
  -1 siblings, 0 replies; 28+ messages in thread
From: Pasha Tatashin @ 2018-10-31 15:40 UTC (permalink / raw)
  To: Alexander Duyck, linux-mm, akpm
  Cc: Pasha Tatashin, mhocko, dave.jiang, linux-kernel, willy, davem,
	yi.z.zhang, khalid.aziz, rppt, vbabka, sparclinux,
	dan.j.williams, ldufour, mgorman, mingo, kirill.shutemov



On 10/17/18 7:54 PM, Alexander Duyck wrote:
> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> 
> This iterator will take care of making sure a given memory range provided
> is in fact contained within a zone. It takes are of all the bounds checking
> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> it should help to speed up the search a bit by iterating until the end of a
> range is greater than the start of the zone pfn range, and will exit
> completely if the start is beyond the end of the zone.
> 
> This patch adds yet another iterator called
> for_each_free_mem_range_in_zone_from and then uses it to support
> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> By doing this we can greatly improve the cache locality of the pages while
> we do several loops over them in the init and freeing process.
> 
> We are able to tighten the loops as a result since we only really need the
> checks for first_init_pfn in our first iteration and after that we can
> assume that all future values will be greater than this. So I have added a
> function called deferred_init_mem_pfn_range_in_zone that primes the
> iterators and if it fails we can just exit.
> 
> On my x86_64 test system with 384GB of memory per node I saw a reduction in
> initialization time from 1.85s to 1.38s as a result of this patch.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Hi Alex,

Could you please split this patch into two parts:

1. Add deferred_init_maxorder()
2. Add memblock iterator?

This would allow a better bisecting in case of problems. Chaning two
loops into deferred_init_maxorder() while a good idea, is still
non-trivial and might lead to bugs.

Thank you,
Pavel

> ---
>  include/linux/memblock.h |   58 +++++++++++++++
>  mm/memblock.c            |   63 ++++++++++++++++
>  mm/page_alloc.c          |  176 ++++++++++++++++++++++++++++++++--------------
>  3 files changed, 242 insertions(+), 55 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a6aa76..2ddd1bafdd03 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
>  			      p_start, p_end, p_nid))
>  
>  /**
> + * for_each_mem_range_from - iterate through memblock areas from type_a and not
> + * included in type_b. Or just type_a if type_b is NULL.
> + * @i: u64 used as loop variable
> + * @type_a: ptr to memblock_type to iterate
> + * @type_b: ptr to memblock_type which excludes from the iteration
> + * @nid: node selector, %NUMA_NO_NODE for all nodes
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + * @p_nid: ptr to int for nid of the range, can be %NULL
> + */
> +#define for_each_mem_range_from(i, type_a, type_b, nid, flags,		\
> +			   p_start, p_end, p_nid)			\
> +	for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b,	\
> +				     p_start, p_end, p_nid);		\
> +	     i != (u64)ULLONG_MAX;					\
> +	     __next_mem_range(&i, nid, flags, type_a, type_b,		\
> +			      p_start, p_end, p_nid))
> +/**
>   * for_each_mem_range_rev - reverse iterate through memblock areas from
>   * type_a and not included in type_b. Or just type_a if type_b is NULL.
>   * @i: u64 used as loop variable
> @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
>  	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> +				  unsigned long *out_spfn,
> +				  unsigned long *out_epfn);
> +/**
> + * for_each_free_mem_range_in_zone - iterate through zone specific free
> + * memblock areas
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone. Available as soon as memblock is initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)	\
> +	for (i = 0,							\
> +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
> +	     i != (u64)ULLONG_MAX;					\
> +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> +
> +/**
> + * for_each_free_mem_range_in_zone_from - iterate through zone specific
> + * free memblock areas from a given point
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone, continuing from current position. Available as soon as memblock is
> + * initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
> +	for (; i != (u64)ULLONG_MAX;					  \
> +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> +
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> +
>  /**
>   * for_each_free_mem_range - iterate through free memblock areas
>   * @i: u64 used as loop variable
> diff --git a/mm/memblock.c b/mm/memblock.c
> index f2ef3915a356..ab3545e356b7 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
>  	return 0;
>  }
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +/**
> + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> + *
> + * @idx: pointer to u64 loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @out_start: ptr to ulong for start pfn of the range, can be %NULL
> + * @out_end: ptr to ulong for end pfn of the range, can be %NULL
> + *
> + * This function is meant to be a zone/pfn specific wrapper for the
> + * for_each_mem_range type iterators. Specifically they are used in the
> + * deferred memory init routines and as such we were duplicating much of
> + * this logic throughout the code. So instead of having it in multiple
> + * locations it seemed like it would make more sense to centralize this to
> + * one new iterator that does everything they need.
> + */
> +void __init_memblock
> +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> +			     unsigned long *out_spfn, unsigned long *out_epfn)
> +{
> +	int zone_nid = zone_to_nid(zone);
> +	phys_addr_t spa, epa;
> +	int nid;
> +
> +	__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> +			 &memblock.memory, &memblock.reserved,
> +			 &spa, &epa, &nid);
> +
> +	while (*idx != ULLONG_MAX) {
> +		unsigned long epfn = PFN_DOWN(epa);
> +		unsigned long spfn = PFN_UP(spa);
> +
> +		/*
> +		 * Verify the end is at least past the start of the zone and
> +		 * that we have at least one PFN to initialize.
> +		 */
> +		if (zone->zone_start_pfn < epfn && spfn < epfn) {
> +			/* if we went too far just stop searching */
> +			if (zone_end_pfn(zone) <= spfn)
> +				break;
> +
> +			if (out_spfn)
> +				*out_spfn = max(zone->zone_start_pfn, spfn);
> +			if (out_epfn)
> +				*out_epfn = min(zone_end_pfn(zone), epfn);
> +
> +			return;
> +		}
> +
> +		__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> +				 &memblock.memory, &memblock.reserved,
> +				 &spa, &epa, &nid);
> +	}
> +
> +	/* signal end of iteration */
> +	*idx = ULLONG_MAX;
> +	if (out_spfn)
> +		*out_spfn = ULONG_MAX;
> +	if (out_epfn)
> +		*out_epfn = 0;
> +}
> +
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
>  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
>  unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a766a15fad81..20e9eb35d75d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1512,19 +1512,103 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
>  	return (nr_pages);
>  }
>  
> +/*
> + * This function is meant to pre-load the iterator for the zone init.
> + * Specifically it walks through the ranges until we are caught up to the
> + * first_init_pfn value and exits there. If we never encounter the value we
> + * return false indicating there are no valid ranges left.
> + */
> +static bool __init
> +deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
> +				    unsigned long *spfn, unsigned long *epfn,
> +				    unsigned long first_init_pfn)
> +{
> +	u64 j;
> +
> +	/*
> +	 * Start out by walking through the ranges in this zone that have
> +	 * already been initialized. We don't need to do anything with them
> +	 * so we just need to flush them out of the system.
> +	 */
> +	for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
> +		if (*epfn <= first_init_pfn)
> +			continue;
> +		if (*spfn < first_init_pfn)
> +			*spfn = first_init_pfn;
> +		*i = j;
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Initialize and free pages. We do it in two loops: first we initialize
> + * struct page, than free to buddy allocator, because while we are
> + * freeing pages we can access pages that are ahead (computing buddy
> + * page in __free_one_page()).
> + *
> + * In order to try and keep some memory in the cache we have the loop
> + * broken along max page order boundaries. This way we will not cause
> + * any issues with the buddy page computation.
> + */
> +static unsigned long __init
> +deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> +		       unsigned long *end_pfn)
> +{
> +	unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
> +	unsigned long spfn = *start_pfn, epfn = *end_pfn;
> +	unsigned long nr_pages = 0;
> +	u64 j = *i;
> +
> +	/* First we loop through and initialize the page values */
> +	for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
> +		unsigned long t;
> +
> +		if (mo_pfn <= spfn)
> +			break;
> +
> +		t = min(mo_pfn, epfn);
> +		nr_pages += deferred_init_pages(zone, spfn, t);
> +
> +		if (mo_pfn <= epfn)
> +			break;
> +	}
> +
> +	/* Reset values and now loop through freeing pages as needed */
> +	j = *i;
> +
> +	for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
> +		unsigned long t;
> +
> +		if (mo_pfn <= *start_pfn)
> +			break;
> +
> +		t = min(mo_pfn, *end_pfn);
> +		deferred_free_pages(*start_pfn, t);
> +		*start_pfn = t;
> +
> +		if (mo_pfn < *end_pfn)
> +			break;
> +	}
> +
> +	/* Store our current values to be reused on the next iteration */
> +	*i = j;
> +
> +	return nr_pages;
> +}
> +
>  /* Initialise remaining memory on a node */
>  static int __init deferred_init_memmap(void *data)
>  {
>  	pg_data_t *pgdat = data;
> -	int nid = pgdat->node_id;
> +	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> +	unsigned long first_init_pfn, flags;
>  	unsigned long start = jiffies;
> -	unsigned long nr_pages = 0;
> -	unsigned long spfn, epfn, first_init_pfn, flags;
> -	phys_addr_t spa, epa;
> -	int zid;
>  	struct zone *zone;
> -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>  	u64 i;
> +	int zid;
>  
>  	/* Bind memory initialisation thread to a local node if possible */
>  	if (!cpumask_empty(cpumask))
> @@ -1549,31 +1633,30 @@ static int __init deferred_init_memmap(void *data)
>  		if (first_init_pfn < zone_end_pfn(zone))
>  			break;
>  	}
> -	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
> +
> +	/* If the zone is empty somebody else may have cleared out the zone */
> +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> +						 first_init_pfn)) {
> +		pgdat_resize_unlock(pgdat, &flags);
> +		pgdat_init_report_one_done();
> +		return 0;
> +	}
>  
>  	/*
> -	 * Initialize and free pages. We do it in two loops: first we initialize
> -	 * struct page, than free to buddy allocator, because while we are
> -	 * freeing pages we can access pages that are ahead (computing buddy
> -	 * page in __free_one_page()).
> +	 * Initialize and free pages in MAX_ORDER sized increments so
> +	 * that we can avoid introducing any issues with the buddy
> +	 * allocator.
>  	 */
> -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> -		nr_pages += deferred_init_pages(zone, spfn, epfn);
> -	}
> -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> -		deferred_free_pages(spfn, epfn);
> -	}
> +	while (spfn < epfn)
> +		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +
>  	pgdat_resize_unlock(pgdat, &flags);
>  
>  	/* Sanity check that the next zone really is unpopulated */
>  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>  
> -	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
> -					jiffies_to_msecs(jiffies - start));
> +	pr_info("node %d initialised, %lu pages in %ums\n",
> +		pgdat->node_id,	nr_pages, jiffies_to_msecs(jiffies - start));
>  
>  	pgdat_init_report_one_done();
>  	return 0;
> @@ -1604,14 +1687,11 @@ static int __init deferred_init_memmap(void *data)
>  static noinline bool __init
>  deferred_grow_zone(struct zone *zone, unsigned int order)
>  {
> -	int zid = zone_idx(zone);
> -	int nid = zone_to_nid(zone);
> -	pg_data_t *pgdat = NODE_DATA(nid);
>  	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
> -	unsigned long nr_pages = 0;
> -	unsigned long first_init_pfn, spfn, epfn, t, flags;
> +	pg_data_t *pgdat = zone->zone_pgdat;
>  	unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
> -	phys_addr_t spa, epa;
> +	unsigned long spfn, epfn, flags;
> +	unsigned long nr_pages = 0;
>  	u64 i;
>  
>  	/* Only the last zone may have deferred pages */
> @@ -1640,37 +1720,23 @@ static int __init deferred_init_memmap(void *data)
>  		return true;
>  	}
>  
> -	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
> -
> -	if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
> +	/* If the zone is empty somebody else may have cleared out the zone */
> +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> +						 first_deferred_pfn)) {
>  		pgdat_resize_unlock(pgdat, &flags);
> -		return false;
> +		return true;
>  	}
>  
> -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> -
> -		while (spfn < epfn && nr_pages < nr_pages_needed) {
> -			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
> -			first_deferred_pfn = min(t, epfn);
> -			nr_pages += deferred_init_pages(zone, spfn,
> -							first_deferred_pfn);
> -			spfn = first_deferred_pfn;
> -		}
> -
> -		if (nr_pages >= nr_pages_needed)
> -			break;
> +	/*
> +	 * Initialize and free pages in MAX_ORDER sized increments so
> +	 * that we can avoid introducing any issues with the buddy
> +	 * allocator.
> +	 */
> +	while (spfn < epfn && nr_pages < nr_pages_needed) {
> +		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +		first_deferred_pfn = spfn;
>  	}
>  
> -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> -		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
> -		deferred_free_pages(spfn, epfn);
> -
> -		if (first_deferred_pfn == epfn)
> -			break;
> -	}
>  	pgdat->first_deferred_pfn = first_deferred_pfn;
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
@ 2018-10-31 15:40     ` Pasha Tatashin
  0 siblings, 0 replies; 28+ messages in thread
From: Pasha Tatashin @ 2018-10-31 15:40 UTC (permalink / raw)
  To: Alexander Duyck, linux-mm, akpm
  Cc: Pasha Tatashin, mhocko, dave.jiang, linux-kernel, willy, davem,
	yi.z.zhang, khalid.aziz, rppt, vbabka, sparclinux,
	dan.j.williams, ldufour, mgorman, mingo, kirill.shutemov

DQoNCk9uIDEwLzE3LzE4IDc6NTQgUE0sIEFsZXhhbmRlciBEdXljayB3cm90ZToNCj4gVGhpcyBw
YXRjaCBpbnRyb2R1Y2VzIGEgbmV3IGl0ZXJhdG9yIGZvcl9lYWNoX2ZyZWVfbWVtX3Bmbl9yYW5n
ZV9pbl96b25lLg0KPiANCj4gVGhpcyBpdGVyYXRvciB3aWxsIHRha2UgY2FyZSBvZiBtYWtpbmcg
c3VyZSBhIGdpdmVuIG1lbW9yeSByYW5nZSBwcm92aWRlZA0KPiBpcyBpbiBmYWN0IGNvbnRhaW5l
ZCB3aXRoaW4gYSB6b25lLiBJdCB0YWtlcyBhcmUgb2YgYWxsIHRoZSBib3VuZHMgY2hlY2tpbmcN
Cj4gd2Ugd2VyZSBkb2luZyBpbiBkZWZlcnJlZF9ncm93X3pvbmUsIGFuZCBkZWZlcnJlZF9pbml0
X21lbW1hcC4gSW4gYWRkaXRpb24NCj4gaXQgc2hvdWxkIGhlbHAgdG8gc3BlZWQgdXAgdGhlIHNl
YXJjaCBhIGJpdCBieSBpdGVyYXRpbmcgdW50aWwgdGhlIGVuZCBvZiBhDQo+IHJhbmdlIGlzIGdy
ZWF0ZXIgdGhhbiB0aGUgc3RhcnQgb2YgdGhlIHpvbmUgcGZuIHJhbmdlLCBhbmQgd2lsbCBleGl0
DQo+IGNvbXBsZXRlbHkgaWYgdGhlIHN0YXJ0IGlzIGJleW9uZCB0aGUgZW5kIG9mIHRoZSB6b25l
Lg0KPiANCj4gVGhpcyBwYXRjaCBhZGRzIHlldCBhbm90aGVyIGl0ZXJhdG9yIGNhbGxlZA0KPiBm
b3JfZWFjaF9mcmVlX21lbV9yYW5nZV9pbl96b25lX2Zyb20gYW5kIHRoZW4gdXNlcyBpdCB0byBz
dXBwb3J0DQo+IGluaXRpYWxpemluZyBhbmQgZnJlZWluZyBwYWdlcyBpbiBncm91cHMgbm8gbGFy
Z2VyIHRoYW4gTUFYX09SREVSX05SX1BBR0VTLg0KPiBCeSBkb2luZyB0aGlzIHdlIGNhbiBncmVh
dGx5IGltcHJvdmUgdGhlIGNhY2hlIGxvY2FsaXR5IG9mIHRoZSBwYWdlcyB3aGlsZQ0KPiB3ZSBk
byBzZXZlcmFsIGxvb3BzIG92ZXIgdGhlbSBpbiB0aGUgaW5pdCBhbmQgZnJlZWluZyBwcm9jZXNz
Lg0KPiANCj4gV2UgYXJlIGFibGUgdG8gdGlnaHRlbiB0aGUgbG9vcHMgYXMgYSByZXN1bHQgc2lu
Y2Ugd2Ugb25seSByZWFsbHkgbmVlZCB0aGUNCj4gY2hlY2tzIGZvciBmaXJzdF9pbml0X3BmbiBp
biBvdXIgZmlyc3QgaXRlcmF0aW9uIGFuZCBhZnRlciB0aGF0IHdlIGNhbg0KPiBhc3N1bWUgdGhh
dCBhbGwgZnV0dXJlIHZhbHVlcyB3aWxsIGJlIGdyZWF0ZXIgdGhhbiB0aGlzLiBTbyBJIGhhdmUg
YWRkZWQgYQ0KPiBmdW5jdGlvbiBjYWxsZWQgZGVmZXJyZWRfaW5pdF9tZW1fcGZuX3JhbmdlX2lu
X3pvbmUgdGhhdCBwcmltZXMgdGhlDQo+IGl0ZXJhdG9ycyBhbmQgaWYgaXQgZmFpbHMgd2UgY2Fu
IGp1c3QgZXhpdC4NCj4gDQo+IE9uIG15IHg4Nl82NCB0ZXN0IHN5c3RlbSB3aXRoIDM4NEdCIG9m
IG1lbW9yeSBwZXIgbm9kZSBJIHNhdyBhIHJlZHVjdGlvbiBpbg0KPiBpbml0aWFsaXphdGlvbiB0
aW1lIGZyb20gMS44NXMgdG8gMS4zOHMgYXMgYSByZXN1bHQgb2YgdGhpcyBwYXRjaC4NCj4gDQo+
IFNpZ25lZC1vZmYtYnk6IEFsZXhhbmRlciBEdXljayA8YWxleGFuZGVyLmguZHV5Y2tAbGludXgu
aW50ZWwuY29tPg0KDQpIaSBBbGV4LA0KDQpDb3VsZCB5b3UgcGxlYXNlIHNwbGl0IHRoaXMgcGF0
Y2ggaW50byB0d28gcGFydHM6DQoNCjEuIEFkZCBkZWZlcnJlZF9pbml0X21heG9yZGVyKCkNCjIu
IEFkZCBtZW1ibG9jayBpdGVyYXRvcj8NCg0KVGhpcyB3b3VsZCBhbGxvdyBhIGJldHRlciBiaXNl
Y3RpbmcgaW4gY2FzZSBvZiBwcm9ibGVtcy4gQ2hhbmluZyB0d28NCmxvb3BzIGludG8gZGVmZXJy
ZWRfaW5pdF9tYXhvcmRlcigpIHdoaWxlIGEgZ29vZCBpZGVhLCBpcyBzdGlsbA0Kbm9uLXRyaXZp
YWwgYW5kIG1pZ2h0IGxlYWQgdG8gYnVncy4NCg0KVGhhbmsgeW91LA0KUGF2ZWwNCg0KPiAtLS0N
Cj4gIGluY2x1ZGUvbGludXgvbWVtYmxvY2suaCB8ICAgNTggKysrKysrKysrKysrKysrDQo+ICBt
bS9tZW1ibG9jay5jICAgICAgICAgICAgfCAgIDYzICsrKysrKysrKysrKysrKysNCj4gIG1tL3Bh
Z2VfYWxsb2MuYyAgICAgICAgICB8ICAxNzYgKysrKysrKysrKysrKysrKysrKysrKysrKysrKysr
KystLS0tLS0tLS0tLS0tLQ0KPiAgMyBmaWxlcyBjaGFuZ2VkLCAyNDIgaW5zZXJ0aW9ucygrKSwg
NTUgZGVsZXRpb25zKC0pDQo+IA0KPiBkaWZmIC0tZ2l0IGEvaW5jbHVkZS9saW51eC9tZW1ibG9j
ay5oIGIvaW5jbHVkZS9saW51eC9tZW1ibG9jay5oDQo+IGluZGV4IGFlZTI5OWE2YWE3Ni4uMmRk
ZDFiYWZkZDAzIDEwMDY0NA0KPiAtLS0gYS9pbmNsdWRlL2xpbnV4L21lbWJsb2NrLmgNCj4gKysr
IGIvaW5jbHVkZS9saW51eC9tZW1ibG9jay5oDQo+IEBAIC0xNzgsNiArMTc4LDI1IEBAIHZvaWQg
X19uZXh0X3Jlc2VydmVkX21lbV9yZWdpb24odTY0ICppZHgsIHBoeXNfYWRkcl90ICpvdXRfc3Rh
cnQsDQo+ICAJCQkgICAgICBwX3N0YXJ0LCBwX2VuZCwgcF9uaWQpKQ0KPiAgDQo+ICAvKioNCj4g
KyAqIGZvcl9lYWNoX21lbV9yYW5nZV9mcm9tIC0gaXRlcmF0ZSB0aHJvdWdoIG1lbWJsb2NrIGFy
ZWFzIGZyb20gdHlwZV9hIGFuZCBub3QNCj4gKyAqIGluY2x1ZGVkIGluIHR5cGVfYi4gT3IganVz
dCB0eXBlX2EgaWYgdHlwZV9iIGlzIE5VTEwuDQo+ICsgKiBAaTogdTY0IHVzZWQgYXMgbG9vcCB2
YXJpYWJsZQ0KPiArICogQHR5cGVfYTogcHRyIHRvIG1lbWJsb2NrX3R5cGUgdG8gaXRlcmF0ZQ0K
PiArICogQHR5cGVfYjogcHRyIHRvIG1lbWJsb2NrX3R5cGUgd2hpY2ggZXhjbHVkZXMgZnJvbSB0
aGUgaXRlcmF0aW9uDQo+ICsgKiBAbmlkOiBub2RlIHNlbGVjdG9yLCAlTlVNQV9OT19OT0RFIGZv
ciBhbGwgbm9kZXMNCj4gKyAqIEBmbGFnczogcGljayBmcm9tIGJsb2NrcyBiYXNlZCBvbiBtZW1v
cnkgYXR0cmlidXRlcw0KPiArICogQHBfc3RhcnQ6IHB0ciB0byBwaHlzX2FkZHJfdCBmb3Igc3Rh
cnQgYWRkcmVzcyBvZiB0aGUgcmFuZ2UsIGNhbiBiZSAlTlVMTA0KPiArICogQHBfZW5kOiBwdHIg
dG8gcGh5c19hZGRyX3QgZm9yIGVuZCBhZGRyZXNzIG9mIHRoZSByYW5nZSwgY2FuIGJlICVOVUxM
DQo+ICsgKiBAcF9uaWQ6IHB0ciB0byBpbnQgZm9yIG5pZCBvZiB0aGUgcmFuZ2UsIGNhbiBiZSAl
TlVMTA0KPiArICovDQo+ICsjZGVmaW5lIGZvcl9lYWNoX21lbV9yYW5nZV9mcm9tKGksIHR5cGVf
YSwgdHlwZV9iLCBuaWQsIGZsYWdzLAkJXA0KPiArCQkJICAgcF9zdGFydCwgcF9lbmQsIHBfbmlk
KQkJCVwNCj4gKwlmb3IgKGkgPSAwLCBfX25leHRfbWVtX3JhbmdlKCZpLCBuaWQsIGZsYWdzLCB0
eXBlX2EsIHR5cGVfYiwJXA0KPiArCQkJCSAgICAgcF9zdGFydCwgcF9lbmQsIHBfbmlkKTsJCVwN
Cj4gKwkgICAgIGkgIT0gKHU2NClVTExPTkdfTUFYOwkJCQkJXA0KPiArCSAgICAgX19uZXh0X21l
bV9yYW5nZSgmaSwgbmlkLCBmbGFncywgdHlwZV9hLCB0eXBlX2IsCQlcDQo+ICsJCQkgICAgICBw
X3N0YXJ0LCBwX2VuZCwgcF9uaWQpKQ0KPiArLyoqDQo+ICAgKiBmb3JfZWFjaF9tZW1fcmFuZ2Vf
cmV2IC0gcmV2ZXJzZSBpdGVyYXRlIHRocm91Z2ggbWVtYmxvY2sgYXJlYXMgZnJvbQ0KPiAgICog
dHlwZV9hIGFuZCBub3QgaW5jbHVkZWQgaW4gdHlwZV9iLiBPciBqdXN0IHR5cGVfYSBpZiB0eXBl
X2IgaXMgTlVMTC4NCj4gICAqIEBpOiB1NjQgdXNlZCBhcyBsb29wIHZhcmlhYmxlDQo+IEBAIC0y
NDgsNiArMjY3LDQ1IEBAIHZvaWQgX19uZXh0X21lbV9wZm5fcmFuZ2UoaW50ICppZHgsIGludCBu
aWQsIHVuc2lnbmVkIGxvbmcgKm91dF9zdGFydF9wZm4sDQo+ICAJICAgICBpID49IDA7IF9fbmV4
dF9tZW1fcGZuX3JhbmdlKCZpLCBuaWQsIHBfc3RhcnQsIHBfZW5kLCBwX25pZCkpDQo+ICAjZW5k
aWYgLyogQ09ORklHX0hBVkVfTUVNQkxPQ0tfTk9ERV9NQVAgKi8NCj4gIA0KPiArI2lmZGVmIENP
TkZJR19ERUZFUlJFRF9TVFJVQ1RfUEFHRV9JTklUDQo+ICt2b2lkIF9fbmV4dF9tZW1fcGZuX3Jh
bmdlX2luX3pvbmUodTY0ICppZHgsIHN0cnVjdCB6b25lICp6b25lLA0KPiArCQkJCSAgdW5zaWdu
ZWQgbG9uZyAqb3V0X3NwZm4sDQo+ICsJCQkJICB1bnNpZ25lZCBsb25nICpvdXRfZXBmbik7DQo+
ICsvKioNCj4gKyAqIGZvcl9lYWNoX2ZyZWVfbWVtX3JhbmdlX2luX3pvbmUgLSBpdGVyYXRlIHRo
cm91Z2ggem9uZSBzcGVjaWZpYyBmcmVlDQo+ICsgKiBtZW1ibG9jayBhcmVhcw0KPiArICogQGk6
IHU2NCB1c2VkIGFzIGxvb3AgdmFyaWFibGUNCj4gKyAqIEB6b25lOiB6b25lIGluIHdoaWNoIGFs
bCBvZiB0aGUgbWVtb3J5IGJsb2NrcyByZXNpZGUNCj4gKyAqIEBwX3N0YXJ0OiBwdHIgdG8gcGh5
c19hZGRyX3QgZm9yIHN0YXJ0IGFkZHJlc3Mgb2YgdGhlIHJhbmdlLCBjYW4gYmUgJU5VTEwNCj4g
KyAqIEBwX2VuZDogcHRyIHRvIHBoeXNfYWRkcl90IGZvciBlbmQgYWRkcmVzcyBvZiB0aGUgcmFu
Z2UsIGNhbiBiZSAlTlVMTA0KPiArICoNCj4gKyAqIFdhbGtzIG92ZXIgZnJlZSAobWVtb3J5ICYm
ICFyZXNlcnZlZCkgYXJlYXMgb2YgbWVtYmxvY2sgaW4gYSBzcGVjaWZpYw0KPiArICogem9uZS4g
QXZhaWxhYmxlIGFzIHNvb24gYXMgbWVtYmxvY2sgaXMgaW5pdGlhbGl6ZWQuDQo+ICsgKi8NCj4g
KyNkZWZpbmUgZm9yX2VhY2hfZnJlZV9tZW1fcGZuX3JhbmdlX2luX3pvbmUoaSwgem9uZSwgcF9z
dGFydCwgcF9lbmQpCVwNCj4gKwlmb3IgKGkgPSAwLAkJCQkJCQlcDQo+ICsJICAgICBfX25leHRf
bWVtX3Bmbl9yYW5nZV9pbl96b25lKCZpLCB6b25lLCBwX3N0YXJ0LCBwX2VuZCk7CVwNCj4gKwkg
ICAgIGkgIT0gKHU2NClVTExPTkdfTUFYOwkJCQkJXA0KPiArCSAgICAgX19uZXh0X21lbV9wZm5f
cmFuZ2VfaW5fem9uZSgmaSwgem9uZSwgcF9zdGFydCwgcF9lbmQpKQ0KPiArDQo+ICsvKioNCj4g
KyAqIGZvcl9lYWNoX2ZyZWVfbWVtX3JhbmdlX2luX3pvbmVfZnJvbSAtIGl0ZXJhdGUgdGhyb3Vn
aCB6b25lIHNwZWNpZmljDQo+ICsgKiBmcmVlIG1lbWJsb2NrIGFyZWFzIGZyb20gYSBnaXZlbiBw
b2ludA0KPiArICogQGk6IHU2NCB1c2VkIGFzIGxvb3AgdmFyaWFibGUNCj4gKyAqIEB6b25lOiB6
b25lIGluIHdoaWNoIGFsbCBvZiB0aGUgbWVtb3J5IGJsb2NrcyByZXNpZGUNCj4gKyAqIEBwX3N0
YXJ0OiBwdHIgdG8gcGh5c19hZGRyX3QgZm9yIHN0YXJ0IGFkZHJlc3Mgb2YgdGhlIHJhbmdlLCBj
YW4gYmUgJU5VTEwNCj4gKyAqIEBwX2VuZDogcHRyIHRvIHBoeXNfYWRkcl90IGZvciBlbmQgYWRk
cmVzcyBvZiB0aGUgcmFuZ2UsIGNhbiBiZSAlTlVMTA0KPiArICoNCj4gKyAqIFdhbGtzIG92ZXIg
ZnJlZSAobWVtb3J5ICYmICFyZXNlcnZlZCkgYXJlYXMgb2YgbWVtYmxvY2sgaW4gYSBzcGVjaWZp
Yw0KPiArICogem9uZSwgY29udGludWluZyBmcm9tIGN1cnJlbnQgcG9zaXRpb24uIEF2YWlsYWJs
ZSBhcyBzb29uIGFzIG1lbWJsb2NrIGlzDQo+ICsgKiBpbml0aWFsaXplZC4NCj4gKyAqLw0KPiAr
I2RlZmluZSBmb3JfZWFjaF9mcmVlX21lbV9wZm5fcmFuZ2VfaW5fem9uZV9mcm9tKGksIHpvbmUs
IHBfc3RhcnQsIHBfZW5kKSBcDQo+ICsJZm9yICg7IGkgIT0gKHU2NClVTExPTkdfTUFYOwkJCQkJ
ICBcDQo+ICsJICAgICBfX25leHRfbWVtX3Bmbl9yYW5nZV9pbl96b25lKCZpLCB6b25lLCBwX3N0
YXJ0LCBwX2VuZCkpDQo+ICsNCj4gKyNlbmRpZiAvKiBDT05GSUdfREVGRVJSRURfU1RSVUNUX1BB
R0VfSU5JVCAqLw0KPiArDQo+ICAvKioNCj4gICAqIGZvcl9lYWNoX2ZyZWVfbWVtX3JhbmdlIC0g
aXRlcmF0ZSB0aHJvdWdoIGZyZWUgbWVtYmxvY2sgYXJlYXMNCj4gICAqIEBpOiB1NjQgdXNlZCBh
cyBsb29wIHZhcmlhYmxlDQo+IGRpZmYgLS1naXQgYS9tbS9tZW1ibG9jay5jIGIvbW0vbWVtYmxv
Y2suYw0KPiBpbmRleCBmMmVmMzkxNWEzNTYuLmFiMzU0NWUzNTZiNyAxMDA2NDQNCj4gLS0tIGEv
bW0vbWVtYmxvY2suYw0KPiArKysgYi9tbS9tZW1ibG9jay5jDQo+IEBAIC0xMjM5LDYgKzEyMzks
NjkgQEAgaW50IF9faW5pdF9tZW1ibG9jayBtZW1ibG9ja19zZXRfbm9kZShwaHlzX2FkZHJfdCBi
YXNlLCBwaHlzX2FkZHJfdCBzaXplLA0KPiAgCXJldHVybiAwOw0KPiAgfQ0KPiAgI2VuZGlmIC8q
IENPTkZJR19IQVZFX01FTUJMT0NLX05PREVfTUFQICovDQo+ICsjaWZkZWYgQ09ORklHX0RFRkVS
UkVEX1NUUlVDVF9QQUdFX0lOSVQNCj4gKy8qKg0KPiArICogX19uZXh0X21lbV9wZm5fcmFuZ2Vf
aW5fem9uZSAtIGl0ZXJhdG9yIGZvciBmb3JfZWFjaF8qX3JhbmdlX2luX3pvbmUoKQ0KPiArICoN
Cj4gKyAqIEBpZHg6IHBvaW50ZXIgdG8gdTY0IGxvb3AgdmFyaWFibGUNCj4gKyAqIEB6b25lOiB6
b25lIGluIHdoaWNoIGFsbCBvZiB0aGUgbWVtb3J5IGJsb2NrcyByZXNpZGUNCj4gKyAqIEBvdXRf
c3RhcnQ6IHB0ciB0byB1bG9uZyBmb3Igc3RhcnQgcGZuIG9mIHRoZSByYW5nZSwgY2FuIGJlICVO
VUxMDQo+ICsgKiBAb3V0X2VuZDogcHRyIHRvIHVsb25nIGZvciBlbmQgcGZuIG9mIHRoZSByYW5n
ZSwgY2FuIGJlICVOVUxMDQo+ICsgKg0KPiArICogVGhpcyBmdW5jdGlvbiBpcyBtZWFudCB0byBi
ZSBhIHpvbmUvcGZuIHNwZWNpZmljIHdyYXBwZXIgZm9yIHRoZQ0KPiArICogZm9yX2VhY2hfbWVt
X3JhbmdlIHR5cGUgaXRlcmF0b3JzLiBTcGVjaWZpY2FsbHkgdGhleSBhcmUgdXNlZCBpbiB0aGUN
Cj4gKyAqIGRlZmVycmVkIG1lbW9yeSBpbml0IHJvdXRpbmVzIGFuZCBhcyBzdWNoIHdlIHdlcmUg
ZHVwbGljYXRpbmcgbXVjaCBvZg0KPiArICogdGhpcyBsb2dpYyB0aHJvdWdob3V0IHRoZSBjb2Rl
LiBTbyBpbnN0ZWFkIG9mIGhhdmluZyBpdCBpbiBtdWx0aXBsZQ0KPiArICogbG9jYXRpb25zIGl0
IHNlZW1lZCBsaWtlIGl0IHdvdWxkIG1ha2UgbW9yZSBzZW5zZSB0byBjZW50cmFsaXplIHRoaXMg
dG8NCj4gKyAqIG9uZSBuZXcgaXRlcmF0b3IgdGhhdCBkb2VzIGV2ZXJ5dGhpbmcgdGhleSBuZWVk
Lg0KPiArICovDQo+ICt2b2lkIF9faW5pdF9tZW1ibG9jaw0KPiArX19uZXh0X21lbV9wZm5fcmFu
Z2VfaW5fem9uZSh1NjQgKmlkeCwgc3RydWN0IHpvbmUgKnpvbmUsDQo+ICsJCQkgICAgIHVuc2ln
bmVkIGxvbmcgKm91dF9zcGZuLCB1bnNpZ25lZCBsb25nICpvdXRfZXBmbikNCj4gK3sNCj4gKwlp
bnQgem9uZV9uaWQgPSB6b25lX3RvX25pZCh6b25lKTsNCj4gKwlwaHlzX2FkZHJfdCBzcGEsIGVw
YTsNCj4gKwlpbnQgbmlkOw0KPiArDQo+ICsJX19uZXh0X21lbV9yYW5nZShpZHgsIHpvbmVfbmlk
LCBNRU1CTE9DS19OT05FLA0KPiArCQkJICZtZW1ibG9jay5tZW1vcnksICZtZW1ibG9jay5yZXNl
cnZlZCwNCj4gKwkJCSAmc3BhLCAmZXBhLCAmbmlkKTsNCj4gKw0KPiArCXdoaWxlICgqaWR4ICE9
IFVMTE9OR19NQVgpIHsNCj4gKwkJdW5zaWduZWQgbG9uZyBlcGZuID0gUEZOX0RPV04oZXBhKTsN
Cj4gKwkJdW5zaWduZWQgbG9uZyBzcGZuID0gUEZOX1VQKHNwYSk7DQo+ICsNCj4gKwkJLyoNCj4g
KwkJICogVmVyaWZ5IHRoZSBlbmQgaXMgYXQgbGVhc3QgcGFzdCB0aGUgc3RhcnQgb2YgdGhlIHpv
bmUgYW5kDQo+ICsJCSAqIHRoYXQgd2UgaGF2ZSBhdCBsZWFzdCBvbmUgUEZOIHRvIGluaXRpYWxp
emUuDQo+ICsJCSAqLw0KPiArCQlpZiAoem9uZS0+em9uZV9zdGFydF9wZm4gPCBlcGZuICYmIHNw
Zm4gPCBlcGZuKSB7DQo+ICsJCQkvKiBpZiB3ZSB3ZW50IHRvbyBmYXIganVzdCBzdG9wIHNlYXJj
aGluZyAqLw0KPiArCQkJaWYgKHpvbmVfZW5kX3Bmbih6b25lKSA8PSBzcGZuKQ0KPiArCQkJCWJy
ZWFrOw0KPiArDQo+ICsJCQlpZiAob3V0X3NwZm4pDQo+ICsJCQkJKm91dF9zcGZuID0gbWF4KHpv
bmUtPnpvbmVfc3RhcnRfcGZuLCBzcGZuKTsNCj4gKwkJCWlmIChvdXRfZXBmbikNCj4gKwkJCQkq
b3V0X2VwZm4gPSBtaW4oem9uZV9lbmRfcGZuKHpvbmUpLCBlcGZuKTsNCj4gKw0KPiArCQkJcmV0
dXJuOw0KPiArCQl9DQo+ICsNCj4gKwkJX19uZXh0X21lbV9yYW5nZShpZHgsIHpvbmVfbmlkLCBN
RU1CTE9DS19OT05FLA0KPiArCQkJCSAmbWVtYmxvY2subWVtb3J5LCAmbWVtYmxvY2sucmVzZXJ2
ZWQsDQo+ICsJCQkJICZzcGEsICZlcGEsICZuaWQpOw0KPiArCX0NCj4gKw0KPiArCS8qIHNpZ25h
bCBlbmQgb2YgaXRlcmF0aW9uICovDQo+ICsJKmlkeCA9IFVMTE9OR19NQVg7DQo+ICsJaWYgKG91
dF9zcGZuKQ0KPiArCQkqb3V0X3NwZm4gPSBVTE9OR19NQVg7DQo+ICsJaWYgKG91dF9lcGZuKQ0K
PiArCQkqb3V0X2VwZm4gPSAwOw0KPiArfQ0KPiArDQo+ICsjZW5kaWYgLyogQ09ORklHX0RFRkVS
UkVEX1NUUlVDVF9QQUdFX0lOSVQgKi8NCj4gIA0KPiAgI2lmZGVmIENPTkZJR19IQVZFX01FTUJM
T0NLX1BGTl9WQUxJRA0KPiAgdW5zaWduZWQgbG9uZyBfX2luaXRfbWVtYmxvY2sgbWVtYmxvY2tf
bmV4dF92YWxpZF9wZm4odW5zaWduZWQgbG9uZyBwZm4pDQo+IGRpZmYgLS1naXQgYS9tbS9wYWdl
X2FsbG9jLmMgYi9tbS9wYWdlX2FsbG9jLmMNCj4gaW5kZXggYTc2NmExNWZhZDgxLi4yMGU5ZWIz
NWQ3NWQgMTAwNjQ0DQo+IC0tLSBhL21tL3BhZ2VfYWxsb2MuYw0KPiArKysgYi9tbS9wYWdlX2Fs
bG9jLmMNCj4gQEAgLTE1MTIsMTkgKzE1MTIsMTAzIEBAIHN0YXRpYyB1bnNpZ25lZCBsb25nICBf
X2luaXQgZGVmZXJyZWRfaW5pdF9wYWdlcyhzdHJ1Y3Qgem9uZSAqem9uZSwNCj4gIAlyZXR1cm4g
KG5yX3BhZ2VzKTsNCj4gIH0NCj4gIA0KPiArLyoNCj4gKyAqIFRoaXMgZnVuY3Rpb24gaXMgbWVh
bnQgdG8gcHJlLWxvYWQgdGhlIGl0ZXJhdG9yIGZvciB0aGUgem9uZSBpbml0Lg0KPiArICogU3Bl
Y2lmaWNhbGx5IGl0IHdhbGtzIHRocm91Z2ggdGhlIHJhbmdlcyB1bnRpbCB3ZSBhcmUgY2F1Z2h0
IHVwIHRvIHRoZQ0KPiArICogZmlyc3RfaW5pdF9wZm4gdmFsdWUgYW5kIGV4aXRzIHRoZXJlLiBJ
ZiB3ZSBuZXZlciBlbmNvdW50ZXIgdGhlIHZhbHVlIHdlDQo+ICsgKiByZXR1cm4gZmFsc2UgaW5k
aWNhdGluZyB0aGVyZSBhcmUgbm8gdmFsaWQgcmFuZ2VzIGxlZnQuDQo+ICsgKi8NCj4gK3N0YXRp
YyBib29sIF9faW5pdA0KPiArZGVmZXJyZWRfaW5pdF9tZW1fcGZuX3JhbmdlX2luX3pvbmUodTY0
ICppLCBzdHJ1Y3Qgem9uZSAqem9uZSwNCj4gKwkJCQkgICAgdW5zaWduZWQgbG9uZyAqc3Bmbiwg
dW5zaWduZWQgbG9uZyAqZXBmbiwNCj4gKwkJCQkgICAgdW5zaWduZWQgbG9uZyBmaXJzdF9pbml0
X3BmbikNCj4gK3sNCj4gKwl1NjQgajsNCj4gKw0KPiArCS8qDQo+ICsJICogU3RhcnQgb3V0IGJ5
IHdhbGtpbmcgdGhyb3VnaCB0aGUgcmFuZ2VzIGluIHRoaXMgem9uZSB0aGF0IGhhdmUNCj4gKwkg
KiBhbHJlYWR5IGJlZW4gaW5pdGlhbGl6ZWQuIFdlIGRvbid0IG5lZWQgdG8gZG8gYW55dGhpbmcg
d2l0aCB0aGVtDQo+ICsJICogc28gd2UganVzdCBuZWVkIHRvIGZsdXNoIHRoZW0gb3V0IG9mIHRo
ZSBzeXN0ZW0uDQo+ICsJICovDQo+ICsJZm9yX2VhY2hfZnJlZV9tZW1fcGZuX3JhbmdlX2luX3pv
bmUoaiwgem9uZSwgc3BmbiwgZXBmbikgew0KPiArCQlpZiAoKmVwZm4gPD0gZmlyc3RfaW5pdF9w
Zm4pDQo+ICsJCQljb250aW51ZTsNCj4gKwkJaWYgKCpzcGZuIDwgZmlyc3RfaW5pdF9wZm4pDQo+
ICsJCQkqc3BmbiA9IGZpcnN0X2luaXRfcGZuOw0KPiArCQkqaSA9IGo7DQo+ICsJCXJldHVybiB0
cnVlOw0KPiArCX0NCj4gKw0KPiArCXJldHVybiBmYWxzZTsNCj4gK30NCj4gKw0KPiArLyoNCj4g
KyAqIEluaXRpYWxpemUgYW5kIGZyZWUgcGFnZXMuIFdlIGRvIGl0IGluIHR3byBsb29wczogZmly
c3Qgd2UgaW5pdGlhbGl6ZQ0KPiArICogc3RydWN0IHBhZ2UsIHRoYW4gZnJlZSB0byBidWRkeSBh
bGxvY2F0b3IsIGJlY2F1c2Ugd2hpbGUgd2UgYXJlDQo+ICsgKiBmcmVlaW5nIHBhZ2VzIHdlIGNh
biBhY2Nlc3MgcGFnZXMgdGhhdCBhcmUgYWhlYWQgKGNvbXB1dGluZyBidWRkeQ0KPiArICogcGFn
ZSBpbiBfX2ZyZWVfb25lX3BhZ2UoKSkuDQo+ICsgKg0KPiArICogSW4gb3JkZXIgdG8gdHJ5IGFu
ZCBrZWVwIHNvbWUgbWVtb3J5IGluIHRoZSBjYWNoZSB3ZSBoYXZlIHRoZSBsb29wDQo+ICsgKiBi
cm9rZW4gYWxvbmcgbWF4IHBhZ2Ugb3JkZXIgYm91bmRhcmllcy4gVGhpcyB3YXkgd2Ugd2lsbCBu
b3QgY2F1c2UNCj4gKyAqIGFueSBpc3N1ZXMgd2l0aCB0aGUgYnVkZHkgcGFnZSBjb21wdXRhdGlv
bi4NCj4gKyAqLw0KPiArc3RhdGljIHVuc2lnbmVkIGxvbmcgX19pbml0DQo+ICtkZWZlcnJlZF9p
bml0X21heG9yZGVyKHU2NCAqaSwgc3RydWN0IHpvbmUgKnpvbmUsIHVuc2lnbmVkIGxvbmcgKnN0
YXJ0X3BmbiwNCj4gKwkJICAgICAgIHVuc2lnbmVkIGxvbmcgKmVuZF9wZm4pDQo+ICt7DQo+ICsJ
dW5zaWduZWQgbG9uZyBtb19wZm4gPSBBTElHTigqc3RhcnRfcGZuICsgMSwgTUFYX09SREVSX05S
X1BBR0VTKTsNCj4gKwl1bnNpZ25lZCBsb25nIHNwZm4gPSAqc3RhcnRfcGZuLCBlcGZuID0gKmVu
ZF9wZm47DQo+ICsJdW5zaWduZWQgbG9uZyBucl9wYWdlcyA9IDA7DQo+ICsJdTY0IGogPSAqaTsN
Cj4gKw0KPiArCS8qIEZpcnN0IHdlIGxvb3AgdGhyb3VnaCBhbmQgaW5pdGlhbGl6ZSB0aGUgcGFn
ZSB2YWx1ZXMgKi8NCj4gKwlmb3JfZWFjaF9mcmVlX21lbV9wZm5fcmFuZ2VfaW5fem9uZV9mcm9t
KGosIHpvbmUsICZzcGZuLCAmZXBmbikgew0KPiArCQl1bnNpZ25lZCBsb25nIHQ7DQo+ICsNCj4g
KwkJaWYgKG1vX3BmbiA8PSBzcGZuKQ0KPiArCQkJYnJlYWs7DQo+ICsNCj4gKwkJdCA9IG1pbiht
b19wZm4sIGVwZm4pOw0KPiArCQlucl9wYWdlcyArPSBkZWZlcnJlZF9pbml0X3BhZ2VzKHpvbmUs
IHNwZm4sIHQpOw0KPiArDQo+ICsJCWlmIChtb19wZm4gPD0gZXBmbikNCj4gKwkJCWJyZWFrOw0K
PiArCX0NCj4gKw0KPiArCS8qIFJlc2V0IHZhbHVlcyBhbmQgbm93IGxvb3AgdGhyb3VnaCBmcmVl
aW5nIHBhZ2VzIGFzIG5lZWRlZCAqLw0KPiArCWogPSAqaTsNCj4gKw0KPiArCWZvcl9lYWNoX2Zy
ZWVfbWVtX3Bmbl9yYW5nZV9pbl96b25lX2Zyb20oaiwgem9uZSwgc3RhcnRfcGZuLCBlbmRfcGZu
KSB7DQo+ICsJCXVuc2lnbmVkIGxvbmcgdDsNCj4gKw0KPiArCQlpZiAobW9fcGZuIDw9ICpzdGFy
dF9wZm4pDQo+ICsJCQlicmVhazsNCj4gKw0KPiArCQl0ID0gbWluKG1vX3BmbiwgKmVuZF9wZm4p
Ow0KPiArCQlkZWZlcnJlZF9mcmVlX3BhZ2VzKCpzdGFydF9wZm4sIHQpOw0KPiArCQkqc3RhcnRf
cGZuID0gdDsNCj4gKw0KPiArCQlpZiAobW9fcGZuIDwgKmVuZF9wZm4pDQo+ICsJCQlicmVhazsN
Cj4gKwl9DQo+ICsNCj4gKwkvKiBTdG9yZSBvdXIgY3VycmVudCB2YWx1ZXMgdG8gYmUgcmV1c2Vk
IG9uIHRoZSBuZXh0IGl0ZXJhdGlvbiAqLw0KPiArCSppID0gajsNCj4gKw0KPiArCXJldHVybiBu
cl9wYWdlczsNCj4gK30NCj4gKw0KPiAgLyogSW5pdGlhbGlzZSByZW1haW5pbmcgbWVtb3J5IG9u
IGEgbm9kZSAqLw0KPiAgc3RhdGljIGludCBfX2luaXQgZGVmZXJyZWRfaW5pdF9tZW1tYXAodm9p
ZCAqZGF0YSkNCj4gIHsNCj4gIAlwZ19kYXRhX3QgKnBnZGF0ID0gZGF0YTsNCj4gLQlpbnQgbmlk
ID0gcGdkYXQtPm5vZGVfaWQ7DQo+ICsJY29uc3Qgc3RydWN0IGNwdW1hc2sgKmNwdW1hc2sgPSBj
cHVtYXNrX29mX25vZGUocGdkYXQtPm5vZGVfaWQpOw0KPiArCXVuc2lnbmVkIGxvbmcgc3BmbiA9
IDAsIGVwZm4gPSAwLCBucl9wYWdlcyA9IDA7DQo+ICsJdW5zaWduZWQgbG9uZyBmaXJzdF9pbml0
X3BmbiwgZmxhZ3M7DQo+ICAJdW5zaWduZWQgbG9uZyBzdGFydCA9IGppZmZpZXM7DQo+IC0JdW5z
aWduZWQgbG9uZyBucl9wYWdlcyA9IDA7DQo+IC0JdW5zaWduZWQgbG9uZyBzcGZuLCBlcGZuLCBm
aXJzdF9pbml0X3BmbiwgZmxhZ3M7DQo+IC0JcGh5c19hZGRyX3Qgc3BhLCBlcGE7DQo+IC0JaW50
IHppZDsNCj4gIAlzdHJ1Y3Qgem9uZSAqem9uZTsNCj4gLQljb25zdCBzdHJ1Y3QgY3B1bWFzayAq
Y3B1bWFzayA9IGNwdW1hc2tfb2Zfbm9kZShwZ2RhdC0+bm9kZV9pZCk7DQo+ICAJdTY0IGk7DQo+
ICsJaW50IHppZDsNCj4gIA0KPiAgCS8qIEJpbmQgbWVtb3J5IGluaXRpYWxpc2F0aW9uIHRocmVh
ZCB0byBhIGxvY2FsIG5vZGUgaWYgcG9zc2libGUgKi8NCj4gIAlpZiAoIWNwdW1hc2tfZW1wdHko
Y3B1bWFzaykpDQo+IEBAIC0xNTQ5LDMxICsxNjMzLDMwIEBAIHN0YXRpYyBpbnQgX19pbml0IGRl
ZmVycmVkX2luaXRfbWVtbWFwKHZvaWQgKmRhdGEpDQo+ICAJCWlmIChmaXJzdF9pbml0X3BmbiA8
IHpvbmVfZW5kX3Bmbih6b25lKSkNCj4gIAkJCWJyZWFrOw0KPiAgCX0NCj4gLQlmaXJzdF9pbml0
X3BmbiA9IG1heCh6b25lLT56b25lX3N0YXJ0X3BmbiwgZmlyc3RfaW5pdF9wZm4pOw0KPiArDQo+
ICsJLyogSWYgdGhlIHpvbmUgaXMgZW1wdHkgc29tZWJvZHkgZWxzZSBtYXkgaGF2ZSBjbGVhcmVk
IG91dCB0aGUgem9uZSAqLw0KPiArCWlmICghZGVmZXJyZWRfaW5pdF9tZW1fcGZuX3JhbmdlX2lu
X3pvbmUoJmksIHpvbmUsICZzcGZuLCAmZXBmbiwNCj4gKwkJCQkJCSBmaXJzdF9pbml0X3Bmbikp
IHsNCj4gKwkJcGdkYXRfcmVzaXplX3VubG9jayhwZ2RhdCwgJmZsYWdzKTsNCj4gKwkJcGdkYXRf
aW5pdF9yZXBvcnRfb25lX2RvbmUoKTsNCj4gKwkJcmV0dXJuIDA7DQo+ICsJfQ0KPiAgDQo+ICAJ
LyoNCj4gLQkgKiBJbml0aWFsaXplIGFuZCBmcmVlIHBhZ2VzLiBXZSBkbyBpdCBpbiB0d28gbG9v
cHM6IGZpcnN0IHdlIGluaXRpYWxpemUNCj4gLQkgKiBzdHJ1Y3QgcGFnZSwgdGhhbiBmcmVlIHRv
IGJ1ZGR5IGFsbG9jYXRvciwgYmVjYXVzZSB3aGlsZSB3ZSBhcmUNCj4gLQkgKiBmcmVlaW5nIHBh
Z2VzIHdlIGNhbiBhY2Nlc3MgcGFnZXMgdGhhdCBhcmUgYWhlYWQgKGNvbXB1dGluZyBidWRkeQ0K
PiAtCSAqIHBhZ2UgaW4gX19mcmVlX29uZV9wYWdlKCkpLg0KPiArCSAqIEluaXRpYWxpemUgYW5k
IGZyZWUgcGFnZXMgaW4gTUFYX09SREVSIHNpemVkIGluY3JlbWVudHMgc28NCj4gKwkgKiB0aGF0
IHdlIGNhbiBhdm9pZCBpbnRyb2R1Y2luZyBhbnkgaXNzdWVzIHdpdGggdGhlIGJ1ZGR5DQo+ICsJ
ICogYWxsb2NhdG9yLg0KPiAgCSAqLw0KPiAtCWZvcl9lYWNoX2ZyZWVfbWVtX3JhbmdlKGksIG5p
ZCwgTUVNQkxPQ0tfTk9ORSwgJnNwYSwgJmVwYSwgTlVMTCkgew0KPiAtCQlzcGZuID0gbWF4X3Qo
dW5zaWduZWQgbG9uZywgZmlyc3RfaW5pdF9wZm4sIFBGTl9VUChzcGEpKTsNCj4gLQkJZXBmbiA9
IG1pbl90KHVuc2lnbmVkIGxvbmcsIHpvbmVfZW5kX3Bmbih6b25lKSwgUEZOX0RPV04oZXBhKSk7
DQo+IC0JCW5yX3BhZ2VzICs9IGRlZmVycmVkX2luaXRfcGFnZXMoem9uZSwgc3BmbiwgZXBmbik7
DQo+IC0JfQ0KPiAtCWZvcl9lYWNoX2ZyZWVfbWVtX3JhbmdlKGksIG5pZCwgTUVNQkxPQ0tfTk9O
RSwgJnNwYSwgJmVwYSwgTlVMTCkgew0KPiAtCQlzcGZuID0gbWF4X3QodW5zaWduZWQgbG9uZywg
Zmlyc3RfaW5pdF9wZm4sIFBGTl9VUChzcGEpKTsNCj4gLQkJZXBmbiA9IG1pbl90KHVuc2lnbmVk
IGxvbmcsIHpvbmVfZW5kX3Bmbih6b25lKSwgUEZOX0RPV04oZXBhKSk7DQo+IC0JCWRlZmVycmVk
X2ZyZWVfcGFnZXMoc3BmbiwgZXBmbik7DQo+IC0JfQ0KPiArCXdoaWxlIChzcGZuIDwgZXBmbikN
Cj4gKwkJbnJfcGFnZXMgKz0gZGVmZXJyZWRfaW5pdF9tYXhvcmRlcigmaSwgem9uZSwgJnNwZm4s
ICZlcGZuKTsNCj4gKw0KPiAgCXBnZGF0X3Jlc2l6ZV91bmxvY2socGdkYXQsICZmbGFncyk7DQo+
ICANCj4gIAkvKiBTYW5pdHkgY2hlY2sgdGhhdCB0aGUgbmV4dCB6b25lIHJlYWxseSBpcyB1bnBv
cHVsYXRlZCAqLw0KPiAgCVdBUk5fT04oKyt6aWQgPCBNQVhfTlJfWk9ORVMgJiYgcG9wdWxhdGVk
X3pvbmUoKyt6b25lKSk7DQo+ICANCj4gLQlwcl9pbmZvKCJub2RlICVkIGluaXRpYWxpc2VkLCAl
bHUgcGFnZXMgaW4gJXVtc1xuIiwgbmlkLCBucl9wYWdlcywNCj4gLQkJCQkJamlmZmllc190b19t
c2VjcyhqaWZmaWVzIC0gc3RhcnQpKTsNCj4gKwlwcl9pbmZvKCJub2RlICVkIGluaXRpYWxpc2Vk
LCAlbHUgcGFnZXMgaW4gJXVtc1xuIiwNCj4gKwkJcGdkYXQtPm5vZGVfaWQsCW5yX3BhZ2VzLCBq
aWZmaWVzX3RvX21zZWNzKGppZmZpZXMgLSBzdGFydCkpOw0KPiAgDQo+ICAJcGdkYXRfaW5pdF9y
ZXBvcnRfb25lX2RvbmUoKTsNCj4gIAlyZXR1cm4gMDsNCj4gQEAgLTE2MDQsMTQgKzE2ODcsMTEg
QEAgc3RhdGljIGludCBfX2luaXQgZGVmZXJyZWRfaW5pdF9tZW1tYXAodm9pZCAqZGF0YSkNCj4g
IHN0YXRpYyBub2lubGluZSBib29sIF9faW5pdA0KPiAgZGVmZXJyZWRfZ3Jvd196b25lKHN0cnVj
dCB6b25lICp6b25lLCB1bnNpZ25lZCBpbnQgb3JkZXIpDQo+ICB7DQo+IC0JaW50IHppZCA9IHpv
bmVfaWR4KHpvbmUpOw0KPiAtCWludCBuaWQgPSB6b25lX3RvX25pZCh6b25lKTsNCj4gLQlwZ19k
YXRhX3QgKnBnZGF0ID0gTk9ERV9EQVRBKG5pZCk7DQo+ICAJdW5zaWduZWQgbG9uZyBucl9wYWdl
c19uZWVkZWQgPSBBTElHTigxIDw8IG9yZGVyLCBQQUdFU19QRVJfU0VDVElPTik7DQo+IC0JdW5z
aWduZWQgbG9uZyBucl9wYWdlcyA9IDA7DQo+IC0JdW5zaWduZWQgbG9uZyBmaXJzdF9pbml0X3Bm
biwgc3BmbiwgZXBmbiwgdCwgZmxhZ3M7DQo+ICsJcGdfZGF0YV90ICpwZ2RhdCA9IHpvbmUtPnpv
bmVfcGdkYXQ7DQo+ICAJdW5zaWduZWQgbG9uZyBmaXJzdF9kZWZlcnJlZF9wZm4gPSBwZ2RhdC0+
Zmlyc3RfZGVmZXJyZWRfcGZuOw0KPiAtCXBoeXNfYWRkcl90IHNwYSwgZXBhOw0KPiArCXVuc2ln
bmVkIGxvbmcgc3BmbiwgZXBmbiwgZmxhZ3M7DQo+ICsJdW5zaWduZWQgbG9uZyBucl9wYWdlcyA9
IDA7DQo+ICAJdTY0IGk7DQo+ICANCj4gIAkvKiBPbmx5IHRoZSBsYXN0IHpvbmUgbWF5IGhhdmUg
ZGVmZXJyZWQgcGFnZXMgKi8NCj4gQEAgLTE2NDAsMzcgKzE3MjAsMjMgQEAgc3RhdGljIGludCBf
X2luaXQgZGVmZXJyZWRfaW5pdF9tZW1tYXAodm9pZCAqZGF0YSkNCj4gIAkJcmV0dXJuIHRydWU7
DQo+ICAJfQ0KPiAgDQo+IC0JZmlyc3RfaW5pdF9wZm4gPSBtYXgoem9uZS0+em9uZV9zdGFydF9w
Zm4sIGZpcnN0X2RlZmVycmVkX3Bmbik7DQo+IC0NCj4gLQlpZiAoZmlyc3RfaW5pdF9wZm4gPj0g
cGdkYXRfZW5kX3BmbihwZ2RhdCkpIHsNCj4gKwkvKiBJZiB0aGUgem9uZSBpcyBlbXB0eSBzb21l
Ym9keSBlbHNlIG1heSBoYXZlIGNsZWFyZWQgb3V0IHRoZSB6b25lICovDQo+ICsJaWYgKCFkZWZl
cnJlZF9pbml0X21lbV9wZm5fcmFuZ2VfaW5fem9uZSgmaSwgem9uZSwgJnNwZm4sICZlcGZuLA0K
PiArCQkJCQkJIGZpcnN0X2RlZmVycmVkX3BmbikpIHsNCj4gIAkJcGdkYXRfcmVzaXplX3VubG9j
ayhwZ2RhdCwgJmZsYWdzKTsNCj4gLQkJcmV0dXJuIGZhbHNlOw0KPiArCQlyZXR1cm4gdHJ1ZTsN
Cj4gIAl9DQo+ICANCj4gLQlmb3JfZWFjaF9mcmVlX21lbV9yYW5nZShpLCBuaWQsIE1FTUJMT0NL
X05PTkUsICZzcGEsICZlcGEsIE5VTEwpIHsNCj4gLQkJc3BmbiA9IG1heF90KHVuc2lnbmVkIGxv
bmcsIGZpcnN0X2luaXRfcGZuLCBQRk5fVVAoc3BhKSk7DQo+IC0JCWVwZm4gPSBtaW5fdCh1bnNp
Z25lZCBsb25nLCB6b25lX2VuZF9wZm4oem9uZSksIFBGTl9ET1dOKGVwYSkpOw0KPiAtDQo+IC0J
CXdoaWxlIChzcGZuIDwgZXBmbiAmJiBucl9wYWdlcyA8IG5yX3BhZ2VzX25lZWRlZCkgew0KPiAt
CQkJdCA9IEFMSUdOKHNwZm4gKyBQQUdFU19QRVJfU0VDVElPTiwgUEFHRVNfUEVSX1NFQ1RJT04p
Ow0KPiAtCQkJZmlyc3RfZGVmZXJyZWRfcGZuID0gbWluKHQsIGVwZm4pOw0KPiAtCQkJbnJfcGFn
ZXMgKz0gZGVmZXJyZWRfaW5pdF9wYWdlcyh6b25lLCBzcGZuLA0KPiAtCQkJCQkJCWZpcnN0X2Rl
ZmVycmVkX3Bmbik7DQo+IC0JCQlzcGZuID0gZmlyc3RfZGVmZXJyZWRfcGZuOw0KPiAtCQl9DQo+
IC0NCj4gLQkJaWYgKG5yX3BhZ2VzID49IG5yX3BhZ2VzX25lZWRlZCkNCj4gLQkJCWJyZWFrOw0K
PiArCS8qDQo+ICsJICogSW5pdGlhbGl6ZSBhbmQgZnJlZSBwYWdlcyBpbiBNQVhfT1JERVIgc2l6
ZWQgaW5jcmVtZW50cyBzbw0KPiArCSAqIHRoYXQgd2UgY2FuIGF2b2lkIGludHJvZHVjaW5nIGFu
eSBpc3N1ZXMgd2l0aCB0aGUgYnVkZHkNCj4gKwkgKiBhbGxvY2F0b3IuDQo+ICsJICovDQo+ICsJ
d2hpbGUgKHNwZm4gPCBlcGZuICYmIG5yX3BhZ2VzIDwgbnJfcGFnZXNfbmVlZGVkKSB7DQo+ICsJ
CW5yX3BhZ2VzICs9IGRlZmVycmVkX2luaXRfbWF4b3JkZXIoJmksIHpvbmUsICZzcGZuLCAmZXBm
bik7DQo+ICsJCWZpcnN0X2RlZmVycmVkX3BmbiA9IHNwZm47DQo+ICAJfQ0KPiAgDQo+IC0JZm9y
X2VhY2hfZnJlZV9tZW1fcmFuZ2UoaSwgbmlkLCBNRU1CTE9DS19OT05FLCAmc3BhLCAmZXBhLCBO
VUxMKSB7DQo+IC0JCXNwZm4gPSBtYXhfdCh1bnNpZ25lZCBsb25nLCBmaXJzdF9pbml0X3Bmbiwg
UEZOX1VQKHNwYSkpOw0KPiAtCQllcGZuID0gbWluX3QodW5zaWduZWQgbG9uZywgZmlyc3RfZGVm
ZXJyZWRfcGZuLCBQRk5fRE9XTihlcGEpKTsNCj4gLQkJZGVmZXJyZWRfZnJlZV9wYWdlcyhzcGZu
LCBlcGZuKTsNCj4gLQ0KPiAtCQlpZiAoZmlyc3RfZGVmZXJyZWRfcGZuID09IGVwZm4pDQo+IC0J
CQlicmVhazsNCj4gLQl9DQo+ICAJcGdkYXQtPmZpcnN0X2RlZmVycmVkX3BmbiA9IGZpcnN0X2Rl
ZmVycmVkX3BmbjsNCj4gIAlwZ2RhdF9yZXNpemVfdW5sb2NrKHBnZGF0LCAmZmxhZ3MpOw0KPiAg
DQo+IA=

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
  2018-10-31 15:40     ` Pasha Tatashin
@ 2018-10-31 16:05       ` Alexander Duyck
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-31 16:05 UTC (permalink / raw)
  To: Pasha Tatashin, linux-mm, akpm
  Cc: mhocko, dave.jiang, linux-kernel, willy, davem, yi.z.zhang,
	khalid.aziz, rppt, vbabka, sparclinux, dan.j.williams, ldufour,
	mgorman, mingo, kirill.shutemov

On Wed, 2018-10-31 at 15:40 +0000, Pasha Tatashin wrote:
> 
> On 10/17/18 7:54 PM, Alexander Duyck wrote:
> > This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> > 
> > This iterator will take care of making sure a given memory range provided
> > is in fact contained within a zone. It takes are of all the bounds checking
> > we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> > it should help to speed up the search a bit by iterating until the end of a
> > range is greater than the start of the zone pfn range, and will exit
> > completely if the start is beyond the end of the zone.
> > 
> > This patch adds yet another iterator called
> > for_each_free_mem_range_in_zone_from and then uses it to support
> > initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> > By doing this we can greatly improve the cache locality of the pages while
> > we do several loops over them in the init and freeing process.
> > 
> > We are able to tighten the loops as a result since we only really need the
> > checks for first_init_pfn in our first iteration and after that we can
> > assume that all future values will be greater than this. So I have added a
> > function called deferred_init_mem_pfn_range_in_zone that primes the
> > iterators and if it fails we can just exit.
> > 
> > On my x86_64 test system with 384GB of memory per node I saw a reduction in
> > initialization time from 1.85s to 1.38s as a result of this patch.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Hi Alex,
> 
> Could you please split this patch into two parts:
> 
> 1. Add deferred_init_maxorder()
> 2. Add memblock iterator?
> 
> This would allow a better bisecting in case of problems. Chaning two
> loops into deferred_init_maxorder() while a good idea, is still
> non-trivial and might lead to bugs.
> 
> Thank you,
> Pavel

I can do that, but I will need to flip the order. I will add the new
iterator first and then deferred_init_maxorder. Otherwise the
intermediate step ends up being too much throw-away code.

- Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
@ 2018-10-31 16:05       ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-10-31 16:05 UTC (permalink / raw)
  To: Pasha Tatashin, linux-mm, akpm
  Cc: mhocko, dave.jiang, linux-kernel, willy, davem, yi.z.zhang,
	khalid.aziz, rppt, vbabka, sparclinux, dan.j.williams, ldufour,
	mgorman, mingo, kirill.shutemov

On Wed, 2018-10-31 at 15:40 +0000, Pasha Tatashin wrote:
> 
> On 10/17/18 7:54 PM, Alexander Duyck wrote:
> > This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> > 
> > This iterator will take care of making sure a given memory range provided
> > is in fact contained within a zone. It takes are of all the bounds checking
> > we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> > it should help to speed up the search a bit by iterating until the end of a
> > range is greater than the start of the zone pfn range, and will exit
> > completely if the start is beyond the end of the zone.
> > 
> > This patch adds yet another iterator called
> > for_each_free_mem_range_in_zone_from and then uses it to support
> > initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> > By doing this we can greatly improve the cache locality of the pages while
> > we do several loops over them in the init and freeing process.
> > 
> > We are able to tighten the loops as a result since we only really need the
> > checks for first_init_pfn in our first iteration and after that we can
> > assume that all future values will be greater than this. So I have added a
> > function called deferred_init_mem_pfn_range_in_zone that primes the
> > iterators and if it fails we can just exit.
> > 
> > On my x86_64 test system with 384GB of memory per node I saw a reduction in
> > initialization time from 1.85s to 1.38s as a result of this patch.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Hi Alex,
> 
> Could you please split this patch into two parts:
> 
> 1. Add deferred_init_maxorder()
> 2. Add memblock iterator?
> 
> This would allow a better bisecting in case of problems. Chaning two
> loops into deferred_init_maxorder() while a good idea, is still
> non-trivial and might lead to bugs.
> 
> Thank you,
> Pavel

I can do that, but I will need to flip the order. I will add the new
iterator first and then deferred_init_maxorder. Otherwise the
intermediate step ends up being too much throw-away code.

- Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
  2018-10-31 16:05       ` Alexander Duyck
@ 2018-10-31 16:06         ` Pasha Tatashin
  -1 siblings, 0 replies; 28+ messages in thread
From: Pasha Tatashin @ 2018-10-31 16:06 UTC (permalink / raw)
  To: Alexander Duyck, Pasha Tatashin, linux-mm, akpm
  Cc: mhocko, dave.jiang, linux-kernel, willy, davem, yi.z.zhang,
	khalid.aziz, rppt, vbabka, sparclinux, dan.j.williams, ldufour,
	mgorman, mingo, kirill.shutemov



On 10/31/18 12:05 PM, Alexander Duyck wrote:
> On Wed, 2018-10-31 at 15:40 +0000, Pasha Tatashin wrote:
>>
>> On 10/17/18 7:54 PM, Alexander Duyck wrote:
>>> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
>>>
>>> This iterator will take care of making sure a given memory range provided
>>> is in fact contained within a zone. It takes are of all the bounds checking
>>> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
>>> it should help to speed up the search a bit by iterating until the end of a
>>> range is greater than the start of the zone pfn range, and will exit
>>> completely if the start is beyond the end of the zone.
>>>
>>> This patch adds yet another iterator called
>>> for_each_free_mem_range_in_zone_from and then uses it to support
>>> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
>>> By doing this we can greatly improve the cache locality of the pages while
>>> we do several loops over them in the init and freeing process.
>>>
>>> We are able to tighten the loops as a result since we only really need the
>>> checks for first_init_pfn in our first iteration and after that we can
>>> assume that all future values will be greater than this. So I have added a
>>> function called deferred_init_mem_pfn_range_in_zone that primes the
>>> iterators and if it fails we can just exit.
>>>
>>> On my x86_64 test system with 384GB of memory per node I saw a reduction in
>>> initialization time from 1.85s to 1.38s as a result of this patch.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>
>> Hi Alex,
>>
>> Could you please split this patch into two parts:
>>
>> 1. Add deferred_init_maxorder()
>> 2. Add memblock iterator?
>>
>> This would allow a better bisecting in case of problems. Chaning two
>> loops into deferred_init_maxorder() while a good idea, is still
>> non-trivial and might lead to bugs.
>>
>> Thank you,
>> Pavel
> 
> I can do that, but I will need to flip the order. I will add the new
> iterator first and then deferred_init_maxorder. Otherwise the
> intermediate step ends up being too much throw-away code.

That sounds good.

Thank you,
Pavel

> 
> - Alex
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
@ 2018-10-31 16:06         ` Pasha Tatashin
  0 siblings, 0 replies; 28+ messages in thread
From: Pasha Tatashin @ 2018-10-31 16:06 UTC (permalink / raw)
  To: Alexander Duyck, Pasha Tatashin, linux-mm, akpm
  Cc: mhocko, dave.jiang, linux-kernel, willy, davem, yi.z.zhang,
	khalid.aziz, rppt, vbabka, sparclinux, dan.j.williams, ldufour,
	mgorman, mingo, kirill.shutemov

DQoNCk9uIDEwLzMxLzE4IDEyOjA1IFBNLCBBbGV4YW5kZXIgRHV5Y2sgd3JvdGU6DQo+IE9uIFdl
ZCwgMjAxOC0xMC0zMSBhdCAxNTo0MCArMDAwMCwgUGFzaGEgVGF0YXNoaW4gd3JvdGU6DQo+Pg0K
Pj4gT24gMTAvMTcvMTggNzo1NCBQTSwgQWxleGFuZGVyIER1eWNrIHdyb3RlOg0KPj4+IFRoaXMg
cGF0Y2ggaW50cm9kdWNlcyBhIG5ldyBpdGVyYXRvciBmb3JfZWFjaF9mcmVlX21lbV9wZm5fcmFu
Z2VfaW5fem9uZS4NCj4+Pg0KPj4+IFRoaXMgaXRlcmF0b3Igd2lsbCB0YWtlIGNhcmUgb2YgbWFr
aW5nIHN1cmUgYSBnaXZlbiBtZW1vcnkgcmFuZ2UgcHJvdmlkZWQNCj4+PiBpcyBpbiBmYWN0IGNv
bnRhaW5lZCB3aXRoaW4gYSB6b25lLiBJdCB0YWtlcyBhcmUgb2YgYWxsIHRoZSBib3VuZHMgY2hl
Y2tpbmcNCj4+PiB3ZSB3ZXJlIGRvaW5nIGluIGRlZmVycmVkX2dyb3dfem9uZSwgYW5kIGRlZmVy
cmVkX2luaXRfbWVtbWFwLiBJbiBhZGRpdGlvbg0KPj4+IGl0IHNob3VsZCBoZWxwIHRvIHNwZWVk
IHVwIHRoZSBzZWFyY2ggYSBiaXQgYnkgaXRlcmF0aW5nIHVudGlsIHRoZSBlbmQgb2YgYQ0KPj4+
IHJhbmdlIGlzIGdyZWF0ZXIgdGhhbiB0aGUgc3RhcnQgb2YgdGhlIHpvbmUgcGZuIHJhbmdlLCBh
bmQgd2lsbCBleGl0DQo+Pj4gY29tcGxldGVseSBpZiB0aGUgc3RhcnQgaXMgYmV5b25kIHRoZSBl
bmQgb2YgdGhlIHpvbmUuDQo+Pj4NCj4+PiBUaGlzIHBhdGNoIGFkZHMgeWV0IGFub3RoZXIgaXRl
cmF0b3IgY2FsbGVkDQo+Pj4gZm9yX2VhY2hfZnJlZV9tZW1fcmFuZ2VfaW5fem9uZV9mcm9tIGFu
ZCB0aGVuIHVzZXMgaXQgdG8gc3VwcG9ydA0KPj4+IGluaXRpYWxpemluZyBhbmQgZnJlZWluZyBw
YWdlcyBpbiBncm91cHMgbm8gbGFyZ2VyIHRoYW4gTUFYX09SREVSX05SX1BBR0VTLg0KPj4+IEJ5
IGRvaW5nIHRoaXMgd2UgY2FuIGdyZWF0bHkgaW1wcm92ZSB0aGUgY2FjaGUgbG9jYWxpdHkgb2Yg
dGhlIHBhZ2VzIHdoaWxlDQo+Pj4gd2UgZG8gc2V2ZXJhbCBsb29wcyBvdmVyIHRoZW0gaW4gdGhl
IGluaXQgYW5kIGZyZWVpbmcgcHJvY2Vzcy4NCj4+Pg0KPj4+IFdlIGFyZSBhYmxlIHRvIHRpZ2h0
ZW4gdGhlIGxvb3BzIGFzIGEgcmVzdWx0IHNpbmNlIHdlIG9ubHkgcmVhbGx5IG5lZWQgdGhlDQo+
Pj4gY2hlY2tzIGZvciBmaXJzdF9pbml0X3BmbiBpbiBvdXIgZmlyc3QgaXRlcmF0aW9uIGFuZCBh
ZnRlciB0aGF0IHdlIGNhbg0KPj4+IGFzc3VtZSB0aGF0IGFsbCBmdXR1cmUgdmFsdWVzIHdpbGwg
YmUgZ3JlYXRlciB0aGFuIHRoaXMuIFNvIEkgaGF2ZSBhZGRlZCBhDQo+Pj4gZnVuY3Rpb24gY2Fs
bGVkIGRlZmVycmVkX2luaXRfbWVtX3Bmbl9yYW5nZV9pbl96b25lIHRoYXQgcHJpbWVzIHRoZQ0K
Pj4+IGl0ZXJhdG9ycyBhbmQgaWYgaXQgZmFpbHMgd2UgY2FuIGp1c3QgZXhpdC4NCj4+Pg0KPj4+
IE9uIG15IHg4Nl82NCB0ZXN0IHN5c3RlbSB3aXRoIDM4NEdCIG9mIG1lbW9yeSBwZXIgbm9kZSBJ
IHNhdyBhIHJlZHVjdGlvbiBpbg0KPj4+IGluaXRpYWxpemF0aW9uIHRpbWUgZnJvbSAxLjg1cyB0
byAxLjM4cyBhcyBhIHJlc3VsdCBvZiB0aGlzIHBhdGNoLg0KPj4+DQo+Pj4gU2lnbmVkLW9mZi1i
eTogQWxleGFuZGVyIER1eWNrIDxhbGV4YW5kZXIuaC5kdXlja0BsaW51eC5pbnRlbC5jb20+DQo+
Pg0KPj4gSGkgQWxleCwNCj4+DQo+PiBDb3VsZCB5b3UgcGxlYXNlIHNwbGl0IHRoaXMgcGF0Y2gg
aW50byB0d28gcGFydHM6DQo+Pg0KPj4gMS4gQWRkIGRlZmVycmVkX2luaXRfbWF4b3JkZXIoKQ0K
Pj4gMi4gQWRkIG1lbWJsb2NrIGl0ZXJhdG9yPw0KPj4NCj4+IFRoaXMgd291bGQgYWxsb3cgYSBi
ZXR0ZXIgYmlzZWN0aW5nIGluIGNhc2Ugb2YgcHJvYmxlbXMuIENoYW5pbmcgdHdvDQo+PiBsb29w
cyBpbnRvIGRlZmVycmVkX2luaXRfbWF4b3JkZXIoKSB3aGlsZSBhIGdvb2QgaWRlYSwgaXMgc3Rp
bGwNCj4+IG5vbi10cml2aWFsIGFuZCBtaWdodCBsZWFkIHRvIGJ1Z3MuDQo+Pg0KPj4gVGhhbmsg
eW91LA0KPj4gUGF2ZWwNCj4gDQo+IEkgY2FuIGRvIHRoYXQsIGJ1dCBJIHdpbGwgbmVlZCB0byBm
bGlwIHRoZSBvcmRlci4gSSB3aWxsIGFkZCB0aGUgbmV3DQo+IGl0ZXJhdG9yIGZpcnN0IGFuZCB0
aGVuIGRlZmVycmVkX2luaXRfbWF4b3JkZXIuIE90aGVyd2lzZSB0aGUNCj4gaW50ZXJtZWRpYXRl
IHN0ZXAgZW5kcyB1cCBiZWluZyB0b28gbXVjaCB0aHJvdy1hd2F5IGNvZGUuDQoNClRoYXQgc291
bmRzIGdvb2QuDQoNClRoYW5rIHlvdSwNClBhdmVsDQoNCj4gDQo+IC0gQWxleA0KPiA

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
  2018-10-31 15:40     ` Pasha Tatashin
@ 2018-11-01  6:17       ` Mike Rapoport
  -1 siblings, 0 replies; 28+ messages in thread
From: Mike Rapoport @ 2018-11-01  6:17 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Alexander Duyck, linux-mm, akpm, mhocko, dave.jiang,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

On Wed, Oct 31, 2018 at 03:40:02PM +0000, Pasha Tatashin wrote:
> 
> 
> On 10/17/18 7:54 PM, Alexander Duyck wrote:
> > This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> > 
> > This iterator will take care of making sure a given memory range provided
> > is in fact contained within a zone. It takes are of all the bounds checking
> > we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> > it should help to speed up the search a bit by iterating until the end of a
> > range is greater than the start of the zone pfn range, and will exit
> > completely if the start is beyond the end of the zone.
> > 
> > This patch adds yet another iterator called
> > for_each_free_mem_range_in_zone_from and then uses it to support
> > initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> > By doing this we can greatly improve the cache locality of the pages while
> > we do several loops over them in the init and freeing process.
> > 
> > We are able to tighten the loops as a result since we only really need the
> > checks for first_init_pfn in our first iteration and after that we can
> > assume that all future values will be greater than this. So I have added a
> > function called deferred_init_mem_pfn_range_in_zone that primes the
> > iterators and if it fails we can just exit.
> > 
> > On my x86_64 test system with 384GB of memory per node I saw a reduction in
> > initialization time from 1.85s to 1.38s as a result of this patch.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
 
[ ... ] 

> > ---
> >  include/linux/memblock.h |   58 +++++++++++++++
> >  mm/memblock.c            |   63 ++++++++++++++++
> >  mm/page_alloc.c          |  176 ++++++++++++++++++++++++++++++++--------------
> >  3 files changed, 242 insertions(+), 55 deletions(-)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index aee299a6aa76..2ddd1bafdd03 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
> >  			      p_start, p_end, p_nid))
> >  
> >  /**
> > + * for_each_mem_range_from - iterate through memblock areas from type_a and not
> > + * included in type_b. Or just type_a if type_b is NULL.
> > + * @i: u64 used as loop variable
> > + * @type_a: ptr to memblock_type to iterate
> > + * @type_b: ptr to memblock_type which excludes from the iteration
> > + * @nid: node selector, %NUMA_NO_NODE for all nodes
> > + * @flags: pick from blocks based on memory attributes
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + * @p_nid: ptr to int for nid of the range, can be %NULL
> > + */
> > +#define for_each_mem_range_from(i, type_a, type_b, nid, flags,		\
> > +			   p_start, p_end, p_nid)			\
> > +	for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b,	\
> > +				     p_start, p_end, p_nid);		\
> > +	     i != (u64)ULLONG_MAX;					\
> > +	     __next_mem_range(&i, nid, flags, type_a, type_b,		\
> > +			      p_start, p_end, p_nid))
> > +/**
> >   * for_each_mem_range_rev - reverse iterate through memblock areas from
> >   * type_a and not included in type_b. Or just type_a if type_b is NULL.
> >   * @i: u64 used as loop variable
> > @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> >  	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
> >  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> >  
> > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT

Sorry for jumping late, but I've noticed this only now.
Do the new iterators have to be restricted by
CONFIG_DEFERRED_STRUCT_PAGE_INIT?

> > +void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> > +				  unsigned long *out_spfn,
> > +				  unsigned long *out_epfn);
> > +/**
> > + * for_each_free_mem_range_in_zone - iterate through zone specific free
> > + * memblock areas
> > + * @i: u64 used as loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over free (memory && !reserved) areas of memblock in a specific
> > + * zone. Available as soon as memblock is initialized.
> > + */
> > +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)	\
> > +	for (i = 0,							\
> > +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
> > +	     i != (u64)ULLONG_MAX;					\
> > +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> > +
> > +/**
> > + * for_each_free_mem_range_in_zone_from - iterate through zone specific
> > + * free memblock areas from a given point
> > + * @i: u64 used as loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over free (memory && !reserved) areas of memblock in a specific
> > + * zone, continuing from current position. Available as soon as memblock is
> > + * initialized.
> > + */
> > +#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
> > +	for (; i != (u64)ULLONG_MAX;					  \
> > +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> > +
> > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> > +
> >  /**
> >   * for_each_free_mem_range - iterate through free memblock areas
> >   * @i: u64 used as loop variable
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index f2ef3915a356..ab3545e356b7 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
> >  	return 0;
> >  }
> >  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > +/**
> > + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> > + *
> > + * @idx: pointer to u64 loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @out_start: ptr to ulong for start pfn of the range, can be %NULL
> > + * @out_end: ptr to ulong for end pfn of the range, can be %NULL
> > + *
> > + * This function is meant to be a zone/pfn specific wrapper for the
> > + * for_each_mem_range type iterators. Specifically they are used in the
> > + * deferred memory init routines and as such we were duplicating much of
> > + * this logic throughout the code. So instead of having it in multiple
> > + * locations it seemed like it would make more sense to centralize this to
> > + * one new iterator that does everything they need.
> > + */
> > +void __init_memblock
> > +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> > +			     unsigned long *out_spfn, unsigned long *out_epfn)
> > +{
> > +	int zone_nid = zone_to_nid(zone);
> > +	phys_addr_t spa, epa;
> > +	int nid;
> > +
> > +	__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> > +			 &memblock.memory, &memblock.reserved,
> > +			 &spa, &epa, &nid);
> > +
> > +	while (*idx != ULLONG_MAX) {
> > +		unsigned long epfn = PFN_DOWN(epa);
> > +		unsigned long spfn = PFN_UP(spa);
> > +
> > +		/*
> > +		 * Verify the end is at least past the start of the zone and
> > +		 * that we have at least one PFN to initialize.
> > +		 */
> > +		if (zone->zone_start_pfn < epfn && spfn < epfn) {
> > +			/* if we went too far just stop searching */
> > +			if (zone_end_pfn(zone) <= spfn)
> > +				break;
> > +
> > +			if (out_spfn)
> > +				*out_spfn = max(zone->zone_start_pfn, spfn);
> > +			if (out_epfn)
> > +				*out_epfn = min(zone_end_pfn(zone), epfn);
> > +
> > +			return;
> > +		}
> > +
> > +		__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> > +				 &memblock.memory, &memblock.reserved,
> > +				 &spa, &epa, &nid);
> > +	}
> > +
> > +	/* signal end of iteration */
> > +	*idx = ULLONG_MAX;
> > +	if (out_spfn)
> > +		*out_spfn = ULONG_MAX;
> > +	if (out_epfn)
> > +		*out_epfn = 0;
> > +}
> > +
> > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> >  
> >  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
> >  unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a766a15fad81..20e9eb35d75d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1512,19 +1512,103 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
> >  	return (nr_pages);
> >  }
> >  
> > +/*
> > + * This function is meant to pre-load the iterator for the zone init.
> > + * Specifically it walks through the ranges until we are caught up to the
> > + * first_init_pfn value and exits there. If we never encounter the value we
> > + * return false indicating there are no valid ranges left.
> > + */
> > +static bool __init
> > +deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
> > +				    unsigned long *spfn, unsigned long *epfn,
> > +				    unsigned long first_init_pfn)
> > +{
> > +	u64 j;
> > +
> > +	/*
> > +	 * Start out by walking through the ranges in this zone that have
> > +	 * already been initialized. We don't need to do anything with them
> > +	 * so we just need to flush them out of the system.
> > +	 */
> > +	for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
> > +		if (*epfn <= first_init_pfn)
> > +			continue;
> > +		if (*spfn < first_init_pfn)
> > +			*spfn = first_init_pfn;
> > +		*i = j;
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/*
> > + * Initialize and free pages. We do it in two loops: first we initialize
> > + * struct page, than free to buddy allocator, because while we are
> > + * freeing pages we can access pages that are ahead (computing buddy
> > + * page in __free_one_page()).
> > + *
> > + * In order to try and keep some memory in the cache we have the loop
> > + * broken along max page order boundaries. This way we will not cause
> > + * any issues with the buddy page computation.
> > + */
> > +static unsigned long __init
> > +deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> > +		       unsigned long *end_pfn)
> > +{
> > +	unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
> > +	unsigned long spfn = *start_pfn, epfn = *end_pfn;
> > +	unsigned long nr_pages = 0;
> > +	u64 j = *i;
> > +
> > +	/* First we loop through and initialize the page values */
> > +	for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
> > +		unsigned long t;
> > +
> > +		if (mo_pfn <= spfn)
> > +			break;
> > +
> > +		t = min(mo_pfn, epfn);
> > +		nr_pages += deferred_init_pages(zone, spfn, t);
> > +
> > +		if (mo_pfn <= epfn)
> > +			break;
> > +	}
> > +
> > +	/* Reset values and now loop through freeing pages as needed */
> > +	j = *i;
> > +
> > +	for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
> > +		unsigned long t;
> > +
> > +		if (mo_pfn <= *start_pfn)
> > +			break;
> > +
> > +		t = min(mo_pfn, *end_pfn);
> > +		deferred_free_pages(*start_pfn, t);
> > +		*start_pfn = t;
> > +
> > +		if (mo_pfn < *end_pfn)
> > +			break;
> > +	}
> > +
> > +	/* Store our current values to be reused on the next iteration */
> > +	*i = j;
> > +
> > +	return nr_pages;
> > +}
> > +
> >  /* Initialise remaining memory on a node */
> >  static int __init deferred_init_memmap(void *data)
> >  {
> >  	pg_data_t *pgdat = data;
> > -	int nid = pgdat->node_id;
> > +	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> > +	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> > +	unsigned long first_init_pfn, flags;
> >  	unsigned long start = jiffies;
> > -	unsigned long nr_pages = 0;
> > -	unsigned long spfn, epfn, first_init_pfn, flags;
> > -	phys_addr_t spa, epa;
> > -	int zid;
> >  	struct zone *zone;
> > -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> >  	u64 i;
> > +	int zid;
> >  
> >  	/* Bind memory initialisation thread to a local node if possible */
> >  	if (!cpumask_empty(cpumask))
> > @@ -1549,31 +1633,30 @@ static int __init deferred_init_memmap(void *data)
> >  		if (first_init_pfn < zone_end_pfn(zone))
> >  			break;
> >  	}
> > -	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
> > +
> > +	/* If the zone is empty somebody else may have cleared out the zone */
> > +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> > +						 first_init_pfn)) {
> > +		pgdat_resize_unlock(pgdat, &flags);
> > +		pgdat_init_report_one_done();
> > +		return 0;
> > +	}
> >  
> >  	/*
> > -	 * Initialize and free pages. We do it in two loops: first we initialize
> > -	 * struct page, than free to buddy allocator, because while we are
> > -	 * freeing pages we can access pages that are ahead (computing buddy
> > -	 * page in __free_one_page()).
> > +	 * Initialize and free pages in MAX_ORDER sized increments so
> > +	 * that we can avoid introducing any issues with the buddy
> > +	 * allocator.
> >  	 */
> > -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> > -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> > -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> > -		nr_pages += deferred_init_pages(zone, spfn, epfn);
> > -	}
> > -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> > -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> > -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> > -		deferred_free_pages(spfn, epfn);
> > -	}
> > +	while (spfn < epfn)
> > +		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > +
> >  	pgdat_resize_unlock(pgdat, &flags);
> >  
> >  	/* Sanity check that the next zone really is unpopulated */
> >  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
> >  
> > -	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
> > -					jiffies_to_msecs(jiffies - start));
> > +	pr_info("node %d initialised, %lu pages in %ums\n",
> > +		pgdat->node_id,	nr_pages, jiffies_to_msecs(jiffies - start));
> >  
> >  	pgdat_init_report_one_done();
> >  	return 0;
> > @@ -1604,14 +1687,11 @@ static int __init deferred_init_memmap(void *data)
> >  static noinline bool __init
> >  deferred_grow_zone(struct zone *zone, unsigned int order)
> >  {
> > -	int zid = zone_idx(zone);
> > -	int nid = zone_to_nid(zone);
> > -	pg_data_t *pgdat = NODE_DATA(nid);
> >  	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
> > -	unsigned long nr_pages = 0;
> > -	unsigned long first_init_pfn, spfn, epfn, t, flags;
> > +	pg_data_t *pgdat = zone->zone_pgdat;
> >  	unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
> > -	phys_addr_t spa, epa;
> > +	unsigned long spfn, epfn, flags;
> > +	unsigned long nr_pages = 0;
> >  	u64 i;
> >  
> >  	/* Only the last zone may have deferred pages */
> > @@ -1640,37 +1720,23 @@ static int __init deferred_init_memmap(void *data)
> >  		return true;
> >  	}
> >  
> > -	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
> > -
> > -	if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
> > +	/* If the zone is empty somebody else may have cleared out the zone */
> > +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> > +						 first_deferred_pfn)) {
> >  		pgdat_resize_unlock(pgdat, &flags);
> > -		return false;
> > +		return true;
> >  	}
> >  
> > -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> > -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> > -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> > -
> > -		while (spfn < epfn && nr_pages < nr_pages_needed) {
> > -			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
> > -			first_deferred_pfn = min(t, epfn);
> > -			nr_pages += deferred_init_pages(zone, spfn,
> > -							first_deferred_pfn);
> > -			spfn = first_deferred_pfn;
> > -		}
> > -
> > -		if (nr_pages >= nr_pages_needed)
> > -			break;
> > +	/*
> > +	 * Initialize and free pages in MAX_ORDER sized increments so
> > +	 * that we can avoid introducing any issues with the buddy
> > +	 * allocator.
> > +	 */
> > +	while (spfn < epfn && nr_pages < nr_pages_needed) {
> > +		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > +		first_deferred_pfn = spfn;
> >  	}
> >  
> > -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> > -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> > -		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
> > -		deferred_free_pages(spfn, epfn);
> > -
> > -		if (first_deferred_pfn == epfn)
> > -			break;
> > -	}
> >  	pgdat->first_deferred_pfn = first_deferred_pfn;
> >  	pgdat_resize_unlock(pgdat, &flags);
> >  
> > 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
@ 2018-11-01  6:17       ` Mike Rapoport
  0 siblings, 0 replies; 28+ messages in thread
From: Mike Rapoport @ 2018-11-01  6:17 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Alexander Duyck, linux-mm, akpm, mhocko, dave.jiang,
	linux-kernel, willy, davem, yi.z.zhang, khalid.aziz, rppt,
	vbabka, sparclinux, dan.j.williams, ldufour, mgorman, mingo,
	kirill.shutemov

On Wed, Oct 31, 2018 at 03:40:02PM +0000, Pasha Tatashin wrote:
> 
> 
> On 10/17/18 7:54 PM, Alexander Duyck wrote:
> > This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> > 
> > This iterator will take care of making sure a given memory range provided
> > is in fact contained within a zone. It takes are of all the bounds checking
> > we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> > it should help to speed up the search a bit by iterating until the end of a
> > range is greater than the start of the zone pfn range, and will exit
> > completely if the start is beyond the end of the zone.
> > 
> > This patch adds yet another iterator called
> > for_each_free_mem_range_in_zone_from and then uses it to support
> > initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> > By doing this we can greatly improve the cache locality of the pages while
> > we do several loops over them in the init and freeing process.
> > 
> > We are able to tighten the loops as a result since we only really need the
> > checks for first_init_pfn in our first iteration and after that we can
> > assume that all future values will be greater than this. So I have added a
> > function called deferred_init_mem_pfn_range_in_zone that primes the
> > iterators and if it fails we can just exit.
> > 
> > On my x86_64 test system with 384GB of memory per node I saw a reduction in
> > initialization time from 1.85s to 1.38s as a result of this patch.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
 
[ ... ] 

> > ---
> >  include/linux/memblock.h |   58 +++++++++++++++
> >  mm/memblock.c            |   63 ++++++++++++++++
> >  mm/page_alloc.c          |  176 ++++++++++++++++++++++++++++++++--------------
> >  3 files changed, 242 insertions(+), 55 deletions(-)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index aee299a6aa76..2ddd1bafdd03 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
> >  			      p_start, p_end, p_nid))
> >  
> >  /**
> > + * for_each_mem_range_from - iterate through memblock areas from type_a and not
> > + * included in type_b. Or just type_a if type_b is NULL.
> > + * @i: u64 used as loop variable
> > + * @type_a: ptr to memblock_type to iterate
> > + * @type_b: ptr to memblock_type which excludes from the iteration
> > + * @nid: node selector, %NUMA_NO_NODE for all nodes
> > + * @flags: pick from blocks based on memory attributes
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + * @p_nid: ptr to int for nid of the range, can be %NULL
> > + */
> > +#define for_each_mem_range_from(i, type_a, type_b, nid, flags,		\
> > +			   p_start, p_end, p_nid)			\
> > +	for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b,	\
> > +				     p_start, p_end, p_nid);		\
> > +	     i != (u64)ULLONG_MAX;					\
> > +	     __next_mem_range(&i, nid, flags, type_a, type_b,		\
> > +			      p_start, p_end, p_nid))
> > +/**
> >   * for_each_mem_range_rev - reverse iterate through memblock areas from
> >   * type_a and not included in type_b. Or just type_a if type_b is NULL.
> >   * @i: u64 used as loop variable
> > @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> >  	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
> >  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> >  
> > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT

Sorry for jumping late, but I've noticed this only now.
Do the new iterators have to be restricted by
CONFIG_DEFERRED_STRUCT_PAGE_INIT?

> > +void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> > +				  unsigned long *out_spfn,
> > +				  unsigned long *out_epfn);
> > +/**
> > + * for_each_free_mem_range_in_zone - iterate through zone specific free
> > + * memblock areas
> > + * @i: u64 used as loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over free (memory && !reserved) areas of memblock in a specific
> > + * zone. Available as soon as memblock is initialized.
> > + */
> > +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)	\
> > +	for (i = 0,							\
> > +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);	\
> > +	     i != (u64)ULLONG_MAX;					\
> > +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> > +
> > +/**
> > + * for_each_free_mem_range_in_zone_from - iterate through zone specific
> > + * free memblock areas from a given point
> > + * @i: u64 used as loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over free (memory && !reserved) areas of memblock in a specific
> > + * zone, continuing from current position. Available as soon as memblock is
> > + * initialized.
> > + */
> > +#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
> > +	for (; i != (u64)ULLONG_MAX;					  \
> > +	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> > +
> > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> > +
> >  /**
> >   * for_each_free_mem_range - iterate through free memblock areas
> >   * @i: u64 used as loop variable
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index f2ef3915a356..ab3545e356b7 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
> >  	return 0;
> >  }
> >  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > +/**
> > + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> > + *
> > + * @idx: pointer to u64 loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @out_start: ptr to ulong for start pfn of the range, can be %NULL
> > + * @out_end: ptr to ulong for end pfn of the range, can be %NULL
> > + *
> > + * This function is meant to be a zone/pfn specific wrapper for the
> > + * for_each_mem_range type iterators. Specifically they are used in the
> > + * deferred memory init routines and as such we were duplicating much of
> > + * this logic throughout the code. So instead of having it in multiple
> > + * locations it seemed like it would make more sense to centralize this to
> > + * one new iterator that does everything they need.
> > + */
> > +void __init_memblock
> > +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> > +			     unsigned long *out_spfn, unsigned long *out_epfn)
> > +{
> > +	int zone_nid = zone_to_nid(zone);
> > +	phys_addr_t spa, epa;
> > +	int nid;
> > +
> > +	__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> > +			 &memblock.memory, &memblock.reserved,
> > +			 &spa, &epa, &nid);
> > +
> > +	while (*idx != ULLONG_MAX) {
> > +		unsigned long epfn = PFN_DOWN(epa);
> > +		unsigned long spfn = PFN_UP(spa);
> > +
> > +		/*
> > +		 * Verify the end is at least past the start of the zone and
> > +		 * that we have at least one PFN to initialize.
> > +		 */
> > +		if (zone->zone_start_pfn < epfn && spfn < epfn) {
> > +			/* if we went too far just stop searching */
> > +			if (zone_end_pfn(zone) <= spfn)
> > +				break;
> > +
> > +			if (out_spfn)
> > +				*out_spfn = max(zone->zone_start_pfn, spfn);
> > +			if (out_epfn)
> > +				*out_epfn = min(zone_end_pfn(zone), epfn);
> > +
> > +			return;
> > +		}
> > +
> > +		__next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> > +				 &memblock.memory, &memblock.reserved,
> > +				 &spa, &epa, &nid);
> > +	}
> > +
> > +	/* signal end of iteration */
> > +	*idx = ULLONG_MAX;
> > +	if (out_spfn)
> > +		*out_spfn = ULONG_MAX;
> > +	if (out_epfn)
> > +		*out_epfn = 0;
> > +}
> > +
> > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> >  
> >  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
> >  unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a766a15fad81..20e9eb35d75d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1512,19 +1512,103 @@ static unsigned long  __init deferred_init_pages(struct zone *zone,
> >  	return (nr_pages);
> >  }
> >  
> > +/*
> > + * This function is meant to pre-load the iterator for the zone init.
> > + * Specifically it walks through the ranges until we are caught up to the
> > + * first_init_pfn value and exits there. If we never encounter the value we
> > + * return false indicating there are no valid ranges left.
> > + */
> > +static bool __init
> > +deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
> > +				    unsigned long *spfn, unsigned long *epfn,
> > +				    unsigned long first_init_pfn)
> > +{
> > +	u64 j;
> > +
> > +	/*
> > +	 * Start out by walking through the ranges in this zone that have
> > +	 * already been initialized. We don't need to do anything with them
> > +	 * so we just need to flush them out of the system.
> > +	 */
> > +	for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
> > +		if (*epfn <= first_init_pfn)
> > +			continue;
> > +		if (*spfn < first_init_pfn)
> > +			*spfn = first_init_pfn;
> > +		*i = j;
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/*
> > + * Initialize and free pages. We do it in two loops: first we initialize
> > + * struct page, than free to buddy allocator, because while we are
> > + * freeing pages we can access pages that are ahead (computing buddy
> > + * page in __free_one_page()).
> > + *
> > + * In order to try and keep some memory in the cache we have the loop
> > + * broken along max page order boundaries. This way we will not cause
> > + * any issues with the buddy page computation.
> > + */
> > +static unsigned long __init
> > +deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> > +		       unsigned long *end_pfn)
> > +{
> > +	unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
> > +	unsigned long spfn = *start_pfn, epfn = *end_pfn;
> > +	unsigned long nr_pages = 0;
> > +	u64 j = *i;
> > +
> > +	/* First we loop through and initialize the page values */
> > +	for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
> > +		unsigned long t;
> > +
> > +		if (mo_pfn <= spfn)
> > +			break;
> > +
> > +		t = min(mo_pfn, epfn);
> > +		nr_pages += deferred_init_pages(zone, spfn, t);
> > +
> > +		if (mo_pfn <= epfn)
> > +			break;
> > +	}
> > +
> > +	/* Reset values and now loop through freeing pages as needed */
> > +	j = *i;
> > +
> > +	for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
> > +		unsigned long t;
> > +
> > +		if (mo_pfn <= *start_pfn)
> > +			break;
> > +
> > +		t = min(mo_pfn, *end_pfn);
> > +		deferred_free_pages(*start_pfn, t);
> > +		*start_pfn = t;
> > +
> > +		if (mo_pfn < *end_pfn)
> > +			break;
> > +	}
> > +
> > +	/* Store our current values to be reused on the next iteration */
> > +	*i = j;
> > +
> > +	return nr_pages;
> > +}
> > +
> >  /* Initialise remaining memory on a node */
> >  static int __init deferred_init_memmap(void *data)
> >  {
> >  	pg_data_t *pgdat = data;
> > -	int nid = pgdat->node_id;
> > +	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> > +	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> > +	unsigned long first_init_pfn, flags;
> >  	unsigned long start = jiffies;
> > -	unsigned long nr_pages = 0;
> > -	unsigned long spfn, epfn, first_init_pfn, flags;
> > -	phys_addr_t spa, epa;
> > -	int zid;
> >  	struct zone *zone;
> > -	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> >  	u64 i;
> > +	int zid;
> >  
> >  	/* Bind memory initialisation thread to a local node if possible */
> >  	if (!cpumask_empty(cpumask))
> > @@ -1549,31 +1633,30 @@ static int __init deferred_init_memmap(void *data)
> >  		if (first_init_pfn < zone_end_pfn(zone))
> >  			break;
> >  	}
> > -	first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
> > +
> > +	/* If the zone is empty somebody else may have cleared out the zone */
> > +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> > +						 first_init_pfn)) {
> > +		pgdat_resize_unlock(pgdat, &flags);
> > +		pgdat_init_report_one_done();
> > +		return 0;
> > +	}
> >  
> >  	/*
> > -	 * Initialize and free pages. We do it in two loops: first we initialize
> > -	 * struct page, than free to buddy allocator, because while we are
> > -	 * freeing pages we can access pages that are ahead (computing buddy
> > -	 * page in __free_one_page()).
> > +	 * Initialize and free pages in MAX_ORDER sized increments so
> > +	 * that we can avoid introducing any issues with the buddy
> > +	 * allocator.
> >  	 */
> > -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> > -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> > -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> > -		nr_pages += deferred_init_pages(zone, spfn, epfn);
> > -	}
> > -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> > -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> > -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> > -		deferred_free_pages(spfn, epfn);
> > -	}
> > +	while (spfn < epfn)
> > +		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > +
> >  	pgdat_resize_unlock(pgdat, &flags);
> >  
> >  	/* Sanity check that the next zone really is unpopulated */
> >  	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
> >  
> > -	pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
> > -					jiffies_to_msecs(jiffies - start));
> > +	pr_info("node %d initialised, %lu pages in %ums\n",
> > +		pgdat->node_id,	nr_pages, jiffies_to_msecs(jiffies - start));
> >  
> >  	pgdat_init_report_one_done();
> >  	return 0;
> > @@ -1604,14 +1687,11 @@ static int __init deferred_init_memmap(void *data)
> >  static noinline bool __init
> >  deferred_grow_zone(struct zone *zone, unsigned int order)
> >  {
> > -	int zid = zone_idx(zone);
> > -	int nid = zone_to_nid(zone);
> > -	pg_data_t *pgdat = NODE_DATA(nid);
> >  	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
> > -	unsigned long nr_pages = 0;
> > -	unsigned long first_init_pfn, spfn, epfn, t, flags;
> > +	pg_data_t *pgdat = zone->zone_pgdat;
> >  	unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
> > -	phys_addr_t spa, epa;
> > +	unsigned long spfn, epfn, flags;
> > +	unsigned long nr_pages = 0;
> >  	u64 i;
> >  
> >  	/* Only the last zone may have deferred pages */
> > @@ -1640,37 +1720,23 @@ static int __init deferred_init_memmap(void *data)
> >  		return true;
> >  	}
> >  
> > -	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
> > -
> > -	if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
> > +	/* If the zone is empty somebody else may have cleared out the zone */
> > +	if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> > +						 first_deferred_pfn)) {
> >  		pgdat_resize_unlock(pgdat, &flags);
> > -		return false;
> > +		return true;
> >  	}
> >  
> > -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> > -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> > -		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> > -
> > -		while (spfn < epfn && nr_pages < nr_pages_needed) {
> > -			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
> > -			first_deferred_pfn = min(t, epfn);
> > -			nr_pages += deferred_init_pages(zone, spfn,
> > -							first_deferred_pfn);
> > -			spfn = first_deferred_pfn;
> > -		}
> > -
> > -		if (nr_pages >= nr_pages_needed)
> > -			break;
> > +	/*
> > +	 * Initialize and free pages in MAX_ORDER sized increments so
> > +	 * that we can avoid introducing any issues with the buddy
> > +	 * allocator.
> > +	 */
> > +	while (spfn < epfn && nr_pages < nr_pages_needed) {
> > +		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > +		first_deferred_pfn = spfn;
> >  	}
> >  
> > -	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> > -		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> > -		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
> > -		deferred_free_pages(spfn, epfn);
> > -
> > -		if (first_deferred_pfn = epfn)
> > -			break;
> > -	}
> >  	pgdat->first_deferred_pfn = first_deferred_pfn;
> >  	pgdat_resize_unlock(pgdat, &flags);
> >  
> > 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
  2018-11-01  6:17       ` Mike Rapoport
@ 2018-11-01 15:10         ` Alexander Duyck
  -1 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-11-01 15:10 UTC (permalink / raw)
  To: Mike Rapoport, Pasha Tatashin
  Cc: linux-mm, akpm, mhocko, dave.jiang, linux-kernel, willy, davem,
	yi.z.zhang, khalid.aziz, rppt, vbabka, sparclinux,
	dan.j.williams, ldufour, mgorman, mingo, kirill.shutemov

On Thu, 2018-11-01 at 08:17 +0200, Mike Rapoport wrote:
> On Wed, Oct 31, 2018 at 03:40:02PM +0000, Pasha Tatashin wrote:
> > 
> > 
> > On 10/17/18 7:54 PM, Alexander Duyck wrote:
> > > This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> > > 
> > > This iterator will take care of making sure a given memory range provided
> > > is in fact contained within a zone. It takes are of all the bounds checking
> > > we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> > > it should help to speed up the search a bit by iterating until the end of a
> > > range is greater than the start of the zone pfn range, and will exit
> > > completely if the start is beyond the end of the zone.
> > > 
> > > This patch adds yet another iterator called
> > > for_each_free_mem_range_in_zone_from and then uses it to support
> > > initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> > > By doing this we can greatly improve the cache locality of the pages while
> > > we do several loops over them in the init and freeing process.
> > > 
> > > We are able to tighten the loops as a result since we only really need the
> > > checks for first_init_pfn in our first iteration and after that we can
> > > assume that all future values will be greater than this. So I have added a
> > > function called deferred_init_mem_pfn_range_in_zone that primes the
> > > iterators and if it fails we can just exit.
> > > 
> > > On my x86_64 test system with 384GB of memory per node I saw a reduction in
> > > initialization time from 1.85s to 1.38s as a result of this patch.
> > > 
> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
>  
> [ ... ] 
> 
> > > ---
> > >  include/linux/memblock.h |   58 +++++++++++++++
> > >  mm/memblock.c            |   63 ++++++++++++++++
> > >  mm/page_alloc.c          |  176 ++++++++++++++++++++++++++++++++--------------
> > >  3 files changed, 242 insertions(+), 55 deletions(-)
> > > 
> > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > index aee299a6aa76..2ddd1bafdd03 100644
> > > --- a/include/linux/memblock.h
> > > +++ b/include/linux/memblock.h
> > > @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
> > >  			      p_start, p_end, p_nid))
> > >  
> > >  /**
> > > + * for_each_mem_range_from - iterate through memblock areas from type_a and not
> > > + * included in type_b. Or just type_a if type_b is NULL.
> > > + * @i: u64 used as loop variable
> > > + * @type_a: ptr to memblock_type to iterate
> > > + * @type_b: ptr to memblock_type which excludes from the iteration
> > > + * @nid: node selector, %NUMA_NO_NODE for all nodes
> > > + * @flags: pick from blocks based on memory attributes
> > > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > > + * @p_nid: ptr to int for nid of the range, can be %NULL
> > > + */
> > > +#define for_each_mem_range_from(i, type_a, type_b, nid, flags,		\
> > > +			   p_start, p_end, p_nid)			\
> > > +	for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b,	\
> > > +				     p_start, p_end, p_nid);		\
> > > +	     i != (u64)ULLONG_MAX;					\
> > > +	     __next_mem_range(&i, nid, flags, type_a, type_b,		\
> > > +			      p_start, p_end, p_nid))
> > > +/**
> > >   * for_each_mem_range_rev - reverse iterate through memblock areas from
> > >   * type_a and not included in type_b. Or just type_a if type_b is NULL.
> > >   * @i: u64 used as loop variable
> > > @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> > >  	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
> > >  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > >  
> > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> 
> Sorry for jumping late, but I've noticed this only now.
> Do the new iterators have to be restricted by
> CONFIG_DEFERRED_STRUCT_PAGE_INIT?

They don't have to be. I just wrapped them since I figured it is better
to just strip the code if it isn't going to be used rather then leave
it floating around taking up space.

Thanks.

- Alex




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init
@ 2018-11-01 15:10         ` Alexander Duyck
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Duyck @ 2018-11-01 15:10 UTC (permalink / raw)
  To: Mike Rapoport, Pasha Tatashin
  Cc: linux-mm, akpm, mhocko, dave.jiang, linux-kernel, willy, davem,
	yi.z.zhang, khalid.aziz, rppt, vbabka, sparclinux,
	dan.j.williams, ldufour, mgorman, mingo, kirill.shutemov

On Thu, 2018-11-01 at 08:17 +0200, Mike Rapoport wrote:
> On Wed, Oct 31, 2018 at 03:40:02PM +0000, Pasha Tatashin wrote:
> > 
> > 
> > On 10/17/18 7:54 PM, Alexander Duyck wrote:
> > > This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> > > 
> > > This iterator will take care of making sure a given memory range provided
> > > is in fact contained within a zone. It takes are of all the bounds checking
> > > we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> > > it should help to speed up the search a bit by iterating until the end of a
> > > range is greater than the start of the zone pfn range, and will exit
> > > completely if the start is beyond the end of the zone.
> > > 
> > > This patch adds yet another iterator called
> > > for_each_free_mem_range_in_zone_from and then uses it to support
> > > initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> > > By doing this we can greatly improve the cache locality of the pages while
> > > we do several loops over them in the init and freeing process.
> > > 
> > > We are able to tighten the loops as a result since we only really need the
> > > checks for first_init_pfn in our first iteration and after that we can
> > > assume that all future values will be greater than this. So I have added a
> > > function called deferred_init_mem_pfn_range_in_zone that primes the
> > > iterators and if it fails we can just exit.
> > > 
> > > On my x86_64 test system with 384GB of memory per node I saw a reduction in
> > > initialization time from 1.85s to 1.38s as a result of this patch.
> > > 
> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
>  
> [ ... ] 
> 
> > > ---
> > >  include/linux/memblock.h |   58 +++++++++++++++
> > >  mm/memblock.c            |   63 ++++++++++++++++
> > >  mm/page_alloc.c          |  176 ++++++++++++++++++++++++++++++++--------------
> > >  3 files changed, 242 insertions(+), 55 deletions(-)
> > > 
> > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > index aee299a6aa76..2ddd1bafdd03 100644
> > > --- a/include/linux/memblock.h
> > > +++ b/include/linux/memblock.h
> > > @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
> > >  			      p_start, p_end, p_nid))
> > >  
> > >  /**
> > > + * for_each_mem_range_from - iterate through memblock areas from type_a and not
> > > + * included in type_b. Or just type_a if type_b is NULL.
> > > + * @i: u64 used as loop variable
> > > + * @type_a: ptr to memblock_type to iterate
> > > + * @type_b: ptr to memblock_type which excludes from the iteration
> > > + * @nid: node selector, %NUMA_NO_NODE for all nodes
> > > + * @flags: pick from blocks based on memory attributes
> > > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> > > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > > + * @p_nid: ptr to int for nid of the range, can be %NULL
> > > + */
> > > +#define for_each_mem_range_from(i, type_a, type_b, nid, flags,		\
> > > +			   p_start, p_end, p_nid)			\
> > > +	for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b,	\
> > > +				     p_start, p_end, p_nid);		\
> > > +	     i != (u64)ULLONG_MAX;					\
> > > +	     __next_mem_range(&i, nid, flags, type_a, type_b,		\
> > > +			      p_start, p_end, p_nid))
> > > +/**
> > >   * for_each_mem_range_rev - reverse iterate through memblock areas from
> > >   * type_a and not included in type_b. Or just type_a if type_b is NULL.
> > >   * @i: u64 used as loop variable
> > > @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> > >  	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
> > >  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > >  
> > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> 
> Sorry for jumping late, but I've noticed this only now.
> Do the new iterators have to be restricted by
> CONFIG_DEFERRED_STRUCT_PAGE_INIT?

They don't have to be. I just wrapped them since I figured it is better
to just strip the code if it isn't going to be used rather then leave
it floating around taking up space.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-11-01 15:11 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-17 23:54 [mm PATCH v4 0/6] Deferred page init improvements Alexander Duyck
2018-10-17 23:54 ` Alexander Duyck
2018-10-17 23:54 ` [mm PATCH v4 1/6] mm: Use mm_zero_struct_page from SPARC on all 64b architectures Alexander Duyck
2018-10-17 23:54   ` Alexander Duyck
2018-10-18 18:29   ` Pavel Tatashin
2018-10-18 18:29     ` Pavel Tatashin
2018-10-29 20:12   ` Michal Hocko
2018-10-29 20:12     ` Michal Hocko
2018-10-17 23:54 ` [mm PATCH v4 2/6] mm: Drop meminit_pfn_in_nid as it is redundant Alexander Duyck
2018-10-17 23:54   ` Alexander Duyck
2018-10-17 23:54 ` [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init Alexander Duyck
2018-10-17 23:54   ` Alexander Duyck
2018-10-31 15:40   ` Pasha Tatashin
2018-10-31 15:40     ` Pasha Tatashin
2018-10-31 16:05     ` Alexander Duyck
2018-10-31 16:05       ` Alexander Duyck
2018-10-31 16:06       ` Pasha Tatashin
2018-10-31 16:06         ` Pasha Tatashin
2018-11-01  6:17     ` Mike Rapoport
2018-11-01  6:17       ` Mike Rapoport
2018-11-01 15:10       ` Alexander Duyck
2018-11-01 15:10         ` Alexander Duyck
2018-10-17 23:54 ` [mm PATCH v4 4/6] mm: Move hot-plug specific memory init into separate functions and optimize Alexander Duyck
2018-10-17 23:54   ` Alexander Duyck
2018-10-17 23:54 ` [mm PATCH v4 5/6] mm: Add reserved flag setting to set_page_links Alexander Duyck
2018-10-17 23:54   ` Alexander Duyck
2018-10-17 23:54 ` [mm PATCH v4 6/6] mm: Use common iterator for deferred_init_pages and deferred_free_pages Alexander Duyck
2018-10-17 23:54   ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.