nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Address issues slowing persistent memory initialization
@ 2018-09-10 23:43 Alexander Duyck
  2018-09-10 23:43 ` [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning Alexander Duyck
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Alexander Duyck @ 2018-09-10 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-nvdimm
  Cc: pavel.tatashin, mhocko, dave.hansen, jglisse, akpm, mingo,
	kirill.shutemov

This patch set is meant to be a v3 to my earlier patch set "Address issues
slowing memory init"[1]. However I have added 2 additional patches to
address issues seen in which NVDIMM memory was slow to initialize
especially on systems with multiple NUMA nodes.

Since v2 of the patch set I have replaced the config option to work around
the page init poisoning with a kernel parameter. I also updated one comment
based on input from Michal.

The third patch in this set is new and is meant to address the need to
defer some page initialization to outside of the hot-plug lock. It is
loosely based on the original patch set by Dan Williams to perform
asynchronous page init for ZONE_DEVICE pages[2]. However, it is  based
more around the deferred page init model where memory init is deferred to a
fixed point, which in this case is to just outside of the hot-plug lock.

The fourth patch allows nvdimm init to be more node specific where
possible. I basically just copy/pasted the approach used in
pci_call_probe to allow for us to get the initialization code on the node
as close to the memory as possible. Doing so allows us to save considerably
on init time.

[1]: https://lkml.org/lkml/2018/9/5/924
[2]: https://lkml.org/lkml/2018/7/16/828

---

Alexander Duyck (4):
      mm: Provide kernel parameter to allow disabling page init poisoning
      mm: Create non-atomic version of SetPageReserved for init use
      mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
      nvdimm: Trigger the device probe on a cpu local to the device


 Documentation/admin-guide/kernel-parameters.txt |    8 ++
 drivers/nvdimm/bus.c                            |   45 ++++++++++
 include/linux/mm.h                              |    2 
 include/linux/page-flags.h                      |    9 ++
 kernel/memremap.c                               |   24 ++---
 mm/debug.c                                      |   16 +++
 mm/hmm.c                                        |   12 ++-
 mm/memblock.c                                   |    5 -
 mm/page_alloc.c                                 |  106 ++++++++++++++++++++++-
 mm/sparse.c                                     |    4 -
 10 files changed, 200 insertions(+), 31 deletions(-)

--
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-10 23:43 [PATCH 0/4] Address issues slowing persistent memory initialization Alexander Duyck
@ 2018-09-10 23:43 ` Alexander Duyck
  2018-09-11  0:35   ` Alexander Duyck
                     ` (3 more replies)
  2018-09-10 23:43 ` [PATCH 2/4] mm: Create non-atomic version of SetPageReserved for init use Alexander Duyck
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 31+ messages in thread
From: Alexander Duyck @ 2018-09-10 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-nvdimm
  Cc: pavel.tatashin, mhocko, dave.hansen, jglisse, akpm, mingo,
	kirill.shutemov

From: Alexander Duyck <alexander.h.duyck@intel.com>

On systems with a large amount of memory it can take a significant amount
of time to initialize all of the page structs with the PAGE_POISON_PATTERN
value. I have seen it take over 2 minutes to initialize a system with
over 12GB of RAM.

In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
the boot time returned to something much more reasonable as the
arch_add_memory call completed in milliseconds versus seconds. However in
doing that I had to disable all of the other VM debugging on the system.

In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on
a system that has a large amount of memory I have added a new kernel
parameter named "page_init_poison" that can be set to "off" in order to
disable it.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |    8 ++++++++
 include/linux/page-flags.h                      |    8 ++++++++
 mm/debug.c                                      |   16 ++++++++++++++++
 mm/memblock.c                                   |    5 ++---
 mm/sparse.c                                     |    4 +---
 5 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 64a3bf54b974..7b21e0b9c394 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3047,6 +3047,14 @@
 			off: turn off poisoning (default)
 			on: turn on poisoning
 
+	page_init_poison=	[KNL] Boot-time parameter changing the
+			state of poisoning of page structures during early
+			boot. Used to verify page metadata is not accessed
+			prior to initialization. Available with
+			CONFIG_DEBUG_VM=y.
+			off: turn off poisoning
+			on: turn on poisoning (default)
+
 	panic=		[KNL] Kernel behaviour on panic: delay <timeout>
 			timeout > 0: seconds before rebooting
 			timeout = 0: wait forever
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74bee8cecf4c..d00216cf00f8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -162,6 +162,14 @@ static inline int PagePoisoned(const struct page *page)
 	return page->flags == PAGE_POISON_PATTERN;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void page_init_poison(struct page *page, size_t size);
+#else
+static inline void page_init_poison(struct page *page, size_t size)
+{
+}
+#endif
+
 /*
  * Page flags policies wrt compound pages
  *
diff --git a/mm/debug.c b/mm/debug.c
index 38c926520c97..c5420422c0b5 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -175,4 +175,20 @@ void dump_mm(const struct mm_struct *mm)
 	);
 }
 
+static bool page_init_poisoning __read_mostly = true;
+
+static int __init page_init_poison_param(char *buf)
+{
+	if (!buf)
+		return -EINVAL;
+	return strtobool(buf, &page_init_poisoning);
+}
+early_param("page_init_poison", page_init_poison_param);
+
+void page_init_poison(struct page *page, size_t size)
+{
+	if (page_init_poisoning)
+		memset(page, PAGE_POISON_PATTERN, size);
+}
+EXPORT_SYMBOL_GPL(page_init_poison);
 #endif		/* CONFIG_DEBUG_VM */
diff --git a/mm/memblock.c b/mm/memblock.c
index 237944479d25..a85315083b5a 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1444,10 +1444,9 @@ void * __init memblock_virt_alloc_try_nid_raw(
 
 	ptr = memblock_virt_alloc_internal(size, align,
 					   min_addr, max_addr, nid);
-#ifdef CONFIG_DEBUG_VM
 	if (ptr && size > 0)
-		memset(ptr, PAGE_POISON_PATTERN, size);
-#endif
+		page_init_poison(ptr, size);
+
 	return ptr;
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 10b07eea9a6e..67ad061f7fb8 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -696,13 +696,11 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
 		goto out;
 	}
 
-#ifdef CONFIG_DEBUG_VM
 	/*
 	 * Poison uninitialized struct pages in order to catch invalid flags
 	 * combinations.
 	 */
-	memset(memmap, PAGE_POISON_PATTERN, sizeof(struct page) * PAGES_PER_SECTION);
-#endif
+	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
 
 	section_mark_present(ms);
 	sparse_init_one_section(ms, section_nr, memmap, usemap);

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/4] mm: Create non-atomic version of SetPageReserved for init use
  2018-09-10 23:43 [PATCH 0/4] Address issues slowing persistent memory initialization Alexander Duyck
  2018-09-10 23:43 ` [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning Alexander Duyck
@ 2018-09-10 23:43 ` Alexander Duyck
  2018-09-12 13:28   ` Pasha Tatashin
  2018-09-10 23:43 ` [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap Alexander Duyck
  2018-09-10 23:44 ` [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device Alexander Duyck
  3 siblings, 1 reply; 31+ messages in thread
From: Alexander Duyck @ 2018-09-10 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-nvdimm
  Cc: pavel.tatashin, mhocko, dave.hansen, jglisse, akpm, mingo,
	kirill.shutemov

From: Alexander Duyck <alexander.h.duyck@intel.com>

It doesn't make much sense to use the atomic SetPageReserved at init time
when we are using memset to clear the memory and manipulating the page
flags via simple "&=" and "|=" operations in __init_single_page.

This patch adds a non-atomic version __SetPageReserved that can be used
during page init and shows about a 10% improvement in initialization times
on the systems I have available for testing. On those systems I saw
initialization times drop from around 35 seconds to around 32 seconds to
initialize a 3TB block of persistent memory.

I tried adding a bit of documentation based on commit <f1dd2cd13c4> ("mm,
memory_hotplug: do not associate hotadded memory to zones until online").

Ideally the reserved flag should be set earlier since there is a brief
window where the page is initialization via __init_single_page and we have
not set the PG_Reserved flag. I'm leaving that for a future patch set as
that will require a more significant refactor.

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/page-flags.h |    1 +
 mm/page_alloc.c            |   17 +++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d00216cf00f8..1b1f8e0378ae 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -300,6 +300,7 @@ static inline void page_init_poison(struct page *page, size_t size)
 
 PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND)
 	__CLEARPAGEFLAG(Reserved, reserved, PF_NO_COMPOUND)
+	__SETPAGEFLAG(Reserved, reserved, PF_NO_COMPOUND)
 PAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
 	__CLEARPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
 	__SETPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89d2a2ab3fe6..a9b095a72fd9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1231,7 +1231,12 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 			/* Avoid false-positive PageTail() */
 			INIT_LIST_HEAD(&page->lru);
 
-			SetPageReserved(page);
+			/*
+			 * no need for atomic set_bit because the struct
+			 * page is not visible yet so nobody should
+			 * access it yet.
+			 */
+			__SetPageReserved(page);
 		}
 	}
 }
@@ -5517,8 +5522,16 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 not_early:
 		page = pfn_to_page(pfn);
 		__init_single_page(page, pfn, zone, nid);
+
+		/*
+		 * Mark page reserved as it will need to wait for onlining
+		 * phase for it to be fully associated with a zone.
+		 *
+		 * We can use the non-atomic __set_bit operation for setting
+		 * the flag as we are still initializing the pages.
+		 */
 		if (context == MEMMAP_HOTPLUG)
-			SetPageReserved(page);
+			__SetPageReserved(page);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-10 23:43 [PATCH 0/4] Address issues slowing persistent memory initialization Alexander Duyck
  2018-09-10 23:43 ` [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning Alexander Duyck
  2018-09-10 23:43 ` [PATCH 2/4] mm: Create non-atomic version of SetPageReserved for init use Alexander Duyck
@ 2018-09-10 23:43 ` Alexander Duyck
  2018-09-11  7:49   ` kbuild test robot
                     ` (3 more replies)
  2018-09-10 23:44 ` [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device Alexander Duyck
  3 siblings, 4 replies; 31+ messages in thread
From: Alexander Duyck @ 2018-09-10 23:43 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-nvdimm
  Cc: pavel.tatashin, mhocko, dave.hansen, jglisse, akpm, mingo,
	kirill.shutemov

From: Alexander Duyck <alexander.h.duyck@intel.com>

The ZONE_DEVICE pages were being initialized in two locations. One was with
the memory_hotplug lock held and another was outside of that lock. The
problem with this is that it was nearly doubling the memory initialization
time. Instead of doing this twice, once while holding a global lock and
once without, I am opting to defer the initialization to the one outside of
the lock. This allows us to avoid serializing the overhead for memory init
and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it to
a poison value. Since this is something that is exposed to the driver in
the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/mm.h |    2 +
 kernel/memremap.c  |   24 +++++---------
 mm/hmm.c           |   12 ++++---
 mm/page_alloc.c    |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 105 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a61ebe8ad4ca..47b440bb3050 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -848,6 +848,8 @@ static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
+extern void memmap_init_zone_device(struct zone *, unsigned long,
+				    unsigned long, struct dev_pagemap *);
 #else
 static inline bool is_zone_device_page(const struct page *page)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 5b8600d39931..d0c32e473f82 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -175,10 +175,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	struct vmem_altmap *altmap = pgmap->altmap_valid ?
 			&pgmap->altmap : NULL;
 	struct resource *res = &pgmap->res;
-	unsigned long pfn, pgoff, order;
+	struct dev_pagemap *conflict_pgmap;
 	pgprot_t pgprot = PAGE_KERNEL;
+	unsigned long pgoff, order;
 	int error, nid, is_ram;
-	struct dev_pagemap *conflict_pgmap;
 
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
@@ -256,19 +256,13 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	if (error)
 		goto err_add_memory;
 
-	for_each_device_pfn(pfn, pgmap) {
-		struct page *page = pfn_to_page(pfn);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
-		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
-		 * freed or placed on a driver-private list.  Seed the
-		 * storage with LIST_POISON* values.
-		 */
-		list_del(&page->lru);
-		page->pgmap = pgmap;
-		percpu_ref_get(pgmap->ref);
-	}
+	/*
+	 * Initialization of the pages has been deferred until now in order
+	 * to allow us to do the work while not holding the hotplug lock.
+	 */
+	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
+				align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT, pgmap);
 
 	devm_add_action(dev, devm_memremap_pages_release, pgmap);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index c968e49f7a0c..774d684fa2b4 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1024,7 +1024,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	resource_size_t key, align_start, align_size, align_end;
 	struct device *device = devmem->device;
 	int ret, nid, is_ram;
-	unsigned long pfn;
 
 	align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
 	align_size = ALIGN(devmem->resource->start +
@@ -1109,11 +1108,14 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 				align_size >> PAGE_SHIFT, NULL);
 	mem_hotplug_done();
 
-	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
-		struct page *page = pfn_to_page(pfn);
+	/*
+	 * Initialization of the pages has been deferred until now in order
+	 * to allow us to do the work while not holding the hotplug lock.
+	 */
+	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
+				align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT, &devmem->pagemap);
 
-		page->pgmap = &devmem->pagemap;
-	}
 	return 0;
 
 error_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9b095a72fd9..81a3fd942c45 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5454,6 +5454,83 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
 #endif
 }
 
+#ifdef CONFIG_ZONE_DEVICE
+void __ref memmap_init_zone_device(struct zone *zone, unsigned long pfn,
+				   unsigned long size,
+				   struct dev_pagemap *pgmap)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	unsigned long zone_idx = zone_idx(zone);
+	unsigned long end_pfn = pfn + size;
+	unsigned long start = jiffies;
+	int nid = pgdat->node_id;
+	unsigned long nr_pages;
+
+	if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
+		return;
+
+	/*
+	 * The call to memmap_init_zone should have already taken care
+	 * of the pages reserved for the memmap, so we can just jump to
+	 * the end of that region and start processing the device pages.
+	 */
+	if (pgmap->altmap_valid) {
+		struct vmem_altmap *altmap = &pgmap->altmap;
+
+		pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
+	}
+
+	/* Record the number of pages we are about to initialize */
+	nr_pages = end_pfn - pfn;
+
+	for (; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		__init_single_page(page, pfn, zone_idx, nid);
+
+		/*
+		 * Mark page reserved as it will need to wait for onlining
+		 * phase for it to be fully associated with a zone.
+		 *
+		 * We can use the non-atomic __set_bit operation for setting
+		 * the flag as we are still initializing the pages.
+		 */
+		__SetPageReserved(page);
+
+		/*
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
+		 * page is ever freed or placed on a driver-private list.
+		 */
+		page->pgmap = pgmap;
+		page->hmm_data = 0;
+
+		/*
+		 * Mark the block movable so that blocks are reserved for
+		 * movable at startup. This will force kernel allocations
+		 * to reserve their blocks rather than leaking throughout
+		 * the address space during boot when many long-lived
+		 * kernel allocations are made.
+		 *
+		 * bitmap is created for zone's valid pfn range. but memmap
+		 * can be created for invalid pages (for alignment)
+		 * check here not to call set_pageblock_migratetype() against
+		 * pfn out of zone.
+		 *
+		 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
+		 * because this is done early in sparse_add_one_section
+		 */
+		if (!(pfn & (pageblock_nr_pages - 1))) {
+			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+			cond_resched();
+		}
+	}
+
+	pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
+		nr_pages, jiffies_to_msecs(jiffies - start));
+}
+
+#endif
 /*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
@@ -5477,10 +5554,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 	/*
 	 * Honor reservation requested by the driver for this ZONE_DEVICE
-	 * memory
+	 * memory. We limit the total number of pages to initialize to just
+	 * those that might contain the memory mapping. We will defer the
+	 * ZONE_DEVICE page initialization until after we have released
+	 * the hotplug lock.
 	 */
-	if (altmap && start_pfn == altmap->base_pfn)
+	if (altmap && start_pfn == altmap->base_pfn) {
 		start_pfn += altmap->reserve;
+		end_pfn = altmap->base_pfn +
+			  vmem_altmap_offset(altmap);
+	} else if (zone == ZONE_DEVICE) {
+		end_pfn = start_pfn;
+	}
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		/*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device
  2018-09-10 23:43 [PATCH 0/4] Address issues slowing persistent memory initialization Alexander Duyck
                   ` (2 preceding siblings ...)
  2018-09-10 23:43 ` [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap Alexander Duyck
@ 2018-09-10 23:44 ` Alexander Duyck
  2018-09-11  0:37   ` Alexander Duyck
                     ` (2 more replies)
  3 siblings, 3 replies; 31+ messages in thread
From: Alexander Duyck @ 2018-09-10 23:44 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-nvdimm
  Cc: pavel.tatashin, mhocko, dave.hansen, jglisse, akpm, mingo,
	kirill.shutemov

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch is based off of the pci_call_probe function used to initialize
PCI devices. The general idea here is to move the probe call to a location
that is local to the memory being initialized. By doing this we can shave
significant time off of the total time needed for initialization.

With this patch applied I see a significant reduction in overall init time
as without it the init varied between 23 and 37 seconds to initialize a 3GB
node. With this patch applied the variance is only between 23 and 26
seconds to initialize each node.

I hope to refine this further in the future by combining this logic into
the async_schedule_domain code that is already in use. By doing that it
would likely make this functionality redundant.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/nvdimm/bus.c |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 8aae6dcc839f..5b73953176b1 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -27,6 +27,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/nd.h>
+#include <linux/cpu.h>
 #include "nd-core.h"
 #include "nd.h"
 #include "pfn.h"
@@ -90,6 +91,48 @@ static void nvdimm_bus_probe_end(struct nvdimm_bus *nvdimm_bus)
 	nvdimm_bus_unlock(&nvdimm_bus->dev);
 }
 
+struct nvdimm_drv_dev {
+	struct nd_device_driver *nd_drv;
+	struct device *dev;
+};
+
+static long __nvdimm_call_probe(void *_nddd)
+{
+	struct nvdimm_drv_dev *nddd = _nddd;
+	struct nd_device_driver *nd_drv = nddd->nd_drv;
+
+	return nd_drv->probe(nddd->dev);
+}
+
+static int nvdimm_call_probe(struct nd_device_driver *nd_drv,
+			     struct device *dev)
+{
+	struct nvdimm_drv_dev nddd = { nd_drv, dev };
+	int rc, node, cpu;
+
+	/*
+	 * Execute driver initialization on node where the device is
+	 * attached.  This way the driver will be able to access local
+	 * memory instead of having to initialize memory across nodes.
+	 */
+	node = dev_to_node(dev);
+
+	cpu_hotplug_disable();
+
+	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+		cpu = nr_cpu_ids;
+	else
+		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+
+	if (cpu < nr_cpu_ids)
+		rc = work_on_cpu(cpu, __nvdimm_call_probe, &nddd);
+	else
+		rc = __nvdimm_call_probe(&nddd);
+
+	cpu_hotplug_enable();
+	return rc;
+}
+
 static int nvdimm_bus_probe(struct device *dev)
 {
 	struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
@@ -104,7 +147,7 @@ static int nvdimm_bus_probe(struct device *dev)
 			dev->driver->name, dev_name(dev));
 
 	nvdimm_bus_probe_start(nvdimm_bus);
-	rc = nd_drv->probe(dev);
+	rc = nvdimm_call_probe(nd_drv, dev);
 	if (rc == 0)
 		nd_region_probe_success(nvdimm_bus, dev);
 	else

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-10 23:43 ` [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning Alexander Duyck
@ 2018-09-11  0:35   ` Alexander Duyck
  2018-09-11 16:50   ` Dan Williams
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2018-09-11  0:35 UTC (permalink / raw)
  To: linux-mm, LKML, linux-nvdimm
  Cc: pavel.tatashin, Michal Hocko, Dave Hansen, jglisse,
	Andrew Morton, Ingo Molnar, Kirill A. Shutemov

On Mon, Sep 10, 2018 at 4:43 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.

Minor typo. I meant 12TB here, not 12GB.

- Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device
  2018-09-10 23:44 ` [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device Alexander Duyck
@ 2018-09-11  0:37   ` Alexander Duyck
  2018-09-12  5:48   ` Dan Williams
  2018-09-12 13:44   ` Pasha Tatashin
  2 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2018-09-11  0:37 UTC (permalink / raw)
  To: linux-mm, LKML, linux-nvdimm
  Cc: pavel.tatashin, Michal Hocko, Dave Hansen, jglisse,
	Andrew Morton, Ingo Molnar, Kirill A. Shutemov

On Mon, Sep 10, 2018 at 4:44 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This patch is based off of the pci_call_probe function used to initialize
> PCI devices. The general idea here is to move the probe call to a location
> that is local to the memory being initialized. By doing this we can shave
> significant time off of the total time needed for initialization.
>
> With this patch applied I see a significant reduction in overall init time
> as without it the init varied between 23 and 37 seconds to initialize a 3GB
> node. With this patch applied the variance is only between 23 and 26
> seconds to initialize each node.

Same mistake here as in patch 1. It is 3TB, not 3GB. I will fix for
the next version.

- Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-10 23:43 ` [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap Alexander Duyck
@ 2018-09-11  7:49   ` kbuild test robot
  2018-09-11  7:54   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2018-09-11  7:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: pavel.tatashin, mhocko, kirill.shutemov, linux-nvdimm,
	dave.hansen, linux-kernel, linux-mm, jglisse, kbuild-all, akpm,
	mingo

Hi Alexander,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.19-rc3]
[cannot apply to next-20180910]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Alexander-Duyck/Address-issues-slowing-persistent-memory-initialization/20180911-144536
config: x86_64-randconfig-x009-201836 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:5:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/page_alloc.c:18:
   mm/page_alloc.c: In function 'memmap_init_zone':
   mm/page_alloc.c:5566:21: error: 'ZONE_DEVICE' undeclared (first use in this function); did you mean 'ZONE_MOVABLE'?
     } else if (zone == ZONE_DEVICE) {
                        ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> mm/page_alloc.c:5566:9: note: in expansion of macro 'if'
     } else if (zone == ZONE_DEVICE) {
            ^~
   mm/page_alloc.c:5566:21: note: each undeclared identifier is reported only once for each function it appears in
     } else if (zone == ZONE_DEVICE) {
                        ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> mm/page_alloc.c:5566:9: note: in expansion of macro 'if'
     } else if (zone == ZONE_DEVICE) {
            ^~

vim +/if +5566 mm/page_alloc.c

  5551	
  5552		if (highest_memmap_pfn < end_pfn - 1)
  5553			highest_memmap_pfn = end_pfn - 1;
  5554	
  5555		/*
  5556		 * Honor reservation requested by the driver for this ZONE_DEVICE
  5557		 * memory. We limit the total number of pages to initialize to just
  5558		 * those that might contain the memory mapping. We will defer the
  5559		 * ZONE_DEVICE page initialization until after we have released
  5560		 * the hotplug lock.
  5561		 */
  5562		if (altmap && start_pfn == altmap->base_pfn) {
  5563			start_pfn += altmap->reserve;
  5564			end_pfn = altmap->base_pfn +
  5565				  vmem_altmap_offset(altmap);
> 5566		} else if (zone == ZONE_DEVICE) {
  5567			end_pfn = start_pfn;
  5568		}
  5569	
  5570		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
  5571			/*
  5572			 * There can be holes in boot-time mem_map[]s handed to this
  5573			 * function.  They do not exist on hotplugged memory.
  5574			 */
  5575			if (context != MEMMAP_EARLY)
  5576				goto not_early;
  5577	
  5578			if (!early_pfn_valid(pfn))
  5579				continue;
  5580			if (!early_pfn_in_nid(pfn, nid))
  5581				continue;
  5582			if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))
  5583				break;
  5584	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-10 23:43 ` [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap Alexander Duyck
  2018-09-11  7:49   ` kbuild test robot
@ 2018-09-11  7:54   ` kbuild test robot
  2018-09-11 22:35   ` Dan Williams
  2018-09-12 13:59   ` Pasha Tatashin
  3 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2018-09-11  7:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: pavel.tatashin, mhocko, kirill.shutemov, linux-nvdimm,
	dave.hansen, linux-kernel, linux-mm, jglisse, kbuild-all, akpm,
	mingo

Hi Alexander,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.19-rc3]
[cannot apply to next-20180910]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Alexander-Duyck/Address-issues-slowing-persistent-memory-initialization/20180911-144536
config: openrisc-or1ksim_defconfig (attached as .config)
compiler: or1k-linux-gcc (GCC) 6.0.0 20160327 (experimental)
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=openrisc 

All errors (new ones prefixed by >>):

   mm/page_alloc.c: In function 'memmap_init_zone':
>> mm/page_alloc.c:5566:21: error: 'ZONE_DEVICE' undeclared (first use in this function)
     } else if (zone == ZONE_DEVICE) {
                        ^~~~~~~~~~~
   mm/page_alloc.c:5566:21: note: each undeclared identifier is reported only once for each function it appears in

vim +/ZONE_DEVICE +5566 mm/page_alloc.c

  5551	
  5552		if (highest_memmap_pfn < end_pfn - 1)
  5553			highest_memmap_pfn = end_pfn - 1;
  5554	
  5555		/*
  5556		 * Honor reservation requested by the driver for this ZONE_DEVICE
  5557		 * memory. We limit the total number of pages to initialize to just
  5558		 * those that might contain the memory mapping. We will defer the
  5559		 * ZONE_DEVICE page initialization until after we have released
  5560		 * the hotplug lock.
  5561		 */
  5562		if (altmap && start_pfn == altmap->base_pfn) {
  5563			start_pfn += altmap->reserve;
  5564			end_pfn = altmap->base_pfn +
  5565				  vmem_altmap_offset(altmap);
> 5566		} else if (zone == ZONE_DEVICE) {
  5567			end_pfn = start_pfn;
  5568		}
  5569	
  5570		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
  5571			/*
  5572			 * There can be holes in boot-time mem_map[]s handed to this
  5573			 * function.  They do not exist on hotplugged memory.
  5574			 */
  5575			if (context != MEMMAP_EARLY)
  5576				goto not_early;
  5577	
  5578			if (!early_pfn_valid(pfn))
  5579				continue;
  5580			if (!early_pfn_in_nid(pfn, nid))
  5581				continue;
  5582			if (!update_defer_init(pgdat, pfn, end_pfn, &nr_initialised))
  5583				break;
  5584	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-10 23:43 ` [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning Alexander Duyck
  2018-09-11  0:35   ` Alexander Duyck
@ 2018-09-11 16:50   ` Dan Williams
  2018-09-11 20:01     ` Alexander Duyck
  2018-09-12 13:24   ` Pasha Tatashin
  2018-09-12 14:10   ` Michal Hocko
  3 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2018-09-11 16:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: pavel.tatashin, Michal Hocko, linux-nvdimm, Dave Hansen,
	Linux Kernel Mailing List, Linux MM, Jérôme Glisse,
	Andrew Morton, Ingo Molnar, Kirill A. Shutemov

On Mon, Sep 10, 2018 at 4:43 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
>
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
>
> In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on
> a system that has a large amount of memory I have added a new kernel
> parameter named "page_init_poison" that can be set to "off" in order to
> disable it.

In anticipation of potentially more DEBUG_VM options wanting runtime
control I'd propose creating a new "vm_debug=" option for this modeled
after "slub_debug=" along with a CONFIG_DEBUG_VM_ON to turn on all
options.

That way there is more differentiation for debug cases like this that
have significant performance impact when enabled.

CONFIG_DEBUG_VM leaves optional debug capabilities disabled by default
unless CONFIG_DEBUG_VM_ON is also set.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-11 16:50   ` Dan Williams
@ 2018-09-11 20:01     ` Alexander Duyck
  2018-09-11 20:24       ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Duyck @ 2018-09-11 20:01 UTC (permalink / raw)
  To: dan.j.williams
  Cc: pavel.tatashin, Michal Hocko, linux-nvdimm, Dave Hansen, LKML,
	linux-mm, jglisse, Andrew Morton, Ingo Molnar,
	Kirill A. Shutemov

On Tue, Sep 11, 2018 at 9:50 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Mon, Sep 10, 2018 at 4:43 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> >
> > On systems with a large amount of memory it can take a significant amount
> > of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> > value. I have seen it take over 2 minutes to initialize a system with
> > over 12GB of RAM.
> >
> > In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> > the boot time returned to something much more reasonable as the
> > arch_add_memory call completed in milliseconds versus seconds. However in
> > doing that I had to disable all of the other VM debugging on the system.
> >
> > In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on
> > a system that has a large amount of memory I have added a new kernel
> > parameter named "page_init_poison" that can be set to "off" in order to
> > disable it.
>
> In anticipation of potentially more DEBUG_VM options wanting runtime
> control I'd propose creating a new "vm_debug=" option for this modeled
> after "slub_debug=" along with a CONFIG_DEBUG_VM_ON to turn on all
> options.
>
> That way there is more differentiation for debug cases like this that
> have significant performance impact when enabled.
>
> CONFIG_DEBUG_VM leaves optional debug capabilities disabled by default
> unless CONFIG_DEBUG_VM_ON is also set.

Based on earlier discussions I would assume that CONFIG_DEBUG_VM would
imply CONFIG_DEBUG_VM_ON anyway since we don't want most of these
disabled by default.

In my mind we should be looking at a selective "vm_debug_disable="
instead of something that would be turning on features.

- Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-11 20:01     ` Alexander Duyck
@ 2018-09-11 20:24       ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-09-11 20:24 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: pavel.tatashin, Michal Hocko, linux-nvdimm, Dave Hansen, LKML,
	linux-mm, Jérôme Glisse, Andrew Morton, Ingo Molnar,
	Kirill A. Shutemov

On Tue, Sep 11, 2018 at 1:01 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Sep 11, 2018 at 9:50 AM Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Mon, Sep 10, 2018 at 4:43 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>> > From: Alexander Duyck <alexander.h.duyck@intel.com>
>> >
>> > On systems with a large amount of memory it can take a significant amount
>> > of time to initialize all of the page structs with the PAGE_POISON_PATTERN
>> > value. I have seen it take over 2 minutes to initialize a system with
>> > over 12GB of RAM.
>> >
>> > In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
>> > the boot time returned to something much more reasonable as the
>> > arch_add_memory call completed in milliseconds versus seconds. However in
>> > doing that I had to disable all of the other VM debugging on the system.
>> >
>> > In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on
>> > a system that has a large amount of memory I have added a new kernel
>> > parameter named "page_init_poison" that can be set to "off" in order to
>> > disable it.
>>
>> In anticipation of potentially more DEBUG_VM options wanting runtime
>> control I'd propose creating a new "vm_debug=" option for this modeled
>> after "slub_debug=" along with a CONFIG_DEBUG_VM_ON to turn on all
>> options.
>>
>> That way there is more differentiation for debug cases like this that
>> have significant performance impact when enabled.
>>
>> CONFIG_DEBUG_VM leaves optional debug capabilities disabled by default
>> unless CONFIG_DEBUG_VM_ON is also set.
>
> Based on earlier discussions I would assume that CONFIG_DEBUG_VM would
> imply CONFIG_DEBUG_VM_ON anyway since we don't want most of these
> disabled by default.
>
> In my mind we should be looking at a selective "vm_debug_disable="
> instead of something that would be turning on features.

Sorry, I missed those earlier discussions, so I won't push too hard if
this has been hashed before. My proposal for opt-in is the fact that
at least one known distribution kernel, Fedora, is shipping with
CONFIG_DEBUG_VM=y. They also ship with CONFIG_SLUB, but not
SLUB_DEBUG_ON. If we are going to picemeal enable some debug options
to be runtime controlled I think we should go further to start
clarifying the cheap vs the expensive checks and making the expensive
checks opt-in in the same spirit of SLUB_DEBUG.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-10 23:43 ` [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap Alexander Duyck
  2018-09-11  7:49   ` kbuild test robot
  2018-09-11  7:54   ` kbuild test robot
@ 2018-09-11 22:35   ` Dan Williams
  2018-09-12  0:51     ` Alexander Duyck
  2018-09-12 13:59   ` Pasha Tatashin
  3 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2018-09-11 22:35 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: pavel.tatashin, Michal Hocko, linux-nvdimm, Dave Hansen,
	Linux Kernel Mailing List, Linux MM, Jérôme Glisse,
	Andrew Morton, Ingo Molnar, Kirill A. Shutemov

On Mon, Sep 10, 2018 at 4:43 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> The ZONE_DEVICE pages were being initialized in two locations. One was with
> the memory_hotplug lock held and another was outside of that lock. The
> problem with this is that it was nearly doubling the memory initialization
> time. Instead of doing this twice, once while holding a global lock and
> once without, I am opting to defer the initialization to the one outside of
> the lock. This allows us to avoid serializing the overhead for memory init
> and we can instead focus on per-node init times.
>
> One issue I encountered is that devm_memremap_pages and
> hmm_devmmem_pages_create were initializing only the pgmap field the same
> way. One wasn't initializing hmm_data, and the other was initializing it to
> a poison value. Since this is something that is exposed to the driver in
> the case of hmm I am opting for a third option and just initializing
> hmm_data to 0 since this is going to be exposed to unknown third party
> drivers.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  include/linux/mm.h |    2 +
>  kernel/memremap.c  |   24 +++++---------
>  mm/hmm.c           |   12 ++++---
>  mm/page_alloc.c    |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++-

Hmm, why mm/page_alloc.c and not kernel/memremap.c for this new
helper? I think that would address the kbuild reports and keeps all
the devm_memremap_pages / ZONE_DEVICE special casing centralized. I
also think it makes sense to move memremap.c to mm/ rather than
kernel/ especially since commit 5981690ddb8f "memremap: split
devm_memremap_pages() and memremap() infrastructure". Arguably, that
commit should have went ahead with the directory move.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-11 22:35   ` Dan Williams
@ 2018-09-12  0:51     ` Alexander Duyck
  2018-09-12  0:59       ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Duyck @ 2018-09-12  0:51 UTC (permalink / raw)
  To: dan.j.williams
  Cc: pavel.tatashin, Michal Hocko, linux-nvdimm, Dave Hansen, LKML,
	linux-mm, jglisse, Andrew Morton, Ingo Molnar,
	Kirill A. Shutemov

On Tue, Sep 11, 2018 at 3:35 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Mon, Sep 10, 2018 at 4:43 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> >
> > The ZONE_DEVICE pages were being initialized in two locations. One was with
> > the memory_hotplug lock held and another was outside of that lock. The
> > problem with this is that it was nearly doubling the memory initialization
> > time. Instead of doing this twice, once while holding a global lock and
> > once without, I am opting to defer the initialization to the one outside of
> > the lock. This allows us to avoid serializing the overhead for memory init
> > and we can instead focus on per-node init times.
> >
> > One issue I encountered is that devm_memremap_pages and
> > hmm_devmmem_pages_create were initializing only the pgmap field the same
> > way. One wasn't initializing hmm_data, and the other was initializing it to
> > a poison value. Since this is something that is exposed to the driver in
> > the case of hmm I am opting for a third option and just initializing
> > hmm_data to 0 since this is going to be exposed to unknown third party
> > drivers.
> >
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > ---
> >  include/linux/mm.h |    2 +
> >  kernel/memremap.c  |   24 +++++---------
> >  mm/hmm.c           |   12 ++++---
> >  mm/page_alloc.c    |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>
> Hmm, why mm/page_alloc.c and not kernel/memremap.c for this new
> helper? I think that would address the kbuild reports and keeps all
> the devm_memremap_pages / ZONE_DEVICE special casing centralized. I
> also think it makes sense to move memremap.c to mm/ rather than
> kernel/ especially since commit 5981690ddb8f "memremap: split
> devm_memremap_pages() and memremap() infrastructure". Arguably, that
> commit should have went ahead with the directory move.

The issue ends up being the fact that I would then have to start
exporting infrastructure such as __init_single_page from page_alloc. I
have some follow-up patches I am working on that will generate some
other shared functions that can be used by both memmap_init_zone and
memmap_init_zone_device, as well as pulling in some of the code from
the deferred memory init.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-12  0:51     ` Alexander Duyck
@ 2018-09-12  0:59       ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-09-12  0:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: pavel.tatashin, Michal Hocko, linux-nvdimm, Dave Hansen, LKML,
	linux-mm, Jérôme Glisse, Andrew Morton, Ingo Molnar,
	Kirill A. Shutemov

On Tue, Sep 11, 2018 at 5:51 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Sep 11, 2018 at 3:35 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Mon, Sep 10, 2018 at 4:43 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>> >
>> > From: Alexander Duyck <alexander.h.duyck@intel.com>
>> >
>> > The ZONE_DEVICE pages were being initialized in two locations. One was with
>> > the memory_hotplug lock held and another was outside of that lock. The
>> > problem with this is that it was nearly doubling the memory initialization
>> > time. Instead of doing this twice, once while holding a global lock and
>> > once without, I am opting to defer the initialization to the one outside of
>> > the lock. This allows us to avoid serializing the overhead for memory init
>> > and we can instead focus on per-node init times.
>> >
>> > One issue I encountered is that devm_memremap_pages and
>> > hmm_devmmem_pages_create were initializing only the pgmap field the same
>> > way. One wasn't initializing hmm_data, and the other was initializing it to
>> > a poison value. Since this is something that is exposed to the driver in
>> > the case of hmm I am opting for a third option and just initializing
>> > hmm_data to 0 since this is going to be exposed to unknown third party
>> > drivers.
>> >
>> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> > ---
>> >  include/linux/mm.h |    2 +
>> >  kernel/memremap.c  |   24 +++++---------
>> >  mm/hmm.c           |   12 ++++---
>> >  mm/page_alloc.c    |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>
>> Hmm, why mm/page_alloc.c and not kernel/memremap.c for this new
>> helper? I think that would address the kbuild reports and keeps all
>> the devm_memremap_pages / ZONE_DEVICE special casing centralized. I
>> also think it makes sense to move memremap.c to mm/ rather than
>> kernel/ especially since commit 5981690ddb8f "memremap: split
>> devm_memremap_pages() and memremap() infrastructure". Arguably, that
>> commit should have went ahead with the directory move.
>
> The issue ends up being the fact that I would then have to start
> exporting infrastructure such as __init_single_page from page_alloc. I
> have some follow-up patches I am working on that will generate some
> other shared functions that can be used by both memmap_init_zone and
> memmap_init_zone_device, as well as pulling in some of the code from
> the deferred memory init.

You wouldn't need to export it, just make it public to mm/ in
mm/internal.h, or a similar local header. With kernel/memremap.c moved
to mm/memremap.c this becomes even easier and better scoped for the
shared symbols.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device
  2018-09-10 23:44 ` [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device Alexander Duyck
  2018-09-11  0:37   ` Alexander Duyck
@ 2018-09-12  5:48   ` Dan Williams
  2018-09-12 13:44   ` Pasha Tatashin
  2 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-09-12  5:48 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: pavel.tatashin, Michal Hocko, linux-nvdimm, Dave Hansen,
	Linux Kernel Mailing List, Linux MM, Jérôme Glisse,
	Andrew Morton, Ingo Molnar, Kirill A. Shutemov

On Mon, Sep 10, 2018 at 4:44 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This patch is based off of the pci_call_probe function used to initialize
> PCI devices. The general idea here is to move the probe call to a location
> that is local to the memory being initialized. By doing this we can shave
> significant time off of the total time needed for initialization.
>
> With this patch applied I see a significant reduction in overall init time
> as without it the init varied between 23 and 37 seconds to initialize a 3GB
> node. With this patch applied the variance is only between 23 and 26
> seconds to initialize each node.
>
> I hope to refine this further in the future by combining this logic into
> the async_schedule_domain code that is already in use. By doing that it
> would likely make this functionality redundant.

Yeah, it is a bit sad that we schedule an async thread only to move it
back somewhere else.

Could we trivially achieve the same with an
async_schedule_domain_on_cpu() variant? It seems we can and the
workqueue core will "Do the right thing".

I now notice that async uses the system_unbound_wq and work_on_cpu()
uses the system_wq.  I don't think we want long running nvdimm work on
system_wq.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-10 23:43 ` [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning Alexander Duyck
  2018-09-11  0:35   ` Alexander Duyck
  2018-09-11 16:50   ` Dan Williams
@ 2018-09-12 13:24   ` Pasha Tatashin
  2018-09-12 14:10   ` Michal Hocko
  3 siblings, 0 replies; 31+ messages in thread
From: Pasha Tatashin @ 2018-09-12 13:24 UTC (permalink / raw)
  To: Alexander Duyck, linux-mm, linux-kernel, linux-nvdimm
  Cc: mhocko, dave.jiang, mingo, dave.hansen, jglisse, akpm, logang,
	dan.j.williams, kirill.shutemov



On 9/10/18 7:43 PM, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
> 
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
> 
> In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on
> a system that has a large amount of memory I have added a new kernel
> parameter named "page_init_poison" that can be set to "off" in order to
> disable it.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>

Thank you,
Pavel

> ---
>  Documentation/admin-guide/kernel-parameters.txt |    8 ++++++++
>  include/linux/page-flags.h                      |    8 ++++++++
>  mm/debug.c                                      |   16 ++++++++++++++++
>  mm/memblock.c                                   |    5 ++---
>  mm/sparse.c                                     |    4 +---
>  5 files changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 64a3bf54b974..7b21e0b9c394 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3047,6 +3047,14 @@
>  			off: turn off poisoning (default)
>  			on: turn on poisoning
>  
> +	page_init_poison=	[KNL] Boot-time parameter changing the
> +			state of poisoning of page structures during early
> +			boot. Used to verify page metadata is not accessed
> +			prior to initialization. Available with
> +			CONFIG_DEBUG_VM=y.
> +			off: turn off poisoning
> +			on: turn on poisoning (default)
> +
>  	panic=		[KNL] Kernel behaviour on panic: delay <timeout>
>  			timeout > 0: seconds before rebooting
>  			timeout = 0: wait forever
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74bee8cecf4c..d00216cf00f8 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -162,6 +162,14 @@ static inline int PagePoisoned(const struct page *page)
>  	return page->flags == PAGE_POISON_PATTERN;
>  }
>  
> +#ifdef CONFIG_DEBUG_VM
> +void page_init_poison(struct page *page, size_t size);
> +#else
> +static inline void page_init_poison(struct page *page, size_t size)
> +{
> +}
> +#endif
> +
>  /*
>   * Page flags policies wrt compound pages
>   *
> diff --git a/mm/debug.c b/mm/debug.c
> index 38c926520c97..c5420422c0b5 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -175,4 +175,20 @@ void dump_mm(const struct mm_struct *mm)
>  	);
>  }
>  
> +static bool page_init_poisoning __read_mostly = true;
> +
> +static int __init page_init_poison_param(char *buf)
> +{
> +	if (!buf)
> +		return -EINVAL;
> +	return strtobool(buf, &page_init_poisoning);
> +}
> +early_param("page_init_poison", page_init_poison_param);
> +
> +void page_init_poison(struct page *page, size_t size)
> +{
> +	if (page_init_poisoning)
> +		memset(page, PAGE_POISON_PATTERN, size);
> +}
> +EXPORT_SYMBOL_GPL(page_init_poison);
>  #endif		/* CONFIG_DEBUG_VM */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 237944479d25..a85315083b5a 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1444,10 +1444,9 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  
>  	ptr = memblock_virt_alloc_internal(size, align,
>  					   min_addr, max_addr, nid);
> -#ifdef CONFIG_DEBUG_VM
>  	if (ptr && size > 0)
> -		memset(ptr, PAGE_POISON_PATTERN, size);
> -#endif
> +		page_init_poison(ptr, size);
> +
>  	return ptr;
>  }
>  
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..67ad061f7fb8 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -696,13 +696,11 @@ int __meminit sparse_add_one_section(struct pglist_data *pgdat,
>  		goto out;
>  	}
>  
> -#ifdef CONFIG_DEBUG_VM
>  	/*
>  	 * Poison uninitialized struct pages in order to catch invalid flags
>  	 * combinations.
>  	 */
> -	memset(memmap, PAGE_POISON_PATTERN, sizeof(struct page) * PAGES_PER_SECTION);
> -#endif
> +	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
>  
>  	section_mark_present(ms);
>  	sparse_init_one_section(ms, section_nr, memmap, usemap);
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] mm: Create non-atomic version of SetPageReserved for init use
  2018-09-10 23:43 ` [PATCH 2/4] mm: Create non-atomic version of SetPageReserved for init use Alexander Duyck
@ 2018-09-12 13:28   ` Pasha Tatashin
  0 siblings, 0 replies; 31+ messages in thread
From: Pasha Tatashin @ 2018-09-12 13:28 UTC (permalink / raw)
  To: Alexander Duyck, linux-mm, linux-kernel, linux-nvdimm
  Cc: mhocko, dave.jiang, mingo, dave.hansen, jglisse, akpm, logang,
	dan.j.williams, kirill.shutemov


On 9/10/18 7:43 PM, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> It doesn't make much sense to use the atomic SetPageReserved at init time
> when we are using memset to clear the memory and manipulating the page
> flags via simple "&=" and "|=" operations in __init_single_page.
> 
> This patch adds a non-atomic version __SetPageReserved that can be used
> during page init and shows about a 10% improvement in initialization times
> on the systems I have available for testing. On those systems I saw
> initialization times drop from around 35 seconds to around 32 seconds to
> initialize a 3TB block of persistent memory.
> 
> I tried adding a bit of documentation based on commit <f1dd2cd13c4> ("mm,
> memory_hotplug: do not associate hotadded memory to zones until online").
> 
> Ideally the reserved flag should be set earlier since there is a brief
> window where the page is initialization via __init_single_page and we have
> not set the PG_Reserved flag. I'm leaving that for a future patch set as
> that will require a more significant refactor.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device
  2018-09-10 23:44 ` [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device Alexander Duyck
  2018-09-11  0:37   ` Alexander Duyck
  2018-09-12  5:48   ` Dan Williams
@ 2018-09-12 13:44   ` Pasha Tatashin
  2 siblings, 0 replies; 31+ messages in thread
From: Pasha Tatashin @ 2018-09-12 13:44 UTC (permalink / raw)
  To: Alexander Duyck, linux-mm, linux-kernel, linux-nvdimm
  Cc: mhocko, dave.jiang, mingo, dave.hansen, jglisse, akpm, logang,
	dan.j.williams, kirill.shutemov



On 9/10/18 7:44 PM, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This patch is based off of the pci_call_probe function used to initialize
> PCI devices. The general idea here is to move the probe call to a location
> that is local to the memory being initialized. By doing this we can shave
> significant time off of the total time needed for initialization.
> 
> With this patch applied I see a significant reduction in overall init time
> as without it the init varied between 23 and 37 seconds to initialize a 3GB
> node. With this patch applied the variance is only between 23 and 26
> seconds to initialize each node.
> 
> I hope to refine this further in the future by combining this logic into
> the async_schedule_domain code that is already in use. By doing that it
> would likely make this functionality redundant.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Looks good to me. The previous fast runs were because there we were
getting lucky and executed in the right latency groups, right? Now, we
bound the execution time to be always fast.

Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-10 23:43 ` [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap Alexander Duyck
                     ` (2 preceding siblings ...)
  2018-09-11 22:35   ` Dan Williams
@ 2018-09-12 13:59   ` Pasha Tatashin
  2018-09-12 15:48     ` Alexander Duyck
  3 siblings, 1 reply; 31+ messages in thread
From: Pasha Tatashin @ 2018-09-12 13:59 UTC (permalink / raw)
  To: Alexander Duyck, linux-mm, linux-kernel, linux-nvdimm
  Cc: mhocko, dave.jiang, mingo, dave.hansen, jglisse, akpm, logang,
	dan.j.williams, kirill.shutemov

Hi Alex,

Please re-base on linux-next,  memmap_init_zone() has been updated there
compared to mainline. You might even find a way to unify some parts of
memmap_init_zone and memmap_init_zone_device as memmap_init_zone() is a
lot simpler now.

I think __init_single_page() should stay local to page_alloc.c to keep
the inlining optimization.

I will review you this patch once you send an updated version.

Thank you,
Pavel

On 9/10/18 7:43 PM, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> The ZONE_DEVICE pages were being initialized in two locations. One was with
> the memory_hotplug lock held and another was outside of that lock. The
> problem with this is that it was nearly doubling the memory initialization
> time. Instead of doing this twice, once while holding a global lock and
> once without, I am opting to defer the initialization to the one outside of
> the lock. This allows us to avoid serializing the overhead for memory init
> and we can instead focus on per-node init times.
> 
> One issue I encountered is that devm_memremap_pages and
> hmm_devmmem_pages_create were initializing only the pgmap field the same
> way. One wasn't initializing hmm_data, and the other was initializing it to
> a poison value. Since this is something that is exposed to the driver in
> the case of hmm I am opting for a third option and just initializing
> hmm_data to 0 since this is going to be exposed to unknown third party
> drivers.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  include/linux/mm.h |    2 +
>  kernel/memremap.c  |   24 +++++---------
>  mm/hmm.c           |   12 ++++---
>  mm/page_alloc.c    |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 105 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a61ebe8ad4ca..47b440bb3050 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -848,6 +848,8 @@ static inline bool is_zone_device_page(const struct page *page)
>  {
>  	return page_zonenum(page) == ZONE_DEVICE;
>  }
> +extern void memmap_init_zone_device(struct zone *, unsigned long,
> +				    unsigned long, struct dev_pagemap *);
>  #else
>  static inline bool is_zone_device_page(const struct page *page)
>  {
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 5b8600d39931..d0c32e473f82 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -175,10 +175,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  	struct vmem_altmap *altmap = pgmap->altmap_valid ?
>  			&pgmap->altmap : NULL;
>  	struct resource *res = &pgmap->res;
> -	unsigned long pfn, pgoff, order;
> +	struct dev_pagemap *conflict_pgmap;
>  	pgprot_t pgprot = PAGE_KERNEL;
> +	unsigned long pgoff, order;
>  	int error, nid, is_ram;
> -	struct dev_pagemap *conflict_pgmap;
>  
>  	align_start = res->start & ~(SECTION_SIZE - 1);
>  	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> @@ -256,19 +256,13 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  	if (error)
>  		goto err_add_memory;
>  
> -	for_each_device_pfn(pfn, pgmap) {
> -		struct page *page = pfn_to_page(pfn);
> -
> -		/*
> -		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
> -		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
> -		 * freed or placed on a driver-private list.  Seed the
> -		 * storage with LIST_POISON* values.
> -		 */
> -		list_del(&page->lru);
> -		page->pgmap = pgmap;
> -		percpu_ref_get(pgmap->ref);
> -	}
> +	/*
> +	 * Initialization of the pages has been deferred until now in order
> +	 * to allow us to do the work while not holding the hotplug lock.
> +	 */
> +	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> +				align_start >> PAGE_SHIFT,
> +				align_size >> PAGE_SHIFT, pgmap);
>  
>  	devm_add_action(dev, devm_memremap_pages_release, pgmap);
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c968e49f7a0c..774d684fa2b4 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1024,7 +1024,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	resource_size_t key, align_start, align_size, align_end;
>  	struct device *device = devmem->device;
>  	int ret, nid, is_ram;
> -	unsigned long pfn;
>  
>  	align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
>  	align_size = ALIGN(devmem->resource->start +
> @@ -1109,11 +1108,14 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  				align_size >> PAGE_SHIFT, NULL);
>  	mem_hotplug_done();
>  
> -	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
> -		struct page *page = pfn_to_page(pfn);
> +	/*
> +	 * Initialization of the pages has been deferred until now in order
> +	 * to allow us to do the work while not holding the hotplug lock.
> +	 */
> +	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> +				align_start >> PAGE_SHIFT,
> +				align_size >> PAGE_SHIFT, &devmem->pagemap);
>  
> -		page->pgmap = &devmem->pagemap;
> -	}
>  	return 0;
>  
>  error_add_memory:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a9b095a72fd9..81a3fd942c45 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5454,6 +5454,83 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>  #endif
>  }
>  
> +#ifdef CONFIG_ZONE_DEVICE
> +void __ref memmap_init_zone_device(struct zone *zone, unsigned long pfn,
> +				   unsigned long size,
> +				   struct dev_pagemap *pgmap)
> +{
> +	struct pglist_data *pgdat = zone->zone_pgdat;
> +	unsigned long zone_idx = zone_idx(zone);
> +	unsigned long end_pfn = pfn + size;
> +	unsigned long start = jiffies;
> +	int nid = pgdat->node_id;
> +	unsigned long nr_pages;
> +
> +	if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
> +		return;
> +
> +	/*
> +	 * The call to memmap_init_zone should have already taken care
> +	 * of the pages reserved for the memmap, so we can just jump to
> +	 * the end of that region and start processing the device pages.
> +	 */
> +	if (pgmap->altmap_valid) {
> +		struct vmem_altmap *altmap = &pgmap->altmap;
> +
> +		pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> +	}
> +
> +	/* Record the number of pages we are about to initialize */
> +	nr_pages = end_pfn - pfn;
> +
> +	for (; pfn < end_pfn; pfn++) {
> +		struct page *page = pfn_to_page(pfn);
> +
> +		__init_single_page(page, pfn, zone_idx, nid);
> +
> +		/*
> +		 * Mark page reserved as it will need to wait for onlining
> +		 * phase for it to be fully associated with a zone.
> +		 *
> +		 * We can use the non-atomic __set_bit operation for setting
> +		 * the flag as we are still initializing the pages.
> +		 */
> +		__SetPageReserved(page);
> +
> +		/*
> +		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
> +		 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
> +		 * page is ever freed or placed on a driver-private list.
> +		 */
> +		page->pgmap = pgmap;
> +		page->hmm_data = 0;
> +
> +		/*
> +		 * Mark the block movable so that blocks are reserved for
> +		 * movable at startup. This will force kernel allocations
> +		 * to reserve their blocks rather than leaking throughout
> +		 * the address space during boot when many long-lived
> +		 * kernel allocations are made.
> +		 *
> +		 * bitmap is created for zone's valid pfn range. but memmap
> +		 * can be created for invalid pages (for alignment)
> +		 * check here not to call set_pageblock_migratetype() against
> +		 * pfn out of zone.
> +		 *
> +		 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
> +		 * because this is done early in sparse_add_one_section
> +		 */
> +		if (!(pfn & (pageblock_nr_pages - 1))) {
> +			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +			cond_resched();
> +		}
> +	}
> +
> +	pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
> +		nr_pages, jiffies_to_msecs(jiffies - start));
> +}
> +
> +#endif
>  /*
>   * Initially all pages are reserved - free ones are freed
>   * up by free_all_bootmem() once the early boot process is
> @@ -5477,10 +5554,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>  
>  	/*
>  	 * Honor reservation requested by the driver for this ZONE_DEVICE
> -	 * memory
> +	 * memory. We limit the total number of pages to initialize to just
> +	 * those that might contain the memory mapping. We will defer the
> +	 * ZONE_DEVICE page initialization until after we have released
> +	 * the hotplug lock.
>  	 */
> -	if (altmap && start_pfn == altmap->base_pfn)
> +	if (altmap && start_pfn == altmap->base_pfn) {
>  		start_pfn += altmap->reserve;
> +		end_pfn = altmap->base_pfn +
> +			  vmem_altmap_offset(altmap);
> +	} else if (zone == ZONE_DEVICE) {
> +		end_pfn = start_pfn;
> +	}
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>  		/*
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-10 23:43 ` [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning Alexander Duyck
                     ` (2 preceding siblings ...)
  2018-09-12 13:24   ` Pasha Tatashin
@ 2018-09-12 14:10   ` Michal Hocko
  2018-09-12 14:49     ` Alexander Duyck
  3 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2018-09-12 14:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: pavel.tatashin, linux-nvdimm, dave.hansen, linux-kernel,
	linux-mm, jglisse, kirill.shutemov, akpm, mingo

On Mon 10-09-18 16:43:41, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
> 
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
> 
> In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on
> a system that has a large amount of memory I have added a new kernel
> parameter named "page_init_poison" that can be set to "off" in order to
> disable it.

I am still not convinced that this all is worth the additional code. It
is much better than a new config option for sure. If we really want this
though then I suggest that the parameter handler should note the
disabled state (when CONFIG_DEBUG_VM is on) to the kernel log. I would
also make it explicit who might want to do that in the parameter
description.

> +	page_init_poison=	[KNL] Boot-time parameter changing the
> +			state of poisoning of page structures during early
> +			boot. Used to verify page metadata is not accessed
> +			prior to initialization. Available with
> +			CONFIG_DEBUG_VM=y.
> +			off: turn off poisoning
> +			on: turn on poisoning (default)
> +

what about the following wording or something along those lines

Boot-time parameter to control struct page poisoning which is a
debugging feature to catch unitialized struct page access. This option
is available only for CONFIG_DEBUG_VM=y and it affects boot time
(especially on large systems). If there are no poisoning bugs reported
on the particular system and workload it should be safe to disable it to
speed up the boot time.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-12 14:10   ` Michal Hocko
@ 2018-09-12 14:49     ` Alexander Duyck
  2018-09-12 15:23       ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Duyck @ 2018-09-12 14:49 UTC (permalink / raw)
  To: mhocko
  Cc: pavel.tatashin, linux-nvdimm, Dave Hansen, LKML, linux-mm,
	jglisse, Kirill A. Shutemov, Andrew Morton, Ingo Molnar

On Wed, Sep 12, 2018 at 7:10 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 10-09-18 16:43:41, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> >
> > On systems with a large amount of memory it can take a significant amount
> > of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> > value. I have seen it take over 2 minutes to initialize a system with
> > over 12GB of RAM.
> >
> > In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> > the boot time returned to something much more reasonable as the
> > arch_add_memory call completed in milliseconds versus seconds. However in
> > doing that I had to disable all of the other VM debugging on the system.
> >
> > In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on
> > a system that has a large amount of memory I have added a new kernel
> > parameter named "page_init_poison" that can be set to "off" in order to
> > disable it.
>
> I am still not convinced that this all is worth the additional code. It
> is much better than a new config option for sure. If we really want this
> though then I suggest that the parameter handler should note the
> disabled state (when CONFIG_DEBUG_VM is on) to the kernel log. I would
> also make it explicit who might want to do that in the parameter
> description.

Anything specific in terms of the kernel log message we are looking
for? I'll probably just go with "Page struct poisoning disabled by
kernel command line option 'page_init_poison'" or something along
those lines.

> > +     page_init_poison=       [KNL] Boot-time parameter changing the
> > +                     state of poisoning of page structures during early
> > +                     boot. Used to verify page metadata is not accessed
> > +                     prior to initialization. Available with
> > +                     CONFIG_DEBUG_VM=y.
> > +                     off: turn off poisoning
> > +                     on: turn on poisoning (default)
> > +
>
> what about the following wording or something along those lines
>
> Boot-time parameter to control struct page poisoning which is a
> debugging feature to catch unitialized struct page access. This option
> is available only for CONFIG_DEBUG_VM=y and it affects boot time
> (especially on large systems). If there are no poisoning bugs reported
> on the particular system and workload it should be safe to disable it to
> speed up the boot time.

That works for me. I will update it for the next release.

Thanks.

- Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-12 14:49     ` Alexander Duyck
@ 2018-09-12 15:23       ` Dave Hansen
  2018-09-12 16:36         ` Alexander Duyck
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2018-09-12 15:23 UTC (permalink / raw)
  To: Alexander Duyck, mhocko
  Cc: pavel.tatashin, linux-nvdimm, LKML, linux-mm, jglisse,
	Andrew Morton, Ingo Molnar, Kirill A. Shutemov

On 09/12/2018 07:49 AM, Alexander Duyck wrote:
>>> +     page_init_poison=       [KNL] Boot-time parameter changing the
>>> +                     state of poisoning of page structures during early
>>> +                     boot. Used to verify page metadata is not accessed
>>> +                     prior to initialization. Available with
>>> +                     CONFIG_DEBUG_VM=y.
>>> +                     off: turn off poisoning
>>> +                     on: turn on poisoning (default)
>>> +
>> what about the following wording or something along those lines
>>
>> Boot-time parameter to control struct page poisoning which is a
>> debugging feature to catch unitialized struct page access. This option
>> is available only for CONFIG_DEBUG_VM=y and it affects boot time
>> (especially on large systems). If there are no poisoning bugs reported
>> on the particular system and workload it should be safe to disable it to
>> speed up the boot time.
> That works for me. I will update it for the next release.

FWIW, I rather liked Dan's idea of wrapping this under
vm_debug=<something>.  We've got a zoo of boot options and it's really
hard to _remember_ what does what.  For this case, we're creating one
that's only available under a specific debug option and I think it makes
total sense to name the boot option accordingly.

For now, I think it makes total sense to do vm_debug=all/off.  If, in
the future, we get more options, we can do things like slab does and do
vm_debug=P (for Page poison) for this feature specifically.

	vm_debug =	[KNL] Available with CONFIG_DEBUG_VM=y.
			May slow down boot speed, especially on larger-
			memory systems when enabled.
			off: turn off all runtime VM debug features
			all: turn on all debug features (default)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-12 13:59   ` Pasha Tatashin
@ 2018-09-12 15:48     ` Alexander Duyck
  2018-09-12 15:54       ` Pasha Tatashin
  2018-09-12 16:50       ` Dan Williams
  0 siblings, 2 replies; 31+ messages in thread
From: Alexander Duyck @ 2018-09-12 15:48 UTC (permalink / raw)
  To: Pavel.Tatashin
  Cc: Michal Hocko, linux-nvdimm, Dave Hansen, LKML, linux-mm, jglisse,
	Kirill A. Shutemov, Andrew Morton, Ingo Molnar

On Wed, Sep 12, 2018 at 6:59 AM Pasha Tatashin
<Pavel.Tatashin@microsoft.com> wrote:
>
> Hi Alex,

Hi Pavel,

> Please re-base on linux-next,  memmap_init_zone() has been updated there
> compared to mainline. You might even find a way to unify some parts of
> memmap_init_zone and memmap_init_zone_device as memmap_init_zone() is a
> lot simpler now.

This patch applied to the linux-next tree with only a little bit of
fuzz. It looks like it is mostly due to some code you had added above
the function as well. I have updated this patch so that it will apply
to both linux and linux-next by just moving the new function to
underneath memmap_init_zone instead of above it.

> I think __init_single_page() should stay local to page_alloc.c to keep
> the inlining optimization.

I agree. In addition it will make pulling common init together into
one space easier. I would rather not have us create an opportunity for
things to further diverge by making it available for anybody to use.

> I will review you this patch once you send an updated version.

Other than moving the new function from being added above versus below
there isn't much else that needs to change, at least for this patch. I
have some follow-up patches I am planning that will be targeted for
linux-next. Those I think will focus more on what you have in mind in
terms of combining this new function

> Thank you,
> Pavel

Thanks,
- Alex

> On 9/10/18 7:43 PM, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@intel.com>
> >
> > The ZONE_DEVICE pages were being initialized in two locations. One was with
> > the memory_hotplug lock held and another was outside of that lock. The
> > problem with this is that it was nearly doubling the memory initialization
> > time. Instead of doing this twice, once while holding a global lock and
> > once without, I am opting to defer the initialization to the one outside of
> > the lock. This allows us to avoid serializing the overhead for memory init
> > and we can instead focus on per-node init times.
> >
> > One issue I encountered is that devm_memremap_pages and
> > hmm_devmmem_pages_create were initializing only the pgmap field the same
> > way. One wasn't initializing hmm_data, and the other was initializing it to
> > a poison value. Since this is something that is exposed to the driver in
> > the case of hmm I am opting for a third option and just initializing
> > hmm_data to 0 since this is going to be exposed to unknown third party
> > drivers.
> >
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > ---
> >  include/linux/mm.h |    2 +
> >  kernel/memremap.c  |   24 +++++---------
> >  mm/hmm.c           |   12 ++++---
> >  mm/page_alloc.c    |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  4 files changed, 105 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a61ebe8ad4ca..47b440bb3050 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -848,6 +848,8 @@ static inline bool is_zone_device_page(const struct page *page)
> >  {
> >       return page_zonenum(page) == ZONE_DEVICE;
> >  }
> > +extern void memmap_init_zone_device(struct zone *, unsigned long,
> > +                                 unsigned long, struct dev_pagemap *);
> >  #else
> >  static inline bool is_zone_device_page(const struct page *page)
> >  {
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index 5b8600d39931..d0c32e473f82 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -175,10 +175,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
> >       struct vmem_altmap *altmap = pgmap->altmap_valid ?
> >                       &pgmap->altmap : NULL;
> >       struct resource *res = &pgmap->res;
> > -     unsigned long pfn, pgoff, order;
> > +     struct dev_pagemap *conflict_pgmap;
> >       pgprot_t pgprot = PAGE_KERNEL;
> > +     unsigned long pgoff, order;
> >       int error, nid, is_ram;
> > -     struct dev_pagemap *conflict_pgmap;
> >
> >       align_start = res->start & ~(SECTION_SIZE - 1);
> >       align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> > @@ -256,19 +256,13 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
> >       if (error)
> >               goto err_add_memory;
> >
> > -     for_each_device_pfn(pfn, pgmap) {
> > -             struct page *page = pfn_to_page(pfn);
> > -
> > -             /*
> > -              * ZONE_DEVICE pages union ->lru with a ->pgmap back
> > -              * pointer.  It is a bug if a ZONE_DEVICE page is ever
> > -              * freed or placed on a driver-private list.  Seed the
> > -              * storage with LIST_POISON* values.
> > -              */
> > -             list_del(&page->lru);
> > -             page->pgmap = pgmap;
> > -             percpu_ref_get(pgmap->ref);
> > -     }
> > +     /*
> > +      * Initialization of the pages has been deferred until now in order
> > +      * to allow us to do the work while not holding the hotplug lock.
> > +      */
> > +     memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> > +                             align_start >> PAGE_SHIFT,
> > +                             align_size >> PAGE_SHIFT, pgmap);
> >
> >       devm_add_action(dev, devm_memremap_pages_release, pgmap);
> >
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index c968e49f7a0c..774d684fa2b4 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -1024,7 +1024,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> >       resource_size_t key, align_start, align_size, align_end;
> >       struct device *device = devmem->device;
> >       int ret, nid, is_ram;
> > -     unsigned long pfn;
> >
> >       align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
> >       align_size = ALIGN(devmem->resource->start +
> > @@ -1109,11 +1108,14 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> >                               align_size >> PAGE_SHIFT, NULL);
> >       mem_hotplug_done();
> >
> > -     for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
> > -             struct page *page = pfn_to_page(pfn);
> > +     /*
> > +      * Initialization of the pages has been deferred until now in order
> > +      * to allow us to do the work while not holding the hotplug lock.
> > +      */
> > +     memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> > +                             align_start >> PAGE_SHIFT,
> > +                             align_size >> PAGE_SHIFT, &devmem->pagemap);
> >
> > -             page->pgmap = &devmem->pagemap;
> > -     }
> >       return 0;
> >
> >  error_add_memory:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a9b095a72fd9..81a3fd942c45 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5454,6 +5454,83 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
> >  #endif
> >  }
> >
> > +#ifdef CONFIG_ZONE_DEVICE
> > +void __ref memmap_init_zone_device(struct zone *zone, unsigned long pfn,
> > +                                unsigned long size,
> > +                                struct dev_pagemap *pgmap)
> > +{
> > +     struct pglist_data *pgdat = zone->zone_pgdat;
> > +     unsigned long zone_idx = zone_idx(zone);
> > +     unsigned long end_pfn = pfn + size;
> > +     unsigned long start = jiffies;
> > +     int nid = pgdat->node_id;
> > +     unsigned long nr_pages;
> > +
> > +     if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
> > +             return;
> > +
> > +     /*
> > +      * The call to memmap_init_zone should have already taken care
> > +      * of the pages reserved for the memmap, so we can just jump to
> > +      * the end of that region and start processing the device pages.
> > +      */
> > +     if (pgmap->altmap_valid) {
> > +             struct vmem_altmap *altmap = &pgmap->altmap;
> > +
> > +             pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> > +     }
> > +
> > +     /* Record the number of pages we are about to initialize */
> > +     nr_pages = end_pfn - pfn;
> > +
> > +     for (; pfn < end_pfn; pfn++) {
> > +             struct page *page = pfn_to_page(pfn);
> > +
> > +             __init_single_page(page, pfn, zone_idx, nid);
> > +
> > +             /*
> > +              * Mark page reserved as it will need to wait for onlining
> > +              * phase for it to be fully associated with a zone.
> > +              *
> > +              * We can use the non-atomic __set_bit operation for setting
> > +              * the flag as we are still initializing the pages.
> > +              */
> > +             __SetPageReserved(page);
> > +
> > +             /*
> > +              * ZONE_DEVICE pages union ->lru with a ->pgmap back
> > +              * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
> > +              * page is ever freed or placed on a driver-private list.
> > +              */
> > +             page->pgmap = pgmap;
> > +             page->hmm_data = 0;
> > +
> > +             /*
> > +              * Mark the block movable so that blocks are reserved for
> > +              * movable at startup. This will force kernel allocations
> > +              * to reserve their blocks rather than leaking throughout
> > +              * the address space during boot when many long-lived
> > +              * kernel allocations are made.
> > +              *
> > +              * bitmap is created for zone's valid pfn range. but memmap
> > +              * can be created for invalid pages (for alignment)
> > +              * check here not to call set_pageblock_migratetype() against
> > +              * pfn out of zone.
> > +              *
> > +              * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
> > +              * because this is done early in sparse_add_one_section
> > +              */
> > +             if (!(pfn & (pageblock_nr_pages - 1))) {
> > +                     set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > +                     cond_resched();
> > +             }
> > +     }
> > +
> > +     pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
> > +             nr_pages, jiffies_to_msecs(jiffies - start));
> > +}
> > +
> > +#endif
> >  /*
> >   * Initially all pages are reserved - free ones are freed
> >   * up by free_all_bootmem() once the early boot process is
> > @@ -5477,10 +5554,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> >
> >       /*
> >        * Honor reservation requested by the driver for this ZONE_DEVICE
> > -      * memory
> > +      * memory. We limit the total number of pages to initialize to just
> > +      * those that might contain the memory mapping. We will defer the
> > +      * ZONE_DEVICE page initialization until after we have released
> > +      * the hotplug lock.
> >        */
> > -     if (altmap && start_pfn == altmap->base_pfn)
> > +     if (altmap && start_pfn == altmap->base_pfn) {
> >               start_pfn += altmap->reserve;
> > +             end_pfn = altmap->base_pfn +
> > +                       vmem_altmap_offset(altmap);
> > +     } else if (zone == ZONE_DEVICE) {
> > +             end_pfn = start_pfn;
> > +     }
> >
> >       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >               /*
> >
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-12 15:48     ` Alexander Duyck
@ 2018-09-12 15:54       ` Pasha Tatashin
  2018-09-12 16:44         ` Alexander Duyck
  2018-09-12 16:50       ` Dan Williams
  1 sibling, 1 reply; 31+ messages in thread
From: Pasha Tatashin @ 2018-09-12 15:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-mm, LKML, linux-nvdimm, Michal Hocko, dave.jiang,
	Ingo Molnar, Dave Hansen, jglisse, Andrew Morton, logang,
	dan.j.williams, Kirill A. Shutemov



On 9/12/18 11:48 AM, Alexander Duyck wrote:
> On Wed, Sep 12, 2018 at 6:59 AM Pasha Tatashin
> <Pavel.Tatashin@microsoft.com> wrote:
>>
>> Hi Alex,
> 
> Hi Pavel,
> 
>> Please re-base on linux-next,  memmap_init_zone() has been updated there
>> compared to mainline. You might even find a way to unify some parts of
>> memmap_init_zone and memmap_init_zone_device as memmap_init_zone() is a
>> lot simpler now.
> 
> This patch applied to the linux-next tree with only a little bit of
> fuzz. It looks like it is mostly due to some code you had added above
> the function as well. I have updated this patch so that it will apply
> to both linux and linux-next by just moving the new function to
> underneath memmap_init_zone instead of above it.
> 
>> I think __init_single_page() should stay local to page_alloc.c to keep
>> the inlining optimization.
> 
> I agree. In addition it will make pulling common init together into
> one space easier. I would rather not have us create an opportunity for
> things to further diverge by making it available for anybody to use.
> 
>> I will review you this patch once you send an updated version.
> 
> Other than moving the new function from being added above versus below
> there isn't much else that needs to change, at least for this patch. I
> have some follow-up patches I am planning that will be targeted for
> linux-next. Those I think will focus more on what you have in mind in
> terms of combining this new function

Hi Alex,

I'd like see the combining to be part of the same series. May be this
patch can be pulled from this series and merged with your upcoming
patches series?

Thank you,
Pavel

> 
>> Thank you,
>> Pavel
> 
> Thanks,
> - Alex
> 
>> On 9/10/18 7:43 PM, Alexander Duyck wrote:
>>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>>>
>>> The ZONE_DEVICE pages were being initialized in two locations. One was with
>>> the memory_hotplug lock held and another was outside of that lock. The
>>> problem with this is that it was nearly doubling the memory initialization
>>> time. Instead of doing this twice, once while holding a global lock and
>>> once without, I am opting to defer the initialization to the one outside of
>>> the lock. This allows us to avoid serializing the overhead for memory init
>>> and we can instead focus on per-node init times.
>>>
>>> One issue I encountered is that devm_memremap_pages and
>>> hmm_devmmem_pages_create were initializing only the pgmap field the same
>>> way. One wasn't initializing hmm_data, and the other was initializing it to
>>> a poison value. Since this is something that is exposed to the driver in
>>> the case of hmm I am opting for a third option and just initializing
>>> hmm_data to 0 since this is going to be exposed to unknown third party
>>> drivers.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> ---
>>>  include/linux/mm.h |    2 +
>>>  kernel/memremap.c  |   24 +++++---------
>>>  mm/hmm.c           |   12 ++++---
>>>  mm/page_alloc.c    |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>  4 files changed, 105 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index a61ebe8ad4ca..47b440bb3050 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -848,6 +848,8 @@ static inline bool is_zone_device_page(const struct page *page)
>>>  {
>>>       return page_zonenum(page) == ZONE_DEVICE;
>>>  }
>>> +extern void memmap_init_zone_device(struct zone *, unsigned long,
>>> +                                 unsigned long, struct dev_pagemap *);
>>>  #else
>>>  static inline bool is_zone_device_page(const struct page *page)
>>>  {
>>> diff --git a/kernel/memremap.c b/kernel/memremap.c
>>> index 5b8600d39931..d0c32e473f82 100644
>>> --- a/kernel/memremap.c
>>> +++ b/kernel/memremap.c
>>> @@ -175,10 +175,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>>>       struct vmem_altmap *altmap = pgmap->altmap_valid ?
>>>                       &pgmap->altmap : NULL;
>>>       struct resource *res = &pgmap->res;
>>> -     unsigned long pfn, pgoff, order;
>>> +     struct dev_pagemap *conflict_pgmap;
>>>       pgprot_t pgprot = PAGE_KERNEL;
>>> +     unsigned long pgoff, order;
>>>       int error, nid, is_ram;
>>> -     struct dev_pagemap *conflict_pgmap;
>>>
>>>       align_start = res->start & ~(SECTION_SIZE - 1);
>>>       align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
>>> @@ -256,19 +256,13 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>>>       if (error)
>>>               goto err_add_memory;
>>>
>>> -     for_each_device_pfn(pfn, pgmap) {
>>> -             struct page *page = pfn_to_page(pfn);
>>> -
>>> -             /*
>>> -              * ZONE_DEVICE pages union ->lru with a ->pgmap back
>>> -              * pointer.  It is a bug if a ZONE_DEVICE page is ever
>>> -              * freed or placed on a driver-private list.  Seed the
>>> -              * storage with LIST_POISON* values.
>>> -              */
>>> -             list_del(&page->lru);
>>> -             page->pgmap = pgmap;
>>> -             percpu_ref_get(pgmap->ref);
>>> -     }
>>> +     /*
>>> +      * Initialization of the pages has been deferred until now in order
>>> +      * to allow us to do the work while not holding the hotplug lock.
>>> +      */
>>> +     memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>> +                             align_start >> PAGE_SHIFT,
>>> +                             align_size >> PAGE_SHIFT, pgmap);
>>>
>>>       devm_add_action(dev, devm_memremap_pages_release, pgmap);
>>>
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index c968e49f7a0c..774d684fa2b4 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -1024,7 +1024,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>>>       resource_size_t key, align_start, align_size, align_end;
>>>       struct device *device = devmem->device;
>>>       int ret, nid, is_ram;
>>> -     unsigned long pfn;
>>>
>>>       align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
>>>       align_size = ALIGN(devmem->resource->start +
>>> @@ -1109,11 +1108,14 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>>>                               align_size >> PAGE_SHIFT, NULL);
>>>       mem_hotplug_done();
>>>
>>> -     for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
>>> -             struct page *page = pfn_to_page(pfn);
>>> +     /*
>>> +      * Initialization of the pages has been deferred until now in order
>>> +      * to allow us to do the work while not holding the hotplug lock.
>>> +      */
>>> +     memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>> +                             align_start >> PAGE_SHIFT,
>>> +                             align_size >> PAGE_SHIFT, &devmem->pagemap);
>>>
>>> -             page->pgmap = &devmem->pagemap;
>>> -     }
>>>       return 0;
>>>
>>>  error_add_memory:
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index a9b095a72fd9..81a3fd942c45 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5454,6 +5454,83 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>>>  #endif
>>>  }
>>>
>>> +#ifdef CONFIG_ZONE_DEVICE
>>> +void __ref memmap_init_zone_device(struct zone *zone, unsigned long pfn,
>>> +                                unsigned long size,
>>> +                                struct dev_pagemap *pgmap)
>>> +{
>>> +     struct pglist_data *pgdat = zone->zone_pgdat;
>>> +     unsigned long zone_idx = zone_idx(zone);
>>> +     unsigned long end_pfn = pfn + size;
>>> +     unsigned long start = jiffies;
>>> +     int nid = pgdat->node_id;
>>> +     unsigned long nr_pages;
>>> +
>>> +     if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
>>> +             return;
>>> +
>>> +     /*
>>> +      * The call to memmap_init_zone should have already taken care
>>> +      * of the pages reserved for the memmap, so we can just jump to
>>> +      * the end of that region and start processing the device pages.
>>> +      */
>>> +     if (pgmap->altmap_valid) {
>>> +             struct vmem_altmap *altmap = &pgmap->altmap;
>>> +
>>> +             pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>>> +     }
>>> +
>>> +     /* Record the number of pages we are about to initialize */
>>> +     nr_pages = end_pfn - pfn;
>>> +
>>> +     for (; pfn < end_pfn; pfn++) {
>>> +             struct page *page = pfn_to_page(pfn);
>>> +
>>> +             __init_single_page(page, pfn, zone_idx, nid);
>>> +
>>> +             /*
>>> +              * Mark page reserved as it will need to wait for onlining
>>> +              * phase for it to be fully associated with a zone.
>>> +              *
>>> +              * We can use the non-atomic __set_bit operation for setting
>>> +              * the flag as we are still initializing the pages.
>>> +              */
>>> +             __SetPageReserved(page);
>>> +
>>> +             /*
>>> +              * ZONE_DEVICE pages union ->lru with a ->pgmap back
>>> +              * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
>>> +              * page is ever freed or placed on a driver-private list.
>>> +              */
>>> +             page->pgmap = pgmap;
>>> +             page->hmm_data = 0;
>>> +
>>> +             /*
>>> +              * Mark the block movable so that blocks are reserved for
>>> +              * movable at startup. This will force kernel allocations
>>> +              * to reserve their blocks rather than leaking throughout
>>> +              * the address space during boot when many long-lived
>>> +              * kernel allocations are made.
>>> +              *
>>> +              * bitmap is created for zone's valid pfn range. but memmap
>>> +              * can be created for invalid pages (for alignment)
>>> +              * check here not to call set_pageblock_migratetype() against
>>> +              * pfn out of zone.
>>> +              *
>>> +              * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
>>> +              * because this is done early in sparse_add_one_section
>>> +              */
>>> +             if (!(pfn & (pageblock_nr_pages - 1))) {
>>> +                     set_pageblock_migratetype(page, MIGRATE_MOVABLE);
>>> +                     cond_resched();
>>> +             }
>>> +     }
>>> +
>>> +     pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
>>> +             nr_pages, jiffies_to_msecs(jiffies - start));
>>> +}
>>> +
>>> +#endif
>>>  /*
>>>   * Initially all pages are reserved - free ones are freed
>>>   * up by free_all_bootmem() once the early boot process is
>>> @@ -5477,10 +5554,18 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>>
>>>       /*
>>>        * Honor reservation requested by the driver for this ZONE_DEVICE
>>> -      * memory
>>> +      * memory. We limit the total number of pages to initialize to just
>>> +      * those that might contain the memory mapping. We will defer the
>>> +      * ZONE_DEVICE page initialization until after we have released
>>> +      * the hotplug lock.
>>>        */
>>> -     if (altmap && start_pfn == altmap->base_pfn)
>>> +     if (altmap && start_pfn == altmap->base_pfn) {
>>>               start_pfn += altmap->reserve;
>>> +             end_pfn = altmap->base_pfn +
>>> +                       vmem_altmap_offset(altmap);
>>> +     } else if (zone == ZONE_DEVICE) {
>>> +             end_pfn = start_pfn;
>>> +     }
>>>
>>>       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>>               /*
>>>
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-12 15:23       ` Dave Hansen
@ 2018-09-12 16:36         ` Alexander Duyck
  2018-09-12 16:43           ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Duyck @ 2018-09-12 16:36 UTC (permalink / raw)
  To: Dave Hansen, mhocko, pavel.tatashin, dan.j.williams
  Cc: linux-mm, LKML, linux-nvdimm, dave.jiang, Ingo Molnar, jglisse,
	Andrew Morton, logang, Kirill A. Shutemov

On Wed, Sep 12, 2018 at 8:25 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 09/12/2018 07:49 AM, Alexander Duyck wrote:
> >>> +     page_init_poison=       [KNL] Boot-time parameter changing the
> >>> +                     state of poisoning of page structures during early
> >>> +                     boot. Used to verify page metadata is not accessed
> >>> +                     prior to initialization. Available with
> >>> +                     CONFIG_DEBUG_VM=y.
> >>> +                     off: turn off poisoning
> >>> +                     on: turn on poisoning (default)
> >>> +
> >> what about the following wording or something along those lines
> >>
> >> Boot-time parameter to control struct page poisoning which is a
> >> debugging feature to catch unitialized struct page access. This option
> >> is available only for CONFIG_DEBUG_VM=y and it affects boot time
> >> (especially on large systems). If there are no poisoning bugs reported
> >> on the particular system and workload it should be safe to disable it to
> >> speed up the boot time.
> > That works for me. I will update it for the next release.
>
> FWIW, I rather liked Dan's idea of wrapping this under
> vm_debug=<something>.  We've got a zoo of boot options and it's really
> hard to _remember_ what does what.  For this case, we're creating one
> that's only available under a specific debug option and I think it makes
> total sense to name the boot option accordingly.
>
> For now, I think it makes total sense to do vm_debug=all/off.  If, in
> the future, we get more options, we can do things like slab does and do
> vm_debug=P (for Page poison) for this feature specifically.
>
>         vm_debug =      [KNL] Available with CONFIG_DEBUG_VM=y.
>                         May slow down boot speed, especially on larger-
>                         memory systems when enabled.
>                         off: turn off all runtime VM debug features
>                         all: turn on all debug features (default)

This would introduce a significant amount of code change if we do it
as a parameter that has control over everything.

I would be open to something like "vm_debug_disables=" where we could
then pass individual values like 'P' for disabling page poisoning.
However doing this as a generic interface that could disable
everything now would be messy. I could then also update the print
message so that it lists what is disabled, and what was left enabled.
Then as we need to disable things in the future we could add
additional letters for individual features. I just don't want us
preemptively adding control flags for features that may never need to
be toggled.

I would want to hear from Michal on this before I get too deep into it
as he seemed to be of the opinion that we were already doing too much
code for this and it seems like this is starting to veer off in that
direction.

- Alex

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning
  2018-09-12 16:36         ` Alexander Duyck
@ 2018-09-12 16:43           ` Dave Hansen
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2018-09-12 16:43 UTC (permalink / raw)
  To: Alexander Duyck, mhocko, pavel.tatashin, dan.j.williams
  Cc: linux-nvdimm, LKML, linux-mm, jglisse, Andrew Morton,
	Ingo Molnar, Kirill A. Shutemov

On 09/12/2018 09:36 AM, Alexander Duyck wrote:
>>         vm_debug =      [KNL] Available with CONFIG_DEBUG_VM=y.
>>                         May slow down boot speed, especially on larger-
>>                         memory systems when enabled.
>>                         off: turn off all runtime VM debug features
>>                         all: turn on all debug features (default)
> This would introduce a significant amount of code change if we do it
> as a parameter that has control over everything.

Sure, but don't do that now.  Just put page poisoning under it now.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-12 15:54       ` Pasha Tatashin
@ 2018-09-12 16:44         ` Alexander Duyck
  0 siblings, 0 replies; 31+ messages in thread
From: Alexander Duyck @ 2018-09-12 16:44 UTC (permalink / raw)
  To: Pavel.Tatashin
  Cc: Michal Hocko, linux-nvdimm, Dave Hansen, LKML, linux-mm, jglisse,
	Kirill A. Shutemov, Andrew Morton, Ingo Molnar

On Wed, Sep 12, 2018 at 8:54 AM Pasha Tatashin
<Pavel.Tatashin@microsoft.com> wrote:
>
>
>
> On 9/12/18 11:48 AM, Alexander Duyck wrote:
> > On Wed, Sep 12, 2018 at 6:59 AM Pasha Tatashin
> > <Pavel.Tatashin@microsoft.com> wrote:
> >>
> >> Hi Alex,
> >
> > Hi Pavel,
> >
> >> Please re-base on linux-next,  memmap_init_zone() has been updated there
> >> compared to mainline. You might even find a way to unify some parts of
> >> memmap_init_zone and memmap_init_zone_device as memmap_init_zone() is a
> >> lot simpler now.
> >
> > This patch applied to the linux-next tree with only a little bit of
> > fuzz. It looks like it is mostly due to some code you had added above
> > the function as well. I have updated this patch so that it will apply
> > to both linux and linux-next by just moving the new function to
> > underneath memmap_init_zone instead of above it.
> >
> >> I think __init_single_page() should stay local to page_alloc.c to keep
> >> the inlining optimization.
> >
> > I agree. In addition it will make pulling common init together into
> > one space easier. I would rather not have us create an opportunity for
> > things to further diverge by making it available for anybody to use.
> >
> >> I will review you this patch once you send an updated version.
> >
> > Other than moving the new function from being added above versus below
> > there isn't much else that needs to change, at least for this patch. I
> > have some follow-up patches I am planning that will be targeted for
> > linux-next. Those I think will focus more on what you have in mind in
> > terms of combining this new function
>
> Hi Alex,
>
> I'd like see the combining to be part of the same series. May be this
> patch can be pulled from this series and merged with your upcoming
> patches series?
>
> Thank you,
> Pavel

The problem is the issue is somewhat time sensitive, and the patches I
put out in this set needed to be easily backported. That is one of the
reasons this patch set is as conservative as it is.

I was hoping to make 4.20 with this patch set at the latest. My
follow-up patches are more of what I would consider 4.21 material as
it will be something we will probably want to give some testing time,
and I figure there will end up being a few revisions. I would probably
have them ready for review in another week or so.

Thanks.

- Alex
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-12 15:48     ` Alexander Duyck
  2018-09-12 15:54       ` Pasha Tatashin
@ 2018-09-12 16:50       ` Dan Williams
  2018-09-12 17:46         ` Pasha Tatashin
  1 sibling, 1 reply; 31+ messages in thread
From: Dan Williams @ 2018-09-12 16:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Pavel.Tatashin, Michal Hocko, linux-nvdimm, Dave Hansen, LKML,
	linux-mm, Jérôme Glisse, Andrew Morton, Ingo Molnar,
	Kirill A. Shutemov

On Wed, Sep 12, 2018 at 8:48 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Sep 12, 2018 at 6:59 AM Pasha Tatashin
> <Pavel.Tatashin@microsoft.com> wrote:
>>
>> Hi Alex,
>
> Hi Pavel,
>
>> Please re-base on linux-next,  memmap_init_zone() has been updated there
>> compared to mainline. You might even find a way to unify some parts of
>> memmap_init_zone and memmap_init_zone_device as memmap_init_zone() is a
>> lot simpler now.
>
> This patch applied to the linux-next tree with only a little bit of
> fuzz. It looks like it is mostly due to some code you had added above
> the function as well. I have updated this patch so that it will apply
> to both linux and linux-next by just moving the new function to
> underneath memmap_init_zone instead of above it.
>
>> I think __init_single_page() should stay local to page_alloc.c to keep
>> the inlining optimization.
>
> I agree. In addition it will make pulling common init together into
> one space easier. I would rather not have us create an opportunity for
> things to further diverge by making it available for anybody to use.

I'll buy the inline argument for keeping the new routine in
page_alloc.c, but I otherwise do not see the divergence danger or
"making __init_single_page() available for anybody" given the the
declaration is limited in scope to a mm/ local header file.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-12 16:50       ` Dan Williams
@ 2018-09-12 17:46         ` Pasha Tatashin
  2018-09-12 19:11           ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Pasha Tatashin @ 2018-09-12 17:46 UTC (permalink / raw)
  To: Dan Williams, Alexander Duyck
  Cc: linux-mm, LKML, linux-nvdimm, Michal Hocko, Dave Jiang,
	Ingo Molnar, Dave Hansen, Jérôme Glisse, Andrew Morton,
	Logan Gunthorpe, Kirill A. Shutemov



On 9/12/18 12:50 PM, Dan Williams wrote:
> On Wed, Sep 12, 2018 at 8:48 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Sep 12, 2018 at 6:59 AM Pasha Tatashin
>> <Pavel.Tatashin@microsoft.com> wrote:
>>>
>>> Hi Alex,
>>
>> Hi Pavel,
>>
>>> Please re-base on linux-next,  memmap_init_zone() has been updated there
>>> compared to mainline. You might even find a way to unify some parts of
>>> memmap_init_zone and memmap_init_zone_device as memmap_init_zone() is a
>>> lot simpler now.
>>
>> This patch applied to the linux-next tree with only a little bit of
>> fuzz. It looks like it is mostly due to some code you had added above
>> the function as well. I have updated this patch so that it will apply
>> to both linux and linux-next by just moving the new function to
>> underneath memmap_init_zone instead of above it.
>>
>>> I think __init_single_page() should stay local to page_alloc.c to keep
>>> the inlining optimization.
>>
>> I agree. In addition it will make pulling common init together into
>> one space easier. I would rather not have us create an opportunity for
>> things to further diverge by making it available for anybody to use.
> 
> I'll buy the inline argument for keeping the new routine in
> page_alloc.c, but I otherwise do not see the divergence danger or
> "making __init_single_page() available for anybody" given the the
> declaration is limited in scope to a mm/ local header file.
> 

Hi Dan,

It is much harder for compiler to decide that function can be inlined
once it is non-static. Of course, we can simply move this function to a
header file, and declare it inline to begin with.

But, still __init_single_page() is so performance sensitive, that I'd
like to reduce number of callers to this function, and keep it in .c file.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
  2018-09-12 17:46         ` Pasha Tatashin
@ 2018-09-12 19:11           ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2018-09-12 19:11 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Michal Hocko, linux-nvdimm, Dave Hansen, LKML, linux-mm,
	Jérôme Glisse, Andrew Morton, Ingo Molnar,
	Kirill A. Shutemov

On Wed, Sep 12, 2018 at 10:46 AM, Pasha Tatashin
<Pavel.Tatashin@microsoft.com> wrote:
>
>
> On 9/12/18 12:50 PM, Dan Williams wrote:
>> On Wed, Sep 12, 2018 at 8:48 AM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Wed, Sep 12, 2018 at 6:59 AM Pasha Tatashin
>>> <Pavel.Tatashin@microsoft.com> wrote:
>>>>
>>>> Hi Alex,
>>>
>>> Hi Pavel,
>>>
>>>> Please re-base on linux-next,  memmap_init_zone() has been updated there
>>>> compared to mainline. You might even find a way to unify some parts of
>>>> memmap_init_zone and memmap_init_zone_device as memmap_init_zone() is a
>>>> lot simpler now.
>>>
>>> This patch applied to the linux-next tree with only a little bit of
>>> fuzz. It looks like it is mostly due to some code you had added above
>>> the function as well. I have updated this patch so that it will apply
>>> to both linux and linux-next by just moving the new function to
>>> underneath memmap_init_zone instead of above it.
>>>
>>>> I think __init_single_page() should stay local to page_alloc.c to keep
>>>> the inlining optimization.
>>>
>>> I agree. In addition it will make pulling common init together into
>>> one space easier. I would rather not have us create an opportunity for
>>> things to further diverge by making it available for anybody to use.
>>
>> I'll buy the inline argument for keeping the new routine in
>> page_alloc.c, but I otherwise do not see the divergence danger or
>> "making __init_single_page() available for anybody" given the the
>> declaration is limited in scope to a mm/ local header file.
>>
>
> Hi Dan,
>
> It is much harder for compiler to decide that function can be inlined
> once it is non-static. Of course, we can simply move this function to a
> header file, and declare it inline to begin with.
>
> But, still __init_single_page() is so performance sensitive, that I'd
> like to reduce number of callers to this function, and keep it in .c file.

Yes, agree, inline considerations win the day. I was just objecting to
the "make it available for anybody" assertion.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-09-12 19:11 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-10 23:43 [PATCH 0/4] Address issues slowing persistent memory initialization Alexander Duyck
2018-09-10 23:43 ` [PATCH 1/4] mm: Provide kernel parameter to allow disabling page init poisoning Alexander Duyck
2018-09-11  0:35   ` Alexander Duyck
2018-09-11 16:50   ` Dan Williams
2018-09-11 20:01     ` Alexander Duyck
2018-09-11 20:24       ` Dan Williams
2018-09-12 13:24   ` Pasha Tatashin
2018-09-12 14:10   ` Michal Hocko
2018-09-12 14:49     ` Alexander Duyck
2018-09-12 15:23       ` Dave Hansen
2018-09-12 16:36         ` Alexander Duyck
2018-09-12 16:43           ` Dave Hansen
2018-09-10 23:43 ` [PATCH 2/4] mm: Create non-atomic version of SetPageReserved for init use Alexander Duyck
2018-09-12 13:28   ` Pasha Tatashin
2018-09-10 23:43 ` [PATCH 3/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap Alexander Duyck
2018-09-11  7:49   ` kbuild test robot
2018-09-11  7:54   ` kbuild test robot
2018-09-11 22:35   ` Dan Williams
2018-09-12  0:51     ` Alexander Duyck
2018-09-12  0:59       ` Dan Williams
2018-09-12 13:59   ` Pasha Tatashin
2018-09-12 15:48     ` Alexander Duyck
2018-09-12 15:54       ` Pasha Tatashin
2018-09-12 16:44         ` Alexander Duyck
2018-09-12 16:50       ` Dan Williams
2018-09-12 17:46         ` Pasha Tatashin
2018-09-12 19:11           ` Dan Williams
2018-09-10 23:44 ` [PATCH 4/4] nvdimm: Trigger the device probe on a cpu local to the device Alexander Duyck
2018-09-11  0:37   ` Alexander Duyck
2018-09-12  5:48   ` Dan Williams
2018-09-12 13:44   ` Pasha Tatashin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).