linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 00/13] mm: Sub-section memory hotplug support
@ 2019-06-19  5:51 Dan Williams
  2019-06-19  5:51 ` [PATCH v10 01/13] mm/sparsemem: Introduce struct mem_section_usage Dan Williams
                   ` (14 more replies)
  0 siblings, 15 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:51 UTC (permalink / raw)
  To: akpm
  Cc: David Hildenbrand, Jérôme Glisse, Mike Rapoport,
	Jane Chu, Pavel Tatashin, Jonathan Corbet, Qian Cai,
	Logan Gunthorpe, Toshi Kani, Oscar Salvador, Jeff Moyer,
	Michal Hocko, Vlastimil Babka, stable, Wei Yang, linux-mm,
	linux-nvdimm, linux-kernel

Changes since v9 [1]:
- Fix multiple issues related to the fact that pfn_valid() has
  traditionally returned true for any pfn in an 'early' (onlined at
  boot) section regardless of whether that pfn represented 'System RAM'.
  Teach pfn_valid() to maintain its traditional behavior in the presence
  of subsections. Specifically, subsection precision for pfn_valid() is
  only considered for non-early / hot-plugged sections. (Qian)

- Related to the first item introduce a SECTION_IS_EARLY
  (->section_mem_map flag) to remove the existing hacks for determining
  an early section by looking at whether the usemap was allocated from the
  slab.

- Kill off the EEXIST hackery in __add_pages(). It breaks
  (arch_add_memory() false-positive) the detection of subsection
  collisions reported by section_activate(). It is also obviated by
  David's recent reworks to move the 'System RAM' request_region() earlier
  in the add_memory() sequence().

- Switch to an arch-independent / static subsection-size of 2MB.
  Otherwise, a per-arch subsection-size is a roadblock on the path to
  persistent memory namespace compatibility across archs. (Jeff)

- Update the changelog for "libnvdimm/pfn: Fix fsdax-mode namespace
  info-block zero-fields" to clarify that the "Cc: stable" is only there
  as safety measure for a distro that decides to backport "libnvdimm/pfn:
  Stop padding pmem namespaces to section alignment", otherwise there is
  no known bug exposure in older kernels. (Andrew)
  
- Drop some redundant subsection checks (Oscar)

- Collect some reviewed-bys

[1]: https://lore.kernel.org/lkml/155977186863.2443951.9036044808311959913.stgit@dwillia2-desk3.amr.corp.intel.com/

---

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem uses
devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
'struct page' memmap for pmem. However, it does not use the 'bottom
half' of memory hotplug, i.e. never marks pmem pages online and never
exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.

It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:

Current design assumptions:

- Sections that describe boot memory (early sections) are never
  unplugged / removed.

- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
  valid_section() check

- __add_pages() and helper routines assume all operations occur in
  PAGES_PER_SECTION units.

- The memblock sysfs interface only comprehends full sections

New design assumptions:

- Sections are instrumented with a sub-section bitmask to track (on x86)
  individual 2MB sub-divisions of a 128MB section.

- Partially populated early sections can be extended with additional
  sub-sections, and those sub-sections can be removed with
  arch_remove_memory(). With this in place we no longer lose usable memory
  capacity to padding.

- pfn_valid() is updated to look deeper than valid_section() to also check the
  active-sub-section mask. This indication is in the same cacheline as
  the valid_section() so the performance impact is expected to be
  negligible. So far the lkp robot has not reported any regressions.

- Outside of the core vmemmap population routines which are replaced,
  other helper routines like shrink_{zone,pgdat}_span() are updated to
  handle the smaller granularity. Core memory hotplug routines that deal
  with online memory are not touched.

- The existing memblock sysfs user api guarantees / assumptions are
  not touched since this capability is limited to !online
  !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem
ranges with other pmem ranges by default [3]. In short,
devm_memremap_pages() has pushed the venerable section-size constraint
past the breaking point, and the simplicity of section-aligned
arch_add_memory() is no longer tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

---

Dan Williams (13):
      mm/sparsemem: Introduce struct mem_section_usage
      mm/sparsemem: Introduce a SECTION_IS_EARLY flag
      mm/sparsemem: Add helpers track active portions of a section at boot
      mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
      mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
      mm/hotplug: Kill is_dev_zone() usage in __remove_pages()
      mm: Kill is_dev_zone() helper
      mm/sparsemem: Prepare for sub-section ranges
      mm/sparsemem: Support sub-section hotplug
      mm: Document ZONE_DEVICE memory-model implications
      mm/devm_memremap_pages: Enable sub-section remap
      libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
      libnvdimm/pfn: Stop padding pmem namespaces to section alignment


 Documentation/vm/memory-model.rst |   39 ++++
 arch/x86/mm/init_64.c             |    4 
 drivers/nvdimm/dax_devs.c         |    2 
 drivers/nvdimm/pfn.h              |   15 --
 drivers/nvdimm/pfn_devs.c         |   95 +++-------
 include/linux/memory_hotplug.h    |    7 -
 include/linux/mm.h                |    4 
 include/linux/mmzone.h            |   84 +++++++--
 kernel/memremap.c                 |   61 +++----
 mm/memory_hotplug.c               |  173 +++++++++----------
 mm/page_alloc.c                   |   16 +-
 mm/sparse-vmemmap.c               |   21 ++
 mm/sparse.c                       |  335 ++++++++++++++++++++++++-------------
 13 files changed, 494 insertions(+), 362 deletions(-)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v10 01/13] mm/sparsemem: Introduce struct mem_section_usage
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
@ 2019-06-19  5:51 ` Dan Williams
  2019-06-19  5:51 ` [PATCH v10 02/13] mm/sparsemem: Introduce a SECTION_IS_EARLY flag Dan Williams
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:51 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Pavel Tatashin,
	Oscar Salvador, Wei Yang, linux-mm, linux-nvdimm, linux-kernel

Towards enabling memory hotplug to track partial population of a
section, introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to
house a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of
per-architecture value like SECTION_SIZE_BITS in order to allow
cross-arch compatibility of subsection users. Specifically a common
subsection size allows for the possibility that persistent memory
namespace configurations be made compatible across architectures.

The primary motivation for this functionality is to support platforms
that mix "System RAM" and "Persistent Memory" within a single section,
or multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mmzone.h |   28 +++++++++++++++--
 mm/memory_hotplug.c    |   18 ++++++-----
 mm/page_alloc.c        |    2 +
 mm/sparse.c            |   81 ++++++++++++++++++++++++------------------------
 4 files changed, 76 insertions(+), 53 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 427b79c39b3c..179680c94262 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1161,6 +1161,24 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
 #define SECTION_ALIGN_UP(pfn)	(((pfn) + PAGES_PER_SECTION - 1) & PAGE_SECTION_MASK)
 #define SECTION_ALIGN_DOWN(pfn)	((pfn) & PAGE_SECTION_MASK)
 
+#define SUBSECTION_SHIFT 21
+
+#define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT)
+#define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT)
+#define PAGE_SUBSECTION_MASK (~(PAGES_PER_SUBSECTION-1))
+
+#if SUBSECTION_SHIFT > SECTION_SIZE_BITS
+#error Subsection size exceeds section size
+#else
+#define SUBSECTIONS_PER_SECTION (1UL << (SECTION_SIZE_BITS - SUBSECTION_SHIFT))
+#endif
+
+struct mem_section_usage {
+	DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
+	/* See declaration of similar field in struct zone */
+	unsigned long pageblock_flags[0];
+};
+
 struct page;
 struct page_ext;
 struct mem_section {
@@ -1178,8 +1196,7 @@ struct mem_section {
 	 */
 	unsigned long section_mem_map;
 
-	/* See declaration of similar field in struct zone */
-	unsigned long *pageblock_flags;
+	struct mem_section_usage *usage;
 #ifdef CONFIG_PAGE_EXTENSION
 	/*
 	 * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
@@ -1210,6 +1227,11 @@ extern struct mem_section **mem_section;
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
+static inline unsigned long *section_to_usemap(struct mem_section *ms)
+{
+	return ms->usage->pageblock_flags;
+}
+
 static inline struct mem_section *__nr_to_section(unsigned long nr)
 {
 #ifdef CONFIG_SPARSEMEM_EXTREME
@@ -1221,7 +1243,7 @@ static inline struct mem_section *__nr_to_section(unsigned long nr)
 	return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
 }
 extern int __section_nr(struct mem_section* ms);
-extern unsigned long usemap_size(void);
+extern size_t mem_section_usage_size(void);
 
 /*
  * We use the lower bits of the mem_map pointer to store
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a88c5f334e5a..7b963c2d3a0d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -166,9 +166,10 @@ void put_page_bootmem(struct page *page)
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
-	unsigned long *usemap, mapsize, section_nr, i;
+	unsigned long mapsize, section_nr, i;
 	struct mem_section *ms;
 	struct page *page, *memmap;
+	struct mem_section_usage *usage;
 
 	section_nr = pfn_to_section_nr(start_pfn);
 	ms = __nr_to_section(section_nr);
@@ -188,10 +189,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
 	for (i = 0; i < mapsize; i++, page++)
 		get_page_bootmem(section_nr, page, SECTION_INFO);
 
-	usemap = ms->pageblock_flags;
-	page = virt_to_page(usemap);
+	usage = ms->usage;
+	page = virt_to_page(usage);
 
-	mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
 
 	for (i = 0; i < mapsize; i++, page++)
 		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
@@ -200,9 +201,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
 #else /* CONFIG_SPARSEMEM_VMEMMAP */
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
-	unsigned long *usemap, mapsize, section_nr, i;
+	unsigned long mapsize, section_nr, i;
 	struct mem_section *ms;
 	struct page *page, *memmap;
+	struct mem_section_usage *usage;
 
 	section_nr = pfn_to_section_nr(start_pfn);
 	ms = __nr_to_section(section_nr);
@@ -211,10 +213,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
 
 	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
 
-	usemap = ms->pageblock_flags;
-	page = virt_to_page(usemap);
+	usage = ms->usage;
+	page = virt_to_page(usage);
 
-	mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
 
 	for (i = 0; i < mapsize; i++, page++)
 		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f4651a09948c..8cc091e87200 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -403,7 +403,7 @@ static inline unsigned long *get_pageblock_bitmap(struct page *page,
 							unsigned long pfn)
 {
 #ifdef CONFIG_SPARSEMEM
-	return __pfn_to_section(pfn)->pageblock_flags;
+	return section_to_usemap(__pfn_to_section(pfn));
 #else
 	return page_zone(page)->pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
diff --git a/mm/sparse.c b/mm/sparse.c
index 1552c855d62a..71da15cc7432 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -288,33 +288,31 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static void __meminit sparse_init_one_section(struct mem_section *ms,
 		unsigned long pnum, struct page *mem_map,
-		unsigned long *pageblock_bitmap)
+		struct mem_section_usage *usage)
 {
 	ms->section_mem_map &= ~SECTION_MAP_MASK;
 	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
 							SECTION_HAS_MEM_MAP;
- 	ms->pageblock_flags = pageblock_bitmap;
+	ms->usage = usage;
 }
 
-unsigned long usemap_size(void)
+static unsigned long usemap_size(void)
 {
 	return BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS) * sizeof(unsigned long);
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-static unsigned long *__kmalloc_section_usemap(void)
+size_t mem_section_usage_size(void)
 {
-	return kmalloc(usemap_size(), GFP_KERNEL);
+	return sizeof(struct mem_section_usage) + usemap_size();
 }
-#endif /* CONFIG_MEMORY_HOTPLUG */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static unsigned long * __init
+static struct mem_section_usage * __init
 sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
 					 unsigned long size)
 {
+	struct mem_section_usage *usage;
 	unsigned long goal, limit;
-	unsigned long *p;
 	int nid;
 	/*
 	 * A page may contain usemaps for other sections preventing the
@@ -330,15 +328,16 @@ sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
 	limit = goal + (1UL << PA_SECTION_SHIFT);
 	nid = early_pfn_to_nid(goal >> PAGE_SHIFT);
 again:
-	p = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid);
-	if (!p && limit) {
+	usage = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid);
+	if (!usage && limit) {
 		limit = 0;
 		goto again;
 	}
-	return p;
+	return usage;
 }
 
-static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
+static void __init check_usemap_section_nr(int nid,
+		struct mem_section_usage *usage)
 {
 	unsigned long usemap_snr, pgdat_snr;
 	static unsigned long old_usemap_snr;
@@ -352,7 +351,7 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
 		old_pgdat_snr = NR_MEM_SECTIONS;
 	}
 
-	usemap_snr = pfn_to_section_nr(__pa(usemap) >> PAGE_SHIFT);
+	usemap_snr = pfn_to_section_nr(__pa(usage) >> PAGE_SHIFT);
 	pgdat_snr = pfn_to_section_nr(__pa(pgdat) >> PAGE_SHIFT);
 	if (usemap_snr == pgdat_snr)
 		return;
@@ -380,14 +379,15 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
 		usemap_snr, pgdat_snr, nid);
 }
 #else
-static unsigned long * __init
+static struct mem_section_usage * __init
 sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
 					 unsigned long size)
 {
 	return memblock_alloc_node(size, SMP_CACHE_BYTES, pgdat->node_id);
 }
 
-static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
+static void __init check_usemap_section_nr(int nid,
+		struct mem_section_usage *usage)
 {
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -474,14 +474,13 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 				   unsigned long pnum_end,
 				   unsigned long map_count)
 {
-	unsigned long pnum, usemap_longs, *usemap;
+	struct mem_section_usage *usage;
+	unsigned long pnum;
 	struct page *map;
 
-	usemap_longs = BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS);
-	usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
-							  usemap_size() *
-							  map_count);
-	if (!usemap) {
+	usage = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
+			mem_section_usage_size() * map_count);
+	if (!usage) {
 		pr_err("%s: node[%d] usemap allocation failed", __func__, nid);
 		goto failed;
 	}
@@ -497,9 +496,9 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			pnum_begin = pnum;
 			goto failed;
 		}
-		check_usemap_section_nr(nid, usemap);
-		sparse_init_one_section(__nr_to_section(pnum), pnum, map, usemap);
-		usemap += usemap_longs;
+		check_usemap_section_nr(nid, usage);
+		sparse_init_one_section(__nr_to_section(pnum), pnum, map, usage);
+		usage = (void *) usage + mem_section_usage_size();
 	}
 	sparse_buffer_fini();
 	return;
@@ -697,9 +696,9 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 				     struct vmem_altmap *altmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
+	struct mem_section_usage *usage;
 	struct mem_section *ms;
 	struct page *memmap;
-	unsigned long *usemap;
 	int ret;
 
 	/*
@@ -713,8 +712,8 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	memmap = kmalloc_section_memmap(section_nr, nid, altmap);
 	if (!memmap)
 		return -ENOMEM;
-	usemap = __kmalloc_section_usemap();
-	if (!usemap) {
+	usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
+	if (!usage) {
 		__kfree_section_memmap(memmap, altmap);
 		return -ENOMEM;
 	}
@@ -732,11 +731,11 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
 
 	section_mark_present(ms);
-	sparse_init_one_section(ms, section_nr, memmap, usemap);
+	sparse_init_one_section(ms, section_nr, memmap, usage);
 
 out:
 	if (ret < 0) {
-		kfree(usemap);
+		kfree(usage);
 		__kfree_section_memmap(memmap, altmap);
 	}
 	return ret;
@@ -772,20 +771,20 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 }
 #endif
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap,
-		struct vmem_altmap *altmap)
+static void free_section_usage(struct page *memmap,
+		struct mem_section_usage *usage, struct vmem_altmap *altmap)
 {
-	struct page *usemap_page;
+	struct page *usage_page;
 
-	if (!usemap)
+	if (!usage)
 		return;
 
-	usemap_page = virt_to_page(usemap);
+	usage_page = virt_to_page(usage);
 	/*
 	 * Check to see if allocation came from hot-plug-add
 	 */
-	if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
-		kfree(usemap);
+	if (PageSlab(usage_page) || PageCompound(usage_page)) {
+		kfree(usage);
 		if (memmap)
 			__kfree_section_memmap(memmap, altmap);
 		return;
@@ -804,18 +803,18 @@ void sparse_remove_one_section(struct mem_section *ms, unsigned long map_offset,
 			       struct vmem_altmap *altmap)
 {
 	struct page *memmap = NULL;
-	unsigned long *usemap = NULL;
+	struct mem_section_usage *usage = NULL;
 
 	if (ms->section_mem_map) {
-		usemap = ms->pageblock_flags;
+		usage = ms->usage;
 		memmap = sparse_decode_mem_map(ms->section_mem_map,
 						__section_nr(ms));
 		ms->section_mem_map = 0;
-		ms->pageblock_flags = NULL;
+		ms->usage = NULL;
 	}
 
 	clear_hwpoisoned_pages(memmap + map_offset,
 			PAGES_PER_SECTION - map_offset);
-	free_section_usemap(memmap, usemap, altmap);
+	free_section_usage(memmap, usage, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 02/13] mm/sparsemem: Introduce a SECTION_IS_EARLY flag
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
  2019-06-19  5:51 ` [PATCH v10 01/13] mm/sparsemem: Introduce struct mem_section_usage Dan Williams
@ 2019-06-19  5:51 ` Dan Williams
  2019-06-24 17:54   ` Oscar Salvador
  2019-06-19  5:51 ` [PATCH v10 03/13] mm/sparsemem: Add helpers track active portions of a section at boot Dan Williams
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:51 UTC (permalink / raw)
  To: akpm
  Cc: Qian Cai, Michal Hocko, Logan Gunthorpe, David Hildenbrand,
	Oscar Salvador, Pavel Tatashin, linux-mm, linux-nvdimm,
	linux-kernel

In preparation for sub-section hotplug, track whether a given section
was created during early memory initialization, or later via memory
hotplug.  This distinction is needed to maintain the coarse expectation
that pfn_valid() returns true for any pfn within a given section even if
that section has pages that are reserved from the page allocator.

For example one of the of goals of subsection hotplug is to support
cases where the system physical memory layout collides System RAM and
PMEM within a section. Several pfn_valid() users expect to just check if
a section is valid, but they are not careful to check if the given pfn
is within a "System RAM" boundary and instead expect pgdat information
to further validate the pfn.

Rather than unwind those paths to make their pfn_valid() queries more
precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the
traditional expectation that pfn_valid() returns true for all early
sections.

Link: https://lore.kernel.org/lkml/1560366952-10660-1-git-send-email-cai@lca.pw/
Reported-by: Qian Cai <cai@lca.pw>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mmzone.h |    8 +++++++-
 mm/sparse.c            |   20 +++++++++-----------
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 179680c94262..d081c9a1d25d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1261,7 +1261,8 @@ extern size_t mem_section_usage_size(void);
 #define	SECTION_MARKED_PRESENT	(1UL<<0)
 #define SECTION_HAS_MEM_MAP	(1UL<<1)
 #define SECTION_IS_ONLINE	(1UL<<2)
-#define SECTION_MAP_LAST_BIT	(1UL<<3)
+#define SECTION_IS_EARLY	(1UL<<3)
+#define SECTION_MAP_LAST_BIT	(1UL<<4)
 #define SECTION_MAP_MASK	(~(SECTION_MAP_LAST_BIT-1))
 #define SECTION_NID_SHIFT	3
 
@@ -1287,6 +1288,11 @@ static inline int valid_section(struct mem_section *section)
 	return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
 }
 
+static inline int early_section(struct mem_section *section)
+{
+	return (section && (section->section_mem_map & SECTION_IS_EARLY));
+}
+
 static inline int valid_section_nr(unsigned long nr)
 {
 	return valid_section(__nr_to_section(nr));
diff --git a/mm/sparse.c b/mm/sparse.c
index 71da15cc7432..2031a0694f35 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -288,11 +288,11 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static void __meminit sparse_init_one_section(struct mem_section *ms,
 		unsigned long pnum, struct page *mem_map,
-		struct mem_section_usage *usage)
+		struct mem_section_usage *usage, unsigned long flags)
 {
 	ms->section_mem_map &= ~SECTION_MAP_MASK;
-	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
-							SECTION_HAS_MEM_MAP;
+	ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum)
+		| SECTION_HAS_MEM_MAP | flags;
 	ms->usage = usage;
 }
 
@@ -497,7 +497,8 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			goto failed;
 		}
 		check_usemap_section_nr(nid, usage);
-		sparse_init_one_section(__nr_to_section(pnum), pnum, map, usage);
+		sparse_init_one_section(__nr_to_section(pnum), pnum, map, usage,
+				SECTION_IS_EARLY);
 		usage = (void *) usage + mem_section_usage_size();
 	}
 	sparse_buffer_fini();
@@ -731,7 +732,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
 
 	section_mark_present(ms);
-	sparse_init_one_section(ms, section_nr, memmap, usage);
+	sparse_init_one_section(ms, section_nr, memmap, usage, 0);
 
 out:
 	if (ret < 0) {
@@ -771,19 +772,16 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 }
 #endif
 
-static void free_section_usage(struct page *memmap,
+static void free_section_usage(struct mem_section *ms, struct page *memmap,
 		struct mem_section_usage *usage, struct vmem_altmap *altmap)
 {
-	struct page *usage_page;
-
 	if (!usage)
 		return;
 
-	usage_page = virt_to_page(usage);
 	/*
 	 * Check to see if allocation came from hot-plug-add
 	 */
-	if (PageSlab(usage_page) || PageCompound(usage_page)) {
+	if (!early_section(ms)) {
 		kfree(usage);
 		if (memmap)
 			__kfree_section_memmap(memmap, altmap);
@@ -815,6 +813,6 @@ void sparse_remove_one_section(struct mem_section *ms, unsigned long map_offset,
 
 	clear_hwpoisoned_pages(memmap + map_offset,
 			PAGES_PER_SECTION - map_offset);
-	free_section_usage(memmap, usage, altmap);
+	free_section_usage(ms, memmap, usage, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 03/13] mm/sparsemem: Add helpers track active portions of a section at boot
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
  2019-06-19  5:51 ` [PATCH v10 01/13] mm/sparsemem: Introduce struct mem_section_usage Dan Williams
  2019-06-19  5:51 ` [PATCH v10 02/13] mm/sparsemem: Introduce a SECTION_IS_EARLY flag Dan Williams
@ 2019-06-19  5:51 ` Dan Williams
  2019-06-24 17:57   ` Oscar Salvador
  2019-06-19  5:51 ` [PATCH v10 04/13] mm/hotplug: Prepare shrink_{zone, pgdat}_span for sub-section removal Dan Williams
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:51 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Oscar Salvador,
	Pavel Tatashin, Qian Cai, Jane Chu, linux-mm, linux-nvdimm,
	linux-kernel

Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.

The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask. The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.

The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data. So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mmzone.h |   33 ++++++++++++++++++++++++++++++++-
 mm/page_alloc.c        |   10 ++++++++--
 mm/sparse.c            |   35 +++++++++++++++++++++++++++++++++++
 3 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d081c9a1d25d..c4e8843e283c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1179,6 +1179,8 @@ struct mem_section_usage {
 	unsigned long pageblock_flags[0];
 };
 
+void subsection_map_init(unsigned long pfn, unsigned long nr_pages);
+
 struct page;
 struct page_ext;
 struct mem_section {
@@ -1322,12 +1324,40 @@ static inline struct mem_section *__pfn_to_section(unsigned long pfn)
 
 extern int __highest_present_section_nr;
 
+static inline int subsection_map_index(unsigned long pfn)
+{
+	return (pfn & ~(PAGE_SECTION_MASK)) / PAGES_PER_SUBSECTION;
+}
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
+{
+	int idx = subsection_map_index(pfn);
+
+	return test_bit(idx, ms->usage->subsection_map);
+}
+#else
+static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
+{
+	return 1;
+}
+#endif
+
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
 static inline int pfn_valid(unsigned long pfn)
 {
+	struct mem_section *ms;
+
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
-	return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
+	ms = __nr_to_section(pfn_to_section_nr(pfn));
+	if (!valid_section(ms))
+		return 0;
+	/*
+	 * Traditionally early sections always returned pfn_valid() for
+	 * the entire section-sized span.
+	 */
+	return early_section(ms) || pfn_section_valid(ms, pfn);
 }
 #endif
 
@@ -1359,6 +1389,7 @@ void sparse_init(void);
 #define sparse_init()	do {} while (0)
 #define sparse_index_init(_sec, _nid)  do {} while (0)
 #define pfn_present pfn_valid
+#define subsection_map_init(_pfn, _nr_pages) do {} while (0)
 #endif /* CONFIG_SPARSEMEM */
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8cc091e87200..8e7215fb6976 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7306,12 +7306,18 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 			       (u64)zone_movable_pfn[i] << PAGE_SHIFT);
 	}
 
-	/* Print out the early node map */
+	/*
+	 * Print out the early node map, and initialize the
+	 * subsection-map relative to active online memory ranges to
+	 * enable future "sub-section" extensions of the memory map.
+	 */
 	pr_info("Early memory node ranges\n");
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
 		pr_info("  node %3d: [mem %#018Lx-%#018Lx]\n", nid,
 			(u64)start_pfn << PAGE_SHIFT,
 			((u64)end_pfn << PAGE_SHIFT) - 1);
+		subsection_map_init(start_pfn, end_pfn - start_pfn);
+	}
 
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
diff --git a/mm/sparse.c b/mm/sparse.c
index 2031a0694f35..e9fec3c2f7ec 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -210,6 +210,41 @@ static inline unsigned long first_present_section_nr(void)
 	return next_present_section_nr(-1);
 }
 
+void subsection_mask_set(unsigned long *map, unsigned long pfn,
+		unsigned long nr_pages)
+{
+	int idx = subsection_map_index(pfn);
+	int end = subsection_map_index(pfn + nr_pages - 1);
+
+	bitmap_set(map, idx, end - idx + 1);
+}
+
+void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages)
+{
+	int end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
+	int i, start_sec = pfn_to_section_nr(pfn);
+
+	if (!nr_pages)
+		return;
+
+	for (i = start_sec; i <= end_sec; i++) {
+		struct mem_section *ms;
+		unsigned long pfns;
+
+		pfns = min(nr_pages, PAGES_PER_SECTION
+				- (pfn & ~PAGE_SECTION_MASK));
+		ms = __nr_to_section(i);
+		subsection_mask_set(ms->usage->subsection_map, pfn, pfns);
+
+		pr_debug("%s: sec: %d pfns: %ld set(%d, %d)\n", __func__, i,
+				pfns, subsection_map_index(pfn),
+				subsection_map_index(pfn + pfns - 1));
+
+		pfn += pfns;
+		nr_pages -= pfns;
+	}
+}
+
 /* Record a memory area against a node. */
 void __init memory_present(int nid, unsigned long start, unsigned long end)
 {


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 04/13] mm/hotplug: Prepare shrink_{zone, pgdat}_span for sub-section removal
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (2 preceding siblings ...)
  2019-06-19  5:51 ` [PATCH v10 03/13] mm/sparsemem: Add helpers track active portions of a section at boot Dan Williams
@ 2019-06-19  5:51 ` Dan Williams
  2019-06-19  5:52 ` [PATCH v10 05/13] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() Dan Williams
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:51 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Pavel Tatashin,
	Oscar Salvador, linux-mm, linux-nvdimm, linux-kernel

Sub-section hotplug support reduces the unit of operation of hotplug
from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
(PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
valid_section(), can toggle.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory_hotplug.c |   29 ++++++++---------------------
 1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7b963c2d3a0d..647859a1d119 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -318,12 +318,8 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 				     unsigned long start_pfn,
 				     unsigned long end_pfn)
 {
-	struct mem_section *ms;
-
-	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(start_pfn);
-
-		if (unlikely(!valid_section(ms)))
+	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SUBSECTION) {
+		if (unlikely(!pfn_valid(start_pfn)))
 			continue;
 
 		if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -343,15 +339,12 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 				    unsigned long start_pfn,
 				    unsigned long end_pfn)
 {
-	struct mem_section *ms;
 	unsigned long pfn;
 
 	/* pfn is the end pfn of a memory section. */
 	pfn = end_pfn - 1;
-	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
-
-		if (unlikely(!valid_section(ms)))
+	for (; pfn >= start_pfn; pfn -= PAGES_PER_SUBSECTION) {
+		if (unlikely(!pfn_valid(pfn)))
 			continue;
 
 		if (unlikely(pfn_to_nid(pfn) != nid))
@@ -373,7 +366,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
 	unsigned long zone_end_pfn = z;
 	unsigned long pfn;
-	struct mem_section *ms;
 	int nid = zone_to_nid(zone);
 
 	zone_span_writelock(zone);
@@ -410,10 +402,8 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	 * it check the zone has only hole or not.
 	 */
 	pfn = zone_start_pfn;
-	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
-
-		if (unlikely(!valid_section(ms)))
+	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
+		if (unlikely(!pfn_valid(pfn)))
 			continue;
 
 		if (page_zone(pfn_to_page(pfn)) != zone)
@@ -441,7 +431,6 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 	unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
 	unsigned long pgdat_end_pfn = p;
 	unsigned long pfn;
-	struct mem_section *ms;
 	int nid = pgdat->node_id;
 
 	if (pgdat_start_pfn == start_pfn) {
@@ -478,10 +467,8 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 	 * has only hole or not.
 	 */
 	pfn = pgdat_start_pfn;
-	for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
-
-		if (unlikely(!valid_section(ms)))
+	for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUBSECTION) {
+		if (unlikely(!pfn_valid(pfn)))
 			continue;
 
 		if (pfn_to_nid(pfn) != nid)


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 05/13] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (3 preceding siblings ...)
  2019-06-19  5:51 ` [PATCH v10 04/13] mm/hotplug: Prepare shrink_{zone, pgdat}_span for sub-section removal Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-24 18:00   ` Oscar Salvador
  2019-06-19  5:52 ` [PATCH v10 06/13] mm/hotplug: Kill is_dev_zone() usage in __remove_pages() Dan Williams
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, David Hildenbrand, Logan Gunthorpe, Oscar Salvador,
	Pavel Tatashin, linux-mm, linux-nvdimm, linux-kernel

Allow sub-section sized ranges to be added to the memmap.
populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.

Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/init_64.c |    4 +++-
 include/linux/mm.h    |    4 ++--
 mm/sparse-vmemmap.c   |   21 ++++++++++++++-------
 mm/sparse.c           |   50 ++++++++++++++++++++++++++-----------------------
 4 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8335ac6e1112..688fb0687e55 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1520,7 +1520,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 {
 	int err;
 
-	if (boot_cpu_has(X86_FEATURE_PSE))
+	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
+		err = vmemmap_populate_basepages(start, end, node);
+	else if (boot_cpu_has(X86_FEATURE_PSE))
 		err = vmemmap_populate_hugepages(start, end, node, altmap);
 	else if (altmap) {
 		pr_err_once("%s: no cpu support for altmap allocations\n",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c6ae9eba645d..f7616518124e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2752,8 +2752,8 @@ const char * arch_vma_name(struct vm_area_struct *vma);
 void print_vma_addr(char *prefix, unsigned long rip);
 
 void *sparse_buffer_alloc(unsigned long size);
-struct page *sparse_mem_map_populate(unsigned long pnum, int nid,
-		struct vmem_altmap *altmap);
+struct page * __populate_section_memmap(unsigned long pfn,
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 7fec05796796..200aef686722 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -245,19 +245,26 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
 	return 0;
 }
 
-struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid,
-		struct vmem_altmap *altmap)
+struct page * __meminit __populate_section_memmap(unsigned long pfn,
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
 	unsigned long start;
 	unsigned long end;
-	struct page *map;
 
-	map = pfn_to_page(pnum * PAGES_PER_SECTION);
-	start = (unsigned long)map;
-	end = (unsigned long)(map + PAGES_PER_SECTION);
+	/*
+	 * The minimum granularity of memmap extensions is
+	 * PAGES_PER_SUBSECTION as allocations are tracked in the
+	 * 'subsection_map' bitmap of the section.
+	 */
+	end = ALIGN(pfn + nr_pages, PAGES_PER_SUBSECTION);
+	pfn &= PAGE_SUBSECTION_MASK;
+	nr_pages = end - pfn;
+
+	start = (unsigned long) pfn_to_page(pfn);
+	end = start + nr_pages * sizeof(struct page);
 
 	if (vmemmap_populate(start, end, nid, altmap))
 		return NULL;
 
-	return map;
+	return pfn_to_page(pfn);
 }
diff --git a/mm/sparse.c b/mm/sparse.c
index e9fec3c2f7ec..49f0c03d15a3 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -439,8 +439,8 @@ static unsigned long __init section_map_size(void)
 	return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
 }
 
-struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid,
-		struct vmem_altmap *altmap)
+struct page __init *__populate_section_memmap(unsigned long pfn,
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -521,10 +521,13 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 	}
 	sparse_buffer_init(map_count * section_map_size(), nid);
 	for_each_present_section_nr(pnum_begin, pnum) {
+		unsigned long pfn = section_nr_to_pfn(pnum);
+
 		if (pnum >= pnum_end)
 			break;
 
-		map = sparse_mem_map_populate(pnum, nid, NULL);
+		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
+				nid, NULL);
 		if (!map) {
 			pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
 			       __func__, nid);
@@ -625,17 +628,17 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
-		struct vmem_altmap *altmap)
+static struct page *populate_section_memmap(unsigned long pfn,
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
-	/* This will make the necessary allocations eventually. */
-	return sparse_mem_map_populate(pnum, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
 }
-static void __kfree_section_memmap(struct page *memmap,
+
+static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
 		struct vmem_altmap *altmap)
 {
-	unsigned long start = (unsigned long)memmap;
-	unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
+	unsigned long start = (unsigned long) pfn_to_page(pfn);
+	unsigned long end = start + nr_pages * sizeof(struct page);
 
 	vmemmap_free(start, end, altmap);
 }
@@ -647,7 +650,8 @@ static void free_map_bootmem(struct page *memmap)
 	vmemmap_free(start, end, NULL);
 }
 #else
-static struct page *__kmalloc_section_memmap(void)
+struct page *populate_section_memmap(unsigned long pfn,
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
 	struct page *page, *ret;
 	unsigned long memmap_size = sizeof(struct page) * PAGES_PER_SECTION;
@@ -668,15 +672,11 @@ static struct page *__kmalloc_section_memmap(void)
 	return ret;
 }
 
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
+static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
 		struct vmem_altmap *altmap)
 {
-	return __kmalloc_section_memmap();
-}
+	struct page *memmap = pfn_to_page(pfn);
 
-static void __kfree_section_memmap(struct page *memmap,
-		struct vmem_altmap *altmap)
-{
 	if (is_vmalloc_addr(memmap))
 		vfree(memmap);
 	else
@@ -745,12 +745,13 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 	if (ret < 0 && ret != -EEXIST)
 		return ret;
 	ret = 0;
-	memmap = kmalloc_section_memmap(section_nr, nid, altmap);
+	memmap = populate_section_memmap(start_pfn, PAGES_PER_SECTION, nid,
+			altmap);
 	if (!memmap)
 		return -ENOMEM;
 	usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
 	if (!usage) {
-		__kfree_section_memmap(memmap, altmap);
+		depopulate_section_memmap(start_pfn, PAGES_PER_SECTION, altmap);
 		return -ENOMEM;
 	}
 
@@ -772,7 +773,7 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 out:
 	if (ret < 0) {
 		kfree(usage);
-		__kfree_section_memmap(memmap, altmap);
+		depopulate_section_memmap(start_pfn, PAGES_PER_SECTION, altmap);
 	}
 	return ret;
 }
@@ -808,7 +809,8 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 #endif
 
 static void free_section_usage(struct mem_section *ms, struct page *memmap,
-		struct mem_section_usage *usage, struct vmem_altmap *altmap)
+		struct mem_section_usage *usage, unsigned long pfn,
+		unsigned long nr_pages, struct vmem_altmap *altmap)
 {
 	if (!usage)
 		return;
@@ -819,7 +821,7 @@ static void free_section_usage(struct mem_section *ms, struct page *memmap,
 	if (!early_section(ms)) {
 		kfree(usage);
 		if (memmap)
-			__kfree_section_memmap(memmap, altmap);
+			depopulate_section_memmap(pfn, nr_pages, altmap);
 		return;
 	}
 
@@ -848,6 +850,8 @@ void sparse_remove_one_section(struct mem_section *ms, unsigned long map_offset,
 
 	clear_hwpoisoned_pages(memmap + map_offset,
 			PAGES_PER_SECTION - map_offset);
-	free_section_usage(ms, memmap, usage, altmap);
+	free_section_usage(ms, memmap, usage,
+			section_nr_to_pfn(__section_nr(ms)),
+			PAGES_PER_SECTION, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 06/13] mm/hotplug: Kill is_dev_zone() usage in __remove_pages()
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (4 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 05/13] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-19  5:52 ` [PATCH v10 07/13] mm: Kill is_dev_zone() helper Dan Williams
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Logan Gunthorpe, Pavel Tatashin, David Hildenbrand,
	Oscar Salvador, linux-mm, linux-nvdimm, linux-kernel

The zone type check was a leftover from the cleanup that plumbed altmap
through the memory hotplug path, i.e. commit da024512a1fa "mm: pass the
vmem_altmap to arch_remove_memory and __remove_pages".

Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/memory_hotplug.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 647859a1d119..4b882c57781a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -535,11 +535,8 @@ void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 	unsigned long map_offset = 0;
 	int sections_to_remove;
 
-	/* In the ZONE_DEVICE case device driver owns the memory region */
-	if (is_dev_zone(zone)) {
-		if (altmap)
-			map_offset = vmem_altmap_offset(altmap);
-	}
+	if (altmap)
+		map_offset = vmem_altmap_offset(altmap);
 
 	clear_zone_contiguous(zone);
 


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 07/13] mm: Kill is_dev_zone() helper
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (5 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 06/13] mm/hotplug: Kill is_dev_zone() usage in __remove_pages() Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-19  5:52 ` [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges Dan Williams
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Logan Gunthorpe, David Hildenbrand, Oscar Salvador,
	Pavel Tatashin, Wei Yang, linux-mm, linux-nvdimm, linux-kernel

Given there are no more usages of is_dev_zone() outside of 'ifdef
CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mmzone.h |   12 ------------
 mm/page_alloc.c        |    2 +-
 2 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4e8843e283c..e976faf57292 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -855,18 +855,6 @@ static inline int local_memory_node(int node_id) { return node_id; };
  */
 #define zone_idx(zone)		((zone) - (zone)->zone_pgdat->node_zones)
 
-#ifdef CONFIG_ZONE_DEVICE
-static inline bool is_dev_zone(const struct zone *zone)
-{
-	return zone_idx(zone) == ZONE_DEVICE;
-}
-#else
-static inline bool is_dev_zone(const struct zone *zone)
-{
-	return false;
-}
-#endif
-
 /*
  * Returns true if a zone has pages managed by the buddy allocator.
  * All the reclaim decisions have to use this function rather than
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8e7215fb6976..12b2afd3a529 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5881,7 +5881,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
 
-	if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
+	if (WARN_ON_ONCE(!pgmap || zone_idx(zone) != ZONE_DEVICE))
 		return;
 
 	/*


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (6 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 07/13] mm: Kill is_dev_zone() helper Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-20 10:31   ` David Hildenbrand
  2019-06-24 18:05   ` Oscar Salvador
  2019-06-19  5:52 ` [PATCH v10 09/13] mm/sparsemem: Support sub-section hotplug Dan Williams
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Oscar Salvador,
	Pavel Tatashin, linux-mm, linux-nvdimm, linux-kernel

Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().

This is simply plumbing, small cleanups, and some identifier renames. No
intended functional changes.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memory_hotplug.h |    5 +-
 mm/memory_hotplug.c            |  114 +++++++++++++++++++++++++---------------
 mm/sparse.c                    |   16 ++----
 3 files changed, 81 insertions(+), 54 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 79e0add6a597..3ab0282b4fe5 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -348,9 +348,10 @@ extern int add_memory_resource(int nid, struct resource *resource);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern int sparse_add_one_section(int nid, unsigned long start_pfn,
-				  struct vmem_altmap *altmap);
+extern int sparse_add_section(int nid, unsigned long pfn,
+		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern void sparse_remove_one_section(struct mem_section *ms,
+		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4b882c57781a..399bf78bccc5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -252,51 +252,84 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 }
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
-static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
-				   struct vmem_altmap *altmap)
+static int __meminit __add_section(int nid, unsigned long pfn,
+		unsigned long nr_pages,	struct vmem_altmap *altmap)
 {
 	int ret;
 
-	if (pfn_valid(phys_start_pfn))
+	if (pfn_valid(pfn))
 		return -EEXIST;
 
-	ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
+	ret = sparse_add_section(nid, pfn, nr_pages, altmap);
 	return ret < 0 ? ret : 0;
 }
 
+static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
+		const char *reason)
+{
+	/*
+	 * Disallow all operations smaller than a sub-section and only
+	 * allow operations smaller than a section for
+	 * SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range()
+	 * enforces a larger memory_block_size_bytes() granularity for
+	 * memory that will be marked online, so this check should only
+	 * fire for direct arch_{add,remove}_memory() users outside of
+	 * add_memory_resource().
+	 */
+	unsigned long min_align;
+
+	if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
+		min_align = PAGES_PER_SUBSECTION;
+	else
+		min_align = PAGES_PER_SECTION;
+	if (!IS_ALIGNED(pfn, min_align)
+			|| !IS_ALIGNED(nr_pages, min_align)) {
+		WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n",
+				reason, pfn, pfn + nr_pages - 1);
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /*
  * Reasonably generic function for adding memory.  It is
  * expected that archs that support memory hotplug will
  * call this function after deciding the zone to which to
  * add the new pages.
  */
-int __ref __add_pages(int nid, unsigned long phys_start_pfn,
-		unsigned long nr_pages, struct mhp_restrictions *restrictions)
+int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
+		struct mhp_restrictions *restrictions)
 {
 	unsigned long i;
-	int err = 0;
-	int start_sec, end_sec;
+	int start_sec, end_sec, err;
 	struct vmem_altmap *altmap = restrictions->altmap;
 
-	/* during initialize mem_map, align hot-added range to section */
-	start_sec = pfn_to_section_nr(phys_start_pfn);
-	end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
-
 	if (altmap) {
 		/*
 		 * Validate altmap is within bounds of the total request
 		 */
-		if (altmap->base_pfn != phys_start_pfn
+		if (altmap->base_pfn != pfn
 				|| vmem_altmap_offset(altmap) > nr_pages) {
 			pr_warn_once("memory add fail, invalid altmap\n");
-			err = -EINVAL;
-			goto out;
+			return -EINVAL;
 		}
 		altmap->alloc = 0;
 	}
 
+	err = check_pfn_span(pfn, nr_pages, "add");
+	if (err)
+		return err;
+
+	start_sec = pfn_to_section_nr(pfn);
+	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
 	for (i = start_sec; i <= end_sec; i++) {
-		err = __add_section(nid, section_nr_to_pfn(i), altmap);
+		unsigned long pfns;
+
+		pfns = min(nr_pages, PAGES_PER_SECTION
+				- (pfn & ~PAGE_SECTION_MASK));
+		err = __add_section(nid, pfn, pfns, altmap);
+		pfn += pfns;
+		nr_pages -= pfns;
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -309,7 +342,6 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 		cond_resched();
 	}
 	vmemmap_populate_print_last();
-out:
 	return err;
 }
 
@@ -487,10 +519,10 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 	pgdat->node_spanned_pages = 0;
 }
 
-static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+static void __remove_zone(struct zone *zone, unsigned long start_pfn,
+		unsigned long nr_pages)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
-	int nr_pages = PAGES_PER_SECTION;
 	unsigned long flags;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -499,27 +531,23 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn)
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-static void __remove_section(struct zone *zone, struct mem_section *ms,
-			     unsigned long map_offset,
-			     struct vmem_altmap *altmap)
+static void __remove_section(struct zone *zone, unsigned long pfn,
+		unsigned long nr_pages, unsigned long map_offset,
+		struct vmem_altmap *altmap)
 {
-	unsigned long start_pfn;
-	int scn_nr;
+	struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
 
 	if (WARN_ON_ONCE(!valid_section(ms)))
 		return;
 
-	scn_nr = __section_nr(ms);
-	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
-	__remove_zone(zone, start_pfn);
-
-	sparse_remove_one_section(ms, map_offset, altmap);
+	__remove_zone(zone, pfn, nr_pages);
+	sparse_remove_one_section(ms, pfn, nr_pages, map_offset, altmap);
 }
 
 /**
  * __remove_pages() - remove sections of pages from a zone
  * @zone: zone from which pages need to be removed
- * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
+ * @pfn: starting pageframe (must be aligned to start of a section)
  * @nr_pages: number of pages to remove (must be multiple of section size)
  * @altmap: alternative device page map or %NULL if default memmap is used
  *
@@ -528,31 +556,31 @@ static void __remove_section(struct zone *zone, struct mem_section *ms,
  * sure that pages are marked reserved and zones are adjust properly by
  * calling offline_pages().
  */
-void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
+void __remove_pages(struct zone *zone, unsigned long pfn,
 		    unsigned long nr_pages, struct vmem_altmap *altmap)
 {
-	unsigned long i;
 	unsigned long map_offset = 0;
-	int sections_to_remove;
+	int i, start_sec, end_sec;
 
 	if (altmap)
 		map_offset = vmem_altmap_offset(altmap);
 
 	clear_zone_contiguous(zone);
 
-	/*
-	 * We can only remove entire sections
-	 */
-	BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
-	BUG_ON(nr_pages % PAGES_PER_SECTION);
+	if (check_pfn_span(pfn, nr_pages, "remove"))
+		return;
 
-	sections_to_remove = nr_pages / PAGES_PER_SECTION;
-	for (i = 0; i < sections_to_remove; i++) {
-		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
+	start_sec = pfn_to_section_nr(pfn);
+	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
+	for (i = start_sec; i <= end_sec; i++) {
+		unsigned long pfns;
 
 		cond_resched();
-		__remove_section(zone, __pfn_to_section(pfn), map_offset,
-				 altmap);
+		pfns = min(nr_pages, PAGES_PER_SECTION
+				- (pfn & ~PAGE_SECTION_MASK));
+		__remove_section(zone, pfn, pfns, map_offset, altmap);
+		pfn += pfns;
+		nr_pages -= pfns;
 		map_offset = 0;
 	}
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 49f0c03d15a3..ad47e25c8f94 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -728,8 +728,8 @@ static void free_map_bootmem(struct page *memmap)
  * * -EEXIST	- Section has been present.
  * * -ENOMEM	- Out of memory.
  */
-int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
-				     struct vmem_altmap *altmap)
+int __meminit sparse_add_section(int nid, unsigned long start_pfn,
+		unsigned long nr_pages, struct vmem_altmap *altmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section_usage *usage;
@@ -834,8 +834,9 @@ static void free_section_usage(struct mem_section *ms, struct page *memmap,
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct mem_section *ms, unsigned long map_offset,
-			       struct vmem_altmap *altmap)
+void sparse_remove_one_section(struct mem_section *ms, unsigned long pfn,
+		unsigned long nr_pages, unsigned long map_offset,
+		struct vmem_altmap *altmap)
 {
 	struct page *memmap = NULL;
 	struct mem_section_usage *usage = NULL;
@@ -848,10 +849,7 @@ void sparse_remove_one_section(struct mem_section *ms, unsigned long map_offset,
 		ms->usage = NULL;
 	}
 
-	clear_hwpoisoned_pages(memmap + map_offset,
-			PAGES_PER_SECTION - map_offset);
-	free_section_usage(ms, memmap, usage,
-			section_nr_to_pfn(__section_nr(ms)),
-			PAGES_PER_SECTION, altmap);
+	clear_hwpoisoned_pages(memmap + map_offset, nr_pages - map_offset);
+	free_section_usage(ms, memmap, usage, pfn, nr_pages, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 09/13] mm/sparsemem: Support sub-section hotplug
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (7 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-24 20:45   ` Oscar Salvador
  2019-06-19  5:52 ` [PATCH v10 10/13] mm: Document ZONE_DEVICE memory-model implications Dan Williams
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Oscar Salvador,
	Pavel Tatashin, linux-mm, linux-nvdimm, linux-kernel

The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity. For example the following backtrace
is emitted when attempting arch_add_memory() with physical address
ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
within a given section:

    # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
    100000000-1ffffffff : System RAM
    200000000-303ffffff : Persistent Memory (legacy)
    304000000-43fffffff : System RAM
    440000000-23ffffffff : Persistent Memory
    2400000000-43bfffffff : Persistent Memory
      2400000000-43bfffffff : namespace2.0

    WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
    [..]
    RIP: 0010:add_pages+0x5c/0x60
    [..]
    Call Trace:
     devm_memremap_pages+0x460/0x6e0
     pmem_attach_disk+0x29e/0x680 [nd_pmem]
     ? nd_dax_probe+0xfc/0x120 [libnvdimm]
     nvdimm_bus_probe+0x66/0x160 [libnvdimm]

It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while. A fix for that long-standing breakage introduces as many problems
as it solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.

Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.

[1]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memory_hotplug.h |    2 
 mm/memory_hotplug.c            |   27 +----
 mm/page_alloc.c                |    2 
 mm/sparse.c                    |  205 ++++++++++++++++++++++++++--------------
 4 files changed, 140 insertions(+), 96 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3ab0282b4fe5..0b8a5e5ef2da 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -350,7 +350,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
-extern void sparse_remove_one_section(struct mem_section *ms,
+extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 399bf78bccc5..4e8e65954f31 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -252,18 +252,6 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
 }
 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
 
-static int __meminit __add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages,	struct vmem_altmap *altmap)
-{
-	int ret;
-
-	if (pfn_valid(pfn))
-		return -EEXIST;
-
-	ret = sparse_add_section(nid, pfn, nr_pages, altmap);
-	return ret < 0 ? ret : 0;
-}
-
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 		const char *reason)
 {
@@ -327,18 +315,11 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 
 		pfns = min(nr_pages, PAGES_PER_SECTION
 				- (pfn & ~PAGE_SECTION_MASK));
-		err = __add_section(nid, pfn, pfns, altmap);
+		err = sparse_add_section(nid, pfn, pfns, altmap);
+		if (err)
+			break;
 		pfn += pfns;
 		nr_pages -= pfns;
-
-		/*
-		 * EEXIST is finally dealt with by ioresource collision
-		 * check. see add_memory() => register_memory_resource()
-		 * Warning will be printed if there is collision.
-		 */
-		if (err && (err != -EEXIST))
-			break;
-		err = 0;
 		cond_resched();
 	}
 	vmemmap_populate_print_last();
@@ -541,7 +522,7 @@ static void __remove_section(struct zone *zone, unsigned long pfn,
 		return;
 
 	__remove_zone(zone, pfn, nr_pages);
-	sparse_remove_one_section(ms, pfn, nr_pages, map_offset, altmap);
+	sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap);
 }
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 12b2afd3a529..5b3266d63521 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5931,7 +5931,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		 * pfn out of zone.
 		 *
 		 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
-		 * because this is done early in sparse_add_one_section
+		 * because this is done early in section_activate()
 		 */
 		if (!(pfn & (pageblock_nr_pages - 1))) {
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
diff --git a/mm/sparse.c b/mm/sparse.c
index ad47e25c8f94..b77ca21a27a4 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -83,8 +83,15 @@ static int __meminit sparse_index_init(unsigned long section_nr, int nid)
 	unsigned long root = SECTION_NR_TO_ROOT(section_nr);
 	struct mem_section *section;
 
+	/*
+	 * An existing section is possible in the sub-section hotplug
+	 * case. First hot-add instantiates, follow-on hot-add reuses
+	 * the existing section.
+	 *
+	 * The mem_hotplug_lock resolves the apparent race below.
+	 */
 	if (mem_section[root])
-		return -EEXIST;
+		return 0;
 
 	section = sparse_index_alloc(nid);
 	if (!section)
@@ -715,10 +722,119 @@ static void free_map_bootmem(struct page *memmap)
 }
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
+		struct vmem_altmap *altmap)
+{
+	DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
+	DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 };
+	struct mem_section *ms = __pfn_to_section(pfn);
+	struct page *memmap = NULL;
+	unsigned long *subsection_map = ms->usage
+		? &ms->usage->subsection_map[0] : NULL;
+
+	subsection_mask_set(map, pfn, nr_pages);
+	if (subsection_map)
+		bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION);
+
+	if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION),
+				"section already deactivated (%#lx + %ld)\n",
+				pfn, nr_pages))
+		return;
+
+	/*
+	 * There are 3 cases to handle across two configurations
+	 * (SPARSEMEM_VMEMMAP={y,n}):
+	 *
+	 * 1/ deactivation of a partial hot-added section (only possible
+	 * in the SPARSEMEM_VMEMMAP=y case).
+	 *    a/ section was present at memory init
+	 *    b/ section was hot-added post memory init
+	 * 2/ deactivation of a complete hot-added section
+	 * 3/ deactivation of a complete section from memory init
+	 *
+	 * For 1/, when subsection_map does not empty we will not be
+	 * freeing the usage map, but still need to free the vmemmap
+	 * range.
+	 *
+	 * For 2/ and 3/ the SPARSEMEM_VMEMMAP={y,n} cases are unified
+	 */
+	bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION);
+	if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
+		unsigned long section_nr = pfn_to_section_nr(pfn);
+
+		if (!early_section(ms)) {
+			kfree(ms->usage);
+			ms->usage = NULL;
+		}
+		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+		ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr);
+	}
+
+	if (early_section(ms) && memmap)
+		free_map_bootmem(memmap);
+	else
+		depopulate_section_memmap(pfn, nr_pages, altmap);
+}
+
+static struct page * __meminit section_activate(int nid, unsigned long pfn,
+		unsigned long nr_pages, struct vmem_altmap *altmap)
+{
+	DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
+	struct mem_section *ms = __pfn_to_section(pfn);
+	struct mem_section_usage *usage = NULL;
+	unsigned long *subsection_map;
+	struct page *memmap;
+	int rc = 0;
+
+	subsection_mask_set(map, pfn, nr_pages);
+
+	if (!ms->usage) {
+		usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
+		if (!usage)
+			return ERR_PTR(-ENOMEM);
+		ms->usage = usage;
+	}
+	subsection_map = &ms->usage->subsection_map[0];
+
+	if (bitmap_empty(map, SUBSECTIONS_PER_SECTION))
+		rc = -EINVAL;
+	else if (bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION))
+		rc = -EEXIST;
+	else
+		bitmap_or(subsection_map, map, subsection_map,
+				SUBSECTIONS_PER_SECTION);
+
+	if (rc) {
+		if (usage)
+			ms->usage = NULL;
+		kfree(usage);
+		return ERR_PTR(rc);
+	}
+
+	/*
+	 * The early init code does not consider partially populated
+	 * initial sections, it simply assumes that memory will never be
+	 * referenced.  If we hot-add memory into such a section then we
+	 * do not need to populate the memmap and can simply reuse what
+	 * is already there.
+	 */
+	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
+		return pfn_to_page(pfn);
+
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	if (!memmap) {
+		section_deactivate(pfn, nr_pages, altmap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return memmap;
+}
+
 /**
- * sparse_add_one_section - add a memory section
+ * sparse_add_section - add a memory section, or populate an existing one
  * @nid: The node to add section on
  * @start_pfn: start pfn of the memory range
+ * @nr_pages: number of pfns to add in the section
  * @altmap: device page map
  *
  * This is only intended for hotplug.
@@ -732,50 +848,33 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
-	struct mem_section_usage *usage;
 	struct mem_section *ms;
 	struct page *memmap;
 	int ret;
 
-	/*
-	 * no locking for this, because it does its own
-	 * plus, it does a kmalloc
-	 */
 	ret = sparse_index_init(section_nr, nid);
-	if (ret < 0 && ret != -EEXIST)
+	if (ret < 0)
 		return ret;
-	ret = 0;
-	memmap = populate_section_memmap(start_pfn, PAGES_PER_SECTION, nid,
-			altmap);
-	if (!memmap)
-		return -ENOMEM;
-	usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
-	if (!usage) {
-		depopulate_section_memmap(start_pfn, PAGES_PER_SECTION, altmap);
-		return -ENOMEM;
-	}
 
-	ms = __pfn_to_section(start_pfn);
-	if (ms->section_mem_map & SECTION_MARKED_PRESENT) {
-		ret = -EEXIST;
-		goto out;
-	}
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	if (IS_ERR(memmap))
+		return PTR_ERR(memmap);
 
 	/*
 	 * Poison uninitialized struct pages in order to catch invalid flags
 	 * combinations.
 	 */
-	page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
+	page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * nr_pages);
 
+	ms = __pfn_to_section(start_pfn);
 	section_mark_present(ms);
-	sparse_init_one_section(ms, section_nr, memmap, usage, 0);
 
-out:
-	if (ret < 0) {
-		kfree(usage);
-		depopulate_section_memmap(start_pfn, PAGES_PER_SECTION, altmap);
-	}
-	return ret;
+	/* Align memmap to section boundary in the subsection case */
+	if (section_nr_to_pfn(section_nr) != start_pfn)
+		memmap = pfn_to_kaddr(section_nr_to_pfn(section_nr));
+	sparse_init_one_section(ms, section_nr, memmap, ms->usage, 0);
+
+	return 0;
 }
 
 #ifdef CONFIG_MEMORY_FAILURE
@@ -808,48 +907,12 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 }
 #endif
 
-static void free_section_usage(struct mem_section *ms, struct page *memmap,
-		struct mem_section_usage *usage, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
-{
-	if (!usage)
-		return;
-
-	/*
-	 * Check to see if allocation came from hot-plug-add
-	 */
-	if (!early_section(ms)) {
-		kfree(usage);
-		if (memmap)
-			depopulate_section_memmap(pfn, nr_pages, altmap);
-		return;
-	}
-
-	/*
-	 * The usemap came from bootmem. This is packed with other usemaps
-	 * on the section which has pgdat at boot time. Just keep it as is now.
-	 */
-
-	if (memmap)
-		free_map_bootmem(memmap);
-}
-
-void sparse_remove_one_section(struct mem_section *ms, unsigned long pfn,
+void sparse_remove_section(struct mem_section *ms, unsigned long pfn,
 		unsigned long nr_pages, unsigned long map_offset,
 		struct vmem_altmap *altmap)
 {
-	struct page *memmap = NULL;
-	struct mem_section_usage *usage = NULL;
-
-	if (ms->section_mem_map) {
-		usage = ms->usage;
-		memmap = sparse_decode_mem_map(ms->section_mem_map,
-						__section_nr(ms));
-		ms->section_mem_map = 0;
-		ms->usage = NULL;
-	}
-
-	clear_hwpoisoned_pages(memmap + map_offset, nr_pages - map_offset);
-	free_section_usage(ms, memmap, usage, pfn, nr_pages, altmap);
+	clear_hwpoisoned_pages(pfn_to_page(pfn) + map_offset,
+			nr_pages - map_offset);
+	section_deactivate(pfn, nr_pages, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 10/13] mm: Document ZONE_DEVICE memory-model implications
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (8 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 09/13] mm/sparsemem: Support sub-section hotplug Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-20 12:30   ` Mike Rapoport
  2019-06-19  5:52 ` [PATCH v10 11/13] mm/devm_memremap_pages: Enable sub-section remap Dan Williams
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm; +Cc: Jonathan Corbet, Mike Rapoport, linux-mm, linux-nvdimm, linux-kernel

Explain the general mechanisms of 'ZONE_DEVICE' pages and list the users
of 'devm_memremap_pages()'.

Cc: Jonathan Corbet <corbet@lwn.net>
Reported-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/vm/memory-model.rst |   39 +++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst
index 382f72ace1fc..e0af47e02e78 100644
--- a/Documentation/vm/memory-model.rst
+++ b/Documentation/vm/memory-model.rst
@@ -181,3 +181,42 @@ that is eventually passed to vmemmap_populate() through a long chain
 of function calls. The vmemmap_populate() implementation may use the
 `vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to
 allocate memory map on the persistent memory device.
+
+ZONE_DEVICE
+===========
+The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
+`struct page` `mem_map` services for device driver identified physical
+address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
+that the page objects for these address ranges are never marked online,
+and that a reference must be taken against the device, not just the page
+to keep the memory pinned for active use. `ZONE_DEVICE`, via
+:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
+turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
+:c:func:`get_user_pages` service for the given range of pfns. Since the
+page reference count never drops below 1 the page is never tracked as
+free memory and the page's `struct list_head lru` space is repurposed
+for back referencing to the host device / driver that mapped the memory.
+
+While `SPARSEMEM` presents memory as a collection of sections,
+optionally collected into memory blocks, `ZONE_DEVICE` users have a need
+for smaller granularity of populating the `mem_map`. Given that
+`ZONE_DEVICE` memory is never marked online it is subsequently never
+subject to its memory ranges being exposed through the sysfs memory
+hotplug api on memory block boundaries. The implementation relies on
+this lack of user-api constraint to allow sub-section sized memory
+ranges to be specified to :c:func:`arch_add_memory`, the top-half of
+memory hotplug. Sub-section support allows for `PMD_SIZE` as the minimum
+alignment granularity for :c:func:`devm_memremap_pages`.
+
+The users of `ZONE_DEVICE` are:
+* pmem: Map platform persistent memory to be used as a direct-I/O target
+  via DAX mappings.
+
+* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
+  event callbacks to allow a device-driver to coordinate memory management
+  events related to device-memory, typically GPU memory. See
+  Documentation/vm/hmm.rst.
+
+* p2pdma: Create `struct page` objects to allow peer devices in a
+  PCI/-E topology to coordinate direct-DMA operations between themselves,
+  i.e. bypass host memory.


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 11/13] mm/devm_memremap_pages: Enable sub-section remap
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (9 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 10/13] mm: Document ZONE_DEVICE memory-model implications Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-19  5:52 ` [PATCH v10 12/13] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm
  Cc: Michal Hocko, Toshi Kani, Jérôme Glisse,
	Logan Gunthorpe, Oscar Salvador, Pavel Tatashin, linux-mm,
	linux-nvdimm, linux-kernel

Teach devm_memremap_pages() about the new sub-section capabilities of
arch_{add,remove}_memory(). Effectively, just replace all usage of
align_start, align_end, and align_size with res->start, res->end, and
resource_size(res). The existing sanity check will still make sure that
the two separate remap attempts do not collide within a sub-section (2MB
on x86).

Cc: Michal Hocko <mhocko@suse.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 kernel/memremap.c |   61 +++++++++++++++++++++--------------------------------
 1 file changed, 24 insertions(+), 37 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 57980ed4e571..a0e5f6b91b04 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -58,7 +58,7 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap)
 	struct vmem_altmap *altmap = &pgmap->altmap;
 	unsigned long pfn;
 
-	pfn = res->start >> PAGE_SHIFT;
+	pfn = PHYS_PFN(res->start);
 	if (pgmap->altmap_valid)
 		pfn += vmem_altmap_offset(altmap);
 	return pfn;
@@ -86,7 +86,6 @@ static void devm_memremap_pages_release(void *data)
 	struct dev_pagemap *pgmap = data;
 	struct device *dev = pgmap->dev;
 	struct resource *res = &pgmap->res;
-	resource_size_t align_start, align_size;
 	unsigned long pfn;
 	int nid;
 
@@ -96,25 +95,21 @@ static void devm_memremap_pages_release(void *data)
 	pgmap->cleanup(pgmap->ref);
 
 	/* pages are dead and unused, undo the arch mapping */
-	align_start = res->start & ~(PA_SECTION_SIZE - 1);
-	align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
-		- align_start;
-
-	nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
+	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
 
 	mem_hotplug_begin();
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		pfn = align_start >> PAGE_SHIFT;
+		pfn = PHYS_PFN(res->start);
 		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-				align_size >> PAGE_SHIFT, NULL);
+				PHYS_PFN(resource_size(res)), NULL);
 	} else {
-		arch_remove_memory(nid, align_start, align_size,
+		arch_remove_memory(nid, res->start, resource_size(res),
 				pgmap->altmap_valid ? &pgmap->altmap : NULL);
-		kasan_remove_zero_shadow(__va(align_start), align_size);
+		kasan_remove_zero_shadow(__va(res->start), resource_size(res));
 	}
 	mem_hotplug_done();
 
-	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+	untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
 	pgmap_array_delete(res);
 	dev_WARN_ONCE(dev, pgmap->altmap.alloc,
 		      "%s: failed to free all reserved pages\n", __func__);
@@ -141,16 +136,13 @@ static void devm_memremap_pages_release(void *data)
  */
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 {
-	resource_size_t align_start, align_size, align_end;
-	struct vmem_altmap *altmap = pgmap->altmap_valid ?
-			&pgmap->altmap : NULL;
 	struct resource *res = &pgmap->res;
 	struct dev_pagemap *conflict_pgmap;
 	struct mhp_restrictions restrictions = {
 		/*
 		 * We do not want any optional features only our own memmap
 		*/
-		.altmap = altmap,
+		.altmap = pgmap->altmap_valid ? &pgmap->altmap : NULL,
 	};
 	pgprot_t pgprot = PAGE_KERNEL;
 	int error, nid, is_ram;
@@ -160,12 +152,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 		return ERR_PTR(-EINVAL);
 	}
 
-	align_start = res->start & ~(PA_SECTION_SIZE - 1);
-	align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
-		- align_start;
-	align_end = align_start + align_size - 1;
-
-	conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_start), NULL);
+	conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->start), NULL);
 	if (conflict_pgmap) {
 		dev_WARN(dev, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
@@ -173,7 +160,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 		goto err_array;
 	}
 
-	conflict_pgmap = get_dev_pagemap(PHYS_PFN(align_end), NULL);
+	conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->end), NULL);
 	if (conflict_pgmap) {
 		dev_WARN(dev, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
@@ -181,7 +168,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 		goto err_array;
 	}
 
-	is_ram = region_intersects(align_start, align_size,
+	is_ram = region_intersects(res->start, resource_size(res),
 		IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
 
 	if (is_ram != REGION_DISJOINT) {
@@ -202,8 +189,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	if (nid < 0)
 		nid = numa_mem_id();
 
-	error = track_pfn_remap(NULL, &pgprot, PHYS_PFN(align_start), 0,
-			align_size);
+	error = track_pfn_remap(NULL, &pgprot, PHYS_PFN(res->start), 0,
+			resource_size(res));
 	if (error)
 		goto err_pfn_remap;
 
@@ -221,25 +208,25 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	 * arch_add_memory().
 	 */
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		error = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, &restrictions);
+		error = add_pages(nid, PHYS_PFN(res->start),
+				PHYS_PFN(resource_size(res)), &restrictions);
 	} else {
-		error = kasan_add_zero_shadow(__va(align_start), align_size);
+		error = kasan_add_zero_shadow(__va(res->start), resource_size(res));
 		if (error) {
 			mem_hotplug_done();
 			goto err_kasan;
 		}
 
-		error = arch_add_memory(nid, align_start, align_size,
-					&restrictions);
+		error = arch_add_memory(nid, res->start, resource_size(res),
+				&restrictions);
 	}
 
 	if (!error) {
 		struct zone *zone;
 
 		zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
-		move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, altmap);
+		move_pfn_range_to_zone(zone, PHYS_PFN(res->start),
+				PHYS_PFN(resource_size(res)), restrictions.altmap);
 	}
 
 	mem_hotplug_done();
@@ -251,8 +238,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	 * to allow us to do the work while not holding the hotplug lock.
 	 */
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
-				align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, pgmap);
+				PHYS_PFN(res->start),
+				PHYS_PFN(resource_size(res)), pgmap);
 	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
 
 	error = devm_add_action_or_reset(dev, devm_memremap_pages_release,
@@ -263,9 +250,9 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	return __va(res->start);
 
  err_add_memory:
-	kasan_remove_zero_shadow(__va(align_start), align_size);
+	kasan_remove_zero_shadow(__va(res->start), resource_size(res));
  err_kasan:
-	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+	untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
  err_pfn_remap:
 	pgmap_array_delete(res);
  err_array:


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 12/13] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (10 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 11/13] mm/devm_memremap_pages: Enable sub-section remap Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-19 16:30   ` Aneesh Kumar K.V
  2019-06-19  5:52 ` [PATCH v10 13/13] libnvdimm/pfn: Stop padding pmem namespaces to section alignment Dan Williams
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm; +Cc: stable, linux-mm, linux-nvdimm, linux-kernel

At namespace creation time there is the potential for the "expected to
be zero" fields of a 'pfn' info-block to be filled with indeterminate
data. While the kernel buffer is zeroed on allocation it is immediately
overwritten by nd_pfn_validate() filling it with the current contents of
the on-media info-block location. For fields like, 'flags' and the
'padding' it potentially means that future implementations can not rely
on those fields being zero.

In preparation to stop using the 'start_pad' and 'end_trunc' fields for
section alignment, arrange for fields that are not explicitly
initialized to be guaranteed zero. Bump the minor version to indicate it
is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
corruption is expected to benign since all other critical fields are
explicitly initialized.

Note The cc: stable is about spreading this new policy to as many
kernels as possible not fixing an issue in those kernels. It is not
until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
section alignment" where this improper initialization becomes a problem.
So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
namespaces to section alignment" (which is not tagged for stable), make
sure this pre-requisite is flagged.

Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/dax_devs.c |    2 +-
 drivers/nvdimm/pfn.h      |    1 +
 drivers/nvdimm/pfn_devs.c |   18 +++++++++++++++---
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 49fc18ee0565..6d22b0f83b3b 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -118,7 +118,7 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
 	nvdimm_bus_unlock(&ndns->dev);
 	if (!dax_dev)
 		return -ENOMEM;
-	pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+	pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
 	nd_pfn->pfn_sb = pfn_sb;
 	rc = nd_pfn_validate(nd_pfn, DAX_SIG);
 	dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : "<none>");
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index f58b849e455b..dfb2bcda8f5a 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -28,6 +28,7 @@ struct nd_pfn_sb {
 	__le32 end_trunc;
 	/* minor-version-2 record the base alignment of the mapping */
 	__le32 align;
+	/* minor-version-3 guarantee the padding and flags are zero */
 	u8 padding[4000];
 	__le64 checksum;
 };
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 0f81fc56bbfd..4977424693b0 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -412,6 +412,15 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn)
 	return 0;
 }
 
+/**
+ * nd_pfn_validate - read and validate info-block
+ * @nd_pfn: fsdax namespace runtime state / properties
+ * @sig: 'devdax' or 'fsdax' signature
+ *
+ * Upon return the info-block buffer contents (->pfn_sb) are
+ * indeterminate when validation fails, and a coherent info-block
+ * otherwise.
+ */
 int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 {
 	u64 checksum, offset;
@@ -557,7 +566,7 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
 	nvdimm_bus_unlock(&ndns->dev);
 	if (!pfn_dev)
 		return -ENOMEM;
-	pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
+	pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
 	nd_pfn = to_nd_pfn(pfn_dev);
 	nd_pfn->pfn_sb = pfn_sb;
 	rc = nd_pfn_validate(nd_pfn, PFN_SIG);
@@ -694,7 +703,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 	u64 checksum;
 	int rc;
 
-	pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
+	pfn_sb = devm_kmalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
 	if (!pfn_sb)
 		return -ENOMEM;
 
@@ -703,11 +712,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 		sig = DAX_SIG;
 	else
 		sig = PFN_SIG;
+
 	rc = nd_pfn_validate(nd_pfn, sig);
 	if (rc != -ENODEV)
 		return rc;
 
 	/* no info block, do init */;
+	memset(pfn_sb, 0, sizeof(*pfn_sb));
+
 	nd_region = to_nd_region(nd_pfn->dev.parent);
 	if (nd_region->ro) {
 		dev_info(&nd_pfn->dev,
@@ -760,7 +772,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 	memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
 	memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
 	pfn_sb->version_major = cpu_to_le16(1);
-	pfn_sb->version_minor = cpu_to_le16(2);
+	pfn_sb->version_minor = cpu_to_le16(3);
 	pfn_sb->start_pad = cpu_to_le32(start_pad);
 	pfn_sb->end_trunc = cpu_to_le32(end_trunc);
 	pfn_sb->align = cpu_to_le32(nd_pfn->align);


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v10 13/13] libnvdimm/pfn: Stop padding pmem namespaces to section alignment
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (11 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 12/13] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
@ 2019-06-19  5:52 ` Dan Williams
  2019-06-20 12:30 ` [PATCH v10 00/13] mm: Sub-section memory hotplug support Aneesh Kumar K.V
  2019-06-20 17:00 ` Oscar Salvador
  14 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19  5:52 UTC (permalink / raw)
  To: akpm; +Cc: Jeff Moyer, linux-mm, linux-nvdimm, linux-kernel

Now that the mm core supports section-unaligned hotplug of ZONE_DEVICE
memory, we no longer need to add padding at pfn/dax device creation
time. The kernel will still honor padding established by older kernels.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pfn.h      |   14 --------
 drivers/nvdimm/pfn_devs.c |   77 ++++++++-------------------------------------
 include/linux/mmzone.h    |    3 ++
 3 files changed, 16 insertions(+), 78 deletions(-)

diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index dfb2bcda8f5a..7381673b7b70 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -33,18 +33,4 @@ struct nd_pfn_sb {
 	__le64 checksum;
 };
 
-#ifdef CONFIG_SPARSEMEM
-#define PFN_SECTION_ALIGN_DOWN(x) SECTION_ALIGN_DOWN(x)
-#define PFN_SECTION_ALIGN_UP(x) SECTION_ALIGN_UP(x)
-#else
-/*
- * In this case ZONE_DEVICE=n and we will disable 'pfn' device support,
- * but we still want pmem to compile.
- */
-#define PFN_SECTION_ALIGN_DOWN(x) (x)
-#define PFN_SECTION_ALIGN_UP(x) (x)
-#endif
-
-#define PHYS_SECTION_ALIGN_DOWN(x) PFN_PHYS(PFN_SECTION_ALIGN_DOWN(PHYS_PFN(x)))
-#define PHYS_SECTION_ALIGN_UP(x) PFN_PHYS(PFN_SECTION_ALIGN_UP(PHYS_PFN(x)))
 #endif /* __NVDIMM_PFN_H */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 4977424693b0..2537aa338bd0 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -587,14 +587,14 @@ static u32 info_block_reserve(void)
 }
 
 /*
- * We hotplug memory at section granularity, pad the reserved area from
- * the previous section base to the namespace base address.
+ * We hotplug memory at sub-section granularity, pad the reserved area
+ * from the previous section base to the namespace base address.
  */
 static unsigned long init_altmap_base(resource_size_t base)
 {
 	unsigned long base_pfn = PHYS_PFN(base);
 
-	return PFN_SECTION_ALIGN_DOWN(base_pfn);
+	return SUBSECTION_ALIGN_DOWN(base_pfn);
 }
 
 static unsigned long init_altmap_reserve(resource_size_t base)
@@ -602,7 +602,7 @@ static unsigned long init_altmap_reserve(resource_size_t base)
 	unsigned long reserve = info_block_reserve() >> PAGE_SHIFT;
 	unsigned long base_pfn = PHYS_PFN(base);
 
-	reserve += base_pfn - PFN_SECTION_ALIGN_DOWN(base_pfn);
+	reserve += base_pfn - SUBSECTION_ALIGN_DOWN(base_pfn);
 	return reserve;
 }
 
@@ -633,8 +633,7 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
 		nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
 		pgmap->altmap_valid = false;
 	} else if (nd_pfn->mode == PFN_MODE_PMEM) {
-		nd_pfn->npfns = PFN_SECTION_ALIGN_UP((resource_size(res)
-					- offset) / PAGE_SIZE);
+		nd_pfn->npfns = PHYS_PFN((resource_size(res) - offset));
 		if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns)
 			dev_info(&nd_pfn->dev,
 					"number of pfns truncated from %lld to %ld\n",
@@ -650,54 +649,14 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
 	return 0;
 }
 
-static u64 phys_pmem_align_down(struct nd_pfn *nd_pfn, u64 phys)
-{
-	return min_t(u64, PHYS_SECTION_ALIGN_DOWN(phys),
-			ALIGN_DOWN(phys, nd_pfn->align));
-}
-
-/*
- * Check if pmem collides with 'System RAM', or other regions when
- * section aligned.  Trim it accordingly.
- */
-static void trim_pfn_device(struct nd_pfn *nd_pfn, u32 *start_pad, u32 *end_trunc)
-{
-	struct nd_namespace_common *ndns = nd_pfn->ndns;
-	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
-	struct nd_region *nd_region = to_nd_region(nd_pfn->dev.parent);
-	const resource_size_t start = nsio->res.start;
-	const resource_size_t end = start + resource_size(&nsio->res);
-	resource_size_t adjust, size;
-
-	*start_pad = 0;
-	*end_trunc = 0;
-
-	adjust = start - PHYS_SECTION_ALIGN_DOWN(start);
-	size = resource_size(&nsio->res) + adjust;
-	if (region_intersects(start - adjust, size, IORESOURCE_SYSTEM_RAM,
-				IORES_DESC_NONE) == REGION_MIXED
-			|| nd_region_conflict(nd_region, start - adjust, size))
-		*start_pad = PHYS_SECTION_ALIGN_UP(start) - start;
-
-	/* Now check that end of the range does not collide. */
-	adjust = PHYS_SECTION_ALIGN_UP(end) - end;
-	size = resource_size(&nsio->res) + adjust;
-	if (region_intersects(start, size, IORESOURCE_SYSTEM_RAM,
-				IORES_DESC_NONE) == REGION_MIXED
-			|| !IS_ALIGNED(end, nd_pfn->align)
-			|| nd_region_conflict(nd_region, start, size))
-		*end_trunc = end - phys_pmem_align_down(nd_pfn, end);
-}
-
 static int nd_pfn_init(struct nd_pfn *nd_pfn)
 {
 	struct nd_namespace_common *ndns = nd_pfn->ndns;
 	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
-	u32 start_pad, end_trunc, reserve = info_block_reserve();
 	resource_size_t start, size;
 	struct nd_region *nd_region;
+	unsigned long npfns, align;
 	struct nd_pfn_sb *pfn_sb;
-	unsigned long npfns;
 	phys_addr_t offset;
 	const char *sig;
 	u64 checksum;
@@ -728,43 +687,35 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 		return -ENXIO;
 	}
 
-	memset(pfn_sb, 0, sizeof(*pfn_sb));
-
-	trim_pfn_device(nd_pfn, &start_pad, &end_trunc);
-	if (start_pad + end_trunc)
-		dev_info(&nd_pfn->dev, "%s alignment collision, truncate %d bytes\n",
-				dev_name(&ndns->dev), start_pad + end_trunc);
-
 	/*
 	 * Note, we use 64 here for the standard size of struct page,
 	 * debugging options may cause it to be larger in which case the
 	 * implementation will limit the pfns advertised through
 	 * ->direct_access() to those that are included in the memmap.
 	 */
-	start = nsio->res.start + start_pad;
+	start = nsio->res.start;
 	size = resource_size(&nsio->res);
-	npfns = PFN_SECTION_ALIGN_UP((size - start_pad - end_trunc - reserve)
-			/ PAGE_SIZE);
+	npfns = PHYS_PFN(size - SZ_8K);
+	align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
 	if (nd_pfn->mode == PFN_MODE_PMEM) {
 		/*
 		 * The altmap should be padded out to the block size used
 		 * when populating the vmemmap. This *should* be equal to
 		 * PMD_SIZE for most architectures.
 		 */
-		offset = ALIGN(start + reserve + 64 * npfns,
-				max(nd_pfn->align, PMD_SIZE)) - start;
+		offset = ALIGN(start + SZ_8K + 64 * npfns, align) - start;
 	} else if (nd_pfn->mode == PFN_MODE_RAM)
-		offset = ALIGN(start + reserve, nd_pfn->align) - start;
+		offset = ALIGN(start + SZ_8K, align) - start;
 	else
 		return -ENXIO;
 
-	if (offset + start_pad + end_trunc >= size) {
+	if (offset >= size) {
 		dev_err(&nd_pfn->dev, "%s unable to satisfy requested alignment\n",
 				dev_name(&ndns->dev));
 		return -ENXIO;
 	}
 
-	npfns = (size - offset - start_pad - end_trunc) / SZ_4K;
+	npfns = PHYS_PFN(size - offset);
 	pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
 	pfn_sb->dataoff = cpu_to_le64(offset);
 	pfn_sb->npfns = cpu_to_le64(npfns);
@@ -773,8 +724,6 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 	memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
 	pfn_sb->version_major = cpu_to_le16(1);
 	pfn_sb->version_minor = cpu_to_le16(3);
-	pfn_sb->start_pad = cpu_to_le32(start_pad);
-	pfn_sb->end_trunc = cpu_to_le32(end_trunc);
 	pfn_sb->align = cpu_to_le32(nd_pfn->align);
 	checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb);
 	pfn_sb->checksum = cpu_to_le64(checksum);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e976faf57292..350a24e48a1b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1161,6 +1161,9 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
 #define SUBSECTIONS_PER_SECTION (1UL << (SECTION_SIZE_BITS - SUBSECTION_SHIFT))
 #endif
 
+#define SUBSECTION_ALIGN_UP(pfn) ALIGN((pfn), PAGES_PER_SUBSECTION)
+#define SUBSECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUBSECTION_MASK)
+
 struct mem_section_usage {
 	DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
 	/* See declaration of similar field in struct zone */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 12/13] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
  2019-06-19  5:52 ` [PATCH v10 12/13] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
@ 2019-06-19 16:30   ` Aneesh Kumar K.V
  2019-06-19 17:06     ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2019-06-19 16:30 UTC (permalink / raw)
  To: Dan Williams, akpm; +Cc: linux-mm, linux-kernel, stable, linux-nvdimm

Dan Williams <dan.j.williams@intel.com> writes:

> At namespace creation time there is the potential for the "expected to
> be zero" fields of a 'pfn' info-block to be filled with indeterminate
> data. While the kernel buffer is zeroed on allocation it is immediately
> overwritten by nd_pfn_validate() filling it with the current contents of
> the on-media info-block location. For fields like, 'flags' and the
> 'padding' it potentially means that future implementations can not rely
> on those fields being zero.
>
> In preparation to stop using the 'start_pad' and 'end_trunc' fields for
> section alignment, arrange for fields that are not explicitly
> initialized to be guaranteed zero. Bump the minor version to indicate it
> is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
> corruption is expected to benign since all other critical fields are
> explicitly initialized.
>
> Note The cc: stable is about spreading this new policy to as many
> kernels as possible not fixing an issue in those kernels. It is not
> until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
> section alignment" where this improper initialization becomes a problem.
> So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
> namespaces to section alignment" (which is not tagged for stable), make
> sure this pre-requisite is flagged.

Don't we need a change like below in this patch?

modified   drivers/nvdimm/pfn_devs.c
@@ -452,10 +452,11 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
 	if (memcmp(pfn_sb->parent_uuid, parent_uuid, 16) != 0)
 		return -ENODEV;
 
-	if (__le16_to_cpu(pfn_sb->version_minor) < 1) {
-		pfn_sb->start_pad = 0;
-		pfn_sb->end_trunc = 0;
-	}
+	if ((__le16_to_cpu(pfn_sb->version_minor) < 1) ||
+	    (__le16_to_cpu(pfn_sb->version_minor) >= 3)) {
+			pfn_sb->start_pad = 0;
+			pfn_sb->end_trunc = 0;
+		}


IIUC we want to force the start_pad and end_truc to zero if the pfn_sb
minor version number >= 3. So once we have this patch backported and
older kernel finds a pfn_sb with minor version 3, it will ignore the
start_pad read from the nvdimm and overwrite that with zero here.
This patch doesn't enforce that right? After the next patch we can have
values other than 0 in pfn_sb->start_pad?


>
> Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/nvdimm/dax_devs.c |    2 +-
>  drivers/nvdimm/pfn.h      |    1 +
>  drivers/nvdimm/pfn_devs.c |   18 +++++++++++++++---
>  3 files changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
> index 49fc18ee0565..6d22b0f83b3b 100644
> --- a/drivers/nvdimm/dax_devs.c
> +++ b/drivers/nvdimm/dax_devs.c
> @@ -118,7 +118,7 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
>  	nvdimm_bus_unlock(&ndns->dev);
>  	if (!dax_dev)
>  		return -ENOMEM;
> -	pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
> +	pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
>  	nd_pfn->pfn_sb = pfn_sb;
>  	rc = nd_pfn_validate(nd_pfn, DAX_SIG);
>  	dev_dbg(dev, "dax: %s\n", rc == 0 ? dev_name(dax_dev) : "<none>");
> diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
> index f58b849e455b..dfb2bcda8f5a 100644
> --- a/drivers/nvdimm/pfn.h
> +++ b/drivers/nvdimm/pfn.h
> @@ -28,6 +28,7 @@ struct nd_pfn_sb {
>  	__le32 end_trunc;
>  	/* minor-version-2 record the base alignment of the mapping */
>  	__le32 align;
> +	/* minor-version-3 guarantee the padding and flags are zero */
>  	u8 padding[4000];
>  	__le64 checksum;
>  };
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 0f81fc56bbfd..4977424693b0 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -412,6 +412,15 @@ static int nd_pfn_clear_memmap_errors(struct nd_pfn *nd_pfn)
>  	return 0;
>  }
>  
> +/**
> + * nd_pfn_validate - read and validate info-block
> + * @nd_pfn: fsdax namespace runtime state / properties
> + * @sig: 'devdax' or 'fsdax' signature
> + *
> + * Upon return the info-block buffer contents (->pfn_sb) are
> + * indeterminate when validation fails, and a coherent info-block
> + * otherwise.
> + */
>  int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
>  {
>  	u64 checksum, offset;
> @@ -557,7 +566,7 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
>  	nvdimm_bus_unlock(&ndns->dev);
>  	if (!pfn_dev)
>  		return -ENOMEM;
> -	pfn_sb = devm_kzalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
> +	pfn_sb = devm_kmalloc(dev, sizeof(*pfn_sb), GFP_KERNEL);
>  	nd_pfn = to_nd_pfn(pfn_dev);
>  	nd_pfn->pfn_sb = pfn_sb;
>  	rc = nd_pfn_validate(nd_pfn, PFN_SIG);
> @@ -694,7 +703,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>  	u64 checksum;
>  	int rc;
>  
> -	pfn_sb = devm_kzalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
> +	pfn_sb = devm_kmalloc(&nd_pfn->dev, sizeof(*pfn_sb), GFP_KERNEL);
>  	if (!pfn_sb)
>  		return -ENOMEM;
>  
> @@ -703,11 +712,14 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>  		sig = DAX_SIG;
>  	else
>  		sig = PFN_SIG;
> +
>  	rc = nd_pfn_validate(nd_pfn, sig);
>  	if (rc != -ENODEV)
>  		return rc;
>  
>  	/* no info block, do init */;
> +	memset(pfn_sb, 0, sizeof(*pfn_sb));
> +
>  	nd_region = to_nd_region(nd_pfn->dev.parent);
>  	if (nd_region->ro) {
>  		dev_info(&nd_pfn->dev,
> @@ -760,7 +772,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>  	memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
>  	memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(&ndns->dev), 16);
>  	pfn_sb->version_major = cpu_to_le16(1);
> -	pfn_sb->version_minor = cpu_to_le16(2);
> +	pfn_sb->version_minor = cpu_to_le16(3);
>  	pfn_sb->start_pad = cpu_to_le32(start_pad);
>  	pfn_sb->end_trunc = cpu_to_le32(end_trunc);
>  	pfn_sb->align = cpu_to_le32(nd_pfn->align);
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 12/13] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
  2019-06-19 16:30   ` Aneesh Kumar K.V
@ 2019-06-19 17:06     ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-19 17:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Linux MM, Linux Kernel Mailing List, stable, linux-nvdimm

On Wed, Jun 19, 2019 at 9:30 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > At namespace creation time there is the potential for the "expected to
> > be zero" fields of a 'pfn' info-block to be filled with indeterminate
> > data. While the kernel buffer is zeroed on allocation it is immediately
> > overwritten by nd_pfn_validate() filling it with the current contents of
> > the on-media info-block location. For fields like, 'flags' and the
> > 'padding' it potentially means that future implementations can not rely
> > on those fields being zero.
> >
> > In preparation to stop using the 'start_pad' and 'end_trunc' fields for
> > section alignment, arrange for fields that are not explicitly
> > initialized to be guaranteed zero. Bump the minor version to indicate it
> > is safe to assume the 'padding' and 'flags' are zero. Otherwise, this
> > corruption is expected to benign since all other critical fields are
> > explicitly initialized.
> >
> > Note The cc: stable is about spreading this new policy to as many
> > kernels as possible not fixing an issue in those kernels. It is not
> > until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
> > section alignment" where this improper initialization becomes a problem.
> > So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
> > namespaces to section alignment" (which is not tagged for stable), make
> > sure this pre-requisite is flagged.
>
> Don't we need a change like below in this patch?
>
> modified   drivers/nvdimm/pfn_devs.c
> @@ -452,10 +452,11 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
>         if (memcmp(pfn_sb->parent_uuid, parent_uuid, 16) != 0)
>                 return -ENODEV;
>
> -       if (__le16_to_cpu(pfn_sb->version_minor) < 1) {
> -               pfn_sb->start_pad = 0;
> -               pfn_sb->end_trunc = 0;
> -       }
> +       if ((__le16_to_cpu(pfn_sb->version_minor) < 1) ||
> +           (__le16_to_cpu(pfn_sb->version_minor) >= 3)) {
> +                       pfn_sb->start_pad = 0;
> +                       pfn_sb->end_trunc = 0;
> +               }

No, this kills off start_pad and end_trunc permanently.

> IIUC we want to force the start_pad and end_truc to zero if the pfn_sb
> minor version number >= 3. So once we have this patch backported and
> older kernel finds a pfn_sb with minor version 3, it will ignore the
> start_pad read from the nvdimm and overwrite that with zero here.
> This patch doesn't enforce that right? After the next patch we can have
> values other than 0 in pfn_sb->start_pad?

The reason for the version bump is for the kernel to safely assume
that uninitialized fields default to zero, but it's otherwise a nop
when the implementation is explicitly initializing every field by
default.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges
  2019-06-19  5:52 ` [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges Dan Williams
@ 2019-06-20 10:31   ` David Hildenbrand
  2019-06-20 16:19     ` Dan Williams
  2019-06-24 18:05   ` Oscar Salvador
  1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2019-06-20 10:31 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Oscar Salvador,
	Pavel Tatashin, linux-mm, linux-nvdimm, linux-kernel

On 19.06.19 07:52, Dan Williams wrote:
> Prepare the memory hot-{add,remove} paths for handling sub-section
> ranges by plumbing the starting page frame and number of pages being
> handled through arch_{add,remove}_memory() to
> sparse_{add,remove}_one_section().
> 
> This is simply plumbing, small cleanups, and some identifier renames. No
> intended functional changes.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/memory_hotplug.h |    5 +-
>  mm/memory_hotplug.c            |  114 +++++++++++++++++++++++++---------------
>  mm/sparse.c                    |   16 ++----
>  3 files changed, 81 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 79e0add6a597..3ab0282b4fe5 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -348,9 +348,10 @@ extern int add_memory_resource(int nid, struct resource *resource);
>  extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>  		unsigned long nr_pages, struct vmem_altmap *altmap);
>  extern bool is_memblock_offlined(struct memory_block *mem);
> -extern int sparse_add_one_section(int nid, unsigned long start_pfn,
> -				  struct vmem_altmap *altmap);
> +extern int sparse_add_section(int nid, unsigned long pfn,
> +		unsigned long nr_pages, struct vmem_altmap *altmap);
>  extern void sparse_remove_one_section(struct mem_section *ms,
> +		unsigned long pfn, unsigned long nr_pages,
>  		unsigned long map_offset, struct vmem_altmap *altmap);
>  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
>  					  unsigned long pnum);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4b882c57781a..399bf78bccc5 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -252,51 +252,84 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
>  }
>  #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
>  
> -static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
> -				   struct vmem_altmap *altmap)
> +static int __meminit __add_section(int nid, unsigned long pfn,
> +		unsigned long nr_pages,	struct vmem_altmap *altmap)
>  {
>  	int ret;
>  
> -	if (pfn_valid(phys_start_pfn))
> +	if (pfn_valid(pfn))
>  		return -EEXIST;
>  
> -	ret = sparse_add_one_section(nid, phys_start_pfn, altmap);
> +	ret = sparse_add_section(nid, pfn, nr_pages, altmap);
>  	return ret < 0 ? ret : 0;
>  }
>  
> +static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
> +		const char *reason)
> +{
> +	/*
> +	 * Disallow all operations smaller than a sub-section and only
> +	 * allow operations smaller than a section for
> +	 * SPARSEMEM_VMEMMAP. Note that check_hotplug_memory_range()
> +	 * enforces a larger memory_block_size_bytes() granularity for
> +	 * memory that will be marked online, so this check should only
> +	 * fire for direct arch_{add,remove}_memory() users outside of
> +	 * add_memory_resource().
> +	 */
> +	unsigned long min_align;
> +
> +	if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
> +		min_align = PAGES_PER_SUBSECTION;
> +	else
> +		min_align = PAGES_PER_SECTION;
> +	if (!IS_ALIGNED(pfn, min_align)
> +			|| !IS_ALIGNED(nr_pages, min_align)) {
> +		WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n",
> +				reason, pfn, pfn + nr_pages - 1);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * Reasonably generic function for adding memory.  It is
>   * expected that archs that support memory hotplug will
>   * call this function after deciding the zone to which to
>   * add the new pages.
>   */
> -int __ref __add_pages(int nid, unsigned long phys_start_pfn,
> -		unsigned long nr_pages, struct mhp_restrictions *restrictions)
> +int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> +		struct mhp_restrictions *restrictions)
>  {
>  	unsigned long i;
> -	int err = 0;
> -	int start_sec, end_sec;
> +	int start_sec, end_sec, err;
>  	struct vmem_altmap *altmap = restrictions->altmap;
>  
> -	/* during initialize mem_map, align hot-added range to section */
> -	start_sec = pfn_to_section_nr(phys_start_pfn);
> -	end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
> -
>  	if (altmap) {
>  		/*
>  		 * Validate altmap is within bounds of the total request
>  		 */
> -		if (altmap->base_pfn != phys_start_pfn
> +		if (altmap->base_pfn != pfn
>  				|| vmem_altmap_offset(altmap) > nr_pages) {
>  			pr_warn_once("memory add fail, invalid altmap\n");
> -			err = -EINVAL;
> -			goto out;
> +			return -EINVAL;
>  		}
>  		altmap->alloc = 0;
>  	}
>  
> +	err = check_pfn_span(pfn, nr_pages, "add");
> +	if (err)
> +		return err;
> +
> +	start_sec = pfn_to_section_nr(pfn);
> +	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
>  	for (i = start_sec; i <= end_sec; i++) {
> -		err = __add_section(nid, section_nr_to_pfn(i), altmap);
> +		unsigned long pfns;
> +
> +		pfns = min(nr_pages, PAGES_PER_SECTION
> +				- (pfn & ~PAGE_SECTION_MASK));
> +		err = __add_section(nid, pfn, pfns, altmap);
> +		pfn += pfns;
> +		nr_pages -= pfns;
>  
>  		/*
>  		 * EEXIST is finally dealt with by ioresource collision
> @@ -309,7 +342,6 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
>  		cond_resched();
>  	}
>  	vmemmap_populate_print_last();
> -out:
>  	return err;
>  }
>  
> @@ -487,10 +519,10 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
>  	pgdat->node_spanned_pages = 0;
>  }
>  
> -static void __remove_zone(struct zone *zone, unsigned long start_pfn)
> +static void __remove_zone(struct zone *zone, unsigned long start_pfn,
> +		unsigned long nr_pages)
>  {
>  	struct pglist_data *pgdat = zone->zone_pgdat;
> -	int nr_pages = PAGES_PER_SECTION;
>  	unsigned long flags;
>  
>  	pgdat_resize_lock(zone->zone_pgdat, &flags);
> @@ -499,27 +531,23 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn)
>  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>  }
>  
> -static void __remove_section(struct zone *zone, struct mem_section *ms,
> -			     unsigned long map_offset,
> -			     struct vmem_altmap *altmap)
> +static void __remove_section(struct zone *zone, unsigned long pfn,
> +		unsigned long nr_pages, unsigned long map_offset,
> +		struct vmem_altmap *altmap)
>  {
> -	unsigned long start_pfn;
> -	int scn_nr;
> +	struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
>  
>  	if (WARN_ON_ONCE(!valid_section(ms)))
>  		return;
>  
> -	scn_nr = __section_nr(ms);
> -	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
> -	__remove_zone(zone, start_pfn);
> -
> -	sparse_remove_one_section(ms, map_offset, altmap);
> +	__remove_zone(zone, pfn, nr_pages);
> +	sparse_remove_one_section(ms, pfn, nr_pages, map_offset, altmap);
>  }
>  
>  /**
>   * __remove_pages() - remove sections of pages from a zone
>   * @zone: zone from which pages need to be removed
> - * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
> + * @pfn: starting pageframe (must be aligned to start of a section)
>   * @nr_pages: number of pages to remove (must be multiple of section size)
>   * @altmap: alternative device page map or %NULL if default memmap is used
>   *
> @@ -528,31 +556,31 @@ static void __remove_section(struct zone *zone, struct mem_section *ms,
>   * sure that pages are marked reserved and zones are adjust properly by
>   * calling offline_pages().
>   */
> -void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> +void __remove_pages(struct zone *zone, unsigned long pfn,
>  		    unsigned long nr_pages, struct vmem_altmap *altmap)
>  {
> -	unsigned long i;
>  	unsigned long map_offset = 0;
> -	int sections_to_remove;
> +	int i, start_sec, end_sec;

As mentioned in v9, use "unsigned long" for start_sec and end_sec please.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 10/13] mm: Document ZONE_DEVICE memory-model implications
  2019-06-19  5:52 ` [PATCH v10 10/13] mm: Document ZONE_DEVICE memory-model implications Dan Williams
@ 2019-06-20 12:30   ` Mike Rapoport
  0 siblings, 0 replies; 29+ messages in thread
From: Mike Rapoport @ 2019-06-20 12:30 UTC (permalink / raw)
  To: Dan Williams; +Cc: akpm, Jonathan Corbet, linux-mm, linux-nvdimm, linux-kernel

On Tue, Jun 18, 2019 at 10:52:29PM -0700, Dan Williams wrote:
> Explain the general mechanisms of 'ZONE_DEVICE' pages and list the users
> of 'devm_memremap_pages()'.
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Reported-by: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

With one nit below

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  Documentation/vm/memory-model.rst |   39 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/Documentation/vm/memory-model.rst b/Documentation/vm/memory-model.rst
> index 382f72ace1fc..e0af47e02e78 100644
> --- a/Documentation/vm/memory-model.rst
> +++ b/Documentation/vm/memory-model.rst
> @@ -181,3 +181,42 @@ that is eventually passed to vmemmap_populate() through a long chain
>  of function calls. The vmemmap_populate() implementation may use the
>  `vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to
>  allocate memory map on the persistent memory device.
> +
> +ZONE_DEVICE
> +===========
> +The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
> +`struct page` `mem_map` services for device driver identified physical
> +address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
> +that the page objects for these address ranges are never marked online,
> +and that a reference must be taken against the device, not just the page
> +to keep the memory pinned for active use. `ZONE_DEVICE`, via
> +:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
> +turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
> +:c:func:`get_user_pages` service for the given range of pfns. Since the
> +page reference count never drops below 1 the page is never tracked as
> +free memory and the page's `struct list_head lru` space is repurposed
> +for back referencing to the host device / driver that mapped the memory.
> +
> +While `SPARSEMEM` presents memory as a collection of sections,
> +optionally collected into memory blocks, `ZONE_DEVICE` users have a need
> +for smaller granularity of populating the `mem_map`. Given that
> +`ZONE_DEVICE` memory is never marked online it is subsequently never
> +subject to its memory ranges being exposed through the sysfs memory
> +hotplug api on memory block boundaries. The implementation relies on
> +this lack of user-api constraint to allow sub-section sized memory
> +ranges to be specified to :c:func:`arch_add_memory`, the top-half of
> +memory hotplug. Sub-section support allows for `PMD_SIZE` as the minimum
> +alignment granularity for :c:func:`devm_memremap_pages`.
> +
> +The users of `ZONE_DEVICE` are:

Sphinx wants an empty line here:
/home/rapoport/git/linux-docs/Documentation/vm/memory-model.rst:213: ERROR:
Unexpected indentation.

> +* pmem: Map platform persistent memory to be used as a direct-I/O target
> +  via DAX mappings.
> +
> +* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
> +  event callbacks to allow a device-driver to coordinate memory management
> +  events related to device-memory, typically GPU memory. See
> +  Documentation/vm/hmm.rst.
> +
> +* p2pdma: Create `struct page` objects to allow peer devices in a
> +  PCI/-E topology to coordinate direct-DMA operations between themselves,
> +  i.e. bypass host memory.
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 00/13] mm: Sub-section memory hotplug support
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (12 preceding siblings ...)
  2019-06-19  5:52 ` [PATCH v10 13/13] libnvdimm/pfn: Stop padding pmem namespaces to section alignment Dan Williams
@ 2019-06-20 12:30 ` Aneesh Kumar K.V
  2019-06-20 16:30   ` Dan Williams
  2019-06-20 17:00 ` Oscar Salvador
  14 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2019-06-20 12:30 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: David Hildenbrand, Jérôme Glisse, Mike Rapoport,
	Jane Chu, Pavel Tatashin, Jonathan Corbet, Qian Cai,
	Logan Gunthorpe, Toshi Kani, Oscar Salvador, Jeff Moyer,
	Michal Hocko, Vlastimil Babka, stable, Wei Yang, linux-mm,
	linux-nvdimm, linux-kernel

Dan Williams <dan.j.williams@intel.com> writes:

> Changes since v9 [1]:
> - Fix multiple issues related to the fact that pfn_valid() has
>   traditionally returned true for any pfn in an 'early' (onlined at
>   boot) section regardless of whether that pfn represented 'System RAM'.
>   Teach pfn_valid() to maintain its traditional behavior in the presence
>   of subsections. Specifically, subsection precision for pfn_valid() is
>   only considered for non-early / hot-plugged sections. (Qian)
>
> - Related to the first item introduce a SECTION_IS_EARLY
>   (->section_mem_map flag) to remove the existing hacks for determining
>   an early section by looking at whether the usemap was allocated from the
>   slab.
>
> - Kill off the EEXIST hackery in __add_pages(). It breaks
>   (arch_add_memory() false-positive) the detection of subsection
>   collisions reported by section_activate(). It is also obviated by
>   David's recent reworks to move the 'System RAM' request_region() earlier
>   in the add_memory() sequence().
>
> - Switch to an arch-independent / static subsection-size of 2MB.
>   Otherwise, a per-arch subsection-size is a roadblock on the path to
>   persistent memory namespace compatibility across archs. (Jeff)
>
> - Update the changelog for "libnvdimm/pfn: Fix fsdax-mode namespace
>   info-block zero-fields" to clarify that the "Cc: stable" is only there
>   as safety measure for a distro that decides to backport "libnvdimm/pfn:
>   Stop padding pmem namespaces to section alignment", otherwise there is
>   no known bug exposure in older kernels. (Andrew)
>   
> - Drop some redundant subsection checks (Oscar)
>
> - Collect some reviewed-bys
>
> [1]: https://lore.kernel.org/lkml/155977186863.2443951.9036044808311959913.stgit@dwillia2-desk3.amr.corp.intel.com/


You can add Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
for ppc64.

BTW even after this series we have the kernel crash mentioned in the
below email on reconfigure. 

https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com

I guess we need to conclude how the reserve space struct page should be
initialized ?

>
> ---
>
> The memory hotplug section is an arbitrary / convenient unit for memory
> hotplug. 'Section-size' units have bled into the user interface
> ('memblock' sysfs) and can not be changed without breaking existing
> userspace. The section-size constraint, while mostly benign for typical
> memory hotplug, has and continues to wreak havoc with 'device-memory'
> use cases, persistent memory (pmem) in particular. Recall that pmem uses
> devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a
> 'struct page' memmap for pmem. However, it does not use the 'bottom
> half' of memory hotplug, i.e. never marks pmem pages online and never
> exposes the userspace memblock interface for pmem. This leaves an
> opening to redress the section-size constraint.
>
> To date, the libnvdimm subsystem has attempted to inject padding to
> satisfy the internal constraints of arch_add_memory(). Beyond
> complicating the code, leading to bugs [2], wasting memory, and limiting
> configuration flexibility, the padding hack is broken when the platform
> changes this physical memory alignment of pmem from one boot to the
> next. Device failure (intermittent or permanent) and physical
> reconfiguration are events that can cause the platform firmware to
> change the physical placement of pmem on a subsequent boot, and device
> failure is an everyday event in a data-center.
>
> It turns out that sections are only a hard requirement of the
> user-facing interface for memory hotplug and with a bit more
> infrastructure sub-section arch_add_memory() support can be added for
> kernel internal usages like devm_memremap_pages(). Here is an analysis
> of the current design assumptions in the current code and how they are
> addressed in the new implementation:
>
> Current design assumptions:
>
> - Sections that describe boot memory (early sections) are never
>   unplugged / removed.
>
> - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
>   valid_section() check
>
> - __add_pages() and helper routines assume all operations occur in
>   PAGES_PER_SECTION units.
>
> - The memblock sysfs interface only comprehends full sections
>
> New design assumptions:
>
> - Sections are instrumented with a sub-section bitmask to track (on x86)
>   individual 2MB sub-divisions of a 128MB section.
>
> - Partially populated early sections can be extended with additional
>   sub-sections, and those sub-sections can be removed with
>   arch_remove_memory(). With this in place we no longer lose usable memory
>   capacity to padding.
>
> - pfn_valid() is updated to look deeper than valid_section() to also check the
>   active-sub-section mask. This indication is in the same cacheline as
>   the valid_section() so the performance impact is expected to be
>   negligible. So far the lkp robot has not reported any regressions.
>
> - Outside of the core vmemmap population routines which are replaced,
>   other helper routines like shrink_{zone,pgdat}_span() are updated to
>   handle the smaller granularity. Core memory hotplug routines that deal
>   with online memory are not touched.
>
> - The existing memblock sysfs user api guarantees / assumptions are
>   not touched since this capability is limited to !online
>   !memblock-sysfs-accessible sections.
>
> Meanwhile the issue reports continue to roll in from users that do not
> understand when and how the 128MB constraint will bite them. The current
> implementation relied on being able to support at least one misaligned
> namespace, but that immediately falls over on any moderately complex
> namespace creation attempt. Beyond the initial problem of 'System RAM'
> colliding with pmem, and the unsolvable problem of physical alignment
> changes, Linux is now being exposed to platforms that collide pmem
> ranges with other pmem ranges by default [3]. In short,
> devm_memremap_pages() has pushed the venerable section-size constraint
> past the breaking point, and the simplicity of section-aligned
> arch_add_memory() is no longer tenable.
>
> These patches are exposed to the kbuild robot on a subsection-v10 branch
> [4], and a preview of the unit test for this functionality is available
> on the 'subsection-pending' branch of ndctl [5].
>
> [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
> [3]: https://github.com/pmem/ndctl/issues/76
> [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
> [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
>
> ---
>
> Dan Williams (13):
>       mm/sparsemem: Introduce struct mem_section_usage
>       mm/sparsemem: Introduce a SECTION_IS_EARLY flag
>       mm/sparsemem: Add helpers track active portions of a section at boot
>       mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal
>       mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
>       mm/hotplug: Kill is_dev_zone() usage in __remove_pages()
>       mm: Kill is_dev_zone() helper
>       mm/sparsemem: Prepare for sub-section ranges
>       mm/sparsemem: Support sub-section hotplug
>       mm: Document ZONE_DEVICE memory-model implications
>       mm/devm_memremap_pages: Enable sub-section remap
>       libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields
>       libnvdimm/pfn: Stop padding pmem namespaces to section alignment
>
>
>  Documentation/vm/memory-model.rst |   39 ++++
>  arch/x86/mm/init_64.c             |    4 
>  drivers/nvdimm/dax_devs.c         |    2 
>  drivers/nvdimm/pfn.h              |   15 --
>  drivers/nvdimm/pfn_devs.c         |   95 +++-------
>  include/linux/memory_hotplug.h    |    7 -
>  include/linux/mm.h                |    4 
>  include/linux/mmzone.h            |   84 +++++++--
>  kernel/memremap.c                 |   61 +++----
>  mm/memory_hotplug.c               |  173 +++++++++----------
>  mm/page_alloc.c                   |   16 +-
>  mm/sparse-vmemmap.c               |   21 ++
>  mm/sparse.c                       |  335 ++++++++++++++++++++++++-------------
>  13 files changed, 494 insertions(+), 362 deletions(-)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges
  2019-06-20 10:31   ` David Hildenbrand
@ 2019-06-20 16:19     ` Dan Williams
  2019-06-20 16:35       ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2019-06-20 16:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Logan Gunthorpe,
	Oscar Salvador, Pavel Tatashin, Linux MM, linux-nvdimm,
	Linux Kernel Mailing List

On Thu, Jun 20, 2019 at 3:31 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.06.19 07:52, Dan Williams wrote:
> > Prepare the memory hot-{add,remove} paths for handling sub-section
> > ranges by plumbing the starting page frame and number of pages being
> > handled through arch_{add,remove}_memory() to
> > sparse_{add,remove}_one_section().
> >
> > This is simply plumbing, small cleanups, and some identifier renames. No
> > intended functional changes.
> >
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Logan Gunthorpe <logang@deltatee.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  include/linux/memory_hotplug.h |    5 +-
> >  mm/memory_hotplug.c            |  114 +++++++++++++++++++++++++---------------
> >  mm/sparse.c                    |   16 ++----
> >  3 files changed, 81 insertions(+), 54 deletions(-)
[..]
> > @@ -528,31 +556,31 @@ static void __remove_section(struct zone *zone, struct mem_section *ms,
> >   * sure that pages are marked reserved and zones are adjust properly by
> >   * calling offline_pages().
> >   */
> > -void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> > +void __remove_pages(struct zone *zone, unsigned long pfn,
> >                   unsigned long nr_pages, struct vmem_altmap *altmap)
> >  {
> > -     unsigned long i;
> >       unsigned long map_offset = 0;
> > -     int sections_to_remove;
> > +     int i, start_sec, end_sec;
>
> As mentioned in v9, use "unsigned long" for start_sec and end_sec please.

Honestly I saw you and Andrew going back and forth about "unsigned
long i" that I thought this would be handled by a follow on patchset
when that debate settled.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 00/13] mm: Sub-section memory hotplug support
  2019-06-20 12:30 ` [PATCH v10 00/13] mm: Sub-section memory hotplug support Aneesh Kumar K.V
@ 2019-06-20 16:30   ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-20 16:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, David Hildenbrand, Jérôme Glisse,
	Mike Rapoport, Jane Chu, Pavel Tatashin, Jonathan Corbet,
	Qian Cai, Logan Gunthorpe, Toshi Kani, Oscar Salvador,
	Jeff Moyer, Michal Hocko, Vlastimil Babka, stable, Wei Yang,
	Linux MM, linux-nvdimm, Linux Kernel Mailing List

On Thu, Jun 20, 2019 at 5:31 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > Changes since v9 [1]:
> > - Fix multiple issues related to the fact that pfn_valid() has
> >   traditionally returned true for any pfn in an 'early' (onlined at
> >   boot) section regardless of whether that pfn represented 'System RAM'.
> >   Teach pfn_valid() to maintain its traditional behavior in the presence
> >   of subsections. Specifically, subsection precision for pfn_valid() is
> >   only considered for non-early / hot-plugged sections. (Qian)
> >
> > - Related to the first item introduce a SECTION_IS_EARLY
> >   (->section_mem_map flag) to remove the existing hacks for determining
> >   an early section by looking at whether the usemap was allocated from the
> >   slab.
> >
> > - Kill off the EEXIST hackery in __add_pages(). It breaks
> >   (arch_add_memory() false-positive) the detection of subsection
> >   collisions reported by section_activate(). It is also obviated by
> >   David's recent reworks to move the 'System RAM' request_region() earlier
> >   in the add_memory() sequence().
> >
> > - Switch to an arch-independent / static subsection-size of 2MB.
> >   Otherwise, a per-arch subsection-size is a roadblock on the path to
> >   persistent memory namespace compatibility across archs. (Jeff)
> >
> > - Update the changelog for "libnvdimm/pfn: Fix fsdax-mode namespace
> >   info-block zero-fields" to clarify that the "Cc: stable" is only there
> >   as safety measure for a distro that decides to backport "libnvdimm/pfn:
> >   Stop padding pmem namespaces to section alignment", otherwise there is
> >   no known bug exposure in older kernels. (Andrew)
> >
> > - Drop some redundant subsection checks (Oscar)
> >
> > - Collect some reviewed-bys
> >
> > [1]: https://lore.kernel.org/lkml/155977186863.2443951.9036044808311959913.stgit@dwillia2-desk3.amr.corp.intel.com/
>
>
> You can add Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> for ppc64.

Thank you!

> BTW even after this series we have the kernel crash mentioned in the
> below email on reconfigure.
>
> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com
>
> I guess we need to conclude how the reserve space struct page should be
> initialized ?

Yes, that issue is independent of the subsection changes. I'll take a
closer look.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges
  2019-06-20 16:19     ` Dan Williams
@ 2019-06-20 16:35       ` David Hildenbrand
  2019-06-20 16:56         ` Dan Williams
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2019-06-20 16:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Logan Gunthorpe,
	Oscar Salvador, Pavel Tatashin, Linux MM, linux-nvdimm,
	Linux Kernel Mailing List

On 20.06.19 18:19, Dan Williams wrote:
> On Thu, Jun 20, 2019 at 3:31 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.06.19 07:52, Dan Williams wrote:
>>> Prepare the memory hot-{add,remove} paths for handling sub-section
>>> ranges by plumbing the starting page frame and number of pages being
>>> handled through arch_{add,remove}_memory() to
>>> sparse_{add,remove}_one_section().
>>>
>>> This is simply plumbing, small cleanups, and some identifier renames. No
>>> intended functional changes.
>>>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
>>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>>  include/linux/memory_hotplug.h |    5 +-
>>>  mm/memory_hotplug.c            |  114 +++++++++++++++++++++++++---------------
>>>  mm/sparse.c                    |   16 ++----
>>>  3 files changed, 81 insertions(+), 54 deletions(-)
> [..]
>>> @@ -528,31 +556,31 @@ static void __remove_section(struct zone *zone, struct mem_section *ms,
>>>   * sure that pages are marked reserved and zones are adjust properly by
>>>   * calling offline_pages().
>>>   */
>>> -void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
>>> +void __remove_pages(struct zone *zone, unsigned long pfn,
>>>                   unsigned long nr_pages, struct vmem_altmap *altmap)
>>>  {
>>> -     unsigned long i;
>>>       unsigned long map_offset = 0;
>>> -     int sections_to_remove;
>>> +     int i, start_sec, end_sec;
>>
>> As mentioned in v9, use "unsigned long" for start_sec and end_sec please.
> 
> Honestly I saw you and Andrew going back and forth about "unsigned
> long i" that I thought this would be handled by a follow on patchset
> when that debate settled.
> 

I'll send a fixup then, once this patch set is final - hoping I won't
forget about it (that's why I asked about using these types in the first
place).

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges
  2019-06-20 16:35       ` David Hildenbrand
@ 2019-06-20 16:56         ` Dan Williams
  0 siblings, 0 replies; 29+ messages in thread
From: Dan Williams @ 2019-06-20 16:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, Vlastimil Babka, Logan Gunthorpe,
	Oscar Salvador, Pavel Tatashin, Linux MM, linux-nvdimm,
	Linux Kernel Mailing List

On Thu, Jun 20, 2019 at 9:37 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 20.06.19 18:19, Dan Williams wrote:
> > On Thu, Jun 20, 2019 at 3:31 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.06.19 07:52, Dan Williams wrote:
> >>> Prepare the memory hot-{add,remove} paths for handling sub-section
> >>> ranges by plumbing the starting page frame and number of pages being
> >>> handled through arch_{add,remove}_memory() to
> >>> sparse_{add,remove}_one_section().
> >>>
> >>> This is simply plumbing, small cleanups, and some identifier renames. No
> >>> intended functional changes.
> >>>
> >>> Cc: Michal Hocko <mhocko@suse.com>
> >>> Cc: Vlastimil Babka <vbabka@suse.cz>
> >>> Cc: Logan Gunthorpe <logang@deltatee.com>
> >>> Cc: Oscar Salvador <osalvador@suse.de>
> >>> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> >>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >>> ---
> >>>  include/linux/memory_hotplug.h |    5 +-
> >>>  mm/memory_hotplug.c            |  114 +++++++++++++++++++++++++---------------
> >>>  mm/sparse.c                    |   16 ++----
> >>>  3 files changed, 81 insertions(+), 54 deletions(-)
> > [..]
> >>> @@ -528,31 +556,31 @@ static void __remove_section(struct zone *zone, struct mem_section *ms,
> >>>   * sure that pages are marked reserved and zones are adjust properly by
> >>>   * calling offline_pages().
> >>>   */
> >>> -void __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> >>> +void __remove_pages(struct zone *zone, unsigned long pfn,
> >>>                   unsigned long nr_pages, struct vmem_altmap *altmap)
> >>>  {
> >>> -     unsigned long i;
> >>>       unsigned long map_offset = 0;
> >>> -     int sections_to_remove;
> >>> +     int i, start_sec, end_sec;
> >>
> >> As mentioned in v9, use "unsigned long" for start_sec and end_sec please.
> >
> > Honestly I saw you and Andrew going back and forth about "unsigned
> > long i" that I thought this would be handled by a follow on patchset
> > when that debate settled.
> >
>
> I'll send a fixup then, once this patch set is final - hoping I won't
> forget about it (that's why I asked about using these types in the first
> place).

It's in Andrew's tree now, I'll send an incremental patch.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 00/13] mm: Sub-section memory hotplug support
  2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
                   ` (13 preceding siblings ...)
  2019-06-20 12:30 ` [PATCH v10 00/13] mm: Sub-section memory hotplug support Aneesh Kumar K.V
@ 2019-06-20 17:00 ` Oscar Salvador
  14 siblings, 0 replies; 29+ messages in thread
From: Oscar Salvador @ 2019-06-20 17:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, David Hildenbrand, Jérôme Glisse, Mike Rapoport,
	Jane Chu, Pavel Tatashin, Jonathan Corbet, Qian Cai,
	Logan Gunthorpe, Toshi Kani, Jeff Moyer, Michal Hocko,
	Vlastimil Babka, stable, Wei Yang, linux-mm, linux-nvdimm,
	linux-kernel

On Tue, Jun 18, 2019 at 10:51:33PM -0700, Dan Williams wrote:
> Changes since v9 [1]:
> - Fix multiple issues related to the fact that pfn_valid() has
>   traditionally returned true for any pfn in an 'early' (onlined at
>   boot) section regardless of whether that pfn represented 'System RAM'.
>   Teach pfn_valid() to maintain its traditional behavior in the presence
>   of subsections. Specifically, subsection precision for pfn_valid() is
>   only considered for non-early / hot-plugged sections. (Qian)
> 
> - Related to the first item introduce a SECTION_IS_EARLY
>   (->section_mem_map flag) to remove the existing hacks for determining
>   an early section by looking at whether the usemap was allocated from the
>   slab.
> 
> - Kill off the EEXIST hackery in __add_pages(). It breaks
>   (arch_add_memory() false-positive) the detection of subsection
>   collisions reported by section_activate(). It is also obviated by
>   David's recent reworks to move the 'System RAM' request_region() earlier
>   in the add_memory() sequence().
> 
> - Switch to an arch-independent / static subsection-size of 2MB.
>   Otherwise, a per-arch subsection-size is a roadblock on the path to
>   persistent memory namespace compatibility across archs. (Jeff)
> 
> - Update the changelog for "libnvdimm/pfn: Fix fsdax-mode namespace
>   info-block zero-fields" to clarify that the "Cc: stable" is only there
>   as safety measure for a distro that decides to backport "libnvdimm/pfn:
>   Stop padding pmem namespaces to section alignment", otherwise there is
>   no known bug exposure in older kernels. (Andrew)
>   
> - Drop some redundant subsection checks (Oscar)
> 
> - Collect some reviewed-bys
> 
> [1]: https://lore.kernel.org/lkml/155977186863.2443951.9036044808311959913.stgit@dwillia2-desk3.amr.corp.intel.com/

Hi Dan,

I am planning to give it a final review later tomorrow.
Now that this work is settled, I took the chance to dust off and push my
vmemmap-hotplug, and I am working on that right now.
But I would definetely come back to this tomorrow.

Thanks for the work

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 02/13] mm/sparsemem: Introduce a SECTION_IS_EARLY flag
  2019-06-19  5:51 ` [PATCH v10 02/13] mm/sparsemem: Introduce a SECTION_IS_EARLY flag Dan Williams
@ 2019-06-24 17:54   ` Oscar Salvador
  0 siblings, 0 replies; 29+ messages in thread
From: Oscar Salvador @ 2019-06-24 17:54 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Qian Cai, Michal Hocko, Logan Gunthorpe, David Hildenbrand,
	Pavel Tatashin, linux-mm, linux-nvdimm, linux-kernel

On Tue, 2019-06-18 at 22:51 -0700, Dan Williams wrote:
> In preparation for sub-section hotplug, track whether a given section
> was created during early memory initialization, or later via memory
> hotplug.  This distinction is needed to maintain the coarse
> expectation
> that pfn_valid() returns true for any pfn within a given section even
> if
> that section has pages that are reserved from the page allocator.
> 
> For example one of the of goals of subsection hotplug is to support
> cases where the system physical memory layout collides System RAM and
> PMEM within a section. Several pfn_valid() users expect to just check
> if
> a section is valid, but they are not careful to check if the given
> pfn
> is within a "System RAM" boundary and instead expect pgdat
> information
> to further validate the pfn.
> 
> Rather than unwind those paths to make their pfn_valid() queries more
> precise a follow on patch uses the SECTION_IS_EARLY flag to maintain
> the
> traditional expectation that pfn_valid() returns true for all early
> sections.
> 
> Link: https://lore.kernel.org/lkml/1560366952-10660-1-git-send-email-
> cai@lca.pw/
> Reported-by: Qian Cai <cai@lca.pw>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

[...]
> @@ -731,7 +732,7 @@ int __meminit sparse_add_one_section(int nid,
> unsigned long start_pfn,
>  	page_init_poison(memmap, sizeof(struct page) *
> PAGES_PER_SECTION);
>  
>  	section_mark_present(ms);
> -	sparse_init_one_section(ms, section_nr, memmap, usage);
> +	sparse_init_one_section(ms, section_nr, memmap, usage, 0);

I think this is an improvment, and I really like the idea of leveraring
a new section's flag for this, but I have mixed feelings about the way
to mark a section as an early one.
IMHO, I think that a new "section_mark_early" function would be better
than passing a new flag parameter to sparse_init_one_section().

But I do not feel strong on this:

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 03/13] mm/sparsemem: Add helpers track active portions of a section at boot
  2019-06-19  5:51 ` [PATCH v10 03/13] mm/sparsemem: Add helpers track active portions of a section at boot Dan Williams
@ 2019-06-24 17:57   ` Oscar Salvador
  0 siblings, 0 replies; 29+ messages in thread
From: Oscar Salvador @ 2019-06-24 17:57 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Pavel Tatashin,
	Qian Cai, Jane Chu, linux-mm, linux-nvdimm, linux-kernel

On Tue, 2019-06-18 at 22:51 -0700, Dan Williams wrote:
> Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
> sub-section active bitmask, each bit representing a PMD_SIZE span of
> the
> architecture's memory hotplug section size.
> 
> The implications of a partially populated section is that pfn_valid()
> needs to go beyond a valid_section() check and either determine that
> the
> section is an "early section", or read the sub-section active ranges
> from the bitmask. The expectation is that the bitmask
> (subsection_map)
> fits in the same cacheline as the valid_section() / early_section()
> data, so the incremental performance overhead to pfn_valid() should
> be
> negligible.
> 
> The rationale for using early_section() to short-ciruit the
> subsection_map check is that there are legacy code paths that use
> pfn_valid() at section granularity before validating the pfn against
> pgdat data. So, the early_section() check allows those traditional
> assumptions to persist while also permitting subsection_map to tell
> the
> truth for purposes of populating the unused portions of early
> sections
> with PMEM and other ZONE_DEVICE mappings.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Reported-by: Qian Cai <cai@lca.pw>
> Tested-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 05/13] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap()
  2019-06-19  5:52 ` [PATCH v10 05/13] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() Dan Williams
@ 2019-06-24 18:00   ` Oscar Salvador
  0 siblings, 0 replies; 29+ messages in thread
From: Oscar Salvador @ 2019-06-24 18:00 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Michal Hocko, David Hildenbrand, Logan Gunthorpe, Pavel Tatashin,
	linux-mm, linux-nvdimm, linux-kernel

On Tue, 2019-06-18 at 22:52 -0700, Dan Williams wrote:
> Allow sub-section sized ranges to be added to the memmap.
> populate_section_memmap() takes an explict pfn range rather than
> assuming a full section, and those parameters are plumbed all the way
> through to vmmemap_populate(). There should be no sub-section usage
> in
> current deployments. New warnings are added to clarify which memmap
> allocation paths are sub-section capable.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges
  2019-06-19  5:52 ` [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges Dan Williams
  2019-06-20 10:31   ` David Hildenbrand
@ 2019-06-24 18:05   ` Oscar Salvador
  1 sibling, 0 replies; 29+ messages in thread
From: Oscar Salvador @ 2019-06-24 18:05 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Pavel Tatashin,
	linux-mm, linux-nvdimm, linux-kernel

On Tue, 2019-06-18 at 22:52 -0700, Dan Williams wrote:
> Prepare the memory hot-{add,remove} paths for handling sub-section
> ranges by plumbing the starting page frame and number of pages being
> handled through arch_{add,remove}_memory() to
> sparse_{add,remove}_one_section().
> 
> This is simply plumbing, small cleanups, and some identifier renames.
> No
> intended functional changes.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

I already gave my Reviewed-by in the previous version:

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v10 09/13] mm/sparsemem: Support sub-section hotplug
  2019-06-19  5:52 ` [PATCH v10 09/13] mm/sparsemem: Support sub-section hotplug Dan Williams
@ 2019-06-24 20:45   ` Oscar Salvador
  0 siblings, 0 replies; 29+ messages in thread
From: Oscar Salvador @ 2019-06-24 20:45 UTC (permalink / raw)
  To: Dan Williams, akpm
  Cc: Michal Hocko, Vlastimil Babka, Logan Gunthorpe, Pavel Tatashin,
	linux-mm, linux-nvdimm, linux-kernel

On Tue, 2019-06-18 at 22:52 -0700, Dan Williams wrote:
> The libnvdimm sub-system has suffered a series of hacks and broken
> workarounds for the memory-hotplug implementation's awkward
> section-aligned (128MB) granularity. For example the following
> backtrace
> is emitted when attempting arch_add_memory() with physical address
> ranges that intersect 'System RAM' (RAM) with 'Persistent Memory'
> (PMEM)
> within a given section:
> 
>     # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
>     100000000-1ffffffff : System RAM
>     200000000-303ffffff : Persistent Memory (legacy)
>     304000000-43fffffff : System RAM
>     440000000-23ffffffff : Persistent Memory
>     2400000000-43bfffffff : Persistent Memory
>       2400000000-43bfffffff : namespace2.0
> 
>     WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850
> add_pages+0x5c/0x60
>     [..]
>     RIP: 0010:add_pages+0x5c/0x60
>     [..]
>     Call Trace:
>      devm_memremap_pages+0x460/0x6e0
>      pmem_attach_disk+0x29e/0x680 [nd_pmem]
>      ? nd_dax_probe+0xfc/0x120 [libnvdimm]
>      nvdimm_bus_probe+0x66/0x160 [libnvdimm]
> 
> It was discovered that the problem goes beyond RAM vs PMEM collisions
> as
> some platform produce PMEM vs PMEM collisions within a given section.
> The libnvdimm workaround for that case revealed that the libnvdimm
> section-alignment-padding implementation has been broken for a long
> while. A fix for that long-standing breakage introduces as many
> problems
> as it solves as it would require a backward-incompatible change to
> the
> namespace metadata interpretation. Instead of that dubious route [1],
> address the root problem in the memory-hotplug implementation.
> 
> Note that EEXIST is no longer treated as success as that is how
> sparse_add_section() reports subsection collisions, it was also
> obviated
> by recent changes to perform the request_region() for 'System RAM'
> before arch_add_memory() in the add_memory() sequence.
> 
> [1]: https://lore.kernel.org/r/155000671719.348031.234736316014111923
> 7.stgit@dwillia2-desk3.amr.corp.intel.com
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-06-24 20:45 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-19  5:51 [PATCH v10 00/13] mm: Sub-section memory hotplug support Dan Williams
2019-06-19  5:51 ` [PATCH v10 01/13] mm/sparsemem: Introduce struct mem_section_usage Dan Williams
2019-06-19  5:51 ` [PATCH v10 02/13] mm/sparsemem: Introduce a SECTION_IS_EARLY flag Dan Williams
2019-06-24 17:54   ` Oscar Salvador
2019-06-19  5:51 ` [PATCH v10 03/13] mm/sparsemem: Add helpers track active portions of a section at boot Dan Williams
2019-06-24 17:57   ` Oscar Salvador
2019-06-19  5:51 ` [PATCH v10 04/13] mm/hotplug: Prepare shrink_{zone, pgdat}_span for sub-section removal Dan Williams
2019-06-19  5:52 ` [PATCH v10 05/13] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() Dan Williams
2019-06-24 18:00   ` Oscar Salvador
2019-06-19  5:52 ` [PATCH v10 06/13] mm/hotplug: Kill is_dev_zone() usage in __remove_pages() Dan Williams
2019-06-19  5:52 ` [PATCH v10 07/13] mm: Kill is_dev_zone() helper Dan Williams
2019-06-19  5:52 ` [PATCH v10 08/13] mm/sparsemem: Prepare for sub-section ranges Dan Williams
2019-06-20 10:31   ` David Hildenbrand
2019-06-20 16:19     ` Dan Williams
2019-06-20 16:35       ` David Hildenbrand
2019-06-20 16:56         ` Dan Williams
2019-06-24 18:05   ` Oscar Salvador
2019-06-19  5:52 ` [PATCH v10 09/13] mm/sparsemem: Support sub-section hotplug Dan Williams
2019-06-24 20:45   ` Oscar Salvador
2019-06-19  5:52 ` [PATCH v10 10/13] mm: Document ZONE_DEVICE memory-model implications Dan Williams
2019-06-20 12:30   ` Mike Rapoport
2019-06-19  5:52 ` [PATCH v10 11/13] mm/devm_memremap_pages: Enable sub-section remap Dan Williams
2019-06-19  5:52 ` [PATCH v10 12/13] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
2019-06-19 16:30   ` Aneesh Kumar K.V
2019-06-19 17:06     ` Dan Williams
2019-06-19  5:52 ` [PATCH v10 13/13] libnvdimm/pfn: Stop padding pmem namespaces to section alignment Dan Williams
2019-06-20 12:30 ` [PATCH v10 00/13] mm: Sub-section memory hotplug support Aneesh Kumar K.V
2019-06-20 16:30   ` Dan Williams
2019-06-20 17:00 ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).