Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/5] Do not touch pages in hot-remove path
@ 2018-11-27 16:20 Oscar Salvador
  2018-11-27 16:20 ` [PATCH v2 1/5] mm, memory_hotplug: Add nid parameter to arch_remove_memory Oscar Salvador
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Oscar Salvador @ 2018-11-27 16:20 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador

From: Oscar Salvador <osalvador@suse.com>

This patchset is based on Dan's HMM/devm refactorization [1].

----

This patchset aims for two things:

 1) A better definition about offline and hot-remove stage
 2) Solving bugs where we can access non-initialized pages
    during hot-remove operations [2] [3].

This is achieved by moving all page/zone handling to the offline
stage, so we do not need to access pages when hot-removing memory.

[1] https://patchwork.kernel.org/cover/10691415/
[2] https://patchwork.kernel.org/patch/10547445/
[3] https://www.spinics.net/lists/linux-mm/msg161316.html

Oscar Salvador (5):
  mm, memory_hotplug: Add nid parameter to arch_remove_memory
  kernel, resource: Check for IORESOURCE_SYSRAM in
    release_mem_region_adjustable
  mm, memory_hotplug: Move zone/pages handling to offline stage
  mm, memory-hotplug: Rework unregister_mem_sect_under_nodes
  mm, memory_hotplug: Refactor shrink_zone/pgdat_span

 arch/ia64/mm/init.c            |   2 +-
 arch/powerpc/mm/mem.c          |  14 +--
 arch/s390/mm/init.c            |   2 +-
 arch/sh/mm/init.c              |   6 +-
 arch/x86/mm/init_32.c          |   5 +-
 arch/x86/mm/init_64.c          |  11 +-
 drivers/base/memory.c          |   9 +-
 drivers/base/node.c            |  39 +------
 include/linux/memory.h         |   2 +-
 include/linux/memory_hotplug.h |  12 +-
 include/linux/node.h           |   9 +-
 kernel/memremap.c              |  19 ++-
 kernel/resource.c              |  15 +++
 mm/memory_hotplug.c            | 254 +++++++++++++++++++----------------------
 mm/sparse.c                    |   4 +-
 15 files changed, 182 insertions(+), 221 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/5] mm, memory_hotplug: Add nid parameter to arch_remove_memory
  2018-11-27 16:20 [PATCH v2 0/5] Do not touch pages in hot-remove path Oscar Salvador
@ 2018-11-27 16:20 ` Oscar Salvador
  2018-11-27 16:20 ` [PATCH v2 2/5] kernel, resource: Check for IORESOURCE_SYSRAM in release_mem_region_adjustable Oscar Salvador
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2018-11-27 16:20 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador,
	Oscar Salvador

From: Oscar Salvador <osalvador@suse.com>

This patch is only a preparation for the following-up patches.
The idea of passing the nid is that it will allow us to get rid
of the zone parameter afterwards.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/ia64/mm/init.c            | 2 +-
 arch/powerpc/mm/mem.c          | 3 ++-
 arch/s390/mm/init.c            | 2 +-
 arch/sh/mm/init.c              | 2 +-
 arch/x86/mm/init_32.c          | 2 +-
 arch/x86/mm/init_64.c          | 3 ++-
 include/linux/memory_hotplug.h | 4 ++--
 kernel/memremap.c              | 5 ++++-
 mm/memory_hotplug.c            | 2 +-
 9 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d5e12ff1d73c..904fe55e10fc 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -661,7 +661,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0a64fffabee1..40feb262080e 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -139,7 +139,8 @@ int __meminit arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int __meminit arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int __meminit arch_remove_memory(int nid, u64 start, u64 size,
+					struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 50388190b393..3e82f66d5c61 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -242,7 +242,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	/*
 	 * There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index c8c13c777162..a8e5c0e00fca 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -443,7 +443,7 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 49ecf5ecf6d3..85c94f9a87f8 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -860,7 +860,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..449958da97a4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1147,7 +1147,8 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true, NULL);
 }
 
-int __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
+int __ref arch_remove_memory(int nid, u64 start, u64 size,
+				struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 84e9ae205930..3aedcd7929cd 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -107,8 +107,8 @@ static inline bool movable_node_is_enabled(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-extern int arch_remove_memory(u64 start, u64 size,
-		struct vmem_altmap *altmap);
+extern int arch_remove_memory(int nid, u64 start, u64 size,
+				struct vmem_altmap *altmap);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages, struct vmem_altmap *altmap);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 3eef989ef035..0d5603d76c37 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -87,6 +87,7 @@ static void devm_memremap_pages_release(void *data)
 	struct resource *res = &pgmap->res;
 	resource_size_t align_start, align_size;
 	unsigned long pfn;
+	int nid;
 
 	pgmap->kill(pgmap->ref);
 	for_each_device_pfn(pfn, pgmap)
@@ -97,13 +98,15 @@ static void devm_memremap_pages_release(void *data)
 	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
 		- align_start;
 
+	nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
+
 	mem_hotplug_begin();
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		pfn = align_start >> PAGE_SHIFT;
 		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
 				align_size >> PAGE_SHIFT, NULL);
 	} else {
-		arch_remove_memory(align_start, align_size,
+		arch_remove_memory(nid, align_start, align_size,
 				pgmap->altmap_valid ? &pgmap->altmap : NULL);
 		kasan_remove_zero_shadow(__va(align_start), align_size);
 	}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7b4317ae8318..849bcc55c5f1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1858,7 +1858,7 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size, NULL);
+	arch_remove_memory(nid, start, size, NULL);
 
 	try_offline_node(nid);
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 2/5] kernel, resource: Check for IORESOURCE_SYSRAM in release_mem_region_adjustable
  2018-11-27 16:20 [PATCH v2 0/5] Do not touch pages in hot-remove path Oscar Salvador
  2018-11-27 16:20 ` [PATCH v2 1/5] mm, memory_hotplug: Add nid parameter to arch_remove_memory Oscar Salvador
@ 2018-11-27 16:20 ` Oscar Salvador
  2018-11-27 16:20 ` [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage Oscar Salvador
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2018-11-27 16:20 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador

This is a preparation for the next patch.

Currently, we only call release_mem_region_adjustable() in __remove_pages
if the zone is not ZONE_DEVICE, because resources that belong to
HMM/devm are being released by themselves with devm_release_mem_region.

Since we do not want to touch any zone/page stuff during the removing
of the memory (but during the offlining), we do not want to check for
the zone here.
So we need another way to tell release_mem_region_adjustable() to not
realease the resource in case it belongs to HMM/devm.

HMM/devm acquires/releases a resource through
devm_request_mem_region/devm_release_mem_region.

These resources have the flag IORESOURCE_MEM, while resources acquired by
hot-add memory path (register_memory_resource()) contain
IORESOURCE_SYSTEM_RAM.

So, we can check for this flag in release_mem_region_adjustable, and if
the resource does not contain such flag, we know that we are dealing with
a HMM/devm resource, so we can back off.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 kernel/resource.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/resource.c b/kernel/resource.c
index 0aaa95005a2e..e20cea416f1f 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1266,6 +1266,21 @@ int release_mem_region_adjustable(struct resource *parent,
 			continue;
 		}
 
+		/*
+		 * All memory regions added from memory-hotplug path have the
+		 * flag IORESOURCE_SYSTEM_RAM. If the resource does not have
+		 * this flag, we know that we are dealing with a resource coming
+		 * from HMM/devm. HMM/devm use another mechanism to add/release
+		 * a resource. This goes via devm_request_mem_region and
+		 * devm_release_mem_region.
+		 * HMM/devm take care to release their resources when they want,
+		 * so if we are dealing with them, let us just back off here.
+		 */
+		if (!(res->flags & IORESOURCE_SYSRAM)) {
+			ret = 0;
+			break;
+		}
+
 		if (!(res->flags & IORESOURCE_MEM))
 			break;
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage
  2018-11-27 16:20 [PATCH v2 0/5] Do not touch pages in hot-remove path Oscar Salvador
  2018-11-27 16:20 ` [PATCH v2 1/5] mm, memory_hotplug: Add nid parameter to arch_remove_memory Oscar Salvador
  2018-11-27 16:20 ` [PATCH v2 2/5] kernel, resource: Check for IORESOURCE_SYSRAM in release_mem_region_adjustable Oscar Salvador
@ 2018-11-27 16:20 ` Oscar Salvador
  2018-11-28  7:52   ` Mike Rapoport
  2018-11-28 14:15   ` osalvador
  2018-11-27 16:20 ` [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes Oscar Salvador
  2018-11-27 16:20 ` [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span Oscar Salvador
  4 siblings, 2 replies; 26+ messages in thread
From: Oscar Salvador @ 2018-11-27 16:20 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador,
	Oscar Salvador

From: Oscar Salvador <osalvador@suse.com>

The current implementation accesses pages during hot-remove
stage in order to get the zone linked to this memory-range.
We use that zone for a) check if the zone is ZONE_DEVICE and
b) to shrink the zone's spanned pages.

Accessing pages during this stage is problematic, as we might be
accessing pages that were not initialized if we did not get to
online the memory before removing it.

The only reason to check for ZONE_DEVICE in __remove_pages
is to bypass the call to release_mem_region_adjustable(),
since these regions are removed with devm_release_mem_region.

With patch#2, this is no longer a problem so we can safely
call release_mem_region_adjustable().
release_mem_region_adjustable() will spot that the region
we are trying to remove was acquired by means of
devm_request_mem_region, and will back off safely.

This allows us to remove all zone-related operations from
hot-remove stage.

Because of this, zone's spanned pages are shrinked during
the offlining stage in shrink_zone_pgdat().
It would have been great to decrease also the spanned page
for the node there, but we need them in try_offline_node().
So we still decrease spanned pages for the node in the hot-remove
stage.

The only particularity is that now
find_smallest_section_pfn/find_biggest_section_pfn, when called from
shrink_zone_span, will now check for online sections and not
valid sections instead.
To make this work with devm/HMM code, we need to call offline_mem_sections
and online_mem_sections in that code path when we are adding memory.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/powerpc/mm/mem.c          | 11 +----
 arch/sh/mm/init.c              |  4 +-
 arch/x86/mm/init_32.c          |  3 +-
 arch/x86/mm/init_64.c          |  8 +---
 include/linux/memory_hotplug.h |  8 ++--
 kernel/memremap.c              | 14 +++++--
 mm/memory_hotplug.c            | 95 ++++++++++++++++++++++++------------------
 mm/sparse.c                    |  4 +-
 8 files changed, 76 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 40feb262080e..b3c9ee5c4f78 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -144,18 +144,9 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page;
 	int ret;
 
-	/*
-	 * If we have an altmap then we need to skip over any reserved PFNs
-	 * when querying the zone.
-	 */
-	page = pfn_to_page(start_pfn);
-	if (altmap)
-		page += vmem_altmap_offset(altmap);
-
-	ret = __remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
+	ret = remove_sections(nid, start_pfn, nr_pages, altmap);
 	if (ret)
 		return ret;
 
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index a8e5c0e00fca..1a483a008872 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -447,11 +447,9 @@ int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 	int ret;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+	ret = remove_sections(nid, start_pfn, nr_pages, altmap);
 	if (unlikely(ret))
 		pr_warn("%s: Failed, __remove_pages() == %d\n", __func__,
 			ret);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 85c94f9a87f8..0b8c7b0033d2 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -866,8 +866,7 @@ int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	return __remove_pages(zone, start_pfn, nr_pages, altmap);
+	return remove_sections(nid, start_pfn, nr_pages, altmap);
 }
 #endif
 #endif
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 449958da97a4..f80d98381a97 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1152,15 +1152,9 @@ int __ref arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(start_pfn);
-	struct zone *zone;
 	int ret;
 
-	/* With altmap the first mapped page is offset from @start */
-	if (altmap)
-		page += vmem_altmap_offset(altmap);
-	zone = page_zone(page);
-	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+	ret = remove_sections(nid, start_pfn, nr_pages, altmap);
 	WARN_ON_ONCE(ret);
 	kernel_physical_mapping_remove(start, start + size);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3aedcd7929cd..0a882d8e32c6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -109,8 +109,10 @@ static inline bool movable_node_is_enabled(void)
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int arch_remove_memory(int nid, u64 start, u64 size,
 				struct vmem_altmap *altmap);
-extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
-	unsigned long nr_pages, struct vmem_altmap *altmap);
+extern int remove_sections(int nid, unsigned long start_pfn,
+			unsigned long nr_pages, struct vmem_altmap *altmap);
+extern void shrink_zone(struct zone *zone, unsigned long start_pfn,
+			unsigned long end_pfn, unsigned long offlined_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /* reasonably generic interface to expand the physical pages */
@@ -335,7 +337,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_one_section(struct pglist_data *pgdat,
 		unsigned long start_pfn, struct vmem_altmap *altmap);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+extern void sparse_remove_one_section(int nid, struct mem_section *ms,
 		unsigned long map_offset, struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 0d5603d76c37..66cbf334203b 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -87,6 +87,7 @@ static void devm_memremap_pages_release(void *data)
 	struct resource *res = &pgmap->res;
 	resource_size_t align_start, align_size;
 	unsigned long pfn;
+	unsigned long nr_pages;
 	int nid;
 
 	pgmap->kill(pgmap->ref);
@@ -101,10 +102,14 @@ static void devm_memremap_pages_release(void *data)
 	nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
 
 	mem_hotplug_begin();
+
+	pfn = align_start >> PAGE_SHIFT;
+	nr_pages = align_size >> PAGE_SHIFT;
+	offline_mem_sections(pfn, pfn + nr_pages);
+	shrink_zone(page_zone(pfn_to_page(pfn)), pfn, pfn + nr_pages, nr_pages);
+
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		pfn = align_start >> PAGE_SHIFT;
-		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-				align_size >> PAGE_SHIFT, NULL);
+		remove_sections(nid, pfn, nr_pages, NULL);
 	} else {
 		arch_remove_memory(nid, align_start, align_size,
 				pgmap->altmap_valid ? &pgmap->altmap : NULL);
@@ -224,7 +229,10 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 
 	if (!error) {
 		struct zone *zone;
+		unsigned long pfn = align_start >> PAGE_SHIFT;
+		unsigned long nr_pages = align_size >> PAGE_SHIFT;
 
+		online_mem_sections(pfn, pfn + nr_pages);
 		zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
 		move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
 				align_size >> PAGE_SHIFT, altmap);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 849bcc55c5f1..4fe42ccb0be4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -314,6 +314,17 @@ int __ref __add_pages(int nid, unsigned long phys_start_pfn,
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
+static bool is_section_ok(struct mem_section *ms, bool zone)
+{
+	/*
+	 * We cannot shrink pgdat's spanned because we use them
+	 * in try_offline_node to check if all sections were removed.
+	 */
+	if (zone)
+		return online_section(ms);
+	else
+		return valid_section(ms);
+}
 /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
 static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 				     unsigned long start_pfn,
@@ -324,7 +335,7 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
 		ms = __pfn_to_section(start_pfn);
 
-		if (unlikely(!valid_section(ms)))
+		if (!is_section_ok(ms, !!zone))
 			continue;
 
 		if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -352,7 +363,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
 		ms = __pfn_to_section(pfn);
 
-		if (unlikely(!valid_section(ms)))
+		if (!is_section_ok(ms, !!zone))
 			continue;
 
 		if (unlikely(pfn_to_nid(pfn) != nid))
@@ -414,7 +425,7 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
 		ms = __pfn_to_section(pfn);
 
-		if (unlikely(!valid_section(ms)))
+		if (unlikely(!online_section(ms)))
 			continue;
 
 		if (page_zone(pfn_to_page(pfn)) != zone)
@@ -501,23 +512,33 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 	pgdat->node_spanned_pages = 0;
 }
 
-static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+void shrink_zone(struct zone *zone, unsigned long start_pfn,
+		unsigned long end_pfn, unsigned long offlined_pages)
 {
-	struct pglist_data *pgdat = zone->zone_pgdat;
 	int nr_pages = PAGES_PER_SECTION;
+	unsigned long pfn;
+
+	clear_zone_contiguous(zone);
+	for (pfn = start_pfn; pfn < end_pfn; pfn += nr_pages)
+		shrink_zone_span(zone, pfn, pfn + nr_pages);
+	set_zone_contiguous(zone);
+}
+
+static void shrink_pgdat(int nid, unsigned long sect_nr)
+{
+	struct pglist_data *pgdat = NODE_DATA(nid);
+	int nr_pages = PAGES_PER_SECTION;
+	unsigned long pfn = section_nr_to_pfn((unsigned long)sect_nr);
 	unsigned long flags;
 
-	pgdat_resize_lock(zone->zone_pgdat, &flags);
-	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
-	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
-	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+	pgdat_resize_lock(pgdat, &flags);
+	shrink_pgdat_span(pgdat, pfn, pfn + nr_pages);
+	pgdat_resize_unlock(pgdat, &flags);
 }
 
-static int __remove_section(struct zone *zone, struct mem_section *ms,
+static int __remove_section(int nid, struct mem_section *ms,
 		unsigned long map_offset, struct vmem_altmap *altmap)
 {
-	unsigned long start_pfn;
-	int scn_nr;
 	int ret = -EINVAL;
 
 	if (!valid_section(ms))
@@ -527,17 +548,15 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
 	if (ret)
 		return ret;
 
-	scn_nr = __section_nr(ms);
-	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
-	__remove_zone(zone, start_pfn);
+	shrink_pgdat(nid, __section_nr(ms));
 
-	sparse_remove_one_section(zone, ms, map_offset, altmap);
+	sparse_remove_one_section(nid, ms, map_offset, altmap);
 	return 0;
 }
 
 /**
- * __remove_pages() - remove sections of pages from a zone
- * @zone: zone from which pages need to be removed
+ * __remove_pages() - remove sections of pages from a nid
+ * @nid: nid from which pages belong to
  * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
  * @nr_pages: number of pages to remove (must be multiple of section size)
  * @altmap: alternative device page map or %NULL if default memmap is used
@@ -547,35 +566,28 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
  * sure that pages are marked reserved and zones are adjust properly by
  * calling offline_pages().
  */
-int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
+int remove_sections(int nid, unsigned long phys_start_pfn,
 		 unsigned long nr_pages, struct vmem_altmap *altmap)
 {
 	unsigned long i;
 	unsigned long map_offset = 0;
 	int sections_to_remove, ret = 0;
+	resource_size_t start, size;
 
-	/* In the ZONE_DEVICE case device driver owns the memory region */
-	if (is_dev_zone(zone)) {
-		if (altmap)
-			map_offset = vmem_altmap_offset(altmap);
-	} else {
-		resource_size_t start, size;
-
-		start = phys_start_pfn << PAGE_SHIFT;
-		size = nr_pages * PAGE_SIZE;
+	start = phys_start_pfn << PAGE_SHIFT;
+	size = nr_pages * PAGE_SIZE;
 
-		ret = release_mem_region_adjustable(&iomem_resource, start,
-					size);
-		if (ret) {
-			resource_size_t endres = start + size - 1;
+	if (altmap)
+		map_offset = vmem_altmap_offset(altmap);
 
-			pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
-					&start, &endres, ret);
-		}
+	ret = release_mem_region_adjustable(&iomem_resource, start,
+								size);
+	if (ret) {
+		resource_size_t endres = start + size - 1;
+		pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
+						&start, &endres, ret);
 	}
 
-	clear_zone_contiguous(zone);
-
 	/*
 	 * We can only remove entire sections
 	 */
@@ -587,15 +599,13 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
 
 		cond_resched();
-		ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
-				altmap);
+		ret = __remove_section(nid, __pfn_to_section(pfn), map_offset,
+									altmap);
 		map_offset = 0;
 		if (ret)
 			break;
 	}
 
-	set_zone_contiguous(zone);
-
 	return ret;
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -1652,11 +1662,14 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
+
+	/* Shrink zone's managed,spanned and zone/pgdat's present pages */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
 	zone->present_pages -= offlined_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	zone->zone_pgdat->node_present_pages -= offlined_pages;
+	shrink_zone(zone, valid_start, valid_end, offlined_pages);
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
 	init_per_zone_wmark_min();
diff --git a/mm/sparse.c b/mm/sparse.c
index 691544a2814c..01aa42102f8b 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -790,12 +790,12 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+void sparse_remove_one_section(int nid, struct mem_section *ms,
 		unsigned long map_offset, struct vmem_altmap *altmap)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 
 	pgdat_resize_lock(pgdat, &flags);
 	if (ms->section_mem_map) {
-- 
2.13.6

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes
  2018-11-27 16:20 [PATCH v2 0/5] Do not touch pages in hot-remove path Oscar Salvador
                   ` (2 preceding siblings ...)
  2018-11-27 16:20 ` [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage Oscar Salvador
@ 2018-11-27 16:20 ` Oscar Salvador
  2019-03-24  6:48   ` Anshuman Khandual
  2018-11-27 16:20 ` [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span Oscar Salvador
  4 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2018-11-27 16:20 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador,
	Oscar Salvador

From: Oscar Salvador <osalvador@suse.com>

This tries to address another issue about accessing
unitiliazed pages.

Jonathan reported a problem [1] where we can access steal pages
in case we hot-remove memory without onlining it first.

This time is in unregister_mem_sect_under_nodes.
This function tries to get the nid from the pfn and then
tries to remove the symlink between mem_blk <-> nid and vice versa.

Since we already know the nid in remove_memory(), we can pass
it down the chain to unregister_mem_sect_under_nodes.
There we can just remove the symlinks without the need
to look into the pages.

This also allows us to cleanup unregister_mem_sect_under_nodes.

[1] https://www.spinics.net/lists/linux-mm/msg161316.html

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/base/memory.c  |  9 ++++-----
 drivers/base/node.c    | 39 ++++++---------------------------------
 include/linux/memory.h |  2 +-
 include/linux/node.h   |  9 ++++-----
 mm/memory_hotplug.c    |  2 +-
 5 files changed, 16 insertions(+), 45 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 0e5985682642..3d8c65d84bea 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -744,8 +744,7 @@ unregister_memory(struct memory_block *memory)
 	device_unregister(&memory->dev);
 }
 
-static int remove_memory_section(unsigned long node_id,
-			       struct mem_section *section, int phys_device)
+static int remove_memory_section(unsigned long nid, struct mem_section *section)
 {
 	struct memory_block *mem;
 
@@ -759,7 +758,7 @@ static int remove_memory_section(unsigned long node_id,
 	if (!mem)
 		goto out_unlock;
 
-	unregister_mem_sect_under_nodes(mem, __section_nr(section));
+	unregister_mem_sect_under_nodes(nid, mem);
 
 	mem->section_count--;
 	if (mem->section_count == 0)
@@ -772,12 +771,12 @@ static int remove_memory_section(unsigned long node_id,
 	return 0;
 }
 
-int unregister_memory_section(struct mem_section *section)
+int unregister_memory_section(int nid, struct mem_section *section)
 {
 	if (!present_section(section))
 		return -EINVAL;
 
-	return remove_memory_section(0, section, 0);
+	return remove_memory_section(nid, section);
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..0858f7f3c7cd 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -453,40 +453,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
 	return 0;
 }
 
-/* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-				    unsigned long phys_index)
+/* Remove symlink between node <-> mem_blk */
+void unregister_mem_sect_under_nodes(int nid, struct memory_block *mem_blk)
 {
-	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
-	unsigned long pfn, sect_start_pfn, sect_end_pfn;
-
-	if (!mem_blk) {
-		NODEMASK_FREE(unlinked_nodes);
-		return -EFAULT;
-	}
-	if (!unlinked_nodes)
-		return -ENOMEM;
-	nodes_clear(*unlinked_nodes);
-
-	sect_start_pfn = section_nr_to_pfn(phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
-	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-		int nid;
-
-		nid = get_nid_for_pfn(pfn);
-		if (nid < 0)
-			continue;
-		if (!node_online(nid))
-			continue;
-		if (node_test_and_set(nid, *unlinked_nodes))
-			continue;
-		sysfs_remove_link(&node_devices[nid]->dev.kobj,
-			 kobject_name(&mem_blk->dev.kobj));
-		sysfs_remove_link(&mem_blk->dev.kobj,
-			 kobject_name(&node_devices[nid]->dev.kobj));
-	}
-	NODEMASK_FREE(unlinked_nodes);
-	return 0;
+	sysfs_remove_link(&node_devices[nid]->dev.kobj,
+			kobject_name(&mem_blk->dev.kobj));
+	sysfs_remove_link(&mem_blk->dev.kobj,
+			kobject_name(&node_devices[nid]->dev.kobj));
 }
 
 int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn)
diff --git a/include/linux/memory.h b/include/linux/memory.h
index a6ddefc60517..d75ec88ca09d 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -113,7 +113,7 @@ extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 int hotplug_memory_register(int nid, struct mem_section *section);
 #ifdef CONFIG_MEMORY_HOTREMOVE
-extern int unregister_memory_section(struct mem_section *);
+extern int unregister_memory_section(int nid, struct mem_section *);
 #endif
 extern int memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..488c1333bb06 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -72,8 +72,8 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						void *arg);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-					   unsigned long phys_index);
+extern void unregister_mem_sect_under_nodes(int nid,
+			struct memory_block *mem_blk);
 
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
@@ -105,10 +105,9 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
 {
 	return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
-						  unsigned long phys_index)
+static inline void unregister_mem_sect_under_nodes(int nid,
+				struct memory_block *mem_blk)
 {
-	return 0;
 }
 
 static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4fe42ccb0be4..49b91907e19e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -544,7 +544,7 @@ static int __remove_section(int nid, struct mem_section *ms,
 	if (!valid_section(ms))
 		return ret;
 
-	ret = unregister_memory_section(ms);
+	ret = unregister_memory_section(nid, ms);
 	if (ret)
 		return ret;
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-27 16:20 [PATCH v2 0/5] Do not touch pages in hot-remove path Oscar Salvador
                   ` (3 preceding siblings ...)
  2018-11-27 16:20 ` [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes Oscar Salvador
@ 2018-11-27 16:20 ` Oscar Salvador
  2018-11-28  6:50   ` Michal Hocko
  4 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2018-11-27 16:20 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador,
	Oscar Salvador

From: Oscar Salvador <osalvador@suse.com>

shrink_zone_span and shrink_pgdat_span look a bit weird.

They both have a loop at the end to check if the zone
or pgdat contains only holes in case the section to be removed
was not either the first one or the last one.

Both code loops look quite similar, so we can simplify it a bit.
We do that by creating a function (has_only_holes), that basically
calls find_smallest_section_pfn() with the full range.
In case nothing has to be found, we do not have any more sections
there.

To be honest, I am not really sure we even need to go through this
check in case we are removing a middle section, because from what I can see,
we will always have a first/last section.

Taking the chance, we could also simplify both find_smallest_section_pfn()
and find_biggest_section_pfn() functions and move the common code
to a helper.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory_hotplug.c | 163 +++++++++++++++++++++-------------------------------
 1 file changed, 64 insertions(+), 99 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 49b91907e19e..0e3f89423153 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -325,28 +325,29 @@ static bool is_section_ok(struct mem_section *ms, bool zone)
 	else
 		return valid_section(ms);
 }
+
+static bool is_valid_section(struct zone *zone, int nid, unsigned long pfn)
+{
+	struct mem_section *ms = __pfn_to_section(pfn);
+
+	if (unlikely(!is_section_ok(ms, !!zone)))
+		return false;
+	if (unlikely(pfn_to_nid(pfn) != nid))
+		return false;
+	if (zone && zone != page_zone(pfn_to_page(pfn)))
+		return false;
+
+	return true;
+}
+
 /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
 static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 				     unsigned long start_pfn,
 				     unsigned long end_pfn)
 {
-	struct mem_section *ms;
-
-	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(start_pfn);
-
-		if (!is_section_ok(ms, !!zone))
-			continue;
-
-		if (unlikely(pfn_to_nid(start_pfn) != nid))
-			continue;
-
-		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
-			continue;
-
-		return start_pfn;
-	}
-
+	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION)
+		if (is_valid_section(zone, nid, start_pfn))
+			return start_pfn;
 	return 0;
 }
 
@@ -355,29 +356,26 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 				    unsigned long start_pfn,
 				    unsigned long end_pfn)
 {
-	struct mem_section *ms;
 	unsigned long pfn;
 
 	/* pfn is the end pfn of a memory section. */
 	pfn = end_pfn - 1;
-	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
-
-		if (!is_section_ok(ms, !!zone))
-			continue;
-
-		if (unlikely(pfn_to_nid(pfn) != nid))
-			continue;
-
-		if (zone && zone != page_zone(pfn_to_page(pfn)))
-			continue;
-
-		return pfn;
-	}
-
+	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION)
+		if (is_valid_section(zone, nid, pfn))
+			return pfn;
 	return 0;
 }
 
+static bool has_only_holes(int nid, struct zone *zone,
+				unsigned long start_pfn,
+				unsigned long end_pfn)
+{
+	/*
+	 * Let us check if the range has only holes
+	 */
+	return !find_smallest_section_pfn(nid, zone, start_pfn, end_pfn);
+}
+
 static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 			     unsigned long end_pfn)
 {
@@ -385,7 +383,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
 	unsigned long zone_end_pfn = z;
 	unsigned long pfn;
-	struct mem_section *ms;
 	int nid = zone_to_nid(zone);
 
 	zone_span_writelock(zone);
@@ -397,11 +394,12 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 		 * for shrinking zone.
 		 */
 		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
-						zone_end_pfn);
-		if (pfn) {
-			zone->zone_start_pfn = pfn;
-			zone->spanned_pages = zone_end_pfn - pfn;
-		}
+							zone_end_pfn);
+		if (!pfn)
+			goto no_sections;
+
+		zone->zone_start_pfn = pfn;
+		zone->spanned_pages = zone_end_pfn - pfn;
 	} else if (zone_end_pfn == end_pfn) {
 		/*
 		 * If the section is biggest section in the zone, it need
@@ -410,39 +408,23 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 		 * shrinking zone.
 		 */
 		pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
-					       start_pfn);
-		if (pfn)
-			zone->spanned_pages = pfn - zone_start_pfn + 1;
-	}
-
-	/*
-	 * The section is not biggest or smallest mem_section in the zone, it
-	 * only creates a hole in the zone. So in this case, we need not
-	 * change the zone. But perhaps, the zone has only hole data. Thus
-	 * it check the zone has only hole or not.
-	 */
-	pfn = zone_start_pfn;
-	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
-
-		if (unlikely(!online_section(ms)))
-			continue;
+								start_pfn);
+		if (!pfn)
+			goto no_sections;
 
-		if (page_zone(pfn_to_page(pfn)) != zone)
-			continue;
-
-		 /* If the section is current section, it continues the loop */
-		if (start_pfn == pfn)
-			continue;
-
-		/* If we find valid section, we have nothing to do */
-		zone_span_writeunlock(zone);
-		return;
+		zone->spanned_pages = pfn - zone_start_pfn + 1;
+	} else {
+		if (has_only_holes(nid, zone, zone_start_pfn, zone_end_pfn))
+			goto no_sections;
 	}
 
+	goto out;
+
+no_sections:
 	/* The zone has no valid section */
 	zone->zone_start_pfn = 0;
 	zone->spanned_pages = 0;
+out:
 	zone_span_writeunlock(zone);
 }
 
@@ -453,7 +435,6 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 	unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
 	unsigned long pgdat_end_pfn = p;
 	unsigned long pfn;
-	struct mem_section *ms;
 	int nid = pgdat->node_id;
 
 	if (pgdat_start_pfn == start_pfn) {
@@ -465,10 +446,11 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 		 */
 		pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
 						pgdat_end_pfn);
-		if (pfn) {
-			pgdat->node_start_pfn = pfn;
-			pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
-		}
+		if (!pfn)
+			goto no_sections;
+
+		pgdat->node_start_pfn = pfn;
+		pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
 	} else if (pgdat_end_pfn == end_pfn) {
 		/*
 		 * If the section is biggest section in the pgdat, it need
@@ -478,35 +460,18 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 		 */
 		pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
 					       start_pfn);
-		if (pfn)
-			pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
-	}
-
-	/*
-	 * If the section is not biggest or smallest mem_section in the pgdat,
-	 * it only creates a hole in the pgdat. So in this case, we need not
-	 * change the pgdat.
-	 * But perhaps, the pgdat has only hole data. Thus it check the pgdat
-	 * has only hole or not.
-	 */
-	pfn = pgdat_start_pfn;
-	for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
-
-		if (unlikely(!valid_section(ms)))
-			continue;
-
-		if (pfn_to_nid(pfn) != nid)
-			continue;
+		if (!pfn)
+			goto no_sections;
 
-		 /* If the section is current section, it continues the loop */
-		if (start_pfn == pfn)
-			continue;
-
-		/* If we find valid section, we have nothing to do */
-		return;
+		pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
+	} else {
+		if (has_only_holes(nid, NULL, pgdat_start_pfn, pgdat_end_pfn))
+			goto no_sections;
 	}
 
+	return;
+
+no_sections:
 	/* The pgdat has no valid section */
 	pgdat->node_start_pfn = 0;
 	pgdat->node_spanned_pages = 0;
@@ -540,6 +505,7 @@ static int __remove_section(int nid, struct mem_section *ms,
 		unsigned long map_offset, struct vmem_altmap *altmap)
 {
 	int ret = -EINVAL;
+	int sect_nr = __section_nr(ms);
 
 	if (!valid_section(ms))
 		return ret;
@@ -548,9 +514,8 @@ static int __remove_section(int nid, struct mem_section *ms,
 	if (ret)
 		return ret;
 
-	shrink_pgdat(nid, __section_nr(ms));
-
 	sparse_remove_one_section(nid, ms, map_offset, altmap);
+	shrink_pgdat(nid, (unsigned long)sect_nr);
 	return 0;
 }
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-27 16:20 ` [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span Oscar Salvador
@ 2018-11-28  6:50   ` Michal Hocko
  2018-11-28  7:07     ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2018-11-28  6:50 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, Oscar Salvador

On Tue 27-11-18 17:20:05, Oscar Salvador wrote:
> From: Oscar Salvador <osalvador@suse.com>
> 
> shrink_zone_span and shrink_pgdat_span look a bit weird.
> 
> They both have a loop at the end to check if the zone
> or pgdat contains only holes in case the section to be removed
> was not either the first one or the last one.
> 
> Both code loops look quite similar, so we can simplify it a bit.
> We do that by creating a function (has_only_holes), that basically
> calls find_smallest_section_pfn() with the full range.
> In case nothing has to be found, we do not have any more sections
> there.
> 
> To be honest, I am not really sure we even need to go through this
> check in case we are removing a middle section, because from what I can see,
> we will always have a first/last section.
> 
> Taking the chance, we could also simplify both find_smallest_section_pfn()
> and find_biggest_section_pfn() functions and move the common code
> to a helper.

I didn't get to read through this whole series but one thing that is on
my todo list for a long time is to remove all this stuff. I do not think
we really want to simplify it when there shouldn't be any real reason to
have it around at all. Why do we need to shrink zone/node at all?

Now that we can override and assign memory to both normal na movable
zones I think we should be good to remove shrinking.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28  6:50   ` Michal Hocko
@ 2018-11-28  7:07     ` Oscar Salvador
  2018-11-28 10:03       ` David Hildenbrand
  2018-11-28 10:14       ` Michal Hocko
  0 siblings, 2 replies; 26+ messages in thread
From: Oscar Salvador @ 2018-11-28  7:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm

On Wed, 2018-11-28 at 07:50 +0100, Michal Hocko wrote:
> 
> I didn't get to read through this whole series but one thing that is
> on
> my todo list for a long time is to remove all this stuff. I do not
> think
> we really want to simplify it when there shouldn't be any real reason
> to
> have it around at all. Why do we need to shrink zone/node at all?
> 
> Now that we can override and assign memory to both normal na movable
> zones I think we should be good to remove shrinking.

I feel like I am missing a piece of obvious information here.
Right now, we shrink zone/node to decrease spanned pages.
I thought this was done for consistency, and in case of the node, in
try_offline_node we use the spanned pages to go through all sections
to check whether the node can be removed or not.

>From your comment, I understand that we do not really care about
spanned pages. Why?
Could you please expand on that?

And if we remove it, would not this give to a user "bad"/confusing
information when looking at /proc/zoneinfo?


Thanks
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage
  2018-11-27 16:20 ` [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage Oscar Salvador
@ 2018-11-28  7:52   ` Mike Rapoport
  2018-11-28 14:25     ` osalvador
  2018-11-28 14:15   ` osalvador
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Rapoport @ 2018-11-28  7:52 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador

On Tue, Nov 27, 2018 at 05:20:03PM +0100, Oscar Salvador wrote:
> From: Oscar Salvador <osalvador@suse.com>
> 
> The current implementation accesses pages during hot-remove
> stage in order to get the zone linked to this memory-range.
> We use that zone for a) check if the zone is ZONE_DEVICE and
> b) to shrink the zone's spanned pages.
> 
> Accessing pages during this stage is problematic, as we might be
> accessing pages that were not initialized if we did not get to
> online the memory before removing it.
> 
> The only reason to check for ZONE_DEVICE in __remove_pages
> is to bypass the call to release_mem_region_adjustable(),
> since these regions are removed with devm_release_mem_region.
> 
> With patch#2, this is no longer a problem so we can safely
> call release_mem_region_adjustable().
> release_mem_region_adjustable() will spot that the region
> we are trying to remove was acquired by means of
> devm_request_mem_region, and will back off safely.
> 
> This allows us to remove all zone-related operations from
> hot-remove stage.
> 
> Because of this, zone's spanned pages are shrinked during
> the offlining stage in shrink_zone_pgdat().
> It would have been great to decrease also the spanned page
> for the node there, but we need them in try_offline_node().
> So we still decrease spanned pages for the node in the hot-remove
> stage.
> 
> The only particularity is that now
> find_smallest_section_pfn/find_biggest_section_pfn, when called from
> shrink_zone_span, will now check for online sections and not
> valid sections instead.
> To make this work with devm/HMM code, we need to call offline_mem_sections
> and online_mem_sections in that code path when we are adding memory.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  arch/powerpc/mm/mem.c          | 11 +----
>  arch/sh/mm/init.c              |  4 +-
>  arch/x86/mm/init_32.c          |  3 +-
>  arch/x86/mm/init_64.c          |  8 +---
>  include/linux/memory_hotplug.h |  8 ++--
>  kernel/memremap.c              | 14 +++++--
>  mm/memory_hotplug.c            | 95 ++++++++++++++++++++++++------------------
>  mm/sparse.c                    |  4 +-
>  8 files changed, 76 insertions(+), 71 deletions(-)
 
[ ... ]
 
>  /**
> - * __remove_pages() - remove sections of pages from a zone
> - * @zone: zone from which pages need to be removed
> + * __remove_pages() - remove sections of pages from a nid
> + * @nid: nid from which pages belong to

Nit: the description sounds a bit awkward.
Why not to keep the original one with s/zone/node/?

>   * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
>   * @nr_pages: number of pages to remove (must be multiple of section size)
>   * @altmap: alternative device page map or %NULL if default memmap is used
> @@ -547,35 +566,28 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
>   * sure that pages are marked reserved and zones are adjust properly by
>   * calling offline_pages().
>   */

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28  7:07     ` Oscar Salvador
@ 2018-11-28 10:03       ` David Hildenbrand
  2018-11-28 10:14       ` Michal Hocko
  1 sibling, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2018-11-28 10:03 UTC (permalink / raw)
  To: Oscar Salvador, Michal Hocko
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, linux-mm

On 28.11.18 08:07, Oscar Salvador wrote:
> On Wed, 2018-11-28 at 07:50 +0100, Michal Hocko wrote:
>>
>> I didn't get to read through this whole series but one thing that is
>> on
>> my todo list for a long time is to remove all this stuff. I do not
>> think
>> we really want to simplify it when there shouldn't be any real reason
>> to
>> have it around at all. Why do we need to shrink zone/node at all?
>>
>> Now that we can override and assign memory to both normal na movable
>> zones I think we should be good to remove shrinking.
> 
> I feel like I am missing a piece of obvious information here.
> Right now, we shrink zone/node to decrease spanned pages.
> I thought this was done for consistency, and in case of the node, in
> try_offline_node we use the spanned pages to go through all sections
> to check whether the node can be removed or not.
> 

I am also not sure if that can be done. Anyhow, simplifying first and
getting rid later is in my opinion also good enough. One step at a time :)

> From your comment, I understand that we do not really care about
> spanned pages. Why?
> Could you please expand on that?
> 
> And if we remove it, would not this give to a user "bad"/confusing
> information when looking at /proc/zoneinfo?
> 
> 
> Thanks
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28  7:07     ` Oscar Salvador
  2018-11-28 10:03       ` David Hildenbrand
@ 2018-11-28 10:14       ` Michal Hocko
  2018-11-28 11:00         ` osalvador
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2018-11-28 10:14 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm

On Wed 28-11-18 08:07:46, Oscar Salvador wrote:
> On Wed, 2018-11-28 at 07:50 +0100, Michal Hocko wrote:
> > 
> > I didn't get to read through this whole series but one thing that is
> > on
> > my todo list for a long time is to remove all this stuff. I do not
> > think
> > we really want to simplify it when there shouldn't be any real reason
> > to
> > have it around at all. Why do we need to shrink zone/node at all?
> > 
> > Now that we can override and assign memory to both normal na movable
> > zones I think we should be good to remove shrinking.
> 
> I feel like I am missing a piece of obvious information here.
> Right now, we shrink zone/node to decrease spanned pages.
> I thought this was done for consistency, and in case of the node, in
> try_offline_node we use the spanned pages to go through all sections
> to check whether the node can be removed or not.
> 
> >From your comment, I understand that we do not really care about
> spanned pages. Why?
> Could you please expand on that?

OK, so what is the difference between memory hotremoving a range withing
a zone and on the zone boundary? There should be none, yet spanned pages
do get updated only when we do the later, IIRC? So spanned pages is not
really all that valuable information. It just tells the
zone_end-zone_start. Also not what is the semantic of
spanned_pages for interleaving zones.

> And if we remove it, would not this give to a user "bad"/confusing
> information when looking at /proc/zoneinfo?

Who does use spanned pages for anything really important? It is managed
pages that people do care about.

Maybe there is something that makes this harder than I anticipate but I
have a strong feeling that this all complication should simply go.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 10:14       ` Michal Hocko
@ 2018-11-28 11:00         ` osalvador
  2018-11-28 12:31           ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: osalvador @ 2018-11-28 11:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm


> OK, so what is the difference between memory hotremoving a range 
> withing
> a zone and on the zone boundary? There should be none, yet spanned 
> pages
> do get updated only when we do the later, IIRC? So spanned pages is not
> really all that valuable information. It just tells the
> zone_end-zone_start. Also not what is the semantic of
> spanned_pages for interleaving zones.

Ok, I think I start getting your point.
Yes, spanned_pages are only touched in case we remove the first or the 
last
section of memory range.

So your point is to get rid of shrink_zone_span() and 
shrink_node_span(),
and do not touch spanned_pages at all? (only when the zone is gone or 
the node
goes offline?)

The only thing I am worried about is that by doing that, the system
will account spanned_pages incorrectly.
So, if we remove pages on zone-boundary, neither zone_start_pfn nor
spanned_pages will change.
I did not check yet, but could it be that somewhere we use zone/node's 
spanned_pages
information to compute something?

I mean, do not get me wrong, getting rid of all shrink stuff would be 
great,
it will remove a __lot__ of code and some complexity, but I am not sure 
if
it is totally safe.

>> And if we remove it, would not this give to a user "bad"/confusing
>> information when looking at /proc/zoneinfo?
> 
> Who does use spanned pages for anything really important? It is managed
> pages that people do care about.

Fair enough.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 11:00         ` osalvador
@ 2018-11-28 12:31           ` Michal Hocko
  2018-11-28 12:51             ` osalvador
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2018-11-28 12:31 UTC (permalink / raw)
  To: osalvador
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm

On Wed 28-11-18 12:00:35, osalvador@suse.de wrote:
> 
> > OK, so what is the difference between memory hotremoving a range withing
> > a zone and on the zone boundary? There should be none, yet spanned pages
> > do get updated only when we do the later, IIRC? So spanned pages is not
> > really all that valuable information. It just tells the
> > zone_end-zone_start. Also not what is the semantic of
> > spanned_pages for interleaving zones.
> 
> Ok, I think I start getting your point.
> Yes, spanned_pages are only touched in case we remove the first or the last
> section of memory range.
> 
> So your point is to get rid of shrink_zone_span() and shrink_node_span(),
> and do not touch spanned_pages at all? (only when the zone is gone or the
> node
> goes offline?)

yep. Or when we extend a zone/node via hotplug.

> The only thing I am worried about is that by doing that, the system
> will account spanned_pages incorrectly.

As long as end_pfn - start_pfn matches then I do not see what would be
incorrect.

> So, if we remove pages on zone-boundary, neither zone_start_pfn nor
> spanned_pages will change.
> I did not check yet, but could it be that somewhere we use zone/node's
> spanned_pages
> information to compute something?

That is an obvious homework to do when posting such a patch ;)

> I mean, do not get me wrong, getting rid of all shrink stuff would be great,
> it will remove a __lot__ of code and some complexity, but I am not sure if
> it is totally safe.

Yes it is a lot of code and I do not see any strong justification for
it. In the past the zone boundary was really important becuase it
defined the memory zone for the new memory to hotplug. For quite some
time we have a much more flexible semantic and you can online memory to
normal/movable zones as you like. So I _believe_ the need for shrink
code is gone. Maybe I am missing something of course. All I want to say
is that it would be _so_ great to get rid of it rather than waste a lip
stick on a pig...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 12:31           ` Michal Hocko
@ 2018-11-28 12:51             ` osalvador
  2018-11-28 13:08               ` Michal Hocko
  2018-11-28 13:09               ` osalvador
  0 siblings, 2 replies; 26+ messages in thread
From: osalvador @ 2018-11-28 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm

> yep. Or when we extend a zone/node via hotplug.
> 
>> The only thing I am worried about is that by doing that, the system
>> will account spanned_pages incorrectly.
> 
> As long as end_pfn - start_pfn matches then I do not see what would be
> incorrect.

If by end_pfn - start_pfn you mean zone_end_pfn - zone_start_pfn,
then we would still need to change zone_start_pfn when removing
the first section, and adjust spanned_pages in case we remove the last 
section,
would not we?

Let us say we have a zone with 3 sections:

zone_start_pfn = 0
zone_end_pfn = 98304

If we hot-remove the last section, zone_end_pfn should be adjusted to 
65536.
Otherwise zone_end_pfn - zone_start_pfn will give us more.

The same goes when we hot-remove the first section.
Of course, we should not care when removing a section which is not 
either
the first one or the last one.

Having said that, I will check the uses we have for spanned_pages.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 12:51             ` osalvador
@ 2018-11-28 13:08               ` Michal Hocko
  2018-11-28 13:18                 ` osalvador
  2018-11-28 13:09               ` osalvador
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2018-11-28 13:08 UTC (permalink / raw)
  To: osalvador
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm

On Wed 28-11-18 13:51:42, osalvador@suse.de wrote:
> > yep. Or when we extend a zone/node via hotplug.
> > 
> > > The only thing I am worried about is that by doing that, the system
> > > will account spanned_pages incorrectly.
> > 
> > As long as end_pfn - start_pfn matches then I do not see what would be
> > incorrect.
> 
> If by end_pfn - start_pfn you mean zone_end_pfn - zone_start_pfn,
> then we would still need to change zone_start_pfn when removing
> the first section, and adjust spanned_pages in case we remove the last
> section,
> would not we?

Why? Again, how is removing the last/first section of the zone any
different from any other section?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 12:51             ` osalvador
  2018-11-28 13:08               ` Michal Hocko
@ 2018-11-28 13:09               ` osalvador
  1 sibling, 0 replies; 26+ messages in thread
From: osalvador @ 2018-11-28 13:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm

On 2018-11-28 13:51, osalvador@suse.de wrote:
>> yep. Or when we extend a zone/node via hotplug.
>> 
>>> The only thing I am worried about is that by doing that, the system
>>> will account spanned_pages incorrectly.
>> 
>> As long as end_pfn - start_pfn matches then I do not see what would be
>> incorrect.

Or unless I misunderstood you, and you would like to instead of having
this shrink code, re-use resize_zone/pgdat_range to adjust
end_pfn and start_pfn when offlining first or last sections.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 13:08               ` Michal Hocko
@ 2018-11-28 13:18                 ` osalvador
  2018-11-28 15:50                   ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: osalvador @ 2018-11-28 13:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm

On 2018-11-28 14:08, Michal Hocko wrote:
> On Wed 28-11-18 13:51:42, osalvador@suse.de wrote:
>> > yep. Or when we extend a zone/node via hotplug.
>> >
>> > > The only thing I am worried about is that by doing that, the system
>> > > will account spanned_pages incorrectly.
>> >
>> > As long as end_pfn - start_pfn matches then I do not see what would be
>> > incorrect.
>> 
>> If by end_pfn - start_pfn you mean zone_end_pfn - zone_start_pfn,
>> then we would still need to change zone_start_pfn when removing
>> the first section, and adjust spanned_pages in case we remove the last
>> section,
>> would not we?
> 
> Why? Again, how is removing the last/first section of the zone any
> different from any other section?

Because removing last/first section changes the zone's boundary.
A zone that you removed the first section, will no longer start
at zone_start_pfn.

A quick glance points that, for example, compact_zone() relies on 
zone_start_pfn
to get where the zone starts.
Now, if you remove the first section and zone_start_pfn does not get 
adjusted, you
will get a wrong start.

Maybe that is fine, I am not sure.
Sorry for looping here, but it is being difficult for me to grasp it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage
  2018-11-27 16:20 ` [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage Oscar Salvador
  2018-11-28  7:52   ` Mike Rapoport
@ 2018-11-28 14:15   ` osalvador
  1 sibling, 0 replies; 26+ messages in thread
From: osalvador @ 2018-11-28 14:15 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador

On 2018-11-27 17:20, Oscar Salvador wrote:
> From: Oscar Salvador <osalvador@suse.com>
> 
> The current implementation accesses pages during hot-remove
> stage in order to get the zone linked to this memory-range.
> We use that zone for a) check if the zone is ZONE_DEVICE and
> b) to shrink the zone's spanned pages.
> 
> Accessing pages during this stage is problematic, as we might be
> accessing pages that were not initialized if we did not get to
> online the memory before removing it.
> 
> The only reason to check for ZONE_DEVICE in __remove_pages
> is to bypass the call to release_mem_region_adjustable(),
> since these regions are removed with devm_release_mem_region.
> 
> With patch#2, this is no longer a problem so we can safely
> call release_mem_region_adjustable().
> release_mem_region_adjustable() will spot that the region
> we are trying to remove was acquired by means of
> devm_request_mem_region, and will back off safely.
> 
> This allows us to remove all zone-related operations from
> hot-remove stage.
> 
> Because of this, zone's spanned pages are shrinked during
> the offlining stage in shrink_zone_pgdat().
> It would have been great to decrease also the spanned page
> for the node there, but we need them in try_offline_node().
> So we still decrease spanned pages for the node in the hot-remove
> stage.
> 
> The only particularity is that now
> find_smallest_section_pfn/find_biggest_section_pfn, when called from
> shrink_zone_span, will now check for online sections and not
> valid sections instead.
> To make this work with devm/HMM code, we need to call 
> offline_mem_sections
> and online_mem_sections in that code path when we are adding memory.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

I did not really like the idea of having to online/offline sections from 
DEVM code, so I think
this should be better:

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 66cbf334203b..dfdb11f58cd1 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -105,7 +105,6 @@ static void devm_memremap_pages_release(void *data)

  	pfn = align_start >> PAGE_SHIFT;
  	nr_pages = align_size >> PAGE_SHIFT;
-	offline_mem_sections(pfn, pfn + nr_pages);
  	shrink_zone(page_zone(pfn_to_page(pfn)), pfn, pfn + nr_pages, 
nr_pages);

  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
@@ -229,10 +228,7 @@ void *devm_memremap_pages(struct device *dev, 
struct dev_pagemap *pgmap)

  	if (!error) {
  		struct zone *zone;
-		unsigned long pfn = align_start >> PAGE_SHIFT;
-		unsigned long nr_pages = align_size >> PAGE_SHIFT;

-		online_mem_sections(pfn, pfn + nr_pages);
  		zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
  		move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
  				align_size >> PAGE_SHIFT, altmap);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4fe42ccb0be4..653d2bc9affe 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -314,13 +314,17 @@ int __ref __add_pages(int nid, unsigned long 
phys_start_pfn,
  }

  #ifdef CONFIG_MEMORY_HOTREMOVE
-static bool is_section_ok(struct mem_section *ms, bool zone)
+static bool is_section_ok(struct mem_section *ms, struct zone *z)
  {
  	/*
-	 * We cannot shrink pgdat's spanned because we use them
-	 * in try_offline_node to check if all sections were removed.
+	 * In case we are shrinking pgdat's pages or the zone is
+	 * ZONE_DEVICE, we check for valid sections instead.
+	 * We cannot shrink pgdat's spanned pages until hot-remove
+	 * operation because we use them in try_offline_node to check
+	 * if all sections were removed.
+	 * ZONE_DEVICE's sections do not get onlined either.
  	 */
-	if (zone)
+	if (z && !is_dev_zone(z))
  		return online_section(ms);
  	else
  		return valid_section(ms);
@@ -335,7 +339,7 @@ static unsigned long find_smallest_section_pfn(int 
nid, struct zone *zone,
  	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
  		ms = __pfn_to_section(start_pfn);

-		if (!is_section_ok(ms, !!zone))
+		if (!is_section_ok(ms, zone))
  			continue;

  		if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -425,7 +429,7 @@ static void shrink_zone_span(struct zone *zone, 
unsigned long start_pfn,
  	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
  		ms = __pfn_to_section(pfn);

-		if (unlikely(!online_section(ms)))
+		if (unlikely(!is_section_ok(ms, zone)))
  			continue;

  		if (page_zone(pfn_to_page(pfn)) != zone)
@@ -517,11 +521,24 @@ void shrink_zone(struct zone *zone, unsigned long 
start_pfn,
  {
  	int nr_pages = PAGES_PER_SECTION;
  	unsigned long pfn;
+	unsigned long flags;
+	struct pglist_data *pgdat = zone->zone_pgdat;
+
+	pgdat_resize_lock(pgdat, &flags);
+	/*
+	 * Handling for ZONE_DEVICE does not account
+	 * present pages.
+	 */
+	if (!is_dev_zone(zone))
+		pgdat->node_present_pages -= offlined_pages;
+

  	clear_zone_contiguous(zone);
  	for (pfn = start_pfn; pfn < end_pfn; pfn += nr_pages)
  		shrink_zone_span(zone, pfn, pfn + nr_pages);
  	set_zone_contiguous(zone);
+
+	pgdat_resize_unlock(pgdat, &flags);
  }

  static void shrink_pgdat(int nid, unsigned long sect_nr)
@@ -555,8 +572,8 @@ static int __remove_section(int nid, struct 
mem_section *ms,
  }

  /**
- * __remove_pages() - remove sections of pages from a nid
- * @nid: nid from which pages belong to
+ * remove_sections() - remove sections of pages from a nid
+ * @nid: node from which pages need to be removed to
   * @phys_start_pfn: starting pageframe (must be aligned to start of a 
section)
   * @nr_pages: number of pages to remove (must be multiple of section 
size)
   * @altmap: alternative device page map or %NULL if default memmap is 
used
@@ -1581,7 +1598,6 @@ static int __ref __offline_pages(unsigned long 
start_pfn,
  	unsigned long pfn, nr_pages;
  	long offlined_pages;
  	int ret, node;
-	unsigned long flags;
  	unsigned long valid_start, valid_end;
  	struct zone *zone;
  	struct memory_notify arg;
@@ -1663,14 +1679,12 @@ static int __ref __offline_pages(unsigned long 
start_pfn,
  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
  	/* removal success */

-	/* Shrink zone's managed,spanned and zone/pgdat's present pages */
+	/* Shrink zone's managed and present pages */
  	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
  	zone->present_pages -= offlined_pages;

-	pgdat_resize_lock(zone->zone_pgdat, &flags);
-	zone->zone_pgdat->node_present_pages -= offlined_pages;
+	/* Shrink zone's spanned pages and node's present pages */
  	shrink_zone(zone, valid_start, valid_end, offlined_pages);
-	pgdat_resize_unlock(zone->zone_pgdat, &flags);

  	init_per_zone_wmark_min();

Although there is an ongoing discussion for getting rid of the shrink 
code.
If that is the case, this will be a lot simpler.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage
  2018-11-28  7:52   ` Mike Rapoport
@ 2018-11-28 14:25     ` osalvador
  0 siblings, 0 replies; 26+ messages in thread
From: osalvador @ 2018-11-28 14:25 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador

>>  /**
>> - * __remove_pages() - remove sections of pages from a zone
>> - * @zone: zone from which pages need to be removed
>> + * __remove_pages() - remove sections of pages from a nid
>> + * @nid: nid from which pages belong to
> 
> Nit: the description sounds a bit awkward.
> Why not to keep the original one with s/zone/node/?

Yes, much better.

thanks

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 13:18                 ` osalvador
@ 2018-11-28 15:50                   ` Michal Hocko
  2018-11-28 16:02                     ` osalvador
  2018-11-29  9:29                     ` osalvador
  0 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2018-11-28 15:50 UTC (permalink / raw)
  To: osalvador
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm

On Wed 28-11-18 14:18:43, osalvador@suse.de wrote:
> On 2018-11-28 14:08, Michal Hocko wrote:
> > On Wed 28-11-18 13:51:42, osalvador@suse.de wrote:
> > > > yep. Or when we extend a zone/node via hotplug.
> > > >
> > > > > The only thing I am worried about is that by doing that, the system
> > > > > will account spanned_pages incorrectly.
> > > >
> > > > As long as end_pfn - start_pfn matches then I do not see what would be
> > > > incorrect.
> > > 
> > > If by end_pfn - start_pfn you mean zone_end_pfn - zone_start_pfn,
> > > then we would still need to change zone_start_pfn when removing
> > > the first section, and adjust spanned_pages in case we remove the last
> > > section,
> > > would not we?
> > 
> > Why? Again, how is removing the last/first section of the zone any
> > different from any other section?
> 
> Because removing last/first section changes the zone's boundary.
> A zone that you removed the first section, will no longer start
> at zone_start_pfn.
> 
> A quick glance points that, for example, compact_zone() relies on
> zone_start_pfn
> to get where the zone starts.
> Now, if you remove the first section and zone_start_pfn does not get
> adjusted, you
> will get a wrong start.
> 
> Maybe that is fine, I am not sure.
> Sorry for looping here, but it is being difficult for me to grasp it.

OK, so let me try again. What is the difference for a pfn walker to
start at an offline pfn start from any other offlined section withing a
zone boundary? I believe there is none because the pfn walker needs to
skip over offline pfns anyway whether they start at a zone boundary or
not.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 15:50                   ` Michal Hocko
@ 2018-11-28 16:02                     ` osalvador
  2018-11-29  9:29                     ` osalvador
  1 sibling, 0 replies; 26+ messages in thread
From: osalvador @ 2018-11-28 16:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm

>> Maybe that is fine, I am not sure.
>> Sorry for looping here, but it is being difficult for me to grasp it.
> 
> OK, so let me try again. What is the difference for a pfn walker to
> start at an offline pfn start from any other offlined section withing a
> zone boundary? I believe there is none because the pfn walker needs to
> skip over offline pfns anyway whether they start at a zone boundary or
> not.

If the pfn walker in question skips over "invalid" (not online) pfn, 
then we
are fine I guess.
But we need to make sure that this is the case, and that we do not have 
someone
relaying on zone_start_pfn and trusting it blindly.

I will go through the code and check all cases to be sure that this is 
not the case.
If that is the case, then I am fine with getting rid of the shrink code.

Thanks for explanations ;-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span
  2018-11-28 15:50                   ` Michal Hocko
  2018-11-28 16:02                     ` osalvador
@ 2018-11-29  9:29                     ` osalvador
  1 sibling, 0 replies; 26+ messages in thread
From: osalvador @ 2018-11-29  9:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, dan.j.williams, pavel.tatashin, jglisse, Jonathan.Cameron,
	rafael, david, linux-mm, owner-linux-mm

> OK, so let me try again. What is the difference for a pfn walker to
> start at an offline pfn start from any other offlined section withing a
> zone boundary? I believe there is none because the pfn walker needs to
> skip over offline pfns anyway whether they start at a zone boundary or
> not.

I checked most of the users of zone_start_pnf:

* split_huge_pages_set:
   - It uses pfn_valid().
     I guess this is fine as it will check if the section still has a 
map.

* __reset_isolation_suitable():
   - Safe as it uses pfn_to_online_page().

* isolate_freepages_range():
* isolate_migratepages_range():
* isolate_migratepages():
   - They use pageblock_pfn_to_page().
     If !zone->contiguos, it will use 
__pageblock_pfn_to_page()->pfn_to_online_page()
     If zone->contiguos is true, it will use 
pageblock_pfn_to_page()->pfn_to_page(),
     which is bad because it will not skip over offlined pfns.

* count_highmem_pages:
* count_data_pages:
* copy_data_pages:
   - page_is_saveable()->pfn_valid().
     I guess this is fine as it will check if the section still has a 
map.


So, leaving out isolate_* functions, it seems that we should be safe.
isolate_* functions would depend on !zone->contiguos to call 
__pageblock_pfn_to_page()->pfn_to_online_page().
So whenever we remove a section in a zone, we should clear 
zone->contiguos.
But this really calls for some deep check that we will not shoot in the 
foot.

What I can do for now is to drop this patch from the patchset and 
re-send
v3 without it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes
  2018-11-27 16:20 ` [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes Oscar Salvador
@ 2019-03-24  6:48   ` Anshuman Khandual
  2019-03-25  7:40     ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: Anshuman Khandual @ 2019-03-24  6:48 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador



On 11/27/2018 09:50 PM, Oscar Salvador wrote:
> From: Oscar Salvador <osalvador@suse.com>
> 
> This tries to address another issue about accessing
> unitiliazed pages.
> 
> Jonathan reported a problem [1] where we can access steal pages
> in case we hot-remove memory without onlining it first.
> 
> This time is in unregister_mem_sect_under_nodes.
> This function tries to get the nid from the pfn and then
> tries to remove the symlink between mem_blk <-> nid and vice versa.
> 
> Since we already know the nid in remove_memory(), we can pass
> it down the chain to unregister_mem_sect_under_nodes.
> There we can just remove the symlinks without the need
> to look into the pages.
> 
> This also allows us to cleanup unregister_mem_sect_under_nodes.
> 
> [1] https://www.spinics.net/lists/linux-mm/msg161316.html
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/base/memory.c  |  9 ++++-----
>  drivers/base/node.c    | 39 ++++++---------------------------------
>  include/linux/memory.h |  2 +-
>  include/linux/node.h   |  9 ++++-----
>  mm/memory_hotplug.c    |  2 +-
>  5 files changed, 16 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 0e5985682642..3d8c65d84bea 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -744,8 +744,7 @@ unregister_memory(struct memory_block *memory)
>  	device_unregister(&memory->dev);
>  }
>  
> -static int remove_memory_section(unsigned long node_id,
> -			       struct mem_section *section, int phys_device)
> +static int remove_memory_section(unsigned long nid, struct mem_section *section)
>  {
>  	struct memory_block *mem;
>  
> @@ -759,7 +758,7 @@ static int remove_memory_section(unsigned long node_id,
>  	if (!mem)
>  		goto out_unlock;
>  
> -	unregister_mem_sect_under_nodes(mem, __section_nr(section));
> +	unregister_mem_sect_under_nodes(nid, mem);
>  
>  	mem->section_count--;
>  	if (mem->section_count == 0)
> @@ -772,12 +771,12 @@ static int remove_memory_section(unsigned long node_id,
>  	return 0;
>  }
>  
> -int unregister_memory_section(struct mem_section *section)
> +int unregister_memory_section(int nid, struct mem_section *section)
>  {
>  	if (!present_section(section))
>  		return -EINVAL;
>  
> -	return remove_memory_section(0, section, 0);
> +	return remove_memory_section(nid, section);
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..0858f7f3c7cd 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -453,40 +453,13 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
>  	return 0;
>  }
>  
> -/* unregister memory section under all nodes that it spans */
> -int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
> -				    unsigned long phys_index)
> +/* Remove symlink between node <-> mem_blk */
> +void unregister_mem_sect_under_nodes(int nid, struct memory_block *mem_blk)
>  {
> -	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
> -	unsigned long pfn, sect_start_pfn, sect_end_pfn;
> -
> -	if (!mem_blk) {
> -		NODEMASK_FREE(unlinked_nodes);
> -		return -EFAULT;
> -	}
> -	if (!unlinked_nodes)
> -		return -ENOMEM;
> -	nodes_clear(*unlinked_nodes);
> -
> -	sect_start_pfn = section_nr_to_pfn(phys_index);
> -	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
> -	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
> -		int nid;
> -
> -		nid = get_nid_for_pfn(pfn);
> -		if (nid < 0)
> -			continue;
> -		if (!node_online(nid))
> -			continue;
> -		if (node_test_and_set(nid, *unlinked_nodes))
> -			continue;
> -		sysfs_remove_link(&node_devices[nid]->dev.kobj,
> -			 kobject_name(&mem_blk->dev.kobj));
> -		sysfs_remove_link(&mem_blk->dev.kobj,
> -			 kobject_name(&node_devices[nid]->dev.kobj));
> -	}
> -	NODEMASK_FREE(unlinked_nodes);
> -	return 0;
> +	sysfs_remove_link(&node_devices[nid]->dev.kobj,
> +			kobject_name(&mem_blk->dev.kobj));
> +	sysfs_remove_link(&mem_blk->dev.kobj,
> +			kobject_name(&node_devices[nid]->dev.kobj));

Hello Oscar,

Passing down node ID till unregister_mem_sect_under_nodes() solves the problem of
querying struct page for nid but the current code assumes that the pfn range for
any given memory section can have different node IDs. Hence it scans over the
section and try to remove all possible node <---> memory block sysfs links.

I am just wondering is that assumption even correct ? Can we really have a memory
section which belongs to different nodes ? Is that even possible.

- Anshuman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes
  2019-03-24  6:48   ` Anshuman Khandual
@ 2019-03-25  7:40     ` Oscar Salvador
  2019-03-25  8:04       ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2019-03-25  7:40 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, mhocko, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador

On Sun, Mar 24, 2019 at 12:18:26PM +0530, Anshuman Khandual wrote:
> Hello Oscar,

Hi Anshuman,

> Passing down node ID till unregister_mem_sect_under_nodes() solves the problem of
> querying struct page for nid but the current code assumes that the pfn range for
> any given memory section can have different node IDs. Hence it scans over the
> section and try to remove all possible node <---> memory block sysfs links.
> 
> I am just wondering is that assumption even correct ? Can we really have a memory
> section which belongs to different nodes ? Is that even possible.

Yes, current code assumes that, but looking at when we init sections at boot
stage, it seems like a 1:1 map to me.

E.g, in memory_present(), we do encode the nid in section's section_mem_map
field to use that later on in sparse_init(), and get the node we should allocate
the data structures from.

And in memory_present() itself, in case we do not use page's flags field,
we end up using the section_to_node_table[] table, which is clearly a 1:1 map.

So, I might be wrong here, but I think that we do not really have nodes mixed
in a section.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes
  2019-03-25  7:40     ` Oscar Salvador
@ 2019-03-25  8:04       ` Michal Hocko
  2019-03-25  8:14         ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-03-25  8:04 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Anshuman Khandual, akpm, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador

On Mon 25-03-19 08:40:32, Oscar Salvador wrote:
> On Sun, Mar 24, 2019 at 12:18:26PM +0530, Anshuman Khandual wrote:
> > Hello Oscar,
> 
> Hi Anshuman,
> 
> > Passing down node ID till unregister_mem_sect_under_nodes() solves the problem of
> > querying struct page for nid but the current code assumes that the pfn range for
> > any given memory section can have different node IDs. Hence it scans over the
> > section and try to remove all possible node <---> memory block sysfs links.
> > 
> > I am just wondering is that assumption even correct ? Can we really have a memory
> > section which belongs to different nodes ? Is that even possible.
> 
> Yes, current code assumes that, but looking at when we init sections at boot
> stage, it seems like a 1:1 map to me.
> 
> E.g, in memory_present(), we do encode the nid in section's section_mem_map
> field to use that later on in sparse_init(), and get the node we should allocate
> the data structures from.
> 
> And in memory_present() itself, in case we do not use page's flags field,
> we end up using the section_to_node_table[] table, which is clearly a 1:1 map.
> 
> So, I might be wrong here, but I think that we do not really have nodes mixed
> in a section.

No, unfortunately two nodes might share the same section indeed. Have a
look at 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes
  2019-03-25  8:04       ` Michal Hocko
@ 2019-03-25  8:14         ` Oscar Salvador
  0 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-03-25  8:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Anshuman Khandual, akpm, dan.j.williams, pavel.tatashin, jglisse,
	Jonathan.Cameron, rafael, david, linux-mm, Oscar Salvador

On Mon, Mar 25, 2019 at 09:04:53AM +0100, Michal Hocko wrote:
> No, unfortunately two nodes might share the same section indeed. Have a
> look at 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a

Heh, I see, I guess one cannot assume anything when it comes to HW.
Well, at least we know that for real now.

I always thought that this was merely a kind of hardcoded assumption that we
just made in an attempt to avoid breaking things.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, back to index

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-27 16:20 [PATCH v2 0/5] Do not touch pages in hot-remove path Oscar Salvador
2018-11-27 16:20 ` [PATCH v2 1/5] mm, memory_hotplug: Add nid parameter to arch_remove_memory Oscar Salvador
2018-11-27 16:20 ` [PATCH v2 2/5] kernel, resource: Check for IORESOURCE_SYSRAM in release_mem_region_adjustable Oscar Salvador
2018-11-27 16:20 ` [PATCH v2 3/5] mm, memory_hotplug: Move zone/pages handling to offline stage Oscar Salvador
2018-11-28  7:52   ` Mike Rapoport
2018-11-28 14:25     ` osalvador
2018-11-28 14:15   ` osalvador
2018-11-27 16:20 ` [PATCH v2 4/5] mm, memory-hotplug: Rework unregister_mem_sect_under_nodes Oscar Salvador
2019-03-24  6:48   ` Anshuman Khandual
2019-03-25  7:40     ` Oscar Salvador
2019-03-25  8:04       ` Michal Hocko
2019-03-25  8:14         ` Oscar Salvador
2018-11-27 16:20 ` [PATCH v2 5/5] mm, memory_hotplug: Refactor shrink_zone/pgdat_span Oscar Salvador
2018-11-28  6:50   ` Michal Hocko
2018-11-28  7:07     ` Oscar Salvador
2018-11-28 10:03       ` David Hildenbrand
2018-11-28 10:14       ` Michal Hocko
2018-11-28 11:00         ` osalvador
2018-11-28 12:31           ` Michal Hocko
2018-11-28 12:51             ` osalvador
2018-11-28 13:08               ` Michal Hocko
2018-11-28 13:18                 ` osalvador
2018-11-28 15:50                   ` Michal Hocko
2018-11-28 16:02                     ` osalvador
2018-11-29  9:29                     ` osalvador
2018-11-28 13:09               ` osalvador

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git