linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
@ 2019-08-29  7:00 David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device() David Hildenbrand
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Michal Hocko, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
zones when removing memory". I decided to go one step further and finally
factor out the shrinking of zones from memory removal code. Zones are now
fixed up when offlining memory/onlining of memory fails/before removing
ZONE_DEVICE memory.

I guess only patch #1 is debatable - it is certainly not a blocker for
the other stuff in this series, it just doesn't seem to be 100% safe. I
am not quite sure if we have a real performance issue here, let's hear
what people think.

--------------------------------------------------------------------------

Example:

:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  0
        present  0
        managed  0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  98304
        present  65536
        managed  65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  32768
        present  32768
        managed  32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  0
        present  0
        managed  0

--------------------------------------------------------------------------

This series fixes various issues:
1. Memory removal can currently crash the system in case the first
memory block was never onlined.

2. Zone shrinking code can crash the system when trying to look at
uninitialized memmaps.

3. Removing memory with different/multiple/no zones for affected memory
blocks does not properly shring zones. It really only works correcty in the
case all memory blocks were onlined to the same zone.

4. In case memory onlining fails, the zones are not fixed up again.

--------------------------------------------------------------------------

For !ZONE_DEVICE memory, zones are now fixed up when offlining memory. This
now works very reliable.

For ZONE_DEVICE memory, the zone is fixed up before removing memory. I
haven't tested it, but we should no longer be able to crash the system
BUT there is a fundamental issue remaining that has to be sorted out next:
How to detect which memmaps of ZONE_DEVICE memory is valid. The current
fix I implemented is ugly and has to go.

For !ZONE_DEVICE memory we can look at the SECTION_IS_ONLINE flag to
decide whether the memmap was initialized. We can't easily use the same for
ZONE_DEVICE memory, especially, because we have subsection hotplug there.
While we have "present" masks for subsections ("memory was added") we don't
have something similar for "online/active". This could probably be one
thing to look into in the future: Use SECTION_IS_ONLINE also for
ZONE_DEVICE memory and remember in a subsection bitmap which subsections
are actually online/active.

I'll leave that exercise to the ZONE_DEVICE folks. From a memory block
onlining/offlining point of view things should be clean now.

While we could still have false positives for ZONE_DEVICE memory when
trying to shrink zones, we should no longer crash - which improves
the situation heavily.

Fact: set_zone_contiguous() is even more broken (false positives) for
ZONE_DEVICE memory: it will never set a zone contiguous because we are
missing the exact same functionality: How to detect whether a memmap was
initialized, so we can trust the zone values.

--------------------------------------------------------------------------

A bunch of prepararions and cleanups included. I only tested on x86
with DIMMs so far.

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>

David Hildenbrand (11):
  mm/memremap: Get rid of memmap_init_zone_device()
  mm/memory_hotplug: Simplify shrink_pgdat_span()
  mm/memory_hotplug: We always have a zone in
    find_(smallest|biggest)_section_pfn
  mm/memory_hotplug: Drop local variables in shrink_zone_span()
  mm/memory_hotplug: Optimize zone shrinking code when checking for
    holes
  mm/memory_hotplug: Fix crashes in shrink_zone_span()
  mm/memory_hotplug: Exit early in __remove_pages() on BUGs
  mm: Exit early in set_zone_contiguous() if already contiguous
  mm/memory_hotplug: Remove pages from a zone before removing memory
  mm/memory_hotplug: Remove zone parameter from __remove_pages()
  mm/memory_hotplug: Cleanup __remove_pages()

 arch/arm64/mm/mmu.c            |   4 +-
 arch/ia64/mm/init.c            |   4 +-
 arch/powerpc/mm/mem.c          |   3 +-
 arch/s390/mm/init.c            |   4 +-
 arch/sh/mm/init.c              |   4 +-
 arch/x86/mm/init_32.c          |   4 +-
 arch/x86/mm/init_64.c          |   4 +-
 include/linux/memory_hotplug.h |   9 +-
 include/linux/mm.h             |   4 +-
 mm/memory_hotplug.c            | 215 +++++++++++++++------------------
 mm/memremap.c                  |  19 +--
 mm/page_alloc.c                |  45 +++----
 12 files changed, 136 insertions(+), 183 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device()
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29 16:39   ` Alexander Duyck
  2019-08-29  7:00 ` [PATCH v3 02/11] mm/memory_hotplug: Simplify shrink_pgdat_span() David Hildenbrand
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang, Qian Cai,
	Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Stephen Rothwell, Dave Airlie,
	Andrey Konovalov, Alexander Duyck, Ira Weiny, John Hubbard,
	Arun KS, Souptick Joarder, Robin Murphy, Yang Shi,
	Jason Gunthorpe, Logan Gunthorpe, Vlastimil Babka, Mel Gorman,
	Mike Rapoport, Alexander Potapenko

As far as I can see, the original split wanted to speed up a duplicate
initialization. We now only initialize once - and we really should
initialize under the lock, when resizing the zones.

As soon as we drop the lock we might have memory unplug running, trying
to shrink the zone again. In case the memmap was not properly initialized,
the zone/node shrinking code might get false negatives when search for
the new zone/node boundaries - bad. We suddenly could no longer span the
memory we just added.

Also, once we want to fix set_zone_contiguous(zone) inside
move_pfn_range_to_zone() to actually work with ZONE_DEVICE (instead of
always immediately stopping and never setting zone->contiguous) we have
to have the whole memmap initialized at that point. (not sure if we
want that in the future, just a note)

Let's just keep things simple and initialize the memmap when resizing
the zones under the lock.

If this is a real performance issue, we have to watch out for
alternatives.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memory_hotplug.h |  2 +-
 include/linux/mm.h             |  4 +---
 mm/memory_hotplug.c            |  4 ++--
 mm/memremap.c                  |  9 +-------
 mm/page_alloc.c                | 42 ++++++++++++----------------------
 5 files changed, 20 insertions(+), 41 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f46ea71b4ffd..235530cdface 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -344,7 +344,7 @@ extern int __add_memory(int nid, u64 start, u64 size);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct dev_pagemap *pgmap);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad6766a08f9b..2bd445c4d3b4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -962,8 +962,6 @@ static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
-extern void memmap_init_zone_device(struct zone *, unsigned long,
-				    unsigned long, struct dev_pagemap *);
 #else
 static inline bool is_zone_device_page(const struct page *page)
 {
@@ -2243,7 +2241,7 @@ static inline void zero_resv_unavail(void) {}
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
-		enum memmap_context, struct vmem_altmap *);
+		enum memmap_context, struct dev_pagemap *);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 49f7bf91c25a..35a395d195c6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -719,7 +719,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
  * call, all affected pages are PG_reserved.
  */
 void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct dev_pagemap *pgmap)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int nid = pgdat->node_id;
@@ -744,7 +744,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	 * are reserved so nobody should be touching them so we should be safe
 	 */
 	memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
-			MEMMAP_HOTPLUG, altmap);
+			 MEMMAP_HOTPLUG, pgmap);
 
 	set_zone_contiguous(zone);
 }
diff --git a/mm/memremap.c b/mm/memremap.c
index f6c17339cd0d..9ee23374e6da 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -308,20 +308,13 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 
 		zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
 		move_pfn_range_to_zone(zone, PHYS_PFN(res->start),
-				PHYS_PFN(resource_size(res)), restrictions.altmap);
+				PHYS_PFN(resource_size(res)), pgmap);
 	}
 
 	mem_hotplug_done();
 	if (error)
 		goto err_add_memory;
 
-	/*
-	 * Initialization of the pages has been deferred until now in order
-	 * to allow us to do the work while not holding the hotplug lock.
-	 */
-	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
-				PHYS_PFN(res->start),
-				PHYS_PFN(resource_size(res)), pgmap);
 	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
 	return __va(res->start);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5d62f1c2851..44038665fe8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5845,7 +5845,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
  */
 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		unsigned long start_pfn, enum memmap_context context,
-		struct vmem_altmap *altmap)
+		struct dev_pagemap *pgmap)
 {
 	unsigned long pfn, end_pfn = start_pfn + size;
 	struct page *page;
@@ -5853,24 +5853,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	if (highest_memmap_pfn < end_pfn - 1)
 		highest_memmap_pfn = end_pfn - 1;
 
-#ifdef CONFIG_ZONE_DEVICE
-	/*
-	 * Honor reservation requested by the driver for this ZONE_DEVICE
-	 * memory. We limit the total number of pages to initialize to just
-	 * those that might contain the memory mapping. We will defer the
-	 * ZONE_DEVICE page initialization until after we have released
-	 * the hotplug lock.
-	 */
-	if (zone == ZONE_DEVICE) {
-		if (!altmap)
-			return;
-
-		if (start_pfn == altmap->base_pfn)
-			start_pfn += altmap->reserve;
-		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
-	}
-#endif
-
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		/*
 		 * There can be holes in boot-time mem_map[]s handed to this
@@ -5892,6 +5874,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		if (context == MEMMAP_HOTPLUG)
 			__SetPageReserved(page);
 
+#ifdef CONFIG_ZONE_DEVICE
+		if (zone == ZONE_DEVICE) {
+			WARN_ON_ONCE(!pgmap);
+			/*
+			 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+			 * pointer and zone_device_data. It is a bug if a
+			 * ZONE_DEVICE page is ever freed or placed on a driver
+			 * private list.
+			 */
+			page->pgmap = pgmap;
+			page->zone_device_data = NULL;
+		}
+#endif
+
 		/*
 		 * Mark the block movable so that blocks are reserved for
 		 * movable at startup. This will force kernel allocations
@@ -5951,14 +5947,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		 */
 		__SetPageReserved(page);
 
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
-		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-		 * ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->zone_device_data = NULL;
-
 		/*
 		 * Mark the block movable so that blocks are reserved for
 		 * movable at startup. This will force kernel allocations
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 02/11] mm/memory_hotplug: Simplify shrink_pgdat_span()
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device() David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 03/11] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn David Hildenbrand
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

Let's update it based on all the zones, which is much easier. No need to
iterate over all subsections again.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 72 ++++++++++-----------------------------------
 1 file changed, 15 insertions(+), 57 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 35a395d195c6..70cd6c59c168 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -436,67 +436,25 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	zone_span_writeunlock(zone);
 }
 
-static void shrink_pgdat_span(struct pglist_data *pgdat,
-			      unsigned long start_pfn, unsigned long end_pfn)
+static void update_pgdat_span(struct pglist_data *pgdat)
 {
-	unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
-	unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
-	unsigned long pgdat_end_pfn = p;
-	unsigned long pfn;
-	int nid = pgdat->node_id;
-
-	if (pgdat_start_pfn == start_pfn) {
-		/*
-		 * If the section is smallest section in the pgdat, it need
-		 * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
-		 * In this case, we find second smallest valid mem_section
-		 * for shrinking zone.
-		 */
-		pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
-						pgdat_end_pfn);
-		if (pfn) {
-			pgdat->node_start_pfn = pfn;
-			pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
-		}
-	} else if (pgdat_end_pfn == end_pfn) {
-		/*
-		 * If the section is biggest section in the pgdat, it need
-		 * shrink pgdat->node_spanned_pages.
-		 * In this case, we find second biggest valid mem_section for
-		 * shrinking zone.
-		 */
-		pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
-					       start_pfn);
-		if (pfn)
-			pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
-	}
-
-	/*
-	 * If the section is not biggest or smallest mem_section in the pgdat,
-	 * it only creates a hole in the pgdat. So in this case, we need not
-	 * change the pgdat.
-	 * But perhaps, the pgdat has only hole data. Thus it check the pgdat
-	 * has only hole or not.
-	 */
-	pfn = pgdat_start_pfn;
-	for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUBSECTION) {
-		if (unlikely(!pfn_valid(pfn)))
-			continue;
-
-		if (pfn_to_nid(pfn) != nid)
-			continue;
+	unsigned long node_start_pfn = 0, node_end_pfn = 0;
+	struct zone *zone;
 
-		/* Skip range to be removed */
-		if (pfn >= start_pfn && pfn < end_pfn)
-			continue;
+	for (zone = pgdat->node_zones;
+	     zone < pgdat->node_zones + MAX_NR_ZONES; zone++) {
+		unsigned long zone_end_pfn = zone->zone_start_pfn +
+					     zone->spanned_pages;
 
-		/* If we find valid section, we have nothing to do */
-		return;
+		/* No need to lock the zones, they can't change. */
+		if (zone_end_pfn > node_end_pfn)
+			node_end_pfn = zone_end_pfn;
+		if (zone->zone_start_pfn < node_start_pfn)
+			node_start_pfn = zone->zone_start_pfn;
 	}
 
-	/* The pgdat has no valid section */
-	pgdat->node_start_pfn = 0;
-	pgdat->node_spanned_pages = 0;
+	pgdat->node_start_pfn = node_start_pfn;
+	pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
 }
 
 static void __remove_zone(struct zone *zone, unsigned long start_pfn,
@@ -507,7 +465,7 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn,
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
-	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
+	update_pgdat_span(pgdat);
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 03/11] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device() David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 02/11] mm/memory_hotplug: Simplify shrink_pgdat_span() David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 04/11] mm/memory_hotplug: Drop local variables in shrink_zone_span() David Hildenbrand
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

With shrink_pgdat_span() out of the way, we now always have a valid
zone.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70cd6c59c168..0a21f6f99753 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -337,7 +337,7 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 		if (unlikely(pfn_to_nid(start_pfn) != nid))
 			continue;
 
-		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+		if (zone != page_zone(pfn_to_page(start_pfn)))
 			continue;
 
 		return start_pfn;
@@ -362,7 +362,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 		if (unlikely(pfn_to_nid(pfn) != nid))
 			continue;
 
-		if (zone && zone != page_zone(pfn_to_page(pfn)))
+		if (zone != page_zone(pfn_to_page(pfn)))
 			continue;
 
 		return pfn;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 04/11] mm/memory_hotplug: Drop local variables in shrink_zone_span()
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (2 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 03/11] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 05/11] mm/memory_hotplug: Optimize zone shrinking code when checking for holes David Hildenbrand
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

Especially, when checking for "all holes" we can avoid rechecking already
processed pieces (that we are removing) - use the updated zone data
instead (possibly already zero).

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a21f6f99753..d3c34bbeb36d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -374,14 +374,11 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 			     unsigned long end_pfn)
 {
-	unsigned long zone_start_pfn = zone->zone_start_pfn;
-	unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
-	unsigned long zone_end_pfn = z;
 	unsigned long pfn;
 	int nid = zone_to_nid(zone);
 
 	zone_span_writelock(zone);
-	if (zone_start_pfn == start_pfn) {
+	if (zone->zone_start_pfn == start_pfn) {
 		/*
 		 * If the section is smallest section in the zone, it need
 		 * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
@@ -389,22 +386,29 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 		 * for shrinking zone.
 		 */
 		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
-						zone_end_pfn);
+						zone_end_pfn(zone));
 		if (pfn) {
+			zone->spanned_pages = zone_end_pfn(zone) - pfn;
 			zone->zone_start_pfn = pfn;
-			zone->spanned_pages = zone_end_pfn - pfn;
+		} else {
+			zone->zone_start_pfn = 0;
+			zone->spanned_pages = 0;
 		}
-	} else if (zone_end_pfn == end_pfn) {
+	} else if (zone_end_pfn(zone) == end_pfn) {
 		/*
 		 * If the section is biggest section in the zone, it need
 		 * shrink zone->spanned_pages.
 		 * In this case, we find second biggest valid mem_section for
 		 * shrinking zone.
 		 */
-		pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+		pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
 					       start_pfn);
 		if (pfn)
-			zone->spanned_pages = pfn - zone_start_pfn + 1;
+			zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
+		else {
+			zone->zone_start_pfn = 0;
+			zone->spanned_pages = 0;
+		}
 	}
 
 	/*
@@ -413,8 +417,8 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	 * change the zone. But perhaps, the zone has only hole data. Thus
 	 * it check the zone has only hole or not.
 	 */
-	pfn = zone_start_pfn;
-	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
+	for (pfn = zone->zone_start_pfn;
+	     pfn < zone_end_pfn(zone); pfn += PAGES_PER_SUBSECTION) {
 		if (unlikely(!pfn_valid(pfn)))
 			continue;
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 05/11] mm/memory_hotplug: Optimize zone shrinking code when checking for holes
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (3 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 04/11] mm/memory_hotplug: Drop local variables in shrink_zone_span() David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 06/11] mm/memory_hotplug: Fix crashes in shrink_zone_span() David Hildenbrand
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

... and clarify why this is needed at all right now. It all boils down
to false positives. We will try to remove the false positives for
!ZONE_DEVICE memory, soon, however, for ZONE_DEVICE memory we won't be
able to easily get rid of false positives.

Don't only detect "all holes" but try to shrink using the existing
functions we have.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 45 +++++++++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d3c34bbeb36d..663853bf97ed 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -411,32 +411,33 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 		}
 	}
 
-	/*
-	 * The section is not biggest or smallest mem_section in the zone, it
-	 * only creates a hole in the zone. So in this case, we need not
-	 * change the zone. But perhaps, the zone has only hole data. Thus
-	 * it check the zone has only hole or not.
-	 */
-	for (pfn = zone->zone_start_pfn;
-	     pfn < zone_end_pfn(zone); pfn += PAGES_PER_SUBSECTION) {
-		if (unlikely(!pfn_valid(pfn)))
-			continue;
-
-		if (page_zone(pfn_to_page(pfn)) != zone)
-			continue;
-
-		/* Skip range to be removed */
-		if (pfn >= start_pfn && pfn < end_pfn)
-			continue;
-
-		/* If we find valid section, we have nothing to do */
+	if (!zone->spanned_pages) {
 		zone_span_writeunlock(zone);
 		return;
 	}
 
-	/* The zone has no valid section */
-	zone->zone_start_pfn = 0;
-	zone->spanned_pages = 0;
+	/*
+	 * Due to false positives in previous skrink attempts, it can happen
+	 * that we can shrink the zones further (possibly to zero). Once we
+	 * can reliably detect which PFNs actually belong to a zone
+	 * (especially for ZONE_DEVICE memory where we don't have online
+	 * sections), this can go.
+	 */
+	pfn = find_smallest_section_pfn(nid, zone, zone->zone_start_pfn,
+					zone_end_pfn(zone));
+	if (pfn) {
+		zone->spanned_pages = zone_end_pfn(zone) - pfn;
+		zone->zone_start_pfn = pfn;
+
+		pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
+					       zone_end_pfn(zone));
+		if (pfn)
+			zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
+	}
+	if (!pfn) {
+		zone->zone_start_pfn = 0;
+		zone->spanned_pages = 0;
+	}
 	zone_span_writeunlock(zone);
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 06/11] mm/memory_hotplug: Fix crashes in shrink_zone_span()
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (4 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 05/11] mm/memory_hotplug: Optimize zone shrinking code when checking for holes David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 07/11] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang,
	Aneesh Kumar K . V

We can currently crash in shrink_zone_span() in case we access an
uninitialized memmap (via page_to_nid()). Root issue is that we cannot
always identify which memmap was actually initialized.

Let's improve the situation by looking only at online PFNs for
!ZONE_DEVICE memory. This is now very reliable - similar to
set_zone_contiguous(). (Side note: set_zone_contiguous() will never
succeed on ZONE_DEVICE memory right now as we have no online PFNs ...).

For ZONE_DEVICE memory, make sure we don't crash by special-casing
poisoned pages and always checking that the NID has a sane value. We
might still read garbage and get false positives, but it certainly
improves the situation.

Note: Especially subsections make it very hard to detect which parts of
a ZONE_DEVICE memmap were actually initialized - otherwise we could just
have reused SECTION_IS_ONLINE. This needs more thought.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 663853bf97ed..65b3fdf7f838 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -334,6 +334,17 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 		if (unlikely(!pfn_valid(start_pfn)))
 			continue;
 
+		/*
+		 * TODO: There is no way we can identify whether the memmap
+		 * of ZONE_DEVICE memory was initialized. We might get
+		 * false positives when reading garbage.
+		 */
+		if (zone_idx(zone) == ZONE_DEVICE) {
+			if (PagePoisoned(pfn_to_page(start_pfn)))
+				continue;
+		} else if (!pfn_to_online_page(start_pfn))
+			continue;
+
 		if (unlikely(pfn_to_nid(start_pfn) != nid))
 			continue;
 
@@ -359,6 +370,17 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 		if (unlikely(!pfn_valid(pfn)))
 			continue;
 
+		/*
+		 * TODO: There is no way we can identify whether the memmap
+		 * of ZONE_DEVICE memory was initialized. We might get
+		 * false positives when reading garbage.
+		 */
+		if (zone_idx(zone) == ZONE_DEVICE) {
+			if (PagePoisoned(pfn_to_page(pfn)))
+				continue;
+		} else if (!pfn_to_online_page(pfn))
+			continue;
+
 		if (unlikely(pfn_to_nid(pfn) != nid))
 			continue;
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 07/11] mm/memory_hotplug: Exit early in __remove_pages() on BUGs
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (5 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 06/11] mm/memory_hotplug: Fix crashes in shrink_zone_span() David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 08/11] mm: Exit early in set_zone_contiguous() if already contiguous David Hildenbrand
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

The error path should never happen in practice (unless bringing up a new
device driver, or on BUGs). However, it's clearer to not touch anything
in case we are going to return right away. Move the check/return.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 65b3fdf7f838..56eabd22cbae 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -527,13 +527,13 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 	unsigned long map_offset = 0;
 	unsigned long nr, start_sec, end_sec;
 
+	if (check_pfn_span(pfn, nr_pages, "remove"))
+		return;
+
 	map_offset = vmem_altmap_offset(altmap);
 
 	clear_zone_contiguous(zone);
 
-	if (check_pfn_span(pfn, nr_pages, "remove"))
-		return;
-
 	start_sec = pfn_to_section_nr(pfn);
 	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
 	for (nr = start_sec; nr <= end_sec; nr++) {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 08/11] mm: Exit early in set_zone_contiguous() if already contiguous
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (6 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 07/11] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 09/11] mm/memory_hotplug: Remove pages from a zone before removing memory David Hildenbrand
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mel Gorman,
	Mike Rapoport, Dan Williams, Alexander Duyck

No need to recompute in case the zone is already marked contiguous.
We will soon exploit this on the memory removal path, where we will only
clear zone->contiguous on zones that intersect with the memory to be
removed.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 44038665fe8e..a9935dc19f5b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1546,6 +1546,9 @@ void set_zone_contiguous(struct zone *zone)
 	unsigned long block_start_pfn = zone->zone_start_pfn;
 	unsigned long block_end_pfn;
 
+	if (zone->contiguous)
+		return;
+
 	block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
 	for (; block_start_pfn < zone_end_pfn(zone);
 			block_start_pfn = block_end_pfn,
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 09/11] mm/memory_hotplug: Remove pages from a zone before removing memory
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (7 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 08/11] mm: Exit early in set_zone_contiguous() if already contiguous David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 10/11] mm/memory_hotplug: Remove zone parameter from __remove_pages() David Hildenbrand
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang, Qian Cai,
	Jason Gunthorpe, Logan Gunthorpe, Ira Weiny

Remove memory from the zone when offlining and when onlining
failed (we only have a single zone) in case of !ZONE_DEVICE memory. Do
the same with ZONE_DEVICE memory before removing memory. Introduce
remove_pfn_range_from_zone().

This fixes a whole bunch of BUGs we have in our code right now when
removing memory whereby
- Single memory blocks that fall into no zone (never onlined)
- Single memory blocks that fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones
Right now, the zones don't get updated properly in these cases. And we
can trigger kernel bugs when removing memory that was never onlined:

:/# [   23.912993] BUG: unable to handle page fault for address: 000000000000353d
[   23.914219] #PF: supervisor write access in kernel mode
[   23.915199] #PF: error_code(0x0002) - not-present page
[   23.916160] PGD 0 P4D 0
[   23.916627] Oops: 0002 [#1] SMP PTI
[   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
[   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
[   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
[   23.926876] RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
[   23.927928] RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
[   23.929458] RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
[   23.930899] RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
[   23.932362] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
[   23.933603] R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
[   23.934913] FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
[   23.936294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.937481] CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
[   23.938687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   23.939889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   23.941168] Call Trace:
[   23.941580]  __remove_pages+0x4b/0x640
[   23.942303]  ? mark_held_locks+0x49/0x70
[   23.943149]  arch_remove_memory+0x63/0x8d
[   23.943921]  try_remove_memory+0xdb/0x130
[   23.944766]  ? walk_memory_blocks+0x7f/0x9e
[   23.945616]  __remove_memory+0xa/0x11
[   23.946274]  acpi_memory_device_remove+0x70/0x100
[   23.947308]  acpi_bus_trim+0x55/0x90
[   23.947914]  acpi_device_hotplug+0x227/0x3a0
[   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
[   23.949433]  process_one_work+0x221/0x550
[   23.950190]  worker_thread+0x50/0x3b0
[   23.950993]  kthread+0x105/0x140
[   23.951644]  ? process_one_work+0x550/0x550
[   23.952508]  ? kthread_park+0x80/0x80
[   23.953367]  ret_from_fork+0x3a/0x50
[   23.954025] Modules linked in:
[   23.954613] CR2: 000000000000353d
[   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memory_hotplug.h |  3 +++
 mm/memory_hotplug.c            | 16 +++++++++-------
 mm/memremap.c                  |  7 ++++---
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 235530cdface..f27559f11b64 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -345,6 +345,9 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct dev_pagemap *pgmap);
+extern void remove_pfn_range_from_zone(struct zone *zone,
+				       unsigned long start_pfn,
+				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 56eabd22cbae..75859a57ecda 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -484,16 +484,21 @@ static void update_pgdat_span(struct pglist_data *pgdat)
 	pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
 }
 
-static void __remove_zone(struct zone *zone, unsigned long start_pfn,
-		unsigned long nr_pages)
+void __ref remove_pfn_range_from_zone(struct zone *zone,
+				      unsigned long start_pfn,
+				      unsigned long nr_pages)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long flags;
 
+	clear_zone_contiguous(zone);
+
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
 	update_pgdat_span(pgdat);
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+
+	set_zone_contiguous(zone);
 }
 
 static void __remove_section(struct zone *zone, unsigned long pfn,
@@ -505,7 +510,6 @@ static void __remove_section(struct zone *zone, unsigned long pfn,
 	if (WARN_ON_ONCE(!valid_section(ms)))
 		return;
 
-	__remove_zone(zone, pfn, nr_pages);
 	sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap);
 }
 
@@ -532,8 +536,6 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 
 	map_offset = vmem_altmap_offset(altmap);
 
-	clear_zone_contiguous(zone);
-
 	start_sec = pfn_to_section_nr(pfn);
 	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
 	for (nr = start_sec; nr <= end_sec; nr++) {
@@ -547,8 +549,6 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 		nr_pages -= pfns;
 		map_offset = 0;
 	}
-
-	set_zone_contiguous(zone);
 }
 
 int set_online_page_callback(online_page_callback_t callback)
@@ -875,6 +875,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
 		 (unsigned long long) pfn << PAGE_SHIFT,
 		 (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
+	remove_pfn_range_from_zone(zone, pfn, nr_pages);
 	memory_notify(MEM_CANCEL_ONLINE, &arg);
 	mem_hotplug_done();
 	return ret;
@@ -1586,6 +1587,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/* removal success */
+	remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
 	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
 	zone->present_pages -= offlined_pages;
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 9ee23374e6da..7c662643a0aa 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -125,7 +125,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
 	struct resource *res = &pgmap->res;
-	unsigned long pfn;
+	unsigned long pfn = PHYS_PFN(res->start);
 	int nid;
 
 	dev_pagemap_kill(pgmap);
@@ -134,11 +134,12 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 	dev_pagemap_cleanup(pgmap);
 
 	/* pages are dead and unused, undo the arch mapping */
-	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+	nid = page_to_nid(pfn_to_page(pfn));
 
 	mem_hotplug_begin();
+	remove_pfn_range_from_zone(page_zone(pfn_to_page(pfn)), pfn,
+				   PHYS_PFN(resource_size(res)));
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		pfn = PHYS_PFN(res->start);
 		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
 				 PHYS_PFN(resource_size(res)), NULL);
 	} else {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 10/11] mm/memory_hotplug: Remove zone parameter from __remove_pages()
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (8 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 09/11] mm/memory_hotplug: Remove pages from a zone before removing memory David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  7:00 ` [PATCH v3 11/11] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
  2019-08-29  8:23 ` [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory Michal Hocko
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Catalin Marinas, Will Deacon,
	Tony Luck, Fenghua Yu, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Yoshinori Sato, Rich Felker, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Andrew Morton,
	Mark Rutland, Steve Capper, Mike Rapoport, Anshuman Khandual,
	Yu Zhao, Jun Yao, Robin Murphy, Michal Hocko, Oscar Salvador,
	Matthew Wilcox (Oracle),
	Christophe Leroy, Aneesh Kumar K.V, Pavel Tatashin,
	Gerald Schaefer, Halil Pasic, Tom Lendacky, Greg Kroah-Hartman,
	Masahiro Yamada, Dan Williams, Wei Yang, Qian Cai,
	Jason Gunthorpe, Logan Gunthorpe, Ira Weiny, linux-arm-kernel,
	linux-ia64, linuxppc-dev, linux-s390, linux-sh

No longer in use, let's drop it. Also drop it from __remove_section().

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/mm/mmu.c            |  4 +---
 arch/ia64/mm/init.c            |  4 +---
 arch/powerpc/mm/mem.c          |  3 +--
 arch/s390/mm/init.c            |  4 +---
 arch/sh/mm/init.c              |  4 +---
 arch/x86/mm/init_32.c          |  4 +---
 arch/x86/mm/init_64.c          |  4 +---
 include/linux/memory_hotplug.h |  4 ++--
 mm/memory_hotplug.c            | 13 ++++++-------
 mm/memremap.c                  |  3 +--
 10 files changed, 16 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 60c929f3683b..d10247fab0fd 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1069,7 +1069,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
 	/*
 	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
@@ -1078,7 +1077,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
 	 * unlocked yet.
 	 */
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index bf9df2625bc8..a6dd80a2c939 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 69f99128a8d6..f3a5e397b911 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
 	int ret;
 
-	__remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 
 	/* Remove htab bolted mappings for this section of memory */
 	start = (unsigned long)__va(start);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 20340a03ad90..6f13eb66e375 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -296,10 +296,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 	vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index dfdbaa50946e..d1b1ff2be17a 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 4068abb9427f..9d036be27aaa 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -865,10 +865,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a6b5c653727b..b8541d77452c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1212,10 +1212,8 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
-	struct zone *zone = page_zone(page);
 
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 	kernel_physical_mapping_remove(start, start + size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f27559f11b64..85040ba5180c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -125,8 +125,8 @@ static inline bool movable_node_is_enabled(void)
 
 extern void arch_remove_memory(int nid, u64 start, u64 size,
 			       struct vmem_altmap *altmap);
-extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
-			   unsigned long nr_pages, struct vmem_altmap *altmap);
+extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
+			   struct vmem_altmap *altmap);
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 75859a57ecda..fe29c637c0a8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -501,9 +501,9 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
 	set_zone_contiguous(zone);
 }
 
-static void __remove_section(struct zone *zone, unsigned long pfn,
-		unsigned long nr_pages, unsigned long map_offset,
-		struct vmem_altmap *altmap)
+static void __remove_section(unsigned long pfn, unsigned long nr_pages,
+			     unsigned long map_offset,
+			     struct vmem_altmap *altmap)
 {
 	struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
 
@@ -515,7 +515,6 @@ static void __remove_section(struct zone *zone, unsigned long pfn,
 
 /**
  * __remove_pages() - remove sections of pages from a zone
- * @zone: zone from which pages need to be removed
  * @pfn: starting pageframe (must be aligned to start of a section)
  * @nr_pages: number of pages to remove (must be multiple of section size)
  * @altmap: alternative device page map or %NULL if default memmap is used
@@ -525,8 +524,8 @@ static void __remove_section(struct zone *zone, unsigned long pfn,
  * sure that pages are marked reserved and zones are adjust properly by
  * calling offline_pages().
  */
-void __remove_pages(struct zone *zone, unsigned long pfn,
-		    unsigned long nr_pages, struct vmem_altmap *altmap)
+void __remove_pages(unsigned long pfn, unsigned long nr_pages,
+		    struct vmem_altmap *altmap)
 {
 	unsigned long map_offset = 0;
 	unsigned long nr, start_sec, end_sec;
@@ -544,7 +543,7 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 		cond_resched();
 		pfns = min(nr_pages, PAGES_PER_SECTION
 				- (pfn & ~PAGE_SECTION_MASK));
-		__remove_section(zone, pfn, pfns, map_offset, altmap);
+		__remove_section(pfn, pfns, map_offset, altmap);
 		pfn += pfns;
 		nr_pages -= pfns;
 		map_offset = 0;
diff --git a/mm/memremap.c b/mm/memremap.c
index 7c662643a0aa..99e38cfcec95 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -140,8 +140,7 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 	remove_pfn_range_from_zone(page_zone(pfn_to_page(pfn)), pfn,
 				   PHYS_PFN(resource_size(res)));
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-				 PHYS_PFN(resource_size(res)), NULL);
+		__remove_pages(pfn, PHYS_PFN(resource_size(res)), NULL);
 	} else {
 		arch_remove_memory(nid, res->start, resource_size(res),
 				pgmap_altmap(pgmap));
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 11/11] mm/memory_hotplug: Cleanup __remove_pages()
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (9 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 10/11] mm/memory_hotplug: Remove zone parameter from __remove_pages() David Hildenbrand
@ 2019-08-29  7:00 ` David Hildenbrand
  2019-08-29  8:23 ` [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory Michal Hocko
  11 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29  7:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

Let's drop the basically unused section stuff and simplify.

Also, let's use a shorter variant to calculate the number of pages to
the next section boundary.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fe29c637c0a8..da56cb57a8aa 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -527,25 +527,20 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		    struct vmem_altmap *altmap)
 {
+	const unsigned long end_pfn = pfn + nr_pages;
+	unsigned long cur_nr_pages;
 	unsigned long map_offset = 0;
-	unsigned long nr, start_sec, end_sec;
 
 	if (check_pfn_span(pfn, nr_pages, "remove"))
 		return;
 
 	map_offset = vmem_altmap_offset(altmap);
 
-	start_sec = pfn_to_section_nr(pfn);
-	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
-	for (nr = start_sec; nr <= end_sec; nr++) {
-		unsigned long pfns;
-
+	for (; pfn < end_pfn; pfn += cur_nr_pages) {
 		cond_resched();
-		pfns = min(nr_pages, PAGES_PER_SECTION
-				- (pfn & ~PAGE_SECTION_MASK));
-		__remove_section(pfn, pfns, map_offset, altmap);
-		pfn += pfns;
-		nr_pages -= pfns;
+		/* Select all remaining pages up to the next section boundary */
+		cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK));
+		__remove_section(pfn, cur_nr_pages, map_offset, altmap);
 		map_offset = 0;
 	}
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
  2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (10 preceding siblings ...)
  2019-08-29  7:00 ` [PATCH v3 11/11] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
@ 2019-08-29  8:23 ` Michal Hocko
  2019-08-29 11:33   ` David Hildenbrand
  11 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-08-29  8:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

On Thu 29-08-19 09:00:08, David Hildenbrand wrote:
> This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
> zones when removing memory". I decided to go one step further and finally
> factor out the shrinking of zones from memory removal code. Zones are now
> fixed up when offlining memory/onlining of memory fails/before removing
> ZONE_DEVICE memory.

I was about to say Yay! but then reading...

> Example:
> 
> :/# cat /proc/zoneinfo
> Node 1, zone  Movable
>         spanned  0
>         present  0
>         managed  0
> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
> :/# cat /proc/zoneinfo
> Node 1, zone  Movable
>         spanned  98304
>         present  65536
>         managed  65536
> :/# echo 0 > /sys/devices/system/memory/memory43/online
> :/# cat /proc/zoneinfo
> Node 1, zone  Movable
>         spanned  32768
>         present  32768
>         managed  32768
> :/# echo 0 > /sys/devices/system/memory/memory41/online
> :/# cat /proc/zoneinfo
> Node 1, zone  Movable
>         spanned  0
>         present  0
>         managed  0

... this made me realize that you are trying to fix it instead. Could
you explain why do we want to do that? Why don't we simply remove all
that crap? Why do we even care about zone boundaries when offlining or
removing memory? Zone shrinking was mostly necessary with the previous
onlining semantic when the zone type could be only changed on the
boundary or unassociated memory. We can interleave memory zones now
arbitrarily.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
  2019-08-29  8:23 ` [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory Michal Hocko
@ 2019-08-29 11:33   ` David Hildenbrand
  2019-08-29 11:43     ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29 11:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

On 29.08.19 10:23, Michal Hocko wrote:
> On Thu 29-08-19 09:00:08, David Hildenbrand wrote:
>> This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
>> zones when removing memory". I decided to go one step further and finally
>> factor out the shrinking of zones from memory removal code. Zones are now
>> fixed up when offlining memory/onlining of memory fails/before removing
>> ZONE_DEVICE memory.
> 
> I was about to say Yay! but then reading...

Almost ;)

> 
>> Example:
>>
>> :/# cat /proc/zoneinfo
>> Node 1, zone  Movable
>>         spanned  0
>>         present  0
>>         managed  0
>> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
>> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
>> :/# cat /proc/zoneinfo
>> Node 1, zone  Movable
>>         spanned  98304
>>         present  65536
>>         managed  65536
>> :/# echo 0 > /sys/devices/system/memory/memory43/online
>> :/# cat /proc/zoneinfo
>> Node 1, zone  Movable
>>         spanned  32768
>>         present  32768
>>         managed  32768
>> :/# echo 0 > /sys/devices/system/memory/memory41/online
>> :/# cat /proc/zoneinfo
>> Node 1, zone  Movable
>>         spanned  0
>>         present  0
>>         managed  0
> 
> ... this made me realize that you are trying to fix it instead. Could
> you explain why do we want to do that? Why don't we simply remove all
> that crap? Why do we even care about zone boundaries when offlining or
> removing memory? Zone shrinking was mostly necessary with the previous
> onlining semantic when the zone type could be only changed on the
> boundary or unassociated memory. We can interleave memory zones now
> arbitrarily.

Last time I asked whether we can just drop all that nasty
zone->contiguous handling I was being told that it does have a
significant performance impact and is here to stay. The boundaries are a
key component to detect whether a zone is contiguous.

So yes, while we allow interleaved memory zones, having contiguous zones
is beneficial for performance. That's why also memory onlining code will
try to online memory as default to the zone that will keep/make zones
contiguous.

Anyhow, I think with this series most of the zone shrinking code becomes
"digestible". Except minor issues with ZONE_DEVICE - which is acceptable.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
  2019-08-29 11:33   ` David Hildenbrand
@ 2019-08-29 11:43     ` David Hildenbrand
  2019-08-29 12:08       ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29 11:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

On 29.08.19 13:33, David Hildenbrand wrote:
> On 29.08.19 10:23, Michal Hocko wrote:
>> On Thu 29-08-19 09:00:08, David Hildenbrand wrote:
>>> This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
>>> zones when removing memory". I decided to go one step further and finally
>>> factor out the shrinking of zones from memory removal code. Zones are now
>>> fixed up when offlining memory/onlining of memory fails/before removing
>>> ZONE_DEVICE memory.
>>
>> I was about to say Yay! but then reading...
> 
> Almost ;)
> 
>>
>>> Example:
>>>
>>> :/# cat /proc/zoneinfo
>>> Node 1, zone  Movable
>>>         spanned  0
>>>         present  0
>>>         managed  0
>>> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
>>> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
>>> :/# cat /proc/zoneinfo
>>> Node 1, zone  Movable
>>>         spanned  98304
>>>         present  65536
>>>         managed  65536
>>> :/# echo 0 > /sys/devices/system/memory/memory43/online
>>> :/# cat /proc/zoneinfo
>>> Node 1, zone  Movable
>>>         spanned  32768
>>>         present  32768
>>>         managed  32768
>>> :/# echo 0 > /sys/devices/system/memory/memory41/online
>>> :/# cat /proc/zoneinfo
>>> Node 1, zone  Movable
>>>         spanned  0
>>>         present  0
>>>         managed  0
>>
>> ... this made me realize that you are trying to fix it instead. Could
>> you explain why do we want to do that? Why don't we simply remove all
>> that crap? Why do we even care about zone boundaries when offlining or
>> removing memory? Zone shrinking was mostly necessary with the previous
>> onlining semantic when the zone type could be only changed on the
>> boundary or unassociated memory. We can interleave memory zones now
>> arbitrarily.
> 
> Last time I asked whether we can just drop all that nasty
> zone->contiguous handling I was being told that it does have a
> significant performance impact and is here to stay. The boundaries are a
> key component to detect whether a zone is contiguous.
> 
> So yes, while we allow interleaved memory zones, having contiguous zones
> is beneficial for performance. That's why also memory onlining code will
> try to online memory as default to the zone that will keep/make zones
> contiguous.
> 
> Anyhow, I think with this series most of the zone shrinking code becomes
> "digestible". Except minor issues with ZONE_DEVICE - which is acceptable.
> 

Also, there are plenty of other users of
node_spanned_pages/zone_spanned_pages etc.. I don't think this can go -
not that easy :)

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
  2019-08-29 11:43     ` David Hildenbrand
@ 2019-08-29 12:08       ` David Hildenbrand
  2019-08-29 12:15         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29 12:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

On 29.08.19 13:43, David Hildenbrand wrote:
> On 29.08.19 13:33, David Hildenbrand wrote:
>> On 29.08.19 10:23, Michal Hocko wrote:
>>> On Thu 29-08-19 09:00:08, David Hildenbrand wrote:
>>>> This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
>>>> zones when removing memory". I decided to go one step further and finally
>>>> factor out the shrinking of zones from memory removal code. Zones are now
>>>> fixed up when offlining memory/onlining of memory fails/before removing
>>>> ZONE_DEVICE memory.
>>>
>>> I was about to say Yay! but then reading...
>>
>> Almost ;)
>>
>>>
>>>> Example:
>>>>
>>>> :/# cat /proc/zoneinfo
>>>> Node 1, zone  Movable
>>>>         spanned  0
>>>>         present  0
>>>>         managed  0
>>>> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
>>>> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
>>>> :/# cat /proc/zoneinfo
>>>> Node 1, zone  Movable
>>>>         spanned  98304
>>>>         present  65536
>>>>         managed  65536
>>>> :/# echo 0 > /sys/devices/system/memory/memory43/online
>>>> :/# cat /proc/zoneinfo
>>>> Node 1, zone  Movable
>>>>         spanned  32768
>>>>         present  32768
>>>>         managed  32768
>>>> :/# echo 0 > /sys/devices/system/memory/memory41/online
>>>> :/# cat /proc/zoneinfo
>>>> Node 1, zone  Movable
>>>>         spanned  0
>>>>         present  0
>>>>         managed  0
>>>
>>> ... this made me realize that you are trying to fix it instead. Could
>>> you explain why do we want to do that? Why don't we simply remove all
>>> that crap? Why do we even care about zone boundaries when offlining or
>>> removing memory? Zone shrinking was mostly necessary with the previous
>>> onlining semantic when the zone type could be only changed on the
>>> boundary or unassociated memory. We can interleave memory zones now
>>> arbitrarily.
>>
>> Last time I asked whether we can just drop all that nasty
>> zone->contiguous handling I was being told that it does have a
>> significant performance impact and is here to stay. The boundaries are a
>> key component to detect whether a zone is contiguous.
>>
>> So yes, while we allow interleaved memory zones, having contiguous zones
>> is beneficial for performance. That's why also memory onlining code will
>> try to online memory as default to the zone that will keep/make zones
>> contiguous.
>>
>> Anyhow, I think with this series most of the zone shrinking code becomes
>> "digestible". Except minor issues with ZONE_DEVICE - which is acceptable.
>>
> 
> Also, there are plenty of other users of
> node_spanned_pages/zone_spanned_pages etc.. I don't think this can go -
> not that easy :)
> 

... re-reading, your suggestion is to drop the zone _shrinking_ code
only, sorry :) That makes more sense.

This would mean that once a zone was !contiguous, it will always remain
like that. Also, even empty zones after unplug would not result in
zone_empty() == true.

I can see that some users of *_spanned_pages make certain assumptions
based on the size (snapshot, oom killer, ...), but that would already be
wrong in case the zone is very sparse.

I'll prepare something, then we can discuss.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
  2019-08-29 12:08       ` David Hildenbrand
@ 2019-08-29 12:15         ` Michal Hocko
  2019-08-29 12:29           ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-08-29 12:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

On Thu 29-08-19 14:08:48, David Hildenbrand wrote:
> On 29.08.19 13:43, David Hildenbrand wrote:
> > On 29.08.19 13:33, David Hildenbrand wrote:
> >> On 29.08.19 10:23, Michal Hocko wrote:
> >>> On Thu 29-08-19 09:00:08, David Hildenbrand wrote:
> >>>> This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
> >>>> zones when removing memory". I decided to go one step further and finally
> >>>> factor out the shrinking of zones from memory removal code. Zones are now
> >>>> fixed up when offlining memory/onlining of memory fails/before removing
> >>>> ZONE_DEVICE memory.
> >>>
> >>> I was about to say Yay! but then reading...
> >>
> >> Almost ;)
> >>
> >>>
> >>>> Example:
> >>>>
> >>>> :/# cat /proc/zoneinfo
> >>>> Node 1, zone  Movable
> >>>>         spanned  0
> >>>>         present  0
> >>>>         managed  0
> >>>> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
> >>>> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
> >>>> :/# cat /proc/zoneinfo
> >>>> Node 1, zone  Movable
> >>>>         spanned  98304
> >>>>         present  65536
> >>>>         managed  65536
> >>>> :/# echo 0 > /sys/devices/system/memory/memory43/online
> >>>> :/# cat /proc/zoneinfo
> >>>> Node 1, zone  Movable
> >>>>         spanned  32768
> >>>>         present  32768
> >>>>         managed  32768
> >>>> :/# echo 0 > /sys/devices/system/memory/memory41/online
> >>>> :/# cat /proc/zoneinfo
> >>>> Node 1, zone  Movable
> >>>>         spanned  0
> >>>>         present  0
> >>>>         managed  0
> >>>
> >>> ... this made me realize that you are trying to fix it instead. Could
> >>> you explain why do we want to do that? Why don't we simply remove all
> >>> that crap? Why do we even care about zone boundaries when offlining or
> >>> removing memory? Zone shrinking was mostly necessary with the previous
> >>> onlining semantic when the zone type could be only changed on the
> >>> boundary or unassociated memory. We can interleave memory zones now
> >>> arbitrarily.
> >>
> >> Last time I asked whether we can just drop all that nasty
> >> zone->contiguous handling I was being told that it does have a
> >> significant performance impact and is here to stay. The boundaries are a
> >> key component to detect whether a zone is contiguous.
> >>
> >> So yes, while we allow interleaved memory zones, having contiguous zones
> >> is beneficial for performance. That's why also memory onlining code will
> >> try to online memory as default to the zone that will keep/make zones
> >> contiguous.
> >>
> >> Anyhow, I think with this series most of the zone shrinking code becomes
> >> "digestible". Except minor issues with ZONE_DEVICE - which is acceptable.
> >>
> > 
> > Also, there are plenty of other users of
> > node_spanned_pages/zone_spanned_pages etc.. I don't think this can go -
> > not that easy :)
> > 
> 
> ... re-reading, your suggestion is to drop the zone _shrinking_ code
> only, sorry :) That makes more sense.
> 
> This would mean that once a zone was !contiguous, it will always remain
> like that. Also, even empty zones after unplug would not result in
> zone_empty() == true.

exactly. We only need to care about not declaring zone !contigious when
offlining from ends but that should be trivial.

> I can see that some users of *_spanned_pages make certain assumptions
> based on the size (snapshot, oom killer, ...), but that would already be
> wrong in case the zone is very sparse.

at least oom killer usage is certainly wrong. I will have a look.

> I'll prepare something, then we can discuss.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
  2019-08-29 12:15         ` Michal Hocko
@ 2019-08-29 12:29           ` David Hildenbrand
  2019-08-29 15:19             ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29 12:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

On 29.08.19 14:15, Michal Hocko wrote:
> On Thu 29-08-19 14:08:48, David Hildenbrand wrote:
>> On 29.08.19 13:43, David Hildenbrand wrote:
>>> On 29.08.19 13:33, David Hildenbrand wrote:
>>>> On 29.08.19 10:23, Michal Hocko wrote:
>>>>> On Thu 29-08-19 09:00:08, David Hildenbrand wrote:
>>>>>> This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
>>>>>> zones when removing memory". I decided to go one step further and finally
>>>>>> factor out the shrinking of zones from memory removal code. Zones are now
>>>>>> fixed up when offlining memory/onlining of memory fails/before removing
>>>>>> ZONE_DEVICE memory.
>>>>>
>>>>> I was about to say Yay! but then reading...
>>>>
>>>> Almost ;)
>>>>
>>>>>
>>>>>> Example:
>>>>>>
>>>>>> :/# cat /proc/zoneinfo
>>>>>> Node 1, zone  Movable
>>>>>>         spanned  0
>>>>>>         present  0
>>>>>>         managed  0
>>>>>> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
>>>>>> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
>>>>>> :/# cat /proc/zoneinfo
>>>>>> Node 1, zone  Movable
>>>>>>         spanned  98304
>>>>>>         present  65536
>>>>>>         managed  65536
>>>>>> :/# echo 0 > /sys/devices/system/memory/memory43/online
>>>>>> :/# cat /proc/zoneinfo
>>>>>> Node 1, zone  Movable
>>>>>>         spanned  32768
>>>>>>         present  32768
>>>>>>         managed  32768
>>>>>> :/# echo 0 > /sys/devices/system/memory/memory41/online
>>>>>> :/# cat /proc/zoneinfo
>>>>>> Node 1, zone  Movable
>>>>>>         spanned  0
>>>>>>         present  0
>>>>>>         managed  0
>>>>>
>>>>> ... this made me realize that you are trying to fix it instead. Could
>>>>> you explain why do we want to do that? Why don't we simply remove all
>>>>> that crap? Why do we even care about zone boundaries when offlining or
>>>>> removing memory? Zone shrinking was mostly necessary with the previous
>>>>> onlining semantic when the zone type could be only changed on the
>>>>> boundary or unassociated memory. We can interleave memory zones now
>>>>> arbitrarily.
>>>>
>>>> Last time I asked whether we can just drop all that nasty
>>>> zone->contiguous handling I was being told that it does have a
>>>> significant performance impact and is here to stay. The boundaries are a
>>>> key component to detect whether a zone is contiguous.
>>>>
>>>> So yes, while we allow interleaved memory zones, having contiguous zones
>>>> is beneficial for performance. That's why also memory onlining code will
>>>> try to online memory as default to the zone that will keep/make zones
>>>> contiguous.
>>>>
>>>> Anyhow, I think with this series most of the zone shrinking code becomes
>>>> "digestible". Except minor issues with ZONE_DEVICE - which is acceptable.
>>>>
>>>
>>> Also, there are plenty of other users of
>>> node_spanned_pages/zone_spanned_pages etc.. I don't think this can go -
>>> not that easy :)
>>>
>>
>> ... re-reading, your suggestion is to drop the zone _shrinking_ code
>> only, sorry :) That makes more sense.
>>
>> This would mean that once a zone was !contiguous, it will always remain
>> like that. Also, even empty zones after unplug would not result in
>> zone_empty() == true.
> 
> exactly. We only need to care about not declaring zone !contigious when
> offlining from ends but that should be trivial.

That won't help a lot (offlining a DIMM will offline first to last
memory block, so unlikely we can keep the zone !contiguous). However, we
could limit zone shrinking to offlining code only (easy) and not perform
it at all for ZONE_DEVICE memory. That would simplify things *a lot*.

What's your take? Remove it completely or do it only for !ZONE_DEVICE
memory when offlining/onlining fails?

I think I would prefer to try to shrink for !ZONE_DEVICE memory, then we
can at least try to keep contiguous set and reset in case it's possible.

> 
>> I can see that some users of *_spanned_pages make certain assumptions
>> based on the size (snapshot, oom killer, ...), but that would already be
>> wrong in case the zone is very sparse.
> 
> at least oom killer usage is certainly wrong. I will have a look.
> 
>> I'll prepare something, then we can discuss.
> 
> Thanks!
> 


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
  2019-08-29 12:29           ` David Hildenbrand
@ 2019-08-29 15:19             ` Michal Hocko
  2019-08-29 15:28               ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-08-29 15:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

On Thu 29-08-19 14:29:22, David Hildenbrand wrote:
> On 29.08.19 14:15, Michal Hocko wrote:
> > On Thu 29-08-19 14:08:48, David Hildenbrand wrote:
> >> On 29.08.19 13:43, David Hildenbrand wrote:
> >>> On 29.08.19 13:33, David Hildenbrand wrote:
> >>>> On 29.08.19 10:23, Michal Hocko wrote:
> >>>>> On Thu 29-08-19 09:00:08, David Hildenbrand wrote:
> >>>>>> This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
> >>>>>> zones when removing memory". I decided to go one step further and finally
> >>>>>> factor out the shrinking of zones from memory removal code. Zones are now
> >>>>>> fixed up when offlining memory/onlining of memory fails/before removing
> >>>>>> ZONE_DEVICE memory.
> >>>>>
> >>>>> I was about to say Yay! but then reading...
> >>>>
> >>>> Almost ;)
> >>>>
> >>>>>
> >>>>>> Example:
> >>>>>>
> >>>>>> :/# cat /proc/zoneinfo
> >>>>>> Node 1, zone  Movable
> >>>>>>         spanned  0
> >>>>>>         present  0
> >>>>>>         managed  0
> >>>>>> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
> >>>>>> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
> >>>>>> :/# cat /proc/zoneinfo
> >>>>>> Node 1, zone  Movable
> >>>>>>         spanned  98304
> >>>>>>         present  65536
> >>>>>>         managed  65536
> >>>>>> :/# echo 0 > /sys/devices/system/memory/memory43/online
> >>>>>> :/# cat /proc/zoneinfo
> >>>>>> Node 1, zone  Movable
> >>>>>>         spanned  32768
> >>>>>>         present  32768
> >>>>>>         managed  32768
> >>>>>> :/# echo 0 > /sys/devices/system/memory/memory41/online
> >>>>>> :/# cat /proc/zoneinfo
> >>>>>> Node 1, zone  Movable
> >>>>>>         spanned  0
> >>>>>>         present  0
> >>>>>>         managed  0
> >>>>>
> >>>>> ... this made me realize that you are trying to fix it instead. Could
> >>>>> you explain why do we want to do that? Why don't we simply remove all
> >>>>> that crap? Why do we even care about zone boundaries when offlining or
> >>>>> removing memory? Zone shrinking was mostly necessary with the previous
> >>>>> onlining semantic when the zone type could be only changed on the
> >>>>> boundary or unassociated memory. We can interleave memory zones now
> >>>>> arbitrarily.
> >>>>
> >>>> Last time I asked whether we can just drop all that nasty
> >>>> zone->contiguous handling I was being told that it does have a
> >>>> significant performance impact and is here to stay. The boundaries are a
> >>>> key component to detect whether a zone is contiguous.
> >>>>
> >>>> So yes, while we allow interleaved memory zones, having contiguous zones
> >>>> is beneficial for performance. That's why also memory onlining code will
> >>>> try to online memory as default to the zone that will keep/make zones
> >>>> contiguous.
> >>>>
> >>>> Anyhow, I think with this series most of the zone shrinking code becomes
> >>>> "digestible". Except minor issues with ZONE_DEVICE - which is acceptable.
> >>>>
> >>>
> >>> Also, there are plenty of other users of
> >>> node_spanned_pages/zone_spanned_pages etc.. I don't think this can go -
> >>> not that easy :)
> >>>
> >>
> >> ... re-reading, your suggestion is to drop the zone _shrinking_ code
> >> only, sorry :) That makes more sense.
> >>
> >> This would mean that once a zone was !contiguous, it will always remain
> >> like that. Also, even empty zones after unplug would not result in
> >> zone_empty() == true.
> > 
> > exactly. We only need to care about not declaring zone !contigious when
> > offlining from ends but that should be trivial.
> 
> That won't help a lot (offlining a DIMM will offline first to last
> memory block, so unlikely we can keep the zone !contiguous). However, we
> could limit zone shrinking to offlining code only (easy) and not perform
> it at all for ZONE_DEVICE memory. That would simplify things *a lot*.
> 
> What's your take? Remove it completely or do it only for !ZONE_DEVICE
> memory when offlining/onlining fails?
> 
> I think I would prefer to try to shrink for !ZONE_DEVICE memory, then we
> can at least try to keep contiguous set and reset in case it's possible.

I would remove that code altogether if that is possible and doesn't
introduce any side effects I am not aware right now. All the existing
code has to deal with holes already so I do not see any reason why it
cannot do the same with holes at both ends.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory
  2019-08-29 15:19             ` Michal Hocko
@ 2019-08-29 15:28               ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29 15:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Alexander Duyck, Alexander Potapenko,
	Andrey Konovalov, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dave Airlie,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Jun Yao,
	Kirill A. Shutemov, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Mike Rapoport,
	Oscar Salvador, Oscar Salvador, Paul Mackerras, Pavel Tatashin,
	Pavel Tatashin, Peter Zijlstra, Qian Cai, Rich Felker,
	Robin Murphy, Souptick Joarder, Stephen Rothwell, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yang Shi,
	Yoshinori Sato, Yu Zhao

On 29.08.19 17:19, Michal Hocko wrote:
> On Thu 29-08-19 14:29:22, David Hildenbrand wrote:
>> On 29.08.19 14:15, Michal Hocko wrote:
>>> On Thu 29-08-19 14:08:48, David Hildenbrand wrote:
>>>> On 29.08.19 13:43, David Hildenbrand wrote:
>>>>> On 29.08.19 13:33, David Hildenbrand wrote:
>>>>>> On 29.08.19 10:23, Michal Hocko wrote:
>>>>>>> On Thu 29-08-19 09:00:08, David Hildenbrand wrote:
>>>>>>>> This is the successor of "[PATCH v2 0/6] mm/memory_hotplug: Consider all
>>>>>>>> zones when removing memory". I decided to go one step further and finally
>>>>>>>> factor out the shrinking of zones from memory removal code. Zones are now
>>>>>>>> fixed up when offlining memory/onlining of memory fails/before removing
>>>>>>>> ZONE_DEVICE memory.
>>>>>>>
>>>>>>> I was about to say Yay! but then reading...
>>>>>>
>>>>>> Almost ;)
>>>>>>
>>>>>>>
>>>>>>>> Example:
>>>>>>>>
>>>>>>>> :/# cat /proc/zoneinfo
>>>>>>>> Node 1, zone  Movable
>>>>>>>>         spanned  0
>>>>>>>>         present  0
>>>>>>>>         managed  0
>>>>>>>> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state 
>>>>>>>> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
>>>>>>>> :/# cat /proc/zoneinfo
>>>>>>>> Node 1, zone  Movable
>>>>>>>>         spanned  98304
>>>>>>>>         present  65536
>>>>>>>>         managed  65536
>>>>>>>> :/# echo 0 > /sys/devices/system/memory/memory43/online
>>>>>>>> :/# cat /proc/zoneinfo
>>>>>>>> Node 1, zone  Movable
>>>>>>>>         spanned  32768
>>>>>>>>         present  32768
>>>>>>>>         managed  32768
>>>>>>>> :/# echo 0 > /sys/devices/system/memory/memory41/online
>>>>>>>> :/# cat /proc/zoneinfo
>>>>>>>> Node 1, zone  Movable
>>>>>>>>         spanned  0
>>>>>>>>         present  0
>>>>>>>>         managed  0
>>>>>>>
>>>>>>> ... this made me realize that you are trying to fix it instead. Could
>>>>>>> you explain why do we want to do that? Why don't we simply remove all
>>>>>>> that crap? Why do we even care about zone boundaries when offlining or
>>>>>>> removing memory? Zone shrinking was mostly necessary with the previous
>>>>>>> onlining semantic when the zone type could be only changed on the
>>>>>>> boundary or unassociated memory. We can interleave memory zones now
>>>>>>> arbitrarily.
>>>>>>
>>>>>> Last time I asked whether we can just drop all that nasty
>>>>>> zone->contiguous handling I was being told that it does have a
>>>>>> significant performance impact and is here to stay. The boundaries are a
>>>>>> key component to detect whether a zone is contiguous.
>>>>>>
>>>>>> So yes, while we allow interleaved memory zones, having contiguous zones
>>>>>> is beneficial for performance. That's why also memory onlining code will
>>>>>> try to online memory as default to the zone that will keep/make zones
>>>>>> contiguous.
>>>>>>
>>>>>> Anyhow, I think with this series most of the zone shrinking code becomes
>>>>>> "digestible". Except minor issues with ZONE_DEVICE - which is acceptable.
>>>>>>
>>>>>
>>>>> Also, there are plenty of other users of
>>>>> node_spanned_pages/zone_spanned_pages etc.. I don't think this can go -
>>>>> not that easy :)
>>>>>
>>>>
>>>> ... re-reading, your suggestion is to drop the zone _shrinking_ code
>>>> only, sorry :) That makes more sense.
>>>>
>>>> This would mean that once a zone was !contiguous, it will always remain
>>>> like that. Also, even empty zones after unplug would not result in
>>>> zone_empty() == true.
>>>
>>> exactly. We only need to care about not declaring zone !contigious when
>>> offlining from ends but that should be trivial.
>>
>> That won't help a lot (offlining a DIMM will offline first to last
>> memory block, so unlikely we can keep the zone !contiguous). However, we
>> could limit zone shrinking to offlining code only (easy) and not perform
>> it at all for ZONE_DEVICE memory. That would simplify things *a lot*.
>>
>> What's your take? Remove it completely or do it only for !ZONE_DEVICE
>> memory when offlining/onlining fails?
>>
>> I think I would prefer to try to shrink for !ZONE_DEVICE memory, then we
>> can at least try to keep contiguous set and reset in case it's possible.
> 
> I would remove that code altogether if that is possible and doesn't
> introduce any side effects I am not aware right now. All the existing
> code has to deal with holes already so I do not see any reason why it
> cannot do the same with holes at both ends.

I'll share a version that keeps shrinking the zones when
offlining/removing memory - so we can eventually set zone->contiguous
again (and keep it set on memory unplug) -  and drops the shrinking part
for ZONE_DEVICE memory, because there it really seems to be useless and
broken right now.

The series I am preparing right now minimizes shrinking code to a bare
minimum, which looks much better.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device()
  2019-08-29  7:00 ` [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device() David Hildenbrand
@ 2019-08-29 16:39   ` Alexander Duyck
  2019-08-29 16:55     ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Duyck @ 2019-08-29 16:39 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams
  Cc: LKML, linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Pavel Tatashin, Wei Yang, Qian Cai, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Stephen Rothwell, Dave Airlie,
	Andrey Konovalov, Alexander Duyck, Ira Weiny, John Hubbard,
	Arun KS, Souptick Joarder, Robin Murphy, Yang Shi,
	Jason Gunthorpe, Logan Gunthorpe, Vlastimil Babka, Mel Gorman,
	Mike Rapoport, Alexander Potapenko

On Thu, Aug 29, 2019 at 12:00 AM David Hildenbrand <david@redhat.com> wrote:
>
> As far as I can see, the original split wanted to speed up a duplicate
> initialization. We now only initialize once - and we really should
> initialize under the lock, when resizing the zones.

What do you mean by duplicate initialization? Basically the issue was
that we can have systems with massive memory footprints and I was just
trying to get the initialization time under a minute. The compromise I
made was to split the initialization so that we only initialized the
pages in the altmap and defer the rest so that they can be initialized
in parallel.

What this patch does is serialize the initialization so it will likely
take 2 to 4 minutes or more to initialize memory on a system where I
had brought the init time under about 30 seconds.

> As soon as we drop the lock we might have memory unplug running, trying
> to shrink the zone again. In case the memmap was not properly initialized,
> the zone/node shrinking code might get false negatives when search for
> the new zone/node boundaries - bad. We suddenly could no longer span the
> memory we just added.

The issue as I see it is that we can have systems with multiple nodes
and each node can populate a fairly significant amount of data. By
pushing this all back under the hotplug lock you are serializing the
initialization for each node resulting in a 4x or 8x increase in
initialization time on some of these bigger systems.

> Also, once we want to fix set_zone_contiguous(zone) inside
> move_pfn_range_to_zone() to actually work with ZONE_DEVICE (instead of
> always immediately stopping and never setting zone->contiguous) we have
> to have the whole memmap initialized at that point. (not sure if we
> want that in the future, just a note)
>
> Let's just keep things simple and initialize the memmap when resizing
> the zones under the lock.
>
> If this is a real performance issue, we have to watch out for
> alternatives.

My concern is that this is going to become a performance issue, but I
don't have access to the hardware at the moment to test how much of
one. I'll have to check to see if I can get access to a system to test
this patch set.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Wei Yang <richard.weiyang@gmail.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Andrey Konovalov <andreyknvl@google.com>
> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Arun KS <arunks@codeaurora.org>
> Cc: Souptick Joarder <jrdr.linux@gmail.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Alexander Potapenko <glider@google.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/memory_hotplug.h |  2 +-
>  include/linux/mm.h             |  4 +---
>  mm/memory_hotplug.c            |  4 ++--
>  mm/memremap.c                  |  9 +-------
>  mm/page_alloc.c                | 42 ++++++++++++----------------------
>  5 files changed, 20 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f46ea71b4ffd..235530cdface 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -344,7 +344,7 @@ extern int __add_memory(int nid, u64 start, u64 size);
>  extern int add_memory(int nid, u64 start, u64 size);
>  extern int add_memory_resource(int nid, struct resource *resource);
>  extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> -               unsigned long nr_pages, struct vmem_altmap *altmap);
> +               unsigned long nr_pages, struct dev_pagemap *pgmap);
>  extern bool is_memblock_offlined(struct memory_block *mem);
>  extern int sparse_add_section(int nid, unsigned long pfn,
>                 unsigned long nr_pages, struct vmem_altmap *altmap);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad6766a08f9b..2bd445c4d3b4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -962,8 +962,6 @@ static inline bool is_zone_device_page(const struct page *page)
>  {
>         return page_zonenum(page) == ZONE_DEVICE;
>  }
> -extern void memmap_init_zone_device(struct zone *, unsigned long,
> -                                   unsigned long, struct dev_pagemap *);
>  #else
>  static inline bool is_zone_device_page(const struct page *page)
>  {
> @@ -2243,7 +2241,7 @@ static inline void zero_resv_unavail(void) {}
>
>  extern void set_dma_reserve(unsigned long new_dma_reserve);
>  extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
> -               enum memmap_context, struct vmem_altmap *);
> +               enum memmap_context, struct dev_pagemap *);
>  extern void setup_per_zone_wmarks(void);
>  extern int __meminit init_per_zone_wmark_min(void);
>  extern void mem_init(void);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 49f7bf91c25a..35a395d195c6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -719,7 +719,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
>   * call, all affected pages are PG_reserved.
>   */
>  void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> -               unsigned long nr_pages, struct vmem_altmap *altmap)
> +               unsigned long nr_pages, struct dev_pagemap *pgmap)
>  {
>         struct pglist_data *pgdat = zone->zone_pgdat;
>         int nid = pgdat->node_id;
> @@ -744,7 +744,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>          * are reserved so nobody should be touching them so we should be safe
>          */
>         memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
> -                       MEMMAP_HOTPLUG, altmap);
> +                        MEMMAP_HOTPLUG, pgmap);
>
>         set_zone_contiguous(zone);
>  }
> diff --git a/mm/memremap.c b/mm/memremap.c
> index f6c17339cd0d..9ee23374e6da 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -308,20 +308,13 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>
>                 zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
>                 move_pfn_range_to_zone(zone, PHYS_PFN(res->start),
> -                               PHYS_PFN(resource_size(res)), restrictions.altmap);
> +                               PHYS_PFN(resource_size(res)), pgmap);
>         }
>
>         mem_hotplug_done();
>         if (error)
>                 goto err_add_memory;
>
> -       /*
> -        * Initialization of the pages has been deferred until now in order
> -        * to allow us to do the work while not holding the hotplug lock.
> -        */
> -       memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> -                               PHYS_PFN(res->start),
> -                               PHYS_PFN(resource_size(res)), pgmap);
>         percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
>         return __va(res->start);
>

So if you are moving this all under the lock then this is going to
serialize initialization and it is going to be quite expensive on bit
systems. I was only ever able to get the init time down to something
like ~20s with the optimized function. Since that has been torn apart
and you are back to doing things with memmap_init_zone we are probably
looking at more like 25-30s for each node, and on a 4 node system we
are looking at 2 minutes or so which may lead to issues if people are
mounting this.

Instead of forcing this all to be done under the hotplug lock is there
some way we could do this under the zone span_seqlock instead to
achieve the same result?

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5d62f1c2851..44038665fe8e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5845,7 +5845,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>   */
>  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>                 unsigned long start_pfn, enum memmap_context context,
> -               struct vmem_altmap *altmap)
> +               struct dev_pagemap *pgmap)
>  {
>         unsigned long pfn, end_pfn = start_pfn + size;
>         struct page *page;
> @@ -5853,24 +5853,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>         if (highest_memmap_pfn < end_pfn - 1)
>                 highest_memmap_pfn = end_pfn - 1;
>
> -#ifdef CONFIG_ZONE_DEVICE
> -       /*
> -        * Honor reservation requested by the driver for this ZONE_DEVICE
> -        * memory. We limit the total number of pages to initialize to just
> -        * those that might contain the memory mapping. We will defer the
> -        * ZONE_DEVICE page initialization until after we have released
> -        * the hotplug lock.
> -        */
> -       if (zone == ZONE_DEVICE) {
> -               if (!altmap)
> -                       return;
> -
> -               if (start_pfn == altmap->base_pfn)
> -                       start_pfn += altmap->reserve;
> -               end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> -       }
> -#endif
> -
>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>                 /*
>                  * There can be holes in boot-time mem_map[]s handed to this
> @@ -5892,6 +5874,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>                 if (context == MEMMAP_HOTPLUG)
>                         __SetPageReserved(page);
>
> +#ifdef CONFIG_ZONE_DEVICE
> +               if (zone == ZONE_DEVICE) {
> +                       WARN_ON_ONCE(!pgmap);
> +                       /*
> +                        * ZONE_DEVICE pages union ->lru with a ->pgmap back
> +                        * pointer and zone_device_data. It is a bug if a
> +                        * ZONE_DEVICE page is ever freed or placed on a driver
> +                        * private list.
> +                        */
> +                       page->pgmap = pgmap;
> +                       page->zone_device_data = NULL;
> +               }
> +#endif
> +

So I am not sure this is right. Part of the reason for the split is
that the pages that were a part of the altmap had an LRU setup, not
the pgmap/zone_device_data setup. This is changing that.

Also, I am more a fan of just testing pgmap and if it is present then
assign page->pgmap and reset zone_device_data. Then you can avoid the
test for zone on every iteration and the WARN_ON_ONCE check, or at
least you could move it outside the loop so we don't incur the cost
with each page. Are there situations where you would have pgmap but
not a ZONE_DEVICE page?

>                 /*
>                  * Mark the block movable so that blocks are reserved for
>                  * movable at startup. This will force kernel allocations
> @@ -5951,14 +5947,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
>                  */
>                 __SetPageReserved(page);
>
> -               /*
> -                * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
> -                * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
> -                * ever freed or placed on a driver-private list.
> -                */
> -               page->pgmap = pgmap;
> -               page->zone_device_data = NULL;
> -
>                 /*
>                  * Mark the block movable so that blocks are reserved for
>                  * movable at startup. This will force kernel allocations

Shouldn't you be removing this function instead of just gutting it?
I'm kind of surprised you aren't getting warnings about unused code
since you also pulled the declaration for it from the header.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device()
  2019-08-29 16:39   ` Alexander Duyck
@ 2019-08-29 16:55     ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-08-29 16:55 UTC (permalink / raw)
  To: Alexander Duyck, Dan Williams
  Cc: LKML, linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Pavel Tatashin, Wei Yang, Qian Cai, Matthew Wilcox (Oracle),
	Kirill A. Shutemov, Stephen Rothwell, Dave Airlie,
	Andrey Konovalov, Alexander Duyck, Ira Weiny, John Hubbard,
	Arun KS, Souptick Joarder, Robin Murphy, Yang Shi,
	Jason Gunthorpe, Logan Gunthorpe, Vlastimil Babka, Mel Gorman,
	Mike Rapoport, Alexander Potapenko

On 29.08.19 18:39, Alexander Duyck wrote:
> On Thu, Aug 29, 2019 at 12:00 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> As far as I can see, the original split wanted to speed up a duplicate
>> initialization. We now only initialize once - and we really should
>> initialize under the lock, when resizing the zones.
> 
> What do you mean by duplicate initialization? Basically the issue was
> that we can have systems with massive memory footprints and I was just
> trying to get the initialization time under a minute. The compromise I
> made was to split the initialization so that we only initialized the
> pages in the altmap and defer the rest so that they can be initialized
> in parallel.
> 
> What this patch does is serialize the initialization so it will likely
> take 2 to 4 minutes or more to initialize memory on a system where I
> had brought the init time under about 30 seconds.
> 
>> As soon as we drop the lock we might have memory unplug running, trying
>> to shrink the zone again. In case the memmap was not properly initialized,
>> the zone/node shrinking code might get false negatives when search for
>> the new zone/node boundaries - bad. We suddenly could no longer span the
>> memory we just added.
> 
> The issue as I see it is that we can have systems with multiple nodes
> and each node can populate a fairly significant amount of data. By
> pushing this all back under the hotplug lock you are serializing the
> initialization for each node resulting in a 4x or 8x increase in
> initialization time on some of these bigger systems.
> 
>> Also, once we want to fix set_zone_contiguous(zone) inside
>> move_pfn_range_to_zone() to actually work with ZONE_DEVICE (instead of
>> always immediately stopping and never setting zone->contiguous) we have
>> to have the whole memmap initialized at that point. (not sure if we
>> want that in the future, just a note)
>>
>> Let's just keep things simple and initialize the memmap when resizing
>> the zones under the lock.
>>
>> If this is a real performance issue, we have to watch out for
>> alternatives.
> 
> My concern is that this is going to become a performance issue, but I
> don't have access to the hardware at the moment to test how much of
> one. I'll have to check to see if I can get access to a system to test
> this patch set.
> 

Thanks for having a look - I already dropped this patch again. We will
rather stop shrinking the ZONE_DEVICE. So assume this patch is gone.

[...]

> So if you are moving this all under the lock then this is going to
> serialize initialization and it is going to be quite expensive on bit
> systems. I was only ever able to get the init time down to something
> like ~20s with the optimized function. Since that has been torn apart
> and you are back to doing things with memmap_init_zone we are probably
> looking at more like 25-30s for each node, and on a 4 node system we
> are looking at 2 minutes or so which may lead to issues if people are
> mounting this.
> 
> Instead of forcing this all to be done under the hotplug lock is there
> some way we could do this under the zone span_seqlock instead to
> achieve the same result?

I guess the right approach really is as Michal suggest to not shrink at
all (at least ZONE_DEVICE) :)

> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c5d62f1c2851..44038665fe8e 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5845,7 +5845,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>>   */
>>  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>                 unsigned long start_pfn, enum memmap_context context,
>> -               struct vmem_altmap *altmap)
>> +               struct dev_pagemap *pgmap)
>>  {
>>         unsigned long pfn, end_pfn = start_pfn + size;
>>         struct page *page;
>> @@ -5853,24 +5853,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>         if (highest_memmap_pfn < end_pfn - 1)
>>                 highest_memmap_pfn = end_pfn - 1;
>>
>> -#ifdef CONFIG_ZONE_DEVICE
>> -       /*
>> -        * Honor reservation requested by the driver for this ZONE_DEVICE
>> -        * memory. We limit the total number of pages to initialize to just
>> -        * those that might contain the memory mapping. We will defer the
>> -        * ZONE_DEVICE page initialization until after we have released
>> -        * the hotplug lock.
>> -        */
>> -       if (zone == ZONE_DEVICE) {
>> -               if (!altmap)
>> -                       return;
>> -
>> -               if (start_pfn == altmap->base_pfn)
>> -                       start_pfn += altmap->reserve;
>> -               end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>> -       }
>> -#endif
>> -
>>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>                 /*
>>                  * There can be holes in boot-time mem_map[]s handed to this
>> @@ -5892,6 +5874,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>                 if (context == MEMMAP_HOTPLUG)
>>                         __SetPageReserved(page);
>>
>> +#ifdef CONFIG_ZONE_DEVICE
>> +               if (zone == ZONE_DEVICE) {
>> +                       WARN_ON_ONCE(!pgmap);
>> +                       /*
>> +                        * ZONE_DEVICE pages union ->lru with a ->pgmap back
>> +                        * pointer and zone_device_data. It is a bug if a
>> +                        * ZONE_DEVICE page is ever freed or placed on a driver
>> +                        * private list.
>> +                        */
>> +                       page->pgmap = pgmap;
>> +                       page->zone_device_data = NULL;
>> +               }
>> +#endif
>> +
> 
> So I am not sure this is right. Part of the reason for the split is
> that the pages that were a part of the altmap had an LRU setup, not
> the pgmap/zone_device_data setup. This is changing that.

Yeah, you might be right, we would have to handle the altmap part
separately.

> 
> Also, I am more a fan of just testing pgmap and if it is present then
> assign page->pgmap and reset zone_device_data. Then you can avoid the
> test for zone on every iteration and the WARN_ON_ONCE check, or at
> least you could move it outside the loop so we don't incur the cost
> with each page. Are there situations where you would have pgmap but
> not a ZONE_DEVICE page?

Don't think so. But I am definitely not a ZONE_DEVICE expert.

> 
>>                 /*
>>                  * Mark the block movable so that blocks are reserved for
>>                  * movable at startup. This will force kernel allocations
>> @@ -5951,14 +5947,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>                  */
>>                 __SetPageReserved(page);
>>
>> -               /*
>> -                * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
>> -                * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
>> -                * ever freed or placed on a driver-private list.
>> -                */
>> -               page->pgmap = pgmap;
>> -               page->zone_device_data = NULL;
>> -
>>                 /*
>>                  * Mark the block movable so that blocks are reserved for
>>                  * movable at startup. This will force kernel allocations
> 
> Shouldn't you be removing this function instead of just gutting it?
> I'm kind of surprised you aren't getting warnings about unused code
> since you also pulled the declaration for it from the header.
> 

Don't ask me what went wrong here, I guess I messed this up while rebasing.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-08-29 16:55 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-29  7:00 [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device() David Hildenbrand
2019-08-29 16:39   ` Alexander Duyck
2019-08-29 16:55     ` David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 02/11] mm/memory_hotplug: Simplify shrink_pgdat_span() David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 03/11] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 04/11] mm/memory_hotplug: Drop local variables in shrink_zone_span() David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 05/11] mm/memory_hotplug: Optimize zone shrinking code when checking for holes David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 06/11] mm/memory_hotplug: Fix crashes in shrink_zone_span() David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 07/11] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 08/11] mm: Exit early in set_zone_contiguous() if already contiguous David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 09/11] mm/memory_hotplug: Remove pages from a zone before removing memory David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 10/11] mm/memory_hotplug: Remove zone parameter from __remove_pages() David Hildenbrand
2019-08-29  7:00 ` [PATCH v3 11/11] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
2019-08-29  8:23 ` [PATCH v3 00/11] mm/memory_hotplug: Shrink zones before removing memory Michal Hocko
2019-08-29 11:33   ` David Hildenbrand
2019-08-29 11:43     ` David Hildenbrand
2019-08-29 12:08       ` David Hildenbrand
2019-08-29 12:15         ` Michal Hocko
2019-08-29 12:29           ` David Hildenbrand
2019-08-29 15:19             ` Michal Hocko
2019-08-29 15:28               ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).