linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory
@ 2019-10-01 14:40 David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone David Hildenbrand
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Aneesh Kumar K . V, Andrew Morton,
	Dan Williams, Michal Hocko, Alexander Duyck, Alexander Potapenko,
	Andy Lutomirski, Anshuman Khandual, Benjamin Herrenschmidt,
	Borislav Petkov, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Dave Hansen, Fenghua Yu, Gerald Schaefer,
	Greg Kroah-Hartman, Halil Pasic, Heiko Carstens, H. Peter Anvin,
	Ingo Molnar, Ira Weiny, Jason Gunthorpe, Jun Yao,
	Logan Gunthorpe, Mark Rutland, Masahiro Yamada,
	Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Oscar Salvador,
	Pankaj Gupta, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory. Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.

We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect
the ZONE_DEVICE as contiguous).

We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
code to a minimum. Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of
DIMMs at zone boundaries.

--------------------------------------------------------------------------

Zones are now properly shrunk when offlining memory blocks or when
onlining failed. This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.

Example:

:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  0
        present  0
        managed  0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  98304
        present  65536
        managed  65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  32768
        present  32768
        managed  32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  0
        present  0
        managed  0

--------------------------------------------------------------------------

I tested this with DIMMs on x86. It sounded like Aneesh tested the
ZONE_DEVICE part :)

v4 -> v5:
- "mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()"
-- Add more details why ZONE_DEVICE is special
- Include two patches from Aneesh
-- "mm/memunmap: Use the correct start and end pfn when removing pages
    from zone"
-- "mm/memmap_init: Update variable name in memmap_init_zone"

v3 -> v4:
- Drop "mm/memremap: Get rid of memmap_init_zone_device()"
-- As Alexander noticed, it was messy either way
- Drop "mm/memory_hotplug: Exit early in __remove_pages() on BUGs"
- Drop "mm: Exit early in set_zone_contiguous() if already contiguous"
- Drop "mm/memory_hotplug: Optimize zone shrinking code when checking for
  holes"
- Merged "mm/memory_hotplug: Remove pages from a zone before removing
  memory" and "mm/memory_hotplug: Remove zone parameter from
  __remove_pages()" into "mm/memory_hotplug: Shrink zones when offlining
  memory"
- Added "mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()"
- Stop shrinking ZONE_DEVICE
- Reshuffle patches, moving all fixes to the front. Add Fixes: tags.
- Change subject/description of various patches
- Minor changes (too many to mention)

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>

Aneesh Kumar K.V (2):
  mm/memunmap: Use the correct start and end pfn when removing pages
    from zone
  mm/memmap_init: Update variable name in memmap_init_zone

David Hildenbrand (8):
  mm/memory_hotplug: Don't access uninitialized memmaps in
    shrink_pgdat_span()
  mm/memory_hotplug: Don't access uninitialized memmaps in
    shrink_zone_span()
  mm/memory_hotplug: Shrink zones when offlining memory
  mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()
  mm/memory_hotplug: We always have a zone in
    find_(smallest|biggest)_section_pfn
  mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()
  mm/memory_hotplug: Drop local variables in shrink_zone_span()
  mm/memory_hotplug: Cleanup __remove_pages()

 arch/arm64/mm/mmu.c            |   4 +-
 arch/ia64/mm/init.c            |   4 +-
 arch/powerpc/mm/mem.c          |   3 +-
 arch/s390/mm/init.c            |   4 +-
 arch/sh/mm/init.c              |   4 +-
 arch/x86/mm/init_32.c          |   4 +-
 arch/x86/mm/init_64.c          |   4 +-
 include/linux/memory_hotplug.h |   7 +-
 mm/memory_hotplug.c            | 184 +++++++++++----------------------
 mm/memremap.c                  |  14 ++-
 mm/page_alloc.c                |   8 +-
 11 files changed, 88 insertions(+), 152 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 14:57   ` David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 02/10] mm/memmap_init: Update variable name in memmap_init_zone David Hildenbrand
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Aneesh Kumar K.V, Dan Williams, Andrew Morton,
	Jason Gunthorpe, Logan Gunthorpe, Ira Weiny, Pankaj Gupta,
	David Hildenbrand

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>

With altmap, all the resource pfns are not initialized. While initializing
pfn, altmap reserve space is skipped. Hence when removing pfn from zone
skip pfns that were never initialized.

Update memunmap_pages to calculate start and end pfn based on altmap
values. This fixes a kernel crash that is observed when destroying
a namespace.

[   81.356173] kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
    pc: c0000000004b9728: memunmap_pages+0x238/0x340
    lr: c0000000004b9724: memunmap_pages+0x234/0x340
...
    pid   = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
[c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
[c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
[c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
[c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
[c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
[c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
[c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
[c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
[c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
[c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
[c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[ move all pfn-realted declarations into a single line ]
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memremap.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..026788b2ac69 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
 	struct resource *res = &pgmap->res;
-	unsigned long pfn;
+	unsigned long pfn, nr_pages, start_pfn, end_pfn;
 	int nid;
 
 	dev_pagemap_kill(pgmap);
@@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 		put_page(pfn_to_page(pfn));
 	dev_pagemap_cleanup(pgmap);
 
+	start_pfn = pfn_first(pgmap);
+	end_pfn = pfn_end(pgmap);
+	nr_pages = end_pfn - start_pfn;
+
 	/* pages are dead and unused, undo the arch mapping */
-	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+	nid = page_to_nid(pfn_to_page(start_pfn));
 
 	mem_hotplug_begin();
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		pfn = PHYS_PFN(res->start);
-		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-				 PHYS_PFN(resource_size(res)), NULL);
+		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
+			       nr_pages, NULL);
 	} else {
 		arch_remove_memory(nid, res->start, resource_size(res),
 				pgmap_altmap(pgmap));
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 02/10] mm/memmap_init: Update variable name in memmap_init_zone
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 03/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_pgdat_span() David Hildenbrand
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Aneesh Kumar K.V, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Mel Gorman, Mike Rapoport,
	Dan Williams, Alexander Duyck, Pavel Tatashin,
	Alexander Potapenko, Pankaj Gupta, David Hildenbrand

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>

The third argument is actually number of pages. Changes the variable name
from size to nr_pages to indicate this better.

No functional change in this patch.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Alexander Potapenko <glider@google.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15c2050c629b..b0b2d5464000 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5936,10 +5936,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 #ifdef CONFIG_ZONE_DEVICE
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
-				   unsigned long size,
+				   unsigned long nr_pages,
 				   struct dev_pagemap *pgmap)
 {
-	unsigned long pfn, end_pfn = start_pfn + size;
+	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
 	unsigned long zone_idx = zone_idx(zone);
@@ -5956,7 +5956,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	 */
 	if (altmap) {
 		start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
-		size = end_pfn - start_pfn;
+		nr_pages = end_pfn - start_pfn;
 	}
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
@@ -6003,7 +6003,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-		size, jiffies_to_msecs(jiffies - start));
+		nr_pages, jiffies_to_msecs(jiffies - start));
 }
 
 #endif
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 03/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_pgdat_span()
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 02/10] mm/memmap_init: Update variable name in memmap_init_zone David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span() David Hildenbrand
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang,
	Aneesh Kumar K . V

We might use the nid of memmaps that were never initialized. For
example, if the memmap was poisoned, we will crash the kernel in
pfn_to_nid() right now. Let's use the calculated boundaries of the separate
zones instead. This now also avoids having to iterate over a whole bunch of
subsections again, after shrinking one zone.

Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized to 0 and the node was set to the
right value. After that commit, the node might be garbage.

We'll have to fix shrink_zone_span() next.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 72 ++++++++++-----------------------------------
 1 file changed, 15 insertions(+), 57 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 680b4b3e57d9..86b4dc18e831 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -436,67 +436,25 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	zone_span_writeunlock(zone);
 }
 
-static void shrink_pgdat_span(struct pglist_data *pgdat,
-			      unsigned long start_pfn, unsigned long end_pfn)
+static void update_pgdat_span(struct pglist_data *pgdat)
 {
-	unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
-	unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
-	unsigned long pgdat_end_pfn = p;
-	unsigned long pfn;
-	int nid = pgdat->node_id;
-
-	if (pgdat_start_pfn == start_pfn) {
-		/*
-		 * If the section is smallest section in the pgdat, it need
-		 * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
-		 * In this case, we find second smallest valid mem_section
-		 * for shrinking zone.
-		 */
-		pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
-						pgdat_end_pfn);
-		if (pfn) {
-			pgdat->node_start_pfn = pfn;
-			pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
-		}
-	} else if (pgdat_end_pfn == end_pfn) {
-		/*
-		 * If the section is biggest section in the pgdat, it need
-		 * shrink pgdat->node_spanned_pages.
-		 * In this case, we find second biggest valid mem_section for
-		 * shrinking zone.
-		 */
-		pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
-					       start_pfn);
-		if (pfn)
-			pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
-	}
-
-	/*
-	 * If the section is not biggest or smallest mem_section in the pgdat,
-	 * it only creates a hole in the pgdat. So in this case, we need not
-	 * change the pgdat.
-	 * But perhaps, the pgdat has only hole data. Thus it check the pgdat
-	 * has only hole or not.
-	 */
-	pfn = pgdat_start_pfn;
-	for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUBSECTION) {
-		if (unlikely(!pfn_valid(pfn)))
-			continue;
-
-		if (pfn_to_nid(pfn) != nid)
-			continue;
+	unsigned long node_start_pfn = 0, node_end_pfn = 0;
+	struct zone *zone;
 
-		/* Skip range to be removed */
-		if (pfn >= start_pfn && pfn < end_pfn)
-			continue;
+	for (zone = pgdat->node_zones;
+	     zone < pgdat->node_zones + MAX_NR_ZONES; zone++) {
+		unsigned long zone_end_pfn = zone->zone_start_pfn +
+					     zone->spanned_pages;
 
-		/* If we find valid section, we have nothing to do */
-		return;
+		/* No need to lock the zones, they can't change. */
+		if (zone_end_pfn > node_end_pfn)
+			node_end_pfn = zone_end_pfn;
+		if (zone->zone_start_pfn < node_start_pfn)
+			node_start_pfn = zone->zone_start_pfn;
 	}
 
-	/* The pgdat has no valid section */
-	pgdat->node_start_pfn = 0;
-	pgdat->node_spanned_pages = 0;
+	pgdat->node_start_pfn = node_start_pfn;
+	pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
 }
 
 static void __remove_zone(struct zone *zone, unsigned long start_pfn,
@@ -507,7 +465,7 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn,
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
-	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
+	update_pgdat_span(pgdat);
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (2 preceding siblings ...)
  2019-10-01 14:40 ` [PATCH v5 03/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_pgdat_span() David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 23:47   ` kbuild test robot
  2019-10-02  0:06   ` kbuild test robot
  2019-10-01 14:40 ` [PATCH v5 05/10] mm/memory_hotplug: Shrink zones when offlining memory David Hildenbrand
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Aneesh Kumar K . V

Let's limit shrinking to !ZONE_DEVICE so we can fix the current code. We
should never try to touch the memmap of offline sections where we could
have uninitialized memmaps and could trigger BUGs when calling
page_to_nid() on poisoned pages.

There is no reliable way to distinguish an uninitialized memmap from an
initialized memmap that belongs to ZONE_DEVICE, as we don't have
anything like SECTION_IS_ONLINE we can use similar to
pfn_to_online_section() for !ZONE_DEVICE memory. E.g.,
set_zone_contiguous() similarly relies on pfn_to_online_section() and
will therefore never set a ZONE_DEVICE zone consecutive. Stopping to
shrink the ZONE_DEVICE therefore results in no observable changes,
besides /proc/zoneinfo indicating different boundaries - something we
can totally live with.

Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized with 0 and the node with the
right value. So the zone might be wrong but not garbage. After that
commit, both the zone and the node will be garbage when touching
uninitialized memmaps.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 86b4dc18e831..afed8331332b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -331,7 +331,7 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 				     unsigned long end_pfn)
 {
 	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SUBSECTION) {
-		if (unlikely(!pfn_valid(start_pfn)))
+		if (unlikely(!pfn_to_online_page(start_pfn)))
 			continue;
 
 		if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -356,7 +356,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 	/* pfn is the end pfn of a memory section. */
 	pfn = end_pfn - 1;
 	for (; pfn >= start_pfn; pfn -= PAGES_PER_SUBSECTION) {
-		if (unlikely(!pfn_valid(pfn)))
+		if (unlikely(!pfn_to_online_page(pfn)))
 			continue;
 
 		if (unlikely(pfn_to_nid(pfn) != nid))
@@ -415,7 +415,7 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	 */
 	pfn = zone_start_pfn;
 	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
-		if (unlikely(!pfn_valid(pfn)))
+		if (unlikely(!pfn_to_online_page(pfn)))
 			continue;
 
 		if (page_zone(pfn_to_page(pfn)) != zone)
@@ -463,6 +463,14 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long flags;
 
+	/*
+	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
+	 * we will not try to shrink the zones - which is okay as
+	 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
+	 */
+	if (zone_idx(zone) == ZONE_DEVICE)
+		return;
+
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
 	update_pgdat_span(pgdat);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 05/10] mm/memory_hotplug: Shrink zones when offlining memory
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (3 preceding siblings ...)
  2019-10-01 14:40 ` [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span() David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 06/10] mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone() David Hildenbrand
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Catalin Marinas, Will Deacon,
	Tony Luck, Fenghua Yu, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Yoshinori Sato, Rich Felker, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Andrew Morton,
	Mark Rutland, Steve Capper, Mike Rapoport, Anshuman Khandual,
	Yu Zhao, Jun Yao, Robin Murphy, Michal Hocko, Oscar Salvador,
	Matthew Wilcox (Oracle),
	Christophe Leroy, Aneesh Kumar K.V, Pavel Tatashin,
	Gerald Schaefer, Halil Pasic, Tom Lendacky, Greg Kroah-Hartman,
	Masahiro Yamada, Dan Williams, Wei Yang, Qian Cai,
	Jason Gunthorpe, Logan Gunthorpe, Ira Weiny

We currently try to shrink a single zone when removing memory. We use the
zone of the first page of the memory we are removing. If that memmap was
never initialized (e.g., memory was never onlined), we will read garbage
and can trigger kernel BUGs (due to a stale pointer):

:/# [   23.912993] BUG: unable to handle page fault for address: 000000000000353d
[   23.914219] #PF: supervisor write access in kernel mode
[   23.915199] #PF: error_code(0x0002) - not-present page
[   23.916160] PGD 0 P4D 0
[   23.916627] Oops: 0002 [#1] SMP PTI
[   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
[   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
[   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
[   23.926876] RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
[   23.927928] RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
[   23.929458] RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
[   23.930899] RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
[   23.932362] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
[   23.933603] R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
[   23.934913] FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
[   23.936294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.937481] CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
[   23.938687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   23.939889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   23.941168] Call Trace:
[   23.941580]  __remove_pages+0x4b/0x640
[   23.942303]  ? mark_held_locks+0x49/0x70
[   23.943149]  arch_remove_memory+0x63/0x8d
[   23.943921]  try_remove_memory+0xdb/0x130
[   23.944766]  ? walk_memory_blocks+0x7f/0x9e
[   23.945616]  __remove_memory+0xa/0x11
[   23.946274]  acpi_memory_device_remove+0x70/0x100
[   23.947308]  acpi_bus_trim+0x55/0x90
[   23.947914]  acpi_device_hotplug+0x227/0x3a0
[   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
[   23.949433]  process_one_work+0x221/0x550
[   23.950190]  worker_thread+0x50/0x3b0
[   23.950993]  kthread+0x105/0x140
[   23.951644]  ? process_one_work+0x550/0x550
[   23.952508]  ? kthread_park+0x80/0x80
[   23.953367]  ret_from_fork+0x3a/0x50
[   23.954025] Modules linked in:
[   23.954613] CR2: 000000000000353d
[   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now properly
shrink the zones, even if we have DIMMs whereby
- Some memory blocks fall into no zone (never onlined)
- Some memory blocks fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/mm/mmu.c            |  4 +---
 arch/ia64/mm/init.c            |  4 +---
 arch/powerpc/mm/mem.c          |  3 +--
 arch/s390/mm/init.c            |  4 +---
 arch/sh/mm/init.c              |  4 +---
 arch/x86/mm/init_32.c          |  4 +---
 arch/x86/mm/init_64.c          |  4 +---
 include/linux/memory_hotplug.h |  7 +++++--
 mm/memory_hotplug.c            | 31 ++++++++++++++++---------------
 mm/memremap.c                  |  3 +--
 10 files changed, 29 insertions(+), 39 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 60c929f3683b..d10247fab0fd 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1069,7 +1069,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
 	/*
 	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
@@ -1078,7 +1077,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
 	 * unlocked yet.
 	 */
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index bf9df2625bc8..a6dd80a2c939 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index be941d382c8d..97e5922cb52e 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
 	int ret;
 
-	__remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 
 	/* Remove htab bolted mappings for this section of memory */
 	start = (unsigned long)__va(start);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index a124f19f7b3c..c1d96e588152 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -291,10 +291,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 	vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index dfdbaa50946e..d1b1ff2be17a 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 930edeb41ec3..0a74407ef92e 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -865,10 +865,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a6b5c653727b..b8541d77452c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1212,10 +1212,8 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
-	struct zone *zone = page_zone(page);
 
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(start_pfn, nr_pages, altmap);
 	kernel_physical_mapping_remove(start, start + size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index bc477e98a310..517b70943732 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -126,8 +126,8 @@ static inline bool movable_node_is_enabled(void)
 
 extern void arch_remove_memory(int nid, u64 start, u64 size,
 			       struct vmem_altmap *altmap);
-extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
-			   unsigned long nr_pages, struct vmem_altmap *altmap);
+extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
+			   struct vmem_altmap *altmap);
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
@@ -346,6 +346,9 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
+extern void remove_pfn_range_from_zone(struct zone *zone,
+				       unsigned long start_pfn,
+				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index afed8331332b..cef909ebd807 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -457,8 +457,9 @@ static void update_pgdat_span(struct pglist_data *pgdat)
 	pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
 }
 
-static void __remove_zone(struct zone *zone, unsigned long start_pfn,
-		unsigned long nr_pages)
+void __ref remove_pfn_range_from_zone(struct zone *zone,
+				      unsigned long start_pfn,
+				      unsigned long nr_pages)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long flags;
@@ -471,28 +472,30 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn,
 	if (zone_idx(zone) == ZONE_DEVICE)
 		return;
 
+	clear_zone_contiguous(zone);
+
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
 	update_pgdat_span(pgdat);
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+
+	set_zone_contiguous(zone);
 }
 
-static void __remove_section(struct zone *zone, unsigned long pfn,
-		unsigned long nr_pages, unsigned long map_offset,
-		struct vmem_altmap *altmap)
+static void __remove_section(unsigned long pfn, unsigned long nr_pages,
+			     unsigned long map_offset,
+			     struct vmem_altmap *altmap)
 {
 	struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
 
 	if (WARN_ON_ONCE(!valid_section(ms)))
 		return;
 
-	__remove_zone(zone, pfn, nr_pages);
 	sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap);
 }
 
 /**
- * __remove_pages() - remove sections of pages from a zone
- * @zone: zone from which pages need to be removed
+ * __remove_pages() - remove sections of pages
  * @pfn: starting pageframe (must be aligned to start of a section)
  * @nr_pages: number of pages to remove (must be multiple of section size)
  * @altmap: alternative device page map or %NULL if default memmap is used
@@ -502,16 +505,14 @@ static void __remove_section(struct zone *zone, unsigned long pfn,
  * sure that pages are marked reserved and zones are adjust properly by
  * calling offline_pages().
  */
-void __remove_pages(struct zone *zone, unsigned long pfn,
-		    unsigned long nr_pages, struct vmem_altmap *altmap)
+void __remove_pages(unsigned long pfn, unsigned long nr_pages,
+		    struct vmem_altmap *altmap)
 {
 	unsigned long map_offset = 0;
 	unsigned long nr, start_sec, end_sec;
 
 	map_offset = vmem_altmap_offset(altmap);
 
-	clear_zone_contiguous(zone);
-
 	if (check_pfn_span(pfn, nr_pages, "remove"))
 		return;
 
@@ -523,13 +524,11 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 		cond_resched();
 		pfns = min(nr_pages, PAGES_PER_SECTION
 				- (pfn & ~PAGE_SECTION_MASK));
-		__remove_section(zone, pfn, pfns, map_offset, altmap);
+		__remove_section(pfn, pfns, map_offset, altmap);
 		pfn += pfns;
 		nr_pages -= pfns;
 		map_offset = 0;
 	}
-
-	set_zone_contiguous(zone);
 }
 
 int set_online_page_callback(online_page_callback_t callback)
@@ -857,6 +856,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 		 (unsigned long long) pfn << PAGE_SHIFT,
 		 (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
 	memory_notify(MEM_CANCEL_ONLINE, &arg);
+	remove_pfn_range_from_zone(zone, pfn, nr_pages);
 	mem_hotplug_done();
 	return ret;
 }
@@ -1603,6 +1603,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
+	remove_pfn_range_from_zone(zone, start_pfn, nr_pages);
 	mem_hotplug_done();
 	return 0;
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 026788b2ac69..734afeaad811 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -140,8 +140,7 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 
 	mem_hotplug_begin();
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
-			       nr_pages, NULL);
+		__remove_pages(start_pfn, nr_pages, NULL);
 	} else {
 		arch_remove_memory(nid, res->start, resource_size(res),
 				pgmap_altmap(pgmap));
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 06/10] mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (4 preceding siblings ...)
  2019-10-01 14:40 ` [PATCH v5 05/10] mm/memory_hotplug: Shrink zones when offlining memory David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn David Hildenbrand
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams

Let's poison the pages similar to when adding new memory in
sparse_add_section(). Also call remove_pfn_range_from_zone() from
memunmap_pages(), so we can poison the memmap from there as well.

While at it, calculate the pfn in memunmap_pages() only once.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 3 +++
 mm/memremap.c       | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index cef909ebd807..640309236a58 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -464,6 +464,9 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long flags;
 
+	/* Poison struct pages because they are now uninitialized again. */
+	page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * nr_pages);
+
 	/*
 	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
 	 * we will not try to shrink the zones - which is okay as
diff --git a/mm/memremap.c b/mm/memremap.c
index 734afeaad811..371939f92b69 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -139,6 +139,8 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 	nid = page_to_nid(pfn_to_page(start_pfn));
 
 	mem_hotplug_begin();
+	remove_pfn_range_from_zone(page_zone(pfn_to_page(start_pfn)),
+				   start_pfn, nr_pages);
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		__remove_pages(start_pfn, nr_pages, NULL);
 	} else {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (5 preceding siblings ...)
  2019-10-01 14:40 ` [PATCH v5 06/10] mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone() David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span() David Hildenbrand
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

With shrink_pgdat_span() out of the way, we now always have a valid
zone.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 640309236a58..273833612774 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -337,7 +337,7 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 		if (unlikely(pfn_to_nid(start_pfn) != nid))
 			continue;
 
-		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+		if (zone != page_zone(pfn_to_page(start_pfn)))
 			continue;
 
 		return start_pfn;
@@ -362,7 +362,7 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 		if (unlikely(pfn_to_nid(pfn) != nid))
 			continue;
 
-		if (zone && zone != page_zone(pfn_to_page(pfn)))
+		if (zone != page_zone(pfn_to_page(pfn)))
 			continue;
 
 		return pfn;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (6 preceding siblings ...)
  2019-10-01 14:40 ` [PATCH v5 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 09/10] mm/memory_hotplug: Drop local variables " David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 10/10] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
  9 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

If we have holes, the holes will automatically get detected and removed
once we remove the next bigger/smaller section. The extra checks can
go.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 34 +++++++---------------------------
 1 file changed, 7 insertions(+), 27 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 273833612774..22b3623768c8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 		if (pfn) {
 			zone->zone_start_pfn = pfn;
 			zone->spanned_pages = zone_end_pfn - pfn;
+		} else {
+			zone->zone_start_pfn = 0;
+			zone->spanned_pages = 0;
 		}
 	} else if (zone_end_pfn == end_pfn) {
 		/*
@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 					       start_pfn);
 		if (pfn)
 			zone->spanned_pages = pfn - zone_start_pfn + 1;
+		else {
+			zone->zone_start_pfn = 0;
+			zone->spanned_pages = 0;
+		}
 	}
-
-	/*
-	 * The section is not biggest or smallest mem_section in the zone, it
-	 * only creates a hole in the zone. So in this case, we need not
-	 * change the zone. But perhaps, the zone has only hole data. Thus
-	 * it check the zone has only hole or not.
-	 */
-	pfn = zone_start_pfn;
-	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
-		if (unlikely(!pfn_to_online_page(pfn)))
-			continue;
-
-		if (page_zone(pfn_to_page(pfn)) != zone)
-			continue;
-
-		/* Skip range to be removed */
-		if (pfn >= start_pfn && pfn < end_pfn)
-			continue;
-
-		/* If we find valid section, we have nothing to do */
-		zone_span_writeunlock(zone);
-		return;
-	}
-
-	/* The zone has no valid section */
-	zone->zone_start_pfn = 0;
-	zone->spanned_pages = 0;
 	zone_span_writeunlock(zone);
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 09/10] mm/memory_hotplug: Drop local variables in shrink_zone_span()
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (7 preceding siblings ...)
  2019-10-01 14:40 ` [PATCH v5 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span() David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  2019-10-01 14:40 ` [PATCH v5 10/10] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
  9 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

Get rid of the unnecessary local variables.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 22b3623768c8..ffb514e3b090 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -374,14 +374,11 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 			     unsigned long end_pfn)
 {
-	unsigned long zone_start_pfn = zone->zone_start_pfn;
-	unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
-	unsigned long zone_end_pfn = z;
 	unsigned long pfn;
 	int nid = zone_to_nid(zone);
 
 	zone_span_writelock(zone);
-	if (zone_start_pfn == start_pfn) {
+	if (zone->zone_start_pfn == start_pfn) {
 		/*
 		 * If the section is smallest section in the zone, it need
 		 * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
@@ -389,25 +386,25 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 		 * for shrinking zone.
 		 */
 		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
-						zone_end_pfn);
+						zone_end_pfn(zone));
 		if (pfn) {
+			zone->spanned_pages = zone_end_pfn(zone) - pfn;
 			zone->zone_start_pfn = pfn;
-			zone->spanned_pages = zone_end_pfn - pfn;
 		} else {
 			zone->zone_start_pfn = 0;
 			zone->spanned_pages = 0;
 		}
-	} else if (zone_end_pfn == end_pfn) {
+	} else if (zone_end_pfn(zone) == end_pfn) {
 		/*
 		 * If the section is biggest section in the zone, it need
 		 * shrink zone->spanned_pages.
 		 * In this case, we find second biggest valid mem_section for
 		 * shrinking zone.
 		 */
-		pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+		pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
 					       start_pfn);
 		if (pfn)
-			zone->spanned_pages = pfn - zone_start_pfn + 1;
+			zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
 		else {
 			zone->zone_start_pfn = 0;
 			zone->spanned_pages = 0;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 10/10] mm/memory_hotplug: Cleanup __remove_pages()
  2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
                   ` (8 preceding siblings ...)
  2019-10-01 14:40 ` [PATCH v5 09/10] mm/memory_hotplug: Drop local variables " David Hildenbrand
@ 2019-10-01 14:40 ` David Hildenbrand
  9 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

Let's drop the basically unused section stuff and simplify.

Also, let's use a shorter variant to calculate the number of pages to
the next section boundary.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ffb514e3b090..0fa99e5a657e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -488,25 +488,20 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		    struct vmem_altmap *altmap)
 {
+	const unsigned long end_pfn = pfn + nr_pages;
+	unsigned long cur_nr_pages;
 	unsigned long map_offset = 0;
-	unsigned long nr, start_sec, end_sec;
 
 	map_offset = vmem_altmap_offset(altmap);
 
 	if (check_pfn_span(pfn, nr_pages, "remove"))
 		return;
 
-	start_sec = pfn_to_section_nr(pfn);
-	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
-	for (nr = start_sec; nr <= end_sec; nr++) {
-		unsigned long pfns;
-
+	for (; pfn < end_pfn; pfn += cur_nr_pages) {
 		cond_resched();
-		pfns = min(nr_pages, PAGES_PER_SECTION
-				- (pfn & ~PAGE_SECTION_MASK));
-		__remove_section(pfn, pfns, map_offset, altmap);
-		pfn += pfns;
-		nr_pages -= pfns;
+		/* Select all remaining pages up to the next section boundary */
+		cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK));
+		__remove_section(pfn, cur_nr_pages, map_offset, altmap);
 		map_offset = 0;
 	}
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-01 14:40 ` [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone David Hildenbrand
@ 2019-10-01 14:57   ` David Hildenbrand
  2019-10-01 15:03     ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 14:57 UTC (permalink / raw)
  To: linux-kernel, Dan Williams
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Aneesh Kumar K.V, Andrew Morton, Jason Gunthorpe,
	Logan Gunthorpe, Ira Weiny, Pankaj Gupta

On 01.10.19 16:40, David Hildenbrand wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> 
> With altmap, all the resource pfns are not initialized. While initializing
> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
> skip pfns that were never initialized.
> 
> Update memunmap_pages to calculate start and end pfn based on altmap
> values. This fixes a kernel crash that is observed when destroying
> a namespace.
> 
> [   81.356173] kernel BUG at include/linux/mm.h:1107!
> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>     pc: c0000000004b9728: memunmap_pages+0x238/0x340
>     lr: c0000000004b9724: memunmap_pages+0x234/0x340
> ...
>     pid   = 3669, comm = ndctl
> kernel BUG at include/linux/mm.h:1107!
> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> [ move all pfn-realted declarations into a single line ]
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memremap.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 557e53c6fb46..026788b2ac69 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>  void memunmap_pages(struct dev_pagemap *pgmap)
>  {
>  	struct resource *res = &pgmap->res;
> -	unsigned long pfn;
> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>  	int nid;
>  
>  	dev_pagemap_kill(pgmap);
> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>  		put_page(pfn_to_page(pfn));
>  	dev_pagemap_cleanup(pgmap);
>  
> +	start_pfn = pfn_first(pgmap);
> +	end_pfn = pfn_end(pgmap);
> +	nr_pages = end_pfn - start_pfn;
> +
>  	/* pages are dead and unused, undo the arch mapping */
> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
> +	nid = page_to_nid(pfn_to_page(start_pfn));
>  
>  	mem_hotplug_begin();
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> -		pfn = PHYS_PFN(res->start);
> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
> -				 PHYS_PFN(resource_size(res)), NULL);
> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
> +			       nr_pages, NULL);
>  	} else {
>  		arch_remove_memory(nid, res->start, resource_size(res),
>  				pgmap_altmap(pgmap));
> 

Aneesh, I was wondering why the use of "res->start" is correct (and we
shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
could review.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-01 14:57   ` David Hildenbrand
@ 2019-10-01 15:03     ` David Hildenbrand
  2019-10-03 16:48       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-10-01 15:03 UTC (permalink / raw)
  To: linux-kernel, Dan Williams
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Aneesh Kumar K.V, Andrew Morton, Jason Gunthorpe,
	Logan Gunthorpe, Ira Weiny, Pankaj Gupta

On 01.10.19 16:57, David Hildenbrand wrote:
> On 01.10.19 16:40, David Hildenbrand wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>
>> With altmap, all the resource pfns are not initialized. While initializing
>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>> skip pfns that were never initialized.
>>
>> Update memunmap_pages to calculate start and end pfn based on altmap
>> values. This fixes a kernel crash that is observed when destroying
>> a namespace.
>>
>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>     pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>     lr: c0000000004b9724: memunmap_pages+0x234/0x340
>> ...
>>     pid   = 3669, comm = ndctl
>> kernel BUG at include/linux/mm.h:1107!
>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Logan Gunthorpe <logang@deltatee.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> [ move all pfn-realted declarations into a single line ]
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/memremap.c | 13 ++++++++-----
>>  1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 557e53c6fb46..026788b2ac69 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>  void memunmap_pages(struct dev_pagemap *pgmap)
>>  {
>>  	struct resource *res = &pgmap->res;
>> -	unsigned long pfn;
>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>  	int nid;
>>  
>>  	dev_pagemap_kill(pgmap);
>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>  		put_page(pfn_to_page(pfn));
>>  	dev_pagemap_cleanup(pgmap);
>>  
>> +	start_pfn = pfn_first(pgmap);
>> +	end_pfn = pfn_end(pgmap);
>> +	nr_pages = end_pfn - start_pfn;
>> +
>>  	/* pages are dead and unused, undo the arch mapping */
>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>  
>>  	mem_hotplug_begin();
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> -		pfn = PHYS_PFN(res->start);
>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>> -				 PHYS_PFN(resource_size(res)), NULL);
>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>> +			       nr_pages, NULL);
>>  	} else {
>>  		arch_remove_memory(nid, res->start, resource_size(res),
>>  				pgmap_altmap(pgmap));
>>
> 
> Aneesh, I was wondering why the use of "res->start" is correct (and we
> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
> could review.
> 

To be more precise, I wonder if it should actually be

__remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
               resource_size(res))

IOW, keep calling __remove_pages() with the same parameters but read
nid/zone from the offset one.

Hope some memunmap_pages() expert can clarify.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()
  2019-10-01 14:40 ` [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span() David Hildenbrand
@ 2019-10-01 23:47   ` kbuild test robot
  2019-10-02  0:06   ` kbuild test robot
  1 sibling, 0 replies; 22+ messages in thread
From: kbuild test robot @ 2019-10-01 23:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kbuild-all, linux-kernel, linux-mm, linux-arm-kernel, linux-ia64,
	linuxppc-dev, linux-s390, linux-sh, David Hildenbrand,
	Andrew Morton, Oscar Salvador, Michal Hocko, Pavel Tatashin,
	Dan Williams, Aneesh Kumar K . V

[-- Attachment #1: Type: text/plain, Size: 1962 bytes --]

Hi David,

I love your patch! Yet something to improve:

[auto build test ERROR on mmotm/master]

url:    https://github.com/0day-ci/linux/commits/David-Hildenbrand/mm-memory_hotplug-Shrink-zones-before-removing-memory/20191002-054310
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-g004-201939 (attached as .config)
compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/memory_hotplug.c: In function '__remove_zone':
>> mm/memory_hotplug.c:471:24: error: 'ZONE_DEVICE' undeclared (first use in this function); did you mean 'ZONE_MOVABLE'?
     if (zone_idx(zone) == ZONE_DEVICE)
                           ^~~~~~~~~~~
                           ZONE_MOVABLE
   mm/memory_hotplug.c:471:24: note: each undeclared identifier is reported only once for each function it appears in

vim +471 mm/memory_hotplug.c

   459	
   460	static void __remove_zone(struct zone *zone, unsigned long start_pfn,
   461			unsigned long nr_pages)
   462	{
   463		struct pglist_data *pgdat = zone->zone_pgdat;
   464		unsigned long flags;
   465	
   466		/*
   467		 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
   468		 * we will not try to shrink the zones - which is okay as
   469		 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
   470		 */
 > 471		if (zone_idx(zone) == ZONE_DEVICE)
   472			return;
   473	
   474		pgdat_resize_lock(zone->zone_pgdat, &flags);
   475		shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
   476		update_pgdat_span(pgdat);
   477		pgdat_resize_unlock(zone->zone_pgdat, &flags);
   478	}
   479	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29129 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()
  2019-10-01 14:40 ` [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span() David Hildenbrand
  2019-10-01 23:47   ` kbuild test robot
@ 2019-10-02  0:06   ` kbuild test robot
  2019-10-02  7:05     ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: kbuild test robot @ 2019-10-02  0:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kbuild-all, linux-kernel, linux-mm, linux-arm-kernel, linux-ia64,
	linuxppc-dev, linux-s390, linux-sh, David Hildenbrand,
	Andrew Morton, Oscar Salvador, Michal Hocko, Pavel Tatashin,
	Dan Williams, Aneesh Kumar K . V

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]

Hi David,

I love your patch! Perhaps something to improve:

[auto build test WARNING on mmotm/master]

url:    https://github.com/0day-ci/linux/commits/David-Hildenbrand/mm-memory_hotplug-Shrink-zones-before-removing-memory/20191002-054310
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-b002-201939 (attached as .config)
compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:5:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/memory_hotplug.c:9:
   mm/memory_hotplug.c: In function '__remove_zone':
   mm/memory_hotplug.c:471:24: error: 'ZONE_DEVICE' undeclared (first use in this function); did you mean 'ZONE_MOVABLE'?
     if (zone_idx(zone) == ZONE_DEVICE)
                           ^
   include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
    #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                       ^~~~
>> mm/memory_hotplug.c:471:2: note: in expansion of macro 'if'
     if (zone_idx(zone) == ZONE_DEVICE)
     ^~
   mm/memory_hotplug.c:471:24: note: each undeclared identifier is reported only once for each function it appears in
     if (zone_idx(zone) == ZONE_DEVICE)
                           ^
   include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
    #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                       ^~~~
>> mm/memory_hotplug.c:471:2: note: in expansion of macro 'if'
     if (zone_idx(zone) == ZONE_DEVICE)
     ^~

vim +/if +471 mm/memory_hotplug.c

   459	
   460	static void __remove_zone(struct zone *zone, unsigned long start_pfn,
   461			unsigned long nr_pages)
   462	{
   463		struct pglist_data *pgdat = zone->zone_pgdat;
   464		unsigned long flags;
   465	
   466		/*
   467		 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
   468		 * we will not try to shrink the zones - which is okay as
   469		 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
   470		 */
 > 471		if (zone_idx(zone) == ZONE_DEVICE)
   472			return;
   473	
   474		pgdat_resize_lock(zone->zone_pgdat, &flags);
   475		shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
   476		update_pgdat_span(pgdat);
   477		pgdat_resize_unlock(zone->zone_pgdat, &flags);
   478	}
   479	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30379 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()
  2019-10-02  0:06   ` kbuild test robot
@ 2019-10-02  7:05     ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-02  7:05 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-kernel, linux-mm, linux-arm-kernel, linux-ia64,
	linuxppc-dev, linux-s390, linux-sh, Andrew Morton,
	Oscar Salvador, Michal Hocko, Pavel Tatashin, Dan Williams,
	Aneesh Kumar K . V

On 02.10.19 02:06, kbuild test robot wrote:
> Hi David,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on mmotm/master]
> 
> url:    https://github.com/0day-ci/linux/commits/David-Hildenbrand/mm-memory_hotplug-Shrink-zones-before-removing-memory/20191002-054310
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: x86_64-randconfig-b002-201939 (attached as .config)
> compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>
> 
> All warnings (new ones prefixed by >>):
> 
>    In file included from include/asm-generic/bug.h:5:0,
>                     from arch/x86/include/asm/bug.h:83,
>                     from include/linux/bug.h:5,
>                     from include/linux/mmdebug.h:5,
>                     from include/linux/mm.h:9,
>                     from mm/memory_hotplug.c:9:
>    mm/memory_hotplug.c: In function '__remove_zone':
>    mm/memory_hotplug.c:471:24: error: 'ZONE_DEVICE' undeclared (first use in this function); did you mean 'ZONE_MOVABLE'?
>      if (zone_idx(zone) == ZONE_DEVICE)
>                            ^
>    include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
>     #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
>                                                        ^~~~
>>> mm/memory_hotplug.c:471:2: note: in expansion of macro 'if'
>      if (zone_idx(zone) == ZONE_DEVICE)
>      ^~
>    mm/memory_hotplug.c:471:24: note: each undeclared identifier is reported only once for each function it appears in
>      if (zone_idx(zone) == ZONE_DEVICE)
>                            ^
>    include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
>     #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
>                                                        ^~~~
>>> mm/memory_hotplug.c:471:2: note: in expansion of macro 'if'
>      if (zone_idx(zone) == ZONE_DEVICE)
>      ^~
> 
> vim +/if +471 mm/memory_hotplug.c
> 
>    459	
>    460	static void __remove_zone(struct zone *zone, unsigned long start_pfn,
>    461			unsigned long nr_pages)
>    462	{
>    463		struct pglist_data *pgdat = zone->zone_pgdat;
>    464		unsigned long flags;
>    465	
>    466		/*
>    467		 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
>    468		 * we will not try to shrink the zones - which is okay as
>    469		 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
>    470		 */
>  > 471		if (zone_idx(zone) == ZONE_DEVICE)
>    472			return;
>    473	
>    474		pgdat_resize_lock(zone->zone_pgdat, &flags);
>    475		shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
>    476		update_pgdat_span(pgdat);
>    477		pgdat_resize_unlock(zone->zone_pgdat, &flags);
>    478	}
>    479	
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 

That should be easy to fix with some ifdef-ery :)

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-01 15:03     ` David Hildenbrand
@ 2019-10-03 16:48       ` Aneesh Kumar K.V
  2019-10-04  9:00         ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2019-10-03 16:48 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, Dan Williams
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Andrew Morton, Jason Gunthorpe, Logan Gunthorpe,
	Ira Weiny, Pankaj Gupta

On 10/1/19 8:33 PM, David Hildenbrand wrote:
> On 01.10.19 16:57, David Hildenbrand wrote:
>> On 01.10.19 16:40, David Hildenbrand wrote:
>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>
>>> With altmap, all the resource pfns are not initialized. While initializing
>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>> skip pfns that were never initialized.
>>>
>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>> values. This fixes a kernel crash that is observed when destroying
>>> a namespace.
>>>
>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>> ...
>>>      pid   = 3669, comm = ndctl
>>> kernel BUG at include/linux/mm.h:1107!
>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> [ move all pfn-realted declarations into a single line ]
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/memremap.c | 13 ++++++++-----
>>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 557e53c6fb46..026788b2ac69 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>   {
>>>   	struct resource *res = &pgmap->res;
>>> -	unsigned long pfn;
>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>   	int nid;
>>>   
>>>   	dev_pagemap_kill(pgmap);
>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>   		put_page(pfn_to_page(pfn));
>>>   	dev_pagemap_cleanup(pgmap);
>>>   
>>> +	start_pfn = pfn_first(pgmap);
>>> +	end_pfn = pfn_end(pgmap);
>>> +	nr_pages = end_pfn - start_pfn;
>>> +
>>>   	/* pages are dead and unused, undo the arch mapping */
>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>   
>>>   	mem_hotplug_begin();
>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>> -		pfn = PHYS_PFN(res->start);
>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>> +			       nr_pages, NULL);
>>>   	} else {
>>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>>   				pgmap_altmap(pgmap));
>>>
>>
>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>> could review.
>>
> 
> To be more precise, I wonder if it should actually be
> 
> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>                 resource_size(res))
> 

yes, that would be make it much clear.

But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?


> IOW, keep calling __remove_pages() with the same parameters but read
> nid/zone from the offset one.
> 
> Hope some memunmap_pages() expert can clarify.
> 

-aneesh


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-03 16:48       ` Aneesh Kumar K.V
@ 2019-10-04  9:00         ` David Hildenbrand
  2019-10-04  9:03           ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2019-10-04  9:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel, Dan Williams
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Andrew Morton, Jason Gunthorpe, Logan Gunthorpe,
	Ira Weiny, Pankaj Gupta

On 03.10.19 18:48, Aneesh Kumar K.V wrote:
> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>> On 01.10.19 16:57, David Hildenbrand wrote:
>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>
>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>> skip pfns that were never initialized.
>>>>
>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>> values. This fixes a kernel crash that is observed when destroying
>>>> a namespace.
>>>>
>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>> ...
>>>>      pid   = 3669, comm = ndctl
>>>> kernel BUG at include/linux/mm.h:1107!
>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>
>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> [ move all pfn-realted declarations into a single line ]
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>   mm/memremap.c | 13 ++++++++-----
>>>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>> index 557e53c6fb46..026788b2ac69 100644
>>>> --- a/mm/memremap.c
>>>> +++ b/mm/memremap.c
>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>>   {
>>>>   	struct resource *res = &pgmap->res;
>>>> -	unsigned long pfn;
>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>   	int nid;
>>>>   
>>>>   	dev_pagemap_kill(pgmap);
>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>   		put_page(pfn_to_page(pfn));
>>>>   	dev_pagemap_cleanup(pgmap);
>>>>   
>>>> +	start_pfn = pfn_first(pgmap);
>>>> +	end_pfn = pfn_end(pgmap);
>>>> +	nr_pages = end_pfn - start_pfn;
>>>> +
>>>>   	/* pages are dead and unused, undo the arch mapping */
>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>   
>>>>   	mem_hotplug_begin();
>>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>> -		pfn = PHYS_PFN(res->start);
>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>> +			       nr_pages, NULL);
>>>>   	} else {
>>>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>>>   				pgmap_altmap(pgmap));
>>>>
>>>
>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>> could review.
>>>
>>
>> To be more precise, I wonder if it should actually be
>>
>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>                 resource_size(res))
>>
> 
> yes, that would be make it much clear.
> 
> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?

Okay, let's recap. We should call add_pages()/__remove_pages()
and arch_add_memory()/arch_remove_memory() with the exact same ranges.

So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)

Now, only a subset of the pages gets actually initialized,
meaning the NID and the ZONE we read could be stale.
That, we have to fix.

What about something like this (am I missing something?):

From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Date: Fri, 27 Sep 2019 16:02:24 +0530
Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
 memunmap_pages()

With an altmap, the memmap falling into the reserved altmap space are
not initialized and, therefore, contain a garbage NID and a garbage
zone. Make sure to read the NID/zone from a memmap that was initialzed.

This fixes a kernel crash that is observed when destroying a namespace:

[   81.356173] kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
    pc: c0000000004b9728: memunmap_pages+0x238/0x340
    lr: c0000000004b9724: memunmap_pages+0x234/0x340
...
    pid   = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
[c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
[c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
[c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
[c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
[c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
[c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
[c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
[c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
[c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
[c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
[c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[ minimze code changes, rephrase description ]
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memremap.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..8b11c0da345c 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
 	struct resource *res = &pgmap->res;
+	struct page *first_page;
 	unsigned long pfn;
 	int nid;
 
@@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 		put_page(pfn_to_page(pfn));
 	dev_pagemap_cleanup(pgmap);
 
+	/* make sure to access a memmap that was actually initialized */
+	first_page = pfn_to_page(pfn_first(pgmap));
+
 	/* pages are dead and unused, undo the arch mapping */
-	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+	nid = page_to_nid(first_page);
 
 	mem_hotplug_begin();
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		pfn = PHYS_PFN(res->start);
-		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-				 PHYS_PFN(resource_size(res)), NULL);
+		__remove_pages(page_zone(first_page), res->start,
+			       resource_size(res), NULL);
 	} else {
 		arch_remove_memory(nid, res->start, resource_size(res),
 				pgmap_altmap(pgmap));
-- 
2.21.0




-- 

Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-04  9:00         ` David Hildenbrand
@ 2019-10-04  9:03           ` David Hildenbrand
  2019-10-04  9:33             ` David Hildenbrand
  2019-10-05  6:13             ` Aneesh Kumar K.V
  0 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-04  9:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel, Dan Williams
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Andrew Morton, Jason Gunthorpe, Logan Gunthorpe,
	Ira Weiny, Pankaj Gupta

On 04.10.19 11:00, David Hildenbrand wrote:
> On 03.10.19 18:48, Aneesh Kumar K.V wrote:
>> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>>> On 01.10.19 16:57, David Hildenbrand wrote:
>>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>
>>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>>> skip pfns that were never initialized.
>>>>>
>>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>>> values. This fixes a kernel crash that is observed when destroying
>>>>> a namespace.
>>>>>
>>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>>> ...
>>>>>      pid   = 3669, comm = ndctl
>>>>> kernel BUG at include/linux/mm.h:1107!
>>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>>
>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> [ move all pfn-realted declarations into a single line ]
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>>   mm/memremap.c | 13 ++++++++-----
>>>>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>> index 557e53c6fb46..026788b2ac69 100644
>>>>> --- a/mm/memremap.c
>>>>> +++ b/mm/memremap.c
>>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>   {
>>>>>   	struct resource *res = &pgmap->res;
>>>>> -	unsigned long pfn;
>>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>>   	int nid;
>>>>>   
>>>>>   	dev_pagemap_kill(pgmap);
>>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>   		put_page(pfn_to_page(pfn));
>>>>>   	dev_pagemap_cleanup(pgmap);
>>>>>   
>>>>> +	start_pfn = pfn_first(pgmap);
>>>>> +	end_pfn = pfn_end(pgmap);
>>>>> +	nr_pages = end_pfn - start_pfn;
>>>>> +
>>>>>   	/* pages are dead and unused, undo the arch mapping */
>>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>>   
>>>>>   	mem_hotplug_begin();
>>>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>>> -		pfn = PHYS_PFN(res->start);
>>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>>> +			       nr_pages, NULL);
>>>>>   	} else {
>>>>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>>>>   				pgmap_altmap(pgmap));
>>>>>
>>>>
>>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>>> could review.
>>>>
>>>
>>> To be more precise, I wonder if it should actually be
>>>
>>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>>                 resource_size(res))
>>>
>>
>> yes, that would be make it much clear.
>>
>> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?
> 
> Okay, let's recap. We should call add_pages()/__remove_pages()
> and arch_add_memory()/arch_remove_memory() with the exact same ranges.
> 
> So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)
> 
> Now, only a subset of the pages gets actually initialized,
> meaning the NID and the ZONE we read could be stale.
> That, we have to fix.
> 
> What about something like this (am I missing something?):
> 
> From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Date: Fri, 27 Sep 2019 16:02:24 +0530
> Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
>  memunmap_pages()
> 
> With an altmap, the memmap falling into the reserved altmap space are
> not initialized and, therefore, contain a garbage NID and a garbage
> zone. Make sure to read the NID/zone from a memmap that was initialzed.
> 
> This fixes a kernel crash that is observed when destroying a namespace:
> 
> [   81.356173] kernel BUG at include/linux/mm.h:1107!
> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>     pc: c0000000004b9728: memunmap_pages+0x238/0x340
>     lr: c0000000004b9724: memunmap_pages+0x234/0x340
> ...
>     pid   = 3669, comm = ndctl
> kernel BUG at include/linux/mm.h:1107!
> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> [ minimze code changes, rephrase description ]
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memremap.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 557e53c6fb46..8b11c0da345c 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>  void memunmap_pages(struct dev_pagemap *pgmap)
>  {
>  	struct resource *res = &pgmap->res;
> +	struct page *first_page;
>  	unsigned long pfn;
>  	int nid;
>  
> @@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>  		put_page(pfn_to_page(pfn));
>  	dev_pagemap_cleanup(pgmap);
>  
> +	/* make sure to access a memmap that was actually initialized */
> +	first_page = pfn_to_page(pfn_first(pgmap));
> +
>  	/* pages are dead and unused, undo the arch mapping */
> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
> +	nid = page_to_nid(first_page);
>  
>  	mem_hotplug_begin();
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> -		pfn = PHYS_PFN(res->start);
> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
> -				 PHYS_PFN(resource_size(res)), NULL);
> +		__remove_pages(page_zone(first_page), res->start,
> +			       resource_size(res), NULL);

Keeping the PHYS_PFN() calls of course ...

>  	} else {
>  		arch_remove_memory(nid, res->start, resource_size(res),
>  				pgmap_altmap(pgmap));
> 


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-04  9:03           ` David Hildenbrand
@ 2019-10-04  9:33             ` David Hildenbrand
  2019-10-05  6:13             ` Aneesh Kumar K.V
  1 sibling, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-04  9:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel, Dan Williams
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Andrew Morton, Jason Gunthorpe, Logan Gunthorpe,
	Ira Weiny, Pankaj Gupta

On 04.10.19 11:03, David Hildenbrand wrote:
> On 04.10.19 11:00, David Hildenbrand wrote:
>> On 03.10.19 18:48, Aneesh Kumar K.V wrote:
>>> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>>>> On 01.10.19 16:57, David Hildenbrand wrote:
>>>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>>
>>>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>>>> skip pfns that were never initialized.
>>>>>>
>>>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>>>> values. This fixes a kernel crash that is observed when destroying
>>>>>> a namespace.
>>>>>>
>>>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>>>> ...
>>>>>>      pid   = 3669, comm = ndctl
>>>>>> kernel BUG at include/linux/mm.h:1107!
>>>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>>>
>>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> [ move all pfn-realted declarations into a single line ]
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>>   mm/memremap.c | 13 ++++++++-----
>>>>>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>>> index 557e53c6fb46..026788b2ac69 100644
>>>>>> --- a/mm/memremap.c
>>>>>> +++ b/mm/memremap.c
>>>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>   {
>>>>>>   	struct resource *res = &pgmap->res;
>>>>>> -	unsigned long pfn;
>>>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>>>   	int nid;
>>>>>>   
>>>>>>   	dev_pagemap_kill(pgmap);
>>>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>   		put_page(pfn_to_page(pfn));
>>>>>>   	dev_pagemap_cleanup(pgmap);
>>>>>>   
>>>>>> +	start_pfn = pfn_first(pgmap);
>>>>>> +	end_pfn = pfn_end(pgmap);
>>>>>> +	nr_pages = end_pfn - start_pfn;
>>>>>> +
>>>>>>   	/* pages are dead and unused, undo the arch mapping */
>>>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>>>   
>>>>>>   	mem_hotplug_begin();
>>>>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>>>> -		pfn = PHYS_PFN(res->start);
>>>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>>>> +			       nr_pages, NULL);
>>>>>>   	} else {
>>>>>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>>>>>   				pgmap_altmap(pgmap));
>>>>>>
>>>>>
>>>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>>>> could review.
>>>>>
>>>>
>>>> To be more precise, I wonder if it should actually be
>>>>
>>>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>>>                 resource_size(res))
>>>>
>>>
>>> yes, that would be make it much clear.
>>>
>>> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?
>>
>> Okay, let's recap. We should call add_pages()/__remove_pages()
>> and arch_add_memory()/arch_remove_memory() with the exact same ranges.
>>
>> So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)
>>
>> Now, only a subset of the pages gets actually initialized,
>> meaning the NID and the ZONE we read could be stale.
>> That, we have to fix.
>>
>> What about something like this (am I missing something?):
>>
>> From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>> Date: Fri, 27 Sep 2019 16:02:24 +0530
>> Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
>>  memunmap_pages()
>>
>> With an altmap, the memmap falling into the reserved altmap space are
>> not initialized and, therefore, contain a garbage NID and a garbage
>> zone. Make sure to read the NID/zone from a memmap that was initialzed.
>>
>> This fixes a kernel crash that is observed when destroying a namespace:
>>
>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>     pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>     lr: c0000000004b9724: memunmap_pages+0x234/0x340
>> ...
>>     pid   = 3669, comm = ndctl
>> kernel BUG at include/linux/mm.h:1107!
>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Logan Gunthorpe <logang@deltatee.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> [ minimze code changes, rephrase description ]
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/memremap.c | 11 +++++++----
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 557e53c6fb46..8b11c0da345c 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>  void memunmap_pages(struct dev_pagemap *pgmap)
>>  {
>>  	struct resource *res = &pgmap->res;
>> +	struct page *first_page;
>>  	unsigned long pfn;
>>  	int nid;
>>  
>> @@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>  		put_page(pfn_to_page(pfn));
>>  	dev_pagemap_cleanup(pgmap);
>>  
>> +	/* make sure to access a memmap that was actually initialized */
>> +	first_page = pfn_to_page(pfn_first(pgmap));
>> +
>>  	/* pages are dead and unused, undo the arch mapping */
>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>> +	nid = page_to_nid(first_page);
>>  
>>  	mem_hotplug_begin();
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> -		pfn = PHYS_PFN(res->start);
>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>> -				 PHYS_PFN(resource_size(res)), NULL);
>> +		__remove_pages(page_zone(first_page), res->start,
>> +			       resource_size(res), NULL);
> 
> Keeping the PHYS_PFN() calls of course ...
> 
>>  	} else {
>>  		arch_remove_memory(nid, res->start, resource_size(res),
>>  				pgmap_altmap(pgmap));
>>
> 
> 

The current state (with the modified patch) can be found at:
	https://github.com/davidhildenbrand/linux/tree/zone_offline

@Aneesh, It would be great if you could test the namespace removal
thingy and tell me if we are still missing something :)

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-04  9:03           ` David Hildenbrand
  2019-10-04  9:33             ` David Hildenbrand
@ 2019-10-05  6:13             ` Aneesh Kumar K.V
  2019-10-06  8:13               ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: Aneesh Kumar K.V @ 2019-10-05  6:13 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, Dan Williams
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Andrew Morton, Jason Gunthorpe, Logan Gunthorpe,
	Ira Weiny, Pankaj Gupta

On 10/4/19 2:33 PM, David Hildenbrand wrote:
> On 04.10.19 11:00, David Hildenbrand wrote:
>> On 03.10.19 18:48, Aneesh Kumar K.V wrote:
>>> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>>>> On 01.10.19 16:57, David Hildenbrand wrote:
>>>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>>
>>>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>>>> skip pfns that were never initialized.
>>>>>>
>>>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>>>> values. This fixes a kernel crash that is observed when destroying
>>>>>> a namespace.
>>>>>>
>>>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>>>       pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>>>       lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>>>> ...
>>>>>>       pid   = 3669, comm = ndctl
>>>>>> kernel BUG at include/linux/mm.h:1107!
>>>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>>>
>>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> [ move all pfn-realted declarations into a single line ]
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>>    mm/memremap.c | 13 ++++++++-----
>>>>>>    1 file changed, 8 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>>> index 557e53c6fb46..026788b2ac69 100644
>>>>>> --- a/mm/memremap.c
>>>>>> +++ b/mm/memremap.c
>>>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>>>    void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>    {
>>>>>>    	struct resource *res = &pgmap->res;
>>>>>> -	unsigned long pfn;
>>>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>>>    	int nid;
>>>>>>    
>>>>>>    	dev_pagemap_kill(pgmap);
>>>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>    		put_page(pfn_to_page(pfn));
>>>>>>    	dev_pagemap_cleanup(pgmap);
>>>>>>    
>>>>>> +	start_pfn = pfn_first(pgmap);
>>>>>> +	end_pfn = pfn_end(pgmap);
>>>>>> +	nr_pages = end_pfn - start_pfn;
>>>>>> +
>>>>>>    	/* pages are dead and unused, undo the arch mapping */
>>>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>>>    
>>>>>>    	mem_hotplug_begin();
>>>>>>    	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>>>> -		pfn = PHYS_PFN(res->start);
>>>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>>>> +			       nr_pages, NULL);
>>>>>>    	} else {
>>>>>>    		arch_remove_memory(nid, res->start, resource_size(res),
>>>>>>    				pgmap_altmap(pgmap));
>>>>>>
>>>>>
>>>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>>>> could review.
>>>>>
>>>>
>>>> To be more precise, I wonder if it should actually be
>>>>
>>>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>>>                  resource_size(res))
>>>>
>>>
>>> yes, that would be make it much clear.
>>>
>>> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?
>>
>> Okay, let's recap. We should call add_pages()/__remove_pages()
>> and arch_add_memory()/arch_remove_memory() with the exact same ranges.
>>
>> So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)
>>
>> Now, only a subset of the pages gets actually initialized,
>> meaning the NID and the ZONE we read could be stale.
>> That, we have to fix.
>>
>> What about something like this (am I missing something?):
>>
>>  From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>> Date: Fri, 27 Sep 2019 16:02:24 +0530
>> Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
>>   memunmap_pages()
>>
>> With an altmap, the memmap falling into the reserved altmap space are
>> not initialized and, therefore, contain a garbage NID and a garbage
>> zone. Make sure to read the NID/zone from a memmap that was initialzed.
>>
>> This fixes a kernel crash that is observed when destroying a namespace:
>>
>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>> ...
>>      pid   = 3669, comm = ndctl
>> kernel BUG at include/linux/mm.h:1107!
>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Logan Gunthorpe <logang@deltatee.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> [ minimze code changes, rephrase description ]
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memremap.c | 11 +++++++----
>>   1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 557e53c6fb46..8b11c0da345c 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>   {
>>   	struct resource *res = &pgmap->res;
>> +	struct page *first_page;
>>   	unsigned long pfn;
>>   	int nid;
>>   
>> @@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>   		put_page(pfn_to_page(pfn));
>>   	dev_pagemap_cleanup(pgmap);
>>   
>> +	/* make sure to access a memmap that was actually initialized */
>> +	first_page = pfn_to_page(pfn_first(pgmap));
>> +
>>   	/* pages are dead and unused, undo the arch mapping */
>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>> +	nid = page_to_nid(first_page);
>>   
>>   	mem_hotplug_begin();
>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> -		pfn = PHYS_PFN(res->start);
>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>> -				 PHYS_PFN(resource_size(res)), NULL);
>> +		__remove_pages(page_zone(first_page), res->start,
>> +			       resource_size(res), NULL);
> 
> Keeping the PHYS_PFN() calls of course ...
> 


That is not different from what I posted right?  For MEMORY_DEVICE_PRIVATE


start_pfn  = PHYS_PFN(rest->start)


>>   	} else {
>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>   				pgmap_altmap(pgmap));
>>
> 
> 

-aneesh



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
  2019-10-05  6:13             ` Aneesh Kumar K.V
@ 2019-10-06  8:13               ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2019-10-06  8:13 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel, Dan Williams
  Cc: linux-mm, linux-arm-kernel, linux-ia64, linuxppc-dev, linux-s390,
	linux-sh, Andrew Morton, Jason Gunthorpe, Logan Gunthorpe,
	Ira Weiny, Pankaj Gupta

On 05.10.19 08:13, Aneesh Kumar K.V wrote:
> On 10/4/19 2:33 PM, David Hildenbrand wrote:
>> On 04.10.19 11:00, David Hildenbrand wrote:
>>> On 03.10.19 18:48, Aneesh Kumar K.V wrote:
>>>> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>>>>> On 01.10.19 16:57, David Hildenbrand wrote:
>>>>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>>>
>>>>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>>>>> skip pfns that were never initialized.
>>>>>>>
>>>>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>>>>> values. This fixes a kernel crash that is observed when destroying
>>>>>>> a namespace.
>>>>>>>
>>>>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>>>>       pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>>>>       lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>>>>> ...
>>>>>>>       pid   = 3669, comm = ndctl
>>>>>>> kernel BUG at include/linux/mm.h:1107!
>>>>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>>>>
>>>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>> [ move all pfn-realted declarations into a single line ]
>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>> ---
>>>>>>>    mm/memremap.c | 13 ++++++++-----
>>>>>>>    1 file changed, 8 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>>>> index 557e53c6fb46..026788b2ac69 100644
>>>>>>> --- a/mm/memremap.c
>>>>>>> +++ b/mm/memremap.c
>>>>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>>>>    void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>>    {
>>>>>>>    	struct resource *res = &pgmap->res;
>>>>>>> -	unsigned long pfn;
>>>>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>>>>    	int nid;
>>>>>>>    
>>>>>>>    	dev_pagemap_kill(pgmap);
>>>>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>>    		put_page(pfn_to_page(pfn));
>>>>>>>    	dev_pagemap_cleanup(pgmap);
>>>>>>>    
>>>>>>> +	start_pfn = pfn_first(pgmap);
>>>>>>> +	end_pfn = pfn_end(pgmap);
>>>>>>> +	nr_pages = end_pfn - start_pfn;
>>>>>>> +
>>>>>>>    	/* pages are dead and unused, undo the arch mapping */
>>>>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>>>>    
>>>>>>>    	mem_hotplug_begin();
>>>>>>>    	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>>>>> -		pfn = PHYS_PFN(res->start);
>>>>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>>>>> +			       nr_pages, NULL);
>>>>>>>    	} else {
>>>>>>>    		arch_remove_memory(nid, res->start, resource_size(res),
>>>>>>>    				pgmap_altmap(pgmap));
>>>>>>>
>>>>>>
>>>>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>>>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>>>>> could review.
>>>>>>
>>>>>
>>>>> To be more precise, I wonder if it should actually be
>>>>>
>>>>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>>>>                  resource_size(res))
>>>>>
>>>>
>>>> yes, that would be make it much clear.
>>>>
>>>> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?
>>>
>>> Okay, let's recap. We should call add_pages()/__remove_pages()
>>> and arch_add_memory()/arch_remove_memory() with the exact same ranges.
>>>
>>> So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)
>>>
>>> Now, only a subset of the pages gets actually initialized,
>>> meaning the NID and the ZONE we read could be stale.
>>> That, we have to fix.
>>>
>>> What about something like this (am I missing something?):
>>>
>>>  From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>> Date: Fri, 27 Sep 2019 16:02:24 +0530
>>> Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
>>>   memunmap_pages()
>>>
>>> With an altmap, the memmap falling into the reserved altmap space are
>>> not initialized and, therefore, contain a garbage NID and a garbage
>>> zone. Make sure to read the NID/zone from a memmap that was initialzed.
>>>
>>> This fixes a kernel crash that is observed when destroying a namespace:
>>>
>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>> ...
>>>      pid   = 3669, comm = ndctl
>>> kernel BUG at include/linux/mm.h:1107!
>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> [ minimze code changes, rephrase description ]
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/memremap.c | 11 +++++++----
>>>   1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 557e53c6fb46..8b11c0da345c 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>   {
>>>   	struct resource *res = &pgmap->res;
>>> +	struct page *first_page;
>>>   	unsigned long pfn;
>>>   	int nid;
>>>   
>>> @@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>   		put_page(pfn_to_page(pfn));
>>>   	dev_pagemap_cleanup(pgmap);
>>>   
>>> +	/* make sure to access a memmap that was actually initialized */
>>> +	first_page = pfn_to_page(pfn_first(pgmap));
>>> +
>>>   	/* pages are dead and unused, undo the arch mapping */
>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>> +	nid = page_to_nid(first_page);
>>>   
>>>   	mem_hotplug_begin();
>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>> -		pfn = PHYS_PFN(res->start);
>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>> +		__remove_pages(page_zone(first_page), res->start,
>>> +			       resource_size(res), NULL);
>>
>> Keeping the PHYS_PFN() calls of course ...
>>
> 
> 
> That is not different from what I posted right?  For MEMORY_DEVICE_PRIVATE

I'll go with this modified patch for now, which does not change the ranges
passed to  __remove_pages(), because that looked like a unrelated change to me.
__remove_ages() has to be called with the same range as add_pages().


commit f66c2fa09293ae0f99afc4363146941579152409 (HEAD)
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Fri Sep 27 16:02:24 2019 +0530

    mm/memunmap: Don't access uninitialized memmap in memunmap_pages()
    
    With an altmap, the memmap falling into the reserved altmap space are
    not initialized and, therefore, contain a garbage NID and a garbage
    zone. Make sure to read the NID/zone from a memmap that was initialzed.
    
    This fixes a kernel crash that is observed when destroying a namespace:
    
    [   81.356173] kernel BUG at include/linux/mm.h:1107!
    cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
        pc: c0000000004b9728: memunmap_pages+0x238/0x340
        lr: c0000000004b9724: memunmap_pages+0x234/0x340
    ...
        pid   = 3669, comm = ndctl
    kernel BUG at include/linux/mm.h:1107!
    [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
    [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
    [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
    [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
    [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
    [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
    [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
    [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
    [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
    [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
    [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
    
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    [ minimze code changes, rephrase description ]
    Signed-off-by: David Hildenbrand <david@redhat.com>

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..8c2fb44c3b4d 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
        struct resource *res = &pgmap->res;
+       struct page *first_page;
        unsigned long pfn;
        int nid;
 
@@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
                put_page(pfn_to_page(pfn));
        dev_pagemap_cleanup(pgmap);
 
+       /* make sure to access a memmap that was actually initialized */
+       first_page = pfn_to_page(pfn_first(pgmap));
+
        /* pages are dead and unused, undo the arch mapping */
-       nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+       nid = page_to_nid(first_page);
 
        mem_hotplug_begin();
        if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-               pfn = PHYS_PFN(res->start);
-               __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-                                PHYS_PFN(resource_size(res)), NULL);
+               __remove_pages(page_zone(first_page), PHYS_PFN(res->start),
+                              PHYS_PFN(resource_size(res)), NULL);
        } else {
                arch_remove_memory(nid, res->start, resource_size(res),
                                pgmap_altmap(pgmap));



As Andrew wants to give this some testing, I'll send out a new version
shortly, then we can discuss there.


-- 

Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-10-06  8:13 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-01 14:40 [PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone David Hildenbrand
2019-10-01 14:57   ` David Hildenbrand
2019-10-01 15:03     ` David Hildenbrand
2019-10-03 16:48       ` Aneesh Kumar K.V
2019-10-04  9:00         ` David Hildenbrand
2019-10-04  9:03           ` David Hildenbrand
2019-10-04  9:33             ` David Hildenbrand
2019-10-05  6:13             ` Aneesh Kumar K.V
2019-10-06  8:13               ` David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 02/10] mm/memmap_init: Update variable name in memmap_init_zone David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 03/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_pgdat_span() David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span() David Hildenbrand
2019-10-01 23:47   ` kbuild test robot
2019-10-02  0:06   ` kbuild test robot
2019-10-02  7:05     ` David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 05/10] mm/memory_hotplug: Shrink zones when offlining memory David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 06/10] mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone() David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span() David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 09/10] mm/memory_hotplug: Drop local variables " David Hildenbrand
2019-10-01 14:40 ` [PATCH v5 10/10] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).