linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
@ 2019-08-26 10:10 David Hildenbrand
  2019-08-26 10:10 ` [PATCH v2 1/6] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 10:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Alexander Duyck, Andrew Morton,
	Andy Lutomirski, Aneesh Kumar K.V, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dan Williams,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, Johannes Weiner, Jun Yao,
	Logan Gunthorpe, Mark Rutland, Masahiro Yamada,
	Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

Working on virtio-mem, I was able to trigger a kernel BUG (with debug
options enabled) when removing memory that was never onlined. I was able
to reproduce with DIMMs. As far as I can see the same can also happen
without debug configs enabled, if we're unlucky and the uninitialized
memmap contains selected garbage .

The root problem is that we should not try to derive the zone of memory we
are removing from the first PFN. The individual memory blocks of a DIMM
could be spanned by different ZONEs, multiple ZONES (after being offline and
re-onlined) or no ZONE at all (never onlined).

Let's process all applicable zones when removing memory so we're on the
safe side. In the long term, we want to resize the zones when offlining
memory (and before removing ZONE_DEVICE memory), however, that will require
more thought (and most probably a new SECTION_ACTIVE / pfn_active()
thingy). More details about that in patch #3.

Along with the fix, some related cleanups.

v1 -> v2:
- Include "mm: Introduce for_each_zone_nid()"
- "mm/memory_hotplug: Pass nid instead of zone to __remove_pages()"
-- Pass the nid instead of the zone and use it to reduce the number of
   zones to process

--- snip ---

I gave this a quick test with a DIMM on x86-64:

Start with a NUMA-less node 1. Hotplug a DIMM (512MB) to Node 1.
1st memory block is not onlined. 2nd and 4th is onlined MOVABLE.
3rd is onlined NORMAL.

:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
[...]
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# echo "online_kernel" > /sys/devices/system/memory/memory42/state
:/# cat /sys/devices/system/memory/memory40/state
offline

:/# cat /proc/zoneinfo
Node 1, zone   Normal
 [...]
        spanned  32768
        present  32768
        managed  32768
 [...]
Node 1, zone  Movable
 [...]
        spanned  98304
        present  65536
        managed  65536
 [...]

Trigger hotunplug. If it succeeds (block 42 can be offlined):

:/# cat /proc/zoneinfo

Node 1, zone   Normal
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0, 0)
Node 1, zone  Movable
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 0, 0)

So all zones were properly fixed up and we don't access the memmap of the
first, never-onlined memory block (garbage). I am no longer able to trigger
the BUG. I did a similar test with an already populated node.

David Hildenbrand (6):
  mm/memory_hotplug: Exit early in __remove_pages() on BUGs
  mm: Exit early in set_zone_contiguous() if already contiguous
  mm/memory_hotplug: Process all zones when removing memory
  mm/memory_hotplug: Cleanup __remove_pages()
  mm: Introduce for_each_zone_nid()
  mm/memory_hotplug: Pass nid instead of zone to __remove_pages()

 arch/arm64/mm/mmu.c            |  4 +--
 arch/ia64/mm/init.c            |  4 +--
 arch/powerpc/mm/mem.c          |  3 +-
 arch/s390/mm/init.c            |  4 +--
 arch/sh/mm/init.c              |  4 +--
 arch/x86/mm/init_32.c          |  4 +--
 arch/x86/mm/init_64.c          |  4 +--
 include/linux/memory_hotplug.h |  2 +-
 include/linux/mmzone.h         |  5 ++++
 mm/memory_hotplug.c            | 51 +++++++++++++++++++---------------
 mm/memremap.c                  |  3 +-
 mm/page_alloc.c                |  3 ++
 12 files changed, 46 insertions(+), 45 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/6] mm/memory_hotplug: Exit early in __remove_pages() on BUGs
  2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
@ 2019-08-26 10:10 ` David Hildenbrand
  2019-08-26 10:10 ` [PATCH v2 2/6] mm: Exit early in set_zone_contiguous() if already contiguous David Hildenbrand
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 10:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

The error path should never happen in practice (unless bringing up a new
device driver, or on BUGs). However, it's clearer to not touch anything
in case we are going to return right away. Move the check/return.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 32a5386758ce..71779b7b14df 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -542,13 +542,13 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 	unsigned long map_offset = 0;
 	unsigned long nr, start_sec, end_sec;
 
+	if (check_pfn_span(pfn, nr_pages, "remove"))
+		return;
+
 	map_offset = vmem_altmap_offset(altmap);
 
 	clear_zone_contiguous(zone);
 
-	if (check_pfn_span(pfn, nr_pages, "remove"))
-		return;
-
 	start_sec = pfn_to_section_nr(pfn);
 	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
 	for (nr = start_sec; nr <= end_sec; nr++) {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/6] mm: Exit early in set_zone_contiguous() if already contiguous
  2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
  2019-08-26 10:10 ` [PATCH v2 1/6] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
@ 2019-08-26 10:10 ` David Hildenbrand
  2019-08-26 10:10 ` [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory David Hildenbrand
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 10:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Michal Hocko,
	Vlastimil Babka, Oscar Salvador, Pavel Tatashin, Mel Gorman,
	Mike Rapoport, Dan Williams, Alexander Duyck

No need to recompute in case the zone is already marked contiguous.
We will soon exploit this on the memory removal path, where we will only
clear zone->contiguous on zones that intersect with the memory to be
removed.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b799e11fba3..995708e05cde 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1546,6 +1546,9 @@ void set_zone_contiguous(struct zone *zone)
 	unsigned long block_start_pfn = zone->zone_start_pfn;
 	unsigned long block_end_pfn;
 
+	if (zone->contiguous)
+		return;
+
 	block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
 	for (; block_start_pfn < zone_end_pfn(zone);
 			block_start_pfn = block_end_pfn,
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
  2019-08-26 10:10 ` [PATCH v2 1/6] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
  2019-08-26 10:10 ` [PATCH v2 2/6] mm: Exit early in set_zone_contiguous() if already contiguous David Hildenbrand
@ 2019-08-26 10:10 ` David Hildenbrand
  2019-08-29 15:39   ` Michal Hocko
  2019-08-26 10:10 ` [PATCH v2 4/6] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 10:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

It is easier than I though to trigger a kernel bug by removing memory that
was never onlined. With CONFIG_DEBUG_VM the memmap is initialized with
garbage, resulting in the detection of a broken zone when removing memory.
Without CONFIG_DEBUG_VM it is less likely - but we could still have
garbage in the memmap.

:/# [   23.912993] BUG: unable to handle page fault for address: 000000000000353d
[   23.914219] #PF: supervisor write access in kernel mode
[   23.915199] #PF: error_code(0x0002) - not-present page
[   23.916160] PGD 0 P4D 0
[   23.916627] Oops: 0002 [#1] SMP PTI
[   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
[   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
[   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
[   23.926876] RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
[   23.927928] RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
[   23.929458] RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
[   23.930899] RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
[   23.932362] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
[   23.933603] R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
[   23.934913] FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
[   23.936294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.937481] CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
[   23.938687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   23.939889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   23.941168] Call Trace:
[   23.941580]  __remove_pages+0x4b/0x640
[   23.942303]  ? mark_held_locks+0x49/0x70
[   23.943149]  arch_remove_memory+0x63/0x8d
[   23.943921]  try_remove_memory+0xdb/0x130
[   23.944766]  ? walk_memory_blocks+0x7f/0x9e
[   23.945616]  __remove_memory+0xa/0x11
[   23.946274]  acpi_memory_device_remove+0x70/0x100
[   23.947308]  acpi_bus_trim+0x55/0x90
[   23.947914]  acpi_device_hotplug+0x227/0x3a0
[   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
[   23.949433]  process_one_work+0x221/0x550
[   23.950190]  worker_thread+0x50/0x3b0
[   23.950993]  kthread+0x105/0x140
[   23.951644]  ? process_one_work+0x550/0x550
[   23.952508]  ? kthread_park+0x80/0x80
[   23.953367]  ret_from_fork+0x3a/0x50
[   23.954025] Modules linked in:
[   23.954613] CR2: 000000000000353d
[   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---

But the problem is more extreme: When removing memory we could have
- Single memory blocks that fall into no zone (never onlined)
- Single memory blocks that fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones
Right now, the zones don't get updated properly in these cases.

So let's simply process all zones for now until we can properly handle
this via the reverse of move_pfn_range_to_zone() (which would then be
called something like remove_pfn_range_from_zone()), for example, when
offlining memory or before removing ZONE_DEVICE memory.

To speed things up, only mark applicable zones non-contiguous (and
therefore reduce the zones to recompute) and skip non-intersecting zones
when trying to resize. shrink_zone_span() and shrink_pgdat_span() seem
to be able to cope just fine with pfn ranges they don't actually
contain (but still intersect with).

Don't check for zone_intersects() when triggering set_zone_contiguous()
- we might have resized the zone and the check might no longer hold. For
now, we have to try to recompute any zone (which will be skipped in case
the zone is already contiguous).

Note1: Detecting which memory is still part of a zone is not easy before
removing memory as the detection relies almost completely on pfn_valid()
right now. pfn_online() cannot be used as ZONE_DEVICE memory is never
online. pfn_present() cannot be used as all memory is present once it was
added (but not onlined). We need to rethink/refactor this properly.

Note2: We are safe to call zone_intersects() without locking (as already
done by onlining code in default_zone_for_pfn()), as we are protected by
the memory hotplug lock - just like zone->contiguous.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 71779b7b14df..27f0457b7512 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -505,22 +505,28 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn,
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long flags;
 
+	if (!zone_intersects(zone, start_pfn, nr_pages))
+		return;
+
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
 	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-static void __remove_section(struct zone *zone, unsigned long pfn,
-		unsigned long nr_pages, unsigned long map_offset,
-		struct vmem_altmap *altmap)
+static void __remove_section(unsigned long pfn, unsigned long nr_pages,
+			     unsigned long map_offset,
+			     struct vmem_altmap *altmap)
 {
 	struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
+	struct zone *zone;
 
 	if (WARN_ON_ONCE(!valid_section(ms)))
 		return;
 
-	__remove_zone(zone, pfn, nr_pages);
+	/* TODO: move zone handling out of memory removal path */
+	for_each_zone(zone)
+		__remove_zone(zone, pfn, nr_pages);
 	sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap);
 }
 
@@ -547,7 +553,10 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 
 	map_offset = vmem_altmap_offset(altmap);
 
-	clear_zone_contiguous(zone);
+	/* TODO: move zone handling out of memory removal path */
+	for_each_zone(zone)
+		if (zone_intersects(zone, pfn, nr_pages))
+			clear_zone_contiguous(zone);
 
 	start_sec = pfn_to_section_nr(pfn);
 	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
@@ -557,13 +566,15 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 		cond_resched();
 		pfns = min(nr_pages, PAGES_PER_SECTION
 				- (pfn & ~PAGE_SECTION_MASK));
-		__remove_section(zone, pfn, pfns, map_offset, altmap);
+		__remove_section(pfn, pfns, map_offset, altmap);
 		pfn += pfns;
 		nr_pages -= pfns;
 		map_offset = 0;
 	}
 
-	set_zone_contiguous(zone);
+	/* TODO: move zone handling out of memory removal path */
+	for_each_zone(zone)
+		set_zone_contiguous(zone);
 }
 
 int set_online_page_callback(online_page_callback_t callback)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/6] mm/memory_hotplug: Cleanup __remove_pages()
  2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
                   ` (2 preceding siblings ...)
  2019-08-26 10:10 ` [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory David Hildenbrand
@ 2019-08-26 10:10 ` David Hildenbrand
  2019-08-26 10:10 ` [PATCH v2 5/6] mm: Introduce for_each_zone_nid() David Hildenbrand
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 10:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Wei Yang

Let's drop the basically unused section stuff and simplify.

Also, let's use a shorter variant to calculate the number of pages to
the next section boundary.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 27f0457b7512..e88c96cf9d77 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -545,8 +545,9 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
 void __remove_pages(struct zone *zone, unsigned long pfn,
 		    unsigned long nr_pages, struct vmem_altmap *altmap)
 {
+	const unsigned long end_pfn = pfn + nr_pages;
+	unsigned long cur_nr_pages;
 	unsigned long map_offset = 0;
-	unsigned long nr, start_sec, end_sec;
 
 	if (check_pfn_span(pfn, nr_pages, "remove"))
 		return;
@@ -558,17 +559,11 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 		if (zone_intersects(zone, pfn, nr_pages))
 			clear_zone_contiguous(zone);
 
-	start_sec = pfn_to_section_nr(pfn);
-	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
-	for (nr = start_sec; nr <= end_sec; nr++) {
-		unsigned long pfns;
-
+	for (; pfn < end_pfn; pfn += cur_nr_pages) {
 		cond_resched();
-		pfns = min(nr_pages, PAGES_PER_SECTION
-				- (pfn & ~PAGE_SECTION_MASK));
-		__remove_section(pfn, pfns, map_offset, altmap);
-		pfn += pfns;
-		nr_pages -= pfns;
+		/* Select all remaining pages up to the next section boundary */
+		cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK));
+		__remove_section(pfn, cur_nr_pages, map_offset, altmap);
 		map_offset = 0;
 	}
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 5/6] mm: Introduce for_each_zone_nid()
  2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
                   ` (3 preceding siblings ...)
  2019-08-26 10:10 ` [PATCH v2 4/6] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
@ 2019-08-26 10:10 ` David Hildenbrand
  2019-08-26 10:10 ` [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages() David Hildenbrand
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 10:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Vlastimil Babka,
	Dan Williams, Mel Gorman, Michal Hocko, Wei Yang,
	Johannes Weiner, Arun KS

Allow to iterate all zones belonging to a nid.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arun KS <arunks@codeaurora.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mmzone.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8b5f758942a2..71f2b9b55069 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1004,6 +1004,11 @@ extern struct zone *next_zone(struct zone *zone);
 			; /* do nothing */		\
 		else
 
+#define for_each_zone_nid(zone, nid)			\
+	for (zone = (NODE_DATA(nid))->node_zones;	\
+	     zone && zone_to_nid(zone) == nid;		\
+	     zone = next_zone(zone))
+
 static inline struct zone *zonelist_zone(struct zoneref *zoneref)
 {
 	return zoneref->zone;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages()
  2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
                   ` (4 preceding siblings ...)
  2019-08-26 10:10 ` [PATCH v2 5/6] mm: Introduce for_each_zone_nid() David Hildenbrand
@ 2019-08-26 10:10 ` David Hildenbrand
  2019-08-27 10:49   ` Robin Murphy
  2019-08-26 14:53 ` [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory Aneesh Kumar K.V
  2019-08-29  8:36 ` Michal Hocko
  7 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 10:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Catalin Marinas, Will Deacon,
	Tony Luck, Fenghua Yu, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Yoshinori Sato, Rich Felker, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Andrew Morton,
	Mark Rutland, Steve Capper, Mike Rapoport, Anshuman Khandual,
	Yu Zhao, Jun Yao, Robin Murphy, Michal Hocko, Oscar Salvador,
	Matthew Wilcox (Oracle),
	Christophe Leroy, Aneesh Kumar K.V, Pavel Tatashin,
	Gerald Schaefer, Halil Pasic, Tom Lendacky, Greg Kroah-Hartman,
	Masahiro Yamada, Dan Williams, Wei Yang, Qian Cai,
	Jason Gunthorpe, Logan Gunthorpe, Ira Weiny, linux-arm-kernel,
	linux-ia64, linuxppc-dev, linux-s390, linux-sh

The zone parameter is no longer in use. Replace it with the nid, which
we can now use the nid to limit the number of zones we have to process
(vie for_each_zone_nid()). The function signature of __remove_pages() now
looks much more similar to the one of __add_pages().

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/mm/mmu.c            |  4 +---
 arch/ia64/mm/init.c            |  4 +---
 arch/powerpc/mm/mem.c          |  3 +--
 arch/s390/mm/init.c            |  4 +---
 arch/sh/mm/init.c              |  4 +---
 arch/x86/mm/init_32.c          |  4 +---
 arch/x86/mm/init_64.c          |  4 +---
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c            | 17 +++++++++--------
 mm/memremap.c                  |  3 +--
 10 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e67bab4d613e..9a2d388314f3 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1080,7 +1080,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
 	/*
 	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
@@ -1089,7 +1088,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
 	 * unlocked yet.
 	 */
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(nid, start_pfn, nr_pages, altmap);
 }
 #endif
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index bf9df2625bc8..ae6a3e718aa0 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(nid, start_pfn, nr_pages, altmap);
 }
 #endif
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 9191a66b3bc5..af21e13529ce 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
 	int ret;
 
-	__remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
+	__remove_pages(nid, start_pfn, nr_pages, altmap);
 
 	/* Remove htab bolted mappings for this section of memory */
 	start = (unsigned long)__va(start);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 20340a03ad90..2a7373ed6ded 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -296,10 +296,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(nid, start_pfn, nr_pages, altmap);
 	vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index dfdbaa50946e..32441b59297d 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(nid, start_pfn, nr_pages, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 4068abb9427f..2760e4bfbc56 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -865,10 +865,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(nid, start_pfn, nr_pages, altmap);
 }
 #endif
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a6b5c653727b..99d92297f1cf 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1212,10 +1212,8 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
-	struct zone *zone = page_zone(page);
 
-	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pages(nid, start_pfn, nr_pages, altmap);
 	kernel_physical_mapping_remove(start, start + size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f46ea71b4ffd..c5b38e7dc8aa 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -125,7 +125,7 @@ static inline bool movable_node_is_enabled(void)
 
 extern void arch_remove_memory(int nid, u64 start, u64 size,
 			       struct vmem_altmap *altmap);
-extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
+extern void __remove_pages(int nid, unsigned long start_pfn,
 			   unsigned long nr_pages, struct vmem_altmap *altmap);
 
 /* reasonably generic interface to expand the physical pages */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e88c96cf9d77..49ca3364eb70 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -514,7 +514,7 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn,
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-static void __remove_section(unsigned long pfn, unsigned long nr_pages,
+static void __remove_section(int nid, unsigned long pfn, unsigned long nr_pages,
 			     unsigned long map_offset,
 			     struct vmem_altmap *altmap)
 {
@@ -525,14 +525,14 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
 		return;
 
 	/* TODO: move zone handling out of memory removal path */
-	for_each_zone(zone)
+	for_each_zone_nid(zone, nid)
 		__remove_zone(zone, pfn, nr_pages);
 	sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap);
 }
 
 /**
  * __remove_pages() - remove sections of pages from a zone
- * @zone: zone from which pages need to be removed
+ * @nid: the nid all pages were added to
  * @pfn: starting pageframe (must be aligned to start of a section)
  * @nr_pages: number of pages to remove (must be multiple of section size)
  * @altmap: alternative device page map or %NULL if default memmap is used
@@ -542,12 +542,13 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
  * sure that pages are marked reserved and zones are adjust properly by
  * calling offline_pages().
  */
-void __remove_pages(struct zone *zone, unsigned long pfn,
-		    unsigned long nr_pages, struct vmem_altmap *altmap)
+void __remove_pages(int nid, unsigned long pfn, unsigned long nr_pages,
+		    struct vmem_altmap *altmap)
 {
 	const unsigned long end_pfn = pfn + nr_pages;
 	unsigned long cur_nr_pages;
 	unsigned long map_offset = 0;
+	struct zone *zone;
 
 	if (check_pfn_span(pfn, nr_pages, "remove"))
 		return;
@@ -555,7 +556,7 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 	map_offset = vmem_altmap_offset(altmap);
 
 	/* TODO: move zone handling out of memory removal path */
-	for_each_zone(zone)
+	for_each_zone_nid(zone, nid)
 		if (zone_intersects(zone, pfn, nr_pages))
 			clear_zone_contiguous(zone);
 
@@ -563,12 +564,12 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
 		cond_resched();
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK));
-		__remove_section(pfn, cur_nr_pages, map_offset, altmap);
+		__remove_section(nid, pfn, cur_nr_pages, map_offset, altmap);
 		map_offset = 0;
 	}
 
 	/* TODO: move zone handling out of memory removal path */
-	for_each_zone(zone)
+	for_each_zone_nid(zone, nid)
 		set_zone_contiguous(zone);
 }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 8a394552b5bd..292ef4c6b447 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -138,8 +138,7 @@ static void devm_memremap_pages_release(void *data)
 	mem_hotplug_begin();
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		pfn = PHYS_PFN(res->start);
-		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-				 PHYS_PFN(resource_size(res)), NULL);
+		__remove_pages(nid, pfn, PHYS_PFN(resource_size(res)), NULL);
 	} else {
 		arch_remove_memory(nid, res->start, resource_size(res),
 				pgmap_altmap(pgmap));
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
                   ` (5 preceding siblings ...)
  2019-08-26 10:10 ` [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages() David Hildenbrand
@ 2019-08-26 14:53 ` Aneesh Kumar K.V
  2019-08-26 15:43   ` David Hildenbrand
  2019-08-29  8:38   ` Michal Hocko
  2019-08-29  8:36 ` Michal Hocko
  7 siblings, 2 replies; 30+ messages in thread
From: Aneesh Kumar K.V @ 2019-08-26 14:53 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, David Hildenbrand, Alexander Duyck, Andrew Morton,
	Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dan Williams,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, Johannes Weiner, Jun Yao,
	Logan Gunthorpe, Mark Rutland, Masahiro Yamada,
	Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

David Hildenbrand <david@redhat.com> writes:

> Working on virtio-mem, I was able to trigger a kernel BUG (with debug
> options enabled) when removing memory that was never onlined. I was able
> to reproduce with DIMMs. As far as I can see the same can also happen
> without debug configs enabled, if we're unlucky and the uninitialized
> memmap contains selected garbage .
>
> The root problem is that we should not try to derive the zone of memory we
> are removing from the first PFN. The individual memory blocks of a DIMM
> could be spanned by different ZONEs, multiple ZONES (after being offline and
> re-onlined) or no ZONE at all (never onlined).
>
> Let's process all applicable zones when removing memory so we're on the
> safe side. In the long term, we want to resize the zones when offlining
> memory (and before removing ZONE_DEVICE memory), however, that will require
> more thought (and most probably a new SECTION_ACTIVE / pfn_active()
> thingy). More details about that in patch #3.
>
> Along with the fix, some related cleanups.
>
> v1 -> v2:
> - Include "mm: Introduce for_each_zone_nid()"
> - "mm/memory_hotplug: Pass nid instead of zone to __remove_pages()"
> -- Pass the nid instead of the zone and use it to reduce the number of
>    zones to process
>
> --- snip ---
>
> I gave this a quick test with a DIMM on x86-64:
>
> Start with a NUMA-less node 1. Hotplug a DIMM (512MB) to Node 1.
> 1st memory block is not onlined. 2nd and 4th is onlined MOVABLE.
> 3rd is onlined NORMAL.
>
> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
> [...]
> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
> :/# echo "online_kernel" > /sys/devices/system/memory/memory42/state
> :/# cat /sys/devices/system/memory/memory40/state
> offline
>
> :/# cat /proc/zoneinfo
> Node 1, zone   Normal
>  [...]
>         spanned  32768
>         present  32768
>         managed  32768
>  [...]
> Node 1, zone  Movable
>  [...]
>         spanned  98304
>         present  65536
>         managed  65536
>  [...]
>
> Trigger hotunplug. If it succeeds (block 42 can be offlined):
>
> :/# cat /proc/zoneinfo
>
> Node 1, zone   Normal
>   pages free     0
>         min      0
>         low      0
>         high     0
>         spanned  0
>         present  0
>         managed  0
>         protection: (0, 0, 0, 0, 0)
> Node 1, zone  Movable
>   pages free     0
>         min      0
>         low      0
>         high     0
>         spanned  0
>         present  0
>         managed  0
>         protection: (0, 0, 0, 0, 0)
>
> So all zones were properly fixed up and we don't access the memmap of the
> first, never-onlined memory block (garbage). I am no longer able to trigger
> the BUG. I did a similar test with an already populated node.
>

I did report a variant of the issue at

https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/

This patch series still doesn't handle the fact that struct page backing
the start_pfn might not be initialized. ie, it results in crash like
below

    pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
    lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
    sp: c0000000dac7f910
   msr: 800000000282b033
  current = 0xc0000000da2fa000
  paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
    pid   = 1224, comm = ndctl
kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
enter ? for help
[c0000000dac7f980] c0000000004bc574 __remove_zone+0x84/0xd0
[c0000000dac7f9d0] c0000000004bc920 __remove_section+0x100/0x170
[c0000000dac7fa30] c0000000004bec98 __remove_pages+0x168/0x220
[c0000000dac7fa90] c00000000007dff8 arch_remove_memory+0x38/0x110
[c0000000dac7fb00] c00000000050cb0c devm_memremap_pages_release+0x24c/0x2f0
[c0000000dac7fb90] c000000000cfec00 devm_action_release+0x30/0x50
[c0000000dac7fbb0] c000000000cffe7c release_nodes+0x24c/0x2c0
[c0000000dac7fc20] c000000000cf8988 device_release_driver_internal+0x168/0x230
[c0000000dac7fc60] c000000000cf5624 unbind_store+0x74/0x190
[c0000000dac7fcb0] c000000000cf42a4 drv_attr_store+0x44/0x60
[c0000000dac7fcd0] c000000000617d44 sysfs_kf_write+0x74/0xa0

I do have a few patches to handle the crashes eralier in
devm_memremap_pages_release() 

--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -121,7 +121,7 @@ static void devm_memremap_pages_release(void *data)
        dev_pagemap_cleanup(pgmap);
 
        /* pages are dead and unused, undo the arch mapping */
-       nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+       nid = page_to_nid(pfn_to_page(pfn_first(pgmap)));
 

and also for pfn_first

https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg16205.html

-aneesh



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-26 14:53 ` [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory Aneesh Kumar K.V
@ 2019-08-26 15:43   ` David Hildenbrand
  2019-08-26 16:01     ` Aneesh Kumar K.V
  2019-08-29  8:38   ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 15:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel
  Cc: linux-mm, Alexander Duyck, Andrew Morton, Andy Lutomirski,
	Anshuman Khandual, Arun KS, Benjamin Herrenschmidt,
	Borislav Petkov, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Dan Williams, Dave Hansen, Fenghua Yu,
	Gerald Schaefer, Greg Kroah-Hartman, Halil Pasic, Heiko Carstens,
	H. Peter Anvin, Ingo Molnar, Ira Weiny, Jason Gunthorpe,
	Johannes Weiner, Jun Yao, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

On 26.08.19 16:53, Aneesh Kumar K.V wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> Working on virtio-mem, I was able to trigger a kernel BUG (with debug
>> options enabled) when removing memory that was never onlined. I was able
>> to reproduce with DIMMs. As far as I can see the same can also happen
>> without debug configs enabled, if we're unlucky and the uninitialized
>> memmap contains selected garbage .
>>
>> The root problem is that we should not try to derive the zone of memory we
>> are removing from the first PFN. The individual memory blocks of a DIMM
>> could be spanned by different ZONEs, multiple ZONES (after being offline and
>> re-onlined) or no ZONE at all (never onlined).
>>
>> Let's process all applicable zones when removing memory so we're on the
>> safe side. In the long term, we want to resize the zones when offlining
>> memory (and before removing ZONE_DEVICE memory), however, that will require
>> more thought (and most probably a new SECTION_ACTIVE / pfn_active()
>> thingy). More details about that in patch #3.
>>
>> Along with the fix, some related cleanups.
>>
>> v1 -> v2:
>> - Include "mm: Introduce for_each_zone_nid()"
>> - "mm/memory_hotplug: Pass nid instead of zone to __remove_pages()"
>> -- Pass the nid instead of the zone and use it to reduce the number of
>>    zones to process
>>
>> --- snip ---
>>
>> I gave this a quick test with a DIMM on x86-64:
>>
>> Start with a NUMA-less node 1. Hotplug a DIMM (512MB) to Node 1.
>> 1st memory block is not onlined. 2nd and 4th is onlined MOVABLE.
>> 3rd is onlined NORMAL.
>>
>> :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
>> [...]
>> :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
>> :/# echo "online_kernel" > /sys/devices/system/memory/memory42/state
>> :/# cat /sys/devices/system/memory/memory40/state
>> offline
>>
>> :/# cat /proc/zoneinfo
>> Node 1, zone   Normal
>>  [...]
>>         spanned  32768
>>         present  32768
>>         managed  32768
>>  [...]
>> Node 1, zone  Movable
>>  [...]
>>         spanned  98304
>>         present  65536
>>         managed  65536
>>  [...]
>>
>> Trigger hotunplug. If it succeeds (block 42 can be offlined):
>>
>> :/# cat /proc/zoneinfo
>>
>> Node 1, zone   Normal
>>   pages free     0
>>         min      0
>>         low      0
>>         high     0
>>         spanned  0
>>         present  0
>>         managed  0
>>         protection: (0, 0, 0, 0, 0)
>> Node 1, zone  Movable
>>   pages free     0
>>         min      0
>>         low      0
>>         high     0
>>         spanned  0
>>         present  0
>>         managed  0
>>         protection: (0, 0, 0, 0, 0)
>>
>> So all zones were properly fixed up and we don't access the memmap of the
>> first, never-onlined memory block (garbage). I am no longer able to trigger
>> the BUG. I did a similar test with an already populated node.
>>
> 
> I did report a variant of the issue at
> 
> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
> 
> This patch series still doesn't handle the fact that struct page backing
> the start_pfn might not be initialized. ie, it results in crash like
> below

Okay, that's a related but different issue I think.

I can see that current shrink_zone_span() might read-access the
uninitialized struct page of a PFN if

1. The zone has holes and we check for "zone all holes". If we get
pfn_valid(pfn), we check if "page_zone(pfn_to_page(pfn)) != zone".

2. Via find_smallest_section_pfn() / find_biggest_section_pfn() find a
spanned pfn_valid(). We check
- pfn_to_nid(start_pfn) != nid
- zone != page_zone(pfn_to_page(start_pfn)

So we don't actually use the zone/nid, only use it to sanity check. That
might result in false-positives (not that bad).

It all boils down to shrink_zone_span() not working only on active
memory, for which the PFN is not only valid but also initialized
(something for which we need a new section flag I assume).

Which access triggers the issue you describe? pfn_to_nid()?

> 
>     pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
>     lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
>     sp: c0000000dac7f910
>    msr: 800000000282b033
>   current = 0xc0000000da2fa000
>   paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
>     pid   = 1224, comm = ndctl
> kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
> Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
> enter ? for help

Which exact kernel BUG are you hitting here? (my tree doesn't seem t
have any BUG statement around  include/linux/mm.h:1088)

> [c0000000dac7f980] c0000000004bc574 __remove_zone+0x84/0xd0
> [c0000000dac7f9d0] c0000000004bc920 __remove_section+0x100/0x170
> [c0000000dac7fa30] c0000000004bec98 __remove_pages+0x168/0x220
> [c0000000dac7fa90] c00000000007dff8 arch_remove_memory+0x38/0x110
> [c0000000dac7fb00] c00000000050cb0c devm_memremap_pages_release+0x24c/0x2f0
> [c0000000dac7fb90] c000000000cfec00 devm_action_release+0x30/0x50
> [c0000000dac7fbb0] c000000000cffe7c release_nodes+0x24c/0x2c0
> [c0000000dac7fc20] c000000000cf8988 device_release_driver_internal+0x168/0x230
> [c0000000dac7fc60] c000000000cf5624 unbind_store+0x74/0x190
> [c0000000dac7fcb0] c000000000cf42a4 drv_attr_store+0x44/0x60
> [c0000000dac7fcd0] c000000000617d44 sysfs_kf_write+0x74/0xa0


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-26 15:43   ` David Hildenbrand
@ 2019-08-26 16:01     ` Aneesh Kumar K.V
  2019-08-26 16:20       ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Aneesh Kumar K.V @ 2019-08-26 16:01 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Alexander Duyck, Andrew Morton, Andy Lutomirski,
	Anshuman Khandual, Arun KS, Benjamin Herrenschmidt,
	Borislav Petkov, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Dan Williams, Dave Hansen, Fenghua Yu,
	Gerald Schaefer, Greg Kroah-Hartman, Halil Pasic, Heiko Carstens,
	H. Peter Anvin, Ingo Molnar, Ira Weiny, Jason Gunthorpe,
	Johannes Weiner, Jun Yao, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

On 8/26/19 9:13 PM, David Hildenbrand wrote:
> On 26.08.19 16:53, Aneesh Kumar K.V wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> 

....

>>
>> I did report a variant of the issue at
>>
>> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
>>
>> This patch series still doesn't handle the fact that struct page backing
>> the start_pfn might not be initialized. ie, it results in crash like
>> below
> 
> Okay, that's a related but different issue I think.
> 
> I can see that current shrink_zone_span() might read-access the
> uninitialized struct page of a PFN if
> 
> 1. The zone has holes and we check for "zone all holes". If we get
> pfn_valid(pfn), we check if "page_zone(pfn_to_page(pfn)) != zone".
> 
> 2. Via find_smallest_section_pfn() / find_biggest_section_pfn() find a
> spanned pfn_valid(). We check
> - pfn_to_nid(start_pfn) != nid
> - zone != page_zone(pfn_to_page(start_pfn)
> 
> So we don't actually use the zone/nid, only use it to sanity check. That
> might result in false-positives (not that bad).
> 
> It all boils down to shrink_zone_span() not working only on active
> memory, for which the PFN is not only valid but also initialized
> (something for which we need a new section flag I assume).
> 
> Which access triggers the issue you describe? pfn_to_nid()?
> 
>>
>>      pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
>>      lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
>>      sp: c0000000dac7f910
>>     msr: 800000000282b033
>>    current = 0xc0000000da2fa000
>>    paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
>>      pid   = 1224, comm = ndctl
>> kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
>> Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
>> enter ? for help
> 
> Which exact kernel BUG are you hitting here? (my tree doesn't seem t
> have any BUG statement around  include/linux/mm.h:1088). 



This is against upstream linus with your patches applied.


static inline int page_to_nid(const struct page *page)
{
	struct page *p = (struct page *)page;

	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
}


#define PF_POISONED_CHECK(page) ({					\
		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
		page; })
#


It is the node id access.

-aneesh


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-26 16:01     ` Aneesh Kumar K.V
@ 2019-08-26 16:20       ` David Hildenbrand
  2019-08-26 16:44         ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 16:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel
  Cc: linux-mm, Alexander Duyck, Andrew Morton, Andy Lutomirski,
	Anshuman Khandual, Arun KS, Benjamin Herrenschmidt,
	Borislav Petkov, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Dan Williams, Dave Hansen, Fenghua Yu,
	Gerald Schaefer, Greg Kroah-Hartman, Halil Pasic, Heiko Carstens,
	H. Peter Anvin, Ingo Molnar, Ira Weiny, Jason Gunthorpe,
	Johannes Weiner, Jun Yao, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

On 26.08.19 18:01, Aneesh Kumar K.V wrote:
> On 8/26/19 9:13 PM, David Hildenbrand wrote:
>> On 26.08.19 16:53, Aneesh Kumar K.V wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>>
> 
> ....
> 
>>>
>>> I did report a variant of the issue at
>>>
>>> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
>>>
>>> This patch series still doesn't handle the fact that struct page backing
>>> the start_pfn might not be initialized. ie, it results in crash like
>>> below
>>
>> Okay, that's a related but different issue I think.
>>
>> I can see that current shrink_zone_span() might read-access the
>> uninitialized struct page of a PFN if
>>
>> 1. The zone has holes and we check for "zone all holes". If we get
>> pfn_valid(pfn), we check if "page_zone(pfn_to_page(pfn)) != zone".
>>
>> 2. Via find_smallest_section_pfn() / find_biggest_section_pfn() find a
>> spanned pfn_valid(). We check
>> - pfn_to_nid(start_pfn) != nid
>> - zone != page_zone(pfn_to_page(start_pfn)
>>
>> So we don't actually use the zone/nid, only use it to sanity check. That
>> might result in false-positives (not that bad).
>>
>> It all boils down to shrink_zone_span() not working only on active
>> memory, for which the PFN is not only valid but also initialized
>> (something for which we need a new section flag I assume).
>>
>> Which access triggers the issue you describe? pfn_to_nid()?
>>
>>>
>>>      pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
>>>      lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
>>>      sp: c0000000dac7f910
>>>     msr: 800000000282b033
>>>    current = 0xc0000000da2fa000
>>>    paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
>>>      pid   = 1224, comm = ndctl
>>> kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
>>> Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
>>> enter ? for help
>>
>> Which exact kernel BUG are you hitting here? (my tree doesn't seem t
>> have any BUG statement around  include/linux/mm.h:1088). 
> 
> 
> 
> This is against upstream linus with your patches applied.

I'm

> 
> 
> static inline int page_to_nid(const struct page *page)
> {
> 	struct page *p = (struct page *)page;
> 
> 	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
> }
> 
> 
> #define PF_POISONED_CHECK(page) ({					\
> 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
> 		page; })
> #
> 
> 
> It is the node id access.

A right. A temporary hack would be to assume in these functions
(shrink_zone_span() and friends) that we might have invalid NIDs /
zonenumbers and simply skip these. After all we're only using them for
finding zone boundaries. Not what we ultimately want, but I think until
we have a proper SECTION_ACTIVE, it might take a while.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-26 16:20       ` David Hildenbrand
@ 2019-08-26 16:44         ` David Hildenbrand
  2019-08-27  5:46           ` Aneesh Kumar K.V
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-26 16:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel
  Cc: linux-mm, Alexander Duyck, Andrew Morton, Andy Lutomirski,
	Anshuman Khandual, Arun KS, Benjamin Herrenschmidt,
	Borislav Petkov, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Dan Williams, Dave Hansen, Fenghua Yu,
	Gerald Schaefer, Greg Kroah-Hartman, Halil Pasic, Heiko Carstens,
	H. Peter Anvin, Ingo Molnar, Ira Weiny, Jason Gunthorpe,
	Johannes Weiner, Jun Yao, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

On 26.08.19 18:20, David Hildenbrand wrote:
> On 26.08.19 18:01, Aneesh Kumar K.V wrote:
>> On 8/26/19 9:13 PM, David Hildenbrand wrote:
>>> On 26.08.19 16:53, Aneesh Kumar K.V wrote:
>>>> David Hildenbrand <david@redhat.com> writes:
>>>>
>>>>>
>>
>> ....
>>
>>>>
>>>> I did report a variant of the issue at
>>>>
>>>> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
>>>>
>>>> This patch series still doesn't handle the fact that struct page backing
>>>> the start_pfn might not be initialized. ie, it results in crash like
>>>> below
>>>
>>> Okay, that's a related but different issue I think.
>>>
>>> I can see that current shrink_zone_span() might read-access the
>>> uninitialized struct page of a PFN if
>>>
>>> 1. The zone has holes and we check for "zone all holes". If we get
>>> pfn_valid(pfn), we check if "page_zone(pfn_to_page(pfn)) != zone".
>>>
>>> 2. Via find_smallest_section_pfn() / find_biggest_section_pfn() find a
>>> spanned pfn_valid(). We check
>>> - pfn_to_nid(start_pfn) != nid
>>> - zone != page_zone(pfn_to_page(start_pfn)
>>>
>>> So we don't actually use the zone/nid, only use it to sanity check. That
>>> might result in false-positives (not that bad).
>>>
>>> It all boils down to shrink_zone_span() not working only on active
>>> memory, for which the PFN is not only valid but also initialized
>>> (something for which we need a new section flag I assume).
>>>
>>> Which access triggers the issue you describe? pfn_to_nid()?
>>>
>>>>
>>>>      pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
>>>>      lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
>>>>      sp: c0000000dac7f910
>>>>     msr: 800000000282b033
>>>>    current = 0xc0000000da2fa000
>>>>    paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
>>>>      pid   = 1224, comm = ndctl
>>>> kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
>>>> Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
>>>> enter ? for help
>>>
>>> Which exact kernel BUG are you hitting here? (my tree doesn't seem t
>>> have any BUG statement around  include/linux/mm.h:1088). 
>>
>>
>>
>> This is against upstream linus with your patches applied.
> 
> I'm
> 
>>
>>
>> static inline int page_to_nid(const struct page *page)
>> {
>> 	struct page *p = (struct page *)page;
>>
>> 	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>> }
>>
>>
>> #define PF_POISONED_CHECK(page) ({					\
>> 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
>> 		page; })
>> #
>>
>>
>> It is the node id access.
> 
> A right. A temporary hack would be to assume in these functions
> (shrink_zone_span() and friends) that we might have invalid NIDs /
> zonenumbers and simply skip these. After all we're only using them for
> finding zone boundaries. Not what we ultimately want, but I think until
> we have a proper SECTION_ACTIVE, it might take a while.
> 

I am talking about something as hacky as this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8d1c7313ab3f..57ed3dd76a4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1099,6 +1099,7 @@ static inline int page_zone_id(struct page *page)

 #ifdef NODE_NOT_IN_PAGE_FLAGS
 extern int page_to_nid(const struct page *page);
+#define __page_to_nid page_to_nid
 #else
 static inline int page_to_nid(const struct page *page)
 {
@@ -1106,6 +1107,10 @@ static inline int page_to_nid(const struct page
*page)

 	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
 }
+static inline int __page_to_nid(const struct page *page)
+{
+	return ((page)->flags >> NODES_PGSHIFT) & NODES_MASK;
+}
 #endif

 #ifdef CONFIG_NUMA_BALANCING
@@ -1249,6 +1254,12 @@ static inline struct zone *page_zone(const struct
page *page)
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
 }

+static inline struct zone *__page_zone(const struct page *page)
+{
+	return &NODE_DATA(__page_to_nid(page))->node_zones[page_zonenum(page)];
+}
+
+
 static inline pg_data_t *page_pgdat(const struct page *page)
 {
 	return NODE_DATA(page_to_nid(page));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 49ca3364eb70..378b593d1fe1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -334,10 +334,10 @@ static unsigned long find_smallest_section_pfn(int
nid, struct zone *zone,
 		if (unlikely(!pfn_valid(start_pfn)))
 			continue;

-		if (unlikely(pfn_to_nid(start_pfn) != nid))
+		/* We might have uninitialized memmaps */
+		if (unlikely(__page_to_nid(pfn_to_page(start_pfn)) != nid))
 			continue;
-
-		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+		if (zone && zone != __page_zone(pfn_to_page(start_pfn)))
 			continue;

 		return start_pfn;
@@ -359,10 +359,10 @@ static unsigned long find_biggest_section_pfn(int
nid, struct zone *zone,
 		if (unlikely(!pfn_valid(pfn)))
 			continue;

-		if (unlikely(pfn_to_nid(pfn) != nid))
+		/* We might have uninitialized memmaps */
+		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
 			continue;
-
-		if (zone && zone != page_zone(pfn_to_page(pfn)))
+		if (zone && zone != __page_zone(pfn_to_page(pfn)))
 			continue;

 		return pfn;
@@ -418,7 +418,10 @@ static void shrink_zone_span(struct zone *zone,
unsigned long start_pfn,
 		if (unlikely(!pfn_valid(pfn)))
 			continue;

-		if (page_zone(pfn_to_page(pfn)) != zone)
+		/* We might have uninitialized memmaps */
+		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
+			continue;
+		if (__page_zone(pfn_to_page(pfn)) != zone)
 			continue;

 		/* Skip range to be removed */
@@ -483,7 +486,8 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
 		if (unlikely(!pfn_valid(pfn)))
 			continue;

-		if (pfn_to_nid(pfn) != nid)
+		/* We might have uninitialized memmaps */
+		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
 			continue;

 		/* Skip range to be removed */

-- 

Thanks,

David / dhildenb


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-26 16:44         ` David Hildenbrand
@ 2019-08-27  5:46           ` Aneesh Kumar K.V
  2019-08-27  7:06             ` David Hildenbrand
  2019-08-28  9:33             ` David Hildenbrand
  0 siblings, 2 replies; 30+ messages in thread
From: Aneesh Kumar K.V @ 2019-08-27  5:46 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Alexander Duyck, Andrew Morton, Andy Lutomirski,
	Anshuman Khandual, Arun KS, Benjamin Herrenschmidt,
	Borislav Petkov, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Dan Williams, Dave Hansen, Fenghua Yu,
	Gerald Schaefer, Greg Kroah-Hartman, Halil Pasic, Heiko Carstens,
	H. Peter Anvin, Ingo Molnar, Ira Weiny, Jason Gunthorpe,
	Johannes Weiner, Jun Yao, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

David Hildenbrand <david@redhat.com> writes:

> On 26.08.19 18:20, David Hildenbrand wrote:
>> On 26.08.19 18:01, Aneesh Kumar K.V wrote:
>>> On 8/26/19 9:13 PM, David Hildenbrand wrote:
>>>> On 26.08.19 16:53, Aneesh Kumar K.V wrote:
>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>
>>>>>>
>>>
>>> ....
>>>
>>>>>
>>>>> I did report a variant of the issue at
>>>>>
>>>>> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
>>>>>
>>>>> This patch series still doesn't handle the fact that struct page backing
>>>>> the start_pfn might not be initialized. ie, it results in crash like
>>>>> below
>>>>
>>>> Okay, that's a related but different issue I think.
>>>>
>>>> I can see that current shrink_zone_span() might read-access the
>>>> uninitialized struct page of a PFN if
>>>>
>>>> 1. The zone has holes and we check for "zone all holes". If we get
>>>> pfn_valid(pfn), we check if "page_zone(pfn_to_page(pfn)) != zone".
>>>>
>>>> 2. Via find_smallest_section_pfn() / find_biggest_section_pfn() find a
>>>> spanned pfn_valid(). We check
>>>> - pfn_to_nid(start_pfn) != nid
>>>> - zone != page_zone(pfn_to_page(start_pfn)
>>>>
>>>> So we don't actually use the zone/nid, only use it to sanity check. That
>>>> might result in false-positives (not that bad).
>>>>
>>>> It all boils down to shrink_zone_span() not working only on active
>>>> memory, for which the PFN is not only valid but also initialized
>>>> (something for which we need a new section flag I assume).
>>>>
>>>> Which access triggers the issue you describe? pfn_to_nid()?
>>>>
>>>>>
>>>>>      pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
>>>>>      lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
>>>>>      sp: c0000000dac7f910
>>>>>     msr: 800000000282b033
>>>>>    current = 0xc0000000da2fa000
>>>>>    paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
>>>>>      pid   = 1224, comm = ndctl
>>>>> kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
>>>>> Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
>>>>> enter ? for help
>>>>
>>>> Which exact kernel BUG are you hitting here? (my tree doesn't seem t
>>>> have any BUG statement around  include/linux/mm.h:1088). 
>>>
>>>
>>>
>>> This is against upstream linus with your patches applied.
>> 
>> I'm
>> 
>>>
>>>
>>> static inline int page_to_nid(const struct page *page)
>>> {
>>> 	struct page *p = (struct page *)page;
>>>
>>> 	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>>> }
>>>
>>>
>>> #define PF_POISONED_CHECK(page) ({					\
>>> 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
>>> 		page; })
>>> #
>>>
>>>
>>> It is the node id access.
>> 
>> A right. A temporary hack would be to assume in these functions
>> (shrink_zone_span() and friends) that we might have invalid NIDs /
>> zonenumbers and simply skip these. After all we're only using them for
>> finding zone boundaries. Not what we ultimately want, but I think until
>> we have a proper SECTION_ACTIVE, it might take a while.
>> 
>
> I am talking about something as hacky as this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8d1c7313ab3f..57ed3dd76a4f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1099,6 +1099,7 @@ static inline int page_zone_id(struct page *page)
>
>  #ifdef NODE_NOT_IN_PAGE_FLAGS
>  extern int page_to_nid(const struct page *page);
> +#define __page_to_nid page_to_nid
>  #else
>  static inline int page_to_nid(const struct page *page)
>  {
> @@ -1106,6 +1107,10 @@ static inline int page_to_nid(const struct page
> *page)
>
>  	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>  }
> +static inline int __page_to_nid(const struct page *page)
> +{
> +	return ((page)->flags >> NODES_PGSHIFT) & NODES_MASK;
> +}
>  #endif
>
>  #ifdef CONFIG_NUMA_BALANCING
> @@ -1249,6 +1254,12 @@ static inline struct zone *page_zone(const struct
> page *page)
>  	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>  }
>
> +static inline struct zone *__page_zone(const struct page *page)
> +{
> +	return &NODE_DATA(__page_to_nid(page))->node_zones[page_zonenum(page)];
> +}

We don't need that. We can always do an explicity __page_to_nid check
and break from the loop before doing a page_zone() ? Also if the struct
page is poisoned, we should not trust the page_zonenum()? 

> +
> +
>  static inline pg_data_t *page_pgdat(const struct page *page)
>  {
>  	return NODE_DATA(page_to_nid(page));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 49ca3364eb70..378b593d1fe1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -334,10 +334,10 @@ static unsigned long find_smallest_section_pfn(int
> nid, struct zone *zone,
>  		if (unlikely(!pfn_valid(start_pfn)))
>  			continue;
>
> -		if (unlikely(pfn_to_nid(start_pfn) != nid))
> +		/* We might have uninitialized memmaps */
> +		if (unlikely(__page_to_nid(pfn_to_page(start_pfn)) != nid))
>  			continue;

if we are here we got non poisoned struct page. Hence we don't need the
below change?

> -
> -		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
> +		if (zone && zone != __page_zone(pfn_to_page(start_pfn)))
>  			continue;
>
>  		return start_pfn;
> @@ -359,10 +359,10 @@ static unsigned long find_biggest_section_pfn(int
> nid, struct zone *zone,
>  		if (unlikely(!pfn_valid(pfn)))
>  			continue;
>
> -		if (unlikely(pfn_to_nid(pfn) != nid))
> +		/* We might have uninitialized memmaps */
> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>  			continue;

same as above

> -
> -		if (zone && zone != page_zone(pfn_to_page(pfn)))
> +		if (zone && zone != __page_zone(pfn_to_page(pfn)))
>  			continue;
>
>  		return pfn;
> @@ -418,7 +418,10 @@ static void shrink_zone_span(struct zone *zone,
> unsigned long start_pfn,
>  		if (unlikely(!pfn_valid(pfn)))
>  			continue;
>
> -		if (page_zone(pfn_to_page(pfn)) != zone)
> +		/* We might have uninitialized memmaps */
> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
> +			continue;

same as above? 

> +		if (__page_zone(pfn_to_page(pfn)) != zone)
>  			continue;
>
>  		/* Skip range to be removed */
> @@ -483,7 +486,8 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
>  		if (unlikely(!pfn_valid(pfn)))
>  			continue;
>
> -		if (pfn_to_nid(pfn) != nid)
> +		/* We might have uninitialized memmaps */
> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>  			continue;
>
>  		/* Skip range to be removed */
>


But I am not sure whether this is the right approach.

-aneesh



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-27  5:46           ` Aneesh Kumar K.V
@ 2019-08-27  7:06             ` David Hildenbrand
  2019-08-28  9:33             ` David Hildenbrand
  1 sibling, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2019-08-27  7:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel
  Cc: linux-mm, Alexander Duyck, Andrew Morton, Andy Lutomirski,
	Anshuman Khandual, Arun KS, Benjamin Herrenschmidt,
	Borislav Petkov, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Dan Williams, Dave Hansen, Fenghua Yu,
	Gerald Schaefer, Greg Kroah-Hartman, Halil Pasic, Heiko Carstens,
	H. Peter Anvin, Ingo Molnar, Ira Weiny, Jason Gunthorpe,
	Johannes Weiner, Jun Yao, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

On 27.08.19 07:46, Aneesh Kumar K.V wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 26.08.19 18:20, David Hildenbrand wrote:
>>> On 26.08.19 18:01, Aneesh Kumar K.V wrote:
>>>> On 8/26/19 9:13 PM, David Hildenbrand wrote:
>>>>> On 26.08.19 16:53, Aneesh Kumar K.V wrote:
>>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>>
>>>>>>>
>>>>
>>>> ....
>>>>
>>>>>>
>>>>>> I did report a variant of the issue at
>>>>>>
>>>>>> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
>>>>>>
>>>>>> This patch series still doesn't handle the fact that struct page backing
>>>>>> the start_pfn might not be initialized. ie, it results in crash like
>>>>>> below
>>>>>
>>>>> Okay, that's a related but different issue I think.
>>>>>
>>>>> I can see that current shrink_zone_span() might read-access the
>>>>> uninitialized struct page of a PFN if
>>>>>
>>>>> 1. The zone has holes and we check for "zone all holes". If we get
>>>>> pfn_valid(pfn), we check if "page_zone(pfn_to_page(pfn)) != zone".
>>>>>
>>>>> 2. Via find_smallest_section_pfn() / find_biggest_section_pfn() find a
>>>>> spanned pfn_valid(). We check
>>>>> - pfn_to_nid(start_pfn) != nid
>>>>> - zone != page_zone(pfn_to_page(start_pfn)
>>>>>
>>>>> So we don't actually use the zone/nid, only use it to sanity check. That
>>>>> might result in false-positives (not that bad).
>>>>>
>>>>> It all boils down to shrink_zone_span() not working only on active
>>>>> memory, for which the PFN is not only valid but also initialized
>>>>> (something for which we need a new section flag I assume).
>>>>>
>>>>> Which access triggers the issue you describe? pfn_to_nid()?
>>>>>
>>>>>>
>>>>>>      pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
>>>>>>      lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
>>>>>>      sp: c0000000dac7f910
>>>>>>     msr: 800000000282b033
>>>>>>    current = 0xc0000000da2fa000
>>>>>>    paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
>>>>>>      pid   = 1224, comm = ndctl
>>>>>> kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
>>>>>> Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
>>>>>> enter ? for help
>>>>>
>>>>> Which exact kernel BUG are you hitting here? (my tree doesn't seem t
>>>>> have any BUG statement around  include/linux/mm.h:1088). 
>>>>
>>>>
>>>>
>>>> This is against upstream linus with your patches applied.
>>>
>>> I'm
>>>
>>>>
>>>>
>>>> static inline int page_to_nid(const struct page *page)
>>>> {
>>>> 	struct page *p = (struct page *)page;
>>>>
>>>> 	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>>>> }
>>>>
>>>>
>>>> #define PF_POISONED_CHECK(page) ({					\
>>>> 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
>>>> 		page; })
>>>> #
>>>>
>>>>
>>>> It is the node id access.
>>>
>>> A right. A temporary hack would be to assume in these functions
>>> (shrink_zone_span() and friends) that we might have invalid NIDs /
>>> zonenumbers and simply skip these. After all we're only using them for
>>> finding zone boundaries. Not what we ultimately want, but I think until
>>> we have a proper SECTION_ACTIVE, it might take a while.
>>>
>>
>> I am talking about something as hacky as this:
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 8d1c7313ab3f..57ed3dd76a4f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1099,6 +1099,7 @@ static inline int page_zone_id(struct page *page)
>>
>>  #ifdef NODE_NOT_IN_PAGE_FLAGS
>>  extern int page_to_nid(const struct page *page);
>> +#define __page_to_nid page_to_nid
>>  #else
>>  static inline int page_to_nid(const struct page *page)
>>  {
>> @@ -1106,6 +1107,10 @@ static inline int page_to_nid(const struct page
>> *page)
>>
>>  	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>>  }
>> +static inline int __page_to_nid(const struct page *page)
>> +{
>> +	return ((page)->flags >> NODES_PGSHIFT) & NODES_MASK;
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_NUMA_BALANCING
>> @@ -1249,6 +1254,12 @@ static inline struct zone *page_zone(const struct
>> page *page)
>>  	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>>  }
>>
>> +static inline struct zone *__page_zone(const struct page *page)
>> +{
>> +	return &NODE_DATA(__page_to_nid(page))->node_zones[page_zonenum(page)];
>> +}
> 
> We don't need that. We can always do an explicity __page_to_nid check
> and break from the loop before doing a page_zone() ? Also if the struct
> page is poisoned, we should not trust the page_zonenum()? 

Assume __page_to_nid() works on a poisoned page. Then we might return a
garbage NID that by chance matches the NID. If we continue calling
page_zone(), we might trigger the same BUG.

"Also if the struct page is poisoned, we should not trust the
page_zonenum()?"

It's worse, we must not trust these values every. In case pages are not
poisoned when adding the memmap, they just contain random garbage.

> 
>> +
>> +
>>  static inline pg_data_t *page_pgdat(const struct page *page)
>>  {
>>  	return NODE_DATA(page_to_nid(page));
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 49ca3364eb70..378b593d1fe1 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -334,10 +334,10 @@ static unsigned long find_smallest_section_pfn(int
>> nid, struct zone *zone,
>>  		if (unlikely(!pfn_valid(start_pfn)))
>>  			continue;
>>
>> -		if (unlikely(pfn_to_nid(start_pfn) != nid))
>> +		/* We might have uninitialized memmaps */
>> +		if (unlikely(__page_to_nid(pfn_to_page(start_pfn)) != nid))
>>  			continue;
> 
> if we are here we got non poisoned struct page. Hence we don't need the
> below change?

No, we can't check if the values are poisoned. We can only assume we
might read garbage.

> 
>> -
>> -		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
>> +		if (zone && zone != __page_zone(pfn_to_page(start_pfn)))
>>  			continue;
>>
>>  		return start_pfn;
>> @@ -359,10 +359,10 @@ static unsigned long find_biggest_section_pfn(int
>> nid, struct zone *zone,
>>  		if (unlikely(!pfn_valid(pfn)))
>>  			continue;
>>
>> -		if (unlikely(pfn_to_nid(pfn) != nid))
>> +		/* We might have uninitialized memmaps */
>> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>>  			continue;
> 
> same as above
> 
>> -
>> -		if (zone && zone != page_zone(pfn_to_page(pfn)))
>> +		if (zone && zone != __page_zone(pfn_to_page(pfn)))
>>  			continue;
>>
>>  		return pfn;
>> @@ -418,7 +418,10 @@ static void shrink_zone_span(struct zone *zone,
>> unsigned long start_pfn,
>>  		if (unlikely(!pfn_valid(pfn)))
>>  			continue;
>>
>> -		if (page_zone(pfn_to_page(pfn)) != zone)
>> +		/* We might have uninitialized memmaps */
>> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>> +			continue;
> 
> same as above? 
> 
>> +		if (__page_zone(pfn_to_page(pfn)) != zone)
>>  			continue;
>>
>>  		/* Skip range to be removed */
>> @@ -483,7 +486,8 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
>>  		if (unlikely(!pfn_valid(pfn)))
>>  			continue;
>>
>> -		if (pfn_to_nid(pfn) != nid)
>> +		/* We might have uninitialized memmaps */
>> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>>  			continue;
>>
>>  		/* Skip range to be removed */
>>
> 
> 
> But I am not sure whether this is the right approach.

It certainly isn't the right approach. It's a temporary hack. (trying
not to rant, ) we did a horribly job moving initialization of the memmap
to onlining of pages. We are now stuck with something that is
fundamentally broken.

Anyhow, I am looking into a proper solution, but it will need a new
bitmap for each section and some other more invasive changes. So that
might take a while.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages()
  2019-08-26 10:10 ` [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages() David Hildenbrand
@ 2019-08-27 10:49   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2019-08-27 10:49 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Catalin Marinas, Will Deacon, Tony Luck, Fenghua Yu,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Yoshinori Sato, Rich Felker, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Andrew Morton, Mark Rutland, Steve Capper,
	Mike Rapoport, Anshuman Khandual, Yu Zhao, Jun Yao, Michal Hocko,
	Oscar Salvador, Matthew Wilcox (Oracle),
	Christophe Leroy, Aneesh Kumar K.V, Pavel Tatashin,
	Gerald Schaefer, Halil Pasic, Tom Lendacky, Greg Kroah-Hartman,
	Masahiro Yamada, Dan Williams, Wei Yang, Qian Cai,
	Jason Gunthorpe, Logan Gunthorpe, Ira Weiny, linux-arm-kernel,
	linux-ia64, linuxppc-dev, linux-s390, linux-sh

On 26/08/2019 11:10, David Hildenbrand wrote:
> The zone parameter is no longer in use. Replace it with the nid, which
> we can now use the nid to limit the number of zones we have to process
> (vie for_each_zone_nid()). The function signature of __remove_pages() now
> looks much more similar to the one of __add_pages().

FWIW I recall this being trivially easy to hit when first playing with 
hotremove development for arm64 - since we only have 3 zones, the page 
flags poison would cause page_zone() to dereference past the end of 
node_zones[] and go all kinds of wrong. This looks like a definite 
improvement in API terms.

For arm64,

Acked-by: Robin Murphy <robin.murphy@arm.com>

Cheers,
Robin.

> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Fenghua Yu <fenghua.yu@intel.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@de.ibm.com>
> Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Steve Capper <steve.capper@arm.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Jun Yao <yaojun8558363@gmail.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Christophe Leroy <christophe.leroy@c-s.fr>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Wei Yang <richard.weiyang@gmail.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-ia64@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: linux-sh@vger.kernel.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   arch/arm64/mm/mmu.c            |  4 +---
>   arch/ia64/mm/init.c            |  4 +---
>   arch/powerpc/mm/mem.c          |  3 +--
>   arch/s390/mm/init.c            |  4 +---
>   arch/sh/mm/init.c              |  4 +---
>   arch/x86/mm/init_32.c          |  4 +---
>   arch/x86/mm/init_64.c          |  4 +---
>   include/linux/memory_hotplug.h |  2 +-
>   mm/memory_hotplug.c            | 17 +++++++++--------
>   mm/memremap.c                  |  3 +--
>   10 files changed, 18 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e67bab4d613e..9a2d388314f3 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1080,7 +1080,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>   {
>   	unsigned long start_pfn = start >> PAGE_SHIFT;
>   	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct zone *zone;
>   
>   	/*
>   	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
> @@ -1089,7 +1088,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>   	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
>   	 * unlocked yet.
>   	 */
> -	zone = page_zone(pfn_to_page(start_pfn));
> -	__remove_pages(zone, start_pfn, nr_pages, altmap);
> +	__remove_pages(nid, start_pfn, nr_pages, altmap);
>   }
>   #endif
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index bf9df2625bc8..ae6a3e718aa0 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>   {
>   	unsigned long start_pfn = start >> PAGE_SHIFT;
>   	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct zone *zone;
>   
> -	zone = page_zone(pfn_to_page(start_pfn));
> -	__remove_pages(zone, start_pfn, nr_pages, altmap);
> +	__remove_pages(nid, start_pfn, nr_pages, altmap);
>   }
>   #endif
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 9191a66b3bc5..af21e13529ce 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
>   {
>   	unsigned long start_pfn = start >> PAGE_SHIFT;
>   	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
>   	int ret;
>   
> -	__remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
> +	__remove_pages(nid, start_pfn, nr_pages, altmap);
>   
>   	/* Remove htab bolted mappings for this section of memory */
>   	start = (unsigned long)__va(start);
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 20340a03ad90..2a7373ed6ded 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -296,10 +296,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>   {
>   	unsigned long start_pfn = start >> PAGE_SHIFT;
>   	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct zone *zone;
>   
> -	zone = page_zone(pfn_to_page(start_pfn));
> -	__remove_pages(zone, start_pfn, nr_pages, altmap);
> +	__remove_pages(nid, start_pfn, nr_pages, altmap);
>   	vmem_remove_mapping(start, size);
>   }
>   #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index dfdbaa50946e..32441b59297d 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>   {
>   	unsigned long start_pfn = PFN_DOWN(start);
>   	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct zone *zone;
>   
> -	zone = page_zone(pfn_to_page(start_pfn));
> -	__remove_pages(zone, start_pfn, nr_pages, altmap);
> +	__remove_pages(nid, start_pfn, nr_pages, altmap);
>   }
>   #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 4068abb9427f..2760e4bfbc56 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -865,10 +865,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>   {
>   	unsigned long start_pfn = start >> PAGE_SHIFT;
>   	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct zone *zone;
>   
> -	zone = page_zone(pfn_to_page(start_pfn));
> -	__remove_pages(zone, start_pfn, nr_pages, altmap);
> +	__remove_pages(nid, start_pfn, nr_pages, altmap);
>   }
>   #endif
>   
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index a6b5c653727b..99d92297f1cf 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1212,10 +1212,8 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
>   {
>   	unsigned long start_pfn = start >> PAGE_SHIFT;
>   	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
> -	struct zone *zone = page_zone(page);
>   
> -	__remove_pages(zone, start_pfn, nr_pages, altmap);
> +	__remove_pages(nid, start_pfn, nr_pages, altmap);
>   	kernel_physical_mapping_remove(start, start + size);
>   }
>   #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index f46ea71b4ffd..c5b38e7dc8aa 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -125,7 +125,7 @@ static inline bool movable_node_is_enabled(void)
>   
>   extern void arch_remove_memory(int nid, u64 start, u64 size,
>   			       struct vmem_altmap *altmap);
> -extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
> +extern void __remove_pages(int nid, unsigned long start_pfn,
>   			   unsigned long nr_pages, struct vmem_altmap *altmap);
>   
>   /* reasonably generic interface to expand the physical pages */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e88c96cf9d77..49ca3364eb70 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -514,7 +514,7 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn,
>   	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>   }
>   
> -static void __remove_section(unsigned long pfn, unsigned long nr_pages,
> +static void __remove_section(int nid, unsigned long pfn, unsigned long nr_pages,
>   			     unsigned long map_offset,
>   			     struct vmem_altmap *altmap)
>   {
> @@ -525,14 +525,14 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
>   		return;
>   
>   	/* TODO: move zone handling out of memory removal path */
> -	for_each_zone(zone)
> +	for_each_zone_nid(zone, nid)
>   		__remove_zone(zone, pfn, nr_pages);
>   	sparse_remove_section(ms, pfn, nr_pages, map_offset, altmap);
>   }
>   
>   /**
>    * __remove_pages() - remove sections of pages from a zone
> - * @zone: zone from which pages need to be removed
> + * @nid: the nid all pages were added to
>    * @pfn: starting pageframe (must be aligned to start of a section)
>    * @nr_pages: number of pages to remove (must be multiple of section size)
>    * @altmap: alternative device page map or %NULL if default memmap is used
> @@ -542,12 +542,13 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages,
>    * sure that pages are marked reserved and zones are adjust properly by
>    * calling offline_pages().
>    */
> -void __remove_pages(struct zone *zone, unsigned long pfn,
> -		    unsigned long nr_pages, struct vmem_altmap *altmap)
> +void __remove_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> +		    struct vmem_altmap *altmap)
>   {
>   	const unsigned long end_pfn = pfn + nr_pages;
>   	unsigned long cur_nr_pages;
>   	unsigned long map_offset = 0;
> +	struct zone *zone;
>   
>   	if (check_pfn_span(pfn, nr_pages, "remove"))
>   		return;
> @@ -555,7 +556,7 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
>   	map_offset = vmem_altmap_offset(altmap);
>   
>   	/* TODO: move zone handling out of memory removal path */
> -	for_each_zone(zone)
> +	for_each_zone_nid(zone, nid)
>   		if (zone_intersects(zone, pfn, nr_pages))
>   			clear_zone_contiguous(zone);
>   
> @@ -563,12 +564,12 @@ void __remove_pages(struct zone *zone, unsigned long pfn,
>   		cond_resched();
>   		/* Select all remaining pages up to the next section boundary */
>   		cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK));
> -		__remove_section(pfn, cur_nr_pages, map_offset, altmap);
> +		__remove_section(nid, pfn, cur_nr_pages, map_offset, altmap);
>   		map_offset = 0;
>   	}
>   
>   	/* TODO: move zone handling out of memory removal path */
> -	for_each_zone(zone)
> +	for_each_zone_nid(zone, nid)
>   		set_zone_contiguous(zone);
>   }
>   
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 8a394552b5bd..292ef4c6b447 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -138,8 +138,7 @@ static void devm_memremap_pages_release(void *data)
>   	mem_hotplug_begin();
>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>   		pfn = PHYS_PFN(res->start);
> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
> -				 PHYS_PFN(resource_size(res)), NULL);
> +		__remove_pages(nid, pfn, PHYS_PFN(resource_size(res)), NULL);
>   	} else {
>   		arch_remove_memory(nid, res->start, resource_size(res),
>   				pgmap_altmap(pgmap));
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-27  5:46           ` Aneesh Kumar K.V
  2019-08-27  7:06             ` David Hildenbrand
@ 2019-08-28  9:33             ` David Hildenbrand
  1 sibling, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2019-08-28  9:33 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-kernel
  Cc: linux-mm, Alexander Duyck, Andrew Morton, Andy Lutomirski,
	Anshuman Khandual, Arun KS, Benjamin Herrenschmidt,
	Borislav Petkov, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Dan Williams, Dave Hansen, Fenghua Yu,
	Gerald Schaefer, Greg Kroah-Hartman, Halil Pasic, Heiko Carstens,
	H. Peter Anvin, Ingo Molnar, Ira Weiny, Jason Gunthorpe,
	Johannes Weiner, Jun Yao, Logan Gunthorpe, Mark Rutland,
	Masahiro Yamada, Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Oscar Salvador, Paul Mackerras, Pavel Tatashin, Pavel Tatashin,
	Peter Zijlstra, Qian Cai, Rich Felker, Robin Murphy,
	Steve Capper, Thomas Gleixner, Tom Lendacky, Tony Luck,
	Vasily Gorbik, Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon,
	Yoshinori Sato, Yu Zhao

On 27.08.19 07:46, Aneesh Kumar K.V wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 26.08.19 18:20, David Hildenbrand wrote:
>>> On 26.08.19 18:01, Aneesh Kumar K.V wrote:
>>>> On 8/26/19 9:13 PM, David Hildenbrand wrote:
>>>>> On 26.08.19 16:53, Aneesh Kumar K.V wrote:
>>>>>> David Hildenbrand <david@redhat.com> writes:
>>>>>>
>>>>>>>
>>>>
>>>> ....
>>>>
>>>>>>
>>>>>> I did report a variant of the issue at
>>>>>>
>>>>>> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
>>>>>>
>>>>>> This patch series still doesn't handle the fact that struct page backing
>>>>>> the start_pfn might not be initialized. ie, it results in crash like
>>>>>> below
>>>>>
>>>>> Okay, that's a related but different issue I think.
>>>>>
>>>>> I can see that current shrink_zone_span() might read-access the
>>>>> uninitialized struct page of a PFN if
>>>>>
>>>>> 1. The zone has holes and we check for "zone all holes". If we get
>>>>> pfn_valid(pfn), we check if "page_zone(pfn_to_page(pfn)) != zone".
>>>>>
>>>>> 2. Via find_smallest_section_pfn() / find_biggest_section_pfn() find a
>>>>> spanned pfn_valid(). We check
>>>>> - pfn_to_nid(start_pfn) != nid
>>>>> - zone != page_zone(pfn_to_page(start_pfn)
>>>>>
>>>>> So we don't actually use the zone/nid, only use it to sanity check. That
>>>>> might result in false-positives (not that bad).
>>>>>
>>>>> It all boils down to shrink_zone_span() not working only on active
>>>>> memory, for which the PFN is not only valid but also initialized
>>>>> (something for which we need a new section flag I assume).
>>>>>
>>>>> Which access triggers the issue you describe? pfn_to_nid()?
>>>>>
>>>>>>
>>>>>>      pc: c0000000004bc1ec: shrink_zone_span+0x1bc/0x290
>>>>>>      lr: c0000000004bc1e8: shrink_zone_span+0x1b8/0x290
>>>>>>      sp: c0000000dac7f910
>>>>>>     msr: 800000000282b033
>>>>>>    current = 0xc0000000da2fa000
>>>>>>    paca    = 0xc00000000fffb300   irqmask: 0x03   irq_happened: 0x01
>>>>>>      pid   = 1224, comm = ndctl
>>>>>> kernel BUG at /home/kvaneesh/src/linux/include/linux/mm.h:1088!
>>>>>> Linux version 5.3.0-rc6-17495-gc7727d815970-dirty (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #183 SMP Mon Aug 26 09:37:32 CDT 2019
>>>>>> enter ? for help
>>>>>
>>>>> Which exact kernel BUG are you hitting here? (my tree doesn't seem t
>>>>> have any BUG statement around  include/linux/mm.h:1088). 
>>>>
>>>>
>>>>
>>>> This is against upstream linus with your patches applied.
>>>
>>> I'm
>>>
>>>>
>>>>
>>>> static inline int page_to_nid(const struct page *page)
>>>> {
>>>> 	struct page *p = (struct page *)page;
>>>>
>>>> 	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>>>> }
>>>>
>>>>
>>>> #define PF_POISONED_CHECK(page) ({					\
>>>> 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
>>>> 		page; })
>>>> #
>>>>
>>>>
>>>> It is the node id access.
>>>
>>> A right. A temporary hack would be to assume in these functions
>>> (shrink_zone_span() and friends) that we might have invalid NIDs /
>>> zonenumbers and simply skip these. After all we're only using them for
>>> finding zone boundaries. Not what we ultimately want, but I think until
>>> we have a proper SECTION_ACTIVE, it might take a while.
>>>
>>
>> I am talking about something as hacky as this:
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 8d1c7313ab3f..57ed3dd76a4f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1099,6 +1099,7 @@ static inline int page_zone_id(struct page *page)
>>
>>  #ifdef NODE_NOT_IN_PAGE_FLAGS
>>  extern int page_to_nid(const struct page *page);
>> +#define __page_to_nid page_to_nid
>>  #else
>>  static inline int page_to_nid(const struct page *page)
>>  {
>> @@ -1106,6 +1107,10 @@ static inline int page_to_nid(const struct page
>> *page)
>>
>>  	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>>  }
>> +static inline int __page_to_nid(const struct page *page)
>> +{
>> +	return ((page)->flags >> NODES_PGSHIFT) & NODES_MASK;
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_NUMA_BALANCING
>> @@ -1249,6 +1254,12 @@ static inline struct zone *page_zone(const struct
>> page *page)
>>  	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>>  }
>>
>> +static inline struct zone *__page_zone(const struct page *page)
>> +{
>> +	return &NODE_DATA(__page_to_nid(page))->node_zones[page_zonenum(page)];
>> +}
> 
> We don't need that. We can always do an explicity __page_to_nid check
> and break from the loop before doing a page_zone() ? Also if the struct
> page is poisoned, we should not trust the page_zonenum()? 
> 
>> +
>> +
>>  static inline pg_data_t *page_pgdat(const struct page *page)
>>  {
>>  	return NODE_DATA(page_to_nid(page));
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 49ca3364eb70..378b593d1fe1 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -334,10 +334,10 @@ static unsigned long find_smallest_section_pfn(int
>> nid, struct zone *zone,
>>  		if (unlikely(!pfn_valid(start_pfn)))
>>  			continue;
>>
>> -		if (unlikely(pfn_to_nid(start_pfn) != nid))
>> +		/* We might have uninitialized memmaps */
>> +		if (unlikely(__page_to_nid(pfn_to_page(start_pfn)) != nid))
>>  			continue;
> 
> if we are here we got non poisoned struct page. Hence we don't need the
> below change?
> 
>> -
>> -		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
>> +		if (zone && zone != __page_zone(pfn_to_page(start_pfn)))
>>  			continue;
>>
>>  		return start_pfn;
>> @@ -359,10 +359,10 @@ static unsigned long find_biggest_section_pfn(int
>> nid, struct zone *zone,
>>  		if (unlikely(!pfn_valid(pfn)))
>>  			continue;
>>
>> -		if (unlikely(pfn_to_nid(pfn) != nid))
>> +		/* We might have uninitialized memmaps */
>> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>>  			continue;
> 
> same as above
> 
>> -
>> -		if (zone && zone != page_zone(pfn_to_page(pfn)))
>> +		if (zone && zone != __page_zone(pfn_to_page(pfn)))
>>  			continue;
>>
>>  		return pfn;
>> @@ -418,7 +418,10 @@ static void shrink_zone_span(struct zone *zone,
>> unsigned long start_pfn,
>>  		if (unlikely(!pfn_valid(pfn)))
>>  			continue;
>>
>> -		if (page_zone(pfn_to_page(pfn)) != zone)
>> +		/* We might have uninitialized memmaps */
>> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>> +			continue;
> 
> same as above? 
> 
>> +		if (__page_zone(pfn_to_page(pfn)) != zone)
>>  			continue;
>>
>>  		/* Skip range to be removed */
>> @@ -483,7 +486,8 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
>>  		if (unlikely(!pfn_valid(pfn)))
>>  			continue;
>>
>> -		if (pfn_to_nid(pfn) != nid)
>> +		/* We might have uninitialized memmaps */
>> +		if (unlikely(__page_to_nid(pfn_to_page(pfn)) != nid))
>>  			continue;
>>
>>  		/* Skip range to be removed */
>>
> 
> 
> But I am not sure whether this is the right approach.
> 
> -aneesh
> 

So I'm currently preparing and testing an intermediate solution that
will not result in any false positives for !ZONE_DEVICE memory.
ZONE_DEVICE memory is tricky, especially due to subsections - we have no
easy way to check if a memmap was initialized. For !ZONE_DEVICE memory
we can at least use SECTION_IS_ONLINE. Anyhow, on false positives with
ZONE_DEVICE memory we should not crash.

I think I'll even be able to move zone/node resizing to offlining code -
for !ZONE_DEVICE memory. For ZONE_DEVICE memory it has to stay and we'll
have to think of a way to check if memmaps are valid without false
positives.

Horribly complicated stuff.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
                   ` (6 preceding siblings ...)
  2019-08-26 14:53 ` [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory Aneesh Kumar K.V
@ 2019-08-29  8:36 ` Michal Hocko
  2019-08-29 11:39   ` David Hildenbrand
  7 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-08-29  8:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Alexander Duyck, Andrew Morton,
	Andy Lutomirski, Aneesh Kumar K.V, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dan Williams,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, Johannes Weiner, Jun Yao,
	Logan Gunthorpe, Mark Rutland, Masahiro Yamada,
	Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Oscar Salvador,
	Paul Mackerras, Pavel Tatashin, Pavel Tatashin, Peter Zijlstra,
	Qian Cai, Rich Felker, Robin Murphy, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yoshinori Sato,
	Yu Zhao

On Mon 26-08-19 12:10:06, David Hildenbrand wrote:
> Working on virtio-mem, I was able to trigger a kernel BUG (with debug
> options enabled) when removing memory that was never onlined. I was able
> to reproduce with DIMMs. As far as I can see the same can also happen
> without debug configs enabled, if we're unlucky and the uninitialized
> memmap contains selected garbage .

Could you be more specific please?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-26 14:53 ` [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory Aneesh Kumar K.V
  2019-08-26 15:43   ` David Hildenbrand
@ 2019-08-29  8:38   ` Michal Hocko
  2019-08-29 11:55     ` David Hildenbrand
  1 sibling, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-08-29  8:38 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: David Hildenbrand, linux-kernel, linux-mm, Alexander Duyck,
	Andrew Morton, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dan Williams,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, Johannes Weiner, Jun Yao,
	Logan Gunthorpe, Mark Rutland, Masahiro Yamada,
	Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Oscar Salvador,
	Paul Mackerras, Pavel Tatashin, Pavel Tatashin, Peter Zijlstra,
	Qian Cai, Rich Felker, Robin Murphy, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yoshinori Sato,
	Yu Zhao

On Mon 26-08-19 20:23:38, Aneesh Kumar K.V wrote:
[...]
> I did report a variant of the issue at
> 
> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/

Is this an instance of the same problem noticed
http://lkml.kernel.org/r/20190725023100.31141-3-t-fukasawa@vx.jp.nec.com
and
http://lkml.kernel.org/r/20190828080006.GG7386@dhcp22.suse.cz
?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-29  8:36 ` Michal Hocko
@ 2019-08-29 11:39   ` David Hildenbrand
  0 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2019-08-29 11:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Alexander Duyck, Andrew Morton,
	Andy Lutomirski, Aneesh Kumar K.V, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dan Williams,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, Johannes Weiner, Jun Yao,
	Logan Gunthorpe, Mark Rutland, Masahiro Yamada,
	Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Oscar Salvador,
	Paul Mackerras, Pavel Tatashin, Pavel Tatashin, Peter Zijlstra,
	Qian Cai, Rich Felker, Robin Murphy, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yoshinori Sato,
	Yu Zhao

On 29.08.19 10:36, Michal Hocko wrote:
> On Mon 26-08-19 12:10:06, David Hildenbrand wrote:
>> Working on virtio-mem, I was able to trigger a kernel BUG (with debug
>> options enabled) when removing memory that was never onlined. I was able
>> to reproduce with DIMMs. As far as I can see the same can also happen
>> without debug configs enabled, if we're unlucky and the uninitialized
>> memmap contains selected garbage .
> 
> Could you be more specific please?
> 

There is more detail in the patches. Also see the new series for more
details.

When shrinking zones we look at all spanned pages to shrink as far as
possible - to skip over holes. There, it might happen that we hit
uninitialized memmaps. While "pfn_valid()" is true, the memmap is not
initialized and doing a pfn_to_nid() or page_zone() will end badly.

For !ZONE_DEVICE we can check SECTION_IS_ONLINE. That is a guarantee
that the whole memmap of the section is valid. For ZONE_DEVICE we don't
have anything similar. As we allow subsection hotplug there (and
therefore only subsection of the memmap are initialized) - we might even
step on uninitialized memmaps without being able to detect this.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-29  8:38   ` Michal Hocko
@ 2019-08-29 11:55     ` David Hildenbrand
  2019-08-29 12:20       ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-29 11:55 UTC (permalink / raw)
  To: Michal Hocko, Aneesh Kumar K.V
  Cc: linux-kernel, linux-mm, Alexander Duyck, Andrew Morton,
	Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dan Williams,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, Johannes Weiner, Jun Yao,
	Logan Gunthorpe, Mark Rutland, Masahiro Yamada,
	Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Oscar Salvador,
	Paul Mackerras, Pavel Tatashin, Pavel Tatashin, Peter Zijlstra,
	Qian Cai, Rich Felker, Robin Murphy, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yoshinori Sato,
	Yu Zhao

On 29.08.19 10:38, Michal Hocko wrote:
> On Mon 26-08-19 20:23:38, Aneesh Kumar K.V wrote:
> [...]
>> I did report a variant of the issue at
>>
>> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
> 
> Is this an instance of the same problem noticed
> http://lkml.kernel.org/r/20190725023100.31141-3-t-fukasawa@vx.jp.nec.com
> and
> http://lkml.kernel.org/r/20190828080006.GG7386@dhcp22.suse.cz
> ?
> 

I think it is related as far as I can tell. I think I could also use
pfn_zone_device_reserved() to skip over uninitialized memmaps in the
zone shrinking code.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory
  2019-08-29 11:55     ` David Hildenbrand
@ 2019-08-29 12:20       ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2019-08-29 12:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Aneesh Kumar K.V, linux-kernel, linux-mm, Alexander Duyck,
	Andrew Morton, Andy Lutomirski, Anshuman Khandual, Arun KS,
	Benjamin Herrenschmidt, Borislav Petkov, Catalin Marinas,
	Christian Borntraeger, Christophe Leroy, Dan Williams,
	Dave Hansen, Fenghua Yu, Gerald Schaefer, Greg Kroah-Hartman,
	Halil Pasic, Heiko Carstens, H. Peter Anvin, Ingo Molnar,
	Ira Weiny, Jason Gunthorpe, Johannes Weiner, Jun Yao,
	Logan Gunthorpe, Mark Rutland, Masahiro Yamada,
	Matthew Wilcox (Oracle),
	Mel Gorman, Michael Ellerman, Mike Rapoport, Oscar Salvador,
	Paul Mackerras, Pavel Tatashin, Pavel Tatashin, Peter Zijlstra,
	Qian Cai, Rich Felker, Robin Murphy, Steve Capper,
	Thomas Gleixner, Tom Lendacky, Tony Luck, Vasily Gorbik,
	Vlastimil Babka, Wei Yang, Wei Yang, Will Deacon, Yoshinori Sato,
	Yu Zhao

On Thu 29-08-19 13:55:24, David Hildenbrand wrote:
> On 29.08.19 10:38, Michal Hocko wrote:
> > On Mon 26-08-19 20:23:38, Aneesh Kumar K.V wrote:
> > [...]
> >> I did report a variant of the issue at
> >>
> >> https://lore.kernel.org/linux-mm/20190514025354.9108-1-aneesh.kumar@linux.ibm.com/
> > 
> > Is this an instance of the same problem noticed
> > http://lkml.kernel.org/r/20190725023100.31141-3-t-fukasawa@vx.jp.nec.com
> > and
> > http://lkml.kernel.org/r/20190828080006.GG7386@dhcp22.suse.cz
> > ?
> > 
> 
> I think it is related as far as I can tell. I think I could also use
> pfn_zone_device_reserved() to skip over uninitialized memmaps in the
> zone shrinking code.

No, we do not want to special case this. We should be able to make these
struct page initialized.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-26 10:10 ` [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory David Hildenbrand
@ 2019-08-29 15:39   ` Michal Hocko
  2019-08-29 15:54     ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-08-29 15:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On Mon 26-08-19 12:10:09, David Hildenbrand wrote:
> It is easier than I though to trigger a kernel bug by removing memory that
> was never onlined. With CONFIG_DEBUG_VM the memmap is initialized with
> garbage, resulting in the detection of a broken zone when removing memory.
> Without CONFIG_DEBUG_VM it is less likely - but we could still have
> garbage in the memmap.
> 
> :/# [   23.912993] BUG: unable to handle page fault for address: 000000000000353d
> [   23.914219] #PF: supervisor write access in kernel mode
> [   23.915199] #PF: error_code(0x0002) - not-present page
> [   23.916160] PGD 0 P4D 0
> [   23.916627] Oops: 0002 [#1] SMP PTI
> [   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
> [   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
> [   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
> [   23.926876] RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
> [   23.927928] RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
> [   23.929458] RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
> [   23.930899] RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
> [   23.932362] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
> [   23.933603] R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
> [   23.934913] FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
> [   23.936294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   23.937481] CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
> [   23.938687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   23.939889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   23.941168] Call Trace:
> [   23.941580]  __remove_pages+0x4b/0x640
> [   23.942303]  ? mark_held_locks+0x49/0x70
> [   23.943149]  arch_remove_memory+0x63/0x8d
> [   23.943921]  try_remove_memory+0xdb/0x130
> [   23.944766]  ? walk_memory_blocks+0x7f/0x9e
> [   23.945616]  __remove_memory+0xa/0x11
> [   23.946274]  acpi_memory_device_remove+0x70/0x100
> [   23.947308]  acpi_bus_trim+0x55/0x90
> [   23.947914]  acpi_device_hotplug+0x227/0x3a0
> [   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
> [   23.949433]  process_one_work+0x221/0x550
> [   23.950190]  worker_thread+0x50/0x3b0
> [   23.950993]  kthread+0x105/0x140
> [   23.951644]  ? process_one_work+0x550/0x550
> [   23.952508]  ? kthread_park+0x80/0x80
> [   23.953367]  ret_from_fork+0x3a/0x50
> [   23.954025] Modules linked in:
> [   23.954613] CR2: 000000000000353d
> [   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---

Yes, this is indeed nasty. I didin't think of this when separating
memmap initialization from the hotremove. This means that the zone
pointer is a garbage in arch_remove_memory already. The proper fix is to
remove it from that level down. Moreover the zone is only needed for the
shrinking code and zone continuous thingy. The later belongs to offlining
code unless I am missing something. I can see that you are removing zone
parameter in a later patch but wouldn't it be just better to remove the
whole zone thing in a single patch and have this as a bug fix for a rare
bug with a fixes tag?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-29 15:39   ` Michal Hocko
@ 2019-08-29 15:54     ` David Hildenbrand
  2019-08-29 16:27       ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-29 15:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On 29.08.19 17:39, Michal Hocko wrote:
> On Mon 26-08-19 12:10:09, David Hildenbrand wrote:
>> It is easier than I though to trigger a kernel bug by removing memory that
>> was never onlined. With CONFIG_DEBUG_VM the memmap is initialized with
>> garbage, resulting in the detection of a broken zone when removing memory.
>> Without CONFIG_DEBUG_VM it is less likely - but we could still have
>> garbage in the memmap.
>>
>> :/# [   23.912993] BUG: unable to handle page fault for address: 000000000000353d
>> [   23.914219] #PF: supervisor write access in kernel mode
>> [   23.915199] #PF: error_code(0x0002) - not-present page
>> [   23.916160] PGD 0 P4D 0
>> [   23.916627] Oops: 0002 [#1] SMP PTI
>> [   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
>> [   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
>> [   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>> [   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
>> [   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
>> [   23.926876] RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
>> [   23.927928] RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
>> [   23.929458] RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
>> [   23.930899] RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
>> [   23.932362] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
>> [   23.933603] R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
>> [   23.934913] FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
>> [   23.936294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   23.937481] CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
>> [   23.938687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [   23.939889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [   23.941168] Call Trace:
>> [   23.941580]  __remove_pages+0x4b/0x640
>> [   23.942303]  ? mark_held_locks+0x49/0x70
>> [   23.943149]  arch_remove_memory+0x63/0x8d
>> [   23.943921]  try_remove_memory+0xdb/0x130
>> [   23.944766]  ? walk_memory_blocks+0x7f/0x9e
>> [   23.945616]  __remove_memory+0xa/0x11
>> [   23.946274]  acpi_memory_device_remove+0x70/0x100
>> [   23.947308]  acpi_bus_trim+0x55/0x90
>> [   23.947914]  acpi_device_hotplug+0x227/0x3a0
>> [   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
>> [   23.949433]  process_one_work+0x221/0x550
>> [   23.950190]  worker_thread+0x50/0x3b0
>> [   23.950993]  kthread+0x105/0x140
>> [   23.951644]  ? process_one_work+0x550/0x550
>> [   23.952508]  ? kthread_park+0x80/0x80
>> [   23.953367]  ret_from_fork+0x3a/0x50
>> [   23.954025] Modules linked in:
>> [   23.954613] CR2: 000000000000353d
>> [   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---
> 
> Yes, this is indeed nasty. I didin't think of this when separating
> memmap initialization from the hotremove. This means that the zone
> pointer is a garbage in arch_remove_memory already. The proper fix is to
> remove it from that level down. Moreover the zone is only needed for the
> shrinking code and zone continuous thingy. The later belongs to offlining
> code unless I am missing something. I can see that you are removing zone
> parameter in a later patch but wouldn't it be just better to remove the
> whole zone thing in a single patch and have this as a bug fix for a rare
> bug with a fixes tag?
> 

If I remember correctly, this patch already fixed the issue for me,
without the other cleanup (removing the zone parameter). But I might be
wrong.

Anyhow, I'll send a v4 shortly (either this evening or tomorrow), so you
can safe yourself some review time and wait for that one :)

I'll try to see if I can attach fixes tags to selected commits. But if
it makes review harder, I prefer keeping this split (this has been
broken for a long time either way).

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-29 15:54     ` David Hildenbrand
@ 2019-08-29 16:27       ` Michal Hocko
  2019-08-29 16:59         ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-08-29 16:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On Thu 29-08-19 17:54:35, David Hildenbrand wrote:
> On 29.08.19 17:39, Michal Hocko wrote:
> > On Mon 26-08-19 12:10:09, David Hildenbrand wrote:
> >> It is easier than I though to trigger a kernel bug by removing memory that
> >> was never onlined. With CONFIG_DEBUG_VM the memmap is initialized with
> >> garbage, resulting in the detection of a broken zone when removing memory.
> >> Without CONFIG_DEBUG_VM it is less likely - but we could still have
> >> garbage in the memmap.
> >>
> >> :/# [   23.912993] BUG: unable to handle page fault for address: 000000000000353d
> >> [   23.914219] #PF: supervisor write access in kernel mode
> >> [   23.915199] #PF: error_code(0x0002) - not-present page
> >> [   23.916160] PGD 0 P4D 0
> >> [   23.916627] Oops: 0002 [#1] SMP PTI
> >> [   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
> >> [   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> >> [   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> >> [   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
> >> [   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
> >> [   23.926876] RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
> >> [   23.927928] RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
> >> [   23.929458] RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
> >> [   23.930899] RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
> >> [   23.932362] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
> >> [   23.933603] R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
> >> [   23.934913] FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
> >> [   23.936294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [   23.937481] CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
> >> [   23.938687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [   23.939889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> [   23.941168] Call Trace:
> >> [   23.941580]  __remove_pages+0x4b/0x640
> >> [   23.942303]  ? mark_held_locks+0x49/0x70
> >> [   23.943149]  arch_remove_memory+0x63/0x8d
> >> [   23.943921]  try_remove_memory+0xdb/0x130
> >> [   23.944766]  ? walk_memory_blocks+0x7f/0x9e
> >> [   23.945616]  __remove_memory+0xa/0x11
> >> [   23.946274]  acpi_memory_device_remove+0x70/0x100
> >> [   23.947308]  acpi_bus_trim+0x55/0x90
> >> [   23.947914]  acpi_device_hotplug+0x227/0x3a0
> >> [   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
> >> [   23.949433]  process_one_work+0x221/0x550
> >> [   23.950190]  worker_thread+0x50/0x3b0
> >> [   23.950993]  kthread+0x105/0x140
> >> [   23.951644]  ? process_one_work+0x550/0x550
> >> [   23.952508]  ? kthread_park+0x80/0x80
> >> [   23.953367]  ret_from_fork+0x3a/0x50
> >> [   23.954025] Modules linked in:
> >> [   23.954613] CR2: 000000000000353d
> >> [   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---
> > 
> > Yes, this is indeed nasty. I didin't think of this when separating
> > memmap initialization from the hotremove. This means that the zone
> > pointer is a garbage in arch_remove_memory already. The proper fix is to
> > remove it from that level down. Moreover the zone is only needed for the
> > shrinking code and zone continuous thingy. The later belongs to offlining
> > code unless I am missing something. I can see that you are removing zone
> > parameter in a later patch but wouldn't it be just better to remove the
> > whole zone thing in a single patch and have this as a bug fix for a rare
> > bug with a fixes tag?
> > 
> 
> If I remember correctly, this patch already fixed the issue for me,

That might be the case because nothing else does access zone on the way.
But the pointer is simply bogus. Removing it is the proper way to fix
it. And I argue that zone shouldn't even be necessary. Re-evaluating
continuous status of the zone is really something for offlining phase.
Check how we use pfn_to_online_page there.

> without the other cleanup (removing the zone parameter). But I might be
> wrong.
> 
> Anyhow, I'll send a v4 shortly (either this evening or tomorrow), so you
> can safe yourself some review time and wait for that one :)

No rush, really... It seems this is quite unlikely event as most hotplug
usecases simply online memory before removing it later on.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-29 16:27       ` Michal Hocko
@ 2019-08-29 16:59         ` David Hildenbrand
  2019-08-30  6:01           ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-29 16:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On 29.08.19 18:27, Michal Hocko wrote:
> On Thu 29-08-19 17:54:35, David Hildenbrand wrote:
>> On 29.08.19 17:39, Michal Hocko wrote:
>>> On Mon 26-08-19 12:10:09, David Hildenbrand wrote:
>>>> It is easier than I though to trigger a kernel bug by removing memory that
>>>> was never onlined. With CONFIG_DEBUG_VM the memmap is initialized with
>>>> garbage, resulting in the detection of a broken zone when removing memory.
>>>> Without CONFIG_DEBUG_VM it is less likely - but we could still have
>>>> garbage in the memmap.
>>>>
>>>> :/# [   23.912993] BUG: unable to handle page fault for address: 000000000000353d
>>>> [   23.914219] #PF: supervisor write access in kernel mode
>>>> [   23.915199] #PF: error_code(0x0002) - not-present page
>>>> [   23.916160] PGD 0 P4D 0
>>>> [   23.916627] Oops: 0002 [#1] SMP PTI
>>>> [   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
>>>> [   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
>>>> [   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
>>>> [   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
>>>> [   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
>>>> [   23.926876] RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
>>>> [   23.927928] RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
>>>> [   23.929458] RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
>>>> [   23.930899] RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
>>>> [   23.932362] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
>>>> [   23.933603] R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
>>>> [   23.934913] FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
>>>> [   23.936294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   23.937481] CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
>>>> [   23.938687] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [   23.939889] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [   23.941168] Call Trace:
>>>> [   23.941580]  __remove_pages+0x4b/0x640
>>>> [   23.942303]  ? mark_held_locks+0x49/0x70
>>>> [   23.943149]  arch_remove_memory+0x63/0x8d
>>>> [   23.943921]  try_remove_memory+0xdb/0x130
>>>> [   23.944766]  ? walk_memory_blocks+0x7f/0x9e
>>>> [   23.945616]  __remove_memory+0xa/0x11
>>>> [   23.946274]  acpi_memory_device_remove+0x70/0x100
>>>> [   23.947308]  acpi_bus_trim+0x55/0x90
>>>> [   23.947914]  acpi_device_hotplug+0x227/0x3a0
>>>> [   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
>>>> [   23.949433]  process_one_work+0x221/0x550
>>>> [   23.950190]  worker_thread+0x50/0x3b0
>>>> [   23.950993]  kthread+0x105/0x140
>>>> [   23.951644]  ? process_one_work+0x550/0x550
>>>> [   23.952508]  ? kthread_park+0x80/0x80
>>>> [   23.953367]  ret_from_fork+0x3a/0x50
>>>> [   23.954025] Modules linked in:
>>>> [   23.954613] CR2: 000000000000353d
>>>> [   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---
>>>
>>> Yes, this is indeed nasty. I didin't think of this when separating
>>> memmap initialization from the hotremove. This means that the zone
>>> pointer is a garbage in arch_remove_memory already. The proper fix is to
>>> remove it from that level down. Moreover the zone is only needed for the
>>> shrinking code and zone continuous thingy. The later belongs to offlining
>>> code unless I am missing something. I can see that you are removing zone
>>> parameter in a later patch but wouldn't it be just better to remove the
>>> whole zone thing in a single patch and have this as a bug fix for a rare
>>> bug with a fixes tag?
>>>
>>
>> If I remember correctly, this patch already fixed the issue for me,
> 
> That might be the case because nothing else does access zone on the way.
> But the pointer is simply bogus. Removing it is the proper way to fix
> it. And I argue that zone shouldn't even be necessary. Re-evaluating
> continuous status of the zone is really something for offlining phase.
> Check how we use pfn_to_online_page there.

Yeah I'm with you, I think you spotted patch 6/6 of this series and v3
already that does exactly that. It's just a matter of rearranging things.

> 
>> without the other cleanup (removing the zone parameter). But I might be
>> wrong.
>>
>> Anyhow, I'll send a v4 shortly (either this evening or tomorrow), so you
>> can safe yourself some review time and wait for that one :)
> 
> No rush, really... It seems this is quite unlikely event as most hotplug
> usecases simply online memory before removing it later on.
> 

I can trigger it reliably right now while working/testing virtio-mem, so
I finally want to clean up this mess :) (has been on my list for a long
time). I'll try to hunt for the right commit id's that broke it.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-29 16:59         ` David Hildenbrand
@ 2019-08-30  6:01           ` Michal Hocko
  2019-08-30  6:20             ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-08-30  6:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On Thu 29-08-19 18:59:31, David Hildenbrand wrote:
> On 29.08.19 18:27, Michal Hocko wrote:
[...]
> > No rush, really... It seems this is quite unlikely event as most hotplug
> > usecases simply online memory before removing it later on.
> > 
> 
> I can trigger it reliably right now while working/testing virtio-mem, so
> I finally want to clean up this mess :) (has been on my list for a long
> time). I'll try to hunt for the right commit id's that broke it.

f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to
zones until online")

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-30  6:01           ` Michal Hocko
@ 2019-08-30  6:20             ` David Hildenbrand
  2019-08-30  6:47               ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-30  6:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On 30.08.19 08:01, Michal Hocko wrote:
> On Thu 29-08-19 18:59:31, David Hildenbrand wrote:
>> On 29.08.19 18:27, Michal Hocko wrote:
> [...]
>>> No rush, really... It seems this is quite unlikely event as most hotplug
>>> usecases simply online memory before removing it later on.
>>>
>>
>> I can trigger it reliably right now while working/testing virtio-mem, so
>> I finally want to clean up this mess :) (has been on my list for a long
>> time). I'll try to hunt for the right commit id's that broke it.
> 
> f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to
> zones until online")
> 

Right, I can see that that commit still does a

set_page_node(page, nid);
SetPageReserved(page);

for every added page. But of course the zone_idx might contain garbage.
So for this garbage pointer, f1dd2cd13c4b seems to be the right one.

Regarding shrink_zone_span(), I suspect it was introduced by

d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-30  6:20             ` David Hildenbrand
@ 2019-08-30  6:47               ` Michal Hocko
  2019-08-30  7:07                 ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-08-30  6:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On Fri 30-08-19 08:20:32, David Hildenbrand wrote:
[...]
> Regarding shrink_zone_span(), I suspect it was introduced by
>
> d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")

zone shrinking code is much older - 815121d2b5cd5. But I do not think
this is really needed for Fixes tag.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-30  6:47               ` Michal Hocko
@ 2019-08-30  7:07                 ` David Hildenbrand
  2019-08-30  8:31                   ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2019-08-30  7:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On 30.08.19 08:47, Michal Hocko wrote:
> On Fri 30-08-19 08:20:32, David Hildenbrand wrote:
> [...]
>> Regarding shrink_zone_span(), I suspect it was introduced by
>>
>> d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
> 
> zone shrinking code is much older - 815121d2b5cd5. But I do not think
> this is really needed for Fixes tag.
> 

Yes it's older, but since d0dc12e86b31 we could run into uninitialized
nids for added memory when trying to shrink the zone.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory
  2019-08-30  7:07                 ` David Hildenbrand
@ 2019-08-30  8:31                   ` Michal Hocko
  0 siblings, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2019-08-30  8:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Wei Yang

On Fri 30-08-19 09:07:59, David Hildenbrand wrote:
> On 30.08.19 08:47, Michal Hocko wrote:
> > On Fri 30-08-19 08:20:32, David Hildenbrand wrote:
> > [...]
> >> Regarding shrink_zone_span(), I suspect it was introduced by
> >>
> >> d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
> > 
> > zone shrinking code is much older - 815121d2b5cd5. But I do not think
> > this is really needed for Fixes tag.
> > 
> 
> Yes it's older, but since d0dc12e86b31 we could run into uninitialized
> nids for added memory when trying to shrink the zone.

Well, not really. Strictly speaking pre d0dc12e86b31 would initialize
struct page to zero so it is initialized but that doesn't mean that the
struct page has a meaningfull and usable content. Zone index would be 0
so the lowest zone which is not something you want to use on the
hotremove path.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-08-30  8:31 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26 10:10 [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 1/6] mm/memory_hotplug: Exit early in __remove_pages() on BUGs David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 2/6] mm: Exit early in set_zone_contiguous() if already contiguous David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 3/6] mm/memory_hotplug: Process all zones when removing memory David Hildenbrand
2019-08-29 15:39   ` Michal Hocko
2019-08-29 15:54     ` David Hildenbrand
2019-08-29 16:27       ` Michal Hocko
2019-08-29 16:59         ` David Hildenbrand
2019-08-30  6:01           ` Michal Hocko
2019-08-30  6:20             ` David Hildenbrand
2019-08-30  6:47               ` Michal Hocko
2019-08-30  7:07                 ` David Hildenbrand
2019-08-30  8:31                   ` Michal Hocko
2019-08-26 10:10 ` [PATCH v2 4/6] mm/memory_hotplug: Cleanup __remove_pages() David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 5/6] mm: Introduce for_each_zone_nid() David Hildenbrand
2019-08-26 10:10 ` [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages() David Hildenbrand
2019-08-27 10:49   ` Robin Murphy
2019-08-26 14:53 ` [PATCH v2 0/6] mm/memory_hotplug: Consider all zones when removing memory Aneesh Kumar K.V
2019-08-26 15:43   ` David Hildenbrand
2019-08-26 16:01     ` Aneesh Kumar K.V
2019-08-26 16:20       ` David Hildenbrand
2019-08-26 16:44         ` David Hildenbrand
2019-08-27  5:46           ` Aneesh Kumar K.V
2019-08-27  7:06             ` David Hildenbrand
2019-08-28  9:33             ` David Hildenbrand
2019-08-29  8:38   ` Michal Hocko
2019-08-29 11:55     ` David Hildenbrand
2019-08-29 12:20       ` Michal Hocko
2019-08-29  8:36 ` Michal Hocko
2019-08-29 11:39   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).