linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Laurent Dufour <ldufour@linux.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Oscar Salvador <osalvador@suse.de>,
	Michal Hocko <mhocko@suse.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Fenghua Yu <fenghua.yu@intel.com>,
	Nathan Lynch <nathanl@linux.ibm.com>,
	Scott Cheloha <cheloha@linux.ibm.com>,
	Tony Luck <tony.luck@intel.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 5.4 49/57] mm: dont rely on system state to detect hot-plug operations
Date: Mon,  5 Oct 2020 17:27:01 +0200	[thread overview]
Message-ID: <20201005142112.155378615@linuxfoundation.org> (raw)
In-Reply-To: <20201005142109.796046410@linuxfoundation.org>

From: Laurent Dufour <ldufour@linux.ibm.com>

commit f85086f95fa36194eb0db5cd5c12e56801b98523 upstream.

In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation.  Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state.  In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1].  So checking against the system state is not
enough.

The consequence is that on system with interleaved node's ranges like this:

 Early memory node ranges
   node   1: [mem 0x0000000000000000-0x000000011fffffff]
   node   2: [mem 0x0000000120000000-0x000000014fffffff]
   node   1: [mem 0x0000000150000000-0x00000001ffffffff]
   node   0: [mem 0x0000000200000000-0x000000048fffffff]
   node   2: [mem 0x0000000490000000-0x00000007ffffffff]

This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done.  At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:

  $ ls -l /sys/devices/system/memory/memory21/node*
  total 0
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
  lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2

In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():

  kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
  CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
  Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation.  An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.

[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:

  $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
        -m size=$MEM,slots=255,maxmem=4294967296k  \
        -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
        -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
        -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
        -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
        -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
        -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
        -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
        -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \

Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/base/node.c  |   85 +++++++++++++++++++++++++++++++++------------------
 include/linux/node.h |   11 ++++--
 mm/memory_hotplug.c  |    3 +
 3 files changed, 64 insertions(+), 35 deletions(-)

--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -758,14 +758,36 @@ static int __ref get_nid_for_pfn(unsigne
 	return pfn_to_nid(pfn);
 }
 
+static int do_register_memory_block_under_node(int nid,
+					       struct memory_block *mem_blk)
+{
+	int ret;
+
+	/*
+	 * If this memory block spans multiple nodes, we only indicate
+	 * the last processed node.
+	 */
+	mem_blk->nid = nid;
+
+	ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
+				       &mem_blk->dev.kobj,
+				       kobject_name(&mem_blk->dev.kobj));
+	if (ret)
+		return ret;
+
+	return sysfs_create_link_nowarn(&mem_blk->dev.kobj,
+				&node_devices[nid]->dev.kobj,
+				kobject_name(&node_devices[nid]->dev.kobj));
+}
+
 /* register memory section under specified node if it spans that node */
-static int register_mem_sect_under_node(struct memory_block *mem_blk,
-					 void *arg)
+static int register_mem_block_under_node_early(struct memory_block *mem_blk,
+					       void *arg)
 {
 	unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE;
 	unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
 	unsigned long end_pfn = start_pfn + memory_block_pfns - 1;
-	int ret, nid = *(int *)arg;
+	int nid = *(int *)arg;
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn <= end_pfn; pfn++) {
@@ -782,39 +804,34 @@ static int register_mem_sect_under_node(
 		}
 
 		/*
-		 * We need to check if page belongs to nid only for the boot
-		 * case, during hotplug we know that all pages in the memory
-		 * block belong to the same node.
-		 */
-		if (system_state == SYSTEM_BOOTING) {
-			page_nid = get_nid_for_pfn(pfn);
-			if (page_nid < 0)
-				continue;
-			if (page_nid != nid)
-				continue;
-		}
-
-		/*
-		 * If this memory block spans multiple nodes, we only indicate
-		 * the last processed node.
+		 * We need to check if page belongs to nid only at the boot
+		 * case because node's ranges can be interleaved.
 		 */
-		mem_blk->nid = nid;
-
-		ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj,
-					&mem_blk->dev.kobj,
-					kobject_name(&mem_blk->dev.kobj));
-		if (ret)
-			return ret;
+		page_nid = get_nid_for_pfn(pfn);
+		if (page_nid < 0)
+			continue;
+		if (page_nid != nid)
+			continue;
 
-		return sysfs_create_link_nowarn(&mem_blk->dev.kobj,
-				&node_devices[nid]->dev.kobj,
-				kobject_name(&node_devices[nid]->dev.kobj));
+		return do_register_memory_block_under_node(nid, mem_blk);
 	}
 	/* mem section does not span the specified node */
 	return 0;
 }
 
 /*
+ * During hotplug we know that all pages in the memory block belong to the same
+ * node.
+ */
+static int register_mem_block_under_node_hotplug(struct memory_block *mem_blk,
+						 void *arg)
+{
+	int nid = *(int *)arg;
+
+	return do_register_memory_block_under_node(nid, mem_blk);
+}
+
+/*
  * Unregister a memory block device under the node it spans. Memory blocks
  * with multiple nodes cannot be offlined and therefore also never be removed.
  */
@@ -829,11 +846,19 @@ void unregister_memory_block_under_nodes
 			  kobject_name(&node_devices[mem_blk->nid]->dev.kobj));
 }
 
-int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn)
+int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn,
+		      enum meminit_context context)
 {
+	walk_memory_blocks_func_t func;
+
+	if (context == MEMINIT_HOTPLUG)
+		func = register_mem_block_under_node_hotplug;
+	else
+		func = register_mem_block_under_node_early;
+
 	return walk_memory_blocks(PFN_PHYS(start_pfn),
 				  PFN_PHYS(end_pfn - start_pfn), (void *)&nid,
-				  register_mem_sect_under_node);
+				  func);
 }
 
 #ifdef CONFIG_HUGETLBFS
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -99,11 +99,13 @@ extern struct node *node_devices[];
 typedef  void (*node_registration_func_t)(struct node *);
 
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA)
-extern int link_mem_sections(int nid, unsigned long start_pfn,
-			     unsigned long end_pfn);
+int link_mem_sections(int nid, unsigned long start_pfn,
+		      unsigned long end_pfn,
+		      enum meminit_context context);
 #else
 static inline int link_mem_sections(int nid, unsigned long start_pfn,
-				    unsigned long end_pfn)
+				    unsigned long end_pfn,
+				    enum meminit_context context)
 {
 	return 0;
 }
@@ -128,7 +130,8 @@ static inline int register_one_node(int
 		if (error)
 			return error;
 		/* link memory sections under this node */
-		error = link_mem_sections(nid, start_pfn, end_pfn);
+		error = link_mem_sections(nid, start_pfn, end_pfn,
+					  MEMINIT_EARLY);
 	}
 
 	return error;
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1082,7 +1082,8 @@ int __ref add_memory_resource(int nid, s
 	}
 
 	/* link memory sections under this node.*/
-	ret = link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size - 1));
+	ret = link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size - 1),
+				MEMINIT_HOTPLUG);
 	BUG_ON(ret);
 
 	/* create new memmap entry */



  parent reply	other threads:[~2020-10-05 15:38 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-05 15:26 [PATCH 5.4 00/57] 5.4.70-rc1 review Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 01/57] btrfs: fix filesystem corruption after a device replace Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 02/57] mmc: sdhci: Workaround broken command queuing on Intel GLK based IRBIS models Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 03/57] USB: gadget: f_ncm: Fix NDP16 datagram validation Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 04/57] gpio: siox: explicitly support only threaded irqs Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 05/57] gpio: mockup: fix resource leak in error path Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 06/57] gpio: tc35894: fix up tc35894 interrupt configuration Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 07/57] clk: socfpga: stratix10: fix the divider for the emac_ptp_free_clk Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 08/57] vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock() Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 09/57] net: virtio_vsock: Enhance connection semantics Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 10/57] xfs: trim IO to found COW extent limit Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 11/57] Input: i8042 - add nopnp quirk for Acer Aspire 5 A515 Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 12/57] iio: adc: qcom-spmi-adc5: fix driver name Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 13/57] ftrace: Move RCU is watching check after recursion check Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 14/57] memstick: Skip allocating card when removing host Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 15/57] drm/amdgpu: restore proper ref count in amdgpu_display_crtc_set_config Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 16/57] clocksource/drivers/timer-gx6605s: Fixup counter reload Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 17/57] libbpf: Remove arch-specific include path in Makefile Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 18/57] drivers/net/wan/hdlc_fr: Add needed_headroom for PVC devices Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 19/57] drm/sun4i: mixer: Extend regmap max_register Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 20/57] net: dec: de2104x: Increase receive ring size for Tulip Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 21/57] rndis_host: increase sleep time in the query-response loop Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 22/57] nvme-core: get/put ctrl and transport module in nvme_dev_open/release() Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 23/57] fuse: fix the ->direct_IO() treatment of iov_iter Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 24/57] drivers/net/wan/lapbether: Make skb->protocol consistent with the header Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 25/57] drivers/net/wan/hdlc: Set skb->protocol before transmitting Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 26/57] mac80211: Fix radiotap header channel flag for 6GHz band Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 27/57] mac80211: do not allow bigger VHT MPDUs than the hardware supports Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 28/57] tracing: Make the space reserved for the pid wider Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 29/57] tools/io_uring: fix compile breakage Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 30/57] spi: fsl-espi: Only process interrupts for expected events Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 31/57] nvme-pci: fix NULL req in completion handler Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 32/57] nvme-fc: fail new connections to a deleted host or remote port Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 33/57] gpio: sprd: Clear interrupt when setting the type as edge Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 34/57] phy: ti: am654: Fix a leak in serdes_am654_probe() Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 35/57] pinctrl: mvebu: Fix i2c sda definition for 98DX3236 Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 36/57] nfs: Fix security label length not being reset Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 37/57] clk: tegra: Always program PLL_E when enabled Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 38/57] clk: samsung: exynos4: mark chipid clock as CLK_IGNORE_UNUSED Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 39/57] iommu/exynos: add missing put_device() call in exynos_iommu_of_xlate() Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 40/57] gpio/aspeed-sgpio: enable access to all 80 input & output sgpios Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 41/57] gpio/aspeed-sgpio: dont enable all interrupts by default Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 42/57] gpio: aspeed: fix ast2600 bank properties Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 43/57] i2c: cpm: Fix i2c_ram structure Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 44/57] Input: trackpoint - enable Synaptics trackpoints Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 45/57] scripts/dtc: only append to HOST_EXTRACFLAGS instead of overwriting Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 46/57] random32: Restore __latent_entropy attribute on net_rand_state Greg Kroah-Hartman
2020-10-05 15:26 ` [PATCH 5.4 47/57] block/diskstats: more accurate approximation of io_ticks for slow disks Greg Kroah-Hartman
2020-10-05 15:27 ` [PATCH 5.4 48/57] mm: replace memmap_context by meminit_context Greg Kroah-Hartman
2020-10-05 15:27 ` Greg Kroah-Hartman [this message]
2020-10-05 15:27 ` [PATCH 5.4 50/57] nvme: Cleanup and rename nvme_block_nr() Greg Kroah-Hartman
2020-10-05 15:27 ` [PATCH 5.4 51/57] nvme: Introduce nvme_lba_to_sect() Greg Kroah-Hartman
2020-10-05 15:27 ` [PATCH 5.4 52/57] nvme: consolidate chunk_sectors settings Greg Kroah-Hartman
2020-10-05 15:27 ` [PATCH 5.4 53/57] epoll: do not insert into poll queues until all sanity checks are done Greg Kroah-Hartman
2020-10-05 15:27 ` [PATCH 5.4 54/57] epoll: replace ->visited/visited_list with generation count Greg Kroah-Hartman
2020-10-05 15:27 ` [PATCH 5.4 55/57] epoll: EPOLL_CTL_ADD: close the race in decision to take fast path Greg Kroah-Hartman
2020-10-05 15:27 ` [PATCH 5.4 56/57] ep_create_wakeup_source(): dentry name can change under you Greg Kroah-Hartman
2020-10-05 15:27 ` [PATCH 5.4 57/57] netfilter: ctnetlink: add a range check for l3/l4 protonum Greg Kroah-Hartman
2020-10-06  0:22 ` [PATCH 5.4 00/57] 5.4.70-rc1 review Shuah Khan
2020-10-06  5:54 ` Naresh Kamboju
2020-10-06  8:25   ` Naresh Kamboju
2020-10-06 14:49     ` Ben Hutchings
2020-10-06 18:17 ` Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201005142112.155378615@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=cheloha@linux.ibm.com \
    --cc=david@redhat.com \
    --cc=fenghua.yu@intel.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=nathanl@linux.ibm.com \
    --cc=osalvador@suse.de \
    --cc=rafael@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tony.luck@intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).