Linux-NVDIMM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory
@ 2020-01-22  3:04 Dan Williams
  2020-01-22  3:04 ` [PATCH v4 1/6] ACPI: NUMA: Up-level "map to online node" functionality Dan Williams
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Dan Williams @ 2020-01-22  3:04 UTC (permalink / raw)
  To: tglx, mingo
  Cc: David Hildenbrand, Borislav Petkov, Rafael J. Wysocki,
	H. Peter Anvin, Aneesh Kumar K.V, kbuild test robot,
	Andrew Morton, Peter Zijlstra, Benjamin Herrenschmidt,
	Michal Hocko, Paul Mackerras, Christoph Hellwig, Dave Hansen,
	Michael Ellerman, x86, Andy Lutomirski, linux-kernel,
	linux-nvdimm

Changes since v3 [1]:
- Cleanup numa_map_to_online_node() to remove redundant "if
  (!node_online(node))" (Aneesh)

[1]: http://lore.kernel.org/r/157954696789.2239526.17707265517154476652.stgit@dwillia2-desk3.amr.corp.intel.com

---

Merge notes:

x86 folks: This has an ack from Rafael for ACPI, and Michael for Power.
With an x86 ack I plan to take this through the libnvdimm tree provided
the x86 touches look ok to you.

---

Cover:

Arrange for platform numa info to be preserved for determining
'target_node' data. Where a 'target_node' is the node a reserved memory
range will become when it is onlined.

This new infrastructure is expected to be more valuable over time for
Memory Tiers / Hierarchy management as more platforms (via the ACPI HMAT
and EFI Specific Purpose Memory) publish reserved or "soft-reserved"
ranges to Linux. Linux system administrators will expect to be able to
interact with those ranges with a unique numa node number when/if that
memory is onlined via the dax_kmem driver [2].

One configuration that currently fails to properly convey the target
node for the resulting memory hotplug operation is persistent memory
defined by the memmap=nn!ss parameter. For example, today if node1 is a
memory only node, and all the memory from node1 is specified to
memmap=nn!ss and subsequently onlined, it will end up being onlined as
node0 memory. As it stands, memory_add_physaddr_to_nid() can only
identify online nodes and since node1 in this example has no online cpus
/ memory the target node is initialized node0.

The fix is to preserve rather than discard the numa_meminfo entries that
are relevant for reserved memory ranges, and to uplevel the node
distance helper for determining the "local" (closest) node relative to
an initiator node.

[2]: https://pmem.io/ndctl/daxctl-reconfigure-device.html

---

Dan Williams (6):
      ACPI: NUMA: Up-level "map to online node" functionality
      mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
      powerpc/papr_scm: Switch to numa_map_to_online_node()
      x86/mm: Introduce CONFIG_KEEP_NUMA
      x86/numa: Provide a range-to-target_node lookup facility
      libnvdimm/e820: Retrieve and populate correct 'target_node' info


 arch/powerpc/platforms/pseries/papr_scm.c |   21 --------
 arch/x86/Kconfig                          |    1 
 arch/x86/mm/numa.c                        |   74 +++++++++++++++++++++++------
 drivers/acpi/numa/srat.c                  |   41 ----------------
 drivers/nvdimm/e820.c                     |   18 ++-----
 include/linux/acpi.h                      |   23 +++++++++
 include/linux/numa.h                      |   23 +++++++++
 mm/Kconfig                                |    5 ++
 mm/mempolicy.c                            |   31 ++++++++++++
 9 files changed, 145 insertions(+), 92 deletions(-)
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/6] ACPI: NUMA: Up-level "map to online node" functionality
  2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
@ 2020-01-22  3:04 ` Dan Williams
  2020-01-22  3:04 ` [PATCH v4 2/6] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node() Dan Williams
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-01-22  3:04 UTC (permalink / raw)
  To: tglx, mingo
  Cc: Michal Hocko, Rafael J. Wysocki, peterz, dave.hansen, hch,
	linux-kernel, linux-nvdimm, x86

The acpi_map_pxm_to_online_node() helper is used to find the closest
online node to a given proximity domain. This is used to map devices in
a proximity domain with no online memory or cpus to the closest online
node and populate a device's 'numa_node' property. The numa_node
property allows applications to be migrated "close" to a resource.

In preparation for providing a generic facility to optionally map an
address range to its closest online node, or the node the range would
represent were it to be onlined (target_node), up-level the core of
acpi_map_pxm_to_online_node() to a generic mm/numa helper.

Cc: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/numa/srat.c |   41 -----------------------------------------
 include/linux/acpi.h     |   23 ++++++++++++++++++++++-
 include/linux/numa.h     |    9 +++++++++
 mm/mempolicy.c           |   30 ++++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+), 42 deletions(-)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index eadbf90e65d1..47b4969d9b93 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -72,47 +72,6 @@ int acpi_map_pxm_to_node(int pxm)
 }
 EXPORT_SYMBOL(acpi_map_pxm_to_node);
 
-/**
- * acpi_map_pxm_to_online_node - Map proximity ID to online node
- * @pxm: ACPI proximity ID
- *
- * This is similar to acpi_map_pxm_to_node(), but always returns an online
- * node.  When the mapped node from a given proximity ID is offline, it
- * looks up the node distance table and returns the nearest online node.
- *
- * ACPI device drivers, which are called after the NUMA initialization has
- * completed in the kernel, can call this interface to obtain their device
- * NUMA topology from ACPI tables.  Such drivers do not have to deal with
- * offline nodes.  A node may be offline when a device proximity ID is
- * unique, SRAT memory entry does not exist, or NUMA is disabled, ex.
- * "numa=off" on x86.
- */
-int acpi_map_pxm_to_online_node(int pxm)
-{
-	int node, min_node;
-
-	node = acpi_map_pxm_to_node(pxm);
-
-	if (node == NUMA_NO_NODE)
-		node = 0;
-
-	min_node = node;
-	if (!node_online(node)) {
-		int min_dist = INT_MAX, dist, n;
-
-		for_each_online_node(n) {
-			dist = node_distance(node, n);
-			if (dist < min_dist) {
-				min_dist = dist;
-				min_node = n;
-			}
-		}
-	}
-
-	return min_node;
-}
-EXPORT_SYMBOL(acpi_map_pxm_to_online_node);
-
 static void __init
 acpi_table_print_srat_entry(struct acpi_subtable_header *header)
 {
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 0f37a7d5fa77..69b73ecfbee4 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -401,9 +401,30 @@ extern void acpi_osi_setup(char *str);
 extern bool acpi_osi_is_win8(void);
 
 #ifdef CONFIG_ACPI_NUMA
-int acpi_map_pxm_to_online_node(int pxm);
 int acpi_map_pxm_to_node(int pxm);
 int acpi_get_node(acpi_handle handle);
+
+/**
+ * acpi_map_pxm_to_online_node - Map proximity ID to online node
+ * @pxm: ACPI proximity ID
+ *
+ * This is similar to acpi_map_pxm_to_node(), but always returns an online
+ * node.  When the mapped node from a given proximity ID is offline, it
+ * looks up the node distance table and returns the nearest online node.
+ *
+ * ACPI device drivers, which are called after the NUMA initialization has
+ * completed in the kernel, can call this interface to obtain their device
+ * NUMA topology from ACPI tables.  Such drivers do not have to deal with
+ * offline nodes.  A node may be offline when a device proximity ID is
+ * unique, SRAT memory entry does not exist, or NUMA is disabled, ex.
+ * "numa=off" on x86.
+ */
+static inline int acpi_map_pxm_to_online_node(int pxm)
+{
+	int node = acpi_map_pxm_to_node(pxm);
+
+	return numa_map_to_online_node(node);
+}
 #else
 static inline int acpi_map_pxm_to_online_node(int pxm)
 {
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 110b0e5d0fb0..20f4e44b186c 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -13,4 +13,13 @@
 
 #define	NUMA_NO_NODE	(-1)
 
+#ifdef CONFIG_NUMA
+int numa_map_to_online_node(int node);
+#else
+static inline int numa_map_to_online_node(int node)
+{
+	return NUMA_NO_NODE;
+}
+#endif
+
 #endif /* _LINUX_NUMA_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 067cf7d3daf5..4cff069279f6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -127,6 +127,36 @@ static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+/**
+ * numa_map_to_online_node - Find closest online node
+ * @nid: Node id to start the search
+ *
+ * Lookup the next closest node by distance if @nid is not online.
+ */
+int numa_map_to_online_node(int node)
+{
+	int min_node;
+
+	if (node == NUMA_NO_NODE)
+		node = 0;
+
+	min_node = node;
+	if (!node_online(node)) {
+		int min_dist = INT_MAX, dist, n;
+
+		for_each_online_node(n) {
+			dist = node_distance(node, n);
+			if (dist < min_dist) {
+				min_dist = dist;
+				min_node = n;
+			}
+		}
+	}
+
+	return min_node;
+}
+EXPORT_SYMBOL_GPL(numa_map_to_online_node);
+
 struct mempolicy *get_task_policy(struct task_struct *p)
 {
 	struct mempolicy *pol = p->mempolicy;
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 2/6] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
  2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
  2020-01-22  3:04 ` [PATCH v4 1/6] ACPI: NUMA: Up-level "map to online node" functionality Dan Williams
@ 2020-01-22  3:04 ` Dan Williams
  2020-01-22  3:04 ` [PATCH v4 3/6] powerpc/papr_scm: Switch to numa_map_to_online_node() Dan Williams
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-01-22  3:04 UTC (permalink / raw)
  To: tglx, mingo
  Cc: Aneesh Kumar K.V, peterz, dave.hansen, hch, linux-kernel,
	linux-nvdimm, x86

Update numa_map_to_online_node() to stop falling back to numa node 0
when the input is NUMA_NO_NODE. Also, skip the lookup if @node is
online. This makes the routine compatible with other arch node mapping
routines.

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157401275716.43284.13185549705765009174.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/mempolicy.c |   20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4cff069279f6..bcb012645809 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -135,21 +135,17 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES];
  */
 int numa_map_to_online_node(int node)
 {
-	int min_node;
+	int min_dist = INT_MAX, dist, n, min_node;
 
-	if (node == NUMA_NO_NODE)
-		node = 0;
+	if (node == NUMA_NO_NODE || node_online(node))
+		return node;
 
 	min_node = node;
-	if (!node_online(node)) {
-		int min_dist = INT_MAX, dist, n;
-
-		for_each_online_node(n) {
-			dist = node_distance(node, n);
-			if (dist < min_dist) {
-				min_dist = dist;
-				min_node = n;
-			}
+	for_each_online_node(n) {
+		dist = node_distance(node, n);
+		if (dist < min_dist) {
+			min_dist = dist;
+			min_node = n;
 		}
 	}
 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 3/6] powerpc/papr_scm: Switch to numa_map_to_online_node()
  2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
  2020-01-22  3:04 ` [PATCH v4 1/6] ACPI: NUMA: Up-level "map to online node" functionality Dan Williams
  2020-01-22  3:04 ` [PATCH v4 2/6] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node() Dan Williams
@ 2020-01-22  3:04 ` Dan Williams
  2020-01-22  3:04 ` [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA Dan Williams
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-01-22  3:04 UTC (permalink / raw)
  To: tglx, mingo
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Aneesh Kumar K.V, peterz, dave.hansen, hch, linux-kernel,
	linux-nvdimm, x86

Now that the core exports numa_map_to_online_node() switch to that
instead of the locally coded duplicate.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157401276263.43284.12616818803654229788.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/platforms/pseries/papr_scm.c |   21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index c2ef320ba1bf..057ed703e882 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -284,25 +284,6 @@ int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
 	return 0;
 }
 
-static inline int papr_scm_node(int node)
-{
-	int min_dist = INT_MAX, dist;
-	int nid, min_node;
-
-	if ((node == NUMA_NO_NODE) || node_online(node))
-		return node;
-
-	min_node = first_online_node;
-	for_each_online_node(nid) {
-		dist = node_distance(node, nid);
-		if (dist < min_dist) {
-			min_dist = dist;
-			min_node = nid;
-		}
-	}
-	return min_node;
-}
-
 static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 {
 	struct device *dev = &p->pdev->dev;
@@ -347,7 +328,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 
 	memset(&ndr_desc, 0, sizeof(ndr_desc));
 	target_nid = dev_to_node(&p->pdev->dev);
-	online_nid = papr_scm_node(target_nid);
+	online_nid = numa_map_to_online_node(target_nid);
 	ndr_desc.numa_node = online_nid;
 	ndr_desc.target_node = target_nid;
 	ndr_desc.res = &p->res;
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA
  2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
                   ` (2 preceding siblings ...)
  2020-01-22  3:04 ` [PATCH v4 3/6] powerpc/papr_scm: Switch to numa_map_to_online_node() Dan Williams
@ 2020-01-22  3:04 ` Dan Williams
  2020-02-13  9:32   ` Ingo Molnar
  2020-02-13 11:22   ` Thomas Gleixner
  2020-01-22  3:05 ` [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility Dan Williams
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Dan Williams @ 2020-01-22  3:04 UTC (permalink / raw)
  To: tglx, mingo
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, x86, Andrew Morton, David Hildenbrand,
	Michal Hocko, hch, linux-kernel, linux-nvdimm

Currently x86 numa_meminfo is marked __initdata in the
CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow
drivers to map reserved memory to a 'target_node'
(phys_to_target_node()), add support for removing the __initdata
designation for those users. Both memory hotplug and
phys_to_target_node() users select CONFIG_KEEP_NUMA to tell the arch to
maintain its physical address to numa mapping infrastructure post init.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/numa.c   |    6 +-----
 include/linux/numa.h |    6 ++++++
 mm/Kconfig           |    5 +++++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 99f7a68738f0..5289d9d6799a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -25,11 +25,7 @@ nodemask_t numa_nodes_parsed __initdata;
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
-static struct numa_meminfo numa_meminfo
-#ifndef CONFIG_MEMORY_HOTPLUG
-__initdata
-#endif
-;
+static struct numa_meminfo numa_meminfo __initdata_numa;
 
 static int numa_distance_cnt;
 static u8 *numa_distance;
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 20f4e44b186c..c005ed6b807b 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -13,6 +13,12 @@
 
 #define	NUMA_NO_NODE	(-1)
 
+#ifdef CONFIG_KEEP_NUMA
+#define __initdata_numa
+#else
+#define __initdata_numa __initdata
+#endif
+
 #ifdef CONFIG_NUMA
 int numa_map_to_online_node(int node);
 #else
diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65f..001f1185eadf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -139,6 +139,10 @@ config HAVE_FAST_GUP
 config ARCH_KEEP_MEMBLOCK
 	bool
 
+# Keep arch numa mapping infrastructure post-init.
+config KEEP_NUMA
+	bool
+
 config MEMORY_ISOLATION
 	bool
 
@@ -154,6 +158,7 @@ config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
 	depends on SPARSEMEM || X86_64_ACPI_NUMA
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+	select KEEP_NUMA if NUMA
 
 config MEMORY_HOTPLUG_SPARSE
 	def_bool y
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility
  2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
                   ` (3 preceding siblings ...)
  2020-01-22  3:04 ` [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA Dan Williams
@ 2020-01-22  3:05 ` Dan Williams
  2020-02-13  9:36   ` Ingo Molnar
  2020-02-13 11:37   ` Thomas Gleixner
  2020-01-22  3:05 ` [PATCH v4 6/6] libnvdimm/e820: Retrieve and populate correct 'target_node' info Dan Williams
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Dan Williams @ 2020-01-22  3:05 UTC (permalink / raw)
  To: tglx, mingo
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, x86, Andrew Morton, David Hildenbrand,
	Michal Hocko, kbuild test robot, hch, linux-kernel, linux-nvdimm

The DEV_DAX_KMEM facility is a generic mechanism to allow device-dax
instances, fronting performance-differentiated-memory like pmem, to be
added to the System RAM pool. The numa node for that hot-added memory is
derived from the device-dax instance's 'target_node' attribute.

Recall that the 'target_node' is the ACPI-PXM-to-node translation for
memory when it comes online whereas the 'numa_node' attribute of the
device represents the closest online cpu node.

Presently useful target_node information from the ACPI SRAT is discarded
with the expectation that "Reserved" memory will never be onlined. Now,
DEV_DAX_KMEM violates that assumption, there is a need to retain the
translation. Move, rather than discard, numa_memblk data to a secondary
array that memory_add_physaddr_to_target_node() may consider at a later
point in time.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/numa.c   |   68 +++++++++++++++++++++++++++++++++++++++++++-------
 include/linux/numa.h |    8 +++++-
 mm/mempolicy.c       |    5 ++++
 3 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5289d9d6799a..f2c8fca36f28 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -26,6 +26,7 @@ struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
 static struct numa_meminfo numa_meminfo __initdata_numa;
+static struct numa_meminfo numa_reserved_meminfo __initdata_numa;
 
 static int numa_distance_cnt;
 static u8 *numa_distance;
@@ -164,6 +165,26 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
 		(mi->nr_blks - idx) * sizeof(mi->blk[0]));
 }
 
+/**
+ * numa_move_memblk - Move one numa_memblk from one numa_meminfo to another
+ * @dst: numa_meminfo to move block to
+ * @idx: Index of memblk to remove
+ * @src: numa_meminfo to remove memblk from
+ *
+ * If @dst is non-NULL add it at the @dst->nr_blks index and increment
+ * @dst->nr_blks, then remove it from @src.
+ */
+static void __init numa_move_memblk(struct numa_meminfo *dst, int idx,
+		struct numa_meminfo *src)
+{
+	if (dst) {
+		memcpy(&dst->blk[dst->nr_blks], &src->blk[idx],
+				sizeof(struct numa_memblk));
+		dst->nr_blks++;
+	}
+	numa_remove_memblk_from(idx, src);
+}
+
 /**
  * numa_add_memblk - Add one numa_memblk to numa_meminfo
  * @nid: NUMA node ID of the new memblk
@@ -233,14 +254,19 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
 	for (i = 0; i < mi->nr_blks; i++) {
 		struct numa_memblk *bi = &mi->blk[i];
 
-		/* make sure all blocks are inside the limits */
+		/* move / save reserved memory ranges */
+		if (!memblock_overlaps_region(&memblock.memory,
+					bi->start, bi->end - bi->start)) {
+			numa_move_memblk(&numa_reserved_meminfo, i--, mi);
+			continue;
+		}
+
+		/* make sure all non-reserved blocks are inside the limits */
 		bi->start = max(bi->start, low);
 		bi->end = min(bi->end, high);
 
-		/* and there's no empty or non-exist block */
-		if (bi->start >= bi->end ||
-		    !memblock_overlaps_region(&memblock.memory,
-			bi->start, bi->end - bi->start))
+		/* and there's no empty block */
+		if (bi->start >= bi->end)
 			numa_remove_memblk_from(i--, mi);
 	}
 
@@ -877,16 +903,38 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif	/* !CONFIG_DEBUG_PER_CPU_MAPS */
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-int memory_add_physaddr_to_nid(u64 start)
+#ifdef CONFIG_KEEP_NUMA
+static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
 {
-	struct numa_meminfo *mi = &numa_meminfo;
-	int nid = mi->blk[0].nid;
 	int i;
 
 	for (i = 0; i < mi->nr_blks; i++)
 		if (mi->blk[i].start <= start && mi->blk[i].end > start)
-			nid = mi->blk[i].nid;
+			return mi->blk[i].nid;
+	return NUMA_NO_NODE;
+}
+
+int phys_to_target_node(phys_addr_t start)
+{
+	int nid = meminfo_to_nid(&numa_meminfo, start);
+
+	/*
+	 * Prefer online nodes, but if reserved memory might be
+	 * hot-added continue the search with reserved ranges.
+	 */
+	if (nid != NUMA_NO_NODE)
+		return nid;
+
+	return meminfo_to_nid(&numa_reserved_meminfo, start);
+}
+EXPORT_SYMBOL_GPL(phys_to_target_node);
+
+int memory_add_physaddr_to_nid(u64 start)
+{
+	int nid = meminfo_to_nid(&numa_meminfo, start);
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_meminfo.blk[0].nid;
 	return nid;
 }
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
diff --git a/include/linux/numa.h b/include/linux/numa.h
index c005ed6b807b..cad0ab165619 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -1,7 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_NUMA_H
 #define _LINUX_NUMA_H
-
+#include <linux/types.h>
 
 #ifdef CONFIG_NODES_SHIFT
 #define NODES_SHIFT     CONFIG_NODES_SHIFT
@@ -21,11 +21,17 @@
 
 #ifdef CONFIG_NUMA
 int numa_map_to_online_node(int node);
+int phys_to_target_node(phys_addr_t addr);
 #else
 static inline int numa_map_to_online_node(int node)
 {
 	return NUMA_NO_NODE;
 }
+
+static inline int phys_to_target_node(phys_addr_t addr)
+{
+	return NUMA_NO_NODE;
+}
 #endif
 
 #endif /* _LINUX_NUMA_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bcb012645809..ed376d57e527 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3011,3 +3011,8 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
+
+__weak int phys_to_target_node(phys_addr_t addr)
+{
+	return NUMA_NO_NODE;
+}
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 6/6] libnvdimm/e820: Retrieve and populate correct 'target_node' info
  2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
                   ` (4 preceding siblings ...)
  2020-01-22  3:05 ` [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility Dan Williams
@ 2020-01-22  3:05 ` Dan Williams
  2020-02-13  2:28 ` [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
  2020-02-16 20:00 ` [PATCH v5 " Dan Williams
  7 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-01-22  3:05 UTC (permalink / raw)
  To: tglx, mingo
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Andrew Morton,
	David Hildenbrand, Michal Hocko, Christoph Hellwig, linux-kernel,
	linux-nvdimm, x86

Use the new phys_to_target_node() and numa_map_to_online_node() helpers
to retrieve the correct id for the 'numa_node' ("local" / online
initiator node) and 'target_node' (offline target memory node) sysfs
attributes.

Below is an example from a 4 numa node system where all the memory on
node2 is pmem / reserved. It should be noted that with the arrival of
the ACPI HMAT table and EFI Specific Purpose Memory the kernel will
start to see more platforms with reserved / performance differentiated
memory in its own numa node. Hence all the stakeholders on the Cc for
what is ostensibly a libnvdimm local patch.

=== Before ===

/* Notice no online memory on node2 at start */

# numactl --hardware
available: 3 nodes (0-1,3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
node 0 size: 3958 MB
node 0 free: 3708 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 1 size: 4027 MB
node 1 free: 3871 MB
node 3 cpus:
node 3 size: 3994 MB
node 3 free: 3971 MB
node distances:
node   0   1   3
  0:  10  21  21
  1:  21  10  21
  3:  21  21  10

/*
 * Put the pmem namespace into devdax mode so it can be assigned to the
 * kmem driver
 */

# ndctl create-namespace -e namespace0.0 -m devdax -f
{
  "dev":"namespace0.0",
  "mode":"devdax",
  "map":"dev",
  "size":"3.94 GiB (4.23 GB)",
  "uuid":"1650af9b-9ba3-4704-acd6-10178399d9a3",
  [..]
}

/* Online Persistent Memory as System RAM */

# daxctl reconfigure-device --mode=system-ram dax0.0
libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success
libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success
libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success
libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success
[
  {
    "chardev":"dax0.0",
    "size":4225761280,
    "target_node":0,
    "mode":"system-ram"
  }
]
reconfigured 1 device

/* Note that the memory is onlined by default to the wrong node, node0 */

# numactl --hardware
available: 3 nodes (0-1,3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
node 0 size: 7926 MB
node 0 free: 7655 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 1 size: 4027 MB
node 1 free: 3871 MB
node 3 cpus:
node 3 size: 3994 MB
node 3 free: 3971 MB
node distances:
node   0   1   3
  0:  10  21  21
  1:  21  10  21
  3:  21  21  10


=== After ===

/* Notice that the "phys_index" error messages are gone */

# daxctl reconfigure-device --mode=system-ram dax0.0
[
  {
    "chardev":"dax0.0",
    "size":4225761280,
    "target_node":2,
    "mode":"system-ram"
  }
]
reconfigured 1 device

/* Notice that node2 is now correctly populated */

# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
node 0 size: 3958 MB
node 0 free: 3793 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 1 size: 4027 MB
node 1 free: 3851 MB
node 2 cpus:
node 2 size: 3968 MB
node 2 free: 3968 MB
node 3 cpus:
node 3 size: 3994 MB
node 3 free: 3908 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig      |    1 +
 drivers/nvdimm/e820.c |   18 ++++--------------
 2 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5e8949953660..3a827fe7afd6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1660,6 +1660,7 @@ config X86_PMEM_LEGACY
 	depends on PHYS_ADDR_T_64BIT
 	depends on BLK_DEV
 	select X86_PMEM_LEGACY_DEVICE
+	select KEEP_NUMA if NUMA
 	select LIBNVDIMM
 	help
 	  Treat memory marked using the non-standard e820 type of 12 as used
diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index e02f60ad6c99..4cd18be9d0e9 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -7,6 +7,7 @@
 #include <linux/memory_hotplug.h>
 #include <linux/libnvdimm.h>
 #include <linux/module.h>
+#include <linux/numa.h>
 
 static int e820_pmem_remove(struct platform_device *pdev)
 {
@@ -16,27 +17,16 @@ static int e820_pmem_remove(struct platform_device *pdev)
 	return 0;
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int e820_range_to_nid(resource_size_t addr)
-{
-	return memory_add_physaddr_to_nid(addr);
-}
-#else
-static int e820_range_to_nid(resource_size_t addr)
-{
-	return NUMA_NO_NODE;
-}
-#endif
-
 static int e820_register_one(struct resource *res, void *data)
 {
 	struct nd_region_desc ndr_desc;
 	struct nvdimm_bus *nvdimm_bus = data;
+	int nid = phys_to_target_node(res->start);
 
 	memset(&ndr_desc, 0, sizeof(ndr_desc));
 	ndr_desc.res = res;
-	ndr_desc.numa_node = e820_range_to_nid(res->start);
-	ndr_desc.target_node = ndr_desc.numa_node;
+	ndr_desc.numa_node = numa_map_to_online_node(nid);
+	ndr_desc.target_node = nid;
 	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
 	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
 		return -ENXIO;
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory
  2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
                   ` (5 preceding siblings ...)
  2020-01-22  3:05 ` [PATCH v4 6/6] libnvdimm/e820: Retrieve and populate correct 'target_node' info Dan Williams
@ 2020-02-13  2:28 ` Dan Williams
  2020-02-13  9:37   ` Ingo Molnar
  2020-02-16 20:00 ` [PATCH v5 " Dan Williams
  7 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-02-13  2:28 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: David Hildenbrand, Borislav Petkov, Rafael J. Wysocki,
	H. Peter Anvin, Aneesh Kumar K.V, kbuild test robot,
	Andrew Morton, Peter Zijlstra, Benjamin Herrenschmidt,
	Michal Hocko, Paul Mackerras, Christoph Hellwig, Dave Hansen,
	Michael Ellerman, X86 ML, Andy Lutomirski,
	Linux Kernel Mailing List, linux-nvdimm

On Tue, Jan 21, 2020 at 7:20 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> Changes since v3 [1]:
> - Cleanup numa_map_to_online_node() to remove redundant "if
>   (!node_online(node))" (Aneesh)
>
> [1]: http://lore.kernel.org/r/157954696789.2239526.17707265517154476652.stgit@dwillia2-desk3.amr.corp.intel.com
>
> ---
>
> Merge notes:
>
> x86 folks: This has an ack from Rafael for ACPI, and Michael for Power.
> With an x86 ack I plan to take this through the libnvdimm tree provided
> the x86 touches look ok to you.

Ping x86 folks. There's no additional changes identified for this
series. Can I request an ack to take it through libnvdimm.git? Do you
need a resend?

    x86/mm: Introduce CONFIG_KEEP_NUMA
    x86/numa: Provide a range-to-target_node lookup facility


>
> ---
>
> Cover:
>
> Arrange for platform numa info to be preserved for determining
> 'target_node' data. Where a 'target_node' is the node a reserved memory
> range will become when it is onlined.
>
> This new infrastructure is expected to be more valuable over time for
> Memory Tiers / Hierarchy management as more platforms (via the ACPI HMAT
> and EFI Specific Purpose Memory) publish reserved or "soft-reserved"
> ranges to Linux. Linux system administrators will expect to be able to
> interact with those ranges with a unique numa node number when/if that
> memory is onlined via the dax_kmem driver [2].
>
> One configuration that currently fails to properly convey the target
> node for the resulting memory hotplug operation is persistent memory
> defined by the memmap=nn!ss parameter. For example, today if node1 is a
> memory only node, and all the memory from node1 is specified to
> memmap=nn!ss and subsequently onlined, it will end up being onlined as
> node0 memory. As it stands, memory_add_physaddr_to_nid() can only
> identify online nodes and since node1 in this example has no online cpus
> / memory the target node is initialized node0.
>
> The fix is to preserve rather than discard the numa_meminfo entries that
> are relevant for reserved memory ranges, and to uplevel the node
> distance helper for determining the "local" (closest) node relative to
> an initiator node.
>
> [2]: https://pmem.io/ndctl/daxctl-reconfigure-device.html
>
> ---
>
> Dan Williams (6):
>       ACPI: NUMA: Up-level "map to online node" functionality
>       mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
>       powerpc/papr_scm: Switch to numa_map_to_online_node()
>       x86/mm: Introduce CONFIG_KEEP_NUMA
>       x86/numa: Provide a range-to-target_node lookup facility
>       libnvdimm/e820: Retrieve and populate correct 'target_node' info
>
>
>  arch/powerpc/platforms/pseries/papr_scm.c |   21 --------
>  arch/x86/Kconfig                          |    1
>  arch/x86/mm/numa.c                        |   74 +++++++++++++++++++++++------
>  drivers/acpi/numa/srat.c                  |   41 ----------------
>  drivers/nvdimm/e820.c                     |   18 ++-----
>  include/linux/acpi.h                      |   23 +++++++++
>  include/linux/numa.h                      |   23 +++++++++
>  mm/Kconfig                                |    5 ++
>  mm/mempolicy.c                            |   31 ++++++++++++
>  9 files changed, 145 insertions(+), 92 deletions(-)
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA
  2020-01-22  3:04 ` [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA Dan Williams
@ 2020-02-13  9:32   ` Ingo Molnar
  2020-02-13 21:19     ` Dan Williams
  2020-02-13 11:22   ` Thomas Gleixner
  1 sibling, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2020-02-13  9:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: tglx, mingo, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, x86, Andrew Morton,
	David Hildenbrand, Michal Hocko, hch, linux-kernel, linux-nvdimm


* Dan Williams <dan.j.williams@intel.com> wrote:

> Currently x86 numa_meminfo is marked __initdata in the
> CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow
> drivers to map reserved memory to a 'target_node'
> (phys_to_target_node()), add support for removing the __initdata
> designation for those users. Both memory hotplug and
> phys_to_target_node() users select CONFIG_KEEP_NUMA to tell the arch to
> maintain its physical address to numa mapping infrastructure post init.
> 
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: <x86@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/mm/numa.c   |    6 +-----
>  include/linux/numa.h |    6 ++++++
>  mm/Kconfig           |    5 +++++
>  3 files changed, 12 insertions(+), 5 deletions(-)

The concept and the x86 portions look sane, just a few minor nits:

> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 99f7a68738f0..5289d9d6799a 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -25,11 +25,7 @@ nodemask_t numa_nodes_parsed __initdata;
>  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>  EXPORT_SYMBOL(node_data);
>  
> -static struct numa_meminfo numa_meminfo
> -#ifndef CONFIG_MEMORY_HOTPLUG
> -__initdata
> -#endif
> -;
> +static struct numa_meminfo numa_meminfo __initdata_numa;
>  
>  static int numa_distance_cnt;
>  static u8 *numa_distance;
> diff --git a/include/linux/numa.h b/include/linux/numa.h
> index 20f4e44b186c..c005ed6b807b 100644
> --- a/include/linux/numa.h
> +++ b/include/linux/numa.h
> @@ -13,6 +13,12 @@
>  
>  #define	NUMA_NO_NODE	(-1)
>  
> +#ifdef CONFIG_KEEP_NUMA
> +#define __initdata_numa
> +#else
> +#define __initdata_numa __initdata
> +#endif
> +
>  #ifdef CONFIG_NUMA
>  int numa_map_to_online_node(int node);
>  #else
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ab80933be65f..001f1185eadf 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -139,6 +139,10 @@ config HAVE_FAST_GUP
>  config ARCH_KEEP_MEMBLOCK
>  	bool
>  
> +# Keep arch numa mapping infrastructure post-init.

s/numa/NUMA

Please also capitalize consistently in the rest of the series.

> +config KEEP_NUMA
> +	bool


So most of our recent new NUMA options followed the naming pattern of:

  CONFIG_NUMA_*

Such as CONFIG_NUMA_BALANCING or CONFIG_NUMA_EMU.

So I'd suggesting naming it to CONFIG_NUMA_KEEP, or, a bit more 
descriptively, such as CONFIG_NUMA_KEEP_MAPPING or such?

'Keeping NUMA' is kind of lame - of course we keep NUMA. ;-)

Thanks,

	Ingo
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility
  2020-01-22  3:05 ` [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility Dan Williams
@ 2020-02-13  9:36   ` Ingo Molnar
  2020-02-13 11:37   ` Thomas Gleixner
  1 sibling, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2020-02-13  9:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: tglx, mingo, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, x86, Andrew Morton,
	David Hildenbrand, Michal Hocko, kbuild test robot, hch,
	linux-kernel, linux-nvdimm


* Dan Williams <dan.j.williams@intel.com> wrote:

> The DEV_DAX_KMEM facility is a generic mechanism to allow device-dax
> instances, fronting performance-differentiated-memory like pmem, to be
> added to the System RAM pool. The numa node for that hot-added memory is
> derived from the device-dax instance's 'target_node' attribute.
> 
> Recall that the 'target_node' is the ACPI-PXM-to-node translation for
> memory when it comes online whereas the 'numa_node' attribute of the
> device represents the closest online cpu node.
> 
> Presently useful target_node information from the ACPI SRAT is discarded
> with the expectation that "Reserved" memory will never be onlined. Now,
> DEV_DAX_KMEM violates that assumption, there is a need to retain the
> translation. Move, rather than discard, numa_memblk data to a secondary
> array that memory_add_physaddr_to_target_node() may consider at a later
> point in time.
> 
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: <x86@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Reported-by: kbuild test robot <lkp@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/mm/numa.c   |   68 +++++++++++++++++++++++++++++++++++++++++++-------
>  include/linux/numa.h |    8 +++++-
>  mm/mempolicy.c       |    5 ++++
>  3 files changed, 70 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 5289d9d6799a..f2c8fca36f28 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -26,6 +26,7 @@ struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
>  EXPORT_SYMBOL(node_data);
>  
>  static struct numa_meminfo numa_meminfo __initdata_numa;
> +static struct numa_meminfo numa_reserved_meminfo __initdata_numa;
>  
>  static int numa_distance_cnt;
>  static u8 *numa_distance;
> @@ -164,6 +165,26 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
>  		(mi->nr_blks - idx) * sizeof(mi->blk[0]));
>  }
>  
> +/**
> + * numa_move_memblk - Move one numa_memblk from one numa_meminfo to another
> + * @dst: numa_meminfo to move block to
> + * @idx: Index of memblk to remove
> + * @src: numa_meminfo to remove memblk from
> + *
> + * If @dst is non-NULL add it at the @dst->nr_blks index and increment
> + * @dst->nr_blks, then remove it from @src.
> + */
> +static void __init numa_move_memblk(struct numa_meminfo *dst, int idx,
> +		struct numa_meminfo *src)

Nit, this is obviously not how we format function definitions if 
checkpatch complains about the col80 limit.


> +{
> +	if (dst) {
> +		memcpy(&dst->blk[dst->nr_blks], &src->blk[idx],
> +				sizeof(struct numa_memblk));

This linebreak is actually unnecessary ...

Thanks,

	Ingo
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory
  2020-02-13  2:28 ` [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
@ 2020-02-13  9:37   ` Ingo Molnar
  0 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2020-02-13  9:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Thomas Gleixner, Ingo Molnar, David Hildenbrand, Borislav Petkov,
	Rafael J. Wysocki, H. Peter Anvin, Aneesh Kumar K.V,
	kbuild test robot, Andrew Morton, Peter Zijlstra,
	Benjamin Herrenschmidt, Michal Hocko, Paul Mackerras,
	Christoph Hellwig, Dave Hansen, Michael Ellerman, X86 ML,
	Andy Lutomirski, Linux Kernel Mailing List, linux-nvdimm


* Dan Williams <dan.j.williams@intel.com> wrote:

> On Tue, Jan 21, 2020 at 7:20 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Changes since v3 [1]:
> > - Cleanup numa_map_to_online_node() to remove redundant "if
> >   (!node_online(node))" (Aneesh)
> >
> > [1]: http://lore.kernel.org/r/157954696789.2239526.17707265517154476652.stgit@dwillia2-desk3.amr.corp.intel.com
> >
> > ---
> >
> > Merge notes:
> >
> > x86 folks: This has an ack from Rafael for ACPI, and Michael for Power.
> > With an x86 ack I plan to take this through the libnvdimm tree provided
> > the x86 touches look ok to you.
> 
> Ping x86 folks. There's no additional changes identified for this
> series. Can I request an ack to take it through libnvdimm.git? Do you
> need a resend?
> 
>     x86/mm: Introduce CONFIG_KEEP_NUMA
>     x86/numa: Provide a range-to-target_node lookup facility

If the minor complaints I outlined are addressed:

Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA
  2020-01-22  3:04 ` [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA Dan Williams
  2020-02-13  9:32   ` Ingo Molnar
@ 2020-02-13 11:22   ` Thomas Gleixner
  2020-02-13 21:21     ` Dan Williams
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2020-02-13 11:22 UTC (permalink / raw)
  To: Dan Williams, mingo
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, x86, Andrew Morton, David Hildenbrand,
	Michal Hocko, hch, linux-kernel, linux-nvdimm

Dan Williams <dan.j.williams@intel.com> writes:
> +#ifdef CONFIG_KEEP_NUMA
> +#define __initdata_numa
> +#else
> +#define __initdata_numa __initdata
> +#endif

TBH, I find this conditional annotation mightingly confusing.

__initdata_numa still suggest that this is __initdata, just a different
section and some extra rules or whatever.

Something like __initdata_or_keepnuma (sorry I could not come up with
something prettier, but you get the idea.

Thanks,

        tglx
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility
  2020-01-22  3:05 ` [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility Dan Williams
  2020-02-13  9:36   ` Ingo Molnar
@ 2020-02-13 11:37   ` Thomas Gleixner
  2020-02-13 21:40     ` Dan Williams
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2020-02-13 11:37 UTC (permalink / raw)
  To: Dan Williams, mingo
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, x86, Andrew Morton, David Hildenbrand,
	Michal Hocko, kbuild test robot, hch, linux-kernel, linux-nvdimm

Dan Williams <dan.j.williams@intel.com> writes:
> +/**
> + * numa_move_memblk - Move one numa_memblk from one numa_meminfo to another
> + * @dst: numa_meminfo to move block to
> + * @idx: Index of memblk to remove
> + * @src: numa_meminfo to remove memblk from
> + *
> + * If @dst is non-NULL add it at the @dst->nr_blks index and increment
> + * @dst->nr_blks, then remove it from @src.

This is not correct. It's suggesting that these operations are only
happening when @dst is non-NULL. Remove is unconditional though.

Also this is called with &numa_reserved_meminfo as @dst argument, which is:

> +static struct numa_meminfo numa_reserved_meminfo __initdata_numa;

So how would @dst ever be NULL?
 
> + */
> +static void __init numa_move_memblk(struct numa_meminfo *dst, int idx,
> +		struct numa_meminfo *src)
> +{
> +	if (dst) {
> +		memcpy(&dst->blk[dst->nr_blks], &src->blk[idx],
> +				sizeof(struct numa_memblk));
> +		dst->nr_blks++;
> +	}
> +	numa_remove_memblk_from(idx, src);
> +}

...

> -		/* make sure all blocks are inside the limits */
> +		/* move / save reserved memory ranges */
> +		if (!memblock_overlaps_region(&memblock.memory,
> +					bi->start, bi->end - bi->start)) {
> +			numa_move_memblk(&numa_reserved_meminfo, i--, mi);

Thanks,

        tglx
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA
  2020-02-13  9:32   ` Ingo Molnar
@ 2020-02-13 21:19     ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-02-13 21:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Borislav Petkov, H. Peter Anvin, X86 ML,
	Andrew Morton, David Hildenbrand, Michal Hocko,
	Christoph Hellwig, Linux Kernel Mailing List, linux-nvdimm

On Thu, Feb 13, 2020 at 1:32 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Dan Williams <dan.j.williams@intel.com> wrote:
>
> > Currently x86 numa_meminfo is marked __initdata in the
> > CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow
> > drivers to map reserved memory to a 'target_node'
> > (phys_to_target_node()), add support for removing the __initdata
> > designation for those users. Both memory hotplug and
> > phys_to_target_node() users select CONFIG_KEEP_NUMA to tell the arch to
> > maintain its physical address to numa mapping infrastructure post init.
> >
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: <x86@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  arch/x86/mm/numa.c   |    6 +-----
> >  include/linux/numa.h |    6 ++++++
> >  mm/Kconfig           |    5 +++++
> >  3 files changed, 12 insertions(+), 5 deletions(-)
>
> The concept and the x86 portions look sane, just a few minor nits:
>
> >
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 99f7a68738f0..5289d9d6799a 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -25,11 +25,7 @@ nodemask_t numa_nodes_parsed __initdata;
> >  struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
> >  EXPORT_SYMBOL(node_data);
> >
> > -static struct numa_meminfo numa_meminfo
> > -#ifndef CONFIG_MEMORY_HOTPLUG
> > -__initdata
> > -#endif
> > -;
> > +static struct numa_meminfo numa_meminfo __initdata_numa;
> >
> >  static int numa_distance_cnt;
> >  static u8 *numa_distance;
> > diff --git a/include/linux/numa.h b/include/linux/numa.h
> > index 20f4e44b186c..c005ed6b807b 100644
> > --- a/include/linux/numa.h
> > +++ b/include/linux/numa.h
> > @@ -13,6 +13,12 @@
> >
> >  #define      NUMA_NO_NODE    (-1)
> >
> > +#ifdef CONFIG_KEEP_NUMA
> > +#define __initdata_numa
> > +#else
> > +#define __initdata_numa __initdata
> > +#endif
> > +
> >  #ifdef CONFIG_NUMA
> >  int numa_map_to_online_node(int node);
> >  #else
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index ab80933be65f..001f1185eadf 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -139,6 +139,10 @@ config HAVE_FAST_GUP
> >  config ARCH_KEEP_MEMBLOCK
> >       bool
> >
> > +# Keep arch numa mapping infrastructure post-init.
>
> s/numa/NUMA
>
> Please also capitalize consistently in the rest of the series.
>
> > +config KEEP_NUMA
> > +     bool
>
>
> So most of our recent new NUMA options followed the naming pattern of:
>
>   CONFIG_NUMA_*
>
> Such as CONFIG_NUMA_BALANCING or CONFIG_NUMA_EMU.
>
> So I'd suggesting naming it to CONFIG_NUMA_KEEP, or, a bit more
> descriptively, such as CONFIG_NUMA_KEEP_MAPPING or such?
>
> 'Keeping NUMA' is kind of lame - of course we keep NUMA. ;-)

Ok, I settled on CONFIG_NUMA_KEEP_MEMINFO, and will fix up the
lowercase "numa" instances in the set.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA
  2020-02-13 11:22   ` Thomas Gleixner
@ 2020-02-13 21:21     ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-02-13 21:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, X86 ML, Andrew Morton,
	David Hildenbrand, Michal Hocko, Christoph Hellwig,
	Linux Kernel Mailing List, linux-nvdimm

On Thu, Feb 13, 2020 at 3:22 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
> > +#ifdef CONFIG_KEEP_NUMA
> > +#define __initdata_numa
> > +#else
> > +#define __initdata_numa __initdata
> > +#endif
>
> TBH, I find this conditional annotation mightingly confusing.
>
> __initdata_numa still suggest that this is __initdata, just a different
> section and some extra rules or whatever.
>
> Something like __initdata_or_keepnuma (sorry I could not come up with
> something prettier, but you get the idea.

Yes, and to dovetail with Ingo's feedback I think
__initdata_or_meminfo conveys it's optionally init vs runtime data.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility
  2020-02-13 11:37   ` Thomas Gleixner
@ 2020-02-13 21:40     ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-02-13 21:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Borislav Petkov, H. Peter Anvin, X86 ML, Andrew Morton,
	David Hildenbrand, Michal Hocko, kbuild test robot,
	Christoph Hellwig, Linux Kernel Mailing List, linux-nvdimm

On Thu, Feb 13, 2020 at 3:38 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
> > +/**
> > + * numa_move_memblk - Move one numa_memblk from one numa_meminfo to another
> > + * @dst: numa_meminfo to move block to
> > + * @idx: Index of memblk to remove
> > + * @src: numa_meminfo to remove memblk from
> > + *
> > + * If @dst is non-NULL add it at the @dst->nr_blks index and increment
> > + * @dst->nr_blks, then remove it from @src.
>
> This is not correct. It's suggesting that these operations are only
> happening when @dst is non-NULL. Remove is unconditional though.
>
> Also this is called with &numa_reserved_meminfo as @dst argument, which is:
>
> > +static struct numa_meminfo numa_reserved_meminfo __initdata_numa;
>
> So how would @dst ever be NULL?

Ugh, something I should have caught. An earlier version of this patch
optionally defined numa_reserved_meminfo [1], but I later switched to
the current / cleaner __initdata_or_meminfo scheme. Will clean this
up.

[1]: https://lore.kernel.org/linux-mm/157309907296.1582359.7986676987778026949.stgit@dwillia2-desk3.amr.corp.intel.com/
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 0/6] Memory Hierarchy: Enable target node lookups for reserved memory
  2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
                   ` (6 preceding siblings ...)
  2020-02-13  2:28 ` [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
@ 2020-02-16 20:00 ` Dan Williams
  2020-02-16 20:00   ` [PATCH v5 1/6] ACPI: NUMA: Up-level "map to online node" functionality Dan Williams
                     ` (5 more replies)
  7 siblings, 6 replies; 25+ messages in thread
From: Dan Williams @ 2020-02-16 20:00 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: David Hildenbrand, Borislav Petkov, Ingo Molnar,
	Rafael J. Wysocki, H. Peter Anvin, Thomas Gleixner,
	Aneesh Kumar K.V, kbuild test robot, Andrew Morton,
	Peter Zijlstra, Benjamin Herrenschmidt, Michal Hocko,
	Paul Mackerras, Christoph Hellwig, Dave Hansen, Michael Ellerman,
	x86, Andy Lutomirski, linux-kernel

Changes since v4 [1]:
- Rename CONFIG_KEEP_NUMA to CONFIG_NUMA_KEEP_MEMINFO (Ingo)
- Rename __initdata_numa to __initdata_or_meminfo (Thomas)
- Capitalize NUMA throughout (Ingo)
- Replace explicit memcpy with implicit structure copy to address an 80
  column violation, and fixup a function definition line-wrap (Ingo)
- Rename numa_move_memblk() to numa_move_tail_memblk(), and remove the
  stale kernel-doc that implied @dst was optional (Thomas)
- Comment that phys_to_target_node() is an optional arch implementation
  detail that consumers must gate with "depends on $ARCH"
- Apply Ingo's conditional reviewed-by

[1]: http://lore.kernel.org/r/157966227494.2508551.7206194169374588977.stgit@dwillia2-desk3.amr.corp.intel.com

---

Merge notes: I believe this addresses all outstanding comments, barring
additional feedback I will push to libnvdimm-for-next.

---

Cover:

Arrange for platform NUMA info to be preserved for determining
'target_node' data. Where a 'target_node' is the node a reserved memory
range will become when it is onlined.

This new infrastructure is expected to be more valuable over time for
Memory Tiers / Hierarchy management as more platforms (via the ACPI HMAT
and EFI Specific Purpose Memory) publish reserved or "soft-reserved"
ranges to Linux. Linux system administrators will expect to be able to
interact with those ranges with a unique NUMA node number when/if that
memory is onlined via the dax_kmem driver [2].

One configuration that currently fails to properly convey the target
node for the resulting memory hotplug operation is persistent memory
defined by the memmap=nn!ss parameter. For example, today if node1 is a
memory only node, and all the memory from node1 is specified to
memmap=nn!ss and subsequently onlined, it will end up being onlined as
node0 memory. As it stands, memory_add_physaddr_to_nid() can only
identify online nodes and since node1 in this example has no online cpus
/ memory the target node is initialized node0.

The fix is to preserve rather than discard the numa_meminfo entries that
are relevant for reserved memory ranges, and to uplevel the node
distance helper for determining the "local" (closest) node relative to
an initiator node.

[2]: https://pmem.io/ndctl/daxctl-reconfigure-device.html

---

Dan Williams (6):
      ACPI: NUMA: Up-level "map to online node" functionality
      mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
      powerpc/papr_scm: Switch to numa_map_to_online_node()
      x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
      x86/NUMA: Provide a range-to-target_node lookup facility
      libnvdimm/e820: Retrieve and populate correct 'target_node' info


 arch/powerpc/platforms/pseries/papr_scm.c |   21 ---------
 arch/x86/Kconfig                          |    1 
 arch/x86/mm/numa.c                        |   67 +++++++++++++++++++++++------
 drivers/acpi/numa/srat.c                  |   41 ------------------
 drivers/nvdimm/e820.c                     |   18 ++------
 include/linux/acpi.h                      |   23 ++++++++++
 include/linux/numa.h                      |   30 +++++++++++++
 mm/Kconfig                                |    5 ++
 mm/mempolicy.c                            |   26 +++++++++++
 9 files changed, 140 insertions(+), 92 deletions(-)

base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 1/6] ACPI: NUMA: Up-level "map to online node" functionality
  2020-02-16 20:00 ` [PATCH v5 " Dan Williams
@ 2020-02-16 20:00   ` Dan Williams
  2020-02-16 20:00   ` [PATCH v5 2/6] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node() Dan Williams
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-02-16 20:00 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Michal Hocko, Rafael J. Wysocki, Ingo Molnar, peterz,
	dave.hansen, hch, linux-kernel, x86

The acpi_map_pxm_to_online_node() helper is used to find the closest
online node to a given proximity domain. This is used to map devices in
a proximity domain with no online memory or cpus to the closest online
node and populate a device's 'numa_node' property. The numa_node
property allows applications to be migrated "close" to a resource.

In preparation for providing a generic facility to optionally map an
address range to its closest online node, or the node the range would
represent were it to be onlined (target_node), up-level the core of
acpi_map_pxm_to_online_node() to a generic mm/numa helper.

Cc: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/acpi/numa/srat.c |   41 -----------------------------------------
 include/linux/acpi.h     |   23 ++++++++++++++++++++++-
 include/linux/numa.h     |    9 +++++++++
 mm/mempolicy.c           |   30 ++++++++++++++++++++++++++++++
 4 files changed, 61 insertions(+), 42 deletions(-)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index eadbf90e65d1..47b4969d9b93 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -72,47 +72,6 @@ int acpi_map_pxm_to_node(int pxm)
 }
 EXPORT_SYMBOL(acpi_map_pxm_to_node);
 
-/**
- * acpi_map_pxm_to_online_node - Map proximity ID to online node
- * @pxm: ACPI proximity ID
- *
- * This is similar to acpi_map_pxm_to_node(), but always returns an online
- * node.  When the mapped node from a given proximity ID is offline, it
- * looks up the node distance table and returns the nearest online node.
- *
- * ACPI device drivers, which are called after the NUMA initialization has
- * completed in the kernel, can call this interface to obtain their device
- * NUMA topology from ACPI tables.  Such drivers do not have to deal with
- * offline nodes.  A node may be offline when a device proximity ID is
- * unique, SRAT memory entry does not exist, or NUMA is disabled, ex.
- * "numa=off" on x86.
- */
-int acpi_map_pxm_to_online_node(int pxm)
-{
-	int node, min_node;
-
-	node = acpi_map_pxm_to_node(pxm);
-
-	if (node == NUMA_NO_NODE)
-		node = 0;
-
-	min_node = node;
-	if (!node_online(node)) {
-		int min_dist = INT_MAX, dist, n;
-
-		for_each_online_node(n) {
-			dist = node_distance(node, n);
-			if (dist < min_dist) {
-				min_dist = dist;
-				min_node = n;
-			}
-		}
-	}
-
-	return min_node;
-}
-EXPORT_SYMBOL(acpi_map_pxm_to_online_node);
-
 static void __init
 acpi_table_print_srat_entry(struct acpi_subtable_header *header)
 {
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 0f24d701fbdc..3839363081f3 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -416,9 +416,30 @@ extern void acpi_osi_setup(char *str);
 extern bool acpi_osi_is_win8(void);
 
 #ifdef CONFIG_ACPI_NUMA
-int acpi_map_pxm_to_online_node(int pxm);
 int acpi_map_pxm_to_node(int pxm);
 int acpi_get_node(acpi_handle handle);
+
+/**
+ * acpi_map_pxm_to_online_node - Map proximity ID to online node
+ * @pxm: ACPI proximity ID
+ *
+ * This is similar to acpi_map_pxm_to_node(), but always returns an online
+ * node.  When the mapped node from a given proximity ID is offline, it
+ * looks up the node distance table and returns the nearest online node.
+ *
+ * ACPI device drivers, which are called after the NUMA initialization has
+ * completed in the kernel, can call this interface to obtain their device
+ * NUMA topology from ACPI tables.  Such drivers do not have to deal with
+ * offline nodes.  A node may be offline when a device proximity ID is
+ * unique, SRAT memory entry does not exist, or NUMA is disabled, ex.
+ * "numa=off" on x86.
+ */
+static inline int acpi_map_pxm_to_online_node(int pxm)
+{
+	int node = acpi_map_pxm_to_node(pxm);
+
+	return numa_map_to_online_node(node);
+}
 #else
 static inline int acpi_map_pxm_to_online_node(int pxm)
 {
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 110b0e5d0fb0..20f4e44b186c 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -13,4 +13,13 @@
 
 #define	NUMA_NO_NODE	(-1)
 
+#ifdef CONFIG_NUMA
+int numa_map_to_online_node(int node);
+#else
+static inline int numa_map_to_online_node(int node)
+{
+	return NUMA_NO_NODE;
+}
+#endif
+
 #endif /* _LINUX_NUMA_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 977c641f78cf..756d6e5bb59f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -127,6 +127,36 @@ static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+/**
+ * numa_map_to_online_node - Find closest online node
+ * @nid: Node id to start the search
+ *
+ * Lookup the next closest node by distance if @nid is not online.
+ */
+int numa_map_to_online_node(int node)
+{
+	int min_node;
+
+	if (node == NUMA_NO_NODE)
+		node = 0;
+
+	min_node = node;
+	if (!node_online(node)) {
+		int min_dist = INT_MAX, dist, n;
+
+		for_each_online_node(n) {
+			dist = node_distance(node, n);
+			if (dist < min_dist) {
+				min_dist = dist;
+				min_node = n;
+			}
+		}
+	}
+
+	return min_node;
+}
+EXPORT_SYMBOL_GPL(numa_map_to_online_node);
+
 struct mempolicy *get_task_policy(struct task_struct *p)
 {
 	struct mempolicy *pol = p->mempolicy;
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 2/6] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
  2020-02-16 20:00 ` [PATCH v5 " Dan Williams
  2020-02-16 20:00   ` [PATCH v5 1/6] ACPI: NUMA: Up-level "map to online node" functionality Dan Williams
@ 2020-02-16 20:00   ` Dan Williams
  2020-02-16 20:00   ` [PATCH v5 3/6] powerpc/papr_scm: Switch to numa_map_to_online_node() Dan Williams
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-02-16 20:00 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Aneesh Kumar K.V, Ingo Molnar, peterz, dave.hansen, hch,
	linux-kernel, x86

Update numa_map_to_online_node() to stop falling back to numa node 0
when the input is NUMA_NO_NODE. Also, skip the lookup if @node is
online. This makes the routine compatible with other arch node mapping
routines.

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157401275716.43284.13185549705765009174.stgit@dwillia2-desk3.amr.corp.intel.com
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/mempolicy.c |   20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 756d6e5bb59f..19f7e71945a7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -135,21 +135,17 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES];
  */
 int numa_map_to_online_node(int node)
 {
-	int min_node;
+	int min_dist = INT_MAX, dist, n, min_node;
 
-	if (node == NUMA_NO_NODE)
-		node = 0;
+	if (node == NUMA_NO_NODE || node_online(node))
+		return node;
 
 	min_node = node;
-	if (!node_online(node)) {
-		int min_dist = INT_MAX, dist, n;
-
-		for_each_online_node(n) {
-			dist = node_distance(node, n);
-			if (dist < min_dist) {
-				min_dist = dist;
-				min_node = n;
-			}
+	for_each_online_node(n) {
+		dist = node_distance(node, n);
+		if (dist < min_dist) {
+			min_dist = dist;
+			min_node = n;
 		}
 	}
 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 3/6] powerpc/papr_scm: Switch to numa_map_to_online_node()
  2020-02-16 20:00 ` [PATCH v5 " Dan Williams
  2020-02-16 20:00   ` [PATCH v5 1/6] ACPI: NUMA: Up-level "map to online node" functionality Dan Williams
  2020-02-16 20:00   ` [PATCH v5 2/6] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node() Dan Williams
@ 2020-02-16 20:00   ` Dan Williams
  2020-02-16 20:01   ` [PATCH v5 4/6] x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO Dan Williams
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-02-16 20:00 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Aneesh Kumar K.V, Ingo Molnar, peterz, dave.hansen, hch,
	linux-kernel, x86

Now that the core exports numa_map_to_online_node() switch to that
instead of the locally coded duplicate.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157401276263.43284.12616818803654229788.stgit@dwillia2-desk3.amr.corp.intel.com
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/platforms/pseries/papr_scm.c |   21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..3cc66224ec1f 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -285,25 +285,6 @@ int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
 	return 0;
 }
 
-static inline int papr_scm_node(int node)
-{
-	int min_dist = INT_MAX, dist;
-	int nid, min_node;
-
-	if ((node == NUMA_NO_NODE) || node_online(node))
-		return node;
-
-	min_node = first_online_node;
-	for_each_online_node(nid) {
-		dist = node_distance(node, nid);
-		if (dist < min_dist) {
-			min_dist = dist;
-			min_node = nid;
-		}
-	}
-	return min_node;
-}
-
 static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 {
 	struct device *dev = &p->pdev->dev;
@@ -349,7 +330,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 
 	memset(&ndr_desc, 0, sizeof(ndr_desc));
 	target_nid = dev_to_node(&p->pdev->dev);
-	online_nid = papr_scm_node(target_nid);
+	online_nid = numa_map_to_online_node(target_nid);
 	ndr_desc.numa_node = online_nid;
 	ndr_desc.target_node = target_nid;
 	ndr_desc.res = &p->res;
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 4/6] x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
  2020-02-16 20:00 ` [PATCH v5 " Dan Williams
                     ` (2 preceding siblings ...)
  2020-02-16 20:00   ` [PATCH v5 3/6] powerpc/papr_scm: Switch to numa_map_to_online_node() Dan Williams
@ 2020-02-16 20:01   ` Dan Williams
  2020-02-17  8:21     ` Thomas Gleixner
  2020-02-16 20:01   ` [PATCH v5 5/6] x86/NUMA: Provide a range-to-target_node lookup facility Dan Williams
  2020-02-16 20:01   ` [PATCH v5 6/6] libnvdimm/e820: Retrieve and populate correct 'target_node' info Dan Williams
  5 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-02-16 20:01 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, x86, Andrew Morton,
	David Hildenbrand, Michal Hocko, Ingo Molnar, hch, linux-kernel

Currently x86 numa_meminfo is marked __initdata in the
CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow
drivers to map reserved memory to a 'target_node'
(phys_to_target_node()), add support for removing the __initdata
designation for those users. Both memory hotplug and
phys_to_target_node() users select CONFIG_NUMA_KEEP_MEMINFO to tell the
arch to maintain its physical address to NUMA mapping infrastructure
post init.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/numa.c   |    6 +-----
 include/linux/numa.h |    7 +++++++
 mm/Kconfig           |    5 +++++
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 99f7a68738f0..2450b21cc28a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -25,11 +25,7 @@ nodemask_t numa_nodes_parsed __initdata;
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
-static struct numa_meminfo numa_meminfo
-#ifndef CONFIG_MEMORY_HOTPLUG
-__initdata
-#endif
-;
+static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
 
 static int numa_distance_cnt;
 static u8 *numa_distance;
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 20f4e44b186c..5773cd2613fc 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -13,6 +13,13 @@
 
 #define	NUMA_NO_NODE	(-1)
 
+/* optionally keep NUMA memory info available post init */
+#ifdef CONFIG_NUMA_KEEP_MEMINFO
+#define __initdata_or_meminfo
+#else
+#define __initdata_or_meminfo __initdata
+#endif
+
 #ifdef CONFIG_NUMA
 int numa_map_to_online_node(int node);
 #else
diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65f..328268473fec 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -139,6 +139,10 @@ config HAVE_FAST_GUP
 config ARCH_KEEP_MEMBLOCK
 	bool
 
+# Keep arch NUMA mapping infrastructure post-init.
+config NUMA_KEEP_MEMINFO
+	bool
+
 config MEMORY_ISOLATION
 	bool
 
@@ -154,6 +158,7 @@ config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
 	depends on SPARSEMEM || X86_64_ACPI_NUMA
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+	select NUMA_KEEP_MEMINFO if NUMA
 
 config MEMORY_HOTPLUG_SPARSE
 	def_bool y
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 5/6] x86/NUMA: Provide a range-to-target_node lookup facility
  2020-02-16 20:00 ` [PATCH v5 " Dan Williams
                     ` (3 preceding siblings ...)
  2020-02-16 20:01   ` [PATCH v5 4/6] x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO Dan Williams
@ 2020-02-16 20:01   ` Dan Williams
  2020-02-17  8:22     ` Thomas Gleixner
  2020-02-16 20:01   ` [PATCH v5 6/6] libnvdimm/e820: Retrieve and populate correct 'target_node' info Dan Williams
  5 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-02-16 20:01 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin, x86, Andrew Morton,
	David Hildenbrand, Michal Hocko, kbuild test robot, Ingo Molnar,
	hch, linux-kernel

The DEV_DAX_KMEM facility is a generic mechanism to allow device-dax
instances, fronting performance-differentiated-memory like pmem, to be
added to the System RAM pool. The NUMA node for that hot-added memory is
derived from the device-dax instance's 'target_node' attribute.

Recall that the 'target_node' is the ACPI-PXM-to-node translation for
memory when it comes online whereas the 'numa_node' attribute of the
device represents the closest online cpu node.

Presently useful target_node information from the ACPI SRAT is discarded
with the expectation that "Reserved" memory will never be onlined. Now,
DEV_DAX_KMEM violates that assumption, there is a need to retain the
translation. Move, rather than discard, numa_memblk data to a secondary
array that memory_add_physaddr_to_target_node() may consider at a later
point in time.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/numa.c   |   61 ++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/numa.h |   14 +++++++++++
 2 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2450b21cc28a..59ba008504dc 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -26,6 +26,7 @@ struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
 static struct numa_meminfo numa_meminfo __initdata_or_meminfo;
+static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo;
 
 static int numa_distance_cnt;
 static u8 *numa_distance;
@@ -164,6 +165,19 @@ void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi)
 		(mi->nr_blks - idx) * sizeof(mi->blk[0]));
 }
 
+/**
+ * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another
+ * @dst: numa_meminfo to append block to
+ * @idx: Index of memblk to remove
+ * @src: numa_meminfo to remove memblk from
+ */
+static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx,
+					 struct numa_meminfo *src)
+{
+	dst->blk[dst->nr_blks++] = src->blk[idx];
+	numa_remove_memblk_from(idx, src);
+}
+
 /**
  * numa_add_memblk - Add one numa_memblk to numa_meminfo
  * @nid: NUMA node ID of the new memblk
@@ -233,14 +247,19 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
 	for (i = 0; i < mi->nr_blks; i++) {
 		struct numa_memblk *bi = &mi->blk[i];
 
-		/* make sure all blocks are inside the limits */
+		/* move / save reserved memory ranges */
+		if (!memblock_overlaps_region(&memblock.memory,
+					bi->start, bi->end - bi->start)) {
+			numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi);
+			continue;
+		}
+
+		/* make sure all non-reserved blocks are inside the limits */
 		bi->start = max(bi->start, low);
 		bi->end = min(bi->end, high);
 
-		/* and there's no empty or non-exist block */
-		if (bi->start >= bi->end ||
-		    !memblock_overlaps_region(&memblock.memory,
-			bi->start, bi->end - bi->start))
+		/* and there's no empty block */
+		if (bi->start >= bi->end)
 			numa_remove_memblk_from(i--, mi);
 	}
 
@@ -877,16 +896,38 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif	/* !CONFIG_DEBUG_PER_CPU_MAPS */
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-int memory_add_physaddr_to_nid(u64 start)
+#ifdef CONFIG_NUMA_KEEP_MEMINFO
+static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
 {
-	struct numa_meminfo *mi = &numa_meminfo;
-	int nid = mi->blk[0].nid;
 	int i;
 
 	for (i = 0; i < mi->nr_blks; i++)
 		if (mi->blk[i].start <= start && mi->blk[i].end > start)
-			nid = mi->blk[i].nid;
+			return mi->blk[i].nid;
+	return NUMA_NO_NODE;
+}
+
+int phys_to_target_node(phys_addr_t start)
+{
+	int nid = meminfo_to_nid(&numa_meminfo, start);
+
+	/*
+	 * Prefer online nodes, but if reserved memory might be
+	 * hot-added continue the search with reserved ranges.
+	 */
+	if (nid != NUMA_NO_NODE)
+		return nid;
+
+	return meminfo_to_nid(&numa_reserved_meminfo, start);
+}
+EXPORT_SYMBOL_GPL(phys_to_target_node);
+
+int memory_add_physaddr_to_nid(u64 start)
+{
+	int nid = meminfo_to_nid(&numa_meminfo, start);
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_meminfo.blk[0].nid;
 	return nid;
 }
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 5773cd2613fc..a42df804679e 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -1,7 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_NUMA_H
 #define _LINUX_NUMA_H
-
+#include <linux/types.h>
 
 #ifdef CONFIG_NODES_SHIFT
 #define NODES_SHIFT     CONFIG_NODES_SHIFT
@@ -21,12 +21,24 @@
 #endif
 
 #ifdef CONFIG_NUMA
+/* Generic implementation available */
 int numa_map_to_online_node(int node);
+
+/*
+ * Optional architecture specific implementation, users need a "depends
+ * on $ARCH"
+ */
+int phys_to_target_node(phys_addr_t addr);
 #else
 static inline int numa_map_to_online_node(int node)
 {
 	return NUMA_NO_NODE;
 }
+
+static inline int phys_to_target_node(phys_addr_t addr)
+{
+	return NUMA_NO_NODE;
+}
 #endif
 
 #endif /* _LINUX_NUMA_H */
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 6/6] libnvdimm/e820: Retrieve and populate correct 'target_node' info
  2020-02-16 20:00 ` [PATCH v5 " Dan Williams
                     ` (4 preceding siblings ...)
  2020-02-16 20:01   ` [PATCH v5 5/6] x86/NUMA: Provide a range-to-target_node lookup facility Dan Williams
@ 2020-02-16 20:01   ` Dan Williams
  5 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2020-02-16 20:01 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton, David Hildenbrand, Michal Hocko,
	Christoph Hellwig, Ingo Molnar, linux-kernel, x86

Use the new phys_to_target_node() and numa_map_to_online_node() helpers
to retrieve the correct id for the 'numa_node' ("local" / online
initiator node) and 'target_node' (offline target memory node) sysfs
attributes.

Below is an example from a 4 NUMA node system where all the memory on
node2 is pmem / reserved. It should be noted that with the arrival of
the ACPI HMAT table and EFI Specific Purpose Memory the kernel will
start to see more platforms with reserved / performance differentiated
memory in its own NUMA node. Hence all the stakeholders on the Cc for
what is ostensibly a libnvdimm local patch.

=== Before ===

/* Notice no online memory on node2 at start */

# numactl --hardware
available: 3 nodes (0-1,3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
node 0 size: 3958 MB
node 0 free: 3708 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 1 size: 4027 MB
node 1 free: 3871 MB
node 3 cpus:
node 3 size: 3994 MB
node 3 free: 3971 MB
node distances:
node   0   1   3
  0:  10  21  21
  1:  21  10  21
  3:  21  21  10

/*
 * Put the pmem namespace into devdax mode so it can be assigned to the
 * kmem driver
 */

# ndctl create-namespace -e namespace0.0 -m devdax -f
{
  "dev":"namespace0.0",
  "mode":"devdax",
  "map":"dev",
  "size":"3.94 GiB (4.23 GB)",
  "uuid":"1650af9b-9ba3-4704-acd6-10178399d9a3",
  [..]
}

/* Online Persistent Memory as System RAM */

# daxctl reconfigure-device --mode=system-ram dax0.0
libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success
libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success
libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success
libdaxctl: memblock_in_dev: dax0.0: memory0: Unable to determine phys_index: Success
[
  {
    "chardev":"dax0.0",
    "size":4225761280,
    "target_node":0,
    "mode":"system-ram"
  }
]
reconfigured 1 device

/* Note that the memory is onlined by default to the wrong node, node0 */

# numactl --hardware
available: 3 nodes (0-1,3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
node 0 size: 7926 MB
node 0 free: 7655 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 1 size: 4027 MB
node 1 free: 3871 MB
node 3 cpus:
node 3 size: 3994 MB
node 3 free: 3971 MB
node distances:
node   0   1   3
  0:  10  21  21
  1:  21  10  21
  3:  21  21  10


=== After ===

/* Notice that the "phys_index" error messages are gone */

# daxctl reconfigure-device --mode=system-ram dax0.0
[
  {
    "chardev":"dax0.0",
    "size":4225761280,
    "target_node":2,
    "mode":"system-ram"
  }
]
reconfigured 1 device

/* Notice that node2 is now correctly populated */

# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
node 0 size: 3958 MB
node 0 free: 3793 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 1 size: 4027 MB
node 1 free: 3851 MB
node 2 cpus:
node 2 size: 3968 MB
node 2 free: 3968 MB
node 3 cpus:
node 3 size: 3994 MB
node 3 free: 3908 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig      |    1 +
 drivers/nvdimm/e820.c |   18 ++++--------------
 2 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..d4e446daf457 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1664,6 +1664,7 @@ config X86_PMEM_LEGACY
 	depends on PHYS_ADDR_T_64BIT
 	depends on BLK_DEV
 	select X86_PMEM_LEGACY_DEVICE
+	select NUMA_KEEP_MEMINFO if NUMA
 	select LIBNVDIMM
 	help
 	  Treat memory marked using the non-standard e820 type of 12 as used
diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index e02f60ad6c99..4cd18be9d0e9 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -7,6 +7,7 @@
 #include <linux/memory_hotplug.h>
 #include <linux/libnvdimm.h>
 #include <linux/module.h>
+#include <linux/numa.h>
 
 static int e820_pmem_remove(struct platform_device *pdev)
 {
@@ -16,27 +17,16 @@ static int e820_pmem_remove(struct platform_device *pdev)
 	return 0;
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int e820_range_to_nid(resource_size_t addr)
-{
-	return memory_add_physaddr_to_nid(addr);
-}
-#else
-static int e820_range_to_nid(resource_size_t addr)
-{
-	return NUMA_NO_NODE;
-}
-#endif
-
 static int e820_register_one(struct resource *res, void *data)
 {
 	struct nd_region_desc ndr_desc;
 	struct nvdimm_bus *nvdimm_bus = data;
+	int nid = phys_to_target_node(res->start);
 
 	memset(&ndr_desc, 0, sizeof(ndr_desc));
 	ndr_desc.res = res;
-	ndr_desc.numa_node = e820_range_to_nid(res->start);
-	ndr_desc.target_node = ndr_desc.numa_node;
+	ndr_desc.numa_node = numa_map_to_online_node(nid);
+	ndr_desc.target_node = nid;
 	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
 	if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
 		return -ENXIO;
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 4/6] x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
  2020-02-16 20:01   ` [PATCH v5 4/6] x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO Dan Williams
@ 2020-02-17  8:21     ` Thomas Gleixner
  0 siblings, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2020-02-17  8:21 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, x86, Andrew Morton, David Hildenbrand,
	Michal Hocko, Ingo Molnar, hch, linux-kernel

eviewed-by: Thomas Gleixner <tglx@linutronix.de>
Dan Williams <dan.j.williams@intel.com> writes:

> Currently x86 numa_meminfo is marked __initdata in the
> CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow
> drivers to map reserved memory to a 'target_node'
> (phys_to_target_node()), add support for removing the __initdata
> designation for those users. Both memory hotplug and
> phys_to_target_node() users select CONFIG_NUMA_KEEP_MEMINFO to tell the
> arch to maintain its physical address to NUMA mapping infrastructure
> post init.
>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: <x86@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 5/6] x86/NUMA: Provide a range-to-target_node lookup facility
  2020-02-16 20:01   ` [PATCH v5 5/6] x86/NUMA: Provide a range-to-target_node lookup facility Dan Williams
@ 2020-02-17  8:22     ` Thomas Gleixner
  0 siblings, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2020-02-17  8:22 UTC (permalink / raw)
  To: Dan Williams, linux-nvdimm
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	H. Peter Anvin, x86, Andrew Morton, David Hildenbrand,
	Michal Hocko, kbuild test robot, Ingo Molnar, hch, linux-kernel

Dan Williams <dan.j.williams@intel.com> writes:

> The DEV_DAX_KMEM facility is a generic mechanism to allow device-dax
> instances, fronting performance-differentiated-memory like pmem, to be
> added to the System RAM pool. The NUMA node for that hot-added memory is
> derived from the device-dax instance's 'target_node' attribute.
>
> Recall that the 'target_node' is the ACPI-PXM-to-node translation for
> memory when it comes online whereas the 'numa_node' attribute of the
> device represents the closest online cpu node.
>
> Presently useful target_node information from the ACPI SRAT is discarded
> with the expectation that "Reserved" memory will never be onlined. Now,
> DEV_DAX_KMEM violates that assumption, there is a need to retain the
> translation. Move, rather than discard, numa_memblk data to a secondary
> array that memory_add_physaddr_to_target_node() may consider at a later
> point in time.
>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: <x86@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Reported-by: kbuild test robot <lkp@intel.com>
> Reviewed-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, back to index

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-22  3:04 [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
2020-01-22  3:04 ` [PATCH v4 1/6] ACPI: NUMA: Up-level "map to online node" functionality Dan Williams
2020-01-22  3:04 ` [PATCH v4 2/6] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node() Dan Williams
2020-01-22  3:04 ` [PATCH v4 3/6] powerpc/papr_scm: Switch to numa_map_to_online_node() Dan Williams
2020-01-22  3:04 ` [PATCH v4 4/6] x86/mm: Introduce CONFIG_KEEP_NUMA Dan Williams
2020-02-13  9:32   ` Ingo Molnar
2020-02-13 21:19     ` Dan Williams
2020-02-13 11:22   ` Thomas Gleixner
2020-02-13 21:21     ` Dan Williams
2020-01-22  3:05 ` [PATCH v4 5/6] x86/numa: Provide a range-to-target_node lookup facility Dan Williams
2020-02-13  9:36   ` Ingo Molnar
2020-02-13 11:37   ` Thomas Gleixner
2020-02-13 21:40     ` Dan Williams
2020-01-22  3:05 ` [PATCH v4 6/6] libnvdimm/e820: Retrieve and populate correct 'target_node' info Dan Williams
2020-02-13  2:28 ` [PATCH v4 0/6] Memory Hierarchy: Enable target node lookups for reserved memory Dan Williams
2020-02-13  9:37   ` Ingo Molnar
2020-02-16 20:00 ` [PATCH v5 " Dan Williams
2020-02-16 20:00   ` [PATCH v5 1/6] ACPI: NUMA: Up-level "map to online node" functionality Dan Williams
2020-02-16 20:00   ` [PATCH v5 2/6] mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node() Dan Williams
2020-02-16 20:00   ` [PATCH v5 3/6] powerpc/papr_scm: Switch to numa_map_to_online_node() Dan Williams
2020-02-16 20:01   ` [PATCH v5 4/6] x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO Dan Williams
2020-02-17  8:21     ` Thomas Gleixner
2020-02-16 20:01   ` [PATCH v5 5/6] x86/NUMA: Provide a range-to-target_node lookup facility Dan Williams
2020-02-17  8:22     ` Thomas Gleixner
2020-02-16 20:01   ` [PATCH v5 6/6] libnvdimm/e820: Retrieve and populate correct 'target_node' info Dan Williams

Linux-NVDIMM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nvdimm/0 linux-nvdimm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nvdimm linux-nvdimm/ https://lore.kernel.org/linux-nvdimm \
		linux-nvdimm@lists.01.org
	public-inbox-index linux-nvdimm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.01.lists.linux-nvdimm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git