linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 0/5] arm64, numa: Add numa support for arm64 platforms
@ 2016-02-23  1:58 David Daney
  2016-02-23  1:58 ` [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them David Daney
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: David Daney @ 2016-02-23  1:58 UTC (permalink / raw)
  To: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter
  Cc: linux-kernel, David Daney

From: David Daney <david.daney@cavium.com>

v12:

	- Replaced 6 patches from Ard Biesheuvel with new simpler, and
          more correct, single patch, also from Ard.

v11:
	- Dropped cleanup patches for other architectures, they will be
          submitted as a separate set after more testing.

	- Added patch set from Ard Biesheuvel that are needed to make
          the whole thing actually work.  Previously this was a
          separate set.

	- Kconfig and other fixes and simplifications as suggested by Rob Herring.

	- Rearranged, refactored and reordered so that we don't patch
          new files multiple times.

	- Summary:

		o 6 patches from Ard Biesheuvel to allow use of
		  "memory" nodes with efi stub.

		o 2 patches to document and add of_numa.c

		o 1 patch to add arm64 NUMA support.

		o 1 patch to add NUMA balancing support for arm64.

v10:
	- Incorporated review comments from Rob Herring.
	- Moved numa binding and implementation to devicetree core.
	- Added cleanup patch to remove redundant NODE_DATA macro from asm header files
	- Include numa balancing support for arm64 patch in this series.
	- Fix tile build issue reported by the kbuild robot(patch 7)

v9:	- Added cleanup patch to reuse and avoid redefinition of cpumask_of_pcibus
	  as suggested from Will Deacon and Bjorn Helgaas.
	  - Including patch to Make pci-host-generic driver numa aware.
	  - Incorporated comment from Shannon Zhao.

v8:
	- Incorporated review comments of Mark Rutland and Will Deacon.
	- Added pci helper function and macro for numa.

v7:
	- managing numa memory mapping using memblock.
	- Incorporated review comments of Mark Rutland.

v6:
	- defined and implemented the numa dt binding using
	node property proximity and device node distance-map.
	- renamed dt_numa to of_numa

v5:
        - created base verion of numa.c which creates dummy numa without using dt
          on single socket platforms. Then added patches for dt support.
        - Incorporated review comments from Hanjun Guo.

v4:
done changes as per Arnd review comments.

v3:
Added changes to support numa on arm64 based platforms.
Tested these patches on cavium's multinode(2 node topology) platform.
In this patchset, defined and implemented dt bindings for numa mapping
for core and memory using device node property arm,associativity.

v2:
Defined and implemented numa map for memory, cores to node and
proximity distance matrix of nodes.

v1:
Initial patchset to support numa on arm64 platforms.

Note: 1. This patchset is tested for NUMA and without NUMA with dt
        (both with and without NUMA bindings) on thunderx single
        socket and dual socket boards.


Ard Biesheuvel (1):
  efi: ARM/arm64: ignore DT memory nodes instead of removing them

Ganapatrao Kulkarni (4):
  Documentation, dt, numa: dt bindings for NUMA.
  dt, numa: Add NUMA dt binding implementation.
  arm64, numa: Add NUMA support for arm64 platforms.
  arm64, mm, numa: Add NUMA balancing support for arm64.

 Documentation/devicetree/bindings/numa.txt | 272 +++++++++++++++++++
 arch/arm64/Kconfig                         |  27 ++
 arch/arm64/include/asm/mmzone.h            |  12 +
 arch/arm64/include/asm/numa.h              |  45 ++++
 arch/arm64/include/asm/pgtable.h           |  15 ++
 arch/arm64/include/asm/topology.h          |  10 +
 arch/arm64/kernel/pci.c                    |  10 +
 arch/arm64/kernel/setup.c                  |   4 +
 arch/arm64/kernel/smp.c                    |   4 +
 arch/arm64/mm/Makefile                     |   1 +
 arch/arm64/mm/init.c                       |  34 ++-
 arch/arm64/mm/mmu.c                        |   1 +
 arch/arm64/mm/numa.c                       | 403 +++++++++++++++++++++++++++++
 drivers/firmware/efi/arm-init.c            |   8 +
 drivers/firmware/efi/libstub/fdt.c         |  24 +-
 drivers/of/Kconfig                         |   3 +
 drivers/of/Makefile                        |   1 +
 drivers/of/of_numa.c                       | 211 +++++++++++++++
 include/linux/of.h                         |   9 +
 19 files changed, 1066 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/numa.txt
 create mode 100644 arch/arm64/include/asm/mmzone.h
 create mode 100644 arch/arm64/include/asm/numa.h
 create mode 100644 arch/arm64/mm/numa.c
 create mode 100644 drivers/of/of_numa.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-23  1:58 [PATCH v12 0/5] arm64, numa: Add numa support for arm64 platforms David Daney
@ 2016-02-23  1:58 ` David Daney
  2016-02-23 11:58   ` Mark Rutland
  2016-02-23  1:58 ` [PATCH v12 2/5] Documentation, dt, numa: dt bindings for NUMA David Daney
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: David Daney @ 2016-02-23  1:58 UTC (permalink / raw)
  To: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter
  Cc: linux-kernel, David Daney

From: Ard Biesheuvel <ard.biesheuvel@linaro.org>

There are two problems with the UEFI stub DT memory node removal
routine:
- it deletes nodes as it traverses the tree, which happens to work
  but is not supported, as deletion invalidates the node iterator;
- deleting memory nodes entirely may discard annotations in the form
  of additional properties on the nodes.

Since the discovery of DT memory nodes occurs strictly before the
UEFI init sequence, we can simply clear the memblock memory table
before parsing the UEFI memory map. This way, it is no longer
necessary to remove the nodes, so we can remove that logic from the
stub as well.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David Daney <david.daney@cavium.com>
---
 drivers/firmware/efi/arm-init.c    |  8 ++++++++
 drivers/firmware/efi/libstub/fdt.c | 24 +-----------------------
 2 files changed, 9 insertions(+), 23 deletions(-)

diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c
index 9e15d57..40c9d85 100644
--- a/drivers/firmware/efi/arm-init.c
+++ b/drivers/firmware/efi/arm-init.c
@@ -143,6 +143,14 @@ static __init void reserve_regions(void)
 	if (efi_enabled(EFI_DBG))
 		pr_info("Processing EFI memory map:\n");
 
+	/*
+	 * Discard memblocks discovered so far: if there are any at this
+	 * point, they originate from memory nodes in the DT, and UEFI
+	 * uses its own memory map instead.
+	 */
+	memblock_dump_all();
+	memblock_remove(0, ULLONG_MAX);
+
 	for_each_efi_memory_desc(&memmap, md) {
 		paddr = md->phys_addr;
 		npages = md->num_pages;
diff --git a/drivers/firmware/efi/libstub/fdt.c b/drivers/firmware/efi/libstub/fdt.c
index cf7b7d4..9df1560 100644
--- a/drivers/firmware/efi/libstub/fdt.c
+++ b/drivers/firmware/efi/libstub/fdt.c
@@ -24,7 +24,7 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
 			unsigned long map_size, unsigned long desc_size,
 			u32 desc_ver)
 {
-	int node, prev, num_rsv;
+	int node, num_rsv;
 	int status;
 	u32 fdt_val32;
 	u64 fdt_val64;
@@ -54,28 +54,6 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
 		goto fdt_set_fail;
 
 	/*
-	 * Delete any memory nodes present. We must delete nodes which
-	 * early_init_dt_scan_memory may try to use.
-	 */
-	prev = 0;
-	for (;;) {
-		const char *type;
-		int len;
-
-		node = fdt_next_node(fdt, prev, NULL);
-		if (node < 0)
-			break;
-
-		type = fdt_getprop(fdt, node, "device_type", &len);
-		if (type && strncmp(type, "memory", len) == 0) {
-			fdt_del_node(fdt, node);
-			continue;
-		}
-
-		prev = node;
-	}
-
-	/*
 	 * Delete all memory reserve map entries. When booting via UEFI,
 	 * kernel will use the UEFI memory map to find reserved regions.
 	 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 2/5] Documentation, dt, numa: dt bindings for NUMA.
  2016-02-23  1:58 [PATCH v12 0/5] arm64, numa: Add numa support for arm64 platforms David Daney
  2016-02-23  1:58 ` [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them David Daney
@ 2016-02-23  1:58 ` David Daney
  2016-02-23  1:58 ` [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation David Daney
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: David Daney @ 2016-02-23  1:58 UTC (permalink / raw)
  To: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter
  Cc: linux-kernel, David Daney

From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>

Add DT bindings for numa mapping of memory, CPUs and IOs.

Reviewed-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
Signed-off-by: David Daney <david.daney@cavium.com>
---
 Documentation/devicetree/bindings/numa.txt | 272 +++++++++++++++++++++++++++++
 1 file changed, 272 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/numa.txt

diff --git a/Documentation/devicetree/bindings/numa.txt b/Documentation/devicetree/bindings/numa.txt
new file mode 100644
index 0000000..ec5ed7c
--- /dev/null
+++ b/Documentation/devicetree/bindings/numa.txt
@@ -0,0 +1,272 @@
+==============================================================================
+NUMA binding description.
+==============================================================================
+
+==============================================================================
+1 - Introduction
+==============================================================================
+
+Systems employing a Non Uniform Memory Access (NUMA) architecture contain
+collections of hardware resources including processors, memory, and I/O buses,
+that comprise what is commonly known as a NUMA node.
+Processor accesses to memory within the local NUMA node is generally faster
+than processor accesses to memory outside of the local NUMA node.
+DT defines interfaces that allow the platform to convey NUMA node
+topology information to OS.
+
+==============================================================================
+2 - numa-node-id
+==============================================================================
+
+For the purpose of identification, each NUMA node is associated with a unique
+token known as a node id. For the purpose of this binding
+a node id is a 32-bit integer.
+
+A device node is associated with a NUMA node by the presence of a
+numa-node-id property which contains the node id of the device.
+
+Example:
+	/* numa node 0 */
+	numa-node-id = <0>;
+
+	/* numa node 1 */
+	numa-node-id = <1>;
+
+==============================================================================
+3 - distance-map
+==============================================================================
+
+The device tree node distance-map describes the relative
+distance (memory latency) between all numa nodes.
+
+- compatible : Should at least contain "numa-distance-map-v1".
+
+- distance-matrix
+  This property defines a matrix to describe the relative distances
+  between all numa nodes.
+  It is represented as a list of node pairs and their relative distance.
+
+  Note:
+	1. Each entry represents distance from first node to second node.
+	The distances are equal in either direction.
+	2. The distance from a node to self (local distance) is represented
+	with value 10 and all internode distance should be represented with
+	a value greater than 10.
+	3. distance-matrix should have entries in lexicographical ascending
+	order of nodes.
+	4. There must be only one device node distance-map which must reside in the root node.
+
+Example:
+	4 nodes connected in mesh/ring topology as below,
+
+		0_______20______1
+		|               |
+		|               |
+		20             20
+		|               |
+		|               |
+		|_______________|
+		3       20      2
+
+	if relative distance for each hop is 20,
+	then internode distance would be,
+	      0 -> 1 = 20
+	      1 -> 2 = 20
+	      2 -> 3 = 20
+	      3 -> 0 = 20
+	      0 -> 2 = 40
+	      1 -> 3 = 40
+
+     and dt presentation for this distance matrix is,
+
+		distance-map {
+			 compatible = "numa-distance-map-v1";
+			 distance-matrix = <0 0  10>,
+					   <0 1  20>,
+					   <0 2  40>,
+					   <0 3  20>,
+					   <1 0  20>,
+					   <1 1  10>,
+					   <1 2  20>,
+					   <1 3  40>,
+					   <2 0  40>,
+					   <2 1  20>,
+					   <2 2  10>,
+					   <2 3  20>,
+					   <3 0  20>,
+					   <3 1  40>,
+					   <3 2  20>,
+					   <3 3  10>;
+		};
+
+==============================================================================
+4 - Example dts
+==============================================================================
+
+Dual socket system consists of 2 boards connected through ccn bus and
+each board having one socket/soc of 8 cpus, memory and pci bus.
+
+	memory@c00000 {
+		device_type = "memory";
+		reg = <0x0 0xc00000 0x0 0x80000000>;
+		/* node 0 */
+		numa-node-id = <0>;
+	};
+
+	memory@10000000000 {
+		device_type = "memory";
+		reg = <0x100 0x0 0x0 0x80000000>;
+		/* node 1 */
+		numa-node-id = <1>;
+	};
+
+	cpus {
+		#address-cells = <2>;
+		#size-cells = <0>;
+
+		cpu@0 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x0>;
+			enable-method = "psci";
+			/* node 0 */
+			numa-node-id = <0>;
+		};
+		cpu@1 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x1>;
+			enable-method = "psci";
+			numa-node-id = <0>;
+		};
+		cpu@2 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x2>;
+			enable-method = "psci";
+			numa-node-id = <0>;
+		};
+		cpu@3 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x3>;
+			enable-method = "psci";
+			numa-node-id = <0>;
+		};
+		cpu@4 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x4>;
+			enable-method = "psci";
+			numa-node-id = <0>;
+		};
+		cpu@5 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x5>;
+			enable-method = "psci";
+			numa-node-id = <0>;
+		};
+		cpu@6 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x6>;
+			enable-method = "psci";
+			numa-node-id = <0>;
+		};
+		cpu@7 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x7>;
+			enable-method = "psci";
+			numa-node-id = <0>;
+		};
+		cpu@8 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x8>;
+			enable-method = "psci";
+			/* node 1 */
+			numa-node-id = <1>;
+		};
+		cpu@9 {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0x9>;
+			enable-method = "psci";
+			numa-node-id = <1>;
+		};
+		cpu@a {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0xa>;
+			enable-method = "psci";
+			numa-node-id = <1>;
+		};
+		cpu@b {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0xb>;
+			enable-method = "psci";
+			numa-node-id = <1>;
+		};
+		cpu@c {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0xc>;
+			enable-method = "psci";
+			numa-node-id = <1>;
+		};
+		cpu@d {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0xd>;
+			enable-method = "psci";
+			numa-node-id = <1>;
+		};
+		cpu@e {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0xe>;
+			enable-method = "psci";
+			numa-node-id = <1>;
+		};
+		cpu@f {
+			device_type = "cpu";
+			compatible =  "arm,armv8";
+			reg = <0x0 0xf>;
+			enable-method = "psci";
+			numa-node-id = <1>;
+		};
+	};
+
+	pcie0: pcie0@848000000000 {
+		compatible = "arm,armv8";
+		device_type = "pci";
+		bus-range = <0 255>;
+		#size-cells = <2>;
+		#address-cells = <3>;
+		reg = <0x8480 0x00000000 0 0x10000000>;  /* Configuration space */
+		ranges = <0x03000000 0x8010 0x00000000 0x8010 0x00000000 0x70 0x00000000>;
+		/* node 0 */
+		numa-node-id = <0>;
+        };
+
+	pcie1: pcie1@948000000000 {
+		compatible = "arm,armv8";
+		device_type = "pci";
+		bus-range = <0 255>;
+		#size-cells = <2>;
+		#address-cells = <3>;
+		reg = <0x9480 0x00000000 0 0x10000000>;  /* Configuration space */
+		ranges = <0x03000000 0x9010 0x00000000 0x9010 0x00000000 0x70 0x00000000>;
+		/* node 1 */
+		numa-node-id = <1>;
+        };
+
+	distance-map {
+		compatible = "numa-distance-map-v1";
+		distance-matrix = <0 0 10>,
+				  <0 1 20>,
+				  <1 1 10>;
+	};
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation.
  2016-02-23  1:58 [PATCH v12 0/5] arm64, numa: Add numa support for arm64 platforms David Daney
  2016-02-23  1:58 ` [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them David Daney
  2016-02-23  1:58 ` [PATCH v12 2/5] Documentation, dt, numa: dt bindings for NUMA David Daney
@ 2016-02-23  1:58 ` David Daney
  2016-02-29 17:29   ` Robert Richter
  2016-02-23  1:58 ` [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms David Daney
  2016-02-23  1:58 ` [PATCH v12 5/5] arm64, mm, numa: Add NUMA balancing support for arm64 David Daney
  4 siblings, 1 reply; 30+ messages in thread
From: David Daney @ 2016-02-23  1:58 UTC (permalink / raw)
  To: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter
  Cc: linux-kernel, David Daney

From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>

ADD device tree node parsing for NUMA topology using device
"numa-node-id" property distance-map.

Reviewed-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
Signed-off-by: David Daney <david.daney@cavium.com>
---
 drivers/of/Kconfig   |   3 +
 drivers/of/Makefile  |   1 +
 drivers/of/of_numa.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/of.h   |   9 +++
 4 files changed, 224 insertions(+)
 create mode 100644 drivers/of/of_numa.c

diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index e2a4841..b3bec3a 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -112,4 +112,7 @@ config OF_OVERLAY
 	  While this option is selected automatically when needed, you can
 	  enable it manually to improve device tree unit test coverage.
 
+config OF_NUMA
+	bool
+
 endif # OF
diff --git a/drivers/of/Makefile b/drivers/of/Makefile
index 156c072..bee3fa9 100644
--- a/drivers/of/Makefile
+++ b/drivers/of/Makefile
@@ -14,5 +14,6 @@ obj-$(CONFIG_OF_MTD)	+= of_mtd.o
 obj-$(CONFIG_OF_RESERVED_MEM) += of_reserved_mem.o
 obj-$(CONFIG_OF_RESOLVE)  += resolver.o
 obj-$(CONFIG_OF_OVERLAY) += overlay.o
+obj-$(CONFIG_OF_NUMA) += of_numa.o
 
 obj-$(CONFIG_OF_UNITTEST) += unittest-data/
diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
new file mode 100644
index 0000000..a691d06
--- /dev/null
+++ b/drivers/of/of_numa.c
@@ -0,0 +1,211 @@
+/*
+ * OF NUMA Parsing support.
+ *
+ * Copyright (C) 2015 - 2016 Cavium Inc.
+ * Author: Ganapatrao Kulkarni <gkulkarni@cavium.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/of.h>
+#include <linux/of_fdt.h>
+#include <linux/nodemask.h>
+
+#include <asm/numa.h>
+
+/* define default numa node to 0 */
+#define DEFAULT_NODE 0
+
+/* Returns nid in the range [0..MAX_NUMNODES-1],
+ * or NUMA_NO_NODE if no valid numa-node-id entry found
+ * or DEFAULT_NODE if no numa-node-id entry exists
+ */
+static int of_numa_prop_to_nid(const __be32 *of_numa_prop, int length)
+{
+	int nid;
+
+	if (!of_numa_prop)
+		return DEFAULT_NODE;
+
+	if (length != sizeof(*of_numa_prop)) {
+		pr_warn("NUMA: Invalid of_numa_prop length %d found.\n",
+				length);
+		return NUMA_NO_NODE;
+	}
+
+	nid = of_read_number(of_numa_prop, 1);
+	if (nid >= MAX_NUMNODES) {
+		pr_warn("NUMA: Invalid numa node %d found.\n", nid);
+		return NUMA_NO_NODE;
+	}
+
+	return nid;
+}
+
+static int __init early_init_of_node_to_nid(unsigned long node)
+{
+	int length;
+	const __be32 *of_numa_prop;
+
+	of_numa_prop = of_get_flat_dt_prop(node, "numa-node-id", &length);
+
+	return of_numa_prop_to_nid(of_numa_prop, length);
+}
+
+/*
+ * Even though we connect cpus to numa domains later in SMP
+ * init, we need to know the node ids now for all cpus.
+*/
+static int __init early_init_parse_cpu_node(unsigned long node)
+{
+	int nid;
+	const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+
+	if (type == NULL)
+		return 0;
+
+	if (strcmp(type, "cpu") != 0)
+		return 0;
+
+	nid = early_init_of_node_to_nid(node);
+	if (nid == NUMA_NO_NODE)
+		return -EINVAL;
+
+	node_set(nid, numa_nodes_parsed);
+	return 0;
+}
+
+static int __init early_init_parse_memory_node(unsigned long node)
+{
+	const __be32 *reg, *endp;
+	int length;
+	int nid;
+	const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+
+	if (type == NULL)
+		return 0;
+
+	if (strcmp(type, "memory") != 0)
+		return 0;
+
+	nid = early_init_of_node_to_nid(node);
+	if (nid == NUMA_NO_NODE)
+		return -EINVAL;
+
+	reg = of_get_flat_dt_prop(node, "reg", &length);
+	endp = reg + (length / sizeof(__be32));
+
+	while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
+		u64 base, size;
+
+		base = dt_mem_next_cell(dt_root_addr_cells, &reg);
+		size = dt_mem_next_cell(dt_root_size_cells, &reg);
+		pr_debug("NUMA:  base = %llx , node = %u\n",
+				base, nid);
+
+		if (numa_add_memblk(nid, base, size) < 0)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int __init early_init_parse_distance_map_v1(unsigned long node,
+		const char *uname)
+{
+	const __be32 *prop_dist_matrix;
+	int length = 0, i, matrix_count;
+	int nr_size_cells = OF_ROOT_NODE_SIZE_CELLS_DEFAULT;
+
+	pr_info("NUMA: parsing numa-distance-map-v1\n");
+
+	prop_dist_matrix =
+		of_get_flat_dt_prop(node, "distance-matrix", &length);
+
+	if (!length) {
+		pr_err("NUMA: failed to parse distance-matrix\n");
+		return -ENODEV;
+	}
+
+	matrix_count = ((length / sizeof(__be32)) / (3 * nr_size_cells));
+
+	if ((matrix_count * sizeof(__be32) * 3 * nr_size_cells) !=  length) {
+		pr_warn("NUMA: invalid distance-matrix length %d\n", length);
+		return -EINVAL;
+	}
+
+	for (i = 0; i < matrix_count; i++) {
+		u32 nodea, nodeb, distance;
+
+		nodea = dt_mem_next_cell(nr_size_cells, &prop_dist_matrix);
+		nodeb = dt_mem_next_cell(nr_size_cells, &prop_dist_matrix);
+		distance = dt_mem_next_cell(nr_size_cells, &prop_dist_matrix);
+		numa_set_distance(nodea, nodeb, distance);
+		pr_debug("NUMA:  distance[node%d -> node%d] = %d\n",
+				nodea, nodeb, distance);
+
+		/* Set default distance of node B->A same as A->B */
+		if (nodeb > nodea)
+			numa_set_distance(nodeb, nodea, distance);
+	}
+
+	return 0;
+}
+
+static int __init early_init_parse_distance_map(unsigned long node,
+		const char *uname)
+{
+	if (strcmp(uname, "distance-map") != 0)
+		return 0;
+
+	if (of_flat_dt_is_compatible(node, "numa-distance-map-v1"))
+		return early_init_parse_distance_map_v1(node, uname);
+
+	pr_err("NUMA: invalid distance-map device node\n");
+	return -EINVAL;
+}
+
+static int __init early_init_of_scan_numa_map(unsigned long node,
+					      const char *uname,
+					      int depth, void *data)
+{
+	int ret;
+
+	ret = early_init_parse_cpu_node(node);
+	if (ret)
+		return ret;
+
+	ret = early_init_parse_memory_node(node);
+	if (ret)
+		return ret;
+
+	return early_init_parse_distance_map(node, uname);
+}
+
+int of_node_to_nid(struct device_node *device)
+{
+	const __be32 *of_numa_prop;
+	int length;
+
+	of_numa_prop = of_get_property(device, "numa-node-id", &length);
+	if (of_numa_prop)
+		return of_numa_prop_to_nid(of_numa_prop, length);
+
+	return NUMA_NO_NODE;
+}
+
+/* DT node mapping is done already early_init_of_scan_memory */
+int __init of_numa_init(void)
+{
+	return of_scan_flat_dt(early_init_of_scan_numa_map, NULL);
+}
diff --git a/include/linux/of.h b/include/linux/of.h
index dc6e396..fe67a4c 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -685,6 +685,15 @@ static inline int of_node_to_nid(struct device_node *device)
 }
 #endif
 
+#ifdef CONFIG_OF_NUMA
+extern int of_numa_init(void);
+#else
+static inline int of_numa_init(void)
+{
+	return -ENOSYS;
+}
+#endif
+
 static inline struct device_node *of_find_matching_node(
 	struct device_node *from,
 	const struct of_device_id *matches)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-23  1:58 [PATCH v12 0/5] arm64, numa: Add numa support for arm64 platforms David Daney
                   ` (2 preceding siblings ...)
  2016-02-23  1:58 ` [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation David Daney
@ 2016-02-23  1:58 ` David Daney
  2016-02-23 10:26   ` Will Deacon
                     ` (3 more replies)
  2016-02-23  1:58 ` [PATCH v12 5/5] arm64, mm, numa: Add NUMA balancing support for arm64 David Daney
  4 siblings, 4 replies; 30+ messages in thread
From: David Daney @ 2016-02-23  1:58 UTC (permalink / raw)
  To: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter
  Cc: linux-kernel, David Daney

From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>

Attempt to get the memory and CPU NUMA node via of_numa.  If that
fails, default the dummy NUMA node and map all memory and CPUs to node
0.

Tested-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
Signed-off-by: David Daney <david.daney@cavium.com>
---
 arch/arm64/Kconfig                |  26 +++
 arch/arm64/include/asm/mmzone.h   |  12 ++
 arch/arm64/include/asm/numa.h     |  45 +++++
 arch/arm64/include/asm/topology.h |  10 +
 arch/arm64/kernel/pci.c           |  10 +
 arch/arm64/kernel/setup.c         |   4 +
 arch/arm64/kernel/smp.c           |   4 +
 arch/arm64/mm/Makefile            |   1 +
 arch/arm64/mm/init.c              |  34 +++-
 arch/arm64/mm/mmu.c               |   1 +
 arch/arm64/mm/numa.c              | 403 ++++++++++++++++++++++++++++++++++++++
 11 files changed, 545 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/mmzone.h
 create mode 100644 arch/arm64/include/asm/numa.h
 create mode 100644 arch/arm64/mm/numa.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a969970..9f0972a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -74,6 +74,7 @@ config ARM64
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS
 	select HAVE_IRQ_TIME_ACCOUNTING
 	select HAVE_MEMBLOCK
+	select HAVE_MEMBLOCK_NODE_MAP if NUMA
 	select HAVE_PATA_PLATFORM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
@@ -96,6 +97,7 @@ config ARM64
 	select SYSCTL_EXCEPTION_TRACE
 	select HAVE_CONTEXT_TRACKING
 	select HAVE_ARM_SMCCC
+	select OF_NUMA if NUMA && OF
 	help
 	  ARM 64-bit (AArch64) Linux support.
 
@@ -545,6 +547,30 @@ config HOTPLUG_CPU
 	  Say Y here to experiment with turning CPUs off and on.  CPUs
 	  can be controlled through /sys/devices/system/cpu.
 
+# Common NUMA Features
+config NUMA
+	bool "Numa Memory Allocation and Scheduler Support"
+	depends on SMP
+	help
+	  Enable NUMA (Non Uniform Memory Access) support.
+
+	  The kernel will try to allocate memory used by a CPU on the
+	  local memory of the CPU and add some more
+	  NUMA awareness to the kernel.
+
+config NODES_SHIFT
+	int "Maximum NUMA Nodes (as a power of 2)"
+	range 1 10
+	default "2"
+	depends on NEED_MULTIPLE_NODES
+	help
+	  Specify the maximum number of NUMA Nodes available on the target
+	  system.  Increases memory reserved to accommodate various tables.
+
+config USE_PERCPU_NUMA_NODE_ID
+	def_bool y
+	depends on NUMA
+
 source kernel/Kconfig.preempt
 source kernel/Kconfig.hz
 
diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
new file mode 100644
index 0000000..a0de9e6
--- /dev/null
+++ b/arch/arm64/include/asm/mmzone.h
@@ -0,0 +1,12 @@
+#ifndef __ASM_MMZONE_H
+#define __ASM_MMZONE_H
+
+#ifdef CONFIG_NUMA
+
+#include <asm/numa.h>
+
+extern struct pglist_data *node_data[];
+#define NODE_DATA(nid)		(node_data[(nid)])
+
+#endif /* CONFIG_NUMA */
+#endif /* __ASM_MMZONE_H */
diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
new file mode 100644
index 0000000..e9b4f29
--- /dev/null
+++ b/arch/arm64/include/asm/numa.h
@@ -0,0 +1,45 @@
+#ifndef __ASM_NUMA_H
+#define __ASM_NUMA_H
+
+#include <asm/topology.h>
+
+#ifdef CONFIG_NUMA
+
+/* currently, arm64 implements flat NUMA topology */
+#define parent_node(node)	(node)
+
+int __node_distance(int from, int to);
+#define node_distance(a, b) __node_distance(a, b)
+
+extern nodemask_t numa_nodes_parsed __initdata;
+
+/* Mappings between node number and cpus on that node. */
+extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
+void numa_clear_node(unsigned int cpu);
+
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+const struct cpumask *cpumask_of_node(int node);
+#else
+/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
+static inline const struct cpumask *cpumask_of_node(int node)
+{
+	return node_to_cpumask_map[node];
+}
+#endif
+
+void __init arm64_numa_init(void);
+int __init numa_add_memblk(int nodeid, u64 start, u64 end);
+void __init numa_set_distance(int from, int to, int distance);
+void __init numa_free_distance(void);
+void __init early_map_cpu_to_node(unsigned int cpu, int nid);
+void numa_store_cpu_info(unsigned int cpu);
+
+#else	/* CONFIG_NUMA */
+
+static inline void numa_store_cpu_info(unsigned int cpu) { }
+static inline void arm64_numa_init(void) { }
+static inline void early_map_cpu_to_node(unsigned int cpu, int nid) { }
+
+#endif	/* CONFIG_NUMA */
+
+#endif	/* __ASM_NUMA_H */
diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
index a3e9d6f..8b57339 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -22,6 +22,16 @@ void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
 
+#ifdef CONFIG_NUMA
+
+struct pci_bus;
+int pcibus_to_node(struct pci_bus *bus);
+#define cpumask_of_pcibus(bus)	(pcibus_to_node(bus) == -1 ?		\
+				 cpu_all_mask :				\
+				 cpumask_of_node(pcibus_to_node(bus)))
+
+#endif /* CONFIG_NUMA */
+
 #include <asm-generic/topology.h>
 
 #endif /* _ASM_ARM_TOPOLOGY_H */
diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index b3d098b..65e6b7d 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -76,6 +76,16 @@ int raw_pci_write(unsigned int domain, unsigned int bus,
 	return -ENXIO;
 }
 
+#ifdef CONFIG_NUMA
+
+int pcibus_to_node(struct pci_bus *bus)
+{
+	return dev_to_node(&bus->dev);
+}
+EXPORT_SYMBOL(pcibus_to_node);
+
+#endif
+
 #ifdef CONFIG_ACPI
 /* Root bridge scanning */
 struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 8119479..d9b9761 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -53,6 +53,7 @@
 #include <asm/cpufeature.h>
 #include <asm/cpu_ops.h>
 #include <asm/kasan.h>
+#include <asm/numa.h>
 #include <asm/sections.h>
 #include <asm/setup.h>
 #include <asm/smp_plat.h>
@@ -372,6 +373,9 @@ static int __init topology_init(void)
 {
 	int i;
 
+	for_each_online_node(i)
+		register_one_node(i);
+
 	for_each_possible_cpu(i) {
 		struct cpu *cpu = &per_cpu(cpu_data.cpu, i);
 		cpu->hotpluggable = 1;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b1adc51..46c45c8 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -45,6 +45,7 @@
 #include <asm/cputype.h>
 #include <asm/cpu_ops.h>
 #include <asm/mmu_context.h>
+#include <asm/numa.h>
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
 #include <asm/processor.h>
@@ -125,6 +126,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 static void smp_store_cpu_info(unsigned int cpuid)
 {
 	store_cpu_topology(cpuid);
+	numa_store_cpu_info(cpuid);
 }
 
 /*
@@ -518,6 +520,8 @@ static void __init of_parse_and_init_cpus(void)
 
 		pr_debug("cpu logical map 0x%llx\n", hwid);
 		cpu_logical_map(cpu_count) = hwid;
+
+		early_map_cpu_to_node(cpu_count, of_node_to_nid(dn));
 next:
 		cpu_count++;
 	}
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 57f57fd..54bb209 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -4,6 +4,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   context.o proc.o pageattr.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
 obj-$(CONFIG_ARM64_PTDUMP)	+= dump.o
+obj-$(CONFIG_NUMA)		+= numa.o
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o	:= n
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index f3b061e..c39c670 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -37,6 +37,7 @@
 
 #include <asm/fixmap.h>
 #include <asm/memory.h>
+#include <asm/numa.h>
 #include <asm/sections.h>
 #include <asm/setup.h>
 #include <asm/sizes.h>
@@ -77,6 +78,21 @@ static phys_addr_t __init max_zone_dma_phys(void)
 	return min(offset + (1ULL << 32), memblock_end_of_DRAM());
 }
 
+#ifdef CONFIG_NUMA
+
+static void __init zone_sizes_init(unsigned long min, unsigned long max)
+{
+	unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
+
+	if (IS_ENABLED(CONFIG_ZONE_DMA))
+		max_zone_pfns[ZONE_DMA] = PFN_DOWN(max_zone_dma_phys());
+	max_zone_pfns[ZONE_NORMAL] = max;
+
+	free_area_init_nodes(max_zone_pfns);
+}
+
+#else
+
 static void __init zone_sizes_init(unsigned long min, unsigned long max)
 {
 	struct memblock_region *reg;
@@ -117,6 +133,8 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
 	free_area_init_node(0, zone_size, min, zhole_size);
 }
 
+#endif /* CONFIG_NUMA */
+
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
 int pfn_valid(unsigned long pfn)
 {
@@ -133,10 +151,15 @@ static void __init arm64_memory_present(void)
 static void __init arm64_memory_present(void)
 {
 	struct memblock_region *reg;
+	int nid = 0;
 
-	for_each_memblock(memory, reg)
-		memory_present(0, memblock_region_memory_base_pfn(reg),
-			       memblock_region_memory_end_pfn(reg));
+	for_each_memblock(memory, reg) {
+#ifdef CONFIG_NUMA
+		nid = reg->nid;
+#endif
+		memory_present(nid, memblock_region_memory_base_pfn(reg),
+				memblock_region_memory_end_pfn(reg));
+	}
 }
 #endif
 
@@ -181,7 +204,6 @@ void __init arm64_memblock_init(void)
 	dma_contiguous_reserve(arm64_dma_phys_limit);
 
 	memblock_allow_resize();
-	memblock_dump_all();
 }
 
 void __init bootmem_init(void)
@@ -193,6 +215,9 @@ void __init bootmem_init(void)
 
 	early_memtest(min << PAGE_SHIFT, max << PAGE_SHIFT);
 
+	max_pfn = max_low_pfn = max;
+
+	arm64_numa_init();
 	/*
 	 * Sparsemem tries to allocate bootmem in memory_present(), so must be
 	 * done after the fixed reservations.
@@ -203,7 +228,6 @@ void __init bootmem_init(void)
 	zone_sizes_init(min, max);
 
 	high_memory = __va((max << PAGE_SHIFT) - 1) + 1;
-	max_pfn = max_low_pfn = max;
 }
 
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 58faeaa..44e3854 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -463,6 +463,7 @@ void __init paging_init(void)
 	zero_page = early_alloc(PAGE_SIZE);
 
 	bootmem_init();
+	memblock_dump_all();
 
 	empty_zero_page = virt_to_page(zero_page);
 
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
new file mode 100644
index 0000000..604e886
--- /dev/null
+++ b/arch/arm64/mm/numa.c
@@ -0,0 +1,403 @@
+/*
+ * NUMA support, based on the x86 implementation.
+ *
+ * Copyright (C) 2015 Cavium Inc.
+ * Author: Ganapatrao Kulkarni <gkulkarni@cavium.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/bootmem.h>
+#include <linux/memblock.h>
+#include <linux/module.h>
+#include <linux/of.h>
+
+struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
+EXPORT_SYMBOL(node_data);
+nodemask_t numa_nodes_parsed __initdata;
+static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+
+static int numa_off;
+static int numa_distance_cnt;
+static u8 *numa_distance;
+
+static __init int numa_parse_early_param(char *opt)
+{
+	if (!opt)
+		return -EINVAL;
+	if (!strncmp(opt, "off", 3)) {
+		pr_info("%s\n", "NUMA turned off");
+		numa_off = 1;
+	}
+	return 0;
+}
+early_param("numa", numa_parse_early_param);
+
+cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
+EXPORT_SYMBOL(node_to_cpumask_map);
+
+#ifdef CONFIG_DEBUG_PER_CPU_MAPS
+
+/*
+ * Returns a pointer to the bitmask of CPUs on Node 'node'.
+ */
+const struct cpumask *cpumask_of_node(int node)
+{
+	if (WARN_ON(node >= nr_node_ids))
+		return cpu_none_mask;
+
+	if (WARN_ON(node_to_cpumask_map[node] == NULL))
+		return cpu_online_mask;
+
+	return node_to_cpumask_map[node];
+}
+EXPORT_SYMBOL(cpumask_of_node);
+
+#endif
+
+static void map_cpu_to_node(unsigned int cpu, int nid)
+{
+	set_cpu_numa_node(cpu, nid);
+	if (nid >= 0)
+		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
+}
+
+static void unmap_cpu_to_node(unsigned int cpu)
+{
+	int nid = cpu_to_node(cpu);
+
+	if (nid >= 0)
+		cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]);
+	set_cpu_numa_node(cpu, NUMA_NO_NODE);
+}
+
+void numa_clear_node(unsigned int cpu)
+{
+	unmap_cpu_to_node(cpu);
+}
+
+/*
+ * Allocate node_to_cpumask_map based on number of available nodes
+ * Requires node_possible_map to be valid.
+ *
+ * Note: cpumask_of_node() is not valid until after this is done.
+ * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
+ */
+static void __init setup_node_to_cpumask_map(void)
+{
+	unsigned int cpu;
+	int node;
+
+	/* setup nr_node_ids if not done yet */
+	if (nr_node_ids == MAX_NUMNODES)
+		setup_nr_node_ids();
+
+	/* allocate and clear the mapping */
+	for (node = 0; node < nr_node_ids; node++) {
+		alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]);
+		cpumask_clear(node_to_cpumask_map[node]);
+	}
+
+	for_each_possible_cpu(cpu)
+		set_cpu_numa_node(cpu, NUMA_NO_NODE);
+
+	/* cpumask_of_node() will now work */
+	pr_debug("NUMA: Node to cpumask map for %d nodes\n", nr_node_ids);
+}
+
+/*
+ *  Set the cpu to node and mem mapping
+ */
+void numa_store_cpu_info(unsigned int cpu)
+{
+	map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
+}
+
+void __init early_map_cpu_to_node(unsigned int cpu, int nid)
+{
+	/* fallback to node 0 */
+	if (nid < 0 || nid >= MAX_NUMNODES)
+		nid = 0;
+
+	cpu_to_node_map[cpu] = nid;
+}
+
+/**
+ * numa_add_memblk - Set node id to memblk
+ * @nid: NUMA node ID of the new memblk
+ * @start: Start address of the new memblk
+ * @size:  Size of the new memblk
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int __init numa_add_memblk(int nid, u64 start, u64 size)
+{
+	int ret;
+
+	ret = memblock_set_node(start, size, &memblock.memory, nid);
+	if (ret < 0) {
+		pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on node %d\n",
+			start, (start + size - 1), nid);
+		return ret;
+	}
+
+	node_set(nid, numa_nodes_parsed);
+	pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
+			start, (start + size - 1), nid);
+	return ret;
+}
+EXPORT_SYMBOL(numa_add_memblk);
+
+/**
+ * Initialize NODE_DATA for a node on the local memory
+ */
+static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
+{
+	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
+	u64 nd_pa;
+	void *nd;
+	int tnid;
+
+	pr_info("NUMA: Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
+			nid, start_pfn << PAGE_SHIFT,
+			(end_pfn << PAGE_SHIFT) - 1);
+
+	nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
+	nd = __va(nd_pa);
+
+	/* report and initialize */
+	pr_info("NUMA: NODE_DATA [mem %#010Lx-%#010Lx]\n",
+		nd_pa, nd_pa + nd_size - 1);
+	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
+	if (tnid != nid)
+		pr_info("NUMA: NODE_DATA(%d) on node %d\n", nid, tnid);
+
+	node_data[nid] = nd;
+	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+	NODE_DATA(nid)->node_id = nid;
+	NODE_DATA(nid)->node_start_pfn = start_pfn;
+	NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
+}
+
+/**
+ * numa_free_distance
+ *
+ * The current table is freed.
+ */
+void __init numa_free_distance(void)
+{
+	size_t size;
+
+	if (!numa_distance)
+		return;
+
+	size = numa_distance_cnt * numa_distance_cnt *
+		sizeof(numa_distance[0]);
+
+	memblock_free(__pa(numa_distance), size);
+	numa_distance_cnt = 0;
+	numa_distance = NULL;
+}
+
+/**
+ *
+ * Create a new NUMA distance table.
+ *
+ */
+static int __init numa_alloc_distance(void)
+{
+	size_t size;
+	u64 phys;
+	int i, j;
+
+	size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
+	phys = memblock_find_in_range(0, PFN_PHYS(max_pfn),
+				      size, PAGE_SIZE);
+	if (WARN_ON(!phys))
+		return -ENOMEM;
+
+	memblock_reserve(phys, size);
+
+	numa_distance = __va(phys);
+	numa_distance_cnt = nr_node_ids;
+
+	/* fill with the default distances */
+	for (i = 0; i < numa_distance_cnt; i++)
+		for (j = 0; j < numa_distance_cnt; j++)
+			numa_distance[i * numa_distance_cnt + j] = i == j ?
+				LOCAL_DISTANCE : REMOTE_DISTANCE;
+
+	pr_debug("NUMA: Initialized distance table, cnt=%d\n",
+			numa_distance_cnt);
+
+	return 0;
+}
+
+/**
+ * numa_set_distance - Set inter node NUMA distance from node to node.
+ * @from: the 'from' node to set distance
+ * @to: the 'to'  node to set distance
+ * @distance: NUMA distance
+ *
+ * Set the distance from node @from to @to to @distance.
+ * If distance table doesn't exist, a warning is printed.
+ *
+ * If @from or @to is higher than the highest known node or lower than zero
+ * or @distance doesn't make sense, the call is ignored.
+ *
+ */
+void __init numa_set_distance(int from, int to, int distance)
+{
+	if (!numa_distance) {
+		pr_warn_once("NUMA: Warning: distance table not allocated yet\n");
+		return;
+	}
+
+	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
+			from < 0 || to < 0) {
+		pr_warn_once("NUMA: Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
+			    from, to, distance);
+		return;
+	}
+
+	if ((u8)distance != distance ||
+	    (from == to && distance != LOCAL_DISTANCE)) {
+		pr_warn_once("NUMA: Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
+			     from, to, distance);
+		return;
+	}
+
+	numa_distance[from * numa_distance_cnt + to] = distance;
+}
+EXPORT_SYMBOL(numa_set_distance);
+
+/**
+ * Return NUMA distance @from to @to
+ */
+int __node_distance(int from, int to)
+{
+	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
+		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
+	return numa_distance[from * numa_distance_cnt + to];
+}
+EXPORT_SYMBOL(__node_distance);
+
+static int __init numa_register_nodes(void)
+{
+	int nid;
+	struct memblock_region *mblk;
+
+	/* Check that valid nid is set to memblks */
+	for_each_memblock(memory, mblk)
+		if (mblk->nid == NUMA_NO_NODE || mblk->nid >= MAX_NUMNODES) {
+			pr_warn("NUMA: Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n",
+				mblk->nid, mblk->base,
+				mblk->base + mblk->size - 1);
+			return -EINVAL;
+		}
+
+	/* Finally register nodes. */
+	for_each_node_mask(nid, numa_nodes_parsed) {
+		unsigned long start_pfn, end_pfn;
+
+		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
+		setup_node_data(nid, start_pfn, end_pfn);
+		node_set_online(nid);
+	}
+
+	/* Setup online nodes to actual nodes*/
+	node_possible_map = numa_nodes_parsed;
+
+	return 0;
+}
+
+static int __init numa_init(int (*init_func)(void))
+{
+	int ret;
+
+	nodes_clear(numa_nodes_parsed);
+	nodes_clear(node_possible_map);
+	nodes_clear(node_online_map);
+	numa_free_distance();
+
+	ret = numa_alloc_distance();
+	if (ret < 0)
+		return ret;
+
+	ret = init_func();
+	if (ret < 0)
+		return ret;
+
+	if (nodes_empty(numa_nodes_parsed))
+		return -EINVAL;
+
+	ret = numa_register_nodes();
+	if (ret < 0)
+		return ret;
+
+	setup_node_to_cpumask_map();
+
+	/* init boot processor */
+	cpu_to_node_map[0] = 0;
+	map_cpu_to_node(0, 0);
+
+	return 0;
+}
+
+/**
+ * dummy_numa_init - Fallback dummy NUMA init
+ *
+ * Used if there's no underlying NUMA architecture, NUMA initialization
+ * fails, or NUMA is disabled on the command line.
+ *
+ * Must online at least one node (node 0) and add memory blocks that cover all
+ * allowed memory. It is unlikely that this function fails.
+ */
+static int __init dummy_numa_init(void)
+{
+	int ret;
+	struct memblock_region *mblk;
+
+	pr_info("%s\n", "No NUMA configuration found");
+	pr_info("NUMA: Faking a node at [mem %#018Lx-%#018Lx]\n",
+	       0LLU, PFN_PHYS(max_pfn) - 1);
+
+	for_each_memblock(memory, mblk) {
+		ret = numa_add_memblk(0, mblk->base, mblk->size);
+		if (!ret)
+			continue;
+
+		pr_err("NUMA init failed\n");
+		return ret;
+	}
+
+	numa_off = 1;
+	return 0;
+}
+
+/**
+ * arm64_numa_init - Initialize NUMA
+ *
+ * Try each configured NUMA initialization method until one succeeds.  The
+ * last fallback is dummy single node config encomapssing whole memory.
+ */
+void __init arm64_numa_init(void)
+{
+	if (!numa_off) {
+		if (!numa_init(of_numa_init))
+			return;
+	}
+
+	numa_init(dummy_numa_init);
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v12 5/5] arm64, mm, numa: Add NUMA balancing support for arm64.
  2016-02-23  1:58 [PATCH v12 0/5] arm64, numa: Add numa support for arm64 platforms David Daney
                   ` (3 preceding siblings ...)
  2016-02-23  1:58 ` [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms David Daney
@ 2016-02-23  1:58 ` David Daney
  2016-02-26 18:53   ` Will Deacon
  4 siblings, 1 reply; 30+ messages in thread
From: David Daney @ 2016-02-23  1:58 UTC (permalink / raw)
  To: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter
  Cc: linux-kernel, David Daney

From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>

Enable NUMA balancing for arm64 platforms.
Add pte, pmd protnone helpers for use by automatic NUMA balancing.

Reviewed-by: Robert Richter <rrichter@cavium.com>
Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
Signed-off-by: David Daney <david.daney@cavium.com>
---
 arch/arm64/Kconfig               |  1 +
 arch/arm64/include/asm/pgtable.h | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f0972a..6e22503 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -11,6 +11,7 @@ config ARM64
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_SUPPORTS_ATOMIC_RMW
+	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_WANT_OPTIONAL_GPIOLIB
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
 	select ARCH_WANT_FRAME_POINTERS
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bf464de..5af9db2 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -346,6 +346,21 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
 	return __pgprot(pgprot_val(prot) & ~PTE_TABLE_BIT);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * See the comment in include/asm-generic/pgtable.h
+ */
+static inline int pte_protnone(pte_t pte)
+{
+	return (pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)) == PTE_PROT_NONE;
+}
+
+static inline int pmd_protnone(pmd_t pmd)
+{
+	return pte_protnone(pmd_pte(pmd));
+}
+#endif
+
 /*
  * THP definitions.
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-23  1:58 ` [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms David Daney
@ 2016-02-23 10:26   ` Will Deacon
  2016-02-23 17:34     ` David Daney
  2016-02-23 17:39   ` David Daney
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2016-02-23 10:26 UTC (permalink / raw)
  To: David Daney
  Cc: linux-arm-kernel, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, devicetree, Ard Biesheuvel,
	Frank Rowand, Grant Likely, Catalin Marinas, Matt Fleming,
	linux-efi, Ganapatrao Kulkarni, Robert Richter, linux-kernel,
	David Daney

On Mon, Feb 22, 2016 at 05:58:22PM -0800, David Daney wrote:
> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> 
> Attempt to get the memory and CPU NUMA node via of_numa.  If that
> fails, default the dummy NUMA node and map all memory and CPUs to node
> 0.
> 
> Tested-by: Shannon Zhao <shannon.zhao@linaro.org>
> Reviewed-by: Robert Richter <rrichter@cavium.com>
> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> Signed-off-by: David Daney <david.daney@cavium.com>
> ---
>  arch/arm64/Kconfig                |  26 +++
>  arch/arm64/include/asm/mmzone.h   |  12 ++
>  arch/arm64/include/asm/numa.h     |  45 +++++
>  arch/arm64/include/asm/topology.h |  10 +
>  arch/arm64/kernel/pci.c           |  10 +
>  arch/arm64/kernel/setup.c         |   4 +
>  arch/arm64/kernel/smp.c           |   4 +
>  arch/arm64/mm/Makefile            |   1 +
>  arch/arm64/mm/init.c              |  34 +++-
>  arch/arm64/mm/mmu.c               |   1 +
>  arch/arm64/mm/numa.c              | 403 ++++++++++++++++++++++++++++++++++++++
>  11 files changed, 545 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/mmzone.h
>  create mode 100644 arch/arm64/include/asm/numa.h
>  create mode 100644 arch/arm64/mm/numa.c

[...]

> diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
> new file mode 100644
> index 0000000..a0de9e6
> --- /dev/null
> +++ b/arch/arm64/include/asm/mmzone.h
> @@ -0,0 +1,12 @@
> +#ifndef __ASM_MMZONE_H
> +#define __ASM_MMZONE_H
> +
> +#ifdef CONFIG_NUMA
> +
> +#include <asm/numa.h>
> +
> +extern struct pglist_data *node_data[];
> +#define NODE_DATA(nid)		(node_data[(nid)])
> +
> +#endif /* CONFIG_NUMA */
> +#endif /* __ASM_MMZONE_H */

What happened to the patch cleaning this up in generic code?

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-23  1:58 ` [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them David Daney
@ 2016-02-23 11:58   ` Mark Rutland
  2016-02-23 12:16     ` Will Deacon
                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Mark Rutland @ 2016-02-23 11:58 UTC (permalink / raw)
  To: David Daney, Ard Biesheuvel, Rob Herring
  Cc: Will Deacon, linux-arm-kernel, Pawel Moll, Ian Campbell,
	Kumar Gala, devicetree, Frank Rowand, Grant Likely,
	Catalin Marinas, Matt Fleming, linux-efi, Ganapatrao Kulkarni,
	Robert Richter, linux-kernel, David Daney

Hi,

On Mon, Feb 22, 2016 at 05:58:19PM -0800, David Daney wrote:
> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> 
> There are two problems with the UEFI stub DT memory node removal
> routine:
> - it deletes nodes as it traverses the tree, which happens to work
>   but is not supported, as deletion invalidates the node iterator;
> - deleting memory nodes entirely may discard annotations in the form
>   of additional properties on the nodes.
> 
> Since the discovery of DT memory nodes occurs strictly before the
> UEFI init sequence, we can simply clear the memblock memory table
> before parsing the UEFI memory map. This way, it is no longer
> necessary to remove the nodes, so we can remove that logic from the
> stub as well.

This is a little bit scary, but I guess this works.

My only concern is that when we get kexec, a subsequent kernel must also
have EFI memory map support, or things go bad for the next EFI-aware
kernel after that (as things like the runtime services may have been
corrupted by the kernel in the middle). It's difficult to fix the
general case later.

A different option would be to support status="disabled" for the memory
nodes, and ignore these in early_init_dt_scan_memory. That way a kernel
cannot use memory without first having parsed the EFI memory map, and we
can still get NUMA info from the disabled nodes.

You'd still need a new kernel to take into account status, but at least
we'd know all kernels would avoid using RAM that potentially needs to be
preserved.

Ard, Rob, thoughts?

Mark.

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Signed-off-by: David Daney <david.daney@cavium.com>
> ---
>  drivers/firmware/efi/arm-init.c    |  8 ++++++++
>  drivers/firmware/efi/libstub/fdt.c | 24 +-----------------------
>  2 files changed, 9 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c
> index 9e15d57..40c9d85 100644
> --- a/drivers/firmware/efi/arm-init.c
> +++ b/drivers/firmware/efi/arm-init.c
> @@ -143,6 +143,14 @@ static __init void reserve_regions(void)
>  	if (efi_enabled(EFI_DBG))
>  		pr_info("Processing EFI memory map:\n");
>  
> +	/*
> +	 * Discard memblocks discovered so far: if there are any at this
> +	 * point, they originate from memory nodes in the DT, and UEFI
> +	 * uses its own memory map instead.
> +	 */
> +	memblock_dump_all();
> +	memblock_remove(0, ULLONG_MAX);
> +
>  	for_each_efi_memory_desc(&memmap, md) {
>  		paddr = md->phys_addr;
>  		npages = md->num_pages;
> diff --git a/drivers/firmware/efi/libstub/fdt.c b/drivers/firmware/efi/libstub/fdt.c
> index cf7b7d4..9df1560 100644
> --- a/drivers/firmware/efi/libstub/fdt.c
> +++ b/drivers/firmware/efi/libstub/fdt.c
> @@ -24,7 +24,7 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
>  			unsigned long map_size, unsigned long desc_size,
>  			u32 desc_ver)
>  {
> -	int node, prev, num_rsv;
> +	int node, num_rsv;
>  	int status;
>  	u32 fdt_val32;
>  	u64 fdt_val64;
> @@ -54,28 +54,6 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
>  		goto fdt_set_fail;
>  
>  	/*
> -	 * Delete any memory nodes present. We must delete nodes which
> -	 * early_init_dt_scan_memory may try to use.
> -	 */
> -	prev = 0;
> -	for (;;) {
> -		const char *type;
> -		int len;
> -
> -		node = fdt_next_node(fdt, prev, NULL);
> -		if (node < 0)
> -			break;
> -
> -		type = fdt_getprop(fdt, node, "device_type", &len);
> -		if (type && strncmp(type, "memory", len) == 0) {
> -			fdt_del_node(fdt, node);
> -			continue;
> -		}
> -
> -		prev = node;
> -	}
> -
> -	/*
>  	 * Delete all memory reserve map entries. When booting via UEFI,
>  	 * kernel will use the UEFI memory map to find reserved regions.
>  	 */
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-23 11:58   ` Mark Rutland
@ 2016-02-23 12:16     ` Will Deacon
  2016-02-23 12:20       ` Ard Biesheuvel
  2016-02-23 22:12     ` Rob Herring
  2016-02-24 19:03     ` Frank Rowand
  2 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2016-02-23 12:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: David Daney, Ard Biesheuvel, Rob Herring, linux-arm-kernel,
	Pawel Moll, Ian Campbell, Kumar Gala, devicetree, Frank Rowand,
	Grant Likely, Catalin Marinas, Matt Fleming, linux-efi,
	Ganapatrao Kulkarni, Robert Richter, linux-kernel, David Daney

On Tue, Feb 23, 2016 at 11:58:05AM +0000, Mark Rutland wrote:
> On Mon, Feb 22, 2016 at 05:58:19PM -0800, David Daney wrote:
> > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > 
> > There are two problems with the UEFI stub DT memory node removal
> > routine:
> > - it deletes nodes as it traverses the tree, which happens to work
> >   but is not supported, as deletion invalidates the node iterator;
> > - deleting memory nodes entirely may discard annotations in the form
> >   of additional properties on the nodes.
> > 
> > Since the discovery of DT memory nodes occurs strictly before the
> > UEFI init sequence, we can simply clear the memblock memory table
> > before parsing the UEFI memory map. This way, it is no longer
> > necessary to remove the nodes, so we can remove that logic from the
> > stub as well.
> 
> This is a little bit scary, but I guess this works.
> 
> My only concern is that when we get kexec, a subsequent kernel must also
> have EFI memory map support, or things go bad for the next EFI-aware
> kernel after that (as things like the runtime services may have been
> corrupted by the kernel in the middle). It's difficult to fix the
> general case later.
> 
> A different option would be to support status="disabled" for the memory
> nodes, and ignore these in early_init_dt_scan_memory. That way a kernel
> cannot use memory without first having parsed the EFI memory map, and we
> can still get NUMA info from the disabled nodes.

So in that case, the middle, non-EFI kernel would fail to boot?
Realistically, once you've kexec'd a non-EFI payload, I don't think you
can rely on the EFI state remaining intact for future EFI applications.

Is this really something we should be trying to police in the kernel?

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-23 12:16     ` Will Deacon
@ 2016-02-23 12:20       ` Ard Biesheuvel
  0 siblings, 0 replies; 30+ messages in thread
From: Ard Biesheuvel @ 2016-02-23 12:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, David Daney, Rob Herring, linux-arm-kernel,
	Pawel Moll, Ian Campbell, Kumar Gala, devicetree, Frank Rowand,
	Grant Likely, Catalin Marinas, Matt Fleming, linux-efi,
	Ganapatrao Kulkarni, Robert Richter, linux-kernel, David Daney

On 23 February 2016 at 13:16, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Feb 23, 2016 at 11:58:05AM +0000, Mark Rutland wrote:
>> On Mon, Feb 22, 2016 at 05:58:19PM -0800, David Daney wrote:
>> > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> >
>> > There are two problems with the UEFI stub DT memory node removal
>> > routine:
>> > - it deletes nodes as it traverses the tree, which happens to work
>> >   but is not supported, as deletion invalidates the node iterator;
>> > - deleting memory nodes entirely may discard annotations in the form
>> >   of additional properties on the nodes.
>> >
>> > Since the discovery of DT memory nodes occurs strictly before the
>> > UEFI init sequence, we can simply clear the memblock memory table
>> > before parsing the UEFI memory map. This way, it is no longer
>> > necessary to remove the nodes, so we can remove that logic from the
>> > stub as well.
>>
>> This is a little bit scary, but I guess this works.
>>
>> My only concern is that when we get kexec, a subsequent kernel must also
>> have EFI memory map support, or things go bad for the next EFI-aware
>> kernel after that (as things like the runtime services may have been
>> corrupted by the kernel in the middle). It's difficult to fix the
>> general case later.
>>
>> A different option would be to support status="disabled" for the memory
>> nodes, and ignore these in early_init_dt_scan_memory. That way a kernel
>> cannot use memory without first having parsed the EFI memory map, and we
>> can still get NUMA info from the disabled nodes.
>
> So in that case, the middle, non-EFI kernel would fail to boot?
> Realistically, once you've kexec'd a non-EFI payload, I don't think you
> can rely on the EFI state remaining intact for future EFI applications.
>
> Is this really something we should be trying to police in the kernel?
>

Well, we could add entries to /reserved-memory in the stub for all the
regions UEFI cares about, that would probably be sufficient to fix
this case.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-23 10:26   ` Will Deacon
@ 2016-02-23 17:34     ` David Daney
  0 siblings, 0 replies; 30+ messages in thread
From: David Daney @ 2016-02-23 17:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: David Daney, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 02/23/2016 02:26 AM, Will Deacon wrote:
> On Mon, Feb 22, 2016 at 05:58:22PM -0800, David Daney wrote:
>> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>>
>> Attempt to get the memory and CPU NUMA node via of_numa.  If that
>> fails, default the dummy NUMA node and map all memory and CPUs to node
>> 0.
>>
>> Tested-by: Shannon Zhao <shannon.zhao@linaro.org>
>> Reviewed-by: Robert Richter <rrichter@cavium.com>
>> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>> Signed-off-by: David Daney <david.daney@cavium.com>
>> ---
>>   arch/arm64/Kconfig                |  26 +++
>>   arch/arm64/include/asm/mmzone.h   |  12 ++
>>   arch/arm64/include/asm/numa.h     |  45 +++++
>>   arch/arm64/include/asm/topology.h |  10 +
>>   arch/arm64/kernel/pci.c           |  10 +
>>   arch/arm64/kernel/setup.c         |   4 +
>>   arch/arm64/kernel/smp.c           |   4 +
>>   arch/arm64/mm/Makefile            |   1 +
>>   arch/arm64/mm/init.c              |  34 +++-
>>   arch/arm64/mm/mmu.c               |   1 +
>>   arch/arm64/mm/numa.c              | 403 ++++++++++++++++++++++++++++++++++++++
>>   11 files changed, 545 insertions(+), 5 deletions(-)
>>   create mode 100644 arch/arm64/include/asm/mmzone.h
>>   create mode 100644 arch/arm64/include/asm/numa.h
>>   create mode 100644 arch/arm64/mm/numa.c
>
> [...]
>
>> diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
>> new file mode 100644
>> index 0000000..a0de9e6
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/mmzone.h
>> @@ -0,0 +1,12 @@
>> +#ifndef __ASM_MMZONE_H
>> +#define __ASM_MMZONE_H
>> +
>> +#ifdef CONFIG_NUMA
>> +
>> +#include <asm/numa.h>
>> +
>> +extern struct pglist_data *node_data[];
>> +#define NODE_DATA(nid)		(node_data[(nid)])
>> +
>> +#endif /* CONFIG_NUMA */
>> +#endif /* __ASM_MMZONE_H */
>
> What happened to the patch cleaning this up in generic code?

As per 0/5 :

    v11:
	- Dropped cleanup patches for other architectures, they will be
           submitted as a separate set after more testing.

The "cleanup patches" were not building on x86, which is a bit of a
problem.  I believe I have corrected that issue, but would like to do
a little more build testing on all the other architectures that are
effected.

I probably should have supplied more detail as to my plan of attack
for these.  To make review more manageable, I would like to consider
the arm64 patches separately.  The parties interested in getting arm64
NUMA working can proceed without waiting for the cleanups to be fully
tested and acknowledged.  I am working on testing the cleanups on each
architecture touched, and will resubmit them as a separate patch set
this week.  I think splitting them like this will streamline the
process of getting to the eventual goal of having both arm64 NUMA and
the cleanups merged.

David.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-23  1:58 ` [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms David Daney
  2016-02-23 10:26   ` Will Deacon
@ 2016-02-23 17:39   ` David Daney
  2016-02-26 18:53   ` Will Deacon
  2016-02-29 17:34   ` Robert Richter
  3 siblings, 0 replies; 30+ messages in thread
From: David Daney @ 2016-02-23 17:39 UTC (permalink / raw)
  To: David Daney, Bjorn Helgaas
  Cc: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney, linux-pci

+ Bjorn Helgaas

Bjorn,

I Inadvertently omitted you from the list of recipients.  There is an 
arm64 PCI bit in here that may be of interest...

On 02/22/2016 05:58 PM, David Daney wrote:
> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>
> Attempt to get the memory and CPU NUMA node via of_numa.  If that
> fails, default the dummy NUMA node and map all memory and CPUs to node
> 0.
>
> Tested-by: Shannon Zhao <shannon.zhao@linaro.org>
> Reviewed-by: Robert Richter <rrichter@cavium.com>
> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> Signed-off-by: David Daney <david.daney@cavium.com>
> ---
>   arch/arm64/Kconfig                |  26 +++
>   arch/arm64/include/asm/mmzone.h   |  12 ++
>   arch/arm64/include/asm/numa.h     |  45 +++++
>   arch/arm64/include/asm/topology.h |  10 +
>   arch/arm64/kernel/pci.c           |  10 +
>   arch/arm64/kernel/setup.c         |   4 +
>   arch/arm64/kernel/smp.c           |   4 +
>   arch/arm64/mm/Makefile            |   1 +
>   arch/arm64/mm/init.c              |  34 +++-
>   arch/arm64/mm/mmu.c               |   1 +
>   arch/arm64/mm/numa.c              | 403 ++++++++++++++++++++++++++++++++++++++
>   11 files changed, 545 insertions(+), 5 deletions(-)
>   create mode 100644 arch/arm64/include/asm/mmzone.h
>   create mode 100644 arch/arm64/include/asm/numa.h
>   create mode 100644 arch/arm64/mm/numa.c
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a969970..9f0972a 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -74,6 +74,7 @@ config ARM64
>   	select HAVE_HW_BREAKPOINT if PERF_EVENTS
>   	select HAVE_IRQ_TIME_ACCOUNTING
>   	select HAVE_MEMBLOCK
> +	select HAVE_MEMBLOCK_NODE_MAP if NUMA
>   	select HAVE_PATA_PLATFORM
>   	select HAVE_PERF_EVENTS
>   	select HAVE_PERF_REGS
> @@ -96,6 +97,7 @@ config ARM64
>   	select SYSCTL_EXCEPTION_TRACE
>   	select HAVE_CONTEXT_TRACKING
>   	select HAVE_ARM_SMCCC
> +	select OF_NUMA if NUMA && OF
>   	help
>   	  ARM 64-bit (AArch64) Linux support.
>
> @@ -545,6 +547,30 @@ config HOTPLUG_CPU
>   	  Say Y here to experiment with turning CPUs off and on.  CPUs
>   	  can be controlled through /sys/devices/system/cpu.
>
> +# Common NUMA Features
> +config NUMA
> +	bool "Numa Memory Allocation and Scheduler Support"
> +	depends on SMP
> +	help
> +	  Enable NUMA (Non Uniform Memory Access) support.
> +
> +	  The kernel will try to allocate memory used by a CPU on the
> +	  local memory of the CPU and add some more
> +	  NUMA awareness to the kernel.
> +
> +config NODES_SHIFT
> +	int "Maximum NUMA Nodes (as a power of 2)"
> +	range 1 10
> +	default "2"
> +	depends on NEED_MULTIPLE_NODES
> +	help
> +	  Specify the maximum number of NUMA Nodes available on the target
> +	  system.  Increases memory reserved to accommodate various tables.
> +
> +config USE_PERCPU_NUMA_NODE_ID
> +	def_bool y
> +	depends on NUMA
> +
>   source kernel/Kconfig.preempt
>   source kernel/Kconfig.hz
>
> diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
> new file mode 100644
> index 0000000..a0de9e6
> --- /dev/null
> +++ b/arch/arm64/include/asm/mmzone.h
> @@ -0,0 +1,12 @@
> +#ifndef __ASM_MMZONE_H
> +#define __ASM_MMZONE_H
> +
> +#ifdef CONFIG_NUMA
> +
> +#include <asm/numa.h>
> +
> +extern struct pglist_data *node_data[];
> +#define NODE_DATA(nid)		(node_data[(nid)])
> +
> +#endif /* CONFIG_NUMA */
> +#endif /* __ASM_MMZONE_H */
> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
> new file mode 100644
> index 0000000..e9b4f29
> --- /dev/null
> +++ b/arch/arm64/include/asm/numa.h
> @@ -0,0 +1,45 @@
> +#ifndef __ASM_NUMA_H
> +#define __ASM_NUMA_H
> +
> +#include <asm/topology.h>
> +
> +#ifdef CONFIG_NUMA
> +
> +/* currently, arm64 implements flat NUMA topology */
> +#define parent_node(node)	(node)
> +
> +int __node_distance(int from, int to);
> +#define node_distance(a, b) __node_distance(a, b)
> +
> +extern nodemask_t numa_nodes_parsed __initdata;
> +
> +/* Mappings between node number and cpus on that node. */
> +extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
> +void numa_clear_node(unsigned int cpu);
> +
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +const struct cpumask *cpumask_of_node(int node);
> +#else
> +/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
> +static inline const struct cpumask *cpumask_of_node(int node)
> +{
> +	return node_to_cpumask_map[node];
> +}
> +#endif
> +
> +void __init arm64_numa_init(void);
> +int __init numa_add_memblk(int nodeid, u64 start, u64 end);
> +void __init numa_set_distance(int from, int to, int distance);
> +void __init numa_free_distance(void);
> +void __init early_map_cpu_to_node(unsigned int cpu, int nid);
> +void numa_store_cpu_info(unsigned int cpu);
> +
> +#else	/* CONFIG_NUMA */
> +
> +static inline void numa_store_cpu_info(unsigned int cpu) { }
> +static inline void arm64_numa_init(void) { }
> +static inline void early_map_cpu_to_node(unsigned int cpu, int nid) { }
> +
> +#endif	/* CONFIG_NUMA */
> +
> +#endif	/* __ASM_NUMA_H */
> diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h
> index a3e9d6f..8b57339 100644
> --- a/arch/arm64/include/asm/topology.h
> +++ b/arch/arm64/include/asm/topology.h
> @@ -22,6 +22,16 @@ void init_cpu_topology(void);
>   void store_cpu_topology(unsigned int cpuid);
>   const struct cpumask *cpu_coregroup_mask(int cpu);
>
> +#ifdef CONFIG_NUMA
> +
> +struct pci_bus;
> +int pcibus_to_node(struct pci_bus *bus);
> +#define cpumask_of_pcibus(bus)	(pcibus_to_node(bus) == -1 ?		\
> +				 cpu_all_mask :				\
> +				 cpumask_of_node(pcibus_to_node(bus)))
> +
> +#endif /* CONFIG_NUMA */
> +
>   #include <asm-generic/topology.h>
>
>   #endif /* _ASM_ARM_TOPOLOGY_H */
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index b3d098b..65e6b7d 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -76,6 +76,16 @@ int raw_pci_write(unsigned int domain, unsigned int bus,
>   	return -ENXIO;
>   }
>
> +#ifdef CONFIG_NUMA
> +
> +int pcibus_to_node(struct pci_bus *bus)
> +{
> +	return dev_to_node(&bus->dev);
> +}
> +EXPORT_SYMBOL(pcibus_to_node);
> +
> +#endif
> +
>   #ifdef CONFIG_ACPI
>   /* Root bridge scanning */
>   struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 8119479..d9b9761 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -53,6 +53,7 @@
>   #include <asm/cpufeature.h>
>   #include <asm/cpu_ops.h>
>   #include <asm/kasan.h>
> +#include <asm/numa.h>
>   #include <asm/sections.h>
>   #include <asm/setup.h>
>   #include <asm/smp_plat.h>
> @@ -372,6 +373,9 @@ static int __init topology_init(void)
>   {
>   	int i;
>
> +	for_each_online_node(i)
> +		register_one_node(i);
> +
>   	for_each_possible_cpu(i) {
>   		struct cpu *cpu = &per_cpu(cpu_data.cpu, i);
>   		cpu->hotpluggable = 1;
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index b1adc51..46c45c8 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -45,6 +45,7 @@
>   #include <asm/cputype.h>
>   #include <asm/cpu_ops.h>
>   #include <asm/mmu_context.h>
> +#include <asm/numa.h>
>   #include <asm/pgtable.h>
>   #include <asm/pgalloc.h>
>   #include <asm/processor.h>
> @@ -125,6 +126,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>   static void smp_store_cpu_info(unsigned int cpuid)
>   {
>   	store_cpu_topology(cpuid);
> +	numa_store_cpu_info(cpuid);
>   }
>
>   /*
> @@ -518,6 +520,8 @@ static void __init of_parse_and_init_cpus(void)
>
>   		pr_debug("cpu logical map 0x%llx\n", hwid);
>   		cpu_logical_map(cpu_count) = hwid;
> +
> +		early_map_cpu_to_node(cpu_count, of_node_to_nid(dn));
>   next:
>   		cpu_count++;
>   	}
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index 57f57fd..54bb209 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -4,6 +4,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>   				   context.o proc.o pageattr.o
>   obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>   obj-$(CONFIG_ARM64_PTDUMP)	+= dump.o
> +obj-$(CONFIG_NUMA)		+= numa.o
>
>   obj-$(CONFIG_KASAN)		+= kasan_init.o
>   KASAN_SANITIZE_kasan_init.o	:= n
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index f3b061e..c39c670 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -37,6 +37,7 @@
>
>   #include <asm/fixmap.h>
>   #include <asm/memory.h>
> +#include <asm/numa.h>
>   #include <asm/sections.h>
>   #include <asm/setup.h>
>   #include <asm/sizes.h>
> @@ -77,6 +78,21 @@ static phys_addr_t __init max_zone_dma_phys(void)
>   	return min(offset + (1ULL << 32), memblock_end_of_DRAM());
>   }
>
> +#ifdef CONFIG_NUMA
> +
> +static void __init zone_sizes_init(unsigned long min, unsigned long max)
> +{
> +	unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
> +
> +	if (IS_ENABLED(CONFIG_ZONE_DMA))
> +		max_zone_pfns[ZONE_DMA] = PFN_DOWN(max_zone_dma_phys());
> +	max_zone_pfns[ZONE_NORMAL] = max;
> +
> +	free_area_init_nodes(max_zone_pfns);
> +}
> +
> +#else
> +
>   static void __init zone_sizes_init(unsigned long min, unsigned long max)
>   {
>   	struct memblock_region *reg;
> @@ -117,6 +133,8 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
>   	free_area_init_node(0, zone_size, min, zhole_size);
>   }
>
> +#endif /* CONFIG_NUMA */
> +
>   #ifdef CONFIG_HAVE_ARCH_PFN_VALID
>   int pfn_valid(unsigned long pfn)
>   {
> @@ -133,10 +151,15 @@ static void __init arm64_memory_present(void)
>   static void __init arm64_memory_present(void)
>   {
>   	struct memblock_region *reg;
> +	int nid = 0;
>
> -	for_each_memblock(memory, reg)
> -		memory_present(0, memblock_region_memory_base_pfn(reg),
> -			       memblock_region_memory_end_pfn(reg));
> +	for_each_memblock(memory, reg) {
> +#ifdef CONFIG_NUMA
> +		nid = reg->nid;
> +#endif
> +		memory_present(nid, memblock_region_memory_base_pfn(reg),
> +				memblock_region_memory_end_pfn(reg));
> +	}
>   }
>   #endif
>
> @@ -181,7 +204,6 @@ void __init arm64_memblock_init(void)
>   	dma_contiguous_reserve(arm64_dma_phys_limit);
>
>   	memblock_allow_resize();
> -	memblock_dump_all();
>   }
>
>   void __init bootmem_init(void)
> @@ -193,6 +215,9 @@ void __init bootmem_init(void)
>
>   	early_memtest(min << PAGE_SHIFT, max << PAGE_SHIFT);
>
> +	max_pfn = max_low_pfn = max;
> +
> +	arm64_numa_init();
>   	/*
>   	 * Sparsemem tries to allocate bootmem in memory_present(), so must be
>   	 * done after the fixed reservations.
> @@ -203,7 +228,6 @@ void __init bootmem_init(void)
>   	zone_sizes_init(min, max);
>
>   	high_memory = __va((max << PAGE_SHIFT) - 1) + 1;
> -	max_pfn = max_low_pfn = max;
>   }
>
>   #ifndef CONFIG_SPARSEMEM_VMEMMAP
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 58faeaa..44e3854 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -463,6 +463,7 @@ void __init paging_init(void)
>   	zero_page = early_alloc(PAGE_SIZE);
>
>   	bootmem_init();
> +	memblock_dump_all();
>
>   	empty_zero_page = virt_to_page(zero_page);
>
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> new file mode 100644
> index 0000000..604e886
> --- /dev/null
> +++ b/arch/arm64/mm/numa.c
> @@ -0,0 +1,403 @@
> +/*
> + * NUMA support, based on the x86 implementation.
> + *
> + * Copyright (C) 2015 Cavium Inc.
> + * Author: Ganapatrao Kulkarni <gkulkarni@cavium.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <linux/bootmem.h>
> +#include <linux/memblock.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +
> +struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
> +EXPORT_SYMBOL(node_data);
> +nodemask_t numa_nodes_parsed __initdata;
> +static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +
> +static int numa_off;
> +static int numa_distance_cnt;
> +static u8 *numa_distance;
> +
> +static __init int numa_parse_early_param(char *opt)
> +{
> +	if (!opt)
> +		return -EINVAL;
> +	if (!strncmp(opt, "off", 3)) {
> +		pr_info("%s\n", "NUMA turned off");
> +		numa_off = 1;
> +	}
> +	return 0;
> +}
> +early_param("numa", numa_parse_early_param);
> +
> +cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
> +EXPORT_SYMBOL(node_to_cpumask_map);
> +
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +
> +/*
> + * Returns a pointer to the bitmask of CPUs on Node 'node'.
> + */
> +const struct cpumask *cpumask_of_node(int node)
> +{
> +	if (WARN_ON(node >= nr_node_ids))
> +		return cpu_none_mask;
> +
> +	if (WARN_ON(node_to_cpumask_map[node] == NULL))
> +		return cpu_online_mask;
> +
> +	return node_to_cpumask_map[node];
> +}
> +EXPORT_SYMBOL(cpumask_of_node);
> +
> +#endif
> +
> +static void map_cpu_to_node(unsigned int cpu, int nid)
> +{
> +	set_cpu_numa_node(cpu, nid);
> +	if (nid >= 0)
> +		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
> +}
> +
> +static void unmap_cpu_to_node(unsigned int cpu)
> +{
> +	int nid = cpu_to_node(cpu);
> +
> +	if (nid >= 0)
> +		cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]);
> +	set_cpu_numa_node(cpu, NUMA_NO_NODE);
> +}
> +
> +void numa_clear_node(unsigned int cpu)
> +{
> +	unmap_cpu_to_node(cpu);
> +}
> +
> +/*
> + * Allocate node_to_cpumask_map based on number of available nodes
> + * Requires node_possible_map to be valid.
> + *
> + * Note: cpumask_of_node() is not valid until after this is done.
> + * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
> + */
> +static void __init setup_node_to_cpumask_map(void)
> +{
> +	unsigned int cpu;
> +	int node;
> +
> +	/* setup nr_node_ids if not done yet */
> +	if (nr_node_ids == MAX_NUMNODES)
> +		setup_nr_node_ids();
> +
> +	/* allocate and clear the mapping */
> +	for (node = 0; node < nr_node_ids; node++) {
> +		alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]);
> +		cpumask_clear(node_to_cpumask_map[node]);
> +	}
> +
> +	for_each_possible_cpu(cpu)
> +		set_cpu_numa_node(cpu, NUMA_NO_NODE);
> +
> +	/* cpumask_of_node() will now work */
> +	pr_debug("NUMA: Node to cpumask map for %d nodes\n", nr_node_ids);
> +}
> +
> +/*
> + *  Set the cpu to node and mem mapping
> + */
> +void numa_store_cpu_info(unsigned int cpu)
> +{
> +	map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
> +}
> +
> +void __init early_map_cpu_to_node(unsigned int cpu, int nid)
> +{
> +	/* fallback to node 0 */
> +	if (nid < 0 || nid >= MAX_NUMNODES)
> +		nid = 0;
> +
> +	cpu_to_node_map[cpu] = nid;
> +}
> +
> +/**
> + * numa_add_memblk - Set node id to memblk
> + * @nid: NUMA node ID of the new memblk
> + * @start: Start address of the new memblk
> + * @size:  Size of the new memblk
> + *
> + * RETURNS:
> + * 0 on success, -errno on failure.
> + */
> +int __init numa_add_memblk(int nid, u64 start, u64 size)
> +{
> +	int ret;
> +
> +	ret = memblock_set_node(start, size, &memblock.memory, nid);
> +	if (ret < 0) {
> +		pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on node %d\n",
> +			start, (start + size - 1), nid);
> +		return ret;
> +	}
> +
> +	node_set(nid, numa_nodes_parsed);
> +	pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
> +			start, (start + size - 1), nid);
> +	return ret;
> +}
> +EXPORT_SYMBOL(numa_add_memblk);
> +
> +/**
> + * Initialize NODE_DATA for a node on the local memory
> + */
> +static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
> +{
> +	const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
> +	u64 nd_pa;
> +	void *nd;
> +	int tnid;
> +
> +	pr_info("NUMA: Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
> +			nid, start_pfn << PAGE_SHIFT,
> +			(end_pfn << PAGE_SHIFT) - 1);
> +
> +	nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> +	nd = __va(nd_pa);
> +
> +	/* report and initialize */
> +	pr_info("NUMA: NODE_DATA [mem %#010Lx-%#010Lx]\n",
> +		nd_pa, nd_pa + nd_size - 1);
> +	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
> +	if (tnid != nid)
> +		pr_info("NUMA: NODE_DATA(%d) on node %d\n", nid, tnid);
> +
> +	node_data[nid] = nd;
> +	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> +	NODE_DATA(nid)->node_id = nid;
> +	NODE_DATA(nid)->node_start_pfn = start_pfn;
> +	NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
> +}
> +
> +/**
> + * numa_free_distance
> + *
> + * The current table is freed.
> + */
> +void __init numa_free_distance(void)
> +{
> +	size_t size;
> +
> +	if (!numa_distance)
> +		return;
> +
> +	size = numa_distance_cnt * numa_distance_cnt *
> +		sizeof(numa_distance[0]);
> +
> +	memblock_free(__pa(numa_distance), size);
> +	numa_distance_cnt = 0;
> +	numa_distance = NULL;
> +}
> +
> +/**
> + *
> + * Create a new NUMA distance table.
> + *
> + */
> +static int __init numa_alloc_distance(void)
> +{
> +	size_t size;
> +	u64 phys;
> +	int i, j;
> +
> +	size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
> +	phys = memblock_find_in_range(0, PFN_PHYS(max_pfn),
> +				      size, PAGE_SIZE);
> +	if (WARN_ON(!phys))
> +		return -ENOMEM;
> +
> +	memblock_reserve(phys, size);
> +
> +	numa_distance = __va(phys);
> +	numa_distance_cnt = nr_node_ids;
> +
> +	/* fill with the default distances */
> +	for (i = 0; i < numa_distance_cnt; i++)
> +		for (j = 0; j < numa_distance_cnt; j++)
> +			numa_distance[i * numa_distance_cnt + j] = i == j ?
> +				LOCAL_DISTANCE : REMOTE_DISTANCE;
> +
> +	pr_debug("NUMA: Initialized distance table, cnt=%d\n",
> +			numa_distance_cnt);
> +
> +	return 0;
> +}
> +
> +/**
> + * numa_set_distance - Set inter node NUMA distance from node to node.
> + * @from: the 'from' node to set distance
> + * @to: the 'to'  node to set distance
> + * @distance: NUMA distance
> + *
> + * Set the distance from node @from to @to to @distance.
> + * If distance table doesn't exist, a warning is printed.
> + *
> + * If @from or @to is higher than the highest known node or lower than zero
> + * or @distance doesn't make sense, the call is ignored.
> + *
> + */
> +void __init numa_set_distance(int from, int to, int distance)
> +{
> +	if (!numa_distance) {
> +		pr_warn_once("NUMA: Warning: distance table not allocated yet\n");
> +		return;
> +	}
> +
> +	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
> +			from < 0 || to < 0) {
> +		pr_warn_once("NUMA: Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
> +			    from, to, distance);
> +		return;
> +	}
> +
> +	if ((u8)distance != distance ||
> +	    (from == to && distance != LOCAL_DISTANCE)) {
> +		pr_warn_once("NUMA: Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
> +			     from, to, distance);
> +		return;
> +	}
> +
> +	numa_distance[from * numa_distance_cnt + to] = distance;
> +}
> +EXPORT_SYMBOL(numa_set_distance);
> +
> +/**
> + * Return NUMA distance @from to @to
> + */
> +int __node_distance(int from, int to)
> +{
> +	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
> +		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
> +	return numa_distance[from * numa_distance_cnt + to];
> +}
> +EXPORT_SYMBOL(__node_distance);
> +
> +static int __init numa_register_nodes(void)
> +{
> +	int nid;
> +	struct memblock_region *mblk;
> +
> +	/* Check that valid nid is set to memblks */
> +	for_each_memblock(memory, mblk)
> +		if (mblk->nid == NUMA_NO_NODE || mblk->nid >= MAX_NUMNODES) {
> +			pr_warn("NUMA: Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n",
> +				mblk->nid, mblk->base,
> +				mblk->base + mblk->size - 1);
> +			return -EINVAL;
> +		}
> +
> +	/* Finally register nodes. */
> +	for_each_node_mask(nid, numa_nodes_parsed) {
> +		unsigned long start_pfn, end_pfn;
> +
> +		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> +		setup_node_data(nid, start_pfn, end_pfn);
> +		node_set_online(nid);
> +	}
> +
> +	/* Setup online nodes to actual nodes*/
> +	node_possible_map = numa_nodes_parsed;
> +
> +	return 0;
> +}
> +
> +static int __init numa_init(int (*init_func)(void))
> +{
> +	int ret;
> +
> +	nodes_clear(numa_nodes_parsed);
> +	nodes_clear(node_possible_map);
> +	nodes_clear(node_online_map);
> +	numa_free_distance();
> +
> +	ret = numa_alloc_distance();
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = init_func();
> +	if (ret < 0)
> +		return ret;
> +
> +	if (nodes_empty(numa_nodes_parsed))
> +		return -EINVAL;
> +
> +	ret = numa_register_nodes();
> +	if (ret < 0)
> +		return ret;
> +
> +	setup_node_to_cpumask_map();
> +
> +	/* init boot processor */
> +	cpu_to_node_map[0] = 0;
> +	map_cpu_to_node(0, 0);
> +
> +	return 0;
> +}
> +
> +/**
> + * dummy_numa_init - Fallback dummy NUMA init
> + *
> + * Used if there's no underlying NUMA architecture, NUMA initialization
> + * fails, or NUMA is disabled on the command line.
> + *
> + * Must online at least one node (node 0) and add memory blocks that cover all
> + * allowed memory. It is unlikely that this function fails.
> + */
> +static int __init dummy_numa_init(void)
> +{
> +	int ret;
> +	struct memblock_region *mblk;
> +
> +	pr_info("%s\n", "No NUMA configuration found");
> +	pr_info("NUMA: Faking a node at [mem %#018Lx-%#018Lx]\n",
> +	       0LLU, PFN_PHYS(max_pfn) - 1);
> +
> +	for_each_memblock(memory, mblk) {
> +		ret = numa_add_memblk(0, mblk->base, mblk->size);
> +		if (!ret)
> +			continue;
> +
> +		pr_err("NUMA init failed\n");
> +		return ret;
> +	}
> +
> +	numa_off = 1;
> +	return 0;
> +}
> +
> +/**
> + * arm64_numa_init - Initialize NUMA
> + *
> + * Try each configured NUMA initialization method until one succeeds.  The
> + * last fallback is dummy single node config encomapssing whole memory.
> + */
> +void __init arm64_numa_init(void)
> +{
> +	if (!numa_off) {
> +		if (!numa_init(of_numa_init))
> +			return;
> +	}
> +
> +	numa_init(dummy_numa_init);
> +}
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-23 11:58   ` Mark Rutland
  2016-02-23 12:16     ` Will Deacon
@ 2016-02-23 22:12     ` Rob Herring
  2016-02-24 19:38       ` Mark Rutland
  2016-02-24 19:03     ` Frank Rowand
  2 siblings, 1 reply; 30+ messages in thread
From: Rob Herring @ 2016-02-23 22:12 UTC (permalink / raw)
  To: Mark Rutland
  Cc: David Daney, Ard Biesheuvel, Will Deacon, linux-arm-kernel,
	Pawel Moll, Ian Campbell, Kumar Gala, devicetree, Frank Rowand,
	Grant Likely, Catalin Marinas, Matt Fleming, linux-efi,
	Ganapatrao Kulkarni, Robert Richter, linux-kernel, David Daney

On Tue, Feb 23, 2016 at 11:58:05AM +0000, Mark Rutland wrote:
> Hi,
> 
> On Mon, Feb 22, 2016 at 05:58:19PM -0800, David Daney wrote:
> > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > 
> > There are two problems with the UEFI stub DT memory node removal
> > routine:
> > - it deletes nodes as it traverses the tree, which happens to work
> >   but is not supported, as deletion invalidates the node iterator;
> > - deleting memory nodes entirely may discard annotations in the form
> >   of additional properties on the nodes.
> > 
> > Since the discovery of DT memory nodes occurs strictly before the
> > UEFI init sequence, we can simply clear the memblock memory table
> > before parsing the UEFI memory map. This way, it is no longer
> > necessary to remove the nodes, so we can remove that logic from the
> > stub as well.
> 
> This is a little bit scary, but I guess this works.

The way it is worded/implemented is, I agree. But if we simply say both 
can be present and the kernel will default to UEFI memory map, that 
seems sufficient to me.
 
> My only concern is that when we get kexec, a subsequent kernel must also
> have EFI memory map support, or things go bad for the next EFI-aware
> kernel after that (as things like the runtime services may have been
> corrupted by the kernel in the middle). It's difficult to fix the
> general case later.
> 
> A different option would be to support status="disabled" for the memory
> nodes, and ignore these in early_init_dt_scan_memory. That way a kernel
> cannot use memory without first having parsed the EFI memory map, and we
> can still get NUMA info from the disabled nodes.

That would be a bit strange that the node is disabled, but still used. 

What if DT and UEFI tables are out of sync somehow? RAM is multiple 
mapped and different addresses were picked for example.

> You'd still need a new kernel to take into account status, but at least
> we'd know all kernels would avoid using RAM that potentially needs to be
> preserved.
> 
> Ard, Rob, thoughts?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-23 11:58   ` Mark Rutland
  2016-02-23 12:16     ` Will Deacon
  2016-02-23 22:12     ` Rob Herring
@ 2016-02-24 19:03     ` Frank Rowand
  2016-02-24 19:30       ` Rob Herring
  2016-02-24 19:33       ` Mark Rutland
  2 siblings, 2 replies; 30+ messages in thread
From: Frank Rowand @ 2016-02-24 19:03 UTC (permalink / raw)
  To: Mark Rutland
  Cc: David Daney, Ard Biesheuvel, Rob Herring, Will Deacon,
	linux-arm-kernel, Pawel Moll, Ian Campbell, Kumar Gala,
	devicetree, Grant Likely, Catalin Marinas, Matt Fleming,
	linux-efi, Ganapatrao Kulkarni, Robert Richter, linux-kernel,
	David Daney

On 2/23/2016 3:58 AM, Mark Rutland wrote:
> Hi,
> 
> On Mon, Feb 22, 2016 at 05:58:19PM -0800, David Daney wrote:
>> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>
>> There are two problems with the UEFI stub DT memory node removal
>> routine:
>> - it deletes nodes as it traverses the tree, which happens to work
>>   but is not supported, as deletion invalidates the node iterator;
>> - deleting memory nodes entirely may discard annotations in the form
>>   of additional properties on the nodes.
>>
>> Since the discovery of DT memory nodes occurs strictly before the
>> UEFI init sequence, we can simply clear the memblock memory table
>> before parsing the UEFI memory map. This way, it is no longer
>> necessary to remove the nodes, so we can remove that logic from the
>> stub as well.
> 
> This is a little bit scary, but I guess this works.
> 
> My only concern is that when we get kexec, a subsequent kernel must also
> have EFI memory map support, or things go bad for the next EFI-aware
> kernel after that (as things like the runtime services may have been
> corrupted by the kernel in the middle). It's difficult to fix the
> general case later.
> 
> A different option would be to support status="disabled" for the memory
> nodes, and ignore these in early_init_dt_scan_memory. That way a kernel
> cannot use memory without first having parsed the EFI memory map, and we
> can still get NUMA info from the disabled nodes.

Please do not play games of treating nodes with status="disabled" as
valid nodes.  The mindset should be if it is disabled, it does not exist.

There have been two bugs reported in the last week where code should
have been ignoring disabled nodes and failed to.  An audit of code
scanning all nodes instead of all enabled nodes is now on my todo list.

< snip >

-Frank

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-24 19:03     ` Frank Rowand
@ 2016-02-24 19:30       ` Rob Herring
  2016-02-24 19:33       ` Mark Rutland
  1 sibling, 0 replies; 30+ messages in thread
From: Rob Herring @ 2016-02-24 19:30 UTC (permalink / raw)
  To: Frank Rowand
  Cc: Mark Rutland, David Daney, Ard Biesheuvel, Will Deacon,
	linux-arm-kernel, Pawel Moll, Ian Campbell, Kumar Gala,
	devicetree, Grant Likely, Catalin Marinas, Matt Fleming,
	linux-efi, Ganapatrao Kulkarni, Robert Richter, linux-kernel,
	David Daney

On Wed, Feb 24, 2016 at 1:03 PM, Frank Rowand <frowand.list@gmail.com> wrote:
> On 2/23/2016 3:58 AM, Mark Rutland wrote:
>> Hi,
>>
>> On Mon, Feb 22, 2016 at 05:58:19PM -0800, David Daney wrote:
>>> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>>
>>> There are two problems with the UEFI stub DT memory node removal
>>> routine:
>>> - it deletes nodes as it traverses the tree, which happens to work
>>>   but is not supported, as deletion invalidates the node iterator;
>>> - deleting memory nodes entirely may discard annotations in the form
>>>   of additional properties on the nodes.
>>>
>>> Since the discovery of DT memory nodes occurs strictly before the
>>> UEFI init sequence, we can simply clear the memblock memory table
>>> before parsing the UEFI memory map. This way, it is no longer
>>> necessary to remove the nodes, so we can remove that logic from the
>>> stub as well.
>>
>> This is a little bit scary, but I guess this works.
>>
>> My only concern is that when we get kexec, a subsequent kernel must also
>> have EFI memory map support, or things go bad for the next EFI-aware
>> kernel after that (as things like the runtime services may have been
>> corrupted by the kernel in the middle). It's difficult to fix the
>> general case later.
>>
>> A different option would be to support status="disabled" for the memory
>> nodes, and ignore these in early_init_dt_scan_memory. That way a kernel
>> cannot use memory without first having parsed the EFI memory map, and we
>> can still get NUMA info from the disabled nodes.
>
> Please do not play games of treating nodes with status="disabled" as
> valid nodes.  The mindset should be if it is disabled, it does not exist.
>
> There have been two bugs reported in the last week where code should
> have been ignoring disabled nodes and failed to.  An audit of code
> scanning all nodes instead of all enabled nodes is now on my todo list.

Perhaps we should merge the default/available variants of iterators
into one. I suspect there are some valid uses. Otherwise, we could
also just not even populate those nodes in the live tree. There are
some cases where the kernel changes the status.

Rob

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-24 19:03     ` Frank Rowand
  2016-02-24 19:30       ` Rob Herring
@ 2016-02-24 19:33       ` Mark Rutland
  1 sibling, 0 replies; 30+ messages in thread
From: Mark Rutland @ 2016-02-24 19:33 UTC (permalink / raw)
  To: Frank Rowand
  Cc: David Daney, Ard Biesheuvel, Rob Herring, Will Deacon,
	linux-arm-kernel, Pawel Moll, Ian Campbell, Kumar Gala,
	devicetree, Grant Likely, Catalin Marinas, Matt Fleming,
	linux-efi, Ganapatrao Kulkarni, Robert Richter, linux-kernel,
	David Daney

On Wed, Feb 24, 2016 at 11:03:08AM -0800, Frank Rowand wrote:
> On 2/23/2016 3:58 AM, Mark Rutland wrote:
> > Hi,
> > 
> > On Mon, Feb 22, 2016 at 05:58:19PM -0800, David Daney wrote:
> >> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >>
> >> There are two problems with the UEFI stub DT memory node removal
> >> routine:
> >> - it deletes nodes as it traverses the tree, which happens to work
> >>   but is not supported, as deletion invalidates the node iterator;
> >> - deleting memory nodes entirely may discard annotations in the form
> >>   of additional properties on the nodes.
> >>
> >> Since the discovery of DT memory nodes occurs strictly before the
> >> UEFI init sequence, we can simply clear the memblock memory table
> >> before parsing the UEFI memory map. This way, it is no longer
> >> necessary to remove the nodes, so we can remove that logic from the
> >> stub as well.
> > 
> > This is a little bit scary, but I guess this works.
> > 
> > My only concern is that when we get kexec, a subsequent kernel must also
> > have EFI memory map support, or things go bad for the next EFI-aware
> > kernel after that (as things like the runtime services may have been
> > corrupted by the kernel in the middle). It's difficult to fix the
> > general case later.
> > 
> > A different option would be to support status="disabled" for the memory
> > nodes, and ignore these in early_init_dt_scan_memory. That way a kernel
> > cannot use memory without first having parsed the EFI memory map, and we
> > can still get NUMA info from the disabled nodes.
> 
> Please do not play games of treating nodes with status="disabled" as
> valid nodes.  The mindset should be if it is disabled, it does not exist.

I completely agree with this generally.

The only possible wiggle room is ePAPR's decription of the precise
meaning of the status property being binding-specific (and there may be
some way to later "enable" the node or otehrwise make use of it). As
with above, we'd only be extracting some information in the presence of
a UEFI memory map.

I agree that this is not a great pattern, and we don't necessarily want
that even for "safe" cases like NUMA.

> There have been two bugs reported in the last week where code should
> have been ignoring disabled nodes and failed to.  An audit of code
> scanning all nodes instead of all enabled nodes is now on my todo list.

That would be great!

Mark.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them
  2016-02-23 22:12     ` Rob Herring
@ 2016-02-24 19:38       ` Mark Rutland
  0 siblings, 0 replies; 30+ messages in thread
From: Mark Rutland @ 2016-02-24 19:38 UTC (permalink / raw)
  To: Rob Herring
  Cc: David Daney, Ard Biesheuvel, Will Deacon, linux-arm-kernel,
	Pawel Moll, Ian Campbell, Kumar Gala, devicetree, Frank Rowand,
	Grant Likely, Catalin Marinas, Matt Fleming, linux-efi,
	Ganapatrao Kulkarni, Robert Richter, linux-kernel, David Daney

On Tue, Feb 23, 2016 at 04:12:02PM -0600, Rob Herring wrote:
> On Tue, Feb 23, 2016 at 11:58:05AM +0000, Mark Rutland wrote:
> > Hi,
> > 
> > On Mon, Feb 22, 2016 at 05:58:19PM -0800, David Daney wrote:
> > > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > > 
> > > There are two problems with the UEFI stub DT memory node removal
> > > routine:
> > > - it deletes nodes as it traverses the tree, which happens to work
> > >   but is not supported, as deletion invalidates the node iterator;
> > > - deleting memory nodes entirely may discard annotations in the form
> > >   of additional properties on the nodes.
> > > 
> > > Since the discovery of DT memory nodes occurs strictly before the
> > > UEFI init sequence, we can simply clear the memblock memory table
> > > before parsing the UEFI memory map. This way, it is no longer
> > > necessary to remove the nodes, so we can remove that logic from the
> > > stub as well.
> > 
> > This is a little bit scary, but I guess this works.
> 
> The way it is worded/implemented is, I agree. But if we simply say both 
> can be present and the kernel will default to UEFI memory map, that 
> seems sufficient to me.
>  
> > My only concern is that when we get kexec, a subsequent kernel must also
> > have EFI memory map support, or things go bad for the next EFI-aware
> > kernel after that (as things like the runtime services may have been
> > corrupted by the kernel in the middle). It's difficult to fix the
> > general case later.
> > 
> > A different option would be to support status="disabled" for the memory
> > nodes, and ignore these in early_init_dt_scan_memory. That way a kernel
> > cannot use memory without first having parsed the EFI memory map, and we
> > can still get NUMA info from the disabled nodes.
> 
> That would be a bit strange that the node is disabled, but still used. 

I agree this would be strange, and not necessarily a precedent we'd want
to see copied elsewhere.

Per ePAPR, a "disabled" node can be enabled in a binding-specific
manner, so having the presence of a UEFI memory map "enable" the NUMA
information would appear to be permitted.

> What if DT and UEFI tables are out of sync somehow? RAM is multiple 
> mapped and different addresses were picked for example.

That applies regardless of the status of the memory nodes.

My suggestion was only that we acquired the NUMA node information, and
added this node information (and not any additional extent of memory) to
the UEFI memory map.

This is precisely what we do with Ard's code, with the exception that in
the absence of a UEFI memory map the kernel would know it was not
permitted to access memory.

Mark.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-23  1:58 ` [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms David Daney
  2016-02-23 10:26   ` Will Deacon
  2016-02-23 17:39   ` David Daney
@ 2016-02-26 18:53   ` Will Deacon
  2016-02-26 19:51     ` David Daney
  2016-02-29 17:34   ` Robert Richter
  3 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2016-02-26 18:53 UTC (permalink / raw)
  To: David Daney
  Cc: linux-arm-kernel, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, devicetree, Ard Biesheuvel,
	Frank Rowand, Grant Likely, Catalin Marinas, Matt Fleming,
	linux-efi, Ganapatrao Kulkarni, Robert Richter, linux-kernel,
	David Daney

On Mon, Feb 22, 2016 at 05:58:22PM -0800, David Daney wrote:
> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> 
> Attempt to get the memory and CPU NUMA node via of_numa.  If that
> fails, default the dummy NUMA node and map all memory and CPUs to node
> 0.
> 
> Tested-by: Shannon Zhao <shannon.zhao@linaro.org>
> Reviewed-by: Robert Richter <rrichter@cavium.com>
> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> Signed-off-by: David Daney <david.daney@cavium.com>

[...]

> @@ -133,10 +151,15 @@ static void __init arm64_memory_present(void)
>  static void __init arm64_memory_present(void)
>  {
>  	struct memblock_region *reg;
> +	int nid = 0;
>  
> -	for_each_memblock(memory, reg)
> -		memory_present(0, memblock_region_memory_base_pfn(reg),
> -			       memblock_region_memory_end_pfn(reg));
> +	for_each_memblock(memory, reg) {
> +#ifdef CONFIG_NUMA
> +		nid = reg->nid;
> +#endif
> +		memory_present(nid, memblock_region_memory_base_pfn(reg),
> +				memblock_region_memory_end_pfn(reg));
> +	}
>  }
>  #endif
>  
> @@ -181,7 +204,6 @@ void __init arm64_memblock_init(void)
>  	dma_contiguous_reserve(arm64_dma_phys_limit);
>  
>  	memblock_allow_resize();
> -	memblock_dump_all();
>  }
>  
>  void __init bootmem_init(void)
> @@ -193,6 +215,9 @@ void __init bootmem_init(void)
>  
>  	early_memtest(min << PAGE_SHIFT, max << PAGE_SHIFT);
>  
> +	max_pfn = max_low_pfn = max;
> +
> +	arm64_numa_init();
>  	/*
>  	 * Sparsemem tries to allocate bootmem in memory_present(), so must be
>  	 * done after the fixed reservations.
> @@ -203,7 +228,6 @@ void __init bootmem_init(void)
>  	zone_sizes_init(min, max);
>  
>  	high_memory = __va((max << PAGE_SHIFT) - 1) + 1;
> -	max_pfn = max_low_pfn = max;
>  }
>  
>  #ifndef CONFIG_SPARSEMEM_VMEMMAP
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 58faeaa..44e3854 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -463,6 +463,7 @@ void __init paging_init(void)
>  	zero_page = early_alloc(PAGE_SIZE);
>  
>  	bootmem_init();
> +	memblock_dump_all();
>  
>  	empty_zero_page = virt_to_page(zero_page);
>  
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> new file mode 100644
> index 0000000..604e886
> --- /dev/null
> +++ b/arch/arm64/mm/numa.c
> @@ -0,0 +1,403 @@
> +/*
> + * NUMA support, based on the x86 implementation.
> + *
> + * Copyright (C) 2015 Cavium Inc.
> + * Author: Ganapatrao Kulkarni <gkulkarni@cavium.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <linux/bootmem.h>
> +#include <linux/memblock.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +
> +struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
> +EXPORT_SYMBOL(node_data);
> +nodemask_t numa_nodes_parsed __initdata;
> +static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +
> +static int numa_off;
> +static int numa_distance_cnt;
> +static u8 *numa_distance;
> +
> +static __init int numa_parse_early_param(char *opt)
> +{
> +	if (!opt)
> +		return -EINVAL;
> +	if (!strncmp(opt, "off", 3)) {
> +		pr_info("%s\n", "NUMA turned off");
> +		numa_off = 1;
> +	}
> +	return 0;
> +}
> +early_param("numa", numa_parse_early_param);

Curious, but when is this option actually useful?

> +
> +cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
> +EXPORT_SYMBOL(node_to_cpumask_map);
> +
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +
> +/*
> + * Returns a pointer to the bitmask of CPUs on Node 'node'.
> + */
> +const struct cpumask *cpumask_of_node(int node)
> +{
> +	if (WARN_ON(node >= nr_node_ids))
> +		return cpu_none_mask;
> +
> +	if (WARN_ON(node_to_cpumask_map[node] == NULL))
> +		return cpu_online_mask;
> +
> +	return node_to_cpumask_map[node];
> +}
> +EXPORT_SYMBOL(cpumask_of_node);
> +
> +#endif
> +
> +static void map_cpu_to_node(unsigned int cpu, int nid)
> +{
> +	set_cpu_numa_node(cpu, nid);
> +	if (nid >= 0)
> +		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
> +}
> +
> +static void unmap_cpu_to_node(unsigned int cpu)
> +{
> +	int nid = cpu_to_node(cpu);
> +
> +	if (nid >= 0)
> +		cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]);
> +	set_cpu_numa_node(cpu, NUMA_NO_NODE);
> +}

How do you end up with negative nids this late in the game?

> +
> +void numa_clear_node(unsigned int cpu)
> +{
> +	unmap_cpu_to_node(cpu);

Why don't you just inline this function?

> +}
> +
> +/*
> + * Allocate node_to_cpumask_map based on number of available nodes
> + * Requires node_possible_map to be valid.
> + *
> + * Note: cpumask_of_node() is not valid until after this is done.
> + * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
> + */
> +static void __init setup_node_to_cpumask_map(void)
> +{
> +	unsigned int cpu;
> +	int node;
> +
> +	/* setup nr_node_ids if not done yet */
> +	if (nr_node_ids == MAX_NUMNODES)
> +		setup_nr_node_ids();
> +
> +	/* allocate and clear the mapping */
> +	for (node = 0; node < nr_node_ids; node++) {
> +		alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]);
> +		cpumask_clear(node_to_cpumask_map[node]);
> +	}
> +
> +	for_each_possible_cpu(cpu)
> +		set_cpu_numa_node(cpu, NUMA_NO_NODE);
> +
> +	/* cpumask_of_node() will now work */
> +	pr_debug("NUMA: Node to cpumask map for %d nodes\n", nr_node_ids);
> +}
> +
> +/*
> + *  Set the cpu to node and mem mapping
> + */
> +void numa_store_cpu_info(unsigned int cpu)
> +{
> +	map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
> +}
> +
> +void __init early_map_cpu_to_node(unsigned int cpu, int nid)
> +{
> +	/* fallback to node 0 */
> +	if (nid < 0 || nid >= MAX_NUMNODES)
> +		nid = 0;
> +
> +	cpu_to_node_map[cpu] = nid;
> +}
> +
> +/**
> + * numa_add_memblk - Set node id to memblk
> + * @nid: NUMA node ID of the new memblk
> + * @start: Start address of the new memblk
> + * @size:  Size of the new memblk
> + *
> + * RETURNS:
> + * 0 on success, -errno on failure.
> + */
> +int __init numa_add_memblk(int nid, u64 start, u64 size)
> +{
> +	int ret;
> +
> +	ret = memblock_set_node(start, size, &memblock.memory, nid);
> +	if (ret < 0) {
> +		pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on node %d\n",
> +			start, (start + size - 1), nid);
> +		return ret;
> +	}
> +
> +	node_set(nid, numa_nodes_parsed);
> +	pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
> +			start, (start + size - 1), nid);
> +	return ret;
> +}
> +EXPORT_SYMBOL(numa_add_memblk);

But this is marked __init... (and you've done this elsewhere in the patch
too).

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/5] arm64, mm, numa: Add NUMA balancing support for arm64.
  2016-02-23  1:58 ` [PATCH v12 5/5] arm64, mm, numa: Add NUMA balancing support for arm64 David Daney
@ 2016-02-26 18:53   ` Will Deacon
  2016-02-26 19:26     ` David Daney
  0 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2016-02-26 18:53 UTC (permalink / raw)
  To: David Daney
  Cc: linux-arm-kernel, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, devicetree, Ard Biesheuvel,
	Frank Rowand, Grant Likely, Catalin Marinas, Matt Fleming,
	linux-efi, Ganapatrao Kulkarni, Robert Richter, linux-kernel,
	David Daney

On Mon, Feb 22, 2016 at 05:58:23PM -0800, David Daney wrote:
> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> 
> Enable NUMA balancing for arm64 platforms.
> Add pte, pmd protnone helpers for use by automatic NUMA balancing.
> 
> Reviewed-by: Robert Richter <rrichter@cavium.com>
> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> Signed-off-by: David Daney <david.daney@cavium.com>
> ---
>  arch/arm64/Kconfig               |  1 +
>  arch/arm64/include/asm/pgtable.h | 15 +++++++++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 9f0972a..6e22503 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -11,6 +11,7 @@ config ARM64
>  	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>  	select ARCH_USE_CMPXCHG_LOCKREF
>  	select ARCH_SUPPORTS_ATOMIC_RMW
> +	select ARCH_SUPPORTS_NUMA_BALANCING
>  	select ARCH_WANT_OPTIONAL_GPIOLIB
>  	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
>  	select ARCH_WANT_FRAME_POINTERS
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index bf464de..5af9db2 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -346,6 +346,21 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
>  	return __pgprot(pgprot_val(prot) & ~PTE_TABLE_BIT);
>  }
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +/*
> + * See the comment in include/asm-generic/pgtable.h
> + */
> +static inline int pte_protnone(pte_t pte)
> +{
> +	return (pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)) == PTE_PROT_NONE;
> +}
> +
> +static inline int pmd_protnone(pmd_t pmd)
> +{
> +	return pte_protnone(pmd_pte(pmd));
> +}
> +#endif

Can these not be macros, like our other pte_* and pmd_* predicates in
this header file?

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 5/5] arm64, mm, numa: Add NUMA balancing support for arm64.
  2016-02-26 18:53   ` Will Deacon
@ 2016-02-26 19:26     ` David Daney
  0 siblings, 0 replies; 30+ messages in thread
From: David Daney @ 2016-02-26 19:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: David Daney, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 02/26/2016 10:53 AM, Will Deacon wrote:
> On Mon, Feb 22, 2016 at 05:58:23PM -0800, David Daney wrote:
>> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>>
>> Enable NUMA balancing for arm64 platforms.
>> Add pte, pmd protnone helpers for use by automatic NUMA balancing.
>>
>> Reviewed-by: Robert Richter <rrichter@cavium.com>
>> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>> Signed-off-by: David Daney <david.daney@cavium.com>
>> ---
>>   arch/arm64/Kconfig               |  1 +
>>   arch/arm64/include/asm/pgtable.h | 15 +++++++++++++++
>>   2 files changed, 16 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 9f0972a..6e22503 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -11,6 +11,7 @@ config ARM64
>>   	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>>   	select ARCH_USE_CMPXCHG_LOCKREF
>>   	select ARCH_SUPPORTS_ATOMIC_RMW
>> +	select ARCH_SUPPORTS_NUMA_BALANCING
>>   	select ARCH_WANT_OPTIONAL_GPIOLIB
>>   	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
>>   	select ARCH_WANT_FRAME_POINTERS
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index bf464de..5af9db2 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -346,6 +346,21 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
>>   	return __pgprot(pgprot_val(prot) & ~PTE_TABLE_BIT);
>>   }
>>
>> +#ifdef CONFIG_NUMA_BALANCING
>> +/*
>> + * See the comment in include/asm-generic/pgtable.h
>> + */
>> +static inline int pte_protnone(pte_t pte)
>> +{
>> +	return (pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)) == PTE_PROT_NONE;
>> +}
>> +
>> +static inline int pmd_protnone(pmd_t pmd)
>> +{
>> +	return pte_protnone(pmd_pte(pmd));
>> +}
>> +#endif
>
> Can these not be macros, like our other pte_* and pmd_* predicates in
> this header file?

They probably could be.  However there is precedence for the static 
inline form:

$ git grep pte_protnone | grep '\.h' | grep -v return
arch/arm64/include/asm/pgtable.h:static inline int pte_protnone(pte_t pte)
arch/powerpc/include/asm/book3s/64/hash.h:static inline int 
pte_protnone(pte_t pte)
arch/powerpc/include/asm/nohash/pgtable.h:static inline int 
pte_protnone(pte_t pte)
arch/s390/include/asm/pgtable.h:static inline int pte_protnone(pte_t pte)
arch/x86/include/asm/pgtable.h:static inline int pte_protnone(pte_t pte)
include/asm-generic/pgtable.h:static inline int pte_protnone(pte_t pte)


Actually I am a little surprised that all the arm64 pte_* and pmd_* 
things are not written as static inline to achieve added type safety. 
The generated code should be identical.


>
> Will
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-26 18:53   ` Will Deacon
@ 2016-02-26 19:51     ` David Daney
  2016-02-27  4:13       ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 30+ messages in thread
From: David Daney @ 2016-02-26 19:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: David Daney, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 02/26/2016 10:53 AM, Will Deacon wrote:
[...]
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> new file mode 100644
>> index 0000000..604e886
>> --- /dev/null
>> +++ b/arch/arm64/mm/numa.c
>> @@ -0,0 +1,403 @@
[...]
>> +
>> +static int numa_off;
>> +static int numa_distance_cnt;
>> +static u8 *numa_distance;
>> +
>> +static __init int numa_parse_early_param(char *opt)
>> +{
>> +	if (!opt)
>> +		return -EINVAL;
>> +	if (!strncmp(opt, "off", 3)) {
>> +		pr_info("%s\n", "NUMA turned off");
>> +		numa_off = 1;
>> +	}
>> +	return 0;
>> +}
>> +early_param("numa", numa_parse_early_param);
>
> Curious, but when is this option actually useful?
>

Good point.  I will remove that bit, it was used as an aid in debugging 
while bringing up the patch set.


>> +
>> +cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
>> +EXPORT_SYMBOL(node_to_cpumask_map);
>> +
>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> +
>> +/*
>> + * Returns a pointer to the bitmask of CPUs on Node 'node'.
>> + */
>> +const struct cpumask *cpumask_of_node(int node)
>> +{
>> +	if (WARN_ON(node >= nr_node_ids))
>> +		return cpu_none_mask;
>> +
>> +	if (WARN_ON(node_to_cpumask_map[node] == NULL))
>> +		return cpu_online_mask;
>> +
>> +	return node_to_cpumask_map[node];
>> +}
>> +EXPORT_SYMBOL(cpumask_of_node);
>> +
>> +#endif
>> +
>> +static void map_cpu_to_node(unsigned int cpu, int nid)
>> +{
>> +	set_cpu_numa_node(cpu, nid);
>> +	if (nid >= 0)
>> +		cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>> +}
>> +
>> +static void unmap_cpu_to_node(unsigned int cpu)
>> +{
>> +	int nid = cpu_to_node(cpu);
>> +
>> +	if (nid >= 0)
>> +		cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]);
>> +	set_cpu_numa_node(cpu, NUMA_NO_NODE);
>> +}
>
> How do you end up with negative nids this late in the game?
>

It might be possible with some of the hot plugging code.  It is a little 
paranoia programming.

If you really don't like it, we can remove it.

>> +
>> +void numa_clear_node(unsigned int cpu)
>> +{
>> +	unmap_cpu_to_node(cpu);
>
> Why don't you just inline this function?

Good point, I will do that.

[...]
>> +int __init numa_add_memblk(int nid, u64 start, u64 size)
>> +{
>> +	int ret;
>> +
>> +	ret = memblock_set_node(start, size, &memblock.memory, nid);
>> +	if (ret < 0) {
>> +		pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on node %d\n",
>> +			start, (start + size - 1), nid);
>> +		return ret;
>> +	}
>> +
>> +	node_set(nid, numa_nodes_parsed);
>> +	pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
>> +			start, (start + size - 1), nid);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(numa_add_memblk);
>
> But this is marked __init... (and you've done this elsewhere in the patch
> too).

I will fix these.

>
> Will
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-26 19:51     ` David Daney
@ 2016-02-27  4:13       ` Ganapatrao Kulkarni
  2016-02-29 10:12         ` Robert Richter
  0 siblings, 1 reply; 30+ messages in thread
From: Ganapatrao Kulkarni @ 2016-02-27  4:13 UTC (permalink / raw)
  To: David Daney
  Cc: Will Deacon, Mark Rutland, devicetree, linux-efi, Pawel Moll,
	Ian Campbell, Matt Fleming, Catalin Marinas, Ard Biesheuvel,
	David Daney, linux-kernel, Robert Richter, Rob Herring,
	David Daney, Kumar Gala, Grant Likely, Ganapatrao Kulkarni,
	Frank Rowand, linux-arm-kernel

On Sat, Feb 27, 2016 at 1:21 AM, David Daney <ddaney@caviumnetworks.com> wrote:
> On 02/26/2016 10:53 AM, Will Deacon wrote:
> [...]
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> new file mode 100644
>>> index 0000000..604e886
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -0,0 +1,403 @@
>
> [...]
>>>
>>> +
>>> +static int numa_off;
>>> +static int numa_distance_cnt;
>>> +static u8 *numa_distance;
>>> +
>>> +static __init int numa_parse_early_param(char *opt)
>>> +{
>>> +       if (!opt)
>>> +               return -EINVAL;
>>> +       if (!strncmp(opt, "off", 3)) {
>>> +               pr_info("%s\n", "NUMA turned off");
>>> +               numa_off = 1;
>>> +       }
>>> +       return 0;
>>> +}
>>> +early_param("numa", numa_parse_early_param);
>>
>>
>> Curious, but when is this option actually useful?
>>
>
> Good point.  I will remove that bit, it was used as an aid in debugging
> while bringing up the patch set.

this is handy in debugging new platforms.
this boot argument option forces to boot as single node dummy system
adding all resources to node0.
>
>
>
>>> +
>>> +cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
>>> +EXPORT_SYMBOL(node_to_cpumask_map);
>>> +
>>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>>> +
>>> +/*
>>> + * Returns a pointer to the bitmask of CPUs on Node 'node'.
>>> + */
>>> +const struct cpumask *cpumask_of_node(int node)
>>> +{
>>> +       if (WARN_ON(node >= nr_node_ids))
>>> +               return cpu_none_mask;
>>> +
>>> +       if (WARN_ON(node_to_cpumask_map[node] == NULL))
>>> +               return cpu_online_mask;
>>> +
>>> +       return node_to_cpumask_map[node];
>>> +}
>>> +EXPORT_SYMBOL(cpumask_of_node);
>>> +
>>> +#endif
>>> +
>>> +static void map_cpu_to_node(unsigned int cpu, int nid)
>>> +{
>>> +       set_cpu_numa_node(cpu, nid);
>>> +       if (nid >= 0)
>>> +               cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>>> +}
>>> +
>>> +static void unmap_cpu_to_node(unsigned int cpu)
>>> +{
>>> +       int nid = cpu_to_node(cpu);
>>> +
>>> +       if (nid >= 0)
>>> +               cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]);
>>> +       set_cpu_numa_node(cpu, NUMA_NO_NODE);
>>> +}
>>
>>
>> How do you end up with negative nids this late in the game?
>>
>
> It might be possible with some of the hot plugging code.  It is a little
> paranoia programming.
>
> If you really don't like it, we can remove it.
>
>>> +
>>> +void numa_clear_node(unsigned int cpu)
>>> +{
>>> +       unmap_cpu_to_node(cpu);
>>
>>
>> Why don't you just inline this function?
>
>
> Good point, I will do that.
>
> [...]
>>>
>>> +int __init numa_add_memblk(int nid, u64 start, u64 size)
>>> +{
>>> +       int ret;
>>> +
>>> +       ret = memblock_set_node(start, size, &memblock.memory, nid);
>>> +       if (ret < 0) {
>>> +               pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on
>>> node %d\n",
>>> +                       start, (start + size - 1), nid);
>>> +               return ret;
>>> +       }
>>> +
>>> +       node_set(nid, numa_nodes_parsed);
>>> +       pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
>>> +                       start, (start + size - 1), nid);
>>> +       return ret;
>>> +}
>>> +EXPORT_SYMBOL(numa_add_memblk);
>>
>>
>> But this is marked __init... (and you've done this elsewhere in the patch
>> too).
>
>
> I will fix these.
>
>
>>
>> Will
>>
>

Ganapat
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-27  4:13       ` Ganapatrao Kulkarni
@ 2016-02-29 10:12         ` Robert Richter
  0 siblings, 0 replies; 30+ messages in thread
From: Robert Richter @ 2016-02-29 10:12 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: David Daney, Will Deacon, Mark Rutland, devicetree, linux-efi,
	Pawel Moll, Ian Campbell, Matt Fleming, Catalin Marinas,
	Ard Biesheuvel, David Daney, linux-kernel, Rob Herring,
	David Daney, Kumar Gala, Grant Likely, Ganapatrao Kulkarni,
	Frank Rowand, linux-arm-kernel

On 27.02.16 09:43:49, Ganapatrao Kulkarni wrote:
> On Sat, Feb 27, 2016 at 1:21 AM, David Daney <ddaney@caviumnetworks.com> wrote:
> > On 02/26/2016 10:53 AM, Will Deacon wrote:

> >>> +static __init int numa_parse_early_param(char *opt)
> >>> +{
> >>> +       if (!opt)
> >>> +               return -EINVAL;
> >>> +       if (!strncmp(opt, "off", 3)) {
> >>> +               pr_info("%s\n", "NUMA turned off");
> >>> +               numa_off = 1;
> >>> +       }
> >>> +       return 0;
> >>> +}
> >>> +early_param("numa", numa_parse_early_param);
> >>
> >>
> >> Curious, but when is this option actually useful?
> >>
> >
> > Good point.  I will remove that bit, it was used as an aid in debugging
> > while bringing up the patch set.
> 
> this is handy in debugging new platforms.
> this boot argument option forces to boot as single node dummy system
> adding all resources to node0.

This is used to disable numa in a numa enabled kernel esp. for
debugging/enabling/booting single node systems. Same kernel parameter
as for x86 and other archs. Should not being removed.

-Robert

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation.
  2016-02-23  1:58 ` [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation David Daney
@ 2016-02-29 17:29   ` Robert Richter
  2016-02-29 18:13     ` David Daney
  0 siblings, 1 reply; 30+ messages in thread
From: Robert Richter @ 2016-02-29 17:29 UTC (permalink / raw)
  To: David Daney
  Cc: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 22.02.16 17:58:21, David Daney wrote:
> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> 
> ADD device tree node parsing for NUMA topology using device
> "numa-node-id" property distance-map.
> 
> Reviewed-by: Robert Richter <rrichter@cavium.com>
> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> Signed-off-by: David Daney <david.daney@cavium.com>
> ---
>  drivers/of/Kconfig   |   3 +
>  drivers/of/Makefile  |   1 +
>  drivers/of/of_numa.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/of.h   |   9 +++
>  4 files changed, 224 insertions(+)
>  create mode 100644 drivers/of/of_numa.c
> 
> diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
> index e2a4841..b3bec3a 100644
> --- a/drivers/of/Kconfig
> +++ b/drivers/of/Kconfig
> @@ -112,4 +112,7 @@ config OF_OVERLAY
>  	  While this option is selected automatically when needed, you can
>  	  enable it manually to improve device tree unit test coverage.
>  
> +config OF_NUMA

In arch/arm64/Kconfig you now need to:

	select OF_NUMA if NUMA && OF

This should depend here on OF and NUMA and enabled in that case. Why
moving that to arch code? This duplicates code as the same needs to be
implemented for every arch.

-Robert

> +	bool
> +
>  endif # OF

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-23  1:58 ` [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms David Daney
                     ` (2 preceding siblings ...)
  2016-02-26 18:53   ` Will Deacon
@ 2016-02-29 17:34   ` Robert Richter
  2016-02-29 23:42     ` David Daney
  3 siblings, 1 reply; 30+ messages in thread
From: Robert Richter @ 2016-02-29 17:34 UTC (permalink / raw)
  To: David Daney
  Cc: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 22.02.16 17:58:22, David Daney wrote:
> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> 
> Attempt to get the memory and CPU NUMA node via of_numa.  If that
> fails, default the dummy NUMA node and map all memory and CPUs to node
> 0.
> 
> Tested-by: Shannon Zhao <shannon.zhao@linaro.org>
> Reviewed-by: Robert Richter <rrichter@cavium.com>
> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> Signed-off-by: David Daney <david.daney@cavium.com>
> ---
>  arch/arm64/Kconfig                |  26 +++
>  arch/arm64/include/asm/mmzone.h   |  12 ++
>  arch/arm64/include/asm/numa.h     |  45 +++++
>  arch/arm64/include/asm/topology.h |  10 +
>  arch/arm64/kernel/pci.c           |  10 +
>  arch/arm64/kernel/setup.c         |   4 +
>  arch/arm64/kernel/smp.c           |   4 +
>  arch/arm64/mm/Makefile            |   1 +
>  arch/arm64/mm/init.c              |  34 +++-
>  arch/arm64/mm/mmu.c               |   1 +
>  arch/arm64/mm/numa.c              | 403 ++++++++++++++++++++++++++++++++++++++
>  11 files changed, 545 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/mmzone.h
>  create mode 100644 arch/arm64/include/asm/numa.h
>  create mode 100644 arch/arm64/mm/numa.c

> +static int __init numa_init(int (*init_func)(void))
> +{
> +	int ret;
> +
> +	nodes_clear(numa_nodes_parsed);
> +	nodes_clear(node_possible_map);
> +	nodes_clear(node_online_map);
> +	numa_free_distance();
> +
> +	ret = numa_alloc_distance();
> +	if (ret < 0)
> +		return ret;

If you move this before the remaining initializers, you will need to
clean this up on error. So better move it back after
numa_register_nodes() as it was in v10. This should work since
distances are used not earlier than numa is enabled.

-Robert

> +
> +	ret = init_func();
> +	if (ret < 0)
> +		return ret;
> +
> +	if (nodes_empty(numa_nodes_parsed))
> +		return -EINVAL;
> +
> +	ret = numa_register_nodes();
> +	if (ret < 0)
> +		return ret;
> +
> +	setup_node_to_cpumask_map();
> +
> +	/* init boot processor */
> +	cpu_to_node_map[0] = 0;
> +	map_cpu_to_node(0, 0);
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation.
  2016-02-29 17:29   ` Robert Richter
@ 2016-02-29 18:13     ` David Daney
  2016-02-29 19:45       ` Robert Richter
  0 siblings, 1 reply; 30+ messages in thread
From: David Daney @ 2016-02-29 18:13 UTC (permalink / raw)
  To: Robert Richter
  Cc: David Daney, Will Deacon, linux-arm-kernel, Rob Herring,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 02/29/2016 09:29 AM, Robert Richter wrote:
> On 22.02.16 17:58:21, David Daney wrote:
>> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>>
>> ADD device tree node parsing for NUMA topology using device
>> "numa-node-id" property distance-map.
>>
>> Reviewed-by: Robert Richter <rrichter@cavium.com>
>> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>> Signed-off-by: David Daney <david.daney@cavium.com>
>> ---
>>   drivers/of/Kconfig   |   3 +
>>   drivers/of/Makefile  |   1 +
>>   drivers/of/of_numa.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/of.h   |   9 +++
>>   4 files changed, 224 insertions(+)
>>   create mode 100644 drivers/of/of_numa.c
>>
>> diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
>> index e2a4841..b3bec3a 100644
>> --- a/drivers/of/Kconfig
>> +++ b/drivers/of/Kconfig
>> @@ -112,4 +112,7 @@ config OF_OVERLAY
>>   	  While this option is selected automatically when needed, you can
>>   	  enable it manually to improve device tree unit test coverage.
>>
>> +config OF_NUMA
>
> In arch/arm64/Kconfig you now need to:
>
> 	select OF_NUMA if NUMA && OF
>
> This should depend here on OF and NUMA and enabled in that case. Why
> moving that to arch code?

The dependency on of_numa.o is in the architecture specific code, so 
that is where the Kconfig select should be too.

For new code, we try to avoid putting architecture specific things into 
the core Kconfig files.

Since arm64 is the only architecture that is currently using this, we 
don't want to select it for non-arm64 kernels that happen to select NUMA 
&& OF.

I will restore the depends, as it is guard against errors in the 
architecture specific Kconfig select statements, but I am going to leave 
the NUMA && OF in the arm64/Kconfig for clarity.

> This duplicates code as the same needs to be
> implemented for every arch.
>

Better to duplicate a single Kconfig select statement a few times in 
architecture specific Kconfig files, than force people to continuously 
modify the depends in  drivers/of/Kconfig

David.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation.
  2016-02-29 18:13     ` David Daney
@ 2016-02-29 19:45       ` Robert Richter
  2016-02-29 22:56         ` David Daney
  0 siblings, 1 reply; 30+ messages in thread
From: Robert Richter @ 2016-02-29 19:45 UTC (permalink / raw)
  To: David Daney
  Cc: David Daney, Will Deacon, linux-arm-kernel, Rob Herring,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 29.02.16 10:13:48, David Daney wrote:
> On 02/29/2016 09:29 AM, Robert Richter wrote:
> >On 22.02.16 17:58:21, David Daney wrote:
> >>From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> >>
> >>ADD device tree node parsing for NUMA topology using device
> >>"numa-node-id" property distance-map.
> >>
> >>Reviewed-by: Robert Richter <rrichter@cavium.com>
> >>Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> >>Signed-off-by: David Daney <david.daney@cavium.com>
> >>---
> >>  drivers/of/Kconfig   |   3 +
> >>  drivers/of/Makefile  |   1 +
> >>  drivers/of/of_numa.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  include/linux/of.h   |   9 +++
> >>  4 files changed, 224 insertions(+)
> >>  create mode 100644 drivers/of/of_numa.c
> >>
> >>diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
> >>index e2a4841..b3bec3a 100644
> >>--- a/drivers/of/Kconfig
> >>+++ b/drivers/of/Kconfig
> >>@@ -112,4 +112,7 @@ config OF_OVERLAY
> >>  	  While this option is selected automatically when needed, you can
> >>  	  enable it manually to improve device tree unit test coverage.
> >>
> >>+config OF_NUMA
> >
> >In arch/arm64/Kconfig you now need to:
> >
> >	select OF_NUMA if NUMA && OF
> >
> >This should depend here on OF and NUMA and enabled in that case. Why
> >moving that to arch code?
> 
> The dependency on of_numa.o is in the architecture specific code, so that is
> where the Kconfig select should be too.

If you grep for select in Kconfig files you will see that dependency
checkers in the form of select ... if ... are very rare. Since select
does not check for depends-on at all, defbool y is commonly used to
enable a config in case all dependencies match. If OF_NUMA only
supports ARM64, why not state this there? It is common to enable
certain configs in drivers/ only for a some archs.

> For new code, we try to avoid putting architecture specific things into the
> core Kconfig files.

But to avoid arch configs in generic Kconfigs, define HAVE_OF_NUMA,
add it as dependency to OF_NUMA and select it in arch/arm64/Kconfig.
OF_NUMA is then enabled with defbool y if all requirements are met.

This also addresses your concerns below.

-Robert

> Since arm64 is the only architecture that is currently using this, we don't
> want to select it for non-arm64 kernels that happen to select NUMA && OF.
> 
> I will restore the depends, as it is guard against errors in the
> architecture specific Kconfig select statements, but I am going to leave the
> NUMA && OF in the arm64/Kconfig for clarity.
> 
> >This duplicates code as the same needs to be
> >implemented for every arch.
> >
> 
> Better to duplicate a single Kconfig select statement a few times in
> architecture specific Kconfig files, than force people to continuously
> modify the depends in  drivers/of/Kconfig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation.
  2016-02-29 19:45       ` Robert Richter
@ 2016-02-29 22:56         ` David Daney
  0 siblings, 0 replies; 30+ messages in thread
From: David Daney @ 2016-02-29 22:56 UTC (permalink / raw)
  To: Robert Richter
  Cc: David Daney, Will Deacon, linux-arm-kernel, Rob Herring,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 02/29/2016 11:45 AM, Robert Richter wrote:
> On 29.02.16 10:13:48, David Daney wrote:
>> On 02/29/2016 09:29 AM, Robert Richter wrote:
>>> On 22.02.16 17:58:21, David Daney wrote:
>>>> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>>>>
>>>> ADD device tree node parsing for NUMA topology using device
>>>> "numa-node-id" property distance-map.
>>>>
>>>> Reviewed-by: Robert Richter <rrichter@cavium.com>
>>>> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>>>> Signed-off-by: David Daney <david.daney@cavium.com>
>>>> ---
>>>>   drivers/of/Kconfig   |   3 +
>>>>   drivers/of/Makefile  |   1 +
>>>>   drivers/of/of_numa.c | 211 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   include/linux/of.h   |   9 +++
>>>>   4 files changed, 224 insertions(+)
>>>>   create mode 100644 drivers/of/of_numa.c
>>>>
>>>> diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
>>>> index e2a4841..b3bec3a 100644
>>>> --- a/drivers/of/Kconfig
>>>> +++ b/drivers/of/Kconfig
>>>> @@ -112,4 +112,7 @@ config OF_OVERLAY
>>>>   	  While this option is selected automatically when needed, you can
>>>>   	  enable it manually to improve device tree unit test coverage.
>>>>
>>>> +config OF_NUMA
>>>
>>> In arch/arm64/Kconfig you now need to:
>>>
>>> 	select OF_NUMA if NUMA && OF
>>>
>>> This should depend here on OF and NUMA and enabled in that case. Why
>>> moving that to arch code?
>>
>> The dependency on of_numa.o is in the architecture specific code, so that is
>> where the Kconfig select should be too.
>
> If you grep for select in Kconfig files you will see that dependency
> checkers in the form of select ... if ... are very rare.

You are right...

$ find . -name Kconfig | xargs grep -e 'select.*if' | wc -l
1513
$ find . -name Kconfig | xargs grep -e 'select' | wc -l
14179

The 'if' variety is encountered in fewer than 11% of the cases.

That said, I am not sure if this type of statistic should be used to 
evaluate code "goodness".

[...]
>
> But to avoid arch configs in generic Kconfigs, define HAVE_OF_NUMA,
> add it as dependency to OF_NUMA and select it in arch/arm64/Kconfig.
> OF_NUMA is then enabled with defbool y if all requirements are met.
>

Really, what we want is NEEDS_OF_NUMA.  That is exactly what "select 
OF_NUMA" means.

There is no need to build a Rube Goldberg device out of these Kconfig 
things.  The "select OF_NUMA if NUMA && OF" is concise, allowed by the 
Kconfig system and exactly expresses when the file should be compiled.

Unless there are other objections, I am going to use the the Kconfig 
changes in the v12 form.

David.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-29 17:34   ` Robert Richter
@ 2016-02-29 23:42     ` David Daney
  2016-03-01 12:21       ` Robert Richter
  0 siblings, 1 reply; 30+ messages in thread
From: David Daney @ 2016-02-29 23:42 UTC (permalink / raw)
  To: Robert Richter
  Cc: Will Deacon, linux-arm-kernel, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, Robert Richter,
	linux-kernel, David Daney

On 02/29/2016 09:34 AM, Robert Richter wrote:
> On 22.02.16 17:58:22, David Daney wrote:
>> From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>>
>> Attempt to get the memory and CPU NUMA node via of_numa.  If that
>> fails, default the dummy NUMA node and map all memory and CPUs to node
>> 0.
>>
>> Tested-by: Shannon Zhao <shannon.zhao@linaro.org>
>> Reviewed-by: Robert Richter <rrichter@cavium.com>
>> Signed-off-by: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>> Signed-off-by: David Daney <david.daney@cavium.com>
>> ---
>>   arch/arm64/Kconfig                |  26 +++
>>   arch/arm64/include/asm/mmzone.h   |  12 ++
>>   arch/arm64/include/asm/numa.h     |  45 +++++
>>   arch/arm64/include/asm/topology.h |  10 +
>>   arch/arm64/kernel/pci.c           |  10 +
>>   arch/arm64/kernel/setup.c         |   4 +
>>   arch/arm64/kernel/smp.c           |   4 +
>>   arch/arm64/mm/Makefile            |   1 +
>>   arch/arm64/mm/init.c              |  34 +++-
>>   arch/arm64/mm/mmu.c               |   1 +
>>   arch/arm64/mm/numa.c              | 403 ++++++++++++++++++++++++++++++++++++++
>>   11 files changed, 545 insertions(+), 5 deletions(-)
>>   create mode 100644 arch/arm64/include/asm/mmzone.h
>>   create mode 100644 arch/arm64/include/asm/numa.h
>>   create mode 100644 arch/arm64/mm/numa.c
>
>> +static int __init numa_init(int (*init_func)(void))
>> +{
>> +	int ret;
>> +
>> +	nodes_clear(numa_nodes_parsed);
>> +	nodes_clear(node_possible_map);
>> +	nodes_clear(node_online_map);
>> +	numa_free_distance();

         ^^^^^^^^^^^^^
      Cleanup for any previous numa_alloc_distance()

>> +
>> +	ret = numa_alloc_distance();
>> +	if (ret < 0)
>> +		return ret;
>
> If you move this before the remaining initializers, you will need to
> clean this up on error.

Yes, we do this.  See above.

> So better move it back after
> numa_register_nodes() as it was in v10. This should work since
> distances are used not earlier than numa is enabled.

I moved it here for a reason.

The init_func (of_numa_init() in this case) makes callbacks that use the 
numa_distance object.  We need to allocate it before using it. 
Allocating it after calling the init_func() is too late.


>
> -Robert
>
>> +
>> +	ret = init_func();
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (nodes_empty(numa_nodes_parsed))
>> +		return -EINVAL;
>> +
>> +	ret = numa_register_nodes();
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	setup_node_to_cpumask_map();
>> +
>> +	/* init boot processor */
>> +	cpu_to_node_map[0] = 0;
>> +	map_cpu_to_node(0, 0);
>> +
>> +	return 0;
>> +}

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.
  2016-02-29 23:42     ` David Daney
@ 2016-03-01 12:21       ` Robert Richter
  0 siblings, 0 replies; 30+ messages in thread
From: Robert Richter @ 2016-03-01 12:21 UTC (permalink / raw)
  To: David Daney
  Cc: Robert Richter, Will Deacon, linux-arm-kernel, Rob Herring,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, devicetree,
	Ard Biesheuvel, Frank Rowand, Grant Likely, Catalin Marinas,
	Matt Fleming, linux-efi, Ganapatrao Kulkarni, linux-kernel,
	David Daney

On 29.02.16 15:42:58, David Daney wrote:
> On 02/29/2016 09:34 AM, Robert Richter wrote:
> >On 22.02.16 17:58:22, David Daney wrote:
> >>From: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>

> >>+static int __init numa_init(int (*init_func)(void))
> >>+{
> >>+	int ret;
> >>+
> >>+	nodes_clear(numa_nodes_parsed);
> >>+	nodes_clear(node_possible_map);
> >>+	nodes_clear(node_online_map);
> >>+	numa_free_distance();
> 
>         ^^^^^^^^^^^^^
>      Cleanup for any previous numa_alloc_distance()
> 
> >>+
> >>+	ret = numa_alloc_distance();
> >>+	if (ret < 0)
> >>+		return ret;
> >
> >If you move this before the remaining initializers, you will need to
> >clean this up on error.
> 
> Yes, we do this.  See above.
> 
> >So better move it back after
> >numa_register_nodes() as it was in v10. This should work since
> >distances are used not earlier than numa is enabled.
> 
> I moved it here for a reason.
> 
> The init_func (of_numa_init() in this case) makes callbacks that use the
> numa_distance object.  We need to allocate it before using it. Allocating it
> after calling the init_func() is too late.

Sounds reasonable and looks sane now.

Thanks,

-Robert

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-03-01 12:21 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23  1:58 [PATCH v12 0/5] arm64, numa: Add numa support for arm64 platforms David Daney
2016-02-23  1:58 ` [PATCH v12 1/5] efi: ARM/arm64: ignore DT memory nodes instead of removing them David Daney
2016-02-23 11:58   ` Mark Rutland
2016-02-23 12:16     ` Will Deacon
2016-02-23 12:20       ` Ard Biesheuvel
2016-02-23 22:12     ` Rob Herring
2016-02-24 19:38       ` Mark Rutland
2016-02-24 19:03     ` Frank Rowand
2016-02-24 19:30       ` Rob Herring
2016-02-24 19:33       ` Mark Rutland
2016-02-23  1:58 ` [PATCH v12 2/5] Documentation, dt, numa: dt bindings for NUMA David Daney
2016-02-23  1:58 ` [PATCH v12 3/5] dt, numa: Add NUMA dt binding implementation David Daney
2016-02-29 17:29   ` Robert Richter
2016-02-29 18:13     ` David Daney
2016-02-29 19:45       ` Robert Richter
2016-02-29 22:56         ` David Daney
2016-02-23  1:58 ` [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms David Daney
2016-02-23 10:26   ` Will Deacon
2016-02-23 17:34     ` David Daney
2016-02-23 17:39   ` David Daney
2016-02-26 18:53   ` Will Deacon
2016-02-26 19:51     ` David Daney
2016-02-27  4:13       ` Ganapatrao Kulkarni
2016-02-29 10:12         ` Robert Richter
2016-02-29 17:34   ` Robert Richter
2016-02-29 23:42     ` David Daney
2016-03-01 12:21       ` Robert Richter
2016-02-23  1:58 ` [PATCH v12 5/5] arm64, mm, numa: Add NUMA balancing support for arm64 David Daney
2016-02-26 18:53   ` Will Deacon
2016-02-26 19:26     ` David Daney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).