linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/4] arm64:numa: Add numa support for arm64 platforms.
@ 2014-11-21 21:23 Ganapatrao Kulkarni
  2014-11-21 21:23 ` [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128 Ganapatrao Kulkarni
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-11-21 21:23 UTC (permalink / raw)
  To: linux-arm-kernel

This is v2 patch set to support numa on arm64 based platforms.
Tested these patches on cavium's multinode(2 node topology) simulator.
In this patchset, defined and implemented dt bindings for numa mapping for core and memory.
Tested using UEFI which passes multi memory range details through UEFI system
table. Tried test-cases present in numactl-2.0.9 package.

v2:
Defined and implemented numa map for memory, cores to node and
proximity distance matrix of nodes to each other. 

v1:
Initial patchset to support numa on arm64 platforms.


Ganapatrao Kulkarni (4):
  arm64: defconfig: increase NR_CPUS range to 2-128
  Documentation: arm64/arm: dt bindings for numa.
  arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node
    topology.
  arm64:numa: adding numa support for arm64 platforms.

 Documentation/devicetree/bindings/arm/numa.txt | 103 ++++
 arch/arm64/Kconfig                             |  37 +-
 arch/arm64/boot/dts/thunder-88xx-2n.dts        |  88 ++++
 arch/arm64/boot/dts/thunder-88xx-2n.dtsi       | 694 +++++++++++++++++++++++++
 arch/arm64/include/asm/mmzone.h                |  32 ++
 arch/arm64/include/asm/numa.h                  |  35 ++
 arch/arm64/kernel/setup.c                      |   8 +
 arch/arm64/kernel/smp.c                        |   2 +
 arch/arm64/mm/Makefile                         |   1 +
 arch/arm64/mm/init.c                           |  34 +-
 arch/arm64/mm/numa.c                           | 631 ++++++++++++++++++++++
 11 files changed, 1657 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/arm/numa.txt
 create mode 100644 arch/arm64/boot/dts/thunder-88xx-2n.dts
 create mode 100644 arch/arm64/boot/dts/thunder-88xx-2n.dtsi
 create mode 100644 arch/arm64/include/asm/mmzone.h
 create mode 100644 arch/arm64/include/asm/numa.h
 create mode 100644 arch/arm64/mm/numa.c

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128
  2014-11-21 21:23 [RFC PATCH v2 0/4] arm64:numa: Add numa support for arm64 platforms Ganapatrao Kulkarni
@ 2014-11-21 21:23 ` Ganapatrao Kulkarni
  2014-11-24 11:53   ` Arnd Bergmann
  2014-11-21 21:23 ` [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa Ganapatrao Kulkarni
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-11-21 21:23 UTC (permalink / raw)
  To: linux-arm-kernel

Raising the maximum limit to 128. This is needed for Cavium's
Thunder system that will have 96 cores on Multi-node system.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
---
 arch/arm64/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c1ad0df..f272926 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -302,8 +302,8 @@ config SCHED_SMT
 	  places. If unsure say N here.
 
 config NR_CPUS
-	int "Maximum number of CPUs (2-64)"
-	range 2 64
+	int "Maximum number of CPUs (2-128)"
+	range 2 128
 	depends on SMP
 	# These have to remain sorted largest to smallest
 	default "64"
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-21 21:23 [RFC PATCH v2 0/4] arm64:numa: Add numa support for arm64 platforms Ganapatrao Kulkarni
  2014-11-21 21:23 ` [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128 Ganapatrao Kulkarni
@ 2014-11-21 21:23 ` Ganapatrao Kulkarni
  2014-11-25  3:55   ` Shannon Zhao
  2014-11-21 21:23 ` [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology Ganapatrao Kulkarni
  2014-11-21 21:23 ` [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms Ganapatrao Kulkarni
  3 siblings, 1 reply; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-11-21 21:23 UTC (permalink / raw)
  To: linux-arm-kernel

DT bindings for numa map for memory, cores to node and
proximity distance matrix of nodes to each other.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
---
 Documentation/devicetree/bindings/arm/numa.txt | 103 +++++++++++++++++++++++++
 1 file changed, 103 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/numa.txt

diff --git a/Documentation/devicetree/bindings/arm/numa.txt b/Documentation/devicetree/bindings/arm/numa.txt
new file mode 100644
index 0000000..ec6bf2d
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/numa.txt
@@ -0,0 +1,103 @@
+==============================================================================
+NUMA binding description.
+==============================================================================
+
+==============================================================================
+1 - Introduction
+==============================================================================
+
+Systems employing a Non Uniform Memory Access (NUMA) architecture contain
+collections of hardware resources including processors, memory, and I/O buses,
+that comprise what is commonly known as a ???NUMA node???. Processor
+accesses to memory within the local NUMA node is
+generally faster than processor accesses to memory outside of the local
+NUMA node. DT defines interfaces that allow the platform to convey NUMA node
+topology information to OS.
+
+==============================================================================
+2 - numa-map node
+==============================================================================
+
+DT Binding for NUMA can be defined for memory and CPUs to map them to
+respective NUMA nodes.
+
+The DT binding can defined using numa-map node.
+The numa-map will have following properties to define NUMA topology.
+
+- mem-map:	This property defines the association between a range of
+		memory and the proximity domain/numa node to which it belongs.
+
+note: memory range address is passed using either memory node of
+DT or UEFI system table and should match to the address defined in mem-map.
+
+- cpu-map:	This property defines the association of range of processors
+		(range of cpu ids) and the proximity domain to which
+		the processor belongs.
+
+- node-matrix:	This table provides a matrix that describes the relative
+		distance (memory latency) between all System Localities.
+		The value of each Entry[i j distance] in node-matrix table,
+		where i represents a row of a matrix and j represents a
+		column of a matrix, indicates the relative distances
+		from Proximity Domain/Numa node i to every other
+		node j in the system (including itself).
+
+The numa-map node must contain the appropriate #address-cells,
+#size-cells and #node-count properties.
+
+
+==============================================================================
+4 - Example dts
+==============================================================================
+
+Example 1: 2 Node system each having 8 CPUs and a Memory.
+
+	numa-map {
+		#address-cells = <2>;
+		#size-cells = <1>;
+		#node-count = <2>;
+		mem-map =  <0x0 0x00000000 0>,
+		           <0x100 0x00000000 1>;
+
+		cpu-map = <0 7 0>,
+			  <8 15 1>;
+
+		node-matrix = <0 0 10>,
+			      <0 1 20>,
+			      <1 0 20>,
+			      <1 1 10>;
+	};
+
+Example 2: 4 Node system each having 4 CPUs and a Memory.
+
+	numa-map {
+		#address-cells = <2>;
+		#size-cells = <1>;
+		#node-count = <2>;
+		mem-map =  <0x0 0x00000000 0>,
+		           <0x100 0x00000000 1>,
+		           <0x200 0x00000000 2>,
+		           <0x300 0x00000000 3>;
+
+		cpu-map = <0 7 0>,
+			  <8 15 1>,
+			  <16 23 2>,
+			  <24 31 3>;
+
+		node-matrix = <0 0 10>,
+			      <0 1 20>,
+			      <0 2 20>,
+			      <0 3 20>,
+			      <1 0 20>,
+			      <1 1 10>,
+			      <1 2 20>,
+			      <1 3 20>,
+			      <2 0 20>,
+			      <2 1 20>,
+			      <2 2 10>,
+			      <2 3 20>,
+			      <3 0 20>,
+			      <3 1 20>,
+			      <3 2 20>,
+			      <3 3 10>;
+	};
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology.
  2014-11-21 21:23 [RFC PATCH v2 0/4] arm64:numa: Add numa support for arm64 platforms Ganapatrao Kulkarni
  2014-11-21 21:23 ` [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128 Ganapatrao Kulkarni
  2014-11-21 21:23 ` [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa Ganapatrao Kulkarni
@ 2014-11-21 21:23 ` Ganapatrao Kulkarni
  2014-11-24 11:59   ` Arnd Bergmann
  2014-11-24 17:01   ` Marc Zyngier
  2014-11-21 21:23 ` [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms Ganapatrao Kulkarni
  3 siblings, 2 replies; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-11-21 21:23 UTC (permalink / raw)
  To: linux-arm-kernel

adding devicetree definition for Cavium's Thunder SoC in 2 Node topology.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
---
 arch/arm64/boot/dts/thunder-88xx-2n.dts  |  88 ++++
 arch/arm64/boot/dts/thunder-88xx-2n.dtsi | 694 +++++++++++++++++++++++++++++++
 2 files changed, 782 insertions(+)
 create mode 100644 arch/arm64/boot/dts/thunder-88xx-2n.dts
 create mode 100644 arch/arm64/boot/dts/thunder-88xx-2n.dtsi

diff --git a/arch/arm64/boot/dts/thunder-88xx-2n.dts b/arch/arm64/boot/dts/thunder-88xx-2n.dts
new file mode 100644
index 0000000..f87a7a4
--- /dev/null
+++ b/arch/arm64/boot/dts/thunder-88xx-2n.dts
@@ -0,0 +1,88 @@
+/*
+ * Cavium Thunder DTS file - Thunder board description
+ *
+ * Copyright (C) 2014, Cavium Inc.
+ *
+ * This file is dual-licensed: you can use it either under the terms
+ * of the GPL or the X11 license, at your option. Note that this dual
+ * licensing only applies to this file, and not this project as a
+ * whole.
+ *
+ *  a) This library is free software; you can redistribute it and/or
+ *     modify it under the terms of the GNU General Public License as
+ *     published by the Free Software Foundation; either version 2 of the
+ *     License, or (at your option) any later version.
+ *
+ *     This library is distributed in the hope that it will be useful,
+ *     but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *     GNU General Public License for more details.
+ *
+ *     You should have received a copy of the GNU General Public
+ *     License along with this library; if not, write to the Free
+ *     Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston,
+ *     MA 02110-1301 USA
+ *
+ * Or, alternatively,
+ *
+ *  b) Permission is hereby granted, free of charge, to any person
+ *     obtaining a copy of this software and associated documentation
+ *     files (the "Software"), to deal in the Software without
+ *     restriction, including without limitation the rights to use,
+ *     copy, modify, merge, publish, distribute, sublicense, and/or
+ *     sell copies of the Software, and to permit persons to whom the
+ *     Software is furnished to do so, subject to the following
+ *     conditions:
+ *
+ *     The above copyright notice and this permission notice shall be
+ *     included in all copies or substantial portions of the Software.
+ *
+ *     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+ *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+ *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+ *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ *     OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+/dts-v1/;
+
+/include/ "thunder-88xx-2n.dtsi"
+
+/ {
+	model = "Cavium ThunderX CN88XX board";
+	compatible = "cavium,thunder-88xx";
+
+	aliases {
+		serial0 = &uaa0;
+		serial1 = &uaa1;
+	};
+
+	memory at 00c00000 {
+		device_type = "memory";
+		reg = <0x0 0x00000000 0x0 0x80000000>;
+	};
+
+	memory at 10000000000 {
+		device_type = "memory";
+		reg = <0x100 0x00000000 0x0 0x80000000>;
+	};
+
+	numa-map {
+		#address-cells = <2>;
+		#size-cells = <1>;
+		#node-count = <2>;
+		mem-map = <0x0 0x00000000 0>,
+		           <0x100 0x00000000 1>;
+
+               cpu-map = <0 47 0>,
+			<48 95 1>;
+
+		node-matrix=    <0 0 10>,
+				<0 1 20>,
+				<1 0 20>,
+				<1 1 10>;
+	};
+};
diff --git a/arch/arm64/boot/dts/thunder-88xx-2n.dtsi b/arch/arm64/boot/dts/thunder-88xx-2n.dtsi
new file mode 100644
index 0000000..3f217b4
--- /dev/null
+++ b/arch/arm64/boot/dts/thunder-88xx-2n.dtsi
@@ -0,0 +1,694 @@
+/*
+ * Cavium Thunder DTS file - Thunder SoC description
+ *
+ * Copyright (C) 2014, Cavium Inc.
+ *
+ * This file is dual-licensed: you can use it either under the terms
+ * of the GPL or the X11 license, at your option. Note that this dual
+ * licensing only applies to this file, and not this project as a
+ * whole.
+ *
+ *  a) This library is free software; you can redistribute it and/or
+ *     modify it under the terms of the GNU General Public License as
+ *     published by the Free Software Foundation; either version 2 of the
+ *     License, or (at your option) any later version.
+ *
+ *     This library is distributed in the hope that it will be useful,
+ *     but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *     GNU General Public License for more details.
+ *
+ *     You should have received a copy of the GNU General Public
+ *     License along with this library; if not, write to the Free
+ *     Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston,
+ *     MA 02110-1301 USA
+ *
+ * Or, alternatively,
+ *
+ *  b) Permission is hereby granted, free of charge, to any person
+ *     obtaining a copy of this software and associated documentation
+ *     files (the "Software"), to deal in the Software without
+ *     restriction, including without limitation the rights to use,
+ *     copy, modify, merge, publish, distribute, sublicense, and/or
+ *     sell copies of the Software, and to permit persons to whom the
+ *     Software is furnished to do so, subject to the following
+ *     conditions:
+ *
+ *     The above copyright notice and this permission notice shall be
+ *     included in all copies or substantial portions of the Software.
+ *
+ *     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ *     EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+ *     OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ *     NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+ *     HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+ *     WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ *     FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ *     OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+/ {
+	compatible = "cavium,thunder-88xx";
+	interrupt-parent = <&gic0>;
+	#address-cells = <2>;
+	#size-cells = <2>;
+
+	psci {
+		compatible = "arm,psci-0.2";
+		method = "smc";
+	};
+
+	cpus {
+		#address-cells = <2>;
+		#size-cells = <0>;
+
+		CPU0: cpu at 000 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x000>;
+			enable-method = "psci";
+		};
+		CPU1: cpu at 001 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x001>;
+			enable-method = "psci";
+		};
+		CPU2: cpu at 002 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x002>;
+			enable-method = "psci";
+		};
+		CPU3: cpu at 003 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x003>;
+			enable-method = "psci";
+		};
+		CPU4: cpu at 004 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x004>;
+			enable-method = "psci";
+		};
+		CPU5: cpu at 005 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x005>;
+			enable-method = "psci";
+		};
+		CPU6: cpu at 006 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x006>;
+			enable-method = "psci";
+		};
+		CPU7: cpu at 007 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x007>;
+			enable-method = "psci";
+		};
+		CPU8: cpu at 008 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x008>;
+			enable-method = "psci";
+		};
+		CPU9: cpu at 009 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x009>;
+			enable-method = "psci";
+		};
+		CPU10: cpu at 00a {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x00a>;
+			enable-method = "psci";
+		};
+		CPU11: cpu at 00b {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x00b>;
+			enable-method = "psci";
+		};
+		CPU12: cpu at 00c {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x00c>;
+			enable-method = "psci";
+		};
+		CPU13: cpu at 00d {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x00d>;
+			enable-method = "psci";
+		};
+		CPU14: cpu at 00e {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x00e>;
+			enable-method = "psci";
+		};
+		CPU15: cpu at 00f {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x00f>;
+			enable-method = "psci";
+		};
+		CPU16: cpu at 100 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x100>;
+			enable-method = "psci";
+		};
+		CPU17: cpu at 101 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x101>;
+			enable-method = "psci";
+		};
+		CPU18: cpu at 102 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x102>;
+			enable-method = "psci";
+		};
+		CPU19: cpu at 103 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x103>;
+			enable-method = "psci";
+		};
+		CPU20: cpu at 104 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x104>;
+			enable-method = "psci";
+		};
+		CPU21: cpu at 105 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x105>;
+			enable-method = "psci";
+		};
+		CPU22: cpu at 106 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x106>;
+			enable-method = "psci";
+		};
+		CPU23: cpu at 107 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x107>;
+			enable-method = "psci";
+		};
+		CPU24: cpu at 108 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x108>;
+			enable-method = "psci";
+		};
+		CPU25: cpu at 109 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x109>;
+			enable-method = "psci";
+		};
+		CPU26: cpu at 10a {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10a>;
+			enable-method = "psci";
+		};
+		CPU27: cpu at 10b {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10b>;
+			enable-method = "psci";
+		};
+		CPU28: cpu at 10c {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10c>;
+			enable-method = "psci";
+		};
+		CPU29: cpu at 10d {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10d>;
+			enable-method = "psci";
+		};
+		CPU30: cpu at 10e {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10e>;
+			enable-method = "psci";
+		};
+		CPU31: cpu at 10f {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10f>;
+			enable-method = "psci";
+		};
+		CPU32: cpu at 200 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x200>;
+			enable-method = "psci";
+		};
+		CPU33: cpu at 201 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x201>;
+			enable-method = "psci";
+		};
+		CPU34: cpu at 202 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x202>;
+			enable-method = "psci";
+		};
+		CPU35: cpu at 203 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x203>;
+			enable-method = "psci";
+		};
+		CPU36: cpu at 204 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x204>;
+			enable-method = "psci";
+		};
+		CPU37: cpu at 205 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x205>;
+			enable-method = "psci";
+		};
+		CPU38: cpu at 206 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x206>;
+			enable-method = "psci";
+		};
+		CPU39: cpu at 207 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x207>;
+			enable-method = "psci";
+		};
+		CPU40: cpu at 208 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x208>;
+			enable-method = "psci";
+		};
+		CPU41: cpu at 209 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x209>;
+			enable-method = "psci";
+		};
+		CPU42: cpu at 20a {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x20a>;
+			enable-method = "psci";
+		};
+		CPU43: cpu at 20b {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x20b>;
+			enable-method = "psci";
+		};
+		CPU44: cpu at 20c {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x20c>;
+			enable-method = "psci";
+		};
+		CPU45: cpu at 20d {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x20d>;
+			enable-method = "psci";
+		};
+		CPU46: cpu at 20e {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x20e>;
+			enable-method = "psci";
+		};
+		CPU47: cpu at 20f {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x20f>;
+			enable-method = "psci";
+		};
+		CPU48: cpu at 10000 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10000>;
+			enable-method = "psci";
+		};
+		CPU49: cpu at 10001 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10001>;
+			enable-method = "psci";
+		};
+		CPU50: cpu at 10002 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10002>;
+			enable-method = "psci";
+		};
+		CPU51: cpu at 10003 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10003>;
+			enable-method = "psci";
+		};
+		CPU52: cpu at 10004 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10004>;
+			enable-method = "psci";
+		};
+		CPU53: cpu at 10005 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10005>;
+			enable-method = "psci";
+		};
+		CPU54: cpu at 10006 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10006>;
+			enable-method = "psci";
+		};
+		CPU55: cpu at 10007 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10007>;
+			enable-method = "psci";
+		};
+		CPU56: cpu at 10008 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10008>;
+			enable-method = "psci";
+		};
+		CPU57: cpu at 10009 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10009>;
+			enable-method = "psci";
+		};
+		CPU58: cpu at 1000a {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1000a>;
+			enable-method = "psci";
+		};
+		CPU59: cpu at 1000b {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1000b>;
+			enable-method = "psci";
+		};
+		CPU60: cpu at 1000c {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1000c>;
+			enable-method = "psci";
+		};
+		CPU61: cpu at 1000d {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1000d>;
+			enable-method = "psci";
+		};
+		CPU62: cpu at 1000e {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1000e>;
+			enable-method = "psci";
+		};
+		CPU63: cpu at 1000f {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1000f>;
+			enable-method = "psci";
+		};
+		CPU64: cpu at 10100 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10100>;
+			enable-method = "psci";
+		};
+		CPU65: cpu at 10101 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10101>;
+			enable-method = "psci";
+		};
+		CPU66: cpu at 10102 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10102>;
+			enable-method = "psci";
+		};
+		CPU67: cpu at 10103 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10103>;
+			enable-method = "psci";
+		};
+		CPU68: cpu at 10104 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10104>;
+			enable-method = "psci";
+		};
+		CPU69: cpu at 10105 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10105>;
+			enable-method = "psci";
+		};
+		CPU70: cpu at 10106 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10106>;
+			enable-method = "psci";
+		};
+		CPU71: cpu at 10107 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10107>;
+			enable-method = "psci";
+		};
+		CPU72: cpu at 10108 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10108>;
+			enable-method = "psci";
+		};
+		CPU73: cpu at 10109 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10109>;
+			enable-method = "psci";
+		};
+		CPU74: cpu at 1010a {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1010a>;
+			enable-method = "psci";
+		};
+		CPU75: cpu at 1010b {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1010b>;
+			enable-method = "psci";
+		};
+		CPU76: cpu at 1010c {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1010c>;
+			enable-method = "psci";
+		};
+		CPU77: cpu at 1010d {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1010d>;
+			enable-method = "psci";
+		};
+		CPU78: cpu at 1010e {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1010e>;
+			enable-method = "psci";
+		};
+		CPU79: cpu at 1010f {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1010f>;
+			enable-method = "psci";
+		};
+		CPU80: cpu at 10200 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10200>;
+			enable-method = "psci";
+		};
+		CPU81: cpu at 10201 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10201>;
+			enable-method = "psci";
+		};
+		CPU82: cpu at 10202 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10202>;
+			enable-method = "psci";
+		};
+		CPU83: cpu at 10203 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10203>;
+			enable-method = "psci";
+		};
+		CPU84: cpu at 10204 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10204>;
+			enable-method = "psci";
+		};
+		CPU85: cpu at 10205 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10205>;
+			enable-method = "psci";
+		};
+		CPU86: cpu at 10206 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10206>;
+			enable-method = "psci";
+		};
+		CPU87: cpu at 10207 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10207>;
+			enable-method = "psci";
+		};
+		CPU88: cpu at 10208 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10208>;
+			enable-method = "psci";
+		};
+		CPU89: cpu at 10209 {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x10209>;
+			enable-method = "psci";
+		};
+		CPU90: cpu at 1020a {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1020a>;
+			enable-method = "psci";
+		};
+		CPU91: cpu at 1020b {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1020b>;
+			enable-method = "psci";
+		};
+		CPU92: cpu at 1020c {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1020c>;
+			enable-method = "psci";
+		};
+		CPU93: cpu at 1020d {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1020d>;
+			enable-method = "psci";
+		};
+		CPU94: cpu at 1020e {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1020e>;
+			enable-method = "psci";
+		};
+		CPU95: cpu at 1020f {
+			device_type = "cpu";
+			compatible = "cavium,thunder", "arm,armv8";
+			reg = <0x0 0x1020f>;
+			enable-method = "psci";
+		};
+	};
+
+	timer {
+		compatible = "arm,armv8-timer";
+		interrupts = <1 13 0xff01>,
+		             <1 14 0xff01>,
+		             <1 11 0xff01>,
+		             <1 10 0xff01>;
+	};
+
+	soc {
+		compatible = "simple-bus";
+		#address-cells = <2>;
+		#size-cells = <2>;
+		ranges;
+
+		refclk50mhz: refclk50mhz {
+			compatible = "fixed-clock";
+			#clock-cells = <0>;
+			clock-frequency = <50000000>;
+			clock-output-names = "refclk50mhz";
+		};
+
+		gic0: interrupt-controller at 8010,00000000 {
+			compatible = "arm,gic-v3";
+			#interrupt-cells = <3>;
+			#address-cells = <2>;
+			#size-cells = <2>;
+			#redistributor-regions = <2>;
+			ranges;
+			interrupt-controller;
+			reg = <0x8010 0x00000000 0x0 0x010000>, /* GICD */
+			      <0x8010 0x80000000 0x0 0x600000>, /* GICR Node 0 */
+			      <0x9010 0x80000000 0x0 0x600000>; /* GICR Node 1 */
+			interrupts = <1 9 0xf04>;
+		      };
+
+		uaa0: serial at 87e0,24000000 {
+			compatible = "arm,pl011", "arm,primecell";
+			reg = <0x87e0 0x24000000 0x0 0x1000>;
+			interrupts = <1 21 4>;
+			clocks = <&refclk50mhz>;
+			clock-names = "apb_pclk";
+		};
+
+		uaa1: serial at 87e0,25000000 {
+			compatible = "arm,pl011", "arm,primecell";
+			reg = <0x87e0 0x25000000 0x0 0x1000>;
+			interrupts = <1 22 4>;
+			clocks = <&refclk50mhz>;
+			clock-names = "apb_pclk";
+		};
+	};
+};
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms.
  2014-11-21 21:23 [RFC PATCH v2 0/4] arm64:numa: Add numa support for arm64 platforms Ganapatrao Kulkarni
                   ` (2 preceding siblings ...)
  2014-11-21 21:23 ` [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology Ganapatrao Kulkarni
@ 2014-11-21 21:23 ` Ganapatrao Kulkarni
  2014-12-06  9:36   ` Ashok Kumar
       [not found]   ` <5482ce36.c9e2420a.5d40.71c7SMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 2 replies; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-11-21 21:23 UTC (permalink / raw)
  To: linux-arm-kernel

Adding numa support for arm64 based platforms.
creating numa mapping by parsing the dt node numa-map.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
---
 arch/arm64/Kconfig              |  33 +++
 arch/arm64/include/asm/mmzone.h |  32 ++
 arch/arm64/include/asm/numa.h   |  35 +++
 arch/arm64/kernel/setup.c       |   8 +
 arch/arm64/kernel/smp.c         |   2 +
 arch/arm64/mm/Makefile          |   1 +
 arch/arm64/mm/init.c            |  34 ++-
 arch/arm64/mm/numa.c            | 631 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 770 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm64/include/asm/mmzone.h
 create mode 100644 arch/arm64/include/asm/numa.h
 create mode 100644 arch/arm64/mm/numa.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index f272926..7deeda2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -59,6 +59,7 @@ config ARM64
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_RCU_TABLE_FREE
 	select HAVE_SYSCALL_TRACEPOINTS
+	select HAVE_MEMBLOCK_NODE_MAP if NUMA
 	select IRQ_DOMAIN
 	select MODULES_USE_ELF_RELA
 	select NO_BOOTMEM
@@ -315,6 +316,38 @@ config HOTPLUG_CPU
 	  Say Y here to experiment with turning CPUs off and on.  CPUs
 	  can be controlled through /sys/devices/system/cpu.
 
+# Common NUMA Features
+config NUMA
+	bool "Numa Memory Allocation and Scheduler Support"
+	depends on SMP
+	---help---
+	  Enable NUMA (Non Uniform Memory Access) support.
+
+	  The kernel will try to allocate memory used by a CPU on the
+	  local memory controller of the CPU and add some more
+	  NUMA awareness to the kernel.
+
+config ARM64_DT_NUMA
+	def_bool y
+	prompt "DT NUMA detection"
+	default n
+	---help---
+	  Enable DT based numa.
+
+config NODES_SHIFT
+	int "Maximum NUMA Nodes (as a power of 2)"
+	range 1 10
+	default "2"
+	depends on NEED_MULTIPLE_NODES
+	---help---
+	  Specify the maximum number of NUMA Nodes available on the target
+	  system.  Increases memory reserved to accommodate various tables.
+
+config USE_PERCPU_NUMA_NODE_ID
+	def_bool y
+	depends on NUMA
+
+
 source kernel/Kconfig.preempt
 
 config HZ
diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
new file mode 100644
index 0000000..d27ee66
--- /dev/null
+++ b/arch/arm64/include/asm/mmzone.h
@@ -0,0 +1,32 @@
+#ifndef __ASM_ARM64_MMZONE_H_
+#define __ASM_ARM64_MMZONE_H_
+
+#ifdef CONFIG_NUMA
+
+#include <linux/mmdebug.h>
+#include <asm/smp.h>
+#include <linux/types.h>
+#include <asm/numa.h>
+
+extern struct pglist_data *node_data[];
+
+#define NODE_DATA(nid)		(node_data[nid])
+
+
+struct numa_memblk {
+	u64			start;
+	u64			end;
+	int			nid;
+};
+
+struct numa_meminfo {
+	int			nr_blks;
+	struct numa_memblk	blk[NR_NODE_MEMBLKS];
+};
+
+void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi);
+int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
+void __init numa_reset_distance(void);
+
+#endif /* CONFIG_NUMA */
+#endif /* __ASM_ARM64_MMZONE_H_ */
diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
new file mode 100644
index 0000000..e4c2ed0
--- /dev/null
+++ b/arch/arm64/include/asm/numa.h
@@ -0,0 +1,35 @@
+#ifndef _ASM_ARM64_NUMA_H
+#define _ASM_ARM64_NUMA_H
+
+#include <linux/nodemask.h>
+#include <asm/topology.h>
+
+#ifdef CONFIG_NUMA
+
+#define NR_NODE_MEMBLKS		(MAX_NUMNODES * 2)
+#define ZONE_ALIGN (1UL << (MAX_ORDER + PAGE_SHIFT))
+
+/* currently, arm64 implements flat NUMA topology */
+#define parent_node(node)	(node)
+
+/* dummy definitions for pci functions */
+#define pcibus_to_node(node)	0
+#define cpumask_of_pcibus(bus)	0
+
+const struct cpumask *cpumask_of_node(int node);
+/* Mappings between node number and cpus on that node. */
+extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
+
+void __init arm64_numa_init(void);
+int __init numa_add_memblk(u32 nodeid, u64 start, u64 end);
+void numa_store_cpu_info(int cpu);
+void numa_set_node(int cpu, int node);
+void numa_clear_node(int cpu);
+void numa_add_cpu(int cpu);
+void numa_remove_cpu(int cpu);
+#else	/* CONFIG_NUMA */
+static inline void arm64_numa_init(void);
+static inline void numa_store_cpu_info(int cpu)	{ }
+static inline void arm64_numa_init()			{ }
+#endif	/* CONFIG_NUMA */
+#endif	/* _ASM_ARM64_NUMA_H */
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 2437196..80b4a9e 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -425,6 +425,9 @@ static int __init topology_init(void)
 {
 	int i;
 
+	for_each_online_node(i)
+		register_one_node(i);
+
 	for_each_possible_cpu(i) {
 		struct cpu *cpu = &per_cpu(cpu_data.cpu, i);
 		cpu->hotpluggable = 1;
@@ -461,7 +464,12 @@ static int c_show(struct seq_file *m, void *v)
 		 * "processor".  Give glibc what it expects.
 		 */
 #ifdef CONFIG_SMP
+	if (IS_ENABLED(CONFIG_NUMA)) {
+		seq_printf(m, "processor\t: %d", i);
+		seq_printf(m, " [nid: %d]\n", cpu_to_node(i));
+	} else {
 		seq_printf(m, "processor\t: %d\n", i);
+	}
 #endif
 	}
 
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index b06d1d9..1d1e86f 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -50,6 +50,7 @@
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
 #include <asm/ptrace.h>
+#include <asm/numa.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/ipi.h>
@@ -123,6 +124,7 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 static void smp_store_cpu_info(unsigned int cpuid)
 {
 	store_cpu_topology(cpuid);
+	numa_store_cpu_info(cpuid);
 }
 
 /*
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index c56179e..c86e6de 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -3,3 +3,4 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
 				   ioremap.o mmap.o pgd.o mmu.o \
 				   context.o proc.o pageattr.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
+obj-$(CONFIG_NUMA)		+= numa.o
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 494297c..6fd6802 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -39,6 +39,7 @@
 #include <asm/setup.h>
 #include <asm/sizes.h>
 #include <asm/tlb.h>
+#include <asm/numa.h>
 
 #include "mm.h"
 
@@ -73,6 +74,20 @@ static phys_addr_t max_zone_dma_phys(void)
 	return min(offset + (1ULL << 32), memblock_end_of_DRAM());
 }
 
+#ifdef CONFIG_NUMA
+static void __init zone_sizes_init(unsigned long min, unsigned long max)
+{
+	unsigned long max_zone_pfns[MAX_NR_ZONES];
+
+	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
+	if (IS_ENABLED(CONFIG_ZONE_DMA))
+		max_zone_pfns[ZONE_DMA] = PFN_DOWN(max_zone_dma_phys());
+	max_zone_pfns[ZONE_NORMAL] = max;
+
+	free_area_init_nodes(max_zone_pfns);
+}
+
+#else
 static void __init zone_sizes_init(unsigned long min, unsigned long max)
 {
 	struct memblock_region *reg;
@@ -111,6 +126,7 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
 
 	free_area_init_node(0, zone_size, min, zhole_size);
 }
+#endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_HAVE_ARCH_PFN_VALID
 int pfn_valid(unsigned long pfn)
@@ -128,10 +144,15 @@ static void arm64_memory_present(void)
 static void arm64_memory_present(void)
 {
 	struct memblock_region *reg;
+	int nid = 0;
 
-	for_each_memblock(memory, reg)
-		memory_present(0, memblock_region_memory_base_pfn(reg),
-			       memblock_region_memory_end_pfn(reg));
+	for_each_memblock(memory, reg) {
+#ifdef CONFIG_NUMA
+		nid = reg->nid;
+#endif
+		memory_present(nid, memblock_region_memory_base_pfn(reg),
+				memblock_region_memory_end_pfn(reg));
+	}
 }
 #endif
 
@@ -167,6 +188,10 @@ void __init bootmem_init(void)
 	min = PFN_UP(memblock_start_of_DRAM());
 	max = PFN_DOWN(memblock_end_of_DRAM());
 
+	high_memory = __va((max << PAGE_SHIFT) - 1) + 1;
+	max_pfn = max_low_pfn = max;
+
+	arm64_numa_init();
 	/*
 	 * Sparsemem tries to allocate bootmem in memory_present(), so must be
 	 * done after the fixed reservations.
@@ -175,9 +200,6 @@ void __init bootmem_init(void)
 
 	sparse_init();
 	zone_sizes_init(min, max);
-
-	high_memory = __va((max << PAGE_SHIFT) - 1) + 1;
-	max_pfn = max_low_pfn = max;
 }
 
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
new file mode 100644
index 0000000..dbe76a3
--- /dev/null
+++ b/arch/arm64/mm/numa.c
@@ -0,0 +1,631 @@
+/*
+ * NUMA support, based on the x86 implementation.
+ *
+ * Copyright (C) 2014 Cavium Inc.
+ * Author: Ganapatrao Kulkarni <gkulkarni@cavium.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <linux/memblock.h>
+#include <linux/mmzone.h>
+#include <linux/ctype.h>
+#include <linux/module.h>
+#include <linux/nodemask.h>
+#include <linux/sched.h>
+#include <linux/topology.h>
+#include <linux/of.h>
+#include <linux/of_fdt.h>
+#include <asm/smp_plat.h>
+
+int __initdata numa_off;
+nodemask_t numa_nodes_parsed __initdata;
+static int numa_distance_cnt;
+static u8 *numa_distance;
+
+struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
+
+static struct {
+	 u32 node_id;
+	 u64 cpu_hwid;
+}cpu_info[NR_CPUS];
+
+EXPORT_SYMBOL(node_data);
+
+static struct numa_meminfo numa_meminfo;
+
+static __init int numa_setup(char *opt)
+{
+	if (!opt)
+		return -EINVAL;
+	if (!strncmp(opt, "off", 3)) {
+		pr_info("%s\n", "NUMA turned off");
+		numa_off = 1;
+	}
+	return 0;
+}
+early_param("numa", numa_setup);
+
+cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
+EXPORT_SYMBOL(node_to_cpumask_map);
+
+/*
+ * Returns a pointer to the bitmask of CPUs on Node 'node'.
+ */
+const struct cpumask *cpumask_of_node(int node)
+{
+	if (node >= nr_node_ids) {
+		pr_warn("cpumask_of_node(%d): node > nr_node_ids(%d)\n",
+			node, nr_node_ids);
+		dump_stack();
+		return cpu_none_mask;
+	}
+	if (node_to_cpumask_map[node] == NULL) {
+		pr_warn("cpumask_of_node(%d): no node_to_cpumask_map!\n",
+			node);
+		dump_stack();
+		return cpu_online_mask;
+	}
+	return node_to_cpumask_map[node];
+}
+EXPORT_SYMBOL(cpumask_of_node);
+
+
+int cpu_to_node_map[NR_CPUS];
+EXPORT_SYMBOL(cpu_to_node_map);
+
+void numa_clear_node(int cpu)
+{
+	cpu_to_node_map[cpu] = NUMA_NO_NODE;
+}
+
+/*
+ * Allocate node_to_cpumask_map based on number of available nodes
+ * Requires node_possible_map to be valid.
+ *
+ * Note: cpumask_of_node() is not valid until after this is done.
+ * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
+ */
+void __init setup_node_to_cpumask_map(void)
+{
+	unsigned int node;
+
+	/* setup nr_node_ids if not done yet */
+	if (nr_node_ids == MAX_NUMNODES)
+		setup_nr_node_ids();
+
+	/* allocate the map */
+	for (node = 0; node < nr_node_ids; node++)
+		alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]);
+
+	/* cpumask_of_node() will now work */
+	pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids);
+}
+
+/*
+ *  Set the cpu to node and mem mapping
+ */
+void numa_store_cpu_info(cpu)
+{
+	cpu_to_node_map[cpu] = cpu_info[cpu].node_id;
+	/* mapping of MPIDR/hwid, node and logical id */
+	cpu_info[cpu].cpu_hwid = cpu_logical_map(cpu);
+	cpumask_set_cpu(cpu, node_to_cpumask_map[cpu_to_node_map[cpu]]);
+	set_numa_node(cpu_to_node_map[cpu]);
+	set_numa_mem(local_memory_node(cpu_to_node_map[cpu]));
+}
+
+/**
+ * numa_add_memblk_to - Add one numa_memblk to a numa_meminfo
+ */
+
+static int __init numa_add_memblk_to(int nid, u64 start, u64 end,
+				     struct numa_meminfo *mi)
+{
+	/* ignore zero length blks */
+	if (start == end)
+		return 0;
+
+	/* whine about and ignore invalid blks */
+	if (start > end || nid < 0 || nid >= MAX_NUMNODES) {
+		pr_warn("NUMA: Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n",
+				nid, start, end - 1);
+		return 0;
+	}
+
+	if (mi->nr_blks >= NR_NODE_MEMBLKS) {
+		pr_err("NUMA: too many memblk ranges\n");
+		return -EINVAL;
+	}
+
+	pr_info("NUMA: Adding memblock %d [0x%llx - 0x%llx] on node %d\n",
+			mi->nr_blks, start, end, nid);
+	mi->blk[mi->nr_blks].start = start;
+	mi->blk[mi->nr_blks].end = end;
+	mi->blk[mi->nr_blks].nid = nid;
+	mi->nr_blks++;
+	return 0;
+}
+
+/**
+ * numa_add_memblk - Add one numa_memblk to numa_meminfo
+ * @nid: NUMA node ID of the new memblk
+ * @start: Start address of the new memblk
+ * @end: End address of the new memblk
+ *
+ * Add a new memblk to the default numa_meminfo.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+#define MAX_PHYS_ADDR	((phys_addr_t)~0)
+
+int __init numa_add_memblk(u32 nid, u64 base, u64 size)
+{
+	const u64 phys_offset = __pa(PAGE_OFFSET);
+
+	base &= PAGE_MASK;
+	size &= PAGE_MASK;
+
+	if (base > MAX_PHYS_ADDR) {
+		pr_warn("NUMA: Ignoring memory block 0x%llx - 0x%llx\n",
+				base, base + size);
+		return -ENOMEM;
+	}
+
+	if (base + size > MAX_PHYS_ADDR) {
+		pr_info("NUMA: Ignoring memory range 0x%lx - 0x%llx\n",
+				ULONG_MAX, base + size);
+		size = MAX_PHYS_ADDR - base;
+	}
+
+	if (base + size < phys_offset) {
+		pr_warn("NUMA: Ignoring memory block 0x%llx - 0x%llx\n",
+			   base, base + size);
+		return -ENOMEM;
+	}
+	if (base < phys_offset) {
+		pr_info("NUMA: Ignoring memory range 0x%llx - 0x%llx\n",
+			   base, phys_offset);
+		size -= phys_offset - base;
+		base = phys_offset;
+	}
+
+	return numa_add_memblk_to(nid, base, base+size, &numa_meminfo);
+}
+EXPORT_SYMBOL(numa_add_memblk);
+
+/* Initialize NODE_DATA for a node on the local memory */
+static void __init setup_node_data(int nid, u64 start, u64 end)
+{
+	const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+	u64 nd_pa;
+	void *nd;
+	int tnid;
+
+	start = roundup(start, ZONE_ALIGN);
+
+	pr_info("Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
+	       nid, start, end - 1);
+
+	/*
+	 * Allocate node data.  Try node-local memory and then any node.
+	 */
+	nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+	if (!nd_pa) {
+		nd_pa = __memblock_alloc_base(nd_size, SMP_CACHE_BYTES,
+					      MEMBLOCK_ALLOC_ACCESSIBLE);
+		if (!nd_pa) {
+			pr_err("Cannot find %zu bytes in node %d\n",
+			       nd_size, nid);
+			return;
+		}
+	}
+	nd = __va(nd_pa);
+
+	/* report and initialize */
+	pr_info("  NODE_DATA [mem %#010Lx-%#010Lx]\n",
+	       nd_pa, nd_pa + nd_size - 1);
+	tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
+	if (tnid != nid)
+		pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
+
+	node_data[nid] = nd;
+	memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+	NODE_DATA(nid)->node_id = nid;
+	NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
+	NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
+
+	node_set_online(nid);
+}
+
+/*
+ * Set nodes, which have memory in @mi, in *@nodemask.
+ */
+static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
+					      const struct numa_meminfo *mi)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
+		if (mi->blk[i].start != mi->blk[i].end &&
+		    mi->blk[i].nid != NUMA_NO_NODE)
+			node_set(mi->blk[i].nid, *nodemask);
+}
+
+/*
+ * Sanity check to catch more bad NUMA configurations (they are amazingly
+ * common).  Make sure the nodes cover all memory.
+ */
+static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
+{
+	u64 numaram, totalram;
+	int i;
+
+	numaram = 0;
+	for (i = 0; i < mi->nr_blks; i++) {
+		u64 s = mi->blk[i].start >> PAGE_SHIFT;
+		u64 e = mi->blk[i].end >> PAGE_SHIFT;
+
+		numaram += e - s;
+		numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
+		if ((s64)numaram < 0)
+			numaram = 0;
+	}
+
+	totalram = max_pfn - absent_pages_in_range(0, max_pfn);
+
+	/* We seem to lose 3 pages somewhere. Allow 1M of slack. */
+	if ((s64)(totalram - numaram) >= (1 << (20 - PAGE_SHIFT))) {
+		pr_err("NUMA: nodes only cover %lluMB of your %lluMB Total RAM. Not used.\n",
+		       (numaram << PAGE_SHIFT) >> 20,
+		       (totalram << PAGE_SHIFT) >> 20);
+		return false;
+	}
+	return true;
+}
+
+/**
+ * numa_reset_distance - Reset NUMA distance table
+ *
+ * The current table is freed.  The next numa_set_distance() call will
+ * create a new one.
+ */
+void __init numa_reset_distance(void)
+{
+	size_t size = numa_distance_cnt * numa_distance_cnt *
+		sizeof(numa_distance[0]);
+
+	/* numa_distance could be 1LU marking allocation failure, test cnt */
+	if (numa_distance_cnt)
+		memblock_free(__pa(numa_distance), size);
+	numa_distance_cnt = 0;
+	numa_distance = NULL;	/* enable table creation */
+}
+
+static int __init numa_alloc_distance(void)
+{
+	nodemask_t nodes_parsed;
+	size_t size;
+	int i, j, cnt = 0;
+	u64 phys;
+
+	/* size the new table and allocate it */
+	nodes_parsed = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);
+
+	for_each_node_mask(i, nodes_parsed)
+		cnt = i;
+	cnt++;
+	size = cnt * cnt * sizeof(numa_distance[0]);
+
+	phys = memblock_find_in_range(0, PFN_PHYS(max_pfn),
+				      size, PAGE_SIZE);
+	if (!phys) {
+		pr_warning("NUMA: Warning: can't allocate distance table!\n");
+		/* don't retry until explicitly reset */
+		numa_distance = (void *)1LU;
+		return -ENOMEM;
+	}
+	memblock_reserve(phys, size);
+
+	numa_distance = __va(phys);
+	numa_distance_cnt = cnt;
+
+	/* fill with the default distances */
+	for (i = 0; i < cnt; i++)
+		for (j = 0; j < cnt; j++)
+			numa_distance[i * cnt + j] = i == j ?
+				LOCAL_DISTANCE : REMOTE_DISTANCE;
+	pr_debug("NUMA: Initialized distance table, cnt=%d\n", cnt);
+
+	return 0;
+}
+
+/**
+ * numa_set_distance - Set NUMA distance from one NUMA to another
+ * @from: the 'from' node to set distance
+ * @to: the 'to'  node to set distance
+ * @distance: NUMA distance
+ *
+ * Set the distance from node @from to @to to @distance.  If distance table
+ * doesn't exist, one which is large enough to accommodate all the currently
+ * known nodes will be created.
+ *
+ * If such table cannot be allocated, a warning is printed and further
+ * calls are ignored until the distance table is reset with
+ * numa_reset_distance().
+ *
+ * If @from or @to is higher than the highest known node or lower than zero
+ *@the time of table creation or @distance doesn't make sense, the call
+ * is ignored.
+ * This is to allow simplification of specific NUMA config implementations.
+ */
+void __init numa_set_distance(int from, int to, int distance)
+{
+	if (!numa_distance && numa_alloc_distance() < 0)
+		return;
+
+	if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
+			from < 0 || to < 0) {
+		pr_warn_once("NUMA: Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
+			    from, to, distance);
+		return;
+	}
+
+	if ((u8)distance != distance ||
+	    (from == to && distance != LOCAL_DISTANCE)) {
+		pr_warn_once("NUMA: Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
+			     from, to, distance);
+		return;
+	}
+
+	numa_distance[from * numa_distance_cnt + to] = distance;
+}
+
+int __node_distance(int from, int to)
+{
+	if (from >= numa_distance_cnt || to >= numa_distance_cnt)
+		return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
+	return numa_distance[from * numa_distance_cnt + to];
+}
+EXPORT_SYMBOL(__node_distance);
+
+static int __init numa_register_memblks(struct numa_meminfo *mi)
+{
+	unsigned long uninitialized_var(pfn_align);
+	int i, nid;
+
+	/* Account for nodes with cpus and no memory */
+	node_possible_map = numa_nodes_parsed;
+	numa_nodemask_from_meminfo(&node_possible_map, mi);
+	if (WARN_ON(nodes_empty(node_possible_map)))
+		return -EINVAL;
+
+	for (i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *mb = &mi->blk[i];
+
+		memblock_set_node(mb->start, mb->end - mb->start,
+				  &memblock.memory, mb->nid);
+	}
+
+	/*
+	 * If sections array is gonna be used for pfn -> nid mapping, check
+	 * whether its granularity is fine enough.
+	 */
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+	pfn_align = node_map_pfn_alignment();
+	if (pfn_align && pfn_align < PAGES_PER_SECTION) {
+		pr_warn("Node alignment %lluMB < min %lluMB, rejecting NUMA config\n",
+		       PFN_PHYS(pfn_align) >> 20,
+		       PFN_PHYS(PAGES_PER_SECTION) >> 20);
+		return -EINVAL;
+	}
+#endif
+	if (!numa_meminfo_cover_memory(mi))
+		return -EINVAL;
+
+	/* Finally register nodes. */
+	for_each_node_mask(nid, node_possible_map) {
+		u64 start = PFN_PHYS(max_pfn);
+		u64 end = 0;
+
+		for (i = 0; i < mi->nr_blks; i++) {
+			if (nid != mi->blk[i].nid)
+				continue;
+			start = min(mi->blk[i].start, start);
+			end = max(mi->blk[i].end, end);
+		}
+
+		if (start < end)
+			setup_node_data(nid, start, end);
+	}
+
+	/* Dump memblock with node info and return. */
+	memblock_dump_all();
+	return 0;
+}
+
+static int __init numa_init(int (*init_func)(void))
+{
+	int ret, i;
+
+	nodes_clear(node_possible_map);
+	nodes_clear(node_online_map);
+
+	ret = init_func();
+	if (ret < 0)
+		return ret;
+
+	ret = numa_register_memblks(&numa_meminfo);
+	if (ret < 0)
+		return ret;
+
+	for (i = 0; i < nr_cpu_ids; i++)
+		numa_clear_node(i);
+
+	setup_node_to_cpumask_map();
+	return 0;
+}
+
+/**
+ * dummy_numa_init - Fallback dummy NUMA init
+ *
+ * Used if there's no underlying NUMA architecture, NUMA initialization
+ * fails, or NUMA is disabled on the command line.
+ *
+ * Must online at least one node and add memory blocks that cover all
+ * allowed memory.  This function must not fail.
+ */
+static int __init dummy_numa_init(void)
+{
+	pr_info("%s\n","No NUMA configuration found");
+	pr_info("Faking a node@[mem %#018Lx-%#018Lx]\n",
+	       0LLU, PFN_PHYS(max_pfn) - 1);
+	node_set(0, numa_nodes_parsed);
+	numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
+
+	return 0;
+}
+
+/**
+ * early_init_dt_scan_numa_map - parse memory node and map nid to memory range.
+ */
+int __init early_init_dt_scan_numa_map(unsigned long node, const char *uname,
+				     int depth, void *data)
+{
+	const __be32 *numa_prop;
+	int nr_address_cells = OF_ROOT_NODE_ADDR_CELLS_DEFAULT;
+	int nr_size_cells = OF_ROOT_NODE_SIZE_CELLS_DEFAULT;
+	int node_count = MAX_NUMNODES;
+	int mem_ranges, cpu_ranges, matrix_count, i, length;
+
+	/* We are scanning "numa-map" nodes only */
+	if (strcmp(uname, "numa-map") != 0)
+		return 0;
+
+	numa_prop = of_get_flat_dt_prop(node, "#address-cells", &length);
+	if (numa_prop)
+		nr_address_cells = dt_mem_next_cell(
+				OF_ROOT_NODE_ADDR_CELLS_DEFAULT, &numa_prop);
+	pr_debug("NUMA-DT: #nr_address_cells = %u\n",nr_address_cells);
+
+	numa_prop = of_get_flat_dt_prop(node, "#size-cells", &length);
+	if (numa_prop)
+		nr_size_cells = dt_mem_next_cell(
+				OF_ROOT_NODE_ADDR_CELLS_DEFAULT, &numa_prop);
+	pr_debug("NUMA-DT: #nr_size_cells = %d\n",nr_size_cells);
+
+	numa_prop = of_get_flat_dt_prop(node, "#node-count", &length);
+	if (numa_prop == NULL)
+		return -EINVAL;
+	node_count = dt_mem_next_cell(nr_size_cells, &numa_prop);
+	pr_debug("NUMA-DT: #node-count = %d\n",node_count);
+
+	if (node_count > MAX_NUMNODES)
+		BUG();
+
+	for(i = 0; i <node_count; i++)
+		node_set(i, numa_nodes_parsed);
+
+	numa_prop = of_get_flat_dt_prop(node, "mem-map", &length);
+	if (numa_prop == NULL)
+		return -EINVAL;
+	mem_ranges = (length /
+			sizeof(__be32))/(nr_address_cells + nr_size_cells);
+	for (i = 0; i < mem_ranges; i++) {
+		u64 base;
+		u32 node;
+		struct memblock_region *reg;
+
+		base = dt_mem_next_cell(nr_address_cells, &numa_prop);
+		node = dt_mem_next_cell(nr_size_cells, &numa_prop);
+		pr_debug("NUMA-DT:  mem-address = %llx , node = %u\n",
+				base, node);
+		for_each_memblock(memory, reg) {
+			if (reg->base == base) {
+				numa_add_memblk(node, reg->base,reg->size);
+				break;
+			}
+		}
+	}
+
+	numa_prop = of_get_flat_dt_prop(node, "cpu-map", &length);
+	if (numa_prop == NULL)
+		return -EINVAL;
+	cpu_ranges = ((length / sizeof(__be32)) /
+			(nr_address_cells + nr_size_cells));
+	for (i = 0; i < cpu_ranges; i++) {
+		u32 cpus, cpue, node_id, j;
+		cpus = dt_mem_next_cell(nr_size_cells, &numa_prop);
+		cpue = dt_mem_next_cell(nr_size_cells, &numa_prop);
+		node_id = dt_mem_next_cell(nr_size_cells, &numa_prop);
+		for (j = cpus; j <= cpue; j++)
+			cpu_info[j].node_id = node_id;
+		pr_debug("NUMA-DT:  start cpu = %d end cpu = %d node-id %d\n",
+				cpus, cpue, node_id);
+	}
+
+
+	numa_prop = of_get_flat_dt_prop(node, "node-matrix", &length);
+	if (numa_prop == NULL)
+		return -EINVAL;
+
+	matrix_count = ((length / sizeof(__be32)) / (3 * nr_size_cells));
+	for (i = 0; i < matrix_count; i++) {
+		u32 nodea, nodeb, distance;
+
+		nodea = dt_mem_next_cell(nr_size_cells, &numa_prop);
+		nodeb = dt_mem_next_cell(nr_size_cells, &numa_prop);
+		distance = dt_mem_next_cell(nr_size_cells, &numa_prop);
+
+		numa_set_distance(nodea, nodeb, distance);
+		pr_debug("NUMA-DT:  distance[node%d -> node%d] = %d\n",
+				nodea, nodeb, distance);
+		/* Set default distance of node B->A same as A->B */
+		if (nodeb > nodea)
+			numa_set_distance(nodeb, nodea, distance);
+	}
+
+	return 0;
+}
+
+/* DT node mapping is done already early_init_dt_scan_memory */
+static inline int __init arm64_dt_numa_init(void)
+{
+	return of_scan_flat_dt(early_init_dt_scan_numa_map, NULL);
+}
+
+/**
+ * arm64_numa_init - Initialize NUMA
+ *
+ * Try each configured NUMA initialization method until one succeeds.  The
+ * last fallback is dummy single node config encomapssing whole memory and
+ * never fails.
+ */
+void __init arm64_numa_init(void)
+{
+	if (!numa_off) {
+#ifdef CONFIG_ARM64_DT_NUMA
+		if (!numa_init(arm64_dt_numa_init))
+			return;
+#endif
+	}
+
+	numa_init(dummy_numa_init);
+}
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128
  2014-11-21 21:23 ` [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128 Ganapatrao Kulkarni
@ 2014-11-24 11:53   ` Arnd Bergmann
  2014-12-09  1:57     ` Zi Shen Lim
  0 siblings, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-24 11:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 22 November 2014 02:53:27 Ganapatrao Kulkarni wrote:
> Raising the maximum limit to 128. This is needed for Cavium's
> Thunder system that will have 96 cores on Multi-node system.
> 
> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
> 

Could we please raise the compile-time limit to the highest number that
you are able to boot successfully on some existing machine?

There isn't much point in doubling this every few months.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology.
  2014-11-21 21:23 ` [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology Ganapatrao Kulkarni
@ 2014-11-24 11:59   ` Arnd Bergmann
  2014-11-24 16:32     ` Roy Franz
  2014-11-24 17:01   ` Marc Zyngier
  1 sibling, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-24 11:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Saturday 22 November 2014 02:53:29 Ganapatrao Kulkarni wrote:
> +/ {
> +	model = "Cavium ThunderX CN88XX board";
> +	compatible = "cavium,thunder-88xx";

No wildcards in compatible strings or model names please. List the
exact chip that you are using.

> +	aliases {
> +		serial0 = &uaa0;
> +		serial1 = &uaa1;
> +	};
> +
> +	memory at 00c00000 {
> +		device_type = "memory";
> +		reg = <0x0 0x00000000 0x0 0x80000000>;
> +	};
> +
> +	memory at 10000000000 {
> +		device_type = "memory";
> +		reg = <0x100 0x00000000 0x0 0x80000000>;
> +	};
> +
> +	numa-map {
> +		#address-cells = <2>;
> +		#size-cells = <1>;
> +		#node-count = <2>;
> +		mem-map = <0x0 0x00000000 0>,
> +		           <0x100 0x00000000 1>;
> +
> +               cpu-map = <0 47 0>,
> +			<48 95 1>;
> +
> +		node-matrix=    <0 0 10>,
> +				<0 1 20>,
> +				<1 0 20>,
> +				<1 1 10>;

I don't know how much history is behind this binding. Have you looked
at the sPAPR way of doing this? I don't remember exactly how that is
done, but we'd need a good reason to discard that and implement
something else for arm64.

If we create a new binding, I don't think the 'numa-map' node you
have here is the best solution. We already have device nodes for each
memory segment and each CPU in the system. Why not work with those
nodes directly?
> +
> +	timer {
> +		compatible = "arm,armv8-timer";
> +		interrupts = <1 13 0xff01>,
> +		             <1 14 0xff01>,
> +		             <1 11 0xff01>,
> +		             <1 10 0xff01>;
> +	};
> +
> +	soc {
> +		compatible = "simple-bus";
> +		#address-cells = <2>;
> +		#size-cells = <2>;
> +		ranges;
> +
> +		refclk50mhz: refclk50mhz {
> +			compatible = "fixed-clock";
> +			#clock-cells = <0>;
> +			clock-frequency = <50000000>;
> +			clock-output-names = "refclk50mhz";
> +		};

Why is the timer outside of the soc and the refclk is inside?
I would have expected the opposite.

	Arnd 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology.
  2014-11-24 11:59   ` Arnd Bergmann
@ 2014-11-24 16:32     ` Roy Franz
  2014-11-24 17:01       ` Arnd Bergmann
  0 siblings, 1 reply; 35+ messages in thread
From: Roy Franz @ 2014-11-24 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 24, 2014 at 6:59 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Saturday 22 November 2014 02:53:29 Ganapatrao Kulkarni wrote:
>> +/ {
>> +     model = "Cavium ThunderX CN88XX board";
>> +     compatible = "cavium,thunder-88xx";
>
> No wildcards in compatible strings or model names please. List the
> exact chip that you are using.
>
>> +     aliases {
>> +             serial0 = &uaa0;
>> +             serial1 = &uaa1;
>> +     };
>> +
>> +     memory at 00c00000 {
>> +             device_type = "memory";
>> +             reg = <0x0 0x00000000 0x0 0x80000000>;
>> +     };
>> +
>> +     memory at 10000000000 {
>> +             device_type = "memory";
>> +             reg = <0x100 0x00000000 0x0 0x80000000>;
>> +     };
>> +
>> +     numa-map {
>> +             #address-cells = <2>;
>> +             #size-cells = <1>;
>> +             #node-count = <2>;
>> +             mem-map = <0x0 0x00000000 0>,
>> +                        <0x100 0x00000000 1>;
>> +
>> +               cpu-map = <0 47 0>,
>> +                     <48 95 1>;
>> +
>> +             node-matrix=    <0 0 10>,
>> +                             <0 1 20>,
>> +                             <1 0 20>,
>> +                             <1 1 10>;
>
> I don't know how much history is behind this binding. Have you looked
> at the sPAPR way of doing this? I don't remember exactly how that is
> done, but we'd need a good reason to discard that and implement
> something else for arm64.
>
> If we create a new binding, I don't think the 'numa-map' node you
> have here is the best solution. We already have device nodes for each
> memory segment and each CPU in the system. Why not work with those
> nodes directly?

The DT memory nodes don't exist in an EFI system, as the EFI memory
map is used instead.
Using EFI as the boot firmware doesn't require the use of ACPI for
hardware description,
so the EFI/DT case is one that we should support.

>> +
>> +     timer {
>> +             compatible = "arm,armv8-timer";
>> +             interrupts = <1 13 0xff01>,
>> +                          <1 14 0xff01>,
>> +                          <1 11 0xff01>,
>> +                          <1 10 0xff01>;
>> +     };
>> +
>> +     soc {
>> +             compatible = "simple-bus";
>> +             #address-cells = <2>;
>> +             #size-cells = <2>;
>> +             ranges;
>> +
>> +             refclk50mhz: refclk50mhz {
>> +                     compatible = "fixed-clock";
>> +                     #clock-cells = <0>;
>> +                     clock-frequency = <50000000>;
>> +                     clock-output-names = "refclk50mhz";
>> +             };
>
> Why is the timer outside of the soc and the refclk is inside?
> I would have expected the opposite.
>
>         Arnd
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology.
  2014-11-21 21:23 ` [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology Ganapatrao Kulkarni
  2014-11-24 11:59   ` Arnd Bergmann
@ 2014-11-24 17:01   ` Marc Zyngier
  1 sibling, 0 replies; 35+ messages in thread
From: Marc Zyngier @ 2014-11-24 17:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 21/11/14 21:23, Ganapatrao Kulkarni wrote:
> adding devicetree definition for Cavium's Thunder SoC in 2 Node topology.
> 
> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
> ---
>  arch/arm64/boot/dts/thunder-88xx-2n.dts  |  88 ++++
>  arch/arm64/boot/dts/thunder-88xx-2n.dtsi | 694 +++++++++++++++++++++++++++++++
>  2 files changed, 782 insertions(+)
>  create mode 100644 arch/arm64/boot/dts/thunder-88xx-2n.dts
>  create mode 100644 arch/arm64/boot/dts/thunder-88xx-2n.dtsi
> 

[...]

> diff --git a/arch/arm64/boot/dts/thunder-88xx-2n.dtsi b/arch/arm64/boot/dts/thunder-88xx-2n.dtsi
> new file mode 100644
> index 0000000..3f217b4
> --- /dev/null
> +++ b/arch/arm64/boot/dts/thunder-88xx-2n.dtsi

[...]

> +       timer {
> +               compatible = "arm,armv8-timer";
> +               interrupts = <1 13 0xff01>,
> +                            <1 14 0xff01>,
> +                            <1 11 0xff01>,
> +                            <1 10 0xff01>;
> +       };

These "0xff01" cells are bogus (the GICv3 binding only specifies values
1 and 4 for respectively edge and level triggered). My hunch is that
they should be 4, as the timers are likely to be level triggered.

> +
> +       soc {
> +               compatible = "simple-bus";
> +               #address-cells = <2>;
> +               #size-cells = <2>;
> +               ranges;
> +
> +               refclk50mhz: refclk50mhz {
> +                       compatible = "fixed-clock";
> +                       #clock-cells = <0>;
> +                       clock-frequency = <50000000>;
> +                       clock-output-names = "refclk50mhz";
> +               };
> +
> +               gic0: interrupt-controller at 8010,00000000 {
> +                       compatible = "arm,gic-v3";
> +                       #interrupt-cells = <3>;
> +                       #address-cells = <2>;
> +                       #size-cells = <2>;
> +                       #redistributor-regions = <2>;
> +                       ranges;
> +                       interrupt-controller;
> +                       reg = <0x8010 0x00000000 0x0 0x010000>, /* GICD */
> +                             <0x8010 0x80000000 0x0 0x600000>, /* GICR Node 0 */
> +                             <0x9010 0x80000000 0x0 0x600000>; /* GICR Node 1 */
> +                       interrupts = <1 9 0xf04>;
> +                     };

Same here.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology.
  2014-11-24 16:32     ` Roy Franz
@ 2014-11-24 17:01       ` Arnd Bergmann
  2014-11-25 12:38         ` Ard Biesheuvel
  0 siblings, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-24 17:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 24 November 2014 11:32:46 Roy Franz wrote:
> >
> > I don't know how much history is behind this binding. Have you looked
> > at the sPAPR way of doing this? I don't remember exactly how that is
> > done, but we'd need a good reason to discard that and implement
> > something else for arm64.
> >
> > If we create a new binding, I don't think the 'numa-map' node you
> > have here is the best solution. We already have device nodes for each
> > memory segment and each CPU in the system. Why not work with those
> > nodes directly?
> 
> The DT memory nodes don't exist in an EFI system, as the EFI memory
> map is used instead.
> Using EFI as the boot firmware doesn't require the use of ACPI for
> hardware description,
> so the EFI/DT case is one that we should support.

But they /could/ exist, right? Can we just require them to be
present in order to use NUMA features?

I don't think it's a good idea to require a new representation
of the memory nodes in DT to make NUMA work when we already have
one that is almost always there.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-21 21:23 ` [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa Ganapatrao Kulkarni
@ 2014-11-25  3:55   ` Shannon Zhao
  2014-11-25  9:42     ` Hanjun Guo
  0 siblings, 1 reply; 35+ messages in thread
From: Shannon Zhao @ 2014-11-25  3:55 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
> DT bindings for numa map for memory, cores to node and
> proximity distance matrix of nodes to each other.
> 
> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
> ---
>  Documentation/devicetree/bindings/arm/numa.txt | 103 +++++++++++++++++++++++++
>  1 file changed, 103 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/arm/numa.txt
> 
> diff --git a/Documentation/devicetree/bindings/arm/numa.txt b/Documentation/devicetree/bindings/arm/numa.txt
> new file mode 100644
> index 0000000..ec6bf2d
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/arm/numa.txt
> @@ -0,0 +1,103 @@
> +==============================================================================
> +NUMA binding description.
> +==============================================================================
> +
> +==============================================================================
> +1 - Introduction
> +==============================================================================
> +
> +Systems employing a Non Uniform Memory Access (NUMA) architecture contain
> +collections of hardware resources including processors, memory, and I/O buses,
> +that comprise what is commonly known as a ??UMA node?? Processor
> +accesses to memory within the local NUMA node is
> +generally faster than processor accesses to memory outside of the local
> +NUMA node. DT defines interfaces that allow the platform to convey NUMA node
> +topology information to OS.
> +
> +==============================================================================
> +2 - numa-map node
> +==============================================================================
> +
> +DT Binding for NUMA can be defined for memory and CPUs to map them to
> +respective NUMA nodes.
> +
> +The DT binding can defined using numa-map node.
> +The numa-map will have following properties to define NUMA topology.
> +
> +- mem-map:	This property defines the association between a range of
> +		memory and the proximity domain/numa node to which it belongs.
> +
> +note: memory range address is passed using either memory node of
> +DT or UEFI system table and should match to the address defined in mem-map.
> +
> +- cpu-map:	This property defines the association of range of processors
> +		(range of cpu ids) and the proximity domain to which
> +		the processor belongs.
> +
> +- node-matrix:	This table provides a matrix that describes the relative
> +		distance (memory latency) between all System Localities.
> +		The value of each Entry[i j distance] in node-matrix table,
> +		where i represents a row of a matrix and j represents a
> +		column of a matrix, indicates the relative distances
> +		from Proximity Domain/Numa node i to every other
> +		node j in the system (including itself).
> +
> +The numa-map node must contain the appropriate #address-cells,
> +#size-cells and #node-count properties.
> +
> +
> +==============================================================================
> +4 - Example dts
> +==============================================================================
> +
> +Example 1: 2 Node system each having 8 CPUs and a Memory.
> +
> +	numa-map {
> +		#address-cells = <2>;
> +		#size-cells = <1>;
> +		#node-count = <2>;
> +		mem-map =  <0x0 0x00000000 0>,
> +		           <0x100 0x00000000 1>;
> +
> +		cpu-map = <0 7 0>,
> +			  <8 15 1>;

The cpu range is continuous here. But if there is a situation like below:

0 2 4 6 belong to node 0
1 3 5 7 belong to node 1

This case is very common on X86. I don't know the real situation of arm as
I don't have a hardware with 2 nodes.

How can we generate a DTS about this situation? like below? Can be parsed?

		cpu-map = <0 2 4 6 0>,
			  <1 3 5 7 1>;

Thanks,
Shannon

> +
> +		node-matrix = <0 0 10>,
> +			      <0 1 20>,
> +			      <1 0 20>,
> +			      <1 1 10>;
> +	};
> +
> +Example 2: 4 Node system each having 4 CPUs and a Memory.
> +
> +	numa-map {
> +		#address-cells = <2>;
> +		#size-cells = <1>;
> +		#node-count = <2>;
> +		mem-map =  <0x0 0x00000000 0>,
> +		           <0x100 0x00000000 1>,
> +		           <0x200 0x00000000 2>,
> +		           <0x300 0x00000000 3>;
> +
> +		cpu-map = <0 7 0>,
> +			  <8 15 1>,
> +			  <16 23 2>,
> +			  <24 31 3>;
> +
> +		node-matrix = <0 0 10>,
> +			      <0 1 20>,
> +			      <0 2 20>,
> +			      <0 3 20>,
> +			      <1 0 20>,
> +			      <1 1 10>,
> +			      <1 2 20>,
> +			      <1 3 20>,
> +			      <2 0 20>,
> +			      <2 1 20>,
> +			      <2 2 10>,
> +			      <2 3 20>,
> +			      <3 0 20>,
> +			      <3 1 20>,
> +			      <3 2 20>,
> +			      <3 3 10>;
> +	};
> 
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 


-- 
Shannon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25  3:55   ` Shannon Zhao
@ 2014-11-25  9:42     ` Hanjun Guo
  2014-11-25 11:02       ` Arnd Bergmann
  0 siblings, 1 reply; 35+ messages in thread
From: Hanjun Guo @ 2014-11-25  9:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 2014-11-25 11:55, Shannon Zhao wrote:
> Hi,
> 
> On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
[...]
>> +==============================================================================
>> +4 - Example dts
>> +==============================================================================
>> +
>> +Example 1: 2 Node system each having 8 CPUs and a Memory.
>> +
>> +	numa-map {
>> +		#address-cells = <2>;
>> +		#size-cells = <1>;
>> +		#node-count = <2>;
>> +		mem-map =  <0x0 0x00000000 0>,
>> +		           <0x100 0x00000000 1>;
>> +
>> +		cpu-map = <0 7 0>,
>> +			  <8 15 1>;
> 
> The cpu range is continuous here. But if there is a situation like below:
> 
> 0 2 4 6 belong to node 0
> 1 3 5 7 belong to node 1
> 
> This case is very common on X86. I don't know the real situation of arm as
> I don't have a hardware with 2 nodes.
> 
> How can we generate a DTS about this situation? like below? Can be parsed?
> 
> 		cpu-map = <0 2 4 6 0>,
> 			  <1 3 5 7 1>;

I think the binding proposed here can not cover your needs, and I think this
binding is not suitable, there are some reasons.

 - CPU logical ID is allocated by OS, and it depends on the order of CPU node
   in the device tree, so it may be in a clean order like this patch proposed,
   or it will like the order Shannon pointed out.

 - Since CPU logical ID is allocated by OS, DTS file will not know these
   numbers.

So the problem behind this is the mappings between CPUs and NUMA nodes,
there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
and MPIDR will be not changed, why not using MPIDR for the mapping of
NUMA node and CPU? then the mappings will be:

CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
(allocated by OS)      (constant)       (allocated by OS)

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25  9:42     ` Hanjun Guo
@ 2014-11-25 11:02       ` Arnd Bergmann
  2014-11-25 13:15         ` Ganapatrao Kulkarni
                           ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-25 11:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 25 November 2014 17:42:44 Hanjun Guo wrote:
> On 2014-11-25 11:55, Shannon Zhao wrote:
> > Hi,
> > 
> > On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
> [...]
> >> +==============================================================================
> >> +4 - Example dts
> >> +==============================================================================
> >> +
> >> +Example 1: 2 Node system each having 8 CPUs and a Memory.
> >> +
> >> +    numa-map {
> >> +            #address-cells = <2>;
> >> +            #size-cells = <1>;
> >> +            #node-count = <2>;
> >> +            mem-map =  <0x0 0x00000000 0>,
> >> +                       <0x100 0x00000000 1>;
> >> +
> >> +            cpu-map = <0 7 0>,
> >> +                      <8 15 1>;
> > 
> > The cpu range is continuous here. But if there is a situation like below:
> > 
> > 0 2 4 6 belong to node 0
> > 1 3 5 7 belong to node 1
> > 
> > This case is very common on X86. I don't know the real situation of arm as
> > I don't have a hardware with 2 nodes.
> > 
> > How can we generate a DTS about this situation? like below? Can be parsed?
> > 
> >               cpu-map = <0 2 4 6 0>,
> >                         <1 3 5 7 1>;
> 
> I think the binding proposed here can not cover your needs, and I think this
> binding is not suitable, there are some reasons.
> 
>  - CPU logical ID is allocated by OS, and it depends on the order of CPU node
>    in the device tree, so it may be in a clean order like this patch proposed,
>    or it will like the order Shannon pointed out.
> 
>  - Since CPU logical ID is allocated by OS, DTS file will not know these
>    numbers.

Also:

- you cannot support hierarchical NUMA topology

- you cannot have CPU-less or memory-less nodes

- you cannot associate I/O devices with NUMA nodes, only memory and CPU

> So the problem behind this is the mappings between CPUs and NUMA nodes,
> there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
> and MPIDR will be not changed, why not using MPIDR for the mapping of
> NUMA node and CPU? then the mappings will be:
> 
> CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
> (allocated by OS)      (constant)       (allocated by OS)

No, don't hardcode ARM specifics into a common binding either. I've looked
at the ibm,associativity properties again, and I think we should just use
those, they can cover all cases and are completely independent of the
architecture. We should probably discuss about the property name though,
as using the "ibm," prefix might not be the best idea.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology.
  2014-11-24 17:01       ` Arnd Bergmann
@ 2014-11-25 12:38         ` Ard Biesheuvel
  2014-11-25 12:45           ` Arnd Bergmann
  0 siblings, 1 reply; 35+ messages in thread
From: Ard Biesheuvel @ 2014-11-25 12:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 November 2014 at 18:01, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 24 November 2014 11:32:46 Roy Franz wrote:
>> >
>> > I don't know how much history is behind this binding. Have you looked
>> > at the sPAPR way of doing this? I don't remember exactly how that is
>> > done, but we'd need a good reason to discard that and implement
>> > something else for arm64.
>> >
>> > If we create a new binding, I don't think the 'numa-map' node you
>> > have here is the best solution. We already have device nodes for each
>> > memory segment and each CPU in the system. Why not work with those
>> > nodes directly?
>>
>> The DT memory nodes don't exist in an EFI system, as the EFI memory
>> map is used instead.
>> Using EFI as the boot firmware doesn't require the use of ACPI for
>> hardware description,
>> so the EFI/DT case is one that we should support.
>
> But they /could/ exist, right? Can we just require them to be
> present in order to use NUMA features?
>

Actually, currently the memory nodes are stripped from the device tree
by the EFI stub, so the kernel will never get to see them.
This is done more or less as a fixup, under the assumption that EFI
systems should not have DT memory nodes in the first place.

We could revisit this, of course, but it needs to be taken into
account in this discussion.

-- 
Ard.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology.
  2014-11-25 12:38         ` Ard Biesheuvel
@ 2014-11-25 12:45           ` Arnd Bergmann
  0 siblings, 0 replies; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-25 12:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 25 November 2014 13:38:01 Ard Biesheuvel wrote:
> On 24 November 2014 at 18:01, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday 24 November 2014 11:32:46 Roy Franz wrote:
> >> >
> >> > I don't know how much history is behind this binding. Have you looked
> >> > at the sPAPR way of doing this? I don't remember exactly how that is
> >> > done, but we'd need a good reason to discard that and implement
> >> > something else for arm64.
> >> >
> >> > If we create a new binding, I don't think the 'numa-map' node you
> >> > have here is the best solution. We already have device nodes for each
> >> > memory segment and each CPU in the system. Why not work with those
> >> > nodes directly?
> >>
> >> The DT memory nodes don't exist in an EFI system, as the EFI memory
> >> map is used instead.
> >> Using EFI as the boot firmware doesn't require the use of ACPI for
> >> hardware description,
> >> so the EFI/DT case is one that we should support.
> >
> > But they /could/ exist, right? Can we just require them to be
> > present in order to use NUMA features?
> >
> 
> Actually, currently the memory nodes are stripped from the device tree
> by the EFI stub, so the kernel will never get to see them.
> This is done more or less as a fixup, under the assumption that EFI
> systems should not have DT memory nodes in the first place.
> 
> We could revisit this, of course, but it needs to be taken into
> account in this discussion.

Right. As we don't support NUMA yet, this would have to become
a requirement for implementing NUMA: If you have no memory nodes,
you could still use the DT binding for topology, but it would be
limited to CPUs and I/O devices, which of course seriously limits
the usefulness.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25 11:02       ` Arnd Bergmann
@ 2014-11-25 13:15         ` Ganapatrao Kulkarni
  2014-11-25 19:00           ` Arnd Bergmann
  2014-11-25 14:54         ` Hanjun Guo
  2014-11-26  2:29         ` Shannon Zhao
  2 siblings, 1 reply; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-11-25 13:15 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnd,

On Tue, Nov 25, 2014 at 6:02 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 25 November 2014 17:42:44 Hanjun Guo wrote:
>> On 2014-11-25 11:55, Shannon Zhao wrote:
>> > Hi,
>> >
>> > On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
>> [...]
>> >> +==============================================================================
>> >> +4 - Example dts
>> >> +==============================================================================
>> >> +
>> >> +Example 1: 2 Node system each having 8 CPUs and a Memory.
>> >> +
>> >> +    numa-map {
>> >> +            #address-cells = <2>;
>> >> +            #size-cells = <1>;
>> >> +            #node-count = <2>;
>> >> +            mem-map =  <0x0 0x00000000 0>,
>> >> +                       <0x100 0x00000000 1>;
>> >> +
>> >> +            cpu-map = <0 7 0>,
>> >> +                      <8 15 1>;
>> >
>> > The cpu range is continuous here. But if there is a situation like below:
>> >
>> > 0 2 4 6 belong to node 0
>> > 1 3 5 7 belong to node 1
>> >
>> > This case is very common on X86. I don't know the real situation of arm as
>> > I don't have a hardware with 2 nodes.
>> >
>> > How can we generate a DTS about this situation? like below? Can be parsed?
>> >
>> >               cpu-map = <0 2 4 6 0>,
>> >                         <1 3 5 7 1>;
>>
>> I think the binding proposed here can not cover your needs, and I think this
>> binding is not suitable, there are some reasons.
>>
>>  - CPU logical ID is allocated by OS, and it depends on the order of CPU node
>>    in the device tree, so it may be in a clean order like this patch proposed,
>>    or it will like the order Shannon pointed out.
>>
>>  - Since CPU logical ID is allocated by OS, DTS file will not know these
>>    numbers.
>
> Also:
>
> - you cannot support hierarchical NUMA topology
>
> - you cannot have CPU-less or memory-less nodes
>
> - you cannot associate I/O devices with NUMA nodes, only memory and CPU
>
>> So the problem behind this is the mappings between CPUs and NUMA nodes,
>> there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
>> and MPIDR will be not changed, why not using MPIDR for the mapping of
>> NUMA node and CPU? then the mappings will be:
>>
>> CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
>> (allocated by OS)      (constant)       (allocated by OS)
>
> No, don't hardcode ARM specifics into a common binding either. I've looked
> at the ibm,associativity properties again, and I think we should just use
> those, they can cover all cases and are completely independent of the
> architecture. We should probably discuss about the property name though,
> as using the "ibm," prefix might not be the best idea.
We have started with new proposal, since not got enough details how
ibm/ppc is managing the numa using dt.
there is no documentation and there is no power/PAPR spec for numa in
public domain and there are no single dt file in arch/powerpc which
describes the numa. if we get any one of these details, we can align
to powerpc implementation.
>
>         Arnd

thanks
ganapat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25 11:02       ` Arnd Bergmann
  2014-11-25 13:15         ` Ganapatrao Kulkarni
@ 2014-11-25 14:54         ` Hanjun Guo
  2014-11-26  2:29         ` Shannon Zhao
  2 siblings, 0 replies; 35+ messages in thread
From: Hanjun Guo @ 2014-11-25 14:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnd,

On 2014?11?25? 19:02, Arnd Bergmann wrote:
> On Tuesday 25 November 2014 17:42:44 Hanjun Guo wrote:
>> On 2014-11-25 11:55, Shannon Zhao wrote:
>>> Hi,
>>>
>>> On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
>> [...]
>>>> +==============================================================================
>>>> +4 - Example dts
>>>> +==============================================================================
>>>> +
>>>> +Example 1: 2 Node system each having 8 CPUs and a Memory.
>>>> +
>>>> +    numa-map {
>>>> +            #address-cells = <2>;
>>>> +            #size-cells = <1>;
>>>> +            #node-count = <2>;
>>>> +            mem-map =  <0x0 0x00000000 0>,
>>>> +                       <0x100 0x00000000 1>;
>>>> +
>>>> +            cpu-map = <0 7 0>,
>>>> +                      <8 15 1>;
>>>
>>> The cpu range is continuous here. But if there is a situation like below:
>>>
>>> 0 2 4 6 belong to node 0
>>> 1 3 5 7 belong to node 1
>>>
>>> This case is very common on X86. I don't know the real situation of arm as
>>> I don't have a hardware with 2 nodes.
>>>
>>> How can we generate a DTS about this situation? like below? Can be parsed?
>>>
>>>                cpu-map = <0 2 4 6 0>,
>>>                          <1 3 5 7 1>;
>>
>> I think the binding proposed here can not cover your needs, and I think this
>> binding is not suitable, there are some reasons.
>>
>>   - CPU logical ID is allocated by OS, and it depends on the order of CPU node
>>     in the device tree, so it may be in a clean order like this patch proposed,
>>     or it will like the order Shannon pointed out.
>>
>>   - Since CPU logical ID is allocated by OS, DTS file will not know these
>>     numbers.
>
> Also:
>
> - you cannot support hierarchical NUMA topology
>
> - you cannot have CPU-less or memory-less nodes
>
> - you cannot associate I/O devices with NUMA nodes, only memory and CPU

Yes, I agree.

>
>> So the problem behind this is the mappings between CPUs and NUMA nodes,
>> there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
>> and MPIDR will be not changed, why not using MPIDR for the mapping of
>> NUMA node and CPU? then the mappings will be:
>>
>> CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
>> (allocated by OS)      (constant)       (allocated by OS)
>
> No, don't hardcode ARM specifics into a common binding either. I've looked
> at the ibm,associativity properties again, and I think we should just use
> those, they can cover all cases and are completely independent of the
> architecture. We should probably discuss about the property name though,
> as using the "ibm," prefix might not be the best idea.

Is there any doc/code related to this? please give me some hints and I
will read that.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25 13:15         ` Ganapatrao Kulkarni
@ 2014-11-25 19:00           ` Arnd Bergmann
  2014-11-25 21:09             ` Arnd Bergmann
                               ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-25 19:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
> > No, don't hardcode ARM specifics into a common binding either. I've looked
> > at the ibm,associativity properties again, and I think we should just use
> > those, they can cover all cases and are completely independent of the
> > architecture. We should probably discuss about the property name though,
> > as using the "ibm," prefix might not be the best idea.
>
> We have started with new proposal, since not got enough details how
> ibm/ppc is managing the numa using dt.
> there is no documentation and there is no power/PAPR spec for numa in
> public domain and there are no single dt file in arch/powerpc which
> describes the numa. if we get any one of these details, we can align
> to powerpc implementation.

Basically the idea is to have an "ibm,associativity" property in each
bus or device that is node specific, and this includes all CPUs and
memory nodes. The property contains an array of 32-bit integers that
count the resources. Take an example of a NUMA cluster of two machines
with four sockets and four cores each (32 cores total), a memory
channel on each socket and one PCI host per board that is connected
at equal speed to each socket on the board.

The ibm,associativity property in each PCI host, CPU or memory device
node consequently has an array of three (board, socket, core) integers:

	memory at 0,0 {
		device_type = "memory";
		reg = <0x0 0x0  0x4 0x0;
		/* board 0, socket 0, no specific core */
		ibm,asssociativity = <0 0 0xffff>;
	};

	memory at 4,0 {
		device_type = "memory";
		reg = <0x4 0x0  0x4 0x0>;
		/* board 0, socket 1, no specific core */
		ibm,asssociativity = <0 1 0xffff>; 
	};

	...

	memory at 1c,0 {
		device_type = "memory";
		reg = <0x1c 0x0  0x4 0x0>;
		/* board 0, socket 7, no specific core */
		ibm,asssociativity = <1 7 0xffff>; 
	};

	cpus {
		#address-cells = <2>;
		#size-cells = <0>;
		cpu at 0 {
			device_type = "cpu";
			reg = <0 0>;
			/* board 0, socket 0, core 0*/
			ibm,asssociativity = <0 0 0>; 
		};

		cpu at 1 {
			device_type = "cpu";
			reg = <0 0>;
			/* board 0, socket 0, core 0*/
			ibm,asssociativity = <0 0 0>; 
		};

		...

		cpu at 31 {
			device_type = "cpu";
			reg = <0 32>;
			/* board 1, socket 7, core 31*/
			ibm,asssociativity = <1 7 31>; 
		};
	};

	pci at 100,0 {
		device_type = "pci";
		/* board 0 */
		ibm,associativity = <0 0xffff 0xffff>;
		...
	};

	pci at 200,0 {
		device_type = "pci";
		/* board 1 */
		ibm,associativity = <1 0xffff 0xffff>;
		...
	};

	ibm,associativity-reference-points = <0 1>;

The "ibm,associativity-reference-points" property here indicates that index 2
of each array is the most important NUMA boundary for the particular system,
because the performance impact of allocating memory on the remote board 
is more significant than the impact of using memory on a remote socket of the
same board. Linux will consequently use the first field in the array as
the NUMA node ID. If the link between the boards however is relatively fast,
so you care mostly about allocating memory on the same socket, but going to
another board isn't much worse than going to another socket on the same
board, this would be

	ibm,associativity-reference-points = <1 0>;

so Linux would ignore the board ID and use the socket ID as the NUMA node
number. The same would apply if you have only one (otherwise identical
board, then you would get

	ibm,associativity-reference-points = <1>;

which means that index 0 is completely irrelevant for NUMA considerations
and you just care about the socket ID. In this case, devices on the PCI
bus would also not care about NUMA policy and just allocate buffers from
anywhere, while in original example Linux would allocate DMA buffers only
from the local board.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25 19:00           ` Arnd Bergmann
@ 2014-11-25 21:09             ` Arnd Bergmann
  2014-11-26  9:12             ` Hanjun Guo
  2014-11-30 16:38             ` Ganapatrao Kulkarni
  2 siblings, 0 replies; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-25 21:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 25 November 2014 20:00:42 Arnd Bergmann wrote:
> On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
> > > No, don't hardcode ARM specifics into a common binding either. I've looked
> > > at the ibm,associativity properties again, and I think we should just use
> > > those, they can cover all cases and are completely independent of the
> > > architecture. We should probably discuss about the property name though,
> > > as using the "ibm," prefix might not be the best idea.
> >
> > We have started with new proposal, since not got enough details how
> > ibm/ppc is managing the numa using dt.
> > there is no documentation and there is no power/PAPR spec for numa in
> > public domain and there are no single dt file in arch/powerpc which
> > describes the numa. if we get any one of these details, we can align
> > to powerpc implementation.
> 
> Basically the idea is to have an "ibm,associativity" property in each
> bus or device that is node specific, and this includes all CPUs and
> memory nodes. ...

I should have mentioned that the example I gave was still rather basic.
In a larger real-world system, you have more levels of associativity,
though not all of them are relevant for NUMA memory allocation.
Also, when you have levels that are not just a crossbar but instead
have multiple point-to-point connections or a ring bus, it gets more
complex, but you can still represent it with these properties.

For task placement, the associativity would also represent the
topology within one node (SMT threads, cores, clusters, chips,
mcms, sockets) as separate levels, and in large installations you
would have multiple levels of memory topology (memory controllers,
sockets, board/blade, chassis, rack, ...), which can get taken into
account for memory allocation to find the closest node. The metric
that you use here is how many levels within the topology are matching
between two devices (typically memory and i/o device, or memory and cpu).

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25 11:02       ` Arnd Bergmann
  2014-11-25 13:15         ` Ganapatrao Kulkarni
  2014-11-25 14:54         ` Hanjun Guo
@ 2014-11-26  2:29         ` Shannon Zhao
  2014-11-26 16:51           ` Arnd Bergmann
  2 siblings, 1 reply; 35+ messages in thread
From: Shannon Zhao @ 2014-11-26  2:29 UTC (permalink / raw)
  To: linux-arm-kernel

On 2014/11/25 19:02, Arnd Bergmann wrote:
> On Tuesday 25 November 2014 17:42:44 Hanjun Guo wrote:
>> On 2014-11-25 11:55, Shannon Zhao wrote:
>>> Hi,
>>>
>>> On 2014/11/22 5:23, Ganapatrao Kulkarni wrote:
>> [...]
>>>> +==============================================================================
>>>> +4 - Example dts
>>>> +==============================================================================
>>>> +
>>>> +Example 1: 2 Node system each having 8 CPUs and a Memory.
>>>> +
>>>> +    numa-map {
>>>> +            #address-cells = <2>;
>>>> +            #size-cells = <1>;
>>>> +            #node-count = <2>;
>>>> +            mem-map =  <0x0 0x00000000 0>,
>>>> +                       <0x100 0x00000000 1>;
>>>> +
>>>> +            cpu-map = <0 7 0>,
>>>> +                      <8 15 1>;
>>>
>>> The cpu range is continuous here. But if there is a situation like below:
>>>
>>> 0 2 4 6 belong to node 0
>>> 1 3 5 7 belong to node 1
>>>
>>> This case is very common on X86. I don't know the real situation of arm as
>>> I don't have a hardware with 2 nodes.
>>>
>>> How can we generate a DTS about this situation? like below? Can be parsed?
>>>
>>>               cpu-map = <0 2 4 6 0>,
>>>                         <1 3 5 7 1>;
>>
>> I think the binding proposed here can not cover your needs, and I think this
>> binding is not suitable, there are some reasons.
>>
>>  - CPU logical ID is allocated by OS, and it depends on the order of CPU node
>>    in the device tree, so it may be in a clean order like this patch proposed,
>>    or it will like the order Shannon pointed out.
>>
>>  - Since CPU logical ID is allocated by OS, DTS file will not know these
>>    numbers.
> 
> Also:
> 
> - you cannot support hierarchical NUMA topology
> 
> - you cannot have CPU-less or memory-less nodes
> 
> - you cannot associate I/O devices with NUMA nodes, only memory and CPU
> 
>> So the problem behind this is the mappings between CPUs and NUMA nodes,
>> there is already mapping for CPU hardware ID (MPIDR) and CPU logical ID,
>> and MPIDR will be not changed, why not using MPIDR for the mapping of
>> NUMA node and CPU? then the mappings will be:
>>
>> CPU logical ID <------> CPU MPIDR <-----> NUMA node ID <-----> proximity domain
>> (allocated by OS)      (constant)       (allocated by OS)
> 
> No, don't hardcode ARM specifics into a common binding either. I've looked
> at the ibm,associativity properties again, and I think we should just use
> those, they can cover all cases and are completely independent of the
> architecture. We should probably discuss about the property name though,
> as using the "ibm," prefix might not be the best idea.
> 

Yeah, I have read the relevant codes in qemu. I think the "ibm,associativity" is more scalable:-)

About the prefix, my opinion is that as this is relevant with NUMA, maybe we can use "numa" as the prefix.

Thanks,
Shannon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25 19:00           ` Arnd Bergmann
  2014-11-25 21:09             ` Arnd Bergmann
@ 2014-11-26  9:12             ` Hanjun Guo
  2014-12-10 10:57               ` Arnd Bergmann
  2014-11-30 16:38             ` Ganapatrao Kulkarni
  2 siblings, 1 reply; 35+ messages in thread
From: Hanjun Guo @ 2014-11-26  9:12 UTC (permalink / raw)
  To: linux-arm-kernel

On 2014-11-26 3:00, Arnd Bergmann wrote:
> On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
>>> No, don't hardcode ARM specifics into a common binding either. I've looked
>>> at the ibm,associativity properties again, and I think we should just use
>>> those, they can cover all cases and are completely independent of the
>>> architecture. We should probably discuss about the property name though,
>>> as using the "ibm," prefix might not be the best idea.
>>
>> We have started with new proposal, since not got enough details how
>> ibm/ppc is managing the numa using dt.
>> there is no documentation and there is no power/PAPR spec for numa in
>> public domain and there are no single dt file in arch/powerpc which
>> describes the numa. if we get any one of these details, we can align
>> to powerpc implementation.
> 
> Basically the idea is to have an "ibm,associativity" property in each
> bus or device that is node specific, and this includes all CPUs and
> memory nodes. The property contains an array of 32-bit integers that
> count the resources. Take an example of a NUMA cluster of two machines
> with four sockets and four cores each (32 cores total), a memory
> channel on each socket and one PCI host per board that is connected
> at equal speed to each socket on the board.
> 
> The ibm,associativity property in each PCI host, CPU or memory device
> node consequently has an array of three (board, socket, core) integers:
> 
> 	memory at 0,0 {
> 		device_type = "memory";
> 		reg = <0x0 0x0  0x4 0x0;
> 		/* board 0, socket 0, no specific core */
> 		ibm,asssociativity = <0 0 0xffff>;
> 	};
> 
> 	memory at 4,0 {
> 		device_type = "memory";
> 		reg = <0x4 0x0  0x4 0x0>;
> 		/* board 0, socket 1, no specific core */
> 		ibm,asssociativity = <0 1 0xffff>; 
> 	};
> 
> 	...
> 
> 	memory at 1c,0 {
> 		device_type = "memory";
> 		reg = <0x1c 0x0  0x4 0x0>;
> 		/* board 0, socket 7, no specific core */
> 		ibm,asssociativity = <1 7 0xffff>; 
> 	};
> 
> 	cpus {
> 		#address-cells = <2>;
> 		#size-cells = <0>;
> 		cpu at 0 {
> 			device_type = "cpu";
> 			reg = <0 0>;
> 			/* board 0, socket 0, core 0*/
> 			ibm,asssociativity = <0 0 0>; 
> 		};
> 
> 		cpu at 1 {
> 			device_type = "cpu";
> 			reg = <0 0>;
> 			/* board 0, socket 0, core 0*/
> 			ibm,asssociativity = <0 0 0>; 
> 		};
> 
> 		...
> 
> 		cpu at 31 {
> 			device_type = "cpu";
> 			reg = <0 32>;
> 			/* board 1, socket 7, core 31*/
> 			ibm,asssociativity = <1 7 31>; 
> 		};
> 	};
> 
> 	pci at 100,0 {
> 		device_type = "pci";
> 		/* board 0 */
> 		ibm,associativity = <0 0xffff 0xffff>;
> 		...
> 	};
> 
> 	pci at 200,0 {
> 		device_type = "pci";
> 		/* board 1 */
> 		ibm,associativity = <1 0xffff 0xffff>;
> 		...
> 	};
> 
> 	ibm,associativity-reference-points = <0 1>;
> 
> The "ibm,associativity-reference-points" property here indicates that index 2
> of each array is the most important NUMA boundary for the particular system,
> because the performance impact of allocating memory on the remote board 
> is more significant than the impact of using memory on a remote socket of the
> same board. Linux will consequently use the first field in the array as
> the NUMA node ID. If the link between the boards however is relatively fast,
> so you care mostly about allocating memory on the same socket, but going to
> another board isn't much worse than going to another socket on the same
> board, this would be
> 
> 	ibm,associativity-reference-points = <1 0>;
> 
> so Linux would ignore the board ID and use the socket ID as the NUMA node
> number. The same would apply if you have only one (otherwise identical
> board, then you would get
> 
> 	ibm,associativity-reference-points = <1>;
> 
> which means that index 0 is completely irrelevant for NUMA considerations
> and you just care about the socket ID. In this case, devices on the PCI
> bus would also not care about NUMA policy and just allocate buffers from
> anywhere, while in original example Linux would allocate DMA buffers only
> from the local board.

Thanks for the detail information. I have the concerns about the distance
for NUMA nodes, does the "ibm,associativity-reference-points" property can
represent the distance between NUMA nodes?

For example, a system with 4 sockets connected like below:

Socket 0  <---->  Socket 1  <---->  Socket 2  <---->  Socket 3

So from socket 0 to socket 1 (maybe on the same board), it just need 1
jump to access the memory, but from socket 0 to socket 2/3, it needs
2/3 jumps and the *distance* relative longer. Can
"ibm,associativity-reference-points" property cover this?

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-26  2:29         ` Shannon Zhao
@ 2014-11-26 16:51           ` Arnd Bergmann
  0 siblings, 0 replies; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-26 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 26 November 2014 10:29:01 Shannon Zhao wrote:
> On 2014/11/25 19:02, Arnd Bergmann wrote:
> > No, don't hardcode ARM specifics into a common binding either. I've looked
> > at the ibm,associativity properties again, and I think we should just use
> > those, they can cover all cases and are completely independent of the
> > architecture. We should probably discuss about the property name though,
> > as using the "ibm," prefix might not be the best idea.
> > 
> 
> Yeah, I have read the relevant codes in qemu. I think the "ibm,associativity" is more scalable:-)

Ok

> About the prefix, my opinion is that as this is relevant with NUMA,
> maybe we can use "numa" as the prefix.

A prefix should really be the name of a company or institution, so it could
be "arm" or "linux", but not "numa". Would could use "numa-associativity"
with a dash instead of a comma, but that would still be somewhat imprecise
because the associativity property is about system topology inside of
a NUMA domain as well, such as cores, core clusters or SMT threads that
only share caches but not physical memory addresses.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-25 19:00           ` Arnd Bergmann
  2014-11-25 21:09             ` Arnd Bergmann
  2014-11-26  9:12             ` Hanjun Guo
@ 2014-11-30 16:38             ` Ganapatrao Kulkarni
  2014-11-30 17:13               ` Arnd Bergmann
  2 siblings, 1 reply; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-11-30 16:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnd,


On Tue, Nov 25, 2014 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
>> > No, don't hardcode ARM specifics into a common binding either. I've looked
>> > at the ibm,associativity properties again, and I think we should just use
>> > those, they can cover all cases and are completely independent of the
>> > architecture. We should probably discuss about the property name though,
>> > as using the "ibm," prefix might not be the best idea.
>>
>> We have started with new proposal, since not got enough details how
>> ibm/ppc is managing the numa using dt.
>> there is no documentation and there is no power/PAPR spec for numa in
>> public domain and there are no single dt file in arch/powerpc which
>> describes the numa. if we get any one of these details, we can align
>> to powerpc implementation.
>
> Basically the idea is to have an "ibm,associativity" property in each
> bus or device that is node specific, and this includes all CPUs and
> memory nodes. The property contains an array of 32-bit integers that
> count the resources. Take an example of a NUMA cluster of two machines
> with four sockets and four cores each (32 cores total), a memory
> channel on each socket and one PCI host per board that is connected
> at equal speed to each socket on the board.
thanks for the detailed information.
IMHO, linux-numa code does not care about how the hardware design is,
like how many boards and how many sockets it has. It only needs to
know how many numa nodes system has, how resources are mapped to nodes
and node-distance to define inter node memory access latency. i think
it will be simple, if we merge board and socket to single entry say
node.
also we are assuming here that numa h/w design will have multiple
boards and sockets, what if it has something different/more.

>
> The ibm,associativity property in each PCI host, CPU or memory device
> node consequently has an array of three (board, socket, core) integers:
>
>         memory at 0,0 {
>                 device_type = "memory";
>                 reg = <0x0 0x0  0x4 0x0;
>                 /* board 0, socket 0, no specific core */
>                 ibm,asssociativity = <0 0 0xffff>;
>         };
>
>         memory at 4,0 {
>                 device_type = "memory";
>                 reg = <0x4 0x0  0x4 0x0>;
>                 /* board 0, socket 1, no specific core */
>                 ibm,asssociativity = <0 1 0xffff>;
>         };
>
>         ...
>
>         memory at 1c,0 {
>                 device_type = "memory";
>                 reg = <0x1c 0x0  0x4 0x0>;
>                 /* board 0, socket 7, no specific core */
>                 ibm,asssociativity = <1 7 0xffff>;
>         };
>
>         cpus {
>                 #address-cells = <2>;
>                 #size-cells = <0>;
>                 cpu at 0 {
>                         device_type = "cpu";
>                         reg = <0 0>;
>                         /* board 0, socket 0, core 0*/
>                         ibm,asssociativity = <0 0 0>;
>                 };
>
>                 cpu at 1 {
>                         device_type = "cpu";
>                         reg = <0 0>;
>                         /* board 0, socket 0, core 0*/
>                         ibm,asssociativity = <0 0 0>;
>                 };
>
>                 ...
>
>                 cpu at 31 {
>                         device_type = "cpu";
>                         reg = <0 32>;
>                         /* board 1, socket 7, core 31*/
>                         ibm,asssociativity = <1 7 31>;
>                 };
>         };
>
>         pci at 100,0 {
>                 device_type = "pci";
>                 /* board 0 */
>                 ibm,associativity = <0 0xffff 0xffff>;
>                 ...
>         };
>
>         pci at 200,0 {
>                 device_type = "pci";
>                 /* board 1 */
>                 ibm,associativity = <1 0xffff 0xffff>;
>                 ...
>         };
>
>         ibm,associativity-reference-points = <0 1>;
>
> The "ibm,associativity-reference-points" property here indicates that index 2
> of each array is the most important NUMA boundary for the particular system,
> because the performance impact of allocating memory on the remote board
> is more significant than the impact of using memory on a remote socket of the
> same board. Linux will consequently use the first field in the array as
> the NUMA node ID. If the link between the boards however is relatively fast,
> so you care mostly about allocating memory on the same socket, but going to
> another board isn't much worse than going to another socket on the same
> board, this would be
>
>         ibm,associativity-reference-points = <1 0>;
i am not able to understand fully, it will be grate help, if you
explain, how we capture the node distance matrix using
"ibm,associativity-reference-points "
for example, how DT looks like for the system with 4 nodes, with below
inter-node distance matrix.
node 0 1 distance 20
node 0 2 distance 20
node 0 3 distance 20
node 1 2 distance 20
node 1 3 distance 20
node 2 3 distance 20
>
> so Linux would ignore the board ID and use the socket ID as the NUMA node
> number. The same would apply if you have only one (otherwise identical
> board, then you would get
>
>         ibm,associativity-reference-points = <1>;
>
> which means that index 0 is completely irrelevant for NUMA considerations
> and you just care about the socket ID. In this case, devices on the PCI
> bus would also not care about NUMA policy and just allocate buffers from
> anywhere, while in original example Linux would allocate DMA buffers only
> from the local board.
>
>         Arnd
thanks
ganapat
ps: sorry for the delayed reply.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-30 16:38             ` Ganapatrao Kulkarni
@ 2014-11-30 17:13               ` Arnd Bergmann
  0 siblings, 0 replies; 35+ messages in thread
From: Arnd Bergmann @ 2014-11-30 17:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Sunday 30 November 2014 08:38:02 Ganapatrao Kulkarni wrote:

> On Tue, Nov 25, 2014 at 11:00 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Tuesday 25 November 2014 08:15:47 Ganapatrao Kulkarni wrote:
> >> > No, don't hardcode ARM specifics into a common binding either. I've looked
> >> > at the ibm,associativity properties again, and I think we should just use
> >> > those, they can cover all cases and are completely independent of the
> >> > architecture. We should probably discuss about the property name though,
> >> > as using the "ibm," prefix might not be the best idea.
> >>
> >> We have started with new proposal, since not got enough details how
> >> ibm/ppc is managing the numa using dt.
> >> there is no documentation and there is no power/PAPR spec for numa in
> >> public domain and there are no single dt file in arch/powerpc which
> >> describes the numa. if we get any one of these details, we can align
> >> to powerpc implementation.
> >
> > Basically the idea is to have an "ibm,associativity" property in each
> > bus or device that is node specific, and this includes all CPUs and
> > memory nodes. The property contains an array of 32-bit integers that
> > count the resources. Take an example of a NUMA cluster of two machines
> > with four sockets and four cores each (32 cores total), a memory
> > channel on each socket and one PCI host per board that is connected
> > at equal speed to each socket on the board.
> thanks for the detailed information.
> IMHO, linux-numa code does not care about how the hardware design is,
> like how many boards and how many sockets it has. It only needs to
> know how many numa nodes system has, how resources are mapped to nodes
> and node-distance to define inter node memory access latency. i think
> it will be simple, if we merge board and socket to single entry say
> node.

But it's not good to rely on implementation details of a particular
operating system.

> also we are assuming here that numa h/w design will have multiple
> boards and sockets, what if it has something different/more.

As I said, this was a simplified example, you can have an arbitrary
number of levels, and normally there are more than three, to capture
the cache hierarchy and other things as well.

> > The "ibm,associativity-reference-points" property here indicates that index 2
> > of each array is the most important NUMA boundary for the particular system,
> > because the performance impact of allocating memory on the remote board
> > is more significant than the impact of using memory on a remote socket of the
> > same board. Linux will consequently use the first field in the array as
> > the NUMA node ID. If the link between the boards however is relatively fast,
> > so you care mostly about allocating memory on the same socket, but going to
> > another board isn't much worse than going to another socket on the same
> > board, this would be
> >
> >         ibm,associativity-reference-points = <1 0>;
> i am not able to understand fully, it will be grate help, if you
> explain, how we capture the node distance matrix using
> "ibm,associativity-reference-points "
> for example, how DT looks like for the system with 4 nodes, with below
> inter-node distance matrix.
> node 0 1 distance 20
> node 0 2 distance 20
> node 0 3 distance 20
> node 1 2 distance 20
> node 1 3 distance 20
> node 2 3 distance 20

In your example, you have only one entry in
ibm,associativity-reference-points as it's even simpler: just
one level of hierarchy, everything is the same distance from
everything else, so within the associativity hierarchy, the
ibm,associativity-reference-points just points to the one
level that indicates a NUMA node.

You would only need multiple entries here if the hierarchy is
complex enough to require multiple levels of topology.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms.
  2014-11-21 21:23 ` [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms Ganapatrao Kulkarni
@ 2014-12-06  9:36   ` Ashok Kumar
       [not found]   ` <5482ce36.c9e2420a.5d40.71c7SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 35+ messages in thread
From: Ashok Kumar @ 2014-12-06  9:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Nov 22, 2014 at 02:53:30AM +0530, Ganapatrao Kulkarni wrote:
> Adding numa support for arm64 based platforms.
> creating numa mapping by parsing the dt node numa-map.
> 
> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
Ganapat,

Can we get a simple version of this patchset with just the associativity parameter for the cpu and memory nodes
upstream? The initial patchset can be with the assumption of a simple full mesh topology for the distance.  The
hierarchical complex topologies and the DT nodes and properties for that can be added in a later patch, once the
standard for that is agreed upon.
We are planning to support ACPI based NUMA for                                                                        Broadcom Vulcan processors(http://www.broadcom.com/press/release.php?id=s797235).
I am working on a patch to add support for the SRAT and SLIT ACPI tables based on your patch and I will post that
patchset soon, so it will be helpful if the core changes are upstream.

Thanks,
Ashok

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms.
       [not found]   ` <5482ce36.c9e2420a.5d40.71c7SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-12-06 18:50     ` Ganapatrao Kulkarni
  2014-12-10 12:26       ` Ashok Kumar
       [not found]       ` <54883be3.8284440a.3154.ffffa34fSMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-12-06 18:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Dec 6, 2014 at 1:36 AM, Ashok Kumar <ashoks@broadcom.com> wrote:
> On Sat, Nov 22, 2014 at 02:53:30AM +0530, Ganapatrao Kulkarni wrote:
>> Adding numa support for arm64 based platforms.
>> creating numa mapping by parsing the dt node numa-map.
>>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
> Ganapat,
>
> Can we get a simple version of this patchset with just the associativity parameter for the cpu and memory nodes
> upstream? The initial patchset can be with the assumption of a simple full mesh topology for the distance.  The
> hierarchical complex topologies and the DT nodes and properties for that can be added in a later patch, once the
> standard for that is agreed upon.
> We are planning to support ACPI based NUMA for                                                                        Broadcom Vulcan processors(http://www.broadcom.com/press/release.php?id=s797235).
> I am working on a patch to add support for the SRAT and SLIT ACPI tables based on your patch and I will post that
> patchset soon, so it will be helpful if the core changes are upstream.
>
to implement, ibm/ppc like implementation, we need efi-stub patch to
not remove the memory node from the dt.
current efi-stub parses dt file and removes memory nodes.
can i get the efi-stub patch to start with please?

> Thanks,
> Ashok
>

thanks
ganapat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128
  2014-11-24 11:53   ` Arnd Bergmann
@ 2014-12-09  1:57     ` Zi Shen Lim
  2014-12-09  8:27       ` Arnd Bergmann
  0 siblings, 1 reply; 35+ messages in thread
From: Zi Shen Lim @ 2014-12-09  1:57 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnd,

On Mon, Nov 24, 2014 at 3:53 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Saturday 22 November 2014 02:53:27 Ganapatrao Kulkarni wrote:
> > Raising the maximum limit to 128. This is needed for Cavium's
> > Thunder system that will have 96 cores on Multi-node system.
> >
> > Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
> >
>
> Could we please raise the compile-time limit to the highest number that
> you are able to boot successfully on some existing machine?
>
> There isn't much point in doubling this every few months.

Agreed. If we look back at [1], Mark Rutland has actually compiled and
boot-tested NR_CPUS=4096 on Juno.

[1] https://lkml.org/lkml/2014/9/8/537

>
>         Arnd
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128
  2014-12-09  1:57     ` Zi Shen Lim
@ 2014-12-09  8:27       ` Arnd Bergmann
  2014-12-24 12:33         ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2014-12-09  8:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 08 December 2014 17:57:03 Zi Shen Lim wrote:
> Hi Arnd,
> 
> On Mon, Nov 24, 2014 at 3:53 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Saturday 22 November 2014 02:53:27 Ganapatrao Kulkarni wrote:
> > > Raising the maximum limit to 128. This is needed for Cavium's
> > > Thunder system that will have 96 cores on Multi-node system.
> > >
> > > Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
> > >
> >
> > Could we please raise the compile-time limit to the highest number that
> > you are able to boot successfully on some existing machine?
> >
> > There isn't much point in doubling this every few months.
> 
> Agreed. If we look back at [1], Mark Rutland has actually compiled and
> boot-tested NR_CPUS=4096 on Juno.
> 
> [1] https://lkml.org/lkml/2014/9/8/537

Ok, 4096 sounds like a good NR_CPUS limit then, it should last for a while.

For the defconfig, we probably want a much smaller value, either one that
covers all known machines (96 at this time), or something that covers
95% of all users (maybe 32?) and does not have an serious impact on
memory consumption or performance on small machines.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-11-26  9:12             ` Hanjun Guo
@ 2014-12-10 10:57               ` Arnd Bergmann
  2014-12-11  9:16                 ` Hanjun Guo
  0 siblings, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2014-12-10 10:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 26 November 2014 17:12:49 Hanjun Guo wrote:
> 
> Thanks for the detail information. I have the concerns about the distance
> for NUMA nodes, does the "ibm,associativity-reference-points" property can
> represent the distance between NUMA nodes?
> 
> For example, a system with 4 sockets connected like below:
> 
> Socket 0  <---->  Socket 1  <---->  Socket 2  <---->  Socket 3
> 
> So from socket 0 to socket 1 (maybe on the same board), it just need 1
> jump to access the memory, but from socket 0 to socket 2/3, it needs
> 2/3 jumps and the *distance* relative longer. Can
> "ibm,associativity-reference-points" property cover this?
> 

Hi Hanjun,

I only today found your replies in my spam folder, I need to put you on
a whitelist so that doesn't happen again.

The above topology is not easy to represent, but I think it would work
like this (ignoring the threads/cores/clusters on the socket, which
would also need to be described in a full DT), using multiple logical
paths between the nodes:

socket 0
ibm,associativity = <0 0 0 0>, <1 1 1 0>, <2 2 0 0>, <3 0 0 0>;

socket 1
ibm,associativity = <1 1 1 1>, <0 0 0 1>, <2 2 2 1>, <3 3 1 1>;

socket 2
ibm,associativity = <2 2 2 2>, <0 0 2 2>, <1 1 1 2>, <3 3 3 2>;

socket 3
ibm,associativity = <3 3 3 3>, <0 3 3 3>, <1 1 3 3>, <2 2 2 3>;

This describes four levels or hierarchy, with the lowest level
being a single CPU core on one socket, and four paths between
the sockets. To compute the associativity between two sockets,
you need to look at each combination of paths to find the best
match.

Comparing sockets 0 and 1, the best matches are <1 1 1 0>
with <1 1 1 1>, and <0 0 0 0> with <0 0 0 1>. In each case, the
associativity is "3", meaning the first three entries match.

Comparing sockets 0 and 3, we have four equally bad matches
that each only match in the highest-level domain, e.g. <0 0 0 0>
with <0 3 3 3>, so the associativity is only "1", and that means
the two nodes are less closely associated than two neighboring
ones.

With the algorithm that powerpc uses to turn associativity into
distance, 2**(numlevels - associativity), this would put the
distance of neighboring nodes at "2", and the longest distance
at "8".

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms.
  2014-12-06 18:50     ` Ganapatrao Kulkarni
@ 2014-12-10 12:26       ` Ashok Kumar
       [not found]       ` <54883be3.8284440a.3154.ffffa34fSMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 35+ messages in thread
From: Ashok Kumar @ 2014-12-10 12:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Dec 06, 2014 at 10:50:57AM -0800, Ganapatrao Kulkarni wrote:
> On Sat, Dec 6, 2014 at 1:36 AM, Ashok Kumar <ashoks@broadcom.com> wrote:
> > On Sat, Nov 22, 2014 at 02:53:30AM +0530, Ganapatrao Kulkarni wrote:
> >> Adding numa support for arm64 based platforms.
> >> creating numa mapping by parsing the dt node numa-map.
> >>
> >> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
> > Ganapat,
> >
> > Can we get a simple version of this patchset with just the associativity parameter for the cpu and memory nodes
> > upstream? The initial patchset can be with the assumption of a simple full mesh topology for the distance.  The
> > hierarchical complex topologies and the DT nodes and properties for that can be added in a later patch, once the
> > standard for that is agreed upon.
> > We are planning to support ACPI based NUMA for Broadcom Vulcan processors(http://www.broadcom.com/press/release.php?id=s797235).
> > I am working on a patch to add support for the SRAT and SLIT ACPI tables based on your patch and I will post that
> > patchset soon, so it will be helpful if the core changes are upstream.
> >
> to implement, ibm/ppc like implementation, we need efi-stub patch to
> not remove the memory node from the dt.
> current efi-stub parses dt file and removes memory nodes.
> can i get the efi-stub patch to start with please?

Ganapat,
 I tried the below patch in qemu and it works. Would this be of help to you?

Roy/Ard,
 Is the below patch fine? 

>From 27b4aecf09707e3d5bd4ff7bf765cd609772476f Mon Sep 17 00:00:00 2001
From: Ashok Kumar <ashoks@broadcom.com>
Date: Tue, 9 Dec 2014 17:23:11 +0530
Subject: [PATCH] efi/arm64: Remove deleting memory nodes in efi-stub

Dont delete memory nodes from DT as it will be
used by NUMA configuration.

Signed-off-by: Ashok Kumar <ashoks@broadcom.com>
---
 arch/arm64/kernel/setup.c          | 12 +++++++++++-
 drivers/firmware/efi/libstub/fdt.c | 24 +-----------------------
 2 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 27f65b0..ac6d34f 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -302,7 +302,7 @@ static void __init setup_processor(void)
 
 static void __init setup_machine_fdt(phys_addr_t dt_phys)
 {
-	if (!dt_phys || !early_init_dt_scan(phys_to_virt(dt_phys))) {
+	if (!dt_phys || !early_init_dt_verify(phys_to_virt(dt_phys))) {
 		early_print("\n"
 			"Error: invalid device tree blob at physical address 0x%p (virtual address 0x%p)\n"
 			"The dtb must be 8-byte aligned and passed in the first 512MB of memory\n"
@@ -313,6 +313,16 @@ static void __init setup_machine_fdt(phys_addr_t dt_phys)
 			cpu_relax();
 	}
 
+	/* Retrieve various information from the /chosen node */
+	of_scan_flat_dt(early_init_dt_scan_chosen, boot_command_line);
+
+	/* Initialize {size,address}-cells info */
+	of_scan_flat_dt(early_init_dt_scan_root, NULL);
+
+	/* Setup memory, calling early_init_dt_add_memory_arch */
+	if (!IS_ENABLED(CONFIG_EFI))
+		of_scan_flat_dt(early_init_dt_scan_memory, NULL);
+
 	machine_name = of_flat_dt_get_machine_name();
 }
 
diff --git a/drivers/firmware/efi/libstub/fdt.c b/drivers/firmware/efi/libstub/fdt.c
index c846a96..a02e56e 100644
--- a/drivers/firmware/efi/libstub/fdt.c
+++ b/drivers/firmware/efi/libstub/fdt.c
@@ -22,7 +22,7 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
 			unsigned long map_size, unsigned long desc_size,
 			u32 desc_ver)
 {
-	int node, prev, num_rsv;
+	int node, num_rsv;
 	int status;
 	u32 fdt_val32;
 	u64 fdt_val64;
@@ -52,28 +52,6 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
 		goto fdt_set_fail;
 
 	/*
-	 * Delete any memory nodes present. We must delete nodes which
-	 * early_init_dt_scan_memory may try to use.
-	 */
-	prev = 0;
-	for (;;) {
-		const char *type;
-		int len;
-
-		node = fdt_next_node(fdt, prev, NULL);
-		if (node < 0)
-			break;
-
-		type = fdt_getprop(fdt, node, "device_type", &len);
-		if (type && strncmp(type, "memory", len) == 0) {
-			fdt_del_node(fdt, node);
-			continue;
-		}
-
-		prev = node;
-	}
-
-	/*
 	 * Delete all memory reserve map entries. When booting via UEFI,
 	 * kernel will use the UEFI memory map to find reserved regions.
 	 */
-- 
1.9.1


Thanks,
Ashok

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-12-10 10:57               ` Arnd Bergmann
@ 2014-12-11  9:16                 ` Hanjun Guo
  2014-12-12 14:20                   ` Arnd Bergmann
  0 siblings, 1 reply; 35+ messages in thread
From: Hanjun Guo @ 2014-12-11  9:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arnd,

On 2014?12?10? 18:57, Arnd Bergmann wrote:
> On Wednesday 26 November 2014 17:12:49 Hanjun Guo wrote:
>>
>> Thanks for the detail information. I have the concerns about the distance
>> for NUMA nodes, does the "ibm,associativity-reference-points" property can
>> represent the distance between NUMA nodes?
>>
>> For example, a system with 4 sockets connected like below:
>>
>> Socket 0  <---->  Socket 1  <---->  Socket 2  <---->  Socket 3
>>
>> So from socket 0 to socket 1 (maybe on the same board), it just need 1
>> jump to access the memory, but from socket 0 to socket 2/3, it needs
>> 2/3 jumps and the *distance* relative longer. Can
>> "ibm,associativity-reference-points" property cover this?
>>
>
> Hi Hanjun,
>
> I only today found your replies in my spam folder, I need to put you on
> a whitelist so that doesn't happen again.

Thanks. I hope my ACPI patches will not scare your email filter :)

>
> The above topology is not easy to represent, but I think it would work
> like this (ignoring the threads/cores/clusters on the socket, which
> would also need to be described in a full DT), using multiple logical
> paths between the nodes:
>
> socket 0
> ibm,associativity = <0 0 0 0>, <1 1 1 0>, <2 2 0 0>, <3 0 0 0>;
>
> socket 1
> ibm,associativity = <1 1 1 1>, <0 0 0 1>, <2 2 2 1>, <3 3 1 1>;
>
> socket 2
> ibm,associativity = <2 2 2 2>, <0 0 2 2>, <1 1 1 2>, <3 3 3 2>;
>
> socket 3
> ibm,associativity = <3 3 3 3>, <0 3 3 3>, <1 1 3 3>, <2 2 2 3>;
>
> This describes four levels or hierarchy, with the lowest level
> being a single CPU core on one socket, and four paths between
> the sockets. To compute the associativity between two sockets,
> you need to look at each combination of paths to find the best
> match.
>
> Comparing sockets 0 and 1, the best matches are <1 1 1 0>
> with <1 1 1 1>, and <0 0 0 0> with <0 0 0 1>. In each case, the
> associativity is "3", meaning the first three entries match.
>
> Comparing sockets 0 and 3, we have four equally bad matches
> that each only match in the highest-level domain, e.g. <0 0 0 0>
> with <0 3 3 3>, so the associativity is only "1", and that means
> the two nodes are less closely associated than two neighboring
> ones.
>
> With the algorithm that powerpc uses to turn associativity into
> distance, 2**(numlevels - associativity), this would put the
> distance of neighboring nodes at "2", and the longest distance
> at "8".

Thanks for the explain, I can understand how it works now,
a bit complicated for me and I think the distance property
"node-matrix" in Ganapatrao's patch is straight forward,
what do you think?

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-12-11  9:16                 ` Hanjun Guo
@ 2014-12-12 14:20                   ` Arnd Bergmann
  2014-12-15  3:50                     ` Hanjun Guo
  0 siblings, 1 reply; 35+ messages in thread
From: Arnd Bergmann @ 2014-12-12 14:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 11 December 2014 17:16:35 Hanjun Guo wrote:
> On 2014?12?10? 18:57, Arnd Bergmann wrote:
> > On Wednesday 26 November 2014 17:12:49 Hanjun Guo wrote:
> > The above topology is not easy to represent, but I think it would work
> > like this (ignoring the threads/cores/clusters on the socket, which
> > would also need to be described in a full DT), using multiple logical
> > paths between the nodes:
> >
> > socket 0
> > ibm,associativity = <0 0 0 0>, <1 1 1 0>, <2 2 0 0>,  0 0 0>;
> >
> > socket 1
> > ibm,associativity = <1 1 1 1>, <0 0 0 1>, <2 2 2 1>,  3 1 1>;
> >
> > socket 2
> > ibm,associativity = <2 2 2 2>, <0 0 2 2>, <1 1 1 2>,  3 3 2>;
> >
> > socket 3
> > ibm,associativity =  3 3 3>, <0 3 3 3>, <1 1 3 3>, <2 2 2 3>;
> >
> > This describes four levels or hierarchy, with the lowest level
> > being a single CPU core on one socket, and four paths between
> > the sockets. To compute the associativity between two sockets,
> > you need to look at each combination of paths to find the best
> > match.
> >
> > Comparing sockets 0 and 1, the best matches are <1 1 1 0>
> > with <1 1 1 1>, and <0 0 0 0> with <0 0 0 1>. In each case, the
> > associativity is "3", meaning the first three entries match.
> >
> > Comparing sockets 0 and 3, we have four equally bad matches
> > that each only match in the highest-level domain, e.g. <0 0 0 0>
> > with <0 3 3 3>, so the associativity is only "1", and that means
> > the two nodes are less closely associated than two neighboring
> > ones.
> >
> > With the algorithm that powerpc uses to turn associativity into
> > distance, 2**(numlevels - associativity), this would put the
> > distance of neighboring nodes at "2", and the longest distance
> > at "8".
> 
> Thanks for the explain, I can understand how it works now,
> a bit complicated for me and I think the distance property
> "node-matrix" in Ganapatrao's patch is straight forward,
> what do you think?

I still think we should go the whole way of having something compatible
with the existing bindings, possibly using different property names
if there are objections to using the "ibm," prefix.

The associativity property is more expressive and lets you describe
things that you can't describe with the mem-map/cpu-map properties,
e.g. devices that are part of the NUMA hierarchy but not associated
to exactly one last-level node.

	Arnd

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa.
  2014-12-12 14:20                   ` Arnd Bergmann
@ 2014-12-15  3:50                     ` Hanjun Guo
  0 siblings, 0 replies; 35+ messages in thread
From: Hanjun Guo @ 2014-12-15  3:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 2014?12?12? 22:20, Arnd Bergmann wrote:
> On Thursday 11 December 2014 17:16:35 Hanjun Guo wrote:
>> On 2014?12?10? 18:57, Arnd Bergmann wrote:
>>> On Wednesday 26 November 2014 17:12:49 Hanjun Guo wrote:
>>> The above topology is not easy to represent, but I think it would work
>>> like this (ignoring the threads/cores/clusters on the socket, which
>>> would also need to be described in a full DT), using multiple logical
>>> paths between the nodes:
>>>
>>> socket 0
>>> ibm,associativity = <0 0 0 0>, <1 1 1 0>, <2 2 0 0>,  0 0 0>;
>>>
>>> socket 1
>>> ibm,associativity = <1 1 1 1>, <0 0 0 1>, <2 2 2 1>,  3 1 1>;
>>>
>>> socket 2
>>> ibm,associativity = <2 2 2 2>, <0 0 2 2>, <1 1 1 2>,  3 3 2>;
>>>
>>> socket 3
>>> ibm,associativity =  3 3 3>, <0 3 3 3>, <1 1 3 3>, <2 2 2 3>;
>>>
>>> This describes four levels or hierarchy, with the lowest level
>>> being a single CPU core on one socket, and four paths between
>>> the sockets. To compute the associativity between two sockets,
>>> you need to look at each combination of paths to find the best
>>> match.
>>>
>>> Comparing sockets 0 and 1, the best matches are <1 1 1 0>
>>> with <1 1 1 1>, and <0 0 0 0> with <0 0 0 1>. In each case, the
>>> associativity is "3", meaning the first three entries match.
>>>
>>> Comparing sockets 0 and 3, we have four equally bad matches
>>> that each only match in the highest-level domain, e.g. <0 0 0 0>
>>> with <0 3 3 3>, so the associativity is only "1", and that means
>>> the two nodes are less closely associated than two neighboring
>>> ones.
>>>
>>> With the algorithm that powerpc uses to turn associativity into
>>> distance, 2**(numlevels - associativity), this would put the
>>> distance of neighboring nodes at "2", and the longest distance
>>> at "8".
>>
>> Thanks for the explain, I can understand how it works now,
>> a bit complicated for me and I think the distance property
>> "node-matrix" in Ganapatrao's patch is straight forward,
>> what do you think?
>
> I still think we should go the whole way of having something compatible
> with the existing bindings, possibly using different property names
> if there are objections to using the "ibm," prefix.

I agree that we should keep using existing bindings and not introducing
a new one.

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms.
       [not found]       ` <54883be3.8284440a.3154.ffffa34fSMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-12-15 18:16         ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-12-15 18:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Roy/Leif

On Wed, Dec 10, 2014 at 4:26 AM, Ashok Kumar <ashoks@broadcom.com> wrote:
> On Sat, Dec 06, 2014 at 10:50:57AM -0800, Ganapatrao Kulkarni wrote:
>> On Sat, Dec 6, 2014 at 1:36 AM, Ashok Kumar <ashoks@broadcom.com> wrote:
>> > On Sat, Nov 22, 2014 at 02:53:30AM +0530, Ganapatrao Kulkarni wrote:
>> >> Adding numa support for arm64 based platforms.
>> >> creating numa mapping by parsing the dt node numa-map.
>> >>
>> >> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
>> > Ganapat,
>> >
>> > Can we get a simple version of this patchset with just the associativity parameter for the cpu and memory nodes
>> > upstream? The initial patchset can be with the assumption of a simple full mesh topology for the distance.  The
>> > hierarchical complex topologies and the DT nodes and properties for that can be added in a later patch, once the
>> > standard for that is agreed upon.
>> > We are planning to support ACPI based NUMA for Broadcom Vulcan processors(http://www.broadcom.com/press/release.php?id=s797235).
>> > I am working on a patch to add support for the SRAT and SLIT ACPI tables based on your patch and I will post that
>> > patchset soon, so it will be helpful if the core changes are upstream.
>> >
>> to implement, ibm/ppc like implementation, we need efi-stub patch to
>> not remove the memory node from the dt.
>> current efi-stub parses dt file and removes memory nodes.
>> can i get the efi-stub patch to start with please?
>
> Ganapat,
>  I tried the below patch in qemu and it works. Would this be of help to you?
>
> Roy/Ard,
>  Is the below patch fine?
>
> From 27b4aecf09707e3d5bd4ff7bf765cd609772476f Mon Sep 17 00:00:00 2001
> From: Ashok Kumar <ashoks@broadcom.com>
> Date: Tue, 9 Dec 2014 17:23:11 +0530
> Subject: [PATCH] efi/arm64: Remove deleting memory nodes in efi-stub
>
> Dont delete memory nodes from DT as it will be
> used by NUMA configuration.
>
> Signed-off-by: Ashok Kumar <ashoks@broadcom.com>
> ---
>  arch/arm64/kernel/setup.c          | 12 +++++++++++-
>  drivers/firmware/efi/libstub/fdt.c | 24 +-----------------------
>  2 files changed, 12 insertions(+), 24 deletions(-)
>
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 27f65b0..ac6d34f 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -302,7 +302,7 @@ static void __init setup_processor(void)
>
>  static void __init setup_machine_fdt(phys_addr_t dt_phys)
>  {
> -       if (!dt_phys || !early_init_dt_scan(phys_to_virt(dt_phys))) {
> +       if (!dt_phys || !early_init_dt_verify(phys_to_virt(dt_phys))) {
>                 early_print("\n"
>                         "Error: invalid device tree blob at physical address 0x%p (virtual address 0x%p)\n"
>                         "The dtb must be 8-byte aligned and passed in the first 512MB of memory\n"
> @@ -313,6 +313,16 @@ static void __init setup_machine_fdt(phys_addr_t dt_phys)
>                         cpu_relax();
>         }
>
> +       /* Retrieve various information from the /chosen node */
> +       of_scan_flat_dt(early_init_dt_scan_chosen, boot_command_line);
> +
> +       /* Initialize {size,address}-cells info */
> +       of_scan_flat_dt(early_init_dt_scan_root, NULL);
> +
> +       /* Setup memory, calling early_init_dt_add_memory_arch */
> +       if (!IS_ENABLED(CONFIG_EFI))
> +               of_scan_flat_dt(early_init_dt_scan_memory, NULL);
> +
>         machine_name = of_flat_dt_get_machine_name();
>  }
>
> diff --git a/drivers/firmware/efi/libstub/fdt.c b/drivers/firmware/efi/libstub/fdt.c
> index c846a96..a02e56e 100644
> --- a/drivers/firmware/efi/libstub/fdt.c
> +++ b/drivers/firmware/efi/libstub/fdt.c
> @@ -22,7 +22,7 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
>                         unsigned long map_size, unsigned long desc_size,
>                         u32 desc_ver)
>  {
> -       int node, prev, num_rsv;
> +       int node, num_rsv;
>         int status;
>         u32 fdt_val32;
>         u64 fdt_val64;
> @@ -52,28 +52,6 @@ efi_status_t update_fdt(efi_system_table_t *sys_table, void *orig_fdt,
>                 goto fdt_set_fail;
>
>         /*
> -        * Delete any memory nodes present. We must delete nodes which
> -        * early_init_dt_scan_memory may try to use.
> -        */
> -       prev = 0;
> -       for (;;) {
> -               const char *type;
> -               int len;
> -
> -               node = fdt_next_node(fdt, prev, NULL);
> -               if (node < 0)
> -                       break;
> -
> -               type = fdt_getprop(fdt, node, "device_type", &len);
> -               if (type && strncmp(type, "memory", len) == 0) {
> -                       fdt_del_node(fdt, node);
> -                       continue;
> -               }
> -
> -               prev = node;
> -       }
> -
> -       /*
>          * Delete all memory reserve map entries. When booting via UEFI,
>          * kernel will use the UEFI memory map to find reserved regions.
>          */
> --
> 1.9.1
>
>
what will happen, If we keep memory node in the DT and boot with UEFI,
will both efi-stub and dt try to add memblocks?
is kernel ignores second request to add memblock to the same address?


> Thanks,
> Ashok

thanks
Ganapat

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128
  2014-12-09  8:27       ` Arnd Bergmann
@ 2014-12-24 12:33         ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 35+ messages in thread
From: Ganapatrao Kulkarni @ 2014-12-24 12:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 9, 2014 at 1:57 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 08 December 2014 17:57:03 Zi Shen Lim wrote:
>> Hi Arnd,
>>
>> On Mon, Nov 24, 2014 at 3:53 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> >
>> > On Saturday 22 November 2014 02:53:27 Ganapatrao Kulkarni wrote:
>> > > Raising the maximum limit to 128. This is needed for Cavium's
>> > > Thunder system that will have 96 cores on Multi-node system.
>> > >
>> > > Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@caviumnetworks.com>
>> > >
>> >
>> > Could we please raise the compile-time limit to the highest number that
>> > you are able to boot successfully on some existing machine?
>> >
>> > There isn't much point in doubling this every few months.
>>
>> Agreed. If we look back at [1], Mark Rutland has actually compiled and
>> boot-tested NR_CPUS=4096 on Juno.
>>
>> [1] https://lkml.org/lkml/2014/9/8/537
>
> Ok, 4096 sounds like a good NR_CPUS limit then, it should last for a while.
Ok, will set range to 2-4096.
>
> For the defconfig, we probably want a much smaller value, either one that
> covers all known machines (96 at this time), or something that covers
> 95% of all users (maybe 32?) and does not have an serious impact on
> memory consumption or performance on small machines.
>
>         Arnd
thanks
Ganapat

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-12-24 12:33 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-21 21:23 [RFC PATCH v2 0/4] arm64:numa: Add numa support for arm64 platforms Ganapatrao Kulkarni
2014-11-21 21:23 ` [RFC PATCH v2 1/4] arm64: defconfig: increase NR_CPUS range to 2-128 Ganapatrao Kulkarni
2014-11-24 11:53   ` Arnd Bergmann
2014-12-09  1:57     ` Zi Shen Lim
2014-12-09  8:27       ` Arnd Bergmann
2014-12-24 12:33         ` Ganapatrao Kulkarni
2014-11-21 21:23 ` [RFC PATCH v2 2/4] Documentation: arm64/arm: dt bindings for numa Ganapatrao Kulkarni
2014-11-25  3:55   ` Shannon Zhao
2014-11-25  9:42     ` Hanjun Guo
2014-11-25 11:02       ` Arnd Bergmann
2014-11-25 13:15         ` Ganapatrao Kulkarni
2014-11-25 19:00           ` Arnd Bergmann
2014-11-25 21:09             ` Arnd Bergmann
2014-11-26  9:12             ` Hanjun Guo
2014-12-10 10:57               ` Arnd Bergmann
2014-12-11  9:16                 ` Hanjun Guo
2014-12-12 14:20                   ` Arnd Bergmann
2014-12-15  3:50                     ` Hanjun Guo
2014-11-30 16:38             ` Ganapatrao Kulkarni
2014-11-30 17:13               ` Arnd Bergmann
2014-11-25 14:54         ` Hanjun Guo
2014-11-26  2:29         ` Shannon Zhao
2014-11-26 16:51           ` Arnd Bergmann
2014-11-21 21:23 ` [RFC PATCH v2 3/4] arm64:thunder: Add initial dts for Cavium's Thunder SoC in 2 Node topology Ganapatrao Kulkarni
2014-11-24 11:59   ` Arnd Bergmann
2014-11-24 16:32     ` Roy Franz
2014-11-24 17:01       ` Arnd Bergmann
2014-11-25 12:38         ` Ard Biesheuvel
2014-11-25 12:45           ` Arnd Bergmann
2014-11-24 17:01   ` Marc Zyngier
2014-11-21 21:23 ` [RFC PATCH v2 4/4] arm64:numa: adding numa support for arm64 platforms Ganapatrao Kulkarni
2014-12-06  9:36   ` Ashok Kumar
     [not found]   ` <5482ce36.c9e2420a.5d40.71c7SMTPIN_ADDED_BROKEN@mx.google.com>
2014-12-06 18:50     ` Ganapatrao Kulkarni
2014-12-10 12:26       ` Ashok Kumar
     [not found]       ` <54883be3.8284440a.3154.ffffa34fSMTPIN_ADDED_BROKEN@mx.google.com>
2014-12-15 18:16         ` Ganapatrao Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).