All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/37] Add device tree based NUMA support to Arm
@ 2021-09-23 12:01 Wei Chen
  2021-09-23 12:02 ` [PATCH 01/37] xen/arm: Print a 64-bit number in hex from early uart Wei Chen
                   ` (36 more replies)
  0 siblings, 37 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:01 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Xen memory allocation and scheduler modules are NUMA aware.
But actually, on x86 has implemented the architecture APIs
to support NUMA. Arm was providing a set of fake architecture
APIs to make it compatible with NUMA awared memory allocation
and scheduler.

Arm system was working well as a single node NUMA system with
these fake APIs, because we didn't have multiple nodes NUMA
system on Arm. But in recent years, more and more Arm devices
support multiple nodes NUMA system.

So now we have a new problem. When Xen is running on these Arm
devices, Xen still treat them as single node SMP systems. The
NUMA affinity capability of Xen memory allocation and scheduler
becomes meaningless. Because they rely on input data that does
not reflect real NUMA layout.

Xen still think the access time for all of the memory is the
same for all CPUs. However, Xen may allocate memory to a VM
from different NUMA nodes with different access speeds. This
difference can be amplified in workloads inside VM, causing
performance instability and timeouts. 

So in this patch series, we implement a set of NUMA API to use
device tree to describe the NUMA layout. We reuse most of the
code of x86 NUMA to create and maintain the mapping between
memory and CPU, create the matrix between any two NUMA nodes.
Except ACPI and some x86 specified code, we have moved other
code to common. In next stage, when we implement ACPI based
NUMA for Arm64, we may move the ACPI NUMA code to common too,
but in current stage, we keep it as x86 only.

This patch serires has been tested and booted well on one
Arm64 NUMA machine and one HPE x86 NUMA machine.

---
@Julien, about numa=noacpi option, I haven't remove it in this
patch series. I had tried, and this option is not easy to remove.
So I'm not going to think about doing that in this series  

rfc -> v1:
 1. Re-order the whole patch series to avoid temporary code
 2. Add detection of discontinous node memory range
 3. Fix typos in commit messages and code comments.
 4. For variables that are used in common code, we don't
    covert them to external any more. Instead, we export
    helpers to provide variable access.
 5. Revert memnodemap[0] to 0 when NUMA init failed. Change
    memnodemapsize from ARRAY_SIZE(_memnodemap) to 1 to reflect
    reality.
 6. Use arch_has_default_dmazone for page_alloc.c instead of
    changing code inside Arm
 7. Keep Kconfig options alphabetically sorted.
 8. Replace #if !defined by #ifndef
 9. Use paddr_t for addresses in NUMA node structures and function
    parameters
10. Use fw_numa to replace acpi_numa for neutrality
11. Change BIOS to Firmware in print message.
12. Promote VIRTUAL_BUG_ON to ASSERT
13. Introduce CONFIG_EFI to stub API for non-EFI architecture
14. Use EFI stub API to replace arch helper for efi_enabled
15. Use NR_MEM_BANKS for Arm's NR_NODE_MEMBLKS
16. Change matrix map default value from NUMA_REMOTE_DISTANCE to 0
17. Remove check in numa_set_node.
18. Follow the x86's method of adding CPU to NUMA
19. Use fdt prefix for all device tree NUMA parser's API
20. Check un-matched bi-direction distance in matrix map
21. Remove unless fdt type check function
22. Update doc to remove numa x86 spcific
23. Introduce Arm generic NUMA Kconfig option

Wei Chen (37):
  xen/arm: Print a 64-bit number in hex from early uart
  xen: introduce a Kconfig option to configure NUMA nodes number
  xen/x86: Initialize memnodemapsize while faking NUMA node
  xen: introduce an arch helper for default dma zone status
  xen: decouple NUMA from ACPI in Kconfig
  xen/arm: use !CONFIG_NUMA to keep fake NUMA API
  xen/x86: use paddr_t for addresses in NUMA node structure
  xen/x86: add detection of discontinous node memory range
  xen/x86: introduce two helpers to access memory hotplug end
  xen/x86: use helpers to access/update mem_hotplug
  xen/x86: abstract neutral code from acpi_numa_memory_affinity_init
  xen/x86: decouple nodes_cover_memory from E820 map
  xen/x86: decouple processor_nodes_parsed from acpi numa functions
  xen/x86: use name fw_numa to replace acpi_numa
  xen/x86: rename acpi_scan_nodes to numa_scan_nodes
  xen/x86: export srat_bad to external
  xen/x86: use CONFIG_NUMA to gate numa_scan_nodes
  xen: move NUMA common code from x86 to common
  xen/x86: promote VIRTUAL_BUG_ON to ASSERT in
  xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI
  xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  xen/arm: implement node distance helpers for Arm
  xen/arm: implement two arch helpers to get memory map info
  xen/arm: implement bad_srat for Arm NUMA initialization
  xen/arm: build NUMA cpu_to_node map in dt_smp_init_cpus
  xen/arm: Add boot and secondary CPU to NUMA system
  xen/arm: stub memory hotplug access helpers for Arm
  xen/arm: introduce a helper to parse device tree processor node
  xen/arm: introduce a helper to parse device tree memory node
  xen/arm: introduce a helper to parse device tree NUMA distance map
  xen/arm: unified entry to parse all NUMA data from device tree
  xen/arm: keep guest still be NUMA unware
  xen/arm: enable device tree based NUMA in system init
  xen/arm: use CONFIG_NUMA to gate node_online_map in smpboot
  xen/arm: Provide Kconfig options for Arm to enable NUMA
  docs: update numa command line to support Arm

 docs/misc/xen-command-line.pandoc |   2 +-
 xen/arch/Kconfig                  |  11 +
 xen/arch/arm/Kconfig              |  12 +
 xen/arch/arm/Makefile             |   4 +-
 xen/arch/arm/arm64/head.S         |   9 +-
 xen/arch/arm/bootfdt.c            |   8 +-
 xen/arch/arm/domain_build.c       |   6 +
 xen/arch/arm/efi/efi-boot.h       |  25 --
 xen/arch/arm/numa.c               | 155 ++++++++++
 xen/arch/arm/numa_device_tree.c   | 274 ++++++++++++++++++
 xen/arch/arm/setup.c              |  12 +
 xen/arch/arm/smpboot.c            |  39 ++-
 xen/arch/x86/Kconfig              |   3 +-
 xen/arch/x86/numa.c               | 449 ++---------------------------
 xen/arch/x86/setup.c              |   2 +-
 xen/arch/x86/srat.c               | 232 ++-------------
 xen/common/Kconfig                |  14 +
 xen/common/Makefile               |   2 +
 xen/common/numa.c                 | 450 ++++++++++++++++++++++++++++++
 xen/common/numa_srat.c            | 264 ++++++++++++++++++
 xen/common/page_alloc.c           |   2 +-
 xen/drivers/acpi/Kconfig          |   3 +-
 xen/drivers/acpi/Makefile         |   2 +-
 xen/include/asm-arm/mm.h          |  10 +
 xen/include/asm-arm/numa.h        |  50 ++++
 xen/include/asm-x86/acpi.h        |   4 -
 xen/include/asm-x86/config.h      |   1 -
 xen/include/asm-x86/mm.h          |  10 +
 xen/include/asm-x86/numa.h        |  65 +----
 xen/include/asm-x86/setup.h       |   1 -
 xen/include/xen/efi.h             |  11 +
 xen/include/xen/numa.h            |  94 ++++++-
 32 files changed, 1470 insertions(+), 756 deletions(-)
 create mode 100644 xen/arch/arm/numa.c
 create mode 100644 xen/arch/arm/numa_device_tree.c
 create mode 100644 xen/common/numa.c
 create mode 100644 xen/common/numa_srat.c

-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 01/37] xen/arm: Print a 64-bit number in hex from early uart
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-23 12:02 ` [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
                   ` (35 subsequent siblings)
  36 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Current putn function that is using for early print
only can print low 32-bit of AArch64 register. This
will lose some important messages while debugging
with early console. For example:
(XEN) Bringing up CPU5
- CPU 0000000100000100 booting -
Will be truncated to
(XEN) Bringing up CPU5
- CPU 00000100 booting -

In this patch, we increased the print loops and shift
bits to make putn print 64-bit number.

Signed-off-by: Wei Chen <wei.chen@arm.com>
Acked-by: Julien Grall <jgrall@amazon.com>
---
 xen/arch/arm/arm64/head.S | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/xen/arch/arm/arm64/head.S b/xen/arch/arm/arm64/head.S
index aa1f88c764..d957ea377b 100644
--- a/xen/arch/arm/arm64/head.S
+++ b/xen/arch/arm/arm64/head.S
@@ -862,17 +862,18 @@ puts:
         ret
 ENDPROC(puts)
 
-/* Print a 32-bit number in hex.  Specific to the PL011 UART.
+/* Print a 64-bit number in hex.
  * x0: Number to print.
  * x23: Early UART base address
  * Clobbers x0-x3 */
+#define PRINT_MASK 0xf000000000000000
 putn:
         adr   x1, hex
-        mov   x3, #8
+        mov   x3, #16
 1:
         early_uart_ready x23, 2
-        and   x2, x0, #0xf0000000    /* Mask off the top nybble */
-        lsr   x2, x2, #28
+        and   x2, x0, #PRINT_MASK    /* Mask off the top nybble */
+        lsr   x2, x2, #60
         ldrb  w2, [x1, x2]           /* Convert to a char */
         early_uart_transmit x23, w2
         lsl   x0, x0, #4             /* Roll it through one nybble at a time */
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
  2021-09-23 12:02 ` [PATCH 01/37] xen/arm: Print a 64-bit number in hex from early uart Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-23 23:45   ` Stefano Stabellini
  2021-09-24  8:55   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking NUMA node Wei Chen
                   ` (34 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Current NUMA nodes number is a hardcode configuration. This
configuration is difficult for an administrator to change
unless changing the code.

So in this patch, we introduce this new Kconfig option for
administrators to change NUMA nodes number conveniently.
Also considering that not all architectures support NUMA,
this Kconfig option only can be visible on NUMA enabled
architectures. Non-NUMA supported architectures can still
use 1 as MAX_NUMNODES.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/Kconfig           | 11 +++++++++++
 xen/include/asm-x86/numa.h |  2 --
 xen/include/xen/numa.h     | 10 +++++-----
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig
index f16eb0df43..8a20da67ed 100644
--- a/xen/arch/Kconfig
+++ b/xen/arch/Kconfig
@@ -17,3 +17,14 @@ config NR_CPUS
 	  For CPU cores which support Simultaneous Multi-Threading or similar
 	  technologies, this the number of logical threads which Xen will
 	  support.
+
+config NR_NUMA_NODES
+	int "Maximum number of NUMA nodes supported"
+	range 1 4095
+	default "64"
+	depends on NUMA
+	help
+	  Controls the build-time size of various arrays and bitmaps
+	  associated with multiple-nodes management. It is the upper bound of
+	  the number of NUMA nodes the scheduler, memory allocation and other
+	  NUMA-aware components can handle.
diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
index bada2c0bb9..3cf26c2def 100644
--- a/xen/include/asm-x86/numa.h
+++ b/xen/include/asm-x86/numa.h
@@ -3,8 +3,6 @@
 
 #include <xen/cpumask.h>
 
-#define NODES_SHIFT 6
-
 typedef u8 nodeid_t;
 
 extern int srat_rev;
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 7aef1a88dc..52950a3150 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -3,14 +3,14 @@
 
 #include <asm/numa.h>
 
-#ifndef NODES_SHIFT
-#define NODES_SHIFT     0
-#endif
-
 #define NUMA_NO_NODE     0xFF
 #define NUMA_NO_DISTANCE 0xFF
 
-#define MAX_NUMNODES    (1 << NODES_SHIFT)
+#ifdef CONFIG_NR_NUMA_NODES
+#define MAX_NUMNODES CONFIG_NR_NUMA_NODES
+#else
+#define MAX_NUMNODES    1
+#endif
 
 #define vcpu_to_node(v) (cpu_to_node((v)->processor))
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking NUMA node
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
  2021-09-23 12:02 ` [PATCH 01/37] xen/arm: Print a 64-bit number in hex from early uart Wei Chen
  2021-09-23 12:02 ` [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  8:57   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 04/37] xen: introduce an arch helper for default dma zone status Wei Chen
                   ` (33 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

When system turns NUMA off or system lacks of NUMA support,
Xen will fake a NUMA node to make system works as a single
node NUMA system.

In this case the memory node map doesn't need to be allocated
from boot pages, it will use the _memnodemap directly. But
memnodemapsize hasn't been set. Xen should assert in phys_to_nid.
Because x86 was using an empty macro "VIRTUAL_BUG_ON" to replace
SSERT, this bug will not be triggered on x86.

Actually, Xen will only use 1 slot of memnodemap in this case.
So we set memnodemap[0] to 0 and memnodemapsize to 1 in this
patch to fix it.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/numa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index f1066c59c7..ce79ee44ce 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -270,6 +270,10 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
     /* setup dummy node covering all memory */
     memnode_shift = BITS_PER_LONG - 1;
     memnodemap = _memnodemap;
+    /* Dummy node only uses 1 slot in reality */
+    memnodemap[0] = 0;
+    memnodemapsize = 1;
+
     nodes_clear(node_online_map);
     node_set_online(0);
     for ( i = 0; i < nr_cpu_ids; i++ )
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (2 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking NUMA node Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-23 23:55   ` Stefano Stabellini
  2022-01-17 16:10   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 05/37] xen: decouple NUMA from ACPI in Kconfig Wei Chen
                   ` (32 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

In current code, when Xen is running in a multiple nodes NUMA
system, it will set dma_bitsize in end_boot_allocator to reserve
some low address memory for DMA.

There are some x86 implications in current implementation. Becuase
on x86, memory starts from 0. On a multiple nodes NUMA system, if
a single node contains the majority or all of the DMA memory. x86
prefer to give out memory from non-local allocations rather than
exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
aside some largely arbitrary amount memory for DMA memory ranges.
The allocations from these memory ranges would happen only after
exhausting all other nodes' memory.

But the implications are not shared across all architectures. For
example, Arm doesn't have these implications. So in this patch, we
introduce an arch_have_default_dmazone helper for arch to determine
that it need to set dma_bitsize for reserve DMA allocations or not.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/numa.c        | 5 +++++
 xen/common/page_alloc.c    | 2 +-
 xen/include/asm-arm/numa.h | 5 +++++
 xen/include/asm-x86/numa.h | 1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index ce79ee44ce..1fabbe8281 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -371,6 +371,11 @@ unsigned int __init arch_get_dma_bitsize(void)
                  + PAGE_SHIFT, 32);
 }
 
+unsigned int arch_have_default_dmazone(void)
+{
+    return ( num_online_nodes() > 1 ) ? 1 : 0;
+}
+
 static void dump_numa(unsigned char key)
 {
     s_time_t now = NOW();
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 5801358b4b..80916205e5 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1889,7 +1889,7 @@ void __init end_boot_allocator(void)
     }
     nr_bootmem_regions = 0;
 
-    if ( !dma_bitsize && (num_online_nodes() > 1) )
+    if ( !dma_bitsize && arch_have_default_dmazone() )
         dma_bitsize = arch_get_dma_bitsize();
 
     printk("Domain heap initialised");
diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index 31a6de4e23..9d5739542d 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -25,6 +25,11 @@ extern mfn_t first_valid_mfn;
 #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
 #define __node_distance(a, b) (20)
 
+static inline unsigned int arch_have_default_dmazone(void)
+{
+    return 0;
+}
+
 #endif /* __ARCH_ARM_NUMA_H */
 /*
  * Local variables:
diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
index 3cf26c2def..8060cbf3f4 100644
--- a/xen/include/asm-x86/numa.h
+++ b/xen/include/asm-x86/numa.h
@@ -78,5 +78,6 @@ extern int valid_numa_range(u64 start, u64 end, nodeid_t node);
 void srat_parse_regions(u64 addr);
 extern u8 __node_distance(nodeid_t a, nodeid_t b);
 unsigned int arch_get_dma_bitsize(void);
+unsigned int arch_have_default_dmazone(void);
 
 #endif
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 05/37] xen: decouple NUMA from ACPI in Kconfig
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (3 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 04/37] xen: introduce an arch helper for default dma zone status Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-23 12:02 ` [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API Wei Chen
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

In current Xen code only implements x86 ACPI-based NUMA support.
So in Xen Kconfig system, NUMA equals to ACPI_NUMA. x86 selects
NUMA by default, and CONFIG_ACPI_NUMA is hardcode in config.h.

In a follow-up patch, we will introduce support for NUMA using
the device tree. That means we will have two NUMA implementations,
so in this patch we decouple NUMA from ACPI based NUMA in Kconfig.
Make NUMA as a common feature, that device tree based NUMA also
can select it.

Signed-off-by: Wei Chen <wei.chen@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/Kconfig         | 2 +-
 xen/common/Kconfig           | 3 +++
 xen/drivers/acpi/Kconfig     | 3 ++-
 xen/drivers/acpi/Makefile    | 2 +-
 xen/include/asm-x86/config.h | 1 -
 5 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 1f83518ee0..28d13b9705 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -6,6 +6,7 @@ config X86
 	def_bool y
 	select ACPI
 	select ACPI_LEGACY_TABLES_LOOKUP
+	select ACPI_NUMA
 	select ALTERNATIVE_CALL
 	select ARCH_SUPPORTS_INT128
 	select CORE_PARKING
@@ -25,7 +26,6 @@ config X86
 	select HAS_UBSAN
 	select HAS_VPCI if HVM
 	select NEEDS_LIBELF
-	select NUMA
 
 config ARCH_DEFCONFIG
 	string
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index db687b1785..9ebb1c239b 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -70,6 +70,9 @@ config MEM_ACCESS
 config NEEDS_LIBELF
 	bool
 
+config NUMA
+	bool
+
 config STATIC_MEMORY
 	bool "Static Allocation Support (UNSUPPORTED)" if UNSUPPORTED
 	depends on ARM
diff --git a/xen/drivers/acpi/Kconfig b/xen/drivers/acpi/Kconfig
index b64d3731fb..e3f3d8f4b1 100644
--- a/xen/drivers/acpi/Kconfig
+++ b/xen/drivers/acpi/Kconfig
@@ -5,5 +5,6 @@ config ACPI
 config ACPI_LEGACY_TABLES_LOOKUP
 	bool
 
-config NUMA
+config ACPI_NUMA
 	bool
+	select NUMA
diff --git a/xen/drivers/acpi/Makefile b/xen/drivers/acpi/Makefile
index 4f8e97228e..2fc5230253 100644
--- a/xen/drivers/acpi/Makefile
+++ b/xen/drivers/acpi/Makefile
@@ -3,7 +3,7 @@ obj-y += utilities/
 obj-$(CONFIG_X86) += apei/
 
 obj-bin-y += tables.init.o
-obj-$(CONFIG_NUMA) += numa.o
+obj-$(CONFIG_ACPI_NUMA) += numa.o
 obj-y += osl.o
 obj-$(CONFIG_HAS_CPUFREQ) += pmstat.o
 
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 883c2ef0df..9a6f0a6edf 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -31,7 +31,6 @@
 /* Intel P4 currently has largest cache line (L2 line size is 128 bytes). */
 #define CONFIG_X86_L1_CACHE_SHIFT 7
 
-#define CONFIG_ACPI_NUMA 1
 #define CONFIG_ACPI_SRAT 1
 #define CONFIG_ACPI_CSTATE 1
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (4 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 05/37] xen: decouple NUMA from ACPI in Kconfig Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:05   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure Wei Chen
                   ` (30 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

We have introduced CONFIG_NUMA in previous patch. And this
option is enabled only on x86 in current stage. In a follow
up patch, we will enable this option for Arm. But we still
want users can disable the CONFIG_NUMA through Kconfig. In
this case, keep current fake NUMA API, will make Arm code
still can work with NUMA aware memory allocation and scheduler.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/include/asm-arm/numa.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index 9d5739542d..8f1c67e3eb 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -5,6 +5,8 @@
 
 typedef u8 nodeid_t;
 
+#ifndef CONFIG_NUMA
+
 /* Fake one node for now. See also node_online_map. */
 #define cpu_to_node(cpu) 0
 #define node_to_cpumask(node)   (cpu_online_map)
@@ -25,6 +27,8 @@ extern mfn_t first_valid_mfn;
 #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
 #define __node_distance(a, b) (20)
 
+#endif
+
 static inline unsigned int arch_have_default_dmazone(void)
 {
     return 0;
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (5 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:11   ` Stefano Stabellini
  2022-01-18 15:22   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 08/37] xen/x86: add detection of discontinous node memory range Wei Chen
                   ` (29 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

NUMA node structure "struct node" is using u64 as node memory
range. In order to make other architectures can reuse this
NUMA node relative code, we replace the u64 to paddr_t. And
use pfn_to_paddr and paddr_to_pfn to replace explicit shift
operations. The relate PRIx64 in print messages have been
replaced by PRIpaddr at the same time.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/numa.c        | 32 +++++++++++++++++---------------
 xen/arch/x86/srat.c        | 26 +++++++++++++-------------
 xen/include/asm-x86/numa.h |  8 ++++----
 3 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 1fabbe8281..6337bbdf31 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -165,12 +165,12 @@ int __init compute_hash_shift(struct node *nodes, int numnodes,
     return shift;
 }
 /* initialize NODE_DATA given nodeid and start/end */
-void __init setup_node_bootmem(nodeid_t nodeid, u64 start, u64 end)
-{ 
+void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
+{
     unsigned long start_pfn, end_pfn;
 
-    start_pfn = start >> PAGE_SHIFT;
-    end_pfn = end >> PAGE_SHIFT;
+    start_pfn = paddr_to_pfn(start);
+    end_pfn = paddr_to_pfn(end);
 
     NODE_DATA(nodeid)->node_start_pfn = start_pfn;
     NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
@@ -201,11 +201,12 @@ void __init numa_init_array(void)
 static int numa_fake __initdata = 0;
 
 /* Numa emulation */
-static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
+static int __init numa_emulation(unsigned long start_pfn,
+                                 unsigned long end_pfn)
 {
     int i;
     struct node nodes[MAX_NUMNODES];
-    u64 sz = ((end_pfn - start_pfn)<<PAGE_SHIFT) / numa_fake;
+    u64 sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
 
     /* Kludge needed for the hash function */
     if ( hweight64(sz) > 1 )
@@ -221,9 +222,9 @@ static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
     memset(&nodes,0,sizeof(nodes));
     for ( i = 0; i < numa_fake; i++ )
     {
-        nodes[i].start = (start_pfn<<PAGE_SHIFT) + i*sz;
+        nodes[i].start = pfn_to_paddr(start_pfn) + i*sz;
         if ( i == numa_fake - 1 )
-            sz = (end_pfn<<PAGE_SHIFT) - nodes[i].start;
+            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
         nodes[i].end = nodes[i].start + sz;
         printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
                i,
@@ -249,24 +250,26 @@ static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
 void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 { 
     int i;
+    paddr_t start, end;
 
 #ifdef CONFIG_NUMA_EMU
     if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
         return;
 #endif
 
+    start = pfn_to_paddr(start_pfn);
+    end = pfn_to_paddr(end_pfn);
+
 #ifdef CONFIG_ACPI_NUMA
-    if ( !numa_off && !acpi_scan_nodes((u64)start_pfn << PAGE_SHIFT,
-         (u64)end_pfn << PAGE_SHIFT) )
+    if ( !numa_off && !acpi_scan_nodes(start, end) )
         return;
 #endif
 
     printk(KERN_INFO "%s\n",
            numa_off ? "NUMA turned off" : "No NUMA configuration found");
 
-    printk(KERN_INFO "Faking a node at %016"PRIx64"-%016"PRIx64"\n",
-           (u64)start_pfn << PAGE_SHIFT,
-           (u64)end_pfn << PAGE_SHIFT);
+    printk(KERN_INFO "Faking a node at %016"PRIpaddr"-%016"PRIpaddr"\n",
+           start, end);
     /* setup dummy node covering all memory */
     memnode_shift = BITS_PER_LONG - 1;
     memnodemap = _memnodemap;
@@ -279,8 +282,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
     for ( i = 0; i < nr_cpu_ids; i++ )
         numa_set_node(i, 0);
     cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
-    setup_node_bootmem(0, (u64)start_pfn << PAGE_SHIFT,
-                    (u64)end_pfn << PAGE_SHIFT);
+    setup_node_bootmem(0, start, end);
 }
 
 void numa_add_cpu(int cpu)
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 6b77b98201..7d20d7f222 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -104,7 +104,7 @@ nodeid_t setup_node(unsigned pxm)
 	return node;
 }
 
-int valid_numa_range(u64 start, u64 end, nodeid_t node)
+int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
 {
 	int i;
 
@@ -119,7 +119,7 @@ int valid_numa_range(u64 start, u64 end, nodeid_t node)
 	return 0;
 }
 
-static __init int conflicting_memblks(u64 start, u64 end)
+static __init int conflicting_memblks(paddr_t start, paddr_t end)
 {
 	int i;
 
@@ -135,7 +135,7 @@ static __init int conflicting_memblks(u64 start, u64 end)
 	return -1;
 }
 
-static __init void cutoff_node(int i, u64 start, u64 end)
+static __init void cutoff_node(int i, paddr_t start, paddr_t end)
 {
 	struct node *nd = &nodes[i];
 	if (nd->start < start) {
@@ -275,7 +275,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 void __init
 acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 {
-	u64 start, end;
+	paddr_t start, end;
 	unsigned pxm;
 	nodeid_t node;
 	int i;
@@ -318,7 +318,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 		bool mismatch = !(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
 		                !test_bit(i, memblk_hotplug);
 
-		printk("%sSRAT: PXM %u (%"PRIx64"-%"PRIx64") overlaps with itself (%"PRIx64"-%"PRIx64")\n",
+		printk("%sSRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
 		       mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end,
 		       node_memblk_range[i].start, node_memblk_range[i].end);
 		if (mismatch) {
@@ -327,7 +327,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 		}
 	} else {
 		printk(KERN_ERR
-		       "SRAT: PXM %u (%"PRIx64"-%"PRIx64") overlaps with PXM %u (%"PRIx64"-%"PRIx64")\n",
+		       "SRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with PXM %u (%"PRIpaddr"-%"PRIpaddr")\n",
 		       pxm, start, end, node_to_pxm(memblk_nodeid[i]),
 		       node_memblk_range[i].start, node_memblk_range[i].end);
 		bad_srat();
@@ -346,7 +346,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 				nd->end = end;
 		}
 	}
-	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIx64"-%"PRIx64"%s\n",
+	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
 	       node, pxm, start, end,
 	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
 
@@ -369,7 +369,7 @@ static int __init nodes_cover_memory(void)
 
 	for (i = 0; i < e820.nr_map; i++) {
 		int j, found;
-		unsigned long long start, end;
+		paddr_t start, end;
 
 		if (e820.map[i].type != E820_RAM) {
 			continue;
@@ -396,7 +396,7 @@ static int __init nodes_cover_memory(void)
 
 		if (start < end) {
 			printk(KERN_ERR "SRAT: No PXM for e820 range: "
-				"%016Lx - %016Lx\n", start, end);
+				"%"PRIpaddr" - %"PRIpaddr"\n", start, end);
 			return 0;
 		}
 	}
@@ -432,7 +432,7 @@ static int __init srat_parse_region(struct acpi_subtable_header *header,
 	return 0;
 }
 
-void __init srat_parse_regions(u64 addr)
+void __init srat_parse_regions(paddr_t addr)
 {
 	u64 mask;
 	unsigned int i;
@@ -441,7 +441,7 @@ void __init srat_parse_regions(u64 addr)
 	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
 		return;
 
-	srat_region_mask = pdx_init_mask(addr);
+	srat_region_mask = pdx_init_mask((u64)addr);
 	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
 			      srat_parse_region, 0);
 
@@ -457,7 +457,7 @@ void __init srat_parse_regions(u64 addr)
 }
 
 /* Use the information discovered above to actually set up the nodes. */
-int __init acpi_scan_nodes(u64 start, u64 end)
+int __init acpi_scan_nodes(paddr_t start, paddr_t end)
 {
 	int i;
 	nodemask_t all_nodes_parsed;
@@ -489,7 +489,7 @@ int __init acpi_scan_nodes(u64 start, u64 end)
 	/* Finally register nodes */
 	for_each_node_mask(i, all_nodes_parsed)
 	{
-		u64 size = nodes[i].end - nodes[i].start;
+		paddr_t size = nodes[i].end - nodes[i].start;
 		if ( size == 0 )
 			printk(KERN_WARNING "SRAT: Node %u has no memory. "
 			       "BIOS Bug or mis-configured hardware?\n", i);
diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
index 8060cbf3f4..50cfd8e7ef 100644
--- a/xen/include/asm-x86/numa.h
+++ b/xen/include/asm-x86/numa.h
@@ -16,7 +16,7 @@ extern cpumask_t     node_to_cpumask[];
 #define node_to_cpumask(node)    (node_to_cpumask[node])
 
 struct node { 
-	u64 start,end; 
+	paddr_t start,end;
 };
 
 extern int compute_hash_shift(struct node *nodes, int numnodes,
@@ -36,7 +36,7 @@ extern void numa_set_node(int cpu, nodeid_t node);
 extern nodeid_t setup_node(unsigned int pxm);
 extern void srat_detect_node(int cpu);
 
-extern void setup_node_bootmem(nodeid_t nodeid, u64 start, u64 end);
+extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
 extern nodeid_t apicid_to_node[];
 extern void init_cpu_to_node(void);
 
@@ -73,9 +73,9 @@ static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
 #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
 				 NODE_DATA(nid)->node_spanned_pages)
 
-extern int valid_numa_range(u64 start, u64 end, nodeid_t node);
+extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
 
-void srat_parse_regions(u64 addr);
+void srat_parse_regions(paddr_t addr);
 extern u8 __node_distance(nodeid_t a, nodeid_t b);
 unsigned int arch_get_dma_bitsize(void);
 unsigned int arch_have_default_dmazone(void);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (6 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:25   ` Stefano Stabellini
  2022-01-18 16:13   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 09/37] xen/x86: introduce two helpers to access memory hotplug end Wei Chen
                   ` (28 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

One NUMA node may contain several memory blocks. In current Xen
code, Xen will maintain a node memory range for each node to cover
all its memory blocks. But here comes the problem, in the gap of
one node's two memory blocks, if there are some memory blocks don't
belong to this node (remote memory blocks). This node's memory range
will be expanded to cover these remote memory blocks.

One node's memory range contains othe nodes' memory, this is obviously
not very reasonable. This means current NUMA code only can support
node has continous memory blocks. However, on a physical machine, the
addresses of multiple nodes can be interleaved.

So in this patch, we add code to detect discontinous memory blocks
for one node. NUMA initializtion will be failed and error messages
will be printed when Xen detect such hardware configuration.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/srat.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 7d20d7f222..2f08fa4660 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 		       pxm, pa->apic_id, node);
 }
 
+/*
+ * Check to see if there are other nodes within this node's range.
+ * We just need to check full contains situation. Because overlaps
+ * have been checked before by conflicting_memblks.
+ */
+static bool __init is_node_memory_continuous(nodeid_t nid,
+    paddr_t start, paddr_t end)
+{
+	nodeid_t i;
+
+	struct node *nd = &nodes[nid];
+	for_each_node_mask(i, memory_nodes_parsed)
+	{
+		/* Skip itself */
+		if (i == nid)
+			continue;
+
+		nd = &nodes[i];
+		if (start < nd->start && nd->end < end)
+		{
+			printk(KERN_ERR
+			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
+			       nid, start, end, i, nd->start, nd->end);
+			return false;
+		}
+	}
+
+	return true;
+}
+
 /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
 void __init
 acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
@@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 				nd->start = start;
 			if (nd->end < end)
 				nd->end = end;
+
+			/* Check whether this range contains memory for other nodes */
+			if (!is_node_memory_continuous(node, nd->start, nd->end)) {
+				bad_srat();
+				return;
+			}
 		}
 	}
 	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 09/37] xen/x86: introduce two helpers to access memory hotplug end
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (7 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 08/37] xen/x86: add detection of discontinous node memory range Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:29   ` Stefano Stabellini
  2022-01-24 16:24   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 10/37] xen/x86: use helpers to access/update mem_hotplug Wei Chen
                   ` (27 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

x86 provides a mem_hotplug to maintain the end of memory hotplug
end address. This variable can be accessed out of mm.c. We want
some code out of mm.c can be reused by other architectures without
memory hotplug ability. So in this patch, we introduce these two
helpers to replace mem_hotplug direct access. This will give the
ability to stub these two API.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/include/asm-x86/mm.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index cb90527499..af2fc4b0cd 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -475,6 +475,16 @@ static inline int get_page_and_type(struct page_info *page,
 
 extern paddr_t mem_hotplug;
 
+static inline void mem_hotplug_update_boundary(paddr_t end)
+{
+    mem_hotplug = end;
+}
+
+static inline paddr_t mem_hotplug_boundary(void)
+{
+    return mem_hotplug;
+}
+
 /******************************************************************************
  * With shadow pagetables, the different kinds of address start
  * to get get confusing.
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 10/37] xen/x86: use helpers to access/update mem_hotplug
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (8 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 09/37] xen/x86: introduce two helpers to access memory hotplug end Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:31   ` Stefano Stabellini
  2022-01-24 16:29   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 11/37] xen/x86: abstract neutral code from acpi_numa_memory_affinity_init Wei Chen
                   ` (26 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

We want to abstract code from acpi_numa_memory_affinity_init.
But mem_hotplug is coupled with x86. In this patch, we use
helpers to repace mem_hotplug direct accessing. This will
allow most code can be common.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/srat.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 2f08fa4660..3334ede7a5 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -391,8 +391,8 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 	memblk_nodeid[num_node_memblks] = node;
 	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
 		__set_bit(num_node_memblks, memblk_hotplug);
-		if (end > mem_hotplug)
-			mem_hotplug = end;
+		if (end > mem_hotplug_boundary())
+			mem_hotplug_update_boundary(end);
 	}
 	num_node_memblks++;
 }
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 11/37] xen/x86: abstract neutral code from acpi_numa_memory_affinity_init
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (9 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 10/37] xen/x86: use helpers to access/update mem_hotplug Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:38   ` Stefano Stabellini
  2022-01-24 16:50   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 12/37] xen/x86: decouple nodes_cover_memory from E820 map Wei Chen
                   ` (25 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

There is some code in acpi_numa_memory_affinity_init to update node
memory range and update node_memblk_range array. This code is not
ACPI specific, it can be shared by other NUMA implementation, like
device tree based NUMA implementation.

So in this patch, we abstract this memory range and blocks relative
code to a new function. This will avoid exporting static variables
like node_memblk_range. And the PXM in neutral code print messages
have been replaced by NODE, as PXM is ACPI specific.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/srat.c        | 131 +++++++++++++++++++++----------------
 xen/include/asm-x86/numa.h |   3 +
 2 files changed, 77 insertions(+), 57 deletions(-)

diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 3334ede7a5..18bc6b19bb 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -104,6 +104,14 @@ nodeid_t setup_node(unsigned pxm)
 	return node;
 }
 
+bool __init numa_memblks_available(void)
+{
+	if (num_node_memblks < NR_NODE_MEMBLKS)
+		return true;
+
+	return false;
+}
+
 int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
 {
 	int i;
@@ -301,69 +309,35 @@ static bool __init is_node_memory_continuous(nodeid_t nid,
 	return true;
 }
 
-/* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
-void __init
-acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
+/* Neutral NUMA memory affinity init function for ACPI and DT */
+int __init numa_update_node_memblks(nodeid_t node,
+		paddr_t start, paddr_t size, bool hotplug)
 {
-	paddr_t start, end;
-	unsigned pxm;
-	nodeid_t node;
+	paddr_t end = start + size;
 	int i;
 
-	if (srat_disabled())
-		return;
-	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
-		bad_srat();
-		return;
-	}
-	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
-		return;
-
-	start = ma->base_address;
-	end = start + ma->length;
-	/* Supplement the heuristics in l1tf_calculations(). */
-	l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE));
-
-	if (num_node_memblks >= NR_NODE_MEMBLKS)
-	{
-		dprintk(XENLOG_WARNING,
-                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
-		bad_srat();
-		return;
-	}
-
-	pxm = ma->proximity_domain;
-	if (srat_rev < 2)
-		pxm &= 0xff;
-	node = setup_node(pxm);
-	if (node == NUMA_NO_NODE) {
-		bad_srat();
-		return;
-	}
-	/* It is fine to add this area to the nodes data it will be used later*/
+	/* It is fine to add this area to the nodes data it will be used later */
 	i = conflicting_memblks(start, end);
 	if (i < 0)
 		/* everything fine */;
 	else if (memblk_nodeid[i] == node) {
-		bool mismatch = !(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
-		                !test_bit(i, memblk_hotplug);
+		bool mismatch = !hotplug != !test_bit(i, memblk_hotplug);
 
-		printk("%sSRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
-		       mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end,
+		printk("%sSRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
+		       mismatch ? KERN_ERR : KERN_WARNING, node, start, end,
 		       node_memblk_range[i].start, node_memblk_range[i].end);
 		if (mismatch) {
-			bad_srat();
-			return;
+			return -1;
 		}
 	} else {
 		printk(KERN_ERR
-		       "SRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with PXM %u (%"PRIpaddr"-%"PRIpaddr")\n",
-		       pxm, start, end, node_to_pxm(memblk_nodeid[i]),
+		       "SRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
+		       node, start, end, memblk_nodeid[i],
 		       node_memblk_range[i].start, node_memblk_range[i].end);
-		bad_srat();
-		return;
+		return -1;
 	}
-	if (!(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE)) {
+
+	if (!hotplug) {
 		struct node *nd = &nodes[node];
 
 		if (!node_test_and_set(node, memory_nodes_parsed)) {
@@ -375,26 +349,69 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 			if (nd->end < end)
 				nd->end = end;
 
-			/* Check whether this range contains memory for other nodes */
-			if (!is_node_memory_continuous(node, nd->start, nd->end)) {
-				bad_srat();
-				return;
-			}
+			if (!is_node_memory_continuous(node, nd->start, nd->end))
+				return -1;
 		}
 	}
-	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
-	       node, pxm, start, end,
-	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
+
+	printk(KERN_INFO "SRAT: Node %u %"PRIpaddr"-%"PRIpaddr"%s\n",
+	       node, start, end, hotplug ? " (hotplug)" : "");
 
 	node_memblk_range[num_node_memblks].start = start;
 	node_memblk_range[num_node_memblks].end = end;
 	memblk_nodeid[num_node_memblks] = node;
-	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
+	if (hotplug) {
 		__set_bit(num_node_memblks, memblk_hotplug);
 		if (end > mem_hotplug_boundary())
 			mem_hotplug_update_boundary(end);
 	}
 	num_node_memblks++;
+
+	return 0;
+}
+
+/* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
+void __init
+acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
+{
+	unsigned pxm;
+	nodeid_t node;
+	int ret;
+
+	if (srat_disabled())
+		return;
+	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
+		bad_srat();
+		return;
+	}
+	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
+		return;
+
+	/* Supplement the heuristics in l1tf_calculations(). */
+	l1tf_safe_maddr = max(l1tf_safe_maddr,
+			ROUNDUP((ma->base_address + ma->length), PAGE_SIZE));
+
+	if (!numa_memblks_available())
+	{
+		dprintk(XENLOG_WARNING,
+                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
+		bad_srat();
+		return;
+	}
+
+	pxm = ma->proximity_domain;
+	if (srat_rev < 2)
+		pxm &= 0xff;
+	node = setup_node(pxm);
+	if (node == NUMA_NO_NODE) {
+		bad_srat();
+		return;
+	}
+
+	ret = numa_update_node_memblks(node, ma->base_address, ma->length,
+					ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE);
+	if (ret != 0)
+		bad_srat();
 }
 
 /* Sanity check to catch more bad SRATs (they are amazingly common).
diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
index 50cfd8e7ef..5772a70665 100644
--- a/xen/include/asm-x86/numa.h
+++ b/xen/include/asm-x86/numa.h
@@ -74,6 +74,9 @@ static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
 				 NODE_DATA(nid)->node_spanned_pages)
 
 extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
+extern bool numa_memblks_available(void);
+extern int numa_update_node_memblks(nodeid_t node,
+		paddr_t start, paddr_t size, bool hotplug);
 
 void srat_parse_regions(paddr_t addr);
 extern u8 __node_distance(nodeid_t a, nodeid_t b);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 12/37] xen/x86: decouple nodes_cover_memory from E820 map
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (10 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 11/37] xen/x86: abstract neutral code from acpi_numa_memory_affinity_init Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:39   ` Stefano Stabellini
  2022-01-24 16:59   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 13/37] xen/x86: decouple processor_nodes_parsed from acpi numa functions Wei Chen
                   ` (24 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

We will reuse nodes_cover_memory for Arm to check its bootmem
info. So we introduce two arch helpers to get memory map's
entry number and specified entry's range:
    arch_get_memory_bank_number
    arch_get_memory_bank_range

Depends above two helpers, we make nodes_cover_memory become
architecture independent. And the only change from an x86
perspective is the additional checks:
  !start || !end

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/numa.c        | 18 ++++++++++++++++++
 xen/arch/x86/srat.c        | 11 ++++-------
 xen/include/asm-x86/numa.h |  3 +++
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 6337bbdf31..6bc4ade411 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -378,6 +378,24 @@ unsigned int arch_have_default_dmazone(void)
     return ( num_online_nodes() > 1 ) ? 1 : 0;
 }
 
+uint32_t __init arch_meminfo_get_nr_bank(void)
+{
+	return e820.nr_map;
+}
+
+int __init arch_meminfo_get_ram_bank_range(uint32_t bank,
+	paddr_t *start, paddr_t *end)
+{
+	if (e820.map[bank].type != E820_RAM || !start || !end) {
+		return -1;
+	}
+
+	*start = e820.map[bank].addr;
+	*end = e820.map[bank].addr + e820.map[bank].size;
+
+	return 0;
+}
+
 static void dump_numa(unsigned char key)
 {
     s_time_t now = NOW();
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 18bc6b19bb..aa07a7e975 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -419,17 +419,14 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 static int __init nodes_cover_memory(void)
 {
 	int i;
+	uint32_t nr_banks = arch_meminfo_get_nr_bank();
 
-	for (i = 0; i < e820.nr_map; i++) {
+	for (i = 0; i < nr_banks; i++) {
 		int j, found;
 		paddr_t start, end;
 
-		if (e820.map[i].type != E820_RAM) {
+		if (arch_meminfo_get_ram_bank_range(i, &start, &end))
 			continue;
-		}
-
-		start = e820.map[i].addr;
-		end = e820.map[i].addr + e820.map[i].size;
 
 		do {
 			found = 0;
@@ -448,7 +445,7 @@ static int __init nodes_cover_memory(void)
 		} while (found && start < end);
 
 		if (start < end) {
-			printk(KERN_ERR "SRAT: No PXM for e820 range: "
+			printk(KERN_ERR "SRAT: No NODE for memory map range: "
 				"%"PRIpaddr" - %"PRIpaddr"\n", start, end);
 			return 0;
 		}
diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
index 5772a70665..78e044a390 100644
--- a/xen/include/asm-x86/numa.h
+++ b/xen/include/asm-x86/numa.h
@@ -82,5 +82,8 @@ void srat_parse_regions(paddr_t addr);
 extern u8 __node_distance(nodeid_t a, nodeid_t b);
 unsigned int arch_get_dma_bitsize(void);
 unsigned int arch_have_default_dmazone(void);
+extern uint32_t arch_meminfo_get_nr_bank(void);
+extern int arch_meminfo_get_ram_bank_range(uint32_t bank,
+    paddr_t *start, paddr_t *end);
 
 #endif
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 13/37] xen/x86: decouple processor_nodes_parsed from acpi numa functions
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (11 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 12/37] xen/x86: decouple nodes_cover_memory from E820 map Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:40   ` Stefano Stabellini
  2022-01-25  9:49   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 14/37] xen/x86: use name fw_numa to replace acpi_numa Wei Chen
                   ` (23 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Xen is using processor_nodes_parsed to record parsed processor nodes
from ACPI table or other firmware provided resource table. This
variable is used in ACPI numa functions directly. In follow-up
patchs, neutral NUMA code will be abstracted and move to other files.
So in this patch, we introduce numa_set_processor_nodes_parsed helper
to decouple processor_nodes_parsed from acpi numa functions.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/srat.c        | 9 +++++++--
 xen/include/asm-x86/numa.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index aa07a7e975..9276a52138 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -104,6 +104,11 @@ nodeid_t setup_node(unsigned pxm)
 	return node;
 }
 
+void  __init numa_set_processor_nodes_parsed(nodeid_t node)
+{
+	node_set(node, processor_nodes_parsed);
+}
+
 bool __init numa_memblks_available(void)
 {
 	if (num_node_memblks < NR_NODE_MEMBLKS)
@@ -236,7 +241,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 	}
 
 	apicid_to_node[pa->apic_id] = node;
-	node_set(node, processor_nodes_parsed);
+	numa_set_processor_nodes_parsed(node);
 	acpi_numa = 1;
 
 	if (opt_acpi_verbose)
@@ -271,7 +276,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 		return;
 	}
 	apicid_to_node[pa->apic_id] = node;
-	node_set(node, processor_nodes_parsed);
+	numa_set_processor_nodes_parsed(node);
 	acpi_numa = 1;
 
 	if (opt_acpi_verbose)
diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
index 78e044a390..295f875a51 100644
--- a/xen/include/asm-x86/numa.h
+++ b/xen/include/asm-x86/numa.h
@@ -77,6 +77,7 @@ extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
 extern bool numa_memblks_available(void);
 extern int numa_update_node_memblks(nodeid_t node,
 		paddr_t start, paddr_t size, bool hotplug);
+extern void numa_set_processor_nodes_parsed(nodeid_t node);
 
 void srat_parse_regions(paddr_t addr);
 extern u8 __node_distance(nodeid_t a, nodeid_t b);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 14/37] xen/x86: use name fw_numa to replace acpi_numa
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (12 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 13/37] xen/x86: decouple processor_nodes_parsed from acpi numa functions Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:40   ` Stefano Stabellini
  2022-01-25 10:12   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 15/37] xen/x86: rename acpi_scan_nodes to numa_scan_nodes Wei Chen
                   ` (22 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Xen is using acpi_numa as a switch for ACPI based NUMA. We want to
use this switch logic for other firmware based NUMA implementation,
like device tree based NUMA in follow-up patches. As Xen will never
use both ACPI and device tree based NUMA at runtime. So I rename
acpi_numa to a more generic name fw_name. This will also allow to
have the code mostly common.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/numa.c        |  6 +++---
 xen/arch/x86/setup.c       |  2 +-
 xen/arch/x86/srat.c        | 10 +++++-----
 xen/include/asm-x86/acpi.h |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 6bc4ade411..2ef385ae3f 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -51,11 +51,11 @@ cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
 nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
 
 bool numa_off;
-s8 acpi_numa = 0;
+s8 fw_numa = 0;
 
 int srat_disabled(void)
 {
-    return numa_off || acpi_numa < 0;
+    return numa_off || fw_numa < 0;
 }
 
 /*
@@ -315,7 +315,7 @@ static __init int numa_setup(const char *opt)
     else if ( !strncmp(opt,"noacpi",6) )
     {
         numa_off = false;
-        acpi_numa = -1;
+        fw_numa = -1;
     }
 #endif
     else
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index b101565f14..1a2093b554 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -313,7 +313,7 @@ void srat_detect_node(int cpu)
     node_set_online(node);
     numa_set_node(cpu, node);
 
-    if ( opt_cpu_info && acpi_numa > 0 )
+    if ( opt_cpu_info && fw_numa > 0 )
         printk("CPU %d APIC %d -> Node %d\n", cpu, apicid, node);
 }
 
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 9276a52138..4921830f94 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -167,7 +167,7 @@ static __init void bad_srat(void)
 {
 	int i;
 	printk(KERN_ERR "SRAT: SRAT not used.\n");
-	acpi_numa = -1;
+	fw_numa = -1;
 	for (i = 0; i < MAX_LOCAL_APIC; i++)
 		apicid_to_node[i] = NUMA_NO_NODE;
 	for (i = 0; i < ARRAY_SIZE(pxm2node); i++)
@@ -242,7 +242,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 
 	apicid_to_node[pa->apic_id] = node;
 	numa_set_processor_nodes_parsed(node);
-	acpi_numa = 1;
+	fw_numa = 1;
 
 	if (opt_acpi_verbose)
 		printk(KERN_INFO "SRAT: PXM %u -> APIC %08x -> Node %u\n",
@@ -277,7 +277,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 	}
 	apicid_to_node[pa->apic_id] = node;
 	numa_set_processor_nodes_parsed(node);
-	acpi_numa = 1;
+	fw_numa = 1;
 
 	if (opt_acpi_verbose)
 		printk(KERN_INFO "SRAT: PXM %u -> APIC %02x -> Node %u\n",
@@ -492,7 +492,7 @@ void __init srat_parse_regions(paddr_t addr)
 	u64 mask;
 	unsigned int i;
 
-	if (acpi_disabled || acpi_numa < 0 ||
+	if (acpi_disabled || fw_numa < 0 ||
 	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
 		return;
 
@@ -521,7 +521,7 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
 	for (i = 0; i < MAX_NUMNODES; i++)
 		cutoff_node(i, start, end);
 
-	if (acpi_numa <= 0)
+	if (fw_numa <= 0)
 		return -1;
 
 	if (!nodes_cover_memory()) {
diff --git a/xen/include/asm-x86/acpi.h b/xen/include/asm-x86/acpi.h
index 7032f3a001..83be71fec3 100644
--- a/xen/include/asm-x86/acpi.h
+++ b/xen/include/asm-x86/acpi.h
@@ -101,7 +101,7 @@ extern unsigned long acpi_wakeup_address;
 
 #define ARCH_HAS_POWER_INIT	1
 
-extern s8 acpi_numa;
+extern s8 fw_numa;
 extern int acpi_scan_nodes(u64 start, u64 end);
 #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 15/37] xen/x86: rename acpi_scan_nodes to numa_scan_nodes
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (13 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 14/37] xen/x86: use name fw_numa to replace acpi_numa Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:40   ` Stefano Stabellini
  2022-01-25 10:17   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 16/37] xen/x86: export srat_bad to external Wei Chen
                   ` (21 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

The most code in acpi_scan_nodes can be reused by other NUMA
implementation. Rename acpi_scan_nodes to a more generic name
numa_scan_nodes, and replace BIOS to Firmware in print message,
as BIOS is x86 specific name.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/numa.c        | 2 +-
 xen/arch/x86/srat.c        | 4 ++--
 xen/include/asm-x86/acpi.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 2ef385ae3f..8a4710df39 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -261,7 +261,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
     end = pfn_to_paddr(end_pfn);
 
 #ifdef CONFIG_ACPI_NUMA
-    if ( !numa_off && !acpi_scan_nodes(start, end) )
+    if ( !numa_off && !numa_scan_nodes(start, end) )
         return;
 #endif
 
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 4921830f94..0b8b0b0c95 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -512,7 +512,7 @@ void __init srat_parse_regions(paddr_t addr)
 }
 
 /* Use the information discovered above to actually set up the nodes. */
-int __init acpi_scan_nodes(paddr_t start, paddr_t end)
+int __init numa_scan_nodes(paddr_t start, paddr_t end)
 {
 	int i;
 	nodemask_t all_nodes_parsed;
@@ -547,7 +547,7 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
 		paddr_t size = nodes[i].end - nodes[i].start;
 		if ( size == 0 )
 			printk(KERN_WARNING "SRAT: Node %u has no memory. "
-			       "BIOS Bug or mis-configured hardware?\n", i);
+			       "Firmware Bug or mis-configured hardware?\n", i);
 
 		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 	}
diff --git a/xen/include/asm-x86/acpi.h b/xen/include/asm-x86/acpi.h
index 83be71fec3..2add971072 100644
--- a/xen/include/asm-x86/acpi.h
+++ b/xen/include/asm-x86/acpi.h
@@ -102,7 +102,7 @@ extern unsigned long acpi_wakeup_address;
 #define ARCH_HAS_POWER_INIT	1
 
 extern s8 fw_numa;
-extern int acpi_scan_nodes(u64 start, u64 end);
+extern int numa_scan_nodes(u64 start, u64 end);
 #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
 
 extern struct acpi_sleep_info acpi_sinfo;
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 16/37] xen/x86: export srat_bad to external
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (14 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 15/37] xen/x86: rename acpi_scan_nodes to numa_scan_nodes Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:41   ` Stefano Stabellini
  2022-01-25 10:22   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 17/37] xen/x86: use CONFIG_NUMA to gate numa_scan_nodes Wei Chen
                   ` (20 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

srat_bad is used when NUMA initialization code scan SRAT failed.
It will turn fw_numa to disabled status. Its implementation depends
on NUMA implementation. We want every NUMA implementation to provide
this function for common initialization code.

In this patch, we export srat_bad to external. This will allow to
have the code mostly common.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/srat.c        | 2 +-
 xen/include/asm-x86/numa.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 0b8b0b0c95..94bd5b34da 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -163,7 +163,7 @@ static __init void cutoff_node(int i, paddr_t start, paddr_t end)
 	}
 }
 
-static __init void bad_srat(void)
+__init void bad_srat(void)
 {
 	int i;
 	printk(KERN_ERR "SRAT: SRAT not used.\n");
diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
index 295f875a51..a5690a7098 100644
--- a/xen/include/asm-x86/numa.h
+++ b/xen/include/asm-x86/numa.h
@@ -32,6 +32,7 @@ extern bool numa_off;
 
 
 extern int srat_disabled(void);
+extern void bad_srat(void);
 extern void numa_set_node(int cpu, nodeid_t node);
 extern nodeid_t setup_node(unsigned int pxm);
 extern void srat_detect_node(int cpu);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 17/37] xen/x86: use CONFIG_NUMA to gate numa_scan_nodes
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (15 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 16/37] xen/x86: export srat_bad to external Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  0:41   ` Stefano Stabellini
  2022-01-25 10:26   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 18/37] xen: move NUMA common code from x86 to common Wei Chen
                   ` (19 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

As we have turned numa_scan_nodes to neutral function. If we
still use CONFIG_ACPI_NUMA in numa_initmem_init to gate
numa_scan_nodes that doesn't make sense. As CONFIG_ACPI_NUMA
will be selected by CONFIG_NUMA for x86. So in this patch,
we replace CONFIG_ACPI_NUMA by CONFIG_NUMA.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 8a4710df39..509d2738c0 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -260,7 +260,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
     start = pfn_to_paddr(start_pfn);
     end = pfn_to_paddr(end_pfn);
 
-#ifdef CONFIG_ACPI_NUMA
+#ifdef CONFIG_NUMA
     if ( !numa_off && !numa_scan_nodes(start, end) )
         return;
 #endif
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 18/37] xen: move NUMA common code from x86 to common
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (16 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 17/37] xen/x86: use CONFIG_NUMA to gate numa_scan_nodes Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-23 12:02 ` [PATCH 19/37] xen/x86: promote VIRTUAL_BUG_ON to ASSERT in Wei Chen
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Some common code has been decoupled and abstracted from x86 ACPI
based NUMA implementation. In order to make this code can be reused
by other NUMA implementation, we move this code from x86 to common
folder. And this code is gated by CONFIG_NUMA, it only can be used
by those architectures that NUMA is enabled. For those architectures
do not support NUMA, they still can implementation NUMA stub API in
asm/numa.h to make NUMA-aware components happy.

In this patch, we also remove some unused include headers.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/x86/numa.c         | 446 +----------------------------------
 xen/arch/x86/srat.c         | 253 +-------------------
 xen/common/Makefile         |   2 +
 xen/common/numa.c           | 450 ++++++++++++++++++++++++++++++++++++
 xen/common/numa_srat.c      | 264 +++++++++++++++++++++
 xen/include/asm-x86/acpi.h  |   4 -
 xen/include/asm-x86/numa.h  |  68 ------
 xen/include/asm-x86/setup.h |   1 -
 xen/include/xen/numa.h      |  82 +++++++
 9 files changed, 802 insertions(+), 768 deletions(-)
 create mode 100644 xen/common/numa.c
 create mode 100644 xen/common/numa_srat.c

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 509d2738c0..92b6bdf7b9 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -3,24 +3,13 @@
  * Copyright 2002,2003 Andi Kleen, SuSE Labs.
  * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
  */ 
-
-#include <xen/mm.h>
-#include <xen/string.h>
 #include <xen/init.h>
-#include <xen/ctype.h>
+#include <xen/mm.h>
 #include <xen/nodemask.h>
 #include <xen/numa.h>
-#include <xen/keyhandler.h>
-#include <xen/param.h>
-#include <xen/time.h>
-#include <xen/smp.h>
-#include <xen/pfn.h>
-#include <asm/acpi.h>
 #include <xen/sched.h>
-#include <xen/softirq.h>
 
-static int numa_setup(const char *s);
-custom_param("numa", numa_setup);
+#include <asm/acpi.h>
 
 #ifndef Dprintk
 #define Dprintk(x...)
@@ -29,300 +18,12 @@ custom_param("numa", numa_setup);
 /* from proto.h */
 #define round_up(x,y) ((((x)+(y))-1) & (~((y)-1)))
 
-struct node_data node_data[MAX_NUMNODES];
-
-/* Mapping from pdx to node id */
-int memnode_shift;
-static typeof(*memnodemap) _memnodemap[64];
-unsigned long memnodemapsize;
-u8 *memnodemap;
-
-nodeid_t cpu_to_node[NR_CPUS] __read_mostly = {
-    [0 ... NR_CPUS-1] = NUMA_NO_NODE
-};
 /*
  * Keep BIOS's CPU2node information, should not be used for memory allocaion
  */
 nodeid_t apicid_to_node[MAX_LOCAL_APIC] = {
     [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
 };
-cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
-
-nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
-
-bool numa_off;
-s8 fw_numa = 0;
-
-int srat_disabled(void)
-{
-    return numa_off || fw_numa < 0;
-}
-
-/*
- * Given a shift value, try to populate memnodemap[]
- * Returns :
- * 1 if OK
- * 0 if memnodmap[] too small (of shift too small)
- * -1 if node overlap or lost ram (shift too big)
- */
-static int __init populate_memnodemap(const struct node *nodes,
-                                      int numnodes, int shift, nodeid_t *nodeids)
-{
-    unsigned long spdx, epdx;
-    int i, res = -1;
-
-    memset(memnodemap, NUMA_NO_NODE, memnodemapsize * sizeof(*memnodemap));
-    for ( i = 0; i < numnodes; i++ )
-    {
-        spdx = paddr_to_pdx(nodes[i].start);
-        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
-        if ( spdx >= epdx )
-            continue;
-        if ( (epdx >> shift) >= memnodemapsize )
-            return 0;
-        do {
-            if ( memnodemap[spdx >> shift] != NUMA_NO_NODE )
-                return -1;
-
-            if ( !nodeids )
-                memnodemap[spdx >> shift] = i;
-            else
-                memnodemap[spdx >> shift] = nodeids[i];
-
-            spdx += (1UL << shift);
-        } while ( spdx < epdx );
-        res = 1;
-    }
-
-    return res;
-}
-
-static int __init allocate_cachealigned_memnodemap(void)
-{
-    unsigned long size = PFN_UP(memnodemapsize * sizeof(*memnodemap));
-    unsigned long mfn = mfn_x(alloc_boot_pages(size, 1));
-
-    memnodemap = mfn_to_virt(mfn);
-    mfn <<= PAGE_SHIFT;
-    size <<= PAGE_SHIFT;
-    printk(KERN_DEBUG "NUMA: Allocated memnodemap from %lx - %lx\n",
-           mfn, mfn + size);
-    memnodemapsize = size / sizeof(*memnodemap);
-
-    return 0;
-}
-
-/*
- * The LSB of all start and end addresses in the node map is the value of the
- * maximum possible shift.
- */
-static int __init extract_lsb_from_nodes(const struct node *nodes,
-                                         int numnodes)
-{
-    int i, nodes_used = 0;
-    unsigned long spdx, epdx;
-    unsigned long bitfield = 0, memtop = 0;
-
-    for ( i = 0; i < numnodes; i++ )
-    {
-        spdx = paddr_to_pdx(nodes[i].start);
-        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
-        if ( spdx >= epdx )
-            continue;
-        bitfield |= spdx;
-        nodes_used++;
-        if ( epdx > memtop )
-            memtop = epdx;
-    }
-    if ( nodes_used <= 1 )
-        i = BITS_PER_LONG - 1;
-    else
-        i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
-    memnodemapsize = (memtop >> i) + 1;
-    return i;
-}
-
-int __init compute_hash_shift(struct node *nodes, int numnodes,
-                              nodeid_t *nodeids)
-{
-    int shift;
-
-    shift = extract_lsb_from_nodes(nodes, numnodes);
-    if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) )
-        memnodemap = _memnodemap;
-    else if ( allocate_cachealigned_memnodemap() )
-        return -1;
-    printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift);
-
-    if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 )
-    {
-        printk(KERN_INFO "Your memory is not aligned you need to "
-               "rebuild your hypervisor with a bigger NODEMAPSIZE "
-               "shift=%d\n", shift);
-        return -1;
-    }
-
-    return shift;
-}
-/* initialize NODE_DATA given nodeid and start/end */
-void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
-{
-    unsigned long start_pfn, end_pfn;
-
-    start_pfn = paddr_to_pfn(start);
-    end_pfn = paddr_to_pfn(end);
-
-    NODE_DATA(nodeid)->node_start_pfn = start_pfn;
-    NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
-
-    node_set_online(nodeid);
-} 
-
-void __init numa_init_array(void)
-{
-    int rr, i;
-
-    /* There are unfortunately some poorly designed mainboards around
-       that only connect memory to a single CPU. This breaks the 1:1 cpu->node
-       mapping. To avoid this fill in the mapping for all possible
-       CPUs, as the number of CPUs is not known yet.
-       We round robin the existing nodes. */
-    rr = first_node(node_online_map);
-    for ( i = 0; i < nr_cpu_ids; i++ )
-    {
-        if ( cpu_to_node[i] != NUMA_NO_NODE )
-            continue;
-        numa_set_node(i, rr);
-        rr = cycle_node(rr, node_online_map);
-    }
-}
-
-#ifdef CONFIG_NUMA_EMU
-static int numa_fake __initdata = 0;
-
-/* Numa emulation */
-static int __init numa_emulation(unsigned long start_pfn,
-                                 unsigned long end_pfn)
-{
-    int i;
-    struct node nodes[MAX_NUMNODES];
-    u64 sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
-
-    /* Kludge needed for the hash function */
-    if ( hweight64(sz) > 1 )
-    {
-        u64 x = 1;
-        while ( (x << 1) < sz )
-            x <<= 1;
-        if ( x < sz/2 )
-            printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n");
-        sz = x;
-    }
-
-    memset(&nodes,0,sizeof(nodes));
-    for ( i = 0; i < numa_fake; i++ )
-    {
-        nodes[i].start = pfn_to_paddr(start_pfn) + i*sz;
-        if ( i == numa_fake - 1 )
-            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
-        nodes[i].end = nodes[i].start + sz;
-        printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
-               i,
-               nodes[i].start, nodes[i].end,
-               (nodes[i].end - nodes[i].start) >> 20);
-        node_set_online(i);
-    }
-    memnode_shift = compute_hash_shift(nodes, numa_fake, NULL);
-    if ( memnode_shift < 0 )
-    {
-        memnode_shift = 0;
-        printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
-        return -1;
-    }
-    for_each_online_node ( i )
-        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
-    numa_init_array();
-
-    return 0;
-}
-#endif
-
-void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
-{ 
-    int i;
-    paddr_t start, end;
-
-#ifdef CONFIG_NUMA_EMU
-    if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
-        return;
-#endif
-
-    start = pfn_to_paddr(start_pfn);
-    end = pfn_to_paddr(end_pfn);
-
-#ifdef CONFIG_NUMA
-    if ( !numa_off && !numa_scan_nodes(start, end) )
-        return;
-#endif
-
-    printk(KERN_INFO "%s\n",
-           numa_off ? "NUMA turned off" : "No NUMA configuration found");
-
-    printk(KERN_INFO "Faking a node at %016"PRIpaddr"-%016"PRIpaddr"\n",
-           start, end);
-    /* setup dummy node covering all memory */
-    memnode_shift = BITS_PER_LONG - 1;
-    memnodemap = _memnodemap;
-    /* Dummy node only uses 1 slot in reality */
-    memnodemap[0] = 0;
-    memnodemapsize = 1;
-
-    nodes_clear(node_online_map);
-    node_set_online(0);
-    for ( i = 0; i < nr_cpu_ids; i++ )
-        numa_set_node(i, 0);
-    cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
-    setup_node_bootmem(0, start, end);
-}
-
-void numa_add_cpu(int cpu)
-{
-    cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
-} 
-
-void numa_set_node(int cpu, nodeid_t node)
-{
-    cpu_to_node[cpu] = node;
-}
-
-/* [numa=off] */
-static __init int numa_setup(const char *opt)
-{
-    if ( !strncmp(opt,"off",3) )
-        numa_off = true;
-    else if ( !strncmp(opt,"on",2) )
-        numa_off = false;
-#ifdef CONFIG_NUMA_EMU
-    else if ( !strncmp(opt, "fake=", 5) )
-    {
-        numa_off = false;
-        numa_fake = simple_strtoul(opt+5,NULL,0);
-        if ( numa_fake >= MAX_NUMNODES )
-            numa_fake = MAX_NUMNODES;
-    }
-#endif
-#ifdef CONFIG_ACPI_NUMA
-    else if ( !strncmp(opt,"noacpi",6) )
-    {
-        numa_off = false;
-        fw_numa = -1;
-    }
-#endif
-    else
-        return -EINVAL;
-
-    return 0;
-} 
 
 /*
  * Setup early cpu_to_node.
@@ -395,146 +96,3 @@ int __init arch_meminfo_get_ram_bank_range(uint32_t bank,
 
 	return 0;
 }
-
-static void dump_numa(unsigned char key)
-{
-    s_time_t now = NOW();
-    unsigned int i, j, n;
-    struct domain *d;
-    struct page_info *page;
-    unsigned int page_num_node[MAX_NUMNODES];
-    const struct vnuma_info *vnuma;
-
-    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
-           now);
-
-    for_each_online_node ( i )
-    {
-        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
-
-        printk("NODE%u start->%lu size->%lu free->%lu\n",
-               i, node_start_pfn(i), node_spanned_pages(i),
-               avail_node_heap_pages(i));
-        /* sanity check phys_to_nid() */
-        if ( phys_to_nid(pa) != i )
-            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
-                   pa, phys_to_nid(pa), i);
-    }
-
-    j = cpumask_first(&cpu_online_map);
-    n = 0;
-    for_each_online_cpu ( i )
-    {
-        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
-        {
-            if ( n > 1 )
-                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
-            else
-                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
-            j = i;
-            n = 1;
-        }
-        else
-            ++n;
-    }
-    if ( n > 1 )
-        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
-    else
-        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
-
-    rcu_read_lock(&domlist_read_lock);
-
-    printk("Memory location of each domain:\n");
-    for_each_domain ( d )
-    {
-        process_pending_softirqs();
-
-        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
-
-        for_each_online_node ( i )
-            page_num_node[i] = 0;
-
-        spin_lock(&d->page_alloc_lock);
-        page_list_for_each(page, &d->page_list)
-        {
-            i = phys_to_nid(page_to_maddr(page));
-            page_num_node[i]++;
-        }
-        spin_unlock(&d->page_alloc_lock);
-
-        for_each_online_node ( i )
-            printk("    Node %u: %u\n", i, page_num_node[i]);
-
-        if ( !read_trylock(&d->vnuma_rwlock) )
-            continue;
-
-        if ( !d->vnuma )
-        {
-            read_unlock(&d->vnuma_rwlock);
-            continue;
-        }
-
-        vnuma = d->vnuma;
-        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
-               vnuma->nr_vnodes, d->max_vcpus);
-        for ( i = 0; i < vnuma->nr_vnodes; i++ )
-        {
-            unsigned int start_cpu = ~0U;
-
-            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
-                printk("       %3u: pnode ???,", i);
-            else
-                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
-
-            printk(" vcpus ");
-
-            for ( j = 0; j < d->max_vcpus; j++ )
-            {
-                if ( !(j & 0x3f) )
-                    process_pending_softirqs();
-
-                if ( vnuma->vcpu_to_vnode[j] == i )
-                {
-                    if ( start_cpu == ~0U )
-                    {
-                        printk("%d", j);
-                        start_cpu = j;
-                    }
-                }
-                else if ( start_cpu != ~0U )
-                {
-                    if ( j - 1 != start_cpu )
-                        printk("-%d ", j - 1);
-                    else
-                        printk(" ");
-                    start_cpu = ~0U;
-                }
-            }
-
-            if ( start_cpu != ~0U  && start_cpu != j - 1 )
-                printk("-%d", j - 1);
-
-            printk("\n");
-
-            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
-            {
-                if ( vnuma->vmemrange[j].nid == i )
-                    printk("           %016"PRIx64" - %016"PRIx64"\n",
-                           vnuma->vmemrange[j].start,
-                           vnuma->vmemrange[j].end);
-            }
-        }
-
-        read_unlock(&d->vnuma_rwlock);
-    }
-
-    rcu_read_unlock(&domlist_read_lock);
-}
-
-static __init int register_numa_trigger(void)
-{
-    register_keyhandler('u', dump_numa, "dump NUMA info", 1);
-    return 0;
-}
-__initcall(register_numa_trigger);
-
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 94bd5b34da..44517c7b62 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -10,24 +10,19 @@
  * 
  * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
  */
-
+#include <xen/acpi.h>
 #include <xen/init.h>
 #include <xen/mm.h>
-#include <xen/inttypes.h>
 #include <xen/nodemask.h>
-#include <xen/acpi.h>
 #include <xen/numa.h>
 #include <xen/pfn.h>
+
 #include <asm/e820.h>
 #include <asm/page.h>
 #include <asm/spec_ctrl.h>
 
 static struct acpi_table_slit *__read_mostly acpi_slit;
 
-static nodemask_t memory_nodes_parsed __initdata;
-static nodemask_t processor_nodes_parsed __initdata;
-static struct node nodes[MAX_NUMNODES] __initdata;
-
 struct pxm2node {
 	unsigned pxm;
 	nodeid_t node;
@@ -37,11 +32,6 @@ static struct pxm2node __read_mostly pxm2node[MAX_NUMNODES] =
 
 static unsigned node_to_pxm(nodeid_t n);
 
-static int num_node_memblks;
-static struct node node_memblk_range[NR_NODE_MEMBLKS];
-static nodeid_t memblk_nodeid[NR_NODE_MEMBLKS];
-static __initdata DECLARE_BITMAP(memblk_hotplug, NR_NODE_MEMBLKS);
-
 static inline bool node_found(unsigned idx, unsigned pxm)
 {
 	return ((pxm2node[idx].pxm == pxm) &&
@@ -104,65 +94,6 @@ nodeid_t setup_node(unsigned pxm)
 	return node;
 }
 
-void  __init numa_set_processor_nodes_parsed(nodeid_t node)
-{
-	node_set(node, processor_nodes_parsed);
-}
-
-bool __init numa_memblks_available(void)
-{
-	if (num_node_memblks < NR_NODE_MEMBLKS)
-		return true;
-
-	return false;
-}
-
-int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
-{
-	int i;
-
-	for (i = 0; i < num_node_memblks; i++) {
-		struct node *nd = &node_memblk_range[i];
-
-		if (nd->start <= start && nd->end >= end &&
-			memblk_nodeid[i] == node)
-			return 1;
-	}
-
-	return 0;
-}
-
-static __init int conflicting_memblks(paddr_t start, paddr_t end)
-{
-	int i;
-
-	for (i = 0; i < num_node_memblks; i++) {
-		struct node *nd = &node_memblk_range[i];
-		if (nd->start == nd->end)
-			continue;
-		if (nd->end > start && nd->start < end)
-			return i;
-		if (nd->end == end && nd->start == start)
-			return i;
-	}
-	return -1;
-}
-
-static __init void cutoff_node(int i, paddr_t start, paddr_t end)
-{
-	struct node *nd = &nodes[i];
-	if (nd->start < start) {
-		nd->start = start;
-		if (nd->end < nd->start)
-			nd->start = nd->end;
-	}
-	if (nd->end > end) {
-		nd->end = end;
-		if (nd->start > nd->end)
-			nd->start = nd->end;
-	}
-}
-
 __init void bad_srat(void)
 {
 	int i;
@@ -284,97 +215,6 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 		       pxm, pa->apic_id, node);
 }
 
-/*
- * Check to see if there are other nodes within this node's range.
- * We just need to check full contains situation. Because overlaps
- * have been checked before by conflicting_memblks.
- */
-static bool __init is_node_memory_continuous(nodeid_t nid,
-    paddr_t start, paddr_t end)
-{
-	nodeid_t i;
-
-	struct node *nd = &nodes[nid];
-	for_each_node_mask(i, memory_nodes_parsed)
-	{
-		/* Skip itself */
-		if (i == nid)
-			continue;
-
-		nd = &nodes[i];
-		if (start < nd->start && nd->end < end)
-		{
-			printk(KERN_ERR
-			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
-			       nid, start, end, i, nd->start, nd->end);
-			return false;
-		}
-	}
-
-	return true;
-}
-
-/* Neutral NUMA memory affinity init function for ACPI and DT */
-int __init numa_update_node_memblks(nodeid_t node,
-		paddr_t start, paddr_t size, bool hotplug)
-{
-	paddr_t end = start + size;
-	int i;
-
-	/* It is fine to add this area to the nodes data it will be used later */
-	i = conflicting_memblks(start, end);
-	if (i < 0)
-		/* everything fine */;
-	else if (memblk_nodeid[i] == node) {
-		bool mismatch = !hotplug != !test_bit(i, memblk_hotplug);
-
-		printk("%sSRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
-		       mismatch ? KERN_ERR : KERN_WARNING, node, start, end,
-		       node_memblk_range[i].start, node_memblk_range[i].end);
-		if (mismatch) {
-			return -1;
-		}
-	} else {
-		printk(KERN_ERR
-		       "SRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
-		       node, start, end, memblk_nodeid[i],
-		       node_memblk_range[i].start, node_memblk_range[i].end);
-		return -1;
-	}
-
-	if (!hotplug) {
-		struct node *nd = &nodes[node];
-
-		if (!node_test_and_set(node, memory_nodes_parsed)) {
-			nd->start = start;
-			nd->end = end;
-		} else {
-			if (start < nd->start)
-				nd->start = start;
-			if (nd->end < end)
-				nd->end = end;
-
-			if (!is_node_memory_continuous(node, nd->start, nd->end))
-				return -1;
-		}
-	}
-
-	printk(KERN_INFO "SRAT: Node %u %"PRIpaddr"-%"PRIpaddr"%s\n",
-	       node, start, end, hotplug ? " (hotplug)" : "");
-
-	node_memblk_range[num_node_memblks].start = start;
-	node_memblk_range[num_node_memblks].end = end;
-	memblk_nodeid[num_node_memblks] = node;
-	if (hotplug) {
-		__set_bit(num_node_memblks, memblk_hotplug);
-		if (end > mem_hotplug_boundary())
-			mem_hotplug_update_boundary(end);
-	}
-	num_node_memblks++;
-
-	return 0;
-}
-
 /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
 void __init
 acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
@@ -419,45 +259,6 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 		bad_srat();
 }
 
-/* Sanity check to catch more bad SRATs (they are amazingly common).
-   Make sure the PXMs cover all memory. */
-static int __init nodes_cover_memory(void)
-{
-	int i;
-	uint32_t nr_banks = arch_meminfo_get_nr_bank();
-
-	for (i = 0; i < nr_banks; i++) {
-		int j, found;
-		paddr_t start, end;
-
-		if (arch_meminfo_get_ram_bank_range(i, &start, &end))
-			continue;
-
-		do {
-			found = 0;
-			for_each_node_mask(j, memory_nodes_parsed)
-				if (start < nodes[j].end
-				    && end > nodes[j].start) {
-					if (start >= nodes[j].start) {
-						start = nodes[j].end;
-						found = 1;
-					}
-					if (end <= nodes[j].end) {
-						end = nodes[j].start;
-						found = 1;
-					}
-				}
-		} while (found && start < end);
-
-		if (start < end) {
-			printk(KERN_ERR "SRAT: No NODE for memory map range: "
-				"%"PRIpaddr" - %"PRIpaddr"\n", start, end);
-			return 0;
-		}
-	}
-	return 1;
-}
-
 void __init acpi_numa_arch_fixup(void) {}
 
 static uint64_t __initdata srat_region_mask;
@@ -511,56 +312,6 @@ void __init srat_parse_regions(paddr_t addr)
 	pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
 }
 
-/* Use the information discovered above to actually set up the nodes. */
-int __init numa_scan_nodes(paddr_t start, paddr_t end)
-{
-	int i;
-	nodemask_t all_nodes_parsed;
-
-	/* First clean up the node list */
-	for (i = 0; i < MAX_NUMNODES; i++)
-		cutoff_node(i, start, end);
-
-	if (fw_numa <= 0)
-		return -1;
-
-	if (!nodes_cover_memory()) {
-		bad_srat();
-		return -1;
-	}
-
-	memnode_shift = compute_hash_shift(node_memblk_range, num_node_memblks,
-				memblk_nodeid);
-
-	if (memnode_shift < 0) {
-		printk(KERN_ERR
-		     "SRAT: No NUMA node hash function found. Contact maintainer\n");
-		bad_srat();
-		return -1;
-	}
-
-	nodes_or(all_nodes_parsed, memory_nodes_parsed, processor_nodes_parsed);
-
-	/* Finally register nodes */
-	for_each_node_mask(i, all_nodes_parsed)
-	{
-		paddr_t size = nodes[i].end - nodes[i].start;
-		if ( size == 0 )
-			printk(KERN_WARNING "SRAT: Node %u has no memory. "
-			       "Firmware Bug or mis-configured hardware?\n", i);
-
-		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
-	}
-	for (i = 0; i < nr_cpu_ids; i++) {
-		if (cpu_to_node[i] == NUMA_NO_NODE)
-			continue;
-		if (!nodemask_test(cpu_to_node[i], &processor_nodes_parsed))
-			numa_set_node(i, NUMA_NO_NODE);
-	}
-	numa_init_array();
-	return 0;
-}
-
 static unsigned node_to_pxm(nodeid_t n)
 {
 	unsigned i;
diff --git a/xen/common/Makefile b/xen/common/Makefile
index 54de70d422..90e5bf3efb 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -26,6 +26,8 @@ obj-$(CONFIG_MEM_ACCESS) += mem_access.o
 obj-y += memory.o
 obj-y += multicall.o
 obj-y += notifier.o
+obj-$(CONFIG_NUMA) += numa.o
+obj-$(CONFIG_NUMA) += numa_srat.o
 obj-y += page_alloc.o
 obj-$(CONFIG_HAS_PDX) += pdx.o
 obj-$(CONFIG_PERF_COUNTERS) += perfc.o
diff --git a/xen/common/numa.c b/xen/common/numa.c
new file mode 100644
index 0000000000..fc6bba3981
--- /dev/null
+++ b/xen/common/numa.c
@@ -0,0 +1,450 @@
+/* 
+ * Generic VM initialization for NUMA setups.
+ * Copyright 2002,2003 Andi Kleen, SuSE Labs.
+ * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
+ */
+#include <xen/init.h>
+#include <xen/keyhandler.h>
+#include <xen/mm.h>
+#include <xen/nodemask.h>
+#include <xen/numa.h>
+#include <xen/param.h>
+
+#include <xen/sched.h>
+#include <xen/softirq.h>
+
+static int numa_setup(const char *s);
+custom_param("numa", numa_setup);
+
+struct node_data node_data[MAX_NUMNODES];
+
+/* Mapping from pdx to node id */
+int memnode_shift;
+static typeof(*memnodemap) _memnodemap[64];
+unsigned long memnodemapsize;
+u8 *memnodemap;
+
+nodeid_t cpu_to_node[NR_CPUS] __read_mostly = {
+    [0 ... NR_CPUS-1] = NUMA_NO_NODE
+};
+
+cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
+nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
+
+bool numa_off;
+s8 fw_numa = 0;
+
+int srat_disabled(void)
+{
+    return numa_off || fw_numa < 0;
+}
+
+/*
+ * Given a shift value, try to populate memnodemap[]
+ * Returns :
+ * 1 if OK
+ * 0 if memnodmap[] too small (of shift too small)
+ * -1 if node overlap or lost ram (shift too big)
+ */
+static int __init populate_memnodemap(const struct node *nodes,
+                                      int numnodes, int shift, nodeid_t *nodeids)
+{
+    unsigned long spdx, epdx;
+    int i, res = -1;
+
+    memset(memnodemap, NUMA_NO_NODE, memnodemapsize * sizeof(*memnodemap));
+    for ( i = 0; i < numnodes; i++ )
+    {
+        spdx = paddr_to_pdx(nodes[i].start);
+        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
+        if ( spdx >= epdx )
+            continue;
+        if ( (epdx >> shift) >= memnodemapsize )
+            return 0;
+        do {
+            if ( memnodemap[spdx >> shift] != NUMA_NO_NODE )
+                return -1;
+
+            if ( !nodeids )
+                memnodemap[spdx >> shift] = i;
+            else
+                memnodemap[spdx >> shift] = nodeids[i];
+
+            spdx += (1UL << shift);
+        } while ( spdx < epdx );
+        res = 1;
+    }
+
+    return res;
+}
+
+static int __init allocate_cachealigned_memnodemap(void)
+{
+    unsigned long size = PFN_UP(memnodemapsize * sizeof(*memnodemap));
+    unsigned long mfn = mfn_x(alloc_boot_pages(size, 1));
+
+    memnodemap = mfn_to_virt(mfn);
+    mfn <<= PAGE_SHIFT;
+    size <<= PAGE_SHIFT;
+    printk(KERN_DEBUG "NUMA: Allocated memnodemap from %lx - %lx\n",
+           mfn, mfn + size);
+    memnodemapsize = size / sizeof(*memnodemap);
+
+    return 0;
+}
+
+/*
+ * The LSB of all start and end addresses in the node map is the value of the
+ * maximum possible shift.
+ */
+static int __init extract_lsb_from_nodes(const struct node *nodes,
+                                         int numnodes)
+{
+    int i, nodes_used = 0;
+    unsigned long spdx, epdx;
+    unsigned long bitfield = 0, memtop = 0;
+
+    for ( i = 0; i < numnodes; i++ )
+    {
+        spdx = paddr_to_pdx(nodes[i].start);
+        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
+        if ( spdx >= epdx )
+            continue;
+        bitfield |= spdx;
+        nodes_used++;
+        if ( epdx > memtop )
+            memtop = epdx;
+    }
+    if ( nodes_used <= 1 )
+        i = BITS_PER_LONG - 1;
+    else
+        i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
+    memnodemapsize = (memtop >> i) + 1;
+    return i;
+}
+
+int __init compute_hash_shift(struct node *nodes, int numnodes,
+                              nodeid_t *nodeids)
+{
+    int shift;
+
+    shift = extract_lsb_from_nodes(nodes, numnodes);
+    if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) )
+        memnodemap = _memnodemap;
+    else if ( allocate_cachealigned_memnodemap() )
+        return -1;
+    printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift);
+
+    if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 )
+    {
+        printk(KERN_INFO "Your memory is not aligned you need to "
+               "rebuild your hypervisor with a bigger NODEMAPSIZE "
+               "shift=%d\n", shift);
+        return -1;
+    }
+
+    return shift;
+}
+/* initialize NODE_DATA given nodeid and start/end */
+void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
+{
+    unsigned long start_pfn, end_pfn;
+
+    start_pfn = paddr_to_pfn(start);
+    end_pfn = paddr_to_pfn(end);
+
+    NODE_DATA(nodeid)->node_start_pfn = start_pfn;
+    NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
+
+    node_set_online(nodeid);
+}
+
+void __init numa_init_array(void)
+{
+    int rr, i;
+
+    /* There are unfortunately some poorly designed mainboards around
+       that only connect memory to a single CPU. This breaks the 1:1 cpu->node
+       mapping. To avoid this fill in the mapping for all possible
+       CPUs, as the number of CPUs is not known yet.
+       We round robin the existing nodes. */
+    rr = first_node(node_online_map);
+    for ( i = 0; i < nr_cpu_ids; i++ )
+    {
+        if ( cpu_to_node[i] != NUMA_NO_NODE )
+            continue;
+        numa_set_node(i, rr);
+        rr = cycle_node(rr, node_online_map);
+    }
+}
+
+#ifdef CONFIG_NUMA_EMU
+static int numa_fake __initdata = 0;
+
+/* Numa emulation */
+static int __init numa_emulation(unsigned long start_pfn,
+                                 unsigned long end_pfn)
+{
+    int i;
+    struct node nodes[MAX_NUMNODES];
+    u64 sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
+
+    /* Kludge needed for the hash function */
+    if ( hweight64(sz) > 1 )
+    {
+        u64 x = 1;
+        while ( (x << 1) < sz )
+            x <<= 1;
+        if ( x < sz/2 )
+            printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n");
+        sz = x;
+    }
+
+    memset(&nodes,0,sizeof(nodes));
+    for ( i = 0; i < numa_fake; i++ )
+    {
+        nodes[i].start = pfn_to_paddr(start_pfn) + i*sz;
+        if ( i == numa_fake - 1 )
+            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
+        nodes[i].end = nodes[i].start + sz;
+        printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
+               i,
+               nodes[i].start, nodes[i].end,
+               (nodes[i].end - nodes[i].start) >> 20);
+        node_set_online(i);
+    }
+    memnode_shift = compute_hash_shift(nodes, numa_fake, NULL);
+    if ( memnode_shift < 0 )
+    {
+        memnode_shift = 0;
+        printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
+        return -1;
+    }
+    for_each_online_node ( i )
+        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+    numa_init_array();
+
+    return 0;
+}
+#endif
+
+void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
+{
+    int i;
+    paddr_t start, end;
+
+#ifdef CONFIG_NUMA_EMU
+    if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
+        return;
+#endif
+
+    start = pfn_to_paddr(start_pfn);
+    end = pfn_to_paddr(end_pfn);
+
+#ifdef CONFIG_NUMA
+    if ( !numa_off && !numa_scan_nodes(start, end) )
+        return;
+#endif
+
+    printk(KERN_INFO "%s\n",
+           numa_off ? "NUMA turned off" : "No NUMA configuration found");
+
+    printk(KERN_INFO "Faking a node at %016"PRIpaddr"-%016"PRIpaddr"\n",
+           start, end);
+    /* setup dummy node covering all memory */
+    memnode_shift = BITS_PER_LONG - 1;
+    memnodemap = _memnodemap;
+    /* Dummy node only uses 1 slot in reality */
+    memnodemap[0] = 0;
+    memnodemapsize = 1;
+
+    nodes_clear(node_online_map);
+    node_set_online(0);
+    for ( i = 0; i < nr_cpu_ids; i++ )
+        numa_set_node(i, 0);
+    cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
+    setup_node_bootmem(0, start, end);
+}
+
+void numa_add_cpu(int cpu)
+{
+    cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
+}
+
+void numa_set_node(int cpu, nodeid_t node)
+{
+    cpu_to_node[cpu] = node;
+}
+
+
+/* [numa=off] */
+static __init int numa_setup(const char *opt)
+{
+    if ( !strncmp(opt,"off",3) )
+        numa_off = true;
+    else if ( !strncmp(opt,"on",2) )
+        numa_off = false;
+#ifdef CONFIG_NUMA_EMU
+    else if ( !strncmp(opt, "fake=", 5) )
+    {
+        numa_off = false;
+        numa_fake = simple_strtoul(opt+5,NULL,0);
+        if ( numa_fake >= MAX_NUMNODES )
+            numa_fake = MAX_NUMNODES;
+    }
+#endif
+#ifdef CONFIG_ACPI_NUMA
+    else if ( !strncmp(opt,"noacpi",6) )
+    {
+        numa_off = false;
+        fw_numa = -1;
+    }
+#endif
+    else
+        return -EINVAL;
+
+    return 0;
+}
+
+
+static void dump_numa(unsigned char key)
+{
+    s_time_t now = NOW();
+    unsigned int i, j, n;
+    struct domain *d;
+    struct page_info *page;
+    unsigned int page_num_node[MAX_NUMNODES];
+    const struct vnuma_info *vnuma;
+
+    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
+           now);
+
+    for_each_online_node ( i )
+    {
+        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
+
+        printk("NODE%u start->%lu size->%lu free->%lu\n",
+               i, node_start_pfn(i), node_spanned_pages(i),
+               avail_node_heap_pages(i));
+        /* sanity check phys_to_nid() */
+        if ( phys_to_nid(pa) != i )
+            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
+                   pa, phys_to_nid(pa), i);
+    }
+
+    j = cpumask_first(&cpu_online_map);
+    n = 0;
+    for_each_online_cpu ( i )
+    {
+        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
+        {
+            if ( n > 1 )
+                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
+            else
+                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
+            j = i;
+            n = 1;
+        }
+        else
+            ++n;
+    }
+    if ( n > 1 )
+        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
+    else
+        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
+
+    rcu_read_lock(&domlist_read_lock);
+
+    printk("Memory location of each domain:\n");
+    for_each_domain ( d )
+    {
+        process_pending_softirqs();
+
+        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
+
+        for_each_online_node ( i )
+            page_num_node[i] = 0;
+
+        spin_lock(&d->page_alloc_lock);
+        page_list_for_each(page, &d->page_list)
+        {
+            i = phys_to_nid(page_to_maddr(page));
+            page_num_node[i]++;
+        }
+        spin_unlock(&d->page_alloc_lock);
+
+        for_each_online_node ( i )
+            printk("    Node %u: %u\n", i, page_num_node[i]);
+
+        if ( !read_trylock(&d->vnuma_rwlock) )
+            continue;
+
+        if ( !d->vnuma )
+        {
+            read_unlock(&d->vnuma_rwlock);
+            continue;
+        }
+
+        vnuma = d->vnuma;
+        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
+               vnuma->nr_vnodes, d->max_vcpus);
+        for ( i = 0; i < vnuma->nr_vnodes; i++ )
+        {
+            unsigned int start_cpu = ~0U;
+
+            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
+                printk("       %3u: pnode ???,", i);
+            else
+                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
+
+            printk(" vcpus ");
+
+            for ( j = 0; j < d->max_vcpus; j++ )
+            {
+                if ( !(j & 0x3f) )
+                    process_pending_softirqs();
+
+                if ( vnuma->vcpu_to_vnode[j] == i )
+                {
+                    if ( start_cpu == ~0U )
+                    {
+                        printk("%d", j);
+                        start_cpu = j;
+                    }
+                }
+                else if ( start_cpu != ~0U )
+                {
+                    if ( j - 1 != start_cpu )
+                        printk("-%d ", j - 1);
+                    else
+                        printk(" ");
+                    start_cpu = ~0U;
+                }
+            }
+
+            if ( start_cpu != ~0U  && start_cpu != j - 1 )
+                printk("-%d", j - 1);
+
+            printk("\n");
+
+            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
+            {
+                if ( vnuma->vmemrange[j].nid == i )
+                    printk("           %016"PRIx64" - %016"PRIx64"\n",
+                           vnuma->vmemrange[j].start,
+                           vnuma->vmemrange[j].end);
+            }
+        }
+
+        read_unlock(&d->vnuma_rwlock);
+    }
+
+    rcu_read_unlock(&domlist_read_lock);
+}
+
+static __init int register_numa_trigger(void)
+{
+    register_keyhandler('u', dump_numa, "dump NUMA info", 1);
+    return 0;
+}
+__initcall(register_numa_trigger);
diff --git a/xen/common/numa_srat.c b/xen/common/numa_srat.c
new file mode 100644
index 0000000000..7bda2ecef6
--- /dev/null
+++ b/xen/common/numa_srat.c
@@ -0,0 +1,264 @@
+/*
+ * ACPI 3.0 based NUMA setup
+ * Copyright 2004 Andi Kleen, SuSE Labs.
+ *
+ * Reads the ACPI SRAT table to figure out what memory belongs to which CPUs.
+ *
+ * Called from acpi_numa_init while reading the SRAT and SLIT tables.
+ * Assumes all memory regions belonging to a single proximity domain
+ * are in one chunk. Holes between them will be included in the node.
+ *
+ * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
+ */
+#include <xen/init.h>
+#include <xen/mm.h>
+#include <xen/nodemask.h>
+#include <xen/numa.h>
+
+static nodemask_t memory_nodes_parsed __initdata;
+static nodemask_t processor_nodes_parsed __initdata;
+static struct node nodes[MAX_NUMNODES] __initdata;
+
+static int num_node_memblks;
+static struct node node_memblk_range[NR_NODE_MEMBLKS];
+static nodeid_t memblk_nodeid[NR_NODE_MEMBLKS];
+static __initdata DECLARE_BITMAP(memblk_hotplug, NR_NODE_MEMBLKS);
+
+void  __init numa_set_processor_nodes_parsed(nodeid_t node)
+{
+	node_set(node, processor_nodes_parsed);
+}
+
+bool __init numa_memblks_available(void)
+{
+	if (num_node_memblks < NR_NODE_MEMBLKS)
+		return true;
+
+	return false;
+}
+
+int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
+{
+	int i;
+
+	for (i = 0; i < num_node_memblks; i++) {
+		struct node *nd = &node_memblk_range[i];
+
+		if (nd->start <= start && nd->end >= end &&
+			memblk_nodeid[i] == node)
+			return 1;
+	}
+
+	return 0;
+}
+
+static __init int conflicting_memblks(paddr_t start, paddr_t end)
+{
+	int i;
+
+	for (i = 0; i < num_node_memblks; i++) {
+		struct node *nd = &node_memblk_range[i];
+		if (nd->start == nd->end)
+			continue;
+		if (nd->end > start && nd->start < end)
+			return i;
+		if (nd->end == end && nd->start == start)
+			return i;
+	}
+	return -1;
+}
+
+static __init void cutoff_node(int i, paddr_t start, paddr_t end)
+{
+	struct node *nd = &nodes[i];
+	if (nd->start < start) {
+		nd->start = start;
+		if (nd->end < nd->start)
+			nd->start = nd->end;
+	}
+	if (nd->end > end) {
+		nd->end = end;
+		if (nd->start > nd->end)
+			nd->start = nd->end;
+	}
+}
+
+/*
+ * Check to see if there are other nodes within this node's range.
+ * We just need to check full contains situation. Because overlaps
+ * have been checked before by conflicting_memblks.
+ */
+static bool __init is_node_memory_continuous(nodeid_t nid,
+    paddr_t start, paddr_t end)
+{
+	nodeid_t i;
+
+	struct node *nd = &nodes[nid];
+	for_each_node_mask(i, memory_nodes_parsed)
+	{
+		/* Skip itself */
+		if (i == nid)
+			continue;
+
+		nd = &nodes[i];
+		if (start < nd->start && nd->end < end)
+		{
+			printk(KERN_ERR
+			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
+			       nid, start, end, i, nd->start, nd->end);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/* Neutral NUMA memory affinity init function for ACPI and DT */
+int __init numa_update_node_memblks(nodeid_t node,
+		paddr_t start, paddr_t size, bool hotplug)
+{
+	paddr_t end = start + size;
+	int i;
+
+	/* It is fine to add this area to the nodes data it will be used later */
+	i = conflicting_memblks(start, end);
+	if (i < 0)
+		/* everything fine */;
+	else if (memblk_nodeid[i] == node) {
+		bool mismatch = !hotplug != !test_bit(i, memblk_hotplug);
+
+		printk("%sSRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
+		       mismatch ? KERN_ERR : KERN_WARNING, node, start, end,
+		       node_memblk_range[i].start, node_memblk_range[i].end);
+		if (mismatch) {
+			return -1;
+		}
+	} else {
+		printk(KERN_ERR
+		       "SRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
+		       node, start, end, memblk_nodeid[i],
+		       node_memblk_range[i].start, node_memblk_range[i].end);
+		return -1;
+	}
+
+	if (!hotplug) {
+		struct node *nd = &nodes[node];
+
+		if (!node_test_and_set(node, memory_nodes_parsed)) {
+			nd->start = start;
+			nd->end = end;
+		} else {
+			if (start < nd->start)
+				nd->start = start;
+			if (nd->end < end)
+				nd->end = end;
+
+			if (!is_node_memory_continuous(node, nd->start, nd->end))
+				return -1;
+		}
+	}
+
+	printk(KERN_INFO "SRAT: Node %u %"PRIpaddr"-%"PRIpaddr"%s\n",
+	       node, start, end, hotplug ? " (hotplug)" : "");
+
+	node_memblk_range[num_node_memblks].start = start;
+	node_memblk_range[num_node_memblks].end = end;
+	memblk_nodeid[num_node_memblks] = node;
+	if (hotplug) {
+		__set_bit(num_node_memblks, memblk_hotplug);
+		if (end > mem_hotplug_boundary())
+			mem_hotplug_update_boundary(end);
+	}
+	num_node_memblks++;
+
+	return 0;
+}
+
+/* Sanity check to catch more bad SRATs (they are amazingly common).
+   Make sure the PXMs cover all memory. */
+static int __init nodes_cover_memory(void)
+{
+	int i;
+	uint32_t nr_banks = arch_meminfo_get_nr_bank();
+
+	for (i = 0; i < nr_banks; i++) {
+		int j, found;
+		paddr_t start, end;
+
+		if (arch_meminfo_get_ram_bank_range(i, &start, &end))
+			continue;
+
+		do {
+			found = 0;
+			for_each_node_mask(j, memory_nodes_parsed)
+				if (start < nodes[j].end
+				    && end > nodes[j].start) {
+					if (start >= nodes[j].start) {
+						start = nodes[j].end;
+						found = 1;
+					}
+					if (end <= nodes[j].end) {
+						end = nodes[j].start;
+						found = 1;
+					}
+				}
+		} while (found && start < end);
+
+		if (start < end) {
+			printk(KERN_ERR "SRAT: No NODE for memory map range: "
+				"%"PRIpaddr" - %"PRIpaddr"\n", start, end);
+			return 0;
+		}
+	}
+	return 1;
+}
+
+/* Use the information discovered above to actually set up the nodes. */
+int __init numa_scan_nodes(paddr_t start, paddr_t end)
+{
+	int i;
+	nodemask_t all_nodes_parsed;
+
+	/* First clean up the node list */
+	for (i = 0; i < MAX_NUMNODES; i++)
+		cutoff_node(i, start, end);
+
+	if (fw_numa <= 0)
+		return -1;
+
+	if (!nodes_cover_memory()) {
+		bad_srat();
+		return -1;
+	}
+
+	memnode_shift = compute_hash_shift(node_memblk_range, num_node_memblks,
+				memblk_nodeid);
+
+	if (memnode_shift < 0) {
+		printk(KERN_ERR
+		     "SRAT: No NUMA node hash function found. Contact maintainer\n");
+		bad_srat();
+		return -1;
+	}
+
+	nodes_or(all_nodes_parsed, memory_nodes_parsed, processor_nodes_parsed);
+
+	/* Finally register nodes */
+	for_each_node_mask(i, all_nodes_parsed)
+	{
+		paddr_t size = nodes[i].end - nodes[i].start;
+		if ( size == 0 )
+			printk(KERN_WARNING "SRAT: Node %u has no memory. "
+			       "Firmware Bug or mis-configured hardware?\n", i);
+
+		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+	}
+	for (i = 0; i < nr_cpu_ids; i++) {
+		if (cpu_to_node[i] == NUMA_NO_NODE)
+			continue;
+		if (!nodemask_test(cpu_to_node[i], &processor_nodes_parsed))
+			numa_set_node(i, NUMA_NO_NODE);
+	}
+	numa_init_array();
+	return 0;
+}
diff --git a/xen/include/asm-x86/acpi.h b/xen/include/asm-x86/acpi.h
index 2add971072..2140461ff3 100644
--- a/xen/include/asm-x86/acpi.h
+++ b/xen/include/asm-x86/acpi.h
@@ -101,10 +101,6 @@ extern unsigned long acpi_wakeup_address;
 
 #define ARCH_HAS_POWER_INIT	1
 
-extern s8 fw_numa;
-extern int numa_scan_nodes(u64 start, u64 end);
-#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
-
 extern struct acpi_sleep_info acpi_sinfo;
 #define acpi_video_flags bootsym(video_flags)
 struct xenpf_enter_acpi_sleep;
diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
index a5690a7098..cd407804c8 100644
--- a/xen/include/asm-x86/numa.h
+++ b/xen/include/asm-x86/numa.h
@@ -7,85 +7,17 @@ typedef u8 nodeid_t;
 
 extern int srat_rev;
 
-extern nodeid_t      cpu_to_node[NR_CPUS];
-extern cpumask_t     node_to_cpumask[];
-
-#define cpu_to_node(cpu)		(cpu_to_node[cpu])
-#define parent_node(node)		(node)
-#define node_to_first_cpu(node)  (__ffs(node_to_cpumask[node]))
-#define node_to_cpumask(node)    (node_to_cpumask[node])
-
-struct node { 
-	paddr_t start,end;
-};
-
-extern int compute_hash_shift(struct node *nodes, int numnodes,
-			      nodeid_t *nodeids);
 extern nodeid_t pxm_to_node(unsigned int pxm);
 
 #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
-#define VIRTUAL_BUG_ON(x) 
-
-extern void numa_add_cpu(int cpu);
-extern void numa_init_array(void);
-extern bool numa_off;
 
-
-extern int srat_disabled(void);
-extern void bad_srat(void);
-extern void numa_set_node(int cpu, nodeid_t node);
 extern nodeid_t setup_node(unsigned int pxm);
-extern void srat_detect_node(int cpu);
 
-extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
 extern nodeid_t apicid_to_node[];
 extern void init_cpu_to_node(void);
 
-static inline void clear_node_cpumask(int cpu)
-{
-	cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
-}
-
-/* Simple perfect hash to map pdx to node numbers */
-extern int memnode_shift; 
-extern unsigned long memnodemapsize;
-extern u8 *memnodemap;
-
-struct node_data {
-    unsigned long node_start_pfn;
-    unsigned long node_spanned_pages;
-};
-
-extern struct node_data node_data[];
-
-static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
-{ 
-	nodeid_t nid;
-	VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
-	nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift]; 
-	VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]); 
-	return nid; 
-} 
-
-#define NODE_DATA(nid)		(&(node_data[nid]))
-
-#define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
-#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
-#define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
-				 NODE_DATA(nid)->node_spanned_pages)
-
-extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
-extern bool numa_memblks_available(void);
-extern int numa_update_node_memblks(nodeid_t node,
-		paddr_t start, paddr_t size, bool hotplug);
-extern void numa_set_processor_nodes_parsed(nodeid_t node);
-
 void srat_parse_regions(paddr_t addr);
-extern u8 __node_distance(nodeid_t a, nodeid_t b);
 unsigned int arch_get_dma_bitsize(void);
 unsigned int arch_have_default_dmazone(void);
-extern uint32_t arch_meminfo_get_nr_bank(void);
-extern int arch_meminfo_get_ram_bank_range(uint32_t bank,
-    paddr_t *start, paddr_t *end);
 
 #endif
diff --git a/xen/include/asm-x86/setup.h b/xen/include/asm-x86/setup.h
index 24be46115d..63838ba2d1 100644
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -17,7 +17,6 @@ void early_time_init(void);
 
 void set_nr_cpu_ids(unsigned int max_cpus);
 
-void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
 void arch_init_memory(void);
 void subarch_init_memory(void);
 
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 52950a3150..51391a2440 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -12,10 +12,92 @@
 #define MAX_NUMNODES    1
 #endif
 
+#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
+
 #define vcpu_to_node(v) (cpu_to_node((v)->processor))
 
 #define domain_to_node(d) \
   (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
    ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
 
+/* The following content can be used when NUMA feature is enabled */
+#ifdef CONFIG_NUMA
+
+extern nodeid_t      cpu_to_node[NR_CPUS];
+extern cpumask_t     node_to_cpumask[];
+
+#define cpu_to_node(cpu)		(cpu_to_node[cpu])
+#define parent_node(node)		(node)
+#define node_to_first_cpu(node)  (__ffs(node_to_cpumask[node]))
+#define node_to_cpumask(node)    (node_to_cpumask[node])
+
+struct node {
+	paddr_t start,end;
+};
+
+extern int compute_hash_shift(struct node *nodes, int numnodes,
+			      nodeid_t *nodeids);
+
+#define VIRTUAL_BUG_ON(x)
+
+extern void numa_add_cpu(int cpu);
+extern void numa_init_array(void);
+extern bool numa_off;
+extern s8 fw_numa;
+
+extern int srat_disabled(void);
+extern void srat_detect_node(int cpu);
+extern void bad_srat(void);
+
+extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
+extern void numa_set_node(int cpu, nodeid_t node);
+extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
+
+static inline void clear_node_cpumask(int cpu)
+{
+	cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
+}
+
+/* Simple perfect hash to map pdx to node numbers */
+extern int memnode_shift;
+extern unsigned long memnodemapsize;
+extern u8 *memnodemap;
+
+struct node_data {
+    unsigned long node_start_pfn;
+    unsigned long node_spanned_pages;
+};
+
+extern struct node_data node_data[];
+
+static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
+{
+	nodeid_t nid;
+	VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
+	nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
+	VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
+	return nid;
+}
+
+#define NODE_DATA(nid)		(&(node_data[nid]))
+
+#define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
+#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
+#define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
+				 NODE_DATA(nid)->node_spanned_pages)
+
+extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
+extern bool numa_memblks_available(void);
+extern int numa_update_node_memblks(nodeid_t node,
+		paddr_t start, paddr_t size, bool hotplug);
+extern void numa_set_processor_nodes_parsed(nodeid_t node);
+extern int numa_scan_nodes(u64 start, u64 end);
+
+extern u8 __node_distance(nodeid_t a, nodeid_t b);
+extern uint32_t arch_meminfo_get_nr_bank(void);
+extern int arch_meminfo_get_ram_bank_range(uint32_t bank,
+    paddr_t *start, paddr_t *end);
+
+#endif
+
 #endif /* _XEN_NUMA_H */
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 19/37] xen/x86: promote VIRTUAL_BUG_ON to ASSERT in
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (17 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 18/37] xen: move NUMA common code from x86 to common Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2022-01-17 16:21   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture Wei Chen
                   ` (17 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

VIRTUAL_BUG_ON that is using in phys_to_nid is an empty macro. This
results in two lines of error-checking code in phys_to_nid are not
actually working. It also covers up two compilation errors:
1. error: ‘MAX_NUMNODES’ undeclared (first use in this function).
   This is because MAX_NUMNODES is defined in xen/numa.h.
   But asm/numa.h is a dependent file of xen/numa.h, we can't
   include xen/numa.h in asm/numa.h. This error has been fixed
   after we move phys_to_nid to xen/numa.h.
2. error: wrong type argument to unary exclamation mark.
   This is becuase, the error-checking code contains !node_data[nid].
   But node_data is a data structure variable, it's not a pointer.

So, in this patch, we use ASSERT in VIRTUAL_BUG_ON to enable the two
lines of error-checking code. And fix the the left compilation errors
by replacing !node_data[nid] to !node_data[nid].node_spanned_pages.

Because when node_spanned_pages is 0, this node has no memory.
numa_scan_node will print warning message for such kind of nodes:
"Firmware Bug or mis-configured hardware?". Even Xen allows to online
such kind of nodes. I still think it's impossible for phys_to_nid to
return a no memory node for a physical address.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/include/xen/numa.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 51391a2440..1978e2be1b 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -38,7 +38,7 @@ struct node {
 extern int compute_hash_shift(struct node *nodes, int numnodes,
 			      nodeid_t *nodeids);
 
-#define VIRTUAL_BUG_ON(x)
+#define VIRTUAL_BUG_ON(x) ASSERT(!(x))
 
 extern void numa_add_cpu(int cpu);
 extern void numa_init_array(void);
@@ -75,7 +75,7 @@ static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
 	nodeid_t nid;
 	VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
 	nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
-	VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
+	VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid].node_spanned_pages);
 	return nid;
 }
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (18 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 19/37] xen/x86: promote VIRTUAL_BUG_ON to ASSERT in Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  1:15   ` Stefano Stabellini
  2022-01-25 10:34   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 21/37] xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI Wei Chen
                   ` (16 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Some architectures do not support EFI, but EFI API will be used
in some common features. Instead of spreading #ifdef ARCH, we
introduce this Kconfig option to give Xen the ability of stubing
EFI API for non-EFI supported architectures.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/Kconfig  |  1 +
 xen/arch/arm/Makefile |  2 +-
 xen/arch/x86/Kconfig  |  1 +
 xen/common/Kconfig    | 11 +++++++++++
 xen/include/xen/efi.h |  4 ++++
 5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
index ecfa6822e4..865ad83a89 100644
--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -6,6 +6,7 @@ config ARM_64
 	def_bool y
 	depends on !ARM_32
 	select 64BIT
+	select EFI
 	select HAS_FAST_MULTIPLY
 
 config ARM
diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
index 3d3b97b5b4..ae4efbf76e 100644
--- a/xen/arch/arm/Makefile
+++ b/xen/arch/arm/Makefile
@@ -1,6 +1,6 @@
 obj-$(CONFIG_ARM_32) += arm32/
 obj-$(CONFIG_ARM_64) += arm64/
-obj-$(CONFIG_ARM_64) += efi/
+obj-$(CONFIG_EFI) += efi/
 obj-$(CONFIG_ACPI) += acpi/
 ifneq ($(CONFIG_NO_PLAT),y)
 obj-y += platforms/
diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
index 28d13b9705..b9ed187f6b 100644
--- a/xen/arch/x86/Kconfig
+++ b/xen/arch/x86/Kconfig
@@ -10,6 +10,7 @@ config X86
 	select ALTERNATIVE_CALL
 	select ARCH_SUPPORTS_INT128
 	select CORE_PARKING
+	select EFI
 	select HAS_ALTERNATIVE
 	select HAS_COMPAT
 	select HAS_CPUFREQ
diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 9ebb1c239b..f998746a1a 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -11,6 +11,16 @@ config COMPAT
 config CORE_PARKING
 	bool
 
+config EFI
+	bool
+	---help---
+      This option provides support for runtime services provided
+      by UEFI firmware (such as non-volatile variables, realtime
+      clock, and platform reset). A UEFI stub is also provided to
+      allow the kernel to be booted as an EFI application. This
+      is only useful for kernels that may run on systems that have
+      UEFI firmware.
+
 config GRANT_TABLE
 	bool "Grant table support" if EXPERT
 	default y
@@ -196,6 +206,7 @@ config KEXEC
 
 config EFI_SET_VIRTUAL_ADDRESS_MAP
     bool "EFI: call SetVirtualAddressMap()" if EXPERT
+    depends on EFI
     ---help---
       Call EFI SetVirtualAddressMap() runtime service to setup memory map for
       further runtime services. According to UEFI spec, it isn't strictly
diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
index 94a7e547f9..661a48286a 100644
--- a/xen/include/xen/efi.h
+++ b/xen/include/xen/efi.h
@@ -25,6 +25,8 @@ extern struct efi efi;
 
 #ifndef __ASSEMBLY__
 
+#ifdef CONFIG_EFI
+
 union xenpf_efi_info;
 union compat_pf_efi_info;
 
@@ -45,6 +47,8 @@ int efi_runtime_call(struct xenpf_efi_runtime_call *);
 int efi_compat_get_info(uint32_t idx, union compat_pf_efi_info *);
 int efi_compat_runtime_call(struct compat_pf_efi_runtime_call *);
 
+#endif /* CONFIG_EFI*/
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __XEN_EFI_H__ */
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 21/37] xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (19 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  1:23   ` Stefano Stabellini
  2022-01-25 10:38   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS Wei Chen
                   ` (15 subsequent siblings)
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

EFI can get memory map from EFI system table. But EFI system
table doesn't contain memory NUMA information, EFI depends on
ACPI SRAT or device tree memory node to parse memory blocks'
NUMA mapping.

But in current code, when Xen is booting from EFI, it will
delete all memory nodes in device tree. So in UEFI + DTB
boot, we don't have numa-node-id for memory blocks any more.

So in this patch, we will keep memory nodes in device tree for
NUMA code to parse memory numa-node-id later.

As a side effect, if we still parse boot memory information in
early_scan_node, bootmem.info will calculate memory ranges in
memory nodes twice. So we have to prevent early_scan_node to
parse memory nodes in EFI boot.

As EFI APIs only can be used in Arm64, so we introduced a stub
API for non-EFI supported Arm32. This will prevent

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/bootfdt.c      |  8 +++++++-
 xen/arch/arm/efi/efi-boot.h | 25 -------------------------
 xen/include/xen/efi.h       |  7 +++++++
 3 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
index afaa0e249b..6bc5a465ec 100644
--- a/xen/arch/arm/bootfdt.c
+++ b/xen/arch/arm/bootfdt.c
@@ -11,6 +11,7 @@
 #include <xen/lib.h>
 #include <xen/kernel.h>
 #include <xen/init.h>
+#include <xen/efi.h>
 #include <xen/device_tree.h>
 #include <xen/libfdt/libfdt.h>
 #include <xen/sort.h>
@@ -370,7 +371,12 @@ static int __init early_scan_node(const void *fdt,
 {
     int rc = 0;
 
-    if ( device_tree_node_matches(fdt, node, "memory") )
+    /*
+     * If Xen has been booted via UEFI, the memory banks will already
+     * be populated. So we should skip the parsing.
+     */
+    if ( !efi_enabled(EFI_BOOT) &&
+         device_tree_node_matches(fdt, node, "memory"))
         rc = process_memory_node(fdt, node, name, depth,
                                  address_cells, size_cells, &bootinfo.mem);
     else if ( depth == 1 && !dt_node_cmp(name, "reserved-memory") )
diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
index cf9c37153f..d0a9987fa4 100644
--- a/xen/arch/arm/efi/efi-boot.h
+++ b/xen/arch/arm/efi/efi-boot.h
@@ -197,33 +197,8 @@ EFI_STATUS __init fdt_add_uefi_nodes(EFI_SYSTEM_TABLE *sys_table,
     int status;
     u32 fdt_val32;
     u64 fdt_val64;
-    int prev;
     int num_rsv;
 
-    /*
-     * Delete any memory nodes present.  The EFI memory map is the only
-     * memory description provided to Xen.
-     */
-    prev = 0;
-    for (;;)
-    {
-        const char *type;
-        int len;
-
-        node = fdt_next_node(fdt, prev, NULL);
-        if ( node < 0 )
-            break;
-
-        type = fdt_getprop(fdt, node, "device_type", &len);
-        if ( type && strncmp(type, "memory", len) == 0 )
-        {
-            fdt_del_node(fdt, node);
-            continue;
-        }
-
-        prev = node;
-    }
-
    /*
     * Delete all memory reserve map entries. When booting via UEFI,
     * kernel will use the UEFI memory map to find reserved regions.
diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
index 661a48286a..b52a4678e9 100644
--- a/xen/include/xen/efi.h
+++ b/xen/include/xen/efi.h
@@ -47,6 +47,13 @@ int efi_runtime_call(struct xenpf_efi_runtime_call *);
 int efi_compat_get_info(uint32_t idx, union compat_pf_efi_info *);
 int efi_compat_runtime_call(struct compat_pf_efi_runtime_call *);
 
+#else
+
+static inline bool efi_enabled(unsigned int feature)
+{
+    return false;
+}
+
 #endif /* CONFIG_EFI*/
 
 #endif /* !__ASSEMBLY__ */
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (20 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 21/37] xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  1:34   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 23/37] xen/arm: implement node distance helpers for Arm Wei Chen
                   ` (14 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

As a memory range described in device tree cannot be split across
multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS in
arch header. And keep default NR_NODE_MEMBLKS in common header
for those architectures NUMA is disabled.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/include/asm-arm/numa.h | 8 +++++++-
 xen/include/xen/numa.h     | 2 ++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index 8f1c67e3eb..21569e634b 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -3,9 +3,15 @@
 
 #include <xen/mm.h>
 
+#include <asm/setup.h>
+
 typedef u8 nodeid_t;
 
-#ifndef CONFIG_NUMA
+#ifdef CONFIG_NUMA
+
+#define NR_NODE_MEMBLKS NR_MEM_BANKS
+
+#else
 
 /* Fake one node for now. See also node_online_map. */
 #define cpu_to_node(cpu) 0
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 1978e2be1b..1731e1cc6b 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -12,7 +12,9 @@
 #define MAX_NUMNODES    1
 #endif
 
+#ifndef NR_NODE_MEMBLKS
 #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
+#endif
 
 #define vcpu_to_node(v) (cpu_to_node((v)->processor))
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 23/37] xen/arm: implement node distance helpers for Arm
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (21 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  1:46   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 24/37] xen/arm: implement two arch helpers to get memory map info Wei Chen
                   ` (13 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

We will parse NUMA nodes distances from device tree or ACPI
table. So we need a matrix to record the distances between
any two nodes we parsed. Accordingly, we provide this
node_set_distance API for device tree or ACPI table parsers
to set the distance for any two nodes in this patch.
When NUMA initialization failed, __node_distance will return
NUMA_REMOTE_DISTANCE, this will help us avoid doing rollback
for distance maxtrix when NUMA initialization failed.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/Makefile      |  1 +
 xen/arch/arm/numa.c        | 69 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-arm/numa.h | 13 +++++++
 3 files changed, 83 insertions(+)
 create mode 100644 xen/arch/arm/numa.c

diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
index ae4efbf76e..41ca311b6b 100644
--- a/xen/arch/arm/Makefile
+++ b/xen/arch/arm/Makefile
@@ -35,6 +35,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
 obj-y += mem_access.o
 obj-y += mm.o
 obj-y += monitor.o
+obj-$(CONFIG_NUMA) += numa.o
 obj-y += p2m.o
 obj-y += percpu.o
 obj-y += platform.o
diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
new file mode 100644
index 0000000000..3f08870d69
--- /dev/null
+++ b/xen/arch/arm/numa.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Arm Architecture support layer for NUMA.
+ *
+ * Copyright (C) 2021 Arm Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+#include <xen/init.h>
+#include <xen/numa.h>
+
+static uint8_t __read_mostly
+node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
+    { 0 }
+};
+
+void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance)
+{
+    if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
+    {
+        printk(KERN_WARNING
+               "NUMA: invalid nodes: from=%"PRIu8" to=%"PRIu8" MAX=%"PRIu8"\n",
+               from, to, MAX_NUMNODES);
+        return;
+    }
+
+    /* NUMA defines 0xff as an unreachable node and 0-9 are undefined */
+    if ( distance >= NUMA_NO_DISTANCE ||
+        (distance >= NUMA_DISTANCE_UDF_MIN &&
+         distance <= NUMA_DISTANCE_UDF_MAX) ||
+        (from == to && distance != NUMA_LOCAL_DISTANCE) )
+    {
+        printk(KERN_WARNING
+               "NUMA: invalid distance: from=%"PRIu8" to=%"PRIu8" distance=%"PRIu32"\n",
+               from, to, distance);
+        return;
+    }
+
+    node_distance_map[from][to] = distance;
+}
+
+uint8_t __node_distance(nodeid_t from, nodeid_t to)
+{
+    /* When NUMA is off, any distance will be treated as remote. */
+    if ( srat_disabled() )
+        return NUMA_REMOTE_DISTANCE;
+
+    /*
+     * Check whether the nodes are in the matrix range.
+     * When any node is out of range, except from and to nodes are the
+     * same, we treat them as unreachable (return 0xFF)
+     */
+    if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
+        return from == to ? NUMA_LOCAL_DISTANCE : NUMA_NO_DISTANCE;
+
+    return node_distance_map[from][to];
+}
+EXPORT_SYMBOL(__node_distance);
diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index 21569e634b..758eafeb05 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -9,8 +9,21 @@ typedef u8 nodeid_t;
 
 #ifdef CONFIG_NUMA
 
+/*
+ * In ACPI spec, 0-9 are the reserved values for node distance,
+ * 10 indicates local node distance, 20 indicates remote node
+ * distance. Set node distance map in device tree will follow
+ * the ACPI's definition.
+ */
+#define NUMA_DISTANCE_UDF_MIN   0
+#define NUMA_DISTANCE_UDF_MAX   9
+#define NUMA_LOCAL_DISTANCE     10
+#define NUMA_REMOTE_DISTANCE    20
+
 #define NR_NODE_MEMBLKS NR_MEM_BANKS
 
+extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance);
+
 #else
 
 /* Fake one node for now. See also node_online_map. */
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 24/37] xen/arm: implement two arch helpers to get memory map info
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (22 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 23/37] xen/arm: implement node distance helpers for Arm Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  2:06   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization Wei Chen
                   ` (12 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

These two helpers are architecture APIs that are required by
nodes_cover_memory.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/numa.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
index 3f08870d69..3755b01ef4 100644
--- a/xen/arch/arm/numa.c
+++ b/xen/arch/arm/numa.c
@@ -67,3 +67,17 @@ uint8_t __node_distance(nodeid_t from, nodeid_t to)
     return node_distance_map[from][to];
 }
 EXPORT_SYMBOL(__node_distance);
+
+uint32_t __init arch_meminfo_get_nr_bank(void)
+{
+	return bootinfo.mem.nr_banks;
+}
+
+int __init arch_meminfo_get_ram_bank_range(uint32_t bank,
+	paddr_t *start, paddr_t *end)
+{
+	*start = bootinfo.mem.bank[bank].start;
+	*end = bootinfo.mem.bank[bank].start + bootinfo.mem.bank[bank].size;
+
+	return 0;
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (23 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 24/37] xen/arm: implement two arch helpers to get memory map info Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  2:09   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 26/37] xen/arm: build NUMA cpu_to_node map in dt_smp_init_cpus Wei Chen
                   ` (11 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

NUMA initialization will parse information from firmware provided
static resource affinity table (ACPI SRAT or DTB). bad_srat if a
function that will be used when initialization code encounters
some unexcepted errors.

In this patch, we introduce Arm version bad_srat for NUMA common
initialization code to invoke it.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/numa.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
index 3755b01ef4..5209d3de4d 100644
--- a/xen/arch/arm/numa.c
+++ b/xen/arch/arm/numa.c
@@ -18,6 +18,7 @@
  *
  */
 #include <xen/init.h>
+#include <xen/nodemask.h>
 #include <xen/numa.h>
 
 static uint8_t __read_mostly
@@ -25,6 +26,12 @@ node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
     { 0 }
 };
 
+__init void bad_srat(void)
+{
+    printk(KERN_ERR "NUMA: Firmware SRAT table not used.\n");
+    fw_numa = -1;
+}
+
 void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance)
 {
     if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 26/37] xen/arm: build NUMA cpu_to_node map in dt_smp_init_cpus
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (24 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  2:26   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 27/37] xen/arm: Add boot and secondary CPU to NUMA system Wei Chen
                   ` (10 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

NUMA implementation has a cpu_to_node array to store CPU to NODE
map. Xen is using CPU logical ID in runtime components, so we
use CPU logical ID as CPU index in cpu_to_node.

In device tree case, cpu_logical_map is created in dt_smp_init_cpus.
So, when NUMA is enabled, dt_smp_init_cpus will fetch CPU NUMA id
at the same time for cpu_to_node.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/smpboot.c     | 37 ++++++++++++++++++++++++++++++++++++-
 xen/include/asm-arm/numa.h |  5 +++++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
index 60c0e82fc5..6e3cc8d3cc 100644
--- a/xen/arch/arm/smpboot.c
+++ b/xen/arch/arm/smpboot.c
@@ -121,7 +121,12 @@ static void __init dt_smp_init_cpus(void)
     {
         [0 ... NR_CPUS - 1] = MPIDR_INVALID
     };
+    static nodeid_t node_map[NR_CPUS] __initdata =
+    {
+        [0 ... NR_CPUS - 1] = NUMA_NO_NODE
+    };
     bool bootcpu_valid = false;
+    uint32_t nid = 0;
     int rc;
 
     mpidr = system_cpuinfo.mpidr.bits & MPIDR_HWID_MASK;
@@ -172,6 +177,28 @@ static void __init dt_smp_init_cpus(void)
             continue;
         }
 
+        if ( IS_ENABLED(CONFIG_NUMA) )
+        {
+            /*
+             * When CONFIG_NUMA is set, try to fetch numa infomation
+             * from CPU dts node, otherwise the nid is always 0.
+             */
+            if ( !dt_property_read_u32(cpu, "numa-node-id", &nid) )
+            {
+                printk(XENLOG_WARNING
+                       "cpu[%d] dts path: %s: doesn't have numa information!\n",
+                       cpuidx, dt_node_full_name(cpu));
+                /*
+                 * During the early stage of NUMA initialization, when Xen
+                 * found any CPU dts node doesn't have numa-node-id info, the
+                 * NUMA will be treated as off, all CPU will be set to a FAKE
+                 * node 0. So if we get numa-node-id failed here, we should
+                 * set nid to 0.
+                 */
+                nid = 0;
+            }
+        }
+
         /*
          * 8 MSBs must be set to 0 in the DT since the reg property
          * defines the MPIDR[23:0]
@@ -231,9 +258,12 @@ static void __init dt_smp_init_cpus(void)
         {
             printk("cpu%d init failed (hwid %"PRIregister"): %d\n", i, hwid, rc);
             tmp_map[i] = MPIDR_INVALID;
+            node_map[i] = NUMA_NO_NODE;
         }
-        else
+        else {
             tmp_map[i] = hwid;
+            node_map[i] = nid;
+        }
     }
 
     if ( !bootcpu_valid )
@@ -249,6 +279,11 @@ static void __init dt_smp_init_cpus(void)
             continue;
         cpumask_set_cpu(i, &cpu_possible_map);
         cpu_logical_map(i) = tmp_map[i];
+
+        nid = node_map[i];
+        if ( nid >= MAX_NUMNODES )
+            nid = 0;
+        numa_set_node(i, nid);
     }
 }
 
diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index 758eafeb05..8a4ad379e0 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -46,6 +46,11 @@ extern mfn_t first_valid_mfn;
 #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
 #define __node_distance(a, b) (20)
 
+static inline void numa_set_node(int cpu, nodeid_t node)
+{
+
+}
+
 #endif
 
 static inline unsigned int arch_have_default_dmazone(void)
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 27/37] xen/arm: Add boot and secondary CPU to NUMA system
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (25 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 26/37] xen/arm: build NUMA cpu_to_node map in dt_smp_init_cpus Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-23 12:02 ` [PATCH 28/37] xen/arm: stub memory hotplug access helpers for Arm Wei Chen
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

In this patch, we make NUMA node online and add cpu to
its NUMA node. This will make NUMA-aware components
have NUMA affinity data to support their work.

To keep the mostly the same behavior of x86, we still use
srat_detect_node to online node. The difference is that,
we have prepared cpu_to_node in dt_smp_init_cpus, so we
don't need to setup cpu_to_node in srat_detect_node.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/numa.c        | 10 ++++++++++
 xen/arch/arm/setup.c       |  5 +++++
 xen/include/asm-arm/numa.h | 10 ++++++++++
 3 files changed, 25 insertions(+)

diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
index 5209d3de4d..7f05299b76 100644
--- a/xen/arch/arm/numa.c
+++ b/xen/arch/arm/numa.c
@@ -32,6 +32,16 @@ __init void bad_srat(void)
     fw_numa = -1;
 }
 
+void srat_detect_node(int cpu)
+{
+    nodeid_t node = cpu_to_node[cpu];
+
+    if ( node == NUMA_NO_NODE )
+        node = 0;
+
+    node_set_online(node);
+}
+
 void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance)
 {
     if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 49dc90d198..1f0fbc95b5 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -988,6 +988,11 @@ void __init start_xen(unsigned long boot_phys_offset,
 
     for_each_present_cpu ( i )
     {
+        /* Detect and online node based on cpu_to_node[] */
+        srat_detect_node(i);
+        /* Set up node_to_cpumask based on cpu_to_node[]. */
+        numa_add_cpu(i);
+
         if ( (num_online_cpus() < cpus) && !cpu_online(i) )
         {
             int ret = cpu_up(i);
diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index 8a4ad379e0..7675012cb7 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -46,11 +46,21 @@ extern mfn_t first_valid_mfn;
 #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
 #define __node_distance(a, b) (20)
 
+static inline void numa_add_cpu(int cpu)
+{
+
+}
+
 static inline void numa_set_node(int cpu, nodeid_t node)
 {
 
 }
 
+static inline void srat_detect_node(int cpu)
+{
+
+}
+
 #endif
 
 static inline unsigned int arch_have_default_dmazone(void)
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 28/37] xen/arm: stub memory hotplug access helpers for Arm
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (26 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 27/37] xen/arm: Add boot and secondary CPU to NUMA system Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  2:33   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 29/37] xen/arm: introduce a helper to parse device tree processor node Wei Chen
                   ` (8 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Common code in NUMA need these two helpers to access/update
memory hotplug end address. Arm has not support memory hotplug
yet. So we stub these two helpers in this patch to make NUMA
common code happy.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/include/asm-arm/mm.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index 7b5e7b7f69..fc9433165d 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -362,6 +362,16 @@ void clear_and_clean_page(struct page_info *page);
 
 unsigned int arch_get_dma_bitsize(void);
 
+static inline void mem_hotplug_update_boundary(paddr_t end)
+{
+
+}
+
+static inline paddr_t mem_hotplug_boundary(void)
+{
+    return 0;
+}
+
 #endif /*  __ARCH_ARM_MM__ */
 /*
  * Local variables:
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 29/37] xen/arm: introduce a helper to parse device tree processor node
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (27 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 28/37] xen/arm: stub memory hotplug access helpers for Arm Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  2:44   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 30/37] xen/arm: introduce a helper to parse device tree memory node Wei Chen
                   ` (7 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Processor NUMA ID information is stored in device tree's processor
node as "numa-node-id". We need a new helper to parse this ID from
processor node. If we get this ID from processor node, this ID's
validity still need to be checked. Once we got a invalid NUMA ID
from any processor node, the device tree will be marked as NUMA
information invalid.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/Makefile           |  1 +
 xen/arch/arm/numa_device_tree.c | 58 +++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)
 create mode 100644 xen/arch/arm/numa_device_tree.c

diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
index 41ca311b6b..c50df2c25d 100644
--- a/xen/arch/arm/Makefile
+++ b/xen/arch/arm/Makefile
@@ -36,6 +36,7 @@ obj-y += mem_access.o
 obj-y += mm.o
 obj-y += monitor.o
 obj-$(CONFIG_NUMA) += numa.o
+obj-$(CONFIG_DEVICE_TREE_NUMA) += numa_device_tree.o
 obj-y += p2m.o
 obj-y += percpu.o
 obj-y += platform.o
diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
new file mode 100644
index 0000000000..2428fbae0b
--- /dev/null
+++ b/xen/arch/arm/numa_device_tree.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Arm Architecture support layer for NUMA.
+ *
+ * Copyright (C) 2021 Arm Ltd
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+#include <xen/init.h>
+#include <xen/nodemask.h>
+#include <xen/numa.h>
+#include <xen/libfdt/libfdt.h>
+#include <xen/device_tree.h>
+
+/* Callback for device tree processor affinity */
+static int __init fdt_numa_processor_affinity_init(nodeid_t node)
+{
+    if ( srat_disabled() )
+        return -EINVAL;
+    else if ( node == NUMA_NO_NODE || node >= MAX_NUMNODES )
+    {
+        bad_srat();
+        return -EINVAL;
+	}
+
+    numa_set_processor_nodes_parsed(node);
+    fw_numa = 1;
+
+    printk(KERN_INFO "DT: NUMA node %"PRIu8" processor parsed\n", node);
+
+    return 0;
+}
+
+/* Parse CPU NUMA node info */
+static int __init fdt_parse_numa_cpu_node(const void *fdt, int node)
+{
+    uint32_t nid;
+
+    nid = device_tree_get_u32(fdt, node, "numa-node-id", MAX_NUMNODES);
+    if ( nid >= MAX_NUMNODES )
+    {
+        printk(XENLOG_ERR "Node id %u exceeds maximum value\n", nid);
+        return -EINVAL;
+    }
+
+    return fdt_numa_processor_affinity_init(nid);
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 30/37] xen/arm: introduce a helper to parse device tree memory node
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (28 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 29/37] xen/arm: introduce a helper to parse device tree processor node Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  3:05   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 31/37] xen/arm: introduce a helper to parse device tree NUMA distance map Wei Chen
                   ` (6 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Memory blocks' NUMA ID information is stored in device tree's
memory nodes as "numa-node-id". We need a new helper to parse
and verify this ID from memory nodes.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/numa_device_tree.c | 80 +++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
index 2428fbae0b..7918a397fa 100644
--- a/xen/arch/arm/numa_device_tree.c
+++ b/xen/arch/arm/numa_device_tree.c
@@ -42,6 +42,35 @@ static int __init fdt_numa_processor_affinity_init(nodeid_t node)
     return 0;
 }
 
+/* Callback for parsing of the memory regions affinity */
+static int __init fdt_numa_memory_affinity_init(nodeid_t node,
+                                paddr_t start, paddr_t size)
+{
+    int ret;
+
+    if ( srat_disabled() )
+    {
+        return -EINVAL;
+    }
+
+	if ( !numa_memblks_available() )
+	{
+		dprintk(XENLOG_WARNING,
+                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
+		bad_srat();
+		return -EINVAL;
+	}
+
+	ret = numa_update_node_memblks(node, start, size, false);
+	if ( ret != 0 )
+	{
+		bad_srat();
+	    return -EINVAL;
+	}
+
+    return 0;
+}
+
 /* Parse CPU NUMA node info */
 static int __init fdt_parse_numa_cpu_node(const void *fdt, int node)
 {
@@ -56,3 +85,54 @@ static int __init fdt_parse_numa_cpu_node(const void *fdt, int node)
 
     return fdt_numa_processor_affinity_init(nid);
 }
+
+/* Parse memory node NUMA info */
+static int __init fdt_parse_numa_memory_node(const void *fdt, int node,
+    const char *name, uint32_t addr_cells, uint32_t size_cells)
+{
+    uint32_t nid;
+    int ret = 0, len;
+    paddr_t addr, size;
+    const struct fdt_property *prop;
+    uint32_t idx, ranges;
+    const __be32 *addresses;
+
+    nid = device_tree_get_u32(fdt, node, "numa-node-id", MAX_NUMNODES);
+    if ( nid >= MAX_NUMNODES )
+    {
+        printk(XENLOG_WARNING "Node id %u exceeds maximum value\n", nid);
+        return -EINVAL;
+    }
+
+    prop = fdt_get_property(fdt, node, "reg", &len);
+    if ( !prop )
+    {
+        printk(XENLOG_WARNING
+               "fdt: node `%s': missing `reg' property\n", name);
+        return -EINVAL;
+    }
+
+    addresses = (const __be32 *)prop->data;
+    ranges = len / (sizeof(__be32)* (addr_cells + size_cells));
+    for ( idx = 0; idx < ranges; idx++ )
+    {
+        device_tree_get_reg(&addresses, addr_cells, size_cells, &addr, &size);
+        /* Skip zero size ranges */
+        if ( !size )
+            continue;
+
+        ret = fdt_numa_memory_affinity_init(nid, addr, size);
+        if ( ret ) {
+            return -EINVAL;
+        }
+    }
+
+    if ( idx == 0 )
+    {
+        printk(XENLOG_ERR
+               "bad property in memory node, idx=%d ret=%d\n", idx, ret);
+        return -EINVAL;
+    }
+
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 31/37] xen/arm: introduce a helper to parse device tree NUMA distance map
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (29 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 30/37] xen/arm: introduce a helper to parse device tree memory node Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  3:05   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 32/37] xen/arm: unified entry to parse all NUMA data from device tree Wei Chen
                   ` (5 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

A NUMA aware device tree will provide a "distance-map" node to
describe distance between any two nodes. This patch introduce a
new helper to parse this distance map.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/numa_device_tree.c | 106 ++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
index 7918a397fa..e7fa84df4c 100644
--- a/xen/arch/arm/numa_device_tree.c
+++ b/xen/arch/arm/numa_device_tree.c
@@ -136,3 +136,109 @@ static int __init fdt_parse_numa_memory_node(const void *fdt, int node,
 
     return 0;
 }
+
+
+/* Parse NUMA distance map v1 */
+static int __init fdt_parse_numa_distance_map_v1(const void *fdt, int node)
+{
+    const struct fdt_property *prop;
+    const __be32 *matrix;
+    uint32_t entry_count;
+    int len, i;
+
+    printk(XENLOG_INFO "NUMA: parsing numa-distance-map\n");
+
+    prop = fdt_get_property(fdt, node, "distance-matrix", &len);
+    if ( !prop )
+    {
+        printk(XENLOG_WARNING
+               "NUMA: No distance-matrix property in distance-map\n");
+
+        return -EINVAL;
+    }
+
+    if ( len % sizeof(uint32_t) != 0 )
+    {
+        printk(XENLOG_WARNING
+               "distance-matrix in node is not a multiple of u32\n");
+        return -EINVAL;
+    }
+
+    entry_count = len / sizeof(uint32_t);
+    if ( entry_count == 0 )
+    {
+        printk(XENLOG_WARNING "NUMA: Invalid distance-matrix\n");
+
+        return -EINVAL;
+    }
+
+    matrix = (const __be32 *)prop->data;
+    for ( i = 0; i + 2 < entry_count; i += 3 )
+    {
+        uint32_t from, to, distance, opposite;
+
+        from = dt_next_cell(1, &matrix);
+        to = dt_next_cell(1, &matrix);
+        distance = dt_next_cell(1, &matrix);
+        if ( (from == to && distance != NUMA_LOCAL_DISTANCE) ||
+            (from != to && distance <= NUMA_LOCAL_DISTANCE) )
+        {
+            printk(XENLOG_WARNING
+                   "NUMA: Invalid distance: NODE#%u->NODE#%u:%u\n",
+                   from, to, distance);
+            return -EINVAL;
+        }
+
+        printk(XENLOG_INFO "NUMA: distance: NODE#%u->NODE#%u:%u\n",
+               from, to, distance);
+
+        /* Get opposite way distance */
+        opposite = __node_distance(from, to);
+        if ( opposite == 0 )
+        {
+            /* Bi-directions are not set, set both */
+            numa_set_distance(from, to, distance);
+            numa_set_distance(to, from, distance);
+        }
+        else
+        {
+            /*
+             * Opposite way distance has been set to a different value.
+             * It may be a firmware device tree bug?
+             */
+            if ( opposite != distance )
+            {
+                /*
+                 * In device tree NUMA distance-matrix binding:
+                 * https://www.kernel.org/doc/Documentation/devicetree/bindings/numa.txt
+                 * There is a notes mentions:
+                 * "Each entry represents distance from first node to
+                 *  second node. The distances are equal in either
+                 *  direction."
+                 *
+                 * That means device tree doesn't permit this case.
+                 * But in ACPI spec, it cares to specifically permit this
+                 * case:
+                 * "Except for the relative distance from a System Locality
+                 *  to itself, each relative distance is stored twice in the
+                 *  matrix. This provides the capability to describe the
+                 *  scenario where the relative distances for the two
+                 *  directions between System Localities is different."
+                 *
+                 * That means a real machine allows such NUMA configuration.
+                 * So, place a WARNING here to notice system administrators,
+                 * is it the specail case that they hijack the device tree
+                 * to support their rare machines?
+                 */
+                printk(XENLOG_WARNING
+                       "Un-matched bi-direction! NODE#%u->NODE#%u:%u, NODE#%u->NODE#%u:%u\n",
+                       from, to, distance, to, from, opposite);
+            }
+
+            /* Opposite way distance has been set, just set this way */
+            numa_set_distance(from, to, distance);
+        }
+    }
+
+    return 0;
+}
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 32/37] xen/arm: unified entry to parse all NUMA data from device tree
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (30 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 31/37] xen/arm: introduce a helper to parse device tree NUMA distance map Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  3:16   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 33/37] xen/arm: keep guest still be NUMA unware Wei Chen
                   ` (4 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

In this API, we scan whole device tree to parse CPU node id, memory
node id and distance-map. Though early_scan_node will invoke has a
handler to process memory nodes. If we want to parse memory node id
in this handler, we have to embeded NUMA parse code in this handler.
But we still need to scan whole device tree to find CPU NUMA id and
distance-map. In this case, we include memory NUMA id parse in this
API too. Another benefit is that we have a unique entry for device
tree NUMA data parse.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/numa_device_tree.c | 30 ++++++++++++++++++++++++++++++
 xen/include/asm-arm/numa.h      |  1 +
 2 files changed, 31 insertions(+)

diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
index e7fa84df4c..6a3fed0002 100644
--- a/xen/arch/arm/numa_device_tree.c
+++ b/xen/arch/arm/numa_device_tree.c
@@ -242,3 +242,33 @@ static int __init fdt_parse_numa_distance_map_v1(const void *fdt, int node)
 
     return 0;
 }
+
+static int __init fdt_scan_numa_nodes(const void *fdt,
+                int node, const char *uname, int depth,
+                u32 address_cells, u32 size_cells, void *data)
+{
+    int len, ret = 0;
+    const void *prop;
+
+    prop = fdt_getprop(fdt, node, "device_type", &len);
+    if (prop)
+    {
+        len += 1;
+        if ( memcmp(prop, "cpu", len) == 0 )
+            ret = fdt_parse_numa_cpu_node(fdt, node);
+        else if ( memcmp(prop, "memory", len) == 0 )
+            ret = fdt_parse_numa_memory_node(fdt, node, uname,
+                                address_cells, size_cells);
+    }
+    else if ( fdt_node_check_compatible(fdt, node,
+                                "numa-distance-map-v1") == 0 )
+        ret = fdt_parse_numa_distance_map_v1(fdt, node);
+
+    return ret;
+}
+
+/* Initialize NUMA from device tree */
+int __init numa_device_tree_init(const void *fdt)
+{
+    return device_tree_for_each_node(fdt, 0, fdt_scan_numa_nodes, NULL);
+}
diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index 7675012cb7..f46e8e2935 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -23,6 +23,7 @@ typedef u8 nodeid_t;
 #define NR_NODE_MEMBLKS NR_MEM_BANKS
 
 extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance);
+extern int numa_device_tree_init(const void *fdt);
 
 #else
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 33/37] xen/arm: keep guest still be NUMA unware
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (31 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 32/37] xen/arm: unified entry to parse all NUMA data from device tree Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  3:19   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 34/37] xen/arm: enable device tree based NUMA in system init Wei Chen
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

The NUMA information provided in the host Device-Tree
are only for Xen. For dom0, we want to hide them as they
may be different (for now, dom0 is still not aware of NUMA)
The CPU and memory nodes are recreated from scratch for the
domain. So we already skip the "numa-node-id" property for
these two types of nodes.

However, some devices like PCIe may have "numa-node-id"
property too. We have to skip them as well.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/domain_build.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index d233d634c1..6e94922238 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -737,6 +737,10 @@ static int __init write_properties(struct domain *d, struct kernel_info *kinfo,
                 continue;
         }
 
+        /* Guest is numa unaware in current stage */
+        if ( dt_property_name_is_equal(prop, "numa-node-id") )
+            continue;
+
         res = fdt_property(kinfo->fdt, prop->name, prop_data, prop_len);
 
         if ( res )
@@ -1607,6 +1611,8 @@ static int __init handle_node(struct domain *d, struct kernel_info *kinfo,
         DT_MATCH_TYPE("memory"),
         /* The memory mapped timer is not supported by Xen. */
         DT_MATCH_COMPATIBLE("arm,armv7-timer-mem"),
+        /* Numa info doesn't need to be exposed to Domain-0 */
+        DT_MATCH_COMPATIBLE("numa-distance-map-v1"),
         { /* sentinel */ },
     };
     static const struct dt_device_match timer_matches[] __initconst =
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 34/37] xen/arm: enable device tree based NUMA in system init
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (32 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 33/37] xen/arm: keep guest still be NUMA unware Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  3:28   ` Stefano Stabellini
  2021-09-23 12:02 ` [PATCH 35/37] xen/arm: use CONFIG_NUMA to gate node_online_map in smpboot Wei Chen
                   ` (2 subsequent siblings)
  36 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

In this patch, we can start to create NUMA system that is
based on device tree.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/numa.c        | 55 ++++++++++++++++++++++++++++++++++++++
 xen/arch/arm/setup.c       |  7 +++++
 xen/include/asm-arm/numa.h |  6 +++++
 3 files changed, 68 insertions(+)

diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
index 7f05299b76..d7a3d32d4b 100644
--- a/xen/arch/arm/numa.c
+++ b/xen/arch/arm/numa.c
@@ -18,8 +18,10 @@
  *
  */
 #include <xen/init.h>
+#include <xen/device_tree.h>
 #include <xen/nodemask.h>
 #include <xen/numa.h>
+#include <xen/pfn.h>
 
 static uint8_t __read_mostly
 node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
@@ -85,6 +87,59 @@ uint8_t __node_distance(nodeid_t from, nodeid_t to)
 }
 EXPORT_SYMBOL(__node_distance);
 
+void __init numa_init(bool acpi_off)
+{
+    uint32_t idx;
+    paddr_t ram_start = ~0;
+    paddr_t ram_size = 0;
+    paddr_t ram_end = 0;
+
+    /* NUMA has been turned off through Xen parameters */
+    if ( numa_off )
+        goto mem_init;
+
+    /* Initialize NUMA from device tree when system is not ACPI booted */
+    if ( acpi_off )
+    {
+        int ret = numa_device_tree_init(device_tree_flattened);
+        if ( ret )
+        {
+            printk(XENLOG_WARNING
+                   "Init NUMA from device tree failed, ret=%d\n", ret);
+            numa_off = true;
+        }
+    }
+    else
+    {
+        /* We don't support NUMA for ACPI boot currently */
+        printk(XENLOG_WARNING
+               "ACPI NUMA has not been supported yet, NUMA off!\n");
+        numa_off = true;
+    }
+
+mem_init:
+    /*
+     * Find the minimal and maximum address of RAM, NUMA will
+     * build a memory to node mapping table for the whole range.
+     */
+    ram_start = bootinfo.mem.bank[0].start;
+    ram_size  = bootinfo.mem.bank[0].size;
+    ram_end   = ram_start + ram_size;
+    for ( idx = 1 ; idx < bootinfo.mem.nr_banks; idx++ )
+    {
+        paddr_t bank_start = bootinfo.mem.bank[idx].start;
+        paddr_t bank_size = bootinfo.mem.bank[idx].size;
+        paddr_t bank_end = bank_start + bank_size;
+
+        ram_size  = ram_size + bank_size;
+        ram_start = min(ram_start, bank_start);
+        ram_end   = max(ram_end, bank_end);
+    }
+
+    numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end));
+    return;
+}
+
 uint32_t __init arch_meminfo_get_nr_bank(void)
 {
 	return bootinfo.mem.nr_banks;
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index 1f0fbc95b5..6097850682 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -905,6 +905,13 @@ void __init start_xen(unsigned long boot_phys_offset,
     /* Parse the ACPI tables for possible boot-time configuration */
     acpi_boot_table_init();
 
+    /*
+     * Try to initialize NUMA system, if failed, the system will
+     * fallback to uniform system which means system has only 1
+     * NUMA node.
+     */
+    numa_init(acpi_disabled);
+
     end_boot_allocator();
 
     /*
diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
index f46e8e2935..5b03dde87f 100644
--- a/xen/include/asm-arm/numa.h
+++ b/xen/include/asm-arm/numa.h
@@ -24,6 +24,7 @@ typedef u8 nodeid_t;
 
 extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance);
 extern int numa_device_tree_init(const void *fdt);
+extern void numa_init(bool acpi_off);
 
 #else
 
@@ -47,6 +48,11 @@ extern mfn_t first_valid_mfn;
 #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
 #define __node_distance(a, b) (20)
 
+static inline void numa_init(bool acpi_off)
+{
+
+}
+
 static inline void numa_add_cpu(int cpu)
 {
 
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 35/37] xen/arm: use CONFIG_NUMA to gate node_online_map in smpboot
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (33 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 34/37] xen/arm: enable device tree based NUMA in system init Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-23 12:02 ` [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA Wei Chen
  2021-09-23 12:02 ` [PATCH 37/37] docs: update numa command line to support Arm Wei Chen
  36 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

node_online_map in smpboot still need for Arm when NUMA is turn
off by Kconfig.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/smpboot.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
index 6e3cc8d3cc..216c8144b4 100644
--- a/xen/arch/arm/smpboot.c
+++ b/xen/arch/arm/smpboot.c
@@ -46,8 +46,10 @@ struct cpuinfo_arm cpu_data[NR_CPUS];
 /* CPU logical map: map xen cpuid to an MPIDR */
 register_t __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = MPIDR_INVALID };
 
+#ifndef CONFIG_NUMA
 /* Fake one node for now. See also include/asm-arm/numa.h */
 nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
+#endif
 
 /* Xen stack for bringing up the first CPU. */
 static unsigned char __initdata cpu0_boot_stack[STACK_SIZE]
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (34 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 35/37] xen/arm: use CONFIG_NUMA to gate node_online_map in smpboot Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  2021-09-24  3:31   ` Stefano Stabellini
  2021-09-24 10:25   ` Jan Beulich
  2021-09-23 12:02 ` [PATCH 37/37] docs: update numa command line to support Arm Wei Chen
  36 siblings, 2 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Arm platforms support both ACPI and device tree. We don't
want users to select device tree NUMA or ACPI NUMA manually.
We hope usrs can just enable NUMA for Arm, and device tree
NUMA and ACPI NUMA can be selected depends on device tree
feature and ACPI feature status automatically. In this case,
these two kinds of NUMA support code can be co-exist in one
Xen binary. Xen can check feature flags to decide using
device tree or ACPI as NUMA based firmware.

So in this patch, we introduce a generic option:
CONFIG_ARM_NUMA for user to enable NUMA for Arm.
And one CONFIG_DEVICE_TREE_NUMA option for ARM_NUMA
to select when HAS_DEVICE_TREE option is enabled.
Once when ACPI NUMA for Arm is supported, ACPI_NUMA
can be selected here too.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 xen/arch/arm/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
index 865ad83a89..ded94ebd37 100644
--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -34,6 +34,17 @@ config ACPI
 	  Advanced Configuration and Power Interface (ACPI) support for Xen is
 	  an alternative to device tree on ARM64.
 
+ config DEVICE_TREE_NUMA
+	def_bool n
+	select NUMA
+
+config ARM_NUMA
+	bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if UNSUPPORTED
+	select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
+	---help---
+
+	  Enable Non-Uniform Memory Access (NUMA) for Arm architecutres
+
 config GICV3
 	bool "GICv3 driver"
 	depends on ARM_64 && !NEW_VGIC
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* [PATCH 37/37] docs: update numa command line to support Arm
  2021-09-23 12:01 [PATCH 00/37] Add device tree based NUMA support to Arm Wei Chen
                   ` (35 preceding siblings ...)
  2021-09-23 12:02 ` [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA Wei Chen
@ 2021-09-23 12:02 ` Wei Chen
  36 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-23 12:02 UTC (permalink / raw)
  To: wei.chen, xen-devel, sstabellini, julien; +Cc: Bertrand.Marquis

Current numa command in documentation is x86 only. Remove
x86 from numa command's arch limitation in this patch.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
 docs/misc/xen-command-line.pandoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 177e656f12..4f3f24eb9d 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1785,7 +1785,7 @@ i.e. a limit on the number of guests it is possible to start each having
 assigned a device sharing a common interrupt line.  Accepts values between
 1 and 255.
 
-### numa (x86)
+### numa
 > `= on | off | fake=<integer> | noacpi`
 
 > Default: `on`
-- 
2.25.1



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number
  2021-09-23 12:02 ` [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
@ 2021-09-23 23:45   ` Stefano Stabellini
  2021-09-24  1:24     ` Wei Chen
  2021-09-24  8:55   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-23 23:45 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> Current NUMA nodes number is a hardcode configuration. This
> configuration is difficult for an administrator to change
> unless changing the code.
> 
> So in this patch, we introduce this new Kconfig option for
> administrators to change NUMA nodes number conveniently.
> Also considering that not all architectures support NUMA,
> this Kconfig option only can be visible on NUMA enabled
> architectures. Non-NUMA supported architectures can still
> use 1 as MAX_NUMNODES.

This is OK but I think you should also mention in the commit message
that you are taking the opportunity to remove NODES_SHIFT because it is
currently unused.

With that:

Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>


> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/Kconfig           | 11 +++++++++++
>  xen/include/asm-x86/numa.h |  2 --
>  xen/include/xen/numa.h     | 10 +++++-----
>  3 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig
> index f16eb0df43..8a20da67ed 100644
> --- a/xen/arch/Kconfig
> +++ b/xen/arch/Kconfig
> @@ -17,3 +17,14 @@ config NR_CPUS
>  	  For CPU cores which support Simultaneous Multi-Threading or similar
>  	  technologies, this the number of logical threads which Xen will
>  	  support.
> +
> +config NR_NUMA_NODES
> +	int "Maximum number of NUMA nodes supported"
> +	range 1 4095
> +	default "64"
> +	depends on NUMA
> +	help
> +	  Controls the build-time size of various arrays and bitmaps
> +	  associated with multiple-nodes management. It is the upper bound of
> +	  the number of NUMA nodes the scheduler, memory allocation and other
> +	  NUMA-aware components can handle.
> diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> index bada2c0bb9..3cf26c2def 100644
> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -3,8 +3,6 @@
>  
>  #include <xen/cpumask.h>
>  
> -#define NODES_SHIFT 6
> -
>  typedef u8 nodeid_t;
>  
>  extern int srat_rev;
> diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> index 7aef1a88dc..52950a3150 100644
> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -3,14 +3,14 @@
>  
>  #include <asm/numa.h>
>  
> -#ifndef NODES_SHIFT
> -#define NODES_SHIFT     0
> -#endif
> -
>  #define NUMA_NO_NODE     0xFF
>  #define NUMA_NO_DISTANCE 0xFF
>  
> -#define MAX_NUMNODES    (1 << NODES_SHIFT)
> +#ifdef CONFIG_NR_NUMA_NODES
> +#define MAX_NUMNODES CONFIG_NR_NUMA_NODES
> +#else
> +#define MAX_NUMNODES    1
> +#endif
>  
>  #define vcpu_to_node(v) (cpu_to_node((v)->processor))
>  
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2021-09-23 12:02 ` [PATCH 04/37] xen: introduce an arch helper for default dma zone status Wei Chen
@ 2021-09-23 23:55   ` Stefano Stabellini
  2021-09-24  1:50     ` Wei Chen
  2022-01-17 16:10   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-23 23:55 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> In current code, when Xen is running in a multiple nodes NUMA
> system, it will set dma_bitsize in end_boot_allocator to reserve
> some low address memory for DMA.
> 
> There are some x86 implications in current implementation. Becuase
                                    ^ the                    ^Because

> on x86, memory starts from 0. On a multiple nodes NUMA system, if
> a single node contains the majority or all of the DMA memory. x86
                                                              ^,

> prefer to give out memory from non-local allocations rather than
> exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
> aside some largely arbitrary amount memory for DMA memory ranges.
                                     ^ of memory

> The allocations from these memory ranges would happen only after
> exhausting all other nodes' memory.
> 
> But the implications are not shared across all architectures. For
> example, Arm doesn't have these implications. So in this patch, we
> introduce an arch_have_default_dmazone helper for arch to determine
> that it need to set dma_bitsize for reserve DMA allocations or not.
          ^ needs

> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/numa.c        | 5 +++++
>  xen/common/page_alloc.c    | 2 +-
>  xen/include/asm-arm/numa.h | 5 +++++
>  xen/include/asm-x86/numa.h | 1 +
>  4 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index ce79ee44ce..1fabbe8281 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -371,6 +371,11 @@ unsigned int __init arch_get_dma_bitsize(void)
>                   + PAGE_SHIFT, 32);
>  }
>  
> +unsigned int arch_have_default_dmazone(void)

Can this function return bool?
Also, can it be a static inline?


> +{
> +    return ( num_online_nodes() > 1 ) ? 1 : 0;
> +}
> +
>  static void dump_numa(unsigned char key)
>  {
>      s_time_t now = NOW();
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 5801358b4b..80916205e5 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1889,7 +1889,7 @@ void __init end_boot_allocator(void)
>      }
>      nr_bootmem_regions = 0;
>  
> -    if ( !dma_bitsize && (num_online_nodes() > 1) )
> +    if ( !dma_bitsize && arch_have_default_dmazone() )
>          dma_bitsize = arch_get_dma_bitsize();
>  
>      printk("Domain heap initialised");
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index 31a6de4e23..9d5739542d 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -25,6 +25,11 @@ extern mfn_t first_valid_mfn;
>  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
>  #define __node_distance(a, b) (20)
>  
> +static inline unsigned int arch_have_default_dmazone(void)
> +{
> +    return 0;
> +}
> +
>  #endif /* __ARCH_ARM_NUMA_H */
>  /*
>   * Local variables:
> diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> index 3cf26c2def..8060cbf3f4 100644
> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -78,5 +78,6 @@ extern int valid_numa_range(u64 start, u64 end, nodeid_t node);
>  void srat_parse_regions(u64 addr);
>  extern u8 __node_distance(nodeid_t a, nodeid_t b);
>  unsigned int arch_get_dma_bitsize(void);
> +unsigned int arch_have_default_dmazone(void);
>  
>  #endif
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API
  2021-09-23 12:02 ` [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API Wei Chen
@ 2021-09-24  0:05   ` Stefano Stabellini
  2021-09-24 10:21     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:05 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> We have introduced CONFIG_NUMA in previous patch. And this
                                   ^ a

> option is enabled only on x86 in current stage. In a follow
                                ^ at the

> up patch, we will enable this option for Arm. But we still
> want users can disable the CONFIG_NUMA through Kconfig. In
             ^ to be able to disable CONFIG_NUMA via Kconfig.


> this case, keep current fake NUMA API, will make Arm code
                 ^ the

> still can work with NUMA aware memory allocation and scheduler.
        ^ able to work

> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>

With the small grammar fixes:

Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>


> ---
>  xen/include/asm-arm/numa.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index 9d5739542d..8f1c67e3eb 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -5,6 +5,8 @@
>  
>  typedef u8 nodeid_t;
>  
> +#ifndef CONFIG_NUMA
> +
>  /* Fake one node for now. See also node_online_map. */
>  #define cpu_to_node(cpu) 0
>  #define node_to_cpumask(node)   (cpu_online_map)
> @@ -25,6 +27,8 @@ extern mfn_t first_valid_mfn;
>  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
>  #define __node_distance(a, b) (20)
>  
> +#endif
> +
>  static inline unsigned int arch_have_default_dmazone(void)
>  {
>      return 0;
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure
  2021-09-23 12:02 ` [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure Wei Chen
@ 2021-09-24  0:11   ` Stefano Stabellini
  2021-09-24  0:13     ` Stefano Stabellini
  2022-01-18 15:22   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:11 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> NUMA node structure "struct node" is using u64 as node memory
> range. In order to make other architectures can reuse this
> NUMA node relative code, we replace the u64 to paddr_t. And
> use pfn_to_paddr and paddr_to_pfn to replace explicit shift
> operations. The relate PRIx64 in print messages have been
> replaced by PRIpaddr at the same time.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/numa.c        | 32 +++++++++++++++++---------------
>  xen/arch/x86/srat.c        | 26 +++++++++++++-------------
>  xen/include/asm-x86/numa.h |  8 ++++----
>  3 files changed, 34 insertions(+), 32 deletions(-)
> 
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index 1fabbe8281..6337bbdf31 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -165,12 +165,12 @@ int __init compute_hash_shift(struct node *nodes, int numnodes,
>      return shift;
>  }
>  /* initialize NODE_DATA given nodeid and start/end */
> -void __init setup_node_bootmem(nodeid_t nodeid, u64 start, u64 end)
> -{ 
> +void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
> +{
>      unsigned long start_pfn, end_pfn;
>  
> -    start_pfn = start >> PAGE_SHIFT;
> -    end_pfn = end >> PAGE_SHIFT;
> +    start_pfn = paddr_to_pfn(start);
> +    end_pfn = paddr_to_pfn(end);
>  
>      NODE_DATA(nodeid)->node_start_pfn = start_pfn;
>      NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
> @@ -201,11 +201,12 @@ void __init numa_init_array(void)
>  static int numa_fake __initdata = 0;
>  
>  /* Numa emulation */
> -static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
> +static int __init numa_emulation(unsigned long start_pfn,
> +                                 unsigned long end_pfn)

Why not changing numa_emulation to take paddr_t too?


>  {
>      int i;
>      struct node nodes[MAX_NUMNODES];
> -    u64 sz = ((end_pfn - start_pfn)<<PAGE_SHIFT) / numa_fake;
> +    u64 sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
>  
>      /* Kludge needed for the hash function */
>      if ( hweight64(sz) > 1 )
> @@ -221,9 +222,9 @@ static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
>      memset(&nodes,0,sizeof(nodes));
>      for ( i = 0; i < numa_fake; i++ )
>      {
> -        nodes[i].start = (start_pfn<<PAGE_SHIFT) + i*sz;
> +        nodes[i].start = pfn_to_paddr(start_pfn) + i*sz;
>          if ( i == numa_fake - 1 )
> -            sz = (end_pfn<<PAGE_SHIFT) - nodes[i].start;
> +            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
>          nodes[i].end = nodes[i].start + sz;
>          printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
>                 i,
> @@ -249,24 +250,26 @@ static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
>  void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)

same here


>  { 
>      int i;
> +    paddr_t start, end;
>  
>  #ifdef CONFIG_NUMA_EMU
>      if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
>          return;
>  #endif
>  
> +    start = pfn_to_paddr(start_pfn);
> +    end = pfn_to_paddr(end_pfn);
> +
>  #ifdef CONFIG_ACPI_NUMA
> -    if ( !numa_off && !acpi_scan_nodes((u64)start_pfn << PAGE_SHIFT,
> -         (u64)end_pfn << PAGE_SHIFT) )
> +    if ( !numa_off && !acpi_scan_nodes(start, end) )
>          return;
>  #endif
>  
>      printk(KERN_INFO "%s\n",
>             numa_off ? "NUMA turned off" : "No NUMA configuration found");
>  
> -    printk(KERN_INFO "Faking a node at %016"PRIx64"-%016"PRIx64"\n",
> -           (u64)start_pfn << PAGE_SHIFT,
> -           (u64)end_pfn << PAGE_SHIFT);
> +    printk(KERN_INFO "Faking a node at %016"PRIpaddr"-%016"PRIpaddr"\n",
> +           start, end);
>      /* setup dummy node covering all memory */
>      memnode_shift = BITS_PER_LONG - 1;
>      memnodemap = _memnodemap;
> @@ -279,8 +282,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
>      for ( i = 0; i < nr_cpu_ids; i++ )
>          numa_set_node(i, 0);
>      cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
> -    setup_node_bootmem(0, (u64)start_pfn << PAGE_SHIFT,
> -                    (u64)end_pfn << PAGE_SHIFT);
> +    setup_node_bootmem(0, start, end);
>  }
>  
>  void numa_add_cpu(int cpu)
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 6b77b98201..7d20d7f222 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -104,7 +104,7 @@ nodeid_t setup_node(unsigned pxm)
>  	return node;
>  }
>  
> -int valid_numa_range(u64 start, u64 end, nodeid_t node)
> +int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
>  {
>  	int i;
>  
> @@ -119,7 +119,7 @@ int valid_numa_range(u64 start, u64 end, nodeid_t node)
>  	return 0;
>  }
>  
> -static __init int conflicting_memblks(u64 start, u64 end)
> +static __init int conflicting_memblks(paddr_t start, paddr_t end)
>  {
>  	int i;
>  
> @@ -135,7 +135,7 @@ static __init int conflicting_memblks(u64 start, u64 end)
>  	return -1;
>  }
>  
> -static __init void cutoff_node(int i, u64 start, u64 end)
> +static __init void cutoff_node(int i, paddr_t start, paddr_t end)
>  {
>  	struct node *nd = &nodes[i];
>  	if (nd->start < start) {
> @@ -275,7 +275,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
>  void __init
>  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  {
> -	u64 start, end;
> +	paddr_t start, end;
>  	unsigned pxm;
>  	nodeid_t node;
>  	int i;
> @@ -318,7 +318,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  		bool mismatch = !(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
>  		                !test_bit(i, memblk_hotplug);
>  
> -		printk("%sSRAT: PXM %u (%"PRIx64"-%"PRIx64") overlaps with itself (%"PRIx64"-%"PRIx64")\n",
> +		printk("%sSRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
>  		       mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end,
>  		       node_memblk_range[i].start, node_memblk_range[i].end);
>  		if (mismatch) {
> @@ -327,7 +327,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  		}
>  	} else {
>  		printk(KERN_ERR
> -		       "SRAT: PXM %u (%"PRIx64"-%"PRIx64") overlaps with PXM %u (%"PRIx64"-%"PRIx64")\n",
> +		       "SRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with PXM %u (%"PRIpaddr"-%"PRIpaddr")\n",
>  		       pxm, start, end, node_to_pxm(memblk_nodeid[i]),
>  		       node_memblk_range[i].start, node_memblk_range[i].end);
>  		bad_srat();
> @@ -346,7 +346,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  				nd->end = end;
>  		}
>  	}
> -	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIx64"-%"PRIx64"%s\n",
> +	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
>  	       node, pxm, start, end,
>  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>  
> @@ -369,7 +369,7 @@ static int __init nodes_cover_memory(void)
>  
>  	for (i = 0; i < e820.nr_map; i++) {
>  		int j, found;
> -		unsigned long long start, end;
> +		paddr_t start, end;
>  
>  		if (e820.map[i].type != E820_RAM) {
>  			continue;
> @@ -396,7 +396,7 @@ static int __init nodes_cover_memory(void)
>  
>  		if (start < end) {
>  			printk(KERN_ERR "SRAT: No PXM for e820 range: "
> -				"%016Lx - %016Lx\n", start, end);
> +				"%"PRIpaddr" - %"PRIpaddr"\n", start, end);
>  			return 0;
>  		}
>  	}
> @@ -432,7 +432,7 @@ static int __init srat_parse_region(struct acpi_subtable_header *header,
>  	return 0;
>  }
>  
> -void __init srat_parse_regions(u64 addr)
> +void __init srat_parse_regions(paddr_t addr)
>  {
>  	u64 mask;
>  	unsigned int i;
> @@ -441,7 +441,7 @@ void __init srat_parse_regions(u64 addr)
>  	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
>  		return;
>  
> -	srat_region_mask = pdx_init_mask(addr);
> +	srat_region_mask = pdx_init_mask((u64)addr);
>  	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
>  			      srat_parse_region, 0);
>  
> @@ -457,7 +457,7 @@ void __init srat_parse_regions(u64 addr)
>  }
>  
>  /* Use the information discovered above to actually set up the nodes. */
> -int __init acpi_scan_nodes(u64 start, u64 end)
> +int __init acpi_scan_nodes(paddr_t start, paddr_t end)
>  {
>  	int i;
>  	nodemask_t all_nodes_parsed;
> @@ -489,7 +489,7 @@ int __init acpi_scan_nodes(u64 start, u64 end)
>  	/* Finally register nodes */
>  	for_each_node_mask(i, all_nodes_parsed)
>  	{
> -		u64 size = nodes[i].end - nodes[i].start;
> +		paddr_t size = nodes[i].end - nodes[i].start;
>  		if ( size == 0 )
>  			printk(KERN_WARNING "SRAT: Node %u has no memory. "
>  			       "BIOS Bug or mis-configured hardware?\n", i);
> diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> index 8060cbf3f4..50cfd8e7ef 100644
> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -16,7 +16,7 @@ extern cpumask_t     node_to_cpumask[];
>  #define node_to_cpumask(node)    (node_to_cpumask[node])
>  
>  struct node { 
> -	u64 start,end; 
> +	paddr_t start,end;
>  };
>  
>  extern int compute_hash_shift(struct node *nodes, int numnodes,
> @@ -36,7 +36,7 @@ extern void numa_set_node(int cpu, nodeid_t node);
>  extern nodeid_t setup_node(unsigned int pxm);
>  extern void srat_detect_node(int cpu);
>  
> -extern void setup_node_bootmem(nodeid_t nodeid, u64 start, u64 end);
> +extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
>  extern nodeid_t apicid_to_node[];
>  extern void init_cpu_to_node(void);
>  
> @@ -73,9 +73,9 @@ static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
>  #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
>  				 NODE_DATA(nid)->node_spanned_pages)
>  
> -extern int valid_numa_range(u64 start, u64 end, nodeid_t node);
> +extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
>  
> -void srat_parse_regions(u64 addr);
> +void srat_parse_regions(paddr_t addr);
>  extern u8 __node_distance(nodeid_t a, nodeid_t b);
>  unsigned int arch_get_dma_bitsize(void);
>  unsigned int arch_have_default_dmazone(void);
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure
  2021-09-24  0:11   ` Stefano Stabellini
@ 2021-09-24  0:13     ` Stefano Stabellini
  2021-09-24  3:00       ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:13 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Chen, xen-devel, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

You forgot to add the x86 maintainers in CC to all the patches touching
x86 code in this series. Adding them now but you should probably resend.


On Thu, 23 Sep 2021, Stefano Stabellini wrote:
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > NUMA node structure "struct node" is using u64 as node memory
> > range. In order to make other architectures can reuse this
> > NUMA node relative code, we replace the u64 to paddr_t. And
> > use pfn_to_paddr and paddr_to_pfn to replace explicit shift
> > operations. The relate PRIx64 in print messages have been
> > replaced by PRIpaddr at the same time.
> > 
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/x86/numa.c        | 32 +++++++++++++++++---------------
> >  xen/arch/x86/srat.c        | 26 +++++++++++++-------------
> >  xen/include/asm-x86/numa.h |  8 ++++----
> >  3 files changed, 34 insertions(+), 32 deletions(-)
> > 
> > diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> > index 1fabbe8281..6337bbdf31 100644
> > --- a/xen/arch/x86/numa.c
> > +++ b/xen/arch/x86/numa.c
> > @@ -165,12 +165,12 @@ int __init compute_hash_shift(struct node *nodes, int numnodes,
> >      return shift;
> >  }
> >  /* initialize NODE_DATA given nodeid and start/end */
> > -void __init setup_node_bootmem(nodeid_t nodeid, u64 start, u64 end)
> > -{ 
> > +void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
> > +{
> >      unsigned long start_pfn, end_pfn;
> >  
> > -    start_pfn = start >> PAGE_SHIFT;
> > -    end_pfn = end >> PAGE_SHIFT;
> > +    start_pfn = paddr_to_pfn(start);
> > +    end_pfn = paddr_to_pfn(end);
> >  
> >      NODE_DATA(nodeid)->node_start_pfn = start_pfn;
> >      NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
> > @@ -201,11 +201,12 @@ void __init numa_init_array(void)
> >  static int numa_fake __initdata = 0;
> >  
> >  /* Numa emulation */
> > -static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
> > +static int __init numa_emulation(unsigned long start_pfn,
> > +                                 unsigned long end_pfn)
> 
> Why not changing numa_emulation to take paddr_t too?
> 
> 
> >  {
> >      int i;
> >      struct node nodes[MAX_NUMNODES];
> > -    u64 sz = ((end_pfn - start_pfn)<<PAGE_SHIFT) / numa_fake;
> > +    u64 sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
> >  
> >      /* Kludge needed for the hash function */
> >      if ( hweight64(sz) > 1 )
> > @@ -221,9 +222,9 @@ static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
> >      memset(&nodes,0,sizeof(nodes));
> >      for ( i = 0; i < numa_fake; i++ )
> >      {
> > -        nodes[i].start = (start_pfn<<PAGE_SHIFT) + i*sz;
> > +        nodes[i].start = pfn_to_paddr(start_pfn) + i*sz;
> >          if ( i == numa_fake - 1 )
> > -            sz = (end_pfn<<PAGE_SHIFT) - nodes[i].start;
> > +            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
> >          nodes[i].end = nodes[i].start + sz;
> >          printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
> >                 i,
> > @@ -249,24 +250,26 @@ static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
> >  void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
> 
> same here
> 
> 
> >  { 
> >      int i;
> > +    paddr_t start, end;
> >  
> >  #ifdef CONFIG_NUMA_EMU
> >      if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
> >          return;
> >  #endif
> >  
> > +    start = pfn_to_paddr(start_pfn);
> > +    end = pfn_to_paddr(end_pfn);
> > +
> >  #ifdef CONFIG_ACPI_NUMA
> > -    if ( !numa_off && !acpi_scan_nodes((u64)start_pfn << PAGE_SHIFT,
> > -         (u64)end_pfn << PAGE_SHIFT) )
> > +    if ( !numa_off && !acpi_scan_nodes(start, end) )
> >          return;
> >  #endif
> >  
> >      printk(KERN_INFO "%s\n",
> >             numa_off ? "NUMA turned off" : "No NUMA configuration found");
> >  
> > -    printk(KERN_INFO "Faking a node at %016"PRIx64"-%016"PRIx64"\n",
> > -           (u64)start_pfn << PAGE_SHIFT,
> > -           (u64)end_pfn << PAGE_SHIFT);
> > +    printk(KERN_INFO "Faking a node at %016"PRIpaddr"-%016"PRIpaddr"\n",
> > +           start, end);
> >      /* setup dummy node covering all memory */
> >      memnode_shift = BITS_PER_LONG - 1;
> >      memnodemap = _memnodemap;
> > @@ -279,8 +282,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
> >      for ( i = 0; i < nr_cpu_ids; i++ )
> >          numa_set_node(i, 0);
> >      cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
> > -    setup_node_bootmem(0, (u64)start_pfn << PAGE_SHIFT,
> > -                    (u64)end_pfn << PAGE_SHIFT);
> > +    setup_node_bootmem(0, start, end);
> >  }
> >  
> >  void numa_add_cpu(int cpu)
> > diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> > index 6b77b98201..7d20d7f222 100644
> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -104,7 +104,7 @@ nodeid_t setup_node(unsigned pxm)
> >  	return node;
> >  }
> >  
> > -int valid_numa_range(u64 start, u64 end, nodeid_t node)
> > +int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
> >  {
> >  	int i;
> >  
> > @@ -119,7 +119,7 @@ int valid_numa_range(u64 start, u64 end, nodeid_t node)
> >  	return 0;
> >  }
> >  
> > -static __init int conflicting_memblks(u64 start, u64 end)
> > +static __init int conflicting_memblks(paddr_t start, paddr_t end)
> >  {
> >  	int i;
> >  
> > @@ -135,7 +135,7 @@ static __init int conflicting_memblks(u64 start, u64 end)
> >  	return -1;
> >  }
> >  
> > -static __init void cutoff_node(int i, u64 start, u64 end)
> > +static __init void cutoff_node(int i, paddr_t start, paddr_t end)
> >  {
> >  	struct node *nd = &nodes[i];
> >  	if (nd->start < start) {
> > @@ -275,7 +275,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
> >  void __init
> >  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> >  {
> > -	u64 start, end;
> > +	paddr_t start, end;
> >  	unsigned pxm;
> >  	nodeid_t node;
> >  	int i;
> > @@ -318,7 +318,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> >  		bool mismatch = !(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
> >  		                !test_bit(i, memblk_hotplug);
> >  
> > -		printk("%sSRAT: PXM %u (%"PRIx64"-%"PRIx64") overlaps with itself (%"PRIx64"-%"PRIx64")\n",
> > +		printk("%sSRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
> >  		       mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end,
> >  		       node_memblk_range[i].start, node_memblk_range[i].end);
> >  		if (mismatch) {
> > @@ -327,7 +327,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> >  		}
> >  	} else {
> >  		printk(KERN_ERR
> > -		       "SRAT: PXM %u (%"PRIx64"-%"PRIx64") overlaps with PXM %u (%"PRIx64"-%"PRIx64")\n",
> > +		       "SRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with PXM %u (%"PRIpaddr"-%"PRIpaddr")\n",
> >  		       pxm, start, end, node_to_pxm(memblk_nodeid[i]),
> >  		       node_memblk_range[i].start, node_memblk_range[i].end);
> >  		bad_srat();
> > @@ -346,7 +346,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> >  				nd->end = end;
> >  		}
> >  	}
> > -	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIx64"-%"PRIx64"%s\n",
> > +	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
> >  	       node, pxm, start, end,
> >  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
> >  
> > @@ -369,7 +369,7 @@ static int __init nodes_cover_memory(void)
> >  
> >  	for (i = 0; i < e820.nr_map; i++) {
> >  		int j, found;
> > -		unsigned long long start, end;
> > +		paddr_t start, end;
> >  
> >  		if (e820.map[i].type != E820_RAM) {
> >  			continue;
> > @@ -396,7 +396,7 @@ static int __init nodes_cover_memory(void)
> >  
> >  		if (start < end) {
> >  			printk(KERN_ERR "SRAT: No PXM for e820 range: "
> > -				"%016Lx - %016Lx\n", start, end);
> > +				"%"PRIpaddr" - %"PRIpaddr"\n", start, end);
> >  			return 0;
> >  		}
> >  	}
> > @@ -432,7 +432,7 @@ static int __init srat_parse_region(struct acpi_subtable_header *header,
> >  	return 0;
> >  }
> >  
> > -void __init srat_parse_regions(u64 addr)
> > +void __init srat_parse_regions(paddr_t addr)
> >  {
> >  	u64 mask;
> >  	unsigned int i;
> > @@ -441,7 +441,7 @@ void __init srat_parse_regions(u64 addr)
> >  	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
> >  		return;
> >  
> > -	srat_region_mask = pdx_init_mask(addr);
> > +	srat_region_mask = pdx_init_mask((u64)addr);
> >  	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> >  			      srat_parse_region, 0);
> >  
> > @@ -457,7 +457,7 @@ void __init srat_parse_regions(u64 addr)
> >  }
> >  
> >  /* Use the information discovered above to actually set up the nodes. */
> > -int __init acpi_scan_nodes(u64 start, u64 end)
> > +int __init acpi_scan_nodes(paddr_t start, paddr_t end)
> >  {
> >  	int i;
> >  	nodemask_t all_nodes_parsed;
> > @@ -489,7 +489,7 @@ int __init acpi_scan_nodes(u64 start, u64 end)
> >  	/* Finally register nodes */
> >  	for_each_node_mask(i, all_nodes_parsed)
> >  	{
> > -		u64 size = nodes[i].end - nodes[i].start;
> > +		paddr_t size = nodes[i].end - nodes[i].start;
> >  		if ( size == 0 )
> >  			printk(KERN_WARNING "SRAT: Node %u has no memory. "
> >  			       "BIOS Bug or mis-configured hardware?\n", i);
> > diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> > index 8060cbf3f4..50cfd8e7ef 100644
> > --- a/xen/include/asm-x86/numa.h
> > +++ b/xen/include/asm-x86/numa.h
> > @@ -16,7 +16,7 @@ extern cpumask_t     node_to_cpumask[];
> >  #define node_to_cpumask(node)    (node_to_cpumask[node])
> >  
> >  struct node { 
> > -	u64 start,end; 
> > +	paddr_t start,end;
> >  };
> >  
> >  extern int compute_hash_shift(struct node *nodes, int numnodes,
> > @@ -36,7 +36,7 @@ extern void numa_set_node(int cpu, nodeid_t node);
> >  extern nodeid_t setup_node(unsigned int pxm);
> >  extern void srat_detect_node(int cpu);
> >  
> > -extern void setup_node_bootmem(nodeid_t nodeid, u64 start, u64 end);
> > +extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
> >  extern nodeid_t apicid_to_node[];
> >  extern void init_cpu_to_node(void);
> >  
> > @@ -73,9 +73,9 @@ static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
> >  #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
> >  				 NODE_DATA(nid)->node_spanned_pages)
> >  
> > -extern int valid_numa_range(u64 start, u64 end, nodeid_t node);
> > +extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
> >  
> > -void srat_parse_regions(u64 addr);
> > +void srat_parse_regions(paddr_t addr);
> >  extern u8 __node_distance(nodeid_t a, nodeid_t b);
> >  unsigned int arch_get_dma_bitsize(void);
> >  unsigned int arch_have_default_dmazone(void);
> > -- 
> > 2.25.1
> > 
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-23 12:02 ` [PATCH 08/37] xen/x86: add detection of discontinous node memory range Wei Chen
@ 2021-09-24  0:25   ` Stefano Stabellini
  2021-09-24  4:28     ` Wei Chen
  2022-01-18 16:13   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:25 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

CC'ing x86 maintainers

On Thu, 23 Sep 2021, Wei Chen wrote:
> One NUMA node may contain several memory blocks. In current Xen
> code, Xen will maintain a node memory range for each node to cover
> all its memory blocks. But here comes the problem, in the gap of
> one node's two memory blocks, if there are some memory blocks don't
> belong to this node (remote memory blocks). This node's memory range
> will be expanded to cover these remote memory blocks.
> 
> One node's memory range contains othe nodes' memory, this is obviously
> not very reasonable. This means current NUMA code only can support
> node has continous memory blocks. However, on a physical machine, the
> addresses of multiple nodes can be interleaved.
> 
> So in this patch, we add code to detect discontinous memory blocks
> for one node. NUMA initializtion will be failed and error messages
> will be printed when Xen detect such hardware configuration.

At least on ARM, it is not just memory that can be interleaved, but also
MMIO regions. For instance:

node0 bank0 0-0x1000000
MMIO 0x1000000-0x1002000
Hole 0x1002000-0x2000000
node0 bank1 0x2000000-0x3000000

So I am not familiar with the SRAT format, but I think on ARM the check
would look different: we would just look for multiple memory ranges
under a device_type = "memory" node of a NUMA node in device tree.



> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/srat.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 7d20d7f222..2f08fa4660 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
>  		       pxm, pa->apic_id, node);
>  }
>  
> +/*
> + * Check to see if there are other nodes within this node's range.
> + * We just need to check full contains situation. Because overlaps
> + * have been checked before by conflicting_memblks.
> + */
> +static bool __init is_node_memory_continuous(nodeid_t nid,
> +    paddr_t start, paddr_t end)
> +{
> +	nodeid_t i;
> +
> +	struct node *nd = &nodes[nid];
> +	for_each_node_mask(i, memory_nodes_parsed)
> +	{
> +		/* Skip itself */
> +		if (i == nid)
> +			continue;
> +
> +		nd = &nodes[i];
> +		if (start < nd->start && nd->end < end)
> +		{
> +			printk(KERN_ERR
> +			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
> +			       nid, start, end, i, nd->start, nd->end);
> +			return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
>  /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
>  void __init
>  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> @@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  				nd->start = start;
>  			if (nd->end < end)
>  				nd->end = end;
> +
> +			/* Check whether this range contains memory for other nodes */
> +			if (!is_node_memory_continuous(node, nd->start, nd->end)) {
> +				bad_srat();
> +				return;
> +			}
>  		}
>  	}
>  	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 09/37] xen/x86: introduce two helpers to access memory hotplug end
  2021-09-23 12:02 ` [PATCH 09/37] xen/x86: introduce two helpers to access memory hotplug end Wei Chen
@ 2021-09-24  0:29   ` Stefano Stabellini
  2021-09-24  4:21     ` Wei Chen
  2022-01-24 16:24   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:29 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers

On Thu, 23 Sep 2021, Wei Chen wrote:
> x86 provides a mem_hotplug to maintain the end of memory hotplug
                            ^ variable

> end address. This variable can be accessed out of mm.c. We want
> some code out of mm.c can be reused by other architectures without
                       ^ so that it can be reused

> memory hotplug ability. So in this patch, we introduce these two
> helpers to replace mem_hotplug direct access. This will give the
> ability to stub these two API.
                            ^ APIs


> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/include/asm-x86/mm.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
> index cb90527499..af2fc4b0cd 100644
> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -475,6 +475,16 @@ static inline int get_page_and_type(struct page_info *page,
>  
>  extern paddr_t mem_hotplug;
>  
> +static inline void mem_hotplug_update_boundary(paddr_t end)
> +{
> +    mem_hotplug = end;
> +}
> +
> +static inline paddr_t mem_hotplug_boundary(void)
> +{
> +    return mem_hotplug;
> +}
> +
>  /******************************************************************************
>   * With shadow pagetables, the different kinds of address start
>   * to get get confusing.
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 10/37] xen/x86: use helpers to access/update mem_hotplug
  2021-09-23 12:02 ` [PATCH 10/37] xen/x86: use helpers to access/update mem_hotplug Wei Chen
@ 2021-09-24  0:31   ` Stefano Stabellini
  2021-09-24  4:29     ` Wei Chen
  2022-01-24 16:29   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:31 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers


On Thu, 23 Sep 2021, Wei Chen wrote:
> We want to abstract code from acpi_numa_memory_affinity_init.
> But mem_hotplug is coupled with x86. In this patch, we use
> helpers to repace mem_hotplug direct accessing. This will
             ^ replace

> allow most code can be common.
                  ^ to be

I think this patch could be merged with the previous patch


> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/srat.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 2f08fa4660..3334ede7a5 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -391,8 +391,8 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  	memblk_nodeid[num_node_memblks] = node;
>  	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
>  		__set_bit(num_node_memblks, memblk_hotplug);
> -		if (end > mem_hotplug)
> -			mem_hotplug = end;
> +		if (end > mem_hotplug_boundary())
> +			mem_hotplug_update_boundary(end);
>  	}
>  	num_node_memblks++;
>  }
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 11/37] xen/x86: abstract neutral code from acpi_numa_memory_affinity_init
  2021-09-23 12:02 ` [PATCH 11/37] xen/x86: abstract neutral code from acpi_numa_memory_affinity_init Wei Chen
@ 2021-09-24  0:38   ` Stefano Stabellini
  2022-01-24 16:50   ` Jan Beulich
  1 sibling, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:38 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers


On Thu, 23 Sep 2021, Wei Chen wrote:
> There is some code in acpi_numa_memory_affinity_init to update node
> memory range and update node_memblk_range array. This code is not
> ACPI specific, it can be shared by other NUMA implementation, like
> device tree based NUMA implementation.
> 
> So in this patch, we abstract this memory range and blocks relative
> code to a new function. This will avoid exporting static variables
> like node_memblk_range. And the PXM in neutral code print messages
> have been replaced by NODE, as PXM is ACPI specific.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/srat.c        | 131 +++++++++++++++++++++----------------
>  xen/include/asm-x86/numa.h |   3 +
>  2 files changed, 77 insertions(+), 57 deletions(-)
> 
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 3334ede7a5..18bc6b19bb 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -104,6 +104,14 @@ nodeid_t setup_node(unsigned pxm)
>  	return node;
>  }
>  
> +bool __init numa_memblks_available(void)
> +{
> +	if (num_node_memblks < NR_NODE_MEMBLKS)
> +		return true;
> +
> +	return false;
> +}
> +
>  int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
>  {
>  	int i;
> @@ -301,69 +309,35 @@ static bool __init is_node_memory_continuous(nodeid_t nid,
>  	return true;
>  }
>  
> -/* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
> -void __init
> -acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> +/* Neutral NUMA memory affinity init function for ACPI and DT */
> +int __init numa_update_node_memblks(nodeid_t node,
> +		paddr_t start, paddr_t size, bool hotplug)
>  {
> -	paddr_t start, end;
> -	unsigned pxm;
> -	nodeid_t node;
> +	paddr_t end = start + size;
>  	int i;
>  
> -	if (srat_disabled())
> -		return;
> -	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
> -		bad_srat();
> -		return;
> -	}
> -	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
> -		return;
> -
> -	start = ma->base_address;
> -	end = start + ma->length;
> -	/* Supplement the heuristics in l1tf_calculations(). */
> -	l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE));
> -
> -	if (num_node_memblks >= NR_NODE_MEMBLKS)
> -	{
> -		dprintk(XENLOG_WARNING,
> -                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
> -		bad_srat();
> -		return;
> -	}
> -
> -	pxm = ma->proximity_domain;
> -	if (srat_rev < 2)
> -		pxm &= 0xff;
> -	node = setup_node(pxm);
> -	if (node == NUMA_NO_NODE) {
> -		bad_srat();
> -		return;
> -	}
> -	/* It is fine to add this area to the nodes data it will be used later*/
> +	/* It is fine to add this area to the nodes data it will be used later */
>  	i = conflicting_memblks(start, end);
>  	if (i < 0)
>  		/* everything fine */;
>  	else if (memblk_nodeid[i] == node) {
> -		bool mismatch = !(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
> -		                !test_bit(i, memblk_hotplug);
> +		bool mismatch = !hotplug != !test_bit(i, memblk_hotplug);
>  
> -		printk("%sSRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
> -		       mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end,
> +		printk("%sSRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with itself (%"PRIpaddr"-%"PRIpaddr")\n",
> +		       mismatch ? KERN_ERR : KERN_WARNING, node, start, end,
>  		       node_memblk_range[i].start, node_memblk_range[i].end);
>  		if (mismatch) {
> -			bad_srat();
> -			return;
> +			return -1;
>  		}
>  	} else {
>  		printk(KERN_ERR
> -		       "SRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with PXM %u (%"PRIpaddr"-%"PRIpaddr")\n",
> -		       pxm, start, end, node_to_pxm(memblk_nodeid[i]),
> +		       "SRAT: NODE %u (%"PRIpaddr"-%"PRIpaddr") overlaps with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
> +		       node, start, end, memblk_nodeid[i],
>  		       node_memblk_range[i].start, node_memblk_range[i].end);
> -		bad_srat();
> -		return;
> +		return -1;
>  	}
> -	if (!(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE)) {
> +
> +	if (!hotplug) {
>  		struct node *nd = &nodes[node];
>  
>  		if (!node_test_and_set(node, memory_nodes_parsed)) {
> @@ -375,26 +349,69 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  			if (nd->end < end)
>  				nd->end = end;
>  
> -			/* Check whether this range contains memory for other nodes */
> -			if (!is_node_memory_continuous(node, nd->start, nd->end)) {
> -				bad_srat();
> -				return;
> -			}
> +			if (!is_node_memory_continuous(node, nd->start, nd->end))
> +				return -1;
>  		}
>  	}
> -	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
> -	       node, pxm, start, end,
> -	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
> +
> +	printk(KERN_INFO "SRAT: Node %u %"PRIpaddr"-%"PRIpaddr"%s\n",
> +	       node, start, end, hotplug ? " (hotplug)" : "");
>  
>  	node_memblk_range[num_node_memblks].start = start;
>  	node_memblk_range[num_node_memblks].end = end;
>  	memblk_nodeid[num_node_memblks] = node;
> -	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
> +	if (hotplug) {
>  		__set_bit(num_node_memblks, memblk_hotplug);
>  		if (end > mem_hotplug_boundary())
>  			mem_hotplug_update_boundary(end);
>  	}
>  	num_node_memblks++;
> +
> +	return 0;
> +}
> +
> +/* Callback for parsing of the Proximity Domain <-> Memory Area mappings */
> +void __init
> +acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> +{
> +	unsigned pxm;
> +	nodeid_t node;
> +	int ret;
> +
> +	if (srat_disabled())
> +		return;
> +	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
> +		bad_srat();
> +		return;
> +	}
> +	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
> +		return;
> +
> +	/* Supplement the heuristics in l1tf_calculations(). */
> +	l1tf_safe_maddr = max(l1tf_safe_maddr,
> +			ROUNDUP((ma->base_address + ma->length), PAGE_SIZE));
> +
> +	if (!numa_memblks_available())
> +	{
> +		dprintk(XENLOG_WARNING,
> +                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
> +		bad_srat();
> +		return;
> +	}
> +
> +	pxm = ma->proximity_domain;
> +	if (srat_rev < 2)
> +		pxm &= 0xff;
> +	node = setup_node(pxm);
> +	if (node == NUMA_NO_NODE) {
> +		bad_srat();
> +		return;
> +	}
> +
> +	ret = numa_update_node_memblks(node, ma->base_address, ma->length,
> +					ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE);
> +	if (ret != 0)
> +		bad_srat();
>  }
>  
>  /* Sanity check to catch more bad SRATs (they are amazingly common).
> diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> index 50cfd8e7ef..5772a70665 100644
> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -74,6 +74,9 @@ static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
>  				 NODE_DATA(nid)->node_spanned_pages)
>  
>  extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
> +extern bool numa_memblks_available(void);
> +extern int numa_update_node_memblks(nodeid_t node,
> +		paddr_t start, paddr_t size, bool hotplug);
>  
>  void srat_parse_regions(paddr_t addr);
>  extern u8 __node_distance(nodeid_t a, nodeid_t b);
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 12/37] xen/x86: decouple nodes_cover_memory from E820 map
  2021-09-23 12:02 ` [PATCH 12/37] xen/x86: decouple nodes_cover_memory from E820 map Wei Chen
@ 2021-09-24  0:39   ` Stefano Stabellini
  2022-01-24 16:59   ` Jan Beulich
  1 sibling, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:39 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers


On Thu, 23 Sep 2021, Wei Chen wrote:
> We will reuse nodes_cover_memory for Arm to check its bootmem
> info. So we introduce two arch helpers to get memory map's
> entry number and specified entry's range:
>     arch_get_memory_bank_number
>     arch_get_memory_bank_range
> 
> Depends above two helpers, we make nodes_cover_memory become
> architecture independent. And the only change from an x86
> perspective is the additional checks:
>   !start || !end
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/numa.c        | 18 ++++++++++++++++++
>  xen/arch/x86/srat.c        | 11 ++++-------
>  xen/include/asm-x86/numa.h |  3 +++
>  3 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index 6337bbdf31..6bc4ade411 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -378,6 +378,24 @@ unsigned int arch_have_default_dmazone(void)
>      return ( num_online_nodes() > 1 ) ? 1 : 0;
>  }
>  
> +uint32_t __init arch_meminfo_get_nr_bank(void)
> +{
> +	return e820.nr_map;
> +}
> +
> +int __init arch_meminfo_get_ram_bank_range(uint32_t bank,
> +	paddr_t *start, paddr_t *end)
> +{
> +	if (e820.map[bank].type != E820_RAM || !start || !end) {
> +		return -1;
> +	}
> +
> +	*start = e820.map[bank].addr;
> +	*end = e820.map[bank].addr + e820.map[bank].size;
> +
> +	return 0;
> +}
> +
>  static void dump_numa(unsigned char key)
>  {
>      s_time_t now = NOW();
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 18bc6b19bb..aa07a7e975 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -419,17 +419,14 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  static int __init nodes_cover_memory(void)
>  {
>  	int i;
> +	uint32_t nr_banks = arch_meminfo_get_nr_bank();
>  
> -	for (i = 0; i < e820.nr_map; i++) {
> +	for (i = 0; i < nr_banks; i++) {
>  		int j, found;
>  		paddr_t start, end;
>  
> -		if (e820.map[i].type != E820_RAM) {
> +		if (arch_meminfo_get_ram_bank_range(i, &start, &end))
>  			continue;
> -		}
> -
> -		start = e820.map[i].addr;
> -		end = e820.map[i].addr + e820.map[i].size;
>  
>  		do {
>  			found = 0;
> @@ -448,7 +445,7 @@ static int __init nodes_cover_memory(void)
>  		} while (found && start < end);
>  
>  		if (start < end) {
> -			printk(KERN_ERR "SRAT: No PXM for e820 range: "
> +			printk(KERN_ERR "SRAT: No NODE for memory map range: "
>  				"%"PRIpaddr" - %"PRIpaddr"\n", start, end);
>  			return 0;
>  		}
> diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> index 5772a70665..78e044a390 100644
> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -82,5 +82,8 @@ void srat_parse_regions(paddr_t addr);
>  extern u8 __node_distance(nodeid_t a, nodeid_t b);
>  unsigned int arch_get_dma_bitsize(void);
>  unsigned int arch_have_default_dmazone(void);
> +extern uint32_t arch_meminfo_get_nr_bank(void);
> +extern int arch_meminfo_get_ram_bank_range(uint32_t bank,
> +    paddr_t *start, paddr_t *end);
>  
>  #endif
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 13/37] xen/x86: decouple processor_nodes_parsed from acpi numa functions
  2021-09-23 12:02 ` [PATCH 13/37] xen/x86: decouple processor_nodes_parsed from acpi numa functions Wei Chen
@ 2021-09-24  0:40   ` Stefano Stabellini
  2022-01-25  9:49   ` Jan Beulich
  1 sibling, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:40 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers


On Thu, 23 Sep 2021, Wei Chen wrote:
> Xen is using processor_nodes_parsed to record parsed processor nodes
> from ACPI table or other firmware provided resource table. This
> variable is used in ACPI numa functions directly. In follow-up
> patchs, neutral NUMA code will be abstracted and move to other files.
> So in this patch, we introduce numa_set_processor_nodes_parsed helper
> to decouple processor_nodes_parsed from acpi numa functions.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/srat.c        | 9 +++++++--
>  xen/include/asm-x86/numa.h | 1 +
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index aa07a7e975..9276a52138 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -104,6 +104,11 @@ nodeid_t setup_node(unsigned pxm)
>  	return node;
>  }
>  
> +void  __init numa_set_processor_nodes_parsed(nodeid_t node)
> +{
> +	node_set(node, processor_nodes_parsed);
> +}
> +
>  bool __init numa_memblks_available(void)
>  {
>  	if (num_node_memblks < NR_NODE_MEMBLKS)
> @@ -236,7 +241,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
>  	}
>  
>  	apicid_to_node[pa->apic_id] = node;
> -	node_set(node, processor_nodes_parsed);
> +	numa_set_processor_nodes_parsed(node);
>  	acpi_numa = 1;
>  
>  	if (opt_acpi_verbose)
> @@ -271,7 +276,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
>  		return;
>  	}
>  	apicid_to_node[pa->apic_id] = node;
> -	node_set(node, processor_nodes_parsed);
> +	numa_set_processor_nodes_parsed(node);
>  	acpi_numa = 1;
>  
>  	if (opt_acpi_verbose)
> diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> index 78e044a390..295f875a51 100644
> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -77,6 +77,7 @@ extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
>  extern bool numa_memblks_available(void);
>  extern int numa_update_node_memblks(nodeid_t node,
>  		paddr_t start, paddr_t size, bool hotplug);
> +extern void numa_set_processor_nodes_parsed(nodeid_t node);
>  
>  void srat_parse_regions(paddr_t addr);
>  extern u8 __node_distance(nodeid_t a, nodeid_t b);
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 14/37] xen/x86: use name fw_numa to replace acpi_numa
  2021-09-23 12:02 ` [PATCH 14/37] xen/x86: use name fw_numa to replace acpi_numa Wei Chen
@ 2021-09-24  0:40   ` Stefano Stabellini
  2022-01-25 10:12   ` Jan Beulich
  1 sibling, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:40 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers


On Thu, 23 Sep 2021, Wei Chen wrote:
> Xen is using acpi_numa as a switch for ACPI based NUMA. We want to
> use this switch logic for other firmware based NUMA implementation,
> like device tree based NUMA in follow-up patches. As Xen will never
> use both ACPI and device tree based NUMA at runtime. So I rename
> acpi_numa to a more generic name fw_name. This will also allow to
> have the code mostly common.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/numa.c        |  6 +++---
>  xen/arch/x86/setup.c       |  2 +-
>  xen/arch/x86/srat.c        | 10 +++++-----
>  xen/include/asm-x86/acpi.h |  2 +-
>  4 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index 6bc4ade411..2ef385ae3f 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -51,11 +51,11 @@ cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
>  nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
>  
>  bool numa_off;
> -s8 acpi_numa = 0;
> +s8 fw_numa = 0;
>  
>  int srat_disabled(void)
>  {
> -    return numa_off || acpi_numa < 0;
> +    return numa_off || fw_numa < 0;
>  }
>  
>  /*
> @@ -315,7 +315,7 @@ static __init int numa_setup(const char *opt)
>      else if ( !strncmp(opt,"noacpi",6) )
>      {
>          numa_off = false;
> -        acpi_numa = -1;
> +        fw_numa = -1;
>      }
>  #endif
>      else
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index b101565f14..1a2093b554 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -313,7 +313,7 @@ void srat_detect_node(int cpu)
>      node_set_online(node);
>      numa_set_node(cpu, node);
>  
> -    if ( opt_cpu_info && acpi_numa > 0 )
> +    if ( opt_cpu_info && fw_numa > 0 )
>          printk("CPU %d APIC %d -> Node %d\n", cpu, apicid, node);
>  }
>  
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 9276a52138..4921830f94 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -167,7 +167,7 @@ static __init void bad_srat(void)
>  {
>  	int i;
>  	printk(KERN_ERR "SRAT: SRAT not used.\n");
> -	acpi_numa = -1;
> +	fw_numa = -1;
>  	for (i = 0; i < MAX_LOCAL_APIC; i++)
>  		apicid_to_node[i] = NUMA_NO_NODE;
>  	for (i = 0; i < ARRAY_SIZE(pxm2node); i++)
> @@ -242,7 +242,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
>  
>  	apicid_to_node[pa->apic_id] = node;
>  	numa_set_processor_nodes_parsed(node);
> -	acpi_numa = 1;
> +	fw_numa = 1;
>  
>  	if (opt_acpi_verbose)
>  		printk(KERN_INFO "SRAT: PXM %u -> APIC %08x -> Node %u\n",
> @@ -277,7 +277,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
>  	}
>  	apicid_to_node[pa->apic_id] = node;
>  	numa_set_processor_nodes_parsed(node);
> -	acpi_numa = 1;
> +	fw_numa = 1;
>  
>  	if (opt_acpi_verbose)
>  		printk(KERN_INFO "SRAT: PXM %u -> APIC %02x -> Node %u\n",
> @@ -492,7 +492,7 @@ void __init srat_parse_regions(paddr_t addr)
>  	u64 mask;
>  	unsigned int i;
>  
> -	if (acpi_disabled || acpi_numa < 0 ||
> +	if (acpi_disabled || fw_numa < 0 ||
>  	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
>  		return;
>  
> @@ -521,7 +521,7 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
>  	for (i = 0; i < MAX_NUMNODES; i++)
>  		cutoff_node(i, start, end);
>  
> -	if (acpi_numa <= 0)
> +	if (fw_numa <= 0)
>  		return -1;
>  
>  	if (!nodes_cover_memory()) {
> diff --git a/xen/include/asm-x86/acpi.h b/xen/include/asm-x86/acpi.h
> index 7032f3a001..83be71fec3 100644
> --- a/xen/include/asm-x86/acpi.h
> +++ b/xen/include/asm-x86/acpi.h
> @@ -101,7 +101,7 @@ extern unsigned long acpi_wakeup_address;
>  
>  #define ARCH_HAS_POWER_INIT	1
>  
> -extern s8 acpi_numa;
> +extern s8 fw_numa;
>  extern int acpi_scan_nodes(u64 start, u64 end);
>  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>  
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 15/37] xen/x86: rename acpi_scan_nodes to numa_scan_nodes
  2021-09-23 12:02 ` [PATCH 15/37] xen/x86: rename acpi_scan_nodes to numa_scan_nodes Wei Chen
@ 2021-09-24  0:40   ` Stefano Stabellini
  2022-01-25 10:17   ` Jan Beulich
  1 sibling, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:40 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers


On Thu, 23 Sep 2021, Wei Chen wrote:
> The most code in acpi_scan_nodes can be reused by other NUMA
> implementation. Rename acpi_scan_nodes to a more generic name
> numa_scan_nodes, and replace BIOS to Firmware in print message,
> as BIOS is x86 specific name.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/numa.c        | 2 +-
>  xen/arch/x86/srat.c        | 4 ++--
>  xen/include/asm-x86/acpi.h | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index 2ef385ae3f..8a4710df39 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -261,7 +261,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
>      end = pfn_to_paddr(end_pfn);
>  
>  #ifdef CONFIG_ACPI_NUMA
> -    if ( !numa_off && !acpi_scan_nodes(start, end) )
> +    if ( !numa_off && !numa_scan_nodes(start, end) )
>          return;
>  #endif
>  
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 4921830f94..0b8b0b0c95 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -512,7 +512,7 @@ void __init srat_parse_regions(paddr_t addr)
>  }
>  
>  /* Use the information discovered above to actually set up the nodes. */
> -int __init acpi_scan_nodes(paddr_t start, paddr_t end)
> +int __init numa_scan_nodes(paddr_t start, paddr_t end)
>  {
>  	int i;
>  	nodemask_t all_nodes_parsed;
> @@ -547,7 +547,7 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
>  		paddr_t size = nodes[i].end - nodes[i].start;
>  		if ( size == 0 )
>  			printk(KERN_WARNING "SRAT: Node %u has no memory. "
> -			       "BIOS Bug or mis-configured hardware?\n", i);
> +			       "Firmware Bug or mis-configured hardware?\n", i);
>  
>  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
>  	}
> diff --git a/xen/include/asm-x86/acpi.h b/xen/include/asm-x86/acpi.h
> index 83be71fec3..2add971072 100644
> --- a/xen/include/asm-x86/acpi.h
> +++ b/xen/include/asm-x86/acpi.h
> @@ -102,7 +102,7 @@ extern unsigned long acpi_wakeup_address;
>  #define ARCH_HAS_POWER_INIT	1
>  
>  extern s8 fw_numa;
> -extern int acpi_scan_nodes(u64 start, u64 end);
> +extern int numa_scan_nodes(u64 start, u64 end);
>  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>  
>  extern struct acpi_sleep_info acpi_sinfo;
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 16/37] xen/x86: export srat_bad to external
  2021-09-23 12:02 ` [PATCH 16/37] xen/x86: export srat_bad to external Wei Chen
@ 2021-09-24  0:41   ` Stefano Stabellini
  2022-01-25 10:22   ` Jan Beulich
  1 sibling, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:41 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers


On Thu, 23 Sep 2021, Wei Chen wrote:
> srat_bad is used when NUMA initialization code scan SRAT failed.
> It will turn fw_numa to disabled status. Its implementation depends
> on NUMA implementation. We want every NUMA implementation to provide
> this function for common initialization code.
> 
> In this patch, we export srat_bad to external. This will allow to
> have the code mostly common.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/srat.c        | 2 +-
>  xen/include/asm-x86/numa.h | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index 0b8b0b0c95..94bd5b34da 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -163,7 +163,7 @@ static __init void cutoff_node(int i, paddr_t start, paddr_t end)
>  	}
>  }
>  
> -static __init void bad_srat(void)
> +__init void bad_srat(void)
>  {
>  	int i;
>  	printk(KERN_ERR "SRAT: SRAT not used.\n");
> diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> index 295f875a51..a5690a7098 100644
> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -32,6 +32,7 @@ extern bool numa_off;
>  
>  
>  extern int srat_disabled(void);
> +extern void bad_srat(void);
>  extern void numa_set_node(int cpu, nodeid_t node);
>  extern nodeid_t setup_node(unsigned int pxm);
>  extern void srat_detect_node(int cpu);
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 17/37] xen/x86: use CONFIG_NUMA to gate numa_scan_nodes
  2021-09-23 12:02 ` [PATCH 17/37] xen/x86: use CONFIG_NUMA to gate numa_scan_nodes Wei Chen
@ 2021-09-24  0:41   ` Stefano Stabellini
  2022-01-25 10:26   ` Jan Beulich
  1 sibling, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  0:41 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

+x86 maintainers


On Thu, 23 Sep 2021, Wei Chen wrote:
> As we have turned numa_scan_nodes to neutral function. If we
> still use CONFIG_ACPI_NUMA in numa_initmem_init to gate
> numa_scan_nodes that doesn't make sense. As CONFIG_ACPI_NUMA
> will be selected by CONFIG_NUMA for x86. So in this patch,
> we replace CONFIG_ACPI_NUMA by CONFIG_NUMA.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/x86/numa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index 8a4710df39..509d2738c0 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -260,7 +260,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
>      start = pfn_to_paddr(start_pfn);
>      end = pfn_to_paddr(end_pfn);
>  
> -#ifdef CONFIG_ACPI_NUMA
> +#ifdef CONFIG_NUMA
>      if ( !numa_off && !numa_scan_nodes(start, end) )
>          return;
>  #endif
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-23 12:02 ` [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture Wei Chen
@ 2021-09-24  1:15   ` Stefano Stabellini
  2021-09-24  4:34     ` Wei Chen
  2022-01-25 10:34   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  1:15 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> Some architectures do not support EFI, but EFI API will be used
> in some common features. Instead of spreading #ifdef ARCH, we
> introduce this Kconfig option to give Xen the ability of stubing
> EFI API for non-EFI supported architectures.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/Kconfig  |  1 +
>  xen/arch/arm/Makefile |  2 +-
>  xen/arch/x86/Kconfig  |  1 +
>  xen/common/Kconfig    | 11 +++++++++++
>  xen/include/xen/efi.h |  4 ++++
>  5 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
> index ecfa6822e4..865ad83a89 100644
> --- a/xen/arch/arm/Kconfig
> +++ b/xen/arch/arm/Kconfig
> @@ -6,6 +6,7 @@ config ARM_64
>  	def_bool y
>  	depends on !ARM_32
>  	select 64BIT
> +	select EFI
>  	select HAS_FAST_MULTIPLY
>  
>  config ARM
> diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> index 3d3b97b5b4..ae4efbf76e 100644
> --- a/xen/arch/arm/Makefile
> +++ b/xen/arch/arm/Makefile
> @@ -1,6 +1,6 @@
>  obj-$(CONFIG_ARM_32) += arm32/
>  obj-$(CONFIG_ARM_64) += arm64/
> -obj-$(CONFIG_ARM_64) += efi/
> +obj-$(CONFIG_EFI) += efi/
>  obj-$(CONFIG_ACPI) += acpi/
>  ifneq ($(CONFIG_NO_PLAT),y)
>  obj-y += platforms/
> diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
> index 28d13b9705..b9ed187f6b 100644
> --- a/xen/arch/x86/Kconfig
> +++ b/xen/arch/x86/Kconfig
> @@ -10,6 +10,7 @@ config X86
>  	select ALTERNATIVE_CALL
>  	select ARCH_SUPPORTS_INT128
>  	select CORE_PARKING
> +	select EFI
>  	select HAS_ALTERNATIVE
>  	select HAS_COMPAT
>  	select HAS_CPUFREQ
> diff --git a/xen/common/Kconfig b/xen/common/Kconfig
> index 9ebb1c239b..f998746a1a 100644
> --- a/xen/common/Kconfig
> +++ b/xen/common/Kconfig
> @@ -11,6 +11,16 @@ config COMPAT
>  config CORE_PARKING
>  	bool
>  
> +config EFI
> +	bool

Without the title the option is not user-selectable (or de-selectable).
So the help message below can never be seen.

Either add a title, e.g.:

bool "EFI support"

Or fully make the option a silent option by removing the help text.



> +	---help---
> +      This option provides support for runtime services provided
> +      by UEFI firmware (such as non-volatile variables, realtime
> +      clock, and platform reset). A UEFI stub is also provided to
> +      allow the kernel to be booted as an EFI application. This
> +      is only useful for kernels that may run on systems that have
> +      UEFI firmware.
> +
>  config GRANT_TABLE
>  	bool "Grant table support" if EXPERT
>  	default y
> @@ -196,6 +206,7 @@ config KEXEC
>  
>  config EFI_SET_VIRTUAL_ADDRESS_MAP
>      bool "EFI: call SetVirtualAddressMap()" if EXPERT
> +    depends on EFI
>      ---help---
>        Call EFI SetVirtualAddressMap() runtime service to setup memory map for
>        further runtime services. According to UEFI spec, it isn't strictly
> diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
> index 94a7e547f9..661a48286a 100644
> --- a/xen/include/xen/efi.h
> +++ b/xen/include/xen/efi.h
> @@ -25,6 +25,8 @@ extern struct efi efi;
>  
>  #ifndef __ASSEMBLY__
>  
> +#ifdef CONFIG_EFI
> +
>  union xenpf_efi_info;
>  union compat_pf_efi_info;
>  
> @@ -45,6 +47,8 @@ int efi_runtime_call(struct xenpf_efi_runtime_call *);
>  int efi_compat_get_info(uint32_t idx, union compat_pf_efi_info *);
>  int efi_compat_runtime_call(struct compat_pf_efi_runtime_call *);
>  
> +#endif /* CONFIG_EFI*/
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __XEN_EFI_H__ */
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 21/37] xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI
  2021-09-23 12:02 ` [PATCH 21/37] xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI Wei Chen
@ 2021-09-24  1:23   ` Stefano Stabellini
  2021-09-24  4:36     ` Wei Chen
  2022-01-25 10:38   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  1:23 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> EFI can get memory map from EFI system table. But EFI system
> table doesn't contain memory NUMA information, EFI depends on
> ACPI SRAT or device tree memory node to parse memory blocks'
> NUMA mapping.
> 
> But in current code, when Xen is booting from EFI, it will
> delete all memory nodes in device tree. So in UEFI + DTB
> boot, we don't have numa-node-id for memory blocks any more.
> 
> So in this patch, we will keep memory nodes in device tree for
> NUMA code to parse memory numa-node-id later.
> 
> As a side effect, if we still parse boot memory information in
> early_scan_node, bootmem.info will calculate memory ranges in
> memory nodes twice. So we have to prevent early_scan_node to
> parse memory nodes in EFI boot.
> 
> As EFI APIs only can be used in Arm64, so we introduced a stub
> API for non-EFI supported Arm32. This will prevent

This last sentence is incomplete.

But aside from that, this patch looks good to me.


> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/bootfdt.c      |  8 +++++++-
>  xen/arch/arm/efi/efi-boot.h | 25 -------------------------
>  xen/include/xen/efi.h       |  7 +++++++
>  3 files changed, 14 insertions(+), 26 deletions(-)
> 
> diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
> index afaa0e249b..6bc5a465ec 100644
> --- a/xen/arch/arm/bootfdt.c
> +++ b/xen/arch/arm/bootfdt.c
> @@ -11,6 +11,7 @@
>  #include <xen/lib.h>
>  #include <xen/kernel.h>
>  #include <xen/init.h>
> +#include <xen/efi.h>
>  #include <xen/device_tree.h>
>  #include <xen/libfdt/libfdt.h>
>  #include <xen/sort.h>
> @@ -370,7 +371,12 @@ static int __init early_scan_node(const void *fdt,
>  {
>      int rc = 0;
>  
> -    if ( device_tree_node_matches(fdt, node, "memory") )
> +    /*
> +     * If Xen has been booted via UEFI, the memory banks will already
> +     * be populated. So we should skip the parsing.
> +     */
> +    if ( !efi_enabled(EFI_BOOT) &&
> +         device_tree_node_matches(fdt, node, "memory"))
>          rc = process_memory_node(fdt, node, name, depth,
>                                   address_cells, size_cells, &bootinfo.mem);
>      else if ( depth == 1 && !dt_node_cmp(name, "reserved-memory") )
> diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
> index cf9c37153f..d0a9987fa4 100644
> --- a/xen/arch/arm/efi/efi-boot.h
> +++ b/xen/arch/arm/efi/efi-boot.h
> @@ -197,33 +197,8 @@ EFI_STATUS __init fdt_add_uefi_nodes(EFI_SYSTEM_TABLE *sys_table,
>      int status;
>      u32 fdt_val32;
>      u64 fdt_val64;
> -    int prev;
>      int num_rsv;
>  
> -    /*
> -     * Delete any memory nodes present.  The EFI memory map is the only
> -     * memory description provided to Xen.
> -     */
> -    prev = 0;
> -    for (;;)
> -    {
> -        const char *type;
> -        int len;
> -
> -        node = fdt_next_node(fdt, prev, NULL);
> -        if ( node < 0 )
> -            break;
> -
> -        type = fdt_getprop(fdt, node, "device_type", &len);
> -        if ( type && strncmp(type, "memory", len) == 0 )
> -        {
> -            fdt_del_node(fdt, node);
> -            continue;
> -        }
> -
> -        prev = node;
> -    }
> -
>     /*
>      * Delete all memory reserve map entries. When booting via UEFI,
>      * kernel will use the UEFI memory map to find reserved regions.
> diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
> index 661a48286a..b52a4678e9 100644
> --- a/xen/include/xen/efi.h
> +++ b/xen/include/xen/efi.h
> @@ -47,6 +47,13 @@ int efi_runtime_call(struct xenpf_efi_runtime_call *);
>  int efi_compat_get_info(uint32_t idx, union compat_pf_efi_info *);
>  int efi_compat_runtime_call(struct compat_pf_efi_runtime_call *);
>  
> +#else
> +
> +static inline bool efi_enabled(unsigned int feature)
> +{
> +    return false;
> +}
> +
>  #endif /* CONFIG_EFI*/
>  
>  #endif /* !__ASSEMBLY__ */
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number
  2021-09-23 23:45   ` Stefano Stabellini
@ 2021-09-24  1:24     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  1:24 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 7:45
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 02/37] xen: introduce a Kconfig option to configure
> NUMA nodes number
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > Current NUMA nodes number is a hardcode configuration. This
> > configuration is difficult for an administrator to change
> > unless changing the code.
> >
> > So in this patch, we introduce this new Kconfig option for
> > administrators to change NUMA nodes number conveniently.
> > Also considering that not all architectures support NUMA,
> > this Kconfig option only can be visible on NUMA enabled
> > architectures. Non-NUMA supported architectures can still
> > use 1 as MAX_NUMNODES.
> 
> This is OK but I think you should also mention in the commit message
> that you are taking the opportunity to remove NODES_SHIFT because it is
> currently unused.
> 
> With that:
> 
> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
> 
> 

Thanks, I will update it in next version.

> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/Kconfig           | 11 +++++++++++
> >  xen/include/asm-x86/numa.h |  2 --
> >  xen/include/xen/numa.h     | 10 +++++-----
> >  3 files changed, 16 insertions(+), 7 deletions(-)
> >
> > diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig
> > index f16eb0df43..8a20da67ed 100644
> > --- a/xen/arch/Kconfig
> > +++ b/xen/arch/Kconfig
> > @@ -17,3 +17,14 @@ config NR_CPUS
> >  	  For CPU cores which support Simultaneous Multi-Threading or
> similar
> >  	  technologies, this the number of logical threads which Xen will
> >  	  support.
> > +
> > +config NR_NUMA_NODES
> > +	int "Maximum number of NUMA nodes supported"
> > +	range 1 4095
> > +	default "64"
> > +	depends on NUMA
> > +	help
> > +	  Controls the build-time size of various arrays and bitmaps
> > +	  associated with multiple-nodes management. It is the upper bound
> of
> > +	  the number of NUMA nodes the scheduler, memory allocation and
> other
> > +	  NUMA-aware components can handle.
> > diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> > index bada2c0bb9..3cf26c2def 100644
> > --- a/xen/include/asm-x86/numa.h
> > +++ b/xen/include/asm-x86/numa.h
> > @@ -3,8 +3,6 @@
> >
> >  #include <xen/cpumask.h>
> >
> > -#define NODES_SHIFT 6
> > -
> >  typedef u8 nodeid_t;
> >
> >  extern int srat_rev;
> > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > index 7aef1a88dc..52950a3150 100644
> > --- a/xen/include/xen/numa.h
> > +++ b/xen/include/xen/numa.h
> > @@ -3,14 +3,14 @@
> >
> >  #include <asm/numa.h>
> >
> > -#ifndef NODES_SHIFT
> > -#define NODES_SHIFT     0
> > -#endif
> > -
> >  #define NUMA_NO_NODE     0xFF
> >  #define NUMA_NO_DISTANCE 0xFF
> >
> > -#define MAX_NUMNODES    (1 << NODES_SHIFT)
> > +#ifdef CONFIG_NR_NUMA_NODES
> > +#define MAX_NUMNODES CONFIG_NR_NUMA_NODES
> > +#else
> > +#define MAX_NUMNODES    1
> > +#endif
> >
> >  #define vcpu_to_node(v) (cpu_to_node((v)->processor))
> >
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-23 12:02 ` [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS Wei Chen
@ 2021-09-24  1:34   ` Stefano Stabellini
  2021-09-26 13:13     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  1:34 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> As a memory range described in device tree cannot be split across
> multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS in
> arch header.

This statement is true but what is the goal of this patch? Is it to
reduce code size and memory consumption?

I am asking because NR_MEM_BANKS is 128 and
NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
NR_NODE_MEMBLKS is 128 before this patch.

In other words, this patch alone doesn't make any difference; at least
doesn't make any difference unless CONFIG_NR_NUMA_NODES is increased.

So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES is
higher than 64?


> And keep default NR_NODE_MEMBLKS in common header
> for those architectures NUMA is disabled.

This last sentence is not accurate: on x86 NUMA is enabled and
NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h (there is no
x86 definition of it)


> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/include/asm-arm/numa.h | 8 +++++++-
>  xen/include/xen/numa.h     | 2 ++
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index 8f1c67e3eb..21569e634b 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -3,9 +3,15 @@
>  
>  #include <xen/mm.h>
>  
> +#include <asm/setup.h>
> +
>  typedef u8 nodeid_t;
>  
> -#ifndef CONFIG_NUMA
> +#ifdef CONFIG_NUMA
> +
> +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> +
> +#else
>  
>  /* Fake one node for now. See also node_online_map. */
>  #define cpu_to_node(cpu) 0
> diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> index 1978e2be1b..1731e1cc6b 100644
> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -12,7 +12,9 @@
>  #define MAX_NUMNODES    1
>  #endif
>  
> +#ifndef NR_NODE_MEMBLKS
>  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> +#endif
>  
>  #define vcpu_to_node(v) (cpu_to_node((v)->processor))
>  
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 23/37] xen/arm: implement node distance helpers for Arm
  2021-09-23 12:02 ` [PATCH 23/37] xen/arm: implement node distance helpers for Arm Wei Chen
@ 2021-09-24  1:46   ` Stefano Stabellini
  2021-09-24  4:41     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  1:46 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> We will parse NUMA nodes distances from device tree or ACPI
> table. So we need a matrix to record the distances between
> any two nodes we parsed. Accordingly, we provide this
> node_set_distance API for device tree or ACPI table parsers
> to set the distance for any two nodes in this patch.
> When NUMA initialization failed, __node_distance will return
> NUMA_REMOTE_DISTANCE, this will help us avoid doing rollback
> for distance maxtrix when NUMA initialization failed.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/Makefile      |  1 +
>  xen/arch/arm/numa.c        | 69 ++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-arm/numa.h | 13 +++++++
>  3 files changed, 83 insertions(+)
>  create mode 100644 xen/arch/arm/numa.c
> 
> diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> index ae4efbf76e..41ca311b6b 100644
> --- a/xen/arch/arm/Makefile
> +++ b/xen/arch/arm/Makefile
> @@ -35,6 +35,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
>  obj-y += mem_access.o
>  obj-y += mm.o
>  obj-y += monitor.o
> +obj-$(CONFIG_NUMA) += numa.o
>  obj-y += p2m.o
>  obj-y += percpu.o
>  obj-y += platform.o
> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> new file mode 100644
> index 0000000000..3f08870d69
> --- /dev/null
> +++ b/xen/arch/arm/numa.c
> @@ -0,0 +1,69 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Arm Architecture support layer for NUMA.
> + *
> + * Copyright (C) 2021 Arm Ltd
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + *
> + */
> +#include <xen/init.h>
> +#include <xen/numa.h>
> +
> +static uint8_t __read_mostly
> +node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> +    { 0 }
> +};
> +
> +void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance)
> +{
> +    if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> +    {
> +        printk(KERN_WARNING
> +               "NUMA: invalid nodes: from=%"PRIu8" to=%"PRIu8" MAX=%"PRIu8"\n",
> +               from, to, MAX_NUMNODES);
> +        return;
> +    }
> +
> +    /* NUMA defines 0xff as an unreachable node and 0-9 are undefined */
> +    if ( distance >= NUMA_NO_DISTANCE ||
> +        (distance >= NUMA_DISTANCE_UDF_MIN &&
> +         distance <= NUMA_DISTANCE_UDF_MAX) ||
> +        (from == to && distance != NUMA_LOCAL_DISTANCE) )
> +    {
> +        printk(KERN_WARNING
> +               "NUMA: invalid distance: from=%"PRIu8" to=%"PRIu8" distance=%"PRIu32"\n",
> +               from, to, distance);
> +        return;
> +    }
> +
> +    node_distance_map[from][to] = distance;
> +}
> +
> +uint8_t __node_distance(nodeid_t from, nodeid_t to)
> +{
> +    /* When NUMA is off, any distance will be treated as remote. */
> +    if ( srat_disabled() )

Given that this is ARM specific code and specific to ACPI, I don't think
we should have any call to something called "srat_disabled".

I suggest to either rename srat_disabled to numa_distance_disabled.

Other than that, this patch looks OK to me.


> +        return NUMA_REMOTE_DISTANCE;
> +
> +    /*
> +     * Check whether the nodes are in the matrix range.
> +     * When any node is out of range, except from and to nodes are the
> +     * same, we treat them as unreachable (return 0xFF)
> +     */
> +    if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> +        return from == to ? NUMA_LOCAL_DISTANCE : NUMA_NO_DISTANCE;
> +
> +    return node_distance_map[from][to];
> +}
> +EXPORT_SYMBOL(__node_distance);
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index 21569e634b..758eafeb05 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -9,8 +9,21 @@ typedef u8 nodeid_t;
>  
>  #ifdef CONFIG_NUMA
>  
> +/*
> + * In ACPI spec, 0-9 are the reserved values for node distance,
> + * 10 indicates local node distance, 20 indicates remote node
> + * distance. Set node distance map in device tree will follow
> + * the ACPI's definition.
> + */
> +#define NUMA_DISTANCE_UDF_MIN   0
> +#define NUMA_DISTANCE_UDF_MAX   9
> +#define NUMA_LOCAL_DISTANCE     10
> +#define NUMA_REMOTE_DISTANCE    20
> +
>  #define NR_NODE_MEMBLKS NR_MEM_BANKS
>  
> +extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance);
> +
>  #else
>  
>  /* Fake one node for now. See also node_online_map. */
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2021-09-23 23:55   ` Stefano Stabellini
@ 2021-09-24  1:50     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  1:50 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 7:56
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 04/37] xen: introduce an arch helper for default dma
> zone status
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > In current code, when Xen is running in a multiple nodes NUMA
> > system, it will set dma_bitsize in end_boot_allocator to reserve
> > some low address memory for DMA.
> >
> > There are some x86 implications in current implementation. Becuase
>                                     ^ the                    ^Because
> 
> > on x86, memory starts from 0. On a multiple nodes NUMA system, if
> > a single node contains the majority or all of the DMA memory. x86
>                                                               ^,
> 
> > prefer to give out memory from non-local allocations rather than
> > exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
> > aside some largely arbitrary amount memory for DMA memory ranges.
>                                      ^ of memory
> 
> > The allocations from these memory ranges would happen only after
> > exhausting all other nodes' memory.
> >
> > But the implications are not shared across all architectures. For
> > example, Arm doesn't have these implications. So in this patch, we
> > introduce an arch_have_default_dmazone helper for arch to determine
> > that it need to set dma_bitsize for reserve DMA allocations or not.
>           ^ needs
> 

I will fix above typos in next version.

> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/x86/numa.c        | 5 +++++
> >  xen/common/page_alloc.c    | 2 +-
> >  xen/include/asm-arm/numa.h | 5 +++++
> >  xen/include/asm-x86/numa.h | 1 +
> >  4 files changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> > index ce79ee44ce..1fabbe8281 100644
> > --- a/xen/arch/x86/numa.c
> > +++ b/xen/arch/x86/numa.c
> > @@ -371,6 +371,11 @@ unsigned int __init arch_get_dma_bitsize(void)
> >                   + PAGE_SHIFT, 32);
> >  }
> >
> > +unsigned int arch_have_default_dmazone(void)
> 
> Can this function return bool?
> Also, can it be a static inline?
> 

Yes, bool would be better. I will place a static inline in asm/numa.h.
Because arm will have another static inline implementation.

> 
> > +{
> > +    return ( num_online_nodes() > 1 ) ? 1 : 0;
> > +}
> > +
> >  static void dump_numa(unsigned char key)
> >  {
> >      s_time_t now = NOW();
> > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> > index 5801358b4b..80916205e5 100644
> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -1889,7 +1889,7 @@ void __init end_boot_allocator(void)
> >      }
> >      nr_bootmem_regions = 0;
> >
> > -    if ( !dma_bitsize && (num_online_nodes() > 1) )
> > +    if ( !dma_bitsize && arch_have_default_dmazone() )
> >          dma_bitsize = arch_get_dma_bitsize();
> >
> >      printk("Domain heap initialised");
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index 31a6de4e23..9d5739542d 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -25,6 +25,11 @@ extern mfn_t first_valid_mfn;
> >  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
> >  #define __node_distance(a, b) (20)
> >
> > +static inline unsigned int arch_have_default_dmazone(void)
> > +{
> > +    return 0;
> > +}
> > +
> >  #endif /* __ARCH_ARM_NUMA_H */
> >  /*
> >   * Local variables:
> > diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> > index 3cf26c2def..8060cbf3f4 100644
> > --- a/xen/include/asm-x86/numa.h
> > +++ b/xen/include/asm-x86/numa.h
> > @@ -78,5 +78,6 @@ extern int valid_numa_range(u64 start, u64 end,
> nodeid_t node);
> >  void srat_parse_regions(u64 addr);
> >  extern u8 __node_distance(nodeid_t a, nodeid_t b);
> >  unsigned int arch_get_dma_bitsize(void);
> > +unsigned int arch_have_default_dmazone(void);
> >
> >  #endif
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 24/37] xen/arm: implement two arch helpers to get memory map info
  2021-09-23 12:02 ` [PATCH 24/37] xen/arm: implement two arch helpers to get memory map info Wei Chen
@ 2021-09-24  2:06   ` Stefano Stabellini
  2021-09-24  4:42     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  2:06 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> These two helpers are architecture APIs that are required by
> nodes_cover_memory.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/numa.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> index 3f08870d69..3755b01ef4 100644
> --- a/xen/arch/arm/numa.c
> +++ b/xen/arch/arm/numa.c
> @@ -67,3 +67,17 @@ uint8_t __node_distance(nodeid_t from, nodeid_t to)
>      return node_distance_map[from][to];
>  }
>  EXPORT_SYMBOL(__node_distance);
> +
> +uint32_t __init arch_meminfo_get_nr_bank(void)
> +{
> +	return bootinfo.mem.nr_banks;
> +}
> +
> +int __init arch_meminfo_get_ram_bank_range(uint32_t bank,
> +	paddr_t *start, paddr_t *end)
> +{
> +	*start = bootinfo.mem.bank[bank].start;
> +	*end = bootinfo.mem.bank[bank].start + bootinfo.mem.bank[bank].size;
> +
> +	return 0;
> +}

The rest of the file is indented using spaces, while this patch is using
tabs.

Also, given the implementation, it looks like
arch_meminfo_get_ram_bank_range should either return void or bool.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization
  2021-09-23 12:02 ` [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization Wei Chen
@ 2021-09-24  2:09   ` Stefano Stabellini
  2021-09-24  4:45     ` Wei Chen
  2021-09-24  8:07     ` Jan Beulich
  0 siblings, 2 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  2:09 UTC (permalink / raw)
  To: Wei Chen
  Cc: xen-devel, sstabellini, julien, Bertrand.Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

On Thu, 23 Sep 2021, Wei Chen wrote:
> NUMA initialization will parse information from firmware provided
> static resource affinity table (ACPI SRAT or DTB). bad_srat if a
> function that will be used when initialization code encounters
> some unexcepted errors.
> 
> In this patch, we introduce Arm version bad_srat for NUMA common
> initialization code to invoke it.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/numa.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> index 3755b01ef4..5209d3de4d 100644
> --- a/xen/arch/arm/numa.c
> +++ b/xen/arch/arm/numa.c
> @@ -18,6 +18,7 @@
>   *
>   */
>  #include <xen/init.h>
> +#include <xen/nodemask.h>
>  #include <xen/numa.h>
>  
>  static uint8_t __read_mostly
> @@ -25,6 +26,12 @@ node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
>      { 0 }
>  };
>  
> +__init void bad_srat(void)
> +{
> +    printk(KERN_ERR "NUMA: Firmware SRAT table not used.\n");
> +    fw_numa = -1;
> +}

I realize that the series keeps the "srat" terminology everywhere on DT
too. I wonder if it is worth replacing srat with something like
"numa_distance" everywhere as appropriate. I am adding the x86
maintainers for an opinion.

If you guys prefer to keep srat (if nothing else, it is concise), I am
also OK with keeping srat although it is not technically accurate.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 26/37] xen/arm: build NUMA cpu_to_node map in dt_smp_init_cpus
  2021-09-23 12:02 ` [PATCH 26/37] xen/arm: build NUMA cpu_to_node map in dt_smp_init_cpus Wei Chen
@ 2021-09-24  2:26   ` Stefano Stabellini
  2021-09-24  4:25     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  2:26 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> NUMA implementation has a cpu_to_node array to store CPU to NODE
> map. Xen is using CPU logical ID in runtime components, so we
> use CPU logical ID as CPU index in cpu_to_node.
> 
> In device tree case, cpu_logical_map is created in dt_smp_init_cpus.
> So, when NUMA is enabled, dt_smp_init_cpus will fetch CPU NUMA id
> at the same time for cpu_to_node.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/smpboot.c     | 37 ++++++++++++++++++++++++++++++++++++-
>  xen/include/asm-arm/numa.h |  5 +++++
>  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
> index 60c0e82fc5..6e3cc8d3cc 100644
> --- a/xen/arch/arm/smpboot.c
> +++ b/xen/arch/arm/smpboot.c
> @@ -121,7 +121,12 @@ static void __init dt_smp_init_cpus(void)
>      {
>          [0 ... NR_CPUS - 1] = MPIDR_INVALID
>      };
> +    static nodeid_t node_map[NR_CPUS] __initdata =
> +    {
> +        [0 ... NR_CPUS - 1] = NUMA_NO_NODE
> +    };
>      bool bootcpu_valid = false;
> +    uint32_t nid = 0;
>      int rc;
>  
>      mpidr = system_cpuinfo.mpidr.bits & MPIDR_HWID_MASK;
> @@ -172,6 +177,28 @@ static void __init dt_smp_init_cpus(void)
>              continue;
>          }
>  
> +        if ( IS_ENABLED(CONFIG_NUMA) )
> +        {
> +            /*
> +             * When CONFIG_NUMA is set, try to fetch numa infomation
> +             * from CPU dts node, otherwise the nid is always 0.
> +             */
> +            if ( !dt_property_read_u32(cpu, "numa-node-id", &nid) )
> +            {
> +                printk(XENLOG_WARNING
> +                       "cpu[%d] dts path: %s: doesn't have numa information!\n",
                               ^ %u


> +                       cpuidx, dt_node_full_name(cpu));

I think that this message shouldn't be a warning: CONFIG_NUMA is a
compile time option. Anybody that enables CONFIG_NUMA in the Xen build
will get this warning printed out at boot time if Xen is booting on a
regular non-NUMA machine, right?

The warning should only be printed if NUMA is actively enabled, e.g.
there is a distance-map but the cpus don't have numa-node-id.



> +                /*
> +                 * During the early stage of NUMA initialization, when Xen
> +                 * found any CPU dts node doesn't have numa-node-id info, the
> +                 * NUMA will be treated as off, all CPU will be set to a FAKE
> +                 * node 0. So if we get numa-node-id failed here, we should
> +                 * set nid to 0.
> +                 */
> +                nid = 0;
> +            }
> +        }
> +
>          /*
>           * 8 MSBs must be set to 0 in the DT since the reg property
>           * defines the MPIDR[23:0]
> @@ -231,9 +258,12 @@ static void __init dt_smp_init_cpus(void)
>          {
>              printk("cpu%d init failed (hwid %"PRIregister"): %d\n", i, hwid, rc);
>              tmp_map[i] = MPIDR_INVALID;
> +            node_map[i] = NUMA_NO_NODE;
>          }
> -        else
> +        else {
>              tmp_map[i] = hwid;
> +            node_map[i] = nid;
> +        }
>      }
>  
>      if ( !bootcpu_valid )
> @@ -249,6 +279,11 @@ static void __init dt_smp_init_cpus(void)
>              continue;
>          cpumask_set_cpu(i, &cpu_possible_map);
>          cpu_logical_map(i) = tmp_map[i];
> +
> +        nid = node_map[i];
> +        if ( nid >= MAX_NUMNODES )
> +            nid = 0;
> +        numa_set_node(i, nid);
>      }
>  }
>  
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index 758eafeb05..8a4ad379e0 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -46,6 +46,11 @@ extern mfn_t first_valid_mfn;
>  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
>  #define __node_distance(a, b) (20)
>  
> +static inline void numa_set_node(int cpu, nodeid_t node)
> +{
> +
> +}
> +
>  #endif
>  
>  static inline unsigned int arch_have_default_dmazone(void)
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 28/37] xen/arm: stub memory hotplug access helpers for Arm
  2021-09-23 12:02 ` [PATCH 28/37] xen/arm: stub memory hotplug access helpers for Arm Wei Chen
@ 2021-09-24  2:33   ` Stefano Stabellini
  2021-09-24  4:26     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  2:33 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> Common code in NUMA need these two helpers to access/update
> memory hotplug end address. Arm has not support memory hotplug
> yet. So we stub these two helpers in this patch to make NUMA
> common code happy.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/include/asm-arm/mm.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
> index 7b5e7b7f69..fc9433165d 100644
> --- a/xen/include/asm-arm/mm.h
> +++ b/xen/include/asm-arm/mm.h
> @@ -362,6 +362,16 @@ void clear_and_clean_page(struct page_info *page);
>  
>  unsigned int arch_get_dma_bitsize(void);
>  
> +static inline void mem_hotplug_update_boundary(paddr_t end)
> +{
> +
> +}
> +
> +static inline paddr_t mem_hotplug_boundary(void)
> +{
> +    return 0;
> +}

Why zero? Could it be INVALID_PADDR ?


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 29/37] xen/arm: introduce a helper to parse device tree processor node
  2021-09-23 12:02 ` [PATCH 29/37] xen/arm: introduce a helper to parse device tree processor node Wei Chen
@ 2021-09-24  2:44   ` Stefano Stabellini
  2021-09-24  4:46     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  2:44 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> Processor NUMA ID information is stored in device tree's processor
> node as "numa-node-id". We need a new helper to parse this ID from
> processor node. If we get this ID from processor node, this ID's
> validity still need to be checked. Once we got a invalid NUMA ID
> from any processor node, the device tree will be marked as NUMA
> information invalid.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/Makefile           |  1 +
>  xen/arch/arm/numa_device_tree.c | 58 +++++++++++++++++++++++++++++++++
>  2 files changed, 59 insertions(+)
>  create mode 100644 xen/arch/arm/numa_device_tree.c
> 
> diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> index 41ca311b6b..c50df2c25d 100644
> --- a/xen/arch/arm/Makefile
> +++ b/xen/arch/arm/Makefile
> @@ -36,6 +36,7 @@ obj-y += mem_access.o
>  obj-y += mm.o
>  obj-y += monitor.o
>  obj-$(CONFIG_NUMA) += numa.o
> +obj-$(CONFIG_DEVICE_TREE_NUMA) += numa_device_tree.o
>  obj-y += p2m.o
>  obj-y += percpu.o
>  obj-y += platform.o
> diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
> new file mode 100644
> index 0000000000..2428fbae0b
> --- /dev/null
> +++ b/xen/arch/arm/numa_device_tree.c
> @@ -0,0 +1,58 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Arm Architecture support layer for NUMA.
> + *
> + * Copyright (C) 2021 Arm Ltd
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + *
> + */
> +#include <xen/init.h>
> +#include <xen/nodemask.h>
> +#include <xen/numa.h>
> +#include <xen/libfdt/libfdt.h>
> +#include <xen/device_tree.h>
> +
> +/* Callback for device tree processor affinity */
> +static int __init fdt_numa_processor_affinity_init(nodeid_t node)
> +{
> +    if ( srat_disabled() )
> +        return -EINVAL;

fdt_numa_processor_affinity_init is called by fdt_parse_numa_cpu_node
which is already parsing NUMA related info. Should this srat_disabled
check be moved to fdt_parse_numa_cpu_node?


> +    else if ( node == NUMA_NO_NODE || node >= MAX_NUMNODES )
> +    {
> +        bad_srat();
> +        return -EINVAL;
> +	}
> +
> +    numa_set_processor_nodes_parsed(node);
> +    fw_numa = 1;
> +
> +    printk(KERN_INFO "DT: NUMA node %"PRIu7" processor parsed\n", node);
> +
> +    return 0;
> +}
> +
> +/* Parse CPU NUMA node info */
> +static int __init fdt_parse_numa_cpu_node(const void *fdt, int node)
> +{
> +    uint32_t nid;
> +
> +    nid = device_tree_get_u32(fdt, node, "numa-node-id", MAX_NUMNODES);
> +    if ( nid >= MAX_NUMNODES )
> +    {
> +        printk(XENLOG_ERR "Node id %u exceeds maximum value\n", nid);
                                      ^ PRIu32


> +        return -EINVAL;
> +    }
> +
> +    return fdt_numa_processor_affinity_init(nid);
> +}
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure
  2021-09-24  0:13     ` Stefano Stabellini
@ 2021-09-24  3:00       ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  3:00 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, andrew.cooper3,
	roger.pau, wl



> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 8:14
> To: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> wl@xen.org
> Subject: Re: [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node
> structure
> 
> You forgot to add the x86 maintainers in CC to all the patches touching
> x86 code in this series. Adding them now but you should probably resend.
>

I am very sorry about it. I realized the problem when I pressed Enter.
I had wanted to repost it at that time, but I didn't know whether these
patches will be turned into spamming... 

I will resend this series ASAP with some changes to address your comments.

> 
> On Thu, 23 Sep 2021, Stefano Stabellini wrote:
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > NUMA node structure "struct node" is using u64 as node memory
> > > range. In order to make other architectures can reuse this
> > > NUMA node relative code, we replace the u64 to paddr_t. And
> > > use pfn_to_paddr and paddr_to_pfn to replace explicit shift
> > > operations. The relate PRIx64 in print messages have been
> > > replaced by PRIpaddr at the same time.
> > >
> > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > ---
> > >  xen/arch/x86/numa.c        | 32 +++++++++++++++++---------------
> > >  xen/arch/x86/srat.c        | 26 +++++++++++++-------------
> > >  xen/include/asm-x86/numa.h |  8 ++++----
> > >  3 files changed, 34 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> > > index 1fabbe8281..6337bbdf31 100644
> > > --- a/xen/arch/x86/numa.c
> > > +++ b/xen/arch/x86/numa.c
> > > @@ -165,12 +165,12 @@ int __init compute_hash_shift(struct node *nodes,
> int numnodes,
> > >      return shift;
> > >  }
> > >  /* initialize NODE_DATA given nodeid and start/end */
> > > -void __init setup_node_bootmem(nodeid_t nodeid, u64 start, u64 end)
> > > -{
> > > +void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start,
> paddr_t end)
> > > +{
> > >      unsigned long start_pfn, end_pfn;
> > >
> > > -    start_pfn = start >> PAGE_SHIFT;
> > > -    end_pfn = end >> PAGE_SHIFT;
> > > +    start_pfn = paddr_to_pfn(start);
> > > +    end_pfn = paddr_to_pfn(end);
> > >
> > >      NODE_DATA(nodeid)->node_start_pfn = start_pfn;
> > >      NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
> > > @@ -201,11 +201,12 @@ void __init numa_init_array(void)
> > >  static int numa_fake __initdata = 0;
> > >
> > >  /* Numa emulation */
> > > -static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
> > > +static int __init numa_emulation(unsigned long start_pfn,
> > > +                                 unsigned long end_pfn)
> >
> > Why not changing numa_emulation to take paddr_t too?
> >

numa_emulation parameter is pfn, it's not address. I have discussed
with Julien in RFC about pfn. He suggested to use mfn_t or unsigned
long for pfn. Comparing to mfn_t, use unsigned long brings less
changes.

> >
> > >  {
> > >      int i;
> > >      struct node nodes[MAX_NUMNODES];
> > > -    u64 sz = ((end_pfn - start_pfn)<<PAGE_SHIFT) / numa_fake;
> > > +    u64 sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
> > >
> > >      /* Kludge needed for the hash function */
> > >      if ( hweight64(sz) > 1 )
> > > @@ -221,9 +222,9 @@ static int __init numa_emulation(u64 start_pfn,
> u64 end_pfn)
> > >      memset(&nodes,0,sizeof(nodes));
> > >      for ( i = 0; i < numa_fake; i++ )
> > >      {
> > > -        nodes[i].start = (start_pfn<<PAGE_SHIFT) + i*sz;
> > > +        nodes[i].start = pfn_to_paddr(start_pfn) + i*sz;
> > >          if ( i == numa_fake - 1 )
> > > -            sz = (end_pfn<<PAGE_SHIFT) - nodes[i].start;
> > > +            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
> > >          nodes[i].end = nodes[i].start + sz;
> > >          printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64"
> (%"PRIu64"MB)\n",
> > >                 i,
> > > @@ -249,24 +250,26 @@ static int __init numa_emulation(u64 start_pfn,
> u64 end_pfn)
> > >  void __init numa_initmem_init(unsigned long start_pfn, unsigned long
> end_pfn)
> >
> > same here
> >
> >
> > >  {
> > >      int i;
> > > +    paddr_t start, end;
> > >
> > >  #ifdef CONFIG_NUMA_EMU
> > >      if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
> > >          return;
> > >  #endif
> > >
> > > +    start = pfn_to_paddr(start_pfn);
> > > +    end = pfn_to_paddr(end_pfn);
> > > +
> > >  #ifdef CONFIG_ACPI_NUMA
> > > -    if ( !numa_off && !acpi_scan_nodes((u64)start_pfn << PAGE_SHIFT,
> > > -         (u64)end_pfn << PAGE_SHIFT) )
> > > +    if ( !numa_off && !acpi_scan_nodes(start, end) )
> > >          return;
> > >  #endif
> > >
> > >      printk(KERN_INFO "%s\n",
> > >             numa_off ? "NUMA turned off" : "No NUMA configuration
> found");
> > >
> > > -    printk(KERN_INFO "Faking a node at %016"PRIx64"-%016"PRIx64"\n",
> > > -           (u64)start_pfn << PAGE_SHIFT,
> > > -           (u64)end_pfn << PAGE_SHIFT);
> > > +    printk(KERN_INFO "Faking a node at %016"PRIpaddr"-
> %016"PRIpaddr"\n",
> > > +           start, end);
> > >      /* setup dummy node covering all memory */
> > >      memnode_shift = BITS_PER_LONG - 1;
> > >      memnodemap = _memnodemap;
> > > @@ -279,8 +282,7 @@ void __init numa_initmem_init(unsigned long
> start_pfn, unsigned long end_pfn)
> > >      for ( i = 0; i < nr_cpu_ids; i++ )
> > >          numa_set_node(i, 0);
> > >      cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
> > > -    setup_node_bootmem(0, (u64)start_pfn << PAGE_SHIFT,
> > > -                    (u64)end_pfn << PAGE_SHIFT);
> > > +    setup_node_bootmem(0, start, end);
> > >  }
> > >
> > >  void numa_add_cpu(int cpu)
> > > diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> > > index 6b77b98201..7d20d7f222 100644
> > > --- a/xen/arch/x86/srat.c
> > > +++ b/xen/arch/x86/srat.c
> > > @@ -104,7 +104,7 @@ nodeid_t setup_node(unsigned pxm)
> > >  	return node;
> > >  }
> > >
> > > -int valid_numa_range(u64 start, u64 end, nodeid_t node)
> > > +int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
> > >  {
> > >  	int i;
> > >
> > > @@ -119,7 +119,7 @@ int valid_numa_range(u64 start, u64 end, nodeid_t
> node)
> > >  	return 0;
> > >  }
> > >
> > > -static __init int conflicting_memblks(u64 start, u64 end)
> > > +static __init int conflicting_memblks(paddr_t start, paddr_t end)
> > >  {
> > >  	int i;
> > >
> > > @@ -135,7 +135,7 @@ static __init int conflicting_memblks(u64 start,
> u64 end)
> > >  	return -1;
> > >  }
> > >
> > > -static __init void cutoff_node(int i, u64 start, u64 end)
> > > +static __init void cutoff_node(int i, paddr_t start, paddr_t end)
> > >  {
> > >  	struct node *nd = &nodes[i];
> > >  	if (nd->start < start) {
> > > @@ -275,7 +275,7 @@ acpi_numa_processor_affinity_init(const struct
> acpi_srat_cpu_affinity *pa)
> > >  void __init
> > >  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity
> *ma)
> > >  {
> > > -	u64 start, end;
> > > +	paddr_t start, end;
> > >  	unsigned pxm;
> > >  	nodeid_t node;
> > >  	int i;
> > > @@ -318,7 +318,7 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> > >  		bool mismatch = !(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
> > >  		                !test_bit(i, memblk_hotplug);
> > >
> > > -		printk("%sSRAT: PXM %u (%"PRIx64"-%"PRIx64") overlaps with
> itself (%"PRIx64"-%"PRIx64")\n",
> > > +		printk("%sSRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with
> itself (%"PRIpaddr"-%"PRIpaddr")\n",
> > >  		       mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end,
> > >  		       node_memblk_range[i].start, node_memblk_range[i].end);
> > >  		if (mismatch) {
> > > @@ -327,7 +327,7 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> > >  		}
> > >  	} else {
> > >  		printk(KERN_ERR
> > > -		       "SRAT: PXM %u (%"PRIx64"-%"PRIx64") overlaps with
> PXM %u (%"PRIx64"-%"PRIx64")\n",
> > > +		       "SRAT: PXM %u (%"PRIpaddr"-%"PRIpaddr") overlaps with
> PXM %u (%"PRIpaddr"-%"PRIpaddr")\n",
> > >  		       pxm, start, end, node_to_pxm(memblk_nodeid[i]),
> > >  		       node_memblk_range[i].start, node_memblk_range[i].end);
> > >  		bad_srat();
> > > @@ -346,7 +346,7 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> > >  				nd->end = end;
> > >  		}
> > >  	}
> > > -	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIx64"-%"PRIx64"%s\n",
> > > +	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
> > >  	       node, pxm, start, end,
> > >  	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
> > >
> > > @@ -369,7 +369,7 @@ static int __init nodes_cover_memory(void)
> > >
> > >  	for (i = 0; i < e820.nr_map; i++) {
> > >  		int j, found;
> > > -		unsigned long long start, end;
> > > +		paddr_t start, end;
> > >
> > >  		if (e820.map[i].type != E820_RAM) {
> > >  			continue;
> > > @@ -396,7 +396,7 @@ static int __init nodes_cover_memory(void)
> > >
> > >  		if (start < end) {
> > >  			printk(KERN_ERR "SRAT: No PXM for e820 range: "
> > > -				"%016Lx - %016Lx\n", start, end);
> > > +				"%"PRIpaddr" - %"PRIpaddr"\n", start, end);
> > >  			return 0;
> > >  		}
> > >  	}
> > > @@ -432,7 +432,7 @@ static int __init srat_parse_region(struct
> acpi_subtable_header *header,
> > >  	return 0;
> > >  }
> > >
> > > -void __init srat_parse_regions(u64 addr)
> > > +void __init srat_parse_regions(paddr_t addr)
> > >  {
> > >  	u64 mask;
> > >  	unsigned int i;
> > > @@ -441,7 +441,7 @@ void __init srat_parse_regions(u64 addr)
> > >  	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
> > >  		return;
> > >
> > > -	srat_region_mask = pdx_init_mask(addr);
> > > +	srat_region_mask = pdx_init_mask((u64)addr);
> > >  	acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> > >  			      srat_parse_region, 0);
> > >
> > > @@ -457,7 +457,7 @@ void __init srat_parse_regions(u64 addr)
> > >  }
> > >
> > >  /* Use the information discovered above to actually set up the nodes.
> */
> > > -int __init acpi_scan_nodes(u64 start, u64 end)
> > > +int __init acpi_scan_nodes(paddr_t start, paddr_t end)
> > >  {
> > >  	int i;
> > >  	nodemask_t all_nodes_parsed;
> > > @@ -489,7 +489,7 @@ int __init acpi_scan_nodes(u64 start, u64 end)
> > >  	/* Finally register nodes */
> > >  	for_each_node_mask(i, all_nodes_parsed)
> > >  	{
> > > -		u64 size = nodes[i].end - nodes[i].start;
> > > +		paddr_t size = nodes[i].end - nodes[i].start;
> > >  		if ( size == 0 )
> > >  			printk(KERN_WARNING "SRAT: Node %u has no memory. "
> > >  			       "BIOS Bug or mis-configured hardware?\n", i);
> > > diff --git a/xen/include/asm-x86/numa.h b/xen/include/asm-x86/numa.h
> > > index 8060cbf3f4..50cfd8e7ef 100644
> > > --- a/xen/include/asm-x86/numa.h
> > > +++ b/xen/include/asm-x86/numa.h
> > > @@ -16,7 +16,7 @@ extern cpumask_t     node_to_cpumask[];
> > >  #define node_to_cpumask(node)    (node_to_cpumask[node])
> > >
> > >  struct node {
> > > -	u64 start,end;
> > > +	paddr_t start,end;
> > >  };
> > >
> > >  extern int compute_hash_shift(struct node *nodes, int numnodes,
> > > @@ -36,7 +36,7 @@ extern void numa_set_node(int cpu, nodeid_t node);
> > >  extern nodeid_t setup_node(unsigned int pxm);
> > >  extern void srat_detect_node(int cpu);
> > >
> > > -extern void setup_node_bootmem(nodeid_t nodeid, u64 start, u64 end);
> > > +extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start,
> paddr_t end);
> > >  extern nodeid_t apicid_to_node[];
> > >  extern void init_cpu_to_node(void);
> > >
> > > @@ -73,9 +73,9 @@ static inline __attribute__((pure)) nodeid_t
> phys_to_nid(paddr_t addr)
> > >  #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
> > >  				 NODE_DATA(nid)->node_spanned_pages)
> > >
> > > -extern int valid_numa_range(u64 start, u64 end, nodeid_t node);
> > > +extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t
> node);
> > >
> > > -void srat_parse_regions(u64 addr);
> > > +void srat_parse_regions(paddr_t addr);
> > >  extern u8 __node_distance(nodeid_t a, nodeid_t b);
> > >  unsigned int arch_get_dma_bitsize(void);
> > >  unsigned int arch_have_default_dmazone(void);
> > > --
> > > 2.25.1
> > >
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 31/37] xen/arm: introduce a helper to parse device tree NUMA distance map
  2021-09-23 12:02 ` [PATCH 31/37] xen/arm: introduce a helper to parse device tree NUMA distance map Wei Chen
@ 2021-09-24  3:05   ` Stefano Stabellini
  2021-09-24  5:23     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  3:05 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> A NUMA aware device tree will provide a "distance-map" node to
> describe distance between any two nodes. This patch introduce a
> new helper to parse this distance map.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/numa_device_tree.c | 106 ++++++++++++++++++++++++++++++++
>  1 file changed, 106 insertions(+)
> 
> diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
> index 7918a397fa..e7fa84df4c 100644
> --- a/xen/arch/arm/numa_device_tree.c
> +++ b/xen/arch/arm/numa_device_tree.c
> @@ -136,3 +136,109 @@ static int __init fdt_parse_numa_memory_node(const void *fdt, int node,
>  
>      return 0;
>  }
> +
> +
> +/* Parse NUMA distance map v1 */
> +static int __init fdt_parse_numa_distance_map_v1(const void *fdt, int node)
> +{
> +    const struct fdt_property *prop;
> +    const __be32 *matrix;
> +    uint32_t entry_count;
> +    int len, i;
> +
> +    printk(XENLOG_INFO "NUMA: parsing numa-distance-map\n");
> +
> +    prop = fdt_get_property(fdt, node, "distance-matrix", &len);
> +    if ( !prop )
> +    {
> +        printk(XENLOG_WARNING
> +               "NUMA: No distance-matrix property in distance-map\n");

I haven't seen where this is called from yet but make sure to print an
error here only if NUMA info is actually expected and required, not on
regular non-NUMA boots on non-NUMA machines.


> +        return -EINVAL;
> +    }
> +
> +    if ( len % sizeof(uint32_t) != 0 )
> +    {
> +        printk(XENLOG_WARNING
> +               "distance-matrix in node is not a multiple of u32\n");
> +        return -EINVAL;
> +    }
> +
> +    entry_count = len / sizeof(uint32_t);
> +    if ( entry_count == 0 )
> +    {
> +        printk(XENLOG_WARNING "NUMA: Invalid distance-matrix\n");
> +
> +        return -EINVAL;
> +    }
> +
> +    matrix = (const __be32 *)prop->data;
> +    for ( i = 0; i + 2 < entry_count; i += 3 )
> +    {
> +        uint32_t from, to, distance, opposite;
> +
> +        from = dt_next_cell(1, &matrix);
> +        to = dt_next_cell(1, &matrix);
> +        distance = dt_next_cell(1, &matrix);
> +        if ( (from == to && distance != NUMA_LOCAL_DISTANCE) ||
> +            (from != to && distance <= NUMA_LOCAL_DISTANCE) )
> +        {
> +            printk(XENLOG_WARNING
> +                   "NUMA: Invalid distance: NODE#%u->NODE#%u:%u\n",
> +                   from, to, distance);
> +            return -EINVAL;
> +        }
> +
> +        printk(XENLOG_INFO "NUMA: distance: NODE#%u->NODE#%u:%u\n",
> +               from, to, distance);
> +
> +        /* Get opposite way distance */
> +        opposite = __node_distance(from, to);

This is not checking for the opposite node distance but...


> +        if ( opposite == 0 )
> +        {
> +            /* Bi-directions are not set, set both */
> +            numa_set_distance(from, to, distance);
> +            numa_set_distance(to, from, distance);

...since you set both directions here at once then it is OK. You are
checking if this direction has already been set which is correct I
think. But the comment "Get opposite way distance" and the variable name
"opposite" are wrong.


> +        }
> +        else
> +        {
> +            /*
> +             * Opposite way distance has been set to a different value.
> +             * It may be a firmware device tree bug?
> +             */
> +            if ( opposite != distance )
> +            {
> +                /*
> +                 * In device tree NUMA distance-matrix binding:
> +                 * https://www.kernel.org/doc/Documentation/devicetree/bindings/numa.txt
> +                 * There is a notes mentions:
> +                 * "Each entry represents distance from first node to
> +                 *  second node. The distances are equal in either
> +                 *  direction."
> +                 *
> +                 * That means device tree doesn't permit this case.
> +                 * But in ACPI spec, it cares to specifically permit this
> +                 * case:
> +                 * "Except for the relative distance from a System Locality
> +                 *  to itself, each relative distance is stored twice in the
> +                 *  matrix. This provides the capability to describe the
> +                 *  scenario where the relative distances for the two
> +                 *  directions between System Localities is different."
> +                 *
> +                 * That means a real machine allows such NUMA configuration.
> +                 * So, place a WARNING here to notice system administrators,
> +                 * is it the specail case that they hijack the device tree
> +                 * to support their rare machines?
> +                 */
> +                printk(XENLOG_WARNING
> +                       "Un-matched bi-direction! NODE#%u->NODE#%u:%u, NODE#%u->NODE#%u:%u\n",
> +                       from, to, distance, to, from, opposite);

PRIu32


> +            }
> +
> +            /* Opposite way distance has been set, just set this way */
> +            numa_set_distance(from, to, distance);
> +        }
> +    }
> +
> +    return 0;
> +}
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 30/37] xen/arm: introduce a helper to parse device tree memory node
  2021-09-23 12:02 ` [PATCH 30/37] xen/arm: introduce a helper to parse device tree memory node Wei Chen
@ 2021-09-24  3:05   ` Stefano Stabellini
  2021-09-24  7:54     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  3:05 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> Memory blocks' NUMA ID information is stored in device tree's
> memory nodes as "numa-node-id". We need a new helper to parse
> and verify this ID from memory nodes.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>

There are tabs for indentation in this patch, we use spaces.


> ---
>  xen/arch/arm/numa_device_tree.c | 80 +++++++++++++++++++++++++++++++++
>  1 file changed, 80 insertions(+)
> 
> diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
> index 2428fbae0b..7918a397fa 100644
> --- a/xen/arch/arm/numa_device_tree.c
> +++ b/xen/arch/arm/numa_device_tree.c
> @@ -42,6 +42,35 @@ static int __init fdt_numa_processor_affinity_init(nodeid_t node)
>      return 0;
>  }
>  
> +/* Callback for parsing of the memory regions affinity */
> +static int __init fdt_numa_memory_affinity_init(nodeid_t node,
> +                                paddr_t start, paddr_t size)

Please align the parameters


> +{
> +    int ret;
> +
> +    if ( srat_disabled() )
> +    {
> +        return -EINVAL;
> +    }
> +
> +	if ( !numa_memblks_available() )
> +	{
> +		dprintk(XENLOG_WARNING,
> +                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
> +		bad_srat();
> +		return -EINVAL;
> +	}
> +
> +	ret = numa_update_node_memblks(node, start, size, false);
> +	if ( ret != 0 )
> +	{
> +		bad_srat();
> +	    return -EINVAL;
> +	}
> +
> +    return 0;
> +}

Aside from spaces/tabs, this is a lot better!


>  /* Parse CPU NUMA node info */
>  static int __init fdt_parse_numa_cpu_node(const void *fdt, int node)
>  {
> @@ -56,3 +85,54 @@ static int __init fdt_parse_numa_cpu_node(const void *fdt, int node)
>  
>      return fdt_numa_processor_affinity_init(nid);
>  }
> +
> +/* Parse memory node NUMA info */
> +static int __init fdt_parse_numa_memory_node(const void *fdt, int node,
> +    const char *name, uint32_t addr_cells, uint32_t size_cells)

Please align the parameters


> +{
> +    uint32_t nid;
> +    int ret = 0, len;
> +    paddr_t addr, size;
> +    const struct fdt_property *prop;
> +    uint32_t idx, ranges;
> +    const __be32 *addresses;
> +
> +    nid = device_tree_get_u32(fdt, node, "numa-node-id", MAX_NUMNODES);
> +    if ( nid >= MAX_NUMNODES )
> +    {
> +        printk(XENLOG_WARNING "Node id %u exceeds maximum value\n", nid);
> +        return -EINVAL;
> +    }
> +
> +    prop = fdt_get_property(fdt, node, "reg", &len);
> +    if ( !prop )
> +    {
> +        printk(XENLOG_WARNING
> +               "fdt: node `%s': missing `reg' property\n", name);
> +        return -EINVAL;
> +    }
> +
> +    addresses = (const __be32 *)prop->data;
> +    ranges = len / (sizeof(__be32)* (addr_cells + size_cells));
> +    for ( idx = 0; idx < ranges; idx++ )
> +    {
> +        device_tree_get_reg(&addresses, addr_cells, size_cells, &addr, &size);
> +        /* Skip zero size ranges */
> +        if ( !size )
> +            continue;
> +
> +        ret = fdt_numa_memory_affinity_init(nid, addr, size);
> +        if ( ret ) {
> +            return -EINVAL;
> +        }
> +    }

I take it would be difficult to parse numa-node-id and call
fdt_numa_memory_affinity_init from
xen/arch/arm/bootfdt.c:device_tree_get_meminfo. Is it because
device_tree_get_meminfo is called too early?


> +    if ( idx == 0 )
> +    {
> +        printk(XENLOG_ERR
> +               "bad property in memory node, idx=%d ret=%d\n", idx, ret);
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 32/37] xen/arm: unified entry to parse all NUMA data from device tree
  2021-09-23 12:02 ` [PATCH 32/37] xen/arm: unified entry to parse all NUMA data from device tree Wei Chen
@ 2021-09-24  3:16   ` Stefano Stabellini
  2021-09-24  7:58     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  3:16 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> In this API, we scan whole device tree to parse CPU node id, memory
          ^ function   ^ the whole

> node id and distance-map. Though early_scan_node will invoke has a
> handler to process memory nodes. If we want to parse memory node id
> in this handler, we have to embeded NUMA parse code in this handler.
                              ^ embed

> But we still need to scan whole device tree to find CPU NUMA id and
> distance-map. In this case, we include memory NUMA id parse in this
> API too. Another benefit is that we have a unique entry for device
  ^ function

> tree NUMA data parse.

Ah, that's the explanation I was asking for earlier!


> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/numa_device_tree.c | 30 ++++++++++++++++++++++++++++++
>  xen/include/asm-arm/numa.h      |  1 +
>  2 files changed, 31 insertions(+)
> 
> diff --git a/xen/arch/arm/numa_device_tree.c b/xen/arch/arm/numa_device_tree.c
> index e7fa84df4c..6a3fed0002 100644
> --- a/xen/arch/arm/numa_device_tree.c
> +++ b/xen/arch/arm/numa_device_tree.c
> @@ -242,3 +242,33 @@ static int __init fdt_parse_numa_distance_map_v1(const void *fdt, int node)
>  
>      return 0;
>  }
> +
> +static int __init fdt_scan_numa_nodes(const void *fdt,
> +                int node, const char *uname, int depth,
> +                u32 address_cells, u32 size_cells, void *data)

Please align parameters


> +{
> +    int len, ret = 0;
> +    const void *prop;
> +
> +    prop = fdt_getprop(fdt, node, "device_type", &len);
> +    if (prop)

code style


> +    {
> +        len += 1;
> +        if ( memcmp(prop, "cpu", len) == 0 )
> +            ret = fdt_parse_numa_cpu_node(fdt, node);
> +        else if ( memcmp(prop, "memory", len) == 0 )
> +            ret = fdt_parse_numa_memory_node(fdt, node, uname,
> +                                address_cells, size_cells);

I realize that with the inclusion of '\0' in the check, the usage of
memcmp should be safe, but I would prefer if we used strncmp instead.


> +    }
> +    else if ( fdt_node_check_compatible(fdt, node,
> +                                "numa-distance-map-v1") == 0 )
> +        ret = fdt_parse_numa_distance_map_v1(fdt, node);
> +
> +    return ret;
> +}
> +
> +/* Initialize NUMA from device tree */
> +int __init numa_device_tree_init(const void *fdt)
> +{
> +    return device_tree_for_each_node(fdt, 0, fdt_scan_numa_nodes, NULL);
> +}
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index 7675012cb7..f46e8e2935 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -23,6 +23,7 @@ typedef u8 nodeid_t;
>  #define NR_NODE_MEMBLKS NR_MEM_BANKS
>  
>  extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance);
> +extern int numa_device_tree_init(const void *fdt);
>  
>  #else
>  
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 33/37] xen/arm: keep guest still be NUMA unware
  2021-09-23 12:02 ` [PATCH 33/37] xen/arm: keep guest still be NUMA unware Wei Chen
@ 2021-09-24  3:19   ` Stefano Stabellini
  2021-09-24 10:23     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  3:19 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> The NUMA information provided in the host Device-Tree
> are only for Xen. For dom0, we want to hide them as they
> may be different (for now, dom0 is still not aware of NUMA)
> The CPU and memory nodes are recreated from scratch for the
> domain. So we already skip the "numa-node-id" property for
> these two types of nodes.
> 
> However, some devices like PCIe may have "numa-node-id"
> property too. We have to skip them as well.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/domain_build.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> index d233d634c1..6e94922238 100644
> --- a/xen/arch/arm/domain_build.c
> +++ b/xen/arch/arm/domain_build.c
> @@ -737,6 +737,10 @@ static int __init write_properties(struct domain *d, struct kernel_info *kinfo,
>                  continue;
>          }
>  
> +        /* Guest is numa unaware in current stage */

I would say: "Dom0 is currently NUMA unaware"

Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>


> +        if ( dt_property_name_is_equal(prop, "numa-node-id") )
> +            continue;
> +
>          res = fdt_property(kinfo->fdt, prop->name, prop_data, prop_len);
>  
>          if ( res )
> @@ -1607,6 +1611,8 @@ static int __init handle_node(struct domain *d, struct kernel_info *kinfo,
>          DT_MATCH_TYPE("memory"),
>          /* The memory mapped timer is not supported by Xen. */
>          DT_MATCH_COMPATIBLE("arm,armv7-timer-mem"),
> +        /* Numa info doesn't need to be exposed to Domain-0 */
> +        DT_MATCH_COMPATIBLE("numa-distance-map-v1"),
>          { /* sentinel */ },
>      };
>      static const struct dt_device_match timer_matches[] __initconst =
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 34/37] xen/arm: enable device tree based NUMA in system init
  2021-09-23 12:02 ` [PATCH 34/37] xen/arm: enable device tree based NUMA in system init Wei Chen
@ 2021-09-24  3:28   ` Stefano Stabellini
  2021-09-24  9:52     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  3:28 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> In this patch, we can start to create NUMA system that is
> based on device tree.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/numa.c        | 55 ++++++++++++++++++++++++++++++++++++++
>  xen/arch/arm/setup.c       |  7 +++++
>  xen/include/asm-arm/numa.h |  6 +++++
>  3 files changed, 68 insertions(+)
> 
> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> index 7f05299b76..d7a3d32d4b 100644
> --- a/xen/arch/arm/numa.c
> +++ b/xen/arch/arm/numa.c
> @@ -18,8 +18,10 @@
>   *
>   */
>  #include <xen/init.h>
> +#include <xen/device_tree.h>
>  #include <xen/nodemask.h>
>  #include <xen/numa.h>
> +#include <xen/pfn.h>
>  
>  static uint8_t __read_mostly
>  node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> @@ -85,6 +87,59 @@ uint8_t __node_distance(nodeid_t from, nodeid_t to)
>  }
>  EXPORT_SYMBOL(__node_distance);
>  
> +void __init numa_init(bool acpi_off)
> +{
> +    uint32_t idx;
> +    paddr_t ram_start = ~0;

INVALID_PADDR


> +    paddr_t ram_size = 0;
> +    paddr_t ram_end = 0;
> +
> +    /* NUMA has been turned off through Xen parameters */
> +    if ( numa_off )
> +        goto mem_init;
> +
> +    /* Initialize NUMA from device tree when system is not ACPI booted */
> +    if ( acpi_off )
> +    {
> +        int ret = numa_device_tree_init(device_tree_flattened);
> +        if ( ret )
> +        {
> +            printk(XENLOG_WARNING
> +                   "Init NUMA from device tree failed, ret=%d\n", ret);

As I mentioned in other patches we need to distinguish between two
cases:

1) NUMA initialization failed because no NUMA information has been found
2) NUMA initialization failed because wrong/inconsistent NUMA info has
   been found

In case of 1), we print nothing. Maybe a single XENLOG_DEBUG message.
In case of 2), all the warnings are good to print.


In this case, if ret != 0 because of 2), then it is fine to print this
warning. But it looks like could be that ret is -EINVAL simply because a
CPU node doesn't have numa-node-id, which is a normal condition for
non-NUMA machines.


> +            numa_off = true;
> +        }
> +    }
> +    else
> +    {
> +        /* We don't support NUMA for ACPI boot currently */
> +        printk(XENLOG_WARNING
> +               "ACPI NUMA has not been supported yet, NUMA off!\n");
> +        numa_off = true;
> +    }
> +
> +mem_init:
> +    /*
> +     * Find the minimal and maximum address of RAM, NUMA will
> +     * build a memory to node mapping table for the whole range.
> +     */
> +    ram_start = bootinfo.mem.bank[0].start;
> +    ram_size  = bootinfo.mem.bank[0].size;
> +    ram_end   = ram_start + ram_size;
> +    for ( idx = 1 ; idx < bootinfo.mem.nr_banks; idx++ )
> +    {
> +        paddr_t bank_start = bootinfo.mem.bank[idx].start;
> +        paddr_t bank_size = bootinfo.mem.bank[idx].size;
> +        paddr_t bank_end = bank_start + bank_size;
> +
> +        ram_size  = ram_size + bank_size;
> +        ram_start = min(ram_start, bank_start);
> +        ram_end   = max(ram_end, bank_end);
> +    }
> +
> +    numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end));
> +    return;

No need for return


> +}
> +
>  uint32_t __init arch_meminfo_get_nr_bank(void)
>  {
>  	return bootinfo.mem.nr_banks;
> diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> index 1f0fbc95b5..6097850682 100644
> --- a/xen/arch/arm/setup.c
> +++ b/xen/arch/arm/setup.c
> @@ -905,6 +905,13 @@ void __init start_xen(unsigned long boot_phys_offset,
>      /* Parse the ACPI tables for possible boot-time configuration */
>      acpi_boot_table_init();
>  
> +    /*
> +     * Try to initialize NUMA system, if failed, the system will
> +     * fallback to uniform system which means system has only 1
> +     * NUMA node.
> +     */
> +    numa_init(acpi_disabled);
> +
>      end_boot_allocator();
>  
>      /*
> diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> index f46e8e2935..5b03dde87f 100644
> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -24,6 +24,7 @@ typedef u8 nodeid_t;
>  
>  extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t distance);
>  extern int numa_device_tree_init(const void *fdt);
> +extern void numa_init(bool acpi_off);
>  
>  #else
>  
> @@ -47,6 +48,11 @@ extern mfn_t first_valid_mfn;
>  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
>  #define __node_distance(a, b) (20)
>  
> +static inline void numa_init(bool acpi_off)
> +{
> +
> +}
> +
>  static inline void numa_add_cpu(int cpu)
>  {
>  
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-23 12:02 ` [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA Wei Chen
@ 2021-09-24  3:31   ` Stefano Stabellini
  2021-09-24 10:13     ` Wei Chen
  2021-09-24 10:25   ` Jan Beulich
  1 sibling, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24  3:31 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, sstabellini, julien, Bertrand.Marquis

On Thu, 23 Sep 2021, Wei Chen wrote:
> Arm platforms support both ACPI and device tree. We don't
> want users to select device tree NUMA or ACPI NUMA manually.
> We hope usrs can just enable NUMA for Arm, and device tree
          ^ users

> NUMA and ACPI NUMA can be selected depends on device tree
> feature and ACPI feature status automatically. In this case,
> these two kinds of NUMA support code can be co-exist in one
> Xen binary. Xen can check feature flags to decide using
> device tree or ACPI as NUMA based firmware.
> 
> So in this patch, we introduce a generic option:
> CONFIG_ARM_NUMA for user to enable NUMA for Arm.
                      ^ users

> And one CONFIG_DEVICE_TREE_NUMA option for ARM_NUMA
> to select when HAS_DEVICE_TREE option is enabled.
> Once when ACPI NUMA for Arm is supported, ACPI_NUMA
> can be selected here too.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
>  xen/arch/arm/Kconfig | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
> index 865ad83a89..ded94ebd37 100644
> --- a/xen/arch/arm/Kconfig
> +++ b/xen/arch/arm/Kconfig
> @@ -34,6 +34,17 @@ config ACPI
>  	  Advanced Configuration and Power Interface (ACPI) support for Xen is
>  	  an alternative to device tree on ARM64.
>  
> + config DEVICE_TREE_NUMA
> +	def_bool n
> +	select NUMA
> +
> +config ARM_NUMA
> +	bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if UNSUPPORTED
> +	select DEVICE_TREE_NUMA if HAS_DEVICE_TREE

Should it be: depends on HAS_DEVICE_TREE ?
(And eventually depends on HAS_DEVICE_TREE || ACPI)


> +	---help---
> +
> +	  Enable Non-Uniform Memory Access (NUMA) for Arm architecutres
                                                      ^ architectures


> +
>  config GICV3
>  	bool "GICv3 driver"
>  	depends on ARM_64 && !NEW_VGIC
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 09/37] xen/x86: introduce two helpers to access memory hotplug end
  2021-09-24  0:29   ` Stefano Stabellini
@ 2021-09-24  4:21     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:21 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand.Marquis, jbeulich, andrew.cooper3,
	roger.pau, wl



On 2021/9/24 8:29, Stefano Stabellini wrote:
> +x86 maintainers
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
>> x86 provides a mem_hotplug to maintain the end of memory hotplug
>                              ^ variable
> 
>> end address. This variable can be accessed out of mm.c. We want
>> some code out of mm.c can be reused by other architectures without
>                         ^ so that it can be reused
> 
>> memory hotplug ability. So in this patch, we introduce these two
>> helpers to replace mem_hotplug direct access. This will give the
>> ability to stub these two API.
>                              ^ APIs
> 
> 

OK

>> Signed-off-by: Wei Chen <wei.chen@arm.com>
>> ---
>>   xen/include/asm-x86/mm.h | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
>> index cb90527499..af2fc4b0cd 100644
>> --- a/xen/include/asm-x86/mm.h
>> +++ b/xen/include/asm-x86/mm.h
>> @@ -475,6 +475,16 @@ static inline int get_page_and_type(struct page_info *page,
>>   
>>   extern paddr_t mem_hotplug;
>>   
>> +static inline void mem_hotplug_update_boundary(paddr_t end)
>> +{
>> +    mem_hotplug = end;
>> +}
>> +
>> +static inline paddr_t mem_hotplug_boundary(void)
>> +{
>> +    return mem_hotplug;
>> +}
>> +
>>   /******************************************************************************
>>    * With shadow pagetables, the different kinds of address start
>>    * to get get confusing.
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 26/37] xen/arm: build NUMA cpu_to_node map in dt_smp_init_cpus
  2021-09-24  2:26   ` Stefano Stabellini
@ 2021-09-24  4:25     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:25 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 10:26
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 26/37] xen/arm: build NUMA cpu_to_node map in
> dt_smp_init_cpus
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > NUMA implementation has a cpu_to_node array to store CPU to NODE
> > map. Xen is using CPU logical ID in runtime components, so we
> > use CPU logical ID as CPU index in cpu_to_node.
> >
> > In device tree case, cpu_logical_map is created in dt_smp_init_cpus.
> > So, when NUMA is enabled, dt_smp_init_cpus will fetch CPU NUMA id
> > at the same time for cpu_to_node.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/smpboot.c     | 37 ++++++++++++++++++++++++++++++++++++-
> >  xen/include/asm-arm/numa.h |  5 +++++
> >  2 files changed, 41 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
> > index 60c0e82fc5..6e3cc8d3cc 100644
> > --- a/xen/arch/arm/smpboot.c
> > +++ b/xen/arch/arm/smpboot.c
> > @@ -121,7 +121,12 @@ static void __init dt_smp_init_cpus(void)
> >      {
> >          [0 ... NR_CPUS - 1] = MPIDR_INVALID
> >      };
> > +    static nodeid_t node_map[NR_CPUS] __initdata =
> > +    {
> > +        [0 ... NR_CPUS - 1] = NUMA_NO_NODE
> > +    };
> >      bool bootcpu_valid = false;
> > +    uint32_t nid = 0;
> >      int rc;
> >
> >      mpidr = system_cpuinfo.mpidr.bits & MPIDR_HWID_MASK;
> > @@ -172,6 +177,28 @@ static void __init dt_smp_init_cpus(void)
> >              continue;
> >          }
> >
> > +        if ( IS_ENABLED(CONFIG_NUMA) )
> > +        {
> > +            /*
> > +             * When CONFIG_NUMA is set, try to fetch numa infomation
> > +             * from CPU dts node, otherwise the nid is always 0.
> > +             */
> > +            if ( !dt_property_read_u32(cpu, "numa-node-id", &nid) )
> > +            {
> > +                printk(XENLOG_WARNING
> > +                       "cpu[%d] dts path: %s: doesn't have numa
> information!\n",
>                                ^ %u
> 
> 
> > +                       cpuidx, dt_node_full_name(cpu));
> 
> I think that this message shouldn't be a warning: CONFIG_NUMA is a
> compile time option. Anybody that enables CONFIG_NUMA in the Xen build
> will get this warning printed out at boot time if Xen is booting on a
> regular non-NUMA machine, right?
> 
> The warning should only be printed if NUMA is actively enabled, e.g.
> there is a distance-map but the cpus don't have numa-node-id.
> 
> 

Yes, this message would be unexpected on a regular non-NUMA machine.
I will some check condition to print this message.

> 
> > +                /*
> > +                 * During the early stage of NUMA initialization, when
> Xen
> > +                 * found any CPU dts node doesn't have numa-node-id
> info, the
> > +                 * NUMA will be treated as off, all CPU will be set to
> a FAKE
> > +                 * node 0. So if we get numa-node-id failed here, we
> should
> > +                 * set nid to 0.
> > +                 */
> > +                nid = 0;
> > +            }
> > +        }
> > +
> >          /*
> >           * 8 MSBs must be set to 0 in the DT since the reg property
> >           * defines the MPIDR[23:0]
> > @@ -231,9 +258,12 @@ static void __init dt_smp_init_cpus(void)
> >          {
> >              printk("cpu%d init failed (hwid %"PRIregister"): %d\n", i,
> hwid, rc);
> >              tmp_map[i] = MPIDR_INVALID;
> > +            node_map[i] = NUMA_NO_NODE;
> >          }
> > -        else
> > +        else {
> >              tmp_map[i] = hwid;
> > +            node_map[i] = nid;
> > +        }
> >      }
> >
> >      if ( !bootcpu_valid )
> > @@ -249,6 +279,11 @@ static void __init dt_smp_init_cpus(void)
> >              continue;
> >          cpumask_set_cpu(i, &cpu_possible_map);
> >          cpu_logical_map(i) = tmp_map[i];
> > +
> > +        nid = node_map[i];
> > +        if ( nid >= MAX_NUMNODES )
> > +            nid = 0;
> > +        numa_set_node(i, nid);
> >      }
> >  }
> >
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index 758eafeb05..8a4ad379e0 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -46,6 +46,11 @@ extern mfn_t first_valid_mfn;
> >  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
> >  #define __node_distance(a, b) (20)
> >
> > +static inline void numa_set_node(int cpu, nodeid_t node)
> > +{
> > +
> > +}
> > +
> >  #endif
> >
> >  static inline unsigned int arch_have_default_dmazone(void)
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 28/37] xen/arm: stub memory hotplug access helpers for Arm
  2021-09-24  2:33   ` Stefano Stabellini
@ 2021-09-24  4:26     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:26 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis



> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 10:34
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 28/37] xen/arm: stub memory hotplug access helpers for
> Arm
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > Common code in NUMA need these two helpers to access/update
> > memory hotplug end address. Arm has not support memory hotplug
> > yet. So we stub these two helpers in this patch to make NUMA
> > common code happy.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/include/asm-arm/mm.h | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
> > index 7b5e7b7f69..fc9433165d 100644
> > --- a/xen/include/asm-arm/mm.h
> > +++ b/xen/include/asm-arm/mm.h
> > @@ -362,6 +362,16 @@ void clear_and_clean_page(struct page_info *page);
> >
> >  unsigned int arch_get_dma_bitsize(void);
> >
> > +static inline void mem_hotplug_update_boundary(paddr_t end)
> > +{
> > +
> > +}
> > +
> > +static inline paddr_t mem_hotplug_boundary(void)
> > +{
> > +    return 0;
> > +}
> 
> Why zero? Could it be INVALID_PADDR ?

Yes, INVALID_PADDR is better.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-24  0:25   ` Stefano Stabellini
@ 2021-09-24  4:28     ` Wei Chen
  2021-09-24 19:52       ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:28 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, andrew.cooper3,
	roger.pau, wl



> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 8:26
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> CC'ing x86 maintainers
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > One NUMA node may contain several memory blocks. In current Xen
> > code, Xen will maintain a node memory range for each node to cover
> > all its memory blocks. But here comes the problem, in the gap of
> > one node's two memory blocks, if there are some memory blocks don't
> > belong to this node (remote memory blocks). This node's memory range
> > will be expanded to cover these remote memory blocks.
> >
> > One node's memory range contains othe nodes' memory, this is obviously
> > not very reasonable. This means current NUMA code only can support
> > node has continous memory blocks. However, on a physical machine, the
> > addresses of multiple nodes can be interleaved.
> >
> > So in this patch, we add code to detect discontinous memory blocks
> > for one node. NUMA initializtion will be failed and error messages
> > will be printed when Xen detect such hardware configuration.
> 
> At least on ARM, it is not just memory that can be interleaved, but also
> MMIO regions. For instance:
> 
> node0 bank0 0-0x1000000
> MMIO 0x1000000-0x1002000
> Hole 0x1002000-0x2000000
> node0 bank1 0x2000000-0x3000000
> 
> So I am not familiar with the SRAT format, but I think on ARM the check
> would look different: we would just look for multiple memory ranges
> under a device_type = "memory" node of a NUMA node in device tree.
> 
> 

Should I need to include/refine above message to commit log?

> 
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/x86/srat.c | 36 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 36 insertions(+)
> >
> > diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> > index 7d20d7f222..2f08fa4660 100644
> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct
> acpi_srat_cpu_affinity *pa)
> >  		       pxm, pa->apic_id, node);
> >  }
> >
> > +/*
> > + * Check to see if there are other nodes within this node's range.
> > + * We just need to check full contains situation. Because overlaps
> > + * have been checked before by conflicting_memblks.
> > + */
> > +static bool __init is_node_memory_continuous(nodeid_t nid,
> > +    paddr_t start, paddr_t end)
> > +{
> > +	nodeid_t i;
> > +
> > +	struct node *nd = &nodes[nid];
> > +	for_each_node_mask(i, memory_nodes_parsed)
> > +	{
> > +		/* Skip itself */
> > +		if (i == nid)
> > +			continue;
> > +
> > +		nd = &nodes[i];
> > +		if (start < nd->start && nd->end < end)
> > +		{
> > +			printk(KERN_ERR
> > +			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine
> with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
> > +			       nid, start, end, i, nd->start, nd->end);
> > +			return false;
> > +		}
> > +	}
> > +
> > +	return true;
> > +}
> > +
> >  /* Callback for parsing of the Proximity Domain <-> Memory Area
> mappings */
> >  void __init
> >  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> > @@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> >  				nd->start = start;
> >  			if (nd->end < end)
> >  				nd->end = end;
> > +
> > +			/* Check whether this range contains memory for other
> nodes */
> > +			if (!is_node_memory_continuous(node, nd->start, nd->end))
> {
> > +				bad_srat();
> > +				return;
> > +			}
> >  		}
> >  	}
> >  	printk(KERN_INFO "SRAT: Node %u PXM %u %"PRIpaddr"-%"PRIpaddr"%s\n",
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 10/37] xen/x86: use helpers to access/update mem_hotplug
  2021-09-24  0:31   ` Stefano Stabellini
@ 2021-09-24  4:29     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:29 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, andrew.cooper3,
	roger.pau, wl



> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Stefano Stabellini
> Sent: 2021年9月24日 8:32
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> Subject: Re: [PATCH 10/37] xen/x86: use helpers to access/update
> mem_hotplug
> 
> +x86 maintainers
> 
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > We want to abstract code from acpi_numa_memory_affinity_init.
> > But mem_hotplug is coupled with x86. In this patch, we use
> > helpers to repace mem_hotplug direct accessing. This will
>              ^ replace
> 
> > allow most code can be common.
>                   ^ to be
> 
> I think this patch could be merged with the previous patch
> 

Ok, I will do it, and fix above typos

> 
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/x86/srat.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> > index 2f08fa4660..3334ede7a5 100644
> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -391,8 +391,8 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> >  	memblk_nodeid[num_node_memblks] = node;
> >  	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
> >  		__set_bit(num_node_memblks, memblk_hotplug);
> > -		if (end > mem_hotplug)
> > -			mem_hotplug = end;
> > +		if (end > mem_hotplug_boundary())
> > +			mem_hotplug_update_boundary(end);
> >  	}
> >  	num_node_memblks++;
> >  }
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-24  1:15   ` Stefano Stabellini
@ 2021-09-24  4:34     ` Wei Chen
  2021-09-24  7:58       ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:34 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 9:15
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> EFI architecture
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > Some architectures do not support EFI, but EFI API will be used
> > in some common features. Instead of spreading #ifdef ARCH, we
> > introduce this Kconfig option to give Xen the ability of stubing
> > EFI API for non-EFI supported architectures.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/Kconfig  |  1 +
> >  xen/arch/arm/Makefile |  2 +-
> >  xen/arch/x86/Kconfig  |  1 +
> >  xen/common/Kconfig    | 11 +++++++++++
> >  xen/include/xen/efi.h |  4 ++++
> >  5 files changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
> > index ecfa6822e4..865ad83a89 100644
> > --- a/xen/arch/arm/Kconfig
> > +++ b/xen/arch/arm/Kconfig
> > @@ -6,6 +6,7 @@ config ARM_64
> >  	def_bool y
> >  	depends on !ARM_32
> >  	select 64BIT
> > +	select EFI
> >  	select HAS_FAST_MULTIPLY
> >
> >  config ARM
> > diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> > index 3d3b97b5b4..ae4efbf76e 100644
> > --- a/xen/arch/arm/Makefile
> > +++ b/xen/arch/arm/Makefile
> > @@ -1,6 +1,6 @@
> >  obj-$(CONFIG_ARM_32) += arm32/
> >  obj-$(CONFIG_ARM_64) += arm64/
> > -obj-$(CONFIG_ARM_64) += efi/
> > +obj-$(CONFIG_EFI) += efi/
> >  obj-$(CONFIG_ACPI) += acpi/
> >  ifneq ($(CONFIG_NO_PLAT),y)
> >  obj-y += platforms/
> > diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig
> > index 28d13b9705..b9ed187f6b 100644
> > --- a/xen/arch/x86/Kconfig
> > +++ b/xen/arch/x86/Kconfig
> > @@ -10,6 +10,7 @@ config X86
> >  	select ALTERNATIVE_CALL
> >  	select ARCH_SUPPORTS_INT128
> >  	select CORE_PARKING
> > +	select EFI
> >  	select HAS_ALTERNATIVE
> >  	select HAS_COMPAT
> >  	select HAS_CPUFREQ
> > diff --git a/xen/common/Kconfig b/xen/common/Kconfig
> > index 9ebb1c239b..f998746a1a 100644
> > --- a/xen/common/Kconfig
> > +++ b/xen/common/Kconfig
> > @@ -11,6 +11,16 @@ config COMPAT
> >  config CORE_PARKING
> >  	bool
> >
> > +config EFI
> > +	bool
> 
> Without the title the option is not user-selectable (or de-selectable).
> So the help message below can never be seen.
> 
> Either add a title, e.g.:
> 
> bool "EFI support"
> 
> Or fully make the option a silent option by removing the help text.
> 
> 

OK, in current Xen code, EFI is unconditionally compiled. Before
we change related code, I prefer to remove the help text.

> 
> > +	---help---
> > +      This option provides support for runtime services provided
> > +      by UEFI firmware (such as non-volatile variables, realtime
> > +      clock, and platform reset). A UEFI stub is also provided to
> > +      allow the kernel to be booted as an EFI application. This
> > +      is only useful for kernels that may run on systems that have
> > +      UEFI firmware.
> > +
> >  config GRANT_TABLE
> >  	bool "Grant table support" if EXPERT
> >  	default y
> > @@ -196,6 +206,7 @@ config KEXEC
> >
> >  config EFI_SET_VIRTUAL_ADDRESS_MAP
> >      bool "EFI: call SetVirtualAddressMap()" if EXPERT
> > +    depends on EFI
> >      ---help---
> >        Call EFI SetVirtualAddressMap() runtime service to setup memory
> map for
> >        further runtime services. According to UEFI spec, it isn't
> strictly
> > diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
> > index 94a7e547f9..661a48286a 100644
> > --- a/xen/include/xen/efi.h
> > +++ b/xen/include/xen/efi.h
> > @@ -25,6 +25,8 @@ extern struct efi efi;
> >
> >  #ifndef __ASSEMBLY__
> >
> > +#ifdef CONFIG_EFI
> > +
> >  union xenpf_efi_info;
> >  union compat_pf_efi_info;
> >
> > @@ -45,6 +47,8 @@ int efi_runtime_call(struct xenpf_efi_runtime_call *);
> >  int efi_compat_get_info(uint32_t idx, union compat_pf_efi_info *);
> >  int efi_compat_runtime_call(struct compat_pf_efi_runtime_call *);
> >
> > +#endif /* CONFIG_EFI*/
> > +
> >  #endif /* !__ASSEMBLY__ */
> >
> >  #endif /* __XEN_EFI_H__ */
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 21/37] xen/arm: Keep memory nodes in dtb for NUMA when boot from EFI
  2021-09-24  1:23   ` Stefano Stabellini
@ 2021-09-24  4:36     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:36 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis


> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 9:23
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 21/37] xen/arm: Keep memory nodes in dtb for NUMA when
> boot from EFI
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > EFI can get memory map from EFI system table. But EFI system
> > table doesn't contain memory NUMA information, EFI depends on
> > ACPI SRAT or device tree memory node to parse memory blocks'
> > NUMA mapping.
> >
> > But in current code, when Xen is booting from EFI, it will
> > delete all memory nodes in device tree. So in UEFI + DTB
> > boot, we don't have numa-node-id for memory blocks any more.
> >
> > So in this patch, we will keep memory nodes in device tree for
> > NUMA code to parse memory numa-node-id later.
> >
> > As a side effect, if we still parse boot memory information in
> > early_scan_node, bootmem.info will calculate memory ranges in
> > memory nodes twice. So we have to prevent early_scan_node to
> > parse memory nodes in EFI boot.
> >
> > As EFI APIs only can be used in Arm64, so we introduced a stub
> > API for non-EFI supported Arm32. This will prevent
> 
> This last sentence is incomplete.
> 
> But aside from that, this patch looks good to me.
> 

Ah, it truncated by accident. I will fix it.

> 
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/bootfdt.c      |  8 +++++++-
> >  xen/arch/arm/efi/efi-boot.h | 25 -------------------------
> >  xen/include/xen/efi.h       |  7 +++++++
> >  3 files changed, 14 insertions(+), 26 deletions(-)
> >
> > diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
> > index afaa0e249b..6bc5a465ec 100644
> > --- a/xen/arch/arm/bootfdt.c
> > +++ b/xen/arch/arm/bootfdt.c
> > @@ -11,6 +11,7 @@
> >  #include <xen/lib.h>
> >  #include <xen/kernel.h>
> >  #include <xen/init.h>
> > +#include <xen/efi.h>
> >  #include <xen/device_tree.h>
> >  #include <xen/libfdt/libfdt.h>
> >  #include <xen/sort.h>
> > @@ -370,7 +371,12 @@ static int __init early_scan_node(const void *fdt,
> >  {
> >      int rc = 0;
> >
> > -    if ( device_tree_node_matches(fdt, node, "memory") )
> > +    /*
> > +     * If Xen has been booted via UEFI, the memory banks will already
> > +     * be populated. So we should skip the parsing.
> > +     */
> > +    if ( !efi_enabled(EFI_BOOT) &&
> > +         device_tree_node_matches(fdt, node, "memory"))
> >          rc = process_memory_node(fdt, node, name, depth,
> >                                   address_cells, size_cells,
> &bootinfo.mem);
> >      else if ( depth == 1 && !dt_node_cmp(name, "reserved-memory") )
> > diff --git a/xen/arch/arm/efi/efi-boot.h b/xen/arch/arm/efi/efi-boot.h
> > index cf9c37153f..d0a9987fa4 100644
> > --- a/xen/arch/arm/efi/efi-boot.h
> > +++ b/xen/arch/arm/efi/efi-boot.h
> > @@ -197,33 +197,8 @@ EFI_STATUS __init
> fdt_add_uefi_nodes(EFI_SYSTEM_TABLE *sys_table,
> >      int status;
> >      u32 fdt_val32;
> >      u64 fdt_val64;
> > -    int prev;
> >      int num_rsv;
> >
> > -    /*
> > -     * Delete any memory nodes present.  The EFI memory map is the only
> > -     * memory description provided to Xen.
> > -     */
> > -    prev = 0;
> > -    for (;;)
> > -    {
> > -        const char *type;
> > -        int len;
> > -
> > -        node = fdt_next_node(fdt, prev, NULL);
> > -        if ( node < 0 )
> > -            break;
> > -
> > -        type = fdt_getprop(fdt, node, "device_type", &len);
> > -        if ( type && strncmp(type, "memory", len) == 0 )
> > -        {
> > -            fdt_del_node(fdt, node);
> > -            continue;
> > -        }
> > -
> > -        prev = node;
> > -    }
> > -
> >     /*
> >      * Delete all memory reserve map entries. When booting via UEFI,
> >      * kernel will use the UEFI memory map to find reserved regions.
> > diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
> > index 661a48286a..b52a4678e9 100644
> > --- a/xen/include/xen/efi.h
> > +++ b/xen/include/xen/efi.h
> > @@ -47,6 +47,13 @@ int efi_runtime_call(struct xenpf_efi_runtime_call *);
> >  int efi_compat_get_info(uint32_t idx, union compat_pf_efi_info *);
> >  int efi_compat_runtime_call(struct compat_pf_efi_runtime_call *);
> >
> > +#else
> > +
> > +static inline bool efi_enabled(unsigned int feature)
> > +{
> > +    return false;
> > +}
> > +
> >  #endif /* CONFIG_EFI*/
> >
> >  #endif /* !__ASSEMBLY__ */
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 23/37] xen/arm: implement node distance helpers for Arm
  2021-09-24  1:46   ` Stefano Stabellini
@ 2021-09-24  4:41     ` Wei Chen
  2021-09-24 19:36       ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:41 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 9:47
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 23/37] xen/arm: implement node distance helpers for
> Arm
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > We will parse NUMA nodes distances from device tree or ACPI
> > table. So we need a matrix to record the distances between
> > any two nodes we parsed. Accordingly, we provide this
> > node_set_distance API for device tree or ACPI table parsers
> > to set the distance for any two nodes in this patch.
> > When NUMA initialization failed, __node_distance will return
> > NUMA_REMOTE_DISTANCE, this will help us avoid doing rollback
> > for distance maxtrix when NUMA initialization failed.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/Makefile      |  1 +
> >  xen/arch/arm/numa.c        | 69 ++++++++++++++++++++++++++++++++++++++
> >  xen/include/asm-arm/numa.h | 13 +++++++
> >  3 files changed, 83 insertions(+)
> >  create mode 100644 xen/arch/arm/numa.c
> >
> > diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> > index ae4efbf76e..41ca311b6b 100644
> > --- a/xen/arch/arm/Makefile
> > +++ b/xen/arch/arm/Makefile
> > @@ -35,6 +35,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
> >  obj-y += mem_access.o
> >  obj-y += mm.o
> >  obj-y += monitor.o
> > +obj-$(CONFIG_NUMA) += numa.o
> >  obj-y += p2m.o
> >  obj-y += percpu.o
> >  obj-y += platform.o
> > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > new file mode 100644
> > index 0000000000..3f08870d69
> > --- /dev/null
> > +++ b/xen/arch/arm/numa.c
> > @@ -0,0 +1,69 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Arm Architecture support layer for NUMA.
> > + *
> > + * Copyright (C) 2021 Arm Ltd
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> > + *
> > + */
> > +#include <xen/init.h>
> > +#include <xen/numa.h>
> > +
> > +static uint8_t __read_mostly
> > +node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> > +    { 0 }
> > +};
> > +
> > +void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t
> distance)
> > +{
> > +    if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> > +    {
> > +        printk(KERN_WARNING
> > +               "NUMA: invalid nodes: from=%"PRIu8" to=%"PRIu8"
> MAX=%"PRIu8"\n",
> > +               from, to, MAX_NUMNODES);
> > +        return;
> > +    }
> > +
> > +    /* NUMA defines 0xff as an unreachable node and 0-9 are undefined
> */
> > +    if ( distance >= NUMA_NO_DISTANCE ||
> > +        (distance >= NUMA_DISTANCE_UDF_MIN &&
> > +         distance <= NUMA_DISTANCE_UDF_MAX) ||
> > +        (from == to && distance != NUMA_LOCAL_DISTANCE) )
> > +    {
> > +        printk(KERN_WARNING
> > +               "NUMA: invalid distance: from=%"PRIu8" to=%"PRIu8"
> distance=%"PRIu32"\n",
> > +               from, to, distance);
> > +        return;
> > +    }
> > +
> > +    node_distance_map[from][to] = distance;
> > +}
> > +
> > +uint8_t __node_distance(nodeid_t from, nodeid_t to)
> > +{
> > +    /* When NUMA is off, any distance will be treated as remote. */
> > +    if ( srat_disabled() )
> 
> Given that this is ARM specific code and specific to ACPI, I don't think
> we should have any call to something called "srat_disabled".
> 
> I suggest to either rename srat_disabled to numa_distance_disabled.
> 
> Other than that, this patch looks OK to me.
> 

srat stands for static resource affinity table, I think dtb also can be
treated as a static resource affinity table. So I keep SRAT in this patch
and other patches. I have seen your comment in patch#25. Before x86 maintainers
give any feedback, can we still keep srat here?

> 
> > +        return NUMA_REMOTE_DISTANCE;
> > +
> > +    /*
> > +     * Check whether the nodes are in the matrix range.
> > +     * When any node is out of range, except from and to nodes are the
> > +     * same, we treat them as unreachable (return 0xFF)
> > +     */
> > +    if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> > +        return from == to ? NUMA_LOCAL_DISTANCE : NUMA_NO_DISTANCE;
> > +
> > +    return node_distance_map[from][to];
> > +}
> > +EXPORT_SYMBOL(__node_distance);
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index 21569e634b..758eafeb05 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -9,8 +9,21 @@ typedef u8 nodeid_t;
> >
> >  #ifdef CONFIG_NUMA
> >
> > +/*
> > + * In ACPI spec, 0-9 are the reserved values for node distance,
> > + * 10 indicates local node distance, 20 indicates remote node
> > + * distance. Set node distance map in device tree will follow
> > + * the ACPI's definition.
> > + */
> > +#define NUMA_DISTANCE_UDF_MIN   0
> > +#define NUMA_DISTANCE_UDF_MAX   9
> > +#define NUMA_LOCAL_DISTANCE     10
> > +#define NUMA_REMOTE_DISTANCE    20
> > +
> >  #define NR_NODE_MEMBLKS NR_MEM_BANKS
> >
> > +extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t
> distance);
> > +
> >  #else
> >
> >  /* Fake one node for now. See also node_online_map. */
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 24/37] xen/arm: implement two arch helpers to get memory map info
  2021-09-24  2:06   ` Stefano Stabellini
@ 2021-09-24  4:42     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:42 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis



> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 10:06
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 24/37] xen/arm: implement two arch helpers to get
> memory map info
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > These two helpers are architecture APIs that are required by
> > nodes_cover_memory.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/numa.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > index 3f08870d69..3755b01ef4 100644
> > --- a/xen/arch/arm/numa.c
> > +++ b/xen/arch/arm/numa.c
> > @@ -67,3 +67,17 @@ uint8_t __node_distance(nodeid_t from, nodeid_t to)
> >      return node_distance_map[from][to];
> >  }
> >  EXPORT_SYMBOL(__node_distance);
> > +
> > +uint32_t __init arch_meminfo_get_nr_bank(void)
> > +{
> > +	return bootinfo.mem.nr_banks;
> > +}
> > +
> > +int __init arch_meminfo_get_ram_bank_range(uint32_t bank,
> > +	paddr_t *start, paddr_t *end)
> > +{
> > +	*start = bootinfo.mem.bank[bank].start;
> > +	*end = bootinfo.mem.bank[bank].start + bootinfo.mem.bank[bank].size;
> > +
> > +	return 0;
> > +}
> 
> The rest of the file is indented using spaces, while this patch is using
> tabs.
> 
> Also, given the implementation, it looks like
> arch_meminfo_get_ram_bank_range should either return void or bool.

I will fix them in next version.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization
  2021-09-24  2:09   ` Stefano Stabellini
@ 2021-09-24  4:45     ` Wei Chen
  2021-09-24  8:07     ` Jan Beulich
  1 sibling, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:45 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, andrew.cooper3,
	roger.pau, wl

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 10:10
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> Subject: Re: [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA
> initialization
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > NUMA initialization will parse information from firmware provided
> > static resource affinity table (ACPI SRAT or DTB). bad_srat if a
> > function that will be used when initialization code encounters
> > some unexcepted errors.
> >
> > In this patch, we introduce Arm version bad_srat for NUMA common
> > initialization code to invoke it.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/numa.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > index 3755b01ef4..5209d3de4d 100644
> > --- a/xen/arch/arm/numa.c
> > +++ b/xen/arch/arm/numa.c
> > @@ -18,6 +18,7 @@
> >   *
> >   */
> >  #include <xen/init.h>
> > +#include <xen/nodemask.h>
> >  #include <xen/numa.h>
> >
> >  static uint8_t __read_mostly
> > @@ -25,6 +26,12 @@ node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> >      { 0 }
> >  };
> >
> > +__init void bad_srat(void)
> > +{
> > +    printk(KERN_ERR "NUMA: Firmware SRAT table not used.\n");
> > +    fw_numa = -1;
> > +}
> 
> I realize that the series keeps the "srat" terminology everywhere on DT
> too. I wonder if it is worth replacing srat with something like
> "numa_distance" everywhere as appropriate. I am adding the x86
> maintainers for an opinion.
> 
> If you guys prefer to keep srat (if nothing else, it is concise), I am
> also OK with keeping srat although it is not technically accurate.

I have placed some comments in patch#23 for srat.
I wouldn't like to replace srat by numa_distance. because srat not only
contains distance, and it also includes some affinity information for CPU
or other devices.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 29/37] xen/arm: introduce a helper to parse device tree processor node
  2021-09-24  2:44   ` Stefano Stabellini
@ 2021-09-24  4:46     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  4:46 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis


> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 10:45
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 29/37] xen/arm: introduce a helper to parse device
> tree processor node
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > Processor NUMA ID information is stored in device tree's processor
> > node as "numa-node-id". We need a new helper to parse this ID from
> > processor node. If we get this ID from processor node, this ID's
> > validity still need to be checked. Once we got a invalid NUMA ID
> > from any processor node, the device tree will be marked as NUMA
> > information invalid.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/Makefile           |  1 +
> >  xen/arch/arm/numa_device_tree.c | 58 +++++++++++++++++++++++++++++++++
> >  2 files changed, 59 insertions(+)
> >  create mode 100644 xen/arch/arm/numa_device_tree.c
> >
> > diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> > index 41ca311b6b..c50df2c25d 100644
> > --- a/xen/arch/arm/Makefile
> > +++ b/xen/arch/arm/Makefile
> > @@ -36,6 +36,7 @@ obj-y += mem_access.o
> >  obj-y += mm.o
> >  obj-y += monitor.o
> >  obj-$(CONFIG_NUMA) += numa.o
> > +obj-$(CONFIG_DEVICE_TREE_NUMA) += numa_device_tree.o
> >  obj-y += p2m.o
> >  obj-y += percpu.o
> >  obj-y += platform.o
> > diff --git a/xen/arch/arm/numa_device_tree.c
> b/xen/arch/arm/numa_device_tree.c
> > new file mode 100644
> > index 0000000000..2428fbae0b
> > --- /dev/null
> > +++ b/xen/arch/arm/numa_device_tree.c
> > @@ -0,0 +1,58 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Arm Architecture support layer for NUMA.
> > + *
> > + * Copyright (C) 2021 Arm Ltd
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> > + *
> > + */
> > +#include <xen/init.h>
> > +#include <xen/nodemask.h>
> > +#include <xen/numa.h>
> > +#include <xen/libfdt/libfdt.h>
> > +#include <xen/device_tree.h>
> > +
> > +/* Callback for device tree processor affinity */
> > +static int __init fdt_numa_processor_affinity_init(nodeid_t node)
> > +{
> > +    if ( srat_disabled() )
> > +        return -EINVAL;
> 
> fdt_numa_processor_affinity_init is called by fdt_parse_numa_cpu_node
> which is already parsing NUMA related info. Should this srat_disabled
> check be moved to fdt_parse_numa_cpu_node?
> 

Ah, yes, it's a good suggestion, I will address it in next version.

> 
> > +    else if ( node == NUMA_NO_NODE || node >= MAX_NUMNODES )
> > +    {
> > +        bad_srat();
> > +        return -EINVAL;
> > +	}
> > +
> > +    numa_set_processor_nodes_parsed(node);
> > +    fw_numa = 1;
> > +
> > +    printk(KERN_INFO "DT: NUMA node %"PRIu7" processor parsed\n", node);
> > +
> > +    return 0;
> > +}
> > +
> > +/* Parse CPU NUMA node info */
> > +static int __init fdt_parse_numa_cpu_node(const void *fdt, int node)
> > +{
> > +    uint32_t nid;
> > +
> > +    nid = device_tree_get_u32(fdt, node, "numa-node-id", MAX_NUMNODES);
> > +    if ( nid >= MAX_NUMNODES )
> > +    {
> > +        printk(XENLOG_ERR "Node id %u exceeds maximum value\n", nid);
>                                       ^ PRIu32
> 
> 
> > +        return -EINVAL;
> > +    }
> > +
> > +    return fdt_numa_processor_affinity_init(nid);
> > +}
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 31/37] xen/arm: introduce a helper to parse device tree NUMA distance map
  2021-09-24  3:05   ` Stefano Stabellini
@ 2021-09-24  5:23     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  5:23 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 11:05
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 31/37] xen/arm: introduce a helper to parse device
> tree NUMA distance map
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > A NUMA aware device tree will provide a "distance-map" node to
> > describe distance between any two nodes. This patch introduce a
> > new helper to parse this distance map.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/numa_device_tree.c | 106 ++++++++++++++++++++++++++++++++
> >  1 file changed, 106 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa_device_tree.c
> b/xen/arch/arm/numa_device_tree.c
> > index 7918a397fa..e7fa84df4c 100644
> > --- a/xen/arch/arm/numa_device_tree.c
> > +++ b/xen/arch/arm/numa_device_tree.c
> > @@ -136,3 +136,109 @@ static int __init fdt_parse_numa_memory_node(const
> void *fdt, int node,
> >
> >      return 0;
> >  }
> > +
> > +
> > +/* Parse NUMA distance map v1 */
> > +static int __init fdt_parse_numa_distance_map_v1(const void *fdt, int
> node)
> > +{
> > +    const struct fdt_property *prop;
> > +    const __be32 *matrix;
> > +    uint32_t entry_count;
> > +    int len, i;
> > +
> > +    printk(XENLOG_INFO "NUMA: parsing numa-distance-map\n");
> > +
> > +    prop = fdt_get_property(fdt, node, "distance-matrix", &len);
> > +    if ( !prop )
> > +    {
> > +        printk(XENLOG_WARNING
> > +               "NUMA: No distance-matrix property in distance-map\n");
> 
> I haven't seen where this is called from yet but make sure to print an
> error here only if NUMA info is actually expected and required, not on
> regular non-NUMA boots on non-NUMA machines.
> 

Only users enable NUMA option and numa_off is false, then Xen can run into
this function (this check is in numa_init). So non-NUMA machines will not
reach here.

> 
> > +        return -EINVAL;
> > +    }
> > +
> > +    if ( len % sizeof(uint32_t) != 0 )
> > +    {
> > +        printk(XENLOG_WARNING
> > +               "distance-matrix in node is not a multiple of u32\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    entry_count = len / sizeof(uint32_t);
> > +    if ( entry_count == 0 )
> > +    {
> > +        printk(XENLOG_WARNING "NUMA: Invalid distance-matrix\n");
> > +
> > +        return -EINVAL;
> > +    }
> > +
> > +    matrix = (const __be32 *)prop->data;
> > +    for ( i = 0; i + 2 < entry_count; i += 3 )
> > +    {
> > +        uint32_t from, to, distance, opposite;
> > +
> > +        from = dt_next_cell(1, &matrix);
> > +        to = dt_next_cell(1, &matrix);
> > +        distance = dt_next_cell(1, &matrix);
> > +        if ( (from == to && distance != NUMA_LOCAL_DISTANCE) ||
> > +            (from != to && distance <= NUMA_LOCAL_DISTANCE) )
> > +        {
> > +            printk(XENLOG_WARNING
> > +                   "NUMA: Invalid distance: NODE#%u->NODE#%u:%u\n",
> > +                   from, to, distance);
> > +            return -EINVAL;
> > +        }
> > +
> > +        printk(XENLOG_INFO "NUMA: distance: NODE#%u->NODE#%u:%u\n",
> > +               from, to, distance);
> > +
> > +        /* Get opposite way distance */
> > +        opposite = __node_distance(from, to);
> 
> This is not checking for the opposite node distance but...
> 

Ah, yes, it's a mistake. It should be __node_distance(to, from);
> 
> > +        if ( opposite == 0 )
> > +        {
> > +            /* Bi-directions are not set, set both */
> > +            numa_set_distance(from, to, distance);
> > +            numa_set_distance(to, from, distance);
> 
> ...since you set both directions here at once then it is OK. You are
> checking if this direction has already been set which is correct I
> think. But the comment "Get opposite way distance" and the variable name
> "opposite" are wrong.
> 

My above mistake make this mis-understanding:
I want to check the opposite way distance is set or not.
If opposite way distance is not set, I will set both way here.

So I will change " opposite = __node_distance(from, to);" to
" opposite = __node_distance(to, from);". And keep the comment.
How do you think about it?

> 
> > +        }
> > +        else
> > +        {
> > +            /*
> > +             * Opposite way distance has been set to a different value.
> > +             * It may be a firmware device tree bug?
> > +             */
> > +            if ( opposite != distance )
> > +            {
> > +                /*
> > +                 * In device tree NUMA distance-matrix binding:
> > +                 *
> https://www.kernel.org/doc/Documentation/devicetree/bindings/numa.txt
> > +                 * There is a notes mentions:
> > +                 * "Each entry represents distance from first node to
> > +                 *  second node. The distances are equal in either
> > +                 *  direction."
> > +                 *
> > +                 * That means device tree doesn't permit this case.
> > +                 * But in ACPI spec, it cares to specifically permit
> this
> > +                 * case:
> > +                 * "Except for the relative distance from a System
> Locality
> > +                 *  to itself, each relative distance is stored twice
> in the
> > +                 *  matrix. This provides the capability to describe
> the
> > +                 *  scenario where the relative distances for the two
> > +                 *  directions between System Localities is different."
> > +                 *
> > +                 * That means a real machine allows such NUMA
> configuration.
> > +                 * So, place a WARNING here to notice system
> administrators,
> > +                 * is it the specail case that they hijack the device
> tree
> > +                 * to support their rare machines?
> > +                 */
> > +                printk(XENLOG_WARNING
> > +                       "Un-matched bi-direction! NODE#%u->NODE#%u:%u,
> NODE#%u->NODE#%u:%u\n",
> > +                       from, to, distance, to, from, opposite);
> 
> PRIu32

Yes.

> 
> 
> > +            }
> > +
> > +            /* Opposite way distance has been set, just set this way */
> > +            numa_set_distance(from, to, distance);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 30/37] xen/arm: introduce a helper to parse device tree memory node
  2021-09-24  3:05   ` Stefano Stabellini
@ 2021-09-24  7:54     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  7:54 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 11:05
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 30/37] xen/arm: introduce a helper to parse device
> tree memory node
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > Memory blocks' NUMA ID information is stored in device tree's
> > memory nodes as "numa-node-id". We need a new helper to parse
> > and verify this ID from memory nodes.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> 
> There are tabs for indentation in this patch, we use spaces.
> 

OK

> 
> > ---
> >  xen/arch/arm/numa_device_tree.c | 80 +++++++++++++++++++++++++++++++++
> >  1 file changed, 80 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa_device_tree.c
> b/xen/arch/arm/numa_device_tree.c
> > index 2428fbae0b..7918a397fa 100644
> > --- a/xen/arch/arm/numa_device_tree.c
> > +++ b/xen/arch/arm/numa_device_tree.c
> > @@ -42,6 +42,35 @@ static int __init
> fdt_numa_processor_affinity_init(nodeid_t node)
> >      return 0;
> >  }
> >
> > +/* Callback for parsing of the memory regions affinity */
> > +static int __init fdt_numa_memory_affinity_init(nodeid_t node,
> > +                                paddr_t start, paddr_t size)
> 
> Please align the parameters
> 

OK

> 
> > +{
> > +    int ret;
> > +
> > +    if ( srat_disabled() )
> > +    {
> > +        return -EINVAL;
> > +    }
> > +
> > +	if ( !numa_memblks_available() )
> > +	{
> > +		dprintk(XENLOG_WARNING,
> > +                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
> > +		bad_srat();
> > +		return -EINVAL;
> > +	}
> > +
> > +	ret = numa_update_node_memblks(node, start, size, false);
> > +	if ( ret != 0 )
> > +	{
> > +		bad_srat();
> > +	    return -EINVAL;
> > +	}
> > +
> > +    return 0;
> > +}
> 
> Aside from spaces/tabs, this is a lot better!
> 

ok

> 
> >  /* Parse CPU NUMA node info */
> >  static int __init fdt_parse_numa_cpu_node(const void *fdt, int node)
> >  {
> > @@ -56,3 +85,54 @@ static int __init fdt_parse_numa_cpu_node(const void
> *fdt, int node)
> >
> >      return fdt_numa_processor_affinity_init(nid);
> >  }
> > +
> > +/* Parse memory node NUMA info */
> > +static int __init fdt_parse_numa_memory_node(const void *fdt, int node,
> > +    const char *name, uint32_t addr_cells, uint32_t size_cells)
> 
> Please align the parameters
> 

ok

> 
> > +{
> > +    uint32_t nid;
> > +    int ret = 0, len;
> > +    paddr_t addr, size;
> > +    const struct fdt_property *prop;
> > +    uint32_t idx, ranges;
> > +    const __be32 *addresses;
> > +
> > +    nid = device_tree_get_u32(fdt, node, "numa-node-id", MAX_NUMNODES);
> > +    if ( nid >= MAX_NUMNODES )
> > +    {
> > +        printk(XENLOG_WARNING "Node id %u exceeds maximum value\n",
> nid);
> > +        return -EINVAL;
> > +    }
> > +
> > +    prop = fdt_get_property(fdt, node, "reg", &len);
> > +    if ( !prop )
> > +    {
> > +        printk(XENLOG_WARNING
> > +               "fdt: node `%s': missing `reg' property\n", name);
> > +        return -EINVAL;
> > +    }
> > +
> > +    addresses = (const __be32 *)prop->data;
> > +    ranges = len / (sizeof(__be32)* (addr_cells + size_cells));
> > +    for ( idx = 0; idx < ranges; idx++ )
> > +    {
> > +        device_tree_get_reg(&addresses, addr_cells, size_cells, &addr,
> &size);
> > +        /* Skip zero size ranges */
> > +        if ( !size )
> > +            continue;
> > +
> > +        ret = fdt_numa_memory_affinity_init(nid, addr, size);
> > +        if ( ret ) {
> > +            return -EINVAL;
> > +        }
> > +    }
> 
> I take it would be difficult to parse numa-node-id and call
> fdt_numa_memory_affinity_init from
> xen/arch/arm/bootfdt.c:device_tree_get_meminfo. Is it because
> device_tree_get_meminfo is called too early?
> 

When I was composing this patch, penny's patch hadn't been merged.
I will look into it.

> 
> > +    if ( idx == 0 )
> > +    {
> > +        printk(XENLOG_ERR
> > +               "bad property in memory node, idx=%d ret=%d\n", idx,
> ret);
> > +        return -EINVAL;
> > +    }
> > +
> > +    return 0;
> > +}
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 32/37] xen/arm: unified entry to parse all NUMA data from device tree
  2021-09-24  3:16   ` Stefano Stabellini
@ 2021-09-24  7:58     ` Wei Chen
  2021-09-24 19:42       ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-24  7:58 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 11:17
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 32/37] xen/arm: unified entry to parse all NUMA data
> from device tree
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > In this API, we scan whole device tree to parse CPU node id, memory
>           ^ function   ^ the whole
> 
> > node id and distance-map. Though early_scan_node will invoke has a
> > handler to process memory nodes. If we want to parse memory node id
> > in this handler, we have to embeded NUMA parse code in this handler.
>                               ^ embed
> 
> > But we still need to scan whole device tree to find CPU NUMA id and
> > distance-map. In this case, we include memory NUMA id parse in this
> > API too. Another benefit is that we have a unique entry for device
>   ^ function
> 
> > tree NUMA data parse.
> 
> Ah, that's the explanation I was asking for earlier!
> 

The question about device_tree_get_meminfo?

> 
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/numa_device_tree.c | 30 ++++++++++++++++++++++++++++++
> >  xen/include/asm-arm/numa.h      |  1 +
> >  2 files changed, 31 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa_device_tree.c
> b/xen/arch/arm/numa_device_tree.c
> > index e7fa84df4c..6a3fed0002 100644
> > --- a/xen/arch/arm/numa_device_tree.c
> > +++ b/xen/arch/arm/numa_device_tree.c
> > @@ -242,3 +242,33 @@ static int __init
> fdt_parse_numa_distance_map_v1(const void *fdt, int node)
> >
> >      return 0;
> >  }
> > +
> > +static int __init fdt_scan_numa_nodes(const void *fdt,
> > +                int node, const char *uname, int depth,
> > +                u32 address_cells, u32 size_cells, void *data)
> 
> Please align parameters
> 

OK

> 
> > +{
> > +    int len, ret = 0;
> > +    const void *prop;
> > +
> > +    prop = fdt_getprop(fdt, node, "device_type", &len);
> > +    if (prop)
> 
> code style
> 

OK

> 
> > +    {
> > +        len += 1;
> > +        if ( memcmp(prop, "cpu", len) == 0 )
> > +            ret = fdt_parse_numa_cpu_node(fdt, node);
> > +        else if ( memcmp(prop, "memory", len) == 0 )
> > +            ret = fdt_parse_numa_memory_node(fdt, node, uname,
> > +                                address_cells, size_cells);
> 
> I realize that with the inclusion of '\0' in the check, the usage of
> memcmp should be safe, but I would prefer if we used strncmp instead.
> 

Ok, I will use strncmp in next version.

> 
> > +    }
> > +    else if ( fdt_node_check_compatible(fdt, node,
> > +                                "numa-distance-map-v1") == 0 )
> > +        ret = fdt_parse_numa_distance_map_v1(fdt, node);
> > +
> > +    return ret;
> > +}
> > +
> > +/* Initialize NUMA from device tree */
> > +int __init numa_device_tree_init(const void *fdt)
> > +{
> > +    return device_tree_for_each_node(fdt, 0, fdt_scan_numa_nodes, NULL);
> > +}
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index 7675012cb7..f46e8e2935 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -23,6 +23,7 @@ typedef u8 nodeid_t;
> >  #define NR_NODE_MEMBLKS NR_MEM_BANKS
> >
> >  extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t
> distance);
> > +extern int numa_device_tree_init(const void *fdt);
> >
> >  #else
> >
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-24  4:34     ` Wei Chen
@ 2021-09-24  7:58       ` Jan Beulich
  2021-09-24 10:31         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-24  7:58 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, julien, Bertrand Marquis, Stefano Stabellini

On 24.09.2021 06:34, Wei Chen wrote:
>> From: Stefano Stabellini <sstabellini@kernel.org>
>> Sent: 2021年9月24日 9:15
>>
>> On Thu, 23 Sep 2021, Wei Chen wrote:
>>> --- a/xen/common/Kconfig
>>> +++ b/xen/common/Kconfig
>>> @@ -11,6 +11,16 @@ config COMPAT
>>>  config CORE_PARKING
>>>  	bool
>>>
>>> +config EFI
>>> +	bool
>>
>> Without the title the option is not user-selectable (or de-selectable).
>> So the help message below can never be seen.
>>
>> Either add a title, e.g.:
>>
>> bool "EFI support"
>>
>> Or fully make the option a silent option by removing the help text.
> 
> OK, in current Xen code, EFI is unconditionally compiled. Before
> we change related code, I prefer to remove the help text.

But that's not true: At least on x86 EFI gets compiled depending on
tool chain capabilities. Ultimately we may indeed want a user
selectable option here, but until then I'm afraid having this option
at all may be misleading on x86.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization
  2021-09-24  2:09   ` Stefano Stabellini
  2021-09-24  4:45     ` Wei Chen
@ 2021-09-24  8:07     ` Jan Beulich
  2021-09-24 19:33       ` Stefano Stabellini
  1 sibling, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-24  8:07 UTC (permalink / raw)
  To: Stefano Stabellini, Wei Chen
  Cc: xen-devel, julien, Bertrand.Marquis, andrew.cooper3, roger.pau, wl

On 24.09.2021 04:09, Stefano Stabellini wrote:
> On Thu, 23 Sep 2021, Wei Chen wrote:
>> NUMA initialization will parse information from firmware provided
>> static resource affinity table (ACPI SRAT or DTB). bad_srat if a
>> function that will be used when initialization code encounters
>> some unexcepted errors.
>>
>> In this patch, we introduce Arm version bad_srat for NUMA common
>> initialization code to invoke it.
>>
>> Signed-off-by: Wei Chen <wei.chen@arm.com>
>> ---
>>  xen/arch/arm/numa.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
>> index 3755b01ef4..5209d3de4d 100644
>> --- a/xen/arch/arm/numa.c
>> +++ b/xen/arch/arm/numa.c
>> @@ -18,6 +18,7 @@
>>   *
>>   */
>>  #include <xen/init.h>
>> +#include <xen/nodemask.h>
>>  #include <xen/numa.h>
>>  
>>  static uint8_t __read_mostly
>> @@ -25,6 +26,12 @@ node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
>>      { 0 }
>>  };
>>  
>> +__init void bad_srat(void)
>> +{
>> +    printk(KERN_ERR "NUMA: Firmware SRAT table not used.\n");
>> +    fw_numa = -1;
>> +}
> 
> I realize that the series keeps the "srat" terminology everywhere on DT
> too. I wonder if it is worth replacing srat with something like
> "numa_distance" everywhere as appropriate. I am adding the x86
> maintainers for an opinion.
> 
> If you guys prefer to keep srat (if nothing else, it is concise), I am
> also OK with keeping srat although it is not technically accurate.

I think we want to tell apart both things: Where we truly talk about
the firmware's SRAT table, keeping that name is fine. But I suppose
there no "Firmware SRAT table" (as in the log message above) when
using DT? If so, at the very least in log messages SRAT shouldn't be
mentioned. Perhaps even functions serving both an ACPI and a DT
purpose would better not use "srat" in their names (but I'm not as
fussed about it there.)

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number
  2021-09-23 12:02 ` [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
  2021-09-23 23:45   ` Stefano Stabellini
@ 2021-09-24  8:55   ` Jan Beulich
  2021-09-24 10:33     ` Wei Chen
  1 sibling, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-24  8:55 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand.Marquis, xen-devel, sstabellini, julien

On 23.09.2021 14:02, Wei Chen wrote:
> Current NUMA nodes number is a hardcode configuration. This
> configuration is difficult for an administrator to change
> unless changing the code.
> 
> So in this patch, we introduce this new Kconfig option for
> administrators to change NUMA nodes number conveniently.
> Also considering that not all architectures support NUMA,
> this Kconfig option only can be visible on NUMA enabled
> architectures. Non-NUMA supported architectures can still
> use 1 as MAX_NUMNODES.

Do you really mean administrators here? To me command line options
are for administrators, but build decisions are usually taken by
build managers of distros.

> --- a/xen/arch/Kconfig
> +++ b/xen/arch/Kconfig
> @@ -17,3 +17,14 @@ config NR_CPUS
>  	  For CPU cores which support Simultaneous Multi-Threading or similar
>  	  technologies, this the number of logical threads which Xen will
>  	  support.
> +
> +config NR_NUMA_NODES
> +	int "Maximum number of NUMA nodes supported"
> +	range 1 4095

How was this upper bound established? Seeing 4095 is the limit of the
number of CPUs, do we really expect a CPU per node on such huge
systems? And did you check that whichever involved data types and
structures are actually suitable? I'm thinking e.g. of things like ...

> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -3,8 +3,6 @@
>  
>  #include <xen/cpumask.h>
>  
> -#define NODES_SHIFT 6
> -
>  typedef u8 nodeid_t;

... this.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking NUMA node
  2021-09-23 12:02 ` [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking NUMA node Wei Chen
@ 2021-09-24  8:57   ` Jan Beulich
  2021-09-24 10:34     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-24  8:57 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand.Marquis, xen-devel, sstabellini, julien

On 23.09.2021 14:02, Wei Chen wrote:
> When system turns NUMA off or system lacks of NUMA support,
> Xen will fake a NUMA node to make system works as a single
> node NUMA system.
> 
> In this case the memory node map doesn't need to be allocated
> from boot pages, it will use the _memnodemap directly. But
> memnodemapsize hasn't been set. Xen should assert in phys_to_nid.
> Because x86 was using an empty macro "VIRTUAL_BUG_ON" to replace
> SSERT, this bug will not be triggered on x86.

Somehow and A got lost here, which I'll add back while committing.

> Actually, Xen will only use 1 slot of memnodemap in this case.
> So we set memnodemap[0] to 0 and memnodemapsize to 1 in this
> patch to fix it.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>

Acked-by: Jan Beulich <jbeulich@suse.com>



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 34/37] xen/arm: enable device tree based NUMA in system init
  2021-09-24  3:28   ` Stefano Stabellini
@ 2021-09-24  9:52     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24  9:52 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 11:28
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 34/37] xen/arm: enable device tree based NUMA in
> system init
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > In this patch, we can start to create NUMA system that is
> > based on device tree.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/numa.c        | 55 ++++++++++++++++++++++++++++++++++++++
> >  xen/arch/arm/setup.c       |  7 +++++
> >  xen/include/asm-arm/numa.h |  6 +++++
> >  3 files changed, 68 insertions(+)
> >
> > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > index 7f05299b76..d7a3d32d4b 100644
> > --- a/xen/arch/arm/numa.c
> > +++ b/xen/arch/arm/numa.c
> > @@ -18,8 +18,10 @@
> >   *
> >   */
> >  #include <xen/init.h>
> > +#include <xen/device_tree.h>
> >  #include <xen/nodemask.h>
> >  #include <xen/numa.h>
> > +#include <xen/pfn.h>
> >
> >  static uint8_t __read_mostly
> >  node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> > @@ -85,6 +87,59 @@ uint8_t __node_distance(nodeid_t from, nodeid_t to)
> >  }
> >  EXPORT_SYMBOL(__node_distance);
> >
> > +void __init numa_init(bool acpi_off)
> > +{
> > +    uint32_t idx;
> > +    paddr_t ram_start = ~0;
> 
> INVALID_PADDR
> 

Oh, yes

> 
> > +    paddr_t ram_size = 0;
> > +    paddr_t ram_end = 0;
> > +
> > +    /* NUMA has been turned off through Xen parameters */
> > +    if ( numa_off )
> > +        goto mem_init;
> > +
> > +    /* Initialize NUMA from device tree when system is not ACPI booted
> */
> > +    if ( acpi_off )
> > +    {
> > +        int ret = numa_device_tree_init(device_tree_flattened);
> > +        if ( ret )
> > +        {
> > +            printk(XENLOG_WARNING
> > +                   "Init NUMA from device tree failed, ret=%d\n", ret);
> 
> As I mentioned in other patches we need to distinguish between two
> cases:
> 
> 1) NUMA initialization failed because no NUMA information has been found
> 2) NUMA initialization failed because wrong/inconsistent NUMA info has
>    been found
> 
> In case of 1), we print nothing. Maybe a single XENLOG_DEBUG message.
> In case of 2), all the warnings are good to print.
> 
> 
> In this case, if ret != 0 because of 2), then it is fine to print this
> warning. But it looks like could be that ret is -EINVAL simply because a
> CPU node doesn't have numa-node-id, which is a normal condition for
> non-NUMA machines.
> 

Yes, we should have to distinguish these two cases. I will try to address
it in next version.

> 
> > +            numa_off = true;
> > +        }
> > +    }
> > +    else
> > +    {
> > +        /* We don't support NUMA for ACPI boot currently */
> > +        printk(XENLOG_WARNING
> > +               "ACPI NUMA has not been supported yet, NUMA off!\n");
> > +        numa_off = true;
> > +    }
> > +
> > +mem_init:
> > +    /*
> > +     * Find the minimal and maximum address of RAM, NUMA will
> > +     * build a memory to node mapping table for the whole range.
> > +     */
> > +    ram_start = bootinfo.mem.bank[0].start;
> > +    ram_size  = bootinfo.mem.bank[0].size;
> > +    ram_end   = ram_start + ram_size;
> > +    for ( idx = 1 ; idx < bootinfo.mem.nr_banks; idx++ )
> > +    {
> > +        paddr_t bank_start = bootinfo.mem.bank[idx].start;
> > +        paddr_t bank_size = bootinfo.mem.bank[idx].size;
> > +        paddr_t bank_end = bank_start + bank_size;
> > +
> > +        ram_size  = ram_size + bank_size;
> > +        ram_start = min(ram_start, bank_start);
> > +        ram_end   = max(ram_end, bank_end);
> > +    }
> > +
> > +    numa_initmem_init(PFN_UP(ram_start), PFN_DOWN(ram_end));
> > +    return;
> 
> No need for return
> 

Ok, I will remove it.

> 
> > +}
> > +
> >  uint32_t __init arch_meminfo_get_nr_bank(void)
> >  {
> >  	return bootinfo.mem.nr_banks;
> > diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
> > index 1f0fbc95b5..6097850682 100644
> > --- a/xen/arch/arm/setup.c
> > +++ b/xen/arch/arm/setup.c
> > @@ -905,6 +905,13 @@ void __init start_xen(unsigned long
> boot_phys_offset,
> >      /* Parse the ACPI tables for possible boot-time configuration */
> >      acpi_boot_table_init();
> >
> > +    /*
> > +     * Try to initialize NUMA system, if failed, the system will
> > +     * fallback to uniform system which means system has only 1
> > +     * NUMA node.
> > +     */
> > +    numa_init(acpi_disabled);
> > +
> >      end_boot_allocator();
> >
> >      /*
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index f46e8e2935..5b03dde87f 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -24,6 +24,7 @@ typedef u8 nodeid_t;
> >
> >  extern void numa_set_distance(nodeid_t from, nodeid_t to, uint32_t
> distance);
> >  extern int numa_device_tree_init(const void *fdt);
> > +extern void numa_init(bool acpi_off);
> >
> >  #else
> >
> > @@ -47,6 +48,11 @@ extern mfn_t first_valid_mfn;
> >  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
> >  #define __node_distance(a, b) (20)
> >
> > +static inline void numa_init(bool acpi_off)
> > +{
> > +
> > +}
> > +
> >  static inline void numa_add_cpu(int cpu)
> >  {
> >
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-24  3:31   ` Stefano Stabellini
@ 2021-09-24 10:13     ` Wei Chen
  2021-09-24 19:39       ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-24 10:13 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand.Marquis

Hi Stefano,

On 2021/9/24 11:31, Stefano Stabellini wrote:
> On Thu, 23 Sep 2021, Wei Chen wrote:
>> Arm platforms support both ACPI and device tree. We don't
>> want users to select device tree NUMA or ACPI NUMA manually.
>> We hope usrs can just enable NUMA for Arm, and device tree
>            ^ users
> 
>> NUMA and ACPI NUMA can be selected depends on device tree
>> feature and ACPI feature status automatically. In this case,
>> these two kinds of NUMA support code can be co-exist in one
>> Xen binary. Xen can check feature flags to decide using
>> device tree or ACPI as NUMA based firmware.
>>
>> So in this patch, we introduce a generic option:
>> CONFIG_ARM_NUMA for user to enable NUMA for Arm.
>                        ^ users
>

OK

>> And one CONFIG_DEVICE_TREE_NUMA option for ARM_NUMA
>> to select when HAS_DEVICE_TREE option is enabled.
>> Once when ACPI NUMA for Arm is supported, ACPI_NUMA
>> can be selected here too.
>>
>> Signed-off-by: Wei Chen <wei.chen@arm.com>
>> ---
>>   xen/arch/arm/Kconfig | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
>> index 865ad83a89..ded94ebd37 100644
>> --- a/xen/arch/arm/Kconfig
>> +++ b/xen/arch/arm/Kconfig
>> @@ -34,6 +34,17 @@ config ACPI
>>   	  Advanced Configuration and Power Interface (ACPI) support for Xen is
>>   	  an alternative to device tree on ARM64.
>>   
>> + config DEVICE_TREE_NUMA
>> +	def_bool n
>> +	select NUMA
>> +
>> +config ARM_NUMA
>> +	bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if UNSUPPORTED
>> +	select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> 
> Should it be: depends on HAS_DEVICE_TREE ?
> (And eventually depends on HAS_DEVICE_TREE || ACPI)
> 

As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
option can be selected by users. And depends on has_device_tree
or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.

If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
does it become a loop dependency?

https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00888.html
> 
>> +	---help---
>> +
>> +	  Enable Non-Uniform Memory Access (NUMA) for Arm architecutres
>                                                        ^ architectures
> 
> 
>> +
>>   config GICV3
>>   	bool "GICv3 driver"
>>   	depends on ARM_64 && !NEW_VGIC
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API
  2021-09-24  0:05   ` Stefano Stabellini
@ 2021-09-24 10:21     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24 10:21 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis


> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 8:05
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 06/37] xen/arm: use !CONFIG_NUMA to keep fake NUMA API
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > We have introduced CONFIG_NUMA in previous patch. And this
>                                    ^ a
> 
> > option is enabled only on x86 in current stage. In a follow
>                                 ^ at the
> 
> > up patch, we will enable this option for Arm. But we still
> > want users can disable the CONFIG_NUMA through Kconfig. In
>              ^ to be able to disable CONFIG_NUMA via Kconfig.
> 
> 
> > this case, keep current fake NUMA API, will make Arm code
>                  ^ the
> 
> > still can work with NUMA aware memory allocation and scheduler.
>         ^ able to work
> 
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> 
> With the small grammar fixes:
> 
> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
> 
> 

Thanks, I will fix them in next version.

> > ---
> >  xen/include/asm-arm/numa.h | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index 9d5739542d..8f1c67e3eb 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -5,6 +5,8 @@
> >
> >  typedef u8 nodeid_t;
> >
> > +#ifndef CONFIG_NUMA
> > +
> >  /* Fake one node for now. See also node_online_map. */
> >  #define cpu_to_node(cpu) 0
> >  #define node_to_cpumask(node)   (cpu_online_map)
> > @@ -25,6 +27,8 @@ extern mfn_t first_valid_mfn;
> >  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
> >  #define __node_distance(a, b) (20)
> >
> > +#endif
> > +
> >  static inline unsigned int arch_have_default_dmazone(void)
> >  {
> >      return 0;
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 33/37] xen/arm: keep guest still be NUMA unware
  2021-09-24  3:19   ` Stefano Stabellini
@ 2021-09-24 10:23     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24 10:23 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis


> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 11:19
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 33/37] xen/arm: keep guest still be NUMA unware
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > The NUMA information provided in the host Device-Tree
> > are only for Xen. For dom0, we want to hide them as they
> > may be different (for now, dom0 is still not aware of NUMA)
> > The CPU and memory nodes are recreated from scratch for the
> > domain. So we already skip the "numa-node-id" property for
> > these two types of nodes.
> >
> > However, some devices like PCIe may have "numa-node-id"
> > property too. We have to skip them as well.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/arch/arm/domain_build.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> > index d233d634c1..6e94922238 100644
> > --- a/xen/arch/arm/domain_build.c
> > +++ b/xen/arch/arm/domain_build.c
> > @@ -737,6 +737,10 @@ static int __init write_properties(struct domain *d,
> struct kernel_info *kinfo,
> >                  continue;
> >          }
> >
> > +        /* Guest is numa unaware in current stage */
> 
> I would say: "Dom0 is currently NUMA unaware"
> 
> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
> 

I will update the code comment in next version.
Thanks!

> 
> > +        if ( dt_property_name_is_equal(prop, "numa-node-id") )
> > +            continue;
> > +
> >          res = fdt_property(kinfo->fdt, prop->name, prop_data, prop_len);
> >
> >          if ( res )
> > @@ -1607,6 +1611,8 @@ static int __init handle_node(struct domain *d,
> struct kernel_info *kinfo,
> >          DT_MATCH_TYPE("memory"),
> >          /* The memory mapped timer is not supported by Xen. */
> >          DT_MATCH_COMPATIBLE("arm,armv7-timer-mem"),
> > +        /* Numa info doesn't need to be exposed to Domain-0 */
> > +        DT_MATCH_COMPATIBLE("numa-distance-map-v1"),
> >          { /* sentinel */ },
> >      };
> >      static const struct dt_device_match timer_matches[] __initconst =
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-23 12:02 ` [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA Wei Chen
  2021-09-24  3:31   ` Stefano Stabellini
@ 2021-09-24 10:25   ` Jan Beulich
  2021-09-24 10:37     ` Wei Chen
  1 sibling, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-24 10:25 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand.Marquis, xen-devel, sstabellini, julien

On 23.09.2021 14:02, Wei Chen wrote:
> --- a/xen/arch/arm/Kconfig
> +++ b/xen/arch/arm/Kconfig
> @@ -34,6 +34,17 @@ config ACPI
>  	  Advanced Configuration and Power Interface (ACPI) support for Xen is
>  	  an alternative to device tree on ARM64.
>  
> + config DEVICE_TREE_NUMA
> +	def_bool n
> +	select NUMA

Two nits here: There's a stray blank on the first line, and you
appear to mean just "bool", not "def_bool n" (there's no point
in having defaults for select-only options).

> +config ARM_NUMA
> +	bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if UNSUPPORTED
> +	select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> +	---help---

And another nit here: We try to move away from "---help---", which
is no longer supported by Linux'es newer kconfig. Please use just
"help" in new code.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-24  7:58       ` Jan Beulich
@ 2021-09-24 10:31         ` Wei Chen
  2021-09-24 10:49           ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-24 10:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, julien, Bertrand Marquis, Stefano Stabellini

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2021年9月24日 15:59
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
> Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> EFI architecture
> 
> On 24.09.2021 06:34, Wei Chen wrote:
> >> From: Stefano Stabellini <sstabellini@kernel.org>
> >> Sent: 2021年9月24日 9:15
> >>
> >> On Thu, 23 Sep 2021, Wei Chen wrote:
> >>> --- a/xen/common/Kconfig
> >>> +++ b/xen/common/Kconfig
> >>> @@ -11,6 +11,16 @@ config COMPAT
> >>>  config CORE_PARKING
> >>>  	bool
> >>>
> >>> +config EFI
> >>> +	bool
> >>
> >> Without the title the option is not user-selectable (or de-selectable).
> >> So the help message below can never be seen.
> >>
> >> Either add a title, e.g.:
> >>
> >> bool "EFI support"
> >>
> >> Or fully make the option a silent option by removing the help text.
> >
> > OK, in current Xen code, EFI is unconditionally compiled. Before
> > we change related code, I prefer to remove the help text.
> 
> But that's not true: At least on x86 EFI gets compiled depending on
> tool chain capabilities. Ultimately we may indeed want a user
> selectable option here, but until then I'm afraid having this option
> at all may be misleading on x86.
> 

I check the build scripts, yes, you're right. For x86, EFI is not a
selectable option in Kconfig. I agree with you, we can't use Kconfig
system to decide to enable EFI build for x86 or not.

So how about we just use this EFI option for Arm only? Because on Arm,
we do not have such toolchain dependency.

> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number
  2021-09-24  8:55   ` Jan Beulich
@ 2021-09-24 10:33     ` Wei Chen
  2021-09-24 10:47       ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-24 10:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2021年9月24日 16:56
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 02/37] xen: introduce a Kconfig option to configure
> NUMA nodes number
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > Current NUMA nodes number is a hardcode configuration. This
> > configuration is difficult for an administrator to change
> > unless changing the code.
> >
> > So in this patch, we introduce this new Kconfig option for
> > administrators to change NUMA nodes number conveniently.
> > Also considering that not all architectures support NUMA,
> > this Kconfig option only can be visible on NUMA enabled
> > architectures. Non-NUMA supported architectures can still
> > use 1 as MAX_NUMNODES.
> 
> Do you really mean administrators here? To me command line options
> are for administrators, but build decisions are usually taken by
> build managers of distros.
> 
> > --- a/xen/arch/Kconfig
> > +++ b/xen/arch/Kconfig
> > @@ -17,3 +17,14 @@ config NR_CPUS
> >  	  For CPU cores which support Simultaneous Multi-Threading or
> similar
> >  	  technologies, this the number of logical threads which Xen will
> >  	  support.
> > +
> > +config NR_NUMA_NODES
> > +	int "Maximum number of NUMA nodes supported"
> > +	range 1 4095
> 
> How was this upper bound established? Seeing 4095 is the limit of the
> number of CPUs, do we really expect a CPU per node on such huge
> systems? And did you check that whichever involved data types and
> structures are actually suitable? I'm thinking e.g. of things like ...
> 
> > --- a/xen/include/asm-x86/numa.h
> > +++ b/xen/include/asm-x86/numa.h
> > @@ -3,8 +3,6 @@
> >
> >  #include <xen/cpumask.h>
> >
> > -#define NODES_SHIFT 6
> > -
> >  typedef u8 nodeid_t;
> 
> ... this.
> 

you're right, we use u8 as nodeid_t. 4095 for node number in this option
is not reasonable. Maybe a 255 upper bound is good?

> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking NUMA node
  2021-09-24  8:57   ` Jan Beulich
@ 2021-09-24 10:34     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24 10:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien



> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2021年9月24日 16:57
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 03/37] xen/x86: Initialize memnodemapsize while faking
> NUMA node
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > When system turns NUMA off or system lacks of NUMA support,
> > Xen will fake a NUMA node to make system works as a single
> > node NUMA system.
> >
> > In this case the memory node map doesn't need to be allocated
> > from boot pages, it will use the _memnodemap directly. But
> > memnodemapsize hasn't been set. Xen should assert in phys_to_nid.
> > Because x86 was using an empty macro "VIRTUAL_BUG_ON" to replace
> > SSERT, this bug will not be triggered on x86.
> 
> Somehow and A got lost here, which I'll add back while committing.
> 

Thanks!

> > Actually, Xen will only use 1 slot of memnodemap in this case.
> > So we set memnodemap[0] to 0 and memnodemapsize to 1 in this
> > patch to fix it.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> 
> Acked-by: Jan Beulich <jbeulich@suse.com>


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-24 10:25   ` Jan Beulich
@ 2021-09-24 10:37     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-24 10:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien


> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2021年9月24日 18:26
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to
> enable NUMA
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > --- a/xen/arch/arm/Kconfig
> > +++ b/xen/arch/arm/Kconfig
> > @@ -34,6 +34,17 @@ config ACPI
> >  	  Advanced Configuration and Power Interface (ACPI) support for Xen
> is
> >  	  an alternative to device tree on ARM64.
> >
> > + config DEVICE_TREE_NUMA
> > +	def_bool n
> > +	select NUMA
> 
> Two nits here: There's a stray blank on the first line, and you
> appear to mean just "bool", not "def_bool n" (there's no point
> in having defaults for select-only options).
> 

Ok

> > +config ARM_NUMA
> > +	bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if
> UNSUPPORTED
> > +	select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> > +	---help---
> 
> And another nit here: We try to move away from "---help---", which
> is no longer supported by Linux'es newer kconfig. Please use just
> "help" in new code.
> 

Thanks, I will do it.

> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 02/37] xen: introduce a Kconfig option to configure NUMA nodes number
  2021-09-24 10:33     ` Wei Chen
@ 2021-09-24 10:47       ` Jan Beulich
  0 siblings, 0 replies; 192+ messages in thread
From: Jan Beulich @ 2021-09-24 10:47 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

On 24.09.2021 12:33, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2021年9月24日 16:56
>>
>> On 23.09.2021 14:02, Wei Chen wrote:
>>> --- a/xen/arch/Kconfig
>>> +++ b/xen/arch/Kconfig
>>> @@ -17,3 +17,14 @@ config NR_CPUS
>>>  	  For CPU cores which support Simultaneous Multi-Threading or
>> similar
>>>  	  technologies, this the number of logical threads which Xen will
>>>  	  support.
>>> +
>>> +config NR_NUMA_NODES
>>> +	int "Maximum number of NUMA nodes supported"
>>> +	range 1 4095
>>
>> How was this upper bound established? Seeing 4095 is the limit of the
>> number of CPUs, do we really expect a CPU per node on such huge
>> systems? And did you check that whichever involved data types and
>> structures are actually suitable? I'm thinking e.g. of things like ...
>>
>>> --- a/xen/include/asm-x86/numa.h
>>> +++ b/xen/include/asm-x86/numa.h
>>> @@ -3,8 +3,6 @@
>>>
>>>  #include <xen/cpumask.h>
>>>
>>> -#define NODES_SHIFT 6
>>> -
>>>  typedef u8 nodeid_t;
>>
>> ... this.
>>
> 
> you're right, we use u8 as nodeid_t. 4095 for node number in this option
> is not reasonable. Maybe a 255 upper bound is good?

I think it is, yes, but you will want to properly check.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-24 10:31         ` Wei Chen
@ 2021-09-24 10:49           ` Jan Beulich
  2021-09-26 10:25             ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-24 10:49 UTC (permalink / raw)
  To: Wei Chen; +Cc: xen-devel, julien, Bertrand Marquis, Stefano Stabellini

On 24.09.2021 12:31, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2021年9月24日 15:59
>>
>> On 24.09.2021 06:34, Wei Chen wrote:
>>>> From: Stefano Stabellini <sstabellini@kernel.org>
>>>> Sent: 2021年9月24日 9:15
>>>>
>>>> On Thu, 23 Sep 2021, Wei Chen wrote:
>>>>> --- a/xen/common/Kconfig
>>>>> +++ b/xen/common/Kconfig
>>>>> @@ -11,6 +11,16 @@ config COMPAT
>>>>>  config CORE_PARKING
>>>>>  	bool
>>>>>
>>>>> +config EFI
>>>>> +	bool
>>>>
>>>> Without the title the option is not user-selectable (or de-selectable).
>>>> So the help message below can never be seen.
>>>>
>>>> Either add a title, e.g.:
>>>>
>>>> bool "EFI support"
>>>>
>>>> Or fully make the option a silent option by removing the help text.
>>>
>>> OK, in current Xen code, EFI is unconditionally compiled. Before
>>> we change related code, I prefer to remove the help text.
>>
>> But that's not true: At least on x86 EFI gets compiled depending on
>> tool chain capabilities. Ultimately we may indeed want a user
>> selectable option here, but until then I'm afraid having this option
>> at all may be misleading on x86.
>>
> 
> I check the build scripts, yes, you're right. For x86, EFI is not a
> selectable option in Kconfig. I agree with you, we can't use Kconfig
> system to decide to enable EFI build for x86 or not.
> 
> So how about we just use this EFI option for Arm only? Because on Arm,
> we do not have such toolchain dependency.

To be honest - don't know. That's because I don't know what you want
to use the option for subsequently.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 25/37] xen/arm: implement bad_srat for Arm NUMA initialization
  2021-09-24  8:07     ` Jan Beulich
@ 2021-09-24 19:33       ` Stefano Stabellini
  0 siblings, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24 19:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Wei Chen, xen-devel, julien,
	Bertrand.Marquis, andrew.cooper3, roger.pau, wl

On Fri, 24 Sep 2021, Jan Beulich wrote:
> On 24.09.2021 04:09, Stefano Stabellini wrote:
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> >> NUMA initialization will parse information from firmware provided
> >> static resource affinity table (ACPI SRAT or DTB). bad_srat if a
> >> function that will be used when initialization code encounters
> >> some unexcepted errors.
> >>
> >> In this patch, we introduce Arm version bad_srat for NUMA common
> >> initialization code to invoke it.
> >>
> >> Signed-off-by: Wei Chen <wei.chen@arm.com>
> >> ---
> >>  xen/arch/arm/numa.c | 7 +++++++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> >> index 3755b01ef4..5209d3de4d 100644
> >> --- a/xen/arch/arm/numa.c
> >> +++ b/xen/arch/arm/numa.c
> >> @@ -18,6 +18,7 @@
> >>   *
> >>   */
> >>  #include <xen/init.h>
> >> +#include <xen/nodemask.h>
> >>  #include <xen/numa.h>
> >>  
> >>  static uint8_t __read_mostly
> >> @@ -25,6 +26,12 @@ node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> >>      { 0 }
> >>  };
> >>  
> >> +__init void bad_srat(void)
> >> +{
> >> +    printk(KERN_ERR "NUMA: Firmware SRAT table not used.\n");
> >> +    fw_numa = -1;
> >> +}
> > 
> > I realize that the series keeps the "srat" terminology everywhere on DT
> > too. I wonder if it is worth replacing srat with something like
> > "numa_distance" everywhere as appropriate. I am adding the x86
> > maintainers for an opinion.
> > 
> > If you guys prefer to keep srat (if nothing else, it is concise), I am
> > also OK with keeping srat although it is not technically accurate.
> 
> I think we want to tell apart both things: Where we truly talk about
> the firmware's SRAT table, keeping that name is fine. But I suppose
> there no "Firmware SRAT table" (as in the log message above) when
> using DT?

No. FYI this is the DT binding:
https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/numa.txt

The interesting bit is the "distance-map"


> If so, at the very least in log messages SRAT shouldn't be
> mentioned. Perhaps even functions serving both an ACPI and a DT
> purpose would better not use "srat" in their names (but I'm not as
> fussed about it there.)

I agree 100% with what you wrote.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 23/37] xen/arm: implement node distance helpers for Arm
  2021-09-24  4:41     ` Wei Chen
@ 2021-09-24 19:36       ` Stefano Stabellini
  2021-09-26 10:15         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24 19:36 UTC (permalink / raw)
  To: Wei Chen; +Cc: Stefano Stabellini, xen-devel, julien, Bertrand Marquis

[-- Attachment #1: Type: text/plain, Size: 4975 bytes --]

On Fri, 24 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月24日 9:47
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > Subject: Re: [PATCH 23/37] xen/arm: implement node distance helpers for
> > Arm
> > 
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > We will parse NUMA nodes distances from device tree or ACPI
> > > table. So we need a matrix to record the distances between
> > > any two nodes we parsed. Accordingly, we provide this
> > > node_set_distance API for device tree or ACPI table parsers
> > > to set the distance for any two nodes in this patch.
> > > When NUMA initialization failed, __node_distance will return
> > > NUMA_REMOTE_DISTANCE, this will help us avoid doing rollback
> > > for distance maxtrix when NUMA initialization failed.
> > >
> > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > ---
> > >  xen/arch/arm/Makefile      |  1 +
> > >  xen/arch/arm/numa.c        | 69 ++++++++++++++++++++++++++++++++++++++
> > >  xen/include/asm-arm/numa.h | 13 +++++++
> > >  3 files changed, 83 insertions(+)
> > >  create mode 100644 xen/arch/arm/numa.c
> > >
> > > diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> > > index ae4efbf76e..41ca311b6b 100644
> > > --- a/xen/arch/arm/Makefile
> > > +++ b/xen/arch/arm/Makefile
> > > @@ -35,6 +35,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
> > >  obj-y += mem_access.o
> > >  obj-y += mm.o
> > >  obj-y += monitor.o
> > > +obj-$(CONFIG_NUMA) += numa.o
> > >  obj-y += p2m.o
> > >  obj-y += percpu.o
> > >  obj-y += platform.o
> > > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > > new file mode 100644
> > > index 0000000000..3f08870d69
> > > --- /dev/null
> > > +++ b/xen/arch/arm/numa.c
> > > @@ -0,0 +1,69 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Arm Architecture support layer for NUMA.
> > > + *
> > > + * Copyright (C) 2021 Arm Ltd
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * This program is distributed in the hope that it will be useful,
> > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > + * GNU General Public License for more details.
> > > + *
> > > + * You should have received a copy of the GNU General Public License
> > > + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> > > + *
> > > + */
> > > +#include <xen/init.h>
> > > +#include <xen/numa.h>
> > > +
> > > +static uint8_t __read_mostly
> > > +node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> > > +    { 0 }
> > > +};
> > > +
> > > +void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t
> > distance)
> > > +{
> > > +    if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> > > +    {
> > > +        printk(KERN_WARNING
> > > +               "NUMA: invalid nodes: from=%"PRIu8" to=%"PRIu8"
> > MAX=%"PRIu8"\n",
> > > +               from, to, MAX_NUMNODES);
> > > +        return;
> > > +    }
> > > +
> > > +    /* NUMA defines 0xff as an unreachable node and 0-9 are undefined
> > */
> > > +    if ( distance >= NUMA_NO_DISTANCE ||
> > > +        (distance >= NUMA_DISTANCE_UDF_MIN &&
> > > +         distance <= NUMA_DISTANCE_UDF_MAX) ||
> > > +        (from == to && distance != NUMA_LOCAL_DISTANCE) )
> > > +    {
> > > +        printk(KERN_WARNING
> > > +               "NUMA: invalid distance: from=%"PRIu8" to=%"PRIu8"
> > distance=%"PRIu32"\n",
> > > +               from, to, distance);
> > > +        return;
> > > +    }
> > > +
> > > +    node_distance_map[from][to] = distance;
> > > +}
> > > +
> > > +uint8_t __node_distance(nodeid_t from, nodeid_t to)
> > > +{
> > > +    /* When NUMA is off, any distance will be treated as remote. */
> > > +    if ( srat_disabled() )
> > 
> > Given that this is ARM specific code and specific to ACPI, I don't think
> > we should have any call to something called "srat_disabled".
> > 
> > I suggest to either rename srat_disabled to numa_distance_disabled.
> > 
> > Other than that, this patch looks OK to me.
> > 
> 
> srat stands for static resource affinity table, I think dtb also can be
> treated as a static resource affinity table. So I keep SRAT in this patch
> and other patches. I have seen your comment in patch#25. Before x86 maintainers
> give any feedback, can we still keep srat here?

Jan and I replied in the other thread. I think that in warning messages
"SRAT" should not be mentioned when booting from DT. Ideally functions
names and variables should be renamed too when shared between ACPI and
DT but it is less critical, and it is fine if you don't do that in the
next version.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-24 10:13     ` Wei Chen
@ 2021-09-24 19:39       ` Stefano Stabellini
  2021-09-27  8:33         ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24 19:39 UTC (permalink / raw)
  To: Wei Chen; +Cc: Stefano Stabellini, xen-devel, julien, Bertrand.Marquis

On Fri, 24 Sep 2021, Wei Chen wrote:
> Hi Stefano,
> 
> On 2021/9/24 11:31, Stefano Stabellini wrote:
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > Arm platforms support both ACPI and device tree. We don't
> > > want users to select device tree NUMA or ACPI NUMA manually.
> > > We hope usrs can just enable NUMA for Arm, and device tree
> >            ^ users
> > 
> > > NUMA and ACPI NUMA can be selected depends on device tree
> > > feature and ACPI feature status automatically. In this case,
> > > these two kinds of NUMA support code can be co-exist in one
> > > Xen binary. Xen can check feature flags to decide using
> > > device tree or ACPI as NUMA based firmware.
> > > 
> > > So in this patch, we introduce a generic option:
> > > CONFIG_ARM_NUMA for user to enable NUMA for Arm.
> >                        ^ users
> > 
> 
> OK
> 
> > > And one CONFIG_DEVICE_TREE_NUMA option for ARM_NUMA
> > > to select when HAS_DEVICE_TREE option is enabled.
> > > Once when ACPI NUMA for Arm is supported, ACPI_NUMA
> > > can be selected here too.
> > > 
> > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > ---
> > >   xen/arch/arm/Kconfig | 11 +++++++++++
> > >   1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig
> > > index 865ad83a89..ded94ebd37 100644
> > > --- a/xen/arch/arm/Kconfig
> > > +++ b/xen/arch/arm/Kconfig
> > > @@ -34,6 +34,17 @@ config ACPI
> > >   	  Advanced Configuration and Power Interface (ACPI) support for Xen is
> > >   	  an alternative to device tree on ARM64.
> > >   + config DEVICE_TREE_NUMA
> > > +	def_bool n
> > > +	select NUMA
> > > +
> > > +config ARM_NUMA
> > > +	bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if
> > > UNSUPPORTED
> > > +	select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> > 
> > Should it be: depends on HAS_DEVICE_TREE ?
> > (And eventually depends on HAS_DEVICE_TREE || ACPI)
> > 
> 
> As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
> option can be selected by users. And depends on has_device_tree
> or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.
> 
> If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
> does it become a loop dependency?
> 
> https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00888.html

OK, I am fine with that. I was just trying to catch the case where a
user selects "ARM_NUMA" but actually neither ACPI nor HAS_DEVICE_TREE
are selected so nothing happens. I was trying to make it clear that
ARM_NUMA depends on having at least one between HAS_DEVICE_TREE or ACPI
because otherwise it is not going to work.

That said, I don't think this is important because HAS_DEVICE_TREE
cannot be unselected. So if we cannot find a way to express the
dependency, I think it is fine to keep the patch as is.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 32/37] xen/arm: unified entry to parse all NUMA data from device tree
  2021-09-24  7:58     ` Wei Chen
@ 2021-09-24 19:42       ` Stefano Stabellini
  0 siblings, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24 19:42 UTC (permalink / raw)
  To: Wei Chen; +Cc: Stefano Stabellini, xen-devel, julien, Bertrand Marquis

[-- Attachment #1: Type: text/plain, Size: 1330 bytes --]

On Fri, 24 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月24日 11:17
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > Subject: Re: [PATCH 32/37] xen/arm: unified entry to parse all NUMA data
> > from device tree
> > 
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > In this API, we scan whole device tree to parse CPU node id, memory
> >           ^ function   ^ the whole
> > 
> > > node id and distance-map. Though early_scan_node will invoke has a
> > > handler to process memory nodes. If we want to parse memory node id
> > > in this handler, we have to embeded NUMA parse code in this handler.
> >                               ^ embed
> > 
> > > But we still need to scan whole device tree to find CPU NUMA id and
> > > distance-map. In this case, we include memory NUMA id parse in this
> > > API too. Another benefit is that we have a unique entry for device
> >   ^ function
> > 
> > > tree NUMA data parse.
> > 
> > Ah, that's the explanation I was asking for earlier!
> > 
> 
> The question about device_tree_get_meminfo?

Yes, it would be nice to reuse process_memory_node if we can, but I
understand if we cannot.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-24  4:28     ` Wei Chen
@ 2021-09-24 19:52       ` Stefano Stabellini
  2021-09-26 10:11         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-24 19:52 UTC (permalink / raw)
  To: Wei Chen
  Cc: Stefano Stabellini, xen-devel, julien, Bertrand Marquis,
	jbeulich, andrew.cooper3, roger.pau, wl

[-- Attachment #1: Type: text/plain, Size: 3240 bytes --]

On Fri, 24 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月24日 8:26
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > CC'ing x86 maintainers
> > 
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > One NUMA node may contain several memory blocks. In current Xen
> > > code, Xen will maintain a node memory range for each node to cover
> > > all its memory blocks. But here comes the problem, in the gap of
> > > one node's two memory blocks, if there are some memory blocks don't
> > > belong to this node (remote memory blocks). This node's memory range
> > > will be expanded to cover these remote memory blocks.
> > >
> > > One node's memory range contains othe nodes' memory, this is obviously
> > > not very reasonable. This means current NUMA code only can support
> > > node has continous memory blocks. However, on a physical machine, the
> > > addresses of multiple nodes can be interleaved.
> > >
> > > So in this patch, we add code to detect discontinous memory blocks
> > > for one node. NUMA initializtion will be failed and error messages
> > > will be printed when Xen detect such hardware configuration.
> > 
> > At least on ARM, it is not just memory that can be interleaved, but also
> > MMIO regions. For instance:
> > 
> > node0 bank0 0-0x1000000
> > MMIO 0x1000000-0x1002000
> > Hole 0x1002000-0x2000000
> > node0 bank1 0x2000000-0x3000000
> > 
> > So I am not familiar with the SRAT format, but I think on ARM the check
> > would look different: we would just look for multiple memory ranges
> > under a device_type = "memory" node of a NUMA node in device tree.
> > 
> > 
> 
> Should I need to include/refine above message to commit log?

Let me ask you a question first.

With the NUMA implementation of this patch series, can we deal with
cases where each node has multiple memory banks, not interleaved?
An an example:

node0: 0x0        - 0x10000000
MMIO : 0x10000000 - 0x20000000
node0: 0x20000000 - 0x30000000
MMIO : 0x30000000 - 0x50000000
node1: 0x50000000 - 0x60000000
MMIO : 0x60000000 - 0x80000000
node2: 0x80000000 - 0x90000000


I assume we can deal with this case simply by setting node0 memory to
0x0-0x30000000 even if there is actually something else, a device, that
doesn't belong to node0 in between the two node0 banks?

Is it only other nodes' memory interleaved that cause issues? In other
words, only the following is a problematic scenario?

node0: 0x0        - 0x10000000
MMIO : 0x10000000 - 0x20000000
node1: 0x20000000 - 0x30000000
MMIO : 0x30000000 - 0x50000000
node0: 0x50000000 - 0x60000000

Because node1 is in between the two ranges of node0?


I am asking these questions because it is certainly possible to have
multiple memory ranges for each NUMA node in device tree, either by
specifying multiple ranges with a single "reg" property, or by
specifying multiple memory nodes with the same numa-node-id.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-24 19:52       ` Stefano Stabellini
@ 2021-09-26 10:11         ` Wei Chen
  2021-09-27  3:13           ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-26 10:11 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, andrew.cooper3,
	roger.pau, wl

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月25日 3:53
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> roger.pau@citrix.com; wl@xen.org
> Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On Fri, 24 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月24日 8:26
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> julien@xen.org;
> > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > memory range
> > >
> > > CC'ing x86 maintainers
> > >
> > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > One NUMA node may contain several memory blocks. In current Xen
> > > > code, Xen will maintain a node memory range for each node to cover
> > > > all its memory blocks. But here comes the problem, in the gap of
> > > > one node's two memory blocks, if there are some memory blocks don't
> > > > belong to this node (remote memory blocks). This node's memory range
> > > > will be expanded to cover these remote memory blocks.
> > > >
> > > > One node's memory range contains othe nodes' memory, this is
> obviously
> > > > not very reasonable. This means current NUMA code only can support
> > > > node has continous memory blocks. However, on a physical machine,
> the
> > > > addresses of multiple nodes can be interleaved.
> > > >
> > > > So in this patch, we add code to detect discontinous memory blocks
> > > > for one node. NUMA initializtion will be failed and error messages
> > > > will be printed when Xen detect such hardware configuration.
> > >
> > > At least on ARM, it is not just memory that can be interleaved, but
> also
> > > MMIO regions. For instance:
> > >
> > > node0 bank0 0-0x1000000
> > > MMIO 0x1000000-0x1002000
> > > Hole 0x1002000-0x2000000
> > > node0 bank1 0x2000000-0x3000000
> > >
> > > So I am not familiar with the SRAT format, but I think on ARM the
> check
> > > would look different: we would just look for multiple memory ranges
> > > under a device_type = "memory" node of a NUMA node in device tree.
> > >
> > >
> >
> > Should I need to include/refine above message to commit log?
> 
> Let me ask you a question first.
> 
> With the NUMA implementation of this patch series, can we deal with
> cases where each node has multiple memory banks, not interleaved?

Yes.

> An an example:
> 
> node0: 0x0        - 0x10000000
> MMIO : 0x10000000 - 0x20000000
> node0: 0x20000000 - 0x30000000
> MMIO : 0x30000000 - 0x50000000
> node1: 0x50000000 - 0x60000000
> MMIO : 0x60000000 - 0x80000000
> node2: 0x80000000 - 0x90000000
> 
> 
> I assume we can deal with this case simply by setting node0 memory to
> 0x0-0x30000000 even if there is actually something else, a device, that
> doesn't belong to node0 in between the two node0 banks?

While this configuration is rare in SoC design, but it is not impossible. 

> 
> Is it only other nodes' memory interleaved that cause issues? In other
> words, only the following is a problematic scenario?
> 
> node0: 0x0        - 0x10000000
> MMIO : 0x10000000 - 0x20000000
> node1: 0x20000000 - 0x30000000
> MMIO : 0x30000000 - 0x50000000
> node0: 0x50000000 - 0x60000000
> 
> Because node1 is in between the two ranges of node0?
> 

But only device_type="memory" can be added to allocation.
For mmio there are two cases:
1. mmio doesn't have NUMA id property.
2. mmio has NUMA id property, just like some PCIe controllers.
   But we don’t need to handle these kinds of MMIO devices
   in memory block parsing. Because we don't need to allocate
   memory from these mmio ranges. And for accessing, we need
   a NUMA-aware PCIe controller driver or a generic NUMA-aware
   MMIO accessing APIs.

> 
> I am asking these questions because it is certainly possible to have
> multiple memory ranges for each NUMA node in device tree, either by
> specifying multiple ranges with a single "reg" property, or by
> specifying multiple memory nodes with the same numa-node-id.



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 23/37] xen/arm: implement node distance helpers for Arm
  2021-09-24 19:36       ` Stefano Stabellini
@ 2021-09-26 10:15         ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-26 10:15 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis


> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月25日 3:36
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>
> Subject: RE: [PATCH 23/37] xen/arm: implement node distance helpers for
> Arm
> 
> On Fri, 24 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月24日 9:47
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> julien@xen.org;
> > > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > > Subject: Re: [PATCH 23/37] xen/arm: implement node distance helpers
> for
> > > Arm
> > >
> > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > We will parse NUMA nodes distances from device tree or ACPI
> > > > table. So we need a matrix to record the distances between
> > > > any two nodes we parsed. Accordingly, we provide this
> > > > node_set_distance API for device tree or ACPI table parsers
> > > > to set the distance for any two nodes in this patch.
> > > > When NUMA initialization failed, __node_distance will return
> > > > NUMA_REMOTE_DISTANCE, this will help us avoid doing rollback
> > > > for distance maxtrix when NUMA initialization failed.
> > > >
> > > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > > ---
> > > >  xen/arch/arm/Makefile      |  1 +
> > > >  xen/arch/arm/numa.c        | 69
> ++++++++++++++++++++++++++++++++++++++
> > > >  xen/include/asm-arm/numa.h | 13 +++++++
> > > >  3 files changed, 83 insertions(+)
> > > >  create mode 100644 xen/arch/arm/numa.c
> > > >
> > > > diff --git a/xen/arch/arm/Makefile b/xen/arch/arm/Makefile
> > > > index ae4efbf76e..41ca311b6b 100644
> > > > --- a/xen/arch/arm/Makefile
> > > > +++ b/xen/arch/arm/Makefile
> > > > @@ -35,6 +35,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
> > > >  obj-y += mem_access.o
> > > >  obj-y += mm.o
> > > >  obj-y += monitor.o
> > > > +obj-$(CONFIG_NUMA) += numa.o
> > > >  obj-y += p2m.o
> > > >  obj-y += percpu.o
> > > >  obj-y += platform.o
> > > > diff --git a/xen/arch/arm/numa.c b/xen/arch/arm/numa.c
> > > > new file mode 100644
> > > > index 0000000000..3f08870d69
> > > > --- /dev/null
> > > > +++ b/xen/arch/arm/numa.c
> > > > @@ -0,0 +1,69 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * Arm Architecture support layer for NUMA.
> > > > + *
> > > > + * Copyright (C) 2021 Arm Ltd
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> modify
> > > > + * it under the terms of the GNU General Public License version 2
> as
> > > > + * published by the Free Software Foundation.
> > > > + *
> > > > + * This program is distributed in the hope that it will be useful,
> > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > + * GNU General Public License for more details.
> > > > + *
> > > > + * You should have received a copy of the GNU General Public
> License
> > > > + * along with this program. If not, see
> <http://www.gnu.org/licenses/>.
> > > > + *
> > > > + */
> > > > +#include <xen/init.h>
> > > > +#include <xen/numa.h>
> > > > +
> > > > +static uint8_t __read_mostly
> > > > +node_distance_map[MAX_NUMNODES][MAX_NUMNODES] = {
> > > > +    { 0 }
> > > > +};
> > > > +
> > > > +void __init numa_set_distance(nodeid_t from, nodeid_t to, uint32_t
> > > distance)
> > > > +{
> > > > +    if ( from >= MAX_NUMNODES || to >= MAX_NUMNODES )
> > > > +    {
> > > > +        printk(KERN_WARNING
> > > > +               "NUMA: invalid nodes: from=%"PRIu8" to=%"PRIu8"
> > > MAX=%"PRIu8"\n",
> > > > +               from, to, MAX_NUMNODES);
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    /* NUMA defines 0xff as an unreachable node and 0-9 are
> undefined
> > > */
> > > > +    if ( distance >= NUMA_NO_DISTANCE ||
> > > > +        (distance >= NUMA_DISTANCE_UDF_MIN &&
> > > > +         distance <= NUMA_DISTANCE_UDF_MAX) ||
> > > > +        (from == to && distance != NUMA_LOCAL_DISTANCE) )
> > > > +    {
> > > > +        printk(KERN_WARNING
> > > > +               "NUMA: invalid distance: from=%"PRIu8" to=%"PRIu8"
> > > distance=%"PRIu32"\n",
> > > > +               from, to, distance);
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    node_distance_map[from][to] = distance;
> > > > +}
> > > > +
> > > > +uint8_t __node_distance(nodeid_t from, nodeid_t to)
> > > > +{
> > > > +    /* When NUMA is off, any distance will be treated as remote. */
> > > > +    if ( srat_disabled() )
> > >
> > > Given that this is ARM specific code and specific to ACPI, I don't
> think
> > > we should have any call to something called "srat_disabled".
> > >
> > > I suggest to either rename srat_disabled to numa_distance_disabled.
> > >
> > > Other than that, this patch looks OK to me.
> > >
> >
> > srat stands for static resource affinity table, I think dtb also can be
> > treated as a static resource affinity table. So I keep SRAT in this
> patch
> > and other patches. I have seen your comment in patch#25. Before x86
> maintainers
> > give any feedback, can we still keep srat here?
> 
> Jan and I replied in the other thread. I think that in warning messages
> "SRAT" should not be mentioned when booting from DT. Ideally functions
> names and variables should be renamed too when shared between ACPI and
> DT but it is less critical, and it is fine if you don't do that in the
> next version.

Thanks. I'll leave it as it is, if I do not have a better name.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-24 10:49           ` Jan Beulich
@ 2021-09-26 10:25             ` Wei Chen
  2021-09-27 10:28               ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-26 10:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, julien, Bertrand Marquis, Stefano Stabellini

Hi Jan,

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan
> Beulich
> Sent: 2021年9月24日 18:49
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
> Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> EFI architecture
> 
> On 24.09.2021 12:31, Wei Chen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2021年9月24日 15:59
> >>
> >> On 24.09.2021 06:34, Wei Chen wrote:
> >>>> From: Stefano Stabellini <sstabellini@kernel.org>
> >>>> Sent: 2021年9月24日 9:15
> >>>>
> >>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> >>>>> --- a/xen/common/Kconfig
> >>>>> +++ b/xen/common/Kconfig
> >>>>> @@ -11,6 +11,16 @@ config COMPAT
> >>>>>  config CORE_PARKING
> >>>>>  	bool
> >>>>>
> >>>>> +config EFI
> >>>>> +	bool
> >>>>
> >>>> Without the title the option is not user-selectable (or de-
> selectable).
> >>>> So the help message below can never be seen.
> >>>>
> >>>> Either add a title, e.g.:
> >>>>
> >>>> bool "EFI support"
> >>>>
> >>>> Or fully make the option a silent option by removing the help text.
> >>>
> >>> OK, in current Xen code, EFI is unconditionally compiled. Before
> >>> we change related code, I prefer to remove the help text.
> >>
> >> But that's not true: At least on x86 EFI gets compiled depending on
> >> tool chain capabilities. Ultimately we may indeed want a user
> >> selectable option here, but until then I'm afraid having this option
> >> at all may be misleading on x86.
> >>
> >
> > I check the build scripts, yes, you're right. For x86, EFI is not a
> > selectable option in Kconfig. I agree with you, we can't use Kconfig
> > system to decide to enable EFI build for x86 or not.
> >
> > So how about we just use this EFI option for Arm only? Because on Arm,
> > we do not have such toolchain dependency.
> 
> To be honest - don't know. That's because I don't know what you want
> to use the option for subsequently.
> 

In last version, I had introduced an arch-helper to stub EFI_BOOT
in Arm's common code for Arm32. Because Arm32 doesn't support EFI.
So Julien suggested me to introduce a CONFIG_EFI option for non-EFI
supported architectures to stub in EFI layer.

[1] https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00808.html

> Jan
> 


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-24  1:34   ` Stefano Stabellini
@ 2021-09-26 13:13     ` Wei Chen
  2021-09-27  3:25       ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-26 13:13 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月24日 9:35
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> NR_NODE_MEMBLKS
> 
> On Thu, 23 Sep 2021, Wei Chen wrote:
> > As a memory range described in device tree cannot be split across
> > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS in
> > arch header.
> 
> This statement is true but what is the goal of this patch? Is it to
> reduce code size and memory consumption?
> 

No, when Julien and I discussed this in last version[1], we hadn't thought
so deeply. We just thought a memory range described in DT cannot be split
across multiple nodes. So NR_MEM_BANKS should be equal to NR_MEM_BANKS.

https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00974.html 

> I am asking because NR_MEM_BANKS is 128 and
> NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> NR_NODE_MEMBLKS is 128 before this patch.
> 
> In other words, this patch alone doesn't make any difference; at least
> doesn't make any difference unless CONFIG_NR_NUMA_NODES is increased.
> 
> So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES is
> higher than 64?
> 

I also thought about this problem when I was writing this patch.
CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
at one point.

But I agree with Julien's suggestion, NR_MEM_BANKS and NR_NODE_MEMBLKS
must be aware of each other. I had thought to add some ASSERT check,
but I don't know how to do it better. So I post this patch for more
suggestion.

> 
> > And keep default NR_NODE_MEMBLKS in common header
> > for those architectures NUMA is disabled.
> 
> This last sentence is not accurate: on x86 NUMA is enabled and
> NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h (there is no
> x86 definition of it)
> 

Yes.

> 
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> >  xen/include/asm-arm/numa.h | 8 +++++++-
> >  xen/include/xen/numa.h     | 2 ++
> >  2 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > index 8f1c67e3eb..21569e634b 100644
> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -3,9 +3,15 @@
> >
> >  #include <xen/mm.h>
> >
> > +#include <asm/setup.h>
> > +
> >  typedef u8 nodeid_t;
> >
> > -#ifndef CONFIG_NUMA
> > +#ifdef CONFIG_NUMA
> > +
> > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > +
> > +#else
> >
> >  /* Fake one node for now. See also node_online_map. */
> >  #define cpu_to_node(cpu) 0
> > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > index 1978e2be1b..1731e1cc6b 100644
> > --- a/xen/include/xen/numa.h
> > +++ b/xen/include/xen/numa.h
> > @@ -12,7 +12,9 @@
> >  #define MAX_NUMNODES    1
> >  #endif
> >
> > +#ifndef NR_NODE_MEMBLKS
> >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > +#endif
> >
> >  #define vcpu_to_node(v) (cpu_to_node((v)->processor))
> >
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-26 10:11         ` Wei Chen
@ 2021-09-27  3:13           ` Stefano Stabellini
  2021-09-27  5:05             ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-27  3:13 UTC (permalink / raw)
  To: Wei Chen
  Cc: Stefano Stabellini, xen-devel, julien, Bertrand Marquis,
	jbeulich, andrew.cooper3, roger.pau, wl

[-- Attachment #1: Type: text/plain, Size: 5713 bytes --]

On Sun, 26 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月25日 3:53
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> > roger.pau@citrix.com; wl@xen.org
> > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > Sent: 2021年9月24日 8:26
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > julien@xen.org;
> > > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > > memory range
> > > >
> > > > CC'ing x86 maintainers
> > > >
> > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > One NUMA node may contain several memory blocks. In current Xen
> > > > > code, Xen will maintain a node memory range for each node to cover
> > > > > all its memory blocks. But here comes the problem, in the gap of
> > > > > one node's two memory blocks, if there are some memory blocks don't
> > > > > belong to this node (remote memory blocks). This node's memory range
> > > > > will be expanded to cover these remote memory blocks.
> > > > >
> > > > > One node's memory range contains othe nodes' memory, this is
> > obviously
> > > > > not very reasonable. This means current NUMA code only can support
> > > > > node has continous memory blocks. However, on a physical machine,
> > the
> > > > > addresses of multiple nodes can be interleaved.
> > > > >
> > > > > So in this patch, we add code to detect discontinous memory blocks
> > > > > for one node. NUMA initializtion will be failed and error messages
> > > > > will be printed when Xen detect such hardware configuration.
> > > >
> > > > At least on ARM, it is not just memory that can be interleaved, but
> > also
> > > > MMIO regions. For instance:
> > > >
> > > > node0 bank0 0-0x1000000
> > > > MMIO 0x1000000-0x1002000
> > > > Hole 0x1002000-0x2000000
> > > > node0 bank1 0x2000000-0x3000000
> > > >
> > > > So I am not familiar with the SRAT format, but I think on ARM the
> > check
> > > > would look different: we would just look for multiple memory ranges
> > > > under a device_type = "memory" node of a NUMA node in device tree.
> > > >
> > > >
> > >
> > > Should I need to include/refine above message to commit log?
> > 
> > Let me ask you a question first.
> > 
> > With the NUMA implementation of this patch series, can we deal with
> > cases where each node has multiple memory banks, not interleaved?
> 
> Yes.
> 
> > An an example:
> > 
> > node0: 0x0        - 0x10000000
> > MMIO : 0x10000000 - 0x20000000
> > node0: 0x20000000 - 0x30000000
> > MMIO : 0x30000000 - 0x50000000
> > node1: 0x50000000 - 0x60000000
> > MMIO : 0x60000000 - 0x80000000
> > node2: 0x80000000 - 0x90000000
> > 
> > 
> > I assume we can deal with this case simply by setting node0 memory to
> > 0x0-0x30000000 even if there is actually something else, a device, that
> > doesn't belong to node0 in between the two node0 banks?
> 
> While this configuration is rare in SoC design, but it is not impossible. 

Definitely, I have seen it before.


> > Is it only other nodes' memory interleaved that cause issues? In other
> > words, only the following is a problematic scenario?
> > 
> > node0: 0x0        - 0x10000000
> > MMIO : 0x10000000 - 0x20000000
> > node1: 0x20000000 - 0x30000000
> > MMIO : 0x30000000 - 0x50000000
> > node0: 0x50000000 - 0x60000000
> > 
> > Because node1 is in between the two ranges of node0?
> > 
> 
> But only device_type="memory" can be added to allocation.
> For mmio there are two cases:
> 1. mmio doesn't have NUMA id property.
> 2. mmio has NUMA id property, just like some PCIe controllers.
>    But we don’t need to handle these kinds of MMIO devices
>    in memory block parsing. Because we don't need to allocate
>    memory from these mmio ranges. And for accessing, we need
>    a NUMA-aware PCIe controller driver or a generic NUMA-aware
>    MMIO accessing APIs.

Yes, I am not too worried about devices with a NUMA id property because
they are less common and this series doesn't handle them at all, right?
I imagine they would be treated like any other device without NUMA
awareness.

I am thinking about the case where the memory of each NUMA node is made
of multiple banks. I understand that this patch adds an explicit check
for cases where these banks are interleaving, however there are many
other cases where NUMA memory nodes are *not* interleaving but they are
still made of multiple discontinuous banks, like in the two example
above.

My question is whether this patch series in its current form can handle
the two cases above correctly. If so, I am wondering how it works given
that we only have a single "start" and "size" parameter per node.

On the other hand if this series cannot handle the two cases above, my
question is whether it would fail explicitly or not. The new
check is_node_memory_continuous doesn't seem to be able to catch them.


> > I am asking these questions because it is certainly possible to have
> > multiple memory ranges for each NUMA node in device tree, either by
> > specifying multiple ranges with a single "reg" property, or by
> > specifying multiple memory nodes with the same numa-node-id.
> 
> 
> 

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-26 13:13     ` Wei Chen
@ 2021-09-27  3:25       ` Stefano Stabellini
  2021-09-27  4:18         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-27  3:25 UTC (permalink / raw)
  To: Wei Chen; +Cc: Stefano Stabellini, xen-devel, julien, Bertrand Marquis

[-- Attachment #1: Type: text/plain, Size: 3525 bytes --]

On Sun, 26 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月24日 9:35
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org;
> > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> > NR_NODE_MEMBLKS
> > 
> > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > As a memory range described in device tree cannot be split across
> > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS in
> > > arch header.
> > 
> > This statement is true but what is the goal of this patch? Is it to
> > reduce code size and memory consumption?
> > 
> 
> No, when Julien and I discussed this in last version[1], we hadn't thought
> so deeply. We just thought a memory range described in DT cannot be split
> across multiple nodes. So NR_MEM_BANKS should be equal to NR_MEM_BANKS.
> 
> https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00974.html 
> 
> > I am asking because NR_MEM_BANKS is 128 and
> > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > NR_NODE_MEMBLKS is 128 before this patch.
> > 
> > In other words, this patch alone doesn't make any difference; at least
> > doesn't make any difference unless CONFIG_NR_NUMA_NODES is increased.
> > 
> > So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES is
> > higher than 64?
> > 
> 
> I also thought about this problem when I was writing this patch.
> CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> at one point.
> 
> But I agree with Julien's suggestion, NR_MEM_BANKS and NR_NODE_MEMBLKS
> must be aware of each other. I had thought to add some ASSERT check,
> but I don't know how to do it better. So I post this patch for more
> suggestion.

OK. In that case I'd say to get rid of the previous definition of
NR_NODE_MEMBLKS as it is probably not necessary, see below.



> > 
> > > And keep default NR_NODE_MEMBLKS in common header
> > > for those architectures NUMA is disabled.
> > 
> > This last sentence is not accurate: on x86 NUMA is enabled and
> > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h (there is no
> > x86 definition of it)
> > 
> 
> Yes.
> 
> > 
> > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > ---
> > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > >  xen/include/xen/numa.h     | 2 ++
> > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > > index 8f1c67e3eb..21569e634b 100644
> > > --- a/xen/include/asm-arm/numa.h
> > > +++ b/xen/include/asm-arm/numa.h
> > > @@ -3,9 +3,15 @@
> > >
> > >  #include <xen/mm.h>
> > >
> > > +#include <asm/setup.h>
> > > +
> > >  typedef u8 nodeid_t;
> > >
> > > -#ifndef CONFIG_NUMA
> > > +#ifdef CONFIG_NUMA
> > > +
> > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > +
> > > +#else
> > >
> > >  /* Fake one node for now. See also node_online_map. */
> > >  #define cpu_to_node(cpu) 0
> > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > index 1978e2be1b..1731e1cc6b 100644
> > > --- a/xen/include/xen/numa.h
> > > +++ b/xen/include/xen/numa.h
> > > @@ -12,7 +12,9 @@
> > >  #define MAX_NUMNODES    1
> > >  #endif
> > >
> > > +#ifndef NR_NODE_MEMBLKS
> > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > +#endif

This one we can remove it completely right?

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27  3:25       ` Stefano Stabellini
@ 2021-09-27  4:18         ` Wei Chen
  2021-09-27  4:59           ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-27  4:18 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月27日 11:26
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>
> Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> NR_NODE_MEMBLKS
> 
> On Sun, 26 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月24日 9:35
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> julien@xen.org;
> > > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> default
> > > NR_NODE_MEMBLKS
> > >
> > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > As a memory range described in device tree cannot be split across
> > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS in
> > > > arch header.
> > >
> > > This statement is true but what is the goal of this patch? Is it to
> > > reduce code size and memory consumption?
> > >
> >
> > No, when Julien and I discussed this in last version[1], we hadn't
> thought
> > so deeply. We just thought a memory range described in DT cannot be
> split
> > across multiple nodes. So NR_MEM_BANKS should be equal to NR_MEM_BANKS.
> >
> > https://lists.xenproject.org/archives/html/xen-devel/2021-
> 08/msg00974.html
> >
> > > I am asking because NR_MEM_BANKS is 128 and
> > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > > NR_NODE_MEMBLKS is 128 before this patch.
> > >
> > > In other words, this patch alone doesn't make any difference; at least
> > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is increased.
> > >
> > > So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES is
> > > higher than 64?
> > >
> >
> > I also thought about this problem when I was writing this patch.
> > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> > at one point.
> >
> > But I agree with Julien's suggestion, NR_MEM_BANKS and NR_NODE_MEMBLKS
> > must be aware of each other. I had thought to add some ASSERT check,
> > but I don't know how to do it better. So I post this patch for more
> > suggestion.
> 
> OK. In that case I'd say to get rid of the previous definition of
> NR_NODE_MEMBLKS as it is probably not necessary, see below.
> 
> 
> 
> > >
> > > > And keep default NR_NODE_MEMBLKS in common header
> > > > for those architectures NUMA is disabled.
> > >
> > > This last sentence is not accurate: on x86 NUMA is enabled and
> > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h (there is
> no
> > > x86 definition of it)
> > >
> >
> > Yes.
> >
> > >
> > > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > > ---
> > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > > >  xen/include/xen/numa.h     | 2 ++
> > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > > > index 8f1c67e3eb..21569e634b 100644
> > > > --- a/xen/include/asm-arm/numa.h
> > > > +++ b/xen/include/asm-arm/numa.h
> > > > @@ -3,9 +3,15 @@
> > > >
> > > >  #include <xen/mm.h>
> > > >
> > > > +#include <asm/setup.h>
> > > > +
> > > >  typedef u8 nodeid_t;
> > > >
> > > > -#ifndef CONFIG_NUMA
> > > > +#ifdef CONFIG_NUMA
> > > > +
> > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > > +
> > > > +#else
> > > >
> > > >  /* Fake one node for now. See also node_online_map. */
> > > >  #define cpu_to_node(cpu) 0
> > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > > index 1978e2be1b..1731e1cc6b 100644
> > > > --- a/xen/include/xen/numa.h
> > > > +++ b/xen/include/xen/numa.h
> > > > @@ -12,7 +12,9 @@
> > > >  #define MAX_NUMNODES    1
> > > >  #endif
> > > >
> > > > +#ifndef NR_NODE_MEMBLKS
> > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > > +#endif
> 
> This one we can remove it completely right?

How about define NR_MEM_BANKS to:
#ifdef CONFIG_NR_NUMA_NODES
#define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
#else
#define NR_MEM_BANKS 128
#endif
for both x86 and Arm. For those architectures do not support or enable
NUMA, they can still use "NR_MEM_BANKS 128". And replace all NR_NODE_MEMBLKS
in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
In this case, NR_MEM_BANKS can be aware of the changes of CONFIG_NR_NUMA_NODES.





^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27  4:18         ` Wei Chen
@ 2021-09-27  4:59           ` Stefano Stabellini
  2021-09-27  6:25             ` Julien Grall
  2021-09-27  6:46             ` Wei Chen
  0 siblings, 2 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-27  4:59 UTC (permalink / raw)
  To: Wei Chen
  Cc: Stefano Stabellini, xen-devel, julien, Bertrand Marquis,
	jbeulich, roger.pau, andrew.cooper3

[-- Attachment #1: Type: text/plain, Size: 5834 bytes --]

+x86 maintainers

On Mon, 27 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月27日 11:26
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>
> > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> > NR_NODE_MEMBLKS
> > 
> > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > Sent: 2021年9月24日 9:35
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > julien@xen.org;
> > > > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > default
> > > > NR_NODE_MEMBLKS
> > > >
> > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > As a memory range described in device tree cannot be split across
> > > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS in
> > > > > arch header.
> > > >
> > > > This statement is true but what is the goal of this patch? Is it to
> > > > reduce code size and memory consumption?
> > > >
> > >
> > > No, when Julien and I discussed this in last version[1], we hadn't
> > thought
> > > so deeply. We just thought a memory range described in DT cannot be
> > split
> > > across multiple nodes. So NR_MEM_BANKS should be equal to NR_MEM_BANKS.
> > >
> > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > 08/msg00974.html
> > >
> > > > I am asking because NR_MEM_BANKS is 128 and
> > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > > > NR_NODE_MEMBLKS is 128 before this patch.
> > > >
> > > > In other words, this patch alone doesn't make any difference; at least
> > > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is increased.
> > > >
> > > > So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES is
> > > > higher than 64?
> > > >
> > >
> > > I also thought about this problem when I was writing this patch.
> > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> > > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> > > at one point.
> > >
> > > But I agree with Julien's suggestion, NR_MEM_BANKS and NR_NODE_MEMBLKS
> > > must be aware of each other. I had thought to add some ASSERT check,
> > > but I don't know how to do it better. So I post this patch for more
> > > suggestion.
> > 
> > OK. In that case I'd say to get rid of the previous definition of
> > NR_NODE_MEMBLKS as it is probably not necessary, see below.
> > 
> > 
> > 
> > > >
> > > > > And keep default NR_NODE_MEMBLKS in common header
> > > > > for those architectures NUMA is disabled.
> > > >
> > > > This last sentence is not accurate: on x86 NUMA is enabled and
> > > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h (there is
> > no
> > > > x86 definition of it)
> > > >
> > >
> > > Yes.
> > >
> > > >
> > > > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > > > ---
> > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > > > >  xen/include/xen/numa.h     | 2 ++
> > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h
> > > > > index 8f1c67e3eb..21569e634b 100644
> > > > > --- a/xen/include/asm-arm/numa.h
> > > > > +++ b/xen/include/asm-arm/numa.h
> > > > > @@ -3,9 +3,15 @@
> > > > >
> > > > >  #include <xen/mm.h>
> > > > >
> > > > > +#include <asm/setup.h>
> > > > > +
> > > > >  typedef u8 nodeid_t;
> > > > >
> > > > > -#ifndef CONFIG_NUMA
> > > > > +#ifdef CONFIG_NUMA
> > > > > +
> > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > > > +
> > > > > +#else
> > > > >
> > > > >  /* Fake one node for now. See also node_online_map. */
> > > > >  #define cpu_to_node(cpu) 0
> > > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > > > index 1978e2be1b..1731e1cc6b 100644
> > > > > --- a/xen/include/xen/numa.h
> > > > > +++ b/xen/include/xen/numa.h
> > > > > @@ -12,7 +12,9 @@
> > > > >  #define MAX_NUMNODES    1
> > > > >  #endif
> > > > >
> > > > > +#ifndef NR_NODE_MEMBLKS
> > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > > > +#endif
> > 
> > This one we can remove it completely right?
> 
> How about define NR_MEM_BANKS to:
> #ifdef CONFIG_NR_NUMA_NODES
> #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
> #else
> #define NR_MEM_BANKS 128
> #endif
> for both x86 and Arm. For those architectures do not support or enable
> NUMA, they can still use "NR_MEM_BANKS 128". And replace all NR_NODE_MEMBLKS
> in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
> In this case, NR_MEM_BANKS can be aware of the changes of CONFIG_NR_NUMA_NODES.

x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess you also
meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?

But NR_MEM_BANKS is not directly related to CONFIG_NR_NUMA_NODES because
there can be many memory banks for each numa node, certainly more than
2. The existing definition on x86:

#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)

Doesn't make a lot of sense to me. Was it just an arbitrary limit for
the lack of a better way to set a maximum?


On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be related.
In fact, what's the difference?

NR_MEM_BANKS is the max number of memory banks (with or without
numa-node-id).

NR_NODE_MEMBLKS is the max number of memory banks with NUMA support
(with numa-node-id)?

They are basically the same thing. On ARM I would just do:

#define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))


And maybe the definition could be common with x86 if we define
NR_MEM_BANKS to 128 on x86 too.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-27  3:13           ` Stefano Stabellini
@ 2021-09-27  5:05             ` Stefano Stabellini
  2021-09-27  9:50               ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-27  5:05 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Chen, xen-devel, julien, Bertrand Marquis, jbeulich,
	andrew.cooper3, roger.pau, wl

[-- Attachment #1: Type: text/plain, Size: 6698 bytes --]

On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> On Sun, 26 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月25日 3:53
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> > > roger.pau@citrix.com; wl@xen.org
> > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > memory range
> > > 
> > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > Sent: 2021年9月24日 8:26
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > julien@xen.org;
> > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > > > memory range
> > > > >
> > > > > CC'ing x86 maintainers
> > > > >
> > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > One NUMA node may contain several memory blocks. In current Xen
> > > > > > code, Xen will maintain a node memory range for each node to cover
> > > > > > all its memory blocks. But here comes the problem, in the gap of
> > > > > > one node's two memory blocks, if there are some memory blocks don't
> > > > > > belong to this node (remote memory blocks). This node's memory range
> > > > > > will be expanded to cover these remote memory blocks.
> > > > > >
> > > > > > One node's memory range contains othe nodes' memory, this is
> > > obviously
> > > > > > not very reasonable. This means current NUMA code only can support
> > > > > > node has continous memory blocks. However, on a physical machine,
> > > the
> > > > > > addresses of multiple nodes can be interleaved.
> > > > > >
> > > > > > So in this patch, we add code to detect discontinous memory blocks
> > > > > > for one node. NUMA initializtion will be failed and error messages
> > > > > > will be printed when Xen detect such hardware configuration.
> > > > >
> > > > > At least on ARM, it is not just memory that can be interleaved, but
> > > also
> > > > > MMIO regions. For instance:
> > > > >
> > > > > node0 bank0 0-0x1000000
> > > > > MMIO 0x1000000-0x1002000
> > > > > Hole 0x1002000-0x2000000
> > > > > node0 bank1 0x2000000-0x3000000
> > > > >
> > > > > So I am not familiar with the SRAT format, but I think on ARM the
> > > check
> > > > > would look different: we would just look for multiple memory ranges
> > > > > under a device_type = "memory" node of a NUMA node in device tree.
> > > > >
> > > > >
> > > >
> > > > Should I need to include/refine above message to commit log?
> > > 
> > > Let me ask you a question first.
> > > 
> > > With the NUMA implementation of this patch series, can we deal with
> > > cases where each node has multiple memory banks, not interleaved?
> > 
> > Yes.
> > 
> > > An an example:
> > > 
> > > node0: 0x0        - 0x10000000
> > > MMIO : 0x10000000 - 0x20000000
> > > node0: 0x20000000 - 0x30000000
> > > MMIO : 0x30000000 - 0x50000000
> > > node1: 0x50000000 - 0x60000000
> > > MMIO : 0x60000000 - 0x80000000
> > > node2: 0x80000000 - 0x90000000
> > > 
> > > 
> > > I assume we can deal with this case simply by setting node0 memory to
> > > 0x0-0x30000000 even if there is actually something else, a device, that
> > > doesn't belong to node0 in between the two node0 banks?
> > 
> > While this configuration is rare in SoC design, but it is not impossible. 
> 
> Definitely, I have seen it before.
> 
> 
> > > Is it only other nodes' memory interleaved that cause issues? In other
> > > words, only the following is a problematic scenario?
> > > 
> > > node0: 0x0        - 0x10000000
> > > MMIO : 0x10000000 - 0x20000000
> > > node1: 0x20000000 - 0x30000000
> > > MMIO : 0x30000000 - 0x50000000
> > > node0: 0x50000000 - 0x60000000
> > > 
> > > Because node1 is in between the two ranges of node0?
> > > 
> > 
> > But only device_type="memory" can be added to allocation.
> > For mmio there are two cases:
> > 1. mmio doesn't have NUMA id property.
> > 2. mmio has NUMA id property, just like some PCIe controllers.
> >    But we don’t need to handle these kinds of MMIO devices
> >    in memory block parsing. Because we don't need to allocate
> >    memory from these mmio ranges. And for accessing, we need
> >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> >    MMIO accessing APIs.
> 
> Yes, I am not too worried about devices with a NUMA id property because
> they are less common and this series doesn't handle them at all, right?
> I imagine they would be treated like any other device without NUMA
> awareness.
> 
> I am thinking about the case where the memory of each NUMA node is made
> of multiple banks. I understand that this patch adds an explicit check
> for cases where these banks are interleaving, however there are many
> other cases where NUMA memory nodes are *not* interleaving but they are
> still made of multiple discontinuous banks, like in the two example
> above.
> 
> My question is whether this patch series in its current form can handle
> the two cases above correctly. If so, I am wondering how it works given
> that we only have a single "start" and "size" parameter per node.
> 
> On the other hand if this series cannot handle the two cases above, my
> question is whether it would fail explicitly or not. The new
> check is_node_memory_continuous doesn't seem to be able to catch them.


Looking at numa_update_node_memblks, it is clear that the code is meant
to increase the range of each numa node to cover even MMIO regions in
between memory banks. Also see the comment at the top of the file:

 * Assumes all memory regions belonging to a single proximity domain
 * are in one chunk. Holes between them will be included in the node.

So if there are multiple banks for each node, start and end are
stretched to cover the holes between them, and it works as long as
memory banks of different NUMA nodes don't interleave.

I would appreciate if you could add an in-code comment to explain this
on top of numa_update_node_memblk.

Have you had a chance to test this? If not it would be fantastic if you
could give it a quick test to make sure it works as intended: for
instance by creating multiple memory banks for each NUMA node by
splitting an real bank into two smaller banks with a hole in between in
device tree, just for the sake of testing.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27  4:59           ` Stefano Stabellini
@ 2021-09-27  6:25             ` Julien Grall
  2021-09-27  6:46             ` Wei Chen
  1 sibling, 0 replies; 192+ messages in thread
From: Julien Grall @ 2021-09-27  6:25 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Chen, xen-devel, Bertrand Marquis, Jan Beulich,
	Roger Pau Monné,
	Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 6612 bytes --]

On Mon, 27 Sep 2021, 06:59 Stefano Stabellini, <sstabellini@kernel.org>
wrote:

> +x86 maintainers
>
> On Mon, 27 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月27日 11:26
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>
> > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> default
> > > NR_NODE_MEMBLKS
> > >
> > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > Sent: 2021年9月24日 9:35
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > julien@xen.org;
> > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > > default
> > > > > NR_NODE_MEMBLKS
> > > > >
> > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > As a memory range described in device tree cannot be split across
> > > > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS in
> > > > > > arch header.
> > > > >
> > > > > This statement is true but what is the goal of this patch? Is it to
> > > > > reduce code size and memory consumption?
> > > > >
> > > >
> > > > No, when Julien and I discussed this in last version[1], we hadn't
> > > thought
> > > > so deeply. We just thought a memory range described in DT cannot be
> > > split
> > > > across multiple nodes. So NR_MEM_BANKS should be equal to
> NR_MEM_BANKS.
> > > >
> > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > 08/msg00974.html
> > > >
> > > > > I am asking because NR_MEM_BANKS is 128 and
> > > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > > > > NR_NODE_MEMBLKS is 128 before this patch.
> > > > >
> > > > > In other words, this patch alone doesn't make any difference; at
> least
> > > > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is
> increased.
> > > > >
> > > > > So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES is
> > > > > higher than 64?
> > > > >
> > > >
> > > > I also thought about this problem when I was writing this patch.
> > > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> > > > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> > > > at one point.
> > > >
> > > > But I agree with Julien's suggestion, NR_MEM_BANKS and
> NR_NODE_MEMBLKS
> > > > must be aware of each other. I had thought to add some ASSERT check,
> > > > but I don't know how to do it better. So I post this patch for more
> > > > suggestion.
> > >
> > > OK. In that case I'd say to get rid of the previous definition of
> > > NR_NODE_MEMBLKS as it is probably not necessary, see below.
> > >
> > >
> > >
> > > > >
> > > > > > And keep default NR_NODE_MEMBLKS in common header
> > > > > > for those architectures NUMA is disabled.
> > > > >
> > > > > This last sentence is not accurate: on x86 NUMA is enabled and
> > > > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h (there
> is
> > > no
> > > > > x86 definition of it)
> > > > >
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > > > > ---
> > > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > > > > >  xen/include/xen/numa.h     | 2 ++
> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/xen/include/asm-arm/numa.h
> b/xen/include/asm-arm/numa.h
> > > > > > index 8f1c67e3eb..21569e634b 100644
> > > > > > --- a/xen/include/asm-arm/numa.h
> > > > > > +++ b/xen/include/asm-arm/numa.h
> > > > > > @@ -3,9 +3,15 @@
> > > > > >
> > > > > >  #include <xen/mm.h>
> > > > > >
> > > > > > +#include <asm/setup.h>
> > > > > > +
> > > > > >  typedef u8 nodeid_t;
> > > > > >
> > > > > > -#ifndef CONFIG_NUMA
> > > > > > +#ifdef CONFIG_NUMA
> > > > > > +
> > > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > > > > +
> > > > > > +#else
> > > > > >
> > > > > >  /* Fake one node for now. See also node_online_map. */
> > > > > >  #define cpu_to_node(cpu) 0
> > > > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > > > > index 1978e2be1b..1731e1cc6b 100644
> > > > > > --- a/xen/include/xen/numa.h
> > > > > > +++ b/xen/include/xen/numa.h
> > > > > > @@ -12,7 +12,9 @@
> > > > > >  #define MAX_NUMNODES    1
> > > > > >  #endif
> > > > > >
> > > > > > +#ifndef NR_NODE_MEMBLKS
> > > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > > > > +#endif
> > >
> > > This one we can remove it completely right?
> >
> > How about define NR_MEM_BANKS to:
> > #ifdef CONFIG_NR_NUMA_NODES
> > #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
> > #else
> > #define NR_MEM_BANKS 128
> > #endif
> > for both x86 and Arm. For those architectures do not support or enable
> > NUMA, they can still use "NR_MEM_BANKS 128". And replace all
> NR_NODE_MEMBLKS
> > in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
> > In this case, NR_MEM_BANKS can be aware of the changes of
> CONFIG_NR_NUMA_NODES.
>
> x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess you also
> meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?
>
> But NR_MEM_BANKS is not directly related to CONFIG_NR_NUMA_NODES because
> there can be many memory banks for each numa node, certainly more than
> 2. The existing definition on x86:
>
> #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>
> Doesn't make a lot of sense to me. Was it just an arbitrary limit for
> the lack of a better way to set a maximum?
>
>
> On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be related.
> In fact, what's the difference?
>
> NR_MEM_BANKS is the max number of memory banks (with or without
> numa-node-id).
>
> NR_NODE_MEMBLKS is the max number of memory banks with NUMA support
> (with numa-node-id)?
>
> They are basically the same thing. On ARM I would just do:
>
> #define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))
>

As you wrote above, the second part of the MAX is totally arbitrary. In
fact, it is very likely than if you have more than 64 nodes, you may need a
lot more than 2 regions per node.

So, for Arm, I would just define NR_NODE_MEMBLKS as an alias to
NR_MEM_BANKS so it can be used by common code.

[-- Attachment #2: Type: text/html, Size: 10002 bytes --]

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27  4:59           ` Stefano Stabellini
  2021-09-27  6:25             ` Julien Grall
@ 2021-09-27  6:46             ` Wei Chen
  2021-09-27  6:53               ` Wei Chen
  1 sibling, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-27  6:46 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, roger.pau, andrew.cooper3

Hi Stefano, Julien,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月27日 13:00
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; jbeulich@suse.com; roger.pau@citrix.com;
> andrew.cooper3@citrix.com
> Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> NR_NODE_MEMBLKS
> 
> +x86 maintainers
> 
> On Mon, 27 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月27日 11:26
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>
> > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> default
> > > NR_NODE_MEMBLKS
> > >
> > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > Sent: 2021年9月24日 9:35
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > julien@xen.org;
> > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > > default
> > > > > NR_NODE_MEMBLKS
> > > > >
> > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > As a memory range described in device tree cannot be split
> across
> > > > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS in
> > > > > > arch header.
> > > > >
> > > > > This statement is true but what is the goal of this patch? Is it
> to
> > > > > reduce code size and memory consumption?
> > > > >
> > > >
> > > > No, when Julien and I discussed this in last version[1], we hadn't
> > > thought
> > > > so deeply. We just thought a memory range described in DT cannot be
> > > split
> > > > across multiple nodes. So NR_MEM_BANKS should be equal to
> NR_MEM_BANKS.
> > > >
> > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > 08/msg00974.html
> > > >
> > > > > I am asking because NR_MEM_BANKS is 128 and
> > > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > > > > NR_NODE_MEMBLKS is 128 before this patch.
> > > > >
> > > > > In other words, this patch alone doesn't make any difference; at
> least
> > > > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is
> increased.
> > > > >
> > > > > So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES
> is
> > > > > higher than 64?
> > > > >
> > > >
> > > > I also thought about this problem when I was writing this patch.
> > > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> > > > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> > > > at one point.
> > > >
> > > > But I agree with Julien's suggestion, NR_MEM_BANKS and
> NR_NODE_MEMBLKS
> > > > must be aware of each other. I had thought to add some ASSERT check,
> > > > but I don't know how to do it better. So I post this patch for more
> > > > suggestion.
> > >
> > > OK. In that case I'd say to get rid of the previous definition of
> > > NR_NODE_MEMBLKS as it is probably not necessary, see below.
> > >
> > >
> > >
> > > > >
> > > > > > And keep default NR_NODE_MEMBLKS in common header
> > > > > > for those architectures NUMA is disabled.
> > > > >
> > > > > This last sentence is not accurate: on x86 NUMA is enabled and
> > > > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h (there
> is
> > > no
> > > > > x86 definition of it)
> > > > >
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > > > > ---
> > > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > > > > >  xen/include/xen/numa.h     | 2 ++
> > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-
> arm/numa.h
> > > > > > index 8f1c67e3eb..21569e634b 100644
> > > > > > --- a/xen/include/asm-arm/numa.h
> > > > > > +++ b/xen/include/asm-arm/numa.h
> > > > > > @@ -3,9 +3,15 @@
> > > > > >
> > > > > >  #include <xen/mm.h>
> > > > > >
> > > > > > +#include <asm/setup.h>
> > > > > > +
> > > > > >  typedef u8 nodeid_t;
> > > > > >
> > > > > > -#ifndef CONFIG_NUMA
> > > > > > +#ifdef CONFIG_NUMA
> > > > > > +
> > > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > > > > +
> > > > > > +#else
> > > > > >
> > > > > >  /* Fake one node for now. See also node_online_map. */
> > > > > >  #define cpu_to_node(cpu) 0
> > > > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > > > > index 1978e2be1b..1731e1cc6b 100644
> > > > > > --- a/xen/include/xen/numa.h
> > > > > > +++ b/xen/include/xen/numa.h
> > > > > > @@ -12,7 +12,9 @@
> > > > > >  #define MAX_NUMNODES    1
> > > > > >  #endif
> > > > > >
> > > > > > +#ifndef NR_NODE_MEMBLKS
> > > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > > > > +#endif
> > >
> > > This one we can remove it completely right?
> >
> > How about define NR_MEM_BANKS to:
> > #ifdef CONFIG_NR_NUMA_NODES
> > #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
> > #else
> > #define NR_MEM_BANKS 128
> > #endif
> > for both x86 and Arm. For those architectures do not support or enable
> > NUMA, they can still use "NR_MEM_BANKS 128". And replace all
> NR_NODE_MEMBLKS
> > in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
> > In this case, NR_MEM_BANKS can be aware of the changes of
> CONFIG_NR_NUMA_NODES.
> 
> x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess you also
> meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?
> 

Yes.

> But NR_MEM_BANKS is not directly related to CONFIG_NR_NUMA_NODES because
> there can be many memory banks for each numa node, certainly more than
> 2. The existing definition on x86:
> 
> #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> 
> Doesn't make a lot of sense to me. Was it just an arbitrary limit for
> the lack of a better way to set a maximum?
> 

At that time, this was probably the most cost-effective approach.
Enough and easy. But, if more nodes need to be supported in the
future, it may bring more memory blocks. And this maximum value
might not apply. The maximum may need to support dynamic extension.

> 
> On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be related.
> In fact, what's the difference?
> 
> NR_MEM_BANKS is the max number of memory banks (with or without
> numa-node-id).
> 
> NR_NODE_MEMBLKS is the max number of memory banks with NUMA support
> (with numa-node-id)?
> 
> They are basically the same thing. On ARM I would just do:
> 

Probably not, NR_MEM_BANKS will count those memory ranges without
numa-node-id in boot memory parsing stage (process_memory_node or
EFI parser). But NR_NODE_MEMBLKS will only count those memory ranges
with numa-node-id.

> #define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))
> 
> 
> And maybe the definition could be common with x86 if we define
> NR_MEM_BANKS to 128 on x86 too.

Julien had comment here, I will continue in that email.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27  6:46             ` Wei Chen
@ 2021-09-27  6:53               ` Wei Chen
  2021-09-27  7:35                 ` Julien Grall
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-27  6:53 UTC (permalink / raw)
  To: Wei Chen, Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, roger.pau, andrew.cooper3

Hi Julien,

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Wei
> Chen
> Sent: 2021年9月27日 14:46
> To: Stefano Stabellini <sstabellini@kernel.org>
> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; jbeulich@suse.com; roger.pau@citrix.com;
> andrew.cooper3@citrix.com
> Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> NR_NODE_MEMBLKS
> 
> Hi Stefano, Julien,
> 
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月27日 13:00
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; roger.pau@citrix.com;
> > andrew.cooper3@citrix.com
> > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> > NR_NODE_MEMBLKS
> >
> > +x86 maintainers
> >
> > On Mon, 27 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > Sent: 2021年9月27日 11:26
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>
> > > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > default
> > > > NR_NODE_MEMBLKS
> > > >
> > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > Sent: 2021年9月24日 9:35
> > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > julien@xen.org;
> > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > > > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > > > default
> > > > > > NR_NODE_MEMBLKS
> > > > > >
> > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > As a memory range described in device tree cannot be split
> > across
> > > > > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS
> in
> > > > > > > arch header.
> > > > > >
> > > > > > This statement is true but what is the goal of this patch? Is it
> > to
> > > > > > reduce code size and memory consumption?
> > > > > >
> > > > >
> > > > > No, when Julien and I discussed this in last version[1], we hadn't
> > > > thought
> > > > > so deeply. We just thought a memory range described in DT cannot
> be
> > > > split
> > > > > across multiple nodes. So NR_MEM_BANKS should be equal to
> > NR_MEM_BANKS.
> > > > >
> > > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > > 08/msg00974.html
> > > > >
> > > > > > I am asking because NR_MEM_BANKS is 128 and
> > > > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > > > > > NR_NODE_MEMBLKS is 128 before this patch.
> > > > > >
> > > > > > In other words, this patch alone doesn't make any difference; at
> > least
> > > > > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is
> > increased.
> > > > > >
> > > > > > So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES
> > is
> > > > > > higher than 64?
> > > > > >
> > > > >
> > > > > I also thought about this problem when I was writing this patch.
> > > > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> > > > > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> > > > > at one point.
> > > > >
> > > > > But I agree with Julien's suggestion, NR_MEM_BANKS and
> > NR_NODE_MEMBLKS
> > > > > must be aware of each other. I had thought to add some ASSERT
> check,
> > > > > but I don't know how to do it better. So I post this patch for
> more
> > > > > suggestion.
> > > >
> > > > OK. In that case I'd say to get rid of the previous definition of
> > > > NR_NODE_MEMBLKS as it is probably not necessary, see below.
> > > >
> > > >
> > > >
> > > > > >
> > > > > > > And keep default NR_NODE_MEMBLKS in common header
> > > > > > > for those architectures NUMA is disabled.
> > > > > >
> > > > > > This last sentence is not accurate: on x86 NUMA is enabled and
> > > > > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h
> (there
> > is
> > > > no
> > > > > > x86 definition of it)
> > > > > >
> > > > >
> > > > > Yes.
> > > > >
> > > > > >
> > > > > > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > > > > > ---
> > > > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > > > > > >  xen/include/xen/numa.h     | 2 ++
> > > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-
> > arm/numa.h
> > > > > > > index 8f1c67e3eb..21569e634b 100644
> > > > > > > --- a/xen/include/asm-arm/numa.h
> > > > > > > +++ b/xen/include/asm-arm/numa.h
> > > > > > > @@ -3,9 +3,15 @@
> > > > > > >
> > > > > > >  #include <xen/mm.h>
> > > > > > >
> > > > > > > +#include <asm/setup.h>
> > > > > > > +
> > > > > > >  typedef u8 nodeid_t;
> > > > > > >
> > > > > > > -#ifndef CONFIG_NUMA
> > > > > > > +#ifdef CONFIG_NUMA
> > > > > > > +
> > > > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > > > > > +
> > > > > > > +#else
> > > > > > >
> > > > > > >  /* Fake one node for now. See also node_online_map. */
> > > > > > >  #define cpu_to_node(cpu) 0
> > > > > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > > > > > index 1978e2be1b..1731e1cc6b 100644
> > > > > > > --- a/xen/include/xen/numa.h
> > > > > > > +++ b/xen/include/xen/numa.h
> > > > > > > @@ -12,7 +12,9 @@
> > > > > > >  #define MAX_NUMNODES    1
> > > > > > >  #endif
> > > > > > >
> > > > > > > +#ifndef NR_NODE_MEMBLKS
> > > > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > > > > > +#endif
> > > >
> > > > This one we can remove it completely right?
> > >
> > > How about define NR_MEM_BANKS to:
> > > #ifdef CONFIG_NR_NUMA_NODES
> > > #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
> > > #else
> > > #define NR_MEM_BANKS 128
> > > #endif
> > > for both x86 and Arm. For those architectures do not support or enable
> > > NUMA, they can still use "NR_MEM_BANKS 128". And replace all
> > NR_NODE_MEMBLKS
> > > in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
> > > In this case, NR_MEM_BANKS can be aware of the changes of
> > CONFIG_NR_NUMA_NODES.
> >
> > x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess you also
> > meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?
> >
> 
> Yes.
> 
> > But NR_MEM_BANKS is not directly related to CONFIG_NR_NUMA_NODES because
> > there can be many memory banks for each numa node, certainly more than
> > 2. The existing definition on x86:
> >
> > #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> >
> > Doesn't make a lot of sense to me. Was it just an arbitrary limit for
> > the lack of a better way to set a maximum?
> >
> 
> At that time, this was probably the most cost-effective approach.
> Enough and easy. But, if more nodes need to be supported in the
> future, it may bring more memory blocks. And this maximum value
> might not apply. The maximum may need to support dynamic extension.
> 
> >
> > On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be related.
> > In fact, what's the difference?
> >
> > NR_MEM_BANKS is the max number of memory banks (with or without
> > numa-node-id).
> >
> > NR_NODE_MEMBLKS is the max number of memory banks with NUMA support
> > (with numa-node-id)?
> >
> > They are basically the same thing. On ARM I would just do:
> >
> 
> Probably not, NR_MEM_BANKS will count those memory ranges without
> numa-node-id in boot memory parsing stage (process_memory_node or
> EFI parser). But NR_NODE_MEMBLKS will only count those memory ranges
> with numa-node-id.
> 
> > #define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))
> >
> >

Quote Julien's comment from HTML email to here:
" As you wrote above, the second part of the MAX is totally arbitrary.
In fact, it is very likely than if you have more than 64 nodes, you may
need a lot more than 2 regions per node.

So, for Arm, I would just define NR_NODE_MEMBLKS as an alias to NR_MEM_BANKS
so it can be used by common code.
"

But here comes the problem:
How can we set the NR_MEM_BANKS maximum value, 128 seems an arbitrary too?
If #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * N)? And what N should be?

> > And maybe the definition could be common with x86 if we define
> > NR_MEM_BANKS to 128 on x86 too.
> 
> Julien had comment here, I will continue in that email.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27  6:53               ` Wei Chen
@ 2021-09-27  7:35                 ` Julien Grall
  2021-09-27 10:21                   ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Julien Grall @ 2021-09-27  7:35 UTC (permalink / raw)
  To: Wei Chen
  Cc: Stefano Stabellini, xen-devel, Bertrand Marquis, Jan Beulich,
	Roger Pau Monné,
	Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 10069 bytes --]

On Mon, 27 Sep 2021, 08:53 Wei Chen, <Wei.Chen@arm.com> wrote:

> Hi Julien,
>
> > -----Original Message-----
> > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Wei
> > Chen
> > Sent: 2021年9月27日 14:46
> > To: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; roger.pau@citrix.com;
> > andrew.cooper3@citrix.com
> > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> > NR_NODE_MEMBLKS
> >
> > Hi Stefano, Julien,
> >
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月27日 13:00
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; roger.pau@citrix.com;
> > > andrew.cooper3@citrix.com
> > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> default
> > > NR_NODE_MEMBLKS
> > >
> > > +x86 maintainers
> > >
> > > On Mon, 27 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > Sent: 2021年9月27日 11:26
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>
> > > > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > > default
> > > > > NR_NODE_MEMBLKS
> > > > >
> > > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > Sent: 2021年9月24日 9:35
> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > > julien@xen.org;
> > > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>
> > > > > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to
> override
> > > > > default
> > > > > > > NR_NODE_MEMBLKS
> > > > > > >
> > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > As a memory range described in device tree cannot be split
> > > across
> > > > > > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS
> > in
> > > > > > > > arch header.
> > > > > > >
> > > > > > > This statement is true but what is the goal of this patch? Is
> it
> > > to
> > > > > > > reduce code size and memory consumption?
> > > > > > >
> > > > > >
> > > > > > No, when Julien and I discussed this in last version[1], we
> hadn't
> > > > > thought
> > > > > > so deeply. We just thought a memory range described in DT cannot
> > be
> > > > > split
> > > > > > across multiple nodes. So NR_MEM_BANKS should be equal to
> > > NR_MEM_BANKS.
> > > > > >
> > > > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > > > 08/msg00974.html
> > > > > >
> > > > > > > I am asking because NR_MEM_BANKS is 128 and
> > > > > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > > > > > > NR_NODE_MEMBLKS is 128 before this patch.
> > > > > > >
> > > > > > > In other words, this patch alone doesn't make any difference;
> at
> > > least
> > > > > > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is
> > > increased.
> > > > > > >
> > > > > > > So, is the goal to reduce memory usage when
> CONFIG_NR_NUMA_NODES
> > > is
> > > > > > > higher than 64?
> > > > > > >
> > > > > >
> > > > > > I also thought about this problem when I was writing this patch.
> > > > > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> > > > > > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> > > > > > at one point.
> > > > > >
> > > > > > But I agree with Julien's suggestion, NR_MEM_BANKS and
> > > NR_NODE_MEMBLKS
> > > > > > must be aware of each other. I had thought to add some ASSERT
> > check,
> > > > > > but I don't know how to do it better. So I post this patch for
> > more
> > > > > > suggestion.
> > > > >
> > > > > OK. In that case I'd say to get rid of the previous definition of
> > > > > NR_NODE_MEMBLKS as it is probably not necessary, see below.
> > > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > > And keep default NR_NODE_MEMBLKS in common header
> > > > > > > > for those architectures NUMA is disabled.
> > > > > > >
> > > > > > > This last sentence is not accurate: on x86 NUMA is enabled and
> > > > > > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h
> > (there
> > > is
> > > > > no
> > > > > > > x86 definition of it)
> > > > > > >
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > >
> > > > > > > > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > > > > > > > ---
> > > > > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > > > > > > >  xen/include/xen/numa.h     | 2 ++
> > > > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-
> > > arm/numa.h
> > > > > > > > index 8f1c67e3eb..21569e634b 100644
> > > > > > > > --- a/xen/include/asm-arm/numa.h
> > > > > > > > +++ b/xen/include/asm-arm/numa.h
> > > > > > > > @@ -3,9 +3,15 @@
> > > > > > > >
> > > > > > > >  #include <xen/mm.h>
> > > > > > > >
> > > > > > > > +#include <asm/setup.h>
> > > > > > > > +
> > > > > > > >  typedef u8 nodeid_t;
> > > > > > > >
> > > > > > > > -#ifndef CONFIG_NUMA
> > > > > > > > +#ifdef CONFIG_NUMA
> > > > > > > > +
> > > > > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > > > > > > +
> > > > > > > > +#else
> > > > > > > >
> > > > > > > >  /* Fake one node for now. See also node_online_map. */
> > > > > > > >  #define cpu_to_node(cpu) 0
> > > > > > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > > > > > > index 1978e2be1b..1731e1cc6b 100644
> > > > > > > > --- a/xen/include/xen/numa.h
> > > > > > > > +++ b/xen/include/xen/numa.h
> > > > > > > > @@ -12,7 +12,9 @@
> > > > > > > >  #define MAX_NUMNODES    1
> > > > > > > >  #endif
> > > > > > > >
> > > > > > > > +#ifndef NR_NODE_MEMBLKS
> > > > > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > > > > > > +#endif
> > > > >
> > > > > This one we can remove it completely right?
> > > >
> > > > How about define NR_MEM_BANKS to:
> > > > #ifdef CONFIG_NR_NUMA_NODES
> > > > #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
> > > > #else
> > > > #define NR_MEM_BANKS 128
> > > > #endif
> > > > for both x86 and Arm. For those architectures do not support or
> enable
> > > > NUMA, they can still use "NR_MEM_BANKS 128". And replace all
> > > NR_NODE_MEMBLKS
> > > > in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
> > > > In this case, NR_MEM_BANKS can be aware of the changes of
> > > CONFIG_NR_NUMA_NODES.
> > >
> > > x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess you also
> > > meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?
> > >
> >
> > Yes.
> >
> > > But NR_MEM_BANKS is not directly related to CONFIG_NR_NUMA_NODES
> because
> > > there can be many memory banks for each numa node, certainly more than
> > > 2. The existing definition on x86:
> > >
> > > #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > >
> > > Doesn't make a lot of sense to me. Was it just an arbitrary limit for
> > > the lack of a better way to set a maximum?
> > >
> >
> > At that time, this was probably the most cost-effective approach.
> > Enough and easy. But, if more nodes need to be supported in the
> > future, it may bring more memory blocks. And this maximum value
> > might not apply. The maximum may need to support dynamic extension.
> >
> > >
> > > On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be related.
> > > In fact, what's the difference?
> > >
> > > NR_MEM_BANKS is the max number of memory banks (with or without
> > > numa-node-id).
> > >
> > > NR_NODE_MEMBLKS is the max number of memory banks with NUMA support
> > > (with numa-node-id)?
> > >
> > > They are basically the same thing. On ARM I would just do:
> > >
> >
> > Probably not, NR_MEM_BANKS will count those memory ranges without
> > numa-node-id in boot memory parsing stage (process_memory_node or
> > EFI parser). But NR_NODE_MEMBLKS will only count those memory ranges
> > with numa-node-id.
> >
> > > #define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))
> > >
> > >
>
> Quote Julien's comment from HTML email to here:
> " As you wrote above, the second part of the MAX is totally arbitrary.
> In fact, it is very likely than if you have more than 64 nodes, you may
> need a lot more than 2 regions per node.
>
> So, for Arm, I would just define NR_NODE_MEMBLKS as an alias to
> NR_MEM_BANKS
> so it can be used by common code.
> "
>
> But here comes the problem:
> How can we set the NR_MEM_BANKS maximum value, 128 seems an arbitrary too?
>

This is based on hardware we currently support (the last time we bumped the
value was, IIRC, for Thunder-X). In the case of booting UEFI, we can get a
lot of small ranges as we discover the RAM using the UEFI memory map.

If #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * N)? And what N should be.


N would have to be the maximum number of ranges you can find in a NUMA node.

We would also need to make sure this doesn't break existing platforms. So N
would have to be quite large or we need a MAX as Stefano suggested.

But I would prefer to keep the existing 128 and allow to configure it at
build time (not necessarily in this series). This avoid to have different
way to define the value based NUMA vs non-NUMA.


> > > And maybe the definition could be common with x86 if we define
> > > NR_MEM_BANKS to 128 on x86 too.
> >
> > Julien had comment here, I will continue in that email.
>

[-- Attachment #2: Type: text/html, Size: 16539 bytes --]

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-24 19:39       ` Stefano Stabellini
@ 2021-09-27  8:33         ` Jan Beulich
  2021-09-27  8:45           ` Julien Grall
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-27  8:33 UTC (permalink / raw)
  To: Stefano Stabellini, Wei Chen; +Cc: xen-devel, julien, Bertrand.Marquis

On 24.09.2021 21:39, Stefano Stabellini wrote:
> On Fri, 24 Sep 2021, Wei Chen wrote:
>> On 2021/9/24 11:31, Stefano Stabellini wrote:
>>> On Thu, 23 Sep 2021, Wei Chen wrote:
>>>> --- a/xen/arch/arm/Kconfig
>>>> +++ b/xen/arch/arm/Kconfig
>>>> @@ -34,6 +34,17 @@ config ACPI
>>>>   	  Advanced Configuration and Power Interface (ACPI) support for Xen is
>>>>   	  an alternative to device tree on ARM64.
>>>>   + config DEVICE_TREE_NUMA
>>>> +	def_bool n
>>>> +	select NUMA
>>>> +
>>>> +config ARM_NUMA
>>>> +	bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)" if
>>>> UNSUPPORTED
>>>> +	select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
>>>
>>> Should it be: depends on HAS_DEVICE_TREE ?
>>> (And eventually depends on HAS_DEVICE_TREE || ACPI)
>>>
>>
>> As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
>> option can be selected by users. And depends on has_device_tree
>> or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.
>>
>> If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
>> does it become a loop dependency?
>>
>> https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00888.html
> 
> OK, I am fine with that. I was just trying to catch the case where a
> user selects "ARM_NUMA" but actually neither ACPI nor HAS_DEVICE_TREE
> are selected so nothing happens. I was trying to make it clear that
> ARM_NUMA depends on having at least one between HAS_DEVICE_TREE or ACPI
> because otherwise it is not going to work.
> 
> That said, I don't think this is important because HAS_DEVICE_TREE
> cannot be unselected. So if we cannot find a way to express the
> dependency, I think it is fine to keep the patch as is.

So how about doing things the other way around: ARM_NUMA has no prompt
and defaults to ACPI_NUMA || DT_NUMA, and DT_NUMA gains a prompt instead
(and, for Arm at least, ACPI_NUMA as well; this might even be worthwhile
to have on x86 down the road).

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-27  8:33         ` Jan Beulich
@ 2021-09-27  8:45           ` Julien Grall
  2021-09-27  9:17             ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Julien Grall @ 2021-09-27  8:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Stefano Stabellini, Wei Chen, xen-devel, Bertrand Marquis

[-- Attachment #1: Type: text/plain, Size: 2361 bytes --]

On Mon, 27 Sep 2021, 10:33 Jan Beulich, <jbeulich@suse.com> wrote:

> On 24.09.2021 21:39, Stefano Stabellini wrote:
> > On Fri, 24 Sep 2021, Wei Chen wrote:
> >> On 2021/9/24 11:31, Stefano Stabellini wrote:
> >>> On Thu, 23 Sep 2021, Wei Chen wrote:
> >>>> --- a/xen/arch/arm/Kconfig
> >>>> +++ b/xen/arch/arm/Kconfig
> >>>> @@ -34,6 +34,17 @@ config ACPI
> >>>>      Advanced Configuration and Power Interface (ACPI) support for
> Xen is
> >>>>      an alternative to device tree on ARM64.
> >>>>   + config DEVICE_TREE_NUMA
> >>>> +  def_bool n
> >>>> +  select NUMA
> >>>> +
> >>>> +config ARM_NUMA
> >>>> +  bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)"
> if
> >>>> UNSUPPORTED
> >>>> +  select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> >>>
> >>> Should it be: depends on HAS_DEVICE_TREE ?
> >>> (And eventually depends on HAS_DEVICE_TREE || ACPI)
> >>>
> >>
> >> As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
> >> option can be selected by users. And depends on has_device_tree
> >> or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.
> >>
> >> If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
> >> does it become a loop dependency?
> >>
> >>
> https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00888.html
> >
> > OK, I am fine with that. I was just trying to catch the case where a
> > user selects "ARM_NUMA" but actually neither ACPI nor HAS_DEVICE_TREE
> > are selected so nothing happens. I was trying to make it clear that
> > ARM_NUMA depends on having at least one between HAS_DEVICE_TREE or ACPI
> > because otherwise it is not going to work.
> >
> > That said, I don't think this is important because HAS_DEVICE_TREE
> > cannot be unselected. So if we cannot find a way to express the
> > dependency, I think it is fine to keep the patch as is.
>
> So how about doing things the other way around: ARM_NUMA has no prompt
> and defaults to ACPI_NUMA || DT_NUMA, and DT_NUMA gains a prompt instead
> (and, for Arm at least, ACPI_NUMA as well; this might even be worthwhile
> to have on x86 down the road).
>

As I wrote before, I don't think the user should say "I want to enable NUMA
with Device-Tree or ACPI". Instead, they say whether they want to use NUMA
and let Xen decide to enable the DT/ACPI support.

In other word, the prompt should stay on ARM_NUMA.

Cheers,


> Jan
>
>

[-- Attachment #2: Type: text/html, Size: 3585 bytes --]

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-27  8:45           ` Julien Grall
@ 2021-09-27  9:17             ` Jan Beulich
  2021-09-27 17:17               ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-27  9:17 UTC (permalink / raw)
  To: Julien Grall; +Cc: Stefano Stabellini, Wei Chen, xen-devel, Bertrand Marquis

On 27.09.2021 10:45, Julien Grall wrote:
> On Mon, 27 Sep 2021, 10:33 Jan Beulich, <jbeulich@suse.com> wrote:
> 
>> On 24.09.2021 21:39, Stefano Stabellini wrote:
>>> On Fri, 24 Sep 2021, Wei Chen wrote:
>>>> On 2021/9/24 11:31, Stefano Stabellini wrote:
>>>>> On Thu, 23 Sep 2021, Wei Chen wrote:
>>>>>> --- a/xen/arch/arm/Kconfig
>>>>>> +++ b/xen/arch/arm/Kconfig
>>>>>> @@ -34,6 +34,17 @@ config ACPI
>>>>>>      Advanced Configuration and Power Interface (ACPI) support for
>> Xen is
>>>>>>      an alternative to device tree on ARM64.
>>>>>>   + config DEVICE_TREE_NUMA
>>>>>> +  def_bool n
>>>>>> +  select NUMA
>>>>>> +
>>>>>> +config ARM_NUMA
>>>>>> +  bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)"
>> if
>>>>>> UNSUPPORTED
>>>>>> +  select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
>>>>>
>>>>> Should it be: depends on HAS_DEVICE_TREE ?
>>>>> (And eventually depends on HAS_DEVICE_TREE || ACPI)
>>>>>
>>>>
>>>> As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
>>>> option can be selected by users. And depends on has_device_tree
>>>> or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.
>>>>
>>>> If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
>>>> does it become a loop dependency?
>>>>
>>>>
>> https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00888.html
>>>
>>> OK, I am fine with that. I was just trying to catch the case where a
>>> user selects "ARM_NUMA" but actually neither ACPI nor HAS_DEVICE_TREE
>>> are selected so nothing happens. I was trying to make it clear that
>>> ARM_NUMA depends on having at least one between HAS_DEVICE_TREE or ACPI
>>> because otherwise it is not going to work.
>>>
>>> That said, I don't think this is important because HAS_DEVICE_TREE
>>> cannot be unselected. So if we cannot find a way to express the
>>> dependency, I think it is fine to keep the patch as is.
>>
>> So how about doing things the other way around: ARM_NUMA has no prompt
>> and defaults to ACPI_NUMA || DT_NUMA, and DT_NUMA gains a prompt instead
>> (and, for Arm at least, ACPI_NUMA as well; this might even be worthwhile
>> to have on x86 down the road).
>>
> 
> As I wrote before, I don't think the user should say "I want to enable NUMA
> with Device-Tree or ACPI". Instead, they say whether they want to use NUMA
> and let Xen decide to enable the DT/ACPI support.
> 
> In other word, the prompt should stay on ARM_NUMA.

Okay. In which case I'm confused by Stefano's question.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-27  5:05             ` Stefano Stabellini
@ 2021-09-27  9:50               ` Wei Chen
  2021-09-27 17:19                 ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-27  9:50 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, andrew.cooper3,
	roger.pau, wl

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月27日 13:05
> To: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> wl@xen.org
> Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > Sent: 2021年9月25日 3:53
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> andrew.cooper3@citrix.com;
> > > > roger.pau@citrix.com; wl@xen.org
> > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous
> node
> > > > memory range
> > > >
> > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > Sent: 2021年9月24日 8:26
> > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > julien@xen.org;
> > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> discontinous node
> > > > > > memory range
> > > > > >
> > > > > > CC'ing x86 maintainers
> > > > > >
> > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > One NUMA node may contain several memory blocks. In current
> Xen
> > > > > > > code, Xen will maintain a node memory range for each node to
> cover
> > > > > > > all its memory blocks. But here comes the problem, in the gap
> of
> > > > > > > one node's two memory blocks, if there are some memory blocks
> don't
> > > > > > > belong to this node (remote memory blocks). This node's memory
> range
> > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > >
> > > > > > > One node's memory range contains othe nodes' memory, this is
> > > > obviously
> > > > > > > not very reasonable. This means current NUMA code only can
> support
> > > > > > > node has continous memory blocks. However, on a physical
> machine,
> > > > the
> > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > >
> > > > > > > So in this patch, we add code to detect discontinous memory
> blocks
> > > > > > > for one node. NUMA initializtion will be failed and error
> messages
> > > > > > > will be printed when Xen detect such hardware configuration.
> > > > > >
> > > > > > At least on ARM, it is not just memory that can be interleaved,
> but
> > > > also
> > > > > > MMIO regions. For instance:
> > > > > >
> > > > > > node0 bank0 0-0x1000000
> > > > > > MMIO 0x1000000-0x1002000
> > > > > > Hole 0x1002000-0x2000000
> > > > > > node0 bank1 0x2000000-0x3000000
> > > > > >
> > > > > > So I am not familiar with the SRAT format, but I think on ARM
> the
> > > > check
> > > > > > would look different: we would just look for multiple memory
> ranges
> > > > > > under a device_type = "memory" node of a NUMA node in device
> tree.
> > > > > >
> > > > > >
> > > > >
> > > > > Should I need to include/refine above message to commit log?
> > > >
> > > > Let me ask you a question first.
> > > >
> > > > With the NUMA implementation of this patch series, can we deal with
> > > > cases where each node has multiple memory banks, not interleaved?
> > >
> > > Yes.
> > >
> > > > An an example:
> > > >
> > > > node0: 0x0        - 0x10000000
> > > > MMIO : 0x10000000 - 0x20000000
> > > > node0: 0x20000000 - 0x30000000
> > > > MMIO : 0x30000000 - 0x50000000
> > > > node1: 0x50000000 - 0x60000000
> > > > MMIO : 0x60000000 - 0x80000000
> > > > node2: 0x80000000 - 0x90000000
> > > >
> > > >
> > > > I assume we can deal with this case simply by setting node0 memory
> to
> > > > 0x0-0x30000000 even if there is actually something else, a device,
> that
> > > > doesn't belong to node0 in between the two node0 banks?
> > >
> > > While this configuration is rare in SoC design, but it is not
> impossible.
> >
> > Definitely, I have seen it before.
> >
> >
> > > > Is it only other nodes' memory interleaved that cause issues? In
> other
> > > > words, only the following is a problematic scenario?
> > > >
> > > > node0: 0x0        - 0x10000000
> > > > MMIO : 0x10000000 - 0x20000000
> > > > node1: 0x20000000 - 0x30000000
> > > > MMIO : 0x30000000 - 0x50000000
> > > > node0: 0x50000000 - 0x60000000
> > > >
> > > > Because node1 is in between the two ranges of node0?
> > > >
> > >
> > > But only device_type="memory" can be added to allocation.
> > > For mmio there are two cases:
> > > 1. mmio doesn't have NUMA id property.
> > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > >    But we don’t need to handle these kinds of MMIO devices
> > >    in memory block parsing. Because we don't need to allocate
> > >    memory from these mmio ranges. And for accessing, we need
> > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > >    MMIO accessing APIs.
> >
> > Yes, I am not too worried about devices with a NUMA id property because
> > they are less common and this series doesn't handle them at all, right?
> > I imagine they would be treated like any other device without NUMA
> > awareness.
> >
> > I am thinking about the case where the memory of each NUMA node is made
> > of multiple banks. I understand that this patch adds an explicit check
> > for cases where these banks are interleaving, however there are many
> > other cases where NUMA memory nodes are *not* interleaving but they are
> > still made of multiple discontinuous banks, like in the two example
> > above.
> >
> > My question is whether this patch series in its current form can handle
> > the two cases above correctly. If so, I am wondering how it works given
> > that we only have a single "start" and "size" parameter per node.
> >
> > On the other hand if this series cannot handle the two cases above, my
> > question is whether it would fail explicitly or not. The new
> > check is_node_memory_continuous doesn't seem to be able to catch them.
> 
> 
> Looking at numa_update_node_memblks, it is clear that the code is meant
> to increase the range of each numa node to cover even MMIO regions in
> between memory banks. Also see the comment at the top of the file:
> 
>  * Assumes all memory regions belonging to a single proximity domain
>  * are in one chunk. Holes between them will be included in the node.
> 
> So if there are multiple banks for each node, start and end are
> stretched to cover the holes between them, and it works as long as
> memory banks of different NUMA nodes don't interleave.
> 
> I would appreciate if you could add an in-code comment to explain this
> on top of numa_update_node_memblk.

Yes, I will do it.

> 
> Have you had a chance to test this? If not it would be fantastic if you
> could give it a quick test to make sure it works as intended: for
> instance by creating multiple memory banks for each NUMA node by
> splitting an real bank into two smaller banks with a hole in between in
> device tree, just for the sake of testing.

Yes, I have created some fake NUMA nodes in FVP device tree to test it.
The intertwine of nodes' address can be detected.

(XEN) SRAT: Node 0 0000000080000000-00000000ff000000
(XEN) SRAT: Node 1 0000000880000000-00000008c0000000
(XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1 (0000000880000000-00000008c0000000)


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27  7:35                 ` Julien Grall
@ 2021-09-27 10:21                   ` Wei Chen
  2021-09-27 10:39                     ` Julien Grall
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-27 10:21 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, xen-devel, Bertrand Marquis, Jan Beulich,
	Roger Pau Monné,
	Andrew Cooper

Hi Julien,

From: Julien Grall <julien.grall.oss@gmail.com> 
Sent: 2021年9月27日 15:36
To: Wei Chen <Wei.Chen@arm.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-devel <xen-devel@lists.xenproject.org>; Bertrand Marquis <Bertrand.Marquis@arm.com>; Jan Beulich <jbeulich@suse.com>; Roger Pau Monné <roger.pau@citrix.com>; Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS


On Mon, 27 Sep 2021, 08:53 Wei Chen, <mailto:Wei.Chen@arm.com> wrote:
Hi Julien,

> -----Original Message-----
> From: Xen-devel <mailto:xen-devel-bounces@lists.xenproject.org> On Behalf Of Wei
> Chen
> Sent: 2021年9月27日 14:46
> To: Stefano Stabellini <mailto:sstabellini@kernel.org>
> Cc: mailto:xen-devel@lists.xenproject.org; mailto:julien@xen.org; Bertrand Marquis
> <mailto:Bertrand.Marquis@arm.com>; mailto:jbeulich@suse.com; mailto:roger.pau@citrix.com;
> mailto:andrew.cooper3@citrix.com
> Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> NR_NODE_MEMBLKS
> 
> Hi Stefano, Julien,
> 
> > -----Original Message-----
> > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
> > Sent: 2021年9月27日 13:00
> > To: Wei Chen <mailto:Wei.Chen@arm.com>
> > Cc: Stefano Stabellini <mailto:sstabellini@kernel.org>; xen-
> > mailto:devel@lists.xenproject.org; mailto:julien@xen.org; Bertrand Marquis
> > <mailto:Bertrand.Marquis@arm.com>; mailto:jbeulich@suse.com; mailto:roger.pau@citrix.com;
> > mailto:andrew.cooper3@citrix.com
> > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> > NR_NODE_MEMBLKS
> >
> > +x86 maintainers
> >
> > On Mon, 27 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
> > > > Sent: 2021年9月27日 11:26
> > > > To: Wei Chen <mailto:Wei.Chen@arm.com>
> > > > Cc: Stefano Stabellini <mailto:sstabellini@kernel.org>; xen-
> > > > mailto:devel@lists.xenproject.org; mailto:julien@xen.org; Bertrand Marquis
> > > > <mailto:Bertrand.Marquis@arm.com>
> > > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > default
> > > > NR_NODE_MEMBLKS
> > > >
> > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
> > > > > > Sent: 2021年9月24日 9:35
> > > > > > To: Wei Chen <mailto:Wei.Chen@arm.com>
> > > > > > Cc: mailto:xen-devel@lists.xenproject.org; mailto:sstabellini@kernel.org;
> > > > mailto:julien@xen.org;
> > > > > > Bertrand Marquis <mailto:Bertrand.Marquis@arm.com>
> > > > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > > > default
> > > > > > NR_NODE_MEMBLKS
> > > > > >
> > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > As a memory range described in device tree cannot be split
> > across
> > > > > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS
> in
> > > > > > > arch header.
> > > > > >
> > > > > > This statement is true but what is the goal of this patch? Is it
> > to
> > > > > > reduce code size and memory consumption?
> > > > > >
> > > > >
> > > > > No, when Julien and I discussed this in last version[1], we hadn't
> > > > thought
> > > > > so deeply. We just thought a memory range described in DT cannot
> be
> > > > split
> > > > > across multiple nodes. So NR_MEM_BANKS should be equal to
> > NR_MEM_BANKS.
> > > > >
> > > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > > 08/msg00974.html
> > > > >
> > > > > > I am asking because NR_MEM_BANKS is 128 and
> > > > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > > > > > NR_NODE_MEMBLKS is 128 before this patch.
> > > > > >
> > > > > > In other words, this patch alone doesn't make any difference; at
> > least
> > > > > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is
> > increased.
> > > > > >
> > > > > > So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES
> > is
> > > > > > higher than 64?
> > > > > >
> > > > >
> > > > > I also thought about this problem when I was writing this patch.
> > > > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> > > > > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> > > > > at one point.
> > > > >
> > > > > But I agree with Julien's suggestion, NR_MEM_BANKS and
> > NR_NODE_MEMBLKS
> > > > > must be aware of each other. I had thought to add some ASSERT
> check,
> > > > > but I don't know how to do it better. So I post this patch for
> more
> > > > > suggestion.
> > > >
> > > > OK. In that case I'd say to get rid of the previous definition of
> > > > NR_NODE_MEMBLKS as it is probably not necessary, see below.
> > > >
> > > >
> > > >
> > > > > >
> > > > > > > And keep default NR_NODE_MEMBLKS in common header
> > > > > > > for those architectures NUMA is disabled.
> > > > > >
> > > > > > This last sentence is not accurate: on x86 NUMA is enabled and
> > > > > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h
> (there
> > is
> > > > no
> > > > > > x86 definition of it)
> > > > > >
> > > > >
> > > > > Yes.
> > > > >
> > > > > >
> > > > > > > Signed-off-by: Wei Chen <mailto:wei.chen@arm.com>
> > > > > > > ---
> > > > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > > > > > >  xen/include/xen/numa.h     | 2 ++
> > > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-
> > arm/numa.h
> > > > > > > index 8f1c67e3eb..21569e634b 100644
> > > > > > > --- a/xen/include/asm-arm/numa.h
> > > > > > > +++ b/xen/include/asm-arm/numa.h
> > > > > > > @@ -3,9 +3,15 @@
> > > > > > >
> > > > > > >  #include <xen/mm.h>
> > > > > > >
> > > > > > > +#include <asm/setup.h>
> > > > > > > +
> > > > > > >  typedef u8 nodeid_t;
> > > > > > >
> > > > > > > -#ifndef CONFIG_NUMA
> > > > > > > +#ifdef CONFIG_NUMA
> > > > > > > +
> > > > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > > > > > +
> > > > > > > +#else
> > > > > > >
> > > > > > >  /* Fake one node for now. See also node_online_map. */
> > > > > > >  #define cpu_to_node(cpu) 0
> > > > > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > > > > > index 1978e2be1b..1731e1cc6b 100644
> > > > > > > --- a/xen/include/xen/numa.h
> > > > > > > +++ b/xen/include/xen/numa.h
> > > > > > > @@ -12,7 +12,9 @@
> > > > > > >  #define MAX_NUMNODES    1
> > > > > > >  #endif
> > > > > > >
> > > > > > > +#ifndef NR_NODE_MEMBLKS
> > > > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > > > > > +#endif
> > > >
> > > > This one we can remove it completely right?
> > >
> > > How about define NR_MEM_BANKS to:
> > > #ifdef CONFIG_NR_NUMA_NODES
> > > #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
> > > #else
> > > #define NR_MEM_BANKS 128
> > > #endif
> > > for both x86 and Arm. For those architectures do not support or enable
> > > NUMA, they can still use "NR_MEM_BANKS 128". And replace all
> > NR_NODE_MEMBLKS
> > > in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
> > > In this case, NR_MEM_BANKS can be aware of the changes of
> > CONFIG_NR_NUMA_NODES.
> >
> > x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess you also
> > meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?
> >
> 
> Yes.
> 
> > But NR_MEM_BANKS is not directly related to CONFIG_NR_NUMA_NODES because
> > there can be many memory banks for each numa node, certainly more than
> > 2. The existing definition on x86:
> >
> > #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> >
> > Doesn't make a lot of sense to me. Was it just an arbitrary limit for
> > the lack of a better way to set a maximum?
> >
> 
> At that time, this was probably the most cost-effective approach.
> Enough and easy. But, if more nodes need to be supported in the
> future, it may bring more memory blocks. And this maximum value
> might not apply. The maximum may need to support dynamic extension.
> 
> >
> > On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be related.
> > In fact, what's the difference?
> >
> > NR_MEM_BANKS is the max number of memory banks (with or without
> > numa-node-id).
> >
> > NR_NODE_MEMBLKS is the max number of memory banks with NUMA support
> > (with numa-node-id)?
> >
> > They are basically the same thing. On ARM I would just do:
> >
> 
> Probably not, NR_MEM_BANKS will count those memory ranges without
> numa-node-id in boot memory parsing stage (process_memory_node or
> EFI parser). But NR_NODE_MEMBLKS will only count those memory ranges
> with numa-node-id.
> 
> > #define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))
> >
> >

> Quote Julien's comment from HTML email to here:
> " As you wrote above, the second part of the MAX is totally arbitrary.
> In fact, it is very likely than if you have more than 64 nodes, you may
> need a lot more than 2 regions per node.
> 
> So, for Arm, I would just define NR_NODE_MEMBLKS as an alias to NR_MEM_BANKS
> so it can be used by common code.
> "
> 
> > But here comes the problem:
> > How can we set the NR_MEM_BANKS maximum value, 128 seems an arbitrary too?
> 
> This is based on hardware we currently support (the last time we bumped the value was, IIRC, for Thunder-X). In the case of booting UEFI, we can get a lot of small ranges as we discover the RAM using the UEFI memory map.
> 

Thanks for the background.

> 
> > If #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * N)? And what N should be.
> 
> N would have to be the maximum number of ranges you can find in a NUMA node.
> 
> We would also need to make sure this doesn't break existing platforms. So N would have to be quite large or we need a MAX as Stefano suggested.
> 
> But I would prefer to keep the existing 128 and allow to configure it at build time (not necessarily in this series). This avoid to have different way to define the value based NUMA vs non-NUMA.

In this case, can we use Stefano's
"#define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))"
in next version. If yes, should we change x86 part? Because NR_MEM_BANKS
has not been defined in x86.

> > And maybe the definition could be common with x86 if we define
> > NR_MEM_BANKS to 128 on x86 too.
> 
> Julien had comment here, I will continue in that email.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-26 10:25             ` Wei Chen
@ 2021-09-27 10:28               ` Wei Chen
  2021-09-28  0:59                 ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-27 10:28 UTC (permalink / raw)
  To: Wei Chen, Jan Beulich
  Cc: xen-devel, julien, Bertrand Marquis, Stefano Stabellini

Hi Julien, Stefano,

> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Wei
> Chen
> Sent: 2021年9月26日 18:25
> To: Jan Beulich <jbeulich@suse.com>
> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
> Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> EFI architecture
> 
> Hi Jan,
> 
> > -----Original Message-----
> > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Jan
> > Beulich
> > Sent: 2021年9月24日 18:49
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
> > Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
> non-
> > EFI architecture
> >
> > On 24.09.2021 12:31, Wei Chen wrote:
> > >> From: Jan Beulich <jbeulich@suse.com>
> > >> Sent: 2021年9月24日 15:59
> > >>
> > >> On 24.09.2021 06:34, Wei Chen wrote:
> > >>>> From: Stefano Stabellini <sstabellini@kernel.org>
> > >>>> Sent: 2021年9月24日 9:15
> > >>>>
> > >>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> > >>>>> --- a/xen/common/Kconfig
> > >>>>> +++ b/xen/common/Kconfig
> > >>>>> @@ -11,6 +11,16 @@ config COMPAT
> > >>>>>  config CORE_PARKING
> > >>>>>  	bool
> > >>>>>
> > >>>>> +config EFI
> > >>>>> +	bool
> > >>>>
> > >>>> Without the title the option is not user-selectable (or de-
> > selectable).
> > >>>> So the help message below can never be seen.
> > >>>>
> > >>>> Either add a title, e.g.:
> > >>>>
> > >>>> bool "EFI support"
> > >>>>
> > >>>> Or fully make the option a silent option by removing the help text.
> > >>>
> > >>> OK, in current Xen code, EFI is unconditionally compiled. Before
> > >>> we change related code, I prefer to remove the help text.
> > >>
> > >> But that's not true: At least on x86 EFI gets compiled depending on
> > >> tool chain capabilities. Ultimately we may indeed want a user
> > >> selectable option here, but until then I'm afraid having this option
> > >> at all may be misleading on x86.
> > >>
> > >
> > > I check the build scripts, yes, you're right. For x86, EFI is not a
> > > selectable option in Kconfig. I agree with you, we can't use Kconfig
> > > system to decide to enable EFI build for x86 or not.
> > >
> > > So how about we just use this EFI option for Arm only? Because on Arm,
> > > we do not have such toolchain dependency.
> >
> > To be honest - don't know. That's because I don't know what you want
> > to use the option for subsequently.
> >
> 
> In last version, I had introduced an arch-helper to stub EFI_BOOT
> in Arm's common code for Arm32. Because Arm32 doesn't support EFI.
> So Julien suggested me to introduce a CONFIG_EFI option for non-EFI
> supported architectures to stub in EFI layer.
> 
> [1] https://lists.xenproject.org/archives/html/xen-devel/2021-
> 08/msg00808.html
> 

As Jan' reminded, x86 doesn't depend on Kconfig to build EFI code.
So, if we CONFIG_EFI to stub EFI API's for x86, we will encounter
that toolchains enable EFI, but Kconfig disable EFI. Or Kconfig
enable EFI but toolchain doesn't provide EFI build supports. And
then x86 could not work well.

If we use CONFIG_EFI for Arm only, that means CONFIG_EFI for x86
is off, this will also cause problem.

So, can we still use previous arch_helpers to stub for Arm32?
until x86 can use this selectable option?

> > Jan
> >


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27 10:21                   ` Wei Chen
@ 2021-09-27 10:39                     ` Julien Grall
  2021-09-27 16:58                       ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Julien Grall @ 2021-09-27 10:39 UTC (permalink / raw)
  To: Wei Chen
  Cc: Julien Grall, Stefano Stabellini, xen-devel, Bertrand Marquis,
	Jan Beulich, Roger Pau Monné,
	Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 11546 bytes --]

On Mon, 27 Sep 2021, 12:22 Wei Chen, <Wei.Chen@arm.com> wrote:

> Hi Julien,
>
> From: Julien Grall <julien.grall.oss@gmail.com>
> Sent: 2021年9月27日 15:36
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-devel <
> xen-devel@lists.xenproject.org>; Bertrand Marquis <
> Bertrand.Marquis@arm.com>; Jan Beulich <jbeulich@suse.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Andrew Cooper <andrew.cooper3@citrix.com>
> Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> NR_NODE_MEMBLKS
>
>
> On Mon, 27 Sep 2021, 08:53 Wei Chen, <mailto:Wei.Chen@arm.com> wrote:
> Hi Julien,
>
> > -----Original Message-----
> > From: Xen-devel <mailto:xen-devel-bounces@lists.xenproject.org> On
> Behalf Of Wei
> > Chen
> > Sent: 2021年9月27日 14:46
> > To: Stefano Stabellini <mailto:sstabellini@kernel.org>
> > Cc: mailto:xen-devel@lists.xenproject.org; mailto:julien@xen.org;
> Bertrand Marquis
> > <mailto:Bertrand.Marquis@arm.com>; mailto:jbeulich@suse.com; mailto:
> roger.pau@citrix.com;
> > mailto:andrew.cooper3@citrix.com
> > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> > NR_NODE_MEMBLKS
> >
> > Hi Stefano, Julien,
> >
> > > -----Original Message-----
> > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
> > > Sent: 2021年9月27日 13:00
> > > To: Wei Chen <mailto:Wei.Chen@arm.com>
> > > Cc: Stefano Stabellini <mailto:sstabellini@kernel.org>; xen-
> > > mailto:devel@lists.xenproject.org; mailto:julien@xen.org; Bertrand
> Marquis
> > > <mailto:Bertrand.Marquis@arm.com>; mailto:jbeulich@suse.com; mailto:
> roger.pau@citrix.com;
> > > mailto:andrew.cooper3@citrix.com
> > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> default
> > > NR_NODE_MEMBLKS
> > >
> > > +x86 maintainers
> > >
> > > On Mon, 27 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
> > > > > Sent: 2021年9月27日 11:26
> > > > > To: Wei Chen <mailto:Wei.Chen@arm.com>
> > > > > Cc: Stefano Stabellini <mailto:sstabellini@kernel.org>; xen-
> > > > > mailto:devel@lists.xenproject.org; mailto:julien@xen.org;
> Bertrand Marquis
> > > > > <mailto:Bertrand.Marquis@arm.com>
> > > > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> > > default
> > > > > NR_NODE_MEMBLKS
> > > > >
> > > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
> > > > > > > Sent: 2021年9月24日 9:35
> > > > > > > To: Wei Chen <mailto:Wei.Chen@arm.com>
> > > > > > > Cc: mailto:xen-devel@lists.xenproject.org; mailto:
> sstabellini@kernel.org;
> > > > > mailto:julien@xen.org;
> > > > > > > Bertrand Marquis <mailto:Bertrand.Marquis@arm.com>
> > > > > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to
> override
> > > > > default
> > > > > > > NR_NODE_MEMBLKS
> > > > > > >
> > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > As a memory range described in device tree cannot be split
> > > across
> > > > > > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS
> > in
> > > > > > > > arch header.
> > > > > > >
> > > > > > > This statement is true but what is the goal of this patch? Is
> it
> > > to
> > > > > > > reduce code size and memory consumption?
> > > > > > >
> > > > > >
> > > > > > No, when Julien and I discussed this in last version[1], we
> hadn't
> > > > > thought
> > > > > > so deeply. We just thought a memory range described in DT cannot
> > be
> > > > > split
> > > > > > across multiple nodes. So NR_MEM_BANKS should be equal to
> > > NR_MEM_BANKS.
> > > > > >
> > > > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > > > 08/msg00974.html
> > > > > >
> > > > > > > I am asking because NR_MEM_BANKS is 128 and
> > > > > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
> > > > > > > NR_NODE_MEMBLKS is 128 before this patch.
> > > > > > >
> > > > > > > In other words, this patch alone doesn't make any difference;
> at
> > > least
> > > > > > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is
> > > increased.
> > > > > > >
> > > > > > > So, is the goal to reduce memory usage when
> CONFIG_NR_NUMA_NODES
> > > is
> > > > > > > higher than 64?
> > > > > > >
> > > > > >
> > > > > > I also thought about this problem when I was writing this patch.
> > > > > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
> > > > > > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
> > > > > > at one point.
> > > > > >
> > > > > > But I agree with Julien's suggestion, NR_MEM_BANKS and
> > > NR_NODE_MEMBLKS
> > > > > > must be aware of each other. I had thought to add some ASSERT
> > check,
> > > > > > but I don't know how to do it better. So I post this patch for
> > more
> > > > > > suggestion.
> > > > >
> > > > > OK. In that case I'd say to get rid of the previous definition of
> > > > > NR_NODE_MEMBLKS as it is probably not necessary, see below.
> > > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > > And keep default NR_NODE_MEMBLKS in common header
> > > > > > > > for those architectures NUMA is disabled.
> > > > > > >
> > > > > > > This last sentence is not accurate: on x86 NUMA is enabled and
> > > > > > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h
> > (there
> > > is
> > > > > no
> > > > > > > x86 definition of it)
> > > > > > >
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > >
> > > > > > > > Signed-off-by: Wei Chen <mailto:wei.chen@arm.com>
> > > > > > > > ---
> > > > > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> > > > > > > >  xen/include/xen/numa.h     | 2 ++
> > > > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-
> > > arm/numa.h
> > > > > > > > index 8f1c67e3eb..21569e634b 100644
> > > > > > > > --- a/xen/include/asm-arm/numa.h
> > > > > > > > +++ b/xen/include/asm-arm/numa.h
> > > > > > > > @@ -3,9 +3,15 @@
> > > > > > > >
> > > > > > > >  #include <xen/mm.h>
> > > > > > > >
> > > > > > > > +#include <asm/setup.h>
> > > > > > > > +
> > > > > > > >  typedef u8 nodeid_t;
> > > > > > > >
> > > > > > > > -#ifndef CONFIG_NUMA
> > > > > > > > +#ifdef CONFIG_NUMA
> > > > > > > > +
> > > > > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> > > > > > > > +
> > > > > > > > +#else
> > > > > > > >
> > > > > > > >  /* Fake one node for now. See also node_online_map. */
> > > > > > > >  #define cpu_to_node(cpu) 0
> > > > > > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
> > > > > > > > index 1978e2be1b..1731e1cc6b 100644
> > > > > > > > --- a/xen/include/xen/numa.h
> > > > > > > > +++ b/xen/include/xen/numa.h
> > > > > > > > @@ -12,7 +12,9 @@
> > > > > > > >  #define MAX_NUMNODES    1
> > > > > > > >  #endif
> > > > > > > >
> > > > > > > > +#ifndef NR_NODE_MEMBLKS
> > > > > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > > > > > > > +#endif
> > > > >
> > > > > This one we can remove it completely right?
> > > >
> > > > How about define NR_MEM_BANKS to:
> > > > #ifdef CONFIG_NR_NUMA_NODES
> > > > #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
> > > > #else
> > > > #define NR_MEM_BANKS 128
> > > > #endif
> > > > for both x86 and Arm. For those architectures do not support or
> enable
> > > > NUMA, they can still use "NR_MEM_BANKS 128". And replace all
> > > NR_NODE_MEMBLKS
> > > > in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
> > > > In this case, NR_MEM_BANKS can be aware of the changes of
> > > CONFIG_NR_NUMA_NODES.
> > >
> > > x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess you also
> > > meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?
> > >
> >
> > Yes.
> >
> > > But NR_MEM_BANKS is not directly related to CONFIG_NR_NUMA_NODES
> because
> > > there can be many memory banks for each numa node, certainly more than
> > > 2. The existing definition on x86:
> > >
> > > #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> > >
> > > Doesn't make a lot of sense to me. Was it just an arbitrary limit for
> > > the lack of a better way to set a maximum?
> > >
> >
> > At that time, this was probably the most cost-effective approach.
> > Enough and easy. But, if more nodes need to be supported in the
> > future, it may bring more memory blocks. And this maximum value
> > might not apply. The maximum may need to support dynamic extension.
> >
> > >
> > > On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be related.
> > > In fact, what's the difference?
> > >
> > > NR_MEM_BANKS is the max number of memory banks (with or without
> > > numa-node-id).
> > >
> > > NR_NODE_MEMBLKS is the max number of memory banks with NUMA support
> > > (with numa-node-id)?
> > >
> > > They are basically the same thing. On ARM I would just do:
> > >
> >
> > Probably not, NR_MEM_BANKS will count those memory ranges without
> > numa-node-id in boot memory parsing stage (process_memory_node or
> > EFI parser). But NR_NODE_MEMBLKS will only count those memory ranges
> > with numa-node-id.
> >
> > > #define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))
> > >
> > >
>
> > Quote Julien's comment from HTML email to here:
> > " As you wrote above, the second part of the MAX is totally arbitrary.
> > In fact, it is very likely than if you have more than 64 nodes, you may
> > need a lot more than 2 regions per node.
> >
> > So, for Arm, I would just define NR_NODE_MEMBLKS as an alias to
> NR_MEM_BANKS
> > so it can be used by common code.
> > "
> >
> > > But here comes the problem:
> > > How can we set the NR_MEM_BANKS maximum value, 128 seems an arbitrary
> too?
> >
> > This is based on hardware we currently support (the last time we bumped
> the value was, IIRC, for Thunder-X). In the case of booting UEFI, we can
> get a lot of small ranges as we discover the RAM using the UEFI memory map.
> >
>
> Thanks for the background.
>
> >
> > > If #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * N)? And what N should
> be.
> >
> > N would have to be the maximum number of ranges you can find in a NUMA
> node.
> >
> > We would also need to make sure this doesn't break existing platforms.
> So N would have to be quite large or we need a MAX as Stefano suggested.
> >
> > But I would prefer to keep the existing 128 and allow to configure it at
> build time (not necessarily in this series). This avoid to have different
> way to define the value based NUMA vs non-NUMA.
>
> In this case, can we use Stefano's
> "#define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))"
> in next version. If yes, should we change x86 part? Because NR_MEM_BANKS
> has not been defined in x86.


What I meant by configuring dynamically is allowing NR_MEM_BANKS to be set
by the user.

The second part of the MAX makes no sense to me (at least on Arm). So I
really prefer if this is not part of the initial version.

We can refine the value, or introduce the MAX in the future if we have a
justification for it.


> > > And maybe the definition could be common with x86 if we define
> > > NR_MEM_BANKS to 128 on x86 too.
> >
> > Julien had comment here, I will continue in that email.
>

[-- Attachment #2: Type: text/html, Size: 18627 bytes --]

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27 10:39                     ` Julien Grall
@ 2021-09-27 16:58                       ` Stefano Stabellini
  2021-09-28  2:57                         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-27 16:58 UTC (permalink / raw)
  To: Julien Grall
  Cc: Wei Chen, Julien Grall, Stefano Stabellini, xen-devel,
	Bertrand Marquis, Jan Beulich, Roger Pau Monné,
	Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 12863 bytes --]

On Mon, 27 Sep 2021, Julien Grall wrote:
> On Mon, 27 Sep 2021, 12:22 Wei Chen, <Wei.Chen@arm.com> wrote:
>       Hi Julien,
> 
>       From: Julien Grall <julien.grall.oss@gmail.com>
>       Sent: 2021年9月27日 15:36
>       To: Wei Chen <Wei.Chen@arm.com>
>       Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-devel <xen-devel@lists.xenproject.org>; Bertrand Marquis
>       <Bertrand.Marquis@arm.com>; Jan Beulich <jbeulich@suse.com>; Roger Pau Monné <roger.pau@citrix.com>; Andrew Cooper
>       <andrew.cooper3@citrix.com>
>       Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
> 
> 
>       On Mon, 27 Sep 2021, 08:53 Wei Chen, <mailto:Wei.Chen@arm.com> wrote:
>       Hi Julien,
> 
>       > -----Original Message-----
>       > From: Xen-devel <mailto:xen-devel-bounces@lists.xenproject.org> On Behalf Of Wei
>       > Chen
>       > Sent: 2021年9月27日 14:46
>       > To: Stefano Stabellini <mailto:sstabellini@kernel.org>
>       > Cc: mailto:xen-devel@lists.xenproject.org; mailto:julien@xen.org; Bertrand Marquis
>       > <mailto:Bertrand.Marquis@arm.com>; mailto:jbeulich@suse.com; mailto:roger.pau@citrix.com;
>       > mailto:andrew.cooper3@citrix.com
>       > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
>       > NR_NODE_MEMBLKS
>       >
>       > Hi Stefano, Julien,
>       >
>       > > -----Original Message-----
>       > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
>       > > Sent: 2021年9月27日 13:00
>       > > To: Wei Chen <mailto:Wei.Chen@arm.com>
>       > > Cc: Stefano Stabellini <mailto:sstabellini@kernel.org>; xen-
>       > > mailto:devel@lists.xenproject.org; mailto:julien@xen.org; Bertrand Marquis
>       > > <mailto:Bertrand.Marquis@arm.com>; mailto:jbeulich@suse.com; mailto:roger.pau@citrix.com;
>       > > mailto:andrew.cooper3@citrix.com
>       > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
>       > > NR_NODE_MEMBLKS
>       > >
>       > > +x86 maintainers
>       > >
>       > > On Mon, 27 Sep 2021, Wei Chen wrote:
>       > > > > -----Original Message-----
>       > > > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
>       > > > > Sent: 2021年9月27日 11:26
>       > > > > To: Wei Chen <mailto:Wei.Chen@arm.com>
>       > > > > Cc: Stefano Stabellini <mailto:sstabellini@kernel.org>; xen-
>       > > > > mailto:devel@lists.xenproject.org; mailto:julien@xen.org; Bertrand Marquis
>       > > > > <mailto:Bertrand.Marquis@arm.com>
>       > > > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
>       > > default
>       > > > > NR_NODE_MEMBLKS
>       > > > >
>       > > > > On Sun, 26 Sep 2021, Wei Chen wrote:
>       > > > > > > -----Original Message-----
>       > > > > > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
>       > > > > > > Sent: 2021年9月24日 9:35
>       > > > > > > To: Wei Chen <mailto:Wei.Chen@arm.com>
>       > > > > > > Cc: mailto:xen-devel@lists.xenproject.org; mailto:sstabellini@kernel.org;
>       > > > > mailto:julien@xen.org;
>       > > > > > > Bertrand Marquis <mailto:Bertrand.Marquis@arm.com>
>       > > > > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
>       > > > > default
>       > > > > > > NR_NODE_MEMBLKS
>       > > > > > >
>       > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
>       > > > > > > > As a memory range described in device tree cannot be split
>       > > across
>       > > > > > > > multiple nodes. So we define NR_NODE_MEMBLKS as NR_MEM_BANKS
>       > in
>       > > > > > > > arch header.
>       > > > > > >
>       > > > > > > This statement is true but what is the goal of this patch? Is it
>       > > to
>       > > > > > > reduce code size and memory consumption?
>       > > > > > >
>       > > > > >
>       > > > > > No, when Julien and I discussed this in last version[1], we hadn't
>       > > > > thought
>       > > > > > so deeply. We just thought a memory range described in DT cannot
>       > be
>       > > > > split
>       > > > > > across multiple nodes. So NR_MEM_BANKS should be equal to
>       > > NR_MEM_BANKS.
>       > > > > >
>       > > > > > https://lists.xenproject.org/archives/html/xen-devel/2021-
>       > > > > 08/msg00974.html
>       > > > > >
>       > > > > > > I am asking because NR_MEM_BANKS is 128 and
>       > > > > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default so again
>       > > > > > > NR_NODE_MEMBLKS is 128 before this patch.
>       > > > > > >
>       > > > > > > In other words, this patch alone doesn't make any difference; at
>       > > least
>       > > > > > > doesn't make any difference unless CONFIG_NR_NUMA_NODES is
>       > > increased.
>       > > > > > >
>       > > > > > > So, is the goal to reduce memory usage when CONFIG_NR_NUMA_NODES
>       > > is
>       > > > > > > higher than 64?
>       > > > > > >
>       > > > > >
>       > > > > > I also thought about this problem when I was writing this patch.
>       > > > > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is a fixed
>       > > > > > value, then NR_MEM_BANKS can be smaller than CONFIG_NR_NUMA_NODES
>       > > > > > at one point.
>       > > > > >
>       > > > > > But I agree with Julien's suggestion, NR_MEM_BANKS and
>       > > NR_NODE_MEMBLKS
>       > > > > > must be aware of each other. I had thought to add some ASSERT
>       > check,
>       > > > > > but I don't know how to do it better. So I post this patch for
>       > more
>       > > > > > suggestion.
>       > > > >
>       > > > > OK. In that case I'd say to get rid of the previous definition of
>       > > > > NR_NODE_MEMBLKS as it is probably not necessary, see below.
>       > > > >
>       > > > >
>       > > > >
>       > > > > > >
>       > > > > > > > And keep default NR_NODE_MEMBLKS in common header
>       > > > > > > > for those architectures NUMA is disabled.
>       > > > > > >
>       > > > > > > This last sentence is not accurate: on x86 NUMA is enabled and
>       > > > > > > NR_NODE_MEMBLKS is still defined in xen/include/xen/numa.h
>       > (there
>       > > is
>       > > > > no
>       > > > > > > x86 definition of it)
>       > > > > > >
>       > > > > >
>       > > > > > Yes.
>       > > > > >
>       > > > > > >
>       > > > > > > > Signed-off-by: Wei Chen <mailto:wei.chen@arm.com>
>       > > > > > > > ---
>       > > > > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
>       > > > > > > >  xen/include/xen/numa.h     | 2 ++
>       > > > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
>       > > > > > > >
>       > > > > > > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-
>       > > arm/numa.h
>       > > > > > > > index 8f1c67e3eb..21569e634b 100644
>       > > > > > > > --- a/xen/include/asm-arm/numa.h
>       > > > > > > > +++ b/xen/include/asm-arm/numa.h
>       > > > > > > > @@ -3,9 +3,15 @@
>       > > > > > > >
>       > > > > > > >  #include <xen/mm.h>
>       > > > > > > >
>       > > > > > > > +#include <asm/setup.h>
>       > > > > > > > +
>       > > > > > > >  typedef u8 nodeid_t;
>       > > > > > > >
>       > > > > > > > -#ifndef CONFIG_NUMA
>       > > > > > > > +#ifdef CONFIG_NUMA
>       > > > > > > > +
>       > > > > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
>       > > > > > > > +
>       > > > > > > > +#else
>       > > > > > > >
>       > > > > > > >  /* Fake one node for now. See also node_online_map. */
>       > > > > > > >  #define cpu_to_node(cpu) 0
>       > > > > > > > diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
>       > > > > > > > index 1978e2be1b..1731e1cc6b 100644
>       > > > > > > > --- a/xen/include/xen/numa.h
>       > > > > > > > +++ b/xen/include/xen/numa.h
>       > > > > > > > @@ -12,7 +12,9 @@
>       > > > > > > >  #define MAX_NUMNODES    1
>       > > > > > > >  #endif
>       > > > > > > >
>       > > > > > > > +#ifndef NR_NODE_MEMBLKS
>       > > > > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>       > > > > > > > +#endif
>       > > > >
>       > > > > This one we can remove it completely right?
>       > > >
>       > > > How about define NR_MEM_BANKS to:
>       > > > #ifdef CONFIG_NR_NUMA_NODES
>       > > > #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
>       > > > #else
>       > > > #define NR_MEM_BANKS 128
>       > > > #endif
>       > > > for both x86 and Arm. For those architectures do not support or enable
>       > > > NUMA, they can still use "NR_MEM_BANKS 128". And replace all
>       > > NR_NODE_MEMBLKS
>       > > > in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS completely.
>       > > > In this case, NR_MEM_BANKS can be aware of the changes of
>       > > CONFIG_NR_NUMA_NODES.
>       > >
>       > > x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess you also
>       > > meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?
>       > >
>       >
>       > Yes.
>       >
>       > > But NR_MEM_BANKS is not directly related to CONFIG_NR_NUMA_NODES because
>       > > there can be many memory banks for each numa node, certainly more than
>       > > 2. The existing definition on x86:
>       > >
>       > > #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
>       > >
>       > > Doesn't make a lot of sense to me. Was it just an arbitrary limit for
>       > > the lack of a better way to set a maximum?
>       > >
>       >
>       > At that time, this was probably the most cost-effective approach.
>       > Enough and easy. But, if more nodes need to be supported in the
>       > future, it may bring more memory blocks. And this maximum value
>       > might not apply. The maximum may need to support dynamic extension.
>       >
>       > >
>       > > On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be related.
>       > > In fact, what's the difference?
>       > >
>       > > NR_MEM_BANKS is the max number of memory banks (with or without
>       > > numa-node-id).
>       > >
>       > > NR_NODE_MEMBLKS is the max number of memory banks with NUMA support
>       > > (with numa-node-id)?
>       > >
>       > > They are basically the same thing. On ARM I would just do:
>       > >
>       >
>       > Probably not, NR_MEM_BANKS will count those memory ranges without
>       > numa-node-id in boot memory parsing stage (process_memory_node or
>       > EFI parser). But NR_NODE_MEMBLKS will only count those memory ranges
>       > with numa-node-id.
>       >
>       > > #define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))
>       > >
>       > >
> 
>       > Quote Julien's comment from HTML email to here:
>       > " As you wrote above, the second part of the MAX is totally arbitrary.
>       > In fact, it is very likely than if you have more than 64 nodes, you may
>       > need a lot more than 2 regions per node.
>       >
>       > So, for Arm, I would just define NR_NODE_MEMBLKS as an alias to NR_MEM_BANKS
>       > so it can be used by common code.
>       > "
>       >
>       > > But here comes the problem:
>       > > How can we set the NR_MEM_BANKS maximum value, 128 seems an arbitrary too?
>       >
>       > This is based on hardware we currently support (the last time we bumped the value was, IIRC, for Thunder-X). In the case of
>       booting UEFI, we can get a lot of small ranges as we discover the RAM using the UEFI memory map.
>       >
> 
>       Thanks for the background.
> 
>       >
>       > > If #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * N)? And what N should be.
>       >
>       > N would have to be the maximum number of ranges you can find in a NUMA node.
>       >
>       > We would also need to make sure this doesn't break existing platforms. So N would have to be quite large or we need a MAX as
>       Stefano suggested.
>       >
>       > But I would prefer to keep the existing 128 and allow to configure it at build time (not necessarily in this series). This
>       avoid to have different way to define the value based NUMA vs non-NUMA.
> 
>       In this case, can we use Stefano's
>       "#define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES * 2))"
>       in next version. If yes, should we change x86 part? Because NR_MEM_BANKS
>       has not been defined in x86.
> 
> 
> What I meant by configuring dynamically is allowing NR_MEM_BANKS to be set by the user.
> 
> The second part of the MAX makes no sense to me (at least on Arm). So I really prefer if this is not part of the initial version.
> 
> We can refine the value, or introduce the MAX in the future if we have a justification for it.

OK, so for clarity the suggestion is:

- define NR_NODE_MEMBLKS as NR_MEM_BANKS on ARM in this series
- in the future make NR_MEM_BANKS user-configurable via kconfig
- for now leave NR_MEM_BANKS as 128 on ARM

That's fine by me.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-27  9:17             ` Jan Beulich
@ 2021-09-27 17:17               ` Stefano Stabellini
  2021-09-28  2:59                 ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-27 17:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Julien Grall, Stefano Stabellini, Wei Chen, xen-devel, Bertrand Marquis

On Mon, 27 Sep 2021, Jan Beulich wrote:
> On 27.09.2021 10:45, Julien Grall wrote:
> > On Mon, 27 Sep 2021, 10:33 Jan Beulich, <jbeulich@suse.com> wrote:
> > 
> >> On 24.09.2021 21:39, Stefano Stabellini wrote:
> >>> On Fri, 24 Sep 2021, Wei Chen wrote:
> >>>> On 2021/9/24 11:31, Stefano Stabellini wrote:
> >>>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> >>>>>> --- a/xen/arch/arm/Kconfig
> >>>>>> +++ b/xen/arch/arm/Kconfig
> >>>>>> @@ -34,6 +34,17 @@ config ACPI
> >>>>>>      Advanced Configuration and Power Interface (ACPI) support for
> >> Xen is
> >>>>>>      an alternative to device tree on ARM64.
> >>>>>>   + config DEVICE_TREE_NUMA
> >>>>>> +  def_bool n
> >>>>>> +  select NUMA
> >>>>>> +
> >>>>>> +config ARM_NUMA
> >>>>>> +  bool "Arm NUMA (Non-Uniform Memory Access) Support (UNSUPPORTED)"
> >> if
> >>>>>> UNSUPPORTED
> >>>>>> +  select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> >>>>>
> >>>>> Should it be: depends on HAS_DEVICE_TREE ?
> >>>>> (And eventually depends on HAS_DEVICE_TREE || ACPI)
> >>>>>
> >>>>
> >>>> As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
> >>>> option can be selected by users. And depends on has_device_tree
> >>>> or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.
> >>>>
> >>>> If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
> >>>> does it become a loop dependency?
> >>>>
> >>>>
> >> https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00888.html
> >>>
> >>> OK, I am fine with that. I was just trying to catch the case where a
> >>> user selects "ARM_NUMA" but actually neither ACPI nor HAS_DEVICE_TREE
> >>> are selected so nothing happens. I was trying to make it clear that
> >>> ARM_NUMA depends on having at least one between HAS_DEVICE_TREE or ACPI
> >>> because otherwise it is not going to work.
> >>>
> >>> That said, I don't think this is important because HAS_DEVICE_TREE
> >>> cannot be unselected. So if we cannot find a way to express the
> >>> dependency, I think it is fine to keep the patch as is.
> >>
> >> So how about doing things the other way around: ARM_NUMA has no prompt
> >> and defaults to ACPI_NUMA || DT_NUMA, and DT_NUMA gains a prompt instead
> >> (and, for Arm at least, ACPI_NUMA as well; this might even be worthwhile
> >> to have on x86 down the road).
> >>
> > 
> > As I wrote before, I don't think the user should say "I want to enable NUMA
> > with Device-Tree or ACPI". Instead, they say whether they want to use NUMA
> > and let Xen decide to enable the DT/ACPI support.
> > 
> > In other word, the prompt should stay on ARM_NUMA.
> 
> Okay. In which case I'm confused by Stefano's question.

Let me clarify: I think it is fine to have a single prompt for NUMA in
Kconfig. However, I am just pointing out that it is theoretically
possible with the current code to present an ARM_NUMA prompt to the user
but actually have no NUMA enabled at the end because both DEVICE TREE
and ACPI are disabled. This is only a theoretical problem because DEVICE
TREE support (HAS_DEVICE_TREE) cannot be disabled today. Also I cannot
imagine how a configuration with neither DEVICE TREE nor ACPI can be
correct. So I don't think it is a critical concern.

That said, you can see that, at least theoretically, ARM_NUMA depends on
either HAS_DEVICE_TREE or ACPI, so I suggested to add:

depends on HAS_DEVICE_TREE || ACPI

Wei answered that it might introduce a circular dependency, but I did
try the addition of "depends on HAS_DEVICE_TREE || ACPI" under ARM_NUMA
in Kconfig and everything built fine here.


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-27  9:50               ` Wei Chen
@ 2021-09-27 17:19                 ` Stefano Stabellini
  2021-09-28  4:41                   ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-27 17:19 UTC (permalink / raw)
  To: Wei Chen
  Cc: Stefano Stabellini, xen-devel, julien, Bertrand Marquis,
	jbeulich, andrew.cooper3, roger.pau, wl

[-- Attachment #1: Type: text/plain, Size: 8429 bytes --]

On Mon, 27 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月27日 13:05
> > To: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> > julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> > jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> > wl@xen.org
> > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > Sent: 2021年9月25日 3:53
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > andrew.cooper3@citrix.com;
> > > > > roger.pau@citrix.com; wl@xen.org
> > > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous
> > node
> > > > > memory range
> > > > >
> > > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > Sent: 2021年9月24日 8:26
> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > > julien@xen.org;
> > > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> > discontinous node
> > > > > > > memory range
> > > > > > >
> > > > > > > CC'ing x86 maintainers
> > > > > > >
> > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > One NUMA node may contain several memory blocks. In current
> > Xen
> > > > > > > > code, Xen will maintain a node memory range for each node to
> > cover
> > > > > > > > all its memory blocks. But here comes the problem, in the gap
> > of
> > > > > > > > one node's two memory blocks, if there are some memory blocks
> > don't
> > > > > > > > belong to this node (remote memory blocks). This node's memory
> > range
> > > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > > >
> > > > > > > > One node's memory range contains othe nodes' memory, this is
> > > > > obviously
> > > > > > > > not very reasonable. This means current NUMA code only can
> > support
> > > > > > > > node has continous memory blocks. However, on a physical
> > machine,
> > > > > the
> > > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > > >
> > > > > > > > So in this patch, we add code to detect discontinous memory
> > blocks
> > > > > > > > for one node. NUMA initializtion will be failed and error
> > messages
> > > > > > > > will be printed when Xen detect such hardware configuration.
> > > > > > >
> > > > > > > At least on ARM, it is not just memory that can be interleaved,
> > but
> > > > > also
> > > > > > > MMIO regions. For instance:
> > > > > > >
> > > > > > > node0 bank0 0-0x1000000
> > > > > > > MMIO 0x1000000-0x1002000
> > > > > > > Hole 0x1002000-0x2000000
> > > > > > > node0 bank1 0x2000000-0x3000000
> > > > > > >
> > > > > > > So I am not familiar with the SRAT format, but I think on ARM
> > the
> > > > > check
> > > > > > > would look different: we would just look for multiple memory
> > ranges
> > > > > > > under a device_type = "memory" node of a NUMA node in device
> > tree.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > Should I need to include/refine above message to commit log?
> > > > >
> > > > > Let me ask you a question first.
> > > > >
> > > > > With the NUMA implementation of this patch series, can we deal with
> > > > > cases where each node has multiple memory banks, not interleaved?
> > > >
> > > > Yes.
> > > >
> > > > > An an example:
> > > > >
> > > > > node0: 0x0        - 0x10000000
> > > > > MMIO : 0x10000000 - 0x20000000
> > > > > node0: 0x20000000 - 0x30000000
> > > > > MMIO : 0x30000000 - 0x50000000
> > > > > node1: 0x50000000 - 0x60000000
> > > > > MMIO : 0x60000000 - 0x80000000
> > > > > node2: 0x80000000 - 0x90000000
> > > > >
> > > > >
> > > > > I assume we can deal with this case simply by setting node0 memory
> > to
> > > > > 0x0-0x30000000 even if there is actually something else, a device,
> > that
> > > > > doesn't belong to node0 in between the two node0 banks?
> > > >
> > > > While this configuration is rare in SoC design, but it is not
> > impossible.
> > >
> > > Definitely, I have seen it before.
> > >
> > >
> > > > > Is it only other nodes' memory interleaved that cause issues? In
> > other
> > > > > words, only the following is a problematic scenario?
> > > > >
> > > > > node0: 0x0        - 0x10000000
> > > > > MMIO : 0x10000000 - 0x20000000
> > > > > node1: 0x20000000 - 0x30000000
> > > > > MMIO : 0x30000000 - 0x50000000
> > > > > node0: 0x50000000 - 0x60000000
> > > > >
> > > > > Because node1 is in between the two ranges of node0?
> > > > >
> > > >
> > > > But only device_type="memory" can be added to allocation.
> > > > For mmio there are two cases:
> > > > 1. mmio doesn't have NUMA id property.
> > > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > > >    But we don’t need to handle these kinds of MMIO devices
> > > >    in memory block parsing. Because we don't need to allocate
> > > >    memory from these mmio ranges. And for accessing, we need
> > > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > > >    MMIO accessing APIs.
> > >
> > > Yes, I am not too worried about devices with a NUMA id property because
> > > they are less common and this series doesn't handle them at all, right?
> > > I imagine they would be treated like any other device without NUMA
> > > awareness.
> > >
> > > I am thinking about the case where the memory of each NUMA node is made
> > > of multiple banks. I understand that this patch adds an explicit check
> > > for cases where these banks are interleaving, however there are many
> > > other cases where NUMA memory nodes are *not* interleaving but they are
> > > still made of multiple discontinuous banks, like in the two example
> > > above.
> > >
> > > My question is whether this patch series in its current form can handle
> > > the two cases above correctly. If so, I am wondering how it works given
> > > that we only have a single "start" and "size" parameter per node.
> > >
> > > On the other hand if this series cannot handle the two cases above, my
> > > question is whether it would fail explicitly or not. The new
> > > check is_node_memory_continuous doesn't seem to be able to catch them.
> > 
> > 
> > Looking at numa_update_node_memblks, it is clear that the code is meant
> > to increase the range of each numa node to cover even MMIO regions in
> > between memory banks. Also see the comment at the top of the file:
> > 
> >  * Assumes all memory regions belonging to a single proximity domain
> >  * are in one chunk. Holes between them will be included in the node.
> > 
> > So if there are multiple banks for each node, start and end are
> > stretched to cover the holes between them, and it works as long as
> > memory banks of different NUMA nodes don't interleave.
> > 
> > I would appreciate if you could add an in-code comment to explain this
> > on top of numa_update_node_memblk.
> 
> Yes, I will do it.
 
Thank you


> > Have you had a chance to test this? If not it would be fantastic if you
> > could give it a quick test to make sure it works as intended: for
> > instance by creating multiple memory banks for each NUMA node by
> > splitting an real bank into two smaller banks with a hole in between in
> > device tree, just for the sake of testing.
> 
> Yes, I have created some fake NUMA nodes in FVP device tree to test it.
> The intertwine of nodes' address can be detected.
> 
> (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> (XEN) SRAT: Node 1 0000000880000000-00000008c0000000
> (XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1 (0000000880000000-00000008c0000000)

Great thanks! And what if there are multiple non-contiguous memory banks
per node, but *not* intertwined. Does that all work correctly as
expected?

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-27 10:28               ` Wei Chen
@ 2021-09-28  0:59                 ` Stefano Stabellini
  2021-09-28  4:16                   ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-28  0:59 UTC (permalink / raw)
  To: Wei Chen
  Cc: Jan Beulich, xen-devel, julien, Bertrand Marquis, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 4391 bytes --]

On Mon, 27 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Wei
> > Chen
> > Sent: 2021年9月26日 18:25
> > To: Jan Beulich <jbeulich@suse.com>
> > Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
> > Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> > EFI architecture
> > 
> > Hi Jan,
> > 
> > > -----Original Message-----
> > > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> > Jan
> > > Beulich
> > > Sent: 2021年9月24日 18:49
> > > To: Wei Chen <Wei.Chen@arm.com>
> > > Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
> > > Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
> > non-
> > > EFI architecture
> > >
> > > On 24.09.2021 12:31, Wei Chen wrote:
> > > >> From: Jan Beulich <jbeulich@suse.com>
> > > >> Sent: 2021年9月24日 15:59
> > > >>
> > > >> On 24.09.2021 06:34, Wei Chen wrote:
> > > >>>> From: Stefano Stabellini <sstabellini@kernel.org>
> > > >>>> Sent: 2021年9月24日 9:15
> > > >>>>
> > > >>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> > > >>>>> --- a/xen/common/Kconfig
> > > >>>>> +++ b/xen/common/Kconfig
> > > >>>>> @@ -11,6 +11,16 @@ config COMPAT
> > > >>>>>  config CORE_PARKING
> > > >>>>>  	bool
> > > >>>>>
> > > >>>>> +config EFI
> > > >>>>> +	bool
> > > >>>>
> > > >>>> Without the title the option is not user-selectable (or de-
> > > selectable).
> > > >>>> So the help message below can never be seen.
> > > >>>>
> > > >>>> Either add a title, e.g.:
> > > >>>>
> > > >>>> bool "EFI support"
> > > >>>>
> > > >>>> Or fully make the option a silent option by removing the help text.
> > > >>>
> > > >>> OK, in current Xen code, EFI is unconditionally compiled. Before
> > > >>> we change related code, I prefer to remove the help text.
> > > >>
> > > >> But that's not true: At least on x86 EFI gets compiled depending on
> > > >> tool chain capabilities. Ultimately we may indeed want a user
> > > >> selectable option here, but until then I'm afraid having this option
> > > >> at all may be misleading on x86.
> > > >>
> > > >
> > > > I check the build scripts, yes, you're right. For x86, EFI is not a
> > > > selectable option in Kconfig. I agree with you, we can't use Kconfig
> > > > system to decide to enable EFI build for x86 or not.
> > > >
> > > > So how about we just use this EFI option for Arm only? Because on Arm,
> > > > we do not have such toolchain dependency.
> > >
> > > To be honest - don't know. That's because I don't know what you want
> > > to use the option for subsequently.
> > >
> > 
> > In last version, I had introduced an arch-helper to stub EFI_BOOT
> > in Arm's common code for Arm32. Because Arm32 doesn't support EFI.
> > So Julien suggested me to introduce a CONFIG_EFI option for non-EFI
> > supported architectures to stub in EFI layer.
> > 
> > [1] https://lists.xenproject.org/archives/html/xen-devel/2021-
> > 08/msg00808.html
> > 
> 
> As Jan' reminded, x86 doesn't depend on Kconfig to build EFI code.
> So, if we CONFIG_EFI to stub EFI API's for x86, we will encounter
> that toolchains enable EFI, but Kconfig disable EFI. Or Kconfig
> enable EFI but toolchain doesn't provide EFI build supports. And
> then x86 could not work well.
> 
> If we use CONFIG_EFI for Arm only, that means CONFIG_EFI for x86
> is off, this will also cause problem.
> 
> So, can we still use previous arch_helpers to stub for Arm32?
> until x86 can use this selectable option?

EFI doesn't have to be necessarily a user-visible option in Kconfig at
this point. I think Julien was just asking to make the #ifdef based on
a EFI-related config rather than just based CONFIG_ARM64.

On x86 EFI is detected based on compiler support, setting XEN_BUILD_EFI
in xen/arch/x86/Makefile. Let's say that we keep using the same name
"XEN_BUILD_EFI" on ARM as well.

On ARM32, XEN_BUILD_EFI should be always unset.

On ARM64 XEN_BUILD_EFI should be always set.

That's it, right? I'd argue that CONFIG_EFI or HAS_EFI are better names
than XEN_BUILD_EFI, but that's OK anyway. So for instance you can make
XEN_BUILD_EFI an invisible symbol in xen/arch/arm/Kconfig and select it
only on ARM64.

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default NR_NODE_MEMBLKS
  2021-09-27 16:58                       ` Stefano Stabellini
@ 2021-09-28  2:57                         ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-09-28  2:57 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: Julien Grall, xen-devel, Bertrand Marquis, Jan Beulich,
	Roger Pau Monné,
	Andrew Cooper

Hi Stefano, Julien,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月28日 0:58
> To: Julien Grall <julien.grall@gmail.com>
> Cc: Wei Chen <Wei.Chen@arm.com>; Julien Grall <julien.grall.oss@gmail.com>;
> Stefano Stabellini <sstabellini@kernel.org>; xen-devel <xen-
> devel@lists.xenproject.org>; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> Jan Beulich <jbeulich@suse.com>; Roger Pau Monné <roger.pau@citrix.com>;
> Andrew Cooper <andrew.cooper3@citrix.com>
> Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override default
> NR_NODE_MEMBLKS
> 
> On Mon, 27 Sep 2021, Julien Grall wrote:
> > On Mon, 27 Sep 2021, 12:22 Wei Chen, <Wei.Chen@arm.com> wrote:
> >       Hi Julien,
> >
> >       From: Julien Grall <julien.grall.oss@gmail.com>
> >       Sent: 2021年9月27日 15:36
> >       To: Wei Chen <Wei.Chen@arm.com>
> >       Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-devel <xen-
> devel@lists.xenproject.org>; Bertrand Marquis
> >       <Bertrand.Marquis@arm.com>; Jan Beulich <jbeulich@suse.com>; Roger
> Pau Monné <roger.pau@citrix.com>; Andrew Cooper
> >       <andrew.cooper3@citrix.com>
> >       Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> default NR_NODE_MEMBLKS
> >
> >
> >       On Mon, 27 Sep 2021, 08:53 Wei Chen, <mailto:Wei.Chen@arm.com>
> wrote:
> >       Hi Julien,
> >
> >       > -----Original Message-----
> >       > From: Xen-devel <mailto:xen-devel-bounces@lists.xenproject.org>
> On Behalf Of Wei
> >       > Chen
> >       > Sent: 2021年9月27日 14:46
> >       > To: Stefano Stabellini <mailto:sstabellini@kernel.org>
> >       > Cc: mailto:xen-devel@lists.xenproject.org; mailto:julien@xen.org;
> Bertrand Marquis
> >       > <mailto:Bertrand.Marquis@arm.com>; mailto:jbeulich@suse.com;
> mailto:roger.pau@citrix.com;
> >       > mailto:andrew.cooper3@citrix.com
> >       > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to override
> default
> >       > NR_NODE_MEMBLKS
> >       >
> >       > Hi Stefano, Julien,
> >       >
> >       > > -----Original Message-----
> >       > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
> >       > > Sent: 2021年9月27日 13:00
> >       > > To: Wei Chen <mailto:Wei.Chen@arm.com>
> >       > > Cc: Stefano Stabellini <mailto:sstabellini@kernel.org>; xen-
> >       > > mailto:devel@lists.xenproject.org; mailto:julien@xen.org;
> Bertrand Marquis
> >       > > <mailto:Bertrand.Marquis@arm.com>; mailto:jbeulich@suse.com;
> mailto:roger.pau@citrix.com;
> >       > > mailto:andrew.cooper3@citrix.com
> >       > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to
> override default
> >       > > NR_NODE_MEMBLKS
> >       > >
> >       > > +x86 maintainers
> >       > >
> >       > > On Mon, 27 Sep 2021, Wei Chen wrote:
> >       > > > > -----Original Message-----
> >       > > > > From: Stefano Stabellini <mailto:sstabellini@kernel.org>
> >       > > > > Sent: 2021年9月27日 11:26
> >       > > > > To: Wei Chen <mailto:Wei.Chen@arm.com>
> >       > > > > Cc: Stefano Stabellini <mailto:sstabellini@kernel.org>;
> xen-
> >       > > > > mailto:devel@lists.xenproject.org; mailto:julien@xen.org;
> Bertrand Marquis
> >       > > > > <mailto:Bertrand.Marquis@arm.com>
> >       > > > > Subject: RE: [PATCH 22/37] xen/arm: use NR_MEM_BANKS to
> override
> >       > > default
> >       > > > > NR_NODE_MEMBLKS
> >       > > > >
> >       > > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> >       > > > > > > -----Original Message-----
> >       > > > > > > From: Stefano Stabellini
> <mailto:sstabellini@kernel.org>
> >       > > > > > > Sent: 2021年9月24日 9:35
> >       > > > > > > To: Wei Chen <mailto:Wei.Chen@arm.com>
> >       > > > > > > Cc: mailto:xen-devel@lists.xenproject.org;
> mailto:sstabellini@kernel.org;
> >       > > > > mailto:julien@xen.org;
> >       > > > > > > Bertrand Marquis <mailto:Bertrand.Marquis@arm.com>
> >       > > > > > > Subject: Re: [PATCH 22/37] xen/arm: use NR_MEM_BANKS
> to override
> >       > > > > default
> >       > > > > > > NR_NODE_MEMBLKS
> >       > > > > > >
> >       > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> >       > > > > > > > As a memory range described in device tree cannot be
> split
> >       > > across
> >       > > > > > > > multiple nodes. So we define NR_NODE_MEMBLKS as
> NR_MEM_BANKS
> >       > in
> >       > > > > > > > arch header.
> >       > > > > > >
> >       > > > > > > This statement is true but what is the goal of this
> patch? Is it
> >       > > to
> >       > > > > > > reduce code size and memory consumption?
> >       > > > > > >
> >       > > > > >
> >       > > > > > No, when Julien and I discussed this in last version[1],
> we hadn't
> >       > > > > thought
> >       > > > > > so deeply. We just thought a memory range described in
> DT cannot
> >       > be
> >       > > > > split
> >       > > > > > across multiple nodes. So NR_MEM_BANKS should be equal
> to
> >       > > NR_MEM_BANKS.
> >       > > > > >
> >       > > > > > https://lists.xenproject.org/archives/html/xen-
> devel/2021-
> >       > > > > 08/msg00974.html
> >       > > > > >
> >       > > > > > > I am asking because NR_MEM_BANKS is 128 and
> >       > > > > > > NR_NODE_MEMBLKS=2*MAX_NUMNODES which is 64 by default
> so again
> >       > > > > > > NR_NODE_MEMBLKS is 128 before this patch.
> >       > > > > > >
> >       > > > > > > In other words, this patch alone doesn't make any
> difference; at
> >       > > least
> >       > > > > > > doesn't make any difference unless
> CONFIG_NR_NUMA_NODES is
> >       > > increased.
> >       > > > > > >
> >       > > > > > > So, is the goal to reduce memory usage when
> CONFIG_NR_NUMA_NODES
> >       > > is
> >       > > > > > > higher than 64?
> >       > > > > > >
> >       > > > > >
> >       > > > > > I also thought about this problem when I was writing
> this patch.
> >       > > > > > CONFIG_NR_NUMA_NODES is increasing, but NR_MEM_BANKS is
> a fixed
> >       > > > > > value, then NR_MEM_BANKS can be smaller than
> CONFIG_NR_NUMA_NODES
> >       > > > > > at one point.
> >       > > > > >
> >       > > > > > But I agree with Julien's suggestion, NR_MEM_BANKS and
> >       > > NR_NODE_MEMBLKS
> >       > > > > > must be aware of each other. I had thought to add some
> ASSERT
> >       > check,
> >       > > > > > but I don't know how to do it better. So I post this
> patch for
> >       > more
> >       > > > > > suggestion.
> >       > > > >
> >       > > > > OK. In that case I'd say to get rid of the previous
> definition of
> >       > > > > NR_NODE_MEMBLKS as it is probably not necessary, see below.
> >       > > > >
> >       > > > >
> >       > > > >
> >       > > > > > >
> >       > > > > > > > And keep default NR_NODE_MEMBLKS in common header
> >       > > > > > > > for those architectures NUMA is disabled.
> >       > > > > > >
> >       > > > > > > This last sentence is not accurate: on x86 NUMA is
> enabled and
> >       > > > > > > NR_NODE_MEMBLKS is still defined in
> xen/include/xen/numa.h
> >       > (there
> >       > > is
> >       > > > > no
> >       > > > > > > x86 definition of it)
> >       > > > > > >
> >       > > > > >
> >       > > > > > Yes.
> >       > > > > >
> >       > > > > > >
> >       > > > > > > > Signed-off-by: Wei Chen <mailto:wei.chen@arm.com>
> >       > > > > > > > ---
> >       > > > > > > >  xen/include/asm-arm/numa.h | 8 +++++++-
> >       > > > > > > >  xen/include/xen/numa.h     | 2 ++
> >       > > > > > > >  2 files changed, 9 insertions(+), 1 deletion(-)
> >       > > > > > > >
> >       > > > > > > > diff --git a/xen/include/asm-arm/numa.h
> b/xen/include/asm-
> >       > > arm/numa.h
> >       > > > > > > > index 8f1c67e3eb..21569e634b 100644
> >       > > > > > > > --- a/xen/include/asm-arm/numa.h
> >       > > > > > > > +++ b/xen/include/asm-arm/numa.h
> >       > > > > > > > @@ -3,9 +3,15 @@
> >       > > > > > > >
> >       > > > > > > >  #include <xen/mm.h>
> >       > > > > > > >
> >       > > > > > > > +#include <asm/setup.h>
> >       > > > > > > > +
> >       > > > > > > >  typedef u8 nodeid_t;
> >       > > > > > > >
> >       > > > > > > > -#ifndef CONFIG_NUMA
> >       > > > > > > > +#ifdef CONFIG_NUMA
> >       > > > > > > > +
> >       > > > > > > > +#define NR_NODE_MEMBLKS NR_MEM_BANKS
> >       > > > > > > > +
> >       > > > > > > > +#else
> >       > > > > > > >
> >       > > > > > > >  /* Fake one node for now. See also node_online_map.
> */
> >       > > > > > > >  #define cpu_to_node(cpu) 0
> >       > > > > > > > diff --git a/xen/include/xen/numa.h
> b/xen/include/xen/numa.h
> >       > > > > > > > index 1978e2be1b..1731e1cc6b 100644
> >       > > > > > > > --- a/xen/include/xen/numa.h
> >       > > > > > > > +++ b/xen/include/xen/numa.h
> >       > > > > > > > @@ -12,7 +12,9 @@
> >       > > > > > > >  #define MAX_NUMNODES    1
> >       > > > > > > >  #endif
> >       > > > > > > >
> >       > > > > > > > +#ifndef NR_NODE_MEMBLKS
> >       > > > > > > >  #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> >       > > > > > > > +#endif
> >       > > > >
> >       > > > > This one we can remove it completely right?
> >       > > >
> >       > > > How about define NR_MEM_BANKS to:
> >       > > > #ifdef CONFIG_NR_NUMA_NODES
> >       > > > #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * 2)
> >       > > > #else
> >       > > > #define NR_MEM_BANKS 128
> >       > > > #endif
> >       > > > for both x86 and Arm. For those architectures do not support
> or enable
> >       > > > NUMA, they can still use "NR_MEM_BANKS 128". And replace all
> >       > > NR_NODE_MEMBLKS
> >       > > > in NUMA code to NR_MEM_BANKS to remove NR_NODE_MEMBLKS
> completely.
> >       > > > In this case, NR_MEM_BANKS can be aware of the changes of
> >       > > CONFIG_NR_NUMA_NODES.
> >       > >
> >       > > x86 doesn't have NR_MEM_BANKS as far as I can tell. I guess
> you also
> >       > > meant to rename NR_NODE_MEMBLKS to NR_MEM_BANKS?
> >       > >
> >       >
> >       > Yes.
> >       >
> >       > > But NR_MEM_BANKS is not directly related to
> CONFIG_NR_NUMA_NODES because
> >       > > there can be many memory banks for each numa node, certainly
> more than
> >       > > 2. The existing definition on x86:
> >       > >
> >       > > #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
> >       > >
> >       > > Doesn't make a lot of sense to me. Was it just an arbitrary
> limit for
> >       > > the lack of a better way to set a maximum?
> >       > >
> >       >
> >       > At that time, this was probably the most cost-effective approach.
> >       > Enough and easy. But, if more nodes need to be supported in the
> >       > future, it may bring more memory blocks. And this maximum value
> >       > might not apply. The maximum may need to support dynamic
> extension.
> >       >
> >       > >
> >       > > On the other hand, NR_MEM_BANKS and NR_NODE_MEMBLKS seem to be
> related.
> >       > > In fact, what's the difference?
> >       > >
> >       > > NR_MEM_BANKS is the max number of memory banks (with or
> without
> >       > > numa-node-id).
> >       > >
> >       > > NR_NODE_MEMBLKS is the max number of memory banks with NUMA
> support
> >       > > (with numa-node-id)?
> >       > >
> >       > > They are basically the same thing. On ARM I would just do:
> >       > >
> >       >
> >       > Probably not, NR_MEM_BANKS will count those memory ranges
> without
> >       > numa-node-id in boot memory parsing stage (process_memory_node
> or
> >       > EFI parser). But NR_NODE_MEMBLKS will only count those memory
> ranges
> >       > with numa-node-id.
> >       >
> >       > > #define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS,
> (CONFIG_NR_NUMA_NODES * 2))
> >       > >
> >       > >
> >
> >       > Quote Julien's comment from HTML email to here:
> >       > " As you wrote above, the second part of the MAX is totally
> arbitrary.
> >       > In fact, it is very likely than if you have more than 64 nodes,
> you may
> >       > need a lot more than 2 regions per node.
> >       >
> >       > So, for Arm, I would just define NR_NODE_MEMBLKS as an alias to
> NR_MEM_BANKS
> >       > so it can be used by common code.
> >       > "
> >       >
> >       > > But here comes the problem:
> >       > > How can we set the NR_MEM_BANKS maximum value, 128 seems an
> arbitrary too?
> >       >
> >       > This is based on hardware we currently support (the last time we
> bumped the value was, IIRC, for Thunder-X). In the case of
> >       booting UEFI, we can get a lot of small ranges as we discover the
> RAM using the UEFI memory map.
> >       >
> >
> >       Thanks for the background.
> >
> >       >
> >       > > If #define NR_MEM_BANKS (CONFIG_NR_NUMA_NODES * N)? And what N
> should be.
> >       >
> >       > N would have to be the maximum number of ranges you can find in
> a NUMA node.
> >       >
> >       > We would also need to make sure this doesn't break existing
> platforms. So N would have to be quite large or we need a MAX as
> >       Stefano suggested.
> >       >
> >       > But I would prefer to keep the existing 128 and allow to
> configure it at build time (not necessarily in this series). This
> >       avoid to have different way to define the value based NUMA vs non-
> NUMA.
> >
> >       In this case, can we use Stefano's
> >       "#define NR_NODE_MEMBLKS MAX(NR_MEM_BANKS, (CONFIG_NR_NUMA_NODES *
> 2))"
> >       in next version. If yes, should we change x86 part? Because
> NR_MEM_BANKS
> >       has not been defined in x86.
> >
> >
> > What I meant by configuring dynamically is allowing NR_MEM_BANKS to be
> set by the user.
> >
> > The second part of the MAX makes no sense to me (at least on Arm). So I
> really prefer if this is not part of the initial version.
> >
> > We can refine the value, or introduce the MAX in the future if we have a
> justification for it.
> 
> OK, so for clarity the suggestion is:
> 
> - define NR_NODE_MEMBLKS as NR_MEM_BANKS on ARM in this series
> - in the future make NR_MEM_BANKS user-configurable via kconfig
> - for now leave NR_MEM_BANKS as 128 on ARM
> 
> That's fine by me.

Ok, I will only keep
#define NR_NODE_MEMBLKS NR_MEM_BANKS in asm-arm/numa.h, and left
x86 NR_NODE_MEMBLKS definition as it was in asm-x86/numa.h
Because in current stage, we can not unify them in on place.
And I will update the commit message to log some of our
discussion in this tthread.





^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-27 17:17               ` Stefano Stabellini
@ 2021-09-28  2:59                 ` Wei Chen
  2021-09-28  3:30                   ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-28  2:59 UTC (permalink / raw)
  To: Stefano Stabellini, Jan Beulich; +Cc: Julien Grall, xen-devel, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月28日 1:17
> To: Jan Beulich <jbeulich@suse.com>
> Cc: Julien Grall <julien.grall.oss@gmail.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Wei Chen <Wei.Chen@arm.com>; xen-devel <xen-
> devel@lists.xenproject.org>; Bertrand Marquis <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to
> enable NUMA
> 
> On Mon, 27 Sep 2021, Jan Beulich wrote:
> > On 27.09.2021 10:45, Julien Grall wrote:
> > > On Mon, 27 Sep 2021, 10:33 Jan Beulich, <jbeulich@suse.com> wrote:
> > >
> > >> On 24.09.2021 21:39, Stefano Stabellini wrote:
> > >>> On Fri, 24 Sep 2021, Wei Chen wrote:
> > >>>> On 2021/9/24 11:31, Stefano Stabellini wrote:
> > >>>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> > >>>>>> --- a/xen/arch/arm/Kconfig
> > >>>>>> +++ b/xen/arch/arm/Kconfig
> > >>>>>> @@ -34,6 +34,17 @@ config ACPI
> > >>>>>>      Advanced Configuration and Power Interface (ACPI) support
> for
> > >> Xen is
> > >>>>>>      an alternative to device tree on ARM64.
> > >>>>>>   + config DEVICE_TREE_NUMA
> > >>>>>> +  def_bool n
> > >>>>>> +  select NUMA
> > >>>>>> +
> > >>>>>> +config ARM_NUMA
> > >>>>>> +  bool "Arm NUMA (Non-Uniform Memory Access) Support
> (UNSUPPORTED)"
> > >> if
> > >>>>>> UNSUPPORTED
> > >>>>>> +  select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> > >>>>>
> > >>>>> Should it be: depends on HAS_DEVICE_TREE ?
> > >>>>> (And eventually depends on HAS_DEVICE_TREE || ACPI)
> > >>>>>
> > >>>>
> > >>>> As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
> > >>>> option can be selected by users. And depends on has_device_tree
> > >>>> or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.
> > >>>>
> > >>>> If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
> > >>>> does it become a loop dependency?
> > >>>>
> > >>>>
> > >> https://lists.xenproject.org/archives/html/xen-devel/2021-
> 08/msg00888.html
> > >>>
> > >>> OK, I am fine with that. I was just trying to catch the case where a
> > >>> user selects "ARM_NUMA" but actually neither ACPI nor
> HAS_DEVICE_TREE
> > >>> are selected so nothing happens. I was trying to make it clear that
> > >>> ARM_NUMA depends on having at least one between HAS_DEVICE_TREE or
> ACPI
> > >>> because otherwise it is not going to work.
> > >>>
> > >>> That said, I don't think this is important because HAS_DEVICE_TREE
> > >>> cannot be unselected. So if we cannot find a way to express the
> > >>> dependency, I think it is fine to keep the patch as is.
> > >>
> > >> So how about doing things the other way around: ARM_NUMA has no
> prompt
> > >> and defaults to ACPI_NUMA || DT_NUMA, and DT_NUMA gains a prompt
> instead
> > >> (and, for Arm at least, ACPI_NUMA as well; this might even be
> worthwhile
> > >> to have on x86 down the road).
> > >>
> > >
> > > As I wrote before, I don't think the user should say "I want to enable
> NUMA
> > > with Device-Tree or ACPI". Instead, they say whether they want to use
> NUMA
> > > and let Xen decide to enable the DT/ACPI support.
> > >
> > > In other word, the prompt should stay on ARM_NUMA.
> >
> > Okay. In which case I'm confused by Stefano's question.
> 
> Let me clarify: I think it is fine to have a single prompt for NUMA in
> Kconfig. However, I am just pointing out that it is theoretically
> possible with the current code to present an ARM_NUMA prompt to the user
> but actually have no NUMA enabled at the end because both DEVICE TREE
> and ACPI are disabled. This is only a theoretical problem because DEVICE
> TREE support (HAS_DEVICE_TREE) cannot be disabled today. Also I cannot
> imagine how a configuration with neither DEVICE TREE nor ACPI can be
> correct. So I don't think it is a critical concern.
> 
> That said, you can see that, at least theoretically, ARM_NUMA depends on
> either HAS_DEVICE_TREE or ACPI, so I suggested to add:
> 
> depends on HAS_DEVICE_TREE || ACPI
> 
> Wei answered that it might introduce a circular dependency, but I did
> try the addition of "depends on HAS_DEVICE_TREE || ACPI" under ARM_NUMA
> in Kconfig and everything built fine here.

Ok, I will add "depends on HAS_DEVICE_TREE" in next version, but "|| ACPI"
will be later when we have ACPI NUMA for Arm : )


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to enable NUMA
  2021-09-28  2:59                 ` Wei Chen
@ 2021-09-28  3:30                   ` Stefano Stabellini
  0 siblings, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-28  3:30 UTC (permalink / raw)
  To: Wei Chen
  Cc: Stefano Stabellini, Jan Beulich, Julien Grall, xen-devel,
	Bertrand Marquis

[-- Attachment #1: Type: text/plain, Size: 4531 bytes --]

On Tue, 28 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月28日 1:17
> > To: Jan Beulich <jbeulich@suse.com>
> > Cc: Julien Grall <julien.grall.oss@gmail.com>; Stefano Stabellini
> > <sstabellini@kernel.org>; Wei Chen <Wei.Chen@arm.com>; xen-devel <xen-
> > devel@lists.xenproject.org>; Bertrand Marquis <Bertrand.Marquis@arm.com>
> > Subject: Re: [PATCH 36/37] xen/arm: Provide Kconfig options for Arm to
> > enable NUMA
> > 
> > On Mon, 27 Sep 2021, Jan Beulich wrote:
> > > On 27.09.2021 10:45, Julien Grall wrote:
> > > > On Mon, 27 Sep 2021, 10:33 Jan Beulich, <jbeulich@suse.com> wrote:
> > > >
> > > >> On 24.09.2021 21:39, Stefano Stabellini wrote:
> > > >>> On Fri, 24 Sep 2021, Wei Chen wrote:
> > > >>>> On 2021/9/24 11:31, Stefano Stabellini wrote:
> > > >>>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> > > >>>>>> --- a/xen/arch/arm/Kconfig
> > > >>>>>> +++ b/xen/arch/arm/Kconfig
> > > >>>>>> @@ -34,6 +34,17 @@ config ACPI
> > > >>>>>>      Advanced Configuration and Power Interface (ACPI) support
> > for
> > > >> Xen is
> > > >>>>>>      an alternative to device tree on ARM64.
> > > >>>>>>   + config DEVICE_TREE_NUMA
> > > >>>>>> +  def_bool n
> > > >>>>>> +  select NUMA
> > > >>>>>> +
> > > >>>>>> +config ARM_NUMA
> > > >>>>>> +  bool "Arm NUMA (Non-Uniform Memory Access) Support
> > (UNSUPPORTED)"
> > > >> if
> > > >>>>>> UNSUPPORTED
> > > >>>>>> +  select DEVICE_TREE_NUMA if HAS_DEVICE_TREE
> > > >>>>>
> > > >>>>> Should it be: depends on HAS_DEVICE_TREE ?
> > > >>>>> (And eventually depends on HAS_DEVICE_TREE || ACPI)
> > > >>>>>
> > > >>>>
> > > >>>> As the discussion in RFC [1]. We want to make ARM_NUMA as a generic
> > > >>>> option can be selected by users. And depends on has_device_tree
> > > >>>> or ACPI to select DEVICE_TREE_NUMA or ACPI_NUMA.
> > > >>>>
> > > >>>> If we add HAS_DEVICE_TREE || ACPI as dependencies for ARM_NUMA,
> > > >>>> does it become a loop dependency?
> > > >>>>
> > > >>>>
> > > >> https://lists.xenproject.org/archives/html/xen-devel/2021-
> > 08/msg00888.html
> > > >>>
> > > >>> OK, I am fine with that. I was just trying to catch the case where a
> > > >>> user selects "ARM_NUMA" but actually neither ACPI nor
> > HAS_DEVICE_TREE
> > > >>> are selected so nothing happens. I was trying to make it clear that
> > > >>> ARM_NUMA depends on having at least one between HAS_DEVICE_TREE or
> > ACPI
> > > >>> because otherwise it is not going to work.
> > > >>>
> > > >>> That said, I don't think this is important because HAS_DEVICE_TREE
> > > >>> cannot be unselected. So if we cannot find a way to express the
> > > >>> dependency, I think it is fine to keep the patch as is.
> > > >>
> > > >> So how about doing things the other way around: ARM_NUMA has no
> > prompt
> > > >> and defaults to ACPI_NUMA || DT_NUMA, and DT_NUMA gains a prompt
> > instead
> > > >> (and, for Arm at least, ACPI_NUMA as well; this might even be
> > worthwhile
> > > >> to have on x86 down the road).
> > > >>
> > > >
> > > > As I wrote before, I don't think the user should say "I want to enable
> > NUMA
> > > > with Device-Tree or ACPI". Instead, they say whether they want to use
> > NUMA
> > > > and let Xen decide to enable the DT/ACPI support.
> > > >
> > > > In other word, the prompt should stay on ARM_NUMA.
> > >
> > > Okay. In which case I'm confused by Stefano's question.
> > 
> > Let me clarify: I think it is fine to have a single prompt for NUMA in
> > Kconfig. However, I am just pointing out that it is theoretically
> > possible with the current code to present an ARM_NUMA prompt to the user
> > but actually have no NUMA enabled at the end because both DEVICE TREE
> > and ACPI are disabled. This is only a theoretical problem because DEVICE
> > TREE support (HAS_DEVICE_TREE) cannot be disabled today. Also I cannot
> > imagine how a configuration with neither DEVICE TREE nor ACPI can be
> > correct. So I don't think it is a critical concern.
> > 
> > That said, you can see that, at least theoretically, ARM_NUMA depends on
> > either HAS_DEVICE_TREE or ACPI, so I suggested to add:
> > 
> > depends on HAS_DEVICE_TREE || ACPI
> > 
> > Wei answered that it might introduce a circular dependency, but I did
> > try the addition of "depends on HAS_DEVICE_TREE || ACPI" under ARM_NUMA
> > in Kconfig and everything built fine here.
> 
> Ok, I will add "depends on HAS_DEVICE_TREE" in next version, but "|| ACPI"
> will be later when we have ACPI NUMA for Arm : )

Good point :)

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-28  0:59                 ` Stefano Stabellini
@ 2021-09-28  4:16                   ` Wei Chen
  2021-09-28  5:01                     ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-28  4:16 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Jan Beulich, xen-devel, julien, Bertrand Marquis

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月28日 9:00
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Jan Beulich <jbeulich@suse.com>; xen-devel@lists.xenproject.org;
> julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>; Stefano
> Stabellini <sstabellini@kernel.org>
> Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> EFI architecture
> 
> On Mon, 27 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Wei
> > > Chen
> > > Sent: 2021年9月26日 18:25
> > > To: Jan Beulich <jbeulich@suse.com>
> > > Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
> > > Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
> non-
> > > EFI architecture
> > >
> > > Hi Jan,
> > >
> > > > -----Original Message-----
> > > > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf
> Of
> > > Jan
> > > > Beulich
> > > > Sent: 2021年9月24日 18:49
> > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; Stefano Stabellini
> <sstabellini@kernel.org>
> > > > Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
> > > non-
> > > > EFI architecture
> > > >
> > > > On 24.09.2021 12:31, Wei Chen wrote:
> > > > >> From: Jan Beulich <jbeulich@suse.com>
> > > > >> Sent: 2021年9月24日 15:59
> > > > >>
> > > > >> On 24.09.2021 06:34, Wei Chen wrote:
> > > > >>>> From: Stefano Stabellini <sstabellini@kernel.org>
> > > > >>>> Sent: 2021年9月24日 9:15
> > > > >>>>
> > > > >>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > >>>>> --- a/xen/common/Kconfig
> > > > >>>>> +++ b/xen/common/Kconfig
> > > > >>>>> @@ -11,6 +11,16 @@ config COMPAT
> > > > >>>>>  config CORE_PARKING
> > > > >>>>>  	bool
> > > > >>>>>
> > > > >>>>> +config EFI
> > > > >>>>> +	bool
> > > > >>>>
> > > > >>>> Without the title the option is not user-selectable (or de-
> > > > selectable).
> > > > >>>> So the help message below can never be seen.
> > > > >>>>
> > > > >>>> Either add a title, e.g.:
> > > > >>>>
> > > > >>>> bool "EFI support"
> > > > >>>>
> > > > >>>> Or fully make the option a silent option by removing the help
> text.
> > > > >>>
> > > > >>> OK, in current Xen code, EFI is unconditionally compiled. Before
> > > > >>> we change related code, I prefer to remove the help text.
> > > > >>
> > > > >> But that's not true: At least on x86 EFI gets compiled depending
> on
> > > > >> tool chain capabilities. Ultimately we may indeed want a user
> > > > >> selectable option here, but until then I'm afraid having this
> option
> > > > >> at all may be misleading on x86.
> > > > >>
> > > > >
> > > > > I check the build scripts, yes, you're right. For x86, EFI is not
> a
> > > > > selectable option in Kconfig. I agree with you, we can't use
> Kconfig
> > > > > system to decide to enable EFI build for x86 or not.
> > > > >
> > > > > So how about we just use this EFI option for Arm only? Because on
> Arm,
> > > > > we do not have such toolchain dependency.
> > > >
> > > > To be honest - don't know. That's because I don't know what you want
> > > > to use the option for subsequently.
> > > >
> > >
> > > In last version, I had introduced an arch-helper to stub EFI_BOOT
> > > in Arm's common code for Arm32. Because Arm32 doesn't support EFI.
> > > So Julien suggested me to introduce a CONFIG_EFI option for non-EFI
> > > supported architectures to stub in EFI layer.
> > >
> > > [1] https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > 08/msg00808.html
> > >
> >
> > As Jan' reminded, x86 doesn't depend on Kconfig to build EFI code.
> > So, if we CONFIG_EFI to stub EFI API's for x86, we will encounter
> > that toolchains enable EFI, but Kconfig disable EFI. Or Kconfig
> > enable EFI but toolchain doesn't provide EFI build supports. And
> > then x86 could not work well.
> >
> > If we use CONFIG_EFI for Arm only, that means CONFIG_EFI for x86
> > is off, this will also cause problem.
> >
> > So, can we still use previous arch_helpers to stub for Arm32?
> > until x86 can use this selectable option?
> 
> EFI doesn't have to be necessarily a user-visible option in Kconfig at
> this point. I think Julien was just asking to make the #ifdef based on
> a EFI-related config rather than just based CONFIG_ARM64.
> 
> On x86 EFI is detected based on compiler support, setting XEN_BUILD_EFI
> in xen/arch/x86/Makefile. Let's say that we keep using the same name
> "XEN_BUILD_EFI" on ARM as well.
> 
> On ARM32, XEN_BUILD_EFI should be always unset.
> 
> On ARM64 XEN_BUILD_EFI should be always set.
> 
> That's it, right? I'd argue that CONFIG_EFI or HAS_EFI are better names
> than XEN_BUILD_EFI, but that's OK anyway. So for instance you can make
> XEN_BUILD_EFI an invisible symbol in xen/arch/arm/Kconfig and select it
> only on ARM64.

Thanks, this is a good approach. But if we place XEN_BUILD_EFI in Kconfig
it will be transfer to CONFIG_XEN_BUILD_EFI. How about using another name
in Kconfig like ARM_EFI, but use CONFIG_ARM_EFI in config.h to define
XEN_BUILD_EFI?




^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-27 17:19                 ` Stefano Stabellini
@ 2021-09-28  4:41                   ` Wei Chen
  2021-09-28  4:59                     ` Stefano Stabellini
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2021-09-28  4:41 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, julien, Bertrand Marquis, jbeulich, andrew.cooper3,
	roger.pau, wl

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabellini@kernel.org>
> Sent: 2021年9月28日 1:19
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> roger.pau@citrix.com; wl@xen.org
> Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On Mon, 27 Sep 2021, Wei Chen wrote:
> > > -----Original Message-----
> > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > Sent: 2021年9月27日 13:05
> > > To: Stefano Stabellini <sstabellini@kernel.org>
> > > Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> > > julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> > > jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> > > wl@xen.org
> > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > memory range
> > >
> > > On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > Sent: 2021年9月25日 3:53
> > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > andrew.cooper3@citrix.com;
> > > > > > roger.pau@citrix.com; wl@xen.org
> > > > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of
> discontinous
> > > node
> > > > > > memory range
> > > > > >
> > > > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > > Sent: 2021年9月24日 8:26
> > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > > > julien@xen.org;
> > > > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>;
> jbeulich@suse.com;
> > > > > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> > > discontinous node
> > > > > > > > memory range
> > > > > > > >
> > > > > > > > CC'ing x86 maintainers
> > > > > > > >
> > > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > > One NUMA node may contain several memory blocks. In
> current
> > > Xen
> > > > > > > > > code, Xen will maintain a node memory range for each node
> to
> > > cover
> > > > > > > > > all its memory blocks. But here comes the problem, in the
> gap
> > > of
> > > > > > > > > one node's two memory blocks, if there are some memory
> blocks
> > > don't
> > > > > > > > > belong to this node (remote memory blocks). This node's
> memory
> > > range
> > > > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > > > >
> > > > > > > > > One node's memory range contains othe nodes' memory, this
> is
> > > > > > obviously
> > > > > > > > > not very reasonable. This means current NUMA code only can
> > > support
> > > > > > > > > node has continous memory blocks. However, on a physical
> > > machine,
> > > > > > the
> > > > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > > > >
> > > > > > > > > So in this patch, we add code to detect discontinous
> memory
> > > blocks
> > > > > > > > > for one node. NUMA initializtion will be failed and error
> > > messages
> > > > > > > > > will be printed when Xen detect such hardware
> configuration.
> > > > > > > >
> > > > > > > > At least on ARM, it is not just memory that can be
> interleaved,
> > > but
> > > > > > also
> > > > > > > > MMIO regions. For instance:
> > > > > > > >
> > > > > > > > node0 bank0 0-0x1000000
> > > > > > > > MMIO 0x1000000-0x1002000
> > > > > > > > Hole 0x1002000-0x2000000
> > > > > > > > node0 bank1 0x2000000-0x3000000
> > > > > > > >
> > > > > > > > So I am not familiar with the SRAT format, but I think on
> ARM
> > > the
> > > > > > check
> > > > > > > > would look different: we would just look for multiple memory
> > > ranges
> > > > > > > > under a device_type = "memory" node of a NUMA node in device
> > > tree.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > Should I need to include/refine above message to commit log?
> > > > > >
> > > > > > Let me ask you a question first.
> > > > > >
> > > > > > With the NUMA implementation of this patch series, can we deal
> with
> > > > > > cases where each node has multiple memory banks, not interleaved?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > An an example:
> > > > > >
> > > > > > node0: 0x0        - 0x10000000
> > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > node0: 0x20000000 - 0x30000000
> > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > node1: 0x50000000 - 0x60000000
> > > > > > MMIO : 0x60000000 - 0x80000000
> > > > > > node2: 0x80000000 - 0x90000000
> > > > > >
> > > > > >
> > > > > > I assume we can deal with this case simply by setting node0
> memory
> > > to
> > > > > > 0x0-0x30000000 even if there is actually something else, a
> device,
> > > that
> > > > > > doesn't belong to node0 in between the two node0 banks?
> > > > >
> > > > > While this configuration is rare in SoC design, but it is not
> > > impossible.
> > > >
> > > > Definitely, I have seen it before.
> > > >
> > > >
> > > > > > Is it only other nodes' memory interleaved that cause issues? In
> > > other
> > > > > > words, only the following is a problematic scenario?
> > > > > >
> > > > > > node0: 0x0        - 0x10000000
> > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > node1: 0x20000000 - 0x30000000
> > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > node0: 0x50000000 - 0x60000000
> > > > > >
> > > > > > Because node1 is in between the two ranges of node0?
> > > > > >
> > > > >
> > > > > But only device_type="memory" can be added to allocation.
> > > > > For mmio there are two cases:
> > > > > 1. mmio doesn't have NUMA id property.
> > > > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > > > >    But we don’t need to handle these kinds of MMIO devices
> > > > >    in memory block parsing. Because we don't need to allocate
> > > > >    memory from these mmio ranges. And for accessing, we need
> > > > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > > > >    MMIO accessing APIs.
> > > >
> > > > Yes, I am not too worried about devices with a NUMA id property
> because
> > > > they are less common and this series doesn't handle them at all,
> right?
> > > > I imagine they would be treated like any other device without NUMA
> > > > awareness.
> > > >
> > > > I am thinking about the case where the memory of each NUMA node is
> made
> > > > of multiple banks. I understand that this patch adds an explicit
> check
> > > > for cases where these banks are interleaving, however there are many
> > > > other cases where NUMA memory nodes are *not* interleaving but they
> are
> > > > still made of multiple discontinuous banks, like in the two example
> > > > above.
> > > >
> > > > My question is whether this patch series in its current form can
> handle
> > > > the two cases above correctly. If so, I am wondering how it works
> given
> > > > that we only have a single "start" and "size" parameter per node.
> > > >
> > > > On the other hand if this series cannot handle the two cases above,
> my
> > > > question is whether it would fail explicitly or not. The new
> > > > check is_node_memory_continuous doesn't seem to be able to catch
> them.
> > >
> > >
> > > Looking at numa_update_node_memblks, it is clear that the code is
> meant
> > > to increase the range of each numa node to cover even MMIO regions in
> > > between memory banks. Also see the comment at the top of the file:
> > >
> > >  * Assumes all memory regions belonging to a single proximity domain
> > >  * are in one chunk. Holes between them will be included in the node.
> > >
> > > So if there are multiple banks for each node, start and end are
> > > stretched to cover the holes between them, and it works as long as
> > > memory banks of different NUMA nodes don't interleave.
> > >
> > > I would appreciate if you could add an in-code comment to explain this
> > > on top of numa_update_node_memblk.
> >
> > Yes, I will do it.
> 
> Thank you
> 
> 
> > > Have you had a chance to test this? If not it would be fantastic if
> you
> > > could give it a quick test to make sure it works as intended: for
> > > instance by creating multiple memory banks for each NUMA node by
> > > splitting an real bank into two smaller banks with a hole in between
> in
> > > device tree, just for the sake of testing.
> >
> > Yes, I have created some fake NUMA nodes in FVP device tree to test it.
> > The intertwine of nodes' address can be detected.
> >
> > (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> > (XEN) SRAT: Node 1 0000000880000000-00000008c0000000
> > (XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1
> (0000000880000000-00000008c0000000)
> 
> Great thanks! And what if there are multiple non-contiguous memory banks
> per node, but *not* intertwined. Does that all work correctly as
> expected?

Yes, I am using a device tree setting like this:

    memory@80000000 {
        device_type = "memory";
        reg = <0x0 0x80000000 0x0 0x80000000>;
        numa-node-id = <0>;
    };

    memory@880000000 {
        device_type = "memory";
        reg = <0x8 0x80000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@890000000 {
        device_type = "memory";
        reg = <0x8 0x90000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8A0000000 {
        device_type = "memory";
        reg = <0x8 0xA0000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8B0000000 {
        device_type = "memory";
        reg = <0x8 0xB0000000 0x0 0x8000000>;
        numa-node-id = <0>;
    };

    memory@8C0000000 {
        device_type = "memory";
        reg = <0x8 0xC0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8D0000000 {
        device_type = "memory";
        reg = <0x8 0xD0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8E0000000 {
        device_type = "memory";
        reg = <0x8 0xE0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

    memory@8F0000000 {
        device_type = "memory";
        reg = <0x8 0xF0000000 0x0 0x8000000>;
        numa-node-id = <1>;
    };

And in Xen we got the output:

(XEN) DT: NUMA node 0 processor parsed
(XEN) DT: NUMA node 0 processor parsed
(XEN) DT: NUMA node 1 processor parsed
(XEN) DT: NUMA node 1 processor parsed
(XEN) SRAT: Node 0 0000000080000000-00000000ff000000
(XEN) SRAT: Node 0 0000000880000000-0000000888000000
(XEN) SRAT: Node 0 0000000890000000-0000000898000000
(XEN) SRAT: Node 0 00000008a0000000-00000008a8000000
(XEN) SRAT: Node 0 00000008b0000000-00000008b8000000
(XEN) SRAT: Node 1 00000008c0000000-00000008c8000000
(XEN) SRAT: Node 1 00000008d0000000-00000008d8000000
(XEN) SRAT: Node 1 00000008e0000000-00000008e8000000
(XEN) SRAT: Node 1 00000008f0000000-00000008f8000000
(XEN) NUMA: parsing numa-distance-map
(XEN) NUMA: distance: NODE#0->NODE#0:10
(XEN) NUMA: distance: NODE#0->NODE#1:20
(XEN) NUMA: distance: NODE#1->NODE#1:10
(XEN) NUMA: Using 16 for the hash shift.
(XEN) Domain heap initialised
(XEN) Booting using Device Tree

Dom0 can be boot successfully, xl info got:
xl info
host                   : X-Dom0
release                : 5.12.0
version                : #20 SMP PREEMPT Wed Jul 28 13:41:28 CST 2021
machine                : aarch64
nr_cpus                : 4
max_cpu_id             : 3
nr_nodes               : 2
cores_per_socket       : 1
threads_per_core       : 1

Xen debug console to dump numa info, we got:

(XEN) 'u' pressed -> dumping numa info (now = 13229372281010)
(XEN) NODE0 start->524288 size->8617984 free->388741
(XEN) NODE1 start->9175040 size->229376 free->106460
(XEN) CPU0...1 -> NODE0
(XEN) CPU2...3 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 262144):
(XEN)     Node 0: 262144
(XEN)     Node 1: 0


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-28  4:41                   ` Wei Chen
@ 2021-09-28  4:59                     ` Stefano Stabellini
  0 siblings, 0 replies; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-28  4:59 UTC (permalink / raw)
  To: Wei Chen
  Cc: Stefano Stabellini, xen-devel, julien, Bertrand Marquis,
	jbeulich, andrew.cooper3, roger.pau, wl

[-- Attachment #1: Type: text/plain, Size: 13051 bytes --]

On Tue, 28 Sep 2021, Wei Chen wrote:
> Hi Stefano,
> 
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月28日 1:19
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > <Bertrand.Marquis@arm.com>; jbeulich@suse.com; andrew.cooper3@citrix.com;
> > roger.pau@citrix.com; wl@xen.org
> > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > memory range
> > 
> > On Mon, 27 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > Sent: 2021年9月27日 13:05
> > > > To: Stefano Stabellini <sstabellini@kernel.org>
> > > > Cc: Wei Chen <Wei.Chen@arm.com>; xen-devel@lists.xenproject.org;
> > > > julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>;
> > > > jbeulich@suse.com; andrew.cooper3@citrix.com; roger.pau@citrix.com;
> > > > wl@xen.org
> > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> > > > memory range
> > > >
> > > > On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > > > > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > Sent: 2021年9月25日 3:53
> > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > Cc: Stefano Stabellini <sstabellini@kernel.org>; xen-
> > > > > > > devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > > > > <Bertrand.Marquis@arm.com>; jbeulich@suse.com;
> > > > andrew.cooper3@citrix.com;
> > > > > > > roger.pau@citrix.com; wl@xen.org
> > > > > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of
> > discontinous
> > > > node
> > > > > > > memory range
> > > > > > >
> > > > > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > > > > > Sent: 2021年9月24日 8:26
> > > > > > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > > > > > Cc: xen-devel@lists.xenproject.org; sstabellini@kernel.org;
> > > > > > > julien@xen.org;
> > > > > > > > > Bertrand Marquis <Bertrand.Marquis@arm.com>;
> > jbeulich@suse.com;
> > > > > > > > > andrew.cooper3@citrix.com; roger.pau@citrix.com; wl@xen.org
> > > > > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> > > > discontinous node
> > > > > > > > > memory range
> > > > > > > > >
> > > > > > > > > CC'ing x86 maintainers
> > > > > > > > >
> > > > > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > > > > One NUMA node may contain several memory blocks. In
> > current
> > > > Xen
> > > > > > > > > > code, Xen will maintain a node memory range for each node
> > to
> > > > cover
> > > > > > > > > > all its memory blocks. But here comes the problem, in the
> > gap
> > > > of
> > > > > > > > > > one node's two memory blocks, if there are some memory
> > blocks
> > > > don't
> > > > > > > > > > belong to this node (remote memory blocks). This node's
> > memory
> > > > range
> > > > > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > > > > >
> > > > > > > > > > One node's memory range contains othe nodes' memory, this
> > is
> > > > > > > obviously
> > > > > > > > > > not very reasonable. This means current NUMA code only can
> > > > support
> > > > > > > > > > node has continous memory blocks. However, on a physical
> > > > machine,
> > > > > > > the
> > > > > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > > > > >
> > > > > > > > > > So in this patch, we add code to detect discontinous
> > memory
> > > > blocks
> > > > > > > > > > for one node. NUMA initializtion will be failed and error
> > > > messages
> > > > > > > > > > will be printed when Xen detect such hardware
> > configuration.
> > > > > > > > >
> > > > > > > > > At least on ARM, it is not just memory that can be
> > interleaved,
> > > > but
> > > > > > > also
> > > > > > > > > MMIO regions. For instance:
> > > > > > > > >
> > > > > > > > > node0 bank0 0-0x1000000
> > > > > > > > > MMIO 0x1000000-0x1002000
> > > > > > > > > Hole 0x1002000-0x2000000
> > > > > > > > > node0 bank1 0x2000000-0x3000000
> > > > > > > > >
> > > > > > > > > So I am not familiar with the SRAT format, but I think on
> > ARM
> > > > the
> > > > > > > check
> > > > > > > > > would look different: we would just look for multiple memory
> > > > ranges
> > > > > > > > > under a device_type = "memory" node of a NUMA node in device
> > > > tree.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > Should I need to include/refine above message to commit log?
> > > > > > >
> > > > > > > Let me ask you a question first.
> > > > > > >
> > > > > > > With the NUMA implementation of this patch series, can we deal
> > with
> > > > > > > cases where each node has multiple memory banks, not interleaved?
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > An an example:
> > > > > > >
> > > > > > > node0: 0x0        - 0x10000000
> > > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > > node0: 0x20000000 - 0x30000000
> > > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > > node1: 0x50000000 - 0x60000000
> > > > > > > MMIO : 0x60000000 - 0x80000000
> > > > > > > node2: 0x80000000 - 0x90000000
> > > > > > >
> > > > > > >
> > > > > > > I assume we can deal with this case simply by setting node0
> > memory
> > > > to
> > > > > > > 0x0-0x30000000 even if there is actually something else, a
> > device,
> > > > that
> > > > > > > doesn't belong to node0 in between the two node0 banks?
> > > > > >
> > > > > > While this configuration is rare in SoC design, but it is not
> > > > impossible.
> > > > >
> > > > > Definitely, I have seen it before.
> > > > >
> > > > >
> > > > > > > Is it only other nodes' memory interleaved that cause issues? In
> > > > other
> > > > > > > words, only the following is a problematic scenario?
> > > > > > >
> > > > > > > node0: 0x0        - 0x10000000
> > > > > > > MMIO : 0x10000000 - 0x20000000
> > > > > > > node1: 0x20000000 - 0x30000000
> > > > > > > MMIO : 0x30000000 - 0x50000000
> > > > > > > node0: 0x50000000 - 0x60000000
> > > > > > >
> > > > > > > Because node1 is in between the two ranges of node0?
> > > > > > >
> > > > > >
> > > > > > But only device_type="memory" can be added to allocation.
> > > > > > For mmio there are two cases:
> > > > > > 1. mmio doesn't have NUMA id property.
> > > > > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > > > > >    But we don’t need to handle these kinds of MMIO devices
> > > > > >    in memory block parsing. Because we don't need to allocate
> > > > > >    memory from these mmio ranges. And for accessing, we need
> > > > > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > > > > >    MMIO accessing APIs.
> > > > >
> > > > > Yes, I am not too worried about devices with a NUMA id property
> > because
> > > > > they are less common and this series doesn't handle them at all,
> > right?
> > > > > I imagine they would be treated like any other device without NUMA
> > > > > awareness.
> > > > >
> > > > > I am thinking about the case where the memory of each NUMA node is
> > made
> > > > > of multiple banks. I understand that this patch adds an explicit
> > check
> > > > > for cases where these banks are interleaving, however there are many
> > > > > other cases where NUMA memory nodes are *not* interleaving but they
> > are
> > > > > still made of multiple discontinuous banks, like in the two example
> > > > > above.
> > > > >
> > > > > My question is whether this patch series in its current form can
> > handle
> > > > > the two cases above correctly. If so, I am wondering how it works
> > given
> > > > > that we only have a single "start" and "size" parameter per node.
> > > > >
> > > > > On the other hand if this series cannot handle the two cases above,
> > my
> > > > > question is whether it would fail explicitly or not. The new
> > > > > check is_node_memory_continuous doesn't seem to be able to catch
> > them.
> > > >
> > > >
> > > > Looking at numa_update_node_memblks, it is clear that the code is
> > meant
> > > > to increase the range of each numa node to cover even MMIO regions in
> > > > between memory banks. Also see the comment at the top of the file:
> > > >
> > > >  * Assumes all memory regions belonging to a single proximity domain
> > > >  * are in one chunk. Holes between them will be included in the node.
> > > >
> > > > So if there are multiple banks for each node, start and end are
> > > > stretched to cover the holes between them, and it works as long as
> > > > memory banks of different NUMA nodes don't interleave.
> > > >
> > > > I would appreciate if you could add an in-code comment to explain this
> > > > on top of numa_update_node_memblk.
> > >
> > > Yes, I will do it.
> > 
> > Thank you
> > 
> > 
> > > > Have you had a chance to test this? If not it would be fantastic if
> > you
> > > > could give it a quick test to make sure it works as intended: for
> > > > instance by creating multiple memory banks for each NUMA node by
> > > > splitting an real bank into two smaller banks with a hole in between
> > in
> > > > device tree, just for the sake of testing.
> > >
> > > Yes, I have created some fake NUMA nodes in FVP device tree to test it.
> > > The intertwine of nodes' address can be detected.
> > >
> > > (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> > > (XEN) SRAT: Node 1 0000000880000000-00000008c0000000
> > > (XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1
> > (0000000880000000-00000008c0000000)
> > 
> > Great thanks! And what if there are multiple non-contiguous memory banks
> > per node, but *not* intertwined. Does that all work correctly as
> > expected?
> 
> Yes, I am using a device tree setting like this:

Perfect! Thank you!


>     memory@80000000 {
>         device_type = "memory";
>         reg = <0x0 0x80000000 0x0 0x80000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@880000000 {
>         device_type = "memory";
>         reg = <0x8 0x80000000 0x0 0x8000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@890000000 {
>         device_type = "memory";
>         reg = <0x8 0x90000000 0x0 0x8000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@8A0000000 {
>         device_type = "memory";
>         reg = <0x8 0xA0000000 0x0 0x8000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@8B0000000 {
>         device_type = "memory";
>         reg = <0x8 0xB0000000 0x0 0x8000000>;
>         numa-node-id = <0>;
>     };
> 
>     memory@8C0000000 {
>         device_type = "memory";
>         reg = <0x8 0xC0000000 0x0 0x8000000>;
>         numa-node-id = <1>;
>     };
> 
>     memory@8D0000000 {
>         device_type = "memory";
>         reg = <0x8 0xD0000000 0x0 0x8000000>;
>         numa-node-id = <1>;
>     };
> 
>     memory@8E0000000 {
>         device_type = "memory";
>         reg = <0x8 0xE0000000 0x0 0x8000000>;
>         numa-node-id = <1>;
>     };
> 
>     memory@8F0000000 {
>         device_type = "memory";
>         reg = <0x8 0xF0000000 0x0 0x8000000>;
>         numa-node-id = <1>;
>     };
> 
> And in Xen we got the output:
> 
> (XEN) DT: NUMA node 0 processor parsed
> (XEN) DT: NUMA node 0 processor parsed
> (XEN) DT: NUMA node 1 processor parsed
> (XEN) DT: NUMA node 1 processor parsed
> (XEN) SRAT: Node 0 0000000080000000-00000000ff000000
> (XEN) SRAT: Node 0 0000000880000000-0000000888000000
> (XEN) SRAT: Node 0 0000000890000000-0000000898000000
> (XEN) SRAT: Node 0 00000008a0000000-00000008a8000000
> (XEN) SRAT: Node 0 00000008b0000000-00000008b8000000
> (XEN) SRAT: Node 1 00000008c0000000-00000008c8000000
> (XEN) SRAT: Node 1 00000008d0000000-00000008d8000000
> (XEN) SRAT: Node 1 00000008e0000000-00000008e8000000
> (XEN) SRAT: Node 1 00000008f0000000-00000008f8000000
> (XEN) NUMA: parsing numa-distance-map
> (XEN) NUMA: distance: NODE#0->NODE#0:10
> (XEN) NUMA: distance: NODE#0->NODE#1:20
> (XEN) NUMA: distance: NODE#1->NODE#1:10
> (XEN) NUMA: Using 16 for the hash shift.
> (XEN) Domain heap initialised
> (XEN) Booting using Device Tree
> 
> Dom0 can be boot successfully, xl info got:
> xl info
> host                   : X-Dom0
> release                : 5.12.0
> version                : #20 SMP PREEMPT Wed Jul 28 13:41:28 CST 2021
> machine                : aarch64
> nr_cpus                : 4
> max_cpu_id             : 3
> nr_nodes               : 2
> cores_per_socket       : 1
> threads_per_core       : 1
> 
> Xen debug console to dump numa info, we got:
> 
> (XEN) 'u' pressed -> dumping numa info (now = 13229372281010)
> (XEN) NODE0 start->524288 size->8617984 free->388741
> (XEN) NODE1 start->9175040 size->229376 free->106460
> (XEN) CPU0...1 -> NODE0
> (XEN) CPU2...3 -> NODE1
> (XEN) Memory location of each domain:
> (XEN) Domain 0 (total: 262144):
> (XEN)     Node 0: 262144
> (XEN)     Node 1: 0
> 
> 

^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-28  4:16                   ` Wei Chen
@ 2021-09-28  5:01                     ` Stefano Stabellini
  2021-09-28  8:02                       ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Stefano Stabellini @ 2021-09-28  5:01 UTC (permalink / raw)
  To: Wei Chen
  Cc: Stefano Stabellini, Jan Beulich, xen-devel, julien, Bertrand Marquis

[-- Attachment #1: Type: text/plain, Size: 5737 bytes --]

On Tue, 28 Sep 2021, Wei Chen wrote:
> > -----Original Message-----
> > From: Stefano Stabellini <sstabellini@kernel.org>
> > Sent: 2021年9月28日 9:00
> > To: Wei Chen <Wei.Chen@arm.com>
> > Cc: Jan Beulich <jbeulich@suse.com>; xen-devel@lists.xenproject.org;
> > julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>; Stefano
> > Stabellini <sstabellini@kernel.org>
> > Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> > EFI architecture
> > 
> > On Mon, 27 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> > Wei
> > > > Chen
> > > > Sent: 2021年9月26日 18:25
> > > > To: Jan Beulich <jbeulich@suse.com>
> > > > Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
> > > > Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
> > non-
> > > > EFI architecture
> > > >
> > > > Hi Jan,
> > > >
> > > > > -----Original Message-----
> > > > > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf
> > Of
> > > > Jan
> > > > > Beulich
> > > > > Sent: 2021年9月24日 18:49
> > > > > To: Wei Chen <Wei.Chen@arm.com>
> > > > > Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> > > > > <Bertrand.Marquis@arm.com>; Stefano Stabellini
> > <sstabellini@kernel.org>
> > > > > Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
> > > > non-
> > > > > EFI architecture
> > > > >
> > > > > On 24.09.2021 12:31, Wei Chen wrote:
> > > > > >> From: Jan Beulich <jbeulich@suse.com>
> > > > > >> Sent: 2021年9月24日 15:59
> > > > > >>
> > > > > >> On 24.09.2021 06:34, Wei Chen wrote:
> > > > > >>>> From: Stefano Stabellini <sstabellini@kernel.org>
> > > > > >>>> Sent: 2021年9月24日 9:15
> > > > > >>>>
> > > > > >>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > >>>>> --- a/xen/common/Kconfig
> > > > > >>>>> +++ b/xen/common/Kconfig
> > > > > >>>>> @@ -11,6 +11,16 @@ config COMPAT
> > > > > >>>>>  config CORE_PARKING
> > > > > >>>>>  	bool
> > > > > >>>>>
> > > > > >>>>> +config EFI
> > > > > >>>>> +	bool
> > > > > >>>>
> > > > > >>>> Without the title the option is not user-selectable (or de-
> > > > > selectable).
> > > > > >>>> So the help message below can never be seen.
> > > > > >>>>
> > > > > >>>> Either add a title, e.g.:
> > > > > >>>>
> > > > > >>>> bool "EFI support"
> > > > > >>>>
> > > > > >>>> Or fully make the option a silent option by removing the help
> > text.
> > > > > >>>
> > > > > >>> OK, in current Xen code, EFI is unconditionally compiled. Before
> > > > > >>> we change related code, I prefer to remove the help text.
> > > > > >>
> > > > > >> But that's not true: At least on x86 EFI gets compiled depending
> > on
> > > > > >> tool chain capabilities. Ultimately we may indeed want a user
> > > > > >> selectable option here, but until then I'm afraid having this
> > option
> > > > > >> at all may be misleading on x86.
> > > > > >>
> > > > > >
> > > > > > I check the build scripts, yes, you're right. For x86, EFI is not
> > a
> > > > > > selectable option in Kconfig. I agree with you, we can't use
> > Kconfig
> > > > > > system to decide to enable EFI build for x86 or not.
> > > > > >
> > > > > > So how about we just use this EFI option for Arm only? Because on
> > Arm,
> > > > > > we do not have such toolchain dependency.
> > > > >
> > > > > To be honest - don't know. That's because I don't know what you want
> > > > > to use the option for subsequently.
> > > > >
> > > >
> > > > In last version, I had introduced an arch-helper to stub EFI_BOOT
> > > > in Arm's common code for Arm32. Because Arm32 doesn't support EFI.
> > > > So Julien suggested me to introduce a CONFIG_EFI option for non-EFI
> > > > supported architectures to stub in EFI layer.
> > > >
> > > > [1] https://lists.xenproject.org/archives/html/xen-devel/2021-
> > > > 08/msg00808.html
> > > >
> > >
> > > As Jan' reminded, x86 doesn't depend on Kconfig to build EFI code.
> > > So, if we CONFIG_EFI to stub EFI API's for x86, we will encounter
> > > that toolchains enable EFI, but Kconfig disable EFI. Or Kconfig
> > > enable EFI but toolchain doesn't provide EFI build supports. And
> > > then x86 could not work well.
> > >
> > > If we use CONFIG_EFI for Arm only, that means CONFIG_EFI for x86
> > > is off, this will also cause problem.
> > >
> > > So, can we still use previous arch_helpers to stub for Arm32?
> > > until x86 can use this selectable option?
> > 
> > EFI doesn't have to be necessarily a user-visible option in Kconfig at
> > this point. I think Julien was just asking to make the #ifdef based on
> > a EFI-related config rather than just based CONFIG_ARM64.
> > 
> > On x86 EFI is detected based on compiler support, setting XEN_BUILD_EFI
> > in xen/arch/x86/Makefile. Let's say that we keep using the same name
> > "XEN_BUILD_EFI" on ARM as well.
> > 
> > On ARM32, XEN_BUILD_EFI should be always unset.
> > 
> > On ARM64 XEN_BUILD_EFI should be always set.
> > 
> > That's it, right? I'd argue that CONFIG_EFI or HAS_EFI are better names
> > than XEN_BUILD_EFI, but that's OK anyway. So for instance you can make
> > XEN_BUILD_EFI an invisible symbol in xen/arch/arm/Kconfig and select it
> > only on ARM64.
> 
> Thanks, this is a good approach. But if we place XEN_BUILD_EFI in Kconfig
> it will be transfer to CONFIG_XEN_BUILD_EFI. How about using another name
> in Kconfig like ARM_EFI, but use CONFIG_ARM_EFI in config.h to define
> XEN_BUILD_EFI?

I am OK with that. Another option is to rename XEN_BUILD_EFI to
CONFIG_XEN_BUILD_EFI on x86. Either way is fine by me. Jan, do you havea
preference?

^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-28  5:01                     ` Stefano Stabellini
@ 2021-09-28  8:02                       ` Jan Beulich
  2021-10-03 23:28                         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2021-09-28  8:02 UTC (permalink / raw)
  To: Stefano Stabellini, Wei Chen; +Cc: xen-devel, julien, Bertrand Marquis

On 28.09.2021 07:01, Stefano Stabellini wrote:
> On Tue, 28 Sep 2021, Wei Chen wrote:
>>> -----Original Message-----
>>> From: Stefano Stabellini <sstabellini@kernel.org>
>>> Sent: 2021年9月28日 9:00
>>> To: Wei Chen <Wei.Chen@arm.com>
>>> Cc: Jan Beulich <jbeulich@suse.com>; xen-devel@lists.xenproject.org;
>>> julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>; Stefano
>>> Stabellini <sstabellini@kernel.org>
>>> Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
>>> EFI architecture
>>>
>>> On Mon, 27 Sep 2021, Wei Chen wrote:
>>>>> -----Original Message-----
>>>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
>>> Wei
>>>>> Chen
>>>>> Sent: 2021年9月26日 18:25
>>>>> To: Jan Beulich <jbeulich@suse.com>
>>>>> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
>>>>> <Bertrand.Marquis@arm.com>; Stefano Stabellini <sstabellini@kernel.org>
>>>>> Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
>>> non-
>>>>> EFI architecture
>>>>>
>>>>> Hi Jan,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf
>>> Of
>>>>> Jan
>>>>>> Beulich
>>>>>> Sent: 2021年9月24日 18:49
>>>>>> To: Wei Chen <Wei.Chen@arm.com>
>>>>>> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
>>>>>> <Bertrand.Marquis@arm.com>; Stefano Stabellini
>>> <sstabellini@kernel.org>
>>>>>> Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
>>>>> non-
>>>>>> EFI architecture
>>>>>>
>>>>>> On 24.09.2021 12:31, Wei Chen wrote:
>>>>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>>>>> Sent: 2021年9月24日 15:59
>>>>>>>>
>>>>>>>> On 24.09.2021 06:34, Wei Chen wrote:
>>>>>>>>>> From: Stefano Stabellini <sstabellini@kernel.org>
>>>>>>>>>> Sent: 2021年9月24日 9:15
>>>>>>>>>>
>>>>>>>>>> On Thu, 23 Sep 2021, Wei Chen wrote:
>>>>>>>>>>> --- a/xen/common/Kconfig
>>>>>>>>>>> +++ b/xen/common/Kconfig
>>>>>>>>>>> @@ -11,6 +11,16 @@ config COMPAT
>>>>>>>>>>>  config CORE_PARKING
>>>>>>>>>>>  	bool
>>>>>>>>>>>
>>>>>>>>>>> +config EFI
>>>>>>>>>>> +	bool
>>>>>>>>>>
>>>>>>>>>> Without the title the option is not user-selectable (or de-
>>>>>> selectable).
>>>>>>>>>> So the help message below can never be seen.
>>>>>>>>>>
>>>>>>>>>> Either add a title, e.g.:
>>>>>>>>>>
>>>>>>>>>> bool "EFI support"
>>>>>>>>>>
>>>>>>>>>> Or fully make the option a silent option by removing the help
>>> text.
>>>>>>>>>
>>>>>>>>> OK, in current Xen code, EFI is unconditionally compiled. Before
>>>>>>>>> we change related code, I prefer to remove the help text.
>>>>>>>>
>>>>>>>> But that's not true: At least on x86 EFI gets compiled depending
>>> on
>>>>>>>> tool chain capabilities. Ultimately we may indeed want a user
>>>>>>>> selectable option here, but until then I'm afraid having this
>>> option
>>>>>>>> at all may be misleading on x86.
>>>>>>>>
>>>>>>>
>>>>>>> I check the build scripts, yes, you're right. For x86, EFI is not
>>> a
>>>>>>> selectable option in Kconfig. I agree with you, we can't use
>>> Kconfig
>>>>>>> system to decide to enable EFI build for x86 or not.
>>>>>>>
>>>>>>> So how about we just use this EFI option for Arm only? Because on
>>> Arm,
>>>>>>> we do not have such toolchain dependency.
>>>>>>
>>>>>> To be honest - don't know. That's because I don't know what you want
>>>>>> to use the option for subsequently.
>>>>>>
>>>>>
>>>>> In last version, I had introduced an arch-helper to stub EFI_BOOT
>>>>> in Arm's common code for Arm32. Because Arm32 doesn't support EFI.
>>>>> So Julien suggested me to introduce a CONFIG_EFI option for non-EFI
>>>>> supported architectures to stub in EFI layer.
>>>>>
>>>>> [1] https://lists.xenproject.org/archives/html/xen-devel/2021-
>>>>> 08/msg00808.html
>>>>>
>>>>
>>>> As Jan' reminded, x86 doesn't depend on Kconfig to build EFI code.
>>>> So, if we CONFIG_EFI to stub EFI API's for x86, we will encounter
>>>> that toolchains enable EFI, but Kconfig disable EFI. Or Kconfig
>>>> enable EFI but toolchain doesn't provide EFI build supports. And
>>>> then x86 could not work well.
>>>>
>>>> If we use CONFIG_EFI for Arm only, that means CONFIG_EFI for x86
>>>> is off, this will also cause problem.
>>>>
>>>> So, can we still use previous arch_helpers to stub for Arm32?
>>>> until x86 can use this selectable option?
>>>
>>> EFI doesn't have to be necessarily a user-visible option in Kconfig at
>>> this point. I think Julien was just asking to make the #ifdef based on
>>> a EFI-related config rather than just based CONFIG_ARM64.
>>>
>>> On x86 EFI is detected based on compiler support, setting XEN_BUILD_EFI
>>> in xen/arch/x86/Makefile. Let's say that we keep using the same name
>>> "XEN_BUILD_EFI" on ARM as well.
>>>
>>> On ARM32, XEN_BUILD_EFI should be always unset.
>>>
>>> On ARM64 XEN_BUILD_EFI should be always set.
>>>
>>> That's it, right? I'd argue that CONFIG_EFI or HAS_EFI are better names
>>> than XEN_BUILD_EFI, but that's OK anyway. So for instance you can make
>>> XEN_BUILD_EFI an invisible symbol in xen/arch/arm/Kconfig and select it
>>> only on ARM64.
>>
>> Thanks, this is a good approach. But if we place XEN_BUILD_EFI in Kconfig
>> it will be transfer to CONFIG_XEN_BUILD_EFI. How about using another name
>> in Kconfig like ARM_EFI, but use CONFIG_ARM_EFI in config.h to define
>> XEN_BUILD_EFI?
> 
> I am OK with that. Another option is to rename XEN_BUILD_EFI to
> CONFIG_XEN_BUILD_EFI on x86. Either way is fine by me. Jan, do you havea
> preference?

Yes, I do: No new CONFIG_* settings please that don't originate from
Kconfig. Hence I'm afraid this is a "no" to your suggestion.

Mid-term we should try to get rid of the remaining CONFIG_* which
get #define-d in e.g. asm/config.h.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-EFI architecture
  2021-09-28  8:02                       ` Jan Beulich
@ 2021-10-03 23:28                         ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2021-10-03 23:28 UTC (permalink / raw)
  To: Jan Beulich, Stefano Stabellini; +Cc: xen-devel, julien, Bertrand Marquis



> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2021年9月28日 16:03
> To: Stefano Stabellini <sstabellini@kernel.org>; Wei Chen
> <Wei.Chen@arm.com>
> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> <Bertrand.Marquis@arm.com>
> Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for non-
> EFI architecture
> 
> On 28.09.2021 07:01, Stefano Stabellini wrote:
> > On Tue, 28 Sep 2021, Wei Chen wrote:
> >>> -----Original Message-----
> >>> From: Stefano Stabellini <sstabellini@kernel.org>
> >>> Sent: 2021年9月28日 9:00
> >>> To: Wei Chen <Wei.Chen@arm.com>
> >>> Cc: Jan Beulich <jbeulich@suse.com>; xen-devel@lists.xenproject.org;
> >>> julien@xen.org; Bertrand Marquis <Bertrand.Marquis@arm.com>; Stefano
> >>> Stabellini <sstabellini@kernel.org>
> >>> Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
> non-
> >>> EFI architecture
> >>>
> >>> On Mon, 27 Sep 2021, Wei Chen wrote:
> >>>>> -----Original Message-----
> >>>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf
> Of
> >>> Wei
> >>>>> Chen
> >>>>> Sent: 2021年9月26日 18:25
> >>>>> To: Jan Beulich <jbeulich@suse.com>
> >>>>> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand Marquis
> >>>>> <Bertrand.Marquis@arm.com>; Stefano Stabellini
> <sstabellini@kernel.org>
> >>>>> Subject: RE: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API for
> >>> non-
> >>>>> EFI architecture
> >>>>>
> >>>>> Hi Jan,
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf
> >>> Of
> >>>>> Jan
> >>>>>> Beulich
> >>>>>> Sent: 2021年9月24日 18:49
> >>>>>> To: Wei Chen <Wei.Chen@arm.com>
> >>>>>> Cc: xen-devel@lists.xenproject.org; julien@xen.org; Bertrand
> Marquis
> >>>>>> <Bertrand.Marquis@arm.com>; Stefano Stabellini
> >>> <sstabellini@kernel.org>
> >>>>>> Subject: Re: [PATCH 20/37] xen: introduce CONFIG_EFI to stub API
> for
> >>>>> non-
> >>>>>> EFI architecture
> >>>>>>
> >>>>>> On 24.09.2021 12:31, Wei Chen wrote:
> >>>>>>>> From: Jan Beulich <jbeulich@suse.com>
> >>>>>>>> Sent: 2021年9月24日 15:59
> >>>>>>>>
> >>>>>>>> On 24.09.2021 06:34, Wei Chen wrote:
> >>>>>>>>>> From: Stefano Stabellini <sstabellini@kernel.org>
> >>>>>>>>>> Sent: 2021年9月24日 9:15
> >>>>>>>>>>
> >>>>>>>>>> On Thu, 23 Sep 2021, Wei Chen wrote:
> >>>>>>>>>>> --- a/xen/common/Kconfig
> >>>>>>>>>>> +++ b/xen/common/Kconfig
> >>>>>>>>>>> @@ -11,6 +11,16 @@ config COMPAT
> >>>>>>>>>>>  config CORE_PARKING
> >>>>>>>>>>>  	bool
> >>>>>>>>>>>
> >>>>>>>>>>> +config EFI
> >>>>>>>>>>> +	bool
> >>>>>>>>>>
> >>>>>>>>>> Without the title the option is not user-selectable (or de-
> >>>>>> selectable).
> >>>>>>>>>> So the help message below can never be seen.
> >>>>>>>>>>
> >>>>>>>>>> Either add a title, e.g.:
> >>>>>>>>>>
> >>>>>>>>>> bool "EFI support"
> >>>>>>>>>>
> >>>>>>>>>> Or fully make the option a silent option by removing the help
> >>> text.
> >>>>>>>>>
> >>>>>>>>> OK, in current Xen code, EFI is unconditionally compiled. Before
> >>>>>>>>> we change related code, I prefer to remove the help text.
> >>>>>>>>
> >>>>>>>> But that's not true: At least on x86 EFI gets compiled depending
> >>> on
> >>>>>>>> tool chain capabilities. Ultimately we may indeed want a user
> >>>>>>>> selectable option here, but until then I'm afraid having this
> >>> option
> >>>>>>>> at all may be misleading on x86.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I check the build scripts, yes, you're right. For x86, EFI is not
> >>> a
> >>>>>>> selectable option in Kconfig. I agree with you, we can't use
> >>> Kconfig
> >>>>>>> system to decide to enable EFI build for x86 or not.
> >>>>>>>
> >>>>>>> So how about we just use this EFI option for Arm only? Because on
> >>> Arm,
> >>>>>>> we do not have such toolchain dependency.
> >>>>>>
> >>>>>> To be honest - don't know. That's because I don't know what you
> want
> >>>>>> to use the option for subsequently.
> >>>>>>
> >>>>>
> >>>>> In last version, I had introduced an arch-helper to stub EFI_BOOT
> >>>>> in Arm's common code for Arm32. Because Arm32 doesn't support EFI.
> >>>>> So Julien suggested me to introduce a CONFIG_EFI option for non-EFI
> >>>>> supported architectures to stub in EFI layer.
> >>>>>
> >>>>> [1] https://lists.xenproject.org/archives/html/xen-devel/2021-
> >>>>> 08/msg00808.html
> >>>>>
> >>>>
> >>>> As Jan' reminded, x86 doesn't depend on Kconfig to build EFI code.
> >>>> So, if we CONFIG_EFI to stub EFI API's for x86, we will encounter
> >>>> that toolchains enable EFI, but Kconfig disable EFI. Or Kconfig
> >>>> enable EFI but toolchain doesn't provide EFI build supports. And
> >>>> then x86 could not work well.
> >>>>
> >>>> If we use CONFIG_EFI for Arm only, that means CONFIG_EFI for x86
> >>>> is off, this will also cause problem.
> >>>>
> >>>> So, can we still use previous arch_helpers to stub for Arm32?
> >>>> until x86 can use this selectable option?
> >>>
> >>> EFI doesn't have to be necessarily a user-visible option in Kconfig at
> >>> this point. I think Julien was just asking to make the #ifdef based on
> >>> a EFI-related config rather than just based CONFIG_ARM64.
> >>>
> >>> On x86 EFI is detected based on compiler support, setting
> XEN_BUILD_EFI
> >>> in xen/arch/x86/Makefile. Let's say that we keep using the same name
> >>> "XEN_BUILD_EFI" on ARM as well.
> >>>
> >>> On ARM32, XEN_BUILD_EFI should be always unset.
> >>>
> >>> On ARM64 XEN_BUILD_EFI should be always set.
> >>>
> >>> That's it, right? I'd argue that CONFIG_EFI or HAS_EFI are better
> names
> >>> than XEN_BUILD_EFI, but that's OK anyway. So for instance you can make
> >>> XEN_BUILD_EFI an invisible symbol in xen/arch/arm/Kconfig and select
> it
> >>> only on ARM64.
> >>
> >> Thanks, this is a good approach. But if we place XEN_BUILD_EFI in
> Kconfig
> >> it will be transfer to CONFIG_XEN_BUILD_EFI. How about using another
> name
> >> in Kconfig like ARM_EFI, but use CONFIG_ARM_EFI in config.h to define
> >> XEN_BUILD_EFI?
> >
> > I am OK with that. Another option is to rename XEN_BUILD_EFI to
> > CONFIG_XEN_BUILD_EFI on x86. Either way is fine by me. Jan, do you havea
> > preference?
> 
> Yes, I do: No new CONFIG_* settings please that don't originate from
> Kconfig. Hence I'm afraid this is a "no" to your suggestion.
> 
> Mid-term we should try to get rid of the remaining CONFIG_* which
> get #define-d in e.g. asm/config.h.
> 

I will do something like this: 
 - introduce an ARM_EFI invisible symbol in kconfig, selected by ARM64 only
 - use CONFIG_ARM_EFI to define XEN_BUILD_EFI in config.h

> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2021-09-23 12:02 ` [PATCH 04/37] xen: introduce an arch helper for default dma zone status Wei Chen
  2021-09-23 23:55   ` Stefano Stabellini
@ 2022-01-17 16:10   ` Jan Beulich
  2022-01-18  7:51     ` Wei Chen
  1 sibling, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-17 16:10 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand.Marquis, xen-devel, sstabellini, julien

I realize this series has been pending for a long time, but I don't
recall any indication that it would have been dropped. Hence as a
first try, a few comments on this relatively simple change. I'm
sorry it to have taken so long to get to it.

On 23.09.2021 14:02, Wei Chen wrote:
> In current code, when Xen is running in a multiple nodes NUMA
> system, it will set dma_bitsize in end_boot_allocator to reserve
> some low address memory for DMA.
> 
> There are some x86 implications in current implementation. Becuase
> on x86, memory starts from 0. On a multiple nodes NUMA system, if
> a single node contains the majority or all of the DMA memory. x86
> prefer to give out memory from non-local allocations rather than
> exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
> aside some largely arbitrary amount memory for DMA memory ranges.
> The allocations from these memory ranges would happen only after
> exhausting all other nodes' memory.
> 
> But the implications are not shared across all architectures. For
> example, Arm doesn't have these implications. So in this patch, we
> introduce an arch_have_default_dmazone helper for arch to determine
> that it need to set dma_bitsize for reserve DMA allocations or not.

How would Arm guarantee availability of memory below a certain
boundary for limited-capability devices? Or is there no need
because there's an assumption that I/O for such devices would
always pass through an IOMMU, lifting address size restrictions?
(I guess in a !PV build on x86 we could also get rid of such a
reservation.)

> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -371,6 +371,11 @@ unsigned int __init arch_get_dma_bitsize(void)
>                   + PAGE_SHIFT, 32);
>  }
>  
> +unsigned int arch_have_default_dmazone(void)
> +{
> +    return ( num_online_nodes() > 1 ) ? 1 : 0;
> +}

According to the expression and ...

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1889,7 +1889,7 @@ void __init end_boot_allocator(void)
>      }
>      nr_bootmem_regions = 0;
>  
> -    if ( !dma_bitsize && (num_online_nodes() > 1) )
> +    if ( !dma_bitsize && arch_have_default_dmazone() )
>          dma_bitsize = arch_get_dma_bitsize();

... the use site, you mean the function to return boolean. Please
indicate so by making it have a return type of "bool". Independent
of that you don't need a conditional expression above, nor
(malformed) use of parentheses. I further wonder whether ...

> --- a/xen/include/asm-arm/numa.h
> +++ b/xen/include/asm-arm/numa.h
> @@ -25,6 +25,11 @@ extern mfn_t first_valid_mfn;
>  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
>  #define __node_distance(a, b) (20)
>  
> +static inline unsigned int arch_have_default_dmazone(void)
> +{
> +    return 0;
> +}

... like this one, x86'es couldn't be inline as well. If indeed
it can't be, making it a macro may still be better (and avoid a
further comment regarding the lack of __init).

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 19/37] xen/x86: promote VIRTUAL_BUG_ON to ASSERT in
  2021-09-23 12:02 ` [PATCH 19/37] xen/x86: promote VIRTUAL_BUG_ON to ASSERT in Wei Chen
@ 2022-01-17 16:21   ` Jan Beulich
  2022-01-18  7:52     ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-17 16:21 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand.Marquis, xen-devel, sstabellini, julien

On 23.09.2021 14:02, Wei Chen wrote:
> VIRTUAL_BUG_ON that is using in phys_to_nid is an empty macro. This
> results in two lines of error-checking code in phys_to_nid are not
> actually working. It also covers up two compilation errors:
> 1. error: ‘MAX_NUMNODES’ undeclared (first use in this function).
>    This is because MAX_NUMNODES is defined in xen/numa.h.
>    But asm/numa.h is a dependent file of xen/numa.h, we can't
>    include xen/numa.h in asm/numa.h. This error has been fixed
>    after we move phys_to_nid to xen/numa.h.

This could easily be taken care of by move MAX_NUMNODES up ahead of
the asm/numa.h inclusion point. And then the change here would become
independent of the rest of the series (and could hence go in early).

> 2. error: wrong type argument to unary exclamation mark.
>    This is becuase, the error-checking code contains !node_data[nid].
>    But node_data is a data structure variable, it's not a pointer.
> 
> So, in this patch, we use ASSERT in VIRTUAL_BUG_ON to enable the two
> lines of error-checking code.

May I suggest to drop VIRTUAL_BUG_ON() and instead use ASSERT()
directly?

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2022-01-17 16:10   ` Jan Beulich
@ 2022-01-18  7:51     ` Wei Chen
  2022-01-18  8:16       ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2022-01-18  7:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月18日 0:11
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 04/37] xen: introduce an arch helper for default dma
> zone status
> 
> I realize this series has been pending for a long time, but I don't
> recall any indication that it would have been dropped. Hence as a
> first try, a few comments on this relatively simple change. I'm
> sorry it to have taken so long to get to it.
> 

Thanks for reviewing this series and lift it up. We are still
working on this series and will send a new version soon.

> On 23.09.2021 14:02, Wei Chen wrote:
> > In current code, when Xen is running in a multiple nodes NUMA
> > system, it will set dma_bitsize in end_boot_allocator to reserve
> > some low address memory for DMA.
> >
> > There are some x86 implications in current implementation. Becuase
> > on x86, memory starts from 0. On a multiple nodes NUMA system, if
> > a single node contains the majority or all of the DMA memory. x86
> > prefer to give out memory from non-local allocations rather than
> > exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
> > aside some largely arbitrary amount memory for DMA memory ranges.
> > The allocations from these memory ranges would happen only after
> > exhausting all other nodes' memory.
> >
> > But the implications are not shared across all architectures. For
> > example, Arm doesn't have these implications. So in this patch, we
> > introduce an arch_have_default_dmazone helper for arch to determine
> > that it need to set dma_bitsize for reserve DMA allocations or not.
> 
> How would Arm guarantee availability of memory below a certain
> boundary for limited-capability devices? Or is there no need
> because there's an assumption that I/O for such devices would
> always pass through an IOMMU, lifting address size restrictions?
> (I guess in a !PV build on x86 we could also get rid of such a
> reservation.)

On Arm, we still can have some devices with limited DMA capability.
And we also don't force all such devices to use IOMMU. This devices
will affect the dma_bitsize. Like RPi platform, it sets its dma_bitsize
to 30. But in multiple NUMA nodes system, Arm doesn't have a default
DMA zone. Multiple nodes is not a constraint on dma_bitsize. And some
previous discussions are placed here [1].

> 
> > --- a/xen/arch/x86/numa.c
> > +++ b/xen/arch/x86/numa.c
> > @@ -371,6 +371,11 @@ unsigned int __init arch_get_dma_bitsize(void)
> >                   + PAGE_SHIFT, 32);
> >  }
> >
> > +unsigned int arch_have_default_dmazone(void)
> > +{
> > +    return ( num_online_nodes() > 1 ) ? 1 : 0;
> > +}
> 
> According to the expression and ...
> 
> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -1889,7 +1889,7 @@ void __init end_boot_allocator(void)
> >      }
> >      nr_bootmem_regions = 0;
> >
> > -    if ( !dma_bitsize && (num_online_nodes() > 1) )
> > +    if ( !dma_bitsize && arch_have_default_dmazone() )
> >          dma_bitsize = arch_get_dma_bitsize();
> 
> ... the use site, you mean the function to return boolean. Please
> indicate so by making it have a return type of "bool". Independent
> of that you don't need a conditional expression above, nor
> (malformed) use of parentheses. I further wonder whether ...
> 

I will fix them in next version. But I am not very clear about
this comment "of that you don't need a conditional expression above",
The "above" indicates this line:
"return ( num_online_nodes() > 1 ) ? 1 : 0;"?

> > --- a/xen/include/asm-arm/numa.h
> > +++ b/xen/include/asm-arm/numa.h
> > @@ -25,6 +25,11 @@ extern mfn_t first_valid_mfn;
> >  #define node_start_pfn(nid) (mfn_x(first_valid_mfn))
> >  #define __node_distance(a, b) (20)
> >
> > +static inline unsigned int arch_have_default_dmazone(void)
> > +{
> > +    return 0;
> > +}
> 
> ... like this one, x86'es couldn't be inline as well. If indeed
> it can't be, making it a macro may still be better (and avoid a
> further comment regarding the lack of __init).

Ok, that would be better, I will do it in next version.

> 
> Jan

[1] https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00772.html



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 19/37] xen/x86: promote VIRTUAL_BUG_ON to ASSERT in
  2022-01-17 16:21   ` Jan Beulich
@ 2022-01-18  7:52     ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2022-01-18  7:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月18日 0:22
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 19/37] xen/x86: promote VIRTUAL_BUG_ON to ASSERT in
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > VIRTUAL_BUG_ON that is using in phys_to_nid is an empty macro. This
> > results in two lines of error-checking code in phys_to_nid are not
> > actually working. It also covers up two compilation errors:
> > 1. error: ‘MAX_NUMNODES’ undeclared (first use in this function).
> >    This is because MAX_NUMNODES is defined in xen/numa.h.
> >    But asm/numa.h is a dependent file of xen/numa.h, we can't
> >    include xen/numa.h in asm/numa.h. This error has been fixed
> >    after we move phys_to_nid to xen/numa.h.
> 
> This could easily be taken care of by move MAX_NUMNODES up ahead of
> the asm/numa.h inclusion point. And then the change here would become
> independent of the rest of the series (and could hence go in early).
> 
> > 2. error: wrong type argument to unary exclamation mark.
> >    This is becuase, the error-checking code contains !node_data[nid].
> >    But node_data is a data structure variable, it's not a pointer.
> >
> > So, in this patch, we use ASSERT in VIRTUAL_BUG_ON to enable the two
> > lines of error-checking code.
> 
> May I suggest to drop VIRTUAL_BUG_ON() and instead use ASSERT()
> directly?
> 

Sure!

> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2022-01-18  7:51     ` Wei Chen
@ 2022-01-18  8:16       ` Jan Beulich
  2022-01-18  9:20         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-18  8:16 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

On 18.01.2022 08:51, Wei Chen wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年1月18日 0:11
>> On 23.09.2021 14:02, Wei Chen wrote:
>>> In current code, when Xen is running in a multiple nodes NUMA
>>> system, it will set dma_bitsize in end_boot_allocator to reserve
>>> some low address memory for DMA.
>>>
>>> There are some x86 implications in current implementation. Becuase
>>> on x86, memory starts from 0. On a multiple nodes NUMA system, if
>>> a single node contains the majority or all of the DMA memory. x86
>>> prefer to give out memory from non-local allocations rather than
>>> exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
>>> aside some largely arbitrary amount memory for DMA memory ranges.
>>> The allocations from these memory ranges would happen only after
>>> exhausting all other nodes' memory.
>>>
>>> But the implications are not shared across all architectures. For
>>> example, Arm doesn't have these implications. So in this patch, we
>>> introduce an arch_have_default_dmazone helper for arch to determine
>>> that it need to set dma_bitsize for reserve DMA allocations or not.
>>
>> How would Arm guarantee availability of memory below a certain
>> boundary for limited-capability devices? Or is there no need
>> because there's an assumption that I/O for such devices would
>> always pass through an IOMMU, lifting address size restrictions?
>> (I guess in a !PV build on x86 we could also get rid of such a
>> reservation.)
> 
> On Arm, we still can have some devices with limited DMA capability.
> And we also don't force all such devices to use IOMMU. This devices
> will affect the dma_bitsize. Like RPi platform, it sets its dma_bitsize
> to 30. But in multiple NUMA nodes system, Arm doesn't have a default
> DMA zone. Multiple nodes is not a constraint on dma_bitsize. And some
> previous discussions are placed here [1].

I'm afraid that doesn't give me more clues. For example, in the mail
being replied to there I find "That means, only first 4GB memory can
be used for DMA." Yet that's not an implication from setting
dma_bitsize. DMA is fine to occur to any address. The special address
range is being held back in case in particular Dom0 is in need of such
a range to perform I/O to _some_ devices.

>>> --- a/xen/arch/x86/numa.c
>>> +++ b/xen/arch/x86/numa.c
>>> @@ -371,6 +371,11 @@ unsigned int __init arch_get_dma_bitsize(void)
>>>                   + PAGE_SHIFT, 32);
>>>  }
>>>
>>> +unsigned int arch_have_default_dmazone(void)
>>> +{
>>> +    return ( num_online_nodes() > 1 ) ? 1 : 0;
>>> +}
>>
>> According to the expression and ...
>>
>>> --- a/xen/common/page_alloc.c
>>> +++ b/xen/common/page_alloc.c
>>> @@ -1889,7 +1889,7 @@ void __init end_boot_allocator(void)
>>>      }
>>>      nr_bootmem_regions = 0;
>>>
>>> -    if ( !dma_bitsize && (num_online_nodes() > 1) )
>>> +    if ( !dma_bitsize && arch_have_default_dmazone() )
>>>          dma_bitsize = arch_get_dma_bitsize();
>>
>> ... the use site, you mean the function to return boolean. Please
>> indicate so by making it have a return type of "bool". Independent
>> of that you don't need a conditional expression above, nor
>> (malformed) use of parentheses. I further wonder whether ...
>>
> 
> I will fix them in next version. But I am not very clear about
> this comment "of that you don't need a conditional expression above",
> The "above" indicates this line:
> "return ( num_online_nodes() > 1 ) ? 1 : 0;"?

Yes. Even without the use of bool such an expression is a more
complicated form of

    return num_online_nodes() > 1;

where we'd prefer to use the simpler variant for being easier to
read / follow.

Jan

> [1] https://lists.xenproject.org/archives/html/xen-devel/2021-08/msg00772.html
> 
> 



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2022-01-18  8:16       ` Jan Beulich
@ 2022-01-18  9:20         ` Wei Chen
  2022-01-18 14:16           ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2022-01-18  9:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月18日 16:16
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 04/37] xen: introduce an arch helper for default dma
> zone status
> 
> On 18.01.2022 08:51, Wei Chen wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2022年1月18日 0:11
> >> On 23.09.2021 14:02, Wei Chen wrote:
> >>> In current code, when Xen is running in a multiple nodes NUMA
> >>> system, it will set dma_bitsize in end_boot_allocator to reserve
> >>> some low address memory for DMA.
> >>>
> >>> There are some x86 implications in current implementation. Becuase
> >>> on x86, memory starts from 0. On a multiple nodes NUMA system, if
> >>> a single node contains the majority or all of the DMA memory. x86
> >>> prefer to give out memory from non-local allocations rather than
> >>> exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
> >>> aside some largely arbitrary amount memory for DMA memory ranges.
> >>> The allocations from these memory ranges would happen only after
> >>> exhausting all other nodes' memory.
> >>>
> >>> But the implications are not shared across all architectures. For
> >>> example, Arm doesn't have these implications. So in this patch, we
> >>> introduce an arch_have_default_dmazone helper for arch to determine
> >>> that it need to set dma_bitsize for reserve DMA allocations or not.
> >>
> >> How would Arm guarantee availability of memory below a certain
> >> boundary for limited-capability devices? Or is there no need
> >> because there's an assumption that I/O for such devices would
> >> always pass through an IOMMU, lifting address size restrictions?
> >> (I guess in a !PV build on x86 we could also get rid of such a
> >> reservation.)
> >
> > On Arm, we still can have some devices with limited DMA capability.
> > And we also don't force all such devices to use IOMMU. This devices
> > will affect the dma_bitsize. Like RPi platform, it sets its dma_bitsize
> > to 30. But in multiple NUMA nodes system, Arm doesn't have a default
> > DMA zone. Multiple nodes is not a constraint on dma_bitsize. And some
> > previous discussions are placed here [1].
> 
> I'm afraid that doesn't give me more clues. For example, in the mail
> being replied to there I find "That means, only first 4GB memory can
> be used for DMA." Yet that's not an implication from setting
> dma_bitsize. DMA is fine to occur to any address. The special address
> range is being held back in case in particular Dom0 is in need of such
> a range to perform I/O to _some_ devices.

I am sorry that my last reply hasn't given you more clues. On Arm, only
Dom0 can have DMA without IOMMU. So when we allocate memory for Dom0,
we're trying to allocate memory under 4GB or in the range of dma_bitsize
indicated. I think these operations meet your above Dom0 special address
range description. As we have already allocated memory for DMA, so I
think we don't need a DMA zone in page allocation. I am not sure is that
answers your earlier question?

> 
> >>> --- a/xen/arch/x86/numa.c
> >>> +++ b/xen/arch/x86/numa.c
> >>> @@ -371,6 +371,11 @@ unsigned int __init arch_get_dma_bitsize(void)
> >>>                   + PAGE_SHIFT, 32);
> >>>  }
> >>>
> >>> +unsigned int arch_have_default_dmazone(void)
> >>> +{
> >>> +    return ( num_online_nodes() > 1 ) ? 1 : 0;
> >>> +}
> >>
> >> According to the expression and ...
> >>
> >>> --- a/xen/common/page_alloc.c
> >>> +++ b/xen/common/page_alloc.c
> >>> @@ -1889,7 +1889,7 @@ void __init end_boot_allocator(void)
> >>>      }
> >>>      nr_bootmem_regions = 0;
> >>>
> >>> -    if ( !dma_bitsize && (num_online_nodes() > 1) )
> >>> +    if ( !dma_bitsize && arch_have_default_dmazone() )
> >>>          dma_bitsize = arch_get_dma_bitsize();
> >>
> >> ... the use site, you mean the function to return boolean. Please
> >> indicate so by making it have a return type of "bool". Independent
> >> of that you don't need a conditional expression above, nor
> >> (malformed) use of parentheses. I further wonder whether ...
> >>
> >
> > I will fix them in next version. But I am not very clear about
> > this comment "of that you don't need a conditional expression above",
> > The "above" indicates this line:
> > "return ( num_online_nodes() > 1 ) ? 1 : 0;"?
> 
> Yes. Even without the use of bool such an expression is a more
> complicated form of
> 
>     return num_online_nodes() > 1;
> 
> where we'd prefer to use the simpler variant for being easier to
> read / follow.
> 

Thanks for clarification, I will fix it. 

> Jan
> 
> > [1] https://lists.xenproject.org/archives/html/xen-devel/2021-
> 08/msg00772.html
> >
> >


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2022-01-18  9:20         ` Wei Chen
@ 2022-01-18 14:16           ` Jan Beulich
  2022-01-19  2:49             ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-18 14:16 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

On 18.01.2022 10:20, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年1月18日 16:16
>>
>> On 18.01.2022 08:51, Wei Chen wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: 2022年1月18日 0:11
>>>> On 23.09.2021 14:02, Wei Chen wrote:
>>>>> In current code, when Xen is running in a multiple nodes NUMA
>>>>> system, it will set dma_bitsize in end_boot_allocator to reserve
>>>>> some low address memory for DMA.
>>>>>
>>>>> There are some x86 implications in current implementation. Becuase
>>>>> on x86, memory starts from 0. On a multiple nodes NUMA system, if
>>>>> a single node contains the majority or all of the DMA memory. x86
>>>>> prefer to give out memory from non-local allocations rather than
>>>>> exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
>>>>> aside some largely arbitrary amount memory for DMA memory ranges.
>>>>> The allocations from these memory ranges would happen only after
>>>>> exhausting all other nodes' memory.
>>>>>
>>>>> But the implications are not shared across all architectures. For
>>>>> example, Arm doesn't have these implications. So in this patch, we
>>>>> introduce an arch_have_default_dmazone helper for arch to determine
>>>>> that it need to set dma_bitsize for reserve DMA allocations or not.
>>>>
>>>> How would Arm guarantee availability of memory below a certain
>>>> boundary for limited-capability devices? Or is there no need
>>>> because there's an assumption that I/O for such devices would
>>>> always pass through an IOMMU, lifting address size restrictions?
>>>> (I guess in a !PV build on x86 we could also get rid of such a
>>>> reservation.)
>>>
>>> On Arm, we still can have some devices with limited DMA capability.
>>> And we also don't force all such devices to use IOMMU. This devices
>>> will affect the dma_bitsize. Like RPi platform, it sets its dma_bitsize
>>> to 30. But in multiple NUMA nodes system, Arm doesn't have a default
>>> DMA zone. Multiple nodes is not a constraint on dma_bitsize. And some
>>> previous discussions are placed here [1].
>>
>> I'm afraid that doesn't give me more clues. For example, in the mail
>> being replied to there I find "That means, only first 4GB memory can
>> be used for DMA." Yet that's not an implication from setting
>> dma_bitsize. DMA is fine to occur to any address. The special address
>> range is being held back in case in particular Dom0 is in need of such
>> a range to perform I/O to _some_ devices.
> 
> I am sorry that my last reply hasn't given you more clues. On Arm, only
> Dom0 can have DMA without IOMMU. So when we allocate memory for Dom0,
> we're trying to allocate memory under 4GB or in the range of dma_bitsize
> indicated. I think these operations meet your above Dom0 special address
> range description. As we have already allocated memory for DMA, so I
> think we don't need a DMA zone in page allocation. I am not sure is that
> answers your earlier question?

I view all of this as flawed, or as a workaround at best. Xen shouldn't
make assumptions on what Dom0 may need. Instead Dom0 should make
arrangements such that it can do I/O to/from all devices of interest.
This may involve arranging for address restricted buffers. And for this
to be possible, Xen would need to have available some suitable memory.
I understand this is complicated by the fact that despite being HVM-like,
due to the lack of an IOMMU in front of certain devices address
restrictions on Dom0 address space alone (i.e. without any Xen
involvement) won't help ...

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure
  2021-09-23 12:02 ` [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure Wei Chen
  2021-09-24  0:11   ` Stefano Stabellini
@ 2022-01-18 15:22   ` Jan Beulich
  2022-01-19  6:33     ` Wei Chen
  1 sibling, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-18 15:22 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand.Marquis, xen-devel, sstabellini, julien

On 23.09.2021 14:02, Wei Chen wrote:
> @@ -201,11 +201,12 @@ void __init numa_init_array(void)
>  static int numa_fake __initdata = 0;
>  
>  /* Numa emulation */
> -static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
> +static int __init numa_emulation(unsigned long start_pfn,
> +                                 unsigned long end_pfn)
>  {
>      int i;
>      struct node nodes[MAX_NUMNODES];
> -    u64 sz = ((end_pfn - start_pfn)<<PAGE_SHIFT) / numa_fake;
> +    u64 sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;

Nit: Please convert to uint64_t (and alike) whenever you touch a line
anyway that uses being-phased-out types.

> @@ -249,24 +250,26 @@ static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
>  void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
>  { 
>      int i;
> +    paddr_t start, end;
>  
>  #ifdef CONFIG_NUMA_EMU
>      if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
>          return;
>  #endif
>  
> +    start = pfn_to_paddr(start_pfn);
> +    end = pfn_to_paddr(end_pfn);

Nit: Would be slightly neater if these were the initializers of the
variables.

>  #ifdef CONFIG_ACPI_NUMA
> -    if ( !numa_off && !acpi_scan_nodes((u64)start_pfn << PAGE_SHIFT,
> -         (u64)end_pfn << PAGE_SHIFT) )
> +    if ( !numa_off && !acpi_scan_nodes(start, end) )
>          return;
>  #endif
>  
>      printk(KERN_INFO "%s\n",
>             numa_off ? "NUMA turned off" : "No NUMA configuration found");
>  
> -    printk(KERN_INFO "Faking a node at %016"PRIx64"-%016"PRIx64"\n",
> -           (u64)start_pfn << PAGE_SHIFT,
> -           (u64)end_pfn << PAGE_SHIFT);
> +    printk(KERN_INFO "Faking a node at %016"PRIpaddr"-%016"PRIpaddr"\n",
> +           start, end);

When switching to PRIpaddr I suppose you did look up what that one
expands to? IOW - please drop the 016 from here.

> @@ -441,7 +441,7 @@ void __init srat_parse_regions(u64 addr)
>  	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
>  		return;
>  
> -	srat_region_mask = pdx_init_mask(addr);
> +	srat_region_mask = pdx_init_mask((u64)addr);

I don't see the need for a cast here.

> @@ -489,7 +489,7 @@ int __init acpi_scan_nodes(u64 start, u64 end)
>  	/* Finally register nodes */
>  	for_each_node_mask(i, all_nodes_parsed)
>  	{
> -		u64 size = nodes[i].end - nodes[i].start;
> +		paddr_t size = nodes[i].end - nodes[i].start;
>  		if ( size == 0 )

Please take the opportunity and add the missing blank line between
declarations and statements.

> --- a/xen/include/asm-x86/numa.h
> +++ b/xen/include/asm-x86/numa.h
> @@ -16,7 +16,7 @@ extern cpumask_t     node_to_cpumask[];
>  #define node_to_cpumask(node)    (node_to_cpumask[node])
>  
>  struct node { 
> -	u64 start,end; 
> +	paddr_t start,end;

Please take the opportunity and add the missing blank after the comma.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2021-09-23 12:02 ` [PATCH 08/37] xen/x86: add detection of discontinous node memory range Wei Chen
  2021-09-24  0:25   ` Stefano Stabellini
@ 2022-01-18 16:13   ` Jan Beulich
  2022-01-19  7:33     ` Wei Chen
  1 sibling, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-18 16:13 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand.Marquis, xen-devel, sstabellini, julien

On 23.09.2021 14:02, Wei Chen wrote:
> One NUMA node may contain several memory blocks. In current Xen
> code, Xen will maintain a node memory range for each node to cover
> all its memory blocks. But here comes the problem, in the gap of
> one node's two memory blocks, if there are some memory blocks don't
> belong to this node (remote memory blocks). This node's memory range
> will be expanded to cover these remote memory blocks.
> 
> One node's memory range contains othe nodes' memory, this is obviously
> not very reasonable. This means current NUMA code only can support
> node has continous memory blocks. However, on a physical machine, the
> addresses of multiple nodes can be interleaved.
> 
> So in this patch, we add code to detect discontinous memory blocks
> for one node. NUMA initializtion will be failed and error messages
> will be printed when Xen detect such hardware configuration.

Luckily what you actually check for isn't as strict as "discontinuous":
What you're after is no interleaving of memory. A single nod can still
have multiple discontiguous ranges (and that'll often be the case on
x86). Please adjust description and function name accordingly.

> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
>  		       pxm, pa->apic_id, node);
>  }
>  
> +/*
> + * Check to see if there are other nodes within this node's range.
> + * We just need to check full contains situation. Because overlaps
> + * have been checked before by conflicting_memblks.
> + */
> +static bool __init is_node_memory_continuous(nodeid_t nid,
> +    paddr_t start, paddr_t end)

This indentation style demands indenting like ...

> +{
> +	nodeid_t i;

... variable declarations at function scope, i.e. in a Linux-style
file by a tab.

> +
> +	struct node *nd = &nodes[nid];
> +	for_each_node_mask(i, memory_nodes_parsed)

Please move the blank line to be between declarations and statements.

Also please make nd pointer-to const.

> +	{

In a Linux-style file opening braces do not belong on their own lines.

> +		/* Skip itself */
> +		if (i == nid)
> +			continue;
> +
> +		nd = &nodes[i];
> +		if (start < nd->start && nd->end < end)
> +		{
> +			printk(KERN_ERR
> +			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",

s/intertwine/interleaves/ ?

> @@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  				nd->start = start;
>  			if (nd->end < end)
>  				nd->end = end;
> +
> +			/* Check whether this range contains memory for other nodes */
> +			if (!is_node_memory_continuous(node, nd->start, nd->end)) {
> +				bad_srat();
> +				return;
> +			}

I think this check would better come before nodes[] gets updated? Otoh
bad_srat() will zap everything anyway ...

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2022-01-18 14:16           ` Jan Beulich
@ 2022-01-19  2:49             ` Wei Chen
  2022-01-19  7:50               ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2022-01-19  2:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月18日 22:16
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 04/37] xen: introduce an arch helper for default dma
> zone status
> 
> On 18.01.2022 10:20, Wei Chen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2022年1月18日 16:16
> >>
> >> On 18.01.2022 08:51, Wei Chen wrote:
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: 2022年1月18日 0:11
> >>>> On 23.09.2021 14:02, Wei Chen wrote:
> >>>>> In current code, when Xen is running in a multiple nodes NUMA
> >>>>> system, it will set dma_bitsize in end_boot_allocator to reserve
> >>>>> some low address memory for DMA.
> >>>>>
> >>>>> There are some x86 implications in current implementation. Becuase
> >>>>> on x86, memory starts from 0. On a multiple nodes NUMA system, if
> >>>>> a single node contains the majority or all of the DMA memory. x86
> >>>>> prefer to give out memory from non-local allocations rather than
> >>>>> exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
> >>>>> aside some largely arbitrary amount memory for DMA memory ranges.
> >>>>> The allocations from these memory ranges would happen only after
> >>>>> exhausting all other nodes' memory.
> >>>>>
> >>>>> But the implications are not shared across all architectures. For
> >>>>> example, Arm doesn't have these implications. So in this patch, we
> >>>>> introduce an arch_have_default_dmazone helper for arch to determine
> >>>>> that it need to set dma_bitsize for reserve DMA allocations or not.
> >>>>
> >>>> How would Arm guarantee availability of memory below a certain
> >>>> boundary for limited-capability devices? Or is there no need
> >>>> because there's an assumption that I/O for such devices would
> >>>> always pass through an IOMMU, lifting address size restrictions?
> >>>> (I guess in a !PV build on x86 we could also get rid of such a
> >>>> reservation.)
> >>>
> >>> On Arm, we still can have some devices with limited DMA capability.
> >>> And we also don't force all such devices to use IOMMU. This devices
> >>> will affect the dma_bitsize. Like RPi platform, it sets its
> dma_bitsize
> >>> to 30. But in multiple NUMA nodes system, Arm doesn't have a default
> >>> DMA zone. Multiple nodes is not a constraint on dma_bitsize. And some
> >>> previous discussions are placed here [1].
> >>
> >> I'm afraid that doesn't give me more clues. For example, in the mail
> >> being replied to there I find "That means, only first 4GB memory can
> >> be used for DMA." Yet that's not an implication from setting
> >> dma_bitsize. DMA is fine to occur to any address. The special address
> >> range is being held back in case in particular Dom0 is in need of such
> >> a range to perform I/O to _some_ devices.
> >
> > I am sorry that my last reply hasn't given you more clues. On Arm, only
> > Dom0 can have DMA without IOMMU. So when we allocate memory for Dom0,
> > we're trying to allocate memory under 4GB or in the range of dma_bitsize
> > indicated. I think these operations meet your above Dom0 special address
> > range description. As we have already allocated memory for DMA, so I
> > think we don't need a DMA zone in page allocation. I am not sure is that
> > answers your earlier question?
> 
> I view all of this as flawed, or as a workaround at best. Xen shouldn't
> make assumptions on what Dom0 may need. Instead Dom0 should make
> arrangements such that it can do I/O to/from all devices of interest.
> This may involve arranging for address restricted buffers. And for this
> to be possible, Xen would need to have available some suitable memory.
> I understand this is complicated by the fact that despite being HVM-like,
> due to the lack of an IOMMU in front of certain devices address
> restrictions on Dom0 address space alone (i.e. without any Xen
> involvement) won't help ...
> 

I agree with you that the current implementation is probably the best
kind of workaround. Do you have some suggestions for this patch to
address above comments? Or should I just need to modify the commit log
to contain some of our above discussions?

Thanks,
Wei Chen

> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure
  2022-01-18 15:22   ` Jan Beulich
@ 2022-01-19  6:33     ` Wei Chen
  2022-01-19  7:55       ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2022-01-19  6:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月18日 23:23
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node
> structure
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > @@ -201,11 +201,12 @@ void __init numa_init_array(void)
> >  static int numa_fake __initdata = 0;
> >
> >  /* Numa emulation */
> > -static int __init numa_emulation(u64 start_pfn, u64 end_pfn)
> > +static int __init numa_emulation(unsigned long start_pfn,
> > +                                 unsigned long end_pfn)
> >  {
> >      int i;
> >      struct node nodes[MAX_NUMNODES];
> > -    u64 sz = ((end_pfn - start_pfn)<<PAGE_SHIFT) / numa_fake;
> > +    u64 sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
> 
> Nit: Please convert to uint64_t (and alike) whenever you touch a line
> anyway that uses being-phased-out types.
> 

Ok, I will do it.

> > @@ -249,24 +250,26 @@ static int __init numa_emulation(u64 start_pfn,
> u64 end_pfn)
> >  void __init numa_initmem_init(unsigned long start_pfn, unsigned long
> end_pfn)
> >  {
> >      int i;
> > +    paddr_t start, end;
> >
> >  #ifdef CONFIG_NUMA_EMU
> >      if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
> >          return;
> >  #endif
> >
> > +    start = pfn_to_paddr(start_pfn);
> > +    end = pfn_to_paddr(end_pfn);
> 
> Nit: Would be slightly neater if these were the initializers of the
> variables.

But if this function returns in numa_fake && !numa_emulation,
will the two pfn_to_paddr operations be waste?

> 
> >  #ifdef CONFIG_ACPI_NUMA
> > -    if ( !numa_off && !acpi_scan_nodes((u64)start_pfn << PAGE_SHIFT,
> > -         (u64)end_pfn << PAGE_SHIFT) )
> > +    if ( !numa_off && !acpi_scan_nodes(start, end) )
> >          return;
> >  #endif
> >
> >      printk(KERN_INFO "%s\n",
> >             numa_off ? "NUMA turned off" : "No NUMA configuration
> found");
> >
> > -    printk(KERN_INFO "Faking a node at %016"PRIx64"-%016"PRIx64"\n",
> > -           (u64)start_pfn << PAGE_SHIFT,
> > -           (u64)end_pfn << PAGE_SHIFT);
> > +    printk(KERN_INFO "Faking a node at %016"PRIpaddr"-%016"PRIpaddr"\n",
> > +           start, end);
> 
> When switching to PRIpaddr I suppose you did look up what that one
> expands to? IOW - please drop the 016 from here.

Oh, yes, I forgot to drop the duplicated 016. I would do it.

> 
> > @@ -441,7 +441,7 @@ void __init srat_parse_regions(u64 addr)
> >  	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
> >  		return;
> >
> > -	srat_region_mask = pdx_init_mask(addr);
> > +	srat_region_mask = pdx_init_mask((u64)addr);
> 
> I don't see the need for a cast here.
> 

current addr type has been changed to paddr_t, but pdx_init_mask
accept parameter type is u64. I know paddr_t is a typedef of
u64 on Arm64/32, or unsinged long on x86. In current stage,
their machine byte-lengths are the same. But in case of future
changes I think an explicit case here maybe better? 

> > @@ -489,7 +489,7 @@ int __init acpi_scan_nodes(u64 start, u64 end)
> >  	/* Finally register nodes */
> >  	for_each_node_mask(i, all_nodes_parsed)
> >  	{
> > -		u64 size = nodes[i].end - nodes[i].start;
> > +		paddr_t size = nodes[i].end - nodes[i].start;
> >  		if ( size == 0 )
> 
> Please take the opportunity and add the missing blank line between
> declarations and statements.
> 

Ok

> > --- a/xen/include/asm-x86/numa.h
> > +++ b/xen/include/asm-x86/numa.h
> > @@ -16,7 +16,7 @@ extern cpumask_t     node_to_cpumask[];
> >  #define node_to_cpumask(node)    (node_to_cpumask[node])
> >
> >  struct node {
> > -	u64 start,end;
> > +	paddr_t start,end;
> 
> Please take the opportunity and add the missing blank after the comma.
> 

Ok

> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2022-01-18 16:13   ` Jan Beulich
@ 2022-01-19  7:33     ` Wei Chen
  2022-01-19  8:01       ` Jan Beulich
  0 siblings, 1 reply; 192+ messages in thread
From: Wei Chen @ 2022-01-19  7:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月19日 0:13
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On 23.09.2021 14:02, Wei Chen wrote:
> > One NUMA node may contain several memory blocks. In current Xen
> > code, Xen will maintain a node memory range for each node to cover
> > all its memory blocks. But here comes the problem, in the gap of
> > one node's two memory blocks, if there are some memory blocks don't
> > belong to this node (remote memory blocks). This node's memory range
> > will be expanded to cover these remote memory blocks.
> >
> > One node's memory range contains othe nodes' memory, this is obviously
> > not very reasonable. This means current NUMA code only can support
> > node has continous memory blocks. However, on a physical machine, the
> > addresses of multiple nodes can be interleaved.
> >

I will adjust above paragraph to:
... This means current NUMA code only can support node has no interlaced
memory blocks. ...

> > So in this patch, we add code to detect discontinous memory blocks
> > for one node. NUMA initializtion will be failed and error messages
> > will be printed when Xen detect such hardware configuration.

I will adjust above paragraph to:
So in this patch, we add code to detect interleave of different nodes'
memory blocks. NUMA initialization will be ...

> 
> Luckily what you actually check for isn't as strict as "discontinuous":
> What you're after is no interleaving of memory. A single nod can still
> have multiple discontiguous ranges (and that'll often be the case on
> x86). Please adjust description and function name accordingly.
> 

Yes, we're checking for no interlaced memory among nodes. In one
node's memory range, the memory block still can be discontinuous.

I will rename the subject to:
"add detection of interlaced memory for different nodes"
And I would rename is_node_memory_continuous to:
node_without_interleave_memory.

> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -271,6 +271,36 @@ acpi_numa_processor_affinity_init(const struct
> acpi_srat_cpu_affinity *pa)
> >  		       pxm, pa->apic_id, node);
> >  }
> >
> > +/*
> > + * Check to see if there are other nodes within this node's range.
> > + * We just need to check full contains situation. Because overlaps
> > + * have been checked before by conflicting_memblks.
> > + */
> > +static bool __init is_node_memory_continuous(nodeid_t nid,
> > +    paddr_t start, paddr_t end)
> 
> This indentation style demands indenting like ...
> 

Ok.

> > +{
> > +	nodeid_t i;
> 
> ... variable declarations at function scope, i.e. in a Linux-style
> file by a tab.
> 
> > +
> > +	struct node *nd = &nodes[nid];
> > +	for_each_node_mask(i, memory_nodes_parsed)
> 
> Please move the blank line to be between declarations and statements.
> 
> Also please make nd pointer-to const.

Ok.

> 
> > +	{
> 
> In a Linux-style file opening braces do not belong on their own lines.
> 

Ok.

> > +		/* Skip itself */
> > +		if (i == nid)
> > +			continue;
> > +
> > +		nd = &nodes[i];
> > +		if (start < nd->start && nd->end < end)
> > +		{
> > +			printk(KERN_ERR
> > +			       "NODE %u: (%"PRIpaddr"-%"PRIpaddr") intertwine
> with NODE %u (%"PRIpaddr"-%"PRIpaddr")\n",
> 
> s/intertwine/interleaves/ ?

Yes, interleaves. I will fix it.

> 
> > @@ -344,6 +374,12 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> >  				nd->start = start;
> >  			if (nd->end < end)
> >  				nd->end = end;
> > +
> > +			/* Check whether this range contains memory for other
> nodes */
> > +			if (!is_node_memory_continuous(node, nd->start, nd->end))
> {
> > +				bad_srat();
> > +				return;
> > +			}
> 
> I think this check would better come before nodes[] gets updated? Otoh
> bad_srat() will zap everything anyway ...

Yes, when I wrote this check, I considered when the check was failed,
bad_srat would make numa initialization be failed. The values in nodes[]
will not take any effect. So I didn't adjust the order. But if the bad_srat
will be changed in future, and nodes[] will be used in fallback progress,
this will take more effort to debug. In this case, I agree with you,
I will update the order in next version.

Thanks,
Wei Chen

> 
> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2022-01-19  2:49             ` Wei Chen
@ 2022-01-19  7:50               ` Jan Beulich
  2022-01-19  8:33                 ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-19  7:50 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

On 19.01.2022 03:49, Wei Chen wrote:
> Hi Jan,
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年1月18日 22:16
>> To: Wei Chen <Wei.Chen@arm.com>
>> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
>> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
>> Subject: Re: [PATCH 04/37] xen: introduce an arch helper for default dma
>> zone status
>>
>> On 18.01.2022 10:20, Wei Chen wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: 2022年1月18日 16:16
>>>>
>>>> On 18.01.2022 08:51, Wei Chen wrote:
>>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>>> Sent: 2022年1月18日 0:11
>>>>>> On 23.09.2021 14:02, Wei Chen wrote:
>>>>>>> In current code, when Xen is running in a multiple nodes NUMA
>>>>>>> system, it will set dma_bitsize in end_boot_allocator to reserve
>>>>>>> some low address memory for DMA.
>>>>>>>
>>>>>>> There are some x86 implications in current implementation. Becuase
>>>>>>> on x86, memory starts from 0. On a multiple nodes NUMA system, if
>>>>>>> a single node contains the majority or all of the DMA memory. x86
>>>>>>> prefer to give out memory from non-local allocations rather than
>>>>>>> exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
>>>>>>> aside some largely arbitrary amount memory for DMA memory ranges.
>>>>>>> The allocations from these memory ranges would happen only after
>>>>>>> exhausting all other nodes' memory.
>>>>>>>
>>>>>>> But the implications are not shared across all architectures. For
>>>>>>> example, Arm doesn't have these implications. So in this patch, we
>>>>>>> introduce an arch_have_default_dmazone helper for arch to determine
>>>>>>> that it need to set dma_bitsize for reserve DMA allocations or not.
>>>>>>
>>>>>> How would Arm guarantee availability of memory below a certain
>>>>>> boundary for limited-capability devices? Or is there no need
>>>>>> because there's an assumption that I/O for such devices would
>>>>>> always pass through an IOMMU, lifting address size restrictions?
>>>>>> (I guess in a !PV build on x86 we could also get rid of such a
>>>>>> reservation.)
>>>>>
>>>>> On Arm, we still can have some devices with limited DMA capability.
>>>>> And we also don't force all such devices to use IOMMU. This devices
>>>>> will affect the dma_bitsize. Like RPi platform, it sets its
>> dma_bitsize
>>>>> to 30. But in multiple NUMA nodes system, Arm doesn't have a default
>>>>> DMA zone. Multiple nodes is not a constraint on dma_bitsize. And some
>>>>> previous discussions are placed here [1].
>>>>
>>>> I'm afraid that doesn't give me more clues. For example, in the mail
>>>> being replied to there I find "That means, only first 4GB memory can
>>>> be used for DMA." Yet that's not an implication from setting
>>>> dma_bitsize. DMA is fine to occur to any address. The special address
>>>> range is being held back in case in particular Dom0 is in need of such
>>>> a range to perform I/O to _some_ devices.
>>>
>>> I am sorry that my last reply hasn't given you more clues. On Arm, only
>>> Dom0 can have DMA without IOMMU. So when we allocate memory for Dom0,
>>> we're trying to allocate memory under 4GB or in the range of dma_bitsize
>>> indicated. I think these operations meet your above Dom0 special address
>>> range description. As we have already allocated memory for DMA, so I
>>> think we don't need a DMA zone in page allocation. I am not sure is that
>>> answers your earlier question?
>>
>> I view all of this as flawed, or as a workaround at best. Xen shouldn't
>> make assumptions on what Dom0 may need. Instead Dom0 should make
>> arrangements such that it can do I/O to/from all devices of interest.
>> This may involve arranging for address restricted buffers. And for this
>> to be possible, Xen would need to have available some suitable memory.
>> I understand this is complicated by the fact that despite being HVM-like,
>> due to the lack of an IOMMU in front of certain devices address
>> restrictions on Dom0 address space alone (i.e. without any Xen
>> involvement) won't help ...
>>
> 
> I agree with you that the current implementation is probably the best
> kind of workaround. Do you have some suggestions for this patch to
> address above comments? Or should I just need to modify the commit log
> to contain some of our above discussions?

Extending the description is my primary request, or else we may end up
having the same discussion every time you submit a new version. As to
improving the situation such that preferably per-arch customization
wouldn't be necessary - that's perhaps better to be thought about by
Arm folks. Otoh, as said, an x86 build with CONFIG_PV=n could probably
leverage the new hook to actually not trigger reservation.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 07/37] xen/x86: use paddr_t for addresses in NUMA node structure
  2022-01-19  6:33     ` Wei Chen
@ 2022-01-19  7:55       ` Jan Beulich
  2022-01-19  8:36         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-19  7:55 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

On 19.01.2022 07:33, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年1月18日 23:23
>>
>> On 23.09.2021 14:02, Wei Chen wrote:
>>> @@ -249,24 +250,26 @@ static int __init numa_emulation(u64 start_pfn,
>> u64 end_pfn)
>>>  void __init numa_initmem_init(unsigned long start_pfn, unsigned long
>> end_pfn)
>>>  {
>>>      int i;
>>> +    paddr_t start, end;
>>>
>>>  #ifdef CONFIG_NUMA_EMU
>>>      if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
>>>          return;
>>>  #endif
>>>
>>> +    start = pfn_to_paddr(start_pfn);
>>> +    end = pfn_to_paddr(end_pfn);
>>
>> Nit: Would be slightly neater if these were the initializers of the
>> variables.
> 
> But if this function returns in numa_fake && !numa_emulation,
> will the two pfn_to_paddr operations be waste?

And what harm would that do?

>>> @@ -441,7 +441,7 @@ void __init srat_parse_regions(u64 addr)
>>>  	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
>>>  		return;
>>>
>>> -	srat_region_mask = pdx_init_mask(addr);
>>> +	srat_region_mask = pdx_init_mask((u64)addr);
>>
>> I don't see the need for a cast here.
>>
> 
> current addr type has been changed to paddr_t, but pdx_init_mask
> accept parameter type is u64. I know paddr_t is a typedef of
> u64 on Arm64/32, or unsinged long on x86. In current stage,
> their machine byte-lengths are the same. But in case of future
> changes I think an explicit case here maybe better? 

It may only ever be an up-cast, yet the compiler would do a widening
conversion (according to the usual type conversion rules) for us
anyway no matter whether there's a cast. Down-casts (in the general
compiler case, i.e. considering a wider set than just gcc and clang)
sometimes need making explicit to silence compiler warnings about
truncation, but I've not observed any compiler warning when widening
values.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* Re: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2022-01-19  7:33     ` Wei Chen
@ 2022-01-19  8:01       ` Jan Beulich
  2022-01-19  8:24         ` Wei Chen
  0 siblings, 1 reply; 192+ messages in thread
From: Jan Beulich @ 2022-01-19  8:01 UTC (permalink / raw)
  To: Wei Chen; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

On 19.01.2022 08:33, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年1月19日 0:13
>>
>> On 23.09.2021 14:02, Wei Chen wrote:
>>> One NUMA node may contain several memory blocks. In current Xen
>>> code, Xen will maintain a node memory range for each node to cover
>>> all its memory blocks. But here comes the problem, in the gap of
>>> one node's two memory blocks, if there are some memory blocks don't
>>> belong to this node (remote memory blocks). This node's memory range
>>> will be expanded to cover these remote memory blocks.
>>>
>>> One node's memory range contains othe nodes' memory, this is obviously
>>> not very reasonable. This means current NUMA code only can support
>>> node has continous memory blocks. However, on a physical machine, the
>>> addresses of multiple nodes can be interleaved.
>>>
> 
> I will adjust above paragraph to:
> ... This means current NUMA code only can support node has no interlaced
> memory blocks. ...
> 
>>> So in this patch, we add code to detect discontinous memory blocks
>>> for one node. NUMA initializtion will be failed and error messages
>>> will be printed when Xen detect such hardware configuration.
> 
> I will adjust above paragraph to:
> So in this patch, we add code to detect interleave of different nodes'
> memory blocks. NUMA initialization will be ...

Taking just this part of your reply (the issue continues later), may I
ask that you use a consistent term throughout this single patch? Mixing
"interlace" and "interleave" like you do may make people wonder whether
the two are intended to express slightly different aspects. Personally,
as per my suggestion, I'd prefer "interleave", but I'm not a native
speaker.

Jan



^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range
  2022-01-19  8:01       ` Jan Beulich
@ 2022-01-19  8:24         ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2022-01-19  8:24 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月19日 16:01
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On 19.01.2022 08:33, Wei Chen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2022年1月19日 0:13
> >>
> >> On 23.09.2021 14:02, Wei Chen wrote:
> >>> One NUMA node may contain several memory blocks. In current Xen
> >>> code, Xen will maintain a node memory range for each node to cover
> >>> all its memory blocks. But here comes the problem, in the gap of
> >>> one node's two memory blocks, if there are some memory blocks don't
> >>> belong to this node (remote memory blocks). This node's memory range
> >>> will be expanded to cover these remote memory blocks.
> >>>
> >>> One node's memory range contains othe nodes' memory, this is obviously
> >>> not very reasonable. This means current NUMA code only can support
> >>> node has continous memory blocks. However, on a physical machine, the
> >>> addresses of multiple nodes can be interleaved.
> >>>
> >
> > I will adjust above paragraph to:
> > ... This means current NUMA code only can support node has no interlaced
> > memory blocks. ...
> >
> >>> So in this patch, we add code to detect discontinous memory blocks
> >>> for one node. NUMA initializtion will be failed and error messages
> >>> will be printed when Xen detect such hardware configuration.
> >
> > I will adjust above paragraph to:
> > So in this patch, we add code to detect interleave of different nodes'
> > memory blocks. NUMA initialization will be ...
> 
> Taking just this part of your reply (the issue continues later), may I
> ask that you use a consistent term throughout this single patch? Mixing
> "interlace" and "interleave" like you do may make people wonder whether
> the two are intended to express slightly different aspects. Personally,
> as per my suggestion, I'd prefer "interleave", but I'm not a native
> speaker.
> 

Sorry, I am not a native speaker too, I had checked dict for interlaced
before I used it. https://www.merriam-webster.com/thesaurus/interlaced

Obviously, I'm probably using it incorrectly and making it harder to
understand, I will use "interleave" in my patches.

Thanks,
Wei Chen


> Jan


^ permalink raw reply	[flat|nested] 192+ messages in thread

* RE: [PATCH 04/37] xen: introduce an arch helper for default dma zone status
  2022-01-19  7:50               ` Jan Beulich
@ 2022-01-19  8:33                 ` Wei Chen
  0 siblings, 0 replies; 192+ messages in thread
From: Wei Chen @ 2022-01-19  8:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Bertrand Marquis, xen-devel, sstabellini, julien

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年1月19日 15:50
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> Subject: Re: [PATCH 04/37] xen: introduce an arch helper for default dma
> zone status
> 
> On 19.01.2022 03:49, Wei Chen wrote:
> > Hi Jan,
> >
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2022年1月18日 22:16
> >> To: Wei Chen <Wei.Chen@arm.com>
> >> Cc: Bertrand Marquis <Bertrand.Marquis@arm.com>; xen-
> >> devel@lists.xenproject.org; sstabellini@kernel.org; julien@xen.org
> >> Subject: Re: [PATCH 04/37] xen: introduce an arch helper for default
> dma
> >> zone status
> >>
> >> On 18.01.2022 10:20, Wei Chen wrote:
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: 2022年1月18日 16:16
> >>>>
> >>>> On 18.01.2022 08:51, Wei Chen wrote:
> >>>>>> From: Jan Beulich <jbeulich@suse.com>
> >>>>>> Sent: 2022年1月18日 0:11
> >>>>>> On 23.09.2021 14:02, Wei Chen wrote:
> >>>>>>> In current code, when Xen is running in a multiple nodes NUMA
> >>>>>>> system, it will set dma_bitsize in end_boot_allocator to reserve
> >>>>>>> some low address memory for DMA.
> >>>>>>>
> >>>>>>> There are some x86 implications in current implementation. Becuase
> >>>>>>> on x86, memory starts from 0. On a multiple nodes NUMA system, if
> >>>>>>> a single node contains the majority or all of the DMA memory. x86
> >>>>>>> prefer to give out memory from non-local allocations rather than
> >>>>>>> exhausting the DMA memory ranges. Hence x86 use dma_bitsize to set
> >>>>>>> aside some largely arbitrary amount memory for DMA memory ranges.
> >>>>>>> The allocations from these memory ranges would happen only after
> >>>>>>> exhausting all other nodes' memory.
> >>>>>>>
> >>>>>>> But the implications are not shared across all architectures. For
> >>>>>>> example, Arm doesn't have these implications. So in this patch, we
> >>>>>>> introduce an arch_have_default_dmazone helper for arch to
> determine
> >>>>>>> that it need to set dma_bitsize for reserve DMA allocations or not.
> >>>>>>
> >>>>>> How would Arm guarantee availability of memory below a certain
> >>>>>> boundary for limited-capability devices? Or is there no need
> >>>>>> because there's an assumption that I/O for such devices would
> >>>>>> always pass through an IOMMU, lifting address size restrictions?
> >>>>>> (I guess in a !PV build on x86 we could also get rid of such a
> >>>>>> reservation.)
> >>>>>
> >>>>> On Arm, we still can have some devices with limited DMA capability.
> >>>>> And we also don't force all such devices to use IOMMU. This devices
> >>>>> will affect the dma_bitsize. Like RPi platform, it sets its
> >> dma_bitsize
> >>>>> to 30. But in multiple NUMA nodes system, Arm doesn't have a default
> >>>>> DMA zone. Multiple nodes is not a constraint on dma_bitsize. And
> some
> >>>>> previous discussions are placed here [1].
> >>>>
> >>>> I'm afraid that doesn't give me more clues. For example, in the mail
> >>>> being replied to there I find "That means, only first 4GB memory can
> >>>> be used for DMA." Yet that's not an implication from setting
> >>>> dma_bitsize. DMA is fine to occur to any address. The special address
> >>>> range is being held back in case in particular Dom0 is in need of
> such
> >>>> a range to perform I/O to _some_ devices.
> >>>
> >>> I am sorry that my last reply hasn't given you more clues. On Arm,
> only
> >>> Dom0 can have DMA without IOMMU. So when we allocate memory for Dom0,
> >>>