All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2
@ 2022-07-08 14:54 Wei Chen
  2022-07-08 14:54 ` [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end Wei Chen
                   ` (8 more replies)
  0 siblings, 9 replies; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

(The Arm device tree based NUMA support patch set contains 35
patches. In order to make stuff easier for reviewers, I split
them into 3 parts:
1. Preparation. I have re-sorted the patch series. And moved
   independent patches to the head of the series - merged in [1]
2. Move generically usable code from x86 to common - this series.
3. Add new code to support Arm.

This series only contains the second part patches. As the whole NUMA
series has been reviewed for 1 round in [2], so this series would
be v2)

Xen memory allocation and scheduler modules are NUMA aware.
But actually, on x86 has implemented the architecture APIs
to support NUMA. Arm was providing a set of fake architecture
APIs to make it compatible with NUMA awared memory allocation
and scheduler.

Arm system was working well as a single node NUMA system with
these fake APIs, because we didn't have multiple nodes NUMA
system on Arm. But in recent years, more and more Arm devices
support multiple nodes NUMA system.

So now we have a new problem. When Xen is running on these Arm
devices, Xen still treat them as single node SMP systems. The
NUMA affinity capability of Xen memory allocation and scheduler
becomes meaningless. Because they rely on input data that does
not reflect real NUMA layout.

Xen still think the access time for all of the memory is the
same for all CPUs. However, Xen may allocate memory to a VM
from different NUMA nodes with different access speeds. This
difference can be amplified in workloads inside VM, causing
performance instability and timeouts.

So in this patch series, we implement a set of NUMA API to use
device tree to describe the NUMA layout. We reuse most of the
code of x86 NUMA to create and maintain the mapping between
memory and CPU, create the matrix between any two NUMA nodes.
Except ACPI and some x86 specified code, we have moved other
code to common. In next stage, when we implement ACPI based
NUMA for Arm64, we may move the ACPI NUMA code to common too,
but in current stage, we keep it as x86 only.

This patch serires has been tested and booted well on one
Arm64 NUMA machine and one HPE x86 NUMA machine.

[1] https://lists.xenproject.org/archives/html/xen-devel/2022-06/msg00499.html
[2] https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg01903.html

---
v1 -> v2:
 1. Refine the commit messages of several patches.
 2. Merge v1 patch#9,10 into one patch. Introduce the new functions
    in the same patch that this patch will be used first time.
 3. Fold if ( end > mem_hotplug ) to mem_hotplug_update_boundary,
    in this case, we can drop mem_hotplug_boundary.
 4. Remove fw_numa, use enumeration to replace numa_off and acpi_numa.
 5. Correct return value of srat_disabled.
 6. Introduce numa_enabled_with_firmware.
 7. Refine the justification of using !node_data[nid].node_spanned_pages.
 8. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid.
 9. Adjust the conditional express for ASSERT.
10. Move MAX_NUMNODES from xen/numa.h to asm/numa.h for x86.
11. Use conditional macro to gate MAX_NUMNODES for other architectures.
12. Use arch_get_memory_map to replace arch_get_memory_bank_range
    and arch_get_memory_bank_number.
13. Remove the !start || !end check, because caller guarantee
    these two pointers will not be NULL.
14. Add code comment for numa_update_node_memblks to explain:
    Assumes all memory regions belonging to a single node
    are in one chunk. Holes between them will be included
    in the node.
15. Merge this single patch instead of serval patches to move
    x86 SRAT code to common.
16. Export node_to_pxm to keep pxm information in NUMA scan
    nodes error messages.
17. Change the code style to target file's Xen code-style.
18. Adjust some __init and __initdata for some functions and
    variables.
19. Replace CONFIG_ACPI_NUMA by CONFIG_NUMA. Replace "SRAT" texts.
20. Turn numa_scan_nodes to static.
21. Change NR_NUMA_NODES upper bound from 4095 to 255.

Wei Chen (9):
  xen/x86: introduce a helper to update memory hotplug end
  xen/x86: Use enumerations to indicate NUMA status
  xen/x86: move generically usable NUMA code from x86 to common
  xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid
  xen/x86: use arch_get_memory_map to get information from E820 map
  xen/x86: move NUMA scan nodes codes from x86 to common
  xen/x86: rename bad_srat to numa_bad
  xen: rename acpi_scan_nodes to numa_scan_nodes
  xen: introduce a Kconfig option to configure NUMA nodes number

 xen/arch/Kconfig                 |  11 +
 xen/arch/x86/include/asm/acpi.h  |   2 -
 xen/arch/x86/include/asm/mm.h    |   6 +
 xen/arch/x86/include/asm/numa.h  |  61 +--
 xen/arch/x86/include/asm/setup.h |   1 -
 xen/arch/x86/numa.c              | 446 +-----------------
 xen/arch/x86/setup.c             |   3 +-
 xen/arch/x86/srat.c              | 310 ++-----------
 xen/common/Makefile              |   1 +
 xen/common/numa.c                | 746 +++++++++++++++++++++++++++++++
 xen/include/xen/numa.h           |  92 +++-
 11 files changed, 902 insertions(+), 777 deletions(-)
 create mode 100644 xen/common/numa.c

-- 
2.25.1



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-12 11:53   ` Jan Beulich
  2022-07-08 14:54 ` [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status Wei Chen
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

x86 provides a mem_hotplug variable to maintain the memory hotplug
end address. We want to move some codes from x86 to common, so that
it can be reused by other architectures. But not all architectures
have supported memory hotplug. So in this patch, we introduce this
helper to replace mem_hotplug direct access in these codes. This
will give the ability of stubbing this helper to those architectures
without memory hotplug support.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v1 -> v2:
1. Refine the commit message.
2. Merge v1 patch#9,10 into one patch. Introduce the new functions
   in the same patch that this patch will be used first time.
3. Fold if ( end > mem_hotplug ) to mem_hotplug_update_boundary,
   in this case, we can drop mem_hotplug_boundary.
---
 xen/arch/x86/include/asm/mm.h | 6 ++++++
 xen/arch/x86/srat.c           | 3 +--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 07b59c982b..b3dfbdb52b 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -476,6 +476,12 @@ static inline int get_page_and_type(struct page_info *page,
 
 extern paddr_t mem_hotplug;
 
+static inline void mem_hotplug_update_boundary(paddr_t end)
+{
+    if ( end > mem_hotplug )
+        mem_hotplug = end;
+}
+
 /******************************************************************************
  * With shadow pagetables, the different kinds of address start
  * to get get confusing.
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index b62a152911..f53431f5e8 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -418,8 +418,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 	memblk_nodeid[num_node_memblks] = node;
 	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
 		__set_bit(num_node_memblks, memblk_hotplug);
-		if (end > mem_hotplug)
-			mem_hotplug = end;
+		mem_hotplug_update_boundary(end);
 	}
 	num_node_memblks++;
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
  2022-07-08 14:54 ` [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-12 12:49   ` Jan Beulich
  2022-07-08 14:54 ` [PATCH v2 3/9] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

In current code, x86 is using two variables, numa_off and acpi_numa,
to indicate the NUMA status. This is because NUMA is not coupled with
ACPI, and ACPI still can work without NUMA on x86. With these two
variables' combinations, x86 can have several NUMA status:
NUMA swith on,
NUMA swith off,
NUMA swith on with NUMA emulation,
NUMA swith on with no-ACPI,
NUMA swith on with ACPI.

In this case, we introduce an enumeration numa_mode in this patch
to indicate above NUMA status, except NUMA on with emulation. Because
NUMA emulation has another variable, numa_fake, to indicate the number
of nodes for emulation. We can't use the enumeration to replace it at
the same time. But it still can be indicated by numa_on and numa_fake
as what it has been indicated.

Based on the enumeration we introduce numa_enabled_with_firmware for
callers to check NUMA status is enabled + ACPI. Using this helper is
because some NUMA implementation will use other firmware, this helper
will be easy to them to check enabled + others.

As we have touched srat_disabled, we have corrected its return value
from int to bool.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v1 -> v2:
1. Remove fw_numa.
2. Use enumeration to replace numa_off and acpi_numa.
3. Correct return value of srat_disabled.
4. Introduce numa_enabled_with_firmware.
---
 xen/arch/x86/include/asm/acpi.h |  1 -
 xen/arch/x86/include/asm/numa.h | 16 +++++++++++++---
 xen/arch/x86/numa.c             | 28 +++++++++++++++-------------
 xen/arch/x86/setup.c            |  3 ++-
 xen/arch/x86/srat.c             | 13 +++++++------
 5 files changed, 37 insertions(+), 24 deletions(-)

diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h
index 9a9cc4c240..ab0d56dd70 100644
--- a/xen/arch/x86/include/asm/acpi.h
+++ b/xen/arch/x86/include/asm/acpi.h
@@ -101,7 +101,6 @@ extern unsigned long acpi_wakeup_address;
 
 #define ARCH_HAS_POWER_INIT	1
 
-extern s8 acpi_numa;
 extern int acpi_scan_nodes(u64 start, u64 end);
 #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
 
diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index c32ccffde3..ee8262d969 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -28,12 +28,22 @@ extern nodeid_t pxm_to_node(unsigned int pxm);
 #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
 #define VIRTUAL_BUG_ON(x) 
 
+/* Enumerations for NUMA status. */
+enum numa_mode {
+	numa_on = 0,
+	numa_off,
+	/* NUMA turns on, but ACPI table is bad or disabled. */
+	numa_no_acpi,
+	/* NUMA turns on, and ACPI table works well. */
+	numa_acpi,
+};
+
 extern void numa_add_cpu(int cpu);
 extern void numa_init_array(void);
-extern bool numa_off;
-
+extern bool numa_enabled_with_firmware(void);
+extern enum numa_mode numa_status;
 
-extern int srat_disabled(void);
+extern bool srat_disabled(void);
 extern void numa_set_node(int cpu, nodeid_t node);
 extern nodeid_t setup_node(unsigned int pxm);
 extern void srat_detect_node(int cpu);
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 627ae8aa95..0777a7518d 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -47,12 +47,16 @@ cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
 
 nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
 
-bool numa_off;
-s8 acpi_numa = 0;
+enum numa_mode numa_status;
 
-int srat_disabled(void)
+bool srat_disabled(void)
 {
-    return numa_off || acpi_numa < 0;
+    return numa_status == numa_off || numa_status == numa_no_acpi;
+}
+
+bool __init numa_enabled_with_firmware(void)
+{
+    return numa_status == numa_acpi;
 }
 
 /*
@@ -254,12 +258,13 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 
 #ifdef CONFIG_ACPI_NUMA
-    if ( !numa_off && !acpi_scan_nodes(start, end) )
+    if ( numa_status != numa_off && !acpi_scan_nodes(start, end) )
         return;
 #endif
 
     printk(KERN_INFO "%s\n",
-           numa_off ? "NUMA turned off" : "No NUMA configuration found");
+           numa_status == numa_off ? "NUMA turned off"
+                                   : "No NUMA configuration found");
 
     printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n",
            start, end);
@@ -292,13 +297,13 @@ void numa_set_node(int cpu, nodeid_t node)
 static int __init cf_check numa_setup(const char *opt)
 {
     if ( !strncmp(opt,"off",3) )
-        numa_off = true;
+        numa_status = numa_off;
     else if ( !strncmp(opt,"on",2) )
-        numa_off = false;
+        numa_status = numa_on;
 #ifdef CONFIG_NUMA_EMU
     else if ( !strncmp(opt, "fake=", 5) )
     {
-        numa_off = false;
+        numa_status = numa_on;
         numa_fake = simple_strtoul(opt+5,NULL,0);
         if ( numa_fake >= MAX_NUMNODES )
             numa_fake = MAX_NUMNODES;
@@ -306,10 +311,7 @@ static int __init cf_check numa_setup(const char *opt)
 #endif
 #ifdef CONFIG_ACPI_NUMA
     else if ( !strncmp(opt,"noacpi",6) )
-    {
-        numa_off = false;
-        acpi_numa = -1;
-    }
+        numa_status = numa_no_acpi;
 #endif
     else
         return -EINVAL;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index f08b07b8de..4841af5926 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -329,7 +329,8 @@ void srat_detect_node(int cpu)
     node_set_online(node);
     numa_set_node(cpu, node);
 
-    if ( opt_cpu_info && acpi_numa > 0 )
+    /* Print CPU info when NUMA is enabled with ACPI. */
+    if ( opt_cpu_info && numa_enabled_with_firmware() )
         printk("CPU %d APIC %d -> Node %d\n", cpu, apicid, node);
 }
 
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index f53431f5e8..422e4c73e3 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -185,7 +185,7 @@ static __init void bad_srat(void)
 {
 	int i;
 	printk(KERN_ERR "SRAT: SRAT not used.\n");
-	acpi_numa = -1;
+	numa_status = numa_no_acpi;
 	for (i = 0; i < MAX_LOCAL_APIC; i++)
 		apicid_to_node[i] = NUMA_NO_NODE;
 	for (i = 0; i < ARRAY_SIZE(pxm2node); i++)
@@ -260,7 +260,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 
 	apicid_to_node[pa->apic_id] = node;
 	node_set(node, processor_nodes_parsed);
-	acpi_numa = 1;
+	numa_status = numa_acpi;
 
 	if (opt_acpi_verbose)
 		printk(KERN_INFO "SRAT: PXM %u -> APIC %08x -> Node %u\n",
@@ -295,7 +295,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 	}
 	apicid_to_node[pa->apic_id] = node;
 	node_set(node, processor_nodes_parsed);
-	acpi_numa = 1;
+	numa_status = numa_acpi;
 
 	if (opt_acpi_verbose)
 		printk(KERN_INFO "SRAT: PXM %u -> APIC %02x -> Node %u\n",
@@ -484,7 +484,7 @@ static int __init cf_check srat_parse_region(
 	    (ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
 		return 0;
 
-	if (numa_off)
+	if (numa_status == numa_off)
 		printk(KERN_INFO "SRAT: %013"PRIx64"-%013"PRIx64"\n",
 		       ma->base_address, ma->base_address + ma->length - 1);
 
@@ -499,7 +499,7 @@ void __init srat_parse_regions(paddr_t addr)
 	u64 mask;
 	unsigned int i;
 
-	if (acpi_disabled || acpi_numa < 0 ||
+	if (acpi_disabled || numa_status == numa_no_acpi ||
 	    acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat))
 		return;
 
@@ -528,7 +528,8 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
 	for (i = 0; i < MAX_NUMNODES; i++)
 		cutoff_node(i, start, end);
 
-	if (acpi_numa <= 0)
+	/* Only when numa_on with good firmware, we can do numa scan nodes. */
+	if (!numa_enabled_with_firmware())
 		return -1;
 
 	if (!nodes_cover_memory()) {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 3/9] xen/x86: move generically usable NUMA code from x86 to common
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
  2022-07-08 14:54 ` [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end Wei Chen
  2022-07-08 14:54 ` [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-12 13:13   ` Jan Beulich
  2022-07-08 14:54 ` [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid Wei Chen
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

There are some codes in x86/numa.c can be shared by architectures
to implement NUMA support. Just like some variables and functions
to check and store NUMA memory map. And some variables and functions
to do NUMA initialization.

In this patch, we move them to common/numa.c and xen/numa.h
and use the CONFIG_NUMA to gate them for non-NUMA supported
architectures. As the target header file is Xen-style, so
we trim some spaces and replace tabs for the codes that has
been moved to xen/numa.h at the same time.

As we have moved phy_to_nid to xen/numa.h, we don't need a
separate MAX_NUMNODES in asm/numa.h. Now all architectures
can use the same MAX_NUMNODES in xen/numa.h. The conditional
macro can be removed.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v1 -> v2:
1. New patch in v2.
---
 xen/arch/x86/include/asm/numa.h  |  64 -----
 xen/arch/x86/include/asm/setup.h |   1 -
 xen/arch/x86/numa.c              | 431 +-----------------------------
 xen/common/Makefile              |   1 +
 xen/common/numa.c                | 439 +++++++++++++++++++++++++++++++
 xen/include/xen/numa.h           |  74 ++++++
 6 files changed, 515 insertions(+), 495 deletions(-)
 create mode 100644 xen/common/numa.c

diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index ee8262d969..d8960743d4 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -9,81 +9,17 @@ typedef u8 nodeid_t;
 
 extern int srat_rev;
 
-extern nodeid_t      cpu_to_node[NR_CPUS];
-extern cpumask_t     node_to_cpumask[];
-
-#define cpu_to_node(cpu)		(cpu_to_node[cpu])
-#define parent_node(node)		(node)
-#define node_to_first_cpu(node)  (__ffs(node_to_cpumask[node]))
-#define node_to_cpumask(node)    (node_to_cpumask[node])
-
-struct node { 
-	paddr_t start, end;
-};
-
-extern int compute_hash_shift(struct node *nodes, int numnodes,
-			      nodeid_t *nodeids);
 extern nodeid_t pxm_to_node(unsigned int pxm);
 
 #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
-#define VIRTUAL_BUG_ON(x) 
-
-/* Enumerations for NUMA status. */
-enum numa_mode {
-	numa_on = 0,
-	numa_off,
-	/* NUMA turns on, but ACPI table is bad or disabled. */
-	numa_no_acpi,
-	/* NUMA turns on, and ACPI table works well. */
-	numa_acpi,
-};
-
-extern void numa_add_cpu(int cpu);
-extern void numa_init_array(void);
-extern bool numa_enabled_with_firmware(void);
-extern enum numa_mode numa_status;
 
 extern bool srat_disabled(void);
-extern void numa_set_node(int cpu, nodeid_t node);
 extern nodeid_t setup_node(unsigned int pxm);
 extern void srat_detect_node(int cpu);
 
-extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
 extern nodeid_t apicid_to_node[];
 extern void init_cpu_to_node(void);
 
-static inline void clear_node_cpumask(int cpu)
-{
-	cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
-}
-
-/* Simple perfect hash to map pdx to node numbers */
-extern int memnode_shift; 
-extern unsigned long memnodemapsize;
-extern u8 *memnodemap;
-
-struct node_data {
-    unsigned long node_start_pfn;
-    unsigned long node_spanned_pages;
-};
-
-extern struct node_data node_data[];
-
-static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
-{ 
-	nodeid_t nid;
-	VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
-	nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift]; 
-	VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]); 
-	return nid; 
-} 
-
-#define NODE_DATA(nid)		(&(node_data[nid]))
-
-#define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
-#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
-#define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
-				 NODE_DATA(nid)->node_spanned_pages)
 #define arch_want_default_dmazone() (num_online_nodes() > 1)
 
 extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index 21037b7f31..ae470ea12f 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -20,7 +20,6 @@ void early_time_init(void);
 
 void set_nr_cpu_ids(unsigned int max_cpus);
 
-void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
 void arch_init_memory(void);
 void subarch_init_memory(void);
 
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 0777a7518d..193314dbd9 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -4,20 +4,11 @@
  * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
  */ 
 
-#include <xen/mm.h>
-#include <xen/string.h>
 #include <xen/init.h>
-#include <xen/ctype.h>
+#include <xen/mm.h>
 #include <xen/nodemask.h>
 #include <xen/numa.h>
-#include <xen/keyhandler.h>
-#include <xen/param.h>
-#include <xen/time.h>
-#include <xen/smp.h>
-#include <xen/pfn.h>
 #include <asm/acpi.h>
-#include <xen/sched.h>
-#include <xen/softirq.h>
 
 #ifndef Dprintk
 #define Dprintk(x...)
@@ -26,28 +17,12 @@
 /* from proto.h */
 #define round_up(x,y) ((((x)+(y))-1) & (~((y)-1)))
 
-struct node_data node_data[MAX_NUMNODES];
-
-/* Mapping from pdx to node id */
-int memnode_shift;
-static typeof(*memnodemap) _memnodemap[64];
-unsigned long memnodemapsize;
-u8 *memnodemap;
-
-nodeid_t cpu_to_node[NR_CPUS] __read_mostly = {
-    [0 ... NR_CPUS-1] = NUMA_NO_NODE
-};
 /*
  * Keep BIOS's CPU2node information, should not be used for memory allocaion
  */
 nodeid_t apicid_to_node[MAX_LOCAL_APIC] = {
     [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
 };
-cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
-
-nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
-
-enum numa_mode numa_status;
 
 bool srat_disabled(void)
 {
@@ -59,267 +34,6 @@ bool __init numa_enabled_with_firmware(void)
     return numa_status == numa_acpi;
 }
 
-/*
- * Given a shift value, try to populate memnodemap[]
- * Returns :
- * 1 if OK
- * 0 if memnodmap[] too small (of shift too small)
- * -1 if node overlap or lost ram (shift too big)
- */
-static int __init populate_memnodemap(const struct node *nodes,
-                                      int numnodes, int shift, nodeid_t *nodeids)
-{
-    unsigned long spdx, epdx;
-    int i, res = -1;
-
-    memset(memnodemap, NUMA_NO_NODE, memnodemapsize * sizeof(*memnodemap));
-    for ( i = 0; i < numnodes; i++ )
-    {
-        spdx = paddr_to_pdx(nodes[i].start);
-        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
-        if ( spdx >= epdx )
-            continue;
-        if ( (epdx >> shift) >= memnodemapsize )
-            return 0;
-        do {
-            if ( memnodemap[spdx >> shift] != NUMA_NO_NODE )
-                return -1;
-
-            if ( !nodeids )
-                memnodemap[spdx >> shift] = i;
-            else
-                memnodemap[spdx >> shift] = nodeids[i];
-
-            spdx += (1UL << shift);
-        } while ( spdx < epdx );
-        res = 1;
-    }
-
-    return res;
-}
-
-static int __init allocate_cachealigned_memnodemap(void)
-{
-    unsigned long size = PFN_UP(memnodemapsize * sizeof(*memnodemap));
-    unsigned long mfn = mfn_x(alloc_boot_pages(size, 1));
-
-    memnodemap = mfn_to_virt(mfn);
-    mfn <<= PAGE_SHIFT;
-    size <<= PAGE_SHIFT;
-    printk(KERN_DEBUG "NUMA: Allocated memnodemap from %lx - %lx\n",
-           mfn, mfn + size);
-    memnodemapsize = size / sizeof(*memnodemap);
-
-    return 0;
-}
-
-/*
- * The LSB of all start and end addresses in the node map is the value of the
- * maximum possible shift.
- */
-static int __init extract_lsb_from_nodes(const struct node *nodes,
-                                         int numnodes)
-{
-    int i, nodes_used = 0;
-    unsigned long spdx, epdx;
-    unsigned long bitfield = 0, memtop = 0;
-
-    for ( i = 0; i < numnodes; i++ )
-    {
-        spdx = paddr_to_pdx(nodes[i].start);
-        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
-        if ( spdx >= epdx )
-            continue;
-        bitfield |= spdx;
-        nodes_used++;
-        if ( epdx > memtop )
-            memtop = epdx;
-    }
-    if ( nodes_used <= 1 )
-        i = BITS_PER_LONG - 1;
-    else
-        i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
-    memnodemapsize = (memtop >> i) + 1;
-    return i;
-}
-
-int __init compute_hash_shift(struct node *nodes, int numnodes,
-                              nodeid_t *nodeids)
-{
-    int shift;
-
-    shift = extract_lsb_from_nodes(nodes, numnodes);
-    if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) )
-        memnodemap = _memnodemap;
-    else if ( allocate_cachealigned_memnodemap() )
-        return -1;
-    printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift);
-
-    if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 )
-    {
-        printk(KERN_INFO "Your memory is not aligned you need to "
-               "rebuild your hypervisor with a bigger NODEMAPSIZE "
-               "shift=%d\n", shift);
-        return -1;
-    }
-
-    return shift;
-}
-/* initialize NODE_DATA given nodeid and start/end */
-void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
-{
-    unsigned long start_pfn = paddr_to_pfn(start);
-    unsigned long end_pfn = paddr_to_pfn(end);
-
-    NODE_DATA(nodeid)->node_start_pfn = start_pfn;
-    NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
-
-    node_set_online(nodeid);
-} 
-
-void __init numa_init_array(void)
-{
-    int rr, i;
-
-    /* There are unfortunately some poorly designed mainboards around
-       that only connect memory to a single CPU. This breaks the 1:1 cpu->node
-       mapping. To avoid this fill in the mapping for all possible
-       CPUs, as the number of CPUs is not known yet.
-       We round robin the existing nodes. */
-    rr = first_node(node_online_map);
-    for ( i = 0; i < nr_cpu_ids; i++ )
-    {
-        if ( cpu_to_node[i] != NUMA_NO_NODE )
-            continue;
-        numa_set_node(i, rr);
-        rr = cycle_node(rr, node_online_map);
-    }
-}
-
-#ifdef CONFIG_NUMA_EMU
-static int numa_fake __initdata = 0;
-
-/* Numa emulation */
-static int __init numa_emulation(unsigned long start_pfn,
-                                 unsigned long end_pfn)
-{
-    int i;
-    struct node nodes[MAX_NUMNODES];
-    uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
-
-    /* Kludge needed for the hash function */
-    if ( hweight64(sz) > 1 )
-    {
-        u64 x = 1;
-        while ( (x << 1) < sz )
-            x <<= 1;
-        if ( x < sz/2 )
-            printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n");
-        sz = x;
-    }
-
-    memset(&nodes,0,sizeof(nodes));
-    for ( i = 0; i < numa_fake; i++ )
-    {
-        nodes[i].start = pfn_to_paddr(start_pfn) + i * sz;
-        if ( i == numa_fake - 1 )
-            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
-        nodes[i].end = nodes[i].start + sz;
-        printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
-               i,
-               nodes[i].start, nodes[i].end,
-               (nodes[i].end - nodes[i].start) >> 20);
-        node_set_online(i);
-    }
-    memnode_shift = compute_hash_shift(nodes, numa_fake, NULL);
-    if ( memnode_shift < 0 )
-    {
-        memnode_shift = 0;
-        printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
-        return -1;
-    }
-    for_each_online_node ( i )
-        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
-    numa_init_array();
-
-    return 0;
-}
-#endif
-
-void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
-{ 
-    int i;
-    paddr_t start = pfn_to_paddr(start_pfn);
-    paddr_t end = pfn_to_paddr(end_pfn);
-
-#ifdef CONFIG_NUMA_EMU
-    if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
-        return;
-#endif
-
-#ifdef CONFIG_ACPI_NUMA
-    if ( numa_status != numa_off && !acpi_scan_nodes(start, end) )
-        return;
-#endif
-
-    printk(KERN_INFO "%s\n",
-           numa_status == numa_off ? "NUMA turned off"
-                                   : "No NUMA configuration found");
-
-    printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n",
-           start, end);
-    /* setup dummy node covering all memory */
-    memnode_shift = BITS_PER_LONG - 1;
-    memnodemap = _memnodemap;
-    /* Dummy node only uses 1 slot in reality */
-    memnodemap[0] = 0;
-    memnodemapsize = 1;
-
-    nodes_clear(node_online_map);
-    node_set_online(0);
-    for ( i = 0; i < nr_cpu_ids; i++ )
-        numa_set_node(i, 0);
-    cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
-    setup_node_bootmem(0, start, end);
-}
-
-void numa_add_cpu(int cpu)
-{
-    cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
-} 
-
-void numa_set_node(int cpu, nodeid_t node)
-{
-    cpu_to_node[cpu] = node;
-}
-
-/* [numa=off] */
-static int __init cf_check numa_setup(const char *opt)
-{
-    if ( !strncmp(opt,"off",3) )
-        numa_status = numa_off;
-    else if ( !strncmp(opt,"on",2) )
-        numa_status = numa_on;
-#ifdef CONFIG_NUMA_EMU
-    else if ( !strncmp(opt, "fake=", 5) )
-    {
-        numa_status = numa_on;
-        numa_fake = simple_strtoul(opt+5,NULL,0);
-        if ( numa_fake >= MAX_NUMNODES )
-            numa_fake = MAX_NUMNODES;
-    }
-#endif
-#ifdef CONFIG_ACPI_NUMA
-    else if ( !strncmp(opt,"noacpi",6) )
-        numa_status = numa_no_acpi;
-#endif
-    else
-        return -EINVAL;
-
-    return 0;
-} 
-custom_param("numa", numa_setup);
-
 /*
  * Setup early cpu_to_node.
  *
@@ -368,146 +82,3 @@ unsigned int __init arch_get_dma_bitsize(void)
                  flsl(node_start_pfn(node) + node_spanned_pages(node) / 4 - 1)
                  + PAGE_SHIFT, 32);
 }
-
-static void cf_check dump_numa(unsigned char key)
-{
-    s_time_t now = NOW();
-    unsigned int i, j, n;
-    struct domain *d;
-    struct page_info *page;
-    unsigned int page_num_node[MAX_NUMNODES];
-    const struct vnuma_info *vnuma;
-
-    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
-           now);
-
-    for_each_online_node ( i )
-    {
-        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
-
-        printk("NODE%u start->%lu size->%lu free->%lu\n",
-               i, node_start_pfn(i), node_spanned_pages(i),
-               avail_node_heap_pages(i));
-        /* sanity check phys_to_nid() */
-        if ( phys_to_nid(pa) != i )
-            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
-                   pa, phys_to_nid(pa), i);
-    }
-
-    j = cpumask_first(&cpu_online_map);
-    n = 0;
-    for_each_online_cpu ( i )
-    {
-        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
-        {
-            if ( n > 1 )
-                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
-            else
-                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
-            j = i;
-            n = 1;
-        }
-        else
-            ++n;
-    }
-    if ( n > 1 )
-        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
-    else
-        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
-
-    rcu_read_lock(&domlist_read_lock);
-
-    printk("Memory location of each domain:\n");
-    for_each_domain ( d )
-    {
-        process_pending_softirqs();
-
-        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
-
-        for_each_online_node ( i )
-            page_num_node[i] = 0;
-
-        spin_lock(&d->page_alloc_lock);
-        page_list_for_each(page, &d->page_list)
-        {
-            i = phys_to_nid(page_to_maddr(page));
-            page_num_node[i]++;
-        }
-        spin_unlock(&d->page_alloc_lock);
-
-        for_each_online_node ( i )
-            printk("    Node %u: %u\n", i, page_num_node[i]);
-
-        if ( !read_trylock(&d->vnuma_rwlock) )
-            continue;
-
-        if ( !d->vnuma )
-        {
-            read_unlock(&d->vnuma_rwlock);
-            continue;
-        }
-
-        vnuma = d->vnuma;
-        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
-               vnuma->nr_vnodes, d->max_vcpus);
-        for ( i = 0; i < vnuma->nr_vnodes; i++ )
-        {
-            unsigned int start_cpu = ~0U;
-
-            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
-                printk("       %3u: pnode ???,", i);
-            else
-                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
-
-            printk(" vcpus ");
-
-            for ( j = 0; j < d->max_vcpus; j++ )
-            {
-                if ( !(j & 0x3f) )
-                    process_pending_softirqs();
-
-                if ( vnuma->vcpu_to_vnode[j] == i )
-                {
-                    if ( start_cpu == ~0U )
-                    {
-                        printk("%d", j);
-                        start_cpu = j;
-                    }
-                }
-                else if ( start_cpu != ~0U )
-                {
-                    if ( j - 1 != start_cpu )
-                        printk("-%d ", j - 1);
-                    else
-                        printk(" ");
-                    start_cpu = ~0U;
-                }
-            }
-
-            if ( start_cpu != ~0U  && start_cpu != j - 1 )
-                printk("-%d", j - 1);
-
-            printk("\n");
-
-            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
-            {
-                if ( vnuma->vmemrange[j].nid == i )
-                    printk("           %016"PRIx64" - %016"PRIx64"\n",
-                           vnuma->vmemrange[j].start,
-                           vnuma->vmemrange[j].end);
-            }
-        }
-
-        read_unlock(&d->vnuma_rwlock);
-    }
-
-    rcu_read_unlock(&domlist_read_lock);
-}
-
-static int __init cf_check register_numa_trigger(void)
-{
-    register_keyhandler('u', dump_numa, "dump NUMA info", 1);
-    return 0;
-}
-__initcall(register_numa_trigger);
-
diff --git a/xen/common/Makefile b/xen/common/Makefile
index 3baf83d527..9a3a12b12d 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_MEM_ACCESS) += mem_access.o
 obj-y += memory.o
 obj-y += multicall.o
 obj-y += notifier.o
+obj-$(CONFIG_NUMA) += numa.o
 obj-y += page_alloc.o
 obj-$(CONFIG_HAS_PDX) += pdx.o
 obj-$(CONFIG_PERF_COUNTERS) += perfc.o
diff --git a/xen/common/numa.c b/xen/common/numa.c
new file mode 100644
index 0000000000..bc6a2ded14
--- /dev/null
+++ b/xen/common/numa.c
@@ -0,0 +1,439 @@
+/*
+ * Generic VM initialization for NUMA setups.
+ * Copyright 2002,2003 Andi Kleen, SuSE Labs.
+ * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
+ */
+
+#include <xen/init.h>
+#include <xen/keyhandler.h>
+#include <xen/mm.h>
+#include <xen/nodemask.h>
+#include <xen/numa.h>
+#include <xen/param.h>
+#include <xen/sched.h>
+#include <xen/softirq.h>
+#include <asm/acpi.h>
+
+struct node_data node_data[MAX_NUMNODES];
+
+/* Mapping from pdx to node id */
+int memnode_shift;
+static typeof(*memnodemap) _memnodemap[64];
+unsigned long memnodemapsize;
+u8 *memnodemap;
+
+nodeid_t cpu_to_node[NR_CPUS] __read_mostly = {
+    [0 ... NR_CPUS-1] = NUMA_NO_NODE
+};
+
+cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
+
+nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
+
+enum numa_mode numa_status;
+
+/*
+ * Given a shift value, try to populate memnodemap[]
+ * Returns :
+ * 1 if OK
+ * 0 if memnodmap[] too small (of shift too small)
+ * -1 if node overlap or lost ram (shift too big)
+ */
+static int __init populate_memnodemap(const struct node *nodes,
+                                      int numnodes, int shift, nodeid_t *nodeids)
+{
+    unsigned long spdx, epdx;
+    int i, res = -1;
+
+    memset(memnodemap, NUMA_NO_NODE, memnodemapsize * sizeof(*memnodemap));
+    for ( i = 0; i < numnodes; i++ )
+    {
+        spdx = paddr_to_pdx(nodes[i].start);
+        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
+        if ( spdx >= epdx )
+            continue;
+        if ( (epdx >> shift) >= memnodemapsize )
+            return 0;
+        do {
+            if ( memnodemap[spdx >> shift] != NUMA_NO_NODE )
+                return -1;
+
+            if ( !nodeids )
+                memnodemap[spdx >> shift] = i;
+            else
+                memnodemap[spdx >> shift] = nodeids[i];
+
+            spdx += (1UL << shift);
+        } while ( spdx < epdx );
+        res = 1;
+    }
+
+    return res;
+}
+
+static int __init allocate_cachealigned_memnodemap(void)
+{
+    unsigned long size = PFN_UP(memnodemapsize * sizeof(*memnodemap));
+    unsigned long mfn = mfn_x(alloc_boot_pages(size, 1));
+
+    memnodemap = mfn_to_virt(mfn);
+    mfn <<= PAGE_SHIFT;
+    size <<= PAGE_SHIFT;
+    printk(KERN_DEBUG "NUMA: Allocated memnodemap from %lx - %lx\n",
+           mfn, mfn + size);
+    memnodemapsize = size / sizeof(*memnodemap);
+
+    return 0;
+}
+
+/*
+ * The LSB of all start and end addresses in the node map is the value of the
+ * maximum possible shift.
+ */
+static int __init extract_lsb_from_nodes(const struct node *nodes,
+                                         int numnodes)
+{
+    int i, nodes_used = 0;
+    unsigned long spdx, epdx;
+    unsigned long bitfield = 0, memtop = 0;
+
+    for ( i = 0; i < numnodes; i++ )
+    {
+        spdx = paddr_to_pdx(nodes[i].start);
+        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
+        if ( spdx >= epdx )
+            continue;
+        bitfield |= spdx;
+        nodes_used++;
+        if ( epdx > memtop )
+            memtop = epdx;
+    }
+    if ( nodes_used <= 1 )
+        i = BITS_PER_LONG - 1;
+    else
+        i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
+    memnodemapsize = (memtop >> i) + 1;
+    return i;
+}
+
+int __init compute_hash_shift(struct node *nodes, int numnodes,
+                              nodeid_t *nodeids)
+{
+    int shift;
+
+    shift = extract_lsb_from_nodes(nodes, numnodes);
+    if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) )
+        memnodemap = _memnodemap;
+    else if ( allocate_cachealigned_memnodemap() )
+        return -1;
+    printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift);
+
+    if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 )
+    {
+        printk(KERN_INFO "Your memory is not aligned you need to "
+               "rebuild your hypervisor with a bigger NODEMAPSIZE "
+               "shift=%d\n", shift);
+        return -1;
+    }
+
+    return shift;
+}
+
+/* initialize NODE_DATA given nodeid and start/end */
+void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
+{
+    unsigned long start_pfn = paddr_to_pfn(start);
+    unsigned long end_pfn = paddr_to_pfn(end);
+
+    NODE_DATA(nodeid)->node_start_pfn = start_pfn;
+    NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
+
+    node_set_online(nodeid);
+}
+
+void __init numa_init_array(void)
+{
+    int rr, i;
+
+    /*
+     * There are unfortunately some poorly designed mainboards around
+     * that only connect memory to a single CPU. This breaks the 1:1 cpu->node
+     * mapping. To avoid this fill in the mapping for all possible
+     * CPUs, as the number of CPUs is not known yet.
+     * We round robin the existing nodes.
+     */
+    rr = first_node(node_online_map);
+    for ( i = 0; i < nr_cpu_ids; i++ )
+    {
+        if ( cpu_to_node[i] != NUMA_NO_NODE )
+            continue;
+        numa_set_node(i, rr);
+        rr = cycle_node(rr, node_online_map);
+    }
+}
+
+#ifdef CONFIG_NUMA_EMU
+static int numa_fake __initdata = 0;
+
+/* Numa emulation */
+static int __init numa_emulation(unsigned long start_pfn,
+                                 unsigned long end_pfn)
+{
+    int i;
+    struct node nodes[MAX_NUMNODES];
+    uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
+
+    /* Kludge needed for the hash function */
+    if ( hweight64(sz) > 1 )
+    {
+        u64 x = 1;
+        while ( (x << 1) < sz )
+            x <<= 1;
+        if ( x < sz/2 )
+            printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n");
+        sz = x;
+    }
+
+    memset(&nodes,0,sizeof(nodes));
+    for ( i = 0; i < numa_fake; i++ )
+    {
+        nodes[i].start = pfn_to_paddr(start_pfn) + i * sz;
+        if ( i == numa_fake - 1 )
+            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
+        nodes[i].end = nodes[i].start + sz;
+        printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
+               i,
+               nodes[i].start, nodes[i].end,
+               (nodes[i].end - nodes[i].start) >> 20);
+        node_set_online(i);
+    }
+    memnode_shift = compute_hash_shift(nodes, numa_fake, NULL);
+    if ( memnode_shift < 0 )
+    {
+        memnode_shift = 0;
+        printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
+        return -1;
+    }
+    for_each_online_node ( i )
+        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+    numa_init_array();
+
+    return 0;
+}
+#endif
+
+void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
+{
+    int i;
+    paddr_t start = pfn_to_paddr(start_pfn);
+    paddr_t end = pfn_to_paddr(end_pfn);
+
+#ifdef CONFIG_NUMA_EMU
+    if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
+        return;
+#endif
+
+#ifdef CONFIG_ACPI_NUMA
+    if ( numa_status != numa_off && !acpi_scan_nodes(start, end) )
+        return;
+#endif
+
+    printk(KERN_INFO "%s\n",
+           numa_status == numa_off ? "NUMA turned off"
+                                   : "No NUMA configuration found");
+
+    printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n",
+           start, end);
+    /* setup dummy node covering all memory */
+    memnode_shift = BITS_PER_LONG - 1;
+    memnodemap = _memnodemap;
+    /* Dummy node only uses 1 slot in reality */
+    memnodemap[0] = 0;
+    memnodemapsize = 1;
+
+    nodes_clear(node_online_map);
+    node_set_online(0);
+    for ( i = 0; i < nr_cpu_ids; i++ )
+        numa_set_node(i, 0);
+    cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
+    setup_node_bootmem(0, start, end);
+}
+
+void numa_add_cpu(int cpu)
+{
+    cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
+}
+
+void numa_set_node(int cpu, nodeid_t node)
+{
+    cpu_to_node[cpu] = node;
+}
+
+/* [numa=off] */
+static int __init cf_check numa_setup(const char *opt)
+{
+    if ( !strncmp(opt,"off",3) )
+        numa_status = numa_off;
+    else if ( !strncmp(opt,"on",2) )
+        numa_status = numa_on;
+#ifdef CONFIG_NUMA_EMU
+    else if ( !strncmp(opt, "fake=", 5) )
+    {
+        numa_status = numa_on;
+        numa_fake = simple_strtoul(opt+5,NULL,0);
+        if ( numa_fake >= MAX_NUMNODES )
+            numa_fake = MAX_NUMNODES;
+    }
+#endif
+#ifdef CONFIG_ACPI_NUMA
+    else if ( !strncmp(opt,"noacpi",6) )
+        numa_status = numa_no_acpi;
+#endif
+    else
+        return -EINVAL;
+
+    return 0;
+}
+custom_param("numa", numa_setup);
+
+static void cf_check dump_numa(unsigned char key)
+{
+    s_time_t now = NOW();
+    unsigned int i, j, n;
+    struct domain *d;
+    struct page_info *page;
+    unsigned int page_num_node[MAX_NUMNODES];
+    const struct vnuma_info *vnuma;
+
+    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
+           now);
+
+    for_each_online_node ( i )
+    {
+        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
+
+        printk("NODE%u start->%lu size->%lu free->%lu\n",
+               i, node_start_pfn(i), node_spanned_pages(i),
+               avail_node_heap_pages(i));
+        /* sanity check phys_to_nid() */
+        if ( phys_to_nid(pa) != i )
+            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
+                   pa, phys_to_nid(pa), i);
+    }
+
+    j = cpumask_first(&cpu_online_map);
+    n = 0;
+    for_each_online_cpu ( i )
+    {
+        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
+        {
+            if ( n > 1 )
+                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
+            else
+                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
+            j = i;
+            n = 1;
+        }
+        else
+            ++n;
+    }
+    if ( n > 1 )
+        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
+    else
+        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
+
+    rcu_read_lock(&domlist_read_lock);
+
+    printk("Memory location of each domain:\n");
+    for_each_domain ( d )
+    {
+        process_pending_softirqs();
+
+        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
+
+        for_each_online_node ( i )
+            page_num_node[i] = 0;
+
+        spin_lock(&d->page_alloc_lock);
+        page_list_for_each(page, &d->page_list)
+        {
+            i = phys_to_nid(page_to_maddr(page));
+            page_num_node[i]++;
+        }
+        spin_unlock(&d->page_alloc_lock);
+
+        for_each_online_node ( i )
+            printk("    Node %u: %u\n", i, page_num_node[i]);
+
+        if ( !read_trylock(&d->vnuma_rwlock) )
+            continue;
+
+        if ( !d->vnuma )
+        {
+            read_unlock(&d->vnuma_rwlock);
+            continue;
+        }
+
+        vnuma = d->vnuma;
+        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
+               vnuma->nr_vnodes, d->max_vcpus);
+        for ( i = 0; i < vnuma->nr_vnodes; i++ )
+        {
+            unsigned int start_cpu = ~0U;
+
+            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
+                printk("       %3u: pnode ???,", i);
+            else
+                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
+
+            printk(" vcpus ");
+
+            for ( j = 0; j < d->max_vcpus; j++ )
+            {
+                if ( !(j & 0x3f) )
+                    process_pending_softirqs();
+
+                if ( vnuma->vcpu_to_vnode[j] == i )
+                {
+                    if ( start_cpu == ~0U )
+                    {
+                        printk("%d", j);
+                        start_cpu = j;
+                    }
+                }
+                else if ( start_cpu != ~0U )
+                {
+                    if ( j - 1 != start_cpu )
+                        printk("-%d ", j - 1);
+                    else
+                        printk(" ");
+                    start_cpu = ~0U;
+                }
+            }
+
+            if ( start_cpu != ~0U  && start_cpu != j - 1 )
+                printk("-%d", j - 1);
+
+            printk("\n");
+
+            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
+            {
+                if ( vnuma->vmemrange[j].nid == i )
+                    printk("           %016"PRIx64" - %016"PRIx64"\n",
+                           vnuma->vmemrange[j].start,
+                           vnuma->vmemrange[j].end);
+            }
+        }
+
+        read_unlock(&d->vnuma_rwlock);
+    }
+
+    rcu_read_unlock(&domlist_read_lock);
+}
+
+static int __init cf_check register_numa_trigger(void)
+{
+    register_keyhandler('u', dump_numa, "dump NUMA info", 1);
+    return 0;
+}
+__initcall(register_numa_trigger);
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 7aef1a88dc..db0e524a0e 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -18,4 +18,78 @@
   (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
    ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
 
+/* The following content can be used when NUMA feature is enabled */
+#ifdef CONFIG_NUMA
+
+extern nodeid_t      cpu_to_node[NR_CPUS];
+extern cpumask_t     node_to_cpumask[];
+
+#define cpu_to_node(cpu)        (cpu_to_node[cpu])
+#define parent_node(node)       (node)
+#define node_to_first_cpu(node) (__ffs(node_to_cpumask[node]))
+#define node_to_cpumask(node)   (node_to_cpumask[node])
+
+struct node {
+    paddr_t start, end;
+};
+
+extern int compute_hash_shift(struct node *nodes, int numnodes,
+                              nodeid_t *nodeids);
+
+#define VIRTUAL_BUG_ON(x)
+
+/* Enumerations for NUMA status. */
+enum numa_mode {
+    numa_on = 0,
+    numa_off,
+    /* NUMA turns on, but ACPI table is bad or disabled. */
+    numa_no_acpi,
+    /* NUMA turns on, and ACPI table works well. */
+    numa_acpi,
+};
+
+extern void numa_add_cpu(int cpu);
+extern void numa_init_array(void);
+extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
+extern bool numa_enabled_with_firmware(void);
+extern enum numa_mode numa_status;
+
+extern void numa_set_node(int cpu, nodeid_t node);
+extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
+
+static inline void clear_node_cpumask(int cpu)
+{
+    cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
+}
+
+/* Simple perfect hash to map pdx to node numbers */
+extern int memnode_shift;
+extern unsigned long memnodemapsize;
+extern u8 *memnodemap;
+
+struct node_data {
+    unsigned long node_start_pfn;
+    unsigned long node_spanned_pages;
+};
+
+extern struct node_data node_data[];
+
+static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
+{
+    nodeid_t nid;
+    VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
+    nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
+    VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
+    return nid;
+}
+
+#define NODE_DATA(nid)          (&(node_data[nid]))
+
+#define node_start_pfn(nid)     (NODE_DATA(nid)->node_start_pfn)
+#define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages)
+#define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
+                                NODE_DATA(nid)->node_spanned_pages)
+
+#endif
+
 #endif /* _XEN_NUMA_H */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (2 preceding siblings ...)
  2022-07-08 14:54 ` [PATCH v2 3/9] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-11  6:32   ` Jan Beulich
  2022-07-08 14:54 ` [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get information from E820 map Wei Chen
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Jiamei Xie

VIRTUAL_BUG_ON is an empty macro used in phys_to_nid. This
results in two lines of error-checking code in phys_to_nid
that is not actually working and causing two compilation
errors:
1. error: "MAX_NUMNODES" undeclared (first use in this function).
   This is because in the common header file, "MAX_NUMNODES" is
   defined after the common header file includes the ARCH header
   file, where phys_to_nid has attempted to use "MAX_NUMNODES".
   This error was resolved after we moved the phys_to_nid from
   x86 ARCH header file to common header file.
2. error: wrong type argument to unary exclamation mark.
   This is because, the error-checking code contains !node_data[nid].
   But node_data is a data structure variable, it's not a pointer.

So, in this patch, we use ASSERT instead of VIRTUAL_BUG_ON to
enable the two lines of error-checking code. And fix the left
compilation errors by replacing !node_data[nid] to
!node_data[nid].node_spanned_pages. Although NUMA allows one node
can only have CPUs but without any memory. And node with 0 bytes
of memory might have an entry in memnodemap[] theoretically. But
that doesn't mean phys_to_nid can find any valid address from a
node with 0 bytes memory.

Signed-off-by: Wei Chen <wei.chen@arm.com>
Tested-by: Jiamei Xie <jiamei.xie@arm.com>
---
v1 -> v2:
1. Move from part#1 to part#2. (Comment from NUMA part1 series)
2. Refine the justification of using
   !node_data[nid].node_spanned_pages. (From NUMA part1 series)
3. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid.
4. Adjust the conditional express for ASSERT.
5. Move MAX_NUMNODES from xen/numa.h to asm/numa.h for x86.
6. Use conditional macro to gate MAX_NUMNODES for other architectures.
---
 xen/include/xen/numa.h | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index db0e524a0e..695ad51df0 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -36,8 +36,6 @@ struct node {
 extern int compute_hash_shift(struct node *nodes, int numnodes,
                               nodeid_t *nodeids);
 
-#define VIRTUAL_BUG_ON(x)
-
 /* Enumerations for NUMA status. */
 enum numa_mode {
     numa_on = 0,
@@ -77,9 +75,9 @@ extern struct node_data node_data[];
 static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
 {
     nodeid_t nid;
-    VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
+    ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
     nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
-    VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
+    ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
     return nid;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get information from E820 map
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (3 preceding siblings ...)
  2022-07-08 14:54 ` [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-12 13:40   ` Jan Beulich
  2022-07-08 14:54 ` [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

The sanity check performed by the nodes_cover_memory function is
also necessary for other architectures that do not yet support NUMA.
But now, the code of nodes_cover_memory is tied to the x86 E820.
In this case, we introduce arch_get_memory_map to decouple
architecture specific memory map from this function. This means,
other architectures like Arm can also use it to check its bootmem
info.

Depending on arch_get_memory_map, we make nodes_cover_memory become
architecture independent. We also use neutral words to replace
SRAT and E820 in the print message of this function. This will
make the message more common.

As arch_get_memory_map use unsigned int for index, we also adjust
the index in nodes_cover_memory from int to unsigned int.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v1 -> v2:
1. Use arch_get_memory_map to replace arch_get_memory_bank_range
   and arch_get_memory_bank_number.
2. Remove the !start || !end check, because caller guarantee
   these two pointers will not be NULL.
---
 xen/arch/x86/numa.c    | 23 +++++++++++++++++++++++
 xen/arch/x86/srat.c    | 18 +++++++++++-------
 xen/include/xen/numa.h |  3 +++
 3 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 193314dbd9..fb235c07ec 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -9,6 +9,7 @@
 #include <xen/nodemask.h>
 #include <xen/numa.h>
 #include <asm/acpi.h>
+#include <asm/e820.h>
 
 #ifndef Dprintk
 #define Dprintk(x...)
@@ -82,3 +83,25 @@ unsigned int __init arch_get_dma_bitsize(void)
                  flsl(node_start_pfn(node) + node_spanned_pages(node) / 4 - 1)
                  + PAGE_SHIFT, 32);
 }
+
+/*
+ * This function provides the ability for caller to get one RAM entry
+ * from architectural memory map by index.
+ *
+ * This function will return zero if it can return a proper RAM entry.
+ * otherwise it will return -ENODEV for out of scope index, or return
+ * -EINVAL for non-RAM type memory entry.
+ */
+int __init arch_get_memory_map(unsigned int idx, paddr_t *start, paddr_t *end)
+{
+    if ( idx >= e820.nr_map )
+        return -ENODEV;
+
+    if ( e820.map[idx].type != E820_RAM )
+        return -EINVAL;
+
+    *start = e820.map[idx].addr;
+    *end = e820.map[idx].addr + e820.map[idx].size;
+
+    return 0;
+}
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 422e4c73e3..5bc9096a15 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -427,18 +427,22 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
    Make sure the PXMs cover all memory. */
 static int __init nodes_cover_memory(void)
 {
-	int i;
+	unsigned int i;
 
-	for (i = 0; i < e820.nr_map; i++) {
+	for (i = 0; ; i++) {
 		int j, found;
 		paddr_t start, end;
 
-		if (e820.map[i].type != E820_RAM) {
+		/* Try to loop memory map from index 0 to end. */
+		found = arch_get_memory_map(i, &start, &end);
+
+		/* Index relate entry is not RAM, skip it. */
+		if (found == -EINVAL)
 			continue;
-		}
 
-		start = e820.map[i].addr;
-		end = e820.map[i].addr + e820.map[i].size;
+		/* Reach the end of arch's memory map */
+		if (found == -ENODEV)
+			break;
 
 		do {
 			found = 0;
@@ -457,7 +461,7 @@ static int __init nodes_cover_memory(void)
 		} while (found && start < end);
 
 		if (start < end) {
-			printk(KERN_ERR "SRAT: No PXM for e820 range: "
+			printk(KERN_ERR "NUMA: No node for memory map range: "
 				"[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
 			return 0;
 		}
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 695ad51df0..6d02204991 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -88,6 +88,9 @@ static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
 #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
                                 NODE_DATA(nid)->node_spanned_pages)
 
+extern int arch_get_memory_map(unsigned int idx,
+                               paddr_t *start, paddr_t *end);
+
 #endif
 
 #endif /* _XEN_NUMA_H */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (4 preceding siblings ...)
  2022-07-08 14:54 ` [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get information from E820 map Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-12 14:21   ` Jan Beulich
  2022-07-08 14:54 ` [PATCH v2 7/9] xen/x86: rename bad_srat to numa_bad Wei Chen
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

x86 has implemented a set of codes to scan NUMA nodes. These
codes will parse NUMA memory and processor information from
ACPI SRAT table. But except some ACPI specific codes, most
of the scan codes like memory blocks validation, node memory
range updates and some sanity check can be reused by other
NUMA implementation.

So in this patch, we move some variables and related functions
for NUMA memory and processor to common code. At the same time,
numa_set_processor_nodes_parsed has been introduced for ACPI
specific code to update processor parsing results. With this
helper, we can move most of NUMA memory affinity init code from
ACPI. And bad_srat and node_to_pxm functions have been exported
for common code to do architectural fallback and node to proximity
converting. For those NUMA implementations haven't supported
proximity domain to node map. A simple 1-1 mapping helper can help
us to avoid the modification of some very valuable print messages.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v1 -> v2:
1. Add code comment for numa_update_node_memblks to explain:
   Assumes all memory regions belonging to a single node
   are in one chunk. Holes between them will be included
   in the node.
2. Merge this single patch instead of serval patches to move
   x86 SRAT code to common.
3. Export node_to_pxm to keep pxm information in NUMA scan
   nodes error messages.
4. Change the code style to target file's Xen code-style.
5. Adjust some __init and __initdata for some functions and
   variables.
---
 xen/arch/x86/include/asm/numa.h |   3 +-
 xen/arch/x86/srat.c             | 288 ++----------------------------
 xen/common/numa.c               | 307 ++++++++++++++++++++++++++++++++
 xen/include/xen/numa.h          |   7 +
 4 files changed, 328 insertions(+), 277 deletions(-)

diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index d8960743d4..eeb431cdb7 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -10,6 +10,7 @@ typedef u8 nodeid_t;
 extern int srat_rev;
 
 extern nodeid_t pxm_to_node(unsigned int pxm);
+extern unsigned node_to_pxm(nodeid_t n);
 
 #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
 
@@ -22,8 +23,6 @@ extern void init_cpu_to_node(void);
 
 #define arch_want_default_dmazone() (num_online_nodes() > 1)
 
-extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
-
 void srat_parse_regions(paddr_t addr);
 extern u8 __node_distance(nodeid_t a, nodeid_t b);
 unsigned int arch_get_dma_bitsize(void);
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 5bc9096a15..9ae81afdff 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -24,10 +24,6 @@
 
 static struct acpi_table_slit *__read_mostly acpi_slit;
 
-static nodemask_t memory_nodes_parsed __initdata;
-static nodemask_t processor_nodes_parsed __initdata;
-static struct node nodes[MAX_NUMNODES] __initdata;
-
 struct pxm2node {
 	unsigned pxm;
 	nodeid_t node;
@@ -35,19 +31,6 @@ struct pxm2node {
 static struct pxm2node __read_mostly pxm2node[MAX_NUMNODES] =
 	{ [0 ... MAX_NUMNODES - 1] = {.node = NUMA_NO_NODE} };
 
-static unsigned node_to_pxm(nodeid_t n);
-
-static int num_node_memblks;
-static struct node node_memblk_range[NR_NODE_MEMBLKS];
-static nodeid_t memblk_nodeid[NR_NODE_MEMBLKS];
-static __initdata DECLARE_BITMAP(memblk_hotplug, NR_NODE_MEMBLKS);
-
-enum conflicts {
-	NO_CONFLICT,
-	OVERLAP,
-	INTERLEAVE,
-};
-
 static inline bool node_found(unsigned idx, unsigned pxm)
 {
 	return ((pxm2node[idx].pxm == pxm) &&
@@ -110,78 +93,7 @@ nodeid_t setup_node(unsigned pxm)
 	return node;
 }
 
-int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
-{
-	int i;
-
-	for (i = 0; i < num_node_memblks; i++) {
-		struct node *nd = &node_memblk_range[i];
-
-		if (nd->start <= start && nd->end >= end &&
-			memblk_nodeid[i] == node)
-			return 1;
-	}
-
-	return 0;
-}
-
-static
-enum conflicts __init conflicting_memblks(nodeid_t nid, paddr_t start,
-					  paddr_t end, paddr_t nd_start,
-					  paddr_t nd_end, unsigned int *mblkid)
-{
-	unsigned int i;
-
-	/*
-	 * Scan all recorded nodes' memory blocks to check conflicts:
-	 * Overlap or interleave.
-	 */
-	for (i = 0; i < num_node_memblks; i++) {
-		struct node *nd = &node_memblk_range[i];
-
-		*mblkid = i;
-
-		/* Skip 0 bytes node memory block. */
-		if (nd->start == nd->end)
-			continue;
-		/*
-		 * Use memblk range to check memblk overlaps, include the
-		 * self-overlap case. As nd's range is non-empty, the special
-		 * case "nd->end == end && nd->start == start" also can be covered.
-		 */
-		if (nd->end > start && nd->start < end)
-			return OVERLAP;
-
-		/*
-		 * Use node memory range to check whether new range contains
-		 * memory from other nodes - interleave check. We just need
-		 * to check full contains situation. Because overlaps have
-		 * been checked above.
-		 */
-	        if (nid != memblk_nodeid[i] &&
-		    nd->start >= nd_start && nd->end <= nd_end)
-			return INTERLEAVE;
-	}
-
-	return NO_CONFLICT;
-}
-
-static __init void cutoff_node(int i, paddr_t start, paddr_t end)
-{
-	struct node *nd = &nodes[i];
-	if (nd->start < start) {
-		nd->start = start;
-		if (nd->end < nd->start)
-			nd->start = nd->end;
-	}
-	if (nd->end > end) {
-		nd->end = end;
-		if (nd->start > nd->end)
-			nd->start = nd->end;
-	}
-}
-
-static __init void bad_srat(void)
+void __init bad_srat(void)
 {
 	int i;
 	printk(KERN_ERR "SRAT: SRAT not used.\n");
@@ -259,7 +171,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 	}
 
 	apicid_to_node[pa->apic_id] = node;
-	node_set(node, processor_nodes_parsed);
+	numa_set_processor_nodes_parsed(node);
 	numa_status = numa_acpi;
 
 	if (opt_acpi_verbose)
@@ -294,7 +206,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 		return;
 	}
 	apicid_to_node[pa->apic_id] = node;
-	node_set(node, processor_nodes_parsed);
+	numa_set_processor_nodes_parsed(node);
 	numa_status = numa_acpi;
 
 	if (opt_acpi_verbose)
@@ -306,12 +218,9 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 void __init
 acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 {
-	struct node *nd;
-	paddr_t nd_start, nd_end;
-	paddr_t start, end;
 	unsigned pxm;
 	nodeid_t node;
-	unsigned int i;
+	int ret;
 
 	if (srat_disabled())
 		return;
@@ -322,15 +231,14 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
 		return;
 
-	start = ma->base_address;
-	end = start + ma->length;
 	/* Supplement the heuristics in l1tf_calculations(). */
-	l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE));
+	l1tf_safe_maddr = max(l1tf_safe_maddr,
+			      ROUNDUP(ma->base_address + ma->length,
+			      PAGE_SIZE));
 
-	if (num_node_memblks >= NR_NODE_MEMBLKS)
-	{
+	if (!numa_memblks_available()) {
 		dprintk(XENLOG_WARNING,
-                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
+			"Too many numa entries, try bigger NR_NODE_MEMBLKS!\n");
 		bad_srat();
 		return;
 	}
@@ -344,129 +252,10 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 		return;
 	}
 
-	/*
-	 * For the node that already has some memory blocks, we will
-	 * expand the node memory range temporarily to check memory
-	 * interleaves with other nodes. We will not use this node
-	 * temp memory range to check overlaps, because it will mask
-	 * the overlaps in same node.
-	 *
-	 * Node with 0 bytes memory doesn't need this expandsion.
-	 */
-	nd_start = start;
-	nd_end = end;
-	nd = &nodes[node];
-	if (nd->start != nd->end) {
-		if (nd_start > nd->start)
-			nd_start = nd->start;
-
-		if (nd_end < nd->end)
-			nd_end = nd->end;
-	}
-
-	/* It is fine to add this area to the nodes data it will be used later*/
-	switch (conflicting_memblks(node, start, end, nd_start, nd_end, &i)) {
-	case OVERLAP:
-		if (memblk_nodeid[i] == node) {
-			bool mismatch = !(ma->flags &
-					  ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
-			                !test_bit(i, memblk_hotplug);
-
-			printk("%sSRAT: PXM %u [%"PRIpaddr", %"PRIpaddr"] overlaps with itself [%"PRIpaddr", %"PRIpaddr"]\n",
-			       mismatch ? KERN_ERR : KERN_WARNING, pxm, start,
-			       end - 1, node_memblk_range[i].start,
-			       node_memblk_range[i].end - 1);
-			if (mismatch) {
-				bad_srat();
-				return;
-			}
-			break;
-		}
-
-		printk(KERN_ERR
-		       "SRAT: PXM %u [%"PRIpaddr", %"PRIpaddr"] overlaps with PXM %u [%"PRIpaddr", %"PRIpaddr"]\n",
-		       pxm, start, end - 1, node_to_pxm(memblk_nodeid[i]),
-		       node_memblk_range[i].start,
-		       node_memblk_range[i].end - 1);
+	ret = numa_update_node_memblks(node, pxm, ma->base_address, ma->length,
+				       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE);
+	if (ret)
 		bad_srat();
-		return;
-
-	case INTERLEAVE:
-		printk(KERN_ERR
-		       "SRAT: PXM %u: [%"PRIpaddr", %"PRIpaddr"] interleaves with PXM %u memblk [%"PRIpaddr", %"PRIpaddr"]\n",
-		       pxm, nd_start, nd_end - 1, node_to_pxm(memblk_nodeid[i]),
-		       node_memblk_range[i].start, node_memblk_range[i].end - 1);
-		bad_srat();
-		return;
-
-	case NO_CONFLICT:
-		break;
-	}
-
-	if (!(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE)) {
-		node_set(node, memory_nodes_parsed);
-		nd->start = nd_start;
-		nd->end = nd_end;
-	}
-
-	printk(KERN_INFO "SRAT: Node %u PXM %u [%"PRIpaddr", %"PRIpaddr"]%s\n",
-	       node, pxm, start, end - 1,
-	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
-
-	node_memblk_range[num_node_memblks].start = start;
-	node_memblk_range[num_node_memblks].end = end;
-	memblk_nodeid[num_node_memblks] = node;
-	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
-		__set_bit(num_node_memblks, memblk_hotplug);
-		mem_hotplug_update_boundary(end);
-	}
-	num_node_memblks++;
-}
-
-/* Sanity check to catch more bad SRATs (they are amazingly common).
-   Make sure the PXMs cover all memory. */
-static int __init nodes_cover_memory(void)
-{
-	unsigned int i;
-
-	for (i = 0; ; i++) {
-		int j, found;
-		paddr_t start, end;
-
-		/* Try to loop memory map from index 0 to end. */
-		found = arch_get_memory_map(i, &start, &end);
-
-		/* Index relate entry is not RAM, skip it. */
-		if (found == -EINVAL)
-			continue;
-
-		/* Reach the end of arch's memory map */
-		if (found == -ENODEV)
-			break;
-
-		do {
-			found = 0;
-			for_each_node_mask(j, memory_nodes_parsed)
-				if (start < nodes[j].end
-				    && end > nodes[j].start) {
-					if (start >= nodes[j].start) {
-						start = nodes[j].end;
-						found = 1;
-					}
-					if (end <= nodes[j].end) {
-						end = nodes[j].start;
-						found = 1;
-					}
-				}
-		} while (found && start < end);
-
-		if (start < end) {
-			printk(KERN_ERR "NUMA: No node for memory map range: "
-				"[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
-			return 0;
-		}
-	}
-	return 1;
 }
 
 void __init acpi_numa_arch_fixup(void) {}
@@ -522,58 +311,7 @@ void __init srat_parse_regions(paddr_t addr)
 	pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
 }
 
-/* Use the information discovered above to actually set up the nodes. */
-int __init acpi_scan_nodes(paddr_t start, paddr_t end)
-{
-	int i;
-	nodemask_t all_nodes_parsed;
-
-	/* First clean up the node list */
-	for (i = 0; i < MAX_NUMNODES; i++)
-		cutoff_node(i, start, end);
-
-	/* Only when numa_on with good firmware, we can do numa scan nodes. */
-	if (!numa_enabled_with_firmware())
-		return -1;
-
-	if (!nodes_cover_memory()) {
-		bad_srat();
-		return -1;
-	}
-
-	memnode_shift = compute_hash_shift(node_memblk_range, num_node_memblks,
-				memblk_nodeid);
-
-	if (memnode_shift < 0) {
-		printk(KERN_ERR
-		     "SRAT: No NUMA node hash function found. Contact maintainer\n");
-		bad_srat();
-		return -1;
-	}
-
-	nodes_or(all_nodes_parsed, memory_nodes_parsed, processor_nodes_parsed);
-
-	/* Finally register nodes */
-	for_each_node_mask(i, all_nodes_parsed)
-	{
-		uint64_t size = nodes[i].end - nodes[i].start;
-
-		if ( size == 0 )
-			printk(KERN_INFO "SRAT: node %u has no memory\n", i);
-
-		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
-	}
-	for (i = 0; i < nr_cpu_ids; i++) {
-		if (cpu_to_node[i] == NUMA_NO_NODE)
-			continue;
-		if (!nodemask_test(cpu_to_node[i], &processor_nodes_parsed))
-			numa_set_node(i, NUMA_NO_NODE);
-	}
-	numa_init_array();
-	return 0;
-}
-
-static unsigned node_to_pxm(nodeid_t n)
+unsigned node_to_pxm(nodeid_t n)
 {
 	unsigned i;
 
diff --git a/xen/common/numa.c b/xen/common/numa.c
index bc6a2ded14..e3b66c54b5 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -14,6 +14,21 @@
 #include <xen/softirq.h>
 #include <asm/acpi.h>
 
+static nodemask_t __initdata processor_nodes_parsed;
+static nodemask_t __initdata memory_nodes_parsed;
+static struct node __initdata nodes[MAX_NUMNODES];
+
+static int num_node_memblks;
+static struct node node_memblk_range[NR_NODE_MEMBLKS];
+static nodeid_t memblk_nodeid[NR_NODE_MEMBLKS];
+static __initdata DECLARE_BITMAP(memblk_hotplug, NR_NODE_MEMBLKS);
+
+enum conflicts {
+    NO_CONFLICT,
+    OVERLAP,
+    INTERLEAVE,
+};
+
 struct node_data node_data[MAX_NUMNODES];
 
 /* Mapping from pdx to node id */
@@ -32,6 +47,298 @@ nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
 
 enum numa_mode numa_status;
 
+void __init numa_set_processor_nodes_parsed(nodeid_t node)
+{
+    node_set(node, processor_nodes_parsed);
+}
+
+int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
+{
+    int i;
+
+    for ( i = 0; i < num_node_memblks; i++ )
+    {
+        struct node *nd = &node_memblk_range[i];
+
+        if ( nd->start <= start && nd->end >= end &&
+             memblk_nodeid[i] == node )
+            return 1;
+    }
+
+    return 0;
+}
+
+static
+enum conflicts __init conflicting_memblks(nodeid_t nid, paddr_t start,
+                                          paddr_t end, paddr_t nd_start,
+                                          paddr_t nd_end, unsigned int *mblkid)
+{
+    unsigned int i;
+
+    /*
+     * Scan all recorded nodes' memory blocks to check conflicts:
+     * Overlap or interleave.
+     */
+    for ( i = 0; i < num_node_memblks; i++ )
+    {
+        struct node *nd = &node_memblk_range[i];
+
+        *mblkid = i;
+
+        /* Skip 0 bytes node memory block. */
+        if ( nd->start == nd->end )
+            continue;
+        /*
+         * Use memblk range to check memblk overlaps, include the
+         * self-overlap case. As nd's range is non-empty, the special
+         * case "nd->end == end && nd->start == start" also can be covered.
+         */
+        if ( nd->end > start && nd->start < end )
+            return OVERLAP;
+
+        /*
+         * Use node memory range to check whether new range contains
+         * memory from other nodes - interleave check. We just need
+         * to check full contains situation. Because overlaps have
+         * been checked above.
+         */
+        if ( nid != memblk_nodeid[i] &&
+             nd->start >= nd_start && nd->end <= nd_end )
+            return INTERLEAVE;
+    }
+
+    return NO_CONFLICT;
+}
+
+static void __init cutoff_node(int i, paddr_t start, paddr_t end)
+{
+    struct node *nd = &nodes[i];
+
+    if ( nd->start < start )
+    {
+        nd->start = start;
+        if ( nd->end < nd->start )
+            nd->start = nd->end;
+    }
+
+    if ( nd->end > end )
+    {
+        nd->end = end;
+        if ( nd->start > nd->end )
+            nd->start = nd->end;
+    }
+}
+
+bool __init numa_memblks_available(void)
+{
+    return num_node_memblks < NR_NODE_MEMBLKS;
+}
+
+/*
+ * This function will be called by NUMA memory affinity initialization to
+ * update NUMA node's memory range. In this function, we assume all memory
+ * regions belonging to a single node are in one chunk. Holes (or MMIO
+ * ranges) between them will be included in the node.
+ *
+ * So in numa_update_node_memblks, if there are multiple banks for each
+ * node, start and end are stretched to cover the holes between them, and
+ * it works as long as memory banks of different NUMA nodes don't interleave.
+ */
+int __init numa_update_node_memblks(nodeid_t node, unsigned pxm,
+                                    paddr_t start, paddr_t size,
+                                    bool hotplug)
+{
+    unsigned int i;
+    paddr_t end = start + size;
+    paddr_t nd_start = start;
+    paddr_t nd_end = end;
+    struct node *nd = &nodes[node];
+
+    /*
+     * For the node that already has some memory blocks, we will
+     * expand the node memory range temporarily to check memory
+     * interleaves with other nodes. We will not use this node
+     * temp memory range to check overlaps, because it will mask
+     * the overlaps in same node.
+     *
+     * Node with 0 bytes memory doesn't need this expandsion.
+     */
+    if ( nd->start != nd->end )
+    {
+        if ( nd_start > nd->start )
+            nd_start = nd->start;
+
+        if ( nd_end < nd->end )
+            nd_end = nd->end;
+    }
+
+    /* It is fine to add this area to the nodes data it will be used later*/
+    switch ( conflicting_memblks(node, start, end, nd_start, nd_end, &i) )
+    {
+    case OVERLAP:
+        if ( memblk_nodeid[i] == node )
+        {
+            bool mismatch = !(hotplug) != !test_bit(i, memblk_hotplug);
+
+            printk("%sNUMA: PXM %u [%"PRIpaddr", %"PRIpaddr"] overlaps with itself [%"PRIpaddr", %"PRIpaddr"]\n",
+                   mismatch ? KERN_ERR : KERN_WARNING, pxm, start, end - 1,
+                   node_memblk_range[i].start, node_memblk_range[i].end - 1);
+            if ( mismatch )
+                return -EINVAL;
+            break;
+        }
+
+        printk(KERN_ERR
+               "NUMA: PXM %u [%"PRIpaddr", %"PRIpaddr"] overlaps with PXM %u [%"PRIpaddr", %"PRIpaddr"]\n",
+               pxm, start, end - 1, node_to_pxm(memblk_nodeid[i]),
+               node_memblk_range[i].start, node_memblk_range[i].end - 1);
+        return -EINVAL;
+
+
+    case INTERLEAVE:
+        printk(KERN_ERR
+               "NUMA: PXM %u: [%"PRIpaddr", %"PRIpaddr"] interleaves with PXM %u memblk [%"PRIpaddr", %"PRIpaddr"]\n",
+               pxm, nd_start, nd_end - 1, node_to_pxm(memblk_nodeid[i]),
+               node_memblk_range[i].start, node_memblk_range[i].end - 1);
+        return -EINVAL;
+
+    case NO_CONFLICT:
+        break;
+    }
+
+    if ( !hotplug )
+    {
+        node_set(node, memory_nodes_parsed);
+        nd->start = nd_start;
+        nd->end = nd_end;
+    }
+
+    printk(KERN_INFO "NUMA: Node %u PXM %u [%"PRIpaddr", %"PRIpaddr"]%s\n",
+           node, pxm, start, end - 1,
+           hotplug ? " (hotplug)" : "");
+
+    node_memblk_range[num_node_memblks].start = start;
+    node_memblk_range[num_node_memblks].end = end;
+    memblk_nodeid[num_node_memblks] = node;
+    if ( hotplug )
+    {
+        __set_bit(num_node_memblks, memblk_hotplug);
+        mem_hotplug_update_boundary(end);
+    }
+    num_node_memblks++;
+
+    return 0;
+}
+
+/*
+ * Sanity check to catch more bad SRATs (they are amazingly common).
+ * Make sure the PXMs cover all memory.
+ */
+static int __init nodes_cover_memory(void)
+{
+    unsigned int i;
+
+    for ( i = 0; ; i++ )
+    {
+        int j, found;
+        paddr_t start, end;
+
+        /* Try to loop memory map from index 0 to end. */
+        found = arch_get_memory_map(i, &start, &end);
+
+        /* Index relate entry is not RAM, skip it. */
+        if ( found == -EINVAL )
+            continue;
+
+        /* Reach the end of arch's memory map */
+        if ( found == -ENODEV )
+            break;
+
+        do {
+            found = 0;
+            for_each_node_mask( j, memory_nodes_parsed )
+                if ( start < nodes[j].end
+                    && end > nodes[j].start )
+                {
+                    if ( start >= nodes[j].start )
+                    {
+                        start = nodes[j].end;
+                        found = 1;
+                    }
+
+                    if ( end <= nodes[j].end )
+                    {
+                        end = nodes[j].start;
+                        found = 1;
+                    }
+                }
+        } while ( found && start < end );
+
+        if ( start < end )
+        {
+            printk(KERN_ERR "NUMA: No node for memory map range: "
+                   "[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
+            return 0;
+        }
+    }
+    return 1;
+}
+
+/* Use the information discovered above to actually set up the nodes. */
+int __init acpi_scan_nodes(paddr_t start, paddr_t end)
+{
+    int i;
+    nodemask_t all_nodes_parsed;
+
+    /* First clean up the node list */
+    for ( i = 0; i < MAX_NUMNODES; i++ )
+        cutoff_node(i, start, end);
+
+    /* Only when numa_on with good firmware, we can do numa scan nodes. */
+    if ( !numa_enabled_with_firmware() )
+        return -1;
+
+    if ( !nodes_cover_memory() )
+    {
+        bad_srat();
+        return -1;
+    }
+
+    memnode_shift = compute_hash_shift(node_memblk_range, num_node_memblks,
+                                       memblk_nodeid);
+
+    if ( memnode_shift < 0 )
+    {
+        printk(KERN_ERR
+               "SRAT: No NUMA node hash function found. Contact maintainer\n");
+        bad_srat();
+        return -1;
+    }
+
+    nodes_or(all_nodes_parsed, memory_nodes_parsed, processor_nodes_parsed);
+
+    /* Finally register nodes */
+    for_each_node_mask( i, all_nodes_parsed )
+    {
+        paddr_t size = nodes[i].end - nodes[i].start;
+
+        if ( size == 0 )
+            printk(KERN_INFO "SRAT: node %u has no memory\n", i);
+
+        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+    }
+
+    for ( i = 0; i < nr_cpu_ids; i++ )
+    {
+        if ( cpu_to_node[i] == NUMA_NO_NODE )
+            continue;
+        if ( !nodemask_test(cpu_to_node[i], &processor_nodes_parsed) )
+            numa_set_node(i, NUMA_NO_NODE);
+    }
+    numa_init_array();
+    return 0;
+}
+
 /*
  * Given a shift value, try to populate memnodemap[]
  * Returns :
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 6d02204991..564add430c 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -52,6 +52,7 @@ extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
 extern bool numa_enabled_with_firmware(void);
 extern enum numa_mode numa_status;
 
+extern void bad_srat(void);
 extern void numa_set_node(int cpu, nodeid_t node);
 extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
 
@@ -90,6 +91,12 @@ static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
 
 extern int arch_get_memory_map(unsigned int idx,
                                paddr_t *start, paddr_t *end);
+extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
+extern bool numa_memblks_available(void);
+extern int numa_update_node_memblks(nodeid_t node, unsigned pxm,
+                                    paddr_t start, paddr_t size,
+                                    bool hotplug);
+extern void numa_set_processor_nodes_parsed(nodeid_t node);
 
 #endif
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 7/9] xen/x86: rename bad_srat to numa_bad
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (5 preceding siblings ...)
  2022-07-08 14:54 ` [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-12 14:25   ` Jan Beulich
  2022-07-08 14:54 ` [PATCH v2 8/9] xen: rename acpi_scan_nodes to numa_scan_nodes Wei Chen
  2022-07-08 14:54 ` [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

When NUMA initialization code is failed in scanning SRAT. It will
call bad_srat to set disable NUMA and clear relate data. But this
name is ACPI specific, we have moved generically usable NUMA codes
to common, bad_srat has came with these codes to common code. But
it's not reasonable for other NUMA implementations to implement a
fall back function named bad_srat. So in this patch, we rename
bad_srat to numa_bad.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v1 -> v2:
1. New in v2.
---
 xen/arch/x86/srat.c    | 18 +++++++++---------
 xen/common/numa.c      |  4 ++--
 xen/include/xen/numa.h |  2 +-
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 9ae81afdff..4afb37bf9f 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -93,7 +93,7 @@ nodeid_t setup_node(unsigned pxm)
 	return node;
 }
 
-void __init bad_srat(void)
+void __init numa_bad(void)
 {
 	int i;
 	printk(KERN_ERR "SRAT: SRAT not used.\n");
@@ -153,7 +153,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 	if (srat_disabled())
 		return;
 	if (pa->header.length < sizeof(struct acpi_srat_x2apic_cpu_affinity)) {
-		bad_srat();
+		numa_bad();
 		return;
 	}
 	if (!(pa->flags & ACPI_SRAT_CPU_ENABLED))
@@ -166,7 +166,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 	pxm = pa->proximity_domain;
 	node = setup_node(pxm);
 	if (node == NUMA_NO_NODE) {
-		bad_srat();
+		numa_bad();
 		return;
 	}
 
@@ -189,7 +189,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 	if (srat_disabled())
 		return;
 	if (pa->header.length != sizeof(struct acpi_srat_cpu_affinity)) {
-		bad_srat();
+		numa_bad();
 		return;
 	}
 	if (!(pa->flags & ACPI_SRAT_CPU_ENABLED))
@@ -202,7 +202,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 	}
 	node = setup_node(pxm);
 	if (node == NUMA_NO_NODE) {
-		bad_srat();
+		numa_bad();
 		return;
 	}
 	apicid_to_node[pa->apic_id] = node;
@@ -225,7 +225,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 	if (srat_disabled())
 		return;
 	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
-		bad_srat();
+		numa_bad();
 		return;
 	}
 	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
@@ -239,7 +239,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 	if (!numa_memblks_available()) {
 		dprintk(XENLOG_WARNING,
 			"Too many numa entries, try bigger NR_NODE_MEMBLKS!\n");
-		bad_srat();
+		numa_bad();
 		return;
 	}
 
@@ -248,14 +248,14 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 		pxm &= 0xff;
 	node = setup_node(pxm);
 	if (node == NUMA_NO_NODE) {
-		bad_srat();
+		numa_bad();
 		return;
 	}
 
 	ret = numa_update_node_memblks(node, pxm, ma->base_address, ma->length,
 				       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE);
 	if (ret)
-		bad_srat();
+		numa_bad();
 }
 
 void __init acpi_numa_arch_fixup(void) {}
diff --git a/xen/common/numa.c b/xen/common/numa.c
index e3b66c54b5..5ab061e991 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -300,7 +300,7 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
 
     if ( !nodes_cover_memory() )
     {
-        bad_srat();
+        numa_bad();
         return -1;
     }
 
@@ -311,7 +311,7 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
     {
         printk(KERN_ERR
                "SRAT: No NUMA node hash function found. Contact maintainer\n");
-        bad_srat();
+        numa_bad();
         return -1;
     }
 
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 564add430c..4c4632ec27 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -52,7 +52,7 @@ extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
 extern bool numa_enabled_with_firmware(void);
 extern enum numa_mode numa_status;
 
-extern void bad_srat(void);
+extern void numa_bad(void);
 extern void numa_set_node(int cpu, nodeid_t node);
 extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 8/9] xen: rename acpi_scan_nodes to numa_scan_nodes
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (6 preceding siblings ...)
  2022-07-08 14:54 ` [PATCH v2 7/9] xen/x86: rename bad_srat to numa_bad Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-12 14:27   ` Jan Beulich
  2022-07-08 14:54 ` [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

We have moved acpi_scan_nodes from x86 to common. Because of
of our previous work, this function no longer has many ACPI
characteristics, except some "SRAT" words in print messages.
So we rename acpi_scan_nodes to a more generic name
numa_scan_nodes, and replace "SRAT" words in print messages.

After doing this, it doesn't make sense to use CONFIG_ACPI_NUMA
to gate numa_scan_nodes in numa_initmem_init. As CONFIG_ACPI_NUMA
will be selected by CONFIG_NUMA for x86. So, we replace
CONFIG_ACPI_NUMA by CONFIG_NUMA.

We take this opportunity to make this function static, since
it currently has no external callers.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v1 -> v2:
1. Merge two patches into this patch:
   1. replace CONFIG_ACPI_NUMA by CONFIG_NUMA.
   2. replace "SRAT" texts.
2. Turn numa_scan_nodes to static.
---
 xen/arch/x86/include/asm/acpi.h |  1 -
 xen/common/numa.c               | 10 +++++-----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h
index ab0d56dd70..f6ea3f1a9a 100644
--- a/xen/arch/x86/include/asm/acpi.h
+++ b/xen/arch/x86/include/asm/acpi.h
@@ -101,7 +101,6 @@ extern unsigned long acpi_wakeup_address;
 
 #define ARCH_HAS_POWER_INIT	1
 
-extern int acpi_scan_nodes(u64 start, u64 end);
 #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
 
 extern struct acpi_sleep_info acpi_sinfo;
diff --git a/xen/common/numa.c b/xen/common/numa.c
index 5ab061e991..0f62638e4c 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -285,7 +285,7 @@ static int __init nodes_cover_memory(void)
 }
 
 /* Use the information discovered above to actually set up the nodes. */
-int __init acpi_scan_nodes(paddr_t start, paddr_t end)
+static int __init numa_scan_nodes(paddr_t start, paddr_t end)
 {
     int i;
     nodemask_t all_nodes_parsed;
@@ -310,7 +310,7 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
     if ( memnode_shift < 0 )
     {
         printk(KERN_ERR
-               "SRAT: No NUMA node hash function found. Contact maintainer\n");
+               "NUMA: No NUMA node hash function found. Contact maintainer\n");
         numa_bad();
         return -1;
     }
@@ -323,7 +323,7 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
         paddr_t size = nodes[i].end - nodes[i].start;
 
         if ( size == 0 )
-            printk(KERN_INFO "SRAT: node %u has no memory\n", i);
+            printk(KERN_INFO "NUMA: node %u has no memory\n", i);
 
         setup_node_bootmem(i, nodes[i].start, nodes[i].end);
     }
@@ -540,8 +540,8 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
         return;
 #endif
 
-#ifdef CONFIG_ACPI_NUMA
-    if ( numa_status != numa_off && !acpi_scan_nodes(start, end) )
+#ifdef CONFIG_NUMA
+    if ( numa_status != numa_off && !numa_scan_nodes(start, end) )
         return;
 #endif
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (7 preceding siblings ...)
  2022-07-08 14:54 ` [PATCH v2 8/9] xen: rename acpi_scan_nodes to numa_scan_nodes Wei Chen
@ 2022-07-08 14:54 ` Wei Chen
  2022-07-12 14:34   ` Jan Beulich
  8 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-08 14:54 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné

Current NUMA nodes number is a hardcode configuration. This
configuration is difficult for an administrator to change
unless changing the code.

So in this patch, we introduce this new Kconfig option for
administrators to change NUMA nodes number conveniently.
Also considering that not all architectures support NUMA,
this Kconfig option only can be visible on NUMA enabled
architectures. Non-NUMA supported architectures can still
use 1 as MAX_NUMNODES.

As NODES_SHIFT is currently unused, we're taking this
opportunity to remove it.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v1 -> v2:
1. Add NODES_SHIFT remove message in commit log
2. Change NR_NUMA_NODES upper bound from 4095 to 255.
---
 xen/arch/Kconfig                | 11 +++++++++++
 xen/arch/x86/include/asm/numa.h |  2 --
 xen/include/xen/numa.h          | 10 +++++-----
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig
index f16eb0df43..4fc16f83ac 100644
--- a/xen/arch/Kconfig
+++ b/xen/arch/Kconfig
@@ -17,3 +17,14 @@ config NR_CPUS
 	  For CPU cores which support Simultaneous Multi-Threading or similar
 	  technologies, this the number of logical threads which Xen will
 	  support.
+
+config NR_NUMA_NODES
+	int "Maximum number of NUMA nodes supported"
+	range 1 255
+	default "64"
+	depends on NUMA
+	help
+	  Controls the build-time size of various arrays and bitmaps
+	  associated with multiple-nodes management. It is the upper bound of
+	  the number of NUMA nodes that the scheduler, memory allocation and
+          other NUMA-aware components can handle.
diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index eeb431cdb7..db76281c08 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -3,8 +3,6 @@
 
 #include <xen/cpumask.h>
 
-#define NODES_SHIFT 6
-
 typedef u8 nodeid_t;
 
 extern int srat_rev;
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 4c4632ec27..cac92d2688 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -3,14 +3,14 @@
 
 #include <asm/numa.h>
 
-#ifndef NODES_SHIFT
-#define NODES_SHIFT     0
-#endif
-
 #define NUMA_NO_NODE     0xFF
 #define NUMA_NO_DISTANCE 0xFF
 
-#define MAX_NUMNODES    (1 << NODES_SHIFT)
+#ifdef CONFIG_NR_NUMA_NODES
+#define MAX_NUMNODES CONFIG_NR_NUMA_NODES
+#else
+#define MAX_NUMNODES 1
+#endif
 
 #define vcpu_to_node(v) (cpu_to_node((v)->processor))
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid
  2022-07-08 14:54 ` [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid Wei Chen
@ 2022-07-11  6:32   ` Jan Beulich
  2022-07-12  7:20     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-11  6:32 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Jiamei Xie, xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> VIRTUAL_BUG_ON is an empty macro used in phys_to_nid. This
> results in two lines of error-checking code in phys_to_nid
> that is not actually working and causing two compilation
> errors:
> 1. error: "MAX_NUMNODES" undeclared (first use in this function).
>    This is because in the common header file, "MAX_NUMNODES" is
>    defined after the common header file includes the ARCH header
>    file, where phys_to_nid has attempted to use "MAX_NUMNODES".
>    This error was resolved after we moved the phys_to_nid from
>    x86 ARCH header file to common header file.
> 2. error: wrong type argument to unary exclamation mark.
>    This is because, the error-checking code contains !node_data[nid].
>    But node_data is a data structure variable, it's not a pointer.
> 
> So, in this patch, we use ASSERT instead of VIRTUAL_BUG_ON to
> enable the two lines of error-checking code. And fix the left
> compilation errors by replacing !node_data[nid] to
> !node_data[nid].node_spanned_pages. Although NUMA allows one node
> can only have CPUs but without any memory. And node with 0 bytes
> of memory might have an entry in memnodemap[] theoretically. But
> that doesn't mean phys_to_nid can find any valid address from a
> node with 0 bytes memory.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> Tested-by: Jiamei Xie <jiamei.xie@arm.com>
> ---
> v1 -> v2:
> 1. Move from part#1 to part#2. (Comment from NUMA part1 series)
> 2. Refine the justification of using
>    !node_data[nid].node_spanned_pages. (From NUMA part1 series)
> 3. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid.
> 4. Adjust the conditional express for ASSERT.
> 5. Move MAX_NUMNODES from xen/numa.h to asm/numa.h for x86.
> 6. Use conditional macro to gate MAX_NUMNODES for other architectures.

The change looks okay, but can you please clarify what these last two
points describe? They don't seem to match any change ...

> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -36,8 +36,6 @@ struct node {
>  extern int compute_hash_shift(struct node *nodes, int numnodes,
>                                nodeid_t *nodeids);
>  
> -#define VIRTUAL_BUG_ON(x)
> -
>  /* Enumerations for NUMA status. */
>  enum numa_mode {
>      numa_on = 0,
> @@ -77,9 +75,9 @@ extern struct node_data node_data[];
>  static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
>  {
>      nodeid_t nid;
> -    VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
> +    ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
>      nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
> -    VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
> +    ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
>      return nid;
>  }
>  

... in the entire patch.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid
  2022-07-11  6:32   ` Jan Beulich
@ 2022-07-12  7:20     ` Wei Chen
  2022-07-12  7:36       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-12  7:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Jiamei Xie, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月11日 14:32
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George
> Dunlap <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Jiamei Xie
> <Jiamei.Xie@arm.com>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON
> for phys_to_nid
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > VIRTUAL_BUG_ON is an empty macro used in phys_to_nid. This
> > results in two lines of error-checking code in phys_to_nid
> > that is not actually working and causing two compilation
> > errors:
> > 1. error: "MAX_NUMNODES" undeclared (first use in this function).
> >    This is because in the common header file, "MAX_NUMNODES" is
> >    defined after the common header file includes the ARCH header
> >    file, where phys_to_nid has attempted to use "MAX_NUMNODES".
> >    This error was resolved after we moved the phys_to_nid from
> >    x86 ARCH header file to common header file.
> > 2. error: wrong type argument to unary exclamation mark.
> >    This is because, the error-checking code contains !node_data[nid].
> >    But node_data is a data structure variable, it's not a pointer.
> >
> > So, in this patch, we use ASSERT instead of VIRTUAL_BUG_ON to
> > enable the two lines of error-checking code. And fix the left
> > compilation errors by replacing !node_data[nid] to
> > !node_data[nid].node_spanned_pages. Although NUMA allows one node
> > can only have CPUs but without any memory. And node with 0 bytes
> > of memory might have an entry in memnodemap[] theoretically. But
> > that doesn't mean phys_to_nid can find any valid address from a
> > node with 0 bytes memory.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > Tested-by: Jiamei Xie <jiamei.xie@arm.com>
> > ---
> > v1 -> v2:
> > 1. Move from part#1 to part#2. (Comment from NUMA part1 series)
> > 2. Refine the justification of using
> >    !node_data[nid].node_spanned_pages. (From NUMA part1 series)
> > 3. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid.
> > 4. Adjust the conditional express for ASSERT.
> > 5. Move MAX_NUMNODES from xen/numa.h to asm/numa.h for x86.
> > 6. Use conditional macro to gate MAX_NUMNODES for other architectures.
> 
> The change looks okay, but can you please clarify what these last two
> points describe? They don't seem to match any change ...
> 

I moved this patch form Part#1 to Part#2, but forgot to remove these
two change logs. In Part#1, we do those changes, but after we moved
this patch to Part#2, those changes are unnecessary. I will remove
these two lines of change logs. Sorry for the confusion!

Cheers,
Wei Chen

> > --- a/xen/include/xen/numa.h
> > +++ b/xen/include/xen/numa.h
> > @@ -36,8 +36,6 @@ struct node {
> >  extern int compute_hash_shift(struct node *nodes, int numnodes,
> >                                nodeid_t *nodeids);
> >
> > -#define VIRTUAL_BUG_ON(x)
> > -
> >  /* Enumerations for NUMA status. */
> >  enum numa_mode {
> >      numa_on = 0,
> > @@ -77,9 +75,9 @@ extern struct node_data node_data[];
> >  static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
> >  {
> >      nodeid_t nid;
> > -    VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >=
> memnodemapsize);
> > +    ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
> >      nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
> > -    VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
> > +    ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
> >      return nid;
> >  }
> >
> 
> ... in the entire patch.
> 
> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid
  2022-07-12  7:20     ` Wei Chen
@ 2022-07-12  7:36       ` Jan Beulich
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2022-07-12  7:36 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Jiamei Xie, xen-devel

On 12.07.2022 09:20, Wei Chen wrote:
> Hi Jan,
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月11日 14:32
>> To: Wei Chen <Wei.Chen@arm.com>
>> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George
>> Dunlap <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
>> Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Jiamei Xie
>> <Jiamei.Xie@arm.com>; xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON
>> for phys_to_nid
>>
>> On 08.07.2022 16:54, Wei Chen wrote:
>>> VIRTUAL_BUG_ON is an empty macro used in phys_to_nid. This
>>> results in two lines of error-checking code in phys_to_nid
>>> that is not actually working and causing two compilation
>>> errors:
>>> 1. error: "MAX_NUMNODES" undeclared (first use in this function).
>>>    This is because in the common header file, "MAX_NUMNODES" is
>>>    defined after the common header file includes the ARCH header
>>>    file, where phys_to_nid has attempted to use "MAX_NUMNODES".
>>>    This error was resolved after we moved the phys_to_nid from
>>>    x86 ARCH header file to common header file.
>>> 2. error: wrong type argument to unary exclamation mark.
>>>    This is because, the error-checking code contains !node_data[nid].
>>>    But node_data is a data structure variable, it's not a pointer.
>>>
>>> So, in this patch, we use ASSERT instead of VIRTUAL_BUG_ON to
>>> enable the two lines of error-checking code. And fix the left
>>> compilation errors by replacing !node_data[nid] to
>>> !node_data[nid].node_spanned_pages. Although NUMA allows one node
>>> can only have CPUs but without any memory. And node with 0 bytes
>>> of memory might have an entry in memnodemap[] theoretically. But
>>> that doesn't mean phys_to_nid can find any valid address from a
>>> node with 0 bytes memory.
>>>
>>> Signed-off-by: Wei Chen <wei.chen@arm.com>
>>> Tested-by: Jiamei Xie <jiamei.xie@arm.com>
>>> ---
>>> v1 -> v2:
>>> 1. Move from part#1 to part#2. (Comment from NUMA part1 series)
>>> 2. Refine the justification of using
>>>    !node_data[nid].node_spanned_pages. (From NUMA part1 series)
>>> 3. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid.
>>> 4. Adjust the conditional express for ASSERT.
>>> 5. Move MAX_NUMNODES from xen/numa.h to asm/numa.h for x86.
>>> 6. Use conditional macro to gate MAX_NUMNODES for other architectures.
>>
>> The change looks okay, but can you please clarify what these last two
>> points describe? They don't seem to match any change ...
>>
> 
> I moved this patch form Part#1 to Part#2, but forgot to remove these
> two change logs. In Part#1, we do those changes, but after we moved
> this patch to Part#2, those changes are unnecessary. I will remove
> these two lines of change logs. Sorry for the confusion!

With this clarified
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end
  2022-07-08 14:54 ` [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end Wei Chen
@ 2022-07-12 11:53   ` Jan Beulich
  2022-07-13  8:22     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-12 11:53 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> x86 provides a mem_hotplug variable to maintain the memory hotplug
> end address. We want to move some codes from x86 to common, so that
> it can be reused by other architectures. But not all architectures
> have supported memory hotplug. So in this patch, we introduce this
> helper to replace mem_hotplug direct access in these codes. This
> will give the ability of stubbing this helper to those architectures
> without memory hotplug support.

What remains unclear to me is why Arm can't simply gain the necessary
variable. Sooner or later you'll want to support hotplug there anyway,
I suppose. Mechanically the change is fine. Albeit maybe "top" instead
of "boundary", and maybe also pass "node" even if x86 doesn't need it?

Jan

> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
> v1 -> v2:
> 1. Refine the commit message.
> 2. Merge v1 patch#9,10 into one patch. Introduce the new functions
>    in the same patch that this patch will be used first time.
> 3. Fold if ( end > mem_hotplug ) to mem_hotplug_update_boundary,
>    in this case, we can drop mem_hotplug_boundary.
> ---
>  xen/arch/x86/include/asm/mm.h | 6 ++++++
>  xen/arch/x86/srat.c           | 3 +--
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> index 07b59c982b..b3dfbdb52b 100644
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -476,6 +476,12 @@ static inline int get_page_and_type(struct page_info *page,
>  
>  extern paddr_t mem_hotplug;
>  
> +static inline void mem_hotplug_update_boundary(paddr_t end)
> +{
> +    if ( end > mem_hotplug )
> +        mem_hotplug = end;
> +}
> +
>  /******************************************************************************
>   * With shadow pagetables, the different kinds of address start
>   * to get get confusing.
> diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> index b62a152911..f53431f5e8 100644
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -418,8 +418,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  	memblk_nodeid[num_node_memblks] = node;
>  	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
>  		__set_bit(num_node_memblks, memblk_hotplug);
> -		if (end > mem_hotplug)
> -			mem_hotplug = end;
> +		mem_hotplug_update_boundary(end);
>  	}
>  	num_node_memblks++;
>  }



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-08 14:54 ` [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status Wei Chen
@ 2022-07-12 12:49   ` Jan Beulich
  2022-07-14  9:03     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-12 12:49 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> In current code, x86 is using two variables, numa_off and acpi_numa,
> to indicate the NUMA status. This is because NUMA is not coupled with
> ACPI, and ACPI still can work without NUMA on x86. With these two
> variables' combinations, x86 can have several NUMA status:
> NUMA swith on,
> NUMA swith off,
> NUMA swith on with NUMA emulation,
> NUMA swith on with no-ACPI,
> NUMA swith on with ACPI.

Hmm, with both this and the actual change I'm not able to convince
myself that you've expressed the prior combinations correctly. May
I suggest that you make table representing the 6 (I think)
combinations of original states with their mapping to the new
enumerators? (It doesn't need to be 6 different enumerators, but
all 6 existing states need a [proper] representation in the new
model.)

As an aside - I think you mean "switched" in all five of these
lines.

> --- a/xen/arch/x86/include/asm/numa.h
> +++ b/xen/arch/x86/include/asm/numa.h
> @@ -28,12 +28,22 @@ extern nodeid_t pxm_to_node(unsigned int pxm);
>  #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
>  #define VIRTUAL_BUG_ON(x) 
>  
> +/* Enumerations for NUMA status. */
> +enum numa_mode {
> +	numa_on = 0,
> +	numa_off,

May I suggest to switch these two around, such that "off" becomes
the meaning of 0, potentially allowing ! to be used in a boolean-
like fashion here or there? And please omit the "= 0" part - it's
only non-zero first values which actually need spelling out.

> +	/* NUMA turns on, but ACPI table is bad or disabled. */
> +	numa_no_acpi,
> +	/* NUMA turns on, and ACPI table works well. */
> +	numa_acpi,

As to the names of these: In the description you already say that
you want to re-use the code for non-ACPI cases. Wouldn't you better
avoid "acpi" in the names then (rather than perhaps renaming these
another time later on)?

I'd also like to understand what useful state "numa_no_acpi" is.
I realize this was a state expressable by the two original
variables, but does it make sense?

> @@ -528,7 +528,8 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t end)
>  	for (i = 0; i < MAX_NUMNODES; i++)
>  		cutoff_node(i, start, end);
>  
> -	if (acpi_numa <= 0)
> +	/* Only when numa_on with good firmware, we can do numa scan nodes. */
> +	if (!numa_enabled_with_firmware())
>  		return -1;

Nit: Perhaps drop "do numa" from the comment?

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 3/9] xen/x86: move generically usable NUMA code from x86 to common
  2022-07-08 14:54 ` [PATCH v2 3/9] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
@ 2022-07-12 13:13   ` Jan Beulich
  2022-07-13 10:16     ` Wei Chen
  2022-07-13 14:10     ` Jan Beulich
  0 siblings, 2 replies; 50+ messages in thread
From: Jan Beulich @ 2022-07-12 13:13 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> --- /dev/null
> +++ b/xen/common/numa.c
> @@ -0,0 +1,439 @@
> +/*
> + * Generic VM initialization for NUMA setups.
> + * Copyright 2002,2003 Andi Kleen, SuSE Labs.
> + * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
> + */
> +
> +#include <xen/init.h>
> +#include <xen/keyhandler.h>
> +#include <xen/mm.h>
> +#include <xen/nodemask.h>
> +#include <xen/numa.h>
> +#include <xen/param.h>
> +#include <xen/sched.h>
> +#include <xen/softirq.h>
> +#include <asm/acpi.h>

Isn't the goal for the now common code to not be dependent upon ACPI?

> +struct node_data node_data[MAX_NUMNODES];
> +
> +/* Mapping from pdx to node id */
> +int memnode_shift;
> +static typeof(*memnodemap) _memnodemap[64];
> +unsigned long memnodemapsize;
> +u8 *memnodemap;

For code moved, please switch to (in this case) uint8_t. I'm on the
edge of asking to go further and
- also use __read_mostly for some / all of these,
- make memnode_shift unsigned int (sadly this looks to require more
  adjustments, even if negative shift counts are meaningless),
- convert from plain int to unsigned int elsewhere as appropriate,
- add const where appropriate / possible.

> +nodeid_t cpu_to_node[NR_CPUS] __read_mostly = {
> +    [0 ... NR_CPUS-1] = NUMA_NO_NODE
> +};
> +
> +cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;

For these two please put __read_mostly into its more conventional
place (right after the type).

> +nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
> +
> +enum numa_mode numa_status;

This should probably have been __read_mostly already in the earlier
patch introducing it.

> +#ifdef CONFIG_NUMA_EMU
> +static int numa_fake __initdata = 0;

The __initdata again wants to move, and the initializer can be omitted.

> +/* [numa=off] */
> +static int __init cf_check numa_setup(const char *opt)
> +{
> +    if ( !strncmp(opt,"off",3) )

As you're correcting style violations elsewhere, please also insert the
missing spaces here and below.

> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -18,4 +18,78 @@
>    (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
>     ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
>  
> +/* The following content can be used when NUMA feature is enabled */
> +#ifdef CONFIG_NUMA
> +
> +extern nodeid_t      cpu_to_node[NR_CPUS];
> +extern cpumask_t     node_to_cpumask[];
> +
> +#define cpu_to_node(cpu)        (cpu_to_node[cpu])
> +#define parent_node(node)       (node)
> +#define node_to_first_cpu(node) (__ffs(node_to_cpumask[node]))
> +#define node_to_cpumask(node)   (node_to_cpumask[node])
> +
> +struct node {
> +    paddr_t start, end;
> +};
> +
> +extern int compute_hash_shift(struct node *nodes, int numnodes,
> +                              nodeid_t *nodeids);
> +
> +#define VIRTUAL_BUG_ON(x)
> +
> +/* Enumerations for NUMA status. */
> +enum numa_mode {
> +    numa_on = 0,
> +    numa_off,
> +    /* NUMA turns on, but ACPI table is bad or disabled. */
> +    numa_no_acpi,
> +    /* NUMA turns on, and ACPI table works well. */
> +    numa_acpi,
> +};
> +
> +extern void numa_add_cpu(int cpu);
> +extern void numa_init_array(void);
> +extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
> +extern bool numa_enabled_with_firmware(void);
> +extern enum numa_mode numa_status;
> +
> +extern void numa_set_node(int cpu, nodeid_t node);
> +extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
> +
> +static inline void clear_node_cpumask(int cpu)
> +{
> +    cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
> +}
> +
> +/* Simple perfect hash to map pdx to node numbers */
> +extern int memnode_shift;
> +extern unsigned long memnodemapsize;
> +extern u8 *memnodemap;
> +
> +struct node_data {
> +    unsigned long node_start_pfn;
> +    unsigned long node_spanned_pages;
> +};
> +
> +extern struct node_data node_data[];
> +
> +static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)

Please use __attribute_pure__.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get information from E820 map
  2022-07-08 14:54 ` [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get information from E820 map Wei Chen
@ 2022-07-12 13:40   ` Jan Beulich
  2022-07-13 10:37     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-12 13:40 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> @@ -82,3 +83,25 @@ unsigned int __init arch_get_dma_bitsize(void)
>                   flsl(node_start_pfn(node) + node_spanned_pages(node) / 4 - 1)
>                   + PAGE_SHIFT, 32);
>  }
> +
> +/*
> + * This function provides the ability for caller to get one RAM entry
> + * from architectural memory map by index.
> + *
> + * This function will return zero if it can return a proper RAM entry.
> + * otherwise it will return -ENODEV for out of scope index, or return
> + * -EINVAL for non-RAM type memory entry.
> + */

I think the comment also wants to spell out that the range is
exclusive at the end (assuming that's suitable for Arm; else now
would perhaps be the time to change it).

> +int __init arch_get_memory_map(unsigned int idx, paddr_t *start, paddr_t *end)
> +{
> +    if ( idx >= e820.nr_map )
> +        return -ENODEV;

Perhaps better -ENOENT?

> +    if ( e820.map[idx].type != E820_RAM )
> +        return -EINVAL;

I'm sorry, this should have occurred to me already when commenting on
v1: "get_memory_map" doesn't really fit this "RAM only" restriction.
Maybe arch_get_ram_range()? Or maybe others have some good naming
suggestion?

> +    *start = e820.map[idx].addr;
> +    *end = e820.map[idx].addr + e820.map[idx].size;

Nit: Would be shorter to read if you (re)used *start.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-07-08 14:54 ` [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
@ 2022-07-12 14:21   ` Jan Beulich
  2022-07-13 10:57     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-12 14:21 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> x86 has implemented a set of codes to scan NUMA nodes. These
> codes will parse NUMA memory and processor information from
> ACPI SRAT table. But except some ACPI specific codes, most
> of the scan codes like memory blocks validation, node memory
> range updates and some sanity check can be reused by other
> NUMA implementation.
> 
> So in this patch, we move some variables and related functions
> for NUMA memory and processor to common code. At the same time,
> numa_set_processor_nodes_parsed has been introduced for ACPI
> specific code to update processor parsing results. With this
> helper, we can move most of NUMA memory affinity init code from
> ACPI. And bad_srat and node_to_pxm functions have been exported
> for common code to do architectural fallback and node to proximity
> converting.

I consider it wrong for generic (ACPI-independent) code to use
terms like "srat" or "pxm". This wants abstracting in some way,
albeit I have to admit I lack a good idea for a suggestion right
now.

> For those NUMA implementations haven't supported
> proximity domain to node map. A simple 1-1 mapping helper can help
> us to avoid the modification of some very valuable print messages.

I don't think a simple 1:1 mapping can do. Surely firmware
represents nodes by some identifiers everywhere. And I don't
consider it very likely that we would want to blindly re-use
such numbering inside Xen.

> --- a/xen/common/numa.c
> +++ b/xen/common/numa.c
> @@ -14,6 +14,21 @@
>  #include <xen/softirq.h>
>  #include <asm/acpi.h>
>  
> +static nodemask_t __initdata processor_nodes_parsed;
> +static nodemask_t __initdata memory_nodes_parsed;
> +static struct node __initdata nodes[MAX_NUMNODES];
> +
> +static int num_node_memblks;
> +static struct node node_memblk_range[NR_NODE_MEMBLKS];
> +static nodeid_t memblk_nodeid[NR_NODE_MEMBLKS];

Some (all) of these likely want to become __read_mostly again, at
this occasion.

> @@ -32,6 +47,298 @@ nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
>  
>  enum numa_mode numa_status;
>  
> +void __init numa_set_processor_nodes_parsed(nodeid_t node)
> +{
> +    node_set(node, processor_nodes_parsed);
> +}
> +
> +int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
> +{
> +    int i;

unsigned int? Then matching e.g. ...

> +static
> +enum conflicts __init conflicting_memblks(nodeid_t nid, paddr_t start,
> +                                          paddr_t end, paddr_t nd_start,
> +                                          paddr_t nd_end, unsigned int *mblkid)
> +{
> +    unsigned int i;

... this. But perhaps also elsewhere.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 7/9] xen/x86: rename bad_srat to numa_bad
  2022-07-08 14:54 ` [PATCH v2 7/9] xen/x86: rename bad_srat to numa_bad Wei Chen
@ 2022-07-12 14:25   ` Jan Beulich
  2022-07-14  9:17     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-12 14:25 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> When NUMA initialization code is failed in scanning SRAT. It will
> call bad_srat to set disable NUMA and clear relate data. But this
> name is ACPI specific, we have moved generically usable NUMA codes
> to common, bad_srat has came with these codes to common code. But
> it's not reasonable for other NUMA implementations to implement a
> fall back function named bad_srat. So in this patch, we rename
> bad_srat to numa_bad.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>
> ---
> v1 -> v2:
> 1. New in v2.

I think this wants to come before moving to common code. And I'd
also like you to consider using e.g. numa_fw_bad() or
numa_firmware_bad() to better reflect the original purpose.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 8/9] xen: rename acpi_scan_nodes to numa_scan_nodes
  2022-07-08 14:54 ` [PATCH v2 8/9] xen: rename acpi_scan_nodes to numa_scan_nodes Wei Chen
@ 2022-07-12 14:27   ` Jan Beulich
  2022-07-14  9:18     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-12 14:27 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> We have moved acpi_scan_nodes from x86 to common. Because of
> of our previous work, this function no longer has many ACPI
> characteristics, except some "SRAT" words in print messages.
> So we rename acpi_scan_nodes to a more generic name
> numa_scan_nodes, and replace "SRAT" words in print messages.

Once again I think the rename should happen before the moving to
common code.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-07-08 14:54 ` [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
@ 2022-07-12 14:34   ` Jan Beulich
  2022-07-14 10:14     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-12 14:34 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On 08.07.2022 16:54, Wei Chen wrote:
> --- a/xen/arch/Kconfig
> +++ b/xen/arch/Kconfig
> @@ -17,3 +17,14 @@ config NR_CPUS
>  	  For CPU cores which support Simultaneous Multi-Threading or similar
>  	  technologies, this the number of logical threads which Xen will
>  	  support.
> +
> +config NR_NUMA_NODES
> +	int "Maximum number of NUMA nodes supported"
> +	range 1 255
> +	default "64"
> +	depends on NUMA

Does 1 make sense? That's not going to be NUMA then, I would say.

Does the value being (perhaps far) larger than NR_CPUS make sense?

Why does the range end at a not-power-of-2 value? (I was actually
going to suggest to have a shift value specified here, until
spotting that NODES_SHIFT isn't used anywhere else, and hence
you're rightfully deleting it.)

> +	help
> +	  Controls the build-time size of various arrays and bitmaps
> +	  associated with multiple-nodes management. It is the upper bound of
> +	  the number of NUMA nodes that the scheduler, memory allocation and
> +          other NUMA-aware components can handle.

Nit: indentation.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end
  2022-07-12 11:53   ` Jan Beulich
@ 2022-07-13  8:22     ` Wei Chen
  2022-07-13  8:46       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-13  8:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月12日 19:54
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v2 1/9] xen/x86: introduce a helper to update memory
> hotplug end
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > x86 provides a mem_hotplug variable to maintain the memory hotplug
> > end address. We want to move some codes from x86 to common, so that
> > it can be reused by other architectures. But not all architectures
> > have supported memory hotplug. So in this patch, we introduce this
> > helper to replace mem_hotplug direct access in these codes. This
> > will give the ability of stubbing this helper to those architectures
> > without memory hotplug support.
> 
> What remains unclear to me is why Arm can't simply gain the necessary
> variable. Sooner or later you'll want to support hotplug there anyway,

I am not sure my limited memory hotplug knowledge can answer your question
or not. As memory hotplug depends on hardware and firmware support. Now
for Arm, we only have ACPI firmware that can notify OS through GED event
(not very sure). But ACPI and device tree couldn't be enabled at the same
time from OS booting. In device tree based NUMA, we will enable device
tree, ACPI service will not be initialized, that means we can not get
notification of memory hotplug events from ACPI firmware. 

Of course, adding GED device nodes to the device tree, and getting these
events through GED interrupts should not be a big technical problem. But
there may be other reasons, and no one is planning to add memory hotplug
support to the device tree (perhaps need an ACPI-like specification or a
firmware abstraction interface). So if we want to implement Xen Arm memory
hotplug, it should also start from ACPI. So currently even if we gain the
variable in Arm, it will not be used. Device tree does not have property
to indicate a memory block can be hotplug or not.

> I suppose. Mechanically the change is fine. Albeit maybe "top" instead
> of "boundary", and maybe also pass "node" even if x86 doesn't need it?
> 

Sorry, I am not very clear about these comments:
It makes sense to use mem_hotplug_update_top instead
of mem_hotplug_update_boundary. But how can I understand pass "node"?
did you mean mem_hotplug_update_top(node, end)? But mem_hotplug is
a global top for memory hotplug ranges. I don't think pass "node"
to this function can be useful.

Cheers,
Wei Chen

> Jan
> 
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> > v1 -> v2:
> > 1. Refine the commit message.
> > 2. Merge v1 patch#9,10 into one patch. Introduce the new functions
> >    in the same patch that this patch will be used first time.
> > 3. Fold if ( end > mem_hotplug ) to mem_hotplug_update_boundary,
> >    in this case, we can drop mem_hotplug_boundary.
> > ---
> >  xen/arch/x86/include/asm/mm.h | 6 ++++++
> >  xen/arch/x86/srat.c           | 3 +--
> >  2 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/xen/arch/x86/include/asm/mm.h
> b/xen/arch/x86/include/asm/mm.h
> > index 07b59c982b..b3dfbdb52b 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -476,6 +476,12 @@ static inline int get_page_and_type(struct
> page_info *page,
> >
> >  extern paddr_t mem_hotplug;
> >
> > +static inline void mem_hotplug_update_boundary(paddr_t end)
> > +{
> > +    if ( end > mem_hotplug )
> > +        mem_hotplug = end;
> > +}
> > +
> >
> /*************************************************************************
> *****
> >   * With shadow pagetables, the different kinds of address start
> >   * to get get confusing.
> > diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
> > index b62a152911..f53431f5e8 100644
> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -418,8 +418,7 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> >  	memblk_nodeid[num_node_memblks] = node;
> >  	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
> >  		__set_bit(num_node_memblks, memblk_hotplug);
> > -		if (end > mem_hotplug)
> > -			mem_hotplug = end;
> > +		mem_hotplug_update_boundary(end);
> >  	}
> >  	num_node_memblks++;
> >  }


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end
  2022-07-13  8:22     ` Wei Chen
@ 2022-07-13  8:46       ` Jan Beulich
  2022-07-13  9:55         ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-13  8:46 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 13.07.2022 10:22, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月12日 19:54
>>
>> On 08.07.2022 16:54, Wei Chen wrote:
>>> x86 provides a mem_hotplug variable to maintain the memory hotplug
>>> end address. We want to move some codes from x86 to common, so that
>>> it can be reused by other architectures. But not all architectures
>>> have supported memory hotplug. So in this patch, we introduce this
>>> helper to replace mem_hotplug direct access in these codes. This
>>> will give the ability of stubbing this helper to those architectures
>>> without memory hotplug support.
>>
>> What remains unclear to me is why Arm can't simply gain the necessary
>> variable. Sooner or later you'll want to support hotplug there anyway,
> 
> I am not sure my limited memory hotplug knowledge can answer your question
> or not. As memory hotplug depends on hardware and firmware support. Now
> for Arm, we only have ACPI firmware that can notify OS through GED event
> (not very sure). But ACPI and device tree couldn't be enabled at the same
> time from OS booting. In device tree based NUMA, we will enable device
> tree, ACPI service will not be initialized, that means we can not get
> notification of memory hotplug events from ACPI firmware. 
> 
> Of course, adding GED device nodes to the device tree, and getting these
> events through GED interrupts should not be a big technical problem. But
> there may be other reasons, and no one is planning to add memory hotplug
> support to the device tree (perhaps need an ACPI-like specification or a
> firmware abstraction interface). So if we want to implement Xen Arm memory
> hotplug, it should also start from ACPI. So currently even if we gain the
> variable in Arm, it will not be used. Device tree does not have property
> to indicate a memory block can be hotplug or not.

But ACPI is possible to be enabled for Arm64, and hence hotplug could be
made work that way. Therefore I view the variable as potentially useful
on Arm as well, and hence introducing the variable might be less trouble
than introducing the per-arch helper.

>> I suppose. Mechanically the change is fine. Albeit maybe "top" instead
>> of "boundary", and maybe also pass "node" even if x86 doesn't need it?
>>
> 
> Sorry, I am not very clear about these comments:
> It makes sense to use mem_hotplug_update_top instead
> of mem_hotplug_update_boundary. But how can I understand pass "node"?
> did you mean mem_hotplug_update_top(node, end)? But mem_hotplug is
> a global top for memory hotplug ranges. I don't think pass "node"
> to this function can be useful.

Please separate "current implementation" from "abstract model". In the
latter it would seem quite reasonable to me to have per-node upper
bounds (or even per-node ranges). Therefore adding node (and, on top
of what I did suggest before, region start) to the parameters of the
new per-arch hook would seem a good move to me, even if at present
only / at most the "end" parameter would actually be used.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end
  2022-07-13  8:46       ` Jan Beulich
@ 2022-07-13  9:55         ` Wei Chen
  2022-07-13 10:03           ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-13  9:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月13日 16:46
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v2 1/9] xen/x86: introduce a helper to update memory
> hotplug end
> 
> On 13.07.2022 10:22, Wei Chen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2022年7月12日 19:54
> >>
> >> On 08.07.2022 16:54, Wei Chen wrote:
> >>> x86 provides a mem_hotplug variable to maintain the memory hotplug
> >>> end address. We want to move some codes from x86 to common, so that
> >>> it can be reused by other architectures. But not all architectures
> >>> have supported memory hotplug. So in this patch, we introduce this
> >>> helper to replace mem_hotplug direct access in these codes. This
> >>> will give the ability of stubbing this helper to those architectures
> >>> without memory hotplug support.
> >>
> >> What remains unclear to me is why Arm can't simply gain the necessary
> >> variable. Sooner or later you'll want to support hotplug there anyway,
> >
> > I am not sure my limited memory hotplug knowledge can answer your
> question
> > or not. As memory hotplug depends on hardware and firmware support. Now
> > for Arm, we only have ACPI firmware that can notify OS through GED event
> > (not very sure). But ACPI and device tree couldn't be enabled at the
> same
> > time from OS booting. In device tree based NUMA, we will enable device
> > tree, ACPI service will not be initialized, that means we can not get
> > notification of memory hotplug events from ACPI firmware.
> >
> > Of course, adding GED device nodes to the device tree, and getting these
> > events through GED interrupts should not be a big technical problem. But
> > there may be other reasons, and no one is planning to add memory hotplug
> > support to the device tree (perhaps need an ACPI-like specification or a
> > firmware abstraction interface). So if we want to implement Xen Arm
> memory
> > hotplug, it should also start from ACPI. So currently even if we gain
> the
> > variable in Arm, it will not be used. Device tree does not have property
> > to indicate a memory block can be hotplug or not.
> 
> But ACPI is possible to be enabled for Arm64, and hence hotplug could be
> made work that way. Therefore I view the variable as potentially useful
> on Arm as well, and hence introducing the variable might be less trouble
> than introducing the per-arch helper.
> 

Ok, I will try to expose mem_hotplug to common/mm.c. As both Arm and x86
code can access it, we can move related NUMA code to common Without this
helper. And mem_hotplug will always be zero for Arm until Arm support
memory hotplug. This should be harmless for Arm.

> >> I suppose. Mechanically the change is fine. Albeit maybe "top" instead
> >> of "boundary", and maybe also pass "node" even if x86 doesn't need it?
> >>
> >
> > Sorry, I am not very clear about these comments:
> > It makes sense to use mem_hotplug_update_top instead
> > of mem_hotplug_update_boundary. But how can I understand pass "node"?
> > did you mean mem_hotplug_update_top(node, end)? But mem_hotplug is
> > a global top for memory hotplug ranges. I don't think pass "node"
> > to this function can be useful.
> 
> Please separate "current implementation" from "abstract model". In the
> latter it would seem quite reasonable to me to have per-node upper
> bounds (or even per-node ranges). Therefore adding node (and, on top
> of what I did suggest before, region start) to the parameters of the
> new per-arch hook would seem a good move to me, even if at present
> only / at most the "end" parameter would actually be used.
> 

As we will export mem_hotplug to common, I think these changes are
not needed any more?

Cheers,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end
  2022-07-13  9:55         ` Wei Chen
@ 2022-07-13 10:03           ` Jan Beulich
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2022-07-13 10:03 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 13.07.2022 11:55, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月13日 16:46
>>
>> On 13.07.2022 10:22, Wei Chen wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: 2022年7月12日 19:54
>>>>
>>>> Mechanically the change is fine. Albeit maybe "top" instead
>>>> of "boundary", and maybe also pass "node" even if x86 doesn't need it?
>>>>
>>>
>>> Sorry, I am not very clear about these comments:
>>> It makes sense to use mem_hotplug_update_top instead
>>> of mem_hotplug_update_boundary. But how can I understand pass "node"?
>>> did you mean mem_hotplug_update_top(node, end)? But mem_hotplug is
>>> a global top for memory hotplug ranges. I don't think pass "node"
>>> to this function can be useful.
>>
>> Please separate "current implementation" from "abstract model". In the
>> latter it would seem quite reasonable to me to have per-node upper
>> bounds (or even per-node ranges). Therefore adding node (and, on top
>> of what I did suggest before, region start) to the parameters of the
>> new per-arch hook would seem a good move to me, even if at present
>> only / at most the "end" parameter would actually be used.
>>
> 
> As we will export mem_hotplug to common, I think these changes are
> not needed any more?

Indeed. I merely wanted to address your question nevertheless, or in case
there was still a reason to avoid making the global variable common.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 3/9] xen/x86: move generically usable NUMA code from x86 to common
  2022-07-12 13:13   ` Jan Beulich
@ 2022-07-13 10:16     ` Wei Chen
  2022-07-13 14:10     ` Jan Beulich
  1 sibling, 0 replies; 50+ messages in thread
From: Wei Chen @ 2022-07-13 10:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月12日 21:13
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v2 3/9] xen/x86: move generically usable NUMA code
> from x86 to common
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > --- /dev/null
> > +++ b/xen/common/numa.c
> > @@ -0,0 +1,439 @@
> > +/*
> > + * Generic VM initialization for NUMA setups.
> > + * Copyright 2002,2003 Andi Kleen, SuSE Labs.
> > + * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
> > + */
> > +
> > +#include <xen/init.h>
> > +#include <xen/keyhandler.h>
> > +#include <xen/mm.h>
> > +#include <xen/nodemask.h>
> > +#include <xen/numa.h>
> > +#include <xen/param.h>
> > +#include <xen/sched.h>
> > +#include <xen/softirq.h>
> > +#include <asm/acpi.h>
> 
> Isn't the goal for the now common code to not be dependent upon ACPI?
> 

Ah, good catch, current common code should be ACPI decoupled.
I will remove "#include <asm/acpi.h>"

Thanks,
Wei Chen

> > +struct node_data node_data[MAX_NUMNODES];
> > +
> > +/* Mapping from pdx to node id */
> > +int memnode_shift;
> > +static typeof(*memnodemap) _memnodemap[64];
> > +unsigned long memnodemapsize;
> > +u8 *memnodemap;
> 
> For code moved, please switch to (in this case) uint8_t. I'm on the
> edge of asking to go further and
> - also use __read_mostly for some / all of these,
> - make memnode_shift unsigned int (sadly this looks to require more
>   adjustments, even if negative shift counts are meaningless),
> - convert from plain int to unsigned int elsewhere as appropriate,
> - add const where appropriate / possible.
> 

Ok, I will address them in next version.

> > +nodeid_t cpu_to_node[NR_CPUS] __read_mostly = {
> > +    [0 ... NR_CPUS-1] = NUMA_NO_NODE
> > +};
> > +
> > +cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
> 
> For these two please put __read_mostly into its more conventional
> place (right after the type).
> 

Ok.

> > +nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
> > +
> > +enum numa_mode numa_status;
> 
> This should probably have been __read_mostly already in the earlier
> patch introducing it.
> 

Ok.

> > +#ifdef CONFIG_NUMA_EMU
> > +static int numa_fake __initdata = 0;
> 
> The __initdata again wants to move, and the initializer can be omitted.
> 

Ok.

> > +/* [numa=off] */
> > +static int __init cf_check numa_setup(const char *opt)
> > +{
> > +    if ( !strncmp(opt,"off",3) )
> 
> As you're correcting style violations elsewhere, please also insert the
> missing spaces here and below.
> 

Ack.

> > --- a/xen/include/xen/numa.h
> > +++ b/xen/include/xen/numa.h
> > @@ -18,4 +18,78 @@
> >    (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
> >     ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
> >
> > +/* The following content can be used when NUMA feature is enabled */
> > +#ifdef CONFIG_NUMA
> > +
> > +extern nodeid_t      cpu_to_node[NR_CPUS];
> > +extern cpumask_t     node_to_cpumask[];
> > +
> > +#define cpu_to_node(cpu)        (cpu_to_node[cpu])
> > +#define parent_node(node)       (node)
> > +#define node_to_first_cpu(node) (__ffs(node_to_cpumask[node]))
> > +#define node_to_cpumask(node)   (node_to_cpumask[node])
> > +
> > +struct node {
> > +    paddr_t start, end;
> > +};
> > +
> > +extern int compute_hash_shift(struct node *nodes, int numnodes,
> > +                              nodeid_t *nodeids);
> > +
> > +#define VIRTUAL_BUG_ON(x)
> > +
> > +/* Enumerations for NUMA status. */
> > +enum numa_mode {
> > +    numa_on = 0,
> > +    numa_off,
> > +    /* NUMA turns on, but ACPI table is bad or disabled. */
> > +    numa_no_acpi,
> > +    /* NUMA turns on, and ACPI table works well. */
> > +    numa_acpi,
> > +};
> > +
> > +extern void numa_add_cpu(int cpu);
> > +extern void numa_init_array(void);
> > +extern void numa_initmem_init(unsigned long start_pfn, unsigned long
> end_pfn);
> > +extern bool numa_enabled_with_firmware(void);
> > +extern enum numa_mode numa_status;
> > +
> > +extern void numa_set_node(int cpu, nodeid_t node);
> > +extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t
> end);
> > +
> > +static inline void clear_node_cpumask(int cpu)
> > +{
> > +    cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
> > +}
> > +
> > +/* Simple perfect hash to map pdx to node numbers */
> > +extern int memnode_shift;
> > +extern unsigned long memnodemapsize;
> > +extern u8 *memnodemap;
> > +
> > +struct node_data {
> > +    unsigned long node_start_pfn;
> > +    unsigned long node_spanned_pages;
> > +};
> > +
> > +extern struct node_data node_data[];
> > +
> > +static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
> 
> Please use __attribute_pure__.
> 

Ack.

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get information from E820 map
  2022-07-12 13:40   ` Jan Beulich
@ 2022-07-13 10:37     ` Wei Chen
  2022-07-13 13:43       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-13 10:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月12日 21:40
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get
> information from E820 map
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > @@ -82,3 +83,25 @@ unsigned int __init arch_get_dma_bitsize(void)
> >                   flsl(node_start_pfn(node) + node_spanned_pages(node) /
> 4 - 1)
> >                   + PAGE_SHIFT, 32);
> >  }
> > +
> > +/*
> > + * This function provides the ability for caller to get one RAM entry
> > + * from architectural memory map by index.
> > + *
> > + * This function will return zero if it can return a proper RAM entry.
> > + * otherwise it will return -ENODEV for out of scope index, or return
> > + * -EINVAL for non-RAM type memory entry.
> > + */
> 
> I think the comment also wants to spell out that the range is
> exclusive at the end (assuming that's suitable for Arm; else now
> would perhaps be the time to change it).
> 

Did you mean we have to mention the range is [star, end)?
If yes, I will add related comment. And...

> > +int __init arch_get_memory_map(unsigned int idx, paddr_t *start,
> paddr_t *end)
> > +{
> > +    if ( idx >= e820.nr_map )
> > +        return -ENODEV;
> 
> Perhaps better -ENOENT?
> 

Ack.

> > +    if ( e820.map[idx].type != E820_RAM )
> > +        return -EINVAL;
> 
> I'm sorry, this should have occurred to me already when commenting on
> v1: "get_memory_map" doesn't really fit this "RAM only" restriction.
> Maybe arch_get_ram_range()? Or maybe others have some good naming
> suggestion?

arch_get_ram_range makes sense to me. I will update in next version.

> 
> > +    *start = e820.map[idx].addr;
> > +    *end = e820.map[idx].addr + e820.map[idx].size;
> 

...In this case, I think we don't need to adjust this line to:
*end = e820.map[idx].addr + e820.map[idx].size - 1;
As we have mentioned the range is [start, end).

> Nit: Would be shorter to read if you (re)used *start.
> 

Ack.

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-07-12 14:21   ` Jan Beulich
@ 2022-07-13 10:57     ` Wei Chen
  2022-07-13 13:49       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-13 10:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月12日 22:21
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86
> to common
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > x86 has implemented a set of codes to scan NUMA nodes. These
> > codes will parse NUMA memory and processor information from
> > ACPI SRAT table. But except some ACPI specific codes, most
> > of the scan codes like memory blocks validation, node memory
> > range updates and some sanity check can be reused by other
> > NUMA implementation.
> >
> > So in this patch, we move some variables and related functions
> > for NUMA memory and processor to common code. At the same time,
> > numa_set_processor_nodes_parsed has been introduced for ACPI
> > specific code to update processor parsing results. With this
> > helper, we can move most of NUMA memory affinity init code from
> > ACPI. And bad_srat and node_to_pxm functions have been exported
> > for common code to do architectural fallback and node to proximity
> > converting.
> 
> I consider it wrong for generic (ACPI-independent) code to use
> terms like "srat" or "pxm". This wants abstracting in some way,
> albeit I have to admit I lack a good idea for a suggestion right
> now.
> 

Maybe we can use fw_rsc_table or rsc_table to replace srat, because
srat is one kind of NUMA resource description table of ACPI?
For PXM, I had tried to keep PXM in x86 ACPI implementation. But the
cost is that, we have to move some common code to architectural code,
because some messages use pxm for info, and they have different meanings
for each platform, we cannot simply remove them.

> > For those NUMA implementations haven't supported
> > proximity domain to node map. A simple 1-1 mapping helper can help
> > us to avoid the modification of some very valuable print messages.
> 
> I don't think a simple 1:1 mapping can do. Surely firmware
> represents nodes by some identifiers everywhere. And I don't
> consider it very likely that we would want to blindly re-use
> such numbering inside Xen.
> 

If we hide pxm to x86 only now, we can avoid this 1:1 mapping between
Node and PXM.

> > --- a/xen/common/numa.c
> > +++ b/xen/common/numa.c
> > @@ -14,6 +14,21 @@
> >  #include <xen/softirq.h>
> >  #include <asm/acpi.h>
> >
> > +static nodemask_t __initdata processor_nodes_parsed;
> > +static nodemask_t __initdata memory_nodes_parsed;
> > +static struct node __initdata nodes[MAX_NUMNODES];
> > +
> > +static int num_node_memblks;
> > +static struct node node_memblk_range[NR_NODE_MEMBLKS];
> > +static nodeid_t memblk_nodeid[NR_NODE_MEMBLKS];
> 
> Some (all) of these likely want to become __read_mostly again, at
> this occasion.
> 

Ok.

> > @@ -32,6 +47,298 @@ nodemask_t __read_mostly node_online_map = { { [0] =
> 1UL } };
> >
> >  enum numa_mode numa_status;
> >
> > +void __init numa_set_processor_nodes_parsed(nodeid_t node)
> > +{
> > +    node_set(node, processor_nodes_parsed);
> > +}
> > +
> > +int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
> > +{
> > +    int i;
> 
> unsigned int? Then matching e.g. ...
> 

Ok.

> > +static
> > +enum conflicts __init conflicting_memblks(nodeid_t nid, paddr_t start,
> > +                                          paddr_t end, paddr_t nd_start,
> > +                                          paddr_t nd_end, unsigned int
> *mblkid)
> > +{
> > +    unsigned int i;
> 
> ... this. But perhaps also elsewhere.
> 

I will check the code an fix them.

Cheers,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get information from E820 map
  2022-07-13 10:37     ` Wei Chen
@ 2022-07-13 13:43       ` Jan Beulich
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2022-07-13 13:43 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 13.07.2022 12:37, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月12日 21:40
>>
>> On 08.07.2022 16:54, Wei Chen wrote:
>>> @@ -82,3 +83,25 @@ unsigned int __init arch_get_dma_bitsize(void)
>>>                   flsl(node_start_pfn(node) + node_spanned_pages(node) /
>> 4 - 1)
>>>                   + PAGE_SHIFT, 32);
>>>  }
>>> +
>>> +/*
>>> + * This function provides the ability for caller to get one RAM entry
>>> + * from architectural memory map by index.
>>> + *
>>> + * This function will return zero if it can return a proper RAM entry.
>>> + * otherwise it will return -ENODEV for out of scope index, or return
>>> + * -EINVAL for non-RAM type memory entry.
>>> + */
>>
>> I think the comment also wants to spell out that the range is
>> exclusive at the end (assuming that's suitable for Arm; else now
>> would perhaps be the time to change it).
>>
> 
> Did you mean we have to mention the range is [star, end)?
> If yes, I will add related comment.

Yes (worded one way or another).

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-07-13 10:57     ` Wei Chen
@ 2022-07-13 13:49       ` Jan Beulich
  0 siblings, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2022-07-13 13:49 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 13.07.2022 12:57, Wei Chen wrote:
> Hi Jan,
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月12日 22:21
>> To: Wei Chen <Wei.Chen@arm.com>
>> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
>> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
>> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
>> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86
>> to common
>>
>> On 08.07.2022 16:54, Wei Chen wrote:
>>> x86 has implemented a set of codes to scan NUMA nodes. These
>>> codes will parse NUMA memory and processor information from
>>> ACPI SRAT table. But except some ACPI specific codes, most
>>> of the scan codes like memory blocks validation, node memory
>>> range updates and some sanity check can be reused by other
>>> NUMA implementation.
>>>
>>> So in this patch, we move some variables and related functions
>>> for NUMA memory and processor to common code. At the same time,
>>> numa_set_processor_nodes_parsed has been introduced for ACPI
>>> specific code to update processor parsing results. With this
>>> helper, we can move most of NUMA memory affinity init code from
>>> ACPI. And bad_srat and node_to_pxm functions have been exported
>>> for common code to do architectural fallback and node to proximity
>>> converting.
>>
>> I consider it wrong for generic (ACPI-independent) code to use
>> terms like "srat" or "pxm". This wants abstracting in some way,
>> albeit I have to admit I lack a good idea for a suggestion right
>> now.
>>
> 
> Maybe we can use fw_rsc_table or rsc_table to replace srat, because
> srat is one kind of NUMA resource description table of ACPI?

Is "rsc" meant to stand for "resource"? Would be a somewhat unusual
abbreviation. I could see use using e.g. numa_fw_ as a prefix, as in
e.g. numa_fw_bad() (replacing bad_srat()).

> For PXM, I had tried to keep PXM in x86 ACPI implementation. But the
> cost is that, we have to move some common code to architectural code,
> because some messages use pxm for info, and they have different meanings
> for each platform, we cannot simply remove them.

Well, for functions wanting to emit log messages, suitable abstractions
can likely be made without needing the retain a lot of per-arch code.
E.g. the arch could pass in "PXM" and format strings then would use %s
together with it. Similarly the translation (if any is necessary) could
likely be abstracted by, in the worst case, passing in a function
pointer.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 3/9] xen/x86: move generically usable NUMA code from x86 to common
  2022-07-12 13:13   ` Jan Beulich
  2022-07-13 10:16     ` Wei Chen
@ 2022-07-13 14:10     ` Jan Beulich
  1 sibling, 0 replies; 50+ messages in thread
From: Jan Beulich @ 2022-07-13 14:10 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 12.07.2022 15:13, Jan Beulich wrote:
> On 08.07.2022 16:54, Wei Chen wrote:
>> --- /dev/null
>> +++ b/xen/common/numa.c
>> @@ -0,0 +1,439 @@
>> +/*
>> + * Generic VM initialization for NUMA setups.
>> + * Copyright 2002,2003 Andi Kleen, SuSE Labs.
>> + * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
>> + */
>> +
>> +#include <xen/init.h>
>> +#include <xen/keyhandler.h>
>> +#include <xen/mm.h>
>> +#include <xen/nodemask.h>
>> +#include <xen/numa.h>
>> +#include <xen/param.h>
>> +#include <xen/sched.h>
>> +#include <xen/softirq.h>
>> +#include <asm/acpi.h>
> 
> Isn't the goal for the now common code to not be dependent upon ACPI?
> 
>> +struct node_data node_data[MAX_NUMNODES];
>> +
>> +/* Mapping from pdx to node id */
>> +int memnode_shift;
>> +static typeof(*memnodemap) _memnodemap[64];
>> +unsigned long memnodemapsize;
>> +u8 *memnodemap;
> 
> For code moved, please switch to (in this case) uint8_t. I'm on the
> edge of asking to go further and
> - also use __read_mostly for some / all of these,

Actually where possible please prefer __ro_after_init over __read_mostly.
While this isn't properly enabled on Arm yet, it should at least not
cause problems, as the result section is covered by .data.* in the linker
script.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-12 12:49   ` Jan Beulich
@ 2022-07-14  9:03     ` Wei Chen
  2022-07-14  9:32       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-14  9:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月12日 20:49
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA
> status
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > In current code, x86 is using two variables, numa_off and acpi_numa,
> > to indicate the NUMA status. This is because NUMA is not coupled with
> > ACPI, and ACPI still can work without NUMA on x86. With these two
> > variables' combinations, x86 can have several NUMA status:
> > NUMA swith on,
> > NUMA swith off,
> > NUMA swith on with NUMA emulation,
> > NUMA swith on with no-ACPI,
> > NUMA swith on with ACPI.
> 
> Hmm, with both this and the actual change I'm not able to convince
> myself that you've expressed the prior combinations correctly. May
> I suggest that you make table representing the 6 (I think)
> combinations of original states with their mapping to the new
> enumerators? (It doesn't need to be 6 different enumerators, but
> all 6 existing states need a [proper] representation in the new
> model.)
> 

Sorry for replying later, I paid sometime to check the code again,
and drew a table like below, I ignore two columns when numa_off=true
and acpi_numa=1/-1. Because when numa_off = true, ACPI srat table
will not be parsed:
+-----------+---------+---------------+-----------+------------+----------+
| original  |  col1   |  col2         |  col3     |  col4      |  col5    |
+-----------+---------+---------------+-----------+------------+----------+
|numa_off   |true     |false          |false      |false       |false     |
|acpi_numa  |0        |0              |1          |-1          |x         |
|numa_fake  |x        |x              |x          |x           |fake_nodes|
|enum state |numa_off |numa_fw_nodata |numa_fw_ok |numa_fw_bad |numa_emu  |
+-----------+---------+---------------+-----------+------------+----------+

Notes:
The following scenarios will make acpi_numa=0:
1. Xen disable ACPI through acpi_disabled = 1.
2. ACPI table doesn't have SRAT, or SRAT doesn't contain CPU and memory
   NUMA affinity information.
3. Emulate numa through "numa=fake" command line parameter. But I found
   that when numa_fake is enabled, current code will not prevent ACPI
   srat parsers to parse NUMA information. So acpi_numa still may be
   changed later. Is there any further reasons that we still need to
   parse NUMA info from SRAT table when numa=fake? Or can we set
   acpi_numa = -1 in nume_setup when "numa=fake" to make srat_disabled
   can prevent ACPI SRAT parsing.

If meet the following conditions, then acpi_numa = 1:
1. Xen enable ACPI through acpi_disabled = 0.
2. ACPI SRAT parser can parse CPU and Memory NUMA affinities successfully
   from SRAT table.
3. Pass the memmory blocks' sanity check and hash computing in
   acpi_scan_nodes.

The following scenarios will make acpi_numa=-1:
1. ACPI table is disabled  by "numa=noacpi" command like parameter.
2. Xen enable ACPI through acpi_disabled = 0 but ACPI SRAT parser can not
   parse CPU or Memory NUMA affinities successfully from SRAT table.
3. The memmory blocks' sanity check or hash computing in acpi_scan_nodes
   is failed.

> As an aside - I think you mean "switched" in all five of these
> lines.
> 

Yes, I will fix them in next version.

> > --- a/xen/arch/x86/include/asm/numa.h
> > +++ b/xen/arch/x86/include/asm/numa.h
> > @@ -28,12 +28,22 @@ extern nodeid_t pxm_to_node(unsigned int pxm);
> >  #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
> >  #define VIRTUAL_BUG_ON(x)
> >
> > +/* Enumerations for NUMA status. */
> > +enum numa_mode {
> > +	numa_on = 0,
> > +	numa_off,
> 
> May I suggest to switch these two around, such that "off" becomes
> the meaning of 0, potentially allowing ! to be used in a boolean-
> like fashion here or there? And please omit the "= 0" part - it's
> only non-zero first values which actually need spelling out.
> 

Ok.

> > +	/* NUMA turns on, but ACPI table is bad or disabled. */
> > +	numa_no_acpi,
> > +	/* NUMA turns on, and ACPI table works well. */
> > +	numa_acpi,
> 
> As to the names of these: In the description you already say that
> you want to re-use the code for non-ACPI cases. Wouldn't you better
> avoid "acpi" in the names then (rather than perhaps renaming these
> another time later on)?
> 
> I'd also like to understand what useful state "numa_no_acpi" is.
> I realize this was a state expressable by the two original
> variables, but does it make sense?
> 

I have updated the new names in above table. And "numa_no_acpi"
is mapping to " numa_fw_bad" enum state.

> > @@ -528,7 +528,8 @@ int __init acpi_scan_nodes(paddr_t start, paddr_t
> end)
> >  	for (i = 0; i < MAX_NUMNODES; i++)
> >  		cutoff_node(i, start, end);
> >
> > -	if (acpi_numa <= 0)
> > +	/* Only when numa_on with good firmware, we can do numa scan nodes.
> */
> > +	if (!numa_enabled_with_firmware())
> >  		return -1;
> 
> Nit: Perhaps drop "do numa" from the comment?
> 

Ok, I will do it in next version.

Cheers,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 7/9] xen/x86: rename bad_srat to numa_bad
  2022-07-12 14:25   ` Jan Beulich
@ 2022-07-14  9:17     ` Wei Chen
  0 siblings, 0 replies; 50+ messages in thread
From: Wei Chen @ 2022-07-14  9:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月12日 22:26
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v2 7/9] xen/x86: rename bad_srat to numa_bad
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > When NUMA initialization code is failed in scanning SRAT. It will
> > call bad_srat to set disable NUMA and clear relate data. But this
> > name is ACPI specific, we have moved generically usable NUMA codes
> > to common, bad_srat has came with these codes to common code. But
> > it's not reasonable for other NUMA implementations to implement a
> > fall back function named bad_srat. So in this patch, we rename
> > bad_srat to numa_bad.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> > ---
> > v1 -> v2:
> > 1. New in v2.
> 
> I think this wants to come before moving to common code. And I'd
> also like you to consider using e.g. numa_fw_bad() or
> numa_firmware_bad() to better reflect the original purpose.
> 

Ok, I will re-order the series and update the function names.

Thanks,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 8/9] xen: rename acpi_scan_nodes to numa_scan_nodes
  2022-07-12 14:27   ` Jan Beulich
@ 2022-07-14  9:18     ` Wei Chen
  0 siblings, 0 replies; 50+ messages in thread
From: Wei Chen @ 2022-07-14  9:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月12日 22:27
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v2 8/9] xen: rename acpi_scan_nodes to numa_scan_nodes
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > We have moved acpi_scan_nodes from x86 to common. Because of
> > of our previous work, this function no longer has many ACPI
> > characteristics, except some "SRAT" words in print messages.
> > So we rename acpi_scan_nodes to a more generic name
> > numa_scan_nodes, and replace "SRAT" words in print messages.
> 
> Once again I think the rename should happen before the moving to
> common code.
> 

Ack again.

Cheers,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14  9:03     ` Wei Chen
@ 2022-07-14  9:32       ` Jan Beulich
  2022-07-14  9:55         ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-14  9:32 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 14.07.2022 11:03, Wei Chen wrote:
> Hi Jan,
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月12日 20:49
>> To: Wei Chen <Wei.Chen@arm.com>
>> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
>> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; xen-
>> devel@lists.xenproject.org
>> Subject: Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA
>> status
>>
>> On 08.07.2022 16:54, Wei Chen wrote:
>>> In current code, x86 is using two variables, numa_off and acpi_numa,
>>> to indicate the NUMA status. This is because NUMA is not coupled with
>>> ACPI, and ACPI still can work without NUMA on x86. With these two
>>> variables' combinations, x86 can have several NUMA status:
>>> NUMA swith on,
>>> NUMA swith off,
>>> NUMA swith on with NUMA emulation,
>>> NUMA swith on with no-ACPI,
>>> NUMA swith on with ACPI.
>>
>> Hmm, with both this and the actual change I'm not able to convince
>> myself that you've expressed the prior combinations correctly. May
>> I suggest that you make table representing the 6 (I think)
>> combinations of original states with their mapping to the new
>> enumerators? (It doesn't need to be 6 different enumerators, but
>> all 6 existing states need a [proper] representation in the new
>> model.)
>>
> 
> Sorry for replying later, I paid sometime to check the code again,
> and drew a table like below, I ignore two columns when numa_off=true
> and acpi_numa=1/-1. Because when numa_off = true, ACPI srat table
> will not be parsed:

While I agree with this fact, the problem is that there are two
independent command line options driving the then single variable.
IOW numa_off and acpi_numa both turned on may still need a
representation. In fact I'm not convinced we can eliminate the
original variables. What we ought to be able to do is consolidate
their values into the one single new variable you add, before
ever evaluating anything. _Then_ I think I agree with the table.

Jan

> +-----------+---------+---------------+-----------+------------+----------+
> | original  |  col1   |  col2         |  col3     |  col4      |  col5    |
> +-----------+---------+---------------+-----------+------------+----------+
> |numa_off   |true     |false          |false      |false       |false     |
> |acpi_numa  |0        |0              |1          |-1          |x         |
> |numa_fake  |x        |x              |x          |x           |fake_nodes|
> |enum state |numa_off |numa_fw_nodata |numa_fw_ok |numa_fw_bad |numa_emu  |
> +-----------+---------+---------------+-----------+------------+----------+
> 
> Notes:
> The following scenarios will make acpi_numa=0:
> 1. Xen disable ACPI through acpi_disabled = 1.
> 2. ACPI table doesn't have SRAT, or SRAT doesn't contain CPU and memory
>    NUMA affinity information.
> 3. Emulate numa through "numa=fake" command line parameter. But I found
>    that when numa_fake is enabled, current code will not prevent ACPI
>    srat parsers to parse NUMA information. So acpi_numa still may be
>    changed later. Is there any further reasons that we still need to
>    parse NUMA info from SRAT table when numa=fake? Or can we set
>    acpi_numa = -1 in nume_setup when "numa=fake" to make srat_disabled
>    can prevent ACPI SRAT parsing.
> 
> If meet the following conditions, then acpi_numa = 1:
> 1. Xen enable ACPI through acpi_disabled = 0.
> 2. ACPI SRAT parser can parse CPU and Memory NUMA affinities successfully
>    from SRAT table.
> 3. Pass the memmory blocks' sanity check and hash computing in
>    acpi_scan_nodes.
> 
> The following scenarios will make acpi_numa=-1:
> 1. ACPI table is disabled  by "numa=noacpi" command like parameter.
> 2. Xen enable ACPI through acpi_disabled = 0 but ACPI SRAT parser can not
>    parse CPU or Memory NUMA affinities successfully from SRAT table.
> 3. The memmory blocks' sanity check or hash computing in acpi_scan_nodes
>    is failed.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14  9:32       ` Jan Beulich
@ 2022-07-14  9:55         ` Wei Chen
  2022-07-14  9:57           ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-14  9:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

Hi Jan,

> -----Original Message-----
> >>
> >
> > Sorry for replying later, I paid sometime to check the code again,
> > and drew a table like below, I ignore two columns when numa_off=true
> > and acpi_numa=1/-1. Because when numa_off = true, ACPI srat table
> > will not be parsed:
> 
> While I agree with this fact, the problem is that there are two
> independent command line options driving the then single variable.
> IOW numa_off and acpi_numa both turned on may still need a
> representation. In fact I'm not convinced we can eliminate the
> original variables. What we ought to be able to do is consolidate
> their values into the one single new variable you add, before
> ever evaluating anything. _Then_ I think I agree with the table.
> 
> Jan
> 
> > +-----------+---------+---------------+-----------+------------+--------
> --+
> > | original  |  col1   |  col2         |  col3     |  col4      |  col5
> |
> > +-----------+---------+---------------+-----------+------------+--------
> --+
> > |numa_off   |true     |false          |false      |false       |false
> |
> > |acpi_numa  |0        |0              |1          |-1          |x
> |
> > |numa_fake  |x        |x              |x          |x
> |fake_nodes|
> > |enum state |numa_off |numa_fw_nodata |numa_fw_ok |numa_fw_bad |numa_emu
> |
> > +-----------+---------+---------------+-----------+------------+--------
> --+
> >

How about update the table like this:
+------------+----------+----------------+----------------+------------+
|  original  |          |                |                |            |
+------------+----------+----------------+----------------+------------+
| numa_off   | true     | true           | true           | true       |
| acpi_numa  | 0        | 1              | -1             | x          |
| numa_fake  | x        | x              | x              | fake_nodes |
| enum state | numa_off | numa_off       | numa_off       | numa_off   |
+------------+----------+----------------+----------------+------------+

+------------+----------------+------------+-------------+------------+
|  original  |                |            |             |            |
+------------+----------------+------------+-------------+------------+
| numa_off   | false          | false      | false       | false      |
| acpi_numa  | 0              | 1          | -1          | x          |
| numa_fake  | x              | x          | x           | fake_nodes |
| enum state | numa_fw_nodata | numa_fw_ok | numa_fw_bad | numa_emu   |
+------------+----------------+------------+-------------+------------+

Cheers,
Wei Chen

> > Notes:
> > The following scenarios will make acpi_numa=0:
sanity check or hash computing in acpi_scan_nodes
> >    is failed.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14  9:55         ` Wei Chen
@ 2022-07-14  9:57           ` Jan Beulich
  2022-07-14 10:26             ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-14  9:57 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 14.07.2022 11:55, Wei Chen wrote:
> Hi Jan,
> 
>> -----Original Message-----
>>>>
>>>
>>> Sorry for replying later, I paid sometime to check the code again,
>>> and drew a table like below, I ignore two columns when numa_off=true
>>> and acpi_numa=1/-1. Because when numa_off = true, ACPI srat table
>>> will not be parsed:
>>
>> While I agree with this fact, the problem is that there are two
>> independent command line options driving the then single variable.
>> IOW numa_off and acpi_numa both turned on may still need a
>> representation. In fact I'm not convinced we can eliminate the
>> original variables. What we ought to be able to do is consolidate
>> their values into the one single new variable you add, before
>> ever evaluating anything. _Then_ I think I agree with the table.
>>
>> Jan
>>
>>> +-----------+---------+---------------+-----------+------------+--------
>> --+
>>> | original  |  col1   |  col2         |  col3     |  col4      |  col5
>> |
>>> +-----------+---------+---------------+-----------+------------+--------
>> --+
>>> |numa_off   |true     |false          |false      |false       |false
>> |
>>> |acpi_numa  |0        |0              |1          |-1          |x
>> |
>>> |numa_fake  |x        |x              |x          |x
>> |fake_nodes|
>>> |enum state |numa_off |numa_fw_nodata |numa_fw_ok |numa_fw_bad |numa_emu
>> |
>>> +-----------+---------+---------------+-----------+------------+--------
>> --+
>>>
> 
> How about update the table like this:
> +------------+----------+----------------+----------------+------------+
> |  original  |          |                |                |            |
> +------------+----------+----------------+----------------+------------+
> | numa_off   | true     | true           | true           | true       |
> | acpi_numa  | 0        | 1              | -1             | x          |
> | numa_fake  | x        | x              | x              | fake_nodes |
> | enum state | numa_off | numa_off       | numa_off       | numa_off   |
> +------------+----------+----------------+----------------+------------+
> 
> +------------+----------------+------------+-------------+------------+
> |  original  |                |            |             |            |
> +------------+----------------+------------+-------------+------------+
> | numa_off   | false          | false      | false       | false      |
> | acpi_numa  | 0              | 1          | -1          | x          |
> | numa_fake  | x              | x          | x           | fake_nodes |
> | enum state | numa_fw_nodata | numa_fw_ok | numa_fw_bad | numa_emu   |
> +------------+----------------+------------+-------------+------------+

Well, this makes the table complete, but it doesn't explain how you mean
to fold the settings of the two command line options into one variable.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-07-12 14:34   ` Jan Beulich
@ 2022-07-14 10:14     ` Wei Chen
  2022-07-14 11:10       ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-14 10:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月12日 22:34
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George
> Dunlap <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Roger Pau Monné
> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure
> NUMA nodes number
> 
> On 08.07.2022 16:54, Wei Chen wrote:
> > --- a/xen/arch/Kconfig
> > +++ b/xen/arch/Kconfig
> > @@ -17,3 +17,14 @@ config NR_CPUS
> >  	  For CPU cores which support Simultaneous Multi-Threading or
> similar
> >  	  technologies, this the number of logical threads which Xen will
> >  	  support.
> > +
> > +config NR_NUMA_NODES
> > +	int "Maximum number of NUMA nodes supported"
> > +	range 1 255
> > +	default "64"
> > +	depends on NUMA
> 
> Does 1 make sense? That's not going to be NUMA then, I would say.
> 

Ok, we need at least 2 nodes to be a real NUMA.

> Does the value being (perhaps far) larger than NR_CPUS make sense?
> 

Arm has 128 default NR_CPUS (except some platforms) and x86 has 256.
So I am not very clear about your comments about far larger? As my
Understanding, one node has 2 or 4 cores are very common in a NUMA
System.

> Why does the range end at a not-power-of-2 value? (I was actually
> going to suggest to have a shift value specified here, until
> spotting that NODES_SHIFT isn't used anywhere else, and hence
> you're rightfully deleting it.)
> 

I think we have discussed about the 255 in v1. Because Xen is using
u8 as nodeid_t, so 255 may be a upper bound.

And if use a shift value, from a user perspective, I don't like it.
It needs to be converted, not intuitive enough. It also limits my
input range, even though my numerical values are reasonable. Yes,
if a machine has 15 node, we can ask them to input 16, but why not
let the users decide? instead of being forced to enter 16 by the program?

> > +	help
> > +	  Controls the build-time size of various arrays and bitmaps
> > +	  associated with multiple-nodes management. It is the upper bound
> of
> > +	  the number of NUMA nodes that the scheduler, memory allocation and
> > +          other NUMA-aware components can handle.
> 
> Nit: indentation.
> 

Ok.

Cheers,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14  9:57           ` Jan Beulich
@ 2022-07-14 10:26             ` Wei Chen
  2022-07-14 10:37               ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-14 10:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月14日 17:58
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA
> status
> 
> >
> > How about update the table like this:
> > +------------+----------+----------------+----------------+------------+
> > |  original  |          |                |                |            |
> > +------------+----------+----------------+----------------+------------+
> > | numa_off   | true     | true           | true           | true       |
> > | acpi_numa  | 0        | 1              | -1             | x          |
> > | numa_fake  | x        | x              | x              | fake_nodes |
> > | enum state | numa_off | numa_off       | numa_off       | numa_off   |
> > +------------+----------+----------------+----------------+------------+
> >
> > +------------+----------------+------------+-------------+------------+
> > |  original  |                |            |             |            |
> > +------------+----------------+------------+-------------+------------+
> > | numa_off   | false          | false      | false       | false      |
> > | acpi_numa  | 0              | 1          | -1          | x          |
> > | numa_fake  | x              | x          | x           | fake_nodes |
> > | enum state | numa_fw_nodata | numa_fw_ok | numa_fw_bad | numa_emu   |
> > +------------+----------------+------------+-------------+------------+
> 
> Well, this makes the table complete, but it doesn't explain how you mean
> to fold the settings of the two command line options into one variable.
> 

No matter how many separate "numa=" parameters have been parsed from
Command line, the values of these original variables are determined
after parsing the command line. So the determined status can be mapped
to the new one variable from above table.

Thanks,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14 10:26             ` Wei Chen
@ 2022-07-14 10:37               ` Jan Beulich
  2022-07-14 10:49                 ` Wei Chen
  2022-07-14 10:51                 ` Jan Beulich
  0 siblings, 2 replies; 50+ messages in thread
From: Jan Beulich @ 2022-07-14 10:37 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 14.07.2022 12:26, Wei Chen wrote:
> Hi Jan,
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月14日 17:58
>> To: Wei Chen <Wei.Chen@arm.com>
>> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
>> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; xen-
>> devel@lists.xenproject.org
>> Subject: Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA
>> status
>>
>>>
>>> How about update the table like this:
>>> +------------+----------+----------------+----------------+------------+
>>> |  original  |          |                |                |            |
>>> +------------+----------+----------------+----------------+------------+
>>> | numa_off   | true     | true           | true           | true       |
>>> | acpi_numa  | 0        | 1              | -1             | x          |
>>> | numa_fake  | x        | x              | x              | fake_nodes |
>>> | enum state | numa_off | numa_off       | numa_off       | numa_off   |
>>> +------------+----------+----------------+----------------+------------+
>>>
>>> +------------+----------------+------------+-------------+------------+
>>> |  original  |                |            |             |            |
>>> +------------+----------------+------------+-------------+------------+
>>> | numa_off   | false          | false      | false       | false      |
>>> | acpi_numa  | 0              | 1          | -1          | x          |
>>> | numa_fake  | x              | x          | x           | fake_nodes |
>>> | enum state | numa_fw_nodata | numa_fw_ok | numa_fw_bad | numa_emu   |
>>> +------------+----------------+------------+-------------+------------+
>>
>> Well, this makes the table complete, but it doesn't explain how you mean
>> to fold the settings of the two command line options into one variable.
>>
> 
> No matter how many separate "numa=" parameters have been parsed from
> Command line, the values of these original variables are determined
> after parsing the command line. So the determined status can be mapped
> to the new one variable from above table.

Hmm, I was partly wrong - the initial values of both variables are
indeed set from just the single "numa=" parsing. But later on they
"evolve" independently, and multiple "numa=" on the command line
can also have "interesting" effects. I'm afraid I still can't
convince myself that the new mapping fully represents all originally
possible combinations (nor can I convince myself that in the existing
code everything is working as intended).

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14 10:37               ` Jan Beulich
@ 2022-07-14 10:49                 ` Wei Chen
  2022-07-14 10:58                   ` Jan Beulich
  2022-07-14 10:51                 ` Jan Beulich
  1 sibling, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-14 10:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月14日 18:37
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA
> status
> >>
> >> Well, this makes the table complete, but it doesn't explain how you
> mean
> >> to fold the settings of the two command line options into one variable.
> >>
> >
> > No matter how many separate "numa=" parameters have been parsed from
> > Command line, the values of these original variables are determined
> > after parsing the command line. So the determined status can be mapped
> > to the new one variable from above table.
> 
> Hmm, I was partly wrong - the initial values of both variables are
> indeed set from just the single "numa=" parsing. But later on they
> "evolve" independently, and multiple "numa=" on the command line
> can also have "interesting" effects. I'm afraid I still can't

Can you provide some examples? This way I can better understand your
concerns.

Cheers,
Wei Chen

> convince myself that the new mapping fully represents all originally
> possible combinations (nor can I convince myself that in the existing
> code everything is working as intended).
> 
> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14 10:37               ` Jan Beulich
  2022-07-14 10:49                 ` Wei Chen
@ 2022-07-14 10:51                 ` Jan Beulich
  2022-07-25  2:28                   ` Wei Chen
  1 sibling, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-14 10:51 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 14.07.2022 12:37, Jan Beulich wrote:
> On 14.07.2022 12:26, Wei Chen wrote:
>>> From: Jan Beulich <jbeulich@suse.com>
>>> Sent: 2022年7月14日 17:58
>>> To: Wei Chen <Wei.Chen@arm.com>
>>>>
>>>> How about update the table like this:
>>>> +------------+----------+----------------+----------------+------------+
>>>> |  original  |          |                |                |            |
>>>> +------------+----------+----------------+----------------+------------+
>>>> | numa_off   | true     | true           | true           | true       |
>>>> | acpi_numa  | 0        | 1              | -1             | x          |
>>>> | numa_fake  | x        | x              | x              | fake_nodes |
>>>> | enum state | numa_off | numa_off       | numa_off       | numa_off   |
>>>> +------------+----------+----------------+----------------+------------+
>>>>
>>>> +------------+----------------+------------+-------------+------------+
>>>> |  original  |                |            |             |            |
>>>> +------------+----------------+------------+-------------+------------+
>>>> | numa_off   | false          | false      | false       | false      |
>>>> | acpi_numa  | 0              | 1          | -1          | x          |
>>>> | numa_fake  | x              | x          | x           | fake_nodes |
>>>> | enum state | numa_fw_nodata | numa_fw_ok | numa_fw_bad | numa_emu   |
>>>> +------------+----------------+------------+-------------+------------+
>>>
>>> Well, this makes the table complete, but it doesn't explain how you mean
>>> to fold the settings of the two command line options into one variable.
>>>
>>
>> No matter how many separate "numa=" parameters have been parsed from
>> Command line, the values of these original variables are determined
>> after parsing the command line. So the determined status can be mapped
>> to the new one variable from above table.
> 
> Hmm, I was partly wrong - the initial values of both variables are
> indeed set from just the single "numa=" parsing. But later on they
> "evolve" independently, and multiple "numa=" on the command line
> can also have "interesting" effects. I'm afraid I still can't
> convince myself that the new mapping fully represents all originally
> possible combinations (nor can I convince myself that in the existing
> code everything is working as intended).

Maybe the solution is to make numa_off common but keep acpi_numa
arch-specific? Then e.g. the replacement of srat_disabled() could
be

int numa_disabled(void)
{
    return numa_off || arch_numa_disabled();
}

with arch_numa_disabled() evaluating acpi_numa on x86.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14 10:49                 ` Wei Chen
@ 2022-07-14 10:58                   ` Jan Beulich
  2022-07-15  7:18                     ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-14 10:58 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 14.07.2022 12:49, Wei Chen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月14日 18:37
>> status
>>>>
>>>> Well, this makes the table complete, but it doesn't explain how you
>> mean
>>>> to fold the settings of the two command line options into one variable.
>>>>
>>>
>>> No matter how many separate "numa=" parameters have been parsed from
>>> Command line, the values of these original variables are determined
>>> after parsing the command line. So the determined status can be mapped
>>> to the new one variable from above table.
>>
>> Hmm, I was partly wrong - the initial values of both variables are
>> indeed set from just the single "numa=" parsing. But later on they
>> "evolve" independently, and multiple "numa=" on the command line
>> can also have "interesting" effects. I'm afraid I still can't
> 
> Can you provide some examples? This way I can better understand your
> concerns.

Take bad_srat(): you convert "acpi_numa = -1" to setting numa_no_acpi.
Yet imo (matching the present model) numa_off shouldn't be affected.
While your change is fine in practice for (current) x86, it is wrong
in the abstract model (which is relevant when making things common).

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-07-14 10:14     ` Wei Chen
@ 2022-07-14 11:10       ` Jan Beulich
  2022-07-14 11:27         ` Julien Grall
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-14 11:10 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, Roger Pau Monné,
	xen-devel

On 14.07.2022 12:14, Wei Chen wrote:
> Hi Jan,
> 
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年7月12日 22:34
>> To: Wei Chen <Wei.Chen@arm.com>
>> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George
>> Dunlap <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
>> Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Roger Pau Monné
>> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
>> Subject: Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure
>> NUMA nodes number
>>
>> On 08.07.2022 16:54, Wei Chen wrote:
>>> --- a/xen/arch/Kconfig
>>> +++ b/xen/arch/Kconfig
>>> @@ -17,3 +17,14 @@ config NR_CPUS
>>>  	  For CPU cores which support Simultaneous Multi-Threading or
>> similar
>>>  	  technologies, this the number of logical threads which Xen will
>>>  	  support.
>>> +
>>> +config NR_NUMA_NODES
>>> +	int "Maximum number of NUMA nodes supported"
>>> +	range 1 255
>>> +	default "64"
>>> +	depends on NUMA
>>
>> Does 1 make sense? That's not going to be NUMA then, I would say.
>>
> 
> Ok, we need at least 2 nodes to be a real NUMA.
> 
>> Does the value being (perhaps far) larger than NR_CPUS make sense?
>>
> 
> Arm has 128 default NR_CPUS (except some platforms) and x86 has 256.
> So I am not very clear about your comments about far larger? As my
> Understanding, one node has 2 or 4 cores are very common in a NUMA
> System.

The defaults are fine. But does it make sense to have 255 nodes when
just 32 CPUs were selected? I'm afraid kconfig is going to get in the
way, but I think I'd like the upper bound to be min(NR_CPUS, 255).

>> Why does the range end at a not-power-of-2 value? (I was actually
>> going to suggest to have a shift value specified here, until
>> spotting that NODES_SHIFT isn't used anywhere else, and hence
>> you're rightfully deleting it.)
>>
> 
> I think we have discussed about the 255 in v1. Because Xen is using
> u8 as nodeid_t, so 255 may be a upper bound.

Indeed, but that's something you could have mentioned in the commit
message as justification. But you could also have opted to make the
upper bound 128. Let me ask you: Are you aware of systems with more
than a dozen or so nodes, that Xen can in principle run on?

> And if use a shift value, from a user perspective, I don't like it.
> It needs to be converted, not intuitive enough. It also limits my
> input range, even though my numerical values are reasonable. Yes,
> if a machine has 15 node, we can ask them to input 16, but why not
> let the users decide? instead of being forced to enter 16 by the program?

At least x86 Linux also wants this specified as a shift value, so
people may actually be (more) used to that. Plus non-pwer-of-2
values may yield more complex calculations when the compiler
generates code. Think of a two-dimensional distance table, for
example.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-07-14 11:10       ` Jan Beulich
@ 2022-07-14 11:27         ` Julien Grall
  2022-07-18  7:51           ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Julien Grall @ 2022-07-14 11:27 UTC (permalink / raw)
  To: Jan Beulich, Wei Chen
  Cc: nd, Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	Roger Pau Monné,
	xen-devel

Hi Jan,

On 14/07/2022 12:10, Jan Beulich wrote:
> On 14.07.2022 12:14, Wei Chen wrote:
>> Hi Jan,
>>
>>> -----Original Message-----
>>> From: Jan Beulich <jbeulich@suse.com>
>>> Sent: 2022年7月12日 22:34
>>> To: Wei Chen <Wei.Chen@arm.com>
>>> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George
>>> Dunlap <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
>>> Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Roger Pau Monné
>>> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
>>> Subject: Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure
>>> NUMA nodes number
>>>
>>> On 08.07.2022 16:54, Wei Chen wrote:
>>>> --- a/xen/arch/Kconfig
>>>> +++ b/xen/arch/Kconfig
>>>> @@ -17,3 +17,14 @@ config NR_CPUS
>>>>   	  For CPU cores which support Simultaneous Multi-Threading or
>>> similar
>>>>   	  technologies, this the number of logical threads which Xen will
>>>>   	  support.
>>>> +
>>>> +config NR_NUMA_NODES
>>>> +	int "Maximum number of NUMA nodes supported"
>>>> +	range 1 255
>>>> +	default "64"
>>>> +	depends on NUMA
>>>
>>> Does 1 make sense? That's not going to be NUMA then, I would say.
>>>
>>
>> Ok, we need at least 2 nodes to be a real NUMA.
>>
>>> Does the value being (perhaps far) larger than NR_CPUS make sense?
>>>
>>
>> Arm has 128 default NR_CPUS (except some platforms) and x86 has 256.
>> So I am not very clear about your comments about far larger? As my
>> Understanding, one node has 2 or 4 cores are very common in a NUMA
>> System.
> 
> The defaults are fine. But does it make sense to have 255 nodes when
> just 32 CPUs were selected? I'm afraid kconfig is going to get in the
> way, but I think I'd like the upper bound to be min(NR_CPUS, 255).
Looking around NUMA nodes with 0 CPUs exists. So I don't think we should 
tie the two.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14 10:58                   ` Jan Beulich
@ 2022-07-15  7:18                     ` Wei Chen
  0 siblings, 0 replies; 50+ messages in thread
From: Wei Chen @ 2022-07-15  7:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月14日 18:58
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA
> status
> 
> On 14.07.2022 12:49, Wei Chen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2022年7月14日 18:37
> >> status
> >>>>
> >>>> Well, this makes the table complete, but it doesn't explain how you
> >> mean
> >>>> to fold the settings of the two command line options into one
> variable.
> >>>>
> >>>
> >>> No matter how many separate "numa=" parameters have been parsed from
> >>> Command line, the values of these original variables are determined
> >>> after parsing the command line. So the determined status can be mapped
> >>> to the new one variable from above table.
> >>
> >> Hmm, I was partly wrong - the initial values of both variables are
> >> indeed set from just the single "numa=" parsing. But later on they
> >> "evolve" independently, and multiple "numa=" on the command line
> >> can also have "interesting" effects. I'm afraid I still can't
> >
> > Can you provide some examples? This way I can better understand your
> > concerns.
> 
> Take bad_srat(): you convert "acpi_numa = -1" to setting numa_no_acpi.
> Yet imo (matching the present model) numa_off shouldn't be affected.
> While your change is fine in practice for (current) x86, it is wrong
> in the abstract model (which is relevant when making things common).
> 

Thanks, I understand your concerns now. In this case, I agree with your
suggestion in previous e-mail:

> int numa_disabled(void)
> {
>    return numa_off || arch_numa_disabled();
> }
>
> with arch_numa_disabled() evaluating acpi_numa on x86.

I would update this patch like above sample in next version. And in
this way, I think We don't need the new enumerations and mapping table.

Cheers,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-07-14 11:27         ` Julien Grall
@ 2022-07-18  7:51           ` Wei Chen
  2022-07-18  8:10             ` Jan Beulich
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Chen @ 2022-07-18  7:51 UTC (permalink / raw)
  To: Julien Grall, Jan Beulich
  Cc: nd, Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	Roger Pau Monné,
	xen-devel

Hi Julien, Jan,

> -----Original Message-----
> From: Julien Grall <julien@xen.org>
> Sent: 2022年7月14日 19:27
> To: Jan Beulich <jbeulich@suse.com>; Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George
> Dunlap <george.dunlap@citrix.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Roger Pau Monné
> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure
> NUMA nodes number
> 
> Hi Jan,
> 
> On 14/07/2022 12:10, Jan Beulich wrote:
> > On 14.07.2022 12:14, Wei Chen wrote:
> >> Hi Jan,
> >>
> >>> -----Original Message-----
> >>> From: Jan Beulich <jbeulich@suse.com>
> >>> Sent: 2022年7月12日 22:34
> >>> To: Wei Chen <Wei.Chen@arm.com>
> >>> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George
> >>> Dunlap <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>;
> Stefano
> >>> Stabellini <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Roger Pau
> Monné
> >>> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org
> >>> Subject: Re: [PATCH v2 9/9] xen: introduce a Kconfig option to
> configure
> >>> NUMA nodes number
> >>>
> >>> On 08.07.2022 16:54, Wei Chen wrote:
> >>>> --- a/xen/arch/Kconfig
> >>>> +++ b/xen/arch/Kconfig
> >>>> @@ -17,3 +17,14 @@ config NR_CPUS
> >>>>   	  For CPU cores which support Simultaneous Multi-Threading or
> >>> similar
> >>>>   	  technologies, this the number of logical threads which Xen
> will
> >>>>   	  support.
> >>>> +
> >>>> +config NR_NUMA_NODES
> >>>> +	int "Maximum number of NUMA nodes supported"
> >>>> +	range 1 255
> >>>> +	default "64"
> >>>> +	depends on NUMA
> >>>
> >>> Does 1 make sense? That's not going to be NUMA then, I would say.
> >>>
> >>
> >> Ok, we need at least 2 nodes to be a real NUMA.
> >>
> >>> Does the value being (perhaps far) larger than NR_CPUS make sense?
> >>>
> >>
> >> Arm has 128 default NR_CPUS (except some platforms) and x86 has 256.
> >> So I am not very clear about your comments about far larger? As my
> >> Understanding, one node has 2 or 4 cores are very common in a NUMA
> >> System.
> >
> > The defaults are fine. But does it make sense to have 255 nodes when
> > just 32 CPUs were selected? I'm afraid kconfig is going to get in the
> > way, but I think I'd like the upper bound to be min(NR_CPUS, 255).
>
> Looking around NUMA nodes with 0 CPUs exists. So I don't think we should
> tie the two.
> 

Yes, some nodes can only have RAM and some nodes can only have CPUs.
So is it ok to use 2-255 for node range?

> Cheers,
> 
> --
> Julien Grall

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-07-18  7:51           ` Wei Chen
@ 2022-07-18  8:10             ` Jan Beulich
  2022-07-18  8:21               ` Wei Chen
  0 siblings, 1 reply; 50+ messages in thread
From: Jan Beulich @ 2022-07-18  8:10 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	Roger Pau Monné,
	xen-devel, Julien Grall

On 18.07.2022 09:51, Wei Chen wrote:
>> From: Julien Grall <julien@xen.org>
>> Sent: 2022年7月14日 19:27
>>
>> On 14/07/2022 12:10, Jan Beulich wrote:
>>> On 14.07.2022 12:14, Wei Chen wrote:
>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>> Sent: 2022年7月12日 22:34
>>>>>
>>>>> On 08.07.2022 16:54, Wei Chen wrote:
>>>>>> --- a/xen/arch/Kconfig
>>>>>> +++ b/xen/arch/Kconfig
>>>>>> @@ -17,3 +17,14 @@ config NR_CPUS
>>>>>>   	  For CPU cores which support Simultaneous Multi-Threading or
>>>>> similar
>>>>>>   	  technologies, this the number of logical threads which Xen
>> will
>>>>>>   	  support.
>>>>>> +
>>>>>> +config NR_NUMA_NODES
>>>>>> +	int "Maximum number of NUMA nodes supported"
>>>>>> +	range 1 255
>>>>>> +	default "64"
>>>>>> +	depends on NUMA
>>>>>
>>>>> Does 1 make sense? That's not going to be NUMA then, I would say.
>>>>>
>>>>
>>>> Ok, we need at least 2 nodes to be a real NUMA.
>>>>
>>>>> Does the value being (perhaps far) larger than NR_CPUS make sense?
>>>>>
>>>>
>>>> Arm has 128 default NR_CPUS (except some platforms) and x86 has 256.
>>>> So I am not very clear about your comments about far larger? As my
>>>> Understanding, one node has 2 or 4 cores are very common in a NUMA
>>>> System.
>>>
>>> The defaults are fine. But does it make sense to have 255 nodes when
>>> just 32 CPUs were selected? I'm afraid kconfig is going to get in the
>>> way, but I think I'd like the upper bound to be min(NR_CPUS, 255).
>>
>> Looking around NUMA nodes with 0 CPUs exists. So I don't think we should
>> tie the two.
>>
> 
> Yes, some nodes can only have RAM and some nodes can only have CPUs.
> So is it ok to use 2-255 for node range?

Personally I think we shouldn't allow unreasonably high node counts,
unless proven by real hardware existing. Which goes hand in hand with
me wanting the upper bound to be a power of 2 value, if for nothing
else than a hint that using power-of-2 values here is preferable.
Hence I'd like to propose 64 or 128 as upper bound, in this order of
(my personal) preference.

Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-07-18  8:10             ` Jan Beulich
@ 2022-07-18  8:21               ` Wei Chen
  0 siblings, 0 replies; 50+ messages in thread
From: Wei Chen @ 2022-07-18  8:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, George Dunlap, Stefano Stabellini, Wei Liu,
	Roger Pau Monné,
	xen-devel, Julien Grall

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年7月18日 16:10
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; George
> Dunlap <george.dunlap@citrix.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Wei Liu <wl@xen.org>; Roger Pau Monné
> <roger.pau@citrix.com>; xen-devel@lists.xenproject.org; Julien Grall
> <julien@xen.org>
> Subject: Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure
> NUMA nodes number
> 
> >>>>> Sent: 2022年7月12日 22:34
> >>>>>
> >>>>> On 08.07.2022 16:54, Wei Chen wrote:
> >>>>>> --- a/xen/arch/Kconfig
> >>>>>> +++ b/xen/arch/Kconfig
> >>>>>> @@ -17,3 +17,14 @@ config NR_CPUS
> >>>>>>   	  For CPU cores which support Simultaneous Multi-Threading or
> >>>>> similar
> >>>>>>   	  technologies, this the number of logical threads which Xen
> >> will
> >>>>>>   	  support.
> >>>>>> +
> >>>>>> +config NR_NUMA_NODES
> >>>>>> +	int "Maximum number of NUMA nodes supported"
> >>>>>> +	range 1 255
> >>>>>> +	default "64"
> >>>>>> +	depends on NUMA
> >>>>>
> >>>>> Does 1 make sense? That's not going to be NUMA then, I would say.
> >>>>>
> >>>>
> >>>> Ok, we need at least 2 nodes to be a real NUMA.
> >>>>
> >>>>> Does the value being (perhaps far) larger than NR_CPUS make sense?
> >>>>>
> >>>>
> >>>> Arm has 128 default NR_CPUS (except some platforms) and x86 has 256.
> >>>> So I am not very clear about your comments about far larger? As my
> >>>> Understanding, one node has 2 or 4 cores are very common in a NUMA
> >>>> System.
> >>>
> >>> The defaults are fine. But does it make sense to have 255 nodes when
> >>> just 32 CPUs were selected? I'm afraid kconfig is going to get in the
> >>> way, but I think I'd like the upper bound to be min(NR_CPUS, 255).
> >>
> >> Looking around NUMA nodes with 0 CPUs exists. So I don't think we
> should
> >> tie the two.
> >>
> >
> > Yes, some nodes can only have RAM and some nodes can only have CPUs.
> > So is it ok to use 2-255 for node range?
> 
> Personally I think we shouldn't allow unreasonably high node counts,
> unless proven by real hardware existing. Which goes hand in hand with
> me wanting the upper bound to be a power of 2 value, if for nothing
> else than a hint that using power-of-2 values here is preferable.
> Hence I'd like to propose 64 or 128 as upper bound, in this order of
> (my personal) preference.
> 

Thanks, I will use 64 in next version.

Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status
  2022-07-14 10:51                 ` Jan Beulich
@ 2022-07-25  2:28                   ` Wei Chen
  0 siblings, 0 replies; 50+ messages in thread
From: Wei Chen @ 2022-07-25  2:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

Hi Jan,

On 2022/7/14 18:51, Jan Beulich wrote:
>>>
>>> No matter how many separate "numa=" parameters have been parsed from
>>> Command line, the values of these original variables are determined
>>> after parsing the command line. So the determined status can be mapped
>>> to the new one variable from above table.
>>
>> Hmm, I was partly wrong - the initial values of both variables are
>> indeed set from just the single "numa=" parsing. But later on they
>> "evolve" independently, and multiple "numa=" on the command line
>> can also have "interesting" effects. I'm afraid I still can't
>> convince myself that the new mapping fully represents all originally
>> possible combinations (nor can I convince myself that in the existing
>> code everything is working as intended).
> 
> Maybe the solution is to make numa_off common but keep acpi_numa
> arch-specific? Then e.g. the replacement of srat_disabled() could
> be
> 
> int numa_disabled(void)
> {
>      return numa_off || arch_numa_disabled();
> }
> 
> with arch_numa_disabled() evaluating acpi_numa on x86.
> 

While I am working on the new version, I think this may be not enough,
we may need another helper like arch_handle_numa_param to prevent direct
accessing of acpi_numa in numa_setup.

Cheers,
Wei Chen

> Jan


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-07-25  2:29 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08 14:54 [PATCH v2 0/9] Device tree based NUMA support for Arm - Part#2 Wei Chen
2022-07-08 14:54 ` [PATCH v2 1/9] xen/x86: introduce a helper to update memory hotplug end Wei Chen
2022-07-12 11:53   ` Jan Beulich
2022-07-13  8:22     ` Wei Chen
2022-07-13  8:46       ` Jan Beulich
2022-07-13  9:55         ` Wei Chen
2022-07-13 10:03           ` Jan Beulich
2022-07-08 14:54 ` [PATCH v2 2/9] xen/x86: Use enumerations to indicate NUMA status Wei Chen
2022-07-12 12:49   ` Jan Beulich
2022-07-14  9:03     ` Wei Chen
2022-07-14  9:32       ` Jan Beulich
2022-07-14  9:55         ` Wei Chen
2022-07-14  9:57           ` Jan Beulich
2022-07-14 10:26             ` Wei Chen
2022-07-14 10:37               ` Jan Beulich
2022-07-14 10:49                 ` Wei Chen
2022-07-14 10:58                   ` Jan Beulich
2022-07-15  7:18                     ` Wei Chen
2022-07-14 10:51                 ` Jan Beulich
2022-07-25  2:28                   ` Wei Chen
2022-07-08 14:54 ` [PATCH v2 3/9] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
2022-07-12 13:13   ` Jan Beulich
2022-07-13 10:16     ` Wei Chen
2022-07-13 14:10     ` Jan Beulich
2022-07-08 14:54 ` [PATCH v2 4/9] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid Wei Chen
2022-07-11  6:32   ` Jan Beulich
2022-07-12  7:20     ` Wei Chen
2022-07-12  7:36       ` Jan Beulich
2022-07-08 14:54 ` [PATCH v2 5/9] xen/x86: use arch_get_memory_map to get information from E820 map Wei Chen
2022-07-12 13:40   ` Jan Beulich
2022-07-13 10:37     ` Wei Chen
2022-07-13 13:43       ` Jan Beulich
2022-07-08 14:54 ` [PATCH v2 6/9] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
2022-07-12 14:21   ` Jan Beulich
2022-07-13 10:57     ` Wei Chen
2022-07-13 13:49       ` Jan Beulich
2022-07-08 14:54 ` [PATCH v2 7/9] xen/x86: rename bad_srat to numa_bad Wei Chen
2022-07-12 14:25   ` Jan Beulich
2022-07-14  9:17     ` Wei Chen
2022-07-08 14:54 ` [PATCH v2 8/9] xen: rename acpi_scan_nodes to numa_scan_nodes Wei Chen
2022-07-12 14:27   ` Jan Beulich
2022-07-14  9:18     ` Wei Chen
2022-07-08 14:54 ` [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
2022-07-12 14:34   ` Jan Beulich
2022-07-14 10:14     ` Wei Chen
2022-07-14 11:10       ` Jan Beulich
2022-07-14 11:27         ` Julien Grall
2022-07-18  7:51           ` Wei Chen
2022-07-18  8:10             ` Jan Beulich
2022-07-18  8:21               ` Wei Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.