All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2
@ 2022-09-02  3:31 Wei Chen
  2022-09-02  3:31 ` [PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa Wei Chen
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Wei Chen @ 2022-09-02  3:31 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

(The Arm device tree based NUMA support patch set contains 35
patches. In order to make stuff easier for reviewers, I split
them into 3 parts:
1. Preparation. I have re-sorted the patch series. And moved
   independent patches to the head of the series - merged in [1]
2. Move generically usable code from x86 to common - this series.
3. Add new code to support Arm.

This series only contains the second part patches. As the whole NUMA
series has been reviewed for 1 round in [2], so this series would
be v3)

Xen memory allocation and scheduler modules are NUMA aware.
But actually, on x86 has implemented the architecture APIs
to support NUMA. Arm was providing a set of fake architecture
APIs to make it compatible with NUMA awared memory allocation
and scheduler.

Arm system was working well as a single node NUMA system with
these fake APIs, because we didn't have multiple nodes NUMA
system on Arm. But in recent years, more and more Arm devices
support multiple nodes NUMA system.

So now we have a new problem. When Xen is running on these Arm
devices, Xen still treat them as single node SMP systems. The
NUMA affinity capability of Xen memory allocation and scheduler
becomes meaningless. Because they rely on input data that does
not reflect real NUMA layout.

Xen still think the access time for all of the memory is the
same for all CPUs. However, Xen may allocate memory to a VM
from different NUMA nodes with different access speeds. This
difference can be amplified in workloads inside VM, causing
performance instability and timeouts.

So in this patch series, we implement a set of NUMA API to use
device tree to describe the NUMA layout. We reuse most of the
code of x86 NUMA to create and maintain the mapping between
memory and CPU, create the matrix between any two NUMA nodes.
Except ACPI and some x86 specified code, we have moved other
code to common. In next stage, when we implement ACPI based
NUMA for Arm64, we may move the ACPI NUMA code to common too,
but in current stage, we keep it as x86 only.

This patch serires has been tested and booted well on one
Arm64 NUMA machine and one HPE x86 NUMA machine.

[1] https://lists.xenproject.org/archives/html/xen-devel/2022-06/msg00499.html
[2] https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg01903.html

---
v3 -> v4:
 1. Add init_as_disable as arch_numa_disabled parameter in the patche
    where use it.
 2. Drop unnecessary "else" from arch_numa_setup, and fix its
   indentation.
 3. Restore compute_hash_shift's return value to int.
 4. Remove unnecessary parentheses for macros.
 5. Use unsigned int for proper variables.
 6. Fix some code-style.
 7. Move arch_get_ram_range function comment to header file.
 8. Use bool for found, and add a new "err" for the return
    value of arch_get_ram_range.
 9. Use -ENODATA instead of -EINVAL for non-RAM type ranges.
10. Use bool as return value for functions that only return
    0/1 or 0/-EINVAL.
11. Move mem_hotplug to a proper place in mm.h
12. Remove useless "size" in numa_scan_nodes.
13. Add CONFIG_HAS_NUMA_NODE_FWID to gate print the mapping
    between node id and architectural node id (fw node id).

v2 -> v3:
 1. Drop enumeration of numa status.
 2. Use helpers to get/update acpi_numa.
 3. Insert spaces among parameters of strncmp in numa_setup.
 4. Drop helpers to access mem_hotplug. Export mem_hotplug for all arch.
 5. Remove acpi.h from common/numa.c.
 6. Rename acpi_scan_nodes to numa_scan_nodes.
 7. Replace u8 by uint8_t for memnodemap.
 8. Use unsigned int for memnode_shift and adjust related functions
    (compute_hash_shift, populate_memnodemap) to use correct types for
    return values or parameters.
 9. Use nodeid_t for nodeid and node numbers.
10. Use __read_mostly and __ro_after_init for appropriate variables.
11. Adjust the __read_mostly and __initdata location for some variables.
12. Convert from plain int to unsigned for cpuid and other proper 
13. Remove unnecessary change items in history.
14. Rename arch_get_memory_map to arch_get_ram_range.
15. Use -ENOENT instead of -ENODEV to indicate end of memory map.
16. Add description to code comment that arch_get_ram_range returns
    RAM range in [start, end) format.
17. Rename bad_srat to numa_fw_bad.
18. Rename node_to_pxm to numa_node_to_arch_nid.
19. Merge patch#7 and #8 into patch#6.
20. Move NR_NODE_MEMBLKS from x86/acpi.h to common/numa.h
22. Use 2-64 for node range.

v1 -> v2:
 1. Refine the commit messages of several patches.
 2. Merge v1 patch#9,10 into one patch. Introduce the new functions
    in the same patch that this patch will be used first time.
 3. Fold if ( end > mem_hotplug ) to mem_hotplug_update_boundary,
    in this case, we can drop mem_hotplug_boundary.
 4. Remove fw_numa, use enumeration to replace numa_off and acpi_numa.
 5. Correct return value of srat_disabled.
 6. Introduce numa_enabled_with_firmware.
 7. Refine the justification of using !node_data[nid].node_spanned_pages.
 8. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid.
 9. Adjust the conditional express for ASSERT.
10. Move MAX_NUMNODES from xen/numa.h to asm/numa.h for x86.
11. Use conditional macro to gate MAX_NUMNODES for other architectures.
12. Use arch_get_memory_map to replace arch_get_memory_bank_range
    and arch_get_memory_bank_number.
13. Remove the !start || !end check, because caller guarantee
    these two pointers will not be NULL.
14. Add code comment for numa_update_node_memblks to explain:
    Assumes all memory regions belonging to a single node
    are in one chunk. Holes between them will be included
    in the node.
15. Merge this single patch instead of serval patches to move
    x86 SRAT code to common.
16. Export node_to_pxm to keep pxm information in NUMA scan
    nodes error messages.
17. Change the code style to target file's Xen code-style.
18. Adjust some __init and __initdata for some functions and
    variables.
19. Replace CONFIG_ACPI_NUMA by CONFIG_NUMA. Replace "SRAT" texts.
20. Turn numa_scan_nodes to static.
21. Change NR_NUMA_NODES upper bound from 4095 to 255.

Wei Chen (6):
  xen/x86: Provide helpers for common code to access acpi_numa
  xen/x86: move generically usable NUMA code from x86 to common
  xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid
  xen/x86: use arch_get_ram_range to get information from E820 map
  xen/x86: move NUMA scan nodes codes from x86 to common
  xen: introduce a Kconfig option to configure NUMA nodes number

 xen/arch/Kconfig                 |  11 +
 xen/arch/x86/include/asm/acpi.h  |   2 -
 xen/arch/x86/include/asm/mm.h    |   2 -
 xen/arch/x86/include/asm/numa.h  |  61 +--
 xen/arch/x86/include/asm/setup.h |   1 -
 xen/arch/x86/mm.c                |   2 -
 xen/arch/x86/numa.c              | 438 +-----------------
 xen/arch/x86/smpboot.c           |   2 +-
 xen/arch/x86/srat.c              | 311 ++-----------
 xen/common/Makefile              |   1 +
 xen/common/numa.c                | 759 +++++++++++++++++++++++++++++++
 xen/common/page_alloc.c          |   2 +
 xen/drivers/acpi/Kconfig         |   1 +
 xen/include/xen/mm.h             |   2 +
 xen/include/xen/numa.h           |  97 +++-
 15 files changed, 917 insertions(+), 775 deletions(-)
 create mode 100644 xen/common/numa.c

-- 
2.25.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa
  2022-09-02  3:31 [PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2 Wei Chen
@ 2022-09-02  3:31 ` Wei Chen
  2022-09-08  8:22   ` Jan Beulich
  2022-09-02  3:31 ` [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Wei Chen @ 2022-09-02  3:31 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

acpi_numa is a specific NUMA switch for ACPI NUMA implementation.
Other NUMA implementation may not need this switch. But this switch is
not only used by ACPI code, it is also used directly in some general
NUMA logic code. So far this hasn't caused any problem because Xen only
has x86 implementing ACPI NUMA, but now Arm is implementing device tree
based NUMA. Accesssing acpi_numa directly in some functions will be a
block of reusing NUMA common code. It is also difficult for us to replace
it with a new generic switch, because it is hard to prove that the new
switch states can guarantee the original code will work correctly.

So in this patch, we provide two helpers for common code to update and
get states of acpi_numa. And other new NUMA implementations just need
to provide the same helpers for common code. In this case, the generic
NUMA logic code can be reused by all NUMA implementations.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v3 -> v4:
1. Drop parameter from arch_numa_disabled, the parameter will be
   introduced in later patch where use it.
2. Drop unnecessary "else" from arch_numa_setup, and fix its
   indentation.
v2 -> v3:
1. Drop enumeration of numa status.
2. Use helpers to get/update acpi_numa.
3. Insert spaces among parameters of strncmp in numa_setup.
v1 -> v2:
1. Remove fw_numa.
2. Use enumeration to replace numa_off and acpi_numa.
3. Correct return value of srat_disabled.
4. Introduce numa_enabled_with_firmware.
---
 xen/arch/x86/include/asm/numa.h |  5 +++--
 xen/arch/x86/numa.c             | 38 ++++++++++++++++++++++-----------
 2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index c32ccffde3..237f2c6dbf 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -32,8 +32,9 @@ extern void numa_add_cpu(int cpu);
 extern void numa_init_array(void);
 extern bool numa_off;
 
-
-extern int srat_disabled(void);
+extern int arch_numa_setup(const char *opt);
+extern bool arch_numa_disabled(void);
+extern bool srat_disabled(void);
 extern void numa_set_node(int cpu, nodeid_t node);
 extern nodeid_t setup_node(unsigned int pxm);
 extern void srat_detect_node(int cpu);
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 627ae8aa95..a5bc7a78c9 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -50,9 +50,28 @@ nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
 bool numa_off;
 s8 acpi_numa = 0;
 
-int srat_disabled(void)
+int __init arch_numa_setup(const char *opt)
 {
-    return numa_off || acpi_numa < 0;
+#ifdef CONFIG_ACPI_NUMA
+    if ( !strncmp(opt, "noacpi", 6) )
+    {
+        numa_off = false;
+        acpi_numa = -1;
+        return 0;
+    }
+#endif
+
+    return -EINVAL;
+}
+
+bool arch_numa_disabled(void)
+{
+    return acpi_numa < 0;
+}
+
+bool srat_disabled(void)
+{
+    return numa_off || arch_numa_disabled();
 }
 
 /*
@@ -291,28 +310,21 @@ void numa_set_node(int cpu, nodeid_t node)
 /* [numa=off] */
 static int __init cf_check numa_setup(const char *opt)
 {
-    if ( !strncmp(opt,"off",3) )
+    if ( !strncmp(opt, "off", 3) )
         numa_off = true;
-    else if ( !strncmp(opt,"on",2) )
+    else if ( !strncmp(opt, "on", 2) )
         numa_off = false;
 #ifdef CONFIG_NUMA_EMU
     else if ( !strncmp(opt, "fake=", 5) )
     {
         numa_off = false;
-        numa_fake = simple_strtoul(opt+5,NULL,0);
+        numa_fake = simple_strtoul(opt + 5, NULL, 0);
         if ( numa_fake >= MAX_NUMNODES )
             numa_fake = MAX_NUMNODES;
     }
-#endif
-#ifdef CONFIG_ACPI_NUMA
-    else if ( !strncmp(opt,"noacpi",6) )
-    {
-        numa_off = false;
-        acpi_numa = -1;
-    }
 #endif
     else
-        return -EINVAL;
+        return arch_numa_setup(opt);
 
     return 0;
 } 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common
  2022-09-02  3:31 [PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2 Wei Chen
  2022-09-02  3:31 ` [PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa Wei Chen
@ 2022-09-02  3:31 ` Wei Chen
  2022-09-08  9:09   ` Jan Beulich
  2022-09-02  3:31 ` [PATCH v4 3/6] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid Wei Chen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Wei Chen @ 2022-09-02  3:31 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

There are some codes in x86/numa.c can be shared by common
architectures to implememnt NUMA support. Just like some
variables and functions to check and store NUMA memory map.
And some variables and functions to do NUMA initialization.

In this patch, we move them to common/numa.c and xen/numa.h
and use the CONFIG_NUMA to gate them for non-NUMA supported
architectures. As the target header file is Xen-style, so
we trim some spaces and replace tabs for the codes that has
been moved to xen/numa.h at the same time.

As acpi_scan_nodes has been used in a common function, it
doesn't make sense to use acpi_xxx in common code, so we
rename it to numa_scan_nodes in this patch too. After that
if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes
in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA
will be selected by CONFIG_ACPI_NUMA for x86. So, we replace
CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes.

As arch_numa_disabled has been implememnted for ACPI NUMA,
we can rename srat_disabled to numa_disabled and move it
to common code as well.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v3 -> v4:
 1. Restore compute_hash_shift's return value to int.
 2. Remove unnecessary parentheses for macros.
 3. Use unsigned int for proper variables.
 4. Fix some code-style.
v2 -> v3:
 1. Remove acpi.h from common/numa.c.
 2. Rename acpi_scan_nodes to numa_scan_nodes.
 3. Replace u8 by uint8_t for memnodemap.
 4. Use unsigned int for memnode_shift and adjust related functions
    (compute_hash_shift, populate_memnodemap) to use correct types for
    return values or parameters.
 5. Use nodeid_t for nodeid and node numbers.
 6. Use __read_mostly and __ro_after_init for appropriate variables.
 7. Adjust the __read_mostly and __initdata location for some variables.
 8. convert from plain int to unsigned for cpuid and other proper variables.
 9. Use __attribute_pure__ instead of __attribute__((pure)).
10. Replace CONFIG_ACPI_NUMA by CONFIG_NUMA in numa_initmem_init.
11. Add const for some functions' parameters.
12. Move srat_disabled to common code with new name numa_disabled.
13. Fix some spaces code-style for numa_emulation.
14. Change from int to unsigned int for numa_fake.
v1 -> v2:
1. New patch in v2.
---
 xen/arch/x86/include/asm/acpi.h  |   1 -
 xen/arch/x86/include/asm/numa.h  |  57 +---
 xen/arch/x86/include/asm/setup.h |   1 -
 xen/arch/x86/numa.c              | 430 +-----------------------------
 xen/arch/x86/smpboot.c           |   2 +-
 xen/arch/x86/srat.c              |   8 +-
 xen/common/Makefile              |   1 +
 xen/common/numa.c                | 442 +++++++++++++++++++++++++++++++
 xen/include/xen/numa.h           |  67 +++++
 9 files changed, 517 insertions(+), 492 deletions(-)
 create mode 100644 xen/common/numa.c

diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h
index 9a9cc4c240..5c2dd5da2d 100644
--- a/xen/arch/x86/include/asm/acpi.h
+++ b/xen/arch/x86/include/asm/acpi.h
@@ -102,7 +102,6 @@ extern unsigned long acpi_wakeup_address;
 #define ARCH_HAS_POWER_INIT	1
 
 extern s8 acpi_numa;
-extern int acpi_scan_nodes(u64 start, u64 end);
 #define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
 
 extern struct acpi_sleep_info acpi_sinfo;
diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index 237f2c6dbf..6c87942d43 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -9,72 +9,17 @@ typedef u8 nodeid_t;
 
 extern int srat_rev;
 
-extern nodeid_t      cpu_to_node[NR_CPUS];
-extern cpumask_t     node_to_cpumask[];
-
-#define cpu_to_node(cpu)		(cpu_to_node[cpu])
-#define parent_node(node)		(node)
-#define node_to_first_cpu(node)  (__ffs(node_to_cpumask[node]))
-#define node_to_cpumask(node)    (node_to_cpumask[node])
-
-struct node { 
-	paddr_t start, end;
-};
-
-extern int compute_hash_shift(struct node *nodes, int numnodes,
-			      nodeid_t *nodeids);
 extern nodeid_t pxm_to_node(unsigned int pxm);
 
 #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
-#define VIRTUAL_BUG_ON(x) 
 
-extern void numa_add_cpu(int cpu);
-extern void numa_init_array(void);
-extern bool numa_off;
-
-extern int arch_numa_setup(const char *opt);
-extern bool arch_numa_disabled(void);
-extern bool srat_disabled(void);
-extern void numa_set_node(int cpu, nodeid_t node);
+extern bool numa_disabled(void);
 extern nodeid_t setup_node(unsigned int pxm);
 extern void srat_detect_node(int cpu);
 
-extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
 extern nodeid_t apicid_to_node[];
 extern void init_cpu_to_node(void);
 
-static inline void clear_node_cpumask(int cpu)
-{
-	cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
-}
-
-/* Simple perfect hash to map pdx to node numbers */
-extern int memnode_shift; 
-extern unsigned long memnodemapsize;
-extern u8 *memnodemap;
-
-struct node_data {
-    unsigned long node_start_pfn;
-    unsigned long node_spanned_pages;
-};
-
-extern struct node_data node_data[];
-
-static inline __attribute__((pure)) nodeid_t phys_to_nid(paddr_t addr)
-{ 
-	nodeid_t nid;
-	VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
-	nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift]; 
-	VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]); 
-	return nid; 
-} 
-
-#define NODE_DATA(nid)		(&(node_data[nid]))
-
-#define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
-#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
-#define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
-				 NODE_DATA(nid)->node_spanned_pages)
 #define arch_want_default_dmazone() (num_online_nodes() > 1)
 
 extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index 21037b7f31..ae470ea12f 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -20,7 +20,6 @@ void early_time_init(void);
 
 void set_nr_cpu_ids(unsigned int max_cpus);
 
-void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
 void arch_init_memory(void);
 void subarch_init_memory(void);
 
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index a5bc7a78c9..90b2a22591 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -4,20 +4,11 @@
  * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
  */ 
 
-#include <xen/mm.h>
-#include <xen/string.h>
 #include <xen/init.h>
-#include <xen/ctype.h>
+#include <xen/mm.h>
 #include <xen/nodemask.h>
 #include <xen/numa.h>
-#include <xen/keyhandler.h>
-#include <xen/param.h>
-#include <xen/time.h>
-#include <xen/smp.h>
-#include <xen/pfn.h>
 #include <asm/acpi.h>
-#include <xen/sched.h>
-#include <xen/softirq.h>
 
 #ifndef Dprintk
 #define Dprintk(x...)
@@ -26,28 +17,13 @@
 /* from proto.h */
 #define round_up(x,y) ((((x)+(y))-1) & (~((y)-1)))
 
-struct node_data node_data[MAX_NUMNODES];
-
-/* Mapping from pdx to node id */
-int memnode_shift;
-static typeof(*memnodemap) _memnodemap[64];
-unsigned long memnodemapsize;
-u8 *memnodemap;
-
-nodeid_t cpu_to_node[NR_CPUS] __read_mostly = {
-    [0 ... NR_CPUS-1] = NUMA_NO_NODE
-};
 /*
  * Keep BIOS's CPU2node information, should not be used for memory allocaion
  */
 nodeid_t apicid_to_node[MAX_LOCAL_APIC] = {
     [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
 };
-cpumask_t node_to_cpumask[MAX_NUMNODES] __read_mostly;
 
-nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
-
-bool numa_off;
 s8 acpi_numa = 0;
 
 int __init arch_numa_setup(const char *opt)
@@ -69,267 +45,6 @@ bool arch_numa_disabled(void)
     return acpi_numa < 0;
 }
 
-bool srat_disabled(void)
-{
-    return numa_off || arch_numa_disabled();
-}
-
-/*
- * Given a shift value, try to populate memnodemap[]
- * Returns :
- * 1 if OK
- * 0 if memnodmap[] too small (of shift too small)
- * -1 if node overlap or lost ram (shift too big)
- */
-static int __init populate_memnodemap(const struct node *nodes,
-                                      int numnodes, int shift, nodeid_t *nodeids)
-{
-    unsigned long spdx, epdx;
-    int i, res = -1;
-
-    memset(memnodemap, NUMA_NO_NODE, memnodemapsize * sizeof(*memnodemap));
-    for ( i = 0; i < numnodes; i++ )
-    {
-        spdx = paddr_to_pdx(nodes[i].start);
-        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
-        if ( spdx >= epdx )
-            continue;
-        if ( (epdx >> shift) >= memnodemapsize )
-            return 0;
-        do {
-            if ( memnodemap[spdx >> shift] != NUMA_NO_NODE )
-                return -1;
-
-            if ( !nodeids )
-                memnodemap[spdx >> shift] = i;
-            else
-                memnodemap[spdx >> shift] = nodeids[i];
-
-            spdx += (1UL << shift);
-        } while ( spdx < epdx );
-        res = 1;
-    }
-
-    return res;
-}
-
-static int __init allocate_cachealigned_memnodemap(void)
-{
-    unsigned long size = PFN_UP(memnodemapsize * sizeof(*memnodemap));
-    unsigned long mfn = mfn_x(alloc_boot_pages(size, 1));
-
-    memnodemap = mfn_to_virt(mfn);
-    mfn <<= PAGE_SHIFT;
-    size <<= PAGE_SHIFT;
-    printk(KERN_DEBUG "NUMA: Allocated memnodemap from %lx - %lx\n",
-           mfn, mfn + size);
-    memnodemapsize = size / sizeof(*memnodemap);
-
-    return 0;
-}
-
-/*
- * The LSB of all start and end addresses in the node map is the value of the
- * maximum possible shift.
- */
-static int __init extract_lsb_from_nodes(const struct node *nodes,
-                                         int numnodes)
-{
-    int i, nodes_used = 0;
-    unsigned long spdx, epdx;
-    unsigned long bitfield = 0, memtop = 0;
-
-    for ( i = 0; i < numnodes; i++ )
-    {
-        spdx = paddr_to_pdx(nodes[i].start);
-        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
-        if ( spdx >= epdx )
-            continue;
-        bitfield |= spdx;
-        nodes_used++;
-        if ( epdx > memtop )
-            memtop = epdx;
-    }
-    if ( nodes_used <= 1 )
-        i = BITS_PER_LONG - 1;
-    else
-        i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
-    memnodemapsize = (memtop >> i) + 1;
-    return i;
-}
-
-int __init compute_hash_shift(struct node *nodes, int numnodes,
-                              nodeid_t *nodeids)
-{
-    int shift;
-
-    shift = extract_lsb_from_nodes(nodes, numnodes);
-    if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) )
-        memnodemap = _memnodemap;
-    else if ( allocate_cachealigned_memnodemap() )
-        return -1;
-    printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift);
-
-    if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 )
-    {
-        printk(KERN_INFO "Your memory is not aligned you need to "
-               "rebuild your hypervisor with a bigger NODEMAPSIZE "
-               "shift=%d\n", shift);
-        return -1;
-    }
-
-    return shift;
-}
-/* initialize NODE_DATA given nodeid and start/end */
-void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
-{
-    unsigned long start_pfn = paddr_to_pfn(start);
-    unsigned long end_pfn = paddr_to_pfn(end);
-
-    NODE_DATA(nodeid)->node_start_pfn = start_pfn;
-    NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
-
-    node_set_online(nodeid);
-} 
-
-void __init numa_init_array(void)
-{
-    int rr, i;
-
-    /* There are unfortunately some poorly designed mainboards around
-       that only connect memory to a single CPU. This breaks the 1:1 cpu->node
-       mapping. To avoid this fill in the mapping for all possible
-       CPUs, as the number of CPUs is not known yet.
-       We round robin the existing nodes. */
-    rr = first_node(node_online_map);
-    for ( i = 0; i < nr_cpu_ids; i++ )
-    {
-        if ( cpu_to_node[i] != NUMA_NO_NODE )
-            continue;
-        numa_set_node(i, rr);
-        rr = cycle_node(rr, node_online_map);
-    }
-}
-
-#ifdef CONFIG_NUMA_EMU
-static int numa_fake __initdata = 0;
-
-/* Numa emulation */
-static int __init numa_emulation(unsigned long start_pfn,
-                                 unsigned long end_pfn)
-{
-    int i;
-    struct node nodes[MAX_NUMNODES];
-    uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
-
-    /* Kludge needed for the hash function */
-    if ( hweight64(sz) > 1 )
-    {
-        u64 x = 1;
-        while ( (x << 1) < sz )
-            x <<= 1;
-        if ( x < sz/2 )
-            printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n");
-        sz = x;
-    }
-
-    memset(&nodes,0,sizeof(nodes));
-    for ( i = 0; i < numa_fake; i++ )
-    {
-        nodes[i].start = pfn_to_paddr(start_pfn) + i * sz;
-        if ( i == numa_fake - 1 )
-            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
-        nodes[i].end = nodes[i].start + sz;
-        printk(KERN_INFO "Faking node %d at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
-               i,
-               nodes[i].start, nodes[i].end,
-               (nodes[i].end - nodes[i].start) >> 20);
-        node_set_online(i);
-    }
-    memnode_shift = compute_hash_shift(nodes, numa_fake, NULL);
-    if ( memnode_shift < 0 )
-    {
-        memnode_shift = 0;
-        printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
-        return -1;
-    }
-    for_each_online_node ( i )
-        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
-    numa_init_array();
-
-    return 0;
-}
-#endif
-
-void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
-{ 
-    int i;
-    paddr_t start = pfn_to_paddr(start_pfn);
-    paddr_t end = pfn_to_paddr(end_pfn);
-
-#ifdef CONFIG_NUMA_EMU
-    if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
-        return;
-#endif
-
-#ifdef CONFIG_ACPI_NUMA
-    if ( !numa_off && !acpi_scan_nodes(start, end) )
-        return;
-#endif
-
-    printk(KERN_INFO "%s\n",
-           numa_off ? "NUMA turned off" : "No NUMA configuration found");
-
-    printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n",
-           start, end);
-    /* setup dummy node covering all memory */
-    memnode_shift = BITS_PER_LONG - 1;
-    memnodemap = _memnodemap;
-    /* Dummy node only uses 1 slot in reality */
-    memnodemap[0] = 0;
-    memnodemapsize = 1;
-
-    nodes_clear(node_online_map);
-    node_set_online(0);
-    for ( i = 0; i < nr_cpu_ids; i++ )
-        numa_set_node(i, 0);
-    cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
-    setup_node_bootmem(0, start, end);
-}
-
-void numa_add_cpu(int cpu)
-{
-    cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
-} 
-
-void numa_set_node(int cpu, nodeid_t node)
-{
-    cpu_to_node[cpu] = node;
-}
-
-/* [numa=off] */
-static int __init cf_check numa_setup(const char *opt)
-{
-    if ( !strncmp(opt, "off", 3) )
-        numa_off = true;
-    else if ( !strncmp(opt, "on", 2) )
-        numa_off = false;
-#ifdef CONFIG_NUMA_EMU
-    else if ( !strncmp(opt, "fake=", 5) )
-    {
-        numa_off = false;
-        numa_fake = simple_strtoul(opt + 5, NULL, 0);
-        if ( numa_fake >= MAX_NUMNODES )
-            numa_fake = MAX_NUMNODES;
-    }
-#endif
-    else
-        return arch_numa_setup(opt);
-
-    return 0;
-} 
-custom_param("numa", numa_setup);
-
 /*
  * Setup early cpu_to_node.
  *
@@ -378,146 +93,3 @@ unsigned int __init arch_get_dma_bitsize(void)
                  flsl(node_start_pfn(node) + node_spanned_pages(node) / 4 - 1)
                  + PAGE_SHIFT, 32);
 }
-
-static void cf_check dump_numa(unsigned char key)
-{
-    s_time_t now = NOW();
-    unsigned int i, j, n;
-    struct domain *d;
-    struct page_info *page;
-    unsigned int page_num_node[MAX_NUMNODES];
-    const struct vnuma_info *vnuma;
-
-    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
-           now);
-
-    for_each_online_node ( i )
-    {
-        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
-
-        printk("NODE%u start->%lu size->%lu free->%lu\n",
-               i, node_start_pfn(i), node_spanned_pages(i),
-               avail_node_heap_pages(i));
-        /* sanity check phys_to_nid() */
-        if ( phys_to_nid(pa) != i )
-            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
-                   pa, phys_to_nid(pa), i);
-    }
-
-    j = cpumask_first(&cpu_online_map);
-    n = 0;
-    for_each_online_cpu ( i )
-    {
-        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
-        {
-            if ( n > 1 )
-                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
-            else
-                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
-            j = i;
-            n = 1;
-        }
-        else
-            ++n;
-    }
-    if ( n > 1 )
-        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
-    else
-        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
-
-    rcu_read_lock(&domlist_read_lock);
-
-    printk("Memory location of each domain:\n");
-    for_each_domain ( d )
-    {
-        process_pending_softirqs();
-
-        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
-
-        for_each_online_node ( i )
-            page_num_node[i] = 0;
-
-        spin_lock(&d->page_alloc_lock);
-        page_list_for_each(page, &d->page_list)
-        {
-            i = phys_to_nid(page_to_maddr(page));
-            page_num_node[i]++;
-        }
-        spin_unlock(&d->page_alloc_lock);
-
-        for_each_online_node ( i )
-            printk("    Node %u: %u\n", i, page_num_node[i]);
-
-        if ( !read_trylock(&d->vnuma_rwlock) )
-            continue;
-
-        if ( !d->vnuma )
-        {
-            read_unlock(&d->vnuma_rwlock);
-            continue;
-        }
-
-        vnuma = d->vnuma;
-        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
-               vnuma->nr_vnodes, d->max_vcpus);
-        for ( i = 0; i < vnuma->nr_vnodes; i++ )
-        {
-            unsigned int start_cpu = ~0U;
-
-            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
-                printk("       %3u: pnode ???,", i);
-            else
-                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
-
-            printk(" vcpus ");
-
-            for ( j = 0; j < d->max_vcpus; j++ )
-            {
-                if ( !(j & 0x3f) )
-                    process_pending_softirqs();
-
-                if ( vnuma->vcpu_to_vnode[j] == i )
-                {
-                    if ( start_cpu == ~0U )
-                    {
-                        printk("%d", j);
-                        start_cpu = j;
-                    }
-                }
-                else if ( start_cpu != ~0U )
-                {
-                    if ( j - 1 != start_cpu )
-                        printk("-%d ", j - 1);
-                    else
-                        printk(" ");
-                    start_cpu = ~0U;
-                }
-            }
-
-            if ( start_cpu != ~0U  && start_cpu != j - 1 )
-                printk("-%d", j - 1);
-
-            printk("\n");
-
-            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
-            {
-                if ( vnuma->vmemrange[j].nid == i )
-                    printk("           %016"PRIx64" - %016"PRIx64"\n",
-                           vnuma->vmemrange[j].start,
-                           vnuma->vmemrange[j].end);
-            }
-        }
-
-        read_unlock(&d->vnuma_rwlock);
-    }
-
-    rcu_read_unlock(&domlist_read_lock);
-}
-
-static int __init cf_check register_numa_trigger(void)
-{
-    register_keyhandler('u', dump_numa, "dump NUMA info", 1);
-    return 0;
-}
-__initcall(register_numa_trigger);
-
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index b46fd9ab18..9df08e9366 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1350,7 +1350,7 @@ int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm)
 
     x86_acpiid_to_apicid[acpi_id] = apic_id;
 
-    if ( !srat_disabled() )
+    if ( !numa_disabled() )
     {
         nodeid_t node = setup_node(pxm);
 
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index b62a152911..4c7f0c547e 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -238,7 +238,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 	unsigned pxm;
 	nodeid_t node;
 
-	if (srat_disabled())
+	if (numa_disabled())
 		return;
 	if (pa->header.length < sizeof(struct acpi_srat_x2apic_cpu_affinity)) {
 		bad_srat();
@@ -274,7 +274,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 	unsigned pxm;
 	nodeid_t node;
 
-	if (srat_disabled())
+	if (numa_disabled())
 		return;
 	if (pa->header.length != sizeof(struct acpi_srat_cpu_affinity)) {
 		bad_srat();
@@ -313,7 +313,7 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 	nodeid_t node;
 	unsigned int i;
 
-	if (srat_disabled())
+	if (numa_disabled())
 		return;
 	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
 		bad_srat();
@@ -520,7 +520,7 @@ void __init srat_parse_regions(paddr_t addr)
 }
 
 /* Use the information discovered above to actually set up the nodes. */
-int __init acpi_scan_nodes(paddr_t start, paddr_t end)
+int __init numa_scan_nodes(paddr_t start, paddr_t end)
 {
 	int i;
 	nodemask_t all_nodes_parsed;
diff --git a/xen/common/Makefile b/xen/common/Makefile
index 3baf83d527..9a3a12b12d 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_MEM_ACCESS) += mem_access.o
 obj-y += memory.o
 obj-y += multicall.o
 obj-y += notifier.o
+obj-$(CONFIG_NUMA) += numa.o
 obj-y += page_alloc.o
 obj-$(CONFIG_HAS_PDX) += pdx.o
 obj-$(CONFIG_PERF_COUNTERS) += perfc.o
diff --git a/xen/common/numa.c b/xen/common/numa.c
new file mode 100644
index 0000000000..03e7318d72
--- /dev/null
+++ b/xen/common/numa.c
@@ -0,0 +1,442 @@
+/*
+ * Generic VM initialization for NUMA setups.
+ * Copyright 2002,2003 Andi Kleen, SuSE Labs.
+ * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
+ */
+
+#include <xen/init.h>
+#include <xen/keyhandler.h>
+#include <xen/mm.h>
+#include <xen/nodemask.h>
+#include <xen/numa.h>
+#include <xen/param.h>
+#include <xen/sched.h>
+#include <xen/softirq.h>
+
+struct node_data __ro_after_init node_data[MAX_NUMNODES];
+
+/* Mapping from pdx to node id */
+unsigned int __ro_after_init memnode_shift;
+unsigned long __ro_after_init memnodemapsize;
+uint8_t *__ro_after_init memnodemap;
+static uint8_t __ro_after_init _memnodemap[64];
+
+nodeid_t __read_mostly cpu_to_node[NR_CPUS] = {
+    [0 ... NR_CPUS-1] = NUMA_NO_NODE
+};
+
+cpumask_t __read_mostly node_to_cpumask[MAX_NUMNODES];
+
+nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
+
+bool __read_mostly numa_off;
+
+bool numa_disabled(void)
+{
+    return numa_off || arch_numa_disabled();
+}
+
+/*
+ * Given a shift value, try to populate memnodemap[]
+ * Returns :
+ * 1 if OK
+ * 0 if memnodmap[] too small (of shift too small)
+ * -1 if node overlap or lost ram (shift too big)
+ */
+static int __init populate_memnodemap(const struct node *nodes,
+                                      unsigned int numnodes, unsigned int shift,
+                                      nodeid_t *nodeids)
+{
+    unsigned long spdx, epdx;
+    unsigned int i;
+    int res = -1;
+
+    memset(memnodemap, NUMA_NO_NODE, memnodemapsize * sizeof(*memnodemap));
+    for ( i = 0; i < numnodes; i++ )
+    {
+        spdx = paddr_to_pdx(nodes[i].start);
+        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
+        if ( spdx >= epdx )
+            continue;
+        if ( (epdx >> shift) >= memnodemapsize )
+            return 0;
+        do {
+            if ( memnodemap[spdx >> shift] != NUMA_NO_NODE )
+                return -1;
+
+            if ( !nodeids )
+                memnodemap[spdx >> shift] = i;
+            else
+                memnodemap[spdx >> shift] = nodeids[i];
+
+            spdx += (1UL << shift);
+        } while ( spdx < epdx );
+        res = 1;
+    }
+
+    return res;
+}
+
+static int __init allocate_cachealigned_memnodemap(void)
+{
+    unsigned long size = PFN_UP(memnodemapsize * sizeof(*memnodemap));
+    unsigned long mfn = mfn_x(alloc_boot_pages(size, 1));
+
+    memnodemap = mfn_to_virt(mfn);
+    mfn <<= PAGE_SHIFT;
+    size <<= PAGE_SHIFT;
+    printk(KERN_DEBUG "NUMA: Allocated memnodemap from %lx - %lx\n",
+           mfn, mfn + size);
+    memnodemapsize = size / sizeof(*memnodemap);
+
+    return 0;
+}
+
+/*
+ * The LSB of all start and end addresses in the node map is the value of the
+ * maximum possible shift.
+ */
+static unsigned int __init extract_lsb_from_nodes(const struct node *nodes,
+                                                  nodeid_t numnodes)
+{
+    unsigned int i;
+    nodeid_t nodes_used = 0;
+    unsigned long spdx, epdx;
+    unsigned long bitfield = 0, memtop = 0;
+
+    for ( i = 0; i < numnodes; i++ )
+    {
+        spdx = paddr_to_pdx(nodes[i].start);
+        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
+        if ( spdx >= epdx )
+            continue;
+        bitfield |= spdx;
+        nodes_used++;
+        if ( epdx > memtop )
+            memtop = epdx;
+    }
+    if ( nodes_used <= 1 )
+        i = BITS_PER_LONG - 1;
+    else
+        i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
+    memnodemapsize = (memtop >> i) + 1;
+    return i;
+}
+
+int __init compute_hash_shift(const struct node *nodes,
+                              nodeid_t numnodes, nodeid_t *nodeids)
+{
+    unsigned int shift;
+
+    shift = extract_lsb_from_nodes(nodes, numnodes);
+    if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) )
+        memnodemap = _memnodemap;
+    else if ( allocate_cachealigned_memnodemap() )
+        return -1;
+    printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift);
+
+    if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 )
+    {
+        printk(KERN_INFO "Your memory is not aligned you need to "
+               "rebuild your hypervisor with a bigger NODEMAPSIZE "
+               "shift=%d\n", shift);
+        return -1;
+    }
+
+    return shift;
+}
+
+/* initialize NODE_DATA given nodeid and start/end */
+void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
+{
+    unsigned long start_pfn = paddr_to_pfn(start);
+    unsigned long end_pfn = paddr_to_pfn(end);
+
+    NODE_DATA(nodeid)->node_start_pfn = start_pfn;
+    NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
+
+    node_set_online(nodeid);
+}
+
+void __init numa_init_array(void)
+{
+    unsigned int i;
+    nodeid_t rr;
+
+    /*
+     * There are unfortunately some poorly designed mainboards around
+     * that only connect memory to a single CPU. This breaks the 1:1 cpu->node
+     * mapping. To avoid this fill in the mapping for all possible
+     * CPUs, as the number of CPUs is not known yet.
+     * We round robin the existing nodes.
+     */
+    rr = first_node(node_online_map);
+    for ( i = 0; i < nr_cpu_ids; i++ )
+    {
+        if ( cpu_to_node[i] != NUMA_NO_NODE )
+            continue;
+        numa_set_node(i, rr);
+        rr = cycle_node(rr, node_online_map);
+    }
+}
+
+#ifdef CONFIG_NUMA_EMU
+static unsigned int __initdata numa_fake;
+
+/* Numa emulation */
+static int __init numa_emulation(unsigned long start_pfn,
+                                 unsigned long end_pfn)
+{
+    unsigned int i;
+    struct node nodes[MAX_NUMNODES];
+    uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake;
+
+    /* Kludge needed for the hash function */
+    if ( hweight64(sz) > 1 )
+    {
+        uint64_t x = 1;
+
+        while ( (x << 1) < sz )
+            x <<= 1;
+        if ( x < sz / 2 )
+            printk(KERN_ERR "Numa emulation unbalanced. Complain to maintainer\n");
+        sz = x;
+    }
+
+    memset(&nodes, 0, sizeof(nodes));
+    for ( i = 0; i < numa_fake; i++ )
+    {
+        nodes[i].start = pfn_to_paddr(start_pfn) + i * sz;
+        if ( i == numa_fake - 1 )
+            sz = pfn_to_paddr(end_pfn) - nodes[i].start;
+        nodes[i].end = nodes[i].start + sz;
+        printk(KERN_INFO "Faking node %u at %"PRIx64"-%"PRIx64" (%"PRIu64"MB)\n",
+               i, nodes[i].start, nodes[i].end,
+               (nodes[i].end - nodes[i].start) >> 20);
+        node_set_online(i);
+    }
+    memnode_shift = compute_hash_shift(nodes, numa_fake, NULL);
+    if ( memnode_shift < 0 )
+    {
+        memnode_shift = 0;
+        printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
+        return -1;
+    }
+    for_each_online_node ( i )
+        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+    numa_init_array();
+
+    return 0;
+}
+#endif
+
+void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
+{
+    unsigned int i;
+    paddr_t start = pfn_to_paddr(start_pfn);
+    paddr_t end = pfn_to_paddr(end_pfn);
+
+#ifdef CONFIG_NUMA_EMU
+    if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
+        return;
+#endif
+
+#ifdef CONFIG_NUMA
+    if ( !numa_off && !numa_scan_nodes(start, end) )
+        return;
+#endif
+
+    printk(KERN_INFO "%s\n",
+           numa_off ? "NUMA turned off" : "No NUMA configuration found");
+
+    printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n",
+           start, end);
+    /* setup dummy node covering all memory */
+    memnode_shift = BITS_PER_LONG - 1;
+    memnodemap = _memnodemap;
+    /* Dummy node only uses 1 slot in reality */
+    memnodemap[0] = 0;
+    memnodemapsize = 1;
+
+    nodes_clear(node_online_map);
+    node_set_online(0);
+    for ( i = 0; i < nr_cpu_ids; i++ )
+        numa_set_node(i, 0);
+    cpumask_copy(&node_to_cpumask[0], cpumask_of(0));
+    setup_node_bootmem(0, start, end);
+}
+
+void numa_add_cpu(unsigned int cpu)
+{
+    cpumask_set_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
+}
+
+void numa_set_node(unsigned int cpu, nodeid_t node)
+{
+    cpu_to_node[cpu] = node;
+}
+
+/* [numa=off] */
+static int __init cf_check numa_setup(const char *opt)
+{
+    if ( !strncmp(opt, "off", 3) )
+        numa_off = true;
+    else if ( !strncmp(opt, "on", 2) )
+        numa_off = false;
+#ifdef CONFIG_NUMA_EMU
+    else if ( !strncmp(opt, "fake=", 5) )
+    {
+        numa_off = false;
+        numa_fake = simple_strtoul(opt + 5, NULL, 0);
+        if ( numa_fake >= MAX_NUMNODES )
+            numa_fake = MAX_NUMNODES;
+    }
+#endif
+    else
+        return arch_numa_setup(opt);
+
+    return 0;
+}
+custom_param("numa", numa_setup);
+
+static void cf_check dump_numa(unsigned char key)
+{
+    s_time_t now = NOW();
+    unsigned int i, j, n;
+    struct domain *d;
+    struct page_info *page;
+    unsigned int page_num_node[MAX_NUMNODES];
+    const struct vnuma_info *vnuma;
+
+    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
+           now);
+
+    for_each_online_node ( i )
+    {
+        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
+
+        printk("NODE%u start->%lu size->%lu free->%lu\n",
+               i, node_start_pfn(i), node_spanned_pages(i),
+               avail_node_heap_pages(i));
+        /* sanity check phys_to_nid() */
+        if ( phys_to_nid(pa) != i )
+            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
+                   pa, phys_to_nid(pa), i);
+    }
+
+    j = cpumask_first(&cpu_online_map);
+    n = 0;
+    for_each_online_cpu ( i )
+    {
+        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
+        {
+            if ( n > 1 )
+                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
+            else
+                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
+            j = i;
+            n = 1;
+        }
+        else
+            ++n;
+    }
+    if ( n > 1 )
+        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
+    else
+        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
+
+    rcu_read_lock(&domlist_read_lock);
+
+    printk("Memory location of each domain:\n");
+    for_each_domain ( d )
+    {
+        process_pending_softirqs();
+
+        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
+
+        for_each_online_node ( i )
+            page_num_node[i] = 0;
+
+        spin_lock(&d->page_alloc_lock);
+        page_list_for_each(page, &d->page_list)
+        {
+            i = phys_to_nid(page_to_maddr(page));
+            page_num_node[i]++;
+        }
+        spin_unlock(&d->page_alloc_lock);
+
+        for_each_online_node ( i )
+            printk("    Node %u: %u\n", i, page_num_node[i]);
+
+        if ( !read_trylock(&d->vnuma_rwlock) )
+            continue;
+
+        if ( !d->vnuma )
+        {
+            read_unlock(&d->vnuma_rwlock);
+            continue;
+        }
+
+        vnuma = d->vnuma;
+        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
+               vnuma->nr_vnodes, d->max_vcpus);
+        for ( i = 0; i < vnuma->nr_vnodes; i++ )
+        {
+            unsigned int start_cpu = ~0U;
+
+            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
+                printk("       %3u: pnode ???,", i);
+            else
+                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
+
+            printk(" vcpus ");
+
+            for ( j = 0; j < d->max_vcpus; j++ )
+            {
+                if ( !(j & 0x3f) )
+                    process_pending_softirqs();
+
+                if ( vnuma->vcpu_to_vnode[j] == i )
+                {
+                    if ( start_cpu == ~0U )
+                    {
+                        printk("%d", j);
+                        start_cpu = j;
+                    }
+                }
+                else if ( start_cpu != ~0U )
+                {
+                    if ( j - 1 != start_cpu )
+                        printk("-%d ", j - 1);
+                    else
+                        printk(" ");
+                    start_cpu = ~0U;
+                }
+            }
+
+            if ( start_cpu != ~0U  && start_cpu != j - 1 )
+                printk("-%d", j - 1);
+
+            printk("\n");
+
+            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
+            {
+                if ( vnuma->vmemrange[j].nid == i )
+                    printk("           %016"PRIx64" - %016"PRIx64"\n",
+                           vnuma->vmemrange[j].start,
+                           vnuma->vmemrange[j].end);
+            }
+        }
+
+        read_unlock(&d->vnuma_rwlock);
+    }
+
+    rcu_read_unlock(&domlist_read_lock);
+}
+
+static int __init cf_check register_numa_trigger(void)
+{
+    register_keyhandler('u', dump_numa, "dump NUMA info", 1);
+    return 0;
+}
+__initcall(register_numa_trigger);
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 7aef1a88dc..baef4cd553 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -18,4 +18,71 @@
   (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
    ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
 
+/* The following content can be used when NUMA feature is enabled */
+#ifdef CONFIG_NUMA
+
+extern nodeid_t      cpu_to_node[NR_CPUS];
+extern cpumask_t     node_to_cpumask[];
+
+#define cpu_to_node(cpu)        cpu_to_node[cpu]
+#define parent_node(node)       (node)
+#define node_to_first_cpu(node) __ffs(node_to_cpumask[node])
+#define node_to_cpumask(node)   node_to_cpumask[node]
+
+struct node {
+    paddr_t start, end;
+};
+
+extern int compute_hash_shift(const struct node *nodes,
+                              nodeid_t numnodes, nodeid_t *nodeids);
+
+#define VIRTUAL_BUG_ON(x)
+
+extern bool numa_off;
+
+extern void numa_add_cpu(unsigned int cpu);
+extern void numa_init_array(void);
+extern void numa_set_node(unsigned int cpu, nodeid_t node);
+extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
+extern int  numa_scan_nodes(paddr_t start, paddr_t end);
+
+extern int arch_numa_setup(const char *opt);
+extern bool arch_numa_disabled(void);
+extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
+
+static inline void clear_node_cpumask(unsigned int cpu)
+{
+    cpumask_clear_cpu(cpu, &node_to_cpumask[cpu_to_node(cpu)]);
+}
+
+/* Simple perfect hash to map pdx to node numbers */
+extern unsigned int memnode_shift;
+extern unsigned long memnodemapsize;
+extern uint8_t *memnodemap;
+
+struct node_data {
+    unsigned long node_start_pfn;
+    unsigned long node_spanned_pages;
+};
+
+extern struct node_data node_data[];
+
+static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
+{
+    nodeid_t nid;
+    VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
+    nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
+    VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
+    return nid;
+}
+
+#define NODE_DATA(nid)          (&node_data[nid])
+
+#define node_start_pfn(nid)     (NODE_DATA(nid)->node_start_pfn)
+#define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages)
+#define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
+                                 NODE_DATA(nid)->node_spanned_pages)
+
+#endif
+
 #endif /* _XEN_NUMA_H */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/6] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid
  2022-09-02  3:31 [PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2 Wei Chen
  2022-09-02  3:31 ` [PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa Wei Chen
  2022-09-02  3:31 ` [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
@ 2022-09-02  3:31 ` Wei Chen
  2022-09-02  3:31 ` [PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map Wei Chen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Wei Chen @ 2022-09-02  3:31 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Jiamei Xie

VIRTUAL_BUG_ON is an empty macro used in phys_to_nid. This
results in two lines of error-checking code in phys_to_nid
that is not actually working and causing two compilation
errors:
1. error: "MAX_NUMNODES" undeclared (first use in this function).
   This is because in the common header file, "MAX_NUMNODES" is
   defined after the common header file includes the ARCH header
   file, where phys_to_nid has attempted to use "MAX_NUMNODES".
   This error was resolved after we moved the phys_to_nid from
   x86 ARCH header file to common header file.
2. error: wrong type argument to unary exclamation mark.
   This is because, the error-checking code contains !node_data[nid].
   But node_data is a data structure variable, it's not a pointer.

So, in this patch, we use ASSERT instead of VIRTUAL_BUG_ON to
enable the two lines of error-checking code. And fix the left
compilation errors by replacing !node_data[nid] to
!node_data[nid].node_spanned_pages. Although NUMA allows one node
can only have CPUs but without any memory. And node with 0 bytes
of memory might have an entry in memnodemap[] theoretically. But
that doesn't mean phys_to_nid can find any valid address from a
node with 0 bytes memory.

Signed-off-by: Wei Chen <wei.chen@arm.com>
Tested-by: Jiamei Xie <jiamei.xie@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v3 -> v4:
no change.
v2 -> v3:
1. Remove unnecessary change items in history.
2. Add Acked-by.
v1 -> v2:
1. Use ASSERT to replace VIRTUAL_BUG_ON in phys_to_nid.
2. Adjust the conditional express for ASSERT.
3. Refine the justification of using !node_data[nid].node_spanned_pages.
---
 xen/include/xen/numa.h | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index baef4cd553..af0abfc8cf 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -36,8 +36,6 @@ struct node {
 extern int compute_hash_shift(const struct node *nodes,
                               nodeid_t numnodes, nodeid_t *nodeids);
 
-#define VIRTUAL_BUG_ON(x)
-
 extern bool numa_off;
 
 extern void numa_add_cpu(unsigned int cpu);
@@ -70,9 +68,9 @@ extern struct node_data node_data[];
 static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
 {
     nodeid_t nid;
-    VIRTUAL_BUG_ON((paddr_to_pdx(addr) >> memnode_shift) >= memnodemapsize);
+    ASSERT((paddr_to_pdx(addr) >> memnode_shift) < memnodemapsize);
     nid = memnodemap[paddr_to_pdx(addr) >> memnode_shift];
-    VIRTUAL_BUG_ON(nid >= MAX_NUMNODES || !node_data[nid]);
+    ASSERT(nid < MAX_NUMNODES && node_data[nid].node_spanned_pages);
     return nid;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map
  2022-09-02  3:31 [PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (2 preceding siblings ...)
  2022-09-02  3:31 ` [PATCH v4 3/6] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid Wei Chen
@ 2022-09-02  3:31 ` Wei Chen
  2022-09-08 12:14   ` Jan Beulich
  2022-09-02  3:31 ` [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
  2022-09-02  3:31 ` [PATCH v4 6/6] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
  5 siblings, 1 reply; 18+ messages in thread
From: Wei Chen @ 2022-09-02  3:31 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

The sanity check of nodes_cover_memory is also a requirement of
other architectures that support NUMA. But now, the code of
nodes_cover_memory is tied to the x86 E820. In this case, we
introduce arch_get_ram_range to decouple architecture specific
memory map from this function. This means, other architectures
like Arm can also use it to check its node and memory coverage
from bootmem info.

Depends arch_get_ram_range, we make nodes_cover_memory become
architecture independent. We also use neutral words to replace
SRAT and E820 in the print message of this function. This will
to make the massage seems more common.

As arch_get_ram_range use unsigned int for index, we also adjust
the index in nodes_cover_memory from int to unsigned int.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v3 -> v4:
1. Move function comment to header file.
2. Use bool for found, and add a new "err" for the return
   value of arch_get_ram_range.
3. Use -ENODATA instead of -EINVAL for non-RAM type ranges.
v2 -> v3:
1. Rename arch_get_memory_map to arch_get_ram_range.
2. Use -ENOENT instead of -ENODEV to indicate end of memory map.
3. Add description to code comment that arch_get_ram_range returns
   RAM range in [start, end) format.
v1 -> v2:
1. Use arch_get_memory_map to replace arch_get_memory_bank_range
   and arch_get_memory_bank_number.
2. Remove the !start || !end check, because caller guarantee
   these two pointers will not be NULL.
---
 xen/arch/x86/numa.c    | 15 +++++++++++++++
 xen/arch/x86/srat.c    | 30 ++++++++++++++++++------------
 xen/include/xen/numa.h | 13 +++++++++++++
 3 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 90b2a22591..fa8caaa084 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -9,6 +9,7 @@
 #include <xen/nodemask.h>
 #include <xen/numa.h>
 #include <asm/acpi.h>
+#include <asm/e820.h>
 
 #ifndef Dprintk
 #define Dprintk(x...)
@@ -93,3 +94,17 @@ unsigned int __init arch_get_dma_bitsize(void)
                  flsl(node_start_pfn(node) + node_spanned_pages(node) / 4 - 1)
                  + PAGE_SHIFT, 32);
 }
+
+int __init arch_get_ram_range(unsigned int idx, paddr_t *start, paddr_t *end)
+{
+    if ( idx >= e820.nr_map )
+        return -ENOENT;
+
+    if ( e820.map[idx].type != E820_RAM )
+        return -ENODATA;
+
+    *start = e820.map[idx].addr;
+    *end = *start + e820.map[idx].size;
+
+    return 0;
+}
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 4c7f0c547e..bd9694db24 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -428,37 +428,43 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
    Make sure the PXMs cover all memory. */
 static int __init nodes_cover_memory(void)
 {
-	int i;
+	unsigned int i;
 
-	for (i = 0; i < e820.nr_map; i++) {
-		int j, found;
+	for (i = 0; ; i++) {
+		int err;
+		unsigned int j;
+		bool found;
 		paddr_t start, end;
 
-		if (e820.map[i].type != E820_RAM) {
-			continue;
-		}
+		/* Try to loop memory map from index 0 to end to get RAM ranges. */
+		err = arch_get_ram_range(i, &start, &end);
 
-		start = e820.map[i].addr;
-		end = e820.map[i].addr + e820.map[i].size;
+		/* Reach the end of arch's memory map */
+		if (err == -ENOENT)
+			break;
+
+		/* Index relate entry is not RAM, skip it. */
+		if (err)
+			continue;
 
 		do {
-			found = 0;
+			found = false;
 			for_each_node_mask(j, memory_nodes_parsed)
 				if (start < nodes[j].end
 				    && end > nodes[j].start) {
 					if (start >= nodes[j].start) {
 						start = nodes[j].end;
-						found = 1;
+						found = true;
 					}
 					if (end <= nodes[j].end) {
 						end = nodes[j].start;
-						found = 1;
+						found = true;
 					}
 				}
 		} while (found && start < end);
 
 		if (start < end) {
-			printk(KERN_ERR "SRAT: No PXM for e820 range: "
+			printk(KERN_ERR "NUMA: No NODE for RAM range: "
 				"[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
 			return 0;
 		}
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index af0abfc8cf..38be7db960 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -81,6 +81,19 @@ static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
 #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
                                  NODE_DATA(nid)->node_spanned_pages)
 
+/*
+ * This function provides the ability for caller to get one RAM entry
+ * from architectural memory map by index.
+ *
+ * This function will return zero if it can return a proper RAM entry.
+ * otherwise it will return -ENOENT for out of scope index, or return
+ * -ENODATA for non-RAM type memory entry.
+ *
+ * Note: the range is exclusive at the end, e.g. [start, end).
+ */
+extern int arch_get_ram_range(unsigned int idx,
+                              paddr_t *start, paddr_t *end);
+
 #endif
 
 #endif /* _XEN_NUMA_H */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-09-02  3:31 [PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (3 preceding siblings ...)
  2022-09-02  3:31 ` [PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map Wei Chen
@ 2022-09-02  3:31 ` Wei Chen
  2022-09-08 13:02   ` Jan Beulich
  2022-09-02  3:31 ` [PATCH v4 6/6] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen
  5 siblings, 1 reply; 18+ messages in thread
From: Wei Chen @ 2022-09-02  3:31 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

x86 has implemented a set of codes to scan NUMA nodes. These
codes will parse NUMA memory and processor information from
ACPI SRAT table. But except some ACPI specific codes, most
of the scan codes like memory blocks validation, node memory
range updates and some sanity check can be reused by other
NUMA implementation.

So in this patch, we move some variables and related functions
for NUMA memory and processor to common as library. At the
same time, numa_set_processor_nodes_parsed has been introduced
for ACPI specific code to update processor parsing results.
With this helper, we can reuse most of NUMA memory affinity init
code from ACPI. As bad_srat and node_to_pxm functions have been
used in common code to do architectural fallback and node to
architectural node info translation. But it doesn't make sense
to reuse the functions names in common code, we have rename them
to neutral names as well.

PXM is an ACPI specific item, we can't use it in common code
directly. As an alternative, we extend the parameters of
numa_update_node_memblks. The caller can pass the PXM as print
messages' prefix or as architectural node id. And we introduced
a CONFIG_HAS_NUMA_NODE_FWID to prevent print the mapping between
node id and architectural node id for those architectures do not
have architectural node id. In this case, we do not need to retain
a lot of per-arch code but still can print architectural log
messages for different NUMA implementations.

mem_hotplug also has been accessing by common code, except x86,
other architectures like Arm will also want to implement memory
hotplug in future. We export mem_hotplug to common will not bring
any harm for Arm and we also can reduce some per-arch helpers to
access mem_hotplug.

As asm/acpi.h has been removed from common/numa.c, we have to
move NR_NODE_MEMBLKS from asm/acpi.h to xen/numa.h in this patch
as well.

Signed-off-by: Wei Chen <wei.chen@arm.com>
---
v3 -> v4:
1. Use bool as return value for functions that only return
   0/1 or 0/-EINVAL.
2. Move mem_hotplug to a proper place in mm.h
3. Remove useless "size" in numa_scan_nodes.
4. Use unsigned int or const for proper variables.
5. Fix code-style.
6. Add init_as_disable as arch_numa_disabled parameter.
7. Add CONFIG_HAS_NUMA_NODE_FWID to gate print the mapping
   between node id and architectural node id (fw node id).
v2 -> v3:
1. Add __ro_after_init to proper variables.
2. Rename bad_srat to numa_fw_bad.
3. Rename node_to_pxm to numa_node_to_arch_nid.
4. Merge patch#7 and #8 into this patch.
5. Correct int to unsigned int in proper places.
6. Move NR_NODE_MEMBLKS from x86/acpi.h to common/numa.h
7. Drop helpers to access mem_hotplug, we export mem_hotplug
   from x86/mm.c to common/page_alloc.c
v1 -> v2:
1. Add code comment for numa_update_node_memblks to explain:
   Assumes all memory regions belonging to a single node
   are in one chunk. Holes between them will be included
   in the node.
2. Merge this single patch instead of serval patches to move
   x86 SRAT code to common.
3. Export node_to_pxm to keep pxm information in NUMA scan
   nodes error messages.
4. Change the code style to target file's Xen code-style.
5. Adjust some __init and __initdata for some functions and
   variables.
6. Merge two patches into this patch:
   1. replace CONFIG_ACPI_NUMA by CONFIG_NUMA.
   2. replace "SRAT" texts.
7. Turn numa_scan_nodes to static.
---
 xen/arch/x86/include/asm/acpi.h |   1 -
 xen/arch/x86/include/asm/mm.h   |   2 -
 xen/arch/x86/include/asm/numa.h |   3 +-
 xen/arch/x86/mm.c               |   2 -
 xen/arch/x86/numa.c             |   7 +-
 xen/arch/x86/srat.c             | 311 +++----------------------------
 xen/common/numa.c               | 321 +++++++++++++++++++++++++++++++-
 xen/common/page_alloc.c         |   2 +
 xen/drivers/acpi/Kconfig        |   1 +
 xen/include/xen/mm.h            |   2 +
 xen/include/xen/numa.h          |  12 +-
 11 files changed, 363 insertions(+), 301 deletions(-)

diff --git a/xen/arch/x86/include/asm/acpi.h b/xen/arch/x86/include/asm/acpi.h
index 5c2dd5da2d..c453450a74 100644
--- a/xen/arch/x86/include/asm/acpi.h
+++ b/xen/arch/x86/include/asm/acpi.h
@@ -102,7 +102,6 @@ extern unsigned long acpi_wakeup_address;
 #define ARCH_HAS_POWER_INIT	1
 
 extern s8 acpi_numa;
-#define NR_NODE_MEMBLKS (MAX_NUMNODES*2)
 
 extern struct acpi_sleep_info acpi_sinfo;
 #define acpi_video_flags bootsym(video_flags)
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 0fc826de46..95ff71a83a 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -474,8 +474,6 @@ static inline int get_page_and_type(struct page_info *page,
     ASSERT(((_p)->count_info & PGC_count_mask) != 0);          \
     ASSERT(page_get_owner(_p) == (_d))
 
-extern paddr_t mem_hotplug;
-
 /******************************************************************************
  * With shadow pagetables, the different kinds of address start
  * to get get confusing.
diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index 6c87942d43..2ca3475271 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -10,6 +10,7 @@ typedef u8 nodeid_t;
 extern int srat_rev;
 
 extern nodeid_t pxm_to_node(unsigned int pxm);
+extern unsigned int numa_node_to_arch_nid(nodeid_t n);
 
 #define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
 
@@ -22,8 +23,6 @@ extern void init_cpu_to_node(void);
 
 #define arch_want_default_dmazone() (num_online_nodes() > 1)
 
-extern int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
-
 void srat_parse_regions(paddr_t addr);
 extern u8 __node_distance(nodeid_t a, nodeid_t b);
 unsigned int arch_get_dma_bitsize(void);
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index db1817b691..68f9989e1f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -157,8 +157,6 @@ l1_pgentry_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
 l1_pgentry_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
     l1_fixmap_x[L1_PAGETABLE_ENTRIES];
 
-paddr_t __read_mostly mem_hotplug;
-
 /* Frame table size in pages. */
 unsigned long max_page;
 unsigned long total_pages;
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index fa8caaa084..e565c3a34d 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -41,9 +41,12 @@ int __init arch_numa_setup(const char *opt)
     return -EINVAL;
 }
 
-bool arch_numa_disabled(void)
+bool arch_numa_disabled(bool init_as_disable)
 {
-    return acpi_numa < 0;
+    if ( !init_as_disable )
+        return acpi_numa < 0;
+
+    return acpi_numa <= 0;
 }
 
 /*
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index bd9694db24..7964e199c5 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -24,10 +24,6 @@
 
 static struct acpi_table_slit *__read_mostly acpi_slit;
 
-static nodemask_t memory_nodes_parsed __initdata;
-static nodemask_t processor_nodes_parsed __initdata;
-static struct node nodes[MAX_NUMNODES] __initdata;
-
 struct pxm2node {
 	unsigned pxm;
 	nodeid_t node;
@@ -35,19 +31,6 @@ struct pxm2node {
 static struct pxm2node __read_mostly pxm2node[MAX_NUMNODES] =
 	{ [0 ... MAX_NUMNODES - 1] = {.node = NUMA_NO_NODE} };
 
-static unsigned node_to_pxm(nodeid_t n);
-
-static int num_node_memblks;
-static struct node node_memblk_range[NR_NODE_MEMBLKS];
-static nodeid_t memblk_nodeid[NR_NODE_MEMBLKS];
-static __initdata DECLARE_BITMAP(memblk_hotplug, NR_NODE_MEMBLKS);
-
-enum conflicts {
-	NO_CONFLICT,
-	OVERLAP,
-	INTERLEAVE,
-};
-
 static inline bool node_found(unsigned idx, unsigned pxm)
 {
 	return ((pxm2node[idx].pxm == pxm) &&
@@ -110,78 +93,7 @@ nodeid_t setup_node(unsigned pxm)
 	return node;
 }
 
-int valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
-{
-	int i;
-
-	for (i = 0; i < num_node_memblks; i++) {
-		struct node *nd = &node_memblk_range[i];
-
-		if (nd->start <= start && nd->end >= end &&
-			memblk_nodeid[i] == node)
-			return 1;
-	}
-
-	return 0;
-}
-
-static
-enum conflicts __init conflicting_memblks(nodeid_t nid, paddr_t start,
-					  paddr_t end, paddr_t nd_start,
-					  paddr_t nd_end, unsigned int *mblkid)
-{
-	unsigned int i;
-
-	/*
-	 * Scan all recorded nodes' memory blocks to check conflicts:
-	 * Overlap or interleave.
-	 */
-	for (i = 0; i < num_node_memblks; i++) {
-		struct node *nd = &node_memblk_range[i];
-
-		*mblkid = i;
-
-		/* Skip 0 bytes node memory block. */
-		if (nd->start == nd->end)
-			continue;
-		/*
-		 * Use memblk range to check memblk overlaps, include the
-		 * self-overlap case. As nd's range is non-empty, the special
-		 * case "nd->end == end && nd->start == start" also can be covered.
-		 */
-		if (nd->end > start && nd->start < end)
-			return OVERLAP;
-
-		/*
-		 * Use node memory range to check whether new range contains
-		 * memory from other nodes - interleave check. We just need
-		 * to check full contains situation. Because overlaps have
-		 * been checked above.
-		 */
-	        if (nid != memblk_nodeid[i] &&
-		    nd->start >= nd_start && nd->end <= nd_end)
-			return INTERLEAVE;
-	}
-
-	return NO_CONFLICT;
-}
-
-static __init void cutoff_node(int i, paddr_t start, paddr_t end)
-{
-	struct node *nd = &nodes[i];
-	if (nd->start < start) {
-		nd->start = start;
-		if (nd->end < nd->start)
-			nd->start = nd->end;
-	}
-	if (nd->end > end) {
-		nd->end = end;
-		if (nd->start > nd->end)
-			nd->start = nd->end;
-	}
-}
-
-static __init void bad_srat(void)
+void __init numa_fw_bad(void)
 {
 	int i;
 	printk(KERN_ERR "SRAT: SRAT not used.\n");
@@ -241,7 +153,7 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 	if (numa_disabled())
 		return;
 	if (pa->header.length < sizeof(struct acpi_srat_x2apic_cpu_affinity)) {
-		bad_srat();
+		numa_fw_bad();
 		return;
 	}
 	if (!(pa->flags & ACPI_SRAT_CPU_ENABLED))
@@ -254,12 +166,12 @@ acpi_numa_x2apic_affinity_init(const struct acpi_srat_x2apic_cpu_affinity *pa)
 	pxm = pa->proximity_domain;
 	node = setup_node(pxm);
 	if (node == NUMA_NO_NODE) {
-		bad_srat();
+		numa_fw_bad();
 		return;
 	}
 
 	apicid_to_node[pa->apic_id] = node;
-	node_set(node, processor_nodes_parsed);
+	numa_set_processor_nodes_parsed(node);
 	acpi_numa = 1;
 
 	if (opt_acpi_verbose)
@@ -277,7 +189,7 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 	if (numa_disabled())
 		return;
 	if (pa->header.length != sizeof(struct acpi_srat_cpu_affinity)) {
-		bad_srat();
+		numa_fw_bad();
 		return;
 	}
 	if (!(pa->flags & ACPI_SRAT_CPU_ENABLED))
@@ -290,11 +202,11 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 	}
 	node = setup_node(pxm);
 	if (node == NUMA_NO_NODE) {
-		bad_srat();
+		numa_fw_bad();
 		return;
 	}
 	apicid_to_node[pa->apic_id] = node;
-	node_set(node, processor_nodes_parsed);
+	numa_set_processor_nodes_parsed(node);
 	acpi_numa = 1;
 
 	if (opt_acpi_verbose)
@@ -306,32 +218,27 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
 void __init
 acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 {
-	struct node *nd;
-	paddr_t nd_start, nd_end;
-	paddr_t start, end;
 	unsigned pxm;
 	nodeid_t node;
-	unsigned int i;
 
 	if (numa_disabled())
 		return;
 	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
-		bad_srat();
+		numa_fw_bad();
 		return;
 	}
 	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
 		return;
 
-	start = ma->base_address;
-	end = start + ma->length;
 	/* Supplement the heuristics in l1tf_calculations(). */
-	l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE));
+	l1tf_safe_maddr = max(l1tf_safe_maddr,
+			      ROUNDUP(ma->base_address + ma->length,
+			      PAGE_SIZE));
 
-	if (num_node_memblks >= NR_NODE_MEMBLKS)
-	{
+	if (!numa_memblks_available()) {
 		dprintk(XENLOG_WARNING,
-                "Too many numa entry, try bigger NR_NODE_MEMBLKS \n");
-		bad_srat();
+			"Too many numa entries, try bigger NR_NODE_MEMBLKS!\n");
+		numa_fw_bad();
 		return;
 	}
 
@@ -340,136 +247,14 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 		pxm &= 0xff;
 	node = setup_node(pxm);
 	if (node == NUMA_NO_NODE) {
-		bad_srat();
+		numa_fw_bad();
 		return;
 	}
 
-	/*
-	 * For the node that already has some memory blocks, we will
-	 * expand the node memory range temporarily to check memory
-	 * interleaves with other nodes. We will not use this node
-	 * temp memory range to check overlaps, because it will mask
-	 * the overlaps in same node.
-	 *
-	 * Node with 0 bytes memory doesn't need this expandsion.
-	 */
-	nd_start = start;
-	nd_end = end;
-	nd = &nodes[node];
-	if (nd->start != nd->end) {
-		if (nd_start > nd->start)
-			nd_start = nd->start;
-
-		if (nd_end < nd->end)
-			nd_end = nd->end;
-	}
-
-	/* It is fine to add this area to the nodes data it will be used later*/
-	switch (conflicting_memblks(node, start, end, nd_start, nd_end, &i)) {
-	case OVERLAP:
-		if (memblk_nodeid[i] == node) {
-			bool mismatch = !(ma->flags &
-					  ACPI_SRAT_MEM_HOT_PLUGGABLE) !=
-			                !test_bit(i, memblk_hotplug);
-
-			printk("%sSRAT: PXM %u [%"PRIpaddr", %"PRIpaddr"] overlaps with itself [%"PRIpaddr", %"PRIpaddr"]\n",
-			       mismatch ? KERN_ERR : KERN_WARNING, pxm, start,
-			       end - 1, node_memblk_range[i].start,
-			       node_memblk_range[i].end - 1);
-			if (mismatch) {
-				bad_srat();
-				return;
-			}
-			break;
-		}
-
-		printk(KERN_ERR
-		       "SRAT: PXM %u [%"PRIpaddr", %"PRIpaddr"] overlaps with PXM %u [%"PRIpaddr", %"PRIpaddr"]\n",
-		       pxm, start, end - 1, node_to_pxm(memblk_nodeid[i]),
-		       node_memblk_range[i].start,
-		       node_memblk_range[i].end - 1);
-		bad_srat();
-		return;
-
-	case INTERLEAVE:
-		printk(KERN_ERR
-		       "SRAT: PXM %u: [%"PRIpaddr", %"PRIpaddr"] interleaves with PXM %u memblk [%"PRIpaddr", %"PRIpaddr"]\n",
-		       pxm, nd_start, nd_end - 1, node_to_pxm(memblk_nodeid[i]),
-		       node_memblk_range[i].start, node_memblk_range[i].end - 1);
-		bad_srat();
-		return;
-
-	case NO_CONFLICT:
-		break;
-	}
-
-	if (!(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE)) {
-		node_set(node, memory_nodes_parsed);
-		nd->start = nd_start;
-		nd->end = nd_end;
-	}
-
-	printk(KERN_INFO "SRAT: Node %u PXM %u [%"PRIpaddr", %"PRIpaddr"]%s\n",
-	       node, pxm, start, end - 1,
-	       ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
-
-	node_memblk_range[num_node_memblks].start = start;
-	node_memblk_range[num_node_memblks].end = end;
-	memblk_nodeid[num_node_memblks] = node;
-	if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
-		__set_bit(num_node_memblks, memblk_hotplug);
-		if (end > mem_hotplug)
-			mem_hotplug = end;
-	}
-	num_node_memblks++;
-}
-
-/* Sanity check to catch more bad SRATs (they are amazingly common).
-   Make sure the PXMs cover all memory. */
-static int __init nodes_cover_memory(void)
-{
-	unsigned int i;
-
-	for (i = 0; ; i++) {
-		int err;
-		unsigned int j;
-		bool found;
-		paddr_t start, end;
-
-		/* Try to loop memory map from index 0 to end to get RAM ranges. */
-		err = arch_get_ram_range(i, &start, &end);
-
-		/* Reach the end of arch's memory map */
-		if (err == -ENOENT)
-			break;
-
-		/* Index relate entry is not RAM, skip it. */
-		if (err)
-			continue;
-
-		do {
-			found = false;
-			for_each_node_mask(j, memory_nodes_parsed)
-				if (start < nodes[j].end
-				    && end > nodes[j].start) {
-					if (start >= nodes[j].start) {
-						start = nodes[j].end;
-						found = true;
-					}
-					if (end <= nodes[j].end) {
-						end = nodes[j].start;
-						found = true;
-					}
-				}
-		} while (found && start < end);
-
-		if (start < end) {
-			printk(KERN_ERR "NUMA: No NODE for RAM range: "
-				"[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
-			return 0;
-		}
-	}
-	return 1;
+	if (!numa_update_node_memblks(node, pxm, ma->base_address,
+				      ma->length, "PXM",
+				      ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE))
+		numa_fw_bad();
 }
 
 void __init acpi_numa_arch_fixup(void) {}
@@ -525,59 +310,9 @@ void __init srat_parse_regions(paddr_t addr)
 	pfn_pdx_hole_setup(mask >> PAGE_SHIFT);
 }
 
-/* Use the information discovered above to actually set up the nodes. */
-int __init numa_scan_nodes(paddr_t start, paddr_t end)
-{
-	int i;
-	nodemask_t all_nodes_parsed;
-
-	/* First clean up the node list */
-	for (i = 0; i < MAX_NUMNODES; i++)
-		cutoff_node(i, start, end);
-
-	if (acpi_numa <= 0)
-		return -1;
-
-	if (!nodes_cover_memory()) {
-		bad_srat();
-		return -1;
-	}
-
-	memnode_shift = compute_hash_shift(node_memblk_range, num_node_memblks,
-				memblk_nodeid);
-
-	if (memnode_shift < 0) {
-		printk(KERN_ERR
-		     "SRAT: No NUMA node hash function found. Contact maintainer\n");
-		bad_srat();
-		return -1;
-	}
-
-	nodes_or(all_nodes_parsed, memory_nodes_parsed, processor_nodes_parsed);
-
-	/* Finally register nodes */
-	for_each_node_mask(i, all_nodes_parsed)
-	{
-		uint64_t size = nodes[i].end - nodes[i].start;
-
-		if ( size == 0 )
-			printk(KERN_INFO "SRAT: node %u has no memory\n", i);
-
-		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
-	}
-	for (i = 0; i < nr_cpu_ids; i++) {
-		if (cpu_to_node[i] == NUMA_NO_NODE)
-			continue;
-		if (!nodemask_test(cpu_to_node[i], &processor_nodes_parsed))
-			numa_set_node(i, NUMA_NO_NODE);
-	}
-	numa_init_array();
-	return 0;
-}
-
-static unsigned node_to_pxm(nodeid_t n)
+unsigned int numa_node_to_arch_nid(nodeid_t n)
 {
-	unsigned i;
+	unsigned int i;
 
 	if ((n < ARRAY_SIZE(pxm2node)) && (pxm2node[n].node == n))
 		return pxm2node[n].pxm;
@@ -594,8 +329,8 @@ u8 __node_distance(nodeid_t a, nodeid_t b)
 
 	if (!acpi_slit)
 		return a == b ? 10 : 20;
-	index = acpi_slit->locality_count * node_to_pxm(a);
-	slit_val = acpi_slit->entry[index + node_to_pxm(b)];
+	index = acpi_slit->locality_count * numa_node_to_arch_nid(a);
+	slit_val = acpi_slit->entry[index + numa_node_to_arch_nid(b)];
 
 	/* ACPI defines 0xff as an unreachable node and 0-9 are undefined */
 	if ((slit_val == 0xff) || (slit_val <= 9))
diff --git a/xen/common/numa.c b/xen/common/numa.c
index 03e7318d72..da0ff7ae34 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -13,6 +13,21 @@
 #include <xen/sched.h>
 #include <xen/softirq.h>
 
+static nodemask_t __initdata processor_nodes_parsed;
+static nodemask_t __initdata memory_nodes_parsed;
+static struct node __initdata nodes[MAX_NUMNODES];
+
+static unsigned int __ro_after_init num_node_memblks;
+static struct node __ro_after_init node_memblk_range[NR_NODE_MEMBLKS];
+static nodeid_t __ro_after_init memblk_nodeid[NR_NODE_MEMBLKS];
+static __initdata DECLARE_BITMAP(memblk_hotplug, NR_NODE_MEMBLKS);
+
+enum conflicts {
+    NO_CONFLICT,
+    OVERLAP,
+    INTERLEAVE,
+};
+
 struct node_data __ro_after_init node_data[MAX_NUMNODES];
 
 /* Mapping from pdx to node id */
@@ -33,7 +48,309 @@ bool __read_mostly numa_off;
 
 bool numa_disabled(void)
 {
-    return numa_off || arch_numa_disabled();
+    return numa_off || arch_numa_disabled(false);
+}
+
+void __init numa_set_processor_nodes_parsed(nodeid_t node)
+{
+    node_set(node, processor_nodes_parsed);
+}
+
+bool valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
+{
+    unsigned int i;
+
+    for ( i = 0; i < num_node_memblks; i++ )
+    {
+        struct node *nd = &node_memblk_range[i];
+
+        if ( nd->start <= start && nd->end >= end &&
+             memblk_nodeid[i] == node )
+            return true;
+    }
+
+    return false;
+}
+
+static enum conflicts __init conflicting_memblks(
+    nodeid_t nid, paddr_t start, paddr_t end, paddr_t nd_start,
+    paddr_t nd_end, unsigned int *mblkid)
+{
+    unsigned int i;
+
+    /*
+     * Scan all recorded nodes' memory blocks to check conflicts:
+     * Overlap or interleave.
+     */
+    for ( i = 0; i < num_node_memblks; i++ )
+    {
+        const struct node *nd = &node_memblk_range[i];
+
+        *mblkid = i;
+
+        /* Skip 0 bytes node memory block. */
+        if ( nd->start == nd->end )
+            continue;
+        /*
+         * Use memblk range to check memblk overlaps, include the
+         * self-overlap case. As nd's range is non-empty, the special
+         * case "nd->end == end && nd->start == start" also can be covered.
+         */
+        if ( nd->end > start && nd->start < end )
+            return OVERLAP;
+
+        /*
+         * Use node memory range to check whether new range contains
+         * memory from other nodes - interleave check. We just need
+         * to check full contains situation. Because overlaps have
+         * been checked above.
+         */
+        if ( nid != memblk_nodeid[i] &&
+             nd->start >= nd_start && nd->end <= nd_end )
+            return INTERLEAVE;
+    }
+
+    return NO_CONFLICT;
+}
+
+static void __init cutoff_node(nodeid_t i, paddr_t start, paddr_t end)
+{
+    struct node *nd = &nodes[i];
+
+    if ( nd->start < start )
+    {
+        nd->start = start;
+        if ( nd->end < nd->start )
+            nd->start = nd->end;
+    }
+
+    if ( nd->end > end )
+    {
+        nd->end = end;
+        if ( nd->start > nd->end )
+            nd->start = nd->end;
+    }
+}
+
+bool __init numa_memblks_available(void)
+{
+    return num_node_memblks < NR_NODE_MEMBLKS;
+}
+
+/*
+ * This function will be called by NUMA memory affinity initialization to
+ * update NUMA node's memory range. In this function, we assume all memory
+ * regions belonging to a single node are in one chunk. Holes (or MMIO
+ * ranges) between them will be included in the node.
+ *
+ * So in numa_update_node_memblks, if there are multiple banks for each
+ * node, start and end are stretched to cover the holes between them, and
+ * it works as long as memory banks of different NUMA nodes don't interleave.
+ */
+bool __init numa_update_node_memblks(nodeid_t node, unsigned int arch_nid,
+                                     paddr_t start, paddr_t size,
+                                     const char *prefix,
+                                     bool hotplug)
+{
+    unsigned int i;
+    paddr_t end = start + size;
+    paddr_t nd_start = start;
+    paddr_t nd_end = end;
+    struct node *nd = &nodes[node];
+
+    /*
+     * For the node that already has some memory blocks, we will
+     * expand the node memory range temporarily to check memory
+     * interleaves with other nodes. We will not use this node
+     * temp memory range to check overlaps, because it will mask
+     * the overlaps in same node.
+     *
+     * Node with 0 bytes memory doesn't need this expandsion.
+     */
+    if ( nd->start != nd->end )
+    {
+        if ( nd_start > nd->start )
+            nd_start = nd->start;
+
+        if ( nd_end < nd->end )
+            nd_end = nd->end;
+    }
+
+    /* It is fine to add this area to the nodes data it will be used later */
+    switch ( conflicting_memblks(node, start, end, nd_start, nd_end, &i) )
+    {
+    case OVERLAP:
+        if ( memblk_nodeid[i] == node )
+        {
+            bool mismatch = !(hotplug) != !test_bit(i, memblk_hotplug);
+
+            printk("%sNUMA: %s %u [%"PRIpaddr", %"PRIpaddr"] overlaps with itself [%"PRIpaddr", %"PRIpaddr"]\n",
+                   mismatch ? KERN_ERR : KERN_WARNING, prefix,
+                   arch_nid, start, end - 1,
+                   node_memblk_range[i].start, node_memblk_range[i].end - 1);
+            if ( mismatch )
+                return false;
+            break;
+        }
+
+        printk(KERN_ERR
+               "NUMA: %s %u [%"PRIpaddr", %"PRIpaddr"] overlaps with %s %u [%"PRIpaddr", %"PRIpaddr"]\n",
+               prefix, arch_nid, start, end - 1, prefix,
+               numa_node_to_arch_nid(memblk_nodeid[i]),
+               node_memblk_range[i].start, node_memblk_range[i].end - 1);
+        return false;
+
+
+    case INTERLEAVE:
+        printk(KERN_ERR
+               "NUMA: %s %u: [%"PRIpaddr", %"PRIpaddr"] interleaves with %s %u memblk [%"PRIpaddr", %"PRIpaddr"]\n",
+               prefix, arch_nid, nd_start, nd_end - 1,
+               prefix, numa_node_to_arch_nid(memblk_nodeid[i]),
+               node_memblk_range[i].start, node_memblk_range[i].end - 1);
+        return false;
+
+    case NO_CONFLICT:
+        break;
+    }
+
+    if ( !hotplug )
+    {
+        node_set(node, memory_nodes_parsed);
+        nd->start = nd_start;
+        nd->end = nd_end;
+    }
+
+
+#ifdef CONFIG_HAS_NUMA_NODE_FWID
+    printk(KERN_INFO "NUMA: Node %u %s %u [%"PRIpaddr", %"PRIpaddr"]%s\n",
+           node, prefix, arch_nid, start, end - 1,
+           hotplug ? " (hotplug)" : "");
+#else
+    printk(KERN_INFO "NUMA: Node %u [%"PRIpaddr", %"PRIpaddr"]%s\n",
+           node, start, end - 1, hotplug ? " (hotplug)" : "");
+#endif
+
+    node_memblk_range[num_node_memblks].start = start;
+    node_memblk_range[num_node_memblks].end = end;
+    memblk_nodeid[num_node_memblks] = node;
+    if ( hotplug )
+    {
+        __set_bit(num_node_memblks, memblk_hotplug);
+        if ( end > mem_hotplug )
+            mem_hotplug = end;
+    }
+    num_node_memblks++;
+
+    return true;
+}
+
+/*
+ * Sanity check to catch more bad SRATs (they are amazingly common).
+ * Make sure the PXMs cover all memory.
+ */
+static int __init nodes_cover_memory(void)
+{
+    unsigned int i;
+
+    for ( i = 0; ; i++ )
+    {
+        int err;
+        bool found;
+        unsigned int j;
+        paddr_t start, end;
+
+        /* Try to loop memory map from index 0 to end to get RAM ranges. */
+        err = arch_get_ram_range(i, &start, &end);
+
+        /* Reach the end of arch's memory map */
+        if ( err == -ENOENT )
+            break;
+
+        /* Index relate entry is not RAM, skip it. */
+        if ( err )
+            continue;
+
+        do {
+            found = false;
+            for_each_node_mask( j, memory_nodes_parsed )
+                if ( start < nodes[j].end
+                    && end > nodes[j].start )
+                {
+                    if ( start >= nodes[j].start )
+                    {
+                        start = nodes[j].end;
+                        found = true;
+                    }
+
+                    if ( end <= nodes[j].end )
+                    {
+                        end = nodes[j].start;
+                        found = true;
+                    }
+                }
+        } while ( found && start < end );
+
+        if ( start < end )
+        {
+            printk(KERN_ERR "NUMA: No node for RAM range: "
+                   "[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
+            return 0;
+        }
+    }
+    return 1;
+}
+
+/* Use the information discovered above to actually set up the nodes. */
+static bool __init numa_scan_nodes(paddr_t start, paddr_t end)
+{
+    unsigned int i;
+    nodemask_t all_nodes_parsed;
+
+    /* First clean up the node list */
+    for ( i = 0; i < MAX_NUMNODES; i++ )
+        cutoff_node(i, start, end);
+
+    /* When numa is on with good firmware, we can do numa scan nodes. */
+    if ( arch_numa_disabled(true) )
+        return false;
+
+    if ( !nodes_cover_memory() )
+    {
+        numa_fw_bad();
+        return false;
+    }
+
+    memnode_shift = compute_hash_shift(node_memblk_range, num_node_memblks,
+                                       memblk_nodeid);
+
+    if ( memnode_shift < 0 )
+    {
+        printk(KERN_ERR
+               "NUMA: No NUMA node hash function found. Contact maintainer\n");
+        numa_fw_bad();
+        return false;
+    }
+
+    nodes_or(all_nodes_parsed, memory_nodes_parsed, processor_nodes_parsed);
+
+    /* Finally register nodes */
+    for_each_node_mask( i, all_nodes_parsed )
+    {
+        if ( nodes[i].end - nodes[i].start == 0 )
+            printk(KERN_INFO "NUMA: node %u has no memory\n", i);
+
+        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+    }
+
+    for ( i = 0; i < nr_cpu_ids; i++ )
+    {
+        if ( cpu_to_node[i] == NUMA_NO_NODE )
+            continue;
+        if ( !nodemask_test(cpu_to_node[i], &processor_nodes_parsed) )
+            numa_set_node(i, NUMA_NO_NODE);
+    }
+    numa_init_array();
+    return true;
 }
 
 /*
@@ -242,7 +559,7 @@ void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 
 #ifdef CONFIG_NUMA
-    if ( !numa_off && !numa_scan_nodes(start, end) )
+    if ( !numa_off && numa_scan_nodes(start, end) )
         return;
 #endif
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index bfd4150be7..39b9653286 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -159,6 +159,8 @@
 #define PGT_TYPE_INFO_INITIALIZER 0
 #endif
 
+paddr_t __read_mostly mem_hotplug;
+
 /*
  * Comma-separated list of hexadecimal page numbers containing bad bytes.
  * e.g. 'badpage=0x3f45,0x8a321'.
diff --git a/xen/drivers/acpi/Kconfig b/xen/drivers/acpi/Kconfig
index e3f3d8f4b1..6f33d1ad57 100644
--- a/xen/drivers/acpi/Kconfig
+++ b/xen/drivers/acpi/Kconfig
@@ -7,4 +7,5 @@ config ACPI_LEGACY_TABLES_LOOKUP
 
 config ACPI_NUMA
 	bool
+	select HAS_NUMA_NODE_FWID
 	select NUMA
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 35b065146f..2bced13c0c 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -148,6 +148,8 @@ int assign_page(
 /* Dump info to serial console */
 void arch_dump_shared_mem_info(void);
 
+extern paddr_t mem_hotplug;
+
 /*
  * Extra fault info types which are used to further describe
  * the source of an access violation.
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index 38be7db960..e593115ba2 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -11,6 +11,7 @@
 #define NUMA_NO_DISTANCE 0xFF
 
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
+#define NR_NODE_MEMBLKS (MAX_NUMNODES * 2)
 
 #define vcpu_to_node(v) (cpu_to_node((v)->processor))
 
@@ -42,10 +43,10 @@ extern void numa_add_cpu(unsigned int cpu);
 extern void numa_init_array(void);
 extern void numa_set_node(unsigned int cpu, nodeid_t node);
 extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
-extern int  numa_scan_nodes(paddr_t start, paddr_t end);
+extern void numa_fw_bad(void);
 
 extern int arch_numa_setup(const char *opt);
-extern bool arch_numa_disabled(void);
+extern bool arch_numa_disabled(bool init_as_disable);
 extern void setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end);
 
 static inline void clear_node_cpumask(unsigned int cpu)
@@ -93,6 +94,13 @@ static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
  */
 extern int arch_get_ram_range(unsigned int idx,
                               paddr_t *start, paddr_t *end);
+extern bool valid_numa_range(paddr_t start, paddr_t end, nodeid_t node);
+extern bool numa_memblks_available(void);
+extern bool numa_update_node_memblks(nodeid_t node, unsigned int arch_nid,
+                                     paddr_t start, paddr_t size,
+                                     const char *prefix,
+                                     bool hotplug);
+extern void numa_set_processor_nodes_parsed(nodeid_t node);
 
 #endif
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 6/6] xen: introduce a Kconfig option to configure NUMA nodes number
  2022-09-02  3:31 [PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2 Wei Chen
                   ` (4 preceding siblings ...)
  2022-09-02  3:31 ` [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
@ 2022-09-02  3:31 ` Wei Chen
  5 siblings, 0 replies; 18+ messages in thread
From: Wei Chen @ 2022-09-02  3:31 UTC (permalink / raw)
  To: xen-devel
  Cc: nd, Wei Chen, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné

Currently the maximum number of NUMA nodes is a hardcoded value.
This provides little flexibility unless changing the code.

Introduce a new Kconfig option to change the maximum number of
NUMA nodes conveniently. Also considering that not all
architectures support NUMA, this Kconfig option is only visible
on NUMA enabled architectures. Architectures not supporting NUMA
still use 1 for MAX_NUMNODES.

As NODES_SHIFT is currently unused, we're taking this
opportunity to remove it.

Signed-off-by: Wei Chen <wei.chen@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v3 -> v4:
1. Update the commit log to follow Jan's suggestion.
2. Add Ack-by.
v2 -> v3:
1. Fix indent.
2. Use 2-64 for node range.
v1 -> v2:
1. Add NODES_SHIFT remove message in commit log
2. Change NR_NUMA_NODES upper bound from 4095 to 255.
---
 xen/arch/Kconfig                | 11 +++++++++++
 xen/arch/x86/include/asm/numa.h |  2 --
 xen/include/xen/numa.h          | 11 ++++++-----
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/xen/arch/Kconfig b/xen/arch/Kconfig
index f16eb0df43..7028f7b74f 100644
--- a/xen/arch/Kconfig
+++ b/xen/arch/Kconfig
@@ -17,3 +17,14 @@ config NR_CPUS
 	  For CPU cores which support Simultaneous Multi-Threading or similar
 	  technologies, this the number of logical threads which Xen will
 	  support.
+
+config NR_NUMA_NODES
+	int "Maximum number of NUMA nodes supported"
+	range 2 64
+	default "64"
+	depends on NUMA
+	help
+	  Controls the build-time size of various arrays and bitmaps
+	  associated with multiple-nodes management. It is the upper bound of
+	  the number of NUMA nodes that the scheduler, memory allocation and
+	  other NUMA-aware components can handle.
diff --git a/xen/arch/x86/include/asm/numa.h b/xen/arch/x86/include/asm/numa.h
index 2ca3475271..7866afa408 100644
--- a/xen/arch/x86/include/asm/numa.h
+++ b/xen/arch/x86/include/asm/numa.h
@@ -3,8 +3,6 @@
 
 #include <xen/cpumask.h>
 
-#define NODES_SHIFT 6
-
 typedef u8 nodeid_t;
 
 extern int srat_rev;
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index e593115ba2..e16a7c3764 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -3,14 +3,15 @@
 
 #include <asm/numa.h>
 
-#ifndef NODES_SHIFT
-#define NODES_SHIFT     0
-#endif
-
 #define NUMA_NO_NODE     0xFF
 #define NUMA_NO_DISTANCE 0xFF
 
-#define MAX_NUMNODES    (1 << NODES_SHIFT)
+#ifdef CONFIG_NR_NUMA_NODES
+#define MAX_NUMNODES CONFIG_NR_NUMA_NODES
+#else
+#define MAX_NUMNODES 1
+#endif
+
 #define NR_NODE_MEMBLKS (MAX_NUMNODES * 2)
 
 #define vcpu_to_node(v) (cpu_to_node((v)->processor))
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa
  2022-09-02  3:31 ` [PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa Wei Chen
@ 2022-09-08  8:22   ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2022-09-08  8:22 UTC (permalink / raw)
  To: Wei Chen; +Cc: nd, Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 02.09.2022 05:31, Wei Chen wrote:
> acpi_numa is a specific NUMA switch for ACPI NUMA implementation.
> Other NUMA implementation may not need this switch. But this switch is
> not only used by ACPI code, it is also used directly in some general
> NUMA logic code. So far this hasn't caused any problem because Xen only
> has x86 implementing ACPI NUMA, but now Arm is implementing device tree
> based NUMA. Accesssing acpi_numa directly in some functions will be a
> block of reusing NUMA common code. It is also difficult for us to replace
> it with a new generic switch, because it is hard to prove that the new
> switch states can guarantee the original code will work correctly.
> 
> So in this patch, we provide two helpers for common code to update and
> get states of acpi_numa. And other new NUMA implementations just need
> to provide the same helpers for common code. In this case, the generic
> NUMA logic code can be reused by all NUMA implementations.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common
  2022-09-02  3:31 ` [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
@ 2022-09-08  9:09   ` Jan Beulich
  2022-09-08 10:32     ` Wei Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2022-09-08  9:09 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 02.09.2022 05:31, Wei Chen wrote:
> --- /dev/null
> +++ b/xen/common/numa.c
> @@ -0,0 +1,442 @@
> +/*
> + * Generic VM initialization for NUMA setups.
> + * Copyright 2002,2003 Andi Kleen, SuSE Labs.
> + * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
> + */
> +
> +#include <xen/init.h>
> +#include <xen/keyhandler.h>
> +#include <xen/mm.h>
> +#include <xen/nodemask.h>
> +#include <xen/numa.h>
> +#include <xen/param.h>
> +#include <xen/sched.h>
> +#include <xen/softirq.h>
> +
> +struct node_data __ro_after_init node_data[MAX_NUMNODES];
> +
> +/* Mapping from pdx to node id */
> +unsigned int __ro_after_init memnode_shift;
> +unsigned long __ro_after_init memnodemapsize;
> +uint8_t *__ro_after_init memnodemap;
> +static uint8_t __ro_after_init _memnodemap[64];

These last two want to use nodeid_t instead of uint8_t. Originally
the latter used typeof(*memnodemap) for (I think - iirc it was me who
made it that way) this reason: That way correcting memnodemap's type
would have propagated without the need for further adjustments.

> +nodeid_t __read_mostly cpu_to_node[NR_CPUS] = {
> +    [0 ... NR_CPUS-1] = NUMA_NO_NODE
> +};
> +
> +cpumask_t __read_mostly node_to_cpumask[MAX_NUMNODES];
> +
> +nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
> +
> +bool __read_mostly numa_off;

The v3 review discussing this possibly becoming __ro_after_init didn't
really finish (you didn't reply to my latest request there), but you
also didn't change the attribute. Please clarify.

> +static int __init populate_memnodemap(const struct node *nodes,
> +                                      unsigned int numnodes, unsigned int shift,
> +                                      nodeid_t *nodeids)

Can't this be pointer-to-const, and then also in the caller?

> +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes,
> +                                                  nodeid_t numnodes)
> +{
> +    unsigned int i;
> +    nodeid_t nodes_used = 0;

This once again is a variable which doesn't really hold a node ID (but
instead is a counter), and hence would better be unsigned int (see
./CODING_STYLE).

> +    unsigned long spdx, epdx;
> +    unsigned long bitfield = 0, memtop = 0;
> +
> +    for ( i = 0; i < numnodes; i++ )
> +    {
> +        spdx = paddr_to_pdx(nodes[i].start);
> +        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
> +        if ( spdx >= epdx )
> +            continue;
> +        bitfield |= spdx;
> +        nodes_used++;
> +        if ( epdx > memtop )
> +            memtop = epdx;
> +    }
> +    if ( nodes_used <= 1 )
> +        i = BITS_PER_LONG - 1;
> +    else
> +        i = find_first_bit(&bitfield, sizeof(unsigned long)*8);

Please add the missing blanks around * .

> +    memnodemapsize = (memtop >> i) + 1;
> +    return i;

Please add the missing blank line before the (main) return statement
of the function.

> +int __init compute_hash_shift(const struct node *nodes,
> +                              nodeid_t numnodes, nodeid_t *nodeids)

While I agree that nodeid_t can hold all necessary values, I still
don't think a cound should be expressed by nodeid_t. As above - see
./CODING_STYLE.

> +{
> +    unsigned int shift;
> +
> +    shift = extract_lsb_from_nodes(nodes, numnodes);
> +    if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) )
> +        memnodemap = _memnodemap;
> +    else if ( allocate_cachealigned_memnodemap() )
> +        return -1;
> +    printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift);

This wants to be %u now. I'd also like to ask to drop the full stop
at this occasion.

> +    if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 )
> +    {
> +        printk(KERN_INFO "Your memory is not aligned you need to "
> +               "rebuild your hypervisor with a bigger NODEMAPSIZE "
> +               "shift=%d\n", shift);

Again %u please.

> +/* initialize NODE_DATA given nodeid and start/end */
> +void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)

Please capitalize the first letter of the comment (see ./CODING_STYLE).

> +void __init numa_init_array(void)
> +{
> +    unsigned int i;
> +    nodeid_t rr;
> +
> +    /*
> +     * There are unfortunately some poorly designed mainboards around
> +     * that only connect memory to a single CPU. This breaks the 1:1 cpu->node
> +     * mapping. To avoid this fill in the mapping for all possible
> +     * CPUs, as the number of CPUs is not known yet.
> +     * We round robin the existing nodes.
> +     */

Along with the style correction re-flowing of the text would have been
nice, such the lines aren't wrapped seemingly randomly without utilizing
permitted line length.

> +void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
> +{
> +    unsigned int i;
> +    paddr_t start = pfn_to_paddr(start_pfn);
> +    paddr_t end = pfn_to_paddr(end_pfn);
> +
> +#ifdef CONFIG_NUMA_EMU
> +    if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
> +        return;
> +#endif
> +
> +#ifdef CONFIG_NUMA
> +    if ( !numa_off && !numa_scan_nodes(start, end) )
> +        return;
> +#endif
> +
> +    printk(KERN_INFO "%s\n",
> +           numa_off ? "NUMA turned off" : "No NUMA configuration found");
> +
> +    printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n",
> +           start, end);
> +    /* setup dummy node covering all memory */

Please again capitalize the first character of the comment.

> +static void cf_check dump_numa(unsigned char key)
> +{
> +    s_time_t now = NOW();
> +    unsigned int i, j, n;
> +    struct domain *d;
> +    struct page_info *page;

Along with the various other style corrections perhaps add const here?

> +    unsigned int page_num_node[MAX_NUMNODES];
> +    const struct vnuma_info *vnuma;
> +
> +    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
> +           now);
> +
> +    for_each_online_node ( i )
> +    {
> +        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
> +
> +        printk("NODE%u start->%lu size->%lu free->%lu\n",
> +               i, node_start_pfn(i), node_spanned_pages(i),
> +               avail_node_heap_pages(i));
> +        /* sanity check phys_to_nid() */

First char of comment again.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common
  2022-09-08  9:09   ` Jan Beulich
@ 2022-09-08 10:32     ` Wei Chen
  2022-09-08 11:42       ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Wei Chen @ 2022-09-08 10:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

On 2022/9/8 17:09, Jan Beulich wrote:
> On 02.09.2022 05:31, Wei Chen wrote:
>> --- /dev/null
>> +++ b/xen/common/numa.c
>> @@ -0,0 +1,442 @@
>> +/*
>> + * Generic VM initialization for NUMA setups.
>> + * Copyright 2002,2003 Andi Kleen, SuSE Labs.
>> + * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
>> + */
>> +
>> +#include <xen/init.h>
>> +#include <xen/keyhandler.h>
>> +#include <xen/mm.h>
>> +#include <xen/nodemask.h>
>> +#include <xen/numa.h>
>> +#include <xen/param.h>
>> +#include <xen/sched.h>
>> +#include <xen/softirq.h>
>> +
>> +struct node_data __ro_after_init node_data[MAX_NUMNODES];
>> +
>> +/* Mapping from pdx to node id */
>> +unsigned int __ro_after_init memnode_shift;
>> +unsigned long __ro_after_init memnodemapsize;
>> +uint8_t *__ro_after_init memnodemap;
>> +static uint8_t __ro_after_init _memnodemap[64];
> 
> These last two want to use nodeid_t instead of uint8_t. Originally
> the latter used typeof(*memnodemap) for (I think - iirc it was me who
> made it that way) this reason: That way correcting memnodemap's type
> would have propagated without the need for further adjustments.
> 

Thanks for this info, should I need to restore it to use
"typeof(*memnodemap)" in next version ?

>> +nodeid_t __read_mostly cpu_to_node[NR_CPUS] = {
>> +    [0 ... NR_CPUS-1] = NUMA_NO_NODE
>> +};
>> +
>> +cpumask_t __read_mostly node_to_cpumask[MAX_NUMNODES];
>> +
>> +nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
>> +
>> +bool __read_mostly numa_off;
> 
> The v3 review discussing this possibly becoming __ro_after_init didn't
> really finish (you didn't reply to my latest request there), but you
> also didn't change the attribute. Please clarify.
> 

I think I had answered your question by:
 >> I think yes, it will be used in numa_disabled and numa_disabled will
 >> be called in cpu_add."

And you replied me with:
 > In the original code I cannot spot such a path - can you please point
 > out how exactly you see numa_disabled() reachable from cpu_add()? I'm
 > clearly overlooking something ..."

But there is a time difference here, your reply was sent after I sent 
v3, maybe you didn't notice it

About the new question:
cpu_add will call srat_disabled, srat_disabled will access numa_off.
srat_disabled is a function without __init.

>> +static int __init populate_memnodemap(const struct node *nodes,
>> +                                      unsigned int numnodes, unsigned int shift,
>> +                                      nodeid_t *nodeids)
> 
> Can't this be pointer-to-const, and then also in the caller?
>

Yes, it's possible, I will fix it.

>> +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes,
>> +                                                  nodeid_t numnodes)
>> +{
>> +    unsigned int i;
>> +    nodeid_t nodes_used = 0;
> 
> This once again is a variable which doesn't really hold a node ID (but
> instead is a counter), and hence would better be unsigned int (see
> ./CODING_STYLE).
>

Ok.

>> +    unsigned long spdx, epdx;
>> +    unsigned long bitfield = 0, memtop = 0;
>> +
>> +    for ( i = 0; i < numnodes; i++ )
>> +    {
>> +        spdx = paddr_to_pdx(nodes[i].start);
>> +        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
>> +        if ( spdx >= epdx )
>> +            continue;
>> +        bitfield |= spdx;
>> +        nodes_used++;
>> +        if ( epdx > memtop )
>> +            memtop = epdx;
>> +    }
>> +    if ( nodes_used <= 1 )
>> +        i = BITS_PER_LONG - 1;
>> +    else
>> +        i = find_first_bit(&bitfield, sizeof(unsigned long)*8);
> 
> Please add the missing blanks around * .
> 

Ok.

>> +    memnodemapsize = (memtop >> i) + 1;
>> +    return i;
> 
> Please add the missing blank line before the (main) return statement
> of the function.
> 

I'll fix him and other places, if there are any.

>> +int __init compute_hash_shift(const struct node *nodes,
>> +                              nodeid_t numnodes, nodeid_t *nodeids)
> 
> While I agree that nodeid_t can hold all necessary values, I still
> don't think a cound should be expressed by nodeid_t. As above - see
> ./CODING_STYLE.
> 

Ok.

>> +{
>> +    unsigned int shift;
>> +
>> +    shift = extract_lsb_from_nodes(nodes, numnodes);
>> +    if ( memnodemapsize <= ARRAY_SIZE(_memnodemap) )
>> +        memnodemap = _memnodemap;
>> +    else if ( allocate_cachealigned_memnodemap() )
>> +        return -1;
>> +    printk(KERN_DEBUG "NUMA: Using %d for the hash shift.\n", shift);
> 
> This wants to be %u now. I'd also like to ask to drop the full stop
> at this occasion.

Ok, that makes sense.

> 
>> +    if ( populate_memnodemap(nodes, numnodes, shift, nodeids) != 1 )
>> +    {
>> +        printk(KERN_INFO "Your memory is not aligned you need to "
>> +               "rebuild your hypervisor with a bigger NODEMAPSIZE "
>> +               "shift=%d\n", shift);
> 
> Again %u please.
> 

Ack.

>> +/* initialize NODE_DATA given nodeid and start/end */
>> +void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
> 
> Please capitalize the first letter of the comment (see ./CODING_STYLE).
> 

Ok.

>> +void __init numa_init_array(void)
>> +{
>> +    unsigned int i;
>> +    nodeid_t rr;
>> +
>> +    /*
>> +     * There are unfortunately some poorly designed mainboards around
>> +     * that only connect memory to a single CPU. This breaks the 1:1 cpu->node
>> +     * mapping. To avoid this fill in the mapping for all possible
>> +     * CPUs, as the number of CPUs is not known yet.
>> +     * We round robin the existing nodes.
>> +     */
> 
> Along with the style correction re-flowing of the text would have been
> nice, such the lines aren't wrapped seemingly randomly without utilizing
> permitted line length.
> 

I will adjust it.

>> +void __init numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn)
>> +{
>> +    unsigned int i;
>> +    paddr_t start = pfn_to_paddr(start_pfn);
>> +    paddr_t end = pfn_to_paddr(end_pfn);
>> +
>> +#ifdef CONFIG_NUMA_EMU
>> +    if ( numa_fake && !numa_emulation(start_pfn, end_pfn) )
>> +        return;
>> +#endif
>> +
>> +#ifdef CONFIG_NUMA
>> +    if ( !numa_off && !numa_scan_nodes(start, end) )
>> +        return;
>> +#endif
>> +
>> +    printk(KERN_INFO "%s\n",
>> +           numa_off ? "NUMA turned off" : "No NUMA configuration found");
>> +
>> +    printk(KERN_INFO "Faking a node at %"PRIpaddr"-%"PRIpaddr"\n",
>> +           start, end);
>> +    /* setup dummy node covering all memory */
> 
> Please again capitalize the first character of the comment.
> 

Ok.

>> +static void cf_check dump_numa(unsigned char key)
>> +{
>> +    s_time_t now = NOW();
>> +    unsigned int i, j, n;
>> +    struct domain *d;
>> +    struct page_info *page;
> 
> Along with the various other style corrections perhaps add const here?
> 

I will add it.

>> +    unsigned int page_num_node[MAX_NUMNODES];
>> +    const struct vnuma_info *vnuma;
>> +
>> +    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
>> +           now);
>> +
>> +    for_each_online_node ( i )
>> +    {
>> +        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
>> +
>> +        printk("NODE%u start->%lu size->%lu free->%lu\n",
>> +               i, node_start_pfn(i), node_spanned_pages(i),
>> +               avail_node_heap_pages(i));
>> +        /* sanity check phys_to_nid() */
> 
> First char of comment again.
> 

Ok.

Thanks.
Wei Chen

> Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common
  2022-09-08 10:32     ` Wei Chen
@ 2022-09-08 11:42       ` Jan Beulich
  2022-09-08 14:52         ` Wei Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2022-09-08 11:42 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 08.09.2022 12:32, Wei Chen wrote:
> On 2022/9/8 17:09, Jan Beulich wrote:
>> On 02.09.2022 05:31, Wei Chen wrote:
>>> --- /dev/null
>>> +++ b/xen/common/numa.c
>>> @@ -0,0 +1,442 @@
>>> +/*
>>> + * Generic VM initialization for NUMA setups.
>>> + * Copyright 2002,2003 Andi Kleen, SuSE Labs.
>>> + * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
>>> + */
>>> +
>>> +#include <xen/init.h>
>>> +#include <xen/keyhandler.h>
>>> +#include <xen/mm.h>
>>> +#include <xen/nodemask.h>
>>> +#include <xen/numa.h>
>>> +#include <xen/param.h>
>>> +#include <xen/sched.h>
>>> +#include <xen/softirq.h>
>>> +
>>> +struct node_data __ro_after_init node_data[MAX_NUMNODES];
>>> +
>>> +/* Mapping from pdx to node id */
>>> +unsigned int __ro_after_init memnode_shift;
>>> +unsigned long __ro_after_init memnodemapsize;
>>> +uint8_t *__ro_after_init memnodemap;
>>> +static uint8_t __ro_after_init _memnodemap[64];
>>
>> These last two want to use nodeid_t instead of uint8_t. Originally
>> the latter used typeof(*memnodemap) for (I think - iirc it was me who
>> made it that way) this reason: That way correcting memnodemap's type
>> would have propagated without the need for further adjustments.
>>
> 
> Thanks for this info, should I need to restore it to use
> "typeof(*memnodemap)" in next version ?

That would be more in line with the original code, but it's not
strictly necessary once nodeid_t if properly used for these variables.
I'd leave it up to you as long as you switch to nodeid_t.

>>> +nodeid_t __read_mostly cpu_to_node[NR_CPUS] = {
>>> +    [0 ... NR_CPUS-1] = NUMA_NO_NODE
>>> +};
>>> +
>>> +cpumask_t __read_mostly node_to_cpumask[MAX_NUMNODES];
>>> +
>>> +nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
>>> +
>>> +bool __read_mostly numa_off;
>>
>> The v3 review discussing this possibly becoming __ro_after_init didn't
>> really finish (you didn't reply to my latest request there), but you
>> also didn't change the attribute. Please clarify.
>>
> 
> I think I had answered your question by:
>  >> I think yes, it will be used in numa_disabled and numa_disabled will
>  >> be called in cpu_add."
> 
> And you replied me with:
>  > In the original code I cannot spot such a path - can you please point
>  > out how exactly you see numa_disabled() reachable from cpu_add()? I'm
>  > clearly overlooking something ..."
> 
> But there is a time difference here, your reply was sent after I sent 
> v3, maybe you didn't notice it

Which suggests you might better have waited with sending v3 until the
discussion had settled.

> About the new question:
> cpu_add will call srat_disabled, srat_disabled will access numa_off.
> srat_disabled is a function without __init.

But the request wasn't to make the variable __initdata. That would be
wrong of course. Since srat_disabled() only reads numa_off,
__ro_after_init does look usable to me.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map
  2022-09-02  3:31 ` [PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map Wei Chen
@ 2022-09-08 12:14   ` Jan Beulich
  2022-09-08 14:55     ` Wei Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2022-09-08 12:14 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 02.09.2022 05:31, Wei Chen wrote:
> The sanity check of nodes_cover_memory is also a requirement of
> other architectures that support NUMA. But now, the code of
> nodes_cover_memory is tied to the x86 E820. In this case, we
> introduce arch_get_ram_range to decouple architecture specific
> memory map from this function. This means, other architectures
> like Arm can also use it to check its node and memory coverage
> from bootmem info.
> 
> Depends arch_get_ram_range, we make nodes_cover_memory become
> architecture independent. We also use neutral words to replace
> SRAT and E820 in the print message of this function. This will
> to make the massage seems more common.
> 
> As arch_get_ram_range use unsigned int for index, we also adjust
> the index in nodes_cover_memory from int to unsigned int.
> 
> Signed-off-by: Wei Chen <wei.chen@arm.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
albeit still with a couple of suggestions:

> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -428,37 +428,43 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>     Make sure the PXMs cover all memory. */
>  static int __init nodes_cover_memory(void)
>  {
> -	int i;
> +	unsigned int i;
>  
> -	for (i = 0; i < e820.nr_map; i++) {
> -		int j, found;
> +	for (i = 0; ; i++) {
> +		int err;
> +		unsigned int j;
> +		bool found;
>  		paddr_t start, end;
>  
> -		if (e820.map[i].type != E820_RAM) {
> -			continue;
> -		}
> +		/* Try to loop memory map from index 0 to end to get RAM ranges. */
> +		err = arch_get_ram_range(i, &start, &end);
>  
> -		start = e820.map[i].addr;
> -		end = e820.map[i].addr + e820.map[i].size;
> +		/* Reach the end of arch's memory map */
> +		if (err == -ENOENT)
> +			break;

Such a comment ahead of an if() is often better put as a question, e.g.
"Reached the end of the memory map?" here or, if you dislike using a
question, "Exit the loop at the end of the memory map".

> +		/* Index relate entry is not RAM, skip it. */
> +		if (err)
> +			continue;

And then perhaps "Skip non-RAM entries" here.

> --- a/xen/include/xen/numa.h
> +++ b/xen/include/xen/numa.h
> @@ -81,6 +81,19 @@ static inline nodeid_t __attribute_pure__ phys_to_nid(paddr_t addr)
>  #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
>                                   NODE_DATA(nid)->node_spanned_pages)
>  
> +/*
> + * This function provides the ability for caller to get one RAM entry
> + * from architectural memory map by index.
> + *
> + * This function will return zero if it can return a proper RAM entry.
> + * otherwise it will return -ENOENT for out of scope index, or return
> + * -ENODATA for non-RAM type memory entry.

The way you've implemented things, -ENODATA isn't special anymore, so
better wouldn't be called out as special here. May I suggest to at
least insert "e.g." in front of it? (An alternative would be to check
for -ENODATA in nodes_cover_memory() again, followed by ASSERT(!err).)

> + * Note: the range is exclusive at the end, e.g. [start, end).

Perhaps better [*start, *end) to match ...

> + */
> +extern int arch_get_ram_range(unsigned int idx,
> +                              paddr_t *start, paddr_t *end);

... this?

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-09-02  3:31 ` [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
@ 2022-09-08 13:02   ` Jan Beulich
  2022-09-08 15:26     ` Wei Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2022-09-08 13:02 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 02.09.2022 05:31, Wei Chen wrote:
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -41,9 +41,12 @@ int __init arch_numa_setup(const char *opt)
>      return -EINVAL;
>  }
>  
> -bool arch_numa_disabled(void)
> +bool arch_numa_disabled(bool init_as_disable)

I'm afraid my question as to the meaning of the name of the parameter has
remained unanswered.

> @@ -306,32 +218,27 @@ acpi_numa_processor_affinity_init(const struct acpi_srat_cpu_affinity *pa)
>  void __init
>  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
>  {
> -	struct node *nd;
> -	paddr_t nd_start, nd_end;
> -	paddr_t start, end;
>  	unsigned pxm;
>  	nodeid_t node;
> -	unsigned int i;
>  
>  	if (numa_disabled())
>  		return;
>  	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
> -		bad_srat();
> +		numa_fw_bad();
>  		return;
>  	}
>  	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
>  		return;
>  
> -	start = ma->base_address;
> -	end = start + ma->length;
>  	/* Supplement the heuristics in l1tf_calculations(). */
> -	l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE));
> +	l1tf_safe_maddr = max(l1tf_safe_maddr,
> +			      ROUNDUP(ma->base_address + ma->length,
> +			      PAGE_SIZE));

Indentation:

	l1tf_safe_maddr = max(l1tf_safe_maddr,
			      ROUNDUP(ma->base_address + ma->length,
			              PAGE_SIZE));

> @@ -33,7 +48,309 @@ bool __read_mostly numa_off;
>  
>  bool numa_disabled(void)
>  {
> -    return numa_off || arch_numa_disabled();
> +    return numa_off || arch_numa_disabled(false);
> +}
> +
> +void __init numa_set_processor_nodes_parsed(nodeid_t node)
> +{
> +    node_set(node, processor_nodes_parsed);
> +}
> +
> +bool valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
> +{
> +    unsigned int i;
> +
> +    for ( i = 0; i < num_node_memblks; i++ )
> +    {
> +        struct node *nd = &node_memblk_range[i];

const? (This is particularly relevant with __ro_after_init.)

> +bool __init numa_update_node_memblks(nodeid_t node, unsigned int arch_nid,
> +                                     paddr_t start, paddr_t size,
> +                                     const char *prefix,
> +                                     bool hotplug)
> +{
> +    unsigned int i;
> +    paddr_t end = start + size;
> +    paddr_t nd_start = start;
> +    paddr_t nd_end = end;
> +    struct node *nd = &nodes[node];
> +
> +    /*
> +     * For the node that already has some memory blocks, we will
> +     * expand the node memory range temporarily to check memory
> +     * interleaves with other nodes. We will not use this node
> +     * temp memory range to check overlaps, because it will mask
> +     * the overlaps in same node.
> +     *
> +     * Node with 0 bytes memory doesn't need this expandsion.

Mind taking the opportunity and drop the 'd' from "expansion"?

> +     */
> +    if ( nd->start != nd->end )
> +    {
> +        if ( nd_start > nd->start )
> +            nd_start = nd->start;
> +
> +        if ( nd_end < nd->end )
> +            nd_end = nd->end;
> +    }
> +
> +    /* It is fine to add this area to the nodes data it will be used later */
> +    switch ( conflicting_memblks(node, start, end, nd_start, nd_end, &i) )
> +    {
> +    case OVERLAP:
> +        if ( memblk_nodeid[i] == node )
> +        {
> +            bool mismatch = !(hotplug) != !test_bit(i, memblk_hotplug);

No need to parenthesize "hotplug" here.

> +            printk("%sNUMA: %s %u [%"PRIpaddr", %"PRIpaddr"] overlaps with itself [%"PRIpaddr", %"PRIpaddr"]\n",
> +                   mismatch ? KERN_ERR : KERN_WARNING, prefix,
> +                   arch_nid, start, end - 1,
> +                   node_memblk_range[i].start, node_memblk_range[i].end - 1);
> +            if ( mismatch )
> +                return false;
> +            break;
> +        }
> +
> +        printk(KERN_ERR
> +               "NUMA: %s %u [%"PRIpaddr", %"PRIpaddr"] overlaps with %s %u [%"PRIpaddr", %"PRIpaddr"]\n",
> +               prefix, arch_nid, start, end - 1, prefix,
> +               numa_node_to_arch_nid(memblk_nodeid[i]),
> +               node_memblk_range[i].start, node_memblk_range[i].end - 1);
> +        return false;
> +
> +
> +    case INTERLEAVE:

Please don't add double blank lines anywhere (original code didn't have
these); there's at least one more instance below.

> +static int __init nodes_cover_memory(void)
> +{
> +    unsigned int i;
> +
> +    for ( i = 0; ; i++ )
> +    {
> +        int err;
> +        bool found;
> +        unsigned int j;
> +        paddr_t start, end;
> +
> +        /* Try to loop memory map from index 0 to end to get RAM ranges. */
> +        err = arch_get_ram_range(i, &start, &end);
> +
> +        /* Reach the end of arch's memory map */
> +        if ( err == -ENOENT )
> +            break;
> +
> +        /* Index relate entry is not RAM, skip it. */
> +        if ( err )
> +            continue;
> +
> +        do {
> +            found = false;
> +            for_each_node_mask( j, memory_nodes_parsed )

Please be consistent with style for constructs like this one: Either
you consider for_each_node_mask a pseudo-keyword (along the lines of
for(;;)), then there's a blank missing ahead of the opening
parenthesis. Or you consider this an ordinary identifier (i.e. the
function-like macro that it is), then there shouldn't be blanks
immediately inside the parentheses. (Same issue elsewhere.)

> +                if ( start < nodes[j].end
> +                    && end > nodes[j].start )
> +                {
> +                    if ( start >= nodes[j].start )
> +                    {
> +                        start = nodes[j].end;
> +                        found = true;
> +                    }
> +
> +                    if ( end <= nodes[j].end )
> +                    {
> +                        end = nodes[j].start;
> +                        found = true;
> +                    }
> +                }
> +        } while ( found && start < end );
> +
> +        if ( start < end )
> +        {
> +            printk(KERN_ERR "NUMA: No node for RAM range: "
> +                   "[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
> +            return 0;
> +        }
> +    }
> +    return 1;
> +}

Seeing the two returns (and no further ones in the function) - did
you not mean to also switch to bool/true/false here?

> +/* Use the information discovered above to actually set up the nodes. */
> +static bool __init numa_scan_nodes(paddr_t start, paddr_t end)

Is "above" in the comment actually still accurate? Aiui the discovery
is now in a different CU. Then perhaps "Use discovered information to
actually set up the nodes."

> +{
> +    unsigned int i;
> +    nodemask_t all_nodes_parsed;
> +
> +    /* First clean up the node list */
> +    for ( i = 0; i < MAX_NUMNODES; i++ )
> +        cutoff_node(i, start, end);
> +
> +    /* When numa is on with good firmware, we can do numa scan nodes. */
> +    if ( arch_numa_disabled(true) )
> +        return false;

Btw - the comment here doesn't help me figure your choice of
"init_as_disabled". The wording towards the end is also a little
odd, considering we're already in numa_scan_nodes(). Which further
points out that really there's no scanning here, just processing,
so maybe the earlier patch wants to rename the function to
numa_process_nodes()?

> +    if ( !nodes_cover_memory() )
> +    {
> +        numa_fw_bad();
> +        return false;
> +    }
> +
> +    memnode_shift = compute_hash_shift(node_memblk_range, num_node_memblks,
> +                                       memblk_nodeid);
> +
> +    if ( memnode_shift < 0 )

As previously pointed out: As of patch 2 memnode_shift is unsigned,
so this comparison is always false (and the latest Coverity will
point this out). You can't get away here without using an intermediate
(signed, i.e. plain int) variable.

> +    {
> +        printk(KERN_ERR
> +               "NUMA: No NUMA node hash function found. Contact maintainer\n");
> +        numa_fw_bad();
> +        return false;
> +    }
> +
> +    nodes_or(all_nodes_parsed, memory_nodes_parsed, processor_nodes_parsed);
> +
> +    /* Finally register nodes */
> +    for_each_node_mask( i, all_nodes_parsed )
> +    {
> +        if ( nodes[i].end - nodes[i].start == 0 )

nodes[i].end == nodes[i].start ?

> +            printk(KERN_INFO "NUMA: node %u has no memory\n", i);
> +
> +        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
> +    }
> +
> +    for ( i = 0; i < nr_cpu_ids; i++ )
> +    {
> +        if ( cpu_to_node[i] == NUMA_NO_NODE )
> +            continue;
> +        if ( !nodemask_test(cpu_to_node[i], &processor_nodes_parsed) )
> +            numa_set_node(i, NUMA_NO_NODE);
> +    }
> +    numa_init_array();
> +    return true;
>  }

While you said you'd check elsewhere as well, just to be sure: Please
add a blank line before the function's main "return". And perhaps
another one between loop and function call.

> --- a/xen/drivers/acpi/Kconfig
> +++ b/xen/drivers/acpi/Kconfig
> @@ -7,4 +7,5 @@ config ACPI_LEGACY_TABLES_LOOKUP
>  
>  config ACPI_NUMA
>  	bool
> +	select HAS_NUMA_NODE_FWID

Are you selecting an option here which doesn't exist anywhere? Or
am I overlooking where this new option is being added?

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common
  2022-09-08 11:42       ` Jan Beulich
@ 2022-09-08 14:52         ` Wei Chen
  0 siblings, 0 replies; 18+ messages in thread
From: Wei Chen @ 2022-09-08 14:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年9月8日 19:42
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v4 2/6] xen/x86: move generically usable NUMA code
> from x86 to common
> 
> On 08.09.2022 12:32, Wei Chen wrote:
> > On 2022/9/8 17:09, Jan Beulich wrote:
> >> On 02.09.2022 05:31, Wei Chen wrote:
> >>> --- /dev/null
> >>> +++ b/xen/common/numa.c
> >>> @@ -0,0 +1,442 @@
> >>> +/*
> >>> + * Generic VM initialization for NUMA setups.
> >>> + * Copyright 2002,2003 Andi Kleen, SuSE Labs.
> >>> + * Adapted for Xen: Ryan Harper <ryanh@us.ibm.com>
> >>> + */
> >>> +
> >>> +#include <xen/init.h>
> >>> +#include <xen/keyhandler.h>
> >>> +#include <xen/mm.h>
> >>> +#include <xen/nodemask.h>
> >>> +#include <xen/numa.h>
> >>> +#include <xen/param.h>
> >>> +#include <xen/sched.h>
> >>> +#include <xen/softirq.h>
> >>> +
> >>> +struct node_data __ro_after_init node_data[MAX_NUMNODES];
> >>> +
> >>> +/* Mapping from pdx to node id */
> >>> +unsigned int __ro_after_init memnode_shift;
> >>> +unsigned long __ro_after_init memnodemapsize;
> >>> +uint8_t *__ro_after_init memnodemap;
> >>> +static uint8_t __ro_after_init _memnodemap[64];
> >>
> >> These last two want to use nodeid_t instead of uint8_t. Originally
> >> the latter used typeof(*memnodemap) for (I think - iirc it was me who
> >> made it that way) this reason: That way correcting memnodemap's type
> >> would have propagated without the need for further adjustments.
> >>
> >
> > Thanks for this info, should I need to restore it to use
> > "typeof(*memnodemap)" in next version ?
> 
> That would be more in line with the original code, but it's not
> strictly necessary once nodeid_t if properly used for these variables.
> I'd leave it up to you as long as you switch to nodeid_t.
> 

Ok, I will think more about it in next version.

> >>> +nodeid_t __read_mostly cpu_to_node[NR_CPUS] = {
> >>> +    [0 ... NR_CPUS-1] = NUMA_NO_NODE
> >>> +};
> >>> +
> >>> +cpumask_t __read_mostly node_to_cpumask[MAX_NUMNODES];
> >>> +
> >>> +nodemask_t __read_mostly node_online_map = { { [0] = 1UL } };
> >>> +
> >>> +bool __read_mostly numa_off;
> >>
> >> The v3 review discussing this possibly becoming __ro_after_init didn't
> >> really finish (you didn't reply to my latest request there), but you
> >> also didn't change the attribute. Please clarify.
> >>
> >
> > I think I had answered your question by:
> >  >> I think yes, it will be used in numa_disabled and numa_disabled will
> >  >> be called in cpu_add."
> >
> > And you replied me with:
> >  > In the original code I cannot spot such a path - can you please point
> >  > out how exactly you see numa_disabled() reachable from cpu_add()? I'm
> >  > clearly overlooking something ..."
> >
> > But there is a time difference here, your reply was sent after I sent
> > v3, maybe you didn't notice it
> 
> Which suggests you might better have waited with sending v3 until the
> discussion had settled.
> 
> > About the new question:
> > cpu_add will call srat_disabled, srat_disabled will access numa_off.
> > srat_disabled is a function without __init.
> 
> But the request wasn't to make the variable __initdata. That would be
> wrong of course. Since srat_disabled() only reads numa_off,
> __ro_after_init does look usable to me.
> 

Oh, yes, you're right. I had thought wrong. I will correct this.

Cheers,
Wei Chen.

> Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map
  2022-09-08 12:14   ` Jan Beulich
@ 2022-09-08 14:55     ` Wei Chen
  0 siblings, 0 replies; 18+ messages in thread
From: Wei Chen @ 2022-09-08 14:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年9月8日 20:14
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v4 4/6] xen/x86: use arch_get_ram_range to get
> information from E820 map
> 
> On 02.09.2022 05:31, Wei Chen wrote:
> > The sanity check of nodes_cover_memory is also a requirement of
> > other architectures that support NUMA. But now, the code of
> > nodes_cover_memory is tied to the x86 E820. In this case, we
> > introduce arch_get_ram_range to decouple architecture specific
> > memory map from this function. This means, other architectures
> > like Arm can also use it to check its node and memory coverage
> > from bootmem info.
> >
> > Depends arch_get_ram_range, we make nodes_cover_memory become
> > architecture independent. We also use neutral words to replace
> > SRAT and E820 in the print message of this function. This will
> > to make the massage seems more common.
> >
> > As arch_get_ram_range use unsigned int for index, we also adjust
> > the index in nodes_cover_memory from int to unsigned int.
> >
> > Signed-off-by: Wei Chen <wei.chen@arm.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> albeit still with a couple of suggestions:
> 

Thanks, I will adjust the code comments to address your suggestions.

> > --- a/xen/arch/x86/srat.c
> > +++ b/xen/arch/x86/srat.c
> > @@ -428,37 +428,43 @@ acpi_numa_memory_affinity_init(const struct
> acpi_srat_mem_affinity *ma)
> >     Make sure the PXMs cover all memory. */
> >  static int __init nodes_cover_memory(void)
> >  {
> > -	int i;
> > +	unsigned int i;
> >
> > -	for (i = 0; i < e820.nr_map; i++) {
> > -		int j, found;
> > +	for (i = 0; ; i++) {
> > +		int err;
> > +		unsigned int j;
> > +		bool found;
> >  		paddr_t start, end;
> >
> > -		if (e820.map[i].type != E820_RAM) {
> > -			continue;
> > -		}
> > +		/* Try to loop memory map from index 0 to end to get RAM
> ranges. */
> > +		err = arch_get_ram_range(i, &start, &end);
> >
> > -		start = e820.map[i].addr;
> > -		end = e820.map[i].addr + e820.map[i].size;
> > +		/* Reach the end of arch's memory map */
> > +		if (err == -ENOENT)
> > +			break;
> 
> Such a comment ahead of an if() is often better put as a question, e.g.
> "Reached the end of the memory map?" here or, if you dislike using a
> question, "Exit the loop at the end of the memory map".
> 
> > +		/* Index relate entry is not RAM, skip it. */
> > +		if (err)
> > +			continue;
> 
> And then perhaps "Skip non-RAM entries" here.
> 
> > --- a/xen/include/xen/numa.h
> > +++ b/xen/include/xen/numa.h
> > @@ -81,6 +81,19 @@ static inline nodeid_t __attribute_pure__
> phys_to_nid(paddr_t addr)
> >  #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
> >                                   NODE_DATA(nid)->node_spanned_pages)
> >
> > +/*
> > + * This function provides the ability for caller to get one RAM entry
> > + * from architectural memory map by index.
> > + *
> > + * This function will return zero if it can return a proper RAM entry.
> > + * otherwise it will return -ENOENT for out of scope index, or return
> > + * -ENODATA for non-RAM type memory entry.
> 
> The way you've implemented things, -ENODATA isn't special anymore, so
> better wouldn't be called out as special here. May I suggest to at
> least insert "e.g." in front of it? (An alternative would be to check
> for -ENODATA in nodes_cover_memory() again, followed by ASSERT(!err).)
> 
> > + * Note: the range is exclusive at the end, e.g. [start, end).
> 
> Perhaps better [*start, *end) to match ...
> 
> > + */
> > +extern int arch_get_ram_range(unsigned int idx,
> > +                              paddr_t *start, paddr_t *end);
> 
> ... this?
> 
> Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-09-08 13:02   ` Jan Beulich
@ 2022-09-08 15:26     ` Wei Chen
  2022-09-08 16:01       ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Wei Chen @ 2022-09-08 15:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年9月8日 21:03
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86
> to common
> 
> On 02.09.2022 05:31, Wei Chen wrote:
> > --- a/xen/arch/x86/numa.c
> > +++ b/xen/arch/x86/numa.c
> > @@ -41,9 +41,12 @@ int __init arch_numa_setup(const char *opt)
> >      return -EINVAL;
> >  }
> >
> > -bool arch_numa_disabled(void)
> > +bool arch_numa_disabled(bool init_as_disable)
> 
> I'm afraid my question as to the meaning of the name of the parameter has
> remained unanswered.
> 

Sorry, I might missed some contents of your reply in v3. The name of this
parameter has been bothering me for a long time, and now this is actually
quite awkward. The origin of this parameter is because the current NUMA
implementation will make different judgments under different usage
conditions when using acpi_numa. In acpi_scan_nodes, it uses acpi_numa <= 0
as the condition for judging that ACPI NUMA is turned off. But only use
acpi_numa < 0 as condition in srat_disabled and elsewhere. I use this
parameter in the hope that we can keep the same semantics as the original
code without changing the code of the caller.

> > @@ -306,32 +218,27 @@ acpi_numa_processor_affinity_init(const struct
> acpi_srat_cpu_affinity *pa)
> >  void __init
> >  acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
> >  {
> > -	struct node *nd;
> > -	paddr_t nd_start, nd_end;
> > -	paddr_t start, end;
> >  	unsigned pxm;
> >  	nodeid_t node;
> > -	unsigned int i;
> >
> >  	if (numa_disabled())
> >  		return;
> >  	if (ma->header.length != sizeof(struct acpi_srat_mem_affinity)) {
> > -		bad_srat();
> > +		numa_fw_bad();
> >  		return;
> >  	}
> >  	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
> >  		return;
> >
> > -	start = ma->base_address;
> > -	end = start + ma->length;
> >  	/* Supplement the heuristics in l1tf_calculations(). */
> > -	l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE));
> > +	l1tf_safe_maddr = max(l1tf_safe_maddr,
> > +			      ROUNDUP(ma->base_address + ma->length,
> > +			      PAGE_SIZE));
> 
> Indentation:
> 
> 	l1tf_safe_maddr = max(l1tf_safe_maddr,
> 			      ROUNDUP(ma->base_address + ma->length,
> 			              PAGE_SIZE));
> 

Ok.

> > @@ -33,7 +48,309 @@ bool __read_mostly numa_off;
> >
> >  bool numa_disabled(void)
> >  {
> > -    return numa_off || arch_numa_disabled();
> > +    return numa_off || arch_numa_disabled(false);
> > +}
> > +
> > +void __init numa_set_processor_nodes_parsed(nodeid_t node)
> > +{
> > +    node_set(node, processor_nodes_parsed);
> > +}
> > +
> > +bool valid_numa_range(paddr_t start, paddr_t end, nodeid_t node)
> > +{
> > +    unsigned int i;
> > +
> > +    for ( i = 0; i < num_node_memblks; i++ )
> > +    {
> > +        struct node *nd = &node_memblk_range[i];
> 
> const? (This is particularly relevant with __ro_after_init.)
> 

Yes, I will fix it.

> > +bool __init numa_update_node_memblks(nodeid_t node, unsigned int
> arch_nid,
> > +                                     paddr_t start, paddr_t size,
> > +                                     const char *prefix,
> > +                                     bool hotplug)
> > +{
> > +    unsigned int i;
> > +    paddr_t end = start + size;
> > +    paddr_t nd_start = start;
> > +    paddr_t nd_end = end;
> > +    struct node *nd = &nodes[node];
> > +
> > +    /*
> > +     * For the node that already has some memory blocks, we will
> > +     * expand the node memory range temporarily to check memory
> > +     * interleaves with other nodes. We will not use this node
> > +     * temp memory range to check overlaps, because it will mask
> > +     * the overlaps in same node.
> > +     *
> > +     * Node with 0 bytes memory doesn't need this expandsion.
> 
> Mind taking the opportunity and drop the 'd' from "expansion"?
> 

Ok.

> > +     */
> > +    if ( nd->start != nd->end )
> > +    {
> > +        if ( nd_start > nd->start )
> > +            nd_start = nd->start;
> > +
> > +        if ( nd_end < nd->end )
> > +            nd_end = nd->end;
> > +    }
> > +
> > +    /* It is fine to add this area to the nodes data it will be used
> later */
> > +    switch ( conflicting_memblks(node, start, end, nd_start, nd_end,
> &i) )
> > +    {
> > +    case OVERLAP:
> > +        if ( memblk_nodeid[i] == node )
> > +        {
> > +            bool mismatch = !(hotplug) != !test_bit(i, memblk_hotplug);
> 
> No need to parenthesize "hotplug" here.
> 

Ok

> > +            printk("%sNUMA: %s %u [%"PRIpaddr", %"PRIpaddr"] overlaps
> with itself [%"PRIpaddr", %"PRIpaddr"]\n",
> > +                   mismatch ? KERN_ERR : KERN_WARNING, prefix,
> > +                   arch_nid, start, end - 1,
> > +                   node_memblk_range[i].start, node_memblk_range[i].end
> - 1);
> > +            if ( mismatch )
> > +                return false;
> > +            break;
> > +        }
> > +
> > +        printk(KERN_ERR
> > +               "NUMA: %s %u [%"PRIpaddr", %"PRIpaddr"] overlaps
> with %s %u [%"PRIpaddr", %"PRIpaddr"]\n",
> > +               prefix, arch_nid, start, end - 1, prefix,
> > +               numa_node_to_arch_nid(memblk_nodeid[i]),
> > +               node_memblk_range[i].start, node_memblk_range[i].end -
> 1);
> > +        return false;
> > +
> > +
> > +    case INTERLEAVE:
> 
> Please don't add double blank lines anywhere (original code didn't have
> these); there's at least one more instance below.
> 

I will check the code and fix them.

> > +static int __init nodes_cover_memory(void)
> > +{
> > +    unsigned int i;
> > +
> > +    for ( i = 0; ; i++ )
> > +    {
> > +        int err;
> > +        bool found;
> > +        unsigned int j;
> > +        paddr_t start, end;
> > +
> > +        /* Try to loop memory map from index 0 to end to get RAM ranges.
> */
> > +        err = arch_get_ram_range(i, &start, &end);
> > +
> > +        /* Reach the end of arch's memory map */
> > +        if ( err == -ENOENT )
> > +            break;
> > +
> > +        /* Index relate entry is not RAM, skip it. */
> > +        if ( err )
> > +            continue;
> > +
> > +        do {
> > +            found = false;
> > +            for_each_node_mask( j, memory_nodes_parsed )
> 
> Please be consistent with style for constructs like this one: Either
> you consider for_each_node_mask a pseudo-keyword (along the lines of
> for(;;)), then there's a blank missing ahead of the opening
> parenthesis. Or you consider this an ordinary identifier (i.e. the
> function-like macro that it is), then there shouldn't be blanks
> immediately inside the parentheses. (Same issue elsewhere.)
> 

I will check the code and fix them.

> > +                if ( start < nodes[j].end
> > +                    && end > nodes[j].start )
> > +                {
> > +                    if ( start >= nodes[j].start )
> > +                    {
> > +                        start = nodes[j].end;
> > +                        found = true;
> > +                    }
> > +
> > +                    if ( end <= nodes[j].end )
> > +                    {
> > +                        end = nodes[j].start;
> > +                        found = true;
> > +                    }
> > +                }
> > +        } while ( found && start < end );
> > +
> > +        if ( start < end )
> > +        {
> > +            printk(KERN_ERR "NUMA: No node for RAM range: "
> > +                   "[%"PRIpaddr", %"PRIpaddr"]\n", start, end - 1);
> > +            return 0;
> > +        }
> > +    }
> > +    return 1;
> > +}
> 
> Seeing the two returns (and no further ones in the function) - did
> you not mean to also switch to bool/true/false here?
> 

Ok, I will switch the return value to bool and find if there still are
some other functions that can switch to bool.

> > +/* Use the information discovered above to actually set up the nodes.
> */
> > +static bool __init numa_scan_nodes(paddr_t start, paddr_t end)
> 
> Is "above" in the comment actually still accurate? Aiui the discovery
> is now in a different CU. Then perhaps "Use discovered information to
> actually set up the nodes."
> 

Ok.

> > +{
> > +    unsigned int i;
> > +    nodemask_t all_nodes_parsed;
> > +
> > +    /* First clean up the node list */
> > +    for ( i = 0; i < MAX_NUMNODES; i++ )
> > +        cutoff_node(i, start, end);
> > +
> > +    /* When numa is on with good firmware, we can do numa scan nodes.
> */
> > +    if ( arch_numa_disabled(true) )
> > +        return false;
> 
> Btw - the comment here doesn't help me figure your choice of
> "init_as_disabled". The wording towards the end is also a little
> odd, considering we're already in numa_scan_nodes(). Which further
> points out that really there's no scanning here, just processing,
> so maybe the earlier patch wants to rename the function to
> numa_process_nodes()?
> 

Yes, scan will make some confusion, your suggestion make sense, I will
fix it in next version.

> > +    if ( !nodes_cover_memory() )
> > +    {
> > +        numa_fw_bad();
> > +        return false;
> > +    }
> > +
> > +    memnode_shift = compute_hash_shift(node_memblk_range,
> num_node_memblks,
> > +                                       memblk_nodeid);
> > +
> > +    if ( memnode_shift < 0 )
> 
> As previously pointed out: As of patch 2 memnode_shift is unsigned,
> so this comparison is always false (and the latest Coverity will
> point this out). You can't get away here without using an intermediate
> (signed, i.e. plain int) variable.
> 

Yes, you're right, I will introduce a new int variable for return value
checking.

> > +    {
> > +        printk(KERN_ERR
> > +               "NUMA: No NUMA node hash function found. Contact
> maintainer\n");
> > +        numa_fw_bad();
> > +        return false;
> > +    }
> > +
> > +    nodes_or(all_nodes_parsed, memory_nodes_parsed,
> processor_nodes_parsed);
> > +
> > +    /* Finally register nodes */
> > +    for_each_node_mask( i, all_nodes_parsed )
> > +    {
> > +        if ( nodes[i].end - nodes[i].start == 0 )
> 
> nodes[i].end == nodes[i].start ?
> 

Yes.

> > +            printk(KERN_INFO "NUMA: node %u has no memory\n", i);
> > +
> > +        setup_node_bootmem(i, nodes[i].start, nodes[i].end);
> > +    }
> > +
> > +    for ( i = 0; i < nr_cpu_ids; i++ )
> > +    {
> > +        if ( cpu_to_node[i] == NUMA_NO_NODE )
> > +            continue;
> > +        if ( !nodemask_test(cpu_to_node[i], &processor_nodes_parsed) )
> > +            numa_set_node(i, NUMA_NO_NODE);
> > +    }
> > +    numa_init_array();
> > +    return true;
> >  }
> 
> While you said you'd check elsewhere as well, just to be sure: Please
> add a blank line before the function's main "return". And perhaps
> another one between loop and function call.
> 

Ok.

> > --- a/xen/drivers/acpi/Kconfig
> > +++ b/xen/drivers/acpi/Kconfig
> > @@ -7,4 +7,5 @@ config ACPI_LEGACY_TABLES_LOOKUP
> >
> >  config ACPI_NUMA
> >  	bool
> > +	select HAS_NUMA_NODE_FWID
> 
> Are you selecting an option here which doesn't exist anywhere? Or
> am I overlooking where this new option is being added?
> 

Yes, this is a new Kconfig option. Should I need to introduce in a
separate patch?

Cheers,
Wei Chen

> Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-09-08 15:26     ` Wei Chen
@ 2022-09-08 16:01       ` Jan Beulich
  2022-09-09  7:07         ` Wei Chen
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2022-09-08 16:01 UTC (permalink / raw)
  To: Wei Chen
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 08.09.2022 17:26, Wei Chen wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 2022年9月8日 21:03
>>
>> On 02.09.2022 05:31, Wei Chen wrote:
>>> --- a/xen/arch/x86/numa.c
>>> +++ b/xen/arch/x86/numa.c
>>> @@ -41,9 +41,12 @@ int __init arch_numa_setup(const char *opt)
>>>      return -EINVAL;
>>>  }
>>>
>>> -bool arch_numa_disabled(void)
>>> +bool arch_numa_disabled(bool init_as_disable)
>>
>> I'm afraid my question as to the meaning of the name of the parameter has
>> remained unanswered.
>>
> 
> Sorry, I might missed some contents of your reply in v3. The name of this
> parameter has been bothering me for a long time, and now this is actually
> quite awkward. The origin of this parameter is because the current NUMA
> implementation will make different judgments under different usage
> conditions when using acpi_numa. In acpi_scan_nodes, it uses acpi_numa <= 0
> as the condition for judging that ACPI NUMA is turned off. But only use
> acpi_numa < 0 as condition in srat_disabled and elsewhere. I use this
> parameter in the hope that we can keep the same semantics as the original
> code without changing the code of the caller.

The difference is "bad only" vs "bad or no data". Maybe that's easier
to express via two functions - arch_numa_disabled() (checking <= 0)
and arch_numa_broken() (checking < 0)? With a single function I guess
the name of the parameter would always be clumsy at best. Unless
someone has a good idea for a suitable name ...

>>> --- a/xen/drivers/acpi/Kconfig
>>> +++ b/xen/drivers/acpi/Kconfig
>>> @@ -7,4 +7,5 @@ config ACPI_LEGACY_TABLES_LOOKUP
>>>
>>>  config ACPI_NUMA
>>>  	bool
>>> +	select HAS_NUMA_NODE_FWID
>>
>> Are you selecting an option here which doesn't exist anywhere? Or
>> am I overlooking where this new option is being added?
>>
> 
> Yes, this is a new Kconfig option. Should I need to introduce in a
> separate patch?

I don't think that'll need to be in a separate patch; it can simply
be another hunk in the one here, adding the needed 2 lines (plus a
blank one) to, presumably, common/Kconfig.

Jan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common
  2022-09-08 16:01       ` Jan Beulich
@ 2022-09-09  7:07         ` Wei Chen
  0 siblings, 0 replies; 18+ messages in thread
From: Wei Chen @ 2022-09-09  7:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: nd, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

Hi Jan,

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 2022年9月9日 0:02
> To: Wei Chen <Wei.Chen@arm.com>
> Cc: nd <nd@arm.com>; Andrew Cooper <andrew.cooper3@citrix.com>; Roger Pau
> Monné <roger.pau@citrix.com>; Wei Liu <wl@xen.org>; George Dunlap
> <george.dunlap@citrix.com>; Julien Grall <julien@xen.org>; Stefano
> Stabellini <sstabellini@kernel.org>; xen-devel@lists.xenproject.org
> Subject: Re: [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86
> to common
> 
> On 08.09.2022 17:26, Wei Chen wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: 2022年9月8日 21:03
> >>
> >> On 02.09.2022 05:31, Wei Chen wrote:
> >>> --- a/xen/arch/x86/numa.c
> >>> +++ b/xen/arch/x86/numa.c
> >>> @@ -41,9 +41,12 @@ int __init arch_numa_setup(const char *opt)
> >>>      return -EINVAL;
> >>>  }
> >>>
> >>> -bool arch_numa_disabled(void)
> >>> +bool arch_numa_disabled(bool init_as_disable)
> >>
> >> I'm afraid my question as to the meaning of the name of the parameter
> has
> >> remained unanswered.
> >>
> >
> > Sorry, I might missed some contents of your reply in v3. The name of
> this
> > parameter has been bothering me for a long time, and now this is
> actually
> > quite awkward. The origin of this parameter is because the current NUMA
> > implementation will make different judgments under different usage
> > conditions when using acpi_numa. In acpi_scan_nodes, it uses acpi_numa
> <= 0
> > as the condition for judging that ACPI NUMA is turned off. But only use
> > acpi_numa < 0 as condition in srat_disabled and elsewhere. I use this
> > parameter in the hope that we can keep the same semantics as the
> original
> > code without changing the code of the caller.
> 
> The difference is "bad only" vs "bad or no data". Maybe that's easier
> to express via two functions - arch_numa_disabled() (checking <= 0)
> and arch_numa_broken() (checking < 0)? With a single function I guess
> the name of the parameter would always be clumsy at best. Unless
> someone has a good idea for a suitable name ...
> 

Yes, I can't find a good name for the parameter, so break into two functions
would be better, I will do it in next version.

> >>> --- a/xen/drivers/acpi/Kconfig
> >>> +++ b/xen/drivers/acpi/Kconfig
> >>> @@ -7,4 +7,5 @@ config ACPI_LEGACY_TABLES_LOOKUP
> >>>
> >>>  config ACPI_NUMA
> >>>  	bool
> >>> +	select HAS_NUMA_NODE_FWID
> >>
> >> Are you selecting an option here which doesn't exist anywhere? Or
> >> am I overlooking where this new option is being added?
> >>
> >
> > Yes, this is a new Kconfig option. Should I need to introduce in a
> > separate patch?
> 
> I don't think that'll need to be in a separate patch; it can simply
> be another hunk in the one here, adding the needed 2 lines (plus a
> blank one) to, presumably, common/Kconfig.

Ok.

Thanks,
Wei Chen

> 
> Jan

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-09-09  7:08 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-02  3:31 [PATCH v4 0/6] Device tree based NUMA support for Arm - Part#2 Wei Chen
2022-09-02  3:31 ` [PATCH v4 1/6] xen/x86: Provide helpers for common code to access acpi_numa Wei Chen
2022-09-08  8:22   ` Jan Beulich
2022-09-02  3:31 ` [PATCH v4 2/6] xen/x86: move generically usable NUMA code from x86 to common Wei Chen
2022-09-08  9:09   ` Jan Beulich
2022-09-08 10:32     ` Wei Chen
2022-09-08 11:42       ` Jan Beulich
2022-09-08 14:52         ` Wei Chen
2022-09-02  3:31 ` [PATCH v4 3/6] xen/x86: Use ASSERT instead of VIRTUAL_BUG_ON for phys_to_nid Wei Chen
2022-09-02  3:31 ` [PATCH v4 4/6] xen/x86: use arch_get_ram_range to get information from E820 map Wei Chen
2022-09-08 12:14   ` Jan Beulich
2022-09-08 14:55     ` Wei Chen
2022-09-02  3:31 ` [PATCH v4 5/6] xen/x86: move NUMA scan nodes codes from x86 to common Wei Chen
2022-09-08 13:02   ` Jan Beulich
2022-09-08 15:26     ` Wei Chen
2022-09-08 16:01       ` Jan Beulich
2022-09-09  7:07         ` Wei Chen
2022-09-02  3:31 ` [PATCH v4 6/6] xen: introduce a Kconfig option to configure NUMA nodes number Wei Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.