xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [XEN PATCH 0/9] x86: parallelize AP bring-up during boot
@ 2023-11-14 17:49 Krystian Hebel
  2023-11-14 17:49 ` [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID Krystian Hebel
                   ` (9 more replies)
  0 siblings, 10 replies; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:49 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

Patch series available on this branch:
https://github.com/TrenchBoot/xen/tree/smp_cleanup_upstreaming

This series makes AP bring-up parallel on x86. This reduces number of
IPIs (and more importantly, delays between them) required to start all
logical processors significantly.

In order to make it possible, some state variables that were global
had to be made per-CPU. In most cases, accesses to those variables can
be performed through helper macros, some of them existed before in a
different form.

In addition to work required for parallel initialization, I've fixed
issues in error path around `machine_restart()` that were discovered
during implementation and testing.

CPU hotplug should not be affected, but I had no way of testing it.
During wakeup from S3 APs are started one by one. It should be possible
to enable parallel execution there as well, but I don't have a way of
testing it as of now.

To measure the improvement, I added output lines (identical for before
and after changes so there is no impact of printing over serial) just
above and below `if ( !pv_shim )` block. `console_timestamps=raw` was
used to get as accurate timestamp as possible, and average over 3 boots
was taken into account for each measurement. The final improvement was
calculated as (1 - after/before) * 100%, rounded to 0.01%. These are
the results:

* Dell OptiPlex 9010 with Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
  (4 cores, 4 threads): 48.83%
* MSI PRO Z790-P with 13th Gen Intel(R) Core(TM) i5-13600K (14 cores,
  20 threads, 6P+8E) `smt=on`: 36.16%
* MSI PRO Z790-P with 13th Gen Intel(R) Core(TM) i5-13600K (14 cores,
  20 threads, 6P+8E) `smt=off`: 0.25% (parking takes a lot of additional
  time)
* HP t630 Thin Client with AMD Embedded G-Series GX-420GI Radeon R7E
  (4 cores, 4 threads): 68.00%

Krystian Hebel (9):
  x86/boot: choose AP stack based on APIC ID
  x86: don't access x86_cpu_to_apicid[] directly, use
    cpu_physical_id(cpu)
  x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  x86/smp: move stack_base to cpu_data
  x86/smp: call x2apic_ap_setup() earlier
  x86/shutdown: protect against recurrent machine_restart()
  x86/smp: drop booting_cpu variable
  x86/smp: make cpu_state per-CPU
  x86/smp: start APs in parallel during boot

 xen/arch/x86/acpi/cpu_idle.c          |   4 +-
 xen/arch/x86/acpi/lib.c               |   2 +-
 xen/arch/x86/apic.c                   |   2 +-
 xen/arch/x86/boot/trampoline.S        |  20 +++
 xen/arch/x86/boot/x86_64.S            |  34 ++++-
 xen/arch/x86/cpu/mwait-idle.c         |   4 +-
 xen/arch/x86/domain.c                 |   2 +-
 xen/arch/x86/include/asm/asm_defns.h  |   2 +-
 xen/arch/x86/include/asm/cpufeature.h |   2 +
 xen/arch/x86/include/asm/processor.h  |   2 +
 xen/arch/x86/include/asm/smp.h        |   7 +-
 xen/arch/x86/mpparse.c                |   6 +-
 xen/arch/x86/numa.c                   |  17 +--
 xen/arch/x86/platform_hypercall.c     |   2 +-
 xen/arch/x86/setup.c                  |  25 +++-
 xen/arch/x86/shutdown.c               |  20 ++-
 xen/arch/x86/smpboot.c                | 207 ++++++++++++++++----------
 xen/arch/x86/spec_ctrl.c              |   2 +-
 xen/arch/x86/sysctl.c                 |   2 +-
 xen/arch/x86/traps.c                  |   4 +-
 xen/arch/x86/x86_64/asm-offsets.c     |   5 +-
 xen/include/xen/smp.h                 |   2 -
 22 files changed, 248 insertions(+), 125 deletions(-)


base-commit: fb41228ececea948c7953c8c16fe28fd65c6536b
prerequisite-patch-id: 142a87c707411d49e136c3fb76f1b14963ec6dc8
prerequisite-patch-id: f155cb7e2761deec26b76f1b8b587bc56a404c80
-- 
2.41.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
@ 2023-11-14 17:49 ` Krystian Hebel
  2024-01-26 18:30   ` Julien Grall
  2024-02-07 16:11   ` Jan Beulich
  2023-11-14 17:49 ` [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu) Krystian Hebel
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:49 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

This is made as first step of making parallel AP bring-up possible. It
should be enough for pre-C code.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/boot/trampoline.S | 20 ++++++++++++++++++++
 xen/arch/x86/boot/x86_64.S     | 28 +++++++++++++++++++++++++++-
 xen/arch/x86/setup.c           |  7 +++++++
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/boot/trampoline.S b/xen/arch/x86/boot/trampoline.S
index b8ab0ffdcbb0..ec254125016d 100644
--- a/xen/arch/x86/boot/trampoline.S
+++ b/xen/arch/x86/boot/trampoline.S
@@ -72,6 +72,26 @@ trampoline_protmode_entry:
         mov     $X86_CR4_PAE,%ecx
         mov     %ecx,%cr4
 
+        /*
+         * Get APIC ID while we're in non-paged mode. Start by checking if
+         * x2APIC is enabled.
+         */
+        mov     $MSR_APIC_BASE, %ecx
+        rdmsr
+        and     $APIC_BASE_EXTD, %eax
+        jnz     .Lx2apic
+
+        /* Not x2APIC, read from MMIO */
+        mov     0xfee00020, %esp
+        shr     $24, %esp
+        jmp     1f
+
+.Lx2apic:
+        mov     $(MSR_X2APIC_FIRST + (0x20 >> 4)), %ecx
+        rdmsr
+        mov     %eax, %esp
+1:
+
         /* Load pagetable base register. */
         mov     $sym_offs(idle_pg_table),%eax
         add     bootsym_rel(trampoline_xen_phys_start,4,%eax)
diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 04bb62ae8680..b85b47b5c1a0 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -15,7 +15,33 @@ ENTRY(__high_start)
         mov     $XEN_MINIMAL_CR4,%rcx
         mov     %rcx,%cr4
 
-        mov     stack_start(%rip),%rsp
+        test    %ebx,%ebx
+        cmovz   stack_start(%rip), %rsp
+        jz      .L_stack_set
+
+        /* APs only: get stack base from APIC ID saved in %esp. */
+        mov     $-1, %rax
+        lea     x86_cpu_to_apicid(%rip), %rcx
+1:
+        add     $1, %rax
+        cmp     $NR_CPUS, %eax
+        jb      2f
+        hlt
+2:
+        cmp     %esp, (%rcx, %rax, 4)
+        jne     1b
+
+        /* %eax is now Xen CPU index. */
+        lea     stack_base(%rip), %rcx
+        mov     (%rcx, %rax, 8), %rsp
+
+        test    %rsp,%rsp
+        jnz     1f
+        hlt
+1:
+        add     $(STACK_SIZE - CPUINFO_sizeof), %rsp
+
+.L_stack_set:
 
         /* Reset EFLAGS (subsumes CLI and CLD). */
         pushq   $0
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index a3d3f797bb1e..1285969901e0 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1951,6 +1951,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
      */
     if ( !pv_shim )
     {
+        /* Separate loop to make parallel AP bringup possible. */
         for_each_present_cpu ( i )
         {
             /* Set up cpu_to_node[]. */
@@ -1958,6 +1959,12 @@ void __init noreturn __start_xen(unsigned long mbi_p)
             /* Set up node_to_cpumask based on cpu_to_node[]. */
             numa_add_cpu(i);
 
+            if ( stack_base[i] == NULL )
+                stack_base[i] = cpu_alloc_stack(i);
+        }
+
+        for_each_present_cpu ( i )
+        {
             if ( (park_offline_cpus || num_online_cpus() < max_cpus) &&
                  !cpu_online(i) )
             {
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu)
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
  2023-11-14 17:49 ` [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID Krystian Hebel
@ 2023-11-14 17:49 ` Krystian Hebel
  2024-01-26 18:38   ` Julien Grall
  2024-02-07 16:28   ` Jan Beulich
  2023-11-14 17:50 ` [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead Krystian Hebel
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:49 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

This is done in preparation to move data from x86_cpu_to_apicid[]
elsewhere.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/acpi/cpu_idle.c      |  4 ++--
 xen/arch/x86/acpi/lib.c           |  2 +-
 xen/arch/x86/apic.c               |  2 +-
 xen/arch/x86/cpu/mwait-idle.c     |  4 ++--
 xen/arch/x86/domain.c             |  2 +-
 xen/arch/x86/mpparse.c            |  6 +++---
 xen/arch/x86/numa.c               |  2 +-
 xen/arch/x86/platform_hypercall.c |  2 +-
 xen/arch/x86/setup.c              | 14 +++++++-------
 xen/arch/x86/smpboot.c            |  4 ++--
 xen/arch/x86/spec_ctrl.c          |  2 +-
 xen/arch/x86/sysctl.c             |  2 +-
 12 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index cfce4cc0408f..d03e54bcef38 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -1256,7 +1256,7 @@ int get_cpu_id(u32 acpi_id)
 
     for ( i = 0; i < nr_cpu_ids; i++ )
     {
-        if ( apic_id == x86_cpu_to_apicid[i] )
+        if ( apic_id == cpu_physical_id(i) )
             return i;
     }
 
@@ -1362,7 +1362,7 @@ long set_cx_pminfo(uint32_t acpi_id, struct xen_processor_power *power)
 
     if ( !cpu_online(cpu_id) )
     {
-        uint32_t apic_id = x86_cpu_to_apicid[cpu_id];
+        uint32_t apic_id = cpu_physical_id(cpu_id);
 
         /*
          * If we've just learned of more available C states, wake the CPU if
diff --git a/xen/arch/x86/acpi/lib.c b/xen/arch/x86/acpi/lib.c
index 51cb082ca02a..87a1f38f3f5a 100644
--- a/xen/arch/x86/acpi/lib.c
+++ b/xen/arch/x86/acpi/lib.c
@@ -91,7 +91,7 @@ unsigned int acpi_get_processor_id(unsigned int cpu)
 {
 	unsigned int acpiid, apicid;
 
-	if ((apicid = x86_cpu_to_apicid[cpu]) == BAD_APICID)
+	if ((apicid = cpu_physical_id(cpu)) == BAD_APICID)
 		return INVALID_ACPIID;
 
 	for (acpiid = 0; acpiid < ARRAY_SIZE(x86_acpiid_to_apicid); acpiid++)
diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c
index 6acdd0ec1468..63e18968e54c 100644
--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -950,7 +950,7 @@ __next:
      */
     if (boot_cpu_physical_apicid == -1U)
         boot_cpu_physical_apicid = get_apic_id();
-    x86_cpu_to_apicid[0] = get_apic_id();
+    cpu_physical_id(0) = get_apic_id();
 
     ioapic_init();
 }
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index ff5c808bc952..88168da8a0ca 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -1202,8 +1202,8 @@ static void __init ivt_idle_state_table_update(void)
 	unsigned int cpu, max_apicid = boot_cpu_physical_apicid;
 
 	for_each_present_cpu(cpu)
-		if (max_apicid < x86_cpu_to_apicid[cpu])
-			max_apicid = x86_cpu_to_apicid[cpu];
+		if (max_apicid < cpu_physical_id(cpu))
+			max_apicid = cpu_physical_id(cpu);
 	switch (apicid_to_socket(max_apicid)) {
 	case 0: case 1:
 		/* 1 and 2 socket systems use default ivt_cstates */
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 3712e36df930..f45437511a46 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1615,7 +1615,7 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
             break;
 
         cpu_id.phys_id =
-            (uint64_t)x86_cpu_to_apicid[v->vcpu_id] |
+            (uint64_t)cpu_physical_id(v->vcpu_id) |
             ((uint64_t)acpi_get_processor_id(v->vcpu_id) << 32);
 
         rc = -EFAULT;
diff --git a/xen/arch/x86/mpparse.c b/xen/arch/x86/mpparse.c
index d8ccab2449c6..b8cabebe7bf9 100644
--- a/xen/arch/x86/mpparse.c
+++ b/xen/arch/x86/mpparse.c
@@ -187,7 +187,7 @@ static int MP_processor_info_x(struct mpc_config_processor *m,
 			       " Processor with apicid %i ignored\n", apicid);
 			return cpu;
 		}
-		x86_cpu_to_apicid[cpu] = apicid;
+		cpu_physical_id(cpu) = apicid;
 		cpumask_set_cpu(cpu, &cpu_present_map);
 	}
 
@@ -822,12 +822,12 @@ void mp_unregister_lapic(uint32_t apic_id, uint32_t cpu)
 	if (!cpu || (apic_id == boot_cpu_physical_apicid))
 		return;
 
-	if (x86_cpu_to_apicid[cpu] != apic_id)
+	if (cpu_physical_id(cpu) != apic_id)
 		return;
 
 	physid_clear(apic_id, phys_cpu_present_map);
 
-	x86_cpu_to_apicid[cpu] = BAD_APICID;
+	cpu_physical_id(cpu) = BAD_APICID;
 	cpumask_clear_cpu(cpu, &cpu_present_map);
 }
 
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 4b0b297c7e09..39e131cb4f35 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -70,7 +70,7 @@ void __init init_cpu_to_node(void)
 
     for ( i = 0; i < nr_cpu_ids; i++ )
     {
-        u32 apicid = x86_cpu_to_apicid[i];
+        u32 apicid = cpu_physical_id(i);
         if ( apicid == BAD_APICID )
             continue;
         node = apicid < MAX_LOCAL_APIC ? apicid_to_node[apicid] : NUMA_NO_NODE;
diff --git a/xen/arch/x86/platform_hypercall.c b/xen/arch/x86/platform_hypercall.c
index 9469de9045c7..9a52e048f67c 100644
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -588,7 +588,7 @@ ret_t do_platform_op(
         }
         else
         {
-            g_info->apic_id = x86_cpu_to_apicid[g_info->xen_cpuid];
+            g_info->apic_id = cpu_physical_id(g_info->xen_cpuid);
             g_info->acpi_id = acpi_get_processor_id(g_info->xen_cpuid);
             ASSERT(g_info->apic_id != BAD_APICID);
             g_info->flags = 0;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 1285969901e0..a19fe219bbf6 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -319,7 +319,7 @@ static void __init init_idle_domain(void)
 void srat_detect_node(int cpu)
 {
     nodeid_t node;
-    u32 apicid = x86_cpu_to_apicid[cpu];
+    u32 apicid = cpu_physical_id(cpu);
 
     node = apicid < MAX_LOCAL_APIC ? apicid_to_node[apicid] : NUMA_NO_NODE;
     if ( node == NUMA_NO_NODE )
@@ -346,7 +346,7 @@ static void __init normalise_cpu_order(void)
 
     for_each_present_cpu ( i )
     {
-        apicid = x86_cpu_to_apicid[i];
+        apicid = cpu_physical_id(i);
         min_diff = min_cpu = ~0u;
 
         /*
@@ -357,12 +357,12 @@ static void __init normalise_cpu_order(void)
               j < nr_cpu_ids;
               j = cpumask_next(j, &cpu_present_map) )
         {
-            diff = x86_cpu_to_apicid[j] ^ apicid;
+            diff = cpu_physical_id(j) ^ apicid;
             while ( diff & (diff-1) )
                 diff &= diff-1;
             if ( (diff < min_diff) ||
                  ((diff == min_diff) &&
-                  (x86_cpu_to_apicid[j] < x86_cpu_to_apicid[min_cpu])) )
+                  (cpu_physical_id(j) < cpu_physical_id(min_cpu))) )
             {
                 min_diff = diff;
                 min_cpu = j;
@@ -378,9 +378,9 @@ static void __init normalise_cpu_order(void)
 
         /* Switch the best-matching CPU with the next CPU in logical order. */
         j = cpumask_next(i, &cpu_present_map);
-        apicid = x86_cpu_to_apicid[min_cpu];
-        x86_cpu_to_apicid[min_cpu] = x86_cpu_to_apicid[j];
-        x86_cpu_to_apicid[j] = apicid;
+        apicid = cpu_physical_id(min_cpu);
+        cpu_physical_id(min_cpu) = cpu_physical_id(j);
+        cpu_physical_id(j) = apicid;
     }
 }
 
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 4c54ecbc91d7..de87c5a41926 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1154,7 +1154,7 @@ void __init smp_prepare_cpus(void)
     print_cpu_info(0);
 
     boot_cpu_physical_apicid = get_apic_id();
-    x86_cpu_to_apicid[0] = boot_cpu_physical_apicid;
+    cpu_physical_id(0) = boot_cpu_physical_apicid;
 
     stack_base[0] = (void *)((unsigned long)stack_start & ~(STACK_SIZE - 1));
 
@@ -1374,7 +1374,7 @@ int __cpu_up(unsigned int cpu)
 {
     int apicid, ret;
 
-    if ( (apicid = x86_cpu_to_apicid[cpu]) == BAD_APICID )
+    if ( (apicid = cpu_physical_id(cpu)) == BAD_APICID )
         return -ENODEV;
 
     if ( (!x2apic_enabled && apicid >= APIC_ALL_CPUS) ||
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index a8d8af22f6d8..d54c8d93cff0 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -589,7 +589,7 @@ static bool __init check_smt_enabled(void)
      * has a non-zero thread id component indicates that SMT is active.
      */
     for_each_present_cpu ( cpu )
-        if ( x86_cpu_to_apicid[cpu] & (boot_cpu_data.x86_num_siblings - 1) )
+        if ( cpu_physical_id(cpu) & (boot_cpu_data.x86_num_siblings - 1) )
             return true;
 
     return false;
diff --git a/xen/arch/x86/sysctl.c b/xen/arch/x86/sysctl.c
index c107f40c6283..67d8ab3f824a 100644
--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -58,7 +58,7 @@ static long cf_check smt_up_down_helper(void *data)
     for_each_present_cpu ( cpu )
     {
         /* Skip primary siblings (those whose thread id is 0). */
-        if ( !(x86_cpu_to_apicid[cpu] & sibling_mask) )
+        if ( !(cpu_physical_id(cpu) & sibling_mask) )
             continue;
 
         if ( !up && core_parking_remove(cpu) )
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
  2023-11-14 17:49 ` [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID Krystian Hebel
  2023-11-14 17:49 ` [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu) Krystian Hebel
@ 2023-11-14 17:50 ` Krystian Hebel
  2024-02-02 18:11   ` Julien Grall
  2024-02-08  7:29   ` Jan Beulich
  2023-11-14 17:50 ` [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data Krystian Hebel
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

Both fields held the same data.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/boot/x86_64.S           |  8 +++++---
 xen/arch/x86/include/asm/asm_defns.h |  2 +-
 xen/arch/x86/include/asm/processor.h |  2 ++
 xen/arch/x86/include/asm/smp.h       |  4 ----
 xen/arch/x86/numa.c                  | 15 +++++++--------
 xen/arch/x86/smpboot.c               |  8 ++++----
 xen/arch/x86/x86_64/asm-offsets.c    |  4 +++-
 7 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index b85b47b5c1a0..195550b5c0ea 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -20,15 +20,17 @@ ENTRY(__high_start)
         jz      .L_stack_set
 
         /* APs only: get stack base from APIC ID saved in %esp. */
-        mov     $-1, %rax
-        lea     x86_cpu_to_apicid(%rip), %rcx
+        mov     $0, %rax
+        lea     cpu_data(%rip), %rcx
+        /* cpu_data[0] is BSP, skip it. */
 1:
         add     $1, %rax
+        add     $CPUINFO_X86_sizeof, %rcx
         cmp     $NR_CPUS, %eax
         jb      2f
         hlt
 2:
-        cmp     %esp, (%rcx, %rax, 4)
+        cmp     %esp, CPUINFO_X86_apicid(%rcx)
         jne     1b
 
         /* %eax is now Xen CPU index. */
diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
index baaaccb26e17..6b05d9d140b8 100644
--- a/xen/arch/x86/include/asm/asm_defns.h
+++ b/xen/arch/x86/include/asm/asm_defns.h
@@ -158,7 +158,7 @@ register unsigned long current_stack_pointer asm("rsp");
 #endif
 
 #define CPUINFO_FEATURE_OFFSET(feature)           \
-    (CPUINFO_features + (cpufeat_word(feature) * 4))
+    (CPUINFO_X86_features + (cpufeat_word(feature) * 4))
 
 #else
 
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index b0d2a62c075f..8345d58094da 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -92,6 +92,8 @@ struct x86_cpu_id {
 extern struct cpuinfo_x86 cpu_data[];
 #define current_cpu_data cpu_data[smp_processor_id()]
 
+#define cpu_physical_id(cpu)	cpu_data[cpu].apicid
+
 extern bool probe_cpuid_faulting(void);
 extern void ctxt_switch_levelling(const struct vcpu *next);
 extern void (*ctxt_switch_masking)(const struct vcpu *next);
diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h
index c0b5d7cdd8dd..94c557491860 100644
--- a/xen/arch/x86/include/asm/smp.h
+++ b/xen/arch/x86/include/asm/smp.h
@@ -39,10 +39,6 @@ extern void (*mtrr_hook) (void);
 
 extern void zap_low_mappings(void);
 
-extern u32 x86_cpu_to_apicid[];
-
-#define cpu_physical_id(cpu)	x86_cpu_to_apicid[cpu]
-
 #define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))
 extern void cpu_exit_clear(unsigned int cpu);
 extern void cpu_uninit(unsigned int cpu);
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 39e131cb4f35..91527be5b406 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -54,14 +54,13 @@ bool __init arch_numa_unavailable(void)
 /*
  * Setup early cpu_to_node.
  *
- * Populate cpu_to_node[] only if x86_cpu_to_apicid[],
- * and apicid_to_node[] tables have valid entries for a CPU.
- * This means we skip cpu_to_node[] initialisation for NUMA
- * emulation and faking node case (when running a kernel compiled
- * for NUMA on a non NUMA box), which is OK as cpu_to_node[]
- * is already initialized in a round robin manner at numa_init_array,
- * prior to this call, and this initialization is good enough
- * for the fake NUMA cases.
+ * Populate cpu_to_node[] only if cpu_data[], and apicid_to_node[]
+ * tables have valid entries for a CPU. This means we skip
+ * cpu_to_node[] initialisation for NUMA emulation and faking node
+ * case (when running a kernel compiled for NUMA on a non NUMA box),
+ * which is OK as cpu_to_node[] is already initialized in a round
+ * robin manner at numa_init_array, prior to this call, and this
+ * initialization is good enough for the fake NUMA cases.
  */
 void __init init_cpu_to_node(void)
 {
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index de87c5a41926..f061486e56eb 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -61,10 +61,8 @@ unsigned int __read_mostly nr_sockets;
 cpumask_t **__read_mostly socket_cpumask;
 static cpumask_t *secondary_socket_cpumask;
 
-struct cpuinfo_x86 cpu_data[NR_CPUS];
-
-u32 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
-	{ [0 ... NR_CPUS-1] = BAD_APICID };
+struct cpuinfo_x86 cpu_data[NR_CPUS] =
+        { [0 ... NR_CPUS-1] .apicid = BAD_APICID };
 
 static int cpu_error;
 static enum cpu_state {
@@ -81,7 +79,9 @@ void *stack_base[NR_CPUS];
 
 void initialize_cpu_data(unsigned int cpu)
 {
+    uint32_t apicid = cpu_physical_id(cpu);
     cpu_data[cpu] = boot_cpu_data;
+    cpu_physical_id(cpu) = apicid;
 }
 
 static bool smp_store_cpu_info(unsigned int id)
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 57b73a4e6214..e881cd5de0a0 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -159,7 +159,9 @@ void __dummy__(void)
     OFFSET(IRQSTAT_softirq_pending, irq_cpustat_t, __softirq_pending);
     BLANK();
 
-    OFFSET(CPUINFO_features, struct cpuinfo_x86, x86_capability);
+    OFFSET(CPUINFO_X86_features, struct cpuinfo_x86, x86_capability);
+    OFFSET(CPUINFO_X86_apicid, struct cpuinfo_x86, apicid);
+    DEFINE(CPUINFO_X86_sizeof, sizeof(struct cpuinfo_x86));
     BLANK();
 
     OFFSET(MB_flags, multiboot_info_t, flags);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
                   ` (2 preceding siblings ...)
  2023-11-14 17:50 ` [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead Krystian Hebel
@ 2023-11-14 17:50 ` Krystian Hebel
  2024-02-02 18:24   ` Julien Grall
  2024-02-07 16:53   ` Jan Beulich
  2023-11-14 17:50 ` [XEN PATCH 5/9] x86/smp: call x2apic_ap_setup() earlier Krystian Hebel
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini

This location is easier to access from assembly. Having it close to
other data required during initialization has also positive (although
rather small) impact on prefetching data from RAM.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/boot/x86_64.S            |  5 ++---
 xen/arch/x86/include/asm/cpufeature.h |  1 +
 xen/arch/x86/include/asm/smp.h        |  2 +-
 xen/arch/x86/setup.c                  |  6 +++---
 xen/arch/x86/smpboot.c                | 25 +++++++++++++------------
 xen/arch/x86/traps.c                  |  4 ++--
 xen/arch/x86/x86_64/asm-offsets.c     |  1 +
 xen/include/xen/smp.h                 |  2 --
 8 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 195550b5c0ea..8d61f270761f 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -33,9 +33,8 @@ ENTRY(__high_start)
         cmp     %esp, CPUINFO_X86_apicid(%rcx)
         jne     1b
 
-        /* %eax is now Xen CPU index. */
-        lea     stack_base(%rip), %rcx
-        mov     (%rcx, %rax, 8), %rsp
+        /* %rcx is now cpu_data[cpu], read stack base from it. */
+        mov     CPUINFO_X86_stack_base(%rcx), %rsp
 
         test    %rsp,%rsp
         jnz     1f
diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index 06e1dd7f3332..ff0e18864cc7 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -37,6 +37,7 @@ struct cpuinfo_x86 {
     unsigned int phys_proc_id;         /* package ID of each logical CPU */
     unsigned int cpu_core_id;          /* core ID of each logical CPU */
     unsigned int compute_unit_id;      /* AMD compute unit ID of each logical CPU */
+    void *stack_base;
     unsigned short x86_clflush_size;
 } __cacheline_aligned;
 
diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h
index 94c557491860..98739028a6ed 100644
--- a/xen/arch/x86/include/asm/smp.h
+++ b/xen/arch/x86/include/asm/smp.h
@@ -69,7 +69,7 @@ extern cpumask_t **socket_cpumask;
  * by certain scheduling code only.
  */
 #define get_cpu_current(cpu) \
-    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
+    (get_cpu_info_from_stack((unsigned long)cpu_data[cpu].stack_base)->current_vcpu)
 
 extern unsigned int disabled_cpus;
 extern bool unaccounted_cpus;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index a19fe219bbf6..b2c0679725ea 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -798,7 +798,7 @@ static void __init noreturn reinit_bsp_stack(void)
     /* Update SYSCALL trampolines */
     percpu_traps_init();
 
-    stack_base[0] = stack;
+    cpu_data[0].stack_base = stack;
 
     rc = setup_cpu_root_pgt(0);
     if ( rc )
@@ -1959,8 +1959,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
             /* Set up node_to_cpumask based on cpu_to_node[]. */
             numa_add_cpu(i);
 
-            if ( stack_base[i] == NULL )
-                stack_base[i] = cpu_alloc_stack(i);
+            if ( cpu_data[i].stack_base == NULL )
+                cpu_data[i].stack_base = cpu_alloc_stack(i);
         }
 
         for_each_present_cpu ( i )
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index f061486e56eb..8ae65ab1769f 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -75,13 +75,15 @@ static enum cpu_state {
 } cpu_state;
 #define set_cpu_state(state) do { smp_mb(); cpu_state = (state); } while (0)
 
-void *stack_base[NR_CPUS];
-
 void initialize_cpu_data(unsigned int cpu)
 {
     uint32_t apicid = cpu_physical_id(cpu);
+    void *stack = cpu_data[cpu].stack_base;
+
     cpu_data[cpu] = boot_cpu_data;
+
     cpu_physical_id(cpu) = apicid;
+    cpu_data[cpu].stack_base = stack;
 }
 
 static bool smp_store_cpu_info(unsigned int id)
@@ -579,8 +581,6 @@ static int do_boot_cpu(int apicid, int cpu)
         printk("Booting processor %d/%d eip %lx\n",
                cpu, apicid, start_eip);
 
-    stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
-
     /* This grunge runs the startup process for the targeted processor. */
 
     set_cpu_state(CPU_STATE_INIT);
@@ -856,7 +856,7 @@ int setup_cpu_root_pgt(unsigned int cpu)
 
     /* Install direct map page table entries for stack, IDT, and TSS. */
     for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
-        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
+        rc = clone_mapping(__va(__pa(cpu_data[cpu].stack_base)) + off, rpt);
 
     if ( !rc )
         rc = clone_mapping(idt_tables[cpu], rpt);
@@ -1007,10 +1007,10 @@ static void cpu_smpboot_free(unsigned int cpu, bool remove)
         FREE_XENHEAP_PAGE(per_cpu(gdt, cpu));
         FREE_XENHEAP_PAGE(idt_tables[cpu]);
 
-        if ( stack_base[cpu] )
+        if ( cpu_data[cpu].stack_base )
         {
-            memguard_unguard_stack(stack_base[cpu]);
-            FREE_XENHEAP_PAGES(stack_base[cpu], STACK_ORDER);
+            memguard_unguard_stack(cpu_data[cpu].stack_base);
+            FREE_XENHEAP_PAGES(cpu_data[cpu].stack_base, STACK_ORDER);
         }
     }
 }
@@ -1044,11 +1044,11 @@ static int cpu_smpboot_alloc(unsigned int cpu)
     if ( node != NUMA_NO_NODE )
         memflags = MEMF_node(node);
 
-    if ( stack_base[cpu] == NULL &&
-         (stack_base[cpu] = cpu_alloc_stack(cpu)) == NULL )
+    if ( cpu_data[cpu].stack_base == NULL &&
+         (cpu_data[cpu].stack_base = cpu_alloc_stack(cpu)) == NULL )
             goto out;
 
-    info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
+    info = get_cpu_info_from_stack((unsigned long)cpu_data[cpu].stack_base);
     info->processor_id = cpu;
     info->per_cpu_offset = __per_cpu_offset[cpu];
 
@@ -1156,7 +1156,8 @@ void __init smp_prepare_cpus(void)
     boot_cpu_physical_apicid = get_apic_id();
     cpu_physical_id(0) = boot_cpu_physical_apicid;
 
-    stack_base[0] = (void *)((unsigned long)stack_start & ~(STACK_SIZE - 1));
+    cpu_data[0].stack_base = (void *)
+             ((unsigned long)stack_start & ~(STACK_SIZE - 1));
 
     set_nr_sockets();
 
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index e1356f696aba..90d9201d1c52 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -611,9 +611,9 @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs)
     unsigned long curr_stack_base = esp & ~(STACK_SIZE - 1);
     unsigned long esp_top, esp_bottom;
 
-    if ( _p(curr_stack_base) != stack_base[cpu] )
+    if ( _p(curr_stack_base) != cpu_data[cpu].stack_base )
         printk("Current stack base %p differs from expected %p\n",
-               _p(curr_stack_base), stack_base[cpu]);
+               _p(curr_stack_base), cpu_data[cpu].stack_base);
 
     esp_bottom = (esp | (STACK_SIZE - 1)) + 1;
     esp_top    = esp_bottom - PRIMARY_STACK_SIZE;
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index e881cd5de0a0..d81a30344677 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -161,6 +161,7 @@ void __dummy__(void)
 
     OFFSET(CPUINFO_X86_features, struct cpuinfo_x86, x86_capability);
     OFFSET(CPUINFO_X86_apicid, struct cpuinfo_x86, apicid);
+    OFFSET(CPUINFO_X86_stack_base, struct cpuinfo_x86, stack_base);
     DEFINE(CPUINFO_X86_sizeof, sizeof(struct cpuinfo_x86));
     BLANK();
 
diff --git a/xen/include/xen/smp.h b/xen/include/xen/smp.h
index 0a9219173f0f..994fdc474200 100644
--- a/xen/include/xen/smp.h
+++ b/xen/include/xen/smp.h
@@ -67,8 +67,6 @@ void smp_send_call_function_mask(const cpumask_t *mask);
 
 int alloc_cpu_id(void);
 
-extern void *stack_base[NR_CPUS];
-
 void initialize_cpu_data(unsigned int cpu);
 int setup_cpu_root_pgt(unsigned int cpu);
 
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [XEN PATCH 5/9] x86/smp: call x2apic_ap_setup() earlier
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
                   ` (3 preceding siblings ...)
  2023-11-14 17:50 ` [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data Krystian Hebel
@ 2023-11-14 17:50 ` Krystian Hebel
  2024-02-07 17:02   ` Jan Beulich
  2023-11-14 17:50 ` [XEN PATCH 6/9] x86/shutdown: protect against recurrent machine_restart() Krystian Hebel
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

It used to be called from smp_callin(), however BUG_ON() was invoked on
multiple occasions before that. It may end up calling machine_restart()
which tries to get APIC ID for CPU running this code. If BSP detected
that x2APIC is enabled, get_apic_id() will try to use it for all CPUs.
Enabling x2APIC on secondary CPUs earlier protects against an endless
loop of #GP exceptions caused by attempts to read IA32_X2APIC_APICID
MSR while x2APIC is disabled in IA32_APIC_BASE.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/smpboot.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 8ae65ab1769f..a3895dafa267 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -184,7 +184,6 @@ static void smp_callin(void)
      * update until we finish. We are free to set up this CPU: first the APIC.
      */
     Dprintk("CALLIN, before setup_local_APIC().\n");
-    x2apic_ap_setup();
     setup_local_APIC(false);
 
     /* Save our processor parameters. */
@@ -351,6 +350,14 @@ void start_secondary(void *unused)
     get_cpu_info()->xen_cr3 = 0;
     get_cpu_info()->pv_cr3 = 0;
 
+    /*
+     * BUG_ON() used in load_system_tables() and later code may end up calling
+     * machine_restart() which tries to get APIC ID for CPU running this code.
+     * If BSP detected that x2APIC is enabled, get_apic_id() will try to use it
+     * for _all_ CPUs. Enable x2APIC on secondary CPUs now so we won't end up
+     * with endless #GP loop.
+     */
+    x2apic_ap_setup();
     load_system_tables();
 
     /* Full exception support from here on in. */
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [XEN PATCH 6/9] x86/shutdown: protect against recurrent machine_restart()
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
                   ` (4 preceding siblings ...)
  2023-11-14 17:50 ` [XEN PATCH 5/9] x86/smp: call x2apic_ap_setup() earlier Krystian Hebel
@ 2023-11-14 17:50 ` Krystian Hebel
  2024-02-08 11:30   ` Jan Beulich
  2023-11-14 17:50 ` [XEN PATCH 7/9] x86/smp: drop booting_cpu variable Krystian Hebel
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

If multiple CPUs called machine_restart() before actual restart took
place, but after boot CPU declared itself not online, ASSERT in
on_selected_cpus() will fail. Few calls later execution would end up
in machine_restart() again, with another frame on call stack for new
exception.

To protect against running out of stack, code checks if boot CPU is
still online before calling on_selected_cpus().

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/shutdown.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/shutdown.c b/xen/arch/x86/shutdown.c
index 7619544d14da..32c70505ed77 100644
--- a/xen/arch/x86/shutdown.c
+++ b/xen/arch/x86/shutdown.c
@@ -577,9 +577,23 @@ void machine_restart(unsigned int delay_millisecs)
         /* Ensure we are the boot CPU. */
         if ( get_apic_id() != boot_cpu_physical_apicid )
         {
-            /* Send IPI to the boot CPU (logical cpu 0). */
-            on_selected_cpus(cpumask_of(0), __machine_restart,
-                             &delay_millisecs, 0);
+            /*
+             * Send IPI to the boot CPU (logical cpu 0).
+             *
+             * If multiple CPUs called machine_restart() before actual restart
+             * took place, but after boot CPU declared itself not online, ASSERT
+             * in on_selected_cpus() will fail. Few calls later we would end up
+             * here again, with another frame on call stack for new exception.
+             * To protect against running out of stack, check if boot CPU is
+             * online.
+             *
+             * Note this is not an atomic operation, so it is possible for
+             * on_selected_cpus() to be called once after boot CPU is offline
+             * before we hit halt() below.
+             */
+            if ( cpu_online(0) )
+                on_selected_cpus(cpumask_of(0), __machine_restart,
+                                 &delay_millisecs, 0);
             for ( ; ; )
                 halt();
         }
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [XEN PATCH 7/9] x86/smp: drop booting_cpu variable
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
                   ` (5 preceding siblings ...)
  2023-11-14 17:50 ` [XEN PATCH 6/9] x86/shutdown: protect against recurrent machine_restart() Krystian Hebel
@ 2023-11-14 17:50 ` Krystian Hebel
  2024-02-08 11:39   ` Jan Beulich
  2023-11-14 17:50 ` [XEN PATCH 8/9] x86/smp: make cpu_state per-CPU Krystian Hebel
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

CPU id is obtained as a side effect of searching for appropriate
stack for AP. It can be used as a parameter to start_secondary().
Coincidentally this also makes further work on making AP bring-up
code parallel easier.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/boot/x86_64.S | 13 +++++++++----
 xen/arch/x86/smpboot.c     | 15 +++++----------
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 8d61f270761f..ad01f20d548d 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -20,20 +20,24 @@ ENTRY(__high_start)
         jz      .L_stack_set
 
         /* APs only: get stack base from APIC ID saved in %esp. */
-        mov     $0, %rax
+        mov     $0, %rbx
         lea     cpu_data(%rip), %rcx
         /* cpu_data[0] is BSP, skip it. */
 1:
-        add     $1, %rax
+        add     $1, %rbx
         add     $CPUINFO_X86_sizeof, %rcx
-        cmp     $NR_CPUS, %eax
+        cmp     $NR_CPUS, %rbx
         jb      2f
         hlt
 2:
         cmp     %esp, CPUINFO_X86_apicid(%rcx)
         jne     1b
 
-        /* %rcx is now cpu_data[cpu], read stack base from it. */
+        /*
+         * At this point:
+         * - %rcx is cpu_data[cpu], read stack base from it,
+         * - %rbx (callee-save) is Xen cpu number, pass it to start_secondary().
+         */
         mov     CPUINFO_X86_stack_base(%rcx), %rsp
 
         test    %rsp,%rsp
@@ -101,6 +105,7 @@ ENTRY(__high_start)
 .L_ap_cet_done:
 #endif /* CONFIG_XEN_SHSTK || CONFIG_XEN_IBT */
 
+        mov     %rbx, %rdi
         tailcall start_secondary
 
 .L_bsp:
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index a3895dafa267..39ffd356dbbc 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -222,8 +222,6 @@ static void smp_callin(void)
         cpu_relax();
 }
 
-static int booting_cpu;
-
 /* CPUs for which sibling maps can be computed. */
 static cpumask_t cpu_sibling_setup_map;
 
@@ -311,15 +309,14 @@ static void set_cpu_sibling_map(unsigned int cpu)
     }
 }
 
-void start_secondary(void *unused)
+void start_secondary(unsigned int cpu)
 {
     struct cpu_info *info = get_cpu_info();
 
     /*
-     * Dont put anything before smp_callin(), SMP booting is so fragile that we
+     * Don't put anything before smp_callin(), SMP booting is so fragile that we
      * want to limit the things done here to the most necessary things.
      */
-    unsigned int cpu = booting_cpu;
 
     /* Critical region without IDT or TSS.  Any fault is deadly! */
 
@@ -346,9 +343,9 @@ void start_secondary(void *unused)
      */
     spin_debug_disable();
 
-    get_cpu_info()->use_pv_cr3 = false;
-    get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = 0;
+    info->use_pv_cr3 = false;
+    info->xen_cr3 = 0;
+    info->pv_cr3 = 0;
 
     /*
      * BUG_ON() used in load_system_tables() and later code may end up calling
@@ -575,8 +572,6 @@ static int do_boot_cpu(int apicid, int cpu)
      */
     mtrr_save_state();
 
-    booting_cpu = cpu;
-
     start_eip = bootsym_phys(trampoline_realmode_entry);
 
     /* start_eip needs be page aligned, and below the 1M boundary. */
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [XEN PATCH 8/9] x86/smp: make cpu_state per-CPU
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
                   ` (6 preceding siblings ...)
  2023-11-14 17:50 ` [XEN PATCH 7/9] x86/smp: drop booting_cpu variable Krystian Hebel
@ 2023-11-14 17:50 ` Krystian Hebel
  2024-02-08 12:13   ` Jan Beulich
  2023-11-14 17:50 ` [XEN PATCH 9/9] x86/smp: start APs in parallel during boot Krystian Hebel
  2023-11-14 20:13 ` [XEN PATCH 0/9] x86: parallelize AP bring-up " Julien Grall
  9 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

This will be used for parallel AP bring-up.

CPU_STATE_INIT changed direction. It was previously set by BSP and never
consumed by AP. Now it signals that AP got through assembly part of
initialization and waits for BSP to call notifiers that set up data
structures required for further initialization.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/include/asm/cpufeature.h |  1 +
 xen/arch/x86/smpboot.c                | 80 ++++++++++++++++-----------
 2 files changed, 49 insertions(+), 32 deletions(-)

diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
index ff0e18864cc7..1831b5fb368f 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -38,6 +38,7 @@ struct cpuinfo_x86 {
     unsigned int cpu_core_id;          /* core ID of each logical CPU */
     unsigned int compute_unit_id;      /* AMD compute unit ID of each logical CPU */
     void *stack_base;
+    unsigned int cpu_state;
     unsigned short x86_clflush_size;
 } __cacheline_aligned;
 
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 39ffd356dbbc..cbea2d45f70d 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -65,15 +65,18 @@ struct cpuinfo_x86 cpu_data[NR_CPUS] =
         { [0 ... NR_CPUS-1] .apicid = BAD_APICID };
 
 static int cpu_error;
-static enum cpu_state {
+enum cpu_state {
     CPU_STATE_DYING,    /* slave -> master: I am dying */
     CPU_STATE_DEAD,     /* slave -> master: I am completely dead */
-    CPU_STATE_INIT,     /* master -> slave: Early bringup phase 1 */
-    CPU_STATE_CALLOUT,  /* master -> slave: Early bringup phase 2 */
+    CPU_STATE_INIT,     /* slave -> master: Early bringup phase 1 completed */
+    CPU_STATE_CALLOUT,  /* master -> slave: Start early bringup phase 2 */
     CPU_STATE_CALLIN,   /* slave -> master: Completed phase 2 */
     CPU_STATE_ONLINE    /* master -> slave: Go fully online now. */
-} cpu_state;
-#define set_cpu_state(state) do { smp_mb(); cpu_state = (state); } while (0)
+};
+#define set_cpu_state(cpu, state) do { \
+    smp_mb(); \
+    cpu_data[cpu].cpu_state = (state); \
+} while (0)
 
 void initialize_cpu_data(unsigned int cpu)
 {
@@ -168,16 +171,7 @@ static void synchronize_tsc_slave(unsigned int slave)
 static void smp_callin(void)
 {
     unsigned int cpu = smp_processor_id();
-    int i, rc;
-
-    /* Wait 2s total for startup. */
-    Dprintk("Waiting for CALLOUT.\n");
-    for ( i = 0; cpu_state != CPU_STATE_CALLOUT; i++ )
-    {
-        BUG_ON(i >= 200);
-        cpu_relax();
-        mdelay(10);
-    }
+    int rc;
 
     /*
      * The boot CPU has finished the init stage and is spinning on cpu_state
@@ -213,12 +207,12 @@ static void smp_callin(void)
     }
 
     /* Allow the master to continue. */
-    set_cpu_state(CPU_STATE_CALLIN);
+    set_cpu_state(cpu, CPU_STATE_CALLIN);
 
     synchronize_tsc_slave(cpu);
 
     /* And wait for our final Ack. */
-    while ( cpu_state != CPU_STATE_ONLINE )
+    while ( cpu_data[cpu].cpu_state != CPU_STATE_ONLINE )
         cpu_relax();
 }
 
@@ -313,6 +307,9 @@ void start_secondary(unsigned int cpu)
 {
     struct cpu_info *info = get_cpu_info();
 
+    /* Tell BSP that we are awake. */
+    set_cpu_state(cpu, CPU_STATE_INIT);
+
     /*
      * Don't put anything before smp_callin(), SMP booting is so fragile that we
      * want to limit the things done here to the most necessary things.
@@ -320,6 +317,10 @@ void start_secondary(unsigned int cpu)
 
     /* Critical region without IDT or TSS.  Any fault is deadly! */
 
+    /* Wait until data set up by CPU_UP_PREPARE notifiers is ready. */
+    while ( cpu_data[cpu].cpu_state != CPU_STATE_CALLOUT )
+        cpu_relax();
+
     set_current(idle_vcpu[cpu]);
     this_cpu(curr_vcpu) = idle_vcpu[cpu];
     rdmsrl(MSR_EFER, this_cpu(efer));
@@ -585,26 +586,35 @@ static int do_boot_cpu(int apicid, int cpu)
 
     /* This grunge runs the startup process for the targeted processor. */
 
-    set_cpu_state(CPU_STATE_INIT);
-
     /* Starting actual IPI sequence... */
     boot_error = wakeup_secondary_cpu(apicid, start_eip);
 
     if ( !boot_error )
     {
-        /* Allow AP to start initializing. */
-        set_cpu_state(CPU_STATE_CALLOUT);
-        Dprintk("After Callout %d.\n", cpu);
-
-        /* Wait 5s total for a response. */
-        for ( timeout = 0; timeout < 50000; timeout++ )
+        /* Wait 2s total for a response. */
+        for ( timeout = 0; timeout < 20000; timeout++ )
         {
-            if ( cpu_state != CPU_STATE_CALLOUT )
+            if ( cpu_data[cpu].cpu_state == CPU_STATE_INIT )
                 break;
             udelay(100);
         }
 
-        if ( cpu_state == CPU_STATE_CALLIN )
+        if ( cpu_data[cpu].cpu_state == CPU_STATE_INIT )
+        {
+            /* Allow AP to start initializing. */
+            set_cpu_state(cpu, CPU_STATE_CALLOUT);
+            Dprintk("After Callout %d.\n", cpu);
+
+            /* Wait 5s total for a response. */
+            for ( timeout = 0; timeout < 500000; timeout++ )
+            {
+                if ( cpu_data[cpu].cpu_state != CPU_STATE_CALLOUT )
+                    break;
+                udelay(10);
+            }
+        }
+
+        if ( cpu_data[cpu].cpu_state == CPU_STATE_CALLIN )
         {
             /* number CPUs logically, starting from 1 (BSP is 0) */
             Dprintk("OK.\n");
@@ -612,7 +622,7 @@ static int do_boot_cpu(int apicid, int cpu)
             synchronize_tsc_master(cpu);
             Dprintk("CPU has booted.\n");
         }
-        else if ( cpu_state == CPU_STATE_DEAD )
+        else if ( cpu_data[cpu].cpu_state == CPU_STATE_DEAD )
         {
             smp_rmb();
             rc = cpu_error;
@@ -683,7 +693,7 @@ unsigned long alloc_stub_page(unsigned int cpu, unsigned long *mfn)
 void cpu_exit_clear(unsigned int cpu)
 {
     cpu_uninit(cpu);
-    set_cpu_state(CPU_STATE_DEAD);
+    set_cpu_state(cpu, CPU_STATE_DEAD);
 }
 
 static int clone_mapping(const void *ptr, root_pgentry_t *rpt)
@@ -1161,6 +1171,12 @@ void __init smp_prepare_cpus(void)
     cpu_data[0].stack_base = (void *)
              ((unsigned long)stack_start & ~(STACK_SIZE - 1));
 
+    /* Set state as CALLOUT so APs won't change it in initialize_cpu_data() */
+    boot_cpu_data.cpu_state = CPU_STATE_CALLOUT;
+
+    /* Not really used anywhere, but set it just in case. */
+    set_cpu_state(0, CPU_STATE_ONLINE);
+
     set_nr_sockets();
 
     socket_cpumask = xzalloc_array(cpumask_t *, nr_sockets);
@@ -1267,7 +1283,7 @@ void __cpu_disable(void)
 {
     int cpu = smp_processor_id();
 
-    set_cpu_state(CPU_STATE_DYING);
+    set_cpu_state(cpu, CPU_STATE_DYING);
 
     local_irq_disable();
     clear_local_APIC();
@@ -1292,7 +1308,7 @@ void __cpu_die(unsigned int cpu)
     unsigned int i = 0;
     enum cpu_state seen_state;
 
-    while ( (seen_state = cpu_state) != CPU_STATE_DEAD )
+    while ( (seen_state = cpu_data[cpu].cpu_state) != CPU_STATE_DEAD )
     {
         BUG_ON(seen_state != CPU_STATE_DYING);
         mdelay(100);
@@ -1393,7 +1409,7 @@ int __cpu_up(unsigned int cpu)
 
     time_latch_stamps();
 
-    set_cpu_state(CPU_STATE_ONLINE);
+    set_cpu_state(cpu, CPU_STATE_ONLINE);
     while ( !cpu_online(cpu) )
     {
         cpu_relax();
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [XEN PATCH 9/9] x86/smp: start APs in parallel during boot
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
                   ` (7 preceding siblings ...)
  2023-11-14 17:50 ` [XEN PATCH 8/9] x86/smp: make cpu_state per-CPU Krystian Hebel
@ 2023-11-14 17:50 ` Krystian Hebel
  2024-02-08 12:37   ` Jan Beulich
  2023-11-14 20:13 ` [XEN PATCH 0/9] x86: parallelize AP bring-up " Julien Grall
  9 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2023-11-14 17:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Krystian Hebel, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu

Multiple delays are required when sending IPIs and waiting for
responses. During boot, 4 such IPIs were sent per each AP. With this
change, only one set of broadcast IPIs is sent. This reduces boot time,
especially for platforms with large number of cores.

Single CPU initialization is still possible, it is used for hotplug.

During wakeup from S3 APs are started one by one. It should be possible
to enable parallel execution there as well, but I don't have a way of
properly testing it as of now.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
---
 xen/arch/x86/include/asm/smp.h |  1 +
 xen/arch/x86/setup.c           |  2 +
 xen/arch/x86/smpboot.c         | 68 ++++++++++++++++++++++++----------
 3 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h
index 98739028a6ed..6ca0158a368d 100644
--- a/xen/arch/x86/include/asm/smp.h
+++ b/xen/arch/x86/include/asm/smp.h
@@ -31,6 +31,7 @@ DECLARE_PER_CPU(cpumask_var_t, send_ipi_cpumask);
 extern bool park_offline_cpus;
 
 void smp_send_nmi_allbutself(void);
+void smp_send_init_sipi_sipi_allbutself(void);
 
 void send_IPI_mask(const cpumask_t *, int vector);
 void send_IPI_self(int vector);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index b2c0679725ea..42a9067b81eb 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1963,6 +1963,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
                 cpu_data[i].stack_base = cpu_alloc_stack(i);
         }
 
+        smp_send_init_sipi_sipi_allbutself();
+
         for_each_present_cpu ( i )
         {
             if ( (park_offline_cpus || num_online_cpus() < max_cpus) &&
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index cbea2d45f70d..e9a7f78a5a6f 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu)
 
 static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
 {
-    unsigned long send_status = 0, accept_status = 0;
+    unsigned long send_status = 0, accept_status = 0, sh = 0;
     int maxlvt, timeout, i;
 
     /*
@@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
     if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) )
         return 0;
 
+    /*
+     * Use destination shorthand for broadcasting IPIs during boot.
+     */
+    if ( phys_apicid == BAD_APICID )
+        sh = APIC_DEST_ALLBUT;
+
     /*
      * Be paranoid about clearing APIC errors.
      */
@@ -458,7 +464,7 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
         /*
          * Turn INIT on target chip via IPI
          */
-        apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
+        apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT | sh,
                        phys_apicid);
 
         if ( !x2apic_enabled )
@@ -475,7 +481,7 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
 
             Dprintk("Deasserting INIT.\n");
 
-            apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+            apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT | sh, phys_apicid);
 
             Dprintk("Waiting for send to finish...\n");
             timeout = 0;
@@ -512,7 +518,7 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
          * STARTUP IPI
          * Boot on the stack
          */
-        apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12), phys_apicid);
+        apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12) | sh, phys_apicid);
 
         if ( !x2apic_enabled )
         {
@@ -565,7 +571,6 @@ int alloc_cpu_id(void)
 static int do_boot_cpu(int apicid, int cpu)
 {
     int timeout, boot_error = 0, rc = 0;
-    unsigned long start_eip;
 
     /*
      * Save current MTRR state in case it was changed since early boot
@@ -573,21 +578,31 @@ static int do_boot_cpu(int apicid, int cpu)
      */
     mtrr_save_state();
 
-    start_eip = bootsym_phys(trampoline_realmode_entry);
+    /* Check if AP is already up. */
+    if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT )
+    {
+        /* This grunge runs the startup process for the targeted processor. */
+        unsigned long start_eip;
+        start_eip = bootsym_phys(trampoline_realmode_entry);
 
-    /* start_eip needs be page aligned, and below the 1M boundary. */
-    if ( start_eip & ~0xff000 )
-        panic("AP trampoline %#lx not suitably positioned\n", start_eip);
+        /* start_eip needs be page aligned, and below the 1M boundary. */
+        if ( start_eip & ~0xff000 )
+            panic("AP trampoline %#lx not suitably positioned\n", start_eip);
 
-    /* So we see what's up   */
-    if ( opt_cpu_info )
-        printk("Booting processor %d/%d eip %lx\n",
-               cpu, apicid, start_eip);
+        /* So we see what's up   */
+        if ( opt_cpu_info )
+            printk("AP trampoline at %lx\n", start_eip);
 
-    /* This grunge runs the startup process for the targeted processor. */
+        /* mark "stuck" area as not stuck */
+        bootsym(trampoline_cpu_started) = 0;
+        smp_mb();
 
-    /* Starting actual IPI sequence... */
-    boot_error = wakeup_secondary_cpu(apicid, start_eip);
+        /* Starting actual IPI sequence... */
+        boot_error = wakeup_secondary_cpu(apicid, start_eip);
+    }
+
+    if ( opt_cpu_info )
+        printk("Booting processor %d/%d\n", cpu, apicid);
 
     if ( !boot_error )
     {
@@ -646,10 +661,6 @@ static int do_boot_cpu(int apicid, int cpu)
         rc = -EIO;
     }
 
-    /* mark "stuck" area as not stuck */
-    bootsym(trampoline_cpu_started) = 0;
-    smp_mb();
-
     return rc;
 }
 
@@ -1155,6 +1166,23 @@ static struct notifier_block cpu_smpboot_nfb = {
     .notifier_call = cpu_smpboot_callback
 };
 
+void smp_send_init_sipi_sipi_allbutself(void)
+{
+    unsigned long start_eip;
+    start_eip = bootsym_phys(trampoline_realmode_entry);
+
+    /* start_eip needs be page aligned, and below the 1M boundary. */
+    if ( start_eip & ~0xff000 )
+        panic("AP trampoline %#lx not suitably positioned\n", start_eip);
+
+    /* So we see what's up   */
+    if ( opt_cpu_info )
+        printk("Booting APs in parallel, eip %lx\n", start_eip);
+
+    /* Starting actual broadcast IPI sequence... */
+    wakeup_secondary_cpu(BAD_APICID, start_eip);
+}
+
 void __init smp_prepare_cpus(void)
 {
     register_cpu_notifier(&cpu_smpboot_nfb);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 0/9] x86: parallelize AP bring-up during boot
  2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
                   ` (8 preceding siblings ...)
  2023-11-14 17:50 ` [XEN PATCH 9/9] x86/smp: start APs in parallel during boot Krystian Hebel
@ 2023-11-14 20:13 ` Julien Grall
  9 siblings, 0 replies; 45+ messages in thread
From: Julien Grall @ 2023-11-14 20:13 UTC (permalink / raw)
  To: Krystian Hebel, xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Stefano Stabellini

Hi Krystian,

Thanks you for sending the series! Just posting some extra data point.

On 14/11/2023 17:49, Krystian Hebel wrote:
> Patch series available on this branch:
> https://github.com/TrenchBoot/xen/tree/smp_cleanup_upstreaming
> 
> This series makes AP bring-up parallel on x86. This reduces number of
> IPIs (and more importantly, delays between them) required to start all
> logical processors significantly.
> 
> In order to make it possible, some state variables that were global
> had to be made per-CPU. In most cases, accesses to those variables can
> be performed through helper macros, some of them existed before in a
> different form.
> 
> In addition to work required for parallel initialization, I've fixed
> issues in error path around `machine_restart()` that were discovered
> during implementation and testing.
> 
> CPU hotplug should not be affected, but I had no way of testing it.
> During wakeup from S3 APs are started one by one. It should be possible
> to enable parallel execution there as well, but I don't have a way of
> testing it as of now.
> 
> To measure the improvement, I added output lines (identical for before
> and after changes so there is no impact of printing over serial) just
> above and below `if ( !pv_shim )` block. `console_timestamps=raw` was
> used to get as accurate timestamp as possible, and average over 3 boots
> was taken into account for each measurement. The final improvement was
> calculated as (1 - after/before) * 100%, rounded to 0.01%. These are
> the results:
> 
> * Dell OptiPlex 9010 with Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
>    (4 cores, 4 threads): 48.83%
> * MSI PRO Z790-P with 13th Gen Intel(R) Core(TM) i5-13600K (14 cores,
>    20 threads, 6P+8E) `smt=on`: 36.16%
> * MSI PRO Z790-P with 13th Gen Intel(R) Core(TM) i5-13600K (14 cores,
>    20 threads, 6P+8E) `smt=off`: 0.25% (parking takes a lot of additional
>    time)
> * HP t630 Thin Client with AMD Embedded G-Series GX-420GI Radeon R7E
>    (4 cores, 4 threads): 68.00%

Your series reminded me some optimization we did at AWS in the SMP boot 
code. This hasn't yet been sent upstrema so I thought I would compare to 
your series to see if there are any differences.

Instead of adding support for parallel SMP boot, we decided to optimize 
some of the wait call (see diff below).

This was tested on a nested environment (KVM/QEMU) on c5 metal with 
x2apic disabled. The numbers are for bringing-up 95 CPUs:

   * Currently: 2s
   * With AWS change only: 300ms
   * With your change only: 100ms

So your approach is superior :). I see you are dropping the loop in 
smp_callin(). So the first hunk in the below diff is not necessary 
anymore. The second hunk probably still makes sense for CPU hotplug (and 
maybe S3 bring-up) even though it would only save 10ms. I will look to 
send the patch.

Diff:

diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 3a1a659082c6..86fd5500e1ea 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -172,9 +172,9 @@ static void smp_callin(void)
      Dprintk("Waiting for CALLOUT.\n");
      for ( i = 0; cpu_state != CPU_STATE_CALLOUT; i++ )
      {
-        BUG_ON(i >= 200);
+        BUG_ON(i >= 200000);
          cpu_relax();
-        mdelay(10);
+        udelay(10);
      }

      /*
@@ -430,6 +430,10 @@ static int wakeup_secondary_cpu(int phys_apicid, 
unsigned long start_eip)
       * Refer to AMD APM Vol2 15.27 "Secure Startup with SKINIT".
       */
      bool send_INIT = ap_boot_method != AP_BOOT_SKINIT;
+    bool modern_cpu = ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
+                       (boot_cpu_data.x86 == 6)) ||
+                      ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) &&
+                       (boot_cpu_data.x86 >= 0xF));

      /*
       * Some versions of tboot might be able to handle the entire wake 
sequence
@@ -464,7 +468,15 @@ static int wakeup_secondary_cpu(int phys_apicid, 
unsigned long start_eip)
                  send_status = apic_read(APIC_ICR) & APIC_ICR_BUSY;
              } while ( send_status && (timeout++ < 1000) );

-            mdelay(10);
+            /*
+             * The Multiprocessor Specification 1.4 (1997) example code 
suggests
+             * that there should be a 10ms delay between the BSP 
asserting INIT
+             * and de-asserting INIT, when starting a remote processor.
+             * But that slows boot and resume on modern processors, 
which include
+             * many cores and don't require that delay.
+             */
+            if ( !modern_cpu )
+                mdelay(10);

              Dprintk("Deasserting INIT.\n");

Cheers,

-- 
Julien Grall


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID
  2023-11-14 17:49 ` [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID Krystian Hebel
@ 2024-01-26 18:30   ` Julien Grall
  2024-03-12 14:14     ` Krystian Hebel
  2024-02-07 16:11   ` Jan Beulich
  1 sibling, 1 reply; 45+ messages in thread
From: Julien Grall @ 2024-01-26 18:30 UTC (permalink / raw)
  To: Krystian Hebel, xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Hi,

I am not too familiary with the x86 boot code. But I will give a try to 
review :).

On 14/11/2023 17:49, Krystian Hebel wrote:
> This is made as first step of making parallel AP bring-up possible. It
> should be enough for pre-C code.
> 
> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
> ---
>   xen/arch/x86/boot/trampoline.S | 20 ++++++++++++++++++++
>   xen/arch/x86/boot/x86_64.S     | 28 +++++++++++++++++++++++++++-
>   xen/arch/x86/setup.c           |  7 +++++++
>   3 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/boot/trampoline.S b/xen/arch/x86/boot/trampoline.S
> index b8ab0ffdcbb0..ec254125016d 100644
> --- a/xen/arch/x86/boot/trampoline.S
> +++ b/xen/arch/x86/boot/trampoline.S
> @@ -72,6 +72,26 @@ trampoline_protmode_entry:
>           mov     $X86_CR4_PAE,%ecx
>           mov     %ecx,%cr4
>   
> +        /*
> +         * Get APIC ID while we're in non-paged mode. Start by checking if
> +         * x2APIC is enabled.
> +         */
> +        mov     $MSR_APIC_BASE, %ecx
> +        rdmsr
> +        and     $APIC_BASE_EXTD, %eax
> +        jnz     .Lx2apic
> +
> +        /* Not x2APIC, read from MMIO */
> +        mov     0xfee00020, %esp
> +        shr     $24, %esp
> +        jmp     1f
> +
> +.Lx2apic:
> +        mov     $(MSR_X2APIC_FIRST + (0x20 >> 4)), %ecx
> +        rdmsr
> +        mov     %eax, %esp
> +1:
> +
>           /* Load pagetable base register. */
>           mov     $sym_offs(idle_pg_table),%eax
>           add     bootsym_rel(trampoline_xen_phys_start,4,%eax)
> diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
> index 04bb62ae8680..b85b47b5c1a0 100644
> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -15,7 +15,33 @@ ENTRY(__high_start)
>           mov     $XEN_MINIMAL_CR4,%rcx
>           mov     %rcx,%cr4
>   
> -        mov     stack_start(%rip),%rsp
> +        test    %ebx,%ebx
> +        cmovz   stack_start(%rip), %rsp
> +        jz      .L_stack_set
> +
> +        /* APs only: get stack base from APIC ID saved in %esp. */
> +        mov     $-1, %rax
> +        lea     x86_cpu_to_apicid(%rip), %rcx
I would consider to move this patch after #2 and #3, so the logic is not 
modified again. This would help the review.

> +1:
> +        add     $1, %rax
> +        cmp     $NR_CPUS, %eax
> +        jb      2f
I think we can get rid of this jump by reworking the loop so %eax is 
tested as the end of the loop. But this is boot code, so it is possibly 
not worth it. I will leave the x86 maintainers commenting.

> +        hlt
> +2:
> +        cmp     %esp, (%rcx, %rax, 4)
> +        jne     1b
> +
> +        /* %eax is now Xen CPU index. */
> +        lea     stack_base(%rip), %rcx
> +        mov     (%rcx, %rax, 8), %rsp
> +
> +        test    %rsp,%rsp
> +        jnz     1f
> +        hlt
> +1:
NIT: Can you use 3? This makes the code easier to read and less prone to 
error (you have two very close 1).

> +        add     $(STACK_SIZE - CPUINFO_sizeof), %rsp
> +
> +.L_stack_set:
>   
>           /* Reset EFLAGS (subsumes CLI and CLD). */
>           pushq   $0
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index a3d3f797bb1e..1285969901e0 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1951,6 +1951,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>        */
>       if ( !pv_shim )
>       {
> +        /* Separate loop to make parallel AP bringup possible. */

The loop split seems to be unrelated to this patch. Actually, I was 
expecting that only the assembly code would be modified.

>           for_each_present_cpu ( i )
>           {
>               /* Set up cpu_to_node[]. */
> @@ -1958,6 +1959,12 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>               /* Set up node_to_cpumask based on cpu_to_node[]. */
>               numa_add_cpu(i);
>   
> +            if ( stack_base[i] == NULL )
> +                stack_base[i] = cpu_alloc_stack(i);

I don't quite understand this change at least in the context of this 
patch. AFAICT the stack will be currently allocated in 
cpu_smpboot_callback() which is called while the CPU is prepared. So you 
should not need this allocation right now.

Looking at the rest of the series, it seems you allocate the stack 
earlier so you start the CPU bring-up earlier. But they will still be 
held in assembly code until cpu_up() is called.

So effectively, part of the C part of the CPUs bring-up is still 
serialized. Did I understand the logic correctly?

If so, I would suggest to clarify it in the series because this wasn't 
obvious to me (I was expecting start_secondary() would also run in 
parallell).

Regarding the change in setup.c, I think it would make more sense in 
patch #9.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu)
  2023-11-14 17:49 ` [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu) Krystian Hebel
@ 2024-01-26 18:38   ` Julien Grall
  2024-02-07 16:28   ` Jan Beulich
  1 sibling, 0 replies; 45+ messages in thread
From: Julien Grall @ 2024-01-26 18:38 UTC (permalink / raw)
  To: Krystian Hebel, xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Hi,

On 14/11/2023 17:49, Krystian Hebel wrote:
> This is done in preparation to move data from x86_cpu_to_apicid[]
> elsewhere.
NIT: I would add "No functional changes intended".

> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>

Reviewed-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  2023-11-14 17:50 ` [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead Krystian Hebel
@ 2024-02-02 18:11   ` Julien Grall
  2024-02-07 16:41     ` Jan Beulich
  2024-02-08  7:29   ` Jan Beulich
  1 sibling, 1 reply; 45+ messages in thread
From: Julien Grall @ 2024-02-02 18:11 UTC (permalink / raw)
  To: Krystian Hebel, xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Hi,

On 14/11/2023 17:50, Krystian Hebel wrote:
> Both fields held the same data.
> 
> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
> ---
>   xen/arch/x86/boot/x86_64.S           |  8 +++++---
>   xen/arch/x86/include/asm/asm_defns.h |  2 +-
>   xen/arch/x86/include/asm/processor.h |  2 ++
>   xen/arch/x86/include/asm/smp.h       |  4 ----
>   xen/arch/x86/numa.c                  | 15 +++++++--------
>   xen/arch/x86/smpboot.c               |  8 ++++----
>   xen/arch/x86/x86_64/asm-offsets.c    |  4 +++-
>   7 files changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
> index b85b47b5c1a0..195550b5c0ea 100644
> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -20,15 +20,17 @@ ENTRY(__high_start)
>           jz      .L_stack_set
>   
>           /* APs only: get stack base from APIC ID saved in %esp. */
> -        mov     $-1, %rax
> -        lea     x86_cpu_to_apicid(%rip), %rcx
> +        mov     $0, %rax
> +        lea     cpu_data(%rip), %rcx
> +        /* cpu_data[0] is BSP, skip it. */
>   1:
>           add     $1, %rax
> +        add     $CPUINFO_X86_sizeof, %rcx
>           cmp     $NR_CPUS, %eax
>           jb      2f
>           hlt
>   2:
> -        cmp     %esp, (%rcx, %rax, 4)
> +        cmp     %esp, CPUINFO_X86_apicid(%rcx)
>           jne     1b
>   
>           /* %eax is now Xen CPU index. */

As mentioned in an earlier patch, I think you want to re-order the 
patches. This will avoid to modify twice the same code within the same 
series (it is best to avoid if you can).

> diff --git a/xen/arch/x86/include/asm/asm_defns.h b/xen/arch/x86/include/asm/asm_defns.h
> index baaaccb26e17..6b05d9d140b8 100644
> --- a/xen/arch/x86/include/asm/asm_defns.h
> +++ b/xen/arch/x86/include/asm/asm_defns.h
> @@ -158,7 +158,7 @@ register unsigned long current_stack_pointer asm("rsp");
>   #endif
>   
>   #define CPUINFO_FEATURE_OFFSET(feature)           \
> -    (CPUINFO_features + (cpufeat_word(feature) * 4))
> +    (CPUINFO_X86_features + (cpufeat_word(feature) * 4))
>   
>   #else
>   
> diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
> index b0d2a62c075f..8345d58094da 100644
> --- a/xen/arch/x86/include/asm/processor.h
> +++ b/xen/arch/x86/include/asm/processor.h
> @@ -92,6 +92,8 @@ struct x86_cpu_id {
>   extern struct cpuinfo_x86 cpu_data[];
>   #define current_cpu_data cpu_data[smp_processor_id()]
>   
> +#define cpu_physical_id(cpu)	cpu_data[cpu].apicid
> +
>   extern bool probe_cpuid_faulting(void);
>   extern void ctxt_switch_levelling(const struct vcpu *next);
>   extern void (*ctxt_switch_masking)(const struct vcpu *next);
> diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h
> index c0b5d7cdd8dd..94c557491860 100644
> --- a/xen/arch/x86/include/asm/smp.h
> +++ b/xen/arch/x86/include/asm/smp.h
> @@ -39,10 +39,6 @@ extern void (*mtrr_hook) (void);
>   
>   extern void zap_low_mappings(void);
>   
> -extern u32 x86_cpu_to_apicid[];
> -
> -#define cpu_physical_id(cpu)	x86_cpu_to_apicid[cpu]
> -
>   #define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))
>   extern void cpu_exit_clear(unsigned int cpu);
>   extern void cpu_uninit(unsigned int cpu);
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index 39e131cb4f35..91527be5b406 100644
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -54,14 +54,13 @@ bool __init arch_numa_unavailable(void)
>   /*
>    * Setup early cpu_to_node.
>    *
> - * Populate cpu_to_node[] only if x86_cpu_to_apicid[],
> - * and apicid_to_node[] tables have valid entries for a CPU.
> - * This means we skip cpu_to_node[] initialisation for NUMA
> - * emulation and faking node case (when running a kernel compiled
> - * for NUMA on a non NUMA box), which is OK as cpu_to_node[]
> - * is already initialized in a round robin manner at numa_init_array,
> - * prior to this call, and this initialization is good enough
> - * for the fake NUMA cases.
> + * Populate cpu_to_node[] only if cpu_data[], and apicid_to_node[]
> + * tables have valid entries for a CPU. This means we skip
> + * cpu_to_node[] initialisation for NUMA emulation and faking node
> + * case (when running a kernel compiled for NUMA on a non NUMA box),
> + * which is OK as cpu_to_node[] is already initialized in a round
> + * robin manner at numa_init_array, prior to this call, and this
> + * initialization is good enough for the fake NUMA cases.
>    */
>   void __init init_cpu_to_node(void)
>   {
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index de87c5a41926..f061486e56eb 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -61,10 +61,8 @@ unsigned int __read_mostly nr_sockets;
>   cpumask_t **__read_mostly socket_cpumask;
>   static cpumask_t *secondary_socket_cpumask;
>   
> -struct cpuinfo_x86 cpu_data[NR_CPUS];
> -
> -u32 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
> -	{ [0 ... NR_CPUS-1] = BAD_APICID };
> +struct cpuinfo_x86 cpu_data[NR_CPUS] =
> +        { [0 ... NR_CPUS-1] .apicid = BAD_APICID };
>   
>   static int cpu_error;
>   static enum cpu_state {
> @@ -81,7 +79,9 @@ void *stack_base[NR_CPUS];
>   
>   void initialize_cpu_data(unsigned int cpu)
>   {
> +    uint32_t apicid = cpu_physical_id(cpu);

It would be probably worth it to add a comment explaining why you save 
apicid. What about?

/* The APIC ID is saved in cpu_data[cpu] which will be overriden below. */

Coding style: Newline after the declaration.

>       cpu_data[cpu] = boot_cpu_data;
> +    cpu_physical_id(cpu) = apicid;
>   }
>   
>   static bool smp_store_cpu_info(unsigned int id)
> diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
> index 57b73a4e6214..e881cd5de0a0 100644
> --- a/xen/arch/x86/x86_64/asm-offsets.c
> +++ b/xen/arch/x86/x86_64/asm-offsets.c
> @@ -159,7 +159,9 @@ void __dummy__(void)
>       OFFSET(IRQSTAT_softirq_pending, irq_cpustat_t, __softirq_pending);
>       BLANK();
>   
> -    OFFSET(CPUINFO_features, struct cpuinfo_x86, x86_capability);
> +    OFFSET(CPUINFO_X86_features, struct cpuinfo_x86, x86_capability);

The rename seems to be unrelated to this patch. Can you clarify?

> +    OFFSET(CPUINFO_X86_apicid, struct cpuinfo_x86, apicid);
> +    DEFINE(CPUINFO_X86_sizeof, sizeof(struct cpuinfo_x86));
>       BLANK();
>   
>       OFFSET(MB_flags, multiboot_info_t, flags);

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data
  2023-11-14 17:50 ` [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data Krystian Hebel
@ 2024-02-02 18:24   ` Julien Grall
  2024-02-05  8:41     ` Jan Beulich
  2024-02-07 16:53   ` Jan Beulich
  1 sibling, 1 reply; 45+ messages in thread
From: Julien Grall @ 2024-02-02 18:24 UTC (permalink / raw)
  To: Krystian Hebel, xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Stefano Stabellini

Hi,

On 14/11/2023 17:50, Krystian Hebel wrote:
> This location is easier to access from assembly. Having it close to
> other data required during initialization has also positive (although
> rather small) impact on prefetching data from RAM.

I understand your goal but...

> 
> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
> ---
>   xen/arch/x86/boot/x86_64.S            |  5 ++---
>   xen/arch/x86/include/asm/cpufeature.h |  1 +
>   xen/arch/x86/include/asm/smp.h        |  2 +-
>   xen/arch/x86/setup.c                  |  6 +++---
>   xen/arch/x86/smpboot.c                | 25 +++++++++++++------------
>   xen/arch/x86/traps.c                  |  4 ++--
>   xen/arch/x86/x86_64/asm-offsets.c     |  1 +
>   xen/include/xen/smp.h                 |  2 --
>   8 files changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
> index 195550b5c0ea..8d61f270761f 100644
> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -33,9 +33,8 @@ ENTRY(__high_start)
>           cmp     %esp, CPUINFO_X86_apicid(%rcx)
>           jne     1b
>   
> -        /* %eax is now Xen CPU index. */
> -        lea     stack_base(%rip), %rcx
> -        mov     (%rcx, %rax, 8), %rsp
> +        /* %rcx is now cpu_data[cpu], read stack base from it. */
> +        mov     CPUINFO_X86_stack_base(%rcx), %rsp
>   
>           test    %rsp,%rsp
>           jnz     1f
> diff --git a/xen/arch/x86/include/asm/cpufeature.h b/xen/arch/x86/include/asm/cpufeature.h
> index 06e1dd7f3332..ff0e18864cc7 100644
> --- a/xen/arch/x86/include/asm/cpufeature.h
> +++ b/xen/arch/x86/include/asm/cpufeature.h

... cpufeature.h feels a rather odd place for storing the stack. I am 
not entirely sure where else to place. Andrew, Jan, Roger?

> @@ -37,6 +37,7 @@ struct cpuinfo_x86 {
>       unsigned int phys_proc_id;         /* package ID of each logical CPU */
>       unsigned int cpu_core_id;          /* core ID of each logical CPU */
>       unsigned int compute_unit_id;      /* AMD compute unit ID of each logical CPU */
> +    void *stack_base;

AFAICT, this means there will be a padding before stack_base and ...

>       unsigned short x86_clflush_size;

... another one here. Is there any particular reason the new field 
wasn't added at the end?

>   } __cacheline_aligned;
>   
> diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h
> index 94c557491860..98739028a6ed 100644
> --- a/xen/arch/x86/include/asm/smp.h
> +++ b/xen/arch/x86/include/asm/smp.h
> @@ -69,7 +69,7 @@ extern cpumask_t **socket_cpumask;
>    * by certain scheduling code only.
>    */
>   #define get_cpu_current(cpu) \
> -    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
> +    (get_cpu_info_from_stack((unsigned long)cpu_data[cpu].stack_base)->current_vcpu)
>   
>   extern unsigned int disabled_cpus;
>   extern bool unaccounted_cpus;
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index a19fe219bbf6..b2c0679725ea 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -798,7 +798,7 @@ static void __init noreturn reinit_bsp_stack(void)
>       /* Update SYSCALL trampolines */
>       percpu_traps_init();
>   
> -    stack_base[0] = stack;
> +    cpu_data[0].stack_base = stack;
>   
>       rc = setup_cpu_root_pgt(0);
>       if ( rc )
> @@ -1959,8 +1959,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>               /* Set up node_to_cpumask based on cpu_to_node[]. */
>               numa_add_cpu(i);
>   
> -            if ( stack_base[i] == NULL )
> -                stack_base[i] = cpu_alloc_stack(i);
> +            if ( cpu_data[i].stack_base == NULL )
> +                cpu_data[i].stack_base = cpu_alloc_stack(i);
>           }
>   
>           for_each_present_cpu ( i )
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index f061486e56eb..8ae65ab1769f 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -75,13 +75,15 @@ static enum cpu_state {
>   } cpu_state;
>   #define set_cpu_state(state) do { smp_mb(); cpu_state = (state); } while (0)
>   
> -void *stack_base[NR_CPUS];
> -
>   void initialize_cpu_data(unsigned int cpu)
>   {
>       uint32_t apicid = cpu_physical_id(cpu);
> +    void *stack = cpu_data[cpu].stack_base;
> +
>       cpu_data[cpu] = boot_cpu_data;
> +
>       cpu_physical_id(cpu) = apicid;
> +    cpu_data[cpu].stack_base = stack;
>   }
>   
>   static bool smp_store_cpu_info(unsigned int id)
> @@ -579,8 +581,6 @@ static int do_boot_cpu(int apicid, int cpu)
>           printk("Booting processor %d/%d eip %lx\n",
>                  cpu, apicid, start_eip);
>   
> -    stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
> -

You remove this line because I can't quite figure out where stack_start 
is now set. This is used...

>       /* This grunge runs the startup process for the targeted processor. */
>   
>       set_cpu_state(CPU_STATE_INIT);
> @@ -856,7 +856,7 @@ int setup_cpu_root_pgt(unsigned int cpu)
>   
>       /* Install direct map page table entries for stack, IDT, and TSS. */
>       for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
> -        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
> +        rc = clone_mapping(__va(__pa(cpu_data[cpu].stack_base)) + off, rpt);
>   
>       if ( !rc )
>           rc = clone_mapping(idt_tables[cpu], rpt);
> @@ -1007,10 +1007,10 @@ static void cpu_smpboot_free(unsigned int cpu, bool remove)
>           FREE_XENHEAP_PAGE(per_cpu(gdt, cpu));
>           FREE_XENHEAP_PAGE(idt_tables[cpu]);
>   
> -        if ( stack_base[cpu] )
> +        if ( cpu_data[cpu].stack_base )
>           {
> -            memguard_unguard_stack(stack_base[cpu]);
> -            FREE_XENHEAP_PAGES(stack_base[cpu], STACK_ORDER);
> +            memguard_unguard_stack(cpu_data[cpu].stack_base);
> +            FREE_XENHEAP_PAGES(cpu_data[cpu].stack_base, STACK_ORDER);
>           }
>       }
>   }
> @@ -1044,11 +1044,11 @@ static int cpu_smpboot_alloc(unsigned int cpu)
>       if ( node != NUMA_NO_NODE )
>           memflags = MEMF_node(node);
>   
> -    if ( stack_base[cpu] == NULL &&
> -         (stack_base[cpu] = cpu_alloc_stack(cpu)) == NULL )
> +    if ( cpu_data[cpu].stack_base == NULL &&
> +         (cpu_data[cpu].stack_base = cpu_alloc_stack(cpu)) == NULL )
>               goto out;
>   
> -    info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
> +    info = get_cpu_info_from_stack((unsigned long)cpu_data[cpu].stack_base);

... here.

>       info->processor_id = cpu;
>       info->per_cpu_offset = __per_cpu_offset[cpu];
>   
> @@ -1156,7 +1156,8 @@ void __init smp_prepare_cpus(void)
>       boot_cpu_physical_apicid = get_apic_id();
>       cpu_physical_id(0) = boot_cpu_physical_apicid;
>   
> -    stack_base[0] = (void *)((unsigned long)stack_start & ~(STACK_SIZE - 1));
> +    cpu_data[0].stack_base = (void *)
> +             ((unsigned long)stack_start & ~(STACK_SIZE - 1));
>   
>       set_nr_sockets();
>   
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index e1356f696aba..90d9201d1c52 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -611,9 +611,9 @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs)
>       unsigned long curr_stack_base = esp & ~(STACK_SIZE - 1);
>       unsigned long esp_top, esp_bottom;
>   
> -    if ( _p(curr_stack_base) != stack_base[cpu] )
> +    if ( _p(curr_stack_base) != cpu_data[cpu].stack_base )
>           printk("Current stack base %p differs from expected %p\n",
> -               _p(curr_stack_base), stack_base[cpu]);
> +               _p(curr_stack_base), cpu_data[cpu].stack_base);
>   
>       esp_bottom = (esp | (STACK_SIZE - 1)) + 1;
>       esp_top    = esp_bottom - PRIMARY_STACK_SIZE;
> diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
> index e881cd5de0a0..d81a30344677 100644
> --- a/xen/arch/x86/x86_64/asm-offsets.c
> +++ b/xen/arch/x86/x86_64/asm-offsets.c
> @@ -161,6 +161,7 @@ void __dummy__(void)
>   
>       OFFSET(CPUINFO_X86_features, struct cpuinfo_x86, x86_capability);
>       OFFSET(CPUINFO_X86_apicid, struct cpuinfo_x86, apicid);
> +    OFFSET(CPUINFO_X86_stack_base, struct cpuinfo_x86, stack_base);
>       DEFINE(CPUINFO_X86_sizeof, sizeof(struct cpuinfo_x86));
>       BLANK();
>   
> diff --git a/xen/include/xen/smp.h b/xen/include/xen/smp.h
> index 0a9219173f0f..994fdc474200 100644
> --- a/xen/include/xen/smp.h
> +++ b/xen/include/xen/smp.h
> @@ -67,8 +67,6 @@ void smp_send_call_function_mask(const cpumask_t *mask);
>   
>   int alloc_cpu_id(void);
>   
> -extern void *stack_base[NR_CPUS];
> -
>   void initialize_cpu_data(unsigned int cpu);
>   int setup_cpu_root_pgt(unsigned int cpu);
>   

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data
  2024-02-02 18:24   ` Julien Grall
@ 2024-02-05  8:41     ` Jan Beulich
  2024-02-05  9:32       ` Julien Grall
  2024-03-12 15:59       ` Krystian Hebel
  0 siblings, 2 replies; 45+ messages in thread
From: Jan Beulich @ 2024-02-05  8:41 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Stefano Stabellini, Krystian Hebel,
	xen-devel

On 02.02.2024 19:24, Julien Grall wrote:
> On 14/11/2023 17:50, Krystian Hebel wrote:
>> This location is easier to access from assembly. Having it close to
>> other data required during initialization has also positive (although
>> rather small) impact on prefetching data from RAM.
> 
> I understand your goal but...
> 
>> --- a/xen/arch/x86/include/asm/cpufeature.h
>> +++ b/xen/arch/x86/include/asm/cpufeature.h
> 
> ... cpufeature.h feels a rather odd place for storing the stack. I am 
> not entirely sure where else to place. Andrew, Jan, Roger?

Well, without having looked at the patch/series itself yet, I can only
say that if struct cpuinfo_x86 really is the place to put this
information, then it's unavoidable to have the field added in this
header. That said, it certainly feels like an abuse - there's nothing
in common with other (collected) data in this structure. "Easier to
access from assembly" may be a fair reason, but then I'd expect the
downsides of alternatives to be discussed explicitly. For example, a
simple new array might be as "easily" accessible from assembly.

>> @@ -37,6 +37,7 @@ struct cpuinfo_x86 {
>>       unsigned int phys_proc_id;         /* package ID of each logical CPU */
>>       unsigned int cpu_core_id;          /* core ID of each logical CPU */
>>       unsigned int compute_unit_id;      /* AMD compute unit ID of each logical CPU */
>> +    void *stack_base;
> 
> AFAICT, this means there will be a padding before stack_base and ...
> 
>>       unsigned short x86_clflush_size;
> 
> ... another one here. Is there any particular reason the new field 
> wasn't added at the end?

With ...

>>   } __cacheline_aligned;

... this I'm not exactly sure this is a problem right now (there may
be ample padding space anyway, yet I didn't go count). But I agree
with your comment in principle.

>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -75,13 +75,15 @@ static enum cpu_state {
>>   } cpu_state;
>>   #define set_cpu_state(state) do { smp_mb(); cpu_state = (state); } while (0)
>>   
>> -void *stack_base[NR_CPUS];
>> -
>>   void initialize_cpu_data(unsigned int cpu)
>>   {
>>       uint32_t apicid = cpu_physical_id(cpu);
>> +    void *stack = cpu_data[cpu].stack_base;
>> +
>>       cpu_data[cpu] = boot_cpu_data;
>> +
>>       cpu_physical_id(cpu) = apicid;
>> +    cpu_data[cpu].stack_base = stack;
>>   }
>>   
>>   static bool smp_store_cpu_info(unsigned int id)
>> @@ -579,8 +581,6 @@ static int do_boot_cpu(int apicid, int cpu)
>>           printk("Booting processor %d/%d eip %lx\n",
>>                  cpu, apicid, start_eip);
>>   
>> -    stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
>> -
> 
> You remove this line because I can't quite figure out where stack_start 
> is now set. This is used...

This line sets a global variable, which ...

>> @@ -856,7 +856,7 @@ int setup_cpu_root_pgt(unsigned int cpu)
>>   
>>       /* Install direct map page table entries for stack, IDT, and TSS. */
>>       for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
>> -        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
>> +        rc = clone_mapping(__va(__pa(cpu_data[cpu].stack_base)) + off, rpt);
>>   
>>       if ( !rc )
>>           rc = clone_mapping(idt_tables[cpu], rpt);
>> @@ -1007,10 +1007,10 @@ static void cpu_smpboot_free(unsigned int cpu, bool remove)
>>           FREE_XENHEAP_PAGE(per_cpu(gdt, cpu));
>>           FREE_XENHEAP_PAGE(idt_tables[cpu]);
>>   
>> -        if ( stack_base[cpu] )
>> +        if ( cpu_data[cpu].stack_base )
>>           {
>> -            memguard_unguard_stack(stack_base[cpu]);
>> -            FREE_XENHEAP_PAGES(stack_base[cpu], STACK_ORDER);
>> +            memguard_unguard_stack(cpu_data[cpu].stack_base);
>> +            FREE_XENHEAP_PAGES(cpu_data[cpu].stack_base, STACK_ORDER);
>>           }
>>       }
>>   }
>> @@ -1044,11 +1044,11 @@ static int cpu_smpboot_alloc(unsigned int cpu)
>>       if ( node != NUMA_NO_NODE )
>>           memflags = MEMF_node(node);
>>   
>> -    if ( stack_base[cpu] == NULL &&
>> -         (stack_base[cpu] = cpu_alloc_stack(cpu)) == NULL )
>> +    if ( cpu_data[cpu].stack_base == NULL &&
>> +         (cpu_data[cpu].stack_base = cpu_alloc_stack(cpu)) == NULL )
>>               goto out;
>>   
>> -    info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
>> +    info = get_cpu_info_from_stack((unsigned long)cpu_data[cpu].stack_base);
> 
> ... here.

... pretty clearly is not used here (anymore). Instead I'd raise the
question of what the remaining purpose of that variable then is.
Looking through updates this patch alone makes to use sites of
stack_start, it's unclear whether the use from assembly code has gone
away already - brief checking suggests it hasn't.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data
  2024-02-05  8:41     ` Jan Beulich
@ 2024-02-05  9:32       ` Julien Grall
  2024-03-12 15:59       ` Krystian Hebel
  1 sibling, 0 replies; 45+ messages in thread
From: Julien Grall @ 2024-02-05  9:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Stefano Stabellini, Krystian Hebel,
	xen-devel

Hi Jan,

On 05/02/2024 08:41, Jan Beulich wrote:
> On 02.02.2024 19:24, Julien Grall wrote:
>>>    static bool smp_store_cpu_info(unsigned int id)
>>> @@ -579,8 +581,6 @@ static int do_boot_cpu(int apicid, int cpu)
>>>            printk("Booting processor %d/%d eip %lx\n",
>>>                   cpu, apicid, start_eip);
>>>    
>>> -    stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
>>> -
>>
>> You remove this line because I can't quite figure out where stack_start
>> is now set. This is used...
> 
> This line sets a global variable, which ...
> 
>>> @@ -856,7 +856,7 @@ int setup_cpu_root_pgt(unsigned int cpu)
>>>    
>>>        /* Install direct map page table entries for stack, IDT, and TSS. */
>>>        for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
>>> -        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
>>> +        rc = clone_mapping(__va(__pa(cpu_data[cpu].stack_base)) + off, rpt);
>>>    
>>>        if ( !rc )
>>>            rc = clone_mapping(idt_tables[cpu], rpt);
>>> @@ -1007,10 +1007,10 @@ static void cpu_smpboot_free(unsigned int cpu, bool remove)
>>>            FREE_XENHEAP_PAGE(per_cpu(gdt, cpu));
>>>            FREE_XENHEAP_PAGE(idt_tables[cpu]);
>>>    
>>> -        if ( stack_base[cpu] )
>>> +        if ( cpu_data[cpu].stack_base )
>>>            {
>>> -            memguard_unguard_stack(stack_base[cpu]);
>>> -            FREE_XENHEAP_PAGES(stack_base[cpu], STACK_ORDER);
>>> +            memguard_unguard_stack(cpu_data[cpu].stack_base);
>>> +            FREE_XENHEAP_PAGES(cpu_data[cpu].stack_base, STACK_ORDER);
>>>            }
>>>        }
>>>    }
>>> @@ -1044,11 +1044,11 @@ static int cpu_smpboot_alloc(unsigned int cpu)
>>>        if ( node != NUMA_NO_NODE )
>>>            memflags = MEMF_node(node);
>>>    
>>> -    if ( stack_base[cpu] == NULL &&
>>> -         (stack_base[cpu] = cpu_alloc_stack(cpu)) == NULL )
>>> +    if ( cpu_data[cpu].stack_base == NULL &&
>>> +         (cpu_data[cpu].stack_base = cpu_alloc_stack(cpu)) == NULL )
>>>                goto out;
>>>    
>>> -    info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
>>> +    info = get_cpu_info_from_stack((unsigned long)cpu_data[cpu].stack_base);
>>
>> ... here.
> 
> ... pretty clearly is not used here (anymore). Instead I'd raise the
> question of what the remaining purpose of that variable then is.
> Looking through updates this patch alone makes to use sites of
> stack_start, it's unclear whether the use from assembly code has gone
> away already - brief checking suggests it hasn't.

Whoops you are right. I am not sure how I thought this line was (still) 
using stack_start.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID
  2023-11-14 17:49 ` [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID Krystian Hebel
  2024-01-26 18:30   ` Julien Grall
@ 2024-02-07 16:11   ` Jan Beulich
  2024-03-12 15:11     ` Krystian Hebel
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-07 16:11 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

On 14.11.2023 18:49, Krystian Hebel wrote:
> --- a/xen/arch/x86/boot/trampoline.S
> +++ b/xen/arch/x86/boot/trampoline.S
> @@ -72,6 +72,26 @@ trampoline_protmode_entry:
>          mov     $X86_CR4_PAE,%ecx
>          mov     %ecx,%cr4
>  
> +        /*
> +         * Get APIC ID while we're in non-paged mode. Start by checking if
> +         * x2APIC is enabled.
> +         */
> +        mov     $MSR_APIC_BASE, %ecx
> +        rdmsr
> +        and     $APIC_BASE_EXTD, %eax

You don't use the result, in which case "test" is to be preferred over
"and".

> +        jnz     .Lx2apic
> +
> +        /* Not x2APIC, read from MMIO */
> +        mov     0xfee00020, %esp

Please don't open-code existing constants (APIC_ID here and below,
APIC_DEFAULT_PHYS_BASE just here, and ...

> +        shr     $24, %esp

... a to-be-introduced constant here (for {G,S}ET_xAPIC_ID() to use as
well then). This is the only way of being able to easily identify all
pieces of code accessing the same piece of hardware.

> +        jmp     1f
> +
> +.Lx2apic:
> +        mov     $(MSR_X2APIC_FIRST + (0x20 >> 4)), %ecx
> +        rdmsr
> +        mov     %eax, %esp
> +1:

Overall I'm worried of the use of %esp throughout here.

> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -15,7 +15,33 @@ ENTRY(__high_start)
>          mov     $XEN_MINIMAL_CR4,%rcx
>          mov     %rcx,%cr4
>  
> -        mov     stack_start(%rip),%rsp
> +        test    %ebx,%ebx

Nit (style): Elsewhere you have blanks after the commas, just here
(and once again near the end of the hunk) you don't.

> +        cmovz   stack_start(%rip), %rsp
> +        jz      .L_stack_set
> +
> +        /* APs only: get stack base from APIC ID saved in %esp. */
> +        mov     $-1, %rax

Why a 64-bit insn here and ...

> +        lea     x86_cpu_to_apicid(%rip), %rcx
> +1:
> +        add     $1, %rax

... here, when you only use (far less than) 32-bit values?

> +        cmp     $NR_CPUS, %eax
> +        jb      2f
> +        hlt
> +2:
> +        cmp     %esp, (%rcx, %rax, 4)
> +        jne     1b

Aren't you open-coding REPNE SCASD here?

> +        /* %eax is now Xen CPU index. */
> +        lea     stack_base(%rip), %rcx
> +        mov     (%rcx, %rax, 8), %rsp
> +
> +        test    %rsp,%rsp
> +        jnz     1f
> +        hlt
> +1:
> +        add     $(STACK_SIZE - CPUINFO_sizeof), %rsp

Even this post-adjustment is worrying me. Imo the stack pointer is
better set exactly once, to its final value. Keeping it at its init
value of 0 until then yields more predictable results in case it
ends up being used somewhere.

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1951,6 +1951,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>       */
>      if ( !pv_shim )
>      {
> +        /* Separate loop to make parallel AP bringup possible. */
>          for_each_present_cpu ( i )
>          {
>              /* Set up cpu_to_node[]. */
> @@ -1958,6 +1959,12 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>              /* Set up node_to_cpumask based on cpu_to_node[]. */
>              numa_add_cpu(i);
>  
> +            if ( stack_base[i] == NULL )
> +                stack_base[i] = cpu_alloc_stack(i);
> +        }

Imo this wants accompanying by removal of the allocation in
cpu_smpboot_alloc(). Which would then make more visible that there's
error checking there, but not here (I realize there effectively is
error checking in assembly code, but that'll end in HLT with no
useful indication of what the problem is). Provided, as Julien has
pointed out, that the change is necessary in the first place.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu)
  2023-11-14 17:49 ` [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu) Krystian Hebel
  2024-01-26 18:38   ` Julien Grall
@ 2024-02-07 16:28   ` Jan Beulich
  2024-03-12 15:18     ` Krystian Hebel
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-07 16:28 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 14.11.2023 18:49, Krystian Hebel wrote:
> --- a/xen/arch/x86/apic.c
> +++ b/xen/arch/x86/apic.c
> @@ -950,7 +950,7 @@ __next:
>       */
>      if (boot_cpu_physical_apicid == -1U)
>          boot_cpu_physical_apicid = get_apic_id();
> -    x86_cpu_to_apicid[0] = get_apic_id();
> +    cpu_physical_id(0) = get_apic_id();

While personally I don't mind as much, I expect Andrew would not like
this: Something that looks like a function call on the lhs is against
what normal language structure would be.

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1615,7 +1615,7 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
>              break;
>  
>          cpu_id.phys_id =
> -            (uint64_t)x86_cpu_to_apicid[v->vcpu_id] |
> +            (uint64_t)cpu_physical_id(v->vcpu_id) |
>              ((uint64_t)acpi_get_processor_id(v->vcpu_id) << 32);

While the cast on the 2nd line is necessary, the one on the 2st isn't
and would be nice to be dropped while touching the line anyway.

> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -70,7 +70,7 @@ void __init init_cpu_to_node(void)
>  
>      for ( i = 0; i < nr_cpu_ids; i++ )
>      {
> -        u32 apicid = x86_cpu_to_apicid[i];
> +        u32 apicid = cpu_physical_id(i);
>          if ( apicid == BAD_APICID )
>              continue;
>          node = apicid < MAX_LOCAL_APIC ? apicid_to_node[apicid] : NUMA_NO_NODE;

We're in the process of phasing out u32 and friends, favoring uint32_t.
Would be nice if in code being touched anyway (i.e. not just here) the
conversion would be done right away. Then again fixed-width types are
preferably avoided where not really needed (see ./CODING_STYLE), so
quite likely it actually wants to be unsigned int here.

Furthermore our style demands that declaration(s) and statement(s) are
separated by a blank line. Inserting the missing one in cases like the
one here would be very desirable as well.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  2024-02-02 18:11   ` Julien Grall
@ 2024-02-07 16:41     ` Jan Beulich
  2024-03-12 15:29       ` Krystian Hebel
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-07 16:41 UTC (permalink / raw)
  To: Julien Grall, Krystian Hebel
  Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

On 02.02.2024 19:11, Julien Grall wrote:
> Hi,
> 
> On 14/11/2023 17:50, Krystian Hebel wrote:
>> Both fields held the same data.
>>
>> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
>> ---
>>   xen/arch/x86/boot/x86_64.S           |  8 +++++---
>>   xen/arch/x86/include/asm/asm_defns.h |  2 +-
>>   xen/arch/x86/include/asm/processor.h |  2 ++
>>   xen/arch/x86/include/asm/smp.h       |  4 ----
>>   xen/arch/x86/numa.c                  | 15 +++++++--------
>>   xen/arch/x86/smpboot.c               |  8 ++++----
>>   xen/arch/x86/x86_64/asm-offsets.c    |  4 +++-
>>   7 files changed, 22 insertions(+), 21 deletions(-)
>>
>> diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
>> index b85b47b5c1a0..195550b5c0ea 100644
>> --- a/xen/arch/x86/boot/x86_64.S
>> +++ b/xen/arch/x86/boot/x86_64.S
>> @@ -20,15 +20,17 @@ ENTRY(__high_start)
>>           jz      .L_stack_set
>>   
>>           /* APs only: get stack base from APIC ID saved in %esp. */
>> -        mov     $-1, %rax
>> -        lea     x86_cpu_to_apicid(%rip), %rcx
>> +        mov     $0, %rax
>> +        lea     cpu_data(%rip), %rcx
>> +        /* cpu_data[0] is BSP, skip it. */
>>   1:
>>           add     $1, %rax
>> +        add     $CPUINFO_X86_sizeof, %rcx
>>           cmp     $NR_CPUS, %eax
>>           jb      2f
>>           hlt
>>   2:
>> -        cmp     %esp, (%rcx, %rax, 4)
>> +        cmp     %esp, CPUINFO_X86_apicid(%rcx)
>>           jne     1b
>>   
>>           /* %eax is now Xen CPU index. */
> 
> As mentioned in an earlier patch, I think you want to re-order the 
> patches. This will avoid to modify twice the same code within the same 
> series (it is best to avoid if you can).

I second this request. Even more so that there's an unexplained move
from starting at $-1 to starting at $0 (in which case you really want
to use xor, not mov).

>> --- a/xen/arch/x86/numa.c
>> +++ b/xen/arch/x86/numa.c
>> @@ -54,14 +54,13 @@ bool __init arch_numa_unavailable(void)
>>   /*
>>    * Setup early cpu_to_node.
>>    *
>> - * Populate cpu_to_node[] only if x86_cpu_to_apicid[],
>> - * and apicid_to_node[] tables have valid entries for a CPU.
>> - * This means we skip cpu_to_node[] initialisation for NUMA
>> - * emulation and faking node case (when running a kernel compiled
>> - * for NUMA on a non NUMA box), which is OK as cpu_to_node[]
>> - * is already initialized in a round robin manner at numa_init_array,
>> - * prior to this call, and this initialization is good enough
>> - * for the fake NUMA cases.
>> + * Populate cpu_to_node[] only if cpu_data[], and apicid_to_node[]

You mean cpu_physical_id() here, and then this change wants doing when
switching to that, imo.

>> + * tables have valid entries for a CPU. This means we skip
>> + * cpu_to_node[] initialisation for NUMA emulation and faking node
>> + * case (when running a kernel compiled for NUMA on a non NUMA box),
>> + * which is OK as cpu_to_node[] is already initialized in a round
>> + * robin manner at numa_init_array, prior to this call, and this
>> + * initialization is good enough for the fake NUMA cases.
>>    */

Also if you're already re-wrapping this comment, please make better use
of line width.

>> --- a/xen/arch/x86/x86_64/asm-offsets.c
>> +++ b/xen/arch/x86/x86_64/asm-offsets.c
>> @@ -159,7 +159,9 @@ void __dummy__(void)
>>       OFFSET(IRQSTAT_softirq_pending, irq_cpustat_t, __softirq_pending);
>>       BLANK();
>>   
>> -    OFFSET(CPUINFO_features, struct cpuinfo_x86, x86_capability);
>> +    OFFSET(CPUINFO_X86_features, struct cpuinfo_x86, x86_capability);
> 
> The rename seems to be unrelated to this patch. Can you clarify?

I agree some renaming wants doing, but separately. That's because we
use CPUINFO_ as a prefix for two entirely different structure's offsets
right now. I'm not convinced of CPUINFO_X86_ as the new prefix though:
Uses are against cpu_data[], so CPUDATA_ may be better. Might be good
if Andrew and/or Roger could voice their view.

Jan

>> +    OFFSET(CPUINFO_X86_apicid, struct cpuinfo_x86, apicid);
>> +    DEFINE(CPUINFO_X86_sizeof, sizeof(struct cpuinfo_x86));
>>       BLANK();
>>   
>>       OFFSET(MB_flags, multiboot_info_t, flags);
> 
> Cheers,
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data
  2023-11-14 17:50 ` [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data Krystian Hebel
  2024-02-02 18:24   ` Julien Grall
@ 2024-02-07 16:53   ` Jan Beulich
  1 sibling, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2024-02-07 16:53 UTC (permalink / raw)
  To: Krystian Hebel
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 14.11.2023 18:50, Krystian Hebel wrote:
> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -33,9 +33,8 @@ ENTRY(__high_start)
>          cmp     %esp, CPUINFO_X86_apicid(%rcx)
>          jne     1b
>  
> -        /* %eax is now Xen CPU index. */
> -        lea     stack_base(%rip), %rcx
> -        mov     (%rcx, %rax, 8), %rsp
> +        /* %rcx is now cpu_data[cpu], read stack base from it. */
> +        mov     CPUINFO_X86_stack_base(%rcx), %rsp

Looks like you're not using the value in %eax anymore? If so, respective
code would want dropping. Which in turn again raises the question that
Julien already put up: By re-ordering the series, can't you avoid
altering the same code multiple times, in part even removing in a later
patch what an earlier one added?

That said, I remain unconvinced that ...

> --- a/xen/arch/x86/include/asm/cpufeature.h
> +++ b/xen/arch/x86/include/asm/cpufeature.h
> @@ -37,6 +37,7 @@ struct cpuinfo_x86 {
>      unsigned int phys_proc_id;         /* package ID of each logical CPU */
>      unsigned int cpu_core_id;          /* core ID of each logical CPU */
>      unsigned int compute_unit_id;      /* AMD compute unit ID of each logical CPU */
> +    void *stack_base;
>      unsigned short x86_clflush_size;
>  } __cacheline_aligned;

... this is a good place for the new data: As indicated before, it
doesn't fit (in nature) with everything else in this struct.

Additionally no matter where the data is put, I'd wonder if it
wouldn't better be const void *. You don't mean to ever write
through it, I suppose.

> @@ -1156,7 +1156,8 @@ void __init smp_prepare_cpus(void)
>      boot_cpu_physical_apicid = get_apic_id();
>      cpu_physical_id(0) = boot_cpu_physical_apicid;
>  
> -    stack_base[0] = (void *)((unsigned long)stack_start & ~(STACK_SIZE - 1));
> +    cpu_data[0].stack_base = (void *)
> +             ((unsigned long)stack_start & ~(STACK_SIZE - 1));

Nit: Too deep indentation. Each indentation level is 4 spaces. I also
think the cast would then also want to move on the 2nd line, such that
(see again ./CODING_STYLE) the assignment operator is last on the 1st
line.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 5/9] x86/smp: call x2apic_ap_setup() earlier
  2023-11-14 17:50 ` [XEN PATCH 5/9] x86/smp: call x2apic_ap_setup() earlier Krystian Hebel
@ 2024-02-07 17:02   ` Jan Beulich
  2024-03-12 16:02     ` Krystian Hebel
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-07 17:02 UTC (permalink / raw)
  To: Krystian Hebel, Andrew Cooper; +Cc: Roger Pau Monné, xen-devel, Wei Liu

On 14.11.2023 18:50, Krystian Hebel wrote:
> It used to be called from smp_callin(), however BUG_ON() was invoked on
> multiple occasions before that. It may end up calling machine_restart()
> which tries to get APIC ID for CPU running this code. If BSP detected
> that x2APIC is enabled, get_apic_id() will try to use it for all CPUs.
> Enabling x2APIC on secondary CPUs earlier protects against an endless
> loop of #GP exceptions caused by attempts to read IA32_X2APIC_APICID
> MSR while x2APIC is disabled in IA32_APIC_BASE.
> 
> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
> ---
>  xen/arch/x86/smpboot.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index 8ae65ab1769f..a3895dafa267 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -184,7 +184,6 @@ static void smp_callin(void)
>       * update until we finish. We are free to set up this CPU: first the APIC.
>       */
>      Dprintk("CALLIN, before setup_local_APIC().\n");
> -    x2apic_ap_setup();
>      setup_local_APIC(false);
>  
>      /* Save our processor parameters. */
> @@ -351,6 +350,14 @@ void start_secondary(void *unused)
>      get_cpu_info()->xen_cr3 = 0;
>      get_cpu_info()->pv_cr3 = 0;
>  
> +    /*
> +     * BUG_ON() used in load_system_tables() and later code may end up calling
> +     * machine_restart() which tries to get APIC ID for CPU running this code.
> +     * If BSP detected that x2APIC is enabled, get_apic_id() will try to use it
> +     * for _all_ CPUs. Enable x2APIC on secondary CPUs now so we won't end up
> +     * with endless #GP loop.
> +     */
> +    x2apic_ap_setup();
>      load_system_tables();

While I find the argument convincing, I seem to recall that there was a
firm plan to have load_system_tables() as early as possible. Andrew?

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  2023-11-14 17:50 ` [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead Krystian Hebel
  2024-02-02 18:11   ` Julien Grall
@ 2024-02-08  7:29   ` Jan Beulich
  2024-03-12 15:35     ` Krystian Hebel
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-08  7:29 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

On 14.11.2023 18:50, Krystian Hebel wrote:
> Both fields held the same data.

Supposedly the same data only. They come from different origins, and you're
hiding this quite well by leaving all sites in place where the field is
written. Both items are also used for entirely separate purposes. So you
need to
- explain why both sources of information necessarily provide the same
  data,
- especially if there's remaining concern from the above explanation that
  the two values might end up different in corner cases (running
  virtualized ourselves comes to mind as a possible example), explain why
  nevertheless it is fine (risk free) to use the consolidated item for
  all of the originally separate purposes,
- either explain or do away with the multiple places setting this single
  remaining field.

> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -61,10 +61,8 @@ unsigned int __read_mostly nr_sockets;
>  cpumask_t **__read_mostly socket_cpumask;
>  static cpumask_t *secondary_socket_cpumask;
>  
> -struct cpuinfo_x86 cpu_data[NR_CPUS];
> -
> -u32 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
> -	{ [0 ... NR_CPUS-1] = BAD_APICID };
> +struct cpuinfo_x86 cpu_data[NR_CPUS] =
> +        { [0 ... NR_CPUS-1] .apicid = BAD_APICID };

Nit: Stray blank after closing square bracket.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 6/9] x86/shutdown: protect against recurrent machine_restart()
  2023-11-14 17:50 ` [XEN PATCH 6/9] x86/shutdown: protect against recurrent machine_restart() Krystian Hebel
@ 2024-02-08 11:30   ` Jan Beulich
  2024-03-12 16:05     ` Krystian Hebel
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-08 11:30 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

On 14.11.2023 18:50, Krystian Hebel wrote:
> If multiple CPUs called machine_restart() before actual restart took
> place, but after boot CPU declared itself not online,

Can you help me please in identifying where this operation is? I can see
two places where a CPU is removed from cpu_online_map, yet neither
__stop_this_cpu() nor __cpu_disable() ought to be coming into play here.
In fact I didn't think CPU0 was ever marked not-online. Except perhaps
if we came through machine_crash_shutdown() -> nmi_shootdown_cpus(), but
I'm sure you would have mentioned such a further dependency.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 7/9] x86/smp: drop booting_cpu variable
  2023-11-14 17:50 ` [XEN PATCH 7/9] x86/smp: drop booting_cpu variable Krystian Hebel
@ 2024-02-08 11:39   ` Jan Beulich
  2024-03-12 16:16     ` Krystian Hebel
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-08 11:39 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 14.11.2023 18:50, Krystian Hebel wrote:
> CPU id is obtained as a side effect of searching for appropriate
> stack for AP. It can be used as a parameter to start_secondary().
> Coincidentally this also makes further work on making AP bring-up
> code parallel easier.

It's not just "easier", but strictly a prereq I think? Such a global
would get in the way of having multiple CPUs make it into
start_secondary() in parallel.

> --- a/xen/arch/x86/boot/x86_64.S
> +++ b/xen/arch/x86/boot/x86_64.S
> @@ -20,20 +20,24 @@ ENTRY(__high_start)
>          jz      .L_stack_set
>  
>          /* APs only: get stack base from APIC ID saved in %esp. */
> -        mov     $0, %rax
> +        mov     $0, %rbx
>          lea     cpu_data(%rip), %rcx
>          /* cpu_data[0] is BSP, skip it. */
>  1:
> -        add     $1, %rax
> +        add     $1, %rbx
>          add     $CPUINFO_X86_sizeof, %rcx
> -        cmp     $NR_CPUS, %eax
> +        cmp     $NR_CPUS, %rbx
>          jb      2f
>          hlt
>  2:
>          cmp     %esp, CPUINFO_X86_apicid(%rcx)
>          jne     1b

Once again this is code you introduced a few patches ago. Why not use
%ebx right away for that purpose? (And yes, this explains why in the
earlier patch you retained that code. Just that again suitably ordering
the series would make this look natural. Otherwise it needs at least
mentioning why dead pieces are kept around.)

> -        /* %rcx is now cpu_data[cpu], read stack base from it. */
> +        /*
> +         * At this point:
> +         * - %rcx is cpu_data[cpu], read stack base from it,
> +         * - %rbx (callee-save) is Xen cpu number, pass it to start_secondary().
> +         */
>          mov     CPUINFO_X86_stack_base(%rcx), %rsp
>  
>          test    %rsp,%rsp
> @@ -101,6 +105,7 @@ ENTRY(__high_start)
>  .L_ap_cet_done:
>  #endif /* CONFIG_XEN_SHSTK || CONFIG_XEN_IBT */
>  
> +        mov     %rbx, %rdi
>          tailcall start_secondary

As alluded to above and as mentioned before - please stick to 32-bit
operations when you deal with 32 (or less) bits of data.

> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -222,8 +222,6 @@ static void smp_callin(void)
>          cpu_relax();
>  }
>  
> -static int booting_cpu;
> -
>  /* CPUs for which sibling maps can be computed. */
>  static cpumask_t cpu_sibling_setup_map;
>  
> @@ -311,15 +309,14 @@ static void set_cpu_sibling_map(unsigned int cpu)
>      }
>  }
>  
> -void start_secondary(void *unused)
> +void start_secondary(unsigned int cpu)
>  {
>      struct cpu_info *info = get_cpu_info();
>  
>      /*
> -     * Dont put anything before smp_callin(), SMP booting is so fragile that we
> +     * Don't put anything before smp_callin(), SMP booting is so fragile that we
>       * want to limit the things done here to the most necessary things.
>       */
> -    unsigned int cpu = booting_cpu;
>  
>      /* Critical region without IDT or TSS.  Any fault is deadly! */
>  
> @@ -346,9 +343,9 @@ void start_secondary(void *unused)
>       */
>      spin_debug_disable();
>  
> -    get_cpu_info()->use_pv_cr3 = false;
> -    get_cpu_info()->xen_cr3 = 0;
> -    get_cpu_info()->pv_cr3 = 0;
> +    info->use_pv_cr3 = false;
> +    info->xen_cr3 = 0;
> +    info->pv_cr3 = 0;

This hunk looks unrelated. While tidying next to what's changed anyway
may be okay if suitably mentioned in the description, in this case I
think it needs splitting off.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 8/9] x86/smp: make cpu_state per-CPU
  2023-11-14 17:50 ` [XEN PATCH 8/9] x86/smp: make cpu_state per-CPU Krystian Hebel
@ 2024-02-08 12:13   ` Jan Beulich
  2024-03-12 16:38     ` Krystian Hebel
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-08 12:13 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 14.11.2023 18:50, Krystian Hebel wrote:
> This will be used for parallel AP bring-up.
> 
> CPU_STATE_INIT changed direction.

Nit: I think you mean "changes" as you describe what the patch does, not
what has happened before. But ...

> It was previously set by BSP and never
> consumed by AP. Now it signals that AP got through assembly part of
> initialization and waits for BSP to call notifiers that set up data
> structures required for further initialization.

... all of this is, afaict, independent of what the title says the
purpose of this patch is. Since the correctness of the state change
adjustments doesn't look straightforward to prove, please split the
mechanical change from the change to the actual logic.

> --- a/xen/arch/x86/include/asm/cpufeature.h
> +++ b/xen/arch/x86/include/asm/cpufeature.h
> @@ -38,6 +38,7 @@ struct cpuinfo_x86 {
>      unsigned int cpu_core_id;          /* core ID of each logical CPU */
>      unsigned int compute_unit_id;      /* AMD compute unit ID of each logical CPU */
>      void *stack_base;
> +    unsigned int cpu_state;
>      unsigned short x86_clflush_size;
>  } __cacheline_aligned;

Is there any reason this cannot be ordinary per-CPU data?

> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -65,15 +65,18 @@ struct cpuinfo_x86 cpu_data[NR_CPUS] =
>          { [0 ... NR_CPUS-1] .apicid = BAD_APICID };
>  
>  static int cpu_error;
> -static enum cpu_state {
> +enum cpu_state {
>      CPU_STATE_DYING,    /* slave -> master: I am dying */
>      CPU_STATE_DEAD,     /* slave -> master: I am completely dead */
> -    CPU_STATE_INIT,     /* master -> slave: Early bringup phase 1 */
> -    CPU_STATE_CALLOUT,  /* master -> slave: Early bringup phase 2 */
> +    CPU_STATE_INIT,     /* slave -> master: Early bringup phase 1 completed */
> +    CPU_STATE_CALLOUT,  /* master -> slave: Start early bringup phase 2 */

It's not really clear to me whether the adding of "Start" on the 2nd line
really adds value.

>      CPU_STATE_CALLIN,   /* slave -> master: Completed phase 2 */
>      CPU_STATE_ONLINE    /* master -> slave: Go fully online now. */
> -} cpu_state;
> -#define set_cpu_state(state) do { smp_mb(); cpu_state = (state); } while (0)
> +};
> +#define set_cpu_state(cpu, state) do { \
> +    smp_mb(); \
> +    cpu_data[cpu].cpu_state = (state); \
> +} while (0)

While you merely re-arrange it, I'd still like to ask: Does this really
need to be smp_mb(), not just smp_wmb()?

> @@ -320,6 +317,10 @@ void start_secondary(unsigned int cpu)
>  
>      /* Critical region without IDT or TSS.  Any fault is deadly! */
>  
> +    /* Wait until data set up by CPU_UP_PREPARE notifiers is ready. */
> +    while ( cpu_data[cpu].cpu_state != CPU_STATE_CALLOUT )
> +        cpu_relax();

I'm afraid I don't understand the comment (and hence whether this loop
is actually needed here): __cpu_up() is called only after those
notifiers completed.

> @@ -1161,6 +1171,12 @@ void __init smp_prepare_cpus(void)
>      cpu_data[0].stack_base = (void *)
>               ((unsigned long)stack_start & ~(STACK_SIZE - 1));
>  
> +    /* Set state as CALLOUT so APs won't change it in initialize_cpu_data() */
> +    boot_cpu_data.cpu_state = CPU_STATE_CALLOUT;

This is actually one of the reasons I don't like you putting the item
as a new field in struct cpuinfo_x86. Otherwise imo initialize_cpu_data()
ought to gain a respective assertion.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 9/9] x86/smp: start APs in parallel during boot
  2023-11-14 17:50 ` [XEN PATCH 9/9] x86/smp: start APs in parallel during boot Krystian Hebel
@ 2024-02-08 12:37   ` Jan Beulich
  2024-03-12 17:13     ` Krystian Hebel
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Beulich @ 2024-02-08 12:37 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

On 14.11.2023 18:50, Krystian Hebel wrote:
> Multiple delays are required when sending IPIs and waiting for
> responses. During boot, 4 such IPIs were sent per each AP. With this
> change, only one set of broadcast IPIs is sent. This reduces boot time,
> especially for platforms with large number of cores.

Yet APs do their startup work in parallel only for a brief period of
time, if I'm not mistaken. Othwerwise I can't see why you'd still have
cpu_up() in __start_xen().

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1963,6 +1963,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>                  cpu_data[i].stack_base = cpu_alloc_stack(i);
>          }
>  
> +        smp_send_init_sipi_sipi_allbutself();
> +
>          for_each_present_cpu ( i )
>          {
>              if ( (park_offline_cpus || num_online_cpus() < max_cpus) &&

So what about constraints on the number of CPUs to use? In such a case
you shouldn't send the IPI to all of them, at least if they're not
meant to be parked.

> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu)
>  
>  static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>  {
> -    unsigned long send_status = 0, accept_status = 0;
> +    unsigned long send_status = 0, accept_status = 0, sh = 0;

sh doesn't need to be 64 bits wide, does it?

>      int maxlvt, timeout, i;
>  
>      /*
> @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>      if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) )
>          return 0;
>  
> +    /*
> +     * Use destination shorthand for broadcasting IPIs during boot.
> +     */

Nit (style): This is a single line comment.

> +    if ( phys_apicid == BAD_APICID )
> +        sh = APIC_DEST_ALLBUT;

I think the latest for this the function parameter wants changing to
unsigned int (in another prereq patch).

> @@ -573,21 +578,31 @@ static int do_boot_cpu(int apicid, int cpu)
>       */
>      mtrr_save_state();
>  
> -    start_eip = bootsym_phys(trampoline_realmode_entry);
> +    /* Check if AP is already up. */
> +    if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT )
> +    {
> +        /* This grunge runs the startup process for the targeted processor. */
> +        unsigned long start_eip;
> +        start_eip = bootsym_phys(trampoline_realmode_entry);
>  
> -    /* start_eip needs be page aligned, and below the 1M boundary. */
> -    if ( start_eip & ~0xff000 )
> -        panic("AP trampoline %#lx not suitably positioned\n", start_eip);
> +        /* start_eip needs be page aligned, and below the 1M boundary. */
> +        if ( start_eip & ~0xff000 )
> +            panic("AP trampoline %#lx not suitably positioned\n", start_eip);

Isn't this redundant now with the panic() in
smp_send_init_sipi_sipi_allbutself(), at least as long as that runs
unconditionally.

> -    /* So we see what's up   */
> -    if ( opt_cpu_info )
> -        printk("Booting processor %d/%d eip %lx\n",
> -               cpu, apicid, start_eip);
> +        /* So we see what's up   */
> +        if ( opt_cpu_info )
> +            printk("AP trampoline at %lx\n", start_eip);

Why this change in log message? It makes messages for individual CPUs
indistinguishable. And like above it's redundant with what
smp_send_init_sipi_sipi_allbutself() logs.

> -    /* This grunge runs the startup process for the targeted processor. */
> +        /* mark "stuck" area as not stuck */
> +        bootsym(trampoline_cpu_started) = 0;
> +        smp_mb();
>  
> -    /* Starting actual IPI sequence... */
> -    boot_error = wakeup_secondary_cpu(apicid, start_eip);
> +        /* Starting actual IPI sequence... */
> +        boot_error = wakeup_secondary_cpu(apicid, start_eip);
> +    }
> +
> +    if ( opt_cpu_info )
> +        printk("Booting processor %d/%d\n", cpu, apicid);

Oh, here's the other half. Yet for above it still doesn't make sense
to issue the same message for all CPUs.

> @@ -646,10 +661,6 @@ static int do_boot_cpu(int apicid, int cpu)
>          rc = -EIO;
>      }
>  
> -    /* mark "stuck" area as not stuck */
> -    bootsym(trampoline_cpu_started) = 0;
> -    smp_mb();

While you move this up, it's not clear to me how you would now
identify individual stuck CPUs. I would have expected that this is
another global that needs converting up front, to be per-CPU.

> @@ -1155,6 +1166,23 @@ static struct notifier_block cpu_smpboot_nfb = {
>      .notifier_call = cpu_smpboot_callback
>  };
>  
> +void smp_send_init_sipi_sipi_allbutself(void)

__init?

> +{
> +    unsigned long start_eip;
> +    start_eip = bootsym_phys(trampoline_realmode_entry);

This can be the initializer of the variable, which would then save
me from complaining about the missing blank line between declaration
and statement(s). (Actually, as I notice only now - same for code you
move around in do_boot_cpu().)

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID
  2024-01-26 18:30   ` Julien Grall
@ 2024-03-12 14:14     ` Krystian Hebel
  0 siblings, 0 replies; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 14:14 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu

Hi,

On 26.01.2024 19:30, Julien Grall wrote:
> Hi,
>
> I am not too familiary with the x86 boot code. But I will give a try 
> to review :).
>
> On 14/11/2023 17:49, Krystian Hebel wrote:
>> This is made as first step of making parallel AP bring-up possible. It
>> should be enough for pre-C code.
>>
>> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
>> ---
>>   xen/arch/x86/boot/trampoline.S | 20 ++++++++++++++++++++
>>   xen/arch/x86/boot/x86_64.S     | 28 +++++++++++++++++++++++++++-
>>   xen/arch/x86/setup.c           |  7 +++++++
>>   3 files changed, 54 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/boot/trampoline.S 
>> b/xen/arch/x86/boot/trampoline.S
>> index b8ab0ffdcbb0..ec254125016d 100644
>> --- a/xen/arch/x86/boot/trampoline.S
>> +++ b/xen/arch/x86/boot/trampoline.S
>> @@ -72,6 +72,26 @@ trampoline_protmode_entry:
>>           mov     $X86_CR4_PAE,%ecx
>>           mov     %ecx,%cr4
>>   +        /*
>> +         * Get APIC ID while we're in non-paged mode. Start by 
>> checking if
>> +         * x2APIC is enabled.
>> +         */
>> +        mov     $MSR_APIC_BASE, %ecx
>> +        rdmsr
>> +        and     $APIC_BASE_EXTD, %eax
>> +        jnz     .Lx2apic
>> +
>> +        /* Not x2APIC, read from MMIO */
>> +        mov     0xfee00020, %esp
>> +        shr     $24, %esp
>> +        jmp     1f
>> +
>> +.Lx2apic:
>> +        mov     $(MSR_X2APIC_FIRST + (0x20 >> 4)), %ecx
>> +        rdmsr
>> +        mov     %eax, %esp
>> +1:
>> +
>>           /* Load pagetable base register. */
>>           mov     $sym_offs(idle_pg_table),%eax
>>           add     bootsym_rel(trampoline_xen_phys_start,4,%eax)
>> diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
>> index 04bb62ae8680..b85b47b5c1a0 100644
>> --- a/xen/arch/x86/boot/x86_64.S
>> +++ b/xen/arch/x86/boot/x86_64.S
>> @@ -15,7 +15,33 @@ ENTRY(__high_start)
>>           mov     $XEN_MINIMAL_CR4,%rcx
>>           mov     %rcx,%cr4
>>   -        mov     stack_start(%rip),%rsp
>> +        test    %ebx,%ebx
>> +        cmovz   stack_start(%rip), %rsp
>> +        jz      .L_stack_set
>> +
>> +        /* APs only: get stack base from APIC ID saved in %esp. */
>> +        mov     $-1, %rax
>> +        lea     x86_cpu_to_apicid(%rip), %rcx
> I would consider to move this patch after #2 and #3, so the logic is 
> not modified again. This would help the review.
I agree, maybe even after #4 after that patch is modified according to 
other comments.
>
>> +1:
>> +        add     $1, %rax
>> +        cmp     $NR_CPUS, %eax
>> +        jb      2f
> I think we can get rid of this jump by reworking the loop so %eax is 
> tested as the end of the loop. But this is boot code, so it is 
> possibly not worth it. I will leave the x86 maintainers commenting.
Not sure if I understood "end of the loop" correctly, but if I did, it 
would result in out-of-bounds read. Anyway, this is changed by further 
patches which I still have to reorder, I'll check if final form can be 
improved.
>
>> +        hlt
>> +2:
>> +        cmp     %esp, (%rcx, %rax, 4)
>> +        jne     1b
>> +
>> +        /* %eax is now Xen CPU index. */
>> +        lea     stack_base(%rip), %rcx
>> +        mov     (%rcx, %rax, 8), %rsp
>> +
>> +        test    %rsp,%rsp
>> +        jnz     1f
>> +        hlt
>> +1:
> NIT: Can you use 3? This makes the code easier to read and less prone 
> to error (you have two very close 1).
Ack
>
>> +        add     $(STACK_SIZE - CPUINFO_sizeof), %rsp
>> +
>> +.L_stack_set:
>>             /* Reset EFLAGS (subsumes CLI and CLD). */
>>           pushq   $0
>> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
>> index a3d3f797bb1e..1285969901e0 100644
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -1951,6 +1951,7 @@ void __init noreturn __start_xen(unsigned long 
>> mbi_p)
>>        */
>>       if ( !pv_shim )
>>       {
>> +        /* Separate loop to make parallel AP bringup possible. */
>
> The loop split seems to be unrelated to this patch. Actually, I was 
> expecting that only the assembly code would be modified.
Fair point, I originally left it here so the code can be bisected if 
needed, but code changed significantly since then. In current form it 
should be safe to do this in the last commit.
>
>>           for_each_present_cpu ( i )
>>           {
>>               /* Set up cpu_to_node[]. */
>> @@ -1958,6 +1959,12 @@ void __init noreturn __start_xen(unsigned long 
>> mbi_p)
>>               /* Set up node_to_cpumask based on cpu_to_node[]. */
>>               numa_add_cpu(i);
>>   +            if ( stack_base[i] == NULL )
>> +                stack_base[i] = cpu_alloc_stack(i);
>
> I don't quite understand this change at least in the context of this 
> patch. AFAICT the stack will be currently allocated in 
> cpu_smpboot_callback() which is called while the CPU is prepared. So 
> you should not need this allocation right now.
>
> Looking at the rest of the series, it seems you allocate the stack 
> earlier so you start the CPU bring-up earlier. But they will still be 
> held in assembly code until cpu_up() is called.
>
> So effectively, part of the C part of the CPUs bring-up is still 
> serialized. Did I understand the logic correctly?
>
> If so, I would suggest to clarify it in the series because this wasn't 
> obvious to me (I was expecting start_secondary() would also run in 
> parallell).

start_secondary() is started in parallel, CPUs are not held in assembly. 
Calling C (almost) always requires stack, and most of this series comes 
to making stack available for all APs at once, just so APs can loop 
early in start_secondary().

You are correct, most of C part is serialized. I tried to make it 
parallel as well but quickly gave up. Some of the notifiers callbacks 
are resistant against any attempts at parallelization, and this set of 
patches already gave satisfactory improvements in boot time (and was 
enough to go through peculiar SMP bring-up used by Intel TXT dynamic 
launch, which is the reason why I had to do it in the first place).

>
> Regarding the change in setup.c, I think it would make more sense in 
> patch #9.
>
> Cheers,
>
Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID
  2024-02-07 16:11   ` Jan Beulich
@ 2024-03-12 15:11     ` Krystian Hebel
  2024-03-12 15:40       ` Jan Beulich
  0 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 15:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

[-- Attachment #1: Type: text/plain, Size: 4455 bytes --]


On 7.02.2024 17:11, Jan Beulich wrote:
> On 14.11.2023 18:49, Krystian Hebel wrote:
>> --- a/xen/arch/x86/boot/trampoline.S
>> +++ b/xen/arch/x86/boot/trampoline.S
>> @@ -72,6 +72,26 @@ trampoline_protmode_entry:
>>           mov     $X86_CR4_PAE,%ecx
>>           mov     %ecx,%cr4
>>   
>> +        /*
>> +         * Get APIC ID while we're in non-paged mode. Start by checking if
>> +         * x2APIC is enabled.
>> +         */
>> +        mov     $MSR_APIC_BASE, %ecx
>> +        rdmsr
>> +        and     $APIC_BASE_EXTD, %eax
> You don't use the result, in which case "test" is to be preferred over
> "and".
Ack
>
>> +        jnz     .Lx2apic
>> +
>> +        /* Not x2APIC, read from MMIO */
>> +        mov     0xfee00020, %esp
> Please don't open-code existing constants (APIC_ID here and below,
> APIC_DEFAULT_PHYS_BASE just here, and ...
>
>> +        shr     $24, %esp
> ... a to-be-introduced constant here (for {G,S}ET_xAPIC_ID() to use as
> well then). This is the only way of being able to easily identify all
> pieces of code accessing the same piece of hardware.

Yes, this was also caught in review done by Qubes OS team.

New constant and {G,S}ET_xAPIC_ID() should be in separate patch, I presume?

>> +        jmp     1f
>> +
>> +.Lx2apic:
>> +        mov     $(MSR_X2APIC_FIRST + (0x20 >> 4)), %ecx
>> +        rdmsr
>> +        mov     %eax, %esp
>> +1:
> Overall I'm worried of the use of %esp throughout here.
>
>> --- a/xen/arch/x86/boot/x86_64.S
>> +++ b/xen/arch/x86/boot/x86_64.S
>> @@ -15,7 +15,33 @@ ENTRY(__high_start)
>>           mov     $XEN_MINIMAL_CR4,%rcx
>>           mov     %rcx,%cr4
>>   
>> -        mov     stack_start(%rip),%rsp
>> +        test    %ebx,%ebx
> Nit (style): Elsewhere you have blanks after the commas, just here
> (and once again near the end of the hunk) you don't.
Is either style preferred?This file has both.
>> +        cmovz   stack_start(%rip), %rsp
>> +        jz      .L_stack_set
>> +
>> +        /* APs only: get stack base from APIC ID saved in %esp. */
>> +        mov     $-1, %rax
> Why a 64-bit insn here and ...
>
>> +        lea     x86_cpu_to_apicid(%rip), %rcx
>> +1:
>> +        add     $1, %rax
> ... here, when you only use (far less than) 32-bit values?
Fair question, no idea what I had in mind, will change.
>> +        cmp     $NR_CPUS, %eax
>> +        jb      2f
>> +        hlt
>> +2:
>> +        cmp     %esp, (%rcx, %rax, 4)
>> +        jne     1b
> Aren't you open-coding REPNE SCASD here?
Looks like I am, I missed that. I'm not sure if this will still apply after
changes from further patches, but I'll keep that in mind.
>
>> +        /* %eax is now Xen CPU index. */
>> +        lea     stack_base(%rip), %rcx
>> +        mov     (%rcx, %rax, 8), %rsp
>> +
>> +        test    %rsp,%rsp
>> +        jnz     1f
>> +        hlt
>> +1:
>> +        add     $(STACK_SIZE - CPUINFO_sizeof), %rsp
> Even this post-adjustment is worrying me. Imo the stack pointer is
> better set exactly once, to its final value. Keeping it at its init
> value of 0 until then yields more predictable results in case it
> ends up being used somewhere.
It really shouldn't be used anywhere, but I'll change it.
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -1951,6 +1951,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>        */
>>       if ( !pv_shim )
>>       {
>> +        /* Separate loop to make parallel AP bringup possible. */
>>           for_each_present_cpu ( i )
>>           {
>>               /* Set up cpu_to_node[]. */
>> @@ -1958,6 +1959,12 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>               /* Set up node_to_cpumask based on cpu_to_node[]. */
>>               numa_add_cpu(i);
>>   
>> +            if ( stack_base[i] == NULL )
>> +                stack_base[i] = cpu_alloc_stack(i);
>> +        }
> Imo this wants accompanying by removal of the allocation in
> cpu_smpboot_alloc(). Which would then make more visible that there's
> error checking there, but not here (I realize there effectively is
> error checking in assembly code, but that'll end in HLT with no
> useful indication of what the problem is). Provided, as Julien has
> pointed out, that the change is necessary in the first place.

The allocation in cpu_smpboot_alloc() was left for hot-plug. This loops
over present CPUs, not possible ones.

>
> Jan
Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com  | @3mdeb_com

[-- Attachment #2: Type: text/html, Size: 7610 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu)
  2024-02-07 16:28   ` Jan Beulich
@ 2024-03-12 15:18     ` Krystian Hebel
  2024-03-12 15:42       ` Jan Beulich
  0 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 15:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2335 bytes --]


On 7.02.2024 17:28, Jan Beulich wrote:
> On 14.11.2023 18:49, Krystian Hebel wrote:
>> --- a/xen/arch/x86/apic.c
>> +++ b/xen/arch/x86/apic.c
>> @@ -950,7 +950,7 @@ __next:
>>        */
>>       if (boot_cpu_physical_apicid == -1U)
>>           boot_cpu_physical_apicid = get_apic_id();
>> -    x86_cpu_to_apicid[0] = get_apic_id();
>> +    cpu_physical_id(0) = get_apic_id();
> While personally I don't mind as much, I expect Andrew would not like
> this: Something that looks like a function call on the lhs is against
> what normal language structure would be.
This made me cringe as well, but I've seen something like this used in
other places (per_cpu() mostly) so I thought it was OK. I can change it.
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1615,7 +1615,7 @@ long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
>>               break;
>>   
>>           cpu_id.phys_id =
>> -            (uint64_t)x86_cpu_to_apicid[v->vcpu_id] |
>> +            (uint64_t)cpu_physical_id(v->vcpu_id) |
>>               ((uint64_t)acpi_get_processor_id(v->vcpu_id) << 32);
> While the cast on the 2nd line is necessary, the one on the 2st isn't
> and would be nice to be dropped while touching the line anyway.
Ack
>
>> --- a/xen/arch/x86/numa.c
>> +++ b/xen/arch/x86/numa.c
>> @@ -70,7 +70,7 @@ void __init init_cpu_to_node(void)
>>   
>>       for ( i = 0; i < nr_cpu_ids; i++ )
>>       {
>> -        u32 apicid = x86_cpu_to_apicid[i];
>> +        u32 apicid = cpu_physical_id(i);
>>           if ( apicid == BAD_APICID )
>>               continue;
>>           node = apicid < MAX_LOCAL_APIC ? apicid_to_node[apicid] : NUMA_NO_NODE;
> We're in the process of phasing out u32 and friends, favoring uint32_t.
> Would be nice if in code being touched anyway (i.e. not just here) the
> conversion would be done right away. Then again fixed-width types are
> preferably avoided where not really needed (see ./CODING_STYLE), so
> quite likely it actually wants to be unsigned int here.
>
> Furthermore our style demands that declaration(s) and statement(s) are
> separated by a blank line. Inserting the missing one in cases like the
> one here would be very desirable as well.
Good to know, thanks.
> Jan
Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com  | @3mdeb_com

[-- Attachment #2: Type: text/html, Size: 3753 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  2024-02-07 16:41     ` Jan Beulich
@ 2024-03-12 15:29       ` Krystian Hebel
  2024-03-12 15:49         ` Jan Beulich
  0 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 15:29 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

[-- Attachment #1: Type: text/plain, Size: 4717 bytes --]

Hi,

On 7.02.2024 17:41, Jan Beulich wrote:
> On 02.02.2024 19:11, Julien Grall wrote:
>> Hi,
>>
>> On 14/11/2023 17:50, Krystian Hebel wrote:
>>> Both fields held the same data.
>>>
>>> Signed-off-by: Krystian Hebel<krystian.hebel@3mdeb.com>
>>> ---
>>>    xen/arch/x86/boot/x86_64.S           |  8 +++++---
>>>    xen/arch/x86/include/asm/asm_defns.h |  2 +-
>>>    xen/arch/x86/include/asm/processor.h |  2 ++
>>>    xen/arch/x86/include/asm/smp.h       |  4 ----
>>>    xen/arch/x86/numa.c                  | 15 +++++++--------
>>>    xen/arch/x86/smpboot.c               |  8 ++++----
>>>    xen/arch/x86/x86_64/asm-offsets.c    |  4 +++-
>>>    7 files changed, 22 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
>>> index b85b47b5c1a0..195550b5c0ea 100644
>>> --- a/xen/arch/x86/boot/x86_64.S
>>> +++ b/xen/arch/x86/boot/x86_64.S
>>> @@ -20,15 +20,17 @@ ENTRY(__high_start)
>>>            jz      .L_stack_set
>>>    
>>>            /* APs only: get stack base from APIC ID saved in %esp. */
>>> -        mov     $-1, %rax
>>> -        lea     x86_cpu_to_apicid(%rip), %rcx
>>> +        mov     $0, %rax
>>> +        lea     cpu_data(%rip), %rcx
>>> +        /* cpu_data[0] is BSP, skip it. */
>>>    1:
>>>            add     $1, %rax
>>> +        add     $CPUINFO_X86_sizeof, %rcx
>>>            cmp     $NR_CPUS, %eax
>>>            jb      2f
>>>            hlt
>>>    2:
>>> -        cmp     %esp, (%rcx, %rax, 4)
>>> +        cmp     %esp, CPUINFO_X86_apicid(%rcx)
>>>            jne     1b
>>>    
>>>            /* %eax is now Xen CPU index. */
>> As mentioned in an earlier patch, I think you want to re-order the
>> patches. This will avoid to modify twice the same code within the same
>> series (it is best to avoid if you can).
> I second this request. Even more so that there's an unexplained move
> from starting at $-1 to starting at $0 (in which case you really want
> to use xor, not mov).
Will do. This may even result in squashing some patches together.
>>> --- a/xen/arch/x86/numa.c
>>> +++ b/xen/arch/x86/numa.c
>>> @@ -54,14 +54,13 @@ bool __init arch_numa_unavailable(void)
>>>    /*
>>>     * Setup early cpu_to_node.
>>>     *
>>> - * Populate cpu_to_node[] only if x86_cpu_to_apicid[],
>>> - * and apicid_to_node[] tables have valid entries for a CPU.
>>> - * This means we skip cpu_to_node[] initialisation for NUMA
>>> - * emulation and faking node case (when running a kernel compiled
>>> - * for NUMA on a non NUMA box), which is OK as cpu_to_node[]
>>> - * is already initialized in a round robin manner at numa_init_array,
>>> - * prior to this call, and this initialization is good enough
>>> - * for the fake NUMA cases.
>>> + * Populate cpu_to_node[] only if cpu_data[], and apicid_to_node[]
> You mean cpu_physical_id() here, and then this change wants doing when
> switching to that, imo.
You mean s/cpu_data[]/cpu_physical_id()/ or something else?
>>> + * tables have valid entries for a CPU. This means we skip
>>> + * cpu_to_node[] initialisation for NUMA emulation and faking node
>>> + * case (when running a kernel compiled for NUMA on a non NUMA box),
>>> + * which is OK as cpu_to_node[] is already initialized in a round
>>> + * robin manner at numa_init_array, prior to this call, and this
>>> + * initialization is good enough for the fake NUMA cases.
>>>     */
> Also if you're already re-wrapping this comment, please make better use
> of line width.
>
>>> --- a/xen/arch/x86/x86_64/asm-offsets.c
>>> +++ b/xen/arch/x86/x86_64/asm-offsets.c
>>> @@ -159,7 +159,9 @@ void __dummy__(void)
>>>        OFFSET(IRQSTAT_softirq_pending, irq_cpustat_t, __softirq_pending);
>>>        BLANK();
>>>    
>>> -    OFFSET(CPUINFO_features, struct cpuinfo_x86, x86_capability);
>>> +    OFFSET(CPUINFO_X86_features, struct cpuinfo_x86, x86_capability);
>> The rename seems to be unrelated to this patch. Can you clarify?
> I agree some renaming wants doing, but separately. That's because we
> use CPUINFO_ as a prefix for two entirely different structure's offsets
> right now. I'm not convinced of CPUINFO_X86_ as the new prefix though:
> Uses are against cpu_data[], so CPUDATA_ may be better. Might be good
> if Andrew and/or Roger could voice their view.
Yes, this was because after adding APIC ID to this structure I tried to use
CPUINFO_sizeof in the assembly, and bad things happened.
>
> Jan
>
>>> +    OFFSET(CPUINFO_X86_apicid, struct cpuinfo_x86, apicid);
>>> +    DEFINE(CPUINFO_X86_sizeof, sizeof(struct cpuinfo_x86));
>>>        BLANK();
>>>    
>>>        OFFSET(MB_flags, multiboot_info_t, flags);
>> Cheers,
>>
Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com  | @3mdeb_com

[-- Attachment #2: Type: text/html, Size: 6689 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  2024-02-08  7:29   ` Jan Beulich
@ 2024-03-12 15:35     ` Krystian Hebel
  0 siblings, 0 replies; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 15:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu


On 8.02.2024 08:29, Jan Beulich wrote:
> On 14.11.2023 18:50, Krystian Hebel wrote:
>> Both fields held the same data.
> Supposedly the same data only. They come from different origins, and you're
> hiding this quite well by leaving all sites in place where the field is
> written. Both items are also used for entirely separate purposes. So you
> need to
> - explain why both sources of information necessarily provide the same
>    data,
> - especially if there's remaining concern from the above explanation that
>    the two values might end up different in corner cases (running
>    virtualized ourselves comes to mind as a possible example), explain why
>    nevertheless it is fine (risk free) to use the consolidated item for
>    all of the originally separate purposes,
> - either explain or do away with the multiple places setting this single
>    remaining field.
I missed those writes, thanks for pointing this out. I'll have to take
a closer look before deciding what to do with this.
>
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -61,10 +61,8 @@ unsigned int __read_mostly nr_sockets;
>>   cpumask_t **__read_mostly socket_cpumask;
>>   static cpumask_t *secondary_socket_cpumask;
>>   
>> -struct cpuinfo_x86 cpu_data[NR_CPUS];
>> -
>> -u32 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
>> -	{ [0 ... NR_CPUS-1] = BAD_APICID };
>> +struct cpuinfo_x86 cpu_data[NR_CPUS] =
>> +        { [0 ... NR_CPUS-1] .apicid = BAD_APICID };
> Nit: Stray blank after closing square bracket.
>
> Jan

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID
  2024-03-12 15:11     ` Krystian Hebel
@ 2024-03-12 15:40       ` Jan Beulich
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2024-03-12 15:40 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

On 12.03.2024 16:11, Krystian Hebel wrote:
> On 7.02.2024 17:11, Jan Beulich wrote:
>> On 14.11.2023 18:49, Krystian Hebel wrote:
>>> --- a/xen/arch/x86/boot/trampoline.S
>>> +++ b/xen/arch/x86/boot/trampoline.S
>>> +        /* Not x2APIC, read from MMIO */
>>> +        mov     0xfee00020, %esp
>> Please don't open-code existing constants (APIC_ID here and below,
>> APIC_DEFAULT_PHYS_BASE just here, and ...
>>
>>> +        shr     $24, %esp
>> ... a to-be-introduced constant here (for {G,S}ET_xAPIC_ID() to use as
>> well then). This is the only way of being able to easily identify all
>> pieces of code accessing the same piece of hardware.
> 
> Yes, this was also caught in review done by Qubes OS team.
> 
> New constant and {G,S}ET_xAPIC_ID() should be in separate patch, I presume?

Preferably, yes.

>>> --- a/xen/arch/x86/boot/x86_64.S
>>> +++ b/xen/arch/x86/boot/x86_64.S
>>> @@ -15,7 +15,33 @@ ENTRY(__high_start)
>>>           mov     $XEN_MINIMAL_CR4,%rcx
>>>           mov     %rcx,%cr4
>>>   
>>> -        mov     stack_start(%rip),%rsp
>>> +        test    %ebx,%ebx
>> Nit (style): Elsewhere you have blanks after the commas, just here
>> (and once again near the end of the hunk) you don't.
> Is either style preferred?This file has both.

Conversion takes time, so in new code we aim at having those blanks.
Over time we hope to have them nearly everywhere, at which point it
may make sense to to a final cleanup sweep.

>>> --- a/xen/arch/x86/setup.c
>>> +++ b/xen/arch/x86/setup.c
>>> @@ -1951,6 +1951,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>>        */
>>>       if ( !pv_shim )
>>>       {
>>> +        /* Separate loop to make parallel AP bringup possible. */
>>>           for_each_present_cpu ( i )
>>>           {
>>>               /* Set up cpu_to_node[]. */
>>> @@ -1958,6 +1959,12 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>>               /* Set up node_to_cpumask based on cpu_to_node[]. */
>>>               numa_add_cpu(i);
>>>   
>>> +            if ( stack_base[i] == NULL )
>>> +                stack_base[i] = cpu_alloc_stack(i);
>>> +        }
>> Imo this wants accompanying by removal of the allocation in
>> cpu_smpboot_alloc(). Which would then make more visible that there's
>> error checking there, but not here (I realize there effectively is
>> error checking in assembly code, but that'll end in HLT with no
>> useful indication of what the problem is). Provided, as Julien has
>> pointed out, that the change is necessary in the first place.
> 
> The allocation in cpu_smpboot_alloc() was left for hot-plug. This loops
> over present CPUs, not possible ones.

Ah, right. Yet better error checking / reporting is going to be needed
anyway.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu)
  2024-03-12 15:18     ` Krystian Hebel
@ 2024-03-12 15:42       ` Jan Beulich
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2024-03-12 15:42 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 12.03.2024 16:18, Krystian Hebel wrote:
> On 7.02.2024 17:28, Jan Beulich wrote:
>> On 14.11.2023 18:49, Krystian Hebel wrote:
>>> --- a/xen/arch/x86/apic.c
>>> +++ b/xen/arch/x86/apic.c
>>> @@ -950,7 +950,7 @@ __next:
>>>        */
>>>       if (boot_cpu_physical_apicid == -1U)
>>>           boot_cpu_physical_apicid = get_apic_id();
>>> -    x86_cpu_to_apicid[0] = get_apic_id();
>>> +    cpu_physical_id(0) = get_apic_id();
>> While personally I don't mind as much, I expect Andrew would not like
>> this: Something that looks like a function call on the lhs is against
>> what normal language structure would be.
> This made me cringe as well, but I've seen something like this used in
> other places (per_cpu() mostly) so I thought it was OK. I can change it.

Please try to get in touch with Andrew, to see what he thinks (especially
with your pointing of per_cpu()'s similarity).

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
  2024-03-12 15:29       ` Krystian Hebel
@ 2024-03-12 15:49         ` Jan Beulich
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2024-03-12 15:49 UTC (permalink / raw)
  To: Krystian Hebel
  Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu, Julien Grall

On 12.03.2024 16:29, Krystian Hebel wrote:
> On 7.02.2024 17:41, Jan Beulich wrote:
>> On 02.02.2024 19:11, Julien Grall wrote:
>>> On 14/11/2023 17:50, Krystian Hebel wrote:
>>>> --- a/xen/arch/x86/numa.c
>>>> +++ b/xen/arch/x86/numa.c
>>>> @@ -54,14 +54,13 @@ bool __init arch_numa_unavailable(void)
>>>>    /*
>>>>     * Setup early cpu_to_node.
>>>>     *
>>>> - * Populate cpu_to_node[] only if x86_cpu_to_apicid[],
>>>> - * and apicid_to_node[] tables have valid entries for a CPU.
>>>> - * This means we skip cpu_to_node[] initialisation for NUMA
>>>> - * emulation and faking node case (when running a kernel compiled
>>>> - * for NUMA on a non NUMA box), which is OK as cpu_to_node[]
>>>> - * is already initialized in a round robin manner at numa_init_array,
>>>> - * prior to this call, and this initialization is good enough
>>>> - * for the fake NUMA cases.
>>>> + * Populate cpu_to_node[] only if cpu_data[], and apicid_to_node[]
>> You mean cpu_physical_id() here, and then this change wants doing when
>> switching to that, imo.
> You mean s/cpu_data[]/cpu_physical_id()/ or something else?

Well, in general terms - whatever the function in fact accesses. That's,
if I reconstruct it from patch 2, as you say then.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data
  2024-02-05  8:41     ` Jan Beulich
  2024-02-05  9:32       ` Julien Grall
@ 2024-03-12 15:59       ` Krystian Hebel
  1 sibling, 0 replies; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 15:59 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Stefano Stabellini, xen-devel

[-- Attachment #1: Type: text/plain, Size: 5698 bytes --]


On 5.02.2024 09:41, Jan Beulich wrote:
> On 02.02.2024 19:24, Julien Grall wrote:
>> On 14/11/2023 17:50, Krystian Hebel wrote:
>>> This location is easier to access from assembly. Having it close to
>>> other data required during initialization has also positive (although
>>> rather small) impact on prefetching data from RAM.
>> I understand your goal but...
>>
>>> --- a/xen/arch/x86/include/asm/cpufeature.h
>>> +++ b/xen/arch/x86/include/asm/cpufeature.h
>> ... cpufeature.h feels a rather odd place for storing the stack. I am
>> not entirely sure where else to place. Andrew, Jan, Roger?
> Well, without having looked at the patch/series itself yet, I can only
> say that if struct cpuinfo_x86 really is the place to put this
> information, then it's unavoidable to have the field added in this
> header. That said, it certainly feels like an abuse - there's nothing
> in common with other (collected) data in this structure. "Easier to
> access from assembly" may be a fair reason, but then I'd expect the
> downsides of alternatives to be discussed explicitly. For example, a
> simple new array might be as "easily" accessible from assembly.

Initially I thought I'll be using more fields from this structure a lot, 
like
booted_cores or apicid. I'll move this and cpu_state introduced in following
patch somewhere else.

>
>>> @@ -37,6 +37,7 @@ struct cpuinfo_x86 {
>>>        unsigned int phys_proc_id;         /* package ID of each logical CPU */
>>>        unsigned int cpu_core_id;          /* core ID of each logical CPU */
>>>        unsigned int compute_unit_id;      /* AMD compute unit ID of each logical CPU */
>>> +    void *stack_base;
>> AFAICT, this means there will be a padding before stack_base and ...
>>
>>>        unsigned short x86_clflush_size;
>> ... another one here. Is there any particular reason the new field
>> wasn't added at the end?
> With ...
>
>>>    } __cacheline_aligned;
> ... this I'm not exactly sure this is a problem right now (there may
> be ample padding space anyway, yet I didn't go count). But I agree
> with your comment in principle.
I've checked that the size didn't change after adding. I also think that
I checked that adding it there wouldn't add any padding, but maybe I
miscalculated something. In any way, this will be moved from here.
>>> --- a/xen/arch/x86/smpboot.c
>>> +++ b/xen/arch/x86/smpboot.c
>>> @@ -75,13 +75,15 @@ static enum cpu_state {
>>>    } cpu_state;
>>>    #define set_cpu_state(state) do { smp_mb(); cpu_state = (state); } while (0)
>>>    
>>> -void *stack_base[NR_CPUS];
>>> -
>>>    void initialize_cpu_data(unsigned int cpu)
>>>    {
>>>        uint32_t apicid = cpu_physical_id(cpu);
>>> +    void *stack = cpu_data[cpu].stack_base;
>>> +
>>>        cpu_data[cpu] = boot_cpu_data;
>>> +
>>>        cpu_physical_id(cpu) = apicid;
>>> +    cpu_data[cpu].stack_base = stack;
>>>    }
>>>    
>>>    static bool smp_store_cpu_info(unsigned int id)
>>> @@ -579,8 +581,6 @@ static int do_boot_cpu(int apicid, int cpu)
>>>            printk("Booting processor %d/%d eip %lx\n",
>>>                   cpu, apicid, start_eip);
>>>    
>>> -    stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
>>> -
>> You remove this line because I can't quite figure out where stack_start
>> is now set. This is used...
> This line sets a global variable, which ...
>
>>> @@ -856,7 +856,7 @@ int setup_cpu_root_pgt(unsigned int cpu)
>>>    
>>>        /* Install direct map page table entries for stack, IDT, and TSS. */
>>>        for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
>>> -        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
>>> +        rc = clone_mapping(__va(__pa(cpu_data[cpu].stack_base)) + off, rpt);
>>>    
>>>        if ( !rc )
>>>            rc = clone_mapping(idt_tables[cpu], rpt);
>>> @@ -1007,10 +1007,10 @@ static void cpu_smpboot_free(unsigned int cpu, bool remove)
>>>            FREE_XENHEAP_PAGE(per_cpu(gdt, cpu));
>>>            FREE_XENHEAP_PAGE(idt_tables[cpu]);
>>>    
>>> -        if ( stack_base[cpu] )
>>> +        if ( cpu_data[cpu].stack_base )
>>>            {
>>> -            memguard_unguard_stack(stack_base[cpu]);
>>> -            FREE_XENHEAP_PAGES(stack_base[cpu], STACK_ORDER);
>>> +            memguard_unguard_stack(cpu_data[cpu].stack_base);
>>> +            FREE_XENHEAP_PAGES(cpu_data[cpu].stack_base, STACK_ORDER);
>>>            }
>>>        }
>>>    }
>>> @@ -1044,11 +1044,11 @@ static int cpu_smpboot_alloc(unsigned int cpu)
>>>        if ( node != NUMA_NO_NODE )
>>>            memflags = MEMF_node(node);
>>>    
>>> -    if ( stack_base[cpu] == NULL &&
>>> -         (stack_base[cpu] = cpu_alloc_stack(cpu)) == NULL )
>>> +    if ( cpu_data[cpu].stack_base == NULL &&
>>> +         (cpu_data[cpu].stack_base = cpu_alloc_stack(cpu)) == NULL )
>>>                goto out;
>>>    
>>> -    info = get_cpu_info_from_stack((unsigned long)stack_base[cpu]);
>>> +    info = get_cpu_info_from_stack((unsigned long)cpu_data[cpu].stack_base);
>> ... here.
> ... pretty clearly is not used here (anymore). Instead I'd raise the
> question of what the remaining purpose of that variable then is.
> Looking through updates this patch alone makes to use sites of
> stack_start, it's unclear whether the use from assembly code has gone
> away already - brief checking suggests it hasn't.
BSP still uses it, but APs don't. That said, comment above declaration says
otherwise, I'll change it, or maybe this variable can be removed altogether
since it always points to the same place, and there are only two consumers,
both in assembly.
>
> Jan
Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com  | @3mdeb_com

[-- Attachment #2: Type: text/html, Size: 7998 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 5/9] x86/smp: call x2apic_ap_setup() earlier
  2024-02-07 17:02   ` Jan Beulich
@ 2024-03-12 16:02     ` Krystian Hebel
  2024-03-13 13:05       ` Jan Beulich
  0 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 16:02 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: Roger Pau Monné, xen-devel, Wei Liu


On 7.02.2024 18:02, Jan Beulich wrote:
> On 14.11.2023 18:50, Krystian Hebel wrote:
>> It used to be called from smp_callin(), however BUG_ON() was invoked on
>> multiple occasions before that. It may end up calling machine_restart()
>> which tries to get APIC ID for CPU running this code. If BSP detected
>> that x2APIC is enabled, get_apic_id() will try to use it for all CPUs.
>> Enabling x2APIC on secondary CPUs earlier protects against an endless
>> loop of #GP exceptions caused by attempts to read IA32_X2APIC_APICID
>> MSR while x2APIC is disabled in IA32_APIC_BASE.
>>
>> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
>> ---
>>   xen/arch/x86/smpboot.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>> index 8ae65ab1769f..a3895dafa267 100644
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -184,7 +184,6 @@ static void smp_callin(void)
>>        * update until we finish. We are free to set up this CPU: first the APIC.
>>        */
>>       Dprintk("CALLIN, before setup_local_APIC().\n");
>> -    x2apic_ap_setup();
>>       setup_local_APIC(false);
>>   
>>       /* Save our processor parameters. */
>> @@ -351,6 +350,14 @@ void start_secondary(void *unused)
>>       get_cpu_info()->xen_cr3 = 0;
>>       get_cpu_info()->pv_cr3 = 0;
>>   
>> +    /*
>> +     * BUG_ON() used in load_system_tables() and later code may end up calling
>> +     * machine_restart() which tries to get APIC ID for CPU running this code.
>> +     * If BSP detected that x2APIC is enabled, get_apic_id() will try to use it
>> +     * for _all_ CPUs. Enable x2APIC on secondary CPUs now so we won't end up
>> +     * with endless #GP loop.
>> +     */
>> +    x2apic_ap_setup();
>>       load_system_tables();
> While I find the argument convincing, I seem to recall that there was a
> firm plan to have load_system_tables() as early as possible. Andrew?
This is where the code failed for me during testing. How about moving
x2apic_ap_setup() into load_system_tables(), just before BUG_ON? Or maybe
move those BUG_ON one level higher, after load_system_tables() returns?
Either way some code will end up in place it doesn't belong, but I'd 
argue that
BUG_ON is only useful if it itself doesn't crash.
>
> Jan

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 6/9] x86/shutdown: protect against recurrent machine_restart()
  2024-02-08 11:30   ` Jan Beulich
@ 2024-03-12 16:05     ` Krystian Hebel
  2024-03-13 13:11       ` Jan Beulich
  0 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 16:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu


On 8.02.2024 12:30, Jan Beulich wrote:
> On 14.11.2023 18:50, Krystian Hebel wrote:
>> If multiple CPUs called machine_restart() before actual restart took
>> place, but after boot CPU declared itself not online,
> Can you help me please in identifying where this operation is? I can see
> two places where a CPU is removed from cpu_online_map, yet neither
> __stop_this_cpu() nor __cpu_disable() ought to be coming into play here.
> In fact I didn't think CPU0 was ever marked not-online. Except perhaps
> if we came through machine_crash_shutdown() -> nmi_shootdown_cpus(), but
> I'm sure you would have mentioned such a further dependency.
>
> Jan
BUG_ON() in cpu_notifier_call_chain() (I've been playing with some of
the notifiers and one of them eventually failed) resulted in panic()
around the same time AP did in pm_idle() due to inconsistent settings
between BSP and AP for MWAIT/MONITOR support after TXT dynamic
launch. There is 5s delay between smp_send_stop() and actual reboot,
during that time AP spammed the output so the original reason for
panic() was visible only after unreasonable amount of scrolling.

Adding TXT support is the reason why I even started making AP bring-up
parallel. Problem with MWAIT doesn't happen in current code or changes
in this patchset, but handling of such error is related to SMP so I've 
included it.

Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 7/9] x86/smp: drop booting_cpu variable
  2024-02-08 11:39   ` Jan Beulich
@ 2024-03-12 16:16     ` Krystian Hebel
  0 siblings, 0 replies; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 16:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

[-- Attachment #1: Type: text/plain, Size: 4013 bytes --]


On 8.02.2024 12:39, Jan Beulich wrote:
> On 14.11.2023 18:50, Krystian Hebel wrote:
>> CPU id is obtained as a side effect of searching for appropriate
>> stack for AP. It can be used as a parameter to start_secondary().
>> Coincidentally this also makes further work on making AP bring-up
>> code parallel easier.
> It's not just "easier", but strictly a prereq I think? Such a global
> would get in the way of having multiple CPUs make it into
> start_secondary() in parallel.
start_secondary() could also repeat what was done in assembly to get
that ID. This was what I've done in one of first attempts, and the commit
messagewas about removing that lookup loop which was squashed along the 
way and no longer applies. I'll reword it a bit.
>> --- a/xen/arch/x86/boot/x86_64.S
>> +++ b/xen/arch/x86/boot/x86_64.S
>> @@ -20,20 +20,24 @@ ENTRY(__high_start)
>>           jz      .L_stack_set
>>   
>>           /* APs only: get stack base from APIC ID saved in %esp. */
>> -        mov     $0, %rax
>> +        mov     $0, %rbx
>>           lea     cpu_data(%rip), %rcx
>>           /* cpu_data[0] is BSP, skip it. */
>>   1:
>> -        add     $1, %rax
>> +        add     $1, %rbx
>>           add     $CPUINFO_X86_sizeof, %rcx
>> -        cmp     $NR_CPUS, %eax
>> +        cmp     $NR_CPUS, %rbx
>>           jb      2f
>>           hlt
>>   2:
>>           cmp     %esp, CPUINFO_X86_apicid(%rcx)
>>           jne     1b
> Once again this is code you introduced a few patches ago. Why not use
> %ebx right away for that purpose? (And yes, this explains why in the
> earlier patch you retained that code. Just that again suitably ordering
> the series would make this look natural. Otherwise it needs at least
> mentioning why dead pieces are kept around.)
Will do.
>
>> -        /* %rcx is now cpu_data[cpu], read stack base from it. */
>> +        /*
>> +         * At this point:
>> +         * - %rcx is cpu_data[cpu], read stack base from it,
>> +         * - %rbx (callee-save) is Xen cpu number, pass it to start_secondary().
>> +         */
>>           mov     CPUINFO_X86_stack_base(%rcx), %rsp
>>   
>>           test    %rsp,%rsp
>> @@ -101,6 +105,7 @@ ENTRY(__high_start)
>>   .L_ap_cet_done:
>>   #endif /* CONFIG_XEN_SHSTK || CONFIG_XEN_IBT */
>>   
>> +        mov     %rbx, %rdi
>>           tailcall start_secondary
> As alluded to above and as mentioned before - please stick to 32-bit
> operations when you deal with 32 (or less) bits of data.
Ack
>
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -222,8 +222,6 @@ static void smp_callin(void)
>>           cpu_relax();
>>   }
>>   
>> -static int booting_cpu;
>> -
>>   /* CPUs for which sibling maps can be computed. */
>>   static cpumask_t cpu_sibling_setup_map;
>>   
>> @@ -311,15 +309,14 @@ static void set_cpu_sibling_map(unsigned int cpu)
>>       }
>>   }
>>   
>> -void start_secondary(void *unused)
>> +void start_secondary(unsigned int cpu)
>>   {
>>       struct cpu_info *info = get_cpu_info();
>>   
>>       /*
>> -     * Dont put anything before smp_callin(), SMP booting is so fragile that we
>> +     * Don't put anything before smp_callin(), SMP booting is so fragile that we
>>        * want to limit the things done here to the most necessary things.
>>        */
>> -    unsigned int cpu = booting_cpu;
>>   
>>       /* Critical region without IDT or TSS.  Any fault is deadly! */
>>   
>> @@ -346,9 +343,9 @@ void start_secondary(void *unused)
>>        */
>>       spin_debug_disable();
>>   
>> -    get_cpu_info()->use_pv_cr3 = false;
>> -    get_cpu_info()->xen_cr3 = 0;
>> -    get_cpu_info()->pv_cr3 = 0;
>> +    info->use_pv_cr3 = false;
>> +    info->xen_cr3 = 0;
>> +    info->pv_cr3 = 0;
> This hunk looks unrelated. While tidying next to what's changed anyway
> may be okay if suitably mentioned in the description, in this case I
> think it needs splitting off.
OK, I'll split it off.
>
> Jan
Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com  | @3mdeb_com

[-- Attachment #2: Type: text/html, Size: 5566 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 8/9] x86/smp: make cpu_state per-CPU
  2024-02-08 12:13   ` Jan Beulich
@ 2024-03-12 16:38     ` Krystian Hebel
  2024-03-13 13:21       ` Jan Beulich
  0 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 16:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel


On 8.02.2024 13:13, Jan Beulich wrote:
> On 14.11.2023 18:50, Krystian Hebel wrote:
>> This will be used for parallel AP bring-up.
>>
>> CPU_STATE_INIT changed direction.
> Nit: I think you mean "changes" as you describe what the patch does, not
> what has happened before. But ...
>
>> It was previously set by BSP and never
>> consumed by AP. Now it signals that AP got through assembly part of
>> initialization and waits for BSP to call notifiers that set up data
>> structures required for further initialization.
> ... all of this is, afaict, independent of what the title says the
> purpose of this patch is. Since the correctness of the state change
> adjustments doesn't look straightforward to prove, please split the
> mechanical change from the change to the actual logic.
Ack
>
>> --- a/xen/arch/x86/include/asm/cpufeature.h
>> +++ b/xen/arch/x86/include/asm/cpufeature.h
>> @@ -38,6 +38,7 @@ struct cpuinfo_x86 {
>>       unsigned int cpu_core_id;          /* core ID of each logical CPU */
>>       unsigned int compute_unit_id;      /* AMD compute unit ID of each logical CPU */
>>       void *stack_base;
>> +    unsigned int cpu_state;
>>       unsigned short x86_clflush_size;
>>   } __cacheline_aligned;
> Is there any reason this cannot be ordinary per-CPU data?
Probably not, will move it away.
>
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -65,15 +65,18 @@ struct cpuinfo_x86 cpu_data[NR_CPUS] =
>>           { [0 ... NR_CPUS-1] .apicid = BAD_APICID };
>>   
>>   static int cpu_error;
>> -static enum cpu_state {
>> +enum cpu_state {
>>       CPU_STATE_DYING,    /* slave -> master: I am dying */
>>       CPU_STATE_DEAD,     /* slave -> master: I am completely dead */
>> -    CPU_STATE_INIT,     /* master -> slave: Early bringup phase 1 */
>> -    CPU_STATE_CALLOUT,  /* master -> slave: Early bringup phase 2 */
>> +    CPU_STATE_INIT,     /* slave -> master: Early bringup phase 1 completed */
>> +    CPU_STATE_CALLOUT,  /* master -> slave: Start early bringup phase 2 */
> It's not really clear to me whether the adding of "Start" on the 2nd line
> really adds value.
Ack
>
>>       CPU_STATE_CALLIN,   /* slave -> master: Completed phase 2 */
>>       CPU_STATE_ONLINE    /* master -> slave: Go fully online now. */
>> -} cpu_state;
>> -#define set_cpu_state(state) do { smp_mb(); cpu_state = (state); } while (0)
>> +};
>> +#define set_cpu_state(cpu, state) do { \
>> +    smp_mb(); \
>> +    cpu_data[cpu].cpu_state = (state); \
>> +} while (0)
> While you merely re-arrange it, I'd still like to ask: Does this really
> need to be smp_mb(), not just smp_wmb()?
Probably not, but I didn't want to change it, assuming there was a reason
that it used smp_wmb() in the first place.
>
>> @@ -320,6 +317,10 @@ void start_secondary(unsigned int cpu)
>>   
>>       /* Critical region without IDT or TSS.  Any fault is deadly! */
>>   
>> +    /* Wait until data set up by CPU_UP_PREPARE notifiers is ready. */
>> +    while ( cpu_data[cpu].cpu_state != CPU_STATE_CALLOUT )
>> +        cpu_relax();
> I'm afraid I don't understand the comment (and hence whether this loop
> is actually needed here): __cpu_up() is called only after those
> notifiers completed.
Yes, but broadcasted INIT-SIPI-SIPI sequence added in next patch will be
sent before that call is made, and consequently APs potentially can get
to this point before that data is set up.
>
>> @@ -1161,6 +1171,12 @@ void __init smp_prepare_cpus(void)
>>       cpu_data[0].stack_base = (void *)
>>                ((unsigned long)stack_start & ~(STACK_SIZE - 1));
>>   
>> +    /* Set state as CALLOUT so APs won't change it in initialize_cpu_data() */
>> +    boot_cpu_data.cpu_state = CPU_STATE_CALLOUT;
> This is actually one of the reasons I don't like you putting the item
> as a new field in struct cpuinfo_x86. Otherwise imo initialize_cpu_data()
> ought to gain a respective assertion.
I'll move it out.
>
> Jan
Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 9/9] x86/smp: start APs in parallel during boot
  2024-02-08 12:37   ` Jan Beulich
@ 2024-03-12 17:13     ` Krystian Hebel
  2024-03-13 13:30       ` Jan Beulich
  0 siblings, 1 reply; 45+ messages in thread
From: Krystian Hebel @ 2024-03-12 17:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu


On 8.02.2024 13:37, Jan Beulich wrote:
> On 14.11.2023 18:50, Krystian Hebel wrote:
>> Multiple delays are required when sending IPIs and waiting for
>> responses. During boot, 4 such IPIs were sent per each AP. With this
>> change, only one set of broadcast IPIs is sent. This reduces boot time,
>> especially for platforms with large number of cores.
> Yet APs do their startup work in parallel only for a brief period of
> time, if I'm not mistaken. Othwerwise I can't see why you'd still have
> cpu_up() in __start_xen().
cpu_up() is left because multiple notifiers aren't easy to convert to work
in parallel. In terms of lines of code it looks like a brief period, but all
the delays along the way were taking much more time than the actual
work. As the gain was already more than what I hoped for, I decided
against spending too much time trying to fix the notifiers' code for
minimal profit.
>
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -1963,6 +1963,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>                   cpu_data[i].stack_base = cpu_alloc_stack(i);
>>           }
>>   
>> +        smp_send_init_sipi_sipi_allbutself();
>> +
>>           for_each_present_cpu ( i )
>>           {
>>               if ( (park_offline_cpus || num_online_cpus() < max_cpus) &&
> So what about constraints on the number of CPUs to use? In such a case
> you shouldn't send the IPI to all of them, at least if they're not
> meant to be parked.
Fair point, such check can be easily added before broadcasting and the
rest of the code should already be able to handle this.
>
>> --- a/xen/arch/x86/smpboot.c
>> +++ b/xen/arch/x86/smpboot.c
>> @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu)
>>   
>>   static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>>   {
>> -    unsigned long send_status = 0, accept_status = 0;
>> +    unsigned long send_status = 0, accept_status = 0, sh = 0;
> sh doesn't need to be 64 bits wide, does it?
No, will change.
>
>>       int maxlvt, timeout, i;
>>   
>>       /*
>> @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>>       if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) )
>>           return 0;
>>   
>> +    /*
>> +     * Use destination shorthand for broadcasting IPIs during boot.
>> +     */
> Nit (style): This is a single line comment.
Ack
>
>> +    if ( phys_apicid == BAD_APICID )
>> +        sh = APIC_DEST_ALLBUT;
> I think the latest for this the function parameter wants changing to
> unsigned int (in another prereq patch).
What do you mean, phys_apicid in wakeup_secondary_cpu()? It is passed
as signed int since __cpu_up(), should I change all of those to unsigned?
>
>> @@ -573,21 +578,31 @@ static int do_boot_cpu(int apicid, int cpu)
>>        */
>>       mtrr_save_state();
>>   
>> -    start_eip = bootsym_phys(trampoline_realmode_entry);
>> +    /* Check if AP is already up. */
>> +    if ( cpu_data[cpu].cpu_state != CPU_STATE_INIT )
>> +    {
>> +        /* This grunge runs the startup process for the targeted processor. */
>> +        unsigned long start_eip;
>> +        start_eip = bootsym_phys(trampoline_realmode_entry);
>>   
>> -    /* start_eip needs be page aligned, and below the 1M boundary. */
>> -    if ( start_eip & ~0xff000 )
>> -        panic("AP trampoline %#lx not suitably positioned\n", start_eip);
>> +        /* start_eip needs be page aligned, and below the 1M boundary. */
>> +        if ( start_eip & ~0xff000 )
>> +            panic("AP trampoline %#lx not suitably positioned\n", start_eip);
> Isn't this redundant now with the panic() in
> smp_send_init_sipi_sipi_allbutself(), at least as long as that runs
> unconditionally.
Won't be running unconditionally, but it also wouldn't be redundant in case
of hot-plugging.
>
>> -    /* So we see what's up   */
>> -    if ( opt_cpu_info )
>> -        printk("Booting processor %d/%d eip %lx\n",
>> -               cpu, apicid, start_eip);
>> +        /* So we see what's up   */
>> +        if ( opt_cpu_info )
>> +            printk("AP trampoline at %lx\n", start_eip);
> Why this change in log message? It makes messages for individual CPUs
> indistinguishable. And like above it's redundant with what
> smp_send_init_sipi_sipi_allbutself() logs.
>
>> -    /* This grunge runs the startup process for the targeted processor. */
>> +        /* mark "stuck" area as not stuck */
>> +        bootsym(trampoline_cpu_started) = 0;
>> +        smp_mb();
>>   
>> -    /* Starting actual IPI sequence... */
>> -    boot_error = wakeup_secondary_cpu(apicid, start_eip);
>> +        /* Starting actual IPI sequence... */
>> +        boot_error = wakeup_secondary_cpu(apicid, start_eip);
>> +    }
>> +
>> +    if ( opt_cpu_info )
>> +        printk("Booting processor %d/%d\n", cpu, apicid);
> Oh, here's the other half. Yet for above it still doesn't make sense
> to issue the same message for all CPUs.
I'll undo it. It was important at one point for debugging, but I agree
that it doesn't make sense now.
>
>> @@ -646,10 +661,6 @@ static int do_boot_cpu(int apicid, int cpu)
>>           rc = -EIO;
>>       }
>>   
>> -    /* mark "stuck" area as not stuck */
>> -    bootsym(trampoline_cpu_started) = 0;
>> -    smp_mb();
> While you move this up, it's not clear to me how you would now
> identify individual stuck CPUs. I would have expected that this is
> another global that needs converting up front, to be per-CPU.
In the existing code this is set very early, in 16b code, and the variable
is located within the first page of trampoline. With this change it is
impossible to identify individual stuck CPUs.

I was considering removing this variable altogether. Another option
would be to make this an array with NR_CPUS elements, but this may
get too big. It would be possible to fill this relatively early, after 
CPU ID
is obtained, before paging is enabled, but after loading IDT, GDT and
jumping to protected mode. Those are things that can break due to error
in the code, but it may be better than not having that info at all.

It could also be set from 64b code, that way simple per-CPU data would
work. It is a bit late, but this would probably be easiest and cleanest to
write.

>
>> @@ -1155,6 +1166,23 @@ static struct notifier_block cpu_smpboot_nfb = {
>>       .notifier_call = cpu_smpboot_callback
>>   };
>>   
>> +void smp_send_init_sipi_sipi_allbutself(void)
> __init?
Ack
>
>> +{
>> +    unsigned long start_eip;
>> +    start_eip = bootsym_phys(trampoline_realmode_entry);
> This can be the initializer of the variable, which would then save
> me from complaining about the missing blank line between declaration
> and statement(s). (Actually, as I notice only now - same for code you
> move around in do_boot_cpu().)
Will do. It may still be split due to line length, but at least that will
follow code style.
>
> Jan
Best regards,

-- 
Krystian Hebel
Firmware Engineer
https://3mdeb.com | @3mdeb_com



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 5/9] x86/smp: call x2apic_ap_setup() earlier
  2024-03-12 16:02     ` Krystian Hebel
@ 2024-03-13 13:05       ` Jan Beulich
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2024-03-13 13:05 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Roger Pau Monné, xen-devel, Wei Liu, Andrew Cooper

On 12.03.2024 17:02, Krystian Hebel wrote:
> 
> On 7.02.2024 18:02, Jan Beulich wrote:
>> On 14.11.2023 18:50, Krystian Hebel wrote:
>>> It used to be called from smp_callin(), however BUG_ON() was invoked on
>>> multiple occasions before that. It may end up calling machine_restart()
>>> which tries to get APIC ID for CPU running this code. If BSP detected
>>> that x2APIC is enabled, get_apic_id() will try to use it for all CPUs.
>>> Enabling x2APIC on secondary CPUs earlier protects against an endless
>>> loop of #GP exceptions caused by attempts to read IA32_X2APIC_APICID
>>> MSR while x2APIC is disabled in IA32_APIC_BASE.
>>>
>>> Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
>>> ---
>>>   xen/arch/x86/smpboot.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
>>> index 8ae65ab1769f..a3895dafa267 100644
>>> --- a/xen/arch/x86/smpboot.c
>>> +++ b/xen/arch/x86/smpboot.c
>>> @@ -184,7 +184,6 @@ static void smp_callin(void)
>>>        * update until we finish. We are free to set up this CPU: first the APIC.
>>>        */
>>>       Dprintk("CALLIN, before setup_local_APIC().\n");
>>> -    x2apic_ap_setup();
>>>       setup_local_APIC(false);
>>>   
>>>       /* Save our processor parameters. */
>>> @@ -351,6 +350,14 @@ void start_secondary(void *unused)
>>>       get_cpu_info()->xen_cr3 = 0;
>>>       get_cpu_info()->pv_cr3 = 0;
>>>   
>>> +    /*
>>> +     * BUG_ON() used in load_system_tables() and later code may end up calling
>>> +     * machine_restart() which tries to get APIC ID for CPU running this code.
>>> +     * If BSP detected that x2APIC is enabled, get_apic_id() will try to use it
>>> +     * for _all_ CPUs. Enable x2APIC on secondary CPUs now so we won't end up
>>> +     * with endless #GP loop.
>>> +     */
>>> +    x2apic_ap_setup();
>>>       load_system_tables();
>> While I find the argument convincing, I seem to recall that there was a
>> firm plan to have load_system_tables() as early as possible. Andrew?
> This is where the code failed for me during testing. How about moving
> x2apic_ap_setup() into load_system_tables(),

How does a call to x2apic_ap_setup() fit in a function named
load_system_tables()?

> just before BUG_ON? Or maybe
> move those BUG_ON one level higher, after load_system_tables() returns?

But they're there for a reason.

> Either way some code will end up in place it doesn't belong, but I'd 
> argue that
> BUG_ON is only useful if it itself doesn't crash.

I guess I don't understand this: That BUG_ON() is already guarded by a
system_state check, to prevent it uselessly hanging the system.

In any event - besides you still wanting to get input from Andrew, it
ought to be clear that anything unusual / unexpected will require extra
justification in the description.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 6/9] x86/shutdown: protect against recurrent machine_restart()
  2024-03-12 16:05     ` Krystian Hebel
@ 2024-03-13 13:11       ` Jan Beulich
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2024-03-13 13:11 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

On 12.03.2024 17:05, Krystian Hebel wrote:
> 
> On 8.02.2024 12:30, Jan Beulich wrote:
>> On 14.11.2023 18:50, Krystian Hebel wrote:
>>> If multiple CPUs called machine_restart() before actual restart took
>>> place, but after boot CPU declared itself not online,
>> Can you help me please in identifying where this operation is? I can see
>> two places where a CPU is removed from cpu_online_map, yet neither
>> __stop_this_cpu() nor __cpu_disable() ought to be coming into play here.
>> In fact I didn't think CPU0 was ever marked not-online. Except perhaps
>> if we came through machine_crash_shutdown() -> nmi_shootdown_cpus(), but
>> I'm sure you would have mentioned such a further dependency.
>>
> BUG_ON() in cpu_notifier_call_chain() (I've been playing with some of
> the notifiers and one of them eventually failed) resulted in panic()
> around the same time AP did in pm_idle() due to inconsistent settings
> between BSP and AP for MWAIT/MONITOR support after TXT dynamic
> launch. There is 5s delay between smp_send_stop() and actual reboot,
> during that time AP spammed the output so the original reason for
> panic() was visible only after unreasonable amount of scrolling.
> 
> Adding TXT support is the reason why I even started making AP bring-up
> parallel. Problem with MWAIT doesn't happen in current code or changes
> in this patchset, but handling of such error is related to SMP so I've 
> included it.

If you mean to address a latent problem, then you want to say so and you
want to make sure you include enough detail on the (future) conditions
under which the problem may happen. Otherwise anything you say wants to
match present code / behavior.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 8/9] x86/smp: make cpu_state per-CPU
  2024-03-12 16:38     ` Krystian Hebel
@ 2024-03-13 13:21       ` Jan Beulich
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2024-03-13 13:21 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, xen-devel

On 12.03.2024 17:38, Krystian Hebel wrote:
> On 8.02.2024 13:13, Jan Beulich wrote:
>> On 14.11.2023 18:50, Krystian Hebel wrote:
>>> @@ -320,6 +317,10 @@ void start_secondary(unsigned int cpu)
>>>   
>>>       /* Critical region without IDT or TSS.  Any fault is deadly! */
>>>   
>>> +    /* Wait until data set up by CPU_UP_PREPARE notifiers is ready. */
>>> +    while ( cpu_data[cpu].cpu_state != CPU_STATE_CALLOUT )
>>> +        cpu_relax();
>> I'm afraid I don't understand the comment (and hence whether this loop
>> is actually needed here): __cpu_up() is called only after those
>> notifiers completed.
> Yes, but broadcasted INIT-SIPI-SIPI sequence added in next patch will be
> sent before that call is made, and consequently APs potentially can get
> to this point before that data is set up.

That's fine, and I was able to conclude this once having read that following
patch. But the patch here, including its description, wants to the self-
contained.

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [XEN PATCH 9/9] x86/smp: start APs in parallel during boot
  2024-03-12 17:13     ` Krystian Hebel
@ 2024-03-13 13:30       ` Jan Beulich
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Beulich @ 2024-03-13 13:30 UTC (permalink / raw)
  To: Krystian Hebel; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel, Wei Liu

On 12.03.2024 18:13, Krystian Hebel wrote:
> 
> On 8.02.2024 13:37, Jan Beulich wrote:
>> On 14.11.2023 18:50, Krystian Hebel wrote:
>>> Multiple delays are required when sending IPIs and waiting for
>>> responses. During boot, 4 such IPIs were sent per each AP. With this
>>> change, only one set of broadcast IPIs is sent. This reduces boot time,
>>> especially for platforms with large number of cores.
>> Yet APs do their startup work in parallel only for a brief period of
>> time, if I'm not mistaken. Othwerwise I can't see why you'd still have
>> cpu_up() in __start_xen().
> cpu_up() is left because multiple notifiers aren't easy to convert to work
> in parallel. In terms of lines of code it looks like a brief period, but all
> the delays along the way were taking much more time than the actual
> work. As the gain was already more than what I hoped for, I decided
> against spending too much time trying to fix the notifiers' code for
> minimal profit.

Which is all fine. Just that by title of this patch and the cover letter
I expected more. Adding "partly" or some such in both places may help.

>>> --- a/xen/arch/x86/smpboot.c
>>> +++ b/xen/arch/x86/smpboot.c
>>> @@ -425,7 +425,7 @@ void start_secondary(unsigned int cpu)
>>>   
>>>   static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>>>   {
>>> -    unsigned long send_status = 0, accept_status = 0;
>>> +    unsigned long send_status = 0, accept_status = 0, sh = 0;
>> sh doesn't need to be 64 bits wide, does it?
> No, will change.
>>
>>>       int maxlvt, timeout, i;
>>>   
>>>       /*
>>> @@ -445,6 +445,12 @@ static int wakeup_secondary_cpu(int phys_apicid, unsigned long start_eip)
>>>       if ( tboot_in_measured_env() && !tboot_wake_ap(phys_apicid, start_eip) )
>>>           return 0;
>>>   
>>> +    /*
>>> +     * Use destination shorthand for broadcasting IPIs during boot.
>>> +     */
>> Nit (style): This is a single line comment.
> Ack
>>
>>> +    if ( phys_apicid == BAD_APICID )
>>> +        sh = APIC_DEST_ALLBUT;
>> I think the latest for this the function parameter wants changing to
>> unsigned int (in another prereq patch).
> What do you mean, phys_apicid in wakeup_secondary_cpu()? It is passed
> as signed int since __cpu_up(), should I change all of those to unsigned?

That would be best, yes. BAD_APICID, after all, is an unsigned constant
(no matter that its definition involves a unary minus operator).

Jan


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2024-03-13 13:30 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-14 17:49 [XEN PATCH 0/9] x86: parallelize AP bring-up during boot Krystian Hebel
2023-11-14 17:49 ` [XEN PATCH 1/9] x86/boot: choose AP stack based on APIC ID Krystian Hebel
2024-01-26 18:30   ` Julien Grall
2024-03-12 14:14     ` Krystian Hebel
2024-02-07 16:11   ` Jan Beulich
2024-03-12 15:11     ` Krystian Hebel
2024-03-12 15:40       ` Jan Beulich
2023-11-14 17:49 ` [XEN PATCH 2/9] x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu) Krystian Hebel
2024-01-26 18:38   ` Julien Grall
2024-02-07 16:28   ` Jan Beulich
2024-03-12 15:18     ` Krystian Hebel
2024-03-12 15:42       ` Jan Beulich
2023-11-14 17:50 ` [XEN PATCH 3/9] x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead Krystian Hebel
2024-02-02 18:11   ` Julien Grall
2024-02-07 16:41     ` Jan Beulich
2024-03-12 15:29       ` Krystian Hebel
2024-03-12 15:49         ` Jan Beulich
2024-02-08  7:29   ` Jan Beulich
2024-03-12 15:35     ` Krystian Hebel
2023-11-14 17:50 ` [XEN PATCH 4/9] x86/smp: move stack_base to cpu_data Krystian Hebel
2024-02-02 18:24   ` Julien Grall
2024-02-05  8:41     ` Jan Beulich
2024-02-05  9:32       ` Julien Grall
2024-03-12 15:59       ` Krystian Hebel
2024-02-07 16:53   ` Jan Beulich
2023-11-14 17:50 ` [XEN PATCH 5/9] x86/smp: call x2apic_ap_setup() earlier Krystian Hebel
2024-02-07 17:02   ` Jan Beulich
2024-03-12 16:02     ` Krystian Hebel
2024-03-13 13:05       ` Jan Beulich
2023-11-14 17:50 ` [XEN PATCH 6/9] x86/shutdown: protect against recurrent machine_restart() Krystian Hebel
2024-02-08 11:30   ` Jan Beulich
2024-03-12 16:05     ` Krystian Hebel
2024-03-13 13:11       ` Jan Beulich
2023-11-14 17:50 ` [XEN PATCH 7/9] x86/smp: drop booting_cpu variable Krystian Hebel
2024-02-08 11:39   ` Jan Beulich
2024-03-12 16:16     ` Krystian Hebel
2023-11-14 17:50 ` [XEN PATCH 8/9] x86/smp: make cpu_state per-CPU Krystian Hebel
2024-02-08 12:13   ` Jan Beulich
2024-03-12 16:38     ` Krystian Hebel
2024-03-13 13:21       ` Jan Beulich
2023-11-14 17:50 ` [XEN PATCH 9/9] x86/smp: start APs in parallel during boot Krystian Hebel
2024-02-08 12:37   ` Jan Beulich
2024-03-12 17:13     ` Krystian Hebel
2024-03-13 13:30       ` Jan Beulich
2023-11-14 20:13 ` [XEN PATCH 0/9] x86: parallelize AP bring-up " Julien Grall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).