On Thu, 2023-02-23 at 15:37 +0100, Thomas Gleixner wrote: > David! > > On Thu, Feb 23 2023 at 11:07, David Woodhouse wrote: > > On Wed, 2023-02-22 at 17:42 +0100, Thomas Gleixner wrote: > > > The low hanging fruit which brings most is the identification/topology > > > muck and the microcode loading. That needs to be addressed first anyway. > > > > Agreed, thanks. > > So the problem with microcode loading is that we must ensure that a HT > sibling is not executing anything else than a trivial loop waiting for > the update to complete. So something like this should work: > >    1) Kick all CPUs into life and let them run up to cpu_init() and >       retrieve only the topology information. > >    2) Wait for all CPUs to reach this point > >    3) Release all primary HT threads so they can load microcode in >       parallel. The secondary HT threads stay in the wait loop and are >       released once the primary thread has finished the microcode >       update. > >    4) Let the CPUs do the full CPUID readout and let them synchronize >       with the control CPU again. > >    5) Complete bringup one by one Can we move the microcode loading to happen earlier, during the x86- specific CPUHP_BP_PARALLEL_DYN stage(s) while they're running in parallel. In the existing set of patches, we send INIT/SIPI/SIPI to each CPU in parallel and they run to the first part of start_secondary(), up to the point where it calls cpu_init_secondary() and sets their bit in cpu_initialized_mask, then spinning and waiting for cpu_callout_mask. My "part 2" test patch does another round in parallel, setting each CPU's bit in 'cpu_callout_mask' and letting them run a bit further through start_secondary() until they get to the end of smp_callin(), where they set their bit in smp_callin_mask and (in my patch) wait for their bit in a new cpu_finishup_mask to be set — which is what releases them to proceed to completion in the final native_cpu_up() bringup. So perhaps the BSP doesn't need to coordinate anything here, if we can let the siblings work it out between themselves in the (now-)parallel stage at the end of smp_callin()? And only set their bit in smp_callin_mask when the microcode update is done? Hm, maybe it's as simple as the first¹ thread on a core waiting for all its siblings' bits in cpu_callin_mask to be set, and *then* doing the update before setting its own bit? ¹ As long as we define "first" as the one with the lowest CPU#, which means that the BSP won't release any of the siblings before it releases the "first". Then the siblings are just spinning on cpu_callin_mask anyway; they don't need to do anything *more*. Probably worth knocking it up and seeing how badly it explodes?