On Thu, 2023-02-23 at 15:37 +0100, Thomas Gleixner wrote:
> David!
> 
> On Thu, Feb 23 2023 at 11:07, David Woodhouse wrote:
> > On Wed, 2023-02-22 at 17:42 +0100, Thomas Gleixner wrote:
> > > The low hanging fruit which brings most is the identification/topology
> > > muck and the microcode loading. That needs to be addressed first anyway.
> > 
> > Agreed, thanks.
> 
> So the problem with microcode loading is that we must ensure that a HT
> sibling is not executing anything else than a trivial loop waiting for
> the update to complete. So something like this should work:
> 
>    1) Kick all CPUs into life and let them run up to cpu_init() and
>       retrieve only the topology information.
>
>    2) Wait for all CPUs to reach this point
>
>    3) Release all primary HT threads so they can load microcode in
>       parallel. The secondary HT threads stay in the wait loop and are
>       released once the primary thread has finished the microcode
>       update.
> 
>    4) Let the CPUs do the full CPUID readout and let them synchronize
>       with the control CPU again.
> 
>    5) Complete bringup one by one


Can we move the microcode loading to happen earlier, during the x86-
specific CPUHP_BP_PARALLEL_DYN stage(s) while they're running in
parallel.

In the existing set of patches, we send INIT/SIPI/SIPI to each CPU in
parallel and they run to the first part of start_secondary(), up to the
point where it calls cpu_init_secondary() and sets their bit in
cpu_initialized_mask, then spinning and waiting for cpu_callout_mask.


My "part 2" test patch does another round in parallel, setting each
CPU's bit in 'cpu_callout_mask' and letting them run a bit further
through start_secondary() until they get to the end of smp_callin(),
where they set their bit in smp_callin_mask and (in my patch) wait for
their bit in a new cpu_finishup_mask to be set — which is what releases
them to proceed to completion in the final native_cpu_up() bringup.

So perhaps the BSP doesn't need to coordinate anything here, if we can
let the siblings work it out between themselves in the (now-)parallel
stage at the end of smp_callin()? And only set their bit in
smp_callin_mask when the microcode update is done?

Hm, maybe it's as simple as the first¹ thread on a core waiting for all
its siblings' bits in cpu_callin_mask to be set, and *then* doing the
update before setting its own bit?

¹ As long as we define "first" as the one with the lowest CPU#, which
means that the BSP won't release any of the siblings before it releases
the "first".

Then the siblings are just spinning on cpu_callin_mask anyway; they
don't need to do anything *more*.

Probably worth knocking it up and seeing how badly it explodes?