Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

From: David Woodhouse <dwmw2@infradead.org>
To: paulmck@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Andy Lutomirski <luto@kernel.org>,
	"shenkai (D)" <shenkai8@huawei.com>,
	"Schander, Johanna 'Mimoja' Amelie" <mimoja@amazon.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	X86 ML <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
	hewenliang4@huawei.com, hushiyuan@huawei.com,
	luolongjun@huawei.com, hejingxian <hejingxian@huawei.com>
Subject: Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation
Date: Wed, 08 Dec 2021 15:10:57 +0000	[thread overview]
Message-ID: <0824902894565e850b79e494c38a7856f8358b99.camel@infradead.org> (raw)
In-Reply-To: <20211208145047.GR641268@paulmck-ThinkPad-P17-Gen-1>

[-- Attachment #1: Type: text/plain, Size: 5747 bytes --]

On Wed, 2021-12-08 at 06:50 -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2021 at 02:14:35PM +0000, David Woodhouse wrote:
> > > Actually it breaks before that, in rcu_cpu_starting(). A spinlock
> > > around that, an atomic_t to let the APs do their TSC sync one at a time
> > > (both in the above tree now), and I have a 75% saving on CPU bringup
> > > time for my 28-thread Haswell:
> > 
> > Coming back to this, I've updated it and thrown up a new branch at 
> > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel-5.16
> > 
> > 
> > For those last two fixes I had started with a trivial naïve approach of
> > just enforcing serialization.
> > 
> > I'm sure we can come up with a cleverer 1:N way of synchronizing the
> > TSCs, instead of just serializing the existing 1:1 sync.
> > 
> > For rcu_cpu_starting() I see there's *already* a lock in the
> > rcu_node... could we use that same lock to protect the manipulation of
> > rnp->ofl_seq and allow rcu_cpu_starting() to be invoked by multiple APs
> > in parallel? Paul?
> > 
> > On a related note, are you currently guaranteed that rcu_report_dead()
> > cannot be called more than once in parallel? Might you want the same
> > locking there?
> 
> Just to make sure I understand, the goal here is to bring multiple CPUs
> online concurrently, correct?  If so, this will take some digging to
> check up on the current implicit assumptions about CPU-hotplug operations
> being serialized.  Some of which might even be documented.  ;-)

Indeed. All the APs end up calling rcu_cpu_starting() at the same time,
and bad things happen. So I took the naïve "stick a lock around it"
approach and the problem went away:

commit 60984965ac1945446eb23ff5a87a4c7acc821409
Author: David Woodhouse <dwmw@amazon.co.uk>
Date:   Tue Feb 16 15:04:34 2021 +0000

    rcu: Add locking around rcu_cpu_starting()
    
    To allow architectures to bring APs online in parallel, we need locking
    around rcu_cpu_starting(). Perhaps we could use the existing node
    spinlock... Paul?
    
    Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ef8d36f580fc..5816d3d40c6c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4231,6 +4231,8 @@ int rcutree_offline_cpu(unsigned int cpu)
  * from the incoming CPU rather than from the cpuhp_step mechanism.
  * This is because this function must be invoked at a precise location.
  */
+static DEFINE_RAW_SPINLOCK(rcu_startup_lock);
+
 void rcu_cpu_starting(unsigned int cpu)
 {
        unsigned long flags;
@@ -4239,9 +4241,11 @@ void rcu_cpu_starting(unsigned int cpu)
        struct rcu_node *rnp;
        bool newcpu;
 
+       raw_spin_lock(&rcu_startup_lock);
+
        rdp = per_cpu_ptr(&rcu_data, cpu);
        if (rdp->cpu_started)
-               return;
+               goto out;
        rdp->cpu_started = true;
 
        rnp = rdp->mynode;
@@ -4273,6 +4277,8 @@ void rcu_cpu_starting(unsigned int cpu)
        WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
        WARN_ON_ONCE(rnp->ofl_seq & 0x1);
        smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
+ out:
+       raw_spin_unlock(&rcu_startup_lock);
 }
 
 /*


On coming back to this and trying to work out the *correct* fix, I see
that there's already a per-node spinlock covering most of this
function, and I strongly suspect that the better fix is as simple as
expanding the coverage of the existing lock? But I have questions...

--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4246,11 +4246,11 @@ void rcu_cpu_starting(unsigned int cpu)
 
        rnp = rdp->mynode;
        mask = rdp->grpmask;
+       raw_spin_lock_irqsave_rcu_node(rnp, flags);
        WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
        WARN_ON_ONCE(!(rnp->ofl_seq & 0x1));
        rcu_dynticks_eqs_online();
        smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
-       raw_spin_lock_irqsave_rcu_node(rnp, flags);
        WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
        newcpu = !(rnp->expmaskinitnext & mask);
        rnp->expmaskinitnext |= mask;
@@ -4266,13 +4266,13 @@ void rcu_cpu_starting(unsigned int cpu)
                rcu_disable_urgency_upon_qs(rdp);
                /* Report QS -after- changing ->qsmaskinitnext! */
                rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
-       } else {
-               raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+               /* Er, why didn't we drop the lock here? */
        }
        smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier().
        WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1);
        WARN_ON_ONCE(rnp->ofl_seq & 0x1);
        smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
+       raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
 /*



> But first...  Is it just bringing CPUs online that is to happen
> concurrently?  Or is it also necessary to take CPUs offline concurrently?

Yes. That is: *First*, it is just bringing CPUs online that is to
happen concurrently. :)

As for offlining.... well, if the cpuhp code didn't intentionally
guarantee that your existing rcu_report_dead() function can only be
called once at a time, and you are only relying on the fact that
*empirically* that seems to be the case, then you are naughty and
should give it the same kind of locking we speak of above.

And although I don't currently have designs on parallel offlining, ask
me again after I've audited the kexec code and seen how long it takes
to do that in series (if it's properly offlining them at all). 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]