Re: [PATCH v3] rcu: Remove impossible wakeup rcu GP kthread action from rcu_report_qs_rdp()

From: Joel Fernandes <joel@joelfernandes.org>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: "Zhang, Qiang1" <qiang1.zhang@intel.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	quic_neeraju@quicinc.com, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3] rcu: Remove impossible wakeup rcu GP kthread action from rcu_report_qs_rdp()
Date: Tue, 24 Jan 2023 16:58:26 +0000	[thread overview]
Message-ID: <Y9AOMhy4GCyaqxdG@google.com> (raw)
In-Reply-To: <Y873895orjOvPNg9@lothringen>

Hi Frederic,

On Mon, Jan 23, 2023 at 10:11:15PM +0100, Frederic Weisbecker wrote:
> On Mon, Jan 23, 2023 at 03:54:07PM -0500, Joel Fernandes wrote:
> > > On Jan 23, 2023, at 11:27 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > 
> > > On Mon, Jan 23, 2023 at 10:22:19AM -0500, Joel Fernandes wrote:
> > >>> What am I missing?
> > >> 
> > >> That the acceleration is also done by __note_gp_changes() once the
> > >> grace period ends anyway, so if any acceleration was missed as you
> > >> say, it will be done anyway.
> > >> 
> > >> Also it is done by scheduler tick raising softirq:
> > >> 
> > >> rcu_pending() does this:
> > >>        /* Has RCU gone idle with this CPU needing another grace period? */
> > >>        if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
> > >>            !rcu_rdp_is_offloaded(rdp) &&
> > >>            !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
> > >>                return 1;
> > >> 
> > >> and rcu_core():
> > >>        /* No grace period and unregistered callbacks? */
> > >>        if (!rcu_gp_in_progress() &&
> > >>            rcu_segcblist_is_enabled(&rdp->cblist) && do_batch) {
> > >>                rcu_nocb_lock_irqsave(rdp, flags);
> > >>                if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
> > >>                        rcu_accelerate_cbs_unlocked(rnp, rdp);
> > >>                rcu_nocb_unlock_irqrestore(rdp, flags);
> > >>        }
> > >> 
> > >> So, I am not sure if you need needacc at all. Those CBs that have not
> > >> been assigned grace period numbers will be taken care off :)
> > > 
> > > But that's only when there is no grace period pending, so it can't happen while
> > > we report a QS.
> > > 
> > > OTOH without the needacc, those callbacks waiting to be accelerated would be
> > > eventually processed but only on the next tick following the end of a grace
> > > period...if none has started since then. So if someone else starts a new GP
> > > before the current CPU, we must wait another GP, etc...
> > > 
> > > That's potentially dangerous.
> > 
> > Waiting for just one more GP cannot be dangerous IMO. Anyway there is no
> > guarantee that callback will run immediately at end of GP, there may be one or
> > more GPs before callback can run, if I remember correctly. That is by
> > design.. but please correct me if my understanding is different from yours.
> 
> It's not bound to just one GP. If you have N CPUs flooding callbacks for a
> long while, your CPU has 1/N chances to be the one starting the next GP on
> each turn. Relying on the acceleration to be performed only when no GP is
> running may theoretically starve your callbacks forever.

I tend to agree with you, however I was mostly referring to the "needac". The
theoretical case is even more theoretical because you have to be half way
through the transition _and_ flooding at the same time such that there is not
a chance for RCU to be idle for a long period of time (which we don't really
see in our systems per-se). But I agree, even if theoretical, maybe better to
handle it.

> > > And unfortunately we can't do the acceleration from __note_gp_changes()
> > > due to lock ordering restrictions: nocb_lock -> rnp_lock
> > > 

Yeah I was more going for the fact that if the offload to deoffload
transition completed before note_gp_changes() was called, which obviously we
cannot assume...

But yeah for the 'limbo between offload to not offload state', eventually RCU
goes idle, the scheduling clock interrupt (via rcu_pending()) will raise
softirq or note_gp_changes() will accelerate. Which does not help the
theoretical flooding case above, but still...

> > 
> > Ah. This part I am not sure. Appreciate if point me to any old archive
> > links or documentation detailing that, if possible… 
> 
> It's not documented but the code in nocb_gp_wait() or nocb_cb_wait() has
> that locking order for example.
> 
> Excerpt:
> 
> 	rcu_nocb_lock_irqsave(rdp, flags); if (rcu_segcblist_nextgp(cblist,
> 	&cur_gp_seq) && rcu_seq_done(&rnp->gp_seq, cur_gp_seq) &&
> 	raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
> 	needwake_gp = rcu_advance_cbs(rdp->mynode, rdp);
> 	raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ }
> 
> Thanks.

Ah I know what you mean now, in other words there is a chance of an ABBA
deadlock if we do not acquire the locks in the correct order. Thank you for
clarifying. Indeed rcutree_migrate_callbacks() also confirms it.

I guess to Frederic's point, another case other than the CB flooding, that can
starve CB accelerations is if the system is in the 'limbo between offload and
de-offload' state for a long period of time. In such a situation also, neither
note_gp_changes(), nor the scheduling-clock interrupt via softirq will help
accelerate CBs.

Whether such a limbo state is possible indefinitely, I am not sure...

Thanks a lot for the discussions and it is always a learning experience
talking about these...!

thanks,

 - Joel