Re: 回复: [PATCH] rcu: shrink each possible cpu krcp

From: Joel Fernandes <joel@joelfernandes.org>
To: "Paul E. McKenney" <paulmck@kernel.org>, boqun.feng@gmail.com
Cc: "Zhang, Qiang" <Qiang.Zhang@windriver.com>,
	Uladzislau Rezki <urezki@gmail.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>, rcu <rcu@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: 回复: [PATCH] rcu: shrink each possible cpu krcp
Date: Wed, 19 Aug 2020 11:54:40 -0400	[thread overview]
Message-ID: <20200819155440.GD3886445@google.com> (raw)
In-Reply-To: <20200819152159.GX27891@paulmck-ThinkPad-P72>

On Wed, Aug 19, 2020 at 08:21:59AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 19, 2020 at 09:56:54AM -0400, Joel Fernandes wrote:
> > On Wed, Aug 19, 2020 at 03:00:55AM +0000, Zhang, Qiang wrote:
> > > 
> > > 
> > > ________________________________________
> > > 发件人: linux-kernel-owner@vger.kernel.org <linux-kernel-owner@vger.kernel.org> 代表 Joel Fernandes <joel@joelfernandes.org>
> > > 发送时间: 2020年8月19日 8:04
> > > 收件人: Paul E. McKenney
> > > 抄送: Uladzislau Rezki; Zhang, Qiang; Josh Triplett; Steven Rostedt; Mathieu Desnoyers; Lai Jiangshan; rcu; LKML
> > > 主题: Re: [PATCH] rcu: shrink each possible cpu krcp
> > > 
> > > On Tue, Aug 18, 2020 at 6:02 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index b8ccd7b5af82..6decb9ad2421 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -2336,10 +2336,15 @@ int rcutree_dead_cpu(unsigned int cpu)
> > > > >  {
> > > > >         struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
> > > > >         struct rcu_node *rnp = rdp->mynode;  /* Outgoing CPU's rdp & rnp. */
> > > > > +       struct kfree_rcu_cpu *krcp;
> > > > >
> > > > >         if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
> > > > >                 return 0;
> > > > >
> > > > > +       /* Drain the kcrp of this CPU. IRQs should be disabled? */
> > > > > +       krcp = this_cpu_ptr(&krc)
> > > > > +       schedule_delayed_work(&krcp->monitor_work, 0);
> > > > > +
> > > > >
> > > > > A cpu can be offlined and its krp will be stuck until a shrinker is involved.
> > > > > Maybe be never.
> > > >
> > > > Does the same apply to its kmalloc() per-CPU caches?  If so, I have a
> > > > hard time getting too worried about it.  ;-)
> > > 
> > > >Looking at slab_offline_cpu() , that calls cancel_delayed_work_sync()
> > > >on the cache reaper who's job is to flush the per-cpu caches. So I
> > > >believe during CPU offlining, the per-cpu slab caches are flushed.
> > > >
> > > >thanks,
> > > >
> > >  >- Joel
> > > 
> > > When cpu going offline, the slub or slab only flush free objects in offline
> > > cpu cache,  put these free objects in node list  or return buddy system,
> > > for those who are still in use, they still stay offline cpu cache.
> > > 
> > > If we want clean per-cpu "krcp" objects when cpu going offline.  we should
> > > free "krcp" cache objects in "rcutree_offline_cpu", this func be called
> > > before other rcu cpu offline func. and then "rcutree_offline_cpu" will be
> > > called in "cpuhp/%u" per-cpu thread.
> > > 
> > 
> > Could you please wrap text properly when you post to mailing list, thanks. I
> > fixed it for you above.
> > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 8ce77d9ac716..1812d4a1ac1b 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3959,6 +3959,7 @@ int rcutree_offline_cpu(unsigned int cpu)
> > >         unsigned long flags;
> > >         struct rcu_data *rdp;
> > >         struct rcu_node *rnp;
> > > +       struct kfree_rcu_cpu *krcp;
> > >  
> > >         rdp = per_cpu_ptr(&rcu_data, cpu);
> > >         rnp = rdp->mynode;
> > > @@ -3970,6 +3971,11 @@ int rcutree_offline_cpu(unsigned int cpu)
> > >  
> > >         // nohz_full CPUs need the tick for stop-machine to work quickly
> > >         tick_dep_set(TICK_DEP_BIT_RCU);
> > > +
> > > +       krcp = per_cpu_ptr(&krc, cpu);
> > > +       raw_spin_lock_irqsave(&krcp->lock, flags);
> > > +       schedule_delayed_work(&krcp->monitor_work, 0);
> > > +       raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > >         return 0;
> > 
> > I realized the above is not good enough for what this is trying to do. Unlike
> > the slab, the new kfree_rcu objects cannot always be drained / submitted to
> > RCU because the previous batch may still be waiting for a grace period. So
> > the above code could very well return with the yet-to-be-submitted kfree_rcu
> > objects still in the cache.
> > 
> > One option is to spin-wait here for monitor_todo to be false and keep calling
> > kfree_rcu_drain_unlock() till then.
> > 
> > But then that's not good enough either, because if new objects are queued
> > when interrupts are enabled in the CPU offline path, then the cache will get
> > new objects after the previous set was drained. Further, spin waiting may
> > introduce deadlocks.
> > 
> > Another option is to switch the kfree_rcu() path to non-batching (so new
> > objects cannot be cached in the offline path and are submitted directly to
> > RCU), wait for a GP and then submit the work. But then not sure if 1-argument
> > kfree_rcu() will like that.
> 
> Or spawn a workqueue that does something like this:
> 
> 1.	Get any pending kvfree_rcu() requests sent off to RCU.
> 
> 2.	Do an rcu_barrier().
> 
> 3.	Do the cleanup actions.

One more thing I'n not sure about is if this will even be an issue  because
the delayed-work to run the monitor could be migrated to another (both timers
and wq get migrated).  So in that case, the draining of the kfree_rcu objects
will happened on another CPU.

Adding Boqun as well since I believe he knows a lot about wq and hotplug as
well.

Basically my question is does delayed_work get migrated to another CPU when
one CPU is offline?  If so, I am assuming all races between new
schedule_delayed_work invocations and CPU offlining are handled as well, such
that at some point those new invocations are queued on another CPU.

> > Probably Qian's original fix for for_each_possible_cpus() is good enough for
> > the shrinker case, and then we can tackle the hotplug one.
> 
> It might take some experimentation to find the best solution.

Sure.

thanks,

 - Joel