Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled

From: Nicholas Piggin <npiggin@gmail.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled
Date: Sat, 7 Apr 2018 07:40:42 +1000	[thread overview]
Message-ID: <20180407074042.0c50a59a@roar.ozlabs.ibm.com> (raw)
In-Reply-To: <20180405155320.GN3948@linux.vnet.ibm.com>

On Thu, 5 Apr 2018 08:53:20 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Apr 05, 2018 at 10:45:12AM +1000, Nicholas Piggin wrote:
> > On Wed, 4 Apr 2018 17:13:58 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >   
> > > On Thu, Apr 05, 2018 at 09:34:14AM +1000, Nicholas Piggin wrote:  
> > > > Hi Paul,
> > > > 
> > > > Just looking at latencies, and RCU showed up as one of the maximums.
> > > > This is a 2 socket system with (176 CPU threads). Just doing a
> > > > `make -j 352` kernel build. Got a max latency of 3ms. I don't think
> > > > that's anything to worry about really, but I wanted to check the
> > > > cause.    
> > > 
> > > Well, that 3 milliseconds would cause serious problems for some workloads...  
> > 
> > True.
> >   
> > > > # tracer: irqsoff
> > > > #
> > > > # irqsoff latency trace v1.1.5 on 4.16.0-01530-g43d1859f0994
> > > > # --------------------------------------------------------------------
> > > > # latency: 3055 us, #19/19, CPU#135 | (M:server VP:0, KP:0, SP:0 HP:0 #P:176)
> > > > #    -----------------
> > > > #    | task: cc1-58689 (uid:1003 nice:0 policy:0 rt_prio:0)
> > > > #    -----------------
> > > > #  => started at: rcu_process_callbacks
> > > > #  => ended at:   _raw_spin_unlock_irqrestore
> > > > #
> > > > #
> > > > #                  _------=> CPU#            
> > > > #                 / _-----=> irqs-off        
> > > > #                | / _----=> need-resched    
> > > > #                || / _---=> hardirq/softirq 
> > > > #                ||| / _--=> preempt-depth   
> > > > #                |||| /     delay            
> > > > #  cmd     pid   ||||| time  |   caller      
> > > > #     \   /      |||||  \    |   /         
> > > >    <...>-58689 135d.s.    0us : rcu_process_callbacks
> > > >    <...>-58689 135d.s.    1us : cpu_needs_another_gp <-rcu_process_callbacks
> > > >    <...>-58689 135d.s.    2us : rcu_segcblist_future_gp_needed <-cpu_needs_another_gp
> > > >    <...>-58689 135d.s.    3us#: _raw_spin_lock <-rcu_process_callbacks
> > > >    <...>-58689 135d.s. 3047us : rcu_start_gp <-rcu_process_callbacks
> > > >    <...>-58689 135d.s. 3048us : rcu_advance_cbs <-rcu_start_gp
> > > >    <...>-58689 135d.s. 3049us : rcu_segcblist_pend_cbs <-rcu_advance_cbs
> > > >    <...>-58689 135d.s. 3049us : rcu_segcblist_advance <-rcu_advance_cbs
> > > >    <...>-58689 135d.s. 3050us : rcu_accelerate_cbs <-rcu_start_gp
> > > >    <...>-58689 135d.s. 3050us : rcu_segcblist_pend_cbs <-rcu_accelerate_cbs
> > > >    <...>-58689 135d.s. 3051us : rcu_segcblist_accelerate <-rcu_accelerate_cbs
> > > >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> > > >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> > > >    <...>-58689 135d.s. 3053us : rcu_start_gp_advanced.isra.35 <-rcu_start_gp
> > > >    <...>-58689 135d.s. 3053us : cpu_needs_another_gp <-rcu_start_gp_advanced.isra.35
> > > >    <...>-58689 135d.s. 3054us : _raw_spin_unlock_irqrestore <-rcu_process_callbacks
> > > >    <...>-58689 135d.s. 3055us : _raw_spin_unlock_irqrestore
> > > >    <...>-58689 135d.s. 3056us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
> > > >    <...>-58689 135d.s. 3061us : <stack trace>
> > > > 
> > > > So it's taking a contende lock with interrupts disabled:
> > > > 
> > > > static void
> > > > __rcu_process_callbacks(struct rcu_state *rsp)
> > > > {
> > > >         unsigned long flags;
> > > >         bool needwake;
> > > >         struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
> > > > 
> > > >         WARN_ON_ONCE(!rdp->beenonline);
> > > > 
> > > >         /* Update RCU state based on any recent quiescent states. */
> > > >         rcu_check_quiescent_state(rsp, rdp);
> > > > 
> > > >         /* Does this CPU require a not-yet-started grace period? */
> > > >         local_irq_save(flags);
> > > >         if (cpu_needs_another_gp(rsp, rdp)) {
> > > >                 raw_spin_lock_rcu_node(rcu_get_root(rsp)); /* irqs disabled. */
> > > >                 needwake = rcu_start_gp(rsp);
> > > >                 raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
> > > >                 if (needwake)
> > > >                         rcu_gp_kthread_wake(rsp);
> > > >         } else {
> > > >                 local_irq_restore(flags);
> > > >         }
> > > > 
> > > > Because irqs are disabled before taking the lock, we can't spin with
> > > > interrupts enabled.
> > > > 
> > > > cpu_needs_another_gp needs interrupts off to prevent races with normal
> > > > callback registry, but that doesn't seem to be preventing any wider
> > > > races in this code, because we immediately re-enable interrupts anyway
> > > > if no gp is needed. So an interrupt can come in right after that and
> > > > queue something up.
> > > > 
> > > > So then the question is whether it's safe-albeit-racy to call with
> > > > interrupts ensabled? Would be nice to move it to a spin_lock_irqsave.    
> > > 
> > > If I recall correctly, the issue is that an unsynchronized (due to
> > > interrupts being enabled) check in the "if" statement can cause extra
> > > unneeded grace periods.  
> 
> But doesn't the code unconditionally disable interrupts before
> going for the lock in any case?
> 
> static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
> {
> 	local_irq_disable();
> 	preempt_disable();
> 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> }

The spin_lock_irq variant does, but _irqsave is allowed to drop the
lock while spinning. I think there is no reason why _irq variant
could not do that as well.

> 
> > If the check is relatively cheap, perhaps you could do a second race
> > free verification after taking the lock and disabling interrupts?  
> 
> Given the above, I am not convinced that doing so will help.
> What am I missing?

Possibly if you can use an _irqsave lock?

> > > I am guessing that the long latency is caused by lots of CPUs suddenly
> > > needing a new grace period at about the same time.  If so, this is
> > > a bottleneck that I have been expecting for some time, and one that
> > > I would resolve by introducing funnel locking, sort of like SRCU and
> > > expedited grace periods already use.
> > > 
> > > Or am I confused about the source of the contention?  
> > 
> > The irqsoff tracer doesn't give you a lot of detail about locking
> > and concurrency, so it's hard for me to say. Any particular tests
> > or details I could try to get, that would help?  
> 
> Something checking for who holds the lock.  Let's see...  The
> rcu_grace_period trace event might be a good start.  Having a trace for
> each lock acquisition and release would be more to the point, except that
> this would likely overflow the trace buffer to the point of uselessness.
> 
> Other thoughts?

Yeah, I'm not too sure, it would be nice to have maximum wait time events
be able to trigger collection of traces of other holders that contribute
to your latency.

Might be infeasible to do it perfectly, but something good enough might be
possible.

> 
> > Note that rcu doesn't show up consistently at the top, this was
> > just one that looked *maybe* like it can be improved. So I don't
> > know how reproducible it is.  
> 
> Ah, that leads me to wonder whether the hypervisor preempted whoever is
> currently holding the lock.  Do we have anything set up to detect that
> sort of thing?

In this case it was running on bare metal, so it was a genuine latency
event. It just hasn't been consistently at the top (scheduler has been
there, but I'm bringing that down with tuning).

Thanks,
Nick