linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled
@ 2018-04-04 23:34 Nicholas Piggin
  2018-04-05  0:13 ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-04 23:34 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linux Kernel Mailing List

Hi Paul,

Just looking at latencies, and RCU showed up as one of the maximums.
This is a 2 socket system with (176 CPU threads). Just doing a
`make -j 352` kernel build. Got a max latency of 3ms. I don't think
that's anything to worry about really, but I wanted to check the
cause.

# tracer: irqsoff
#
# irqsoff latency trace v1.1.5 on 4.16.0-01530-g43d1859f0994
# --------------------------------------------------------------------
# latency: 3055 us, #19/19, CPU#135 | (M:server VP:0, KP:0, SP:0 HP:0 #P:176)
#    -----------------
#    | task: cc1-58689 (uid:1003 nice:0 policy:0 rt_prio:0)
#    -----------------
#  => started at: rcu_process_callbacks
#  => ended at:   _raw_spin_unlock_irqrestore
#
#
#                  _------=> CPU#            
#                 / _-----=> irqs-off        
#                | / _----=> need-resched    
#                || / _---=> hardirq/softirq 
#                ||| / _--=> preempt-depth   
#                |||| /     delay            
#  cmd     pid   ||||| time  |   caller      
#     \   /      |||||  \    |   /         
   <...>-58689 135d.s.    0us : rcu_process_callbacks
   <...>-58689 135d.s.    1us : cpu_needs_another_gp <-rcu_process_callbacks
   <...>-58689 135d.s.    2us : rcu_segcblist_future_gp_needed <-cpu_needs_another_gp
   <...>-58689 135d.s.    3us#: _raw_spin_lock <-rcu_process_callbacks
   <...>-58689 135d.s. 3047us : rcu_start_gp <-rcu_process_callbacks
   <...>-58689 135d.s. 3048us : rcu_advance_cbs <-rcu_start_gp
   <...>-58689 135d.s. 3049us : rcu_segcblist_pend_cbs <-rcu_advance_cbs
   <...>-58689 135d.s. 3049us : rcu_segcblist_advance <-rcu_advance_cbs
   <...>-58689 135d.s. 3050us : rcu_accelerate_cbs <-rcu_start_gp
   <...>-58689 135d.s. 3050us : rcu_segcblist_pend_cbs <-rcu_accelerate_cbs
   <...>-58689 135d.s. 3051us : rcu_segcblist_accelerate <-rcu_accelerate_cbs
   <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
   <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
   <...>-58689 135d.s. 3053us : rcu_start_gp_advanced.isra.35 <-rcu_start_gp
   <...>-58689 135d.s. 3053us : cpu_needs_another_gp <-rcu_start_gp_advanced.isra.35
   <...>-58689 135d.s. 3054us : _raw_spin_unlock_irqrestore <-rcu_process_callbacks
   <...>-58689 135d.s. 3055us : _raw_spin_unlock_irqrestore
   <...>-58689 135d.s. 3056us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
   <...>-58689 135d.s. 3061us : <stack trace>

So it's taking a contende lock with interrupts disabled:

static void
__rcu_process_callbacks(struct rcu_state *rsp)
{
        unsigned long flags;
        bool needwake;
        struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);

        WARN_ON_ONCE(!rdp->beenonline);

        /* Update RCU state based on any recent quiescent states. */
        rcu_check_quiescent_state(rsp, rdp);

        /* Does this CPU require a not-yet-started grace period? */
        local_irq_save(flags);
        if (cpu_needs_another_gp(rsp, rdp)) {
                raw_spin_lock_rcu_node(rcu_get_root(rsp)); /* irqs disabled. */
                needwake = rcu_start_gp(rsp);
                raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
                if (needwake)
                        rcu_gp_kthread_wake(rsp);
        } else {
                local_irq_restore(flags);
        }

Because irqs are disabled before taking the lock, we can't spin with
interrupts enabled.

cpu_needs_another_gp needs interrupts off to prevent races with normal
callback registry, but that doesn't seem to be preventing any wider
races in this code, because we immediately re-enable interrupts anyway
if no gp is needed. So an interrupt can come in right after that and
queue something up.

So then the question is whether it's safe-albeit-racy to call with
interrupts ensabled? Would be nice to move it to a spin_lock_irqsave.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled
  2018-04-04 23:34 rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled Nicholas Piggin
@ 2018-04-05  0:13 ` Paul E. McKenney
  2018-04-05  0:45   ` Nicholas Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2018-04-05  0:13 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Linux Kernel Mailing List

On Thu, Apr 05, 2018 at 09:34:14AM +1000, Nicholas Piggin wrote:
> Hi Paul,
> 
> Just looking at latencies, and RCU showed up as one of the maximums.
> This is a 2 socket system with (176 CPU threads). Just doing a
> `make -j 352` kernel build. Got a max latency of 3ms. I don't think
> that's anything to worry about really, but I wanted to check the
> cause.

Well, that 3 milliseconds would cause serious problems for some workloads...

> # tracer: irqsoff
> #
> # irqsoff latency trace v1.1.5 on 4.16.0-01530-g43d1859f0994
> # --------------------------------------------------------------------
> # latency: 3055 us, #19/19, CPU#135 | (M:server VP:0, KP:0, SP:0 HP:0 #P:176)
> #    -----------------
> #    | task: cc1-58689 (uid:1003 nice:0 policy:0 rt_prio:0)
> #    -----------------
> #  => started at: rcu_process_callbacks
> #  => ended at:   _raw_spin_unlock_irqrestore
> #
> #
> #                  _------=> CPU#            
> #                 / _-----=> irqs-off        
> #                | / _----=> need-resched    
> #                || / _---=> hardirq/softirq 
> #                ||| / _--=> preempt-depth   
> #                |||| /     delay            
> #  cmd     pid   ||||| time  |   caller      
> #     \   /      |||||  \    |   /         
>    <...>-58689 135d.s.    0us : rcu_process_callbacks
>    <...>-58689 135d.s.    1us : cpu_needs_another_gp <-rcu_process_callbacks
>    <...>-58689 135d.s.    2us : rcu_segcblist_future_gp_needed <-cpu_needs_another_gp
>    <...>-58689 135d.s.    3us#: _raw_spin_lock <-rcu_process_callbacks
>    <...>-58689 135d.s. 3047us : rcu_start_gp <-rcu_process_callbacks
>    <...>-58689 135d.s. 3048us : rcu_advance_cbs <-rcu_start_gp
>    <...>-58689 135d.s. 3049us : rcu_segcblist_pend_cbs <-rcu_advance_cbs
>    <...>-58689 135d.s. 3049us : rcu_segcblist_advance <-rcu_advance_cbs
>    <...>-58689 135d.s. 3050us : rcu_accelerate_cbs <-rcu_start_gp
>    <...>-58689 135d.s. 3050us : rcu_segcblist_pend_cbs <-rcu_accelerate_cbs
>    <...>-58689 135d.s. 3051us : rcu_segcblist_accelerate <-rcu_accelerate_cbs
>    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
>    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
>    <...>-58689 135d.s. 3053us : rcu_start_gp_advanced.isra.35 <-rcu_start_gp
>    <...>-58689 135d.s. 3053us : cpu_needs_another_gp <-rcu_start_gp_advanced.isra.35
>    <...>-58689 135d.s. 3054us : _raw_spin_unlock_irqrestore <-rcu_process_callbacks
>    <...>-58689 135d.s. 3055us : _raw_spin_unlock_irqrestore
>    <...>-58689 135d.s. 3056us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
>    <...>-58689 135d.s. 3061us : <stack trace>
> 
> So it's taking a contende lock with interrupts disabled:
> 
> static void
> __rcu_process_callbacks(struct rcu_state *rsp)
> {
>         unsigned long flags;
>         bool needwake;
>         struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
> 
>         WARN_ON_ONCE(!rdp->beenonline);
> 
>         /* Update RCU state based on any recent quiescent states. */
>         rcu_check_quiescent_state(rsp, rdp);
> 
>         /* Does this CPU require a not-yet-started grace period? */
>         local_irq_save(flags);
>         if (cpu_needs_another_gp(rsp, rdp)) {
>                 raw_spin_lock_rcu_node(rcu_get_root(rsp)); /* irqs disabled. */
>                 needwake = rcu_start_gp(rsp);
>                 raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
>                 if (needwake)
>                         rcu_gp_kthread_wake(rsp);
>         } else {
>                 local_irq_restore(flags);
>         }
> 
> Because irqs are disabled before taking the lock, we can't spin with
> interrupts enabled.
> 
> cpu_needs_another_gp needs interrupts off to prevent races with normal
> callback registry, but that doesn't seem to be preventing any wider
> races in this code, because we immediately re-enable interrupts anyway
> if no gp is needed. So an interrupt can come in right after that and
> queue something up.
> 
> So then the question is whether it's safe-albeit-racy to call with
> interrupts ensabled? Would be nice to move it to a spin_lock_irqsave.

If I recall correctly, the issue is that an unsynchronized (due to
interrupts being enabled) check in the "if" statement can cause extra
unneeded grace periods.

I am guessing that the long latency is caused by lots of CPUs suddenly
needing a new grace period at about the same time.  If so, this is
a bottleneck that I have been expecting for some time, and one that
I would resolve by introducing funnel locking, sort of like SRCU and
expedited grace periods already use.

Or am I confused about the source of the contention?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled
  2018-04-05  0:13 ` Paul E. McKenney
@ 2018-04-05  0:45   ` Nicholas Piggin
  2018-04-05 15:53     ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-05  0:45 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linux Kernel Mailing List

On Wed, 4 Apr 2018 17:13:58 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Apr 05, 2018 at 09:34:14AM +1000, Nicholas Piggin wrote:
> > Hi Paul,
> > 
> > Just looking at latencies, and RCU showed up as one of the maximums.
> > This is a 2 socket system with (176 CPU threads). Just doing a
> > `make -j 352` kernel build. Got a max latency of 3ms. I don't think
> > that's anything to worry about really, but I wanted to check the
> > cause.  
> 
> Well, that 3 milliseconds would cause serious problems for some workloads...

True.

> > # tracer: irqsoff
> > #
> > # irqsoff latency trace v1.1.5 on 4.16.0-01530-g43d1859f0994
> > # --------------------------------------------------------------------
> > # latency: 3055 us, #19/19, CPU#135 | (M:server VP:0, KP:0, SP:0 HP:0 #P:176)
> > #    -----------------
> > #    | task: cc1-58689 (uid:1003 nice:0 policy:0 rt_prio:0)
> > #    -----------------
> > #  => started at: rcu_process_callbacks
> > #  => ended at:   _raw_spin_unlock_irqrestore
> > #
> > #
> > #                  _------=> CPU#            
> > #                 / _-----=> irqs-off        
> > #                | / _----=> need-resched    
> > #                || / _---=> hardirq/softirq 
> > #                ||| / _--=> preempt-depth   
> > #                |||| /     delay            
> > #  cmd     pid   ||||| time  |   caller      
> > #     \   /      |||||  \    |   /         
> >    <...>-58689 135d.s.    0us : rcu_process_callbacks
> >    <...>-58689 135d.s.    1us : cpu_needs_another_gp <-rcu_process_callbacks
> >    <...>-58689 135d.s.    2us : rcu_segcblist_future_gp_needed <-cpu_needs_another_gp
> >    <...>-58689 135d.s.    3us#: _raw_spin_lock <-rcu_process_callbacks
> >    <...>-58689 135d.s. 3047us : rcu_start_gp <-rcu_process_callbacks
> >    <...>-58689 135d.s. 3048us : rcu_advance_cbs <-rcu_start_gp
> >    <...>-58689 135d.s. 3049us : rcu_segcblist_pend_cbs <-rcu_advance_cbs
> >    <...>-58689 135d.s. 3049us : rcu_segcblist_advance <-rcu_advance_cbs
> >    <...>-58689 135d.s. 3050us : rcu_accelerate_cbs <-rcu_start_gp
> >    <...>-58689 135d.s. 3050us : rcu_segcblist_pend_cbs <-rcu_accelerate_cbs
> >    <...>-58689 135d.s. 3051us : rcu_segcblist_accelerate <-rcu_accelerate_cbs
> >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> >    <...>-58689 135d.s. 3053us : rcu_start_gp_advanced.isra.35 <-rcu_start_gp
> >    <...>-58689 135d.s. 3053us : cpu_needs_another_gp <-rcu_start_gp_advanced.isra.35
> >    <...>-58689 135d.s. 3054us : _raw_spin_unlock_irqrestore <-rcu_process_callbacks
> >    <...>-58689 135d.s. 3055us : _raw_spin_unlock_irqrestore
> >    <...>-58689 135d.s. 3056us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
> >    <...>-58689 135d.s. 3061us : <stack trace>
> > 
> > So it's taking a contende lock with interrupts disabled:
> > 
> > static void
> > __rcu_process_callbacks(struct rcu_state *rsp)
> > {
> >         unsigned long flags;
> >         bool needwake;
> >         struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
> > 
> >         WARN_ON_ONCE(!rdp->beenonline);
> > 
> >         /* Update RCU state based on any recent quiescent states. */
> >         rcu_check_quiescent_state(rsp, rdp);
> > 
> >         /* Does this CPU require a not-yet-started grace period? */
> >         local_irq_save(flags);
> >         if (cpu_needs_another_gp(rsp, rdp)) {
> >                 raw_spin_lock_rcu_node(rcu_get_root(rsp)); /* irqs disabled. */
> >                 needwake = rcu_start_gp(rsp);
> >                 raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
> >                 if (needwake)
> >                         rcu_gp_kthread_wake(rsp);
> >         } else {
> >                 local_irq_restore(flags);
> >         }
> > 
> > Because irqs are disabled before taking the lock, we can't spin with
> > interrupts enabled.
> > 
> > cpu_needs_another_gp needs interrupts off to prevent races with normal
> > callback registry, but that doesn't seem to be preventing any wider
> > races in this code, because we immediately re-enable interrupts anyway
> > if no gp is needed. So an interrupt can come in right after that and
> > queue something up.
> > 
> > So then the question is whether it's safe-albeit-racy to call with
> > interrupts ensabled? Would be nice to move it to a spin_lock_irqsave.  
> 
> If I recall correctly, the issue is that an unsynchronized (due to
> interrupts being enabled) check in the "if" statement can cause extra
> unneeded grace periods.

If the check is relatively cheap, perhaps you could do a second race
free verification after taking the lock and disabling interrupts?

> I am guessing that the long latency is caused by lots of CPUs suddenly
> needing a new grace period at about the same time.  If so, this is
> a bottleneck that I have been expecting for some time, and one that
> I would resolve by introducing funnel locking, sort of like SRCU and
> expedited grace periods already use.
> 
> Or am I confused about the source of the contention?

The irqsoff tracer doesn't give you a lot of detail about locking
and concurrency, so it's hard for me to say. Any particular tests
or details I could try to get, that would help?

Note that rcu doesn't show up consistently at the top, this was
just one that looked *maybe* like it can be improved. So I don't
know how reproducible it is.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled
  2018-04-05  0:45   ` Nicholas Piggin
@ 2018-04-05 15:53     ` Paul E. McKenney
  2018-04-06 21:40       ` Nicholas Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2018-04-05 15:53 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Linux Kernel Mailing List

On Thu, Apr 05, 2018 at 10:45:12AM +1000, Nicholas Piggin wrote:
> On Wed, 4 Apr 2018 17:13:58 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Apr 05, 2018 at 09:34:14AM +1000, Nicholas Piggin wrote:
> > > Hi Paul,
> > > 
> > > Just looking at latencies, and RCU showed up as one of the maximums.
> > > This is a 2 socket system with (176 CPU threads). Just doing a
> > > `make -j 352` kernel build. Got a max latency of 3ms. I don't think
> > > that's anything to worry about really, but I wanted to check the
> > > cause.  
> > 
> > Well, that 3 milliseconds would cause serious problems for some workloads...
> 
> True.
> 
> > > # tracer: irqsoff
> > > #
> > > # irqsoff latency trace v1.1.5 on 4.16.0-01530-g43d1859f0994
> > > # --------------------------------------------------------------------
> > > # latency: 3055 us, #19/19, CPU#135 | (M:server VP:0, KP:0, SP:0 HP:0 #P:176)
> > > #    -----------------
> > > #    | task: cc1-58689 (uid:1003 nice:0 policy:0 rt_prio:0)
> > > #    -----------------
> > > #  => started at: rcu_process_callbacks
> > > #  => ended at:   _raw_spin_unlock_irqrestore
> > > #
> > > #
> > > #                  _------=> CPU#            
> > > #                 / _-----=> irqs-off        
> > > #                | / _----=> need-resched    
> > > #                || / _---=> hardirq/softirq 
> > > #                ||| / _--=> preempt-depth   
> > > #                |||| /     delay            
> > > #  cmd     pid   ||||| time  |   caller      
> > > #     \   /      |||||  \    |   /         
> > >    <...>-58689 135d.s.    0us : rcu_process_callbacks
> > >    <...>-58689 135d.s.    1us : cpu_needs_another_gp <-rcu_process_callbacks
> > >    <...>-58689 135d.s.    2us : rcu_segcblist_future_gp_needed <-cpu_needs_another_gp
> > >    <...>-58689 135d.s.    3us#: _raw_spin_lock <-rcu_process_callbacks
> > >    <...>-58689 135d.s. 3047us : rcu_start_gp <-rcu_process_callbacks
> > >    <...>-58689 135d.s. 3048us : rcu_advance_cbs <-rcu_start_gp
> > >    <...>-58689 135d.s. 3049us : rcu_segcblist_pend_cbs <-rcu_advance_cbs
> > >    <...>-58689 135d.s. 3049us : rcu_segcblist_advance <-rcu_advance_cbs
> > >    <...>-58689 135d.s. 3050us : rcu_accelerate_cbs <-rcu_start_gp
> > >    <...>-58689 135d.s. 3050us : rcu_segcblist_pend_cbs <-rcu_accelerate_cbs
> > >    <...>-58689 135d.s. 3051us : rcu_segcblist_accelerate <-rcu_accelerate_cbs
> > >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> > >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> > >    <...>-58689 135d.s. 3053us : rcu_start_gp_advanced.isra.35 <-rcu_start_gp
> > >    <...>-58689 135d.s. 3053us : cpu_needs_another_gp <-rcu_start_gp_advanced.isra.35
> > >    <...>-58689 135d.s. 3054us : _raw_spin_unlock_irqrestore <-rcu_process_callbacks
> > >    <...>-58689 135d.s. 3055us : _raw_spin_unlock_irqrestore
> > >    <...>-58689 135d.s. 3056us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
> > >    <...>-58689 135d.s. 3061us : <stack trace>
> > > 
> > > So it's taking a contende lock with interrupts disabled:
> > > 
> > > static void
> > > __rcu_process_callbacks(struct rcu_state *rsp)
> > > {
> > >         unsigned long flags;
> > >         bool needwake;
> > >         struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
> > > 
> > >         WARN_ON_ONCE(!rdp->beenonline);
> > > 
> > >         /* Update RCU state based on any recent quiescent states. */
> > >         rcu_check_quiescent_state(rsp, rdp);
> > > 
> > >         /* Does this CPU require a not-yet-started grace period? */
> > >         local_irq_save(flags);
> > >         if (cpu_needs_another_gp(rsp, rdp)) {
> > >                 raw_spin_lock_rcu_node(rcu_get_root(rsp)); /* irqs disabled. */
> > >                 needwake = rcu_start_gp(rsp);
> > >                 raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
> > >                 if (needwake)
> > >                         rcu_gp_kthread_wake(rsp);
> > >         } else {
> > >                 local_irq_restore(flags);
> > >         }
> > > 
> > > Because irqs are disabled before taking the lock, we can't spin with
> > > interrupts enabled.
> > > 
> > > cpu_needs_another_gp needs interrupts off to prevent races with normal
> > > callback registry, but that doesn't seem to be preventing any wider
> > > races in this code, because we immediately re-enable interrupts anyway
> > > if no gp is needed. So an interrupt can come in right after that and
> > > queue something up.
> > > 
> > > So then the question is whether it's safe-albeit-racy to call with
> > > interrupts ensabled? Would be nice to move it to a spin_lock_irqsave.  
> > 
> > If I recall correctly, the issue is that an unsynchronized (due to
> > interrupts being enabled) check in the "if" statement can cause extra
> > unneeded grace periods.

But doesn't the code unconditionally disable interrupts before
going for the lock in any case?

static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
{
	local_irq_disable();
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

> If the check is relatively cheap, perhaps you could do a second race
> free verification after taking the lock and disabling interrupts?

Given the above, I am not convinced that doing so will help.
What am I missing?

> > I am guessing that the long latency is caused by lots of CPUs suddenly
> > needing a new grace period at about the same time.  If so, this is
> > a bottleneck that I have been expecting for some time, and one that
> > I would resolve by introducing funnel locking, sort of like SRCU and
> > expedited grace periods already use.
> > 
> > Or am I confused about the source of the contention?
> 
> The irqsoff tracer doesn't give you a lot of detail about locking
> and concurrency, so it's hard for me to say. Any particular tests
> or details I could try to get, that would help?

Something checking for who holds the lock.  Let's see...  The
rcu_grace_period trace event might be a good start.  Having a trace for
each lock acquisition and release would be more to the point, except that
this would likely overflow the trace buffer to the point of uselessness.

Other thoughts?

> Note that rcu doesn't show up consistently at the top, this was
> just one that looked *maybe* like it can be improved. So I don't
> know how reproducible it is.

Ah, that leads me to wonder whether the hypervisor preempted whoever is
currently holding the lock.  Do we have anything set up to detect that
sort of thing?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled
  2018-04-05 15:53     ` Paul E. McKenney
@ 2018-04-06 21:40       ` Nicholas Piggin
  2018-04-08 21:06         ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-06 21:40 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linux Kernel Mailing List

On Thu, 5 Apr 2018 08:53:20 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Apr 05, 2018 at 10:45:12AM +1000, Nicholas Piggin wrote:
> > On Wed, 4 Apr 2018 17:13:58 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >   
> > > On Thu, Apr 05, 2018 at 09:34:14AM +1000, Nicholas Piggin wrote:  
> > > > Hi Paul,
> > > > 
> > > > Just looking at latencies, and RCU showed up as one of the maximums.
> > > > This is a 2 socket system with (176 CPU threads). Just doing a
> > > > `make -j 352` kernel build. Got a max latency of 3ms. I don't think
> > > > that's anything to worry about really, but I wanted to check the
> > > > cause.    
> > > 
> > > Well, that 3 milliseconds would cause serious problems for some workloads...  
> > 
> > True.
> >   
> > > > # tracer: irqsoff
> > > > #
> > > > # irqsoff latency trace v1.1.5 on 4.16.0-01530-g43d1859f0994
> > > > # --------------------------------------------------------------------
> > > > # latency: 3055 us, #19/19, CPU#135 | (M:server VP:0, KP:0, SP:0 HP:0 #P:176)
> > > > #    -----------------
> > > > #    | task: cc1-58689 (uid:1003 nice:0 policy:0 rt_prio:0)
> > > > #    -----------------
> > > > #  => started at: rcu_process_callbacks
> > > > #  => ended at:   _raw_spin_unlock_irqrestore
> > > > #
> > > > #
> > > > #                  _------=> CPU#            
> > > > #                 / _-----=> irqs-off        
> > > > #                | / _----=> need-resched    
> > > > #                || / _---=> hardirq/softirq 
> > > > #                ||| / _--=> preempt-depth   
> > > > #                |||| /     delay            
> > > > #  cmd     pid   ||||| time  |   caller      
> > > > #     \   /      |||||  \    |   /         
> > > >    <...>-58689 135d.s.    0us : rcu_process_callbacks
> > > >    <...>-58689 135d.s.    1us : cpu_needs_another_gp <-rcu_process_callbacks
> > > >    <...>-58689 135d.s.    2us : rcu_segcblist_future_gp_needed <-cpu_needs_another_gp
> > > >    <...>-58689 135d.s.    3us#: _raw_spin_lock <-rcu_process_callbacks
> > > >    <...>-58689 135d.s. 3047us : rcu_start_gp <-rcu_process_callbacks
> > > >    <...>-58689 135d.s. 3048us : rcu_advance_cbs <-rcu_start_gp
> > > >    <...>-58689 135d.s. 3049us : rcu_segcblist_pend_cbs <-rcu_advance_cbs
> > > >    <...>-58689 135d.s. 3049us : rcu_segcblist_advance <-rcu_advance_cbs
> > > >    <...>-58689 135d.s. 3050us : rcu_accelerate_cbs <-rcu_start_gp
> > > >    <...>-58689 135d.s. 3050us : rcu_segcblist_pend_cbs <-rcu_accelerate_cbs
> > > >    <...>-58689 135d.s. 3051us : rcu_segcblist_accelerate <-rcu_accelerate_cbs
> > > >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> > > >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> > > >    <...>-58689 135d.s. 3053us : rcu_start_gp_advanced.isra.35 <-rcu_start_gp
> > > >    <...>-58689 135d.s. 3053us : cpu_needs_another_gp <-rcu_start_gp_advanced.isra.35
> > > >    <...>-58689 135d.s. 3054us : _raw_spin_unlock_irqrestore <-rcu_process_callbacks
> > > >    <...>-58689 135d.s. 3055us : _raw_spin_unlock_irqrestore
> > > >    <...>-58689 135d.s. 3056us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
> > > >    <...>-58689 135d.s. 3061us : <stack trace>
> > > > 
> > > > So it's taking a contende lock with interrupts disabled:
> > > > 
> > > > static void
> > > > __rcu_process_callbacks(struct rcu_state *rsp)
> > > > {
> > > >         unsigned long flags;
> > > >         bool needwake;
> > > >         struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
> > > > 
> > > >         WARN_ON_ONCE(!rdp->beenonline);
> > > > 
> > > >         /* Update RCU state based on any recent quiescent states. */
> > > >         rcu_check_quiescent_state(rsp, rdp);
> > > > 
> > > >         /* Does this CPU require a not-yet-started grace period? */
> > > >         local_irq_save(flags);
> > > >         if (cpu_needs_another_gp(rsp, rdp)) {
> > > >                 raw_spin_lock_rcu_node(rcu_get_root(rsp)); /* irqs disabled. */
> > > >                 needwake = rcu_start_gp(rsp);
> > > >                 raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
> > > >                 if (needwake)
> > > >                         rcu_gp_kthread_wake(rsp);
> > > >         } else {
> > > >                 local_irq_restore(flags);
> > > >         }
> > > > 
> > > > Because irqs are disabled before taking the lock, we can't spin with
> > > > interrupts enabled.
> > > > 
> > > > cpu_needs_another_gp needs interrupts off to prevent races with normal
> > > > callback registry, but that doesn't seem to be preventing any wider
> > > > races in this code, because we immediately re-enable interrupts anyway
> > > > if no gp is needed. So an interrupt can come in right after that and
> > > > queue something up.
> > > > 
> > > > So then the question is whether it's safe-albeit-racy to call with
> > > > interrupts ensabled? Would be nice to move it to a spin_lock_irqsave.    
> > > 
> > > If I recall correctly, the issue is that an unsynchronized (due to
> > > interrupts being enabled) check in the "if" statement can cause extra
> > > unneeded grace periods.  
> 
> But doesn't the code unconditionally disable interrupts before
> going for the lock in any case?
> 
> static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
> {
> 	local_irq_disable();
> 	preempt_disable();
> 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> }

The spin_lock_irq variant does, but _irqsave is allowed to drop the
lock while spinning. I think there is no reason why _irq variant
could not do that as well.

> 
> > If the check is relatively cheap, perhaps you could do a second race
> > free verification after taking the lock and disabling interrupts?  
> 
> Given the above, I am not convinced that doing so will help.
> What am I missing?

Possibly if you can use an _irqsave lock?

> > > I am guessing that the long latency is caused by lots of CPUs suddenly
> > > needing a new grace period at about the same time.  If so, this is
> > > a bottleneck that I have been expecting for some time, and one that
> > > I would resolve by introducing funnel locking, sort of like SRCU and
> > > expedited grace periods already use.
> > > 
> > > Or am I confused about the source of the contention?  
> > 
> > The irqsoff tracer doesn't give you a lot of detail about locking
> > and concurrency, so it's hard for me to say. Any particular tests
> > or details I could try to get, that would help?  
> 
> Something checking for who holds the lock.  Let's see...  The
> rcu_grace_period trace event might be a good start.  Having a trace for
> each lock acquisition and release would be more to the point, except that
> this would likely overflow the trace buffer to the point of uselessness.
> 
> Other thoughts?

Yeah, I'm not too sure, it would be nice to have maximum wait time events
be able to trigger collection of traces of other holders that contribute
to your latency.

Might be infeasible to do it perfectly, but something good enough might be
possible.

> 
> > Note that rcu doesn't show up consistently at the top, this was
> > just one that looked *maybe* like it can be improved. So I don't
> > know how reproducible it is.  
> 
> Ah, that leads me to wonder whether the hypervisor preempted whoever is
> currently holding the lock.  Do we have anything set up to detect that
> sort of thing?

In this case it was running on bare metal, so it was a genuine latency
event. It just hasn't been consistently at the top (scheduler has been
there, but I'm bringing that down with tuning).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled
  2018-04-06 21:40       ` Nicholas Piggin
@ 2018-04-08 21:06         ` Paul E. McKenney
  2018-04-17 15:43           ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2018-04-08 21:06 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Linux Kernel Mailing List

On Sat, Apr 07, 2018 at 07:40:42AM +1000, Nicholas Piggin wrote:
> On Thu, 5 Apr 2018 08:53:20 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Apr 05, 2018 at 10:45:12AM +1000, Nicholas Piggin wrote:
> > > On Wed, 4 Apr 2018 17:13:58 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Thu, Apr 05, 2018 at 09:34:14AM +1000, Nicholas Piggin wrote:  
> > > > > Hi Paul,
> > > > > 
> > > > > Just looking at latencies, and RCU showed up as one of the maximums.
> > > > > This is a 2 socket system with (176 CPU threads). Just doing a
> > > > > `make -j 352` kernel build. Got a max latency of 3ms. I don't think
> > > > > that's anything to worry about really, but I wanted to check the
> > > > > cause.    
> > > > 
> > > > Well, that 3 milliseconds would cause serious problems for some workloads...  
> > > 
> > > True.
> > >   
> > > > > # tracer: irqsoff
> > > > > #
> > > > > # irqsoff latency trace v1.1.5 on 4.16.0-01530-g43d1859f0994
> > > > > # --------------------------------------------------------------------
> > > > > # latency: 3055 us, #19/19, CPU#135 | (M:server VP:0, KP:0, SP:0 HP:0 #P:176)
> > > > > #    -----------------
> > > > > #    | task: cc1-58689 (uid:1003 nice:0 policy:0 rt_prio:0)
> > > > > #    -----------------
> > > > > #  => started at: rcu_process_callbacks
> > > > > #  => ended at:   _raw_spin_unlock_irqrestore
> > > > > #
> > > > > #
> > > > > #                  _------=> CPU#            
> > > > > #                 / _-----=> irqs-off        
> > > > > #                | / _----=> need-resched    
> > > > > #                || / _---=> hardirq/softirq 
> > > > > #                ||| / _--=> preempt-depth   
> > > > > #                |||| /     delay            
> > > > > #  cmd     pid   ||||| time  |   caller      
> > > > > #     \   /      |||||  \    |   /         
> > > > >    <...>-58689 135d.s.    0us : rcu_process_callbacks
> > > > >    <...>-58689 135d.s.    1us : cpu_needs_another_gp <-rcu_process_callbacks
> > > > >    <...>-58689 135d.s.    2us : rcu_segcblist_future_gp_needed <-cpu_needs_another_gp
> > > > >    <...>-58689 135d.s.    3us#: _raw_spin_lock <-rcu_process_callbacks
> > > > >    <...>-58689 135d.s. 3047us : rcu_start_gp <-rcu_process_callbacks
> > > > >    <...>-58689 135d.s. 3048us : rcu_advance_cbs <-rcu_start_gp
> > > > >    <...>-58689 135d.s. 3049us : rcu_segcblist_pend_cbs <-rcu_advance_cbs
> > > > >    <...>-58689 135d.s. 3049us : rcu_segcblist_advance <-rcu_advance_cbs
> > > > >    <...>-58689 135d.s. 3050us : rcu_accelerate_cbs <-rcu_start_gp
> > > > >    <...>-58689 135d.s. 3050us : rcu_segcblist_pend_cbs <-rcu_accelerate_cbs
> > > > >    <...>-58689 135d.s. 3051us : rcu_segcblist_accelerate <-rcu_accelerate_cbs
> > > > >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> > > > >    <...>-58689 135d.s. 3052us : trace_rcu_future_gp.isra.0 <-rcu_accelerate_cbs
> > > > >    <...>-58689 135d.s. 3053us : rcu_start_gp_advanced.isra.35 <-rcu_start_gp
> > > > >    <...>-58689 135d.s. 3053us : cpu_needs_another_gp <-rcu_start_gp_advanced.isra.35
> > > > >    <...>-58689 135d.s. 3054us : _raw_spin_unlock_irqrestore <-rcu_process_callbacks
> > > > >    <...>-58689 135d.s. 3055us : _raw_spin_unlock_irqrestore
> > > > >    <...>-58689 135d.s. 3056us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
> > > > >    <...>-58689 135d.s. 3061us : <stack trace>
> > > > > 
> > > > > So it's taking a contende lock with interrupts disabled:
> > > > > 
> > > > > static void
> > > > > __rcu_process_callbacks(struct rcu_state *rsp)
> > > > > {
> > > > >         unsigned long flags;
> > > > >         bool needwake;
> > > > >         struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
> > > > > 
> > > > >         WARN_ON_ONCE(!rdp->beenonline);
> > > > > 
> > > > >         /* Update RCU state based on any recent quiescent states. */
> > > > >         rcu_check_quiescent_state(rsp, rdp);
> > > > > 
> > > > >         /* Does this CPU require a not-yet-started grace period? */
> > > > >         local_irq_save(flags);
> > > > >         if (cpu_needs_another_gp(rsp, rdp)) {
> > > > >                 raw_spin_lock_rcu_node(rcu_get_root(rsp)); /* irqs disabled. */
> > > > >                 needwake = rcu_start_gp(rsp);
> > > > >                 raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
> > > > >                 if (needwake)
> > > > >                         rcu_gp_kthread_wake(rsp);
> > > > >         } else {
> > > > >                 local_irq_restore(flags);
> > > > >         }
> > > > > 
> > > > > Because irqs are disabled before taking the lock, we can't spin with
> > > > > interrupts enabled.
> > > > > 
> > > > > cpu_needs_another_gp needs interrupts off to prevent races with normal
> > > > > callback registry, but that doesn't seem to be preventing any wider
> > > > > races in this code, because we immediately re-enable interrupts anyway
> > > > > if no gp is needed. So an interrupt can come in right after that and
> > > > > queue something up.
> > > > > 
> > > > > So then the question is whether it's safe-albeit-racy to call with
> > > > > interrupts ensabled? Would be nice to move it to a spin_lock_irqsave.    
> > > > 
> > > > If I recall correctly, the issue is that an unsynchronized (due to
> > > > interrupts being enabled) check in the "if" statement can cause extra
> > > > unneeded grace periods.  
> > 
> > But doesn't the code unconditionally disable interrupts before
> > going for the lock in any case?
> > 
> > static inline void __raw_spin_lock_irq(raw_spinlock_t *lock)
> > {
> > 	local_irq_disable();
> > 	preempt_disable();
> > 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> > 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
> > }
> 
> The spin_lock_irq variant does, but _irqsave is allowed to drop the
> lock while spinning. I think there is no reason why _irq variant
> could not do that as well.

If RCU_SOFTIRQ is running in ksoftirqd context, this might help.
If it is instead running on the back of an interrupt, preemption
is disabled regardless.

But why not try it out?  Assuming that it reproduces, anyway...

> > > If the check is relatively cheap, perhaps you could do a second race
> > > free verification after taking the lock and disabling interrupts?  
> > 
> > Given the above, I am not convinced that doing so will help.
> > What am I missing?
> 
> Possibly if you can use an _irqsave lock?
> 
> > > > I am guessing that the long latency is caused by lots of CPUs suddenly
> > > > needing a new grace period at about the same time.  If so, this is
> > > > a bottleneck that I have been expecting for some time, and one that
> > > > I would resolve by introducing funnel locking, sort of like SRCU and
> > > > expedited grace periods already use.
> > > > 
> > > > Or am I confused about the source of the contention?  
> > > 
> > > The irqsoff tracer doesn't give you a lot of detail about locking
> > > and concurrency, so it's hard for me to say. Any particular tests
> > > or details I could try to get, that would help?  
> > 
> > Something checking for who holds the lock.  Let's see...  The
> > rcu_grace_period trace event might be a good start.  Having a trace for
> > each lock acquisition and release would be more to the point, except that
> > this would likely overflow the trace buffer to the point of uselessness.
> > 
> > Other thoughts?
> 
> Yeah, I'm not too sure, it would be nice to have maximum wait time events
> be able to trigger collection of traces of other holders that contribute
> to your latency.
> 
> Might be infeasible to do it perfectly, but something good enough might be
> possible.

There is the latency tracer, though maybe that is still stuck in -rt...

> > > Note that rcu doesn't show up consistently at the top, this was
> > > just one that looked *maybe* like it can be improved. So I don't
> > > know how reproducible it is.  
> > 
> > Ah, that leads me to wonder whether the hypervisor preempted whoever is
> > currently holding the lock.  Do we have anything set up to detect that
> > sort of thing?
> 
> In this case it was running on bare metal, so it was a genuine latency
> event. It just hasn't been consistently at the top (scheduler has been
> there, but I'm bringing that down with tuning).

OK, never mind about vCPU preemption, then!  ;-)

It looks like I will have other reasons to decrease rcu_node lock
contention, so let me see what I can do.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled
  2018-04-08 21:06         ` Paul E. McKenney
@ 2018-04-17 15:43           ` Paul E. McKenney
  0 siblings, 0 replies; 7+ messages in thread
From: Paul E. McKenney @ 2018-04-17 15:43 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Linux Kernel Mailing List

On Sun, Apr 08, 2018 at 02:06:18PM -0700, Paul E. McKenney wrote:
> On Sat, Apr 07, 2018 at 07:40:42AM +1000, Nicholas Piggin wrote:
> > On Thu, 5 Apr 2018 08:53:20 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

[ . . . ]

> > > > Note that rcu doesn't show up consistently at the top, this was
> > > > just one that looked *maybe* like it can be improved. So I don't
> > > > know how reproducible it is.  
> > > 
> > > Ah, that leads me to wonder whether the hypervisor preempted whoever is
> > > currently holding the lock.  Do we have anything set up to detect that
> > > sort of thing?
> > 
> > In this case it was running on bare metal, so it was a genuine latency
> > event. It just hasn't been consistently at the top (scheduler has been
> > there, but I'm bringing that down with tuning).
> 
> OK, never mind about vCPU preemption, then!  ;-)
> 
> It looks like I will have other reasons to decrease rcu_node lock
> contention, so let me see what I can do.

And the intermittent contention behavior you saw makes is plausible
given the current code structure, which avoids contention in the common
case where grace periods follow immediately one after the other, but
does not in the less-likely case where RCU is idle and a bunch of CPUs
simultaneously see the need for a new grace period.  I have a fix in
the works which occasionally actually makes it through rcutorture.  ;-)

I expect to have something robust enough to post to LKML by the end
of this week.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-04-17 15:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-04 23:34 rcu_process_callbacks irqsoff latency caused by taking spinlock with irqs disabled Nicholas Piggin
2018-04-05  0:13 ` Paul E. McKenney
2018-04-05  0:45   ` Nicholas Piggin
2018-04-05 15:53     ` Paul E. McKenney
2018-04-06 21:40       ` Nicholas Piggin
2018-04-08 21:06         ` Paul E. McKenney
2018-04-17 15:43           ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).