Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

From: Phil Auld <pauld@redhat.com>
To: bsegall@google.com
Cc: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
Date: Tue, 5 Mar 2019 15:05:54 -0500	[thread overview]
Message-ID: <20190305200554.GA8786@pauld.bos.csb> (raw)
In-Reply-To: <xm268sxtyx5u.fsf@bsegall-linux.svl.corp.google.com>

On Tue, Mar 05, 2019 at 10:49:01AM -0800 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> >> >
> >> > 	raw_spin_lock(&cfs_b->lock);
> >> > 	for (;;) {
> >> > 		overrun = hrtimer_forward_now(timer, cfs_b->period);
> >> > 		if (!overrun)
> >> > 			break;
> >> >
> >> > 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >> > 	}
> >> > 	if (idle)
> >> > 		cfs_b->period_active = 0;
> >> > 	raw_spin_unlock(&cfs_b->lock);
> >> >
> >> >
> >> > There's no guarantee that hrtimer_forward_now() will ever return 0 and also 
> >> > do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call 
> >> > distribute_cfs_runtime. 
> >> >
> >> > I considered code to tune the period and quota up (proportionally so they have the same 
> >> > relative effect) but I did not like that because of the above statement and the fact that 
> >> > it would be changing the valid values the user configured. I have two versions that do that 
> >> > differently but they both still call out for protection from the forever loop. I think they 
> >> > add complexity for no real gain.
> >> >
> >> > For my current testing I'm using a more direct but less elegant approach of simply limiting 
> >> > the number of times the loop can execute. This has the potential to skip handling a period 
> >> > I think but will prevent the problem reliably. I'm not sure how many iterations this loop 
> >> > was expected to take. I'm seeing numbers in the low thousands at times.
> >> 
> >> I mean the answer should be "do_sched_cfs_period_timer runs once" the
> >> vast majority of the time; if it's not then your machine/setup can't
> >> handle whatever super low period you've set, or there's some bad
> >> behavior somewhere in the period timer handling.
> >> CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
> >> mean that you've already spent an entire period inside the handler.
> >>
> >
> > Thanks for looking at this.
> >
> > That's sort of what I though, too. Would it not make more sense to get rid
> > of the loop? I find forever loops inherently suspect.
> 
> I mean, probably better to have a loop than copy paste the code a couple
> of times; you could rework the condition in your patch to be part of the
> loop condition in general, though it might not be any clearer (heck,
> the loop /could/ be for (;overrun = hrtimer_forward_now(...);) { ... },
> it's just that's kinda questionable about whether it's clearer).
> 

So I'm not clear on this loop thing, though.  If it's okay for the loop
to run 2-3 times that implies it's okay for do_sched_cfs_period_timer() 
to take longer than a period to run.  So if that's expected twice or 
thrice what would ensure it does not continually take that long?

> >
> > I think the fact that we drop the lock to do the distribue is the real cuplrit.
> > It's not do_sched_cfs_period_timer()'s code itself that takes the time,
> > I think, it's the re-acquire of the cfs_b->lock and the contention on it due
> > to all the cgroups. Lock_stat smaples during that run show pretty high contention
> > on that lock and some really long wait times.
> 
> Tons of cgroups shouldn't increase contention; cfs_b->lock is
> per-cgroup, and I don't see or remember anything offhand where it even
> should be worse for deep nesting. Large machines with many cpus where
> threads in the cgroup are running on each make it worse of course, and
> tons of cgroups with throttling enabled have O(log n) cost on hrtimers
> insertion and of course O(n) cost in actual runtime in irq handlers.

The lock is per cgroup but it's taken in assign_cfs_rq_runtime() via 
__account_cfs_rq_runtime() from every non-idle cpu, I think.

> 
> If you're seeing increasing contention as the cgroup depth or cgroup
> count increases, that may be a good thing to fix regardless.

I'll look at it some more. 

Interestingly, if I limit the number of child cgroups to the number of 
them I'm actually putting processes into (16 down from 2500) the problem
does not reproduce. 

Fwiw this is a 80 cpu (incl. smt) 4-numa system. So there are a lot 
of cfs_rqs that might be doing __account_cfs_rq_runtime().

Here is lock_stat output from a short bit of the run when the pr_warn has come a couple of times. 
Edited for screen width. This is the 8th lock on the list. (I wonder in this case of the really long 
holds are the pr_warn firing.)

class name       &cfs_b->lock:
con-bounces      33770
contentions      33776
waittime-min         0.10
waittime-max     48088.04
waittime-total  856027.70
waittime-avg        25.34  
acq-bounces     148379
acquisitions    162184 
holdtime-min         0.00 
holdtime-max     48240.79
holdtime-total  354683.04  
holdtime-avg         2.19

             ------------
             &cfs_b->lock          29414          [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
             &cfs_b->lock           4195          [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
             &cfs_b->lock            166          [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
             &cfs_b->lock              1          [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0
             ------------
             &cfs_b->lock          28602          [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
             &cfs_b->lock           3215          [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
             &cfs_b->lock           1938          [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
             &cfs_b->lock             21          [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0

> 
> The lock has to be dropped due to lock ordering vs rq locks, and the
> reverse order wouldn't be possible. That said, each cfs_rq unthrottle in
> distribute grabs the lock, and then that cpu will grab the lock when it
> wakes up, which can be while we're still in distribute. I'm not sure if
> it would be possible to move the resched_curr calls until after doing
> all the rest of unthrottle, and even if we did then we'd be taking each
> rq lock twice, which wouldn't be great either. It might at least be
> possible to coalesce all the cfs_b accounting in unthrottle to be done
> in a single locked section, but I don't know if that would actually
> help; it would still all have to be serialized regardless after all.

Yeah, that makes sense, thanks.

Cheers,
Phil

> 
> >
> > My reproducer is artificial, but is designed to trigger the issue as has
> > been hit in various customer workloads. So yes, it's exaggerated, but only
> > so I don't have to wait weeks between hits :)
> >
> >
> > Thanks,
> > Phil
> >
> >> >
> >> > A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period 
> >> > would be good but I'm not sure what that would be and we still have this potential forever
> >> > loop. 
> >> >
> >> > Below is the bandaid version. 
> >> >   
> >> > Thoughts? 
> >> >
> >> >
> >> > Cheers,
> >> > Phil
> >> >
> >> > ---
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index 310d0637fe4b..33e55620f891 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >> >         return HRTIMER_NORESTART;
> >> >  }
> >> >  
> >> > +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
> >> > + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
> >> > +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
> >> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >> >  {
> >> >         struct cfs_bandwidth *cfs_b =
> >> >                 container_of(timer, struct cfs_bandwidth, period_timer);
> >> >         int overrun;
> >> >         int idle = 0;
> >> > +       int count = 0;
> >> >  
> >> >         raw_spin_lock(&cfs_b->lock);
> >> >         for (;;) {
> >> > @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >> >                 if (!overrun)
> >> >                         break;
> >> >  
> >> > +               if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
> >> > +                       pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
> >> > +                               smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
> >> > +                       // Make sure we restart the timer.
> >> > +                       idle = 0;
> >> > +                       break;
> >> > +               }
> >> > +
> >> >                 idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >> >         }
> >> >         if (idle)

--