Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

From: Phil Auld <pauld@redhat.com>
To: bsegall@google.com
Cc: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
Date: Wed, 13 Mar 2019 14:50:26 -0400	[thread overview]
Message-ID: <20190313185026.GD24002@pauld.bos.csb> (raw)
In-Reply-To: <xm26ef7ay8ie.fsf@bsegall-linux.svl.corp.google.com>

On Wed, Mar 13, 2019 at 10:44:09AM -0700 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
> >> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
> >> > Letting it spin for 100ms and then only increasing by 6% seems extremely
> >> > generous. If we went this route I'd probably say "after looping N
> >> > times, set the period to time taken / N + X%" where N is like 8 or
> >> > something. I think I'd probably perfer something like this to the
> >> > previous "just abort and let it happen again next interrupt" one.
> >> 
> >> Okay. I'll try to spin something up that does this. It may be a little 
> >> trickier to keep the quota proportional to the new period. I think that's 
> >> important since we'll be changing the user's setting.
> >> 
> >> Do you mean to have it break when it hits N and recalculates the period or 
> >> reset the counter and keep going?
> >> 
> >
> > Let me know what you think of the below. It's working nicely. I like your 
> > suggestion to limit it quickly based on number of loops and use that to 
> > scale up. I think it is best to break out and let it fire again if needed. 
> > The warning fires once, very occasionally twice, and then things are quiet.
> >
> > If that looks reasonable I'll do some more testing and spin it up as a real 
> > patch submission. 
> 
> Yeah, this looks reasonable. I should probably see how unreasonable the
> other thing would be, but if your previous periods were kinda small (and
> it's just that the machine crashing isn't an ok failure mode) I suppose
> it's not a big deal.
> 

I posted it a little while ago. The periods were tiny (~2000us vs a minimum 
of 1000) and with 2500 mostly unused child cgroups (I didn't narrow that 
down much but it did reproduce still with 1250 children).  That's why I was 
thinking edge case. It also requires a fairly small quota and load to make 
sure cfs_rqs get throttled.

I'm still wrapping my head around the scheduler code but I'd be happy to 
try it the other way if you can give me a bit more description of what
you have in mind. Also happy to test a patch with my repro. 

Cheers,
Phil

> >
> > Cheers,
> > Phil
> > ---
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 310d0637fe4b..54b30adfc89e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >  	return HRTIMER_NORESTART;
> >  }
> >  
> > +extern const u64 max_cfs_quota_period;
> > +int cfs_period_autotune_loop_limit   = 8;
> > +int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period recalculation */
> > +
> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >  {
> >  	struct cfs_bandwidth *cfs_b =
> >  		container_of(timer, struct cfs_bandwidth, period_timer);
> > +	s64 nsstart, nsnow, new_period;
> >  	int overrun;
> >  	int idle = 0;
> > +	int count = 0;
> >  
> >  	raw_spin_lock(&cfs_b->lock);
> > +	nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
> >  	for (;;) {
> >  		overrun = hrtimer_forward_now(timer, cfs_b->period);
> >  		if (!overrun)
> >  			break;
> >  
> > +		if (++count > cfs_period_autotune_loop_limit) {
> > +			ktime_t old_period = ktime_to_ns(cfs_b->period);
> > +
> > +			nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> > +			new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
> > +
> > +			/*  Make sure new period will be larger than old. */
> > +			if (new_period < old_period) {
> > +				new_period = old_period;
> > +			}
> > +			new_period += (new_period *  cfs_period_autotune_cushion_pct) / 100;
> 
> This ordering means that it will always increase by at least 15%. This
> is a bit odd but probably a good thing; I'd just change the comment to
> make it clear this is deliberate.
> 
> > +
> > +			if (new_period >  max_cfs_quota_period)
> > +				new_period = max_cfs_quota_period;
> > +
> > +			cfs_b->period = ns_to_ktime(new_period);
> > +			cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;
> 
> In general it makes sense to do fixed point via 1024 or something that
> can be optimized into shifts (and a larger number is better in general
> for better precision).
> 
> > +			pr_warn_ratelimited(
> > +				"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > +				smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> > +
> > +			idle = 0;
> > +			break;
> > +		}
> > +
> >  		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >  	}
> >  	if (idle)

--