All of lore.kernel.org
 help / color / mirror / Atom feed
From: Phil Auld <pauld@redhat.com>
To: bsegall@google.com
Cc: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
Date: Wed, 13 Mar 2019 14:50:26 -0400	[thread overview]
Message-ID: <20190313185026.GD24002@pauld.bos.csb> (raw)
In-Reply-To: <xm26ef7ay8ie.fsf@bsegall-linux.svl.corp.google.com>

On Wed, Mar 13, 2019 at 10:44:09AM -0700 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
> >> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
> >> > Letting it spin for 100ms and then only increasing by 6% seems extremely
> >> > generous. If we went this route I'd probably say "after looping N
> >> > times, set the period to time taken / N + X%" where N is like 8 or
> >> > something. I think I'd probably perfer something like this to the
> >> > previous "just abort and let it happen again next interrupt" one.
> >> 
> >> Okay. I'll try to spin something up that does this. It may be a little 
> >> trickier to keep the quota proportional to the new period. I think that's 
> >> important since we'll be changing the user's setting.
> >> 
> >> Do you mean to have it break when it hits N and recalculates the period or 
> >> reset the counter and keep going?
> >> 
> >
> > Let me know what you think of the below. It's working nicely. I like your 
> > suggestion to limit it quickly based on number of loops and use that to 
> > scale up. I think it is best to break out and let it fire again if needed. 
> > The warning fires once, very occasionally twice, and then things are quiet.
> >
> > If that looks reasonable I'll do some more testing and spin it up as a real 
> > patch submission. 
> 
> Yeah, this looks reasonable. I should probably see how unreasonable the
> other thing would be, but if your previous periods were kinda small (and
> it's just that the machine crashing isn't an ok failure mode) I suppose
> it's not a big deal.
> 

I posted it a little while ago. The periods were tiny (~2000us vs a minimum 
of 1000) and with 2500 mostly unused child cgroups (I didn't narrow that 
down much but it did reproduce still with 1250 children).  That's why I was 
thinking edge case. It also requires a fairly small quota and load to make 
sure cfs_rqs get throttled.

I'm still wrapping my head around the scheduler code but I'd be happy to 
try it the other way if you can give me a bit more description of what
you have in mind. Also happy to test a patch with my repro. 


Cheers,
Phil


> >
> > Cheers,
> > Phil
> > ---
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 310d0637fe4b..54b30adfc89e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >  	return HRTIMER_NORESTART;
> >  }
> >  
> > +extern const u64 max_cfs_quota_period;
> > +int cfs_period_autotune_loop_limit   = 8;
> > +int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period recalculation */
> > +
> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >  {
> >  	struct cfs_bandwidth *cfs_b =
> >  		container_of(timer, struct cfs_bandwidth, period_timer);
> > +	s64 nsstart, nsnow, new_period;
> >  	int overrun;
> >  	int idle = 0;
> > +	int count = 0;
> >  
> >  	raw_spin_lock(&cfs_b->lock);
> > +	nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
> >  	for (;;) {
> >  		overrun = hrtimer_forward_now(timer, cfs_b->period);
> >  		if (!overrun)
> >  			break;
> >  
> > +		if (++count > cfs_period_autotune_loop_limit) {
> > +			ktime_t old_period = ktime_to_ns(cfs_b->period);
> > +
> > +			nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> > +			new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
> > +
> > +			/*  Make sure new period will be larger than old. */
> > +			if (new_period < old_period) {
> > +				new_period = old_period;
> > +			}
> > +			new_period += (new_period *  cfs_period_autotune_cushion_pct) / 100;
> 
> This ordering means that it will always increase by at least 15%. This
> is a bit odd but probably a good thing; I'd just change the comment to
> make it clear this is deliberate.
> 
> > +
> > +			if (new_period >  max_cfs_quota_period)
> > +				new_period = max_cfs_quota_period;
> > +
> > +			cfs_b->period = ns_to_ktime(new_period);
> > +			cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;
> 
> In general it makes sense to do fixed point via 1024 or something that
> can be optimized into shifts (and a larger number is better in general
> for better precision).
> 
> > +			pr_warn_ratelimited(
> > +				"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > +				smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> > +
> > +			idle = 0;
> > +			break;
> > +		}
> > +
> >  		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >  	}
> >  	if (idle)

-- 

  reply	other threads:[~2019-03-13 18:50 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-01 14:52 [RFC] sched/fair: hard lockup in sched_cfs_period_timer Phil Auld
2019-03-04 18:13 ` bsegall
2019-03-04 19:05   ` Phil Auld
2019-03-05 18:49     ` bsegall
2019-03-05 20:05       ` Phil Auld
2019-03-05 20:45         ` bsegall
2019-03-06 16:23           ` Phil Auld
2019-03-06 19:25             ` bsegall
2019-03-09 20:33               ` Phil Auld
2019-03-11 17:44                 ` bsegall
2019-03-11 20:25                   ` Phil Auld
2019-03-12 13:57                     ` Phil Auld
2019-03-13 17:44                       ` bsegall
2019-03-13 18:50                         ` Phil Auld [this message]
2019-03-13 20:26                           ` bsegall
2019-03-13 21:10                             ` Phil Auld
2019-03-12 17:29                     ` bsegall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190313185026.GD24002@pauld.bos.csb \
    --to=pauld@redhat.com \
    --cc=bsegall@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.