linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Phil Auld <pauld@redhat.com>
To: bsegall@google.com
Cc: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
Date: Tue, 5 Mar 2019 15:05:54 -0500	[thread overview]
Message-ID: <20190305200554.GA8786@pauld.bos.csb> (raw)
In-Reply-To: <xm268sxtyx5u.fsf@bsegall-linux.svl.corp.google.com>

On Tue, Mar 05, 2019 at 10:49:01AM -0800 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> >> >
> >> > 	raw_spin_lock(&cfs_b->lock);
> >> > 	for (;;) {
> >> > 		overrun = hrtimer_forward_now(timer, cfs_b->period);
> >> > 		if (!overrun)
> >> > 			break;
> >> >
> >> > 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >> > 	}
> >> > 	if (idle)
> >> > 		cfs_b->period_active = 0;
> >> > 	raw_spin_unlock(&cfs_b->lock);
> >> >
> >> >
> >> > There's no guarantee that hrtimer_forward_now() will ever return 0 and also 
> >> > do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call 
> >> > distribute_cfs_runtime. 
> >> >
> >> > I considered code to tune the period and quota up (proportionally so they have the same 
> >> > relative effect) but I did not like that because of the above statement and the fact that 
> >> > it would be changing the valid values the user configured. I have two versions that do that 
> >> > differently but they both still call out for protection from the forever loop. I think they 
> >> > add complexity for no real gain.
> >> >
> >> > For my current testing I'm using a more direct but less elegant approach of simply limiting 
> >> > the number of times the loop can execute. This has the potential to skip handling a period 
> >> > I think but will prevent the problem reliably. I'm not sure how many iterations this loop 
> >> > was expected to take. I'm seeing numbers in the low thousands at times.
> >> 
> >> I mean the answer should be "do_sched_cfs_period_timer runs once" the
> >> vast majority of the time; if it's not then your machine/setup can't
> >> handle whatever super low period you've set, or there's some bad
> >> behavior somewhere in the period timer handling.
> >> CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
> >> mean that you've already spent an entire period inside the handler.
> >>
> >
> > Thanks for looking at this.
> >
> > That's sort of what I though, too. Would it not make more sense to get rid
> > of the loop? I find forever loops inherently suspect.
> 
> I mean, probably better to have a loop than copy paste the code a couple
> of times; you could rework the condition in your patch to be part of the
> loop condition in general, though it might not be any clearer (heck,
> the loop /could/ be for (;overrun = hrtimer_forward_now(...);) { ... },
> it's just that's kinda questionable about whether it's clearer).
> 

So I'm not clear on this loop thing, though.  If it's okay for the loop
to run 2-3 times that implies it's okay for do_sched_cfs_period_timer() 
to take longer than a period to run.  So if that's expected twice or 
thrice what would ensure it does not continually take that long?


> >
> > I think the fact that we drop the lock to do the distribue is the real cuplrit.
> > It's not do_sched_cfs_period_timer()'s code itself that takes the time,
> > I think, it's the re-acquire of the cfs_b->lock and the contention on it due
> > to all the cgroups. Lock_stat smaples during that run show pretty high contention
> > on that lock and some really long wait times.
> 
> Tons of cgroups shouldn't increase contention; cfs_b->lock is
> per-cgroup, and I don't see or remember anything offhand where it even
> should be worse for deep nesting. Large machines with many cpus where
> threads in the cgroup are running on each make it worse of course, and
> tons of cgroups with throttling enabled have O(log n) cost on hrtimers
> insertion and of course O(n) cost in actual runtime in irq handlers.

The lock is per cgroup but it's taken in assign_cfs_rq_runtime() via 
__account_cfs_rq_runtime() from every non-idle cpu, I think.

> 
> If you're seeing increasing contention as the cgroup depth or cgroup
> count increases, that may be a good thing to fix regardless.

I'll look at it some more. 

Interestingly, if I limit the number of child cgroups to the number of 
them I'm actually putting processes into (16 down from 2500) the problem
does not reproduce. 

Fwiw this is a 80 cpu (incl. smt) 4-numa system. So there are a lot 
of cfs_rqs that might be doing __account_cfs_rq_runtime().

Here is lock_stat output from a short bit of the run when the pr_warn has come a couple of times. 
Edited for screen width. This is the 8th lock on the list. (I wonder in this case of the really long 
holds are the pr_warn firing.)
 
class name       &cfs_b->lock:
con-bounces      33770
contentions      33776
waittime-min         0.10
waittime-max     48088.04
waittime-total  856027.70
waittime-avg        25.34  
acq-bounces     148379
acquisitions    162184 
holdtime-min         0.00 
holdtime-max     48240.79
holdtime-total  354683.04  
holdtime-avg         2.19
                 
             ------------
             &cfs_b->lock          29414          [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
             &cfs_b->lock           4195          [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
             &cfs_b->lock            166          [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
             &cfs_b->lock              1          [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0
             ------------
             &cfs_b->lock          28602          [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
             &cfs_b->lock           3215          [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
             &cfs_b->lock           1938          [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
             &cfs_b->lock             21          [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0


> 
> The lock has to be dropped due to lock ordering vs rq locks, and the
> reverse order wouldn't be possible. That said, each cfs_rq unthrottle in
> distribute grabs the lock, and then that cpu will grab the lock when it
> wakes up, which can be while we're still in distribute. I'm not sure if
> it would be possible to move the resched_curr calls until after doing
> all the rest of unthrottle, and even if we did then we'd be taking each
> rq lock twice, which wouldn't be great either. It might at least be
> possible to coalesce all the cfs_b accounting in unthrottle to be done
> in a single locked section, but I don't know if that would actually
> help; it would still all have to be serialized regardless after all.

Yeah, that makes sense, thanks.

Cheers,
Phil

> 
> >
> > My reproducer is artificial, but is designed to trigger the issue as has
> > been hit in various customer workloads. So yes, it's exaggerated, but only
> > so I don't have to wait weeks between hits :)
> >
> >
> > Thanks,
> > Phil
> >
> >> >
> >> > A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period 
> >> > would be good but I'm not sure what that would be and we still have this potential forever
> >> > loop. 
> >> >
> >> > Below is the bandaid version. 
> >> >   
> >> > Thoughts? 
> >> >
> >> >
> >> > Cheers,
> >> > Phil
> >> >
> >> > ---
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index 310d0637fe4b..33e55620f891 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >> >         return HRTIMER_NORESTART;
> >> >  }
> >> >  
> >> > +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
> >> > + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
> >> > +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
> >> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >> >  {
> >> >         struct cfs_bandwidth *cfs_b =
> >> >                 container_of(timer, struct cfs_bandwidth, period_timer);
> >> >         int overrun;
> >> >         int idle = 0;
> >> > +       int count = 0;
> >> >  
> >> >         raw_spin_lock(&cfs_b->lock);
> >> >         for (;;) {
> >> > @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >> >                 if (!overrun)
> >> >                         break;
> >> >  
> >> > +               if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
> >> > +                       pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
> >> > +                               smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
> >> > +                       // Make sure we restart the timer.
> >> > +                       idle = 0;
> >> > +                       break;
> >> > +               }
> >> > +
> >> >                 idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >> >         }
> >> >         if (idle)

-- 

  reply	other threads:[~2019-03-05 20:05 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-01 14:52 [RFC] sched/fair: hard lockup in sched_cfs_period_timer Phil Auld
2019-03-04 18:13 ` bsegall
2019-03-04 19:05   ` Phil Auld
2019-03-05 18:49     ` bsegall
2019-03-05 20:05       ` Phil Auld [this message]
2019-03-05 20:45         ` bsegall
2019-03-06 16:23           ` Phil Auld
2019-03-06 19:25             ` bsegall
2019-03-09 20:33               ` Phil Auld
2019-03-11 17:44                 ` bsegall
2019-03-11 20:25                   ` Phil Auld
2019-03-12 13:57                     ` Phil Auld
2019-03-13 17:44                       ` bsegall
2019-03-13 18:50                         ` Phil Auld
2019-03-13 20:26                           ` bsegall
2019-03-13 21:10                             ` Phil Auld
2019-03-12 17:29                     ` bsegall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190305200554.GA8786@pauld.bos.csb \
    --to=pauld@redhat.com \
    --cc=bsegall@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).