linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
@ 2019-03-01 14:52 Phil Auld
  2019-03-04 18:13 ` bsegall
  0 siblings, 1 reply; 17+ messages in thread
From: Phil Auld @ 2019-03-01 14:52 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel

Hi,

I have a reproducible case of this: 

[  217.264946] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
[  217.264948] Modules linked in: sunrpc iTCO_wdt gpio_ich iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si intel_cstate joydev ipmi_devintf pcspkr hpilo intel_uncore sg hpwdt ipmi_msghandler acpi_power_meter i7core_edac lpc_ich acpi_cpufreq ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix drm serio_raw crc32c_intel hpsa myri10ge libata dca scsi_transport_sas netxen_nic dm_mirror dm_region_hash dm_log dm_mod
[  217.264964] CPU: 24 PID: 46684 Comm: myapp Not tainted 5.0.0-rc7+ #25
[  217.264965] Hardware name: HP ProLiant DL580 G7, BIOS P65 08/16/2015
[  217.264965] RIP: 0010:tg_nop+0x0/0x10
[  217.264966] Code: 83 c9 08 f0 48 0f b1 0f 48 39 c2 74 0e 48 89 c2 f7 c2 00 00 20 00 75 dc 31 c0 c3 b8 01 00 00 00 c3 66 0f 1f 84 00 00 00 00 00 <66> 66 66 66 90 31 c0 c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 8b
[  217.264967] RSP: 0000:ffff976a7f703e48 EFLAGS: 00000087
[  217.264967] RAX: ffff976a7c7c8f00 RBX: ffff976a6f4fad00 RCX: ffff976a7c7c90f0
[  217.264968] RDX: ffff976a6f4faee0 RSI: ffff976a7f961f00 RDI: ffff976a6f4fad00
[  217.264968] RBP: ffff976a7f961f00 R08: 0000000000000002 R09: ffffffad2c9b3331
[  217.264969] R10: 0000000000000000 R11: 0000000000000000 R12: ffff976a7c7c8f00
[  217.264969] R13: ffffffffb2305c00 R14: 0000000000000000 R15: ffffffffb22f8510
[  217.264970] FS:  00007f6240678740(0000) GS:ffff976a7f700000(0000) knlGS:0000000000000000
[  217.264970] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  217.264971] CR2: 00000000006dee20 CR3: 000000bf2bffc005 CR4: 00000000000206e0
[  217.264971] Call Trace:
[  217.264971]  <IRQ>
[  217.264972]  walk_tg_tree_from+0x29/0xb0
[  217.264972]  unthrottle_cfs_rq+0xe0/0x1a0
[  217.264972]  distribute_cfs_runtime+0xd3/0xf0
[  217.264973]  sched_cfs_period_timer+0xcb/0x160
[  217.264973]  ? sched_cfs_slack_timer+0xd0/0xd0
[  217.264973]  __hrtimer_run_queues+0xfb/0x270
[  217.264974]  hrtimer_interrupt+0x122/0x270
[  217.264974]  smp_apic_timer_interrupt+0x6a/0x140
[  217.264975]  apic_timer_interrupt+0xf/0x20
[  217.264975]  </IRQ>
[  217.264975] RIP: 0033:0x7f6240125fe5
[  217.264976] Code: 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 66 44 0f 7f 01 <66> 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41 30 48 83 c1 40
...

There's more but this is the juicy part. 

The reproducer involves a large number of cgroups in a hierarchy and a very low cfs_period_us 
setting. 

The guts of sched_cfs_period_timer is this forever loop:

	raw_spin_lock(&cfs_b->lock);
	for (;;) {
		overrun = hrtimer_forward_now(timer, cfs_b->period);
		if (!overrun)
			break;

		idle = do_sched_cfs_period_timer(cfs_b, overrun);
	}
	if (idle)
		cfs_b->period_active = 0;
	raw_spin_unlock(&cfs_b->lock);


There's no guarantee that hrtimer_forward_now() will ever return 0 and also 
do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call 
distribute_cfs_runtime. 

I considered code to tune the period and quota up (proportionally so they have the same 
relative effect) but I did not like that because of the above statement and the fact that 
it would be changing the valid values the user configured. I have two versions that do that 
differently but they both still call out for protection from the forever loop. I think they 
add complexity for no real gain.

For my current testing I'm using a more direct but less elegant approach of simply limiting 
the number of times the loop can execute. This has the potential to skip handling a period 
I think but will prevent the problem reliably. I'm not sure how many iterations this loop 
was expected to take. I'm seeing numbers in the low thousands at times.

A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period 
would be good but I'm not sure what that would be and we still have this potential forever
loop. 

Below is the bandaid version. 
  
Thoughts? 


Cheers,
Phil

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 310d0637fe4b..33e55620f891 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
        return HRTIMER_NORESTART;
 }
 
+/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
+ * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
+#define CFS_PERIOD_TIMER_EXIT_COUNT 100
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
        struct cfs_bandwidth *cfs_b =
                container_of(timer, struct cfs_bandwidth, period_timer);
        int overrun;
        int idle = 0;
+       int count = 0;
 
        raw_spin_lock(&cfs_b->lock);
        for (;;) {
@@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
                if (!overrun)
                        break;
 
+               if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
+                       pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
+                               smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
+                       // Make sure we restart the timer.
+                       idle = 0;
+                       break;
+               }
+
                idle = do_sched_cfs_period_timer(cfs_b, overrun);
        }
        if (idle)


-- 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-01 14:52 [RFC] sched/fair: hard lockup in sched_cfs_period_timer Phil Auld
@ 2019-03-04 18:13 ` bsegall
  2019-03-04 19:05   ` Phil Auld
  0 siblings, 1 reply; 17+ messages in thread
From: bsegall @ 2019-03-04 18:13 UTC (permalink / raw)
  To: Phil Auld; +Cc: mingo, peterz, linux-kernel

Phil Auld <pauld@redhat.com> writes:

> Hi,
>
> I have a reproducible case of this: 
>
> [  217.264946] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
> [  217.264948] Modules linked in: sunrpc iTCO_wdt gpio_ich iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si intel_cstate joydev ipmi_devintf pcspkr hpilo intel_uncore sg hpwdt ipmi_msghandler acpi_power_meter i7core_edac lpc_ich acpi_cpufreq ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix drm serio_raw crc32c_intel hpsa myri10ge libata dca scsi_transport_sas netxen_nic dm_mirror dm_region_hash dm_log dm_mod
> [  217.264964] CPU: 24 PID: 46684 Comm: myapp Not tainted 5.0.0-rc7+ #25
> [  217.264965] Hardware name: HP ProLiant DL580 G7, BIOS P65 08/16/2015
> [  217.264965] RIP: 0010:tg_nop+0x0/0x10
> [  217.264966] Code: 83 c9 08 f0 48 0f b1 0f 48 39 c2 74 0e 48 89 c2 f7 c2 00 00 20 00 75 dc 31 c0 c3 b8 01 00 00 00 c3 66 0f 1f 84 00 00 00 00 00 <66> 66 66 66 90 31 c0 c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 8b
> [  217.264967] RSP: 0000:ffff976a7f703e48 EFLAGS: 00000087
> [  217.264967] RAX: ffff976a7c7c8f00 RBX: ffff976a6f4fad00 RCX: ffff976a7c7c90f0
> [  217.264968] RDX: ffff976a6f4faee0 RSI: ffff976a7f961f00 RDI: ffff976a6f4fad00
> [  217.264968] RBP: ffff976a7f961f00 R08: 0000000000000002 R09: ffffffad2c9b3331
> [  217.264969] R10: 0000000000000000 R11: 0000000000000000 R12: ffff976a7c7c8f00
> [  217.264969] R13: ffffffffb2305c00 R14: 0000000000000000 R15: ffffffffb22f8510
> [  217.264970] FS:  00007f6240678740(0000) GS:ffff976a7f700000(0000) knlGS:0000000000000000
> [  217.264970] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  217.264971] CR2: 00000000006dee20 CR3: 000000bf2bffc005 CR4: 00000000000206e0
> [  217.264971] Call Trace:
> [  217.264971]  <IRQ>
> [  217.264972]  walk_tg_tree_from+0x29/0xb0
> [  217.264972]  unthrottle_cfs_rq+0xe0/0x1a0
> [  217.264972]  distribute_cfs_runtime+0xd3/0xf0
> [  217.264973]  sched_cfs_period_timer+0xcb/0x160
> [  217.264973]  ? sched_cfs_slack_timer+0xd0/0xd0
> [  217.264973]  __hrtimer_run_queues+0xfb/0x270
> [  217.264974]  hrtimer_interrupt+0x122/0x270
> [  217.264974]  smp_apic_timer_interrupt+0x6a/0x140
> [  217.264975]  apic_timer_interrupt+0xf/0x20
> [  217.264975]  </IRQ>
> [  217.264975] RIP: 0033:0x7f6240125fe5
> [  217.264976] Code: 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 66 44 0f 7f 01 <66> 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41 30 48 83 c1 40
> ...
>
> There's more but this is the juicy part. 
>
> The reproducer involves a large number of cgroups in a hierarchy and a very low cfs_period_us 
> setting. 
>
> The guts of sched_cfs_period_timer is this forever loop:
>
> 	raw_spin_lock(&cfs_b->lock);
> 	for (;;) {
> 		overrun = hrtimer_forward_now(timer, cfs_b->period);
> 		if (!overrun)
> 			break;
>
> 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> 	}
> 	if (idle)
> 		cfs_b->period_active = 0;
> 	raw_spin_unlock(&cfs_b->lock);
>
>
> There's no guarantee that hrtimer_forward_now() will ever return 0 and also 
> do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call 
> distribute_cfs_runtime. 
>
> I considered code to tune the period and quota up (proportionally so they have the same 
> relative effect) but I did not like that because of the above statement and the fact that 
> it would be changing the valid values the user configured. I have two versions that do that 
> differently but they both still call out for protection from the forever loop. I think they 
> add complexity for no real gain.
>
> For my current testing I'm using a more direct but less elegant approach of simply limiting 
> the number of times the loop can execute. This has the potential to skip handling a period 
> I think but will prevent the problem reliably. I'm not sure how many iterations this loop 
> was expected to take. I'm seeing numbers in the low thousands at times.

I mean the answer should be "do_sched_cfs_period_timer runs once" the
vast majority of the time; if it's not then your machine/setup can't
handle whatever super low period you've set, or there's some bad
behavior somewhere in the period timer handling.
CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
mean that you've already spent an entire period inside the handler.

>
> A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period 
> would be good but I'm not sure what that would be and we still have this potential forever
> loop. 
>
> Below is the bandaid version. 
>   
> Thoughts? 
>
>
> Cheers,
> Phil
>
> ---
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 310d0637fe4b..33e55620f891 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>         return HRTIMER_NORESTART;
>  }
>  
> +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
> + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
> +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  {
>         struct cfs_bandwidth *cfs_b =
>                 container_of(timer, struct cfs_bandwidth, period_timer);
>         int overrun;
>         int idle = 0;
> +       int count = 0;
>  
>         raw_spin_lock(&cfs_b->lock);
>         for (;;) {
> @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>                 if (!overrun)
>                         break;
>  
> +               if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
> +                       pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
> +                               smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
> +                       // Make sure we restart the timer.
> +                       idle = 0;
> +                       break;
> +               }
> +
>                 idle = do_sched_cfs_period_timer(cfs_b, overrun);
>         }
>         if (idle)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-04 18:13 ` bsegall
@ 2019-03-04 19:05   ` Phil Auld
  2019-03-05 18:49     ` bsegall
  0 siblings, 1 reply; 17+ messages in thread
From: Phil Auld @ 2019-03-04 19:05 UTC (permalink / raw)
  To: bsegall; +Cc: mingo, peterz, linux-kernel

On Mon, Mar 04, 2019 at 10:13:49AM -0800 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > Hi,
> >
> > I have a reproducible case of this: 
> >
> > [  217.264946] NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
> > [  217.264948] Modules linked in: sunrpc iTCO_wdt gpio_ich iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si intel_cstate joydev ipmi_devintf pcspkr hpilo intel_uncore sg hpwdt ipmi_msghandler acpi_power_meter i7core_edac lpc_ich acpi_cpufreq ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix drm serio_raw crc32c_intel hpsa myri10ge libata dca scsi_transport_sas netxen_nic dm_mirror dm_region_hash dm_log dm_mod
> > [  217.264964] CPU: 24 PID: 46684 Comm: myapp Not tainted 5.0.0-rc7+ #25
> > [  217.264965] Hardware name: HP ProLiant DL580 G7, BIOS P65 08/16/2015
> > [  217.264965] RIP: 0010:tg_nop+0x0/0x10
> > [  217.264966] Code: 83 c9 08 f0 48 0f b1 0f 48 39 c2 74 0e 48 89 c2 f7 c2 00 00 20 00 75 dc 31 c0 c3 b8 01 00 00 00 c3 66 0f 1f 84 00 00 00 00 00 <66> 66 66 66 90 31 c0 c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 8b
> > [  217.264967] RSP: 0000:ffff976a7f703e48 EFLAGS: 00000087
> > [  217.264967] RAX: ffff976a7c7c8f00 RBX: ffff976a6f4fad00 RCX: ffff976a7c7c90f0
> > [  217.264968] RDX: ffff976a6f4faee0 RSI: ffff976a7f961f00 RDI: ffff976a6f4fad00
> > [  217.264968] RBP: ffff976a7f961f00 R08: 0000000000000002 R09: ffffffad2c9b3331
> > [  217.264969] R10: 0000000000000000 R11: 0000000000000000 R12: ffff976a7c7c8f00
> > [  217.264969] R13: ffffffffb2305c00 R14: 0000000000000000 R15: ffffffffb22f8510
> > [  217.264970] FS:  00007f6240678740(0000) GS:ffff976a7f700000(0000) knlGS:0000000000000000
> > [  217.264970] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  217.264971] CR2: 00000000006dee20 CR3: 000000bf2bffc005 CR4: 00000000000206e0
> > [  217.264971] Call Trace:
> > [  217.264971]  <IRQ>
> > [  217.264972]  walk_tg_tree_from+0x29/0xb0
> > [  217.264972]  unthrottle_cfs_rq+0xe0/0x1a0
> > [  217.264972]  distribute_cfs_runtime+0xd3/0xf0
> > [  217.264973]  sched_cfs_period_timer+0xcb/0x160
> > [  217.264973]  ? sched_cfs_slack_timer+0xd0/0xd0
> > [  217.264973]  __hrtimer_run_queues+0xfb/0x270
> > [  217.264974]  hrtimer_interrupt+0x122/0x270
> > [  217.264974]  smp_apic_timer_interrupt+0x6a/0x140
> > [  217.264975]  apic_timer_interrupt+0xf/0x20
> > [  217.264975]  </IRQ>
> > [  217.264975] RIP: 0033:0x7f6240125fe5
> > [  217.264976] Code: 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 66 44 0f 7f 01 <66> 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41 30 48 83 c1 40
> > ...
> >
> > There's more but this is the juicy part. 
> >
> > The reproducer involves a large number of cgroups in a hierarchy and a very low cfs_period_us 
> > setting. 
> >
> > The guts of sched_cfs_period_timer is this forever loop:
> >
> > 	raw_spin_lock(&cfs_b->lock);
> > 	for (;;) {
> > 		overrun = hrtimer_forward_now(timer, cfs_b->period);
> > 		if (!overrun)
> > 			break;
> >
> > 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> > 	}
> > 	if (idle)
> > 		cfs_b->period_active = 0;
> > 	raw_spin_unlock(&cfs_b->lock);
> >
> >
> > There's no guarantee that hrtimer_forward_now() will ever return 0 and also 
> > do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call 
> > distribute_cfs_runtime. 
> >
> > I considered code to tune the period and quota up (proportionally so they have the same 
> > relative effect) but I did not like that because of the above statement and the fact that 
> > it would be changing the valid values the user configured. I have two versions that do that 
> > differently but they both still call out for protection from the forever loop. I think they 
> > add complexity for no real gain.
> >
> > For my current testing I'm using a more direct but less elegant approach of simply limiting 
> > the number of times the loop can execute. This has the potential to skip handling a period 
> > I think but will prevent the problem reliably. I'm not sure how many iterations this loop 
> > was expected to take. I'm seeing numbers in the low thousands at times.
> 
> I mean the answer should be "do_sched_cfs_period_timer runs once" the
> vast majority of the time; if it's not then your machine/setup can't
> handle whatever super low period you've set, or there's some bad
> behavior somewhere in the period timer handling.
> CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
> mean that you've already spent an entire period inside the handler.
>

Thanks for looking at this.

That's sort of what I though, too. Would it not make more sense to get rid
of the loop? I find forever loops inherently suspect.

I think the fact that we drop the lock to do the distribue is the real cuplrit.
It's not do_sched_cfs_period_timer()'s code itself that takes the time,
I think, it's the re-acquire of the cfs_b->lock and the contention on it due
to all the cgroups. Lock_stat smaples during that run show pretty high contention
on that lock and some really long wait times.

My reproducer is artificial, but is designed to trigger the issue as has
been hit in various customer workloads. So yes, it's exaggerated, but only
so I don't have to wait weeks between hits :)


Thanks,
Phil

> >
> > A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period 
> > would be good but I'm not sure what that would be and we still have this potential forever
> > loop. 
> >
> > Below is the bandaid version. 
> >   
> > Thoughts? 
> >
> >
> > Cheers,
> > Phil
> >
> > ---
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 310d0637fe4b..33e55620f891 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >         return HRTIMER_NORESTART;
> >  }
> >  
> > +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
> > + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
> > +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >  {
> >         struct cfs_bandwidth *cfs_b =
> >                 container_of(timer, struct cfs_bandwidth, period_timer);
> >         int overrun;
> >         int idle = 0;
> > +       int count = 0;
> >  
> >         raw_spin_lock(&cfs_b->lock);
> >         for (;;) {
> > @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >                 if (!overrun)
> >                         break;
> >  
> > +               if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
> > +                       pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
> > +                               smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
> > +                       // Make sure we restart the timer.
> > +                       idle = 0;
> > +                       break;
> > +               }
> > +
> >                 idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >         }
> >         if (idle)

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-04 19:05   ` Phil Auld
@ 2019-03-05 18:49     ` bsegall
  2019-03-05 20:05       ` Phil Auld
  0 siblings, 1 reply; 17+ messages in thread
From: bsegall @ 2019-03-05 18:49 UTC (permalink / raw)
  To: Phil Auld; +Cc: mingo, peterz, linux-kernel

Phil Auld <pauld@redhat.com> writes:

>> >
>> > 	raw_spin_lock(&cfs_b->lock);
>> > 	for (;;) {
>> > 		overrun = hrtimer_forward_now(timer, cfs_b->period);
>> > 		if (!overrun)
>> > 			break;
>> >
>> > 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> > 	}
>> > 	if (idle)
>> > 		cfs_b->period_active = 0;
>> > 	raw_spin_unlock(&cfs_b->lock);
>> >
>> >
>> > There's no guarantee that hrtimer_forward_now() will ever return 0 and also 
>> > do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call 
>> > distribute_cfs_runtime. 
>> >
>> > I considered code to tune the period and quota up (proportionally so they have the same 
>> > relative effect) but I did not like that because of the above statement and the fact that 
>> > it would be changing the valid values the user configured. I have two versions that do that 
>> > differently but they both still call out for protection from the forever loop. I think they 
>> > add complexity for no real gain.
>> >
>> > For my current testing I'm using a more direct but less elegant approach of simply limiting 
>> > the number of times the loop can execute. This has the potential to skip handling a period 
>> > I think but will prevent the problem reliably. I'm not sure how many iterations this loop 
>> > was expected to take. I'm seeing numbers in the low thousands at times.
>> 
>> I mean the answer should be "do_sched_cfs_period_timer runs once" the
>> vast majority of the time; if it's not then your machine/setup can't
>> handle whatever super low period you've set, or there's some bad
>> behavior somewhere in the period timer handling.
>> CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
>> mean that you've already spent an entire period inside the handler.
>>
>
> Thanks for looking at this.
>
> That's sort of what I though, too. Would it not make more sense to get rid
> of the loop? I find forever loops inherently suspect.

I mean, probably better to have a loop than copy paste the code a couple
of times; you could rework the condition in your patch to be part of the
loop condition in general, though it might not be any clearer (heck,
the loop /could/ be for (;overrun = hrtimer_forward_now(...);) { ... },
it's just that's kinda questionable about whether it's clearer).

>
> I think the fact that we drop the lock to do the distribue is the real cuplrit.
> It's not do_sched_cfs_period_timer()'s code itself that takes the time,
> I think, it's the re-acquire of the cfs_b->lock and the contention on it due
> to all the cgroups. Lock_stat smaples during that run show pretty high contention
> on that lock and some really long wait times.

Tons of cgroups shouldn't increase contention; cfs_b->lock is
per-cgroup, and I don't see or remember anything offhand where it even
should be worse for deep nesting. Large machines with many cpus where
threads in the cgroup are running on each make it worse of course, and
tons of cgroups with throttling enabled have O(log n) cost on hrtimers
insertion and of course O(n) cost in actual runtime in irq handlers.

If you're seeing increasing contention as the cgroup depth or cgroup
count increases, that may be a good thing to fix regardless.

The lock has to be dropped due to lock ordering vs rq locks, and the
reverse order wouldn't be possible. That said, each cfs_rq unthrottle in
distribute grabs the lock, and then that cpu will grab the lock when it
wakes up, which can be while we're still in distribute. I'm not sure if
it would be possible to move the resched_curr calls until after doing
all the rest of unthrottle, and even if we did then we'd be taking each
rq lock twice, which wouldn't be great either. It might at least be
possible to coalesce all the cfs_b accounting in unthrottle to be done
in a single locked section, but I don't know if that would actually
help; it would still all have to be serialized regardless after all.

>
> My reproducer is artificial, but is designed to trigger the issue as has
> been hit in various customer workloads. So yes, it's exaggerated, but only
> so I don't have to wait weeks between hits :)
>
>
> Thanks,
> Phil
>
>> >
>> > A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period 
>> > would be good but I'm not sure what that would be and we still have this potential forever
>> > loop. 
>> >
>> > Below is the bandaid version. 
>> >   
>> > Thoughts? 
>> >
>> >
>> > Cheers,
>> > Phil
>> >
>> > ---
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 310d0637fe4b..33e55620f891 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>> >         return HRTIMER_NORESTART;
>> >  }
>> >  
>> > +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
>> > + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
>> > +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
>> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> >  {
>> >         struct cfs_bandwidth *cfs_b =
>> >                 container_of(timer, struct cfs_bandwidth, period_timer);
>> >         int overrun;
>> >         int idle = 0;
>> > +       int count = 0;
>> >  
>> >         raw_spin_lock(&cfs_b->lock);
>> >         for (;;) {
>> > @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> >                 if (!overrun)
>> >                         break;
>> >  
>> > +               if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
>> > +                       pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
>> > +                               smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
>> > +                       // Make sure we restart the timer.
>> > +                       idle = 0;
>> > +                       break;
>> > +               }
>> > +
>> >                 idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> >         }
>> >         if (idle)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-05 18:49     ` bsegall
@ 2019-03-05 20:05       ` Phil Auld
  2019-03-05 20:45         ` bsegall
  0 siblings, 1 reply; 17+ messages in thread
From: Phil Auld @ 2019-03-05 20:05 UTC (permalink / raw)
  To: bsegall; +Cc: mingo, peterz, linux-kernel

On Tue, Mar 05, 2019 at 10:49:01AM -0800 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> >> >
> >> > 	raw_spin_lock(&cfs_b->lock);
> >> > 	for (;;) {
> >> > 		overrun = hrtimer_forward_now(timer, cfs_b->period);
> >> > 		if (!overrun)
> >> > 			break;
> >> >
> >> > 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >> > 	}
> >> > 	if (idle)
> >> > 		cfs_b->period_active = 0;
> >> > 	raw_spin_unlock(&cfs_b->lock);
> >> >
> >> >
> >> > There's no guarantee that hrtimer_forward_now() will ever return 0 and also 
> >> > do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call 
> >> > distribute_cfs_runtime. 
> >> >
> >> > I considered code to tune the period and quota up (proportionally so they have the same 
> >> > relative effect) but I did not like that because of the above statement and the fact that 
> >> > it would be changing the valid values the user configured. I have two versions that do that 
> >> > differently but they both still call out for protection from the forever loop. I think they 
> >> > add complexity for no real gain.
> >> >
> >> > For my current testing I'm using a more direct but less elegant approach of simply limiting 
> >> > the number of times the loop can execute. This has the potential to skip handling a period 
> >> > I think but will prevent the problem reliably. I'm not sure how many iterations this loop 
> >> > was expected to take. I'm seeing numbers in the low thousands at times.
> >> 
> >> I mean the answer should be "do_sched_cfs_period_timer runs once" the
> >> vast majority of the time; if it's not then your machine/setup can't
> >> handle whatever super low period you've set, or there's some bad
> >> behavior somewhere in the period timer handling.
> >> CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
> >> mean that you've already spent an entire period inside the handler.
> >>
> >
> > Thanks for looking at this.
> >
> > That's sort of what I though, too. Would it not make more sense to get rid
> > of the loop? I find forever loops inherently suspect.
> 
> I mean, probably better to have a loop than copy paste the code a couple
> of times; you could rework the condition in your patch to be part of the
> loop condition in general, though it might not be any clearer (heck,
> the loop /could/ be for (;overrun = hrtimer_forward_now(...);) { ... },
> it's just that's kinda questionable about whether it's clearer).
> 

So I'm not clear on this loop thing, though.  If it's okay for the loop
to run 2-3 times that implies it's okay for do_sched_cfs_period_timer() 
to take longer than a period to run.  So if that's expected twice or 
thrice what would ensure it does not continually take that long?


> >
> > I think the fact that we drop the lock to do the distribue is the real cuplrit.
> > It's not do_sched_cfs_period_timer()'s code itself that takes the time,
> > I think, it's the re-acquire of the cfs_b->lock and the contention on it due
> > to all the cgroups. Lock_stat smaples during that run show pretty high contention
> > on that lock and some really long wait times.
> 
> Tons of cgroups shouldn't increase contention; cfs_b->lock is
> per-cgroup, and I don't see or remember anything offhand where it even
> should be worse for deep nesting. Large machines with many cpus where
> threads in the cgroup are running on each make it worse of course, and
> tons of cgroups with throttling enabled have O(log n) cost on hrtimers
> insertion and of course O(n) cost in actual runtime in irq handlers.

The lock is per cgroup but it's taken in assign_cfs_rq_runtime() via 
__account_cfs_rq_runtime() from every non-idle cpu, I think.

> 
> If you're seeing increasing contention as the cgroup depth or cgroup
> count increases, that may be a good thing to fix regardless.

I'll look at it some more. 

Interestingly, if I limit the number of child cgroups to the number of 
them I'm actually putting processes into (16 down from 2500) the problem
does not reproduce. 

Fwiw this is a 80 cpu (incl. smt) 4-numa system. So there are a lot 
of cfs_rqs that might be doing __account_cfs_rq_runtime().

Here is lock_stat output from a short bit of the run when the pr_warn has come a couple of times. 
Edited for screen width. This is the 8th lock on the list. (I wonder in this case of the really long 
holds are the pr_warn firing.)
 
class name       &cfs_b->lock:
con-bounces      33770
contentions      33776
waittime-min         0.10
waittime-max     48088.04
waittime-total  856027.70
waittime-avg        25.34  
acq-bounces     148379
acquisitions    162184 
holdtime-min         0.00 
holdtime-max     48240.79
holdtime-total  354683.04  
holdtime-avg         2.19
                 
             ------------
             &cfs_b->lock          29414          [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
             &cfs_b->lock           4195          [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
             &cfs_b->lock            166          [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
             &cfs_b->lock              1          [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0
             ------------
             &cfs_b->lock          28602          [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
             &cfs_b->lock           3215          [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
             &cfs_b->lock           1938          [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
             &cfs_b->lock             21          [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0


> 
> The lock has to be dropped due to lock ordering vs rq locks, and the
> reverse order wouldn't be possible. That said, each cfs_rq unthrottle in
> distribute grabs the lock, and then that cpu will grab the lock when it
> wakes up, which can be while we're still in distribute. I'm not sure if
> it would be possible to move the resched_curr calls until after doing
> all the rest of unthrottle, and even if we did then we'd be taking each
> rq lock twice, which wouldn't be great either. It might at least be
> possible to coalesce all the cfs_b accounting in unthrottle to be done
> in a single locked section, but I don't know if that would actually
> help; it would still all have to be serialized regardless after all.

Yeah, that makes sense, thanks.

Cheers,
Phil

> 
> >
> > My reproducer is artificial, but is designed to trigger the issue as has
> > been hit in various customer workloads. So yes, it's exaggerated, but only
> > so I don't have to wait weeks between hits :)
> >
> >
> > Thanks,
> > Phil
> >
> >> >
> >> > A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period 
> >> > would be good but I'm not sure what that would be and we still have this potential forever
> >> > loop. 
> >> >
> >> > Below is the bandaid version. 
> >> >   
> >> > Thoughts? 
> >> >
> >> >
> >> > Cheers,
> >> > Phil
> >> >
> >> > ---
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index 310d0637fe4b..33e55620f891 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >> >         return HRTIMER_NORESTART;
> >> >  }
> >> >  
> >> > +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
> >> > + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
> >> > +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
> >> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >> >  {
> >> >         struct cfs_bandwidth *cfs_b =
> >> >                 container_of(timer, struct cfs_bandwidth, period_timer);
> >> >         int overrun;
> >> >         int idle = 0;
> >> > +       int count = 0;
> >> >  
> >> >         raw_spin_lock(&cfs_b->lock);
> >> >         for (;;) {
> >> > @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >> >                 if (!overrun)
> >> >                         break;
> >> >  
> >> > +               if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
> >> > +                       pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
> >> > +                               smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
> >> > +                       // Make sure we restart the timer.
> >> > +                       idle = 0;
> >> > +                       break;
> >> > +               }
> >> > +
> >> >                 idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >> >         }
> >> >         if (idle)

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-05 20:05       ` Phil Auld
@ 2019-03-05 20:45         ` bsegall
  2019-03-06 16:23           ` Phil Auld
  0 siblings, 1 reply; 17+ messages in thread
From: bsegall @ 2019-03-05 20:45 UTC (permalink / raw)
  To: Phil Auld; +Cc: mingo, peterz, linux-kernel

Phil Auld <pauld@redhat.com> writes:

> On Tue, Mar 05, 2019 at 10:49:01AM -0800 bsegall@google.com wrote:
>> Phil Auld <pauld@redhat.com> writes:
>> 
>> >> >
>> >> > 	raw_spin_lock(&cfs_b->lock);
>> >> > 	for (;;) {
>> >> > 		overrun = hrtimer_forward_now(timer, cfs_b->period);
>> >> > 		if (!overrun)
>> >> > 			break;
>> >> >
>> >> > 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> >> > 	}
>> >> > 	if (idle)
>> >> > 		cfs_b->period_active = 0;
>> >> > 	raw_spin_unlock(&cfs_b->lock);
>> >> >
>> >> >
>> >> > There's no guarantee that hrtimer_forward_now() will ever return 0 and also 
>> >> > do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call 
>> >> > distribute_cfs_runtime. 
>> >> >
>> >> > I considered code to tune the period and quota up (proportionally so they have the same 
>> >> > relative effect) but I did not like that because of the above statement and the fact that 
>> >> > it would be changing the valid values the user configured. I have two versions that do that 
>> >> > differently but they both still call out for protection from the forever loop. I think they 
>> >> > add complexity for no real gain.
>> >> >
>> >> > For my current testing I'm using a more direct but less elegant approach of simply limiting 
>> >> > the number of times the loop can execute. This has the potential to skip handling a period 
>> >> > I think but will prevent the problem reliably. I'm not sure how many iterations this loop 
>> >> > was expected to take. I'm seeing numbers in the low thousands at times.
>> >> 
>> >> I mean the answer should be "do_sched_cfs_period_timer runs once" the
>> >> vast majority of the time; if it's not then your machine/setup can't
>> >> handle whatever super low period you've set, or there's some bad
>> >> behavior somewhere in the period timer handling.
>> >> CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
>> >> mean that you've already spent an entire period inside the handler.
>> >>
>> >
>> > Thanks for looking at this.
>> >
>> > That's sort of what I though, too. Would it not make more sense to get rid
>> > of the loop? I find forever loops inherently suspect.
>> 
>> I mean, probably better to have a loop than copy paste the code a couple
>> of times; you could rework the condition in your patch to be part of the
>> loop condition in general, though it might not be any clearer (heck,
>> the loop /could/ be for (;overrun = hrtimer_forward_now(...);) { ... },
>> it's just that's kinda questionable about whether it's clearer).
>> 
>
> So I'm not clear on this loop thing, though.  If it's okay for the loop
> to run 2-3 times that implies it's okay for do_sched_cfs_period_timer() 
> to take longer than a period to run.  So if that's expected twice or 
> thrice what would ensure it does not continually take that long?

So, the loop was originally copied from the rt period timer, and I'm not
sure we really thought that much about it. hrtimer_forward_now returing
!0 twice is probably best to allow, because the hrtimer could be
delayed most-but-not-all of a period; the third time means that the
fault is definitely in do_sched_cfs_period_timer though and if it was
too slow once there's no reason it could not continue to be slow. I was
just being vague on which of 2 or 3 winds up with that exact meaning.

>
>
>> >
>> > I think the fact that we drop the lock to do the distribue is the real cuplrit.
>> > It's not do_sched_cfs_period_timer()'s code itself that takes the time,
>> > I think, it's the re-acquire of the cfs_b->lock and the contention on it due
>> > to all the cgroups. Lock_stat smaples during that run show pretty high contention
>> > on that lock and some really long wait times.
>> 
>> Tons of cgroups shouldn't increase contention; cfs_b->lock is
>> per-cgroup, and I don't see or remember anything offhand where it even
>> should be worse for deep nesting. Large machines with many cpus where
>> threads in the cgroup are running on each make it worse of course, and
>> tons of cgroups with throttling enabled have O(log n) cost on hrtimers
>> insertion and of course O(n) cost in actual runtime in irq handlers.
>
> The lock is per cgroup but it's taken in assign_cfs_rq_runtime() via 
> __account_cfs_rq_runtime() from every non-idle cpu, I think.

Yes, that's proportional to cpus though, not cgroups.

>
>> 
>> If you're seeing increasing contention as the cgroup depth or cgroup
>> count increases, that may be a good thing to fix regardless.
>
> I'll look at it some more. 
>
> Interestingly, if I limit the number of child cgroups to the number of 
> them I'm actually putting processes into (16 down from 2500) the problem
> does not reproduce.

That is indeed interesting, and definitely not something we'd want to
matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
root->throttled->a->...->thread vs root->throttled_cgroup, which is what
I was originally thinking of)

>
> Fwiw this is a 80 cpu (incl. smt) 4-numa system. So there are a lot 
> of cfs_rqs that might be doing __account_cfs_rq_runtime().
>
> Here is lock_stat output from a short bit of the run when the pr_warn has come a couple of times. 
> Edited for screen width. This is the 8th lock on the list. (I wonder in this case of the really long 
> holds are the pr_warn firing.)

Yeah, it's probably best to check without the pr_warn.

>  
> class name       &cfs_b->lock:
> con-bounces      33770
> contentions      33776
> waittime-min         0.10
> waittime-max     48088.04
> waittime-total  856027.70
> waittime-avg        25.34  
> acq-bounces     148379
> acquisitions    162184 
> holdtime-min         0.00 
> holdtime-max     48240.79
> holdtime-total  354683.04  
> holdtime-avg         2.19
>                  
>              ------------
>              &cfs_b->lock          29414          [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
>              &cfs_b->lock           4195          [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
>              &cfs_b->lock            166          [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
>              &cfs_b->lock              1          [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0
>              ------------
>              &cfs_b->lock          28602          [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
>              &cfs_b->lock           3215          [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
>              &cfs_b->lock           1938          [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
>              &cfs_b->lock             21          [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0
>
>
>> 
>> The lock has to be dropped due to lock ordering vs rq locks, and the
>> reverse order wouldn't be possible. That said, each cfs_rq unthrottle in
>> distribute grabs the lock, and then that cpu will grab the lock when it
>> wakes up, which can be while we're still in distribute. I'm not sure if
>> it would be possible to move the resched_curr calls until after doing
>> all the rest of unthrottle, and even if we did then we'd be taking each
>> rq lock twice, which wouldn't be great either. It might at least be
>> possible to coalesce all the cfs_b accounting in unthrottle to be done
>> in a single locked section, but I don't know if that would actually
>> help; it would still all have to be serialized regardless after all.
>
> Yeah, that makes sense, thanks.
>
> Cheers,
> Phil
>
>> 
>> >
>> > My reproducer is artificial, but is designed to trigger the issue as has
>> > been hit in various customer workloads. So yes, it's exaggerated, but only
>> > so I don't have to wait weeks between hits :)
>> >
>> >
>> > Thanks,
>> > Phil
>> >
>> >> >
>> >> > A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period 
>> >> > would be good but I'm not sure what that would be and we still have this potential forever
>> >> > loop. 
>> >> >
>> >> > Below is the bandaid version. 
>> >> >   
>> >> > Thoughts? 
>> >> >
>> >> >
>> >> > Cheers,
>> >> > Phil
>> >> >
>> >> > ---
>> >> >
>> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> > index 310d0637fe4b..33e55620f891 100644
>> >> > --- a/kernel/sched/fair.c
>> >> > +++ b/kernel/sched/fair.c
>> >> > @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>> >> >         return HRTIMER_NORESTART;
>> >> >  }
>> >> >  
>> >> > +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
>> >> > + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
>> >> > +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
>> >> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> >> >  {
>> >> >         struct cfs_bandwidth *cfs_b =
>> >> >                 container_of(timer, struct cfs_bandwidth, period_timer);
>> >> >         int overrun;
>> >> >         int idle = 0;
>> >> > +       int count = 0;
>> >> >  
>> >> >         raw_spin_lock(&cfs_b->lock);
>> >> >         for (;;) {
>> >> > @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> >> >                 if (!overrun)
>> >> >                         break;
>> >> >  
>> >> > +               if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
>> >> > +                       pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
>> >> > +                               smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
>> >> > +                       // Make sure we restart the timer.
>> >> > +                       idle = 0;
>> >> > +                       break;
>> >> > +               }
>> >> > +
>> >> >                 idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> >> >         }
>> >> >         if (idle)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-05 20:45         ` bsegall
@ 2019-03-06 16:23           ` Phil Auld
  2019-03-06 19:25             ` bsegall
  0 siblings, 1 reply; 17+ messages in thread
From: Phil Auld @ 2019-03-06 16:23 UTC (permalink / raw)
  To: bsegall; +Cc: mingo, peterz, linux-kernel

On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > Interestingly, if I limit the number of child cgroups to the number of 
> > them I'm actually putting processes into (16 down from 2500) the problem
> > does not reproduce.
> 
> That is indeed interesting, and definitely not something we'd want to
> matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
> root->throttled->a->...->thread vs root->throttled_cgroup, which is what
> I was originally thinking of)
> 

The locking may be a red herring.

The setup is root->throttled->a where a is 1-2500. There are 4 threads in
each of the first 16 a groups.  The parent, throttled, is where the 
cfs_period/quota_us are set. 

I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). 

The distribute_cfg_runtime looks to be O(n * m) where n is number of 
throttled cfs_rqs and m is the number of child cgroups. But I'm not 
completely clear on how the hierarchical cgroups play together here. 

I'll pull on this thread some. 

Thanks for your input.


Cheers,
Phil

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-06 16:23           ` Phil Auld
@ 2019-03-06 19:25             ` bsegall
  2019-03-09 20:33               ` Phil Auld
  0 siblings, 1 reply; 17+ messages in thread
From: bsegall @ 2019-03-06 19:25 UTC (permalink / raw)
  To: Phil Auld; +Cc: mingo, peterz, linux-kernel

Phil Auld <pauld@redhat.com> writes:

> On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote:
>> Phil Auld <pauld@redhat.com> writes:
>> 
>> > Interestingly, if I limit the number of child cgroups to the number of 
>> > them I'm actually putting processes into (16 down from 2500) the problem
>> > does not reproduce.
>> 
>> That is indeed interesting, and definitely not something we'd want to
>> matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
>> root->throttled->a->...->thread vs root->throttled_cgroup, which is what
>> I was originally thinking of)
>> 
>
> The locking may be a red herring.
>
> The setup is root->throttled->a where a is 1-2500. There are 4 threads in
> each of the first 16 a groups.  The parent, throttled, is where the 
> cfs_period/quota_us are set. 
>
> I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). 
>
> The distribute_cfg_runtime looks to be O(n * m) where n is number of 
> throttled cfs_rqs and m is the number of child cgroups. But I'm not 
> completely clear on how the hierarchical cgroups play together here. 
>
> I'll pull on this thread some. 
>
> Thanks for your input.
>
>
> Cheers,
> Phil

Yeah, that isn't under the cfs_b lock, but is still part of distribute
(and under rq lock, which might also matter). I was thinking too much
about just the cfs_b regions. I'm not sure there's any good general
optimization there.

I suppose cfs_rqs (tgs/cfs_bs?) could have "nearest
ancestor with a quota" pointer and ones with quota could have
"descendants with quota" list, parallel to the children/parent lists of
tgs. Then throttle/unthrottle would only have to visit these lists, and
child cgroups/cfs_rqs without their own quotas would just check
cfs_rq->nearest_quota_cfs_rq->throttle_count. throttled_clock_task_time
can also probably be tracked there.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-06 19:25             ` bsegall
@ 2019-03-09 20:33               ` Phil Auld
  2019-03-11 17:44                 ` bsegall
  0 siblings, 1 reply; 17+ messages in thread
From: Phil Auld @ 2019-03-09 20:33 UTC (permalink / raw)
  To: bsegall; +Cc: mingo, peterz, linux-kernel

On Wed, Mar 06, 2019 at 11:25:02AM -0800 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote:
> >> Phil Auld <pauld@redhat.com> writes:
> >> 
> >> > Interestingly, if I limit the number of child cgroups to the number of 
> >> > them I'm actually putting processes into (16 down from 2500) the problem
> >> > does not reproduce.
> >> 
> >> That is indeed interesting, and definitely not something we'd want to
> >> matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
> >> root->throttled->a->...->thread vs root->throttled_cgroup, which is what
> >> I was originally thinking of)
> >> 
> >
> > The locking may be a red herring.
> >
> > The setup is root->throttled->a where a is 1-2500. There are 4 threads in
> > each of the first 16 a groups.  The parent, throttled, is where the 
> > cfs_period/quota_us are set. 
> >
> > I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). 
> >
> > The distribute_cfg_runtime looks to be O(n * m) where n is number of 
> > throttled cfs_rqs and m is the number of child cgroups. But I'm not 
> > completely clear on how the hierarchical cgroups play together here. 
> >
> > I'll pull on this thread some. 
> >
> > Thanks for your input.
> >
> >
> > Cheers,
> > Phil
> 
> Yeah, that isn't under the cfs_b lock, but is still part of distribute
> (and under rq lock, which might also matter). I was thinking too much
> about just the cfs_b regions. I'm not sure there's any good general
> optimization there.
>

It's really an edge case, but the watchdog NMI is pretty painful.

> I suppose cfs_rqs (tgs/cfs_bs?) could have "nearest
> ancestor with a quota" pointer and ones with quota could have
> "descendants with quota" list, parallel to the children/parent lists of
> tgs. Then throttle/unthrottle would only have to visit these lists, and
> child cgroups/cfs_rqs without their own quotas would just check
> cfs_rq->nearest_quota_cfs_rq->throttle_count. throttled_clock_task_time
> can also probably be tracked there.

That seems like it would add a lot of complexity for this edge case. Maybe
it would be acceptible to use the safety valve like my first example, or
something like the below which will tune the period up until it doesn't
overrun for ever.  The down side of this one is it does change the user's
settings, but that could be preferable to an NMI crash.

Cheers,
Phil

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 310d0637fe4b..78f9e28adc7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4859,16 +4859,42 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
+extern const u64 max_cfs_quota_period;
+s64 cfs_quota_period_autotune_thresh = 100 * NSEC_PER_MSEC;
+int cfs_quota_period_autotune_shift  = 4; /* 100 / 16 = 6.25% */
+
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
+	s64 nsprev, nsnow, new_period;
+	ktime_t now;
 	int overrun;
 	int idle = 0;
 
 	raw_spin_lock(&cfs_b->lock);
+	nsprev = ktime_to_ns(hrtimer_cb_get_time(timer));
 	for (;;) {
-		overrun = hrtimer_forward_now(timer, cfs_b->period);
+		/* 
+		 * Note this reverts the change to use hrtimer_forward_now, which avoids calling hrtimer_cb_get_time
+		 * for a value we already have
+		 */
+		now = hrtimer_cb_get_time(timer);
+		nsnow = ktime_to_ns(now);
+		if (nsnow - nsprev >= cfs_quota_period_autotune_thresh) {
+			new_period = ktime_to_ns(cfs_b->period);
+			new_period += new_period >> cfs_quota_period_autotune_shift;
+			if (new_period <= max_cfs_quota_period) {
+				cfs_b->period = ns_to_ktime(new_period);
+				cfs_b->quota += cfs_b->quota >> cfs_quota_period_autotune_shift;
+				pr_warn_ratelimited(
+					"cfs_period_timer [cpu%d] : Running too long, scaling up (new period %lld, new quota = %lld)\n", 
+					smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
+			}
+			nsprev = nsnow;
+		}
+
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
 		if (!overrun)
 			break;
 


-- 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-09 20:33               ` Phil Auld
@ 2019-03-11 17:44                 ` bsegall
  2019-03-11 20:25                   ` Phil Auld
  0 siblings, 1 reply; 17+ messages in thread
From: bsegall @ 2019-03-11 17:44 UTC (permalink / raw)
  To: Phil Auld; +Cc: mingo, peterz, linux-kernel

Phil Auld <pauld@redhat.com> writes:

> On Wed, Mar 06, 2019 at 11:25:02AM -0800 bsegall@google.com wrote:
>> Phil Auld <pauld@redhat.com> writes:
>> 
>> > On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote:
>> >> Phil Auld <pauld@redhat.com> writes:
>> >> 
>> >> > Interestingly, if I limit the number of child cgroups to the number of 
>> >> > them I'm actually putting processes into (16 down from 2500) the problem
>> >> > does not reproduce.
>> >> 
>> >> That is indeed interesting, and definitely not something we'd want to
>> >> matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
>> >> root->throttled->a->...->thread vs root->throttled_cgroup, which is what
>> >> I was originally thinking of)
>> >> 
>> >
>> > The locking may be a red herring.
>> >
>> > The setup is root->throttled->a where a is 1-2500. There are 4 threads in
>> > each of the first 16 a groups.  The parent, throttled, is where the 
>> > cfs_period/quota_us are set. 
>> >
>> > I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). 
>> >
>> > The distribute_cfg_runtime looks to be O(n * m) where n is number of 
>> > throttled cfs_rqs and m is the number of child cgroups. But I'm not 
>> > completely clear on how the hierarchical cgroups play together here. 
>> >
>> > I'll pull on this thread some. 
>> >
>> > Thanks for your input.
>> >
>> >
>> > Cheers,
>> > Phil
>> 
>> Yeah, that isn't under the cfs_b lock, but is still part of distribute
>> (and under rq lock, which might also matter). I was thinking too much
>> about just the cfs_b regions. I'm not sure there's any good general
>> optimization there.
>>
>
> It's really an edge case, but the watchdog NMI is pretty painful.
>
>> I suppose cfs_rqs (tgs/cfs_bs?) could have "nearest
>> ancestor with a quota" pointer and ones with quota could have
>> "descendants with quota" list, parallel to the children/parent lists of
>> tgs. Then throttle/unthrottle would only have to visit these lists, and
>> child cgroups/cfs_rqs without their own quotas would just check
>> cfs_rq->nearest_quota_cfs_rq->throttle_count. throttled_clock_task_time
>> can also probably be tracked there.
>
> That seems like it would add a lot of complexity for this edge case. Maybe
> it would be acceptible to use the safety valve like my first example, or
> something like the below which will tune the period up until it doesn't
> overrun for ever.  The down side of this one is it does change the user's
> settings, but that could be preferable to an NMI crash.

Yeah, I'm not sure what solution is best here, but one of the solutions
should be done.

>
> Cheers,
> Phil
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 310d0637fe4b..78f9e28adc7b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4859,16 +4859,42 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>  	return HRTIMER_NORESTART;
>  }
>  
> +extern const u64 max_cfs_quota_period;
> +s64 cfs_quota_period_autotune_thresh = 100 * NSEC_PER_MSEC;
> +int cfs_quota_period_autotune_shift  = 4; /* 100 / 16 = 6.25% */

Letting it spin for 100ms and then only increasing by 6% seems extremely
generous. If we went this route I'd probably say "after looping N
times, set the period to time taken / N + X%" where N is like 8 or
something. I think I'd probably perfer something like this to the
previous "just abort and let it happen again next interrupt" one.

> +
>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  {
>  	struct cfs_bandwidth *cfs_b =
>  		container_of(timer, struct cfs_bandwidth, period_timer);
> +	s64 nsprev, nsnow, new_period;
> +	ktime_t now;
>  	int overrun;
>  	int idle = 0;
>  
>  	raw_spin_lock(&cfs_b->lock);
> +	nsprev = ktime_to_ns(hrtimer_cb_get_time(timer));
>  	for (;;) {
> -		overrun = hrtimer_forward_now(timer, cfs_b->period);
> +		/* 
> +		 * Note this reverts the change to use hrtimer_forward_now, which avoids calling hrtimer_cb_get_time
> +		 * for a value we already have
> +		 */
> +		now = hrtimer_cb_get_time(timer);
> +		nsnow = ktime_to_ns(now);
> +		if (nsnow - nsprev >= cfs_quota_period_autotune_thresh) {
> +			new_period = ktime_to_ns(cfs_b->period);
> +			new_period += new_period >> cfs_quota_period_autotune_shift;
> +			if (new_period <= max_cfs_quota_period) {
> +				cfs_b->period = ns_to_ktime(new_period);
> +				cfs_b->quota += cfs_b->quota >> cfs_quota_period_autotune_shift;
> +				pr_warn_ratelimited(
> +					"cfs_period_timer [cpu%d] : Running too long, scaling up (new period %lld, new quota = %lld)\n", 
> +					smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> +			}
> +			nsprev = nsnow;
> +		}
> +
> +		overrun = hrtimer_forward(timer, now, cfs_b->period);
>  		if (!overrun)
>  			break;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-11 17:44                 ` bsegall
@ 2019-03-11 20:25                   ` Phil Auld
  2019-03-12 13:57                     ` Phil Auld
  2019-03-12 17:29                     ` bsegall
  0 siblings, 2 replies; 17+ messages in thread
From: Phil Auld @ 2019-03-11 20:25 UTC (permalink / raw)
  To: bsegall; +Cc: mingo, peterz, linux-kernel

On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Wed, Mar 06, 2019 at 11:25:02AM -0800 bsegall@google.com wrote:
> >> Phil Auld <pauld@redhat.com> writes:
> >> 
> >> > On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote:
> >> >> Phil Auld <pauld@redhat.com> writes:
> >> >> 
> >> >> > Interestingly, if I limit the number of child cgroups to the number of 
> >> >> > them I'm actually putting processes into (16 down from 2500) the problem
> >> >> > does not reproduce.
> >> >> 
> >> >> That is indeed interesting, and definitely not something we'd want to
> >> >> matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
> >> >> root->throttled->a->...->thread vs root->throttled_cgroup, which is what
> >> >> I was originally thinking of)
> >> >> 
> >> >
> >> > The locking may be a red herring.
> >> >
> >> > The setup is root->throttled->a where a is 1-2500. There are 4 threads in
> >> > each of the first 16 a groups.  The parent, throttled, is where the 
> >> > cfs_period/quota_us are set. 
> >> >
> >> > I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). 
> >> >
> >> > The distribute_cfg_runtime looks to be O(n * m) where n is number of 
> >> > throttled cfs_rqs and m is the number of child cgroups. But I'm not 
> >> > completely clear on how the hierarchical cgroups play together here. 
> >> >
> >> > I'll pull on this thread some. 
> >> >
> >> > Thanks for your input.
> >> >
> >> >
> >> > Cheers,
> >> > Phil
> >> 
> >> Yeah, that isn't under the cfs_b lock, but is still part of distribute
> >> (and under rq lock, which might also matter). I was thinking too much
> >> about just the cfs_b regions. I'm not sure there's any good general
> >> optimization there.
> >>
> >
> > It's really an edge case, but the watchdog NMI is pretty painful.
> >
> >> I suppose cfs_rqs (tgs/cfs_bs?) could have "nearest
> >> ancestor with a quota" pointer and ones with quota could have
> >> "descendants with quota" list, parallel to the children/parent lists of
> >> tgs. Then throttle/unthrottle would only have to visit these lists, and
> >> child cgroups/cfs_rqs without their own quotas would just check
> >> cfs_rq->nearest_quota_cfs_rq->throttle_count. throttled_clock_task_time
> >> can also probably be tracked there.
> >
> > That seems like it would add a lot of complexity for this edge case. Maybe
> > it would be acceptible to use the safety valve like my first example, or
> > something like the below which will tune the period up until it doesn't
> > overrun for ever.  The down side of this one is it does change the user's
> > settings, but that could be preferable to an NMI crash.
> 
> Yeah, I'm not sure what solution is best here, but one of the solutions
> should be done.
> 
> >
> > Cheers,
> > Phil
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 310d0637fe4b..78f9e28adc7b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4859,16 +4859,42 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >  	return HRTIMER_NORESTART;
> >  }
> >  
> > +extern const u64 max_cfs_quota_period;
> > +s64 cfs_quota_period_autotune_thresh = 100 * NSEC_PER_MSEC;
> > +int cfs_quota_period_autotune_shift  = 4; /* 100 / 16 = 6.25% */
> 
> Letting it spin for 100ms and then only increasing by 6% seems extremely
> generous. If we went this route I'd probably say "after looping N
> times, set the period to time taken / N + X%" where N is like 8 or
> something. I think I'd probably perfer something like this to the
> previous "just abort and let it happen again next interrupt" one.

Okay. I'll try to spin something up that does this. It may be a little 
trickier to keep the quota proportional to the new period. I think that's 
important since we'll be changing the user's setting.

Do you mean to have it break when it hits N and recalculates the period or 
reset the counter and keep going?



Cheers,
Phil

> 
> > +
> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >  {
> >  	struct cfs_bandwidth *cfs_b =
> >  		container_of(timer, struct cfs_bandwidth, period_timer);
> > +	s64 nsprev, nsnow, new_period;
> > +	ktime_t now;
> >  	int overrun;
> >  	int idle = 0;
> >  
> >  	raw_spin_lock(&cfs_b->lock);
> > +	nsprev = ktime_to_ns(hrtimer_cb_get_time(timer));
> >  	for (;;) {
> > -		overrun = hrtimer_forward_now(timer, cfs_b->period);
> > +		/* 
> > +		 * Note this reverts the change to use hrtimer_forward_now, which avoids calling hrtimer_cb_get_time
> > +		 * for a value we already have
> > +		 */
> > +		now = hrtimer_cb_get_time(timer);
> > +		nsnow = ktime_to_ns(now);
> > +		if (nsnow - nsprev >= cfs_quota_period_autotune_thresh) {
> > +			new_period = ktime_to_ns(cfs_b->period);
> > +			new_period += new_period >> cfs_quota_period_autotune_shift;
> > +			if (new_period <= max_cfs_quota_period) {
> > +				cfs_b->period = ns_to_ktime(new_period);
> > +				cfs_b->quota += cfs_b->quota >> cfs_quota_period_autotune_shift;
> > +				pr_warn_ratelimited(
> > +					"cfs_period_timer [cpu%d] : Running too long, scaling up (new period %lld, new quota = %lld)\n", 
> > +					smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> > +			}
> > +			nsprev = nsnow;
> > +		}
> > +
> > +		overrun = hrtimer_forward(timer, now, cfs_b->period);
> >  		if (!overrun)
> >  			break;

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-11 20:25                   ` Phil Auld
@ 2019-03-12 13:57                     ` Phil Auld
  2019-03-13 17:44                       ` bsegall
  2019-03-12 17:29                     ` bsegall
  1 sibling, 1 reply; 17+ messages in thread
From: Phil Auld @ 2019-03-12 13:57 UTC (permalink / raw)
  To: bsegall; +Cc: mingo, peterz, linux-kernel

On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
> > Letting it spin for 100ms and then only increasing by 6% seems extremely
> > generous. If we went this route I'd probably say "after looping N
> > times, set the period to time taken / N + X%" where N is like 8 or
> > something. I think I'd probably perfer something like this to the
> > previous "just abort and let it happen again next interrupt" one.
> 
> Okay. I'll try to spin something up that does this. It may be a little 
> trickier to keep the quota proportional to the new period. I think that's 
> important since we'll be changing the user's setting.
> 
> Do you mean to have it break when it hits N and recalculates the period or 
> reset the counter and keep going?
> 

Let me know what you think of the below. It's working nicely. I like your 
suggestion to limit it quickly based on number of loops and use that to 
scale up. I think it is best to break out and let it fire again if needed. 
The warning fires once, very occasionally twice, and then things are quiet.

If that looks reasonable I'll do some more testing and spin it up as a real 
patch submission. 

Cheers,
Phil
---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 310d0637fe4b..54b30adfc89e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
+extern const u64 max_cfs_quota_period;
+int cfs_period_autotune_loop_limit   = 8;
+int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period recalculation */
+
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
+	s64 nsstart, nsnow, new_period;
 	int overrun;
 	int idle = 0;
+	int count = 0;
 
 	raw_spin_lock(&cfs_b->lock);
+	nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
 	for (;;) {
 		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
+		if (++count > cfs_period_autotune_loop_limit) {
+			ktime_t old_period = ktime_to_ns(cfs_b->period);
+
+			nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
+			new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
+
+			/*  Make sure new period will be larger than old. */
+			if (new_period < old_period) {
+				new_period = old_period;
+			}
+			new_period += (new_period *  cfs_period_autotune_cushion_pct) / 100;
+
+			if (new_period >  max_cfs_quota_period)
+				new_period = max_cfs_quota_period;
+
+			cfs_b->period = ns_to_ktime(new_period);
+			cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;
+			pr_warn_ratelimited(
+				"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
+				smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
+
+			idle = 0;
+			break;
+		}
+
 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
 	}
 	if (idle)


-- 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-11 20:25                   ` Phil Auld
  2019-03-12 13:57                     ` Phil Auld
@ 2019-03-12 17:29                     ` bsegall
  1 sibling, 0 replies; 17+ messages in thread
From: bsegall @ 2019-03-12 17:29 UTC (permalink / raw)
  To: Phil Auld; +Cc: mingo, peterz, linux-kernel

Phil Auld <pauld@redhat.com> writes:

> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
>> Phil Auld <pauld@redhat.com> writes:
>> 
>> > On Wed, Mar 06, 2019 at 11:25:02AM -0800 bsegall@google.com wrote:
>> >> Phil Auld <pauld@redhat.com> writes:
>> >> 
>> >> > On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote:
>> >> >> Phil Auld <pauld@redhat.com> writes:
>> >> >> 
>> >> >> > Interestingly, if I limit the number of child cgroups to the number of 
>> >> >> > them I'm actually putting processes into (16 down from 2500) the problem
>> >> >> > does not reproduce.
>> >> >> 
>> >> >> That is indeed interesting, and definitely not something we'd want to
>> >> >> matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
>> >> >> root->throttled->a->...->thread vs root->throttled_cgroup, which is what
>> >> >> I was originally thinking of)
>> >> >> 
>> >> >
>> >> > The locking may be a red herring.
>> >> >
>> >> > The setup is root->throttled->a where a is 1-2500. There are 4 threads in
>> >> > each of the first 16 a groups.  The parent, throttled, is where the 
>> >> > cfs_period/quota_us are set. 
>> >> >
>> >> > I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). 
>> >> >
>> >> > The distribute_cfg_runtime looks to be O(n * m) where n is number of 
>> >> > throttled cfs_rqs and m is the number of child cgroups. But I'm not 
>> >> > completely clear on how the hierarchical cgroups play together here. 
>> >> >
>> >> > I'll pull on this thread some. 
>> >> >
>> >> > Thanks for your input.
>> >> >
>> >> >
>> >> > Cheers,
>> >> > Phil
>> >> 
>> >> Yeah, that isn't under the cfs_b lock, but is still part of distribute
>> >> (and under rq lock, which might also matter). I was thinking too much
>> >> about just the cfs_b regions. I'm not sure there's any good general
>> >> optimization there.
>> >>
>> >
>> > It's really an edge case, but the watchdog NMI is pretty painful.
>> >
>> >> I suppose cfs_rqs (tgs/cfs_bs?) could have "nearest
>> >> ancestor with a quota" pointer and ones with quota could have
>> >> "descendants with quota" list, parallel to the children/parent lists of
>> >> tgs. Then throttle/unthrottle would only have to visit these lists, and
>> >> child cgroups/cfs_rqs without their own quotas would just check
>> >> cfs_rq->nearest_quota_cfs_rq->throttle_count. throttled_clock_task_time
>> >> can also probably be tracked there.
>> >
>> > That seems like it would add a lot of complexity for this edge case. Maybe
>> > it would be acceptible to use the safety valve like my first example, or
>> > something like the below which will tune the period up until it doesn't
>> > overrun for ever.  The down side of this one is it does change the user's
>> > settings, but that could be preferable to an NMI crash.
>> 
>> Yeah, I'm not sure what solution is best here, but one of the solutions
>> should be done.
>> 
>> >
>> > Cheers,
>> > Phil
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 310d0637fe4b..78f9e28adc7b 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -4859,16 +4859,42 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>> >  	return HRTIMER_NORESTART;
>> >  }
>> >  
>> > +extern const u64 max_cfs_quota_period;
>> > +s64 cfs_quota_period_autotune_thresh = 100 * NSEC_PER_MSEC;
>> > +int cfs_quota_period_autotune_shift  = 4; /* 100 / 16 = 6.25% */
>> 
>> Letting it spin for 100ms and then only increasing by 6% seems extremely
>> generous. If we went this route I'd probably say "after looping N
>> times, set the period to time taken / N + X%" where N is like 8 or
>> something. I think I'd probably perfer something like this to the
>> previous "just abort and let it happen again next interrupt" one.
>
> Okay. I'll try to spin something up that does this. It may be a little 
> trickier to keep the quota proportional to the new period. I think that's 
> important since we'll be changing the user's setting.
>
> Do you mean to have it break when it hits N and recalculates the period or 
> reset the counter and keep going?
>

In theory you should be fine doing it once more I think? And yeah,
keeping the quota correct is a bit more annoying given you have to use
fixed point math.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-12 13:57                     ` Phil Auld
@ 2019-03-13 17:44                       ` bsegall
  2019-03-13 18:50                         ` Phil Auld
  0 siblings, 1 reply; 17+ messages in thread
From: bsegall @ 2019-03-13 17:44 UTC (permalink / raw)
  To: Phil Auld; +Cc: mingo, peterz, linux-kernel

Phil Auld <pauld@redhat.com> writes:

> On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
>> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
>> > Letting it spin for 100ms and then only increasing by 6% seems extremely
>> > generous. If we went this route I'd probably say "after looping N
>> > times, set the period to time taken / N + X%" where N is like 8 or
>> > something. I think I'd probably perfer something like this to the
>> > previous "just abort and let it happen again next interrupt" one.
>> 
>> Okay. I'll try to spin something up that does this. It may be a little 
>> trickier to keep the quota proportional to the new period. I think that's 
>> important since we'll be changing the user's setting.
>> 
>> Do you mean to have it break when it hits N and recalculates the period or 
>> reset the counter and keep going?
>> 
>
> Let me know what you think of the below. It's working nicely. I like your 
> suggestion to limit it quickly based on number of loops and use that to 
> scale up. I think it is best to break out and let it fire again if needed. 
> The warning fires once, very occasionally twice, and then things are quiet.
>
> If that looks reasonable I'll do some more testing and spin it up as a real 
> patch submission. 

Yeah, this looks reasonable. I should probably see how unreasonable the
other thing would be, but if your previous periods were kinda small (and
it's just that the machine crashing isn't an ok failure mode) I suppose
it's not a big deal.

>
> Cheers,
> Phil
> ---
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 310d0637fe4b..54b30adfc89e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>  	return HRTIMER_NORESTART;
>  }
>  
> +extern const u64 max_cfs_quota_period;
> +int cfs_period_autotune_loop_limit   = 8;
> +int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period recalculation */
> +
>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  {
>  	struct cfs_bandwidth *cfs_b =
>  		container_of(timer, struct cfs_bandwidth, period_timer);
> +	s64 nsstart, nsnow, new_period;
>  	int overrun;
>  	int idle = 0;
> +	int count = 0;
>  
>  	raw_spin_lock(&cfs_b->lock);
> +	nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
>  	for (;;) {
>  		overrun = hrtimer_forward_now(timer, cfs_b->period);
>  		if (!overrun)
>  			break;
>  
> +		if (++count > cfs_period_autotune_loop_limit) {
> +			ktime_t old_period = ktime_to_ns(cfs_b->period);
> +
> +			nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> +			new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
> +
> +			/*  Make sure new period will be larger than old. */
> +			if (new_period < old_period) {
> +				new_period = old_period;
> +			}
> +			new_period += (new_period *  cfs_period_autotune_cushion_pct) / 100;

This ordering means that it will always increase by at least 15%. This
is a bit odd but probably a good thing; I'd just change the comment to
make it clear this is deliberate.

> +
> +			if (new_period >  max_cfs_quota_period)
> +				new_period = max_cfs_quota_period;
> +
> +			cfs_b->period = ns_to_ktime(new_period);
> +			cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;

In general it makes sense to do fixed point via 1024 or something that
can be optimized into shifts (and a larger number is better in general
for better precision).

> +			pr_warn_ratelimited(
> +				"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> +				smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> +
> +			idle = 0;
> +			break;
> +		}
> +
>  		idle = do_sched_cfs_period_timer(cfs_b, overrun);
>  	}
>  	if (idle)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-13 17:44                       ` bsegall
@ 2019-03-13 18:50                         ` Phil Auld
  2019-03-13 20:26                           ` bsegall
  0 siblings, 1 reply; 17+ messages in thread
From: Phil Auld @ 2019-03-13 18:50 UTC (permalink / raw)
  To: bsegall; +Cc: mingo, peterz, linux-kernel

On Wed, Mar 13, 2019 at 10:44:09AM -0700 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
> >> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
> >> > Letting it spin for 100ms and then only increasing by 6% seems extremely
> >> > generous. If we went this route I'd probably say "after looping N
> >> > times, set the period to time taken / N + X%" where N is like 8 or
> >> > something. I think I'd probably perfer something like this to the
> >> > previous "just abort and let it happen again next interrupt" one.
> >> 
> >> Okay. I'll try to spin something up that does this. It may be a little 
> >> trickier to keep the quota proportional to the new period. I think that's 
> >> important since we'll be changing the user's setting.
> >> 
> >> Do you mean to have it break when it hits N and recalculates the period or 
> >> reset the counter and keep going?
> >> 
> >
> > Let me know what you think of the below. It's working nicely. I like your 
> > suggestion to limit it quickly based on number of loops and use that to 
> > scale up. I think it is best to break out and let it fire again if needed. 
> > The warning fires once, very occasionally twice, and then things are quiet.
> >
> > If that looks reasonable I'll do some more testing and spin it up as a real 
> > patch submission. 
> 
> Yeah, this looks reasonable. I should probably see how unreasonable the
> other thing would be, but if your previous periods were kinda small (and
> it's just that the machine crashing isn't an ok failure mode) I suppose
> it's not a big deal.
> 

I posted it a little while ago. The periods were tiny (~2000us vs a minimum 
of 1000) and with 2500 mostly unused child cgroups (I didn't narrow that 
down much but it did reproduce still with 1250 children).  That's why I was 
thinking edge case. It also requires a fairly small quota and load to make 
sure cfs_rqs get throttled.

I'm still wrapping my head around the scheduler code but I'd be happy to 
try it the other way if you can give me a bit more description of what
you have in mind. Also happy to test a patch with my repro. 


Cheers,
Phil


> >
> > Cheers,
> > Phil
> > ---
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 310d0637fe4b..54b30adfc89e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >  	return HRTIMER_NORESTART;
> >  }
> >  
> > +extern const u64 max_cfs_quota_period;
> > +int cfs_period_autotune_loop_limit   = 8;
> > +int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period recalculation */
> > +
> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >  {
> >  	struct cfs_bandwidth *cfs_b =
> >  		container_of(timer, struct cfs_bandwidth, period_timer);
> > +	s64 nsstart, nsnow, new_period;
> >  	int overrun;
> >  	int idle = 0;
> > +	int count = 0;
> >  
> >  	raw_spin_lock(&cfs_b->lock);
> > +	nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
> >  	for (;;) {
> >  		overrun = hrtimer_forward_now(timer, cfs_b->period);
> >  		if (!overrun)
> >  			break;
> >  
> > +		if (++count > cfs_period_autotune_loop_limit) {
> > +			ktime_t old_period = ktime_to_ns(cfs_b->period);
> > +
> > +			nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> > +			new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
> > +
> > +			/*  Make sure new period will be larger than old. */
> > +			if (new_period < old_period) {
> > +				new_period = old_period;
> > +			}
> > +			new_period += (new_period *  cfs_period_autotune_cushion_pct) / 100;
> 
> This ordering means that it will always increase by at least 15%. This
> is a bit odd but probably a good thing; I'd just change the comment to
> make it clear this is deliberate.
> 
> > +
> > +			if (new_period >  max_cfs_quota_period)
> > +				new_period = max_cfs_quota_period;
> > +
> > +			cfs_b->period = ns_to_ktime(new_period);
> > +			cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;
> 
> In general it makes sense to do fixed point via 1024 or something that
> can be optimized into shifts (and a larger number is better in general
> for better precision).
> 
> > +			pr_warn_ratelimited(
> > +				"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > +				smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> > +
> > +			idle = 0;
> > +			break;
> > +		}
> > +
> >  		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >  	}
> >  	if (idle)

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-13 18:50                         ` Phil Auld
@ 2019-03-13 20:26                           ` bsegall
  2019-03-13 21:10                             ` Phil Auld
  0 siblings, 1 reply; 17+ messages in thread
From: bsegall @ 2019-03-13 20:26 UTC (permalink / raw)
  To: Phil Auld; +Cc: mingo, peterz, linux-kernel

Phil Auld <pauld@redhat.com> writes:

> On Wed, Mar 13, 2019 at 10:44:09AM -0700 bsegall@google.com wrote:
>> Phil Auld <pauld@redhat.com> writes:
>> 
>> > On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
>> >> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
>> >> > Letting it spin for 100ms and then only increasing by 6% seems extremely
>> >> > generous. If we went this route I'd probably say "after looping N
>> >> > times, set the period to time taken / N + X%" where N is like 8 or
>> >> > something. I think I'd probably perfer something like this to the
>> >> > previous "just abort and let it happen again next interrupt" one.
>> >> 
>> >> Okay. I'll try to spin something up that does this. It may be a little 
>> >> trickier to keep the quota proportional to the new period. I think that's 
>> >> important since we'll be changing the user's setting.
>> >> 
>> >> Do you mean to have it break when it hits N and recalculates the period or 
>> >> reset the counter and keep going?
>> >> 
>> >
>> > Let me know what you think of the below. It's working nicely. I like your 
>> > suggestion to limit it quickly based on number of loops and use that to 
>> > scale up. I think it is best to break out and let it fire again if needed. 
>> > The warning fires once, very occasionally twice, and then things are quiet.
>> >
>> > If that looks reasonable I'll do some more testing and spin it up as a real 
>> > patch submission. 
>> 
>> Yeah, this looks reasonable. I should probably see how unreasonable the
>> other thing would be, but if your previous periods were kinda small (and
>> it's just that the machine crashing isn't an ok failure mode) I suppose
>> it's not a big deal.
>> 
>
> I posted it a little while ago. The periods were tiny (~2000us vs a minimum 
> of 1000) and with 2500 mostly unused child cgroups (I didn't narrow that 
> down much but it did reproduce still with 1250 children).  That's why I was 
> thinking edge case. It also requires a fairly small quota and load to make 
> sure cfs_rqs get throttled.

Ok, yeah, that's below the level where I'm too worried about it. It's
still bad to spend a ms in IRQ in the case of tons of child cgroups, but
1-2ms periods are definitely way less than what is really sensible for cfsb.

>
> I'm still wrapping my head around the scheduler code but I'd be happy to 
> try it the other way if you can give me a bit more description of what
> you have in mind. Also happy to test a patch with my repro.

Eh, I was more saying to myself, though I'm a bit busy.

The idea is that the only reason for the walk_tg_from is that we need
two pieces of information for every cfs_rq: a) if it is currently
throttled, and b) how much total time it has spent throttled. We
currently update these for all children during throttle/unthrottle
rather than have the children search through their ancestors to avoid
the cost when doing other work. However, all children who don't have a
quota set will all have identical throttle_count, and all children who
have never had a quota set will have identical throttle time (I
previously hadn't thought of the additional restriction required for the
second case here). Thus we could share the data structure for all these
identical cases, and then only have to look at children who have quota
set, or have in the past. Setting things up to walk this sort of reduced
tree of "tgs/cfs_rqs that have ever used cfs_b" would be a pain and
mirror the existing tg tree setup, but be possible.

This would be no help if the children sometimes have a quota (say if
only a few ever use it at a time but eventually all of them do, or they
do initially during some setup or something), and would just make things
worse with additional memory/cache pressure.

>
>
> Cheers,
> Phil
>
>
>> >
>> > Cheers,
>> > Phil
>> > ---
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 310d0637fe4b..54b30adfc89e 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>> >  	return HRTIMER_NORESTART;
>> >  }
>> >  
>> > +extern const u64 max_cfs_quota_period;
>> > +int cfs_period_autotune_loop_limit   = 8;
>> > +int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period recalculation */
>> > +
>> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> >  {
>> >  	struct cfs_bandwidth *cfs_b =
>> >  		container_of(timer, struct cfs_bandwidth, period_timer);
>> > +	s64 nsstart, nsnow, new_period;
>> >  	int overrun;
>> >  	int idle = 0;
>> > +	int count = 0;
>> >  
>> >  	raw_spin_lock(&cfs_b->lock);
>> > +	nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
>> >  	for (;;) {
>> >  		overrun = hrtimer_forward_now(timer, cfs_b->period);
>> >  		if (!overrun)
>> >  			break;
>> >  
>> > +		if (++count > cfs_period_autotune_loop_limit) {
>> > +			ktime_t old_period = ktime_to_ns(cfs_b->period);
>> > +
>> > +			nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
>> > +			new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
>> > +
>> > +			/*  Make sure new period will be larger than old. */
>> > +			if (new_period < old_period) {
>> > +				new_period = old_period;
>> > +			}
>> > +			new_period += (new_period *  cfs_period_autotune_cushion_pct) / 100;
>> 
>> This ordering means that it will always increase by at least 15%. This
>> is a bit odd but probably a good thing; I'd just change the comment to
>> make it clear this is deliberate.
>> 
>> > +
>> > +			if (new_period >  max_cfs_quota_period)
>> > +				new_period = max_cfs_quota_period;
>> > +
>> > +			cfs_b->period = ns_to_ktime(new_period);
>> > +			cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;
>> 
>> In general it makes sense to do fixed point via 1024 or something that
>> can be optimized into shifts (and a larger number is better in general
>> for better precision).
>> 
>> > +			pr_warn_ratelimited(
>> > +				"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
>> > +				smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
>> > +
>> > +			idle = 0;
>> > +			break;
>> > +		}
>> > +
>> >  		idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> >  	}
>> >  	if (idle)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer
  2019-03-13 20:26                           ` bsegall
@ 2019-03-13 21:10                             ` Phil Auld
  0 siblings, 0 replies; 17+ messages in thread
From: Phil Auld @ 2019-03-13 21:10 UTC (permalink / raw)
  To: bsegall; +Cc: mingo, peterz, linux-kernel

On Wed, Mar 13, 2019 at 01:26:51PM -0700 bsegall@google.com wrote:
> Phil Auld <pauld@redhat.com> writes:
> 
> > On Wed, Mar 13, 2019 at 10:44:09AM -0700 bsegall@google.com wrote:
> >> Phil Auld <pauld@redhat.com> writes:
> >> 
> >> > On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
> >> >> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@google.com wrote:
> >> >> > Letting it spin for 100ms and then only increasing by 6% seems extremely
> >> >> > generous. If we went this route I'd probably say "after looping N
> >> >> > times, set the period to time taken / N + X%" where N is like 8 or
> >> >> > something. I think I'd probably perfer something like this to the
> >> >> > previous "just abort and let it happen again next interrupt" one.
> >> >> 
> >> >> Okay. I'll try to spin something up that does this. It may be a little 
> >> >> trickier to keep the quota proportional to the new period. I think that's 
> >> >> important since we'll be changing the user's setting.
> >> >> 
> >> >> Do you mean to have it break when it hits N and recalculates the period or 
> >> >> reset the counter and keep going?
> >> >> 
> >> >
> >> > Let me know what you think of the below. It's working nicely. I like your 
> >> > suggestion to limit it quickly based on number of loops and use that to 
> >> > scale up. I think it is best to break out and let it fire again if needed. 
> >> > The warning fires once, very occasionally twice, and then things are quiet.
> >> >
> >> > If that looks reasonable I'll do some more testing and spin it up as a real 
> >> > patch submission. 
> >> 
> >> Yeah, this looks reasonable. I should probably see how unreasonable the
> >> other thing would be, but if your previous periods were kinda small (and
> >> it's just that the machine crashing isn't an ok failure mode) I suppose
> >> it's not a big deal.
> >> 
> >
> > I posted it a little while ago. The periods were tiny (~2000us vs a minimum 
> > of 1000) and with 2500 mostly unused child cgroups (I didn't narrow that 
> > down much but it did reproduce still with 1250 children).  That's why I was 
> > thinking edge case. It also requires a fairly small quota and load to make 
> > sure cfs_rqs get throttled.
> 
> Ok, yeah, that's below the level where I'm too worried about it. It's
> still bad to spend a ms in IRQ in the case of tons of child cgroups, but
> 1-2ms periods are definitely way less than what is really sensible for cfsb.
> 

The original case in the wild was 17ms.  In order to make it trigger reliably 
on upstream it needed to be reduced a fair bit. 

I considered a raise in the minimum but was concerned that would be effectively 
an API breakage. And it also could then fail with a more cgroups. 


> >
> > I'm still wrapping my head around the scheduler code but I'd be happy to 
> > try it the other way if you can give me a bit more description of what
> > you have in mind. Also happy to test a patch with my repro.
> 
> Eh, I was more saying to myself, though I'm a bit busy.
> 
> The idea is that the only reason for the walk_tg_from is that we need
> two pieces of information for every cfs_rq: a) if it is currently
> throttled, and b) how much total time it has spent throttled. We
> currently update these for all children during throttle/unthrottle
> rather than have the children search through their ancestors to avoid
> the cost when doing other work. However, all children who don't have a
> quota set will all have identical throttle_count, and all children who
> have never had a quota set will have identical throttle time (I
> previously hadn't thought of the additional restriction required for the
> second case here). Thus we could share the data structure for all these
> identical cases, and then only have to look at children who have quota
> set, or have in the past. Setting things up to walk this sort of reduced
> tree of "tgs/cfs_rqs that have ever used cfs_b" would be a pain and
> mirror the existing tg tree setup, but be possible.
> 
> This would be no help if the children sometimes have a quota (say if
> only a few ever use it at a time but eventually all of them do, or they
> do initially during some setup or something), and would just make things
> worse with additional memory/cache pressure.

Let's see if there is any traction for the current patch then. 


Interstingly 5.0+ seems to be a bit slower in do_sched_cfs_period_timer() than
the v5.0 tagged tree. 

The test runs once with the initial values (period 2170) for 30 seconds, then
resets them to half that and runs for 30s. Then it gets below min and fails so 
it cleans up and starts over. 

Usually there's 1-2 for the first 30s and 1 for the 2nd. Followed by an idle 
which usually means a CPU switch. The new cfs_period_us is the average over 
the 8 loops (modulo the 15% increase).  The small one seem to be in the first 
run probably before the throttled list gets full enough. HW, test and userspace
are all the same. 

This is 5.0+ (HEAD) from today:
[  249.081068] cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 20388, cfs_quota_us = 1174481)
[  286.184298] cfs_period_timer[cpu21]: period too short, scaling up (new cfs_period_us 12367, cfs_quota_us = 712423)
[  316.247071] cfs_period_timer[cpu12]: period too short, scaling up (new cfs_period_us 15670, cfs_quota_us = 902688)
[  353.118799] cfs_period_timer[cpu1]: period too short, scaling up (new cfs_period_us 9189, cfs_quota_us = 529375)
[  353.568433] cfs_period_timer[cpu1]: period too short, scaling up (new cfs_period_us 20875, cfs_quota_us = 1202477)
[  383.226669] cfs_period_timer[cpu1]: period too short, scaling up (new cfs_period_us 17195, cfs_quota_us = 990533)
[  420.331071] cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 16409, cfs_quota_us = 945243)
[  450.363500] cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 17041, cfs_quota_us = 981681)
[  487.198213] cfs_period_timer[cpu6]: period too short, scaling up (new cfs_period_us 7886, cfs_quota_us = 454289)
[  487.991057] cfs_period_timer[cpu6]: period too short, scaling up (new cfs_period_us 12848, cfs_quota_us = 740132)
[  517.290409] cfs_period_timer[cpu6]: period too short, scaling up (new cfs_period_us 15257, cfs_quota_us = 878867)
[  554.135046] cfs_period_timer[cpu7]: period too short, scaling up (new cfs_period_us 4689, cfs_quota_us = 270154)
[  554.393260] cfs_period_timer[cpu7]: period too short, scaling up (new cfs_period_us 19964, cfs_quota_us = 1150044)
[  584.263253] cfs_period_timer[cpu7]: period too short, scaling up (new cfs_period_us 17543, cfs_quota_us = 1010544)
[  621.359747] cfs_period_timer[cpu7]: period too short, scaling up (new cfs_period_us 4331, cfs_quota_us = 249533)
[  621.955802] cfs_period_timer[cpu7]: period too short, scaling up (new cfs_period_us 15284, cfs_quota_us = 880459)


This is v5.0:
[  105.859929] cfs_period_timer[cpu28]: period too short, scaling up (new cfs_period_us 6575, cfs_quota_us = 378757)
[  123.370507] cfs_period_timer[cpu28]: period too short, scaling up (new cfs_period_us 10336, cfs_quota_us = 595413)
[  135.866776] cfs_period_timer[cpu28]: period too short, scaling up (new cfs_period_us 11857, cfs_quota_us = 683030)
[  173.345169] cfs_period_timer[cpu33]: period too short, scaling up (new cfs_period_us 5155, cfs_quota_us = 296991)
[  182.912922] cfs_period_timer[cpu33]: period too short, scaling up (new cfs_period_us 10832, cfs_quota_us = 624006)
[  203.422465] cfs_period_timer[cpu33]: period too short, scaling up (new cfs_period_us 12305, cfs_quota_us = 708816)
[  240.946232] cfs_period_timer[cpu31]: period too short, scaling up (new cfs_period_us 11883, cfs_quota_us = 684559)
[  270.899190] cfs_period_timer[cpu31]: period too short, scaling up (new cfs_period_us 12185, cfs_quota_us = 701929)
[  308.251577] cfs_period_timer[cpu15]: period too short, scaling up (new cfs_period_us 11063, cfs_quota_us = 637318)
[  338.267753] cfs_period_timer[cpu15]: period too short, scaling up (new cfs_period_us 12288, cfs_quota_us = 707873)
[  375.636986] cfs_period_timer[cpu35]: period too short, scaling up (new cfs_period_us 7252, cfs_quota_us = 417748)
[  405.697730] cfs_period_timer[cpu35]: period too short, scaling up (new cfs_period_us 12585, cfs_quota_us = 724956)
[  443.173459] cfs_period_timer[cpu36]: period too short, scaling up (new cfs_period_us 8081, cfs_quota_us = 465516)
[  473.224266] cfs_period_timer[cpu36]: period too short, scaling up (new cfs_period_us 12703, cfs_quota_us = 731788)
[  510.600851] cfs_period_timer[cpu32]: period too short, scaling up (new cfs_period_us 10613, cfs_quota_us = 611382)
[  540.625699] cfs_period_timer[cpu32]: period too short, scaling up (new cfs_period_us 11961, cfs_quota_us = 689002)


Cheers,
Phil

> 
> >
> >
> > Cheers,
> > Phil
> >
> >
> >> >
> >> > Cheers,
> >> > Phil
> >> > ---
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index 310d0637fe4b..54b30adfc89e 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> >> >  	return HRTIMER_NORESTART;
> >> >  }
> >> >  
> >> > +extern const u64 max_cfs_quota_period;
> >> > +int cfs_period_autotune_loop_limit   = 8;
> >> > +int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period recalculation */
> >> > +
> >> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >> >  {
> >> >  	struct cfs_bandwidth *cfs_b =
> >> >  		container_of(timer, struct cfs_bandwidth, period_timer);
> >> > +	s64 nsstart, nsnow, new_period;
> >> >  	int overrun;
> >> >  	int idle = 0;
> >> > +	int count = 0;
> >> >  
> >> >  	raw_spin_lock(&cfs_b->lock);
> >> > +	nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
> >> >  	for (;;) {
> >> >  		overrun = hrtimer_forward_now(timer, cfs_b->period);
> >> >  		if (!overrun)
> >> >  			break;
> >> >  
> >> > +		if (++count > cfs_period_autotune_loop_limit) {
> >> > +			ktime_t old_period = ktime_to_ns(cfs_b->period);
> >> > +
> >> > +			nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> >> > +			new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
> >> > +
> >> > +			/*  Make sure new period will be larger than old. */
> >> > +			if (new_period < old_period) {
> >> > +				new_period = old_period;
> >> > +			}
> >> > +			new_period += (new_period *  cfs_period_autotune_cushion_pct) / 100;
> >> 
> >> This ordering means that it will always increase by at least 15%. This
> >> is a bit odd but probably a good thing; I'd just change the comment to
> >> make it clear this is deliberate.
> >> 
> >> > +
> >> > +			if (new_period >  max_cfs_quota_period)
> >> > +				new_period = max_cfs_quota_period;
> >> > +
> >> > +			cfs_b->period = ns_to_ktime(new_period);
> >> > +			cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;
> >> 
> >> In general it makes sense to do fixed point via 1024 or something that
> >> can be optimized into shifts (and a larger number is better in general
> >> for better precision).
> >> 
> >> > +			pr_warn_ratelimited(
> >> > +				"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> >> > +				smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> >> > +
> >> > +			idle = 0;
> >> > +			break;
> >> > +		}
> >> > +
> >> >  		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >> >  	}
> >> >  	if (idle)

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-03-13 21:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-01 14:52 [RFC] sched/fair: hard lockup in sched_cfs_period_timer Phil Auld
2019-03-04 18:13 ` bsegall
2019-03-04 19:05   ` Phil Auld
2019-03-05 18:49     ` bsegall
2019-03-05 20:05       ` Phil Auld
2019-03-05 20:45         ` bsegall
2019-03-06 16:23           ` Phil Auld
2019-03-06 19:25             ` bsegall
2019-03-09 20:33               ` Phil Auld
2019-03-11 17:44                 ` bsegall
2019-03-11 20:25                   ` Phil Auld
2019-03-12 13:57                     ` Phil Auld
2019-03-13 17:44                       ` bsegall
2019-03-13 18:50                         ` Phil Auld
2019-03-13 20:26                           ` bsegall
2019-03-13 21:10                             ` Phil Auld
2019-03-12 17:29                     ` bsegall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).