All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
@ 2014-05-15 16:58 Roman Gushchin
  2014-05-15 17:43 ` bsegall
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2014-05-15 16:58 UTC (permalink / raw)
  To: linux-kernel, bsegall, peterz, pjt, chris.j.arges, gregkh

tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
force the period timer restart. It's not safe, because
can lead to deadlock, described in commit 927b54fccbf0:
"__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock."
If tg_set_cfs_bandwidth() resets cfs_b->timer_active concurrently
to the second iteration of sched_cfs_period_timer, deadlock occurs.

Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
wait for period timer callbacks (ignoring cfs_b->timer_active) and
restart the timer explicitly.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ben Segall <bsegall@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/sched/core.c  |    3 +--
 kernel/sched/fair.c  |    8 ++++----
 kernel/sched/sched.h |    2 +-
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9d8ece..e9b9c66 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7717,8 +7717,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	/* restart the period timer (if active) to handle new period expiry */
 	if (runtime_enabled && cfs_b->timer_active) {
 		/* force a reprogram */
-		cfs_b->timer_active = 0;
-		__start_cfs_bandwidth(cfs_b);
+		__start_cfs_bandwidth(cfs_b, true);
 	}
 	raw_spin_unlock_irq(&cfs_b->lock);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..55a0a5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3129,7 +3129,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		 */
 		if (!cfs_b->timer_active) {
 			__refill_cfs_bandwidth_runtime(cfs_b);
-			__start_cfs_bandwidth(cfs_b);
+			__start_cfs_bandwidth(cfs_b, false);
 		}
 
 		if (cfs_b->runtime > 0) {
@@ -3308,7 +3308,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	if (!cfs_b->timer_active)
-		__start_cfs_bandwidth(cfs_b);
+		__start_cfs_bandwidth(cfs_b, false);
 	raw_spin_unlock(&cfs_b->lock);
 }
 
@@ -3690,7 +3690,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 }
 
 /* requires cfs_b->lock, may release to reprogram timer */
-void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
 {
 	/*
 	 * The timer may be active because we're trying to set a new bandwidth
@@ -3705,7 +3705,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 		cpu_relax();
 		raw_spin_lock(&cfs_b->lock);
 		/* if someone else restarted the timer then we're done */
-		if (cfs_b->timer_active)
+		if (!force && cfs_b->timer_active)
 			return;
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..369b4d6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -278,7 +278,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-15 16:58 [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock Roman Gushchin
@ 2014-05-15 17:43 ` bsegall
  2014-05-16  8:38   ` Roman Gushchin
  2014-05-19 10:40   ` Peter Zijlstra
  0 siblings, 2 replies; 19+ messages in thread
From: bsegall @ 2014-05-15 17:43 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-kernel, peterz, pjt, chris.j.arges, gregkh

Roman Gushchin <klamm@yandex-team.ru> writes:

> tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
> force the period timer restart. It's not safe, because
> can lead to deadlock, described in commit 927b54fccbf0:
> "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
> waiting for the hrtimer to finish. However, if sched_cfs_period_timer
> runs for another loop iteration, the hrtimer can attempt to take
> rq->lock, resulting in deadlock."
> If tg_set_cfs_bandwidth() resets cfs_b->timer_active concurrently
> to the second iteration of sched_cfs_period_timer, deadlock occurs.

I do not see this deadlock. cfs_b->timer_active is protected by
cfs_b->lock on both read and write, and I don't think it would even
directly cause deadlock if it wasn't. In particular this patch makes us
run for strictly longer outside of the interrupt (although I think the
patched version is also correct). The old issue was calling
hrtimer_cancel, which would mean we hold cfs_b->lock while waiting for
the interrupt to complete, while the interrupt was waiting to take
cfs_b->lock.

This patch /does/ ensure that we call
start_bandwidth_timer/__hrtimer_start_range_ns with the updated period,
but if I'm remembering my hrtimer mechanics correctly that is not
necessary.

I just did notice however that sched_cfs_period_timer reads
cfs_b->period without any locks, which can in fact race with an update
to period (and isn't fixed by this patch). If this write doesn't happen
to be atomic (which is certainly not guaranteed, and is entirely
plausible on say 32-bit x86, much less other archs), we could read a
partially written value and move the timer incorrectly. Pulling the
lock/unlock out of do_sched_cfs_period_timer should fix this easily
enough.

unthrottle_offline_cfs_rqs could cause similar issues with its unlocked
read of quota, and can also be easily fixed.


>
> Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
> wait for period timer callbacks (ignoring cfs_b->timer_active) and
> restart the timer explicitly.
>
> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: pjt@google.com
> Cc: Chris J Arges <chris.j.arges@canonical.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  kernel/sched/core.c  |    3 +--
>  kernel/sched/fair.c  |    8 ++++----
>  kernel/sched/sched.h |    2 +-
>  3 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d9d8ece..e9b9c66 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7717,8 +7717,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>  	/* restart the period timer (if active) to handle new period expiry */
>  	if (runtime_enabled && cfs_b->timer_active) {
>  		/* force a reprogram */
> -		cfs_b->timer_active = 0;
> -		__start_cfs_bandwidth(cfs_b);
> +		__start_cfs_bandwidth(cfs_b, true);
>  	}
>  	raw_spin_unlock_irq(&cfs_b->lock);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7570dd9..55a0a5b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3129,7 +3129,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  		 */
>  		if (!cfs_b->timer_active) {
>  			__refill_cfs_bandwidth_runtime(cfs_b);
> -			__start_cfs_bandwidth(cfs_b);
> +			__start_cfs_bandwidth(cfs_b, false);
>  		}
>  
>  		if (cfs_b->runtime > 0) {
> @@ -3308,7 +3308,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  	raw_spin_lock(&cfs_b->lock);
>  	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>  	if (!cfs_b->timer_active)
> -		__start_cfs_bandwidth(cfs_b);
> +		__start_cfs_bandwidth(cfs_b, false);
>  	raw_spin_unlock(&cfs_b->lock);
>  }
>  
> @@ -3690,7 +3690,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  }
>  
>  /* requires cfs_b->lock, may release to reprogram timer */
> -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
>  {
>  	/*
>  	 * The timer may be active because we're trying to set a new bandwidth
> @@ -3705,7 +3705,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  		cpu_relax();
>  		raw_spin_lock(&cfs_b->lock);
>  		/* if someone else restarted the timer then we're done */
> -		if (cfs_b->timer_active)
> +		if (!force && cfs_b->timer_active)
>  			return;
>  	}
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 456e492..369b4d6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -278,7 +278,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>  extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
>  
>  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
> -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> +extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
>  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>  
>  extern void free_rt_sched_group(struct task_group *tg);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-15 17:43 ` bsegall
@ 2014-05-16  8:38   ` Roman Gushchin
  2014-05-16 17:57     ` bsegall
  2014-05-19 10:32     ` Peter Zijlstra
  2014-05-19 10:40   ` Peter Zijlstra
  1 sibling, 2 replies; 19+ messages in thread
From: Roman Gushchin @ 2014-05-16  8:38 UTC (permalink / raw)
  To: bsegall; +Cc: Roman Gushchin, linux-kernel, peterz, pjt, chris.j.arges, gregkh

At Thu, 15 May 2014 10:43:14 -0700,
bsegall@google.com wrote:
> 
> Roman Gushchin <klamm@yandex-team.ru> writes:
> 
> > tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
> > force the period timer restart. It's not safe, because
> > can lead to deadlock, described in commit 927b54fccbf0:
> > "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
> > waiting for the hrtimer to finish. However, if sched_cfs_period_timer
> > runs for another loop iteration, the hrtimer can attempt to take
> > rq->lock, resulting in deadlock."
> > If tg_set_cfs_bandwidth() resets cfs_b->timer_active concurrently
> > to the second iteration of sched_cfs_period_timer, deadlock occurs.
> 
> I do not see this deadlock. cfs_b->timer_active is protected by
> cfs_b->lock on both read and write, and I don't think it would even
> directly cause deadlock if it wasn't. In particular this patch makes us
> run for strictly longer outside of the interrupt (although I think the
> patched version is also correct). The old issue was calling
> hrtimer_cancel, which would mean we hold cfs_b->lock while waiting for
> the interrupt to complete, while the interrupt was waiting to take
> cfs_b->lock.

I still think, there is a deadlock. I'll try to explain.
Three CPUs must be involved:
CPU0				CPU1				CPU2
take rq->lock			period timer fired		
...				take cfs_b lock
...				...				tg_set_cfs_bandwidth()
throttle_cfs_rq()		release cfs_b lock		take cfs_b lock
...				distribute_cfs_runtime()	timer_active = 0
take cfs_b->lock		wait for rq->lock		...
__start_cfs_bandwidth()	
{wait for timer callback
 break if timer_active == 1}

So, CPU0 and CPU1 are deadlocked.

If you remember do_sched_cfs_period_timer() code, timer_active is set to 1 to
avoid a similar deadlock:
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
{
	...
	/*
	 * if we have relooped after returning idle once, we need to update our
	 * status as actually running, so that other cpus doing
	 * __start_cfs_bandwidth will stop trying to cancel us.
	 */
	cfs_b->timer_active = 1;
	...
	while (throttled && runtime > 0) {
		raw_spin_unlock(&cfs_b->lock);
		/* we can't nest cfs_b->lock while distributing bandwidth */
>		runtime = distribute_cfs_runtime(cfs_b, runtime,
						 runtime_expires);
	...

tg_set_cfs_bandwidth can actually set timer_active to 0 in a marked moment.

This is how it looks like in practise:
CPU0:
[ 1967.348938] WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x98/0xc0()
[ 1967.356663] Watchdog detected hard LOCKUP on cpu 8
[ 1967.361269] Modules linked in: vhost_net macvtap macvlan 8021q mrp garp ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc intel_powerclamp coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 gpio_ich hid_generic dm_multipath scsi_dh usbhid ast ttm hid drm_kms_helper microcode drm sysimgblt sysfillrect syscopyarea shpchp pci_hotplug sb_edac edac_core lpc_ich tcp_htcp raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov igb dca i2c_algo_bit ahci libahci i2c_core ptp pps_core raid6_pq async_tx raid1 raid0 multipath linear
[ 1967.433217] CPU: 8 PID: 9745 Comm: kvm Tainted: G        W    3.10.38+ #17
[ 1967.440074] Hardware name: GIGABYTE GS-CFZ1/GS-CCZ/GS-CCZH/GA-7PTSH, BIOS D06 03/25/2013
[ 1967.448142]  ffffffff819faa10 ffff88207fc07bf8 ffffffff81676473 ffff88207fc07c38
[ 1967.455587]  ffffffff81054fcb 0000000000000000 ffff882029338000 0000000000000000
[ 1967.463033]  ffff88207fc07d80 0000000000000000 ffff88207fc07ef8 ffff88207fc07c98
[ 1967.470480] Call Trace:
[ 1967.472922]  <NMI>  [<ffffffff81676473>] dump_stack+0x19/0x1b
[ 1967.478691]  [<ffffffff81054fcb>] warn_slowpath_common+0x6b/0xa0
[ 1967.484689]  [<ffffffff810550a1>] warn_slowpath_fmt+0x41/0x50
[ 1967.490422]  [<ffffffff810e2a28>] watchdog_overflow_callback+0x98/0xc0
[ 1967.496933]  [<ffffffff811171bc>] __perf_event_overflow+0x9c/0x310
[ 1967.503097]  [<ffffffff810245c8>] ? x86_perf_event_set_period+0xd8/0x150
[ 1967.509780]  [<ffffffff811181f4>] perf_event_overflow+0x14/0x20
[ 1967.515686]  [<ffffffff8102a71e>] intel_pmu_handle_irq+0x1ae/0x390
[ 1967.521852]  [<ffffffff8133c31d>] ? do_raw_spin_unlock+0x5d/0xb0
[ 1967.527852]  [<ffffffff810236c1>] perf_event_nmi_handler+0x21/0x30
[ 1967.534025]  [<ffffffff810174ae>] nmi_handle.isra.2+0x9e/0x1e0
[ 1967.539852]  [<ffffffff81017410>] ? unregister_nmi_handler+0x180/0x180
[ 1967.546370]  [<ffffffff81017708>] do_nmi+0x118/0x390
[ 1967.551324]  [<ffffffff8167e0c1>] end_repeat_nmi+0x1e/0x2e
[ 1967.556804]  [<ffffffff810b2721>] ? __lock_acquire+0x201/0x1fb0
[ 1967.562716]  [<ffffffff810b2721>] ? __lock_acquire+0x201/0x1fb0
[ 1967.568621]  [<ffffffff810b2721>] ? __lock_acquire+0x201/0x1fb0
[ 1967.574525]  <<EOE>>  [<ffffffff810b293f>] ? __lock_acquire+0x41f/0x1fb0
[ 1967.581235]  [<ffffffff810b293f>] ? __lock_acquire+0x41f/0x1fb0
[ 1967.587144]  [<ffffffff81085b3b>] ? lock_hrtimer_base.isra.25+0x2b/0x60
[ 1967.593748]  [<ffffffff810b4b4c>] lock_acquire+0xac/0x140
[ 1967.599143]  [<ffffffff81098c18>] ? __start_cfs_bandwidth+0x68/0x90
[ 1967.605402]  [<ffffffff8167cdac>] _raw_spin_lock+0x2c/0x40
[ 1967.610882]  [<ffffffff81098c18>] ? __start_cfs_bandwidth+0x68/0x90
[ 1967.617132]  [<ffffffff8167d236>] ? _raw_spin_unlock+0x26/0x30
[ 1967.622952]  [<ffffffff81098c18>] __start_cfs_bandwidth+0x68/0x90
[ 1967.629038]  [<ffffffff81098d27>] __account_cfs_rq_runtime+0xe7/0x190
[ 1967.635462]  [<ffffffff81098ea8>] update_curr+0xd8/0x1c0
[ 1967.640768]  [<ffffffff8109aca8>] put_prev_task_fair+0x228/0x350
[ 1967.646760]  [<ffffffff8101bb99>] ? sched_clock+0x9/0x10
[ 1967.652060]  [<ffffffff8109558d>] ? sched_clock_cpu+0xbd/0x110
[ 1967.657879]  [<ffffffff8167b01f>] __schedule+0x13f/0x8a0
[ 1967.663187]  [<ffffffff810b0fb0>] ? mark_held_locks+0x80/0x150
[ 1967.669011]  [<ffffffff8167c405>] ? yield_to+0x145/0x210
[ 1967.674312]  [<ffffffff8167c1a4>] schedule+0x24/0x70
[ 1967.679273]  [<ffffffff8167c41a>] yield_to+0x15a/0x210
[ 1967.684419]  [<ffffffffa0270e68>] kvm_vcpu_yield_to+0xc8/0x1c0 [kvm]
[ 1967.690767]  [<ffffffffa0270da0>] ? kvm_vcpu_fault+0xf0/0xf0 [kvm]
[ 1967.696939]  [<ffffffffa027107b>] kvm_vcpu_on_spin+0x11b/0x150 [kvm]
[ 1967.703283]  [<ffffffffa0236da9>] handle_pause+0x19/0x30 [kvm_intel]
[ 1967.709630]  [<ffffffffa023908e>] vmx_handle_exit+0xfe/0x8e0 [kvm_intel]
[ 1967.716330]  [<ffffffffa028b085>] ? kvm_arch_vcpu_ioctl_run+0xe35/0x1330 [kvm]
[ 1967.723548]  [<ffffffffa028adc5>] kvm_arch_vcpu_ioctl_run+0xb75/0x1330 [kvm]
[ 1967.730594]  [<ffffffffa028ad5a>] ? kvm_arch_vcpu_ioctl_run+0xb0a/0x1330 [kvm]
[ 1967.737807]  [<ffffffffa0273136>] kvm_vcpu_ioctl+0x456/0x640 [kvm]
[ 1967.743974]  [<ffffffff810ba085>] ? do_futex+0xa45/0xbb0
[ 1967.749274]  [<ffffffff8119a37f>] do_vfs_ioctl+0x8f/0x550
[ 1967.754667]  [<ffffffff811a5bce>] ? fget_light+0xee/0x4d0
[ 1967.760059]  [<ffffffff811a5b23>] ? fget_light+0x43/0x4d0
[ 1967.765445]  [<ffffffff8167e355>] ? sysret_check+0x22/0x5d
[ 1967.770917]  [<ffffffff8119a8d1>] SyS_ioctl+0x91/0xb0
[ 1967.775956]  [<ffffffff81334cbe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1967.782381]  [<ffffffff8167e329>] system_call_fastpath+0x16/0x1b

CPU1:
[ 2044.045281] NMI backtrace for cpu 9
[ 2044.048763] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W    3.10.38+ #17
[ 2044.055878] Hardware name: GIGABYTE GS-CFZ1/GS-CCZ/GS-CCZH/GA-7PTSH, BIOS D06 03/25/2013
[ 2044.063949] task: ffff88102956af40 ti: ffff881029570000 task.ti: ffff881029570000
[ 2044.071411] RIP: 0010:[<ffffffff81333a1d>]  [<ffffffff81333a1d>] delay_tsc+0x1d/0x80
[ 2044.079153] RSP: 0018:ffff88207fc23d58  EFLAGS: 00000006
[ 2044.084451] RAX: 0000000000000000 RBX: ffff88207fdb45c0 RCX: 000000005ae55164
[ 2044.091567] RDX: 000000000000c7c6 RSI: 000000000002a000 RDI: 0000000000000001
[ 2044.098685] RBP: ffff88207fc23d78 R08: 0000000000000002 R09: 0000000000000000
[ 2044.105801] R10: 0000000000000000 R11: 0000000000000016 R12: 00000000831982c5
[ 2044.112919] R13: 0000000000000009 R14: 0000000000000001 R15: ffff88207fdb45c0
[ 2044.120035] FS:  0000000000000000(0000) GS:ffff88207fc20000(0000) knlGS:0000000000000000
[ 2044.128105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2044.133835] CR2: 00007f561cd6fc30 CR3: 0000000001c0c000 CR4: 00000000000427e0
[ 2044.140954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2044.148072] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2044.155186] Stack:
[ 2044.157196]  ffff88207fdb45c0 00000000831982c5 00000000833a1014 ffff881ec1a33400
[ 2044.164641]  ffff88207fc23d88 ffffffff8133396a ffff88207fc23dc8 ffffffff8133c22c
[ 2044.172087]  ffff88207fc23dc8 ffff88207fdb45c0 ffff881024dc8258 ffff88207fdb45c0
[ 2044.179534] Call Trace:
[ 2044.181974]  <IRQ> 
[ 2044.183898]  [<ffffffff8133396a>] __delay+0xa/0x10
[ 2044.188886]  [<ffffffff8133c22c>] do_raw_spin_lock+0xcc/0x120
[ 2044.194618]  [<ffffffff8167cdb4>] _raw_spin_lock+0x34/0x40
[ 2044.200091]  [<ffffffff8109be87>] ? distribute_cfs_runtime+0xb7/0x290
[ 2044.206523]  [<ffffffff8109be87>] distribute_cfs_runtime+0xb7/0x290
[ 2044.212783]  [<ffffffff8109bdd0>] ? rq_offline_fair+0x140/0x140
[ 2044.218695]  [<ffffffff8109c288>] sched_cfs_period_timer+0x118/0x190
[ 2044.225035]  [<ffffffff81085c2c>] __run_hrtimer+0x6c/0x1d0
[ 2044.230514]  [<ffffffff810a626d>] ? ktime_get_update_offsets+0x4d/0xe0
[ 2044.237025]  [<ffffffff8109c170>] ? sched_cfs_slack_timer+0x110/0x110
[ 2044.243450]  [<ffffffff810864d7>] hrtimer_interrupt+0x107/0x260
[ 2044.249362]  [<ffffffff81038684>] smp_apic_timer_interrupt+0x64/0xa0
[ 2044.255702]  [<ffffffff8167efb2>] apic_timer_interrupt+0x72/0x80
[ 2044.261698]  <EOI> 
[ 2044.263622]  [<ffffffff8150baae>] ? cpuidle_enter_state+0x5e/0xe0
[ 2044.269910]  [<ffffffff8150baaa>] ? cpuidle_enter_state+0x5a/0xe0
[ 2044.275994]  [<ffffffff8150bbe9>] cpuidle_idle_call+0xb9/0x250
[ 2044.281813]  [<ffffffff8101d109>] arch_cpu_idle+0x9/0x30
[ 2044.287112]  [<ffffffff810a3fc0>] cpu_startup_entry+0xe0/0x2d0
[ 2044.292930]  [<ffffffff816662f9>] start_secondary+0x204/0x20b

CPU2:
[ 2043.793852] CPU: 12 PID: 1487 Comm: libvirtd Tainted: G        W    3.10.38+ #17
[ 2043.801227] Hardware name: GIGABYTE GS-CFZ1/GS-CCZ/GS-CCZH/GA-7PTSH, BIOS D06 03/25/2013
[ 2043.809298] task: ffff882025785e80 ti: ffff882026518000 task.ti: ffff882026518000
[ 2043.816760] RIP: 0010:[<ffffffff81333a45>]  [<ffffffff81333a45>] delay_tsc+0x45/0x80
[ 2043.824500] RSP: 0018:ffff882026519cd8  EFLAGS: 00000002
[ 2043.829800] RAX: 000011815ae55170 RBX: ffff881024dc8170 RCX: 000000005ae55170
[ 2043.836917] RDX: 0000000000001181 RSI: 0000000dc4934000 RDI: 0000000000000001
[ 2043.844033] RBP: ffff882026519cf8 R08: 0000000000000003 R09: 0000000000000000
[ 2043.851151] R10: 0000000000000000 R11: 000000000000049b R12: 000000005ae55170
[ 2043.858267] R13: 000000000000000c R14: 0000000000000001 R15: 0000000000000000
[ 2043.865384] FS:  00007f67cf908700(0000) GS:ffff88207fc80000(0000) knlGS:0000000000000000
[ 2043.873453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2043.879185] CR2: 00007f67d5c22000 CR3: 0000001026737000 CR4: 00000000000427e0
[ 2043.886301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2043.893418] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2043.900534] Stack:
[ 2043.902543]  ffff881024dc8170 0000000000000032 00000000833a1014 0000000000000001
[ 2043.909990]  ffff882026519d08 ffffffff8133396a ffff882026519d48 ffffffff8133c22c
[ 2043.917437]  ffff882026519d38 ffff881024dc8170 ffff881024dc81d8 ffff881024dc8170
[ 2043.924883] Call Trace:
[ 2043.927326]  [<ffffffff8133396a>] __delay+0xa/0x10
[ 2043.932114]  [<ffffffff8133c22c>] do_raw_spin_lock+0xcc/0x120
[ 2043.937846]  [<ffffffff8167cdb4>] _raw_spin_lock+0x34/0x40
[ 2043.943325]  [<ffffffff81098c18>] ? __start_cfs_bandwidth+0x68/0x90
[ 2043.949577]  [<ffffffff8167d236>] ? _raw_spin_unlock+0x26/0x30
[ 2043.955395]  [<ffffffff81098c18>] __start_cfs_bandwidth+0x68/0x90
[ 2043.961474]  [<ffffffff8108f9a8>] tg_set_cfs_bandwidth+0x2e8/0x330
[ 2043.967638]  [<ffffffff8108f768>] ? tg_set_cfs_bandwidth+0xa8/0x330
[ 2043.973897]  [<ffffffff810952e7>] tg_set_cfs_period+0x17/0x20
[ 2043.979629]  [<ffffffff81095300>] cpu_cfs_period_write_u64+0x10/0x20
[ 2043.985969]  [<ffffffff810caa63>] cgroup_file_write+0x143/0x2e0
[ 2043.991882]  [<ffffffff8118773b>] ? vfs_write+0x1bb/0x1e0
[ 2043.997274]  [<ffffffff8118773b>] ? vfs_write+0x1bb/0x1e0
[ 2044.002665]  [<ffffffff812c8cb7>] ? security_file_permission+0x27/0xb0
[ 2044.009180]  [<ffffffff81187649>] vfs_write+0xc9/0x1e0
[ 2044.014315]  [<ffffffff81187b10>] SyS_write+0x50/0xa0
[ 2044.019360]  [<ffffffff8167e329>] system_call_fastpath+0x16/0x1b

> 
> This patch /does/ ensure that we call
> start_bandwidth_timer/__hrtimer_start_range_ns with the updated period,
> but if I'm remembering my hrtimer mechanics correctly that is not
> necessary.
> 
> I just did notice however that sched_cfs_period_timer reads
> cfs_b->period without any locks, which can in fact race with an update
> to period (and isn't fixed by this patch). If this write doesn't happen
> to be atomic (which is certainly not guaranteed, and is entirely
> plausible on say 32-bit x86, much less other archs), we could read a
> partially written value and move the timer incorrectly. Pulling the
> lock/unlock out of do_sched_cfs_period_timer should fix this easily
> enough.

Looks like an another issue.

Thanks,
Roman

> 
> unthrottle_offline_cfs_rqs could cause similar issues with its unlocked
> read of quota, and can also be easily fixed.
> 
> 
> >
> > Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
> > wait for period timer callbacks (ignoring cfs_b->timer_active) and
> > restart the timer explicitly.
> >
> > Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: pjt@google.com
> > Cc: Chris J Arges <chris.j.arges@canonical.com>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > ---
> >  kernel/sched/core.c  |    3 +--
> >  kernel/sched/fair.c  |    8 ++++----
> >  kernel/sched/sched.h |    2 +-
> >  3 files changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d9d8ece..e9b9c66 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7717,8 +7717,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> >  	/* restart the period timer (if active) to handle new period expiry */
> >  	if (runtime_enabled && cfs_b->timer_active) {
> >  		/* force a reprogram */
> > -		cfs_b->timer_active = 0;
> > -		__start_cfs_bandwidth(cfs_b);
> > +		__start_cfs_bandwidth(cfs_b, true);
> >  	}
> >  	raw_spin_unlock_irq(&cfs_b->lock);
> >  
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7570dd9..55a0a5b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3129,7 +3129,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >  		 */
> >  		if (!cfs_b->timer_active) {
> >  			__refill_cfs_bandwidth_runtime(cfs_b);
> > -			__start_cfs_bandwidth(cfs_b);
> > +			__start_cfs_bandwidth(cfs_b, false);
> >  		}
> >  
> >  		if (cfs_b->runtime > 0) {
> > @@ -3308,7 +3308,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> >  	raw_spin_lock(&cfs_b->lock);
> >  	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
> >  	if (!cfs_b->timer_active)
> > -		__start_cfs_bandwidth(cfs_b);
> > +		__start_cfs_bandwidth(cfs_b, false);
> >  	raw_spin_unlock(&cfs_b->lock);
> >  }
> >  
> > @@ -3690,7 +3690,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >  }
> >  
> >  /* requires cfs_b->lock, may release to reprogram timer */
> > -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > +void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
> >  {
> >  	/*
> >  	 * The timer may be active because we're trying to set a new bandwidth
> > @@ -3705,7 +3705,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >  		cpu_relax();
> >  		raw_spin_lock(&cfs_b->lock);
> >  		/* if someone else restarted the timer then we're done */
> > -		if (cfs_b->timer_active)
> > +		if (!force && cfs_b->timer_active)
> >  			return;
> >  	}
> >  
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 456e492..369b4d6 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -278,7 +278,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> >  extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
> >  
> >  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
> > -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> > +extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
> >  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
> >  
> >  extern void free_rt_sched_group(struct task_group *tg);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-16  8:38   ` Roman Gushchin
@ 2014-05-16 17:57     ` bsegall
  2014-05-19 11:10       ` Roman Gushchin
  2014-05-19 10:32     ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: bsegall @ 2014-05-16 17:57 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-kernel, peterz, pjt, chris.j.arges, gregkh

Roman Gushchin <klamm@yandex-team.ru> writes:

> At Thu, 15 May 2014 10:43:14 -0700,
> bsegall@google.com wrote:
>> 
>> Roman Gushchin <klamm@yandex-team.ru> writes:
>> 
>> > tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
>> > force the period timer restart. It's not safe, because
>> > can lead to deadlock, described in commit 927b54fccbf0:
>> > "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
>> > waiting for the hrtimer to finish. However, if sched_cfs_period_timer
>> > runs for another loop iteration, the hrtimer can attempt to take
>> > rq->lock, resulting in deadlock."
>> > If tg_set_cfs_bandwidth() resets cfs_b->timer_active concurrently
>> > to the second iteration of sched_cfs_period_timer, deadlock occurs.
>> 
>> I do not see this deadlock. cfs_b->timer_active is protected by
>> cfs_b->lock on both read and write, and I don't think it would even
>> directly cause deadlock if it wasn't. In particular this patch makes us
>> run for strictly longer outside of the interrupt (although I think the
>> patched version is also correct). The old issue was calling
>> hrtimer_cancel, which would mean we hold cfs_b->lock while waiting for
>> the interrupt to complete, while the interrupt was waiting to take
>> cfs_b->lock.
>
> I still think, there is a deadlock. I'll try to explain.
> Three CPUs must be involved:
> CPU0				CPU1				CPU2
> take rq->lock			period timer fired		
> ...				take cfs_b lock
> ...				...				tg_set_cfs_bandwidth()
> throttle_cfs_rq()		release cfs_b lock		take cfs_b lock
> ...				distribute_cfs_runtime()	timer_active = 0
> take cfs_b->lock		wait for rq->lock		...
> __start_cfs_bandwidth()	
> {wait for timer callback
>  break if timer_active == 1}
>
> So, CPU0 and CPU1 are deadlocked.

Ahhh, yes, there is a three-cpu race here. Could you clarify the commit
message that this is a race involving all three of tg_set_cfs_bandwidth,
period timer, and then a second __start_cfs_bandwidth call?

The code looks good however.

>> I just did notice however that sched_cfs_period_timer reads
>> cfs_b->period without any locks, which can in fact race with an update
>> to period (and isn't fixed by this patch). If this write doesn't happen
>> to be atomic (which is certainly not guaranteed, and is entirely
>> plausible on say 32-bit x86, much less other archs), we could read a
>> partially written value and move the timer incorrectly. Pulling the
>> lock/unlock out of do_sched_cfs_period_timer should fix this easily
>> enough.
>
> Looks like an another issue.

Yeah.
>
> Thanks,
> Roman
>
>> 
>> unthrottle_offline_cfs_rqs could cause similar issues with its unlocked
>> read of quota, and can also be easily fixed.
>> 
>> 
>> >
>> > Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
>> > wait for period timer callbacks (ignoring cfs_b->timer_active) and
>> > restart the timer explicitly.
>> >
>> > Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
>> > Cc: Ben Segall <bsegall@google.com>
>> > Cc: Peter Zijlstra <peterz@infradead.org>
>> > Cc: pjt@google.com
>> > Cc: Chris J Arges <chris.j.arges@canonical.com>
>> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> > ---
>> >  kernel/sched/core.c  |    3 +--
>> >  kernel/sched/fair.c  |    8 ++++----
>> >  kernel/sched/sched.h |    2 +-
>> >  3 files changed, 6 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index d9d8ece..e9b9c66 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -7717,8 +7717,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> >  	/* restart the period timer (if active) to handle new period expiry */
>> >  	if (runtime_enabled && cfs_b->timer_active) {
>> >  		/* force a reprogram */
>> > -		cfs_b->timer_active = 0;
>> > -		__start_cfs_bandwidth(cfs_b);
>> > +		__start_cfs_bandwidth(cfs_b, true);
>> >  	}
>> >  	raw_spin_unlock_irq(&cfs_b->lock);
>> >  
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 7570dd9..55a0a5b 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -3129,7 +3129,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> >  		 */
>> >  		if (!cfs_b->timer_active) {
>> >  			__refill_cfs_bandwidth_runtime(cfs_b);
>> > -			__start_cfs_bandwidth(cfs_b);
>> > +			__start_cfs_bandwidth(cfs_b, false);
>> >  		}
>> >  
>> >  		if (cfs_b->runtime > 0) {
>> > @@ -3308,7 +3308,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> >  	raw_spin_lock(&cfs_b->lock);
>> >  	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>> >  	if (!cfs_b->timer_active)
>> > -		__start_cfs_bandwidth(cfs_b);
>> > +		__start_cfs_bandwidth(cfs_b, false);
>> >  	raw_spin_unlock(&cfs_b->lock);
>> >  }
>> >  
>> > @@ -3690,7 +3690,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> >  }
>> >  
>> >  /* requires cfs_b->lock, may release to reprogram timer */
>> > -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> > +void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
>> >  {
>> >  	/*
>> >  	 * The timer may be active because we're trying to set a new bandwidth
>> > @@ -3705,7 +3705,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> >  		cpu_relax();
>> >  		raw_spin_lock(&cfs_b->lock);
>> >  		/* if someone else restarted the timer then we're done */
>> > -		if (cfs_b->timer_active)
>> > +		if (!force && cfs_b->timer_active)
>> >  			return;
>> >  	}
>> >  
>> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> > index 456e492..369b4d6 100644
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -278,7 +278,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>> >  extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
>> >  
>> >  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>> > -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>> > +extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
>> >  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>> >  
>> >  extern void free_rt_sched_group(struct task_group *tg);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-16  8:38   ` Roman Gushchin
  2014-05-16 17:57     ` bsegall
@ 2014-05-19 10:32     ` Peter Zijlstra
  2014-05-19 11:37       ` Roman Gushchin
  2014-05-19 17:30       ` bsegall
  1 sibling, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-05-19 10:32 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: bsegall, linux-kernel, pjt, chris.j.arges, gregkh

[-- Attachment #1: Type: text/plain, Size: 830 bytes --]

On Fri, May 16, 2014 at 12:38:21PM +0400, Roman Gushchin wrote:

> I still think, there is a deadlock. I'll try to explain.
> Three CPUs must be involved:
> CPU0				CPU1				CPU2
> take rq->lock			period timer fired		
> ...				take cfs_b lock
> ...				...				tg_set_cfs_bandwidth()
> throttle_cfs_rq()		release cfs_b lock		take cfs_b lock
> ...				distribute_cfs_runtime()	timer_active = 0
> take cfs_b->lock		wait for rq->lock		...
> __start_cfs_bandwidth()	
> {wait for timer callback
>  break if timer_active == 1}
> 
> So, CPU0 and CPU1 are deadlocked.

OK, so can someone explain this ->timer_active thing? esp. what's the
'obvious' difference with hrtimer_active()?


Ideally we'd change the lot to not have this, but if we have to keep it
we'll need to make it lockdep visible because all this stinks

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-15 17:43 ` bsegall
  2014-05-16  8:38   ` Roman Gushchin
@ 2014-05-19 10:40   ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-05-19 10:40 UTC (permalink / raw)
  To: bsegall; +Cc: Roman Gushchin, linux-kernel, pjt, chris.j.arges, gregkh

[-- Attachment #1: Type: text/plain, Size: 741 bytes --]

On Thu, May 15, 2014 at 10:43:14AM -0700, bsegall@google.com wrote:
> I just did notice however that sched_cfs_period_timer reads
> cfs_b->period without any locks, which can in fact race with an update
> to period (and isn't fixed by this patch). If this write doesn't happen
> to be atomic (which is certainly not guaranteed, and is entirely
> plausible on say 32-bit x86, much less other archs), we could read a
> partially written value and move the timer incorrectly. Pulling the
> lock/unlock out of do_sched_cfs_period_timer should fix this easily
> enough.
> 
> unthrottle_offline_cfs_rqs could cause similar issues with its unlocked
> read of quota, and can also be easily fixed.

Ben, will you send patches for this?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-16 17:57     ` bsegall
@ 2014-05-19 11:10       ` Roman Gushchin
  2014-05-19 17:40         ` bsegall
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2014-05-19 11:10 UTC (permalink / raw)
  To: bsegall; +Cc: Roman Gushchin, linux-kernel, peterz, pjt, chris.j.arges, gregkh

tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
force the period timer restart. It's not safe, because
can lead to deadlock, described in commit 927b54fccbf0:
"__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock."
Three CPUs must be involved:
CPU0               CPU1                         CPU2
take rq->lock      period timer fired
...                take cfs_b lock
...                ...                          tg_set_cfs_bandwidth()
throttle_cfs_rq()  release cfs_b lock           take cfs_b lock
...                distribute_cfs_runtime()     timer_active = 0
take cfs_b->lock   wait for rq->lock            ...
__start_cfs_bandwidth()
{wait for timer callback
 break if timer_active == 1}

So, CPU0 and CPU1 are deadlocked.

Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
wait for period timer callbacks (ignoring cfs_b->timer_active) and
restart the timer explicitly.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ben Segall <bsegall@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/sched/core.c  |    3 +--
 kernel/sched/fair.c  |    8 ++++----
 kernel/sched/sched.h |    2 +-
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4308d7..e122774 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7826,8 +7826,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	/* restart the period timer (if active) to handle new period expiry */
 	if (runtime_enabled && cfs_b->timer_active) {
 		/* force a reprogram */
-		cfs_b->timer_active = 0;
-		__start_cfs_bandwidth(cfs_b);
+		__start_cfs_bandwidth(cfs_b, true);
 	}
 	raw_spin_unlock_irq(&cfs_b->lock);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c7ab8ea..54939f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2117,7 +2117,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		 */
 		if (!cfs_b->timer_active) {
 			__refill_cfs_bandwidth_runtime(cfs_b);
-			__start_cfs_bandwidth(cfs_b);
+			__start_cfs_bandwidth(cfs_b, false);
 		}
 
 		if (cfs_b->runtime > 0) {
@@ -2298,7 +2298,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	if (!cfs_b->timer_active)
-		__start_cfs_bandwidth(cfs_b);
+		__start_cfs_bandwidth(cfs_b, false);
 	raw_spin_unlock(&cfs_b->lock);
 }
 
@@ -2681,7 +2681,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 }
 
 /* requires cfs_b->lock, may release to reprogram timer */
-void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
 {
 	/*
 	 * The timer may be active because we're trying to set a new bandwidth
@@ -2696,7 +2696,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 		cpu_relax();
 		raw_spin_lock(&cfs_b->lock);
 		/* if someone else restarted the timer then we're done */
-		if (cfs_b->timer_active)
+		if (!force && cfs_b->timer_active)
 			return;
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dfa31d5..bc150a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -209,7 +209,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-19 10:32     ` Peter Zijlstra
@ 2014-05-19 11:37       ` Roman Gushchin
  2014-05-19 17:35         ` bsegall
  2014-05-19 17:30       ` bsegall
  1 sibling, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2014-05-19 11:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Roman Gushchin, bsegall, linux-kernel, pjt, chris.j.arges, gregkh

At Mon, 19 May 2014 12:32:47 +0200,
Peter Zijlstra wrote:
> 
> OK, so can someone explain this ->timer_active thing? esp. what's the
> 'obvious' difference with hrtimer_active()?

A good question :)

> Ideally we'd change the lot to not have this, but if we have to keep it
> we'll need to make it lockdep visible because all this stinks

If the patch below is what Ben means, timer_active will become even more unused.
It seems to me now, that it will be perfectly possible to just drop it.
I'll try to prepare a patch for.

--

index 7570dd9..be7865e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3408,7 +3408,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	u64 runtime, runtime_expires;
 	int idle = 1, throttled;
 
-	raw_spin_lock(&cfs_b->lock);
 	/* no need to continue the timer with no bandwidth constraint */
 	if (cfs_b->quota == RUNTIME_INF)
 		goto out_unlock;
@@ -3477,7 +3476,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 out_unlock:
 	if (idle)
 		cfs_b->timer_active = 0;
-	raw_spin_unlock(&cfs_b->lock);
 
 	return idle;
 }
@@ -3656,6 +3654,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 	int overrun;
 	int idle = 0;
 
+	raw_spin_lock(&cfs_b->lock);
+
 	for (;;) {
 		now = hrtimer_cb_get_time(timer);
 		overrun = hrtimer_forward(timer, now, cfs_b->period);
@@ -3666,6 +3666,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
 	}
 
+	raw_spin_unlock(&cfs_b->lock);
+
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-19 10:32     ` Peter Zijlstra
  2014-05-19 11:37       ` Roman Gushchin
@ 2014-05-19 17:30       ` bsegall
  2014-05-20 13:15         ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: bsegall @ 2014-05-19 17:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Roman Gushchin, linux-kernel, pjt, chris.j.arges, gregkh

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, May 16, 2014 at 12:38:21PM +0400, Roman Gushchin wrote:
>
>> I still think, there is a deadlock. I'll try to explain.
>> Three CPUs must be involved:
>> CPU0				CPU1				CPU2
>> take rq->lock			period timer fired		
>> ...				take cfs_b lock
>> ...				...				tg_set_cfs_bandwidth()
>> throttle_cfs_rq()		release cfs_b lock		take cfs_b lock
>> ...				distribute_cfs_runtime()	timer_active = 0
>> take cfs_b->lock		wait for rq->lock		...
>> __start_cfs_bandwidth()	
>> {wait for timer callback
>>  break if timer_active == 1}
>> 
>> So, CPU0 and CPU1 are deadlocked.
>
> OK, so can someone explain this ->timer_active thing? esp. what's the
> 'obvious' difference with hrtimer_active()?

timer_active is only changed or checked under cfs_b lock, so that we can
be certain that the timer is active whenever runtime isn't full. We
can't use hrtimer_active because the timer might have unlocked the cfs_b
on another cpu and be about to return HRTIMER_NORESTART.
>
>
> Ideally we'd change the lot to not have this, but if we have to keep it
> we'll need to make it lockdep visible because all this stinks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-19 11:37       ` Roman Gushchin
@ 2014-05-19 17:35         ` bsegall
  0 siblings, 0 replies; 19+ messages in thread
From: bsegall @ 2014-05-19 17:35 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Peter Zijlstra, linux-kernel, pjt, chris.j.arges, gregkh

Roman Gushchin <klamm@yandex-team.ru> writes:

> At Mon, 19 May 2014 12:32:47 +0200,
> Peter Zijlstra wrote:
>> 
>> OK, so can someone explain this ->timer_active thing? esp. what's the
>> 'obvious' difference with hrtimer_active()?
>
> A good question :)
>
>> Ideally we'd change the lot to not have this, but if we have to keep it
>> we'll need to make it lockdep visible because all this stinks
>
> If the patch below is what Ben means, timer_active will become even more unused.
> It seems to me now, that it will be perfectly possible to just drop it.
> I'll try to prepare a patch for.

Yeah, the below patch was the sort of idea I was thinking of. I'll send
a similar patch that also hits unthrottle_offline_cfs_rqs shortly. This
still doesn't save us from the unlock vs become !hrtimer_active race
though.

>
> --
>
> index 7570dd9..be7865e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3408,7 +3408,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  	u64 runtime, runtime_expires;
>  	int idle = 1, throttled;
>  
> -	raw_spin_lock(&cfs_b->lock);
>  	/* no need to continue the timer with no bandwidth constraint */
>  	if (cfs_b->quota == RUNTIME_INF)
>  		goto out_unlock;
> @@ -3477,7 +3476,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  out_unlock:
>  	if (idle)
>  		cfs_b->timer_active = 0;
> -	raw_spin_unlock(&cfs_b->lock);
>  
>  	return idle;
>  }
> @@ -3656,6 +3654,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  	int overrun;
>  	int idle = 0;
>  
> +	raw_spin_lock(&cfs_b->lock);
> +
>  	for (;;) {
>  		now = hrtimer_cb_get_time(timer);
>  		overrun = hrtimer_forward(timer, now, cfs_b->period);
> @@ -3666,6 +3666,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  		idle = do_sched_cfs_period_timer(cfs_b, overrun);
>  	}
>  
> +	raw_spin_unlock(&cfs_b->lock);
> +
>  	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-19 11:10       ` Roman Gushchin
@ 2014-05-19 17:40         ` bsegall
  0 siblings, 0 replies; 19+ messages in thread
From: bsegall @ 2014-05-19 17:40 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-kernel, peterz, pjt, chris.j.arges, gregkh

Roman Gushchin <klamm@yandex-team.ru> writes:

> tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
> force the period timer restart. It's not safe, because
> can lead to deadlock, described in commit 927b54fccbf0:
> "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
> waiting for the hrtimer to finish. However, if sched_cfs_period_timer
> runs for another loop iteration, the hrtimer can attempt to take
> rq->lock, resulting in deadlock."
> Three CPUs must be involved:
> CPU0               CPU1                         CPU2
> take rq->lock      period timer fired
> ...                take cfs_b lock
> ...                ...                          tg_set_cfs_bandwidth()
> throttle_cfs_rq()  release cfs_b lock           take cfs_b lock
> ...                distribute_cfs_runtime()     timer_active = 0
> take cfs_b->lock   wait for rq->lock            ...
> __start_cfs_bandwidth()
> {wait for timer callback
>  break if timer_active == 1}
>
> So, CPU0 and CPU1 are deadlocked.
>
> Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
> wait for period timer callbacks (ignoring cfs_b->timer_active) and
> restart the timer explicitly.
>
> Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: pjt@google.com
> Cc: Chris J Arges <chris.j.arges@canonical.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Ben Segall <bsegall@google.com>

> ---
>  kernel/sched/core.c  |    3 +--
>  kernel/sched/fair.c  |    8 ++++----
>  kernel/sched/sched.h |    2 +-
>  3 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b4308d7..e122774 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7826,8 +7826,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>  	/* restart the period timer (if active) to handle new period expiry */
>  	if (runtime_enabled && cfs_b->timer_active) {
>  		/* force a reprogram */
> -		cfs_b->timer_active = 0;
> -		__start_cfs_bandwidth(cfs_b);
> +		__start_cfs_bandwidth(cfs_b, true);
>  	}
>  	raw_spin_unlock_irq(&cfs_b->lock);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c7ab8ea..54939f8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2117,7 +2117,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  		 */
>  		if (!cfs_b->timer_active) {
>  			__refill_cfs_bandwidth_runtime(cfs_b);
> -			__start_cfs_bandwidth(cfs_b);
> +			__start_cfs_bandwidth(cfs_b, false);
>  		}
>  
>  		if (cfs_b->runtime > 0) {
> @@ -2298,7 +2298,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  	raw_spin_lock(&cfs_b->lock);
>  	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>  	if (!cfs_b->timer_active)
> -		__start_cfs_bandwidth(cfs_b);
> +		__start_cfs_bandwidth(cfs_b, false);
>  	raw_spin_unlock(&cfs_b->lock);
>  }
>  
> @@ -2681,7 +2681,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  }
>  
>  /* requires cfs_b->lock, may release to reprogram timer */
> -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
>  {
>  	/*
>  	 * The timer may be active because we're trying to set a new bandwidth
> @@ -2696,7 +2696,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  		cpu_relax();
>  		raw_spin_lock(&cfs_b->lock);
>  		/* if someone else restarted the timer then we're done */
> -		if (cfs_b->timer_active)
> +		if (!force && cfs_b->timer_active)
>  			return;
>  	}
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index dfa31d5..bc150a6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -209,7 +209,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>  extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
>  
>  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
> -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> +extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
>  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>  
>  extern void free_rt_sched_group(struct task_group *tg);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-19 17:30       ` bsegall
@ 2014-05-20 13:15         ` Peter Zijlstra
  2014-05-20 14:01           ` Peter Zijlstra
                             ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-05-20 13:15 UTC (permalink / raw)
  To: bsegall
  Cc: Roman Gushchin, linux-kernel, pjt, chris.j.arges, gregkh,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 8086 bytes --]

On Mon, May 19, 2014 at 10:30:05AM -0700, bsegall@google.com wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Fri, May 16, 2014 at 12:38:21PM +0400, Roman Gushchin wrote:
> >
> >> I still think, there is a deadlock. I'll try to explain.
> >> Three CPUs must be involved:
> >> CPU0				CPU1				CPU2
> >> take rq->lock			period timer fired		
> >> ...				take cfs_b lock
> >> ...				...				tg_set_cfs_bandwidth()
> >> throttle_cfs_rq()		release cfs_b lock		take cfs_b lock
> >> ...				distribute_cfs_runtime()	timer_active = 0
> >> take cfs_b->lock		wait for rq->lock		...
> >> __start_cfs_bandwidth()	
> >> {wait for timer callback
> >>  break if timer_active == 1}
> >> 
> >> So, CPU0 and CPU1 are deadlocked.
> >
> > OK, so can someone explain this ->timer_active thing? esp. what's the
> > 'obvious' difference with hrtimer_active()?
> 
> timer_active is only changed or checked under cfs_b lock, so that we can
> be certain that the timer is active whenever runtime isn't full. We
> can't use hrtimer_active because the timer might have unlocked the cfs_b
> on another cpu and be about to return HRTIMER_NORESTART.

The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.

Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.

So now, if I understood you right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally loose the timer.

It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.

Which leads us to what I think is a BUG in the current hrtimer code (and
one wonders why we never hit that), because we drop the cpu_base->lock
over calling hrtimer::function, hrtimer_start_range_ns() can in fact
come in and (re)enqueue the timer, if hrtimer::function then returns
HRTIMER_RESTART, we'll hit that BUG_ON() before trying to enqueue the
timer once more.

So I _think_ something like the below would actually solve all the
problems, no?

Completely untested... does anybody have a 'useful' test case for this
bw stuff?

---
 kernel/hrtimer.c     |  9 ++++++---
 kernel/sched/core.c  | 10 ++++++----
 kernel/sched/fair.c  | 42 +++---------------------------------------
 kernel/sched/sched.h |  2 +-
 4 files changed, 16 insertions(+), 47 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 3ab28993f6e0..28942c65635e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1273,11 +1273,14 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now)
 	 * Note: We clear the CALLBACK bit after enqueue_hrtimer and
 	 * we do not reprogramm the event hardware. Happens either in
 	 * hrtimer_start_range_ns() or in hrtimer_interrupt()
+	 *
+	 * Note: Because we dropped the cpu_base->lock above,
+	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
+	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART) {
-		BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+	if (restart != HRTIMER_NORESTART &&
+	    !(timer->state & HRTIMER_STATE_ENQUEUED))
 		enqueue_hrtimer(timer, base);
-	}
 
 	WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ea7b3f1a087..829a158948a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,17 +106,17 @@ void __smp_mb__after_atomic(void)
 EXPORT_SYMBOL(__smp_mb__after_atomic);
 #endif
 
-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
+int start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
 	unsigned long delta;
 	ktime_t soft, hard, now;
+	int overruns = 0;
 
 	for (;;) {
-		if (hrtimer_active(period_timer))
+		if (hrtimer_is_queued(period_timer))
 			break;
 
-		now = hrtimer_cb_get_time(period_timer);
-		hrtimer_forward(period_timer, now, period);
+		overruns += hrtimer_forward_now(period_timer, period);
 
 		soft = hrtimer_get_softexpires(period_timer);
 		hard = hrtimer_get_expires(period_timer);
@@ -124,6 +124,8 @@ void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 		__hrtimer_start_range_ns(period_timer, soft, delta,
 					 HRTIMER_MODE_ABS_PINNED, 0);
 	}
+
+	return overruns;
 }
 
 DEFINE_MUTEX(sched_domains_mutex);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28ccf502c63c..27dfb59f6aaf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3146,15 +3146,13 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		amount = min_amount;
 	else {
 		/*
-		 * If the bandwidth pool has become inactive, then at least one
+		 * If we had overruns starting the timer, then at least one
 		 * period must have elapsed since the last consumption.
 		 * Refresh the global state and ensure bandwidth timer becomes
 		 * active.
 		 */
-		if (!cfs_b->timer_active) {
+		if (start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period))
 			__refill_cfs_bandwidth_runtime(cfs_b);
-			__start_cfs_bandwidth(cfs_b);
-		}
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -3331,8 +3329,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->throttled_clock = rq_clock(rq);
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
-	if (!cfs_b->timer_active)
-		__start_cfs_bandwidth(cfs_b);
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
 	raw_spin_unlock(&cfs_b->lock);
 }
 
@@ -3446,13 +3443,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	if (idle)
 		goto out_unlock;
 
-	/*
-	 * if we have relooped after returning idle once, we need to update our
-	 * status as actually running, so that other cpus doing
-	 * __start_cfs_bandwidth will stop trying to cancel us.
-	 */
-	cfs_b->timer_active = 1;
-
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
 	if (!throttled) {
@@ -3499,8 +3489,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	 */
 	cfs_b->idle = 0;
 out_unlock:
-	if (idle)
-		cfs_b->timer_active = 0;
 	raw_spin_unlock(&cfs_b->lock);
 
 	return idle;
@@ -3713,30 +3701,6 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-/* requires cfs_b->lock, may release to reprogram timer */
-void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-{
-	/*
-	 * The timer may be active because we're trying to set a new bandwidth
-	 * period or because we're racing with the tear-down path
-	 * (timer_active==0 becomes visible before the hrtimer call-back
-	 * terminates).  In either case we ensure that it's re-programmed
-	 */
-	while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
-	       hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
-		/* bounce the lock to allow do_sched_cfs_period_timer to run */
-		raw_spin_unlock(&cfs_b->lock);
-		cpu_relax();
-		raw_spin_lock(&cfs_b->lock);
-		/* if someone else restarted the timer then we're done */
-		if (cfs_b->timer_active)
-			return;
-	}
-
-	cfs_b->timer_active = 1;
-	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
-}
-
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	hrtimer_cancel(&cfs_b->period_timer);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2cbe81308af..3214eb78fd91 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -187,7 +187,7 @@ struct cfs_bandwidth {
 	s64 hierarchal_quota;
 	u64 runtime_expires;
 
-	int idle, timer_active;
+	int idle;
 	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-20 13:15         ` Peter Zijlstra
@ 2014-05-20 14:01           ` Peter Zijlstra
  2014-05-20 14:55             ` Peter Zijlstra
  2014-05-20 14:21           ` Roman Gushchin
  2014-05-20 19:08           ` bsegall
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-05-20 14:01 UTC (permalink / raw)
  To: bsegall
  Cc: Roman Gushchin, linux-kernel, pjt, chris.j.arges, gregkh,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

On Tue, May 20, 2014 at 03:15:26PM +0200, Peter Zijlstra wrote:
> Which leads us to what I think is a BUG in the current hrtimer code (and
> one wonders why we never hit that), because we drop the cpu_base->lock
> over calling hrtimer::function, hrtimer_start_range_ns() can in fact
> come in and (re)enqueue the timer, if hrtimer::function then returns
> HRTIMER_RESTART, we'll hit that BUG_ON() before trying to enqueue the
> timer once more.

> ---
>  kernel/hrtimer.c     |  9 ++++++---
>  kernel/sched/core.c  | 10 ++++++----
>  kernel/sched/fair.c  | 42 +++---------------------------------------
>  kernel/sched/sched.h |  2 +-
>  4 files changed, 16 insertions(+), 47 deletions(-)
> 
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 3ab28993f6e0..28942c65635e 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -1273,11 +1273,14 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now)
>  	 * Note: We clear the CALLBACK bit after enqueue_hrtimer and
>  	 * we do not reprogramm the event hardware. Happens either in
>  	 * hrtimer_start_range_ns() or in hrtimer_interrupt()
> +	 *
> +	 * Note: Because we dropped the cpu_base->lock above,
> +	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
> +	 * for us already.
>  	 */
> -	if (restart != HRTIMER_NORESTART) {
> -		BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
> +	if (restart != HRTIMER_NORESTART &&
> +	    !(timer->state & HRTIMER_STATE_ENQUEUED))
>  		enqueue_hrtimer(timer, base);
> -	}
>  
>  	WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));
>  

Hmm,. doesn't this also mean its entirely unsafe to call
hrtimer_forward*() from the timer callback, because it might be changing
the time of an already enqueued timer, which would corrupt the rb-tree
order.

Lemme go find a nice way out of this mess, I think I'm responsible for
creating it in the first place :-(

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-20 13:15         ` Peter Zijlstra
  2014-05-20 14:01           ` Peter Zijlstra
@ 2014-05-20 14:21           ` Roman Gushchin
  2014-05-20 14:28             ` Peter Zijlstra
  2014-05-20 19:08           ` bsegall
  2 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2014-05-20 14:21 UTC (permalink / raw)
  To: Peter Zijlstra, bsegall
  Cc: linux-kernel, pjt, chris.j.arges, gregkh, Thomas Gleixner

20.05.2014, 17:15, "Peter Zijlstra" <peterz@infradead.org>:
> Completely untested... does anybody have a 'useful' test case for this
> bw stuff?

I can test it for stability.
I'll start tomorrow, now I'm testing Ben's + my patches together.

Thanks,
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-20 14:21           ` Roman Gushchin
@ 2014-05-20 14:28             ` Peter Zijlstra
  2014-05-20 14:34               ` Roman Gushchin
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-05-20 14:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bsegall, linux-kernel, pjt, chris.j.arges, gregkh, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 718 bytes --]

On Tue, May 20, 2014 at 06:21:15PM +0400, Roman Gushchin wrote:
> 20.05.2014, 17:15, "Peter Zijlstra" <peterz@infradead.org>:
> > Completely untested... does anybody have a 'useful' test case for this
> > bw stuff?
> 
> I can test it for stability.
> I'll start tomorrow, now I'm testing Ben's + my patches together.

Do you have an easy to reproduce setup and workload so I might be able
to test as well? Or are you simply dumping it in 'production' to see
what happens?

I mean, the trivial setup with a few cgroups and spinners isn't that
interesting because it'll just keep the timer ticking and never go idle.

I suppose I could try writing a sleep/spinner which randomly exceeds bw
constraints.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-20 14:28             ` Peter Zijlstra
@ 2014-05-20 14:34               ` Roman Gushchin
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Gushchin @ 2014-05-20 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bsegall, linux-kernel, pjt, chris.j.arges, gregkh, Thomas Gleixner

> Do you have an easy to reproduce setup and workload so I might be able
> to test as well? Or are you simply dumping it in 'production' to see
> what happens?

No. I have only a "production" setup, where I can nicely reproduce all these lockups.
It takes ~1 day. I also tried to reproduce them locally but without any success.

Thanks,
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-20 14:01           ` Peter Zijlstra
@ 2014-05-20 14:55             ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-05-20 14:55 UTC (permalink / raw)
  To: bsegall
  Cc: Roman Gushchin, linux-kernel, pjt, chris.j.arges, gregkh,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 9786 bytes --]

On Tue, May 20, 2014 at 04:01:55PM +0200, Peter Zijlstra wrote:

> Hmm,. doesn't this also mean its entirely unsafe to call
> hrtimer_forward*() from the timer callback, because it might be changing
> the time of an already enqueued timer, which would corrupt the rb-tree
> order.
> 
> Lemme go find a nice way out of this mess, I think I'm responsible for
> creating it in the first place :-(

Ah, so since the hrtimer doesn't provide any serialization of its own,
per design, we should serialize the access to it ourselves, which is
easily done.

Something like so would do I suppose..

Here we make sure to be holding the bandwidth lock (be it cfs_b->lock or
rt_b->rt_runtime_lock) around both starting the timer and in the handler
while modifying its expiration date.

---
 kernel/hrtimer.c     |  9 +++++---
 kernel/sched/core.c  | 12 ++++++----
 kernel/sched/fair.c  | 65 +++++++++++-----------------------------------------
 kernel/sched/rt.c    | 14 +++++------
 kernel/sched/sched.h |  4 ++--
 5 files changed, 34 insertions(+), 70 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 3ab28993f6e0..28942c65635e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1273,11 +1273,14 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now)
 	 * Note: We clear the CALLBACK bit after enqueue_hrtimer and
 	 * we do not reprogramm the event hardware. Happens either in
 	 * hrtimer_start_range_ns() or in hrtimer_interrupt()
+	 *
+	 * Note: Because we dropped the cpu_base->lock above,
+	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
+	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART) {
-		BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+	if (restart != HRTIMER_NORESTART &&
+	    !(timer->state & HRTIMER_STATE_ENQUEUED))
 		enqueue_hrtimer(timer, base);
-	}
 
 	WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ea7b3f1a087..1f4602993d37 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -106,17 +106,17 @@ void __smp_mb__after_atomic(void)
 EXPORT_SYMBOL(__smp_mb__after_atomic);
 #endif
 
-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
+int start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
 	unsigned long delta;
-	ktime_t soft, hard, now;
+	ktime_t soft, hard;
+	int overruns = 0;
 
 	for (;;) {
-		if (hrtimer_active(period_timer))
+		if (hrtimer_is_queued(period_timer))
 			break;
 
-		now = hrtimer_cb_get_time(period_timer);
-		hrtimer_forward(period_timer, now, period);
+		overruns += hrtimer_forward_now(period_timer, period);
 
 		soft = hrtimer_get_softexpires(period_timer);
 		hard = hrtimer_get_expires(period_timer);
@@ -124,6 +124,8 @@ void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 		__hrtimer_start_range_ns(period_timer, soft, delta,
 					 HRTIMER_MODE_ABS_PINNED, 0);
 	}
+
+	return overruns;
 }
 
 DEFINE_MUTEX(sched_domains_mutex);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28ccf502c63c..11152b621a31 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3146,15 +3146,13 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		amount = min_amount;
 	else {
 		/*
-		 * If the bandwidth pool has become inactive, then at least one
+		 * If we had overruns starting the timer, then at least one
 		 * period must have elapsed since the last consumption.
 		 * Refresh the global state and ensure bandwidth timer becomes
 		 * active.
 		 */
-		if (!cfs_b->timer_active) {
+		if (start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period))
 			__refill_cfs_bandwidth_runtime(cfs_b);
-			__start_cfs_bandwidth(cfs_b);
-		}
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -3331,8 +3329,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->throttled_clock = rq_clock(rq);
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
-	if (!cfs_b->timer_active)
-		__start_cfs_bandwidth(cfs_b);
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
 	raw_spin_unlock(&cfs_b->lock);
 }
 
@@ -3430,12 +3427,13 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	u64 runtime, runtime_expires;
-	int idle = 1, throttled;
+	bool idle, throttled;
+
+	lockdep_assert_held(&cfs_b->lock);
 
-	raw_spin_lock(&cfs_b->lock);
 	/* no need to continue the timer with no bandwidth constraint */
 	if (cfs_b->quota == RUNTIME_INF)
-		goto out_unlock;
+		return 1;
 
 	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
 	/* idle depends on !throttled (for the case of a large deficit) */
@@ -3444,21 +3442,14 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 
 	/* if we're going inactive then everything else can be deferred */
 	if (idle)
-		goto out_unlock;
-
-	/*
-	 * if we have relooped after returning idle once, we need to update our
-	 * status as actually running, so that other cpus doing
-	 * __start_cfs_bandwidth will stop trying to cancel us.
-	 */
-	cfs_b->timer_active = 1;
+		return 1;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
 		cfs_b->idle = 1;
-		goto out_unlock;
+		return 0;
 	}
 
 	/* account preceding periods in which throttling occurred */
@@ -3498,12 +3489,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	 * timer to remain active while there are any throttled entities.)
 	 */
 	cfs_b->idle = 0;
-out_unlock:
-	if (idle)
-		cfs_b->timer_active = 0;
-	raw_spin_unlock(&cfs_b->lock);
-
-	return idle;
+	return 0;
 }
 
 /* a cfs_rq won't donate quota below this amount */
@@ -3676,19 +3662,18 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
-	ktime_t now;
 	int overrun;
 	int idle = 0;
 
+	raw_spin_lock(&cfs_b->lock);
 	for (;;) {
-		now = hrtimer_cb_get_time(timer);
-		overrun = hrtimer_forward(timer, now, cfs_b->period);
-
+		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
 	}
+	raw_spin_unlock(&cfs_b->lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
@@ -3713,30 +3698,6 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-/* requires cfs_b->lock, may release to reprogram timer */
-void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
-{
-	/*
-	 * The timer may be active because we're trying to set a new bandwidth
-	 * period or because we're racing with the tear-down path
-	 * (timer_active==0 becomes visible before the hrtimer call-back
-	 * terminates).  In either case we ensure that it's re-programmed
-	 */
-	while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
-	       hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
-		/* bounce the lock to allow do_sched_cfs_period_timer to run */
-		raw_spin_unlock(&cfs_b->lock);
-		cpu_relax();
-		raw_spin_lock(&cfs_b->lock);
-		/* if someone else restarted the timer then we're done */
-		if (cfs_b->timer_active)
-			return;
-	}
-
-	cfs_b->timer_active = 1;
-	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
-}
-
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	hrtimer_cancel(&cfs_b->period_timer);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7795e292f4c9..60bfb8caa27a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -17,19 +17,20 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 {
 	struct rt_bandwidth *rt_b =
 		container_of(timer, struct rt_bandwidth, rt_period_timer);
-	ktime_t now;
-	int overrun;
 	int idle = 0;
+	int overrun;
 
+	raw_spin_lock(&rt_b->rt_runtime_lock);
 	for (;;) {
-		now = hrtimer_cb_get_time(timer);
-		overrun = hrtimer_forward(timer, now, rt_b->rt_period);
-
+		overrun = hrtimer_forward_now(timer, rt_b->rt_period);
 		if (!overrun)
 			break;
 
+		raw_spin_unlock(&rt_b->rt_runtime_lock);
 		idle = do_sched_rt_period_timer(rt_b, overrun);
+		raw_spin_lock(&rt_b->rt_runtime_lock);
 	}
+	raw_spin_unlock(&rt_b->rt_runtime_lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
@@ -51,9 +52,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
-	if (hrtimer_active(&rt_b->rt_period_timer))
-		return;
-
 	raw_spin_lock(&rt_b->rt_runtime_lock);
 	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2cbe81308af..0937e341e079 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -187,7 +187,7 @@ struct cfs_bandwidth {
 	s64 hierarchal_quota;
 	u64 runtime_expires;
 
-	int idle, timer_active;
+	int idle;
 	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
@@ -1288,7 +1288,7 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
 static inline void sched_avg_update(struct rq *rq) { }
 #endif
 
-extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
+extern int start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
 
 #ifdef CONFIG_SMP
 #ifdef CONFIG_PREEMPT

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-20 13:15         ` Peter Zijlstra
  2014-05-20 14:01           ` Peter Zijlstra
  2014-05-20 14:21           ` Roman Gushchin
@ 2014-05-20 19:08           ` bsegall
  2014-05-21  7:58             ` Peter Zijlstra
  2 siblings, 1 reply; 19+ messages in thread
From: bsegall @ 2014-05-20 19:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Roman Gushchin, linux-kernel, pjt, chris.j.arges, gregkh,
	Thomas Gleixner

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 19, 2014 at 10:30:05AM -0700, bsegall@google.com wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Fri, May 16, 2014 at 12:38:21PM +0400, Roman Gushchin wrote:
>> >
>> >> I still think, there is a deadlock. I'll try to explain.
>> >> Three CPUs must be involved:
>> >> CPU0				CPU1				CPU2
>> >> take rq->lock			period timer fired		
>> >> ...				take cfs_b lock
>> >> ...				...				tg_set_cfs_bandwidth()
>> >> throttle_cfs_rq()		release cfs_b lock		take cfs_b lock
>> >> ...				distribute_cfs_runtime()	timer_active = 0
>> >> take cfs_b->lock		wait for rq->lock		...
>> >> __start_cfs_bandwidth()	
>> >> {wait for timer callback
>> >>  break if timer_active == 1}
>> >> 
>> >> So, CPU0 and CPU1 are deadlocked.
>> >
>> > OK, so can someone explain this ->timer_active thing? esp. what's the
>> > 'obvious' difference with hrtimer_active()?
>> 
>> timer_active is only changed or checked under cfs_b lock, so that we can
>> be certain that the timer is active whenever runtime isn't full. We
>> can't use hrtimer_active because the timer might have unlocked the cfs_b
>> on another cpu and be about to return HRTIMER_NORESTART.
>
> The more I look at that code the more I'm convinced its crack, that
> entire __start_cfs_bandwidth() thing is brain melting, we don't need to
> cancel a timer before starting it, *hrtimer_start*() will happily remove
> the timer for you if its still enqueued.
>
> Removing that, removes a big part of the problem, no more ugly cancel
> loop to get stuck in.
>
> So now, if I understood you right, the entire reason you have this
> cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
> accidentally loose the timer.
>
> It appears to me that it should be possible to guarantee that same by
> unconditionally (re)starting the timer when !queued. Because regardless
> what hrtimer::function will return, if we beat it to (re)enqueue the
> timer, it doesn't matter.
>
> Which leads us to what I think is a BUG in the current hrtimer code (and
> one wonders why we never hit that), because we drop the cpu_base->lock
> over calling hrtimer::function, hrtimer_start_range_ns() can in fact
> come in and (re)enqueue the timer, if hrtimer::function then returns
> HRTIMER_RESTART, we'll hit that BUG_ON() before trying to enqueue the
> timer once more.

Yeah, I believe the whole timer_active/cancel-before-start was to not
lose the timer while still avoiding the HRTIMER_RESTART/hrtimer_start
race. I'm not sure what, if anything, all the other ~35 users of
HRTIMER_RESTART do about this. At least some do hrtimer_cancel();
hrtimer_start(); though.

>
> So I _think_ something like the below would actually solve all the
> problems, no?
>
> Completely untested... does anybody have a 'useful' test case for this
> bw stuff?

[using the updated patch]:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4ea7b3f1a087..1f4602993d37 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -106,17 +106,17 @@ void __smp_mb__after_atomic(void)
>  EXPORT_SYMBOL(__smp_mb__after_atomic);
>  #endif
>  
> -void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
> +int start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
>  {
>  	unsigned long delta;
> -	ktime_t soft, hard, now;
> +	ktime_t soft, hard;
> +	int overruns = 0;
>  
>  	for (;;) {
> -		if (hrtimer_active(period_timer))
> +		if (hrtimer_is_queued(period_timer))
>  			break;
[...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 28ccf502c63c..11152b621a31 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3146,15 +3146,13 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  		amount = min_amount;
>  	else {
>  		/*
> -		 * If the bandwidth pool has become inactive, then at least one
> +		 * If we had overruns starting the timer, then at least one
>  		 * period must have elapsed since the last consumption.
>  		 * Refresh the global state and ensure bandwidth timer becomes
>  		 * active.
>  		 */
> -		if (!cfs_b->timer_active) {
> +		if (start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period))
>  			__refill_cfs_bandwidth_runtime(cfs_b);
> -			__start_cfs_bandwidth(cfs_b);
> -		}


So, this and the other changes will ensure that a period timer is always
queued. However, this also means that if we race so that we take
cfs_b->lock before the period timer, but late enough that the interrupt
has fired and we see !queued, we will hrtimer_forward the entire period
here. What assign_cfs_rq_runtime does is insufficient - it doesn't
unthrottle any other cfs_rqs, and since start_bandwidth_timer
hrtimer_forwarded the period away, neither will
(do_)sched_cfs_period_timer. (In addition throttle doesn't even do that,
but that's probably easy enough to fix)

I suppose we could check hrtimer_active and if so record any periods
start_bandwidth_timer used for sched_cfs_period_timer, which could then
force a do_sched_cfs_period_timer call. Perhaps something like this
(equally untested) patch on top of your patch.

One other possibly minor thing is that we are now possibly calling
hrtimer_start* more often, which is more expensive than just checking
timer_active. Since it's only if !queued though that seems just about as
good (it's also limited to 1/ms/cpu or so at that).

Anyway, patch:


---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f4f9857..122cdec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7779,11 +7779,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
        __refill_cfs_bandwidth_runtime(cfs_b);
        /* restart the period timer (if active) to handle new period expiry */
-       if (runtime_enabled && cfs_b->timer_active) {
-               /* force a reprogram */
-               cfs_b->timer_active = 0;
+       if (runtime_enabled)
                __start_cfs_bandwidth(cfs_b);
-       }
        raw_spin_unlock_irq(&cfs_b->lock);
 
        for_each_possible_cpu(i) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd528dc..eb80b7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3177,7 +3177,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
                 * Refresh the global state and ensure bandwidth timer becomes
                 * active.
                 */
-               if (start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period))
+               if (__start_cfs_bandwidth(cfs_b->period_timer))
                        __refill_cfs_bandwidth_runtime(cfs_b);
 
                if (cfs_b->runtime > 0) {
@@ -3355,7 +3355,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
        cfs_rq->throttled_clock = rq_clock(rq);
        raw_spin_lock(&cfs_b->lock);
        list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
-       start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+       __start_cfs_bandwidth(&cfs_b->period_timer, cfs_b->period);
        raw_spin_unlock(&cfs_b->lock);
 }
 
@@ -3718,6 +3718,22 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
        cfs_b->slack_timer.function = sched_cfs_slack_timer;
 }
 
+int __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+       int was_active = hrtimer_active(&cfs_b->period_timer);
+       int overrun = start_bandwidth_timer(&cfs_b->period_timer,
+                                           cfs_b->period);
+
+       /*
+        * If our bandwidth was completely idle, we don't care about any periods
+        * during that idle time. Otherwise we may need to unthrottle on other
+        * cpus.
+        */
+       if (was_active)
+               cfs_b->stored_periods += overrun;
+       return overrun;
+}
+
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
        cfs_rq->runtime_enabled = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 66b84f3..a3a4c35 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -187,7 +187,7 @@ struct cfs_bandwidth {
        s64 hierarchal_quota;
        u64 runtime_expires;
 
-       int idle;
+       int idle, stored_periods;
        struct hrtimer period_timer, slack_timer;
        struct list_head throttled_cfs_rq;
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock
  2014-05-20 19:08           ` bsegall
@ 2014-05-21  7:58             ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-05-21  7:58 UTC (permalink / raw)
  To: bsegall
  Cc: Roman Gushchin, linux-kernel, pjt, chris.j.arges, gregkh,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2665 bytes --]

On Tue, May 20, 2014 at 12:08:34PM -0700, bsegall@google.com wrote:
> Yeah, I believe the whole timer_active/cancel-before-start was to not
> lose the timer while still avoiding the HRTIMER_RESTART/hrtimer_start
> race. I'm not sure what, if anything, all the other ~35 users of
> HRTIMER_RESTART do about this. At least some do hrtimer_cancel();
> hrtimer_start(); though.

Yeah, I've not yet had the courage to go look through the tree :-/

That said, I've got a growing suspicion that I'm missing something
obvious and simple here.

> So, this and the other changes will ensure that a period timer is always
> queued. However, this also means that if we race so that we take
> cfs_b->lock before the period timer, but late enough that the interrupt
> has fired and we see !queued, we will hrtimer_forward the entire period
> here.

Right you are, missed that.

> What assign_cfs_rq_runtime does is insufficient - it doesn't
> unthrottle any other cfs_rqs, and since start_bandwidth_timer
> hrtimer_forwarded the period away, neither will
> (do_)sched_cfs_period_timer. (In addition throttle doesn't even do that,
> but that's probably easy enough to fix)

So I had a closer look at the bandwidth bits because of this, and it
strikes me as odd that __refill_cfs_bandwidth_runtime() does an
unconditional fill of cfs_b->runtime.

What I would have expected to see is something like:

	cfs_b->runtime += overruns * cfs_b->quota;
	if (cfs_b->runtime > cfs_b->quota)
		cfs_b->runtime = cfs_b->quota;

Which refills the runtime based on the overruns that happened and
respects the negative runtime accrued from previous inaccuracies.

If you unconditionally top-up like currently happens its possible to
consistently use more time than specified; due to the 1 jiffy throttle
granularity.

> I suppose we could check hrtimer_active and if so record any periods
> start_bandwidth_timer used for sched_cfs_period_timer, which could then
> force a do_sched_cfs_period_timer call. Perhaps something like this
> (equally untested) patch on top of your patch.

You left the new ->stored_periods unused but the idea was clear enough.

Anyway, given that you don't actually use overruns for anything except
pointless stats we could cheat and assume any trigger of the handler is
due to at least 1 overrun, that should trigger the unthrottle code
methinks.

In any case, I think your tracking overruns like that does make sense,
and its something the rt.c code needs too, as it suffers from the same
problem, so maybe we can roll that into the generic
start_bandwidth_timer() thing.

/me goes see if he can track down his suspicion and find shiny simple
truth at the end of it.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-05-21  7:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-15 16:58 [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock Roman Gushchin
2014-05-15 17:43 ` bsegall
2014-05-16  8:38   ` Roman Gushchin
2014-05-16 17:57     ` bsegall
2014-05-19 11:10       ` Roman Gushchin
2014-05-19 17:40         ` bsegall
2014-05-19 10:32     ` Peter Zijlstra
2014-05-19 11:37       ` Roman Gushchin
2014-05-19 17:35         ` bsegall
2014-05-19 17:30       ` bsegall
2014-05-20 13:15         ` Peter Zijlstra
2014-05-20 14:01           ` Peter Zijlstra
2014-05-20 14:55             ` Peter Zijlstra
2014-05-20 14:21           ` Roman Gushchin
2014-05-20 14:28             ` Peter Zijlstra
2014-05-20 14:34               ` Roman Gushchin
2014-05-20 19:08           ` bsegall
2014-05-21  7:58             ` Peter Zijlstra
2014-05-19 10:40   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.