LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
@ 2016-06-16 12:57 Konstantin Khlebnikov
  2016-06-16 17:03 ` bsegall
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Konstantin Khlebnikov @ 2016-06-16 12:57 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, linux-kernel; +Cc: stable

Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Stable <stable@vger.kernel.org> # v3.2+
---
 kernel/sched/fair.c  |   19 +++++++++++++++++++
 kernel/sched/sched.h |    2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e83db73..fe809fe169d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
 	if (!cfs_bandwidth_used())
 		return;
 
+	/* synchronize hierarchical throttle counter */
+	if (unlikely(!cfs_rq->throttle_uptodate)) {
+		struct rq *rq = rq_of(cfs_rq);
+		struct cfs_rq *pcfs_rq;
+		struct task_group *tg;
+
+		cfs_rq->throttle_uptodate = 1;
+		/* get closest uptodate node because leaves goes first */
+		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
+			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
+			if (pcfs_rq->throttle_uptodate)
+				break;
+		}
+		if (tg) {
+			cfs_rq->throttle_count = pcfs_rq->throttle_count;
+			cfs_rq->throttled_clock_task = rq_clock_task(rq);
+		}
+	}
+
 	/* an active group must be handled by the update_curr()->put() path */
 	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
 		return;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72f1f3087b04..7cbeb92a1cb9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -437,7 +437,7 @@ struct cfs_rq {
 
 	u64 throttled_clock, throttled_clock_task;
 	u64 throttled_clock_task_time;
-	int throttled, throttle_count;
+	int throttled, throttle_count, throttle_uptodate;
 	struct list_head throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
  2016-06-16 12:57 [PATCH] sched/fair: initialize throttle_count for new task-groups lazily Konstantin Khlebnikov
@ 2016-06-16 17:03 ` bsegall
  2016-06-16 17:23   ` Konstantin Khlebnikov
  2016-06-21 13:41 ` Konstantin Khlebnikov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: bsegall @ 2016-06-16 17:03 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, stable

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> Cgroup created inside throttled group must inherit current throttle_count.
> Broken throttle_count allows to nominate throttled entries as a next buddy,
> later this leads to null pointer dereference in pick_next_task_fair().
>
> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
> allows to skip locking all rq at group creation. Lazy approach also allows
> to skip full sub-tree scan at throttling hierarchy (not in this patch).
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Stable <stable@vger.kernel.org> # v3.2+
> ---
>  kernel/sched/fair.c  |   19 +++++++++++++++++++
>  kernel/sched/sched.h |    2 +-
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 218f8e83db73..fe809fe169d2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>  	if (!cfs_bandwidth_used())
>  		return;
>  
> +	/* synchronize hierarchical throttle counter */
> +	if (unlikely(!cfs_rq->throttle_uptodate)) {
> +		struct rq *rq = rq_of(cfs_rq);
> +		struct cfs_rq *pcfs_rq;
> +		struct task_group *tg;
> +
> +		cfs_rq->throttle_uptodate = 1;
> +		/* get closest uptodate node because leaves goes first */
> +		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
> +			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
> +			if (pcfs_rq->throttle_uptodate)
> +				break;
> +		}
> +		if (tg) {
> +			cfs_rq->throttle_count = pcfs_rq->throttle_count;
> +			cfs_rq->throttled_clock_task = rq_clock_task(rq);
> +		}
> +	}
> +

Checking just in enqueue is not sufficient - throttled_lb_pair can check
against a cfs_rq that has never been enqueued (and possibly other
paths).

It might also make sense to go ahead and initialize all the cfs_rqs we
skipped over to avoid some n^2 pathological behavior. You could also use
throttle_count == -1 or < 0. (We had our own version of this patch that
I guess we forgot to push?)


>  	/* an active group must be handled by the update_curr()->put() path */
>  	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
>  		return;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 72f1f3087b04..7cbeb92a1cb9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -437,7 +437,7 @@ struct cfs_rq {
>  
>  	u64 throttled_clock, throttled_clock_task;
>  	u64 throttled_clock_task_time;
> -	int throttled, throttle_count;
> +	int throttled, throttle_count, throttle_uptodate;
>  	struct list_head throttled_list;
>  #endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
  2016-06-16 17:03 ` bsegall
@ 2016-06-16 17:23   ` Konstantin Khlebnikov
  2016-06-16 17:33     ` bsegall
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Khlebnikov @ 2016-06-16 17:23 UTC (permalink / raw)
  To: bsegall; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, stable

On 16.06.2016 20:03, bsegall@google.com wrote:
> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>
>> Cgroup created inside throttled group must inherit current throttle_count.
>> Broken throttle_count allows to nominate throttled entries as a next buddy,
>> later this leads to null pointer dereference in pick_next_task_fair().
>>
>> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
>> allows to skip locking all rq at group creation. Lazy approach also allows
>> to skip full sub-tree scan at throttling hierarchy (not in this patch).
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Cc: Stable <stable@vger.kernel.org> # v3.2+
>> ---
>>   kernel/sched/fair.c  |   19 +++++++++++++++++++
>>   kernel/sched/sched.h |    2 +-
>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 218f8e83db73..fe809fe169d2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>>   	if (!cfs_bandwidth_used())
>>   		return;
>>
>> +	/* synchronize hierarchical throttle counter */
>> +	if (unlikely(!cfs_rq->throttle_uptodate)) {
>> +		struct rq *rq = rq_of(cfs_rq);
>> +		struct cfs_rq *pcfs_rq;
>> +		struct task_group *tg;
>> +
>> +		cfs_rq->throttle_uptodate = 1;
>> +		/* get closest uptodate node because leaves goes first */
>> +		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
>> +			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
>> +			if (pcfs_rq->throttle_uptodate)
>> +				break;
>> +		}
>> +		if (tg) {
>> +			cfs_rq->throttle_count = pcfs_rq->throttle_count;
>> +			cfs_rq->throttled_clock_task = rq_clock_task(rq);
>> +		}
>> +	}
>> +
>
> Checking just in enqueue is not sufficient - throttled_lb_pair can check
> against a cfs_rq that has never been enqueued (and possibly other
> paths).

Looks like this is minor problem: in worst case load-balancer will migrate
task into throttled hierarchy. And this could happens only once for each
newly created group.

>
> It might also make sense to go ahead and initialize all the cfs_rqs we
> skipped over to avoid some n^2 pathological behavior. You could also use
> throttle_count == -1 or < 0. (We had our own version of this patch that
> I guess we forgot to push?)

n^2 shouldn't be a problem while this happens only once for each group.

throttle_count == -1 could be overwritten when parent throttles/unthrottles
before initialization. We could set it to INT_MIN/2 and check <0 but this
will hide possible bugs here. One more int in the same cacheline shouldn't
add noticeable overhead.

I've also added this into our kernel to catch such problems without crash.
Probably it's worth to add into upstream because stale buddy is a real pain)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5506,8 +5506,11 @@ static void set_last_buddy(struct sched_entity *se)
         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
                 return;

-       for_each_sched_entity(se)
+       for_each_sched_entity(se) {
+               if (WARN_ON_ONCE(!se->on_rq))
+                       return;
                 cfs_rq_of(se)->last = se;
+       }
  }

  static void set_next_buddy(struct sched_entity *se)
@@ -5515,8 +5518,11 @@ static void set_next_buddy(struct sched_entity *se)
         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
                 return;

-       for_each_sched_entity(se)
+       for_each_sched_entity(se) {
+               if (WARN_ON_ONCE(!se->on_rq))
+                       return;
                 cfs_rq_of(se)->next = se;
+       }
  }

  static void set_skip_buddy(struct sched_entity *se)


>
>
>>   	/* an active group must be handled by the update_curr()->put() path */
>>   	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
>>   		return;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 72f1f3087b04..7cbeb92a1cb9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -437,7 +437,7 @@ struct cfs_rq {
>>
>>   	u64 throttled_clock, throttled_clock_task;
>>   	u64 throttled_clock_task_time;
>> -	int throttled, throttle_count;
>> +	int throttled, throttle_count, throttle_uptodate;
>>   	struct list_head throttled_list;
>>   #endif /* CONFIG_CFS_BANDWIDTH */
>>   #endif /* CONFIG_FAIR_GROUP_SCHED */


-- 
Konstantin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
  2016-06-16 17:23   ` Konstantin Khlebnikov
@ 2016-06-16 17:33     ` bsegall
  0 siblings, 0 replies; 9+ messages in thread
From: bsegall @ 2016-06-16 17:33 UTC (permalink / raw)
  To: Konstantin Khlebnikov, pjt
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, stable

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> On 16.06.2016 20:03, bsegall@google.com wrote:
>> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>>
>>> Cgroup created inside throttled group must inherit current throttle_count.
>>> Broken throttle_count allows to nominate throttled entries as a next buddy,
>>> later this leads to null pointer dereference in pick_next_task_fair().
>>>
>>> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
>>> allows to skip locking all rq at group creation. Lazy approach also allows
>>> to skip full sub-tree scan at throttling hierarchy (not in this patch).
>>>
>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>> Cc: Stable <stable@vger.kernel.org> # v3.2+
>>> ---
>>>   kernel/sched/fair.c  |   19 +++++++++++++++++++
>>>   kernel/sched/sched.h |    2 +-
>>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 218f8e83db73..fe809fe169d2 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>>>   	if (!cfs_bandwidth_used())
>>>   		return;
>>>
>>> +	/* synchronize hierarchical throttle counter */
>>> +	if (unlikely(!cfs_rq->throttle_uptodate)) {
>>> +		struct rq *rq = rq_of(cfs_rq);
>>> +		struct cfs_rq *pcfs_rq;
>>> +		struct task_group *tg;
>>> +
>>> +		cfs_rq->throttle_uptodate = 1;
>>> +		/* get closest uptodate node because leaves goes first */
>>> +		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
>>> +			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
>>> +			if (pcfs_rq->throttle_uptodate)
>>> +				break;
>>> +		}
>>> +		if (tg) {
>>> +			cfs_rq->throttle_count = pcfs_rq->throttle_count;
>>> +			cfs_rq->throttled_clock_task = rq_clock_task(rq);
>>> +		}
>>> +	}
>>> +
>>
>> Checking just in enqueue is not sufficient - throttled_lb_pair can check
>> against a cfs_rq that has never been enqueued (and possibly other
>> paths).
>
> Looks like this is minor problem: in worst case load-balancer will migrate
> task into throttled hierarchy. And this could happens only once for each
> newly created group.
>
>>
>> It might also make sense to go ahead and initialize all the cfs_rqs we
>> skipped over to avoid some n^2 pathological behavior. You could also use
>> throttle_count == -1 or < 0. (We had our own version of this patch that
>> I guess we forgot to push?)
>
> n^2 shouldn't be a problem while this happens only once for each
> group.

Yeah it's shouldn't be a problem, it's more a why not.

>
> throttle_count == -1 could be overwritten when parent throttles/unthrottles
> before initialization. We could set it to INT_MIN/2 and check <0 but this
> will hide possible bugs here. One more int in the same cacheline shouldn't
> add noticeable overhead.

Yeah, you would have to check in tg_throttle_up/tg_unthrottle_up as well
to do this.

>
> I've also added this into our kernel to catch such problems without crash.
> Probably it's worth to add into upstream because stale buddy is a real pain)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5506,8 +5506,11 @@ static void set_last_buddy(struct sched_entity *se)
>         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
>                 return;
>
> -       for_each_sched_entity(se)
> +       for_each_sched_entity(se) {
> +               if (WARN_ON_ONCE(!se->on_rq))
> +                       return;
>                 cfs_rq_of(se)->last = se;
> +       }
>  }
>
>  static void set_next_buddy(struct sched_entity *se)
> @@ -5515,8 +5518,11 @@ static void set_next_buddy(struct sched_entity *se)
>         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
>                 return;
>
> -       for_each_sched_entity(se)
> +       for_each_sched_entity(se) {
> +               if (WARN_ON_ONCE(!se->on_rq))
> +                       return;
>                 cfs_rq_of(se)->next = se;
> +       }
>  }
>
>  static void set_skip_buddy(struct sched_entity *se)
>
>
>>
>>
>>>   	/* an active group must be handled by the update_curr()->put() path */
>>>   	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
>>>   		return;
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 72f1f3087b04..7cbeb92a1cb9 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -437,7 +437,7 @@ struct cfs_rq {
>>>
>>>   	u64 throttled_clock, throttled_clock_task;
>>>   	u64 throttled_clock_task_time;
>>> -	int throttled, throttle_count;
>>> +	int throttled, throttle_count, throttle_uptodate;
>>>   	struct list_head throttled_list;
>>>   #endif /* CONFIG_CFS_BANDWIDTH */
>>>   #endif /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
  2016-06-16 12:57 [PATCH] sched/fair: initialize throttle_count for new task-groups lazily Konstantin Khlebnikov
  2016-06-16 17:03 ` bsegall
@ 2016-06-21 13:41 ` Konstantin Khlebnikov
  2016-06-21 21:10 ` Peter Zijlstra
  2016-06-24  8:59 ` [tip:sched/urgent] sched/fair: Initialize " tip-bot for Konstantin Khlebnikov
  3 siblings, 0 replies; 9+ messages in thread
From: Konstantin Khlebnikov @ 2016-06-21 13:41 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, linux-kernel; +Cc: stable

On 16.06.2016 15:57, Konstantin Khlebnikov wrote:
> Cgroup created inside throttled group must inherit current throttle_count.
> Broken throttle_count allows to nominate throttled entries as a next buddy,
> later this leads to null pointer dereference in pick_next_task_fair().

example of kernel oops to summon maintainers

<1>[3627487.878297] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
<1>[3627487.879028] IP: [<ffffffff8109ab6c>] set_next_entity+0x1c/0x80
<4>[3627487.879837] PGD 0
<4>[3627487.880567] Oops: 0000 [#1] SMP
<4>[3627487.881292] Modules linked in: macvlan overlay ipmi_si ipmi_devintf ipmi_msghandler ip6t_REJECT nf_reject_ipv6 xt_tcpudp 
ip6table_filter ip6_tables x_tables quota_v2 quota_tree cls_cgroup sch_htb bridge netconsole configfs 8021q mrp garp stp llc 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crc32_pclmul ast ttm drm_kms_helper drm ghash_clmulni_intel aesni_intel 
ablk_helper sb_edac cryptd lrw lpc_ich gf128mul edac_core sysimgblt glue_helper aes_x86_64 microcode sysfillrect syscopyarea acpi_pad 
tcp_htcp mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel igb i2c_algo_bit isci ixgbe libsas i2c_core ahci dca ptp libahci 
scsi_transport_sas pps_core mdio raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 
multipath linear [last unloaded: ipmi_msghandler]<4>[3627487.886379]
<4>[3627487.887892] CPU: 21 PID: 0 Comm: swapper/21 Not tainted 3.18.19-24 #1
<4>[3627487.889429] Hardware name: AIC 1D-HV24-02/MB-DPSB04-04, BIOS IVYBV058 07/01/2015
<4>[3627487.891008] task: ffff881fd336f540 ti: ffff881fd33a4000 task.ti: ffff881fd33a4000
<4>[3627487.892569] RIP: 0010:[<ffffffff8109ab6c>]  [<ffffffff8109ab6c>] set_next_entity+0x1c/0x80
<4>[3627487.894200] RSP: 0018:ffff881fd33a7d68  EFLAGS: 00010082
<4>[3627487.895750] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff881fffdb2b70
<4>[3627487.897276] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff881fd1193600
<4>[3627487.898793] RBP: ffff881fd33a7d88 R08: 0000000000000f6d R09: 0000000000000000
<4>[3627487.900358] R10: 0000000000000078 R11: 0000000000000000 R12: 0000000000000000
<4>[3627487.901898] R13: ffffffff8180f3c0 R14: ffff881fd33a4000 R15: ffff881fd1193600
<4>[3627487.903381] FS:  0000000000000000(0000) GS:ffff881fffda0000(0000) knlGS:0000000000000000
<4>[3627487.904920] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[3627487.906382] CR2: 0000000000000038 CR3: 0000000001c14000 CR4: 00000000001407e0
<4>[3627487.907904] Stack:
<4>[3627487.909365]  ffff881fffdb2b00 ffff881fffdb2b00 0000000000000000 ffffffff8180f3c0
<4>[3627487.910837]  ffff881fd33a7e18 ffffffff810a1b18 00000001360794f4 00000001760794f3
<4>[3627487.912322]  ffff881fd2888000 0000000000000000 0000000000012b00 ffff881fd336f540
<4>[3627487.913770] Call Trace:
<4>[3627487.915188]  [<ffffffff810a1b18>] pick_next_task_fair+0x88/0x5f0
<4>[3627487.916573]  [<ffffffff816d258f>] __schedule+0x6ef/0x820
<4>[3627487.917936]  [<ffffffff816d2799>] schedule+0x29/0x70
<4>[3627487.919277]  [<ffffffff816d2a76>] schedule_preempt_disabled+0x16/0x20
<4>[3627487.920632]  [<ffffffff810a8ddb>] cpu_startup_entry+0x14b/0x3d0
<4>[3627487.921999]  [<ffffffff810ce272>] ? clockevents_register_device+0xe2/0x140
<4>[3627487.923323]  [<ffffffff810463fc>] start_secondary+0x14c/0x160
<4>[3627487.924660] Code: 89 ff 48 89 e5 f0 48 0f b3 3e 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 
f3 4c 89 6d f8 <44> 8b 4e 38 49 89 fc 45 85 c9 74 17 4c 8d 6e 10 4c 39 6f 30 74
<1>[3627487.927435] RIP  [<ffffffff8109ab6c>] set_next_entity+0x1c/0x80
<4>[3627487.928741]  RSP <ffff881fd33a7d68>
<4>[3627487.930010] CR2: 0000000000000038

>
> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
> allows to skip locking all rq at group creation. Lazy approach also allows
> to skip full sub-tree scan at throttling hierarchy (not in this patch).
>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Stable <stable@vger.kernel.org> # v3.2+
> ---
>   kernel/sched/fair.c  |   19 +++++++++++++++++++
>   kernel/sched/sched.h |    2 +-
>   2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 218f8e83db73..fe809fe169d2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>   	if (!cfs_bandwidth_used())
>   		return;
>
> +	/* synchronize hierarchical throttle counter */
> +	if (unlikely(!cfs_rq->throttle_uptodate)) {
> +		struct rq *rq = rq_of(cfs_rq);
> +		struct cfs_rq *pcfs_rq;
> +		struct task_group *tg;
> +
> +		cfs_rq->throttle_uptodate = 1;
> +		/* get closest uptodate node because leaves goes first */
> +		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
> +			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
> +			if (pcfs_rq->throttle_uptodate)
> +				break;
> +		}
> +		if (tg) {
> +			cfs_rq->throttle_count = pcfs_rq->throttle_count;
> +			cfs_rq->throttled_clock_task = rq_clock_task(rq);
> +		}
> +	}
> +
>   	/* an active group must be handled by the update_curr()->put() path */
>   	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
>   		return;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 72f1f3087b04..7cbeb92a1cb9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -437,7 +437,7 @@ struct cfs_rq {
>
>   	u64 throttled_clock, throttled_clock_task;
>   	u64 throttled_clock_task_time;
> -	int throttled, throttle_count;
> +	int throttled, throttle_count, throttle_uptodate;
>   	struct list_head throttled_list;
>   #endif /* CONFIG_CFS_BANDWIDTH */
>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>


-- 
Konstantin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
  2016-06-16 12:57 [PATCH] sched/fair: initialize throttle_count for new task-groups lazily Konstantin Khlebnikov
  2016-06-16 17:03 ` bsegall
  2016-06-21 13:41 ` Konstantin Khlebnikov
@ 2016-06-21 21:10 ` Peter Zijlstra
  2016-06-22  8:10   ` Konstantin Khlebnikov
  2016-06-24  8:59 ` [tip:sched/urgent] sched/fair: Initialize " tip-bot for Konstantin Khlebnikov
  3 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-06-21 21:10 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Ingo Molnar, linux-kernel, stable

On Thu, Jun 16, 2016 at 03:57:01PM +0300, Konstantin Khlebnikov wrote:
> Cgroup created inside throttled group must inherit current throttle_count.
> Broken throttle_count allows to nominate throttled entries as a next buddy,
> later this leads to null pointer dereference in pick_next_task_fair().
> 
> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
> allows to skip locking all rq at group creation. Lazy approach also allows
> to skip full sub-tree scan at throttling hierarchy (not in this patch).

You're talking about taking rq->lock in alloc_fair_sched_group(), right?

We're about to go do that anyway... But I suppose for backports this
makes sense. Doing it at creation time also avoids the issues Ben
raised, right?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
  2016-06-21 21:10 ` Peter Zijlstra
@ 2016-06-22  8:10   ` Konstantin Khlebnikov
  2016-06-22  8:23     ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Khlebnikov @ 2016-06-22  8:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, stable

On 22.06.2016 00:10, Peter Zijlstra wrote:
> On Thu, Jun 16, 2016 at 03:57:01PM +0300, Konstantin Khlebnikov wrote:
>> Cgroup created inside throttled group must inherit current throttle_count.
>> Broken throttle_count allows to nominate throttled entries as a next buddy,
>> later this leads to null pointer dereference in pick_next_task_fair().
>>
>> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
>> allows to skip locking all rq at group creation. Lazy approach also allows
>> to skip full sub-tree scan at throttling hierarchy (not in this patch).
>
> You're talking about taking rq->lock in alloc_fair_sched_group(), right?
>
> We're about to go do that anyway... But I suppose for backports this
> makes sense. Doing it at creation time also avoids the issues Ben
> raised, right?

Yes, all will be fine. But for 8192-cores this will be disaster =)
throttle_count must be initialized after linking tg into lists. obviously.

-- 
Konstantin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
  2016-06-22  8:10   ` Konstantin Khlebnikov
@ 2016-06-22  8:23     ` Peter Zijlstra
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2016-06-22  8:23 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Ingo Molnar, linux-kernel, stable

On Wed, Jun 22, 2016 at 11:10:46AM +0300, Konstantin Khlebnikov wrote:
> On 22.06.2016 00:10, Peter Zijlstra wrote:
> >On Thu, Jun 16, 2016 at 03:57:01PM +0300, Konstantin Khlebnikov wrote:
> >>Cgroup created inside throttled group must inherit current throttle_count.
> >>Broken throttle_count allows to nominate throttled entries as a next buddy,
> >>later this leads to null pointer dereference in pick_next_task_fair().
> >>
> >>This patch initialize cfs_rq->throttle_count at first enqueue: laziness
> >>allows to skip locking all rq at group creation. Lazy approach also allows
> >>to skip full sub-tree scan at throttling hierarchy (not in this patch).
> >
> >You're talking about taking rq->lock in alloc_fair_sched_group(), right?
> >
> >We're about to go do that anyway... But I suppose for backports this
> >makes sense. Doing it at creation time also avoids the issues Ben
> >raised, right?
> 
> Yes, all will be fine. But for 8192-cores this will be disaster =)

Well, creating cgroups isn't something you do much of, and creating them
will be proportionally expensive already, as we allocate all kinds of
per-cpu data.

In any case, we 'need' to do this because of the per entity load
tracking stuff, entities, even blocked, should be added to the cfs_rq.

> throttle_count must be initialized after linking tg into lists. obviously.

Crud, that's later than we currently take the rq lock. Let me see how
much pain it is to re-order all that.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip:sched/urgent] sched/fair: Initialize throttle_count for new task-groups lazily
  2016-06-16 12:57 [PATCH] sched/fair: initialize throttle_count for new task-groups lazily Konstantin Khlebnikov
                   ` (2 preceding siblings ...)
  2016-06-21 21:10 ` Peter Zijlstra
@ 2016-06-24  8:59 ` tip-bot for Konstantin Khlebnikov
  3 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Konstantin Khlebnikov @ 2016-06-24  8:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, tglx, hpa, khlebnikov, peterz, torvalds, linux-kernel

Commit-ID:  094f469172e00d6ab0a3130b0e01c83b3cf3a98d
Gitweb:     http://git.kernel.org/tip/094f469172e00d6ab0a3130b0e01c83b3cf3a98d
Author:     Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
AuthorDate: Thu, 16 Jun 2016 15:57:01 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 24 Jun 2016 08:26:44 +0200

sched/fair: Initialize throttle_count for new task-groups lazily

Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  | 20 ++++++++++++++++++++
 kernel/sched/sched.h |  2 +-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ae68f0..8c5d8c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4202,6 +4202,26 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
 	if (!cfs_bandwidth_used())
 		return;
 
+	/* Synchronize hierarchical throttle counter: */
+	if (unlikely(!cfs_rq->throttle_uptodate)) {
+		struct rq *rq = rq_of(cfs_rq);
+		struct cfs_rq *pcfs_rq;
+		struct task_group *tg;
+
+		cfs_rq->throttle_uptodate = 1;
+
+		/* Get closest up-to-date node, because leaves go first: */
+		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
+			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
+			if (pcfs_rq->throttle_uptodate)
+				break;
+		}
+		if (tg) {
+			cfs_rq->throttle_count = pcfs_rq->throttle_count;
+			cfs_rq->throttled_clock_task = rq_clock_task(rq);
+		}
+	}
+
 	/* an active group must be handled by the update_curr()->put() path */
 	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
 		return;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72f1f30..7cbeb92 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -437,7 +437,7 @@ struct cfs_rq {
 
 	u64 throttled_clock, throttled_clock_task;
 	u64 throttled_clock_task_time;
-	int throttled, throttle_count;
+	int throttled, throttle_count, throttle_uptodate;
 	struct list_head throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16 12:57 [PATCH] sched/fair: initialize throttle_count for new task-groups lazily Konstantin Khlebnikov
2016-06-16 17:03 ` bsegall
2016-06-16 17:23   ` Konstantin Khlebnikov
2016-06-16 17:33     ` bsegall
2016-06-21 13:41 ` Konstantin Khlebnikov
2016-06-21 21:10 ` Peter Zijlstra
2016-06-22  8:10   ` Konstantin Khlebnikov
2016-06-22  8:23     ` Peter Zijlstra
2016-06-24  8:59 ` [tip:sched/urgent] sched/fair: Initialize " tip-bot for Konstantin Khlebnikov

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git
	git clone --mirror https://lore.kernel.org/lkml/10 lkml/git/10.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git