From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754909AbcFPRYP (ORCPT <rfc822;w@1wt.eu>);
	Thu, 16 Jun 2016 13:24:15 -0400
Received: from forward-corp1o.mail.yandex.net ([37.140.190.172]:56879 "EHLO
	forward-corp1o.mail.yandex.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1754021AbcFPRYN (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 16 Jun 2016 13:24:13 -0400
Authentication-Results: smtpcorp1m.mail.yandex.net; dkim=pass header.i=@yandex-team.ru
Subject: Re: [PATCH] sched/fair: initialize throttle_count for new task-groups
 lazily
To: bsegall@google.com
References: <146608182119.21870.8439834428248129633.stgit@buzz>
 <xm26r3bxvzep.fsf@bsegall-linux.mtv.corp.google.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        linux-kernel@vger.kernel.org, stable@vger.kernel.org
From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Message-ID: <5762E090.4000706@yandex-team.ru>
Date: Thu, 16 Jun 2016 20:23:28 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.8.0
MIME-Version: 1.0
In-Reply-To: <xm26r3bxvzep.fsf@bsegall-linux.mtv.corp.google.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 16.06.2016 20:03, bsegall@google.com wrote:
> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>
>> Cgroup created inside throttled group must inherit current throttle_count.
>> Broken throttle_count allows to nominate throttled entries as a next buddy,
>> later this leads to null pointer dereference in pick_next_task_fair().
>>
>> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
>> allows to skip locking all rq at group creation. Lazy approach also allows
>> to skip full sub-tree scan at throttling hierarchy (not in this patch).
>>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Cc: Stable <stable@vger.kernel.org> # v3.2+
>> ---
>>   kernel/sched/fair.c  |   19 +++++++++++++++++++
>>   kernel/sched/sched.h |    2 +-
>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 218f8e83db73..fe809fe169d2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>>   	if (!cfs_bandwidth_used())
>>   		return;
>>
>> +	/* synchronize hierarchical throttle counter */
>> +	if (unlikely(!cfs_rq->throttle_uptodate)) {
>> +		struct rq *rq = rq_of(cfs_rq);
>> +		struct cfs_rq *pcfs_rq;
>> +		struct task_group *tg;
>> +
>> +		cfs_rq->throttle_uptodate = 1;
>> +		/* get closest uptodate node because leaves goes first */
>> +		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
>> +			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
>> +			if (pcfs_rq->throttle_uptodate)
>> +				break;
>> +		}
>> +		if (tg) {
>> +			cfs_rq->throttle_count = pcfs_rq->throttle_count;
>> +			cfs_rq->throttled_clock_task = rq_clock_task(rq);
>> +		}
>> +	}
>> +
>
> Checking just in enqueue is not sufficient - throttled_lb_pair can check
> against a cfs_rq that has never been enqueued (and possibly other
> paths).

Looks like this is minor problem: in worst case load-balancer will migrate
task into throttled hierarchy. And this could happens only once for each
newly created group.

>
> It might also make sense to go ahead and initialize all the cfs_rqs we
> skipped over to avoid some n^2 pathological behavior. You could also use
> throttle_count == -1 or < 0. (We had our own version of this patch that
> I guess we forgot to push?)

n^2 shouldn't be a problem while this happens only once for each group.

throttle_count == -1 could be overwritten when parent throttles/unthrottles
before initialization. We could set it to INT_MIN/2 and check <0 but this
will hide possible bugs here. One more int in the same cacheline shouldn't
add noticeable overhead.

I've also added this into our kernel to catch such problems without crash.
Probably it's worth to add into upstream because stale buddy is a real pain)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5506,8 +5506,11 @@ static void set_last_buddy(struct sched_entity *se)
         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
                 return;

-       for_each_sched_entity(se)
+       for_each_sched_entity(se) {
+               if (WARN_ON_ONCE(!se->on_rq))
+                       return;
                 cfs_rq_of(se)->last = se;
+       }
  }

  static void set_next_buddy(struct sched_entity *se)
@@ -5515,8 +5518,11 @@ static void set_next_buddy(struct sched_entity *se)
         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
                 return;

-       for_each_sched_entity(se)
+       for_each_sched_entity(se) {
+               if (WARN_ON_ONCE(!se->on_rq))
+                       return;
                 cfs_rq_of(se)->next = se;
+       }
  }

  static void set_skip_buddy(struct sched_entity *se)


>
>
>>   	/* an active group must be handled by the update_curr()->put() path */
>>   	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
>>   		return;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 72f1f3087b04..7cbeb92a1cb9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -437,7 +437,7 @@ struct cfs_rq {
>>
>>   	u64 throttled_clock, throttled_clock_task;
>>   	u64 throttled_clock_task_time;
>> -	int throttled, throttle_count;
>> +	int throttled, throttle_count, throttle_uptodate;
>>   	struct list_head throttled_list;
>>   #endif /* CONFIG_CFS_BANDWIDTH */
>>   #endif /* CONFIG_FAIR_GROUP_SCHED */


-- 
Konstantin