LKML Archive on lore.kernel.org
 help / color / Atom feed
From: bsegall@google.com
To: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>, pjt@google.com
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily
Date: Thu, 16 Jun 2016 10:33:19 -0700
Message-ID: <xm26fusdvy1s.fsf@bsegall-linux.mtv.corp.google.com> (raw)
In-Reply-To: <5762E090.4000706@yandex-team.ru> (Konstantin Khlebnikov's message of "Thu, 16 Jun 2016 20:23:28 +0300")

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:

> On 16.06.2016 20:03, bsegall@google.com wrote:
>> Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
>>
>>> Cgroup created inside throttled group must inherit current throttle_count.
>>> Broken throttle_count allows to nominate throttled entries as a next buddy,
>>> later this leads to null pointer dereference in pick_next_task_fair().
>>>
>>> This patch initialize cfs_rq->throttle_count at first enqueue: laziness
>>> allows to skip locking all rq at group creation. Lazy approach also allows
>>> to skip full sub-tree scan at throttling hierarchy (not in this patch).
>>>
>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>> Cc: Stable <stable@vger.kernel.org> # v3.2+
>>> ---
>>>   kernel/sched/fair.c  |   19 +++++++++++++++++++
>>>   kernel/sched/sched.h |    2 +-
>>>   2 files changed, 20 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 218f8e83db73..fe809fe169d2 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>>>   	if (!cfs_bandwidth_used())
>>>   		return;
>>>
>>> +	/* synchronize hierarchical throttle counter */
>>> +	if (unlikely(!cfs_rq->throttle_uptodate)) {
>>> +		struct rq *rq = rq_of(cfs_rq);
>>> +		struct cfs_rq *pcfs_rq;
>>> +		struct task_group *tg;
>>> +
>>> +		cfs_rq->throttle_uptodate = 1;
>>> +		/* get closest uptodate node because leaves goes first */
>>> +		for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
>>> +			pcfs_rq = tg->cfs_rq[cpu_of(rq)];
>>> +			if (pcfs_rq->throttle_uptodate)
>>> +				break;
>>> +		}
>>> +		if (tg) {
>>> +			cfs_rq->throttle_count = pcfs_rq->throttle_count;
>>> +			cfs_rq->throttled_clock_task = rq_clock_task(rq);
>>> +		}
>>> +	}
>>> +
>>
>> Checking just in enqueue is not sufficient - throttled_lb_pair can check
>> against a cfs_rq that has never been enqueued (and possibly other
>> paths).
>
> Looks like this is minor problem: in worst case load-balancer will migrate
> task into throttled hierarchy. And this could happens only once for each
> newly created group.
>
>>
>> It might also make sense to go ahead and initialize all the cfs_rqs we
>> skipped over to avoid some n^2 pathological behavior. You could also use
>> throttle_count == -1 or < 0. (We had our own version of this patch that
>> I guess we forgot to push?)
>
> n^2 shouldn't be a problem while this happens only once for each
> group.

Yeah it's shouldn't be a problem, it's more a why not.

>
> throttle_count == -1 could be overwritten when parent throttles/unthrottles
> before initialization. We could set it to INT_MIN/2 and check <0 but this
> will hide possible bugs here. One more int in the same cacheline shouldn't
> add noticeable overhead.

Yeah, you would have to check in tg_throttle_up/tg_unthrottle_up as well
to do this.

>
> I've also added this into our kernel to catch such problems without crash.
> Probably it's worth to add into upstream because stale buddy is a real pain)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5506,8 +5506,11 @@ static void set_last_buddy(struct sched_entity *se)
>         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
>                 return;
>
> -       for_each_sched_entity(se)
> +       for_each_sched_entity(se) {
> +               if (WARN_ON_ONCE(!se->on_rq))
> +                       return;
>                 cfs_rq_of(se)->last = se;
> +       }
>  }
>
>  static void set_next_buddy(struct sched_entity *se)
> @@ -5515,8 +5518,11 @@ static void set_next_buddy(struct sched_entity *se)
>         if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE))
>                 return;
>
> -       for_each_sched_entity(se)
> +       for_each_sched_entity(se) {
> +               if (WARN_ON_ONCE(!se->on_rq))
> +                       return;
>                 cfs_rq_of(se)->next = se;
> +       }
>  }
>
>  static void set_skip_buddy(struct sched_entity *se)
>
>
>>
>>
>>>   	/* an active group must be handled by the update_curr()->put() path */
>>>   	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
>>>   		return;
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 72f1f3087b04..7cbeb92a1cb9 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -437,7 +437,7 @@ struct cfs_rq {
>>>
>>>   	u64 throttled_clock, throttled_clock_task;
>>>   	u64 throttled_clock_task_time;
>>> -	int throttled, throttle_count;
>>> +	int throttled, throttle_count, throttle_uptodate;
>>>   	struct list_head throttled_list;
>>>   #endif /* CONFIG_CFS_BANDWIDTH */
>>>   #endif /* CONFIG_FAIR_GROUP_SCHED */

  reply index

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 12:57 Konstantin Khlebnikov
2016-06-16 17:03 ` bsegall
2016-06-16 17:23   ` Konstantin Khlebnikov
2016-06-16 17:33     ` bsegall [this message]
2016-06-21 13:41 ` Konstantin Khlebnikov
2016-06-21 21:10 ` Peter Zijlstra
2016-06-22  8:10   ` Konstantin Khlebnikov
2016-06-22  8:23     ` Peter Zijlstra
2016-06-24  8:59 ` [tip:sched/urgent] sched/fair: Initialize " tip-bot for Konstantin Khlebnikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xm26fusdvy1s.fsf@bsegall-linux.mtv.corp.google.com \
    --to=bsegall@google.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git
	git clone --mirror https://lore.kernel.org/lkml/10 lkml/git/10.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git