linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Sargun Dhillon <sargun@sargun.me>,
	Xie XiuQi <xiexiuqi@huawei.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	xiezhipeng1@huawei.com, huawei.libin@huawei.com,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] sched: fix infinity loop in update_blocked_averages
Date: Thu, 27 Dec 2018 10:15:17 -0800	[thread overview]
Message-ID: <CAHk-=wja7WN1sDz5_E7hf8c47CmX=yH8xc3h0zmmF8Y_fSR71A@mail.gmail.com> (raw)
In-Reply-To: <CAKfTPtAmkHZ5XVHpf6rE8nrSPGzW_uO7a0tvH5JqDSpTzK9D=Q@mail.gmail.com>

On Thu, Dec 27, 2018 at 9:02 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> In the original behavior, the cs_rq was removed from the list only
> when the cgroup was removed.
> patch a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance
> path) has added an optimization which remove the cfs_rq when there
> were no blocked load to update in order to optimize the loop but it
> has introduced a race condition that create this infinite loop. The
> patch fixes the problem by removing the optimization.
> I will look at re-adding the optimization once i will have afix for
> the race condition

Hmm. What's the race? We seem to take the rq lock for all the cases,
but maybe I'm missing something?

That commit a9e7f6544b9c is a year and a half old, why did this start
being reported now?

[ goes off and looks ]

Oh. unthrottle_cfs_rq -> enqueue_entity -> list_add_leaf_cfs_rq()
doesn't actually seem to hold the rq lock at all. It's just called
under a rcu read lock.

So it all seems to depend on that "on_list" flag for exclusion. Which
seems fundamentally racy, since it's not protected by a lock.

So yeah, the whole logic seems to depend on "on_list is sticky and
stays set until the whole task group is destroyed".

So commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance
path") would appear to be entirely wrong, because on_list isn't
actually protected by a lock, and that can confuse things.

But that still makes me go "how come is this only noticed 18 months
after the fact"?

So I'm probably still missing something.

Tejun? PeterZ? Tell my why I'm being dense.

                   Linus

  reply	other threads:[~2018-12-27 18:15 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-27  3:04 [PATCH] sched: fix infinity loop in update_blocked_averages Xie XiuQi
2018-12-27  9:21 ` Vincent Guittot
2018-12-27 10:21   ` Vincent Guittot
2018-12-27 10:23     ` Vincent Guittot
2018-12-27 16:39       ` Sargun Dhillon
2018-12-27 17:01         ` Vincent Guittot
2018-12-27 18:15           ` Linus Torvalds [this message]
2018-12-27 21:08             ` Sargun Dhillon
2018-12-27 21:46               ` Linus Torvalds
2018-12-28  1:15             ` Tejun Heo
2018-12-28  1:36               ` Linus Torvalds
2018-12-28  1:53                 ` Tejun Heo
2018-12-28  2:02                   ` Tejun Heo
2018-12-28  2:30                     ` Xie XiuQi
2018-12-28  5:38                     ` Sargun Dhillon
2018-12-28  9:30                     ` Vincent Guittot
2018-12-28 14:26                       ` Sargun Dhillon
2018-12-28 16:54                       ` Tejun Heo
2018-12-28 17:25                         ` Vincent Guittot
2018-12-28 17:46                           ` Tejun Heo
2018-12-28 18:04                             ` Vincent Guittot
2018-12-28 10:25                     ` Xiezhipeng (EulerOS)
2018-12-30 12:04   ` Ingo Molnar
2018-12-30 12:31     ` [PATCH] sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c Ingo Molnar
2018-12-30 12:36     ` [PATCH] sched: fix infinity loop in update_blocked_averages Vincent Guittot
2018-12-30 12:54       ` Ingo Molnar
2018-12-30 13:00 ` [tip:sched/urgent] sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c tip-bot for Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wja7WN1sDz5_E7hf8c47CmX=yH8xc3h0zmmF8Y_fSR71A@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=dmitry.adamushko@gmail.com \
    --cc=huawei.libin@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sargun@sargun.me \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=xiexiuqi@huawei.com \
    --cc=xiezhipeng1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).