From: Josef Bacik <josef@toxicpanda.com>
To: Rik van Riel <riel@surriel.com>
Cc: linux-kernel@vger.kernel.org, kernel-team@fb.com, pjt@google.com,
dietmar.eggemann@arm.com, peterz@infradead.org, mingo@redhat.com,
morten.rasmussen@arm.com, tglx@linutronix.de,
mgorman@techsingularity.net, vincent.guittot@linaro.org
Subject: Re: [PATCH 10/10] sched,fair: flatten hierarchical runqueues
Date: Mon, 1 Jul 2019 16:34:08 -0400 [thread overview]
Message-ID: <20190701203407.4fmsavivebyrocmw@macbook-pro-91.dhcp.thefacebook.com> (raw)
In-Reply-To: <20190628204913.10287-11-riel@surriel.com>
On Fri, Jun 28, 2019 at 04:49:13PM -0400, Rik van Riel wrote:
> Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.
>
> Iteration of the sched_entity hierarchy is rate limited to once per jiffy
> per sched_entity, which is a smaller change than it seems, because load
> average adjustments were already rate limited to once per jiffy before this
> patch series.
>
> This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
> from throttled cgroups onto their cgroup runqueues, and slowly (using the
> GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
> gets unthrottled, to prevent thundering herd issues.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> include/linux/sched.h | 2 +
> kernel/sched/fair.c | 452 +++++++++++++++---------------------------
> kernel/sched/pelt.c | 6 +-
> kernel/sched/pelt.h | 2 +-
> kernel/sched/sched.h | 2 +-
> 5 files changed, 171 insertions(+), 293 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 84a6cc6f5c47..901c710363e7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -453,6 +453,8 @@ struct sched_entity {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> int depth;
> unsigned long enqueued_h_load;
> + unsigned long enqueued_h_weight;
> + struct load_weight h_load;
> struct sched_entity *parent;
> /* rq on which this entity is (to be) queued: */
> struct cfs_rq *cfs_rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fea8849cc12..c31d3da081fb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -450,6 +450,19 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
> }
> }
>
> +/* Add the cgroup cfs_rqs to the list, for update_blocked_averages */
> +static void enqueue_entity_cfs_rqs(struct sched_entity *se)
> +{
> + SCHED_WARN_ON(!entity_is_task(se));
> +
> + for_each_sched_entity(se) {
> + struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
> +
> + if (list_add_leaf_cfs_rq(cfs_rq))
> + break;
> + }
> +}
> +
> #else /* !CONFIG_FAIR_GROUP_SCHED */
>
> static inline struct task_struct *task_of(struct sched_entity *se)
> @@ -672,8 +685,14 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
> */
> static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
> {
> - if (unlikely(se->load.weight != NICE_0_LOAD))
> + if (task_se_in_cgroup(se)) {
> + unsigned long h_weight = task_se_h_weight(se);
> + if (h_weight != se->h_load.weight)
> + update_load_set(&se->h_load, h_weight);
> + delta = __calc_delta(delta, NICE_0_LOAD, &se->h_load);
> + } else if (unlikely(se->load.weight != NICE_0_LOAD)) {
> delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
> + }
>
> return delta;
> }
> @@ -687,22 +706,16 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
> static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> u64 slice = sysctl_sched_latency;
> + struct load_weight *load = &cfs_rq->load;
> + struct load_weight lw;
>
> - for_each_sched_entity(se) {
> - struct load_weight *load;
> - struct load_weight lw;
> -
> - cfs_rq = cfs_rq_of(se);
> - load = &cfs_rq->load;
> + if (unlikely(!se->on_rq)) {
> + lw = cfs_rq->load;
>
> - if (unlikely(!se->on_rq)) {
> - lw = cfs_rq->load;
> -
> - update_load_add(&lw, se->load.weight);
> - load = &lw;
> - }
> - slice = __calc_delta(slice, se->load.weight, load);
> + update_load_add(&lw, task_se_h_weight(se));
> + load = &lw;
> }
> + slice = __calc_delta(slice, task_se_h_weight(se), load);
>
> /*
> * To avoid cache thrashing, run at least sysctl_sched_min_granularity.
> @@ -2703,16 +2716,28 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
> static void
> account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - update_load_add(&cfs_rq->load, se->load.weight);
> - if (!parent_entity(se))
> + struct rq *rq;
> +
> + if (task_se_in_cgroup(se)) {
> + struct cfs_rq *cgroup_rq = group_cfs_rq_of_parent(se);
> + unsigned long h_weight;
> +
> + update_load_add(&cgroup_rq->load, se->load.weight);
> + cgroup_rq->nr_running++;
> +
> + /* Add the hierarchical weight to the CPU rq */
> + h_weight = task_se_h_weight(se);
> + se->enqueued_h_weight = h_weight;
> + update_load_add(&rq_of(cfs_rq)->load, h_weight);
Ok I think this is where we need to handle the newly enqueued se's potential
weight. Right now task_se_h_weight() is based on it's weight _and_ its load.
So it's really it's task_se_h_weighted_load_avg, and not really
task_se_h_weight, right? Instead we should separate these two things out, one
gives us our heirarchal load, and one that gives us our heirarchal weight based
purely on the ratio of (our weight) / (total se weight on parent). Then we can
take this number and max it with the heirarchal load avg and use that with the
update_load_add. Does that make sense? Thanks,
Josef
next prev parent reply other threads:[~2019-07-01 20:34 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-28 20:49 [PATCH RFC v2 0/10] sched,fair: flatten CPU controller runqueues Rik van Riel
2019-06-28 20:49 ` [PATCH 01/10] sched: introduce task_se_h_load helper Rik van Riel
2019-07-01 19:43 ` Josef Bacik
2019-06-28 20:49 ` [PATCH 02/10] sched: change /proc/sched_debug fields Rik van Riel
2019-07-01 19:52 ` Josef Bacik
2019-06-28 20:49 ` [PATCH 03/10] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
2019-07-01 16:29 ` Josef Bacik
2019-07-01 16:47 ` Rik van Riel
2019-07-01 19:22 ` Josef Bacik
2019-07-01 19:32 ` Rik van Riel
2019-06-28 20:49 ` [PATCH 04/10] sched,fair: move runnable_load_avg to cfs_rq Rik van Riel
2019-06-28 20:49 ` [PATCH 05/10] sched,fair: remove cfs rqs from leaf_cfs_rq_list bottom up Rik van Riel
2019-07-04 9:33 ` Vincent Guittot
2019-07-05 1:02 ` Rik van Riel
2019-06-28 20:49 ` [PATCH 06/10] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
2019-06-28 20:49 ` [PATCH 07/10] sched,cfs: fix zero length timeslice calculation Rik van Riel
2019-07-05 14:51 ` Vincent Guittot
2019-07-05 15:15 ` Rik van Riel
2019-06-28 20:49 ` [PATCH 08/10] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
2019-06-28 20:49 ` [PATCH 09/10] sched,fair: add helper functions for flattened runqueue Rik van Riel
2019-07-01 20:20 ` Josef Bacik
2019-07-01 20:53 ` Rik van Riel
2019-06-28 20:49 ` [PATCH 10/10] sched,fair: flatten hierarchical runqueues Rik van Riel
2019-07-01 20:34 ` Josef Bacik [this message]
2019-07-01 20:58 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190701203407.4fmsavivebyrocmw@macbook-pro-91.dhcp.thefacebook.com \
--to=josef@toxicpanda.com \
--cc=dietmar.eggemann@arm.com \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=morten.rasmussen@arm.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=riel@surriel.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).