linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] introduce cpu.headroom knob to cpu controller
@ 2019-04-08 21:45 Song Liu
  2019-04-08 21:45 ` [PATCH 1/7] sched: refactor tg_set_cfs_bandwidth() Song Liu
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Song Liu @ 2019-04-08 21:45 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: mingo, peterz, vincent.guittot, tglx, morten.rasmussen,
	kernel-team, Song Liu

Servers running latency sensitive workload usually aren't fully loaded for 
various reasons including disaster readiness. The machines running our 
interactive workloads (referred as main workload) have a lot of spare CPU 
cycles that we would like to use for optimistic side jobs like video 
encoding. However, our experiments show that the side workload has strong
impact on the latency of main workload:

  side-job   main-load-level   main-avg-latency
     none          1.0              1.00
     none          1.1              1.10
     none          1.2              1.10 
     none          1.3              1.10
     none          1.4              1.15
     none          1.5              1.24
     none          1.6              1.74

     ffmpeg        1.0              1.82
     ffmpeg        1.1              2.74

Note: both the main-load-level and the main-avg-latency numbers are
 _normalized_.

In these experiments, ffmpeg is put in a cgroup with cpu.weight of 1 
(lowest priority). However, it consumes all idle CPU cycles in the 
system and causes high latency for the main workload. Further experiments
and analysis (more details below) shows that, for the main workload to meet
its latency targets, it is necessary to limit the CPU usage of the side
workload so that there are some _idle_ CPU. There are various reasons
behind the need of idle CPU time. First, shared CPU resouce saturation 
starts to happen way before time-measured utilization reaches 100%. 
Secondly, scheduling latency starts to impact the main workload as CPU 
reaches full utilization. 

Currently, the cpu controller provides two mechanisms to protect the main 
workload: cpu.weight and cpu.max. However, neither of them is sufficient 
in these use cases. As shown in the experiments above, side workload with 
cpu.weight of 1 (lowest priority) would still consume all idle CPU and add 
unacceptable latency to the main workload. cpu.max can throttle the CPU 
usage of the side workload and preserve some idle CPU. However, cpu.max 
cannot react to changes in load levels. For example, when the main 
workload uses 40% of CPU, cpu.max of 30% for the side workload would yield 
good latencies for the main workload. However, when the workload 
experiences higher load levels and uses more CPU, the same setting (cpu.max 
of 30%) would cause the interactive workload to miss its latency target. 

These experiments demonstrated the need for a mechanism to effectively 
throttle CPU usage of the side workload and preserve idle CPU cycles. 
The mechanism should be able to adjust the level of throttling based on
the load level of the main workload. 

This patchset introduces a new knob for cpu controller: cpu.headroom. 
cgroup of the main workload uses cpu.headroom to ensure side workload to 
use limited CPU cycles. For example, if a main workload has a cpu.headroom 
of 30%. The side workload will be throttled to give 30% overall idle CPU. 
If the main workload uses more than 70% of CPU, the side workload will only 
run with configurable minimal cycles. This configurable minimal cycles is
referred as "tolerance" of the main workload. 

The following is a detailed example:

 main/cpu.headroom    main-cpu-load    low-pri-cpu-cycle   idle-cpu
      30%                 30%                40%              30%
      30%                 40%                30%              30%
      30%                 50%                20%              30%
      30%                 60%                10%              30%
      30%                 70%                minimal          ~30%
      30%                 80%                minimal          ~20%

In the example, we use a constant cpu.headroom setting of 30%. As main job
experiences different level of load, the cpu controller adjusts CPU cycles
used by the low-pri jobs.

We experiemented with a web server as the main workload and ffmpeg as the 
side workload. The following table compares latency impact on the main 
workload under different cpu.headroom settings and load levels. In all 
tests, the side workload cgroup is configured with cpu.weight of 1. When 
throttled, the side workload can only run 1ms per 100ms period.

                               average-latency
main-load-level   w/o-side    w/-side-      w/-side-       w/-side-
                            no-headroom   30%-headroom   20%-headroom
     1.0            1.00       1.82          1.26           1.14                      
     1.1            1.10       2.74          1.26           1.32                      
     1.2            1.10                     1.29           1.38                      
     1.3            1.10                     1.32           1.49                      
     1.4            1.15                     1.29           1.85                      
     1.5            1.24                     1.32                                
     1.6            1.74                     1.50                              

Each row of the table shows a normalized load level and average latencies 
for 4 scenarios: w/o side workload, w/ side workload but no headroom; w/ 
side workload and 30% headroom; with side workload and 20% headroom. 


When there is no side workload, average latency of main job falls in the 
0.7x range, except the very high load scenarios. When there is side 
workload but no headroom, latency of the main job goes very high at 
moderate load levels. With 30% headroom, the average latency falls in the 
0.8x range. With 20% headroom, the average latency falls in the 0.9x to 
1.x range. We didn't finish tests in some cases with high load, because 
the latency is too high. 

This experiment demonstrated cpu.headroom is an effective and efficient
knob to control the latency of the main job.

Thanks!

Song Liu (7):
  sched: refactor tg_set_cfs_bandwidth()
  cgroup: introduce hook css_has_tasks_changed
  cgroup: introduce cgroup_parse_percentage
  sched, cgroup: add entry cpu.headroom
  sched/fair: global idleness counter for cpu.headroom
  sched/fair: throttle task runtime based on cpu.headroom
  Documentation: cgroup-v2: add information for cpu.headroom

 Documentation/admin-guide/cgroup-v2.rst |  18 +
 fs/proc/stat.c                          |   4 +-
 include/linux/cgroup-defs.h             |   2 +
 include/linux/cgroup.h                  |   1 +
 include/linux/kernel_stat.h             |   2 +
 kernel/cgroup/cgroup.c                  |  51 +++
 kernel/sched/core.c                     | 425 ++++++++++++++++++++++--
 kernel/sched/fair.c                     | 143 +++++++-
 kernel/sched/sched.h                    |  30 ++
 9 files changed, 634 insertions(+), 42 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/7] sched: refactor tg_set_cfs_bandwidth()
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
@ 2019-04-08 21:45 ` Song Liu
  2019-04-08 21:45 ` [PATCH 2/7] cgroup: introduce hook css_has_tasks_changed Song Liu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-04-08 21:45 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: mingo, peterz, vincent.guittot, tglx, morten.rasmussen,
	kernel-team, Song Liu

This patch factors tg_switch_cfs_runtime() out of tg_set_cfs_bandwidth(),
so that next patches can extend tg_switch_cfs_runtime() to support the new
target_idle_pct value.

This patch doesn't have any functionality changes.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/sched/core.c | 71 +++++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 32 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ead464a0f2e5..b8f220860dc7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6578,39 +6578,12 @@ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+/* need get_online_cpus() and hold cfs_constraints_mutex */
+static void tg_switch_cfs_runtime(struct task_group *tg, u64 period, u64 quota)
 {
-	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
-
-	if (tg == &root_task_group)
-		return -EINVAL;
-
-	/*
-	 * Ensure we have at some amount of bandwidth every period.  This is
-	 * to prevent reaching a state of large arrears when throttled via
-	 * entity_tick() resulting in prolonged exit starvation.
-	 */
-	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
-		return -EINVAL;
-
-	/*
-	 * Likewise, bound things on the otherside by preventing insane quota
-	 * periods.  This also allows us to normalize in computing quota
-	 * feasibility.
-	 */
-	if (period > max_cfs_quota_period)
-		return -EINVAL;
-
-	/*
-	 * Prevent race between setting of cfs_rq->runtime_enabled and
-	 * unthrottle_offline_cfs_rqs().
-	 */
-	get_online_cpus();
-	mutex_lock(&cfs_constraints_mutex);
-	ret = __cfs_schedulable(tg, period, quota);
-	if (ret)
-		goto out_unlock;
+	int runtime_enabled, runtime_was_enabled;
+	int i;
 
 	runtime_enabled = quota != RUNTIME_INF;
 	runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
@@ -6647,7 +6620,41 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	}
 	if (runtime_was_enabled && !runtime_enabled)
 		cfs_bandwidth_usage_dec();
-out_unlock:
+}
+
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int ret = 0;
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at some amount of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
+		return -EINVAL;
+
+	/*
+	 * Likewise, bound things on the otherside by preventing insane quota
+	 * periods.  This also allows us to normalize in computing quota
+	 * feasibility.
+	 */
+	if (period > max_cfs_quota_period)
+		return -EINVAL;
+
+	/*
+	 * Prevent race between setting of cfs_rq->runtime_enabled and
+	 * unthrottle_offline_cfs_rqs().
+	 */
+	get_online_cpus();
+	mutex_lock(&cfs_constraints_mutex);
+	ret = __cfs_schedulable(tg, period, quota);
+	if (!ret)
+		tg_switch_cfs_runtime(tg, period, quota);
+
 	mutex_unlock(&cfs_constraints_mutex);
 	put_online_cpus();
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/7] cgroup: introduce hook css_has_tasks_changed
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
  2019-04-08 21:45 ` [PATCH 1/7] sched: refactor tg_set_cfs_bandwidth() Song Liu
@ 2019-04-08 21:45 ` Song Liu
  2019-04-08 21:45 ` [PATCH 3/7] cgroup: introduce cgroup_parse_percentage Song Liu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-04-08 21:45 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: mingo, peterz, vincent.guittot, tglx, morten.rasmussen,
	kernel-team, Song Liu

This patch introduces a new hook css_has_tasks_changed:

   void (*css_has_tasks_changed)(struct cgroup_subsys_state *css,
                                 bool has_tasks);

The hook is called when the cgroup gets its first task and when the cgroup
loses its last task. It is called under css_set_lock.

Note: has_task is different to populated. It only considers directly
attached tasks.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/cgroup-defs.h |  2 ++
 kernel/cgroup/cgroup.c      | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1c70803e9f77..ba499ed5309c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -594,6 +594,8 @@ struct cgroup_subsys {
 	void (*css_released)(struct cgroup_subsys_state *css);
 	void (*css_free)(struct cgroup_subsys_state *css);
 	void (*css_reset)(struct cgroup_subsys_state *css);
+	void (*css_has_tasks_changed)(struct cgroup_subsys_state *css,
+				      bool has_tasks);
 	void (*css_rstat_flush)(struct cgroup_subsys_state *css, int cpu);
 	int (*css_extra_stat_show)(struct seq_file *seq,
 				   struct cgroup_subsys_state *css);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 3f2b4bde0f9c..b0df96132476 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -200,6 +200,7 @@ static u16 have_fork_callback __read_mostly;
 static u16 have_exit_callback __read_mostly;
 static u16 have_release_callback __read_mostly;
 static u16 have_canfork_callback __read_mostly;
+static u16 have_has_tasks_changed_callback __read_mostly;
 
 /* cgroup namespace for init task */
 struct cgroup_namespace init_cgroup_ns = {
@@ -762,8 +763,11 @@ static bool css_set_populated(struct css_set *cset)
  */
 static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 {
+	struct cgroup *orig_cgrp = cgrp;
 	struct cgroup *child = NULL;
 	int adj = populated ? 1 : -1;
+	struct cgroup_subsys *ss;
+	int ssid;
 
 	lockdep_assert_held(&css_set_lock);
 
@@ -788,6 +792,14 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
 		child = cgrp;
 		cgrp = cgroup_parent(cgrp);
 	} while (cgrp);
+
+	do_each_subsys_mask(ss, ssid, have_has_tasks_changed_callback) {
+		struct cgroup_subsys_state *css;
+
+		css = cgroup_css(orig_cgrp, ss);
+		if (css)
+			ss->css_has_tasks_changed(css, populated);
+	} while_each_subsys_mask();
 }
 
 /**
@@ -5370,6 +5382,8 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	have_exit_callback |= (bool)ss->exit << ss->id;
 	have_release_callback |= (bool)ss->release << ss->id;
 	have_canfork_callback |= (bool)ss->can_fork << ss->id;
+	have_has_tasks_changed_callback |=
+		(bool)ss->css_has_tasks_changed << ss->id;
 
 	/* At system boot, before all subsystems have been
 	 * registered, no tasks have been forked, so we don't
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/7] cgroup: introduce cgroup_parse_percentage
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
  2019-04-08 21:45 ` [PATCH 1/7] sched: refactor tg_set_cfs_bandwidth() Song Liu
  2019-04-08 21:45 ` [PATCH 2/7] cgroup: introduce hook css_has_tasks_changed Song Liu
@ 2019-04-08 21:45 ` Song Liu
  2019-04-08 21:45 ` [PATCH 4/7] sched, cgroup: add entry cpu.headroom Song Liu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-04-08 21:45 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: mingo, peterz, vincent.guittot, tglx, morten.rasmussen,
	kernel-team, Song Liu

This patch introduces a helper to parse percentage string to long
integer, with user selected scale:

   long cgroup_parse_percentage(char *tok, unsigned long base)

Valid tok could be integer 0 to 100, decimal 0.00 to 100.00, or "max".
A tok of "max"is same as "100".

Base is the desire output scale for input "1".

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup/cgroup.c | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 81f58b4a5418..b28f8a41c970 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -110,6 +110,7 @@ int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
 int cgroup_rm_cftypes(struct cftype *cfts);
 void cgroup_file_notify(struct cgroup_file *cfile);
+long cgroup_parse_percentage(char *tok, unsigned long base);
 
 int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
 int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b0df96132476..2e48840ff613 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3967,6 +3967,43 @@ void cgroup_file_notify(struct cgroup_file *cfile)
 	spin_unlock_irqrestore(&cgroup_file_kn_lock, flags);
 }
 
+/**
+ * cgroup_parse_percentage - parse percentage number
+ * @tok: input string contains the token. Valid values are "00.00" to
+ *       "100.00", or "max"
+ * @base: number "1" in desired output scale
+ *
+ * Returns:
+ *    @base * 100 for "max";
+ *    @base * <0.00 to 100.00> for valid inputs;
+ *    -EINVAL for invalid input.
+ */
+long cgroup_parse_percentage(char *tok, unsigned long base)
+{
+	unsigned long val_int, val_frag;
+
+	if (strcmp(tok, "max") == 0) {
+		return base * 100;
+	} else if (sscanf(tok, "%lu.%02lu", &val_int, &val_frag) == 2) {
+		/* xx.1 yields val_frag = 1, while it should be 10 */
+		if (val_frag < 10 && strstr(tok, ".0") == NULL)
+			val_frag *= 10;
+		goto calculate_output;
+	} else if (sscanf(tok, "%lu", &val_int) == 1) {
+		val_frag = 0;
+		goto calculate_output;
+	} else {
+		return -EINVAL;
+	}
+
+calculate_output:
+	if (val_int > 100 || (val_int == 100 && val_frag > 0))
+		return -EINVAL;
+
+	/* round up val_frag by 0.5, to avoid repeated rounding down */
+	return (val_int * base) + div_u64((val_frag * 10 + 5) * base, 1000);
+}
+
 /**
  * css_next_child - find the next child of a given css
  * @pos: the current position (%NULL to initiate traversal)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/7] sched, cgroup: add entry cpu.headroom
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
                   ` (2 preceding siblings ...)
  2019-04-08 21:45 ` [PATCH 3/7] cgroup: introduce cgroup_parse_percentage Song Liu
@ 2019-04-08 21:45 ` Song Liu
  2019-04-08 21:45 ` [PATCH 5/7] sched/fair: global idleness counter for cpu.headroom Song Liu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-04-08 21:45 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: mingo, peterz, vincent.guittot, tglx, morten.rasmussen,
	kernel-team, Song Liu

This patch introduces a new cgroup cpu controller knob: cpu.headroom.
cpu.headroom is a mechanism for latency sensitive applications (or the
main workkload) to enforce latency requirements by reserving cpu cycle
headroom. Only the latency sensitive application can use the cpu cycle
headroom.

cpu controller has two existing knobs: cpu.weight(.nice) and cpu.max.
These two knobs are not sufficient to ensure latency of the main workload.
With interference from shared CPU and scheduling resources, assigning
very low cpu.weight to side workload cannot guarantee good latency of the
main workload. While cpu.max provides mechanism to throttle side workload,
it cannot react to changes in the load level of the main workload. For
example, on a system where the main workload consumes 50% to 70% of
overall CPU. It is necessary to adjust cpu.max for the side workload
according to the main workload traffice. When the main workload consumes
50%, the side workload should get cpu.max of 25%; when the main workload
consumes 70%, the side workload should only get cpu.max of 5%.
cpu.headroom, on the other hand, can react fast when load level changes.
In the example above, we can set cpu.headroom of the main workload to 25%.
Then, the side workload would honor this headroom, and adjust its cpu.max
accordingly.

For example, in a system with two cgroups: cgA with cpu.headroom of 30%
and cgB with cpu.headroom of 0%. 30% of the cpu cycles are reserved for
cgA. If cgA uses 50% of cpu cycles, cgB will run at most 20%. If cgA uses
more than 70% of cpu cycles, cgB will only run with a configurable small
bandwidth, to avoid starvation and deadlock. This configurable small
bandwidth is referred as tolerance.

cpu.headroom knob has two percentage numbers: headroom and tolerance.
headroom is how much idle cpu this cgroup would claim. Other cgroups with
lower headrooms are throttled to preserve this much idle cpu for this
cgroup. When the system is running into the headroom (idle < headroom),
cgroups with less than maximal headroom will be throttled down to use at
most tolerance % cpu. In other words, cgroup with maximal headroom could
tolerate other cgroups to run at tolerance % cpu.

Here is an example how the knob is used. To configure headroom of 40% and
tolerance of 10%:

    root@virt-test:/sys/fs/cgroup/t# echo 40 10 > cpu.headroom

To show the setting:

    root@virt-test:/sys/fs/cgroup/t# cat cpu.headroom
    40.00 10.00

It is possible to configure a cgroup with headroom of "max", which means
exemption from throttling due to other cgroups' cpu.headroom. The user can
also configure a cgroup with tolerance of "max", which means do not really
throttle other tasks.

Similar to cpu.max, the tolerance % cpu works hierarchically. For example,
in the configuration below, when the main workload runs into its headroom,
both side-A and side-B get at most 10% of runtime. However, their parent
"side-workload" also get at most 10% of runtime. Therefore, side-A and
side-B combined can consume at most 10% of CPU.

     side-workload       main-workload (tolerance 10%)
         /    \
    side-A    side-B

Typical cgroup configuration of cpu.headroom looks like:
 1. A main-workload slice with workload specific cpu.headroom (e.g.
 20.00 3.00), containing latency sensitive applications (e.g. web
 server);
 2. A side-workload slice with cpu.headroom of "0.00 max" (default
 setting), containing batch workloads that are insensitive to long
 latencies.

cpu.headroom achieves the throttling of side workload by computing a
"target idle percentage" number for side-workload. In the example above,
the side-workload slice has target idle percentage of 20%. If global idle
cpu is below 20%, side-workload slice can only run 3%.

With cgroup hierachies, children cgroups cannot have more cpu.headroom
than the parent cgroup. Parent cgroups without direct attached tasks will
only claim max headroom of children tasks.

Tasks attached to root_task_group is exempted from throttling.

For more details of how headroom and tolerance are used, please refer to
comments before cpu_headroom_update_config().

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/sched/core.c  | 358 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c  |  12 ++
 kernel/sched/sched.h |  26 ++++
 3 files changed, 392 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8f220860dc7..3cfd8e009ae6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6579,14 +6579,17 @@ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
 /* need get_online_cpus() and hold cfs_constraints_mutex */
-static void tg_switch_cfs_runtime(struct task_group *tg, u64 period, u64 quota)
+static void tg_switch_cfs_runtime(struct task_group *tg, u64 period, u64 quota,
+				  unsigned long target_idle,
+				  unsigned long min_runtime)
 {
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
 	int runtime_enabled, runtime_was_enabled;
 	int i;
 
-	runtime_enabled = quota != RUNTIME_INF;
-	runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
+	runtime_enabled = (quota != RUNTIME_INF) || (target_idle != 0);
+	runtime_was_enabled = (cfs_b->quota != RUNTIME_INF) ||
+		(cfs_b->target_idle != 0);
 	/*
 	 * If we need to toggle cfs_bandwidth_used, off->on must occur
 	 * before making related changes, and on->off must occur afterwards
@@ -6596,6 +6599,8 @@ static void tg_switch_cfs_runtime(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->target_idle = target_idle;
+	cfs_b->min_runtime = min_runtime;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -6653,7 +6658,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	mutex_lock(&cfs_constraints_mutex);
 	ret = __cfs_schedulable(tg, period, quota);
 	if (!ret)
-		tg_switch_cfs_runtime(tg, period, quota);
+		tg_switch_cfs_runtime(tg, period, quota,
+				      tg->cfs_bandwidth.target_idle,
+				      tg->cfs_bandwidth.min_runtime);
 
 	mutex_unlock(&cfs_constraints_mutex);
 	put_online_cpus();
@@ -7042,6 +7049,340 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 		ret = tg_set_cfs_bandwidth(tg, period, quota);
 	return ret ?: nbytes;
 }
+
+/*
+ * Configure headroom, down pass, cap children's allowed value with
+ * parent's values:
+ *
+ *   allowed_headroom = min(configured_headroom,
+ *                              parent->allowed_headroom);
+ *   allowed_tolerance = max(configured_tolerance,
+ *                              parent->allowed_tolerance);
+ */
+static int cpu_headroom_configure_down(struct task_group *tg, void *data)
+{
+	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+	/* skip non-cgroup task_group and root cgroup */
+	if (!tg->css.cgroup || !tg->parent)
+		return 0;
+
+	cfs_b->allowed_headroom =
+		min_t(unsigned long, cfs_b->configured_headroom,
+		      tg->parent->cfs_bandwidth.allowed_headroom);
+	cfs_b->allowed_tolerance =
+		max_t(unsigned long, cfs_b->configured_tolerance,
+		      tg->parent->cfs_bandwidth.allowed_tolerance);
+	return 0;
+}
+
+/*
+ * Configure headroom, up pass, calculate effective values only for
+ * populated cgroups:
+ *   if (cgroup has tasks) {
+ *        effective_headroom = allowed_headroom
+ *        effective_tolerance = allowed_tolerance
+ *   } else {
+ *        effective_headroom =  max(children's effective_headroom)
+ *        effective_tolerance =  min(children's effective_tolerance)
+ *   }
+ */
+static int cpu_headroom_configure_up(struct task_group *tg, void *data)
+{
+	struct cfs_bandwidth *cfs_b_p;
+
+	/* skip non-cgroup task_group and root cgroup */
+	if (!tg->css.cgroup || !tg->parent)
+		return 0;
+
+	cfs_b_p = &tg->cfs_bandwidth;
+
+	if (tg->css.cgroup->nr_populated_csets > 0) {
+		cfs_b_p->effective_headroom = cfs_b_p->allowed_headroom;
+		cfs_b_p->effective_tolerance = cfs_b_p->allowed_tolerance;
+	} else {
+		struct task_group *child;
+
+		cfs_b_p->effective_headroom = 0;
+		cfs_b_p->effective_tolerance = CFS_BANDWIDTH_MAX_HEADROOM;
+
+		list_for_each_entry_rcu(child, &tg->children, siblings) {
+			struct cfs_bandwidth *cfs_b_c = &child->cfs_bandwidth;
+
+			cfs_b_p->effective_headroom =
+				max_t(unsigned long,
+				      cfs_b_p->effective_headroom,
+				      cfs_b_c->effective_headroom);
+
+			cfs_b_p->effective_tolerance =
+				min_t(unsigned long,
+				      cfs_b_p->effective_tolerance,
+				      cfs_b_c->effective_tolerance);
+		}
+	}
+
+	return 0;
+}
+
+/* update target_idle, down pass */
+static int cpu_headroom_target_idle_down(struct task_group *tg, void *data)
+{
+	struct cfs_bandwidth *cfs_b_root = &root_task_group.cfs_bandwidth;
+	unsigned long root_headroom = cfs_b_root->effective_headroom;
+	unsigned long root_tolerance = cfs_b_root->effective_tolerance;
+	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+	unsigned long target_idle;
+
+	/* skip non-cgroup task_group and root cgroup */
+	if (!tg->css.cgroup || !tg->parent)
+		return 0;
+
+	if (cfs_b->effective_headroom == CFS_BANDWIDTH_MAX_HEADROOM)
+		target_idle = 0;
+	else
+		target_idle = root_headroom -
+			tg->cfs_bandwidth.effective_headroom;
+
+	tg_switch_cfs_runtime(tg, cfs_b->period, cfs_b->quota,
+			      target_idle, root_tolerance);
+	return 0;
+}
+
+/*
+ * Calculate global max headroom and tolerance based on effective_* values
+ * of top task_groups. This global max headroom is stored in
+ * root_task_group.
+ *
+ * If new global max headroom is different from previous settings, update
+ * target_idle for all task_groups.
+ *
+ * Returns whether target_idle are updated.
+ */
+static bool cpu_headroom_calculate_global_headroom(void)
+{
+	struct cfs_bandwidth *cfs_b_root = &root_task_group.cfs_bandwidth;
+	unsigned long tolerance = CFS_BANDWIDTH_MAX_HEADROOM;
+	unsigned long headroom = 0;
+	bool update_target_idle;
+	struct task_group *tg;
+
+	list_for_each_entry_rcu(tg, &root_task_group.children, siblings) {
+		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+		/* skip "max", which means exempt from throttle */
+		if (cfs_b->effective_headroom == CFS_BANDWIDTH_MAX_HEADROOM)
+			continue;
+		if (cfs_b->effective_headroom > headroom) {
+			headroom = cfs_b->effective_headroom;
+			tolerance = cfs_b->effective_tolerance;
+		} else if (cfs_b->effective_headroom == headroom) {
+			if (cfs_b->effective_tolerance < tolerance)
+				tolerance = cfs_b->effective_tolerance;
+		}
+	}
+	update_target_idle =
+		(cfs_b_root->effective_headroom != headroom) ||
+		(cfs_b_root->effective_tolerance != tolerance);
+	cfs_b_root->effective_headroom = headroom;
+	cfs_b_root->effective_tolerance = tolerance;
+	if (update_target_idle)
+		walk_tg_tree(cpu_headroom_target_idle_down, tg_nop, NULL);
+
+	return update_target_idle;
+}
+
+/*
+ * Update allowed_*, effective_*, target_idle, and min_runtime. This is
+ * called when cpu.headroom configuration changes.
+ *
+ * headroom is how much idle cpu this tg would claim. Other tgs with lower
+ * headrooms are throttled to preserve this much idle cpu for this tg.
+ *
+ * When the system is running into the headroom (idle < headroom),
+ * tolerance is cpu _other_ tgs are throttled down to. In other words,
+ * this tg could tolerate other tgs to run tolerance % cpu.
+ *
+ * Note that, higher headroom and lower tolerance indicates more
+ * aggressive throttling of other task_groups. By default, each cgroup
+ * gets 0% headroom and max tolerance, which means no headroom. For
+ * headroom to be affective, the user has to configure both headroom and
+ * tolerance.
+ *
+ * Here is an example:
+ *    workload-slice: headroom = 30% tolerance = 5%
+ *    system-slice:   headroom =  0% tolerance = max
+ *
+ * In this configuration, system-slice will be throttled to try to give
+ * workload-slice 30%: if workload-slice uses 50% cpu, system-slice will
+ * use at most 20%. In case the system runs into the headroom, e.g.
+ * workload-slice uses 80% cpu, system-slice will use at most 5%
+ * (throttled).
+ *
+ * The throttling is achieved by assigning target_idle and min_runtime.
+ * In this example, the setting looks like:
+ *    workload-slice: headroom    = 30%  tolerance    = 5%
+ *                    target_idle =  0%  min_runtime = max
+ *    system-slice:   headroom    =  0%  tolerance    = max
+ *                    target_idle = 30%  min_runtime = 5%
+ *
+ * Note that, when target_idle is 0%, value of min_runtime is ignored.
+ * Also, when headroom is 0%, tolerance is ignored.
+ *
+ * headroom and tolerance follows the following hierarchical rules.
+ *   1. task_group may not use higher headroom than parent task_group;
+ *   2. task_group may not use lower tolerance than parent task_group;
+ *   3. headroom and tolerance oftask_groups without directly
+ *      attached tasks are not considered effective.
+ *
+ * To follow these rules, we expand each of headroom and tolerance into 3
+ * variables configured_, allowed_, and effective_.
+ *
+ * configured_headroom is directly assigned by user. As a child task_group
+ * may not have higher headroom than the parent, allowed_headroom is the
+ * minimum of configured_headroom and parent->allowed_headroom.
+ * effective_headroom only considers task_groups with directly attached
+ * tasks. For task_groups with directly attached tasks, effective_headroom
+ * is same as allowed_headroom; for task_groups without directly attached
+ * tasks, effective_headroom is the maximum of all child task_groups'
+ * effective headroom.
+ *
+ * tolerance follows similar logic as headroom, except that tolerance uses
+ * minimum for where headroom uses maximum, and vice versa.
+ *
+ * When headroom and tolerance are calculated for all task_groups, we pick
+ * the highest effective_headroom and corresponding effective_tolerance,
+ * namely, global_headroom and global_tolerance. Then all task_groups are
+ * assigned with target_idle and min_runtime as:
+ *     tg->target_idle =
+ *         global_headroom - tg->effective_headroom
+ *     tg->min_runtime = global_tolerance
+ *
+ * Note that, tg with effective_headroom equals to global_headroom has
+ * target_idle of 0%, which means no throttling.
+ *
+ * In summary, target_idle and min_runtime are calculated by the following
+ * steps:
+ *    1. Walk down tg tree and calculate allowed_* and effective_* values;
+ *    2. Walk up tg tree and calculate effective_* values;
+ *    3. Find global_headroom and global_tolerance;
+ *    4. If necessary, update target_idle and min_runtime.
+ *
+ * For changes of cpu.headroom configurations, we need all these steps;
+ * for changes in task_group has_task, we only need step 2 to 4.
+ */
+static void cpu_headroom_update_config(struct task_group *tg,
+				       bool config_change)
+{
+	struct task_group *orig_tg = tg;
+
+	get_online_cpus();
+	mutex_lock(&cfs_constraints_mutex);
+	rcu_read_lock();
+
+	/*
+	 * If this is configuration change, update allowed_* and
+	 * effective_*values from this tg down
+	 */
+	if (config_change)
+		walk_tg_tree_from(tg, cpu_headroom_configure_down,
+				  cpu_headroom_configure_up, NULL);
+
+	/* Update effective_* values from this tg up */
+	while (tg) {
+		cpu_headroom_configure_up(tg, NULL);
+		tg = tg->parent;
+	}
+
+	/*
+	 * Update global headroom, and (if necessary) target_idle.
+	 *
+	 * If target_idle is not updated for all task_groups, at least
+	 * update it from this task_group down.
+	 */
+	if (!cpu_headroom_calculate_global_headroom())
+		walk_tg_tree_from(orig_tg, cpu_headroom_target_idle_down,
+				  tg_nop, NULL);
+
+	rcu_read_unlock();
+	mutex_unlock(&cfs_constraints_mutex);
+	put_online_cpus();
+}
+
+void cfs_bandwidth_has_tasks_changed_work(struct work_struct *work)
+{
+	struct cfs_bandwidth *cfs_b = container_of(work, struct cfs_bandwidth,
+						   has_tasks_changed_work);
+	struct task_group *tg = container_of(cfs_b, struct task_group,
+					     cfs_bandwidth);
+
+	cpu_headroom_update_config(tg, false);
+}
+
+/*
+ * For has_task updates, no change to allowed_* values. Only update
+ * effective_* values and calculate global headroom
+ */
+static void
+cpu_cgroup_css_has_tasks_changed(struct cgroup_subsys_state *css,
+				 bool has_tasks)
+{
+	struct task_group *tg = css_tg(css);
+
+	/*
+	 * We held css_set_lock here, so we cannot call
+	 * cpu_headroom_update_config(), which calls mutex_lock() and
+	 * get_online_cpus(). Instead, call cpu_headroom_update_config()
+	 * in an async work.
+	 */
+	schedule_work(&tg->cfs_bandwidth.has_tasks_changed_work);
+}
+
+static int cpu_headroom_show(struct seq_file *sf, void *v)
+{
+	struct task_group *tg = css_tg(seq_css(sf));
+	unsigned long val;
+
+	val = tg->cfs_bandwidth.configured_headroom;
+
+	if (val == CFS_BANDWIDTH_MAX_HEADROOM)
+		seq_printf(sf, "max ");
+	else
+		seq_printf(sf, "%lu.%02lu ", LOAD_INT(val), LOAD_FRAC(val));
+
+	val = tg->cfs_bandwidth.configured_tolerance;
+
+	if (val == CFS_BANDWIDTH_MAX_HEADROOM)
+		seq_printf(sf, "max\n");
+	else
+		seq_printf(sf, "%lu.%02lu\n", LOAD_INT(val), LOAD_FRAC(val));
+	return 0;
+}
+
+static ssize_t cpu_headroom_write(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	long headroom, tolerance;
+	struct task_group *tg = css_tg(of_css(of));
+	char tok_a[7], tok_b[7]; /* longest valid input is 100.00 */
+
+	if (sscanf(buf, "%6s %6s", tok_a, tok_b) != 2)
+		return -EINVAL;
+
+	headroom = cgroup_parse_percentage(tok_a, FIXED_1);
+	if (headroom < 0)
+		return -EINVAL;
+
+	tolerance = cgroup_parse_percentage(tok_b, FIXED_1);
+	if (tolerance < 0)
+		return -EINVAL;
+
+	tg->cfs_bandwidth.configured_headroom = headroom;
+	tg->cfs_bandwidth.configured_tolerance = tolerance;
+
+	cpu_headroom_update_config(tg, true);
+
+	return nbytes;
+}
 #endif
 
 static struct cftype cpu_files[] = {
@@ -7066,6 +7407,12 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+	{
+		.name = "headroom",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_headroom_show,
+		.write = cpu_headroom_write,
+	},
 #endif
 	{ }	/* terminate */
 };
@@ -7079,6 +7426,9 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,
+#ifdef CONFIG_CFS_BANDWIDTH
+	.css_has_tasks_changed = cpu_cgroup_css_has_tasks_changed,
+#endif
 	.legacy_cftypes	= cpu_legacy_files,
 	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea74d43924b2..65aa9d3b665f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4915,6 +4915,18 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
+	if (cfs_b == &root_task_group.cfs_bandwidth) {
+		/* allowed_* values for root tg */
+		cfs_b->allowed_headroom = CFS_BANDWIDTH_MAX_HEADROOM;
+		cfs_b->allowed_tolerance = 0;
+	} else {
+		/* default configured_* values for other tg */
+		cfs_b->configured_headroom = 0;
+		cfs_b->configured_tolerance = CFS_BANDWIDTH_MAX_HEADROOM;
+	}
+	INIT_WORK(&cfs_b->has_tasks_changed_work,
+		  cfs_bandwidth_has_tasks_changed_work);
+
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efa686eeff26..9309bf05ff0c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -154,6 +154,8 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  */
 #define RUNTIME_INF		((u64)~0ULL)
 
+#define CFS_BANDWIDTH_MAX_HEADROOM (100UL << FSHIFT)	/* 100% */
+
 static inline int idle_policy(int policy)
 {
 	return policy == SCHED_IDLE;
@@ -334,6 +336,10 @@ struct rt_rq;
 
 extern struct list_head task_groups;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern void cfs_bandwidth_has_tasks_changed_work(struct work_struct *work);
+#endif
+
 struct cfs_bandwidth {
 #ifdef CONFIG_CFS_BANDWIDTH
 	raw_spinlock_t		lock;
@@ -344,6 +350,26 @@ struct cfs_bandwidth {
 	u64			runtime_expires;
 	int			expires_seq;
 
+	/*
+	 * The following values are all fixed-point. For more information
+	 * about these values, please refer to comments before
+	 * cpu_headroom_update_config().
+	 */
+	/* values configured by user */
+	unsigned long		configured_headroom;
+	unsigned long		configured_tolerance;
+	/* values capped by configuration of parent group */
+	unsigned long		allowed_headroom;
+	unsigned long		allowed_tolerance;
+	/* effective values for cgroups with tasks */
+	unsigned long		effective_headroom;
+	unsigned long		effective_tolerance;
+	/* values calculated for runtime based throttling */
+	unsigned long		target_idle;
+	unsigned long		min_runtime;
+	/* work_struct to adjust settings asynchronously */
+	struct work_struct	has_tasks_changed_work;
+
 	short			idle;
 	short			period_active;
 	struct hrtimer		period_timer;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/7] sched/fair: global idleness counter for cpu.headroom
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
                   ` (3 preceding siblings ...)
  2019-04-08 21:45 ` [PATCH 4/7] sched, cgroup: add entry cpu.headroom Song Liu
@ 2019-04-08 21:45 ` Song Liu
  2019-04-08 21:45 ` [PATCH 6/7] sched/fair: throttle task runtime based on cpu.headroom Song Liu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-04-08 21:45 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: mingo, peterz, vincent.guittot, tglx, morten.rasmussen,
	kernel-team, Song Liu

This patch introduces a global idleness counter in fair.c for the
cpu.headroom knob. This counter is based on per cpu get_idle_time().

The counter is used via function call:

  unsigned long cfs_global_idleness_update(u64 now, u64 period);

The function returns global idleness in fixed-point percentage since
previous call of the function. If the time between previous call of the
function is called and @now is shorter than @period, the function will
return idleness calculated in previous call.

cfs_global_idleness_update() will be called from a non-preemptible
context, struct cfs_global_idleness uses raw_spin_lock instead of
spin_lock.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 fs/proc/stat.c              |  4 +--
 include/linux/kernel_stat.h |  2 ++
 kernel/sched/fair.c         | 64 +++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 80c305f206bb..b327ffdb169f 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -23,7 +23,7 @@
 
 #ifdef arch_idle_time
 
-static u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
+u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
 {
 	u64 idle;
 
@@ -45,7 +45,7 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
 
 #else
 
-static u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
+u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
 {
 	u64 idle, idle_usecs = -1ULL;
 
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 7ee2bb43b251..337135272391 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -97,4 +97,6 @@ extern void account_process_tick(struct task_struct *, int user);
 
 extern void account_idle_ticks(unsigned long ticks);
 
+u64 get_idle_time(struct kernel_cpustat *kcs, int cpu);
+
 #endif /* _LINUX_KERNEL_STAT_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65aa9d3b665f..49c68daffe7e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -116,6 +116,62 @@ static unsigned int capacity_margin			= 1280;
  * (default: 5 msec, units: microseconds)
  */
 unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
+
+/* tracking global idlenesss for cpu.headroom */
+struct cfs_global_idleness {
+	u64		prev_total_idle_time;
+	u64		prev_timestamp;
+	unsigned long	idle_percent; /* fixed-point */
+	raw_spinlock_t	lock;
+};
+
+static struct cfs_global_idleness global_idleness;
+
+/*
+ * Calculate global idleness in fixed-point percentage since previous call
+ * of the function. If the time between previous call of the function is
+ * called and @now is shorter than @period, return idleness calculated in
+ * previous call.
+ */
+static unsigned long cfs_global_idleness_update(u64 now, u64 period)
+{
+	u64 prev_timestamp, total_idle_time, delta_idle_time;
+	unsigned long idle_percent;
+	int cpu;
+
+	/*
+	 * Fastpath: if idleness has been updated within the last period
+	 * of time, just return previous idleness.
+	 */
+	prev_timestamp = READ_ONCE(global_idleness.prev_timestamp);
+	if (prev_timestamp + period >= now)
+		return READ_ONCE(global_idleness.idle_percent);
+
+	raw_spin_lock_irq(&global_idleness.lock);
+	if (global_idleness.prev_timestamp + period >= now) {
+		idle_percent = global_idleness.idle_percent;
+		goto out;
+	}
+
+	/* Slowpath: calculate the average idleness since prev_timestamp */
+	total_idle_time = 0;
+	for_each_online_cpu(cpu)
+		total_idle_time += get_idle_time(&kcpustat_cpu(cpu), cpu);
+
+	delta_idle_time = total_idle_time -
+		global_idleness.prev_total_idle_time;
+
+	idle_percent = div64_u64((delta_idle_time << FSHIFT) * 100,
+				 num_online_cpus() *
+				 (now - global_idleness.prev_timestamp));
+
+	WRITE_ONCE(global_idleness.prev_total_idle_time, total_idle_time);
+	WRITE_ONCE(global_idleness.prev_timestamp, now);
+	WRITE_ONCE(global_idleness.idle_percent, idle_percent);
+out:
+	raw_spin_unlock_irq(&global_idleness.lock);
+	return idle_percent;
+}
 #endif
 
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
@@ -4293,6 +4349,11 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = cfs_b->quota;
 	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 	cfs_b->expires_seq++;
+
+	if (cfs_b->target_idle == 0)
+		return;
+
+	cfs_global_idleness_update(now, cfs_b->period);
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -10676,4 +10737,7 @@ __init void init_sched_fair_class(void)
 #endif
 #endif /* SMP */
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	raw_spin_lock_init(&global_idleness.lock);
+#endif
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/7] sched/fair: throttle task runtime based on cpu.headroom
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
                   ` (4 preceding siblings ...)
  2019-04-08 21:45 ` [PATCH 5/7] sched/fair: global idleness counter for cpu.headroom Song Liu
@ 2019-04-08 21:45 ` Song Liu
  2019-04-08 21:45 ` [PATCH 7/7] Documentation: cgroup-v2: add information for cpu.headroom Song Liu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-04-08 21:45 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: mingo, peterz, vincent.guittot, tglx, morten.rasmussen,
	kernel-team, Song Liu

This patch enables task runtime throttling based on cpu.headroom setting.
The throttling leverages the same mechanism of the cpu.max knob. Task
groups with non-zero target_idle get throttled.

In __refill_cfs_bandwidth_runtime(), global idleness measured by function
cfs_global_idleness_update() is compared against target_idle of the task
group. If the measured idleness is lower than the target, runtime of this
task group is reduced to min_runtime.

A new variable "prev_runtime" is added to struct cfs_bandwidth, so that
the new runtime could be adjust accordingly.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 kernel/sched/fair.c  | 69 +++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |  4 +++
 2 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49c68daffe7e..3b0535cda7cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4331,6 +4331,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+static inline bool cfs_bandwidth_throttling_on(struct cfs_bandwidth *cfs_b)
+{
+	return cfs_b->quota != RUNTIME_INF || cfs_b->target_idle != 0;
+}
+
+static inline u64 cfs_bandwidth_pct_to_ns(u64 period, unsigned long pct)
+{
+	return div_u64(period * num_online_cpus() * pct, 100) >> FSHIFT;
+}
+
 /*
  * Replenish runtime according to assigned quota and update expiration time.
  * We use sched_clock_cpu directly instead of rq->clock to avoid adding
@@ -4340,9 +4350,12 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
+	/* runtimes in nanoseconds */
+	u64 idle_time, target_idle_time, max_runtime, min_runtime;
+	unsigned long idle_pct;
 	u64 now;
 
-	if (cfs_b->quota == RUNTIME_INF)
+	if (!cfs_bandwidth_throttling_on(cfs_b))
 		return;
 
 	now = sched_clock_cpu(smp_processor_id());
@@ -4353,7 +4366,49 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->target_idle == 0)
 		return;
 
-	cfs_global_idleness_update(now, cfs_b->period);
+	/*
+	 * max_runtime is the maximal possible runtime for given
+	 * target_idle and quota. In other words:
+	 *     max_runtime = min(quota,
+	 *                       total_time * (100% - target_idle))
+	 */
+	max_runtime = min_t(u64, cfs_b->quota,
+			    cfs_bandwidth_pct_to_ns(cfs_b->period,
+						    (100 << FSHIFT) - cfs_b->target_idle));
+	idle_pct = cfs_global_idleness_update(now, cfs_b->period);
+
+	/*
+	 * Throttle runtime if idle_pct is less than target_idle:
+	 *     idle_pct < cfs_b->target_idle
+	 *
+	 * or if the throttling is on in previous period:
+	 *     max_runtime != cfs_b->prev_runtime
+	 */
+	if (idle_pct < cfs_b->target_idle ||
+	    max_runtime != cfs_b->prev_runtime) {
+		idle_time = cfs_bandwidth_pct_to_ns(cfs_b->period, idle_pct);
+		target_idle_time = cfs_bandwidth_pct_to_ns(cfs_b->period,
+							   cfs_b->target_idle);
+
+		/* minimal runtime to avoid starving */
+		min_runtime = max_t(u64, min_cfs_quota_period,
+				    cfs_bandwidth_pct_to_ns(cfs_b->period,
+							    cfs_b->min_runtime));
+		if (cfs_b->prev_runtime + idle_time < target_idle_time) {
+			cfs_b->runtime = min_runtime;
+		} else {
+			cfs_b->runtime = cfs_b->prev_runtime + idle_time -
+				target_idle_time;
+			if (cfs_b->runtime > max_runtime)
+				cfs_b->runtime = max_runtime;
+			if (cfs_b->runtime < min_runtime)
+				cfs_b->runtime = min_runtime;
+		}
+	} else {
+		/* no need for throttling */
+		cfs_b->runtime = max_runtime;
+	}
+	cfs_b->prev_runtime = cfs_b->runtime;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4382,7 +4437,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota == RUNTIME_INF)
+	if (!cfs_bandwidth_throttling_on(cfs_b))
 		amount = min_amount;
 	else {
 		start_cfs_bandwidth(cfs_b);
@@ -4690,7 +4745,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
-	if (cfs_b->quota == RUNTIME_INF)
+	if (!cfs_bandwidth_throttling_on(cfs_b))
 		goto out_deactivate;
 
 	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
@@ -4806,7 +4861,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
+	if (cfs_bandwidth_throttling_on(cfs_b) &&
 	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
 		cfs_b->runtime += slack_runtime;
 
@@ -4854,7 +4909,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 		return;
 	}
 
-	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
+	if (cfs_bandwidth_throttling_on(cfs_b) && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
 	expires = cfs_b->runtime_expires;
@@ -5048,7 +5103,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
 		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 		raw_spin_lock(&cfs_b->lock);
-		cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
+		cfs_rq->runtime_enabled = cfs_bandwidth_throttling_on(cfs_b);
 		raw_spin_unlock(&cfs_b->lock);
 	}
 	rcu_read_unlock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9309bf05ff0c..92e8a824c6fe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -338,6 +338,7 @@ extern struct list_head task_groups;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern void cfs_bandwidth_has_tasks_changed_work(struct work_struct *work);
+extern const u64 min_cfs_quota_period;
 #endif
 
 struct cfs_bandwidth {
@@ -370,6 +371,9 @@ struct cfs_bandwidth {
 	/* work_struct to adjust settings asynchronously */
 	struct work_struct	has_tasks_changed_work;
 
+	/* runtime assigned to previous period */
+	u64			prev_runtime;
+
 	short			idle;
 	short			period_active;
 	struct hrtimer		period_timer;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/7] Documentation: cgroup-v2: add information for cpu.headroom
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
                   ` (5 preceding siblings ...)
  2019-04-08 21:45 ` [PATCH 6/7] sched/fair: throttle task runtime based on cpu.headroom Song Liu
@ 2019-04-08 21:45 ` Song Liu
  2019-04-10 11:59 ` [PATCH 0/7] introduce cpu.headroom knob to cpu controller Morten Rasmussen
  2019-04-15 16:48 ` Song Liu
  8 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-04-08 21:45 UTC (permalink / raw)
  To: linux-kernel, cgroups
  Cc: mingo, peterz, vincent.guittot, tglx, morten.rasmussen,
	kernel-team, Song Liu

This patch adds simple explanation of the new cpu.headroom knob.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 20f92c16ffbf..5007076d6611 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -916,6 +916,10 @@ have placed RT processes into nonroot cgroups during the system boot
 process, and these processes may need to be moved to the root cgroup
 before the cpu controller can be enabled.
 
+Some latency sensitive applications experience high latency when the
+CPU is fully utilized. cpu.headroom knob provides mechanism to reserve
+CPU headroom that are only available to certain applications.
+
 
 CPU Interface Files
 ~~~~~~~~~~~~~~~~~~~
@@ -968,6 +972,20 @@ All time durations are in microseconds.
 	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
 	one number is written, $MAX is updated.
 
+  cpu.headroom
+	A read-write two value file which exists on non-root cgroups.
+	The default is "0.00 max".
+
+	The idle CPU headroom claimed by the cgroup. It's is in the
+	following format::
+
+	  $HEADROOM $TOLERANCE
+
+	which indicates that other cgroups should be throttled to
+	yield $HEADROOM % idle CPU. If it is not possible to maintain
+	$HEADROOM % idle CPU, other cgroups will consume at most
+	$TOLERANCE % CPU.
+
   cpu.pressure
 	A read-only nested-key file which exists on non-root cgroups.
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
                   ` (6 preceding siblings ...)
  2019-04-08 21:45 ` [PATCH 7/7] Documentation: cgroup-v2: add information for cpu.headroom Song Liu
@ 2019-04-10 11:59 ` Morten Rasmussen
  2019-04-10 19:43   ` Song Liu
  2019-04-15 16:48 ` Song Liu
  8 siblings, 1 reply; 26+ messages in thread
From: Morten Rasmussen @ 2019-04-10 11:59 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-kernel, cgroups, mingo, peterz, vincent.guittot, tglx, kernel-team

Hi,

On Mon, Apr 08, 2019 at 02:45:32PM -0700, Song Liu wrote:
> Servers running latency sensitive workload usually aren't fully loaded for 
> various reasons including disaster readiness. The machines running our 
> interactive workloads (referred as main workload) have a lot of spare CPU 
> cycles that we would like to use for optimistic side jobs like video 
> encoding. However, our experiments show that the side workload has strong
> impact on the latency of main workload:
> 
>   side-job   main-load-level   main-avg-latency
>      none          1.0              1.00
>      none          1.1              1.10
>      none          1.2              1.10 
>      none          1.3              1.10
>      none          1.4              1.15
>      none          1.5              1.24
>      none          1.6              1.74
> 
>      ffmpeg        1.0              1.82
>      ffmpeg        1.1              2.74
> 
> Note: both the main-load-level and the main-avg-latency numbers are
>  _normalized_.

Could you reveal what level of utilization those main-load-level numbers
correspond to? I'm trying to understand why the latency seems to
increase rapidly once you hit 1.5. Is that the point where the system
hits 100% utilization?

> In these experiments, ffmpeg is put in a cgroup with cpu.weight of 1 
> (lowest priority). However, it consumes all idle CPU cycles in the 
> system and causes high latency for the main workload. Further experiments
> and analysis (more details below) shows that, for the main workload to meet
> its latency targets, it is necessary to limit the CPU usage of the side
> workload so that there are some _idle_ CPU. There are various reasons
> behind the need of idle CPU time. First, shared CPU resouce saturation 
> starts to happen way before time-measured utilization reaches 100%. 
> Secondly, scheduling latency starts to impact the main workload as CPU 
> reaches full utilization. 
> 
> Currently, the cpu controller provides two mechanisms to protect the main 
> workload: cpu.weight and cpu.max. However, neither of them is sufficient 
> in these use cases. As shown in the experiments above, side workload with 
> cpu.weight of 1 (lowest priority) would still consume all idle CPU and add 
> unacceptable latency to the main workload. cpu.max can throttle the CPU 
> usage of the side workload and preserve some idle CPU. However, cpu.max 
> cannot react to changes in load levels. For example, when the main 
> workload uses 40% of CPU, cpu.max of 30% for the side workload would yield 
> good latencies for the main workload. However, when the workload 
> experiences higher load levels and uses more CPU, the same setting (cpu.max 
> of 30%) would cause the interactive workload to miss its latency target. 
> 
> These experiments demonstrated the need for a mechanism to effectively 
> throttle CPU usage of the side workload and preserve idle CPU cycles. 
> The mechanism should be able to adjust the level of throttling based on
> the load level of the main workload. 
> 
> This patchset introduces a new knob for cpu controller: cpu.headroom. 
> cgroup of the main workload uses cpu.headroom to ensure side workload to 
> use limited CPU cycles. For example, if a main workload has a cpu.headroom 
> of 30%. The side workload will be throttled to give 30% overall idle CPU. 
> If the main workload uses more than 70% of CPU, the side workload will only 
> run with configurable minimal cycles. This configurable minimal cycles is
> referred as "tolerance" of the main workload.

IIUC, you are proposing to basically apply dynamic bandwidth throttling to
side-jobs to preserve a specific headroom of idle cycles.

The bit that isn't clear to me, is _why_ adding idle cycles helps your
workload. I'm not convinced that adding headroom gives any latency
improvements beyond watering down the impact of your side jobs. AFAIK,
the throttling mechanism effectively removes the throttled tasks from
the schedule according to a specific duty cycle. When the side job is
not throttled the main workload is experiencing the same latency issues
as before, but by dynamically tuning the side job throttling you can
achieve a better average latency. Am I missing something?

Have you looked at your distribution of main job latency and tried to
compare with when throttling is active/not active?

I'm wondering if the headroom solution is really the right solution for
your use-case or if what you are really after is something which is
lower priority than just setting the weight to 1. Something that
(nearly) always gets pre-empted by your main job (SCHED_BATCH and
SCHED_IDLE might not be enough). If your main job consist
of lots of relatively short wake-ups things like the min_granularity
could have significant latency impact.

Morten

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-10 11:59 ` [PATCH 0/7] introduce cpu.headroom knob to cpu controller Morten Rasmussen
@ 2019-04-10 19:43   ` Song Liu
  2019-04-17 12:56     ` Vincent Guittot
  2019-05-21 13:47     ` Michal Koutný
  0 siblings, 2 replies; 26+ messages in thread
From: Song Liu @ 2019-04-10 19:43 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, cgroups, mingo, peterz, vincent.guittot, tglx, Kernel Team

Hi Morten,

> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> 
> Hi,
> 
> On Mon, Apr 08, 2019 at 02:45:32PM -0700, Song Liu wrote:
>> Servers running latency sensitive workload usually aren't fully loaded for 
>> various reasons including disaster readiness. The machines running our 
>> interactive workloads (referred as main workload) have a lot of spare CPU 
>> cycles that we would like to use for optimistic side jobs like video 
>> encoding. However, our experiments show that the side workload has strong
>> impact on the latency of main workload:
>> 
>>  side-job   main-load-level   main-avg-latency
>>     none          1.0              1.00
>>     none          1.1              1.10
>>     none          1.2              1.10 
>>     none          1.3              1.10
>>     none          1.4              1.15
>>     none          1.5              1.24
>>     none          1.6              1.74
>> 
>>     ffmpeg        1.0              1.82
>>     ffmpeg        1.1              2.74
>> 
>> Note: both the main-load-level and the main-avg-latency numbers are
>> _normalized_.
> 
> Could you reveal what level of utilization those main-load-level numbers
> correspond to? I'm trying to understand why the latency seems to
> increase rapidly once you hit 1.5. Is that the point where the system
> hits 100% utilization?

The load level above is measured as requests-per-second. 

When there is no side workload, the system has about 45% busy CPU with 
load level of 1.0; and about 75% busy CPU at load level of 1.5. 

The saturation starts before the system hitting 100% utilization. This is
true for many different resources: ALUs in SMT systems, cache lines, 
memory bandwidths, etc. 

> 
>> In these experiments, ffmpeg is put in a cgroup with cpu.weight of 1 
>> (lowest priority). However, it consumes all idle CPU cycles in the 
>> system and causes high latency for the main workload. Further experiments
>> and analysis (more details below) shows that, for the main workload to meet
>> its latency targets, it is necessary to limit the CPU usage of the side
>> workload so that there are some _idle_ CPU. There are various reasons
>> behind the need of idle CPU time. First, shared CPU resouce saturation 
>> starts to happen way before time-measured utilization reaches 100%. 
>> Secondly, scheduling latency starts to impact the main workload as CPU 
>> reaches full utilization. 
>> 
>> Currently, the cpu controller provides two mechanisms to protect the main 
>> workload: cpu.weight and cpu.max. However, neither of them is sufficient 
>> in these use cases. As shown in the experiments above, side workload with 
>> cpu.weight of 1 (lowest priority) would still consume all idle CPU and add 
>> unacceptable latency to the main workload. cpu.max can throttle the CPU 
>> usage of the side workload and preserve some idle CPU. However, cpu.max 
>> cannot react to changes in load levels. For example, when the main 
>> workload uses 40% of CPU, cpu.max of 30% for the side workload would yield 
>> good latencies for the main workload. However, when the workload 
>> experiences higher load levels and uses more CPU, the same setting (cpu.max 
>> of 30%) would cause the interactive workload to miss its latency target. 
>> 
>> These experiments demonstrated the need for a mechanism to effectively 
>> throttle CPU usage of the side workload and preserve idle CPU cycles. 
>> The mechanism should be able to adjust the level of throttling based on
>> the load level of the main workload. 
>> 
>> This patchset introduces a new knob for cpu controller: cpu.headroom. 
>> cgroup of the main workload uses cpu.headroom to ensure side workload to 
>> use limited CPU cycles. For example, if a main workload has a cpu.headroom 
>> of 30%. The side workload will be throttled to give 30% overall idle CPU. 
>> If the main workload uses more than 70% of CPU, the side workload will only 
>> run with configurable minimal cycles. This configurable minimal cycles is
>> referred as "tolerance" of the main workload.
> 
> IIUC, you are proposing to basically apply dynamic bandwidth throttling to
> side-jobs to preserve a specific headroom of idle cycles.

This is accurate. The effect is similar to cpu.max, but more dynamic. 

> 
> The bit that isn't clear to me, is _why_ adding idle cycles helps your
> workload. I'm not convinced that adding headroom gives any latency
> improvements beyond watering down the impact of your side jobs. AFAIK,

We think the latency improvements actually come from watering down the 
impact of side jobs. It is not just statistically improving average 
latency numbers, but also reduces resource contention caused by the side
workload. I don't know whether it is from reducing contention of ALUs, 
memory bandwidth, CPU caches, or something else, but we saw reduced 
latencies when headroom is used. 

> the throttling mechanism effectively removes the throttled tasks from
> the schedule according to a specific duty cycle. When the side job is
> not throttled the main workload is experiencing the same latency issues
> as before, but by dynamically tuning the side job throttling you can
> achieve a better average latency. Am I missing something?
> 
> Have you looked at your distribution of main job latency and tried to
> compare with when throttling is active/not active?

cfs_bandwidth adjusts allowed runtime for each task_group each period 
(configurable, 100ms by default). cpu.headroom logic applies gentle 
throttling, so that the side workload gets some runtime in every period. 
Therefore, if we look at time window equal to or bigger than 100ms, we
don't really see "throttling active time" vs. "throttling inactive time". 

> 
> I'm wondering if the headroom solution is really the right solution for
> your use-case or if what you are really after is something which is
> lower priority than just setting the weight to 1. Something that

The experiments show that, cpu.weight does proper work for priority: the 
main workload gets priority to use the CPU; while the side workload only 
fill the idle CPU. However, this is not sufficient, as the side workload 
creates big enough contention to impact the main workload. 

> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
> SCHED_IDLE might not be enough). If your main job consist
> of lots of relatively short wake-ups things like the min_granularity
> could have significant latency impact.

cpu.headroom gives benefits in addition to optimizations in pre-empt
side. By maintaining some idle time, fewer pre-empt actions are 
necessary, thus the main workload will get better latency. 

Thanks,
Song

> 
> Morten


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
                   ` (7 preceding siblings ...)
  2019-04-10 11:59 ` [PATCH 0/7] introduce cpu.headroom knob to cpu controller Morten Rasmussen
@ 2019-04-15 16:48 ` Song Liu
  8 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-04-15 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, vincent.guittot, Thomas Gleixner, Morten Rasmussen,
	Kernel Team, cgroups, LKML

Hi Peter,

> On Apr 8, 2019, at 2:45 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> Servers running latency sensitive workload usually aren't fully loaded for 
> various reasons including disaster readiness. The machines running our 
> interactive workloads (referred as main workload) have a lot of spare CPU 
> cycles that we would like to use for optimistic side jobs like video 
> encoding. However, our experiments show that the side workload has strong
> impact on the latency of main workload:
> 
>  side-job   main-load-level   main-avg-latency
>     none          1.0              1.00
>     none          1.1              1.10
>     none          1.2              1.10 
>     none          1.3              1.10
>     none          1.4              1.15
>     none          1.5              1.24
>     none          1.6              1.74
> 
>     ffmpeg        1.0              1.82
>     ffmpeg        1.1              2.74
> 
> Note: both the main-load-level and the main-avg-latency numbers are
> _normalized_.
> 
> In these experiments, ffmpeg is put in a cgroup with cpu.weight of 1 
> (lowest priority). However, it consumes all idle CPU cycles in the 
> system and causes high latency for the main workload. Further experiments
> and analysis (more details below) shows that, for the main workload to meet
> its latency targets, it is necessary to limit the CPU usage of the side
> workload so that there are some _idle_ CPU. There are various reasons
> behind the need of idle CPU time. First, shared CPU resouce saturation 
> starts to happen way before time-measured utilization reaches 100%. 
> Secondly, scheduling latency starts to impact the main workload as CPU 
> reaches full utilization. 
> 
> Currently, the cpu controller provides two mechanisms to protect the main 
> workload: cpu.weight and cpu.max. However, neither of them is sufficient 
> in these use cases. As shown in the experiments above, side workload with 
> cpu.weight of 1 (lowest priority) would still consume all idle CPU and add 
> unacceptable latency to the main workload. cpu.max can throttle the CPU 
> usage of the side workload and preserve some idle CPU. However, cpu.max 
> cannot react to changes in load levels. For example, when the main 
> workload uses 40% of CPU, cpu.max of 30% for the side workload would yield 
> good latencies for the main workload. However, when the workload 
> experiences higher load levels and uses more CPU, the same setting (cpu.max 
> of 30%) would cause the interactive workload to miss its latency target. 
> 
> These experiments demonstrated the need for a mechanism to effectively 
> throttle CPU usage of the side workload and preserve idle CPU cycles. 
> The mechanism should be able to adjust the level of throttling based on
> the load level of the main workload. 
> 
> This patchset introduces a new knob for cpu controller: cpu.headroom. 
> cgroup of the main workload uses cpu.headroom to ensure side workload to 
> use limited CPU cycles. For example, if a main workload has a cpu.headroom 
> of 30%. The side workload will be throttled to give 30% overall idle CPU. 
> If the main workload uses more than 70% of CPU, the side workload will only 
> run with configurable minimal cycles. This configurable minimal cycles is
> referred as "tolerance" of the main workload. 
> 
> The following is a detailed example:
> 
> main/cpu.headroom    main-cpu-load    low-pri-cpu-cycle   idle-cpu
>      30%                 30%                40%              30%
>      30%                 40%                30%              30%
>      30%                 50%                20%              30%
>      30%                 60%                10%              30%
>      30%                 70%                minimal          ~30%
>      30%                 80%                minimal          ~20%
> 
> In the example, we use a constant cpu.headroom setting of 30%. As main job
> experiences different level of load, the cpu controller adjusts CPU cycles
> used by the low-pri jobs.
> 
> We experiemented with a web server as the main workload and ffmpeg as the 
> side workload. The following table compares latency impact on the main 
> workload under different cpu.headroom settings and load levels. In all 
> tests, the side workload cgroup is configured with cpu.weight of 1. When 
> throttled, the side workload can only run 1ms per 100ms period.
> 
>                               average-latency
> main-load-level   w/o-side    w/-side-      w/-side-       w/-side-
>                            no-headroom   30%-headroom   20%-headroom
>     1.0            1.00       1.82          1.26           1.14                      
>     1.1            1.10       2.74          1.26           1.32                      
>     1.2            1.10                     1.29           1.38                      
>     1.3            1.10                     1.32           1.49                      
>     1.4            1.15                     1.29           1.85                      
>     1.5            1.24                     1.32                                
>     1.6            1.74                     1.50                              
> 
> Each row of the table shows a normalized load level and average latencies 
> for 4 scenarios: w/o side workload, w/ side workload but no headroom; w/ 
> side workload and 30% headroom; with side workload and 20% headroom. 
> 
> 
> When there is no side workload, average latency of main job falls in the 
> 0.7x range, except the very high load scenarios. When there is side 
> workload but no headroom, latency of the main job goes very high at 
> moderate load levels. With 30% headroom, the average latency falls in the 
> 0.8x range. With 20% headroom, the average latency falls in the 0.9x to 
> 1.x range. We didn't finish tests in some cases with high load, because 
> the latency is too high. 
> 
> This experiment demonstrated cpu.headroom is an effective and efficient
> knob to control the latency of the main job.
> 
> Thanks!

Could you please kindly share your feedback and comments on this work?

Thanks and Regards,
Song

> Song Liu (7):
>  sched: refactor tg_set_cfs_bandwidth()
>  cgroup: introduce hook css_has_tasks_changed
>  cgroup: introduce cgroup_parse_percentage
>  sched, cgroup: add entry cpu.headroom
>  sched/fair: global idleness counter for cpu.headroom
>  sched/fair: throttle task runtime based on cpu.headroom
>  Documentation: cgroup-v2: add information for cpu.headroom
> 
> Documentation/admin-guide/cgroup-v2.rst |  18 +
> fs/proc/stat.c                          |   4 +-
> include/linux/cgroup-defs.h             |   2 +
> include/linux/cgroup.h                  |   1 +
> include/linux/kernel_stat.h             |   2 +
> kernel/cgroup/cgroup.c                  |  51 +++
> kernel/sched/core.c                     | 425 ++++++++++++++++++++++--
> kernel/sched/fair.c                     | 143 +++++++-
> kernel/sched/sched.h                    |  30 ++
> 9 files changed, 634 insertions(+), 42 deletions(-)
> 
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-10 19:43   ` Song Liu
@ 2019-04-17 12:56     ` Vincent Guittot
  2019-04-22 23:22       ` Song Liu
  2019-05-21 13:47     ` Michal Koutný
  1 sibling, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2019-04-17 12:56 UTC (permalink / raw)
  To: Song Liu
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team

On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
>
> Hi Morten,
>
> > On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >

> >
> > The bit that isn't clear to me, is _why_ adding idle cycles helps your
> > workload. I'm not convinced that adding headroom gives any latency
> > improvements beyond watering down the impact of your side jobs. AFAIK,
>
> We think the latency improvements actually come from watering down the
> impact of side jobs. It is not just statistically improving average
> latency numbers, but also reduces resource contention caused by the side
> workload. I don't know whether it is from reducing contention of ALUs,
> memory bandwidth, CPU caches, or something else, but we saw reduced
> latencies when headroom is used.
>
> > the throttling mechanism effectively removes the throttled tasks from
> > the schedule according to a specific duty cycle. When the side job is
> > not throttled the main workload is experiencing the same latency issues
> > as before, but by dynamically tuning the side job throttling you can
> > achieve a better average latency. Am I missing something?
> >
> > Have you looked at your distribution of main job latency and tried to
> > compare with when throttling is active/not active?
>
> cfs_bandwidth adjusts allowed runtime for each task_group each period
> (configurable, 100ms by default). cpu.headroom logic applies gentle
> throttling, so that the side workload gets some runtime in every period.
> Therefore, if we look at time window equal to or bigger than 100ms, we
> don't really see "throttling active time" vs. "throttling inactive time".
>
> >
> > I'm wondering if the headroom solution is really the right solution for
> > your use-case or if what you are really after is something which is
> > lower priority than just setting the weight to 1. Something that
>
> The experiments show that, cpu.weight does proper work for priority: the
> main workload gets priority to use the CPU; while the side workload only
> fill the idle CPU. However, this is not sufficient, as the side workload
> creates big enough contention to impact the main workload.
>
> > (nearly) always gets pre-empted by your main job (SCHED_BATCH and
> > SCHED_IDLE might not be enough). If your main job consist
> > of lots of relatively short wake-ups things like the min_granularity
> > could have significant latency impact.
>
> cpu.headroom gives benefits in addition to optimizations in pre-empt
> side. By maintaining some idle time, fewer pre-empt actions are
> necessary, thus the main workload will get better latency.

I agree with Morten's proposal, SCHED_IDLE should help your latency
problem because side job will be directly preempted unlike normal cfs
task even lowest priority.
In addition to min_granularity, sched_period also has an impact on the
time that a task has to wait before preempting the running task. Also,
some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
latency of a task.

It would be nice to know if the latency problem comes from contention
on cache resources or if it's mainly because you main load waits
before running on a CPU

Regards,
Vincent

>
> Thanks,
> Song
>
> >
> > Morten
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-17 12:56     ` Vincent Guittot
@ 2019-04-22 23:22       ` Song Liu
  2019-04-28 19:47         ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-04-22 23:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team

Hi Vincent,

> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> 
> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
>> 
>> Hi Morten,
>> 
>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>> 
> 
>>> 
>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
>>> workload. I'm not convinced that adding headroom gives any latency
>>> improvements beyond watering down the impact of your side jobs. AFAIK,
>> 
>> We think the latency improvements actually come from watering down the
>> impact of side jobs. It is not just statistically improving average
>> latency numbers, but also reduces resource contention caused by the side
>> workload. I don't know whether it is from reducing contention of ALUs,
>> memory bandwidth, CPU caches, or something else, but we saw reduced
>> latencies when headroom is used.
>> 
>>> the throttling mechanism effectively removes the throttled tasks from
>>> the schedule according to a specific duty cycle. When the side job is
>>> not throttled the main workload is experiencing the same latency issues
>>> as before, but by dynamically tuning the side job throttling you can
>>> achieve a better average latency. Am I missing something?
>>> 
>>> Have you looked at your distribution of main job latency and tried to
>>> compare with when throttling is active/not active?
>> 
>> cfs_bandwidth adjusts allowed runtime for each task_group each period
>> (configurable, 100ms by default). cpu.headroom logic applies gentle
>> throttling, so that the side workload gets some runtime in every period.
>> Therefore, if we look at time window equal to or bigger than 100ms, we
>> don't really see "throttling active time" vs. "throttling inactive time".
>> 
>>> 
>>> I'm wondering if the headroom solution is really the right solution for
>>> your use-case or if what you are really after is something which is
>>> lower priority than just setting the weight to 1. Something that
>> 
>> The experiments show that, cpu.weight does proper work for priority: the
>> main workload gets priority to use the CPU; while the side workload only
>> fill the idle CPU. However, this is not sufficient, as the side workload
>> creates big enough contention to impact the main workload.
>> 
>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
>>> SCHED_IDLE might not be enough). If your main job consist
>>> of lots of relatively short wake-ups things like the min_granularity
>>> could have significant latency impact.
>> 
>> cpu.headroom gives benefits in addition to optimizations in pre-empt
>> side. By maintaining some idle time, fewer pre-empt actions are
>> necessary, thus the main workload will get better latency.
> 
> I agree with Morten's proposal, SCHED_IDLE should help your latency
> problem because side job will be directly preempted unlike normal cfs
> task even lowest priority.
> In addition to min_granularity, sched_period also has an impact on the
> time that a task has to wait before preempting the running task. Also,
> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
> latency of a task.
> 
> It would be nice to know if the latency problem comes from contention
> on cache resources or if it's mainly because you main load waits
> before running on a CPU
> 
> Regards,
> Vincent

Thanks for these suggestions. Here are some more tests to show the impact 
of scheduler knobs and cpu.headroom.

side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
--------------------------------------------------------------------------------
  none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
 ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
 ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
 ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
 ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08

In all these cases, the main workload is loaded with same level of 
traffic (request per second). Main workload latency numbers are normalized 
based on the baseline (first row). 

For the baseline, the main workload runs without any side workload, the 
system has about 45.20% idle CPU. 

The next two rows compare the impact of scheduling knobs cpu.weight and 
sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms, 
we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we 
see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting 
the main workload. However, it is not sufficient, as the latency overhead 
is high (>40%). 

The last two rows show the benefit of cpu.headroom. With 20% headroom, 
the latency is 1.13; while with 30% headroom, the latency is 1.08. 

We can also see a clear correlation between latency and global idle CPU: 
more idle CPU yields better lower latency. 

Over all, these results show that cpu.headroom provides effective 
mechanism to control the latency impact of side workloads. Other knobs 
could also help the latency, but they are not as effective and flexible 
as cpu.headroom. 

Does this analysis address your concern? 

Thanks,
Song

> 
>> 
>> Thanks,
>> Song
>> 
>>> 
>>> Morten
>> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-22 23:22       ` Song Liu
@ 2019-04-28 19:47         ` Song Liu
  2019-04-29 12:24           ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-04-28 19:47 UTC (permalink / raw)
  To: Vincent Guittot, Morten Rasmussen
  Cc: linux-kernel, cgroups, mingo, peterz, tglx, Kernel Team

Hi Morten and Vincent,

> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> Hi Vincent,
> 
>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> 
>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> Hi Morten,
>>> 
>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>>> 
>> 
>>>> 
>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
>>>> workload. I'm not convinced that adding headroom gives any latency
>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
>>> 
>>> We think the latency improvements actually come from watering down the
>>> impact of side jobs. It is not just statistically improving average
>>> latency numbers, but also reduces resource contention caused by the side
>>> workload. I don't know whether it is from reducing contention of ALUs,
>>> memory bandwidth, CPU caches, or something else, but we saw reduced
>>> latencies when headroom is used.
>>> 
>>>> the throttling mechanism effectively removes the throttled tasks from
>>>> the schedule according to a specific duty cycle. When the side job is
>>>> not throttled the main workload is experiencing the same latency issues
>>>> as before, but by dynamically tuning the side job throttling you can
>>>> achieve a better average latency. Am I missing something?
>>>> 
>>>> Have you looked at your distribution of main job latency and tried to
>>>> compare with when throttling is active/not active?
>>> 
>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
>>> throttling, so that the side workload gets some runtime in every period.
>>> Therefore, if we look at time window equal to or bigger than 100ms, we
>>> don't really see "throttling active time" vs. "throttling inactive time".
>>> 
>>>> 
>>>> I'm wondering if the headroom solution is really the right solution for
>>>> your use-case or if what you are really after is something which is
>>>> lower priority than just setting the weight to 1. Something that
>>> 
>>> The experiments show that, cpu.weight does proper work for priority: the
>>> main workload gets priority to use the CPU; while the side workload only
>>> fill the idle CPU. However, this is not sufficient, as the side workload
>>> creates big enough contention to impact the main workload.
>>> 
>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
>>>> SCHED_IDLE might not be enough). If your main job consist
>>>> of lots of relatively short wake-ups things like the min_granularity
>>>> could have significant latency impact.
>>> 
>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
>>> side. By maintaining some idle time, fewer pre-empt actions are
>>> necessary, thus the main workload will get better latency.
>> 
>> I agree with Morten's proposal, SCHED_IDLE should help your latency
>> problem because side job will be directly preempted unlike normal cfs
>> task even lowest priority.
>> In addition to min_granularity, sched_period also has an impact on the
>> time that a task has to wait before preempting the running task. Also,
>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
>> latency of a task.
>> 
>> It would be nice to know if the latency problem comes from contention
>> on cache resources or if it's mainly because you main load waits
>> before running on a CPU
>> 
>> Regards,
>> Vincent
> 
> Thanks for these suggestions. Here are some more tests to show the impact 
> of scheduler knobs and cpu.headroom.
> 
> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
> --------------------------------------------------------------------------------
>  none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
> ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
> ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
> ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
> ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
> 
> In all these cases, the main workload is loaded with same level of 
> traffic (request per second). Main workload latency numbers are normalized 
> based on the baseline (first row). 
> 
> For the baseline, the main workload runs without any side workload, the 
> system has about 45.20% idle CPU. 
> 
> The next two rows compare the impact of scheduling knobs cpu.weight and 
> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms, 
> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we 
> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting 
> the main workload. However, it is not sufficient, as the latency overhead 
> is high (>40%). 
> 
> The last two rows show the benefit of cpu.headroom. With 20% headroom, 
> the latency is 1.13; while with 30% headroom, the latency is 1.08. 
> 
> We can also see a clear correlation between latency and global idle CPU: 
> more idle CPU yields better lower latency. 
> 
> Over all, these results show that cpu.headroom provides effective 
> mechanism to control the latency impact of side workloads. Other knobs 
> could also help the latency, but they are not as effective and flexible 
> as cpu.headroom. 
> 
> Does this analysis address your concern? 
> 
> Thanks,
> Song
> 

Could you please share your comments and suggestions on this work? Did
the results address your questions/concerns? 

Thanks again,
Song 

>> 
>>> 
>>> Thanks,
>>> Song
>>> 
>>>> 
>>>> Morten


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-28 19:47         ` Song Liu
@ 2019-04-29 12:24           ` Vincent Guittot
  2019-04-30  6:10             ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2019-04-29 12:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team, viresh kumar

Hi Song,

On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@fb.com> wrote:
>
> Hi Morten and Vincent,
>
> > On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@fb.com> wrote:
> >
> > Hi Vincent,
> >
> >> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >>
> >> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
> >>>
> >>> Hi Morten,
> >>>
> >>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >>>>
> >>
> >>>>
> >>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
> >>>> workload. I'm not convinced that adding headroom gives any latency
> >>>> improvements beyond watering down the impact of your side jobs. AFAIK,
> >>>
> >>> We think the latency improvements actually come from watering down the
> >>> impact of side jobs. It is not just statistically improving average
> >>> latency numbers, but also reduces resource contention caused by the side
> >>> workload. I don't know whether it is from reducing contention of ALUs,
> >>> memory bandwidth, CPU caches, or something else, but we saw reduced
> >>> latencies when headroom is used.
> >>>
> >>>> the throttling mechanism effectively removes the throttled tasks from
> >>>> the schedule according to a specific duty cycle. When the side job is
> >>>> not throttled the main workload is experiencing the same latency issues
> >>>> as before, but by dynamically tuning the side job throttling you can
> >>>> achieve a better average latency. Am I missing something?
> >>>>
> >>>> Have you looked at your distribution of main job latency and tried to
> >>>> compare with when throttling is active/not active?
> >>>
> >>> cfs_bandwidth adjusts allowed runtime for each task_group each period
> >>> (configurable, 100ms by default). cpu.headroom logic applies gentle
> >>> throttling, so that the side workload gets some runtime in every period.
> >>> Therefore, if we look at time window equal to or bigger than 100ms, we
> >>> don't really see "throttling active time" vs. "throttling inactive time".
> >>>
> >>>>
> >>>> I'm wondering if the headroom solution is really the right solution for
> >>>> your use-case or if what you are really after is something which is
> >>>> lower priority than just setting the weight to 1. Something that
> >>>
> >>> The experiments show that, cpu.weight does proper work for priority: the
> >>> main workload gets priority to use the CPU; while the side workload only
> >>> fill the idle CPU. However, this is not sufficient, as the side workload
> >>> creates big enough contention to impact the main workload.
> >>>
> >>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
> >>>> SCHED_IDLE might not be enough). If your main job consist
> >>>> of lots of relatively short wake-ups things like the min_granularity
> >>>> could have significant latency impact.
> >>>
> >>> cpu.headroom gives benefits in addition to optimizations in pre-empt
> >>> side. By maintaining some idle time, fewer pre-empt actions are
> >>> necessary, thus the main workload will get better latency.
> >>
> >> I agree with Morten's proposal, SCHED_IDLE should help your latency
> >> problem because side job will be directly preempted unlike normal cfs
> >> task even lowest priority.
> >> In addition to min_granularity, sched_period also has an impact on the
> >> time that a task has to wait before preempting the running task. Also,
> >> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
> >> latency of a task.
> >>
> >> It would be nice to know if the latency problem comes from contention
> >> on cache resources or if it's mainly because you main load waits
> >> before running on a CPU
> >>
> >> Regards,
> >> Vincent
> >
> > Thanks for these suggestions. Here are some more tests to show the impact
> > of scheduler knobs and cpu.headroom.
> >
> > side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
> > --------------------------------------------------------------------------------
> >  none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
> > ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
> > ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
> > ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
> > ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
> >
> > In all these cases, the main workload is loaded with same level of
> > traffic (request per second). Main workload latency numbers are normalized
> > based on the baseline (first row).
> >
> > For the baseline, the main workload runs without any side workload, the
> > system has about 45.20% idle CPU.
> >
> > The next two rows compare the impact of scheduling knobs cpu.weight and
> > sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
> > we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
> > see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
> > the main workload. However, it is not sufficient, as the latency overhead
> > is high (>40%).
> >
> > The last two rows show the benefit of cpu.headroom. With 20% headroom,
> > the latency is 1.13; while with 30% headroom, the latency is 1.08.
> >
> > We can also see a clear correlation between latency and global idle CPU:
> > more idle CPU yields better lower latency.
> >
> > Over all, these results show that cpu.headroom provides effective
> > mechanism to control the latency impact of side workloads. Other knobs
> > could also help the latency, but they are not as effective and flexible
> > as cpu.headroom.
> >
> > Does this analysis address your concern?

So, you results show that sched_idle class doesn't provide the
intended behavior because it still delay the scheduling of sched_other
tasks. In fact, the wakeup path of the scheduler doesn't make any
difference between a cpu running a sched_other and a cpu running a
sched_idle when looking for the idlest cpu and it can create some
contentions between sched_other tasks whereas a cpu runs sched_idle
task.
Viresh (cced to this email) is working on improving such behavior at
wake up and has sent an patch related to the subject:
https://lkml.org/lkml/2019/4/25/251
I'm curious if this would improve the results.

Regards,
Vincent

> >
> > Thanks,
> > Song
> >
>
> Could you please share your comments and suggestions on this work? Did
> the results address your questions/concerns?
>
> Thanks again,
> Song
>
> >>
> >>>
> >>> Thanks,
> >>> Song
> >>>
> >>>>
> >>>> Morten
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-29 12:24           ` Vincent Guittot
@ 2019-04-30  6:10             ` Song Liu
  2019-04-30 16:20               ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-04-30  6:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team, viresh kumar



> On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> 
> Hi Song,
> 
> On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@fb.com> wrote:
>> 
>> Hi Morten and Vincent,
>> 
>>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> Hi Vincent,
>>> 
>>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>>> 
>>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
>>>>> 
>>>>> Hi Morten,
>>>>> 
>>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>>>>> 
>>>> 
>>>>>> 
>>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
>>>>>> workload. I'm not convinced that adding headroom gives any latency
>>>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
>>>>> 
>>>>> We think the latency improvements actually come from watering down the
>>>>> impact of side jobs. It is not just statistically improving average
>>>>> latency numbers, but also reduces resource contention caused by the side
>>>>> workload. I don't know whether it is from reducing contention of ALUs,
>>>>> memory bandwidth, CPU caches, or something else, but we saw reduced
>>>>> latencies when headroom is used.
>>>>> 
>>>>>> the throttling mechanism effectively removes the throttled tasks from
>>>>>> the schedule according to a specific duty cycle. When the side job is
>>>>>> not throttled the main workload is experiencing the same latency issues
>>>>>> as before, but by dynamically tuning the side job throttling you can
>>>>>> achieve a better average latency. Am I missing something?
>>>>>> 
>>>>>> Have you looked at your distribution of main job latency and tried to
>>>>>> compare with when throttling is active/not active?
>>>>> 
>>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
>>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
>>>>> throttling, so that the side workload gets some runtime in every period.
>>>>> Therefore, if we look at time window equal to or bigger than 100ms, we
>>>>> don't really see "throttling active time" vs. "throttling inactive time".
>>>>> 
>>>>>> 
>>>>>> I'm wondering if the headroom solution is really the right solution for
>>>>>> your use-case or if what you are really after is something which is
>>>>>> lower priority than just setting the weight to 1. Something that
>>>>> 
>>>>> The experiments show that, cpu.weight does proper work for priority: the
>>>>> main workload gets priority to use the CPU; while the side workload only
>>>>> fill the idle CPU. However, this is not sufficient, as the side workload
>>>>> creates big enough contention to impact the main workload.
>>>>> 
>>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
>>>>>> SCHED_IDLE might not be enough). If your main job consist
>>>>>> of lots of relatively short wake-ups things like the min_granularity
>>>>>> could have significant latency impact.
>>>>> 
>>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
>>>>> side. By maintaining some idle time, fewer pre-empt actions are
>>>>> necessary, thus the main workload will get better latency.
>>>> 
>>>> I agree with Morten's proposal, SCHED_IDLE should help your latency
>>>> problem because side job will be directly preempted unlike normal cfs
>>>> task even lowest priority.
>>>> In addition to min_granularity, sched_period also has an impact on the
>>>> time that a task has to wait before preempting the running task. Also,
>>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
>>>> latency of a task.
>>>> 
>>>> It would be nice to know if the latency problem comes from contention
>>>> on cache resources or if it's mainly because you main load waits
>>>> before running on a CPU
>>>> 
>>>> Regards,
>>>> Vincent
>>> 
>>> Thanks for these suggestions. Here are some more tests to show the impact
>>> of scheduler knobs and cpu.headroom.
>>> 
>>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
>>> --------------------------------------------------------------------------------
>>> none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
>>> ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
>>> ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
>>> ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
>>> ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
>>> 
>>> In all these cases, the main workload is loaded with same level of
>>> traffic (request per second). Main workload latency numbers are normalized
>>> based on the baseline (first row).
>>> 
>>> For the baseline, the main workload runs without any side workload, the
>>> system has about 45.20% idle CPU.
>>> 
>>> The next two rows compare the impact of scheduling knobs cpu.weight and
>>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
>>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
>>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
>>> the main workload. However, it is not sufficient, as the latency overhead
>>> is high (>40%).
>>> 
>>> The last two rows show the benefit of cpu.headroom. With 20% headroom,
>>> the latency is 1.13; while with 30% headroom, the latency is 1.08.
>>> 
>>> We can also see a clear correlation between latency and global idle CPU:
>>> more idle CPU yields better lower latency.
>>> 
>>> Over all, these results show that cpu.headroom provides effective
>>> mechanism to control the latency impact of side workloads. Other knobs
>>> could also help the latency, but they are not as effective and flexible
>>> as cpu.headroom.
>>> 
>>> Does this analysis address your concern?
> 
> So, you results show that sched_idle class doesn't provide the
> intended behavior because it still delay the scheduling of sched_other
> tasks. In fact, the wakeup path of the scheduler doesn't make any
> difference between a cpu running a sched_other and a cpu running a
> sched_idle when looking for the idlest cpu and it can create some
> contentions between sched_other tasks whereas a cpu runs sched_idle
> task.

I don't think scheduling delay is the only (or dominating) factor of 
extra latency. Here are some data to show it. 

I measured IPC (instructions per cycle) of the main workload under 
different scenarios:

side-load | cpu.headroom | side/cpu.weight  | IPC   
----------------------------------------------------
 none     |     0%       |       N/A        | 0.66 
 ffmpeg   |     0%       |    SCHED_IDLE    | 0.53 
 ffmpeg   |    20%       |    SCHED_IDLE    | 0.58 
 ffmpeg   |    30%       |    SCHED_IDLE    | 0.62 

These data show that the side workload has a negative impact on the 
main workload's IPC. And cpu.headroom could help reduce this impact.  

Therefore, while optimizations in the wakeup path should help the 
latency; cpu.headroom would add _significant_ benefit on top of that. 

Does this assessment make sense? 


> Viresh (cced to this email) is working on improving such behavior at
> wake up and has sent an patch related to the subject:
> https://lkml.org/lkml/2019/4/25/251
> I'm curious if this would improve the results.

I could try it with our workload next week (I am at LSF/MM this 
week). Also, please keep in mind that this test sometimes takes 
multiple days to setup and run. 

Thanks,
Song

> 
> Regards,
> Vincent
> 
>>> 
>>> Thanks,
>>> Song
>>> 
>> 
>> Could you please share your comments and suggestions on this work? Did
>> the results address your questions/concerns?
>> 
>> Thanks again,
>> Song
>> 
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Song
>>>>> 
>>>>>> 
>>>>>> Morten


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-30  6:10             ` Song Liu
@ 2019-04-30 16:20               ` Vincent Guittot
  2019-04-30 16:54                 ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2019-04-30 16:20 UTC (permalink / raw)
  To: Song Liu
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team, viresh kumar

Hi Song,

On Tue, 30 Apr 2019 at 08:11, Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >
> > Hi Song,
> >
> > On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@fb.com> wrote:
> >>
> >> Hi Morten and Vincent,
> >>
> >>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@fb.com> wrote:
> >>>
> >>> Hi Vincent,
> >>>
> >>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >>>>
> >>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
> >>>>>
> >>>>> Hi Morten,
> >>>>>
> >>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >>>>>>
> >>>>
> >>>>>>
> >>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
> >>>>>> workload. I'm not convinced that adding headroom gives any latency
> >>>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
> >>>>>
> >>>>> We think the latency improvements actually come from watering down the
> >>>>> impact of side jobs. It is not just statistically improving average
> >>>>> latency numbers, but also reduces resource contention caused by the side
> >>>>> workload. I don't know whether it is from reducing contention of ALUs,
> >>>>> memory bandwidth, CPU caches, or something else, but we saw reduced
> >>>>> latencies when headroom is used.
> >>>>>
> >>>>>> the throttling mechanism effectively removes the throttled tasks from
> >>>>>> the schedule according to a specific duty cycle. When the side job is
> >>>>>> not throttled the main workload is experiencing the same latency issues
> >>>>>> as before, but by dynamically tuning the side job throttling you can
> >>>>>> achieve a better average latency. Am I missing something?
> >>>>>>
> >>>>>> Have you looked at your distribution of main job latency and tried to
> >>>>>> compare with when throttling is active/not active?
> >>>>>
> >>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
> >>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
> >>>>> throttling, so that the side workload gets some runtime in every period.
> >>>>> Therefore, if we look at time window equal to or bigger than 100ms, we
> >>>>> don't really see "throttling active time" vs. "throttling inactive time".
> >>>>>
> >>>>>>
> >>>>>> I'm wondering if the headroom solution is really the right solution for
> >>>>>> your use-case or if what you are really after is something which is
> >>>>>> lower priority than just setting the weight to 1. Something that
> >>>>>
> >>>>> The experiments show that, cpu.weight does proper work for priority: the
> >>>>> main workload gets priority to use the CPU; while the side workload only
> >>>>> fill the idle CPU. However, this is not sufficient, as the side workload
> >>>>> creates big enough contention to impact the main workload.
> >>>>>
> >>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
> >>>>>> SCHED_IDLE might not be enough). If your main job consist
> >>>>>> of lots of relatively short wake-ups things like the min_granularity
> >>>>>> could have significant latency impact.
> >>>>>
> >>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
> >>>>> side. By maintaining some idle time, fewer pre-empt actions are
> >>>>> necessary, thus the main workload will get better latency.
> >>>>
> >>>> I agree with Morten's proposal, SCHED_IDLE should help your latency
> >>>> problem because side job will be directly preempted unlike normal cfs
> >>>> task even lowest priority.
> >>>> In addition to min_granularity, sched_period also has an impact on the
> >>>> time that a task has to wait before preempting the running task. Also,
> >>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
> >>>> latency of a task.
> >>>>
> >>>> It would be nice to know if the latency problem comes from contention
> >>>> on cache resources or if it's mainly because you main load waits
> >>>> before running on a CPU
> >>>>
> >>>> Regards,
> >>>> Vincent
> >>>
> >>> Thanks for these suggestions. Here are some more tests to show the impact
> >>> of scheduler knobs and cpu.headroom.
> >>>
> >>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
> >>> --------------------------------------------------------------------------------
> >>> none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
> >>> ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
> >>> ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
> >>> ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
> >>> ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
> >>>
> >>> In all these cases, the main workload is loaded with same level of
> >>> traffic (request per second). Main workload latency numbers are normalized
> >>> based on the baseline (first row).
> >>>
> >>> For the baseline, the main workload runs without any side workload, the
> >>> system has about 45.20% idle CPU.
> >>>
> >>> The next two rows compare the impact of scheduling knobs cpu.weight and
> >>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
> >>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
> >>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
> >>> the main workload. However, it is not sufficient, as the latency overhead
> >>> is high (>40%).
> >>>
> >>> The last two rows show the benefit of cpu.headroom. With 20% headroom,
> >>> the latency is 1.13; while with 30% headroom, the latency is 1.08.
> >>>
> >>> We can also see a clear correlation between latency and global idle CPU:
> >>> more idle CPU yields better lower latency.
> >>>
> >>> Over all, these results show that cpu.headroom provides effective
> >>> mechanism to control the latency impact of side workloads. Other knobs
> >>> could also help the latency, but they are not as effective and flexible
> >>> as cpu.headroom.
> >>>
> >>> Does this analysis address your concern?
> >
> > So, you results show that sched_idle class doesn't provide the
> > intended behavior because it still delay the scheduling of sched_other
> > tasks. In fact, the wakeup path of the scheduler doesn't make any
> > difference between a cpu running a sched_other and a cpu running a
> > sched_idle when looking for the idlest cpu and it can create some
> > contentions between sched_other tasks whereas a cpu runs sched_idle
> > task.
>
> I don't think scheduling delay is the only (or dominating) factor of
> extra latency. Here are some data to show it.
>
> I measured IPC (instructions per cycle) of the main workload under
> different scenarios:
>
> side-load | cpu.headroom | side/cpu.weight  | IPC
> ----------------------------------------------------
>  none     |     0%       |       N/A        | 0.66
>  ffmpeg   |     0%       |    SCHED_IDLE    | 0.53
>  ffmpeg   |    20%       |    SCHED_IDLE    | 0.58
>  ffmpeg   |    30%       |    SCHED_IDLE    | 0.62
>
> These data show that the side workload has a negative impact on the
> main workload's IPC. And cpu.headroom could help reduce this impact.
>
> Therefore, while optimizations in the wakeup path should help the
> latency; cpu.headroom would add _significant_ benefit on top of that.

It seems normal that side workload has a negative impact on IPC
because of resources sharing but your previous results showed a 42%
regression of latency with sched_idle which is can't be only linked to
resources access contention

>
> Does this assessment make sense?
>
>
> > Viresh (cced to this email) is working on improving such behavior at
> > wake up and has sent an patch related to the subject:
> > https://lkml.org/lkml/2019/4/25/251
> > I'm curious if this would improve the results.
>
> I could try it with our workload next week (I am at LSF/MM this
> week). Also, please keep in mind that this test sometimes takes
> multiple days to setup and run.

Yes. I understand. That would be good to have a simpler setup to
reproduce the behavior of your setup in order to do preliminary tests
and analyse the behavior

>
> Thanks,
> Song
>
> >
> > Regards,
> > Vincent
> >
> >>>
> >>> Thanks,
> >>> Song
> >>>
> >>
> >> Could you please share your comments and suggestions on this work? Did
> >> the results address your questions/concerns?
> >>
> >> Thanks again,
> >> Song
> >>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Song
> >>>>>
> >>>>>>
> >>>>>> Morten
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-30 16:20               ` Vincent Guittot
@ 2019-04-30 16:54                 ` Song Liu
  2019-05-10 18:22                   ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-04-30 16:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team, viresh kumar



> On Apr 30, 2019, at 12:20 PM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> 
> Hi Song,
> 
> On Tue, 30 Apr 2019 at 08:11, Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>> 
>>> Hi Song,
>>> 
>>> On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Hi Morten and Vincent,
>>>> 
>>>>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@fb.com> wrote:
>>>>> 
>>>>> Hi Vincent,
>>>>> 
>>>>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>>>>> 
>>>>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
>>>>>>> 
>>>>>>> Hi Morten,
>>>>>>> 
>>>>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>>>>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
>>>>>>>> workload. I'm not convinced that adding headroom gives any latency
>>>>>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
>>>>>>> 
>>>>>>> We think the latency improvements actually come from watering down the
>>>>>>> impact of side jobs. It is not just statistically improving average
>>>>>>> latency numbers, but also reduces resource contention caused by the side
>>>>>>> workload. I don't know whether it is from reducing contention of ALUs,
>>>>>>> memory bandwidth, CPU caches, or something else, but we saw reduced
>>>>>>> latencies when headroom is used.
>>>>>>> 
>>>>>>>> the throttling mechanism effectively removes the throttled tasks from
>>>>>>>> the schedule according to a specific duty cycle. When the side job is
>>>>>>>> not throttled the main workload is experiencing the same latency issues
>>>>>>>> as before, but by dynamically tuning the side job throttling you can
>>>>>>>> achieve a better average latency. Am I missing something?
>>>>>>>> 
>>>>>>>> Have you looked at your distribution of main job latency and tried to
>>>>>>>> compare with when throttling is active/not active?
>>>>>>> 
>>>>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
>>>>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
>>>>>>> throttling, so that the side workload gets some runtime in every period.
>>>>>>> Therefore, if we look at time window equal to or bigger than 100ms, we
>>>>>>> don't really see "throttling active time" vs. "throttling inactive time".
>>>>>>> 
>>>>>>>> 
>>>>>>>> I'm wondering if the headroom solution is really the right solution for
>>>>>>>> your use-case or if what you are really after is something which is
>>>>>>>> lower priority than just setting the weight to 1. Something that
>>>>>>> 
>>>>>>> The experiments show that, cpu.weight does proper work for priority: the
>>>>>>> main workload gets priority to use the CPU; while the side workload only
>>>>>>> fill the idle CPU. However, this is not sufficient, as the side workload
>>>>>>> creates big enough contention to impact the main workload.
>>>>>>> 
>>>>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
>>>>>>>> SCHED_IDLE might not be enough). If your main job consist
>>>>>>>> of lots of relatively short wake-ups things like the min_granularity
>>>>>>>> could have significant latency impact.
>>>>>>> 
>>>>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
>>>>>>> side. By maintaining some idle time, fewer pre-empt actions are
>>>>>>> necessary, thus the main workload will get better latency.
>>>>>> 
>>>>>> I agree with Morten's proposal, SCHED_IDLE should help your latency
>>>>>> problem because side job will be directly preempted unlike normal cfs
>>>>>> task even lowest priority.
>>>>>> In addition to min_granularity, sched_period also has an impact on the
>>>>>> time that a task has to wait before preempting the running task. Also,
>>>>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
>>>>>> latency of a task.
>>>>>> 
>>>>>> It would be nice to know if the latency problem comes from contention
>>>>>> on cache resources or if it's mainly because you main load waits
>>>>>> before running on a CPU
>>>>>> 
>>>>>> Regards,
>>>>>> Vincent
>>>>> 
>>>>> Thanks for these suggestions. Here are some more tests to show the impact
>>>>> of scheduler knobs and cpu.headroom.
>>>>> 
>>>>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
>>>>> --------------------------------------------------------------------------------
>>>>> none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
>>>>> ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
>>>>> ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
>>>>> ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
>>>>> ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
>>>>> 
>>>>> In all these cases, the main workload is loaded with same level of
>>>>> traffic (request per second). Main workload latency numbers are normalized
>>>>> based on the baseline (first row).
>>>>> 
>>>>> For the baseline, the main workload runs without any side workload, the
>>>>> system has about 45.20% idle CPU.
>>>>> 
>>>>> The next two rows compare the impact of scheduling knobs cpu.weight and
>>>>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
>>>>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
>>>>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
>>>>> the main workload. However, it is not sufficient, as the latency overhead
>>>>> is high (>40%).
>>>>> 
>>>>> The last two rows show the benefit of cpu.headroom. With 20% headroom,
>>>>> the latency is 1.13; while with 30% headroom, the latency is 1.08.
>>>>> 
>>>>> We can also see a clear correlation between latency and global idle CPU:
>>>>> more idle CPU yields better lower latency.
>>>>> 
>>>>> Over all, these results show that cpu.headroom provides effective
>>>>> mechanism to control the latency impact of side workloads. Other knobs
>>>>> could also help the latency, but they are not as effective and flexible
>>>>> as cpu.headroom.
>>>>> 
>>>>> Does this analysis address your concern?
>>> 
>>> So, you results show that sched_idle class doesn't provide the
>>> intended behavior because it still delay the scheduling of sched_other
>>> tasks. In fact, the wakeup path of the scheduler doesn't make any
>>> difference between a cpu running a sched_other and a cpu running a
>>> sched_idle when looking for the idlest cpu and it can create some
>>> contentions between sched_other tasks whereas a cpu runs sched_idle
>>> task.
>> 
>> I don't think scheduling delay is the only (or dominating) factor of
>> extra latency. Here are some data to show it.
>> 
>> I measured IPC (instructions per cycle) of the main workload under
>> different scenarios:
>> 
>> side-load | cpu.headroom | side/cpu.weight  | IPC
>> ----------------------------------------------------
>> none     |     0%       |       N/A        | 0.66
>> ffmpeg   |     0%       |    SCHED_IDLE    | 0.53
>> ffmpeg   |    20%       |    SCHED_IDLE    | 0.58
>> ffmpeg   |    30%       |    SCHED_IDLE    | 0.62
>> 
>> These data show that the side workload has a negative impact on the
>> main workload's IPC. And cpu.headroom could help reduce this impact.
>> 
>> Therefore, while optimizations in the wakeup path should help the
>> latency; cpu.headroom would add _significant_ benefit on top of that.
> 
> It seems normal that side workload has a negative impact on IPC
> because of resources sharing but your previous results showed a 42%
> regression of latency with sched_idle which is can't be only linked to
> resources access contention

Agreed. I think both scheduling latency and resource contention 
contribute noticeable latency overhead to the main workload. The 
scheduler optimization by Viresh would help reduce the scheduling
latency, but it won't help the resource contention. Hopefully, with 
optimizations in the scheduler, we can meet the latency target with 
smaller cpu.headroom. However, I don't think scheduler optimizations 
will eliminate the need of cpu.headroom, as the resource contention
always exists, and the impact could be significant. 

Do you have further concerns with this patchset?

Thanks,
Song 


>> 
>> Does this assessment make sense?
>> 
>> 
>>> Viresh (cced to this email) is working on improving such behavior at
>>> wake up and has sent an patch related to the subject:
>>> https://lkml.org/lkml/2019/4/25/251
>>> I'm curious if this would improve the results.
>> 
>> I could try it with our workload next week (I am at LSF/MM this
>> week). Also, please keep in mind that this test sometimes takes
>> multiple days to setup and run.
> 
> Yes. I understand. That would be good to have a simpler setup to
> reproduce the behavior of your setup in order to do preliminary tests
> and analyse the behavior
> 
>> 
>> Thanks,
>> Song
>> 
>>> 
>>> Regards,
>>> Vincent
>>> 
>>>>> 
>>>>> Thanks,
>>>>> Song
>>>>> 
>>>> 
>>>> Could you please share your comments and suggestions on this work? Did
>>>> the results address your questions/concerns?
>>>> 
>>>> Thanks again,
>>>> Song
>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Song
>>>>>>> 
>>>>>>>> 
>>>>>>>> Morten


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-30 16:54                 ` Song Liu
@ 2019-05-10 18:22                   ` Song Liu
  2019-05-14 20:58                     ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-05-10 18:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team, viresh kumar



> On Apr 30, 2019, at 9:54 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Apr 30, 2019, at 12:20 PM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> 
>> Hi Song,
>> 
>> On Tue, 30 Apr 2019 at 08:11, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>> 
>>> 
>>>> On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>>> 
>>>> Hi Song,
>>>> 
>>>> On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@fb.com> wrote:
>>>>> 
>>>>> Hi Morten and Vincent,
>>>>> 
>>>>>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@fb.com> wrote:
>>>>>> 
>>>>>> Hi Vincent,
>>>>>> 
>>>>>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>>>>>> 
>>>>>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Morten,
>>>>>>>> 
>>>>>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>>>>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
>>>>>>>>> workload. I'm not convinced that adding headroom gives any latency
>>>>>>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
>>>>>>>> 
>>>>>>>> We think the latency improvements actually come from watering down the
>>>>>>>> impact of side jobs. It is not just statistically improving average
>>>>>>>> latency numbers, but also reduces resource contention caused by the side
>>>>>>>> workload. I don't know whether it is from reducing contention of ALUs,
>>>>>>>> memory bandwidth, CPU caches, or something else, but we saw reduced
>>>>>>>> latencies when headroom is used.
>>>>>>>> 
>>>>>>>>> the throttling mechanism effectively removes the throttled tasks from
>>>>>>>>> the schedule according to a specific duty cycle. When the side job is
>>>>>>>>> not throttled the main workload is experiencing the same latency issues
>>>>>>>>> as before, but by dynamically tuning the side job throttling you can
>>>>>>>>> achieve a better average latency. Am I missing something?
>>>>>>>>> 
>>>>>>>>> Have you looked at your distribution of main job latency and tried to
>>>>>>>>> compare with when throttling is active/not active?
>>>>>>>> 
>>>>>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
>>>>>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
>>>>>>>> throttling, so that the side workload gets some runtime in every period.
>>>>>>>> Therefore, if we look at time window equal to or bigger than 100ms, we
>>>>>>>> don't really see "throttling active time" vs. "throttling inactive time".
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I'm wondering if the headroom solution is really the right solution for
>>>>>>>>> your use-case or if what you are really after is something which is
>>>>>>>>> lower priority than just setting the weight to 1. Something that
>>>>>>>> 
>>>>>>>> The experiments show that, cpu.weight does proper work for priority: the
>>>>>>>> main workload gets priority to use the CPU; while the side workload only
>>>>>>>> fill the idle CPU. However, this is not sufficient, as the side workload
>>>>>>>> creates big enough contention to impact the main workload.
>>>>>>>> 
>>>>>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
>>>>>>>>> SCHED_IDLE might not be enough). If your main job consist
>>>>>>>>> of lots of relatively short wake-ups things like the min_granularity
>>>>>>>>> could have significant latency impact.
>>>>>>>> 
>>>>>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
>>>>>>>> side. By maintaining some idle time, fewer pre-empt actions are
>>>>>>>> necessary, thus the main workload will get better latency.
>>>>>>> 
>>>>>>> I agree with Morten's proposal, SCHED_IDLE should help your latency
>>>>>>> problem because side job will be directly preempted unlike normal cfs
>>>>>>> task even lowest priority.
>>>>>>> In addition to min_granularity, sched_period also has an impact on the
>>>>>>> time that a task has to wait before preempting the running task. Also,
>>>>>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
>>>>>>> latency of a task.
>>>>>>> 
>>>>>>> It would be nice to know if the latency problem comes from contention
>>>>>>> on cache resources or if it's mainly because you main load waits
>>>>>>> before running on a CPU
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Vincent
>>>>>> 
>>>>>> Thanks for these suggestions. Here are some more tests to show the impact
>>>>>> of scheduler knobs and cpu.headroom.
>>>>>> 
>>>>>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
>>>>>> --------------------------------------------------------------------------------
>>>>>> none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
>>>>>> ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
>>>>>> ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
>>>>>> ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
>>>>>> ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
>>>>>> 
>>>>>> In all these cases, the main workload is loaded with same level of
>>>>>> traffic (request per second). Main workload latency numbers are normalized
>>>>>> based on the baseline (first row).
>>>>>> 
>>>>>> For the baseline, the main workload runs without any side workload, the
>>>>>> system has about 45.20% idle CPU.
>>>>>> 
>>>>>> The next two rows compare the impact of scheduling knobs cpu.weight and
>>>>>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
>>>>>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
>>>>>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
>>>>>> the main workload. However, it is not sufficient, as the latency overhead
>>>>>> is high (>40%).
>>>>>> 
>>>>>> The last two rows show the benefit of cpu.headroom. With 20% headroom,
>>>>>> the latency is 1.13; while with 30% headroom, the latency is 1.08.
>>>>>> 
>>>>>> We can also see a clear correlation between latency and global idle CPU:
>>>>>> more idle CPU yields better lower latency.
>>>>>> 
>>>>>> Over all, these results show that cpu.headroom provides effective
>>>>>> mechanism to control the latency impact of side workloads. Other knobs
>>>>>> could also help the latency, but they are not as effective and flexible
>>>>>> as cpu.headroom.
>>>>>> 
>>>>>> Does this analysis address your concern?
>>>> 
>>>> So, you results show that sched_idle class doesn't provide the
>>>> intended behavior because it still delay the scheduling of sched_other
>>>> tasks. In fact, the wakeup path of the scheduler doesn't make any
>>>> difference between a cpu running a sched_other and a cpu running a
>>>> sched_idle when looking for the idlest cpu and it can create some
>>>> contentions between sched_other tasks whereas a cpu runs sched_idle
>>>> task.
>>> 
>>> I don't think scheduling delay is the only (or dominating) factor of
>>> extra latency. Here are some data to show it.
>>> 
>>> I measured IPC (instructions per cycle) of the main workload under
>>> different scenarios:
>>> 
>>> side-load | cpu.headroom | side/cpu.weight  | IPC
>>> ----------------------------------------------------
>>> none     |     0%       |       N/A        | 0.66
>>> ffmpeg   |     0%       |    SCHED_IDLE    | 0.53
>>> ffmpeg   |    20%       |    SCHED_IDLE    | 0.58
>>> ffmpeg   |    30%       |    SCHED_IDLE    | 0.62
>>> 
>>> These data show that the side workload has a negative impact on the
>>> main workload's IPC. And cpu.headroom could help reduce this impact.
>>> 
>>> Therefore, while optimizations in the wakeup path should help the
>>> latency; cpu.headroom would add _significant_ benefit on top of that.
>> 
>> It seems normal that side workload has a negative impact on IPC
>> because of resources sharing but your previous results showed a 42%
>> regression of latency with sched_idle which is can't be only linked to
>> resources access contention
> 
> Agreed. I think both scheduling latency and resource contention 
> contribute noticeable latency overhead to the main workload. The 
> scheduler optimization by Viresh would help reduce the scheduling
> latency, but it won't help the resource contention. Hopefully, with 
> optimizations in the scheduler, we can meet the latency target with 
> smaller cpu.headroom. However, I don't think scheduler optimizations 
> will eliminate the need of cpu.headroom, as the resource contention
> always exists, and the impact could be significant. 
> 
> Do you have further concerns with this patchset?
> 
> Thanks,
> Song 

Here are some more results with both Viresh's patch and the cpu.headroom
set. In these tests, the side job runs with SCHED_IDLE, so we get benefit
of Viresh's patch. 

We collected another metric here, average "cpu time" used by the requests.
We also presented "wall time" and "wall - cpu" time. "wall time" is the 
same as "latency" in previous results. Basically, "wall time" includes cpu 
time, scheduling latency, and time spent waiting for data (from data base, 
memcache, etc.). We don't have good data that separates scheduling latency
and time spent waiting for data, so we present "wall - cpu" time, which is 
the sum of the two. Time spent waiting for data should not change in these 
tests, so changes in "wall - cpu" mostly comes from scheduling latency. 
All the latency numbers are normalized based on the "wall time" of the 
first row. 

side job | cpu.headroom |  cpu-idle | wall time | cpu time | wall - cpu
------------------------------------------------------------------------
 none    |     n/a      |    42.4%  |   1.00    |   0.31   |   0.69
ffmpeg   |       0      |    10.8%  |   1.17    |   0.38   |   0.79
ffmpeg   |     25%      |    22.8%  |   1.08    |   0.35   |   0.73

From these results, we can see that Viresh's patch reduces the latency
overhead of the side job, from 42% (in previous results) to 17%. And 
a 25% cpu.headroom further reduces the latency overhead to 8%. 
cpu.headroom reduces time spent in "cpu time" and "wall - cpu" time, 
which means cpu.headroom yields better IPC and lower scheduling latency.

I think these data demonstrate that 

  1. Viresh's work is helpful in reducing scheduling latency introduced 
     by SCHED_IDLE side jobs. 
  2. cpu.headroom work provides mechanism to further reduce scheduling
     latency on top of Viresh's work. 

Therefore, the combination of the two work would give us mechanisms to 
control the latency overhead of side workloads. 

@Vincent, do these data and analysis make sense from your point of view?

Thanks,
Song


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-05-10 18:22                   ` Song Liu
@ 2019-05-14 20:58                     ` Song Liu
  2019-05-15 10:18                       ` Vincent Guittot
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-05-14 20:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team, viresh kumar

Hi Vincent,

> On May 10, 2019, at 11:22 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Apr 30, 2019, at 9:54 AM, Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Apr 30, 2019, at 12:20 PM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>> 
>>> Hi Song,
>>> 
>>> On Tue, 30 Apr 2019 at 08:11, Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Apr 29, 2019, at 8:24 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>>>> 
>>>>> Hi Song,
>>>>> 
>>>>> On Sun, 28 Apr 2019 at 21:47, Song Liu <songliubraving@fb.com> wrote:
>>>>>> 
>>>>>> Hi Morten and Vincent,
>>>>>> 
>>>>>>> On Apr 22, 2019, at 6:22 PM, Song Liu <songliubraving@fb.com> wrote:
>>>>>>> 
>>>>>>> Hi Vincent,
>>>>>>> 
>>>>>>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>>>>>>> 
>>>>>>>> On Wed, 10 Apr 2019 at 21:43, Song Liu <songliubraving@fb.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Morten,
>>>>>>>>> 
>>>>>>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps your
>>>>>>>>>> workload. I'm not convinced that adding headroom gives any latency
>>>>>>>>>> improvements beyond watering down the impact of your side jobs. AFAIK,
>>>>>>>>> 
>>>>>>>>> We think the latency improvements actually come from watering down the
>>>>>>>>> impact of side jobs. It is not just statistically improving average
>>>>>>>>> latency numbers, but also reduces resource contention caused by the side
>>>>>>>>> workload. I don't know whether it is from reducing contention of ALUs,
>>>>>>>>> memory bandwidth, CPU caches, or something else, but we saw reduced
>>>>>>>>> latencies when headroom is used.
>>>>>>>>> 
>>>>>>>>>> the throttling mechanism effectively removes the throttled tasks from
>>>>>>>>>> the schedule according to a specific duty cycle. When the side job is
>>>>>>>>>> not throttled the main workload is experiencing the same latency issues
>>>>>>>>>> as before, but by dynamically tuning the side job throttling you can
>>>>>>>>>> achieve a better average latency. Am I missing something?
>>>>>>>>>> 
>>>>>>>>>> Have you looked at your distribution of main job latency and tried to
>>>>>>>>>> compare with when throttling is active/not active?
>>>>>>>>> 
>>>>>>>>> cfs_bandwidth adjusts allowed runtime for each task_group each period
>>>>>>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle
>>>>>>>>> throttling, so that the side workload gets some runtime in every period.
>>>>>>>>> Therefore, if we look at time window equal to or bigger than 100ms, we
>>>>>>>>> don't really see "throttling active time" vs. "throttling inactive time".
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I'm wondering if the headroom solution is really the right solution for
>>>>>>>>>> your use-case or if what you are really after is something which is
>>>>>>>>>> lower priority than just setting the weight to 1. Something that
>>>>>>>>> 
>>>>>>>>> The experiments show that, cpu.weight does proper work for priority: the
>>>>>>>>> main workload gets priority to use the CPU; while the side workload only
>>>>>>>>> fill the idle CPU. However, this is not sufficient, as the side workload
>>>>>>>>> creates big enough contention to impact the main workload.
>>>>>>>>> 
>>>>>>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and
>>>>>>>>>> SCHED_IDLE might not be enough). If your main job consist
>>>>>>>>>> of lots of relatively short wake-ups things like the min_granularity
>>>>>>>>>> could have significant latency impact.
>>>>>>>>> 
>>>>>>>>> cpu.headroom gives benefits in addition to optimizations in pre-empt
>>>>>>>>> side. By maintaining some idle time, fewer pre-empt actions are
>>>>>>>>> necessary, thus the main workload will get better latency.
>>>>>>>> 
>>>>>>>> I agree with Morten's proposal, SCHED_IDLE should help your latency
>>>>>>>> problem because side job will be directly preempted unlike normal cfs
>>>>>>>> task even lowest priority.
>>>>>>>> In addition to min_granularity, sched_period also has an impact on the
>>>>>>>> time that a task has to wait before preempting the running task. Also,
>>>>>>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the
>>>>>>>> latency of a task.
>>>>>>>> 
>>>>>>>> It would be nice to know if the latency problem comes from contention
>>>>>>>> on cache resources or if it's mainly because you main load waits
>>>>>>>> before running on a CPU
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Vincent
>>>>>>> 
>>>>>>> Thanks for these suggestions. Here are some more tests to show the impact
>>>>>>> of scheduler knobs and cpu.headroom.
>>>>>>> 
>>>>>>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | main/latency
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> none    |      0       |     n/a         |    1 ms  |  45.20%  |   1.00
>>>>>>> ffmpeg   |      0       |      1          |   10 ms  |   3.38%  |   1.46
>>>>>>> ffmpeg   |      0       |   SCHED_IDLE    |    1 ms  |   5.69%  |   1.42
>>>>>>> ffmpeg   |    20%       |   SCHED_IDLE    |    1 ms  |  19.00%  |   1.13
>>>>>>> ffmpeg   |    30%       |   SCHED_IDLE    |    1 ms  |  27.60%  |   1.08
>>>>>>> 
>>>>>>> In all these cases, the main workload is loaded with same level of
>>>>>>> traffic (request per second). Main workload latency numbers are normalized
>>>>>>> based on the baseline (first row).
>>>>>>> 
>>>>>>> For the baseline, the main workload runs without any side workload, the
>>>>>>> system has about 45.20% idle CPU.
>>>>>>> 
>>>>>>> The next two rows compare the impact of scheduling knobs cpu.weight and
>>>>>>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10ms,
>>>>>>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms, we
>>>>>>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protecting
>>>>>>> the main workload. However, it is not sufficient, as the latency overhead
>>>>>>> is high (>40%).
>>>>>>> 
>>>>>>> The last two rows show the benefit of cpu.headroom. With 20% headroom,
>>>>>>> the latency is 1.13; while with 30% headroom, the latency is 1.08.
>>>>>>> 
>>>>>>> We can also see a clear correlation between latency and global idle CPU:
>>>>>>> more idle CPU yields better lower latency.
>>>>>>> 
>>>>>>> Over all, these results show that cpu.headroom provides effective
>>>>>>> mechanism to control the latency impact of side workloads. Other knobs
>>>>>>> could also help the latency, but they are not as effective and flexible
>>>>>>> as cpu.headroom.
>>>>>>> 
>>>>>>> Does this analysis address your concern?
>>>>> 
>>>>> So, you results show that sched_idle class doesn't provide the
>>>>> intended behavior because it still delay the scheduling of sched_other
>>>>> tasks. In fact, the wakeup path of the scheduler doesn't make any
>>>>> difference between a cpu running a sched_other and a cpu running a
>>>>> sched_idle when looking for the idlest cpu and it can create some
>>>>> contentions between sched_other tasks whereas a cpu runs sched_idle
>>>>> task.
>>>> 
>>>> I don't think scheduling delay is the only (or dominating) factor of
>>>> extra latency. Here are some data to show it.
>>>> 
>>>> I measured IPC (instructions per cycle) of the main workload under
>>>> different scenarios:
>>>> 
>>>> side-load | cpu.headroom | side/cpu.weight  | IPC
>>>> ----------------------------------------------------
>>>> none     |     0%       |       N/A        | 0.66
>>>> ffmpeg   |     0%       |    SCHED_IDLE    | 0.53
>>>> ffmpeg   |    20%       |    SCHED_IDLE    | 0.58
>>>> ffmpeg   |    30%       |    SCHED_IDLE    | 0.62
>>>> 
>>>> These data show that the side workload has a negative impact on the
>>>> main workload's IPC. And cpu.headroom could help reduce this impact.
>>>> 
>>>> Therefore, while optimizations in the wakeup path should help the
>>>> latency; cpu.headroom would add _significant_ benefit on top of that.
>>> 
>>> It seems normal that side workload has a negative impact on IPC
>>> because of resources sharing but your previous results showed a 42%
>>> regression of latency with sched_idle which is can't be only linked to
>>> resources access contention
>> 
>> Agreed. I think both scheduling latency and resource contention 
>> contribute noticeable latency overhead to the main workload. The 
>> scheduler optimization by Viresh would help reduce the scheduling
>> latency, but it won't help the resource contention. Hopefully, with 
>> optimizations in the scheduler, we can meet the latency target with 
>> smaller cpu.headroom. However, I don't think scheduler optimizations 
>> will eliminate the need of cpu.headroom, as the resource contention
>> always exists, and the impact could be significant. 
>> 
>> Do you have further concerns with this patchset?
>> 
>> Thanks,
>> Song 
> 
> Here are some more results with both Viresh's patch and the cpu.headroom
> set. In these tests, the side job runs with SCHED_IDLE, so we get benefit
> of Viresh's patch. 
> 
> We collected another metric here, average "cpu time" used by the requests.
> We also presented "wall time" and "wall - cpu" time. "wall time" is the 
> same as "latency" in previous results. Basically, "wall time" includes cpu 
> time, scheduling latency, and time spent waiting for data (from data base, 
> memcache, etc.). We don't have good data that separates scheduling latency
> and time spent waiting for data, so we present "wall - cpu" time, which is 
> the sum of the two. Time spent waiting for data should not change in these 
> tests, so changes in "wall - cpu" mostly comes from scheduling latency. 
> All the latency numbers are normalized based on the "wall time" of the 
> first row. 
> 
> side job | cpu.headroom |  cpu-idle | wall time | cpu time | wall - cpu
> ------------------------------------------------------------------------
> none    |     n/a      |    42.4%  |   1.00    |   0.31   |   0.69
> ffmpeg   |       0      |    10.8%  |   1.17    |   0.38   |   0.79
> ffmpeg   |     25%      |    22.8%  |   1.08    |   0.35   |   0.73
> 
> From these results, we can see that Viresh's patch reduces the latency
> overhead of the side job, from 42% (in previous results) to 17%. And 
> a 25% cpu.headroom further reduces the latency overhead to 8%. 
> cpu.headroom reduces time spent in "cpu time" and "wall - cpu" time, 
> which means cpu.headroom yields better IPC and lower scheduling latency.
> 
> I think these data demonstrate that 
> 
>  1. Viresh's work is helpful in reducing scheduling latency introduced 
>     by SCHED_IDLE side jobs. 
>  2. cpu.headroom work provides mechanism to further reduce scheduling
>     latency on top of Viresh's work. 
> 
> Therefore, the combination of the two work would give us mechanisms to 
> control the latency overhead of side workloads. 
> 
> @Vincent, do these data and analysis make sense from your point of view?

Do you have further questions/concerns with this set? 

As the data shown, scheduling latency is not the only resource of high
latency here. In fact, with hyper threading and other shared system 
resources (cache, memory, etc.), side workload would always negatively
impact the latency of the main workload. It is impossible to eliminate
these impacts with scheduler optimizations. On the other hand, 
cpu.headroom provides mechanism to limit such impact. 

Optimization and protection are two sides of the problem. While we 
spend a lot of time optimizing the workload (so Viresh's work is really
interesting for us), cpu.headroom works on the protection side. There 
are multiple reasons behind the high latencies. cpu.headroom provides 
universal protection against all these. 

With the protection of cpu.headroom, we can actually do optimizations 
more efficiently, as we can safely start with a high headroom, and 
then try to lower it.

Please let me know your thoughts on this. 

Thanks,
Song




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-05-14 20:58                     ` Song Liu
@ 2019-05-15 10:18                       ` Vincent Guittot
  2019-05-15 15:42                         ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Vincent Guittot @ 2019-05-15 10:18 UTC (permalink / raw)
  To: Song Liu
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team, viresh kumar

Hi Song,

On Tue, 14 May 2019 at 22:58, Song Liu <songliubraving@fb.com> wrote:
>
> Hi Vincent,
>

[snip]

> >
> > Here are some more results with both Viresh's patch and the cpu.headroom
> > set. In these tests, the side job runs with SCHED_IDLE, so we get benefit
> > of Viresh's patch.
> >
> > We collected another metric here, average "cpu time" used by the requests.
> > We also presented "wall time" and "wall - cpu" time. "wall time" is the
> > same as "latency" in previous results. Basically, "wall time" includes cpu
> > time, scheduling latency, and time spent waiting for data (from data base,
> > memcache, etc.). We don't have good data that separates scheduling latency
> > and time spent waiting for data, so we present "wall - cpu" time, which is
> > the sum of the two. Time spent waiting for data should not change in these
> > tests, so changes in "wall - cpu" mostly comes from scheduling latency.
> > All the latency numbers are normalized based on the "wall time" of the
> > first row.
> >
> > side job | cpu.headroom |  cpu-idle | wall time | cpu time | wall - cpu
> > ------------------------------------------------------------------------
> > none    |     n/a      |    42.4%  |   1.00    |   0.31   |   0.69
> > ffmpeg   |       0      |    10.8%  |   1.17    |   0.38   |   0.79
> > ffmpeg   |     25%      |    22.8%  |   1.08    |   0.35   |   0.73
> >
> > From these results, we can see that Viresh's patch reduces the latency
> > overhead of the side job, from 42% (in previous results) to 17%. And
> > a 25% cpu.headroom further reduces the latency overhead to 8%.
> > cpu.headroom reduces time spent in "cpu time" and "wall - cpu" time,
> > which means cpu.headroom yields better IPC and lower scheduling latency.
> >
> > I think these data demonstrate that
> >
> >  1. Viresh's work is helpful in reducing scheduling latency introduced
> >     by SCHED_IDLE side jobs.
> >  2. cpu.headroom work provides mechanism to further reduce scheduling
> >     latency on top of Viresh's work.
> >
> > Therefore, the combination of the two work would give us mechanisms to
> > control the latency overhead of side workloads.
> >
> > @Vincent, do these data and analysis make sense from your point of view?
>
> Do you have further questions/concerns with this set?

Viresh's patchset takes into account CPU running only sched_idle task
only for the fast wakeup path. But nothing special is (yet) done for
the slow path or during idle load balance.
The histogram that you provided for "Fallback to sched-idle CPU for
better performance", shows that even if we have significantly reduced
the long wakeup latency, there are still some wakeup latency evenly
distributed in the range [16us-2msec]. Such values are most probably
because of sched_other task doesn't always preempt sched_idle task and
sometime waits for the next tick. This means that there are still
margin for improving the results with sched_idle without adding a new
knob.
The headroom knob forces cpus to be idle from time to time and the
scheduler fallbacks to the normal scheduling policy that tries to fill
idle CPU in this case. I'm still not convinced that most of the
increase of the latency increase is linked to contention when
accessing shared resources.

>
> As the data shown, scheduling latency is not the only resource of high
> latency here. In fact, with hyper threading and other shared system
> resources (cache, memory, etc.), side workload would always negatively
> impact the latency of the main workload. It is impossible to eliminate
> these impacts with scheduler optimizations. On the other hand,
> cpu.headroom provides mechanism to limit such impact.
>
> Optimization and protection are two sides of the problem. While we
> spend a lot of time optimizing the workload (so Viresh's work is really
> interesting for us), cpu.headroom works on the protection side. There
> are multiple reasons behind the high latencies. cpu.headroom provides
> universal protection against all these.
>
> With the protection of cpu.headroom, we can actually do optimizations
> more efficiently, as we can safely start with a high headroom, and
> then try to lower it.
>
> Please let me know your thoughts on this.
>
> Thanks,
> Song
>
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-05-15 10:18                       ` Vincent Guittot
@ 2019-05-15 15:42                         ` Song Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-05-15 15:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, linux-kernel, cgroups, mingo, peterz, tglx,
	Kernel Team, viresh kumar

Hi Vincent,

> On May 15, 2019, at 3:18 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> 
> Hi Song,
> 
> On Tue, 14 May 2019 at 22:58, Song Liu <songliubraving@fb.com> wrote:
>> 
>> Hi Vincent,
>> 
> 
> [snip]
> 
>>> 
>>> Here are some more results with both Viresh's patch and the cpu.headroom
>>> set. In these tests, the side job runs with SCHED_IDLE, so we get benefit
>>> of Viresh's patch.
>>> 
>>> We collected another metric here, average "cpu time" used by the requests.
>>> We also presented "wall time" and "wall - cpu" time. "wall time" is the
>>> same as "latency" in previous results. Basically, "wall time" includes cpu
>>> time, scheduling latency, and time spent waiting for data (from data base,
>>> memcache, etc.). We don't have good data that separates scheduling latency
>>> and time spent waiting for data, so we present "wall - cpu" time, which is
>>> the sum of the two. Time spent waiting for data should not change in these
>>> tests, so changes in "wall - cpu" mostly comes from scheduling latency.
>>> All the latency numbers are normalized based on the "wall time" of the
>>> first row.
>>> 
>>> side job | cpu.headroom |  cpu-idle | wall time | cpu time | wall - cpu
>>> ------------------------------------------------------------------------
>>> none    |     n/a      |    42.4%  |   1.00    |   0.31   |   0.69
>>> ffmpeg   |       0      |    10.8%  |   1.17    |   0.38   |   0.79
>>> ffmpeg   |     25%      |    22.8%  |   1.08    |   0.35   |   0.73
>>> 
>>> From these results, we can see that Viresh's patch reduces the latency
>>> overhead of the side job, from 42% (in previous results) to 17%. And
>>> a 25% cpu.headroom further reduces the latency overhead to 8%.
>>> cpu.headroom reduces time spent in "cpu time" and "wall - cpu" time,
>>> which means cpu.headroom yields better IPC and lower scheduling latency.
>>> 
>>> I think these data demonstrate that
>>> 
>>> 1. Viresh's work is helpful in reducing scheduling latency introduced
>>>    by SCHED_IDLE side jobs.
>>> 2. cpu.headroom work provides mechanism to further reduce scheduling
>>>    latency on top of Viresh's work.
>>> 
>>> Therefore, the combination of the two work would give us mechanisms to
>>> control the latency overhead of side workloads.
>>> 
>>> @Vincent, do these data and analysis make sense from your point of view?
>> 
>> Do you have further questions/concerns with this set?
> 
> Viresh's patchset takes into account CPU running only sched_idle task
> only for the fast wakeup path. But nothing special is (yet) done for
> the slow path or during idle load balance.
> The histogram that you provided for "Fallback to sched-idle CPU for
> better performance", shows that even if we have significantly reduced
> the long wakeup latency, there are still some wakeup latency evenly
> distributed in the range [16us-2msec]. Such values are most probably
> because of sched_other task doesn't always preempt sched_idle task and
> sometime waits for the next tick. This means that there are still
> margin for improving the results with sched_idle without adding a new
> knob.
> The headroom knob forces cpus to be idle from time to time and the
> scheduler fallbacks to the normal scheduling policy that tries to fill
> idle CPU in this case. I'm still not convinced that most of the
> increase of the latency increase is linked to contention when
> accessing shared resources.

I would like clarify that, we are not trying to convince that most of 
the increase of the latency is from resource contention. Actually, we 
also have data showing scheduling latency contributes more to the 
latency overhead:

side job | cpu.headroom |  cpu-idle | wall time | cpu time | wall - cpu
------------------------------------------------------------------------
 none    |     n/a      |    42.4%  |   1.00    |   0.31   |   0.69
ffmpeg   |       0      |    10.8%  |   1.17    |   0.38   |   0.79
ffmpeg   |     25%      |    22.8%  |   1.08    |   0.35   |   0.73

Compared against first row, second row shows 17% of latency overhead 
(wall time). Among this 17%, 7% is in the "cpu time" column, which 
is from resource contention (lower IPC). The other 10% (wall - cpu) is
mostly from scheduling latency. These experiments already have Viresh's
current patch. Scheduling latency contributes more in the overall 
latency w/o Viresh's patch. 

So we agree that, in this case, most of the increased latency comes 
from scheduling latency. 

However, we still think cpu.headroom would add value. The following 
table shows comparison of ideal cases, where we totally eliminated the 
overhead of scheduling latency. 

side job | cpu.headroom |  cpu-idle | wall time | cpu time | wall - cpu
------------------------------------------------------------------------
 none    |     n/a      |    42.4%  |   1.00    |   0.31   |   0.69
------------------------------------------------------------------------
        below are all from estimation, not from experiments
------------------------------------------------------------------------
ffmpeg   |       0      |     TBD   |   1.07    |   0.38   |   0.69
ffmpeg   |     25%      |     TBD   |   1.04    |   0.35   |   0.69

We can see, cpu.headroom helps control latency even with ideal scheduling. 
The saving here (from 7% overhead to 4%) is not as significant. But it is
still meaningful in some cases. 
 
More important, cpu.headroom gives us mechanism to control the latency
overhead. It is a _protection_ mechanism, not some optimization. It is 
somehow similar to current cpu.max knob, which limits the max cpu usage
of certain workload. cpu.headroom is more flexible than cpu.max, because
cpu.headroom could adjust the limits based on system load level dynamically. 

Does this explanation make sense?

Thanks,
Song


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-04-10 19:43   ` Song Liu
  2019-04-17 12:56     ` Vincent Guittot
@ 2019-05-21 13:47     ` Michal Koutný
  2019-05-21 16:27       ` Song Liu
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Koutný @ 2019-05-21 13:47 UTC (permalink / raw)
  To: Song Liu
  Cc: Morten Rasmussen, Kernel Team, peterz, vincent.guittot, tglx,
	mingo, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 776 bytes --]

Hello Song.

On Wed, Apr 10, 2019 at 07:43:35PM +0000, Song Liu <songliubraving@fb.com> wrote:
> The load level above is measured as requests-per-second. 
> 
> When there is no side workload, the system has about 45% busy CPU with 
> load level of 1.0; and about 75% busy CPU at load level of 1.5. 
> 
> The saturation starts before the system hitting 100% utilization. This is
> true for many different resources: ALUs in SMT systems, cache lines, 
> memory bandwidths, etc. 
I have read through the thread continuation and it appears to me there
is some misunderstanding on the latency metric (scheduler latency <=
your latency <= request wall time?).

Could you please describe how is the latency that you report defined and
measured?

Thanks,
Michal


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-05-21 13:47     ` Michal Koutný
@ 2019-05-21 16:27       ` Song Liu
  2019-06-26  8:26         ` Michal Koutný
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-05-21 16:27 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Morten Rasmussen, Kernel Team, peterz, vincent.guittot, tglx,
	mingo, cgroups, linux-kernel

Hi Michal,

> On May 21, 2019, at 6:47 AM, Michal Koutný <mkoutny@suse.com> wrote:
> 
> Hello Song.
> 
> On Wed, Apr 10, 2019 at 07:43:35PM +0000, Song Liu <songliubraving@fb.com> wrote:
>> The load level above is measured as requests-per-second. 
>> 
>> When there is no side workload, the system has about 45% busy CPU with 
>> load level of 1.0; and about 75% busy CPU at load level of 1.5. 
>> 
>> The saturation starts before the system hitting 100% utilization. This is
>> true for many different resources: ALUs in SMT systems, cache lines, 
>> memory bandwidths, etc. 
> I have read through the thread continuation and it appears to me there
> is some misunderstanding on the latency metric (scheduler latency <=
> your latency <= request wall time?).

We define "your latency" == "request wall time" > "scheduler latency". 

The application is a web server. It works as:

    for a few iterations:
        (a) request data from cache/db
        (b) wait for data
        (c) data is ready, wait to be scheduled
        (d) render web page based on data

We need to do a few iterations because we need some data to decide what
other data to fetch. 

The overall latency (or wall latency) contains: 

   (1) cpu time, which is (a) and (d) in the loop above;
   (2) time waiting for data, which is (b);
   (3) schedule latency, time between data is ready and the thread wakes
       up, which is (c);

In our experiment, we can measure (1) and "(1)+(2)+(3)". For data in the
following table. "cpu time" is (1), "wall time" is (1)+(2)+(3), and 
"wall - cpu" is "(2)+(3)". We assume (2) doesn't change, so changes in 
"wall - cpu" is mostly due to changes in (3). 

side job | cpu.headroom |  cpu-idle | wall time | cpu time | wall - cpu
------------------------------------------------------------------------
 none    |     n/a      |    42.4%  |   1.00    |   0.31   |   0.69
ffmpeg   |       0      |    10.8%  |   1.17    |   0.38   |   0.79
ffmpeg   |     25%      |    22.8%  |   1.08    |   0.35   |   0.73


Does this make sense?

Thanks,
Song


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-05-21 16:27       ` Song Liu
@ 2019-06-26  8:26         ` Michal Koutný
  2019-06-26 15:56           ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Koutný @ 2019-06-26  8:26 UTC (permalink / raw)
  To: Song Liu
  Cc: Morten Rasmussen, Kernel Team, peterz, vincent.guittot, tglx,
	mingo, cgroups, linux-kernel

Hello Song and I apology for late reply.

I understand the motivation for the headroom attribute is to achieve
side load throttling before the CPU is fully saturated since your
measurements show that something else gets saturated earlier than CPU
and causes grow of the observed latency.

The second aspect of the headroom knob, i.e. dynamic partitioning of the
CPU resource is IMO something which we already have thanks to
cpu.weight.

As you wrote, plain cpu.weight of workloads didn't work for you, so I
think it'd be worth figuring out what is the resource whose saturation
affects the overall observed latency and see if a protection/weights on
that resource can be set (or implemented).

On Tue, May 21, 2019 at 04:27:02PM +0000, Song Liu <songliubraving@fb.com> wrote:
> The overall latency (or wall latency) contains: 
> 
>    (1) cpu time, which is (a) and (d) in the loop above;
How do you measure this CPU time? Does it include time spent in the
kernel? (Or can there be anything else unaccounted for in the following
calculations?)

>    (2) time waiting for data, which is (b);
Is your assumption of this being constant supported by the measurements?

The last note is regarding semantics of the headroom knob, I'm not sure
it fits well into the weight^allocation^limit^protection model. It seems
to me that it's crafted to satisfy the division to one main workload and
side workload, however, the concept doesn't generalize well to arbitrary
number of siblings (e.g. two cgroups with same headroom, third with
less, who is winning?).

HTH,
Michal

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
  2019-06-26  8:26         ` Michal Koutný
@ 2019-06-26 15:56           ` Song Liu
  0 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-06-26 15:56 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Morten Rasmussen, Kernel Team, peterz, vincent.guittot, tglx,
	mingo, cgroups, linux-kernel

Hi Michal,

> On Jun 26, 2019, at 1:26 AM, Michal Koutný <mkoutny@suse.com> wrote:
> 
> Hello Song and I apology for late reply.
> 
> I understand the motivation for the headroom attribute is to achieve
> side load throttling before the CPU is fully saturated since your
> measurements show that something else gets saturated earlier than CPU
> and causes grow of the observed latency.
> 
> The second aspect of the headroom knob, i.e. dynamic partitioning of the
> CPU resource is IMO something which we already have thanks to
> cpu.weight.

I think the cpu.headroom knob is the dynamic version of the cpu.max knob. 
It serves different role as cpu.weight. 

cpu.weight is like: when both tasks can run, which one gets more cycles. 
cpu.headroom is like: even there is idle cpu cycle, the side workload 
should not use it all. 

> 
> As you wrote, plain cpu.weight of workloads didn't work for you, so I
> think it'd be worth figuring out what is the resource whose saturation
> affects the overall observed latency and see if a protection/weights on
> that resource can be set (or implemented).

Our goal here is not to solve a particular case. Instead, we would like 
a universal solution for different combination of main workload and side
workload. cpu.headroom makes it easy to adjust throttling based no the 
requirement of the main workload. 

Also, there are resources that could only be protected by intentionally
leave some idle cycles. For example, SMT siblings share ALUs, sometimes 
we have to throttle one SMT sibling to make the other sibling run faster. 

> 
> On Tue, May 21, 2019 at 04:27:02PM +0000, Song Liu <songliubraving@fb.com> wrote:
>> The overall latency (or wall latency) contains: 
>> 
>>   (1) cpu time, which is (a) and (d) in the loop above;
> How do you measure this CPU time? Does it include time spent in the
> kernel? (Or can there be anything else unaccounted for in the following
> calculations?)

We measures how much time a thread is running. It includes kernel time. 
I think we didn't measure times spent on processing IRQs, but that is 
small compared with overall latency. 

> 
>>   (2) time waiting for data, which is (b);
> Is your assumption of this being constant supported by the measurements?

We don't measure that specifically. The data is fetched over the network
from other servers. The latency to fetch data is not constant, but the 
average of thousands of requests should be the same for different cases. 

> 
> The last note is regarding semantics of the headroom knob, I'm not sure
> it fits well into the weight^allocation^limit^protection model. It seems
> to me that it's crafted to satisfy the division to one main workload and
> side workload, however, the concept doesn't generalize well to arbitrary
> number of siblings (e.g. two cgroups with same headroom, third with
> less, who is winning?).

The semantics is not very straightforward. We discussed about it for a 
long time. And it is really crafted to protection model. 

In your example, say both A and B have 30% headroom, and C has 20%. A and
B are "winning", as they will not be throttled. C will be throttled when
the global idleness is lower than 10% (30% - 20%). 

Note that, this is not a typical use case for cpu.headroom. If multiple 
latency sensitive applications are sharing the same server, they would 
need some partition scheme. 

Thanks,
Song



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-06-26 15:57 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-08 21:45 [PATCH 0/7] introduce cpu.headroom knob to cpu controller Song Liu
2019-04-08 21:45 ` [PATCH 1/7] sched: refactor tg_set_cfs_bandwidth() Song Liu
2019-04-08 21:45 ` [PATCH 2/7] cgroup: introduce hook css_has_tasks_changed Song Liu
2019-04-08 21:45 ` [PATCH 3/7] cgroup: introduce cgroup_parse_percentage Song Liu
2019-04-08 21:45 ` [PATCH 4/7] sched, cgroup: add entry cpu.headroom Song Liu
2019-04-08 21:45 ` [PATCH 5/7] sched/fair: global idleness counter for cpu.headroom Song Liu
2019-04-08 21:45 ` [PATCH 6/7] sched/fair: throttle task runtime based on cpu.headroom Song Liu
2019-04-08 21:45 ` [PATCH 7/7] Documentation: cgroup-v2: add information for cpu.headroom Song Liu
2019-04-10 11:59 ` [PATCH 0/7] introduce cpu.headroom knob to cpu controller Morten Rasmussen
2019-04-10 19:43   ` Song Liu
2019-04-17 12:56     ` Vincent Guittot
2019-04-22 23:22       ` Song Liu
2019-04-28 19:47         ` Song Liu
2019-04-29 12:24           ` Vincent Guittot
2019-04-30  6:10             ` Song Liu
2019-04-30 16:20               ` Vincent Guittot
2019-04-30 16:54                 ` Song Liu
2019-05-10 18:22                   ` Song Liu
2019-05-14 20:58                     ` Song Liu
2019-05-15 10:18                       ` Vincent Guittot
2019-05-15 15:42                         ` Song Liu
2019-05-21 13:47     ` Michal Koutný
2019-05-21 16:27       ` Song Liu
2019-06-26  8:26         ` Michal Koutný
2019-06-26 15:56           ` Song Liu
2019-04-15 16:48 ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).