linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] sched/fair: Burstable CFS bandwidth controller
@ 2021-03-16  4:49 Huaixin Chang
  2021-03-16  4:49 ` [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Huaixin Chang @ 2021-03-16  4:49 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, peterz, pjt, rostedt,
	shanpeic, tj, vincent.guittot, xiyou.wangcong

Changelog:

v4:
- Adjust assignments in tg_set_cfs_bandwidth(), saving unnecessary
  assignemnts when quota == RUNTIME_INF.
- Getting rid of sysctl_sched_cfs_bw_burst_onset_percent, as there seems
  no justification for both controlling start bandwidth and a percent
  way.
- Comment improvement in sched_cfs_period_timer() shifts on explaining
  why max_overrun shifting to 0 is a problem.
- Rename previous_runtime to runtime_at_period_start.
- Add cgroup2 interface and documentation.
- Getting rid of exposing current_bw as there are not enough
  justification and the updating problem.
- Add justification on cpu.stat change in the changelog.
- Rebase upon v5.12-rc3.
- Correct SoB chain.
- Several indentation fixes.
- Adjust quota in schbench test from 700000 to 600000.

v3:
- Fix another issue reported by test robot.
- Update docs as Randy Dunlap suggested.
Link:
https://lore.kernel.org/lkml/20210120122715.29493-1-changhuaixin@linux.alibaba.com/

v2:
- Fix an issue reported by test robot.
- Rewriting docs. Appreciate any further suggestions or help.
Link:
https://lore.kernel.org/lkml/20210121110453.18899-1-changhuaixin@linux.alibaba.com/

v1 Link:
https://lore.kernel.org/lkml/20201217074620.58338-1-changhuaixin@linux.alibaba.com/

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled. And they are latency sensitive at the same
time so that throttling them is undesired.

Scaling up period and quota allows greater burst capacity. But it might
cause longer stuck till next refill. We introduce "burst" to allow
accumulating unused quota from previous periods, and to be assigned when
a task group requests more CPU than quota during a specific period. Thus
allowing CPU time requests as long as the average requested CPU time is
below quota on the long run. The maximum accumulation is capped by burst
and is set 0 by default, thus the traditional behaviour remains.

A huge drop of 99th tail latency from more than 500ms to 27ms is seen for
real java workloads when using burst. Similar drops are seen when
testing with schbench too:

	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
	echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	# The average CPU usage is around 500%, which is 200ms CPU time
	# every 40ms.
	./schbench -m 1 -t 30 -r 10 -c 10000 -R 500

	Without burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 10
	*99.0000th: 933
	99.5000th: 981
	99.9000th: 3068
	min=0, max=20054
	rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

	With burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 9
	*99.0000th: 12
	99.5000th: 13
	99.9000th: 19
	min=0, max=406
	rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 0.12%

How much workloads with benefit from burstable CFS bandwidth control
depends on how bursty and how latency sensitive they are.

Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/

This time we present more latency statistics and handle overflow while
accumulating.

Huaixin Chang (4):
  sched/fair: Introduce primitives for CFS bandwidth burst
  sched/fair: Make CFS bandwidth controller burstable
  sched/fair: Add cfs bandwidth burst statistics
  sched/fair: Add document for burstable CFS bandwidth control

 Documentation/scheduler/sched-bwc.rst |  49 +++++++++++--
 include/linux/sched/sysctl.h          |   2 +
 kernel/sched/core.c                   | 126 +++++++++++++++++++++++++++++-----
 kernel/sched/fair.c                   |  58 +++++++++++++---
 kernel/sched/sched.h                  |   9 ++-
 kernel/sysctl.c                       |  18 +++++
 6 files changed, 232 insertions(+), 30 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-16  4:49 [PATCH v4 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
@ 2021-03-16  4:49 ` Huaixin Chang
  2021-03-16  9:27   ` Peter Zijlstra
                     ` (3 more replies)
  2021-03-16  4:49 ` [PATCH v4 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 21+ messages in thread
From: Huaixin Chang @ 2021-03-16  4:49 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, peterz, pjt, rostedt,
	shanpeic, tj, vincent.guittot, xiyou.wangcong

In this patch, we introduce the notion of CFS bandwidth burst. Unused
"quota" from pervious "periods" might be accumulated and used in the
following "periods". The maximum amount of accumulated bandwidth is
bounded by "burst". And the maximun amount of CPU a group can consume in
a given period is "buffer" which is equivalent to "quota" + "burst in
case that this group has done enough accumulation.

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 kernel/sched/core.c  | 97 +++++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/fair.c  |  2 ++
 kernel/sched/sched.h |  2 ++
 3 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98191218d891..708c31e6ce1f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8952,7 +8952,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+				u64 burst)
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -8982,6 +8983,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
 		return -EINVAL;
 
+	/*
+	 * Bound burst to defend burst against overflow during bandwidth shift.
+	 */
+	if (burst > max_cfs_runtime)
+		return -EINVAL;
+
 	/*
 	 * Prevent race between setting of cfs_rq->runtime_enabled and
 	 * unthrottle_offline_cfs_rqs().
@@ -9003,12 +9010,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->burst = burst;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
-	/* Restart the period timer (if active) to handle new period expiry: */
-	if (runtime_enabled)
+	if (runtime_enabled) {
+		cfs_b->buffer = min(max_cfs_runtime, quota + burst);
+
+		/* Restart the period timer (if active) to handle new period expiry: */
 		start_cfs_bandwidth(cfs_b);
+	}
 
 	raw_spin_unlock_irq(&cfs_b->lock);
 
@@ -9036,9 +9047,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	burst = tg->cfs_bandwidth.burst;
 	if (cfs_quota_us < 0)
 		quota = RUNTIME_INF;
 	else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -9046,7 +9058,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 	else
 		return -EINVAL;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -9064,15 +9076,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
 		return -EINVAL;
 
 	period = (u64)cfs_period_us * NSEC_PER_USEC;
 	quota = tg->cfs_bandwidth.quota;
+	burst = tg->cfs_bandwidth.burst;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -9085,6 +9098,35 @@ static long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+	u64 quota, period, burst;
+
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	quota = tg->cfs_bandwidth.quota;
+	if (cfs_burst_us < 0)
+		burst = RUNTIME_INF;
+	else if ((u64)cfs_burst_us <= U64_MAX / NSEC_PER_USEC)
+		burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+	else
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+	u64 burst_us;
+
+	if (tg->cfs_bandwidth.burst == RUNTIME_INF)
+		return -1;
+
+	burst_us = tg->cfs_bandwidth.burst;
+	do_div(burst_us, NSEC_PER_USEC);
+
+	return burst_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -9109,6 +9151,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static s64 cpu_cfs_burst_read_s64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_s64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, s64 cfs_burst_us)
+{
+	return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
+}
+
 struct cfs_schedulable_data {
 	struct task_group *tg;
 	u64 period, quota;
@@ -9261,6 +9315,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "cfs_burst_us",
+		.read_s64 = cpu_cfs_burst_read_s64,
+		.write_s64 = cpu_cfs_burst_write_s64,
+	},
 	{
 		.name = "stat",
 		.seq_show = cpu_cfs_stat_show,
@@ -9380,27 +9439,29 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
 }
 #endif
 
-static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
-						  long period, long quota)
+static void __maybe_unused
+cpu_period_quota_print(struct seq_file *sf, long period, long quota,
+		       long burst)
 {
 	if (quota < 0)
 		seq_puts(sf, "max");
 	else
 		seq_printf(sf, "%ld", quota);
 
-	seq_printf(sf, " %ld\n", period);
+	seq_printf(sf, " %ld %ld\n", period, burst);
 }
 
-/* caller should put the current value in *@periodp before calling */
-static int __maybe_unused cpu_period_quota_parse(char *buf,
-						 u64 *periodp, u64 *quotap)
+/* caller should put the current value in *@periodp and *@burstp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf, u64 *periodp,
+						 u64 *quotap, u64 *burstp)
 {
 	char tok[21];	/* U64_MAX */
 
-	if (sscanf(buf, "%20s %llu", tok, periodp) < 1)
+	if (sscanf(buf, "%20s %llu %llu", tok, periodp, burstp) < 1)
 		return -EINVAL;
 
 	*periodp *= NSEC_PER_USEC;
+	*burstp *= NSEC_PER_USEC;
 
 	if (sscanf(tok, "%llu", quotap))
 		*quotap *= NSEC_PER_USEC;
@@ -9417,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 
-	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
+			       tg_get_cfs_burst(tg));
 	return 0;
 }
 
@@ -9426,12 +9488,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 {
 	struct task_group *tg = css_tg(of_css(of));
 	u64 period = tg_get_cfs_period(tg);
+	u64 burst = tg_get_cfs_burst(tg);
 	u64 quota;
 	int ret;
 
-	ret = cpu_period_quota_parse(buf, &period, &quota);
+	ret = cpu_period_quota_parse(buf, &period, &quota, &burst);
 	if (!ret)
-		ret = tg_set_cfs_bandwidth(tg, period, quota);
+		ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
 	return ret ?: nbytes;
 }
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..59d816a365f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5255,6 +5255,8 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+	cfs_b->burst = 0;
+	cfs_b->buffer = RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10a1522b1e30..a4a1c0116d51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -357,6 +357,8 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			quota;
 	u64			runtime;
+	u64			burst;
+	u64			buffer;
 	s64			hierarchical_quota;
 
 	u8			idle;
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-03-16  4:49 [PATCH v4 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2021-03-16  4:49 ` [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
@ 2021-03-16  4:49 ` Huaixin Chang
  2021-03-16  9:52   ` Peter Zijlstra
  2021-03-16  4:49 ` [PATCH v4 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
  2021-03-16  4:49 ` [PATCH v4 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
  3 siblings, 1 reply; 21+ messages in thread
From: Huaixin Chang @ 2021-03-16  4:49 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, peterz, pjt, rostedt,
	shanpeic, tj, vincent.guittot, xiyou.wangcong

Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c          |  8 +++---
 kernel/sched/fair.c          | 58 ++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h         |  4 +--
 kernel/sysctl.c              |  9 +++++++
 5 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..3cce25485c69 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -72,6 +72,7 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 708c31e6ce1f..16e23a2499ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8948,7 +8948,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -9012,13 +9012,13 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 	cfs_b->quota = quota;
 	cfs_b->burst = burst;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
-
 	if (runtime_enabled) {
 		cfs_b->buffer = min(max_cfs_runtime, quota + burst);
+		cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
+		cfs_b->runtime = cfs_b->quota;
 
 		/* Restart the period timer (if active) to handle new period expiry: */
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 1);
 	}
 
 	raw_spin_unlock_irq(&cfs_b->lock);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59d816a365f3..c981d4845c96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -127,6 +127,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: 5 msec, units: microseconds)
  */
 unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
+
+/*
+ * A switch for cfs bandwidth burst.
+ *
+ * (default: 1, enabled)
+ */
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
 #endif
 
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
@@ -4602,10 +4609,23 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
+					   u64 overrun)
 {
-	if (cfs_b->quota != RUNTIME_INF)
-		cfs_b->runtime = cfs_b->quota;
+	u64 refill;
+
+	if (cfs_b->quota != RUNTIME_INF) {
+
+		if (!sysctl_sched_cfs_bw_burst_enabled) {
+			cfs_b->runtime = cfs_b->quota;
+			return;
+		}
+
+		overrun = min(overrun, cfs_b->max_overrun);
+		refill = cfs_b->quota * overrun;
+		cfs_b->runtime += refill;
+		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+	}
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4627,7 +4647,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 0);
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -4973,7 +4993,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	__refill_cfs_bandwidth_runtime(cfs_b, overrun);
 
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
@@ -5194,6 +5214,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 }
 
 extern const u64 max_cfs_quota_period;
+extern const u64 max_cfs_runtime;
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -5223,7 +5244,18 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 			new = old * 2;
 			if (new < max_cfs_quota_period) {
 				cfs_b->period = ns_to_ktime(new);
-				cfs_b->quota *= 2;
+				cfs_b->quota = min(cfs_b->quota * 2,
+						   max_cfs_runtime);
+
+				cfs_b->buffer = min(cfs_b->quota + cfs_b->burst,
+						    max_cfs_runtime);
+				/*
+				 * Add 1 in case max_overrun becomes 0.
+				 * 0 max_overrun will cause no runtime being
+				 * refilled in __refill_cfs_bandwidth_runtime().
+				 */
+				cfs_b->max_overrun >>= 1;
+				cfs_b->max_overrun++;
 
 				pr_warn_ratelimited(
 	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5272,16 +5304,26 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
 {
+	u64 overrun;
+
 	lockdep_assert_held(&cfs_b->lock);
 
 	if (cfs_b->period_active)
 		return;
 
 	cfs_b->period_active = 1;
-	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+
+	/*
+	 * When period timer stops, quota for the following period is not
+	 * refilled, however period timer is already forwarded. We should
+	 * accumulate quota once more than overrun here.
+	 */
+	if (!init)
+		__refill_cfs_bandwidth_runtime(cfs_b, overrun + 1);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4a1c0116d51..efcbbfc31619 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -359,6 +359,7 @@ struct cfs_bandwidth {
 	u64			runtime;
 	u64			burst;
 	u64			buffer;
+	u64			max_overrun;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -469,8 +470,7 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 
-extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62fbd09b5dc1..20d6a5ca9ef3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1842,6 +1842,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "sched_cfs_bw_burst_enabled",
+		.data		= &sysctl_sched_cfs_bw_burst_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 	{
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 3/4] sched/fair: Add cfs bandwidth burst statistics
  2021-03-16  4:49 [PATCH v4 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2021-03-16  4:49 ` [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
  2021-03-16  4:49 ` [PATCH v4 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-03-16  4:49 ` Huaixin Chang
  2021-03-16  4:49 ` [PATCH v4 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
  3 siblings, 0 replies; 21+ messages in thread
From: Huaixin Chang @ 2021-03-16  4:49 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, peterz, pjt, rostedt,
	shanpeic, tj, vincent.guittot, xiyou.wangcong

When using cfs_b and meeting with some throttled periods, users shall
use burst buffer to allow bursty workloads. Apart from configuring some
burst buffer and watch whether throttled periods disappears, some
statistics on burst buffer using are also helpful. Thus expose the
following statistics into cpu.stat file:

nr_burst:   number of periods bandwidth burst occurs
burst_time: cumulative wall-time that any cpus has
	    used above quota in respective periods

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 kernel/sched/core.c  | 14 +++++++++++---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  3 +++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 16e23a2499ef..f60232862300 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9016,6 +9016,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 		cfs_b->buffer = min(max_cfs_runtime, quota + burst);
 		cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
 		cfs_b->runtime = cfs_b->quota;
+		cfs_b->runtime_at_period_start = cfs_b->runtime;
 
 		/* Restart the period timer (if active) to handle new period expiry: */
 		start_cfs_bandwidth(cfs_b, 1);
@@ -9265,6 +9266,9 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 		seq_printf(sf, "wait_sum %llu\n", ws);
 	}
 
+	seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
+	seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
+
 	return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
@@ -9361,16 +9365,20 @@ static int cpu_extra_stat_show(struct seq_file *sf,
 	{
 		struct task_group *tg = css_tg(css);
 		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
-		u64 throttled_usec;
+		u64 throttled_usec, burst_usec;
 
 		throttled_usec = cfs_b->throttled_time;
 		do_div(throttled_usec, NSEC_PER_USEC);
+		burst_usec = cfs_b->burst_time;
+		do_div(burst_usec, NSEC_PER_USEC);
 
 		seq_printf(sf, "nr_periods %d\n"
 			   "nr_throttled %d\n"
-			   "throttled_usec %llu\n",
+			   "throttled_usec %llu\n"
+			   "nr_burst %d\n"
+			   "burst_usec %llu\n",
 			   cfs_b->nr_periods, cfs_b->nr_throttled,
-			   throttled_usec);
+			   throttled_usec, cfs_b->nr_burst, burst_usec);
 	}
 #endif
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c981d4845c96..e7574d8bc11a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4612,7 +4612,7 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
 					   u64 overrun)
 {
-	u64 refill;
+	u64 refill, runtime;
 
 	if (cfs_b->quota != RUNTIME_INF) {
 
@@ -4621,10 +4621,21 @@ static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
 			return;
 		}
 
+		if (cfs_b->runtime_at_period_start > cfs_b->runtime) {
+			runtime = cfs_b->runtime_at_period_start
+				- cfs_b->runtime;
+			if (runtime > cfs_b->quota) {
+				cfs_b->burst_time += runtime - cfs_b->quota;
+				cfs_b->nr_burst++;
+			}
+		}
+
 		overrun = min(overrun, cfs_b->max_overrun);
 		refill = cfs_b->quota * overrun;
 		cfs_b->runtime += refill;
 		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+
+		cfs_b->runtime_at_period_start = cfs_b->runtime;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efcbbfc31619..7ef8d4733791 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -360,6 +360,7 @@ struct cfs_bandwidth {
 	u64			burst;
 	u64			buffer;
 	u64			max_overrun;
+	u64			runtime_at_period_start;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -372,7 +373,9 @@ struct cfs_bandwidth {
 	/* Statistics: */
 	int			nr_periods;
 	int			nr_throttled;
+	int			nr_burst;
 	u64			throttled_time;
+	u64			burst_time;
 #endif
 };
 
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 4/4] sched/fair: Add document for burstable CFS bandwidth control
  2021-03-16  4:49 [PATCH v4 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                   ` (2 preceding siblings ...)
  2021-03-16  4:49 ` [PATCH v4 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2021-03-16  4:49 ` Huaixin Chang
  3 siblings, 0 replies; 21+ messages in thread
From: Huaixin Chang @ 2021-03-16  4:49 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, peterz, pjt, rostedt,
	shanpeic, tj, vincent.guittot, xiyou.wangcong

Basic description of usage and effect for CFS Bandwidth Control Burst.

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 16 +++++----
 Documentation/scheduler/sched-bwc.rst   | 64 ++++++++++++++++++++++++++++++---
 2 files changed, 69 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 64c62b979f2f..17ec571ab4a8 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -997,6 +997,8 @@ All time durations are in microseconds.
 	- nr_periods
 	- nr_throttled
 	- throttled_usec
+	- nr_burst
+	- burst_usec
 
   cpu.weight
 	A read-write single value file which exists on non-root
@@ -1017,16 +1019,18 @@ All time durations are in microseconds.
 	the closest approximation of the current weight.
 
   cpu.max
-	A read-write two value file which exists on non-root cgroups.
-	The default is "max 100000".
+	A read-write three value file which exists on non-root cgroups.
+	The default is "max 100000 0".
 
 	The maximum bandwidth limit.  It's in the following format::
 
-	  $MAX $PERIOD
+	  $MAX $PERIOD $BURST
 
-	which indicates that the group may consume upto $MAX in each
-	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
-	one number is written, $MAX is updated.
+	which indicates that the group may consume upto $MAX from this
+	period plus $BURST carried over from previous periods in each
+	$PERIOD duration.  "max" for $MAX indicates no limit. "0" for
+	$BURST indicates no bandwidth can be carried over. On partial
+	writing, values are updated accordingly.
 
   cpu.pressure
 	A read-write nested-keyed file.
diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 845eee659199..42e0773c0eed 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -22,24 +22,51 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
 is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
+By default, CPU bandwidth consumption is strictly limited to quota within each
+given period. For the sequence of CPU usage u_i served under CFS bandwidth
+control, if for any j <= k N(j,k) is the number of periods from u_j to u_k:
+
+        u_j+...+u_k <= quota * N(j,k)
+
+For a bursty sequence among which interval u_j...u_k are at the peak, CPU
+requests might have to wait for more periods to replenish enough quota.
+Otherwise, larger quota is required.
+
+With "burst" buffer, CPU requests might be served as long as:
+
+        u_j+...+u_k <= B_j + quota * N(j,k)
+
+if for any j <= k N(j,k) is the number of periods from u_j to u_k and B_j is
+the accumulated quota from previous periods in burst buffer serving u_j.
+Burst buffer helps in that serving whole bursty CPU requests without throttling
+them can be done with moderate quota setting and accumulated quota in burst
+buffer, if:
+
+        u_0+...+u_n <= B_0 + quota * N(0,n)
+
+where B_0 is the initial state of burst buffer. The maximum accumulated quota in
+the burst buffer is capped by burst. With proper burst setting, the available
+bandwidth is still determined by quota and period on the long run.
+
 Management
 ----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.
 
 .. note::
    The cgroupfs files described in this section are only applicable
    to cgroup v1. For cgroup v2, see
    :ref:`Documentation/admin-guide/cgroupv2.rst <cgroup-v2-cpu>`.
 
-- cpu.cfs_quota_us: the total available run-time within a period (in
-  microseconds)
+- cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
 - cpu.cfs_period_us: the length of a period (in microseconds)
 - cpu.stat: exports throttling statistics [explained further below]
+- cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
 
 The default values are::
 
 	cpu.cfs_period_us=100ms
-	cpu.cfs_quota=-1
+	cpu.cfs_quota_us=-1
+	cpu.cfs_burst_us=0
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
@@ -55,6 +82,11 @@ more detail below.
 Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 and return the group to an unconstrained state once more.
 
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS unchanged. Writing any (valid) positive value(s) into cpu.cfs_burst_us
+will enact the cap on unused bandwidth accumulation.
+
 Any updates to a group's bandwidth specification will result in it becoming
 unthrottled if it is in a constrained state.
 
@@ -72,9 +104,15 @@ This is tunable via procfs::
 Larger slice values will reduce transfer overheads, while smaller values allow
 for more fine-grained consumption.
 
+There is also a global switch to turn off burst for all groups::
+       /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
+
+By default it is enabled. Writing a 0 value means no accumulated CPU time can be
+used for any group, even if cpu.cfs_burst_us is configured.
+
 Statistics
 ----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+A group's bandwidth statistics are exported via 6 fields in cpu.stat.
 
 cpu.stat:
 
@@ -82,6 +120,10 @@ cpu.stat:
 - nr_throttled: Number of times the group has been throttled/limited.
 - throttled_time: The total time duration (in nanoseconds) for which entities
   of the group have been throttled.
+- nr_burst: Number of periods burst occurs.
+- burst_time: Cumulative wall-time that any CPUs has used above quota in
+  respective periods
+
 
 This interface is read-only.
 
@@ -179,3 +221,15 @@ Examples
 
    By using a small period here we are ensuring a consistent latency
    response at the expense of burst capacity.
+
+4. Limit a group to 20% of 1 CPU, and allow accumulate up to 60% of 1 CPU
+   additionally, in case accumulation has been done.
+
+   With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+   And 30ms burst will be equivalent to 60% of 1 CPU.
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+	# echo 30000 > cpu.cfs_burst_us /* burst = 30ms */
+
+   Larger buffer setting allows greater burst capacity.
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-16  4:49 ` [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
@ 2021-03-16  9:27   ` Peter Zijlstra
  2021-03-16  9:41   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-03-16  9:27 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, pjt, rostedt, shanpeic, tj,
	vincent.guittot, xiyou.wangcong

On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
> In this patch, we introduce the notion of CFS bandwidth burst. Unused

Documentation/process/submitting-patches.rst:instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-16  4:49 ` [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
  2021-03-16  9:27   ` Peter Zijlstra
@ 2021-03-16  9:41   ` Peter Zijlstra
  2021-03-16  9:54   ` Peter Zijlstra
  2021-03-16 10:40   ` Peter Zijlstra
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-03-16  9:41 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, pjt, rostedt, shanpeic, tj,
	vincent.guittot, xiyou.wangcong

On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
> In this patch, we introduce the notion of CFS bandwidth burst. Unused
> "quota" from pervious "periods" might be accumulated and used in the
> following "periods". The maximum amount of accumulated bandwidth is
> bounded by "burst". And the maximun amount of CPU a group can consume in
> a given period is "buffer" which is equivalent to "quota" + "burst in
> case that this group has done enough accumulation.

Complete lack of why though. Why am I going to spend time looking at
this?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-03-16  4:49 ` [PATCH v4 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-03-16  9:52   ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-03-16  9:52 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, pjt, rostedt, shanpeic, tj,
	vincent.guittot, xiyou.wangcong



I can't make sense of patch 1 and 2 independent of one another. Why the
split?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-16  4:49 ` [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
  2021-03-16  9:27   ` Peter Zijlstra
  2021-03-16  9:41   ` Peter Zijlstra
@ 2021-03-16  9:54   ` Peter Zijlstra
  2021-03-17  7:16     ` changhuaixin
  2021-03-16 10:40   ` Peter Zijlstra
  3 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2021-03-16  9:54 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, pjt, rostedt, shanpeic, tj,
	vincent.guittot, xiyou.wangcong

On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
> @@ -8982,6 +8983,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>  	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>  		return -EINVAL;
>  
> +	/*
> +	 * Bound burst to defend burst against overflow during bandwidth shift.
> +	 */
> +	if (burst > max_cfs_runtime)
> +		return -EINVAL;

Why do you allow such a large burst? I would expect something like:

	if (burst > quote)
		return -EINVAL;

That limits the variance in the system. Allowing super long bursts seems
to defeat the entire purpose of bandwidth control.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-16  4:49 ` [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
                     ` (2 preceding siblings ...)
  2021-03-16  9:54   ` Peter Zijlstra
@ 2021-03-16 10:40   ` Peter Zijlstra
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-03-16 10:40 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, odin, odin, pauld, pjt, rostedt, shanpeic, tj,
	vincent.guittot, xiyou.wangcong

On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
> And the maximun amount of CPU a group can consume in
> a given period is "buffer" which is equivalent to "quota" + "burst in
> case that this group has done enough accumulation.

I'm confused as heck about cfs_b->buffer. Why do you need that? What's
wrong with cfs_b->runtime ?

Currently, by being strict, we ignore any remaining runtime and the
period timer resets runtime to quota and life goes on. What you want to
do is instead of resetting runtime, add quota and limit the total.

That is, currently it does:

	runtime = min(runtime + quota, quota);

which by virtue of runtime not being allowed negative, it the exact same
as:

	runtime = quota;

Which is what we have in refill.

Fix that to be something like:

	runtime = min(runtime + quota, quota + burst)

and you're basically done. And that seems *much* simpler.

What am I missing, why have you made it so complicated?


/me looks again..

Oooh, I think I see, all this is because you don't do your constraints
right. Removing static from max_cfs_runtime is a big clue you did that
wrong.

Something like this on top of the first two. Much simpler!

Now I just need to figure out wth you mucked about with where we call
refill.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8954,7 +8954,7 @@ static DEFINE_MUTEX(cfs_constraints_mute
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -8989,10 +8989,10 @@ static int tg_set_cfs_bandwidth(struct t
 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
 		return -EINVAL;
 
-	/*
-	 * Bound burst to defend burst against overflow during bandwidth shift.
-	 */
-	if (burst > max_cfs_runtime)
+	if (burst > quota)
+		return -EINVAL;
+
+	if (quota + burst > max_cfs_runtime)
 		return -EINVAL;
 
 	/*
@@ -9019,8 +9019,6 @@ static int tg_set_cfs_bandwidth(struct t
 	cfs_b->burst = burst;
 
 	if (runtime_enabled) {
-		cfs_b->buffer = min(max_cfs_runtime, quota + burst);
-		cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
 		cfs_b->runtime = cfs_b->quota;
 
 		/* Restart the period timer (if active) to handle new period expiry: */
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4621,23 +4621,22 @@ static inline u64 sched_cfs_bandwidth_sl
  *
  * requires cfs_b->lock
  */
-static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
-					   u64 overrun)
+static void
+__refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 {
-	u64 refill;
-
-	if (cfs_b->quota != RUNTIME_INF) {
-
-		if (!sysctl_sched_cfs_bw_burst_enabled) {
-			cfs_b->runtime = cfs_b->quota;
-			return;
-		}
+	if (unlikely(cfs_b->quota == RUNTIME_INF))
+		return;
 
-		overrun = min(overrun, cfs_b->max_overrun);
-		refill = cfs_b->quota * overrun;
-		cfs_b->runtime += refill;
-		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+	if (!sysctl_sched_cfs_bw_burst_enabled) {
+		cfs_b->runtime = cfs_b->quota;
+		return;
 	}
+
+	/*
+	 * Ignore @overrun since burst <= quota.
+	 */
+	cfs_b->runtime += cfs_b->quota;
+	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -5226,7 +5225,6 @@ static enum hrtimer_restart sched_cfs_sl
 }
 
 extern const u64 max_cfs_quota_period;
-extern const u64 max_cfs_runtime;
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -5256,18 +5254,7 @@ static enum hrtimer_restart sched_cfs_pe
 			new = old * 2;
 			if (new < max_cfs_quota_period) {
 				cfs_b->period = ns_to_ktime(new);
-				cfs_b->quota = min(cfs_b->quota * 2,
-						   max_cfs_runtime);
-
-				cfs_b->buffer = min(cfs_b->quota + cfs_b->burst,
-						    max_cfs_runtime);
-				/*
-				 * Add 1 in case max_overrun becomes 0.
-				 * 0 max_overrun will cause no runtime being
-				 * refilled in __refill_cfs_bandwidth_runtime().
-				 */
-				cfs_b->max_overrun >>= 1;
-				cfs_b->max_overrun++;
+				cfs_b->quota *= 2;
 
 				pr_warn_ratelimited(
 	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5300,7 +5287,6 @@ void init_cfs_bandwidth(struct cfs_bandw
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 	cfs_b->burst = 0;
-	cfs_b->buffer = RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,8 +366,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	u64			burst;
-	u64			buffer;
-	u64			max_overrun;
 	s64			hierarchical_quota;
 
 	u8			idle;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-16  9:54   ` Peter Zijlstra
@ 2021-03-17  7:16     ` changhuaixin
  2021-03-17  8:06       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: changhuaixin @ 2021-03-17  7:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, Odin Ugedal, Odin Ugedal,
	pauld, Paul Turner, rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong



> On Mar 16, 2021, at 5:54 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Mar 16, 2021 at 12:49:28PM +0800, Huaixin Chang wrote:
>> @@ -8982,6 +8983,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>> 		return -EINVAL;
>> 
>> +	/*
>> +	 * Bound burst to defend burst against overflow during bandwidth shift.
>> +	 */
>> +	if (burst > max_cfs_runtime)
>> +		return -EINVAL;
> 
> Why do you allow such a large burst? I would expect something like:
> 
> 	if (burst > quote)
> 		return -EINVAL;
> 
> That limits the variance in the system. Allowing super long bursts seems
> to defeat the entire purpose of bandwidth control.

I understand your concern. Surely large burst value might allow super long bursts
thus preventing bandwidth control entirely for a long time.

However, I am afraid it is hard to decide what the maximum burst should be from
the bandwidth control mechanism itself. Allowing some burst to the maximum of
quota is helpful, but not enough. There are cases where workloads are bursty
that they need many times more than quota in a single period. In such cases, limiting
burst to the maximum of quota fails to meet the needs.

Thus, I wonder whether is it acceptable to leave the maximum burst to users. If the desired
behavior is to allow some burst, configure burst accordingly. If that is causing variance, use share
or other fairness mechanism. And if fairness mechanism still fails to coordinate, do not use
burst maybe.

In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is still needed maybe.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-17  7:16     ` changhuaixin
@ 2021-03-17  8:06       ` Peter Zijlstra
  2021-03-18  1:26         ` changhuaixin
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2021-03-17  8:06 UTC (permalink / raw)
  To: changhuaixin
  Cc: Benjamin Segall, dietmar.eggemann, juri.lelli, khlebnikov,
	open list, mgorman, mingo, Odin Ugedal, Odin Ugedal, pauld,
	Paul Turner, rostedt, Shanpei Chen, Tejun Heo, Vincent Guittot,
	xiyou.wangcong

On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:

> > Why do you allow such a large burst? I would expect something like:
> > 
> > 	if (burst > quote)
> > 		return -EINVAL;
> > 
> > That limits the variance in the system. Allowing super long bursts seems
> > to defeat the entire purpose of bandwidth control.
> 
> I understand your concern. Surely large burst value might allow super
> long bursts thus preventing bandwidth control entirely for a long
> time.
> 
> However, I am afraid it is hard to decide what the maximum burst
> should be from the bandwidth control mechanism itself. Allowing some
> burst to the maximum of quota is helpful, but not enough. There are
> cases where workloads are bursty that they need many times more than
> quota in a single period. In such cases, limiting burst to the maximum
> of quota fails to meet the needs.
> 
> Thus, I wonder whether is it acceptable to leave the maximum burst to
> users. If the desired behavior is to allow some burst, configure burst
> accordingly. If that is causing variance, use share or other fairness
> mechanism. And if fairness mechanism still fails to coordinate, do not
> use burst maybe.

It's not fairness, bandwidth control is about isolation, and burst
introduces interference.

> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
> still needed maybe.

So what is the typical avg,stdev,max and mode for the workloads where you find
you need this?

I would really like to put a limit on the burst. IMO a workload that has
a burst many times longer than the quota is plain broken.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-17  8:06       ` Peter Zijlstra
@ 2021-03-18  1:26         ` changhuaixin
  2021-03-18 12:59           ` Phil Auld
  2021-03-18 15:05           ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: changhuaixin @ 2021-03-18  1:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, Odin Ugedal, Odin Ugedal,
	pauld, Paul Turner, rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong



> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
> 
>>> Why do you allow such a large burst? I would expect something like:
>>> 
>>> 	if (burst > quote)
>>> 		return -EINVAL;
>>> 
>>> That limits the variance in the system. Allowing super long bursts seems
>>> to defeat the entire purpose of bandwidth control.
>> 
>> I understand your concern. Surely large burst value might allow super
>> long bursts thus preventing bandwidth control entirely for a long
>> time.
>> 
>> However, I am afraid it is hard to decide what the maximum burst
>> should be from the bandwidth control mechanism itself. Allowing some
>> burst to the maximum of quota is helpful, but not enough. There are
>> cases where workloads are bursty that they need many times more than
>> quota in a single period. In such cases, limiting burst to the maximum
>> of quota fails to meet the needs.
>> 
>> Thus, I wonder whether is it acceptable to leave the maximum burst to
>> users. If the desired behavior is to allow some burst, configure burst
>> accordingly. If that is causing variance, use share or other fairness
>> mechanism. And if fairness mechanism still fails to coordinate, do not
>> use burst maybe.
> 
> It's not fairness, bandwidth control is about isolation, and burst
> introduces interference.
> 
>> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
>> still needed maybe.
> 
> So what is the typical avg,stdev,max and mode for the workloads where you find
> you need this?
> 
> I would really like to put a limit on the burst. IMO a workload that has
> a burst many times longer than the quota is plain broken.

I see. Then the problem comes down to how large the limit on burst shall be.

I have sampled the CPU usage of a bursty container in 100ms periods. The statistics are:
average	: 42.2%
stddev	: 81.5%
max		: 844.5%
P95		: 183.3%
P99		: 437.0%

If quota is 100000ms, burst buffer needs to be 8 times more in order for this workload not to be throttled.
I can't say this is typical, but these workloads exist. On a machine running Kubernetes containers,
where there is often room for such burst and the interference is hard to notice, users would prefer
allowing such burst to being throttled occasionally.

In this sense, I suggest limit burst buffer to 16 times of quota or around. That should be enough for users to
improve tail latency caused by throttling. And users might choose a smaller one or even none, if the interference
is unacceptable. What do you think?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-18  1:26         ` changhuaixin
@ 2021-03-18 12:59           ` Phil Auld
  2021-03-18 15:10             ` Peter Zijlstra
  2021-03-19 12:51             ` changhuaixin
  2021-03-18 15:05           ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Phil Auld @ 2021-03-18 12:59 UTC (permalink / raw)
  To: changhuaixin
  Cc: Peter Zijlstra, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, Odin Ugedal, Odin Ugedal,
	pauld, Paul Turner, rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong

On Thu, Mar 18, 2021 at 09:26:58AM +0800 changhuaixin wrote:
> 
> 
> > On Mar 17, 2021, at 4:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
> > 
> >>> Why do you allow such a large burst? I would expect something like:
> >>> 
> >>> 	if (burst > quote)
> >>> 		return -EINVAL;
> >>> 
> >>> That limits the variance in the system. Allowing super long bursts seems
> >>> to defeat the entire purpose of bandwidth control.
> >> 
> >> I understand your concern. Surely large burst value might allow super
> >> long bursts thus preventing bandwidth control entirely for a long
> >> time.
> >> 
> >> However, I am afraid it is hard to decide what the maximum burst
> >> should be from the bandwidth control mechanism itself. Allowing some
> >> burst to the maximum of quota is helpful, but not enough. There are
> >> cases where workloads are bursty that they need many times more than
> >> quota in a single period. In such cases, limiting burst to the maximum
> >> of quota fails to meet the needs.
> >> 
> >> Thus, I wonder whether is it acceptable to leave the maximum burst to
> >> users. If the desired behavior is to allow some burst, configure burst
> >> accordingly. If that is causing variance, use share or other fairness
> >> mechanism. And if fairness mechanism still fails to coordinate, do not
> >> use burst maybe.
> > 
> > It's not fairness, bandwidth control is about isolation, and burst
> > introduces interference.
> > 
> >> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
> >> still needed maybe.
> > 
> > So what is the typical avg,stdev,max and mode for the workloads where you find
> > you need this?
> > 
> > I would really like to put a limit on the burst. IMO a workload that has
> > a burst many times longer than the quota is plain broken.
> 
> I see. Then the problem comes down to how large the limit on burst shall be.
> 
> I have sampled the CPU usage of a bursty container in 100ms periods. The statistics are:
> average	: 42.2%
> stddev	: 81.5%
> max		: 844.5%
> P95		: 183.3%
> P99		: 437.0%
> 
> If quota is 100000ms, burst buffer needs to be 8 times more in order for this workload not to be throttled.
> I can't say this is typical, but these workloads exist. On a machine running Kubernetes containers,
> where there is often room for such burst and the interference is hard to notice, users would prefer
> allowing such burst to being throttled occasionally.
>

I admit to not having followed all the history of this patch set. That said, when I see the above I just
think your quota is too low for your workload.

The burst (mis?)feature seems to be a way to bypass the quota.  And it sort of assumes cooperative
containers that will only burst when they need it and then go back to normal. 

> In this sense, I suggest limit burst buffer to 16 times of quota or around. That should be enough for users to
> improve tail latency caused by throttling. And users might choose a smaller one or even none, if the interference
> is unacceptable. What do you think?
> 

Having quotas that can regularly be exceeded by 16 times seems to make the concept of a quota
meaningless.  I'd have thought a burst would be some small percentage.

What if several such containers burst at the same time? Can't that lead to overcommit that can effect
other well-behaved containers?


Cheers,
Phil

-- 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-18  1:26         ` changhuaixin
  2021-03-18 12:59           ` Phil Auld
@ 2021-03-18 15:05           ` Peter Zijlstra
  2021-03-19 12:39             ` changhuaixin
  2021-05-12 12:41             ` changhuaixin
  1 sibling, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-03-18 15:05 UTC (permalink / raw)
  To: changhuaixin
  Cc: Benjamin Segall, dietmar.eggemann, juri.lelli, khlebnikov,
	open list, mgorman, mingo, Odin Ugedal, Odin Ugedal, pauld,
	Paul Turner, rostedt, Shanpei Chen, Tejun Heo, Vincent Guittot,
	xiyou.wangcong

On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
> > On Mar 17, 2021, at 4:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> > So what is the typical avg,stdev,max and mode for the workloads where you find
> > you need this?
> > 
> > I would really like to put a limit on the burst. IMO a workload that has
> > a burst many times longer than the quota is plain broken.
> 
> I see. Then the problem comes down to how large the limit on burst shall be.
> 
> I have sampled the CPU usage of a bursty container in 100ms periods. The statistics are:

So CPU usage isn't exactly what is required, job execution time is what
you're after. Assuming there is a relation...

> average	: 42.2%
> stddev	: 81.5%
> max		: 844.5%
> P95		: 183.3%
> P99		: 437.0%

Then your WCET is 844% of 100ms ? , which is .84s.

But you forgot your mode; what is the most common duration, given P95 is
so high, I doubt that avg is representative of the most common duration.

> If quota is 100000ms, burst buffer needs to be 8 times more in order
> for this workload not to be throttled.

Where does that 100s come from? And an 800s burst is bizarre.

Did you typo [us] as [ms] ?

> I can't say this is typical, but these workloads exist. On a machine
> running Kubernetes containers, where there is often room for such
> burst and the interference is hard to notice, users would prefer
> allowing such burst to being throttled occasionally.

Users also want ponies. I've no idea what kubernetes actually is or what
it has to do with containers. That's all just word salad.

> In this sense, I suggest limit burst buffer to 16 times of quota or
> around. That should be enough for users to improve tail latency caused
> by throttling. And users might choose a smaller one or even none, if
> the interference is unacceptable. What do you think?

Well, normal RT theory would suggest you pick your runtime around 200%
to get that P95 and then allow a full period burst to get your P99, but
that same RT theory would also have you calculate the resulting
interference and see if that works with the rest of the system...

16 times is horrific.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-18 12:59           ` Phil Auld
@ 2021-03-18 15:10             ` Peter Zijlstra
  2021-04-19  8:18               ` changhuaixin
  2021-03-19 12:51             ` changhuaixin
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2021-03-18 15:10 UTC (permalink / raw)
  To: Phil Auld
  Cc: changhuaixin, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, Odin Ugedal, Odin Ugedal,
	Paul Turner, rostedt, Shanpei Chen, Tejun Heo, Vincent Guittot,
	xiyou.wangcong

On Thu, Mar 18, 2021 at 08:59:44AM -0400, Phil Auld wrote:
> I admit to not having followed all the history of this patch set. That
> said, when I see the above I just think your quota is too low for your
> workload.

This.

> The burst (mis?)feature seems to be a way to bypass the quota.  And it
> sort of assumes cooperative containers that will only burst when they
> need it and then go back to normal. 

Its not entirely unreasonable or unheard of. There's soft realtime
systems that use this to increase the utilization with the trade-off
that you're going to miss deadlines once every so often.

If you do it right, you can calculate the probabilities. Or usually the
other way around, you calculate the allowed variance/burst given a P
value for making the deadline.

Input then isn't the WCET for each task, but a runtime distribution as
measured for your workload on your system etc..

I used to have papers on this, but I can't seem to find them in a hurry.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-18 15:05           ` Peter Zijlstra
@ 2021-03-19 12:39             ` changhuaixin
  2021-03-20  2:06               ` changhuaixin
  2021-05-12 12:41             ` changhuaixin
  1 sibling, 1 reply; 21+ messages in thread
From: changhuaixin @ 2021-03-19 12:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, Odin Ugedal, Odin Ugedal,
	pauld, Paul Turner, rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong



> On Mar 18, 2021, at 11:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> So what is the typical avg,stdev,max and mode for the workloads where you find
>>> you need this?
>>> 
>>> I would really like to put a limit on the burst. IMO a workload that has
>>> a burst many times longer than the quota is plain broken.
>> 
>> I see. Then the problem comes down to how large the limit on burst shall be.
>> 
>> I have sampled the CPU usage of a bursty container in 100ms periods. The statistics are:
> 
> So CPU usage isn't exactly what is required, job execution time is what
> you're after. Assuming there is a relation...
> 

Yes, job execution time is important. To be specific, it is to improve the CPU usage of the whole
system to reduce the total cost of ownership, while not damaging job execution time. This
requires lower the average CPU resource of underutilized cgroups, and allowing their bursts
at the same time.

>> average	: 42.2%
>> stddev	: 81.5%
>> max		: 844.5%
>> P95		: 183.3%
>> P99		: 437.0%
> 
> Then your WCET is 844% of 100ms ? , which is .84s.
> 
> But you forgot your mode; what is the most common duration, given P95 is
> so high, I doubt that avg is representative of the most common duration.
> 

It is true.

>> If quota is 100000ms, burst buffer needs to be 8 times more in order
>> for this workload not to be throttled.
> 
> Where does that 100s come from? And an 800s burst is bizarre.
> 
> Did you typo [us] as [ms] ?
> 

Sorry, it should be 100000us.

>> I can't say this is typical, but these workloads exist. On a machine
>> running Kubernetes containers, where there is often room for such
>> burst and the interference is hard to notice, users would prefer
>> allowing such burst to being throttled occasionally.
> 
> Users also want ponies. I've no idea what kubernetes actually is or what
> it has to do with containers. That's all just word salad.
> 
>> In this sense, I suggest limit burst buffer to 16 times of quota or
>> around. That should be enough for users to improve tail latency caused
>> by throttling. And users might choose a smaller one or even none, if
>> the interference is unacceptable. What do you think?
> 
> Well, normal RT theory would suggest you pick your runtime around 200%
> to get that P95 and then allow a full period burst to get your P99, but
> that same RT theory would also have you calculate the resulting
> interference and see if that works with the rest of the system...
> 

I am sorry that I don't know much about the RT theory you mentioned, and can't provide
the desired calculation now. But I'd like to try and do some reading if that is needed.

> 16 times is horrific.

So can we decide on a more relative value now? Or is the interference probabilities still the
missing piece?

Is the paper you mentioned about called "Insensitivity results in statistical bandwidth sharing",
or some related ones on statistical bandwidth results under some kind of fairness?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-18 12:59           ` Phil Auld
  2021-03-18 15:10             ` Peter Zijlstra
@ 2021-03-19 12:51             ` changhuaixin
  1 sibling, 0 replies; 21+ messages in thread
From: changhuaixin @ 2021-03-19 12:51 UTC (permalink / raw)
  To: Phil Auld
  Cc: changhuaixin, Peter Zijlstra, Benjamin Segall, dietmar.eggemann,
	juri.lelli, khlebnikov, open list, mgorman, mingo, Odin Ugedal,
	Odin Ugedal, Paul Turner, rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong



> On Mar 18, 2021, at 8:59 PM, Phil Auld <pauld@redhat.com> wrote:
> 
> On Thu, Mar 18, 2021 at 09:26:58AM +0800 changhuaixin wrote:
>> 
>> 
>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
>>> 
>>>>> Why do you allow such a large burst? I would expect something like:
>>>>> 
>>>>> 	if (burst > quote)
>>>>> 		return -EINVAL;
>>>>> 
>>>>> That limits the variance in the system. Allowing super long bursts seems
>>>>> to defeat the entire purpose of bandwidth control.
>>>> 
>>>> I understand your concern. Surely large burst value might allow super
>>>> long bursts thus preventing bandwidth control entirely for a long
>>>> time.
>>>> 
>>>> However, I am afraid it is hard to decide what the maximum burst
>>>> should be from the bandwidth control mechanism itself. Allowing some
>>>> burst to the maximum of quota is helpful, but not enough. There are
>>>> cases where workloads are bursty that they need many times more than
>>>> quota in a single period. In such cases, limiting burst to the maximum
>>>> of quota fails to meet the needs.
>>>> 
>>>> Thus, I wonder whether is it acceptable to leave the maximum burst to
>>>> users. If the desired behavior is to allow some burst, configure burst
>>>> accordingly. If that is causing variance, use share or other fairness
>>>> mechanism. And if fairness mechanism still fails to coordinate, do not
>>>> use burst maybe.
>>> 
>>> It's not fairness, bandwidth control is about isolation, and burst
>>> introduces interference.
>>> 
>>>> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
>>>> still needed maybe.
>>> 
>>> So what is the typical avg,stdev,max and mode for the workloads where you find
>>> you need this?
>>> 
>>> I would really like to put a limit on the burst. IMO a workload that has
>>> a burst many times longer than the quota is plain broken.
>> 
>> I see. Then the problem comes down to how large the limit on burst shall be.
>> 
>> I have sampled the CPU usage of a bursty container in 100ms periods. The statistics are:
>> average	: 42.2%
>> stddev	: 81.5%
>> max		: 844.5%
>> P95		: 183.3%
>> P99		: 437.0%
>> 
>> If quota is 100000ms, burst buffer needs to be 8 times more in order for this workload not to be throttled.
>> I can't say this is typical, but these workloads exist. On a machine running Kubernetes containers,
>> where there is often room for such burst and the interference is hard to notice, users would prefer
>> allowing such burst to being throttled occasionally.
>> 
> 
> I admit to not having followed all the history of this patch set. That said, when I see the above I just
> think your quota is too low for your workload.
> 

Yeah, more quota is helpful for this workload. But that usually prevents us from improving the total CPU
usage by putting more work onto a single machine.

> The burst (mis?)feature seems to be a way to bypass the quota.  And it sort of assumes cooperative
> containers that will only burst when they need it and then go back to normal. 
> 
>> In this sense, I suggest limit burst buffer to 16 times of quota or around. That should be enough for users to
>> improve tail latency caused by throttling. And users might choose a smaller one or even none, if the interference
>> is unacceptable. What do you think?
>> 
> 
> Having quotas that can regularly be exceeded by 16 times seems to make the concept of a quota
> meaningless.  I'd have thought a burst would be some small percentage.
> 
> What if several such containers burst at the same time? Can't that lead to overcommit that can effect
> other well-behaved containers?
> 

I see. Maybe there should be some calculation on the probabilities of that, as Peter has replied.

> 
> Cheers,
> Phil
> 
> -- 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-19 12:39             ` changhuaixin
@ 2021-03-20  2:06               ` changhuaixin
  0 siblings, 0 replies; 21+ messages in thread
From: changhuaixin @ 2021-03-20  2:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, Odin Ugedal, Odin Ugedal,
	pauld, Paul Turner, rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong



> On Mar 19, 2021, at 8:39 PM, changhuaixin <changhuaixin@linux.alibaba.com> wrote:
> 
> 
> 
>> On Mar 18, 2021, at 11:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
>>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>>>> So what is the typical avg,stdev,max and mode for the workloads where you find
>>>> you need this?
>>>> 
>>>> I would really like to put a limit on the burst. IMO a workload that has
>>>> a burst many times longer than the quota is plain broken.
>>> 
>>> I see. Then the problem comes down to how large the limit on burst shall be.
>>> 
>>> I have sampled the CPU usage of a bursty container in 100ms periods. The statistics are:
>> 
>> So CPU usage isn't exactly what is required, job execution time is what
>> you're after. Assuming there is a relation...
>> 
> 
> Yes, job execution time is important. To be specific, it is to improve the CPU usage of the whole
> system to reduce the total cost of ownership, while not damaging job execution time. This
> requires lower the average CPU resource of underutilized cgroups, and allowing their bursts
> at the same time.
> 
>>> average	: 42.2%
>>> stddev	: 81.5%
>>> max		: 844.5%
>>> P95		: 183.3%
>>> P99		: 437.0%
>> 
>> Then your WCET is 844% of 100ms ? , which is .84s.
>> 
>> But you forgot your mode; what is the most common duration, given P95 is
>> so high, I doubt that avg is representative of the most common duration.
>> 
> 
> It is true.
> 
>>> If quota is 100000ms, burst buffer needs to be 8 times more in order
>>> for this workload not to be throttled.
>> 
>> Where does that 100s come from? And an 800s burst is bizarre.
>> 
>> Did you typo [us] as [ms] ?
>> 
> 
> Sorry, it should be 100000us.
> 
>>> I can't say this is typical, but these workloads exist. On a machine
>>> running Kubernetes containers, where there is often room for such
>>> burst and the interference is hard to notice, users would prefer
>>> allowing such burst to being throttled occasionally.
>> 
>> Users also want ponies. I've no idea what kubernetes actually is or what
>> it has to do with containers. That's all just word salad.
>> 
>>> In this sense, I suggest limit burst buffer to 16 times of quota or
>>> around. That should be enough for users to improve tail latency caused
>>> by throttling. And users might choose a smaller one or even none, if
>>> the interference is unacceptable. What do you think?
>> 
>> Well, normal RT theory would suggest you pick your runtime around 200%
>> to get that P95 and then allow a full period burst to get your P99, but
>> that same RT theory would also have you calculate the resulting
>> interference and see if that works with the rest of the system...
>> 
> 
> I am sorry that I don't know much about the RT theory you mentioned, and can't provide
> the desired calculation now. But I'd like to try and do some reading if that is needed.
> 
>> 16 times is horrific.
> 
> So can we decide on a more relative value now? Or is the interference probabilities still the
> missing piece?

A more [realistic] value, I mean.

> 
> Is the paper you mentioned about called "Insensitivity results in statistical bandwidth sharing",
> or some related ones on statistical bandwidth results under some kind of fairness?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-18 15:10             ` Peter Zijlstra
@ 2021-04-19  8:18               ` changhuaixin
  0 siblings, 0 replies; 21+ messages in thread
From: changhuaixin @ 2021-04-19  8:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Phil Auld, Benjamin Segall, dietmar.eggemann,
	juri.lelli, khlebnikov, open list, mgorman, mingo, Odin Ugedal,
	Odin Ugedal, Paul Turner, rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong



> On Mar 18, 2021, at 11:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Mar 18, 2021 at 08:59:44AM -0400, Phil Auld wrote:
>> I admit to not having followed all the history of this patch set. That
>> said, when I see the above I just think your quota is too low for your
>> workload.
> 
> This.
> 
>> The burst (mis?)feature seems to be a way to bypass the quota.  And it
>> sort of assumes cooperative containers that will only burst when they
>> need it and then go back to normal. 
> 
> Its not entirely unreasonable or unheard of. There's soft realtime
> systems that use this to increase the utilization with the trade-off
> that you're going to miss deadlines once every so often.
> 
> If you do it right, you can calculate the probabilities. Or usually the
> other way around, you calculate the allowed variance/burst given a P
> value for making the deadline.
> 
> Input then isn't the WCET for each task, but a runtime distribution as
> measured for your workload on your system etc..
> 
> I used to have papers on this, but I can't seem to find them in a hurry.
> 

Hi, I have done some reading on queueing theory and done some problem definition.

Divide real time into discrete periods as cfs_b does. Assume there are m cgroups using
CFS Bandwidth Control. During each period, the i-th cgroup demands u_i CPU time,
where we assume u_i is under some distribution(exponential, Poisson or else).
At the end of a period, if the sum of u_i is under or equal to 100%, we call it an "idle" state.
The number of periods between two "idle" states stands for the WCET of tasks during these
periods.

Originally using quota, it is guaranteed that "idle" state comes at the end of each period. Thus,
the WCET is the length of period. When enabling CPU Burst, the sum of u_i may exceed 100%,
and the exceeded workload is handled in the following periods. The WCET is the number of periods
between two "idle" states.

Then, we are going to calculate the probabilities that WCET is longer than a period, and the average
WCET when using certain burst under some runtime distribution.

Basically, these are based on pervious mails. I am sending this email to see if there is anything wrong
on problem definition.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-03-18 15:05           ` Peter Zijlstra
  2021-03-19 12:39             ` changhuaixin
@ 2021-05-12 12:41             ` changhuaixin
  1 sibling, 0 replies; 21+ messages in thread
From: changhuaixin @ 2021-05-12 12:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, Odin Ugedal, Odin Ugedal,
	pauld, Paul Turner, rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong, dtcccc



> On Mar 18, 2021, at 11:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Mar 18, 2021 at 09:26:58AM +0800, changhuaixin wrote:
>>> On Mar 17, 2021, at 4:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> So what is the typical avg,stdev,max and mode for the workloads where you find
>>> you need this?
>>> 
>>> I would really like to put a limit on the burst. IMO a workload that has
>>> a burst many times longer than the quota is plain broken.
>> 
>> I see. Then the problem comes down to how large the limit on burst shall be.
>> 
>> I have sampled the CPU usage of a bursty container in 100ms periods. The statistics are:
> 
> So CPU usage isn't exactly what is required, job execution time is what
> you're after. Assuming there is a relation...
> 
>> average	: 42.2%
>> stddev	: 81.5%
>> max		: 844.5%
>> P95		: 183.3%
>> P99		: 437.0%
> 
> Then your WCET is 844% of 100ms ? , which is .84s.
> 
> But you forgot your mode; what is the most common duration, given P95 is
> so high, I doubt that avg is representative of the most common duration.
> 
>> If quota is 100000ms, burst buffer needs to be 8 times more in order
>> for this workload not to be throttled.
> 
> Where does that 100s come from? And an 800s burst is bizarre.
> 
> Did you typo [us] as [ms] ?
> 
>> I can't say this is typical, but these workloads exist. On a machine
>> running Kubernetes containers, where there is often room for such
>> burst and the interference is hard to notice, users would prefer
>> allowing such burst to being throttled occasionally.
> 
> Users also want ponies. I've no idea what kubernetes actually is or what
> it has to do with containers. That's all just word salad.
> 
>> In this sense, I suggest limit burst buffer to 16 times of quota or
>> around. That should be enough for users to improve tail latency caused
>> by throttling. And users might choose a smaller one or even none, if
>> the interference is unacceptable. What do you think?
> 
> Well, normal RT theory would suggest you pick your runtime around 200%
> to get that P95 and then allow a full period burst to get your P99, but
> that same RT theory would also have you calculate the resulting
> interference and see if that works with the rest of the system...
> 
> 16 times is horrific.
> 
> 

Hi,

We have discussed much about the benefit of using the burst feature previously in the
thread[1]. This mail shows the interference on the rest of the system when burst buffer is used,
when runtime is under exponential[2], poisson[3] or pareto[4] distribution. Test results showed that,
interference is more sensitive to the average utilization and the number of cgroups, than to burst buffer
size. If there are many cgroups and CPU is under utilized, a large burst buffer is acceptable. If there are
a small number of cgroups or CPU utilization is high, burst buffer should not be used.

The calculation is done by simulating[5] the CPU utilization and CPU Bandwidth distribution,
and deciding whether deadline is broken in each period. Simulation is done for over 1M
periods to get stable results.We also tried queueing theory, such as Markov chains,
to do the simplest WCET calculation, and failed when the calculation gets more complicated.
Nor did we find any paper on this. 

The inputs for our simulation program are: distribution, u_avg, m, and b. We let m cgroups
configured eqaul quota (1/m of the total) and buffer (b * quota). The CPU utilization for each period
is under certain distribution and averaged at u_avg of quota. Utilization of m cgroups is independent
identically distributed.

The WCET is counted as the number of periods till making the deadline. If a period ends with
more CPU needs that 100% after using burst, deadline is broken for this period. Without burst
feature, deadline can be guaranteed in each period.

We've conducted our experiments with variable inputs. The result is shown with the format of
E(WCET)/P(WCET>1), meaning the expectation of WCET and the probability when WCET>1.
The rows and cols(i.e. buffer=0) with all 1.0000/0% are omitted.

For exponential distribution[2], we set λ=u_avg/m for each cgroup.

m=5:
u_avg/buffer            100%              200%                    ∞
30%               1.0001/0.01%     1.0002/0.02%   1.0002/0.02%
50%               1.0176/1.66%     1.0301/2.76%    1.0352/3.15%
70%               1.1503/11.52%    1.2691/17.51%   1.4057/21.86%
90%               1.3042/21.17%    1.6701/34.19%  3.7937/60.02%

m=20:
u_avg/buffer            100%              200%                    ∞
50%               1.0000/0%        1.0001/0.01%    1.0002/0.02%
70%               1.0060/0.59%    1.0217/2.05%    1.0476/4.18%
90%               1.1989/14.07%    1.4467/24.46%  2.8603/46.26%

Since poisson distribution[3] is discrete, we convert u_avg to integer by multiplying 10 and normalize
the result, where X = n means utilization is at n * 10% of quota. We choose this way instead of others like
X = n requires n% or n/10% of quota because it generates more bursts, and thus more interference.

m=5:
u_avg/buffer            100%              200%                    ∞
70%               1.0071/0.69%      1.0071/0.69%   1.0071/0.70%
90%               1.4313/25.22%    1.5124/27.29%  1.5237/27.52%

m=20:
u_avg/buffer            100%              200%                    ∞
90%               1.0708/6.06%     1.0879/7.20%   1.0903/7.35%

It can be concluded that, compared to exponential distribution, tasks whose utilization is under
poisson distribution generate less interference after using burst. To generate more bursty workload, we also tested
the pareto distribution[4]. We set α=4 and adjust x_min accordingly so that the max demand is at about 500%
when u_avg is at 30% during 1M periods.

m=5:
u_avg/buffer            100%              200%                    ∞
30%               1.0000/0%        1.0000/0%        1.0001/0.01%
50%               1.0001/0.01%     1.0008/0.08%    1.0019/0.16%
70%               1.0117/1.11%       1.0234/2.11%     1.0308/2.38%
90%               1.3994/21.97      1.6059/26.49%  1.8296/28.78%

m=20:
u_avg/buffer            100%              200%                    ∞
70%               1.0000/0%         1.0001/0.01%      1.0013/0.12%
90%               1.0447/3.96%    1.0941/7.49%       1.1622/10.59%

References
[1]. https://lore.kernel.org/lkml/YFNtesvbxUY4XTYR@hirez.programming.kicks-ass.net/
[2]. https://en.wikipedia.org/wiki/Exponential_distribution
[3]. https://en.wikipedia.org/wiki/Poisson_distribution
[4]. https://en.wikipedia.org/wiki/Pareto_distribution
[5]. The pseudo code for our simulation program is as follows:

input: u_avg, m, b, distribution
output: wcet_list

for  (i = 0; i < nr_periods; i++) {
   free_cpu = 0;
   for (j = 0; j < m; j++) {
       needs[j] += new_needs(u_avg, m);
       if (needs[j] <= quota[j]) {
           free_cpu += quota[j] - needs[j];
           gather_token();
           needs[j] = 0;
       } else /* needs > quota */
           needs[j] -= quota[j];
   }

   /*
    * Additional free cpu is distrubted fairly among cgroups 
    * with more needs and burst buffer to meet needs.
    */
   distribute_free_cpu();

   for (j = 0; j < m; j++) {
       needs[j] -= free_cpu_use[j];
       use_token();
   }

   /* 
    * If deadline is met for this period.
    */
   if (free_cpu > 0 || !distribute_done) {
       /*
	 * Count periods when deadline is broken.
        */ 
       count_wcet_periods();
   }
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-05-12 12:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-16  4:49 [PATCH v4 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
2021-03-16  4:49 ` [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
2021-03-16  9:27   ` Peter Zijlstra
2021-03-16  9:41   ` Peter Zijlstra
2021-03-16  9:54   ` Peter Zijlstra
2021-03-17  7:16     ` changhuaixin
2021-03-17  8:06       ` Peter Zijlstra
2021-03-18  1:26         ` changhuaixin
2021-03-18 12:59           ` Phil Auld
2021-03-18 15:10             ` Peter Zijlstra
2021-04-19  8:18               ` changhuaixin
2021-03-19 12:51             ` changhuaixin
2021-03-18 15:05           ` Peter Zijlstra
2021-03-19 12:39             ` changhuaixin
2021-03-20  2:06               ` changhuaixin
2021-05-12 12:41             ` changhuaixin
2021-03-16 10:40   ` Peter Zijlstra
2021-03-16  4:49 ` [PATCH v4 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
2021-03-16  9:52   ` Peter Zijlstra
2021-03-16  4:49 ` [PATCH v4 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
2021-03-16  4:49 ` [PATCH v4 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).