All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller
@ 2020-12-17  7:46 Huaixin Chang
  2020-12-17  7:46 ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: Huaixin Chang @ 2020-12-17  7:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: bsegall, changhuaixin, dietmar.eggemann, juri.lelli, mgorman,
	mingo, pauld, peterz, pjt, rostedt, vincent.guittot, khlebnikov,
	xiyou.wangcong, shanpeic

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled. And they are latency sensitive at the same
time so that throttling them is undesired.

Scaling up period and quota allows greater burst capacity. But it might
cause longer stuck till next refill. We introduce "burst" to allow
accumulating unused quota from previous periods, and to be assigned when
a task group requests more CPU than quota during a specific period. Thus
allowing CPU time requests as long as the average requested CPU time is
below quota on the long run. The maximum accumulation is capped by burst
and is set 0 by default, thus the traditional behaviour remains.

A huge drop of 99th tail latency from more than 500ms to 27ms is seen for
real java workloads when using burst. Similar drops are seen when
testing with schbench too:

	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 700000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
	echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	# The average CPU usage is around 500%, which is 200ms CPU time
	# every 40ms.
	./schbench -m 1 -t 30 -r 60 -c 10000 -R 500

	Without burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 10
	*99.0000th: 933
	99.5000th: 981
	99.9000th: 3068
	min=0, max=20054
	rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

	With burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 9
	*99.0000th: 12
	99.5000th: 13
	99.9000th: 19
	min=0, max=406
	rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 0.12%

How much workloads with benefit from burstable CFS bandwidth control
depends on how bursty and how latency sensitive they are.

Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/

This time we present more latency statistics and handle overflow while
accumulating.

Huaixin Chang (4):
  sched/fair: Introduce primitives for CFS bandwidth burst
  sched/fair: Make CFS bandwidth controller burstable
  sched/fair: Add cfs bandwidth burst statistics
  sched/fair: Add document for burstable CFS bandwidth control

 Documentation/scheduler/sched-bwc.rst |  49 +++++++++++--
 include/linux/sched/sysctl.h          |   2 +
 kernel/sched/core.c                   | 126 +++++++++++++++++++++++++++++-----
 kernel/sched/fair.c                   |  58 +++++++++++++---
 kernel/sched/sched.h                  |   9 ++-
 kernel/sysctl.c                       |  18 +++++
 6 files changed, 232 insertions(+), 30 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2020-12-17  7:46 [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
@ 2020-12-17  7:46 ` Huaixin Chang
  2020-12-17 13:36   ` Peter Zijlstra
  2020-12-17  7:46 ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Huaixin Chang @ 2020-12-17  7:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: bsegall, changhuaixin, dietmar.eggemann, juri.lelli, mgorman,
	mingo, pauld, peterz, pjt, rostedt, vincent.guittot, khlebnikov,
	xiyou.wangcong, shanpeic

In this patch, we introduce the notion of CFS bandwidth burst. Unused
"quota" from pervious "periods" might be accumulated and used in the
following "periods". The maximum amount of accumulated bandwidth is
bounded by "burst". And the maximun amount of CPU a group can consume in
a given period is "buffer" which is equivalent to "quota" + "burst in
case that this group has done enough accumulation.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 kernel/sched/core.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++--------
 kernel/sched/fair.c  |  2 ++
 kernel/sched/sched.h |  2 ++
 3 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7e453492cff..48d3bad12be2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7895,10 +7895,12 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+							u64 burst)
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+	u64 buffer;
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -7925,6 +7927,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
 		return -EINVAL;
 
+	/*
+	 * Bound burst to defend burst against overflow during bandwidth shift.
+	 */
+	if (burst > max_cfs_runtime)
+		return -EINVAL;
+
+	if (quota == RUNTIME_INF)
+		buffer = RUNTIME_INF;
+	else
+		buffer = min(max_cfs_runtime, quota + burst);
 	/*
 	 * Prevent race between setting of cfs_rq->runtime_enabled and
 	 * unthrottle_offline_cfs_rqs().
@@ -7946,6 +7958,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->burst = burst;
+	cfs_b->buffer = buffer;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -7979,9 +7993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	burst = tg->cfs_bandwidth.burst;
 	if (cfs_quota_us < 0)
 		quota = RUNTIME_INF;
 	else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -7989,7 +8004,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 	else
 		return -EINVAL;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -8007,15 +8022,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
 		return -EINVAL;
 
 	period = (u64)cfs_period_us * NSEC_PER_USEC;
 	quota = tg->cfs_bandwidth.quota;
+	burst = tg->cfs_bandwidth.burst;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -8028,6 +8044,35 @@ static long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+	u64 quota, period, burst;
+
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	quota = tg->cfs_bandwidth.quota;
+	if (cfs_burst_us < 0)
+		burst = RUNTIME_INF;
+	else if ((u64)cfs_burst_us <= U64_MAX / NSEC_PER_USEC)
+		burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+	else
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+	u64 burst_us;
+
+	if (tg->cfs_bandwidth.burst == RUNTIME_INF)
+		return -1;
+
+	burst_us = tg->cfs_bandwidth.burst;
+	do_div(burst_us, NSEC_PER_USEC);
+
+	return burst_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -8052,6 +8097,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static s64 cpu_cfs_burst_read_s64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_s64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, s64 cfs_burst_us)
+{
+	return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
+}
+
 struct cfs_schedulable_data {
 	struct task_group *tg;
 	u64 period, quota;
@@ -8204,6 +8261,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "cfs_burst_us",
+		.read_s64 = cpu_cfs_burst_read_s64,
+		.write_s64 = cpu_cfs_burst_write_s64,
+	},
 	{
 		.name = "stat",
 		.seq_show = cpu_cfs_stat_show,
@@ -8324,26 +8386,27 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
 #endif
 
 static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
-						  long period, long quota)
+					long period, long quota, long burst)
 {
 	if (quota < 0)
 		seq_puts(sf, "max");
 	else
 		seq_printf(sf, "%ld", quota);
 
-	seq_printf(sf, " %ld\n", period);
+	seq_printf(sf, " %ld %ld\n", period, burst);
 }
 
-/* caller should put the current value in *@periodp before calling */
+/* caller should put the current value in *@periodp and *@burstp before calling */
 static int __maybe_unused cpu_period_quota_parse(char *buf,
-						 u64 *periodp, u64 *quotap)
+					 u64 *periodp, u64 *quotap, u64 *burstp)
 {
 	char tok[21];	/* U64_MAX */
 
-	if (sscanf(buf, "%20s %llu", tok, periodp) < 1)
+	if (sscanf(buf, "%20s %llu %llu", tok, periodp, burstp) < 1)
 		return -EINVAL;
 
 	*periodp *= NSEC_PER_USEC;
+	*burstp *= NSEC_PER_USEC;
 
 	if (sscanf(tok, "%llu", quotap))
 		*quotap *= NSEC_PER_USEC;
@@ -8360,7 +8423,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 
-	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
+			tg_get_cfs_burst(tg));
 	return 0;
 }
 
@@ -8369,12 +8433,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 {
 	struct task_group *tg = css_tg(of_css(of));
 	u64 period = tg_get_cfs_period(tg);
+	u64 burst = tg_get_cfs_burst(tg);
 	u64 quota;
 	int ret;
 
-	ret = cpu_period_quota_parse(buf, &period, &quota);
+	ret = cpu_period_quota_parse(buf, &period, &quota, &burst);
 	if (!ret)
-		ret = tg_set_cfs_bandwidth(tg, period, quota);
+		ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
 	return ret ?: nbytes;
 }
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba8fd4f..6bb4f89259fd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5242,6 +5242,8 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+	cfs_b->burst = 0;
+	cfs_b->buffer = RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df80bfcea92e..a8772eca8cbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -364,6 +364,8 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			quota;
 	u64			runtime;
+	u64			burst;
+	u64			buffer;
 	s64			hierarchical_quota;
 
 	u8			idle;
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
  2020-12-17  7:46 [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2020-12-17  7:46 ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
@ 2020-12-17  7:46 ` Huaixin Chang
  2020-12-18  9:53     ` kernel test robot
  2020-12-17  7:46 ` [PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Huaixin Chang @ 2020-12-17  7:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: bsegall, changhuaixin, dietmar.eggemann, juri.lelli, mgorman,
	mingo, pauld, peterz, pjt, rostedt, vincent.guittot, khlebnikov,
	xiyou.wangcong, shanpeic

Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
denote how many percent of burst is given on setting cfs bandwidth. By
default it is 0, which means on burst is allowed unless accumulated.

Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/core.c          | 31 +++++++++++++++++++++++++----
 kernel/sched/fair.c          | 46 ++++++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h         |  4 ++--
 kernel/sysctl.c              | 18 +++++++++++++++++
 5 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..3400828eaf2d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48d3bad12be2..71cced83df2f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
  */
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
+ * 0 by default.
+ */
+unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
+#endif
+
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
-	u64 buffer;
+	u64 buffer, burst_onset;
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 	cfs_b->burst = burst;
 	cfs_b->buffer = buffer;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
+	cfs_b->runtime = cfs_b->quota;
+
+	/* burst_onset needed */
+	if (cfs_b->quota != RUNTIME_INF &&
+			sysctl_sched_cfs_bw_burst_enabled &&
+			sysctl_sched_cfs_bw_burst_onset_percent > 0) {
+
+		burst_onset = burst / 100 *
+			sysctl_sched_cfs_bw_burst_onset_percent;
+
+		cfs_b->runtime += burst_onset;
+		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
+	}
 
 	/* Restart the period timer (if active) to handle new period expiry: */
 	if (runtime_enabled)
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 1);
 
 	raw_spin_unlock_irq(&cfs_b->lock);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6bb4f89259fd..38a726f77783 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4598,10 +4598,22 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 {
-	if (cfs_b->quota != RUNTIME_INF)
-		cfs_b->runtime = cfs_b->quota;
+	u64 refill;
+
+	if (cfs_b->quota != RUNTIME_INF) {
+
+		if (!sysctl_sched_cfs_bw_burst_enabled) {
+			cfs_b->runtime = cfs_b->quota;
+			return;
+		}
+
+		overrun = min(overrun, cfs_b->max_overrun);
+		refill = cfs_b->quota * overrun;
+		cfs_b->runtime += refill;
+		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+	}
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4623,7 +4635,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 0);
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -4957,7 +4969,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	__refill_cfs_bandwidth_runtime(cfs_b, overrun);
 
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
@@ -5181,6 +5193,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 }
 
 extern const u64 max_cfs_quota_period;
+extern const u64 max_cfs_runtime;
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -5210,7 +5223,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 			new = old * 2;
 			if (new < max_cfs_quota_period) {
 				cfs_b->period = ns_to_ktime(new);
-				cfs_b->quota *= 2;
+				cfs_b->quota = min(cfs_b->quota * 2,
+						max_cfs_runtime);
+
+				cfs_b->buffer = min(max_cfs_runtime,
+						cfs_b->quota + cfs_b->burst);
+				/* Add 1 in case max_overrun becomes 0. */
+				cfs_b->max_overrun >>= 1;
+				cfs_b->max_overrun++;
 
 				pr_warn_ratelimited(
 	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5259,16 +5279,26 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
 {
+	u64 overrun;
+
 	lockdep_assert_held(&cfs_b->lock);
 
 	if (cfs_b->period_active)
 		return;
 
 	cfs_b->period_active = 1;
-	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+
+	/*
+	 * When period timer stops, quota for the following period is not
+	 * refilled, however period timer is already forwarded. We should
+	 * accumulate quota once more than overrun here.
+	 */
+	if (!init)
+		__refill_cfs_bandwidth_runtime(cfs_b, overrun + 1);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a8772eca8cbb..ff8b5382485d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
 	u64			runtime;
 	u64			burst;
 	u64			buffer;
+	u64			max_overrun;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -476,8 +477,7 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 
-extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afad085960b8..291dca62a571 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1842,6 +1842,24 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "sched_cfs_bw_burst_onset_percent",
+		.data		= &sysctl_sched_cfs_bw_burst_onset_percent,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &one_hundred,
+	},
+	{
+		.procname	= "sched_cfs_bw_burst_enabled",
+		.data		= &sysctl_sched_cfs_bw_burst_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 	{
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics
  2020-12-17  7:46 [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2020-12-17  7:46 ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
  2020-12-17  7:46 ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2020-12-17  7:46 ` Huaixin Chang
  2020-12-17  7:46 ` [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Huaixin Chang @ 2020-12-17  7:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: bsegall, changhuaixin, dietmar.eggemann, juri.lelli, mgorman,
	mingo, pauld, peterz, pjt, rostedt, vincent.guittot, khlebnikov,
	xiyou.wangcong, shanpeic

Introduce statistics exports for the burstable cfs bandwidth
controller.

The following exports are included:

current_bw: current runtime in global pool
nr_burst:   number of periods bandwidth burst occurs
burst_time: cumulative wall-time that any cpus has
	    used above quota in respective periods

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 kernel/sched/core.c  |  6 ++++++
 kernel/sched/fair.c  | 12 +++++++++++-
 kernel/sched/sched.h |  3 +++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71cced83df2f..d80a37819d7a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7986,6 +7986,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
 	}
 
+	cfs_b->previous_runtime = cfs_b->runtime;
+
 	/* Restart the period timer (if active) to handle new period expiry: */
 	if (runtime_enabled)
 		start_cfs_bandwidth(cfs_b, 1);
@@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 		seq_printf(sf, "wait_sum %llu\n", ws);
 	}
 
+	seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
+	seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
+	seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
+
 	return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38a726f77783..e431b2fff01d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4600,7 +4600,7 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 {
-	u64 refill;
+	u64 refill, runtime;
 
 	if (cfs_b->quota != RUNTIME_INF) {
 
@@ -4609,10 +4609,20 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 			return;
 		}
 
+		if (cfs_b->previous_runtime > cfs_b->runtime) {
+			runtime = cfs_b->previous_runtime - cfs_b->runtime;
+			if (runtime > cfs_b->quota) {
+				cfs_b->burst_time += runtime - cfs_b->quota;
+				cfs_b->nr_burst++;
+			}
+		}
+
 		overrun = min(overrun, cfs_b->max_overrun);
 		refill = cfs_b->quota * overrun;
 		cfs_b->runtime += refill;
 		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+
+		cfs_b->previous_runtime = cfs_b->runtime;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ff8b5382485d..4adb984e5197 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -367,6 +367,7 @@ struct cfs_bandwidth {
 	u64			burst;
 	u64			buffer;
 	u64			max_overrun;
+	u64			previous_runtime;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -379,7 +380,9 @@ struct cfs_bandwidth {
 	/* Statistics: */
 	int			nr_periods;
 	int			nr_throttled;
+	int			nr_burst;
 	u64			throttled_time;
+	u64			burst_time;
 #endif
 };
 
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control
  2020-12-17  7:46 [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                   ` (2 preceding siblings ...)
  2020-12-17  7:46 ` [PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2020-12-17  7:46 ` Huaixin Chang
  2020-12-17 21:25 ` [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Benjamin Segall
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Huaixin Chang @ 2020-12-17  7:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: bsegall, changhuaixin, dietmar.eggemann, juri.lelli, mgorman,
	mingo, pauld, peterz, pjt, rostedt, vincent.guittot, khlebnikov,
	xiyou.wangcong, shanpeic

Basic description of usage and effect for CFS Bandwidth Control Burst.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 Documentation/scheduler/sched-bwc.rst | 49 ++++++++++++++++++++++++++++++++---
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 9801d6b284b1..67c5de641149 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -16,6 +16,11 @@ assigned any additional requests for quota will result in those threads being
 throttled. Throttled threads will not be able to run again until the next
 period when the quota is replenished.
 
+With "burst", unused quota from previous periods might be accumulated and
+assigned when a task group requests more CPU than quota during a specific
+period, thus allowing CPU time requests as long as the average request is below
+quota on the long run. The maximum accumulation is capped by burst.
+
 A group's unassigned quota is globally tracked, being refreshed back to
 cfs_quota units at each period boundary. As threads consume this bandwidth it
 is transferred to cpu-local "silos" on a demand basis. The amount transferred
@@ -23,16 +28,18 @@ within each of these updates is tunable and described as the "slice".
 
 Management
 ----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.
 
-cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
 cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
 cpu.stat: exports throttling statistics [explained further below]
 
 The default values are::
 
 	cpu.cfs_period_us=100ms
-	cpu.cfs_quota=-1
+	cpu.cfs_quota_us=-1
+	cpu.cfs_burst_us=0
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
@@ -48,6 +55,11 @@ more detail below.
 Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 and return the group to an unconstrained state once more.
 
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS unchanged. Writing any (valid) positive value(s) into cpu.cfs_burst_us
+will enact the cap on unused bandwidth accumulation.
+
 Any updates to a group's bandwidth specification will result in it becoming
 unthrottled if it is in a constrained state.
 
@@ -65,9 +77,21 @@ This is tunable via procfs::
 Larger slice values will reduce transfer overheads, while smaller values allow
 for more fine-grained consumption.
 
+There is also a global switch to turn off burst for all groups::
+       /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
+
+By default it is enabled. Write 0 values means no accumulated CPU time can be
+used for any group, even if cpu.cfs_burst_us is configured.
+
+Sometimes users might want a group to burst without accumulation. This is
+tunable via::
+       /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
+
+Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.
+
 Statistics
 ----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+A group's bandwidth statistics are exported via 6 fields in cpu.stat.
 
 cpu.stat:
 
@@ -75,6 +99,11 @@ cpu.stat:
 - nr_throttled: Number of times the group has been throttled/limited.
 - throttled_time: The total time duration (in nanoseconds) for which entities
   of the group have been throttled.
+- current_bw: Current runtime in global pool.
+- nr_burst: Number of periods burst occurs.
+- burst_time: Cumulative wall-time that any cpus has used above quota in
+  respective periods
+
 
 This interface is read-only.
 
@@ -172,3 +201,15 @@ Examples
 
    By using a small period here we are ensuring a consistent latency
    response at the expense of burst capacity.
+
+4. Limit a group to 20% of 1 CPU, and allow accumulate up to 60% of 1 CPU
+   addtionally, in case accumulation has been done.
+
+   With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+   And 30ms burst will be equivalent to 60% of 1 CPU.
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+	# echo 30000 > cpu.cfs_burst_us /* burst = 30ms */
+
+   Larger buffer setting allows greater burst capacity.
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2020-12-17  7:46 ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
@ 2020-12-17 13:36   ` Peter Zijlstra
  2020-12-21 13:53     ` changhuaixin
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2020-12-17 13:36 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: linux-kernel, bsegall, dietmar.eggemann, juri.lelli, mgorman,
	mingo, pauld, pjt, rostedt, vincent.guittot, khlebnikov,
	xiyou.wangcong, shanpeic

On Thu, Dec 17, 2020 at 03:46:17PM +0800, Huaixin Chang wrote:
> In this patch, we introduce the notion of CFS bandwidth burst. Unused
> "quota" from pervious "periods" might be accumulated and used in the
> following "periods". The maximum amount of accumulated bandwidth is
> bounded by "burst". And the maximun amount of CPU a group can consume in
> a given period is "buffer" which is equivalent to "quota" + "burst in
> case that this group has done enough accumulation.

Oh man, Juri, wasn't there a paper about statistical bandwidth
accounting somewhere? Where, if you replace every utilization by a
statistical variable, the end result is still useful?

That is, instead of something like; \Sum u_i <= 1, you get something
like: \Sum {avg(u),var(u)}_i <= {1, sqrt(\Sum var_i^2)} and you can
still proof bounded tardiness etc.. (assuming a gaussian distribution).

The proposed seems close to that, but not quite, and I'm afraid it's not
quite strong enough to still provide any guarantees.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller
  2020-12-17  7:46 [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                   ` (3 preceding siblings ...)
  2020-12-17  7:46 ` [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
@ 2020-12-17 21:25 ` Benjamin Segall
  2021-01-20 12:27 ` [PATCH v2 " Huaixin Chang
  2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  6 siblings, 0 replies; 44+ messages in thread
From: Benjamin Segall @ 2020-12-17 21:25 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: linux-kernel, dietmar.eggemann, juri.lelli, mgorman, mingo,
	pauld, peterz, pjt, rostedt, vincent.guittot, khlebnikov,
	xiyou.wangcong, shanpeic

The code for this looks fine, and the feature is something people do
seem to ask for occasionally. I agree with peterz that using this
generally means you lose any guarantees (which are already imperfect
given CFS), but I suspect that cfsb is being used in overload anyways.

The docs could use a grammar/wording pass maybe, but that's easy enough.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
  2020-12-17  7:46 ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2020-12-18  9:53     ` kernel test robot
  0 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2020-12-18  9:53 UTC (permalink / raw)
  To: Huaixin Chang, linux-kernel
  Cc: kbuild-all, bsegall, changhuaixin, dietmar.eggemann, juri.lelli,
	mgorman, mingo, pauld, peterz, pjt

[-- Attachment #1: Type: text/plain, Size: 5279 bytes --]

Hi Huaixin,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/master linus/master v5.10 next-20201217]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Huaixin-Chang/sched-fair-Burstable-CFS-bandwidth-controller/20201217-155214
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 5b78f2dc315354c05300795064f587366a02c6ff
config: i386-randconfig-r013-20200812 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/d2f5cde464c872307ee33f78ba167a6f3334697c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Huaixin-Chang/sched-fair-Burstable-CFS-bandwidth-controller/20201217-155214
        git checkout d2f5cde464c872307ee33f78ba167a6f3334697c
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: kernel/sched/core.o: in function `tg_set_cfs_bandwidth':
>> kernel/sched/core.c:8596: undefined reference to `__udivdi3'


vim +8596 kernel/sched/core.c

  8521	
  8522	static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
  8523								u64 burst)
  8524	{
  8525		int i, ret = 0, runtime_enabled, runtime_was_enabled;
  8526		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
  8527		u64 buffer, burst_onset;
  8528	
  8529		if (tg == &root_task_group)
  8530			return -EINVAL;
  8531	
  8532		/*
  8533		 * Ensure we have at some amount of bandwidth every period.  This is
  8534		 * to prevent reaching a state of large arrears when throttled via
  8535		 * entity_tick() resulting in prolonged exit starvation.
  8536		 */
  8537		if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
  8538			return -EINVAL;
  8539	
  8540		/*
  8541		 * Likewise, bound things on the otherside by preventing insane quota
  8542		 * periods.  This also allows us to normalize in computing quota
  8543		 * feasibility.
  8544		 */
  8545		if (period > max_cfs_quota_period)
  8546			return -EINVAL;
  8547	
  8548		/*
  8549		 * Bound quota to defend quota against overflow during bandwidth shift.
  8550		 */
  8551		if (quota != RUNTIME_INF && quota > max_cfs_runtime)
  8552			return -EINVAL;
  8553	
  8554		/*
  8555		 * Bound burst to defend burst against overflow during bandwidth shift.
  8556		 */
  8557		if (burst > max_cfs_runtime)
  8558			return -EINVAL;
  8559	
  8560		if (quota == RUNTIME_INF)
  8561			buffer = RUNTIME_INF;
  8562		else
  8563			buffer = min(max_cfs_runtime, quota + burst);
  8564		/*
  8565		 * Prevent race between setting of cfs_rq->runtime_enabled and
  8566		 * unthrottle_offline_cfs_rqs().
  8567		 */
  8568		get_online_cpus();
  8569		mutex_lock(&cfs_constraints_mutex);
  8570		ret = __cfs_schedulable(tg, period, quota);
  8571		if (ret)
  8572			goto out_unlock;
  8573	
  8574		runtime_enabled = quota != RUNTIME_INF;
  8575		runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
  8576		/*
  8577		 * If we need to toggle cfs_bandwidth_used, off->on must occur
  8578		 * before making related changes, and on->off must occur afterwards
  8579		 */
  8580		if (runtime_enabled && !runtime_was_enabled)
  8581			cfs_bandwidth_usage_inc();
  8582		raw_spin_lock_irq(&cfs_b->lock);
  8583		cfs_b->period = ns_to_ktime(period);
  8584		cfs_b->quota = quota;
  8585		cfs_b->burst = burst;
  8586		cfs_b->buffer = buffer;
  8587	
  8588		cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
  8589		cfs_b->runtime = cfs_b->quota;
  8590	
  8591		/* burst_onset needed */
  8592		if (cfs_b->quota != RUNTIME_INF &&
  8593				sysctl_sched_cfs_bw_burst_enabled &&
  8594				sysctl_sched_cfs_bw_burst_onset_percent > 0) {
  8595	
> 8596			burst_onset = burst / 100 *
  8597				sysctl_sched_cfs_bw_burst_onset_percent;
  8598	
  8599			cfs_b->runtime += burst_onset;
  8600			cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
  8601		}
  8602	
  8603		/* Restart the period timer (if active) to handle new period expiry: */
  8604		if (runtime_enabled)
  8605			start_cfs_bandwidth(cfs_b, 1);
  8606	
  8607		raw_spin_unlock_irq(&cfs_b->lock);
  8608	
  8609		for_each_online_cpu(i) {
  8610			struct cfs_rq *cfs_rq = tg->cfs_rq[i];
  8611			struct rq *rq = cfs_rq->rq;
  8612			struct rq_flags rf;
  8613	
  8614			rq_lock_irq(rq, &rf);
  8615			cfs_rq->runtime_enabled = runtime_enabled;
  8616			cfs_rq->runtime_remaining = 0;
  8617	
  8618			if (cfs_rq->throttled)
  8619				unthrottle_cfs_rq(cfs_rq);
  8620			rq_unlock_irq(rq, &rf);
  8621		}
  8622		if (runtime_was_enabled && !runtime_enabled)
  8623			cfs_bandwidth_usage_dec();
  8624	out_unlock:
  8625		mutex_unlock(&cfs_constraints_mutex);
  8626		put_online_cpus();
  8627	
  8628		return ret;
  8629	}
  8630	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34741 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
@ 2020-12-18  9:53     ` kernel test robot
  0 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2020-12-18  9:53 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5428 bytes --]

Hi Huaixin,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on tip/master linus/master v5.10 next-20201217]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Huaixin-Chang/sched-fair-Burstable-CFS-bandwidth-controller/20201217-155214
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 5b78f2dc315354c05300795064f587366a02c6ff
config: i386-randconfig-r013-20200812 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/d2f5cde464c872307ee33f78ba167a6f3334697c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Huaixin-Chang/sched-fair-Burstable-CFS-bandwidth-controller/20201217-155214
        git checkout d2f5cde464c872307ee33f78ba167a6f3334697c
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: kernel/sched/core.o: in function `tg_set_cfs_bandwidth':
>> kernel/sched/core.c:8596: undefined reference to `__udivdi3'


vim +8596 kernel/sched/core.c

  8521	
  8522	static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
  8523								u64 burst)
  8524	{
  8525		int i, ret = 0, runtime_enabled, runtime_was_enabled;
  8526		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
  8527		u64 buffer, burst_onset;
  8528	
  8529		if (tg == &root_task_group)
  8530			return -EINVAL;
  8531	
  8532		/*
  8533		 * Ensure we have at some amount of bandwidth every period.  This is
  8534		 * to prevent reaching a state of large arrears when throttled via
  8535		 * entity_tick() resulting in prolonged exit starvation.
  8536		 */
  8537		if (quota < min_cfs_quota_period || period < min_cfs_quota_period)
  8538			return -EINVAL;
  8539	
  8540		/*
  8541		 * Likewise, bound things on the otherside by preventing insane quota
  8542		 * periods.  This also allows us to normalize in computing quota
  8543		 * feasibility.
  8544		 */
  8545		if (period > max_cfs_quota_period)
  8546			return -EINVAL;
  8547	
  8548		/*
  8549		 * Bound quota to defend quota against overflow during bandwidth shift.
  8550		 */
  8551		if (quota != RUNTIME_INF && quota > max_cfs_runtime)
  8552			return -EINVAL;
  8553	
  8554		/*
  8555		 * Bound burst to defend burst against overflow during bandwidth shift.
  8556		 */
  8557		if (burst > max_cfs_runtime)
  8558			return -EINVAL;
  8559	
  8560		if (quota == RUNTIME_INF)
  8561			buffer = RUNTIME_INF;
  8562		else
  8563			buffer = min(max_cfs_runtime, quota + burst);
  8564		/*
  8565		 * Prevent race between setting of cfs_rq->runtime_enabled and
  8566		 * unthrottle_offline_cfs_rqs().
  8567		 */
  8568		get_online_cpus();
  8569		mutex_lock(&cfs_constraints_mutex);
  8570		ret = __cfs_schedulable(tg, period, quota);
  8571		if (ret)
  8572			goto out_unlock;
  8573	
  8574		runtime_enabled = quota != RUNTIME_INF;
  8575		runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
  8576		/*
  8577		 * If we need to toggle cfs_bandwidth_used, off->on must occur
  8578		 * before making related changes, and on->off must occur afterwards
  8579		 */
  8580		if (runtime_enabled && !runtime_was_enabled)
  8581			cfs_bandwidth_usage_inc();
  8582		raw_spin_lock_irq(&cfs_b->lock);
  8583		cfs_b->period = ns_to_ktime(period);
  8584		cfs_b->quota = quota;
  8585		cfs_b->burst = burst;
  8586		cfs_b->buffer = buffer;
  8587	
  8588		cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
  8589		cfs_b->runtime = cfs_b->quota;
  8590	
  8591		/* burst_onset needed */
  8592		if (cfs_b->quota != RUNTIME_INF &&
  8593				sysctl_sched_cfs_bw_burst_enabled &&
  8594				sysctl_sched_cfs_bw_burst_onset_percent > 0) {
  8595	
> 8596			burst_onset = burst / 100 *
  8597				sysctl_sched_cfs_bw_burst_onset_percent;
  8598	
  8599			cfs_b->runtime += burst_onset;
  8600			cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
  8601		}
  8602	
  8603		/* Restart the period timer (if active) to handle new period expiry: */
  8604		if (runtime_enabled)
  8605			start_cfs_bandwidth(cfs_b, 1);
  8606	
  8607		raw_spin_unlock_irq(&cfs_b->lock);
  8608	
  8609		for_each_online_cpu(i) {
  8610			struct cfs_rq *cfs_rq = tg->cfs_rq[i];
  8611			struct rq *rq = cfs_rq->rq;
  8612			struct rq_flags rf;
  8613	
  8614			rq_lock_irq(rq, &rf);
  8615			cfs_rq->runtime_enabled = runtime_enabled;
  8616			cfs_rq->runtime_remaining = 0;
  8617	
  8618			if (cfs_rq->throttled)
  8619				unthrottle_cfs_rq(cfs_rq);
  8620			rq_unlock_irq(rq, &rf);
  8621		}
  8622		if (runtime_was_enabled && !runtime_enabled)
  8623			cfs_bandwidth_usage_dec();
  8624	out_unlock:
  8625		mutex_unlock(&cfs_constraints_mutex);
  8626		put_online_cpus();
  8627	
  8628		return ret;
  8629	}
  8630	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 34741 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2020-12-17 13:36   ` Peter Zijlstra
@ 2020-12-21 13:53     ` changhuaixin
  2021-01-12  9:21       ` changhuaixin
  0 siblings, 1 reply; 44+ messages in thread
From: changhuaixin @ 2020-12-21 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, linux-kernel, bsegall, dietmar.eggemann,
	juri.lelli, mgorman, mingo, pauld, pjt, rostedt, vincent.guittot,
	khlebnikov, xiyou.wangcong, shanpeic



> On Dec 17, 2020, at 9:36 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Dec 17, 2020 at 03:46:17PM +0800, Huaixin Chang wrote:
>> In this patch, we introduce the notion of CFS bandwidth burst. Unused
>> "quota" from pervious "periods" might be accumulated and used in the
>> following "periods". The maximum amount of accumulated bandwidth is
>> bounded by "burst". And the maximun amount of CPU a group can consume in
>> a given period is "buffer" which is equivalent to "quota" + "burst in
>> case that this group has done enough accumulation.
> 
> Oh man, Juri, wasn't there a paper about statistical bandwidth
> accounting somewhere? Where, if you replace every utilization by a
> statistical variable, the end result is still useful?
> 
> That is, instead of something like; \Sum u_i <= 1, you get something
> like: \Sum {avg(u),var(u)}_i <= {1, sqrt(\Sum var_i^2)} and you can
> still proof bounded tardiness etc.. (assuming a gaussian distribution).
> 
> The proposed seems close to that, but not quite, and I'm afraid it's not
> quite strong enough to still provide any guarantees.

After reading some papers on statistical bandwidth sharing, it occurs to me that statistical bandwidth sharing is about the way bandwidth is shared between competitors. I wonder if the paper you mentioned was "Insensitivity results in statistical bandwidth sharing" or some paper referenced, which showed the end result is insensitive to maybe the distribution or the arrival pattern.

I am sorry that I cannot prove using statistical bandwidth sharing theory now. However, I wonder if it is more acceptable to put rate-based cfsb after share fairness, because the input stream of cfsb account is the output stream of fairness. And that is in contrast with what statistical bandwidth sharing does, in which fairness is used to share bandwidth upon the output stream of rate-based control. In this way, maybe guarantees can be provided by share fairness, and cfsb may use a larger buffer to allow more jitters.

A buffer, which is several times of quota, is able to handle large bursts, and throttle threads soon when overloaded. The present cfsb, however, has to be configured several times larger to handle jitters, which is ineffective and does not provide the designed rate-based control.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2020-12-21 13:53     ` changhuaixin
@ 2021-01-12  9:21       ` changhuaixin
  0 siblings, 0 replies; 44+ messages in thread
From: changhuaixin @ 2021-01-12  9:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, linux-kernel, bsegall, dietmar.eggemann,
	juri.lelli, mgorman, mingo, pauld, pjt, rostedt, vincent.guittot,
	khlebnikov, xiyou.wangcong, shanpeic



> On Dec 21, 2020, at 9:53 PM, changhuaixin <changhuaixin@linux.alibaba.com> wrote:
> 
> 
> 
>> On Dec 17, 2020, at 9:36 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Thu, Dec 17, 2020 at 03:46:17PM +0800, Huaixin Chang wrote:
>>> In this patch, we introduce the notion of CFS bandwidth burst. Unused
>>> "quota" from pervious "periods" might be accumulated and used in the
>>> following "periods". The maximum amount of accumulated bandwidth is
>>> bounded by "burst". And the maximun amount of CPU a group can consume in
>>> a given period is "buffer" which is equivalent to "quota" + "burst in
>>> case that this group has done enough accumulation.
>> 
>> Oh man, Juri, wasn't there a paper about statistical bandwidth
>> accounting somewhere? Where, if you replace every utilization by a
>> statistical variable, the end result is still useful?
>> 
>> That is, instead of something like; \Sum u_i <= 1, you get something
>> like: \Sum {avg(u),var(u)}_i <= {1, sqrt(\Sum var_i^2)} and you can
>> still proof bounded tardiness etc.. (assuming a gaussian distribution).
>> 
>> The proposed seems close to that, but not quite, and I'm afraid it's not
>> quite strong enough to still provide any guarantees.
> 
> After reading some papers on statistical bandwidth sharing, it occurs to me that statistical bandwidth sharing is about the way bandwidth is shared between competitors. I wonder if the paper you mentioned was "Insensitivity results in statistical bandwidth sharing" or some paper referenced, which showed the end result is insensitive to maybe the distribution or the arrival pattern.
> 
> I am sorry that I cannot prove using statistical bandwidth sharing theory now. However, I wonder if it is more acceptable to put rate-based cfsb after share fairness, because the input stream of cfsb account is the output stream of fairness. And that is in contrast with what statistical bandwidth sharing does, in which fairness is used to share bandwidth upon the output stream of rate-based control. In this way, maybe guarantees can be provided by share fairness, and cfsb may use a larger buffer to allow more jitters.
> 
> A buffer, which is several times of quota, is able to handle large bursts, and throttle threads soon when overloaded. The present cfsb, however, has to be configured several times larger to handle jitters, which is ineffective and does not provide the designed rate-based control.

Hi, peter.

Will you please have a look at my earlier reply.

The guarantee provided by cfsb is max-min fairness with strict upper bound, I think. And after this modification, it is still roughly that. As for longer periods, each cfsb group is still limited to their quota on average.

When users configure a huge burst buffer, cfsb doesn't really throttle threads and upper bound is exceeded. In that case, guarantees are provided by share, which provides proportional fairness.

Thanks,
Huaixin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/4] sched/fair: Burstable CFS bandwidth controller
  2020-12-17  7:46 [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                   ` (4 preceding siblings ...)
  2020-12-17 21:25 ` [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Benjamin Segall
@ 2021-01-20 12:27 ` Huaixin Chang
  2021-01-20 12:27   ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
                     ` (3 more replies)
  2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  6 siblings, 4 replies; 44+ messages in thread
From: Huaixin Chang @ 2021-01-20 12:27 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Changelog v2:
1. Fix an issue reported by test robot.
2. Rewriting docs. Appreciate any further suggestions or help.

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled. And they are latency sensitive at the same
time so that throttling them is undesired.

Scaling up period and quota allows greater burst capacity. But it might
cause longer stuck till next refill. We introduce "burst" to allow
accumulating unused quota from previous periods, and to be assigned when
a task group requests more CPU than quota during a specific period. Thus
allowing CPU time requests as long as the average requested CPU time is
below quota on the long run. The maximum accumulation is capped by burst
and is set 0 by default, thus the traditional behaviour remains.

A huge drop of 99th tail latency from more than 500ms to 27ms is seen for
real java workloads when using burst. Similar drops are seen when
testing with schbench too:

	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 700000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
	echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	# The average CPU usage is around 500%, which is 200ms CPU time
	# every 40ms.
	./schbench -m 1 -t 30 -r 60 -c 10000 -R 500

	Without burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 10
	*99.0000th: 933
	99.5000th: 981
	99.9000th: 3068
	min=0, max=20054
	rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

	With burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 9
	*99.0000th: 12
	99.5000th: 13
	99.9000th: 19
	min=0, max=406
	rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 0.12%

How much workloads with benefit from burstable CFS bandwidth control
depends on how bursty and how latency sensitive they are.

Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/

This time we present more latency statistics and handle overflow while
accumulating.

Huaixin Chang (4):
  sched/fair: Introduce primitives for CFS bandwidth burst
  sched/fair: Make CFS bandwidth controller burstable
  sched/fair: Add cfs bandwidth burst statistics
  sched/fair: Add document for burstable CFS bandwidth control

 Documentation/scheduler/sched-bwc.rst |  49 +++++++++++--
 include/linux/sched/sysctl.h          |   2 +
 kernel/sched/core.c                   | 126 +++++++++++++++++++++++++++++-----
 kernel/sched/fair.c                   |  58 +++++++++++++---
 kernel/sched/sched.h                  |   9 ++-
 kernel/sysctl.c                       |  18 +++++
 6 files changed, 232 insertions(+), 30 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-01-20 12:27 ` [PATCH v2 " Huaixin Chang
@ 2021-01-20 12:27   ` Huaixin Chang
  2021-01-20 12:27   ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Huaixin Chang @ 2021-01-20 12:27 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

In this patch, we introduce the notion of CFS bandwidth burst. Unused
"quota" from pervious "periods" might be accumulated and used in the
following "periods". The maximum amount of accumulated bandwidth is
bounded by "burst". And the maximun amount of CPU a group can consume in
a given period is "buffer" which is equivalent to "quota" + "burst in
case that this group has done enough accumulation.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 kernel/sched/core.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++--------
 kernel/sched/fair.c  |  2 ++
 kernel/sched/sched.h |  2 ++
 3 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7e453492cff..48d3bad12be2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7895,10 +7895,12 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+							u64 burst)
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+	u64 buffer;
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -7925,6 +7927,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
 		return -EINVAL;
 
+	/*
+	 * Bound burst to defend burst against overflow during bandwidth shift.
+	 */
+	if (burst > max_cfs_runtime)
+		return -EINVAL;
+
+	if (quota == RUNTIME_INF)
+		buffer = RUNTIME_INF;
+	else
+		buffer = min(max_cfs_runtime, quota + burst);
 	/*
 	 * Prevent race between setting of cfs_rq->runtime_enabled and
 	 * unthrottle_offline_cfs_rqs().
@@ -7946,6 +7958,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->burst = burst;
+	cfs_b->buffer = buffer;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -7979,9 +7993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	burst = tg->cfs_bandwidth.burst;
 	if (cfs_quota_us < 0)
 		quota = RUNTIME_INF;
 	else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -7989,7 +8004,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 	else
 		return -EINVAL;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -8007,15 +8022,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
 		return -EINVAL;
 
 	period = (u64)cfs_period_us * NSEC_PER_USEC;
 	quota = tg->cfs_bandwidth.quota;
+	burst = tg->cfs_bandwidth.burst;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -8028,6 +8044,35 @@ static long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+	u64 quota, period, burst;
+
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	quota = tg->cfs_bandwidth.quota;
+	if (cfs_burst_us < 0)
+		burst = RUNTIME_INF;
+	else if ((u64)cfs_burst_us <= U64_MAX / NSEC_PER_USEC)
+		burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+	else
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+	u64 burst_us;
+
+	if (tg->cfs_bandwidth.burst == RUNTIME_INF)
+		return -1;
+
+	burst_us = tg->cfs_bandwidth.burst;
+	do_div(burst_us, NSEC_PER_USEC);
+
+	return burst_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -8052,6 +8097,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static s64 cpu_cfs_burst_read_s64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_s64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, s64 cfs_burst_us)
+{
+	return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
+}
+
 struct cfs_schedulable_data {
 	struct task_group *tg;
 	u64 period, quota;
@@ -8204,6 +8261,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "cfs_burst_us",
+		.read_s64 = cpu_cfs_burst_read_s64,
+		.write_s64 = cpu_cfs_burst_write_s64,
+	},
 	{
 		.name = "stat",
 		.seq_show = cpu_cfs_stat_show,
@@ -8324,26 +8386,27 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
 #endif
 
 static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
-						  long period, long quota)
+					long period, long quota, long burst)
 {
 	if (quota < 0)
 		seq_puts(sf, "max");
 	else
 		seq_printf(sf, "%ld", quota);
 
-	seq_printf(sf, " %ld\n", period);
+	seq_printf(sf, " %ld %ld\n", period, burst);
 }
 
-/* caller should put the current value in *@periodp before calling */
+/* caller should put the current value in *@periodp and *@burstp before calling */
 static int __maybe_unused cpu_period_quota_parse(char *buf,
-						 u64 *periodp, u64 *quotap)
+					 u64 *periodp, u64 *quotap, u64 *burstp)
 {
 	char tok[21];	/* U64_MAX */
 
-	if (sscanf(buf, "%20s %llu", tok, periodp) < 1)
+	if (sscanf(buf, "%20s %llu %llu", tok, periodp, burstp) < 1)
 		return -EINVAL;
 
 	*periodp *= NSEC_PER_USEC;
+	*burstp *= NSEC_PER_USEC;
 
 	if (sscanf(tok, "%llu", quotap))
 		*quotap *= NSEC_PER_USEC;
@@ -8360,7 +8423,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 
-	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
+			tg_get_cfs_burst(tg));
 	return 0;
 }
 
@@ -8369,12 +8433,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 {
 	struct task_group *tg = css_tg(of_css(of));
 	u64 period = tg_get_cfs_period(tg);
+	u64 burst = tg_get_cfs_burst(tg);
 	u64 quota;
 	int ret;
 
-	ret = cpu_period_quota_parse(buf, &period, &quota);
+	ret = cpu_period_quota_parse(buf, &period, &quota, &burst);
 	if (!ret)
-		ret = tg_set_cfs_bandwidth(tg, period, quota);
+		ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
 	return ret ?: nbytes;
 }
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba8fd4f..6bb4f89259fd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5242,6 +5242,8 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+	cfs_b->burst = 0;
+	cfs_b->buffer = RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df80bfcea92e..a8772eca8cbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -364,6 +364,8 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			quota;
 	u64			runtime;
+	u64			burst;
+	u64			buffer;
 	s64			hierarchical_quota;
 
 	u8			idle;
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-01-20 12:27 ` [PATCH v2 " Huaixin Chang
  2021-01-20 12:27   ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
@ 2021-01-20 12:27   ` Huaixin Chang
  2021-01-20 17:06       ` kernel test robot
                       ` (3 more replies)
  2021-01-20 12:27   ` [PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
  2021-01-20 12:27   ` [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
  3 siblings, 4 replies; 44+ messages in thread
From: Huaixin Chang @ 2021-01-20 12:27 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
denote how many percent of burst is given on setting cfs bandwidth. By
default it is 0, which means on burst is allowed unless accumulated.

Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/core.c          | 31 +++++++++++++++++++++++++----
 kernel/sched/fair.c          | 46 ++++++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h         |  4 ++--
 kernel/sysctl.c              | 18 +++++++++++++++++
 5 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..3400828eaf2d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48d3bad12be2..fecf0f05ef0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
  */
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
+ * 0 by default.
+ */
+unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
+#endif
+
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
-	u64 buffer;
+	u64 buffer, burst_onset;
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 	cfs_b->burst = burst;
 	cfs_b->buffer = buffer;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
+	cfs_b->runtime = cfs_b->quota;
+
+	/* burst_onset needed */
+	if (cfs_b->quota != RUNTIME_INF &&
+			sysctl_sched_cfs_bw_burst_enabled &&
+			sysctl_sched_cfs_bw_burst_onset_percent > 0) {
+
+		burst_onset = do_div(burst, 100) *
+			sysctl_sched_cfs_bw_burst_onset_percent;
+
+		cfs_b->runtime += burst_onset;
+		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
+	}
 
 	/* Restart the period timer (if active) to handle new period expiry: */
 	if (runtime_enabled)
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 1);
 
 	raw_spin_unlock_irq(&cfs_b->lock);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6bb4f89259fd..38a726f77783 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4598,10 +4598,22 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 {
-	if (cfs_b->quota != RUNTIME_INF)
-		cfs_b->runtime = cfs_b->quota;
+	u64 refill;
+
+	if (cfs_b->quota != RUNTIME_INF) {
+
+		if (!sysctl_sched_cfs_bw_burst_enabled) {
+			cfs_b->runtime = cfs_b->quota;
+			return;
+		}
+
+		overrun = min(overrun, cfs_b->max_overrun);
+		refill = cfs_b->quota * overrun;
+		cfs_b->runtime += refill;
+		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+	}
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4623,7 +4635,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 0);
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -4957,7 +4969,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	__refill_cfs_bandwidth_runtime(cfs_b, overrun);
 
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
@@ -5181,6 +5193,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 }
 
 extern const u64 max_cfs_quota_period;
+extern const u64 max_cfs_runtime;
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -5210,7 +5223,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 			new = old * 2;
 			if (new < max_cfs_quota_period) {
 				cfs_b->period = ns_to_ktime(new);
-				cfs_b->quota *= 2;
+				cfs_b->quota = min(cfs_b->quota * 2,
+						max_cfs_runtime);
+
+				cfs_b->buffer = min(max_cfs_runtime,
+						cfs_b->quota + cfs_b->burst);
+				/* Add 1 in case max_overrun becomes 0. */
+				cfs_b->max_overrun >>= 1;
+				cfs_b->max_overrun++;
 
 				pr_warn_ratelimited(
 	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5259,16 +5279,26 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
 {
+	u64 overrun;
+
 	lockdep_assert_held(&cfs_b->lock);
 
 	if (cfs_b->period_active)
 		return;
 
 	cfs_b->period_active = 1;
-	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+
+	/*
+	 * When period timer stops, quota for the following period is not
+	 * refilled, however period timer is already forwarded. We should
+	 * accumulate quota once more than overrun here.
+	 */
+	if (!init)
+		__refill_cfs_bandwidth_runtime(cfs_b, overrun + 1);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a8772eca8cbb..ff8b5382485d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
 	u64			runtime;
 	u64			burst;
 	u64			buffer;
+	u64			max_overrun;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -476,8 +477,7 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 
-extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afad085960b8..291dca62a571 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1842,6 +1842,24 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "sched_cfs_bw_burst_onset_percent",
+		.data		= &sysctl_sched_cfs_bw_burst_onset_percent,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &one_hundred,
+	},
+	{
+		.procname	= "sched_cfs_bw_burst_enabled",
+		.data		= &sysctl_sched_cfs_bw_burst_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 	{
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics
  2021-01-20 12:27 ` [PATCH v2 " Huaixin Chang
  2021-01-20 12:27   ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
  2021-01-20 12:27   ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-01-20 12:27   ` Huaixin Chang
  2021-01-20 12:27   ` [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
  3 siblings, 0 replies; 44+ messages in thread
From: Huaixin Chang @ 2021-01-20 12:27 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Introduce statistics exports for the burstable cfs bandwidth
controller.

The following exports are included:

current_bw: current runtime in global pool
nr_burst:   number of periods bandwidth burst occurs
burst_time: cumulative wall-time that any cpus has
	    used above quota in respective periods

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 kernel/sched/core.c  |  6 ++++++
 kernel/sched/fair.c  | 12 +++++++++++-
 kernel/sched/sched.h |  3 +++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fecf0f05ef0c..80ca763ca492 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7986,6 +7986,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
 	}
 
+	cfs_b->previous_runtime = cfs_b->runtime;
+
 	/* Restart the period timer (if active) to handle new period expiry: */
 	if (runtime_enabled)
 		start_cfs_bandwidth(cfs_b, 1);
@@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 		seq_printf(sf, "wait_sum %llu\n", ws);
 	}
 
+	seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
+	seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
+	seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
+
 	return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38a726f77783..e431b2fff01d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4600,7 +4600,7 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 {
-	u64 refill;
+	u64 refill, runtime;
 
 	if (cfs_b->quota != RUNTIME_INF) {
 
@@ -4609,10 +4609,20 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 			return;
 		}
 
+		if (cfs_b->previous_runtime > cfs_b->runtime) {
+			runtime = cfs_b->previous_runtime - cfs_b->runtime;
+			if (runtime > cfs_b->quota) {
+				cfs_b->burst_time += runtime - cfs_b->quota;
+				cfs_b->nr_burst++;
+			}
+		}
+
 		overrun = min(overrun, cfs_b->max_overrun);
 		refill = cfs_b->quota * overrun;
 		cfs_b->runtime += refill;
 		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+
+		cfs_b->previous_runtime = cfs_b->runtime;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ff8b5382485d..4adb984e5197 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -367,6 +367,7 @@ struct cfs_bandwidth {
 	u64			burst;
 	u64			buffer;
 	u64			max_overrun;
+	u64			previous_runtime;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -379,7 +380,9 @@ struct cfs_bandwidth {
 	/* Statistics: */
 	int			nr_periods;
 	int			nr_throttled;
+	int			nr_burst;
 	u64			throttled_time;
+	u64			burst_time;
 #endif
 };
 
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control
  2021-01-20 12:27 ` [PATCH v2 " Huaixin Chang
                     ` (2 preceding siblings ...)
  2021-01-20 12:27   ` [PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2021-01-20 12:27   ` Huaixin Chang
  2021-01-20 19:48     ` Randy Dunlap
  3 siblings, 1 reply; 44+ messages in thread
From: Huaixin Chang @ 2021-01-20 12:27 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Basic description of usage and effect for CFS Bandwidth Control Burst.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 Documentation/scheduler/sched-bwc.rst | 70 +++++++++++++++++++++++++++++++++--
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 9801d6b284b1..2214ecaad393 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -21,18 +21,46 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
 is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
+By default, CPU bandwidth consumption is strictly limited to quota within each
+given period. For the sequence of CPU usage u_i served under CFS bandwidth
+control, if for any j <= k N(j,k) is the number of periods from u_j to u_k:
+
+        u_j+...+u_k <= quota * N(j,k)
+
+For a bursty sequence among which interval u_j...u_k are at the peak, CPU
+requests might have to wait for more periods to replenish enough quota.
+Otherwise, larger quota is required.
+
+With "burst" buffer, CPU requests might be served as long as:
+
+        u_j+...+u_k <= B_j + quota * N(j,k)
+
+if for any j <= k N(j,k) is the number of periods from u_j to u_k and B_j is
+the accumulated quota from previous periods in burst buffer serving u_j.
+Burst buffer helps in that serving whole bursty CPU requests without throttling
+them can be done with moderate quota setting and accumulated quota in burst
+buffer, if:
+
+        u_0+...+u_n <= B_0 + quota * N(0,n)
+
+where B_0 is the initial state of burst buffer. The maximum accumulated quota in
+the burst buffer is capped by burst. With proper burst setting, the available
+bandwidth is still determined by quota and period on the long run.
+
 Management
 ----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.
 
-cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
 cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
 cpu.stat: exports throttling statistics [explained further below]
 
 The default values are::
 
 	cpu.cfs_period_us=100ms
-	cpu.cfs_quota=-1
+	cpu.cfs_quota_us=-1
+	cpu.cfs_burst_us=0
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
@@ -48,6 +76,11 @@ more detail below.
 Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 and return the group to an unconstrained state once more.
 
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS unchanged. Writing any (valid) positive value(s) into cpu.cfs_burst_us
+will enact the cap on unused bandwidth accumulation.
+
 Any updates to a group's bandwidth specification will result in it becoming
 unthrottled if it is in a constrained state.
 
@@ -65,9 +98,21 @@ This is tunable via procfs::
 Larger slice values will reduce transfer overheads, while smaller values allow
 for more fine-grained consumption.
 
+There is also a global switch to turn off burst for all groups::
+       /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
+
+By default it is enabled. Write 0 values means no accumulated CPU time can be
+used for any group, even if cpu.cfs_burst_us is configured.
+
+Sometimes users might want a group to burst without accumulation. This is
+tunable via::
+       /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
+
+Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.
+
 Statistics
 ----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+A group's bandwidth statistics are exported via 6 fields in cpu.stat.
 
 cpu.stat:
 
@@ -75,6 +120,11 @@ cpu.stat:
 - nr_throttled: Number of times the group has been throttled/limited.
 - throttled_time: The total time duration (in nanoseconds) for which entities
   of the group have been throttled.
+- current_bw: Current runtime in global pool.
+- nr_burst: Number of periods burst occurs.
+- burst_time: Cumulative wall-time that any cpus has used above quota in
+  respective periods
+
 
 This interface is read-only.
 
@@ -172,3 +222,15 @@ Examples
 
    By using a small period here we are ensuring a consistent latency
    response at the expense of burst capacity.
+
+4. Limit a group to 20% of 1 CPU, and allow accumulate up to 60% of 1 CPU
+   addtionally, in case accumulation has been done.
+
+   With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+   And 30ms burst will be equivalent to 60% of 1 CPU.
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+	# echo 30000 > cpu.cfs_burst_us /* burst = 30ms */
+
+   Larger buffer setting allows greater burst capacity.
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-01-20 12:27   ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-01-20 17:06       ` kernel test robot
  2021-01-20 18:33       ` kernel test robot
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-01-20 17:06 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: kbuild-all, clang-built-linux, bsegall, dietmar.eggemann,
	juri.lelli, khlebnikov, linux-kernel, mgorman, mingo, pauld,
	peterz

[-- Attachment #1: Type: text/plain, Size: 3957 bytes --]

Hi Huaixin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linux/master linus/master v5.11-rc4 next-20210120]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 65bcf072e20ed7597caa902f170f293662b0af3c
config: x86_64-randconfig-a001-20210120 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 22b68440e1647e16b5ee24b924986207173c02d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/a62cc8421988be29990ad86e33e877fb8776f8bd
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
        git checkout a62cc8421988be29990ad86e33e877fb8776f8bd
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/sched/fair.c:4623:6: warning: no previous prototype for function '__refill_cfs_bandwidth_runtime' [-Wmissing-prototypes]
   void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
        ^
   kernel/sched/fair.c:4623:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
   ^
   static 
   kernel/sched/fair.c:2985:20: warning: unused function 'account_numa_enqueue' [-Wunused-function]
   static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
                      ^
   kernel/sched/fair.c:2989:20: warning: unused function 'account_numa_dequeue' [-Wunused-function]
   static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
                      ^
   kernel/sched/fair.c:2993:20: warning: unused function 'update_scan_period' [-Wunused-function]
   static inline void update_scan_period(struct task_struct *p, int new_cpu)
                      ^
   kernel/sched/fair.c:4730:19: warning: unused function 'throttled_lb_pair' [-Wunused-function]
   static inline int throttled_lb_pair(struct task_group *tg,
                     ^
   5 warnings generated.


vim +/__refill_cfs_bandwidth_runtime +4623 kernel/sched/fair.c

  4615	
  4616	/*
  4617	 * Replenish runtime according to assigned quota. We use sched_clock_cpu
  4618	 * directly instead of rq->clock to avoid adding additional synchronization
  4619	 * around rq->lock.
  4620	 *
  4621	 * requires cfs_b->lock
  4622	 */
> 4623	void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
  4624	{
  4625		u64 refill;
  4626	
  4627		if (cfs_b->quota != RUNTIME_INF) {
  4628	
  4629			if (!sysctl_sched_cfs_bw_burst_enabled) {
  4630				cfs_b->runtime = cfs_b->quota;
  4631				return;
  4632			}
  4633	
  4634			overrun = min(overrun, cfs_b->max_overrun);
  4635			refill = cfs_b->quota * overrun;
  4636			cfs_b->runtime += refill;
  4637			cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
  4638		}
  4639	}
  4640	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 43369 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
@ 2021-01-20 17:06       ` kernel test robot
  0 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-01-20 17:06 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4044 bytes --]

Hi Huaixin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linux/master linus/master v5.11-rc4 next-20210120]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 65bcf072e20ed7597caa902f170f293662b0af3c
config: x86_64-randconfig-a001-20210120 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 22b68440e1647e16b5ee24b924986207173c02d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/a62cc8421988be29990ad86e33e877fb8776f8bd
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
        git checkout a62cc8421988be29990ad86e33e877fb8776f8bd
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/sched/fair.c:4623:6: warning: no previous prototype for function '__refill_cfs_bandwidth_runtime' [-Wmissing-prototypes]
   void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
        ^
   kernel/sched/fair.c:4623:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
   ^
   static 
   kernel/sched/fair.c:2985:20: warning: unused function 'account_numa_enqueue' [-Wunused-function]
   static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
                      ^
   kernel/sched/fair.c:2989:20: warning: unused function 'account_numa_dequeue' [-Wunused-function]
   static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
                      ^
   kernel/sched/fair.c:2993:20: warning: unused function 'update_scan_period' [-Wunused-function]
   static inline void update_scan_period(struct task_struct *p, int new_cpu)
                      ^
   kernel/sched/fair.c:4730:19: warning: unused function 'throttled_lb_pair' [-Wunused-function]
   static inline int throttled_lb_pair(struct task_group *tg,
                     ^
   5 warnings generated.


vim +/__refill_cfs_bandwidth_runtime +4623 kernel/sched/fair.c

  4615	
  4616	/*
  4617	 * Replenish runtime according to assigned quota. We use sched_clock_cpu
  4618	 * directly instead of rq->clock to avoid adding additional synchronization
  4619	 * around rq->lock.
  4620	 *
  4621	 * requires cfs_b->lock
  4622	 */
> 4623	void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
  4624	{
  4625		u64 refill;
  4626	
  4627		if (cfs_b->quota != RUNTIME_INF) {
  4628	
  4629			if (!sysctl_sched_cfs_bw_burst_enabled) {
  4630				cfs_b->runtime = cfs_b->quota;
  4631				return;
  4632			}
  4633	
  4634			overrun = min(overrun, cfs_b->max_overrun);
  4635			refill = cfs_b->quota * overrun;
  4636			cfs_b->runtime += refill;
  4637			cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
  4638		}
  4639	}
  4640	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 43369 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-01-20 12:27   ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-01-20 18:33       ` kernel test robot
  2021-01-20 18:33       ` kernel test robot
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-01-20 18:33 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: kbuild-all, bsegall, dietmar.eggemann, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, pauld, peterz

[-- Attachment #1: Type: text/plain, Size: 2757 bytes --]

Hi Huaixin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linux/master linus/master v5.11-rc4 next-20210120]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 65bcf072e20ed7597caa902f170f293662b0af3c
config: c6x-randconfig-r035-20210120 (attached as .config)
compiler: c6x-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/a62cc8421988be29990ad86e33e877fb8776f8bd
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
        git checkout a62cc8421988be29990ad86e33e877fb8776f8bd
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=c6x 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/sched/fair.c:4623:6: warning: no previous prototype for '__refill_cfs_bandwidth_runtime' [-Wmissing-prototypes]
    4623 | void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/__refill_cfs_bandwidth_runtime +4623 kernel/sched/fair.c

  4615	
  4616	/*
  4617	 * Replenish runtime according to assigned quota. We use sched_clock_cpu
  4618	 * directly instead of rq->clock to avoid adding additional synchronization
  4619	 * around rq->lock.
  4620	 *
  4621	 * requires cfs_b->lock
  4622	 */
> 4623	void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
  4624	{
  4625		u64 refill;
  4626	
  4627		if (cfs_b->quota != RUNTIME_INF) {
  4628	
  4629			if (!sysctl_sched_cfs_bw_burst_enabled) {
  4630				cfs_b->runtime = cfs_b->quota;
  4631				return;
  4632			}
  4633	
  4634			overrun = min(overrun, cfs_b->max_overrun);
  4635			refill = cfs_b->quota * overrun;
  4636			cfs_b->runtime += refill;
  4637			cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
  4638		}
  4639	}
  4640	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33067 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
@ 2021-01-20 18:33       ` kernel test robot
  0 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-01-20 18:33 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2825 bytes --]

Hi Huaixin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linux/master linus/master v5.11-rc4 next-20210120]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 65bcf072e20ed7597caa902f170f293662b0af3c
config: c6x-randconfig-r035-20210120 (attached as .config)
compiler: c6x-elf-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/a62cc8421988be29990ad86e33e877fb8776f8bd
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
        git checkout a62cc8421988be29990ad86e33e877fb8776f8bd
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=c6x 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> kernel/sched/fair.c:4623:6: warning: no previous prototype for '__refill_cfs_bandwidth_runtime' [-Wmissing-prototypes]
    4623 | void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/__refill_cfs_bandwidth_runtime +4623 kernel/sched/fair.c

  4615	
  4616	/*
  4617	 * Replenish runtime according to assigned quota. We use sched_clock_cpu
  4618	 * directly instead of rq->clock to avoid adding additional synchronization
  4619	 * around rq->lock.
  4620	 *
  4621	 * requires cfs_b->lock
  4622	 */
> 4623	void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
  4624	{
  4625		u64 refill;
  4626	
  4627		if (cfs_b->quota != RUNTIME_INF) {
  4628	
  4629			if (!sysctl_sched_cfs_bw_burst_enabled) {
  4630				cfs_b->runtime = cfs_b->quota;
  4631				return;
  4632			}
  4633	
  4634			overrun = min(overrun, cfs_b->max_overrun);
  4635			refill = cfs_b->quota * overrun;
  4636			cfs_b->runtime += refill;
  4637			cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
  4638		}
  4639	}
  4640	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 33067 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control
  2021-01-20 12:27   ` [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
@ 2021-01-20 19:48     ` Randy Dunlap
  0 siblings, 0 replies; 44+ messages in thread
From: Randy Dunlap @ 2021-01-20 19:48 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Hi--

Some comments below:

On 1/20/21 4:27 AM, Huaixin Chang wrote:
> Basic description of usage and effect for CFS Bandwidth Control Burst.
> 
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> ---
>  Documentation/scheduler/sched-bwc.rst | 70 +++++++++++++++++++++++++++++++++--
>  1 file changed, 66 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
> index 9801d6b284b1..2214ecaad393 100644
> --- a/Documentation/scheduler/sched-bwc.rst
> +++ b/Documentation/scheduler/sched-bwc.rst
> @@ -21,18 +21,46 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
>  is transferred to cpu-local "silos" on a demand basis. The amount transferred
>  within each of these updates is tunable and described as the "slice".
>  
> +By default, CPU bandwidth consumption is strictly limited to quota within each
> +given period. For the sequence of CPU usage u_i served under CFS bandwidth
> +control, if for any j <= k N(j,k) is the number of periods from u_j to u_k:
> +
> +        u_j+...+u_k <= quota * N(j,k)
> +
> +For a bursty sequence among which interval u_j...u_k are at the peak, CPU
> +requests might have to wait for more periods to replenish enough quota.
> +Otherwise, larger quota is required.
> +
> +With "burst" buffer, CPU requests might be served as long as:
> +
> +        u_j+...+u_k <= B_j + quota * N(j,k)
> +
> +if for any j <= k N(j,k) is the number of periods from u_j to u_k and B_j is
> +the accumulated quota from previous periods in burst buffer serving u_j.
> +Burst buffer helps in that serving whole bursty CPU requests without throttling
> +them can be done with moderate quota setting and accumulated quota in burst
> +buffer, if:
> +
> +        u_0+...+u_n <= B_0 + quota * N(0,n)
> +
> +where B_0 is the initial state of burst buffer. The maximum accumulated quota in
> +the burst buffer is capped by burst. With proper burst setting, the available
> +bandwidth is still determined by quota and period on the long run.
> +
>  Management
>  ----------
> -Quota and period are managed within the cpu subsystem via cgroupfs.
> +Quota, period and burst are managed within the cpu subsystem via cgroupfs.
>  
> -cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
> +cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
>  cpu.cfs_period_us: the length of a period (in microseconds)
> +cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
>  cpu.stat: exports throttling statistics [explained further below]
>  
>  The default values are::
>  
>  	cpu.cfs_period_us=100ms
> -	cpu.cfs_quota=-1
> +	cpu.cfs_quota_us=-1
> +	cpu.cfs_burst_us=0
>  
>  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
>  bandwidth restriction in place, such a group is described as an unconstrained
> @@ -48,6 +76,11 @@ more detail below.
>  Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
>  and return the group to an unconstrained state once more.
>  
> +A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
> +any unused bandwidth. It makes the traditional bandwidth control behavior for
> +CFS unchanged. Writing any (valid) positive value(s) into cpu.cfs_burst_us
> +will enact the cap on unused bandwidth accumulation.
> +
>  Any updates to a group's bandwidth specification will result in it becoming
>  unthrottled if it is in a constrained state.
>  
> @@ -65,9 +98,21 @@ This is tunable via procfs::
>  Larger slice values will reduce transfer overheads, while smaller values allow
>  for more fine-grained consumption.
>  
> +There is also a global switch to turn off burst for all groups::
> +       /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
> +
> +By default it is enabled. Write 0 values means no accumulated CPU time can be

                             Writing a 0 value means

> +used for any group, even if cpu.cfs_burst_us is configured.
> +
> +Sometimes users might want a group to burst without accumulation. This is
> +tunable via::
> +       /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
> +
> +Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.
> +
>  Statistics
>  ----------
> -A group's bandwidth statistics are exported via 3 fields in cpu.stat.
> +A group's bandwidth statistics are exported via 6 fields in cpu.stat.
>  
>  cpu.stat:
>  
> @@ -75,6 +120,11 @@ cpu.stat:
>  - nr_throttled: Number of times the group has been throttled/limited.
>  - throttled_time: The total time duration (in nanoseconds) for which entities
>    of the group have been throttled.
> +- current_bw: Current runtime in global pool.
> +- nr_burst: Number of periods burst occurs.
> +- burst_time: Cumulative wall-time that any cpus has used above quota in

                                               CPUs have used

> +  respective periods
> +
>  
>  This interface is read-only.
>  
> @@ -172,3 +222,15 @@ Examples
>  
>     By using a small period here we are ensuring a consistent latency
>     response at the expense of burst capacity.
> +
> +4. Limit a group to 20% of 1 CPU, and allow accumulate up to 60% of 1 CPU
> +   addtionally, in case accumulation has been done.

      additionally,

> +
> +   With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
> +   And 30ms burst will be equivalent to 60% of 1 CPU.
> +
> +	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
> +	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
> +	# echo 30000 > cpu.cfs_burst_us /* burst = 30ms */
> +
> +   Larger buffer setting allows greater burst capacity.
> 


HTH.
-- 
~Randy
"He closes his eyes and drops the goggles.  You can't get hurt
by looking at a bitmap.  Or can you?"
(Neal Stephenson: Snow Crash)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-01-20 12:27   ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-01-20 22:01       ` kernel test robot
  2021-01-20 18:33       ` kernel test robot
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-01-20 22:01 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: kbuild-all, bsegall, dietmar.eggemann, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, pauld, peterz

[-- Attachment #1: Type: text/plain, Size: 8167 bytes --]

Hi Huaixin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linux/master linus/master v5.11-rc4 next-20210120]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 65bcf072e20ed7597caa902f170f293662b0af3c
config: x86_64-randconfig-s022-20210120 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-208-g46a52ca4-dirty
        # https://github.com/0day-ci/linux/commit/a62cc8421988be29990ad86e33e877fb8776f8bd
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
        git checkout a62cc8421988be29990ad86e33e877fb8776f8bd
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
   kernel/sched/fair.c:879:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_entity *se @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:879:34: sparse:     expected struct sched_entity *se
   kernel/sched/fair.c:879:34: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:2523:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu * @@
   kernel/sched/fair.c:2523:13: sparse:     expected struct task_struct *tsk
   kernel/sched/fair.c:2523:13: sparse:     got struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:10660:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:10660:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:10660:9: sparse:     got struct sched_domain [noderef] __rcu *parent
>> kernel/sched/fair.c:4623:6: sparse: sparse: symbol '__refill_cfs_bandwidth_runtime' was not declared. Should it be static?
   kernel/sched/fair.c:4937:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/fair.c:4937:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:4937:22: sparse:    struct task_struct *
   kernel/sched/fair.c:5480:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:5480:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:5480:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:6668:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:6668:20: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:6668:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:6790:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:6790:9: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/fair.c:6790:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:6988:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:6988:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:6988:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:7239:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:7239:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:7239:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:8210:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/fair.c:8210:40: sparse:     expected struct sched_domain *child
   kernel/sched/fair.c:8210:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/fair.c:8703:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/fair.c:8703:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:8703:22: sparse:    struct task_struct *
   kernel/sched/fair.c:9979:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:9979:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:9979:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:9635:44: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *sd_parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:9635:44: sparse:     expected struct sched_domain *sd_parent
   kernel/sched/fair.c:9635:44: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:10057:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:10057:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:10057:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:2467:9: sparse: sparse: context imbalance in 'task_numa_placement' - different lock contexts for basic block
   kernel/sched/fair.c:4593:31: sparse: sparse: marked inline, but without a definition
   kernel/sched/fair.c: note: in included file:
   kernel/sched/sched.h:1724:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct *
   kernel/sched/sched.h:1879:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1879:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1879:9: sparse:    struct task_struct *
   kernel/sched/sched.h:1724:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct *
   kernel/sched/sched.h:1724:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct *

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36569 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
@ 2021-01-20 22:01       ` kernel test robot
  0 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-01-20 22:01 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 8262 bytes --]

Hi Huaixin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linux/master linus/master v5.11-rc4 next-20210120]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 65bcf072e20ed7597caa902f170f293662b0af3c
config: x86_64-randconfig-s022-20210120 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-208-g46a52ca4-dirty
        # https://github.com/0day-ci/linux/commit/a62cc8421988be29990ad86e33e877fb8776f8bd
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Huaixin-Chang/sched-fair-Introduce-primitives-for-CFS-bandwidth-burst/20210120-212731
        git checkout a62cc8421988be29990ad86e33e877fb8776f8bd
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
   kernel/sched/fair.c:879:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_entity *se @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:879:34: sparse:     expected struct sched_entity *se
   kernel/sched/fair.c:879:34: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:2523:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu * @@
   kernel/sched/fair.c:2523:13: sparse:     expected struct task_struct *tsk
   kernel/sched/fair.c:2523:13: sparse:     got struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:10660:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:10660:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:10660:9: sparse:     got struct sched_domain [noderef] __rcu *parent
>> kernel/sched/fair.c:4623:6: sparse: sparse: symbol '__refill_cfs_bandwidth_runtime' was not declared. Should it be static?
   kernel/sched/fair.c:4937:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/fair.c:4937:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:4937:22: sparse:    struct task_struct *
   kernel/sched/fair.c:5480:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:5480:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:5480:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:6668:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:6668:20: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:6668:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:6790:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:6790:9: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/fair.c:6790:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:6988:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:6988:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:6988:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:7239:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:7239:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:7239:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:8210:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/fair.c:8210:40: sparse:     expected struct sched_domain *child
   kernel/sched/fair.c:8210:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/fair.c:8703:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/fair.c:8703:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/fair.c:8703:22: sparse:    struct task_struct *
   kernel/sched/fair.c:9979:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:9979:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:9979:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:9635:44: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *sd_parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:9635:44: sparse:     expected struct sched_domain *sd_parent
   kernel/sched/fair.c:9635:44: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:10057:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/fair.c:10057:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/fair.c:10057:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/fair.c:2467:9: sparse: sparse: context imbalance in 'task_numa_placement' - different lock contexts for basic block
   kernel/sched/fair.c:4593:31: sparse: sparse: marked inline, but without a definition
   kernel/sched/fair.c: note: in included file:
   kernel/sched/sched.h:1724:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct *
   kernel/sched/sched.h:1879:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1879:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1879:9: sparse:    struct task_struct *
   kernel/sched/sched.h:1724:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct *
   kernel/sched/sched.h:1724:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:1724:25: sparse:    struct task_struct *

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 36569 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH] sched/fair: __refill_cfs_bandwidth_runtime() can be static
  2021-01-20 12:27   ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-01-20 22:01       ` kernel test robot
  2021-01-20 18:33       ` kernel test robot
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-01-20 22:01 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: kbuild-all, bsegall, dietmar.eggemann, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, pauld, peterz


Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8be4651ae482dc..bf2259494136b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4620,7 +4620,7 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 {
 	u64 refill;
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH] sched/fair: __refill_cfs_bandwidth_runtime() can be static
@ 2021-01-20 22:01       ` kernel test robot
  0 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-01-20 22:01 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 626 bytes --]


Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8be4651ae482dc..bf2259494136b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4620,7 +4620,7 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
 {
 	u64 refill;
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller
  2020-12-17  7:46 [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                   ` (5 preceding siblings ...)
  2021-01-20 12:27 ` [PATCH v2 " Huaixin Chang
@ 2021-01-21 11:04 ` Huaixin Chang
  2021-01-21 11:04   ` [PATCH v3 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
                     ` (5 more replies)
  6 siblings, 6 replies; 44+ messages in thread
From: Huaixin Chang @ 2021-01-21 11:04 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Changelog

v3:
1. Fix another issue reported by test robot.
2. Update docs as Randy Dunlap suggested.

v2:
1. Fix an issue reported by test robot.
2. Rewriting docs. Appreciate any further suggestions or help.

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled. And they are latency sensitive at the same
time so that throttling them is undesired.

Scaling up period and quota allows greater burst capacity. But it might
cause longer stuck till next refill. We introduce "burst" to allow
accumulating unused quota from previous periods, and to be assigned when
a task group requests more CPU than quota during a specific period. Thus
allowing CPU time requests as long as the average requested CPU time is
below quota on the long run. The maximum accumulation is capped by burst
and is set 0 by default, thus the traditional behaviour remains.

A huge drop of 99th tail latency from more than 500ms to 27ms is seen for
real java workloads when using burst. Similar drops are seen when
testing with schbench too:

	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 700000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
	echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	# The average CPU usage is around 500%, which is 200ms CPU time
	# every 40ms.
	./schbench -m 1 -t 30 -r 60 -c 10000 -R 500

	Without burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 10
	*99.0000th: 933
	99.5000th: 981
	99.9000th: 3068
	min=0, max=20054
	rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

	With burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 9
	*99.0000th: 12
	99.5000th: 13
	99.9000th: 19
	min=0, max=406
	rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 0.12%

How much workloads with benefit from burstable CFS bandwidth control
depends on how bursty and how latency sensitive they are.

Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/

This time we present more latency statistics and handle overflow while
accumulating.

Huaixin Chang (4):
  sched/fair: Introduce primitives for CFS bandwidth burst
  sched/fair: Make CFS bandwidth controller burstable
  sched/fair: Add cfs bandwidth burst statistics
  sched/fair: Add document for burstable CFS bandwidth control

 Documentation/scheduler/sched-bwc.rst |  49 +++++++++++--
 include/linux/sched/sysctl.h          |   2 +
 kernel/sched/core.c                   | 126 +++++++++++++++++++++++++++++-----
 kernel/sched/fair.c                   |  58 +++++++++++++---
 kernel/sched/sched.h                  |   9 ++-
 kernel/sysctl.c                       |  18 +++++
 6 files changed, 232 insertions(+), 30 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
@ 2021-01-21 11:04   ` Huaixin Chang
  2021-03-10 12:39     ` Peter Zijlstra
  2021-01-21 11:04   ` [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Huaixin Chang @ 2021-01-21 11:04 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

In this patch, we introduce the notion of CFS bandwidth burst. Unused
"quota" from pervious "periods" might be accumulated and used in the
following "periods". The maximum amount of accumulated bandwidth is
bounded by "burst". And the maximun amount of CPU a group can consume in
a given period is "buffer" which is equivalent to "quota" + "burst in
case that this group has done enough accumulation.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 kernel/sched/core.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++--------
 kernel/sched/fair.c  |  2 ++
 kernel/sched/sched.h |  2 ++
 3 files changed, 82 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7e453492cff..48d3bad12be2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7895,10 +7895,12 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+							u64 burst)
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+	u64 buffer;
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -7925,6 +7927,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
 		return -EINVAL;
 
+	/*
+	 * Bound burst to defend burst against overflow during bandwidth shift.
+	 */
+	if (burst > max_cfs_runtime)
+		return -EINVAL;
+
+	if (quota == RUNTIME_INF)
+		buffer = RUNTIME_INF;
+	else
+		buffer = min(max_cfs_runtime, quota + burst);
 	/*
 	 * Prevent race between setting of cfs_rq->runtime_enabled and
 	 * unthrottle_offline_cfs_rqs().
@@ -7946,6 +7958,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->burst = burst;
+	cfs_b->buffer = buffer;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -7979,9 +7993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	burst = tg->cfs_bandwidth.burst;
 	if (cfs_quota_us < 0)
 		quota = RUNTIME_INF;
 	else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -7989,7 +8004,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 	else
 		return -EINVAL;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -8007,15 +8022,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
 		return -EINVAL;
 
 	period = (u64)cfs_period_us * NSEC_PER_USEC;
 	quota = tg->cfs_bandwidth.quota;
+	burst = tg->cfs_bandwidth.burst;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -8028,6 +8044,35 @@ static long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+	u64 quota, period, burst;
+
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	quota = tg->cfs_bandwidth.quota;
+	if (cfs_burst_us < 0)
+		burst = RUNTIME_INF;
+	else if ((u64)cfs_burst_us <= U64_MAX / NSEC_PER_USEC)
+		burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+	else
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+	u64 burst_us;
+
+	if (tg->cfs_bandwidth.burst == RUNTIME_INF)
+		return -1;
+
+	burst_us = tg->cfs_bandwidth.burst;
+	do_div(burst_us, NSEC_PER_USEC);
+
+	return burst_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -8052,6 +8097,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static s64 cpu_cfs_burst_read_s64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_s64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, s64 cfs_burst_us)
+{
+	return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
+}
+
 struct cfs_schedulable_data {
 	struct task_group *tg;
 	u64 period, quota;
@@ -8204,6 +8261,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "cfs_burst_us",
+		.read_s64 = cpu_cfs_burst_read_s64,
+		.write_s64 = cpu_cfs_burst_write_s64,
+	},
 	{
 		.name = "stat",
 		.seq_show = cpu_cfs_stat_show,
@@ -8324,26 +8386,27 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
 #endif
 
 static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
-						  long period, long quota)
+					long period, long quota, long burst)
 {
 	if (quota < 0)
 		seq_puts(sf, "max");
 	else
 		seq_printf(sf, "%ld", quota);
 
-	seq_printf(sf, " %ld\n", period);
+	seq_printf(sf, " %ld %ld\n", period, burst);
 }
 
-/* caller should put the current value in *@periodp before calling */
+/* caller should put the current value in *@periodp and *@burstp before calling */
 static int __maybe_unused cpu_period_quota_parse(char *buf,
-						 u64 *periodp, u64 *quotap)
+					 u64 *periodp, u64 *quotap, u64 *burstp)
 {
 	char tok[21];	/* U64_MAX */
 
-	if (sscanf(buf, "%20s %llu", tok, periodp) < 1)
+	if (sscanf(buf, "%20s %llu %llu", tok, periodp, burstp) < 1)
 		return -EINVAL;
 
 	*periodp *= NSEC_PER_USEC;
+	*burstp *= NSEC_PER_USEC;
 
 	if (sscanf(tok, "%llu", quotap))
 		*quotap *= NSEC_PER_USEC;
@@ -8360,7 +8423,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 
-	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
+			tg_get_cfs_burst(tg));
 	return 0;
 }
 
@@ -8369,12 +8433,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 {
 	struct task_group *tg = css_tg(of_css(of));
 	u64 period = tg_get_cfs_period(tg);
+	u64 burst = tg_get_cfs_burst(tg);
 	u64 quota;
 	int ret;
 
-	ret = cpu_period_quota_parse(buf, &period, &quota);
+	ret = cpu_period_quota_parse(buf, &period, &quota, &burst);
 	if (!ret)
-		ret = tg_set_cfs_bandwidth(tg, period, quota);
+		ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
 	return ret ?: nbytes;
 }
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae7ceba8fd4f..6bb4f89259fd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5242,6 +5242,8 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+	cfs_b->burst = 0;
+	cfs_b->buffer = RUNTIME_INF;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df80bfcea92e..a8772eca8cbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -364,6 +364,8 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			quota;
 	u64			runtime;
+	u64			burst;
+	u64			buffer;
 	s64			hierarchical_quota;
 
 	u8			idle;
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2021-01-21 11:04   ` [PATCH v3 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
@ 2021-01-21 11:04   ` Huaixin Chang
  2021-03-10 13:04     ` Peter Zijlstra
  2021-01-21 11:04   ` [PATCH v3 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Huaixin Chang @ 2021-01-21 11:04 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
denote how many percent of burst is given on setting cfs bandwidth. By
default it is 0, which means on burst is allowed unless accumulated.

Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/core.c          | 31 +++++++++++++++++++++++++----
 kernel/sched/fair.c          | 47 ++++++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h         |  4 ++--
 kernel/sysctl.c              | 18 +++++++++++++++++
 5 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..3400828eaf2d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48d3bad12be2..fecf0f05ef0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
  */
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
+ * 0 by default.
+ */
+unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
+#endif
+
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
-	u64 buffer;
+	u64 buffer, burst_onset;
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 	cfs_b->burst = burst;
 	cfs_b->buffer = buffer;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
+	cfs_b->runtime = cfs_b->quota;
+
+	/* burst_onset needed */
+	if (cfs_b->quota != RUNTIME_INF &&
+			sysctl_sched_cfs_bw_burst_enabled &&
+			sysctl_sched_cfs_bw_burst_onset_percent > 0) {
+
+		burst_onset = do_div(burst, 100) *
+			sysctl_sched_cfs_bw_burst_onset_percent;
+
+		cfs_b->runtime += burst_onset;
+		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
+	}
 
 	/* Restart the period timer (if active) to handle new period expiry: */
 	if (runtime_enabled)
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 1);
 
 	raw_spin_unlock_irq(&cfs_b->lock);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6bb4f89259fd..abe6eb05fe09 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4598,10 +4598,23 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
+		u64 overrun)
 {
-	if (cfs_b->quota != RUNTIME_INF)
-		cfs_b->runtime = cfs_b->quota;
+	u64 refill;
+
+	if (cfs_b->quota != RUNTIME_INF) {
+
+		if (!sysctl_sched_cfs_bw_burst_enabled) {
+			cfs_b->runtime = cfs_b->quota;
+			return;
+		}
+
+		overrun = min(overrun, cfs_b->max_overrun);
+		refill = cfs_b->quota * overrun;
+		cfs_b->runtime += refill;
+		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+	}
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4623,7 +4636,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 0);
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -4957,7 +4970,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	__refill_cfs_bandwidth_runtime(cfs_b, overrun);
 
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
@@ -5181,6 +5194,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 }
 
 extern const u64 max_cfs_quota_period;
+extern const u64 max_cfs_runtime;
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -5210,7 +5224,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 			new = old * 2;
 			if (new < max_cfs_quota_period) {
 				cfs_b->period = ns_to_ktime(new);
-				cfs_b->quota *= 2;
+				cfs_b->quota = min(cfs_b->quota * 2,
+						max_cfs_runtime);
+
+				cfs_b->buffer = min(max_cfs_runtime,
+						cfs_b->quota + cfs_b->burst);
+				/* Add 1 in case max_overrun becomes 0. */
+				cfs_b->max_overrun >>= 1;
+				cfs_b->max_overrun++;
 
 				pr_warn_ratelimited(
 	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5259,16 +5280,26 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
 {
+	u64 overrun;
+
 	lockdep_assert_held(&cfs_b->lock);
 
 	if (cfs_b->period_active)
 		return;
 
 	cfs_b->period_active = 1;
-	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+
+	/*
+	 * When period timer stops, quota for the following period is not
+	 * refilled, however period timer is already forwarded. We should
+	 * accumulate quota once more than overrun here.
+	 */
+	if (!init)
+		__refill_cfs_bandwidth_runtime(cfs_b, overrun + 1);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a8772eca8cbb..ff8b5382485d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
 	u64			runtime;
 	u64			burst;
 	u64			buffer;
+	u64			max_overrun;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -476,8 +477,7 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 
-extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afad085960b8..291dca62a571 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1842,6 +1842,24 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "sched_cfs_bw_burst_onset_percent",
+		.data		= &sysctl_sched_cfs_bw_burst_onset_percent,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &one_hundred,
+	},
+	{
+		.procname	= "sched_cfs_bw_burst_enabled",
+		.data		= &sysctl_sched_cfs_bw_burst_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 	{
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 3/4] sched/fair: Add cfs bandwidth burst statistics
  2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2021-01-21 11:04   ` [PATCH v3 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
  2021-01-21 11:04   ` [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-01-21 11:04   ` Huaixin Chang
  2021-03-10 13:10     ` Peter Zijlstra
  2021-01-21 11:04   ` [PATCH v3 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Huaixin Chang @ 2021-01-21 11:04 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Introduce statistics exports for the burstable cfs bandwidth
controller.

The following exports are included:

current_bw: current runtime in global pool
nr_burst:   number of periods bandwidth burst occurs
burst_time: cumulative wall-time that any cpus has
	    used above quota in respective periods

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 kernel/sched/core.c  |  6 ++++++
 kernel/sched/fair.c  | 12 +++++++++++-
 kernel/sched/sched.h |  3 +++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fecf0f05ef0c..80ca763ca492 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7986,6 +7986,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
 	}
 
+	cfs_b->previous_runtime = cfs_b->runtime;
+
 	/* Restart the period timer (if active) to handle new period expiry: */
 	if (runtime_enabled)
 		start_cfs_bandwidth(cfs_b, 1);
@@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 		seq_printf(sf, "wait_sum %llu\n", ws);
 	}
 
+	seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
+	seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
+	seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
+
 	return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index abe6eb05fe09..e3536301df7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4601,7 +4601,7 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
 		u64 overrun)
 {
-	u64 refill;
+	u64 refill, runtime;
 
 	if (cfs_b->quota != RUNTIME_INF) {
 
@@ -4610,10 +4610,20 @@ static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
 			return;
 		}
 
+		if (cfs_b->previous_runtime > cfs_b->runtime) {
+			runtime = cfs_b->previous_runtime - cfs_b->runtime;
+			if (runtime > cfs_b->quota) {
+				cfs_b->burst_time += runtime - cfs_b->quota;
+				cfs_b->nr_burst++;
+			}
+		}
+
 		overrun = min(overrun, cfs_b->max_overrun);
 		refill = cfs_b->quota * overrun;
 		cfs_b->runtime += refill;
 		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+
+		cfs_b->previous_runtime = cfs_b->runtime;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ff8b5382485d..4adb984e5197 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -367,6 +367,7 @@ struct cfs_bandwidth {
 	u64			burst;
 	u64			buffer;
 	u64			max_overrun;
+	u64			previous_runtime;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -379,7 +380,9 @@ struct cfs_bandwidth {
 	/* Statistics: */
 	int			nr_periods;
 	int			nr_throttled;
+	int			nr_burst;
 	u64			throttled_time;
+	u64			burst_time;
 #endif
 };
 
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 4/4] sched/fair: Add document for burstable CFS bandwidth control
  2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                     ` (2 preceding siblings ...)
  2021-01-21 11:04   ` [PATCH v3 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2021-01-21 11:04   ` Huaixin Chang
  2021-03-10 13:17     ` Peter Zijlstra
  2021-01-26 10:18   ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller changhuaixin
  2021-02-09 13:17   ` Odin Ugedal
  5 siblings, 1 reply; 44+ messages in thread
From: Huaixin Chang @ 2021-01-21 11:04 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Basic description of usage and effect for CFS Bandwidth Control Burst.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
---
 Documentation/scheduler/sched-bwc.rst | 70 +++++++++++++++++++++++++++++++++--
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 9801d6b284b1..0933c66cc68b 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -21,18 +21,46 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
 is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
+By default, CPU bandwidth consumption is strictly limited to quota within each
+given period. For the sequence of CPU usage u_i served under CFS bandwidth
+control, if for any j <= k N(j,k) is the number of periods from u_j to u_k:
+
+        u_j+...+u_k <= quota * N(j,k)
+
+For a bursty sequence among which interval u_j...u_k are at the peak, CPU
+requests might have to wait for more periods to replenish enough quota.
+Otherwise, larger quota is required.
+
+With "burst" buffer, CPU requests might be served as long as:
+
+        u_j+...+u_k <= B_j + quota * N(j,k)
+
+if for any j <= k N(j,k) is the number of periods from u_j to u_k and B_j is
+the accumulated quota from previous periods in burst buffer serving u_j.
+Burst buffer helps in that serving whole bursty CPU requests without throttling
+them can be done with moderate quota setting and accumulated quota in burst
+buffer, if:
+
+        u_0+...+u_n <= B_0 + quota * N(0,n)
+
+where B_0 is the initial state of burst buffer. The maximum accumulated quota in
+the burst buffer is capped by burst. With proper burst setting, the available
+bandwidth is still determined by quota and period on the long run.
+
 Management
 ----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.
 
-cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
 cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
 cpu.stat: exports throttling statistics [explained further below]
 
 The default values are::
 
 	cpu.cfs_period_us=100ms
-	cpu.cfs_quota=-1
+	cpu.cfs_quota_us=-1
+	cpu.cfs_burst_us=0
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
@@ -48,6 +76,11 @@ more detail below.
 Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 and return the group to an unconstrained state once more.
 
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS unchanged. Writing any (valid) positive value(s) into cpu.cfs_burst_us
+will enact the cap on unused bandwidth accumulation.
+
 Any updates to a group's bandwidth specification will result in it becoming
 unthrottled if it is in a constrained state.
 
@@ -65,9 +98,21 @@ This is tunable via procfs::
 Larger slice values will reduce transfer overheads, while smaller values allow
 for more fine-grained consumption.
 
+There is also a global switch to turn off burst for all groups::
+       /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
+
+By default it is enabled. Writing a 0 value means no accumulated CPU time can be
+used for any group, even if cpu.cfs_burst_us is configured.
+
+Sometimes users might want a group to burst without accumulation. This is
+tunable via::
+       /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
+
+Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.
+
 Statistics
 ----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+A group's bandwidth statistics are exported via 6 fields in cpu.stat.
 
 cpu.stat:
 
@@ -75,6 +120,11 @@ cpu.stat:
 - nr_throttled: Number of times the group has been throttled/limited.
 - throttled_time: The total time duration (in nanoseconds) for which entities
   of the group have been throttled.
+- current_bw: Current runtime in global pool.
+- nr_burst: Number of periods burst occurs.
+- burst_time: Cumulative wall-time that any CPUs has used above quota in
+  respective periods
+
 
 This interface is read-only.
 
@@ -172,3 +222,15 @@ Examples
 
    By using a small period here we are ensuring a consistent latency
    response at the expense of burst capacity.
+
+4. Limit a group to 20% of 1 CPU, and allow accumulate up to 60% of 1 CPU
+   additionally, in case accumulation has been done.
+
+   With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+   And 30ms burst will be equivalent to 60% of 1 CPU.
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+	# echo 30000 > cpu.cfs_burst_us /* burst = 30ms */
+
+   Larger buffer setting allows greater burst capacity.
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller
  2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                     ` (3 preceding siblings ...)
  2021-01-21 11:04   ` [PATCH v3 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
@ 2021-01-26 10:18   ` changhuaixin
  2021-03-10 12:26     ` Peter Zijlstra
  2021-02-09 13:17   ` Odin Ugedal
  5 siblings, 1 reply; 44+ messages in thread
From: changhuaixin @ 2021-01-26 10:18 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong



> On Jan 21, 2021, at 7:04 PM, Huaixin Chang <changhuaixin@linux.alibaba.com> wrote:
> 
> Changelog
> 
> v3:
> 1. Fix another issue reported by test robot.
> 2. Update docs as Randy Dunlap suggested.
> 
> v2:
> 1. Fix an issue reported by test robot.
> 2. Rewriting docs. Appreciate any further suggestions or help.
> 
> The CFS bandwidth controller limits CPU requests of a task group to
> quota during each period. However, parallel workloads might be bursty
> so that they get throttled. And they are latency sensitive at the same
> time so that throttling them is undesired.
> 
> Scaling up period and quota allows greater burst capacity. But it might
> cause longer stuck till next refill. We introduce "burst" to allow
> accumulating unused quota from previous periods, and to be assigned when
> a task group requests more CPU than quota during a specific period. Thus
> allowing CPU time requests as long as the average requested CPU time is
> below quota on the long run. The maximum accumulation is capped by burst
> and is set 0 by default, thus the traditional behaviour remains.
> 
> A huge drop of 99th tail latency from more than 500ms to 27ms is seen for
> real java workloads when using burst. Similar drops are seen when
> testing with schbench too:
> 
> 	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
> 	echo 700000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> 	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> 	echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
> 
> 	# The average CPU usage is around 500%, which is 200ms CPU time
> 	# every 40ms.
> 	./schbench -m 1 -t 30 -r 60 -c 10000 -R 500
> 
> 	Without burst:
> 
> 	Latency percentiles (usec)
> 	50.0000th: 7
> 	75.0000th: 8
> 	90.0000th: 9
> 	95.0000th: 10
> 	*99.0000th: 933
> 	99.5000th: 981
> 	99.9000th: 3068
> 	min=0, max=20054
> 	rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%
> 
> 	With burst:
> 
> 	Latency percentiles (usec)
> 	50.0000th: 7
> 	75.0000th: 8
> 	90.0000th: 9
> 	95.0000th: 9
> 	*99.0000th: 12
> 	99.5000th: 13
> 	99.9000th: 19
> 	min=0, max=406
> 	rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 0.12%
> 
> How much workloads with benefit from burstable CFS bandwidth control
> depends on how bursty and how latency sensitive they are.
> 
> Previously, Cong Wang and Konstantin Khlebnikov proposed similar
> feature:
> https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/
> https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/
> 
> This time we present more latency statistics and handle overflow while
> accumulating.
> 
> Huaixin Chang (4):
>  sched/fair: Introduce primitives for CFS bandwidth burst
>  sched/fair: Make CFS bandwidth controller burstable
>  sched/fair: Add cfs bandwidth burst statistics
>  sched/fair: Add document for burstable CFS bandwidth control
> 
> Documentation/scheduler/sched-bwc.rst |  49 +++++++++++--
> include/linux/sched/sysctl.h          |   2 +
> kernel/sched/core.c                   | 126 +++++++++++++++++++++++++++++-----
> kernel/sched/fair.c                   |  58 +++++++++++++---
> kernel/sched/sched.h                  |   9 ++-
> kernel/sysctl.c                       |  18 +++++
> 6 files changed, 232 insertions(+), 30 deletions(-)
> 
> -- 
> 2.14.4.44.g2045bb6

Ping, any new comments on this patchset? If there're no other concerns, I think it's ready to be merged?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller
  2021-02-09 13:17   ` Odin Ugedal
@ 2021-02-09 10:28     ` Tejun Heo
  2021-02-27 13:48     ` changhuaixin
  1 sibling, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2021-02-09 10:28 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: changhuaixin, bsegall, dietmar.eggemann, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, pauld, peterz, pjt, rostedt,
	shanpeic, vincent.guittot, xiyou.wangcong

Hello,

On Tue, Feb 09, 2021 at 02:17:19PM +0100, Odin Ugedal wrote:
> A am not that familiar how cross subsystem patches like these are handled, but
> I am still adding the Tejun Heo (cgroup maintainer) as a CC. Should maybe cc to
> cgroup@ as well?

Yeah, that'd be great. Given that it's mostly straight forward extension on
an existing interface, things looks fine from cgroup side; however, please
do add cgroup2 interface and documentation. One thing which has bene
bothersome about the bandwidth interface is that we're exposing
implementation details (window size and now burst size) instead of
usage-centric requirements but that boat has already sailed, so...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller
  2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                     ` (4 preceding siblings ...)
  2021-01-26 10:18   ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller changhuaixin
@ 2021-02-09 13:17   ` Odin Ugedal
  2021-02-09 10:28     ` Tejun Heo
  2021-02-27 13:48     ` changhuaixin
  5 siblings, 2 replies; 44+ messages in thread
From: Odin Ugedal @ 2021-02-09 13:17 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong, tj


Hi! This looks quite useful, but I have a few quick thoughts. :)

I know of a lot of people who would love this (especially some
Kubernetes users)! I really like how this allow users to use cfs
in a more dynamic and flexible way, without interfering with those
who like the enforce strict quotas.


> +++ b/kernel/sched/core.c
> @ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64
> [...]
> +	/* burst_onset needed */
> +	if (cfs_b->quota != RUNTIME_INF &&
> +			sysctl_sched_cfs_bw_burst_enabled &&
> +			sysctl_sched_cfs_bw_burst_onset_percent > 0) {
> +
> +		burst_onset = do_div(burst, 100) *
> +			sysctl_sched_cfs_bw_burst_onset_percent;
> +
> +		cfs_b->runtime += burst_onset;
> +		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
> +	}

I saw a comment about this behavior, but I think this can lead to a bit of
confusion. If sysctl_sched_cfs_bw_burst_onset_percent=0, the amount of
bandwidth when the first process starts up will depend on the time between
the quota was set and the startup of the process, and that feel a bit like
a "timing" race that end user application then will have to think about.

I suspect contianer runtimes and/or tools like Kubernetes will then have
to tell users to set the value to a certan amount in order to make it
work as expected.

Another thing is that when a cgroup has saved some time into the
"burst quota", updating the quota, period or burst will then reset the
"burst quota", even though eg. only the burst was changed. Some tools
use dynamic quotas, resulting in multiple changes in the quota over time,
and I am a bit scared that don't allowing them to control "start burst"
on a write can be limiting.

Maybe we can allow people to set the "start bandwidth" explicitly when setting
cfs_burst if they want to do that? (edit: that might be hard for cgroup v1, but
would I think that is a good solution on cgroup v2).

This is however just my thoughts, and I am not 100% sure about what the
best solution is, but if we merge a certain behavior, we have no real
chance of changing it later.


> +++ b/kernel/sched/sched.h
> @@ -367,6 +367,7 @@ struct cfs_bandwidth {
>  	u64			burst;
>  	u64			buffer;
>  	u64			max_overrun;
> +	u64			previous_runtime;
>  	s64			hierarchical_quota;

Maybe indicate that this was the remaining runtime _after_ the previous
period ended? Not 100% sure, but maybe sometihing like
'remaining_runtime_prev_period' or 'end_runtime_prev_period'(as inspiration).   


> +++ b/kernel/sched/core.c
> @@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>  		seq_printf(sf, "wait_sum %llu\n", ws);
>  	}
>  
> +	seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
> +	seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
> +	seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
> +
>  	return 0;
>  }

Looks like these metrics are missing from the cgroup v2 stats.

Are we sure it is smart to start exposing cfs_b->runtime, since it makes it
harder to change the implementation at a later time? I don't thin it is that
usefull, and if it is only exposed for debugging purposes people can probably
use kprobes instead? Also, it would not be usefull unless you know how much
wall time is left in the current period. In that sense,
cfs_b->previous_runtime would probably be more usefull, but still not sure if
it deserves to be exposed to end users like this.

Also, will "cfs_b->runtime" keep updating if no processes are running, or
will it be the the same here, but update (with burst via timer overrun)
when a process starts again? If so, the runtime available when a process
starts on cgroup inint can be hard to communicate if the value here doesn't
update.


> +++ b/kernel/sched/fair.c
> +void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
> [...]
> +	/*
> +	 * When period timer stops, quota for the following period is not
> +	 * refilled, however period timer is already forwarded. We should
> +	 * accumulate quota once more than overrun here.
> +	 */


Trying to wrap my head around this one... Is not refilling here, as the
behavior before your patch causing "loss" in runtime and causing unnecessary
possibly causing a cgroup throttle?.


A am not that familiar how cross subsystem patches like these are handled, but
I am still adding the Tejun Heo (cgroup maintainer) as a CC. Should maybe cc to
cgroup@ as well?

Sorry for a long mail, in retrospect it should have been one per patch...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller
  2021-02-09 13:17   ` Odin Ugedal
  2021-02-09 10:28     ` Tejun Heo
@ 2021-02-27 13:48     ` changhuaixin
  2021-03-10 11:11       ` Odin Ugedal
  1 sibling, 1 reply; 44+ messages in thread
From: changhuaixin @ 2021-02-27 13:48 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: changhuaixin, bsegall, dietmar.eggemann, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, pauld, peterz, pjt, rostedt,
	shanpeic, vincent.guittot, xiyou.wangcong, tj

Hi,

Sorry for my late reply.

> On Feb 9, 2021, at 9:17 PM, Odin Ugedal <odin@uged.al> wrote:
> 
> 
> Hi! This looks quite useful, but I have a few quick thoughts. :)
> 
> I know of a lot of people who would love this (especially some
> Kubernetes users)! I really like how this allow users to use cfs
> in a more dynamic and flexible way, without interfering with those
> who like the enforce strict quotas.
> 
> 
>> +++ b/kernel/sched/core.c
>> @ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64
>> [...]
>> +	/* burst_onset needed */
>> +	if (cfs_b->quota != RUNTIME_INF &&
>> +			sysctl_sched_cfs_bw_burst_enabled &&
>> +			sysctl_sched_cfs_bw_burst_onset_percent > 0) {
>> +
>> +		burst_onset = do_div(burst, 100) *
>> +			sysctl_sched_cfs_bw_burst_onset_percent;
>> +
>> +		cfs_b->runtime += burst_onset;
>> +		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
>> +	}
> 
> I saw a comment about this behavior, but I think this can lead to a bit of
> confusion. If sysctl_sched_cfs_bw_burst_onset_percent=0, the amount of
> bandwidth when the first process starts up will depend on the time between
> the quota was set and the startup of the process, and that feel a bit like
> a "timing" race that end user application then will have to think about.
> 
> I suspect contianer runtimes and/or tools like Kubernetes will then have
> to tell users to set the value to a certan amount in order to make it
> work as expected.
> 
> Another thing is that when a cgroup has saved some time into the
> "burst quota", updating the quota, period or burst will then reset the
> "burst quota", even though eg. only the burst was changed. Some tools
> use dynamic quotas, resulting in multiple changes in the quota over time,
> and I am a bit scared that don't allowing them to control "start burst"
> on a write can be limiting.
> 
> Maybe we can allow people to set the "start bandwidth" explicitly when setting
> cfs_burst if they want to do that? (edit: that might be hard for cgroup v1, but
> would I think that is a good solution on cgroup v2).
> 
> This is however just my thoughts, and I am not 100% sure about what the
> best solution is, but if we merge a certain behavior, we have no real
> chance of changing it later.
> 

If there are cases where the "start bandwidth" matters, I think there is need to expose the
"start bandwidth" explicitly too. However, I doubt the existence of such cases from my view
and the two examples above.

In my thoughts, this patchset keeps cgroup usage within the quota in the longer term, and allows 
cgroup to respond to a burst of work with the help of a reasonable burst buffer. If quota is set correctly
above average usage, and enough burst buffer is set to meet the needs of bursty work. In this
case, it makes no difference whether this cgroup runs with 0 start bandwidth or all of it.
Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start bandwidth
to leave some convenience here. If this sysctl interface is confusing, I wonder whether it
is a good idea not to expose this interface.

For the first case mentioned above, if Kubernet users care the "start bandwidth" for process startup,
maybe it is better to give all of it rather than a part?

For the second case with quota changes over time, I think it is important making sure each change works
long enough to enforce average quota limit. Does it really matter to control "start burst" on each change?



> 
>> +++ b/kernel/sched/sched.h
>> @@ -367,6 +367,7 @@ struct cfs_bandwidth {
>> 	u64			burst;
>> 	u64			buffer;
>> 	u64			max_overrun;
>> +	u64			previous_runtime;
>> 	s64			hierarchical_quota;
> 
> Maybe indicate that this was the remaining runtime _after_ the previous
> period ended? Not 100% sure, but maybe sometihing like
> 'remaining_runtime_prev_period' or 'end_runtime_prev_period'(as inspiration).   
> 

It is an copy of runtime at period start, and used to calculate burst time during a period.
Not quite remaining_runtime_prev_period.

> 
>> +++ b/kernel/sched/core.c
>> @@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>> 		seq_printf(sf, "wait_sum %llu\n", ws);
>> 	}
>> 
>> +	seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
>> +	seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
>> +	seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
>> +
>> 	return 0;
>> }
> 
> Looks like these metrics are missing from the cgroup v2 stats.
> 

Thanks, cgroup v2 stats and documentation should be handled. I will add them
in the following patchset.

> Are we sure it is smart to start exposing cfs_b->runtime, since it makes it
> harder to change the implementation at a later time? I don't thin it is that
> usefull, and if it is only exposed for debugging purposes people can probably
> use kprobes instead? Also, it would not be usefull unless you know how much
> wall time is left in the current period. In that sense,
> cfs_b->previous_runtime would probably be more usefull, but still not sure if
> it deserves to be exposed to end users like this.
> 
> Also, will "cfs_b->runtime" keep updating if no processes are running, or
> will it be the the same here, but update (with burst via timer overrun)
> when a process starts again? If so, the runtime available when a process
> starts on cgroup inint can be hard to communicate if the value here doesn't
> update.
> 

Yeah, there is the updating problem. It is okey not to expose cfs_b->runtime then.

> 
>> +++ b/kernel/sched/fair.c
>> +void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
>> [...]
>> +	/*
>> +	 * When period timer stops, quota for the following period is not
>> +	 * refilled, however period timer is already forwarded. We should
>> +	 * accumulate quota once more than overrun here.
>> +	 */
> 
> 
> Trying to wrap my head around this one... Is not refilling here, as the
> behavior before your patch causing "loss" in runtime and causing unnecessary
> possibly causing a cgroup throttle?.
> 

This comment does not mean any loss any unnecessary throttle for present cfsb.
All this means is that all quota refilling that is not done during timer stop should be
refilled on timer start, for the burstable cfsb.

Maybe I shall change this comment in some way if it is misleading?

> 
> A am not that familiar how cross subsystem patches like these are handled, but
> I am still adding the Tejun Heo (cgroup maintainer) as a CC. Should maybe cc to
> cgroup@ as well?
> 
> Sorry for a long mail, in retrospect it should have been one per patch...
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller
  2021-02-27 13:48     ` changhuaixin
@ 2021-03-10 11:11       ` Odin Ugedal
  2021-03-12 13:26         ` changhuaixin
  0 siblings, 1 reply; 44+ messages in thread
From: Odin Ugedal @ 2021-03-10 11:11 UTC (permalink / raw)
  To: changhuaixin
  Cc: Odin Ugedal, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, pauld, peterz,
	Paul Turner, rostedt, shanpeic, Vincent Guittot, xiyou.wangcong,
	Tejun Heo

Hi,

> If there are cases where the "start bandwidth" matters, I think there is need to expose the
> "start bandwidth" explicitly too. However, I doubt the existence of such cases from my view
> and the two examples above.

Yeah, I don't think there will be any cases where users will be
"depending" on having burst available,
so I agree in that sense.

> In my thoughts, this patchset keeps cgroup usage within the quota in the longer term, and allows
> cgroup to respond to a burst of work with the help of a reasonable burst buffer. If quota is set correctly
> above average usage, and enough burst buffer is set to meet the needs of bursty work. In this
> case, it makes no difference whether this cgroup runs with 0 start bandwidth or all of it.
> Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start bandwidth
> to leave some convenience here. If this sysctl interface is confusing, I wonder whether it
> is a good idea not to expose this interface.
>
> For the first case mentioned above, if Kubernet users care the "start bandwidth" for process startup,
> maybe it is better to give all of it rather than a part?

Yeah, I am a bit afraid there will be some confusion, so not sure if
the sysctl is the best way to do it.

But I would like feedback from others to highlight the problem as
well, that would be helpful. I think a simple "API"
where you get 0 burst or full burst on "set" (the one we decide on)
would be best to avoid unnecessary complexity.

Start burst when starting up a new process in a new cgroup might be
helpful, so maybe that is a vote for
full burst? However, in long term that doesn't matter, so 0 burst on
start would work as well.

> For the second case with quota changes over time, I think it is important making sure each change works
> long enough to enforce average quota limit. Does it really matter to control "start burst" on each change?

No, I don't think so. Doing so would be another thing to set per
cgroup, and that would just clutter the api
more than necessary imo., since we cannot come up with any real use cases.

> It is an copy of runtime at period start, and used to calculate burst time during a period.
> Not quite remaining_runtime_prev_period.

Ahh, I see, I misunderstood the code. So in a essence it is
"runtime_at_period_start"?

> Yeah, there is the updating problem. It is okey not to expose cfs_b->runtime then.

Yeah, I think dropping it all together is the best solution.


> This comment does not mean any loss any unnecessary throttle for present cfsb.
> All this means is that all quota refilling that is not done during timer stop should be
> refilled on timer start, for the burstable cfsb.
>
> Maybe I shall change this comment in some way if it is misleading?

I think I formulated my question badly. The comment makes sense, I am
just trying to compare how "start_cfs_bandwidth"
works after your patch compared to how it works currently. As I
understand, without this patch "start_cfs_bandwidth" will
never refill runtime, while with your patch, it will refill even when
overrun=0 with burst disabled. Is that an intended change in
behavior, or am I not understanding the patch?


On another note, I have also been testing this patch, and I am not
able to reproduce your schbench results. Both with and without burst,
it gets the same result, and no nr_throttled stays at 0 (tested on a
32-core system). Can you try to rerun your tests with the mainline
to see if you still get the same results? (Also, I see you are running
with 30 threads. How many cores do your test setup have?). To actually
say that the result is real, all cores used should maybe be
exclusively reserved as well, to avoid issues where other processes
cause a
spike in latency.


Odin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller
  2021-01-26 10:18   ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller changhuaixin
@ 2021-03-10 12:26     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-03-10 12:26 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, pjt, rostedt, shanpeic, vincent.guittot,
	xiyou.wangcong

On Tue, Jan 26, 2021 at 06:18:59PM +0800, changhuaixin wrote:
> Ping, any new comments on this patchset? If there're no other
> concerns, I think it's ready to be merged?

I only found this by accident...

Thread got lost because you're posting new series as replies to older
series. Please don't do that, it's crap.

I'll go have a look.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/4] sched/fair: Introduce primitives for CFS bandwidth burst
  2021-01-21 11:04   ` [PATCH v3 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
@ 2021-03-10 12:39     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-03-10 12:39 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, pjt, rostedt, shanpeic, vincent.guittot,
	xiyou.wangcong

On Thu, Jan 21, 2021 at 07:04:50PM +0800, Huaixin Chang wrote:
> In this patch, we introduce the notion of CFS bandwidth burst. Unused
> "quota" from pervious "periods" might be accumulated and used in the
> following "periods". The maximum amount of accumulated bandwidth is
> bounded by "burst". And the maximun amount of CPU a group can consume in
> a given period is "buffer" which is equivalent to "quota" + "burst in
> case that this group has done enough accumulation.
> 
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>

Not a valid SoB chain.

> ---
>  kernel/sched/core.c  | 91 ++++++++++++++++++++++++++++++++++++++++++++--------
>  kernel/sched/fair.c  |  2 ++
>  kernel/sched/sched.h |  2 ++
>  3 files changed, 82 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e7e453492cff..48d3bad12be2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7895,10 +7895,12 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>  
>  static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>  
> -static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
> +							u64 burst)

Non standard indentation.

>  {
>  	int i, ret = 0, runtime_enabled, runtime_was_enabled;
>  	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> +	u64 buffer;
>  
>  	if (tg == &root_task_group)
>  		return -EINVAL;
> @@ -7925,6 +7927,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>  	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>  		return -EINVAL;
>  
> +	/*
> +	 * Bound burst to defend burst against overflow during bandwidth shift.
> +	 */
> +	if (burst > max_cfs_runtime)
> +		return -EINVAL;
> +
> +	if (quota == RUNTIME_INF)
> +		buffer = RUNTIME_INF;
> +	else
> +		buffer = min(max_cfs_runtime, quota + burst);

Why do we care about buffer when RUNTIME_INF?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-01-21 11:04   ` [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
@ 2021-03-10 13:04     ` Peter Zijlstra
  2021-03-12 13:54       ` changhuaixin
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2021-03-10 13:04 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, pjt, rostedt, shanpeic, vincent.guittot,
	xiyou.wangcong

On Thu, Jan 21, 2021 at 07:04:51PM +0800, Huaixin Chang wrote:
> Accumulate unused quota from previous periods, thus accumulated
> bandwidth runtime can be used in the following periods. During
> accumulation, take care of runtime overflow. Previous non-burstable
> CFS bandwidth controller only assign quota to runtime, that saves a lot.
> 
> A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
> denote how many percent of burst is given on setting cfs bandwidth. By
> default it is 0, which means on burst is allowed unless accumulated.
> 
> Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
> switch for burst. It is enabled by default.
> 
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>

Identical invalid SoB chain.

> Reported-by: kernel test robot <lkp@intel.com>

What exactly did the robot report; the whole patch?

> ---
>  include/linux/sched/sysctl.h |  2 ++
>  kernel/sched/core.c          | 31 +++++++++++++++++++++++++----
>  kernel/sched/fair.c          | 47 ++++++++++++++++++++++++++++++++++++--------
>  kernel/sched/sched.h         |  4 ++--
>  kernel/sysctl.c              | 18 +++++++++++++++++
>  5 files changed, 88 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 3c31ba88aca5..3400828eaf2d 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
>  extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> +extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
>  #endif
>  
>  #ifdef CONFIG_SCHED_AUTOGROUP
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 48d3bad12be2..fecf0f05ef0c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
>   */
>  const_debug unsigned int sysctl_sched_nr_migrate = 32;
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +/*
> + * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
> + * 0 by default.
> + */
> +unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
> +
> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
> +#endif

There's already an #ifdef block that contains that bandwidth_slice
thing, see the previous hunk, so why create a new #ifdef here?

Also, personally I think percentages are over-represented as members of
Q.

> @@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
>  const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
>  static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
>  /* More than 203 days if BW_SHIFT equals 20. */
> -static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
> +const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>  
>  static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>  
> @@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>  {
>  	int i, ret = 0, runtime_enabled, runtime_was_enabled;
>  	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> -	u64 buffer;
> +	u64 buffer, burst_onset;
>  
>  	if (tg == &root_task_group)
>  		return -EINVAL;
> @@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>  	cfs_b->burst = burst;
>  	cfs_b->buffer = buffer;
>  
> -	__refill_cfs_bandwidth_runtime(cfs_b);
> +	cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
> +	cfs_b->runtime = cfs_b->quota;
> +
> +	/* burst_onset needed */
> +	if (cfs_b->quota != RUNTIME_INF &&
> +			sysctl_sched_cfs_bw_burst_enabled &&
> +			sysctl_sched_cfs_bw_burst_onset_percent > 0) {

'creative' indentation again...

Also, this gives rise to the question as to why onset_percent is
separate from enabled.

> +
> +		burst_onset = do_div(burst, 100) *
> +			sysctl_sched_cfs_bw_burst_onset_percent;

and again..

> +
> +		cfs_b->runtime += burst_onset;
> +		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
> +	}
>  
>  	/* Restart the period timer (if active) to handle new period expiry: */
>  	if (runtime_enabled)
> -		start_cfs_bandwidth(cfs_b);
> +		start_cfs_bandwidth(cfs_b, 1);
>  
>  	raw_spin_unlock_irq(&cfs_b->lock);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6bb4f89259fd..abe6eb05fe09 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4598,10 +4598,23 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>   *
>   * requires cfs_b->lock
>   */
> -void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> +static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
> +		u64 overrun)
>  {
> -	if (cfs_b->quota != RUNTIME_INF)
> -		cfs_b->runtime = cfs_b->quota;
> +	u64 refill;
> +
> +	if (cfs_b->quota != RUNTIME_INF) {
> +
> +		if (!sysctl_sched_cfs_bw_burst_enabled) {
> +			cfs_b->runtime = cfs_b->quota;
> +			return;
> +		}
> +
> +		overrun = min(overrun, cfs_b->max_overrun);
> +		refill = cfs_b->quota * overrun;
> +		cfs_b->runtime += refill;
> +		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
> +	}
>  }
>  
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4623,7 +4636,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
>  	if (cfs_b->quota == RUNTIME_INF)
>  		amount = min_amount;
>  	else {
> -		start_cfs_bandwidth(cfs_b);
> +		start_cfs_bandwidth(cfs_b, 0);
>  
>  		if (cfs_b->runtime > 0) {
>  			amount = min(cfs_b->runtime, min_amount);
> @@ -4957,7 +4970,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  	if (cfs_b->idle && !throttled)
>  		goto out_deactivate;
>  
> -	__refill_cfs_bandwidth_runtime(cfs_b);
> +	__refill_cfs_bandwidth_runtime(cfs_b, overrun);
>  
>  	if (!throttled) {
>  		/* mark as potentially idle for the upcoming period */
> @@ -5181,6 +5194,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>  }
>  
>  extern const u64 max_cfs_quota_period;
> +extern const u64 max_cfs_runtime;
>  
>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  {
> @@ -5210,7 +5224,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  			new = old * 2;
>  			if (new < max_cfs_quota_period) {
>  				cfs_b->period = ns_to_ktime(new);
> -				cfs_b->quota *= 2;
> +				cfs_b->quota = min(cfs_b->quota * 2,
> +						max_cfs_runtime);

again, broken indent

> +
> +				cfs_b->buffer = min(max_cfs_runtime,
> +						cfs_b->quota + cfs_b->burst);

and again..

> +				/* Add 1 in case max_overrun becomes 0. */

A better comment would explain *why* 0 is a problem; and possibly
include a reference to the code that cares
(__refill_cfs_bandiwdth_runtime() afaict).

> +				cfs_b->max_overrun >>= 1;
> +				cfs_b->max_overrun++;
>  
>  				pr_warn_ratelimited(
>  	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/4] sched/fair: Add cfs bandwidth burst statistics
  2021-01-21 11:04   ` [PATCH v3 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2021-03-10 13:10     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-03-10 13:10 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, pjt, rostedt, shanpeic, vincent.guittot,
	xiyou.wangcong

On Thu, Jan 21, 2021 at 07:04:52PM +0800, Huaixin Chang wrote:
> Introduce statistics exports for the burstable cfs bandwidth
> controller.
> 
> The following exports are included:
> 
> current_bw: current runtime in global pool
> nr_burst:   number of periods bandwidth burst occurs
> burst_time: cumulative wall-time that any cpus has
> 	    used above quota in respective periods
> 
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>

Consistently fail.

> ---
>  kernel/sched/core.c  |  6 ++++++
>  kernel/sched/fair.c  | 12 +++++++++++-
>  kernel/sched/sched.h |  3 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fecf0f05ef0c..80ca763ca492 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7986,6 +7986,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>  		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
>  	}
>  
> +	cfs_b->previous_runtime = cfs_b->runtime;
> +
>  	/* Restart the period timer (if active) to handle new period expiry: */
>  	if (runtime_enabled)
>  		start_cfs_bandwidth(cfs_b, 1);
> @@ -8234,6 +8236,10 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>  		seq_printf(sf, "wait_sum %llu\n", ws);
>  	}
>  
> +	seq_printf(sf, "current_bw %llu\n", cfs_b->runtime);
> +	seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
> +	seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
> +
>  	return 0;
>  }
>  #endif /* CONFIG_CFS_BANDWIDTH */

This is ABI; and the Changelog has no justification what so ever...


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 4/4] sched/fair: Add document for burstable CFS bandwidth control
  2021-01-21 11:04   ` [PATCH v3 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
@ 2021-03-10 13:17     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-03-10 13:17 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, juri.lelli, khlebnikov, linux-kernel,
	mgorman, mingo, pauld, pjt, rostedt, shanpeic, vincent.guittot,
	xiyou.wangcong

On Thu, Jan 21, 2021 at 07:04:53PM +0800, Huaixin Chang wrote:
> Basic description of usage and effect for CFS Bandwidth Control Burst.
> 
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>

Guess :-)

> +Sometimes users might want a group to burst without accumulation. This is
> +tunable via::
> +       /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
> +
> +Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.

Sometimes is a very crap reason for code to exist. Also, everything is
in _us, why do we have this one thing as a percent?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller
  2021-03-10 11:11       ` Odin Ugedal
@ 2021-03-12 13:26         ` changhuaixin
  0 siblings, 0 replies; 44+ messages in thread
From: changhuaixin @ 2021-03-12 13:26 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: changhuaixin, Odin Ugedal, Benjamin Segall, dietmar.eggemann,
	juri.lelli, khlebnikov, open list, mgorman, mingo, pauld, peterz,
	Paul Turner, rostedt, shanpeic, Vincent Guittot, xiyou.wangcong,
	Tejun Heo



> On Mar 10, 2021, at 7:11 PM, Odin Ugedal <odin@ugedal.com> wrote:
> 
> Hi,
> 
>> If there are cases where the "start bandwidth" matters, I think there is need to expose the
>> "start bandwidth" explicitly too. However, I doubt the existence of such cases from my view
>> and the two examples above.
> 
> Yeah, I don't think there will be any cases where users will be
> "depending" on having burst available,
> so I agree in that sense.
> 
>> In my thoughts, this patchset keeps cgroup usage within the quota in the longer term, and allows
>> cgroup to respond to a burst of work with the help of a reasonable burst buffer. If quota is set correctly
>> above average usage, and enough burst buffer is set to meet the needs of bursty work. In this
>> case, it makes no difference whether this cgroup runs with 0 start bandwidth or all of it.
>> Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start bandwidth
>> to leave some convenience here. If this sysctl interface is confusing, I wonder whether it
>> is a good idea not to expose this interface.
>> 
>> For the first case mentioned above, if Kubernet users care the "start bandwidth" for process startup,
>> maybe it is better to give all of it rather than a part?
> 
> Yeah, I am a bit afraid there will be some confusion, so not sure if
> the sysctl is the best way to do it.
> 
> But I would like feedback from others to highlight the problem as
> well, that would be helpful. I think a simple "API"
> where you get 0 burst or full burst on "set" (the one we decide on)
> would be best to avoid unnecessary complexity.
> 
> Start burst when starting up a new process in a new cgroup might be
> helpful, so maybe that is a vote for
> full burst? However, in long term that doesn't matter, so 0 burst on
> start would work as well.
> 
>> For the second case with quota changes over time, I think it is important making sure each change works
>> long enough to enforce average quota limit. Does it really matter to control "start burst" on each change?
> 
> No, I don't think so. Doing so would be another thing to set per
> cgroup, and that would just clutter the api
> more than necessary imo., since we cannot come up with any real use cases.
> 
>> It is an copy of runtime at period start, and used to calculate burst time during a period.
>> Not quite remaining_runtime_prev_period.
> 
> Ahh, I see, I misunderstood the code. So in a essence it is
> "runtime_at_period_start"?
> 

Yes, it is "runtime_at_preiod_start".

>> Yeah, there is the updating problem. It is okey not to expose cfs_b->runtime then.
> 
> Yeah, I think dropping it all together is the best solution.
> 
> 
>> This comment does not mean any loss any unnecessary throttle for present cfsb.
>> All this means is that all quota refilling that is not done during timer stop should be
>> refilled on timer start, for the burstable cfsb.
>> 
>> Maybe I shall change this comment in some way if it is misleading?
> 
> I think I formulated my question badly. The comment makes sense, I am
> just trying to compare how "start_cfs_bandwidth"
> works after your patch compared to how it works currently. As I
> understand, without this patch "start_cfs_bandwidth" will
> never refill runtime, while with your patch, it will refill even when
> overrun=0 with burst disabled. Is that an intended change in
> behavior, or am I not understanding the patch?
> 

Good point. The way "start_cfs_bandwidth" works is changed indeed. The present cfs_b doesn't
have to refill bandwidth because quota is not used during the period before timer stops. With this patch,
runtime is refilled no matter burst is enabled or not. Do you suggest not refilling runtime unless burst
is enabled here?

> 
> On another note, I have also been testing this patch, and I am not
> able to reproduce your schbench results. Both with and without burst,
> it gets the same result, and no nr_throttled stays at 0 (tested on a
> 32-core system). Can you try to rerun your tests with the mainline
> to see if you still get the same results? (Also, I see you are running
> with 30 threads. How many cores do your test setup have?). To actually
> say that the result is real, all cores used should maybe be
> exclusively reserved as well, to avoid issues where other processes
> cause a
> spike in latency.
> 

Spikes indeed cause trouble. If nr_throttle stays at 0, I suggest change quota from 700000 to 600000,
which is still above the average utilization 500%. I have rerun on a 64-core system and reproduced the
results. And I think it should work on a 32-core system too, as there are 20 active workers in each round.

If you still have trouble, I suggest test in the following way. And it should work on a two-core system.

mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 300000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

./schbench -m 1 -t 3 -r 20 -c 200000 -R 4

On my machine, two workers work for 200ms and sleep for 300ms in each round. The average utilization is
around 80%.
  

> 
> Odin


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-03-10 13:04     ` Peter Zijlstra
@ 2021-03-12 13:54       ` changhuaixin
  2021-03-16 10:55         ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: changhuaixin @ 2021-03-12 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Benjamin Segall, dietmar.eggemann, juri.lelli,
	khlebnikov, open list, mgorman, mingo, pauld, Paul Turner,
	rostedt, Shanpei Chen, Vincent Guittot, xiyou.wangcong



> On Mar 10, 2021, at 9:04 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Jan 21, 2021 at 07:04:51PM +0800, Huaixin Chang wrote:
>> Accumulate unused quota from previous periods, thus accumulated
>> bandwidth runtime can be used in the following periods. During
>> accumulation, take care of runtime overflow. Previous non-burstable
>> CFS bandwidth controller only assign quota to runtime, that saves a lot.
>> 
>> A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
>> denote how many percent of burst is given on setting cfs bandwidth. By
>> default it is 0, which means on burst is allowed unless accumulated.
>> 
>> Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
>> switch for burst. It is enabled by default.
>> 
>> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
>> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> 
> Identical invalid SoB chain.
> 
>> Reported-by: kernel test robot <lkp@intel.com>
> 
> What exactly did the robot report; the whole patch?

A warning is reported by the robot. And I have fixed it in this series. I'll remove this line,
since it seems unnecessary.

> 
>> ---
>> include/linux/sched/sysctl.h |  2 ++
>> kernel/sched/core.c          | 31 +++++++++++++++++++++++++----
>> kernel/sched/fair.c          | 47 ++++++++++++++++++++++++++++++++++++--------
>> kernel/sched/sched.h         |  4 ++--
>> kernel/sysctl.c              | 18 +++++++++++++++++
>> 5 files changed, 88 insertions(+), 14 deletions(-)
>> 
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index 3c31ba88aca5..3400828eaf2d 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>> 
>> #ifdef CONFIG_CFS_BANDWIDTH
>> extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>> +extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
>> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
>> #endif
>> 
>> #ifdef CONFIG_SCHED_AUTOGROUP
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 48d3bad12be2..fecf0f05ef0c 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
>>  */
>> const_debug unsigned int sysctl_sched_nr_migrate = 32;
>> 
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +/*
>> + * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
>> + * 0 by default.
>> + */
>> +unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
>> +
>> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
>> +#endif
> 
> There's already an #ifdef block that contains that bandwidth_slice
> thing, see the previous hunk, so why create a new #ifdef here?
> 
> Also, personally I think percentages are over-represented as members of
> Q.
> 
Sorry, I don't quite understand the "members of Q". Is this saying that the percentages
are over-designed here?

>> @@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
>> const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
>> static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
>> /* More than 203 days if BW_SHIFT equals 20. */
>> -static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>> +const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>> 
>> static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>> 
>> @@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>> {
>> 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
>> 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>> -	u64 buffer;
>> +	u64 buffer, burst_onset;
>> 
>> 	if (tg == &root_task_group)
>> 		return -EINVAL;
>> @@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>> 	cfs_b->burst = burst;
>> 	cfs_b->buffer = buffer;
>> 
>> -	__refill_cfs_bandwidth_runtime(cfs_b);
>> +	cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
>> +	cfs_b->runtime = cfs_b->quota;
>> +
>> +	/* burst_onset needed */
>> +	if (cfs_b->quota != RUNTIME_INF &&
>> +			sysctl_sched_cfs_bw_burst_enabled &&
>> +			sysctl_sched_cfs_bw_burst_onset_percent > 0) {
> 
> 'creative' indentation again...
> 
> Also, this gives rise to the question as to why onset_percent is
> separate from enabled.

Odin noticed the precent thing, too. Maybe I will remove this and let cfsb start with 0 burst.
In this way, this if statement can be removed too.

> 
>> +
>> +		burst_onset = do_div(burst, 100) *
>> +			sysctl_sched_cfs_bw_burst_onset_percent;
> 
> and again..
> 
>> +
>> +		cfs_b->runtime += burst_onset;
>> +		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
>> +	}
>> 
>> 	/* Restart the period timer (if active) to handle new period expiry: */
>> 	if (runtime_enabled)
>> -		start_cfs_bandwidth(cfs_b);
>> +		start_cfs_bandwidth(cfs_b, 1);
>> 
>> 	raw_spin_unlock_irq(&cfs_b->lock);
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6bb4f89259fd..abe6eb05fe09 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4598,10 +4598,23 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>>  *
>>  * requires cfs_b->lock
>>  */
>> -void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>> +static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
>> +		u64 overrun)
>> {
>> -	if (cfs_b->quota != RUNTIME_INF)
>> -		cfs_b->runtime = cfs_b->quota;
>> +	u64 refill;
>> +
>> +	if (cfs_b->quota != RUNTIME_INF) {
>> +
>> +		if (!sysctl_sched_cfs_bw_burst_enabled) {
>> +			cfs_b->runtime = cfs_b->quota;
>> +			return;
>> +		}
>> +
>> +		overrun = min(overrun, cfs_b->max_overrun);
>> +		refill = cfs_b->quota * overrun;
>> +		cfs_b->runtime += refill;
>> +		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
>> +	}
>> }
>> 
>> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4623,7 +4636,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
>> 	if (cfs_b->quota == RUNTIME_INF)
>> 		amount = min_amount;
>> 	else {
>> -		start_cfs_bandwidth(cfs_b);
>> +		start_cfs_bandwidth(cfs_b, 0);
>> 
>> 		if (cfs_b->runtime > 0) {
>> 			amount = min(cfs_b->runtime, min_amount);
>> @@ -4957,7 +4970,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>> 	if (cfs_b->idle && !throttled)
>> 		goto out_deactivate;
>> 
>> -	__refill_cfs_bandwidth_runtime(cfs_b);
>> +	__refill_cfs_bandwidth_runtime(cfs_b, overrun);
>> 
>> 	if (!throttled) {
>> 		/* mark as potentially idle for the upcoming period */
>> @@ -5181,6 +5194,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>> }
>> 
>> extern const u64 max_cfs_quota_period;
>> +extern const u64 max_cfs_runtime;
>> 
>> static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> {
>> @@ -5210,7 +5224,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> 			new = old * 2;
>> 			if (new < max_cfs_quota_period) {
>> 				cfs_b->period = ns_to_ktime(new);
>> -				cfs_b->quota *= 2;
>> +				cfs_b->quota = min(cfs_b->quota * 2,
>> +						max_cfs_runtime);
> 
> again, broken indent
> 
>> +
>> +				cfs_b->buffer = min(max_cfs_runtime,
>> +						cfs_b->quota + cfs_b->burst);
> 
> and again..
> 
>> +				/* Add 1 in case max_overrun becomes 0. */
> 
> A better comment would explain *why* 0 is a problem; and possibly
> include a reference to the code that cares
> (__refill_cfs_bandiwdth_runtime() afaict).
> 
>> +				cfs_b->max_overrun >>= 1;
>> +				cfs_b->max_overrun++;
>> 
>> 				pr_warn_ratelimited(
>> 	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-03-12 13:54       ` changhuaixin
@ 2021-03-16 10:55         ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-03-16 10:55 UTC (permalink / raw)
  To: changhuaixin
  Cc: Benjamin Segall, dietmar.eggemann, juri.lelli, khlebnikov,
	open list, mgorman, mingo, pauld, Paul Turner, rostedt,
	Shanpei Chen, Vincent Guittot, xiyou.wangcong

On Fri, Mar 12, 2021 at 09:54:33PM +0800, changhuaixin wrote:
> > On Mar 10, 2021, at 9:04 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> > There's already an #ifdef block that contains that bandwidth_slice
> > thing, see the previous hunk, so why create a new #ifdef here?
> > 
> > Also, personally I think percentages are over-represented as members of
> > Q.
> > 
> Sorry, I don't quite understand the "members of Q". Is this saying that the percentages
> are over-designed here?

You know the number groups (in order): N, Z, Q, R, C, H, O.

Percent being 1/100 is a fraction and thus part of Q (and anything
higher ofcourse).

Some people seem to think percent is magical and special. It's just a
fraction like the inifinite many others in Q. It's also a very crappy
one when we consider computers.

Basically I hate percentages, they're nothing special and often employed
where they should not be.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable
  2021-02-02 11:40 [PATCH " Huaixin Chang
@ 2021-02-02 11:40 ` Huaixin Chang
  0 siblings, 0 replies; 44+ messages in thread
From: Huaixin Chang @ 2021-02-02 11:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: changhuaixin, bsegall, dietmar.eggemann, juri.lelli, khlebnikov,
	mgorman, mingo, pauld, peterz, pjt, rostedt, shanpeic,
	vincent.guittot, xiyou.wangcong

Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
denote how many percent of burst is given on setting cfs bandwidth. By
default it is 0, which means on burst is allowed unless accumulated.

Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/core.c          | 31 +++++++++++++++++++++++++----
 kernel/sched/fair.c          | 47 ++++++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h         |  4 ++--
 kernel/sysctl.c              | 18 +++++++++++++++++
 5 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..3400828eaf2d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 28e3165c685b..9f1b05ad0411 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
  */
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
+ * 0 by default.
+ */
+unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
+#endif
+
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -8586,7 +8596,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
 const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
 static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
 /* More than 203 days if BW_SHIFT equals 20. */
-static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
@@ -8595,7 +8605,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
-	u64 buffer;
+	u64 buffer, burst_onset;
 
 	if (tg == &root_task_group)
 		return -EINVAL;
@@ -8656,11 +8666,24 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 	cfs_b->burst = burst;
 	cfs_b->buffer = buffer;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
+	cfs_b->runtime = cfs_b->quota;
+
+	/* burst_onset needed */
+	if (cfs_b->quota != RUNTIME_INF &&
+			sysctl_sched_cfs_bw_burst_enabled &&
+			sysctl_sched_cfs_bw_burst_onset_percent > 0) {
+
+		burst_onset = div_u64(burst, 100) *
+			sysctl_sched_cfs_bw_burst_onset_percent;
+
+		cfs_b->runtime += burst_onset;
+		cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
+	}
 
 	/* Restart the period timer (if active) to handle new period expiry: */
 	if (runtime_enabled)
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 1);
 
 	raw_spin_unlock_irq(&cfs_b->lock);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46945349f209..6a7c261d206a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4609,10 +4609,23 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  *
  * requires cfs_b->lock
  */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b,
+		u64 overrun)
 {
-	if (cfs_b->quota != RUNTIME_INF)
-		cfs_b->runtime = cfs_b->quota;
+	u64 refill;
+
+	if (cfs_b->quota != RUNTIME_INF) {
+
+		if (!sysctl_sched_cfs_bw_burst_enabled) {
+			cfs_b->runtime = cfs_b->quota;
+			return;
+		}
+
+		overrun = min(overrun, cfs_b->max_overrun);
+		refill = cfs_b->quota * overrun;
+		cfs_b->runtime += refill;
+		cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+	}
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4634,7 +4647,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		start_cfs_bandwidth(cfs_b);
+		start_cfs_bandwidth(cfs_b, 0);
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -4980,7 +4993,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	__refill_cfs_bandwidth_runtime(cfs_b, overrun);
 
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
@@ -5201,6 +5214,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 }
 
 extern const u64 max_cfs_quota_period;
+extern const u64 max_cfs_runtime;
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
@@ -5230,7 +5244,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 			new = old * 2;
 			if (new < max_cfs_quota_period) {
 				cfs_b->period = ns_to_ktime(new);
-				cfs_b->quota *= 2;
+				cfs_b->quota = min(cfs_b->quota * 2,
+						max_cfs_runtime);
+
+				cfs_b->buffer = min(max_cfs_runtime,
+						cfs_b->quota + cfs_b->burst);
+				/* Add 1 in case max_overrun becomes 0. */
+				cfs_b->max_overrun >>= 1;
+				cfs_b->max_overrun++;
 
 				pr_warn_ratelimited(
 	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5279,16 +5300,26 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
 {
+	u64 overrun;
+
 	lockdep_assert_held(&cfs_b->lock);
 
 	if (cfs_b->period_active)
 		return;
 
 	cfs_b->period_active = 1;
-	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+
+	/*
+	 * When period timer stops, quota for the following period is not
+	 * refilled, however period timer is already forwarded. We should
+	 * accumulate quota once more than overrun here.
+	 */
+	if (!init)
+		__refill_cfs_bandwidth_runtime(cfs_b, overrun + 1);
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2c0d8469c0fb..4f53ea8e92ce 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -359,6 +359,7 @@ struct cfs_bandwidth {
 	u64			runtime;
 	u64			burst;
 	u64			buffer;
+	u64			max_overrun;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -469,8 +470,7 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 
-extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c9fbdd848138..6839a37ee29c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1842,6 +1842,24 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "sched_cfs_bw_burst_onset_percent",
+		.data		= &sysctl_sched_cfs_bw_burst_onset_percent,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &one_hundred,
+	},
+	{
+		.procname	= "sched_cfs_bw_burst_enabled",
+		.data		= &sysctl_sched_cfs_bw_burst_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 	{
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-03-16 10:56 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17  7:46 [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
2020-12-17  7:46 ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
2020-12-17 13:36   ` Peter Zijlstra
2020-12-21 13:53     ` changhuaixin
2021-01-12  9:21       ` changhuaixin
2020-12-17  7:46 ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
2020-12-18  9:53   ` kernel test robot
2020-12-18  9:53     ` kernel test robot
2020-12-17  7:46 ` [PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
2020-12-17  7:46 ` [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
2020-12-17 21:25 ` [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller Benjamin Segall
2021-01-20 12:27 ` [PATCH v2 " Huaixin Chang
2021-01-20 12:27   ` [PATCH 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
2021-01-20 12:27   ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
2021-01-20 17:06     ` kernel test robot
2021-01-20 17:06       ` kernel test robot
2021-01-20 18:33     ` kernel test robot
2021-01-20 18:33       ` kernel test robot
2021-01-20 22:01     ` kernel test robot
2021-01-20 22:01       ` kernel test robot
2021-01-20 22:01     ` [RFC PATCH] sched/fair: __refill_cfs_bandwidth_runtime() can be static kernel test robot
2021-01-20 22:01       ` kernel test robot
2021-01-20 12:27   ` [PATCH 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
2021-01-20 12:27   ` [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
2021-01-20 19:48     ` Randy Dunlap
2021-01-21 11:04 ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
2021-01-21 11:04   ` [PATCH v3 1/4] sched/fair: Introduce primitives for CFS bandwidth burst Huaixin Chang
2021-03-10 12:39     ` Peter Zijlstra
2021-01-21 11:04   ` [PATCH v3 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang
2021-03-10 13:04     ` Peter Zijlstra
2021-03-12 13:54       ` changhuaixin
2021-03-16 10:55         ` Peter Zijlstra
2021-01-21 11:04   ` [PATCH v3 3/4] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
2021-03-10 13:10     ` Peter Zijlstra
2021-01-21 11:04   ` [PATCH v3 4/4] sched/fair: Add document for burstable CFS bandwidth control Huaixin Chang
2021-03-10 13:17     ` Peter Zijlstra
2021-01-26 10:18   ` [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller changhuaixin
2021-03-10 12:26     ` Peter Zijlstra
2021-02-09 13:17   ` Odin Ugedal
2021-02-09 10:28     ` Tejun Heo
2021-02-27 13:48     ` changhuaixin
2021-03-10 11:11       ` Odin Ugedal
2021-03-12 13:26         ` changhuaixin
2021-02-02 11:40 [PATCH " Huaixin Chang
2021-02-02 11:40 ` [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable Huaixin Chang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.