[PATCH v5 0/3] sched/fair: Burstable CFS bandwidth controller

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/3] sched/fair: Burstable CFS bandwidth controller
@ 2021-05-20 12:34 Huaixin Chang
  2021-05-20 12:34 ` [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Huaixin Chang @ 2021-05-20 12:34 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, dtcccc, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, odin, odin, pauld, peterz, pjt,
	rostedt, shanpeic, tj, vincent.guittot, xiyou.wangcong

Changelog:
v5:
- Rearrange into 3 patches, one less than the previous version.
- The interference to other groups are valued.
- Put a limit on burst, so that code is further simplified.
- Rebase upon v5.13-rc3.

v4:
- Adjust assignments in tg_set_cfs_bandwidth(), saving unnecessary
  assignemnts when quota == RUNTIME_INF.
- Getting rid of sysctl_sched_cfs_bw_burst_onset_percent, as there seems
  no justification for both controlling start bandwidth and a percent
  way.
- Comment improvement in sched_cfs_period_timer() shifts on explaining
  why max_overrun shifting to 0 is a problem.
- Rename previous_runtime to runtime_at_period_start.
- Add cgroup2 interface and documentation.
- Getting rid of exposing current_bw as there are not enough
  justification and the updating problem.
- Add justification on cpu.stat change in the changelog.
- Rebase upon v5.12-rc3.
- Correct SoB chain.
- Several indentation fixes.
- Adjust quota in schbench test from 700000 to 600000.
Link:
https://lore.kernel.org/lkml/20210316044931.39733-1-changhuaixin@linux.alibaba.com/

v3:
- Fix another issue reported by test robot.
- Update docs as Randy Dunlap suggested.
Link:
https://lore.kernel.org/lkml/20210120122715.29493-1-changhuaixin@linux.alibaba.com/

v2:
- Fix an issue reported by test robot.
- Rewriting docs. Appreciate any further suggestions or help.
Link:
https://lore.kernel.org/lkml/20210121110453.18899-1-changhuaixin@linux.alibaba.com/

v1 Link:
https://lore.kernel.org/lkml/20201217074620.58338-1-changhuaixin@linux.alibaba.com/

Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/

This time we present more latency statistics and handle overflow while
accumulating.

Huaixin Chang (3):
  sched/fair: Introduce the burstable CFS controller
  sched/fair: Add cfs bandwidth burst statistics
  sched/fair: Add document for burstable CFS bandwidth

 Documentation/admin-guide/cgroup-v2.rst | 17 +++---
 Documentation/scheduler/sched-bwc.rst   | 76 ++++++++++++++++++++++----
 include/linux/sched/sysctl.h            |  1 +
 kernel/sched/core.c                     | 96 ++++++++++++++++++++++++++-------
 kernel/sched/fair.c                     | 32 ++++++++++-
 kernel/sched/sched.h                    |  4 ++
 kernel/sysctl.c                         |  9 ++++
 7 files changed, 200 insertions(+), 35 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-20 12:34 [PATCH v5 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
@ 2021-05-20 12:34 ` Huaixin Chang
  2021-05-20 14:00   ` Odin Ugedal
  2021-05-21 14:00   ` Peter Zijlstra
  2021-05-20 12:34 ` [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
  2021-05-20 12:34 ` [PATCH v5 3/3] sched/fair: Add document for burstable CFS bandwidth Huaixin Chang
  2 siblings, 2 replies; 21+ messages in thread
From: Huaixin Chang @ 2021-05-20 12:34 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, dtcccc, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, odin, odin, pauld, peterz, pjt,
	rostedt, shanpeic, tj, vincent.guittot, xiyou.wangcong

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.

Scaling up period and quota allows greater burst capacity. But it might
cause longer stuck till next refill. Introduce "burst" to allow
accumulating unused quota from previous periods, and to be assigned when
a task group requests more CPU than quota during a specific period.

Introducing burst buffer might also cause interference to other groups.
Thus limit the maximum accumulated buffer by "burst", and limit
the maximum allowed burst by quota, too.

The benefit of burst is seen when testing with schbench:

	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
	echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	# The average CPU usage is around 500%, which is 200ms CPU time
	# every 40ms.
	./schbench -m 1 -t 30 -r 10 -c 10000 -R 500

	Without burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 10
	*99.0000th: 933
	99.5000th: 981
	99.9000th: 3068
	min=0, max=20054
	rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

	With burst:

	Latency percentiles (usec)
	50.0000th: 7
	75.0000th: 8
	90.0000th: 9
	95.0000th: 9
	*99.0000th: 12
	99.5000th: 13
	99.9000th: 19
	min=0, max=406
	rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 0.12%

The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c          | 83 ++++++++++++++++++++++++++++++++++++--------
 kernel/sched/fair.c          | 21 ++++++++++-
 kernel/sched/sched.h         |  1 +
 kernel/sysctl.c              |  9 +++++
 5 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index db2c0f34aaaf..08432aeb742e 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -73,6 +73,7 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
 #endif
 
 #ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..7d34b08ee0e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8962,7 +8962,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+				u64 burst)
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -8992,6 +8993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
 		return -EINVAL;
 
+	if (quota != RUNTIME_INF && (burst > quota ||
+				     burst + quota > max_cfs_runtime))
+		return -EINVAL;
+
 	/*
 	 * Prevent race between setting of cfs_rq->runtime_enabled and
 	 * unthrottle_offline_cfs_rqs().
@@ -9013,6 +9018,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->burst = burst;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -9046,9 +9052,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	burst = tg->cfs_bandwidth.burst;
 	if (cfs_quota_us < 0)
 		quota = RUNTIME_INF;
 	else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -9056,7 +9063,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 	else
 		return -EINVAL;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -9074,15 +9081,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
 		return -EINVAL;
 
 	period = (u64)cfs_period_us * NSEC_PER_USEC;
 	quota = tg->cfs_bandwidth.quota;
+	burst = tg->cfs_bandwidth.burst;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -9095,6 +9103,30 @@ static long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+	u64 quota, period, burst;
+
+	if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC)
+		return -EINVAL;
+
+	burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	quota = tg->cfs_bandwidth.quota;
+
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+	u64 burst_us;
+
+	burst_us = tg->cfs_bandwidth.burst;
+	do_div(burst_us, NSEC_PER_USEC);
+
+	return burst_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -9119,6 +9151,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static s64 cpu_cfs_burst_read_s64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_s64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, s64 cfs_burst_us)
+{
+	return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
+}
+
 struct cfs_schedulable_data {
 	struct task_group *tg;
 	u64 period, quota;
@@ -9271,6 +9315,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "cfs_burst_us",
+		.read_s64 = cpu_cfs_burst_read_s64,
+		.write_s64 = cpu_cfs_burst_write_s64,
+	},
 	{
 		.name = "stat",
 		.seq_show = cpu_cfs_stat_show,
@@ -9390,27 +9439,29 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
 }
 #endif
 
-static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
-						  long period, long quota)
+static void __maybe_unused
+cpu_period_quota_print(struct seq_file *sf, long period, long quota,
+		       long burst)
 {
 	if (quota < 0)
 		seq_puts(sf, "max");
 	else
 		seq_printf(sf, "%ld", quota);
 
-	seq_printf(sf, " %ld\n", period);
+	seq_printf(sf, " %ld %ld\n", period, burst);
 }
 
-/* caller should put the current value in *@periodp before calling */
-static int __maybe_unused cpu_period_quota_parse(char *buf,
-						 u64 *periodp, u64 *quotap)
+/* caller should put the current value in *@periodp and *@burstp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf, u64 *periodp,
+						 u64 *quotap, u64 *burstp)
 {
 	char tok[21];	/* U64_MAX */
 
-	if (sscanf(buf, "%20s %llu", tok, periodp) < 1)
+	if (sscanf(buf, "%20s %llu %llu", tok, periodp, burstp) < 1)
 		return -EINVAL;
 
 	*periodp *= NSEC_PER_USEC;
+	*burstp *= NSEC_PER_USEC;
 
 	if (sscanf(tok, "%llu", quotap))
 		*quotap *= NSEC_PER_USEC;
@@ -9427,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
 {
 	struct task_group *tg = css_tg(seq_css(sf));
 
-	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+	cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
+			       tg_get_cfs_burst(tg));
 	return 0;
 }
 
@@ -9436,12 +9488,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 {
 	struct task_group *tg = css_tg(of_css(of));
 	u64 period = tg_get_cfs_period(tg);
+	u64 burst = tg_get_cfs_burst(tg);
 	u64 quota;
 	int ret;
 
-	ret = cpu_period_quota_parse(buf, &period, &quota);
+	ret = cpu_period_quota_parse(buf, &period, &quota, &burst);
 	if (!ret)
-		ret = tg_set_cfs_bandwidth(tg, period, quota);
+		ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
 	return ret ?: nbytes;
 }
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3248e24a90b0..48fad5cf0f7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -134,6 +134,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  * (default: 5 msec, units: microseconds)
  */
 unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
+
+/*
+ * A switch for cfs bandwidth burst.
+ *
+ * (default: 1, enabled)
+ */
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
 #endif
 
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
@@ -4628,8 +4635,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
-	if (cfs_b->quota != RUNTIME_INF)
+	if (unlikely(cfs_b->quota == RUNTIME_INF))
+		return;
+
+	if (!sysctl_sched_cfs_bw_burst_enabled) {
 		cfs_b->runtime = cfs_b->quota;
+		return;
+	}
+
+	cfs_b->runtime += cfs_b->quota;
+	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
+		if (!cfs_b->period_active)
+			__refill_cfs_bandwidth_runtime(cfs_b);
+
 		start_cfs_bandwidth(cfs_b);
 
 		if (cfs_b->runtime > 0) {
@@ -5279,6 +5297,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+	cfs_b->burst = 0;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..d317ca74a48c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			quota;
 	u64			runtime;
+	u64			burst;
 	s64			hierarchical_quota;
 
 	u8			idle;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 14edf84cc571..fb27bae7ace2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1816,6 +1816,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "sched_cfs_bw_burst_enabled",
+		.data		= &sysctl_sched_cfs_bw_burst_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 	{
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-05-20 12:34 [PATCH v5 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2021-05-20 12:34 ` [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
@ 2021-05-20 12:34 ` Huaixin Chang
  2021-05-20 14:11   ` Odin Ugedal
                     ` (2 more replies)
  2021-05-20 12:34 ` [PATCH v5 3/3] sched/fair: Add document for burstable CFS bandwidth Huaixin Chang
  2 siblings, 3 replies; 21+ messages in thread
From: Huaixin Chang @ 2021-05-20 12:34 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, dtcccc, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, odin, odin, pauld, peterz, pjt,
	rostedt, shanpeic, tj, vincent.guittot, xiyou.wangcong

When using cfs_b and meeting with some throttled periods, users shall
use burst buffer to allow bursty workloads. Apart from configuring some
burst buffer and watch whether throttled periods disappears, some
statistics on burst buffer using are also helpful. Thus expose the
following statistics into cpu.stat file:

nr_burst:   number of periods bandwidth burst occurs
burst_time: cumulative wall-time that any cpus has
	    used above quota in respective periods

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 kernel/sched/core.c  | 13 ++++++++++---
 kernel/sched/fair.c  | 11 +++++++++++
 kernel/sched/sched.h |  3 +++
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7d34b08ee0e5..d442fcd85374 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9265,6 +9265,9 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 		seq_printf(sf, "wait_sum %llu\n", ws);
 	}
 
+	seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
+	seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
+
 	return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
@@ -9361,16 +9364,20 @@ static int cpu_extra_stat_show(struct seq_file *sf,
 	{
 		struct task_group *tg = css_tg(css);
 		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
-		u64 throttled_usec;
+		u64 throttled_usec, burst_usec;
 
 		throttled_usec = cfs_b->throttled_time;
 		do_div(throttled_usec, NSEC_PER_USEC);
+		burst_usec = cfs_b->burst_time;
+		do_div(burst_usec, NSEC_PER_USEC);
 
 		seq_printf(sf, "nr_periods %d\n"
 			   "nr_throttled %d\n"
-			   "throttled_usec %llu\n",
+			   "throttled_usec %llu\n"
+			   "nr_burst %d\n"
+			   "burst_usec %llu\n",
 			   cfs_b->nr_periods, cfs_b->nr_throttled,
-			   throttled_usec);
+			   throttled_usec, cfs_b->nr_burst, burst_usec);
 	}
 #endif
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48fad5cf0f7a..d4783b62a010 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4635,6 +4635,8 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
+	u64 runtime;
+
 	if (unlikely(cfs_b->quota == RUNTIME_INF))
 		return;
 
@@ -4643,8 +4645,17 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 		return;
 	}
 
+	if (cfs_b->runtime_at_period_start > cfs_b->runtime) {
+		runtime = cfs_b->runtime_at_period_start - cfs_b->runtime;
+		if (runtime > cfs_b->quota) {
+			cfs_b->burst_time += runtime - cfs_b->quota;
+			cfs_b->nr_burst++;
+		}
+	}
+
 	cfs_b->runtime += cfs_b->quota;
 	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
+	cfs_b->runtime_at_period_start = cfs_b->runtime;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d317ca74a48c..b770b553dfbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -367,6 +367,7 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	u64			burst;
+	u64			runtime_at_period_start;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -379,7 +380,9 @@ struct cfs_bandwidth {
 	/* Statistics: */
 	int			nr_periods;
 	int			nr_throttled;
+	int			nr_burst;
 	u64			throttled_time;
+	u64			burst_time;
 #endif
 };
 
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 3/3] sched/fair: Add document for burstable CFS bandwidth
  2021-05-20 12:34 [PATCH v5 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2021-05-20 12:34 ` [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
  2021-05-20 12:34 ` [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2021-05-20 12:34 ` Huaixin Chang
  2 siblings, 0 replies; 21+ messages in thread
From: Huaixin Chang @ 2021-05-20 12:34 UTC (permalink / raw)
  To: changhuaixin
  Cc: bsegall, dietmar.eggemann, dtcccc, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, odin, odin, pauld, peterz, pjt,
	rostedt, shanpeic, tj, vincent.guittot, xiyou.wangcong

Basic description of usage and effect for CFS Bandwidth Control Burst.

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 17 +++++---
 Documentation/scheduler/sched-bwc.rst   | 76 ++++++++++++++++++++++++++++-----
 2 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b1e81aa8598a..222c9942592e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1000,6 +1000,8 @@ All time durations are in microseconds.
 	- nr_periods
 	- nr_throttled
 	- throttled_usec
+	- nr_burst
+	- burst_usec
 
   cpu.weight
 	A read-write single value file which exists on non-root
@@ -1020,16 +1022,19 @@ All time durations are in microseconds.
 	the closest approximation of the current weight.
 
   cpu.max
-	A read-write two value file which exists on non-root cgroups.
-	The default is "max 100000".
+	A read-write three value file which exists on non-root cgroups.
+	The default is "max 100000 0".
 
 	The maximum bandwidth limit.  It's in the following format::
 
-	  $MAX $PERIOD
+	  $MAX $PERIOD $BURST
 
-	which indicates that the group may consume upto $MAX in each
-	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
-	one number is written, $MAX is updated.
+	which indicates that the group may consume upto $MAX from this
+	period plus $BURST carried over from previous periods in each
+	$PERIOD duration.  "max" for $MAX indicates no limit. "0" for
+	$BURST indicates no bandwidth can be carried over. $BURST should
+	not be larger than $MAX. On partial writing, values are updated
+	accordingly.
 
   cpu.pressure
 	A read-write nested-keyed file.
diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 845eee659199..3e7ebb7bc562 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -22,39 +22,74 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
 is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
+By default, CPU bandwidth consumption is strictly limited to quota within each
+given period. For the sequence of CPU usage u_i served under CFS bandwidth
+control, if for any j <= k N(j,k) is the number of periods from u_j to u_k:
+
+        u_j+...+u_k <= quota * N(j,k)
+
+For a bursty sequence among which interval u_j...u_k are at the peak, CPU
+requests might have to wait for more periods to replenish enough quota.
+Otherwise, larger quota is required.
+
+With "burst" buffer, CPU requests might be served as long as:
+
+        u_j+...+u_k <= B_j + quota * N(j,k) <= quota * (N(j,k)+1)
+
+if for any j <= k N(j,k) is the number of periods from u_j to u_k and B_j is
+the accumulated quota from previous periods in burst buffer serving u_j. The
+max value of B_j is limited to quota so the total CPU utilization will not be
+larger than quota * (N(j,k)+1).
+Burst buffer helps in that serving whole bursty CPU requests without throttling
+them can be done with moderate quota setting and accumulated quota in burst
+buffer, if:
+
+        u_0+...+u_n <= B_0 + quota * N(0,n) <= quota * (N(0,n)+1)
+
+where B_0 is the initial state of burst buffer. The maximum accumulated quota in
+the burst buffer is capped by burst. With proper burst setting, the available
+bandwidth is still determined by quota and period on the long run.
+
 Management
 ----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.
 
 .. note::
    The cgroupfs files described in this section are only applicable
    to cgroup v1. For cgroup v2, see
    :ref:`Documentation/admin-guide/cgroupv2.rst <cgroup-v2-cpu>`.
 
-- cpu.cfs_quota_us: the total available run-time within a period (in
-  microseconds)
+- cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
 - cpu.cfs_period_us: the length of a period (in microseconds)
 - cpu.stat: exports throttling statistics [explained further below]
+- cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
 
 The default values are::
 
 	cpu.cfs_period_us=100ms
-	cpu.cfs_quota=-1
+	cpu.cfs_quota_us=-1
+	cpu.cfs_burst_us=0
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
 bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
-Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms. There is also an
-upper bound on the period length of 1s. Additional restrictions exist when
-bandwidth limits are used in a hierarchical fashion, these are explained in
-more detail below.
+Writing any (valid) positive value(s) no smaller than cpu.cfs_burst_us will
+enact the specified bandwidth limit. The minimum quota allowed for the quota or
+period is 1ms. There is also an upper bound on the period length of 1s.
+Additional restrictions exist when bandwidth limits are used in a hierarchical
+fashion, these are explained in more detail below.
 
 Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 and return the group to an unconstrained state once more.
 
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS unchanged. Writing any (valid) positive value(s) no larger than
+cpu.cfs_quota_us into cpu.cfs_burst_us will enact the cap on unused bandwidth
+accumulation.
+
 Any updates to a group's bandwidth specification will result in it becoming
 unthrottled if it is in a constrained state.
 
@@ -72,9 +107,15 @@ This is tunable via procfs::
 Larger slice values will reduce transfer overheads, while smaller values allow
 for more fine-grained consumption.
 
+There is also a global switch to turn off burst for all groups::
+       /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
+
+By default it is enabled. Writing a 0 value means no accumulated CPU time can be
+used for any group, even if cpu.cfs_burst_us is configured.
+
 Statistics
 ----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+A group's bandwidth statistics are exported via 5 fields in cpu.stat.
 
 cpu.stat:
 
@@ -82,6 +123,9 @@ cpu.stat:
 - nr_throttled: Number of times the group has been throttled/limited.
 - throttled_time: The total time duration (in nanoseconds) for which entities
   of the group have been throttled.
+- nr_burst: Number of periods burst occurs.
+- burst_time: Cumulative wall-time that any CPUs has used above quota in
+  respective periods
 
 This interface is read-only.
 
@@ -179,3 +223,15 @@ Examples
 
    By using a small period here we are ensuring a consistent latency
    response at the expense of burst capacity.
+
+4. Limit a group to 40% of 1 CPU, and allow accumulate up to 20% of 1 CPU
+   additionally, in case accumulation has been done.
+
+   With 50ms period, 20ms quota will be equivalent to 40% of 1 CPU.
+   And 10ms burst will be equivalent to 20% of 1 CPU.
+
+	# echo 20000 > cpu.cfs_quota_us /* quota = 20ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+	# echo 10000 > cpu.cfs_burst_us /* burst = 10ms */
+
+   Larger buffer setting (no larger than quota) allows greater burst capacity.
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-20 12:34 ` [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
@ 2021-05-20 14:00   ` Odin Ugedal
  2021-05-20 17:04     ` Tejun Heo
  2021-05-21  9:09     ` changhuaixin
  2021-05-21 14:00   ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Odin Ugedal @ 2021-05-20 14:00 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: Benjamin Segall, Dietmar Eggemann, dtcccc, Juri Lelli,
	khlebnikov, open list, Mel Gorman, Ingo Molnar, Odin Ugedal,
	pauld, Peter Zijlstra, Paul Turner, Steven Rostedt, shanpeic,
	Tejun Heo, Vincent Guittot, xiyou.wangcong

Hi,

Here are some more thoughts and questions:

> The benefit of burst is seen when testing with schbench:
>
>         echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>         echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>         echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
>         echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
>         # The average CPU usage is around 500%, which is 200ms CPU time
>         # every 40ms.
>         ./schbench -m 1 -t 30 -r 10 -c 10000 -R 500
>
>         Without burst:
>
>         Latency percentiles (usec)
>         50.0000th: 7
>         75.0000th: 8
>         90.0000th: 9
>         95.0000th: 10
>         *99.0000th: 933
>         99.5000th: 981
>         99.9000th: 3068
>         min=0, max=20054
>         rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

It should be noted that this was running on a 64 core machine (if that was
the case, ref. your previous patch).

I am curious how much you have tried tweaking both the period and the quota
for this workload. I assume a longer period can help such bursty application,
and from the small slowdowns, a slightly higher quota could also help
I guess. I am
not saying this is a bad idea, but that we need to understand what it
fixes, and how,
in order to be able to understand how/if to use it.

Also, what value of the sysctl kernel.sched_cfs_bandwidth_slice_us are
you using?
What CONFIG_HZ you are using is also interesting, due to how bw is
accounted for.
There is some more info about it here: Documentation/scheduler/sched-bwc.rst. I
assume a smaller slice value may also help, and it would be interesting to see
what implications it gives. A high threads to (quota/period) ratio, together
with a high bandwidth_slice will probably cause some throttling, so one has
to choose between precision and overhead.

Also, here you give a burst of 66% the quota. Would that be a typical value
for a cgroup, or is it just a result of testing? As I understand this
patchset, your example
would allow 600% constant CPU load, then one period with 1000% load,
then another
"long set" of periods with 600% load. Have you discussed a way of limiting how
long burst can be "saved" before expiring?

> @@ -9427,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
>  {
>         struct task_group *tg = css_tg(seq_css(sf));
>
> -       cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
> +       cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
> +                              tg_get_cfs_burst(tg));
>         return 0;
>  }

The current cgroup v2 docs say the following:

>   cpu.max
>     A read-write two value file which exists on non-root cgroups.
>     The default is "max 100000".

This will become a "three value file", and I know a few user space projects
who parse this file by splitting on the middle space. I am not sure if they are
"wrong", but I don't think we usually break such things. Not sure what
Tejun thinks about this.

Thanks
Odin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-05-20 12:34 ` [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2021-05-20 14:11   ` Odin Ugedal
  2021-05-21 12:42     ` changhuaixin
  2021-05-21 14:05     ` Peter Zijlstra
  2021-05-20 17:06   ` Tejun Heo
  2021-05-21 14:01   ` Peter Zijlstra
  2 siblings, 2 replies; 21+ messages in thread
From: Odin Ugedal @ 2021-05-20 14:11 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: Benjamin Segall, Dietmar Eggemann, dtcccc, Juri Lelli,
	khlebnikov, open list, Mel Gorman, Ingo Molnar, Odin Ugedal,
	pauld, Peter Zijlstra, Paul Turner, Steven Rostedt, shanpeic,
	Tejun Heo, Vincent Guittot, xiyou.wangcong

I am a bit sceptical about both the nr_burst and burst_time as they are now.

As an example; a control group using "99.9%" of the quota each period
and that is never throttled. Such group would with this patch with a burst of X
still get nr_throttled = 0 (as before), but it would get a nr_burst
and burst_time that
will keep increasing.

I think there is a big difference between runtime moved/taken from
cfs_b->runtime to cfs_rq->runtime_remaining and the actual runtime used
in the period. Currently, cfs bw can only supply info the first one, and
not the latter.

I think that if people see nr_burst increasing, that they think they _have_
to use cfs burst in order to avoid being throttled, even though that might
not be the case. It is probably fine as is, as long as it is explicitly stated
what the values mean and imply, and what they do not. I cannot see another
way to calculate it as it is now, but maybe someone else has some thoughts.

Thanks
Odin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-20 14:00   ` Odin Ugedal
@ 2021-05-20 17:04     ` Tejun Heo
  2021-05-21  9:09     ` changhuaixin
  1 sibling, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2021-05-20 17:04 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Huaixin Chang, Benjamin Segall, Dietmar Eggemann, dtcccc,
	Juri Lelli, khlebnikov, open list, Mel Gorman, Ingo Molnar,
	pauld, Peter Zijlstra, Paul Turner, Steven Rostedt, shanpeic,
	Vincent Guittot, xiyou.wangcong

Hello, Odin.

On Thu, May 20, 2021 at 04:00:29PM +0200, Odin Ugedal wrote:
> >   cpu.max
> >     A read-write two value file which exists on non-root cgroups.
> >     The default is "max 100000".
> 
> This will become a "three value file", and I know a few user space projects
> who parse this file by splitting on the middle space. I am not sure if they are
> "wrong", but I don't think we usually break such things. Not sure what
> Tejun thinks about this.

Good point. I haven't thought about that. It would make more sense to
separate it out to a separate file then - e.g. sth like cpu.max.burst, but
it seems like there are important questions to answer before adding new
interfaces.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-05-20 12:34 ` [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
  2021-05-20 14:11   ` Odin Ugedal
@ 2021-05-20 17:06   ` Tejun Heo
  2021-05-21 14:01   ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2021-05-20 17:06 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, dtcccc, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, odin, odin, pauld, peterz, pjt,
	rostedt, shanpeic, vincent.guittot, xiyou.wangcong

On Thu, May 20, 2021 at 08:34:18PM +0800, Huaixin Chang wrote:
> When using cfs_b and meeting with some throttled periods, users shall
> use burst buffer to allow bursty workloads. Apart from configuring some
> burst buffer and watch whether throttled periods disappears, some
> statistics on burst buffer using are also helpful. Thus expose the
> following statistics into cpu.stat file:
> 
> nr_burst:   number of periods bandwidth burst occurs

nr_bursts

> burst_time: cumulative wall-time that any cpus has
> 	    used above quota in respective periods

burst_usec may be more in line with other fields in that file.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-20 14:00   ` Odin Ugedal
  2021-05-20 17:04     ` Tejun Heo
@ 2021-05-21  9:09     ` changhuaixin
  2021-05-21  9:38       ` Odin Ugedal
  1 sibling, 1 reply; 21+ messages in thread
From: changhuaixin @ 2021-05-21  9:09 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: changhuaixin, Benjamin Segall, Dietmar Eggemann, dtcccc,
	Juri Lelli, khlebnikov, open list, Mel Gorman, Ingo Molnar,
	pauld, Peter Zijlstra, Paul Turner, Steven Rostedt, Shanpei Chen,
	Tejun Heo, Vincent Guittot, xiyou.wangcong



> On May 20, 2021, at 10:00 PM, Odin Ugedal <odin@uged.al> wrote:
> 
> Hi,
> 
> Here are some more thoughts and questions:
> 
>> The benefit of burst is seen when testing with schbench:
>> 
>>        echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>>        echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>>        echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
>>        echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>> 
>>        # The average CPU usage is around 500%, which is 200ms CPU time
>>        # every 40ms.
>>        ./schbench -m 1 -t 30 -r 10 -c 10000 -R 500
>> 
>>        Without burst:
>> 
>>        Latency percentiles (usec)
>>        50.0000th: 7
>>        75.0000th: 8
>>        90.0000th: 9
>>        95.0000th: 10
>>        *99.0000th: 933
>>        99.5000th: 981
>>        99.9000th: 3068
>>        min=0, max=20054
>>        rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%
> 
> It should be noted that this was running on a 64 core machine (if that was
> the case, ref. your previous patch).
> 
> I am curious how much you have tried tweaking both the period and the quota
> for this workload. I assume a longer period can help such bursty application,
> and from the small slowdowns, a slightly higher quota could also help
> I guess. I am
> not saying this is a bad idea, but that we need to understand what it
> fixes, and how,
> in order to be able to understand how/if to use it.
> 

Yeah, it is a well tuned workload and configuration. I did this because for benchmarks
like schbench, workloads are generated in a fixed pattern without burst. So I set schbench
params carefully to generate burst during each 100ms periods, to show burst works. Longer
period or higher quota helps indeed, in which case more workloads can be used to generate
tail latency then.

In my view, burst is like the cfsb way of token bucket. For the present cfsb, bucket capacity
is strictly limited to quota. And that is changed into quota + burst now. And it shall be used when
tasks get throttled and CPU is under utilized for the whole system.

> Also, what value of the sysctl kernel.sched_cfs_bandwidth_slice_us are
> you using?
> What CONFIG_HZ you are using is also interesting, due to how bw is
> accounted for.
> There is some more info about it here: Documentation/scheduler/sched-bwc.rst. I
> assume a smaller slice value may also help, and it would be interesting to see
> what implications it gives. A high threads to (quota/period) ratio, together
> with a high bandwidth_slice will probably cause some throttling, so one has
> to choose between precision and overhead.
> 

Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

The following case might be used to prevent getting throttled from many threads and high bandwidth
slice:

mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

./schbench -m 1 -t 3 -r 20 -c 80000 -R 20

On my machine, two workers work for 80ms and sleep for 120ms in each round. The average utilization is
around 80%. This will work on a two-core system. It is recommended to  try it multiple times as getting
throttled doesn't necessarily cause tail latency for schbench.


> Also, here you give a burst of 66% the quota. Would that be a typical value
> for a cgroup, or is it just a result of testing? As I understand this

Yeah, it is not a typical value, and tuned for this test.

> patchset, your example
> would allow 600% constant CPU load, then one period with 1000% load,
> then another
> "long set" of periods with 600% load. Have you discussed a way of limiting how
> long burst can be "saved" before expiring?

Haven't thought about it much. It is interesting but I doubt the need to do that.

> 
>> @@ -9427,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
>> {
>>        struct task_group *tg = css_tg(seq_css(sf));
>> 
>> -       cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
>> +       cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
>> +                              tg_get_cfs_burst(tg));
>>        return 0;
>> }
> 
> The current cgroup v2 docs say the following:
> 
>>  cpu.max
>>    A read-write two value file which exists on non-root cgroups.
>>    The default is "max 100000".
> 
> This will become a "three value file", and I know a few user space projects
> who parse this file by splitting on the middle space. I am not sure if they are
> "wrong", but I don't think we usually break such things. Not sure what
> Tejun thinks about this.
> 

Thanks, it will be modified in the way Tejun suggests.

> Thanks
> Odin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-21  9:09     ` changhuaixin
@ 2021-05-21  9:38       ` Odin Ugedal
  2021-05-21 12:38         ` changhuaixin
  0 siblings, 1 reply; 21+ messages in thread
From: Odin Ugedal @ 2021-05-21  9:38 UTC (permalink / raw)
  To: changhuaixin
  Cc: Odin Ugedal, Benjamin Segall, Dietmar Eggemann, dtcccc,
	Juri Lelli, khlebnikov, open list, Mel Gorman, Ingo Molnar,
	pauld, Peter Zijlstra, Paul Turner, Steven Rostedt, Shanpei Chen,
	Tejun Heo, Vincent Guittot, xiyou.wangcong

Hi,

> Yeah, it is a well tuned workload and configuration. I did this because for benchmarks
> like schbench, workloads are generated in a fixed pattern without burst. So I set schbench
> params carefully to generate burst during each 100ms periods, to show burst works. Longer
> period or higher quota helps indeed, in which case more workloads can be used to generate
> tail latency then.

Yeah, that makes sense. When it comes to fairness (you are talking
about generating tail
latency), I think configuration of cpu shares/weight between cgroups
is more relevant.

How much more tail latency will a cgroup be able to "create" when
doubling the period?

> In my view, burst is like the cfsb way of token bucket. For the present cfsb, bucket capacity
> is strictly limited to quota. And that is changed into quota + burst now. And it shall be used when
> tasks get throttled and CPU is under utilized for the whole system.

Well, it is as strict as we can make it, depending on how one looks at it. We
cannot guarantee anything more strict than the length of a jiffy or
kernel.sched_cfs_bandwidth_slice_us (simplified ofc.), especially since we allow
runtime from one period to be used in another. I think there is a
"big" distinction between
runtime transferred from the cfs_bw to cfs_rq's in a period compared
to the actual runtime used.

> Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

You should mention that in the msg then, since it is highly relevant
to the results. Can you try to tweak
kernel.sched_cfs_bandwidth_slice_us to something like 1ms, and see
what the result will be?

For such a workload and high cfs_bw_slice, a smaller CONFIG_HZ might
also be beneficial (although
there are many things to consider when talking about that, and a lot
of people know more about that than me).

> The following case might be used to prevent getting throttled from many threads and high bandwidth
> slice:
>
> mkdir /sys/fs/cgroup/cpu/test
> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
>
> On my machine, two workers work for 80ms and sleep for 120ms in each round. The average utilization is
> around 80%. This will work on a two-core system. It is recommended to  try it multiple times as getting
> throttled doesn't necessarily cause tail latency for schbench.

When I run this, I get the following results without cfs bandwidth enabled.

$ time ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
Latency percentiles (usec) runtime 20 (s) (398 total samples)
        50.0th: 22 (201 samples)
        75.0th: 50 (158 samples)
        90.0th: 50 (0 samples)
        95.0th: 51 (38 samples)
        *99.0th: 51 (0 samples)
        99.5th: 51 (0 samples)
        99.9th: 52 (1 samples)
        min=5, max=52
rps: 19900000.00 p95 (usec) 51 p99 (usec) 51 p95/cputime 0.06% p99/cputime 0.06%
./schbench -m 1 -t 3 -r 20 -c 80000 -R 20  31.85s user 0.00s system
159% cpu 20.021 total

In this case, I see 80% load on two cores, ending at a total of 160%. If setting
period: 100ms and quota: 100ms (aka. 1 cpu), throttling is what
you would expect, or?. In this case, burst wouldn't matter?

Thanks
Odin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-21  9:38       ` Odin Ugedal
@ 2021-05-21 12:38         ` changhuaixin
  0 siblings, 0 replies; 21+ messages in thread
From: changhuaixin @ 2021-05-21 12:38 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: changhuaixin, Benjamin Segall, Dietmar Eggemann, dtcccc,
	Juri Lelli, khlebnikov, open list, Mel Gorman, Ingo Molnar,
	pauld, Peter Zijlstra, Paul Turner, Steven Rostedt, Shanpei Chen,
	Tejun Heo, Vincent Guittot, xiyou.wangcong



> On May 21, 2021, at 5:38 PM, Odin Ugedal <odin@uged.al> wrote:
> 
> Hi,
> 
>> Yeah, it is a well tuned workload and configuration. I did this because for benchmarks
>> like schbench, workloads are generated in a fixed pattern without burst. So I set schbench
>> params carefully to generate burst during each 100ms periods, to show burst works. Longer
>> period or higher quota helps indeed, in which case more workloads can be used to generate
>> tail latency then.
> 
> Yeah, that makes sense. When it comes to fairness (you are talking
> about generating tail
> latency), I think configuration of cpu shares/weight between cgroups
> is more relevant.
> 
> How much more tail latency will a cgroup be able to "create" when
> doubling the period?
> 

Indeed, fairness is another factor relevant to tail latency. However, real workloads benefit from burst
feature, too. For java workloads with equal fairness between cgroups, a huge drop of tail latency from
500ms to 27ms is seen after enabling burst feature. I shouldn't delete this info in the msg.

I guess tail latency from schbench is small here, because schbench is simple and only measures wakeup
latency. For workloads measuring round trip time, the effect of getting throttled is more obvious.

> 
>> In my view, burst is like the cfsb way of token bucket. For the present cfsb, bucket capacity
>> is strictly limited to quota. And that is changed into quota + burst now. And it shall be used when
>> tasks get throttled and CPU is under utilized for the whole system.
> 
> Well, it is as strict as we can make it, depending on how one looks at it. We
> cannot guarantee anything more strict than the length of a jiffy or
> kernel.sched_cfs_bandwidth_slice_us (simplified ofc.), especially since we allow
> runtime from one period to be used in another. I think there is a
> "big" distinction between
> runtime transferred from the cfs_bw to cfs_rq's in a period compared
> to the actual runtime used.
> 
>> Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
> 
> You should mention that in the msg then, since it is highly relevant
> to the results. Can you try to tweak

Sorry for causing trouble reproducing this. I'll add these info.

> kernel.sched_cfs_bandwidth_slice_us to something like 1ms, and see
> what the result will be?
> 

After using 1ms kernel.sched_cfs_bandwidth_slice_us I see 99.0th and 99.5th latency drop, and 99.9th
latency remains at several ms. I guess I can't tell it from some small spikes now. 

# 1ms kernel.sched_cfs_bandwidth_slice_us
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
#echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
cat /sys/fs/cgroup/cpu/test/cpu.stat | grep nr_throttled

./schbench -m 1 -t 30 -r 10 -c 10000 -R 500

Latency percentiles (usec)
	50.0000th: 8
	75.0000th: 8
	90.0000th: 9
	95.0000th: 10
	*99.0000th: 13
	99.5000th: 17
	99.9000th: 6408
	min=0, max=7576
rps: 497.44 p95 (usec) 10 p99 (usec) 13 p95/cputime 0.10% p99/cputime 0.13%


> For such a workload and high cfs_bw_slice, a smaller CONFIG_HZ might
> also be beneficial (although
> there are many things to consider when talking about that, and a lot
> of people know more about that than me).
> 
>> The following case might be used to prevent getting throttled from many threads and high bandwidth
>> slice:
>> 
>> mkdir /sys/fs/cgroup/cpu/test
>> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>> 
>> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
>> 
>> On my machine, two workers work for 80ms and sleep for 120ms in each round. The average utilization is
>> around 80%. This will work on a two-core system. It is recommended to  try it multiple times as getting
>> throttled doesn't necessarily cause tail latency for schbench.
> 
> When I run this, I get the following results without cfs bandwidth enabled.
> 
> $ time ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
> Latency percentiles (usec) runtime 20 (s) (398 total samples)
>        50.0th: 22 (201 samples)
>        75.0th: 50 (158 samples)
>        90.0th: 50 (0 samples)
>        95.0th: 51 (38 samples)
>        *99.0th: 51 (0 samples)
>        99.5th: 51 (0 samples)
>        99.9th: 52 (1 samples)
>        min=5, max=52
> rps: 19900000.00 p95 (usec) 51 p99 (usec) 51 p95/cputime 0.06% p99/cputime 0.06%
> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20  31.85s user 0.00s system
> 159% cpu 20.021 total
> 
> In this case, I see 80% load on two cores, ending at a total of 160%. If setting
> period: 100ms and quota: 100ms (aka. 1 cpu), throttling is what
> you would expect, or?. In this case, burst wouldn't matter?
> 

Sorry for my mistake. The -R option should be 10 instead of 20. And the case should be:

# 1ms kernel.sched_cfs_bandwidth_slice_us
mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled
for 8 times.

Tail latencies are showed below, and it wasn't the worst case.

Latency percentiles (usec)
	50.0000th: 19872
	75.0000th: 21344
	90.0000th: 22176
	95.0000th: 22496
	*99.0000th: 22752
	99.5000th: 22752
	99.9000th: 22752
	min=0, max=22727
rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%


Sometimes the measured period of schbench is not throttled and thus no tail latency is seen. Sometimes tasks do
not get throttled because the the offset of schbench worker start from period start matters too. In this case, these
two CPUs work for 80ms and sleeps for 120ms, If the 80ms burst period for 2 workers is cut into two cfsb periods,
they might not get throttled.

I'll use this case in the commit log.

> 
> Thanks
> Odin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-05-20 14:11   ` Odin Ugedal
@ 2021-05-21 12:42     ` changhuaixin
  2021-05-21 14:05     ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: changhuaixin @ 2021-05-21 12:42 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: changhuaixin, Benjamin Segall, Dietmar Eggemann, dtcccc,
	Juri Lelli, khlebnikov, open list, Mel Gorman, Ingo Molnar,
	pauld, Peter Zijlstra, Paul Turner, Steven Rostedt, shanpeic,
	Tejun Heo, Vincent Guittot, xiyou.wangcong



> On May 20, 2021, at 10:11 PM, Odin Ugedal <odin@uged.al> wrote:
> 
> I am a bit sceptical about both the nr_burst and burst_time as they are now.
> 
> As an example; a control group using "99.9%" of the quota each period
> and that is never throttled. Such group would with this patch with a burst of X
> still get nr_throttled = 0 (as before), but it would get a nr_burst
> and burst_time that
> will keep increasing.
> 

Agreed, there are false positive and false negetive cases, as the current implementation
uses cfs_b->runtime to judge instead of the actual runtime used.

> I think there is a big difference between runtime moved/taken from
> cfs_b->runtime to cfs_rq->runtime_remaining and the actual runtime used
> in the period. Currently, cfs bw can only supply info the first one, and
> not the latter.
> 
> I think that if people see nr_burst increasing, that they think they _have_
> to use cfs burst in order to avoid being throttled, even though that might
> not be the case. It is probably fine as is, as long as it is explicitly stated

It can't be seeing nr_burst incresing first, and using cfs burst feature afterwards.
Do you mean people see nr_throttled increasing and use cfs burst, while the actual usage
is below quota? In that case, tasks get throttled because there are runtime to be returned from
cfs_rq, and get unthrottled shortly. That is a false positive for nr_throttled. When users see that,
using burst can help improve.

> what the values mean and imply, and what they do not. I cannot see another
> way to calculate it as it is now, but maybe someone else has some thoughts.
> 
> Thanks
> Odin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-20 12:34 ` [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
  2021-05-20 14:00   ` Odin Ugedal
@ 2021-05-21 14:00   ` Peter Zijlstra
  2021-05-24 12:42     ` changhuaixin
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2021-05-21 14:00 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, dtcccc, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, odin, odin, pauld, pjt, rostedt,
	shanpeic, tj, vincent.guittot, xiyou.wangcong

On Thu, May 20, 2021 at 08:34:17PM +0800, Huaixin Chang wrote:
> The CFS bandwidth controller limits CPU requests of a task group to
> quota during each period. However, parallel workloads might be bursty
> so that they get throttled even when their average utilization is under
> quota. And they are latency sensitive at the same time so that
> throttling them is undesired.
> 
> Scaling up period and quota allows greater burst capacity. But it might
> cause longer stuck till next refill. Introduce "burst" to allow
> accumulating unused quota from previous periods, and to be assigned when
> a task group requests more CPU than quota during a specific period.
> 
> Introducing burst buffer might also cause interference to other groups.
> Thus limit the maximum accumulated buffer by "burst", and limit
> the maximum allowed burst by quota, too.

Overall, *much* better than before.

However I would like a little bit better discussion of how exactly
people are supposed to reason about this. That will also help with the
question from Odin on how people are supposed to set/compute this burst
value.

So traditional (UP-EDF) bandwidth control is something like:

  (U = \Sum u_i) <= 1

This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.

This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.

For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.

That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.

At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).

And I think you can compute more fun properties.

Now, CFS bandwidth control is not EDF, and the above doesn't fully
translate, but much does I think.

We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded etc..



> Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> ---
>  include/linux/sched/sysctl.h |  1 +
>  kernel/sched/core.c          | 83 ++++++++++++++++++++++++++++++++++++--------
>  kernel/sched/fair.c          | 21 ++++++++++-
>  kernel/sched/sched.h         |  1 +
>  kernel/sysctl.c              |  9 +++++
>  5 files changed, 99 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index db2c0f34aaaf..08432aeb742e 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -73,6 +73,7 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
>  extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
>  #endif
>  
>  #ifdef CONFIG_SCHED_AUTOGROUP
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5226cc26a095..7d34b08ee0e5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8962,7 +8962,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>  
>  static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>  
> -static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
> +				u64 burst)
>  {
>  	int i, ret = 0, runtime_enabled, runtime_was_enabled;
>  	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> @@ -8992,6 +8993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>  	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>  		return -EINVAL;
>  
> +	if (quota != RUNTIME_INF && (burst > quota ||
> +				     burst + quota > max_cfs_runtime))
> +		return -EINVAL;
> +
>  	/*
>  	 * Prevent race between setting of cfs_rq->runtime_enabled and
>  	 * unthrottle_offline_cfs_rqs().

<snip all API crud>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3248e24a90b0..48fad5cf0f7a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -134,6 +134,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>   * (default: 5 msec, units: microseconds)
>   */
>  unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
> +
> +/*
> + * A switch for cfs bandwidth burst.
> + *
> + * (default: 1, enabled)
> + */
> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
>  #endif
>  
>  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
> @@ -4628,8 +4635,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>   */
>  void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>  {
> -	if (cfs_b->quota != RUNTIME_INF)
> +	if (unlikely(cfs_b->quota == RUNTIME_INF))
> +		return;
> +
> +	if (!sysctl_sched_cfs_bw_burst_enabled) {
>  		cfs_b->runtime = cfs_b->quota;
> +		return;
> +	}
> +
> +	cfs_b->runtime += cfs_b->quota;
> +	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
>  }
>  
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
>  	if (cfs_b->quota == RUNTIME_INF)
>  		amount = min_amount;
>  	else {
> +		if (!cfs_b->period_active)
> +			__refill_cfs_bandwidth_runtime(cfs_b);

Why this call?

>  		start_cfs_bandwidth(cfs_b);
>  
>  		if (cfs_b->runtime > 0) {
> @@ -5279,6 +5297,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  	cfs_b->runtime = 0;
>  	cfs_b->quota = RUNTIME_INF;
>  	cfs_b->period = ns_to_ktime(default_cfs_period());
> +	cfs_b->burst = 0;
>  
>  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>  	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a189bec13729..d317ca74a48c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -366,6 +366,7 @@ struct cfs_bandwidth {
>  	ktime_t			period;
>  	u64			quota;
>  	u64			runtime;
> +	u64			burst;
>  	s64			hierarchical_quota;
>  
>  	u8			idle;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 14edf84cc571..fb27bae7ace2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1816,6 +1816,15 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= SYSCTL_ONE,
>  	},
> +	{
> +		.procname	= "sched_cfs_bw_burst_enabled",
> +		.data		= &sysctl_sched_cfs_bw_burst_enabled,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},
>  #endif
>  #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>  	{

What's the purpose of this new sysctl? By default it is disabled because
burst==0, only if you set burst to some !0 value does this actually do
anything.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-05-20 12:34 ` [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
  2021-05-20 14:11   ` Odin Ugedal
  2021-05-20 17:06   ` Tejun Heo
@ 2021-05-21 14:01   ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-05-21 14:01 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: bsegall, dietmar.eggemann, dtcccc, juri.lelli, khlebnikov,
	linux-kernel, mgorman, mingo, odin, odin, pauld, pjt, rostedt,
	shanpeic, tj, vincent.guittot, xiyou.wangcong

On Thu, May 20, 2021 at 08:34:18PM +0800, Huaixin Chang wrote:
> When using cfs_b and meeting with some throttled periods, users shall
> use burst buffer to allow bursty workloads. Apart from configuring some
> burst buffer and watch whether throttled periods disappears, some
> statistics on burst buffer using are also helpful. Thus expose the
> following statistics into cpu.stat file:
> 

Helpful how.. the above is a bunch of words without any actual
justification for any of this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-05-20 14:11   ` Odin Ugedal
  2021-05-21 12:42     ` changhuaixin
@ 2021-05-21 14:05     ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-05-21 14:05 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Huaixin Chang, Benjamin Segall, Dietmar Eggemann, dtcccc,
	Juri Lelli, khlebnikov, open list, Mel Gorman, Ingo Molnar,
	pauld, Paul Turner, Steven Rostedt, shanpeic, Tejun Heo,
	Vincent Guittot, xiyou.wangcong

On Thu, May 20, 2021 at 04:11:52PM +0200, Odin Ugedal wrote:
> I am a bit sceptical about both the nr_burst and burst_time as they are now.
> 
> As an example; a control group using "99.9%" of the quota each period
> and that is never throttled. Such group would with this patch with a burst of X
> still get nr_throttled = 0 (as before), but it would get a nr_burst
> and burst_time that
> will keep increasing.
> 
> I think there is a big difference between runtime moved/taken from
> cfs_b->runtime to cfs_rq->runtime_remaining and the actual runtime used
> in the period. Currently, cfs bw can only supply info the first one, and
> not the latter.
> 
> I think that if people see nr_burst increasing, that they think they _have_
> to use cfs burst in order to avoid being throttled, even though that might
> not be the case. It is probably fine as is, as long as it is explicitly stated
> what the values mean and imply, and what they do not. I cannot see another
> way to calculate it as it is now, but maybe someone else has some thoughts.

You can always trace the system. I don't think we have nice tracepoints
for any of this, but much can be inferred from the scheduler and hrtimer
tracepoints. Also kprobe might be empoloyed to stick in more appropriate
thingies I suppose.

You can also run the workload without bandwidth controls and measure
it's job execution times, and from that compute the bandwidth settings,
all without tracepoints.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-21 14:00   ` Peter Zijlstra
@ 2021-05-24 12:42     ` changhuaixin
  2021-05-25 10:46       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: changhuaixin @ 2021-05-24 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Benjamin Segall, Dietmar Eggemann, dtcccc,
	Juri Lelli, khlebnikov, open list, Mel Gorman, Ingo Molnar,
	Odin Ugedal, Odin Ugedal, pauld, Paul Turner, Steven Rostedt,
	Shanpei Chen, Tejun Heo, Vincent Guittot, xiyou.wangcong



> On May 21, 2021, at 10:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, May 20, 2021 at 08:34:17PM +0800, Huaixin Chang wrote:
>> The CFS bandwidth controller limits CPU requests of a task group to
>> quota during each period. However, parallel workloads might be bursty
>> so that they get throttled even when their average utilization is under
>> quota. And they are latency sensitive at the same time so that
>> throttling them is undesired.
>> 
>> Scaling up period and quota allows greater burst capacity. But it might
>> cause longer stuck till next refill. Introduce "burst" to allow
>> accumulating unused quota from previous periods, and to be assigned when
>> a task group requests more CPU than quota during a specific period.
>> 
>> Introducing burst buffer might also cause interference to other groups.
>> Thus limit the maximum accumulated buffer by "burst", and limit
>> the maximum allowed burst by quota, too.
> 
> Overall, *much* better than before.
> 
> However I would like a little bit better discussion of how exactly
> people are supposed to reason about this. That will also help with the
> question from Odin on how people are supposed to set/compute this burst
> value.
> 
> So traditional (UP-EDF) bandwidth control is something like:
> 
>  (U = \Sum u_i) <= 1
> 
> This guaranteeds both that every deadline is met and that the system is
> stable. After all, if U were > 1, then for every second of walltime,
> we'd have to run more than a second of program time, and obviously miss
> our deadline, but the next deadline will be further out still, there is
> never time to catch up, unbounded fail.
> 
> This work observes that a workload doesn't always executes the full
> quota; this enables one to describe u_i as a statistical distribution.
> 
> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
> (the traditional WCET). This effectively allows u to be smaller,
> increasing the efficiency (we can pack more tasks in the system), but at
> the cost of missing deadlines when all the odds line up. However, it
> does maintain stability, since every overrun must be paired with an
> underrun as long as our x is above the average.
> 
> That is, suppose we have 2 tasks, both specify a p(95) value, then we
> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
> both tasks will exceed their quota at the same time (guaranteed deadline
> fail). Somewhere in between there's a threshold where one exceeds and
> the other doesn't underrun enough to compensate; this depends on the
> specific CDFs.
> 
> At the same time, we can say that the worst case deadline miss, will be
> \Sum e_i; that is, there is a bounded tardiness (under the assumption
> that x+e is indeed WCET).
> 
> And I think you can compute more fun properties.
> 
> Now, CFS bandwidth control is not EDF, and the above doesn't fully
> translate, but much does I think.
> 
> We borrow time now against our future underrun, at the cost of increased
> interference against the other system users. All nicely bounded etc..
> 

I shall improve the commit log then.

We did some compute on the probability that deadline is missed, and the expected
boundary. These values are calculated with different number of control groups and
variable CPU utilization when runtime is under exponential distribution, poisson
distribution or pareto distribution.

The more control groups there are, the more likely deadline is made and the smaller average
WCET to expect. Because many equal control groups means small chance of U > 1.

And the more under utilized the whole system is, the more likely deadline is made and the smaller
average WCET to expect.

More details are posted in
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/.

> 
> 
>> Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
>> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
>> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
>> ---
>> include/linux/sched/sysctl.h |  1 +
>> kernel/sched/core.c          | 83 ++++++++++++++++++++++++++++++++++++--------
>> kernel/sched/fair.c          | 21 ++++++++++-
>> kernel/sched/sched.h         |  1 +
>> kernel/sysctl.c              |  9 +++++
>> 5 files changed, 99 insertions(+), 16 deletions(-)
>> 
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index db2c0f34aaaf..08432aeb742e 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -73,6 +73,7 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>> 
>> #ifdef CONFIG_CFS_BANDWIDTH
>> extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
>> #endif
>> 
>> #ifdef CONFIG_SCHED_AUTOGROUP
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5226cc26a095..7d34b08ee0e5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8962,7 +8962,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>> 
>> static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>> 
>> -static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>> +				u64 burst)
>> {
>> 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
>> 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>> @@ -8992,6 +8993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>> 		return -EINVAL;
>> 
>> +	if (quota != RUNTIME_INF && (burst > quota ||
>> +				     burst + quota > max_cfs_runtime))
>> +		return -EINVAL;
>> +
>> 	/*
>> 	 * Prevent race between setting of cfs_rq->runtime_enabled and
>> 	 * unthrottle_offline_cfs_rqs().
> 
> <snip all API crud>
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3248e24a90b0..48fad5cf0f7a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -134,6 +134,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>>  * (default: 5 msec, units: microseconds)
>>  */
>> unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
>> +
>> +/*
>> + * A switch for cfs bandwidth burst.
>> + *
>> + * (default: 1, enabled)
>> + */
>> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
>> #endif
>> 
>> static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>> @@ -4628,8 +4635,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>>  */
>> void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>> {
>> -	if (cfs_b->quota != RUNTIME_INF)
>> +	if (unlikely(cfs_b->quota == RUNTIME_INF))
>> +		return;
>> +
>> +	if (!sysctl_sched_cfs_bw_burst_enabled) {
>> 		cfs_b->runtime = cfs_b->quota;
>> +		return;
>> +	}
>> +
>> +	cfs_b->runtime += cfs_b->quota;
>> +	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
>> }
>> 
>> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
>> 	if (cfs_b->quota == RUNTIME_INF)
>> 		amount = min_amount;
>> 	else {
>> +		if (!cfs_b->period_active)
>> +			__refill_cfs_bandwidth_runtime(cfs_b);
> 
> Why this call?

As the cfs bandwidth timer stops on idle with runtime unfilled, refill runtime when it restarts to make
use of the underrun when period timer stops. Another way to do this might be:

        throttled = !list_empty(&cfs_b->throttled_cfs_rq);
        cfs_b->nr_periods += overrun;

+       __refill_cfs_bandwidth_runtime(cfs_b);
+
        /*
         * idle depends on !throttled (for the case of a large deficit), and if
         * we're going inactive then everything else can be deferred
         */
        if (cfs_b->idle && !throttled)
                goto out_deactivate;

-       __refill_cfs_bandwidth_runtime(cfs_b);
-
        if (!throttled) {

> 
>> 		start_cfs_bandwidth(cfs_b);
>> 
>> 		if (cfs_b->runtime > 0) {
>> @@ -5279,6 +5297,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> 	cfs_b->runtime = 0;
>> 	cfs_b->quota = RUNTIME_INF;
>> 	cfs_b->period = ns_to_ktime(default_cfs_period());
>> +	cfs_b->burst = 0;
>> 
>> 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>> 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index a189bec13729..d317ca74a48c 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -366,6 +366,7 @@ struct cfs_bandwidth {
>> 	ktime_t			period;
>> 	u64			quota;
>> 	u64			runtime;
>> +	u64			burst;
>> 	s64			hierarchical_quota;
>> 
>> 	u8			idle;
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 14edf84cc571..fb27bae7ace2 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1816,6 +1816,15 @@ static struct ctl_table kern_table[] = {
>> 		.proc_handler	= proc_dointvec_minmax,
>> 		.extra1		= SYSCTL_ONE,
>> 	},
>> +	{
>> +		.procname	= "sched_cfs_bw_burst_enabled",
>> +		.data		= &sysctl_sched_cfs_bw_burst_enabled,
>> +		.maxlen		= sizeof(unsigned int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec_minmax,
>> +		.extra1		= SYSCTL_ZERO,
>> +		.extra2		= SYSCTL_ONE,
>> +	},
>> #endif
>> #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>> 	{
> 
> What's the purpose of this new sysctl? By default it is disabled because
> burst==0, only if you set burst to some !0 value does this actually do
> anything.

Originally, this is designed to turn burst feature off when the system becomes unstable.
Guess we can remove this as you have questioned it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-24 12:42     ` changhuaixin
@ 2021-05-25 10:46       ` Peter Zijlstra
  2021-05-31  6:59         ` luca abeni
  2021-05-25 10:47       ` Peter Zijlstra
  2021-05-25 10:49       ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2021-05-25 10:46 UTC (permalink / raw)
  To: changhuaixin
  Cc: Benjamin Segall, Dietmar Eggemann, dtcccc, Juri Lelli,
	khlebnikov, open list, Mel Gorman, Ingo Molnar, Odin Ugedal,
	Odin Ugedal, pauld, Paul Turner, Steven Rostedt, Shanpei Chen,
	Tejun Heo, Vincent Guittot, xiyou.wangcong, luca.abeni,
	tommaso.cucinotta, baruah, anderson

On Mon, May 24, 2021 at 08:42:03PM +0800, changhuaixin wrote:
> > On May 21, 2021, at 10:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Thu, May 20, 2021 at 08:34:17PM +0800, Huaixin Chang wrote:
> >> The CFS bandwidth controller limits CPU requests of a task group to
> >> quota during each period. However, parallel workloads might be bursty
> >> so that they get throttled even when their average utilization is under
> >> quota. And they are latency sensitive at the same time so that
> >> throttling them is undesired.
> >> 
> >> Scaling up period and quota allows greater burst capacity. But it might
> >> cause longer stuck till next refill. Introduce "burst" to allow
> >> accumulating unused quota from previous periods, and to be assigned when
> >> a task group requests more CPU than quota during a specific period.
> >> 
> >> Introducing burst buffer might also cause interference to other groups.
> >> Thus limit the maximum accumulated buffer by "burst", and limit
> >> the maximum allowed burst by quota, too.
> > 
> > Overall, *much* better than before.
> > 
> > However I would like a little bit better discussion of how exactly
> > people are supposed to reason about this. That will also help with the
> > question from Odin on how people are supposed to set/compute this burst
> > value.
> > 
> > So traditional (UP-EDF) bandwidth control is something like:
> > 
> >  (U = \Sum u_i) <= 1
> > 
> > This guaranteeds both that every deadline is met and that the system is
> > stable. After all, if U were > 1, then for every second of walltime,
> > we'd have to run more than a second of program time, and obviously miss
> > our deadline, but the next deadline will be further out still, there is
> > never time to catch up, unbounded fail.
> > 
> > This work observes that a workload doesn't always executes the full
> > quota; this enables one to describe u_i as a statistical distribution.
> > 
> > For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
> > (the traditional WCET). This effectively allows u to be smaller,
> > increasing the efficiency (we can pack more tasks in the system), but at
> > the cost of missing deadlines when all the odds line up. However, it
> > does maintain stability, since every overrun must be paired with an
> > underrun as long as our x is above the average.
> > 
> > That is, suppose we have 2 tasks, both specify a p(95) value, then we
> > have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
> > everything is good. At the same time we have a p(5)p(5) = 0.25% chance
> > both tasks will exceed their quota at the same time (guaranteed deadline
> > fail). Somewhere in between there's a threshold where one exceeds and
> > the other doesn't underrun enough to compensate; this depends on the
> > specific CDFs.
> > 
> > At the same time, we can say that the worst case deadline miss, will be
> > \Sum e_i; that is, there is a bounded tardiness (under the assumption
> > that x+e is indeed WCET).

Having second thoughts about this exact claim; lightning can strike
twice, and if we exceed bounds again before having recovered from the
last time we might exceed the bound mentioned. I _think_ the property
holds, but the bound might need work.

> > And I think you can compute more fun properties.
> > 
> > Now, CFS bandwidth control is not EDF, and the above doesn't fully
> > translate, but much does I think.
> > 
> > We borrow time now against our future underrun, at the cost of increased
> > interference against the other system users. All nicely bounded etc..
> > 
> 
> I shall improve the commit log then.

Thanks!

> We did some compute on the probability that deadline is missed, and the expected
> boundary. These values are calculated with different number of control groups and
> variable CPU utilization when runtime is under exponential distribution, poisson
> distribution or pareto distribution.
> 
> The more control groups there are, the more likely deadline is made and the smaller average
> WCET to expect. Because many equal control groups means small chance of U > 1.
> 
> And the more under utilized the whole system is, the more likely deadline is made and the smaller
> average WCET to expect.
> 
> More details are posted in
> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/.

Indeed you did; I'm a bit sad it's so hard to find papers that cover
this. When one Googles for 'Probabilistic WCET' there's a fair number of
papers about using Extreme Value Theory to estimate the traditional WCET
given measurement based input. Many from the excellent WCET track at
ECRTS.

The thing is, the last time I attended that conference (which appears to
be almost 4 years ago :/), I'm sure I spoke to people about exactly the
thing explored here. Albeit, at the time we discussed this as a
SCHED_DEADLINE task model extension.

Let me Cc a bunch of people that might know more..,

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-24 12:42     ` changhuaixin
  2021-05-25 10:46       ` Peter Zijlstra
@ 2021-05-25 10:47       ` Peter Zijlstra
  2021-05-25 21:25         ` Benjamin Segall
  2021-05-25 10:49       ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2021-05-25 10:47 UTC (permalink / raw)
  To: changhuaixin
  Cc: Benjamin Segall, Dietmar Eggemann, dtcccc, Juri Lelli,
	khlebnikov, open list, Mel Gorman, Ingo Molnar, Odin Ugedal,
	Odin Ugedal, pauld, Paul Turner, Steven Rostedt, Shanpei Chen,
	Tejun Heo, Vincent Guittot, xiyou.wangcong

On Mon, May 24, 2021 at 08:42:03PM +0800, changhuaixin wrote:

> >> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> >> @@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
> >> 	if (cfs_b->quota == RUNTIME_INF)
> >> 		amount = min_amount;
> >> 	else {
> >> +		if (!cfs_b->period_active)
> >> +			__refill_cfs_bandwidth_runtime(cfs_b);
> > 
> > Why this call?
> 
> As the cfs bandwidth timer stops on idle with runtime unfilled, refill runtime when it restarts to make
> use of the underrun when period timer stops. Another way to do this might be:
> 
>         throttled = !list_empty(&cfs_b->throttled_cfs_rq);
>         cfs_b->nr_periods += overrun;
> 
> +       __refill_cfs_bandwidth_runtime(cfs_b);
> +
>         /*
>          * idle depends on !throttled (for the case of a large deficit), and if
>          * we're going inactive then everything else can be deferred
>          */
>         if (cfs_b->idle && !throttled)
>                 goto out_deactivate;
> 
> -       __refill_cfs_bandwidth_runtime(cfs_b);
> -
>         if (!throttled) {
> 

Ben, do you have a preference?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-24 12:42     ` changhuaixin
  2021-05-25 10:46       ` Peter Zijlstra
  2021-05-25 10:47       ` Peter Zijlstra
@ 2021-05-25 10:49       ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-05-25 10:49 UTC (permalink / raw)
  To: changhuaixin
  Cc: Benjamin Segall, Dietmar Eggemann, dtcccc, Juri Lelli,
	khlebnikov, open list, Mel Gorman, Ingo Molnar, Odin Ugedal,
	Odin Ugedal, pauld, Paul Turner, Steven Rostedt, Shanpei Chen,
	Tejun Heo, Vincent Guittot, xiyou.wangcong

On Mon, May 24, 2021 at 08:42:03PM +0800, changhuaixin wrote:

> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -1816,6 +1816,15 @@ static struct ctl_table kern_table[] = {
> >> 		.proc_handler	= proc_dointvec_minmax,
> >> 		.extra1		= SYSCTL_ONE,
> >> 	},
> >> +	{
> >> +		.procname	= "sched_cfs_bw_burst_enabled",
> >> +		.data		= &sysctl_sched_cfs_bw_burst_enabled,
> >> +		.maxlen		= sizeof(unsigned int),
> >> +		.mode		= 0644,
> >> +		.proc_handler	= proc_dointvec_minmax,
> >> +		.extra1		= SYSCTL_ZERO,
> >> +		.extra2		= SYSCTL_ONE,
> >> +	},
> >> #endif
> >> #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> >> 	{
> > 
> > What's the purpose of this new sysctl? By default it is disabled because
> > burst==0, only if you set burst to some !0 value does this actually do
> > anything.
> 
> Originally, this is designed to turn burst feature off when the system becomes unstable.
> Guess we can remove this as you have questioned it.

Is stability a concern? This is CFS after all, if we overload, we simply
share time as per usual.

If there is a real use-case for a global knob to limit/disable this I
don't object too much, but then please explicitly mention it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-25 10:47       ` Peter Zijlstra
@ 2021-05-25 21:25         ` Benjamin Segall
  0 siblings, 0 replies; 21+ messages in thread
From: Benjamin Segall @ 2021-05-25 21:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Dietmar Eggemann, dtcccc, Juri Lelli, khlebnikov,
	open list, Mel Gorman, Ingo Molnar, Odin Ugedal, Odin Ugedal,
	pauld, Paul Turner, Steven Rostedt, Shanpei Chen, Tejun Heo,
	Vincent Guittot, xiyou.wangcong

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, May 24, 2021 at 08:42:03PM +0800, changhuaixin wrote:
>
>> >> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> >> @@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
>> >> 	if (cfs_b->quota == RUNTIME_INF)
>> >> 		amount = min_amount;
>> >> 	else {
>> >> +		if (!cfs_b->period_active)
>> >> +			__refill_cfs_bandwidth_runtime(cfs_b);
>> > 
>> > Why this call?
>> 
>> As the cfs bandwidth timer stops on idle with runtime unfilled, refill runtime when it restarts to make
>> use of the underrun when period timer stops. Another way to do this might be:
>> 
>>         throttled = !list_empty(&cfs_b->throttled_cfs_rq);
>>         cfs_b->nr_periods += overrun;
>> 
>> +       __refill_cfs_bandwidth_runtime(cfs_b);
>> +
>>         /*
>>          * idle depends on !throttled (for the case of a large deficit), and if
>>          * we're going inactive then everything else can be deferred
>>          */
>>         if (cfs_b->idle && !throttled)
>>                 goto out_deactivate;
>> 
>> -       __refill_cfs_bandwidth_runtime(cfs_b);
>> -
>>         if (!throttled) {
>> 
>
> Ben, do you have a preference?


I think I prefer the latter, possibly with a
/* Refill extra burst quota even if cfs_b->idle */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller
  2021-05-25 10:46       ` Peter Zijlstra
@ 2021-05-31  6:59         ` luca abeni
  0 siblings, 0 replies; 21+ messages in thread
From: luca abeni @ 2021-05-31  6:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, Benjamin Segall, Dietmar Eggemann, dtcccc,
	Juri Lelli, khlebnikov, open list, Mel Gorman, Ingo Molnar,
	Odin Ugedal, Odin Ugedal, pauld, Paul Turner, Steven Rostedt,
	Shanpei Chen, Tejun Heo, Vincent Guittot, xiyou.wangcong,
	tommaso.cucinotta, baruah, anderson

Hi all,

On Tue, 25 May 2021 12:46:52 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > We did some compute on the probability that deadline is missed, and the expected
> > boundary. These values are calculated with different number of control groups and
> > variable CPU utilization when runtime is under exponential distribution, poisson
> > distribution or pareto distribution.
> > 
> > The more control groups there are, the more likely deadline is made and the smaller average
> > WCET to expect. Because many equal control groups means small chance of U > 1.
> > 
> > And the more under utilized the whole system is, the more likely deadline is made and the smaller
> > average WCET to expect.
> > 
> > More details are posted in
> > https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/.  
> 
> Indeed you did; I'm a bit sad it's so hard to find papers that cover
> this. When one Googles for 'Probabilistic WCET' there's a fair number of
> papers about using Extreme Value Theory to estimate the traditional WCET
> given measurement based input. Many from the excellent WCET track at
> ECRTS.

If I understand well the context, you do not need probabilistic WCET
here...
If you assume to know the probability distribution of the inter-arrival
times and execution times (this is what is assumed at
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/,
right?), then you can use "standard" queuing theory to compute the
response time distribution.

If I understand well, in the link mentioned above the response times are
measured by simulating a model of the scheduler. Queuing theory can be
used instead, as shown (for example) in these papers:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.7683&rep=rep1&type=pdf
http://retis.sssup.it/~giorgio/paps/2001/wpdrts01.pdf
(these papers consider a scheduler similar to SCHED_DEADLINE, but the
approach can be easily applied to every scheduler that guarantees a
runtime in a period --- I think the CFS controller falls in this
category, right?)
I think the burst mentioned above can be added to this queuing model;
I'll have a look at this in the next days.

The problem with this approach is that the execution times of different
activation of a task are considered to be independent and identically
distributed (this is the infamous "IID assumption"). And this
assumption is often unrealistic...
The probabilistic WCET approach mentioned above allows you to analyze
the behaviour of a scheduler without assuming that the execution (and/or
inter-activation) times are IID.

			Luca

> The thing is, the last time I attended that conference (which appears to
> be almost 4 years ago :/), I'm sure I spoke to people about exactly the
> thing explored here. Albeit, at the time we discussed this as a
> SCHED_DEADLINE task model extension.
> 
> Let me Cc a bunch of people that might know more..,

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-05-31  6:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-20 12:34 [PATCH v5 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
2021-05-20 12:34 ` [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
2021-05-20 14:00   ` Odin Ugedal
2021-05-20 17:04     ` Tejun Heo
2021-05-21  9:09     ` changhuaixin
2021-05-21  9:38       ` Odin Ugedal
2021-05-21 12:38         ` changhuaixin
2021-05-21 14:00   ` Peter Zijlstra
2021-05-24 12:42     ` changhuaixin
2021-05-25 10:46       ` Peter Zijlstra
2021-05-31  6:59         ` luca abeni
2021-05-25 10:47       ` Peter Zijlstra
2021-05-25 21:25         ` Benjamin Segall
2021-05-25 10:49       ` Peter Zijlstra
2021-05-20 12:34 ` [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
2021-05-20 14:11   ` Odin Ugedal
2021-05-21 12:42     ` changhuaixin
2021-05-21 14:05     ` Peter Zijlstra
2021-05-20 17:06   ` Tejun Heo
2021-05-21 14:01   ` Peter Zijlstra
2021-05-20 12:34 ` [PATCH v5 3/3] sched/fair: Add document for burstable CFS bandwidth Huaixin Chang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.