[PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller
@ 2021-06-21  9:27 Huaixin Chang
  2021-06-21  9:27 ` [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Huaixin Chang @ 2021-06-21  9:27 UTC (permalink / raw)
  To: luca.abeni
  Cc: anderson, baruah, bsegall, changhuaixin, dietmar.eggemann,
	dtcccc, juri.lelli, khlebnikov, linux-kernel, mgorman, mingo,
	odin, odin, pauld, peterz, pjt, rostedt, shanpeic, tj,
	tommaso.cucinotta, vincent.guittot, xiyou.wangcong

Changelog:
v6:
- Separate burst config to cpu.max.burst.
- Rewrite commit log and document for burst feature.
- Remove global sysfsctl to disable burst feature.
- Some code mofication.
- Rebase upon v5.13-rc6.

v5:
- Rearrange into 3 patches, one less than the previous version.
- The interference to other groups are valued.
- Put a limit on burst, so that code is further simplified.
- Rebase upon v5.13-rc3.
Link:
https://lore.kernel.org/lkml/20210520123419.8039-1-changhuaixin@linux.alibaba.com/

v4:
- Adjust assignments in tg_set_cfs_bandwidth(), saving unnecessary
  assignemnts when quota == RUNTIME_INF.
- Getting rid of sysctl_sched_cfs_bw_burst_onset_percent, as there seems
  no justification for both controlling start bandwidth and a percent
  way.
- Comment improvement in sched_cfs_period_timer() shifts on explaining
  why max_overrun shifting to 0 is a problem.
- Rename previous_runtime to runtime_at_period_start.
- Add cgroup2 interface and documentation.
- Getting rid of exposing current_bw as there are not enough
  justification and the updating problem.
- Add justification on cpu.stat change in the changelog.
- Rebase upon v5.12-rc3.
- Correct SoB chain.
- Several indentation fixes.
- Adjust quota in schbench test from 700000 to 600000.
Link:
https://lore.kernel.org/lkml/20210316044931.39733-1-changhuaixin@linux.alibaba.com/

v3:
- Fix another issue reported by test robot.
- Update docs as Randy Dunlap suggested.
Link:
https://lore.kernel.org/lkml/20210120122715.29493-1-changhuaixin@linux.alibaba.com/

v2:
- Fix an issue reported by test robot.
- Rewriting docs. Appreciate any further suggestions or help.
Link:
https://lore.kernel.org/lkml/20210121110453.18899-1-changhuaixin@linux.alibaba.com/

v1 Link:
https://lore.kernel.org/lkml/20201217074620.58338-1-changhuaixin@linux.alibaba.com/

Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/20180522062017.5193-1-xiyou.wangcong@gmail.com/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/

This time we present more latency statistics and handle overflow while
accumulating.

Huaixin Chang (3):
  sched/fair: Introduce the burstable CFS controller
  sched/fair: Add cfs bandwidth burst statistics
  sched/fair: Add document for burstable CFS bandwidth

 Documentation/admin-guide/cgroup-v2.rst | 17 +++---
 Documentation/scheduler/sched-bwc.rst   | 76 ++++++++++++++++++++++----
 include/linux/sched/sysctl.h            |  1 +
 kernel/sched/core.c                     | 96 ++++++++++++++++++++++++++-------
 kernel/sched/fair.c                     | 32 ++++++++++-
 kernel/sched/sched.h                    |  4 ++
 kernel/sysctl.c                         |  9 ++++
 7 files changed, 200 insertions(+), 35 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
  2021-06-21  9:27 [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
@ 2021-06-21  9:27 ` Huaixin Chang
  2021-06-22 13:19   ` Peter Zijlstra
                     ` (2 more replies)
  2021-06-21  9:27 ` [PATCH v6 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 17+ messages in thread
From: Huaixin Chang @ 2021-06-21  9:27 UTC (permalink / raw)
  To: luca.abeni
  Cc: anderson, baruah, bsegall, changhuaixin, dietmar.eggemann,
	dtcccc, juri.lelli, khlebnikov, linux-kernel, mgorman, mingo,
	odin, odin, pauld, peterz, pjt, rostedt, shanpeic, tj,
	tommaso.cucinotta, vincent.guittot, xiyou.wangcong

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.

We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded.

Traditional (UP-EDF) bandwidth control is something like:

  (U = \Sum u_i) <= 1

This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.

This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.

For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.

That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.

At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).

The benefit of burst is seen when testing with schbench. Default value of
kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

	mkdir /sys/fs/cgroup/cpu/test
	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail
latency for 6 times and got throttled for 8 times.

Tail latencies are shown below, and it wasn't the worst case.

	Latency percentiles (usec)
		50.0000th: 19872
		75.0000th: 21344
		90.0000th: 22176
		95.0000th: 22496
		*99.0000th: 22752
		99.5000th: 22752
		99.9000th: 22752
		min=0, max=22727
	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%

The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 kernel/sched/core.c  | 68 +++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/fair.c  | 13 ++++++----
 kernel/sched/sched.h |  1 +
 3 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..b58ced2194a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8962,7 +8962,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+				u64 burst)
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -8992,6 +8993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
 		return -EINVAL;
 
+	if (quota != RUNTIME_INF && (burst > quota ||
+				     burst + quota > max_cfs_runtime))
+		return -EINVAL;
+
 	/*
 	 * Prevent race between setting of cfs_rq->runtime_enabled and
 	 * unthrottle_offline_cfs_rqs().
@@ -9013,6 +9018,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->burst = burst;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -9046,9 +9052,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	burst = tg->cfs_bandwidth.burst;
 	if (cfs_quota_us < 0)
 		quota = RUNTIME_INF;
 	else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -9056,7 +9063,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 	else
 		return -EINVAL;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -9074,15 +9081,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
 		return -EINVAL;
 
 	period = (u64)cfs_period_us * NSEC_PER_USEC;
 	quota = tg->cfs_bandwidth.quota;
+	burst = tg->cfs_bandwidth.burst;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -9095,6 +9103,30 @@ static long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+	u64 quota, period, burst;
+
+	if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC)
+		return -EINVAL;
+
+	burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	quota = tg->cfs_bandwidth.quota;
+
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+	u64 burst_us;
+
+	burst_us = tg->cfs_bandwidth.burst;
+	do_div(burst_us, NSEC_PER_USEC);
+
+	return burst_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -9119,6 +9151,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static u64 cpu_cfs_burst_read_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, u64 cfs_burst_us)
+{
+	return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
+}
+
 struct cfs_schedulable_data {
 	struct task_group *tg;
 	u64 period, quota;
@@ -9271,6 +9315,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "cfs_burst_us",
+		.read_u64 = cpu_cfs_burst_read_u64,
+		.write_u64 = cpu_cfs_burst_write_u64,
+	},
 	{
 		.name = "stat",
 		.seq_show = cpu_cfs_stat_show,
@@ -9436,12 +9485,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 {
 	struct task_group *tg = css_tg(of_css(of));
 	u64 period = tg_get_cfs_period(tg);
+	u64 burst = tg_get_cfs_burst(tg);
 	u64 quota;
 	int ret;
 
 	ret = cpu_period_quota_parse(buf, &period, &quota);
 	if (!ret)
-		ret = tg_set_cfs_bandwidth(tg, period, quota);
+		ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
 	return ret ?: nbytes;
 }
 #endif
@@ -9468,6 +9518,12 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+	{
+		.name = "max.burst",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_cfs_burst_read_u64,
+		.write_u64 = cpu_cfs_burst_write_u64,
+	},
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c8a9352590d..53d7cc4d009b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4634,8 +4634,11 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
-	if (cfs_b->quota != RUNTIME_INF)
-		cfs_b->runtime = cfs_b->quota;
+	if (unlikely(cfs_b->quota == RUNTIME_INF))
+		return;
+
+	cfs_b->runtime += cfs_b->quota;
+	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4996,6 +4999,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
 	cfs_b->nr_periods += overrun;
 
+	/* Refill extra burst quota even if cfs_b->idle */
+	__refill_cfs_bandwidth_runtime(cfs_b);
+
 	/*
 	 * idle depends on !throttled (for the case of a large deficit), and if
 	 * we're going inactive then everything else can be deferred
@@ -5003,8 +5009,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
-
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
 		cfs_b->idle = 1;
@@ -5285,6 +5289,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+	cfs_b->burst = 0;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..d317ca74a48c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			quota;
 	u64			runtime;
+	u64			burst;
 	s64			hierarchical_quota;
 
 	u8			idle;
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-06-21  9:27 [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2021-06-21  9:27 ` [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
@ 2021-06-21  9:27 ` Huaixin Chang
  2021-06-28 15:00   ` Peter Zijlstra
  2021-06-21  9:28 ` [PATCH v6 3/3] sched/fair: Add document for burstable CFS bandwidth Huaixin Chang
  2021-06-22 14:25 ` [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller Tejun Heo
  3 siblings, 1 reply; 17+ messages in thread
From: Huaixin Chang @ 2021-06-21  9:27 UTC (permalink / raw)
  To: luca.abeni
  Cc: anderson, baruah, bsegall, changhuaixin, dietmar.eggemann,
	dtcccc, juri.lelli, khlebnikov, linux-kernel, mgorman, mingo,
	odin, odin, pauld, peterz, pjt, rostedt, shanpeic, tj,
	tommaso.cucinotta, vincent.guittot, xiyou.wangcong

The following statistics in cpu.stat file is added to show how much workload
is making use of cfs_b burst:

nr_bursts:  number of periods bandwidth burst occurs
burst_usec: cumulative wall-time that any cpus has
	    used above quota in respective periods

The larger nr_bursts is, the more bursty periods there are. And the larger
burst_usec is, the more burst time is used by bursty workload.

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 kernel/sched/core.c  | 13 ++++++++++---
 kernel/sched/fair.c  | 11 +++++++++++
 kernel/sched/sched.h |  3 +++
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b58ced2194a0..1e41c51b14b5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9265,6 +9265,9 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
 		seq_printf(sf, "wait_sum %llu\n", ws);
 	}
 
+	seq_printf(sf, "nr_bursts %d\n", cfs_b->nr_burst);
+	seq_printf(sf, "burst_usec %llu\n", cfs_b->burst_time);
+
 	return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
@@ -9361,16 +9364,20 @@ static int cpu_extra_stat_show(struct seq_file *sf,
 	{
 		struct task_group *tg = css_tg(css);
 		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
-		u64 throttled_usec;
+		u64 throttled_usec, burst_usec;
 
 		throttled_usec = cfs_b->throttled_time;
 		do_div(throttled_usec, NSEC_PER_USEC);
+		burst_usec = cfs_b->burst_time;
+		do_div(burst_usec, NSEC_PER_USEC);
 
 		seq_printf(sf, "nr_periods %d\n"
 			   "nr_throttled %d\n"
-			   "throttled_usec %llu\n",
+			   "throttled_usec %llu\n"
+			   "nr_bursts %d\n"
+			   "burst_usec %llu\n",
 			   cfs_b->nr_periods, cfs_b->nr_throttled,
-			   throttled_usec);
+			   throttled_usec, cfs_b->nr_burst, burst_usec);
 	}
 #endif
 	return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53d7cc4d009b..62b73722e510 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4634,11 +4634,22 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
+	u64 runtime;
+
 	if (unlikely(cfs_b->quota == RUNTIME_INF))
 		return;
 
+	if (cfs_b->runtime_at_period_start > cfs_b->runtime) {
+		runtime = cfs_b->runtime_at_period_start - cfs_b->runtime;
+		if (runtime > cfs_b->quota) {
+			cfs_b->burst_time += runtime - cfs_b->quota;
+			cfs_b->nr_burst++;
+		}
+	}
+
 	cfs_b->runtime += cfs_b->quota;
 	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
+	cfs_b->runtime_at_period_start = cfs_b->runtime;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d317ca74a48c..b770b553dfbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -367,6 +367,7 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	u64			burst;
+	u64			runtime_at_period_start;
 	s64			hierarchical_quota;
 
 	u8			idle;
@@ -379,7 +380,9 @@ struct cfs_bandwidth {
 	/* Statistics: */
 	int			nr_periods;
 	int			nr_throttled;
+	int			nr_burst;
 	u64			throttled_time;
+	u64			burst_time;
 #endif
 };
 
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 3/3] sched/fair: Add document for burstable CFS bandwidth
  2021-06-21  9:27 [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
  2021-06-21  9:27 ` [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
  2021-06-21  9:27 ` [PATCH v6 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2021-06-21  9:28 ` Huaixin Chang
  2021-06-22 15:26   ` Odin Ugedal
  2021-06-22 14:25 ` [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller Tejun Heo
  3 siblings, 1 reply; 17+ messages in thread
From: Huaixin Chang @ 2021-06-21  9:28 UTC (permalink / raw)
  To: luca.abeni
  Cc: anderson, baruah, bsegall, changhuaixin, dietmar.eggemann,
	dtcccc, juri.lelli, khlebnikov, linux-kernel, mgorman, mingo,
	odin, odin, pauld, peterz, pjt, rostedt, shanpeic, tj,
	tommaso.cucinotta, vincent.guittot, xiyou.wangcong

Basic description of usage and effect for CFS Bandwidth Control Burst.

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  8 +++
 Documentation/scheduler/sched-bwc.rst   | 91 +++++++++++++++++++++++++++++----
 2 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b1e81aa8598a..3d0a86a065a1 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1000,6 +1000,8 @@ All time durations are in microseconds.
 	- nr_periods
 	- nr_throttled
 	- throttled_usec
+	- nr_bursts
+	- burst_usec
 
   cpu.weight
 	A read-write single value file which exists on non-root
@@ -1031,6 +1033,12 @@ All time durations are in microseconds.
 	$PERIOD duration.  "max" for $MAX indicates no limit.  If only
 	one number is written, $MAX is updated.
 
+  cpu.max.burst
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The burst in the range [0, $PERIOD].
+
   cpu.pressure
 	A read-write nested-keyed file.
 
diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 845eee659199..b1a67fee1d46 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -22,39 +22,89 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
 is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
+Burst feature
+-------------
+This feature borrows time now against our future underrun, at the cost of
+increased interference against the other system users. All nicely bounded.
+
+Traditional (UP-EDF) bandwidth control is something like:
+
+  (U = \Sum u_i) <= 1
+
+This guaranteeds both that every deadline is met and that the system is
+stable. After all, if U were > 1, then for every second of walltime,
+we'd have to run more than a second of program time, and obviously miss
+our deadline, but the next deadline will be further out still, there is
+never time to catch up, unbounded fail.
+
+The burst feature observes that a workload doesn't always executes the full
+quota; this enables one to describe u_i as a statistical distribution.
+
+For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
+(the traditional WCET). This effectively allows u to be smaller,
+increasing the efficiency (we can pack more tasks in the system), but at
+the cost of missing deadlines when all the odds line up. However, it
+does maintain stability, since every overrun must be paired with an
+underrun as long as our x is above the average.
+
+That is, suppose we have 2 tasks, both specify a p(95) value, then we
+have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
+everything is good. At the same time we have a p(5)p(5) = 0.25% chance
+both tasks will exceed their quota at the same time (guaranteed deadline
+fail). Somewhere in between there's a threshold where one exceeds and
+the other doesn't underrun enough to compensate; this depends on the
+specific CDFs.
+
+At the same time, we can say that the worst case deadline miss, will be
+\Sum e_i; that is, there is a bounded tardiness (under the assumption
+that x+e is indeed WCET).
+
+The interferenece when using burst is valued by the possibilities for
+missing the deadline and the average WCET. Test results showed that when
+there many cgroups or CPU is under utilized, the interference is
+limited. More details are shown in:
+https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
+
 Management
 ----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.
 
 .. note::
    The cgroupfs files described in this section are only applicable
    to cgroup v1. For cgroup v2, see
    :ref:`Documentation/admin-guide/cgroupv2.rst <cgroup-v2-cpu>`.
 
-- cpu.cfs_quota_us: the total available run-time within a period (in
-  microseconds)
+- cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
 - cpu.cfs_period_us: the length of a period (in microseconds)
 - cpu.stat: exports throttling statistics [explained further below]
+- cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
 
 The default values are::
 
 	cpu.cfs_period_us=100ms
-	cpu.cfs_quota=-1
+	cpu.cfs_quota_us=-1
+	cpu.cfs_burst_us=0
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
 bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
-Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms. There is also an
-upper bound on the period length of 1s. Additional restrictions exist when
-bandwidth limits are used in a hierarchical fashion, these are explained in
-more detail below.
+Writing any (valid) positive value(s) no smaller than cpu.cfs_burst_us will
+enact the specified bandwidth limit. The minimum quota allowed for the quota or
+period is 1ms. There is also an upper bound on the period length of 1s.
+Additional restrictions exist when bandwidth limits are used in a hierarchical
+fashion, these are explained in more detail below.
 
 Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
 and return the group to an unconstrained state once more.
 
+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS unchanged. Writing any (valid) positive value(s) no larger than
+cpu.cfs_quota_us into cpu.cfs_burst_us will enact the cap on unused bandwidth
+accumulation.
+
 Any updates to a group's bandwidth specification will result in it becoming
 unthrottled if it is in a constrained state.
 
@@ -72,9 +122,15 @@ This is tunable via procfs::
 Larger slice values will reduce transfer overheads, while smaller values allow
 for more fine-grained consumption.
 
+There is also a global switch to turn off burst for all groups::
+       /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
+
+By default it is enabled. Writing a 0 value means no accumulated CPU time can be
+used for any group, even if cpu.cfs_burst_us is configured.
+
 Statistics
 ----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+A group's bandwidth statistics are exported via 5 fields in cpu.stat.
 
 cpu.stat:
 
@@ -82,6 +138,9 @@ cpu.stat:
 - nr_throttled: Number of times the group has been throttled/limited.
 - throttled_time: The total time duration (in nanoseconds) for which entities
   of the group have been throttled.
+- nr_bursts: Number of periods burst occurs.
+- burst_usec: Cumulative wall-time that any CPUs has used above quota in
+  respective periods
 
 This interface is read-only.
 
@@ -179,3 +238,15 @@ Examples
 
    By using a small period here we are ensuring a consistent latency
    response at the expense of burst capacity.
+
+4. Limit a group to 40% of 1 CPU, and allow accumulate up to 20% of 1 CPU
+   additionally, in case accumulation has been done.
+
+   With 50ms period, 20ms quota will be equivalent to 40% of 1 CPU.
+   And 10ms burst will be equivalent to 20% of 1 CPU.
+
+	# echo 20000 > cpu.cfs_quota_us /* quota = 20ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+	# echo 10000 > cpu.cfs_burst_us /* burst = 10ms */
+
+   Larger buffer setting (no larger than quota) allows greater burst capacity.
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
  2021-06-21  9:27 ` [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
@ 2021-06-22 13:19   ` Peter Zijlstra
  2021-06-22 18:57     ` Benjamin Segall
  2021-06-24  8:48     ` changhuaixin
  2021-06-22 15:27   ` Odin Ugedal
  2021-06-24  7:39   ` [tip: sched/core] " tip-bot2 for Huaixin Chang
  2 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-06-22 13:19 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: luca.abeni, anderson, baruah, bsegall, dietmar.eggemann, dtcccc,
	juri.lelli, khlebnikov, linux-kernel, mgorman, mingo, odin, odin,
	pauld, pjt, rostedt, shanpeic, tj, tommaso.cucinotta,
	vincent.guittot, xiyou.wangcong

On Mon, Jun 21, 2021 at 05:27:58PM +0800, Huaixin Chang wrote:
> The CFS bandwidth controller limits CPU requests of a task group to
> quota during each period. However, parallel workloads might be bursty
> so that they get throttled even when their average utilization is under
> quota. And they are latency sensitive at the same time so that
> throttling them is undesired.
> 
> We borrow time now against our future underrun, at the cost of increased
> interference against the other system users. All nicely bounded.
> 
> Traditional (UP-EDF) bandwidth control is something like:
> 
>   (U = \Sum u_i) <= 1
> 
> This guaranteeds both that every deadline is met and that the system is
> stable. After all, if U were > 1, then for every second of walltime,
> we'd have to run more than a second of program time, and obviously miss
> our deadline, but the next deadline will be further out still, there is
> never time to catch up, unbounded fail.
> 
> This work observes that a workload doesn't always executes the full
> quota; this enables one to describe u_i as a statistical distribution.
> 
> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
> (the traditional WCET). This effectively allows u to be smaller,
> increasing the efficiency (we can pack more tasks in the system), but at
> the cost of missing deadlines when all the odds line up. However, it
> does maintain stability, since every overrun must be paired with an
> underrun as long as our x is above the average.
> 
> That is, suppose we have 2 tasks, both specify a p(95) value, then we
> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
> both tasks will exceed their quota at the same time (guaranteed deadline
> fail). Somewhere in between there's a threshold where one exceeds and
> the other doesn't underrun enough to compensate; this depends on the
> specific CDFs.
> 
> At the same time, we can say that the worst case deadline miss, will be
> \Sum e_i; that is, there is a bounded tardiness (under the assumption
> that x+e is indeed WCET).
> 
> The benefit of burst is seen when testing with schbench. Default value of
> kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
> 
> 	mkdir /sys/fs/cgroup/cpu/test
> 	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
> 	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> 	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
> 
> 	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
> 
> The average CPU usage is at 80%. I run this for 10 times, and got long tail
> latency for 6 times and got throttled for 8 times.
> 
> Tail latencies are shown below, and it wasn't the worst case.
> 
> 	Latency percentiles (usec)
> 		50.0000th: 19872
> 		75.0000th: 21344
> 		90.0000th: 22176
> 		95.0000th: 22496
> 		*99.0000th: 22752
> 		99.5000th: 22752
> 		99.9000th: 22752
> 		min=0, max=22727
> 	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
> 
> The interferenece when using burst is valued by the possibilities for
> missing the deadline and the average WCET. Test results showed that when
> there many cgroups or CPU is under utilized, the interference is
> limited. More details are shown in:
> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
> 
> Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> ---

Ben, what say you? I'm tempted to pick up at least this first patch.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller
  2021-06-21  9:27 [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
                   ` (2 preceding siblings ...)
  2021-06-21  9:28 ` [PATCH v6 3/3] sched/fair: Add document for burstable CFS bandwidth Huaixin Chang
@ 2021-06-22 14:25 ` Tejun Heo
  3 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-06-22 14:25 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: luca.abeni, anderson, baruah, bsegall, dietmar.eggemann, dtcccc,
	juri.lelli, khlebnikov, linux-kernel, mgorman, mingo, odin, odin,
	pauld, peterz, pjt, rostedt, shanpeic, tommaso.cucinotta,
	vincent.guittot, xiyou.wangcong

On Mon, Jun 21, 2021 at 05:27:57PM +0800, Huaixin Chang wrote:
> Changelog:
> v6:
> - Separate burst config to cpu.max.burst.
> - Rewrite commit log and document for burst feature.
> - Remove global sysfsctl to disable burst feature.
> - Some code mofication.
> - Rebase upon v5.13-rc6.

This looks fine from cgroup interface POV.

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/3] sched/fair: Add document for burstable CFS bandwidth
  2021-06-21  9:28 ` [PATCH v6 3/3] sched/fair: Add document for burstable CFS bandwidth Huaixin Chang
@ 2021-06-22 15:26   ` Odin Ugedal
  0 siblings, 0 replies; 17+ messages in thread
From: Odin Ugedal @ 2021-06-22 15:26 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: luca.abeni, anderson, baruah, Benjamin Segall, Dietmar Eggemann,
	dtcccc, Juri Lelli, khlebnikov, open list, Mel Gorman,
	Ingo Molnar, Odin Ugedal, pauld, Peter Zijlstra, Paul Turner,
	Steven Rostedt, Shanpei Chen, Tejun Heo, tommaso.cucinotta,
	Vincent Guittot, xiyou.wangcong

Hi,

man. 21. jun. 2021 kl. 11:28 skrev Huaixin Chang
<changhuaixin@linux.alibaba.com>:
>
> Basic description of usage and effect for CFS Bandwidth Control Burst.
>
> Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  8 +++
>  Documentation/scheduler/sched-bwc.rst   | 91 +++++++++++++++++++++++++++++----
>  2 files changed, 89 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index b1e81aa8598a..3d0a86a065a1 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1000,6 +1000,8 @@ All time durations are in microseconds.
>         - nr_periods
>         - nr_throttled
>         - throttled_usec
> +       - nr_bursts
> +       - burst_usec
>
>    cpu.weight
>         A read-write single value file which exists on non-root
> @@ -1031,6 +1033,12 @@ All time durations are in microseconds.
>         $PERIOD duration.  "max" for $MAX indicates no limit.  If only
>         one number is written, $MAX is updated.
>
> +  cpu.max.burst
> +       A read-write single value file which exists on non-root
> +       cgroups.  The default is "0".
> +
> +       The burst in the range [0, $PERIOD].

Is this supposed to be $PERIOD or $QUOTA?

> +
>    cpu.pressure
>         A read-write nested-keyed file.
>
> diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
> index 845eee659199..b1a67fee1d46 100644
> --- a/Documentation/scheduler/sched-bwc.rst
> +++ b/Documentation/scheduler/sched-bwc.rst
> @@ -22,39 +22,89 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
>  is transferred to cpu-local "silos" on a demand basis. The amount transferred
>  within each of these updates is tunable and described as the "slice".
>
> +Burst feature
> +-------------
> +This feature borrows time now against our future underrun, at the cost of
> +increased interference against the other system users. All nicely bounded.
> +
> +Traditional (UP-EDF) bandwidth control is something like:
> +
> +  (U = \Sum u_i) <= 1
> +
> +This guaranteeds both that every deadline is met and that the system is
> +stable. After all, if U were > 1, then for every second of walltime,
> +we'd have to run more than a second of program time, and obviously miss
> +our deadline, but the next deadline will be further out still, there is
> +never time to catch up, unbounded fail.
> +
> +The burst feature observes that a workload doesn't always executes the full
> +quota; this enables one to describe u_i as a statistical distribution.
> +
> +For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
> +(the traditional WCET). This effectively allows u to be smaller,
> +increasing the efficiency (we can pack more tasks in the system), but at
> +the cost of missing deadlines when all the odds line up. However, it
> +does maintain stability, since every overrun must be paired with an
> +underrun as long as our x is above the average.
> +
> +That is, suppose we have 2 tasks, both specify a p(95) value, then we
> +have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
> +everything is good. At the same time we have a p(5)p(5) = 0.25% chance
> +both tasks will exceed their quota at the same time (guaranteed deadline
> +fail). Somewhere in between there's a threshold where one exceeds and
> +the other doesn't underrun enough to compensate; this depends on the
> +specific CDFs.
> +
> +At the same time, we can say that the worst case deadline miss, will be
> +\Sum e_i; that is, there is a bounded tardiness (under the assumption
> +that x+e is indeed WCET).
> +
> +The interferenece when using burst is valued by the possibilities for
> +missing the deadline and the average WCET. Test results showed that when
> +there many cgroups or CPU is under utilized, the interference is
> +limited. More details are shown in:
> +https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
> +
>  Management
>  ----------
> -Quota and period are managed within the cpu subsystem via cgroupfs.
> +Quota, period and burst are managed within the cpu subsystem via cgroupfs.
>
>  .. note::
>     The cgroupfs files described in this section are only applicable
>     to cgroup v1. For cgroup v2, see
>     :ref:`Documentation/admin-guide/cgroupv2.rst <cgroup-v2-cpu>`.
>
> -- cpu.cfs_quota_us: the total available run-time within a period (in
> -  microseconds)
> +- cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
>  - cpu.cfs_period_us: the length of a period (in microseconds)
>  - cpu.stat: exports throttling statistics [explained further below]
> +- cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
>
>  The default values are::
>
>         cpu.cfs_period_us=100ms
> -       cpu.cfs_quota=-1
> +       cpu.cfs_quota_us=-1
> +       cpu.cfs_burst_us=0
>
>  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
>  bandwidth restriction in place, such a group is described as an unconstrained
>  bandwidth group. This represents the traditional work-conserving behavior for
>  CFS.
>
> -Writing any (valid) positive value(s) will enact the specified bandwidth limit.
> -The minimum quota allowed for the quota or period is 1ms. There is also an
> -upper bound on the period length of 1s. Additional restrictions exist when
> -bandwidth limits are used in a hierarchical fashion, these are explained in
> -more detail below.
> +Writing any (valid) positive value(s) no smaller than cpu.cfs_burst_us will
> +enact the specified bandwidth limit. The minimum quota allowed for the quota or
> +period is 1ms. There is also an upper bound on the period length of 1s.
> +Additional restrictions exist when bandwidth limits are used in a hierarchical
> +fashion, these are explained in more detail below.
>
>  Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
>  and return the group to an unconstrained state once more.
>
> +A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
> +any unused bandwidth. It makes the traditional bandwidth control behavior for
> +CFS unchanged. Writing any (valid) positive value(s) no larger than
> +cpu.cfs_quota_us into cpu.cfs_burst_us will enact the cap on unused bandwidth
> +accumulation.
> +
>  Any updates to a group's bandwidth specification will result in it becoming
>  unthrottled if it is in a constrained state.
>
> @@ -72,9 +122,15 @@ This is tunable via procfs::
>  Larger slice values will reduce transfer overheads, while smaller values allow
>  for more fine-grained consumption.
>
> +There is also a global switch to turn off burst for all groups::
> +       /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
> +
> +By default it is enabled. Writing a 0 value means no accumulated CPU time can be
> +used for any group, even if cpu.cfs_burst_us is configured.
> +

nit: This was removed now, or?

>  Statistics
>  ----------
> -A group's bandwidth statistics are exported via 3 fields in cpu.stat.
> +A group's bandwidth statistics are exported via 5 fields in cpu.stat.
>
>  cpu.stat:
>
> @@ -82,6 +138,9 @@ cpu.stat:
>  - nr_throttled: Number of times the group has been throttled/limited.
>  - throttled_time: The total time duration (in nanoseconds) for which entities
>    of the group have been throttled.
> +- nr_bursts: Number of periods burst occurs.
> +- burst_usec: Cumulative wall-time that any CPUs has used above quota in
> +  respective periods
>
>  This interface is read-only.
>
> @@ -179,3 +238,15 @@ Examples
>
>     By using a small period here we are ensuring a consistent latency
>     response at the expense of burst capacity.
> +
> +4. Limit a group to 40% of 1 CPU, and allow accumulate up to 20% of 1 CPU
> +   additionally, in case accumulation has been done.
> +
> +   With 50ms period, 20ms quota will be equivalent to 40% of 1 CPU.
> +   And 10ms burst will be equivalent to 20% of 1 CPU.
> +
> +       # echo 20000 > cpu.cfs_quota_us /* quota = 20ms */
> +       # echo 50000 > cpu.cfs_period_us /* period = 50ms */
> +       # echo 10000 > cpu.cfs_burst_us /* burst = 10ms */
> +
> +   Larger buffer setting (no larger than quota) allows greater burst capacity.
> --
> 2.14.4.44.g2045bb6
>

Thanks
Odin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
  2021-06-21  9:27 ` [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
  2021-06-22 13:19   ` Peter Zijlstra
@ 2021-06-22 15:27   ` Odin Ugedal
  2021-06-23  8:47     ` Peter Zijlstra
  2021-06-24  8:45     ` changhuaixin
  2021-06-24  7:39   ` [tip: sched/core] " tip-bot2 for Huaixin Chang
  2 siblings, 2 replies; 17+ messages in thread
From: Odin Ugedal @ 2021-06-22 15:27 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: luca.abeni, anderson, baruah, Benjamin Segall, Dietmar Eggemann,
	dtcccc, Juri Lelli, khlebnikov, open list, Mel Gorman,
	Ingo Molnar, Odin Ugedal, pauld, Peter Zijlstra, Paul Turner,
	Steven Rostedt, Shanpei Chen, Tejun Heo, tommaso.cucinotta,
	Vincent Guittot, xiyou.wangcong

Hi,

Just some more thoughts.

man. 21. jun. 2021 kl. 11:28 skrev Huaixin Chang
<changhuaixin@linux.alibaba.com>:
>
> The CFS bandwidth controller limits CPU requests of a task group to
> quota during each period. However, parallel workloads might be bursty
> so that they get throttled even when their average utilization is under
> quota. And they are latency sensitive at the same time so that
> throttling them is undesired.
>
> We borrow time now against our future underrun, at the cost of increased
> interference against the other system users. All nicely bounded.
>
> Traditional (UP-EDF) bandwidth control is something like:
>
>   (U = \Sum u_i) <= 1
>
> This guaranteeds both that every deadline is met and that the system is
> stable. After all, if U were > 1, then for every second of walltime,
> we'd have to run more than a second of program time, and obviously miss
> our deadline, but the next deadline will be further out still, there is
> never time to catch up, unbounded fail.
>
> This work observes that a workload doesn't always executes the full
> quota; this enables one to describe u_i as a statistical distribution.
>
> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
> (the traditional WCET). This effectively allows u to be smaller,
> increasing the efficiency (we can pack more tasks in the system), but at
> the cost of missing deadlines when all the odds line up. However, it
> does maintain stability, since every overrun must be paired with an
> underrun as long as our x is above the average.
>
> That is, suppose we have 2 tasks, both specify a p(95) value, then we
> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
> both tasks will exceed their quota at the same time (guaranteed deadline
> fail). Somewhere in between there's a threshold where one exceeds and
> the other doesn't underrun enough to compensate; this depends on the
> specific CDFs.
>
> At the same time, we can say that the worst case deadline miss, will be
> \Sum e_i; that is, there is a bounded tardiness (under the assumption
> that x+e is indeed WCET).
>
> The benefit of burst is seen when testing with schbench. Default value of
> kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
>
>         mkdir /sys/fs/cgroup/cpu/test
>         echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>         echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>         echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
>         ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
>
> The average CPU usage is at 80%. I run this for 10 times, and got long tail
> latency for 6 times and got throttled for 8 times.

I don't think this is the best example of the benefit given by burst.
If you double the period, to 200ms, all throttling is mitigated for this
example. Doubling the period is ~the same as having burst=quota (not 100%)
the same. It would be interesting to also see other workloads referred to here,
and how they behave with a double in period vs "100%" burst. (eg. 50% burst
is the ~the same as going from 100ms period to 150%).

The same way, using a 50ms period and 100% burst would cause throttling
here as well.

For certain workloads using about 100% of their quota constantly, then a short
burst over, would certainly benefit from this. However, those workloads should
then probably have a higher quota as well, also mitigating that problem, unless
we want it to be throttled.

Timing issues for refilling the cfs_b can also cause some throttling, and
naturally giving more runtime will "fix" those issues, but I think there
are better ways of solving it than adding new APIs.


Overall, my biggest issue with this approach is that it would be hard for users
to reason about what burst is, and what it "fixes", especially compared to
the period length. I think a lot of people would benefit from using longer
periods than 100ms when they have way more processes than the ratio between
the quota and the period. I therefore think it would be best to use examples
for when to use burst where a period change would not fix the issue.


> Tail latencies are shown below, and it wasn't the worst case.
>
>         Latency percentiles (usec)
>                 50.0000th: 19872
>                 75.0000th: 21344
>                 90.0000th: 22176
>                 95.0000th: 22496
>                 *99.0000th: 22752
>                 99.5000th: 22752
>                 99.9000th: 22752
>                 min=0, max=22727
>         rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
>
> The interferenece when using burst is valued by the possibilities for
> missing the deadline and the average WCET. Test results showed that when
> there many cgroups or CPU is under utilized, the interference is
> limited. More details are shown in:
> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
>
> Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
> ---

Anyways, this should also probably scale burst together with period and quota
when the period is too short. Something like the diff below, as well as
updating the message.


Thanks
Odin


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfaa6e1f6067..ab809bd11785 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5277,6 +5277,7 @@ static enum hrtimer_restart
sched_cfs_period_timer(struct hrtimer *timer)
                        if (new < max_cfs_quota_period) {
                                cfs_b->period = ns_to_ktime(new);
                                cfs_b->quota *= 2;
+                               cfs_b->burst *= 2;

                                pr_warn_ratelimited(
        "cfs_period_timer[cpu%d]: period too short, scaling up (new
cfs_period_us = %lld, cfs_quota_us = %lld)\n",

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
  2021-06-22 13:19   ` Peter Zijlstra
@ 2021-06-22 18:57     ` Benjamin Segall
  2021-06-24  8:48     ` changhuaixin
  1 sibling, 0 replies; 17+ messages in thread
From: Benjamin Segall @ 2021-06-22 18:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Huaixin Chang, luca.abeni, anderson, baruah, dietmar.eggemann,
	dtcccc, juri.lelli, khlebnikov, linux-kernel, mgorman, mingo,
	odin, odin, pauld, pjt, rostedt, shanpeic, tj, tommaso.cucinotta,
	vincent.guittot, xiyou.wangcong

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Jun 21, 2021 at 05:27:58PM +0800, Huaixin Chang wrote:
>> The CFS bandwidth controller limits CPU requests of a task group to
>> quota during each period. However, parallel workloads might be bursty
>> so that they get throttled even when their average utilization is under
>> quota. And they are latency sensitive at the same time so that
>> throttling them is undesired.
>> 
>> We borrow time now against our future underrun, at the cost of increased
>> interference against the other system users. All nicely bounded.
>> 
>> Traditional (UP-EDF) bandwidth control is something like:
>> 
>>   (U = \Sum u_i) <= 1
>> 
>> This guaranteeds both that every deadline is met and that the system is
>> stable. After all, if U were > 1, then for every second of walltime,
>> we'd have to run more than a second of program time, and obviously miss
>> our deadline, but the next deadline will be further out still, there is
>> never time to catch up, unbounded fail.
>> 
>> This work observes that a workload doesn't always executes the full
>> quota; this enables one to describe u_i as a statistical distribution.
>> 
>> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
>> (the traditional WCET). This effectively allows u to be smaller,
>> increasing the efficiency (we can pack more tasks in the system), but at
>> the cost of missing deadlines when all the odds line up. However, it
>> does maintain stability, since every overrun must be paired with an
>> underrun as long as our x is above the average.
>> 
>> That is, suppose we have 2 tasks, both specify a p(95) value, then we
>> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
>> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
>> both tasks will exceed their quota at the same time (guaranteed deadline
>> fail). Somewhere in between there's a threshold where one exceeds and
>> the other doesn't underrun enough to compensate; this depends on the
>> specific CDFs.
>> 
>> At the same time, we can say that the worst case deadline miss, will be
>> \Sum e_i; that is, there is a bounded tardiness (under the assumption
>> that x+e is indeed WCET).
>> 
>> The benefit of burst is seen when testing with schbench. Default value of
>> kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
>> 
>> 	mkdir /sys/fs/cgroup/cpu/test
>> 	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>> 	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>> 	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>> 
>> 	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
>> 
>> The average CPU usage is at 80%. I run this for 10 times, and got long tail
>> latency for 6 times and got throttled for 8 times.
>> 
>> Tail latencies are shown below, and it wasn't the worst case.
>> 
>> 	Latency percentiles (usec)
>> 		50.0000th: 19872
>> 		75.0000th: 21344
>> 		90.0000th: 22176
>> 		95.0000th: 22496
>> 		*99.0000th: 22752
>> 		99.5000th: 22752
>> 		99.9000th: 22752
>> 		min=0, max=22727
>> 	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
>> 
>> The interferenece when using burst is valued by the possibilities for
>> missing the deadline and the average WCET. Test results showed that when
>> there many cgroups or CPU is under utilized, the interference is
>> limited. More details are shown in:
>> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
>> 
>> Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
>> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
>> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
>> ---
>
> Ben, what say you? I'm tempted to pick up at least this first patch.

Yeah, I'm fine with it; I know internally we've thought about adding
something like this.

Reviewed-by: Ben Segall <bsegall@google.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
  2021-06-22 15:27   ` Odin Ugedal
@ 2021-06-23  8:47     ` Peter Zijlstra
  2021-06-24  8:45     ` changhuaixin
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-06-23  8:47 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Huaixin Chang, luca.abeni, anderson, baruah, Benjamin Segall,
	Dietmar Eggemann, dtcccc, Juri Lelli, khlebnikov, open list,
	Mel Gorman, Ingo Molnar, pauld, Paul Turner, Steven Rostedt,
	Shanpei Chen, Tejun Heo, tommaso.cucinotta, Vincent Guittot,
	xiyou.wangcong

On Tue, Jun 22, 2021 at 05:27:30PM +0200, Odin Ugedal wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfaa6e1f6067..ab809bd11785 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5277,6 +5277,7 @@ static enum hrtimer_restart
> sched_cfs_period_timer(struct hrtimer *timer)
>                         if (new < max_cfs_quota_period) {
>                                 cfs_b->period = ns_to_ktime(new);
>                                 cfs_b->quota *= 2;
> +                               cfs_b->burst *= 2;
> 
>                                 pr_warn_ratelimited(
>         "cfs_period_timer[cpu%d]: period too short, scaling up (new
> cfs_period_us = %lld, cfs_quota_us = %lld)\n",

Thanks, folded that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip: sched/core] sched/fair: Introduce the burstable CFS controller
  2021-06-21  9:27 ` [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
  2021-06-22 13:19   ` Peter Zijlstra
  2021-06-22 15:27   ` Odin Ugedal
@ 2021-06-24  7:39   ` tip-bot2 for Huaixin Chang
  2 siblings, 0 replies; 17+ messages in thread
From: tip-bot2 for Huaixin Chang @ 2021-06-24  7:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Shanpei Chen, Tianchen Ding, Huaixin Chang,
	Peter Zijlstra (Intel),
	Ben Segall, Tejun Heo, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f4183717b370ad28dd0c0d74760142b20e6e7931
Gitweb:        https://git.kernel.org/tip/f4183717b370ad28dd0c0d74760142b20e6e7931
Author:        Huaixin Chang <changhuaixin@linux.alibaba.com>
AuthorDate:    Mon, 21 Jun 2021 17:27:58 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 24 Jun 2021 09:07:50 +02:00

sched/fair: Introduce the burstable CFS controller

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.

We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded.

Traditional (UP-EDF) bandwidth control is something like:

  (U = \Sum u_i) <= 1

This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.

This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.

For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.

That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.

At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).

The benefit of burst is seen when testing with schbench. Default value of
kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

	mkdir /sys/fs/cgroup/cpu/test
	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail
latency for 6 times and got throttled for 8 times.

Tail latencies are shown below, and it wasn't the worst case.

	Latency percentiles (usec)
		50.0000th: 19872
		75.0000th: 21344
		90.0000th: 22176
		95.0000th: 22496
		*99.0000th: 22752
		99.5000th: 22752
		99.9000th: 22752
		min=0, max=22727
	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%

The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
---
 kernel/sched/core.c  | 68 +++++++++++++++++++++++++++++++++++++++----
 kernel/sched/fair.c  | 14 ++++++---
 kernel/sched/sched.h |  1 +-
 3 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc231d6..2883c22 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9780,7 +9780,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+				u64 burst)
 {
 	int i, ret = 0, runtime_enabled, runtime_was_enabled;
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -9810,6 +9811,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	if (quota != RUNTIME_INF && quota > max_cfs_runtime)
 		return -EINVAL;
 
+	if (quota != RUNTIME_INF && (burst > quota ||
+				     burst + quota > max_cfs_runtime))
+		return -EINVAL;
+
 	/*
 	 * Prevent race between setting of cfs_rq->runtime_enabled and
 	 * unthrottle_offline_cfs_rqs().
@@ -9831,6 +9836,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
+	cfs_b->burst = burst;
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
@@ -9864,9 +9870,10 @@ out_unlock:
 
 static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	burst = tg->cfs_bandwidth.burst;
 	if (cfs_quota_us < 0)
 		quota = RUNTIME_INF;
 	else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -9874,7 +9881,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
 	else
 		return -EINVAL;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_quota(struct task_group *tg)
@@ -9892,15 +9899,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
 
 static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
 {
-	u64 quota, period;
+	u64 quota, period, burst;
 
 	if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
 		return -EINVAL;
 
 	period = (u64)cfs_period_us * NSEC_PER_USEC;
 	quota = tg->cfs_bandwidth.quota;
+	burst = tg->cfs_bandwidth.burst;
 
-	return tg_set_cfs_bandwidth(tg, period, quota);
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
 }
 
 static long tg_get_cfs_period(struct task_group *tg)
@@ -9913,6 +9921,30 @@ static long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+	u64 quota, period, burst;
+
+	if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC)
+		return -EINVAL;
+
+	burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	quota = tg->cfs_bandwidth.quota;
+
+	return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+	u64 burst_us;
+
+	burst_us = tg->cfs_bandwidth.burst;
+	do_div(burst_us, NSEC_PER_USEC);
+
+	return burst_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -9937,6 +9969,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static u64 cpu_cfs_burst_read_u64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_u64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, u64 cfs_burst_us)
+{
+	return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
+}
+
 struct cfs_schedulable_data {
 	struct task_group *tg;
 	u64 period, quota;
@@ -10090,6 +10134,11 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
 	{
+		.name = "cfs_burst_us",
+		.read_u64 = cpu_cfs_burst_read_u64,
+		.write_u64 = cpu_cfs_burst_write_u64,
+	},
+	{
 		.name = "stat",
 		.seq_show = cpu_cfs_stat_show,
 	},
@@ -10254,12 +10303,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 {
 	struct task_group *tg = css_tg(of_css(of));
 	u64 period = tg_get_cfs_period(tg);
+	u64 burst = tg_get_cfs_burst(tg);
 	u64 quota;
 	int ret;
 
 	ret = cpu_period_quota_parse(buf, &period, &quota);
 	if (!ret)
-		ret = tg_set_cfs_bandwidth(tg, period, quota);
+		ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
 	return ret ?: nbytes;
 }
 #endif
@@ -10286,6 +10336,12 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_max_show,
 		.write = cpu_max_write,
 	},
+	{
+		.name = "max.burst",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = cpu_cfs_burst_read_u64,
+		.write_u64 = cpu_cfs_burst_write_u64,
+	},
 #endif
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 	{
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b8990f..4a3e61a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4626,8 +4626,11 @@ static inline u64 sched_cfs_bandwidth_slice(void)
  */
 void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 {
-	if (cfs_b->quota != RUNTIME_INF)
-		cfs_b->runtime = cfs_b->quota;
+	if (unlikely(cfs_b->quota == RUNTIME_INF))
+		return;
+
+	cfs_b->runtime += cfs_b->quota;
+	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4988,6 +4991,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	throttled = !list_empty(&cfs_b->throttled_cfs_rq);
 	cfs_b->nr_periods += overrun;
 
+	/* Refill extra burst quota even if cfs_b->idle */
+	__refill_cfs_bandwidth_runtime(cfs_b);
+
 	/*
 	 * idle depends on !throttled (for the case of a large deficit), and if
 	 * we're going inactive then everything else can be deferred
@@ -4995,8 +5001,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
-
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
 		cfs_b->idle = 1;
@@ -5246,6 +5250,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 			if (new < max_cfs_quota_period) {
 				cfs_b->period = ns_to_ktime(new);
 				cfs_b->quota *= 2;
+				cfs_b->burst *= 2;
 
 				pr_warn_ratelimited(
 	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5277,6 +5282,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+	cfs_b->burst = 0;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01e48f6..c80d42e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			quota;
 	u64			runtime;
+	u64			burst;
 	s64			hierarchical_quota;
 
 	u8			idle;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
  2021-06-22 15:27   ` Odin Ugedal
  2021-06-23  8:47     ` Peter Zijlstra
@ 2021-06-24  8:45     ` changhuaixin
  1 sibling, 0 replies; 17+ messages in thread
From: changhuaixin @ 2021-06-24  8:45 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: changhuaixin, luca.abeni, anderson, baruah, Benjamin Segall,
	Dietmar Eggemann, dtcccc, Juri Lelli, khlebnikov, open list,
	Mel Gorman, Ingo Molnar, pauld, Peter Zijlstra, Paul Turner,
	Steven Rostedt, Shanpei Chen, Tejun Heo, tommaso.cucinotta,
	Vincent Guittot, xiyou.wangcong



> On Jun 22, 2021, at 11:27 PM, Odin Ugedal <odin@uged.al> wrote:
> 
> Hi,
> 
> Just some more thoughts.
> 
> man. 21. jun. 2021 kl. 11:28 skrev Huaixin Chang
> <changhuaixin@linux.alibaba.com>:
>> 
>> The CFS bandwidth controller limits CPU requests of a task group to
>> quota during each period. However, parallel workloads might be bursty
>> so that they get throttled even when their average utilization is under
>> quota. And they are latency sensitive at the same time so that
>> throttling them is undesired.
>> 
>> We borrow time now against our future underrun, at the cost of increased
>> interference against the other system users. All nicely bounded.
>> 
>> Traditional (UP-EDF) bandwidth control is something like:
>> 
>>  (U = \Sum u_i) <= 1
>> 
>> This guaranteeds both that every deadline is met and that the system is
>> stable. After all, if U were > 1, then for every second of walltime,
>> we'd have to run more than a second of program time, and obviously miss
>> our deadline, but the next deadline will be further out still, there is
>> never time to catch up, unbounded fail.
>> 
>> This work observes that a workload doesn't always executes the full
>> quota; this enables one to describe u_i as a statistical distribution.
>> 
>> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
>> (the traditional WCET). This effectively allows u to be smaller,
>> increasing the efficiency (we can pack more tasks in the system), but at
>> the cost of missing deadlines when all the odds line up. However, it
>> does maintain stability, since every overrun must be paired with an
>> underrun as long as our x is above the average.
>> 
>> That is, suppose we have 2 tasks, both specify a p(95) value, then we
>> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
>> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
>> both tasks will exceed their quota at the same time (guaranteed deadline
>> fail). Somewhere in between there's a threshold where one exceeds and
>> the other doesn't underrun enough to compensate; this depends on the
>> specific CDFs.
>> 
>> At the same time, we can say that the worst case deadline miss, will be
>> \Sum e_i; that is, there is a bounded tardiness (under the assumption
>> that x+e is indeed WCET).
>> 
>> The benefit of burst is seen when testing with schbench. Default value of
>> kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
>> 
>>        mkdir /sys/fs/cgroup/cpu/test
>>        echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>>        echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>>        echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>> 
>>        ./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
>> 
>> The average CPU usage is at 80%. I run this for 10 times, and got long tail
>> latency for 6 times and got throttled for 8 times.
> 
> I don't think this is the best example of the benefit given by burst.
> If you double the period, to 200ms, all throttling is mitigated for this
> example. Doubling the period is ~the same as having burst=quota (not 100%)
> the same. It would be interesting to also see other workloads referred to here,
> and how they behave with a double in period vs "100%" burst. (eg. 50% burst
> is the ~the same as going from 100ms period to 150%).
> 
> The same way, using a 50ms period and 100% burst would cause throttling
> here as well.
> 
> For certain workloads using about 100% of their quota constantly, then a short
> burst over, would certainly benefit from this. However, those workloads should
> then probably have a higher quota as well, also mitigating that problem, unless
> we want it to be throttled.
> 
> Timing issues for refilling the cfs_b can also cause some throttling, and
> naturally giving more runtime will "fix" those issues, but I think there
> are better ways of solving it than adding new APIs.
> 
> 
> Overall, my biggest issue with this approach is that it would be hard for users
> to reason about what burst is, and what it "fixes", especially compared to
> the period length. I think a lot of people would benefit from using longer
> periods than 100ms when they have way more processes than the ratio between
> the quota and the period. I therefore think it would be best to use examples
> for when to use burst where a period change would not fix the issue.
> 

Hi,

Maybe using burst and changing period can be used at the same time, and burst feature
is not designed to replace changing period anyway. However I suggest burst shall be
considered prior to changing period, as doubling the period doubles the WCET. That is to say,
using burst avoids throttling and improved WCET in some way, compared to using larger periods
only.


> 
>> Tail latencies are shown below, and it wasn't the worst case.
>> 
>>        Latency percentiles (usec)
>>                50.0000th: 19872
>>                75.0000th: 21344
>>                90.0000th: 22176
>>                95.0000th: 22496
>>                *99.0000th: 22752
>>                99.5000th: 22752
>>                99.9000th: 22752
>>                min=0, max=22727
>>        rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
>> 
>> The interferenece when using burst is valued by the possibilities for
>> missing the deadline and the average WCET. Test results showed that when
>> there many cgroups or CPU is under utilized, the interference is
>> limited. More details are shown in:
>> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
>> 
>> Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
>> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
>> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
>> ---
> 
> Anyways, this should also probably scale burst together with period and quota
> when the period is too short. Something like the diff below, as well as
> updating the message.
> 
> 
> Thanks
> Odin
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfaa6e1f6067..ab809bd11785 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5277,6 +5277,7 @@ static enum hrtimer_restart
> sched_cfs_period_timer(struct hrtimer *timer)
>                        if (new < max_cfs_quota_period) {
>                                cfs_b->period = ns_to_ktime(new);
>                                cfs_b->quota *= 2;
> +                               cfs_b->burst *= 2;
> 
>                                pr_warn_ratelimited(
>        "cfs_period_timer[cpu%d]: period too short, scaling up (new
> cfs_period_us = %lld, cfs_quota_us = %lld)\n",


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
  2021-06-22 13:19   ` Peter Zijlstra
  2021-06-22 18:57     ` Benjamin Segall
@ 2021-06-24  8:48     ` changhuaixin
  2021-06-24  9:28       ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: changhuaixin @ 2021-06-24  8:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, luca.abeni, anderson, baruah, Benjamin Segall,
	Dietmar Eggemann, dtcccc, Juri Lelli, khlebnikov, open list,
	Mel Gorman, Ingo Molnar, Odin Ugedal, Odin Ugedal, pauld,
	Paul Turner, Steven Rostedt, Shanpei Chen, Tejun Heo,
	tommaso.cucinotta, Vincent Guittot, xiyou.wangcong



> On Jun 22, 2021, at 9:19 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Jun 21, 2021 at 05:27:58PM +0800, Huaixin Chang wrote:
>> The CFS bandwidth controller limits CPU requests of a task group to
>> quota during each period. However, parallel workloads might be bursty
>> so that they get throttled even when their average utilization is under
>> quota. And they are latency sensitive at the same time so that
>> throttling them is undesired.
>> 
>> We borrow time now against our future underrun, at the cost of increased
>> interference against the other system users. All nicely bounded.
>> 
>> Traditional (UP-EDF) bandwidth control is something like:
>> 
>>  (U = \Sum u_i) <= 1
>> 
>> This guaranteeds both that every deadline is met and that the system is
>> stable. After all, if U were > 1, then for every second of walltime,
>> we'd have to run more than a second of program time, and obviously miss
>> our deadline, but the next deadline will be further out still, there is
>> never time to catch up, unbounded fail.
>> 
>> This work observes that a workload doesn't always executes the full
>> quota; this enables one to describe u_i as a statistical distribution.
>> 
>> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
>> (the traditional WCET). This effectively allows u to be smaller,
>> increasing the efficiency (we can pack more tasks in the system), but at
>> the cost of missing deadlines when all the odds line up. However, it
>> does maintain stability, since every overrun must be paired with an
>> underrun as long as our x is above the average.
>> 
>> That is, suppose we have 2 tasks, both specify a p(95) value, then we
>> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
>> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
>> both tasks will exceed their quota at the same time (guaranteed deadline
>> fail). Somewhere in between there's a threshold where one exceeds and
>> the other doesn't underrun enough to compensate; this depends on the
>> specific CDFs.
>> 
>> At the same time, we can say that the worst case deadline miss, will be
>> \Sum e_i; that is, there is a bounded tardiness (under the assumption
>> that x+e is indeed WCET).
>> 
>> The benefit of burst is seen when testing with schbench. Default value of
>> kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
>> 
>> 	mkdir /sys/fs/cgroup/cpu/test
>> 	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>> 	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>> 	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>> 
>> 	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
>> 
>> The average CPU usage is at 80%. I run this for 10 times, and got long tail
>> latency for 6 times and got throttled for 8 times.
>> 
>> Tail latencies are shown below, and it wasn't the worst case.
>> 
>> 	Latency percentiles (usec)
>> 		50.0000th: 19872
>> 		75.0000th: 21344
>> 		90.0000th: 22176
>> 		95.0000th: 22496
>> 		*99.0000th: 22752
>> 		99.5000th: 22752
>> 		99.9000th: 22752
>> 		min=0, max=22727
>> 	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
>> 
>> The interferenece when using burst is valued by the possibilities for
>> missing the deadline and the average WCET. Test results showed that when
>> there many cgroups or CPU is under utilized, the interference is
>> limited. More details are shown in:
>> https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
>> 
>> Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
>> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
>> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
>> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
>> ---
> 
> Ben, what say you? I'm tempted to pick up at least this first patch.

Hi, apart from the document issues Odin has replied, is there anything to improve for the other two patches?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller
  2021-06-24  8:48     ` changhuaixin
@ 2021-06-24  9:28       ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-06-24  9:28 UTC (permalink / raw)
  To: changhuaixin
  Cc: luca.abeni, anderson, baruah, Benjamin Segall, Dietmar Eggemann,
	dtcccc, Juri Lelli, khlebnikov, open list, Mel Gorman,
	Ingo Molnar, Odin Ugedal, Odin Ugedal, pauld, Paul Turner,
	Steven Rostedt, Shanpei Chen, Tejun Heo, tommaso.cucinotta,
	Vincent Guittot, xiyou.wangcong

On Thu, Jun 24, 2021 at 04:48:54PM +0800, changhuaixin wrote:
> Hi, apart from the document issues Odin has replied, is there anything to improve for the other two patches?

I don't like the second patch much; but I've not had enough time to
actually think about anything yet, so I can't really make
recommendations :/

I'll try and get to it..

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-06-21  9:27 ` [PATCH v6 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
@ 2021-06-28 15:00   ` Peter Zijlstra
  2021-06-28 15:12     ` Peter Zijlstra
  2021-07-02 11:31     ` changhuaixin
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-06-28 15:00 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: luca.abeni, anderson, baruah, bsegall, dietmar.eggemann, dtcccc,
	juri.lelli, khlebnikov, linux-kernel, mgorman, mingo, odin, odin,
	pauld, pjt, rostedt, shanpeic, tj, tommaso.cucinotta,
	vincent.guittot, xiyou.wangcong

On Mon, Jun 21, 2021 at 05:27:59PM +0800, Huaixin Chang wrote:
> The following statistics in cpu.stat file is added to show how much workload
> is making use of cfs_b burst:
> 
> nr_bursts:  number of periods bandwidth burst occurs
> burst_usec: cumulative wall-time that any cpus has
> 	    used above quota in respective periods
> 
> The larger nr_bursts is, the more bursty periods there are. And the larger
> burst_usec is, the more burst time is used by bursty workload.

That's what it does, but fails to explain why. How is this number
useful.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 53d7cc4d009b..62b73722e510 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4634,11 +4634,22 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>   */
>  void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>  {
> +	u64 runtime;
> +
>  	if (unlikely(cfs_b->quota == RUNTIME_INF))
>  		return;
>  
> +	if (cfs_b->runtime_at_period_start > cfs_b->runtime) {
> +		runtime = cfs_b->runtime_at_period_start - cfs_b->runtime;

That comparison is the same as the subtraction; might as well write
this:

> +		if (runtime > cfs_b->quota) {
> +			cfs_b->burst_time += runtime - cfs_b->quota;

Same here.

> +			cfs_b->nr_burst++;
> +		}
> +	}


Perhaps we can write that like:

	s64 runtime = cfs_b->runtime_snapshot - cfs_b->runtime;
	if (runtime > 0) {
		s64 burstime = runtime - cfs_q->quota;
		if (burstime > 0) {
			cfs_b->bust_time += bursttime;
			cfs_b->nr_bursts++;
		}
	}

I was hoping we could get away with something simpler, like maybe:

	u64 old_runtim = cfs_b->runtime;

	cfs_b->runtime += cfs_b->quota
	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);

	if (cfs_b->runtime - old_runtime > cfs_b->quota)
		cfs_b->nr_bursts++;

Would that be good enough?


> +
>  	cfs_b->runtime += cfs_b->quota;
>  	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
> +	cfs_b->runtime_at_period_start = cfs_b->runtime;
>  }
>  
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d317ca74a48c..b770b553dfbb 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -367,6 +367,7 @@ struct cfs_bandwidth {
>  	u64			quota;
>  	u64			runtime;
>  	u64			burst;
> +	u64			runtime_at_period_start;
>  	s64			hierarchical_quota;

As per the above, I don't really like that name, runtime_snapshot or
perhaps runtime_snap is shorter and not less clear. But not having it at
all would be even better.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-06-28 15:00   ` Peter Zijlstra
@ 2021-06-28 15:12     ` Peter Zijlstra
  2021-07-02 11:31     ` changhuaixin
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-06-28 15:12 UTC (permalink / raw)
  To: Huaixin Chang
  Cc: luca.abeni, anderson, baruah, bsegall, dietmar.eggemann, dtcccc,
	juri.lelli, khlebnikov, linux-kernel, mgorman, mingo, odin, odin,
	pauld, pjt, rostedt, shanpeic, tj, tommaso.cucinotta,
	vincent.guittot, xiyou.wangcong

On Mon, Jun 28, 2021 at 05:00:30PM +0200, Peter Zijlstra wrote:
> I was hoping we could get away with something simpler, like maybe:
> 
> 	u64 old_runtim = cfs_b->runtime;
> 
> 	cfs_b->runtime += cfs_b->quota
> 	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
> 
> 	if (cfs_b->runtime - old_runtime > cfs_b->quota)
> 		cfs_b->nr_bursts++;
> 
> Would that be good enough?

Bah,, of course not ... :-/ At best we can detect == quota, which might
be a good enough indicator of burst.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/3] sched/fair: Add cfs bandwidth burst statistics
  2021-06-28 15:00   ` Peter Zijlstra
  2021-06-28 15:12     ` Peter Zijlstra
@ 2021-07-02 11:31     ` changhuaixin
  1 sibling, 0 replies; 17+ messages in thread
From: changhuaixin @ 2021-07-02 11:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: changhuaixin, luca.abeni, anderson, baruah, Benjamin Segall,
	Dietmar Eggemann, dtcccc, Juri Lelli, khlebnikov, open list,
	Mel Gorman, Ingo Molnar, Odin Ugedal, Odin Ugedal, pauld,
	Paul Turner, Steven Rostedt, Shanpei Chen, Tejun Heo,
	tommaso.cucinotta, Vincent Guittot, xiyou.wangcong



> On Jun 28, 2021, at 11:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Jun 21, 2021 at 05:27:59PM +0800, Huaixin Chang wrote:
>> The following statistics in cpu.stat file is added to show how much workload
>> is making use of cfs_b burst:
>> 
>> nr_bursts:  number of periods bandwidth burst occurs
>> burst_usec: cumulative wall-time that any cpus has
>> 	    used above quota in respective periods
>> 
>> The larger nr_bursts is, the more bursty periods there are. And the larger
>> burst_usec is, the more burst time is used by bursty workload.
> 
> That's what it does, but fails to explain why. How is this number
> useful.
> 

How about this？

The cfs_b burst feature avoids throttling by allowing bandwidth bursts. When using cfs_b
burst, users configure burst and see if it helps from workload latency and cfs_b interval
statistics like nr_throttled. Also two new statistics are introduced to show the internal of burst featrue
and explain why burst helps or not:

	nr_bursts:    number of periods bandwidth burst occurs
	burst_usec: cumulative wall-time that any cpus has
			    used above quota in respective periods


>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 53d7cc4d009b..62b73722e510 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4634,11 +4634,22 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>>  */
>> void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>> {
>> +	u64 runtime;
>> +
>> 	if (unlikely(cfs_b->quota == RUNTIME_INF))
>> 		return;
>> 
>> +	if (cfs_b->runtime_at_period_start > cfs_b->runtime) {
>> +		runtime = cfs_b->runtime_at_period_start - cfs_b->runtime;
> 
> That comparison is the same as the subtraction; might as well write
> this:
> 
>> +		if (runtime > cfs_b->quota) {
>> +			cfs_b->burst_time += runtime - cfs_b->quota;
> 
> Same here.
> 
>> +			cfs_b->nr_burst++;
>> +		}
>> +	}
> 
> 
> Perhaps we can write that like:
> 
> 	s64 runtime = cfs_b->runtime_snapshot - cfs_b->runtime;
> 	if (runtime > 0) {
> 		s64 burstime = runtime - cfs_q->quota;
> 		if (burstime > 0) {
> 			cfs_b->bust_time += bursttime;
> 			cfs_b->nr_bursts++;
> 		}
> 	}
> 
> I was hoping we could get away with something simpler, like maybe:
> 

Got it.

> 	u64 old_runtim = cfs_b->runtime;
> 
> 	cfs_b->runtime += cfs_b->quota
> 	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
> 
> 	if (cfs_b->runtime - old_runtime > cfs_b->quota)
> 		cfs_b->nr_bursts++;
> 
> Would that be good enough?
> 
> 
>> +
>> 	cfs_b->runtime += cfs_b->quota;
>> 	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
>> +	cfs_b->runtime_at_period_start = cfs_b->runtime;
>> }
>> 
>> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index d317ca74a48c..b770b553dfbb 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -367,6 +367,7 @@ struct cfs_bandwidth {
>> 	u64			quota;
>> 	u64			runtime;
>> 	u64			burst;
>> +	u64			runtime_at_period_start;
>> 	s64			hierarchical_quota;
> 
> As per the above, I don't really like that name, runtime_snapshot or
> perhaps runtime_snap is shorter and not less clear. But not having it at
> all would be even better.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-07-02 11:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-21  9:27 [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller Huaixin Chang
2021-06-21  9:27 ` [PATCH v6 1/3] sched/fair: Introduce the burstable CFS controller Huaixin Chang
2021-06-22 13:19   ` Peter Zijlstra
2021-06-22 18:57     ` Benjamin Segall
2021-06-24  8:48     ` changhuaixin
2021-06-24  9:28       ` Peter Zijlstra
2021-06-22 15:27   ` Odin Ugedal
2021-06-23  8:47     ` Peter Zijlstra
2021-06-24  8:45     ` changhuaixin
2021-06-24  7:39   ` [tip: sched/core] " tip-bot2 for Huaixin Chang
2021-06-21  9:27 ` [PATCH v6 2/3] sched/fair: Add cfs bandwidth burst statistics Huaixin Chang
2021-06-28 15:00   ` Peter Zijlstra
2021-06-28 15:12     ` Peter Zijlstra
2021-07-02 11:31     ` changhuaixin
2021-06-21  9:28 ` [PATCH v6 3/3] sched/fair: Add document for burstable CFS bandwidth Huaixin Chang
2021-06-22 15:26   ` Odin Ugedal
2021-06-22 14:25 ` [PATCH v6 0/3] sched/fair: Burstable CFS bandwidth controller Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).