linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] sched/fair: improve CFS throttle
@ 2018-05-22  6:20 Cong Wang
  2018-05-22  6:20 ` [PATCH 1/3] sched/fair: make CFS bandwidth slice per cpu group Cong Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Cong Wang @ 2018-05-22  6:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: ibobrik, Cong Wang

Cong Wang (3):
  sched/fair: make CFS bandwidth slice per cpu group
  sched/fair: accumulate unused cpu time for CFS throttle
  sched/fair: add tracepoints for cfs throttle

 Documentation/scheduler/sched-bwc.txt |  25 ++++++--
 include/trace/events/sched.h          |  42 +++++++++++++
 kernel/sched/core.c                   | 109 ++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c                   |  76 +++++++++++++++---------
 kernel/sched/sched.h                  |   4 +-
 5 files changed, 216 insertions(+), 40 deletions(-)

-- 
2.13.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] sched/fair: make CFS bandwidth slice per cpu group
  2018-05-22  6:20 [PATCH 0/3] sched/fair: improve CFS throttle Cong Wang
@ 2018-05-22  6:20 ` Cong Wang
  2018-05-22  6:20 ` [PATCH 2/3] sched/fair: accumulate unused cpu time for CFS throttle Cong Wang
  2018-05-22  6:20 ` [PATCH 3/3] sched/fair: add tracepoints for cfs throttle Cong Wang
  2 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2018-05-22  6:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: ibobrik, Cong Wang, Paul Turner, Peter Zijlstra, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar

We saw tasks in a CPU cgroup got throttled for many times even
when they don't apparently over-burn the CPU's. I tried to trace
the cause and noticed that one of the problems here is that the
local CPU which got the last chunk of quota doesn't use all of
them and prevents other CPUs to reuse the unused CPU time. So
reducing sched_cfs_bandwidth_slice_us from the default 5ms to 1ms
mostly solves the problem, at least no tasks got throttled after
this change in my test case.

However, the sched_cfs_bandwidth_slice_us is a global setting which
affects all cgroups. Different cgroups may want different values based
on their own workload, 5ms or 1ms is not suitable for all the cgroups.
A smaller slice distributes CPU time more fairly at a cost of slightly
higher overhead. We have to minimize the impact of the slice change.
On the other hand, the global pool filled periodically is per cgroup,
each cgroup should have the right to distribute its own quota to the
local CPUs with its own preferred frequency.

This patch introduces cpu.cfs_slice_us which allows each cgroup to
specify its own slice length without any global impact. And the
global sysctl sched_cfs_bandwidth_slice_us now becomes the default
value of each cpu.cfs_slice_us. Note, updating this sysctl does not
automatically update existing cgroups using a default value, users
will have to update each existing cgroup accordingly to make a global
change.

Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 Documentation/scheduler/sched-bwc.txt | 16 ++++++++----
 kernel/sched/core.c                   | 47 +++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c                   | 22 ++++++++++------
 kernel/sched/sched.h                  |  1 +
 4 files changed, 73 insertions(+), 13 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873f68ab..d853e5cb5553 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -48,19 +48,25 @@ and return the group to an unconstrained state once more.
 Any updates to a group's bandwidth specification will result in it becoming
 unthrottled if it is in a constrained state.
 
-System wide settings
---------------------
+Slice
+-----
 For efficiency run-time is transferred between the global pool and CPU local
 "silos" in a batch fashion.  This greatly reduces global accounting pressure
 on large systems.  The amount transferred each time such an update is required
 is described as the "slice".
 
-This is tunable via procfs:
-	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
-
 Larger slice values will reduce transfer overheads, while smaller values allow
 for more fine-grained consumption.
 
+The per-group file cpu.cfs_slice_us controls the slice length within each CPU
+group, different groups could set different values for their own preference.
+Its default value is tunable via procfs:
+
+	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
+
+Note, updating this file does not automatically update existing groups using
+a default slice.
+
 Statistics
 ----------
 A group's bandwidth statistics are exported via 3 fields in cpu.stat.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 092f7c4de903..7015a6143151 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6666,6 +6666,36 @@ long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_slice(struct task_group *tg, long cfs_slice_us)
+{
+	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+	u64 quota, slice;
+
+	if (cfs_slice_us <= 0)
+		return -ERANGE;
+	quota = tg->cfs_bandwidth.quota;
+	if (quota == RUNTIME_INF)
+		return -EINVAL;
+	slice = (u64)cfs_slice_us * NSEC_PER_USEC;
+	if (slice > quota)
+		return -EINVAL;
+
+	raw_spin_lock_irq(&cfs_b->lock);
+	cfs_b->slice = slice;
+	raw_spin_unlock_irq(&cfs_b->lock);
+	return 0;
+}
+
+static long tg_get_cfs_slice(struct task_group *tg)
+{
+	u64 slice_us;
+
+	slice_us = tg->cfs_bandwidth.slice;
+	do_div(slice_us, NSEC_PER_USEC);
+
+	return slice_us;
+}
+
 static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -6690,6 +6720,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static s64 cpu_cfs_slice_read_s64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_slice(css_tg(css));
+}
+
+static int cpu_cfs_slice_write_s64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, s64 cfs_quota_us)
+{
+	return tg_set_cfs_slice(css_tg(css), cfs_quota_us);
+}
+
 struct cfs_schedulable_data {
 	struct task_group *tg;
 	u64 period, quota;
@@ -6833,6 +6875,11 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
 	{
+		.name = "cfs_slice_us",
+		.read_s64 = cpu_cfs_slice_read_s64,
+		.write_s64 = cpu_cfs_slice_write_s64,
+	},
+	{
 		.name = "stat",
 		.seq_show = cpu_cfs_stat_show,
 	},
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 79f574dba096..5d72961d89eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4527,11 +4527,6 @@ static inline u64 default_cfs_period(void)
 	return 100000000ULL;
 }
 
-static inline u64 sched_cfs_bandwidth_slice(void)
-{
-	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
-}
-
 /*
  * Replenish runtime according to assigned quota and update expiration time.
  * We use sched_clock_cpu directly instead of rq->clock to avoid adding
@@ -4565,6 +4560,11 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
 	return rq_clock_task(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
 }
 
+static inline u64 cfs_bandwidth_slice(struct cfs_bandwidth *cfs_b)
+{
+	return cfs_b->slice;
+}
+
 /* returns 0 on failure to allocate runtime */
 static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
@@ -4573,7 +4573,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	u64 amount = 0, min_amount, expires;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
-	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
+	min_amount = cfs_bandwidth_slice(cfs_b) - cfs_rq->runtime_remaining;
 
 	raw_spin_lock(&cfs_b->lock);
 	if (cfs_b->quota == RUNTIME_INF)
@@ -4992,7 +4992,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
-		if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
+		if (cfs_b->runtime > cfs_bandwidth_slice(cfs_b) &&
 		    !list_empty(&cfs_b->throttled_cfs_rq))
 			start_cfs_slack_bandwidth(cfs_b);
 	}
@@ -5019,7 +5019,7 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
  */
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
-	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	u64 runtime = 0, slice = cfs_bandwidth_slice(cfs_b);
 	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
@@ -5139,12 +5139,18 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
 
+static inline u64 cfs_default_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->runtime = 0;
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
+	cfs_b->slice = cfs_default_bandwidth_slice();
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1f0a4bc6a39d..a15c0458f4d1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -331,6 +331,7 @@ struct cfs_bandwidth {
 	raw_spinlock_t		lock;
 	ktime_t			period;
 	u64			quota;
+	u64			slice;
 	u64			runtime;
 	s64			hierarchical_quota;
 	u64			runtime_expires;
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] sched/fair: accumulate unused cpu time for CFS throttle
  2018-05-22  6:20 [PATCH 0/3] sched/fair: improve CFS throttle Cong Wang
  2018-05-22  6:20 ` [PATCH 1/3] sched/fair: make CFS bandwidth slice per cpu group Cong Wang
@ 2018-05-22  6:20 ` Cong Wang
  2018-05-22  6:20 ` [PATCH 3/3] sched/fair: add tracepoints for cfs throttle Cong Wang
  2 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2018-05-22  6:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: ibobrik, Cong Wang, Paul Turner, Peter Zijlstra, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar

With current implementation of CFS CPU throttle, the unused CPU
time in a period is not carried over to the next one. And when
the CFS periodical timer is deactivated after detecting idle,
these idle periods are not counted either. This causes some
apparently counterintuitive CPU throttles when the job in a cpu
cgroup mostly sleeps during each period.

This can be reproduced with the reproducer from Ivan Babrou:
https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1
(I need to run the binary directly rather using golang to run,
otherwise the CPU used by golang itself is also accounted.)

We also observed a similar issue in production where QPS has
an interesting anti-correlation with nr_throttled/nr_periods.

This patch addresses it by always accumulating unused CPU time
in the global pool, and when the CFS periodical timer is
deactivated calculating the number of idle periods and
replenishing the missing quotas to the cpu cgroup. To avoid
burst, a cap could be specified by user so that quotas won't
exceed this cap.

This patch significantly reduces the nr_throttled/nr_periods
percentage and throttled time:

Without any patch (quota = 20000, period = 100000):
  nr_periods 56
  nr_throttled 14
  throttled_time 1496288690

With this patch (burst = 10 * quota):
  nr_periods 43
  nr_throttled 1
  throttled_time 74789584

This patch first folds __refill_cfs_bandwidth_runtime() into
tg_set_cfs_bandwidth(), then renames and reuses it for
do_sched_cfs_period_timer(), where I use cfs_rq->last_active
to record the last time when this timer is activated, so that
we can calculate how many periods we miss during this.

Reported-by: Ivan Babrou <ibobrik@gmail.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 Documentation/scheduler/sched-bwc.txt |  9 +++++
 kernel/sched/core.c                   | 62 ++++++++++++++++++++++++++++++++---
 kernel/sched/fair.c                   | 52 +++++++++++++++++------------
 kernel/sched/sched.h                  |  3 +-
 4 files changed, 99 insertions(+), 27 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index d853e5cb5553..2c88a4afea9a 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -67,6 +67,15 @@ group, different groups could set different values for their own preference.
 Note, updating this file does not automatically update existing groups using
 a default slice.
 
+Burst
+-----
+When a local run queue has no task to run, its unused CPU time is returned
+back to the global pool. If the whole group becomes idle, its periodical
+timer is halted, unused CPU time during this time is accumulated too.
+To avoid burst, a cap should be set by user via cpu.cfs_burst_us, with it
+the accumulated CPU time won't exceed this cap. Its default value allows
+a group to burn all CPU's with the given quota for a whole period.
+
 Statistics
 ----------
 A group's bandwidth statistics are exported via 3 fields in cpu.stat.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7015a6143151..917f76d9d44c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6589,12 +6589,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
-
-	__refill_cfs_bandwidth_runtime(cfs_b);
-
-	/* Restart the period timer (if active) to handle new period expiry: */
-	if (runtime_enabled)
+	if (runtime_enabled) {
+		u64 now;
+
+		cfs_b->runtime = cfs_b->quota;
+		cfs_b->burst = nr_cpu_ids * quota;
+		now = sched_clock_cpu(smp_processor_id());
+		cfs_b->last_active = now;
+		cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+		/* Restart the period timer (if active) to handle new period expiry: */
 		start_cfs_bandwidth(cfs_b);
+	}
 
 	raw_spin_unlock_irq(&cfs_b->lock);
 
@@ -6666,6 +6671,36 @@ long tg_get_cfs_period(struct task_group *tg)
 	return cfs_period_us;
 }
 
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+	u64 quota, burst;
+
+	if (cfs_burst_us <= 0)
+		return -ERANGE;
+	quota = tg->cfs_bandwidth.quota;
+	if (quota == RUNTIME_INF)
+		return -EINVAL;
+	burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+	if (burst < quota)
+		return -EINVAL;
+
+	raw_spin_lock_irq(&cfs_b->lock);
+	cfs_b->burst = burst;
+	raw_spin_unlock_irq(&cfs_b->lock);
+	return 0;
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+	u64 burst_us;
+
+	burst_us = tg->cfs_bandwidth.burst;
+	do_div(burst_us, NSEC_PER_USEC);
+
+	return burst_us;
+}
+
 static int tg_set_cfs_slice(struct task_group *tg, long cfs_slice_us)
 {
 	struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -6720,6 +6755,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
 	return tg_set_cfs_period(css_tg(css), cfs_period_us);
 }
 
+static s64 cpu_cfs_burst_read_s64(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_s64(struct cgroup_subsys_state *css,
+				   struct cftype *cftype, s64 cfs_quota_us)
+{
+	return tg_set_cfs_burst(css_tg(css), cfs_quota_us);
+}
+
 static s64 cpu_cfs_slice_read_s64(struct cgroup_subsys_state *css,
 				  struct cftype *cft)
 {
@@ -6875,6 +6922,11 @@ static struct cftype cpu_legacy_files[] = {
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
 	{
+		.name = "cfs_burst_us",
+		.read_s64 = cpu_cfs_burst_read_s64,
+		.write_s64 = cpu_cfs_burst_write_s64,
+	},
+	{
 		.name = "cfs_slice_us",
 		.read_s64 = cpu_cfs_slice_read_s64,
 		.write_s64 = cpu_cfs_slice_write_s64,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d72961d89eb..8a13ee006f39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4527,25 +4527,6 @@ static inline u64 default_cfs_period(void)
 	return 100000000ULL;
 }
 
-/*
- * Replenish runtime according to assigned quota and update expiration time.
- * We use sched_clock_cpu directly instead of rq->clock to avoid adding
- * additional synchronization around rq->lock.
- *
- * requires cfs_b->lock
- */
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
-{
-	u64 now;
-
-	if (cfs_b->quota == RUNTIME_INF)
-		return;
-
-	now = sched_clock_cpu(smp_processor_id());
-	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-}
-
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 {
 	return &tg->cfs_bandwidth;
@@ -4862,6 +4843,31 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 }
 
 /*
+ * Replenish runtime according to assigned quota and update expiration time.
+ * We use sched_clock_cpu directly instead of rq->clock to avoid adding
+ * additional synchronization around rq->lock.
+ *
+ * requires cfs_b->lock
+ */
+static void refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+{
+	u64 now, idle;
+
+	now = sched_clock_cpu(smp_processor_id());
+	idle = now - cfs_b->last_active;
+	if (idle > cfs_b->period)
+		do_div(idle, cfs_b->period);
+	else
+		idle = 1;
+	cfs_b->runtime += idle * cfs_b->quota;
+	if (cfs_b->runtime > cfs_b->burst)
+		cfs_b->runtime = cfs_b->burst;
+	now = sched_clock_cpu(smp_processor_id());
+	cfs_b->last_active = now;
+	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
+}
+
+/*
  * Responsible for refilling a task_group's bandwidth and unthrottling its
  * cfs_rqs as appropriate. If there has been no activity within the last
  * period the timer is deactivated until scheduling resumes; cfs_b->idle is
@@ -4886,7 +4892,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	__refill_cfs_bandwidth_runtime(cfs_b);
+	refill_cfs_bandwidth_runtime(cfs_b);
 
 	if (!throttled) {
 		/* mark as potentially idle for the upcoming period */
@@ -5132,8 +5138,10 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 
 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
 	}
-	if (idle)
+	if (idle) {
 		cfs_b->period_active = 0;
+		cfs_b->last_active = sched_clock_cpu(smp_processor_id());
+	}
 	raw_spin_unlock(&cfs_b->lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -5151,6 +5159,8 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->quota = RUNTIME_INF;
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 	cfs_b->slice = cfs_default_bandwidth_slice();
+	cfs_b->burst = nr_cpu_ids * cfs_b->period;
+	cfs_b->last_active = 0;
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a15c0458f4d1..9a717f82bbe3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -333,8 +333,10 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			slice;
 	u64			runtime;
+	u64			burst;
 	s64			hierarchical_quota;
 	u64			runtime_expires;
+	u64			last_active;
 
 	int			idle;
 	int			period_active;
@@ -433,7 +435,6 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 			struct sched_entity *parent);
 extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 
-extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] sched/fair: add tracepoints for cfs throttle
  2018-05-22  6:20 [PATCH 0/3] sched/fair: improve CFS throttle Cong Wang
  2018-05-22  6:20 ` [PATCH 1/3] sched/fair: make CFS bandwidth slice per cpu group Cong Wang
  2018-05-22  6:20 ` [PATCH 2/3] sched/fair: accumulate unused cpu time for CFS throttle Cong Wang
@ 2018-05-22  6:20 ` Cong Wang
       [not found]   ` <CANWdNRCQCteO7L+Of7T2vWBW7GbMHyvyF5aSQMVYNGg_afxNhg@mail.gmail.com>
  2 siblings, 1 reply; 9+ messages in thread
From: Cong Wang @ 2018-05-22  6:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: ibobrik, Cong Wang, Paul Turner, Peter Zijlstra, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar

This patch introduces two sched tracepoints to track
CFS throttle and unthrottle. The use case is to measure
how long each throttle is and to track when a CPU cgroup
is throttled and unthrottled.

Sample output:

     cfs-722   [000] dN.3    51.477702: sched_cfs_throttle: path=/test cpu=0 runtime_remaining=0
  <idle>-0     [000] d.h4    51.536659: sched_cfs_unthrottle: path=/test cpu=0 runtime_remaining=1

Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/trace/events/sched.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  2 ++
 2 files changed, 44 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index bc01e06bc716..3d9e00972e92 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -570,6 +570,48 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
 
 	TP_printk("cpu=%d", __entry->cpu)
 );
+
+#ifdef CONFIG_CFS_BANDWIDTH
+DECLARE_EVENT_CLASS(sched_fair,
+
+	TP_PROTO(struct cfs_rq *cfs_rq),
+
+	TP_ARGS(cfs_rq),
+
+	TP_STRUCT__entry(
+		__field(	s64,		runtime_remaining	)
+		__field(	int,		cpu			)
+		__dynamic_array(char,		cfs_path,
+				cgroup_path(cfs_rq->tg->css.cgroup, NULL, 0) + 1)
+	),
+
+	TP_fast_assign(
+		__entry->runtime_remaining = cfs_rq->runtime_remaining;
+		__entry->cpu = cpu_of(cfs_rq->rq);
+		cgroup_path(cfs_rq->tg->css.cgroup,
+			    __get_dynamic_array(cfs_path),
+			    __get_dynamic_array_len(cfs_path));
+	),
+
+	TP_printk("path=%s cpu=%d runtime_remaining=%lld", __get_str(cfs_path),
+		  __entry->cpu, __entry->runtime_remaining)
+);
+
+DEFINE_EVENT(sched_fair, sched_cfs_throttle,
+
+	TP_PROTO(struct cfs_rq *cfs_rq),
+
+	TP_ARGS(cfs_rq)
+);
+
+DEFINE_EVENT(sched_fair, sched_cfs_unthrottle,
+
+	TP_PROTO(struct cfs_rq *cfs_rq),
+
+	TP_ARGS(cfs_rq)
+);
+#endif /* CONFIG_CFS_BANDWIDTH */
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a13ee006f39..3bcc40f2f272 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4736,6 +4736,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	cfs_rq->throttled = 1;
 	cfs_rq->throttled_clock = rq_clock(rq);
+	trace_sched_cfs_throttle(cfs_rq);
 	raw_spin_lock(&cfs_b->lock);
 	empty = list_empty(&cfs_b->throttled_cfs_rq);
 
@@ -4766,6 +4767,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
 	cfs_rq->throttled = 0;
+	trace_sched_cfs_unthrottle(cfs_rq);
 
 	update_rq_clock(rq);
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] sched/fair: add tracepoints for cfs throttle
       [not found]   ` <CANWdNRCQCteO7L+Of7T2vWBW7GbMHyvyF5aSQMVYNGg_afxNhg@mail.gmail.com>
@ 2018-05-23  0:34     ` Cong Wang
  2018-05-23  9:09     ` Peter Zijlstra
  1 sibling, 0 replies; 9+ messages in thread
From: Cong Wang @ 2018-05-23  0:34 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: LKML, Paul Turner, Peter Zijlstra, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar

On Mon, May 21, 2018 at 11:35 PM, Ivan Babrou <ibobrik@gmail.com> wrote:
>
>
> On Mon, May 21, 2018 at 11:20 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>
>> +
>> +       TP_printk("path=%s cpu=%d runtime_remaining=%lld",
>> __get_str(cfs_path),
>> +                 __entry->cpu, __entry->runtime_remaining)
>
>
> Can you add "[ns]" as the unit to runtime_remaining? We have it in
> "sched:sched_stat_runtime".

Hmm, although it is indeed ns, this ->runtime_remaining can be negative,
unlike sched_stat_runtime. So I am not sure if I should marked it as "[ns]".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] sched/fair: add tracepoints for cfs throttle
       [not found]   ` <CANWdNRCQCteO7L+Of7T2vWBW7GbMHyvyF5aSQMVYNGg_afxNhg@mail.gmail.com>
  2018-05-23  0:34     ` Cong Wang
@ 2018-05-23  9:09     ` Peter Zijlstra
  2018-05-24  4:40       ` Cong Wang
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-05-23  9:09 UTC (permalink / raw)
  To: Ivan Babrou; +Cc: xiyou.wangcong, linux-kernel, pjt, efault, tglx, mingo

On Mon, May 21, 2018 at 11:35:38PM -0700, Ivan Babrou wrote:
> > +       TP_printk("path=%s cpu=%d runtime_remaining=%lld",
> > __get_str(cfs_path),
> > +                 __entry->cpu, __entry->runtime_remaining)
> >
> 
> Can you add "[ns]" as the unit to runtime_remaining? We have it in
> "sched:sched_stat_runtime".

Yeah, don't worry. I hate tracepoints, I regret the existing ones, no
new ones will happen.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] sched/fair: add tracepoints for cfs throttle
  2018-05-23  9:09     ` Peter Zijlstra
@ 2018-05-24  4:40       ` Cong Wang
  2018-05-24  7:11         ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Cong Wang @ 2018-05-24  4:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ivan Babrou, LKML, Paul Turner, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar

On Wed, May 23, 2018 at 2:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, May 21, 2018 at 11:35:38PM -0700, Ivan Babrou wrote:
>> > +       TP_printk("path=%s cpu=%d runtime_remaining=%lld",
>> > __get_str(cfs_path),
>> > +                 __entry->cpu, __entry->runtime_remaining)
>> >
>>
>> Can you add "[ns]" as the unit to runtime_remaining? We have it in
>> "sched:sched_stat_runtime".
>
> Yeah, don't worry. I hate tracepoints, I regret the existing ones, no
> new ones will happen.

No matter you hate them or not, they are useful:

https://github.com/iovisor/bcc/blob/master/tools/runqlat.py

Unlike in sched, tracepoints are welcome in networking:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/include/trace/events/tcp.h

I don't want to change your mind, just want to show the facts.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] sched/fair: add tracepoints for cfs throttle
  2018-05-24  4:40       ` Cong Wang
@ 2018-05-24  7:11         ` Peter Zijlstra
  2018-05-24 22:23           ` Cong Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-05-24  7:11 UTC (permalink / raw)
  To: Cong Wang
  Cc: Ivan Babrou, LKML, Paul Turner, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar

On Wed, May 23, 2018 at 09:40:47PM -0700, Cong Wang wrote:
> On Wed, May 23, 2018 at 2:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Yeah, don't worry. I hate tracepoints, I regret the existing ones, no
> > new ones will happen.
> 
> No matter you hate them or not, they are useful:

No argument there.

> Unlike in sched, tracepoints are welcome in networking:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/include/trace/events/tcp.h

They are also absolutely disallowed in the vfs..

> I don't want to change your mind, just want to show the facts.

The problem with tracepoints is that they can become ABI and you cannot
change them without breaking tools. This is a crap situation and I'm fed
up with it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] sched/fair: add tracepoints for cfs throttle
  2018-05-24  7:11         ` Peter Zijlstra
@ 2018-05-24 22:23           ` Cong Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Cong Wang @ 2018-05-24 22:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ivan Babrou, LKML, Paul Turner, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar

On Thu, May 24, 2018 at 12:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The problem with tracepoints is that they can become ABI and you cannot
> change them without breaking tools. This is a crap situation and I'm fed
> up with it.

Yeah, I once used perf_event_open() to parse the trace events, I
understand how painful it is. Fortunately tools like bcc mostly hides
the ABI from end user. To be fair, kprobe has the same problem if not
worse, different compilers could inline different kernel functions, not
to mention kernel functions could change at any time.

OTOH, the tracepoint text interface is sufficient to help people to inspect
kernel internals, like in this case I can observe when a cpu cgroup is
throttled or unthrottled. Of course they are not friendly for programming.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-05-24 22:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-22  6:20 [PATCH 0/3] sched/fair: improve CFS throttle Cong Wang
2018-05-22  6:20 ` [PATCH 1/3] sched/fair: make CFS bandwidth slice per cpu group Cong Wang
2018-05-22  6:20 ` [PATCH 2/3] sched/fair: accumulate unused cpu time for CFS throttle Cong Wang
2018-05-22  6:20 ` [PATCH 3/3] sched/fair: add tracepoints for cfs throttle Cong Wang
     [not found]   ` <CANWdNRCQCteO7L+Of7T2vWBW7GbMHyvyF5aSQMVYNGg_afxNhg@mail.gmail.com>
2018-05-23  0:34     ` Cong Wang
2018-05-23  9:09     ` Peter Zijlstra
2018-05-24  4:40       ` Cong Wang
2018-05-24  7:11         ` Peter Zijlstra
2018-05-24 22:23           ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).