* [PATCH] sched: Make schedstats a runtime tunable that is disabled by default v2
@ 2016-01-27 15:29 Mel Gorman
2016-01-27 15:58 ` Mike Galbraith
2016-01-28 10:32 ` Matt Fleming
0 siblings, 2 replies; 5+ messages in thread
From: Mel Gorman @ 2016-01-27 15:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Matt Fleming, Mike Galbraith, LKML, Mel Gorman
Changelog since V1
o Introduce schedstat_enabled and address Ingo's feedback
o More schedstat-only paths eliminated, particularly ttwu_stat
schedstats is very useful during debugging and performance tuning but it
incurs overhead. As such, even though it can be disabled at build time,
it is often enabled as the information is useful. This patch adds a
kernel command-line and sysctl tunable to enable or disable schedstats on
demand. It is disabled by default as someone who knows they need it can
also learn to enable it when necessary.
The benefits are workload-dependent but when it gets down to it, the
difference will be whether cache misses are incurred updating the shared
stats or not. These measurements were taken from a 48-core 2-socket machine
with Xeon(R) E5-2670 v3 cpus although they were also tested on a single
socket machine 8-core machine with Intel i7-3770 processors.
netperf TCP_STREAM
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v2r2
Hmean 64 560.45 ( 0.00%) 576.96 ( 2.94%)
Hmean 128 766.66 ( 0.00%) 797.54 ( 4.03%)
Hmean 256 950.51 ( 0.00%) 972.24 ( 2.29%)
Hmean 1024 1433.25 ( 0.00%) 1492.66 ( 4.15%)
Hmean 2048 2810.54 ( 0.00%) 2984.70 ( 6.20%)
Hmean 3312 4618.18 ( 0.00%) 4778.72 ( 3.48%)
Hmean 4096 5306.42 ( 0.00%) 5389.35 ( 1.56%)
Hmean 8192 10581.44 ( 0.00%) 10824.27 ( 2.29%)
Hmean 16384 18857.70 ( 0.00%) 18911.32 ( 0.28%)
Small gains here, UDP_STREAM showed nothing intresting and neither did
the TCP_RR tests. The gains on the 8-core machine were very similar.
tbench4
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v2r2
Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
Hmean mb/sec-2 984.66 ( 0.00%) 1017.92 ( 3.38%)
Hmean mb/sec-4 1827.91 ( 0.00%) 1871.38 ( 2.38%)
Hmean mb/sec-8 3561.36 ( 0.00%) 3563.62 ( 0.06%)
Hmean mb/sec-16 5824.52 ( 0.00%) 5918.90 ( 1.62%)
Hmean mb/sec-32 10943.10 ( 0.00%) 10967.55 ( 0.22%)
Hmean mb/sec-64 15950.81 ( 0.00%) 15976.55 ( 0.16%)
Hmean mb/sec-128 15302.17 ( 0.00%) 15372.01 ( 0.46%)
Hmean mb/sec-256 14866.18 ( 0.00%) 14938.50 ( 0.49%)
Hmean mb/sec-512 15223.31 ( 0.00%) 15360.33 ( 0.90%)
Hmean mb/sec-1024 14574.25 ( 0.00%) 14632.68 ( 0.40%)
Hmean mb/sec-2048 13569.02 ( 0.00%) 13861.61 ( 2.16%)
Hmean mb/sec-3072 12865.98 ( 0.00%) 13106.66 ( 1.87%)
Small gains of 2-4% at low thread counts and otherwise flat. The
gains on the 8-core machine were slightly different
tbench4 on 8-core i7-3770 single socket machine
Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)
In constract, this shows a relatively steady 2-3% gain at higher thread
counts. Due to the nature of the patch and the type of workload, it's
not a surprise that the result will depend on the CPU used.
hackbench-pipes
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v2r2
Amean 1 0.0637 ( 0.00%) 0.0666 ( -4.48%)
Amean 4 0.1229 ( 0.00%) 0.1240 ( -0.93%)
Amean 7 0.1921 ( 0.00%) 0.1967 ( -2.38%)
Amean 12 0.3117 ( 0.00%) 0.3133 ( -0.50%)
Amean 21 0.4050 ( 0.00%) 0.3954 ( 2.36%)
Amean 30 0.4586 ( 0.00%) 0.4529 ( 1.25%)
Amean 48 0.5910 ( 0.00%) 0.5733 ( 3.00%)
Amean 79 0.8663 ( 0.00%) 0.8394 ( 3.10%)
Amean 110 1.1543 ( 0.00%) 1.1449 ( 0.82%)
Amean 141 1.4457 ( 0.00%) 1.4526 ( -0.47%)
Amean 172 1.7090 ( 0.00%) 1.7121 ( -0.18%)
Amean 192 1.9126 ( 0.00%) 1.8959 ( 0.87%)
This is borderline at best, small gains and losses and while the variance
data is not included, it's will within the noise. The UMA machine did not
show anything particularly different
pipetest
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v2r2
Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)
Small improvement and similar gains were seen on the UMA machine.
The gain is small but it'll depend on the CPU and the workload whether
this patch makes a different. However, it stands to reason that doing
less work in the scheduler is a good thing. The downside is that the
lack of schedstats and tracepoints will be surprising to experts doing
performance analysis until they find the existance of the schedstats=
parameter or schedstats sysctl.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
Documentation/kernel-parameters.txt | 5 ++
Documentation/sysctl/kernel.txt | 8 +++
include/linux/sched/sysctl.h | 4 ++
kernel/sched/core.c | 62 ++++++++++++++++++++-
kernel/sched/debug.c | 108 +++++++++++++++++++-----------------
kernel/sched/fair.c | 92 ++++++++++++++++++------------
kernel/sched/sched.h | 1 +
kernel/sched/stats.h | 8 ++-
kernel/sysctl.c | 11 ++++
9 files changed, 208 insertions(+), 91 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 87d40a72f6a1..846956abfe85 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3523,6 +3523,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
sched_debug [KNL] Enables verbose scheduler debug messages.
+ schedstats= [KNL,X86] Enable or disable scheduled statistics.
+ Allowed values are enable and disable. This feature
+ incurs a small amount of overhead in the scheduler
+ but is useful for debugging and performance tuning.
+
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a93b414672a7..be7c3b720adf 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -760,6 +760,14 @@ rtsig-nr shows the number of RT signals currently queued.
==============================================================
+schedstats:
+
+Enables/disables scheduler statistics. Enabling this feature
+incurs a small amount of overhead in the scheduler but is
+useful for debugging and performance tuning.
+
+==============================================================
+
sg-big-buff:
This file shows the size of the generic SCSI (sg) buffer.
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731cf10b..4f080ab4f2cd 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -95,4 +95,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
+extern int sysctl_schedstats(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+
#endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63d3a24e081a..42ea7b28a47b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2093,7 +2093,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
ttwu_queue(p, cpu);
stat:
- ttwu_stat(p, cpu, wake_flags);
+ if (schedstat_enabled())
+ ttwu_stat(p, cpu, wake_flags);
out:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -2141,7 +2142,8 @@ static void try_to_wake_up_local(struct task_struct *p)
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
ttwu_do_wakeup(rq, p, 0);
- ttwu_stat(p, smp_processor_id(), 0);
+ if (schedstat_enabled())
+ ttwu_stat(p, smp_processor_id(), 0);
out:
raw_spin_unlock(&p->pi_lock);
}
@@ -2210,6 +2212,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif
#ifdef CONFIG_SCHEDSTATS
+ /* Even if schedstat is disabled, there should not be garbage */
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif
@@ -2281,6 +2284,61 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
#endif
#endif
+DEFINE_STATIC_KEY_FALSE(sched_schedstats);
+
+#ifdef CONFIG_SCHEDSTATS
+void set_schedstats(bool enabled)
+{
+ if (enabled)
+ static_branch_enable(&sched_schedstats);
+ else
+ static_branch_disable(&sched_schedstats);
+}
+
+static int __init setup_schedstats(char *str)
+{
+ int ret = 0;
+ if (!str)
+ goto out;
+
+ if (!strcmp(str, "enable")) {
+ set_schedstats(true);
+ ret = 1;
+ } else if (!strcmp(str, "disable")) {
+ set_schedstats(false);
+ ret = 1;
+ }
+out:
+ if (!ret)
+ pr_warn("Unable to parse schedstats=\n");
+
+ return ret;
+}
+__setup("schedstats=", setup_schedstats);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_schedstats(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table t;
+ int err;
+ int state = static_branch_likely(&sched_schedstats);
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ t = *table;
+ t.data = &state;
+ err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+ if (err < 0)
+ return err;
+ if (write)
+ set_schedstats(state);
+ return err;
+}
+#endif
+#endif
+
/*
* fork()/clone()-time setup:
*/
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 641511771ae6..c3c2499ec104 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -75,16 +75,18 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
PN(se->vruntime);
PN(se->sum_exec_runtime);
#ifdef CONFIG_SCHEDSTATS
- PN(se->statistics.wait_start);
- PN(se->statistics.sleep_start);
- PN(se->statistics.block_start);
- PN(se->statistics.sleep_max);
- PN(se->statistics.block_max);
- PN(se->statistics.exec_max);
- PN(se->statistics.slice_max);
- PN(se->statistics.wait_max);
- PN(se->statistics.wait_sum);
- P(se->statistics.wait_count);
+ if (schedstat_enabled()) {
+ PN(se->statistics.wait_start);
+ PN(se->statistics.sleep_start);
+ PN(se->statistics.block_start);
+ PN(se->statistics.sleep_max);
+ PN(se->statistics.block_max);
+ PN(se->statistics.exec_max);
+ PN(se->statistics.slice_max);
+ PN(se->statistics.wait_max);
+ PN(se->statistics.wait_sum);
+ P(se->statistics.wait_count);
+ }
#endif
P(se->load.weight);
#ifdef CONFIG_SMP
@@ -122,10 +124,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
- SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
- SPLIT_NS(p->se.statistics.wait_sum),
- SPLIT_NS(p->se.sum_exec_runtime),
- SPLIT_NS(p->se.statistics.sum_sleep_runtime));
+ if (schedstat_enabled()) {
+ SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
+ SPLIT_NS(p->se.statistics.wait_sum),
+ SPLIT_NS(p->se.sum_exec_runtime),
+ SPLIT_NS(p->se.statistics.sum_sleep_runtime));
+ }
#else
SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
0LL, 0L,
@@ -313,17 +317,19 @@ do { \
#define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n);
#define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n);
- P(yld_count);
+ if (schedstat_enabled()) {
+ P(yld_count);
- P(sched_count);
- P(sched_goidle);
+ P(sched_count);
+ P(sched_goidle);
#ifdef CONFIG_SMP
- P64(avg_idle);
- P64(max_idle_balance_cost);
+ P64(avg_idle);
+ P64(max_idle_balance_cost);
#endif
- P(ttwu_count);
- P(ttwu_local);
+ P(ttwu_count);
+ P(ttwu_local);
+ }
#undef P
#undef P64
@@ -569,38 +575,38 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
nr_switches = p->nvcsw + p->nivcsw;
#ifdef CONFIG_SCHEDSTATS
- PN(se.statistics.sum_sleep_runtime);
- PN(se.statistics.wait_start);
- PN(se.statistics.sleep_start);
- PN(se.statistics.block_start);
- PN(se.statistics.sleep_max);
- PN(se.statistics.block_max);
- PN(se.statistics.exec_max);
- PN(se.statistics.slice_max);
- PN(se.statistics.wait_max);
- PN(se.statistics.wait_sum);
- P(se.statistics.wait_count);
- PN(se.statistics.iowait_sum);
- P(se.statistics.iowait_count);
- P(se.nr_migrations);
- P(se.statistics.nr_migrations_cold);
- P(se.statistics.nr_failed_migrations_affine);
- P(se.statistics.nr_failed_migrations_running);
- P(se.statistics.nr_failed_migrations_hot);
- P(se.statistics.nr_forced_migrations);
- P(se.statistics.nr_wakeups);
- P(se.statistics.nr_wakeups_sync);
- P(se.statistics.nr_wakeups_migrate);
- P(se.statistics.nr_wakeups_local);
- P(se.statistics.nr_wakeups_remote);
- P(se.statistics.nr_wakeups_affine);
- P(se.statistics.nr_wakeups_affine_attempts);
- P(se.statistics.nr_wakeups_passive);
- P(se.statistics.nr_wakeups_idle);
-
- {
+ if (schedstat_enabled()) {
u64 avg_atom, avg_per_cpu;
+ PN(se.statistics.sum_sleep_runtime);
+ PN(se.statistics.wait_start);
+ PN(se.statistics.sleep_start);
+ PN(se.statistics.block_start);
+ PN(se.statistics.sleep_max);
+ PN(se.statistics.block_max);
+ PN(se.statistics.exec_max);
+ PN(se.statistics.slice_max);
+ PN(se.statistics.wait_max);
+ PN(se.statistics.wait_sum);
+ P(se.statistics.wait_count);
+ PN(se.statistics.iowait_sum);
+ P(se.statistics.iowait_count);
+ P(se.nr_migrations);
+ P(se.statistics.nr_migrations_cold);
+ P(se.statistics.nr_failed_migrations_affine);
+ P(se.statistics.nr_failed_migrations_running);
+ P(se.statistics.nr_failed_migrations_hot);
+ P(se.statistics.nr_forced_migrations);
+ P(se.statistics.nr_wakeups);
+ P(se.statistics.nr_wakeups_sync);
+ P(se.statistics.nr_wakeups_migrate);
+ P(se.statistics.nr_wakeups_local);
+ P(se.statistics.nr_wakeups_remote);
+ P(se.statistics.nr_wakeups_affine);
+ P(se.statistics.nr_wakeups_affine_attempts);
+ P(se.statistics.nr_wakeups_passive);
+ P(se.statistics.nr_wakeups_idle);
+
avg_atom = p->se.sum_exec_runtime;
if (nr_switches)
avg_atom = div64_ul(avg_atom, nr_switches);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1926606ece80..1de9a70b0832 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -755,7 +755,9 @@ static void
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct task_struct *p;
- u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
+ u64 delta;
+
+ delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
if (entity_is_task(se)) {
p = task_of(se);
@@ -776,22 +778,12 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
se->statistics.wait_sum += delta;
se->statistics.wait_start = 0;
}
-#else
-static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-}
-
-static inline void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-}
-#endif
/*
* Task is being enqueued - update stats:
*/
-static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static inline void
+update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/*
* Are we enqueueing a waiting task? (for current tasks
@@ -801,8 +793,8 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_stats_wait_start(cfs_rq, se);
}
-static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void
+update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Mark the end of the wait period if dequeueing a
@@ -810,7 +802,40 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (se != cfs_rq->curr)
update_stats_wait_end(cfs_rq, se);
+
+ if (flags & DEQUEUE_SLEEP) {
+ if (entity_is_task(se)) {
+ struct task_struct *tsk = task_of(se);
+
+ if (tsk->state & TASK_INTERRUPTIBLE)
+ se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
+ if (tsk->state & TASK_UNINTERRUPTIBLE)
+ se->statistics.block_start = rq_clock(rq_of(cfs_rq));
+ }
+ }
+
+}
+#else
+static inline void
+update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
}
+#endif
/*
* We are picking a new current task - update its stats:
@@ -3106,11 +3131,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (flags & ENQUEUE_WAKEUP) {
place_entity(cfs_rq, se, 0);
- enqueue_sleeper(cfs_rq, se);
+ if (schedstat_enabled())
+ enqueue_sleeper(cfs_rq, se);
}
- update_stats_enqueue(cfs_rq, se);
- check_spread(cfs_rq, se);
+ if (schedstat_enabled()) {
+ update_stats_enqueue(cfs_rq, se);
+ check_spread(cfs_rq, se);
+ }
if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;
@@ -3177,19 +3205,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_curr(cfs_rq);
dequeue_entity_load_avg(cfs_rq, se);
- update_stats_dequeue(cfs_rq, se);
- if (flags & DEQUEUE_SLEEP) {
-#ifdef CONFIG_SCHEDSTATS
- if (entity_is_task(se)) {
- struct task_struct *tsk = task_of(se);
-
- if (tsk->state & TASK_INTERRUPTIBLE)
- se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
- if (tsk->state & TASK_UNINTERRUPTIBLE)
- se->statistics.block_start = rq_clock(rq_of(cfs_rq));
- }
-#endif
- }
+ if (schedstat_enabled())
+ update_stats_dequeue(cfs_rq, se, flags);
clear_buddies(cfs_rq, se);
@@ -3263,7 +3280,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
- update_stats_wait_end(cfs_rq, se);
+ if (schedstat_enabled())
+ update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
update_load_avg(se, 1);
}
@@ -3276,7 +3294,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* least twice that of our own weight (i.e. dont track it
* when there are only lesser-weight tasks around):
*/
- if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
+ if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
se->statistics.slice_max = max(se->statistics.slice_max,
se->sum_exec_runtime - se->prev_sum_exec_runtime);
}
@@ -3359,9 +3377,13 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
/* throttle cfs_rqs exceeding runtime */
check_cfs_rq_runtime(cfs_rq);
- check_spread(cfs_rq, prev);
+ if (schedstat_enabled()) {
+ check_spread(cfs_rq, prev);
+ if (prev->on_rq)
+ update_stats_wait_start(cfs_rq, prev);
+ }
+
if (prev->on_rq) {
- update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f16374df7f..1d583870e1a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1022,6 +1022,7 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
extern struct static_key_false sched_numa_balancing;
+extern struct static_key_false sched_schedstats;
static inline u64 global_rt_period(void)
{
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index b0fbc7632de5..70b3b6a20fb0 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -29,9 +29,10 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
if (rq)
rq->rq_sched_info.run_delay += delta;
}
-# define schedstat_inc(rq, field) do { (rq)->field++; } while (0)
-# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0)
-# define schedstat_set(var, val) do { var = (val); } while (0)
+# define schedstat_enabled() static_branch_unlikely(&sched_schedstats)
+# define schedstat_inc(rq, field) do { if (schedstat_enabled()) { (rq)->field++; } } while (0)
+# define schedstat_add(rq, field, amt) do { if (schedstat_enabled()) { (rq)->field += (amt); } } while (0)
+# define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0)
#else /* !CONFIG_SCHEDSTATS */
static inline void
rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
@@ -42,6 +43,7 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
static inline void
rq_sched_info_depart(struct rq *rq, unsigned long long delta)
{}
+# define schedstat_enabled() 0
# define schedstat_inc(rq, field) do { } while (0)
# define schedstat_add(rq, field, amt) do { } while (0)
# define schedstat_set(var, val) do { } while (0)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 97715fd9e790..6fe70ccdf2ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -350,6 +350,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHEDSTATS
+ {
+ .procname = "sched_schedstats",
+ .data = NULL,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_schedstats,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif /* CONFIG_SCHEDSTATS */
#endif /* CONFIG_SMP */
#ifdef CONFIG_NUMA_BALANCING
{
--
2.6.4
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] sched: Make schedstats a runtime tunable that is disabled by default v2
2016-01-27 15:29 [PATCH] sched: Make schedstats a runtime tunable that is disabled by default v2 Mel Gorman
@ 2016-01-27 15:58 ` Mike Galbraith
2016-01-28 10:32 ` Matt Fleming
1 sibling, 0 replies; 5+ messages in thread
From: Mike Galbraith @ 2016-01-27 15:58 UTC (permalink / raw)
To: Mel Gorman, Peter Zijlstra; +Cc: Ingo Molnar, Matt Fleming, LKML
On Wed, 2016-01-27 at 15:29 +0000, Mel Gorman wrote:
> Changelog since V1
> o Introduce schedstat_enabled and address Ingo's feedback
> o More schedstat-only paths eliminated, particularly ttwu_stat
As mentioned offline, with the somewhat lighter than distro config of
my box, the patch delivered ~4% on cross core pipe-test loop time, and
~2% for tbench throughput.
-Mike
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched: Make schedstats a runtime tunable that is disabled by default v2
2016-01-27 15:29 [PATCH] sched: Make schedstats a runtime tunable that is disabled by default v2 Mel Gorman
2016-01-27 15:58 ` Mike Galbraith
@ 2016-01-28 10:32 ` Matt Fleming
2016-01-28 10:59 ` Mel Gorman
1 sibling, 1 reply; 5+ messages in thread
From: Matt Fleming @ 2016-01-28 10:32 UTC (permalink / raw)
To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, LKML
On Wed, 27 Jan, at 03:29:26PM, Mel Gorman wrote:
> +#ifdef CONFIG_SCHEDSTATS
> +void set_schedstats(bool enabled)
> +{
> + if (enabled)
> + static_branch_enable(&sched_schedstats);
> + else
> + static_branch_disable(&sched_schedstats);
> +}
This function should probably be 'static'; it has no users outside of
this file.
> @@ -313,17 +317,19 @@ do { \
> #define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n);
> #define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n);
>
> - P(yld_count);
> + if (schedstat_enabled()) {
> + P(yld_count);
>
> - P(sched_count);
> - P(sched_goidle);
> + P(sched_count);
> + P(sched_goidle);
> #ifdef CONFIG_SMP
> - P64(avg_idle);
> - P64(max_idle_balance_cost);
> + P64(avg_idle);
> + P64(max_idle_balance_cost);
These two fields are still updated without any kind of
schedstat_enabled() guard. We probably shouldn't refuse to print them
if we're maintaining these counters, right?
> #undef P
> #undef P64
> @@ -569,38 +575,38 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
> nr_switches = p->nvcsw + p->nivcsw;
>
> #ifdef CONFIG_SCHEDSTATS
> - PN(se.statistics.sum_sleep_runtime);
> - PN(se.statistics.wait_start);
> - PN(se.statistics.sleep_start);
> - PN(se.statistics.block_start);
> - PN(se.statistics.sleep_max);
> - PN(se.statistics.block_max);
> - PN(se.statistics.exec_max);
> - PN(se.statistics.slice_max);
> - PN(se.statistics.wait_max);
> - PN(se.statistics.wait_sum);
> - P(se.statistics.wait_count);
> - PN(se.statistics.iowait_sum);
> - P(se.statistics.iowait_count);
> - P(se.nr_migrations);
> - P(se.statistics.nr_migrations_cold);
> - P(se.statistics.nr_failed_migrations_affine);
> - P(se.statistics.nr_failed_migrations_running);
> - P(se.statistics.nr_failed_migrations_hot);
> - P(se.statistics.nr_forced_migrations);
> - P(se.statistics.nr_wakeups);
> - P(se.statistics.nr_wakeups_sync);
> - P(se.statistics.nr_wakeups_migrate);
> - P(se.statistics.nr_wakeups_local);
> - P(se.statistics.nr_wakeups_remote);
> - P(se.statistics.nr_wakeups_affine);
> - P(se.statistics.nr_wakeups_affine_attempts);
> - P(se.statistics.nr_wakeups_passive);
> - P(se.statistics.nr_wakeups_idle);
> -
> - {
> + if (schedstat_enabled()) {
> u64 avg_atom, avg_per_cpu;
>
> + PN(se.statistics.sum_sleep_runtime);
> + PN(se.statistics.wait_start);
> + PN(se.statistics.sleep_start);
> + PN(se.statistics.block_start);
> + PN(se.statistics.sleep_max);
> + PN(se.statistics.block_max);
> + PN(se.statistics.exec_max);
> + PN(se.statistics.slice_max);
> + PN(se.statistics.wait_max);
> + PN(se.statistics.wait_sum);
> + P(se.statistics.wait_count);
> + PN(se.statistics.iowait_sum);
> + P(se.statistics.iowait_count);
> + P(se.nr_migrations);
Ditto for se.nr_migrations. It has no schedstat_enabled() wrapper.
> @@ -801,8 +793,8 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> update_stats_wait_start(cfs_rq, se);
> }
>
> -static inline void
> -update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static void
> +update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> /*
> * Mark the end of the wait period if dequeueing a
You dropped the 'inline' from this function. Since there is only one
caller, I'm guessing that was unintentional?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched: Make schedstats a runtime tunable that is disabled by default v2
2016-01-28 10:32 ` Matt Fleming
@ 2016-01-28 10:59 ` Mel Gorman
2016-01-28 11:34 ` Matt Fleming
0 siblings, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2016-01-28 10:59 UTC (permalink / raw)
To: Matt Fleming; +Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, LKML
On Thu, Jan 28, 2016 at 10:32:08AM +0000, Matt Fleming wrote:
> On Wed, 27 Jan, at 03:29:26PM, Mel Gorman wrote:
> > +#ifdef CONFIG_SCHEDSTATS
> > +void set_schedstats(bool enabled)
> > +{
> > + if (enabled)
> > + static_branch_enable(&sched_schedstats);
> > + else
> > + static_branch_disable(&sched_schedstats);
> > +}
>
> This function should probably be 'static'; it has no users outside of
> this file.
>
Yes.
> > @@ -313,17 +317,19 @@ do { \
> > #define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n);
> > #define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n);
> >
> > - P(yld_count);
> > + if (schedstat_enabled()) {
> > + P(yld_count);
> >
> > - P(sched_count);
> > - P(sched_goidle);
> > + P(sched_count);
> > + P(sched_goidle);
> > #ifdef CONFIG_SMP
> > - P64(avg_idle);
> > - P64(max_idle_balance_cost);
> > + P64(avg_idle);
> > + P64(max_idle_balance_cost);
>
> These two fields are still updated without any kind of
> schedstat_enabled() guard. We probably shouldn't refuse to print them
> if we're maintaining these counters, right?
>
Right.
> > #undef P
> > #undef P64
> > @@ -569,38 +575,38 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
> > nr_switches = p->nvcsw + p->nivcsw;
> >
> > #ifdef CONFIG_SCHEDSTATS
> > - PN(se.statistics.sum_sleep_runtime);
> > - PN(se.statistics.wait_start);
> > - PN(se.statistics.sleep_start);
> > - PN(se.statistics.block_start);
> > - PN(se.statistics.sleep_max);
> > - PN(se.statistics.block_max);
> > - PN(se.statistics.exec_max);
> > - PN(se.statistics.slice_max);
> > - PN(se.statistics.wait_max);
> > - PN(se.statistics.wait_sum);
> > - P(se.statistics.wait_count);
> > - PN(se.statistics.iowait_sum);
> > - P(se.statistics.iowait_count);
> > - P(se.nr_migrations);
> > - P(se.statistics.nr_migrations_cold);
> > - P(se.statistics.nr_failed_migrations_affine);
> > - P(se.statistics.nr_failed_migrations_running);
> > - P(se.statistics.nr_failed_migrations_hot);
> > - P(se.statistics.nr_forced_migrations);
> > - P(se.statistics.nr_wakeups);
> > - P(se.statistics.nr_wakeups_sync);
> > - P(se.statistics.nr_wakeups_migrate);
> > - P(se.statistics.nr_wakeups_local);
> > - P(se.statistics.nr_wakeups_remote);
> > - P(se.statistics.nr_wakeups_affine);
> > - P(se.statistics.nr_wakeups_affine_attempts);
> > - P(se.statistics.nr_wakeups_passive);
> > - P(se.statistics.nr_wakeups_idle);
> > -
> > - {
> > + if (schedstat_enabled()) {
> > u64 avg_atom, avg_per_cpu;
> >
> > + PN(se.statistics.sum_sleep_runtime);
> > + PN(se.statistics.wait_start);
> > + PN(se.statistics.sleep_start);
> > + PN(se.statistics.block_start);
> > + PN(se.statistics.sleep_max);
> > + PN(se.statistics.block_max);
> > + PN(se.statistics.exec_max);
> > + PN(se.statistics.slice_max);
> > + PN(se.statistics.wait_max);
> > + PN(se.statistics.wait_sum);
> > + P(se.statistics.wait_count);
> > + PN(se.statistics.iowait_sum);
> > + P(se.statistics.iowait_count);
> > + P(se.nr_migrations);
>
> Ditto for se.nr_migrations. It has no schedstat_enabled() wrapper.
>
Yes.
> > @@ -801,8 +793,8 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > update_stats_wait_start(cfs_rq, se);
> > }
> >
> > -static inline void
> > -update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +static void
> > +update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > {
> > /*
> > * Mark the end of the wait period if dequeueing a
>
> You dropped the 'inline' from this function. Since there is only one
> caller, I'm guessing that was unintentional?
It wasn't really. The patch increased the function size by enough that
I uninlined it and let the compiler make the decision. In this case,
it should automatically inline but I can leave the inline in.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched: Make schedstats a runtime tunable that is disabled by default v2
2016-01-28 10:59 ` Mel Gorman
@ 2016-01-28 11:34 ` Matt Fleming
0 siblings, 0 replies; 5+ messages in thread
From: Matt Fleming @ 2016-01-28 11:34 UTC (permalink / raw)
To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, LKML
On Thu, 28 Jan, at 10:59:25AM, Mel Gorman wrote:
>
> It wasn't really. The patch increased the function size by enough that
> I uninlined it and let the compiler make the decision. In this case,
> it should automatically inline but I can leave the inline in.
Letting the compiler make the decision seems fine to me, but I'll let
other people chime in with their 'static inline' opinions.
My concern was that you didn't intend to delete the keyword, and it
was done by accident. But since that's not the case, it's no big deal.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-01-28 11:34 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-27 15:29 [PATCH] sched: Make schedstats a runtime tunable that is disabled by default v2 Mel Gorman
2016-01-27 15:58 ` Mike Galbraith
2016-01-28 10:32 ` Matt Fleming
2016-01-28 10:59 ` Mel Gorman
2016-01-28 11:34 ` Matt Fleming
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).