[PATCH] sched: Make schedstats a runtime tunable that is disabled by default

* [PATCH] sched: Make schedstats a runtime tunable that is disabled by default
@ 2016-01-25 10:05 Mel Gorman
  2016-01-25 11:26 ` Peter Zijlstra
  2016-01-25 15:29 ` Matt Fleming
  0 siblings, 2 replies; 13+ messages in thread
From: Mel Gorman @ 2016-01-25 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Matt Fleming, Mike Galbraith, LKML, Mel Gorman

schedstats is very useful during debugging and performance tuning but it
incurs overhead. As such, even though it can be disabled at build time,
it is often enabled as the information is useful.  This patch adds a
kernel command-line and sysctl tunable to enable or disable schedstats on
demand. It is disabled by default as someone who knows they need it can
also learn to enable it when necessary.

The benefits are workload-dependent but when it gets down to it, the
difference will be whether cache misses are incurred updating the shared
stats or not. These measurements were taken from a 48-core 2-socket machine
with Xeon(R) E5-2670 v3 cpus.

netperf TCP_STREAM
                               4.4.0                 4.4.0
                             vanilla               nostats
Hmean    64         566.50 (  0.00%)      571.86 (  0.95%)
Hmean    128        699.15 (  0.00%)      712.82 (  1.95%)
Hmean    256        920.33 (  0.00%)      934.86 (  1.58%)
Hmean    1024      1097.89 (  0.00%)     1207.15 (  9.95%)
Hmean    2048      2356.64 (  0.00%)     2720.96 ( 15.46%)
Hmean    3312      4363.97 (  0.00%)     4409.14 (  1.04%)
Hmean    4096      5050.99 (  0.00%)     5080.74 (  0.59%)
Hmean    8192      9703.42 (  0.00%)    10057.36 (  3.65%)
Hmean    16384    17524.31 (  0.00%)    18126.95 (  3.44%)

Small gains here, UDP_STREAM showed nothing interesting.

tbench4
                                     4.4.0                 4.4.0
                                   vanilla            nostats-v1
Hmean    mb/sec-1         430.56 (  0.00%)      453.55 (  5.34%)
Hmean    mb/sec-2         831.14 (  0.00%)      820.39 ( -1.29%)
Hmean    mb/sec-4        1537.69 (  0.00%)     1678.43 (  9.15%)
Hmean    mb/sec-8        3327.61 (  0.00%)     3124.44 ( -6.11%)
Hmean    mb/sec-16       5833.78 (  0.00%)     5936.33 (  1.76%)
Hmean    mb/sec-32      11022.03 (  0.00%)    11076.39 (  0.49%)
Hmean    mb/sec-64      15757.57 (  0.00%)    15850.09 (  0.59%)
Hmean    mb/sec-128     15086.64 (  0.00%)    15192.84 (  0.70%)
Hmean    mb/sec-256     14762.37 (  0.00%)    14905.69 (  0.97%)
Hmean    mb/sec-512     14999.88 (  0.00%)    15086.94 (  0.58%)
Hmean    mb/sec-1024    14095.00 (  0.00%)    14218.66 (  0.88%)
Hmean    mb/sec-2048    13043.98 (  0.00%)    13499.61 (  3.49%)

Borderline but small gains.

hackbench-pipes
                             4.4.0                 4.4.0
                           vanilla            nostats-v1
Amean    1        0.0753 (  0.00%)      0.0817 ( -8.54%)
Amean    4        0.1214 (  0.00%)      0.1247 ( -2.71%)
Amean    7        0.1766 (  0.00%)      0.1787 ( -1.21%)
Amean    12       0.2940 (  0.00%)      0.2637 ( 10.30%)
Amean    21       0.3799 (  0.00%)      0.2926 ( 22.98%)
Amean    30       0.3570 (  0.00%)      0.3294 (  7.72%)
Amean    48       0.5830 (  0.00%)      0.5307 (  8.97%)
Amean    79       0.9131 (  0.00%)      0.8671 (  5.04%)
Amean    110      1.3851 (  0.00%)      1.3239 (  4.42%)
Amean    141      1.9411 (  0.00%)      1.7483 (  9.94%)
Amean    172      2.3360 (  0.00%)      2.2880 (  2.05%)
Amean    192      2.5974 (  0.00%)      2.4211 (  6.79%)

This showed a mix but some of the gains were large enough to be
interesting.

Even though the gain is not universal and some workloads simply do not
care, it stands to reason that doing less work within the scheduler is a
good thing assuming that the user is ok with enabling the stats if required.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/kernel-parameters.txt |   5 ++
 Documentation/sysctl/kernel.txt     |   8 +++
 include/linux/sched/sysctl.h        |   4 ++
 kernel/sched/core.c                 |  55 ++++++++++++++++++
 kernel/sched/debug.c                | 110 +++++++++++++++++++-----------------
 kernel/sched/fair.c                 |   6 +-
 kernel/sched/sched.h                |   1 +
 kernel/sched/stats.h                |   6 +-
 kernel/sysctl.c                     |  11 ++++
 9 files changed, 149 insertions(+), 57 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 87d40a72f6a1..846956abfe85 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3523,6 +3523,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 
 	sched_debug	[KNL] Enables verbose scheduler debug messages.
 
+	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
+			Allowed values are enable and disable. This feature
+			incurs a small amount of overhead in the scheduler
+			but is useful for debugging and performance tuning.
+
 	skew_tick=	[KNL] Offset the periodic timer tick per cpu to mitigate
 			xtime_lock contention on larger systems, and/or RCU lock
 			contention on all systems with CONFIG_MAXSMP set.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a93b414672a7..be7c3b720adf 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -760,6 +760,14 @@ rtsig-nr shows the number of RT signals currently queued.
 
 ==============================================================
 
+schedstats:
+
+Enables/disables scheduler statistics. Enabling this feature
+incurs a small amount of overhead in the scheduler but is
+useful for debugging and performance tuning.
+
+==============================================================
+
 sg-big-buff:
 
 This file shows the size of the generic SCSI (sg) buffer.
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731cf10b..4f080ab4f2cd 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -95,4 +95,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
 				 void __user *buffer, size_t *lenp,
 				 loff_t *ppos);
 
+extern int sysctl_schedstats(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 63d3a24e081a..56ade34a65bf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2281,6 +2281,61 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
 #endif
 #endif
 
+DEFINE_STATIC_KEY_FALSE(sched_schedstats);
+
+#ifdef CONFIG_SCHEDSTATS
+void set_schedstats(bool enabled)
+{
+	if (enabled)
+		static_branch_enable(&sched_schedstats);
+	else
+		static_branch_disable(&sched_schedstats);
+}
+
+static int __init setup_schedstats(char *str)
+{
+	int ret = 0;
+	if (!str)
+		goto out;
+
+	if (!strcmp(str, "enable")) {
+		set_schedstats(true);
+		ret = 1;
+	} else if (!strcmp(str, "disable")) {
+		set_schedstats(false);
+		ret = 1;
+	}
+out:
+	if (!ret)
+		pr_warn("Unable to parse numa_balancing=\n");
+
+	return ret;
+}
+__setup("schedstats=", setup_schedstats);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_schedstats(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_schedstats);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (write)
+		set_schedstats(state);
+	return err;
+}
+#endif
+#endif
+
 /*
  * fork()/clone()-time setup:
  */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 641511771ae6..c5450729cd97 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -75,16 +75,18 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
 #ifdef CONFIG_SCHEDSTATS
-	PN(se->statistics.wait_start);
-	PN(se->statistics.sleep_start);
-	PN(se->statistics.block_start);
-	PN(se->statistics.sleep_max);
-	PN(se->statistics.block_max);
-	PN(se->statistics.exec_max);
-	PN(se->statistics.slice_max);
-	PN(se->statistics.wait_max);
-	PN(se->statistics.wait_sum);
-	P(se->statistics.wait_count);
+	if (static_branch_unlikely(&sched_schedstats)) {
+		PN(se->statistics.wait_start);
+		PN(se->statistics.sleep_start);
+		PN(se->statistics.block_start);
+		PN(se->statistics.sleep_max);
+		PN(se->statistics.block_max);
+		PN(se->statistics.exec_max);
+		PN(se->statistics.slice_max);
+		PN(se->statistics.wait_max);
+		PN(se->statistics.wait_sum);
+		P(se->statistics.wait_count);
+	}
 #endif
 	P(se->load.weight);
 #ifdef CONFIG_SMP
@@ -122,10 +124,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 #ifdef CONFIG_SCHEDSTATS
-	SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
-		SPLIT_NS(p->se.statistics.wait_sum),
-		SPLIT_NS(p->se.sum_exec_runtime),
-		SPLIT_NS(p->se.statistics.sum_sleep_runtime));
+	if (static_branch_unlikely(&sched_schedstats)) {
+		SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
+			SPLIT_NS(p->se.statistics.wait_sum),
+			SPLIT_NS(p->se.sum_exec_runtime),
+			SPLIT_NS(p->se.statistics.sum_sleep_runtime));
+	}
 #else
 	SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
 		0LL, 0L,
@@ -313,17 +317,19 @@ do {									\
 #define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
 #define P64(n) SEQ_printf(m, "  .%-30s: %Ld\n", #n, rq->n);
 
-	P(yld_count);
+	if (static_branch_unlikely(&sched_schedstats)) {
+		P(yld_count);
 
-	P(sched_count);
-	P(sched_goidle);
+		P(sched_count);
+		P(sched_goidle);
 #ifdef CONFIG_SMP
-	P64(avg_idle);
-	P64(max_idle_balance_cost);
+		P64(avg_idle);
+		P64(max_idle_balance_cost);
 #endif
 
-	P(ttwu_count);
-	P(ttwu_local);
+		P(ttwu_count);
+		P(ttwu_local);
+	}
 
 #undef P
 #undef P64
@@ -569,38 +575,38 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	nr_switches = p->nvcsw + p->nivcsw;
 
 #ifdef CONFIG_SCHEDSTATS
-	PN(se.statistics.sum_sleep_runtime);
-	PN(se.statistics.wait_start);
-	PN(se.statistics.sleep_start);
-	PN(se.statistics.block_start);
-	PN(se.statistics.sleep_max);
-	PN(se.statistics.block_max);
-	PN(se.statistics.exec_max);
-	PN(se.statistics.slice_max);
-	PN(se.statistics.wait_max);
-	PN(se.statistics.wait_sum);
-	P(se.statistics.wait_count);
-	PN(se.statistics.iowait_sum);
-	P(se.statistics.iowait_count);
-	P(se.nr_migrations);
-	P(se.statistics.nr_migrations_cold);
-	P(se.statistics.nr_failed_migrations_affine);
-	P(se.statistics.nr_failed_migrations_running);
-	P(se.statistics.nr_failed_migrations_hot);
-	P(se.statistics.nr_forced_migrations);
-	P(se.statistics.nr_wakeups);
-	P(se.statistics.nr_wakeups_sync);
-	P(se.statistics.nr_wakeups_migrate);
-	P(se.statistics.nr_wakeups_local);
-	P(se.statistics.nr_wakeups_remote);
-	P(se.statistics.nr_wakeups_affine);
-	P(se.statistics.nr_wakeups_affine_attempts);
-	P(se.statistics.nr_wakeups_passive);
-	P(se.statistics.nr_wakeups_idle);
-
-	{
+	if (static_branch_unlikely(&sched_schedstats)) {
 		u64 avg_atom, avg_per_cpu;
 
+		PN(se.statistics.sum_sleep_runtime);
+		PN(se.statistics.wait_start);
+		PN(se.statistics.sleep_start);
+		PN(se.statistics.block_start);
+		PN(se.statistics.sleep_max);
+		PN(se.statistics.block_max);
+		PN(se.statistics.exec_max);
+		PN(se.statistics.slice_max);
+		PN(se.statistics.wait_max);
+		PN(se.statistics.wait_sum);
+		P(se.statistics.wait_count);
+		PN(se.statistics.iowait_sum);
+		P(se.statistics.iowait_count);
+		P(se.nr_migrations);
+		P(se.statistics.nr_migrations_cold);
+		P(se.statistics.nr_failed_migrations_affine);
+		P(se.statistics.nr_failed_migrations_running);
+		P(se.statistics.nr_failed_migrations_hot);
+		P(se.statistics.nr_forced_migrations);
+		P(se.statistics.nr_wakeups);
+		P(se.statistics.nr_wakeups_sync);
+		P(se.statistics.nr_wakeups_migrate);
+		P(se.statistics.nr_wakeups_local);
+		P(se.statistics.nr_wakeups_remote);
+		P(se.statistics.nr_wakeups_affine);
+		P(se.statistics.nr_wakeups_affine_attempts);
+		P(se.statistics.nr_wakeups_passive);
+		P(se.statistics.nr_wakeups_idle);
+
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
 			avg_atom = div64_ul(avg_atom, nr_switches);
@@ -610,7 +616,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 		avg_per_cpu = p->se.sum_exec_runtime;
 		if (p->se.nr_migrations) {
 			avg_per_cpu = div64_u64(avg_per_cpu,
-						p->se.nr_migrations);
+					p->se.nr_migrations);
 		} else {
 			avg_per_cpu = -1LL;
 		}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1926606ece80..0fce4e353f3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3110,7 +3110,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	}
 
 	update_stats_enqueue(cfs_rq, se);
-	check_spread(cfs_rq, se);
+	if (static_branch_unlikely(&sched_schedstats))
+		check_spread(cfs_rq, se);
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
@@ -3359,7 +3360,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	/* throttle cfs_rqs exceeding runtime */
 	check_cfs_rq_runtime(cfs_rq);
 
-	check_spread(cfs_rq, prev);
+	if (static_branch_unlikely(&sched_schedstats))
+		check_spread(cfs_rq, prev);
 	if (prev->on_rq) {
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f16374df7f..1d583870e1a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1022,6 +1022,7 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
 extern struct static_key_false sched_numa_balancing;
+extern struct static_key_false sched_schedstats;
 
 static inline u64 global_rt_period(void)
 {
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index b0fbc7632de5..d183c9dc4b93 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -29,9 +29,9 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
 	if (rq)
 		rq->rq_sched_info.run_delay += delta;
 }
-# define schedstat_inc(rq, field)	do { (rq)->field++; } while (0)
-# define schedstat_add(rq, field, amt)	do { (rq)->field += (amt); } while (0)
-# define schedstat_set(var, val)	do { var = (val); } while (0)
+# define schedstat_inc(rq, field)	do { if (static_branch_unlikely(&sched_schedstats)) { (rq)->field++; } } while (0)
+# define schedstat_add(rq, field, amt)	do { if (static_branch_unlikely(&sched_schedstats)) { (rq)->field += (amt); } } while (0)
+# define schedstat_set(var, val)	do { if (static_branch_unlikely(&sched_schedstats)) { var = (val); } } while (0)
 #else /* !CONFIG_SCHEDSTATS */
 static inline void
 rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 97715fd9e790..6fe70ccdf2ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -350,6 +350,17 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_SCHEDSTATS
+	{
+		.procname	= "sched_schedstats",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_schedstats,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif /* CONFIG_SCHEDSTATS */
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_NUMA_BALANCING
 	{
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread