All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
@ 2016-02-03 11:07 Mel Gorman
  2016-02-03 11:28 ` Ingo Molnar
  2016-02-03 11:51 ` Srikar Dronamraju
  0 siblings, 2 replies; 7+ messages in thread
From: Mel Gorman @ 2016-02-03 11:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Matt Fleming, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Changelog since v3
o Force enable stats during profiling and latencytop

Changelog since V2
o Print stats that are not related to schedstat
o Reintroduce a static inline for update_stats_dequeue

Changelog since V1
o Introduce schedstat_enabled and address Ingo's feedback
o More schedstat-only paths eliminated, particularly ttwu_stat

schedstats is very useful during debugging and performance tuning but it
incurs overhead. As such, even though it can be disabled at build time,
it is often enabled as the information is useful.  This patch adds a
kernel command-line and sysctl tunable to enable or disable schedstats on
demand. It is disabled by default as someone who knows they need it can
also learn to enable it when necessary.

The benefits are workload-dependent but when it gets down to it, the
difference will be whether cache misses are incurred updating the shared
stats or not. These measurements were taken from a 48-core 2-socket machine
with Xeon(R) E5-2670 v3 cpus although they were also tested on a single
socket machine 8-core machine with Intel i7-3770 processors.

netperf-tcp
                           4.5.0-rc1             4.5.0-rc1
                             vanilla          nostats-v3r1
Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)

Small gains here, UDP_STREAM showed nothing intresting and neither did
the TCP_RR tests. The gains on the 8-core machine were very similar.

tbench4
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)

Small gains of 2-4% at low thread counts and otherwise flat.  The
gains on the 8-core machine were slightly different

tbench4 on 8-core i7-3770 single socket machine
Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)

In constract, this shows a relatively steady 2-3% gain at higher thread
counts. Due to the nature of the patch and the type of workload, it's
not a surprise that the result will depend on the CPU used.

hackbench-pipes
                         4.5.0-rc1             4.5.0-rc1
                           vanilla          nostats-v3r1
Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)

Some small gains and losses and while the variance data is not included,
it's close to the noise. The UMA machine did not show anything particularly
different

pipetest
                             4.5.0-rc1             4.5.0-rc1
                               vanilla          nostats-v2r2
Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)

Small improvement and similar gains were seen on the UMA machine.

The gain is small but it'll depend on the CPU and the workload whether
this patch makes a different.  However, it stands to reason that doing
less work in the scheduler is a good thing. The downside is that the
lack of schedstats and tracepoints will be surprising to experts doing
performance analysis until they find the existance of the schedstats=
parameter or schedstats sysctl.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
---
 Documentation/kernel-parameters.txt |   5 ++
 Documentation/sysctl/kernel.txt     |   8 +++
 include/linux/latencytop.h          |   3 ++
 include/linux/sched.h               |   4 ++
 include/linux/sched/sysctl.h        |   4 ++
 kernel/latencytop.c                 |  14 ++++-
 kernel/profile.c                    |   1 +
 kernel/sched/core.c                 |  70 ++++++++++++++++++++++++-
 kernel/sched/debug.c                | 102 +++++++++++++++++++-----------------
 kernel/sched/fair.c                 |  92 +++++++++++++++++++-------------
 kernel/sched/sched.h                |   1 +
 kernel/sched/stats.h                |   8 +--
 kernel/sysctl.c                     |  13 ++++-
 13 files changed, 235 insertions(+), 90 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 87d40a72f6a1..846956abfe85 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3523,6 +3523,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 
 	sched_debug	[KNL] Enables verbose scheduler debug messages.
 
+	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
+			Allowed values are enable and disable. This feature
+			incurs a small amount of overhead in the scheduler
+			but is useful for debugging and performance tuning.
+
 	skew_tick=	[KNL] Offset the periodic timer tick per cpu to mitigate
 			xtime_lock contention on larger systems, and/or RCU lock
 			contention on all systems with CONFIG_MAXSMP set.
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a93b414672a7..be7c3b720adf 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -760,6 +760,14 @@ rtsig-nr shows the number of RT signals currently queued.
 
 ==============================================================
 
+schedstats:
+
+Enables/disables scheduler statistics. Enabling this feature
+incurs a small amount of overhead in the scheduler but is
+useful for debugging and performance tuning.
+
+==============================================================
+
 sg-big-buff:
 
 This file shows the size of the generic SCSI (sg) buffer.
diff --git a/include/linux/latencytop.h b/include/linux/latencytop.h
index e23121f9d82a..59ccab297ae0 100644
--- a/include/linux/latencytop.h
+++ b/include/linux/latencytop.h
@@ -37,6 +37,9 @@ account_scheduler_latency(struct task_struct *task, int usecs, int inter)
 
 void clear_all_latency_tracing(struct task_struct *p);
 
+extern int sysctl_latencytop(struct ctl_table *table, int write,
+			void __user *buffer, size_t *lenp, loff_t *ppos);
+
 #else
 
 static inline void
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..a292c4b7e94c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -920,6 +920,10 @@ static inline int sched_info_on(void)
 #endif
 }
 
+#ifdef CONFIG_SCHEDSTATS
+void force_schedstat_enabled(void);
+#endif
+
 enum cpu_idle_type {
 	CPU_IDLE,
 	CPU_NOT_IDLE,
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731cf10b..4f080ab4f2cd 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -95,4 +95,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
 				 void __user *buffer, size_t *lenp,
 				 loff_t *ppos);
 
+extern int sysctl_schedstats(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/latencytop.c b/kernel/latencytop.c
index a02812743a7e..b5c30d9f46c5 100644
--- a/kernel/latencytop.c
+++ b/kernel/latencytop.c
@@ -47,12 +47,12 @@
  * of times)
  */
 
-#include <linux/latencytop.h>
 #include <linux/kallsyms.h>
 #include <linux/seq_file.h>
 #include <linux/notifier.h>
 #include <linux/spinlock.h>
 #include <linux/proc_fs.h>
+#include <linux/latencytop.h>
 #include <linux/export.h>
 #include <linux/sched.h>
 #include <linux/list.h>
@@ -289,4 +289,16 @@ static int __init init_lstats_procfs(void)
 	proc_create("latency_stats", 0644, NULL, &lstats_fops);
 	return 0;
 }
+
+int sysctl_latencytop(struct ctl_table *table, int write,
+			void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err;
+
+	err = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (latencytop_enabled)
+		force_schedstat_enabled();
+
+	return err;
+}
 device_initcall(init_lstats_procfs);
diff --git a/kernel/profile.c b/kernel/profile.c
index 99513e1160e5..51369697466e 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -59,6 +59,7 @@ int profile_setup(char *str)
 
 	if (!strncmp(str, sleepstr, strlen(sleepstr))) {
 #ifdef CONFIG_SCHEDSTATS
+		force_schedstat_enabled();
 		prof_on = SLEEP_PROFILING;
 		if (str[strlen(sleepstr)] == ',')
 			str += strlen(sleepstr) + 1;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9503d590e5ef..5a805a459d1a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2093,7 +2093,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 	ttwu_queue(p, cpu);
 stat:
-	ttwu_stat(p, cpu, wake_flags);
+	if (schedstat_enabled())
+		ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 
@@ -2141,7 +2142,8 @@ static void try_to_wake_up_local(struct task_struct *p)
 		ttwu_activate(rq, p, ENQUEUE_WAKEUP);
 
 	ttwu_do_wakeup(rq, p, 0);
-	ttwu_stat(p, smp_processor_id(), 0);
+	if (schedstat_enabled())
+		ttwu_stat(p, smp_processor_id(), 0);
 out:
 	raw_spin_unlock(&p->pi_lock);
 }
@@ -2210,6 +2212,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
+	/* Even if schedstat is disabled, there should not be garbage */
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
@@ -2281,6 +2284,69 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
 #endif
 #endif
 
+DEFINE_STATIC_KEY_FALSE(sched_schedstats);
+
+#ifdef CONFIG_SCHEDSTATS
+static void set_schedstats(bool enabled)
+{
+	if (enabled)
+		static_branch_enable(&sched_schedstats);
+	else
+		static_branch_disable(&sched_schedstats);
+}
+
+void force_schedstat_enabled(void)
+{
+	if (!schedstat_enabled()) {
+		pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
+		static_branch_enable(&sched_schedstats);
+	}
+}
+
+static int __init setup_schedstats(char *str)
+{
+	int ret = 0;
+	if (!str)
+		goto out;
+
+	if (!strcmp(str, "enable")) {
+		set_schedstats(true);
+		ret = 1;
+	} else if (!strcmp(str, "disable")) {
+		set_schedstats(false);
+		ret = 1;
+	}
+out:
+	if (!ret)
+		pr_warn("Unable to parse schedstats=\n");
+
+	return ret;
+}
+__setup("schedstats=", setup_schedstats);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_schedstats(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_schedstats);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (write)
+		set_schedstats(state);
+	return err;
+}
+#endif
+#endif
+
 /*
  * fork()/clone()-time setup:
  */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 641511771ae6..7cfa87bd8b89 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -75,16 +75,18 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
 #ifdef CONFIG_SCHEDSTATS
-	PN(se->statistics.wait_start);
-	PN(se->statistics.sleep_start);
-	PN(se->statistics.block_start);
-	PN(se->statistics.sleep_max);
-	PN(se->statistics.block_max);
-	PN(se->statistics.exec_max);
-	PN(se->statistics.slice_max);
-	PN(se->statistics.wait_max);
-	PN(se->statistics.wait_sum);
-	P(se->statistics.wait_count);
+	if (schedstat_enabled()) {
+		PN(se->statistics.wait_start);
+		PN(se->statistics.sleep_start);
+		PN(se->statistics.block_start);
+		PN(se->statistics.sleep_max);
+		PN(se->statistics.block_max);
+		PN(se->statistics.exec_max);
+		PN(se->statistics.slice_max);
+		PN(se->statistics.wait_max);
+		PN(se->statistics.wait_sum);
+		P(se->statistics.wait_count);
+	}
 #endif
 	P(se->load.weight);
 #ifdef CONFIG_SMP
@@ -122,10 +124,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 #ifdef CONFIG_SCHEDSTATS
-	SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
-		SPLIT_NS(p->se.statistics.wait_sum),
-		SPLIT_NS(p->se.sum_exec_runtime),
-		SPLIT_NS(p->se.statistics.sum_sleep_runtime));
+	if (schedstat_enabled()) {
+		SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
+			SPLIT_NS(p->se.statistics.wait_sum),
+			SPLIT_NS(p->se.sum_exec_runtime),
+			SPLIT_NS(p->se.statistics.sum_sleep_runtime));
+	}
 #else
 	SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
 		0LL, 0L,
@@ -313,17 +317,18 @@ do {									\
 #define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
 #define P64(n) SEQ_printf(m, "  .%-30s: %Ld\n", #n, rq->n);
 
-	P(yld_count);
-
-	P(sched_count);
-	P(sched_goidle);
 #ifdef CONFIG_SMP
 	P64(avg_idle);
 	P64(max_idle_balance_cost);
 #endif
 
-	P(ttwu_count);
-	P(ttwu_local);
+	if (schedstat_enabled()) {
+		P(yld_count);
+		P(sched_count);
+		P(sched_goidle);
+		P(ttwu_count);
+		P(ttwu_local);
+	}
 
 #undef P
 #undef P64
@@ -569,38 +574,39 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	nr_switches = p->nvcsw + p->nivcsw;
 
 #ifdef CONFIG_SCHEDSTATS
-	PN(se.statistics.sum_sleep_runtime);
-	PN(se.statistics.wait_start);
-	PN(se.statistics.sleep_start);
-	PN(se.statistics.block_start);
-	PN(se.statistics.sleep_max);
-	PN(se.statistics.block_max);
-	PN(se.statistics.exec_max);
-	PN(se.statistics.slice_max);
-	PN(se.statistics.wait_max);
-	PN(se.statistics.wait_sum);
-	P(se.statistics.wait_count);
-	PN(se.statistics.iowait_sum);
-	P(se.statistics.iowait_count);
 	P(se.nr_migrations);
-	P(se.statistics.nr_migrations_cold);
-	P(se.statistics.nr_failed_migrations_affine);
-	P(se.statistics.nr_failed_migrations_running);
-	P(se.statistics.nr_failed_migrations_hot);
-	P(se.statistics.nr_forced_migrations);
-	P(se.statistics.nr_wakeups);
-	P(se.statistics.nr_wakeups_sync);
-	P(se.statistics.nr_wakeups_migrate);
-	P(se.statistics.nr_wakeups_local);
-	P(se.statistics.nr_wakeups_remote);
-	P(se.statistics.nr_wakeups_affine);
-	P(se.statistics.nr_wakeups_affine_attempts);
-	P(se.statistics.nr_wakeups_passive);
-	P(se.statistics.nr_wakeups_idle);
 
-	{
+	if (schedstat_enabled()) {
 		u64 avg_atom, avg_per_cpu;
 
+		PN(se.statistics.sum_sleep_runtime);
+		PN(se.statistics.wait_start);
+		PN(se.statistics.sleep_start);
+		PN(se.statistics.block_start);
+		PN(se.statistics.sleep_max);
+		PN(se.statistics.block_max);
+		PN(se.statistics.exec_max);
+		PN(se.statistics.slice_max);
+		PN(se.statistics.wait_max);
+		PN(se.statistics.wait_sum);
+		P(se.statistics.wait_count);
+		PN(se.statistics.iowait_sum);
+		P(se.statistics.iowait_count);
+		P(se.statistics.nr_migrations_cold);
+		P(se.statistics.nr_failed_migrations_affine);
+		P(se.statistics.nr_failed_migrations_running);
+		P(se.statistics.nr_failed_migrations_hot);
+		P(se.statistics.nr_forced_migrations);
+		P(se.statistics.nr_wakeups);
+		P(se.statistics.nr_wakeups_sync);
+		P(se.statistics.nr_wakeups_migrate);
+		P(se.statistics.nr_wakeups_local);
+		P(se.statistics.nr_wakeups_remote);
+		P(se.statistics.nr_wakeups_affine);
+		P(se.statistics.nr_wakeups_affine_attempts);
+		P(se.statistics.nr_wakeups_passive);
+		P(se.statistics.nr_wakeups_idle);
+
 		avg_atom = p->se.sum_exec_runtime;
 		if (nr_switches)
 			avg_atom = div64_ul(avg_atom, nr_switches);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 56b7d4b83947..824f8d7058b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -20,8 +20,8 @@
  *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
  */
 
-#include <linux/latencytop.h>
 #include <linux/sched.h>
+#include <linux/latencytop.h>
 #include <linux/cpumask.h>
 #include <linux/cpuidle.h>
 #include <linux/slab.h>
@@ -755,7 +755,9 @@ static void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct task_struct *p;
-	u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
+	u64 delta;
+
+	delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
@@ -776,22 +778,12 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	se->statistics.wait_sum += delta;
 	se->statistics.wait_start = 0;
 }
-#else
-static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-}
-
-static inline void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-}
-#endif
 
 /*
  * Task is being enqueued - update stats:
  */
-static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static inline void
+update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	/*
 	 * Are we enqueueing a waiting task? (for current tasks
@@ -802,7 +794,7 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	/*
 	 * Mark the end of the wait period if dequeueing a
@@ -810,7 +802,40 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	if (se != cfs_rq->curr)
 		update_stats_wait_end(cfs_rq, se);
+
+	if (flags & DEQUEUE_SLEEP) {
+		if (entity_is_task(se)) {
+			struct task_struct *tsk = task_of(se);
+
+			if (tsk->state & TASK_INTERRUPTIBLE)
+				se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
+			if (tsk->state & TASK_UNINTERRUPTIBLE)
+				se->statistics.block_start = rq_clock(rq_of(cfs_rq));
+		}
+	}
+
+}
+#else
+static inline void
+update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+
+static inline void
+update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
 }
+#endif
 
 /*
  * We are picking a new current task - update its stats:
@@ -3122,11 +3147,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
-		enqueue_sleeper(cfs_rq, se);
+		if (schedstat_enabled())
+			enqueue_sleeper(cfs_rq, se);
 	}
 
-	update_stats_enqueue(cfs_rq, se);
-	check_spread(cfs_rq, se);
+	if (schedstat_enabled()) {
+		update_stats_enqueue(cfs_rq, se);
+		check_spread(cfs_rq, se);
+	}
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
@@ -3193,19 +3221,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	update_curr(cfs_rq);
 	dequeue_entity_load_avg(cfs_rq, se);
 
-	update_stats_dequeue(cfs_rq, se);
-	if (flags & DEQUEUE_SLEEP) {
-#ifdef CONFIG_SCHEDSTATS
-		if (entity_is_task(se)) {
-			struct task_struct *tsk = task_of(se);
-
-			if (tsk->state & TASK_INTERRUPTIBLE)
-				se->statistics.sleep_start = rq_clock(rq_of(cfs_rq));
-			if (tsk->state & TASK_UNINTERRUPTIBLE)
-				se->statistics.block_start = rq_clock(rq_of(cfs_rq));
-		}
-#endif
-	}
+	if (schedstat_enabled())
+		update_stats_dequeue(cfs_rq, se, flags);
 
 	clear_buddies(cfs_rq, se);
 
@@ -3279,7 +3296,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 * a CPU. So account for the time it spent waiting on the
 		 * runqueue.
 		 */
-		update_stats_wait_end(cfs_rq, se);
+		if (schedstat_enabled())
+			update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
 		update_load_avg(se, 1);
 	}
@@ -3292,7 +3310,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 * least twice that of our own weight (i.e. dont track it
 	 * when there are only lesser-weight tasks around):
 	 */
-	if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
+	if (schedstat_enabled() && rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
 		se->statistics.slice_max = max(se->statistics.slice_max,
 			se->sum_exec_runtime - se->prev_sum_exec_runtime);
 	}
@@ -3375,9 +3393,13 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	/* throttle cfs_rqs exceeding runtime */
 	check_cfs_rq_runtime(cfs_rq);
 
-	check_spread(cfs_rq, prev);
+	if (schedstat_enabled()) {
+		check_spread(cfs_rq, prev);
+		if (prev->on_rq)
+			update_stats_wait_start(cfs_rq, prev);
+	}
+
 	if (prev->on_rq) {
-		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f16374df7f..1d583870e1a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1022,6 +1022,7 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
 extern struct static_key_false sched_numa_balancing;
+extern struct static_key_false sched_schedstats;
 
 static inline u64 global_rt_period(void)
 {
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index b0fbc7632de5..70b3b6a20fb0 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -29,9 +29,10 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
 	if (rq)
 		rq->rq_sched_info.run_delay += delta;
 }
-# define schedstat_inc(rq, field)	do { (rq)->field++; } while (0)
-# define schedstat_add(rq, field, amt)	do { (rq)->field += (amt); } while (0)
-# define schedstat_set(var, val)	do { var = (val); } while (0)
+# define schedstat_enabled()		static_branch_unlikely(&sched_schedstats)
+# define schedstat_inc(rq, field)	do { if (schedstat_enabled()) { (rq)->field++; } } while (0)
+# define schedstat_add(rq, field, amt)	do { if (schedstat_enabled()) { (rq)->field += (amt); } } while (0)
+# define schedstat_set(var, val)	do { if (schedstat_enabled()) { var = (val); } } while (0)
 #else /* !CONFIG_SCHEDSTATS */
 static inline void
 rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
@@ -42,6 +43,7 @@ rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
 static inline void
 rq_sched_info_depart(struct rq *rq, unsigned long long delta)
 {}
+# define schedstat_enabled()		0
 # define schedstat_inc(rq, field)	do { } while (0)
 # define schedstat_add(rq, field, amt)	do { } while (0)
 # define schedstat_set(var, val)	do { } while (0)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 97715fd9e790..f5102fabef7f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -350,6 +350,17 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_SCHEDSTATS
+	{
+		.procname	= "sched_schedstats",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_schedstats,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif /* CONFIG_SCHEDSTATS */
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_NUMA_BALANCING
 	{
@@ -505,7 +516,7 @@ static struct ctl_table kern_table[] = {
 		.data		= &latencytop_enabled,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= sysctl_latencytop,
 	},
 #endif
 #ifdef CONFIG_BLK_DEV_INITRD
-- 
2.6.4

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
  2016-02-03 11:07 [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4 Mel Gorman
@ 2016-02-03 11:28 ` Ingo Molnar
  2016-02-03 11:39   ` Mel Gorman
  2016-02-03 11:51 ` Srikar Dronamraju
  1 sibling, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2016-02-03 11:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Matt Fleming, Mike Galbraith, Srikar Dronamraju, LKML


* Mel Gorman <mgorman@techsingularity.net> wrote:

> Changelog since v3
> o Force enable stats during profiling and latencytop
> 
> Changelog since V2
> o Print stats that are not related to schedstat
> o Reintroduce a static inline for update_stats_dequeue
> 
> Changelog since V1
> o Introduce schedstat_enabled and address Ingo's feedback
> o More schedstat-only paths eliminated, particularly ttwu_stat
> 
> schedstats is very useful during debugging and performance tuning but it
> incurs overhead. As such, even though it can be disabled at build time,
> it is often enabled as the information is useful.  This patch adds a
> kernel command-line and sysctl tunable to enable or disable schedstats on
> demand. It is disabled by default as someone who knows they need it can
> also learn to enable it when necessary.
> 
> The benefits are workload-dependent but when it gets down to it, the
> difference will be whether cache misses are incurred updating the shared
> stats or not. [...]

Hm, which shared stats are those? I think we should really fix those as well: 
those shared stats should be percpu collected as well, with no extra cache misses 
in any scheduler fast path.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
  2016-02-03 11:28 ` Ingo Molnar
@ 2016-02-03 11:39   ` Mel Gorman
  2016-02-03 12:49     ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2016-02-03 11:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Matt Fleming, Mike Galbraith, Srikar Dronamraju, LKML

On Wed, Feb 03, 2016 at 12:28:49PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > Changelog since v3
> > o Force enable stats during profiling and latencytop
> > 
> > Changelog since V2
> > o Print stats that are not related to schedstat
> > o Reintroduce a static inline for update_stats_dequeue
> > 
> > Changelog since V1
> > o Introduce schedstat_enabled and address Ingo's feedback
> > o More schedstat-only paths eliminated, particularly ttwu_stat
> > 
> > schedstats is very useful during debugging and performance tuning but it
> > incurs overhead. As such, even though it can be disabled at build time,
> > it is often enabled as the information is useful.  This patch adds a
> > kernel command-line and sysctl tunable to enable or disable schedstats on
> > demand. It is disabled by default as someone who knows they need it can
> > also learn to enable it when necessary.
> > 
> > The benefits are workload-dependent but when it gets down to it, the
> > difference will be whether cache misses are incurred updating the shared
> > stats or not. [...]
> 
> Hm, which shared stats are those?

Extremely poor phrasing on my part. The stats share a cache line and the
impact partially depends on whether unrelated stats share a cache line or
not during updates.

> I think we should really fix those as well: 
> those shared stats should be percpu collected as well, with no extra cache misses 
> in any scheduler fast path.
> 

I looked into that but converting those stats to per-cpu counters would
incur sizable memory overhead. There are a *lot* of them and the basic
structure for the generic percpu-counter is

struct percpu_counter {
        raw_spinlock_t lock;
        s64 count;
#ifdef CONFIG_HOTPLUG_CPU
        struct list_head list;  /* All percpu_counters are on a list */
#endif
        s32 __percpu *counters;
};

That's not taking the associated runtime overhead such as synchronising
them. Granted, some specialised implementation could be done for scheduler
but it would be massive overkill and maintenance overhead for stats that
most users do not even want.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
  2016-02-03 11:07 [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4 Mel Gorman
  2016-02-03 11:28 ` Ingo Molnar
@ 2016-02-03 11:51 ` Srikar Dronamraju
  1 sibling, 0 replies; 7+ messages in thread
From: Srikar Dronamraju @ 2016-02-03 11:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Matt Fleming, Mike Galbraith, LKML

> Changelog since v3
> o Force enable stats during profiling and latencytop
> 
> Changelog since V2
> o Print stats that are not related to schedstat
> o Reintroduce a static inline for update_stats_dequeue
> 
> Changelog since V1
> o Introduce schedstat_enabled and address Ingo's feedback
> o More schedstat-only paths eliminated, particularly ttwu_stat
> 
> schedstats is very useful during debugging and performance tuning but it
> incurs overhead. As such, even though it can be disabled at build time,
> it is often enabled as the information is useful.  This patch adds a
> kernel command-line and sysctl tunable to enable or disable schedstats on
> demand. It is disabled by default as someone who knows they need it can
> also learn to enable it when necessary.
> 
> The benefits are workload-dependent but when it gets down to it, the
> difference will be whether cache misses are incurred updating the shared
> stats or not. These measurements were taken from a 48-core 2-socket machine
> with Xeon(R) E5-2670 v3 cpus although they were also tested on a single
> socket machine 8-core machine with Intel i7-3770 processors.
> 
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>


-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
  2016-02-03 11:39   ` Mel Gorman
@ 2016-02-03 12:49     ` Ingo Molnar
  2016-02-03 13:32       ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2016-02-03 12:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Matt Fleming, Mike Galbraith, Srikar Dronamraju, LKML


* Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Feb 03, 2016 at 12:28:49PM +0100, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > > Changelog since v3
> > > o Force enable stats during profiling and latencytop
> > > 
> > > Changelog since V2
> > > o Print stats that are not related to schedstat
> > > o Reintroduce a static inline for update_stats_dequeue
> > > 
> > > Changelog since V1
> > > o Introduce schedstat_enabled and address Ingo's feedback
> > > o More schedstat-only paths eliminated, particularly ttwu_stat
> > > 
> > > schedstats is very useful during debugging and performance tuning but it
> > > incurs overhead. As such, even though it can be disabled at build time,
> > > it is often enabled as the information is useful.  This patch adds a
> > > kernel command-line and sysctl tunable to enable or disable schedstats on
> > > demand. It is disabled by default as someone who knows they need it can
> > > also learn to enable it when necessary.
> > > 
> > > The benefits are workload-dependent but when it gets down to it, the
> > > difference will be whether cache misses are incurred updating the shared
> > > stats or not. [...]
> > 
> > Hm, which shared stats are those?
> 
> Extremely poor phrasing on my part. The stats share a cache line and the impact 
> partially depends on whether unrelated stats share a cache line or not during 
> updates.

Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there 
any 'global' (or per node) counters that we keep touching and which keep 
generating cache-misses?

> > I think we should really fix those as well: those shared stats should be 
> > percpu collected as well, with no extra cache misses in any scheduler fast 
> > path.
> 
> I looked into that but converting those stats to per-cpu counters would incur 
> sizable memory overhead. There are a *lot* of them and the basic structure for 
> the generic percpu-counter is
> 
> struct percpu_counter {
>         raw_spinlock_t lock;
>         s64 count;
> #ifdef CONFIG_HOTPLUG_CPU
>         struct list_head list;  /* All percpu_counters are on a list */
> #endif
>         s32 __percpu *counters;
> };

We don't have to reuse percpu_counter().

> That's not taking the associated runtime overhead such as synchronising them. 

Why do we have to synchronize them in the kernel? User-space can recover them on a 
percpu basis and add them up if it wishes to. We can update the schedstat utility 
to handle the more spread out fields as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
  2016-02-03 12:49     ` Ingo Molnar
@ 2016-02-03 13:32       ` Mel Gorman
  2016-02-03 14:56         ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2016-02-03 13:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Matt Fleming, Mike Galbraith, Srikar Dronamraju, LKML

On Wed, Feb 03, 2016 at 01:49:21PM +0100, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Wed, Feb 03, 2016 at 12:28:49PM +0100, Ingo Molnar wrote:
> > > 
> > > * Mel Gorman <mgorman@techsingularity.net> wrote:
> > > 
> > > > Changelog since v3
> > > > o Force enable stats during profiling and latencytop
> > > > 
> > > > Changelog since V2
> > > > o Print stats that are not related to schedstat
> > > > o Reintroduce a static inline for update_stats_dequeue
> > > > 
> > > > Changelog since V1
> > > > o Introduce schedstat_enabled and address Ingo's feedback
> > > > o More schedstat-only paths eliminated, particularly ttwu_stat
> > > > 
> > > > schedstats is very useful during debugging and performance tuning but it
> > > > incurs overhead. As such, even though it can be disabled at build time,
> > > > it is often enabled as the information is useful.  This patch adds a
> > > > kernel command-line and sysctl tunable to enable or disable schedstats on
> > > > demand. It is disabled by default as someone who knows they need it can
> > > > also learn to enable it when necessary.
> > > > 
> > > > The benefits are workload-dependent but when it gets down to it, the
> > > > difference will be whether cache misses are incurred updating the shared
> > > > stats or not. [...]
> > > 
> > > Hm, which shared stats are those?
> > 
> > Extremely poor phrasing on my part. The stats share a cache line and the impact 
> > partially depends on whether unrelated stats share a cache line or not during 
> > updates.
> 
> Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there 
> any 'global' (or per node) counters that we keep touching and which keep 
> generating cache-misses?
> 

I haven't specifically identified them as I consider the calculations for
some of them to be expensive in their own right even without accounting for
cache misses. Moving to per-cpu counters would not eliminate all cache misses
as a stat updated on one CPU for a task that is woken on a separate CPU is
still going to trigger a cache miss. Even if such counters were identified
and moved to separate cache lines, the calculation overhead would remain.

> > > I think we should really fix those as well: those shared stats should be 
> > > percpu collected as well, with no extra cache misses in any scheduler fast 
> > > path.
> > 
> > I looked into that but converting those stats to per-cpu counters would incur 
> > sizable memory overhead. There are a *lot* of them and the basic structure for 
> > the generic percpu-counter is
> > 
> > struct percpu_counter {
> >         raw_spinlock_t lock;
> >         s64 count;
> > #ifdef CONFIG_HOTPLUG_CPU
> >         struct list_head list;  /* All percpu_counters are on a list */
> > #endif
> >         s32 __percpu *counters;
> > };
> 
> We don't have to reuse percpu_counter().
> 

No, but rolling a specialised solution for a debugging feature is overkill
and the calculation overhead would remain. It's specialised code with very
little upside.

The main gain from the patch is that the calculation overhead is
avoided. Avoid any potential cache miss is a bonus.

> > That's not taking the associated runtime overhead such as synchronising them. 
> 
> Why do we have to synchronize them in the kernel?

Because some simply require it or are not suitable for moving to per-cpu
counters at all. sleep_start is an obvious one as it can wake on another
CPU.

>? User-space can recover them on a 
> percpu basis and add them up if it wishes to. We can update the schedstat utility 
> to handle the more spread out fields as well.
> 

Any user of /proc/pid/sched would also need updating, including latencytop
and all of them will need to be able to handle CPU hotplug or else deal
with the output from all possible CPUs instead of the currently online ones.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4
  2016-02-03 13:32       ` Mel Gorman
@ 2016-02-03 14:56         ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2016-02-03 14:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Matt Fleming, Mike Galbraith, Srikar Dronamraju, LKML

On Wed, Feb 03, 2016 at 01:32:46PM +0000, Mel Gorman wrote:
> > Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there 
> > any 'global' (or per node) counters that we keep touching and which keep 
> > generating cache-misses?
> > 
> 
> I haven't specifically identified them as I consider the calculations for
> some of them to be expensive in their own right even without accounting for
> cache misses. Moving to per-cpu counters would not eliminate all cache misses
> as a stat updated on one CPU for a task that is woken on a separate CPU is
> still going to trigger a cache miss. Even if such counters were identified
> and moved to separate cache lines, the calculation overhead would remain.
> 

I looked closer with perf stat to see if there was a good case for reducing
cache misses using per-cpu counters.

Workload was hackbench with pipes and twice as many processes as there
are CPUs to generate a reasonable amount of scheduler activity.

Kernel 4.5-rc2 vanilla
 Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):

      54355.194747      task-clock (msec)         #   35.825 CPUs utilized            ( +-  0.72% )  (100.00%)
         6,654,707      context-switches          #    0.122 M/sec                    ( +-  1.56% )  (100.00%)
           376,624      cpu-migrations            #    0.007 M/sec                    ( +-  3.43% )  (100.00%)
           128,533      page-faults               #    0.002 M/sec                    ( +-  1.80% )  (100.00%)
   111,173,775,559      cycles                    #    2.045 GHz                      ( +-  0.76% )  (52.55%)
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    87,243,428,243      instructions              #    0.78  insns per cycle          ( +-  0.38% )  (63.74%)
    17,067,078,003      branches                  #  313.992 M/sec                    ( +-  0.39% )  (61.79%)
        65,864,607      branch-misses             #    0.39% of all branches          ( +-  2.10% )  (61.51%)
    26,873,984,605      L1-dcache-loads           #  494.414 M/sec                    ( +-  0.45% )  (33.08%)
     1,531,628,468      L1-dcache-load-misses     #    5.70% of all L1-dcache hits    ( +-  1.14% )  (31.65%)
       410,990,209      LLC-loads                 #    7.561 M/sec                    ( +-  1.08% )  (31.38%)
        38,279,473      LLC-load-misses           #    9.31% of all LL-cache hits     ( +-  6.82% )  (42.35%)

       1.517251315 seconds time elapsed                                          ( +-  1.55% )

Note that the actual cache miss ratio is quite low and indicates that
there is potentially little to gain from using per-cpu counters.

Kernel 4.5-rc2 plus patch that disables schedstats by default

 Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):

      51904.139186      task-clock (msec)         #   35.322 CPUs utilized            ( +-  2.07% )  (100.00%)
         5,958,009      context-switches          #    0.115 M/sec                    ( +-  5.90% )  (100.00%)
           327,235      cpu-migrations            #    0.006 M/sec                    ( +-  8.24% )  (100.00%)
           130,063      page-faults               #    0.003 M/sec                    ( +-  1.10% )  (100.00%)
   104,926,877,727      cycles                    #    2.022 GHz                      ( +-  2.12% )  (52.08%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    83,768,167,895      instructions              #    0.80  insns per cycle          ( +-  1.25% )  (63.49%)
    16,379,438,730      branches                  #  315.571 M/sec                    ( +-  1.47% )  (61.99%)
        59,841,332      branch-misses             #    0.37% of all branches          ( +-  4.60% )  (61.68%)
    25,749,569,276      L1-dcache-loads           #  496.099 M/sec                    ( +-  1.37% )  (34.08%)
     1,385,090,233      L1-dcache-load-misses     #    5.38% of all L1-dcache hits    ( +-  3.40% )  (31.88%)
       358,531,172      LLC-loads                 #    6.908 M/sec                    ( +-  4.65% )  (31.04%)
        33,476,691      LLC-load-misses           #    9.34% of all LL-cache hits     ( +-  4.95% )  (41.71%)

       1.469447783 seconds time elapsed                                          ( +-  2.23% )

Now, note that there is a reduction in cache misses but it's not a major
percentage and the miss ratio is only dropped slightly in comparison to
having stats enabled.

While a perf report shows there is a drop in cache references in
functions like ttwu_stat and [en|de]queue_entity but it's a small
percentage overall. The same is true for the cycle count. The overall
percentage is small but the patch eliminates them.

Based on the low level of cache misses, I see no value to using per-cpu
counters as an alternative.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-02-03 14:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-03 11:07 [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4 Mel Gorman
2016-02-03 11:28 ` Ingo Molnar
2016-02-03 11:39   ` Mel Gorman
2016-02-03 12:49     ` Ingo Molnar
2016-02-03 13:32       ` Mel Gorman
2016-02-03 14:56         ` Mel Gorman
2016-02-03 11:51 ` Srikar Dronamraju

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.