linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] smt-aware rt load balancer
@ 2014-09-04 12:20 Roman Gushchin
  2014-09-04 15:40 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Roman Gushchin @ 2014-09-04 12:20 UTC (permalink / raw)
  To: peterz, mingo, Kirill Tkhai, LKML; +Cc: Stanislav Fomichev

Hello!

We had an earlier discussion about using real-time policies on modern CPUs
for cpu-bound tasks with "near real-time" execution time expectations (like front-end servers):
https://lkml.org/lkml/2014/4/24/602 .

I sad, that I had a prototype of real-time load balancer (called smart), that performs well in this case.

Now it's ready to be published.

The patch set on top of the 3.10.x branch can be found here:
https://github.com/yandex/smart .

It's stable.
We use them in production for a couple of months on more than thousand machines.
We get noticeable performance increase for many projects with different load patterns
(up to 10-15% in both RPS and latency).

Any feedback, comments, questions are welcome!

Regards,
Roman

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] smt-aware rt load balancer
  2014-09-04 12:20 [RFC] smt-aware rt load balancer Roman Gushchin
@ 2014-09-04 15:40 ` Peter Zijlstra
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
  2014-09-04 16:49 ` [RFC] smt-aware rt load balancer Peter Zijlstra
  2 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2014-09-04 15:40 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: mingo, Kirill Tkhai, LKML, Stanislav Fomichev

On Thu, Sep 04, 2014 at 04:20:06PM +0400, Roman Gushchin wrote:
> Hello!
> 
> We had an earlier discussion about using real-time policies on modern CPUs
> for cpu-bound tasks with "near real-time" execution time expectations (like front-end servers):
> https://lkml.org/lkml/2014/4/24/602 .
> 
> I sad, that I had a prototype of real-time load balancer (called smart), that performs well in this case.
> 
> Now it's ready to be published.
> 
> The patch set on top of the 3.10.x branch can be found here:
> https://github.com/yandex/smart .
> 
> It's stable.
> We use them in production for a couple of months on more than thousand machines.
> We get noticeable performance increase for many projects with different load patterns
> (up to 10-15% in both RPS and latency).
> 
> Any feedback, comments, questions are welcome!

-ENOPATCH

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/19] smart: define and build per-core data structures
  2014-09-04 12:20 [RFC] smt-aware rt load balancer Roman Gushchin
  2014-09-04 15:40 ` Peter Zijlstra
@ 2014-09-04 16:30 ` klamm
  2014-09-04 16:30   ` [PATCH 02/19] smart: add config option for SMART-related code klamm
                     ` (17 more replies)
  2014-09-04 16:49 ` [RFC] smt-aware rt load balancer Peter Zijlstra
  2 siblings, 18 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

First, this patch introduces the smart_core_data structure.
This structure contains following fields:
cpu_core_id - per-cpu core id (first SMT thread on this core)
core_next - id of next core on local node
core_node_sibling - id of corresponding core on next node
core_locked - per-core lock used for synchronizing core selection

The following macros/functions are introduced to access smart data:
cpu_core_id(cpu) - returns core id of CPU
smart_data(cpu) - returns per-core smart_data (macro)
core_next(cpu) - returns next core id of CPU
core_node_sibling(cpu) - returns id of sibling core on next node

Also, this patch introduces build_smart_topology() function,
which fills smart_core_data for each cpu.
Below is the illustration of how it's should look on 2-nodes CPU
with 8 physical cores and 16 SMT threads.

cpu    cpu_core_id
0,8              0
1,9              1
2,10             2
3,11             3
4,12             4
5,13             5
6,14             6
7,15             7

           node 0                              node 1
------------------------------      ------------------------------
core         0  core         1      core         4  core         5
core_next    1  core_next    2      core_next    5  core_next    6
node_sibling 4  node_sibling 5      node_sibling 0  node_sibling 1

core         2  core         3      core         6  core         7
core_next    3  core_next    0      core_next    7  core_next    4
node_sibling 6  node_sibling 7      node_sibling 2  node_sibling 3
------------------------------      ------------------------------

build_smart_topology() uses sched_domains data and is called
each time sched domains are rebuilt. If smart topology is built
successfully (checked by check_smart_data()),
__smart_initialized static key is set to true.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/core.c  |   3 +
 kernel/sched/rt.c    | 169 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  40 ++++++++++++
 3 files changed, 212 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c771f25..14bcdd6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6671,6 +6671,7 @@ static int init_sched_domains(const struct cpumask *cpu_map)
 		doms_cur = &fallback_doms;
 	cpumask_andnot(doms_cur[0], cpu_map, cpu_isolated_map);
 	err = build_sched_domains(doms_cur[0], NULL);
+	build_smart_topology();
 	register_sched_domain_sysctl();
 
 	return err;
@@ -6791,6 +6792,8 @@ match2:
 
 	register_sched_domain_sysctl();
 
+	build_smart_topology();
+
 	mutex_unlock(&sched_domains_mutex);
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2dffc7b..fed3992 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -7,6 +7,15 @@
 
 #include <linux/slab.h>
 
+#ifdef CONFIG_SMART
+#include <linux/jump_label.h>
+
+struct static_key __smart_initialized = STATIC_KEY_INIT_FALSE;
+DEFINE_MUTEX(smart_mutex);
+
+DEFINE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
+#endif /* CONFIG_SMART */
+
 int sched_rr_timeslice = RR_TIMESLICE;
 
 static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
@@ -2114,3 +2123,163 @@ void print_rt_stats(struct seq_file *m, int cpu)
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
+
+#ifdef CONFIG_SMART
+int check_smart_data(void)
+{
+	int cpu, core;
+	int iterations;
+
+	for_each_online_cpu(cpu) {
+		if (cpu_core_id(cpu) == -1 || next_core(cpu) == -1 ||
+		    core_node_sibling(cpu) == -1)
+			goto error;
+
+		if (!cpumask_test_cpu(cpu_core_id(cpu), cpu_online_mask))
+			goto error;
+
+		if (!cpumask_test_cpu(core_node_sibling(cpu), cpu_online_mask))
+			goto error;
+
+		iterations = 0;
+		core = cpu_core_id(cpu);
+		do {
+			if (core == -1)
+				goto error;
+			if (++iterations > NR_CPUS)
+				goto error;
+		} while (core = next_core(core), core != cpu_core_id(cpu));
+
+		iterations = 0;
+		core = core_node_sibling(cpu);
+		do {
+			if (core == -1)
+				goto error;
+			if (++iterations > NR_CPUS)
+				goto error;
+		} while (core = next_core(core), core != core_node_sibling(cpu));
+
+	}
+
+	return 0;
+
+error:
+	printk(KERN_INFO "smart: init error (cpu %d core %d next %d sibling %d)\n",
+	       cpu, cpu_core_id(cpu), next_core(cpu),  core_node_sibling(cpu));
+	return -1;
+}
+
+static int number_of_cpu(int cpu, cpumask_t *mask)
+{
+	int tmp;
+	int count = 0;
+
+	for_each_cpu(tmp, mask) {
+		if (tmp == cpu)
+			return count;
+		count++;
+	}
+
+	return -1;
+}
+
+static int cpu_with_number(int number, cpumask_t *mask)
+{
+	int tmp;
+	int count = 0;
+
+	for_each_cpu(tmp, mask) {
+		if (count == number)
+			return tmp;
+		count++;
+	}
+
+	return -1;
+}
+
+void build_smart_topology(void)
+{
+	int cpu;
+	int was_initialized;
+
+	mutex_lock(&smart_mutex);
+
+	was_initialized = static_key_enabled(&__smart_initialized);
+	if (was_initialized)
+		static_key_slow_dec(&__smart_initialized);
+	synchronize_rcu();
+
+	if (was_initialized)
+		printk(KERN_INFO "smart: disabled\n");
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		/* __cpu_core_id */
+		per_cpu(smart_core_data, cpu).cpu_core_id =
+			cpumask_first(topology_thread_cpumask(cpu));
+		if (per_cpu(smart_core_data, cpu).cpu_core_id < 0 ||
+		    per_cpu(smart_core_data, cpu).cpu_core_id >= nr_cpu_ids)
+			per_cpu(smart_core_data, cpu).cpu_core_id = cpu;
+
+		atomic_set(&per_cpu(smart_core_data, cpu).core_locked, 0);
+	}
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		struct sched_domain *sd;
+
+		/* core_node_sibling */
+		smart_data(cpu).core_node_sibling = -1;
+		for_each_domain(cpu, sd) {
+			struct sched_group *sg, *next_sg;
+			int number;
+
+			if (sd->flags & SD_SHARE_PKG_RESOURCES)
+				continue;
+
+			sg = sd->groups;
+			next_sg = sg->next;
+
+			if (sg == next_sg)
+				continue;
+
+			number = number_of_cpu(cpu, sched_group_cpus(sg));
+			if (number != -1) {
+				int sibling = cpu_with_number(number,
+							      sched_group_cpus(next_sg));
+				if (sibling != -1)
+					smart_data(cpu).core_node_sibling = cpu_core_id(sibling);
+			}
+		}
+
+		/* local_core_list */
+		smart_data(cpu).core_next = -1;
+		for_each_domain(cpu, sd) {
+			if (sd->flags & SD_SHARE_CPUPOWER)
+				continue;
+
+			if (likely(sd->groups)) {
+				struct sched_group *sg = sd->groups->next;
+				int next = group_first_cpu(sg);
+
+				if (next < nr_cpu_ids)
+					smart_data(cpu).core_next = cpu_core_id(next);
+			}
+
+			break;
+		}
+	}
+
+	if (!check_smart_data()) {
+		printk(KERN_INFO "smart: enabled\n");
+		static_key_slow_inc(&__smart_initialized);
+	}
+
+	rcu_read_unlock();
+
+	put_online_cpus();
+
+	mutex_unlock(&smart_mutex);
+}
+
+#endif /* CONFIG_SMART */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dfa31d5..357736b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1378,3 +1378,43 @@ static inline u64 irq_time_read(int cpu)
 }
 #endif /* CONFIG_64BIT */
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_SMART
+struct smart_core_data {
+	int cpu_core_id;
+
+	/* Per core data, use smart_data macro for access */
+	int core_next;
+	int core_node_sibling;
+	atomic_t core_locked;
+} ____cacheline_aligned_in_smp;
+
+extern struct static_key __smart_initialized;
+
+DECLARE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
+
+static inline int cpu_core_id(int cpu)
+{
+	return per_cpu(smart_core_data, cpu).cpu_core_id;
+}
+
+#define smart_data(cpu) per_cpu(smart_core_data, cpu_core_id(cpu))
+
+static inline int core_node_sibling(int cpu)
+{
+	return smart_data(cpu).core_node_sibling;
+}
+
+static inline int next_core(int cpu)
+{
+	return smart_data(cpu).core_next;
+}
+
+void build_smart_topology(void);
+
+#else /* CONFIG_SMART */
+static inline void build_smart_topology(void)
+{
+}
+
+#endif /* CONFIG_SMART */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/19] smart: add config option for SMART-related code
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 03/19] smart: introduce smart_enabled() klamm
                     ` (16 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

All the SMART-related code will use CONFIG_SMART config option.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 init/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 5d6feba..98dd173 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -802,6 +802,17 @@ config NUMA_BALANCING
 
 	  This system will be inactive on UMA systems.
 
+config SMART
+	bool "Enable SMART rt scheduler extension"
+	default y
+	depends on SMP
+	help
+	  This option enables SMART (Simultaneous Multithreading-Aware Real-Time)
+	  scheduler exitension. SMART allows efficiently use of real-time
+	  scheduler for runtime tasks on modern multi-cores CPUs with
+	  enabled hyper-threading.
+	  Do not use for hard real-time purposes.
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/19] smart: introduce smart_enabled()
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
  2014-09-04 16:30   ` [PATCH 02/19] smart: add config option for SMART-related code klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 04/19] smart: helper functions for smart_data manipulations klamm
                     ` (15 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This patch introduces __smart_enabled static key and smart_enabled()
function, which will be used to check if smart functionality
is enabled.
An ability to turn it on/off will be added later.
By default, __smart_enabled is set to true.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/rt.c    |  4 ++++
 kernel/sched/sched.h | 13 +++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index fed3992..c71b9a3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -11,6 +11,7 @@
 #include <linux/jump_label.h>
 
 struct static_key __smart_initialized = STATIC_KEY_INIT_FALSE;
+struct static_key __smart_enabled = STATIC_KEY_INIT_TRUE;
 DEFINE_MUTEX(smart_mutex);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
@@ -2202,6 +2203,9 @@ void build_smart_topology(void)
 	int cpu;
 	int was_initialized;
 
+	if (!static_key_enabled(&__smart_enabled))
+		return;
+
 	mutex_lock(&smart_mutex);
 
 	was_initialized = static_key_enabled(&__smart_initialized);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 357736b..7e26454 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
 #include <linux/tick.h>
+#include <linux/jump_label.h>
 
 #include "cpupri.h"
 #include "cpuacct.h"
@@ -1390,6 +1391,7 @@ struct smart_core_data {
 } ____cacheline_aligned_in_smp;
 
 extern struct static_key __smart_initialized;
+extern struct static_key __smart_enabled;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
 
@@ -1400,6 +1402,12 @@ static inline int cpu_core_id(int cpu)
 
 #define smart_data(cpu) per_cpu(smart_core_data, cpu_core_id(cpu))
 
+static inline bool smart_enabled(void)
+{
+	return static_key_false(&__smart_initialized) &&
+		static_key_true(&__smart_enabled);
+}
+
 static inline int core_node_sibling(int cpu)
 {
 	return smart_data(cpu).core_node_sibling;
@@ -1417,4 +1425,9 @@ static inline void build_smart_topology(void)
 {
 }
 
+static inline bool smart_enabled(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_SMART */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/19] smart: helper functions for smart_data manipulations
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
  2014-09-04 16:30   ` [PATCH 02/19] smart: add config option for SMART-related code klamm
  2014-09-04 16:30   ` [PATCH 03/19] smart: introduce smart_enabled() klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 05/19] smart: CPU selection logic klamm
                     ` (14 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This commit adds the following helper functions:
acquire_core() - acquires per-core lock
release_core() - releases per-core lock
core_acquired() - checks if per-core lock is acquired
core_is_rt_free() - checks if there are no rt tasks on specified core
core_rt_free_thread() - finds free SMT thread on specified core
find_rt_free_core() - finds free core starting with specified core

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/sched.h | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e26454..4603096 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1418,6 +1418,89 @@ static inline int next_core(int cpu)
 	return smart_data(cpu).core_next;
 }
 
+static inline int acquire_core(int cpu)
+{
+	return (0 == atomic_cmpxchg(&smart_data(cpu).core_locked, 0, 1));
+}
+
+static inline void release_core(int cpu)
+{
+	atomic_set(&smart_data(cpu).core_locked, 0);
+}
+
+static inline int core_acquired(int cpu)
+{
+	return atomic_read(&smart_data(cpu).core_locked);
+}
+
+static inline int core_is_rt_free(int core)
+{
+	struct rq *rq;
+	int cpu;
+	unsigned int nr_rt;
+	struct task_struct *task;
+
+	for_each_cpu(cpu, topology_thread_cpumask(core)) {
+		rq = cpu_rq(cpu);
+
+		if (rq->rt.rt_throttled)
+			return 0;
+
+		nr_rt = rq->rt.rt_nr_running;
+		if (nr_rt) {
+			if (nr_rt > 1)
+				return 0;
+
+			task = ACCESS_ONCE(rq->curr);
+			if (task->mm)
+				return 0;
+		}
+	}
+
+	return 1;
+}
+
+static inline int core_rt_free_thread(int core)
+{
+	struct rq *rq;
+	int cpu;
+
+	for_each_cpu(cpu, topology_thread_cpumask(core)) {
+		rq = cpu_rq(cpu);
+
+		if (rq->rt.rt_throttled)
+			continue;
+
+		if (!rq->rt.rt_nr_running)
+			return cpu;
+	}
+
+	return -1;
+}
+
+static inline int find_rt_free_core(int start_cpu, struct task_struct *task)
+{
+	int core;
+
+	/* Local cores */
+	core = cpu_core_id(start_cpu);
+	do {
+		if (!core_acquired(core) && core_is_rt_free(core) &&
+		    cpumask_test_cpu(core, tsk_cpus_allowed(task)))
+			return core;
+	} while (core = next_core(core), core != cpu_core_id(start_cpu));
+
+	/* Remote cores */
+	core = core_node_sibling(start_cpu);
+	do {
+		if (!core_acquired(core) && core_is_rt_free(core) &&
+		    cpumask_test_cpu(core, tsk_cpus_allowed(task)))
+			return core;
+	} while (core = next_core(core), core != core_node_sibling(start_cpu));
+
+	return -1;
+}
+
 void build_smart_topology(void);
 
 #else /* CONFIG_SMART */
@@ -1430,4 +1513,8 @@ static inline bool smart_enabled(void)
 	return false;
 }
 
+static inline void release_core(int cpu)
+{
+}
+
 #endif /* CONFIG_SMART */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/19] smart: CPU selection logic
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (2 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 04/19] smart: helper functions for smart_data manipulations klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 06/19] smart: use CPU selection logic if smart is enabled klamm
                     ` (13 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This commit contains the most important code: CPU selection logic.
It's implemented by smart_find_lowest_rq() function and some helper
functions.

The logic is relatively simple:
1) try to find free core on local node (starting with previous task's CPU)
2) try to find free core on next remote node
3) try to find free SMT thread on local node
4) try to find free SMT thread on remote node
5) try to find SMT thread with minimum rt tasks

If we find free CPU on any step we try to acquire it and if it's successful we
stop further processing and quit; otherwise, if CPU is locked, we restart the
search.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/rt.c    | 32 ++++++++++++++++++++++++++++++++
 kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c71b9a3..805951b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2286,4 +2286,36 @@ void build_smart_topology(void)
 	mutex_unlock(&smart_mutex);
 }
 
+static int smart_find_lowest_rq(struct task_struct *task, bool wakeup)
+{
+	int prev_cpu = task_cpu(task);
+	int best_cpu;
+	int attempts;
+
+	if (task->nr_cpus_allowed == 1)
+		return -1; /* No other targets possible */
+
+	rcu_read_lock();
+
+
+	for (attempts = 3; attempts; attempts--) {
+		best_cpu = find_rt_free_core(prev_cpu, task);
+		if (best_cpu == -1) {
+			best_cpu = find_rt_best_thread(prev_cpu, task);
+
+			break;
+		}
+
+		if (!acquire_core(best_cpu))
+			continue;
+
+		if (likely(core_is_rt_free(best_cpu)))
+			break;
+
+		release_core(best_cpu);
+	}
+
+	rcu_read_unlock();
+	return best_cpu;
+}
 #endif /* CONFIG_SMART */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4603096..b662a89 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1501,6 +1501,49 @@ static inline int find_rt_free_core(int start_cpu, struct task_struct *task)
 	return -1;
 }
 
+static inline int find_rt_best_thread(int start_cpu, struct task_struct *task)
+{
+	int core;
+	int cpu;
+	struct rq *rq;
+	unsigned int min_running = 2;
+	int best_cpu = -1;
+	int nr_running;
+
+	/* Local cores */
+	core = cpu_core_id(start_cpu);
+	do {
+		cpu = core_rt_free_thread(core);
+		if (cpu != -1 && cpumask_test_cpu(cpu, tsk_cpus_allowed(task)))
+			return cpu;
+	} while (core = next_core(core), core != cpu_core_id(start_cpu));
+
+	/* Remote cores */
+	core = core_node_sibling(start_cpu);
+	do {
+		cpu = core_rt_free_thread(core);
+		if (cpu != -1 && cpumask_test_cpu(cpu, tsk_cpus_allowed(task)))
+			return cpu;
+	} while (core = next_core(core), core != core_node_sibling(start_cpu));
+
+	/* Find local thread with min. number of tasks */
+	for_each_cpu(cpu, topology_core_cpumask(start_cpu)) {
+		rq = cpu_rq(cpu);
+		nr_running = rq->rt.rt_nr_running;
+		if (nr_running < min_running &&
+		    cpumask_test_cpu(cpu, tsk_cpus_allowed(task))) {
+			min_running = nr_running;
+			best_cpu = cpu;
+		}
+	}
+
+	if (best_cpu != -1 &&
+	    min_running == cpu_rq(start_cpu)->rt.rt_nr_running)
+		best_cpu = -1;
+
+	return best_cpu;
+}
+
 void build_smart_topology(void);
 
 #else /* CONFIG_SMART */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/19] smart: use CPU selection logic if smart is enabled
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (3 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 05/19] smart: CPU selection logic klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-21 17:31     ` Pavel Machek
  2014-09-04 16:30   ` [PATCH 07/19] smart: balance load between nodes klamm
                     ` (12 subsequent siblings)
  17 siblings, 1 reply; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This patch causes rt scheduler to use smart CPU selection logic,
if smart_enabled() returns true.

Also, release_core() should be called every time rt-task is actually
enqueued on the runqueue or it's clear, that it will never be enqueued
on selected CPU.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/core.c |  6 +++++-
 kernel/sched/rt.c   | 29 +++++++++++++++++++++++++++--
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 14bcdd6..832b3d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1262,8 +1262,12 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 	 *   not worry about this generic constraint ]
 	 */
 	if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
-		     !cpu_online(cpu)))
+		     !cpu_online(cpu))) {
+		if (smart_enabled() && task_has_rt_policy(p) && cpu >= 0)
+			release_core(cpu);
+
 		cpu = select_fallback_rq(task_cpu(p), p);
+	}
 
 	return cpu;
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 805951b..1993c47 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -15,6 +15,14 @@ struct static_key __smart_enabled = STATIC_KEY_INIT_TRUE;
 DEFINE_MUTEX(smart_mutex);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
+
+static int smart_find_lowest_rq(struct task_struct *task, bool wakeup);
+
+#else /* CONFIG_SMART */
+static inline int smart_find_lowest_rq(struct task_struct *task, bool wakeup)
+{
+	return -1;
+}
 #endif /* CONFIG_SMART */
 
 int sched_rr_timeslice = RR_TIMESLICE;
@@ -1211,6 +1219,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 		enqueue_pushable_task(rq, p);
 
 	inc_nr_running(rq);
+	release_core(cpu_of(rq));
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -1278,6 +1287,13 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
 		goto out;
 
+	if (smart_enabled()) {
+		int target = smart_find_lowest_rq(p, true);
+		if (likely(target != -1))
+			cpu = target;
+		goto out;
+	}
+
 	rq = cpu_rq(cpu);
 
 	rcu_read_lock();
@@ -1580,10 +1596,17 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	int cpu;
 
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
-		cpu = find_lowest_rq(task);
+		if (smart_enabled())
+			cpu = smart_find_lowest_rq(task, false);
+		else
+			cpu = find_lowest_rq(task);
+
+		if ((cpu == -1) || (cpu == rq->cpu)) {
+			if (cpu == rq->cpu)
+				release_core(cpu);
 
-		if ((cpu == -1) || (cpu == rq->cpu))
 			break;
+		}
 
 		lowest_rq = cpu_rq(cpu);
 
@@ -1602,6 +1625,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 				     !task->on_rq)) {
 
 				double_unlock_balance(rq, lowest_rq);
+				release_core(cpu);
 				lowest_rq = NULL;
 				break;
 			}
@@ -1614,6 +1638,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 		/* try again */
 		double_unlock_balance(rq, lowest_rq);
 		lowest_rq = NULL;
+		release_core(cpu);
 	}
 
 	return lowest_rq;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/19] smart: balance load between nodes
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (4 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 06/19] smart: use CPU selection logic if smart is enabled klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 08/19] smart: smart pull klamm
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

Although previously introduced CPU selection logic isn't limited by
local node, if the number of rt tasks is smaller than the number of
physical cores per node, significant load imbalance can occur.

Modern CPUs tends to scale their's frequency depending on the number
of loaded CPUs, so such imbalance can lead to decreased per-CPU
performance on the more loaded node.

To solve this problem, this commit adds the following logic to CPU
selection logic: if the number of running rt processes on current node
is greater than on a remote node and the difference is more than 1/4 of
the number of rt tasks on local node, start search with corresponding
core on the remote node.

The number of rt tasks on each node is tracked with per-node atomic count.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/rt.c    | 22 ++++++++++++++++++++++
 kernel/sched/sched.h | 29 +++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 1993c47..3202ab4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -15,6 +15,7 @@ struct static_key __smart_enabled = STATIC_KEY_INIT_TRUE;
 DEFINE_MUTEX(smart_mutex);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
+struct smart_node_data smart_node_data[MAX_NUMNODES] ____cacheline_aligned_in_smp;
 
 static int smart_find_lowest_rq(struct task_struct *task, bool wakeup);
 
@@ -1218,6 +1219,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
 
+	inc_node_running(cpu_of(rq));
 	inc_nr_running(rq);
 	release_core(cpu_of(rq));
 }
@@ -1231,6 +1233,7 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	dequeue_pushable_task(rq, p);
 
+	dec_node_running(cpu_of(rq));
 	dec_nr_running(rq);
 }
 
@@ -2316,12 +2319,31 @@ static int smart_find_lowest_rq(struct task_struct *task, bool wakeup)
 	int prev_cpu = task_cpu(task);
 	int best_cpu;
 	int attempts;
+	int this_node_rt, other_node_rt;
+	int node, this_node;
 
 	if (task->nr_cpus_allowed == 1)
 		return -1; /* No other targets possible */
 
 	rcu_read_lock();
 
+	if (wakeup) {
+		this_node = cpu_to_node(prev_cpu);
+		this_node_rt = node_running(this_node);
+
+		for_each_online_node(node) {
+			if (node == this_node)
+				continue;
+
+			other_node_rt = node_running(node);
+
+			if (this_node_rt > other_node_rt &&
+			    ((this_node_rt - other_node_rt) * 4 > this_node_rt)) {
+				this_node_rt = other_node_rt;
+				prev_cpu = core_node_sibling(prev_cpu);
+			}
+		}
+	}
 
 	for (attempts = 3; attempts; attempts--) {
 		best_cpu = find_rt_free_core(prev_cpu, task);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b662a89..dd539ca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1390,10 +1390,15 @@ struct smart_core_data {
 	atomic_t core_locked;
 } ____cacheline_aligned_in_smp;
 
+struct smart_node_data {
+	atomic_t nr_rt_running;
+} ____cacheline_aligned_in_smp;
+
 extern struct static_key __smart_initialized;
 extern struct static_key __smart_enabled;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
+extern struct smart_node_data smart_node_data[MAX_NUMNODES];
 
 static inline int cpu_core_id(int cpu)
 {
@@ -1401,6 +1406,7 @@ static inline int cpu_core_id(int cpu)
 }
 
 #define smart_data(cpu) per_cpu(smart_core_data, cpu_core_id(cpu))
+#define smart_node_ptr(cpu) smart_node_data[cpu_to_node(cpu)]
 
 static inline bool smart_enabled(void)
 {
@@ -1433,6 +1439,21 @@ static inline int core_acquired(int cpu)
 	return atomic_read(&smart_data(cpu).core_locked);
 }
 
+static inline void inc_node_running(int cpu)
+{
+	atomic_inc(&smart_node_ptr(cpu).nr_rt_running);
+}
+
+static inline void dec_node_running(int cpu)
+{
+	atomic_dec(&smart_node_ptr(cpu).nr_rt_running);
+}
+
+static inline int node_running(int node)
+{
+	return atomic_read(&smart_node_data[node].nr_rt_running);
+}
+
 static inline int core_is_rt_free(int core)
 {
 	struct rq *rq;
@@ -1560,4 +1581,12 @@ static inline void release_core(int cpu)
 {
 }
 
+static inline void inc_node_running(int cpu)
+{
+}
+
+static inline void dec_node_running(int cpu)
+{
+}
+
 #endif /* CONFIG_SMART */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/19] smart: smart pull
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (5 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 07/19] smart: balance load between nodes klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 09/19] smart: throttle CFS tasks by affinning to first SMT thread klamm
                     ` (10 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This patch implements migration of running rt tasks (aka "smart pull").
The idea is quite simple: if there are free cores, there is no reason to
suffer from hyper-threading.

The implementation is a bit trickier:
in pre_schedule() we check, if we are switching from rt task to
non-rt (CFS or idle), and if so, we schedule a delayed work. This work
searches for a rt tasks, which is the best candidate for migrating (has
maximal smart_score). If there is a such candidate, migration occurs.
Such migration always occurs from non-first SMT thread on source core to
first SMT thread on destination core.

Smart score are counting in the following way: for each running rt task
on each scheduler tick we check if there are tasks concurrently running
on the same core. If so, we add this number to smart score.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 include/linux/sched.h |   3 ++
 kernel/sched/core.c   |  35 +++++++++++++++++
 kernel/sched/rt.c     | 107 +++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h  |  10 +++++
 4 files changed, 154 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 597c8ab..49b7361 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1023,6 +1023,9 @@ struct sched_rt_entity {
 	/* rq "owned" by this entity/group: */
 	struct rt_rq		*my_q;
 #endif
+#ifdef CONFIG_SMART
+	atomic_t smart_score;
+#endif
 };
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 832b3d0..9d888610c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4920,6 +4920,41 @@ static int migration_cpu_stop(void *data)
 	return 0;
 }
 
+#ifdef CONFIG_SMART
+int smart_migrate_task(struct task_struct *p, int prev_cpu,
+		       int dest_cpu)
+{
+	unsigned long flags;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+
+	/* Something has changed? Do nothing. */
+	if (unlikely(prev_cpu != cpu_of(rq)))
+		goto out;
+
+	if (unlikely(!rt_task(p)))
+		goto out;
+
+	if (p->nr_cpus_allowed == 1 ||
+	    !cpumask_test_cpu(dest_cpu, &p->cpus_allowed))
+		goto out;
+
+	if (p->on_rq) {
+		struct migration_arg arg = { p, dest_cpu };
+		/* Need help from migration thread: drop lock and wait. */
+		task_rq_unlock(rq, p, &flags);
+		stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+		tlb_migrate_finish(p->mm);
+		return 0;
+	}
+out:
+	task_rq_unlock(rq, p, &flags);
+
+	return -1;
+}
+#endif
+
 #ifdef CONFIG_HOTPLUG_CPU
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3202ab4..7ef0fd0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 
 #ifdef CONFIG_SMART
+#include <linux/workqueue.h>
 #include <linux/jump_label.h>
 
 struct static_key __smart_initialized = STATIC_KEY_INIT_FALSE;
@@ -18,12 +19,26 @@ DEFINE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
 struct smart_node_data smart_node_data[MAX_NUMNODES] ____cacheline_aligned_in_smp;
 
 static int smart_find_lowest_rq(struct task_struct *task, bool wakeup);
+static void update_curr_smart(struct rq *rq, struct task_struct *p);
+static void pre_schedule_smart(struct rq *rq, struct task_struct *prev);
+
+static void smart_pull(struct work_struct *dummy);
+static DECLARE_WORK(smart_work, smart_pull);
+
 
 #else /* CONFIG_SMART */
 static inline int smart_find_lowest_rq(struct task_struct *task, bool wakeup)
 {
 	return -1;
 }
+
+static void update_curr_smart(struct rq *rq, struct task_struct *p)
+{
+}
+
+static void pre_schedule_smart(struct rq *rq, struct task_struct *prev)
+{
+}
 #endif /* CONFIG_SMART */
 
 int sched_rr_timeslice = RR_TIMESLICE;
@@ -1211,9 +1226,12 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
 
-	if (flags & ENQUEUE_WAKEUP)
+	if (flags & ENQUEUE_WAKEUP) {
 		rt_se->timeout = 0;
 
+		reset_smart_score(rt_se);
+	}
+
 	enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD);
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
@@ -1845,6 +1863,8 @@ static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
 	/* Try to pull RT tasks here if we lower this rq's prio */
 	if (rq->rt.highest_prio.curr > prev->prio)
 		pull_rt_task(rq);
+
+	pre_schedule_smart(rq, prev);
 }
 
 static void post_schedule_rt(struct rq *rq)
@@ -2058,6 +2078,8 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 
 	update_curr_rt(rq);
 
+	update_curr_smart(rq, p);
+
 	watchdog(rq, p);
 
 	/*
@@ -2365,4 +2387,87 @@ static int smart_find_lowest_rq(struct task_struct *task, bool wakeup)
 	rcu_read_unlock();
 	return best_cpu;
 }
+
+static void smart_pull(struct work_struct *dummy)
+{
+	int this_cpu = smp_processor_id();
+	int cpu;
+	struct rq *rq = cpu_rq(this_cpu);
+	struct task_struct *task;
+	int points;
+	struct task_struct *best_task = NULL;
+	int best_points = 2;
+
+	if (rq->rt.rt_nr_running > 0)
+		return;
+
+	if (core_acquired(this_cpu))
+		return;
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		if (cpu == cpu_core_id(cpu))
+			continue;
+
+		rq = cpu_rq(cpu);
+		if (!rq->rt.rt_nr_running)
+			continue;
+
+		task = ACCESS_ONCE(rq->curr);
+		if (!rt_task(task))
+			continue;
+
+		points = atomic_read(&task->rt.smart_score);
+		if (points > best_points) {
+			best_task = task;
+			best_points = points;
+		}
+	}
+
+	if (!best_task) {
+		rcu_read_unlock();
+		return;
+	}
+
+	get_task_struct(best_task);
+	rcu_read_unlock();
+
+	smart_migrate_task(best_task, task_cpu(best_task), this_cpu);
+
+	put_task_struct(best_task);
+}
+
+static void update_curr_smart(struct rq *this_rq, struct task_struct *p)
+{
+	int this_cpu = cpu_of(this_rq);
+	int cpu;
+	struct rq *rq;
+	int points = 0;
+
+	for_each_cpu(cpu, topology_thread_cpumask(this_cpu)) {
+		if (cpu == this_cpu)
+			continue;
+
+		rq = cpu_rq(cpu);
+
+		points += rq->nr_running;
+	}
+
+	if (points)
+		atomic_add(points, &p->rt.smart_score);
+}
+
+static void pre_schedule_smart(struct rq *rq, struct task_struct *prev)
+{
+	if (smart_enabled()) {
+		int cpu = cpu_of(rq);
+
+		if (cpu == cpu_core_id(cpu) && !rq->rt.rt_nr_running) {
+			/* Try to pull rt tasks */
+			raw_spin_unlock(&rq->lock);
+			schedule_work_on(cpu, &smart_work);
+			raw_spin_lock(&rq->lock);
+		}
+	}
+}
 #endif /* CONFIG_SMART */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dd539ca..463fdbe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1565,6 +1565,12 @@ static inline int find_rt_best_thread(int start_cpu, struct task_struct *task)
 	return best_cpu;
 }
 
+static inline void reset_smart_score(struct sched_rt_entity *rt_se)
+{
+	atomic_set(&rt_se->smart_score, 0);
+}
+
+int smart_migrate_task(struct task_struct *p, int prev_cpu, int dest_cpu);
 void build_smart_topology(void);
 
 #else /* CONFIG_SMART */
@@ -1589,4 +1595,8 @@ static inline void dec_node_running(int cpu)
 {
 }
 
+static inline void reset_smart_score(struct sched_rt_entity *rt_se)
+{
+}
+
 #endif /* CONFIG_SMART */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/19] smart: throttle CFS tasks by affinning to first SMT thread
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (6 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 08/19] smart: smart pull klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 10/19] smart: smart gathering klamm
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

Normal ("CFS") tasks can affect the performance of rt tasks by loading
other SMT threads on the same core. If we want to have a guaranteed
performance for rt tasks, it's unacceptable.
So, this patch denies enqueuing of CFS tasks if there are running rt
tasks on the same core.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/fair.c  | 25 ++++++++++++++++++++-----
 kernel/sched/rt.c    |  4 ++++
 kernel/sched/sched.h | 21 +++++++++++++++++++++
 kernel/sysctl.c      |  4 ++++
 4 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c7ab8ea..629aa0d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3285,6 +3285,9 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+		if (!cpu_allowed_for_cfs(i))
+			continue;
+
 		load = weighted_cpuload(i);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
@@ -3305,13 +3308,14 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
-	if (idle_cpu(target))
+	if (idle_cpu(target) && cpu_allowed_for_cfs(target))
 		return target;
 
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
+	if (i != target && cpus_share_cache(i, target) && idle_cpu(i) &&
+	    cpu_allowed_for_cfs(i))
 		return i;
 
 	/*
@@ -3326,12 +3330,15 @@ static int select_idle_sibling(struct task_struct *p, int target)
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (i == target || !idle_cpu(i))
+				if (i == target || !idle_cpu(i) ||
+				    !cpu_allowed_for_cfs(i))
 					goto next;
 			}
 
 			target = cpumask_first_and(sched_group_cpus(sg),
 					tsk_cpus_allowed(p));
+			if (!cpu_allowed_for_cfs(target))
+				goto next;
 			goto done;
 next:
 			sg = sg->next;
@@ -3366,7 +3373,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+		    cpu_allowed_for_cfs(cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3931,6 +3939,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
 
+	if (!cpu_allowed_for_cfs(env->dst_cpu))
+		return 0;
+
 	if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
 		int cpu;
 
@@ -5191,7 +5202,8 @@ more_balance:
 			 * moved to this_cpu
 			 */
 			if (!cpumask_test_cpu(this_cpu,
-					tsk_cpus_allowed(busiest->curr))) {
+					tsk_cpus_allowed(busiest->curr)) ||
+			    !cpu_allowed_for_cfs(this_cpu)) {
 				raw_spin_unlock_irqrestore(&busiest->lock,
 							    flags);
 				env.flags |= LBF_ALL_PINNED;
@@ -5270,6 +5282,9 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 
 	this_rq->idle_stamp = this_rq->clock;
 
+	if (!cpu_allowed_for_cfs(this_cpu))
+		return;
+
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7ef0fd0..8621443 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1436,7 +1436,11 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	if (!rt_rq->rt_nr_running)
 		return NULL;
 
+#ifdef CONFIG_SMART
+	if (rt_rq_throttled(rt_rq) && rq->cfs.h_nr_running)
+#else
 	if (rt_rq_throttled(rt_rq))
+#endif
 		return NULL;
 
 	do {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 463fdbe..c7c1cdc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1396,6 +1396,7 @@ struct smart_node_data {
 
 extern struct static_key __smart_initialized;
 extern struct static_key __smart_enabled;
+extern struct static_key smart_cfs_throttle;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
 extern struct smart_node_data smart_node_data[MAX_NUMNODES];
@@ -1454,6 +1455,21 @@ static inline int node_running(int node)
 	return atomic_read(&smart_node_data[node].nr_rt_running);
 }
 
+static inline bool cpu_allowed_for_cfs(int cpu)
+{
+	struct rq *rq;
+	int core = cpu_core_id(cpu);
+
+	if (!smart_enabled() || !static_key_true(&smart_cfs_throttle))
+		return true;
+
+	if (cpu == core)
+		return true;
+
+	rq = cpu_rq(core);
+	return (!rq->rt.rt_nr_running || rq->rt.rt_throttled);
+}
+
 static inline int core_is_rt_free(int core)
 {
 	struct rq *rq;
@@ -1599,4 +1615,9 @@ static inline void reset_smart_score(struct sched_rt_entity *rt_se)
 {
 }
 
+static inline bool cpu_allowed_for_cfs(int cpu)
+{
+	return true;
+}
+
 #endif /* CONFIG_SMART */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9469f4c..7ee22ef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -175,6 +175,10 @@ extern int unaligned_dump_stack;
 extern int no_unaligned_warning;
 #endif
 
+#ifdef CONFIG_SMART
+struct static_key smart_cfs_throttle = STATIC_KEY_INIT_TRUE;
+#endif
+
 #ifdef CONFIG_PROC_SYSCTL
 static int proc_do_cad_pid(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/19] smart: smart gathering
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (7 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 09/19] smart: throttle CFS tasks by affinning to first SMT thread klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 11/19] smart: smart debug klamm
                     ` (8 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

Previous patch denies starting new CFS tasks on a core with running rt
tasks. The problem still exists if there are already running CFS tasks
at the moment rt task starts.
This patch introduces smart gathering - migration of running CFS tasks
from all other CPUs to the CPU with rt task.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/core.c  | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h | 14 ++++++++
 kernel/sysctl.c      |  1 +
 3 files changed, 113 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d888610c..5954f48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2753,6 +2753,8 @@ void scheduler_tick(void)
 	curr->sched_class->task_tick(rq, curr, 0);
 	raw_spin_unlock(&rq->lock);
 
+	smart_tick(cpu);
+
 	perf_event_task_tick();
 
 #ifdef CONFIG_SMP
@@ -4921,6 +4923,97 @@ static int migration_cpu_stop(void *data)
 }
 
 #ifdef CONFIG_SMART
+
+DEFINE_PER_CPU_SHARED_ALIGNED(struct smart_gathering, smart_gathering_data);
+
+static int smart_gathering_cpu_stop(void *data)
+{
+	int this_cpu = smp_processor_id();
+	int dest_cpu = cpu_core_id(this_cpu);
+	struct rq *rq = cpu_rq(this_cpu);
+	struct task_struct *next;
+	struct smart_gathering *sg;
+	unsigned long flags;
+	int ret;
+	int iter;
+
+	WARN_ON(this_cpu == dest_cpu);
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	for (iter = 0; iter < rq->cfs.h_nr_running; iter++) {
+		next = fair_sched_class.pick_next_task(rq);
+		if (!next)
+			break;
+		next->sched_class->put_prev_task(rq, next);
+
+		if (!cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(next)) ||
+		    !cpu_online(dest_cpu))
+			break;
+
+		raw_spin_unlock(&rq->lock);
+		ret = __migrate_task(next, this_cpu, dest_cpu);
+		raw_spin_lock(&rq->lock);
+
+		if (!ret)
+			break;
+	}
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+	sg = &smart_gathering_data(this_cpu);
+	spin_lock_irqsave(&sg->lock, flags);
+	WARN_ON(!sg->gather);
+	sg->gather = 0;
+	spin_unlock_irqrestore(&sg->lock, flags);
+
+	return 0;
+}
+
+void smart_tick(int cpu)
+{
+	unsigned long flags;
+	struct smart_gathering *sg;
+	int gather = 0;
+	struct rq *rq;
+	int core;
+	struct task_struct *curr;
+
+	if (idle_cpu(cpu) || !smart_enabled() ||
+	    !static_key_true(&smart_cfs_gather))
+		return;
+
+	rcu_read_lock();
+
+	core = cpu_core_id(cpu);
+	if (cpu != core) {
+		rq = cpu_rq(core);
+		curr = rq->curr;
+		if (rt_task(curr) && curr->mm)
+			gather = 1;
+
+		rq = cpu_rq(cpu);
+		curr = rq->curr;
+		if (rt_task(curr))
+			gather = 0;
+	}
+
+	if (gather) {
+		sg = &smart_gathering_data(cpu);
+
+		spin_lock_irqsave(&sg->lock, flags);
+		if (sg->gather)
+			gather = 0;
+		else
+			sg->gather = 1;
+		spin_unlock_irqrestore(&sg->lock, flags);
+	}
+
+	rcu_read_unlock();
+
+	if (gather)
+		stop_one_cpu_nowait(cpu, smart_gathering_cpu_stop, NULL,
+				    &sg->work);
+}
+
 int smart_migrate_task(struct task_struct *p, int prev_cpu,
 		       int dest_cpu)
 {
@@ -7093,6 +7186,11 @@ void __init sched_init(void)
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+#ifdef CONFIG_SMART
+		spin_lock_init(&smart_gathering_data(i).lock);
+		smart_gathering_data(i).gather = 0;
+#endif
 	}
 
 	set_load_weight(&init_task);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7c1cdc..80d202e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1394,11 +1394,19 @@ struct smart_node_data {
 	atomic_t nr_rt_running;
 } ____cacheline_aligned_in_smp;
 
+struct smart_gathering {
+	spinlock_t lock;
+	int gather;
+	struct cpu_stop_work work;
+};
+
 extern struct static_key __smart_initialized;
 extern struct static_key __smart_enabled;
+extern struct static_key smart_cfs_gather;
 extern struct static_key smart_cfs_throttle;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
+DECLARE_PER_CPU_SHARED_ALIGNED(struct smart_gathering, smart_gathering_data);
 extern struct smart_node_data smart_node_data[MAX_NUMNODES];
 
 static inline int cpu_core_id(int cpu)
@@ -1408,6 +1416,7 @@ static inline int cpu_core_id(int cpu)
 
 #define smart_data(cpu) per_cpu(smart_core_data, cpu_core_id(cpu))
 #define smart_node_ptr(cpu) smart_node_data[cpu_to_node(cpu)]
+#define smart_gathering_data(cpu) per_cpu(smart_gathering_data, cpu)
 
 static inline bool smart_enabled(void)
 {
@@ -1586,6 +1595,7 @@ static inline void reset_smart_score(struct sched_rt_entity *rt_se)
 	atomic_set(&rt_se->smart_score, 0);
 }
 
+void smart_tick(int cpu);
 int smart_migrate_task(struct task_struct *p, int prev_cpu, int dest_cpu);
 void build_smart_topology(void);
 
@@ -1620,4 +1630,8 @@ static inline bool cpu_allowed_for_cfs(int cpu)
 	return true;
 }
 
+static inline void smart_tick(int cpu)
+{
+}
+
 #endif /* CONFIG_SMART */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7ee22ef..a1e71e9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -176,6 +176,7 @@ extern int no_unaligned_warning;
 #endif
 
 #ifdef CONFIG_SMART
+struct static_key smart_cfs_gather = STATIC_KEY_INIT_TRUE;
 struct static_key smart_cfs_throttle = STATIC_KEY_INIT_TRUE;
 #endif
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/19] smart: smart debug
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (8 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 10/19] smart: smart gathering klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 12/19] smart: cgroup interface for smart klamm
                     ` (7 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This patch introduces debug infrastructure for smart.
The infrastructure contains a number of per-cpu counters,
smart_event() function to register smart-related events and procfs
interface to read and reset these counters.

The following events are counting:
pull - smart pull
balance_local - local cpu selection
balance_remote - remote cpu selection
select_core - free local core selected
select_rcore - free remote core selected
select_thread - free local cpu selected
select_rthread - free remote cpu selected
select_busy - busy cpu selected
select_busy_curr - current busy cpu selected
select_fallback - select_fallback_rq() called
rt_pull - rt push
rt_push - rt pull

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 init/Kconfig         |  8 +++++
 kernel/sched/core.c  |  3 ++
 kernel/sched/rt.c    | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h | 38 +++++++++++++++++++++++
 4 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/init/Kconfig b/init/Kconfig
index 98dd173..d936b49 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -813,6 +813,14 @@ config SMART
 	  enabled hyper-threading.
 	  Do not use for hard real-time purposes.
 
+config SMART_DEBUG
+	bool "Enable SMART debug features"
+	default y
+	depends on SMART
+	help
+	  This option adds gathering of SMART statistics. It's
+	  available via /proc/smart_stat interface.
+
 menuconfig CGROUPS
 	boolean "Control Group support"
 	depends on EVENTFD
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5954f48..c2b988c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1180,6 +1180,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 	enum { cpuset, possible, fail } state = cpuset;
 	int dest_cpu;
 
+	if (smart_enabled() && task_has_rt_policy(p))
+		smart_event(select_fallback);
+
 	/*
 	 * If the node that the cpu is on has been offlined, cpu_to_node()
 	 * will return -1. There is no cpu on the node, and we should
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8621443..14acd51 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -9,8 +9,69 @@
 
 #ifdef CONFIG_SMART
 #include <linux/workqueue.h>
+#include <linux/proc_fs.h>
+
+#ifdef CONFIG_SMART_DEBUG
+#include <linux/seq_file.h>
 #include <linux/jump_label.h>
 
+DEFINE_PER_CPU(struct smart_stat, smart_stat);
+
+static struct proc_dir_entry *smart_pde;
+
+static ssize_t smart_proc_write(struct file *file, const char __user *buffer,
+				size_t count, loff_t *ppos)
+{
+	int c;
+
+	for_each_possible_cpu(c)
+		memset(&per_cpu(smart_stat, c), 0, sizeof(struct smart_stat));
+
+	return count;
+}
+
+#define smart_stat_print(m, field)				\
+	({							\
+		u64 res = 0;					\
+		unsigned int c;					\
+		for_each_possible_cpu(c)			\
+			res += per_cpu(smart_stat, c).field;	\
+		seq_printf(m, "%-16s %llu\n", #field, res);	\
+	})
+
+static int smart_proc_show(struct seq_file *m, void *arg)
+{
+	smart_stat_print(m, pull);
+	smart_stat_print(m, balance_local);
+	smart_stat_print(m, balance_remote);
+	smart_stat_print(m, select_core);
+	smart_stat_print(m, select_rcore);
+	smart_stat_print(m, select_thread);
+	smart_stat_print(m, select_rthread);
+	smart_stat_print(m, select_busy);
+	smart_stat_print(m, select_busy_curr);
+	smart_stat_print(m, select_fallback);
+	smart_stat_print(m, rt_pull);
+	smart_stat_print(m, rt_push);
+
+	return 0;
+}
+
+static int smart_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, smart_proc_show, NULL);
+}
+
+static const struct file_operations smart_proc_fops = {
+	.owner		= THIS_MODULE,
+	.open		= smart_proc_open,
+	.read		= seq_read,
+	.write          = smart_proc_write,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif
+
 struct static_key __smart_initialized = STATIC_KEY_INIT_FALSE;
 struct static_key __smart_enabled = STATIC_KEY_INIT_TRUE;
 DEFINE_MUTEX(smart_mutex);
@@ -1773,6 +1834,9 @@ retry:
 out:
 	put_task_struct(next_task);
 
+	if (ret)
+		smart_event(rt_push);
+
 	return ret;
 }
 
@@ -1859,6 +1923,9 @@ skip:
 		double_unlock_balance(this_rq, src_rq);
 	}
 
+	if (ret)
+		smart_event(rt_pull);
+
 	return ret;
 }
 
@@ -2270,6 +2337,13 @@ void build_smart_topology(void)
 	if (was_initialized)
 		printk(KERN_INFO "smart: disabled\n");
 
+#ifdef CONFIG_SMART_DEBUG
+	if (!smart_pde)
+		smart_pde = proc_create("smart_stat",
+					S_IWUSR | S_IRUSR | S_IRGRP | S_IROTH,
+					NULL, &smart_proc_fops);
+#endif
+
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
 		/* __cpu_core_id */
@@ -2369,6 +2443,9 @@ static int smart_find_lowest_rq(struct task_struct *task, bool wakeup)
 				prev_cpu = core_node_sibling(prev_cpu);
 			}
 		}
+
+		smart_event_node(task_cpu(task), prev_cpu,
+				 balance_local, balance_remote);
 	}
 
 	for (attempts = 3; attempts; attempts--) {
@@ -2376,14 +2453,20 @@ static int smart_find_lowest_rq(struct task_struct *task, bool wakeup)
 		if (best_cpu == -1) {
 			best_cpu = find_rt_best_thread(prev_cpu, task);
 
+			smart_event_node(task_cpu(task), best_cpu,
+					 select_thread, select_rthread);
+
 			break;
 		}
 
 		if (!acquire_core(best_cpu))
 			continue;
 
-		if (likely(core_is_rt_free(best_cpu)))
+		if (likely(core_is_rt_free(best_cpu))) {
+			smart_event_node(task_cpu(task), best_cpu,
+					 select_core, select_rcore);
 			break;
+		}
 
 		release_core(best_cpu);
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 80d202e..d450b8f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1380,6 +1380,38 @@ static inline u64 irq_time_read(int cpu)
 #endif /* CONFIG_64BIT */
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
 
+#ifdef CONFIG_SMART_DEBUG
+struct smart_stat {
+	u64 pull;
+	u64 balance_local;
+	u64 balance_remote;
+	u64 select_core;
+	u64 select_rcore;
+	u64 select_thread;
+	u64 select_rthread;
+	u64 select_busy;
+	u64 select_busy_curr;
+	u64 select_fallback;
+	u64 rt_pull;
+	u64 rt_push;
+};
+
+DECLARE_PER_CPU(struct smart_stat, smart_stat);
+#define smart_event(e) do { __get_cpu_var(smart_stat).e++; } while (0)
+#define smart_event_node(prev_cpu, next_cpu, local_event, remote_event) \
+	do {								\
+		if (prev_cpu >= 0 && next_cpu >= 0 &&			\
+		    cpu_to_node(prev_cpu) == cpu_to_node(next_cpu))	\
+			smart_event(local_event);			\
+		else							\
+			smart_event(remote_event);			\
+	} while (0)
+#else
+#define smart_event(e) do { } while (0)
+#define smart_event_node(prev_cpu, next_cpu, local_event, remote_event) \
+	do { } while (0)
+#endif /* CONFIG_SMART_DEBUG */
+
 #ifdef CONFIG_SMART
 struct smart_core_data {
 	int cpu_core_id;
@@ -1417,6 +1449,7 @@ static inline int cpu_core_id(int cpu)
 #define smart_data(cpu) per_cpu(smart_core_data, cpu_core_id(cpu))
 #define smart_node_ptr(cpu) smart_node_data[cpu_to_node(cpu)]
 #define smart_gathering_data(cpu) per_cpu(smart_gathering_data, cpu)
+#define smart_stats(cpu) per_cpu(smart_core_data, cpu_core_id(cpu)).stats
 
 static inline bool smart_enabled(void)
 {
@@ -1587,6 +1620,11 @@ static inline int find_rt_best_thread(int start_cpu, struct task_struct *task)
 	    min_running == cpu_rq(start_cpu)->rt.rt_nr_running)
 		best_cpu = -1;
 
+	if (best_cpu == -1 || best_cpu == start_cpu)
+		smart_event(select_busy_curr);
+	else
+		smart_event(select_busy);
+
 	return best_cpu;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/19] smart: cgroup interface for smart
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (9 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 11/19] smart: smart debug klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:30   ` [PATCH 13/19] smart: nosmart boot option klamm
                     ` (6 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This patch extends cpu cgroup controller to provide convenient interface
for using smart.
The interface contains one simple knob: smart. If it's set to 1,
SCHED_RR scheduling policy (with priority 10) is assigned to all non-rt
tasks in the group. If it's set to 0, scheduling policy of all rt tasks
is reset to SCHED_NORMAL.
Global enabling/disabling smart doesn't affect per-cgroup smart knob
state, but tasks in the smart cgroup will actually scheduled by CFS if
smart is disabled globally. In other words, tasks in a cgroup with smart
knob set are scheduled by real-time scheduler only if smart is enabled.
If smart is temporarily disabled globally (due to cpu hotplug, for
instance), all tasks in smart cgroups will be temporarily scheduled by
CFS and than rt scheduling policy will be restored. Such behavior
guarantees graceful degradation if something is wrong with smart.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/core.c  | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/rt.c    |   8 ++++
 kernel/sched/sched.h |   6 +++
 3 files changed, 139 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2b988c..0f25fe0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7413,6 +7413,10 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+#ifdef CONFIG_SMART
+	tg->smart = parent->smart;
+#endif
+
 	return tg;
 
 err:
@@ -7877,13 +7881,119 @@ static int cpu_cgroup_can_attach(struct cgroup *cgrp,
 	return 0;
 }
 
+#ifdef CONFIG_SMART
+
+int sched_smart_prio = 10;
+
+static void __update_task_smart(struct task_struct *task,
+				struct cgroup *cgrp)
+{
+	struct task_group *tg;
+	int policy;
+	struct sched_param param;
+
+	if (!task->mm)
+		return;
+
+	tg = cgroup_tg(cgrp);
+
+	if (!rt_task(task) && tg->smart > 0) {
+		policy = SCHED_RR;
+		param.sched_priority = sched_smart_prio;
+	} else if (rt_task(task) && tg->smart <= 0) {
+		policy = SCHED_NORMAL;
+		param.sched_priority = 0;
+	} else
+		return;
+
+	WARN_ON(sched_setscheduler_nocheck(task, policy, &param));
+}
+
+static void update_task_smart(struct task_struct *task,
+			      struct cgroup_scanner *scan)
+{
+	__update_task_smart(task, scan->cg);
+}
+
+static int update_cgrp_smart(struct cgroup *cgrp)
+{
+	struct cgroup_scanner scan;
+
+	scan.cg = cgrp;
+	scan.test_task = NULL;
+	scan.process_task = update_task_smart;
+	scan.heap = NULL;
+
+	return cgroup_scan_tasks(&scan);
+}
+
+static u64 cpu_smart_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return cgroup_tg(cgrp)->smart == 1 ? 1 : 0;
+}
+
+static int cpu_smart_write(struct cgroup *cgrp, struct cftype *cftype,
+			   u64 enable)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+
+	if (enable != 0 && enable != 1)
+		return -EINVAL;
+
+	/* Don't allow to enable smart for root cgroup */
+	if (!tg->se[0])
+		return -EINVAL;
+
+	mutex_lock(&smart_mutex);
+	tg->smart = (smart_enabled() ? 1 : -1) * enable;
+	update_cgrp_smart(cgrp);
+	mutex_unlock(&smart_mutex);
+
+	return 0;
+}
+
+static int update_smart_tg(struct task_group *tg, void *data)
+{
+	int ret = 0;
+	int enabled = smart_enabled();
+
+	if (enabled && tg->smart < 0) {
+		tg->smart = 1;
+		ret = update_cgrp_smart(tg->css.cgroup);
+	} else if (!enabled && tg->smart > 0) {
+		tg->smart = -1;
+		ret = update_cgrp_smart(tg->css.cgroup);
+	}
+
+	return ret;
+}
+
+int smart_update_globally(void)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = walk_tg_tree(update_smart_tg, tg_nop, NULL);
+	rcu_read_unlock();
+
+	return ret;
+}
+#else /* CONFIG_SMART */
+static void __update_task_smart(struct task_struct *task,
+				struct cgroup *cgrp)
+{
+}
+#endif  /* CONFIG_SMART */
+
 static void cpu_cgroup_attach(struct cgroup *cgrp,
 			      struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
 
-	cgroup_taskset_for_each(task, cgrp, tset)
+	cgroup_taskset_for_each(task, cgrp, tset) {
 		sched_move_task(task);
+		__update_task_smart(task, cgrp);
+	}
 }
 
 static void
@@ -8214,6 +8324,13 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_SMART
+	{
+		.name = "smart",
+		.read_u64 = cpu_smart_read,
+		.write_u64 = cpu_smart_write,
+	},
+#endif
 	{ }	/* terminate */
 };
 
@@ -8231,6 +8348,13 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.early_init	= 1,
 };
 
+#else /* CONFIG_CGROUP_SCHED */
+#ifdef CONFIG_SMART
+int smart_update_globally(void)
+{
+	return 0;
+}
+#endif
 #endif	/* CONFIG_CGROUP_SCHED */
 
 void dump_cpu_task(int cpu)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 14acd51..a3fd83c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2405,6 +2405,14 @@ void build_smart_topology(void)
 	if (!check_smart_data()) {
 		printk(KERN_INFO "smart: enabled\n");
 		static_key_slow_inc(&__smart_initialized);
+		if (!was_initialized) {
+			smart_update_globally();
+			printk(KERN_INFO "smart: enabled globally\n");
+		}
+	} else if (was_initialized) {
+		printk(KERN_ALERT "smart: can't build smart topology\n");
+		smart_update_globally();
+		printk(KERN_ALERT "smart: disabled globally\n");
 	}
 
 	rcu_read_unlock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d450b8f..6ab02dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -165,6 +165,10 @@ struct task_group {
 #endif
 
 	struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_SMART
+	int smart;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1434,6 +1438,7 @@ struct smart_gathering {
 
 extern struct static_key __smart_initialized;
 extern struct static_key __smart_enabled;
+extern struct mutex smart_mutex;
 extern struct static_key smart_cfs_gather;
 extern struct static_key smart_cfs_throttle;
 
@@ -1636,6 +1641,7 @@ static inline void reset_smart_score(struct sched_rt_entity *rt_se)
 void smart_tick(int cpu);
 int smart_migrate_task(struct task_struct *p, int prev_cpu, int dest_cpu);
 void build_smart_topology(void);
+int smart_update_globally(void);
 
 #else /* CONFIG_SMART */
 static inline void build_smart_topology(void)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 13/19] smart: nosmart boot option
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (10 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 12/19] smart: cgroup interface for smart klamm
@ 2014-09-04 16:30   ` klamm
  2014-09-04 16:31   ` [PATCH 14/19] smart: smart-related sysctl's klamm
                     ` (5 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:30 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This patch introduces nosmart boot option, that is intended to disable
smart globally at boot time.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/rt.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a3fd83c..ff7751a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2319,6 +2319,13 @@ static int cpu_with_number(int number, cpumask_t *mask)
 	return -1;
 }
 
+static int __init nosmart_setup(char *str)
+{
+	static_key_slow_dec(&__smart_enabled);
+	return 0;
+}
+early_param("nosmart", nosmart_setup);
+
 void build_smart_topology(void)
 {
 	int cpu;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 14/19] smart: smart-related sysctl's
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (11 preceding siblings ...)
  2014-09-04 16:30   ` [PATCH 13/19] smart: nosmart boot option klamm
@ 2014-09-04 16:31   ` klamm
  2014-09-04 16:31   ` [PATCH 15/19] smart: decrease default rt_period/rt_runtime values klamm
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:31 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This patch introduces sysctl interface for smart configuration (default
values are given in brackets):
sched_smart_enable (1) - enable/disable smart globally
sched_smart_prio (10) - rt priority of tasks in a cgroup with smart set
sched_smart_gathering (1) - enable/disable smart gathering
sched_smart_throttle (1) - enable/disable throttling of CFS tasks

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/rt.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c   | 39 ++++++++++++++++++++++++++++
 2 files changed, 115 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ff7751a..21ffe09 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2247,6 +2247,82 @@ void print_rt_stats(struct seq_file *m, int cpu)
 #endif /* CONFIG_SCHED_DEBUG */
 
 #ifdef CONFIG_SMART
+int proc_smart_enable(struct ctl_table *table, int write,
+		      void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int enable = static_key_enabled(&__smart_enabled);
+	int ret;
+
+	build_smart_topology();
+
+	mutex_lock(&smart_mutex);
+
+	if (write && !static_key_false(&__smart_initialized)) {
+		ret = -EBUSY;
+		goto exit;
+	}
+
+	table->data = &enable;
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (ret)
+		goto exit;
+
+	if (write) {
+		if (enable && !static_key_enabled(&__smart_enabled)) {
+			static_key_slow_inc(&__smart_enabled);
+			smart_update_globally();
+		}
+
+		if (!enable && static_key_enabled(&__smart_enabled)) {
+			static_key_slow_dec(&__smart_enabled);
+			smart_update_globally();
+		}
+	}
+
+exit:
+	mutex_unlock(&smart_mutex);
+
+	return ret;
+}
+
+int proc_smart_static_key(struct ctl_table *table, int write,
+			  void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct static_key *key;
+	int enable;
+	int ret;
+	static DEFINE_MUTEX(mutex);
+
+	mutex_lock(&mutex);
+
+	key = table->data;
+	enable = static_key_enabled(key);
+
+	if (write && !static_key_false(&__smart_initialized)) {
+		ret = -EBUSY;
+		goto exit;
+	}
+
+	table->data = &enable;
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (ret)
+		goto exit;
+
+	if (write) {
+		if (enable && !static_key_enabled(key))
+			static_key_slow_inc(key);
+
+		if (!enable && static_key_enabled(key))
+			static_key_slow_dec(key);
+	}
+
+exit:
+	table->data = key;
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 int check_smart_data(void)
 {
 	int cpu, core;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a1e71e9..27eb548 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -128,6 +128,7 @@ static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static int __maybe_unused three = 3;
 static unsigned long one_ul = 1;
+static int ninety_eight = 98;
 static int one_hundred = 100;
 #ifdef CONFIG_PRINTK
 static int ten_thousand = 10000;
@@ -178,6 +179,13 @@ extern int no_unaligned_warning;
 #ifdef CONFIG_SMART
 struct static_key smart_cfs_gather = STATIC_KEY_INIT_TRUE;
 struct static_key smart_cfs_throttle = STATIC_KEY_INIT_TRUE;
+
+extern int sched_smart_prio;
+
+int proc_smart_static_key(struct ctl_table *table, int write,
+			  void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_smart_enable(struct ctl_table *table, int write,
+		      void __user *buffer, size_t *lenp, loff_t *ppos);
 #endif
 
 #ifdef CONFIG_PROC_SYSCTL
@@ -446,6 +454,37 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &one,
 	},
 #endif
+#ifdef CONFIG_SMART
+	{
+		.procname	= "sched_smart_enable",
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_smart_enable,
+	},
+	{
+		.procname	= "sched_smart_prio",
+		.data		= &sched_smart_prio,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+		.extra2		= &ninety_eight,
+	},
+	{
+		.procname	= "sched_smart_gathering",
+		.data		= &smart_cfs_gather,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_smart_static_key,
+	},
+	{
+		.procname	= "sched_smart_throttle",
+		.data		= &smart_cfs_throttle,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_smart_static_key,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 15/19] smart: decrease default rt_period/rt_runtime values
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (12 preceding siblings ...)
  2014-09-04 16:31   ` [PATCH 14/19] smart: smart-related sysctl's klamm
@ 2014-09-04 16:31   ` klamm
  2014-09-04 16:31   ` [PATCH 16/19] smart: change default RR timeslice to 5ms klamm
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:31 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

By default, rt_period is 1s and rt_runtime is 0.95s. These values
are to high to achieve acceptable system responsibility under high
rt load. Decrease it to 100ms and 99ms respectively.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/core.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f25fe0..a1b116e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -282,19 +282,26 @@ const_debug unsigned int sysctl_sched_nr_migrate = 32;
  */
 const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
 
+#ifdef CONFIG_SMART
+#define DEFAULT_SCHED_RT_PERIOD 100000
+#define DEFAULT_SCHED_RT_RUNTIME 99000
+
+#else
+#define DEFAULT_SCHED_RT_PERIOD 1000000
+#define DEFAULT_SCHED_RT_RUNTIME 950000
+#endif
+
 /*
  * period over which we measure -rt task cpu usage in us.
- * default: 1s
  */
-unsigned int sysctl_sched_rt_period = 1000000;
+unsigned int sysctl_sched_rt_period = DEFAULT_SCHED_RT_PERIOD;
 
 __read_mostly int scheduler_running;
 
 /*
  * part of the period that we allow rt tasks to run in us.
- * default: 0.95s
  */
-int sysctl_sched_rt_runtime = 950000;
+int sysctl_sched_rt_runtime = DEFAULT_SCHED_RT_RUNTIME;
 
 
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 16/19] smart: change default RR timeslice to 5ms
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (13 preceding siblings ...)
  2014-09-04 16:31   ` [PATCH 15/19] smart: decrease default rt_period/rt_runtime values klamm
@ 2014-09-04 16:31   ` klamm
  2014-09-04 16:31   ` [PATCH 17/19] smart: disable RT runtime sharing klamm
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:31 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

By default, SCHED_RR timeslice is 100ms. Change it to 5ms to achieve
better responsibility.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 include/linux/sched/rt.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index 440434d..988eae8 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -59,6 +59,10 @@ extern void normalize_rt_tasks(void);
  * default timeslice is 100 msecs (used only for SCHED_RR tasks).
  * Timeslices get refilled after they expire.
  */
+#ifdef CONFIG_SMART
+#define RR_TIMESLICE		(5 * HZ / 1000)
+#else
 #define RR_TIMESLICE		(100 * HZ / 1000)
+#endif
 
 #endif /* _SCHED_RT_H */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 17/19] smart: disable RT runtime sharing
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (14 preceding siblings ...)
  2014-09-04 16:31   ` [PATCH 16/19] smart: change default RR timeslice to 5ms klamm
@ 2014-09-04 16:31   ` klamm
  2014-09-04 16:31   ` [PATCH 18/19] hrtimers: calculate expires_next after all timers are executed klamm
  2014-09-04 16:31   ` [PATCH 19/19] smart: add description to the README file klamm
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:31 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

This patch disables runtime sharing for rt tasks (RT_RUNTIME_SHARE).
It's necessary to guarantee that CFS tasks (including system threads
like rcu workers) will get time slice regularly on each cpu even when
system is heavily loaded with rt tasks.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..0945d38 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,7 +57,7 @@ SCHED_FEAT(NONTASK_POWER, true)
 SCHED_FEAT(TTWU_QUEUE, true)
 
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
-SCHED_FEAT(RT_RUNTIME_SHARE, true)
+SCHED_FEAT(RT_RUNTIME_SHARE, false)
 SCHED_FEAT(LB_MIN, false)
 
 /*
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 18/19] hrtimers: calculate expires_next after all timers are executed
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (15 preceding siblings ...)
  2014-09-04 16:31   ` [PATCH 17/19] smart: disable RT runtime sharing klamm
@ 2014-09-04 16:31   ` klamm
  2014-09-04 16:31   ` [PATCH 19/19] smart: add description to the README file klamm
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:31 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev

From: Stanislav Fomichev <stfomichev@yandex-team.ru>

I think I'm hitting particularly subtle issue with NOHZ_IDLE kernel.

The sequence is as follows:
- CPU enters idle, we disable tick
- hrtimer interrupt fires (for hrtimer_wakeup)
- for clock base #1 (REALTIME) we wake up SCHED_RT thread and
  start RT period timer (from start_rt_bandwidth) for clock base #0 (MONOTONIC)
- because we already checked expiry time for clock base #0
  we end up programming wrong wake up time (minutes, from tick_sched_timer)
- then, we exit idle loop and restart tick;
  but because tick_sched_timer is not leftmost (the leftmost one
  is RT period timer) we don't program it

So in the end, I see working CPU without tick interrupt.
This eventually leads to RCU stall on that CPU: rcu_gp_kthread
is not woken up because there is no tick (this is the reason
I started digging this up).

The proposed fix runs expired timers first and only then tries to find
next expiry time for all clocks.

Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
---
 include/linux/hrtimer.h |  2 ++
 kernel/hrtimer.c        | 41 +++++++++++++++++++++++++++++++----------
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index d19a5c2..520a671 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -141,6 +141,7 @@ struct hrtimer_sleeper {
  * @get_time:		function to retrieve the current time of the clock
  * @softirq_time:	the time when running the hrtimer queue in the softirq
  * @offset:		offset of this clock to the monotonic base
+ * @next:		time of the next event on this clock base
  */
 struct hrtimer_clock_base {
 	struct hrtimer_cpu_base	*cpu_base;
@@ -151,6 +152,7 @@ struct hrtimer_clock_base {
 	ktime_t			(*get_time)(void);
 	ktime_t			softirq_time;
 	ktime_t			offset;
+	ktime_t			next;
 };
 
 enum  hrtimer_base_type {
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index aadf4b7..f4c028d 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -900,6 +900,8 @@ EXPORT_SYMBOL_GPL(hrtimer_forward);
 static int enqueue_hrtimer(struct hrtimer *timer,
 			   struct hrtimer_clock_base *base)
 {
+	ktime_t expires;
+
 	debug_activate(timer);
 
 	timerqueue_add(&base->active, &timer->node);
@@ -911,6 +913,10 @@ static int enqueue_hrtimer(struct hrtimer *timer,
 	 */
 	timer->state |= HRTIMER_STATE_ENQUEUED;
 
+	expires = ktime_sub(hrtimer_get_expires(timer), base->offset);
+	if (ktime_compare(base->next, expires) > 0)
+		base->next = expires;
+
 	return (&timer->node == base->active.next);
 }
 
@@ -947,8 +953,10 @@ static void __remove_hrtimer(struct hrtimer *timer,
 		}
 #endif
 	}
-	if (!timerqueue_getnext(&base->active))
+	if (!timerqueue_getnext(&base->active)) {
 		base->cpu_base->active_bases &= ~(1 << base->index);
+		base->next = ktime_set(KTIME_SEC_MAX, 0);
+	}
 out:
 	timer->state = newstate;
 }
@@ -1300,6 +1308,7 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now)
  */
 void hrtimer_interrupt(struct clock_event_device *dev)
 {
+	struct hrtimer_clock_base *base;
 	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
 	ktime_t expires_next, now, entry_time, delta;
 	int i, retries = 0;
@@ -1322,7 +1331,6 @@ retry:
 	cpu_base->expires_next.tv64 = KTIME_MAX;
 
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
-		struct hrtimer_clock_base *base;
 		struct timerqueue_node *node;
 		ktime_t basenow;
 
@@ -1351,14 +1359,8 @@ retry:
 			 */
 
 			if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) {
-				ktime_t expires;
-
-				expires = ktime_sub(hrtimer_get_expires(timer),
-						    base->offset);
-				if (expires.tv64 < 0)
-					expires.tv64 = KTIME_MAX;
-				if (expires.tv64 < expires_next.tv64)
-					expires_next = expires;
+				base->next = ktime_sub(hrtimer_get_expires(timer),
+						       base->offset);
 				break;
 			}
 
@@ -1367,6 +1369,25 @@ retry:
 	}
 
 	/*
+	 * Because timer handler may add new timer on a different clock base,
+	 * we need to find next expiry only after we execute all timers.
+	 */
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+		ktime_t expires;
+
+		if (!(cpu_base->active_bases & (1 << i)))
+			continue;
+
+		base = cpu_base->clock_base + i;
+		expires = base->next;
+
+		if (expires.tv64 < 0)
+			expires.tv64 = KTIME_MAX;
+		if (expires.tv64 < expires_next.tv64)
+			expires_next = expires;
+	}
+
+	/*
 	 * Store the new expiry value so the migration code can verify
 	 * against it.
 	 */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 19/19] smart: add description to the README file
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
                     ` (16 preceding siblings ...)
  2014-09-04 16:31   ` [PATCH 18/19] hrtimers: calculate expires_next after all timers are executed klamm
@ 2014-09-04 16:31   ` klamm
  17 siblings, 0 replies; 24+ messages in thread
From: klamm @ 2014-09-04 16:31 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: stfomichev, Roman Gushchin

From: Roman Gushchin <klamm@yandex-team.ru>

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
---
 README | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/README b/README
index a24ec89..c8af02d 100644
--- a/README
+++ b/README
@@ -1,3 +1,59 @@
+        SMART (Simultaneous Multithreading-Aware Real-Time)
+
+This commit series adds SMT awareness to Linux RT scheduler.
+
+SMART is aimed at realtime CPU-bound applications which run for a
+short period of time (tens of milliseconds) and expect no interruption
+(basically anything that is facing user and should finish execution as
+quickly as possible).
+
+On our memory-hot search clusters with thousands of machines we see
+10-15% more RPS compared to CFS. Most profit is gained when machine
+load is around 50%.
+
+While we aim for full CPU utilization with SMART, some care is still
+taken about CFS tasks, because it's crucial to be able to ssh on a fully
+loaded machine.
+
+Linux realtime scheduler uses cpupri structure to distribute tasks
+among CPU cores which doesn't take topology into account and thus can
+make very poor scheduling decisions. SMART scheduler completely abandons
+cpupri mechanism and tries to balance RT load between NUMA nodes and
+doesn't place (if possible) tasks on adjacent SMT threads of a single core.
+
+Among other features are:
+- SMART pull: if all SMT threads of a single core execute RT tasks we
+  migrate them whenever free core becomes available
+- CFS throttling: if some core has RT tasks running, we don't enqueue
+  CFS tasks on the other SMT threads of this core
+- SMART gathering: we migrate running CFS tasks from all SMT threads
+  of a core that don't have RT tasks, to the thread that has
+
+Usage:
+
+SMART is just an improvement on classic Linux realtime scheduler, thus
+all tasks that have SCHED_FIFO or SCHED_RR policy use SMART. From the
+command line it's possible to manipulate scheduler's policy using chrt
+command.
+
+Also, we provide cpu.smart knob to the cpu cgroup which sets scheduler
+policy of all tasks in the cgroup to SCHED_RR.
+
+
+When not to use SMART (you still can use it, but don't expect any profit):
+- target application is not bursty and its execution takes second(s)
+- CPU utilization of target machine approaches 100%
+- target application is true real-time application and you do care
+  about latency measured in microseconds
+
+
+Copyright (c) 2013-2014, Yandex LLC
+
+Authors: Roman Gushchin (klamm@yandex-team.ru),
+	 Stanislav Fomichev (stfomichev@yandex-team.ru)
+
+--------------------------------------------------------------------------------
+
         Linux kernel release 3.x <http://kernel.org/>
 
 These are the release notes for Linux version 3.  Read them carefully,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC] smt-aware rt load balancer
  2014-09-04 12:20 [RFC] smt-aware rt load balancer Roman Gushchin
  2014-09-04 15:40 ` Peter Zijlstra
  2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
@ 2014-09-04 16:49 ` Peter Zijlstra
  2014-09-04 17:06   ` Roman Gushchin
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2014-09-04 16:49 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: mingo, Kirill Tkhai, LKML, Stanislav Fomichev


No, we're not going to merge a second rt balancer. If you want to change
it change the one that is there. But it needs to remain a valid
_Real_Time_ balancer at all times.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] smt-aware rt load balancer
  2014-09-04 16:49 ` [RFC] smt-aware rt load balancer Peter Zijlstra
@ 2014-09-04 17:06   ` Roman Gushchin
  0 siblings, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2014-09-04 17:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, Kirill Tkhai, LKML, Stanislav Fomichev

I'm not pretending to merge it now (and that why I sent a link to repository instead of patches), but,
probably, there are guys here, who can try it and save a bit of hardware.

Regards,
Roman


04.09.2014, 20:49, "Peter Zijlstra" <peterz@infradead.org>:
> No, we're not going to merge a second rt balancer. If you want to change
> it change the one that is there. But it needs to remain a valid
> _Real_Time_ balancer at all times.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 06/19] smart: use CPU selection logic if smart is enabled
  2014-09-04 16:30   ` [PATCH 06/19] smart: use CPU selection logic if smart is enabled klamm
@ 2014-09-21 17:31     ` Pavel Machek
  0 siblings, 0 replies; 24+ messages in thread
From: Pavel Machek @ 2014-09-21 17:31 UTC (permalink / raw)
  To: klamm; +Cc: peterz, mingo, linux-kernel, stfomichev

On Thu 2014-09-04 20:30:52, klamm@yandex-team.ru wrote:
> From: Roman Gushchin <klamm@yandex-team.ru>
> 
> This patch causes rt scheduler to use smart CPU selection logic,
> if smart_enabled() returns true.

SMART is a technology to report HDD health, so while it is cool name, it might
be better to use different name to avoid confusion.

									Pavel

> @@ -1262,8 +1262,12 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
>  	 *   not worry about this generic constraint ]
>  	 */
>  	if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
> -		     !cpu_online(cpu)))
> +		     !cpu_online(cpu))) {
> +		if (smart_enabled() && task_has_rt_policy(p) && cpu >= 0)
> +			release_core(cpu);
> +
>  		cpu = select_fallback_rq(task_cpu(p), p);
> +	}
>  
>  	return cpu;
>  }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 805951b..1993c47 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -15,6 +15,14 @@ struct static_key __smart_enabled = STATIC_KEY_INIT_TRUE;
>  DEFINE_MUTEX(smart_mutex);
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct smart_core_data, smart_core_data);
> +
> +static int smart_find_lowest_rq(struct task_struct *task, bool wakeup);
> +
> +#else /* CONFIG_SMART */
> +static inline int smart_find_lowest_rq(struct task_struct *task, bool wakeup)
> +{
> +	return -1;
> +}
>  #endif /* CONFIG_SMART */
>  
>  int sched_rr_timeslice = RR_TIMESLICE;
> @@ -1211,6 +1219,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>  		enqueue_pushable_task(rq, p);
>  
>  	inc_nr_running(rq);
> +	release_core(cpu_of(rq));
>  }
>  
>  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> @@ -1278,6 +1287,13 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
>  	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
>  		goto out;
>  
> +	if (smart_enabled()) {
> +		int target = smart_find_lowest_rq(p, true);
> +		if (likely(target != -1))
> +			cpu = target;
> +		goto out;
> +	}
> +
>  	rq = cpu_rq(cpu);
>  
>  	rcu_read_lock();
> @@ -1580,10 +1596,17 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>  	int cpu;
>  
>  	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
> -		cpu = find_lowest_rq(task);
> +		if (smart_enabled())
> +			cpu = smart_find_lowest_rq(task, false);
> +		else
> +			cpu = find_lowest_rq(task);
> +
> +		if ((cpu == -1) || (cpu == rq->cpu)) {
> +			if (cpu == rq->cpu)
> +				release_core(cpu);
>  
> -		if ((cpu == -1) || (cpu == rq->cpu))
>  			break;
> +		}
>  
>  		lowest_rq = cpu_rq(cpu);
>  
> @@ -1602,6 +1625,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>  				     !task->on_rq)) {
>  
>  				double_unlock_balance(rq, lowest_rq);
> +				release_core(cpu);
>  				lowest_rq = NULL;
>  				break;
>  			}
> @@ -1614,6 +1638,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
>  		/* try again */
>  		double_unlock_balance(rq, lowest_rq);
>  		lowest_rq = NULL;
> +		release_core(cpu);
>  	}
>  
>  	return lowest_rq;
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-09-21 17:31 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-04 12:20 [RFC] smt-aware rt load balancer Roman Gushchin
2014-09-04 15:40 ` Peter Zijlstra
2014-09-04 16:30 ` [PATCH 01/19] smart: define and build per-core data structures klamm
2014-09-04 16:30   ` [PATCH 02/19] smart: add config option for SMART-related code klamm
2014-09-04 16:30   ` [PATCH 03/19] smart: introduce smart_enabled() klamm
2014-09-04 16:30   ` [PATCH 04/19] smart: helper functions for smart_data manipulations klamm
2014-09-04 16:30   ` [PATCH 05/19] smart: CPU selection logic klamm
2014-09-04 16:30   ` [PATCH 06/19] smart: use CPU selection logic if smart is enabled klamm
2014-09-21 17:31     ` Pavel Machek
2014-09-04 16:30   ` [PATCH 07/19] smart: balance load between nodes klamm
2014-09-04 16:30   ` [PATCH 08/19] smart: smart pull klamm
2014-09-04 16:30   ` [PATCH 09/19] smart: throttle CFS tasks by affinning to first SMT thread klamm
2014-09-04 16:30   ` [PATCH 10/19] smart: smart gathering klamm
2014-09-04 16:30   ` [PATCH 11/19] smart: smart debug klamm
2014-09-04 16:30   ` [PATCH 12/19] smart: cgroup interface for smart klamm
2014-09-04 16:30   ` [PATCH 13/19] smart: nosmart boot option klamm
2014-09-04 16:31   ` [PATCH 14/19] smart: smart-related sysctl's klamm
2014-09-04 16:31   ` [PATCH 15/19] smart: decrease default rt_period/rt_runtime values klamm
2014-09-04 16:31   ` [PATCH 16/19] smart: change default RR timeslice to 5ms klamm
2014-09-04 16:31   ` [PATCH 17/19] smart: disable RT runtime sharing klamm
2014-09-04 16:31   ` [PATCH 18/19] hrtimers: calculate expires_next after all timers are executed klamm
2014-09-04 16:31   ` [PATCH 19/19] smart: add description to the README file klamm
2014-09-04 16:49 ` [RFC] smt-aware rt load balancer Peter Zijlstra
2014-09-04 17:06   ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).