linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/PATCH 0/4] CPUSET driven CPU isolation
@ 2008-02-27 22:21 Peter Zijlstra
  2008-02-27 22:21 ` [RFC/PATCH 1/4] sched: remove isolcpus Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-27 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy
  Cc: linux-kernel, Peter Zijlstra

My vision on the direction we should take wrt cpu isolation.

Next on the list would be figuring out a nice solution to the workqueue
flush issue.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC/PATCH 1/4] sched: remove isolcpus
  2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
@ 2008-02-27 22:21 ` Peter Zijlstra
  2008-02-27 23:57   ` Max Krasnyanskiy
  2008-02-27 22:21 ` [RFC/PATCH 2/4] cpuset: system sets Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-27 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-remove-isol.patch --]
[-- Type: text/plain, Size: 2054 bytes --]

cpu isolation doesn't offer anything over cpusets, hence remove it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   24 +++---------------------
 1 file changed, 3 insertions(+), 21 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -6217,24 +6217,6 @@ cpu_attach_domain(struct sched_domain *s
 	rcu_assign_pointer(rq->sd, sd);
 }
 
-/* cpus with isolated domains */
-static cpumask_t cpu_isolated_map = CPU_MASK_NONE;
-
-/* Setup the mask of cpus configured for isolated domains */
-static int __init isolated_cpu_setup(char *str)
-{
-	int ints[NR_CPUS], i;
-
-	str = get_options(str, ARRAY_SIZE(ints), ints);
-	cpus_clear(cpu_isolated_map);
-	for (i = 1; i <= ints[0]; i++)
-		if (ints[i] < NR_CPUS)
-			cpu_set(ints[i], cpu_isolated_map);
-	return 1;
-}
-
-__setup("isolcpus=", isolated_cpu_setup);
-
 /*
  * init_sched_build_groups takes the cpumask we wish to span, and a pointer
  * to a function which identifies what group(along with sched group) a CPU
@@ -6856,7 +6838,7 @@ static int arch_init_sched_domains(const
 	doms_cur = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
 	if (!doms_cur)
 		doms_cur = &fallback_doms;
-	cpus_andnot(*doms_cur, *cpu_map, cpu_isolated_map);
+	*doms_cur = *cpu_map;
 	err = build_sched_domains(doms_cur);
 	register_sched_domain_sysctl();
 
@@ -6917,7 +6899,7 @@ void partition_sched_domains(int ndoms_n
 	if (doms_new == NULL) {
 		ndoms_new = 1;
 		doms_new = &fallback_doms;
-		cpus_andnot(doms_new[0], cpu_online_map, cpu_isolated_map);
+		doms_new[0] = cpu_online_map;
 	}
 
 	/* Destroy deleted domains */
@@ -7076,7 +7058,7 @@ void __init sched_init_smp(void)
 
 	get_online_cpus();
 	arch_init_sched_domains(&cpu_online_map);
-	cpus_andnot(non_isolated_cpus, cpu_possible_map, cpu_isolated_map);
+	non_isolated_cpus = cpu_possible_map;
 	if (cpus_empty(non_isolated_cpus))
 		cpu_set(smp_processor_id(), non_isolated_cpus);
 	put_online_cpus();

--


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC/PATCH 2/4] cpuset: system sets
  2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
  2008-02-27 22:21 ` [RFC/PATCH 1/4] sched: remove isolcpus Peter Zijlstra
@ 2008-02-27 22:21 ` Peter Zijlstra
  2008-02-27 23:39   ` Paul Jackson
  2008-02-27 23:52   ` Max Krasnyanskiy
  2008-02-27 22:21 ` [RFC/PATCH 3/4] genirq: system set irq affinities Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-27 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: cpuset-system.patch --]
[-- Type: text/plain, Size: 9243 bytes --]

Introduce the notion of a System set. A system set will be one that caters the
general purpose OS. This patch provides the infrastructure, but doesn't
actually provide any new functionality.

Typical functionality would be setting the IRQ affinity of unbound IRQs to
within the system set. And setting the affinity of unbounded kernel threads to
within the system set.

Future patches will provide this.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/cpumask.h |    5 ++
 include/linux/cpuset.h  |    2 
 kernel/cpuset.c         |  115 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched.c          |    3 +
 4 files changed, 124 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/cpuset.c
===================================================================
--- linux-2.6.orig/kernel/cpuset.c
+++ linux-2.6/kernel/cpuset.c
@@ -64,6 +64,7 @@
  * short circuit some hooks.
  */
 int number_of_cpusets __read_mostly;
+int number_of_system_sets __read_mostly;
 
 /* Forward declare cgroup structures */
 struct cgroup_subsys cpuset_subsys;
@@ -128,6 +129,7 @@ typedef enum {
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_SYSTEM,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -161,6 +163,11 @@ static inline int is_spread_slab(const s
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_system(const struct cpuset *cs)
+{
+	return test_bit(CS_SYSTEM, &cs->flags);
+}
+
 /*
  * Increment this integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -183,7 +190,9 @@ static inline int is_spread_slab(const s
 static int cpuset_mems_generation;
 
 static struct cpuset top_cpuset = {
-	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
+	.flags = ((1 << CS_CPU_EXCLUSIVE) |
+		  (1 << CS_MEM_EXCLUSIVE) |
+		  (1 << CS_SYSTEM)),
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
 };
@@ -465,6 +474,9 @@ static int validate_change(const struct 
 		}
 	}
 
+	if (number_of_system_sets == 1 && is_system(cur) && !is_system(trial))
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -1011,6 +1023,74 @@ static int update_memory_pressure_enable
 	return 0;
 }
 
+BLOCKING_NOTIFIER_HEAD(system_map_notifier);
+EXPORT_SYMBOL_GPL(system_map_notifier);
+
+int cpus_match_system(cpumask_t mask)
+{
+	cpumask_t online_system, online_mask;
+
+	cpus_and(online_system, cpu_system_map, cpu_online_map);
+	cpus_and(online_mask, mask, cpu_online_map);
+
+	return cpus_equal(online_system, online_mask);
+}
+
+static void rebuild_system_map(void)
+{
+	cpumask_t *new_system_map;
+	struct kfifo *q = NULL;
+	struct cpuset *cp;
+
+	new_system_map = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
+	if (!new_system_map)
+		return;
+
+	if (is_system(&top_cpuset)) {
+		cpus_setall(*new_system_map);
+		goto notify;
+	}
+
+	cpus_clear(*new_system_map);
+
+	q = kfifo_alloc(number_of_cpusets * sizeof(cp), GFP_KERNEL, NULL);
+	if (IS_ERR(q))
+		goto done;
+
+	cp = &top_cpuset;
+	__kfifo_put(q, (void *)&cp, sizeof(cp));
+	while (__kfifo_get(q, (void *)&cp, sizeof(cp))) {
+		struct cgroup *cont;
+		struct cpuset *child;
+
+		if (is_system(cp)) {
+			cpus_or(*new_system_map,
+					*new_system_map, cp->cpus_allowed);
+			continue;
+		}
+
+		list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
+			child = cgroup_cs(cont);
+			__kfifo_put(q, (void *)&child, sizeof(cp));
+		}
+	}
+
+	if (cpus_empty(*new_system_map))
+		BUG();
+
+notify:
+	if (!cpus_match_system(*new_system_map)) {
+		blocking_notifier_call_chain(&system_map_notifier, 0,
+				new_system_map);
+	}
+	cpu_system_map = *new_system_map;
+
+done:
+	kfree(new_system_map);
+	if (q && !IS_ERR(q))
+		kfifo_free(q);
+}
+
 /*
  * update_flag - read a 0 or a 1 in a file and update associated flag
  * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
@@ -1029,6 +1109,7 @@ static int update_flag(cpuset_flagbits_t
 	struct cpuset trialcs;
 	int err;
 	int cpus_nonempty, balance_flag_changed;
+	int system_flag_changed;
 
 	turning_on = (simple_strtoul(buf, NULL, 10) != 0);
 
@@ -1045,6 +1126,7 @@ static int update_flag(cpuset_flagbits_t
 	cpus_nonempty = !cpus_empty(trialcs.cpus_allowed);
 	balance_flag_changed = (is_sched_load_balance(cs) !=
 		 			is_sched_load_balance(&trialcs));
+	system_flag_changed = (is_system(cs) != is_system(&trialcs));
 
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs.flags;
@@ -1053,6 +1135,15 @@ static int update_flag(cpuset_flagbits_t
 	if (cpus_nonempty && balance_flag_changed)
 		rebuild_sched_domains();
 
+	if (system_flag_changed) {
+		rebuild_system_map();
+
+		if (is_system(cs))
+			number_of_system_sets++;
+		else
+			number_of_system_sets--;
+	}
+
 	return 0;
 }
 
@@ -1206,6 +1297,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_SYSTEM,
 } cpuset_filetype_t;
 
 static ssize_t cpuset_common_file_write(struct cgroup *cont,
@@ -1273,6 +1365,9 @@ static ssize_t cpuset_common_file_write(
 		retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
 		cs->mems_generation = cpuset_mems_generation++;
 		break;
+	case FILE_SYSTEM:
+		retval = update_flag(CS_SYSTEM, cs, buffer);
+		break;
 	default:
 		retval = -EINVAL;
 		goto out2;
@@ -1369,6 +1464,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_SPREAD_SLAB:
 		*s++ = is_spread_slab(cs) ? '1' : '0';
 		break;
+	case FILE_SYSTEM:
+		*s++ = is_system(cs) ? '1' : '0';
+		break;
 	default:
 		retval = -EINVAL;
 		goto out;
@@ -1459,6 +1557,13 @@ static struct cftype cft_spread_slab = {
 	.private = FILE_SPREAD_SLAB,
 };
 
+static struct cftype cft_system = {
+	.name = "system",
+	.read = cpuset_common_file_read,
+	.write = cpuset_common_file_write,
+	.private = FILE_SYSTEM,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -1481,6 +1586,8 @@ static int cpuset_populate(struct cgroup
 		return err;
 	if ((err = cgroup_add_file(cont, ss, &cft_spread_slab)) < 0)
 		return err;
+	if ((err = cgroup_add_file(cont, ss, &cft_system)) < 0)
+		return err;
 	/* memory_pressure_enabled is in root cpuset only */
 	if (err == 0 && !cont->parent)
 		err = cgroup_add_file(cont, ss,
@@ -1555,6 +1662,7 @@ static struct cgroup_subsys_state *cpuse
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+	set_bit(CS_SYSTEM, &cs->flags);
 	cs->cpus_allowed = CPU_MASK_NONE;
 	cs->mems_allowed = NODE_MASK_NONE;
 	cs->mems_generation = cpuset_mems_generation++;
@@ -1562,6 +1670,7 @@ static struct cgroup_subsys_state *cpuse
 
 	cs->parent = parent;
 	number_of_cpusets++;
+	number_of_system_sets++;
 	return &cs->css ;
 }
 
@@ -1585,8 +1694,11 @@ static void cpuset_destroy(struct cgroup
 
 	if (is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, "0");
+	if (!is_system(cs))
+		update_flag(CS_SYSTEM, cs, "1");
 
 	number_of_cpusets--;
+	number_of_system_sets--;
 	kfree(cs);
 }
 
@@ -1637,6 +1749,7 @@ int __init cpuset_init(void)
 		return err;
 
 	number_of_cpusets = 1;
+	number_of_system_sets = 1;
 	return 0;
 }
 
Index: linux-2.6/include/linux/cpumask.h
===================================================================
--- linux-2.6.orig/include/linux/cpumask.h
+++ linux-2.6/include/linux/cpumask.h
@@ -380,6 +380,7 @@ static inline void __cpus_remap(cpumask_
 extern cpumask_t cpu_possible_map;
 extern cpumask_t cpu_online_map;
 extern cpumask_t cpu_present_map;
+extern cpumask_t cpu_system_map;
 
 #if NR_CPUS > 1
 #define num_online_cpus()	cpus_weight(cpu_online_map)
@@ -388,6 +389,7 @@ extern cpumask_t cpu_present_map;
 #define cpu_online(cpu)		cpu_isset((cpu), cpu_online_map)
 #define cpu_possible(cpu)	cpu_isset((cpu), cpu_possible_map)
 #define cpu_present(cpu)	cpu_isset((cpu), cpu_present_map)
+#define cpu_system(cpu)		cpu_isset((cpu), cpu_system_map)
 #else
 #define num_online_cpus()	1
 #define num_possible_cpus()	1
@@ -395,8 +397,11 @@ extern cpumask_t cpu_present_map;
 #define cpu_online(cpu)		((cpu) == 0)
 #define cpu_possible(cpu)	((cpu) == 0)
 #define cpu_present(cpu)	((cpu) == 0)
+#define cpu_system(cpu)		((cpu) == 0)
 #endif
 
+extern int cpus_match_system(cpumask_t mask);
+
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
 
 #ifdef CONFIG_SMP
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4854,6 +4854,9 @@ asmlinkage long sys_sched_setaffinity(pi
 cpumask_t cpu_present_map __read_mostly;
 EXPORT_SYMBOL(cpu_present_map);
 
+cpumask_t cpu_system_map __read_mostly = CPU_MASK_ALL;
+EXPORT_SYMBOL(cpu_system_map);
+
 #ifndef CONFIG_SMP
 cpumask_t cpu_online_map __read_mostly = CPU_MASK_ALL;
 EXPORT_SYMBOL(cpu_online_map);
Index: linux-2.6/include/linux/cpuset.h
===================================================================
--- linux-2.6.orig/include/linux/cpuset.h
+++ linux-2.6/include/linux/cpuset.h
@@ -78,6 +78,8 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+extern struct blocking_notifier_head system_map_notifier;
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }

--


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC/PATCH 3/4] genirq: system set irq affinities
  2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
  2008-02-27 22:21 ` [RFC/PATCH 1/4] sched: remove isolcpus Peter Zijlstra
  2008-02-27 22:21 ` [RFC/PATCH 2/4] cpuset: system sets Peter Zijlstra
@ 2008-02-27 22:21 ` Peter Zijlstra
  2008-02-28  0:10   ` Max Krasnyanskiy
  2008-02-27 22:21 ` [RFC/PATCH 4/4] kthread: system set kthread affinities Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-27 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: cpuset-system-irq.patch --]
[-- Type: text/plain, Size: 3227 bytes --]

Keep the affinity of unbound IRQs within the system set.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/alpha/kernel/irq.c |    2 -
 include/linux/irq.h     |    7 -----
 kernel/irq/manage.c     |   62 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 63 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/alpha/kernel/irq.c
===================================================================
--- linux-2.6.orig/arch/alpha/kernel/irq.c
+++ linux-2.6/arch/alpha/kernel/irq.c
@@ -51,7 +51,7 @@ select_smp_affinity(unsigned int irq)
 	if (!irq_desc[irq].chip->set_affinity || irq_user_affinity[irq])
 		return 1;
 
-	while (!cpu_possible(cpu))
+	while (!cpu_possible(cpu) || !cpu_system(cpu))
 		cpu = (cpu < (NR_CPUS-1) ? cpu + 1 : 0);
 	last_cpu = cpu;
 
Index: linux-2.6/include/linux/irq.h
===================================================================
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -253,14 +253,7 @@ static inline void set_balance_irq_affin
 }
 #endif
 
-#ifdef CONFIG_AUTO_IRQ_AFFINITY
 extern int select_smp_affinity(unsigned int irq);
-#else
-static inline int select_smp_affinity(unsigned int irq)
-{
-	return 1;
-}
-#endif
 
 extern int no_irq_affinity;
 
Index: linux-2.6/kernel/irq/manage.c
===================================================================
--- linux-2.6.orig/kernel/irq/manage.c
+++ linux-2.6/kernel/irq/manage.c
@@ -11,6 +11,8 @@
 #include <linux/module.h>
 #include <linux/random.h>
 #include <linux/interrupt.h>
+#include <linux/cpumask.h>
+#include <linux/cpuset.h>
 
 #include "internals.h"
 
@@ -488,6 +490,24 @@ void free_irq(unsigned int irq, void *de
 }
 EXPORT_SYMBOL(free_irq);
 
+#ifndef CONFIG_AUTO_IRQ_AFFINITY
+int select_smp_affinity(unsigned int irq)
+{
+	cpumask_t online_system;
+
+	if (!irq_can_set_affinity(irq))
+		return 0;
+
+	cpus_and(online_system, cpu_system_map, cpu_online_map);
+
+	set_balance_irq_affinity(irq, online_system);
+
+	irq_desc[irq].affinity = online_system;
+	irq_desc[irq].chip->set_affinity(irq, online_system);
+	return 0;
+}
+#endif
+
 /**
  *	request_irq - allocate an interrupt line
  *	@irq: Interrupt line to allocate
@@ -580,3 +600,45 @@ int request_irq(unsigned int irq, irq_ha
 	return retval;
 }
 EXPORT_SYMBOL(request_irq);
+
+#ifdef CONFIG_CPUSETS
+static int system_irq_notifier(struct notifier_block *nb,
+		unsigned long action, void *cpus)
+{
+	cpumask_t *new_system_map = (cpumask_t *)cpus;
+	int i;
+
+	for (i = 0; i < NR_IRQS; i++) {
+		struct irq_desc *desc = &irq_desc[i];
+
+		if (desc->chip == &no_irq_chip || !irq_can_set_affinity(i))
+			continue;
+
+		if (cpus_match_system(desc->affinity)) {
+			cpumask_t online_system;
+
+			cpus_and(online_system, new_system_map, cpu_online_map);
+
+			set_balance_irq_affinity(i, online_system);
+
+			desc->affinity = online_system;
+			desc->chip->set_affinity(i, online_system);
+		}
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block fn_system_irq_notifier = {
+	.notifier_call = system_irq_notifier,
+};
+
+static int __init init_irq(void)
+{
+	blocking_notifier_chain_register(&system_map_notifier,
+			&fn_system_irq_notifier);
+	return 0;
+}
+
+module_init(init_irq);
+#endif

--


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC/PATCH 4/4] kthread: system set kthread affinities
  2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
                   ` (2 preceding siblings ...)
  2008-02-27 22:21 ` [RFC/PATCH 3/4] genirq: system set irq affinities Peter Zijlstra
@ 2008-02-27 22:21 ` Peter Zijlstra
  2008-02-27 23:38 ` [RFC/PATCH 0/4] CPUSET driven CPU isolation Max Krasnyanskiy
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-27 22:21 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy
  Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: cpuset-system-kthread.patch --]
[-- Type: text/plain, Size: 2291 bytes --]

Keep the affinities of unbound kthreads within the system set.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/kthread.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -13,6 +13,8 @@
 #include <linux/file.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/cpumask.h>
+#include <linux/cpuset.h>
 #include <asm/semaphore.h>
 
 #define KTHREAD_NICE_LEVEL (-5)
@@ -107,7 +109,7 @@ static void create_kthread(struct kthrea
 		 */
 		sched_setscheduler(create->result, SCHED_NORMAL, &param);
 		set_user_nice(create->result, KTHREAD_NICE_LEVEL);
-		set_cpus_allowed(create->result, CPU_MASK_ALL);
+		set_cpus_allowed(create->result, cpu_system_map);
 	}
 	complete(&create->done);
 }
@@ -232,7 +234,7 @@ int kthreadd(void *unused)
 	set_task_comm(tsk, "kthreadd");
 	ignore_signals(tsk);
 	set_user_nice(tsk, KTHREAD_NICE_LEVEL);
-	set_cpus_allowed(tsk, CPU_MASK_ALL);
+	set_cpus_allowed(tsk, cpu_system_map);
 
 	current->flags |= PF_NOFREEZE;
 
@@ -260,3 +262,47 @@ int kthreadd(void *unused)
 
 	return 0;
 }
+
+#ifdef CONFIG_CPUSETS
+static int system_kthread_notifier(struct notifier_block *nb,
+		unsigned long action, void *cpus)
+{
+	cpumask_t *new_system_map = (cpumask_t *)cpus;
+	struct task_struct *g, *t;
+
+again:
+	rcu_read_lock();
+	do_each_thread(g, t) {
+		if (t->parent != kthreadd_task && t != kthreadd_task)
+			continue;
+
+		if (cpus_match_system(t->cpus_allowed) &&
+		    !cpus_equal(t->cpus_allowed, *new_system_map)) {
+			/*
+			 * What is holding a ref on t->usage here?!
+			 */
+			get_task_struct(t);
+			rcu_read_unlock();
+			set_cpus_allowed(t, *new_system_map);
+			put_task_struct(t);
+			goto again;
+		}
+	} while_each_thread(g, t);
+	rcu_read_unlock();
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block fn_system_kthread_notifier = {
+	.notifier_call = system_kthread_notifier,
+};
+
+static int __init init_kthread(void)
+{
+	blocking_notifier_chain_register(&system_map_notifier,
+			&fn_system_kthread_notifier);
+	return 0;
+}
+
+module_init(init_kthread);
+#endif

--


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
                   ` (3 preceding siblings ...)
  2008-02-27 22:21 ` [RFC/PATCH 4/4] kthread: system set kthread affinities Peter Zijlstra
@ 2008-02-27 23:38 ` Max Krasnyanskiy
  2008-02-28 10:19   ` Peter Zijlstra
  2008-02-28  7:50 ` Ingo Molnar
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-27 23:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel

Peter Zijlstra wrote:
> My vision on the direction we should take wrt cpu isolation.
General impressions:
- "cpu_system_map" is %100 identical to the "~cpu_isolated_map" as in my 
patches. It's updated from different place but functionally wise it's very 
much the same. I guess you did not like the 'isolated' name ;-). As I 
mentioned before I'm not hung up on the name so it's cool :).

- Updating cpu_system_map from cpusets
There are a couple of things that I do not like about this approach:
1. We lost the ability to isolate CPUs at boot. Which means slower boot times 
for me (ie before I can start my apps I need to create cpuset, etc). Not a big 
deal, I can live with it.

2. We now need another notification mechanism to propagate the updates to the 
cpu_system_map. That by itself is not a big deal. The big deal is that now we 
need to basically audit the kernel and make sure that everything affected must
have proper notifier and react to the mask changes.
For example your current patch does not move the timers and I do not think it 
makes sense to go and add notifier for the timers. I think the better approach 
is to use CPU hotplug here. ie Enforce the rule that cpu_system_map is updated 
  only when CPU is off-line.
By bringing CPU down first we get a lot of features for free. All the kernel 
threads, timers, softirqs, HW irqs, workqueues, etc are properly 
terminated/moved/canceled/etc. This gives us a very clean state when we bring 
the CPU back online with "system" bit cleared (or "isolated" bit set like in 
my patches). I do not see a good reason for reimplementing that functionality 
via system_map notifiers.

I'll comment more on the individual patches.

> Next on the list would be figuring out a nice solution to the workqueue
> flush issue.
Do not forget the "stop machine", or more specifically module loading/unloading.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 2/4] cpuset: system sets
  2008-02-27 22:21 ` [RFC/PATCH 2/4] cpuset: system sets Peter Zijlstra
@ 2008-02-27 23:39   ` Paul Jackson
  2008-02-28  1:53     ` Max Krasnyanskiy
  2008-02-27 23:52   ` Max Krasnyanskiy
  1 sibling, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-27 23:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, oleg, rostedt, maxk, linux-kernel, a.p.zijlstra

Peter wrote:
> A system set will be one that caters the
> general purpose OS. This patch provides the infrastructure, but doesn't
> actually provide any new functionality.
> 
> Typical functionality would be setting the IRQ affinity of unbound IRQs to
> within the system set. And setting the affinity of unbounded kernel threads to
> within the system set.

"one that caters the general purpose OS" ... a tad terse on the
documentation ;).

I guess what you have is a new cpumask_t cpu_system_map, which is the
union of the CPUs of all the cpusets marked 'system', where to a rough
approximation the CPUs -not- in that cpumask are what we would have
called the isolated CPUs by the old code?

In any case, if this patch survives its birth, it will need an added
change for some file in the Documentation directory.

Could we get the term 'cpu' in the name 'system' somehow?  Perhaps call
this new cpuset flag 'cpus_system' or some such.  Cpusets handles both
CPU and memory configuration, and I make some effort to mark per-cpuset
specific attributes that apply to only one of these with a prefix
indicating to which they apply.  The per-cpuset flag name 'system', by
itself, would mean little to someone just listing the files in a cpuset
directory.

In the rebuild_system_map() code, you have:
+	if (cpus_empty(*new_system_map))
+		BUG();

... what's to prevent simply turning off the 'system' (aka cpus_system)
in the top cpuset, on a system with only that one cpuset, and hitting
this BUG()?

Overall I like this approach.  I suspect you made a good choice in
marking the non-isolated (aka system) CPUs, rather than the isolated
CPUs.  It seems clearer that way, in understanding the affects of
overlapping cpusets with various markings.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 2/4] cpuset: system sets
  2008-02-27 22:21 ` [RFC/PATCH 2/4] cpuset: system sets Peter Zijlstra
  2008-02-27 23:39   ` Paul Jackson
@ 2008-02-27 23:52   ` Max Krasnyanskiy
  2008-02-28  0:11     ` Paul Jackson
  1 sibling, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-27 23:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel

Hmm, shouldn't patches be sent inline ?
Otherwise I need to cut&paste in order to reply.

Anyway. cpu_system_map looks fine. It's identical in functionality (minus the 
notifier) to the ~cpu_isolated_map. Different name works for me.

As I explained in the prev reply I suggest we use CPU hotplug instead of the 
brand new notifier mechanism that requires changes to a bunch of things, and 
at the end of the day ends up doing the same exact thing. ie Moving things out 
of the CPU that is being isolated.

> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -4854,6 +4854,9 @@ asmlinkage long sys_sched_setaffinity(pi
>  cpumask_t cpu_present_map __read_mostly;
>  EXPORT_SYMBOL(cpu_present_map);
>  
> +cpumask_t cpu_system_map __read_mostly = CPU_MASK_ALL;
> +EXPORT_SYMBOL(cpu_system_map);
> +
>  #ifndef CONFIG_SMP
>  cpumask_t cpu_online_map __read_mostly = CPU_MASK_ALL;
>  EXPORT_SYMBOL(cpu_online_map);

I beleive those masks belong in kernel/cpu.c instead of kernel/sched.c.
It can be done with a separate patch of course.

Max






^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 1/4] sched: remove isolcpus
  2008-02-27 22:21 ` [RFC/PATCH 1/4] sched: remove isolcpus Peter Zijlstra
@ 2008-02-27 23:57   ` Max Krasnyanskiy
  2008-02-28 10:19     ` Peter Zijlstra
  0 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-27 23:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel

Peter Zijlstra wrote:

> cpu isolation doesn't offer anything over cpusets, hence remove it.

Works for me. That's what I suggested in my reply to your comments.
Here is the quote from the previous thread:

>>>> This also allows for isolated groups, there are good reasons to isolate groups,
>>>> esp. now that we have a stronger RT balancer. SMP and hard RT are not
>>>> exclusive. A design that does not take that into account is too rigid.
>> 
>>> You're thinking scheduling only. Paul had the same confusion ;-)
>> 
>> I'm not, I'm thinking it ought to allow for it.
> One way I can think of how to support groups and allow for RT balancer is 
> this: Make scheduler ignore cpu_isolated_map and give cpusets full control of 
> the scheduler domains. Use cpu_isolated_map to only for hw irq and other 
> kernel sub-systems. That way cpusets could mark cpus in the group as isolated 
> to get rid of the  kernel activity and build sched domain such that tasks get 
> balanced in it.
> The thing I do not like about it is that there is no way to boot the system 
> with CPU N isolated from the beginning. Also dynamic isolation currently 
> relies on the cpu hotplug to clear pending irqs, softirqs, kernel timers and 
> threads. So cpusets would have to simulate the cpu hotplug event I guess.

Max


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 3/4] genirq: system set irq affinities
  2008-02-27 22:21 ` [RFC/PATCH 3/4] genirq: system set irq affinities Peter Zijlstra
@ 2008-02-28  0:10   ` Max Krasnyanskiy
  2008-02-28 10:19     ` Peter Zijlstra
  0 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-28  0:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel

Besides the notifier stuff it's identical to my genirq patch that I sent to 
Thomas and you for the review ~5 days ago.
There are a couple of things you missed.
Current call site for select_smp_affinity() inside request_irq() is incorrect. 
It ends up moving irq each time requires_irq() is called, and it is called 
multiple times for the shared irqs. My patch moves it into setup_irq() under 
if(!shared) check.

Also the following part is unsafe

> +#ifdef CONFIG_CPUSETS
> +static int system_irq_notifier(struct notifier_block *nb,
> +		unsigned long action, void *cpus)
> +{
> +	cpumask_t *new_system_map = (cpumask_t *)cpus;
> +	int i;
> +
> +	for (i = 0; i < NR_IRQS; i++) {
> +		struct irq_desc *desc = &irq_desc[i];
> +
> +		if (desc->chip == &no_irq_chip || !irq_can_set_affinity(i))
> +			continue;
> +
> +		if (cpus_match_system(desc->affinity)) {
> +			cpumask_t online_system;
> +
> +			cpus_and(online_system, new_system_map, cpu_online_map);
> +
> +			set_balance_irq_affinity(i, online_system);
> +
> +			desc->affinity = online_system;
> +			desc->chip->set_affinity(i, online_system);
Two lines above should be
	irq_set_affinity(i, online_system);

If you look at how irq_set_affinity() is implemented, you'll see this

#ifdef CONFIG_GENERIC_PENDING_IRQ
         set_pending_irq(irq, cpumask);
#else
         desc->affinity = cpumask;
         desc->chip->set_affinity(irq, cpumask);
#endif

set_pending_irq() is the safe way to move pending irqs.

btw It should be ok to call chip->set_affinity() directly from 
select_smp_affinity() because in my patch is is guarantied to be called only 
for the first handler registration.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 2/4] cpuset: system sets
  2008-02-27 23:52   ` Max Krasnyanskiy
@ 2008-02-28  0:11     ` Paul Jackson
  2008-02-28  0:29       ` Steven Rostedt
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-28  0:11 UTC (permalink / raw)
  To: Max Krasnyanskiy; +Cc: a.p.zijlstra, mingo, tglx, oleg, rostedt, linux-kernel

Max wrote:
> Hmm, shouldn't patches be sent inline ?

Huh?  Peter's patches were inline for me ... odd.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 2/4] cpuset: system sets
  2008-02-28  0:11     ` Paul Jackson
@ 2008-02-28  0:29       ` Steven Rostedt
  2008-02-28  1:45         ` Max Krasnyanskiy
  0 siblings, 1 reply; 94+ messages in thread
From: Steven Rostedt @ 2008-02-28  0:29 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Max Krasnyanskiy, a.p.zijlstra, mingo, tglx, oleg, linux-kernel


On Wed, 27 Feb 2008, Paul Jackson wrote:

> Max wrote:
> > Hmm, shouldn't patches be sent inline ?
>
> Huh?  Peter's patches were inline for me ... odd.
>

Me too.

Max, what email client are you using?

-- Steve


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 2/4] cpuset: system sets
  2008-02-28  0:29       ` Steven Rostedt
@ 2008-02-28  1:45         ` Max Krasnyanskiy
  2008-02-28  3:41           ` Steven Rostedt
  0 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-28  1:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul Jackson, a.p.zijlstra, mingo, tglx, oleg, linux-kernel

Steven Rostedt wrote:
> On Wed, 27 Feb 2008, Paul Jackson wrote:
> 
>> Max wrote:
>>> Hmm, shouldn't patches be sent inline ?
>> Huh?  Peter's patches were inline for me ... odd.
>>
> 
> Me too.
> 
> Max, what email client are you using?

Thunderbird. The patches were actually attached. Thunderbird does show
them inline but when you hit reply it's an empty message.

Max


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 2/4] cpuset: system sets
  2008-02-27 23:39   ` Paul Jackson
@ 2008-02-28  1:53     ` Max Krasnyanskiy
  0 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-28  1:53 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Peter Zijlstra, mingo, tglx, oleg, rostedt, linux-kernel

Paul Jackson wrote:
> Peter wrote:
>> A system set will be one that caters the
>> general purpose OS. This patch provides the infrastructure, but doesn't
>> actually provide any new functionality.
>>
>> Typical functionality would be setting the IRQ affinity of unbound IRQs to
>> within the system set. And setting the affinity of unbounded kernel threads to
>> within the system set.
> 
> "one that caters the general purpose OS" ... a tad terse on the
> documentation ;).
> 
> I guess what you have is a new cpumask_t cpu_system_map, which is the
> union of the CPUs of all the cpusets marked 'system', where to a rough
> approximation the CPUs -not- in that cpumask are what we would have
> called the isolated CPUs by the old code?
Yes it's in fact exactly the same as ~cpu_isolated_map in the patches that I 
sent out earlier.

> In any case, if this patch survives its birth, it will need an added
> change for some file in the Documentation directory.
Sure. We can just update readme from my patch to use cpuset instead of 
/sys/system/cpu/cpu1/isolated bits. If we go with this approach that is.

> Could we get the term 'cpu' in the name 'system' somehow?  Perhaps call
> this new cpuset flag 'cpus_system' or some such.  Cpusets handles both
> CPU and memory configuration, and I make some effort to mark per-cpuset
> specific attributes that apply to only one of these with a prefix
> indicating to which they apply.  The per-cpuset flag name 'system', by
> itself, would mean little to someone just listing the files in a cpuset
> directory.
Makes sense to me too. ie cpus_system is more descriptive.

> In the rebuild_system_map() code, you have:
> +	if (cpus_empty(*new_system_map))
> +		BUG();
> 
> ... what's to prevent simply turning off the 'system' (aka cpus_system)
> in the top cpuset, on a system with only that one cpuset, and hitting
> this BUG()?
Good point.

> Overall I like this approach.  I suspect you made a good choice in
> marking the non-isolated (aka system) CPUs, rather than the isolated
> CPUs.  It seems clearer that way, in understanding the affects of
> overlapping cpusets with various markings.
Ok so you did not like the 'isolated' name too ;-).

Max






^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 2/4] cpuset: system sets
  2008-02-28  1:45         ` Max Krasnyanskiy
@ 2008-02-28  3:41           ` Steven Rostedt
  2008-02-28  4:58             ` Max Krasnyansky
  0 siblings, 1 reply; 94+ messages in thread
From: Steven Rostedt @ 2008-02-28  3:41 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: Paul Jackson, a.p.zijlstra, mingo, tglx, oleg, linux-kernel



On Wed, 27 Feb 2008, Max Krasnyanskiy wrote:

> Steven Rostedt wrote:
>
> Thunderbird. The patches were actually attached. Thunderbird does show
> them inline but when you hit reply it's an empty message.

Nope, Thunderbird simply got fooled by the added header line:

 Content-Disposition: inline; filename=cpuset-system.patch

These patches were sent by quilt...

  User-Agent: quilt/0.45-1

and as such are the proper way to send patch series.

I pulled this up in Thunderbird, hit reply and it kept the patch. Perhaps
your options are not set up correctly, or you need to upgrade your
Thunderbird.

-- Steve


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 2/4] cpuset: system sets
  2008-02-28  3:41           ` Steven Rostedt
@ 2008-02-28  4:58             ` Max Krasnyansky
  0 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyansky @ 2008-02-28  4:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul Jackson, a.p.zijlstra, mingo, tglx, oleg, linux-kernel

Steven Rostedt wrote:
> 
> On Wed, 27 Feb 2008, Max Krasnyanskiy wrote:
> 
>> Steven Rostedt wrote:
>>
>> Thunderbird. The patches were actually attached. Thunderbird does show
>> them inline but when you hit reply it's an empty message.
> 
> Nope, Thunderbird simply got fooled by the added header line:
> 
>  Content-Disposition: inline; filename=cpuset-system.patch
> 
> These patches were sent by quilt...
> 
>   User-Agent: quilt/0.45-1
> 
> and as such are the proper way to send patch series.
> 
> I pulled this up in Thunderbird, hit reply and it kept the patch. Perhaps
> your options are not set up correctly, or you need to upgrade your
> Thunderbird.

Interesting. I got Thunderbird 2.0.0.9 here, came with the latest Fedora 8 updates.
This is the first series of patches it does that for me. Stuff sent by git-send-email 
for example comes out just fine. I looked through content handling options and do no 
see anything obvious.
Anyway. Sorry for the complaint.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
                   ` (4 preceding siblings ...)
  2008-02-27 23:38 ` [RFC/PATCH 0/4] CPUSET driven CPU isolation Max Krasnyanskiy
@ 2008-02-28  7:50 ` Ingo Molnar
  2008-02-28  8:08   ` Paul Jackson
                     ` (2 more replies)
  2008-02-28 12:12 ` Mark Hounschell
  2008-02-29 18:55 ` [RFC/PATCH] cpuset: cpuset irq affinities Peter Zijlstra
  7 siblings, 3 replies; 94+ messages in thread
From: Ingo Molnar @ 2008-02-28  7:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Oleg Nesterov, Steven Rostedt, Paul Jackson,
	Max Krasnyanskiy, linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> My vision on the direction we should take wrt cpu isolation.
> 
> Next on the list would be figuring out a nice solution to the 
> workqueue flush issue.

nice work Peter, i find this "system sets" extension to cpusets a much 
more elegant (and much more future-proof) solution than the proposed 
spreadout of the limited hack of isolcpus/cpu_isolated_map. It 
concentrates us on a single API and on a single mechanism to handle 
isolation matters. (be that for clustering/supercomputing or real-time 
purposes)

Thanks for insisting on using cpusets for this!

i've queued up your patches in sched-devel.git, and lets make sure this 
has no side-effects on existing functionality. (it shouldnt)

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  7:50 ` Ingo Molnar
@ 2008-02-28  8:08   ` Paul Jackson
  2008-02-28  9:08     ` Ingo Molnar
  2008-02-28 17:48   ` Max Krasnyanskiy
  2008-02-29  8:31   ` Andrew Morton
  2 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-28  8:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

Ingo wrote:
> nice work Peter

agreed

> i've queued up your patches in sched-devel.git

Before this patchset gets too far, I'd like to decide on whether to
adapt my suggestion to call that per-cpuset flag 'cpus_system' (or
anything else with 'cpu' in it, perhaps 'system_cpus' would be more
idiomatic), rather than the tad too generic 'system'.

People doing 'ls /dev/cpuset' should be able to half-way guess
what things do, just from their name.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  8:08   ` Paul Jackson
@ 2008-02-28  9:08     ` Ingo Molnar
  2008-02-28  9:17       ` Paul Jackson
  2008-02-28 20:23       ` Max Krasnyansky
  0 siblings, 2 replies; 94+ messages in thread
From: Ingo Molnar @ 2008-02-28  9:08 UTC (permalink / raw)
  To: Paul Jackson; +Cc: a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel


* Paul Jackson <pj@sgi.com> wrote:

> > i've queued up your patches in sched-devel.git
> 
> Before this patchset gets too far, I'd like to decide on whether to 
> adapt my suggestion to call that per-cpuset flag 'cpus_system' (or 
> anything else with 'cpu' in it, perhaps 'system_cpus' would be more 
> idiomatic), rather than the tad too generic 'system'.

yeah. In fact i'm not at all sure this is really a "system" thing - it's 
more of a "bootup" default.

once the system has booted up and the user is in a position to create 
cpusets, i believe the distinction and assymetry between any bootup 
cpuset and the other cpusets should vanish. The "bootup" cpuset is just 
a convenience container to handle everything that the box booted up 
with, and then we can shrink it (without having to enumerate every PID 
and every irq and other resource explicitly) to make place for other 
cpusets.

maybe it's even more idomatic to call it "set0" and just create a 
/dev/cpuset/set0/ directory for it and making it an explicit cpuset - 
instead of the hardcoded /dev/cpusets/system thing? Do you have any 
established naming scheme for cpusets that we could follow here?

> People doing 'ls /dev/cpuset' should be able to half-way guess what 
> things do, just from their name.

oh, certainly. This is at the earliest v2.6.26 material - but now it at 
least looks clean conceptually, fits more nicely into cpusets instead of 
being a bolted-on thing.

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  9:08     ` Ingo Molnar
@ 2008-02-28  9:17       ` Paul Jackson
  2008-02-28  9:32         ` David Rientjes
  2008-02-28 10:46         ` Ingo Molnar
  2008-02-28 20:23       ` Max Krasnyansky
  1 sibling, 2 replies; 94+ messages in thread
From: Paul Jackson @ 2008-02-28  9:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

Ingo wrote:
>  The "bootup" cpuset is just 
> a convenience container to handle everything that the box booted up 
> with, and then we can shrink it (without having to enumerate every PID 
> and every irq and other resource explicitly) to make place for other 
> cpusets.

I'm not quite sure of what you're thinking here; rather I'm just
bouncing off the sound of your words.

But your words sound alot like what we at SGI call a 'boot' cpuset.

Our big honkin NUMA customers, who are managing most of the system
either for a few dedicated, very-important jobs, and/or under a
batch scheduler, need to leave a few nodes to run the classic Unix
load such as init, cron, assorted daemons and the admins login shell.

So we provide them some init script mechanisms that make it easy to
set this up, which includes moving every task (not many at the low
numbered init script time this runs) that isn't pinned (doesn't have
a restricted Cpus_allowed) into the boot cpuset, conventionally
named /dev/cpuset/boot.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  9:17       ` Paul Jackson
@ 2008-02-28  9:32         ` David Rientjes
  2008-02-28 10:12           ` David Rientjes
  2008-02-28 10:46         ` Ingo Molnar
  1 sibling, 1 reply; 94+ messages in thread
From: David Rientjes @ 2008-02-28  9:32 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Ingo Molnar, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

On Thu, 28 Feb 2008, Paul Jackson wrote:

> So we provide them some init script mechanisms that make it easy to
> set this up, which includes moving every task (not many at the low
> numbered init script time this runs) that isn't pinned (doesn't have
> a restricted Cpus_allowed) into the boot cpuset, conventionally
> named /dev/cpuset/boot.
> 

Should the kernel refuse to move some threads, such as the migration 
or watchdog kthreads, out of the root cpuset where the mems can be 
adjusted to disallow access to the cpu to which they are bound?  This is 
a quick way to cause a crash or soft lockup.

		David

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  9:32         ` David Rientjes
@ 2008-02-28 10:12           ` David Rientjes
  2008-02-28 10:26             ` Peter Zijlstra
  2008-02-28 17:37             ` Paul Jackson
  0 siblings, 2 replies; 94+ messages in thread
From: David Rientjes @ 2008-02-28 10:12 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Ingo Molnar, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

On Thu, 28 Feb 2008, David Rientjes wrote:

> Should the kernel refuse to move some threads, such as the migration 
> or watchdog kthreads, out of the root cpuset where the mems can be 
> adjusted to disallow access to the cpu to which they are bound?  This is 
> a quick way to cause a crash or soft lockup.
> 

Something like this?
---
 include/linux/sched.h |    1 +
 kernel/cpuset.c       |    5 ++++-
 kernel/kthread.c      |    1 +
 kernel/sched.c        |    6 ++++++
 4 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1464,6 +1464,7 @@ static inline void put_task_struct(struct task_struct *t)
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
+#define PF_CPU_BOUND	0x04000000	/* Kthread bound to specific cpu */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1175,11 +1175,14 @@ static void cpuset_attach(struct cgroup_subsys *ss,
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
+	int ret;
 
 	mutex_lock(&callback_mutex);
 	guarantee_online_cpus(cs, &cpus);
-	set_cpus_allowed(tsk, cpus);
+	ret = set_cpus_allowed(tsk, cpus);
 	mutex_unlock(&callback_mutex);
+	if (ret < 0)
+		return;
 
 	from = oldcs->mems_allowed;
 	to = cs->mems_allowed;
diff --git a/kernel/kthread.c b/kernel/kthread.c
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
 	wait_task_inactive(k);
 	set_task_cpu(k, cpu);
 	k->cpus_allowed = cpumask_of_cpu(cpu);
+	k->flags |= PF_CPU_BOUND;
 }
 EXPORT_SYMBOL(kthread_bind);
 
diff --git a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5345,6 +5345,12 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 		goto out;
 	}
 
+	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
+	    	     !cpus_equal(p->cpus_allowed, new_mask))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	if (p->sched_class->set_cpus_allowed)
 		p->sched_class->set_cpus_allowed(p, &new_mask);
 	else {

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 1/4] sched: remove isolcpus
  2008-02-27 23:57   ` Max Krasnyanskiy
@ 2008-02-28 10:19     ` Peter Zijlstra
  2008-02-28 19:36       ` Max Krasnyansky
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-28 10:19 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel


On Wed, 2008-02-27 at 15:57 -0800, Max Krasnyanskiy wrote:
> Peter Zijlstra wrote:
> 
> > cpu isolation doesn't offer anything over cpusets, hence remove it.
> 
> Works for me. That's what I suggested in my reply to your comments.
> Here is the quote from the previous thread:

Dude, I've been pushing a cpuset interface from the get go.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 3/4] genirq: system set irq affinities
  2008-02-28  0:10   ` Max Krasnyanskiy
@ 2008-02-28 10:19     ` Peter Zijlstra
  0 siblings, 0 replies; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-28 10:19 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel


On Wed, 2008-02-27 at 16:10 -0800, Max Krasnyanskiy wrote:
> Besides the notifier stuff it's identical to my genirq patch that I sent to 
> Thomas and you for the review ~5 days ago.
> There are a couple of things you missed.

Or left out because I didn't take time to actually look at the code to
much. It took me ~4 hours to whip this up (then a few more to debug and
test it).

You missed the call to set_balance_irq_affinity() and went poking in the
balancer itself. 

> Current call site for select_smp_affinity() inside request_irq() is incorrect. 
> It ends up moving irq each time requires_irq() is called, and it is called 
> multiple times for the shared irqs. My patch moves it into setup_irq() under 
> if(!shared) check.

I'll leave that to tglx and mingo and claim lack of clue.

> Also the following part is unsafe
> 
> > +#ifdef CONFIG_CPUSETS
> > +static int system_irq_notifier(struct notifier_block *nb,
> > +		unsigned long action, void *cpus)
> > +{
> > +	cpumask_t *new_system_map = (cpumask_t *)cpus;
> > +	int i;
> > +
> > +	for (i = 0; i < NR_IRQS; i++) {
> > +		struct irq_desc *desc = &irq_desc[i];
> > +
> > +		if (desc->chip == &no_irq_chip || !irq_can_set_affinity(i))
> > +			continue;
> > +
> > +		if (cpus_match_system(desc->affinity)) {
> > +			cpumask_t online_system;
> > +
> > +			cpus_and(online_system, new_system_map, cpu_online_map);
> > +
> > +			set_balance_irq_affinity(i, online_system);
> > +
> > +			desc->affinity = online_system;
> > +			desc->chip->set_affinity(i, online_system);
> Two lines above should be
> 	irq_set_affinity(i, online_system);
> 
> If you look at how irq_set_affinity() is implemented, you'll see this
> 
> #ifdef CONFIG_GENERIC_PENDING_IRQ
>          set_pending_irq(irq, cpumask);
> #else
>          desc->affinity = cpumask;
>          desc->chip->set_affinity(irq, cpumask);
> #endif
> 
> set_pending_irq() is the safe way to move pending irqs.

Seems not unsafe, just not handling pending irqs.





^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-27 23:38 ` [RFC/PATCH 0/4] CPUSET driven CPU isolation Max Krasnyanskiy
@ 2008-02-28 10:19   ` Peter Zijlstra
  2008-02-28 17:33     ` Max Krasnyanskiy
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-28 10:19 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel


On Wed, 2008-02-27 at 15:38 -0800, Max Krasnyanskiy wrote:
> Peter Zijlstra wrote:
> > My vision on the direction we should take wrt cpu isolation.
> General impressions:
> - "cpu_system_map" is %100 identical to the "~cpu_isolated_map" as in my 
> patches. It's updated from different place but functionally wise it's very 
> much the same. I guess you did not like the 'isolated' name ;-). As I 
> mentioned before I'm not hung up on the name so it's cool :).

Ah, but you miss that cpu_system_map doesn't do the one thing the
cpu_isolated_map ever did, prevent sched_domains from forming on those
cpus.

Which is a major point.

> - Updating cpu_system_map from cpusets
> There are a couple of things that I do not like about this approach:
> 1. We lost the ability to isolate CPUs at boot. Which means slower boot times 
> for me (ie before I can start my apps I need to create cpuset, etc). Not a big 
> deal, I can live with it.

I'm sure those few lines in rc.local won't grow your boot time by a
significant amount of time.

That said, we should look into a replacement for the boot time parameter
(if only to decide not to do it) because of backward compatibility.

> 2. We now need another notification mechanism to propagate the updates to the 
> cpu_system_map. That by itself is not a big deal. The big deal is that now we 
> need to basically audit the kernel and make sure that everything affected must
> have proper notifier and react to the mask changes.
> For example your current patch does not move the timers and I do not think it 
> makes sense to go and add notifier for the timers. I think the better approach 
> is to use CPU hotplug here. ie Enforce the rule that cpu_system_map is updated 
>   only when CPU is off-line.
> By bringing CPU down first we get a lot of features for free. All the kernel 
> threads, timers, softirqs, HW irqs, workqueues, etc are properly 
> terminated/moved/canceled/etc. This gives us a very clean state when we bring 
> the CPU back online with "system" bit cleared (or "isolated" bit set like in 
> my patches). I do not see a good reason for reimplementing that functionality 
> via system_map notifiers.

I'm not convinced cpu hotplug notifiers are the right thing here. Sure
we could easily iterate the old and new system map and call the matching
cpu hotplug notifiers, but they seem overly complex to me.

The audit would be a good idea anyway. If we do indeed end up with a 1:1
mapping of whatever cpu hotplug does, then well, perhaps you're right.

> I'll comment more on the individual patches.
> 
> > Next on the list would be figuring out a nice solution to the workqueue
> > flush issue.
> Do not forget the "stop machine", or more specifically module loading/unloading.

No, the full stop machine thing, there are more interesting users than
module loading. But I'm not too interested in solving this particular
problem atm, I have too much on my plate as it is.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 10:12           ` David Rientjes
@ 2008-02-28 10:26             ` Peter Zijlstra
  2008-02-28 17:37             ` Paul Jackson
  1 sibling, 0 replies; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-28 10:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Paul Jackson, Ingo Molnar, tglx, oleg, rostedt, maxk, linux-kernel


On Thu, 2008-02-28 at 02:12 -0800, David Rientjes wrote:
> On Thu, 28 Feb 2008, David Rientjes wrote:
> 
> > Should the kernel refuse to move some threads, such as the migration 
> > or watchdog kthreads, out of the root cpuset where the mems can be 
> > adjusted to disallow access to the cpu to which they are bound?  This is 
> > a quick way to cause a crash or soft lockup.

Indeed, there is a hole in my cpus_match_system() logic in that when the
system set is reduced to a single cpu, the tasks bound to that cpu also
match.

I had wanted to avoid adding PF_ flags (as I remember we're running
short on them), but I think you're right.

Thanks!

> Something like this?
> ---
>  include/linux/sched.h |    1 +
>  kernel/cpuset.c       |    5 ++++-
>  kernel/kthread.c      |    1 +
>  kernel/sched.c        |    6 ++++++
>  4 files changed, 12 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1464,6 +1464,7 @@ static inline void put_task_struct(struct task_struct *t)
>  #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
>  #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
>  #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
> +#define PF_CPU_BOUND	0x04000000	/* Kthread bound to specific cpu */
>  #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
>  #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
>  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1175,11 +1175,14 @@ static void cpuset_attach(struct cgroup_subsys *ss,
>  	struct mm_struct *mm;
>  	struct cpuset *cs = cgroup_cs(cont);
>  	struct cpuset *oldcs = cgroup_cs(oldcont);
> +	int ret;
>  
>  	mutex_lock(&callback_mutex);
>  	guarantee_online_cpus(cs, &cpus);
> -	set_cpus_allowed(tsk, cpus);
> +	ret = set_cpus_allowed(tsk, cpus);
>  	mutex_unlock(&callback_mutex);
> +	if (ret < 0)
> +		return;
>  
>  	from = oldcs->mems_allowed;
>  	to = cs->mems_allowed;
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
>  	wait_task_inactive(k);
>  	set_task_cpu(k, cpu);
>  	k->cpus_allowed = cpumask_of_cpu(cpu);
> +	k->flags |= PF_CPU_BOUND;
>  }
>  EXPORT_SYMBOL(kthread_bind);
>  
> diff --git a/kernel/sched.c b/kernel/sched.c
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5345,6 +5345,12 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
>  		goto out;
>  	}
>  
> +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
> +	    	     !cpus_equal(p->cpus_allowed, new_mask))) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
>  	if (p->sched_class->set_cpus_allowed)
>  		p->sched_class->set_cpus_allowed(p, &new_mask);
>  	else {


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  9:17       ` Paul Jackson
  2008-02-28  9:32         ` David Rientjes
@ 2008-02-28 10:46         ` Ingo Molnar
  2008-02-28 17:47           ` Paul Jackson
  2008-02-28 20:11           ` Max Krasnyansky
  1 sibling, 2 replies; 94+ messages in thread
From: Ingo Molnar @ 2008-02-28 10:46 UTC (permalink / raw)
  To: Paul Jackson; +Cc: a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel


* Paul Jackson <pj@sgi.com> wrote:

> But your words sound alot like what we at SGI call a 'boot' cpuset.
> 
> Our big honkin NUMA customers, who are managing most of the system 
> either for a few dedicated, very-important jobs, and/or under a batch 
> scheduler, need to leave a few nodes to run the classic Unix load such 
> as init, cron, assorted daemons and the admins login shell.
> 
> So we provide them some init script mechanisms that make it easy to 
> set this up, which includes moving every task (not many at the low 
> numbered init script time this runs) that isn't pinned (doesn't have a 
> restricted Cpus_allowed) into the boot cpuset, conventionally named 
> /dev/cpuset/boot.

yes. Ideally Peter's patchset should turn into something equivalent and 
i very much agree with Peter's arguments. There was never any design 
level problem with cpusets, and the parallel cpu_isolated_map approach 
was misdirected IMO.

There was indeed a problem with the _manageability_ of cpusets in 
certain (rather new) usecases like real-time or virtualization, and how 
they are connected to other system resources like IRQs and how easy it 
is to manage these resources. IRQs should probably be tied to specific 
cpusets and should migrate together with them, were the span of that 
cpuset be changed. (by default they'd be tied to the boot cpuset)

IMO Peter's patchset is a good first step in that it removes the 
cpu_isolated_map API hack, and i think we should try to go the whole way 
and just offer a /dev/cpuset/boot/ default set that can then be 
restricted to isolate the default workloads away from other CPUs.

( an initscripts approach, while i'm sure it works, would always be a
  bit fragile in that it requires precise knowledge about which task is
  what. I think we should make this a turn-key in-kernel solution that
  both the big-honking NUMA-box guys and the real-time guys would be
  happy with. )

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
                   ` (5 preceding siblings ...)
  2008-02-28  7:50 ` Ingo Molnar
@ 2008-02-28 12:12 ` Mark Hounschell
  2008-02-28 19:57   ` Max Krasnyansky
  2008-02-29 18:55 ` [RFC/PATCH] cpuset: cpuset irq affinities Peter Zijlstra
  7 siblings, 1 reply; 94+ messages in thread
From: Mark Hounschell @ 2008-02-28 12:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy, linux-kernel,
	markh@compro.net >> Mark Hounschell

Peter Zijlstra wrote:
> My vision on the direction we should take wrt cpu isolation.
> 
> Next on the list would be figuring out a nice solution to the workqueue
> flush issue.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Is it now the intent, not only that I have to enable cpusets in the
kernel but I will also have to use them in userland to take advantage of
this.

And hot-plug too??

Can I predict that in the future that userland sched_setaffinity will be
taken away also and be forced to use cpusets?

And hot-plug too??

Mark

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 10:19   ` Peter Zijlstra
@ 2008-02-28 17:33     ` Max Krasnyanskiy
  0 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-28 17:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel

Peter Zijlstra wrote:
> On Wed, 2008-02-27 at 15:38 -0800, Max Krasnyanskiy wrote:
>> Peter Zijlstra wrote:
>>> My vision on the direction we should take wrt cpu isolation.
>> General impressions:
>> - "cpu_system_map" is %100 identical to the "~cpu_isolated_map" as in my 
>> patches. It's updated from different place but functionally wise it's very 
>> much the same. I guess you did not like the 'isolated' name ;-). As I 
>> mentioned before I'm not hung up on the name so it's cool :).
> 
> Ah, but you miss that cpu_system_map doesn't do the one thing the
> cpu_isolated_map ever did, prevent sched_domains from forming on those
> cpus.
> 
> Which is a major point.
Did you see my reply on how to support "RT balancer" with cpu_isolated_map ?
Anyway, my point was that you could've just told me something like:
    "lets rename cpu_isolated_map and invert it"
That's is it. The gist of the idea is exactly the same. There is a bitmap that 
tells various subsystems what cpus can be used for the kernel activities.

>> - Updating cpu_system_map from cpusets
>> There are a couple of things that I do not like about this approach:
>> 1. We lost the ability to isolate CPUs at boot. Which means slower boot times 
>> for me (ie before I can start my apps I need to create cpuset, etc). Not a big 
>> deal, I can live with it.
> 
> I'm sure those few lines in rc.local won't grow your boot time by a
> significant amount of time.
> 
> That said, we should look into a replacement for the boot time parameter
> (if only to decide not to do it) because of backward compatibility.
As I said I can live with it.

>> 2. We now need another notification mechanism to propagate the updates to the 
>> cpu_system_map. That by itself is not a big deal. The big deal is that now we 
>> need to basically audit the kernel and make sure that everything affected must
>> have proper notifier and react to the mask changes.
>> For example your current patch does not move the timers and I do not think it 
>> makes sense to go and add notifier for the timers. I think the better approach 
>> is to use CPU hotplug here. ie Enforce the rule that cpu_system_map is updated 
>>   only when CPU is off-line.
>> By bringing CPU down first we get a lot of features for free. All the kernel 
>> threads, timers, softirqs, HW irqs, workqueues, etc are properly 
>> terminated/moved/canceled/etc. This gives us a very clean state when we bring 
>> the CPU back online with "system" bit cleared (or "isolated" bit set like in 
>> my patches). I do not see a good reason for reimplementing that functionality 
>> via system_map notifiers.
> 
> I'm not convinced cpu hotplug notifiers are the right thing here. Sure
> we could easily iterate the old and new system map and call the matching
> cpu hotplug notifiers, but they seem overly complex to me.
> 
> The audit would be a good idea anyway. If we do indeed end up with a 1:1
> mapping of whatever cpu hotplug does, then well, perhaps you're right.
I was not talking about calling notifiers directly. We can literally bring the 
CPU down. Just like echo 0 > /sys/devices/system/cpu/cpuN/online does.

>> I'll comment more on the individual patches.
>>
>>> Next on the list would be figuring out a nice solution to the workqueue
>>> flush issue.
>> Do not forget the "stop machine", or more specifically module loading/unloading.
> 
> No, the full stop machine thing, there are more interesting users than
> module loading. But I'm not too interested in solving this particular
> problem atm, I have too much on my plate as it is.
I was not suggesting that it's up to you to solve it. You made it sounds like 
your patches provide complete solution, just need to fix the workqueues. I was 
merely pointing out that there is also stop machine that needs fixing.

btw You did not comment about the fact that your patch does not move timers.
I was trying it out last night. It's definitely not ready yet.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 10:12           ` David Rientjes
  2008-02-28 10:26             ` Peter Zijlstra
@ 2008-02-28 17:37             ` Paul Jackson
  2008-02-28 21:24               ` David Rientjes
  1 sibling, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-28 17:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

I don't have strong opinions either way on this patch; it adds an error
check that makes sense.  I haven't seen much problem not having this check,
nor do I know of any code that depends on doing what this check prohibits.

Except for three details:

 1) +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
    +	    	     !cpus_equal(p->cpus_allowed, new_mask))) {
    +		ret = -EINVAL;

    The check for equal cpus allowed seems too strong.  Shouldn't you be
    checking that all of task p's cpus_allowed would still be allowed in
    the new mask:

    +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
    +	    	     !cpus_subset(p->cpus_allowed, new_mask))) {
    +		ret = -EINVAL;

 2) Doesn't this leave out a check for the flip side -- shrinking
    the cpus allowed by a cpuset so that it no longer contains those
    required by any PF_CPU_BOUND tasks in that cpuset?  I'm not sure
    if this second check is a good idea or not.

 3) Could we call this PF_CPU_PINNED instead?  I tend to use "cpu
    bound" to refer to tasks that consume alot of CPU cycles (which
    these pinned tasks rarely do), and "pinned" to refer to what is
    done to confine a task to a particular subset of all possible CPUs.
    It looks to me like some code in kernel/sched.c already uses the
    word pinned in this same way, so PF_CPU_PINNED would be more
    consistent terminology.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 10:46         ` Ingo Molnar
@ 2008-02-28 17:47           ` Paul Jackson
  2008-02-28 20:11           ` Max Krasnyansky
  1 sibling, 0 replies; 94+ messages in thread
From: Paul Jackson @ 2008-02-28 17:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

Ingo wrote:
> ( an initscripts approach, while i'm sure it works, would always be a
>   bit fragile

Agreed.  In the early days of cpusets, (1) the less I put in the kernel,
the easier it was to get those first patches accepted, and (2) the more
there was that remained for me to code in user space, where my employer
could provide better, leading edge, products for a profit.

Overtime, these user space features that prove their worth, but that
were difficult to code in the most usable and robust manner in user
space, such as boot cpusets here, and, on another thread, cpuset
relative NUMA mempolicies, naturally migrate to the kernel, for wider
availability.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  7:50 ` Ingo Molnar
  2008-02-28  8:08   ` Paul Jackson
@ 2008-02-28 17:48   ` Max Krasnyanskiy
  2008-02-29  8:31   ` Andrew Morton
  2 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-28 17:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel

Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
>> My vision on the direction we should take wrt cpu isolation.
>>
>> Next on the list would be figuring out a nice solution to the 
>> workqueue flush issue.
> 
> nice work Peter, i find this "system sets" extension to cpusets a much 
> more elegant (and much more future-proof) solution than the proposed 
> spreadout of the limited hack of isolcpus/cpu_isolated_map. It 
> concentrates us on a single API and on a single mechanism to handle 
> isolation matters. (be that for clustering/supercomputing or real-time 
> purposes)
Come on Ingo. You make it sounds like it's radically different solution.
At the end of the day we have a bitmap that represents which CPUs can be used 
for the kernel stuff. How is that different ?
I was saying all along that cpusets is a higher level API and was discussing 
or trying to discuss (people were ignoring my questions) ways to integrate it.

> Thanks for insisting on using cpusets for this!
> 
> i've queued up your patches in sched-devel.git, and lets make sure this 
> has no side-effects on existing functionality. (it shouldnt)
Hmm, that was easy. Not a single ack. Even the core part is not complete yet. 
I pointed out several issues. Like the fact that it does not provide full 
isolation because it does not move timers, does not handle workqueues.
I did not even get a chance to test this stuff properly and see if it actually 
solves the usecase I was solving with my patches.
_Obviously_ we could not have taken my tested solution and evolved it in the 
direction people wanted to see it evolve, ie integration with the cpusets :(.

My main concern is that it introduces a whole new set of notifiers that 
perform similar functions to what CPU hotplut already does.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 1/4] sched: remove isolcpus
  2008-02-28 10:19     ` Peter Zijlstra
@ 2008-02-28 19:36       ` Max Krasnyansky
  0 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyansky @ 2008-02-28 19:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel

Peter Zijlstra wrote:
> On Wed, 2008-02-27 at 15:57 -0800, Max Krasnyanskiy wrote:
>> Peter Zijlstra wrote:
>>
>>> cpu isolation doesn't offer anything over cpusets, hence remove it.
>> Works for me. That's what I suggested in my reply to your comments.
>> Here is the quote from the previous thread:
> 
> Dude, I've been pushing a cpuset interface from the get go.
Right. And you ended up with exact same solution for the map that represents
the CPUs that are isolated (or not isolated).
I was just saying that I was asking people how to integrate cpu_isolated_map
with cpusets, none of you guys acked or nacked it.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 12:12 ` Mark Hounschell
@ 2008-02-28 19:57   ` Max Krasnyansky
  0 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyansky @ 2008-02-28 19:57 UTC (permalink / raw)
  To: Mark Hounschell
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Oleg Nesterov,
	Steven Rostedt, Paul Jackson, linux-kernel,
	markh@compro.net >> Mark Hounschell

Mark Hounschell wrote:
> Peter Zijlstra wrote:
>> My vision on the direction we should take wrt cpu isolation.
>>
>> Next on the list would be figuring out a nice solution to the workqueue
>> flush issue.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
> Is it now the intent, not only that I have to enable cpusets in the
> kernel but I will also have to use them in userland to take advantage of
> this.
> 
> And hot-plug too??
> 
> Can I predict that in the future that userland sched_setaffinity will be
> taken away also and be forced to use cpusets?
> 
> And hot-plug too??

Mark,
I bet you won't get any replies (besides mine). And yes this means that you
will have to enable cpusets if Peter's patches go in (looks like they will).
Hot-plug may not be needed unless I convince people to reuse the hot-plug
instead of introducing new notifiers.
I guess we can make some extensions to expose "system" bit just like I did
with "isolated" bit via sysfs. In which case cpusets may not be needed. We'll see.

Max



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 10:46         ` Ingo Molnar
  2008-02-28 17:47           ` Paul Jackson
@ 2008-02-28 20:11           ` Max Krasnyansky
  2008-02-28 20:13             ` Paul Jackson
  1 sibling, 1 reply; 94+ messages in thread
From: Max Krasnyansky @ 2008-02-28 20:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Paul Jackson, a.p.zijlstra, tglx, oleg, rostedt, linux-kernel

Ingo Molnar wrote:
> * Paul Jackson <pj@sgi.com> wrote:
> 
>> But your words sound alot like what we at SGI call a 'boot' cpuset.
>>
>> Our big honkin NUMA customers, who are managing most of the system 
>> either for a few dedicated, very-important jobs, and/or under a batch 
>> scheduler, need to leave a few nodes to run the classic Unix load such 
>> as init, cron, assorted daemons and the admins login shell.
>>
>> So we provide them some init script mechanisms that make it easy to 
>> set this up, which includes moving every task (not many at the low 
>> numbered init script time this runs) that isn't pinned (doesn't have a 
>> restricted Cpus_allowed) into the boot cpuset, conventionally named 
>> /dev/cpuset/boot.
> 
> yes. Ideally Peter's patchset should turn into something equivalent and 
> i very much agree with Peter's arguments. There was never any design 
> level problem with cpusets, and the parallel cpu_isolated_map approach 
> was misdirected IMO.
> 
> There was indeed a problem with the _manageability_ of cpusets in 
> certain (rather new) usecases like real-time or virtualization, and how 
> they are connected to other system resources like IRQs and how easy it 
> is to manage these resources. IRQs should probably be tied to specific 
> cpusets and should migrate together with them, were the span of that 
> cpuset be changed. (by default they'd be tied to the boot cpuset)
> 
> IMO Peter's patchset is a good first step in that it removes the 
> cpu_isolated_map API hack, and i think we should try to go the whole way 
> and just offer a /dev/cpuset/boot/ default set that can then be 
> restricted to isolate the default workloads away from other CPUs.

I like the concept of the "boot" set. But we still need a separate "system"
flag. Users should not have to move cpu back into the "boot" set to allow for
kernel (irqs, etc) activity on it. And it's just more explicit and clear that way.

Max




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 20:11           ` Max Krasnyansky
@ 2008-02-28 20:13             ` Paul Jackson
  2008-02-28 20:26               ` Max Krasnyansky
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-28 20:13 UTC (permalink / raw)
  To: Max Krasnyansky; +Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, linux-kernel

Max K wrote:
> I like the concept of the "boot" set. But we still need a separate "system"
> flag. Users should not have to move cpu back into the "boot" set to allow for
> kernel (irqs, etc) activity on it. And it's just more explicit and clear that way.

Good point -- a "boot" cpuset might be 4 CPUs out of 256 CPUs, just for running
the classic Unix load (daemons, init, login, ...).  But irq's might need to go
to most CPUs, except for some (dare I use the word) isolated CPUs.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  9:08     ` Ingo Molnar
  2008-02-28  9:17       ` Paul Jackson
@ 2008-02-28 20:23       ` Max Krasnyansky
  1 sibling, 0 replies; 94+ messages in thread
From: Max Krasnyansky @ 2008-02-28 20:23 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Paul Jackson, a.p.zijlstra, tglx, oleg, rostedt, linux-kernel

Ingo Molnar wrote:
> * Paul Jackson <pj@sgi.com> wrote:
> 
>>> i've queued up your patches in sched-devel.git
>> Before this patchset gets too far, I'd like to decide on whether to 
>> adapt my suggestion to call that per-cpuset flag 'cpus_system' (or 
>> anything else with 'cpu' in it, perhaps 'system_cpus' would be more 
>> idiomatic), rather than the tad too generic 'system'.
> 
> yeah. In fact i'm not at all sure this is really a "system" thing - it's 
> more of a "bootup" default.
> once the system has booted up and the user is in a position to create 
> cpusets, i believe the distinction and assymetry between any bootup 
> cpuset and the other cpusets should vanish. The "bootup" cpuset is just 
> a convenience container to handle everything that the box booted up 
> with, and then we can shrink it (without having to enumerate every PID 
> and every irq and other resource explicitly) to make place for other 
> cpusets.
> 
> maybe it's even more idomatic to call it "set0" and just create a 
> /dev/cpuset/set0/ directory for it and making it an explicit cpuset - 
> instead of the hardcoded /dev/cpusets/system thing? Do you have any 
> established naming scheme for cpusets that we could follow here?

I think that is a separate thing. Bootup default is one thing and being able
to explicitly allow/disallow kernel activity on a CPU(s) is another.

I think "boot" or "set0" makes perfect sense. In fact that was the first thing
I noticed when I started playing with it. ie Even if I just wanted to isolated
 one cpu I now need to create a cpuset for the other cpus and move all the
tasks there explicitly. It'd be very useful if it happens by default.

Max




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 20:13             ` Paul Jackson
@ 2008-02-28 20:26               ` Max Krasnyansky
  2008-02-28 20:27                 ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyansky @ 2008-02-28 20:26 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, linux-kernel



Paul Jackson wrote:
> Max K wrote:
>> I like the concept of the "boot" set. But we still need a separate "system"
>> flag. Users should not have to move cpu back into the "boot" set to allow for
>> kernel (irqs, etc) activity on it. And it's just more explicit and clear that way.
> 
> Good point -- a "boot" cpuset might be 4 CPUs out of 256 CPUs, just for running
> the classic Unix load (daemons, init, login, ...).  But irq's might need to go
> to most CPUs, except for some (dare I use the word) isolated CPUs.
> 

btw Can you send me those init scripts that you mentioned. ie Those that
create "boot" set from userspace.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 20:26               ` Max Krasnyansky
@ 2008-02-28 20:27                 ` Paul Jackson
  2008-02-28 20:45                   ` Max Krasnyansky
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-28 20:27 UTC (permalink / raw)
  To: Max Krasnyansky; +Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, linux-kernel

Max K wrote:
> btw Can you send me those init scripts that you mentioned. ie Those that
> create "boot" set from userspace.

They aren't open source ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 20:27                 ` Paul Jackson
@ 2008-02-28 20:45                   ` Max Krasnyansky
  0 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyansky @ 2008-02-28 20:45 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, linux-kernel

Paul Jackson wrote:
> Max K wrote:
>> btw Can you send me those init scripts that you mentioned. ie Those that
>> create "boot" set from userspace.
> 
> They aren't open source ;).

Oh, I see. It's probably patented too ;-)

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 17:37             ` Paul Jackson
@ 2008-02-28 21:24               ` David Rientjes
  2008-02-28 22:46                 ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: David Rientjes @ 2008-02-28 21:24 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

On Thu, 28 Feb 2008, Paul Jackson wrote:

> I don't have strong opinions either way on this patch; it adds an error
> check that makes sense.  I haven't seen much problem not having this check,
> nor do I know of any code that depends on doing what this check prohibits.
> 

How about moving watchdog/0 to a cpuset with exclusive access to only cpu 
1?

> Except for three details:
> 
>  1) +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
>     +	    	     !cpus_equal(p->cpus_allowed, new_mask))) {
>     +		ret = -EINVAL;
> 
>     The check for equal cpus allowed seems too strong.  Shouldn't you be
>     checking that all of task p's cpus_allowed would still be allowed in
>     the new mask:
> 
>     +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
>     +	    	     !cpus_subset(p->cpus_allowed, new_mask))) {
>     +		ret = -EINVAL;
> 

That's a convenient way for a kthread to temporarily expand its set of 
cpus_allowed and then never be able to remove the added cpus again.  Do 
you have any examples in the tree where a kthread does this?

>  2) Doesn't this leave out a check for the flip side -- shrinking
>     the cpus allowed by a cpuset so that it no longer contains those
>     required by any PF_CPU_BOUND tasks in that cpuset?  I'm not sure
>     if this second check is a good idea or not.
> 

That's why the check in set_cpus_allowed() is

	cpus_equal(p->cpus_allowed, newmask)

since it prevents PF_CPU_BOUND tasks from being moved out of the root 
cpuset.

>  3) Could we call this PF_CPU_PINNED instead?  I tend to use "cpu
>     bound" to refer to tasks that consume alot of CPU cycles (which
>     these pinned tasks rarely do), and "pinned" to refer to what is
>     done to confine a task to a particular subset of all possible CPUs.
>     It looks to me like some code in kernel/sched.c already uses the
>     word pinned in this same way, so PF_CPU_PINNED would be more
>     consistent terminology.
> 

PF_CPU_BOUND follows the nomenclature of kthread_bind() really well, but 
it could probably be confused with a processor-bound task.  So perhaps 
PF_BOUND_CPU is even better?

		David

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 21:24               ` David Rientjes
@ 2008-02-28 22:46                 ` Paul Jackson
  2008-02-28 23:00                   ` David Rientjes
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-28 22:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

David wrote:
> How about moving watchdog/0 to a cpuset with exclusive access to only cpu 
> 1?

I don't understand your question here.

> >     +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
> >     +	    	     !cpus_subset(p->cpus_allowed, new_mask))) {
> >     +		ret = -EINVAL;
> > 
> 
> That's a convenient way for a kthread to temporarily expand its set of 
> cpus_allowed and then never be able to remove the added cpus again.  Do 
> you have any examples in the tree where a kthread does this?

Good question.  Actually, we -normally- have pinned tasks in the top cpuset,
where the top cpuset allows all CPUs, but the pinned task has a cpus_allowed
(in its task struct) of just one or a few CPUs (for node pinning.)

So ... could we allow moving pinned threads to any cpuset that allowed
the CPUs to which it was pinned (my cpus_subset() test, above), but
-not- change the pinned tasks cpus_allowed in its task struct, keeping
it pinned to just the same one or few CPUs?

> That's why the check in set_cpus_allowed() is
> 
> 	cpus_equal(p->cpus_allowed, newmask)
> 
> since it prevents PF_CPU_BOUND tasks from being moved out of the root 
> cpuset.

I don't think that the cpus_equal() check prevents that (shrinking a
pinned tasks cpuset out from under it.)  Try the following on a freshly
booted system with your proposed patch:

  mkdir /dev/cpuset
  mount -t cpuset cpuset /dev/cpuset
  cd /dev/cpuset
  mkdir a
  cp ???s a
  < tasks sed -un -e p -e 10q > a/tasks

I'll wager you just moved a few pinned tasks into cpuset 'a'.  This
would be allowed, as 'a' has the same cpus as the top cpuset.  But then
one could shrink a (if it had more than 1 CPU in the first place), leaving
some pinned tasks in a cpuset they weren't allowed to run in, essentially
unpinning them.

> PF_CPU_BOUND follows the nomenclature of kthread_bind() really well, but 
> it could probably be confused with a processor-bound task.  So perhaps 
> PF_BOUND_CPU is even better?

Good point - "BOUND" as the past tense of "BIND".  How about
PF_THREAD_BIND ;)?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 22:46                 ` Paul Jackson
@ 2008-02-28 23:00                   ` David Rientjes
  2008-02-29  0:16                     ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: David Rientjes @ 2008-02-28 23:00 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

On Thu, 28 Feb 2008, Paul Jackson wrote:

> > How about moving watchdog/0 to a cpuset with exclusive access to only cpu 
> > 1?
> 
> I don't understand your question here.
> 

Move the watchdog/0 thread to a cpuset that doesn't have access to cpu 0.  

> > >     +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
> > >     +	    	     !cpus_subset(p->cpus_allowed, new_mask))) {
> > >     +		ret = -EINVAL;
> > > 
> > 
> > That's a convenient way for a kthread to temporarily expand its set of 
> > cpus_allowed and then never be able to remove the added cpus again.  Do 
> > you have any examples in the tree where a kthread does this?
> 
> Good question.  Actually, we -normally- have pinned tasks in the top cpuset,
> where the top cpuset allows all CPUs, but the pinned task has a cpus_allowed
> (in its task struct) of just one or a few CPUs (for node pinning.)
> 

Same situation occurs for a task bound to a cpuset that issues a 
sched_setaffinity() call to restrict its cpumask further.  There's nothing 
new there.

> So ... could we allow moving pinned threads to any cpuset that allowed
> the CPUs to which it was pinned (my cpus_subset() test, above), but
> -not- change the pinned tasks cpus_allowed in its task struct, keeping
> it pinned to just the same one or few CPUs?
> 

I'd hesitate to do that unless you can guarantee that restricting 
kthreads mems_allowed via the cpuset interface won't cause any problems 
either.  Is there a benefit to reducing the size of a kthread's 
mems_allowed that doesn't have an adverse effect on the kernel?  What 
about kswapd?

> I don't think that the cpus_equal() check prevents that (shrinking a
> pinned tasks cpuset out from under it.)  Try the following on a freshly
> booted system with your proposed patch:
> 
>   mkdir /dev/cpuset
>   mount -t cpuset cpuset /dev/cpuset
>   cd /dev/cpuset
>   mkdir a
>   cp ???s a
>   < tasks sed -un -e p -e 10q > a/tasks
> 
> I'll wager you just moved a few pinned tasks into cpuset 'a'.  This
> would be allowed, as 'a' has the same cpus as the top cpuset.  But then
> one could shrink a (if it had more than 1 CPU in the first place), leaving
> some pinned tasks in a cpuset they weren't allowed to run in, essentially
> unpinning them.
> 

You can move them, but you cannot reduce the size of the thread's 
cpus_allowed since it was bound to a specific cpu via kthread_bind().  The 
change to set_cpus_allowed() simply prevents this from happening in the 
kernel as opposed to just stopping the possibility through cpusets.

> > PF_CPU_BOUND follows the nomenclature of kthread_bind() really well, but 
> > it could probably be confused with a processor-bound task.  So perhaps 
> > PF_BOUND_CPU is even better?
> 
> Good point - "BOUND" as the past tense of "BIND".  How about
> PF_THREAD_BIND ;)?
> 

Sure.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28 23:00                   ` David Rientjes
@ 2008-02-29  0:16                     ` Paul Jackson
  2008-02-29  1:05                       ` David Rientjes
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-29  0:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

David wrote:
> Move the watchdog/0 thread to a cpuset that doesn't have access to cpu 0.  

I still don't understand ... you must have some context in mind that
I've spaced out ... I can't even tell if that is a statement or a
question.

> I'd hesitate to do that unless you can guarantee that restricting 
> kthreads mems_allowed via the cpuset interface won't cause any problems 
> either.  Is there a benefit to reducing the size of a kthread's 
> mems_allowed that doesn't have an adverse effect on the kernel?  What 
> about kswapd?

Well ... I'm suspecting we've got this portion of our discussion wrapped
around the axle one time too many.

Backing up, hopefully unwrapping, you seemed to allow moving bound tasks
only to cpusets with the same cpus (how come you didn't check for the
same memory nodes too?).  If you really needed to move bound tasks at all,
then that seemed like an unnecessarily tight constraint.  It wouldn't hurt
the bound task to move to another cpuset that still allowed the CPUs it was
bound to.

... but after an another iteration of that subthread ... I'm wondering
why you have to move bound tasks at all.  How about PF_THREAD_BIND just
meaning (1) "can't be moved to any other cpuset", and (2) "always
placed in the top cpuset," so we don't have to worry about being unable
to move threads out of child cpusets.

Do you have any situation in which pinned threads have to be moved?

I don't.  Can we just prohibit it?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-29  0:16                     ` Paul Jackson
@ 2008-02-29  1:05                       ` David Rientjes
  2008-02-29  3:34                         ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: David Rientjes @ 2008-02-29  1:05 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

On Thu, 28 Feb 2008, Paul Jackson wrote:

> > Move the watchdog/0 thread to a cpuset that doesn't have access to cpu 0.  
> 
> I still don't understand ... you must have some context in mind that
> I've spaced out ... I can't even tell if that is a statement or a
> question.
> 

You said that you weren't aware of any problems that could arise that are 
fixed with this added check in set_cpus_allowed(), so I asked that you 
setup a cpuset that doesn't have access to cpu 0 and move the watchdog/0 
thread to it.

> Backing up, hopefully unwrapping, you seemed to allow moving bound tasks
> only to cpusets with the same cpus (how come you didn't check for the
> same memory nodes too?).  If you really needed to move bound tasks at all,
> then that seemed like an unnecessarily tight constraint.  It wouldn't hurt
> the bound task to move to another cpuset that still allowed the CPUs it was
> bound to.
> 
> ... but after an another iteration of that subthread ... I'm wondering
> why you have to move bound tasks at all.  How about PF_THREAD_BIND just
> meaning (1) "can't be moved to any other cpuset", and (2) "always
> placed in the top cpuset," so we don't have to worry about being unable
> to move threads out of child cpusets.
> 
> Do you have any situation in which pinned threads have to be moved?
> 
> I don't.  Can we just prohibit it?
> 

The fix is more general than just the cpusets case, even though it is 
probably the only user in the kernel of set_cpus_allowed() that allows 
bound kthreads to be moved.

The idea is to expressly deny kthreads that have called kthread_bind() 
from changing their cpus_allowed via set_cpus_allowed().  Cpusets should 
handle that gracefully, but my fix is more general: it adds the check to 
set_cpus_allowed() instead of in the cpusets code.

This is a little difficult with the current way that cpusets calls 
set_cpus_allowed() since its attach function is void and does not return 
error or success to the cgroup interface, yet my patch prevents the soft 
lockups and kernel crashes that can occur when cpusets changes the 
cpus_allowed of watchdog or migration threads that is obviously in error.

		David

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-29  1:05                       ` David Rientjes
@ 2008-02-29  3:34                         ` Paul Jackson
  2008-02-29  4:00                           ` David Rientjes
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-29  3:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

David, responding to pj, responding to ...:
>
> > > Move the watchdog/0 thread to a cpuset that doesn't have access to cpu 0.  
> > 
> > I still don't understand ... you must have some context in mind that
> > I've spaced out ... I can't even tell if that is a statement or a
> > question.
> > 
> 
> You said that you weren't aware of any problems that could arise that are 
> fixed with this added check in set_cpus_allowed(),

Ok, now I understand your question - thanks.

I think your question arises from misreading what I wrote.

I did not say that I wasn't "aware of any problems that could arise"

I did say, as you quoted, from Thu, 28 Feb 2008 11:37:28 -0600:
>
> I don't have strong opinions either way on this patch; it adds an error
> check that makes sense.  I haven't seen much problem not having this check,
> nor do I know of any code that depends on doing what this check prohibits.

 - This does not say no (none whatsoever) problem could (ever in the future) arise.

 - This does say not much (just a little) problem had arisen (so far in the past).

Apparently, you thought I was trying to reject the patch on the grounds
that no such problem could ever occur, and you were showing how such a
problem could occur.  I wasn't trying to reject the patch, and I agree
that the check made sense, and I agree that such a problem could occur,
as your example shows.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-29  3:34                         ` Paul Jackson
@ 2008-02-29  4:00                           ` David Rientjes
  2008-02-29  6:53                             ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: David Rientjes @ 2008-02-29  4:00 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

On Thu, 28 Feb 2008, Paul Jackson wrote:

> Apparently, you thought I was trying to reject the patch on the grounds
> that no such problem could ever occur, and you were showing how such a
> problem could occur.  I wasn't trying to reject the patch, and I agree
> that the check made sense, and I agree that such a problem could occur,
> as your example shows.
> 

Ok, good.  We're in agreement that my patch fixes a problem in allowing 
kthreads to move to cpusets that either currently or eventually deny them 
access to the cpu to which they are bound.

Now do you have any preference without extensive cpusets or cgroups 
hacking where we can gracefully handle -EINVAL being returned from 
set_cpus_allowed() so the task doesn't end up in a different cpuset?  The 
return value of set_cpus_allowed() is currently ignored by the cpuset 
implementation and, regardless, cpuset_attach() returns void.

cpuset_can_attach() does some sanity checking, but we need a call to 
set_cpus_allowed() to check the new logic and will actually change 
tsk->cpus_allowed if correctly invoked.

So the cpusets implementation needs a similar check in cpuset_can_attach() 
so that it is expressly denied from moving before we ever call 
cpuset_attach().  It doesn't make sense that a kthread is going to move 
itself through the cpusets interface, so this can simply be done by 
checking for tsk->flags & PF_THREAD_BOUND.



sched: prevent bound kthreads from changing cpus_allowed

Kthreads that have called kthread_bind() are bound to specific cpus, so 
other tasks should not be able to change their cpus_allowed from under 
them.  Otherwise, it is possible to move kthreads, such as the migration 
or watchdog threads, so they are not allowed access to the cpu they work 
on.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/sched.h |    1 +
 kernel/cpuset.c       |    2 ++
 kernel/kthread.c      |    1 +
 kernel/sched.c        |    6 ++++++
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1464,6 +1464,7 @@ static inline void put_task_struct(struct task_struct *t)
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
+#define PF_THREAD_BOUND 0x04000000	/* Thread bound to specific cpu */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1162,6 +1162,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
 
 	if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
 		return -ENOSPC;
+	if (tsk->flags & PF_THREAD_BOUND)
+		return -EINVAL;
 
 	return security_task_setscheduler(tsk, 0, NULL);
 }
diff --git a/kernel/kthread.c b/kernel/kthread.c
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
 	wait_task_inactive(k);
 	set_task_cpu(k, cpu);
 	k->cpus_allowed = cpumask_of_cpu(cpu);
+	k->flags |= PF_THREAD_BOUND;
 }
 EXPORT_SYMBOL(kthread_bind);
 
diff --git a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5345,6 +5345,12 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 		goto out;
 	}
 
+	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+		     !cpus_equal(p->cpus_allowed, new_mask))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
 	if (p->sched_class->set_cpus_allowed)
 		p->sched_class->set_cpus_allowed(p, &new_mask);
 	else {

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-29  4:00                           ` David Rientjes
@ 2008-02-29  6:53                             ` Paul Jackson
  0 siblings, 0 replies; 94+ messages in thread
From: Paul Jackson @ 2008-02-29  6:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: mingo, a.p.zijlstra, tglx, oleg, rostedt, maxk, linux-kernel

David wrote:
> @@ -1162,6 +1162,8 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
>  
>  	if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
>  		return -ENOSPC;
> +	if (tsk->flags & PF_THREAD_BOUND)
> +		return -EINVAL;
>  
>  	return security_task_setscheduler(tsk, 0, NULL);
>  }

I'm ok with this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-28  7:50 ` Ingo Molnar
  2008-02-28  8:08   ` Paul Jackson
  2008-02-28 17:48   ` Max Krasnyanskiy
@ 2008-02-29  8:31   ` Andrew Morton
  2008-02-29  8:36     ` Andrew Morton
  2008-02-29  9:10     ` Ingo Molnar
  2 siblings, 2 replies; 94+ messages in thread
From: Andrew Morton @ 2008-02-29  8:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy, linux-kernel

On Thu, 28 Feb 2008 08:50:11 +0100 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > My vision on the direction we should take wrt cpu isolation.
> > 
> > Next on the list would be figuring out a nice solution to the 
> > workqueue flush issue.
> 
> nice work Peter, i find this "system sets" extension to cpusets a much 
> more elegant (and much more future-proof) solution than the proposed 
> spreadout of the limited hack of isolcpus/cpu_isolated_map. It 
> concentrates us on a single API and on a single mechanism to handle 
> isolation matters. (be that for clustering/supercomputing or real-time 
> purposes)
> 
> Thanks for insisting on using cpusets for this!
> 
> i've queued up your patches in sched-devel.git, and lets make sure this 
> has no side-effects on existing functionality. (it shouldnt)
> 

It of course lays waste to a series of cgroup patches from Paul Menage
which I already had queued.

So I shall drop git-sched again.

How often do I have to say this?  git-sched is not
git-everything-which-looks-shiny!  It is for the CPU scheduler.

If you had put this patchset into a private branch for private testing, or
even into a separate git-petes-stuff then I wouldn't have to collaterally
drop the entirety of git-sched because of this.

Sigh.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-29  8:31   ` Andrew Morton
@ 2008-02-29  8:36     ` Andrew Morton
  2008-02-29  9:10     ` Ingo Molnar
  1 sibling, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2008-02-29  8:36 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Thomas Gleixner, Oleg Nesterov,
	Steven Rostedt, Paul Jackson, Max Krasnyanskiy, linux-kernel

On Fri, 29 Feb 2008 00:31:55 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:

> So I shall drop git-sched again.

And when I do this I get:

***************
*** 8125,8137 ****
  	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
  }
  
- static int cpu_rt_period_write_uint(struct cgroup *cgrp, struct cftype *cftype,
  		u64 rt_period_us)
  {
  	return sched_group_set_rt_period(cgroup_tg(cgrp), rt_period_us);
  }
  
- static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
  {
  	return sched_group_rt_period(cgroup_tg(cgrp));
  }
--- 8125,8137 ----
  	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
  }
  
+ static int cpu_rt_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
  		u64 rt_period_us)
  {
  	return sched_group_set_rt_period(cgroup_tg(cgrp), rt_period_us);
  }
  
+ static u64 cpu_rt_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
  {
  	return sched_group_rt_period(cgroup_tg(cgrp));
  }
***************
*** 8367,8374 ****
  	},
  	{
  		.name = "rt_period_us",
- 		.read_uint = cpu_rt_period_read_uint,
- 		.write_uint = cpu_rt_period_write_uint,
  	},
  #endif
  };
--- 8367,8374 ----
  	},
  	{
  		.name = "rt_period_us",
+ 		.read_u64 = cpu_rt_period_read_u64,
+ 		.write_u64 = cpu_rt_period_write_u64,
  	},
  #endif
  };

and if I then fix that up, and later restore git-sched, Paul's patch is now
broken.

Your trees continue to cause more trouble than anyone else's have ever
done, by a lot.

Let me try yesterday's git-sched.patch.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-29  8:31   ` Andrew Morton
  2008-02-29  8:36     ` Andrew Morton
@ 2008-02-29  9:10     ` Ingo Molnar
  2008-02-29 18:06       ` Max Krasnyanskiy
  1 sibling, 1 reply; 94+ messages in thread
From: Ingo Molnar @ 2008-02-29  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy, linux-kernel


* Andrew Morton <akpm@linux-foundation.org> wrote:

> It of course lays waste to a series of cgroup patches from Paul Menage 
> which I already had queued.

Andrew, please stop tracking sched-devel.git and track this tree 
instead:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

thanks,

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation
  2008-02-29  9:10     ` Ingo Molnar
@ 2008-02-29 18:06       ` Max Krasnyanskiy
  0 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-29 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Peter Zijlstra, Thomas Gleixner, Oleg Nesterov,
	Steven Rostedt, Paul Jackson, linux-kernel

Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>> It of course lays waste to a series of cgroup patches from Paul Menage 
>> which I already had queued.
> 
> Andrew, please stop tracking sched-devel.git and track this tree 
> instead:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

Another option is to have cpusets or cpu isolation tree that I've started 
already. I was saying from the very beginning cpu isolation stuff does not 
imho belong in the scheduler tree. Besides a tiny patch to the sched.c that 
adds/removes the bitmaps there are no scheduler changes needed for this 
specifically.

Peter, Ingo, if you guys are ok with this lets just have this stuff in 
cpuisol-2.6.git. I'm anyway rebasing it ontop of Peter's work. Of course we'll 
go through regular review and stuff and Andrew can track that tree separately.

Just a suggestion. I'm ok with submitting patches via sched-devel. Separate 
tree seems more appropriate though.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
                   ` (6 preceding siblings ...)
  2008-02-28 12:12 ` Mark Hounschell
@ 2008-02-29 18:55 ` Peter Zijlstra
  2008-02-29 19:02   ` Ingo Molnar
                     ` (2 more replies)
  7 siblings, 3 replies; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-29 18:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Oleg Nesterov, Steven Rostedt, Paul Jackson,
	Max Krasnyanskiy, linux-kernel, David Rientjes

Hi Paul,

How about something like this; along with the in-kernel version
of /cgroup/boot this could also provide the desired semantics.

Another benefit of this approach would be that it no longer requires
PF_THREAD_BIND, as we'd only stick unbounded kthreads into that cgroup.

(compile tested only)
---
Subject: cpuset: cpuset irq affinities

Allow for an association between cpusets and irqs. 

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/irq.h |    9 ++
 kernel/cpuset.c     |  160 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/irq/manage.c |   19 ++++++
 3 files changed, 188 insertions(+)

Index: linux-2.6-2/include/linux/irq.h
===================================================================
--- linux-2.6-2.orig/include/linux/irq.h
+++ linux-2.6-2/include/linux/irq.h
@@ -174,11 +174,20 @@ struct irq_desc {
 #ifdef CONFIG_PROC_FS
 	struct proc_dir_entry	*dir;
 #endif
+#ifdef CONFIG_CPUSETS
+	struct cpuset		*cs;
+#endif
 	const char		*name;
 } ____cacheline_internodealigned_in_smp;
 
 extern struct irq_desc irq_desc[NR_IRQS];
 
+struct irq_iterator {
+	int (*function)(struct irq_iterator *, int, struct irq_desc *);
+};
+
+extern int irq_iterator(struct irq_iterator *);
+
 /*
  * Migration helpers for obsolete names, they will go away:
  */
Index: linux-2.6-2/kernel/cpuset.c
===================================================================
--- linux-2.6-2.orig/kernel/cpuset.c
+++ linux-2.6-2/kernel/cpuset.c
@@ -50,6 +50,9 @@
 #include <linux/time.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
+#ifdef CONFIG_GENERIC_HARDIRQS
+#include <linux/irq.h>
+#endif
 
 #include <asm/uaccess.h>
 #include <asm/atomic.h>
@@ -732,6 +735,44 @@ void cpuset_change_cpumask(struct task_s
 	set_cpus_allowed(tsk, (cgroup_cs(scan->cg))->cpus_allowed);
 }
 
+#ifdef CONFIG_GENERIC_HARDIRQS
+struct cpuset_irq_cpumask {
+	struct irq_iterator v;
+	struct cpuset *cs;
+	cpumask_t mask;
+};
+
+static int
+update_irq_cpumask(struct irq_iterator *v, int irq, struct irq_desc *desc)
+{
+	struct cpuset_irq_cpumask *s =
+		container_of(v, struct cpuset_irq_cpumask, v);
+
+	if (desc->cs != s->cs)
+		return 0;
+
+	irq_set_affinity(irq, s->mask);
+
+	return 0;
+}
+
+static void update_irqs_cpumask(struct cpuset *cs)
+{
+	struct cpuset_irq_cpumask s = {
+		.v = { .function = update_irq_cpumask },
+		.cs = cs,
+	};
+
+	cpus_and(s.mask, cpu_online_map, cs->cpus_allowed);
+
+	irq_iterator(&s.v);
+}
+#else
+static void update_irqs_cpumask(struct cpuset *cs)
+{
+}
+#endif
+
 /**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
@@ -795,6 +836,8 @@ static int update_cpumask(struct cpuset 
 	cgroup_scan_tasks(&scan);
 	heap_free(&heap);
 
+	update_irqs_cpumask(cs);
+
 	if (is_load_balanced)
 		rebuild_sched_domains();
 	return 0;
@@ -1056,6 +1099,52 @@ static int update_flag(cpuset_flagbits_t
 	return 0;
 }
 
+#ifdef CONFIG_GENERIC_HARDIRQS
+struct cpuset_irq_update {
+	struct irq_iterator v;
+	struct cpuset *cs;
+	int irq;
+};
+
+static int
+cpuset_update_irq(struct irq_iterator *v, int irq, struct irq_desc *desc)
+{
+	struct cpuset_irq_update *s =
+		container_of(v, struct cpuset_irq_update, v);
+	cpumask_t online_set;
+	int ret;
+
+	if (irq != s->irq)
+		return 0;
+
+	cpus_and(online_set, cpu_online_map, s->cs->cpus_allowed);
+
+	ret = irq_set_affinity(irq, online_set);
+	if (!ret)
+		desc->cs = s->cs;
+
+	return ret;
+}
+
+static int update_irqs(struct cpuset *cs, char *buf)
+{
+	struct cpuset_irq_update s = {
+		.v = { .function = cpuset_update_irq },
+		.cs = cs,
+	};
+
+	if (sscanf(buf, "%d", &s.irq) != 1)
+		return -EIO;
+
+	return irq_iterator(&s.v);
+}
+#else
+static int update_irqs(struct cpuset *cs, char *buf)
+{
+	return 0;
+}
+#endif
+
 /*
  * Frequency meter - How fast is some event occurring?
  *
@@ -1206,6 +1295,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_IRQS,
 } cpuset_filetype_t;
 
 static ssize_t cpuset_common_file_write(struct cgroup *cont,
@@ -1273,6 +1363,9 @@ static ssize_t cpuset_common_file_write(
 		retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
 		cs->mems_generation = cpuset_mems_generation++;
 		break;
+	case FILE_IRQS:
+		retval = update_irqs(cs, buffer);
+		break;
 	default:
 		retval = -EINVAL;
 		goto out2;
@@ -1321,6 +1414,59 @@ static int cpuset_sprintf_memlist(char *
 	return nodelist_scnprintf(page, PAGE_SIZE, mask);
 }
 
+#ifdef CONFIG_GENERIC_HARDIRQS
+struct cpuset_irq_print {
+	struct irq_iterator v;
+	struct cpuset *cs;
+	char *buf;
+	int len;
+	int buflen;
+};
+
+static int
+cpuset_sprintf_irq(struct irq_iterator *v, int irq, struct irq_desc *desc)
+{
+	struct cpuset_irq_print *s =
+		container_of(v, struct cpuset_irq_print, v);
+
+	if (desc->cs != s->cs)
+		return 0;
+
+	if (s->len > 0)
+		s->len += scnprintf(s->buf + s->len, s->buflen - s->len, " ");
+	s->len += scnprintf(s->buf + s->len, s->buflen - s->len, "%d", irq);
+
+	return 0;
+}
+
+static int cpuset_sprintf_irqlist(char *page, struct cpuset *cs)
+{
+	int ret;
+
+	struct cpuset_irq_print s = {
+		.v = { .function = cpuset_sprintf_irq },
+		.cs = cs,
+		.buf = page,
+		.len = 0,
+		.buflen = PAGE_SIZE,
+	};
+
+	mutex_lock(&callback_mutex);
+	ret = irq_iterator(&s.v);
+	mutex_unlock(&callback_mutex);
+
+	if (!ret)
+		ret = s.len;
+
+	return ret;
+}
+#else
+static int cpuset_sprintf_irqlist(char *page, struct cpuset *cs)
+{
+	return 0;
+}
+#endif
+
 static ssize_t cpuset_common_file_read(struct cgroup *cont,
 				       struct cftype *cft,
 				       struct file *file,
@@ -1369,6 +1515,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_SPREAD_SLAB:
 		*s++ = is_spread_slab(cs) ? '1' : '0';
 		break;
+	case FILE_IRQS:
+		s += cpuset_sprintf_irqlist(s, cs);
+		break;
 	default:
 		retval = -EINVAL;
 		goto out;
@@ -1459,6 +1608,13 @@ static struct cftype cft_spread_slab = {
 	.private = FILE_SPREAD_SLAB,
 };
 
+static struct cftype cft_irqs = {
+	.name = "irqs",
+	.read = cpuset_common_file_read,
+	.write = cpuset_common_file_write,
+	.private = FILE_IRQS,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -1481,6 +1637,10 @@ static int cpuset_populate(struct cgroup
 		return err;
 	if ((err = cgroup_add_file(cont, ss, &cft_spread_slab)) < 0)
 		return err;
+#ifdef CONFIG_GENERIC_HARDIRQS
+	if ((err = cgroup_add_file(cont, ss, &cft_irqs)) < 0)
+		return err;
+#endif
 	/* memory_pressure_enabled is in root cpuset only */
 	if (err == 0 && !cont->parent)
 		err = cgroup_add_file(cont, ss,
Index: linux-2.6-2/kernel/irq/manage.c
===================================================================
--- linux-2.6-2.orig/kernel/irq/manage.c
+++ linux-2.6-2/kernel/irq/manage.c
@@ -96,6 +96,25 @@ int irq_set_affinity(unsigned int irq, c
 
 #endif
 
+int irq_iterator(struct irq_iterator *v)
+{
+	int ret = 0;
+	int irq;
+
+	for (irq = 0; irq < NR_IRQS; irq++) {
+		struct irq_desc *desc = &irq_desc[irq];
+
+		if (desc->chip == &no_irq_chip)
+			continue;
+
+		ret = v->function(v, irq, desc);
+		if (ret)
+			return ret;
+	}
+
+	return ret;
+}
+
 /**
  *	disable_irq_nosync - disable an irq without waiting
  *	@irq: Interrupt to disable



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 18:55 ` [RFC/PATCH] cpuset: cpuset irq affinities Peter Zijlstra
@ 2008-02-29 19:02   ` Ingo Molnar
  2008-02-29 20:52     ` Max Krasnyanskiy
  2008-02-29 20:55   ` Paul Jackson
  2008-03-02  5:18   ` Christoph Hellwig
  2 siblings, 1 reply; 94+ messages in thread
From: Ingo Molnar @ 2008-02-29 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Oleg Nesterov, Steven Rostedt, Paul Jackson,
	Max Krasnyanskiy, linux-kernel, David Rientjes


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> @@ -174,11 +174,20 @@ struct irq_desc {
>  #ifdef CONFIG_PROC_FS
>  	struct proc_dir_entry	*dir;
>  #endif
> +#ifdef CONFIG_CPUSETS
> +	struct cpuset		*cs;
> +#endif

i like this approach - it makes irqs more resource-alike and attaches 
them to a specific resource control group.

So if /cgroup/boot is changed to have less CPUs then the "default" irqs 
move along with it.

but if an isolated RT domain has specific irqs attached to it (say the 
IRQ of some high-speed data capture device), then the irqs would move 
together with that domain.

irqs are no longer a bolted-upon concept, but more explicitly managed.

[ If you boot-test it and if Paul agrees with the general approach then
  i could even apply it to sched-devel.git ;-) ]

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 19:02   ` Ingo Molnar
@ 2008-02-29 20:52     ` Max Krasnyanskiy
  2008-02-29 21:03       ` Peter Zijlstra
  0 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-29 20:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel, David Rientjes

Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
>> @@ -174,11 +174,20 @@ struct irq_desc {
>>  #ifdef CONFIG_PROC_FS
>>  	struct proc_dir_entry	*dir;
>>  #endif
>> +#ifdef CONFIG_CPUSETS
>> +	struct cpuset		*cs;
>> +#endif
> 
> i like this approach - it makes irqs more resource-alike and attaches 
> them to a specific resource control group.
> 
> So if /cgroup/boot is changed to have less CPUs then the "default" irqs 
> move along with it.
> 
> but if an isolated RT domain has specific irqs attached to it (say the 
> IRQ of some high-speed data capture device), then the irqs would move 
> together with that domain.
> 
> irqs are no longer a bolted-upon concept, but more explicitly managed.
> 
> [ If you boot-test it and if Paul agrees with the general approach then
>   i could even apply it to sched-devel.git ;-) ]

Believe it or not but I like it too :).
Now we're talking different approach compared to the cpu_isolated_map since 
with this patch cpu_system_map is no longer needed.
I've been playing with latest sched-devel tree and while I think we'll endup 
adding a lot more code, doing it with the cpuset is definitely more flexible.
This way we can provide more fine grain control of what part of the "system" 
services are allowed to run on a cpuset. Rather that "catch all" system flag.

Current sched-devel tree does not provide complete isolation at this point. 
There are still many things here and there that need to be added/fixed.
Having finer control here helps.

One concern I have is that this API conflicts with /proc/irq/X/smp_affinity.
ie Setting smp_affinity manually will override affinity set by the cpuset.
In other words I think
	int irq_set_affinity(unsigned int irq, cpumask_t cpumask)
now needs to make sure that cpumask does not have cpus that do not belong to 
the cpuset this irq belongs to. Just like sched_setaffinity() does for the tasks.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 18:55 ` [RFC/PATCH] cpuset: cpuset irq affinities Peter Zijlstra
  2008-02-29 19:02   ` Ingo Molnar
@ 2008-02-29 20:55   ` Paul Jackson
  2008-02-29 21:14     ` Peter Zijlstra
  2008-03-02  5:18   ` Christoph Hellwig
  2 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-02-29 20:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, tglx, oleg, rostedt, maxk, linux-kernel, rientjes

Like Ingo, I like the approach.

But I am concerned it won't work, as stated.

Unfortunately, my blithering ignorance of how one might want to
distribute irq's across a system is making it difficult for me
to say for sure if this works or not.

The thing about /dev/cpuset that I am afraid will get in the way
with this use of cpusets to place irqs is that we can really only
have a single purpose hierarchy below /dev/cpuset.

For example, lets say we have:

    /dev/cpuset
        boot
	big_special_app
	a_few_isolated_rt_nodes
	batchscheduler
            batch job 1
	    batch job 2
	    ...

I guess, with your "cpuset: cpuset irq affinities" patch, we'd start
off with /dev/cpuset/irqs listing the irqs available, and we could
reasonably decide to move any or all irqs to /dev/cpuset/boot/irqs,
by writing the numbers of those irqs to that file, one irq number
per write(2) system call (as is the cpuset convention.)

Do these irqs have any special hardware affinity?  Or are they
just consumers of CPU cycles that can be jammed onto whatever CPU(s)
we're willing to let be interrupted?

If for reason of desired hardware affinity, or perhaps for some other
reason that I'm not aware of, we wanted to have the combined CPUs in
both the 'boot' and 'big_special_app' handle some irq, then we'd be
screwed.  We can't easily define, using the cpuset interface and its
conventions, a distinct cpuset overlapping boot and big_special_app,
to hold that irq.  Any such combining cpuset would have to be the
common parent of both the combined cpusets, an annoying intrusion on
the expected hierarchy.

If the actual set of CPUs we wanted to handle a particular irq wasn't
even the union of any pre-existing set of cpusets, then we'd be even
more screwed, unable even to force the issue by imposing additional
intermediate combined cpusets to meet the need.

If there is any potential for this to be a problem, then we should
examine the possibility of making irqs their own cgroup, rather than
piggy backing them on cpusets (which are now just one instance of a
cgroup module.)

Could you educate me a little, Peter, on what these irqs are and on
the sorts of ways people might want to place them across CPUs?


> +	if (s->len > 0)
> +		s->len += scnprintf(s->buf + s->len, s->buflen - s->len, " ");

The other 'vector' type cpuset file, "tasks", uses a newline '\n'
field terminator, not a space ' ' separator.  Would '\n' work here,
or is ' ' just too much the expected irq separator in such ascii lists?
My preference is toward using the exact same vector syntax in each
place, so that once someone has code that handles one, they can
repurpose that code for another with minimum breakage.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 20:52     ` Max Krasnyanskiy
@ 2008-02-29 21:03       ` Peter Zijlstra
  2008-02-29 21:20         ` Max Krasnyanskiy
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-29 21:03 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel, David Rientjes


On Fri, 2008-02-29 at 12:52 -0800, Max Krasnyanskiy wrote:
> Ingo Molnar wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> >> @@ -174,11 +174,20 @@ struct irq_desc {
> >>  #ifdef CONFIG_PROC_FS
> >>  	struct proc_dir_entry	*dir;
> >>  #endif
> >> +#ifdef CONFIG_CPUSETS
> >> +	struct cpuset		*cs;
> >> +#endif
> > 
> > i like this approach - it makes irqs more resource-alike and attaches 
> > them to a specific resource control group.
> > 
> > So if /cgroup/boot is changed to have less CPUs then the "default" irqs 
> > move along with it.
> > 
> > but if an isolated RT domain has specific irqs attached to it (say the 
> > IRQ of some high-speed data capture device), then the irqs would move 
> > together with that domain.
> > 
> > irqs are no longer a bolted-upon concept, but more explicitly managed.
> > 
> > [ If you boot-test it and if Paul agrees with the general approach then
> >   i could even apply it to sched-devel.git ;-) ]
> 
> Believe it or not but I like it too :).
> Now we're talking different approach compared to the cpu_isolated_map since 
> with this patch cpu_system_map is no longer needed.
> I've been playing with latest sched-devel tree and while I think we'll endup 
> adding a lot more code, doing it with the cpuset is definitely more flexible.
> This way we can provide more fine grain control of what part of the "system" 
> services are allowed to run on a cpuset. Rather that "catch all" system flag.
> 
> Current sched-devel tree does not provide complete isolation at this point. 
> There are still many things here and there that need to be added/fixed.
> Having finer control here helps.
> 
> One concern I have is that this API conflicts with /proc/irq/X/smp_affinity.
> ie Setting smp_affinity manually will override affinity set by the cpuset.
> In other words I think
> 	int irq_set_affinity(unsigned int irq, cpumask_t cpumask)
> now needs to make sure that cpumask does not have cpus that do not belong to 
> the cpuset this irq belongs to. Just like sched_setaffinity() does for the tasks.

The patch also needs to handle group destruction too; currently it
leaves cpuset pointers dangling. So it would either have to refuse
removing a group when there are still irqs associated, or move them to
the parent.

But yeah, this was just a quick hack to show the idea, glad you like it.
Will try to flesh it out a bit in the coming week.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 20:55   ` Paul Jackson
@ 2008-02-29 21:14     ` Peter Zijlstra
  2008-02-29 21:29       ` Ingo Molnar
                         ` (3 more replies)
  0 siblings, 4 replies; 94+ messages in thread
From: Peter Zijlstra @ 2008-02-29 21:14 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, tglx, oleg, rostedt, maxk, linux-kernel, rientjes


On Fri, 2008-02-29 at 14:55 -0600, Paul Jackson wrote:
> Like Ingo, I like the approach.
> 
> But I am concerned it won't work, as stated.
> 
> Unfortunately, my blithering ignorance of how one might want to
> distribute irq's across a system is making it difficult for me
> to say for sure if this works or not.
> 
> The thing about /dev/cpuset that I am afraid will get in the way
> with this use of cpusets to place irqs is that we can really only
> have a single purpose hierarchy below /dev/cpuset.
> 
> For example, lets say we have:
> 
>     /dev/cpuset
>         boot
> 	big_special_app
> 	a_few_isolated_rt_nodes
> 	batchscheduler
>             batch job 1
> 	    batch job 2
> 	    ...

I might just be new-fangled, but I have a /cgroup mount.
but I guess that's just different mount-point of cgroup, right?

> I guess, with your "cpuset: cpuset irq affinities" patch, we'd start
> off with /dev/cpuset/irqs listing the irqs available, and we could
> reasonably decide to move any or all irqs to /dev/cpuset/boot/irqs,
> by writing the numbers of those irqs to that file, one irq number
> per write(2) system call (as is the cpuset convention.)

Right.

> Do these irqs have any special hardware affinity?  Or are they
> just consumers of CPU cycles that can be jammed onto whatever CPU(s)
> we're willing to let be interrupted?

Depends a bit, the genirq layer seems to allow for irqs that can't be
freely placed. But most of them can be given a free mask - /me looks @
tglx/ingo.

> If for reason of desired hardware affinity, or perhaps for some other
> reason that I'm not aware of, we wanted to have the combined CPUs in
> both the 'boot' and 'big_special_app' handle some irq, then we'd be
> screwed.  We can't easily define, using the cpuset interface and its
> conventions, a distinct cpuset overlapping boot and big_special_app,
> to hold that irq.  Any such combining cpuset would have to be the
> common parent of both the combined cpusets, an annoying intrusion on
> the expected hierarchy.
> 
> If the actual set of CPUs we wanted to handle a particular irq wasn't
> even the union of any pre-existing set of cpusets, then we'd be even
> more screwed, unable even to force the issue by imposing additional
> intermediate combined cpusets to meet the need.

I see the issue. We don't support mv on cgroups, right? To easily create
common parents...

> If there is any potential for this to be a problem, then we should
> examine the possibility of making irqs their own cgroup, rather than
> piggy backing them on cpusets (which are now just one instance of a
> cgroup module.)

Hmm, but that would then be another controller based on cpus. Might be a
tad confusing. Might be needed. I'll ponder..

> Could you educate me a little, Peter, on what these irqs are and on
> the sorts of ways people might want to place them across CPUs?

I'm not sure I know what you're asking. IRQ are hardware notifiers and
do all kinds of things depending on the hardware. Network cards
typically use them to notify the CPU of incoming packets. Video cards
can do vsync notifiers, empty dma buffers, whatnot.

> > +	if (s->len > 0)
> > +		s->len += scnprintf(s->buf + s->len, s->buflen - s->len, " ");
> 
> The other 'vector' type cpuset file, "tasks", uses a newline '\n'
> field terminator, not a space ' ' separator.  Would '\n' work here,
> or is ' ' just too much the expected irq separator in such ascii lists?
> My preference is toward using the exact same vector syntax in each
> place, so that once someone has code that handles one, they can
> repurpose that code for another with minimum breakage.

I'm fine with whatever, I saw a ',' in the bitmap stuff, not really sure
how that ended up being a ' ' in the patch I send out... :-)


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 21:03       ` Peter Zijlstra
@ 2008-02-29 21:20         ` Max Krasnyanskiy
  2008-03-03 11:57           ` Peter Zijlstra
  0 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-29 21:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel, David Rientjes

Peter Zijlstra wrote:

> But yeah, this was just a quick hack to show the idea, glad you like it.
> Will try to flesh it out a bit in the coming week.

Are you going to add code for "boot" cpuset ?
I wrote user-space code that is does that, but as I understand from previous 
discussions we want to create that in the kernel.

Max


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 21:14     ` Peter Zijlstra
@ 2008-02-29 21:29       ` Ingo Molnar
  2008-02-29 21:32       ` Ingo Molnar
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 94+ messages in thread
From: Ingo Molnar @ 2008-02-29 21:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Jackson, tglx, oleg, rostedt, maxk, linux-kernel, rientjes


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > Do these irqs have any special hardware affinity?  Or are they just 
> > consumers of CPU cycles that can be jammed onto whatever CPU(s) 
> > we're willing to let be interrupted?
> 
> Depends a bit, the genirq layer seems to allow for irqs that can't be 
> freely placed. But most of them can be given a free mask - /me looks @ 
> tglx/ingo.

yes - and when they cannot be arbitrarily migrated we just dont move 
them (but still keep them attached to that cpuset). The affinity calls 
will just fail in that case. Might want to emit a kernel warning but 
that's all. (if then it's a hardware constraint)

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 21:14     ` Peter Zijlstra
  2008-02-29 21:29       ` Ingo Molnar
@ 2008-02-29 21:32       ` Ingo Molnar
  2008-02-29 21:42       ` Max Krasnyanskiy
  2008-02-29 21:53       ` Paul Jackson
  3 siblings, 0 replies; 94+ messages in thread
From: Ingo Molnar @ 2008-02-29 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Jackson, tglx, oleg, rostedt, maxk, linux-kernel, rientjes


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > Could you educate me a little, Peter, on what these irqs are and on 
> > the sorts of ways people might want to place them across CPUs?
> 
> I'm not sure I know what you're asking. IRQ are hardware notifiers and 
> do all kinds of things depending on the hardware. Network cards 
> typically use them to notify the CPU of incoming packets. Video cards 
> can do vsync notifiers, empty dma buffers, whatnot.

irq affinity masks can basically be thought of as: "these are the CPUs 
where external hardware events will trigger certain kernel functions and 
cause overhead on those CPUs". An IRQ can have followup effects: softirq 
execution, workqueue execution, etc.

so managing the IRQ masks is very meaningful and just as meaningful as 
managing the affinity masks of tasks. You can think of "IRQ# 123" as 
"special kernel task # 123".

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 21:14     ` Peter Zijlstra
  2008-02-29 21:29       ` Ingo Molnar
  2008-02-29 21:32       ` Ingo Molnar
@ 2008-02-29 21:42       ` Max Krasnyanskiy
  2008-02-29 22:00         ` Paul Jackson
  2008-02-29 21:53       ` Paul Jackson
  3 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-02-29 21:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Jackson, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Peter Zijlstra wrote:
> On Fri, 2008-02-29 at 14:55 -0600, Paul Jackson wrote:

>> Do these irqs have any special hardware affinity?  Or are they
>> just consumers of CPU cycles that can be jammed onto whatever CPU(s)
>> we're willing to let be interrupted?
> 
> Depends a bit, the genirq layer seems to allow for irqs that can't be
> freely placed. But most of them can be given a free mask - /me looks @
> tglx/ingo.
We should just check the return value from irq_set_affinity(). If it fails we 
refuse to add it to the set.

>> If for reason of desired hardware affinity, or perhaps for some other
>> reason that I'm not aware of, we wanted to have the combined CPUs in
>> both the 'boot' and 'big_special_app' handle some irq, then we'd be
>> screwed.  We can't easily define, using the cpuset interface and its
>> conventions, a distinct cpuset overlapping boot and big_special_app,
>> to hold that irq.  Any such combining cpuset would have to be the
>> common parent of both the combined cpusets, an annoying intrusion on
>> the expected hierarchy.
>>
>> If the actual set of CPUs we wanted to handle a particular irq wasn't
>> even the union of any pre-existing set of cpusets, then we'd be even
>> more screwed, unable even to force the issue by imposing additional
>> intermediate combined cpusets to meet the need.
> 
> I see the issue. We don't support mv on cgroups, right? To easily create
> common parents...
I guess there maybe some fancy HW topologies that may be a problem but for 
most cases we should be ok.
Simple cases like unmovable IRQs are easy to handle (ie set_affinity() fails 
and we refuse to add it to the cpuset).

>> If there is any potential for this to be a problem, then we should
>> examine the possibility of making irqs their own cgroup, rather than
>> piggy backing them on cpusets (which are now just one instance of a
>> cgroup module.)
> 
> Hmm, but that would then be another controller based on cpus. Might be a
> tad confusing. Might be needed. I'll ponder..
Yeah, I'd prefer it to be along with cpusets. As I mentioned will need similar 
mechanisms for other things besides irqs for complete isolation. Creating a 
separate group for each sounds like an overkill.

Max



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 21:14     ` Peter Zijlstra
                         ` (2 preceding siblings ...)
  2008-02-29 21:42       ` Max Krasnyanskiy
@ 2008-02-29 21:53       ` Paul Jackson
  3 siblings, 0 replies; 94+ messages in thread
From: Paul Jackson @ 2008-02-29 21:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, tglx, oleg, rostedt, maxk, linux-kernel, rientjes

Peter wrote:
> IRQ are hardware notifiers and
> do all kinds of things depending on the hardware.

So some of these irqs might have very particular node affinity then,
right?  If my thingamajig board is attached to node 72, then I might
want its interrupts going to a CPU on node 72, right?

In which case, putting that irq in my boot cpuset that only has nodes
0-3 would be harmful to the performance of my thingamajig board, right?

I'm suspecting here that you don't want a small 'boot' cpuset
(usually a small cpuset running legacy Unix stuff) holding the irqs,
but rather you want a big 'system' cpuset, which has -all-but- a few
nodes dedicated to hard real time or other isolated (there's that
word again) purposes.

That way, most irqs can go to most CPUs, depending on their specific
needs.

Unfortunately, I don't think the cpuset hierarchy and conventions admit
of both a big 'system' cpuset (all but a few isolated nodes) and a small
overlapping 'boot' cpuset.

> We don't support mv on cgroups, right? To easily create
> common parents...

The only mv supported is simple rename, preserving parentage.

And if one could and did a tree reshaping mv near the top of the
hierarchy, it would confuse the heck out of existing uses and users.

> I might just be new-fangled, but I have a /cgroup mount.
> but I guess that's just different mount-point of cgroup, right?

All cgroups mount beneath /cgroup.  For backwards compatibility,
one can also "mount -t cpuset cpuset /dev/cpuset", and just get
the cpuset interface, with a couple of legacy hooks to make it behave
just like good old fashioned cpusets, rather than new fangled cgroups.

> I'm fine with whatever, I saw a ',' in the bitmap stuff, not really sure
> how that ended up being a ' ' in the patch I send out... :-)

Yes - that's another commonly supported form.  If that's a better
presentation, then you'd probably want to rework your code, to take
in and display the entire vector of irq numbers in one line, using
a comma-separated list of irqs and ranges of irqs.

See further bitmap_scnprintf(), bitmap_parse_user(),
bitmap_scnlistprintf() and bitmap_parselist(), in bitmap.c.

Given that you don't have an pre-existing bitmap of irqs (that I know
of) and that you might have a distinct error code for each irq that you
try to attach to a different cpuset, I'm guessing you want to stick
with the single irq per write on input, single irq per line on output,
paradigm, similar to what the 'tasks' file uses for task pids.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 21:42       ` Max Krasnyanskiy
@ 2008-02-29 22:00         ` Paul Jackson
  0 siblings, 0 replies; 94+ messages in thread
From: Paul Jackson @ 2008-02-29 22:00 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: a.p.zijlstra, mingo, tglx, oleg, rostedt, linux-kernel, rientjes,
	Paul Menage

Max wrote:
> Yeah, I'd prefer it to be along with cpusets. As I mentioned will need similar 
> mechanisms for other things besides irqs for complete isolation. Creating a 
> separate group for each sounds like an overkill.

One can combine cgroups into a single hierarchy.

If we had irqs and these similar mechanisms you have in mind each in
their separate cgroup subsystem, and if you normally wanted to deal
with them all in a single hierarchy, then you can mount those cgroup
subsystems together.

I've manage to avoid learning how to use cgroups, so you might want
to ask Paul Menage (just added to the cc list) how to do that, if the
Documentation/cgroups.txt isn't sufficient.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 18:55 ` [RFC/PATCH] cpuset: cpuset irq affinities Peter Zijlstra
  2008-02-29 19:02   ` Ingo Molnar
  2008-02-29 20:55   ` Paul Jackson
@ 2008-03-02  5:18   ` Christoph Hellwig
  2 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2008-03-02  5:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, Max Krasnyanskiy, linux-kernel, David Rientjes

On Fri, Feb 29, 2008 at 07:55:51PM +0100, Peter Zijlstra wrote:
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/irq.h |    9 ++
>  kernel/cpuset.c     |  160 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/irq/manage.c |   19 ++++++
>  3 files changed, 188 insertions(+)

linux/irq.h must not be included in generic code, it's actually more
and asm-generic/hw_irq.h.  Please restructure the code so that the
cpuset code calls into an arch interface which will then be implemented
by arch code (which in most cases will be genirq, the other can be left
stubbed out for now)


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-02-29 21:20         ` Max Krasnyanskiy
@ 2008-03-03 11:57           ` Peter Zijlstra
  2008-03-03 17:36             ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-03-03 11:57 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: Ingo Molnar, Thomas Gleixner, Oleg Nesterov, Steven Rostedt,
	Paul Jackson, linux-kernel, David Rientjes

On Fri, 2008-02-29 at 13:20 -0800, Max Krasnyanskiy wrote:
> Peter Zijlstra wrote:
> 
> > But yeah, this was just a quick hack to show the idea, glad you like it.
> > Will try to flesh it out a bit in the coming week.
> 
> Are you going to add code for "boot" cpuset ?
> I wrote user-space code that is does that, but as I understand from previous 
> discussions we want to create that in the kernel.

Yeah, I'll be trying to (lack of cgroup fu for the moment).

I think something like

 /cgroup
 /cgroup/system
 /cgroup/system/boot

 /cgroup/big_honking_app
 /cgroup/rt_domain

Where the system group includes all IRQs and all unbound kernel threads
(by default). The system/boot group will contain all of userspace.

Doing it in this way ought to allow for some weird setups. The system
group can overlap with anything that does need system services. The boot
group must be a subset thereof, and can be shrunk to a small part of the
machine.




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-03 11:57           ` Peter Zijlstra
@ 2008-03-03 17:36             ` Paul Jackson
  2008-03-03 17:57               ` Peter Zijlstra
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-03 17:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Peter wrote:
> The system group can overlap with anything that does need system services.

I suppose IRQs need to overlap like this, but cpusets often can't
overlap like this.

If a system has the cgroup hierarchy you draw:

  /cgroup
  /cgroup/system
  /cgroup/system/boot

  /cgroup/big_honking_app
  /cgroup/rt_domain

this must not force the cpuset hierarchy to be:

  /dev/cpuset
  /dev/cpuset/system
  /dev/cpuset/system/boot

  /dev/cpuset/big_honking_app
  /dev/cpuset/rt_domain

I guess this means IRQs cannot be added to the cpuset subsystem
of cgroups.  Rather they have to be added to some other cgroup
subsystem, perhaps a new one just for IRQs.

In perhaps the most common sort of cpuset hierarchy:

  /dev/cpuset
  /dev/cpuset/boot
  /dev/cpuset/batch_sched
  /dev/cpuset/big_honking_app
  /dev/cpuset/rt_domain

none of boot or its siblings overlap.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-03 17:36             ` Paul Jackson
@ 2008-03-03 17:57               ` Peter Zijlstra
  2008-03-03 18:10                 ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-03-03 17:57 UTC (permalink / raw)
  To: Paul Jackson; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes


On Mon, 2008-03-03 at 11:36 -0600, Paul Jackson wrote:
> Peter wrote:
> > The system group can overlap with anything that does need system services.
> 
> I suppose IRQs need to overlap like this, but cpusets often can't
> overlap like this.

Due to CS_CPU_EXCLUSIVE usage?

I had hoped system would be allowed to overlap.

> If a system has the cgroup hierarchy you draw:
> 
>   /cgroup
>   /cgroup/system
>   /cgroup/system/boot
> 
>   /cgroup/big_honking_app
>   /cgroup/rt_domain
> 
> this must not force the cpuset hierarchy to be:
> 
>   /dev/cpuset
>   /dev/cpuset/system
>   /dev/cpuset/system/boot
> 
>   /dev/cpuset/big_honking_app
>   /dev/cpuset/rt_domain
> 
> I guess this means IRQs cannot be added to the cpuset subsystem
> of cgroups.  Rather they have to be added to some other cgroup
> subsystem, perhaps a new one just for IRQs.

The trouble is, cgroups are primarily about tasks, whereas IRQs are not.
So we would create a cgroup that does not manage tasks, but rather
associates irqs with sets of cpus - which are not cpusets.

See how that would be awkward?

> In perhaps the most common sort of cpuset hierarchy:
> 
>   /dev/cpuset
>   /dev/cpuset/boot
>   /dev/cpuset/batch_sched
>   /dev/cpuset/big_honking_app
>   /dev/cpuset/rt_domain
> 
> none of boot or its siblings overlap.

But as long as nobody does CS_CPU_EXCLUSIVE they may overlap, right?


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-03 17:57               ` Peter Zijlstra
@ 2008-03-03 18:10                 ` Paul Jackson
  2008-03-03 18:18                   ` Peter Zijlstra
  2008-03-03 18:41                   ` Paul Menage
  0 siblings, 2 replies; 94+ messages in thread
From: Paul Jackson @ 2008-03-03 18:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

> But as long as nobody does CS_CPU_EXCLUSIVE they may overlap, right?

It's a bit stronger than that:

 1) They need non-overlapping cpusets at this level to control
    the sched_domain setup, if they want to avoid load balancing
    across almost all CPUs in the system.  Depending on the kernel
    version, sched_domain partitioning is controlled either by the
    cpuset flag cpu_exclusive, or the cpuset flag sched_load_balance.

 2) They need non-overlapping cpusets at this level to control
    memory placement of some kernel allocations, which are allowed
    outside the current tasks cpuset, to be confined by the nearest
    ancestor cpuset marked 'mem_exclusive'

 3) Some sysadmin tools are likely coded to expect a /dev/cpuset/boot
    cpuset, not a /dev/cpuset/system/boot cpuset, as that has been
    customary for a long time.

(1) and (2) would break the major batch schedulers.  They typically
mark their top cpuset, /dev/cpuset/pbs or /dev/cpuset/lfs or whatever
batch scheduler it is, as cpu_exclusive and mem_exclusive, by way of
expressing their intention to pretty much own those CPUs and memory
nodes.  If we fired them up on a system where that wasn't allowed due
to overlap with /dev/cpuset/system, they'd croak.  Such changes as that
are costly and unappreciated.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-03 18:10                 ` Paul Jackson
@ 2008-03-03 18:18                   ` Peter Zijlstra
  2008-03-04  7:35                     ` Paul Jackson
  2008-03-03 18:41                   ` Paul Menage
  1 sibling, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-03-03 18:18 UTC (permalink / raw)
  To: Paul Jackson; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

On Mon, 2008-03-03 at 12:10 -0600, Paul Jackson wrote:
> > But as long as nobody does CS_CPU_EXCLUSIVE they may overlap, right?
> 
> It's a bit stronger than that:
> 
>  1) They need non-overlapping cpusets at this level to control
>     the sched_domain setup, if they want to avoid load balancing
>     across almost all CPUs in the system.  Depending on the kernel
>     version, sched_domain partitioning is controlled either by the
>     cpuset flag cpu_exclusive, or the cpuset flag sched_load_balance.
> 
>  2) They need non-overlapping cpusets at this level to control
>     memory placement of some kernel allocations, which are allowed
>     outside the current tasks cpuset, to be confined by the nearest
>     ancestor cpuset marked 'mem_exclusive'
> 
>  3) Some sysadmin tools are likely coded to expect a /dev/cpuset/boot
>     cpuset, not a /dev/cpuset/system/boot cpuset, as that has been
>     customary for a long time.
> 
> (1) and (2) would break the major batch schedulers.  They typically
> mark their top cpuset, /dev/cpuset/pbs or /dev/cpuset/lfs or whatever
> batch scheduler it is, as cpu_exclusive and mem_exclusive, by way of
> expressing their intention to pretty much own those CPUs and memory
> nodes.  If we fired them up on a system where that wasn't allowed due
> to overlap with /dev/cpuset/system, they'd croak.  Such changes as that
> are costly and unappreciated.

OK, understood, I'll try and come up with yet another scheme :-)


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-03 18:10                 ` Paul Jackson
  2008-03-03 18:18                   ` Peter Zijlstra
@ 2008-03-03 18:41                   ` Paul Menage
  2008-03-03 18:52                     ` Paul Jackson
  1 sibling, 1 reply; 94+ messages in thread
From: Paul Menage @ 2008-03-03 18:41 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Peter Zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

On Mon, Mar 3, 2008 at 10:10 AM, Paul Jackson <pj@sgi.com> wrote:
>   2) They need non-overlapping cpusets at this level to control
>     memory placement of some kernel allocations, which are allowed
>     outside the current tasks cpuset, to be confined by the nearest
>     ancestor cpuset marked 'mem_exclusive'

A while ago I posted a patch that split "cpusets" into "cpusets"
(controlling CPU) and "memsets" (controlling memory node placement).
It got luke-warm reception at the time, but maybe it's worth me fixing
it up and resending? It wouldn't have to affect the legacy mounts of
cpusets, but would allow memory and CPU assignments to be controlled
independently.

Also, one of the problems with the mem_exclusive flag at the moment is
that it's overloaded to mean "no-overlapping" and "no GFP_KERNEL
allocations outside this cpuset". If we added a "mem_hardwall" flag
that just had the latter semantics (i.e. either mem_exclusive or
mem_hardwall would be sufficient to confine GFP_KERNEL allocations
within the cpuset), you could have the confinement without worrying
about overlap issues.

Paul

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-03 18:41                   ` Paul Menage
@ 2008-03-03 18:52                     ` Paul Jackson
  2008-03-04  5:26                       ` Paul Menage
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-03 18:52 UTC (permalink / raw)
  To: Paul Menage
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Paul M wrote:
> It ... would allow memory and CPU assignments to be controlled
> independently.

Could you motivate this suggestion -- who needs it, or why
would is it needed?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-03 18:52                     ` Paul Jackson
@ 2008-03-04  5:26                       ` Paul Menage
  2008-03-04  6:15                         ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Menage @ 2008-03-04  5:26 UTC (permalink / raw)
  To: Paul Jackson
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

On Mon, Mar 3, 2008 at 10:52 AM, Paul Jackson <pj@sgi.com> wrote:
> Paul M wrote:
>  > It ... would allow memory and CPU assignments to be controlled
>  > independently.
>
>  Could you motivate this suggestion -- who needs it, or why
>  would is it needed?

My impression was that Peter wanted to be able to control the
assignments of CPUs to IRQs in a way that could result in overlapping.
One of the arguments that you posted against his proposal was that
this would break due to the memory overlap requirements of
mem_exclusive cpusets. So this appeared to be a case where the fact
that memory and cpu masks are combined in the same cgroups subsystem
is a drawback. (But maybe I'm misunderstanding the discussion).

I'm sure if cpusets were being developed today on top of cgroups,
rather than being its inspiration, there would be no good reason to
have the memory mask assignment and the cpu mask assignment be part of
the same subsystem - they're only together now because there was no
general grouping mechanism in the kernel when cpusets was written.

Paul

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-04  5:26                       ` Paul Menage
@ 2008-03-04  6:15                         ` Paul Jackson
  2008-03-04  6:21                           ` Paul Menage
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-04  6:15 UTC (permalink / raw)
  To: Paul Menage
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Paul M wrote:
> One of the arguments that you posted against his proposal was that
> this would break due to the memory overlap requirements of
> mem_exclusive cpusets.

There were three concerns I had with his proposal -- (1) it conflicted
with memory placement, (2) it conflicted with cpu placement (sched
domain definition) and (3) it broke the conventional cpuset hierarchy
configuration on which users have come to depend.

Separating cpus and memory as distinct cgroup subsystems wouldn't help
much; there's still (2) and (3).  And in the legacy /dev/cpuset
interface, cpus and memory remain together, whether or not cgroups
enables them to be separate.


> I'm sure if cpusets were being developed today on top of cgroups,
> rather than being its inspiration, there would be no good reason to
> have the memory mask assignment and the cpu mask assignment be part of
> the same subsystem 

Perhaps ... though they work together rather well in practice, perhaps
because CPUs and memory banks usually are physically associated;
selecting which CPUs to use and which memory banks to use really is not
an independent choice, on most hardware.

Now however the shoe is on the other foot.  Is there a good reason to
separate them ... an actual would-be user, or actual problems inflicted
on current users due to the lack of this split?

If there's just a rare occassion such an independently split CPU and
Memory hiearchy, one can still use the current combined cpuset
implementation, just with a less natural cpuset hierarchy (more leaf
node cpusets, representing the cross product of interesting CPU subsets
and interesting memory node subsets.)  If some specified users start
doing this routinely, then that gets sufficiently cumbersome that it's
worth trying to remedy.

As usual, I seem to be counseling resisting adding code, complexity and
incompatible change, without actual need, beyond just increased
conceptual sophistication.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-04  6:15                         ` Paul Jackson
@ 2008-03-04  6:21                           ` Paul Menage
  2008-03-04  6:26                             ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Menage @ 2008-03-04  6:21 UTC (permalink / raw)
  To: Paul Jackson
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

On Mon, Mar 3, 2008 at 10:15 PM, Paul Jackson <pj@sgi.com> wrote:
>  implementation, just with a less natural cpuset hierarchy (more leaf
>  node cpusets, representing the cross product of interesting CPU subsets
>  and interesting memory node subsets.)

Except that this isn't currently possible if you're also trying to do
memory hardwalling on those cpusets, since then sibling cpusets can't
share memory nodes.

Having said that, this bit of the problem can be fixed without
splitting cpus/mems, by my other earlier proposal of adding a separate
"mem_hardwall" flag that can enable the hardwall behaviour without the
exclusive behaviour. (i.e. hardwall behaviour occurs if mem_exclusive
|| mem_hardwall)

Paul

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-04  6:21                           ` Paul Menage
@ 2008-03-04  6:26                             ` Paul Jackson
  2008-03-04  6:34                               ` Paul Menage
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-04  6:26 UTC (permalink / raw)
  To: Paul Menage
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Paul M wrote:
> Except that this isn't currently possible if you're also trying to do
> memory hardwalling on those cpusets, since then sibling cpusets can't
> share memory nodes.
 
Yes, there would be interactions, on both the CPU and Memory side with
some second order mechanisms (hardwalls and sched domains.)

Still, I ask, where's the beef -- the real users?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-04  6:26                             ` Paul Jackson
@ 2008-03-04  6:34                               ` Paul Menage
  2008-03-04  6:51                                 ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Menage @ 2008-03-04  6:34 UTC (permalink / raw)
  To: Paul Jackson
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

On Mon, Mar 3, 2008 at 10:26 PM, Paul Jackson <pj@sgi.com> wrote:
> Paul M wrote:
>  > Except that this isn't currently possible if you're also trying to do
>  > memory hardwalling on those cpusets, since then sibling cpusets can't
>  > share memory nodes.
>
>  Yes, there would be interactions, on both the CPU and Memory side with
>  some second order mechanisms (hardwalls and sched domains.)
>
>  Still, I ask, where's the beef -- the real users?
>

I'm one such user who's been forced to add the mem_hardwall flag to
get around the fact that exclusive and hardwall are controlled by the
same flag. I keep meaning to send it in as a patch but haven't yet got
round to it.

Also, if you're using fake numa for memory isolation (which we're
experimenting with) then the correlation between cpu placement and
memory placement is much much weaker, or non-existent.

Paul

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-04  6:34                               ` Paul Menage
@ 2008-03-04  6:51                                 ` Paul Jackson
  0 siblings, 0 replies; 94+ messages in thread
From: Paul Jackson @ 2008-03-04  6:51 UTC (permalink / raw)
  To: Paul Menage
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Paul M wrote:
> I'm one such user who's been forced to add the mem_hardwall flag to
> get around the fact that exclusive and hardwall are controlled by the
> same flag. I keep meaning to send it in as a patch but haven't yet got
> round to it.

I made essentially the same mistake twice in the evolution of cpusets:
 1) overloading the cpu_exclusive flag to define sched domains, and
 2) overloading the mem_exclusive flag to define memory hardwalls.

I eventually reversed (1), with a deliberately incompatible change
(and you know how I resist those ;), creating a new 'sched_load_balance'
flag that controls the sched_domain partitioning, and removing any
affect that the cpu_exclusive flag has on this.

Perhaps the unfortunate interaction of mem_exclusive and hardwall is
destined to go the same path.  Thought the audience that is currently
using mem_exclusive for the purpose of hardwall enforcement of kernel
allocations might be broader than the specialized real-time audience
that was using cpu_exclusive for dynamic sched domain isolation, and so
we might not choose to just break compatibility in one shot, but rather
phase in your new flag, before, perhaps, in a later release, phasing
out the old hardwall overloading of the mem_exclusive flag.

(My primeval mistake was including the cpu_exclusive and mem_exclusive
flags in the original cpuset design; those two flags have given me
nothing but temptation to commit further design errors ;).


> Also, if you're using fake numa for memory isolation (which we're
> experimenting with) then the correlation between cpu placement and
> memory placement is much much weaker, or non-existent.

That might be a good answer to my asking where the beef was.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-03 18:18                   ` Peter Zijlstra
@ 2008-03-04  7:35                     ` Paul Jackson
  2008-03-04 11:06                       ` Peter Zijlstra
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-04  7:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Peter wrote:
> OK, understood, I'll try and come up with yet another scheme :-)

Would your per-cpuset 'irqs' file work if, unlike pids in the 'tasks' file,
we allowed the same irq to be listed in multiple 'irqs' files?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-04  7:35                     ` Paul Jackson
@ 2008-03-04 11:06                       ` Peter Zijlstra
  2008-03-04 19:52                         ` Max Krasnyanskiy
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-03-04 11:06 UTC (permalink / raw)
  To: Paul Jackson; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes


On Tue, 2008-03-04 at 01:35 -0600, Paul Jackson wrote:
> Peter wrote:
> > OK, understood, I'll try and come up with yet another scheme :-)
> 
> Would your per-cpuset 'irqs' file work if, unlike pids in the 'tasks' file,
> we allowed the same irq to be listed in multiple 'irqs' files?

I did think of that, but that seems rather awkward. For one, how would
you remove an irq from a cpuset?

Secondly, the beauty of the current solution is that we use
irq_desc->cs->cpus_allowed, if it were in multiple sets, we'd have to
iterate a list, and cpus_or() the bunch.




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-04 11:06                       ` Peter Zijlstra
@ 2008-03-04 19:52                         ` Max Krasnyanskiy
  2008-03-05  1:11                           ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Max Krasnyanskiy @ 2008-03-04 19:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Jackson, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Peter Zijlstra wrote:
> On Tue, 2008-03-04 at 01:35 -0600, Paul Jackson wrote:
>> Peter wrote:
>>> OK, understood, I'll try and come up with yet another scheme :-)
>> Would your per-cpuset 'irqs' file work if, unlike pids in the 'tasks' file,
>> we allowed the same irq to be listed in multiple 'irqs' files?
> 
> I did think of that, but that seems rather awkward. For one, how would
> you remove an irq from a cpuset?
> 
> Secondly, the beauty of the current solution is that we use
> irq_desc->cs->cpus_allowed, if it were in multiple sets, we'd have to
> iterate a list, and cpus_or() the bunch.
> 
Yeah, that would definitely be awkward.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-04 19:52                         ` Max Krasnyanskiy
@ 2008-03-05  1:11                           ` Paul Jackson
  2008-03-05  8:37                             ` Peter Zijlstra
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-05  1:11 UTC (permalink / raw)
  To: Max Krasnyanskiy
  Cc: a.p.zijlstra, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Max K wrote:
> Yeah, that would definitely be awkward.

Yeah - agreed - awkward.

Forget that idea (allowing the same irq in multiple 'irqs' files.)

It seems to me that we get into trouble trying to cram that 'system'
cpuset into the cpuset hierarchy, where that system cpuset is there to
hold a list of irqs, but is only partially a good fit for the existing
cpuset hierarchy.

Could this irq configuration be partly a system-wide configuration
decision (which irqs are 'system' irqs), and partly a per-cpuset
decision -- which cpusets (such as a real-time one) want to disable
the usual system irqs that everyone else gets.

The cpuset portion of this should take only a single per-cpuset Boolean
flag -- which if set True (1), asks the system to "please leave my CPUs
off the list of CPUs receiving the usual system irqs."

Then the list of "usual system irqs" would be established in some /proc
or /sys configuration.  Such irqs would be able to go to any CPUs
except those CPUs which found themselves in a cpuset with the above
per-cpuset Boolean flag set True (1).

How does all this interact with /proc/irq/N/smp_affinity?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-05  1:11                           ` Paul Jackson
@ 2008-03-05  8:37                             ` Peter Zijlstra
  2008-03-05  8:50                               ` Ingo Molnar
                                                 ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Peter Zijlstra @ 2008-03-05  8:37 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Max Krasnyanskiy, mingo, tglx, oleg, rostedt, linux-kernel, rientjes


On Tue, 2008-03-04 at 19:11 -0600, Paul Jackson wrote:
> Max K wrote:
> > Yeah, that would definitely be awkward.
> 
> Yeah - agreed - awkward.
> 
> Forget that idea (allowing the same irq in multiple 'irqs' files.)
> 
> It seems to me that we get into trouble trying to cram that 'system'
> cpuset into the cpuset hierarchy, where that system cpuset is there to
> hold a list of irqs, but is only partially a good fit for the existing
> cpuset hierarchy.
> 
> Could this irq configuration be partly a system-wide configuration
> decision (which irqs are 'system' irqs), and partly a per-cpuset
> decision -- which cpusets (such as a real-time one) want to disable
> the usual system irqs that everyone else gets.
> 
> The cpuset portion of this should take only a single per-cpuset Boolean
> flag -- which if set True (1), asks the system to "please leave my CPUs
> off the list of CPUs receiving the usual system irqs."
> 
> Then the list of "usual system irqs" would be established in some /proc
> or /sys configuration.  Such irqs would be able to go to any CPUs
> except those CPUs which found themselves in a cpuset with the above
> per-cpuset Boolean flag set True (1).

How about we make this in-kernel boot set, that by default contains all
IRQs, all unbounded kthreads and all of user-space.

To be compatible with your existing clients you only need to move all
the IRQs to the root domain.

(Upgrading a kernel would require distributing some new userspace
anyway, right? - and we could offer a .config option to disable the boot
set for those who do upgrade kernels without upgrading user-space).

Then, once you want to make use of the new features, you have to update
your batch scheduler to only make use of load_balance and not
cpus_exclusive (as they're only interested in sched_domains, right?)

So if you want to do IRQ isolation and batch scheduling on the same
machine (as is not possible now) you need to update userspace as said
before; so that it allows for the overlapping cpuset.

For example, on a 32 cpu machine:

/cgroup/boot 0-1 (kthreads - initial userspace)
/cgroup/irqs 0-27 (most irqs)
/cgroup/batch_A 2-5
/cgroup/batch_B 6-13
/cgroup/another_big_app 14-27
/cgroup/RT-domain 28-31 (my special irq)

So by providing a .config option for strict backward compatibility, a
simple way for runtime compatibility (moving all IRQs to the root) which
should be easy to do if the kernel upgrade is accompanied by a (limited)
user-space upgrade.

And once all the features need to be used together (something that is
now not possible - so new usage) then the code that relies on
cpus_exclusive to create sched_domains needs to be changed to use
load_balance instead.

Does that sound like a feasible plan?

> How does all this interact with /proc/irq/N/smp_affinity?

Much the same way the cpuset cpus_allowed interacts with a task's
cpus_allowed. That is, cs->cpus_allowed is a mask on top of the provided
affinity.

If for some reason the cs->cpus_allowed changes in such a way that the
user-specified mask becomes empty (irq->cpus_allowed & cs->cpus_allowed
== 0), then print a message and set it to the full mask
(irq->cpus_allowed = cs->cpus_allowed).

If for some reason the cs->cpus_allowed changes in such a way that the
mask is physically impossible (set_irq_affinity(cs->cpus_allowed)
fails), then print a message and move the IRQ to the parent set.




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-05  8:37                             ` Peter Zijlstra
@ 2008-03-05  8:50                               ` Ingo Molnar
  2008-03-05 12:35                                 ` Paul Jackson
  2008-03-05 19:17                               ` Max Krasnyansky
  2008-03-06 13:47                               ` Paul Jackson
  2 siblings, 1 reply; 94+ messages in thread
From: Ingo Molnar @ 2008-03-05  8:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Jackson, Max Krasnyanskiy, tglx, oleg, rostedt,
	linux-kernel, rientjes


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> So by providing a .config option for strict backward compatibility, a 
> simple way for runtime compatibility (moving all IRQs to the root) 
> which should be easy to do if the kernel upgrade is accompanied by a 
> (limited) user-space upgrade.

/me likes

this looks like the most straightforward and most manageable approach 
proposed so far - i always thought that cpusets should boot up with some 
meaningful default set that people could play with. This would really 
push cpusets into mainstream use i believe.

Any patch i could try out, to see how well this works in practice? ;-)

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-05  8:50                               ` Ingo Molnar
@ 2008-03-05 12:35                                 ` Paul Jackson
  2008-03-05 12:43                                   ` Ingo Molnar
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-05 12:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: a.p.zijlstra, maxk, tglx, oleg, rostedt, linux-kernel, rientjes

Ingo wrote:
> i always thought that cpusets should boot up with some 
> meaningful default set that people could play with.

They do ... you just have to mount cpusets to see it ;).

If your kernel is configured with CONFIG_CPUSETS, then
there is one 'top' cpuset containing all tasks, all mems,
and all cpus, setup during the kernel init boot code,
unconditionally.

Anyone who does:
	mkdir -p /dev/cpuset
	mount -t cpuset cpuset /dev/cpuset
can see it and play with it.

... perhaps I misunderstood your comment?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-05 12:35                                 ` Paul Jackson
@ 2008-03-05 12:43                                   ` Ingo Molnar
  2008-03-05 17:44                                     ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Ingo Molnar @ 2008-03-05 12:43 UTC (permalink / raw)
  To: Paul Jackson
  Cc: a.p.zijlstra, maxk, tglx, oleg, rostedt, linux-kernel, rientjes


* Paul Jackson <pj@sgi.com> wrote:

> > i always thought that cpusets should boot up with some meaningful 
> > default set that people could play with.
> 
> They do ... you just have to mount cpusets to see it ;).

the root cpuset is special as it is the root of the tree. You cannot 
shrink it in practice (in a meaningful way) because then you cannot have 
children outside of its scope.

	Ingo

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-05 12:43                                   ` Ingo Molnar
@ 2008-03-05 17:44                                     ` Paul Jackson
  0 siblings, 0 replies; 94+ messages in thread
From: Paul Jackson @ 2008-03-05 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: a.p.zijlstra, maxk, tglx, oleg, rostedt, linux-kernel, rientjes

Ingo wrote:
> > i always thought that cpusets should boot up with some meaningful 
> > default set that people could play with.

Paul replied:
> They do ... you just have to mount cpusets to see it ;).

Ingo replied:
> the root cpuset is special as it is the root of the tree.

Yes ... true ... so?  I guess whatever you meant by "meaningful"
doesn't include the root cpuset.  Oh well <grin>.

To be a tad more serious, it wasn't (still isn't) clear to me
what you did mean here.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-05  8:37                             ` Peter Zijlstra
  2008-03-05  8:50                               ` Ingo Molnar
@ 2008-03-05 19:17                               ` Max Krasnyansky
  2008-03-06 13:47                               ` Paul Jackson
  2 siblings, 0 replies; 94+ messages in thread
From: Max Krasnyansky @ 2008-03-05 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Jackson, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Peter Zijlstra wrote:
> How about we make this in-kernel boot set, that by default contains all
> IRQs, all unbounded kthreads and all of user-space.
I assumed that this exactly what we meant by the boot set all along :).

One thing I wanted to clarify that by all IRQs we literally mean all of them
even those that do not have handlers yet. /proc/irqN/smp_affinity for example
 is not available for the irqs are not active. In other words an IRQ must not
all of the sudden show up in the root set when something does request_irq() on
it.
This applies to other thing too.

>> How does all this interact with /proc/irq/N/smp_affinity?
> 
> Much the same way the cpuset cpus_allowed interacts with a task's
> cpus_allowed. That is, cs->cpus_allowed is a mask on top of the provided
> affinity.
> 
> If for some reason the cs->cpus_allowed changes in such a way that the
> user-specified mask becomes empty (irq->cpus_allowed & cs->cpus_allowed
> == 0), then print a message and set it to the full mask
> (irq->cpus_allowed = cs->cpus_allowed).
> 
> If for some reason the cs->cpus_allowed changes in such a way that the
> mask is physically impossible (set_irq_affinity(cs->cpus_allowed)
> fails), then print a message and move the IRQ to the parent set.

I think Paul missed my earlier reply where I pointed out that original patch
conflicted with /proc/irq/N/smp_affinity API. The solution is for
irq_set_affinity() to enforce cpus_allowed just like sched_setaffinity does
for the tasks.

Max

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-05  8:37                             ` Peter Zijlstra
  2008-03-05  8:50                               ` Ingo Molnar
  2008-03-05 19:17                               ` Max Krasnyansky
@ 2008-03-06 13:47                               ` Paul Jackson
  2008-03-06 15:21                                 ` Peter Zijlstra
  2 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-06 13:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Peter wrote:
> How about we make this in-kernel boot set, that by default contains all
> IRQs, all unbounded kthreads and all of user-space.

If I understood your proposal, the /cgroup/irq cpuset is rather an
odd cpuset -- it seems to be just used to place the 'system' irqs on
the specified CPUs.  This is not the usual use of cpusets, which are
to associate a set of tasks with some resources.  In your example,
the 'tasks' file in /cgroup/irqs is probably empty, right?

And then you have to argue that certain incompatibilities this
causes are tolerable:

> (Upgrading a kernel would require distributing some new userspace
> anyway, right? - and we could offer a .config option to disable the boot
> set for those who do upgrade kernels without upgrading user-space).

No ... we normally do our best not to force apps to change source
code, or where possible not even force them to recompile, to run on
new kernels.

> So by providing a .config option for strict backward compatibility,

That sort of alternative is useless when dealing with the major
distros.  They choose one config for half or all of their entire
market (perhaps distinguishing personal PC from server, but no more).
They enable everything that doesn't break something they care about.

> > How does all this interact with /proc/irq/N/smp_affinity?
> 
> Much the same way the cpuset cpus_allowed interacts with a task's
> cpus_allowed. That is, cs->cpus_allowed is a mask on top of the provided
> affinity.

Ok - something like that could be done, if we had to, just as
a cpusets 'mems' masks mbind and set_mempolicy nodemasks, and
a cpusets 'cpus' masks sched_setaffinity cpumasks.

Could be ... I suppose ... if we had to ... however ...


I suspect that the reason you had the odd /cgroup/irqs cpuset is that
it wasn't clear what to do with the irqs of cpusets that had both:
 1) overlapping cpus, and
 2) specified different lists of irqs.

In my view, you are trying to:
 A] force the <<irq, cpu>, Boolean> mapping into linear lists, and
 B] then trying to force those linear lists into the cpuset hierarchy.

With each capability that we've added to cpusets, beginning with CPUs
and Memory Nodes themselves, and more recently with sched_domains,
I've strived to keep cpusets fully hierarchical and nestable, supporting
arbitrary combinations of overlapping CPUs and Memory Nodes.

What we have here, in this cpuset-irq proposal, I claim, is another
example of trying to put in a tree what is essentially flat.

No ... that last paragraph is wrong.  It's worse than that.

The mapping of <cpu, irq> pairs to Boolean ("can we direct this irq
to this CPU - yes or no?") is not flat; it's a two-dimensional matrix
of Boolean.

So you're [A] squinting out the left eye, flattening the <irq, cpu> map
to a linear list; then [B] squinting out the right eye and flattening
the cpuset hierarchy into a flat world; and then exclaiming "Gee,
these two worlds have similar shape, so let's join them!"

Why, why, why?  Why such insanity?

What in tarnation are you trying to do, that's painful or impossible
to do, with what we have now?

The only opportunity I've sensed in all this so far is some sort
of Karnaugh-map factoring of the <<irq, cpu>, Boolean> matrix, as
an improved representation of the simple, brute force, low level
/proc/irq/N/smp_affinity map.  Sets of irqs that held a common
usage pattern could be named, and sets of CPUs that had similar irq
needs could be named (or named lists of existing cpusets might be
a suitable proxy for some, not all, such sets), and then this map
could be represented using these named sets.  Mind you, I don't see
that big a gain from presenting a second way of looking at what we
can already see by the current, simple-minded, way, because the
elegance of being able to group together common clusters (e.g.,
the set of CPUs that all get the same set of RealTime IRQ's) of the
existing smp_affinity map is offset by the added complexities of
having to deal with two ways of saying the same thing.  But perhaps
I'm missing some real opportunity here to gain significant leverage
on some problem.  If that's so, then perhaps some new cgroup would
be appropriate, representing this new interface, pairing named sets
of irqs to sets with sets of CPUs.  And perhaps, for -some- system
configurations, it would make sense to use the cgroup capability
to mount multiple cgroup subsystems together, in a common mount.
Perhaps ...  I haven't seen what would be sufficient motivation to
justify such an effort, but I can imagine that such could succeed,
if some need, as yet unforseen by myself, existed.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-06 13:47                               ` Paul Jackson
@ 2008-03-06 15:21                                 ` Peter Zijlstra
  2008-03-07  3:40                                   ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Peter Zijlstra @ 2008-03-06 15:21 UTC (permalink / raw)
  To: Paul Jackson; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes


On Thu, 2008-03-06 at 07:47 -0600, Paul Jackson wrote:
> Peter wrote:
> > How about we make this in-kernel boot set, that by default contains all
> > IRQs, all unbounded kthreads and all of user-space.
> 
> If I understood your proposal, the /cgroup/irq cpuset is rather an
> odd cpuset -- it seems to be just used to place the 'system' irqs on
> the specified CPUs.  This is not the usual use of cpusets, which are
> to associate a set of tasks with some resources.  In your example,
> the 'tasks' file in /cgroup/irqs is probably empty, right?

Likely; if for instance you'd want some unbound kernel threads to join
in that overlapping set, then perhaps that name would be badly chosen.

Although I'm not sure which unbound kernel threads would benefit from
such treatment.

> And then you have to argue that certain incompatibilities this
> causes are tolerable:
> 
> > (Upgrading a kernel would require distributing some new userspace
> > anyway, right? - and we could offer a .config option to disable the boot
> > set for those who do upgrade kernels without upgrading user-space).
> 
> No ... we normally do our best not to force apps to change source
> code, or where possible not even force them to recompile, to run on
> new kernels.

Perhaps we're talking about something else here; how bad would it be to
require:

for irq in `cat /cgroup/boot/irqs` ; do echo $irq > /cgroup/irqs; done

be added to rc.local or fully replace your home-brew boot cpuset script?
Its basically an update for that script, giving the exact same semantics
to its user, but moving the larger part of it in-kernel.

> > So by providing a .config option for strict backward compatibility,
> 
> That sort of alternative is useless when dealing with the major
> distros.  They choose one config for half or all of their entire
> market (perhaps distinguishing personal PC from server, but no more).
> They enable everything that doesn't break something they care about.

Sure, but your application vendors will need to re-certify their
applications to run on new distros, sometimes even re-compile because
ABI changes and the like. Certainly providing a new script in the new
version certified for a new distro isn't too much work?

> > > How does all this interact with /proc/irq/N/smp_affinity?
> > 
> > Much the same way the cpuset cpus_allowed interacts with a task's
> > cpus_allowed. That is, cs->cpus_allowed is a mask on top of the provided
> > affinity.
> 
> Ok - something like that could be done, if we had to, just as
> a cpusets 'mems' masks mbind and set_mempolicy nodemasks, and
> a cpusets 'cpus' masks sched_setaffinity cpumasks.
> 
> Could be ... I suppose ... if we had to ... however ...
> 
> 
> I suspect that the reason you had the odd /cgroup/irqs cpuset is that
> it wasn't clear what to do with the irqs of cpusets that had both:
>  1) overlapping cpus, and
>  2) specified different lists of irqs.
> 
> In my view, you are trying to:
>  A] force the <<irq, cpu>, Boolean> mapping into linear lists, and
>  B] then trying to force those linear lists into the cpuset hierarchy.
> 
> With each capability that we've added to cpusets, beginning with CPUs
> and Memory Nodes themselves, and more recently with sched_domains,
> I've strived to keep cpusets fully hierarchical and nestable, supporting
> arbitrary combinations of overlapping CPUs and Memory Nodes.
> 
> What we have here, in this cpuset-irq proposal, I claim, is another
> example of trying to put in a tree what is essentially flat.
> 
> No ... that last paragraph is wrong.  It's worse than that.
> 
> The mapping of <cpu, irq> pairs to Boolean ("can we direct this irq
> to this CPU - yes or no?") is not flat; it's a two-dimensional matrix
> of Boolean.
> 
> So you're [A] squinting out the left eye, flattening the <irq, cpu> map
> to a linear list; then [B] squinting out the right eye and flattening
> the cpuset hierarchy into a flat world; and then exclaiming "Gee,
> these two worlds have similar shape, so let's join them!"
> 
> Why, why, why?  Why such insanity?
> 
> What in tarnation are you trying to do, that's painful or impossible
> to do, with what we have now?

Assign a map of cpus where irqs will default into, and a way to
explicitly move them out of it.

> The only opportunity I've sensed in all this so far is some sort
> of Karnaugh-map factoring of the <<irq, cpu>, Boolean> matrix, as
> an improved representation of the simple, brute force, low level
> /proc/irq/N/smp_affinity map. 

>  Sets of irqs that held a common
> usage pattern could be named, and sets of CPUs that had similar irq
> needs could be named (or named lists of existing cpusets might be
> a suitable proxy for some, not all, such sets), and then this map
> could be represented using these named sets. 

>  Mind you, I don't see
> that big a gain from presenting a second way of looking at what we
> can already see by the current, simple-minded, way, because the
> elegance of being able to group together common clusters (e.g.,
> the set of CPUs that all get the same set of RealTime IRQ's) of the
> existing smp_affinity map is offset by the added complexities of
> having to deal with two ways of saying the same thing. 

>  But perhaps
> I'm missing some real opportunity here to gain significant leverage
> on some problem.  If that's so, then perhaps some new cgroup would
> be appropriate, representing this new interface, pairing named sets
> of irqs to sets with sets of CPUs.  And perhaps, for -some- system
> configurations, it would make sense to use the cgroup capability
> to mount multiple cgroup subsystems together, in a common mount.

> Perhaps ...  I haven't seen what would be sufficient motivation to
> justify such an effort, but I can imagine that such could succeed,
> if some need, as yet unforseen by myself, existed.

So, yes, cgroups are perhaps awkward because they group tasks whereas
the current problem is grouping IRQs.

Because we're mapping them onto CPUs, cpusets came to mind.

The thing we 'need', is to provide named groups of irqs and for each
such a group specify a cpu mask that is appropriate.

Grouping them makes sense in that we want to make a functional division.
Some IRQs serve a system as a whole, others serve a subset. Typical
subsets could be a RT process space bounded to a cpu/mem domain.

Other usable subsets could be limiting the IRQs of node local network
and IO cards to the cpu/mem domain that runs the application that uses
them.

So we group irqs like:

  system_on_nodes_1_2_and_3 (default)
  big_io_app_on_nodes_2_and_3
  rt_app_on_node_4

Where, again, you see a strong similarity to the cpu/mem divisions
already made by cpusets.

I understand its somewhat at odds with the hierarchical nature of the
<task, cpu/mem> mapping you currently have. But its not too far away
either.

Do you see what we want to do, and why our - perhaps misguided - choice
for cpusets? Creating a whole new cgroup controller is also weird
because we don't deal in tasks.

[ Aside from grouping existing IRQs we also want to provide a way to
designate a default group for new IRQs. But I think such functionality
will fall into place once we can agree on the rest.]


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-06 15:21                                 ` Peter Zijlstra
@ 2008-03-07  3:40                                   ` Paul Jackson
  2008-03-07  6:39                                     ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-07  3:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Helpful reply, Peter.  Thanks.

Peter, replying to pj, wrote:
> On Thu, 2008-03-06 at 07:47 -0600, Paul Jackson wrote:
> > ... In your example,
> > the 'tasks' file in /cgroup/irqs is probably empty, right?
> 
> Likely; if for instance you'd want some unbound kernel threads to join
> in that overlapping set, then perhaps that name would be badly chosen.

Ok.  So, as you note below, discussing cgroups:
> ... yes, cgroups are perhaps awkward because they group tasks whereas
> the current problem is grouping IRQs.

Essentially, cpusets are like cgroups in this regard.  They group
tasks.  They just happen to be grouping tasks to associate them
with sets of CPUs (and Memory Nodes), which seems relevant somehow
to the present need, to group irqs to associate them with sets of
CPUs.



> Perhaps we're talking about something else here; how bad would it be to
> require:
> 
> for irq in `cat /cgroup/boot/irqs` ; do echo $irq > /cgroup/irqs; done

I haven't gotten my head around what such a script would do yet,
but you are correct in suspecting that we could add a script like
this easily enough in future releases, if that was useful.

I can change init scripts, for each kernel version, much easier than I
can ask the big batch scheduler providers to change their application
code (user level system code) to deal with incompatible changes.


> Certainly providing a new script in the new
> version certified for a new distro isn't too much work?

Correct - that's quite easy, from my perspective.

> > What in tarnation are you trying to do, that's painful or impossible
> > to do, with what we have now?
> 
> Assign a map of cpus where irqs will default into, and a way to
> explicitly move them out of it.

Can you spell out how or why /proc/irq/N/smp_affinity doesn't
provide what you need here?

My guess is that it's fairly obvious why /proc/irq/N/smp_affinity is
not well suited for this.  It requires poking lots of settings, one
at a time, which is cumbersome and racey from user space, difficult
to keep in sync with any other changes in placement of RT or other
jobs, and it requires root permissions, without any finer granularity
practical.


> Because we're mapping them [irqs] onto CPUs, cpusets came to mind.
> 
> The thing we 'need', is to provide named groups of irqs and for each
> such a group specify a cpu mask that is appropriate.
> 
> Grouping them makes sense in that we want to make a functional division.
> Some IRQs serve a system as a whole, others serve a subset. Typical
> subsets could be a RT process space bounded to a cpu/mem domain.
> 
> Other usable subsets could be limiting the IRQs of node local network
> and IO cards to the cpu/mem domain that runs the application that uses
> them.
> 
> So we group irqs like:
> 
>   system_on_nodes_1_2_and_3 (default)
>   big_io_app_on_nodes_2_and_3
>   rt_app_on_node_4
> 
> Where, again, you see a strong similarity to the cpu/mem divisions
> already made by cpusets.


Cool -- I'm glad now I asked (rather impatiently) what we needed.

That's a helpful reply.

Could we:
 1) name some sets of IRQs
 2) for each cpuset, specify which named IRQ set applied to it
 3) prioritize these sets of IRQs (linear order), so that
    for any given CPU, if it were in multiple cpusets
    specifying conflicting IRQ sets, we could select the
    IRQ set to apply to that CPU.

Given the reliance in (2) of cpusets on these IRQ set names, this
still needs to be part of cpusets.

But rather than (ab)use cpusets to directly accomplish (1), how
about adding some files to the root cpuset to define IRQ sets,
with names such as (for example):

	irqs.0.system
	irqs.1.big_io_apps
	irqs.2.rt

That is, more generally, add one or more "irqs.N.name" files to
the top cpuset, where N is a distinct natural number and "name"
a user space specified name (except that perhaps the first one,
the 'irqs.0.system', with its name 'system' or perhaps it should
be 'boot', is pre-ordained during system boot.)

Each of these 'irqs.N.name' files would contain a newline separated
list of irq numbers.

Also add, per item (2) above, to each cpuset, one more file, containing
a single line, naming one of these irq.* files to be found in the
root cpuset.  Let me call this new per-cpuset file 'irqs'.

The number N in the name "irqs.N.name" would order these sets of irqs.

If in this example a cpuset's "irqs" file specified 'rt', that would
take priority (for the CPUs in that cpusets 'cpus' file) over the
other two irqs.N.* files above, because the '2' in "irqs.2.rt" is
bigger than the other irqs.N numbers.

For each CPU, we'd find the largest N such that some cpuset (1) had
that CPU in its 'cpus'mask, and that cpusets 'irqs' file named the
corresponding irqs.N.name file, and then we'd use the irqs listed in
that irqs.N.name file on that CPU.

The default value of the top (root) cpusets 'irqs' file at boot
would be 'system' (or 'boot').  The default value for any cpusets
created thereafter would be inherited from the cpusets parent.

These 'irqs.N.name' files would be the first instance of allowing
user created files in cpuset directories.  That will require some
changes to the cpuset or cgroup code; I don't know how much.

If one of these 'irqs.N.name' files were removed, then any cpuset
that had been using it (had that 'name' in its 'irqs' file) would
have to be reverted, I suppose to its parents 'irqs' setting.

An application (any job with permission to write its own cpusets
files) could control which named set of irqs it wanted to use,
by writing the 'irqs' file in its cpuset.  But system permissions
(such as root) would be probably be required to specify which irqs
were listed in each /dev/cpuset/irqs.N.* file (unless some admin
script decided to change the permissions on those files at runtime,
of course.)

Does that make any sense?  What have I missed?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-07  3:40                                   ` Paul Jackson
@ 2008-03-07  6:39                                     ` Paul Jackson
  2008-03-07  8:47                                       ` Paul Menage
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Jackson @ 2008-03-07  6:39 UTC (permalink / raw)
  To: Paul Jackson
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

pj wrote:
> These 'irqs.N.name' files would be the first instance of allowing
> user created files in cpuset directories.  That will require some
> changes to the cpuset or cgroup code; I don't know how much.

I guess this will require adding a line:
	.create = cgroup_create
line to the:
	static struct file_operations cgroup_file_operations = {
initialization in kernel/cgroup.c, and a cgroup_create() routine
in kernel/cgroup.c, that calls an optional per-cgroup-subsystem
create routine, that, in the case of cpusets, is willing to create
one of these irqs.N.name files, in the top cpuset, if all looks
right.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-07  6:39                                     ` Paul Jackson
@ 2008-03-07  8:47                                       ` Paul Menage
  2008-03-07 14:57                                         ` Paul Jackson
  0 siblings, 1 reply; 94+ messages in thread
From: Paul Menage @ 2008-03-07  8:47 UTC (permalink / raw)
  To: Paul Jackson
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

On Thu, Mar 6, 2008 at 10:39 PM, Paul Jackson <pj@sgi.com> wrote:
>  I guess this will require adding a line:
>         .create = cgroup_create
>  line to the:
>         static struct file_operations cgroup_file_operations = {
>  initialization in kernel/cgroup.c, and a cgroup_create() routine
>  in kernel/cgroup.c, that calls an optional per-cgroup-subsystem
>  create routine, that, in the case of cpusets, is willing to create
>  one of these irqs.N.name files, in the top cpuset, if all looks
>  right.
>

An alternative would be to just have some kind of "irqsets" file in
the top-level cpuset directory and let the user write irq group
definitions into that file.

Paul

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC/PATCH] cpuset: cpuset irq affinities
  2008-03-07  8:47                                       ` Paul Menage
@ 2008-03-07 14:57                                         ` Paul Jackson
  0 siblings, 0 replies; 94+ messages in thread
From: Paul Jackson @ 2008-03-07 14:57 UTC (permalink / raw)
  To: Paul Menage
  Cc: a.p.zijlstra, maxk, mingo, tglx, oleg, rostedt, linux-kernel, rientjes

Paul M wrote:
> An alternative would be to just have some kind of "irqsets" file in
> the top-level cpuset directory and let the user write irq group
> definitions into that file.

Yes, exactly, that's the alternative.

I tried that, in my mind, and got stuck on the complicated syntax that
would have been needed to represent an arbitrary length array of named
lists of irqs with each list having a priority attribute.

So I went with the separate "irqs.N.name" files, (ab)using the file
system directory apparatus to handle the "arbitrary length array of
named" entities aspect, burying the priority attribute (N) in the name,
and leaving each individual irqs.N.name file only needing to hold a
single vector of irq numbers.

A security conscious sysadmin can even assign different permissions to
the different irq lists with this.  And updates of one irq list don't
endanger the other irq lists, thanks to the innate and elaborate
capabilities in the kernel vfs code to handle concurrent updates to
separate files correctly and reliably.

This reminds me of the difference between the Windows Registry (one big
awful file) and the Unix /etc directory (2745 files, on the system
nearest to my shell prompt.)  Well, in this irq case, the different is not
-that- dramatic, but it is a small echo of such.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2008-03-07 14:57 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-27 22:21 [RFC/PATCH 0/4] CPUSET driven CPU isolation Peter Zijlstra
2008-02-27 22:21 ` [RFC/PATCH 1/4] sched: remove isolcpus Peter Zijlstra
2008-02-27 23:57   ` Max Krasnyanskiy
2008-02-28 10:19     ` Peter Zijlstra
2008-02-28 19:36       ` Max Krasnyansky
2008-02-27 22:21 ` [RFC/PATCH 2/4] cpuset: system sets Peter Zijlstra
2008-02-27 23:39   ` Paul Jackson
2008-02-28  1:53     ` Max Krasnyanskiy
2008-02-27 23:52   ` Max Krasnyanskiy
2008-02-28  0:11     ` Paul Jackson
2008-02-28  0:29       ` Steven Rostedt
2008-02-28  1:45         ` Max Krasnyanskiy
2008-02-28  3:41           ` Steven Rostedt
2008-02-28  4:58             ` Max Krasnyansky
2008-02-27 22:21 ` [RFC/PATCH 3/4] genirq: system set irq affinities Peter Zijlstra
2008-02-28  0:10   ` Max Krasnyanskiy
2008-02-28 10:19     ` Peter Zijlstra
2008-02-27 22:21 ` [RFC/PATCH 4/4] kthread: system set kthread affinities Peter Zijlstra
2008-02-27 23:38 ` [RFC/PATCH 0/4] CPUSET driven CPU isolation Max Krasnyanskiy
2008-02-28 10:19   ` Peter Zijlstra
2008-02-28 17:33     ` Max Krasnyanskiy
2008-02-28  7:50 ` Ingo Molnar
2008-02-28  8:08   ` Paul Jackson
2008-02-28  9:08     ` Ingo Molnar
2008-02-28  9:17       ` Paul Jackson
2008-02-28  9:32         ` David Rientjes
2008-02-28 10:12           ` David Rientjes
2008-02-28 10:26             ` Peter Zijlstra
2008-02-28 17:37             ` Paul Jackson
2008-02-28 21:24               ` David Rientjes
2008-02-28 22:46                 ` Paul Jackson
2008-02-28 23:00                   ` David Rientjes
2008-02-29  0:16                     ` Paul Jackson
2008-02-29  1:05                       ` David Rientjes
2008-02-29  3:34                         ` Paul Jackson
2008-02-29  4:00                           ` David Rientjes
2008-02-29  6:53                             ` Paul Jackson
2008-02-28 10:46         ` Ingo Molnar
2008-02-28 17:47           ` Paul Jackson
2008-02-28 20:11           ` Max Krasnyansky
2008-02-28 20:13             ` Paul Jackson
2008-02-28 20:26               ` Max Krasnyansky
2008-02-28 20:27                 ` Paul Jackson
2008-02-28 20:45                   ` Max Krasnyansky
2008-02-28 20:23       ` Max Krasnyansky
2008-02-28 17:48   ` Max Krasnyanskiy
2008-02-29  8:31   ` Andrew Morton
2008-02-29  8:36     ` Andrew Morton
2008-02-29  9:10     ` Ingo Molnar
2008-02-29 18:06       ` Max Krasnyanskiy
2008-02-28 12:12 ` Mark Hounschell
2008-02-28 19:57   ` Max Krasnyansky
2008-02-29 18:55 ` [RFC/PATCH] cpuset: cpuset irq affinities Peter Zijlstra
2008-02-29 19:02   ` Ingo Molnar
2008-02-29 20:52     ` Max Krasnyanskiy
2008-02-29 21:03       ` Peter Zijlstra
2008-02-29 21:20         ` Max Krasnyanskiy
2008-03-03 11:57           ` Peter Zijlstra
2008-03-03 17:36             ` Paul Jackson
2008-03-03 17:57               ` Peter Zijlstra
2008-03-03 18:10                 ` Paul Jackson
2008-03-03 18:18                   ` Peter Zijlstra
2008-03-04  7:35                     ` Paul Jackson
2008-03-04 11:06                       ` Peter Zijlstra
2008-03-04 19:52                         ` Max Krasnyanskiy
2008-03-05  1:11                           ` Paul Jackson
2008-03-05  8:37                             ` Peter Zijlstra
2008-03-05  8:50                               ` Ingo Molnar
2008-03-05 12:35                                 ` Paul Jackson
2008-03-05 12:43                                   ` Ingo Molnar
2008-03-05 17:44                                     ` Paul Jackson
2008-03-05 19:17                               ` Max Krasnyansky
2008-03-06 13:47                               ` Paul Jackson
2008-03-06 15:21                                 ` Peter Zijlstra
2008-03-07  3:40                                   ` Paul Jackson
2008-03-07  6:39                                     ` Paul Jackson
2008-03-07  8:47                                       ` Paul Menage
2008-03-07 14:57                                         ` Paul Jackson
2008-03-03 18:41                   ` Paul Menage
2008-03-03 18:52                     ` Paul Jackson
2008-03-04  5:26                       ` Paul Menage
2008-03-04  6:15                         ` Paul Jackson
2008-03-04  6:21                           ` Paul Menage
2008-03-04  6:26                             ` Paul Jackson
2008-03-04  6:34                               ` Paul Menage
2008-03-04  6:51                                 ` Paul Jackson
2008-02-29 20:55   ` Paul Jackson
2008-02-29 21:14     ` Peter Zijlstra
2008-02-29 21:29       ` Ingo Molnar
2008-02-29 21:32       ` Ingo Molnar
2008-02-29 21:42       ` Max Krasnyanskiy
2008-02-29 22:00         ` Paul Jackson
2008-02-29 21:53       ` Paul Jackson
2008-03-02  5:18   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).