All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
@ 2012-02-07 18:55 Srivatsa S. Bhat
  2012-02-07 18:56 ` [PATCH 1/4] CPU hotplug, cpuset: Maintain a copy of the cpus_allowed mask before CPU hotplug Srivatsa S. Bhat
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-07 18:55 UTC (permalink / raw)
  To: paul, a.p.zijlstra, mingo, rjw, tj
  Cc: frank.rowand, pjt, tglx, lizf, prashanth, paulmck, vatsa,
	srivatsa.bhat, linux-kernel, linux-pm

There is a very long standing issue related to how cpusets handle CPU
hotplug events. The problem is that when a CPU goes offline, it is removed
from all cpusets. However, when that CPU comes back online, it is added
*only* to the root cpuset. Which means, any task attached to a cpuset lower
in the hierarchy will have one CPU less in its cpuset, though it had this
CPU in its cpuset before the CPU went offline.

The issue gets enormously aggravated in the case of suspend/resume. During
suspend, all non-boot CPUs are taken offline. Which means, all those CPUs
get removed from all the cpusets. When the system resumes, all CPUs are
brought back online; however, the newly onlined CPUs get added only to the
root cpuset - and all other cpusets have cpuset.cpus = 0 (boot cpu alone)!
This means, (as is obvious), all those tasks attached to non-root cpusets
will be constrained to run only on one single cpu!

So, imagine the amount of performance degradation after suspend/resume!!

In particular, libvirt is one of the active users of cpusets. And apparently,
people hit this problem long ago:
https://bugzilla.redhat.com/show_bug.cgi?id=714271

But unfortunately this never got resolved since people probably thought that
the bug was in libvirt... and all this time the kernel was the culprit!

--
 Srivatsa S. Bhat (4):
      CPU hotplug, cpuset: Maintain a copy of the cpus_allowed mask before CPU hotplug
      cpuset: Split up update_cpumask() so that its functionality can be reused
      cpuset: Add function to introduce CPUs to cpusets during CPU online
      CPU hotplug, cpusets: Differentiate the CPU online and CPU offline callbacks


  include/linux/cpuset.h |    4 +
 kernel/cpuset.c        |  177 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/core.c    |   12 +++
 3 files changed, 163 insertions(+), 30 deletions(-)


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] CPU hotplug, cpuset: Maintain a copy of the cpus_allowed mask before CPU hotplug
  2012-02-07 18:55 [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Srivatsa S. Bhat
@ 2012-02-07 18:56 ` Srivatsa S. Bhat
  2012-02-07 18:56 ` [PATCH 2/4] cpuset: Split up update_cpumask() so that its functionality can be reused Srivatsa S. Bhat
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-07 18:56 UTC (permalink / raw)
  To: paul, a.p.zijlstra, mingo, rjw, tj
  Cc: frank.rowand, pjt, tglx, lizf, prashanth, paulmck, vatsa,
	srivatsa.bhat, linux-kernel, linux-pm

Maintain a new per-cpuset mask 'cpus_allowed_before_hotplug', that closely
reflects the value of 'cpus_allowed' of a cpuset, in order to decide how to
react upon a CPU offline followed by CPU online.

Whenever 'cpus_allowed' of the cpuset changes, 'cpus_allowed_before_hotplug'
is updated. The only exception is: when 'cpus_allowed' changes due to a CPU
hotplug event (either offline or online) we don't update
'cpus_allowed_before_hotplug', so that it still reflects the state of
'cpus_allowed' as seen *before* the hotplug event.

This is to handle cases like:

* cpus_allowed has 0-15, cpus_allowed_before_hotplug is also 0-15

* CPU 15 was taken offline
  So now cpus_allowed = 0-14
  No changes to cpus_allowed_before_hotplug.

* CPU 14 was taken offline
  So now cpus_allowed = 0-13
  But we still remember that the original cpuset was 0-15, because
  cpus_allowed_before_hotplug was never changed.

* cpuset was changed to 0-10 from userspace.
  So now cpus_allowed = 0-10, and cpus_allowed_before_hotplug is also updated
  to 0-10.

So, essentially cpus_allowed_before_hotplug is maintained in such a way that
if a CPU comes back online, we know whether it was in the cpuset earlier (and
hence we need to add that CPU back into the cpuset) or whether it was
originally absent from that cpuset.

Reported-by: Prashanth K. Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
---

 kernel/cpuset.c |   40 +++++++++++++++++++++++++++++++++++-----
 1 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a09ac2b..5e2323b 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -93,6 +93,15 @@ struct cpuset {
 
 	unsigned long flags;		/* "unsigned long" so bitops work */
 	cpumask_var_t cpus_allowed;	/* CPUs allowed to tasks in cpuset */
+
+	/*
+	 * Copy of 'cpus_allowed'. This is updated whenever 'cpus_allowed' is
+	 * updated, except during CPU hotplug operations (in which case, only
+	 * 'cpus_allowed' is updated). This is used to decide whether to add a
+	 * CPU to this cpuset when that offline CPU comes back online.
+	 */
+	cpumask_var_t cpus_allowed_before_hotplug;
+
 	nodemask_t mems_allowed;	/* Memory Nodes allowed to tasks */
 
 	struct cpuset *parent;		/* my parent */
@@ -903,6 +912,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 
 	mutex_lock(&callback_mutex);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
+	cpumask_copy(cs->cpus_allowed_before_hotplug, cs->cpus_allowed);
 	mutex_unlock(&callback_mutex);
 
 	/*
@@ -1851,6 +1861,7 @@ static void cpuset_post_clone(struct cgroup_subsys *ss,
 	mutex_lock(&callback_mutex);
 	cs->mems_allowed = parent_cs->mems_allowed;
 	cpumask_copy(cs->cpus_allowed, parent_cs->cpus_allowed);
+	cpumask_copy(cs->cpus_allowed_before_hotplug, cs->cpus_allowed);
 	mutex_unlock(&callback_mutex);
 	return;
 }
@@ -1875,10 +1886,12 @@ static struct cgroup_subsys_state *cpuset_create(
 	cs = kmalloc(sizeof(*cs), GFP_KERNEL);
 	if (!cs)
 		return ERR_PTR(-ENOMEM);
-	if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL)) {
-		kfree(cs);
-		return ERR_PTR(-ENOMEM);
-	}
+
+	if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
+		goto out_cs;
+
+	if (!alloc_cpumask_var(&cs->cpus_allowed_before_hotplug, GFP_KERNEL))
+		goto out_cpus_allowed;
 
 	cs->flags = 0;
 	if (is_spread_page(parent))
@@ -1887,6 +1900,7 @@ static struct cgroup_subsys_state *cpuset_create(
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cpumask_clear(cs->cpus_allowed);
+	cpumask_clear(cs->cpus_allowed_before_hotplug);
 	nodes_clear(cs->mems_allowed);
 	fmeter_init(&cs->fmeter);
 	cs->relax_domain_level = -1;
@@ -1894,6 +1908,12 @@ static struct cgroup_subsys_state *cpuset_create(
 	cs->parent = parent;
 	number_of_cpusets++;
 	return &cs->css ;
+
+out_cpus_allowed:
+	kfree(cs->cpus_allowed);
+out_cs:
+	kfree(cs);
+	return ERR_PTR(-ENOMEM);
 }
 
 /*
@@ -1911,6 +1931,7 @@ static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
 
 	number_of_cpusets--;
 	free_cpumask_var(cs->cpus_allowed);
+	free_cpumask_var(cs->cpus_allowed_before_hotplug);
 	kfree(cs);
 }
 
@@ -1936,7 +1957,9 @@ int __init cpuset_init(void)
 {
 	int err = 0;
 
-	if (!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL))
+	if (!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL) ||
+		!alloc_cpumask_var(&top_cpuset.cpus_allowed_before_hotplug,
+				GFP_KERNEL))
 		BUG();
 
 	cpumask_setall(top_cpuset.cpus_allowed);
@@ -2077,6 +2100,13 @@ static void scan_for_empty_cpusets(struct cpuset *root)
 		mutex_lock(&callback_mutex);
 		cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
 			    cpu_active_mask);
+
+		/*
+		 * Do NOT update cpus_allowed_before_hotplug here. We want it
+		 * to reflect the value of cpus_allowed as seen *before* the
+		 * hotplug event.
+		 */
+
 		nodes_and(cp->mems_allowed, cp->mems_allowed,
 						node_states[N_HIGH_MEMORY]);
 		mutex_unlock(&callback_mutex);


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] cpuset: Split up update_cpumask() so that its functionality can be reused
  2012-02-07 18:55 [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Srivatsa S. Bhat
  2012-02-07 18:56 ` [PATCH 1/4] CPU hotplug, cpuset: Maintain a copy of the cpus_allowed mask before CPU hotplug Srivatsa S. Bhat
@ 2012-02-07 18:56 ` Srivatsa S. Bhat
  2012-02-07 18:57 ` [PATCH 3/4] cpuset: Add function to introduce CPUs to cpusets during CPU online Srivatsa S. Bhat
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-07 18:56 UTC (permalink / raw)
  To: paul, a.p.zijlstra, mingo, rjw, tj
  Cc: frank.rowand, pjt, tglx, lizf, prashanth, paulmck, vatsa,
	srivatsa.bhat, linux-kernel, linux-pm

update_cpumask() expects the new cpuset in the form of characters (owing to
the fact that it expects cpuset updates from userspace). This might not be
appropriate if an in-kernel user wants to call it.

So, split up the function so that its functionality can be reused. That is,
introduce do_update_cpumask() and offload the core work to it.

Reported-by: Prashanth K. Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
---

 kernel/cpuset.c |   61 ++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5e2323b..2be71da 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -864,13 +864,14 @@ static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
 	cgroup_scan_tasks(&scan);
 }
 
+
 /**
- * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
+ * do_update_cpumask - update the cpus_allowed mask of a cpuset and all tasks
+ * in it
  * @cs: the cpuset to consider
- * @buf: buffer of cpu numbers written to this cpuset
+ * @trialcs: the updated cpuset value requested
  */
-static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
-			  const char *buf)
+static int do_update_cpumask(struct cpuset *cs, struct cpuset *trialcs)
 {
 	struct ptr_heap heap;
 	int retval;
@@ -880,22 +881,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (cs == &top_cpuset)
 		return -EACCES;
 
-	/*
-	 * An empty cpus_allowed is ok only if the cpuset has no tasks.
-	 * Since cpulist_parse() fails on an empty mask, we special case
-	 * that parsing.  The validate_change() call ensures that cpusets
-	 * with tasks have cpus.
-	 */
-	if (!*buf) {
-		cpumask_clear(trialcs->cpus_allowed);
-	} else {
-		retval = cpulist_parse(buf, trialcs->cpus_allowed);
-		if (retval < 0)
-			return retval;
-
-		if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
+	if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
 			return -EINVAL;
-	}
+
 	retval = validate_change(cs, trialcs);
 	if (retval < 0)
 		return retval;
@@ -925,9 +913,44 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 
 	if (is_load_balanced)
 		async_rebuild_sched_domains();
+
 	return 0;
 }
 
+/**
+ * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
+ * @cs: the cpuset to consider
+ * @buf: buffer of cpu numbers written to this cpuset
+ */
+static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
+			  const char *buf)
+{
+	int retval;
+
+	/* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */
+	if (cs == &top_cpuset)
+		return -EACCES;
+
+	/*
+	 * An empty cpus_allowed is ok only if the cpuset has no tasks.
+	 * Since cpulist_parse() fails on an empty mask, we special case
+	 * that parsing.  The validate_change() call ensures that cpusets
+	 * with tasks have cpus.
+	 */
+	if (!*buf) {
+		cpumask_clear(trialcs->cpus_allowed);
+	} else {
+		retval = cpulist_parse(buf, trialcs->cpus_allowed);
+		if (retval < 0)
+			return retval;
+
+		if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
+			return -EINVAL;
+	}
+
+	return do_update_cpumask(cs, trialcs);
+}
+
 /*
  * cpuset_migrate_mm
  *


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] cpuset: Add function to introduce CPUs to cpusets during CPU online
  2012-02-07 18:55 [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Srivatsa S. Bhat
  2012-02-07 18:56 ` [PATCH 1/4] CPU hotplug, cpuset: Maintain a copy of the cpus_allowed mask before CPU hotplug Srivatsa S. Bhat
  2012-02-07 18:56 ` [PATCH 2/4] cpuset: Split up update_cpumask() so that its functionality can be reused Srivatsa S. Bhat
@ 2012-02-07 18:57 ` Srivatsa S. Bhat
  2012-02-07 18:57 ` [PATCH 4/4] CPU hotplug, cpusets: Differentiate the CPU online and CPU offline callbacks Srivatsa S. Bhat
  2012-02-08  3:22 ` [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Peter Zijlstra
  4 siblings, 0 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-07 18:57 UTC (permalink / raw)
  To: paul, a.p.zijlstra, mingo, rjw, tj
  Cc: frank.rowand, pjt, tglx, lizf, prashanth, paulmck, vatsa,
	srivatsa.bhat, linux-kernel, linux-pm

During a CPU online operation, at present, that CPU is added only to
top_cpuset.cpus_allowed, but not to any of the other cpusets, which
is a mistake.

So, add a function to go through all the cpusets and add this CPU to
each cpuset which contained this CPU previously. In other words, add this
CPU to those cpusets, whose cpus_allowed_before_hotplug mask contains
this CPU.

Note that this function should not modify cpus_allowed_before_hotplug.
Hence, we move that particular piece of code from do_update_cpumask()
to update_cpumask().

Reported-by: Prashanth K. Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
---

 kernel/cpuset.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2be71da..c56e705 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -900,7 +900,6 @@ static int do_update_cpumask(struct cpuset *cs, struct cpuset *trialcs)
 
 	mutex_lock(&callback_mutex);
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
-	cpumask_copy(cs->cpus_allowed_before_hotplug, cs->cpus_allowed);
 	mutex_unlock(&callback_mutex);
 
 	/*
@@ -948,7 +947,10 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 			return -EINVAL;
 	}
 
-	return do_update_cpumask(cs, trialcs);
+	retval = do_update_cpumask(cs, trialcs);
+
+	cpumask_copy(cs->cpus_allowed_before_hotplug, cs->cpus_allowed);
+	return retval;
 }
 
 /*
@@ -2079,6 +2081,71 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
 	move_member_tasks_to_cpuset(cs, parent);
 }
 
+
+/*
+ * Walk the specified cpuset subtree and add the newly onlined CPU to the
+ * cpuset, if that CPU was in this cpuset previously (IOW, if this CPU
+ * is still in the cpuset_allowed_before_hotplug mask of this cpuset).
+ *
+ * Called with cgroup_mutex held.
+ *
+ * This walk processes the tree from top to bottom, completing one layer
+ * before dropping down to the next.  It always processes a node before
+ * any of its children.
+ */
+static void scan_cpusets_during_cpu_online(struct cpuset *root)
+{
+	LIST_HEAD(queue);
+	struct cpuset *cp;	/* scans cpusets being updated */
+	struct cpuset *child;	/* scans child cpusets of cp */
+	struct cgroup *cont;
+	struct cpuset *trialcs;
+
+	trialcs = alloc_trial_cpuset(root);
+	if (!trialcs)
+		return;
+
+	list_add_tail((struct list_head *)&root->stack_list, &queue);
+
+	while (!list_empty(&queue)) {
+		cp = list_first_entry(&queue, struct cpuset, stack_list);
+		list_del(queue.next);
+		list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
+			child = cgroup_cs(cont);
+			list_add_tail(&child->stack_list, &queue);
+		}
+
+		/*
+		 * Continue past cpusets which didn't have this CPU previously
+		 * (Note that when this CPU went offline, it had been removed
+		 * from all cpusets.)
+		 */
+		if (cpumask_equal(cp->cpus_allowed,
+					cp->cpus_allowed_before_hotplug))
+			continue;
+
+		/*
+		 * Add newly onlined CPUs to this cpuset if they were part of
+		 * this cpuset previously.
+		 */
+		cpumask_and(trialcs->cpus_allowed,
+				cp->cpus_allowed_before_hotplug,
+				cpu_active_mask);
+
+		/* Update the cpuset */
+		do_update_cpumask(cp, trialcs);
+
+		/*
+		 * Do NOT update cpus_allowed_before_hotplug here. We want it
+		 * to reflect the value of cpus_allowed as seen *before* the
+		 * hotplug event.
+		 */
+	}
+
+	free_trial_cpuset(trialcs);
+}
+
+
 /*
  * Walk the specified cpuset subtree and look for empty cpusets.
  * The tasks of such cpuset must be moved to a parent cpuset.


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/4] CPU hotplug, cpusets: Differentiate the CPU online and CPU offline callbacks
  2012-02-07 18:55 [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Srivatsa S. Bhat
                   ` (2 preceding siblings ...)
  2012-02-07 18:57 ` [PATCH 3/4] cpuset: Add function to introduce CPUs to cpusets during CPU online Srivatsa S. Bhat
@ 2012-02-07 18:57 ` Srivatsa S. Bhat
  2012-02-08  3:22 ` [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Peter Zijlstra
  4 siblings, 0 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-07 18:57 UTC (permalink / raw)
  To: paul, a.p.zijlstra, mingo, rjw, tj
  Cc: frank.rowand, pjt, tglx, lizf, prashanth, paulmck, vatsa,
	srivatsa.bhat, linux-kernel, linux-pm

Call scan_cpusets_during_cpu_online() upon CPU online events and
call scan_for_empty_cpusets() upon CPU offline events.

Reported-by: Prashanth K. Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
---

 include/linux/cpuset.h |    4 ++--
 kernel/cpuset.c        |    9 +++++++--
 kernel/sched/core.c    |   12 ++++++++++--
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e9eaec5..acfe484 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -20,7 +20,7 @@ extern int number_of_cpusets;	/* How many cpusets are defined in system? */
 
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
-extern void cpuset_update_active_cpus(void);
+extern void cpuset_update_active_cpus(bool cpu_online);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -133,7 +133,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
-static inline void cpuset_update_active_cpus(void)
+static inline void cpuset_update_active_cpus(bool cpu_online)
 {
 	partition_sched_domains(1, NULL, NULL);
 }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index c56e705..d863df5 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2223,8 +2223,10 @@ static void scan_for_empty_cpusets(struct cpuset *root)
  *
  * Called within get_online_cpus().  Needs to call cgroup_lock()
  * before calling generate_sched_domains().
+ *
+ * @cpu_online: whether this is a CPU online event or a CPU offline event.
  */
-void cpuset_update_active_cpus(void)
+void cpuset_update_active_cpus(bool cpu_online)
 {
 	struct sched_domain_attr *attr;
 	cpumask_var_t *doms;
@@ -2234,7 +2236,10 @@ void cpuset_update_active_cpus(void)
 	mutex_lock(&callback_mutex);
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
 	mutex_unlock(&callback_mutex);
-	scan_for_empty_cpusets(&top_cpuset);
+	if (cpu_online)
+		scan_cpusets_during_cpu_online(&top_cpuset);
+	else
+		scan_for_empty_cpusets(&top_cpuset);
 	ndoms = generate_sched_domains(&doms, &attr);
 	cgroup_unlock();
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5255c9d..d4dd9ef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6732,7 +6732,11 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_ONLINE:
 	case CPU_DOWN_FAILED:
-		cpuset_update_active_cpus();
+		/*
+		 * The 'true' value indicates that we are calling the function
+		 * due to a CPU online operation in progress.
+		 */
+		cpuset_update_active_cpus(true);
 		return NOTIFY_OK;
 	default:
 		return NOTIFY_DONE;
@@ -6744,7 +6748,11 @@ static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
 {
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_DOWN_PREPARE:
-		cpuset_update_active_cpus();
+		/*
+		 * The 'false' value indicates that we are calling the function
+		 * due to a CPU offline operation in progress.
+		 */
+		cpuset_update_active_cpus(false);
 		return NOTIFY_OK;
 	default:
 		return NOTIFY_DONE;


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-07 18:55 [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Srivatsa S. Bhat
                   ` (3 preceding siblings ...)
  2012-02-07 18:57 ` [PATCH 4/4] CPU hotplug, cpusets: Differentiate the CPU online and CPU offline callbacks Srivatsa S. Bhat
@ 2012-02-08  3:22 ` Peter Zijlstra
  2012-02-08  6:33   ` Srivatsa S. Bhat
  4 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2012-02-08  3:22 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: paul, mingo, rjw, tj, frank.rowand, pjt, tglx, lizf, prashanth,
	paulmck, vatsa, linux-kernel, linux-pm

On Wed, 2012-02-08 at 00:25 +0530, Srivatsa S. Bhat wrote:
> There is a very long standing issue related to how cpusets handle CPU
> hotplug events. The problem is that when a CPU goes offline, it is removed
> from all cpusets. However, when that CPU comes back online, it is added
> *only* to the root cpuset. Which means, any task attached to a cpuset lower
> in the hierarchy will have one CPU less in its cpuset, though it had this
> CPU in its cpuset before the CPU went offline.

Yeah so? That's known behaviour..

> The issue gets enormously aggravated in the case of suspend/resume.

Why does suspend resume does this anyway? hotunplug is terribly
expensive, surely not doing it would make suspend ever so much faster?

>  During
> suspend, all non-boot CPUs are taken offline. Which means, all those CPUs
> get removed from all the cpusets. When the system resumes, all CPUs are
> brought back online; however, the newly onlined CPUs get added only to the
> root cpuset - and all other cpusets have cpuset.cpus = 0 (boot cpu alone)!
> This means, (as is obvious), all those tasks attached to non-root cpusets
> will be constrained to run only on one single cpu!
> 
> So, imagine the amount of performance degradation after suspend/resume!!
> 
> In particular, libvirt is one of the active users of cpusets. And apparently,
> people hit this problem long ago:
> https://bugzilla.redhat.com/show_bug.cgi?id=714271
> 
> But unfortunately this never got resolved since people probably thought that
> the bug was in libvirt... and all this time the kernel was the culprit!

/me boggles, why do you use cpusets on a system small enough to suspend,
and I'm so not going to ask about libvirt because I know I'll just get
sad.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-08  3:22 ` [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Peter Zijlstra
@ 2012-02-08  6:33   ` Srivatsa S. Bhat
  2012-02-09  7:57     ` Ingo Molnar
  2012-02-09 16:43     ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-08  6:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paul, mingo, rjw, tj, frank.rowand, pjt, tglx, lizf, prashanth,
	paulmck, vatsa, linux-kernel, linux-pm, akpm

On 02/08/2012 08:52 AM, Peter Zijlstra wrote:

> On Wed, 2012-02-08 at 00:25 +0530, Srivatsa S. Bhat wrote:
>> There is a very long standing issue related to how cpusets handle CPU
>> hotplug events. The problem is that when a CPU goes offline, it is removed
>> from all cpusets. However, when that CPU comes back online, it is added
>> *only* to the root cpuset. Which means, any task attached to a cpuset lower
>> in the hierarchy will have one CPU less in its cpuset, though it had this
>> CPU in its cpuset before the CPU went offline.
> 
> Yeah so? That's known behaviour..


This might be a known behaviour, but this is surely not the behaviour we
want right? I understand that if you take a CPU offline, we have no other
choice but to remove it from all cpusets. But if the same CPU comes back
online and the userspace did not request any change to cpusets in between
those events (offline-online), then is it not wrong to silently keep that
CPU out of the cpuset even when it comes online?

IOW, consider:

cpuset A has 0-10

- Take CPU 10 offline
  [We are forced to remove CPU 10 from cpuset A, which becomes 0-9 now]


   <Userspace didn't request any change to cpuset A>


- Bring back CPU 10 online

Now cpuset A is still 0-9! IMO, it should have been 0-10.
That is, why should a totally unrelated operation like CPU Hotplug alter
the cpuset silently under the hood? Or, put another way, if the kernel
is intelligent enough to restore the root cpuset on CPU hotplug events,
why should it not restore the rest of the cpusets?

> 
>> The issue gets enormously aggravated in the case of suspend/resume.
> 
> Why does suspend resume does this anyway? hotunplug is terribly
> expensive, surely not doing it would make suspend ever so much faster?
> 


Well, the point I am trying to make is not about speeding up suspend/resume
itself. I am trying to say that there is a bug (or atleast an "undesirable
behaviour" if you feel "bug" is too strong a word to use) in cpu hotplug
handling in cpusets which gets magnified during suspend/resume
(agreed, because suspend/resume relies on cpu hotplug at the moment).

[And one of the promises of suspend/resume is to restore the system to its
original state to the best extent it can. And cpusets is clearly breaking
this promise. And the good news is: this luckily falls under our "things
that we *can* restore after resume" list and this patchset achieves this.]

>>  During
>> suspend, all non-boot CPUs are taken offline. Which means, all those CPUs
>> get removed from all the cpusets. When the system resumes, all CPUs are
>> brought back online; however, the newly onlined CPUs get added only to the
>> root cpuset - and all other cpusets have cpuset.cpus = 0 (boot cpu alone)!
>> This means, (as is obvious), all those tasks attached to non-root cpusets
>> will be constrained to run only on one single cpu!
>>
>> So, imagine the amount of performance degradation after suspend/resume!!
>>
>> In particular, libvirt is one of the active users of cpusets. And apparently,
>> people hit this problem long ago:
>> https://bugzilla.redhat.com/show_bug.cgi?id=714271
>>
>> But unfortunately this never got resolved since people probably thought that
>> the bug was in libvirt... and all this time the kernel was the culprit!
> 
> /me boggles, why do you use cpusets on a system small enough to suspend,
> and I'm so not going to ask about libvirt because I know I'll just get
> sad.
> 
 

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-08  6:33   ` Srivatsa S. Bhat
@ 2012-02-09  7:57     ` Ingo Molnar
  2012-02-09  8:42       ` Srivatsa S. Bhat
  2012-02-09 16:43     ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2012-02-09  7:57 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, paul, rjw, tj, frank.rowand, pjt, tglx, lizf,
	prashanth, paulmck, vatsa, linux-kernel, linux-pm, akpm


* Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:

> IOW, consider:
> 
> cpuset A has 0-10
> 
> - Take CPU 10 offline
>   [We are forced to remove CPU 10 from cpuset A, which becomes 0-9 now]
> 
> 
>    <Userspace didn't request any change to cpuset A>
> 
> 
> - Bring back CPU 10 online
> 
> Now cpuset A is still 0-9! IMO, it should have been 0-10.

Why is CPU 10 taken out of the cpuset to begin with?

The cpuset code should be fixed to work with offline CPUs as 
well - it can obviously not schedule to them, but otherwise it 
should be fine to have a wider cpuset than the hw can support.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-09  7:57     ` Ingo Molnar
@ 2012-02-09  8:42       ` Srivatsa S. Bhat
  2012-02-09 15:11         ` Ingo Molnar
  2012-02-10 15:53         ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-09  8:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, paul, rjw, tj, frank.rowand, pjt, tglx, lizf,
	prashanth, paulmck, vatsa, linux-kernel, linux-pm, akpm

Hi Ingo,

On 02/09/2012 01:27 PM, Ingo Molnar wrote:

> 
> * Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> 
>> IOW, consider:
>>
>> cpuset A has 0-10
>>
>> - Take CPU 10 offline
>>   [We are forced to remove CPU 10 from cpuset A, which becomes 0-9 now]
>>
>>
>>    <Userspace didn't request any change to cpuset A>
>>
>>
>> - Bring back CPU 10 online
>>
>> Now cpuset A is still 0-9! IMO, it should have been 0-10.
> 
> Why is CPU 10 taken out of the cpuset to begin with?
> 
> The cpuset code should be fixed to work with offline CPUs as 
> well - it can obviously not schedule to them, but otherwise it 
> should be fine to have a wider cpuset than the hw can support.
> 


My understanding of the code is that when a CPU is taken offline, it is
removed from all the cpusets and then the scan_for_empty_cpusets() function
is run to move tasks from empty cpusets to their parent cpusets.

And then update_tasks_cpumask() will update the cpus_allowed mask of each
task whose cpus_allowed mask needs an update (because it moved between
cpusets). (This is achieved by setting up cpuset_change_cpumask() as the
callback and calling cgroup_scan_tasks()).

So, from my point of view, cpusets seems to be handling CPU Offline pretty
well, and correctly too. The corresponding CPU online handling is what is
buggy, IMHO, and that is what I intended to fix with this patchset.
 
Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-09  8:42       ` Srivatsa S. Bhat
@ 2012-02-09 15:11         ` Ingo Molnar
  2012-02-10 15:52           ` Peter Zijlstra
  2012-02-10 15:53         ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2012-02-09 15:11 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, paul, rjw, tj, frank.rowand, pjt, tglx, lizf,
	prashanth, paulmck, vatsa, linux-kernel, linux-pm, akpm


* Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:

> Hi Ingo,
> 
> On 02/09/2012 01:27 PM, Ingo Molnar wrote:
> 
> > 
> > * Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> > 
> >> IOW, consider:
> >>
> >> cpuset A has 0-10
> >>
> >> - Take CPU 10 offline
> >>   [We are forced to remove CPU 10 from cpuset A, which becomes 0-9 now]
> >>
> >>
> >>    <Userspace didn't request any change to cpuset A>
> >>
> >>
> >> - Bring back CPU 10 online
> >>
> >> Now cpuset A is still 0-9! IMO, it should have been 0-10.
> > 
> > Why is CPU 10 taken out of the cpuset to begin with?
> > 
> > The cpuset code should be fixed to work with offline CPUs as 
> > well - it can obviously not schedule to them, but otherwise 
> > it should be fine to have a wider cpuset than the hw can 
> > support.
> 
> My understanding of the code is that when a CPU is taken 
> offline, it is removed from all the cpusets and then the 
> scan_for_empty_cpusets() function is run to move tasks from 
> empty cpusets to their parent cpusets.

Why is that done that way? offlining a CPU should be an 
invariant as far as cpusets are concerned.

Not touching the cpuset would avoid the hot-replug complications 
as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-08  6:33   ` Srivatsa S. Bhat
  2012-02-09  7:57     ` Ingo Molnar
@ 2012-02-09 16:43     ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2012-02-09 16:43 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, paul, mingo, rjw, tj, frank.rowand, pjt, tglx,
	lizf, prashanth, paulmck, vatsa, linux-kernel, linux-pm, akpm

On Wed, Feb 08, 2012 at 12:03:50PM +0530, Srivatsa S. Bhat wrote:
> On 02/08/2012 08:52 AM, Peter Zijlstra wrote:
> 
> > On Wed, 2012-02-08 at 00:25 +0530, Srivatsa S. Bhat wrote:
> >> There is a very long standing issue related to how cpusets handle CPU
> >> hotplug events. The problem is that when a CPU goes offline, it is removed
> >> from all cpusets. However, when that CPU comes back online, it is added
> >> *only* to the root cpuset. Which means, any task attached to a cpuset lower
> >> in the hierarchy will have one CPU less in its cpuset, though it had this
> >> CPU in its cpuset before the CPU went offline.
> > 
> > Yeah so? That's known behaviour..
> 
> 
> This might be a known behaviour, but this is surely not the behaviour we
> want right? I understand that if you take a CPU offline, we have no other
> choice but to remove it from all cpusets. But if the same CPU comes back
> online and the userspace did not request any change to cpusets in between
> those events (offline-online), then is it not wrong to silently keep that
> CPU out of the cpuset even when it comes online?

no, it is what we want because unplug is destructive, it cannot be undone in generic. consider the case where you unplug allcpus of a set. 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-09 15:11         ` Ingo Molnar
@ 2012-02-10 15:52           ` Peter Zijlstra
  2012-02-10 16:53             ` Paul E. McKenney
  2012-02-11 13:39             ` Ingo Molnar
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2012-02-10 15:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srivatsa S. Bhat, paul, rjw, tj, frank.rowand, pjt, tglx, lizf,
	prashanth, paulmck, vatsa, linux-kernel, linux-pm, akpm

On Thu, 2012-02-09 at 16:11 +0100, Ingo Molnar wrote:

> > My understanding of the code is that when a CPU is taken 
> > offline, it is removed from all the cpusets and then the 
> > scan_for_empty_cpusets() function is run to move tasks from 
> > empty cpusets to their parent cpusets.
> 
> Why is that done that way? offlining a CPU should be an 
> invariant as far as cpusets are concerned.

Can't, tasks need to run someplace. There's two choices, add a still
online cpu to the now empty cpuset or move the tasks to a parent that
still has online cpus.

Both are destructive.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-09  8:42       ` Srivatsa S. Bhat
  2012-02-09 15:11         ` Ingo Molnar
@ 2012-02-10 15:53         ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2012-02-10 15:53 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Ingo Molnar, paul, rjw, tj, frank.rowand, pjt, tglx, lizf,
	prashanth, paulmck, vatsa, linux-kernel, linux-pm, akpm

On Thu, 2012-02-09 at 14:12 +0530, Srivatsa S. Bhat wrote:
> 
> So, from my point of view, cpusets seems to be handling CPU Offline pretty
> well, and correctly too. The corresponding CPU online handling is what is
> buggy, IMHO, and that is what I intended to fix with this patchset. 

No, online is not buggy, since offline is destructive of the cpuset
state (it has to) you cannot undo it.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-10 15:52           ` Peter Zijlstra
@ 2012-02-10 16:53             ` Paul E. McKenney
  2012-02-10 17:34               ` Peter Zijlstra
  2012-02-13 17:47               ` Srivatsa S. Bhat
  2012-02-11 13:39             ` Ingo Molnar
  1 sibling, 2 replies; 32+ messages in thread
From: Paul E. McKenney @ 2012-02-10 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Srivatsa S. Bhat, paul, rjw, tj, frank.rowand, pjt,
	tglx, lizf, prashanth, vatsa, linux-kernel, linux-pm, akpm

On Fri, Feb 10, 2012 at 04:52:07PM +0100, Peter Zijlstra wrote:
> On Thu, 2012-02-09 at 16:11 +0100, Ingo Molnar wrote:
> 
> > > My understanding of the code is that when a CPU is taken 
> > > offline, it is removed from all the cpusets and then the 
> > > scan_for_empty_cpusets() function is run to move tasks from 
> > > empty cpusets to their parent cpusets.
> > 
> > Why is that done that way? offlining a CPU should be an 
> > invariant as far as cpusets are concerned.
> 
> Can't, tasks need to run someplace. There's two choices, add a still
> online cpu to the now empty cpuset or move the tasks to a parent that
> still has online cpus.
> 
> Both are destructive.

OK, I will ask the stupid question...  Hey, somebody has to!  ;-)

Would it make sense for offlining the last CPU in a cpuset to be
destructive, but to allow offlining of a non-last CPU to be reversible?

For example, assume that cpuset A has CPUs 0 and 1, and cpuset B has
1, 2, and 3.  Then offlining any single CPU and then onlining it would
restore the cpusets to their original state.  Offlining both CPUs 0 and 1
would be destructive to cpuset A, so that onlining those two CPUs would
leave any tasks in cpuset A in some ancestor of cpuset A, and would
leave cpuset A with no assigned CPUs.  However, that same operation
(offlining both CPUs 0 and 1, then onlining them) would restore cpuset
B to its original state, covering CPUs 1, 2, and 3.

/me ducks.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-10 16:53             ` Paul E. McKenney
@ 2012-02-10 17:34               ` Peter Zijlstra
  2012-02-10 21:51                 ` Alan Stern
  2012-02-11 16:00                 ` [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Paul E. McKenney
  2012-02-13 17:47               ` Srivatsa S. Bhat
  1 sibling, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2012-02-10 17:34 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, Srivatsa S. Bhat, paul, rjw, tj, frank.rowand, pjt,
	tglx, lizf, prashanth, vatsa, linux-kernel, linux-pm, akpm

On Fri, 2012-02-10 at 08:53 -0800, Paul E. McKenney wrote:
> On Fri, Feb 10, 2012 at 04:52:07PM +0100, Peter Zijlstra wrote:
> > On Thu, 2012-02-09 at 16:11 +0100, Ingo Molnar wrote:
> > 
> > > > My understanding of the code is that when a CPU is taken 
> > > > offline, it is removed from all the cpusets and then the 
> > > > scan_for_empty_cpusets() function is run to move tasks from 
> > > > empty cpusets to their parent cpusets.
> > > 
> > > Why is that done that way? offlining a CPU should be an 
> > > invariant as far as cpusets are concerned.
> > 
> > Can't, tasks need to run someplace. There's two choices, add a still
> > online cpu to the now empty cpuset or move the tasks to a parent that
> > still has online cpus.
> > 
> > Both are destructive.
> 
> OK, I will ask the stupid question...  Hey, somebody has to!  ;-)
> 
> Would it make sense for offlining the last CPU in a cpuset to be
> destructive, but to allow offlining of a non-last CPU to be reversible?

No, that's very inconsistent and will lead to way more 'surprises'. 

> /me ducks.  ;-)

/me quacks ;-)

Now the whole problem here seems to be that suspend uses cpu-hotplug to
reduce the machine to UP -- I've no clue why it does that but I can
imagine its because the BIOS calls only work on CPU0 and/or the resume
only wakes CPU0 so you have to bootstrap the SMP thing again..

Some suspend person wanna clarify? Rafael?

Anyway, the whole suspend case is magic anyway since all tasks will have
been frozen, so we could simply leave all of cpuset alone and ignore the
hotplug notifier on CPU_TASKS_FROZEN callbacks, hmm?

Do we unfreeze after we bring up the machine again?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-10 17:34               ` Peter Zijlstra
@ 2012-02-10 21:51                 ` Alan Stern
  2012-02-10 22:39                   ` Rafael J. Wysocki
  2012-02-11 16:00                 ` [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Paul E. McKenney
  1 sibling, 1 reply; 32+ messages in thread
From: Alan Stern @ 2012-02-10 21:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Ingo Molnar, Srivatsa S. Bhat, paul, rjw, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On Fri, 10 Feb 2012, Peter Zijlstra wrote:

> Now the whole problem here seems to be that suspend uses cpu-hotplug to
> reduce the machine to UP -- I've no clue why it does that but I can
> imagine its because the BIOS calls only work on CPU0 and/or the resume
> only wakes CPU0 so you have to bootstrap the SMP thing again..
> 
> Some suspend person wanna clarify? Rafael?

If I understand correctly, ACPI requires that only CPU0 be running when
the system is suspended (i.e., your guess is right).  Of course, this
doesn't apply to non-ACPI systems, but it's easiest to do the same
thing everywhere.

> Anyway, the whole suspend case is magic anyway since all tasks will have
> been frozen, so we could simply leave all of cpuset alone and ignore the
> hotplug notifier on CPU_TASKS_FROZEN callbacks, hmm?

I don't see why not.  Presumably no CPUs will be added or removed while 
the system is asleep.

> Do we unfreeze after we bring up the machine again?

Yes.

Alan Stern


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-10 21:51                 ` Alan Stern
@ 2012-02-10 22:39                   ` Rafael J. Wysocki
  2012-02-11  2:07                     ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2012-02-10 22:39 UTC (permalink / raw)
  To: Alan Stern
  Cc: Peter Zijlstra, paulmck, Ingo Molnar, Srivatsa S. Bhat, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On Friday, February 10, 2012, Alan Stern wrote:
> On Fri, 10 Feb 2012, Peter Zijlstra wrote:
> 
> > Now the whole problem here seems to be that suspend uses cpu-hotplug to
> > reduce the machine to UP -- I've no clue why it does that but I can
> > imagine its because the BIOS calls only work on CPU0 and/or the resume
> > only wakes CPU0 so you have to bootstrap the SMP thing again..

Yes, that's the main problem.

> > Some suspend person wanna clarify? Rafael?
> 
> If I understand correctly, ACPI requires that only CPU0 be running when
> the system is suspended (i.e., your guess is right).  Of course, this
> doesn't apply to non-ACPI systems, but it's easiest to do the same
> thing everywhere.

That's correct.  Plus we have some code in the kernel (syscore_ops in
general) that assumes to be executed on one CPU with interrupts off.
Plus I'm not sure how the ACPI low-level suspend is going to behave if it's
not executed on the boot CPU.

> > Anyway, the whole suspend case is magic anyway since all tasks will have
> > been frozen, so we could simply leave all of cpuset alone and ignore the
> > hotplug notifier on CPU_TASKS_FROZEN callbacks, hmm?
> 
> I don't see why not.  Presumably no CPUs will be added or removed while 
> the system is asleep.

ACPI explicitly forbids that level of hardware reconfiguration in a sleep
state (even in S4), AFAICS.  Still, people may try to do that ...

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-10 22:39                   ` Rafael J. Wysocki
@ 2012-02-11  2:07                     ` Peter Zijlstra
  2012-02-11  4:26                       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2012-02-11  2:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, paulmck, Ingo Molnar, Srivatsa S. Bhat, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On Fri, 2012-02-10 at 23:39 +0100, Rafael J. Wysocki wrote:
> 
> > I don't see why not.  Presumably no CPUs will be added or removed while 
> > the system is asleep.
> 
> ACPI explicitly forbids that level of hardware reconfiguration in a sleep
> state (even in S4), AFAICS.  Still, people may try to do that ... 

I'm ok with breaking that :-) If its really really important to someone
we (him most likely) could fix it by detecting the topology changed over
the suspend and do a fixup. Assuming it actually gets that far.

Srivatsa, wanna give this (the proposal to not modify cpusets on
CPU_TASKS_FROZEN) a try?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-11  2:07                     ` Peter Zijlstra
@ 2012-02-11  4:26                       ` Srivatsa S. Bhat
  2012-02-13 17:47                         ` Srivatsa S. Bhat
  0 siblings, 1 reply; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-11  4:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Alan Stern, paulmck, Ingo Molnar, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On 02/11/2012 07:37 AM, Peter Zijlstra wrote:

> On Fri, 2012-02-10 at 23:39 +0100, Rafael J. Wysocki wrote:
>>
>>> I don't see why not.  Presumably no CPUs will be added or removed while 
>>> the system is asleep.
>>
>> ACPI explicitly forbids that level of hardware reconfiguration in a sleep
>> state (even in S4), AFAICS.  Still, people may try to do that ... 
> 
> I'm ok with breaking that :-) If its really really important to someone
> we (him most likely) could fix it by detecting the topology changed over
> the suspend and do a fixup. Assuming it actually gets that far.
> 
> Srivatsa, wanna give this (the proposal to not modify cpusets on
> CPU_TASKS_FROZEN) a try?
> 


Sure! After you pointed out that CPU Hotplug is destructive in general and
hence it is not a good idea to put back online CPUs to cpusets, the next
thing I thought of trying was a special case handling for suspend/resume
alone, IOW, not calling the cpuset update upon CPU_TASKS_FROZEN.

So, yes, I'll write up a patch for that and post it soon :-)

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-10 15:52           ` Peter Zijlstra
  2012-02-10 16:53             ` Paul E. McKenney
@ 2012-02-11 13:39             ` Ingo Molnar
  1 sibling, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2012-02-11 13:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, paul, rjw, tj, frank.rowand, pjt, tglx, lizf,
	prashanth, paulmck, vatsa, linux-kernel, linux-pm, akpm


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2012-02-09 at 16:11 +0100, Ingo Molnar wrote:
> 
> > > My understanding of the code is that when a CPU is taken 
> > > offline, it is removed from all the cpusets and then the 
> > > scan_for_empty_cpusets() function is run to move tasks from 
> > > empty cpusets to their parent cpusets.
> > 
> > Why is that done that way? offlining a CPU should be an 
> > invariant as far as cpusets are concerned.
> 
> Can't, tasks need to run someplace. There's two choices, add a 
> still online cpu to the now empty cpuset or move the tasks to 
> a parent that still has online cpus.
> 
> Both are destructive.

You aren't thinking hard enough ;-) There's several solutions 
off the top of my mind:

1) refuse the "impossible" offlining of the CPU, with a clear 
   enough error to make it actionable

2) offer a 'forced' offlinign of a CPU that will SIGTERM all 
   tasks that are on the now offline CPU and can only be there.

3) offer a 'nice' offlining variant that moves all orphan tasks
   to their or any other well-defined fallback CPU.

4) *allow* 'impossible' cpusets and just run them on CPU#0 or 
   any other natural approximation. Don't touch the cpuset!

All of these would be exception mechanisms with no need to do 
anything at hot-replug time.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-10 17:34               ` Peter Zijlstra
  2012-02-10 21:51                 ` Alan Stern
@ 2012-02-11 16:00                 ` Paul E. McKenney
  1 sibling, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2012-02-11 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Srivatsa S. Bhat, paul, rjw, tj, frank.rowand, pjt,
	tglx, lizf, prashanth, vatsa, linux-kernel, linux-pm, akpm

On Fri, Feb 10, 2012 at 06:34:04PM +0100, Peter Zijlstra wrote:
> On Fri, 2012-02-10 at 08:53 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 10, 2012 at 04:52:07PM +0100, Peter Zijlstra wrote:
> > > On Thu, 2012-02-09 at 16:11 +0100, Ingo Molnar wrote:
> > > 
> > > > > My understanding of the code is that when a CPU is taken 
> > > > > offline, it is removed from all the cpusets and then the 
> > > > > scan_for_empty_cpusets() function is run to move tasks from 
> > > > > empty cpusets to their parent cpusets.
> > > > 
> > > > Why is that done that way? offlining a CPU should be an 
> > > > invariant as far as cpusets are concerned.
> > > 
> > > Can't, tasks need to run someplace. There's two choices, add a still
> > > online cpu to the now empty cpuset or move the tasks to a parent that
> > > still has online cpus.
> > > 
> > > Both are destructive.
> > 
> > OK, I will ask the stupid question...  Hey, somebody has to!  ;-)
> > 
> > Would it make sense for offlining the last CPU in a cpuset to be
> > destructive, but to allow offlining of a non-last CPU to be reversible?
> 
> No, that's very inconsistent and will lead to way more 'surprises'. 

It might well lead to surprises, but so does INT_MIN==-INT_MIN.  IOW,
the inconsistency certainly is a disadvantage, but it must be weighed
against the disadvantages of the current situation.

> > /me ducks.  ;-)
> 
> /me quacks ;-)
> 
> Now the whole problem here seems to be that suspend uses cpu-hotplug to
> reduce the machine to UP -- I've no clue why it does that but I can
> imagine its because the BIOS calls only work on CPU0 and/or the resume
> only wakes CPU0 so you have to bootstrap the SMP thing again..
> 
> Some suspend person wanna clarify? Rafael?
> 
> Anyway, the whole suspend case is magic anyway since all tasks will have
> been frozen, so we could simply leave all of cpuset alone and ignore the
> hotplug notifier on CPU_TASKS_FROZEN callbacks, hmm?
> 
> Do we unfreeze after we bring up the machine again?

Agreed, the suspend case is the highest priority in that losing your cpusets
after suspending and resuming is -very- surprising.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-11  4:26                       ` Srivatsa S. Bhat
@ 2012-02-13 17:47                         ` Srivatsa S. Bhat
  2012-02-17 12:15                           ` Srivatsa S. Bhat
  0 siblings, 1 reply; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-13 17:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Alan Stern, paulmck, Ingo Molnar, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On 02/11/2012 09:56 AM, Srivatsa S. Bhat wrote:

> On 02/11/2012 07:37 AM, Peter Zijlstra wrote:
> 
>> On Fri, 2012-02-10 at 23:39 +0100, Rafael J. Wysocki wrote:
>>>
>>>> I don't see why not.  Presumably no CPUs will be added or removed while 
>>>> the system is asleep.
>>>
>>> ACPI explicitly forbids that level of hardware reconfiguration in a sleep
>>> state (even in S4), AFAICS.  Still, people may try to do that ... 
>>
>> I'm ok with breaking that :-) If its really really important to someone
>> we (him most likely) could fix it by detecting the topology changed over
>> the suspend and do a fixup. Assuming it actually gets that far.
>>
>> Srivatsa, wanna give this (the proposal to not modify cpusets on
>> CPU_TASKS_FROZEN) a try?
>>
> 
> 
> Sure! After you pointed out that CPU Hotplug is destructive in general and
> hence it is not a good idea to put back online CPUs to cpusets, the next
> thing I thought of trying was a special case handling for suspend/resume
> alone, IOW, not calling the cpuset update upon CPU_TASKS_FROZEN.
> 
> So, yes, I'll write up a patch for that and post it soon :-)
> 


Trivially removing CPU_TASKS_FROZEN as shown below doesn't look right to me:

---

 kernel/sched/core.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5255c9d..43a166e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6729,7 +6729,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
 static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 			     void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
 	case CPU_ONLINE:
 	case CPU_DOWN_FAILED:
 		cpuset_update_active_cpus();
@@ -6742,7 +6742,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
 			       void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
 	case CPU_DOWN_PREPARE:
 		cpuset_update_active_cpus();
 		return NOTIFY_OK;


IMO, irrespective of whether we keep cpusets unaware of all CPU Hotplug or
only unaware of the CPU hotplug in the suspend/resume path, I feel the
scheduler should always know the true state of the system, ie., offline CPUs
must not be part of any sched domain, at any point in time.

At the moment, I am exploring several ways to achieve this (I can think of 2
ways at the moment, will see which one is better). But in case this approach
itself seems wrong for any reason, please let me know.

Regards,
Srivatsa S. Bhat


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-10 16:53             ` Paul E. McKenney
  2012-02-10 17:34               ` Peter Zijlstra
@ 2012-02-13 17:47               ` Srivatsa S. Bhat
  2012-02-13 20:39                 ` Paul E. McKenney
  1 sibling, 1 reply; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-13 17:47 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Ingo Molnar, paul, rjw, tj, frank.rowand, pjt,
	tglx, lizf, prashanth, vatsa, linux-kernel, linux-pm, akpm

On 02/10/2012 10:23 PM, Paul E. McKenney wrote:

> On Fri, Feb 10, 2012 at 04:52:07PM +0100, Peter Zijlstra wrote:
>> On Thu, 2012-02-09 at 16:11 +0100, Ingo Molnar wrote:
>>
>>>> My understanding of the code is that when a CPU is taken 
>>>> offline, it is removed from all the cpusets and then the 
>>>> scan_for_empty_cpusets() function is run to move tasks from 
>>>> empty cpusets to their parent cpusets.
>>>
>>> Why is that done that way? offlining a CPU should be an 
>>> invariant as far as cpusets are concerned.
>>
>> Can't, tasks need to run someplace. There's two choices, add a still
>> online cpu to the now empty cpuset or move the tasks to a parent that
>> still has online cpus.
>>
>> Both are destructive.
> 
> OK, I will ask the stupid question...  Hey, somebody has to!  ;-)
> 
> Would it make sense for offlining the last CPU in a cpuset to be
> destructive, but to allow offlining of a non-last CPU to be reversible?
> 
> For example, assume that cpuset A has CPUs 0 and 1, and cpuset B has
> 1, 2, and 3.  Then offlining any single CPU and then onlining it would
> restore the cpusets to their original state.  Offlining both CPUs 0 and 1
> would be destructive to cpuset A, so that onlining those two CPUs would
> leave any tasks in cpuset A in some ancestor of cpuset A, and would
> leave cpuset A with no assigned CPUs.  However, that same operation
> (offlining both CPUs 0 and 1, then onlining them) would restore cpuset
> B to its original state, covering CPUs 1, 2, and 3.
> 


But how would this scheme help us? During suspend, all non-boot CPUs are
taken offline. Which means, it would be destructive to any cpuset that
didn't originally contain CPU0 (even when using the above scheme). So, upon
resume, it is still not the same as how it was before suspend.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-13 17:47               ` Srivatsa S. Bhat
@ 2012-02-13 20:39                 ` Paul E. McKenney
  2012-02-13 20:49                   ` Srivatsa S. Bhat
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2012-02-13 20:39 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, Ingo Molnar, paul, rjw, tj, frank.rowand, pjt,
	tglx, lizf, prashanth, vatsa, linux-kernel, linux-pm, akpm

On Mon, Feb 13, 2012 at 11:17:53PM +0530, Srivatsa S. Bhat wrote:
> On 02/10/2012 10:23 PM, Paul E. McKenney wrote:
> 
> > On Fri, Feb 10, 2012 at 04:52:07PM +0100, Peter Zijlstra wrote:
> >> On Thu, 2012-02-09 at 16:11 +0100, Ingo Molnar wrote:
> >>
> >>>> My understanding of the code is that when a CPU is taken 
> >>>> offline, it is removed from all the cpusets and then the 
> >>>> scan_for_empty_cpusets() function is run to move tasks from 
> >>>> empty cpusets to their parent cpusets.
> >>>
> >>> Why is that done that way? offlining a CPU should be an 
> >>> invariant as far as cpusets are concerned.
> >>
> >> Can't, tasks need to run someplace. There's two choices, add a still
> >> online cpu to the now empty cpuset or move the tasks to a parent that
> >> still has online cpus.
> >>
> >> Both are destructive.
> > 
> > OK, I will ask the stupid question...  Hey, somebody has to!  ;-)
> > 
> > Would it make sense for offlining the last CPU in a cpuset to be
> > destructive, but to allow offlining of a non-last CPU to be reversible?
> > 
> > For example, assume that cpuset A has CPUs 0 and 1, and cpuset B has
> > 1, 2, and 3.  Then offlining any single CPU and then onlining it would
> > restore the cpusets to their original state.  Offlining both CPUs 0 and 1
> > would be destructive to cpuset A, so that onlining those two CPUs would
> > leave any tasks in cpuset A in some ancestor of cpuset A, and would
> > leave cpuset A with no assigned CPUs.  However, that same operation
> > (offlining both CPUs 0 and 1, then onlining them) would restore cpuset
> > B to its original state, covering CPUs 1, 2, and 3.
> 
> But how would this scheme help us? During suspend, all non-boot CPUs are
> taken offline. Which means, it would be destructive to any cpuset that
> didn't originally contain CPU0 (even when using the above scheme). So, upon
> resume, it is still not the same as how it was before suspend.

Yep, it would only help for incremental cases.  Or if all cpusets had
CPU 0 in them.  So preserving cpusets across suspend will require a
bigger hammer.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-13 20:39                 ` Paul E. McKenney
@ 2012-02-13 20:49                   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-13 20:49 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Ingo Molnar, paul, rjw, tj, frank.rowand, pjt,
	tglx, lizf, prashanth, vatsa, linux-kernel, linux-pm, akpm

On 02/14/2012 02:09 AM, Paul E. McKenney wrote:

> On Mon, Feb 13, 2012 at 11:17:53PM +0530, Srivatsa S. Bhat wrote:
>> On 02/10/2012 10:23 PM, Paul E. McKenney wrote:
>>
>>> On Fri, Feb 10, 2012 at 04:52:07PM +0100, Peter Zijlstra wrote:
>>>> On Thu, 2012-02-09 at 16:11 +0100, Ingo Molnar wrote:
>>>>
>>>>>> My understanding of the code is that when a CPU is taken 
>>>>>> offline, it is removed from all the cpusets and then the 
>>>>>> scan_for_empty_cpusets() function is run to move tasks from 
>>>>>> empty cpusets to their parent cpusets.
>>>>>
>>>>> Why is that done that way? offlining a CPU should be an 
>>>>> invariant as far as cpusets are concerned.
>>>>
>>>> Can't, tasks need to run someplace. There's two choices, add a still
>>>> online cpu to the now empty cpuset or move the tasks to a parent that
>>>> still has online cpus.
>>>>
>>>> Both are destructive.
>>>
>>> OK, I will ask the stupid question...  Hey, somebody has to!  ;-)
>>>
>>> Would it make sense for offlining the last CPU in a cpuset to be
>>> destructive, but to allow offlining of a non-last CPU to be reversible?
>>>
>>> For example, assume that cpuset A has CPUs 0 and 1, and cpuset B has
>>> 1, 2, and 3.  Then offlining any single CPU and then onlining it would
>>> restore the cpusets to their original state.  Offlining both CPUs 0 and 1
>>> would be destructive to cpuset A, so that onlining those two CPUs would
>>> leave any tasks in cpuset A in some ancestor of cpuset A, and would
>>> leave cpuset A with no assigned CPUs.  However, that same operation
>>> (offlining both CPUs 0 and 1, then onlining them) would restore cpuset
>>> B to its original state, covering CPUs 1, 2, and 3.
>>
>> But how would this scheme help us? During suspend, all non-boot CPUs are
>> taken offline. Which means, it would be destructive to any cpuset that
>> didn't originally contain CPU0 (even when using the above scheme). So, upon
>> resume, it is still not the same as how it was before suspend.
> 
> Yep, it would only help for incremental cases.  Or if all cpusets had
> CPU 0 in them.  So preserving cpusets across suspend will require a
> bigger hammer.
> 


Hehe ;-)

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-13 17:47                         ` Srivatsa S. Bhat
@ 2012-02-17 12:15                           ` Srivatsa S. Bhat
  2012-02-20 12:49                             ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-17 12:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Alan Stern, paulmck, Ingo Molnar, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On 02/13/2012 11:17 PM, Srivatsa S. Bhat wrote:

> On 02/11/2012 09:56 AM, Srivatsa S. Bhat wrote:
> 
>> On 02/11/2012 07:37 AM, Peter Zijlstra wrote:
>>
>>> On Fri, 2012-02-10 at 23:39 +0100, Rafael J. Wysocki wrote:
>>>>
>>>>> I don't see why not.  Presumably no CPUs will be added or removed while 
>>>>> the system is asleep.
>>>>
>>>> ACPI explicitly forbids that level of hardware reconfiguration in a sleep
>>>> state (even in S4), AFAICS.  Still, people may try to do that ... 
>>>
>>> I'm ok with breaking that :-) If its really really important to someone
>>> we (him most likely) could fix it by detecting the topology changed over
>>> the suspend and do a fixup. Assuming it actually gets that far.
>>>
>>> Srivatsa, wanna give this (the proposal to not modify cpusets on
>>> CPU_TASKS_FROZEN) a try?
>>>
>>
>>
>> Sure! After you pointed out that CPU Hotplug is destructive in general and
>> hence it is not a good idea to put back online CPUs to cpusets, the next
>> thing I thought of trying was a special case handling for suspend/resume
>> alone, IOW, not calling the cpuset update upon CPU_TASKS_FROZEN.
>>
>> So, yes, I'll write up a patch for that and post it soon :-)
>>
> 
> 
> Trivially removing CPU_TASKS_FROZEN as shown below doesn't look right to me:
> 
> ---
> 
>  kernel/sched/core.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5255c9d..43a166e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6729,7 +6729,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
>  static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>  			     void *hcpu)
>  {
> -	switch (action & ~CPU_TASKS_FROZEN) {
> +	switch (action) {
>  	case CPU_ONLINE:
>  	case CPU_DOWN_FAILED:
>  		cpuset_update_active_cpus();
> @@ -6742,7 +6742,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>  static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
>  			       void *hcpu)
>  {
> -	switch (action & ~CPU_TASKS_FROZEN) {
> +	switch (action) {
>  	case CPU_DOWN_PREPARE:
>  		cpuset_update_active_cpus();
>  		return NOTIFY_OK;
> 
> 
> IMO, irrespective of whether we keep cpusets unaware of all CPU Hotplug or
> only unaware of the CPU hotplug in the suspend/resume path, I feel the
> scheduler should always know the true state of the system, ie., offline CPUs
> must not be part of any sched domain, at any point in time.
> 
> At the moment, I am exploring several ways to achieve this (I can think of 2
> ways at the moment, will see which one is better). But in case this approach
> itself seems wrong for any reason, please let me know.
> 


Ok, here it goes (v2):

----
From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume

Currently, during CPU hotplug, the cpuset callbacks modify the cpusets
to reflect the state of the system, and this handling is asymmetric.
That is, upon CPU offline, that CPU is removed from all cpusets. However
when it comes back online, it is put back only to the root cpuset.

This gives rise to a significant problem during suspend/resume. During
suspend, we offline all non-boot cpus and during resume we online them back.
Which means, after a resume, all cpusets (except the root cpuset) will be
restricted to just one single CPU (the boot cpu). But the whole point of
suspend/resume is to restore the system to a state which is as close as
possible to how it was before suspend.

So to fix this, to begin with, don't touch cpusets during suspend/resume.
But cpusets are closely tied up with sched domain rebuild. But since our
intention is to only fool cpusets but not the scheduler during a suspend/
resume, we do the following:

* Towards the end of suspend, we will anyway end up with one single sched
  domain with one CPU in it. So, right from the first CPU offline, we just
  build a single sched domain, ignoring cpusets. Moreover, this won't do
  any harm since tasks would have been anyways frozen before beginning CPU
  hotplug in the suspend/resume sequence.

* Luckily suspend/resume is symmetric with respect to CPU hotplug. That is,
  exactly the same number of CPUs are taken offline and online. So, keep
  track of the number of CPUs taken offline/online, and when we detect that
  this is the last CPU online operation (ie., all CPUs onlined during resume,
  or after a suspend failure), invoke the cpusets code to rebuild the sched
  domains as per the cpusets.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Related bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=714271

Reported-by: Prashanth K. Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |   40 ++++++++++++++++++++++++++++++++++++----
 1 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5255c9d..e2b4e2b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6721,34 +6721,66 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
+static int num_cpus_frozen;
+
 /*
  * Update cpusets according to cpu_active mask.  If cpusets are
  * disabled, cpuset_update_active_cpus() becomes a simple wrapper
  * around partition_sched_domains().
+ *
+ * If we come here as part of a suspend/resume, don't touch cpusets because we
+ * want to restore it back to its original state upon resume anyways.
  */
 static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 			     void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED_FROZEN:
+	case CPU_UP_CANCELED_FROZEN:
+		/*
+		 * num_cpus_frozen tracks how many CPUs are involved in suspend
+		 * resume sequence. As long as we are in an intermediate state
+		 * (some more CPUs online/offline is necessary to get system
+		 * back to original state), we just build a single sched domain.
+		 */
+		num_cpus_frozen--;
+		if (num_cpus_frozen != 0) {
+			/*
+			 * This is not the last CPU to come online. More
+			 * CPU online events are yet to come.
+			 */
+			partition_sched_domains(1, NULL, NULL);
+			break;
+		}
+
+		/* Fall through, if this is the last CPU to come online */
+
 	case CPU_ONLINE:
 	case CPU_DOWN_FAILED:
 		cpuset_update_active_cpus();
-		return NOTIFY_OK;
+		break;
 	default:
 		return NOTIFY_DONE;
 	}
+	return NOTIFY_OK;
 }
 
 static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
 			       void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
 	case CPU_DOWN_PREPARE:
 		cpuset_update_active_cpus();
-		return NOTIFY_OK;
+		break;
+	case CPU_DOWN_PREPARE_FROZEN:
+		num_cpus_frozen++;
+		partition_sched_domains(1, NULL, NULL);
+		break;
 	default:
 		return NOTIFY_DONE;
 	}
+	return NOTIFY_OK;
 }
 
 void __init sched_init_smp(void)



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-17 12:15                           ` Srivatsa S. Bhat
@ 2012-02-20 12:49                             ` Peter Zijlstra
  2012-02-20 12:59                               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2012-02-20 12:49 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Rafael J. Wysocki, Alan Stern, paulmck, Ingo Molnar, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On Fri, 2012-02-17 at 17:45 +0530, Srivatsa S. Bhat wrote:

> > Trivially removing CPU_TASKS_FROZEN as shown below doesn't look right to me:
> > 
> > ---
> > 
> >  kernel/sched/core.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 5255c9d..43a166e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6729,7 +6729,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
> >  static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
> >  			     void *hcpu)
> >  {
> > -	switch (action & ~CPU_TASKS_FROZEN) {
> > +	switch (action) {
> >  	case CPU_ONLINE:
> >  	case CPU_DOWN_FAILED:
> >  		cpuset_update_active_cpus();
> > @@ -6742,7 +6742,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
> >  static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
> >  			       void *hcpu)
> >  {
> > -	switch (action & ~CPU_TASKS_FROZEN) {
> > +	switch (action) {
> >  	case CPU_DOWN_PREPARE:
> >  		cpuset_update_active_cpus();
> >  		return NOTIFY_OK;
> > 
> > 
> > IMO, irrespective of whether we keep cpusets unaware of all CPU Hotplug or
> > only unaware of the CPU hotplug in the suspend/resume path, I feel the
> > scheduler should always know the true state of the system, ie., offline CPUs
> > must not be part of any sched domain, at any point in time.

That's really not a problem as long as they're not in the active mask.

> > At the moment, I am exploring several ways to achieve this (I can think of 2
> > ways at the moment, will see which one is better). But in case this approach
> > itself seems wrong for any reason, please let me know.


Have you actually tried the simple patch? 

Calling partition_sched_domains() like you do doesn't seem right, it
completely ignores cpusets, it will make certain cpuset configurations
mis-behave.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-20 12:49                             ` Peter Zijlstra
@ 2012-02-20 12:59                               ` Srivatsa S. Bhat
  2012-02-23  9:57                                 ` Srivatsa S. Bhat
  0 siblings, 1 reply; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-20 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Alan Stern, paulmck, Ingo Molnar, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

Hi Peter,

On 02/20/2012 06:19 PM, Peter Zijlstra wrote:

> On Fri, 2012-02-17 at 17:45 +0530, Srivatsa S. Bhat wrote:
> 
>>> Trivially removing CPU_TASKS_FROZEN as shown below doesn't look right to me:
>>>
>>> ---
>>>
>>>  kernel/sched/core.c |    4 ++--
>>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 5255c9d..43a166e 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -6729,7 +6729,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
>>>  static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>>>  			     void *hcpu)
>>>  {
>>> -	switch (action & ~CPU_TASKS_FROZEN) {
>>> +	switch (action) {
>>>  	case CPU_ONLINE:
>>>  	case CPU_DOWN_FAILED:
>>>  		cpuset_update_active_cpus();
>>> @@ -6742,7 +6742,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>>>  static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
>>>  			       void *hcpu)
>>>  {
>>> -	switch (action & ~CPU_TASKS_FROZEN) {
>>> +	switch (action) {
>>>  	case CPU_DOWN_PREPARE:
>>>  		cpuset_update_active_cpus();
>>>  		return NOTIFY_OK;
>>>
>>>
>>> IMO, irrespective of whether we keep cpusets unaware of all CPU Hotplug or
>>> only unaware of the CPU hotplug in the suspend/resume path, I feel the
>>> scheduler should always know the true state of the system, ie., offline CPUs
>>> must not be part of any sched domain, at any point in time.
> 
> That's really not a problem as long as they're not in the active mask.
> 


Ok..

>>> At the moment, I am exploring several ways to achieve this (I can think of 2
>>> ways at the moment, will see which one is better). But in case this approach
>>> itself seems wrong for any reason, please let me know.
> 
> 
> Have you actually tried the simple patch? 
> 


Yes, I tried it out and it fixed the suspend/resume bug. I was just unsure that it
was the right fix, because the scheduler's sched domains wouldn't be up-to-date,
unintentionally.

> Calling partition_sched_domains() like you do doesn't seem right, it
> completely ignores cpusets, it will make certain cpuset configurations
> mis-behave.
> 


Well, why do we care about cpuset configurations when we are in the midst of
suspend/resume? Anyway all tasks are frozen during suspend, and when we resume,
we restore everything (including cpusets and sched domains) to exactly how it
was before suspend.. So that must be fine right?

(I am referring to the patch posted at https://lkml.org/lkml/2012/2/17/125).

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-20 12:59                               ` Srivatsa S. Bhat
@ 2012-02-23  9:57                                 ` Srivatsa S. Bhat
  2012-02-24 23:24                                   ` Rafael J. Wysocki
                                                     ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Srivatsa S. Bhat @ 2012-02-23  9:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Alan Stern, paulmck, Ingo Molnar, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On 02/20/2012 06:29 PM, Srivatsa S. Bhat wrote:

> Hi Peter,
> 
> On 02/20/2012 06:19 PM, Peter Zijlstra wrote:
> 
>> On Fri, 2012-02-17 at 17:45 +0530, Srivatsa S. Bhat wrote:
>>
>>>> Trivially removing CPU_TASKS_FROZEN as shown below doesn't look right to me:
>>>>
>>>> ---
>>>>
>>>>  kernel/sched/core.c |    4 ++--
>>>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 5255c9d..43a166e 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -6729,7 +6729,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
>>>>  static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>>>>  			     void *hcpu)
>>>>  {
>>>> -	switch (action & ~CPU_TASKS_FROZEN) {
>>>> +	switch (action) {
>>>>  	case CPU_ONLINE:
>>>>  	case CPU_DOWN_FAILED:
>>>>  		cpuset_update_active_cpus();
>>>> @@ -6742,7 +6742,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>>>>  static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
>>>>  			       void *hcpu)
>>>>  {
>>>> -	switch (action & ~CPU_TASKS_FROZEN) {
>>>> +	switch (action) {
>>>>  	case CPU_DOWN_PREPARE:
>>>>  		cpuset_update_active_cpus();
>>>>  		return NOTIFY_OK;
>>>>
>>>>
>>>> IMO, irrespective of whether we keep cpusets unaware of all CPU Hotplug or
>>>> only unaware of the CPU hotplug in the suspend/resume path, I feel the
>>>> scheduler should always know the true state of the system, ie., offline CPUs
>>>> must not be part of any sched domain, at any point in time.
>>
>> That's really not a problem as long as they're not in the active mask.
>>


[...]

So, based on what you said above, I guess we can go with that simple patch.
(See below, for the patch with changelog).

I thought about what Ingo suggested (ie., not touching cpusets during cpu
hotplug, irrespective of whether it is part of suspend or not). And we can
implement that by having a scheme something like:

o Currently if a cpuset's cpus_allowed mask becomes empty due to CPU offline,
  all tasks in that cpuset is moved to a parent cpuset whose cpus_allowed mask
  is non-empty.
  Here, instead of *moving* the tasks to another cpuset, we could just change
  the cpus_allowed mask of each task in that cpuset to reflect the non-empty
  parent cpuset's cpus_allowed mask. IOW, during a CPU offline, we never touch
  a cpuset's cpus_allowed mask, we only modify the cpus_allowed mask of the
  *tasks* in that cpuset. Also, we never move a task from one cpuset to another
  due to CPU offline.

o Since we never modify a cpuset's cpus_allowed mask due to CPU offline, it is
  trivial to get back to original state when that CPU comes back online. Just
  compare the cpuset's cpus_allowed mask with cpu_active_mask and update the
  cpus_allowed masks of all the tasks in that cpuset.

We can definitely do all this, but I am not quite sure if this complexity is
justified (ie., complexity in the sense that the cpus_allowed mask of the tasks
in a cpuset might not always be the same as the cpus_allowed mask of that
cpuset).

However, if somebody feels that the above mentioned approach looks good and
the complexity is justified, please let me know.. But until then, the
following simple fix for the suspend/resume bug should suffice.

----

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume

Currently, during CPU hotplug, the cpuset callbacks modify the cpusets
to reflect the state of the system, and this handling is asymmetric.
That is, upon CPU offline, that CPU is removed from all cpusets. However
when it comes back online, it is put back only to the root cpuset.

This gives rise to a significant problem during suspend/resume. During
suspend, we offline all non-boot cpus and during resume we online them back.
Which means, after a resume, all cpusets (except the root cpuset) will be
restricted to just one single CPU (the boot cpu). But the whole point of
suspend/resume is to restore the system to a state which is as close as
possible to how it was before suspend.

So to fix this, don't touch cpusets during suspend/resume. That is, modify
the cpuset-related CPU hotplug callback to just ignore CPU hotplug when it
is initiated as part of the suspend/resume sequence.

Reported-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
---

 kernel/sched/core.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1169246..49ba9d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6728,7 +6728,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
 static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 			     void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
 	case CPU_ONLINE:
 	case CPU_DOWN_FAILED:
 		cpuset_update_active_cpus();
@@ -6741,7 +6741,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
 			       void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
 	case CPU_DOWN_PREPARE:
 		cpuset_update_active_cpus();
 		return NOTIFY_OK;



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-23  9:57                                 ` Srivatsa S. Bhat
@ 2012-02-24 23:24                                   ` Rafael J. Wysocki
  2012-02-27 10:18                                   ` Peter Zijlstra
  2012-02-27 12:09                                   ` [tip:sched/urgent] CPU hotplug, cpusets, suspend: Don' t touch cpusets during suspend/resume tip-bot for Srivatsa S. Bhat
  2 siblings, 0 replies; 32+ messages in thread
From: Rafael J. Wysocki @ 2012-02-24 23:24 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, Alan Stern, paulmck, Ingo Molnar, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On Thursday, February 23, 2012, Srivatsa S. Bhat wrote:
> On 02/20/2012 06:29 PM, Srivatsa S. Bhat wrote:
> 
> > Hi Peter,
> > 
> > On 02/20/2012 06:19 PM, Peter Zijlstra wrote:
> > 
> >> On Fri, 2012-02-17 at 17:45 +0530, Srivatsa S. Bhat wrote:
> >>
> >>>> Trivially removing CPU_TASKS_FROZEN as shown below doesn't look right to me:
> >>>>
> >>>> ---
> >>>>
> >>>>  kernel/sched/core.c |    4 ++--
> >>>>  1 files changed, 2 insertions(+), 2 deletions(-)
> >>>>
> >>>>
> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>>> index 5255c9d..43a166e 100644
> >>>> --- a/kernel/sched/core.c
> >>>> +++ b/kernel/sched/core.c
> >>>> @@ -6729,7 +6729,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
> >>>>  static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
> >>>>  			     void *hcpu)
> >>>>  {
> >>>> -	switch (action & ~CPU_TASKS_FROZEN) {
> >>>> +	switch (action) {
> >>>>  	case CPU_ONLINE:
> >>>>  	case CPU_DOWN_FAILED:
> >>>>  		cpuset_update_active_cpus();
> >>>> @@ -6742,7 +6742,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
> >>>>  static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
> >>>>  			       void *hcpu)
> >>>>  {
> >>>> -	switch (action & ~CPU_TASKS_FROZEN) {
> >>>> +	switch (action) {
> >>>>  	case CPU_DOWN_PREPARE:
> >>>>  		cpuset_update_active_cpus();
> >>>>  		return NOTIFY_OK;
> >>>>
> >>>>
> >>>> IMO, irrespective of whether we keep cpusets unaware of all CPU Hotplug or
> >>>> only unaware of the CPU hotplug in the suspend/resume path, I feel the
> >>>> scheduler should always know the true state of the system, ie., offline CPUs
> >>>> must not be part of any sched domain, at any point in time.
> >>
> >> That's really not a problem as long as they're not in the active mask.
> >>
> 
> 
> [...]
> 
> So, based on what you said above, I guess we can go with that simple patch.
> (See below, for the patch with changelog).
> 
> I thought about what Ingo suggested (ie., not touching cpusets during cpu
> hotplug, irrespective of whether it is part of suspend or not). And we can
> implement that by having a scheme something like:
> 
> o Currently if a cpuset's cpus_allowed mask becomes empty due to CPU offline,
>   all tasks in that cpuset is moved to a parent cpuset whose cpus_allowed mask
>   is non-empty.
>   Here, instead of *moving* the tasks to another cpuset, we could just change
>   the cpus_allowed mask of each task in that cpuset to reflect the non-empty
>   parent cpuset's cpus_allowed mask. IOW, during a CPU offline, we never touch
>   a cpuset's cpus_allowed mask, we only modify the cpus_allowed mask of the
>   *tasks* in that cpuset. Also, we never move a task from one cpuset to another
>   due to CPU offline.
> 
> o Since we never modify a cpuset's cpus_allowed mask due to CPU offline, it is
>   trivial to get back to original state when that CPU comes back online. Just
>   compare the cpuset's cpus_allowed mask with cpu_active_mask and update the
>   cpus_allowed masks of all the tasks in that cpuset.
> 
> We can definitely do all this, but I am not quite sure if this complexity is
> justified (ie., complexity in the sense that the cpus_allowed mask of the tasks
> in a cpuset might not always be the same as the cpus_allowed mask of that
> cpuset).
> 
> However, if somebody feels that the above mentioned approach looks good and
> the complexity is justified, please let me know.. But until then, the
> following simple fix for the suspend/resume bug should suffice.
> 
> ----
> 
> From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Subject: CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume
> 
> Currently, during CPU hotplug, the cpuset callbacks modify the cpusets
> to reflect the state of the system, and this handling is asymmetric.
> That is, upon CPU offline, that CPU is removed from all cpusets. However
> when it comes back online, it is put back only to the root cpuset.
> 
> This gives rise to a significant problem during suspend/resume. During
> suspend, we offline all non-boot cpus and during resume we online them back.
> Which means, after a resume, all cpusets (except the root cpuset) will be
> restricted to just one single CPU (the boot cpu). But the whole point of
> suspend/resume is to restore the system to a state which is as close as
> possible to how it was before suspend.
> 
> So to fix this, don't touch cpusets during suspend/resume. That is, modify
> the cpuset-related CPU hotplug callback to just ignore CPU hotplug when it
> is initiated as part of the suspend/resume sequence.
> 
> Reported-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Cc: stable@vger.kernel.org

So I wonder what people think about this patch.  Are there any objections?
If not, I'd like to take it for v3.4 until there's a better fix.

Thanks,
Rafael


> ---
> 
>  kernel/sched/core.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1169246..49ba9d4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6728,7 +6728,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
>  static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>  			     void *hcpu)
>  {
> -	switch (action & ~CPU_TASKS_FROZEN) {
> +	switch (action) {
>  	case CPU_ONLINE:
>  	case CPU_DOWN_FAILED:
>  		cpuset_update_active_cpus();
> @@ -6741,7 +6741,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>  static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
>  			       void *hcpu)
>  {
> -	switch (action & ~CPU_TASKS_FROZEN) {
> +	switch (action) {
>  	case CPU_DOWN_PREPARE:
>  		cpuset_update_active_cpus();
>  		return NOTIFY_OK;
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets
  2012-02-23  9:57                                 ` Srivatsa S. Bhat
  2012-02-24 23:24                                   ` Rafael J. Wysocki
@ 2012-02-27 10:18                                   ` Peter Zijlstra
  2012-02-27 12:09                                   ` [tip:sched/urgent] CPU hotplug, cpusets, suspend: Don' t touch cpusets during suspend/resume tip-bot for Srivatsa S. Bhat
  2 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2012-02-27 10:18 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Rafael J. Wysocki, Alan Stern, paulmck, Ingo Molnar, paul, tj,
	frank.rowand, pjt, tglx, lizf, prashanth, vatsa, linux-kernel,
	linux-pm, akpm

On Thu, 2012-02-23 at 15:27 +0530, Srivatsa S. Bhat wrote:
> From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Subject: CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume
> 
> Currently, during CPU hotplug, the cpuset callbacks modify the cpusets
> to reflect the state of the system, and this handling is asymmetric.
> That is, upon CPU offline, that CPU is removed from all cpusets. However
> when it comes back online, it is put back only to the root cpuset.
> 
> This gives rise to a significant problem during suspend/resume. During
> suspend, we offline all non-boot cpus and during resume we online them back.
> Which means, after a resume, all cpusets (except the root cpuset) will be
> restricted to just one single CPU (the boot cpu). But the whole point of
> suspend/resume is to restore the system to a state which is as close as
> possible to how it was before suspend.
> 
> So to fix this, don't touch cpusets during suspend/resume. That is, modify
> the cpuset-related CPU hotplug callback to just ignore CPU hotplug when it
> is initiated as part of the suspend/resume sequence.
> 
> Reported-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Cc: stable@vger.kernel.org

Fair enough, I'll take it and push it through sched/urgent so that it
should still make 3.4.

> ---
> 
>  kernel/sched/core.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1169246..49ba9d4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6728,7 +6728,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
>  static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>                              void *hcpu)
>  {
> -       switch (action & ~CPU_TASKS_FROZEN) {
> +       switch (action) {
>         case CPU_ONLINE:
>         case CPU_DOWN_FAILED:
>                 cpuset_update_active_cpus();
> @@ -6741,7 +6741,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
>  static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
>                                void *hcpu)
>  {
> -       switch (action & ~CPU_TASKS_FROZEN) {
> +       switch (action) {
>         case CPU_DOWN_PREPARE:
>                 cpuset_update_active_cpus();
>                 return NOTIFY_OK;
> 
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [tip:sched/urgent] CPU hotplug, cpusets, suspend: Don' t touch cpusets during suspend/resume
  2012-02-23  9:57                                 ` Srivatsa S. Bhat
  2012-02-24 23:24                                   ` Rafael J. Wysocki
  2012-02-27 10:18                                   ` Peter Zijlstra
@ 2012-02-27 12:09                                   ` tip-bot for Srivatsa S. Bhat
  2 siblings, 0 replies; 32+ messages in thread
From: tip-bot for Srivatsa S. Bhat @ 2012-02-27 12:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, torvalds, srivatsa.bhat,
	prashanth, akpm, tglx, mingo

Commit-ID:  8f2f748b0656257153bcf0941df8d6060acc5ca6
Gitweb:     http://git.kernel.org/tip/8f2f748b0656257153bcf0941df8d6060acc5ca6
Author:     Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
AuthorDate: Thu, 23 Feb 2012 15:27:15 +0530
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 27 Feb 2012 11:38:13 +0100

CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume

Currently, during CPU hotplug, the cpuset callbacks modify the cpusets
to reflect the state of the system, and this handling is asymmetric.
That is, upon CPU offline, that CPU is removed from all cpusets. However
when it comes back online, it is put back only to the root cpuset.

This gives rise to a significant problem during suspend/resume. During
suspend, we offline all non-boot cpus and during resume we online them back.
Which means, after a resume, all cpusets (except the root cpuset) will be
restricted to just one single CPU (the boot cpu). But the whole point of
suspend/resume is to restore the system to a state which is as close as
possible to how it was before suspend.

So to fix this, don't touch cpusets during suspend/resume. That is, modify
the cpuset-related CPU hotplug callback to just ignore CPU hotplug when it
is initiated as part of the suspend/resume sequence.

Reported-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/4F460D7B.1020703@linux.vnet.ibm.com
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/core.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b342f57..33a0676 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6728,7 +6728,7 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
 static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 			     void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
 	case CPU_ONLINE:
 	case CPU_DOWN_FAILED:
 		cpuset_update_active_cpus();
@@ -6741,7 +6741,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
 			       void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
 	case CPU_DOWN_PREPARE:
 		cpuset_update_active_cpus();
 		return NOTIFY_OK;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-02-27 12:10 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-07 18:55 [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Srivatsa S. Bhat
2012-02-07 18:56 ` [PATCH 1/4] CPU hotplug, cpuset: Maintain a copy of the cpus_allowed mask before CPU hotplug Srivatsa S. Bhat
2012-02-07 18:56 ` [PATCH 2/4] cpuset: Split up update_cpumask() so that its functionality can be reused Srivatsa S. Bhat
2012-02-07 18:57 ` [PATCH 3/4] cpuset: Add function to introduce CPUs to cpusets during CPU online Srivatsa S. Bhat
2012-02-07 18:57 ` [PATCH 4/4] CPU hotplug, cpusets: Differentiate the CPU online and CPU offline callbacks Srivatsa S. Bhat
2012-02-08  3:22 ` [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Peter Zijlstra
2012-02-08  6:33   ` Srivatsa S. Bhat
2012-02-09  7:57     ` Ingo Molnar
2012-02-09  8:42       ` Srivatsa S. Bhat
2012-02-09 15:11         ` Ingo Molnar
2012-02-10 15:52           ` Peter Zijlstra
2012-02-10 16:53             ` Paul E. McKenney
2012-02-10 17:34               ` Peter Zijlstra
2012-02-10 21:51                 ` Alan Stern
2012-02-10 22:39                   ` Rafael J. Wysocki
2012-02-11  2:07                     ` Peter Zijlstra
2012-02-11  4:26                       ` Srivatsa S. Bhat
2012-02-13 17:47                         ` Srivatsa S. Bhat
2012-02-17 12:15                           ` Srivatsa S. Bhat
2012-02-20 12:49                             ` Peter Zijlstra
2012-02-20 12:59                               ` Srivatsa S. Bhat
2012-02-23  9:57                                 ` Srivatsa S. Bhat
2012-02-24 23:24                                   ` Rafael J. Wysocki
2012-02-27 10:18                                   ` Peter Zijlstra
2012-02-27 12:09                                   ` [tip:sched/urgent] CPU hotplug, cpusets, suspend: Don' t touch cpusets during suspend/resume tip-bot for Srivatsa S. Bhat
2012-02-11 16:00                 ` [PATCH 0/4] CPU hotplug, cpusets: Fix CPU online handling related to cpusets Paul E. McKenney
2012-02-13 17:47               ` Srivatsa S. Bhat
2012-02-13 20:39                 ` Paul E. McKenney
2012-02-13 20:49                   ` Srivatsa S. Bhat
2012-02-11 13:39             ` Ingo Molnar
2012-02-10 15:53         ` Peter Zijlstra
2012-02-09 16:43     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.