linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations
@ 2012-05-17 16:59 Srivatsa S. Bhat
  2012-05-17 16:59 ` [PATCH v5 1/5] CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume Srivatsa S. Bhat
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-17 16:59 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, pjt, paul, akpm
  Cc: rjw, nacc, rientjes, paulmck, tglx, seto.hidetoshi, tj, mschmidt,
	berrange, nikunj, vatsa, liuj97, linux-kernel, linux-pm,
	srivatsa.bhat

Currently the kernel doesn't handle cpusets properly during suspend/resume.
After a resume, all non-root cpusets end up having only 1 cpu (the boot cpu),
causing massive performance degradation of workloads. One major user of cpusets
is libvirt, which means that after a suspend/hibernation cycle, all VMs
suddenly end up running terribly slow!

Also, the kernel moves the tasks from one cpuset to another during CPU hotplug
in the suspend/resume path, leading to a task-management nightmare after
resume.

Patch 1 fixes this by keeping cpusets unmodified in the suspend/resume path.
But to ensure we don't trip over, it keeps the sched domains updated during
every CPU hotplug in the s/r path.
This is a long standing issue and we need to fix up stable kernels too.

The rest of the patches in the series are mostly cleanups/optimizations
(except patch 4 which is a non-critical bug fix).

--
 Srivatsa S. Bhat (5):
      CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
      cpusets, hotplug: Implement cpuset tree traversal in a helper function
      cpusets, hotplug: Restructure functions that are invoked during hotplug
      cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset
      cpusets: Remove/update outdated comments


  include/linux/cpuset.h |    4 +
 kernel/cpuset.c        |  161 +++++++++++++++++++++++++++++++++++-------------
 kernel/sched/core.c    |   44 +++++++++++--
 3 files changed, 159 insertions(+), 50 deletions(-)



Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 1/5] CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
  2012-05-17 16:59 [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
@ 2012-05-17 16:59 ` Srivatsa S. Bhat
  2012-05-17 17:00 ` [PATCH v5 2/5] cpusets, hotplug: Implement cpuset tree traversal in a helper function Srivatsa S. Bhat
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-17 16:59 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, pjt, paul, akpm
  Cc: rjw, nacc, rientjes, paulmck, tglx, seto.hidetoshi, tj, mschmidt,
	berrange, nikunj, vatsa, liuj97, linux-kernel, linux-pm,
	srivatsa.bhat

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
   In particular, don't move the tasks from one cpuset to another, and
   don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
   during the CPU hotplug operations that are carried out in the
   suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
   altering cpusets alone. So, to keep the sched domains updated, build
   a single sched domain (containing all active cpus) during each of the
   CPU hotplug operations carried out in s/r path, effectively ignoring
   the cpusets' cpus_allowed masks.

   (Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
   domains by looking up the (unaltered) cpusets' cpus_allowed masks.
   That will bring back the system to the same original state as it was in
   before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
---

 kernel/cpuset.c     |    3 +++
 kernel/sched/core.c |   40 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 14f7070..5fc1570 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2065,6 +2065,9 @@ static void scan_for_empty_cpusets(struct cpuset *root)
  * (of no affect) on systems that are actively using CPU hotplug
  * but making no active use of cpusets.
  *
+ * The only exception to this is suspend/resume, where we don't
+ * modify cpusets at all.
+ *
  * This routine ensures that top_cpuset.cpus_allowed tracks
  * cpu_active_mask on each CPU hotplug (cpuhp) event.
  *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0533a68..9c6ce0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6801,34 +6801,66 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
+static int num_cpus_frozen;	/* used to mark begin/end of suspend/resume */
+
 /*
  * Update cpusets according to cpu_active mask.  If cpusets are
  * disabled, cpuset_update_active_cpus() becomes a simple wrapper
  * around partition_sched_domains().
+ *
+ * If we come here as part of a suspend/resume, don't touch cpusets because we
+ * want to restore it back to its original state upon resume anyway.
  */
 static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 			     void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
+	case CPU_ONLINE_FROZEN:
+	case CPU_DOWN_FAILED_FROZEN:
+
+		/*
+		 * num_cpus_frozen tracks how many CPUs are involved in suspend
+		 * resume sequence. As long as this is not the last online
+		 * operation in the resume sequence, just build a single sched
+		 * domain, ignoring cpusets.
+		 */
+		num_cpus_frozen--;
+		if (likely(num_cpus_frozen)) {
+			partition_sched_domains(1, NULL, NULL);
+			break;
+		}
+
+		/*
+		 * This is the last CPU online operation. So fall through and
+		 * restore the original sched domains by considering the
+		 * cpuset configurations.
+		 */
+
 	case CPU_ONLINE:
 	case CPU_DOWN_FAILED:
 		cpuset_update_active_cpus();
-		return NOTIFY_OK;
+		break;
 	default:
 		return NOTIFY_DONE;
 	}
+	return NOTIFY_OK;
 }
 
 static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
 			       void *hcpu)
 {
-	switch (action & ~CPU_TASKS_FROZEN) {
+	switch (action) {
 	case CPU_DOWN_PREPARE:
 		cpuset_update_active_cpus();
-		return NOTIFY_OK;
+		break;
+	case CPU_DOWN_PREPARE_FROZEN:
+		num_cpus_frozen++;
+		partition_sched_domains(1, NULL, NULL);
+		break;
 	default:
 		return NOTIFY_DONE;
 	}
+	return NOTIFY_OK;
 }
 
 void __init sched_init_smp(void)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 2/5] cpusets, hotplug: Implement cpuset tree traversal in a helper function
  2012-05-17 16:59 [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
  2012-05-17 16:59 ` [PATCH v5 1/5] CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume Srivatsa S. Bhat
@ 2012-05-17 17:00 ` Srivatsa S. Bhat
  2012-05-17 17:00 ` [PATCH v5 3/5] cpusets, hotplug: Restructure functions that are invoked during hotplug Srivatsa S. Bhat
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-17 17:00 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, pjt, paul, akpm
  Cc: rjw, nacc, rientjes, paulmck, tglx, seto.hidetoshi, tj, mschmidt,
	berrange, nikunj, vatsa, liuj97, linux-kernel, linux-pm,
	srivatsa.bhat

At present, the functions that deal with cpusets during CPU/Mem hotplug
are quite messy, since a lot of the functionality is mixed up without clear
separation. And this takes a toll on optimization as well. For example,
the function cpuset_update_active_cpus() is called on both CPU offline and CPU
online events; and it invokes scan_for_empty_cpusets(), which makes sense
only for CPU offline events. And hence, the current code ends up unnecessarily
traversing the cpuset tree during CPU online also.

As a first step towards cleaning up those functions, encapsulate the cpuset
tree traversal in a helper function, so as to facilitate upcoming changes.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/cpuset.c |   36 +++++++++++++++++++++++++++---------
 1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5fc1570..15926f8 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2001,6 +2001,32 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
 }
 
 /*
+ * Helper function to traverse cpusets.
+ * It can be used to walk the cpuset tree from top to bottom, completing
+ * one layer before dropping down to the next (thus always processing a
+ * node before any of its children).
+ */
+static struct cpuset *cpuset_next(struct list_head *queue)
+{
+	struct cpuset *cp;
+	struct cpuset *child;	/* scans child cpusets of cp */
+	struct cgroup *cont;
+
+	if (list_empty(queue))
+		return NULL;
+
+	cp = list_first_entry(queue, struct cpuset, stack_list);
+	list_del(queue->next);
+	list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
+		child = cgroup_cs(cont);
+		list_add_tail(&child->stack_list, queue);
+	}
+
+	return cp;
+}
+
+
+/*
  * Walk the specified cpuset subtree and look for empty cpusets.
  * The tasks of such cpuset must be moved to a parent cpuset.
  *
@@ -2019,19 +2045,11 @@ static void scan_for_empty_cpusets(struct cpuset *root)
 {
 	LIST_HEAD(queue);
 	struct cpuset *cp;	/* scans cpusets being updated */
-	struct cpuset *child;	/* scans child cpusets of cp */
-	struct cgroup *cont;
 	static nodemask_t oldmems;	/* protected by cgroup_mutex */
 
 	list_add_tail((struct list_head *)&root->stack_list, &queue);
 
-	while (!list_empty(&queue)) {
-		cp = list_first_entry(&queue, struct cpuset, stack_list);
-		list_del(queue.next);
-		list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
-			child = cgroup_cs(cont);
-			list_add_tail(&child->stack_list, &queue);
-		}
+	while ((cp = cpuset_next(&queue)) != NULL) {
 
 		/* Continue past cpusets with all cpus, mems online */
 		if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 3/5] cpusets, hotplug: Restructure functions that are invoked during hotplug
  2012-05-17 16:59 [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
  2012-05-17 16:59 ` [PATCH v5 1/5] CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume Srivatsa S. Bhat
  2012-05-17 17:00 ` [PATCH v5 2/5] cpusets, hotplug: Implement cpuset tree traversal in a helper function Srivatsa S. Bhat
@ 2012-05-17 17:00 ` Srivatsa S. Bhat
  2012-05-17 17:03 ` [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset Srivatsa S. Bhat
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-17 17:00 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, pjt, paul, akpm
  Cc: rjw, nacc, rientjes, paulmck, tglx, seto.hidetoshi, tj, mschmidt,
	berrange, nikunj, vatsa, liuj97, linux-kernel, linux-pm,
	srivatsa.bhat

Separate out the cpuset related handling for CPU/Memory online/offline.
This also helps us exploit the most obvious and basic level of optimization
that any notification mechanism (CPU/Mem online/offline) has to offer us:
"We *know* why we have been invoked. So stop pretending that we are lost,
and do only the necessary amount of processing!".

And while at it, rename scan_for_empty_cpusets() to
scan_cpusets_upon_hotplug(), which is more appropriate considering how
it is restructured.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpuset.h |    4 +-
 kernel/cpuset.c        |   88 +++++++++++++++++++++++++++++++++---------------
 kernel/sched/core.c    |    4 +-
 3 files changed, 65 insertions(+), 31 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 668f66b..838320f 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -20,7 +20,7 @@ extern int number_of_cpusets;	/* How many cpusets are defined in system? */
 
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
-extern void cpuset_update_active_cpus(void);
+extern void cpuset_update_active_cpus(bool cpu_online);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -124,7 +124,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
-static inline void cpuset_update_active_cpus(void)
+static inline void cpuset_update_active_cpus(bool cpu_online)
 {
 	partition_sched_domains(1, NULL, NULL);
 }
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 15926f8..276a35a 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -147,6 +147,12 @@ typedef enum {
 	CS_SPREAD_SLAB,
 } cpuset_flagbits_t;
 
+/* the type of hotplug event */
+enum hotplug_event {
+	CPUSET_CPU_OFFLINE,
+	CPUSET_MEM_OFFLINE,
+};
+
 /* convenient tests for these bits */
 static inline int is_cpu_exclusive(const struct cpuset *cs)
 {
@@ -2027,8 +2033,10 @@ static struct cpuset *cpuset_next(struct list_head *queue)
 
 
 /*
- * Walk the specified cpuset subtree and look for empty cpusets.
- * The tasks of such cpuset must be moved to a parent cpuset.
+ * Walk the specified cpuset subtree upon a hotplug operation (CPU/Memory
+ * online/offline) and update the cpusets accordingly.
+ * For regular CPU/Mem hotplug, look for empty cpusets; the tasks of such
+ * cpuset must be moved to a parent cpuset.
  *
  * Called with cgroup_mutex held.  We take callback_mutex to modify
  * cpus_allowed and mems_allowed.
@@ -2041,38 +2049,58 @@ static struct cpuset *cpuset_next(struct list_head *queue)
  * that has tasks along with an empty 'mems'.  But if we did see such
  * a cpuset, we'd handle it just like we do if its 'cpus' was empty.
  */
-static void scan_for_empty_cpusets(struct cpuset *root)
+static void
+scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
 {
 	LIST_HEAD(queue);
-	struct cpuset *cp;	/* scans cpusets being updated */
+	struct cpuset *cp;		/* scans cpusets being updated */
 	static nodemask_t oldmems;	/* protected by cgroup_mutex */
 
 	list_add_tail((struct list_head *)&root->stack_list, &queue);
 
-	while ((cp = cpuset_next(&queue)) != NULL) {
+	switch (event) {
+	case CPUSET_CPU_OFFLINE:
+		while ((cp = cpuset_next(&queue)) != NULL) {
 
-		/* Continue past cpusets with all cpus, mems online */
-		if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
-		    nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
-			continue;
+			/* Continue past cpusets with all cpus online */
+			if (cpumask_subset(cp->cpus_allowed, cpu_active_mask))
+				continue;
 
-		oldmems = cp->mems_allowed;
+			/* Remove offline cpus from this cpuset. */
+			mutex_lock(&callback_mutex);
+			cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
+							cpu_active_mask);
+			mutex_unlock(&callback_mutex);
 
-		/* Remove offline cpus and mems from this cpuset. */
-		mutex_lock(&callback_mutex);
-		cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
-			    cpu_active_mask);
-		nodes_and(cp->mems_allowed, cp->mems_allowed,
+			/* Move tasks from the empty cpuset to a parent */
+			if (cpumask_empty(cp->cpus_allowed))
+				remove_tasks_in_empty_cpuset(cp);
+			else
+				update_tasks_cpumask(cp, NULL);
+		}
+		break;
+
+	case CPUSET_MEM_OFFLINE:
+		while ((cp = cpuset_next(&queue)) != NULL) {
+
+			/* Continue past cpusets with all mems online */
+			if (nodes_subset(cp->mems_allowed,
+					node_states[N_HIGH_MEMORY]))
+				continue;
+
+			oldmems = cp->mems_allowed;
+
+			/* Remove offline mems from this cpuset. */
+			mutex_lock(&callback_mutex);
+			nodes_and(cp->mems_allowed, cp->mems_allowed,
 						node_states[N_HIGH_MEMORY]);
-		mutex_unlock(&callback_mutex);
+			mutex_unlock(&callback_mutex);
 
-		/* Move tasks from the empty cpuset to a parent */
-		if (cpumask_empty(cp->cpus_allowed) ||
-		     nodes_empty(cp->mems_allowed))
-			remove_tasks_in_empty_cpuset(cp);
-		else {
-			update_tasks_cpumask(cp, NULL);
-			update_tasks_nodemask(cp, &oldmems, NULL);
+			/* Move tasks from the empty cpuset to a parent */
+			if (nodes_empty(cp->mems_allowed))
+				remove_tasks_in_empty_cpuset(cp);
+			else
+				update_tasks_nodemask(cp, &oldmems, NULL);
 		}
 	}
 }
@@ -2091,8 +2119,11 @@ static void scan_for_empty_cpusets(struct cpuset *root)
  *
  * Called within get_online_cpus().  Needs to call cgroup_lock()
  * before calling generate_sched_domains().
+ *
+ * @cpu_online: Indicates whether this is a CPU online event (true) or
+ * a CPU offline event (false).
  */
-void cpuset_update_active_cpus(void)
+void cpuset_update_active_cpus(bool cpu_online)
 {
 	struct sched_domain_attr *attr;
 	cpumask_var_t *doms;
@@ -2102,7 +2133,10 @@ void cpuset_update_active_cpus(void)
 	mutex_lock(&callback_mutex);
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
 	mutex_unlock(&callback_mutex);
-	scan_for_empty_cpusets(&top_cpuset);
+
+	if (!cpu_online)
+		scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_OFFLINE);
+
 	ndoms = generate_sched_domains(&doms, &attr);
 	cgroup_unlock();
 
@@ -2133,9 +2167,9 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
 	case MEM_OFFLINE:
 		/*
 		 * needn't update top_cpuset.mems_allowed explicitly because
-		 * scan_for_empty_cpusets() will update it.
+		 * scan_cpusets_upon_hotplug() will update it.
 		 */
-		scan_for_empty_cpusets(&top_cpuset);
+		scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_MEM_OFFLINE);
 		break;
 	default:
 		break;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9c6ce0c..b5a0a0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6838,7 +6838,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
 
 	case CPU_ONLINE:
 	case CPU_DOWN_FAILED:
-		cpuset_update_active_cpus();
+		cpuset_update_active_cpus(true);
 		break;
 	default:
 		return NOTIFY_DONE;
@@ -6851,7 +6851,7 @@ static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
 {
 	switch (action) {
 	case CPU_DOWN_PREPARE:
-		cpuset_update_active_cpus();
+		cpuset_update_active_cpus(false);
 		break;
 	case CPU_DOWN_PREPARE_FROZEN:
 		num_cpus_frozen++;


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset
  2012-05-17 16:59 [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
                   ` (2 preceding siblings ...)
  2012-05-17 17:00 ` [PATCH v5 3/5] cpusets, hotplug: Restructure functions that are invoked during hotplug Srivatsa S. Bhat
@ 2012-05-17 17:03 ` Srivatsa S. Bhat
  2012-05-24  9:13   ` Peter Zijlstra
  2012-05-17 17:03 ` [PATCH v5 5/5] cpusets: Remove/update outdated comments Srivatsa S. Bhat
  2012-05-20 13:58 ` [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
  5 siblings, 1 reply; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-17 17:03 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, pjt, paul, akpm
  Cc: rjw, nacc, rientjes, paulmck, tglx, seto.hidetoshi, tj, mschmidt,
	berrange, nikunj, vatsa, liuj97, linux-kernel, linux-pm,
	srivatsa.bhat

The root cpuset's cpus_allowed mask can get updated during CPU hotplug.
However, during those updates, the tasks belonging to the root cpuset
aren't touched at all - their cpus_allowed masks aren't updated to reflect
the change in the configuration of the cpuset they belong to.

Fix this by moving the update of top_cpuset.cpus_allowed to
scan_cpusets_upon_hotplug() and ensure that the call to update_tasks_cpumask()
is not missed out for the root cpuset, including for cpu online events.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
---
David, I updated this code slightly (added the cpumask_equal() check
in CPUSET_CPU_ONLINE case) but retained your ack. Let me know if you have
objections.

 kernel/cpuset.c |   39 +++++++++++++++++++++++++++++++--------
 1 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 276a35a..e44db78 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -150,6 +150,7 @@ typedef enum {
 /* the type of hotplug event */
 enum hotplug_event {
 	CPUSET_CPU_OFFLINE,
+	CPUSET_CPU_ONLINE,
 	CPUSET_MEM_OFFLINE,
 };
 
@@ -2066,10 +2067,14 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
 			if (cpumask_subset(cp->cpus_allowed, cpu_active_mask))
 				continue;
 
-			/* Remove offline cpus from this cpuset. */
 			mutex_lock(&callback_mutex);
-			cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
-							cpu_active_mask);
+			/* top_cpuset.cpus_allowed must track cpu_active_mask */
+			if (cp == &top_cpuset)
+				cpumask_copy(cp->cpus_allowed, cpu_active_mask);
+			else
+				/* Remove offline cpus from this cpuset. */
+				cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
+							      cpu_active_mask);
 			mutex_unlock(&callback_mutex);
 
 			/* Move tasks from the empty cpuset to a parent */
@@ -2080,6 +2085,24 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
 		}
 		break;
 
+	case CPUSET_CPU_ONLINE:
+		/*
+		 * Restore only the top_cpuset because it has to track
+		 * cpu_active_mask always.
+		 * (We don't need to do anything if we come here during resume
+		 * from suspend, since top_cpuset.cpus_allowed will already be
+		 * equal to cpu_active_mask.)
+		 */
+		if (root == &top_cpuset && !cpumask_equal(root->cpus_allowed,
+							  cpu_active_mask)) {
+			mutex_lock(&callback_mutex);
+			cpumask_copy(root->cpus_allowed, cpu_active_mask);
+			mutex_unlock(&callback_mutex);
+			update_tasks_cpumask(root, NULL);
+		}
+		list_del(queue.next);
+		break;
+
 	case CPUSET_MEM_OFFLINE:
 		while ((cp = cpuset_next(&queue)) != NULL) {
 
@@ -2115,7 +2138,8 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
  * modify cpusets at all.
  *
  * This routine ensures that top_cpuset.cpus_allowed tracks
- * cpu_active_mask on each CPU hotplug (cpuhp) event.
+ * cpu_active_mask on each CPU hotplug (cpuhp) event. (This maintenance
+ * is actually implemented in scan_cpusets_upon_hotplug().)
  *
  * Called within get_online_cpus().  Needs to call cgroup_lock()
  * before calling generate_sched_domains().
@@ -2130,11 +2154,10 @@ void cpuset_update_active_cpus(bool cpu_online)
 	int ndoms;
 
 	cgroup_lock();
-	mutex_lock(&callback_mutex);
-	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
-	mutex_unlock(&callback_mutex);
 
-	if (!cpu_online)
+	if (cpu_online)
+		scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_ONLINE);
+	else
 		scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_OFFLINE);
 
 	ndoms = generate_sched_domains(&doms, &attr);


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v5 5/5] cpusets: Remove/update outdated comments
  2012-05-17 16:59 [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
                   ` (3 preceding siblings ...)
  2012-05-17 17:03 ` [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset Srivatsa S. Bhat
@ 2012-05-17 17:03 ` Srivatsa S. Bhat
  2012-05-20 13:58 ` [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
  5 siblings, 0 replies; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-17 17:03 UTC (permalink / raw)
  To: a.p.zijlstra, mingo, pjt, paul, akpm
  Cc: rjw, nacc, rientjes, paulmck, tglx, seto.hidetoshi, tj, mschmidt,
	berrange, nikunj, vatsa, liuj97, linux-kernel, linux-pm,
	srivatsa.bhat

cpuset_track_online_cpus() is no longer present. So remove the
outdated comment and replace it with reference to cpuset_update_active_cpus()
which is its equivalent.

Also, we don't lack memory hot-unplug anymore. And David Rientjes pointed
out how it is dealt with. So update that comment as well.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/cpuset.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index e44db78..89ff649 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2046,9 +2046,8 @@ static struct cpuset *cpuset_next(struct list_head *queue)
  * before dropping down to the next.  It always processes a node before
  * any of its children.
  *
- * For now, since we lack memory hot unplug, we'll never see a cpuset
- * that has tasks along with an empty 'mems'.  But if we did see such
- * a cpuset, we'd handle it just like we do if its 'cpus' was empty.
+ * In the case of memory hot-unplug, it will remove nodes from N_HIGH_MEMORY
+ * if all present pages from a node are offlined.
  */
 static void
 scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
@@ -2171,7 +2170,7 @@ void cpuset_update_active_cpus(bool cpu_online)
 /*
  * Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
  * Call this routine anytime after node_states[N_HIGH_MEMORY] changes.
- * See also the previous routine cpuset_track_online_cpus().
+ * See cpuset_update_active_cpus() for CPU hotplug handling.
  */
 static int cpuset_track_online_nodes(struct notifier_block *self,
 				unsigned long action, void *arg)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations
  2012-05-17 16:59 [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
                   ` (4 preceding siblings ...)
  2012-05-17 17:03 ` [PATCH v5 5/5] cpusets: Remove/update outdated comments Srivatsa S. Bhat
@ 2012-05-20 13:58 ` Srivatsa S. Bhat
  5 siblings, 0 replies; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-20 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, mingo, pjt, paul, akpm, rjw, nacc, rientjes,
	paulmck, tglx, seto.hidetoshi, tj, mschmidt, berrange, nikunj,
	vatsa, liuj97, linux-kernel, linux-pm

Hi Peter,

Did you get a chance to take a look at this patchset?

(I have added a summary of the changes between the versions below.)

Thanks a lot!

On 05/17/2012 10:29 PM, Srivatsa S. Bhat wrote:

> Currently the kernel doesn't handle cpusets properly during suspend/resume.
> After a resume, all non-root cpusets end up having only 1 cpu (the boot cpu),
> causing massive performance degradation of workloads. One major user of cpusets
> is libvirt, which means that after a suspend/hibernation cycle, all VMs
> suddenly end up running terribly slow!
> 
> Also, the kernel moves the tasks from one cpuset to another during CPU hotplug
> in the suspend/resume path, leading to a task-management nightmare after
> resume.
> 
> Patch 1 fixes this by keeping cpusets unmodified in the suspend/resume path.
> But to ensure we don't trip over, it keeps the sched domains updated during
> every CPU hotplug in the s/r path.
> This is a long standing issue and we need to fix up stable kernels too.
> 
> The rest of the patches in the series are mostly cleanups/optimizations
> (except patch 4 which is a non-critical bug fix).
>


Changelog:

v5 : Don't do explicit save/restore of cpusets' cpu masks during s/r, to avoid
     the overhead of a new per-cpuset cpu mask to help with that. Also, ensure
     that the sched domains are updated on each hotplug.
     The s/r fix (patch 1) is now distinct from the cleanups (patches 2-5).

v4 : (Never had this version. I skipped from v3 to v5 accidentally).

v3 : Explicitly save and restore cpusets' cpu masks during s/r. Don't alter
     the semantics of regular cpu hotplug.
     http://thread.gmane.org/gmane.linux.kernel/1296339

v2 : http://thread.gmane.org/gmane.linux.documentation/4805/

v1 : thread.gmane.org/gmane.linux.kernel/1250097/
 

> --
>  Srivatsa S. Bhat (5):
>       CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
>       cpusets, hotplug: Implement cpuset tree traversal in a helper function
>       cpusets, hotplug: Restructure functions that are invoked during hotplug
>       cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset
>       cpusets: Remove/update outdated comments
> 
> 
>   include/linux/cpuset.h |    4 +
>  kernel/cpuset.c        |  161 +++++++++++++++++++++++++++++++++++-------------
>  kernel/sched/core.c    |   44 +++++++++++--
>  3 files changed, 159 insertions(+), 50 deletions(-)


Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset
  2012-05-17 17:03 ` [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset Srivatsa S. Bhat
@ 2012-05-24  9:13   ` Peter Zijlstra
  2012-05-24  9:44     ` Srivatsa S. Bhat
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2012-05-24  9:13 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mingo, pjt, paul, akpm, rjw, nacc, rientjes, paulmck, tglx,
	seto.hidetoshi, tj, mschmidt, berrange, nikunj, vatsa, liuj97,
	linux-kernel, linux-pm

On Thu, 2012-05-17 at 22:33 +0530, Srivatsa S. Bhat wrote:
> +               /*
> +                * Restore only the top_cpuset because it has to track
> +                * cpu_active_mask always.
> +                * (We don't need to do anything if we come here during resume
> +                * from suspend, since top_cpuset.cpus_allowed will already be
> +                * equal to cpu_active_mask.)
> +                */
> +               if (root == &top_cpuset && !cpumask_equal(root->cpus_allowed,
> +                                                         cpu_active_mask)) {
> +                       mutex_lock(&callback_mutex);
> +                       cpumask_copy(root->cpus_allowed, cpu_active_mask);
> +                       mutex_unlock(&callback_mutex);
> +                       update_tasks_cpumask(root, NULL);
> +               } 

This looks absolutely broken.

Suppose I set an explicit cpu affinity mask on my task, which per not
using cpusets is in the root group.

If I then do a hotplug cycle, I'll find my affinity mask is lost.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset
  2012-05-24  9:13   ` Peter Zijlstra
@ 2012-05-24  9:44     ` Srivatsa S. Bhat
  2012-05-24 11:58       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-24  9:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, pjt, paul, akpm, rjw, nacc, rientjes, paulmck, tglx,
	seto.hidetoshi, tj, mschmidt, berrange, nikunj, vatsa, liuj97,
	linux-kernel, linux-pm

On 05/24/2012 02:43 PM, Peter Zijlstra wrote:

> On Thu, 2012-05-17 at 22:33 +0530, Srivatsa S. Bhat wrote:
>> +               /*
>> +                * Restore only the top_cpuset because it has to track
>> +                * cpu_active_mask always.
>> +                * (We don't need to do anything if we come here during resume
>> +                * from suspend, since top_cpuset.cpus_allowed will already be
>> +                * equal to cpu_active_mask.)
>> +                */
>> +               if (root == &top_cpuset && !cpumask_equal(root->cpus_allowed,
>> +                                                         cpu_active_mask)) {
>> +                       mutex_lock(&callback_mutex);
>> +                       cpumask_copy(root->cpus_allowed, cpu_active_mask);
>> +                       mutex_unlock(&callback_mutex);
>> +                       update_tasks_cpumask(root, NULL);
>> +               } 
> 
> This looks absolutely broken.
> 
> Suppose I set an explicit cpu affinity mask on my task, which per not
> using cpusets is in the root group.
> 
> If I then do a hotplug cycle, I'll find my affinity mask is lost.
> 


Sorry, my bad, I hadn't considered that. Thanks for pointing it out!

So, I am wondering how we ought to deal with CPU hotplug for tasks attached
to the root cpuset..

Considering tasks attached to the root cpuset, if a cpu present in a task's
cpus_allowed mask goes offline, it should be removed from that mask right?
And if that cpu comes back online, it should not be put back to the task's
cpus_allowed mask (just like we don't put back cpus in non-root cpusets).

Is the above understanding correct?

In the current kernel, during cpu hotplug, we don't touch cpus_allowed mask
of the tasks attached to the root cpuset at all.. Whereas we update the
cpus_allowed mask of tasks belonging to non-root cpusets, during cpu offline.

So, is this differentiation intended?

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset
  2012-05-24  9:44     ` Srivatsa S. Bhat
@ 2012-05-24 11:58       ` Peter Zijlstra
  2012-05-24 12:24         ` Srivatsa S. Bhat
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2012-05-24 11:58 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mingo, pjt, paul, akpm, rjw, nacc, rientjes, paulmck, tglx,
	seto.hidetoshi, tj, mschmidt, berrange, nikunj, vatsa, liuj97,
	linux-kernel, linux-pm

On Thu, 2012-05-24 at 15:14 +0530, Srivatsa S. Bhat wrote:

> Sorry, my bad, I hadn't considered that. Thanks for pointing it out!
> 
> So, I am wondering how we ought to deal with CPU hotplug for tasks attached
> to the root cpuset..

By not doing anything at all. If a task is in the root set it is
supposed to behave as it cpusets don't exist.

> Considering tasks attached to the root cpuset, if a cpu present in a task's
> cpus_allowed mask goes offline, it should be removed from that mask right?

Nope, that shouldn't happen. We only reset the mask if all cpus in the
affinity mask go away. This is where task affinity and cpusets differ.

> And if that cpu comes back online, it should not be put back to the task's
> cpus_allowed mask (just like we don't put back cpus in non-root cpusets).

Online shouldn't ever change anything.

> Is the above understanding correct?

Nope.

> In the current kernel, during cpu hotplug, we don't touch cpus_allowed mask
> of the tasks attached to the root cpuset at all.. Whereas we update the
> cpus_allowed mask of tasks belonging to non-root cpusets, during cpu offline.
> 
> So, is this differentiation intended?

Yes, although arguably the cpuset case is 'weird' in that it came later
and didn't mirror the cpu affinity semantics.

Tasks aren't attached to the root cpuset, they live there because
there's no other place to be when you don't use cpusets. This very much
means that tasks in the root set should behave as if cpusets didn't
exist.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset
  2012-05-24 11:58       ` Peter Zijlstra
@ 2012-05-24 12:24         ` Srivatsa S. Bhat
  0 siblings, 0 replies; 11+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-24 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, pjt, paul, akpm, rjw, nacc, rientjes, paulmck, tglx,
	seto.hidetoshi, tj, mschmidt, berrange, nikunj, vatsa, liuj97,
	linux-kernel, linux-pm

On 05/24/2012 05:28 PM, Peter Zijlstra wrote:

> On Thu, 2012-05-24 at 15:14 +0530, Srivatsa S. Bhat wrote:
> 
>> Sorry, my bad, I hadn't considered that. Thanks for pointing it out!
>>
>> So, I am wondering how we ought to deal with CPU hotplug for tasks attached
>> to the root cpuset..
> 
> By not doing anything at all. If a task is in the root set it is
> supposed to behave as it cpusets don't exist.
> 
>> Considering tasks attached to the root cpuset, if a cpu present in a task's
>> cpus_allowed mask goes offline, it should be removed from that mask right?
> 
> Nope, that shouldn't happen. We only reset the mask if all cpus in the
> affinity mask go away. This is where task affinity and cpusets differ.
> 
>> And if that cpu comes back online, it should not be put back to the task's
>> cpus_allowed mask (just like we don't put back cpus in non-root cpusets).
> 
> Online shouldn't ever change anything.
> 
>> Is the above understanding correct?
> 
> Nope.
> 
>> In the current kernel, during cpu hotplug, we don't touch cpus_allowed mask
>> of the tasks attached to the root cpuset at all.. Whereas we update the
>> cpus_allowed mask of tasks belonging to non-root cpusets, during cpu offline.
>>
>> So, is this differentiation intended?
> 
> Yes, although arguably the cpuset case is 'weird' in that it came later
> and didn't mirror the cpu affinity semantics.
> 
> Tasks aren't attached to the root cpuset, they live there because
> there's no other place to be when you don't use cpusets. This very much
> means that tasks in the root set should behave as if cpusets didn't
> exist.
> 


Ok, got it.. that logic makes sense. Thanks a lot for the explanation!
I will resubmit this patchset without patch 4/5 then, if you don't have any
other objections.

 

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-05-24 12:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-17 16:59 [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat
2012-05-17 16:59 ` [PATCH v5 1/5] CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume Srivatsa S. Bhat
2012-05-17 17:00 ` [PATCH v5 2/5] cpusets, hotplug: Implement cpuset tree traversal in a helper function Srivatsa S. Bhat
2012-05-17 17:00 ` [PATCH v5 3/5] cpusets, hotplug: Restructure functions that are invoked during hotplug Srivatsa S. Bhat
2012-05-17 17:03 ` [PATCH v5 4/5] cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset Srivatsa S. Bhat
2012-05-24  9:13   ` Peter Zijlstra
2012-05-24  9:44     ` Srivatsa S. Bhat
2012-05-24 11:58       ` Peter Zijlstra
2012-05-24 12:24         ` Srivatsa S. Bhat
2012-05-17 17:03 ` [PATCH v5 5/5] cpusets: Remove/update outdated comments Srivatsa S. Bhat
2012-05-20 13:58 ` [PATCH v5 0/5] CPU hotplug, cpusets, suspend/resume: Fixes, cleanups and optimizations Srivatsa S. Bhat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).