All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
@ 2004-01-31 14:16 Rusty Russell
  2004-02-01  8:03 ` Andrew Morton
  2004-02-02  9:12 ` Ingo Molnar
  0 siblings, 2 replies; 19+ messages in thread
From: Rusty Russell @ 2004-01-31 14:16 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, Nick Piggin, dipankar, vatsa, mingo

The actual CPU patch.  It's big, but almost all under
CONFIG_HOTPLUG_CPU, or macros which have same effect.

Particularly interested in Nick's views here: I tried to catch all the
places where the new scheduler might pull tasks onto an offline cpu.

Name: Hotplug CPU Patch: -mm Version
Author: Almost Everyone on the lhcs List
Status: Experimental
Depends: Hotcpu/use-kthread-simple.patch.gz
Depends: Hotcpu/drain_local_pages.patch.gz
Depends: Hotcpu/kstat-notifier-removal.patch.gz
Depends: Hotcpu/cpucontrol_macros.patch.gz
Depends: Module/module_kthread.patch.gz
Depends: Hotcpu/cpu_prepare.patch.gz
Depends: Hotcpu/zero-initializer-notifier-removal.patch.gz
Depends: Hotcpu/workqueue-cleanup.patch.gz
Depends: Hotcpu/kmod-affinity.patch.gz
Version: -mm

D: This is the arch-indep hotplug cpu code, contributed by Matt
D: Fleming, Zwane Mwaikambo, Dipankar Sarma, Joel Schopp,
D: Vatsa Vaddagiri and me.
D: 
D: Changes are designed to be NOOPs: usually by explicit use of
D: CONFIG_HOTPLUG_CPU
D: 
D: The way CPUs go down is:
D: 1) Userspace writes 0 to sysfs ("online" attrib).
D: 2) The cpucontrol mutex is grabbed.
D: 3) The task moves onto the cpu.
D: 4) cpu_online(cpu) is set to false.
D: 5) Tasks are migrated, cpus_allowed broken if required (not kernel threads)
D: 6) We move back off the cpu.
D: 7) CPU_OFFLINE notifier is called for the cpu (kernel threads clean up).
D: 8) Arch-specific __cpu_die() is called.
D: 9) CPU_DEAD notifier is called (caches drained, etc).
D: 10) hotplug script is run
D: 11) cpucontrol mutex released.
D: 
D: CPUs go up as before, except with CONFIG_HOTPLUG_CPU, they can go
D: up after boot.
D: 
D: Changes are:
D: - drivers/base/cpu.c: Add an "online" attribute.
D: 
D: - drivers/scsi/scsi.c: drain the scsi_done_q into current CPU on CPU_DEAD
D:
D: - fs/buffer.c: release buffer heads from bh_lrus on CPU_DEAD
D:
D: - linux/cpu.h: Add cpu_down().  Add hotcpu_notifier convenience macro.
D:   Add cpu_is_offline() for when you know a cpu was (once) online: this
D:   is a constant if !CONFIG_HOTPLUG_CPU.
D: 
D: - linux/interrupt.h: Add tasklet_kill_immediate for RCU tasklet takedown
D:   on dead CPU.
D: 
D: - linux/mmzip.h: keep pointer to kswapd, in case we need to move it.
D: 
D: - linux/sched.h: Add migrate_all_tasks() for cpu-down.  Add migrate_to_cpu
D:   for kernel threads to escape.  Add wake_idle_cpu() for i386 hotplug code.
D: 
D: - kernel/cpu.c: Take cpucontrol lock around notifiers: notifiers have
D:   no locking themselves.  Implement cpu_down() and code to call
D:   sbin_hotplug.
D: 
D: - kernel/kmod.c: As a bound kernel thread, take care of case where
D:   thread is going down, before keventd() CPU_OFFLINE notifier can
D:   return.
D: 
D: - kernel/kthread.c: Same as kmod.c, take care of race where CPU goes
D:   down and we didn't get migrated.
D: 
D: - kernel/rcupdate.c: Move rcu batch when cpu dies.
D: 
D: - kernel/sched.c: Add wake_idle_cpu() for i386 hotplug.
D:      
D:      Check everywhere to make sure we never move tasks onto an
D:      offline cpu: we'll just be fighting migrate_all_tasks().
D: 
D:      Change sched_migrate_task() to migrate_to_cpu and expose it
D:      for hotplug cpu.
D: 
D:      Take hotplug lock in sys_sched_setaffinity.
D:      
D:      Return cpus_allowed masked by cpu_possible_map, not
D:      cpu_online_map in sys_sched_getaffinity, otherwise tasks can't
D:      know about whether they can run on currently-offline cpus.
D: 
D:      Implement migrate_all_tasks() to push tasks off the dying cpu.
D:      
D:      Add callbacks to stop migration thread.
D: 
D: - kernel/softirq.c:
D:     Handle case where interrupt tries to kick softirqd after it has exited.
D:
D:     Implement tasklet_kill_immediate for RCU.
D:     
D:     Code to shut down ksoftirqd, and take over dead cpu's tasklets.
D: 
D: - kernel/timer.c:
D:     Code to migrate timers off dead CPU.
D: 
D: - kernel/workqueue.c:
D:     Keep all workqueue is list, with their name: when cpus come up
D:     we need to create new threads for each one.
D:
D:     Lock cpu hotplug in flush_workqueue for simplicity.
D: 
D:     create_workqueue_thread doesn't need name arg, now in struct.
D:
D:     Lock cpu hotplug when creating and destroying workqueues.
D: 
D:     cleanup_workqueue_thread needs to block irqs: can now be called
D:     on live workqueues.
D:
D:     Implement callbacks to add and delete threads, and take over work.
D: 
D: - mm/page_alloc.c:
D:    Drain local pages on CPU which is dead.
D:
D: - mm/slab.c:
D:    Move free_block decl, ac_entry and ac_data earlier in file.
D:    
D:    Stop reap timer on as cpu goes offline.
D:
D:    Neaten list_for_each into list_for_each_entry.
D:
D:    Free cache blocks and correct stats one CPU is dead.
D: 
D:    Make reap_timer_fnc terminate if cpu goes offline.
D: 
D: - mm/swap.c:
D:    Fix up committed stats and drain lru cache when cpu dead.
D: 
D: - mm/vmscan.c:
D:    Keep kswapd on cpus within node unless all CPUs go offline, and
D:    restore when they come back.
D:    
D:    Record kswapd tasks for each pgdat upon creation.
D: 
D: - net/core/dev.c:
D:    Drain skb queues to this cpu when another cpu is dead.
D: 
D: - net/core/flow.c:
D:    Call __flow_cache_shrink() to shrink cache to zero when CPU dead.

diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/drivers/base/cpu.c .22517-linux-2.6.2-rc2-mm2.updated/drivers/base/cpu.c
--- .22517-linux-2.6.2-rc2-mm2/drivers/base/cpu.c	2003-10-26 14:52:48.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/drivers/base/cpu.c	2004-02-01 01:08:03.000000000 +1100
@@ -7,6 +7,7 @@
 #include <linux/init.h>
 #include <linux/cpu.h>
 #include <linux/topology.h>
+#include <linux/device.h>
 
 
 struct sysdev_class cpu_sysdev_class = {
@@ -14,6 +15,46 @@ struct sysdev_class cpu_sysdev_class = {
 };
 EXPORT_SYMBOL(cpu_sysdev_class);
 
+#ifdef CONFIG_HOTPLUG_CPU
+static ssize_t show_online(struct sys_device *dev, char *buf)
+{
+	struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+
+	return sprintf(buf, "%u\n", !!cpu_online(cpu->sysdev.id));
+}
+
+static ssize_t store_online(struct sys_device *dev, const char *buf,
+			    size_t count)
+{
+	struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+	ssize_t ret;
+
+	switch (buf[0]) {
+	case '0':
+		ret = cpu_down(cpu->sysdev.id);
+		break;
+	case '1':
+		ret = cpu_up(cpu->sysdev.id);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	if (ret >= 0)
+		ret = count;
+	return ret;
+}
+static SYSDEV_ATTR(online, 0600, show_online, store_online);
+
+static void __init register_cpu_control(struct cpu *cpu)
+{
+	sysdev_create_file(&cpu->sysdev, &attr_online);
+}
+#else /* ... !CONFIG_HOTPLUG_CPU */
+static void __init register_cpu_control(struct cpu *cpu)
+{
+}
+#endif /* CONFIG_HOTPLUG_CPU */
 
 /*
  * register_cpu - Setup a driverfs device for a CPU.
@@ -34,6 +75,8 @@ int __init register_cpu(struct cpu *cpu,
 		error = sysfs_create_link(&root->sysdev.kobj,
 					  &cpu->sysdev.kobj,
 					  kobject_name(&cpu->sysdev.kobj));
+	if (!error)
+		register_cpu_control(cpu);
 	return error;
 }
 
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/drivers/scsi/scsi.c .22517-linux-2.6.2-rc2-mm2.updated/drivers/scsi/scsi.c
--- .22517-linux-2.6.2-rc2-mm2/drivers/scsi/scsi.c	2004-01-31 17:28:10.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/drivers/scsi/scsi.c	2004-02-01 01:08:03.000000000 +1100
@@ -53,6 +53,8 @@
 #include <linux/spinlock.h>
 #include <linux/kmod.h>
 #include <linux/interrupt.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
 
 #include <scsi/scsi_host.h>
 #include "scsi.h"
@@ -1129,6 +1131,38 @@ int scsi_device_cancel(struct scsi_devic
 	return 0;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static int scsi_cpu_notify(struct notifier_block *self, 
+			   unsigned long action, void *hcpu)
+{
+	int cpu = (unsigned long)hcpu;
+
+	switch(action) {
+	case CPU_DEAD:
+		/* Drain scsi_done_q. */
+		local_irq_disable();
+		list_splice_init(&per_cpu(scsi_done_q, cpu),
+				 &__get_cpu_var(scsi_done_q));
+		raise_softirq_irqoff(SCSI_SOFTIRQ);
+		local_irq_enable();
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata scsi_cpu_nb = {
+	.notifier_call	= scsi_cpu_notify,
+};
+
+#define register_scsi_cpu() register_cpu_notifier(&scsi_cpu_nb)
+#define unregister_scsi_cpu() unregister_cpu_notifier(&scsi_cpu_nb)
+#else
+#define register_scsi_cpu()
+#define unregister_scsi_cpu()
+#endif /* CONFIG_HOTPLUG_CPU */
+
 MODULE_DESCRIPTION("SCSI core");
 MODULE_LICENSE("GPL");
 
@@ -1163,6 +1197,7 @@ static int __init init_scsi(void)
 
 	devfs_mk_dir("scsi");
 	open_softirq(SCSI_SOFTIRQ, scsi_softirq, NULL);
+	register_scsi_cpu();
 	printk(KERN_NOTICE "SCSI subsystem initialized\n");
 	return 0;
 
@@ -1190,6 +1225,7 @@ static void __exit exit_scsi(void)
 	devfs_remove("scsi");
 	scsi_exit_procfs();
 	scsi_exit_queue();
+	unregister_scsi_cpu();
 }
 
 subsys_initcall(init_scsi);
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/fs/buffer.c .22517-linux-2.6.2-rc2-mm2.updated/fs/buffer.c
--- .22517-linux-2.6.2-rc2-mm2/fs/buffer.c	2004-01-31 17:28:12.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/fs/buffer.c	2004-02-01 01:08:03.000000000 +1100
@@ -2991,6 +2991,26 @@ init_buffer_head(void *data, kmem_cache_
 	}
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static void buffer_exit_cpu(int cpu)
+{
+	int i;
+	struct bh_lru *b = &per_cpu(bh_lrus, cpu);
+
+	for (i = 0; i < BH_LRU_SIZE; i++) {
+		brelse(b->bhs[i]);
+		b->bhs[i] = NULL;
+	}
+}
+
+static int buffer_cpu_notify(struct notifier_block *self, 
+			      unsigned long action, void *hcpu)
+{
+	if (action == CPU_DEAD)
+		buffer_exit_cpu((unsigned long)hcpu);
+	return NOTIFY_OK;
+}
+#endif /* CONFIG_HOTPLUG_CPU */
 
 void __init buffer_init(void)
 {
@@ -3008,6 +3028,7 @@ void __init buffer_init(void)
 	 */
 	nrpages = (nr_free_buffer_pages() * 10) / 100;
 	max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
+	hotcpu_notifier(buffer_cpu_notify, 0);
 }
 
 EXPORT_SYMBOL(__bforget);
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/include/linux/cpu.h .22517-linux-2.6.2-rc2-mm2.updated/include/linux/cpu.h
--- .22517-linux-2.6.2-rc2-mm2/include/linux/cpu.h	2004-01-31 17:28:19.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/include/linux/cpu.h	2004-02-01 01:08:03.000000000 +1100
@@ -21,6 +21,8 @@
 
 #include <linux/sysdev.h>
 #include <linux/node.h>
+#include <linux/compiler.h>
+#include <linux/cpumask.h>
 #include <asm/semaphore.h>
 
 struct cpu {
@@ -56,9 +58,19 @@ extern struct sysdev_class cpu_sysdev_cl
 extern struct semaphore cpucontrol;
 #define lock_cpu_hotplug()	down(&cpucontrol)
 #define unlock_cpu_hotplug()	up(&cpucontrol)
+int cpu_down(unsigned int cpu);
+#define hotcpu_notifier(fn, pri) {				\
+	static struct notifier_block fn##_nb = { fn, pri };	\
+	register_cpu_notifier(&fn##_nb);			\
+}
+#define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))
 #else
 #define lock_cpu_hotplug()	do { } while (0)
 #define unlock_cpu_hotplug()	do { } while (0)
+#define hotcpu_notifier(fn, pri)
+
+/* CPUs don't go offline once they're online w/o CONFIG_HOTPLUG_CPU */
+#define cpu_is_offline(cpu) 0
 #endif
 
 #endif /* _LINUX_CPU_H_ */
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/include/linux/interrupt.h .22517-linux-2.6.2-rc2-mm2.updated/include/linux/interrupt.h
--- .22517-linux-2.6.2-rc2-mm2/include/linux/interrupt.h	2003-09-29 10:26:04.000000000 +1000
+++ .22517-linux-2.6.2-rc2-mm2.updated/include/linux/interrupt.h	2004-02-01 01:08:03.000000000 +1100
@@ -211,6 +211,7 @@ static inline void tasklet_hi_enable(str
 }
 
 extern void tasklet_kill(struct tasklet_struct *t);
+extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);
 extern void tasklet_init(struct tasklet_struct *t,
 			 void (*func)(unsigned long), unsigned long data);
 
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/include/linux/mmzone.h .22517-linux-2.6.2-rc2-mm2.updated/include/linux/mmzone.h
--- .22517-linux-2.6.2-rc2-mm2/include/linux/mmzone.h	2004-01-31 17:28:19.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/include/linux/mmzone.h	2004-02-01 01:08:03.000000000 +1100
@@ -213,6 +213,7 @@ typedef struct pglist_data {
 	int node_id;
 	struct pglist_data *pgdat_next;
 	wait_queue_head_t       kswapd_wait;
+	struct task_struct *kswapd;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/include/linux/sched.h .22517-linux-2.6.2-rc2-mm2.updated/include/linux/sched.h
--- .22517-linux-2.6.2-rc2-mm2/include/linux/sched.h	2004-01-31 17:28:19.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/include/linux/sched.h	2004-02-01 01:08:03.000000000 +1100
@@ -634,6 +634,10 @@ extern void sched_balance_exec(void);
 #define sched_balance_exec()   {}
 #endif
 
+/* Move tasks off this (offline) CPU onto another. */
+extern void migrate_all_tasks(void);
+/* Try to move me here, if possible. */
+void migrate_to_cpu(int dest_cpu);
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
@@ -1010,6 +1014,8 @@ static inline void set_task_cpu(struct t
 	p->thread_info->cpu = cpu;
 }
 
+/* Wake up a CPU from idle */
+void wake_idle_cpu(unsigned int cpu);
 #else
 
 static inline unsigned int task_cpu(struct task_struct *p)
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/kernel/cpu.c .22517-linux-2.6.2-rc2-mm2.updated/kernel/cpu.c
--- .22517-linux-2.6.2-rc2-mm2/kernel/cpu.c	2004-01-31 17:28:20.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/kernel/cpu.c	2004-02-01 01:08:03.000000000 +1100
@@ -1,14 +1,17 @@
 /* CPU control.
- * (C) 2001 Rusty Russell
+ * (C) 2001, 2002, 2003, 2004 Rusty Russell
+ *
  * This code is licenced under the GPL.
  */
 #include <linux/proc_fs.h>
 #include <linux/smp.h>
 #include <linux/init.h>
-#include <linux/notifier.h>
 #include <linux/sched.h>
 #include <linux/unistd.h>
+#include <linux/kmod.h>		/* for hotplug_path */
 #include <linux/cpu.h>
+#include <linux/module.h>
+#include <linux/notifier.h>
 #include <asm/semaphore.h>
 
 /* This protects CPUs going up and down... */
@@ -19,14 +22,150 @@ static struct notifier_block *cpu_chain;
 /* Need to know about CPUs going up/down? */
 int register_cpu_notifier(struct notifier_block *nb)
 {
-	return notifier_chain_register(&cpu_chain, nb);
+	int ret;
+
+	if ((ret = down_interruptible(&cpucontrol)) != 0)
+		return ret;
+	ret = notifier_chain_register(&cpu_chain, nb);
+	up(&cpucontrol);
+	return ret;
 }
+EXPORT_SYMBOL(register_cpu_notifier);
 
 void unregister_cpu_notifier(struct notifier_block *nb)
 {
-	notifier_chain_unregister(&cpu_chain,nb);
+	down(&cpucontrol);
+	notifier_chain_unregister(&cpu_chain, nb);
+	up(&cpucontrol);
+}
+EXPORT_SYMBOL(unregister_cpu_notifier);
+
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void check_for_tasks(int cpu)
+{
+	struct task_struct *p;
+
+	write_lock_irq(&tasklist_lock);
+	for_each_process(p) {
+		if (task_cpu(p) == cpu)
+			printk(KERN_WARNING "Task %s is on cpu %d, "
+				"not dying\n", p->comm, cpu);
+	}
+	write_unlock_irq(&tasklist_lock);
+}
+
+/* Notify userspace when a cpu event occurs, by running '/sbin/hotplug
+ * cpu' with certain environment variables set.  */
+static int cpu_run_sbin_hotplug(unsigned int cpu, const char *action)
+{
+	char *argv[3], *envp[5], cpu_str[12], action_str[32];
+	int i;
+
+	sprintf(cpu_str, "CPU=%d", cpu);
+	sprintf(action_str, "ACTION=%s", action);
+	/* FIXME: Add DEVPATH. --RR */
+
+	i = 0;
+	argv[i++] = hotplug_path;
+	argv[i++] = "cpu";
+	argv[i] = NULL;
+
+	i = 0;
+	/* minimal command environment */
+	envp [i++] = "HOME=/";
+	envp [i++] = "PATH=/sbin:/bin:/usr/sbin:/usr/bin";
+	envp [i++] = cpu_str;
+	envp [i++] = action_str;
+	envp [i] = NULL;
+
+	return call_usermodehelper(argv[0], argv, envp, 0);
+}
+
+static inline int cpu_down_check(unsigned int cpu, cpumask_t mask)
+{
+	if (!cpu_online(cpu))
+		return -EINVAL;
+
+	cpu_clear(cpu, mask);
+	if (any_online_cpu(mask) == NR_CPUS)
+		return -EBUSY;
+
+	return 0;
 }
 
+static inline int cpu_disable(int cpu)
+{
+	int ret;
+
+	ret = __cpu_disable();
+	if (ret < 0)
+		return ret;
+
+	/* Everyone looking at cpu_online() should be doing so with
+	 * preemption disabled. */
+	synchronize_kernel();
+	BUG_ON(cpu_online(cpu));
+	return 0;
+}
+
+int cpu_down(unsigned int cpu)
+{
+	int err, rc;
+	void *vcpu = (void *)(long)cpu;
+	cpumask_t oldmask, mask;
+
+	if ((err = down_interruptible(&cpucontrol)) != 0) 
+		return err;
+
+	/* There has to be another cpu for this task. */
+	oldmask = current->cpus_allowed;
+	if ((err = cpu_down_check(cpu, oldmask)) != 0)
+		goto out;
+
+	/* Schedule ourselves on the dying CPU. */
+	set_cpus_allowed(current, cpumask_of_cpu(cpu));
+
+	if ((err = cpu_disable(cpu)) != 0)
+		goto out_allowed;
+
+	/* Move other tasks off to other CPUs (simple since they are
+           not running now). */
+	migrate_all_tasks();
+
+	/* Move off dying CPU, which will revert to idle process. */
+	cpus_clear(mask);
+	cpus_complement(mask);
+	cpu_clear(cpu, mask);
+	set_cpus_allowed(current, mask);
+
+	/* Tell kernel threads to go away: they can't fail here. */
+	rc = notifier_call_chain(&cpu_chain, CPU_OFFLINE, vcpu);
+	BUG_ON(rc == NOTIFY_BAD);
+
+	check_for_tasks(cpu);
+
+	/* This actually kills the CPU. */
+	__cpu_die(cpu);
+
+	/* CPU is completely dead: tell everyone.  Too late to complain. */
+	rc = notifier_call_chain(&cpu_chain, CPU_DEAD, vcpu);
+	BUG_ON(rc == NOTIFY_BAD);
+
+	cpu_run_sbin_hotplug(cpu, "offline");
+
+ out_allowed:
+	set_cpus_allowed(current, oldmask);
+ out:
+	up(&cpucontrol);
+	return err;
+}
+#else
+static inline int cpu_run_sbin_hotplug(unsigned int cpu, const char *action)
+{
+	return 0;
+}
+#endif /*CONFIG_HOTPLUG_CPU*/
+
 int __devinit cpu_up(unsigned int cpu)
 {
 	int ret;
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/kernel/kmod.c .22517-linux-2.6.2-rc2-mm2.updated/kernel/kmod.c
--- .22517-linux-2.6.2-rc2-mm2/kernel/kmod.c	2004-01-31 17:28:20.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/kernel/kmod.c	2004-02-01 01:08:03.000000000 +1100
@@ -34,6 +34,7 @@
 #include <linux/security.h>
 #include <linux/mount.h>
 #include <linux/kernel.h>
+#include <linux/cpu.h>
 #include <asm/uaccess.h>
 
 extern int max_threads;
@@ -170,6 +171,12 @@ static int ____call_usermodehelper(void 
 
 	/* We can run anywhere, unlike our parent keventd(). */
 	set_cpus_allowed(current, CPU_MASK_ALL);
+	/* As a kernel thread which was bound to a specific cpu,
+	   migrate_all_tasks wouldn't touch us.  Avoid running child
+	   on dying CPU. */
+	if (cpu_is_offline(smp_processor_id()))
+		migrate_to_cpu(any_online_cpu(CPU_MASK_ALL));
+
 	retval = -EPERM;
 	if (current->fs->root)
 		retval = execve(sub_info->path, sub_info->argv,sub_info->envp);
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/kernel/kthread.c .22517-linux-2.6.2-rc2-mm2.updated/kernel/kthread.c
--- .22517-linux-2.6.2-rc2-mm2/kernel/kthread.c	2004-01-31 17:28:20.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/kernel/kthread.c	2004-02-01 01:08:03.000000000 +1100
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/err.h>
 #include <linux/unistd.h>
+#include <linux/cpu.h>
 #include <asm/semaphore.h>
 
 struct kthread_create_info
@@ -39,7 +40,7 @@ static int kthread(void *_create)
 	threadfn = create->threadfn;
 	data = create->data;
 
-	/* Block and flush all signals (in case we're not from keventd). */
+	/* Block and flush all signals. */
 	sigfillset(&blocked);
 	sigprocmask(SIG_BLOCK, &blocked, NULL);
 	flush_signals(current);
@@ -47,6 +48,12 @@ static int kthread(void *_create)
 	/* By default we can run anywhere, unlike keventd. */
 	set_cpus_allowed(current, CPU_MASK_ALL);
 
+	/* As a kernel thread which was bound to a specific cpu,
+	   migrate_all_tasks wouldn't touch us.  Avoid running on
+	   dying CPU. */
+	if (cpu_is_offline(smp_processor_id()))
+		migrate_to_cpu(any_online_cpu(CPU_MASK_ALL));
+
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_INTERRUPTIBLE);
 	complete(&create->started);
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/kernel/rcupdate.c .22517-linux-2.6.2-rc2-mm2.updated/kernel/rcupdate.c
--- .22517-linux-2.6.2-rc2-mm2/kernel/rcupdate.c	2003-10-09 18:03:02.000000000 +1000
+++ .22517-linux-2.6.2-rc2-mm2.updated/kernel/rcupdate.c	2004-02-01 01:08:03.000000000 +1100
@@ -154,6 +154,60 @@ out_unlock:
 }
 
 
+#ifdef CONFIG_HOTPLUG_CPU
+
+/* warning! helper for rcu_offline_cpu. do not use elsewhere without reviewing
+ * locking requirements, the list it's pulling from has to belong to a cpu
+ * which is dead and hence not processing interrupts.
+ */
+static void rcu_move_batch(struct list_head *list)
+{
+	struct list_head *entry;
+	int cpu = smp_processor_id();
+
+	local_irq_disable();
+	while (!list_empty(list)) {
+		entry = list->next;
+		list_del(entry);
+		list_add_tail(entry, &RCU_nxtlist(cpu));
+	}
+	local_irq_enable();
+}
+
+static void rcu_offline_cpu(int cpu)
+{
+	/* if the cpu going offline owns the grace period
+	 * we can block indefinitely waiting for it, so flush
+	 * it here
+	 */
+	spin_lock_irq(&rcu_ctrlblk.mutex);
+	if (!rcu_ctrlblk.rcu_cpu_mask)
+		goto unlock;
+
+	cpu_clear(cpu, rcu_ctrlblk.rcu_cpu_mask);
+	if (cpus_empty(rcu_ctrlblk.rcu_cpu_mask)) {
+		rcu_ctrlblk.curbatch++;
+		/* We may avoid calling start batch if
+		 * we are starting the batch only
+		 * because of the DEAD CPU (the current
+		 * CPU will start a new batch anyway for 
+		 * the callbacks we will move to current CPU).
+		 * However, we will avoid this optimisation
+		 * for now.
+		 */
+		rcu_start_batch(rcu_ctrlblk.maxbatch);
+	}
+unlock:
+	spin_unlock_irq(&rcu_ctrlblk.mutex);
+
+	rcu_move_batch(&RCU_curlist(cpu));
+	rcu_move_batch(&RCU_nxtlist(cpu));
+
+	tasklet_kill_immediate(&RCU_tasklet(cpu), cpu);
+}
+
+#endif
+
 /*
  * This does the RCU processing work from tasklet context. 
  */
@@ -214,7 +268,11 @@ static int __devinit rcu_cpu_notify(stru
 	case CPU_UP_PREPARE:
 		rcu_online_cpu(cpu);
 		break;
-	/* Space reserved for CPU_OFFLINE :) */
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DEAD:
+		rcu_offline_cpu(cpu);
+		break;
+#endif
 	default:
 		break;
 	}
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/kernel/sched.c .22517-linux-2.6.2-rc2-mm2.updated/kernel/sched.c
--- .22517-linux-2.6.2-rc2-mm2/kernel/sched.c	2004-01-31 17:28:20.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/kernel/sched.c	2004-02-01 01:08:03.000000000 +1100
@@ -508,6 +508,12 @@ static inline void resched_task(task_t *
 #endif
 }
 
+/* Wake up a CPU from idle */
+void wake_idle_cpu(unsigned int cpu)
+{
+	resched_task(cpu_rq(cpu)->idle);
+}
+
 /**
  * task_curr - is this task currently executing on a CPU?
  * @p: the task in question.
@@ -723,7 +729,8 @@ static int try_to_wake_up(task_t * p, un
 		goto out_activate;
 
 	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)
-				|| task_running(rq, p)))
+		     || task_running(rq, p)
+		     || cpu_is_offline(this_cpu)))
 		goto out_activate;
 
 	/* Passive load balancing */
@@ -1112,25 +1119,21 @@ enum idle_type
 };
 
 #ifdef CONFIG_SMP
-#ifdef CONFIG_NUMA
-/*
- * If dest_cpu is allowed for this process, migrate the task to it.
- * This is accomplished by forcing the cpu_allowed mask to only
- * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
- * the cpu_allowed mask is restored.
- */
-static void sched_migrate_task(task_t *p, int dest_cpu)
+
+#if defined(CONFIG_NUMA) || defined(CONFIG_HOTPLUG_CPU)
+/* If dest_cpu is allowed for this process, migrate us to it. */
+void migrate_to_cpu(int dest_cpu)
 {
 	runqueue_t *rq;
 	migration_req_t req;
 	unsigned long flags;
 
-	rq = task_rq_lock(p, &flags);
-	if (!cpu_isset(dest_cpu, p->cpus_allowed))
+	rq = task_rq_lock(current, &flags);
+	if (!cpu_isset(dest_cpu, current->cpus_allowed))
 		goto out;
 
 	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (migrate_task(current, dest_cpu, &req)) {
 		/* Need to wait for migration thread. */
 		task_rq_unlock(rq, &flags);
 		wake_up_process(rq->migration_thread);
@@ -1140,7 +1143,9 @@ static void sched_migrate_task(task_t *p
 out:
 	task_rq_unlock(rq, &flags);
 }
+#endif /* CONFIG_NUMA || CONFIG_HOTPLUG_CPU */
 
+#ifdef CONFIG_NUMA
 /*
  * Find the least loaded CPU.  Slightly favor the current CPU by
  * setting its runqueue length as the minimum to start.
@@ -1185,7 +1190,7 @@ void sched_balance_exec(void)
 	if (domain->flags & SD_FLAG_EXEC) {
 		new_cpu = sched_best_cpu(current, domain);
 		if (new_cpu != this_cpu)
-			sched_migrate_task(current, new_cpu);
+			migrate_task_to_cpu(current, new_cpu);
 	}
 }
 #endif /* CONFIG_NUMA */
@@ -1597,6 +1602,10 @@ static inline void idle_balance(int this
 {
 	struct sched_domain *domain = this_sched_domain();
 
+	/* CPU going down is a special case: we don't pull more tasks here */
+	if (cpu_is_offline(this_cpu))
+		return;
+
 	do {
 		if (unlikely(!domain->groups))
 			/* hasn't been setup yet */
@@ -1701,6 +1710,10 @@ static void rebalance_tick(int this_cpu,
 	unsigned long j = jiffies + CPU_OFFSET(this_cpu);
 	struct sched_domain *domain = this_sched_domain();
 
+	/* CPU going down is a special case: we don't pull more tasks here */
+	if (cpu_is_offline(this_cpu))
+		return;
+
 	/* Run through all this CPU's domains */
 	do {
 		unsigned long interval;
@@ -2623,11 +2636,13 @@ asmlinkage long sys_sched_setaffinity(pi
 	if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask)))
 		return -EFAULT;
 
+	lock_cpu_hotplug();
 	read_lock(&tasklist_lock);
 
 	p = find_process_by_pid(pid);
 	if (!p) {
 		read_unlock(&tasklist_lock);
+		unlock_cpu_hotplug();
 		return -ESRCH;
 	}
 
@@ -2648,6 +2663,7 @@ asmlinkage long sys_sched_setaffinity(pi
 
 out_unlock:
 	put_task_struct(p);
+	unlock_cpu_hotplug();
 	return retval;
 }
 
@@ -2677,7 +2693,7 @@ asmlinkage long sys_sched_getaffinity(pi
 		goto out_unlock;
 
 	retval = 0;
-	cpus_and(mask, p->cpus_allowed, cpu_online_map);
+	cpus_and(mask, p->cpus_allowed, cpu_possible_map);
 
 out_unlock:
 	read_unlock(&tasklist_lock);
@@ -3058,6 +3074,9 @@ static void __migrate_task(struct task_s
 	/* Affinity changed (again). */
 	if (!cpu_isset(dest_cpu, p->cpus_allowed))
 		goto out;
+	/* CPU went down. */
+	if (cpu_is_offline(dest_cpu))
+		goto out;
 
 	set_task_cpu(p, dest_cpu);
 	if (p->array) {
@@ -3123,6 +3142,72 @@ static int migration_thread(void * data)
 	return 0;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+/* migrate_all_tasks - function to migrate all the tasks from the
+ * current cpu caller must have already scheduled this to the target
+ * cpu via set_cpus_allowed  */
+void migrate_all_tasks(void)
+{
+	struct task_struct *tsk, *t;
+	int dest_cpu, src_cpu;
+	unsigned int node;
+
+	/* We're nailed to this CPU. */
+	src_cpu = smp_processor_id();
+
+	/* lock out everyone else intentionally */
+	write_lock_irq(&tasklist_lock);
+
+	/* watch out for per node tasks, let's stay on this node */
+	node = cpu_to_node(src_cpu);
+
+	do_each_thread(t, tsk) {
+		cpumask_t mask;
+		if (tsk == current)
+			continue;
+
+		if (task_cpu(tsk) != src_cpu)
+			continue;
+
+		/* Figure out where this task should go (attempting to
+		 * keep it on-node), and check if it can be migrated
+		 * as-is.  NOTE that kernel threads bound to more than
+		 * one online cpu will be migrated. */
+		mask = node_to_cpumask(node);
+		cpus_and(mask, mask, tsk->cpus_allowed);
+		dest_cpu = any_online_cpu(mask);
+		if (dest_cpu == NR_CPUS)
+			dest_cpu = any_online_cpu(tsk->cpus_allowed);
+		if (dest_cpu == NR_CPUS) {
+			/* Kernel threads which are bound to specific
+			 * processors need to look after themselves
+			 * with their own callbacks.  Exiting tasks
+			 * can have mm NULL too. */
+			if (tsk->mm == NULL && !(tsk->flags & PF_EXITING))
+				continue;
+
+			cpus_clear(tsk->cpus_allowed);
+			cpus_complement(tsk->cpus_allowed);
+			dest_cpu = any_online_cpu(tsk->cpus_allowed);
+
+			/* Don't tell them about moving exiting tasks,
+			   since they never leave kernel. */
+			if (!(tsk->flags & PF_EXITING)
+			    && printk_ratelimit())
+				printk(KERN_INFO "process %d (%s) no "
+				       "longer affine to cpu%d\n",
+				       tsk->pid, tsk->comm, src_cpu);
+		}
+
+		get_task_struct(tsk);
+		__migrate_task(tsk, dest_cpu);
+		put_task_struct(tsk);
+	} while_each_thread(t, tsk);
+
+	write_unlock_irq(&tasklist_lock);
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
 /*
  * migration_call - callback that gets triggered when a CPU is added.
  * Here we can start up the necessary migration thread for the new CPU.
@@ -3145,14 +3230,26 @@ static int migration_call(struct notifie
 		/* Strictly unneccessary, as first user will wake it. */
 		wake_up_process(cpu_rq(cpu)->migration_thread);
 		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_UP_CANCELED:
+		/* Unbind it from offline cpu so it can run.  Fall thru. */
+		kthread_bind(cpu_rq(cpu)->migration_thread,smp_processor_id());
+	case CPU_OFFLINE:
+		kthread_stop(cpu_rq(cpu)->migration_thread);
+		cpu_rq(cpu)->migration_thread = NULL;
+		break;
+
+	case CPU_DEAD:
+ 		BUG_ON(cpu_rq(cpu)->nr_running != 0);
+ 		break;
+#endif
 	}
 	return NOTIFY_OK;
 }
 
-/*
- * We want this after the other threads, so they can use set_cpus_allowed
- * from their CPU_OFFLINE callback
- */
+/* Want this after the other threads, so they can use set_cpus_allowed
+ * from their CPU_OFFLINE callback, and also so everyone is off
+ * runqueue so we know it's empty. */
 static struct notifier_block __devinitdata migration_notifier = {
 	.notifier_call = migration_call,
 	.priority = -10,
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/kernel/softirq.c .22517-linux-2.6.2-rc2-mm2.updated/kernel/softirq.c
--- .22517-linux-2.6.2-rc2-mm2/kernel/softirq.c	2004-01-31 17:28:20.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/kernel/softirq.c	2004-02-01 01:08:03.000000000 +1100
@@ -104,7 +104,11 @@ restart:
 		local_irq_disable();
 
 		pending = local_softirq_pending();
-		if (pending && --max_restart)
+		/*
+		 * If ksoftirqd is dead transiently during cpu offlining
+		 * continue to process from interrupt context.
+		 */
+		if (pending && (--max_restart || !__get_cpu_var(ksoftirqd)))
 			goto restart;
 		if (pending)
 			wakeup_softirqd();
@@ -344,6 +348,58 @@ static int ksoftirqd(void * __bind_cpu)
 	return 0;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+/* 
+ * tasklet_kill_immediate is called to remove a tasklet which can already be
+ * scheduled for execution on @cpu.
+ *
+ * Unlike tasklet_kill, this function removes the tasklet
+ * _immediately_, even if the tasklet is in TASKLET_STATE_SCHED state.
+ *
+ * When this function is called, @cpu must be in the CPU_DEAD state.
+ */
+void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu)
+{
+	struct tasklet_struct **i;
+
+	BUG_ON(cpu_online(cpu));
+	BUG_ON(test_bit(TASKLET_STATE_RUN, &t->state));
+
+	if (!test_bit(TASKLET_STATE_SCHED, &t->state))
+		return;
+
+	/* CPU is dead, so no lock needed. */
+	for (i = &per_cpu(tasklet_vec, cpu).list; *i; i = &(*i)->next) {
+		if (*i == t) {
+			*i = t->next;
+			return;
+		}
+	}
+	BUG();
+}
+
+static void takeover_tasklets(unsigned int cpu)
+{
+	struct tasklet_struct **i;
+
+	/* CPU is dead, so no lock needed. */
+	local_irq_disable();
+
+	/* Find end, append list for that CPU. */
+	for (i = &__get_cpu_var(tasklet_vec).list; *i; i = &(*i)->next);
+	*i = per_cpu(tasklet_vec, cpu).list;
+	per_cpu(tasklet_vec, cpu).list = NULL;
+	raise_softirq_irqoff(TASKLET_SOFTIRQ);
+
+	for (i = &__get_cpu_var(tasklet_hi_vec).list; *i; i = &(*i)->next);
+	*i = per_cpu(tasklet_hi_vec, cpu).list;
+	per_cpu(tasklet_hi_vec, cpu).list = NULL;
+	raise_softirq_irqoff(HI_SOFTIRQ);
+
+	local_irq_enable();
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
 static int __devinit cpu_callback(struct notifier_block *nfb,
 				  unsigned long action,
 				  void *hcpu)
@@ -367,6 +423,21 @@ static int __devinit cpu_callback(struct
 	case CPU_ONLINE:
 		wake_up_process(per_cpu(ksoftirqd, hotcpu));
 		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_UP_CANCELED:
+		/* Unbind so it can run.  Fall thru. */
+		kthread_bind(per_cpu(ksoftirqd, hotcpu), smp_processor_id());
+	case CPU_OFFLINE:
+		p = per_cpu(ksoftirqd, hotcpu);
+		per_cpu(ksoftirqd, hotcpu) = NULL;
+		/* Must set to NULL before interrupt tries to wake it */
+		barrier();
+		kthread_stop(p);
+		break;
+	case CPU_DEAD:
+		takeover_tasklets(hotcpu);
+		break;
+#endif /* CONFIG_HOTPLUG_CPU */
  	}
 	return NOTIFY_OK;
 }
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/kernel/timer.c .22517-linux-2.6.2-rc2-mm2.updated/kernel/timer.c
--- .22517-linux-2.6.2-rc2-mm2/kernel/timer.c	2004-01-26 16:15:09.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/kernel/timer.c	2004-02-01 01:08:03.000000000 +1100
@@ -1223,7 +1223,74 @@ static void __devinit init_timers_cpu(in
 
 	base->timer_jiffies = jiffies;
 }
-	
+
+#ifdef CONFIG_HOTPLUG_CPU
+static int migrate_timer_list(tvec_base_t *new_base, struct list_head *head)
+{
+	struct timer_list *timer;
+
+	while (!list_empty(head)) {
+		timer = list_entry(head->next, struct timer_list, entry);
+		/* We're locking backwards from __mod_timer order here,
+		   beware deadlock. */
+		if (!spin_trylock(&timer->lock))
+			return 0;
+		list_del(&timer->entry);
+		internal_add_timer(new_base, timer);
+		timer->base = new_base;
+		spin_unlock(&timer->lock);
+	}
+	return 1;
+}
+
+static void __devinit migrate_timers(int cpu)
+{
+	tvec_base_t *old_base;
+	tvec_base_t *new_base;
+	int i;
+
+	BUG_ON(cpu_online(cpu));
+	old_base = &per_cpu(tvec_bases, cpu);
+	new_base = &get_cpu_var(tvec_bases);
+	printk("migrate_timers: offlined base %p\n", old_base);
+
+	local_irq_disable();
+again:
+	/* Prevent deadlocks via ordering by old_base < new_base. */
+	if (old_base < new_base) {
+		spin_lock(&new_base->lock);
+		spin_lock(&old_base->lock);
+	} else {
+		spin_lock(&old_base->lock);
+		spin_lock(&new_base->lock);
+	}
+
+	if (old_base->running_timer)
+		BUG();
+	for (i = 0; i < TVR_SIZE; i++)
+		if (!migrate_timer_list(new_base, old_base->tv1.vec + i))
+			goto unlock_again;
+	for (i = 0; i < TVN_SIZE; i++)
+		if (!migrate_timer_list(new_base, old_base->tv2.vec + i)
+		    || !migrate_timer_list(new_base, old_base->tv3.vec + i)
+		    || !migrate_timer_list(new_base, old_base->tv4.vec + i)
+		    || !migrate_timer_list(new_base, old_base->tv5.vec + i))
+			goto unlock_again;
+	spin_unlock(&old_base->lock);
+	spin_unlock(&new_base->lock);
+	local_irq_enable();
+	put_cpu_var(tvec_bases);
+	return;
+
+unlock_again:
+	/* Avoid deadlock with __mod_timer, by backing off. */
+	spin_unlock(&old_base->lock);
+	spin_unlock(&new_base->lock);
+	cpu_relax();
+	goto again;
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
 static int __devinit timer_cpu_notify(struct notifier_block *self, 
 				unsigned long action, void *hcpu)
 {
@@ -1232,6 +1299,11 @@ static int __devinit timer_cpu_notify(st
 	case CPU_UP_PREPARE:
 		init_timers_cpu(cpu);
 		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DEAD:
+		migrate_timers(cpu);
+		break;
+#endif
 	default:
 		break;
 	}
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/kernel/workqueue.c .22517-linux-2.6.2-rc2-mm2.updated/kernel/workqueue.c
--- .22517-linux-2.6.2-rc2-mm2/kernel/workqueue.c	2004-01-31 17:28:20.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/kernel/workqueue.c	2004-02-01 01:08:03.000000000 +1100
@@ -22,6 +22,8 @@
 #include <linux/completion.h>
 #include <linux/workqueue.h>
 #include <linux/slab.h>
+#include <linux/cpu.h>
+#include <linux/notifier.h>
 #include <linux/kthread.h>
 
 /*
@@ -55,8 +57,22 @@ struct cpu_workqueue_struct {
  */
 struct workqueue_struct {
 	struct cpu_workqueue_struct cpu_wq[NR_CPUS];
+	const char *name;
+	struct list_head list;
 };
 
+#ifdef CONFIG_HOTPLUG_CPU
+/* All the workqueues on the system, for hotplug cpu to add/remove
+   threads to each one as cpus come/go.  Protected by cpucontrol
+   sem. */
+static LIST_HEAD(workqueues);
+#define add_workqueue(wq) list_add(&(wq)->list, &workqueues)
+#define del_workqueue(wq) list_del(&(wq)->list)
+#else
+#define add_workqueue(wq)
+#define del_workqueue(wq)
+#endif /* CONFIG_HOTPLUG_CPU */
+
 /* Preempt must be disabled. */
 static void __queue_work(struct cpu_workqueue_struct *cwq,
 			 struct work_struct *work)
@@ -210,6 +226,7 @@ void flush_workqueue(struct workqueue_st
 
 	might_sleep();
 
+	lock_cpu_hotplug();
 	for (cpu = 0; cpu < NR_CPUS; cpu++) {
 		DEFINE_WAIT(wait);
 		long sequence_needed;
@@ -231,11 +248,10 @@ void flush_workqueue(struct workqueue_st
 		finish_wait(&cwq->work_done, &wait);
 		spin_unlock_irq(&cwq->lock);
 	}
+	unlock_cpu_hotplug();
 }
 
-static int create_workqueue_thread(struct workqueue_struct *wq,
-				   const char *name,
-				   int cpu)
+static int create_workqueue_thread(struct workqueue_struct *wq, int cpu)
 {
 	struct cpu_workqueue_struct *cwq = wq->cpu_wq + cpu;
 	struct task_struct *p;
@@ -249,7 +265,7 @@ static int create_workqueue_thread(struc
 	init_waitqueue_head(&cwq->more_work);
 	init_waitqueue_head(&cwq->work_done);
 
-	p = kthread_create(worker_thread, cwq, "%s/%d", name, cpu);
+	p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
 	if (IS_ERR(p))
 		return PTR_ERR(p);
 	cwq->thread = p;
@@ -268,14 +284,19 @@ struct workqueue_struct *create_workqueu
 	if (!wq)
 		return NULL;
 
+	wq->name = name;
+	/* We don't need the distraction of CPUs appearing and vanishing. */
+	lock_cpu_hotplug();
 	for (cpu = 0; cpu < NR_CPUS; cpu++) {
 		if (!cpu_online(cpu))
 			continue;
-		if (create_workqueue_thread(wq, name, cpu) < 0)
+		if (create_workqueue_thread(wq, cpu) < 0)
 			destroy = 1;
 		else
 			wake_up_process(wq->cpu_wq[cpu].thread);
 	}
+	add_workqueue(wq);
+
 	/*
 	 * Was there any error during startup? If yes then clean up:
 	 */
@@ -283,16 +304,23 @@ struct workqueue_struct *create_workqueu
 		destroy_workqueue(wq);
 		wq = NULL;
 	}
+	unlock_cpu_hotplug();
 	return wq;
 }
 
 static void cleanup_workqueue_thread(struct workqueue_struct *wq, int cpu)
 {
 	struct cpu_workqueue_struct *cwq;
+	unsigned long flags;
+	struct task_struct *p;
 
 	cwq = wq->cpu_wq + cpu;
-	if (cwq->thread)
-		kthread_stop(cwq->thread);
+	spin_lock_irqsave(&cwq->lock, flags);
+	p = cwq->thread;
+	cwq->thread = NULL;
+	spin_unlock_irqrestore(&cwq->lock, flags);
+	if (p)
+		kthread_stop(p);
 }
 
 void destroy_workqueue(struct workqueue_struct *wq)
@@ -301,10 +329,14 @@ void destroy_workqueue(struct workqueue_
 
 	flush_workqueue(wq);
 
+	/* We don't need the distraction of CPUs appearing and vanishing. */
+	lock_cpu_hotplug();
 	for (cpu = 0; cpu < NR_CPUS; cpu++) {
 		if (cpu_online(cpu))
 			cleanup_workqueue_thread(wq, cpu);
 	}
+	list_del(&wq->list);
+	unlock_cpu_hotplug();
 	kfree(wq);
 }
 
@@ -345,8 +377,78 @@ int current_is_keventd(void)
 	return 0;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+/* Take the work from this (downed) CPU. */
+static void take_over_work(struct workqueue_struct *wq, unsigned int cpu)
+{
+	struct cpu_workqueue_struct *cwq = wq->cpu_wq + cpu;
+	LIST_HEAD(list);
+	struct work_struct *work;
+
+	spin_lock_irq(&cwq->lock);
+	printk("Workqueue %s: %i\n", wq->name, list_empty(&cwq->worklist));
+	list_splice_init(&cwq->worklist, &list);
+
+	while (!list_empty(&list)) {
+		printk("Taking work for %s\n", wq->name);
+		work = list_entry(list.next,struct work_struct,entry);
+		list_del(&work->entry);
+		__queue_work(wq->cpu_wq + smp_processor_id(), work);
+	}
+	spin_unlock_irq(&cwq->lock);
+}
+
+/* We're holding the cpucontrol mutex here */
+static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
+				  unsigned long action,
+				  void *hcpu)
+{
+	unsigned int hotcpu = (unsigned long)hcpu;
+	struct workqueue_struct *wq;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		/* Create a new workqueue thread for it. */
+		list_for_each_entry(wq, &workqueues, list) {
+			if (create_workqueue_thread(wq, hotcpu) < 0) {
+				printk("workqueue for %i failed\n", hotcpu);
+				return NOTIFY_BAD;
+			}
+		}
+		break;
+
+	case CPU_ONLINE:
+		/* Kick off worker threads. */
+		list_for_each_entry(wq, &workqueues, list)
+			wake_up_process(wq->cpu_wq[hotcpu].thread);
+		break;
+
+	case CPU_UP_CANCELED:
+		list_for_each_entry(wq, &workqueues, list) {
+			/* Unbind so it can run. */
+			kthread_bind(wq->cpu_wq[hotcpu].thread,
+				     smp_processor_id());
+			cleanup_workqueue_thread(wq, hotcpu);
+		}
+		break;
+
+	case CPU_OFFLINE:
+		list_for_each_entry(wq, &workqueues, list)
+			take_over_work(wq, hotcpu);
+		list_for_each_entry(wq, &workqueues, list)
+			cleanup_workqueue_thread(wq, hotcpu);
+		break;
+	case CPU_DEAD:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+#endif
+
 void init_workqueues(void)
 {
+	hotcpu_notifier(workqueue_cpu_callback, 0);
 	keventd_wq = create_workqueue("events");
 	BUG_ON(!keventd_wq);
 }
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/mm/page_alloc.c .22517-linux-2.6.2-rc2-mm2.updated/mm/page_alloc.c
--- .22517-linux-2.6.2-rc2-mm2/mm/page_alloc.c	2004-02-01 01:08:00.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/mm/page_alloc.c	2004-02-01 01:08:03.000000000 +1100
@@ -1538,9 +1538,27 @@ struct seq_operations vmstat_op = {
 
 #endif /* CONFIG_PROC_FS */
 
+#ifdef CONFIG_HOTPLUG_CPU
+static int page_alloc_cpu_notify(struct notifier_block *self, 
+				 unsigned long action, void *hcpu)
+{
+	int cpu = (unsigned long)hcpu;
+	long *count;
+
+	if (action == CPU_DEAD) {
+		/* Drain local pagecache count. */
+		count = &per_cpu(nr_pagecache_local, cpu);
+		atomic_add(*count, &nr_pagecache);
+		*count = 0;
+		drain_local_pages(cpu);
+	}
+	return NOTIFY_OK;
+}
+#endif /* CONFIG_HOTPLUG_CPU */
 
 void __init page_alloc_init(void)
 {
+	hotcpu_notifier(page_alloc_cpu_notify, 0);
 }
 
 /*
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/mm/slab.c .22517-linux-2.6.2-rc2-mm2.updated/mm/slab.c
--- .22517-linux-2.6.2-rc2-mm2/mm/slab.c	2004-01-31 17:28:20.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/mm/slab.c	2004-02-01 01:08:03.000000000 +1100
@@ -521,9 +521,19 @@ enum {
 static DEFINE_PER_CPU(struct timer_list, reap_timers);
 
 static void reap_timer_fnc(unsigned long data);
-
+static void free_block (kmem_cache_t* cachep, void** objpp, int len);
 static void enable_cpucache (kmem_cache_t *cachep);
 
+static inline void ** ac_entry(struct array_cache *ac)
+{
+	return (void**)(ac+1);
+}
+
+static inline struct array_cache *ac_data(kmem_cache_t *cachep)
+{
+	return cachep->array[smp_processor_id()];
+}
+
 /* Cal the num objs, wastage, and bytes left over for a given slab size. */
 static void cache_estimate (unsigned long gfporder, size_t size,
 		 int flags, size_t *left_over, unsigned int *num)
@@ -578,27 +588,33 @@ static void start_cpu_timer(int cpu)
 	}
 }
 
-/*
- * Note: if someone calls kmem_cache_alloc() on the new
- * cpu before the cpuup callback had a chance to allocate
- * the head arrays, it will oops.
- * Is CPU_ONLINE early enough?
- */
+#ifdef CONFIG_HOTPLUG_CPU
+static void stop_cpu_timer(int cpu)
+{
+	struct timer_list *rt = &per_cpu(reap_timers, cpu);
+
+	if (rt->function) {
+		del_timer_sync(rt);
+		WARN_ON(timer_pending(rt));
+		rt->function = NULL;
+	}	
+}
+#endif
+
 static int __devinit cpuup_callback(struct notifier_block *nfb,
 				  unsigned long action,
 				  void *hcpu)
 {
 	long cpu = (long)hcpu;
-	struct list_head *p;
+	kmem_cache_t* cachep;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 		down(&cache_chain_sem);
-		list_for_each(p, &cache_chain) {
+		list_for_each_entry(cachep, &cache_chain, next) {
 			int memsize;
 			struct array_cache *nc;
 
-			kmem_cache_t* cachep = list_entry(p, kmem_cache_t, next);
 			memsize = sizeof(void*)*cachep->limit+sizeof(struct array_cache);
 			nc = kmalloc(memsize, GFP_KERNEL);
 			if (!nc)
@@ -618,22 +634,32 @@ static int __devinit cpuup_callback(stru
 		up(&cache_chain_sem);
 		break;
 	case CPU_ONLINE:
-		if (g_cpucache_up == FULL)
-			start_cpu_timer(cpu);
+		start_cpu_timer(cpu);
 		break;
+
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_OFFLINE:
+		stop_cpu_timer(cpu);
+		break;
+
 	case CPU_UP_CANCELED:
+	case CPU_DEAD:
 		down(&cache_chain_sem);
-
-		list_for_each(p, &cache_chain) {
+		list_for_each_entry(cachep, &cache_chain, next) {
 			struct array_cache *nc;
-			kmem_cache_t* cachep = list_entry(p, kmem_cache_t, next);
 
+			spin_lock_irq(&cachep->spinlock);
+			/* cpu is dead; no one can alloc from it. */
 			nc = cachep->array[cpu];
 			cachep->array[cpu] = NULL;
+			cachep->free_limit -= cachep->batchcount;
+			free_block(cachep, ac_entry(nc), nc->avail);
+			spin_unlock_irq(&cachep->spinlock);
 			kfree(nc);
 		}
 		up(&cache_chain_sem);
 		break;
+#endif /* CONFIG_HOTPLUG_CPU */
 	}
 	return NOTIFY_OK;
 bad:
@@ -643,16 +669,6 @@ bad:
 
 static struct notifier_block cpucache_notifier = { &cpuup_callback, NULL, 0 };
 
-static inline void ** ac_entry(struct array_cache *ac)
-{
-	return (void**)(ac+1);
-}
-
-static inline struct array_cache *ac_data(kmem_cache_t *cachep)
-{
-	return cachep->array[smp_processor_id()];
-}
-
 /* Initialisation.
  * Called after the gfp() functions have been enabled, and before smp_init().
  */
@@ -1378,7 +1394,6 @@ static void smp_call_function_all_cpus(v
 	preempt_enable();
 }
 
-static void free_block (kmem_cache_t* cachep, void** objpp, int len);
 static void drain_array_locked(kmem_cache_t* cachep,
 				struct array_cache *ac, int force);
 
@@ -1499,6 +1514,9 @@ int kmem_cache_destroy (kmem_cache_t * c
 		return 1;
 	}
 
+	/* no cpu_online check required here since we clear the percpu
+	 * array on cpu offline and set this to NULL.
+	 */
 	for (i = 0; i < NR_CPUS; i++)
 		kfree(cachep->array[i]);
 
@@ -2590,7 +2608,8 @@ static void reap_timer_fnc(unsigned long
 	struct timer_list *rt = &__get_cpu_var(reap_timers);
 
 	cache_reap();
-	mod_timer(rt, jiffies + REAPTIMEOUT_CPUC + cpu);
+	if (!cpu_is_offline(cpu))
+		mod_timer(rt, jiffies + REAPTIMEOUT_CPUC + cpu);
 }
 
 #ifdef CONFIG_PROC_FS
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/mm/swap.c .22517-linux-2.6.2-rc2-mm2.updated/mm/swap.c
--- .22517-linux-2.6.2-rc2-mm2/mm/swap.c	2004-01-10 13:59:39.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/mm/swap.c	2004-02-01 01:08:03.000000000 +1100
@@ -27,6 +27,9 @@
 #include <linux/module.h>
 #include <linux/percpu_counter.h>
 #include <linux/percpu.h>
+#include <linux/cpu.h>
+#include <linux/notifier.h>
+#include <linux/init.h>
 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
@@ -381,7 +384,43 @@ void vm_acct_memory(long pages)
 	preempt_enable();
 }
 EXPORT_SYMBOL(vm_acct_memory);
-#endif
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void lru_drain_cache(unsigned int cpu)
+{
+	struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu);
+
+	/* CPU is dead, so no locking needed. */
+	if (pagevec_count(pvec)) {
+		printk("pagevec_count for %u is %u\n",
+		       cpu, pagevec_count(pvec));
+		__pagevec_lru_add(pvec);
+	}
+	pvec = &per_cpu(lru_add_active_pvecs, cpu);
+	if (pagevec_count(pvec)) {
+		printk("active pagevec_count for %u is %u\n",
+		       cpu, pagevec_count(pvec));
+		__pagevec_lru_add_active(pvec);
+	}
+}
+
+/* Drop the CPU's cached committed space back into the central pool. */
+static int cpu_swap_callback(struct notifier_block *nfb,
+			     unsigned long action,
+			     void *hcpu)
+{
+	long *committed;
+
+	committed = &per_cpu(committed_space, (long)hcpu);
+	if (action == CPU_DEAD) {
+		atomic_add(*committed, &vm_committed_space);
+		*committed = 0;
+		lru_drain_cache((long)hcpu);
+	}
+	return NOTIFY_OK;
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
 
 #ifdef CONFIG_SMP
 void percpu_counter_mod(struct percpu_counter *fbc, long amount)
@@ -420,4 +459,5 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+	hotcpu_notifier(cpu_swap_callback, 0);
 }
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/mm/vmscan.c .22517-linux-2.6.2-rc2-mm2.updated/mm/vmscan.c
--- .22517-linux-2.6.2-rc2-mm2/mm/vmscan.c	2004-01-31 17:28:20.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/mm/vmscan.c	2004-02-01 01:08:03.000000000 +1100
@@ -30,6 +30,8 @@
 #include <linux/backing-dev.h>
 #include <linux/rmap-locking.h>
 #include <linux/topology.h>
+#include <linux/cpu.h>
+#include <linux/notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
@@ -1105,13 +1107,54 @@ int shrink_all_memory(int nr_pages)
 }
 #endif
 
+#ifdef CONFIG_HOTPLUG_CPU
+/* It's optimal to keep kswapds on the same CPUs as their memory, but
+   not required for correctness.  So if the last cpu in a node goes
+   away, let them run anywhere, and as the first one comes back,
+   restore their cpu bindings. */
+static int __devinit cpu_callback(struct notifier_block *nfb,
+				  unsigned long action,
+				  void *hcpu)
+{
+	pg_data_t *pgdat;
+	unsigned int hotcpu = (unsigned long)hcpu;
+	cpumask_t mask;
+
+	if (action == CPU_OFFLINE) {
+		/* Make sure that kswapd never becomes unschedulable. */
+		for_each_pgdat(pgdat) {
+			mask = node_to_cpumask(pgdat->node_id);
+			if (any_online_cpu(mask) == NR_CPUS) {
+				cpus_complement(mask);
+				set_cpus_allowed(pgdat->kswapd, mask);
+			}
+		}
+	}
+
+	if (action == CPU_ONLINE) {
+		for_each_pgdat(pgdat) {
+			mask = node_to_cpumask(pgdat->node_id);
+			cpu_clear(hotcpu, mask);
+			if (any_online_cpu(mask) == NR_CPUS) {
+				cpu_set(hotcpu, mask);
+				/* One of our CPUs came back: restore mask */
+				set_cpus_allowed(pgdat->kswapd, mask);
+			}
+		}
+	}
+	return NOTIFY_OK;
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
 static int __init kswapd_init(void)
 {
 	pg_data_t *pgdat;
 	swap_setup();
 	for_each_pgdat(pgdat)
-		kernel_thread(kswapd, pgdat, CLONE_KERNEL);
+		pgdat->kswapd
+		= find_task_by_pid(kernel_thread(kswapd, pgdat, CLONE_KERNEL));
 	total_memory = nr_free_pagecache_pages();
+	hotcpu_notifier(cpu_callback, 0);
 	return 0;
 }
 
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/net/core/dev.c .22517-linux-2.6.2-rc2-mm2.updated/net/core/dev.c
--- .22517-linux-2.6.2-rc2-mm2/net/core/dev.c	2004-01-31 17:28:21.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/net/core/dev.c	2004-02-01 01:08:03.000000000 +1100
@@ -105,6 +105,7 @@
 #include <linux/kmod.h>
 #include <linux/module.h>
 #include <linux/kallsyms.h>
+#include <linux/cpu.h>
 #include <linux/netpoll.h>
 #ifdef CONFIG_NET_RADIO
 #include <linux/wireless.h>		/* Note : will define WIRELESS_EXT */
@@ -3080,6 +3081,52 @@ int unregister_netdevice(struct net_devi
 	return 0;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static int dev_cpu_callback(struct notifier_block *nfb,
+			    unsigned long action,
+			    void *ocpu)
+{
+	struct sk_buff **list_skb;
+	struct net_device **list_net;
+	struct sk_buff *skb;
+	unsigned int cpu, oldcpu = (unsigned long)ocpu;
+	struct softnet_data *sd, *oldsd;
+
+	if (action != CPU_DEAD)
+		return NOTIFY_OK;
+
+	local_irq_disable();
+	cpu = smp_processor_id();
+	sd = &per_cpu(softnet_data, cpu);
+	oldsd = &per_cpu(softnet_data, oldcpu);
+
+	/* Find end of our completion_queue. */
+	list_skb = &sd->completion_queue;
+	while (*list_skb)
+		list_skb = &(*list_skb)->next;
+	/* Append completion queue from offline CPU. */
+	*list_skb = oldsd->completion_queue;
+	oldsd->completion_queue = NULL;
+
+	/* Find end of our output_queue. */
+	list_net = &sd->output_queue;
+	while (*list_net)
+		list_net = &(*list_net)->next_sched;
+	/* Append output queue from offline CPU. */
+	*list_net = oldsd->output_queue;
+	oldsd->output_queue = NULL;
+
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_enable();
+
+	/* Process offline CPU's input_pkt_queue */
+	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
+		netif_rx(skb);
+
+	return NOTIFY_OK;
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
 
 /*
  *	Initialize the DEV module. At boot time this walks the device list and
@@ -3144,6 +3191,7 @@ static int __init net_dev_init(void)
 #ifdef CONFIG_NET_SCHED
 	pktsched_init();
 #endif
+	hotcpu_notifier(dev_cpu_callback, 0);
 	rc = 0;
 out:
 	return rc;
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22517-linux-2.6.2-rc2-mm2/net/core/flow.c .22517-linux-2.6.2-rc2-mm2.updated/net/core/flow.c
--- .22517-linux-2.6.2-rc2-mm2/net/core/flow.c	2004-01-31 17:28:21.000000000 +1100
+++ .22517-linux-2.6.2-rc2-mm2.updated/net/core/flow.c	2004-02-01 01:08:03.000000000 +1100
@@ -326,6 +326,17 @@ static void __devinit flow_cache_cpu_pre
 	tasklet_init(tasklet, flow_cache_flush_tasklet, 0);
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static int flow_cache_cpu(struct notifier_block *nfb,
+			  unsigned long action,
+			  void *hcpu)
+{
+	if (action == CPU_DEAD)
+		__flow_cache_shrink((unsigned long)hcpu, 0);
+	return NOTIFY_OK;
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
 static int __init flow_cache_init(void)
 {
 	int i;
@@ -350,6 +361,7 @@ static int __init flow_cache_init(void)
 	for_each_cpu(i)
 		flow_cache_cpu_prepare(i);
 
+	hotcpu_notifier(flow_cache_cpu, 0);
 	return 0;
 }
 
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-01-31 14:16 [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core Rusty Russell
@ 2004-02-01  8:03 ` Andrew Morton
  2004-02-01 10:19   ` Nick Piggin
  2004-02-01 12:07   ` Rusty Russell
  2004-02-02  9:12 ` Ingo Molnar
  1 sibling, 2 replies; 19+ messages in thread
From: Andrew Morton @ 2004-02-01  8:03 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel, piggin, dipankar, vatsa, mingo

Rusty Russell <rusty@rustcorp.com.au> wrote:
>
>  The actual CPU patch.  It's big, but almost all under
>  CONFIG_HOTPLUG_CPU, or macros which have same effect.

Needs a fixup.



CPU_MASK_ALL and CPU_MASK_NONE may only be used for initialisers.  It
doesn't compile if NR_CPUS>4*BITS_PER_LONG.  Fixes to the cpumask
infrastructure for this remain welcome.



---

 25-akpm/kernel/kmod.c    |    5 +++--
 25-akpm/kernel/kthread.c |    5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff -puN kernel/kmod.c~set_cpus_allowed-fix kernel/kmod.c
--- 25/kernel/kmod.c~set_cpus_allowed-fix	Sun Feb  1 02:21:19 2004
+++ 25-akpm/kernel/kmod.c	Sun Feb  1 02:21:19 2004
@@ -159,6 +159,7 @@ struct subprocess_info {
 static int ____call_usermodehelper(void *data)
 {
 	struct subprocess_info *sub_info = data;
+	static cpumask_t all_cpus = CPU_MASK_ALL;
 	int retval;
 
 	/* Unblock all signals. */
@@ -170,12 +171,12 @@ static int ____call_usermodehelper(void 
 	spin_unlock_irq(&current->sighand->siglock);
 
 	/* We can run anywhere, unlike our parent keventd(). */
-	set_cpus_allowed(current, CPU_MASK_ALL);
+	set_cpus_allowed(current, all_cpus);
 	/* As a kernel thread which was bound to a specific cpu,
 	   migrate_all_tasks wouldn't touch us.  Avoid running child
 	   on dying CPU. */
 	if (cpu_is_offline(smp_processor_id()))
-		migrate_to_cpu(any_online_cpu(CPU_MASK_ALL));
+		migrate_to_cpu(any_online_cpu(all_cpus));
 
 	retval = -EPERM;
 	if (current->fs->root)
diff -puN kernel/kthread.c~set_cpus_allowed-fix kernel/kthread.c
--- 25/kernel/kthread.c~set_cpus_allowed-fix	Sun Feb  1 02:21:19 2004
+++ 25-akpm/kernel/kthread.c	Sun Feb  1 02:21:19 2004
@@ -35,6 +35,7 @@ static int kthread(void *_create)
 	void *data;
 	sigset_t blocked;
 	int ret = -EINTR;
+	static cpumask_t all_cpus = CPU_MASK_ALL;
 
 	/* Copy data: it's on keventd's stack */
 	threadfn = create->threadfn;
@@ -46,13 +47,13 @@ static int kthread(void *_create)
 	flush_signals(current);
 
 	/* By default we can run anywhere, unlike keventd. */
-	set_cpus_allowed(current, CPU_MASK_ALL);
+	set_cpus_allowed(current, all_cpus);
 
 	/* As a kernel thread which was bound to a specific cpu,
 	   migrate_all_tasks wouldn't touch us.  Avoid running on
 	   dying CPU. */
 	if (cpu_is_offline(smp_processor_id()))
-		migrate_to_cpu(any_online_cpu(CPU_MASK_ALL));
+		migrate_to_cpu(any_online_cpu(all_cpus));
 
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_INTERRUPTIBLE);

_


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-01  8:03 ` Andrew Morton
@ 2004-02-01 10:19   ` Nick Piggin
  2004-02-01 12:07   ` Rusty Russell
  1 sibling, 0 replies; 19+ messages in thread
From: Nick Piggin @ 2004-02-01 10:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rusty Russell, linux-kernel, dipankar, vatsa, mingo



Andrew Morton wrote:

>Rusty Russell <rusty@rustcorp.com.au> wrote:
>
>> The actual CPU patch.  It's big, but almost all under
>> CONFIG_HOTPLUG_CPU, or macros which have same effect.
>>
>
>Needs a fixup.
>
>
>
>CPU_MASK_ALL and CPU_MASK_NONE may only be used for initialisers.  It
>doesn't compile if NR_CPUS>4*BITS_PER_LONG.  Fixes to the cpumask
>infrastructure for this remain welcome.
>
>

The kernel/sched.c stuff otherwise looks pretty straightforward
at a quick glance. I can't see any fundamental problem with it.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-01  8:03 ` Andrew Morton
  2004-02-01 10:19   ` Nick Piggin
@ 2004-02-01 12:07   ` Rusty Russell
  1 sibling, 0 replies; 19+ messages in thread
From: Rusty Russell @ 2004-02-01 12:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, piggin, dipankar, vatsa, mingo

In message <20040201000314.233e05a7.akpm@osdl.org> you write:
> CPU_MASK_ALL and CPU_MASK_NONE may only be used for initialisers.  It
> doesn't compile if NR_CPUS>4*BITS_PER_LONG.  Fixes to the cpumask
> infrastructure for this remain welcome.

Of course: *THAT* makes sense.

I dislike the "flexibility" of the cpumask code immensley.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-01-31 14:16 [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core Rusty Russell
  2004-02-01  8:03 ` Andrew Morton
@ 2004-02-02  9:12 ` Ingo Molnar
  2004-02-02 10:55   ` Rusty Russell
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2004-02-02  9:12 UTC (permalink / raw)
  To: Rusty Russell; +Cc: akpm, linux-kernel, Nick Piggin, dipankar, vatsa


On Sun, 1 Feb 2004, Rusty Russell wrote:

> D: - kernel/sched.c: Add wake_idle_cpu() for i386 hotplug.
> D:      
> D:      Check everywhere to make sure we never move tasks onto an
> D:      offline cpu: we'll just be fighting migrate_all_tasks().
> D: 
> D:      Change sched_migrate_task() to migrate_to_cpu and expose it
> D:      for hotplug cpu.
> D: 
> D:      Take hotplug lock in sys_sched_setaffinity.
> D:      
> D:      Return cpus_allowed masked by cpu_possible_map, not
> D:      cpu_online_map in sys_sched_getaffinity, otherwise tasks can't
> D:      know about whether they can run on currently-offline cpus.
> D: 
> D:      Implement migrate_all_tasks() to push tasks off the dying cpu.
> D:      
> D:      Add callbacks to stop migration thread.

the sched.c bits look good. A question: why is the migrate_all_tasks()
code nonatomic? If you run this code in the context of the migration
thread of the downed CPU then all of the migration can be done in one big
atomic locked section which holds all runqueue locks and just moves all
tasks off the current CPU. If it were done this way then eg. the
__migrate_task() race-avoidance check (cpu_is_offline()) could go away and
the whole thing would be more robust i believe.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-02  9:12 ` Ingo Molnar
@ 2004-02-02 10:55   ` Rusty Russell
  2004-02-02 12:45     ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Rusty Russell @ 2004-02-02 10:55 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel, Nick Piggin, dipankar, vatsa

In message <Pine.LNX.4.58.0402020353240.25194@devserv.devel.redhat.com> you write:
> the sched.c bits look good. A question: why is the migrate_all_tasks()
> code nonatomic? If you run this code in the context of the migration
> thread of the downed CPU then all of the migration can be done in one big
> atomic locked section which holds all runqueue locks and just moves all
> tasks off the current CPU. If it were done this way then eg. the
> __migrate_task() race-avoidance check (cpu_is_offline()) could go away and
> the whole thing would be more robust i believe.

Hmm...

I can code it up and we can see what it looks like.

Unfortunately the __migrate_task() check won't go away: someone may
have asked to move from CPU 0 to 1, and by the time migration thread
on 0 gets to the request, 1 has gone down.  We don't want all the
callers to hold the cpucontrol lock, because now the NUMA scheduler
uses migration as a common case 8(

Thanks,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-02 10:55   ` Rusty Russell
@ 2004-02-02 12:45     ` Ingo Molnar
  2004-02-02 13:22       ` Srivatsa Vaddagiri
  2004-02-03  0:34       ` [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core Rusty Russell
  0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2004-02-02 12:45 UTC (permalink / raw)
  To: Rusty Russell; +Cc: akpm, linux-kernel, Nick Piggin, dipankar, vatsa


On Mon, 2 Feb 2004, Rusty Russell wrote:

> Unfortunately the __migrate_task() check won't go away: someone may have
> asked to move from CPU 0 to 1, and by the time migration thread on 0
> gets to the request, 1 has gone down.  We don't want all the callers to
> hold the cpucontrol lock, because now the NUMA scheduler uses migration
> as a common case 8(

well, when a CPU goes down it could process the migration request queue as
well. (this would be a pretty natural thing to do if CPU-down executes in
the migration-thread context.)

Another question is user-space semantics - if user-space relies on CPU
affinity, is the kernel allowed to violate it or should the process be
notified. Sending it a signal (SIGTERM or anything similar) if the
affinity was non-generic might be a good thing to do.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-02 12:45     ` Ingo Molnar
@ 2004-02-02 13:22       ` Srivatsa Vaddagiri
  2004-02-02 15:40         ` Ingo Molnar
  2004-02-03  0:34       ` [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core Rusty Russell
  1 sibling, 1 reply; 19+ messages in thread
From: Srivatsa Vaddagiri @ 2004-02-02 13:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Rusty Russell, akpm, linux-kernel, Nick Piggin, dipankar

On Mon, Feb 02, 2004 at 07:45:32AM -0500, Ingo Molnar wrote:
> 
> On Mon, 2 Feb 2004, Rusty Russell wrote:
> 
> > Unfortunately the __migrate_task() check won't go away: someone may have
> > asked to move from CPU 0 to 1, and by the time migration thread on 0
> > gets to the request, 1 has gone down.  We don't want all the callers to
> > hold the cpucontrol lock, because now the NUMA scheduler uses migration
> > as a common case 8(
> 
> well, when a CPU goes down it could process the migration request queue as
> well. (this would be a pretty natural thing to do if CPU-down executes in
> the migration-thread context.)


One possibility is the migration request will be in 
CPU 0's queue and not in CPU1's? Hence there won't be anything (in 
CPU1's migration request queue) to process when CPU1 is brought down.
However by the time migration thread on CPU 0 gets around to processing the
request, CPU1 is already down and hence the check.


-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-02 13:22       ` Srivatsa Vaddagiri
@ 2004-02-02 15:40         ` Ingo Molnar
  2004-02-03  0:45           ` Rusty Russell
  2004-02-03  7:39           ` New v. v. experimental HOTPLUG CPU megapatch Rusty Russell
  0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2004-02-02 15:40 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Rusty Russell, akpm, linux-kernel, Nick Piggin, dipankar


* Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:

> One possibility is the migration request will be in CPU 0's queue and
> not in CPU1's? Hence there won't be anything (in CPU1's migration
> request queue) to process when CPU1 is brought down. However by the
> time migration thread on CPU 0 gets around to processing the request,
> CPU1 is already down and hence the check.

IMO the semantics of this area are not defined clearly enough, hence
these problems. __migrate_task(cpu) is a wrapper around
set_cpus_allowed(). If a CPU goes away atomically then the only source
of some non-online CPU is from user-space or unsafe references of
smp_processor_id() [which are also preempt-unsafe so need to be fixed
anyway]. set_cpus_allowed() deals with invalid masks just fine: it does
nothing and returns an error. The following sequence shouldnt be
possible to trigger if CPU-down is an atomic event:

        /* Affinity changed (again). */
        if (!cpu_isset(dest_cpu, p->cpus_allowed))
                goto out;
        /* CPU went down. */
        if (cpu_is_offline(dest_cpu))
                goto out;

because p->cpus_allowed is fixed up atomically. If this task is in
CPU1's migration queue then the CPU-down mechanism takes care of it. If
this task is in CPU0's migration queue then the task will be migrated
off CPU1 by the CPU-down mechanism and CPU0's migration wont have alot
of work left.

kernel-space threads that rely on running on a specific CPU need the
callback mechanism on CPU-down, so there's no problem with them.

user-space tasks that rely on running on a specific CPU need a callback
too. (probably in the form of a signal, which, if unhandled, terminates
the task.) Eg. if a webserver has a mode to run one thread per CPU, then
the server needs to adapt to the new situation when a CPU goes away. We
cannot just unilaterally migrate a task and violate its affinity.

another thing: if the migrate-irqs op is done atomically too (together
with the migrate-tasks op) then the special-cases in idle_balance() and
rebalance_tick() could go away too.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-02 12:45     ` Ingo Molnar
  2004-02-02 13:22       ` Srivatsa Vaddagiri
@ 2004-02-03  0:34       ` Rusty Russell
  2004-02-03  9:26         ` Ingo Molnar
  1 sibling, 1 reply; 19+ messages in thread
From: Rusty Russell @ 2004-02-03  0:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel, Nick Piggin, dipankar, vatsa

In message <Pine.LNX.4.58.0402020741250.16748@devserv.devel.redhat.com> you wri
te:
> 
> On Mon, 2 Feb 2004, Rusty Russell wrote:
> 
> > Unfortunately the __migrate_task() check won't go away: someone may have
> > asked to move from CPU 0 to 1, and by the time migration thread on 0
> > gets to the request, 1 has gone down.  We don't want all the callers to
> > hold the cpucontrol lock, because now the NUMA scheduler uses migration
> > as a common case 8(
> 
> well, when a CPU goes down it could process the migration request queue as
> well. (this would be a pretty natural thing to do if CPU-down executes in
> the migration-thread context.)

Wrong migration thread.  The migration thread on CPU 1 has been asked
to push into CPU 0, which is now going down.

Now I've slept on the "do it atomically" idea, I think it's a good
one.  I've even worked out how to maintain the "last thread on CPU is
the idle thread", although I'd need to test in code.

> Another question is user-space semantics - if user-space relies on CPU
> affinity, is the kernel allowed to violate it or should the process be
> notified. Sending it a signal (SIGTERM or anything similar) if the
> affinity was non-generic might be a good thing to do.

We've been there: we used to send SIGPWR with a new si_info field to
say what CPU it was.  But it turns out not to be useful, so we took it
out.

In practice, any app which wants to scale with # CPUs needs to know
when CPUs are coming up, as well as going down.  Ditto memory, etc.
This means they need to listen for the hotplug event (DBUS anyone?),
or we introduce a SIGRECONF (default ignored).  But AFAICT,
introducing a new signal isn't possible (at least on x86) without
breaking glibc.

Thanks!
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-02 15:40         ` Ingo Molnar
@ 2004-02-03  0:45           ` Rusty Russell
  2004-02-03  8:04             ` Ingo Molnar
  2004-02-03  7:39           ` New v. v. experimental HOTPLUG CPU megapatch Rusty Russell
  1 sibling, 1 reply; 19+ messages in thread
From: Rusty Russell @ 2004-02-03  0:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, akpm, linux-kernel, Nick Piggin, dipankar

In message <20040202154040.GA5895@elte.hu> you write:
> user-space tasks that rely on running on a specific CPU need a callback
> too. (probably in the form of a signal, which, if unhandled, terminates
> the task.) Eg. if a webserver has a mode to run one thread per CPU, then
> the server needs to adapt to the new situation when a CPU goes away. We
> cannot just unilaterally migrate a task and violate its affinity.

Well, that's what we'd do anyway, to deliver the signal.

This terminating signal idea is simply flawed: affinity is inherited,
so you're killing a process which knows nothing anyway.

If we can't do it well, leave it to userspace to sort out 8)

> another thing: if the migrate-irqs op is done atomically too (together
> with the migrate-tasks op) then the special-cases in idle_balance() and
> rebalance_tick() could go away too.

Exactly.  It simplifies a number of things.

Thanks,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* New v. v. experimental HOTPLUG CPU megapatch.
  2004-02-02 15:40         ` Ingo Molnar
  2004-02-03  0:45           ` Rusty Russell
@ 2004-02-03  7:39           ` Rusty Russell
  2004-02-03  9:35             ` Ingo Molnar
  2004-02-05 18:12             ` Pavel Machek
  1 sibling, 2 replies; 19+ messages in thread
From: Rusty Russell @ 2004-02-03  7:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, akpm, linux-kernel, Nick Piggin, dipankar

[-- Attachment #1: Type: text/plain, Size: 1470 bytes --]

This is my first cut of a patch, still has some old code in it.  As an
attachment since it's 70k (I'll split into multiple parts later, this
is the x86 part, too).

Patch against 2.6.2-rc2-mm2.  Works basically, gives "APIC error on
CPU1: 08(08)" under stress.  Clues welcome.

Basically consists of:

1) New file stop_machine.[ch] which takes logic out of module.c (I
   haven't converted module.c code over yet though).

2) x86: arch_cpu_down_check (called before offlining) and arch_cpu_down
   (called when machine stopped, moves irqs).

3) x86: idle loop code to play dead.

4) migrate_all_tasks() called with machine stopped, and migrates
   kernel threads as well.

5) cpu.c fires off a thread to do the dirty work: it schedules with
   interrupts still disabled on dead cpu. 

6) Most threads are happy to run on "wrong" CPUs, but slab.c reap
   timer needs a little fixing, and still needs to stop when CPU goes
   offline.  softirq threads are tied to CPU: I just hacked in a check
   so the do nothing if CPU is offline.  Moving the migration thread
   is safe since it should have nothing to do.

7) Ugly change to finish_arch_switch so it doesn't re-enable
   interrupts if the CPU is down (switching from take_cpu_down kthread
   to idle task).  __migrate_task() still needs check for cpu down,
   AFAICT.

Given it was about a day's work, I'm happy it works at all...

Cheers,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.


[-- Attachment #2: hotcpu-atomic-test.patch.bz2 --]
[-- Type: application/octet-stream, Size: 17121 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-03  0:45           ` Rusty Russell
@ 2004-02-03  8:04             ` Ingo Molnar
  2004-02-03  8:16               ` Rusty Russell
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2004-02-03  8:04 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Srivatsa Vaddagiri, akpm, linux-kernel, Nick Piggin, dipankar


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> This terminating signal idea is simply flawed: affinity is inherited,
> so you're killing a process which knows nothing anyway.
> 
> If we can't do it well, leave it to userspace to sort out 8)

yes, but currently there's no mechanism for userspace to get notified -
hence no clean way for userspace to sort it out. (other than userspace
continuously polling the CPU mask - bleh.) But this is a separate issue.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-03  8:04             ` Ingo Molnar
@ 2004-02-03  8:16               ` Rusty Russell
  2004-02-03  8:31                 ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Rusty Russell @ 2004-02-03  8:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, akpm, linux-kernel, Nick Piggin, dipankar

In message <20040203080436.GA374@elte.hu> you write:
> 
> * Rusty Russell <rusty@rustcorp.com.au> wrote:
> 
> > This terminating signal idea is simply flawed: affinity is inherited,
> > so you're killing a process which knows nothing anyway.
> > 
> > If we can't do it well, leave it to userspace to sort out 8)
> 
> yes, but currently there's no mechanism for userspace to get notified -
> hence no clean way for userspace to sort it out. (other than userspace
> continuously polling the CPU mask - bleh.) But this is a separate issue.

We use /sbin/hotplug at the moment.  I'm hoping that DBUS turn this
from "possible" into "easy" some day...

Thanks,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-03  8:16               ` Rusty Russell
@ 2004-02-03  8:31                 ` Ingo Molnar
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2004-02-03  8:31 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Srivatsa Vaddagiri, akpm, linux-kernel, Nick Piggin, dipankar


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> > > If we can't do it well, leave it to userspace to sort out 8)
> > 
> > yes, but currently there's no mechanism for userspace to get notified -
> > hence no clean way for userspace to sort it out. (other than userspace
> > continuously polling the CPU mask - bleh.) But this is a separate issue.
> 
> We use /sbin/hotplug at the moment.  I'm hoping that DBUS turn this
> from "possible" into "easy" some day...

ok. Anyway, it should be relatively rare for userspace to care about
this.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-03  0:34       ` [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core Rusty Russell
@ 2004-02-03  9:26         ` Ingo Molnar
  2004-02-04  0:19           ` Rusty Russell
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2004-02-03  9:26 UTC (permalink / raw)
  To: Rusty Russell; +Cc: akpm, linux-kernel, Nick Piggin, dipankar, vatsa


On Tue, 3 Feb 2004, Rusty Russell wrote:

> Wrong migration thread.  The migration thread on CPU 1 has been asked to
> push into CPU 0, which is now going down.

you are right - this one place would have to be aware of CPUs going away.

> Now I've slept on the "do it atomically" idea, I think it's a good one.  
> I've even worked out how to maintain the "last thread on CPU is the idle
> thread", although I'd need to test in code.

good!

> In practice, any app which wants to scale with # CPUs needs to know when
> CPUs are coming up, as well as going down.  Ditto memory, etc. This
> means they need to listen for the hotplug event (DBUS anyone?), or we
> introduce a SIGRECONF (default ignored).  But AFAICT, introducing a new
> signal isn't possible (at least on x86) without breaking glibc.

you can introduce a variable RT signal for this purpose no problem - and
one would want to have a queued signal for this anyway. Ie. by default the
notification is disabled, but a new syscall sets up the process to be
notified of CPU up/down events, on a signal # picked by the app. But this
this indeed is dbus domain ...

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: New v. v. experimental HOTPLUG CPU megapatch.
  2004-02-03  7:39           ` New v. v. experimental HOTPLUG CPU megapatch Rusty Russell
@ 2004-02-03  9:35             ` Ingo Molnar
  2004-02-05 18:12             ` Pavel Machek
  1 sibling, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2004-02-03  9:35 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Srivatsa Vaddagiri, akpm, linux-kernel, Nick Piggin, dipankar


* Rusty Russell <rusty@rustcorp.com.au> wrote:

> Patch against 2.6.2-rc2-mm2.  Works basically, gives "APIC error on
> CPU1: 08(08)" under stress.  Clues welcome.

APIC error 08 is receive error. Ie. most likely there was a pending IPI
(or pending hwirq) to that CPU but the CPU was zapped and the APIC
reset. I'd suggest to add "sti;nop;cli" instructions after the IO-APIC
masks have been redirected [note the nop - the interrupt-enable boundary
on x86 is two instructions from sti] - to flush out pending hardirqs and
IPIs. After this point nothing is supposed to reach this CPU. Enabling
irqs at this point should not cause any races, because you do this
first, right?

the pending cross-CPU-IPI case should not happen if the infrastructure
is correct, external hardirqs are not an issue unless it's an
edge-triggered device. So the worst-case with your current code could be
a lost timer IRQ or a lost edge-triggered PCI irq (old ne2k cards).

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core
  2004-02-03  9:26         ` Ingo Molnar
@ 2004-02-04  0:19           ` Rusty Russell
  0 siblings, 0 replies; 19+ messages in thread
From: Rusty Russell @ 2004-02-04  0:19 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel, Nick Piggin, dipankar, vatsa

In message <Pine.LNX.4.58.0402030418310.22596@devserv.devel.redhat.com> you write:
> > introduce a SIGRECONF (default ignored).  But AFAICT, introducing a new
> > signal isn't possible (at least on x86) without breaking glibc.
> 
> you can introduce a variable RT signal for this purpose no problem - and
> one would want to have a queued signal for this anyway. Ie. by default the
> notification is disabled, but a new syscall sets up the process to be
> notified of CPU up/down events, on a signal # picked by the app. But this
> this indeed is dbus domain ...

Yeah, this workaround because we can't add a new signal isn't very
good, though ("I can't add a new signal, so you choose!").  In
practice, it sucks for libraries and it sucks for children.

As you point out, DBUS is a better solution anyway.  Of course,
proposing that something which doesn't yet exist will solve all our
problems is usually a sign of laziness or optimism 8).

Cheers,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: New v. v. experimental HOTPLUG CPU megapatch.
  2004-02-03  7:39           ` New v. v. experimental HOTPLUG CPU megapatch Rusty Russell
  2004-02-03  9:35             ` Ingo Molnar
@ 2004-02-05 18:12             ` Pavel Machek
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Machek @ 2004-02-05 18:12 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Ingo Molnar, Srivatsa Vaddagiri, akpm, linux-kernel, Nick Piggin,
	dipankar

Hi!

> This is my first cut of a patch, still has some old code in it.  As an
> attachment since it's 70k (I'll split into multiple parts later, this
> is the x86 part, too).
> 
> Patch against 2.6.2-rc2-mm2.  Works basically, gives "APIC error on
> CPU1: 08(08)" under stress.  Clues welcome.
> 
> Basically consists of:
> 
> 1) New file stop_machine.[ch] which takes logic out of module.c (I
>    haven't converted module.c code over yet though).

I like this one. Something like this will be handy for SMP sleep
support.

								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2004-02-05 18:14 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-01-31 14:16 [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core Rusty Russell
2004-02-01  8:03 ` Andrew Morton
2004-02-01 10:19   ` Nick Piggin
2004-02-01 12:07   ` Rusty Russell
2004-02-02  9:12 ` Ingo Molnar
2004-02-02 10:55   ` Rusty Russell
2004-02-02 12:45     ` Ingo Molnar
2004-02-02 13:22       ` Srivatsa Vaddagiri
2004-02-02 15:40         ` Ingo Molnar
2004-02-03  0:45           ` Rusty Russell
2004-02-03  8:04             ` Ingo Molnar
2004-02-03  8:16               ` Rusty Russell
2004-02-03  8:31                 ` Ingo Molnar
2004-02-03  7:39           ` New v. v. experimental HOTPLUG CPU megapatch Rusty Russell
2004-02-03  9:35             ` Ingo Molnar
2004-02-05 18:12             ` Pavel Machek
2004-02-03  0:34       ` [PATCH 3/4] 2.6.2-rc2-mm2 CPU Hotplug: The Core Rusty Russell
2004-02-03  9:26         ` Ingo Molnar
2004-02-04  0:19           ` Rusty Russell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.