All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches
@ 2009-12-02  3:56 Tejun Heo
  2009-12-02  3:56 ` [PATCH 1/7] sched: revert 498657a478c60be092208422fefa9c7b248729c2 Tejun Heo
                   ` (6 more replies)
  0 siblings, 7 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  3:56 UTC (permalink / raw)
  To: tglx, mingo, avi, peterz, efault, rusty, linux-kernel

Hello, all.

This patchset contains sched patches necessary for concurrency managed
workqueue and is composed of two parts.  The first four patches are to
be applied to sched/core and other than the implementation of
force_cpus_allowed(), it makes no functional changes.  The purpose of
this part is to reduce unnecessary conflicts between ongoing scheduler
developements and the temporary scheduler notifiers updates added by
the next three patches.

The next three patches are to be applied to a separate branch.  The
latter part adds sched_notifiers for wakeup and sleep and
try_to_wake_up_local() which can be called from those notifiers.
These changes are not acked by Ingo.  For these changes to go
upstream, scheduler notifier framework needs to be unified and cleaned
up.  However, merging these scheduler changes in a separate stable
branch which will only be pulled into linux-next and other testing
trees allows c-m-wq a stable base to be developed upon and tested.

This patchset contains the following patches.

 0001-sched-revert-498657a478c60be092208422fefa9c7b248729c.patch
 0002-sched-rename-preempt_notifiers-to-sched_notifiers-an.patch
 0003-sched-refactor-try_to_wake_up.patch
 0004-sched-implement-force_cpus_allowed.patch
 0005-sched-make-sched_notifiers-unconditional.patch
 0006-sched-add-wakeup-sleep-sched_notifiers-and-allow-NUL.patch
 0007-sched-implement-try_to_wake_up_local.patch

0001 reverts the mis-fix to sched_in notifier.  0002 and 0003 rename
and move code around.  These two patches either don't cause or cause
very minute difference in the final binary output.  0004 implements
force_cpus_allowed().  Although force_cpus_allowed() won't be used in
upstream yet, the implementation touches various parts of sched.c so I
think it's better to include it into upstream code earlier.

0005-0007 always enables sched_notifiers, add wakeup/sleep and
implement try_to_wake_up_local().  As said above, these three can live
in a separate branch which won't be pushed to upstream until sched
notifier framework is improved.

These patches are available in the following git trees.

 First four:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-core-for-ingo

 Rest:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-wq-for-ingo

and contains the following changes.

 First four:

 arch/ia64/kvm/Kconfig    |    2 
 arch/powerpc/kvm/Kconfig |    2 
 arch/s390/kvm/Kconfig    |    2 
 arch/x86/kvm/Kconfig     |    2 
 include/linux/kvm_host.h |    4 
 include/linux/preempt.h  |   43 ------
 include/linux/sched.h    |   60 ++++++++-
 init/Kconfig             |    2 
 kernel/sched.c           |  309 +++++++++++++++++++++++++----------------------
 virt/kvm/kvm_main.c      |   26 +--
 10 files changed, 243 insertions(+), 209 deletions(-)

 Rest:

 arch/ia64/kvm/Kconfig    |    1 
 arch/powerpc/kvm/Kconfig |    1 
 arch/s390/kvm/Kconfig    |    1 
 arch/x86/kvm/Kconfig     |    1 
 include/linux/kvm_host.h |    2 -
 include/linux/sched.h    |   14 ++++----
 init/Kconfig             |    4 --
 kernel/sched.c           |   80 +++++++++++++++++++++++++++++++++++++----------
 8 files changed, 72 insertions(+), 32 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/7] sched: revert 498657a478c60be092208422fefa9c7b248729c2
  2009-12-02  3:56 [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches Tejun Heo
@ 2009-12-02  3:56 ` Tejun Heo
  2009-12-02 10:42   ` [tip:sched/core] sched: Revert 498657a478c60be092208422fefa9c7b248729c2 tip-bot for Tejun Heo
  2009-12-02  3:56 ` [PATCH 2/7] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  3:56 UTC (permalink / raw)
  To: tglx, mingo, avi, peterz, efault, rusty, linux-kernel; +Cc: Tejun Heo

498657a478c60be092208422fefa9c7b248729c2 incorrectly assumed that
preempt wasn't disabled around context_switch() and thus was fixing
imaginary problem.  It also broke kvm because it depended on
->sched_in() to be called with irq enabled so that it can do smp calls
from there.

Revert the incorrect commit and add comment describing different
contexts under with the two callbacks are invoked.

Avi: spotted transposed in/out in the added comment.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@redhat.com>
---
 include/linux/preempt.h |    5 +++++
 kernel/sched.c          |    2 +-
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 72b1a10..2e681d9 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -105,6 +105,11 @@ struct preempt_notifier;
  * @sched_out: we've just been preempted
  *    notifier: struct preempt_notifier for the task being preempted
  *    next: the task that's kicking us out
+ *
+ * Please note that sched_in and out are called under different
+ * contexts.  sched_out is called with rq lock held and irq disabled
+ * while sched_in is called without rq lock and irq enabled.  This
+ * difference is intentional and depended upon by its users.
  */
 struct preempt_ops {
 	void (*sched_in)(struct preempt_notifier *notifier, int cpu);
diff --git a/kernel/sched.c b/kernel/sched.c
index b3d4e2b..1031cae 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2768,9 +2768,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	prev_state = prev->state;
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(current, cpu_of(rq));
-	fire_sched_in_preempt_notifiers(current);
 	finish_lock_switch(rq, prev);
 
+	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/7] sched: rename preempt_notifiers to sched_notifiers and refactor implementation
  2009-12-02  3:56 [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches Tejun Heo
  2009-12-02  3:56 ` [PATCH 1/7] sched: revert 498657a478c60be092208422fefa9c7b248729c2 Tejun Heo
@ 2009-12-02  3:56 ` Tejun Heo
  2009-12-02  3:56 ` [PATCH 3/7] sched: refactor try_to_wake_up() Tejun Heo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  3:56 UTC (permalink / raw)
  To: tglx, mingo, avi, peterz, efault, rusty, linux-kernel; +Cc: Tejun Heo

Rename preempt_notifiers to sched_notifiers and move it to sched.h.
Also, refactor implementation in sched.c such that adding new
callbacks is easier.

This patch does not introduce any functional change and in fact
generates the same binary at least with my configuration (x86_64 SMP,
kvm and some debug options).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 arch/ia64/kvm/Kconfig    |    2 +-
 arch/powerpc/kvm/Kconfig |    2 +-
 arch/s390/kvm/Kconfig    |    2 +-
 arch/x86/kvm/Kconfig     |    2 +-
 include/linux/kvm_host.h |    4 +-
 include/linux/preempt.h  |   48 --------------------
 include/linux/sched.h    |   53 +++++++++++++++++++++-
 init/Kconfig             |    2 +-
 kernel/sched.c           |  108 +++++++++++++++++++---------------------------
 virt/kvm/kvm_main.c      |   26 +++++------
 10 files changed, 113 insertions(+), 136 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index ef3e7be..a38b72e 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c299268..6447702 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,7 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 
 config KVM_440
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index bf164fc..3bc7045 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -18,7 +18,7 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select S390_SWITCH_AMODE
 	---help---
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b84e571..23609bd 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
 	depends on HAVE_KVM
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b7bbb5d..d05fadc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,8 +74,8 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, struct kvm_io_bus *bus,
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	struct preempt_notifier preempt_notifier;
+#ifdef CONFIG_SCHED_NOTIFIERS
+	struct sched_notifier sched_notifier;
 #endif
 	int vcpu_id;
 	struct mutex mutex;
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 2e681d9..538c675 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -93,52 +93,4 @@ do { \
 
 #endif
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
-struct preempt_notifier;
-
-/**
- * preempt_ops - notifiers called when a task is preempted and rescheduled
- * @sched_in: we're about to be rescheduled:
- *    notifier: struct preempt_notifier for the task being scheduled
- *    cpu:  cpu we're scheduled on
- * @sched_out: we've just been preempted
- *    notifier: struct preempt_notifier for the task being preempted
- *    next: the task that's kicking us out
- *
- * Please note that sched_in and out are called under different
- * contexts.  sched_out is called with rq lock held and irq disabled
- * while sched_in is called without rq lock and irq enabled.  This
- * difference is intentional and depended upon by its users.
- */
-struct preempt_ops {
-	void (*sched_in)(struct preempt_notifier *notifier, int cpu);
-	void (*sched_out)(struct preempt_notifier *notifier,
-			  struct task_struct *next);
-};
-
-/**
- * preempt_notifier - key for installing preemption notifiers
- * @link: internal use
- * @ops: defines the notifier functions to be called
- *
- * Usually used in conjunction with container_of().
- */
-struct preempt_notifier {
-	struct hlist_node link;
-	struct preempt_ops *ops;
-};
-
-void preempt_notifier_register(struct preempt_notifier *notifier);
-void preempt_notifier_unregister(struct preempt_notifier *notifier);
-
-static inline void preempt_notifier_init(struct preempt_notifier *notifier,
-				     struct preempt_ops *ops)
-{
-	INIT_HLIST_NODE(&notifier->link);
-	notifier->ops = ops;
-}
-
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0395b0f..a6f64cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1210,6 +1210,53 @@ struct sched_rt_entity {
 #endif
 };
 
+#ifdef CONFIG_SCHED_NOTIFIERS
+
+struct sched_notifier;
+
+/**
+ * sched_notifier_ops - notifiers called for scheduling events
+ * @in: we're about to be rescheduled:
+ *    notifier: struct sched_notifier for the task being scheduled
+ *    cpu:  cpu we're scheduled on
+ * @out: we've just been preempted
+ *    notifier: struct sched_notifier for the task being preempted
+ *    next: the task that's kicking us out
+ *
+ * Please note that in and out are called under different contexts.
+ * out is called with rq lock held and irq disabled while in is called
+ * without rq lock and irq enabled.  This difference is intentional
+ * and depended upon by its users.
+ */
+struct sched_notifier_ops {
+	void (*in)(struct sched_notifier *notifier, int cpu);
+	void (*out)(struct sched_notifier *notifier, struct task_struct *next);
+};
+
+/**
+ * sched_notifier - key for installing scheduler notifiers
+ * @link: internal use
+ * @ops: defines the notifier functions to be called
+ *
+ * Usually used in conjunction with container_of().
+ */
+struct sched_notifier {
+	struct hlist_node link;
+	struct sched_notifier_ops *ops;
+};
+
+void sched_notifier_register(struct sched_notifier *notifier);
+void sched_notifier_unregister(struct sched_notifier *notifier);
+
+static inline void sched_notifier_init(struct sched_notifier *notifier,
+				       struct sched_notifier_ops *ops)
+{
+	INIT_HLIST_NODE(&notifier->link);
+	notifier->ops = ops;
+}
+
+#endif	/* CONFIG_SCHED_NOTIFIERS */
+
 struct rcu_node;
 
 struct task_struct {
@@ -1233,9 +1280,9 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	/* list of struct preempt_notifier: */
-	struct hlist_head preempt_notifiers;
+#ifdef CONFIG_SCHED_NOTIFIERS
+	/* list of struct sched_notifier: */
+	struct hlist_head sched_notifiers;
 #endif
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index 9e03ef8..c7eef2b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1207,6 +1207,6 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
+config SCHED_NOTIFIERS
 	bool
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 1031cae..204ca3e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1433,6 +1433,44 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
 
+#ifdef CONFIG_SCHED_NOTIFIERS
+
+#define fire_sched_notifiers(p, callback, args...) do {			\
+	struct sched_notifier *__sn;					\
+	struct hlist_node *__pos;					\
+									\
+	hlist_for_each_entry(__sn, __pos, &(p)->sched_notifiers, link)	\
+		__sn->ops->callback(__sn , ##args);			\
+} while (0)
+
+/**
+ * sched_notifier_register - register scheduler notifier
+ * @notifier: notifier struct to register
+ */
+void sched_notifier_register(struct sched_notifier *notifier)
+{
+	hlist_add_head(&notifier->link, &current->sched_notifiers);
+}
+EXPORT_SYMBOL_GPL(sched_notifier_register);
+
+/**
+ * sched_notifier_unregister - unregister scheduler notifier
+ * @notifier: notifier struct to unregister
+ *
+ * This is safe to call from within a scheduler notifier.
+ */
+void sched_notifier_unregister(struct sched_notifier *notifier)
+{
+	hlist_del(&notifier->link);
+}
+EXPORT_SYMBOL_GPL(sched_notifier_unregister);
+
+#else	/* !CONFIG_SCHED_NOTIFIERS */
+
+#define fire_sched_notifiers(p, callback, args...)	do { } while (0)
+
+#endif	/* CONFIG_SCHED_NOTIFIERS */
+
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -2539,8 +2577,8 @@ static void __sched_fork(struct task_struct *p)
 	p->se.on_rq = 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	INIT_HLIST_HEAD(&p->preempt_notifiers);
+#ifdef CONFIG_SCHED_NOTIFIERS
+	INIT_HLIST_HEAD(&p->sched_notifiers);
 #endif
 
 	/*
@@ -2651,64 +2689,6 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 	task_rq_unlock(rq, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
-/**
- * preempt_notifier_register - tell me when current is being preempted & rescheduled
- * @notifier: notifier struct to register
- */
-void preempt_notifier_register(struct preempt_notifier *notifier)
-{
-	hlist_add_head(&notifier->link, &current->preempt_notifiers);
-}
-EXPORT_SYMBOL_GPL(preempt_notifier_register);
-
-/**
- * preempt_notifier_unregister - no longer interested in preemption notifications
- * @notifier: notifier struct to unregister
- *
- * This is safe to call from within a preemption notifier.
- */
-void preempt_notifier_unregister(struct preempt_notifier *notifier)
-{
-	hlist_del(&notifier->link);
-}
-EXPORT_SYMBOL_GPL(preempt_notifier_unregister);
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-	struct preempt_notifier *notifier;
-	struct hlist_node *node;
-
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_in(notifier, raw_smp_processor_id());
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-	struct preempt_notifier *notifier;
-	struct hlist_node *node;
-
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_out(notifier, next);
-}
-
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */
-
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -2726,7 +2706,7 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
-	fire_sched_out_preempt_notifiers(prev, next);
+	fire_sched_notifiers(prev, out, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
 }
@@ -2770,7 +2750,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	perf_event_task_sched_in(current, cpu_of(rq));
 	finish_lock_switch(rq, prev);
 
-	fire_sched_in_preempt_notifiers(current);
+	fire_sched_notifiers(current, in, raw_smp_processor_id());
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
@@ -9569,8 +9549,8 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
+#ifdef CONFIG_SCHED_NOTIFIERS
+	INIT_HLIST_HEAD(&init_task.sched_notifiers);
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7495ce3..4e8e33f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -79,7 +79,7 @@ static cpumask_var_t cpus_hardware_enabled;
 struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
-static __read_mostly struct preempt_ops kvm_preempt_ops;
+static __read_mostly struct sched_notifier_ops kvm_sched_notifier_ops;
 
 struct dentry *kvm_debugfs_dir;
 
@@ -713,7 +713,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 
 	mutex_lock(&vcpu->mutex);
 	cpu = get_cpu();
-	preempt_notifier_register(&vcpu->preempt_notifier);
+	sched_notifier_register(&vcpu->sched_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
 }
@@ -722,7 +722,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
-	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	sched_notifier_unregister(&vcpu->sched_notifier);
 	preempt_enable();
 	mutex_unlock(&vcpu->mutex);
 }
@@ -1772,7 +1772,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
-	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
+	sched_notifier_init(&vcpu->sched_notifier, &kvm_sched_notifier_ops);
 
 	r = kvm_arch_vcpu_setup(vcpu);
 	if (r)
@@ -2690,23 +2690,21 @@ static struct sys_device kvm_sysdev = {
 struct page *bad_page;
 pfn_t bad_pfn;
 
-static inline
-struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn)
+static inline struct kvm_vcpu *sched_notifier_to_vcpu(struct sched_notifier *sn)
 {
-	return container_of(pn, struct kvm_vcpu, preempt_notifier);
+	return container_of(sn, struct kvm_vcpu, sched_notifier);
 }
 
-static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
+static void kvm_sched_in(struct sched_notifier *sn, int cpu)
 {
-	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+	struct kvm_vcpu *vcpu = sched_notifier_to_vcpu(sn);
 
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
-static void kvm_sched_out(struct preempt_notifier *pn,
-			  struct task_struct *next)
+static void kvm_sched_out(struct sched_notifier *sn, struct task_struct *next)
 {
-	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+	struct kvm_vcpu *vcpu = sched_notifier_to_vcpu(sn);
 
 	kvm_arch_vcpu_put(vcpu);
 }
@@ -2780,8 +2778,8 @@ int kvm_init(void *opaque, unsigned int vcpu_size,
 		goto out_free;
 	}
 
-	kvm_preempt_ops.sched_in = kvm_sched_in;
-	kvm_preempt_ops.sched_out = kvm_sched_out;
+	kvm_sched_notifier_ops.in = kvm_sched_in;
+	kvm_sched_notifier_ops.out = kvm_sched_out;
 
 	kvm_init_debug();
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/7] sched: refactor try_to_wake_up()
  2009-12-02  3:56 [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches Tejun Heo
  2009-12-02  3:56 ` [PATCH 1/7] sched: revert 498657a478c60be092208422fefa9c7b248729c2 Tejun Heo
  2009-12-02  3:56 ` [PATCH 2/7] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
@ 2009-12-02  3:56 ` Tejun Heo
  2009-12-02  9:05   ` Mike Galbraith
  2009-12-03  6:11   ` [PATCH UDPATED " Tejun Heo
  2009-12-02  3:56 ` [PATCH 4/7] sched: implement force_cpus_allowed() Tejun Heo
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  3:56 UTC (permalink / raw)
  To: tglx, mingo, avi, peterz, efault, rusty, linux-kernel; +Cc: Tejun Heo

Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise.  Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.

This is to help future implementation of try_to_wake_up_local().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |  112 +++++++++++++++++++++++++++++++------------------------
 1 files changed, 63 insertions(+), 49 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 204ca3e..dd8c91c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2361,11 +2361,67 @@ void task_oncpu_function_call(struct task_struct *p,
 	preempt_enable();
 }
 
-/***
+static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
+				 bool is_sync, bool is_migrate, bool is_local)
+{
+	schedstat_inc(p, se.nr_wakeups);
+	if (is_sync)
+		schedstat_inc(p, se.nr_wakeups_sync);
+	if (is_migrate)
+		schedstat_inc(p, se.nr_wakeups_migrate);
+	if (is_local)
+		schedstat_inc(p, se.nr_wakeups_local);
+	else
+		schedstat_inc(p, se.nr_wakeups_remote);
+
+	activate_task(rq, p, 1);
+
+	/*
+	 * Only attribute actual wakeups done by this task.
+	 */
+	if (!in_interrupt()) {
+		struct sched_entity *se = &current->se;
+		u64 sample = se->sum_exec_runtime;
+
+		if (se->last_wakeup)
+			sample -= se->last_wakeup;
+		else
+			sample -= se->start_runtime;
+		update_avg(&se->avg_wakeup, sample);
+
+		se->last_wakeup = se->sum_exec_runtime;
+	}
+}
+
+static inline void ttwu_woken_up(struct task_struct *p, struct rq *rq,
+				 int wake_flags, bool success)
+{
+	trace_sched_wakeup(rq, p, success);
+	check_preempt_curr(rq, p, wake_flags);
+
+	p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_wake_up)
+		p->sched_class->task_wake_up(rq, p);
+
+	if (unlikely(rq->idle_stamp)) {
+		u64 delta = rq->clock - rq->idle_stamp;
+		u64 max = 2*sysctl_sched_migration_cost;
+
+		if (delta > max)
+			rq->avg_idle = max;
+		else
+			update_avg(&rq->avg_idle, delta);
+		rq->idle_stamp = 0;
+	}
+#endif
+}
+
+/**
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
  * @state: the mask of task states that can be woken
- * @sync: do a synchronous wakeup?
+ * @wake_flags: wake modifier flags (WF_*)
  *
  * Put it on the run-queue if it's not already there. The "current"
  * thread is always on the run-queue (except when the actual
@@ -2373,7 +2429,8 @@ void task_oncpu_function_call(struct task_struct *p,
  * the simpler "current->state = TASK_RUNNING" to mark yourself
  * runnable without the overhead of this.
  *
- * returns failure only if the task is already active.
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state,
 			  int wake_flags)
@@ -2444,54 +2501,11 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 
 out_activate:
 #endif /* CONFIG_SMP */
-	schedstat_inc(p, se.nr_wakeups);
-	if (wake_flags & WF_SYNC)
-		schedstat_inc(p, se.nr_wakeups_sync);
-	if (orig_cpu != cpu)
-		schedstat_inc(p, se.nr_wakeups_migrate);
-	if (cpu == this_cpu)
-		schedstat_inc(p, se.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.nr_wakeups_remote);
-	activate_task(rq, p, 1);
+	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
+		      cpu == this_cpu);
 	success = 1;
-
-	/*
-	 * Only attribute actual wakeups done by this task.
-	 */
-	if (!in_interrupt()) {
-		struct sched_entity *se = &current->se;
-		u64 sample = se->sum_exec_runtime;
-
-		if (se->last_wakeup)
-			sample -= se->last_wakeup;
-		else
-			sample -= se->start_runtime;
-		update_avg(&se->avg_wakeup, sample);
-
-		se->last_wakeup = se->sum_exec_runtime;
-	}
-
 out_running:
-	trace_sched_wakeup(rq, p, success);
-	check_preempt_curr(rq, p, wake_flags);
-
-	p->state = TASK_RUNNING;
-#ifdef CONFIG_SMP
-	if (p->sched_class->task_wake_up)
-		p->sched_class->task_wake_up(rq, p);
-
-	if (unlikely(rq->idle_stamp)) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
-#endif
+	ttwu_woken_up(p, rq, wake_flags, success);
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-02  3:56 [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches Tejun Heo
                   ` (2 preceding siblings ...)
  2009-12-02  3:56 ` [PATCH 3/7] sched: refactor try_to_wake_up() Tejun Heo
@ 2009-12-02  3:56 ` Tejun Heo
  2009-12-04 10:40   ` Peter Zijlstra
  2009-12-02  3:56 ` [PATCH 5/7] sched: make sched_notifiers unconditional Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  3:56 UTC (permalink / raw)
  To: tglx, mingo, avi, peterz, efault, rusty, linux-kernel; +Cc: Tejun Heo

set_cpus_allowed_ptr() modifies the allowed cpu mask of a task.  The
function performs the following checks before applying new mask.

* Check whether PF_THREAD_BOUND is set.  This is set for bound
  kthreads so that they can't be moved around.

* Check whether the target cpu is still marked active - cpu_active().
  Active state is cleared early while downing a cpu.

This patch adds force_cpus_allowed() which bypasses the above two
checks.  The caller is responsible for guaranteeing that the
destination cpu doesn't go down until force_cpus_allowed() finishes.

The first check is bypassed by factoring out actual migration part
into __set_cpus_allowed() from set_cpus_allowed_ptr() and calling the
inner function from force_cpus_allowed().

The second check is buried deep down in __migrate_task() which is
executed by migration threads.  @force parameter is added to
__migrate_task().  As the only way to pass parameters from
__set_cpus_allowed() is through migration_req, migration_req->force is
added and the @force parameter is passed down to __migrate_task().

Please note the naming discrepancy between set_cpus_allowed_ptr() and
the new functions.  The _ptr suffix is from the days when cpumask api
wasn't mature and future changes should drop it from
set_cpus_allowed_ptr() too.

force_cpus_allowed() will be used for concurrency-managed workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    7 ++++
 kernel/sched.c        |   89 +++++++++++++++++++++++++++++++++----------------
 2 files changed, 67 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a6f64cb..ae4fcfe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1854,6 +1854,8 @@ static inline void rcu_copy_process(struct task_struct *p)
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed_ptr(struct task_struct *p,
 				const struct cpumask *new_mask);
+extern int force_cpus_allowed(struct task_struct *p,
+				  const struct cpumask *new_mask);
 #else
 static inline int set_cpus_allowed_ptr(struct task_struct *p,
 				       const struct cpumask *new_mask)
@@ -1862,6 +1864,11 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p,
 		return -EINVAL;
 	return 0;
 }
+static inline int force_cpus_allowed(struct task_struct *p,
+				     const struct cpumask *new_mask)
+{
+	return set_cpus_allowed_ptr(p, new_mask);
+}
 #endif
 
 #ifndef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/sched.c b/kernel/sched.c
index dd8c91c..8fd3981 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2135,6 +2135,7 @@ struct migration_req {
 
 	struct task_struct *task;
 	int dest_cpu;
+	bool force;
 
 	struct completion done;
 };
@@ -2143,8 +2144,8 @@ struct migration_req {
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static int
-migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
+static int migrate_task(struct task_struct *p, int dest_cpu,
+			struct migration_req *req, bool force)
 {
 	struct rq *rq = task_rq(p);
 
@@ -2161,6 +2162,7 @@ migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
 	init_completion(&req->done);
 	req->task = p;
 	req->dest_cpu = dest_cpu;
+	req->force = force;
 	list_add(&req->list, &rq->migration_queue);
 
 	return 1;
@@ -3127,7 +3129,7 @@ static void sched_migrate_task(struct task_struct *p, int dest_cpu)
 		goto out;
 
 	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (migrate_task(p, dest_cpu, &req, false)) {
 		/* Need to wait for migration thread (might exit: take ref). */
 		struct task_struct *mt = rq->migration_thread;
 
@@ -7033,34 +7035,19 @@ static inline void sched_init_granularity(void)
  * 7) we wake up and the migration is done.
  */
 
-/*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
- *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
- */
-int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+static inline int __set_cpus_allowed(struct task_struct *p,
+				     const struct cpumask *new_mask,
+				     struct rq *rq, unsigned long *flags,
+				     bool force)
 {
 	struct migration_req req;
-	unsigned long flags;
-	struct rq *rq;
 	int ret = 0;
 
-	rq = task_rq_lock(p, &flags);
 	if (!cpumask_intersects(new_mask, cpu_online_mask)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
-		     !cpumask_equal(&p->cpus_allowed, new_mask))) {
-		ret = -EINVAL;
-		goto out;
-	}
-
 	if (p->sched_class->set_cpus_allowed)
 		p->sched_class->set_cpus_allowed(p, new_mask);
 	else {
@@ -7072,12 +7059,13 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 	if (cpumask_test_cpu(task_cpu(p), new_mask))
 		goto out;
 
-	if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req)) {
+	if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req,
+			 force)) {
 		/* Need help from migration thread: drop lock and wait. */
 		struct task_struct *mt = rq->migration_thread;
 
 		get_task_struct(mt);
-		task_rq_unlock(rq, &flags);
+		task_rq_unlock(rq, flags);
 		wake_up_process(rq->migration_thread);
 		put_task_struct(mt);
 		wait_for_completion(&req.done);
@@ -7085,13 +7073,54 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 		return 0;
 	}
 out:
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, flags);
 
 	return ret;
 }
+
+/*
+ * Change a given task's CPU affinity. Migrate the thread to a
+ * proper CPU and schedule it away if the CPU it's executing on
+ * is removed from the allowed bitmask.
+ *
+ * NOTE: the caller must have a valid reference to the task, the
+ * task must not exit() & deallocate itself prematurely. The
+ * call is not atomic; no spinlocks may be held.
+ */
+int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+{
+	unsigned long flags;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+
+	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+		     !cpumask_equal(&p->cpus_allowed, new_mask))) {
+		task_rq_unlock(rq, &flags);
+		return -EINVAL;
+	}
+
+	return __set_cpus_allowed(p, new_mask, rq, &flags, false);
+}
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
+ * Similar to set_cpus_allowed_ptr() but bypasses PF_THREAD_BOUND
+ * check and ignores cpu_active() status as long as the cpu is online.
+ * The caller is responsible for guaranteeing that the destination
+ * cpus don't go down until this function finishes and in general
+ * ensuring things don't go bonkers.
+ */
+int force_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
+{
+	unsigned long flags;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	return __set_cpus_allowed(p, new_mask, rq, &flags, true);
+}
+
+/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -7102,12 +7131,13 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
  *
  * Returns non-zero if task was successfully migrated.
  */
-static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
+static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu,
+			  bool force)
 {
 	struct rq *rq_dest, *rq_src;
 	int ret = 0, on_rq;
 
-	if (unlikely(!cpu_active(dest_cpu)))
+	if (!force && unlikely(!cpu_active(dest_cpu)))
 		return ret;
 
 	rq_src = cpu_rq(src_cpu);
@@ -7186,7 +7216,8 @@ static int migration_thread(void *data)
 
 		if (req->task != NULL) {
 			spin_unlock(&rq->lock);
-			__migrate_task(req->task, cpu, req->dest_cpu);
+			__migrate_task(req->task, cpu, req->dest_cpu,
+				       req->force);
 		} else if (likely(cpu == (badcpu = smp_processor_id()))) {
 			req->dest_cpu = RCU_MIGRATION_GOT_QS;
 			spin_unlock(&rq->lock);
@@ -7211,7 +7242,7 @@ static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
 	int ret;
 
 	local_irq_disable();
-	ret = __migrate_task(p, src_cpu, dest_cpu);
+	ret = __migrate_task(p, src_cpu, dest_cpu, false);
 	local_irq_enable();
 	return ret;
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/7] sched: make sched_notifiers unconditional
  2009-12-02  3:56 [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches Tejun Heo
                   ` (3 preceding siblings ...)
  2009-12-02  3:56 ` [PATCH 4/7] sched: implement force_cpus_allowed() Tejun Heo
@ 2009-12-02  3:56 ` Tejun Heo
  2009-12-02  3:56 ` [PATCH 6/7] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
  2009-12-02  3:56 ` [PATCH 7/7] sched: implement try_to_wake_up_local() Tejun Heo
  6 siblings, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  3:56 UTC (permalink / raw)
  To: tglx, mingo, avi, peterz, efault, rusty, linux-kernel; +Cc: Tejun Heo

sched_notifiers will be used by workqueue which is always there.
Always enable sched_notifiers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 arch/ia64/kvm/Kconfig    |    1 -
 arch/powerpc/kvm/Kconfig |    1 -
 arch/s390/kvm/Kconfig    |    1 -
 arch/x86/kvm/Kconfig     |    1 -
 include/linux/kvm_host.h |    2 --
 include/linux/sched.h    |    6 ------
 init/Kconfig             |    4 ----
 kernel/sched.c           |   13 -------------
 8 files changed, 0 insertions(+), 29 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index a38b72e..a9e2b9c 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 6447702..092503e 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 
 config KVM_440
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 3bc7045..e125d45 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select S390_SWITCH_AMODE
 	---help---
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 23609bd..b391852 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM
 	# for device assignment:
 	depends on PCI
-	select SCHED_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d05fadc..bc0c1d4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,9 +74,7 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, struct kvm_io_bus *bus,
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_SCHED_NOTIFIERS
 	struct sched_notifier sched_notifier;
-#endif
 	int vcpu_id;
 	struct mutex mutex;
 	int   cpu;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae4fcfe..bc8be2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1210,8 +1210,6 @@ struct sched_rt_entity {
 #endif
 };
 
-#ifdef CONFIG_SCHED_NOTIFIERS
-
 struct sched_notifier;
 
 /**
@@ -1255,8 +1253,6 @@ static inline void sched_notifier_init(struct sched_notifier *notifier,
 	notifier->ops = ops;
 }
 
-#endif	/* CONFIG_SCHED_NOTIFIERS */
-
 struct rcu_node;
 
 struct task_struct {
@@ -1280,10 +1276,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_SCHED_NOTIFIERS
 	/* list of struct sched_notifier: */
 	struct hlist_head sched_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
diff --git a/init/Kconfig b/init/Kconfig
index c7eef2b..0220aa7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1206,7 +1206,3 @@ config STOP_MACHINE
 	  Need stop_machine() primitive.
 
 source "block/Kconfig"
-
-config SCHED_NOTIFIERS
-	bool
-
diff --git a/kernel/sched.c b/kernel/sched.c
index 8fd3981..f7d9bb3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1433,8 +1433,6 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
 
-#ifdef CONFIG_SCHED_NOTIFIERS
-
 #define fire_sched_notifiers(p, callback, args...) do {			\
 	struct sched_notifier *__sn;					\
 	struct hlist_node *__pos;					\
@@ -1465,12 +1463,6 @@ void sched_notifier_unregister(struct sched_notifier *notifier)
 }
 EXPORT_SYMBOL_GPL(sched_notifier_unregister);
 
-#else	/* !CONFIG_SCHED_NOTIFIERS */
-
-#define fire_sched_notifiers(p, callback, args...)	do { } while (0)
-
-#endif	/* CONFIG_SCHED_NOTIFIERS */
-
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -2592,10 +2584,7 @@ static void __sched_fork(struct task_struct *p)
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->se.on_rq = 0;
 	INIT_LIST_HEAD(&p->se.group_node);
-
-#ifdef CONFIG_SCHED_NOTIFIERS
 	INIT_HLIST_HEAD(&p->sched_notifiers);
-#endif
 
 	/*
 	 * We mark the process as running here, but have not actually
@@ -9594,9 +9583,7 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_SCHED_NOTIFIERS
 	INIT_HLIST_HEAD(&init_task.sched_notifiers);
-#endif
 
 #ifdef CONFIG_SMP
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 6/7] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2009-12-02  3:56 [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches Tejun Heo
                   ` (4 preceding siblings ...)
  2009-12-02  3:56 ` [PATCH 5/7] sched: make sched_notifiers unconditional Tejun Heo
@ 2009-12-02  3:56 ` Tejun Heo
  2009-12-02  3:56 ` [PATCH 7/7] sched: implement try_to_wake_up_local() Tejun Heo
  6 siblings, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  3:56 UTC (permalink / raw)
  To: tglx, mingo, avi, peterz, efault, rusty, linux-kernel; +Cc: Tejun Heo

Add wakeup and sleep notifiers to sched_notifiers and allow omitting
some of the notifiers in the ops table.  These will be used by
concurrency managed workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |   11 ++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc8be2d..cad6fd6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1214,6 +1214,10 @@ struct sched_notifier;
 
 /**
  * sched_notifier_ops - notifiers called for scheduling events
+ * @wakeup: we're waking up
+ *    notifier: struct sched_notifier for the task being woken up
+ * @sleep: we're going to bed
+ *    notifier: struct sched_notifier for the task sleeping
  * @in: we're about to be rescheduled:
  *    notifier: struct sched_notifier for the task being scheduled
  *    cpu:  cpu we're scheduled on
@@ -1227,6 +1231,8 @@ struct sched_notifier;
  * and depended upon by its users.
  */
 struct sched_notifier_ops {
+	void (*wakeup)(struct sched_notifier *notifier);
+	void (*sleep)(struct sched_notifier *notifier);
 	void (*in)(struct sched_notifier *notifier, int cpu);
 	void (*out)(struct sched_notifier *notifier, struct task_struct *next);
 };
diff --git a/kernel/sched.c b/kernel/sched.c
index f7d9bb3..9011dca 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1438,7 +1438,8 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 	struct hlist_node *__pos;					\
 									\
 	hlist_for_each_entry(__sn, __pos, &(p)->sched_notifiers, link)	\
-		__sn->ops->callback(__sn , ##args);			\
+		if (__sn->ops->callback)				\
+			__sn->ops->callback(__sn , ##args);		\
 } while (0)
 
 /**
@@ -2409,6 +2410,8 @@ static inline void ttwu_woken_up(struct task_struct *p, struct rq *rq,
 		rq->idle_stamp = 0;
 	}
 #endif
+	if (success)
+		fire_sched_notifiers(p, wakeup);
 }
 
 /**
@@ -5431,10 +5434,12 @@ need_resched_nonpreemptible:
 	clear_tsk_need_resched(prev);
 
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
-		if (unlikely(signal_pending_state(prev->state, prev)))
+		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
-		else
+		} else {
+			fire_sched_notifiers(prev, sleep);
 			deactivate_task(rq, prev, 1);
+		}
 		switch_count = &prev->nvcsw;
 	}
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 7/7] sched: implement try_to_wake_up_local()
  2009-12-02  3:56 [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches Tejun Heo
                   ` (5 preceding siblings ...)
  2009-12-02  3:56 ` [PATCH 6/7] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
@ 2009-12-02  3:56 ` Tejun Heo
  2009-12-03  6:13   ` [PATCH UPDATED " Tejun Heo
  2009-12-04 10:44   ` [PATCH " Peter Zijlstra
  6 siblings, 2 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  3:56 UTC (permalink / raw)
  To: tglx, mingo, avi, peterz, efault, rusty, linux-kernel; +Cc: Tejun Heo

Implement try_to_wake_up_local().  try_to_wake_up_local() is similar
to try_to_wake_up() but it assumes the caller has this_rq() locked and
the target task is bound to this_rq().  try_to_wake_up_local() can be
called from wakeup and sleep scheduler notifiers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    2 +
 kernel/sched.c        |   56 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cad6fd6..cd709bb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2067,6 +2067,8 @@ extern void release_uids(struct user_namespace *ns);
 
 extern void do_timer(unsigned long ticks);
 
+extern bool try_to_wake_up_local(struct task_struct *p, unsigned int state,
+				 int wake_flags);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
diff --git a/kernel/sched.c b/kernel/sched.c
index 9011dca..18437de 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2410,6 +2410,10 @@ static inline void ttwu_woken_up(struct task_struct *p, struct rq *rq,
 		rq->idle_stamp = 0;
 	}
 #endif
+	/*
+	 * Wake up is complete, fire wake up notifier.  This allows
+	 * try_to_wake_up_local() to be called from wake up notifiers.
+	 */
 	if (success)
 		fire_sched_notifiers(p, wakeup);
 }
@@ -2511,6 +2515,53 @@ out:
 }
 
 /**
+ * try_to_wake_up_local - try to wake up a local task with rq lock held
+ * @p: the to-be-woken-up thread
+ * @state: the mask of task states that can be woken
+ * @wake_flags: wake modifier flags (WF_*)
+ *
+ * Put @p on the run-queue if it's not alredy there.  The caller must
+ * ensure that this_rq() is locked, @p is bound to this_rq() and @p is
+ * not the current task.  this_rq() stays locked over invocation.
+ *
+ * This function can be called from wakeup and sleep scheduler
+ * notifiers.  Be careful not to create deep recursion by chaining
+ * wakeup notifiers.
+ *
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
+ */
+bool try_to_wake_up_local(struct task_struct *p, unsigned int state,
+			  int wake_flags)
+{
+	struct rq *rq = task_rq(p);
+	bool success = false;
+
+	BUG_ON(rq != this_rq());
+	BUG_ON(p == current);
+	lockdep_assert_held(&rq->lock);
+
+	if (!sched_feat(SYNC_WAKEUPS))
+		wake_flags &= ~WF_SYNC;
+
+	if (!(p->state & state))
+		return false;
+
+	if (!p->se.on_rq) {
+		if (likely(!task_running(rq, p))) {
+			schedstat_inc(rq, ttwu_count);
+			schedstat_inc(rq, ttwu_local);
+		}
+		ttwu_activate(p, rq, wake_flags & WF_SYNC, false, true);
+		success = true;
+	}
+
+	ttwu_woken_up(p, rq, wake_flags, success);
+
+	return success;
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
@@ -5437,6 +5488,11 @@ need_resched_nonpreemptible:
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+			/*
+			 * Fire sleep notifier before changing any scheduler
+			 * state.  This allows try_to_wake_up_local() to be
+			 * called from sleep notifiers.
+			 */
 			fire_sched_notifiers(prev, sleep);
 			deactivate_task(rq, prev, 1);
 		}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] sched: refactor try_to_wake_up()
  2009-12-02  3:56 ` [PATCH 3/7] sched: refactor try_to_wake_up() Tejun Heo
@ 2009-12-02  9:05   ` Mike Galbraith
  2009-12-02  9:51     ` Tejun Heo
  2009-12-03  6:11   ` [PATCH UDPATED " Tejun Heo
  1 sibling, 1 reply; 47+ messages in thread
From: Mike Galbraith @ 2009-12-02  9:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, peterz, rusty, linux-kernel

On Wed, 2009-12-02 at 12:56 +0900, Tejun Heo wrote:
> Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().

Nit: ttwu_woken_up() sounds decidedly strange to my ear. Perhaps
ttwu_post_activation()?

As a $.02 comment, factoring here doesn't look nice, reader scrolls
around whereas he currently sees all the why/wherefore at a glance.
Needing to pass three booleans for stats also looks bad.  I think it
would _look_ better with the thing just duplicated/stripped down and
called what it is, sched_notifier_wakeup() or such.

Which leaves growth in it's wake though...

> +/**
>   * try_to_wake_up - wake up a thread
>   * @p: the to-be-woken-up thread

Nit: thread to be awakened sounds better.

	-Mike


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] sched: refactor try_to_wake_up()
  2009-12-02  9:05   ` Mike Galbraith
@ 2009-12-02  9:51     ` Tejun Heo
  2009-12-02 10:10       ` Mike Galbraith
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-02  9:51 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, avi, peterz, rusty, linux-kernel

Hello,

On 12/02/2009 06:05 PM, Mike Galbraith wrote:
> On Wed, 2009-12-02 at 12:56 +0900, Tejun Heo wrote:
>> Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
> 
> Nit: ttwu_woken_up() sounds decidedly strange to my ear. Perhaps
> ttwu_post_activation()?

Sure, I can rename it.

> As a $.02 comment, factoring here doesn't look nice, reader scrolls
> around whereas he currently sees all the why/wherefore at a glance.
> Needing to pass three booleans for stats also looks bad.

The three bools aren't the prettiest thing in the world but I couldn't
prevent gcc from re-evaluating expressions without those.

> I think it would _look_ better with the thing just
> duplicated/stripped down and called what it is,
> sched_notifier_wakeup() or such.

Sorry, I'm not following.  Can you elaborate a bit?

> Which leaves growth in it's wake though...
>
>> +/**
>>   * try_to_wake_up - wake up a thread
>>   * @p: the to-be-woken-up thread
> 
> Nit: thread to be awakened sounds better.

Will update.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] sched: refactor try_to_wake_up()
  2009-12-02  9:51     ` Tejun Heo
@ 2009-12-02 10:10       ` Mike Galbraith
  2009-12-02 10:14         ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Mike Galbraith @ 2009-12-02 10:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, peterz, rusty, linux-kernel

On Wed, 2009-12-02 at 18:51 +0900, Tejun Heo wrote:
> Hello,

Greetings,

> > I think it would _look_ better with the thing just
> > duplicated/stripped down and called what it is,
> > sched_notifier_wakeup() or such.
> 
> Sorry, I'm not following.  Can you elaborate a bit?

I mean copying ttwu, stripping out locking etc, and renaming the result
would look better.  No functional difference though, and binary growth
would be left in the wake of such a duplication.

	-Mike


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] sched: refactor try_to_wake_up()
  2009-12-02 10:10       ` Mike Galbraith
@ 2009-12-02 10:14         ` Tejun Heo
  2009-12-02 11:01           ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-02 10:14 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, avi, peterz, rusty, linux-kernel

On 12/02/2009 07:10 PM, Mike Galbraith wrote:
>>> I think it would _look_ better with the thing just
>>> duplicated/stripped down and called what it is,
>>> sched_notifier_wakeup() or such.
>>
>> Sorry, I'm not following.  Can you elaborate a bit?
> 
> I mean copying ttwu, stripping out locking etc, and renaming the result
> would look better.  No functional difference though, and binary growth
> would be left in the wake of such a duplication.

Ah... okay.  Well, that was the first approach (with some missing
parts).  Peter wasn't particularly happy with code duplication and
suggested refactoring ttwu and reuse common parts.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [tip:sched/core] sched: Revert 498657a478c60be092208422fefa9c7b248729c2
  2009-12-02  3:56 ` [PATCH 1/7] sched: revert 498657a478c60be092208422fefa9c7b248729c2 Tejun Heo
@ 2009-12-02 10:42   ` tip-bot for Tejun Heo
  0 siblings, 0 replies; 47+ messages in thread
From: tip-bot for Tejun Heo @ 2009-12-02 10:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, tj, tglx, mingo, avi

Commit-ID:  8592e6486a177a02f048567cb928bc3a1f9a86c3
Gitweb:     http://git.kernel.org/tip/8592e6486a177a02f048567cb928bc3a1f9a86c3
Author:     Tejun Heo <tj@kernel.org>
AuthorDate: Wed, 2 Dec 2009 12:56:46 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 2 Dec 2009 09:55:33 +0100

sched: Revert 498657a478c60be092208422fefa9c7b248729c2

498657a478c60be092208422fefa9c7b248729c2 incorrectly assumed
that preempt wasn't disabled around context_switch() and thus
was fixing imaginary problem.  It also broke KVM because it
depended on ->sched_in() to be called with irq enabled so that
it can do smp calls from there.

Revert the incorrect commit and add comment describing different
contexts under with the two callbacks are invoked.

Avi: spotted transposed in/out in the added comment.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: peterz@infradead.org
Cc: efault@gmx.de
Cc: rusty@rustcorp.com.au
LKML-Reference: <1259726212-30259-2-git-send-email-tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/preempt.h |    5 +++++
 kernel/sched.c          |    2 +-
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 72b1a10..2e681d9 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -105,6 +105,11 @@ struct preempt_notifier;
  * @sched_out: we've just been preempted
  *    notifier: struct preempt_notifier for the task being preempted
  *    next: the task that's kicking us out
+ *
+ * Please note that sched_in and out are called under different
+ * contexts.  sched_out is called with rq lock held and irq disabled
+ * while sched_in is called without rq lock and irq enabled.  This
+ * difference is intentional and depended upon by its users.
  */
 struct preempt_ops {
 	void (*sched_in)(struct preempt_notifier *notifier, int cpu);
diff --git a/kernel/sched.c b/kernel/sched.c
index b3d4e2b..1031cae 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2768,9 +2768,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	prev_state = prev->state;
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(current, cpu_of(rq));
-	fire_sched_in_preempt_notifiers(current);
 	finish_lock_switch(rq, prev);
 
+	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] sched: refactor try_to_wake_up()
  2009-12-02 10:14         ` Tejun Heo
@ 2009-12-02 11:01           ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-02 11:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Mike Galbraith, tglx, mingo, avi, rusty, linux-kernel

On Wed, 2009-12-02 at 19:14 +0900, Tejun Heo wrote:
> On 12/02/2009 07:10 PM, Mike Galbraith wrote:
> >>> I think it would _look_ better with the thing just
> >>> duplicated/stripped down and called what it is,
> >>> sched_notifier_wakeup() or such.
> >>
> >> Sorry, I'm not following.  Can you elaborate a bit?
> > 
> > I mean copying ttwu, stripping out locking etc, and renaming the result
> > would look better.  No functional difference though, and binary growth
> > would be left in the wake of such a duplication.
> 
> Ah... okay.  Well, that was the first approach (with some missing
> parts).  Peter wasn't particularly happy with code duplication and
> suggested refactoring ttwu and reuse common parts.

Right, esp since it missed some parts.

I haven't yet had time to look at this patch, but yeah, sharing code
seems like a good idea since its tricky and fragile code.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH UDPATED 3/7] sched: refactor try_to_wake_up()
  2009-12-02  3:56 ` [PATCH 3/7] sched: refactor try_to_wake_up() Tejun Heo
  2009-12-02  9:05   ` Mike Galbraith
@ 2009-12-03  6:11   ` Tejun Heo
  1 sibling, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-03  6:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, peterz, efault, rusty, linux-kernel

Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise.  Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.

This is to help future implementation of try_to_wake_up_local().

Mike Galbraith suggested rename to ttwu_post_activation() from
ttwu_woken_up() and comment update in try_to_wake_up().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
Updated to reflect Mike Galbraith's comment.  The only difference is
function rename and comment change.

Updated tree available on top of current sched/core available in the
following git trees.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-core-for-ingo
  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-wq-for-ingo

Thanks.

 kernel/sched.c |  114 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 64 insertions(+), 50 deletions(-)

Index: work3/kernel/sched.c
===================================================================
--- work3.orig/kernel/sched.c
+++ work3/kernel/sched.c
@@ -2361,11 +2361,67 @@ void task_oncpu_function_call(struct tas
 	preempt_enable();
 }
 
-/***
+static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
+				 bool is_sync, bool is_migrate, bool is_local)
+{
+	schedstat_inc(p, se.nr_wakeups);
+	if (is_sync)
+		schedstat_inc(p, se.nr_wakeups_sync);
+	if (is_migrate)
+		schedstat_inc(p, se.nr_wakeups_migrate);
+	if (is_local)
+		schedstat_inc(p, se.nr_wakeups_local);
+	else
+		schedstat_inc(p, se.nr_wakeups_remote);
+
+	activate_task(rq, p, 1);
+
+	/*
+	 * Only attribute actual wakeups done by this task.
+	 */
+	if (!in_interrupt()) {
+		struct sched_entity *se = &current->se;
+		u64 sample = se->sum_exec_runtime;
+
+		if (se->last_wakeup)
+			sample -= se->last_wakeup;
+		else
+			sample -= se->start_runtime;
+		update_avg(&se->avg_wakeup, sample);
+
+		se->last_wakeup = se->sum_exec_runtime;
+	}
+}
+
+static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
+					int wake_flags, bool success)
+{
+	trace_sched_wakeup(rq, p, success);
+	check_preempt_curr(rq, p, wake_flags);
+
+	p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_wake_up)
+		p->sched_class->task_wake_up(rq, p);
+
+	if (unlikely(rq->idle_stamp)) {
+		u64 delta = rq->clock - rq->idle_stamp;
+		u64 max = 2*sysctl_sched_migration_cost;
+
+		if (delta > max)
+			rq->avg_idle = max;
+		else
+			update_avg(&rq->avg_idle, delta);
+		rq->idle_stamp = 0;
+	}
+#endif
+}
+
+/**
  * try_to_wake_up - wake up a thread
- * @p: the to-be-woken-up thread
+ * @p: the thread to be awakened
  * @state: the mask of task states that can be woken
- * @sync: do a synchronous wakeup?
+ * @wake_flags: wake modifier flags (WF_*)
  *
  * Put it on the run-queue if it's not already there. The "current"
  * thread is always on the run-queue (except when the actual
@@ -2373,7 +2429,8 @@ void task_oncpu_function_call(struct tas
  * the simpler "current->state = TASK_RUNNING" to mark yourself
  * runnable without the overhead of this.
  *
- * returns failure only if the task is already active.
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state,
 			  int wake_flags)
@@ -2444,54 +2501,11 @@ static int try_to_wake_up(struct task_st
 
 out_activate:
 #endif /* CONFIG_SMP */
-	schedstat_inc(p, se.nr_wakeups);
-	if (wake_flags & WF_SYNC)
-		schedstat_inc(p, se.nr_wakeups_sync);
-	if (orig_cpu != cpu)
-		schedstat_inc(p, se.nr_wakeups_migrate);
-	if (cpu == this_cpu)
-		schedstat_inc(p, se.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.nr_wakeups_remote);
-	activate_task(rq, p, 1);
+	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
+		      cpu == this_cpu);
 	success = 1;
-
-	/*
-	 * Only attribute actual wakeups done by this task.
-	 */
-	if (!in_interrupt()) {
-		struct sched_entity *se = &current->se;
-		u64 sample = se->sum_exec_runtime;
-
-		if (se->last_wakeup)
-			sample -= se->last_wakeup;
-		else
-			sample -= se->start_runtime;
-		update_avg(&se->avg_wakeup, sample);
-
-		se->last_wakeup = se->sum_exec_runtime;
-	}
-
 out_running:
-	trace_sched_wakeup(rq, p, success);
-	check_preempt_curr(rq, p, wake_flags);
-
-	p->state = TASK_RUNNING;
-#ifdef CONFIG_SMP
-	if (p->sched_class->task_wake_up)
-		p->sched_class->task_wake_up(rq, p);
-
-	if (unlikely(rq->idle_stamp)) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
-#endif
+	ttwu_post_activation(p, rq, wake_flags, success);
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH UPDATED 7/7] sched: implement try_to_wake_up_local()
  2009-12-02  3:56 ` [PATCH 7/7] sched: implement try_to_wake_up_local() Tejun Heo
@ 2009-12-03  6:13   ` Tejun Heo
  2009-12-04 10:47     ` Peter Zijlstra
  2009-12-04 10:44   ` [PATCH " Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-03  6:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, peterz, efault, rusty, linux-kernel

Implement try_to_wake_up_local().  try_to_wake_up_local() is similar
to try_to_wake_up() but it assumes the caller has this_rq() locked and
the target task is bound to this_rq().  try_to_wake_up_local() can be
called from wakeup and sleep scheduler notifiers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
The only change is to accomodate function name change of earlier
patch.

Updated tree available on top of current sched/core available in the
following git trees.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-core-for-ingo
  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-wq-for-ingo

Thanks.

 include/linux/sched.h |    2 +
 kernel/sched.c        |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

Index: work3/include/linux/sched.h
===================================================================
--- work3.orig/include/linux/sched.h
+++ work3/include/linux/sched.h
@@ -2067,6 +2067,8 @@ extern void release_uids(struct user_nam
 
 extern void do_timer(unsigned long ticks);
 
+extern bool try_to_wake_up_local(struct task_struct *p, unsigned int state,
+				 int wake_flags);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
Index: work3/kernel/sched.c
===================================================================
--- work3.orig/kernel/sched.c
+++ work3/kernel/sched.c
@@ -2410,6 +2410,10 @@ static inline void ttwu_post_activation(
 		rq->idle_stamp = 0;
 	}
 #endif
+	/*
+	 * Wake up is complete, fire wake up notifier.  This allows
+	 * try_to_wake_up_local() to be called from wake up notifiers.
+	 */
 	if (success)
 		fire_sched_notifiers(p, wakeup);
 }
@@ -2511,6 +2515,53 @@ out:
 }
 
 /**
+ * try_to_wake_up_local - try to wake up a local task with rq lock held
+ * @p: the thread to be awakened
+ * @state: the mask of task states that can be woken
+ * @wake_flags: wake modifier flags (WF_*)
+ *
+ * Put @p on the run-queue if it's not alredy there.  The caller must
+ * ensure that this_rq() is locked, @p is bound to this_rq() and @p is
+ * not the current task.  this_rq() stays locked over invocation.
+ *
+ * This function can be called from wakeup and sleep scheduler
+ * notifiers.  Be careful not to create deep recursion by chaining
+ * wakeup notifiers.
+ *
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
+ */
+bool try_to_wake_up_local(struct task_struct *p, unsigned int state,
+			  int wake_flags)
+{
+	struct rq *rq = task_rq(p);
+	bool success = false;
+
+	BUG_ON(rq != this_rq());
+	BUG_ON(p == current);
+	lockdep_assert_held(&rq->lock);
+
+	if (!sched_feat(SYNC_WAKEUPS))
+		wake_flags &= ~WF_SYNC;
+
+	if (!(p->state & state))
+		return false;
+
+	if (!p->se.on_rq) {
+		if (likely(!task_running(rq, p))) {
+			schedstat_inc(rq, ttwu_count);
+			schedstat_inc(rq, ttwu_local);
+		}
+		ttwu_activate(p, rq, wake_flags & WF_SYNC, false, true);
+		success = true;
+	}
+
+	ttwu_post_activation(p, rq, wake_flags, success);
+
+	return success;
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
@@ -5437,6 +5488,11 @@ need_resched_nonpreemptible:
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+			/*
+			 * Fire sleep notifier before changing any scheduler
+			 * state.  This allows try_to_wake_up_local() to be
+			 * called from sleep notifiers.
+			 */
 			fire_sched_notifiers(prev, sleep);
 			deactivate_task(rq, prev, 1);
 		}

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-02  3:56 ` [PATCH 4/7] sched: implement force_cpus_allowed() Tejun Heo
@ 2009-12-04 10:40   ` Peter Zijlstra
  2009-12-04 10:43     ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-04 10:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

On Wed, 2009-12-02 at 12:56 +0900, Tejun Heo wrote:
> set_cpus_allowed_ptr() modifies the allowed cpu mask of a task.  The
> function performs the following checks before applying new mask.
> 
> * Check whether PF_THREAD_BOUND is set.  This is set for bound
>   kthreads so that they can't be moved around.
> 
> * Check whether the target cpu is still marked active - cpu_active().
>   Active state is cleared early while downing a cpu.
> 
> This patch adds force_cpus_allowed() which bypasses the above two
> checks.  The caller is responsible for guaranteeing that the
> destination cpu doesn't go down until force_cpus_allowed() finishes.
> 
> The first check is bypassed by factoring out actual migration part
> into __set_cpus_allowed() from set_cpus_allowed_ptr() and calling the
> inner function from force_cpus_allowed().
> 
> The second check is buried deep down in __migrate_task() which is
> executed by migration threads.  @force parameter is added to
> __migrate_task().  As the only way to pass parameters from
> __set_cpus_allowed() is through migration_req, migration_req->force is
> added and the @force parameter is passed down to __migrate_task().
> 
> Please note the naming discrepancy between set_cpus_allowed_ptr() and
> the new functions.  The _ptr suffix is from the days when cpumask api
> wasn't mature and future changes should drop it from
> set_cpus_allowed_ptr() too.
> 
> force_cpus_allowed() will be used for concurrency-managed workqueue.

Would still like to know why all this is needed.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-04 10:40   ` Peter Zijlstra
@ 2009-12-04 10:43     ` Peter Zijlstra
  2009-12-07  4:34       ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-04 10:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

On Fri, 2009-12-04 at 11:40 +0100, Peter Zijlstra wrote:
> On Wed, 2009-12-02 at 12:56 +0900, Tejun Heo wrote:
> > set_cpus_allowed_ptr() modifies the allowed cpu mask of a task.  The
> > function performs the following checks before applying new mask.
> > 
> > * Check whether PF_THREAD_BOUND is set.  This is set for bound
> >   kthreads so that they can't be moved around.
> > 
> > * Check whether the target cpu is still marked active - cpu_active().
> >   Active state is cleared early while downing a cpu.
> > 
> > This patch adds force_cpus_allowed() which bypasses the above two
> > checks.  The caller is responsible for guaranteeing that the
> > destination cpu doesn't go down until force_cpus_allowed() finishes.
> > 
> > The first check is bypassed by factoring out actual migration part
> > into __set_cpus_allowed() from set_cpus_allowed_ptr() and calling the
> > inner function from force_cpus_allowed().
> > 
> > The second check is buried deep down in __migrate_task() which is
> > executed by migration threads.  @force parameter is added to
> > __migrate_task().  As the only way to pass parameters from
> > __set_cpus_allowed() is through migration_req, migration_req->force is
> > added and the @force parameter is passed down to __migrate_task().
> > 
> > Please note the naming discrepancy between set_cpus_allowed_ptr() and
> > the new functions.  The _ptr suffix is from the days when cpumask api
> > wasn't mature and future changes should drop it from
> > set_cpus_allowed_ptr() too.
> > 
> > force_cpus_allowed() will be used for concurrency-managed workqueue.
> 
> Would still like to know why all this is needed.

That is, what problem do these new-fangled workqueues have and why is
this a good solution.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] sched: implement try_to_wake_up_local()
  2009-12-02  3:56 ` [PATCH 7/7] sched: implement try_to_wake_up_local() Tejun Heo
  2009-12-03  6:13   ` [PATCH UPDATED " Tejun Heo
@ 2009-12-04 10:44   ` Peter Zijlstra
  2009-12-07  3:26     ` Tejun Heo
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-04 10:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

On Wed, 2009-12-02 at 12:56 +0900, Tejun Heo wrote:

> +++ b/kernel/sched.c
> @@ -2410,6 +2410,10 @@ static inline void ttwu_woken_up(struct task_struct *p, struct rq *rq,
>  		rq->idle_stamp = 0;
>  	}
>  #endif
> +	/*
> +	 * Wake up is complete, fire wake up notifier.  This allows
> +	 * try_to_wake_up_local() to be called from wake up notifiers.
> +	 */
>  	if (success)
>  		fire_sched_notifiers(p, wakeup);
>  }

> @@ -5437,6 +5488,11 @@ need_resched_nonpreemptible:
>  		if (unlikely(signal_pending_state(prev->state, prev))) {
>  			prev->state = TASK_RUNNING;
>  		} else {
> +			/*
> +			 * Fire sleep notifier before changing any scheduler
> +			 * state.  This allows try_to_wake_up_local() to be
> +			 * called from sleep notifiers.
> +			 */
>  			fire_sched_notifiers(prev, sleep);
>  			deactivate_task(rq, prev, 1);
>  		}

These two hunks seem to belong to patch 6


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH UPDATED 7/7] sched: implement try_to_wake_up_local()
  2009-12-03  6:13   ` [PATCH UPDATED " Tejun Heo
@ 2009-12-04 10:47     ` Peter Zijlstra
  2009-12-07  3:31       ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-04 10:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

On Thu, 2009-12-03 at 15:13 +0900, Tejun Heo wrote:
> Updated tree available on top of current sched/core available in the
> following git trees.
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-core-for-ingo
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git sched-wq-for-ingo

Would you have that in quilt format? git sucks for reviewing.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] sched: implement try_to_wake_up_local()
  2009-12-04 10:44   ` [PATCH " Peter Zijlstra
@ 2009-12-07  3:26     ` Tejun Heo
  2009-12-07  8:50       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-07  3:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

Hello,

On 12/04/2009 07:44 PM, Peter Zijlstra wrote:
> On Wed, 2009-12-02 at 12:56 +0900, Tejun Heo wrote:
> 
>> +++ b/kernel/sched.c
>> @@ -2410,6 +2410,10 @@ static inline void ttwu_woken_up(struct task_struct *p, struct rq *rq,
>>  		rq->idle_stamp = 0;
>>  	}
>>  #endif
>> +	/*
>> +	 * Wake up is complete, fire wake up notifier.  This allows
>> +	 * try_to_wake_up_local() to be called from wake up notifiers.
>> +	 */
>>  	if (success)
>>  		fire_sched_notifiers(p, wakeup);
>>  }
> 
>> @@ -5437,6 +5488,11 @@ need_resched_nonpreemptible:
>>  		if (unlikely(signal_pending_state(prev->state, prev))) {
>>  			prev->state = TASK_RUNNING;
>>  		} else {
>> +			/*
>> +			 * Fire sleep notifier before changing any scheduler
>> +			 * state.  This allows try_to_wake_up_local() to be
>> +			 * called from sleep notifiers.
>> +			 */
>>  			fire_sched_notifiers(prev, sleep);
>>  			deactivate_task(rq, prev, 1);
>>  		}
> 
> These two hunks seem to belong to patch 6

Hmmm... it was intentional as, before this patch, there's no
try_to_wake_up_local() so it was strange to mention it in the comment.
I can move the comments but I don't think it's particularly better
that way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH UPDATED 7/7] sched: implement try_to_wake_up_local()
  2009-12-04 10:47     ` Peter Zijlstra
@ 2009-12-07  3:31       ` Tejun Heo
  0 siblings, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-07  3:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 216 bytes --]

Hello,

On 12/04/2009 07:47 PM, Peter Zijlstra wrote:
> Would you have that in quilt format? git sucks for reviewing.

Here it is.  I'll post full series again after this review pass is
complete.

Thanks.

-- 
tejun

[-- Attachment #2: sched-wq.tar.gz --]
[-- Type: application/x-gzip, Size: 10309 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-04 10:43     ` Peter Zijlstra
@ 2009-12-07  4:34       ` Tejun Heo
  2009-12-07  8:35         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-07  4:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

Hello, Peter.

On 12/04/2009 07:43 PM, Peter Zijlstra wrote:
>>> force_cpus_allowed() will be used for concurrency-managed workqueue.
>>
>> Would still like to know why all this is needed.
> 
> That is, what problem do these new-fangled workqueues have and why is
> this a good solution.

This is the original RFC posting of cmwq which includes the whole
thing.  I'm a few days away from posting a new version but the usage
of force_cpus_allowed() remains the same.

  http://thread.gmane.org/gmane.linux.kernel/896268/focus=896294

There are two tests which are bypassed by the force_ variant.

* PF_THREAD_BOUND.  This is used to mark tasks which are bound to a
  cpu using kthread_bind() to be bound permanently.  However, new
  trustee based workqueue hotplugging decouples per-cpu workqueue
  flushing with cpu hot plug/unplugging.  This is necessary because
  with cmwq, long running works can be served by regular workqueues,
  so delaying completion of hot plug/unplugging till certain works are
  flushed isn't feasible.  So, what becomes necessary is the ability
  to re-bind tasks which has PF_THREAD_BOUND set but unbound from its
  now offline cpu which is coming online again.

* cpu_active() test.  CPU activeness encloses cpu online status which
  is the actual on/offline state.  Workqueues need to keep running
  while a cpu is going down and with cmwq keeping workqueues running
  involves creation of new workers (it consists part of
  forward-progress guarantee and one of cpu down callbacks might end
  up waiting on completion of certain works).

  The problem with active state is that during cpu down, active going
  off doesn't mean all tasks have been migrated off the cpu, so
  without a migration interface which is synchronized with the actual
  offline migration, it is difficult to guarantee that all works are
  either running on the designated cpu if the cpu is online or all
  work on other cpus if the cpu is offline.

  Another related problem is that there's no way to monitor the cpu
  activeness change notifications.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-07  4:34       ` Tejun Heo
@ 2009-12-07  8:35         ` Peter Zijlstra
  2009-12-07 10:34           ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-07  8:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Mon, 2009-12-07 at 13:34 +0900, Tejun Heo wrote:
> Hello, Peter.
> 
> On 12/04/2009 07:43 PM, Peter Zijlstra wrote:
> >>> force_cpus_allowed() will be used for concurrency-managed workqueue.
> >>
> >> Would still like to know why all this is needed.
> > 
> > That is, what problem do these new-fangled workqueues have and why is
> > this a good solution.
> 
> This is the original RFC posting of cmwq which includes the whole
> thing.  I'm a few days away from posting a new version but the usage
> of force_cpus_allowed() remains the same.
> 
>   http://thread.gmane.org/gmane.linux.kernel/896268/focus=896294
> 
> There are two tests which are bypassed by the force_ variant.
> 
> * PF_THREAD_BOUND.  This is used to mark tasks which are bound to a
>   cpu using kthread_bind() to be bound permanently.  However, new
>   trustee based workqueue hotplugging decouples per-cpu workqueue
>   flushing with cpu hot plug/unplugging.  This is necessary because
>   with cmwq, long running works can be served by regular workqueues,
>   so delaying completion of hot plug/unplugging till certain works are
>   flushed isn't feasible.  So, what becomes necessary is the ability
>   to re-bind tasks which has PF_THREAD_BOUND set but unbound from its
>   now offline cpu which is coming online again.

I'm not at all sure I like that. I'd be perfectly happy with delaying
the hot-unplug.

The whole cpu hotplug mess is tricky enough as it is and I see no
compelling reason to further complicate it. If people are really going
to enqueue strict per-cpu worklets (queue_work_on()) that takes seconds
to complete, then they get to keep the results of that, which includes
slow hot unplug.

Having an off-line cpu still process code like it was online is asking
for trouble, don't go there.

> * cpu_active() test.  CPU activeness encloses cpu online status which
>   is the actual on/offline state.  Workqueues need to keep running
>   while a cpu is going down and with cmwq keeping workqueues running
>   involves creation of new workers (it consists part of
>   forward-progress guarantee and one of cpu down callbacks might end
>   up waiting on completion of certain works).
> 
>   The problem with active state is that during cpu down, active going
>   off doesn't mean all tasks have been migrated off the cpu, so
>   without a migration interface which is synchronized with the actual
>   offline migration, it is difficult to guarantee that all works are
>   either running on the designated cpu if the cpu is online or all
>   work on other cpus if the cpu is offline.
> 
>   Another related problem is that there's no way to monitor the cpu
>   activeness change notifications.

cpu_active() is basically meant for the scheduler to not stick new tasks
on a dying cpu.

So on hot-unplug you'd want to splice your worklets to another cpu,
except maybe those strictly enqueued to the dying cpu, and since there
was work on the dying cpu, you already had a task processing them, so
you don't need new tasks, right?





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] sched: implement try_to_wake_up_local()
  2009-12-07  3:26     ` Tejun Heo
@ 2009-12-07  8:50       ` Peter Zijlstra
  2009-12-07  8:56         ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-07  8:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

On Mon, 2009-12-07 at 12:26 +0900, Tejun Heo wrote:
> Hello,
> 
> On 12/04/2009 07:44 PM, Peter Zijlstra wrote:
> > On Wed, 2009-12-02 at 12:56 +0900, Tejun Heo wrote:
> > 
> >> +++ b/kernel/sched.c
> >> @@ -2410,6 +2410,10 @@ static inline void ttwu_woken_up(struct task_struct *p, struct rq *rq,
> >>  		rq->idle_stamp = 0;
> >>  	}
> >>  #endif
> >> +	/*
> >> +	 * Wake up is complete, fire wake up notifier.  This allows
> >> +	 * try_to_wake_up_local() to be called from wake up notifiers.
> >> +	 */
> >>  	if (success)
> >>  		fire_sched_notifiers(p, wakeup);
> >>  }
> > 
> >> @@ -5437,6 +5488,11 @@ need_resched_nonpreemptible:
> >>  		if (unlikely(signal_pending_state(prev->state, prev))) {
> >>  			prev->state = TASK_RUNNING;
> >>  		} else {
> >> +			/*
> >> +			 * Fire sleep notifier before changing any scheduler
> >> +			 * state.  This allows try_to_wake_up_local() to be
> >> +			 * called from sleep notifiers.
> >> +			 */
> >>  			fire_sched_notifiers(prev, sleep);
> >>  			deactivate_task(rq, prev, 1);
> >>  		}
> > 
> > These two hunks seem to belong to patch 6
> 
> Hmmm... it was intentional as, before this patch, there's no
> try_to_wake_up_local() so it was strange to mention it in the comment.
> I can move the comments but I don't think it's particularly better
> that way.

/me reads the comments and goes ah!

OK, maybe you've got a point there ;-)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] sched: implement try_to_wake_up_local()
  2009-12-07  8:50       ` Peter Zijlstra
@ 2009-12-07  8:56         ` Peter Zijlstra
  2009-12-07 10:27           ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-07  8:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

On Mon, 2009-12-07 at 09:50 +0100, Peter Zijlstra wrote:
> On Mon, 2009-12-07 at 12:26 +0900, Tejun Heo wrote:

> > Hmmm... it was intentional as, before this patch, there's no
> > try_to_wake_up_local() so it was strange to mention it in the comment.
> > I can move the comments but I don't think it's particularly better
> > that way.
> 
> /me reads the comments and goes ah!
> 
> OK, maybe you've got a point there ;-)


OK, so you fork and wakeup a new thread when an existing one goes to
sleep, but do you also limit the concurrency on wakeup?

Otherwise we can end up with say 100 workqueue tasks running, simply
because they all ran into a contended lock and then woke up again.

Where does that fork happen? Having to do memory allocations and all
that while holding the rq->lock doesn't seem like a very good idea.

What happens when you run out of memory and the workqueue progress is
needed to get free memory?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] sched: implement try_to_wake_up_local()
  2009-12-07  8:56         ` Peter Zijlstra
@ 2009-12-07 10:27           ` Tejun Heo
  2009-12-08  8:53             ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-07 10:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

Hello,

On 12/07/2009 05:56 PM, Peter Zijlstra wrote:
> OK, so you fork and wakeup a new thread when an existing one goes to
> sleep, but do you also limit the concurrency on wakeup?
>
> Otherwise we can end up with say 100 workqueue tasks running, simply
> because they all ran into a contended lock and then woke up again.

Yes, the total number of workers is limited.

> Where does that fork happen? Having to do memory allocations and all
> that while holding the rq->lock doesn't seem like a very good idea.

There always is guaranteed to be at least one idle worker and when the
last running worker goes to sleep, one idle worker is woken up which
assumes the role of manager which if necessary forks another worker,
so all that happens under rq->lock is waking up another worker.

> What happens when you run out of memory and the workqueue progress is
> needed to get free memory?

That's when the "rescuer" workers kick in.  All workqueues which might
be used during allocation paths are required to have one.  When
manager fails to fork a new worker, it summons the rescuers.  They
migrate to the distressed cpu and processes only the works from their
specific workqueues thus guaranteeing forward progress.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-07  8:35         ` Peter Zijlstra
@ 2009-12-07 10:34           ` Tejun Heo
  2009-12-07 10:54             ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-07 10:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/07/2009 05:35 PM, Peter Zijlstra wrote:
>> * PF_THREAD_BOUND.  This is used to mark tasks which are bound to a
>>   cpu using kthread_bind() to be bound permanently.  However, new
>>   trustee based workqueue hotplugging decouples per-cpu workqueue
>>   flushing with cpu hot plug/unplugging.  This is necessary because
>>   with cmwq, long running works can be served by regular workqueues,
>>   so delaying completion of hot plug/unplugging till certain works are
>>   flushed isn't feasible.  So, what becomes necessary is the ability
>>   to re-bind tasks which has PF_THREAD_BOUND set but unbound from its
>>   now offline cpu which is coming online again.
> 
> I'm not at all sure I like that. I'd be perfectly happy with delaying
> the hot-unplug.
> 
> The whole cpu hotplug mess is tricky enough as it is and I see no
> compelling reason to further complicate it. If people are really going
> to enqueue strict per-cpu worklets (queue_work_on()) that takes seconds
> to complete, then they get to keep the results of that, which includes
> slow hot unplug.
> 
> Having an off-line cpu still process code like it was online is asking
> for trouble, don't go there.

We're already there.  Users of workqueue which require strict CPU
affinity are required to flush respective works from CPU down
notifiers and fire them as necessary on up notifiers; otherwise, works
will continue to run until they're done after the cpu went down
regardless of whether explicit queue_work_on() was used or not.

The only difference is that now the ability to create new workers is
necessary to guarantee forward progress while cpu is going down, so
it's not that different from what we're doing now.

>>   Another related problem is that there's no way to monitor the cpu
>>   activeness change notifications.
> 
> cpu_active() is basically meant for the scheduler to not stick new tasks
> on a dying cpu.
> 
> So on hot-unplug you'd want to splice your worklets to another cpu,
> except maybe those strictly enqueued to the dying cpu, and since there
> was work on the dying cpu, you already had a task processing them, so
> you don't need new tasks, right?

We need to be able to fork to guarantee forward progress at any point
which means that to guarantee flush_work() or flush_workqueue() from
cpu down notification doesn't block indefinitely, we need to be able
to create new workers after cpu_active() is clear.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-07 10:34           ` Tejun Heo
@ 2009-12-07 10:54             ` Peter Zijlstra
  2009-12-07 11:07               ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-07 10:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Mon, 2009-12-07 at 19:34 +0900, Tejun Heo wrote:
> Hello,
> 
> On 12/07/2009 05:35 PM, Peter Zijlstra wrote:
> >> * PF_THREAD_BOUND.  This is used to mark tasks which are bound to a
> >>   cpu using kthread_bind() to be bound permanently.  However, new
> >>   trustee based workqueue hotplugging decouples per-cpu workqueue
> >>   flushing with cpu hot plug/unplugging.  This is necessary because
> >>   with cmwq, long running works can be served by regular workqueues,
> >>   so delaying completion of hot plug/unplugging till certain works are
> >>   flushed isn't feasible.  So, what becomes necessary is the ability
> >>   to re-bind tasks which has PF_THREAD_BOUND set but unbound from its
> >>   now offline cpu which is coming online again.
> > 
> > I'm not at all sure I like that. I'd be perfectly happy with delaying
> > the hot-unplug.
> > 
> > The whole cpu hotplug mess is tricky enough as it is and I see no
> > compelling reason to further complicate it. If people are really going
> > to enqueue strict per-cpu worklets (queue_work_on()) that takes seconds
> > to complete, then they get to keep the results of that, which includes
> > slow hot unplug.
> > 
> > Having an off-line cpu still process code like it was online is asking
> > for trouble, don't go there.
> 
> We're already there.  Users of workqueue which require strict CPU
> affinity are required to flush respective works from CPU down
> notifiers and fire them as necessary on up notifiers; otherwise, works
> will continue to run until they're done after the cpu went down
> regardless of whether explicit queue_work_on() was used or not.

So we seem to do cleanup_workqueue_thread() from CPU_POST_DEAD, but at
that time any thread that might still be around will most certainly not
be running on the offlined cpu anymore.

If you really want to ensure you remain on the cpu, you have to complete
from CPU_DOWN_PREPARE.

We're not running things from offline CPUs.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-07 10:54             ` Peter Zijlstra
@ 2009-12-07 11:07               ` Tejun Heo
  2009-12-08  8:41                 ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-07 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/07/2009 07:54 PM, Peter Zijlstra wrote:
> So we seem to do cleanup_workqueue_thread() from CPU_POST_DEAD, but at
> that time any thread that might still be around will most certainly not
> be running on the offlined cpu anymore.
> 
> If you really want to ensure you remain on the cpu, you have to complete
> from CPU_DOWN_PREPARE.
> 
> We're not running things from offline CPUs.

Oh, no, we're not doing that.  We can't do that.  What we're doing is
to continue to process works which were queued on the now offline cpu
unless it has been flushed/cancled from one of the cpu down
notifications and the reason why we need to be able to fork after
active is clear is to guarantee those flush/cancels don't deadlock.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-07 11:07               ` Tejun Heo
@ 2009-12-08  8:41                 ` Tejun Heo
  2009-12-08  9:02                   ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-08  8:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/07/2009 08:07 PM, Tejun Heo wrote:
> On 12/07/2009 07:54 PM, Peter Zijlstra wrote:
>> So we seem to do cleanup_workqueue_thread() from CPU_POST_DEAD, but at
>> that time any thread that might still be around will most certainly not
>> be running on the offlined cpu anymore.
>>
>> If you really want to ensure you remain on the cpu, you have to complete
>> from CPU_DOWN_PREPARE.
>>
>> We're not running things from offline CPUs.
> 
> Oh, no, we're not doing that.  We can't do that.  What we're doing is
> to continue to process works which were queued on the now offline cpu
> unless it has been flushed/cancled from one of the cpu down
> notifications and the reason why we need to be able to fork after
> active is clear is to guarantee those flush/cancels don't deadlock.

Does my explanation justify the patch?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] sched: implement try_to_wake_up_local()
  2009-12-07 10:27           ` Tejun Heo
@ 2009-12-08  8:53             ` Peter Zijlstra
  2009-12-08  9:16               ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-08  8:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

On Mon, 2009-12-07 at 19:27 +0900, Tejun Heo wrote:
> 
> > What happens when you run out of memory and the workqueue progress is
> > needed to get free memory?
> 
> That's when the "rescuer" workers kick in.  All workqueues which might
> be used during allocation paths are required to have one.  When
> manager fails to fork a new worker, it summons the rescuers.  They
> migrate to the distressed cpu and processes only the works from their
> specific workqueues thus guaranteeing forward progress.

What happens if the rescue worker is tied up on another cpu in a
deadlock avoidance game, which requires progress on this cpu in order to
resolve it?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08  8:41                 ` Tejun Heo
@ 2009-12-08  9:02                   ` Peter Zijlstra
  2009-12-08  9:12                     ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-08  9:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Tue, 2009-12-08 at 17:41 +0900, Tejun Heo wrote:
> Hello,
> 
> On 12/07/2009 08:07 PM, Tejun Heo wrote:
> > On 12/07/2009 07:54 PM, Peter Zijlstra wrote:
> >> So we seem to do cleanup_workqueue_thread() from CPU_POST_DEAD, but at
> >> that time any thread that might still be around will most certainly not
> >> be running on the offlined cpu anymore.
> >>
> >> If you really want to ensure you remain on the cpu, you have to complete
> >> from CPU_DOWN_PREPARE.
> >>
> >> We're not running things from offline CPUs.
> > 
> > Oh, no, we're not doing that.  We can't do that.  What we're doing is
> > to continue to process works which were queued on the now offline cpu
> > unless it has been flushed/cancled from one of the cpu down
> > notifications and the reason why we need to be able to fork after
> > active is clear is to guarantee those flush/cancels don't deadlock.
> 
> Does my explanation justify the patch?

So its only needed in order to flush a workqueue from CPU_DOWN_PREPARE?
And all you need it to place a new kthread on a !active cpu?

Or is this in order to allow migrate_live_tasks() to move the worker
threads away from the dead cpu?

I'm really not thrilled by the whole fork-fest workqueue design.




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08  9:02                   ` Peter Zijlstra
@ 2009-12-08  9:12                     ` Tejun Heo
  2009-12-08 10:34                       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-08  9:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/08/2009 06:02 PM, Peter Zijlstra wrote:
> So its only needed in order to flush a workqueue from CPU_DOWN_PREPARE?
> And all you need it to place a new kthread on a !active cpu?

Yes, that's all I need.

> Or is this in order to allow migrate_live_tasks() to move the worker
> threads away from the dead cpu?

Nope, that's left to sched cpu callbacks.

> I'm really not thrilled by the whole fork-fest workqueue design.

Shared work pool kind of needs forks unless it's gonna create all the
workers upfront.  Most issues arise from corner cases - deadlocks, cpu
hotplugging, freeze/thaw and so on.  During usual operation, the
thread pool would remain largely stable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] sched: implement try_to_wake_up_local()
  2009-12-08  8:53             ` Peter Zijlstra
@ 2009-12-08  9:16               ` Tejun Heo
  0 siblings, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-08  9:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, mingo, avi, efault, rusty, linux-kernel

Hello,

On 12/08/2009 05:53 PM, Peter Zijlstra wrote:
> On Mon, 2009-12-07 at 19:27 +0900, Tejun Heo wrote:
>>
>>> What happens when you run out of memory and the workqueue progress is
>>> needed to get free memory?
>>
>> That's when the "rescuer" workers kick in.  All workqueues which might
>> be used during allocation paths are required to have one.  When
>> manager fails to fork a new worker, it summons the rescuers.  They
>> migrate to the distressed cpu and processes only the works from their
>> specific workqueues thus guaranteeing forward progress.
> 
> What happens if the rescue worker is tied up on another cpu in a
> deadlock avoidance game, which requires progress on this cpu in order to
> resolve it?

Hmmm.... I think one rescuer thread per a workqueue is enough to
guarantee forward-progress.  If there are workqueues which have
inter-cpu dependency, we'll probably need per-cpu rescuers but is
there any?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08  9:12                     ` Tejun Heo
@ 2009-12-08 10:34                       ` Peter Zijlstra
  2009-12-08 10:38                         ` Peter Zijlstra
  2009-12-08 11:24                         ` Tejun Heo
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-08 10:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Tue, 2009-12-08 at 18:12 +0900, Tejun Heo wrote:
> > So its only needed in order to flush a workqueue from CPU_DOWN_PREPARE?
> > And all you need it to place a new kthread on a !active cpu?
> 
> Yes, that's all I need. 

Then you don't need most of that patch, you don't need to touch the
migration stuff since a new kthread isn't running.

All you need to do is make a kthread_bind/set_cpus_allowed variant that
checks against cpu_online_mask instead of cpu_active_mask.

It might even make sense to have kthread_bind() always check with
cpu_online_mask as the kernel really ought to know what its doing
anyway.

You also don't need to play trickery with PF_THREAD_BOUND, since a new
kthread will not have that set.

In fact, your patch is against a tree that still has cpu_online_mask in
all those places, so you wouldn't have needed any of that, confused.. ?!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 10:34                       ` Peter Zijlstra
@ 2009-12-08 10:38                         ` Peter Zijlstra
  2009-12-08 11:26                           ` Tejun Heo
  2009-12-08 11:24                         ` Tejun Heo
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-08 10:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Tue, 2009-12-08 at 11:34 +0100, Peter Zijlstra wrote:
> On Tue, 2009-12-08 at 18:12 +0900, Tejun Heo wrote:
> > > So its only needed in order to flush a workqueue from CPU_DOWN_PREPARE?
> > > And all you need it to place a new kthread on a !active cpu?
> > 
> > Yes, that's all I need. 
> 
> Then you don't need most of that patch, you don't need to touch the
> migration stuff since a new kthread isn't running.
> 
> All you need to do is make a kthread_bind/set_cpus_allowed variant that
> checks against cpu_online_mask instead of cpu_active_mask.
> 
> It might even make sense to have kthread_bind() always check with
> cpu_online_mask as the kernel really ought to know what its doing
> anyway.
> 
> You also don't need to play trickery with PF_THREAD_BOUND, since a new
> kthread will not have that set.
> 
> In fact, your patch is against a tree that still has cpu_online_mask in
> all those places, so you wouldn't have needed any of that, confused.. ?!

More confusion, kthread_bind() doesn't even use set_cpus_allowed() so it
wouldn't have hit any of that crap.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 10:34                       ` Peter Zijlstra
  2009-12-08 10:38                         ` Peter Zijlstra
@ 2009-12-08 11:24                         ` Tejun Heo
  2009-12-08 11:48                           ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-08 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/08/2009 07:34 PM, Peter Zijlstra wrote:
> On Tue, 2009-12-08 at 18:12 +0900, Tejun Heo wrote:
>>> So its only needed in order to flush a workqueue from CPU_DOWN_PREPARE?
>>> And all you need it to place a new kthread on a !active cpu?
>>
>> Yes, that's all I need. 

Let me augment the above sentence.  Yes, that's all I need *during
CPU_DOWN*.  During CPU_UP, I need to migrate back left running workers
which survived from the last CPU_DOWN.  In the original patch, the
down path is worker_maybe_bind_and_lock() and the latter path is
trustee_unset_rogue().

> Then you don't need most of that patch, you don't need to touch the
> migration stuff since a new kthread isn't running.
> 
> All you need to do is make a kthread_bind/set_cpus_allowed variant that
> checks against cpu_online_mask instead of cpu_active_mask.

kthread_bind() doesn't check against cpu_online_mask.  Isn't
set_cpus_allowed() variant which checks against cpu_online_mask is
what's implemented by the patch (+ PF_THREAD_BOUND bypass)?

> It might even make sense to have kthread_bind() always check with
> cpu_online_mask as the kernel really ought to know what its doing
> anyway.

Oh... yeah, it was a bit strange that the function doesn't check
against cpu onliness but if that is removed what would be the point of
kthread_bind() when set_cpus_allowed() provides pretty much the same
capability?

> You also don't need to play trickery with PF_THREAD_BOUND, since a new
> kthread will not have that set.

Yeap, but when the cpu comes back online, those kthreads need to be
rebound to the cpu.

> In fact, your patch is against a tree that still has cpu_online_mask in
> all those places, so you wouldn't have needed any of that, confused.. ?!

To migrate back the workers from CPU_ONLINE callback.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 10:38                         ` Peter Zijlstra
@ 2009-12-08 11:26                           ` Tejun Heo
  0 siblings, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-08 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/08/2009 07:38 PM, Peter Zijlstra wrote:
> More confusion, kthread_bind() doesn't even use set_cpus_allowed() so it
> wouldn't have hit any of that crap.

kthread_bind() can't be used safely unless cpu up/down is synchronized
out some way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 11:24                         ` Tejun Heo
@ 2009-12-08 11:48                           ` Peter Zijlstra
  2009-12-08 11:56                             ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-08 11:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Tue, 2009-12-08 at 20:24 +0900, Tejun Heo wrote:
> 
> Let me augment the above sentence.  Yes, that's all I need *during
> CPU_DOWN*.  During CPU_UP, I need to migrate back left running workers
> which survived from the last CPU_DOWN.  In the original patch, the
> down path is worker_maybe_bind_and_lock() and the latter path is
> trustee_unset_rogue(). 

Why bother with that?

workqueue's CPU_POST_DEAD will flush the workqueue and destroy all
threads under cpu_add_remove_lock, which excludes the cpu from coming
back up before its fully destroyed.

So there's no remaining tasks to be migrated back.

Changing that semantics is not worthwhile.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 11:48                           ` Peter Zijlstra
@ 2009-12-08 11:56                             ` Tejun Heo
  2009-12-08 12:10                               ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-08 11:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/08/2009 08:48 PM, Peter Zijlstra wrote:
> Why bother with that?
> 
> workqueue's CPU_POST_DEAD will flush the workqueue and destroy all
> threads under cpu_add_remove_lock, which excludes the cpu from coming
> back up before its fully destroyed.
> 
> So there's no remaining tasks to be migrated back.
> 
> Changing that semantics is not worthwhile.

It is worthwhile because the goal is to unify all worker pool
mechanisms.  So, slow_work or whatnot (scsi EHs, FS specific worker
pools, async workers used for parallel probing) will all be converted
to use workqueue instead and with that we can't afford to wait for all
works to flush to down a cpu.  All that's necessary to implement that
is migrating back the unbound workers which can be implemented as a
separate piece of code apart from regular operation.  It would even
benefit the current implementation as it makes cpu up/down operations
much more deterministic.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 11:56                             ` Tejun Heo
@ 2009-12-08 12:10                               ` Peter Zijlstra
  2009-12-08 12:23                                 ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-08 12:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Tue, 2009-12-08 at 20:56 +0900, Tejun Heo wrote:
> Hello,
> 
> On 12/08/2009 08:48 PM, Peter Zijlstra wrote:
> > Why bother with that?
> > 
> > workqueue's CPU_POST_DEAD will flush the workqueue and destroy all
> > threads under cpu_add_remove_lock, which excludes the cpu from coming
> > back up before its fully destroyed.
> > 
> > So there's no remaining tasks to be migrated back.
> > 
> > Changing that semantics is not worthwhile.
> 
> It is worthwhile because the goal is to unify all worker pool
> mechanisms.  So, slow_work or whatnot (scsi EHs, FS specific worker
> pools, async workers used for parallel probing) will all be converted
> to use workqueue instead and with that we can't afford to wait for all
> works to flush to down a cpu.  All that's necessary to implement that
> is migrating back the unbound workers which can be implemented as a
> separate piece of code apart from regular operation.  It would even
> benefit the current implementation as it makes cpu up/down operations
> much more deterministic.

Hotplug and deterministic are not to be used in the same sentence, its
an utter slow path and I'd much rather have simple code than clever code
there -- there's been way too many 'interesting' hotplug problems.

If there is work being enqueued that takes more than a few seconds to
complete then I'm thinking there's something seriously wrong and up to
that point its perfectly fine to simply wait for it.

Furthermore if it's objective is to cater to generic thread pools then I
think its an utter fail simply because it mandates strict cpu affinity,
that basically requires you to write a work scheduler to balance work
load etc.. Much easier is a simple unbounded thread pool that gets
balanced by the regular scheduler.

/me liking this stuff less and less :/


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 12:10                               ` Peter Zijlstra
@ 2009-12-08 12:23                                 ` Tejun Heo
  2009-12-08 13:35                                   ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-08 12:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/08/2009 09:10 PM, Peter Zijlstra wrote:
> Hotplug and deterministic are not to be used in the same sentence, its
> an utter slow path and I'd much rather have simple code than clever code
> there -- there's been way too many 'interesting' hotplug problems.

Slow and indeterminism comes in different magnitudes.

> If there is work being enqueued that takes more than a few seconds to
> complete then I'm thinking there's something seriously wrong and up to
> that point its perfectly fine to simply wait for it.
> 
> Furthermore if it's objective is to cater to generic thread pools then I
> think its an utter fail simply because it mandates strict cpu affinity,
> that basically requires you to write a work scheduler to balance work
> load etc.. Much easier is a simple unbounded thread pool that gets
> balanced by the regular scheduler.

The observation was that for most long running async jobs, most time
is spent sleeping instead of burning cpu cycles and long running ones
are relatively few compared to short ones so the strict affinity would
be more helpful.  That is the basis of whole design and why it has
scheduler callbacks to regulate concurrency instead of creating a
bunch of active workers and letting the scheduler take care of it.
Works wouldn't be competing for cpu cycles.

In short, the target workload is the current short works + long
running mostly sleeping async works, which cover most of worker pools
we have in kernel.  I thought about adding an unbound pool of workers
for cpu intensive works for completeness but I really couldn't find
much use for that.  If enough number of users would need something
like that, we can add an anonymous pool but for now I really don't see
the need to worry about that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 12:23                                 ` Tejun Heo
@ 2009-12-08 13:35                                   ` Peter Zijlstra
  2009-12-09  5:25                                     ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-08 13:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Tue, 2009-12-08 at 21:23 +0900, Tejun Heo wrote:
> On 12/08/2009 09:10 PM, Peter Zijlstra wrote:
> > Hotplug and deterministic are not to be used in the same sentence, its
> > an utter slow path and I'd much rather have simple code than clever code
> > there -- there's been way too many 'interesting' hotplug problems.
> 
> Slow and indeterminism comes in different magnitudes.

Determinism does _not_ come in magnitudes, its a very binary property,
either it is or it is not.

As to the order of slowness for unplug, that is about maximal, its _the_
slowest path in the whole kernel.

> > Furthermore if it's objective is to cater to generic thread pools then I
> > think its an utter fail simply because it mandates strict cpu affinity,
> > that basically requires you to write a work scheduler to balance work
> > load etc.. Much easier is a simple unbounded thread pool that gets
> > balanced by the regular scheduler.
> 
> The observation was that for most long running async jobs, most time
> is spent sleeping instead of burning cpu cycles and long running ones
> are relatively few compared to short ones so the strict affinity would
> be more helpful.  That is the basis of whole design and why it has
> scheduler callbacks to regulate concurrency instead of creating a
> bunch of active workers and letting the scheduler take care of it.
> Works wouldn't be competing for cpu cycles.
> 
> In short, the target workload is the current short works + long
> running mostly sleeping async works, which cover most of worker pools
> we have in kernel.

Ok, maybe, but that is not what I would call a generic thread pool.

So the reason we have tons of idle workqueues around are purely because
of deadlock scenarios? Or is there other crap about?

So why not start simple and only have one thread per cpu (lets call it
events/#) and run all works there. Then when you enqueue a work and
events/# is already busy with a work from anther wq, hand the work to a
global event thread which will spawn a special single shot kthread for
it, with a second exception for those reclaim wq's, for which you'll
have this rescue thread which you'll bind to the right cpu for that
work.

That should get rid of all these gazillion threads we have, preserve the
queue property and not be as invasive as your current thing.

If they're really as idle as reported you'll never need the fork-fest
you currently propose, simply because there's not enough work.

So basically, have events/# service the first non-empty cwq, when
there's more non empty cwqs spawn them single shot threads, or use a
rescue thread.

>   I thought about adding an unbound pool of workers
> for cpu intensive works for completeness but I really couldn't find
> much use for that.  If enough number of users would need something
> like that, we can add an anonymous pool but for now I really don't see
> the need to worry about that.

And I though I'd heard multiple parties express interesting in exactly
that, btrfs, bdi and pohmelfs come to mind, also crypto looks like one
that could actually do some work.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-08 13:35                                   ` Peter Zijlstra
@ 2009-12-09  5:25                                     ` Tejun Heo
  2009-12-09  7:41                                       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2009-12-09  5:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/08/2009 10:35 PM, Peter Zijlstra wrote:
>> Slow and indeterminism comes in different magnitudes.
> 
> Determinism does _not_ come in magnitudes, its a very binary property,
> either it is or it is not.

It's semantics, right?  The whole process is made closer to being
deterministic by making each component deterministic, so in that way
"more deterministic" is a meaningful expression or it also can
describe how deterministic it feels to human perception.

> As to the order of slowness for unplug, that is about maximal, its _the_
> slowest path in the whole kernel.

Long running works may run for minutes.  They're slow to wait for as
perceived by human beings, so we're talking about pretty different
scales.

> Ok, maybe, but that is not what I would call a generic thread pool.

Sure, it's not completely generic, not yet anyway.  The main focus is
to solve concurrency issues with the current workqueue implementation.
With concurrency issues solved, accomodating long running works
becomes quite easy - all we need to do is to migrate back unbound
workers during cup up - and as we have considerable number of users
which require such usage, it's reasonable to implement it.

> So the reason we have tons of idle workqueues around are purely because
> of deadlock scenarios? Or is there other crap about?

No, that's part of concurrency issues.  Works don't devour and compete
for CPU cycles but they overlap each other frequently.  Works are
often used to provide task context so that the users can use sleeping
synchronization constructs or wait for events.

> So why not start simple and only have one thread per cpu (lets call it
> events/#) and run all works there. Then when you enqueue a work and
> events/# is already busy with a work from anther wq, hand the work to a
> global event thread which will spawn a special single shot kthread for
> it, with a second exception for those reclaim wq's, for which you'll
> have this rescue thread which you'll bind to the right cpu for that
> work.
>
> That should get rid of all these gazillion threads we have, preserve the
> queue property and not be as invasive as your current thing.

That doesn't solve concurrency problem at all and we'll be ending up
bouncing around a LOT of works.

> If they're really as idle as reported you'll never need the fork-fest
> you currently propose, simply because there's not enough work.
>
> So basically, have events/# service the first non-empty cwq, when
> there's more non empty cwqs spawn them single shot threads, or use a
> rescue thread.

Idle doesn't mean they don't overlap.  They aren't cpu cycle hungry
but are context hungry.

It just isn't possible to implement shared worker pool which can scale
according to necessary level of concurrency without doing some sort of
dynamic worker management which necessarily involves forking new ones
when in need and killing some of them off when there are more than
enough.

As for the force_cpus_allowed() bit, I think it's a rather natural
interface to have and maybe we can replace kthread_bind() with it or
make kthread_bind() in terms of it.  It's the basic migration function
which adheres to the cpu hot plug/unplug synchronization rules.

>>   I thought about adding an unbound pool of workers
>> for cpu intensive works for completeness but I really couldn't find
>> much use for that.  If enough number of users would need something
>> like that, we can add an anonymous pool but for now I really don't see
>> the need to worry about that.
> 
> And I though I'd heard multiple parties express interesting in exactly
> that, btrfs, bdi and pohmelfs come to mind, also crypto looks like one
> that could actually do some work.

Yeah, anything cryptography related or crunches large chunk of data
would be a good candidate but still compared to the works and kthreads
we have just to have context and be able to wait for things, they're
minority.  Plus, it's something we can continue to work on if there
are enough reasons to do so.  If accomodating long running works there
makes more sense, we'll do that but for most of them whether bound to
certain cpu or not just isn't intersting at all and all we need to
serve them is the ability to migrate back the threads during CPU UP.
It's pretty isolated path.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-09  5:25                                     ` Tejun Heo
@ 2009-12-09  7:41                                       ` Peter Zijlstra
  2009-12-09  8:03                                         ` Tejun Heo
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2009-12-09  7:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

On Wed, 2009-12-09 at 14:25 +0900, Tejun Heo wrote:
> As for the force_cpus_allowed() bit, I think it's a rather natural
> interface to have and maybe we can replace kthread_bind() with it or
> make kthread_bind() in terms of it.  It's the basic migration function
> which adheres to the cpu hot plug/unplug synchronization rules.

I quite disagree, its quite unnatural to be adding threads to a cpu that
is about to go down or about to come up for that matter, and we most
certainly don't want to add to the hotplug rules, there are quite enough
of them already.

You could always do it from CPU_ONLINE, since they don't care about
cpu-affinity anyway (the thing was down for crying out loud), it really
doesn't matter when they're moved back if at all.

I still think its utter insanity to even consider moving them back, or
for that matter to have worklets that take minutes to complete, that's
just daft.

I think I'm going to NAK all this, it looks quite ill conceived.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] sched: implement force_cpus_allowed()
  2009-12-09  7:41                                       ` Peter Zijlstra
@ 2009-12-09  8:03                                         ` Tejun Heo
  0 siblings, 0 replies; 47+ messages in thread
From: Tejun Heo @ 2009-12-09  8:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, avi, efault, rusty, linux-kernel, Gautham R Shenoy,
	Linus Torvalds

Hello,

On 12/09/2009 04:41 PM, Peter Zijlstra wrote:
> On Wed, 2009-12-09 at 14:25 +0900, Tejun Heo wrote:
>> As for the force_cpus_allowed() bit, I think it's a rather natural
>> interface to have and maybe we can replace kthread_bind() with it or
>> make kthread_bind() in terms of it.  It's the basic migration function
>> which adheres to the cpu hot plug/unplug synchronization rules.
> 
> I quite disagree, its quite unnatural to be adding threads to a cpu that
> is about to go down or about to come up for that matter, and we most
> certainly don't want to add to the hotplug rules, there are quite enough
> of them already.

I don't think it adds to hotplug rules in any way.  While cpu is going
down, there's this single atomic moment when all threads are taken off
the cpu and that's the natural synchronization point.  You can add
more restrictions on top of that like the cpu activeness test to avoid
certain desirable behavior but in the end the natural ultimate sync
point is the final migration off the cpu.  That said, let's agree to
disagree here.  We're talking about what's being 'natural' and it
depends largely on personal POVs, right?

> You could always do it from CPU_ONLINE, since they don't care about
> cpu-affinity anyway (the thing was down for crying out loud), it really
> doesn't matter when they're moved back if at all.
> 
> I still think its utter insanity to even consider moving them back, or
> for that matter to have worklets that take minutes to complete, that's
> just daft.
>
> I think I'm going to NAK all this, it looks quite ill conceived.

Alright, fair enough.  So, I get that you NACK the whole concurrency
managed workqueue thing.  The reason why this part was split out and
sent separately for scheduler tree was that because I thought there
was general consensus toward having concurrency managed workqueues and
its basic design.  It looks like we'll need to have another round
about that.

Here's what I'll do.  I'm about done with the second round, properly
split cmwq patches.  It's feature complete and should be able to
replace existing workqueue implementation without any problem (well,
theoretically).  I'll send them as a whole series with these scheduler
patches and cc all the interested people and explain Ingo's concern
about notification framework and your general objection to the whole
thing.  Let's continue there, okay?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2009-12-09  8:03 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-02  3:56 [PATCHSET tip/sched/core] sched: concurrency managed workqueue related sched patches Tejun Heo
2009-12-02  3:56 ` [PATCH 1/7] sched: revert 498657a478c60be092208422fefa9c7b248729c2 Tejun Heo
2009-12-02 10:42   ` [tip:sched/core] sched: Revert 498657a478c60be092208422fefa9c7b248729c2 tip-bot for Tejun Heo
2009-12-02  3:56 ` [PATCH 2/7] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
2009-12-02  3:56 ` [PATCH 3/7] sched: refactor try_to_wake_up() Tejun Heo
2009-12-02  9:05   ` Mike Galbraith
2009-12-02  9:51     ` Tejun Heo
2009-12-02 10:10       ` Mike Galbraith
2009-12-02 10:14         ` Tejun Heo
2009-12-02 11:01           ` Peter Zijlstra
2009-12-03  6:11   ` [PATCH UDPATED " Tejun Heo
2009-12-02  3:56 ` [PATCH 4/7] sched: implement force_cpus_allowed() Tejun Heo
2009-12-04 10:40   ` Peter Zijlstra
2009-12-04 10:43     ` Peter Zijlstra
2009-12-07  4:34       ` Tejun Heo
2009-12-07  8:35         ` Peter Zijlstra
2009-12-07 10:34           ` Tejun Heo
2009-12-07 10:54             ` Peter Zijlstra
2009-12-07 11:07               ` Tejun Heo
2009-12-08  8:41                 ` Tejun Heo
2009-12-08  9:02                   ` Peter Zijlstra
2009-12-08  9:12                     ` Tejun Heo
2009-12-08 10:34                       ` Peter Zijlstra
2009-12-08 10:38                         ` Peter Zijlstra
2009-12-08 11:26                           ` Tejun Heo
2009-12-08 11:24                         ` Tejun Heo
2009-12-08 11:48                           ` Peter Zijlstra
2009-12-08 11:56                             ` Tejun Heo
2009-12-08 12:10                               ` Peter Zijlstra
2009-12-08 12:23                                 ` Tejun Heo
2009-12-08 13:35                                   ` Peter Zijlstra
2009-12-09  5:25                                     ` Tejun Heo
2009-12-09  7:41                                       ` Peter Zijlstra
2009-12-09  8:03                                         ` Tejun Heo
2009-12-02  3:56 ` [PATCH 5/7] sched: make sched_notifiers unconditional Tejun Heo
2009-12-02  3:56 ` [PATCH 6/7] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
2009-12-02  3:56 ` [PATCH 7/7] sched: implement try_to_wake_up_local() Tejun Heo
2009-12-03  6:13   ` [PATCH UPDATED " Tejun Heo
2009-12-04 10:47     ` Peter Zijlstra
2009-12-07  3:31       ` Tejun Heo
2009-12-04 10:44   ` [PATCH " Peter Zijlstra
2009-12-07  3:26     ` Tejun Heo
2009-12-07  8:50       ` Peter Zijlstra
2009-12-07  8:56         ` Peter Zijlstra
2009-12-07 10:27           ` Tejun Heo
2009-12-08  8:53             ` Peter Zijlstra
2009-12-08  9:16               ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.