linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (no subject)
@ 2009-12-18 12:57 Tejun Heo
  2009-12-18 12:57 ` [PATCH 01/27] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
                   ` (29 more replies)
  0 siblings, 30 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi

Subject: [RFC PATCHSET] concurrency managed workqueue, take#2

Hello, all.

This is the second take of cmwq (concurrency managed workqueue).  It's
on top of linus#master 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f
(v2.6.33-rc1).  Git tree is available at

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-cmwq

Quilt series is available at

  http://master.kernel.org/~tj/patches/review-cmwq.tar.gz


ISSUES FROM THE FIRST RFC AND THEIR RESOLUTIONS
===============================================

The first RFC round[1] was in October.  Several issues were raised but
there was no objection against the basic design.  Issued raised there
and later are

A. Hackish scheduler notification implemented by overriding scheduler
   class needs to be made generic.

B. Scheduler local wake up function needs to be reimplemented and
   share code path with try_to_wake_up().

C. Dual-colored workqueue flushing scheme may become a scalability
   issue.

D. There are users which end up issuing too many concurrent works
   unless throttled somehow (xfs).

E. Single thread workqueue is broken.  Works queued to a single thread
   workqueue require strict ordering.

F. The patch to actually implement cmwq is too large and needs to be
   split.

A, B are scheduler related and will be discussed further later with
other unresolved issues.

C is solved by implementing multi colored flush.  It has two
properties which make it resistant to scalability issues.  First, 14
flushes can be in progress simultaneously.  Second, when all the
colors are used up, new flushers don't wait in line and get processed
one by one.  All the overflowed ones get assigned the same color and
processed in batch when a color frees up, so throughput will increase
along with congestion.

D is solved by introducing max_active per cpu_workqueue_struct.  If
the number of active (running or pending for execution) works goes
over the max, works are put on to delayed_works list; thus giving
workqueues the ability to throttle concurrency.  The original
freeze/thaw implementation is replaced with max_active based one
(max_active is temporarily quenched to zero while frozen), so the
increase in the overall complexity isn't too great.

E also is implemented using max_active.  SINGLE_THREAD flag is
replaced with SINGLE_CPU.  CWQs dynamically arbitrate which CWQ is
gonna serve SINGLE_CPU workqueue using atomic accesses to
wq->single_cpu so that only one CWQ is active at any given time.
Combined with max_active set to one, this results in the same queuing
and execution behavior as single thread workqueues without requiring
dedicated thread.

F is solved by introducing workers, gcwqs, trustee, shared worklist
and concurrency managed worker pool in separate steps.  Although
logics which are gradually added carry superfluous parts which will
only be fully useful after complete implementation, each step achieves
pretty good execution coverage of new logics and should be useful as
review and bisection step.


UN/HALF-RESOLVED ISSUES
=======================

A. After a couple of tries, scheduler notification is currently
   implemented as generalized version of preempt_notifiers which used
   to be used only by kvm.  Two more notifications - wakeup and sleep
   - were added.  Ingo was unsatisfied with the fact that there now
   are three different notification-like mechanisms living around the
   scheduler code and refused to accept the new notifiers unless all
   the scheduler notification mechanisms are unified.

   To prevent having cmwq patches floating too long without a stable
   branch to be tested in linux-next, it was agreed to do this in the
   following stages[2].

   1. Apply patches which don't change scheduler behavior but will
      reduce conflicts to sched tree.

   2. Create a new sched branch which will contain the new notifiers.
      This branch will be stable and will end up in linux-next but
      won't be pushed to Linus unless the notification mechanisms are
      unified.

   3. Base cmwq branch on top of the devel branch created in #2 and
      publish it to linux-next for testing.

   4. Unify scheduler notification mechanisms in the sched devel
      branch and when it's done push it and cmwq to Linus.

B. set_cpus_allowed_ptr() doesn't move threads bound with
   kthread_bind() or to CPUs which don't have active set.  Active
   state encloses online state and used by scheduler to prevent
   scheduling threads on a dying CPU unless strictly necessary.

   However, it's desirable to have PF_THREAD_BOUND for kworkers during
   usual operation and new and rescue workers need to be able to
   migrate to CPUs in CPU_DOWN_PREPARE state to guarantee forward
   progress to wq/work flushes from DOWN_PREPARE callbacks.  Also, if
   a CPU comes back online, left running workers need to be rebound to
   the CPU ignoring PF_THREAD_BOUND restriction.

   Using kthread_bind() isn't feasible because kthread_bind() isn't
   synchronized against cpu online state and is allowed to put a
   thread on a dead cpu.

   Originally, force_cpus_allowed() was added which bypasses
   PF_THREAD_BOUND and active check.  The current version adds
   __set_cpus_allowed() function which takes @force param to do about
   the same thing (new version properly checks online state so it will
   never put a task on a dead cpu).  This is still temporary.

   I think the cleanest solution here would be making sure that nobody
   depends on kthread_bind() being able to put a task on a dead cpu
   and then allowing kthread_bind() to bind a task to cpus which are
   online by calling __set_cpus_allowed().  So, the interface visible
   outside will be set_cpus_allowed_ptr() for regular cases and
   kthread_bind() for kthreads.  I'll be happy to pursue this path if
   it can be agreed on.

C. While discussing issue B [3], Peter Zijlstra objected to the
   basic design of cmwq.  Peter's objections are...

   o1. It isn't a generic worker pool mechanism in that it can't serve
       cpu-intensive workloads because all works are affined to local
       cpus.

   o2. Allowing long (> 5s for example) running works isn't a good
       idea and by not allowing long running works, the need to
       migrate back workers when cpu comes back online can be removed.

   o3. It's a fork-fest.

   My rationales for each are

   r1. The first design goal of cmwq is solving the issues the current
       workqueue implementation has including hard to detect
       deadlocks, unexpectedly long latencies caused by long running
       works which share the workqueue and excessive number of worker
       threads necessitated by each workqueue having its own workers.

       cmwq solves these issues quite efficiently without depending on
       fragile and complex heuristics.  Concurrency is managed to
       minimal yet sufficient level, workers are reused as much as
       possible and only necessary number of workers are created and
       maintained.

       cmwq is cpu affine because its target workloads are not cpu
       intensive.  Most works are context hungry not cpu cycle hungry
       and as such providing the necessary context (or concurrency)
       from the local CPU is the most efficient way to serve them.

       The second design goal is to unify different async mechanisms
       in kernel.  Although cmwq wouldn't be able to serve CPU cycle
       intensive workload, most in-kernel async mechanisms are there
       to provide context and concurrency and they all can be
       converted to use cmwq.

       Async workloads which need to burn large amount of CPU cycles
       such as encryption and IO checksumming have pretty different
       requirements and worker pool designed to serve them would
       probably require fair amount of heuristics to determine the
       appropriate level of concurrency.  Workqueue API may be
       extended to cover such workloads by providing an anonymous CPU
       for those works to bind to but the underlying operation would
       be fairly different.  If this is something necessary, let's
       pursue it but I don't think it's exclusive with cmwq.

   r2. The only thing necessary to support long running works is the
       ability to rebind workers to the cpu if it comes back online
       and allowing long running works will allow most existing worker
       pools to be served by cmwq and also make CPU down/up latencies
       more predictable.

   r3. I don't think there is any way to implement shared worker pool
       without forking when more concurrency is required and the
       actual amount of forking would be low as cmwq scales the number
       of idle workers to keep according to the current concurrency
       level and uses rather long timeout (5min) for idlers.

We know what to do about A.  I'm pretty sure B can be solved one way
or another.  So, the biggest problem here is that whether the basic
design of cmwq itself is agreed on.  Being the author, I'm probably
pretty biased but I really think it's a good solution for the problems
it tries to solve and many other developers seem to agree on that
according to the first RFC round.  So, let's discuss.  If I missed
some points of the objection, please go ahead and add.


CHANGES FROM THE LAST RFC TAKE[1] AND PREP PATCHSET[4]
======================================================

* All scheduler related parts - notification, forced task migration
  and wake up from notification are re-done.  This part is still in
  flux and likely to change further.

* Barrier works are now uncolored.  They don't participate in
  workqueue flushing and don't contribute to the active count.  This
  change is necessary to enable max_active throttling.

* max_active throttling is added and freezing is reimplemented using
  it.  Fixed limit on total number of workers is removed.  It's now
  regulated by max_active.

* Singlethread workqueue is un-removed and works properly.  It's
  implemented as SINGLE_CPU workqueue with max_active == 1.

* The monster patch to implement cmwq is split into logical steps.

This patchset contains the following 27 patches.

 0001-sched-rename-preempt_notifiers-to-sched_notifiers-an.patch
 0002-sched-refactor-try_to_wake_up.patch
 0003-sched-implement-__set_cpus_allowed.patch
 0004-sched-make-sched_notifiers-unconditional.patch
 0005-sched-add-wakeup-sleep-sched_notifiers-and-allow-NUL.patch
 0006-sched-implement-try_to_wake_up_local.patch
 0007-acpi-use-queue_work_on-instead-of-binding-workqueue-.patch
 0008-stop_machine-reimplement-without-using-workqueue.patch
 0009-workqueue-misc-cosmetic-updates.patch
 0010-workqueue-merge-feature-parameters-into-flags.patch
 0011-workqueue-define-both-bit-position-and-mask-for-work.patch
 0012-workqueue-separate-out-process_one_work.patch
 0013-workqueue-temporarily-disable-workqueue-tracing.patch
 0014-workqueue-kill-cpu_populated_map.patch
 0015-workqueue-update-cwq-alignement.patch
 0016-workqueue-reimplement-workqueue-flushing-using-color.patch
 0017-workqueue-introduce-worker.patch
 0018-workqueue-reimplement-work-flushing-using-linked-wor.patch
 0019-workqueue-implement-per-cwq-active-work-limit.patch
 0020-workqueue-reimplement-workqueue-freeze-using-max_act.patch
 0021-workqueue-introduce-global-cwq-and-unify-cwq-locks.patch
 0022-workqueue-implement-worker-states.patch
 0023-workqueue-reimplement-CPU-hotplugging-support-using-.patch
 0024-workqueue-make-single-thread-workqueue-shared-worker.patch
 0025-workqueue-use-shared-worklist-and-pool-all-workers-p.patch
 0026-workqueue-implement-concurrency-managed-dynamic-work.patch
 0027-workqueue-increase-max_active-of-keventd-and-kill-cu.patch

0001-0006 are scheduler related changes.

0007-0008 changes two unusual users.  After the change, acpi creates
per-cpu workers which weren't necessary before but in the end it won't
be doing anything suboptimal.  stop_machine won't use workqueue from
this point on.

0009-0013 do misc preparations.  0007-0013 stayed about the same from
the previous round.

0014 kills cpu_populated_map, creates workers for all possible workers
and simplifies CPU hotplugging.

0015-0024 introduces new constructs step by step and reimplements
workqueue features so that they can be used with shared worker pool.

0025 makes all workqueues share per-cpu worklist and pool their
workers.  At this stage, all the pieces other than concurrency managed
worker pool is there.

0026 implements concurrency managed worker pool.  Even after this,
there is no visible behavior different to workqueue users as all
workqueues still have max_active of 1.

0027 increases max_active of keventd.  This patch isn't signed off
yet.  lockdep annotations need to be updated.

Each feature of cmwq has been verified using test scenarios (well, I
tried, at least).  In a reply, I'll attach the source of the test
module I used.

Things to do from here are...

* Hopefully, establish a stable tree.

* Audit workqueue users, drop unnecessary workqueues and make them use
  keventd.

* Restore workqueue tracing.

* Replace various in-kernel async mechanisms which are there to
  provide context and concurrency.

Diffstat follows.

 arch/ia64/kernel/smpboot.c   |    2 
 arch/ia64/kvm/Kconfig        |    1 
 arch/powerpc/kvm/Kconfig     |    1 
 arch/s390/kvm/Kconfig        |    1 
 arch/x86/kernel/smpboot.c    |    2 
 arch/x86/kvm/Kconfig         |    1 
 drivers/acpi/osl.c           |   41 
 include/linux/kvm_host.h     |    4 
 include/linux/preempt.h      |   48 
 include/linux/sched.h        |   71 +
 include/linux/stop_machine.h |    6 
 include/linux/workqueue.h    |   88 +
 init/Kconfig                 |    4 
 init/main.c                  |    2 
 kernel/power/process.c       |   21 
 kernel/sched.c               |  329 +++--
 kernel/stop_machine.c        |  151 ++
 kernel/trace/Kconfig         |    4 
 kernel/workqueue.c           | 2640 +++++++++++++++++++++++++++++++++++++------
 virt/kvm/kvm_main.c          |   26 
 20 files changed, 2783 insertions(+), 660 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/896268
[2] http://patchwork.kernel.org/patch/63119/
[3] http://thread.gmane.org/gmane.linux.kernel/921267
[4] http://thread.gmane.org/gmane.linux.kernel/917570

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 01/27] sched: rename preempt_notifiers to sched_notifiers and refactor implementation
  2009-12-18 12:57 Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 02/27] sched: refactor try_to_wake_up() Tejun Heo
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

Rename preempt_notifiers to sched_notifiers and move it to sched.h.
Also, refactor implementation in sched.c such that adding new
callbacks is easier.

This patch does not introduce any functional change and in fact
generates the same binary at least with my configuration (x86_64 SMP,
kvm and some debug options).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 arch/ia64/kvm/Kconfig    |    2 +-
 arch/powerpc/kvm/Kconfig |    2 +-
 arch/s390/kvm/Kconfig    |    2 +-
 arch/x86/kvm/Kconfig     |    2 +-
 include/linux/kvm_host.h |    4 +-
 include/linux/preempt.h  |   48 --------------------
 include/linux/sched.h    |   53 +++++++++++++++++++++-
 init/Kconfig             |    2 +-
 kernel/sched.c           |  108 +++++++++++++++++++---------------------------
 virt/kvm/kvm_main.c      |   26 +++++------
 10 files changed, 113 insertions(+), 136 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index ef3e7be..a38b72e 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 07703f7..d3a65c6 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,7 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_64_HANDLER
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 6ee55ae..a0adddd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -18,7 +18,7 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 4cd4983..fd38f79 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
 	depends on HAVE_KVM
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bd5a616..8079759 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,8 +74,8 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, struct kvm_io_bus *bus,
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	struct preempt_notifier preempt_notifier;
+#ifdef CONFIG_SCHED_NOTIFIERS
+	struct sched_notifier sched_notifier;
 #endif
 	int vcpu_id;
 	struct mutex mutex;
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 2e681d9..538c675 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -93,52 +93,4 @@ do { \
 
 #endif
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
-struct preempt_notifier;
-
-/**
- * preempt_ops - notifiers called when a task is preempted and rescheduled
- * @sched_in: we're about to be rescheduled:
- *    notifier: struct preempt_notifier for the task being scheduled
- *    cpu:  cpu we're scheduled on
- * @sched_out: we've just been preempted
- *    notifier: struct preempt_notifier for the task being preempted
- *    next: the task that's kicking us out
- *
- * Please note that sched_in and out are called under different
- * contexts.  sched_out is called with rq lock held and irq disabled
- * while sched_in is called without rq lock and irq enabled.  This
- * difference is intentional and depended upon by its users.
- */
-struct preempt_ops {
-	void (*sched_in)(struct preempt_notifier *notifier, int cpu);
-	void (*sched_out)(struct preempt_notifier *notifier,
-			  struct task_struct *next);
-};
-
-/**
- * preempt_notifier - key for installing preemption notifiers
- * @link: internal use
- * @ops: defines the notifier functions to be called
- *
- * Usually used in conjunction with container_of().
- */
-struct preempt_notifier {
-	struct hlist_node link;
-	struct preempt_ops *ops;
-};
-
-void preempt_notifier_register(struct preempt_notifier *notifier);
-void preempt_notifier_unregister(struct preempt_notifier *notifier);
-
-static inline void preempt_notifier_init(struct preempt_notifier *notifier,
-				     struct preempt_ops *ops)
-{
-	INIT_HLIST_NODE(&notifier->link);
-	notifier->ops = ops;
-}
-
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e898578..6231576 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1211,6 +1211,53 @@ struct sched_rt_entity {
 #endif
 };
 
+#ifdef CONFIG_SCHED_NOTIFIERS
+
+struct sched_notifier;
+
+/**
+ * sched_notifier_ops - notifiers called for scheduling events
+ * @in: we're about to be rescheduled:
+ *    notifier: struct sched_notifier for the task being scheduled
+ *    cpu:  cpu we're scheduled on
+ * @out: we've just been preempted
+ *    notifier: struct sched_notifier for the task being preempted
+ *    next: the task that's kicking us out
+ *
+ * Please note that in and out are called under different contexts.
+ * out is called with rq lock held and irq disabled while in is called
+ * without rq lock and irq enabled.  This difference is intentional
+ * and depended upon by its users.
+ */
+struct sched_notifier_ops {
+	void (*in)(struct sched_notifier *notifier, int cpu);
+	void (*out)(struct sched_notifier *notifier, struct task_struct *next);
+};
+
+/**
+ * sched_notifier - key for installing scheduler notifiers
+ * @link: internal use
+ * @ops: defines the notifier functions to be called
+ *
+ * Usually used in conjunction with container_of().
+ */
+struct sched_notifier {
+	struct hlist_node link;
+	struct sched_notifier_ops *ops;
+};
+
+void sched_notifier_register(struct sched_notifier *notifier);
+void sched_notifier_unregister(struct sched_notifier *notifier);
+
+static inline void sched_notifier_init(struct sched_notifier *notifier,
+				       struct sched_notifier_ops *ops)
+{
+	INIT_HLIST_NODE(&notifier->link);
+	notifier->ops = ops;
+}
+
+#endif	/* CONFIG_SCHED_NOTIFIERS */
+
 struct rcu_node;
 
 struct task_struct {
@@ -1234,9 +1281,9 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	/* list of struct preempt_notifier: */
-	struct hlist_head preempt_notifiers;
+#ifdef CONFIG_SCHED_NOTIFIERS
+	/* list of struct sched_notifier: */
+	struct hlist_head sched_notifiers;
 #endif
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index a23da9f..37aefe1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1249,7 +1249,7 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
+config SCHED_NOTIFIERS
 	bool
 
 source "kernel/Kconfig.locks"
diff --git a/kernel/sched.c b/kernel/sched.c
index 18cceee..fe3bab0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1434,6 +1434,44 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
 
+#ifdef CONFIG_SCHED_NOTIFIERS
+
+#define fire_sched_notifiers(p, callback, args...) do {			\
+	struct sched_notifier *__sn;					\
+	struct hlist_node *__pos;					\
+									\
+	hlist_for_each_entry(__sn, __pos, &(p)->sched_notifiers, link)	\
+		__sn->ops->callback(__sn , ##args);			\
+} while (0)
+
+/**
+ * sched_notifier_register - register scheduler notifier
+ * @notifier: notifier struct to register
+ */
+void sched_notifier_register(struct sched_notifier *notifier)
+{
+	hlist_add_head(&notifier->link, &current->sched_notifiers);
+}
+EXPORT_SYMBOL_GPL(sched_notifier_register);
+
+/**
+ * sched_notifier_unregister - unregister scheduler notifier
+ * @notifier: notifier struct to unregister
+ *
+ * This is safe to call from within a scheduler notifier.
+ */
+void sched_notifier_unregister(struct sched_notifier *notifier)
+{
+	hlist_del(&notifier->link);
+}
+EXPORT_SYMBOL_GPL(sched_notifier_unregister);
+
+#else	/* !CONFIG_SCHED_NOTIFIERS */
+
+#define fire_sched_notifiers(p, callback, args...)	do { } while (0)
+
+#endif	/* CONFIG_SCHED_NOTIFIERS */
+
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -2535,8 +2573,8 @@ static void __sched_fork(struct task_struct *p)
 	p->se.on_rq = 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	INIT_HLIST_HEAD(&p->preempt_notifiers);
+#ifdef CONFIG_SCHED_NOTIFIERS
+	INIT_HLIST_HEAD(&p->sched_notifiers);
 #endif
 
 	/*
@@ -2636,64 +2674,6 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 	task_rq_unlock(rq, &flags);
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
-/**
- * preempt_notifier_register - tell me when current is being preempted & rescheduled
- * @notifier: notifier struct to register
- */
-void preempt_notifier_register(struct preempt_notifier *notifier)
-{
-	hlist_add_head(&notifier->link, &current->preempt_notifiers);
-}
-EXPORT_SYMBOL_GPL(preempt_notifier_register);
-
-/**
- * preempt_notifier_unregister - no longer interested in preemption notifications
- * @notifier: notifier struct to unregister
- *
- * This is safe to call from within a preemption notifier.
- */
-void preempt_notifier_unregister(struct preempt_notifier *notifier)
-{
-	hlist_del(&notifier->link);
-}
-EXPORT_SYMBOL_GPL(preempt_notifier_unregister);
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-	struct preempt_notifier *notifier;
-	struct hlist_node *node;
-
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_in(notifier, raw_smp_processor_id());
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-	struct preempt_notifier *notifier;
-	struct hlist_node *node;
-
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_out(notifier, next);
-}
-
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */
-
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -2711,7 +2691,7 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
-	fire_sched_out_preempt_notifiers(prev, next);
+	fire_sched_notifiers(prev, out, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
 }
@@ -2755,7 +2735,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	perf_event_task_sched_in(current, cpu_of(rq));
 	finish_lock_switch(rq, prev);
 
-	fire_sched_in_preempt_notifiers(current);
+	fire_sched_notifiers(current, in, raw_smp_processor_id());
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
@@ -9615,8 +9595,8 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
+#ifdef CONFIG_SCHED_NOTIFIERS
+	INIT_HLIST_HEAD(&init_task.sched_notifiers);
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e1f2bf8..255b07e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -77,7 +77,7 @@ static atomic_t hardware_enable_failed;
 struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
-static __read_mostly struct preempt_ops kvm_preempt_ops;
+static __read_mostly struct sched_notifier_ops kvm_sched_notifier_ops;
 
 struct dentry *kvm_debugfs_dir;
 
@@ -109,7 +109,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 
 	mutex_lock(&vcpu->mutex);
 	cpu = get_cpu();
-	preempt_notifier_register(&vcpu->preempt_notifier);
+	sched_notifier_register(&vcpu->sched_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
 }
@@ -118,7 +118,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
-	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	sched_notifier_unregister(&vcpu->sched_notifier);
 	preempt_enable();
 	mutex_unlock(&vcpu->mutex);
 }
@@ -1192,7 +1192,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
-	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
+	sched_notifier_init(&vcpu->sched_notifier, &kvm_sched_notifier_ops);
 
 	r = kvm_arch_vcpu_setup(vcpu);
 	if (r)
@@ -2026,23 +2026,21 @@ static struct sys_device kvm_sysdev = {
 struct page *bad_page;
 pfn_t bad_pfn;
 
-static inline
-struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn)
+static inline struct kvm_vcpu *sched_notifier_to_vcpu(struct sched_notifier *sn)
 {
-	return container_of(pn, struct kvm_vcpu, preempt_notifier);
+	return container_of(sn, struct kvm_vcpu, sched_notifier);
 }
 
-static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
+static void kvm_sched_in(struct sched_notifier *sn, int cpu)
 {
-	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+	struct kvm_vcpu *vcpu = sched_notifier_to_vcpu(sn);
 
 	kvm_arch_vcpu_load(vcpu, cpu);
 }
 
-static void kvm_sched_out(struct preempt_notifier *pn,
-			  struct task_struct *next)
+static void kvm_sched_out(struct sched_notifier *sn, struct task_struct *next)
 {
-	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+	struct kvm_vcpu *vcpu = sched_notifier_to_vcpu(sn);
 
 	kvm_arch_vcpu_put(vcpu);
 }
@@ -2115,8 +2113,8 @@ int kvm_init(void *opaque, unsigned int vcpu_size,
 		goto out_free;
 	}
 
-	kvm_preempt_ops.sched_in = kvm_sched_in;
-	kvm_preempt_ops.sched_out = kvm_sched_out;
+	kvm_sched_notifier_ops.in = kvm_sched_in;
+	kvm_sched_notifier_ops.out = kvm_sched_out;
 
 	kvm_init_debug();
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 02/27] sched: refactor try_to_wake_up()
  2009-12-18 12:57 Tejun Heo
  2009-12-18 12:57 ` [PATCH 01/27] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 03/27] sched: implement __set_cpus_allowed() Tejun Heo
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up().
The factoring out doesn't affect try_to_wake_up() much
code-generation-wise.  Depending on configuration options, it ends up
generating the same object code as before or slightly different one
due to different register assignment.

This is to help future implementation of try_to_wake_up_local().

Mike Galbraith suggested rename to ttwu_post_activation() from
ttwu_woken_up() and comment update in try_to_wake_up().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |  114 +++++++++++++++++++++++++++++++------------------------
 1 files changed, 64 insertions(+), 50 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index fe3bab0..910bf74 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2362,11 +2362,67 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 }
 #endif
 
-/***
+static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
+				 bool is_sync, bool is_migrate, bool is_local)
+{
+	schedstat_inc(p, se.nr_wakeups);
+	if (is_sync)
+		schedstat_inc(p, se.nr_wakeups_sync);
+	if (is_migrate)
+		schedstat_inc(p, se.nr_wakeups_migrate);
+	if (is_local)
+		schedstat_inc(p, se.nr_wakeups_local);
+	else
+		schedstat_inc(p, se.nr_wakeups_remote);
+
+	activate_task(rq, p, 1);
+
+	/*
+	 * Only attribute actual wakeups done by this task.
+	 */
+	if (!in_interrupt()) {
+		struct sched_entity *se = &current->se;
+		u64 sample = se->sum_exec_runtime;
+
+		if (se->last_wakeup)
+			sample -= se->last_wakeup;
+		else
+			sample -= se->start_runtime;
+		update_avg(&se->avg_wakeup, sample);
+
+		se->last_wakeup = se->sum_exec_runtime;
+	}
+}
+
+static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
+					int wake_flags, bool success)
+{
+	trace_sched_wakeup(rq, p, success);
+	check_preempt_curr(rq, p, wake_flags);
+
+	p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_wake_up)
+		p->sched_class->task_wake_up(rq, p);
+
+	if (unlikely(rq->idle_stamp)) {
+		u64 delta = rq->clock - rq->idle_stamp;
+		u64 max = 2*sysctl_sched_migration_cost;
+
+		if (delta > max)
+			rq->avg_idle = max;
+		else
+			update_avg(&rq->avg_idle, delta);
+		rq->idle_stamp = 0;
+	}
+#endif
+}
+
+/**
  * try_to_wake_up - wake up a thread
- * @p: the to-be-woken-up thread
+ * @p: the thread to be awakened
  * @state: the mask of task states that can be woken
- * @sync: do a synchronous wakeup?
+ * @wake_flags: wake modifier flags (WF_*)
  *
  * Put it on the run-queue if it's not already there. The "current"
  * thread is always on the run-queue (except when the actual
@@ -2374,7 +2430,8 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
  * the simpler "current->state = TASK_RUNNING" to mark yourself
  * runnable without the overhead of this.
  *
- * returns failure only if the task is already active.
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state,
 			  int wake_flags)
@@ -2442,54 +2499,11 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 
 out_activate:
 #endif /* CONFIG_SMP */
-	schedstat_inc(p, se.nr_wakeups);
-	if (wake_flags & WF_SYNC)
-		schedstat_inc(p, se.nr_wakeups_sync);
-	if (orig_cpu != cpu)
-		schedstat_inc(p, se.nr_wakeups_migrate);
-	if (cpu == this_cpu)
-		schedstat_inc(p, se.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.nr_wakeups_remote);
-	activate_task(rq, p, 1);
+	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
+		      cpu == this_cpu);
 	success = 1;
-
-	/*
-	 * Only attribute actual wakeups done by this task.
-	 */
-	if (!in_interrupt()) {
-		struct sched_entity *se = &current->se;
-		u64 sample = se->sum_exec_runtime;
-
-		if (se->last_wakeup)
-			sample -= se->last_wakeup;
-		else
-			sample -= se->start_runtime;
-		update_avg(&se->avg_wakeup, sample);
-
-		se->last_wakeup = se->sum_exec_runtime;
-	}
-
 out_running:
-	trace_sched_wakeup(rq, p, success);
-	check_preempt_curr(rq, p, wake_flags);
-
-	p->state = TASK_RUNNING;
-#ifdef CONFIG_SMP
-	if (p->sched_class->task_wake_up)
-		p->sched_class->task_wake_up(rq, p);
-
-	if (unlikely(rq->idle_stamp)) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
-#endif
+	ttwu_post_activation(p, rq, wake_flags, success);
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 03/27] sched: implement __set_cpus_allowed()
  2009-12-18 12:57 Tejun Heo
  2009-12-18 12:57 ` [PATCH 01/27] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
  2009-12-18 12:57 ` [PATCH 02/27] sched: refactor try_to_wake_up() Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 04/27] sched: make sched_notifiers unconditional Tejun Heo
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

set_cpus_allowed_ptr() modifies the allowed cpu mask of a task.  The
function performs the following checks before applying new mask.

* Check whether PF_THREAD_BOUND is set.  This is set for bound
  kthreads so that they can't be moved around.

* Check whether the target cpu is still marked active - cpu_active().
  Active state is cleared early while downing a cpu.

This patch adds __set_cpus_allowed() which takes @force parameter
which when true makes __set_cpus_allowed() ignore PF_THREAD_BOUND and
use cpu online state instead of active state for the latter.  This
allows migrating tasks to CPUs as long as they are online.
set_cpus_allowed_ptr() is implemented as inline wrapper around
__set_cpus_allowed().

Due to the way migration is implemented, the @force parameter needs to
be passed over to the migration thread.  @force parameter is added to
struct migration_req and passed to __migrate_task().

Please note the naming discrepancy between set_cpus_allowed_ptr() and
the new functions.  The _ptr suffix is from the days when cpumask API
wasn't mature and future changes should drop it from
set_cpus_allowed_ptr() too.

NOTE: It would be nice to implement kthread_bind() in terms of
      __set_cpus_allowed() if we can drop the capability to bind to a
      dead CPU from kthread_bind(), which doesn't seem too popular
      anyway.  With such change, we'll have set_cpus_allowed_ptr() for
      regular tasks and kthread_bind() for kthreads and can use
      PF_THREAD_BOUND instead of passing @force parameter around.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |   14 +++++++++---
 kernel/sched.c        |   55 ++++++++++++++++++++++++++++++++-----------------
 2 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6231576..c3a9c3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1864,11 +1864,11 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif
 
 #ifdef CONFIG_SMP
-extern int set_cpus_allowed_ptr(struct task_struct *p,
-				const struct cpumask *new_mask);
+extern int __set_cpus_allowed(struct task_struct *p,
+			      const struct cpumask *new_mask, bool force);
 #else
-static inline int set_cpus_allowed_ptr(struct task_struct *p,
-				       const struct cpumask *new_mask)
+static inline int __set_cpus_allowed(struct task_struct *p,
+				     const struct cpumask *new_mask, bool force)
 {
 	if (!cpumask_test_cpu(0, new_mask))
 		return -EINVAL;
@@ -1876,6 +1876,12 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p,
 }
 #endif
 
+static inline int set_cpus_allowed_ptr(struct task_struct *p,
+				       const struct cpumask *new_mask)
+{
+	return __set_cpus_allowed(p, new_mask, false);
+}
+
 #ifndef CONFIG_CPUMASK_OFFSTACK
 static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 {
diff --git a/kernel/sched.c b/kernel/sched.c
index 910bf74..f7e2c32 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2128,6 +2128,7 @@ struct migration_req {
 
 	struct task_struct *task;
 	int dest_cpu;
+	bool force;
 
 	struct completion done;
 };
@@ -2136,8 +2137,8 @@ struct migration_req {
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static int
-migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
+static int migrate_task(struct task_struct *p, int dest_cpu, bool force,
+			struct migration_req *req)
 {
 	struct rq *rq = task_rq(p);
 
@@ -2154,6 +2155,7 @@ migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
 	init_completion(&req->done);
 	req->task = p;
 	req->dest_cpu = dest_cpu;
+	req->force = force;
 	list_add(&req->list, &rq->migration_queue);
 
 	return 1;
@@ -3112,7 +3114,7 @@ static void sched_migrate_task(struct task_struct *p, int dest_cpu)
 		goto out;
 
 	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (migrate_task(p, dest_cpu, false, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
 		struct task_struct *mt = rq->migration_thread;
 
@@ -7078,29 +7080,39 @@ static inline void sched_init_granularity(void)
  * 7) we wake up and the migration is done.
  */
 
-/*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
+/**
+ * __set_cpus_allowed - change a task's CPU affinity
+ * @p: task to change CPU affinity for
+ * @new_mask: new CPU affinity
+ * @force: override CPU active status and PF_THREAD_BOUND check
+ *
+ * Migrate the thread to a proper CPU and schedule it away if the CPU
+ * it's executing on is removed from the allowed bitmask.
+ *
+ * The caller must have a valid reference to the task, the task must
+ * not exit() & deallocate itself prematurely. The call is not atomic;
+ * no spinlocks may be held.
  *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
+ * If @force is %true, PF_THREAD_BOUND test is bypassed and CPU active
+ * state is ignored as long as the CPU is online.
  */
-int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+int __set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask,
+		       bool force)
 {
+	const struct cpumask *cpu_cand_mask =
+		force ? cpu_online_mask : cpu_active_mask;
 	struct migration_req req;
 	unsigned long flags;
 	struct rq *rq;
 	int ret = 0;
 
 	rq = task_rq_lock(p, &flags);
-	if (!cpumask_intersects(new_mask, cpu_active_mask)) {
+	if (!cpumask_intersects(new_mask, cpu_cand_mask)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+	if (unlikely((p->flags & PF_THREAD_BOUND) && !force && p != current &&
 		     !cpumask_equal(&p->cpus_allowed, new_mask))) {
 		ret = -EINVAL;
 		goto out;
@@ -7117,7 +7129,8 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 	if (cpumask_test_cpu(task_cpu(p), new_mask))
 		goto out;
 
-	if (migrate_task(p, cpumask_any_and(cpu_active_mask, new_mask), &req)) {
+	if (migrate_task(p, cpumask_any_and(cpu_cand_mask, new_mask), force,
+			 &req)) {
 		/* Need help from migration thread: drop lock and wait. */
 		struct task_struct *mt = rq->migration_thread;
 
@@ -7134,7 +7147,7 @@ out:
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
+EXPORT_SYMBOL_GPL(__set_cpus_allowed);
 
 /*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
@@ -7147,12 +7160,15 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
  *
  * Returns non-zero if task was successfully migrated.
  */
-static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
+static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu,
+			  bool force)
 {
+	const struct cpumask *cpu_cand_mask =
+		force ? cpu_online_mask : cpu_active_mask;
 	struct rq *rq_dest, *rq_src;
 	int ret = 0, on_rq;
 
-	if (unlikely(!cpu_active(dest_cpu)))
+	if (unlikely(!cpumask_test_cpu(dest_cpu, cpu_cand_mask)))
 		return ret;
 
 	rq_src = cpu_rq(src_cpu);
@@ -7231,7 +7247,8 @@ static int migration_thread(void *data)
 
 		if (req->task != NULL) {
 			raw_spin_unlock(&rq->lock);
-			__migrate_task(req->task, cpu, req->dest_cpu);
+			__migrate_task(req->task, cpu, req->dest_cpu,
+				       req->force);
 		} else if (likely(cpu == (badcpu = smp_processor_id()))) {
 			req->dest_cpu = RCU_MIGRATION_GOT_QS;
 			raw_spin_unlock(&rq->lock);
@@ -7256,7 +7273,7 @@ static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
 	int ret;
 
 	local_irq_disable();
-	ret = __migrate_task(p, src_cpu, dest_cpu);
+	ret = __migrate_task(p, src_cpu, dest_cpu, false);
 	local_irq_enable();
 	return ret;
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 04/27] sched: make sched_notifiers unconditional
  2009-12-18 12:57 Tejun Heo
                   ` (2 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 03/27] sched: implement __set_cpus_allowed() Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 05/27] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

sched_notifiers will be used by workqueue which is always there.
Always enable sched_notifiers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 arch/ia64/kvm/Kconfig    |    1 -
 arch/powerpc/kvm/Kconfig |    1 -
 arch/s390/kvm/Kconfig    |    1 -
 arch/x86/kvm/Kconfig     |    1 -
 include/linux/kvm_host.h |    2 --
 include/linux/sched.h    |    6 ------
 init/Kconfig             |    4 ----
 kernel/sched.c           |   13 -------------
 8 files changed, 0 insertions(+), 29 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index a38b72e..a9e2b9c 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index d3a65c6..38818c0 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 
 config KVM_BOOK3S_64_HANDLER
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index a0adddd..f9b46b0 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -18,7 +18,6 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select SCHED_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fd38f79..337a4e5 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	depends on HAVE_KVM
 	# for device assignment:
 	depends on PCI
-	select SCHED_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8079759..45b631e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,9 +74,7 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, struct kvm_io_bus *bus,
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_SCHED_NOTIFIERS
 	struct sched_notifier sched_notifier;
-#endif
 	int vcpu_id;
 	struct mutex mutex;
 	int   cpu;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c3a9c3d..f327ac7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1211,8 +1211,6 @@ struct sched_rt_entity {
 #endif
 };
 
-#ifdef CONFIG_SCHED_NOTIFIERS
-
 struct sched_notifier;
 
 /**
@@ -1256,8 +1254,6 @@ static inline void sched_notifier_init(struct sched_notifier *notifier,
 	notifier->ops = ops;
 }
 
-#endif	/* CONFIG_SCHED_NOTIFIERS */
-
 struct rcu_node;
 
 struct task_struct {
@@ -1281,10 +1277,8 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_SCHED_NOTIFIERS
 	/* list of struct sched_notifier: */
 	struct hlist_head sched_notifiers;
-#endif
 
 	/*
 	 * fpu_counter contains the number of consecutive context switches
diff --git a/init/Kconfig b/init/Kconfig
index 37aefe1..4a6b14b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1248,8 +1248,4 @@ config STOP_MACHINE
 	  Need stop_machine() primitive.
 
 source "block/Kconfig"
-
-config SCHED_NOTIFIERS
-	bool
-
 source "kernel/Kconfig.locks"
diff --git a/kernel/sched.c b/kernel/sched.c
index f7e2c32..f44db0b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1434,8 +1434,6 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
 
-#ifdef CONFIG_SCHED_NOTIFIERS
-
 #define fire_sched_notifiers(p, callback, args...) do {			\
 	struct sched_notifier *__sn;					\
 	struct hlist_node *__pos;					\
@@ -1466,12 +1464,6 @@ void sched_notifier_unregister(struct sched_notifier *notifier)
 }
 EXPORT_SYMBOL_GPL(sched_notifier_unregister);
 
-#else	/* !CONFIG_SCHED_NOTIFIERS */
-
-#define fire_sched_notifiers(p, callback, args...)	do { } while (0)
-
-#endif	/* CONFIG_SCHED_NOTIFIERS */
-
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -2588,10 +2580,7 @@ static void __sched_fork(struct task_struct *p)
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->se.on_rq = 0;
 	INIT_LIST_HEAD(&p->se.group_node);
-
-#ifdef CONFIG_SCHED_NOTIFIERS
 	INIT_HLIST_HEAD(&p->sched_notifiers);
-#endif
 
 	/*
 	 * We mark the process as running here, but have not actually
@@ -9626,9 +9615,7 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_SCHED_NOTIFIERS
 	INIT_HLIST_HEAD(&init_task.sched_notifiers);
-#endif
 
 #ifdef CONFIG_SMP
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 05/27] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops
  2009-12-18 12:57 Tejun Heo
                   ` (3 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 04/27] sched: make sched_notifiers unconditional Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 06/27] sched: implement try_to_wake_up_local() Tejun Heo
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

Add wakeup and sleep notifiers to sched_notifiers and allow omitting
some of the notifiers in the ops table.  These will be used by
concurrency managed workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |   11 ++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f327ac7..b3c1666 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1215,6 +1215,10 @@ struct sched_notifier;
 
 /**
  * sched_notifier_ops - notifiers called for scheduling events
+ * @wakeup: we're waking up
+ *    notifier: struct sched_notifier for the task being woken up
+ * @sleep: we're going to bed
+ *    notifier: struct sched_notifier for the task sleeping
  * @in: we're about to be rescheduled:
  *    notifier: struct sched_notifier for the task being scheduled
  *    cpu:  cpu we're scheduled on
@@ -1228,6 +1232,8 @@ struct sched_notifier;
  * and depended upon by its users.
  */
 struct sched_notifier_ops {
+	void (*wakeup)(struct sched_notifier *notifier);
+	void (*sleep)(struct sched_notifier *notifier);
 	void (*in)(struct sched_notifier *notifier, int cpu);
 	void (*out)(struct sched_notifier *notifier, struct task_struct *next);
 };
diff --git a/kernel/sched.c b/kernel/sched.c
index f44db0b..35af985 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1439,7 +1439,8 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 	struct hlist_node *__pos;					\
 									\
 	hlist_for_each_entry(__sn, __pos, &(p)->sched_notifiers, link)	\
-		__sn->ops->callback(__sn , ##args);			\
+		if (__sn->ops->callback)				\
+			__sn->ops->callback(__sn , ##args);		\
 } while (0)
 
 /**
@@ -2410,6 +2411,8 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 		rq->idle_stamp = 0;
 	}
 #endif
+	if (success)
+		fire_sched_notifiers(p, wakeup);
 }
 
 /**
@@ -5448,10 +5451,12 @@ need_resched_nonpreemptible:
 	clear_tsk_need_resched(prev);
 
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
-		if (unlikely(signal_pending_state(prev->state, prev)))
+		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
-		else
+		} else {
+			fire_sched_notifiers(prev, sleep);
 			deactivate_task(rq, prev, 1);
+		}
 		switch_count = &prev->nvcsw;
 	}
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 06/27] sched: implement try_to_wake_up_local()
  2009-12-18 12:57 Tejun Heo
                   ` (4 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 05/27] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 07/27] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo, Mike Galbraith

Implement try_to_wake_up_local().  try_to_wake_up_local() is similar
to try_to_wake_up() but it assumes the caller has this_rq() locked and
the target task is bound to this_rq().  try_to_wake_up_local() can be
called from wakeup and sleep scheduler notifiers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    2 +
 kernel/sched.c        |   56 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3c1666..40a164e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2087,6 +2087,8 @@ extern void release_uids(struct user_namespace *ns);
 
 extern void do_timer(unsigned long ticks);
 
+extern bool try_to_wake_up_local(struct task_struct *p, unsigned int state,
+				 int wake_flags);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
diff --git a/kernel/sched.c b/kernel/sched.c
index 35af985..70432c8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2411,6 +2411,10 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 		rq->idle_stamp = 0;
 	}
 #endif
+	/*
+	 * Wake up is complete, fire wake up notifier.  This allows
+	 * try_to_wake_up_local() to be called from wake up notifiers.
+	 */
 	if (success)
 		fire_sched_notifiers(p, wakeup);
 }
@@ -2509,6 +2513,53 @@ out:
 }
 
 /**
+ * try_to_wake_up_local - try to wake up a local task with rq lock held
+ * @p: the thread to be awakened
+ * @state: the mask of task states that can be woken
+ * @wake_flags: wake modifier flags (WF_*)
+ *
+ * Put @p on the run-queue if it's not alredy there.  The caller must
+ * ensure that this_rq() is locked, @p is bound to this_rq() and @p is
+ * not the current task.  this_rq() stays locked over invocation.
+ *
+ * This function can be called from wakeup and sleep scheduler
+ * notifiers.  Be careful not to create deep recursion by chaining
+ * wakeup notifiers.
+ *
+ * Returns %true if @p was woken up, %false if it was already running
+ * or @state didn't match @p's state.
+ */
+bool try_to_wake_up_local(struct task_struct *p, unsigned int state,
+			  int wake_flags)
+{
+	struct rq *rq = task_rq(p);
+	bool success = false;
+
+	BUG_ON(rq != this_rq());
+	BUG_ON(p == current);
+	lockdep_assert_held(&rq->lock);
+
+	if (!sched_feat(SYNC_WAKEUPS))
+		wake_flags &= ~WF_SYNC;
+
+	if (!(p->state & state))
+		return false;
+
+	if (!p->se.on_rq) {
+		if (likely(!task_running(rq, p))) {
+			schedstat_inc(rq, ttwu_count);
+			schedstat_inc(rq, ttwu_local);
+		}
+		ttwu_activate(p, rq, wake_flags & WF_SYNC, false, true);
+		success = true;
+	}
+
+	ttwu_post_activation(p, rq, wake_flags, success);
+
+	return success;
+}
+
+/**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
  *
@@ -5454,6 +5505,11 @@ need_resched_nonpreemptible:
 		if (unlikely(signal_pending_state(prev->state, prev))) {
 			prev->state = TASK_RUNNING;
 		} else {
+			/*
+			 * Fire sleep notifier before changing any scheduler
+			 * state.  This allows try_to_wake_up_local() to be
+			 * called from sleep notifiers.
+			 */
 			fire_sched_notifiers(prev, sleep);
 			deactivate_task(rq, prev, 1);
 		}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 07/27] acpi: use queue_work_on() instead of binding workqueue worker to cpu0
  2009-12-18 12:57 Tejun Heo
                   ` (5 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 06/27] sched: implement try_to_wake_up_local() Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 08/27] stop_machine: reimplement without using workqueue Tejun Heo
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

ACPI works need to be executed on cpu0 and acpi/osl.c achieves this by
creating singlethread workqueue and then binding it to cpu0 from a
work which is quite unorthodox.  Make it create regular workqueues and
use queue_work_on() instead.  This is in preparation of concurrency
managed workqueue and the extra workers won't be a problem after it's
implemented.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/acpi/osl.c |   41 ++++++++++++-----------------------------
 1 files changed, 12 insertions(+), 29 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 02e8464..93f6647 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -191,36 +191,11 @@ acpi_status __init acpi_os_initialize(void)
 	return AE_OK;
 }
 
-static void bind_to_cpu0(struct work_struct *work)
-{
-	set_cpus_allowed_ptr(current, cpumask_of(0));
-	kfree(work);
-}
-
-static void bind_workqueue(struct workqueue_struct *wq)
-{
-	struct work_struct *work;
-
-	work = kzalloc(sizeof(struct work_struct), GFP_KERNEL);
-	INIT_WORK(work, bind_to_cpu0);
-	queue_work(wq, work);
-}
-
 acpi_status acpi_os_initialize1(void)
 {
-	/*
-	 * On some machines, a software-initiated SMI causes corruption unless
-	 * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
-	 * typically it's done in GPE-related methods that are run via
-	 * workqueues, so we can avoid the known corruption cases by binding
-	 * the workqueues to CPU 0.
-	 */
-	kacpid_wq = create_singlethread_workqueue("kacpid");
-	bind_workqueue(kacpid_wq);
-	kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
-	bind_workqueue(kacpi_notify_wq);
-	kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
-	bind_workqueue(kacpi_hotplug_wq);
+	kacpid_wq = create_workqueue("kacpid");
+	kacpi_notify_wq = create_workqueue("kacpi_notify");
+	kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");
 	BUG_ON(!kacpid_wq);
 	BUG_ON(!kacpi_notify_wq);
 	BUG_ON(!kacpi_hotplug_wq);
@@ -759,7 +734,15 @@ static acpi_status __acpi_os_execute(acpi_execute_type type,
 		(type == OSL_NOTIFY_HANDLER ? kacpi_notify_wq : kacpid_wq);
 	dpc->wait = hp ? 1 : 0;
 	INIT_WORK(&dpc->work, acpi_os_execute_deferred);
-	ret = queue_work(queue, &dpc->work);
+
+	/*
+	 * On some machines, a software-initiated SMI causes corruption unless
+	 * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
+	 * typically it's done in GPE-related methods that are run via
+	 * workqueues, so we can avoid the known corruption cases by always
+	 * queueing on CPU 0.
+	 */
+	ret = queue_work_on(0, queue, &dpc->work);
 
 	if (!ret) {
 		printk(KERN_ERR PREFIX
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 08/27] stop_machine: reimplement without using workqueue
  2009-12-18 12:57 Tejun Heo
                   ` (6 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 07/27] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 09/27] workqueue: misc/cosmetic updates Tejun Heo
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

stop_machine() is the only user of RT workqueue.  Reimplement it using
kthreads directly and rip RT support from workqueue.  This is in
preparation of concurrency managed workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/stop_machine.h |    6 ++
 include/linux/workqueue.h    |   20 +++---
 init/main.c                  |    2 +
 kernel/stop_machine.c        |  151 ++++++++++++++++++++++++++++++++++-------
 kernel/workqueue.c           |    6 --
 5 files changed, 142 insertions(+), 43 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index baba3a2..2d32e06 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -53,6 +53,11 @@ int stop_machine_create(void);
  */
 void stop_machine_destroy(void);
 
+/**
+ * init_stop_machine: initialize stop_machine during boot
+ */
+void init_stop_machine(void);
+
 #else
 
 static inline int stop_machine(int (*fn)(void *), void *data,
@@ -67,6 +72,7 @@ static inline int stop_machine(int (*fn)(void *), void *data,
 
 static inline int stop_machine_create(void) { return 0; }
 static inline void stop_machine_destroy(void) { }
+static inline void init_stop_machine(void) { }
 
 #endif /* CONFIG_SMP */
 #endif /* _LINUX_STOP_MACHINE */
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 9466e86..0697946 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -181,12 +181,11 @@ static inline void destroy_work_on_stack(struct work_struct *work) { }
 
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread,
-		       int freezeable, int rt, struct lock_class_key *key,
-		       const char *lock_name);
+__create_workqueue_key(const char *name, int singlethread, int freezeable,
+		       struct lock_class_key *key, const char *lock_name);
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
+#define __create_workqueue(name, singlethread, freezeable)	\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -197,19 +196,18 @@ __create_workqueue_key(const char *name, int singlethread,
 		__lock_name = #name;				\
 								\
 	__create_workqueue_key((name), (singlethread),		\
-			       (freezeable), (rt), &__key,	\
+			       (freezeable), &__key,		\
 			       __lock_name);			\
 })
 #else
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
-	__create_workqueue_key((name), (singlethread), (freezeable), (rt), \
+#define __create_workqueue(name, singlethread, freezeable)	\
+	__create_workqueue_key((name), (singlethread), (freezeable), \
 			       NULL, NULL)
 #endif
 
-#define create_workqueue(name) __create_workqueue((name), 0, 0, 0)
-#define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_workqueue(name) __create_workqueue((name), 0, 0)
+#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
+#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/init/main.c b/init/main.c
index c3db4a9..a88bfee 100644
--- a/init/main.c
+++ b/init/main.c
@@ -34,6 +34,7 @@
 #include <linux/security.h>
 #include <linux/smp.h>
 #include <linux/workqueue.h>
+#include <linux/stop_machine.h>
 #include <linux/profile.h>
 #include <linux/rcupdate.h>
 #include <linux/moduleparam.h>
@@ -774,6 +775,7 @@ static void __init do_initcalls(void)
 static void __init do_basic_setup(void)
 {
 	init_workqueues();
+	init_stop_machine();
 	cpuset_init_smp();
 	usermodehelper_init();
 	init_tmpfs();
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 912823e..671a4ac 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -25,6 +25,8 @@ enum stopmachine_state {
 	STOPMACHINE_RUN,
 	/* Exit */
 	STOPMACHINE_EXIT,
+	/* Done */
+	STOPMACHINE_DONE,
 };
 static enum stopmachine_state state;
 
@@ -42,10 +44,9 @@ static DEFINE_MUTEX(lock);
 static DEFINE_MUTEX(setup_lock);
 /* Users of stop_machine. */
 static int refcount;
-static struct workqueue_struct *stop_machine_wq;
+static struct task_struct **stop_machine_threads;
 static struct stop_machine_data active, idle;
 static const struct cpumask *active_cpus;
-static void *stop_machine_work;
 
 static void set_state(enum stopmachine_state newstate)
 {
@@ -63,14 +64,31 @@ static void ack_state(void)
 }
 
 /* This is the actual function which stops the CPU. It runs
- * in the context of a dedicated stopmachine workqueue. */
-static void stop_cpu(struct work_struct *unused)
+ * on dedicated per-cpu kthreads. */
+static int stop_cpu(void *unused)
 {
 	enum stopmachine_state curstate = STOPMACHINE_NONE;
-	struct stop_machine_data *smdata = &idle;
+	struct stop_machine_data *smdata;
 	int cpu = smp_processor_id();
 	int err;
 
+repeat:
+	/* Wait for __stop_machine() to initiate */
+	while (true) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* <- kthread_stop() and __stop_machine()::smp_wmb() */
+		if (kthread_should_stop()) {
+			__set_current_state(TASK_RUNNING);
+			return 0;
+		}
+		if (state == STOPMACHINE_PREPARE)
+			break;
+		schedule();
+	}
+	smp_rmb();	/* <- __stop_machine()::set_state() */
+
+	/* Okay, let's go */
+	smdata = &idle;
 	if (!active_cpus) {
 		if (cpu == cpumask_first(cpu_online_mask))
 			smdata = &active;
@@ -104,6 +122,7 @@ static void stop_cpu(struct work_struct *unused)
 	} while (curstate != STOPMACHINE_EXIT);
 
 	local_irq_enable();
+	goto repeat;
 }
 
 /* Callback for CPUs which aren't supposed to do anything. */
@@ -112,46 +131,122 @@ static int chill(void *unused)
 	return 0;
 }
 
+static int create_stop_machine_thread(unsigned int cpu)
+{
+	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+	struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+	struct task_struct *p;
+
+	if (*pp)
+		return -EBUSY;
+
+	p = kthread_create(stop_cpu, NULL, "kstop/%u", cpu);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+	*pp = p;
+	return 0;
+}
+
+/* Should be called with cpu hotplug disabled and setup_lock held */
+static void kill_stop_machine_threads(void)
+{
+	unsigned int cpu;
+
+	if (!stop_machine_threads)
+		return;
+
+	for_each_online_cpu(cpu) {
+		struct task_struct *p = *per_cpu_ptr(stop_machine_threads, cpu);
+		if (p)
+			kthread_stop(p);
+	}
+	free_percpu(stop_machine_threads);
+	stop_machine_threads = NULL;
+}
+
 int stop_machine_create(void)
 {
+	unsigned int cpu;
+
+	get_online_cpus();
 	mutex_lock(&setup_lock);
 	if (refcount)
 		goto done;
-	stop_machine_wq = create_rt_workqueue("kstop");
-	if (!stop_machine_wq)
-		goto err_out;
-	stop_machine_work = alloc_percpu(struct work_struct);
-	if (!stop_machine_work)
+
+	stop_machine_threads = alloc_percpu(struct task_struct *);
+	if (!stop_machine_threads)
 		goto err_out;
+
+	/*
+	 * cpu hotplug is disabled, create only for online cpus,
+	 * cpu_callback() will handle cpu hot [un]plugs.
+	 */
+	for_each_online_cpu(cpu) {
+		if (create_stop_machine_thread(cpu))
+			goto err_out;
+		kthread_bind(*per_cpu_ptr(stop_machine_threads, cpu), cpu);
+	}
 done:
 	refcount++;
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 	return 0;
 
 err_out:
-	if (stop_machine_wq)
-		destroy_workqueue(stop_machine_wq);
+	kill_stop_machine_threads();
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 	return -ENOMEM;
 }
 EXPORT_SYMBOL_GPL(stop_machine_create);
 
 void stop_machine_destroy(void)
 {
+	get_online_cpus();
 	mutex_lock(&setup_lock);
-	refcount--;
-	if (refcount)
-		goto done;
-	destroy_workqueue(stop_machine_wq);
-	free_percpu(stop_machine_work);
-done:
+	if (!--refcount)
+		kill_stop_machine_threads();
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 }
 EXPORT_SYMBOL_GPL(stop_machine_destroy);
 
+static int __cpuinit stop_machine_cpu_callback(struct notifier_block *nfb,
+					       unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (unsigned long)hcpu;
+	struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+
+	/* Hotplug exclusion is enough, no need to worry about setup_lock */
+	if (!stop_machine_threads)
+		return NOTIFY_OK;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_UP_PREPARE:
+		if (create_stop_machine_thread(cpu)) {
+			printk(KERN_ERR "failed to create stop machine "
+			       "thread for %u\n", cpu);
+			return NOTIFY_BAD;
+		}
+		break;
+
+	case CPU_ONLINE:
+		kthread_bind(*pp, cpu);
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_POST_DEAD:
+		kthread_stop(*pp);
+		*pp = NULL;
+		break;
+	}
+	return NOTIFY_OK;
+}
+
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 {
-	struct work_struct *sm_work;
 	int i, ret;
 
 	/* Set up initial state. */
@@ -164,19 +259,18 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	idle.fn = chill;
 	idle.data = NULL;
 
-	set_state(STOPMACHINE_PREPARE);
+	set_state(STOPMACHINE_PREPARE);	/* -> stop_cpu()::smp_rmb() */
+	smp_wmb();			/* -> stop_cpu()::set_current_state() */
 
 	/* Schedule the stop_cpu work on all cpus: hold this CPU so one
 	 * doesn't hit this CPU until we're ready. */
 	get_cpu();
-	for_each_online_cpu(i) {
-		sm_work = per_cpu_ptr(stop_machine_work, i);
-		INIT_WORK(sm_work, stop_cpu);
-		queue_work_on(i, stop_machine_wq, sm_work);
-	}
+	for_each_online_cpu(i)
+		wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
 	/* This will release the thread on our CPU. */
 	put_cpu();
-	flush_workqueue(stop_machine_wq);
+	while (state < STOPMACHINE_DONE)
+		yield();
 	ret = active.fnret;
 	mutex_unlock(&lock);
 	return ret;
@@ -197,3 +291,8 @@ int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(stop_machine);
+
+void __init init_stop_machine(void)
+{
+	hotcpu_notifier(stop_machine_cpu_callback, 0);
+}
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dee4865..3dccec6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -62,7 +62,6 @@ struct workqueue_struct {
 	const char *name;
 	int singlethread;
 	int freezeable;		/* Freeze threads during suspend */
-	int rt;
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
@@ -913,7 +912,6 @@ init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
 
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
-	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	struct workqueue_struct *wq = cwq->wq;
 	const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
 	struct task_struct *p;
@@ -929,8 +927,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 	 */
 	if (IS_ERR(p))
 		return PTR_ERR(p);
-	if (cwq->wq->rt)
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
 	cwq->thread = p;
 
 	trace_workqueue_creation(cwq->thread, cpu);
@@ -952,7 +948,6 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 struct workqueue_struct *__create_workqueue_key(const char *name,
 						int singlethread,
 						int freezeable,
-						int rt,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -974,7 +969,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	wq->singlethread = singlethread;
 	wq->freezeable = freezeable;
-	wq->rt = rt;
 	INIT_LIST_HEAD(&wq->list);
 
 	if (singlethread) {
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 09/27] workqueue: misc/cosmetic updates
  2009-12-18 12:57 Tejun Heo
                   ` (7 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 08/27] stop_machine: reimplement without using workqueue Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 10/27] workqueue: merge feature parameters into flags Tejun Heo
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Make the following updates in preparation of concurrency managed
workqueue.  None of these changes causes any visible behavior
difference.

* Add comments and adjust indentations to data structures and several
  functions.

* Rename wq_per_cpu() to get_cwq() and swap the position of two
  parameters for consistency.  Convert a direct per_cpu_ptr() access
  to wq->cpu_wq to get_cwq().

* Add work_static() and Update set_wq_data() such that it sets the
  flags part to WORK_STRUCT_PENDING | WORK_STRUCT_STATIC if static |
  @extra_flags.

* Move santiy check on work->entry emptiness from queue_work_on() to
  __queue_work() which all queueing paths share.

* Make __queue_work() take @cpu and @wq instead of @cwq.

* Restructure flush_work() and __create_workqueue_key() to make them
  easier to modify.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    5 ++
 kernel/workqueue.c        |  127 +++++++++++++++++++++++++++++----------------
 2 files changed, 88 insertions(+), 44 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0697946..ac06c55 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -96,9 +96,14 @@ struct execute_work {
 #ifdef CONFIG_DEBUG_OBJECTS_WORK
 extern void __init_work(struct work_struct *work, int onstack);
 extern void destroy_work_on_stack(struct work_struct *work);
+static inline bool work_static(struct work_struct *work)
+{
+	return test_bit(WORK_STRUCT_STATIC, work_data_bits(work));
+}
 #else
 static inline void __init_work(struct work_struct *work, int onstack) { }
 static inline void destroy_work_on_stack(struct work_struct *work) { }
+static inline bool work_static(struct work_struct *work) { return false; }
 #endif
 
 /*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3dccec6..e16c457 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -37,6 +37,16 @@
 #include <trace/events/workqueue.h>
 
 /*
+ * Structure fields follow one of the following exclusion rules.
+ *
+ * I: Set during initialization and read-only afterwards.
+ *
+ * L: cwq->lock protected.  Access with cwq->lock held.
+ *
+ * W: workqueue_lock protected.
+ */
+
+/*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).
  */
@@ -48,8 +58,8 @@ struct cpu_workqueue_struct {
 	wait_queue_head_t more_work;
 	struct work_struct *current_work;
 
-	struct workqueue_struct *wq;
-	struct task_struct *thread;
+	struct workqueue_struct *wq;		/* I: the owning workqueue */
+	struct task_struct	*thread;
 } ____cacheline_aligned;
 
 /*
@@ -57,13 +67,13 @@ struct cpu_workqueue_struct {
  * per-CPU workqueues:
  */
 struct workqueue_struct {
-	struct cpu_workqueue_struct *cpu_wq;
-	struct list_head list;
-	const char *name;
+	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
+	struct list_head	list;		/* W: list of all workqueues */
+	const char		*name;		/* I: workqueue name */
 	int singlethread;
 	int freezeable;		/* Freeze threads during suspend */
 #ifdef CONFIG_LOCKDEP
-	struct lockdep_map lockdep_map;
+	struct lockdep_map	lockdep_map;
 #endif
 };
 
@@ -204,8 +214,8 @@ static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
 		? cpu_singlethread_map : cpu_populated_map;
 }
 
-static
-struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+					    struct workqueue_struct *wq)
 {
 	if (unlikely(is_wq_single_threaded(wq)))
 		cpu = singlethread_cpu;
@@ -217,15 +227,14 @@ struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
  * - Must *only* be called if the pending flag is set
  */
 static inline void set_wq_data(struct work_struct *work,
-				struct cpu_workqueue_struct *cwq)
+			       struct cpu_workqueue_struct *cwq,
+			       unsigned long extra_flags)
 {
-	unsigned long new;
-
 	BUG_ON(!work_pending(work));
 
-	new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
-	new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
-	atomic_long_set(&work->data, new);
+	atomic_long_set(&work->data, (unsigned long)cwq |
+			(work_static(work) ? (1UL << WORK_STRUCT_STATIC) : 0) |
+			(1UL << WORK_STRUCT_PENDING) | extra_flags);
 }
 
 static inline
@@ -234,29 +243,47 @@ struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 	return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
 }
 
+/**
+ * insert_work - insert a work into cwq
+ * @cwq: cwq @work belongs to
+ * @work: work to insert
+ * @head: insertion point
+ * @extra_flags: extra WORK_STRUCT_* flags to set
+ *
+ * Insert @work into @cwq after @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
 static void insert_work(struct cpu_workqueue_struct *cwq,
-			struct work_struct *work, struct list_head *head)
+			struct work_struct *work, struct list_head *head,
+			unsigned int extra_flags)
 {
 	trace_workqueue_insertion(cwq->thread, work);
 
-	set_wq_data(work, cwq);
+	/* we own @work, set data and link */
+	set_wq_data(work, cwq, extra_flags);
+
 	/*
 	 * Ensure that we get the right work->data if we see the
 	 * result of list_add() below, see try_to_grab_pending().
 	 */
 	smp_wmb();
+
 	list_add_tail(&work->entry, head);
 	wake_up(&cwq->more_work);
 }
 
-static void __queue_work(struct cpu_workqueue_struct *cwq,
+static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
+	struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 	unsigned long flags;
 
 	debug_work_activate(work);
 	spin_lock_irqsave(&cwq->lock, flags);
-	insert_work(cwq, work, &cwq->worklist);
+	BUG_ON(!list_empty(&work->entry));
+	insert_work(cwq, work, &cwq->worklist, 0);
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -298,8 +325,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 	int ret = 0;
 
 	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
-		BUG_ON(!list_empty(&work->entry));
-		__queue_work(wq_per_cpu(wq, cpu), work);
+		__queue_work(cpu, wq, work);
 		ret = 1;
 	}
 	return ret;
@@ -310,9 +336,8 @@ static void delayed_work_timer_fn(unsigned long __data)
 {
 	struct delayed_work *dwork = (struct delayed_work *)__data;
 	struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
-	struct workqueue_struct *wq = cwq->wq;
 
-	__queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work);
+	__queue_work(smp_processor_id(), cwq->wq, &dwork->work);
 }
 
 /**
@@ -356,7 +381,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 		timer_stats_timer_set_start_info(&dwork->timer);
 
 		/* This stores cwq for the moment, for the timer_fn */
-		set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
+		set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
 		timer->expires = jiffies + delay;
 		timer->data = (unsigned long)dwork;
 		timer->function = delayed_work_timer_fn;
@@ -420,6 +445,12 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
 	spin_unlock_irq(&cwq->lock);
 }
 
+/**
+ * worker_thread - the worker thread function
+ * @__cwq: cwq to serve
+ *
+ * The cwq worker thread function.
+ */
 static int worker_thread(void *__cwq)
 {
 	struct cpu_workqueue_struct *cwq = __cwq;
@@ -458,6 +489,17 @@ static void wq_barrier_func(struct work_struct *work)
 	complete(&barr->done);
 }
 
+/**
+ * insert_wq_barrier - insert a barrier work
+ * @cwq: cwq to insert barrier into
+ * @barr: wq_barrier to insert
+ * @head: insertion point
+ *
+ * Insert barrier @barr into @cwq before @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
 static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 			struct wq_barrier *barr, struct list_head *head)
 {
@@ -469,11 +511,10 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	 */
 	INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
 	__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
-
 	init_completion(&barr->done);
 
 	debug_work_activate(&barr->work);
-	insert_work(cwq, &barr->work, head);
+	insert_work(cwq, &barr->work, head, 0);
 }
 
 static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
@@ -507,9 +548,6 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  *
  * We sleep until all works which were queued on entry have been handled,
  * but we are not livelocked by new incoming ones.
- *
- * This function used to run the workqueues itself.  Now we just wait for the
- * helper threads to do it.
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
@@ -548,7 +586,6 @@ int flush_work(struct work_struct *work)
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_release(&cwq->wq->lockdep_map);
 
-	prev = NULL;
 	spin_lock_irq(&cwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
@@ -557,22 +594,22 @@ int flush_work(struct work_struct *work)
 		 */
 		smp_rmb();
 		if (unlikely(cwq != get_wq_data(work)))
-			goto out;
+			goto already_gone;
 		prev = &work->entry;
 	} else {
 		if (cwq->current_work != work)
-			goto out;
+			goto already_gone;
 		prev = &cwq->worklist;
 	}
 	insert_wq_barrier(cwq, &barr, prev->next);
-out:
-	spin_unlock_irq(&cwq->lock);
-	if (!prev)
-		return 0;
 
+	spin_unlock_irq(&cwq->lock);
 	wait_for_completion(&barr.done);
 	destroy_work_on_stack(&barr.work);
 	return 1;
+already_gone:
+	spin_unlock_irq(&cwq->lock);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(flush_work);
 
@@ -655,7 +692,7 @@ static void wait_on_work(struct work_struct *work)
 	cpu_map = wq_cpu_map(wq);
 
 	for_each_cpu(cpu, cpu_map)
-		wait_on_cpu_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+		wait_on_cpu_work(get_cwq(cpu, wq), work);
 }
 
 static int __cancel_work_timer(struct work_struct *work,
@@ -772,9 +809,7 @@ EXPORT_SYMBOL(schedule_delayed_work);
 void flush_delayed_work(struct delayed_work *dwork)
 {
 	if (del_timer_sync(&dwork->timer)) {
-		struct cpu_workqueue_struct *cwq;
-		cwq = wq_per_cpu(keventd_wq, get_cpu());
-		__queue_work(cwq, &dwork->work);
+		__queue_work(get_cpu(), keventd_wq, &dwork->work);
 		put_cpu();
 	}
 	flush_work(&dwork->work);
@@ -957,13 +992,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
-		return NULL;
+		goto err;
 
 	wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
-	if (!wq->cpu_wq) {
-		kfree(wq);
-		return NULL;
-	}
+	if (!wq->cpu_wq)
+		goto err;
 
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
@@ -1007,6 +1040,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		wq = NULL;
 	}
 	return wq;
+err:
+	if (wq) {
+		free_percpu(wq->cpu_wq);
+		kfree(wq);
+	}
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(__create_workqueue_key);
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 10/27] workqueue: merge feature parameters into flags
  2009-12-18 12:57 Tejun Heo
                   ` (8 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 09/27] workqueue: misc/cosmetic updates Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 11/27] workqueue: define both bit position and mask for work flags Tejun Heo
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Currently, __create_workqueue_key() takes @singlethread and
@freezeable paramters and store them separately in workqueue_struct.
Merge them into a single flags parameter and field and use
WQ_FREEZEABLE and WQ_SINGLE_THREAD.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   25 +++++++++++++++----------
 kernel/workqueue.c        |   17 +++++++----------
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index ac06c55..495572a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -184,13 +184,17 @@ static inline bool work_static(struct work_struct *work) { return false; }
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
 
+enum {
+	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
+	WQ_SINGLE_THREAD	= 1 << 1, /* no per-cpu worker */
+};
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread, int freezeable,
+__create_workqueue_key(const char *name, unsigned int flags,
 		       struct lock_class_key *key, const char *lock_name);
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable)	\
+#define __create_workqueue(name, flags)				\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -200,19 +204,20 @@ __create_workqueue_key(const char *name, int singlethread, int freezeable,
 	else							\
 		__lock_name = #name;				\
 								\
-	__create_workqueue_key((name), (singlethread),		\
-			       (freezeable), &__key,		\
+	__create_workqueue_key((name), (flags), &__key,		\
 			       __lock_name);			\
 })
 #else
-#define __create_workqueue(name, singlethread, freezeable)	\
-	__create_workqueue_key((name), (singlethread), (freezeable), \
-			       NULL, NULL)
+#define __create_workqueue(name, flags)				\
+	__create_workqueue_key((name), (flags), NULL, NULL)
 #endif
 
-#define create_workqueue(name) __create_workqueue((name), 0, 0)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
+#define create_workqueue(name)					\
+	__create_workqueue((name), 0)
+#define create_freezeable_workqueue(name)			\
+	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+#define create_singlethread_workqueue(name)			\
+	__create_workqueue((name), WQ_SINGLE_THREAD)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e16c457..579041f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -67,11 +67,10 @@ struct cpu_workqueue_struct {
  * per-CPU workqueues:
  */
 struct workqueue_struct {
+	unsigned int		flags;		/* I: WQ_* flags */
 	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
 	struct list_head	list;		/* W: list of all workqueues */
 	const char		*name;		/* I: workqueue name */
-	int singlethread;
-	int freezeable;		/* Freeze threads during suspend */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
 #endif
@@ -203,9 +202,9 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
 static cpumask_var_t cpu_populated_map __read_mostly;
 
 /* If it's single threaded, it isn't in the list of workqueues. */
-static inline int is_wq_single_threaded(struct workqueue_struct *wq)
+static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
 {
-	return wq->singlethread;
+	return wq->flags & WQ_SINGLE_THREAD;
 }
 
 static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
@@ -456,7 +455,7 @@ static int worker_thread(void *__cwq)
 	struct cpu_workqueue_struct *cwq = __cwq;
 	DEFINE_WAIT(wait);
 
-	if (cwq->wq->freezeable)
+	if (cwq->wq->flags & WQ_FREEZEABLE)
 		set_freezable();
 
 	for (;;) {
@@ -981,8 +980,7 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 }
 
 struct workqueue_struct *__create_workqueue_key(const char *name,
-						int singlethread,
-						int freezeable,
+						unsigned int flags,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -998,13 +996,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	if (!wq->cpu_wq)
 		goto err;
 
+	wq->flags = flags;
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
-	wq->singlethread = singlethread;
-	wq->freezeable = freezeable;
 	INIT_LIST_HEAD(&wq->list);
 
-	if (singlethread) {
+	if (flags & WQ_SINGLE_THREAD) {
 		cwq = init_cpu_workqueue(wq, singlethread_cpu);
 		err = create_workqueue_thread(cwq, singlethread_cpu);
 		start_workqueue_thread(cwq, -1);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 11/27] workqueue: define both bit position and mask for work flags
  2009-12-18 12:57 Tejun Heo
                   ` (9 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 10/27] workqueue: merge feature parameters into flags Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 12/27] workqueue: separate out process_one_work() Tejun Heo
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Work flags are about to see more traditional mask handling.  Define
WORK_STRUCT_*_BIT as the bit position constant and redefine
WORK_STRUCT_* as bit masks.

While at it, re-define these constants as enums and use
WORK_STRUCT_STATIC instead of hard-coding 2 in
WORK_DATA_STATIC_INIT().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   23 +++++++++++++++--------
 kernel/workqueue.c        |   14 +++++++-------
 2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 495572a..e51b5dc 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -22,12 +22,19 @@ typedef void (*work_func_t)(struct work_struct *work);
  */
 #define work_data_bits(work) ((unsigned long *)(&(work)->data))
 
+enum {
+	WORK_STRUCT_PENDING_BIT	= 0,	/* work item is pending execution */
+	WORK_STRUCT_STATIC_BIT	= 1,	/* static initializer (debugobjects) */
+
+	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
+	WORK_STRUCT_STATIC	= 1 << WORK_STRUCT_STATIC_BIT,
+
+	WORK_STRUCT_FLAG_MASK	= 3UL,
+	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
+};
+
 struct work_struct {
 	atomic_long_t data;
-#define WORK_STRUCT_PENDING 0		/* T if work item pending execution */
-#define WORK_STRUCT_STATIC  1		/* static initializer (debugobjects) */
-#define WORK_STRUCT_FLAG_MASK (3UL)
-#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
 	struct list_head entry;
 	work_func_t func;
 #ifdef CONFIG_LOCKDEP
@@ -36,7 +43,7 @@ struct work_struct {
 };
 
 #define WORK_DATA_INIT()	ATOMIC_LONG_INIT(0)
-#define WORK_DATA_STATIC_INIT()	ATOMIC_LONG_INIT(2)
+#define WORK_DATA_STATIC_INIT()	ATOMIC_LONG_INIT(WORK_STRUCT_STATIC)
 
 struct delayed_work {
 	struct work_struct work;
@@ -98,7 +105,7 @@ extern void __init_work(struct work_struct *work, int onstack);
 extern void destroy_work_on_stack(struct work_struct *work);
 static inline bool work_static(struct work_struct *work)
 {
-	return test_bit(WORK_STRUCT_STATIC, work_data_bits(work));
+	return test_bit(WORK_STRUCT_STATIC_BIT, work_data_bits(work));
 }
 #else
 static inline void __init_work(struct work_struct *work, int onstack) { }
@@ -167,7 +174,7 @@ static inline bool work_static(struct work_struct *work) { return false; }
  * @work: The work item in question
  */
 #define work_pending(work) \
-	test_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+	test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))
 
 /**
  * delayed_work_pending - Find out whether a delayable work item is currently
@@ -182,7 +189,7 @@ static inline bool work_static(struct work_struct *work) { return false; }
  * @work: The work item in question
  */
 #define work_clear_pending(work) \
-	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+	clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))
 
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 579041f..f8e4d67 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -115,7 +115,7 @@ static int work_fixup_activate(void *addr, enum debug_obj_state state)
 		 * statically initialized. We just make sure that it
 		 * is tracked in the object tracker.
 		 */
-		if (test_bit(WORK_STRUCT_STATIC, work_data_bits(work))) {
+		if (test_bit(WORK_STRUCT_STATIC_BIT, work_data_bits(work))) {
 			debug_object_init(work, &work_debug_descr);
 			debug_object_activate(work, &work_debug_descr);
 			return 0;
@@ -232,8 +232,8 @@ static inline void set_wq_data(struct work_struct *work,
 	BUG_ON(!work_pending(work));
 
 	atomic_long_set(&work->data, (unsigned long)cwq |
-			(work_static(work) ? (1UL << WORK_STRUCT_STATIC) : 0) |
-			(1UL << WORK_STRUCT_PENDING) | extra_flags);
+			(work_static(work) ? WORK_STRUCT_STATIC : 0) |
+			WORK_STRUCT_PENDING | extra_flags);
 }
 
 static inline
@@ -323,7 +323,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 {
 	int ret = 0;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		__queue_work(cpu, wq, work);
 		ret = 1;
 	}
@@ -373,7 +373,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	struct timer_list *timer = &dwork->timer;
 	struct work_struct *work = &dwork->work;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		BUG_ON(timer_pending(timer));
 		BUG_ON(!list_empty(&work->entry));
 
@@ -509,7 +509,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	 * might deadlock.
 	 */
 	INIT_WORK_ON_STACK(&barr->work, wq_barrier_func);
-	__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
+	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
 	init_completion(&barr->done);
 
 	debug_work_activate(&barr->work);
@@ -621,7 +621,7 @@ static int try_to_grab_pending(struct work_struct *work)
 	struct cpu_workqueue_struct *cwq;
 	int ret = -1;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
 		return 0;
 
 	/*
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 12/27] workqueue: separate out process_one_work()
  2009-12-18 12:57 Tejun Heo
                   ` (10 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 11/27] workqueue: define both bit position and mask for work flags Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 13/27] workqueue: temporarily disable workqueue tracing Tejun Heo
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Separate out process_one_work() out of run_workqueue().  This patch
doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  100 +++++++++++++++++++++++++++++++--------------------
 1 files changed, 61 insertions(+), 39 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f8e4d67..aa1d680 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -395,51 +395,73 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+/**
+ * process_one_work - process single work
+ * @cwq: cwq to process work for
+ * @work: work to process
+ *
+ * Process @work.  This function contains all the logics necessary to
+ * process a single work including synchronization against and
+ * interaction with other workers on the same cpu, queueing and
+ * flushing.  As long as context requirement is met, any worker can
+ * call this function to process a work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ */
+static void process_one_work(struct cpu_workqueue_struct *cwq,
+			     struct work_struct *work)
+{
+	work_func_t f = work->func;
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * It is permissible to free the struct work_struct from
+	 * inside the function that is called from it, this we need to
+	 * take into account for lockdep too.  To avoid bogus "held
+	 * lock freed" warnings as well as problems when looking into
+	 * work->lockdep_map, make a copy and use that here.
+	 */
+	struct lockdep_map lockdep_map = work->lockdep_map;
+#endif
+	/* claim and process */
+	trace_workqueue_execution(cwq->thread, work);
+	debug_work_deactivate(work);
+	cwq->current_work = work;
+	list_del_init(&work->entry);
+
+	spin_unlock_irq(&cwq->lock);
+
+	BUG_ON(get_wq_data(work) != cwq);
+	work_clear_pending(work);
+	lock_map_acquire(&cwq->wq->lockdep_map);
+	lock_map_acquire(&lockdep_map);
+	f(work);
+	lock_map_release(&lockdep_map);
+	lock_map_release(&cwq->wq->lockdep_map);
+
+	if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
+		printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
+		       "%s/0x%08x/%d\n",
+		       current->comm, preempt_count(), task_pid_nr(current));
+		printk(KERN_ERR "    last function: ");
+		print_symbol("%s\n", (unsigned long)f);
+		debug_show_held_locks(current);
+		dump_stack();
+	}
+
+	spin_lock_irq(&cwq->lock);
+
+	/* we're done with it, release */
+	cwq->current_work = NULL;
+}
+
 static void run_workqueue(struct cpu_workqueue_struct *cwq)
 {
 	spin_lock_irq(&cwq->lock);
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
-		work_func_t f = work->func;
-#ifdef CONFIG_LOCKDEP
-		/*
-		 * It is permissible to free the struct work_struct
-		 * from inside the function that is called from it,
-		 * this we need to take into account for lockdep too.
-		 * To avoid bogus "held lock freed" warnings as well
-		 * as problems when looking into work->lockdep_map,
-		 * make a copy and use that here.
-		 */
-		struct lockdep_map lockdep_map = work->lockdep_map;
-#endif
-		trace_workqueue_execution(cwq->thread, work);
-		debug_work_deactivate(work);
-		cwq->current_work = work;
-		list_del_init(cwq->worklist.next);
-		spin_unlock_irq(&cwq->lock);
-
-		BUG_ON(get_wq_data(work) != cwq);
-		work_clear_pending(work);
-		lock_map_acquire(&cwq->wq->lockdep_map);
-		lock_map_acquire(&lockdep_map);
-		f(work);
-		lock_map_release(&lockdep_map);
-		lock_map_release(&cwq->wq->lockdep_map);
-
-		if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
-			printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
-					"%s/0x%08x/%d\n",
-					current->comm, preempt_count(),
-				       	task_pid_nr(current));
-			printk(KERN_ERR "    last function: ");
-			print_symbol("%s\n", (unsigned long)f);
-			debug_show_held_locks(current);
-			dump_stack();
-		}
-
-		spin_lock_irq(&cwq->lock);
-		cwq->current_work = NULL;
+		process_one_work(cwq, work);
 	}
 	spin_unlock_irq(&cwq->lock);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 13/27] workqueue: temporarily disable workqueue tracing
  2009-12-18 12:57 Tejun Heo
                   ` (11 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 12/27] workqueue: separate out process_one_work() Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 14/27] workqueue: kill cpu_populated_map Tejun Heo
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Strip tracing code from workqueue and disable workqueue tracing.  This
is temporary measure till concurrency managed workqueue is complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/trace/Kconfig |    4 +++-
 kernel/workqueue.c   |   14 +++-----------
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index d006554..de0bfeb 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -414,7 +414,9 @@ config KMEMTRACE
 	  If unsure, say N.
 
 config WORKQUEUE_TRACER
-	bool "Trace workqueues"
+# Temporarily disabled during workqueue reimplementation
+#	bool "Trace workqueues"
+	def_bool n
 	select GENERIC_TRACER
 	help
 	  The workqueue tracer provides some statistical informations
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aa1d680..a48a9b8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -33,8 +33,6 @@
 #include <linux/kallsyms.h>
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
-#define CREATE_TRACE_POINTS
-#include <trace/events/workqueue.h>
 
 /*
  * Structure fields follow one of the following exclusion rules.
@@ -236,10 +234,10 @@ static inline void set_wq_data(struct work_struct *work,
 			WORK_STRUCT_PENDING | extra_flags);
 }
 
-static inline
-struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
+static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 {
-	return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
+	return (void *)(atomic_long_read(&work->data) &
+			WORK_STRUCT_WQ_DATA_MASK);
 }
 
 /**
@@ -258,8 +256,6 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
 			unsigned int extra_flags)
 {
-	trace_workqueue_insertion(cwq->thread, work);
-
 	/* we own @work, set data and link */
 	set_wq_data(work, cwq, extra_flags);
 
@@ -424,7 +420,6 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	struct lockdep_map lockdep_map = work->lockdep_map;
 #endif
 	/* claim and process */
-	trace_workqueue_execution(cwq->thread, work);
 	debug_work_deactivate(work);
 	cwq->current_work = work;
 	list_del_init(&work->entry);
@@ -985,8 +980,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 		return PTR_ERR(p);
 	cwq->thread = p;
 
-	trace_workqueue_creation(cwq->thread, cpu);
-
 	return 0;
 }
 
@@ -1091,7 +1084,6 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
 	 * checks list_empty(), and a "normal" queue_work() can't use
 	 * a dead CPU.
 	 */
-	trace_workqueue_destruction(cwq->thread);
 	kthread_stop(cwq->thread);
 	cwq->thread = NULL;
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 14/27] workqueue: kill cpu_populated_map
  2009-12-18 12:57 Tejun Heo
                   ` (12 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 13/27] workqueue: temporarily disable workqueue tracing Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 15/27] workqueue: update cwq alignement Tejun Heo
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Worker management is about to be overhauled.  Simplify things by
removing cpu_populated_map, creating workers for all possible cpus and
making single threaded workqueues behave more like multi threaded
ones.

After this patch, all cwqs are always initialized, all workqueues are
linked on the workqueues list and workers for all possibles cpus
always exist.  This also makes CPU hotplug support simpler - rebinding
workers on CPU_ONLINE and flushing on CPU_POST_DEAD are enough.

While at it, make get_cwq() always return the cwq for the specified
cpu, add target_cwq() for cases where single thread distinction is
necessary and drop all direct usage of per_cpu_ptr() on wq->cpu_wq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  165 +++++++++++++++++-----------------------------------
 1 files changed, 54 insertions(+), 111 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a48a9b8..d29e069 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -189,34 +189,19 @@ static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 
 static int singlethread_cpu __read_mostly;
-static const struct cpumask *cpu_singlethread_map __read_mostly;
-/*
- * _cpu_down() first removes CPU from cpu_online_map, then CPU_DEAD
- * flushes cwq->worklist. This means that flush_workqueue/wait_on_work
- * which comes in between can't use for_each_online_cpu(). We could
- * use cpu_possible_map, the cpumask below is more a documentation
- * than optimization.
- */
-static cpumask_var_t cpu_populated_map __read_mostly;
-
-/* If it's single threaded, it isn't in the list of workqueues. */
-static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
-{
-	return wq->flags & WQ_SINGLE_THREAD;
-}
 
-static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+					    struct workqueue_struct *wq)
 {
-	return is_wq_single_threaded(wq)
-		? cpu_singlethread_map : cpu_populated_map;
+	return per_cpu_ptr(wq->cpu_wq, cpu);
 }
 
-static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
-					    struct workqueue_struct *wq)
+static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
+					       struct workqueue_struct *wq)
 {
-	if (unlikely(is_wq_single_threaded(wq)))
+	if (unlikely(wq->flags & WQ_SINGLE_THREAD))
 		cpu = singlethread_cpu;
-	return per_cpu_ptr(wq->cpu_wq, cpu);
+	return get_cwq(cpu, wq);
 }
 
 /*
@@ -272,7 +257,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
-	struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+	struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
 	unsigned long flags;
 
 	debug_work_activate(work);
@@ -376,7 +361,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 		timer_stats_timer_set_start_info(&dwork->timer);
 
 		/* This stores cwq for the moment, for the timer_fn */
-		set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
+		set_wq_data(work, target_cwq(raw_smp_processor_id(), wq), 0);
 		timer->expires = jiffies + delay;
 		timer->data = (unsigned long)dwork;
 		timer->function = delayed_work_timer_fn;
@@ -567,14 +552,13 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
-	const struct cpumask *cpu_map = wq_cpu_map(wq);
 	int cpu;
 
 	might_sleep();
 	lock_map_acquire(&wq->lockdep_map);
 	lock_map_release(&wq->lockdep_map);
-	for_each_cpu(cpu, cpu_map)
-		flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+	for_each_possible_cpu(cpu)
+		flush_cpu_workqueue(get_cwq(cpu, wq));
 }
 EXPORT_SYMBOL_GPL(flush_workqueue);
 
@@ -692,7 +676,6 @@ static void wait_on_work(struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq;
 	struct workqueue_struct *wq;
-	const struct cpumask *cpu_map;
 	int cpu;
 
 	might_sleep();
@@ -705,9 +688,8 @@ static void wait_on_work(struct work_struct *work)
 		return;
 
 	wq = cwq->wq;
-	cpu_map = wq_cpu_map(wq);
 
-	for_each_cpu(cpu, cpu_map)
+	for_each_possible_cpu(cpu)
 		wait_on_cpu_work(get_cwq(cpu, wq), work);
 }
 
@@ -940,7 +922,7 @@ int current_is_keventd(void)
 
 	BUG_ON(!keventd_wq);
 
-	cwq = per_cpu_ptr(keventd_wq->cpu_wq, cpu);
+	cwq = get_cwq(cpu, keventd_wq);
 	if (current == cwq->thread)
 		ret = 1;
 
@@ -948,26 +930,12 @@ int current_is_keventd(void)
 
 }
 
-static struct cpu_workqueue_struct *
-init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
-{
-	struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
-
-	cwq->wq = wq;
-	spin_lock_init(&cwq->lock);
-	INIT_LIST_HEAD(&cwq->worklist);
-	init_waitqueue_head(&cwq->more_work);
-
-	return cwq;
-}
-
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
 	struct workqueue_struct *wq = cwq->wq;
-	const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
 	struct task_struct *p;
 
-	p = kthread_create(worker_thread, cwq, fmt, wq->name, cpu);
+	p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
 	/*
 	 * Nobody can add the work_struct to this cwq,
 	 *	if (caller is __create_workqueue)
@@ -999,8 +967,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
+	bool singlethread = flags & WQ_SINGLE_THREAD;
 	struct workqueue_struct *wq;
-	struct cpu_workqueue_struct *cwq;
 	int err = 0, cpu;
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
@@ -1016,41 +984,40 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
 
-	if (flags & WQ_SINGLE_THREAD) {
-		cwq = init_cpu_workqueue(wq, singlethread_cpu);
-		err = create_workqueue_thread(cwq, singlethread_cpu);
-		start_workqueue_thread(cwq, -1);
-	} else {
-		cpu_maps_update_begin();
-		/*
-		 * We must place this wq on list even if the code below fails.
-		 * cpu_down(cpu) can remove cpu from cpu_populated_map before
-		 * destroy_workqueue() takes the lock, in that case we leak
-		 * cwq[cpu]->thread.
-		 */
-		spin_lock(&workqueue_lock);
-		list_add(&wq->list, &workqueues);
-		spin_unlock(&workqueue_lock);
-		/*
-		 * We must initialize cwqs for each possible cpu even if we
-		 * are going to call destroy_workqueue() finally. Otherwise
-		 * cpu_up() can hit the uninitialized cwq once we drop the
-		 * lock.
-		 */
-		for_each_possible_cpu(cpu) {
-			cwq = init_cpu_workqueue(wq, cpu);
-			if (err || !cpu_online(cpu))
-				continue;
-			err = create_workqueue_thread(cwq, cpu);
+	cpu_maps_update_begin();
+	/*
+	 * We must initialize cwqs for each possible cpu even if we
+	 * are going to call destroy_workqueue() finally. Otherwise
+	 * cpu_up() can hit the uninitialized cwq once we drop the
+	 * lock.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+		cwq->wq = wq;
+		spin_lock_init(&cwq->lock);
+		INIT_LIST_HEAD(&cwq->worklist);
+		init_waitqueue_head(&cwq->more_work);
+
+		if (err)
+			continue;
+		err = create_workqueue_thread(cwq, cpu);
+		if (cpu_online(cpu) && !singlethread)
 			start_workqueue_thread(cwq, cpu);
-		}
-		cpu_maps_update_done();
+		else
+			start_workqueue_thread(cwq, -1);
 	}
+	cpu_maps_update_done();
 
 	if (err) {
 		destroy_workqueue(wq);
 		wq = NULL;
 	}
+
+	spin_lock(&workqueue_lock);
+	list_add(&wq->list, &workqueues);
+	spin_unlock(&workqueue_lock);
+
 	return wq;
 err:
 	if (wq) {
@@ -1096,17 +1063,14 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
  */
 void destroy_workqueue(struct workqueue_struct *wq)
 {
-	const struct cpumask *cpu_map = wq_cpu_map(wq);
 	int cpu;
 
-	cpu_maps_update_begin();
 	spin_lock(&workqueue_lock);
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	for_each_cpu(cpu, cpu_map)
-		cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu));
- 	cpu_maps_update_done();
+	for_each_possible_cpu(cpu)
+		cleanup_workqueue_thread(get_cwq(cpu, wq));
 
 	free_percpu(wq->cpu_wq);
 	kfree(wq);
@@ -1120,47 +1084,30 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 	unsigned int cpu = (unsigned long)hcpu;
 	struct cpu_workqueue_struct *cwq;
 	struct workqueue_struct *wq;
-	int ret = NOTIFY_OK;
 
 	action &= ~CPU_TASKS_FROZEN;
 
-	switch (action) {
-	case CPU_UP_PREPARE:
-		cpumask_set_cpu(cpu, cpu_populated_map);
-	}
-undo:
 	list_for_each_entry(wq, &workqueues, list) {
-		cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+		if (wq->flags & WQ_SINGLE_THREAD)
+			continue;
 
-		switch (action) {
-		case CPU_UP_PREPARE:
-			if (!create_workqueue_thread(cwq, cpu))
-				break;
-			printk(KERN_ERR "workqueue [%s] for %i failed\n",
-				wq->name, cpu);
-			action = CPU_UP_CANCELED;
-			ret = NOTIFY_BAD;
-			goto undo;
+		cwq = get_cwq(cpu, wq);
 
+		switch (action) {
 		case CPU_ONLINE:
-			start_workqueue_thread(cwq, cpu);
+			__set_cpus_allowed(cwq->thread, get_cpu_mask(cpu),
+					   true);
 			break;
 
-		case CPU_UP_CANCELED:
-			start_workqueue_thread(cwq, -1);
 		case CPU_POST_DEAD:
-			cleanup_workqueue_thread(cwq);
+			lock_map_acquire(&cwq->wq->lockdep_map);
+			lock_map_release(&cwq->wq->lockdep_map);
+			flush_cpu_workqueue(cwq);
 			break;
 		}
 	}
 
-	switch (action) {
-	case CPU_UP_CANCELED:
-	case CPU_POST_DEAD:
-		cpumask_clear_cpu(cpu, cpu_populated_map);
-	}
-
-	return ret;
+	return NOTIFY_OK;
 }
 
 #ifdef CONFIG_SMP
@@ -1212,11 +1159,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
 
 void __init init_workqueues(void)
 {
-	alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
-
-	cpumask_copy(cpu_populated_map, cpu_online_mask);
 	singlethread_cpu = cpumask_first(cpu_possible_mask);
-	cpu_singlethread_map = cpumask_of(singlethread_cpu);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 	keventd_wq = create_workqueue("events");
 	BUG_ON(!keventd_wq);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 15/27] workqueue: update cwq alignement
  2009-12-18 12:57 Tejun Heo
                   ` (13 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 14/27] workqueue: kill cpu_populated_map Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 16/27] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

work->data field is used for two purposes.  It points to cwq it's
queued on and the lower bits are used for flags.  Currently, two bits
are reserved which is always safe as 4 byte alignment is guaranteed on
every architecture.  However, future changes will need more flag bits.

On SMP, the percpu allocator is capable of honoring larger alignment
(there are other users which depend on it) and larger alignment works
just fine.  On UP, percpu allocator is a thin wrapper around
kzalloc/kfree() and don't honor alignment request.

This patch introduces WORK_STRUCT_FLAG_BITS and implements
alloc/free_cwqs() which guarantees (1 << WORK_STRUCT_FLAG_BITS)
alignment both on SMP and UP.  On SMP, simply wrapping percpu
allocator is enouhg.  On UP, extra space is allocated so that cwq can
be aligned and the original pointer can be stored after it which is
used in the free path.

While at it, as cwqs are now forced aligned, make sure the resulting
alignment is at least equal to or larger than that of long long.

Alignment problem on UP is reported by Michal Simek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reported-by: Michal Simek <michal.simek@petalogix.com>
---
 include/linux/workqueue.h |    4 ++-
 kernel/workqueue.c        |   59 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index e51b5dc..011738a 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -29,7 +29,9 @@ enum {
 	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
 	WORK_STRUCT_STATIC	= 1 << WORK_STRUCT_STATIC_BIT,
 
-	WORK_STRUCT_FLAG_MASK	= 3UL,
+	WORK_STRUCT_FLAG_BITS	= 2,
+
+	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
 	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
 };
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d29e069..14edfcd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -46,7 +46,9 @@
 
 /*
  * The per-CPU workqueue (if single thread, we always use the first
- * possible cpu).
+ * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
+ * work_struct->data are used for flags and thus cwqs need to be
+ * aligned at two's power of the number of flag bits.
  */
 struct cpu_workqueue_struct {
 
@@ -58,7 +60,7 @@ struct cpu_workqueue_struct {
 
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	struct task_struct	*thread;
-} ____cacheline_aligned;
+};
 
 /*
  * The externally visible workqueue abstraction is an array of
@@ -930,6 +932,44 @@ int current_is_keventd(void)
 
 }
 
+static struct cpu_workqueue_struct *alloc_cwqs(void)
+{
+	const size_t size = sizeof(struct cpu_workqueue_struct);
+	const size_t align = 1 << WORK_STRUCT_FLAG_BITS;
+	struct cpu_workqueue_struct *cwqs;
+#ifndef CONFIG_SMP
+	void *ptr;
+
+	/*
+	 * On UP, percpu allocator doesn't honor alignment parameter
+	 * and simply uses arch-dependent default.  Allocate enough
+	 * room to align cwq and put an extra pointer at the end
+	 * pointing back to the originally allocated pointer which
+	 * will be used for free.
+	 */
+	ptr = __alloc_percpu(size + align + sizeof(void *), 1);
+	cwqs = PTR_ALIGN(ptr, align);
+	*(void **)per_cpu_ptr(cwqs + 1, 0) = ptr;
+#else
+	/* On SMP, percpu allocator can do it itself */
+	cwqs = __alloc_percpu(size, align);
+#endif
+	/* just in case, make sure it's actually aligned */
+	BUG_ON(!IS_ALIGNED((unsigned long)cwqs, align));
+	return cwqs;
+}
+
+static void free_cwqs(struct cpu_workqueue_struct *cwqs)
+{
+#ifndef CONFIG_SMP
+	/* on UP, the pointer to free is stored right after the cwq */
+	if (cwqs)
+		free_percpu(*(void **)per_cpu_ptr(cwqs + 1, 0));
+#else
+	free_percpu(cwqs);
+#endif
+}
+
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
 	struct workqueue_struct *wq = cwq->wq;
@@ -975,7 +1015,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	if (!wq)
 		goto err;
 
-	wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
+	wq->cpu_wq = alloc_cwqs();
 	if (!wq->cpu_wq)
 		goto err;
 
@@ -994,6 +1034,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
+		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
 		cwq->wq = wq;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
@@ -1021,7 +1062,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	return wq;
 err:
 	if (wq) {
-		free_percpu(wq->cpu_wq);
+		free_cwqs(wq->cpu_wq);
 		kfree(wq);
 	}
 	return NULL;
@@ -1072,7 +1113,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	for_each_possible_cpu(cpu)
 		cleanup_workqueue_thread(get_cwq(cpu, wq));
 
-	free_percpu(wq->cpu_wq);
+	free_cwqs(wq->cpu_wq);
 	kfree(wq);
 }
 EXPORT_SYMBOL_GPL(destroy_workqueue);
@@ -1159,6 +1200,14 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
 
 void __init init_workqueues(void)
 {
+	/*
+	 * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
+	 * Make sure that the alignment isn't lower than that of
+	 * unsigned long long.
+	 */
+	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
+		     __alignof__(unsigned long long));
+
 	singlethread_cpu = cpumask_first(cpu_possible_mask);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 	keventd_wq = create_workqueue("events");
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 16/27] workqueue: reimplement workqueue flushing using color coded works
  2009-12-18 12:57 Tejun Heo
                   ` (14 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 15/27] workqueue: update cwq alignement Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 17/27] workqueue: introduce worker Tejun Heo
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Reimplement workqueue flushing using color coded works.  wq has the
current work color which is painted on the works being issued via
cwqs.  Flushing a workqueue is achieved by advancing the current work
colors of cwqs and waiting for all the works which have any of the
previous colors to drain.

Currently there are 16 possible colors, one is reserved for no color
and 15 colors are useable allowing 14 concurrent flushes.  When color
space gets full, flush attempts are batched up and processed together
when color frees up, so even with many concurrent flushers, the new
implementation won't build up huge queue of flushers which has to be
processed one after another.

Only works which are queued via __queue_work() are colored.  Works
which are directly put on queue using insert_work() use NO_COLOR and
don't participate in workqueue flushing.  Currently only works used
for work-specific flush fall in this category.

This new implementation leaves only cleanup_workqueue_thread() as the
user of flush_cpu_workqueue().  Just make its users use
flush_workqueue() and kthread_stop() directly and kill
cleanup_workqueue_thread().  As workqueue flushing doesn't use barrier
request anymore, the comment describing the complex synchronization
around it in cleanup_workqueue_thread() is removed together with the
function.

This new implementation is to allow having and sharing multiple
workers per cpu.

Please note that one more bit is reserved for a future work flag by
this patch.  This is to avoid shifting bits and updating comments
later.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   17 ++-
 kernel/workqueue.c        |  351 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 315 insertions(+), 53 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 011738a..316cc48 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -29,7 +29,22 @@ enum {
 	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
 	WORK_STRUCT_STATIC	= 1 << WORK_STRUCT_STATIC_BIT,
 
-	WORK_STRUCT_FLAG_BITS	= 2,
+	WORK_STRUCT_COLOR_SHIFT	= 3,	/* color for workqueue flushing */
+	WORK_STRUCT_COLOR_BITS	= 4,
+
+	/*
+	 * The last color is no color used for works which don't
+	 * participate in workqueue flushing.
+	 */
+	WORK_NR_COLORS		= (1 << WORK_STRUCT_COLOR_BITS) - 1,
+	WORK_NO_COLOR		= WORK_NR_COLORS,
+
+	/*
+	 * Reserve 7 bits off of cwq pointer.  This makes cwqs aligned
+	 * to 128 bytes which isn't too excessive while allowing 15
+	 * workqueue flush colors.
+	 */
+	WORK_STRUCT_FLAG_BITS	= 7,
 
 	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
 	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 14edfcd..9e55c80 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,6 +41,8 @@
  *
  * L: cwq->lock protected.  Access with cwq->lock held.
  *
+ * F: wq->flush_mutex protected.
+ *
  * W: workqueue_lock protected.
  */
 
@@ -59,10 +61,23 @@ struct cpu_workqueue_struct {
 	struct work_struct *current_work;
 
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
+	int			work_color;	/* L: current color */
+	int			flush_color;	/* L: flushing color */
+	int			nr_in_flight[WORK_NR_COLORS];
+						/* L: nr of in_flight works */
 	struct task_struct	*thread;
 };
 
 /*
+ * Structure used to wait for workqueue flush.
+ */
+struct wq_flusher {
+	struct list_head	list;		/* F: list of flushers */
+	int			flush_color;	/* F: flush color waiting for */
+	struct completion	done;		/* flush completion */
+};
+
+/*
  * The externally visible workqueue abstraction is an array of
  * per-CPU workqueues:
  */
@@ -70,6 +85,15 @@ struct workqueue_struct {
 	unsigned int		flags;		/* I: WQ_* flags */
 	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
 	struct list_head	list;		/* W: list of all workqueues */
+
+	struct mutex		flush_mutex;	/* protects wq flushing */
+	int			work_color;	/* F: current work color */
+	int			flush_color;	/* F: current flush color */
+	atomic_t		nr_cwqs_to_flush; /* flush in progress */
+	struct wq_flusher	*first_flusher;	/* F: first flusher */
+	struct list_head	flusher_queue;	/* F: flush waiters */
+	struct list_head	flusher_overflow; /* F: flush overflow list */
+
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
@@ -206,6 +230,22 @@ static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
 	return get_cwq(cpu, wq);
 }
 
+static unsigned int work_color_to_flags(int color)
+{
+	return color << WORK_STRUCT_COLOR_SHIFT;
+}
+
+static int work_flags_to_color(unsigned int flags)
+{
+	return (flags >> WORK_STRUCT_COLOR_SHIFT) &
+		((1 << WORK_STRUCT_COLOR_BITS) - 1);
+}
+
+static int work_next_color(int color)
+{
+	return (color + 1) % WORK_NR_COLORS;
+}
+
 /*
  * Set the workqueue on which a work item is to be run
  * - Must *only* be called if the pending flag is set
@@ -265,7 +305,9 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 	debug_work_activate(work);
 	spin_lock_irqsave(&cwq->lock, flags);
 	BUG_ON(!list_empty(&work->entry));
-	insert_work(cwq, work, &cwq->worklist, 0);
+	cwq->nr_in_flight[cwq->work_color]++;
+	insert_work(cwq, work, &cwq->worklist,
+		    work_color_to_flags(cwq->work_color));
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -379,6 +421,44 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
 /**
+ * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
+ * @cwq: cwq of interest
+ * @color: color of work which left the queue
+ *
+ * A work either has completed or is removed from pending queue,
+ * decrement nr_in_flight of its cwq and handle workqueue flushing.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
+{
+	/* ignore uncolored works */
+	if (color == WORK_NO_COLOR)
+		return;
+
+	cwq->nr_in_flight[color]--;
+
+	/* is flush in progress and are we at the flushing tip? */
+	if (likely(cwq->flush_color != color))
+		return;
+
+	/* are there still in-flight works? */
+	if (cwq->nr_in_flight[color])
+		return;
+
+	/* this cwq is done, clear flush_color */
+	cwq->flush_color = -1;
+
+	/*
+	 * If this was the last cwq, wake up the first flusher.  It
+	 * will handle the rest.
+	 */
+	if (atomic_dec_and_test(&cwq->wq->nr_cwqs_to_flush))
+		complete(&cwq->wq->first_flusher->done);
+}
+
+/**
  * process_one_work - process single work
  * @cwq: cwq to process work for
  * @work: work to process
@@ -396,6 +476,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 			     struct work_struct *work)
 {
 	work_func_t f = work->func;
+	int work_color;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -409,6 +490,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	/* claim and process */
 	debug_work_deactivate(work);
 	cwq->current_work = work;
+	work_color = work_flags_to_color(*work_data_bits(work));
 	list_del_init(&work->entry);
 
 	spin_unlock_irq(&cwq->lock);
@@ -435,6 +517,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 
 	/* we're done with it, release */
 	cwq->current_work = NULL;
+	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
 static void run_workqueue(struct cpu_workqueue_struct *cwq)
@@ -517,29 +600,72 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	init_completion(&barr->done);
 
 	debug_work_activate(&barr->work);
-	insert_work(cwq, &barr->work, head, 0);
+	insert_work(cwq, &barr->work, head, work_color_to_flags(WORK_NO_COLOR));
 }
 
-static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
+/**
+ * flush_workqueue_prep_cwqs - prepare cwqs for workqueue flushing
+ * @wq: workqueue being flushed
+ * @flush_color: new flush color, < 0 for no-op
+ * @work_color: new work color, < 0 for no-op
+ *
+ * Prepare cwqs for workqueue flushing.
+ *
+ * If @flush_color is non-negative, flush_color on all cwqs should be
+ * -1.  If no cwq has in-flight commands at the specified color, all
+ * cwq->flush_color's stay at -1 and %false is returned.  If any cwq
+ * has in flight commands, its cwq->flush_color is set to
+ * @flush_color, @wq->nr_cwqs_to_flush is updated accordingly, cwq
+ * wakeup logic is armed and %true is returned.
+ *
+ * The caller should have initialized @wq->first_flusher prior to
+ * calling this function with non-negative @flush_color.  If
+ * @flush_color is negative, no flush color update is done and %false
+ * is returned.
+ *
+ * If @work_color is non-negative, all cwqs should have the same
+ * work_color which is previous to @work_color and all will be
+ * advanced to @work_color.
+ *
+ * CONTEXT:
+ * mutex_lock(wq->flush_mutex).
+ *
+ * RETURNS:
+ * %true if @flush_color >= 0 and wakeup logic is armed.  %false
+ * otherwise.
+ */
+static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,
+				      int flush_color, int work_color)
 {
-	int active = 0;
-	struct wq_barrier barr;
+	bool wait = false;
+	unsigned int cpu;
 
-	WARN_ON(cwq->thread == current);
+	BUG_ON(flush_color >= 0 && atomic_read(&wq->nr_cwqs_to_flush));
 
-	spin_lock_irq(&cwq->lock);
-	if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
-		insert_wq_barrier(cwq, &barr, &cwq->worklist);
-		active = 1;
-	}
-	spin_unlock_irq(&cwq->lock);
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
-	if (active) {
-		wait_for_completion(&barr.done);
-		destroy_work_on_stack(&barr.work);
+		spin_lock_irq(&cwq->lock);
+
+		if (flush_color >= 0) {
+			BUG_ON(cwq->flush_color != -1);
+
+			if (cwq->nr_in_flight[flush_color]) {
+				cwq->flush_color = flush_color;
+				atomic_inc(&wq->nr_cwqs_to_flush);
+				wait = true;
+			}
+		}
+
+		if (work_color >= 0) {
+			BUG_ON(work_color != work_next_color(cwq->work_color));
+			cwq->work_color = work_color;
+		}
+
+		spin_unlock_irq(&cwq->lock);
 	}
 
-	return active;
+	return wait;
 }
 
 /**
@@ -554,13 +680,144 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
-	int cpu;
+	struct wq_flusher this_flusher = {
+		.list = LIST_HEAD_INIT(this_flusher.list),
+		.flush_color = -1,
+		.done = COMPLETION_INITIALIZER_ONSTACK(this_flusher.done),
+	};
+	int next_color;
 
-	might_sleep();
 	lock_map_acquire(&wq->lockdep_map);
 	lock_map_release(&wq->lockdep_map);
-	for_each_possible_cpu(cpu)
-		flush_cpu_workqueue(get_cwq(cpu, wq));
+
+	mutex_lock(&wq->flush_mutex);
+
+	/*
+	 * Start-to-wait phase
+	 */
+	next_color = work_next_color(wq->work_color);
+
+	if (next_color != wq->flush_color) {
+		/*
+		 * Color space is not full.  The current work_color
+		 * becomes our flush_color and work_color is advanced
+		 * by one.
+		 */
+		BUG_ON(!list_empty(&wq->flusher_overflow));
+		this_flusher.flush_color = wq->work_color;
+		wq->work_color = next_color;
+
+		if (!wq->first_flusher) {
+			/* no flush in progress, become the first flusher */
+			BUG_ON(wq->flush_color != this_flusher.flush_color);
+
+			wq->first_flusher = &this_flusher;
+
+			if (!flush_workqueue_prep_cwqs(wq, wq->flush_color,
+						       wq->work_color)) {
+				/* nothing to flush, done */
+				wq->flush_color = next_color;
+				wq->first_flusher = NULL;
+				goto out_unlock;
+			}
+		} else {
+			/* wait in queue */
+			BUG_ON(wq->flush_color == this_flusher.flush_color);
+			list_add_tail(&this_flusher.list, &wq->flusher_queue);
+			flush_workqueue_prep_cwqs(wq, -1, wq->work_color);
+		}
+	} else {
+		/*
+		 * Oops, color space is full, wait on overflow queue.
+		 * The next flush completion will assign us
+		 * flush_color and transfer to flusher_queue.
+		 */
+		list_add_tail(&this_flusher.list, &wq->flusher_overflow);
+	}
+
+	mutex_unlock(&wq->flush_mutex);
+
+	wait_for_completion(&this_flusher.done);
+
+	/*
+	 * Wake-up-and-cascade phase
+	 *
+	 * First flushers are responsible for cascading flushes and
+	 * handling overflow.  Non-first flushers can simply return.
+	 */
+	if (wq->first_flusher != &this_flusher)
+		return;
+
+	mutex_lock(&wq->flush_mutex);
+
+	wq->first_flusher = NULL;
+
+	BUG_ON(!list_empty(&this_flusher.list));
+	BUG_ON(wq->flush_color != this_flusher.flush_color);
+
+	while (true) {
+		struct wq_flusher *next, *tmp;
+
+		/* complete all the flushers sharing the current flush color */
+		list_for_each_entry_safe(next, tmp, &wq->flusher_queue, list) {
+			if (next->flush_color != wq->flush_color)
+				break;
+			list_del_init(&next->list);
+			complete(&next->done);
+		}
+
+		BUG_ON(!list_empty(&wq->flusher_overflow) &&
+		       wq->flush_color != work_next_color(wq->work_color));
+
+		/* this flush_color is finished, advance by one */
+		wq->flush_color = work_next_color(wq->flush_color);
+
+		/* one color has been freed, handle overflow queue */
+		if (!list_empty(&wq->flusher_overflow)) {
+			/*
+			 * Assign the same color to all overflowed
+			 * flushers, advance work_color and append to
+			 * flusher_queue.  This is the start-to-wait
+			 * phase for these overflowed flushers.
+			 */
+			list_for_each_entry(tmp, &wq->flusher_overflow, list)
+				tmp->flush_color = wq->work_color;
+
+			wq->work_color = work_next_color(wq->work_color);
+
+			list_splice_tail_init(&wq->flusher_overflow,
+					      &wq->flusher_queue);
+			flush_workqueue_prep_cwqs(wq, -1, wq->work_color);
+		}
+
+		if (list_empty(&wq->flusher_queue)) {
+			BUG_ON(wq->flush_color != wq->work_color);
+			break;
+		}
+
+		/*
+		 * Need to flush more colors.  Make the next flusher
+		 * the new first flusher and arm cwqs.
+		 */
+		BUG_ON(wq->flush_color == wq->work_color);
+		BUG_ON(wq->flush_color != next->flush_color);
+
+		list_del_init(&next->list);
+		wq->first_flusher = next;
+
+		if (flush_workqueue_prep_cwqs(wq, wq->flush_color, -1))
+			break;
+
+		/*
+		 * Meh... this color is already done, clear first
+		 * flusher and repeat cascading.
+		 */
+		wq->first_flusher = NULL;
+		complete(&next->done);
+	}
+
+out_unlock:
+	mutex_unlock(&wq->flush_mutex);
 }
 EXPORT_SYMBOL_GPL(flush_workqueue);
 
@@ -647,6 +904,8 @@ static int try_to_grab_pending(struct work_struct *work)
 		if (cwq == get_wq_data(work)) {
 			debug_work_deactivate(work);
 			list_del_init(&work->entry);
+			cwq_dec_nr_in_flight(cwq,
+				work_flags_to_color(*work_data_bits(work)));
 			ret = 1;
 		}
 	}
@@ -1020,6 +1279,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		goto err;
 
 	wq->flags = flags;
+	mutex_init(&wq->flush_mutex);
+	atomic_set(&wq->nr_cwqs_to_flush, 0);
+	INIT_LIST_HEAD(&wq->flusher_queue);
+	INIT_LIST_HEAD(&wq->flusher_overflow);
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
@@ -1036,6 +1299,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
 		cwq->wq = wq;
+		cwq->flush_color = -1;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
 		init_waitqueue_head(&cwq->more_work);
@@ -1069,33 +1333,6 @@ err:
 }
 EXPORT_SYMBOL_GPL(__create_workqueue_key);
 
-static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
-{
-	/*
-	 * Our caller is either destroy_workqueue() or CPU_POST_DEAD,
-	 * cpu_add_remove_lock protects cwq->thread.
-	 */
-	if (cwq->thread == NULL)
-		return;
-
-	lock_map_acquire(&cwq->wq->lockdep_map);
-	lock_map_release(&cwq->wq->lockdep_map);
-
-	flush_cpu_workqueue(cwq);
-	/*
-	 * If the caller is CPU_POST_DEAD and cwq->worklist was not empty,
-	 * a concurrent flush_workqueue() can insert a barrier after us.
-	 * However, in that case run_workqueue() won't return and check
-	 * kthread_should_stop() until it flushes all work_struct's.
-	 * When ->worklist becomes empty it is safe to exit because no
-	 * more work_structs can be queued on this cwq: flush_workqueue
-	 * checks list_empty(), and a "normal" queue_work() can't use
-	 * a dead CPU.
-	 */
-	kthread_stop(cwq->thread);
-	cwq->thread = NULL;
-}
-
 /**
  * destroy_workqueue - safely terminate a workqueue
  * @wq: target workqueue
@@ -1110,8 +1347,20 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	for_each_possible_cpu(cpu)
-		cleanup_workqueue_thread(get_cwq(cpu, wq));
+	flush_workqueue(wq);
+
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		int i;
+
+		if (cwq->thread) {
+			kthread_stop(cwq->thread);
+			cwq->thread = NULL;
+		}
+
+		for (i = 0; i < WORK_NR_COLORS; i++)
+			BUG_ON(cwq->nr_in_flight[i]);
+	}
 
 	free_cwqs(wq->cpu_wq);
 	kfree(wq);
@@ -1141,9 +1390,7 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 			break;
 
 		case CPU_POST_DEAD:
-			lock_map_acquire(&cwq->wq->lockdep_map);
-			lock_map_release(&cwq->wq->lockdep_map);
-			flush_cpu_workqueue(cwq);
+			flush_workqueue(wq);
 			break;
 		}
 	}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 17/27] workqueue: introduce worker
  2009-12-18 12:57 Tejun Heo
                   ` (15 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 16/27] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:57 ` [PATCH 18/27] workqueue: reimplement work flushing using linked works Tejun Heo
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Separate out worker thread related information to struct worker from
struct cpu_workqueue_struct and implement helper functions to deal
with the new struct worker.  The only change which is visible outside
is that now workqueue worker are all named "kworker/CPUID:WORKERID"
where WORKERID is allocated from per-cpu ida.

This is in preparation of concurrency managed workqueue where shared
multiple workers would be available per cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  211 +++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 151 insertions(+), 60 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9e55c80..17d270b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -33,6 +33,7 @@
 #include <linux/kallsyms.h>
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
+#include <linux/idr.h>
 
 /*
  * Structure fields follow one of the following exclusion rules.
@@ -46,6 +47,15 @@
  * W: workqueue_lock protected.
  */
 
+struct cpu_workqueue_struct;
+
+struct worker {
+	struct work_struct	*current_work;	/* L: work being processed */
+	struct task_struct	*task;		/* I: worker task */
+	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
+	int			id;		/* I: worker id */
+};
+
 /*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
@@ -58,14 +68,14 @@ struct cpu_workqueue_struct {
 
 	struct list_head worklist;
 	wait_queue_head_t more_work;
-	struct work_struct *current_work;
+	unsigned int		cpu;
+	struct worker		*worker;
 
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
 	int			flush_color;	/* L: flushing color */
 	int			nr_in_flight[WORK_NR_COLORS];
 						/* L: nr of in_flight works */
-	struct task_struct	*thread;
 };
 
 /*
@@ -213,6 +223,9 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
 /* Serializes the accesses to the list of workqueues. */
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
+static DEFINE_PER_CPU(struct ida, worker_ida);
+
+static int worker_thread(void *__worker);
 
 static int singlethread_cpu __read_mostly;
 
@@ -420,6 +433,105 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+static struct worker *alloc_worker(void)
+{
+	struct worker *worker;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	return worker;
+}
+
+/**
+ * create_worker - create a new workqueue worker
+ * @cwq: cwq the new worker will belong to
+ * @bind: whether to set affinity to @cpu or not
+ *
+ * Create a new worker which is bound to @cwq.  The returned worker
+ * can be started by calling start_worker() or destroyed using
+ * destroy_worker().
+ *
+ * CONTEXT:
+ * Might sleep.  Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * Pointer to the newly created worker.
+ */
+static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+{
+	int id = -1;
+	struct worker *worker = NULL;
+
+	spin_lock(&workqueue_lock);
+	while (ida_get_new(&per_cpu(worker_ida, cwq->cpu), &id)) {
+		spin_unlock(&workqueue_lock);
+		if (!ida_pre_get(&per_cpu(worker_ida, cwq->cpu), GFP_KERNEL))
+			goto fail;
+		spin_lock(&workqueue_lock);
+	}
+	spin_unlock(&workqueue_lock);
+
+	worker = alloc_worker();
+	if (!worker)
+		goto fail;
+
+	worker->cwq = cwq;
+	worker->id = id;
+
+	worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
+				      cwq->cpu, id);
+	if (IS_ERR(worker->task))
+		goto fail;
+
+	if (bind)
+		kthread_bind(worker->task, cwq->cpu);
+
+	return worker;
+fail:
+	if (id >= 0) {
+		spin_lock(&workqueue_lock);
+		ida_remove(&per_cpu(worker_ida, cwq->cpu), id);
+		spin_unlock(&workqueue_lock);
+	}
+	kfree(worker);
+	return NULL;
+}
+
+/**
+ * start_worker - start a newly created worker
+ * @worker: worker to start
+ *
+ * Start @worker.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void start_worker(struct worker *worker)
+{
+	wake_up_process(worker->task);
+}
+
+/**
+ * destroy_worker - destroy a workqueue worker
+ * @worker: worker to be destroyed
+ *
+ * Destroy @worker.
+ */
+static void destroy_worker(struct worker *worker)
+{
+	int cpu = worker->cwq->cpu;
+	int id = worker->id;
+
+	/* sanity check frenzy */
+	BUG_ON(worker->current_work);
+
+	kthread_stop(worker->task);
+	kfree(worker);
+
+	spin_lock(&workqueue_lock);
+	ida_remove(&per_cpu(worker_ida, cpu), id);
+	spin_unlock(&workqueue_lock);
+}
+
 /**
  * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
  * @cwq: cwq of interest
@@ -460,7 +572,7 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
 
 /**
  * process_one_work - process single work
- * @cwq: cwq to process work for
+ * @worker: self
  * @work: work to process
  *
  * Process @work.  This function contains all the logics necessary to
@@ -472,9 +584,9 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
  * CONTEXT:
  * spin_lock_irq(cwq->lock) which is released and regrabbed.
  */
-static void process_one_work(struct cpu_workqueue_struct *cwq,
-			     struct work_struct *work)
+static void process_one_work(struct worker *worker, struct work_struct *work)
 {
+	struct cpu_workqueue_struct *cwq = worker->cwq;
 	work_func_t f = work->func;
 	int work_color;
 #ifdef CONFIG_LOCKDEP
@@ -489,7 +601,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 #endif
 	/* claim and process */
 	debug_work_deactivate(work);
-	cwq->current_work = work;
+	worker->current_work = work;
 	work_color = work_flags_to_color(*work_data_bits(work));
 	list_del_init(&work->entry);
 
@@ -516,30 +628,33 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	spin_lock_irq(&cwq->lock);
 
 	/* we're done with it, release */
-	cwq->current_work = NULL;
+	worker->current_work = NULL;
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
-static void run_workqueue(struct cpu_workqueue_struct *cwq)
+static void run_workqueue(struct worker *worker)
 {
+	struct cpu_workqueue_struct *cwq = worker->cwq;
+
 	spin_lock_irq(&cwq->lock);
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
-		process_one_work(cwq, work);
+		process_one_work(worker, work);
 	}
 	spin_unlock_irq(&cwq->lock);
 }
 
 /**
  * worker_thread - the worker thread function
- * @__cwq: cwq to serve
+ * @__worker: self
  *
  * The cwq worker thread function.
  */
-static int worker_thread(void *__cwq)
+static int worker_thread(void *__worker)
 {
-	struct cpu_workqueue_struct *cwq = __cwq;
+	struct worker *worker = __worker;
+	struct cpu_workqueue_struct *cwq = worker->cwq;
 	DEFINE_WAIT(wait);
 
 	if (cwq->wq->flags & WQ_FREEZEABLE)
@@ -558,7 +673,7 @@ static int worker_thread(void *__cwq)
 		if (kthread_should_stop())
 			break;
 
-		run_workqueue(cwq);
+		run_workqueue(worker);
 	}
 
 	return 0;
@@ -856,7 +971,7 @@ int flush_work(struct work_struct *work)
 			goto already_gone;
 		prev = &work->entry;
 	} else {
-		if (cwq->current_work != work)
+		if (!cwq->worker || cwq->worker->current_work != work)
 			goto already_gone;
 		prev = &cwq->worklist;
 	}
@@ -921,7 +1036,7 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 	int running = 0;
 
 	spin_lock_irq(&cwq->lock);
-	if (unlikely(cwq->current_work == work)) {
+	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
 		insert_wq_barrier(cwq, &barr, cwq->worklist.next);
 		running = 1;
 	}
@@ -1184,7 +1299,7 @@ int current_is_keventd(void)
 	BUG_ON(!keventd_wq);
 
 	cwq = get_cwq(cpu, keventd_wq);
-	if (current == cwq->thread)
+	if (current == cwq->worker->task)
 		ret = 1;
 
 	return ret;
@@ -1229,38 +1344,6 @@ static void free_cwqs(struct cpu_workqueue_struct *cwqs)
 #endif
 }
 
-static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
-	struct workqueue_struct *wq = cwq->wq;
-	struct task_struct *p;
-
-	p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
-	/*
-	 * Nobody can add the work_struct to this cwq,
-	 *	if (caller is __create_workqueue)
-	 *		nobody should see this wq
-	 *	else // caller is CPU_UP_PREPARE
-	 *		cpu is not on cpu_online_map
-	 * so we can abort safely.
-	 */
-	if (IS_ERR(p))
-		return PTR_ERR(p);
-	cwq->thread = p;
-
-	return 0;
-}
-
-static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
-	struct task_struct *p = cwq->thread;
-
-	if (p != NULL) {
-		if (cpu >= 0)
-			kthread_bind(p, cpu);
-		wake_up_process(p);
-	}
-}
-
 struct workqueue_struct *__create_workqueue_key(const char *name,
 						unsigned int flags,
 						struct lock_class_key *key,
@@ -1268,7 +1351,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 {
 	bool singlethread = flags & WQ_SINGLE_THREAD;
 	struct workqueue_struct *wq;
-	int err = 0, cpu;
+	bool failed = false;
+	unsigned int cpu;
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
@@ -1298,23 +1382,25 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
 		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
+		cwq->cpu = cpu;
 		cwq->wq = wq;
 		cwq->flush_color = -1;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
 		init_waitqueue_head(&cwq->more_work);
 
-		if (err)
+		if (failed)
 			continue;
-		err = create_workqueue_thread(cwq, cpu);
-		if (cpu_online(cpu) && !singlethread)
-			start_workqueue_thread(cwq, cpu);
+		cwq->worker = create_worker(cwq,
+					    cpu_online(cpu) && !singlethread);
+		if (cwq->worker)
+			start_worker(cwq->worker);
 		else
-			start_workqueue_thread(cwq, -1);
+			failed = true;
 	}
 	cpu_maps_update_done();
 
-	if (err) {
+	if (failed) {
 		destroy_workqueue(wq);
 		wq = NULL;
 	}
@@ -1353,9 +1439,9 @@ void destroy_workqueue(struct workqueue_struct *wq)
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 		int i;
 
-		if (cwq->thread) {
-			kthread_stop(cwq->thread);
-			cwq->thread = NULL;
+		if (cwq->worker) {
+			destroy_worker(cwq->worker);
+			cwq->worker = NULL;
 		}
 
 		for (i = 0; i < WORK_NR_COLORS; i++)
@@ -1385,8 +1471,8 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 
 		switch (action) {
 		case CPU_ONLINE:
-			__set_cpus_allowed(cwq->thread, get_cpu_mask(cpu),
-					   true);
+			__set_cpus_allowed(cwq->worker->task,
+					   get_cpu_mask(cpu), true);
 			break;
 
 		case CPU_POST_DEAD:
@@ -1447,6 +1533,8 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
 
 void __init init_workqueues(void)
 {
+	unsigned int cpu;
+
 	/*
 	 * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
 	 * Make sure that the alignment isn't lower than that of
@@ -1455,6 +1543,9 @@ void __init init_workqueues(void)
 	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
 		     __alignof__(unsigned long long));
 
+	for_each_possible_cpu(cpu)
+		ida_init(&per_cpu(worker_ida, cpu));
+
 	singlethread_cpu = cpumask_first(cpu_possible_mask);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 	keventd_wq = create_workqueue("events");
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 18/27] workqueue: reimplement work flushing using linked works
  2009-12-18 12:57 Tejun Heo
                   ` (16 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 17/27] workqueue: introduce worker Tejun Heo
@ 2009-12-18 12:57 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 19/27] workqueue: implement per-cwq active work limit Tejun Heo
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:57 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

A work is linked to the next one by having WORK_STRUCT_LINKED bit set
and these links can be chained.  When a linked work is dispatched to a
worker, all linked works are dispatched to the worker's newly added
->scheduled queue and processed back-to-back.

Currently, as there's only single worker per cwq, having linked works
doesn't make any visible behavior difference.  This change is to
prepare for multiple shared workers per cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |  152 ++++++++++++++++++++++++++++++++++++++------
 2 files changed, 133 insertions(+), 21 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 316cc48..a6650f1 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -25,9 +25,11 @@ typedef void (*work_func_t)(struct work_struct *work);
 enum {
 	WORK_STRUCT_PENDING_BIT	= 0,	/* work item is pending execution */
 	WORK_STRUCT_STATIC_BIT	= 1,	/* static initializer (debugobjects) */
+	WORK_STRUCT_LINKED_BIT	= 2,	/* next work is linked to this one */
 
 	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
 	WORK_STRUCT_STATIC	= 1 << WORK_STRUCT_STATIC_BIT,
+	WORK_STRUCT_LINKED	= 1 << WORK_STRUCT_LINKED_BIT,
 
 	WORK_STRUCT_COLOR_SHIFT	= 3,	/* color for workqueue flushing */
 	WORK_STRUCT_COLOR_BITS	= 4,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 17d270b..2ac1624 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -51,6 +51,7 @@ struct cpu_workqueue_struct;
 
 struct worker {
 	struct work_struct	*current_work;	/* L: work being processed */
+	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
 	int			id;		/* I: worker id */
@@ -438,6 +439,8 @@ static struct worker *alloc_worker(void)
 	struct worker *worker;
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	if (worker)
+		INIT_LIST_HEAD(&worker->scheduled);
 	return worker;
 }
 
@@ -523,6 +526,7 @@ static void destroy_worker(struct worker *worker)
 
 	/* sanity check frenzy */
 	BUG_ON(worker->current_work);
+	BUG_ON(!list_empty(&worker->scheduled));
 
 	kthread_stop(worker->task);
 	kfree(worker);
@@ -533,6 +537,48 @@ static void destroy_worker(struct worker *worker)
 }
 
 /**
+ * move_linked_works - move linked works to a list
+ * @work: start of series of works to be scheduled
+ * @head: target list to append @work to
+ * @nextp: out paramter for nested worklist walking
+ *
+ * Schedule linked works starting from @work to @head.  Work series to
+ * be scheduled starts at @work and includes any consecutive work with
+ * WORK_STRUCT_LINKED set in its predecessor.
+ *
+ * If @nextp is not NULL, it's updated to point to the next work of
+ * the last scheduled work.  This allows move_linked_works() to be
+ * nested inside outer list_for_each_entry_safe().
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void move_linked_works(struct work_struct *work, struct list_head *head,
+			      struct work_struct **nextp)
+{
+	struct work_struct *n;
+
+	/*
+	 * Linked worklist will always end before the end of the list,
+	 * use NULL for list head.
+	 */
+	work = list_entry(work->entry.prev, struct work_struct, entry);
+	list_for_each_entry_safe_continue(work, n, NULL, entry) {
+		list_move_tail(&work->entry, head);
+		if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
+			break;
+	}
+
+	/*
+	 * If we're already inside safe list traversal and have moved
+	 * multiple works to the scheduled queue, the next position
+	 * needs to be updated.
+	 */
+	if (nextp)
+		*nextp = n;
+}
+
+/**
  * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
  * @cwq: cwq of interest
  * @color: color of work which left the queue
@@ -632,17 +678,25 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
-static void run_workqueue(struct worker *worker)
+/**
+ * process_scheduled_works - process scheduled works
+ * @worker: self
+ *
+ * Process all scheduled works.  Please note that the scheduled list
+ * may change while processing a work, so this function repeatedly
+ * fetches a work from the top and executes it.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * multiple times.
+ */
+static void process_scheduled_works(struct worker *worker)
 {
-	struct cpu_workqueue_struct *cwq = worker->cwq;
-
-	spin_lock_irq(&cwq->lock);
-	while (!list_empty(&cwq->worklist)) {
-		struct work_struct *work = list_entry(cwq->worklist.next,
+	while (!list_empty(&worker->scheduled)) {
+		struct work_struct *work = list_first_entry(&worker->scheduled,
 						struct work_struct, entry);
 		process_one_work(worker, work);
 	}
-	spin_unlock_irq(&cwq->lock);
 }
 
 /**
@@ -673,7 +727,27 @@ static int worker_thread(void *__worker)
 		if (kthread_should_stop())
 			break;
 
-		run_workqueue(worker);
+		spin_lock_irq(&cwq->lock);
+
+		while (!list_empty(&cwq->worklist)) {
+			struct work_struct *work =
+				list_first_entry(&cwq->worklist,
+						 struct work_struct, entry);
+
+			if (likely(!(*work_data_bits(work) &
+				     WORK_STRUCT_LINKED))) {
+				/* optimization path, not strictly necessary */
+				process_one_work(worker, work);
+				if (unlikely(!list_empty(&worker->scheduled)))
+					process_scheduled_works(worker);
+			} else {
+				move_linked_works(work, &worker->scheduled,
+						  NULL);
+				process_scheduled_works(worker);
+			}
+		}
+
+		spin_unlock_irq(&cwq->lock);
 	}
 
 	return 0;
@@ -694,16 +768,33 @@ static void wq_barrier_func(struct work_struct *work)
  * insert_wq_barrier - insert a barrier work
  * @cwq: cwq to insert barrier into
  * @barr: wq_barrier to insert
- * @head: insertion point
+ * @target: target work to attach @barr to
+ * @worker: worker currently executing @target, NULL if @target is not executing
  *
- * Insert barrier @barr into @cwq before @head.
+ * @barr is linked to @target such that @barr is completed only after
+ * @target finishes execution.  Please note that the ordering
+ * guarantee is observed only with respect to @target and on the local
+ * cpu.
+ *
+ * Currently, a queued barrier can't be canceled.  This is because
+ * try_to_grab_pending() can't determine whether the work to be
+ * grabbed is at the head of the queue and thus can't clear LINKED
+ * flag of the previous work while there must be a valid next work
+ * after a work with LINKED flag set.
+ *
+ * Note that when @worker is non-NULL, @target may be modified
+ * underneath us, so we can't reliably determine cwq from @target.
  *
  * CONTEXT:
  * spin_lock_irq(cwq->lock).
  */
 static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
-			struct wq_barrier *barr, struct list_head *head)
+			      struct wq_barrier *barr,
+			      struct work_struct *target, struct worker *worker)
 {
+	struct list_head *head;
+	unsigned int linked = 0;
+
 	/*
 	 * debugobject calls are safe here even with cwq->lock locked
 	 * as we know for sure that this will not trigger any of the
@@ -714,8 +805,24 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
 	init_completion(&barr->done);
 
+	/*
+	 * If @target is currently being executed, schedule the
+	 * barrier to the worker; otherwise, put it after @target.
+	 */
+	if (worker)
+		head = worker->scheduled.next;
+	else {
+		unsigned long *bits = work_data_bits(target);
+
+		head = target->entry.next;
+		/* there can already be other linked works, inherit and set */
+		linked = *bits & WORK_STRUCT_LINKED;
+		*bits |= WORK_STRUCT_LINKED;
+	}
+
 	debug_work_activate(&barr->work);
-	insert_work(cwq, &barr->work, head, work_color_to_flags(WORK_NO_COLOR));
+	insert_work(cwq, &barr->work, head,
+		    work_color_to_flags(WORK_NO_COLOR) | linked);
 }
 
 /**
@@ -948,8 +1055,8 @@ EXPORT_SYMBOL_GPL(flush_workqueue);
  */
 int flush_work(struct work_struct *work)
 {
+	struct worker *worker = NULL;
 	struct cpu_workqueue_struct *cwq;
-	struct list_head *prev;
 	struct wq_barrier barr;
 
 	might_sleep();
@@ -969,14 +1076,14 @@ int flush_work(struct work_struct *work)
 		smp_rmb();
 		if (unlikely(cwq != get_wq_data(work)))
 			goto already_gone;
-		prev = &work->entry;
 	} else {
-		if (!cwq->worker || cwq->worker->current_work != work)
+		if (cwq->worker && cwq->worker->current_work == work)
+			worker = cwq->worker;
+		if (!worker)
 			goto already_gone;
-		prev = &cwq->worklist;
 	}
-	insert_wq_barrier(cwq, &barr, prev->next);
 
+	insert_wq_barrier(cwq, &barr, work, worker);
 	spin_unlock_irq(&cwq->lock);
 	wait_for_completion(&barr.done);
 	destroy_work_on_stack(&barr.work);
@@ -1033,16 +1140,19 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 				struct work_struct *work)
 {
 	struct wq_barrier barr;
-	int running = 0;
+	struct worker *worker;
 
 	spin_lock_irq(&cwq->lock);
+
+	worker = NULL;
 	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
-		insert_wq_barrier(cwq, &barr, cwq->worklist.next);
-		running = 1;
+		worker = cwq->worker;
+		insert_wq_barrier(cwq, &barr, work, worker);
 	}
+
 	spin_unlock_irq(&cwq->lock);
 
-	if (unlikely(running)) {
+	if (unlikely(worker)) {
 		wait_for_completion(&barr.done);
 		destroy_work_on_stack(&barr.work);
 	}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 19/27] workqueue: implement per-cwq active work limit
  2009-12-18 12:57 Tejun Heo
                   ` (17 preceding siblings ...)
  2009-12-18 12:57 ` [PATCH 18/27] workqueue: reimplement work flushing using linked works Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 20/27] workqueue: reimplement workqueue freeze using max_active Tejun Heo
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Add cwq->nr_active, cwq->max_active and cwq->delayed_work.  nr_active
counts the number of active works per cwq.  A work is active if it's
flushable (colored) and is on cwq's worklist.  If nr_active reaches
max_active, new works are queued on cwq->delayed_work and activated
later as works on the cwq complete and decrement nr_active.

cwq->max_active can be specified via the new @max_active parameter to
__create_workqueue() and is set to 1 for all workqueues for now.  As
each cwq has only single worker now, this double queueing doesn't
cause any behavior difference visible to its users.

This will be used to reimplement freeze/thaw and implement shared
worker pool.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |   18 +++++++++---------
 kernel/workqueue.c        |   39 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a6650f1..974a232 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -216,11 +216,11 @@ enum {
 };
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, unsigned int flags,
+__create_workqueue_key(const char *name, unsigned int flags, int max_active,
 		       struct lock_class_key *key, const char *lock_name);
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, flags)				\
+#define __create_workqueue(name, flags, max_active)		\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -230,20 +230,20 @@ __create_workqueue_key(const char *name, unsigned int flags,
 	else							\
 		__lock_name = #name;				\
 								\
-	__create_workqueue_key((name), (flags), &__key,		\
-			       __lock_name);			\
+	__create_workqueue_key((name), (flags), (max_active),	\
+				&__key, __lock_name);		\
 })
 #else
-#define __create_workqueue(name, flags)				\
-	__create_workqueue_key((name), (flags), NULL, NULL)
+#define __create_workqueue(name, flags, max_active)		\
+	__create_workqueue_key((name), (flags), (max_active), NULL, NULL)
 #endif
 
 #define create_workqueue(name)					\
-	__create_workqueue((name), 0)
+	__create_workqueue((name), 0, 1)
 #define create_freezeable_workqueue(name)			\
-	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD, 1)
 #define create_singlethread_workqueue(name)			\
-	__create_workqueue((name), WQ_SINGLE_THREAD)
+	__create_workqueue((name), WQ_SINGLE_THREAD, 1)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2ac1624..0c9c01d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -77,6 +77,9 @@ struct cpu_workqueue_struct {
 	int			flush_color;	/* L: flushing color */
 	int			nr_in_flight[WORK_NR_COLORS];
 						/* L: nr of in_flight works */
+	int			nr_active;	/* L: nr of active works */
+	int			max_active;	/* I: max active works */
+	struct list_head	delayed_works;	/* L: delayed works */
 };
 
 /*
@@ -314,14 +317,24 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
+	struct list_head *worklist;
 	unsigned long flags;
 
 	debug_work_activate(work);
+
 	spin_lock_irqsave(&cwq->lock, flags);
 	BUG_ON(!list_empty(&work->entry));
+
 	cwq->nr_in_flight[cwq->work_color]++;
-	insert_work(cwq, work, &cwq->worklist,
-		    work_color_to_flags(cwq->work_color));
+
+	if (likely(cwq->nr_active < cwq->max_active)) {
+		cwq->nr_active++;
+		worklist = &cwq->worklist;
+	} else
+		worklist = &cwq->delayed_works;
+
+	insert_work(cwq, work, worklist, work_color_to_flags(cwq->work_color));
+
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -578,6 +591,15 @@ static void move_linked_works(struct work_struct *work, struct list_head *head,
 		*nextp = n;
 }
 
+static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
+{
+	struct work_struct *work = list_first_entry(&cwq->delayed_works,
+						    struct work_struct, entry);
+
+	move_linked_works(work, &cwq->worklist, NULL);
+	cwq->nr_active++;
+}
+
 /**
  * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
  * @cwq: cwq of interest
@@ -596,6 +618,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
 		return;
 
 	cwq->nr_in_flight[color]--;
+	cwq->nr_active--;
+
+	/* one down, submit a delayed one */
+	if (!list_empty(&cwq->delayed_works) &&
+	    cwq->nr_active < cwq->max_active)
+		cwq_activate_first_delayed(cwq);
 
 	/* is flush in progress and are we at the flushing tip? */
 	if (likely(cwq->flush_color != color))
@@ -1456,6 +1484,7 @@ static void free_cwqs(struct cpu_workqueue_struct *cwqs)
 
 struct workqueue_struct *__create_workqueue_key(const char *name,
 						unsigned int flags,
+						int max_active,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -1464,6 +1493,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	bool failed = false;
 	unsigned int cpu;
 
+	max_active = clamp_val(max_active, 1, INT_MAX);
+
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
 		goto err;
@@ -1495,8 +1526,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq->cpu = cpu;
 		cwq->wq = wq;
 		cwq->flush_color = -1;
+		cwq->max_active = max_active;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
+		INIT_LIST_HEAD(&cwq->delayed_works);
 		init_waitqueue_head(&cwq->more_work);
 
 		if (failed)
@@ -1556,6 +1589,8 @@ void destroy_workqueue(struct workqueue_struct *wq)
 
 		for (i = 0; i < WORK_NR_COLORS; i++)
 			BUG_ON(cwq->nr_in_flight[i]);
+		BUG_ON(cwq->nr_active);
+		BUG_ON(!list_empty(&cwq->delayed_works));
 	}
 
 	free_cwqs(wq->cpu_wq);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 20/27] workqueue: reimplement workqueue freeze using max_active
  2009-12-18 12:57 Tejun Heo
                   ` (18 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 19/27] workqueue: implement per-cwq active work limit Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 21/27] workqueue: introduce global cwq and unify cwq locks Tejun Heo
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Currently, workqueue freezing is implemented by marking the worker
freezeable and calling try_to_freeze() from dispatch loop.
Reimplement it using cwq->limit so that the workqueue is frozen
instead of the worker.

* workqueue_struct->saved_max_active is added which stores the
  specified max_active on initialization.

* On freeze, all cwq->max_active's are quenched to zero.  Freezing is
  complete when nr_active on all cwqs reach zero.

* On thaw, all cwq->max_active's are restored to wq->saved_max_active
  and the worklist is repopulated.

This new implementation allows having single shared pool of workers
per cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    7 ++
 kernel/power/process.c    |   21 +++++-
 kernel/workqueue.c        |  163 ++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 179 insertions(+), 12 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 974a232..7a260df 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -331,4 +331,11 @@ static inline long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
 #else
 long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg);
 #endif /* CONFIG_SMP */
+
+#ifdef CONFIG_FREEZER
+extern void freeze_workqueues_begin(void);
+extern bool freeze_workqueues_busy(void);
+extern void thaw_workqueues(void);
+#endif /* CONFIG_FREEZER */
+
 #endif
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5ade1bd..6f89afd 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -15,6 +15,7 @@
 #include <linux/syscalls.h>
 #include <linux/freezer.h>
 #include <linux/delay.h>
+#include <linux/workqueue.h>
 
 /* 
  * Timeout for stopping processes
@@ -35,6 +36,7 @@ static int try_to_freeze_tasks(bool sig_only)
 	struct task_struct *g, *p;
 	unsigned long end_time;
 	unsigned int todo;
+	bool wq_busy = false;
 	struct timeval start, end;
 	u64 elapsed_csecs64;
 	unsigned int elapsed_csecs;
@@ -42,6 +44,10 @@ static int try_to_freeze_tasks(bool sig_only)
 	do_gettimeofday(&start);
 
 	end_time = jiffies + TIMEOUT;
+
+	if (!sig_only)
+		freeze_workqueues_begin();
+
 	while (true) {
 		todo = 0;
 		read_lock(&tasklist_lock);
@@ -63,6 +69,12 @@ static int try_to_freeze_tasks(bool sig_only)
 				todo++;
 		} while_each_thread(g, p);
 		read_unlock(&tasklist_lock);
+
+		if (!sig_only) {
+			wq_busy = freeze_workqueues_busy();
+			todo += wq_busy;
+		}
+
 		if (!todo || time_after(jiffies, end_time))
 			break;
 
@@ -86,9 +98,13 @@ static int try_to_freeze_tasks(bool sig_only)
 		 */
 		printk("\n");
 		printk(KERN_ERR "Freezing of tasks failed after %d.%02d seconds "
-				"(%d tasks refusing to freeze):\n",
-				elapsed_csecs / 100, elapsed_csecs % 100, todo);
+		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
+		       elapsed_csecs / 100, elapsed_csecs % 100,
+		       todo - wq_busy, wq_busy);
 		show_state();
+
+		thaw_workqueues();
+
 		read_lock(&tasklist_lock);
 		do_each_thread(g, p) {
 			task_lock(p);
@@ -158,6 +174,7 @@ void thaw_processes(void)
 	oom_killer_enable();
 
 	printk("Restarting tasks ... ");
+	thaw_workqueues();
 	thaw_tasks(true);
 	thaw_tasks(false);
 	schedule();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0c9c01d..eca3925 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -78,7 +78,7 @@ struct cpu_workqueue_struct {
 	int			nr_in_flight[WORK_NR_COLORS];
 						/* L: nr of in_flight works */
 	int			nr_active;	/* L: nr of active works */
-	int			max_active;	/* I: max active works */
+	int			max_active;	/* L: max active works */
 	struct list_head	delayed_works;	/* L: delayed works */
 };
 
@@ -108,6 +108,7 @@ struct workqueue_struct {
 	struct list_head	flusher_queue;	/* F: flush waiters */
 	struct list_head	flusher_overflow; /* F: flush overflow list */
 
+	int			saved_max_active; /* I: saved cwq max_active */
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
@@ -228,6 +229,7 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 static DEFINE_PER_CPU(struct ida, worker_ida);
+static bool workqueue_freezing;		/* W: have wqs started freezing? */
 
 static int worker_thread(void *__worker);
 
@@ -739,19 +741,13 @@ static int worker_thread(void *__worker)
 	struct cpu_workqueue_struct *cwq = worker->cwq;
 	DEFINE_WAIT(wait);
 
-	if (cwq->wq->flags & WQ_FREEZEABLE)
-		set_freezable();
-
 	for (;;) {
 		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
-		if (!freezing(current) &&
-		    !kthread_should_stop() &&
+		if (!kthread_should_stop() &&
 		    list_empty(&cwq->worklist))
 			schedule();
 		finish_wait(&cwq->more_work, &wait);
 
-		try_to_freeze();
-
 		if (kthread_should_stop())
 			break;
 
@@ -1504,6 +1500,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		goto err;
 
 	wq->flags = flags;
+	wq->saved_max_active = max_active;
 	mutex_init(&wq->flush_mutex);
 	atomic_set(&wq->nr_cwqs_to_flush, 0);
 	INIT_LIST_HEAD(&wq->flusher_queue);
@@ -1548,8 +1545,19 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		wq = NULL;
 	}
 
+	/*
+	 * workqueue_lock protects global freeze state and workqueues
+	 * list.  Grab it, set max_active accordingly and add the new
+	 * workqueue to workqueues list.
+	 */
 	spin_lock(&workqueue_lock);
+
+	if (workqueue_freezing && wq->flags & WQ_FREEZEABLE)
+		for_each_possible_cpu(cpu)
+			get_cwq(cpu, wq)->max_active = 0;
+
 	list_add(&wq->list, &workqueues);
+
 	spin_unlock(&workqueue_lock);
 
 	return wq;
@@ -1572,12 +1580,16 @@ void destroy_workqueue(struct workqueue_struct *wq)
 {
 	int cpu;
 
+	flush_workqueue(wq);
+
+	/*
+	 * wq list is used to freeze wq, remove from list after
+	 * flushing is complete in case freeze races us.
+	 */
 	spin_lock(&workqueue_lock);
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	flush_workqueue(wq);
-
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 		int i;
@@ -1676,6 +1688,137 @@ long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
 EXPORT_SYMBOL_GPL(work_on_cpu);
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_FREEZER
+
+/**
+ * freeze_workqueues_begin - begin freezing workqueues
+ *
+ * Start freezing workqueues.  After this function returns, all
+ * freezeable workqueues will queue new works to their frozen_works
+ * list instead of the cwq ones.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void freeze_workqueues_begin(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+
+	spin_lock(&workqueue_lock);
+
+	BUG_ON(workqueue_freezing);
+	workqueue_freezing = true;
+
+	for_each_possible_cpu(cpu) {
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+			spin_lock_irq(&cwq->lock);
+
+			if (wq->flags & WQ_FREEZEABLE)
+				cwq->max_active = 0;
+
+			spin_unlock_irq(&cwq->lock);
+		}
+	}
+
+	spin_unlock(&workqueue_lock);
+}
+
+/**
+ * freeze_workqueues_busy - are freezeable workqueues still busy?
+ *
+ * Check whether freezing is complete.  This function must be called
+ * between freeze_workqueues_begin() and thaw_workqueues().
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock.
+ *
+ * RETURNS:
+ * %true if some freezeable workqueues are still busy.  %false if
+ * freezing is complete.
+ */
+bool freeze_workqueues_busy(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+	bool busy = false;
+
+	spin_lock(&workqueue_lock);
+
+	BUG_ON(!workqueue_freezing);
+
+	for_each_possible_cpu(cpu) {
+		/*
+		 * nr_active is monotonically decreasing.  It's safe
+		 * to peek without lock.
+		 */
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+			if (!(wq->flags & WQ_FREEZEABLE))
+				continue;
+
+			BUG_ON(cwq->nr_active < 0);
+			if (cwq->nr_active) {
+				busy = true;
+				goto out_unlock;
+			}
+		}
+	}
+out_unlock:
+	spin_unlock(&workqueue_lock);
+	return busy;
+}
+
+/**
+ * thaw_workqueues - thaw workqueues
+ *
+ * Thaw workqueues.  Normal queueing is restored and all collected
+ * frozen works are transferred to their respective cwq worklists.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void thaw_workqueues(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+
+	spin_lock(&workqueue_lock);
+
+	if (!workqueue_freezing)
+		goto out_unlock;
+
+	for_each_possible_cpu(cpu) {
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+			if (!(wq->flags & WQ_FREEZEABLE))
+				continue;
+
+			spin_lock_irq(&cwq->lock);
+
+			/* restore max_active and repopulate worklist */
+			cwq->max_active = wq->saved_max_active;
+
+			while (!list_empty(&cwq->delayed_works) &&
+			       cwq->nr_active < cwq->max_active)
+				cwq_activate_first_delayed(cwq);
+
+			wake_up(&cwq->more_work);
+
+			spin_unlock_irq(&cwq->lock);
+		}
+	}
+
+	workqueue_freezing = false;
+out_unlock:
+	spin_unlock(&workqueue_lock);
+}
+#endif /* CONFIG_FREEZER */
+
 void __init init_workqueues(void)
 {
 	unsigned int cpu;
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 21/27] workqueue: introduce global cwq and unify cwq locks
  2009-12-18 12:57 Tejun Heo
                   ` (19 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 20/27] workqueue: reimplement workqueue freeze using max_active Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 22/27] workqueue: implement worker states Tejun Heo
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

There is one gcwq (global cwq) per each cpu and all cwqs on an cpu
point to it.  A gcwq contains a lock to be used by all cwqs on the cpu
and an ida to give IDs to workers belonging to the cpu.

This patch introduces gcwq, moves worker_ida into gcwq and make all
cwqs on the same cpu use the cpu's gcwq->lock instead of separate
locks.  gcwq->ida is now protected by gcwq->lock too.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  156 ++++++++++++++++++++++++++++++++--------------------
 1 files changed, 96 insertions(+), 60 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index eca3925..91c924c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -40,38 +40,45 @@
  *
  * I: Set during initialization and read-only afterwards.
  *
- * L: cwq->lock protected.  Access with cwq->lock held.
+ * L: gcwq->lock protected.  Access with gcwq->lock held.
  *
  * F: wq->flush_mutex protected.
  *
  * W: workqueue_lock protected.
  */
 
+struct global_cwq;
 struct cpu_workqueue_struct;
 
 struct worker {
 	struct work_struct	*current_work;	/* L: work being processed */
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
+	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
 	int			id;		/* I: worker id */
 };
 
 /*
+ * Global per-cpu workqueue.
+ */
+struct global_cwq {
+	spinlock_t		lock;		/* the gcwq lock */
+	unsigned int		cpu;		/* I: the associated cpu */
+	struct ida		worker_ida;	/* L: for worker IDs */
+} ____cacheline_aligned_in_smp;
+
+/*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
  * work_struct->data are used for flags and thus cwqs need to be
  * aligned at two's power of the number of flag bits.
  */
 struct cpu_workqueue_struct {
-
-	spinlock_t lock;
-
+	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct list_head worklist;
 	wait_queue_head_t more_work;
-	unsigned int		cpu;
 	struct worker		*worker;
-
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
 	int			flush_color;	/* L: flushing color */
@@ -228,13 +235,19 @@ static inline void debug_work_deactivate(struct work_struct *work) { }
 /* Serializes the accesses to the list of workqueues. */
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
-static DEFINE_PER_CPU(struct ida, worker_ida);
 static bool workqueue_freezing;		/* W: have wqs started freezing? */
 
+static DEFINE_PER_CPU(struct global_cwq, global_cwq);
+
 static int worker_thread(void *__worker);
 
 static int singlethread_cpu __read_mostly;
 
+static struct global_cwq *get_gcwq(unsigned int cpu)
+{
+	return &per_cpu(global_cwq, cpu);
+}
+
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 					    struct workqueue_struct *wq)
 {
@@ -296,7 +309,7 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
  * Insert @work into @cwq after @head.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
@@ -319,12 +332,13 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
+	struct global_cwq *gcwq = cwq->gcwq;
 	struct list_head *worklist;
 	unsigned long flags;
 
 	debug_work_activate(work);
 
-	spin_lock_irqsave(&cwq->lock, flags);
+	spin_lock_irqsave(&gcwq->lock, flags);
 	BUG_ON(!list_empty(&work->entry));
 
 	cwq->nr_in_flight[cwq->work_color]++;
@@ -337,7 +351,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 
 	insert_work(cwq, work, worklist, work_color_to_flags(cwq->work_color));
 
-	spin_unlock_irqrestore(&cwq->lock, flags);
+	spin_unlock_irqrestore(&gcwq->lock, flags);
 }
 
 /**
@@ -476,39 +490,41 @@ static struct worker *alloc_worker(void)
  */
 static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
 {
+	struct global_cwq *gcwq = cwq->gcwq;
 	int id = -1;
 	struct worker *worker = NULL;
 
-	spin_lock(&workqueue_lock);
-	while (ida_get_new(&per_cpu(worker_ida, cwq->cpu), &id)) {
-		spin_unlock(&workqueue_lock);
-		if (!ida_pre_get(&per_cpu(worker_ida, cwq->cpu), GFP_KERNEL))
+	spin_lock_irq(&gcwq->lock);
+	while (ida_get_new(&gcwq->worker_ida, &id)) {
+		spin_unlock_irq(&gcwq->lock);
+		if (!ida_pre_get(&gcwq->worker_ida, GFP_KERNEL))
 			goto fail;
-		spin_lock(&workqueue_lock);
+		spin_lock_irq(&gcwq->lock);
 	}
-	spin_unlock(&workqueue_lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	worker = alloc_worker();
 	if (!worker)
 		goto fail;
 
+	worker->gcwq = gcwq;
 	worker->cwq = cwq;
 	worker->id = id;
 
 	worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
-				      cwq->cpu, id);
+				      gcwq->cpu, id);
 	if (IS_ERR(worker->task))
 		goto fail;
 
 	if (bind)
-		kthread_bind(worker->task, cwq->cpu);
+		kthread_bind(worker->task, gcwq->cpu);
 
 	return worker;
 fail:
 	if (id >= 0) {
-		spin_lock(&workqueue_lock);
-		ida_remove(&per_cpu(worker_ida, cwq->cpu), id);
-		spin_unlock(&workqueue_lock);
+		spin_lock_irq(&gcwq->lock);
+		ida_remove(&gcwq->worker_ida, id);
+		spin_unlock_irq(&gcwq->lock);
 	}
 	kfree(worker);
 	return NULL;
@@ -521,7 +537,7 @@ fail:
  * Start @worker.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void start_worker(struct worker *worker)
 {
@@ -536,7 +552,7 @@ static void start_worker(struct worker *worker)
  */
 static void destroy_worker(struct worker *worker)
 {
-	int cpu = worker->cwq->cpu;
+	struct global_cwq *gcwq = worker->gcwq;
 	int id = worker->id;
 
 	/* sanity check frenzy */
@@ -546,9 +562,9 @@ static void destroy_worker(struct worker *worker)
 	kthread_stop(worker->task);
 	kfree(worker);
 
-	spin_lock(&workqueue_lock);
-	ida_remove(&per_cpu(worker_ida, cpu), id);
-	spin_unlock(&workqueue_lock);
+	spin_lock_irq(&gcwq->lock);
+	ida_remove(&gcwq->worker_ida, id);
+	spin_unlock_irq(&gcwq->lock);
 }
 
 /**
@@ -566,7 +582,7 @@ static void destroy_worker(struct worker *worker)
  * nested inside outer list_for_each_entry_safe().
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void move_linked_works(struct work_struct *work, struct list_head *head,
 			      struct work_struct **nextp)
@@ -611,7 +627,7 @@ static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
  * decrement nr_in_flight of its cwq and handle workqueue flushing.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
 {
@@ -658,11 +674,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
  * call this function to process a work.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
  */
 static void process_one_work(struct worker *worker, struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = worker->cwq;
+	struct global_cwq *gcwq = cwq->gcwq;
 	work_func_t f = work->func;
 	int work_color;
 #ifdef CONFIG_LOCKDEP
@@ -681,7 +698,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	work_color = work_flags_to_color(*work_data_bits(work));
 	list_del_init(&work->entry);
 
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	BUG_ON(get_wq_data(work) != cwq);
 	work_clear_pending(work);
@@ -701,7 +718,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 		dump_stack();
 	}
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 
 	/* we're done with it, release */
 	worker->current_work = NULL;
@@ -717,7 +734,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
  * fetches a work from the top and executes it.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
  * multiple times.
  */
 static void process_scheduled_works(struct worker *worker)
@@ -738,6 +755,7 @@ static void process_scheduled_works(struct worker *worker)
 static int worker_thread(void *__worker)
 {
 	struct worker *worker = __worker;
+	struct global_cwq *gcwq = worker->gcwq;
 	struct cpu_workqueue_struct *cwq = worker->cwq;
 	DEFINE_WAIT(wait);
 
@@ -751,7 +769,7 @@ static int worker_thread(void *__worker)
 		if (kthread_should_stop())
 			break;
 
-		spin_lock_irq(&cwq->lock);
+		spin_lock_irq(&gcwq->lock);
 
 		while (!list_empty(&cwq->worklist)) {
 			struct work_struct *work =
@@ -771,7 +789,7 @@ static int worker_thread(void *__worker)
 			}
 		}
 
-		spin_unlock_irq(&cwq->lock);
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	return 0;
@@ -810,7 +828,7 @@ static void wq_barrier_func(struct work_struct *work)
  * underneath us, so we can't reliably determine cwq from @target.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 			      struct wq_barrier *barr,
@@ -820,7 +838,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	unsigned int linked = 0;
 
 	/*
-	 * debugobject calls are safe here even with cwq->lock locked
+	 * debugobject calls are safe here even with gcwq->lock locked
 	 * as we know for sure that this will not trigger any of the
 	 * checks and call back into the fixup functions where we
 	 * might deadlock.
@@ -890,8 +908,9 @@ static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,
 
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = cwq->gcwq;
 
-		spin_lock_irq(&cwq->lock);
+		spin_lock_irq(&gcwq->lock);
 
 		if (flush_color >= 0) {
 			BUG_ON(cwq->flush_color != -1);
@@ -908,7 +927,7 @@ static bool flush_workqueue_prep_cwqs(struct workqueue_struct *wq,
 			cwq->work_color = work_color;
 		}
 
-		spin_unlock_irq(&cwq->lock);
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	return wait;
@@ -1081,17 +1100,19 @@ int flush_work(struct work_struct *work)
 {
 	struct worker *worker = NULL;
 	struct cpu_workqueue_struct *cwq;
+	struct global_cwq *gcwq;
 	struct wq_barrier barr;
 
 	might_sleep();
 	cwq = get_wq_data(work);
 	if (!cwq)
 		return 0;
+	gcwq = cwq->gcwq;
 
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_release(&cwq->wq->lockdep_map);
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
 		 * See the comment near try_to_grab_pending()->smp_rmb().
@@ -1108,12 +1129,12 @@ int flush_work(struct work_struct *work)
 	}
 
 	insert_wq_barrier(cwq, &barr, work, worker);
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 	wait_for_completion(&barr.done);
 	destroy_work_on_stack(&barr.work);
 	return 1;
 already_gone:
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(flush_work);
@@ -1124,6 +1145,7 @@ EXPORT_SYMBOL_GPL(flush_work);
  */
 static int try_to_grab_pending(struct work_struct *work)
 {
+	struct global_cwq *gcwq;
 	struct cpu_workqueue_struct *cwq;
 	int ret = -1;
 
@@ -1138,8 +1160,9 @@ static int try_to_grab_pending(struct work_struct *work)
 	cwq = get_wq_data(work);
 	if (!cwq)
 		return ret;
+	gcwq = cwq->gcwq;
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
 		 * This work is queued, but perhaps we locked the wrong cwq.
@@ -1155,7 +1178,7 @@ static int try_to_grab_pending(struct work_struct *work)
 			ret = 1;
 		}
 	}
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	return ret;
 }
@@ -1163,10 +1186,11 @@ static int try_to_grab_pending(struct work_struct *work)
 static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 				struct work_struct *work)
 {
+	struct global_cwq *gcwq = cwq->gcwq;
 	struct wq_barrier barr;
 	struct worker *worker;
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 
 	worker = NULL;
 	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
@@ -1174,7 +1198,7 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 		insert_wq_barrier(cwq, &barr, work, worker);
 	}
 
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	if (unlikely(worker)) {
 		wait_for_completion(&barr.done);
@@ -1518,13 +1542,13 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	 */
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = get_gcwq(cpu);
 
 		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
-		cwq->cpu = cpu;
+		cwq->gcwq = gcwq;
 		cwq->wq = wq;
 		cwq->flush_color = -1;
 		cwq->max_active = max_active;
-		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
 		INIT_LIST_HEAD(&cwq->delayed_works);
 		init_waitqueue_head(&cwq->more_work);
@@ -1698,7 +1722,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
  * list instead of the cwq ones.
  *
  * CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
  */
 void freeze_workqueues_begin(void)
 {
@@ -1711,16 +1735,18 @@ void freeze_workqueues_begin(void)
 	workqueue_freezing = true;
 
 	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_irq(&gcwq->lock);
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
-			spin_lock_irq(&cwq->lock);
-
 			if (wq->flags & WQ_FREEZEABLE)
 				cwq->max_active = 0;
-
-			spin_unlock_irq(&cwq->lock);
 		}
+
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	spin_unlock(&workqueue_lock);
@@ -1779,7 +1805,7 @@ out_unlock:
  * frozen works are transferred to their respective cwq worklists.
  *
  * CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
  */
 void thaw_workqueues(void)
 {
@@ -1792,14 +1818,16 @@ void thaw_workqueues(void)
 		goto out_unlock;
 
 	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_irq(&gcwq->lock);
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
 			if (!(wq->flags & WQ_FREEZEABLE))
 				continue;
 
-			spin_lock_irq(&cwq->lock);
-
 			/* restore max_active and repopulate worklist */
 			cwq->max_active = wq->saved_max_active;
 
@@ -1808,9 +1836,9 @@ void thaw_workqueues(void)
 				cwq_activate_first_delayed(cwq);
 
 			wake_up(&cwq->more_work);
-
-			spin_unlock_irq(&cwq->lock);
 		}
+
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	workqueue_freezing = false;
@@ -1831,11 +1859,19 @@ void __init init_workqueues(void)
 	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
 		     __alignof__(unsigned long long));
 
-	for_each_possible_cpu(cpu)
-		ida_init(&per_cpu(worker_ida, cpu));
-
 	singlethread_cpu = cpumask_first(cpu_possible_mask);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
+
+	/* initialize gcwqs */
+	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_init(&gcwq->lock);
+		gcwq->cpu = cpu;
+
+		ida_init(&gcwq->worker_ida);
+	}
+
 	keventd_wq = create_workqueue("events");
 	BUG_ON(!keventd_wq);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 22/27] workqueue: implement worker states
  2009-12-18 12:57 Tejun Heo
                   ` (20 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 21/27] workqueue: introduce global cwq and unify cwq locks Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 23/27] workqueue: reimplement CPU hotplugging support using trustee Tejun Heo
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Implement worker states.  After created, a worker is STARTED.  While a
worker isn't processing a work, it's IDLE and chained on
gcwq->idle_list.  While processing a work, a worker is BUSY and
chained on gcwq->busy_hash.  Also, gcwq now counts the number of all
workers and idle ones.

worker_thread() is restructured to reflect state transitions.
cwq->more_work is removed and waking up a worker makes it check for
events.  A worker is killed by setting DIE flag while it's IDLE and
waking it up.

This gives gcwq better visibility of what's going on and allows it to
find out whether a work is executing quickly which is necessary to
have multiple workers processing the same cwq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  207 ++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 170 insertions(+), 37 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 91c924c..104f72f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -35,6 +35,17 @@
 #include <linux/lockdep.h>
 #include <linux/idr.h>
 
+enum {
+	/* worker flags */
+	WORKER_STARTED		= 1 << 0,	/* started */
+	WORKER_DIE		= 1 << 1,	/* die die die */
+	WORKER_IDLE		= 1 << 2,	/* is idle */
+
+	BUSY_WORKER_HASH_ORDER	= 5,		/* 32 pointers */
+	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
+	BUSY_WORKER_HASH_MASK	= BUSY_WORKER_HASH_SIZE - 1,
+};
+
 /*
  * Structure fields follow one of the following exclusion rules.
  *
@@ -51,11 +62,18 @@ struct global_cwq;
 struct cpu_workqueue_struct;
 
 struct worker {
+	/* on idle list while idle, on busy hash table while busy */
+	union {
+		struct list_head	entry;	/* L: while idle */
+		struct hlist_node	hentry;	/* L: while busy */
+	};
+
 	struct work_struct	*current_work;	/* L: work being processed */
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
+	unsigned int		flags;		/* L: flags */
 	int			id;		/* I: worker id */
 };
 
@@ -65,6 +83,15 @@ struct worker {
 struct global_cwq {
 	spinlock_t		lock;		/* the gcwq lock */
 	unsigned int		cpu;		/* I: the associated cpu */
+
+	int			nr_workers;	/* L: total number of workers */
+	int			nr_idle;	/* L: currently idle ones */
+
+	/* workers are chained either in the idle_list or busy_hash */
+	struct list_head	idle_list;	/* L: list of idle workers */
+	struct hlist_head	busy_hash[BUSY_WORKER_HASH_SIZE];
+						/* L: hash of busy workers */
+
 	struct ida		worker_ida;	/* L: for worker IDs */
 } ____cacheline_aligned_in_smp;
 
@@ -77,7 +104,6 @@ struct global_cwq {
 struct cpu_workqueue_struct {
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct list_head worklist;
-	wait_queue_head_t more_work;
 	struct worker		*worker;
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
@@ -300,6 +326,33 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 }
 
 /**
+ * busy_worker_head - return the busy hash head for a work
+ * @gcwq: gcwq of interest
+ * @work: work to be hashed
+ *
+ * Return hash head of @gcwq for @work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to the hash head.
+ */
+static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
+					   struct work_struct *work)
+{
+	const int base_shift = ilog2(sizeof(struct work_struct));
+	unsigned long v = (unsigned long)work;
+
+	/* simple shift and fold hash, do we need something better? */
+	v >>= base_shift;
+	v += v >> BUSY_WORKER_HASH_ORDER;
+	v &= BUSY_WORKER_HASH_MASK;
+
+	return &gcwq->busy_hash[v];
+}
+
+/**
  * insert_work - insert a work into cwq
  * @cwq: cwq @work belongs to
  * @work: work to insert
@@ -325,7 +378,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	smp_wmb();
 
 	list_add_tail(&work->entry, head);
-	wake_up(&cwq->more_work);
+	wake_up_process(cwq->worker->task);
 }
 
 static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
@@ -463,13 +516,59 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+/**
+ * worker_enter_idle - enter idle state
+ * @worker: worker which is entering idle state
+ *
+ * @worker is entering idle state.  Update stats and idle timer if
+ * necessary.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_enter_idle(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+
+	BUG_ON(worker->flags & WORKER_IDLE);
+	BUG_ON(!list_empty(&worker->entry) &&
+	       (worker->hentry.next || worker->hentry.pprev));
+
+	worker->flags |= WORKER_IDLE;
+	gcwq->nr_idle++;
+
+	/* idle_list is LIFO */
+	list_add(&worker->entry, &gcwq->idle_list);
+}
+
+/**
+ * worker_leave_idle - leave idle state
+ * @worker: worker which is leaving idle state
+ *
+ * @worker is leaving idle state.  Update stats.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_leave_idle(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+
+	BUG_ON(!(worker->flags & WORKER_IDLE));
+	worker->flags &= ~WORKER_IDLE;
+	gcwq->nr_idle--;
+	list_del_init(&worker->entry);
+}
+
 static struct worker *alloc_worker(void)
 {
 	struct worker *worker;
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
-	if (worker)
+	if (worker) {
+		INIT_LIST_HEAD(&worker->entry);
 		INIT_LIST_HEAD(&worker->scheduled);
+	}
 	return worker;
 }
 
@@ -534,13 +633,16 @@ fail:
  * start_worker - start a newly created worker
  * @worker: worker to start
  *
- * Start @worker.
+ * Make the gcwq aware of @worker and start it.
  *
  * CONTEXT:
  * spin_lock_irq(gcwq->lock).
  */
 static void start_worker(struct worker *worker)
 {
+	worker->flags |= WORKER_STARTED;
+	worker->gcwq->nr_workers++;
+	worker_enter_idle(worker);
 	wake_up_process(worker->task);
 }
 
@@ -548,7 +650,10 @@ static void start_worker(struct worker *worker)
  * destroy_worker - destroy a workqueue worker
  * @worker: worker to be destroyed
  *
- * Destroy @worker.
+ * Destroy @worker and adjust @gcwq stats accordingly.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
  */
 static void destroy_worker(struct worker *worker)
 {
@@ -559,12 +664,21 @@ static void destroy_worker(struct worker *worker)
 	BUG_ON(worker->current_work);
 	BUG_ON(!list_empty(&worker->scheduled));
 
+	if (worker->flags & WORKER_STARTED)
+		gcwq->nr_workers--;
+	if (worker->flags & WORKER_IDLE)
+		gcwq->nr_idle--;
+
+	list_del_init(&worker->entry);
+	worker->flags |= WORKER_DIE;
+
+	spin_unlock_irq(&gcwq->lock);
+
 	kthread_stop(worker->task);
 	kfree(worker);
 
 	spin_lock_irq(&gcwq->lock);
 	ida_remove(&gcwq->worker_ida, id);
-	spin_unlock_irq(&gcwq->lock);
 }
 
 /**
@@ -680,6 +794,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = worker->cwq;
 	struct global_cwq *gcwq = cwq->gcwq;
+	struct hlist_head *bwh = busy_worker_head(gcwq, work);
 	work_func_t f = work->func;
 	int work_color;
 #ifdef CONFIG_LOCKDEP
@@ -694,6 +809,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 #endif
 	/* claim and process */
 	debug_work_deactivate(work);
+	hlist_add_head(&worker->hentry, bwh);
 	worker->current_work = work;
 	work_color = work_flags_to_color(*work_data_bits(work));
 	list_del_init(&work->entry);
@@ -721,6 +837,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	spin_lock_irq(&gcwq->lock);
 
 	/* we're done with it, release */
+	hlist_del_init(&worker->hentry);
 	worker->current_work = NULL;
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
@@ -757,42 +874,52 @@ static int worker_thread(void *__worker)
 	struct worker *worker = __worker;
 	struct global_cwq *gcwq = worker->gcwq;
 	struct cpu_workqueue_struct *cwq = worker->cwq;
-	DEFINE_WAIT(wait);
 
-	for (;;) {
-		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
-		if (!kthread_should_stop() &&
-		    list_empty(&cwq->worklist))
-			schedule();
-		finish_wait(&cwq->more_work, &wait);
+woke_up:
+	spin_lock_irq(&gcwq->lock);
 
-		if (kthread_should_stop())
-			break;
+	/* DIE can be set only while we're idle, checking here is enough */
+	if (worker->flags & WORKER_DIE) {
+		spin_unlock_irq(&gcwq->lock);
+		return 0;
+	}
 
-		spin_lock_irq(&gcwq->lock);
+	worker_leave_idle(worker);
 
-		while (!list_empty(&cwq->worklist)) {
-			struct work_struct *work =
-				list_first_entry(&cwq->worklist,
-						 struct work_struct, entry);
-
-			if (likely(!(*work_data_bits(work) &
-				     WORK_STRUCT_LINKED))) {
-				/* optimization path, not strictly necessary */
-				process_one_work(worker, work);
-				if (unlikely(!list_empty(&worker->scheduled)))
-					process_scheduled_works(worker);
-			} else {
-				move_linked_works(work, &worker->scheduled,
-						  NULL);
+	/*
+	 * ->scheduled list can only be filled while a worker is
+	 * preparing to process a work or actually processing it.
+	 * Make sure nobody diddled with it while I was sleeping.
+	 */
+	BUG_ON(!list_empty(&worker->scheduled));
+
+	while (!list_empty(&cwq->worklist)) {
+		struct work_struct *work =
+			list_first_entry(&cwq->worklist,
+					 struct work_struct, entry);
+
+		if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
+			/* optimization path, not strictly necessary */
+			process_one_work(worker, work);
+			if (unlikely(!list_empty(&worker->scheduled)))
 				process_scheduled_works(worker);
-			}
+		} else {
+			move_linked_works(work, &worker->scheduled, NULL);
+			process_scheduled_works(worker);
 		}
-
-		spin_unlock_irq(&gcwq->lock);
 	}
 
-	return 0;
+	/*
+	 * gcwq->lock is held and there's no work to process, sleep.
+	 * Workers are woken up only while holding gcwq->lock, so
+	 * setting the current state before releasing gcwq->lock is
+	 * enough to prevent losing any event.
+	 */
+	worker_enter_idle(worker);
+	__set_current_state(TASK_INTERRUPTIBLE);
+	spin_unlock_irq(&gcwq->lock);
+	schedule();
+	goto woke_up;
 }
 
 struct wq_barrier {
@@ -1551,7 +1678,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq->max_active = max_active;
 		INIT_LIST_HEAD(&cwq->worklist);
 		INIT_LIST_HEAD(&cwq->delayed_works);
-		init_waitqueue_head(&cwq->more_work);
 
 		if (failed)
 			continue;
@@ -1602,7 +1728,7 @@ EXPORT_SYMBOL_GPL(__create_workqueue_key);
  */
 void destroy_workqueue(struct workqueue_struct *wq)
 {
-	int cpu;
+	unsigned int cpu;
 
 	flush_workqueue(wq);
 
@@ -1619,8 +1745,10 @@ void destroy_workqueue(struct workqueue_struct *wq)
 		int i;
 
 		if (cwq->worker) {
+			spin_lock_irq(&cwq->gcwq->lock);
 			destroy_worker(cwq->worker);
 			cwq->worker = NULL;
+			spin_unlock_irq(&cwq->gcwq->lock);
 		}
 
 		for (i = 0; i < WORK_NR_COLORS; i++)
@@ -1835,7 +1963,7 @@ void thaw_workqueues(void)
 			       cwq->nr_active < cwq->max_active)
 				cwq_activate_first_delayed(cwq);
 
-			wake_up(&cwq->more_work);
+			wake_up_process(cwq->worker->task);
 		}
 
 		spin_unlock_irq(&gcwq->lock);
@@ -1850,6 +1978,7 @@ out_unlock:
 void __init init_workqueues(void)
 {
 	unsigned int cpu;
+	int i;
 
 	/*
 	 * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
@@ -1869,6 +1998,10 @@ void __init init_workqueues(void)
 		spin_lock_init(&gcwq->lock);
 		gcwq->cpu = cpu;
 
+		INIT_LIST_HEAD(&gcwq->idle_list);
+		for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
+			INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
+
 		ida_init(&gcwq->worker_ida);
 	}
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 23/27] workqueue: reimplement CPU hotplugging support using trustee
  2009-12-18 12:57 Tejun Heo
                   ` (21 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 22/27] workqueue: implement worker states Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 24/27] workqueue: make single thread workqueue shared worker pool friendly Tejun Heo
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Reimplement CPU hotplugging support using trustee thread.  On CPU
down, a trustee thread is created and each step of CPU down is
executed by the trustee and workqueue_cpu_callback() simply drives and
waits for trustee state transitions.

CPU down operation no longer waits for works to be drained but trustee
sticks around till all pending works have been completed.  If CPU is
brought back up while works are still draining,
workqueue_cpu_callback() tells trustee to step down and rebinds
workers to the cpu.

As it's difficult to tell whether cwqs are empty if it's freezing or
frozen, trustee doesn't consider draining to be complete while a gcwq
is freezing or frozen (tracked by new GCWQ_FREEZING flag).  Also,
workers which get unbound from their cpu are marked with WORKER_ROGUE.

Trustee based implementation doesn't bring any new feature at this
point but it will be used to manage worker pool when dynamic shared
worker pool is implemented.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  296 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 281 insertions(+), 15 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 104f72f..265f480 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -36,14 +36,27 @@
 #include <linux/idr.h>
 
 enum {
+	/* global_cwq flags */
+	GCWQ_FREEZING		= 1 << 3,	/* freeze in progress */
+
 	/* worker flags */
 	WORKER_STARTED		= 1 << 0,	/* started */
 	WORKER_DIE		= 1 << 1,	/* die die die */
 	WORKER_IDLE		= 1 << 2,	/* is idle */
+	WORKER_ROGUE		= 1 << 4,	/* not bound to any cpu */
+
+	/* gcwq->trustee_state */
+	TRUSTEE_START		= 0,		/* start */
+	TRUSTEE_IN_CHARGE	= 1,		/* trustee in charge of gcwq */
+	TRUSTEE_BUTCHER		= 2,		/* butcher workers */
+	TRUSTEE_RELEASE		= 3,		/* release workers */
+	TRUSTEE_DONE		= 4,		/* trustee is done */
 
 	BUSY_WORKER_HASH_ORDER	= 5,		/* 32 pointers */
 	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
 	BUSY_WORKER_HASH_MASK	= BUSY_WORKER_HASH_SIZE - 1,
+
+	TRUSTEE_COOLDOWN	= HZ / 10,	/* for trustee draining */
 };
 
 /*
@@ -83,6 +96,7 @@ struct worker {
 struct global_cwq {
 	spinlock_t		lock;		/* the gcwq lock */
 	unsigned int		cpu;		/* I: the associated cpu */
+	unsigned int		flags;		/* L: GCWQ_* flags */
 
 	int			nr_workers;	/* L: total number of workers */
 	int			nr_idle;	/* L: currently idle ones */
@@ -93,6 +107,10 @@ struct global_cwq {
 						/* L: hash of busy workers */
 
 	struct ida		worker_ida;	/* L: for worker IDs */
+
+	struct task_struct	*trustee;	/* L: for gcwq shutdown */
+	unsigned int		trustee_state;	/* L: trustee state */
+	wait_queue_head_t	trustee_wait;	/* trustee wait */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -148,6 +166,10 @@ struct workqueue_struct {
 #endif
 };
 
+#define for_each_busy_worker(worker, i, pos, gcwq)			\
+	for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)			\
+		hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry)
+
 #ifdef CONFIG_DEBUG_OBJECTS_WORK
 
 static struct debug_obj_descr work_debug_descr;
@@ -539,6 +561,9 @@ static void worker_enter_idle(struct worker *worker)
 
 	/* idle_list is LIFO */
 	list_add(&worker->entry, &gcwq->idle_list);
+
+	if (unlikely(worker->flags & WORKER_ROGUE))
+		wake_up_all(&gcwq->trustee_wait);
 }
 
 /**
@@ -615,8 +640,15 @@ static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
 	if (IS_ERR(worker->task))
 		goto fail;
 
+	/*
+	 * A rogue worker will become a regular one if CPU comes
+	 * online later on.  Make sure every worker has
+	 * PF_THREAD_BOUND set.
+	 */
 	if (bind)
 		kthread_bind(worker->task, gcwq->cpu);
+	else
+		worker->task->flags |= PF_THREAD_BOUND;
 
 	return worker;
 fail:
@@ -1762,34 +1794,259 @@ void destroy_workqueue(struct workqueue_struct *wq)
 }
 EXPORT_SYMBOL_GPL(destroy_workqueue);
 
+/*
+ * CPU hotplug.
+ *
+ * CPU hotplug is implemented by allowing cwqs to be detached from
+ * CPU, running with unbound workers and allowing them to be
+ * reattached later if the cpu comes back online.  A separate thread
+ * is created to govern cwqs in such state and is called the trustee.
+ *
+ * Trustee states and their descriptions.
+ *
+ * START	Command state used on startup.  On CPU_DOWN_PREPARE, a
+ *		new trustee is started with this state.
+ *
+ * IN_CHARGE	Once started, trustee will enter this state after
+ *		making all existing workers rogue.  DOWN_PREPARE waits
+ *		for trustee to enter this state.  After reaching
+ *		IN_CHARGE, trustee tries to execute the pending
+ *		worklist until it's empty and the state is set to
+ *		BUTCHER, or the state is set to RELEASE.
+ *
+ * BUTCHER	Command state which is set by the cpu callback after
+ *		the cpu has went down.  Once this state is set trustee
+ *		knows that there will be no new works on the worklist
+ *		and once the worklist is empty it can proceed to
+ *		killing idle workers.
+ *
+ * RELEASE	Command state which is set by the cpu callback if the
+ *		cpu down has been canceled or it has come online
+ *		again.  After recognizing this state, trustee stops
+ *		trying to drain or butcher and transits to DONE.
+ *
+ * DONE		Trustee will enter this state after BUTCHER or RELEASE
+ *		is complete.
+ *
+ *          trustee                 CPU                draining
+ *         took over                down               complete
+ * START -----------> IN_CHARGE -----------> BUTCHER -----------> DONE
+ *                        |                     |                  ^
+ *                        | CPU is back online  v   return workers |
+ *                         ----------------> RELEASE --------------
+ */
+
+/**
+ * trustee_wait_event_timeout - timed event wait for trustee
+ * @cond: condition to wait for
+ * @timeout: timeout in jiffies
+ *
+ * wait_event_timeout() for trustee to use.  Handles locking and
+ * checks for RELEASE request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by trustee.
+ *
+ * RETURNS:
+ * Positive indicating left time if @cond is satisfied, 0 if timed
+ * out, -1 if canceled.
+ */
+#define trustee_wait_event_timeout(cond, timeout) ({			\
+	long __ret = (timeout);						\
+	while (!((cond) || (gcwq->trustee_state == TRUSTEE_RELEASE)) &&	\
+	       __ret) {							\
+		spin_unlock_irq(&gcwq->lock);				\
+		__wait_event_timeout(gcwq->trustee_wait, (cond) ||	\
+			(gcwq->trustee_state == TRUSTEE_RELEASE),	\
+			__ret);						\
+		spin_lock_irq(&gcwq->lock);				\
+	}								\
+	gcwq->trustee_state == TRUSTEE_RELEASE ? -1 : (__ret);		\
+})
+
+/**
+ * trustee_wait_event - event wait for trustee
+ * @cond: condition to wait for
+ *
+ * wait_event() for trustee to use.  Automatically handles locking and
+ * checks for CANCEL request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by trustee.
+ *
+ * RETURNS:
+ * 0 if @cond is satisfied, -1 if canceled.
+ */
+#define trustee_wait_event(cond) ({					\
+	long __ret1;							\
+	__ret1 = trustee_wait_event_timeout(cond, MAX_SCHEDULE_TIMEOUT);\
+	__ret1 < 0 ? -1 : 0;						\
+})
+
+static bool __cpuinit trustee_unset_rogue(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	int rc;
+
+	if (!(worker->flags & WORKER_ROGUE))
+		return false;
+
+	spin_unlock_irq(&gcwq->lock);
+	rc = __set_cpus_allowed(worker->task, get_cpu_mask(gcwq->cpu), true);
+	BUG_ON(rc);
+	spin_lock_irq(&gcwq->lock);
+	worker->flags &= ~WORKER_ROGUE;
+	return true;
+}
+
+static int __cpuinit trustee_thread(void *__gcwq)
+{
+	struct global_cwq *gcwq = __gcwq;
+	struct worker *worker;
+	struct hlist_node *pos;
+	int i;
+
+	BUG_ON(gcwq->cpu != smp_processor_id());
+
+	spin_lock_irq(&gcwq->lock);
+	/*
+	 * Make all multithread workers rogue.  Trustee must be bound
+	 * to the target cpu and can't be cancelled.
+	 */
+	BUG_ON(gcwq->cpu != smp_processor_id());
+
+	list_for_each_entry(worker, &gcwq->idle_list, entry)
+		if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
+			worker->flags |= WORKER_ROGUE;
+
+	for_each_busy_worker(worker, i, pos, gcwq)
+		if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
+			worker->flags |= WORKER_ROGUE;
+
+	/*
+	 * We're now in charge.  Notify and proceed to drain.  We need
+	 * to keep the gcwq running during the whole CPU down
+	 * procedure as other cpu hotunplug callbacks may need to
+	 * flush currently running tasks.
+	 */
+	gcwq->trustee_state = TRUSTEE_IN_CHARGE;
+	wake_up_all(&gcwq->trustee_wait);
+
+	/*
+	 * The original cpu is in the process of dying and may go away
+	 * anytime now.  When that happens, we and all workers would
+	 * be migrated to other cpus.  Try draining any left work.
+	 * Note that if the gcwq is frozen, there may be frozen works
+	 * in freezeable cwqs.  Don't declare completion while frozen.
+	 */
+	while (gcwq->nr_workers != gcwq->nr_idle ||
+	       gcwq->flags & GCWQ_FREEZING ||
+	       gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+		/* give a breather */
+		if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
+			break;
+	}
+
+	/* notify completion */
+	gcwq->trustee = NULL;
+	gcwq->trustee_state = TRUSTEE_DONE;
+	wake_up_all(&gcwq->trustee_wait);
+	spin_unlock_irq(&gcwq->lock);
+	return 0;
+}
+
+/**
+ * wait_trustee_state - wait for trustee to enter the specified state
+ * @gcwq: gcwq the trustee of interest belongs to
+ * @state: target state to wait for
+ *
+ * Wait for the trustee to reach @state.  DONE is already matched.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by cpu_callback.
+ */
+static void __cpuinit wait_trustee_state(struct global_cwq *gcwq, int state)
+{
+	if (!(gcwq->trustee_state == state ||
+	      gcwq->trustee_state == TRUSTEE_DONE)) {
+		spin_unlock_irq(&gcwq->lock);
+		__wait_event(gcwq->trustee_wait,
+			     gcwq->trustee_state == state ||
+			     gcwq->trustee_state == TRUSTEE_DONE);
+		spin_lock_irq(&gcwq->lock);
+	}
+}
+
 static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 						unsigned long action,
 						void *hcpu)
 {
 	unsigned int cpu = (unsigned long)hcpu;
-	struct cpu_workqueue_struct *cwq;
-	struct workqueue_struct *wq;
+	struct global_cwq *gcwq = get_gcwq(cpu);
+	struct task_struct *new_trustee = NULL;
+	struct worker *worker;
+	struct hlist_node *pos;
+	int i;
 
 	action &= ~CPU_TASKS_FROZEN;
 
-	list_for_each_entry(wq, &workqueues, list) {
-		if (wq->flags & WQ_SINGLE_THREAD)
-			continue;
-
-		cwq = get_cwq(cpu, wq);
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+		new_trustee = kthread_create(trustee_thread, gcwq,
+					     "workqueue_trustee/%d\n", cpu);
+		if (IS_ERR(new_trustee))
+			return NOTIFY_BAD;
+		kthread_bind(new_trustee, cpu);
+	}
 
-		switch (action) {
-		case CPU_ONLINE:
-			__set_cpus_allowed(cwq->worker->task,
-					   get_cpu_mask(cpu), true);
-			break;
+	spin_lock_irq(&gcwq->lock);
 
-		case CPU_POST_DEAD:
-			flush_workqueue(wq);
-			break;
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+		/* initialize trustee and tell it to acquire the gcwq */
+		BUG_ON(gcwq->trustee || gcwq->trustee_state != TRUSTEE_DONE);
+		gcwq->trustee = new_trustee;
+		gcwq->trustee_state = TRUSTEE_START;
+		wake_up_process(gcwq->trustee);
+		wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+		break;
+
+	case CPU_POST_DEAD:
+		gcwq->trustee_state = TRUSTEE_BUTCHER;
+		break;
+
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+		if (gcwq->trustee_state != TRUSTEE_DONE) {
+			gcwq->trustee_state = TRUSTEE_RELEASE;
+			wake_up_process(gcwq->trustee);
+			wait_trustee_state(gcwq, TRUSTEE_DONE);
 		}
+
+		/*
+		 * Clear ROGUE from and rebind all multithread
+		 * workers.  Unsetting ROGUE and rebinding require
+		 * dropping gcwq->lock.  Restart loop after each
+		 * successful release.
+		 */
+	recheck:
+		list_for_each_entry(worker, &gcwq->idle_list, entry)
+			if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD) &&
+			    trustee_unset_rogue(worker))
+				goto recheck;
+
+		for_each_busy_worker(worker, i, pos, gcwq)
+			if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD) &&
+			    trustee_unset_rogue(worker))
+				goto recheck;
+		break;
 	}
 
+	spin_unlock_irq(&gcwq->lock);
+
 	return NOTIFY_OK;
 }
 
@@ -1867,6 +2124,9 @@ void freeze_workqueues_begin(void)
 
 		spin_lock_irq(&gcwq->lock);
 
+		BUG_ON(gcwq->flags & GCWQ_FREEZING);
+		gcwq->flags |= GCWQ_FREEZING;
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
@@ -1950,6 +2210,9 @@ void thaw_workqueues(void)
 
 		spin_lock_irq(&gcwq->lock);
 
+		BUG_ON(!(gcwq->flags & GCWQ_FREEZING));
+		gcwq->flags &= ~GCWQ_FREEZING;
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
@@ -2003,6 +2266,9 @@ void __init init_workqueues(void)
 			INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
 
 		ida_init(&gcwq->worker_ida);
+
+		gcwq->trustee_state = TRUSTEE_DONE;
+		init_waitqueue_head(&gcwq->trustee_wait);
 	}
 
 	keventd_wq = create_workqueue("events");
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 24/27] workqueue: make single thread workqueue shared worker pool friendly
  2009-12-18 12:57 Tejun Heo
                   ` (22 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 23/27] workqueue: reimplement CPU hotplugging support using trustee Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 25/27] workqueue: use shared worklist and pool all workers per cpu Tejun Heo
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Reimplement st (single thread) workqueue so that it's friendly to
shared worker pool.  It was originally implemented by confining st
workqueues to use cwq of a fixed cpu and always having a worker for
the cpu.  This implementation isn't very friendly to shared worker
pool and suboptimal in that it ends up crossing cpu boundaries often.

Reimplement st workqueue using dynamic single cpu binding and
cwq->limit.  WQ_SINGLE_THREAD is replaced with WQ_SINGLE_CPU.  In a
single cpu workqueue, at most single cwq is bound to the wq at any
given time.  Arbitration is done using atomic accesses to
wq->single_cpu when queueing a work.  Once bound, the binding stays
till the workqueue is drained.

Note that the binding is never broken while a workqueue is frozen.
This is because idle cwqs may have works waiting in delayed_works
queue while frozen.  On thaw, the cwq is restarted if there are any
delayed works or unbound otherwise.

When combined with max_active limit of 1, single cpu workqueue has
exactly the same execution properties as the original single thread
workqueue while allowing sharing of per-cpu workers.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    6 +-
 kernel/workqueue.c        |  139 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 103 insertions(+), 42 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7a260df..b012da7 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -212,7 +212,7 @@ static inline bool work_static(struct work_struct *work) { return false; }
 
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
-	WQ_SINGLE_THREAD	= 1 << 1, /* no per-cpu worker */
+	WQ_SINGLE_CPU		= 1 << 1, /* only single cpu at a time */
 };
 
 extern struct workqueue_struct *
@@ -241,9 +241,9 @@ __create_workqueue_key(const char *name, unsigned int flags, int max_active,
 #define create_workqueue(name)					\
 	__create_workqueue((name), 0, 1)
 #define create_freezeable_workqueue(name)			\
-	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD, 1)
+	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_CPU, 1)
 #define create_singlethread_workqueue(name)			\
-	__create_workqueue((name), WQ_SINGLE_THREAD, 1)
+	__create_workqueue((name), WQ_SINGLE_CPU, 1)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 265f480..19cfa12 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -114,8 +114,7 @@ struct global_cwq {
 } ____cacheline_aligned_in_smp;
 
 /*
- * The per-CPU workqueue (if single thread, we always use the first
- * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
+ * The per-CPU workqueue.  The lower WORK_STRUCT_FLAG_BITS of
  * work_struct->data are used for flags and thus cwqs need to be
  * aligned at two's power of the number of flag bits.
  */
@@ -159,6 +158,8 @@ struct workqueue_struct {
 	struct list_head	flusher_queue;	/* F: flush waiters */
 	struct list_head	flusher_overflow; /* F: flush overflow list */
 
+	unsigned long		single_cpu;	/* cpu for single cpu wq */
+
 	int			saved_max_active; /* I: saved cwq max_active */
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
@@ -289,8 +290,6 @@ static DEFINE_PER_CPU(struct global_cwq, global_cwq);
 
 static int worker_thread(void *__worker);
 
-static int singlethread_cpu __read_mostly;
-
 static struct global_cwq *get_gcwq(unsigned int cpu)
 {
 	return &per_cpu(global_cwq, cpu);
@@ -302,14 +301,6 @@ static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 	return per_cpu_ptr(wq->cpu_wq, cpu);
 }
 
-static struct cpu_workqueue_struct *target_cwq(unsigned int cpu,
-					       struct workqueue_struct *wq)
-{
-	if (unlikely(wq->flags & WQ_SINGLE_THREAD))
-		cpu = singlethread_cpu;
-	return get_cwq(cpu, wq);
-}
-
 static unsigned int work_color_to_flags(int color)
 {
 	return color << WORK_STRUCT_COLOR_SHIFT;
@@ -403,17 +394,84 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	wake_up_process(cwq->worker->task);
 }
 
-static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
-			 struct work_struct *work)
+/**
+ * cwq_unbind_single_cpu - unbind cwq from single cpu workqueue processing
+ * @cwq: cwq to unbind
+ *
+ * Try to unbind @cwq from single cpu workqueue processing.  If
+ * @cwq->wq is frozen, unbind is delayed till the workqueue is thawed.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void cwq_unbind_single_cpu(struct cpu_workqueue_struct *cwq)
 {
-	struct cpu_workqueue_struct *cwq = target_cwq(cpu, wq);
+	struct workqueue_struct *wq = cwq->wq;
 	struct global_cwq *gcwq = cwq->gcwq;
+
+	BUG_ON(wq->single_cpu != gcwq->cpu);
+	/*
+	 * Unbind from workqueue if @cwq is not frozen.  If frozen,
+	 * thaw_workqueues() will either restart processing on this
+	 * cpu or unbind if empty.  This keeps works queued while
+	 * frozen fully ordered and flushable.
+	 */
+	if (likely(!(gcwq->flags & GCWQ_FREEZING))) {
+		smp_wmb();	/* paired with cmpxchg() in __queue_work() */
+		wq->single_cpu = NR_CPUS;
+	}
+}
+
+static void __queue_work(unsigned int req_cpu, struct workqueue_struct *wq,
+			 struct work_struct *work)
+{
+	struct global_cwq *gcwq;
+	struct cpu_workqueue_struct *cwq;
 	struct list_head *worklist;
+	unsigned int cpu;
 	unsigned long flags;
+	bool arbitrate;
 
 	debug_work_activate(work);
 
-	spin_lock_irqsave(&gcwq->lock, flags);
+	if (!(wq->flags & WQ_SINGLE_CPU)) {
+		/* just use the requested cpu for multicpu workqueues */
+		cwq = get_cwq(req_cpu, wq);
+		gcwq = cwq->gcwq;
+		spin_lock_irqsave(&gcwq->lock, flags);
+	} else {
+		/*
+		 * It's a bit more complex for single cpu workqueues.
+		 * We first need to determine which cpu is going to be
+		 * used.  If no cpu is currently serving this
+		 * workqueue, arbitrate using atomic accesses to
+		 * wq->single_cpu; otherwise, use the current one.
+		 */
+	retry:
+		cpu = wq->single_cpu;
+		arbitrate = cpu == NR_CPUS;
+		if (arbitrate)
+			cpu = req_cpu;
+
+		cwq = get_cwq(cpu, wq);
+		gcwq = cwq->gcwq;
+		spin_lock_irqsave(&gcwq->lock, flags);
+
+		/*
+		 * The following cmpxchg() is full barrier paired with
+		 * smp_wmb() in cwq_unbind_single_cpu() and guarantees
+		 * that all changes to wq->st_* fields are visible on
+		 * the new cpu after this point.
+		 */
+		if (arbitrate)
+			cmpxchg(&wq->single_cpu, NR_CPUS, cpu);
+
+		if (unlikely(wq->single_cpu != cpu)) {
+			spin_unlock_irqrestore(&gcwq->lock, flags);
+			goto retry;
+		}
+	}
+
 	BUG_ON(!list_empty(&work->entry));
 
 	cwq->nr_in_flight[cwq->work_color]++;
@@ -523,7 +581,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 		timer_stats_timer_set_start_info(&dwork->timer);
 
 		/* This stores cwq for the moment, for the timer_fn */
-		set_wq_data(work, target_cwq(raw_smp_processor_id(), wq), 0);
+		set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
 		timer->expires = jiffies + delay;
 		timer->data = (unsigned long)dwork;
 		timer->function = delayed_work_timer_fn;
@@ -784,10 +842,14 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
 	cwq->nr_in_flight[color]--;
 	cwq->nr_active--;
 
-	/* one down, submit a delayed one */
-	if (!list_empty(&cwq->delayed_works) &&
-	    cwq->nr_active < cwq->max_active)
-		cwq_activate_first_delayed(cwq);
+	if (!list_empty(&cwq->delayed_works)) {
+		/* one down, submit a delayed one */
+		if (cwq->nr_active < cwq->max_active)
+			cwq_activate_first_delayed(cwq);
+	} else if (!cwq->nr_active && cwq->wq->flags & WQ_SINGLE_CPU) {
+		/* this was the last work, unbind from single cpu */
+		cwq_unbind_single_cpu(cwq);
+	}
 
 	/* is flush in progress and are we at the flushing tip? */
 	if (likely(cwq->flush_color != color))
@@ -1667,7 +1729,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
-	bool singlethread = flags & WQ_SINGLE_THREAD;
 	struct workqueue_struct *wq;
 	bool failed = false;
 	unsigned int cpu;
@@ -1688,6 +1749,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	atomic_set(&wq->nr_cwqs_to_flush, 0);
 	INIT_LIST_HEAD(&wq->flusher_queue);
 	INIT_LIST_HEAD(&wq->flusher_overflow);
+	wq->single_cpu = NR_CPUS;
+
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
@@ -1713,8 +1776,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 		if (failed)
 			continue;
-		cwq->worker = create_worker(cwq,
-					    cpu_online(cpu) && !singlethread);
+		cwq->worker = create_worker(cwq, cpu_online(cpu));
 		if (cwq->worker)
 			start_worker(cwq->worker);
 		else
@@ -1912,18 +1974,16 @@ static int __cpuinit trustee_thread(void *__gcwq)
 
 	spin_lock_irq(&gcwq->lock);
 	/*
-	 * Make all multithread workers rogue.  Trustee must be bound
-	 * to the target cpu and can't be cancelled.
+	 * Make all workers rogue.  Trustee must be bound to the
+	 * target cpu and can't be cancelled.
 	 */
 	BUG_ON(gcwq->cpu != smp_processor_id());
 
 	list_for_each_entry(worker, &gcwq->idle_list, entry)
-		if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
-			worker->flags |= WORKER_ROGUE;
+		worker->flags |= WORKER_ROGUE;
 
 	for_each_busy_worker(worker, i, pos, gcwq)
-		if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD))
-			worker->flags |= WORKER_ROGUE;
+		worker->flags |= WORKER_ROGUE;
 
 	/*
 	 * We're now in charge.  Notify and proceed to drain.  We need
@@ -2027,20 +2087,17 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 		}
 
 		/*
-		 * Clear ROGUE from and rebind all multithread
-		 * workers.  Unsetting ROGUE and rebinding require
-		 * dropping gcwq->lock.  Restart loop after each
-		 * successful release.
+		 * Clear ROGUE from and rebind all workers.  Unsetting
+		 * ROGUE and rebinding require dropping gcwq->lock.
+		 * Restart loop after each successful release.
 		 */
 	recheck:
 		list_for_each_entry(worker, &gcwq->idle_list, entry)
-			if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD) &&
-			    trustee_unset_rogue(worker))
+			if (trustee_unset_rogue(worker))
 				goto recheck;
 
 		for_each_busy_worker(worker, i, pos, gcwq)
-			if (!(worker->cwq->wq->flags & WQ_SINGLE_THREAD) &&
-			    trustee_unset_rogue(worker))
+			if (trustee_unset_rogue(worker))
 				goto recheck;
 		break;
 	}
@@ -2226,6 +2283,11 @@ void thaw_workqueues(void)
 			       cwq->nr_active < cwq->max_active)
 				cwq_activate_first_delayed(cwq);
 
+			/* perform delayed unbind from single cpu if empty */
+			if (wq->single_cpu == gcwq->cpu &&
+			    !cwq->nr_active && list_empty(&cwq->delayed_works))
+				cwq_unbind_single_cpu(cwq);
+
 			wake_up_process(cwq->worker->task);
 		}
 
@@ -2251,7 +2313,6 @@ void __init init_workqueues(void)
 	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
 		     __alignof__(unsigned long long));
 
-	singlethread_cpu = cpumask_first(cpu_possible_mask);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 
 	/* initialize gcwqs */
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 25/27] workqueue: use shared worklist and pool all workers per cpu
  2009-12-18 12:57 Tejun Heo
                   ` (23 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 24/27] workqueue: make single thread workqueue shared worker pool friendly Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 26/27] workqueue: implement concurrency managed dynamic worker pool Tejun Heo
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Use gcwq->worklist instead of cwq->worklist and break the strict
association between a cwq and its worker.  All works queued on a cpu
are queued on gcwq->worklist and processed by any available worker on
the gcwq.

As there no longer is strict association between a cwq and its worker,
whether a work is executing can now only be determined using
gcwq->busy_hash[].  [__]find_worker_executing_work() are implemented
for this and used where it's necessary to find whether a work is being
executed and if so which worker is executing it.

After this change, the only association between a cwq and its worker
is that a cwq puts a worker into shared worker pool on creation and
kills it on destruction.  As all workqueues are still limited to
max_active of one, this means that there are always at least as many
workers as active works and thus there's no danger for deadlock.

The break of strong association between cwqs and workers requires
somewhat clumsy changes to current_is_keventd() and
destroy_workqueue().  Dynamic worker pool management will remove both
clumsy changes.  current_is_keventd() won't be necessary at all as the
only reason it exists is to avoid queueing a work from a work which
will be allowed just fine.  The clumsy part of destroy_workqueue() is
added because a worker can only be destroyed while idle and there's no
guarantee a worker is idle when its wq is going down.  With dynamic
pool management, workers are not associated with workqueues at all and
only idle ones will be submitted to destroy_workqueue() so the code
won't be necessary anymore.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c |  192 +++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 154 insertions(+), 38 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 19cfa12..f38d263 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -72,7 +72,6 @@ enum {
  */
 
 struct global_cwq;
-struct cpu_workqueue_struct;
 
 struct worker {
 	/* on idle list while idle, on busy hash table while busy */
@@ -85,7 +84,6 @@ struct worker {
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
-	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
 	unsigned int		flags;		/* L: flags */
 	int			id;		/* I: worker id */
 };
@@ -95,6 +93,7 @@ struct worker {
  */
 struct global_cwq {
 	spinlock_t		lock;		/* the gcwq lock */
+	struct list_head	worklist;	/* L: list of pending works */
 	unsigned int		cpu;		/* I: the associated cpu */
 	unsigned int		flags;		/* L: GCWQ_* flags */
 
@@ -120,7 +119,6 @@ struct global_cwq {
  */
 struct cpu_workqueue_struct {
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
-	struct list_head worklist;
 	struct worker		*worker;
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
@@ -338,6 +336,32 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 			WORK_STRUCT_WQ_DATA_MASK);
 }
 
+/* Return the first worker.  Safe with preemption disabled */
+static struct worker *first_worker(struct global_cwq *gcwq)
+{
+	if (unlikely(list_empty(&gcwq->idle_list)))
+		return NULL;
+
+	return list_first_entry(&gcwq->idle_list, struct worker, entry);
+}
+
+/**
+ * wake_up_worker - wake up an idle worker
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void wake_up_worker(struct global_cwq *gcwq)
+{
+	struct worker *worker = first_worker(gcwq);
+
+	if (likely(worker))
+		wake_up_process(worker->task);
+}
+
 /**
  * busy_worker_head - return the busy hash head for a work
  * @gcwq: gcwq of interest
@@ -366,13 +390,67 @@ static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
 }
 
 /**
- * insert_work - insert a work into cwq
+ * __find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @bwh: hash head as returned by busy_worker_head()
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq.  @bwh should be
+ * the hash head obtained by calling busy_worker_head() with the same
+ * work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *__find_worker_executing_work(struct global_cwq *gcwq,
+						   struct hlist_head *bwh,
+						   struct work_struct *work)
+{
+	struct worker *worker;
+	struct hlist_node *tmp;
+
+	hlist_for_each_entry(worker, tmp, bwh, hentry)
+		if (worker->current_work == work)
+			return worker;
+	return NULL;
+}
+
+/**
+ * find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq.  This function is
+ * identical to __find_worker_executing_work() except that this
+ * function calculates @bwh itself.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
+						 struct work_struct *work)
+{
+	return __find_worker_executing_work(gcwq, busy_worker_head(gcwq, work),
+					    work);
+}
+
+/**
+ * insert_work - insert a work into gcwq
  * @cwq: cwq @work belongs to
  * @work: work to insert
  * @head: insertion point
  * @extra_flags: extra WORK_STRUCT_* flags to set
  *
- * Insert @work into @cwq after @head.
+ * Insert @work which belongs to @cwq into @gcwq after @head.
+ * @extra_flags is or'd to work_struct flags.
  *
  * CONTEXT:
  * spin_lock_irq(gcwq->lock).
@@ -391,7 +469,7 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	smp_wmb();
 
 	list_add_tail(&work->entry, head);
-	wake_up_process(cwq->worker->task);
+	wake_up_worker(cwq->gcwq);
 }
 
 /**
@@ -478,7 +556,7 @@ static void __queue_work(unsigned int req_cpu, struct workqueue_struct *wq,
 
 	if (likely(cwq->nr_active < cwq->max_active)) {
 		cwq->nr_active++;
-		worklist = &cwq->worklist;
+		worklist = &gcwq->worklist;
 	} else
 		worklist = &cwq->delayed_works;
 
@@ -657,10 +735,10 @@ static struct worker *alloc_worker(void)
 
 /**
  * create_worker - create a new workqueue worker
- * @cwq: cwq the new worker will belong to
+ * @gcwq: gcwq the new worker will belong to
  * @bind: whether to set affinity to @cpu or not
  *
- * Create a new worker which is bound to @cwq.  The returned worker
+ * Create a new worker which is bound to @gcwq.  The returned worker
  * can be started by calling start_worker() or destroyed using
  * destroy_worker().
  *
@@ -670,9 +748,8 @@ static struct worker *alloc_worker(void)
  * RETURNS:
  * Pointer to the newly created worker.
  */
-static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
 {
-	struct global_cwq *gcwq = cwq->gcwq;
 	int id = -1;
 	struct worker *worker = NULL;
 
@@ -690,7 +767,6 @@ static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
 		goto fail;
 
 	worker->gcwq = gcwq;
-	worker->cwq = cwq;
 	worker->id = id;
 
 	worker->task = kthread_create(worker_thread, worker, "kworker/%u:%d",
@@ -818,7 +894,7 @@ static void cwq_activate_first_delayed(struct cpu_workqueue_struct *cwq)
 	struct work_struct *work = list_first_entry(&cwq->delayed_works,
 						    struct work_struct, entry);
 
-	move_linked_works(work, &cwq->worklist, NULL);
+	move_linked_works(work, &cwq->gcwq->worklist, NULL);
 	cwq->nr_active++;
 }
 
@@ -886,11 +962,12 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq, int color)
  */
 static void process_one_work(struct worker *worker, struct work_struct *work)
 {
-	struct cpu_workqueue_struct *cwq = worker->cwq;
+	struct cpu_workqueue_struct *cwq = get_wq_data(work);
 	struct global_cwq *gcwq = cwq->gcwq;
 	struct hlist_head *bwh = busy_worker_head(gcwq, work);
 	work_func_t f = work->func;
 	int work_color;
+	struct worker *collision;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -901,6 +978,18 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	 */
 	struct lockdep_map lockdep_map = work->lockdep_map;
 #endif
+	/*
+	 * A single work shouldn't be executed concurrently by
+	 * multiple workers on a single cpu.  Check whether anyone is
+	 * already processing the work.  If so, defer the work to the
+	 * currently executing one.
+	 */
+	collision = __find_worker_executing_work(gcwq, bwh, work);
+	if (unlikely(collision)) {
+		move_linked_works(work, &collision->scheduled, NULL);
+		return;
+	}
+
 	/* claim and process */
 	debug_work_deactivate(work);
 	hlist_add_head(&worker->hentry, bwh);
@@ -910,7 +999,6 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 
 	spin_unlock_irq(&gcwq->lock);
 
-	BUG_ON(get_wq_data(work) != cwq);
 	work_clear_pending(work);
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_acquire(&lockdep_map);
@@ -967,7 +1055,6 @@ static int worker_thread(void *__worker)
 {
 	struct worker *worker = __worker;
 	struct global_cwq *gcwq = worker->gcwq;
-	struct cpu_workqueue_struct *cwq = worker->cwq;
 
 woke_up:
 	spin_lock_irq(&gcwq->lock);
@@ -987,9 +1074,9 @@ woke_up:
 	 */
 	BUG_ON(!list_empty(&worker->scheduled));
 
-	while (!list_empty(&cwq->worklist)) {
+	while (!list_empty(&gcwq->worklist)) {
 		struct work_struct *work =
-			list_first_entry(&cwq->worklist,
+			list_first_entry(&gcwq->worklist,
 					 struct work_struct, entry);
 
 		if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
@@ -1343,8 +1430,7 @@ int flush_work(struct work_struct *work)
 		if (unlikely(cwq != get_wq_data(work)))
 			goto already_gone;
 	} else {
-		if (cwq->worker && cwq->worker->current_work == work)
-			worker = cwq->worker;
+		worker = find_worker_executing_work(gcwq, work);
 		if (!worker)
 			goto already_gone;
 	}
@@ -1413,11 +1499,9 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 
 	spin_lock_irq(&gcwq->lock);
 
-	worker = NULL;
-	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
-		worker = cwq->worker;
+	worker = find_worker_executing_work(gcwq, work);
+	if (unlikely(worker))
 		insert_wq_barrier(cwq, &barr, work, worker);
-	}
 
 	spin_unlock_irq(&gcwq->lock);
 
@@ -1671,18 +1755,37 @@ int keventd_up(void)
 
 int current_is_keventd(void)
 {
-	struct cpu_workqueue_struct *cwq;
-	int cpu = raw_smp_processor_id(); /* preempt-safe: keventd is per-cpu */
-	int ret = 0;
+	bool found = false;
+	unsigned int cpu;
 
-	BUG_ON(!keventd_wq);
+	/*
+	 * There no longer is one-to-one relation between worker and
+	 * work queue and a worker task might be unbound from its cpu
+	 * if the cpu was offlined.  Match all busy workers.  This
+	 * function will go away once dynamic pool is implemented.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+		struct worker *worker;
+		struct hlist_node *pos;
+		unsigned long flags;
+		int i;
 
-	cwq = get_cwq(cpu, keventd_wq);
-	if (current == cwq->worker->task)
-		ret = 1;
+		spin_lock_irqsave(&gcwq->lock, flags);
 
-	return ret;
+		for_each_busy_worker(worker, i, pos, gcwq) {
+			if (worker->task == current) {
+				found = true;
+				break;
+			}
+		}
+
+		spin_unlock_irqrestore(&gcwq->lock, flags);
+		if (found)
+			break;
+	}
 
+	return found;
 }
 
 static struct cpu_workqueue_struct *alloc_cwqs(void)
@@ -1771,12 +1874,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq->wq = wq;
 		cwq->flush_color = -1;
 		cwq->max_active = max_active;
-		INIT_LIST_HEAD(&cwq->worklist);
 		INIT_LIST_HEAD(&cwq->delayed_works);
 
 		if (failed)
 			continue;
-		cwq->worker = create_worker(cwq, cpu_online(cpu));
+		cwq->worker = create_worker(gcwq, cpu_online(cpu));
 		if (cwq->worker)
 			start_worker(cwq->worker);
 		else
@@ -1836,13 +1938,26 @@ void destroy_workqueue(struct workqueue_struct *wq)
 
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = cwq->gcwq;
 		int i;
 
 		if (cwq->worker) {
-			spin_lock_irq(&cwq->gcwq->lock);
+		retry:
+			spin_lock_irq(&gcwq->lock);
+			/*
+			 * Worker can only be destroyed while idle.
+			 * Wait till it becomes idle.  This is ugly
+			 * and prone to starvation.  It will go away
+			 * once dynamic worker pool is implemented.
+			 */
+			if (!(cwq->worker->flags & WORKER_IDLE)) {
+				spin_unlock_irq(&gcwq->lock);
+				msleep(100);
+				goto retry;
+			}
 			destroy_worker(cwq->worker);
 			cwq->worker = NULL;
-			spin_unlock_irq(&cwq->gcwq->lock);
+			spin_unlock_irq(&gcwq->lock);
 		}
 
 		for (i = 0; i < WORK_NR_COLORS; i++)
@@ -2161,7 +2276,7 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
  *
  * Start freezing workqueues.  After this function returns, all
  * freezeable workqueues will queue new works to their frozen_works
- * list instead of the cwq ones.
+ * list instead of gcwq->worklist.
  *
  * CONTEXT:
  * Grabs and releases workqueue_lock and gcwq->lock's.
@@ -2247,7 +2362,7 @@ out_unlock:
  * thaw_workqueues - thaw workqueues
  *
  * Thaw workqueues.  Normal queueing is restored and all collected
- * frozen works are transferred to their respective cwq worklists.
+ * frozen works are transferred to their respective gcwq worklists.
  *
  * CONTEXT:
  * Grabs and releases workqueue_lock and gcwq->lock's.
@@ -2320,6 +2435,7 @@ void __init init_workqueues(void)
 		struct global_cwq *gcwq = get_gcwq(cpu);
 
 		spin_lock_init(&gcwq->lock);
+		INIT_LIST_HEAD(&gcwq->worklist);
 		gcwq->cpu = cpu;
 
 		INIT_LIST_HEAD(&gcwq->idle_list);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 26/27] workqueue: implement concurrency managed dynamic worker pool
  2009-12-18 12:57 Tejun Heo
                   ` (24 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 25/27] workqueue: use shared worklist and pool all workers per cpu Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 12:58 ` [PATCH 27/27] workqueue: increase max_active of keventd and kill current_is_keventd() Tejun Heo
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo

Instead of creating a worker for each cwq and putting it into the
shared pool, manage per-cpu workers dynamically.

Works aren't supposed to be cpu cycle hogs and maintaining just enough
concurrency to prevent work processing from stalling due to lack of
processing context is optimal.  gcwq keeps the number of concurrent
active workers to minimum but no less.  As long as there's one or more
running workers on the cpu, no new worker is scheduled so that works
can be processed in batch as much as possible but when the last
running worker blocks, gcwq immediately schedules new worker so that
the cpu doesn't sit idle while there are works to be processed.

gcwq always keeps at least single idle worker around.  When a new
worker is necessary and the worker is the last idle one, the worker
assumes the role of "manager" and manages the worker pool -
ie. creates another worker.  Forward-progress is guaranteed by having
dedicated rescue workers for workqueues which may be necessary while
creating a new worker.  When the manager is having problem creating a
new worker, mayday timer activates and rescue workers are summoned to
the cpu and execute works which might be necessary to create new
workers.

Trustee is expanded to serve the role of manager while a CPU is being
taken down and stays down.  As no new works are supposed to be queued
on a dead cpu, it just needs to drain all the existing ones.  Trustee
continues to try to create new workers and summon rescuers as long as
there are pending works.  If the CPU is brought back up while the
trustee is still trying to drain the gcwq from the previous offlining,
the trustee puts all workers back to the cpu and pass control over to
gcwq which assumes the manager role as necessary.

Concurrency managed worker pool reduces the number of workers
drastically.  Only workers which are necessary to keep the processing
going are created and kept.  Also, it reduces cache footprint by
avoiding unnecessarily switching contexts between different workers.

Please note that this patch does not increase max_active of any
workqueue.  All workqueues can still only process one work per cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/workqueue.h |    8 +-
 kernel/workqueue.c        |  858 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 778 insertions(+), 88 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b012da7..adb3080 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -213,6 +213,7 @@ static inline bool work_static(struct work_struct *work) { return false; }
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
 	WQ_SINGLE_CPU		= 1 << 1, /* only single cpu at a time */
+	WQ_RESCUER		= 1 << 2, /* has an rescue worker */
 };
 
 extern struct workqueue_struct *
@@ -239,11 +240,12 @@ __create_workqueue_key(const char *name, unsigned int flags, int max_active,
 #endif
 
 #define create_workqueue(name)					\
-	__create_workqueue((name), 0, 1)
+	__create_workqueue((name), WQ_RESCUER, 1)
 #define create_freezeable_workqueue(name)			\
-	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_CPU, 1)
+	__create_workqueue((name),				\
+			   WQ_FREEZEABLE | WQ_SINGLE_CPU | WQ_RESCUER, 1)
 #define create_singlethread_workqueue(name)			\
-	__create_workqueue((name), WQ_SINGLE_CPU, 1)
+	__create_workqueue((name), WQ_SINGLE_CPU | WQ_RESCUER, 1)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f38d263..9baf7a8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -43,8 +43,16 @@ enum {
 	WORKER_STARTED		= 1 << 0,	/* started */
 	WORKER_DIE		= 1 << 1,	/* die die die */
 	WORKER_IDLE		= 1 << 2,	/* is idle */
+	WORKER_PREP		= 1 << 3,	/* preparing to run works */
 	WORKER_ROGUE		= 1 << 4,	/* not bound to any cpu */
 
+	WORKER_IGN_RUNNING	= WORKER_PREP | WORKER_ROGUE,
+
+	/* global_cwq flags */
+	GCWQ_MANAGE_WORKERS	= 1 << 0,	/* need to manage workers */
+	GCWQ_MANAGING_WORKERS	= 1 << 1,	/* managing workers */
+	GCWQ_DISASSOCIATED	= 1 << 2,	/* cpu can't serve workers */
+
 	/* gcwq->trustee_state */
 	TRUSTEE_START		= 0,		/* start */
 	TRUSTEE_IN_CHARGE	= 1,		/* trustee in charge of gcwq */
@@ -56,7 +64,19 @@ enum {
 	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
 	BUSY_WORKER_HASH_MASK	= BUSY_WORKER_HASH_SIZE - 1,
 
+	MAX_IDLE_WORKERS_RATIO	= 4,		/* 1/4 of busy can be idle */
+	IDLE_WORKER_TIMEOUT	= 300 * HZ,	/* keep idle ones for 5 mins */
+
+	MAYDAY_INITIAL_TIMEOUT	= HZ / 100,	/* call for help after 10ms */
+	MAYDAY_INTERVAL		= HZ / 10,	/* and then every 100ms */
+	CREATE_COOLDOWN		= HZ,		/* time to breath after fail */
 	TRUSTEE_COOLDOWN	= HZ / 10,	/* for trustee draining */
+
+	/*
+	 * Rescue workers are used only on emergencies and shared by
+	 * all cpus.  Give -20.
+	 */
+	RESCUER_NICE_LEVEL	= -20,
 };
 
 /*
@@ -64,8 +84,16 @@ enum {
  *
  * I: Set during initialization and read-only afterwards.
  *
+ * P: Preemption protected.  Disabling preemption is enough and should
+ *    only be modified and accessed from the local cpu.
+ *
  * L: gcwq->lock protected.  Access with gcwq->lock held.
  *
+ * X: During normal operation, modification requires gcwq->lock and
+ *    should be done only from local cpu.  Either disabling preemption
+ *    on local cpu or grabbing gcwq->lock is enough for read access.
+ *    While trustee is in charge, it's identical to L.
+ *
  * F: wq->flush_mutex protected.
  *
  * W: workqueue_lock protected.
@@ -73,6 +101,10 @@ enum {
 
 struct global_cwq;
 
+/*
+ * The poor guys doing the actual heavy lifting.  All on-duty workers
+ * are either serving the manager role, on idle list or on busy hash.
+ */
 struct worker {
 	/* on idle list while idle, on busy hash table while busy */
 	union {
@@ -84,12 +116,17 @@ struct worker {
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
-	unsigned int		flags;		/* L: flags */
+	unsigned long		last_active;	/* L: last active timestamp */
+	/* 64 bytes boundary on 64bit, 32 on 32bit */
+	struct sched_notifier	sched_notifier;	/* I: scheduler notifier */
+	unsigned int		flags;		/* ?: flags */
 	int			id;		/* I: worker id */
 };
 
 /*
- * Global per-cpu workqueue.
+ * Global per-cpu workqueue.  There's one and only one for each cpu
+ * and all works are queued and processed here regardless of their
+ * target workqueues.
  */
 struct global_cwq {
 	spinlock_t		lock;		/* the gcwq lock */
@@ -101,15 +138,19 @@ struct global_cwq {
 	int			nr_idle;	/* L: currently idle ones */
 
 	/* workers are chained either in the idle_list or busy_hash */
-	struct list_head	idle_list;	/* L: list of idle workers */
+	struct list_head	idle_list;	/* ?: list of idle workers */
 	struct hlist_head	busy_hash[BUSY_WORKER_HASH_SIZE];
 						/* L: hash of busy workers */
 
+	struct timer_list	idle_timer;	/* L: worker idle timeout */
+	struct timer_list	mayday_timer;	/* L: SOS timer for dworkers */
+
 	struct ida		worker_ida;	/* L: for worker IDs */
 
 	struct task_struct	*trustee;	/* L: for gcwq shutdown */
 	unsigned int		trustee_state;	/* L: trustee state */
 	wait_queue_head_t	trustee_wait;	/* trustee wait */
+	struct worker		*first_idle;	/* L: first idle worker */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -119,7 +160,6 @@ struct global_cwq {
  */
 struct cpu_workqueue_struct {
 	struct global_cwq	*gcwq;		/* I: the associated gcwq */
-	struct worker		*worker;
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	int			work_color;	/* L: current color */
 	int			flush_color;	/* L: flushing color */
@@ -158,6 +198,9 @@ struct workqueue_struct {
 
 	unsigned long		single_cpu;	/* cpu for single cpu wq */
 
+	cpumask_var_t		mayday_mask;	/* cpus requesting rescue */
+	struct worker		*rescuer;	/* I: rescue worker */
+
 	int			saved_max_active; /* I: saved cwq max_active */
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
@@ -284,7 +327,14 @@ static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 static bool workqueue_freezing;		/* W: have wqs started freezing? */
 
+/*
+ * The almighty global cpu workqueues.  nr_running is the only field
+ * which is expected to be used frequently by other cpus by
+ * try_to_wake_up() which ends up incrementing it.  Put it in a
+ * separate cacheline.
+ */
 static DEFINE_PER_CPU(struct global_cwq, global_cwq);
+static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, gcwq_nr_running);
 
 static int worker_thread(void *__worker);
 
@@ -293,6 +343,11 @@ static struct global_cwq *get_gcwq(unsigned int cpu)
 	return &per_cpu(global_cwq, cpu);
 }
 
+static atomic_t *get_gcwq_nr_running(unsigned int cpu)
+{
+	return &per_cpu(gcwq_nr_running, cpu);
+}
+
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 					    struct workqueue_struct *wq)
 {
@@ -336,6 +391,63 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 			WORK_STRUCT_WQ_DATA_MASK);
 }
 
+/*
+ * Policy functions.  These define the policies on how the global
+ * worker pool is managed.  Unless noted otherwise, these functions
+ * assume that they're being called with gcwq->lock held.
+ */
+
+/*
+ * Need to wake up a worker?  Called from anything but currently
+ * running workers.
+ */
+static bool need_more_worker(struct global_cwq *gcwq)
+{
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	return !list_empty(&gcwq->worklist) && !atomic_read(nr_running);
+}
+
+/* Can I start working?  Called from busy but !running workers. */
+static bool may_start_working(struct global_cwq *gcwq)
+{
+	return gcwq->nr_idle;
+}
+
+/* Do I need to keep working?  Called from currently running workers. */
+static bool keep_working(struct global_cwq *gcwq)
+{
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	return !list_empty(&gcwq->worklist) && atomic_read(nr_running) <= 1;
+}
+
+/* Do we need a new worker?  Called from manager. */
+static bool need_to_create_worker(struct global_cwq *gcwq)
+{
+	return need_more_worker(gcwq) && !may_start_working(gcwq);
+}
+
+/* Do I need to be the manager? */
+static bool need_to_manage_workers(struct global_cwq *gcwq)
+{
+	return need_to_create_worker(gcwq) || gcwq->flags & GCWQ_MANAGE_WORKERS;
+}
+
+/* Do we have too many workers and should some go away? */
+static bool too_many_workers(struct global_cwq *gcwq)
+{
+	bool managing = gcwq->flags & GCWQ_MANAGING_WORKERS;
+	int nr_idle = gcwq->nr_idle + managing; /* manager is considered idle */
+	int nr_busy = gcwq->nr_workers - nr_idle;
+
+	return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
+}
+
+/*
+ * Wake up functions.
+ */
+
 /* Return the first worker.  Safe with preemption disabled */
 static struct worker *first_worker(struct global_cwq *gcwq)
 {
@@ -363,6 +475,70 @@ static void wake_up_worker(struct global_cwq *gcwq)
 }
 
 /**
+ * sched_wake_up_worker - wake up an idle worker from a scheduler notifier
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * Scheduler callback.  DO NOT call from anywhere else.
+ */
+static void sched_wake_up_worker(struct global_cwq *gcwq)
+{
+	struct worker *worker = first_worker(gcwq);
+
+	if (likely(worker))
+		try_to_wake_up_local(worker->task, TASK_NORMAL, 0);
+}
+
+/*
+ * Scheduler notifier callbacks.  These functions are called during
+ * schedule() with rq lock held.  Don't try to acquire any lock and
+ * only access fields which are safe with preemption disabled from
+ * local cpu.
+ */
+
+/* called when a worker task wakes up from sleep */
+static void worker_sched_wakeup(struct sched_notifier *sn)
+{
+	struct worker *worker = container_of(sn, struct worker, sched_notifier);
+	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+		return;
+
+	atomic_inc(nr_running);
+}
+
+/* called when a worker task goes into sleep */
+static void worker_sched_sleep(struct sched_notifier *sn)
+{
+	struct worker *worker = container_of(sn, struct worker, sched_notifier);
+	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+		return;
+
+	/* this can only happen on the local cpu */
+	BUG_ON(gcwq->cpu != raw_smp_processor_id());
+
+	/*
+	 * The counterpart of the following dec_and_test, implied mb,
+	 * worklist not empty test sequence is in insert_work().
+	 * Please read comment there.
+	 */
+	if (atomic_dec_and_test(nr_running) && !list_empty(&gcwq->worklist))
+		sched_wake_up_worker(gcwq);
+}
+
+static struct sched_notifier_ops wq_sched_notifier_ops = {
+	.wakeup		= worker_sched_wakeup,
+	.sleep		= worker_sched_sleep,
+};
+
+/**
  * busy_worker_head - return the busy hash head for a work
  * @gcwq: gcwq of interest
  * @work: work to be hashed
@@ -459,6 +635,8 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
 			unsigned int extra_flags)
 {
+	struct global_cwq *gcwq = cwq->gcwq;
+
 	/* we own @work, set data and link */
 	set_wq_data(work, cwq, extra_flags);
 
@@ -469,7 +647,16 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	smp_wmb();
 
 	list_add_tail(&work->entry, head);
-	wake_up_worker(cwq->gcwq);
+
+	/*
+	 * Ensure either worker_sched_deactivated() sees the above
+	 * list_add_tail() or we see zero nr_running to avoid workers
+	 * lying around lazily while there are works to be processed.
+	 */
+	smp_mb();
+
+	if (!atomic_read(get_gcwq_nr_running(gcwq->cpu)))
+		wake_up_worker(gcwq);
 }
 
 /**
@@ -694,11 +881,16 @@ static void worker_enter_idle(struct worker *worker)
 
 	worker->flags |= WORKER_IDLE;
 	gcwq->nr_idle++;
+	worker->last_active = jiffies;
 
 	/* idle_list is LIFO */
 	list_add(&worker->entry, &gcwq->idle_list);
 
-	if (unlikely(worker->flags & WORKER_ROGUE))
+	if (likely(!(worker->flags & WORKER_ROGUE))) {
+		if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer))
+			mod_timer(&gcwq->idle_timer,
+				  jiffies + IDLE_WORKER_TIMEOUT);
+	} else
 		wake_up_all(&gcwq->trustee_wait);
 }
 
@@ -729,6 +921,9 @@ static struct worker *alloc_worker(void)
 	if (worker) {
 		INIT_LIST_HEAD(&worker->entry);
 		INIT_LIST_HEAD(&worker->scheduled);
+		sched_notifier_init(&worker->sched_notifier,
+				    &wq_sched_notifier_ops);
+		/* on creation a worker is not idle */
 	}
 	return worker;
 }
@@ -806,7 +1001,7 @@ fail:
  */
 static void start_worker(struct worker *worker)
 {
-	worker->flags |= WORKER_STARTED;
+	worker->flags |= WORKER_STARTED | WORKER_PREP;
 	worker->gcwq->nr_workers++;
 	worker_enter_idle(worker);
 	wake_up_process(worker->task);
@@ -847,6 +1042,220 @@ static void destroy_worker(struct worker *worker)
 	ida_remove(&gcwq->worker_ida, id);
 }
 
+static void idle_worker_timeout(unsigned long __gcwq)
+{
+	struct global_cwq *gcwq = (void *)__gcwq;
+
+	spin_lock_irq(&gcwq->lock);
+
+	if (too_many_workers(gcwq)) {
+		struct worker *worker;
+		unsigned long expires;
+
+		/* idle_list is kept in LIFO order, check the last one */
+		worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+		expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+		if (time_before(jiffies, expires))
+			mod_timer(&gcwq->idle_timer, expires);
+		else {
+			/* it's been idle for too long, wake up manager */
+			gcwq->flags |= GCWQ_MANAGE_WORKERS;
+			wake_up_worker(gcwq);
+		}
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+}
+
+static bool send_mayday(struct work_struct *work)
+{
+	struct cpu_workqueue_struct *cwq = get_wq_data(work);
+	struct workqueue_struct *wq = cwq->wq;
+
+	if (!(wq->flags & WQ_RESCUER))
+		return false;
+
+	/* mayday mayday mayday */
+	if (!cpumask_test_and_set_cpu(cwq->gcwq->cpu, wq->mayday_mask))
+		wake_up_process(wq->rescuer->task);
+	return true;
+}
+
+static void gcwq_mayday_timeout(unsigned long __gcwq)
+{
+	struct global_cwq *gcwq = (void *)__gcwq;
+	struct work_struct *work;
+
+	spin_lock_irq(&gcwq->lock);
+
+	if (need_to_create_worker(gcwq)) {
+		/*
+		 * We've been trying to create a new worker but
+		 * haven't been successful.  We might be hitting an
+		 * allocation deadlock.  Send distress signals to
+		 * rescuers.
+		 */
+		list_for_each_entry(work, &gcwq->worklist, entry)
+			send_mayday(work);
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+
+	mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INTERVAL);
+}
+
+/**
+ * maybe_create_worker - create a new worker if necessary
+ * @gcwq: gcwq to create a new worker for
+ *
+ * Create a new worker for @gcwq if necessary.  @gcwq is guaranteed to
+ * have at least one idle worker on return from this function.  If
+ * creating a new worker takes longer than MAYDAY_INTERVAL, mayday is
+ * sent to all rescuers with works scheduled on @gcwq to resolve
+ * possible allocation deadlock.
+ *
+ * On return, need_to_create_worker() is guaranteed to be false and
+ * may_start_working() true.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Does GFP_KERNEL allocations.  Called only from
+ * manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_create_worker(struct global_cwq *gcwq)
+{
+	if (!need_to_create_worker(gcwq))
+		return false;
+restart:
+	/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
+	mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
+
+	while (true) {
+		struct worker *worker;
+
+		spin_unlock_irq(&gcwq->lock);
+
+		worker = create_worker(gcwq, true);
+		if (worker) {
+			del_timer_sync(&gcwq->mayday_timer);
+			spin_lock_irq(&gcwq->lock);
+			start_worker(worker);
+			BUG_ON(need_to_create_worker(gcwq));
+			return true;
+		}
+
+		if (!need_to_create_worker(gcwq))
+			break;
+
+		spin_unlock_irq(&gcwq->lock);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(CREATE_COOLDOWN);
+		spin_lock_irq(&gcwq->lock);
+		if (!need_to_create_worker(gcwq))
+			break;
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+	del_timer_sync(&gcwq->mayday_timer);
+	spin_lock_irq(&gcwq->lock);
+	if (need_to_create_worker(gcwq))
+		goto restart;
+	return true;
+}
+
+/**
+ * maybe_destroy_worker - destroy workers which have been idle for a while
+ * @gcwq: gcwq to destroy workers for
+ *
+ * Destroy @gcwq workers which have been idle for longer than
+ * IDLE_WORKER_TIMEOUT.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Called only from manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_destroy_workers(struct global_cwq *gcwq)
+{
+	bool ret = false;
+
+	while (too_many_workers(gcwq)) {
+		struct worker *worker;
+		unsigned long expires;
+
+		worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+		expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+		if (time_before(jiffies, expires)) {
+			mod_timer(&gcwq->idle_timer, expires);
+			break;
+		}
+
+		destroy_worker(worker);
+		ret = true;
+	}
+
+	return ret;
+}
+
+/**
+ * manage_workers - manage worker pool
+ * @worker: self
+ *
+ * Assume the manager role and manage gcwq worker pool @worker belongs
+ * to.  At any given time, there can be only zero or one manager per
+ * gcwq.  The exclusion is handled automatically by this function.
+ *
+ * The caller can safely start processing works on false return.  On
+ * true return, it's guaranteed that need_to_create_worker() is false
+ * and may_start_working() is true.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true if
+ * some action was taken.
+ */
+static bool manage_workers(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	bool ret = false;
+
+	if (gcwq->flags & GCWQ_MANAGING_WORKERS)
+		return ret;
+
+	gcwq->flags &= ~GCWQ_MANAGE_WORKERS;
+	gcwq->flags |= GCWQ_MANAGING_WORKERS;
+
+	/*
+	 * Destroy and then create so that may_start_working() is true
+	 * on return.
+	 */
+	ret |= maybe_destroy_workers(gcwq);
+	ret |= maybe_create_worker(gcwq);
+
+	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
+	/*
+	 * The trustee might be waiting to take over the manager
+	 * position, tell it we're done.
+	 */
+	if (unlikely(gcwq->trustee))
+		wake_up_all(&gcwq->trustee_wait);
+
+	return ret;
+}
+
 /**
  * move_linked_works - move linked works to a list
  * @work: start of series of works to be scheduled
@@ -1049,23 +1458,39 @@ static void process_scheduled_works(struct worker *worker)
  * worker_thread - the worker thread function
  * @__worker: self
  *
- * The cwq worker thread function.
+ * The gcwq worker thread function.  There's a single dynamic pool of
+ * these per each cpu.  These workers process all works regardless of
+ * their specific target workqueue.  The only exception is works which
+ * belong to workqueues with a rescuer which will be explained in
+ * rescuer_thread().
  */
 static int worker_thread(void *__worker)
 {
 	struct worker *worker = __worker;
 	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
 
+	/* register sched_notifiers */
+	sched_notifier_register(&worker->sched_notifier);
 woke_up:
 	spin_lock_irq(&gcwq->lock);
 
 	/* DIE can be set only while we're idle, checking here is enough */
 	if (worker->flags & WORKER_DIE) {
 		spin_unlock_irq(&gcwq->lock);
+		sched_notifier_unregister(&worker->sched_notifier);
 		return 0;
 	}
 
 	worker_leave_idle(worker);
+recheck:
+	/* no more worker necessary? */
+	if (!need_more_worker(gcwq))
+		goto sleep;
+
+	/* do we need to manage? */
+	if (unlikely(!may_start_working(gcwq)) && manage_workers(worker))
+		goto recheck;
 
 	/*
 	 * ->scheduled list can only be filled while a worker is
@@ -1074,7 +1499,16 @@ woke_up:
 	 */
 	BUG_ON(!list_empty(&worker->scheduled));
 
-	while (!list_empty(&gcwq->worklist)) {
+	/*
+	 * When control reaches this point, we're guaranteed to have
+	 * at least one idle worker or that someone else has already
+	 * assumed the manager role.
+	 */
+	worker->flags &= ~WORKER_PREP;
+	if (likely(!(worker->flags & WORKER_IGN_RUNNING)))
+		atomic_inc(nr_running);
+
+	do {
 		struct work_struct *work =
 			list_first_entry(&gcwq->worklist,
 					 struct work_struct, entry);
@@ -1088,13 +1522,21 @@ woke_up:
 			move_linked_works(work, &worker->scheduled, NULL);
 			process_scheduled_works(worker);
 		}
-	}
+	} while (keep_working(gcwq));
+
+	if (likely(!(worker->flags & WORKER_IGN_RUNNING)))
+		atomic_dec(nr_running);
+	worker->flags |= WORKER_PREP;
 
+	if (unlikely(need_to_manage_workers(gcwq)) && manage_workers(worker))
+		goto recheck;
+sleep:
 	/*
-	 * gcwq->lock is held and there's no work to process, sleep.
-	 * Workers are woken up only while holding gcwq->lock, so
-	 * setting the current state before releasing gcwq->lock is
-	 * enough to prevent losing any event.
+	 * gcwq->lock is held and there's no work to process and no
+	 * need to manage, sleep.  Workers are woken up only while
+	 * holding gcwq->lock or from local cpu, so setting the
+	 * current state before releasing gcwq->lock is enough to
+	 * prevent losing any event.
 	 */
 	worker_enter_idle(worker);
 	__set_current_state(TASK_INTERRUPTIBLE);
@@ -1103,6 +1545,122 @@ woke_up:
 	goto woke_up;
 }
 
+/**
+ * worker_maybe_bind_and_lock - bind worker to its cpu if possible and lock gcwq
+ * @worker: target worker
+ *
+ * Works which are scheduled while the cpu is online must at least be
+ * scheduled to a worker which is bound to the cpu so that if they are
+ * flushed from cpu callbacks while cpu is going down, they are
+ * guaranteed to execute on the cpu.
+ *
+ * This function is to be used to bind rescuers and new rogue workers
+ * to the target cpu and may race with cpu going down or coming
+ * online.  kthread_bind() can't be used because it may put the worker
+ * to already dead cpu and __set_cpus_allowed() can't be used verbatim
+ * as it's best effort and blocking and gcwq may be [dis]associated in
+ * the meantime.
+ *
+ * This function tries __set_cpus_allowed() and locks gcwq and
+ * verifies the binding against GCWQ_DISASSOCIATED which is set during
+ * CPU_DYING and cleared during CPU_ONLINE, so if the worker enters
+ * idle state or fetches works without dropping lock, it can guarantee
+ * the scheduling requirement described in the first paragraph.
+ *
+ * CONTEXT:
+ * Might sleep.  Called without any lock but returns with gcwq->lock
+ * held.
+ */
+static void worker_maybe_bind_and_lock(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	struct task_struct *task = worker->task;
+
+	while (true) {
+		/*
+		 * The following call may fail, succeed or succeed
+		 * without actually migrating the task to the cpu if
+		 * it races with cpu hotunplug operation.  Verify
+		 * against GCWQ_DISASSOCIATED.
+		 */
+		__set_cpus_allowed(task, get_cpu_mask(gcwq->cpu), true);
+
+		spin_lock_irq(&gcwq->lock);
+		if (gcwq->flags & GCWQ_DISASSOCIATED)
+			return;
+		if (task_cpu(task) == gcwq->cpu &&
+		    cpumask_equal(&current->cpus_allowed,
+				  get_cpu_mask(gcwq->cpu)))
+			return;
+		spin_unlock_irq(&gcwq->lock);
+
+		/* CPU has come up inbetween, retry migration */
+		cpu_relax();
+	}
+}
+
+/**
+ * rescuer_thread - the rescuer thread function
+ * @__wq: the associated workqueue
+ *
+ * Workqueue rescuer thread function.  There's one rescuer for each
+ * workqueue which has WQ_RESCUER set.
+ *
+ * Regular work processing on a gcwq may block trying to create a new
+ * worker which uses GFP_KERNEL allocation which has slight chance of
+ * developing into deadlock if some works currently on the same queue
+ * need to be processed to satisfy the GFP_KERNEL allocation.  This is
+ * the problem rescuer solves.
+ *
+ * When such condition is possible, the gcwq summons rescuers of all
+ * workqueues which have works queued on the gcwq and let them process
+ * those works so that forward progress can be guaranteed.
+ *
+ * This should happen rarely.
+ */
+static int rescuer_thread(void *__wq)
+{
+	struct workqueue_struct *wq = __wq;
+	struct worker *rescuer = wq->rescuer;
+	struct list_head *scheduled = &rescuer->scheduled;
+	unsigned int cpu;
+
+	set_user_nice(current, RESCUER_NICE_LEVEL);
+repeat:
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	if (kthread_should_stop())
+		return 0;
+
+	for_each_cpu(cpu, wq->mayday_mask) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = cwq->gcwq;
+		struct work_struct *work, *n;
+
+		__set_current_state(TASK_RUNNING);
+		cpumask_clear_cpu(cpu, wq->mayday_mask);
+
+		/* migrate to the target cpu if possible */
+		rescuer->gcwq = gcwq;
+		worker_maybe_bind_and_lock(rescuer);
+
+		/*
+		 * Slurp in all works issued via this workqueue and
+		 * process'em.
+		 */
+		BUG_ON(!list_empty(&rescuer->scheduled));
+		list_for_each_entry_safe(work, n, &gcwq->worklist, entry)
+			if (get_wq_data(work) == cwq)
+				move_linked_works(work, scheduled, &n);
+
+		process_scheduled_works(rescuer);
+		spin_unlock_irq(&gcwq->lock);
+	}
+
+	schedule();
+	goto repeat;
+}
+
 struct wq_barrier {
 	struct work_struct	work;
 	struct completion	done;
@@ -1833,7 +2391,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 						const char *lock_name)
 {
 	struct workqueue_struct *wq;
-	bool failed = false;
 	unsigned int cpu;
 
 	max_active = clamp_val(max_active, 1, INT_MAX);
@@ -1858,13 +2415,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
 
-	cpu_maps_update_begin();
-	/*
-	 * We must initialize cwqs for each possible cpu even if we
-	 * are going to call destroy_workqueue() finally. Otherwise
-	 * cpu_up() can hit the uninitialized cwq once we drop the
-	 * lock.
-	 */
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 		struct global_cwq *gcwq = get_gcwq(cpu);
@@ -1875,20 +2425,25 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq->flush_color = -1;
 		cwq->max_active = max_active;
 		INIT_LIST_HEAD(&cwq->delayed_works);
-
-		if (failed)
-			continue;
-		cwq->worker = create_worker(gcwq, cpu_online(cpu));
-		if (cwq->worker)
-			start_worker(cwq->worker);
-		else
-			failed = true;
 	}
-	cpu_maps_update_done();
 
-	if (failed) {
-		destroy_workqueue(wq);
-		wq = NULL;
+	if (flags & WQ_RESCUER) {
+		struct worker *rescuer;
+
+		if (!alloc_cpumask_var(&wq->mayday_mask, GFP_KERNEL))
+			goto err;
+
+		wq->rescuer = rescuer = alloc_worker();
+		if (!rescuer)
+			goto err;
+
+		rescuer->task = kthread_create(rescuer_thread, wq, "%s", name);
+		if (IS_ERR(rescuer->task))
+			goto err;
+
+		wq->rescuer = rescuer;
+		rescuer->task->flags |= PF_THREAD_BOUND;
+		wake_up_process(rescuer->task);
 	}
 
 	/*
@@ -1910,6 +2465,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 err:
 	if (wq) {
 		free_cwqs(wq->cpu_wq);
+		free_cpumask_var(wq->mayday_mask);
+		kfree(wq->rescuer);
 		kfree(wq);
 	}
 	return NULL;
@@ -1936,36 +2493,22 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
+	/* sanity check */
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
-		struct global_cwq *gcwq = cwq->gcwq;
 		int i;
 
-		if (cwq->worker) {
-		retry:
-			spin_lock_irq(&gcwq->lock);
-			/*
-			 * Worker can only be destroyed while idle.
-			 * Wait till it becomes idle.  This is ugly
-			 * and prone to starvation.  It will go away
-			 * once dynamic worker pool is implemented.
-			 */
-			if (!(cwq->worker->flags & WORKER_IDLE)) {
-				spin_unlock_irq(&gcwq->lock);
-				msleep(100);
-				goto retry;
-			}
-			destroy_worker(cwq->worker);
-			cwq->worker = NULL;
-			spin_unlock_irq(&gcwq->lock);
-		}
-
 		for (i = 0; i < WORK_NR_COLORS; i++)
 			BUG_ON(cwq->nr_in_flight[i]);
 		BUG_ON(cwq->nr_active);
 		BUG_ON(!list_empty(&cwq->delayed_works));
 	}
 
+	if (wq->flags & WQ_RESCUER) {
+		kthread_stop(wq->rescuer->task);
+		free_cpumask_var(wq->mayday_mask);
+	}
+
 	free_cwqs(wq->cpu_wq);
 	kfree(wq);
 }
@@ -1974,10 +2517,18 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
 /*
  * CPU hotplug.
  *
- * CPU hotplug is implemented by allowing cwqs to be detached from
- * CPU, running with unbound workers and allowing them to be
- * reattached later if the cpu comes back online.  A separate thread
- * is created to govern cwqs in such state and is called the trustee.
+ * There are two challenges in supporting CPU hotplug.  Firstly, there
+ * are a lot of assumptions on strong associations among work, cwq and
+ * gcwq which make migrating pending and scheduled works very
+ * difficult to implement without impacting hot paths.  Secondly,
+ * gcwqs serve mix of short, long and very long running works making
+ * blocked draining impractical.
+ *
+ * This is solved by allowing a gcwq to be detached from CPU, running
+ * it with unbound (rogue) workers and allowing it to be reattached
+ * later if the cpu comes back online.  A separate thread is created
+ * to govern a gcwq in such state and is called the trustee of the
+ * gcwq.
  *
  * Trustee states and their descriptions.
  *
@@ -1985,11 +2536,12 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
  *		new trustee is started with this state.
  *
  * IN_CHARGE	Once started, trustee will enter this state after
- *		making all existing workers rogue.  DOWN_PREPARE waits
- *		for trustee to enter this state.  After reaching
- *		IN_CHARGE, trustee tries to execute the pending
- *		worklist until it's empty and the state is set to
- *		BUTCHER, or the state is set to RELEASE.
+ *		assuming the manager role and making all existing
+ *		workers rogue.  DOWN_PREPARE waits for trustee to
+ *		enter this state.  After reaching IN_CHARGE, trustee
+ *		tries to execute the pending worklist until it's empty
+ *		and the state is set to BUTCHER, or the state is set
+ *		to RELEASE.
  *
  * BUTCHER	Command state which is set by the cpu callback after
  *		the cpu has went down.  Once this state is set trustee
@@ -2000,7 +2552,9 @@ EXPORT_SYMBOL_GPL(destroy_workqueue);
  * RELEASE	Command state which is set by the cpu callback if the
  *		cpu down has been canceled or it has come online
  *		again.  After recognizing this state, trustee stops
- *		trying to drain or butcher and transits to DONE.
+ *		trying to drain or butcher and clears ROGUE, rebinds
+ *		all remaining workers back to the cpu and releases
+ *		manager role.
  *
  * DONE		Trustee will enter this state after BUTCHER or RELEASE
  *		is complete.
@@ -2081,18 +2635,26 @@ static bool __cpuinit trustee_unset_rogue(struct worker *worker)
 static int __cpuinit trustee_thread(void *__gcwq)
 {
 	struct global_cwq *gcwq = __gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
 	struct worker *worker;
+	struct work_struct *work;
 	struct hlist_node *pos;
+	long rc;
 	int i;
 
 	BUG_ON(gcwq->cpu != smp_processor_id());
 
 	spin_lock_irq(&gcwq->lock);
 	/*
-	 * Make all workers rogue.  Trustee must be bound to the
-	 * target cpu and can't be cancelled.
+	 * Claim the manager position and make all workers rogue.
+	 * Trustee must be bound to the target cpu and can't be
+	 * cancelled.
 	 */
 	BUG_ON(gcwq->cpu != smp_processor_id());
+	rc = trustee_wait_event(!(gcwq->flags & GCWQ_MANAGING_WORKERS));
+	BUG_ON(rc < 0);
+
+	gcwq->flags |= GCWQ_MANAGING_WORKERS;
 
 	list_for_each_entry(worker, &gcwq->idle_list, entry)
 		worker->flags |= WORKER_ROGUE;
@@ -2101,6 +2663,28 @@ static int __cpuinit trustee_thread(void *__gcwq)
 		worker->flags |= WORKER_ROGUE;
 
 	/*
+	 * Call schedule() so that we cross rq->lock and thus can
+	 * guarantee sched callbacks see the rogue flag.  This is
+	 * necessary as scheduler callbacks may be invoked from other
+	 * cpus.
+	 */
+	spin_unlock_irq(&gcwq->lock);
+	schedule();
+	spin_lock_irq(&gcwq->lock);
+
+	/*
+	 * Sched callbacks are disabled now.  Zap nr_running.  After
+	 * this, gcwq->nr_running stays zero and need_more_worker()
+	 * and keep_working() are always true as long as the worklist
+	 * is not empty.
+	 */
+	atomic_set(nr_running, 0);
+
+	spin_unlock_irq(&gcwq->lock);
+	del_timer_sync(&gcwq->idle_timer);
+	spin_lock_irq(&gcwq->lock);
+
+	/*
 	 * We're now in charge.  Notify and proceed to drain.  We need
 	 * to keep the gcwq running during the whole CPU down
 	 * procedure as other cpu hotunplug callbacks may need to
@@ -2112,18 +2696,80 @@ static int __cpuinit trustee_thread(void *__gcwq)
 	/*
 	 * The original cpu is in the process of dying and may go away
 	 * anytime now.  When that happens, we and all workers would
-	 * be migrated to other cpus.  Try draining any left work.
-	 * Note that if the gcwq is frozen, there may be frozen works
-	 * in freezeable cwqs.  Don't declare completion while frozen.
+	 * be migrated to other cpus.  Try draining any left work.  We
+	 * want to get it over with ASAP - spam rescuers, wake up as
+	 * many idlers as necessary and create new ones till the
+	 * worklist is empty.  Note that if the gcwq is frozen, there
+	 * may be frozen works in freezeable cwqs.  Don't declare
+	 * completion while frozen.
 	 */
 	while (gcwq->nr_workers != gcwq->nr_idle ||
 	       gcwq->flags & GCWQ_FREEZING ||
 	       gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+		int nr_works = 0;
+
+		list_for_each_entry(work, &gcwq->worklist, entry) {
+			send_mayday(work);
+			nr_works++;
+		}
+
+		list_for_each_entry(worker, &gcwq->idle_list, entry) {
+			if (!nr_works--)
+				break;
+			wake_up_process(worker->task);
+		}
+
+		if (need_to_create_worker(gcwq)) {
+			spin_unlock_irq(&gcwq->lock);
+			worker = create_worker(gcwq, false);
+			if (worker) {
+				worker_maybe_bind_and_lock(worker);
+				worker->flags |= WORKER_ROGUE;
+				start_worker(worker);
+			} else
+				spin_lock_irq(&gcwq->lock);
+		}
+
 		/* give a breather */
 		if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
 			break;
 	}
 
+	/*
+	 * Either all works have been scheduled and cpu is down, or
+	 * cpu down has already been canceled.  Wait for and butcher
+	 * all workers till we're canceled.
+	 */
+	while (gcwq->nr_workers) {
+		if (trustee_wait_event(!list_empty(&gcwq->idle_list)) < 0)
+			break;
+
+		while (!list_empty(&gcwq->idle_list)) {
+			worker = list_first_entry(&gcwq->idle_list,
+						  struct worker, entry);
+			destroy_worker(worker);
+		}
+	}
+
+	/*
+	 * At this point, either draining has completed and no worker
+	 * is left, or cpu down has been canceled or the cpu is being
+	 * brought back up.  Clear ROGUE from and rebind all left
+	 * workers.  Unsetting ROGUE and rebinding require dropping
+	 * gcwq->lock.  Restart loop after each successful release.
+	 */
+recheck:
+	list_for_each_entry(worker, &gcwq->idle_list, entry)
+		if (trustee_unset_rogue(worker))
+			goto recheck;
+
+	for_each_busy_worker(worker, i, pos, gcwq)
+		if (trustee_unset_rogue(worker))
+			goto recheck;
+
+	/* relinquish manager role */
+	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
 	/* notify completion */
 	gcwq->trustee = NULL;
 	gcwq->trustee_state = TRUSTEE_DONE;
@@ -2162,9 +2808,7 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 	unsigned int cpu = (unsigned long)hcpu;
 	struct global_cwq *gcwq = get_gcwq(cpu);
 	struct task_struct *new_trustee = NULL;
-	struct worker *worker;
-	struct hlist_node *pos;
-	int i;
+	struct worker *uninitialized_var(new_worker);
 
 	action &= ~CPU_TASKS_FROZEN;
 
@@ -2175,6 +2819,15 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 		if (IS_ERR(new_trustee))
 			return NOTIFY_BAD;
 		kthread_bind(new_trustee, cpu);
+		/* fall through */
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		new_worker = create_worker(gcwq, false);
+		if (!new_worker) {
+			if (new_trustee)
+				kthread_stop(new_trustee);
+			return NOTIFY_BAD;
+		}
 	}
 
 	spin_lock_irq(&gcwq->lock);
@@ -2187,14 +2840,32 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 		gcwq->trustee_state = TRUSTEE_START;
 		wake_up_process(gcwq->trustee);
 		wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+		/* fall through */
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		gcwq->first_idle = new_worker;
+		break;
+
+	case CPU_DYING:
+		/*
+		 * Before this, the trustee and all workers must have
+		 * stayed on the cpu.  After this, they'll all be
+		 * diasporas.
+		 */
+		gcwq->flags |= GCWQ_DISASSOCIATED;
 		break;
 
 	case CPU_POST_DEAD:
 		gcwq->trustee_state = TRUSTEE_BUTCHER;
+		/* fall through */
+	case CPU_UP_CANCELED:
+		destroy_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
 		break;
 
 	case CPU_DOWN_FAILED:
 	case CPU_ONLINE:
+		gcwq->flags &= ~GCWQ_DISASSOCIATED;
 		if (gcwq->trustee_state != TRUSTEE_DONE) {
 			gcwq->trustee_state = TRUSTEE_RELEASE;
 			wake_up_process(gcwq->trustee);
@@ -2202,18 +2873,16 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 		}
 
 		/*
-		 * Clear ROGUE from and rebind all workers.  Unsetting
-		 * ROGUE and rebinding require dropping gcwq->lock.
-		 * Restart loop after each successful release.
+		 * Trustee is done and there might be no worker left.
+		 * Put the first_idle in and request a real manager to
+		 * take a look.
 		 */
-	recheck:
-		list_for_each_entry(worker, &gcwq->idle_list, entry)
-			if (trustee_unset_rogue(worker))
-				goto recheck;
-
-		for_each_busy_worker(worker, i, pos, gcwq)
-			if (trustee_unset_rogue(worker))
-				goto recheck;
+		spin_unlock_irq(&gcwq->lock);
+		kthread_bind(gcwq->first_idle->task, cpu);
+		spin_lock_irq(&gcwq->lock);
+		gcwq->flags |= GCWQ_MANAGE_WORKERS;
+		start_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
 		break;
 	}
 
@@ -2402,10 +3071,10 @@ void thaw_workqueues(void)
 			if (wq->single_cpu == gcwq->cpu &&
 			    !cwq->nr_active && list_empty(&cwq->delayed_works))
 				cwq_unbind_single_cpu(cwq);
-
-			wake_up_process(cwq->worker->task);
 		}
 
+		wake_up_worker(gcwq);
+
 		spin_unlock_irq(&gcwq->lock);
 	}
 
@@ -2442,12 +3111,31 @@ void __init init_workqueues(void)
 		for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
 			INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
 
+		init_timer_deferrable(&gcwq->idle_timer);
+		gcwq->idle_timer.function = idle_worker_timeout;
+		gcwq->idle_timer.data = (unsigned long)gcwq;
+
+		setup_timer(&gcwq->mayday_timer, gcwq_mayday_timeout,
+			    (unsigned long)gcwq);
+
 		ida_init(&gcwq->worker_ida);
 
 		gcwq->trustee_state = TRUSTEE_DONE;
 		init_waitqueue_head(&gcwq->trustee_wait);
 	}
 
+	/* create the initial worker */
+	for_each_online_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+		struct worker *worker;
+
+		worker = create_worker(gcwq, true);
+		BUG_ON(!worker);
+		spin_lock_irq(&gcwq->lock);
+		start_worker(worker);
+		spin_unlock_irq(&gcwq->lock);
+	}
+
 	keventd_wq = create_workqueue("events");
 	BUG_ON(!keventd_wq);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 27/27] workqueue: increase max_active of keventd and kill current_is_keventd()
  2009-12-18 12:57 Tejun Heo
                   ` (25 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 26/27] workqueue: implement concurrency managed dynamic worker pool Tejun Heo
@ 2009-12-18 12:58 ` Tejun Heo
  2009-12-18 13:00 ` SUBJ: [RFC PATCHSET] concurrency managed workqueue, take#2 Tejun Heo
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 12:58 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi
  Cc: Tejun Heo, Thomas Gleixner, Tony Luck, Andi Kleen, Oleg Nesterov

Define WQ_MAX_ACTIVE and create keventd with max_active set to half of
it which means that keventd now can process upto WQ_MAX_ACTIVE / 2 - 1
works concurrently.  Unless some combination can result in dependency
loop longer than max_active, deadlock won't happen and thus it's
unnecessary to check whether current_is_keventd() before trying to
schedule a work.  Kill current_is_keventd().

(Lockdep annotations are broken.  We need lock_map_acquire_read_norecurse())

NOT_SIGNED_OFF_YET: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 arch/ia64/kernel/smpboot.c |    2 +-
 arch/x86/kernel/smpboot.c  |    2 +-
 include/linux/workqueue.h  |    3 +-
 kernel/workqueue.c         |   54 +++----------------------------------------
 4 files changed, 8 insertions(+), 53 deletions(-)

diff --git a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
index de100aa..3a46feb 100644
--- a/arch/ia64/kernel/smpboot.c
+++ b/arch/ia64/kernel/smpboot.c
@@ -516,7 +516,7 @@ do_boot_cpu (int sapicid, int cpu)
 	/*
 	 * We can't use kernel_thread since we must avoid to reschedule the child.
 	 */
-	if (!keventd_up() || current_is_keventd())
+	if (!keventd_up())
 		c_idle.work.func(&c_idle.work);
 	else {
 		schedule_work(&c_idle.work);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 678d0b8..93175af 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -724,7 +724,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
 		goto do_rest;
 	}
 
-	if (!keventd_up() || current_is_keventd())
+	if (!keventd_up())
 		c_idle.work.func(&c_idle.work);
 	else {
 		schedule_work(&c_idle.work);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index adb3080..f43a260 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -214,6 +214,8 @@ enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
 	WQ_SINGLE_CPU		= 1 << 1, /* only single cpu at a time */
 	WQ_RESCUER		= 1 << 2, /* has an rescue worker */
+
+	WQ_MAX_ACTIVE		= 256,    /* I like 256, better ideas? */
 };
 
 extern struct workqueue_struct *
@@ -267,7 +269,6 @@ extern int schedule_delayed_work(struct delayed_work *work, unsigned long delay)
 extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
 					unsigned long delay);
 extern int schedule_on_each_cpu(work_func_t func);
-extern int current_is_keventd(void);
 extern int keventd_up(void);
 
 extern void init_workqueues(void);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9baf7a8..4ffaad2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2239,7 +2239,6 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
 int schedule_on_each_cpu(work_func_t func)
 {
 	int cpu;
-	int orig = -1;
 	struct work_struct *works;
 
 	works = alloc_percpu(struct work_struct);
@@ -2248,23 +2247,12 @@ int schedule_on_each_cpu(work_func_t func)
 
 	get_online_cpus();
 
-	/*
-	 * When running in keventd don't schedule a work item on
-	 * itself.  Can just call directly because the work queue is
-	 * already bound.  This also is faster.
-	 */
-	if (current_is_keventd())
-		orig = raw_smp_processor_id();
-
 	for_each_online_cpu(cpu) {
 		struct work_struct *work = per_cpu_ptr(works, cpu);
 
 		INIT_WORK(work, func);
-		if (cpu != orig)
-			schedule_work_on(cpu, work);
+		schedule_work_on(cpu, work);
 	}
-	if (orig >= 0)
-		func(per_cpu_ptr(works, orig));
 
 	for_each_online_cpu(cpu)
 		flush_work(per_cpu_ptr(works, cpu));
@@ -2311,41 +2299,6 @@ int keventd_up(void)
 	return keventd_wq != NULL;
 }
 
-int current_is_keventd(void)
-{
-	bool found = false;
-	unsigned int cpu;
-
-	/*
-	 * There no longer is one-to-one relation between worker and
-	 * work queue and a worker task might be unbound from its cpu
-	 * if the cpu was offlined.  Match all busy workers.  This
-	 * function will go away once dynamic pool is implemented.
-	 */
-	for_each_possible_cpu(cpu) {
-		struct global_cwq *gcwq = get_gcwq(cpu);
-		struct worker *worker;
-		struct hlist_node *pos;
-		unsigned long flags;
-		int i;
-
-		spin_lock_irqsave(&gcwq->lock, flags);
-
-		for_each_busy_worker(worker, i, pos, gcwq) {
-			if (worker->task == current) {
-				found = true;
-				break;
-			}
-		}
-
-		spin_unlock_irqrestore(&gcwq->lock, flags);
-		if (found)
-			break;
-	}
-
-	return found;
-}
-
 static struct cpu_workqueue_struct *alloc_cwqs(void)
 {
 	const size_t size = sizeof(struct cpu_workqueue_struct);
@@ -2393,7 +2346,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	struct workqueue_struct *wq;
 	unsigned int cpu;
 
-	max_active = clamp_val(max_active, 1, INT_MAX);
+	WARN_ON(max_active < 1 || max_active > WQ_MAX_ACTIVE);
+	max_active = clamp_val(max_active, 1, WQ_MAX_ACTIVE);
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
@@ -3136,6 +3090,6 @@ void __init init_workqueues(void)
 		spin_unlock_irq(&gcwq->lock);
 	}
 
-	keventd_wq = create_workqueue("events");
+	keventd_wq = __create_workqueue("events", 0, WQ_MAX_ACTIVE / 2);
 	BUG_ON(!keventd_wq);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* SUBJ: [RFC PATCHSET] concurrency managed workqueue, take#2
  2009-12-18 12:57 Tejun Heo
                   ` (26 preceding siblings ...)
  2009-12-18 12:58 ` [PATCH 27/27] workqueue: increase max_active of keventd and kill current_is_keventd() Tejun Heo
@ 2009-12-18 13:00 ` Tejun Heo
  2009-12-18 13:03 ` Tejun Heo
  2009-12-18 13:45 ` workqueue thing Peter Zijlstra
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 13:00 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi

Aiee... Subject ended up in body.

Sorry.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC PATCHSET] concurrency managed workqueue, take#2
  2009-12-18 12:57 Tejun Heo
                   ` (27 preceding siblings ...)
  2009-12-18 13:00 ` SUBJ: [RFC PATCHSET] concurrency managed workqueue, take#2 Tejun Heo
@ 2009-12-18 13:03 ` Tejun Heo
  2009-12-18 13:45 ` workqueue thing Peter Zijlstra
  29 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-18 13:03 UTC (permalink / raw)
  To: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, peterz, johannes, andi

Here's the test program I've been using to verify features.

#include <linux/module.h>
#include <linux/workqueue.h>
#include <linux/jiffies.h>
#include <linux/delay.h>
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/cpu.h>
#include <linux/kthread.h>

#define MAX_WQ_NAME		64
#define MAX_WQS			64
#define MAX_WORKS		64

struct wq_spec {
	int			id;	/* -1 terminates */
	unsigned int		max_active;
	unsigned int		flags;
};

enum action {
	ACT_TERM,			/* end */
	ACT_LOG,			/* const char * */
	ACT_BURN,			/* ulong duration_msecs */
	ACT_SLEEP,			/* ulong duration_msecs */
	ACT_WAKEUP,			/* ulong work_id */
	ACT_REQUEUE,			/* ulong delay_msecs */
	ACT_FLUSH,			/* ulong work_id */
	ACT_FLUSH_WQ,			/* ulong workqueue_id */
	ACT_CANCEL,			/* ulong work_id */
};

struct work_action {
	enum action		action;	/* ACT_TERM terminates */
	union {
		unsigned long	v;
		const char	*s;
	};
};

struct work_spec {
	int			id;		/* -1 terminates */
	int			wq_id;
	int			requeue_cnt;
	unsigned int		cpu;
	unsigned long		initial_delay;	/* msecs */

	const struct work_action *actions;
};

struct test_scenario {
	const struct wq_spec	*wq_spec;
	const struct work_spec	**work_spec;	/* NULL terminated */
};

static const struct wq_spec dfl_wq_spec[] = {
	{
		.id		= 0,
		.max_active	= 32,
		.flags		= 0,
	},
	{
		.id		= 1,
		.max_active	= 32,
		.flags		= 0,
	},
	{
		.id		= 2,
		.max_active	= 32,
		.flags		= WQ_RESCUER,
	},
	{
		.id		= 3,
		.max_active	= 32,
		.flags		= WQ_FREEZEABLE,
	},
	{
		.id		= 4,
		.max_active	= 1,
		.flags		= WQ_SINGLE_CPU | WQ_FREEZEABLE/* | WQ_DBG*/,
	},
	{ .id = -1 },
};

/*
 * Scenario 0.  All are on cpu0.  work16 and 17 burn cpus for 10 and
 * 5msecs respectively and requeue themselves.  18 sleeps 2 secs and
 * cancel both.
 */
static const struct work_spec work_spec0[] = {
	{
		.id		= 16,
		.requeue_cnt	= 1024,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 10 }},
			{ ACT_REQUEUE,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 17,
		.requeue_cnt	= 1024,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 5 }},
			{ ACT_REQUEUE,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 18,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "will sleep 2s and cancel both" }},
			{ ACT_SLEEP,	{ 2000 }},
			{ ACT_CANCEL,	{ 16 }},
			{ ACT_CANCEL,	{ 17 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario0 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec0, NULL },
};

/*
 * Scenario 1.  All are on cpu0.  Work 0, 1 and 2 sleep for different
 * intervals but all three will terminate at around 30secs.  3 starts
 * at @28 and 4 at @33 and both sleep for five secs and then
 * terminate.  5 waits for 0, 1, 2 and then flush wq which by the time
 * should have 3 on it.  After 3 completes @32, 5 terminates too.
 * After 4 secs, 4 terminates and all test sequence is done.
 */
static const struct work_spec work_spec1[] = {
	{
		.id		= 0,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 3 }},	/* to cause sched activation */
			{ ACT_LOG,	{ .s = "will sleep 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 5 }},
			{ ACT_LOG,	{ .s = "will sleep 10s and burn 5msec and repeat 3 times" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_BURN,	{ 5 }},
			{ ACT_LOG,	{ .s = "@10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_BURN,	{ 5 }},
			{ ACT_LOG,	{ .s = "@20s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_BURN,	{ 5 }},
			{ ACT_LOG,	{ .s = "@30s" }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "will sleep 3s and burn 1msec and repeat 10 times" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@3s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@6s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@9s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@12s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@15s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@18s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@21s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@24s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@27s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@30s" }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.initial_delay	= 29000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "started@28s, will sleep for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_TERM },
		}
	},
	{
		.id		= 4,
		.initial_delay	= 33000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "started@33s, will sleep for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_TERM },
		}
	},
	{
		.id		= 5,
		.wq_id		= 1,	/* can't flush self */
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flushing 0, 1 and 2" }},
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_FLUSH,	{ 1 }},
			{ ACT_FLUSH,	{ 2 }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario1 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec1, NULL },
};

/*
 * Scenario 2.  Combination of scenario 0 and 1.
 */
static const struct test_scenario scenario2 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec0, work_spec1, NULL },
};

/*
 * Scenario 3.  More complex flushing.
 *
 *               2:burn 2s        3:4s
 *                <---->      <---------->
 *     0:4s                1:4s
 * <---------->      <..----------->
 *    ^               ^
 *    |               |
 *    |               |
 * 4:flush(cpu0)    flush_wq
 * 5:flush(cpu0)    flush
 * 6:flush(cpu1)    flush_wq
 * 7:flush(cpu1)    flush
 */
static const struct work_spec work_spec2[] = {
	{
		.id		= 0,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.initial_delay	= 6000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.initial_delay	= 5000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "burning 2s" }},
			{ ACT_BURN,	{ 2000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.initial_delay	= 9000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},

	{
		.id		= 4,
		.wq_id		= 1,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_SLEEP,	{ 2500 }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 5,
		.wq_id		= 1,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_SLEEP,	{ 2500 }},
			{ ACT_FLUSH,	{ 1 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 6,
		.wq_id		= 1,
		.cpu		= 1,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_SLEEP,	{ 2500 }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 7,
		.wq_id		= 1,
		.cpu		= 1,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_SLEEP,	{ 2500 }},
			{ ACT_FLUSH,	{ 1 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario3 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec2, NULL },
};

/*
 * Scenario 4.  Mayday!  To be used with MAX_CPU_WORKERS_ORDER reduced
 * to 2.
 */
static const struct work_spec work_spec4[] = {
	{
		.id		= 0,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 4,
		.wq_id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 1s" }},
			{ ACT_SLEEP,	{ 1000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 5,
		.wq_id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 1s" }},
			{ ACT_SLEEP,	{ 1000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 6,
		.wq_id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 1s" }},
			{ ACT_SLEEP,	{ 1000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 7,
		.wq_id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 1s" }},
			{ ACT_SLEEP,	{ 1000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario4 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec4, NULL },
};

/*
 * Scenario 5.  To test cpu off/onlining.  A bunch of long running
 * tasks on cpu1.  Gets interesting with various other conditions
 * applied together - lowered MAX_CPU_WORKERS_ORDER, induced failure
 * or delay during CPU_DOWN/UP_PREPARE and so on.
 */
static const struct work_spec work_spec5[] = {
	/* runs for 30 secs */
	{
		.id		= 0,
		.cpu		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.cpu		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.cpu		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.cpu		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},

	/* kicks in @15 and runs for 15 from wq0 */
	{
		.id		= 4,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 5,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 6,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 7,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},

	/* kicks in @15 and runs for 15 from wq2 */
	{
		.id		= 8,
		.wq_id		= 2,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 9,
		.wq_id		= 2,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},

	/* kicks in @30 and runs for 15 */
	{
		.id		= 10,
		.cpu		= 1,
		.initial_delay	= 30000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 11,
		.cpu		= 1,
		.initial_delay	= 30000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},

	{ .id = -1 },
};

static const struct test_scenario scenario5 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec5, NULL },
};

/*
 * Scenario 6.  Scenario to test freezeable workqueue.  User should
 * freeze the machine between 0s and 9s.
 *
 * 0,1:sleep 10s
 * <-------->
 *      <--freezing--><--frozen--><-thawed
 *         2,3: sleeps for 10s
 *          <.....................-------->
 *          <.....................-------->
 */
static const struct work_spec work_spec6[] = {
	/* two works which get queued @0s and sleeps for 10s */
	{
		.id		= 0,
		.wq_id		= 3,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.wq_id		= 3,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_TERM },
		},
	},
	/* two works which get queued @9s and sleeps for 10s */
	{
		.id		= 2,
		.wq_id		= 3,
		.initial_delay	= 9000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.wq_id		= 3,
		.initial_delay	= 9000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_TERM },
		},
	},

	{ .id = -1 },
};

static const struct test_scenario scenario6 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec6, NULL },
};

/*
 * Scenario 7.  Scenario to test multi-colored workqueue.
 *
 *   0 1 2 3 4 5 6 7 8 9 0 1 2	: time in seconds
 * 0:  <------>				cpu0
 * 1:    <------>			cpu1
 * 2:      <------>			cpu2
 * 3:        <------>			cpu3
 * 4:          <------>			cpu0
 * 5:            <------>		cpu1
 * 6:              <------>		cpu2
 * 7:                <------>		cpu3
 * Flush workqueues
 * 10:^
 * 11:  ^
 * 12:  ^
 * 13:      ^
 * 14:        ^
 * 15:        ^
 * 16:            ^
 * 17:              ^
 * 18:              ^
 */
static const struct work_spec work_spec7[] = {
	/* 8 works - each sleeps 4sec and starts staggered */
	{
		.id		= 0,
		.wq_id		= 0,
		.cpu		= 0,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @ 1s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.wq_id		= 0,
		.cpu		= 1,
		.initial_delay	= 2000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @ 2s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.wq_id		= 0,
		.cpu		= 2,
		.initial_delay	= 3000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @ 3s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.wq_id		= 0,
		.cpu		= 3,
		.initial_delay	= 4000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @ 4s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 4,
		.wq_id		= 0,
		.cpu		= 0,
		.initial_delay	= 5000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @ 5s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 5,
		.wq_id		= 0,
		.cpu		= 1,
		.initial_delay	= 6000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @ 6s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 6,
		.wq_id		= 0,
		.cpu		= 2,
		.initial_delay	= 7000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @ 7s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 7,
		.wq_id		= 0,
		.cpu		= 3,
		.initial_delay	= 8000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @ 8s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},

	/* 9 workqueue flushers */
	{
		.id		= 10,
		.wq_id		= 1,
		.initial_delay	= 500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 0.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 11,
		.wq_id		= 1,
		.initial_delay	= 1500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 1.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 12,
		.wq_id		= 1,
		.initial_delay	= 1500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 1.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 13,
		.wq_id		= 1,
		.initial_delay	= 3500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 3.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 14,
		.wq_id		= 1,
		.initial_delay	= 4500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 4.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 15,
		.wq_id		= 1,
		.initial_delay	= 4500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 4.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 16,
		.wq_id		= 1,
		.initial_delay	= 6500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 6.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 17,
		.wq_id		= 1,
		.initial_delay	= 7500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 7.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 18,
		.wq_id		= 1,
		.initial_delay	= 7500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flush_wq @ 7.5s" }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},

	{ .id = -1 },
};

static const struct test_scenario scenario7 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec7, NULL },
};

/*
 * Scenario 8.  Scenario to test single thread workqueue.  Test with
 * freeze/thaw, suspend/resume at various points.
 *
 * 0@cpu0 <------>
 * 1@cpu1     <...--->
 * 2@cpu2         <...--->
 * 3@cpu3                  <------>
 * 4@cpu0                      <...--->
 * 5@cpu1                              <------>
 */
static const struct work_spec work_spec8[] = {
	{
		.id		= 0,
		.wq_id		= 4,
		.cpu		= 0,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @0s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.wq_id		= 4,
		.cpu		= 1,
		.initial_delay	= 2000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 2s @2:4s" }},
			{ ACT_SLEEP,	{ 2000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.wq_id		= 4,
		.cpu		= 2,
		.initial_delay	= 4000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 2s @4:6s" }},
			{ ACT_SLEEP,	{ 2000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.wq_id		= 4,
		.cpu		= 3,
		.initial_delay	= 8500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @8.5s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 4,
		.wq_id		= 4,
		.cpu		= 0,
		.initial_delay	= 10500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 2s @10.5:12.5s" }},
			{ ACT_SLEEP,	{ 2000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 5,
		.wq_id		= 4,
		.cpu		= 1,
		.initial_delay	= 14500,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s @14.5s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario8 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec8, NULL },
};

static const struct test_scenario *scenarios[] = {
	&scenario0, &scenario1, &scenario2, &scenario3, &scenario4, &scenario5,
	&scenario6, &scenario7, &scenario8,
};

/*
 * Execute
 */
static struct task_struct *sequencer;
static char wq_names[MAX_WQS][MAX_WQ_NAME];
static struct workqueue_struct *wqs[MAX_WQS];
static struct delayed_work dworks[MAX_WORKS];
#ifdef CONFIG_LOCKDEP
static struct lock_class_key wq_lockdep_keys[MAX_WORKS];
static struct lock_class_key dwork_lockdep_keys[MAX_WORKS];
#endif
static const struct work_spec *work_specs[MAX_WORKS];
static bool sleeping[MAX_WORKS];
static wait_queue_head_t wait_heads[MAX_WORKS];
static int requeue_cnt[MAX_WORKS];

static void test_work_fn(struct work_struct *work)
{
	int id = to_delayed_work(work) - dworks;
	const struct work_spec *spec = work_specs[id];
	const struct work_action *act = spec->actions;
	int rc;

#define pd(lvl, fmt, args...)	\
	printk("w%02d/%04d@%d "lvl": "fmt"\n", id, current->pid, raw_smp_processor_id() , ##args);
#define plog(fmt, args...)	pd("LOG ", fmt , ##args)
#define pinfo(fmt, args...)	pd("INFO", fmt , ##args)
#define perr(fmt, args...)	pd("ERR ", fmt , ##args)
repeat:
	switch (act->action) {
	case ACT_TERM:
		pinfo("TERM");
		return;
	case ACT_LOG:
		plog("%s", act->s);
		break;
	case ACT_BURN:
		mdelay(act->v);
		break;
	case ACT_SLEEP:
		sleeping[id] = true;
		wait_event_timeout(wait_heads[id], !sleeping[id],
				   msecs_to_jiffies(act->v));
		if (!sleeping[id])
			pinfo("somebody woke me up");
		sleeping[id] = false;
		break;
	case ACT_WAKEUP:
		if (act->v < MAX_WORKS && sleeping[act->v]) {
			pinfo("waking %lu up", act->v);
			sleeping[act->v] = false;
			wake_up(&wait_heads[act->v]);
		} else
			perr("trying to wake up non-sleeping work %lu",
			     act->v);
		break;
	case ACT_REQUEUE:
		if (requeue_cnt[id] > 0 || requeue_cnt[id] < 0) {
			int cpu;

			get_online_cpus();

			if (spec->cpu < nr_cpu_ids && cpu_online(spec->cpu))
				cpu = spec->cpu;
			else
				cpu = raw_smp_processor_id();

			if (act->v)
				queue_delayed_work_on(cpu, wqs[spec->wq_id],
						      &dworks[id],
						      msecs_to_jiffies(act->v));
			else
				queue_work_on(cpu, wqs[spec->wq_id],
					      &dworks[id].work);

			pinfo("requeued on cpu%d delay=%lumsecs",
			      cpu, act->v);
			if (requeue_cnt[id] > 0)
				requeue_cnt[id]--;

			put_online_cpus();
		} else
			pinfo("requeue limit reached");

		break;
	case ACT_FLUSH:
		if (act->v < MAX_WORKS && work_specs[act->v]) {
			pinfo("flushing work %lu", act->v);
			rc = flush_work(&dworks[act->v].work);
			pinfo("flushed work %lu, rc=%d", act->v, rc);
		} else
			perr("trying to flush non-existent work %lu", act->v);
		break;
	case ACT_FLUSH_WQ:
		if (act->v < MAX_WQS && wqs[act->v]) {
			pinfo("flushing workqueue %lu", act->v);
			flush_workqueue(wqs[act->v]);
			pinfo("flushed workqueue %lu", act->v);
		} else
			perr("trying to flush non-existent workqueue %lu",
			     act->v);
		break;
	case ACT_CANCEL:
		if (act->v < MAX_WORKS && work_specs[act->v]) {
			pinfo("canceling work %lu", act->v);
			rc = cancel_delayed_work_sync(&dworks[act->v]);
			pinfo("canceled work %lu, rc=%d", act->v, rc);
		} else
			perr("trying to cancel non-existent work %lu", act->v);
		break;
	}
	act++;
	goto repeat;
}

#define for_each_work_spec(spec, i, scenario)				\
	for (i = 0, spec = scenario->work_spec[i]; spec;		\
	     spec = (spec + 1)->id >= 0 ? spec + 1 : scenario->work_spec[++i])

static int sequencer_thread(void *__scenario)
{
	const struct test_scenario *scenario = __scenario;
	const struct wq_spec *wq_spec;
	const struct work_spec *work_spec;
	int i, id;

	for (wq_spec = scenario->wq_spec; wq_spec->id >= 0; wq_spec++) {
		if (wq_spec->id >= MAX_WQS) {
			printk("ERR : wq id %d too high\n", wq_spec->id);
			goto err;
		}

		if (wqs[wq_spec->id]) {
			printk("ERR : wq %d already occupied\n", wq_spec->id);
			goto err;
		}

		snprintf(wq_names[wq_spec->id], MAX_WQ_NAME, "test-wq-%02d",
			 wq_spec->id);
		wqs[wq_spec->id] = __create_workqueue_key(wq_names[wq_spec->id],
						wq_spec->flags,
						wq_spec->max_active,
						&wq_lockdep_keys[wq_spec->id],
						wq_names[wq_spec->id]);
		if (!wqs[wq_spec->id]) {
			printk("ERR : failed create wq %d\n", wq_spec->id);
			goto err;
		}
	}

	for_each_work_spec(work_spec, i, scenario) {
		struct delayed_work *dwork = &dworks[work_spec->id];
		struct workqueue_struct *wq = wqs[work_spec->wq_id];

		if (work_spec->id >= MAX_WORKS) {
			printk("ERR : work id %d too high\n", work_spec->id);
			goto err;
		}

		if (!wq) {
			printk("ERR : work %d references non-existent wq %d\n",
			       work_spec->id, work_spec->wq_id);
			goto err;
		}

		if (work_specs[work_spec->id]) {
			printk("ERR : work %d already initialized\n",
			       work_spec->id);
			goto err;
		}
		INIT_DELAYED_WORK(dwork, test_work_fn);
#ifdef CONFIG_LOCKDEP
		lockdep_init_map(&dwork->work.lockdep_map, "test-dwork",
				 &dwork_lockdep_keys[work_spec->id], 0);
#endif
		work_specs[work_spec->id] = work_spec;
		init_waitqueue_head(&wait_heads[work_spec->id]);
		requeue_cnt[work_spec->id] = work_spec->requeue_cnt ?: -1;
	}

	for_each_work_spec(work_spec, i, scenario) {
		struct delayed_work *dwork = &dworks[work_spec->id];
		struct workqueue_struct *wq = wqs[work_spec->wq_id];
		int cpu;

		get_online_cpus();

		if (work_spec->cpu < nr_cpu_ids && cpu_online(work_spec->cpu))
			cpu = work_spec->cpu;
		else
			cpu = raw_smp_processor_id();

		if (work_spec->initial_delay)
			queue_delayed_work_on(cpu, wq, dwork,
				msecs_to_jiffies(work_spec->initial_delay));
		else
			queue_work_on(cpu, wq, &dwork->work);

		put_online_cpus();
	}

	set_current_state(TASK_INTERRUPTIBLE);
	while (!kthread_should_stop()) {
		schedule();
		set_current_state(TASK_INTERRUPTIBLE);
	}
	set_current_state(TASK_RUNNING);

err:
	for (id = 0; id < MAX_WORKS; id++)
		if (work_specs[id])
			cancel_delayed_work_sync(&dworks[id]);
	for (id = 0; id < MAX_WQS; id++)
		if (wqs[id])
			destroy_workqueue(wqs[id]);

	set_current_state(TASK_INTERRUPTIBLE);
	while (!kthread_should_stop()) {
		schedule();
		set_current_state(TASK_INTERRUPTIBLE);
	}

	return 0;
}

static int scenario_switch = -1;
module_param_named(scenario, scenario_switch, int, 0444);

static int __init test_init(void)
{
	if (scenario_switch < 0 || scenario_switch >= ARRAY_SIZE(scenarios)) {
		printk("TEST WQ - no such scenario\n");
		return -EINVAL;
	}

	printk("TEST WQ - executing scenario %d\n", scenario_switch);
	sequencer = kthread_run(sequencer_thread,
				(void *)scenarios[scenario_switch],
				"test-wq-sequencer");
	if (IS_ERR(sequencer))
		return PTR_ERR(sequencer);
	return 0;
}

static void __exit test_exit(void)
{
	kthread_stop(sequencer);
}

module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 12:57 Tejun Heo
                   ` (28 preceding siblings ...)
  2009-12-18 13:03 ` Tejun Heo
@ 2009-12-18 13:45 ` Peter Zijlstra
  2009-12-18 13:50   ` Andi Kleen
                     ` (2 more replies)
  29 siblings, 3 replies; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-18 13:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Fri, 2009-12-18 at 21:57 +0900, Tejun Heo wrote:
> 
> C. While discussing issue B [3], Peter Zijlstra objected to the
>    basic design of cmwq.  Peter's objections are...
> 
>    o1. It isn't a generic worker pool mechanism in that it can't serve
>        cpu-intensive workloads because all works are affined to local
>        cpus.
> 
>    o2. Allowing long (> 5s for example) running works isn't a good
>        idea and by not allowing long running works, the need to
>        migrate back workers when cpu comes back online can be removed.
> 
>    o3. It's a fork-fest.
> 
>    My rationales for each are
> 
>    r1. The first design goal of cmwq is solving the issues the current
>        workqueue implementation has including hard to detect
>        deadlocks, 

lockdep is quite proficient at finding these these days.

>	 unexpectedly long latencies caused by long running
>        works which share the workqueue and excessive number of worker
>        threads necessitated by each workqueue having its own workers.

works shouldn't be long running to begin with

>        cmwq solves these issues quite efficiently without depending on
>        fragile and complex heuristics.  Concurrency is managed to
>        minimal yet sufficient level, workers are reused as much as
>        possible and only necessary number of workers are created and
>        maintained.
> 
>        cmwq is cpu affine because its target workloads are not cpu
>        intensive.  Most works are context hungry not cpu cycle hungry
>        and as such providing the necessary context (or concurrency)
>        from the local CPU is the most efficient way to serve them.

Things cannot be not cpu intensive and long running.

And this design is patently unsuited for cpu intensive tasks, hence they
should not be long running.

The only way something can be not cpu intensive and long 'running' is if
it got blocked that long, and the right solution is to fix that
contention, things should not be blocked for seconds.

>        The second design goal is to unify different async mechanisms
>        in kernel.  Although cmwq wouldn't be able to serve CPU cycle
>        intensive workload, most in-kernel async mechanisms are there
>        to provide context and concurrency and they all can be
>        converted to use cmwq.

Which specifically, the ones I'm aware of are mostly cpu intensive.

>        Async workloads which need to burn large amount of CPU cycles
>        such as encryption and IO checksumming have pretty different
>        requirements and worker pool designed to serve them would
>        probably require fair amount of heuristics to determine the
>        appropriate level of concurrency.  Workqueue API may be
>        extended to cover such workloads by providing an anonymous CPU
>        for those works to bind to but the underlying operation would
>        be fairly different.  If this is something necessary, let's
>        pursue it but I don't think it's exclusive with cmwq.

The interesting bit is limiting runnable tasks, that will introduce
deadlock potential. 

>    r2. The only thing necessary to support long running works is the
>        ability to rebind workers to the cpu if it comes back online
>        and allowing long running works will allow most existing worker
>        pools to be served by cmwq and also make CPU down/up latencies
>        more predictable.

That's not necessary at all, and introduces quite a lot of ugly code.

Furthermore, let me restate that having long running works is the
problem.

>    r3. I don't think there is any way to implement shared worker pool
>        without forking when more concurrency is required and the
>        actual amount of forking would be low as cmwq scales the number
>        of idle workers to keep according to the current concurrency
>        level and uses rather long timeout (5min) for idlers.

I'm still not convinced more concurrency is required.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 13:45 ` workqueue thing Peter Zijlstra
@ 2009-12-18 13:50   ` Andi Kleen
  2009-12-18 15:01     ` Arjan van de Ven
  2009-12-18 15:30   ` workqueue thing Linus Torvalds
  2009-12-21  3:04   ` Tejun Heo
  2 siblings, 1 reply; 104+ messages in thread
From: Andi Kleen @ 2009-12-18 13:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, torvalds, awalls, linux-kernel, jeff, mingo, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

> The only way something can be not cpu intensive and long 'running' is if
> it got blocked that long, and the right solution is to fix that
> contention, things should not be blocked for seconds.

Work queue items shouldn't be blocking for seconds in the normal case, but it might
always happen in the exceptional case (e.g. handling some error condition).
In this case it would be good if the other jobs wouldn't be too 
disrupted and the whole setup degrades gracefully.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 13:50   ` Andi Kleen
@ 2009-12-18 15:01     ` Arjan van de Ven
  2009-12-21  3:19       ` Tejun Heo
  2009-12-21  9:17       ` Jens Axboe
  0 siblings, 2 replies; 104+ messages in thread
From: Arjan van de Ven @ 2009-12-18 15:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Tejun Heo, torvalds, awalls, linux-kernel, jeff,
	mingo, akpm, jens.axboe, rusty, cl, dhowells, avi, johannes

On 12/18/2009 5:50, Andi Kleen wrote:
>> The only way something can be not cpu intensive and long 'running' is if
>> it got blocked that long, and the right solution is to fix that
>> contention, things should not be blocked for seconds.
>
> Work queue items shouldn't be blocking for seconds in the normal case, but it might
> always happen in the exceptional case (e.g. handling some error condition).
> In this case it would be good if the other jobs wouldn't be too
> disrupted and the whole setup degrades gracefully.

in addition, threads are cheap. Linux has no technical problem with running 100's of kernel threads
(if not 1000s); they cost basically a task struct and a stack (2 pages) each and that's about it.
making an elaborate-and-thus-fragile design to save a few kernel threads is likely a bad design direction...

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 13:45 ` workqueue thing Peter Zijlstra
  2009-12-18 13:50   ` Andi Kleen
@ 2009-12-18 15:30   ` Linus Torvalds
  2009-12-18 15:39     ` Ingo Molnar
  2009-12-18 15:39     ` Peter Zijlstra
  2009-12-21  3:04   ` Tejun Heo
  2 siblings, 2 replies; 104+ messages in thread
From: Linus Torvalds @ 2009-12-18 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi



On Fri, 18 Dec 2009, Peter Zijlstra wrote:

> >    r1. The first design goal of cmwq is solving the issues the current
> >        workqueue implementation has including hard to detect
> >        deadlocks, 
> 
> lockdep is quite proficient at finding these these days.

I don't think so.

The reason it is not is that workqueues fundamentally do _different_ 
things in the same context, adn lockdep has no clue what-so-ever.

IOW, if you hold a lock, and then do 'flush_workqueue()', lockdep has no 
idea that maybe one of the entries on a workqueue might need the lock that 
you are holding. But I don't think lockdep sees the dependency that gets 
created by the flush - because it's not a direct code execution 
dependency.

It's not a deadlock _directly_ due to lock ordering, but indirectly due to 
waiting for unrelated code that needs locks.

Now, maybe lockdep could be _taught_ to consider workqueues themselves to 
be 'locks', and ordering those pseudo-locks wrt the real locks they take. 
So if workqueue Q takes lock A, the fact that it is _taken_ in a workqueue 
makes the ordering be Q->A. Then, if somebody does a "flush_workqueue" 
while holding lock B, the flush implies a "lock ordering" of B->Q (where 
"Q" is the set of all workqueues that could be flushed).

		Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 15:30   ` workqueue thing Linus Torvalds
@ 2009-12-18 15:39     ` Ingo Molnar
  2009-12-18 15:39     ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Ingo Molnar @ 2009-12-18 15:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Tejun Heo, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> IOW, if you hold a lock, and then do 'flush_workqueue()', lockdep has no 
> idea that maybe one of the entries on a workqueue might need the lock that 
> you are holding. But I don't think lockdep sees the dependency that gets 
> created by the flush - because it's not a direct code execution dependency.

Do you mean like the annotations we added in:

  4e6045f: workqueue: debug flushing deadlocks with lockdep
  a67da70: workqueues: lockdep annotations for flush_work()

?

It looks like this currently in the worklet:

                lock_map_acquire(&cwq->wq->lockdep_map);
                lock_map_acquire(&lockdep_map);
                f(work);
                lock_map_release(&lockdep_map);
                lock_map_release(&cwq->wq->lockdep_map);

and like this in flush:

        lock_map_acquire(&wq->lockdep_map);
        lock_map_release(&wq->lockdep_map);
        for_each_cpu(cpu, cpu_map)
                flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));

We basically track the implicit dependencies even if they are not executed 
(only theoretically possible) - and we subsequently caught a few bugs that 
way.

Or did you have some other dependency in mind?

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 15:30   ` workqueue thing Linus Torvalds
  2009-12-18 15:39     ` Ingo Molnar
@ 2009-12-18 15:39     ` Peter Zijlstra
  2009-12-18 15:47       ` Linus Torvalds
  1 sibling, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-18 15:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Fri, 2009-12-18 at 07:30 -0800, Linus Torvalds wrote:
> 
> On Fri, 18 Dec 2009, Peter Zijlstra wrote:
> 
> > >    r1. The first design goal of cmwq is solving the issues the current
> > >        workqueue implementation has including hard to detect
> > >        deadlocks, 
> > 
> > lockdep is quite proficient at finding these these days.
> 
> I don't think so.
> 
> The reason it is not is that workqueues fundamentally do _different_ 
> things in the same context, adn lockdep has no clue what-so-ever.
> 
> IOW, if you hold a lock, and then do 'flush_workqueue()', lockdep has no 
> idea that maybe one of the entries on a workqueue might need the lock that 
> you are holding. But I don't think lockdep sees the dependency that gets 
> created by the flush - because it's not a direct code execution 
> dependency.
> 
> It's not a deadlock _directly_ due to lock ordering, but indirectly due to 
> waiting for unrelated code that needs locks.
> 
> Now, maybe lockdep could be _taught_ to consider workqueues themselves to 
> be 'locks', and ordering those pseudo-locks wrt the real locks they take. 
> So if workqueue Q takes lock A, the fact that it is _taken_ in a workqueue 
> makes the ordering be Q->A. Then, if somebody does a "flush_workqueue" 
> while holding lock B, the flush implies a "lock ordering" of B->Q (where 
> "Q" is the set of all workqueues that could be flushed).

That's exactly what it does..

4e6045f134784f4b158b3c0f7a282b04bd816887
eb13ba873881abd5e15af784756a61af635e665e
a67da70dc0955580665f5444f318b92e69a3c272





^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 15:39     ` Peter Zijlstra
@ 2009-12-18 15:47       ` Linus Torvalds
  2009-12-18 15:53         ` Peter Zijlstra
  0 siblings, 1 reply; 104+ messages in thread
From: Linus Torvalds @ 2009-12-18 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi



On Fri, 18 Dec 2009, Peter Zijlstra wrote:
> 
> That's exactly what it does..
> 
> 4e6045f134784f4b158b3c0f7a282b04bd816887
> eb13ba873881abd5e15af784756a61af635e665e
> a67da70dc0955580665f5444f318b92e69a3c272

Ahh. Never even triggered my "I've seen those patches".

		Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 15:47       ` Linus Torvalds
@ 2009-12-18 15:53         ` Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-18 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Fri, 2009-12-18 at 07:47 -0800, Linus Torvalds wrote:
> 
> On Fri, 18 Dec 2009, Peter Zijlstra wrote:
> > 
> > That's exactly what it does..
> > 
> > 4e6045f134784f4b158b3c0f7a282b04bd816887
> > eb13ba873881abd5e15af784756a61af635e665e
> > a67da70dc0955580665f5444f318b92e69a3c272
> 
> Ahh. Never even triggered my "I've seen those patches".

Hehe, with the volume going through these days I'm amazed at the amount
you do remember ;-)


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 13:45 ` workqueue thing Peter Zijlstra
  2009-12-18 13:50   ` Andi Kleen
  2009-12-18 15:30   ` workqueue thing Linus Torvalds
@ 2009-12-21  3:04   ` Tejun Heo
  2009-12-21  9:22     ` Peter Zijlstra
  2 siblings, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-21  3:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 12/18/2009 10:45 PM, Peter Zijlstra wrote:
>>    r1. The first design goal of cmwq is solving the issues the current
>>        workqueue implementation has including hard to detect
>>        deadlocks, 
> 
> lockdep is quite proficient at finding these these days.

I've been thinking there are cases which the current lockdep
annotations can't detect but I can't think of any.  But still when
these possible deadlocks are detected, the only solution we now have
is to put one into a separate workqueue.  As it often is an overkill
to create multithread workqueue for that single work, it usually ends
up being a singlethread one, which often isn't optimal.  With cmwq,
this problem is gone.

>> 	 unexpectedly long latencies caused by long running
>>        works which share the workqueue and excessive number of worker
>>        threads necessitated by each workqueue having its own workers.
> 
> works shouldn't be long running to begin with

Let's discuss this below.

>>        cmwq is cpu affine because its target workloads are not cpu
>>        intensive.  Most works are context hungry not cpu cycle hungry
>>        and as such providing the necessary context (or concurrency)
>>        from the local CPU is the most efficient way to serve them.
> 
> Things cannot be not cpu intensive and long running.

What are you talking about?  There's a huge world outside of CPUs and
RAMs where taking seconds or even tens of seconds isn't too strange.
SCSI/ATA exception conditions can easily take tens of seconds which in
turn makes anything which may depend on IO may take tens of seconds.

> And this design is patently unsuited for cpu intensive tasks, hence they
> should not be long running.

Burning a lot of CPU cycles in kernel is and must be very rare
exceptions.  Waiting for IO or other external events is far less so.
Seriously, take a look at other async mechanisms we have - async, long
works, SCSI EHs.  They're there to provide concurrency so that they
can wait for *EVENTS* not to burn CPU cycles.

> The only way something can be not cpu intensive and long 'running' is if
> it got blocked that long, and the right solution is to fix that
> contention, things should not be blocked for seconds.

IOs and events.  Not CPUs or RAMs.

>>        The second design goal is to unify different async mechanisms
>>        in kernel.  Although cmwq wouldn't be able to serve CPU cycle
>>        intensive workload, most in-kernel async mechanisms are there
>>        to provide context and concurrency and they all can be
>>        converted to use cmwq.
> 
> Which specifically, the ones I'm aware of are mostly cpu intensive.

Async which currently is only used to make ATA probing parallel.  Long
works which are used for fscache.  SCSI/ATA EHs.  ATA polling PIO and
probing helper workers.  To-be-implemented in-kernel media presence
polling.  xfs or other fs IO threads.

The ones you're aware of are very different from the ones I'm aware
of.  The difference is probably coming from where we usually work on.
CPU intensive ones require pretty different solution from the ones
which just need the contexts to wait for events.  For the latter,
things can be made very mechanical as the optimal level of concurrency
is easily defined as minimal level which is just enough to avoid
blocking other works as the resource they're competeing for - the
contexts - isn't scarce.  And for this, the work interface is very
well suited.

For CPU intensive ones, it's more difficult as CPU cycles are in
contention and things like fairness needs to be considered.  It's a
different class of problem.  From what I can see, this class of
problems are still smaller in volume than the event waiting class of
problems.  Increasing popularities in FS end-to-end verification and
maybe encryption are likely to increase pressure here tho.  This is
gonna require a different solution where scheduler would play the core
role.

>>    r2. The only thing necessary to support long running works is the
>>        ability to rebind workers to the cpu if it comes back online
>>        and allowing long running works will allow most existing worker
>>        pools to be served by cmwq and also make CPU down/up latencies
>>        more predictable.
> 
> That's not necessary at all, and introduces quite a lot of ugly code.
>
> Furthermore, let me restate that having long running works is the
> problem.

I guess I explained this enough.  When IO goes wrong, in extreme
cases, it can easily take over thirty secs to recover and that's
required by the hardware specifications, so anything which ends up
waiting on IO can take a pretty long time.  The only piece of code
which is necessary to support that is the code necessary to migrate
back tasks to CPUs when they come online again.  It's not a lot of
ugly code.

>>    r3. I don't think there is any way to implement shared worker pool
>>        without forking when more concurrency is required and the
>>        actual amount of forking would be low as cmwq scales the number
>>        of idle workers to keep according to the current concurrency
>>        level and uses rather long timeout (5min) for idlers.
> 
> I'm still not convinced more concurrency is required.

Hope my explanations helped convincing you.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 15:01     ` Arjan van de Ven
@ 2009-12-21  3:19       ` Tejun Heo
  2009-12-21  9:17       ` Jens Axboe
  1 sibling, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-21  3:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Peter Zijlstra, torvalds, awalls, linux-kernel, jeff,
	mingo, akpm, jens.axboe, rusty, cl, dhowells, avi, johannes

Hello,

On 12/19/2009 12:01 AM, Arjan van de Ven wrote:
> in addition, threads are cheap. Linux has no technical problem with
> running 100's of kernel threads (if not 1000s); they cost basically
> a task struct and a stack (2 pages) each and that's about it.
> making an elaborate-and-thus-fragile design to save a few kernel
> threads is likely a bad design direction...

A resource not being scarce is very different from a resource being
limitless and doesn't mean we can use them without thinking about it.
If we haven't been trying to limit the number of threads used by
various subsystems, we could easily be looking at far larger number of
threads than we're using now and things would be breaking at the far
ends of the spectrum (very small or extremely large systems).

What cmwq does is managing this not-so-scarce resource automatically
at a single place and because the resource is not so scarce, it can be
made very mechanical and reliable.  No need to worry about managing
concurrencies or limit the number of threads from other places.  Sure,
it involves some amount of complex code in implementing workqueue
itself but we'll be removing a lot of complexities and codes from more
peripheral places and that is the right direction to go.  We can
easily get the core code right and after being properly debugged, cmwq
won't be fragile.  As an async mechanism, its operation is highly
deterministic and mechanical.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-18 15:01     ` Arjan van de Ven
  2009-12-21  3:19       ` Tejun Heo
@ 2009-12-21  9:17       ` Jens Axboe
  2009-12-21 10:35         ` Peter Zijlstra
                           ` (2 more replies)
  1 sibling, 3 replies; 104+ messages in thread
From: Jens Axboe @ 2009-12-21  9:17 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Peter Zijlstra, Tejun Heo, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On Fri, Dec 18 2009, Arjan van de Ven wrote:
> in addition, threads are cheap. Linux has no technical problem with
> running 100's of kernel threads (if not 1000s); they cost basically a
> task struct and a stack (2 pages) each and that's about it.  making an
> elaborate-and-thus-fragile design to save a few kernel threads is
> likely a bad design direction...

One would hope not, since that is by no means outside of what you see on
boxes today... Thousands. The fact that they are cheap, is not an
argument against doing it right. Conceptually, I think the concurrency
managed work queue pool is a much cleaner (and efficient) design.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21  3:04   ` Tejun Heo
@ 2009-12-21  9:22     ` Peter Zijlstra
  2009-12-21 13:30       ` Tejun Heo
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-21  9:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Mon, 2009-12-21 at 12:04 +0900, Tejun Heo wrote:
> When IO goes wrong, in extreme
> cases, it can easily take over thirty secs to recover and that's
> required by the hardware specifications, so anything which ends up
> waiting on IO can take a pretty long time.  The only piece of code
> which is necessary to support that is the code necessary to migrate
> back tasks to CPUs when they come online again.  It's not a lot of
> ugly code. 

Why does it need to get migrated back, there are no affinity promises if
you allow hotplug to continue, so it might as well complete and continue
on the other cpu.

And yes, it is a lot of very ugly code.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21  9:17       ` Jens Axboe
@ 2009-12-21 10:35         ` Peter Zijlstra
  2009-12-21 11:09         ` Andi Kleen
  2009-12-21 11:11         ` Arjan van de Ven
  2 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-21 10:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Arjan van de Ven, Andi Kleen, Tejun Heo, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On Mon, 2009-12-21 at 10:17 +0100, Jens Axboe wrote:
> On Fri, Dec 18 2009, Arjan van de Ven wrote:
> > in addition, threads are cheap. Linux has no technical problem with
> > running 100's of kernel threads (if not 1000s); they cost basically a
> > task struct and a stack (2 pages) each and that's about it.  making an
> > elaborate-and-thus-fragile design to save a few kernel threads is
> > likely a bad design direction...
> 
> One would hope not, since that is by no means outside of what you see on
> boxes today... Thousands. The fact that they are cheap, is not an
> argument against doing it right. Conceptually, I think the concurrency
> managed work queue pool is a much cleaner (and efficient) design.

If your only concern is the number if idle threads, and it reads like
that, then there is a much easier solution for that.

But I tend to agree with Arjan, who cares if there's thousands idle
threads around.

The fact is that this concurrent workqueue stuff really only works with
works that don't consume CPU, and that's simply not the case today,
there are a number of workqueue users which really do burn CPU.

But even then, the corner cases introduced by memory pressure and
reclaim just make the whole thing an utterly fragile mess.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21  9:17       ` Jens Axboe
  2009-12-21 10:35         ` Peter Zijlstra
@ 2009-12-21 11:09         ` Andi Kleen
  2009-12-21 11:17           ` Arjan van de Ven
  2009-12-21 11:11         ` Arjan van de Ven
  2 siblings, 1 reply; 104+ messages in thread
From: Andi Kleen @ 2009-12-21 11:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Arjan van de Ven, Andi Kleen, Peter Zijlstra, Tejun Heo,
	torvalds, awalls, linux-kernel, jeff, mingo, akpm, rusty, cl,
	dhowells, avi, johannes

On Mon, Dec 21, 2009 at 10:17:54AM +0100, Jens Axboe wrote:
> On Fri, Dec 18 2009, Arjan van de Ven wrote:
> > in addition, threads are cheap. Linux has no technical problem with
> > running 100's of kernel threads (if not 1000s); they cost basically a
> > task struct and a stack (2 pages) each and that's about it.  making an
> > elaborate-and-thus-fragile design to save a few kernel threads is
> > likely a bad design direction...
> 
> One would hope not, since that is by no means outside of what you see on
> boxes today... Thousands. The fact that they are cheap, is not an
> argument against doing it right. Conceptually, I think the concurrency
> managed work queue pool is a much cleaner (and efficient) design.

Agreed. Even if possible thousands of threads waste precious cache.
And they look ugly in ps.

Also the nice thing about dynamically sizing the thread pool
is that if something bad (error condition that takes long) happens 
in one work queue for a specific subsystem there's still a chance 
to make process with other operations in the same subsystem.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21  9:17       ` Jens Axboe
  2009-12-21 10:35         ` Peter Zijlstra
  2009-12-21 11:09         ` Andi Kleen
@ 2009-12-21 11:11         ` Arjan van de Ven
  2009-12-21 13:22           ` Tejun Heo
  2 siblings, 1 reply; 104+ messages in thread
From: Arjan van de Ven @ 2009-12-21 11:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andi Kleen, Peter Zijlstra, Tejun Heo, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On 12/21/2009 10:17, Jens Axboe wrote:
> On Fri, Dec 18 2009, Arjan van de Ven wrote:
>> in addition, threads are cheap. Linux has no technical problem with
>> running 100's of kernel threads (if not 1000s); they cost basically a
>> task struct and a stack (2 pages) each and that's about it.  making an
>> elaborate-and-thus-fragile design to save a few kernel threads is
>> likely a bad design direction...
>
> One would hope not, since that is by no means outside of what you see on
> boxes today... Thousands. The fact that they are cheap, is not an
> argument against doing it right. Conceptually, I think the concurrency
> managed work queue pool is a much cleaner (and efficient) design.
>

I don't mind a good and clean design; and for sure sharing thread pools into one pool
is really good.
But if I have to choose between a complex "how to deal with deadlocks" algorithm, versus just
running some more threads in the pool, I'll pick the later.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 11:09         ` Andi Kleen
@ 2009-12-21 11:17           ` Arjan van de Ven
  2009-12-21 11:33             ` Andi Kleen
  2009-12-21 13:18             ` Tejun Heo
  0 siblings, 2 replies; 104+ messages in thread
From: Arjan van de Ven @ 2009-12-21 11:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jens Axboe, Peter Zijlstra, Tejun Heo, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On 12/21/2009 12:09, Andi Kleen wrote:
> On Mon, Dec 21, 2009 at 10:17:54AM +0100, Jens Axboe wrote:
>> On Fri, Dec 18 2009, Arjan van de Ven wrote:
>>> in addition, threads are cheap. Linux has no technical problem with
>>> running 100's of kernel threads (if not 1000s); they cost basically a
>>> task struct and a stack (2 pages) each and that's about it.  making an
>>> elaborate-and-thus-fragile design to save a few kernel threads is
>>> likely a bad design direction...
>>
>> One would hope not, since that is by no means outside of what you see on
>> boxes today... Thousands. The fact that they are cheap, is not an
>> argument against doing it right. Conceptually, I think the concurrency
>> managed work queue pool is a much cleaner (and efficient) design.
>
> Agreed. Even if possible thousands of threads waste precious cache.

only used ones waste cache ;-)

> And they look ugly in ps.

that we could solve by making them properly threads of each other; ps and co
already (at least by default) fold threads of the same program into one.

>
> Also the nice thing about dynamically sizing the thread pool
> is that if something bad (error condition that takes long) happens
> in one work queue for a specific subsystem there's still a chance
> to make process with other operations in the same subsystem.

yup
same is true for hitting some form of contention; just make an extra thread
so that the rest can continue.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 11:17           ` Arjan van de Ven
@ 2009-12-21 11:33             ` Andi Kleen
  2009-12-21 13:18             ` Tejun Heo
  1 sibling, 0 replies; 104+ messages in thread
From: Andi Kleen @ 2009-12-21 11:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Jens Axboe, Peter Zijlstra, Tejun Heo, torvalds,
	awalls, linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells,
	avi, johannes

>> Also the nice thing about dynamically sizing the thread pool
>> is that if something bad (error condition that takes long) happens
>> in one work queue for a specific subsystem there's still a chance
>> to make process with other operations in the same subsystem.
>
> yup
> same is true for hitting some form of contention; just make an extra thread
> so that the rest can continue.

If you dynamically increase you can/should as well dynamically shrink too.
And with some more gimmicks you're nearly at what Tejun is proposing

Not sure what you're arguing against?

-Andi
>

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 11:17           ` Arjan van de Ven
  2009-12-21 11:33             ` Andi Kleen
@ 2009-12-21 13:18             ` Tejun Heo
  1 sibling, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-21 13:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Jens Axboe, Peter Zijlstra, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

Hello, Arjan.

On 12/21/2009 08:17 PM, Arjan van de Ven wrote:
>>> One would hope not, since that is by no means outside of what you see on
>>> boxes today... Thousands. The fact that they are cheap, is not an
>>> argument against doing it right. Conceptually, I think the concurrency
>>> managed work queue pool is a much cleaner (and efficient) design.
>>
>> Agreed. Even if possible thousands of threads waste precious cache.
> 
> only used ones waste cache ;-)

Yes and using dedicated threads increases the number of used stacks.
ie. with cmwq, in most cases, only few stacks would be active and
shared among different works.  With workqueues with dedicated workers,
different type of works will always end up using different stacks thus
unnecessarily increasing cache footprint.

>> And they look ugly in ps.
> 
> that we could solve by making them properly threads of each other;
> ps and co already (at least by default) fold threads of the same
> program into one.

That way poses two unnecessary problems.  It will easily incur a
scalability issue.  ie. I've been thinking about making block EHs
per-device so that per-device EH actions can be implemented which
won't block the whole host.  If I do this with dedicated threads and
allocate single thread per block device, it will be the easiest,
right?  The problem is that there are machines with tens of thousands
of LUNs (not that uncommon either) and such design would simply
collapse there.

Such potential scalability issues thus would require special crafting
at the block layer to manage concurrency to gurantee both EH forward
progress and proper level of concurrency without paying too much
upfront.  We'll need another partial solution to solve concurrency
there and it never stops there.  What about in-kernel media presence
polling?  Or what about ATA PIO pollers?

>> Also the nice thing about dynamically sizing the thread pool
>> is that if something bad (error condition that takes long) happens
>> in one work queue for a specific subsystem there's still a chance
>> to make process with other operations in the same subsystem.
> 
> yup same is true for hitting some form of contention; just make an
> extra thread so that the rest can continue.

cmwq tries to do exactly that.  It uses scheduler notificatinos to
detect those contentions and creates new workers if everything is
blocked.  The reason why rescuers are required is to guarantee forward
progress in creating workers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 11:11         ` Arjan van de Ven
@ 2009-12-21 13:22           ` Tejun Heo
  2009-12-21 13:53             ` Arjan van de Ven
  0 siblings, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-21 13:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jens Axboe, Andi Kleen, Peter Zijlstra, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

Hello,

On 12/21/2009 08:11 PM, Arjan van de Ven wrote:
> I don't mind a good and clean design; and for sure sharing thread
> pools into one pool is really good.  But if I have to choose between
> a complex "how to deal with deadlocks" algorithm, versus just
> running some more threads in the pool, I'll pick the later.

The deadlock avoidance algorithm is pretty simple.  It creates a new
worker when everything is blocked.  If the attempt to create a new
worker blocks, it calls in dedicated workers to ensure allocation path
is not blocked.  It's not that complex.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21  9:22     ` Peter Zijlstra
@ 2009-12-21 13:30       ` Tejun Heo
  2009-12-21 14:26         ` Peter Zijlstra
  0 siblings, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-21 13:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

Hello, Peter.

On 12/21/2009 06:22 PM, Peter Zijlstra wrote:
> On Mon, 2009-12-21 at 12:04 +0900, Tejun Heo wrote:
>> When IO goes wrong, in extreme
>> cases, it can easily take over thirty secs to recover and that's
>> required by the hardware specifications, so anything which ends up
>> waiting on IO can take a pretty long time.  The only piece of code
>> which is necessary to support that is the code necessary to migrate
>> back tasks to CPUs when they come online again.  It's not a lot of
>> ugly code. 
> 
> Why does it need to get migrated back, there are no affinity promises if
> you allow hotplug to continue, so it might as well complete and continue
> on the other cpu.
> 
> And yes, it is a lot of very ugly code.

Migrating to online but !active CPU is necessary to call rescuers
during CPU_DOWN_PREPARE which is necessary to guarantee forward
progress during cpu down operation.  Given that, the only extra code
which is necessary purely for migrating back when a CPU comes back
online is a few tens of lines of code which handles TRUSTEE_RELEASE
case.  That's not a lot.  If we do it differently (ie. let unbound
workers not process new works, just drain and let them die), it will
take more code.

I think you're primarily concerned with the scheduler modifications
and think that the choose-between-two-masks on migration is ugly.  I
agree it's not the prettiest thing in this world but then again it's
not a lot of code.  The reason why it looks ugly is because the way
migration is implemented and parameter is passed in.  API-wise, I
think making kthread_bind() synchronized against cpu onliness should
be pretty clean.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 13:22           ` Tejun Heo
@ 2009-12-21 13:53             ` Arjan van de Ven
  2009-12-21 14:19               ` Tejun Heo
  0 siblings, 1 reply; 104+ messages in thread
From: Arjan van de Ven @ 2009-12-21 13:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Andi Kleen, Peter Zijlstra, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On 12/21/2009 14:22, Tejun Heo wrote:
> Hello,
>
> On 12/21/2009 08:11 PM, Arjan van de Ven wrote:
>> I don't mind a good and clean design; and for sure sharing thread
>> pools into one pool is really good.  But if I have to choose between
>> a complex "how to deal with deadlocks" algorithm, versus just
>> running some more threads in the pool, I'll pick the later.
>
> The deadlock avoidance algorithm is pretty simple.  It creates a new
> worker when everything is blocked.  If the attempt to create a new
> worker blocks, it calls in dedicated workers to ensure allocation path
> is not blocked.  It's not that complex.

I'm just wondering if even that is overkill; I suspect you can do entirely without the scheduler intrusion;
just make a new thread for each work item, with some hesteresis:
* threads should stay around for a bit before dying (you do that)
* after some minimum nr of threads (say 4 per cpu), you wait, say, 0.1 seconds before deciding it's time
   to spawn more threads, to smooth out spikes of very short lived stuff.

wouldn't that be a lot simpler than "ask the scheduler to see if they are all blocked". If they are all
very busy churning cpu (say doing raid6 work, or btrfs checksumming) you still would want more threads
I suspect

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 13:53             ` Arjan van de Ven
@ 2009-12-21 14:19               ` Tejun Heo
  2009-12-21 15:19                 ` Arjan van de Ven
  0 siblings, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-21 14:19 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jens Axboe, Andi Kleen, Peter Zijlstra, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

Hello, Arjan.

On 12/21/2009 10:53 PM, Arjan van de Ven wrote:
> I'm just wondering if even that is overkill; I suspect you can do
> entirely without the scheduler intrusion;
> just make a new thread for each work item, with some hesteresis:
>
> * threads should stay around for a bit before dying (you do that)
> * after some minimum nr of threads (say 4 per cpu), you wait, say, 0.1
> seconds before deciding it's time
>   to spawn more threads, to smooth out spikes of very short lived stuff.
> 
> wouldn't that be a lot simpler than "ask the scheduler to see if
> they are all blocked". If they are all very busy churning cpu (say
> doing raid6 work, or btrfs checksumming) you still would want more
> threads I suspect

Ah... okay, there are two aspects cmwq invovles the scheduler.

A. Concurrency management.  This is achieved by the scheduler
   callbacks which watches how many workers are working.

B. Deadlock avoidance.  This requires migrating rescuers to CPUs under
   allocation distress.  The problem here is that
   set_cpus_allowed_ptr() doesn't allow migrating tasks to CPUs which
   are online but !active (CPU_DOWN_PREPARE).

B would be necessary in whichever way you implement shared worker pool
unless you create all the workers which might possibly be necessary
for allocation.

For A, it's far more efficient and robust with scheduler callbacks.
It's conceptually pretty simple too.  If you look at the patch which
actually implements the dynamic pool, the amount of code necessary for
implementing this part isn't that big.  Most of complexity in the
series comes from trying to sharing workers not the dynamic pool
management.  Even if it switches to timer based one, there simply
won't be much reduction in complexity.  So, I don't think there's any
reason to choose rather fragile heuristics when it can be implemented
in a pretty mechanical way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 13:30       ` Tejun Heo
@ 2009-12-21 14:26         ` Peter Zijlstra
  2009-12-21 23:50           ` Tejun Heo
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-21 14:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Mon, 2009-12-21 at 22:30 +0900, Tejun Heo wrote:

> I think you're primarily concerned with the scheduler modifications

No I think the whole wq redesign sucks chunks, because it:

 1) looses the queue property
 2) doesn't deal with cpu heavy tasks/wakeup parallelism
 3) gets fragile at memory-pressure/reclaim
 4) didn't consider the RT needs

Also, I think that whole move tasks back on online stuff is utter crazy,
just move then to another cpu and leave them there.

Also, I don't think it can properly warn of simple AB-BA flush
deadlocks, where work A flushes B and B flushes A.

(I also don't much like the colour coding flush implementation, but I
haven't spend a lot of time considering alternatives)

> and think that the choose-between-two-masks on migration is ugly.  I
> agree it's not the prettiest thing in this world but then again it's
> not a lot of code.  The reason why it looks ugly is because the way
> migration is implemented and parameter is passed in.  API-wise, I
> think making kthread_bind() synchronized against cpu onliness should
> be pretty clean.

Assuming you only migrate blocked tasks the current kthread_bind()
should suit your needs -- I recently reworked most of the migration
logic.

But as it stands I don't think its wise to replace the current workqueue
implementation with this, esp since there are known heavy CPU users
using it, nor have you addressed the queueing issue (or is that the
restoration of the single-queue workqueue?)




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 14:19               ` Tejun Heo
@ 2009-12-21 15:19                 ` Arjan van de Ven
  2009-12-22  0:00                   ` Tejun Heo
  0 siblings, 1 reply; 104+ messages in thread
From: Arjan van de Ven @ 2009-12-21 15:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Andi Kleen, Peter Zijlstra, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On 12/21/2009 15:19, Tejun Heo wrote:
>
> Ah... okay, there are two aspects cmwq invovles the scheduler.
>
> A. Concurrency management.  This is achieved by the scheduler
>     callbacks which watches how many workers are working.
>
> B. Deadlock avoidance.  This requires migrating rescuers to CPUs under
>     allocation distress.  The problem here is that
>     set_cpus_allowed_ptr() doesn't allow migrating tasks to CPUs which
>     are online but !active (CPU_DOWN_PREPARE).

why would you get involved in what-runs-where at all? Wouldn't the scheduler
load balancer be a useful thing ? ;-)


>
> For A, it's far more efficient and robust with scheduler callbacks.
> It's conceptually pretty simple too.

Assuming you don't get tasks that do lots of cpu intense work and that need
to get preempted etc...like raid5/6 sync but also the btrfs checksumming etc etc.
That to me is the weak spot of the scheduler callback thing. We already have
a whole thing that knows how to load balance and preempt tasks... it's called "scheduler".
Let the scheduler do its job and just give it insight in the work (by having a runnable thread)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 14:26         ` Peter Zijlstra
@ 2009-12-21 23:50           ` Tejun Heo
  2009-12-22 11:00             ` Peter Zijlstra
                               ` (4 more replies)
  0 siblings, 5 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-21 23:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 12/21/2009 11:26 PM, Peter Zijlstra wrote:
>> I think you're primarily concerned with the scheduler modifications
> 
> No I think the whole wq redesign sucks chunks, because it:
> 
>  1) looses the queue property

Restored with max_active based ST workqueue implementation.

>  2) doesn't deal with cpu heavy tasks/wakeup parallelism

workqueue was never suited for this.  MT workqueues have strong CPU
affinity which doesn't make sense for CPU-heavy workloads.  ST
workqueues can't take advantage of parallelism.  We can easily mark
some workqueues as not contributing to concurrency accounting such
that works queued for such wqs are always considered blocked for MT
CPU-heavy workloads.  But I don't think that's the right use of
workqueues.  For CPU-heavy workloads, let's use something which is
more suited to the job.  Something which can take advantage of
parallelism.

>  3) gets fragile at memory-pressure/reclaim

Shared dynamic pool is going to be affected by memory pressure no
matter how you implement it.  cmwq tries to maintain stable level of
workers and has forward progress guarantee.  If you're gonna do shared
pool, it can't get much better.

>  4) didn't consider the RT needs

Can you be more specific?  What RT needs?  It's pretty difficult to
tell when there's no in-kernel user and any shared worker pool would
have pretty similar properties as cmwq.

> Also, I think that whole move tasks back on online stuff is utter crazy,
> just move then to another cpu and leave them there.

Can you be more specific?  Why is it crazy when moving to online but
!active cpu is necessary anyway?

> Also, I don't think it can properly warn of simple AB-BA flush
> deadlocks, where work A flushes B and B flushes A.

During transformation, I lost per-work lockdep annoation in flush_work
path which used to go through wait_on_work().  It can easily be
restored.  Nothing changed at per-work level.  The same lockdep
annoations will work.

> (I also don't much like the colour coding flush implementation, but I
> haven't spend a lot of time considering alternatives)

When you can come up with much better one, let me know.  Will be happy
to replace the current one.

>> and think that the choose-between-two-masks on migration is ugly.  I
>> agree it's not the prettiest thing in this world but then again it's
>> not a lot of code.  The reason why it looks ugly is because the way
>> migration is implemented and parameter is passed in.  API-wise, I
>> think making kthread_bind() synchronized against cpu onliness should
>> be pretty clean.
> 
> Assuming you only migrate blocked tasks the current kthread_bind()
> should suit your needs -- I recently reworked most of the migration
> logic.

Okay, that requires another thread which would watch whether those
rescuers which need to be migrated are scheduled out and a mechanism
to make sure they stay scheduled out.  After the "migration" thread
makes that sure, it would call kthread_bind() and wake up the migrated
rescuers, right?  That's basically what the scheduler "migration"
mechanism does.  Why implement it again?

> But as it stands I don't think its wise to replace the current workqueue
> implementation with this, esp since there are known heavy CPU users
> using it, nor have you addressed the queueing issue (or is that the
> restoration of the single-queue workqueue?)

The queueing one is addressed and which CPU-heavy users are you
talking about?  Workqueues have always been CPU-affine.  Not much
changes for its users by cmwq.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 15:19                 ` Arjan van de Ven
@ 2009-12-22  0:00                   ` Tejun Heo
  2009-12-22 11:10                     ` Peter Zijlstra
  0 siblings, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-22  0:00 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jens Axboe, Andi Kleen, Peter Zijlstra, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

Hello,

On 12/22/2009 12:19 AM, Arjan van de Ven wrote:
> On 12/21/2009 15:19, Tejun Heo wrote:
>>
>> Ah... okay, there are two aspects cmwq invovles the scheduler.
>>
>> A. Concurrency management.  This is achieved by the scheduler
>>     callbacks which watches how many workers are working.
>>
>> B. Deadlock avoidance.  This requires migrating rescuers to CPUs under
>>     allocation distress.  The problem here is that
>>     set_cpus_allowed_ptr() doesn't allow migrating tasks to CPUs which
>>     are online but !active (CPU_DOWN_PREPARE).
> 
> why would you get involved in what-runs-where at all? Wouldn't the
> scheduler load balancer be a useful thing ? ;-)

That's the way workqueues have been designed and used because its
primary purpose is to supply sleepable context and for that crossing
cpu boundaries is unnecessary overhead, so many of its users wouldn't
benefit from load balancing and some of them assume strong CPU
affinity.  cmwq's goal is to improve workqueues.  If you want to
implement something which would take advantage of processing
parallelism, it would need to be a separate mechanism (which may share
the API via out-of-range CPU affinty or something but still a separate
mechanism).

>> For A, it's far more efficient and robust with scheduler callbacks.
>> It's conceptually pretty simple too.
> 
> Assuming you don't get tasks that do lots of cpu intense work and
> that need to get preempted etc...like raid5/6 sync but also the
> btrfs checksumming etc etc.  That to me is the weak spot of the
> scheduler callback thing. We already have a whole thing that knows
> how to load balance and preempt tasks... it's called "scheduler".
> Let the scheduler do its job and just give it insight in the work
> (by having a runnable thread)

Yeah, sure, it's not suited for offloading CPU intensive workloads
(checksumming and encryption basically).  Workqueue interface was
never meant to be used by them - it has strong cpu affinity.  We need
something which is more parallelism aware for those.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 23:50           ` Tejun Heo
@ 2009-12-22 11:00             ` Peter Zijlstra
  2009-12-22 11:03             ` Peter Zijlstra
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-22 11:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Tue, 2009-12-22 at 08:50 +0900, Tejun Heo wrote:
> 
> >  4) didn't consider the RT needs
> 
> Can you be more specific?  What RT needs?  It's pretty difficult to
> tell when there's no in-kernel user and any shared worker pool would
> have pretty similar properties as cmwq. 

http://programming.kicks-ass.net/kernel-patches/rt-workqueue-prio/

Most of that passed over lkml too at various times and in various forms,
not eve sure that that is the last one, but its the one I found first.

Basically it boils down to letting the worklets run at the priority of
the enqueing task and boost current running works with the prio of the
highest pending work, including proper barrier support.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 23:50           ` Tejun Heo
  2009-12-22 11:00             ` Peter Zijlstra
@ 2009-12-22 11:03             ` Peter Zijlstra
  2009-12-23  3:43               ` Tejun Heo
  2009-12-22 11:04             ` Peter Zijlstra
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-22 11:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Tue, 2009-12-22 at 08:50 +0900, Tejun Heo wrote:
> > But as it stands I don't think its wise to replace the current workqueue
> > implementation with this, esp since there are known heavy CPU users
> > using it, nor have you addressed the queueing issue (or is that the
> > restoration of the single-queue workqueue?)
> 
> The queueing one is addressed and which CPU-heavy users are you
> talking about?  Workqueues have always been CPU-affine.  Not much
> changes for its users by cmwq. 

crypto for one (no clue what other bits we have atm, there might be some
async raid-n helper bits too, or whatever).

And yes that does change, the current design ensures we don't run more
than one crypto job per cpu, whereas with your stuff that can be
arbitrary many per cpu (up to some artificial limit of 127 or so).




^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 23:50           ` Tejun Heo
  2009-12-22 11:00             ` Peter Zijlstra
  2009-12-22 11:03             ` Peter Zijlstra
@ 2009-12-22 11:04             ` Peter Zijlstra
  2009-12-23  3:48               ` Tejun Heo
  2009-12-22 11:06             ` Peter Zijlstra
  2009-12-23  8:31             ` Stijn Devriendt
  4 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-22 11:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Tue, 2009-12-22 at 08:50 +0900, Tejun Heo wrote:
> > Also, I think that whole move tasks back on online stuff is utter crazy,
> > just move then to another cpu and leave them there.
> 
> Can you be more specific?  Why is it crazy when moving to online but
> !active cpu is necessary anyway? 

because its extra (and above all ugly code) that serves no purpose what
so ever.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 23:50           ` Tejun Heo
                               ` (2 preceding siblings ...)
  2009-12-22 11:04             ` Peter Zijlstra
@ 2009-12-22 11:06             ` Peter Zijlstra
  2009-12-23  4:18               ` Tejun Heo
  2009-12-23  8:31             ` Stijn Devriendt
  4 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-22 11:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On Tue, 2009-12-22 at 08:50 +0900, Tejun Heo wrote:
> 
> >  3) gets fragile at memory-pressure/reclaim
> 
> Shared dynamic pool is going to be affected by memory pressure no
> matter how you implement it.  cmwq tries to maintain stable level of
> workers and has forward progress guarantee.  If you're gonna do shared
> pool, it can't get much better. 

And here I'm questioning the very need for shared stuff, I don't see
any. That is, I'm not seeing it being worth the hassle.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22  0:00                   ` Tejun Heo
@ 2009-12-22 11:10                     ` Peter Zijlstra
  2009-12-22 17:20                       ` Linus Torvalds
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-22 11:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Arjan van de Ven, Jens Axboe, Andi Kleen, torvalds, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On Tue, 2009-12-22 at 09:00 +0900, Tejun Heo wrote:
> 
> Yeah, sure, it's not suited for offloading CPU intensive workloads
> (checksumming and encryption basically).  Workqueue interface was
> never meant to be used by them - it has strong cpu affinity.  We need
> something which is more parallelism aware for those. 

Right, so what about cleaning that up first, then looking how many
workqueues can be removed by converting to threaded interrupts and then
maybe look again at some of your async things?

As to the SCSI/ATA error, have to wait for a year on hardware threads,
why not simply use the existing work to create a one-time (non-affine)
thread for that? Its not like it'll ever happen much.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 11:10                     ` Peter Zijlstra
@ 2009-12-22 17:20                       ` Linus Torvalds
  2009-12-22 17:47                         ` Peter Zijlstra
  0 siblings, 1 reply; 104+ messages in thread
From: Linus Torvalds @ 2009-12-22 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Arjan van de Ven, Jens Axboe, Andi Kleen, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes



On Tue, 22 Dec 2009, Peter Zijlstra wrote:

> On Tue, 2009-12-22 at 09:00 +0900, Tejun Heo wrote:
> > 
> > Yeah, sure, it's not suited for offloading CPU intensive workloads
> > (checksumming and encryption basically).  Workqueue interface was
> > never meant to be used by them - it has strong cpu affinity.  We need
> > something which is more parallelism aware for those. 
> 
> Right, so what about cleaning that up first, then looking how many
> workqueues can be removed by converting to threaded interrupts and then
> maybe look again at some of your async things?

Peter - nobody is interested in that.

People use workqueues for other things _today_, and they have annoying 
problems as they stand. It would be nice to get rid of the deadlock 
issue, for example - right now the tty driver literally does crazy things, 
and drops locks that it shouldn't drop due to the fact that it needs to 
wait for queued work - even if the queued work it is actually waiting for 
isn't the one that takes the lock!

So why do you argue about all those irrelevant things, and ask Tejun to 
clean up stuff that nobody cares about, when our existing workqueues have 
problems that people -do- care about and that his patches address?

So stop arguing about irrelevancies. Nobody uses workqueues for RT or for 
CPU-intensive crap. It's not what they were designed for, or used for.

If you _want_ to use them for that, that is _your_ problem. Not Tejuns.

			Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 17:20                       ` Linus Torvalds
@ 2009-12-22 17:47                         ` Peter Zijlstra
  2009-12-22 18:07                           ` Andi Kleen
                                             ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-22 17:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Arjan van de Ven, Jens Axboe, Andi Kleen, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On Tue, 2009-12-22 at 09:20 -0800, Linus Torvalds wrote:
> 
> So stop arguing about irrelevancies. Nobody uses workqueues for RT or for 
> CPU-intensive crap. It's not what they were designed for, or used for.

RT crap maybe, but cpu intensive bits are used for sure, see the
crypto/crypto_wq.c drivers/md/dm*.c.

I've seen those consume significant amounts of cpu, now I'm not going to
argue that workqueues are not the best way to consume lots of cpu, but
the fact is they _are_ used for that.

And since tejun's thing doesn't have wakeup parallelism covered these
uses can turn into significant loads.

> If you _want_ to use them for that, that is _your_ problem. Not Tejuns.

I don't want to use workqueues at all.

> People use workqueues for other things _today_, and they have annoying 
> problems as they stand. It would be nice to get rid of the deadlock 
> issue, for example - right now the tty driver literally does crazy things, 
> and drops locks that it shouldn't drop due to the fact that it needs to 
> wait for queued work - even if the queued work it is actually waiting for 
> isn't the one that takes the lock!

Which in turn would imply we cannot carry fwd the current lockdep
annotations, right?

Which means we'll be stuck in a situation where A flushes B and B
flushes A will go undetected until we actually hit it.

Where exactly does the tty thing live in the code?


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 17:47                         ` Peter Zijlstra
@ 2009-12-22 18:07                           ` Andi Kleen
  2009-12-22 18:20                             ` Peter Zijlstra
  2009-12-23  8:17                             ` Stijn Devriendt
  2009-12-22 18:28                           ` Linus Torvalds
  2009-12-23  3:37                           ` Tejun Heo
  2 siblings, 2 replies; 104+ messages in thread
From: Andi Kleen @ 2009-12-22 18:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Tejun Heo, Arjan van de Ven, Jens Axboe,
	Andi Kleen, awalls, linux-kernel, jeff, mingo, akpm, rusty, cl,
	dhowells, avi, johannes

> I've seen those consume significant amounts of cpu, now I'm not going to
> argue that workqueues are not the best way to consume lots of cpu, but
> the fact is they _are_ used for that.

AFAIK they currently have the following problems:
- Consuming CPU in shared work queues is a bad idea for obvious reasons
(but I don't think anybody does that)
- Single threaded work is well, single threaded and does not scale
and wastes cores.
- Per CPU work queues are bound to each CPU so that the scheduler
cannot properly load balance (so you might end up with over/under
subscribed CPUs)

One reason I liked a more dynamic frame work for this is that it
has the potential to be exposed to user space and allow automatic
work partitioning there based on available cores.  User space
has a lot more CPU consumption than the kernel.

I think Grand Central Dispatch does something in this direction. 
TBB would probably also benfit

Short term an alternative for the kernel would be also
to generalize the simple framework that is in btrfs.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 18:07                           ` Andi Kleen
@ 2009-12-22 18:20                             ` Peter Zijlstra
  2009-12-23  8:17                             ` Stijn Devriendt
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-22 18:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Tejun Heo, Arjan van de Ven, Jens Axboe, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes

On Tue, 2009-12-22 at 19:07 +0100, Andi Kleen wrote:

> One reason I liked a more dynamic frame work for this is that it
> has the potential to be exposed to user space and allow automatic
> work partitioning there based on available cores.  User space
> has a lot more CPU consumption than the kernel.

What you want is something that does not block, but does not generate
more load than a single runnable task either. Otherwise you'll end up
with massive load issues.

The thing currently proposed only does the first, but does not guarantee
the latter. Meaning you can only use it if you know it will not do much
work (after blocking), which is a fine restriction in the kernel (not
something I think the current workqueue users all adhere to though).

However, it is not something you'd want to expose to userspace, not that
is, without the counterpart limiting wakeup parallelism.

> I think Grand Central Dispatch does something in this direction. 
> TBB would probably also benfit

What are those?

> Short term an alternative for the kernel would be also
> to generalize the simple framework that is in btrfs.

And here I though btrfs was one of those who did actually consume heaps
of CPU in its worklets doing data checksums and the like.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 17:47                         ` Peter Zijlstra
  2009-12-22 18:07                           ` Andi Kleen
@ 2009-12-22 18:28                           ` Linus Torvalds
  2009-12-23  8:06                             ` Johannes Berg
  2009-12-23  3:37                           ` Tejun Heo
  2 siblings, 1 reply; 104+ messages in thread
From: Linus Torvalds @ 2009-12-22 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Arjan van de Ven, Jens Axboe, Andi Kleen, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes



On Tue, 22 Dec 2009, Peter Zijlstra wrote:
> 
> Which in turn would imply we cannot carry fwd the current lockdep
> annotations, right?
> 
> Which means we'll be stuck in a situation where A flushes B and B
> flushes A will go undetected until we actually hit it.

No, lockdep should still work. It just means that waiting for an 
individual work should be seen as a matter of only waiting for the locks 
that work itself has done - rather than waiting for all the locks that any 
worker has taken.

And the way the workqueue lockdep stuff is done, I'd assume this just 
automatically fixes itself when rewritten.

> Where exactly does the tty thing live in the code?

I think I worked around it all, but I cursed the workqueues while I did 
it. The locking problem is tty->ldisc_mutex vs flushing the ldisc buffers. 
The flushing itself doesn't even care about the ldisc_mutex, but if it 
happens to be behind a the hangup work - which does care about that mutex 
- you still can't wait for it.

Happily, it turns out that you can synchronously _cancel_ the damn thing 
despite this problem, because the cancel can take it off the list if it is 
waiting for something else (eg another workqueue entry in front of it), 
and if it's actively running we know that it's not blocked waiting for 
that hangup work that needs the lock, so for that particular case we can 
even wait for it to finish running - even if we couldn't do that in 
general.

And I 

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 17:47                         ` Peter Zijlstra
  2009-12-22 18:07                           ` Andi Kleen
  2009-12-22 18:28                           ` Linus Torvalds
@ 2009-12-23  3:37                           ` Tejun Heo
  2009-12-23  6:52                             ` Herbert Xu
  2 siblings, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  3:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Arjan van de Ven, Jens Axboe, Andi Kleen, awalls,
	linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells, avi,
	johannes, Herbert Xu, David S. Miller

Hello,

On 12/23/2009 02:47 AM, Peter Zijlstra wrote:
> On Tue, 2009-12-22 at 09:20 -0800, Linus Torvalds wrote:
>>
>> So stop arguing about irrelevancies. Nobody uses workqueues for RT or for 
>> CPU-intensive crap. It's not what they were designed for, or used for.
> 
> RT crap maybe, but cpu intensive bits are used for sure, see the
> crypto/crypto_wq.c drivers/md/dm*.c.
>
> I've seen those consume significant amounts of cpu, now I'm not going to
> argue that workqueues are not the best way to consume lots of cpu, but
> the fact is they _are_ used for that.
>
> And since tejun's thing doesn't have wakeup parallelism covered these
> uses can turn into significant loads.

(cc'ing Herbert Xu and David Miller for crypt)

I wrote this before but if something is burning CPU cycles using MT
workqueues, this can be easily supported by marking such workqueue as
not concurrency-managed.  ie. Works queued for such workqueues
wouldn't contribute to the perceived workqueue concurrency and will be
left to be managed solely by the scheduler.  This will completely
cover the crypto_wq case which uses MT workqueue with local cpu
binding.

I don't think dm is currently using workqueue for checksumming but it
does use kcryptd (a ST workqueue) for dm-crypt.  My understanding of
crypt subsystem is limited but kcryptd simply passes the encryption
request to crypt subsystem which may then again pass the thing to
crypto workers (the MT workqueue described above) or process it
synchronously depending on how the encryption chain is setup.  In the
former case, cmwq with the above described modification wouldn't
change anything.  For the latter case, it would make the worker pinned
to a cpu while encrypting/decrypting but then again kcryptd spending
large amount of cpu cycles is simply wrong to begin with.  There's
only _single_ kcryptd which is shared by all dm-crypt instances.  It
should be issuing works to crypto workers instead of doing
encrypt/decrypt itself.

It seems that cryptd is building an async mechanism for CPU-intensive
tasks on top of workqueue mechanism, which can be well supported by
cmwq (not different from the current situation).  In the long run, the
right thing to do would to build a generic mechanism for this type of
workloads and share it among crypt, dm/btrfs block processing.  But,
at any rate, this doesn't really change anything for cmwq.  With
slight modification, it can support the current mechanisms just as
well.

>> People use workqueues for other things _today_, and they have annoying 
>> problems as they stand. It would be nice to get rid of the deadlock 
>> issue, for example - right now the tty driver literally does crazy things, 
>> and drops locks that it shouldn't drop due to the fact that it needs to 
>> wait for queued work - even if the queued work it is actually waiting for 
>> isn't the one that takes the lock!
> 
> Which in turn would imply we cannot carry fwd the current lockdep
> annotations, right?
>
> Which means we'll be stuck in a situation where A flushes B and B
> flushes A will go undetected until we actually hit it.

Lockdep annotations will keep working where it matters.  Why would it
change at all?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 11:03             ` Peter Zijlstra
@ 2009-12-23  3:43               ` Tejun Heo
  0 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  3:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

On 12/22/2009 08:03 PM, Peter Zijlstra wrote:
> On Tue, 2009-12-22 at 08:50 +0900, Tejun Heo wrote:
>>> But as it stands I don't think its wise to replace the current workqueue
>>> implementation with this, esp since there are known heavy CPU users
>>> using it, nor have you addressed the queueing issue (or is that the
>>> restoration of the single-queue workqueue?)
>>
>> The queueing one is addressed and which CPU-heavy users are you
>> talking about?  Workqueues have always been CPU-affine.  Not much
>> changes for its users by cmwq. 
> 
> crypto for one (no clue what other bits we have atm, there might be some
> async raid-n helper bits too, or whatever).

This one is being discussed in a different thread.

> And yes that does change, the current design ensures we don't run more
> than one crypto job per cpu, whereas with your stuff that can be
> arbitrary many per cpu (up to some artificial limit of 127 or so).

Nope, with max_active set to 1 the behavior will remain exactly the
same as the current workqueues and as I've described in the head
message and the patch description, there now is no global limit only
max_active limits apply.  It got solved together with the singlethread
thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 11:04             ` Peter Zijlstra
@ 2009-12-23  3:48               ` Tejun Heo
  0 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  3:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 12/22/2009 08:04 PM, Peter Zijlstra wrote:
> On Tue, 2009-12-22 at 08:50 +0900, Tejun Heo wrote:
>>> Also, I think that whole move tasks back on online stuff is utter crazy,
>>> just move then to another cpu and leave them there.
>>
>> Can you be more specific?  Why is it crazy when moving to online but
>> !active cpu is necessary anyway? 
> 
> because its extra (and above all ugly code) that serves no purpose what
> so ever.

The forward progress guarantee requires rescuers to migrate to online
but !active CPUs during CPU_DOWN_PREPARE, so the only extra code
necessary for migrating back remaining workers when a CPU comes back
online is the code to actually do that.  That's only a couple tens of
lines of code in the trustee thread.

Now, the other option would to leave those unbound workers alone and
make sure they don't take up new works once new CPUs come online which
would require code in hotter path and results in less consistent
behavior.  The tradeoff seems pretty obvious to me.  Doesn't it?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 11:06             ` Peter Zijlstra
@ 2009-12-23  4:18               ` Tejun Heo
  2009-12-23  4:42                 ` Linus Torvalds
  0 siblings, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  4:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, awalls, linux-kernel, jeff, mingo, akpm, jens.axboe,
	rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 12/22/2009 08:06 PM, Peter Zijlstra wrote:
> On Tue, 2009-12-22 at 08:50 +0900, Tejun Heo wrote:
>>
>>>  3) gets fragile at memory-pressure/reclaim
>>
>> Shared dynamic pool is going to be affected by memory pressure no
>> matter how you implement it.  cmwq tries to maintain stable level of
>> workers and has forward progress guarantee.  If you're gonna do shared
>> pool, it can't get much better. 
> 
> And here I'm questioning the very need for shared stuff, I don't see
> any. That is, I'm not seeing it being worth the hassle.

Then you see the situation pretty different from the way I do.  Maybe
it's caused by the different things we work on.  Whenever I want to
create something which would need async context, I'm always faced with
these tradeoffs that I think is silly to worry about at that layer.
It ends up scattering partial solutions all over the place.

libata has two workqueues just because one may depend on the other.
The workqueue used for polling is MT to increase parallelism in case
there are multiple devices which would require polling but it's both
wasteful and not enough - they won't be used most of the time but they
aren't enough when there are multiple pollers on the same CPU.  libata
just had to make a rather mediocre in-the-middle tradeoff between
having one poller for each device and sharing single poller for all
devices.

The same goes for EH threads.  How often they are used heavily depends
on the system configuration.  For example, libata handles ATAPI CHECK
CONDITION as an exception and acquire sense data from the exception
handler and it happens pretty frequently.  So, I want to have
per-device EHs and have ideas on how to escalate from device level EH
to host level EH.  The problem here again is how to maintain the
concurrency because having a single kthread for each block device
won't be acceptable from scalability POV.

Another similar but less severe problem is in-kernel media presence
pollers.  Here, I think I can have a single poller for each device
without having too many scalability issues but it just isn't efficient
because most of the time one poller would be enough.  It's only when
you get to the corner cases or error conditions when you would need
more than one.  So, again, I can implement a special poller pool for
this one.

And there are slow work and async both of which are there just to
provide process context to tasks which may take quite some time to
complete waiting for IOs and quite a few ST workqueues which got
separated out because they somehow got involved in some obscure
deadlock condition and the only reason they're ST is because MT would
create too many threads.  CPU affinity would work better for them but
they have to make these tradeoffs.

So, if we can have a mehanism which can solve these issues, it's an
obvious plus.  Shifting complexity out of peripheral code to better
crafted and managed core code is the right thing to do and it will
shift a lot of complexity out of peripheral codes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  4:18               ` Tejun Heo
@ 2009-12-23  4:42                 ` Linus Torvalds
  2009-12-23  6:02                   ` Ingo Molnar
  0 siblings, 1 reply; 104+ messages in thread
From: Linus Torvalds @ 2009-12-23  4:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, awalls, linux-kernel, jeff, mingo, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi



On Wed, 23 Dec 2009, Tejun Heo wrote:
> 
> So, if we can have a mehanism which can solve these issues, it's an
> obvious plus.  Shifting complexity out of peripheral code to better
> crafted and managed core code is the right thing to do and it will
> shift a lot of complexity out of peripheral codes.

I really think this is key. I don't think Peter really has worked much 
outside of very core code (direct CPU-related stuff), and hasn't seen the 
kind of annoying problems our current workqueue code has.

Half the kernel is drivers. And 95% of all workqueue users are those 
drivers.

Workqueues generally aren't about heavy CPU usage, although some workqueue 
usage has scalability issues. And the most common scalability problem is 
not "I need more than one CPU", but often "I need to run even though 
another workqueue entry is blocked on IO" - iow, it's not about lacking 
CPU power, it's about in-fighting with other workqueue users.

That said, if we can improve on this further, I'd be all for it. I'd love 
to have some generic async model that really works. So far, our async 
models have tended to not really work out well, whether they be softirq's 
or kernel threads (many of the same issues: some subsystems start tons of 
kernel threads just because one kernel thread blocks, not because you need 
multi-processor CPU usage per se). And AIO/threadlets never got anywhere 
etc etc.

		Linus

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  4:42                 ` Linus Torvalds
@ 2009-12-23  6:02                   ` Ingo Molnar
  2009-12-23  6:13                     ` Jeff Garzik
                                       ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Ingo Molnar @ 2009-12-23  6:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Workqueues generally aren't about heavy CPU usage, although some workqueue 
> usage has scalability issues. And the most common scalability problem is not 
> "I need more than one CPU", but often "I need to run even though another 
> workqueue entry is blocked on IO" - iow, it's not about lacking CPU power, 
> it's about in-fighting with other workqueue users.
> 
> That said, if we can improve on this further, I'd be all for it. I'd love to 
> have some generic async model that really works. So far, our async models 
> have tended to not really work out well, whether they be softirq's or kernel 
> threads (many of the same issues: some subsystems start tons of kernel 
> threads just because one kernel thread blocks, not because you need 
> multi-processor CPU usage per se). And AIO/threadlets never got anywhere etc 
> etc.

Not from lack of trying though ;-)

One key thing i havent seen in this discussion are actual measurements. I 
think a lot could be decided by simply testing this patch-set, by looking at 
the hard numbers: how much faster (or slower) did a particular key workload 
get before/after these patches.

Likewise, if there's a reduction in complexity, that is a tangible metric as 
well: lets do a few conversions as part of the patch-set and see how much 
simpler things have become as a result of it.

We really are not forced to the space of Gedankenexperiments here.

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  6:02                   ` Ingo Molnar
@ 2009-12-23  6:13                     ` Jeff Garzik
  2009-12-23  7:53                       ` Ingo Molnar
  2009-12-23  8:41                       ` Peter Zijlstra
  2009-12-23  7:09                     ` Tejun Heo
  2009-12-23 13:00                     ` Stefan Richter
  2 siblings, 2 replies; 104+ messages in thread
From: Jeff Garzik @ 2009-12-23  6:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tejun Heo, Peter Zijlstra, awalls, linux-kernel,
	akpm, jens.axboe, rusty, cl, dhowells, arjan, avi, johannes,
	andi

On 12/23/2009 01:02 AM, Ingo Molnar wrote:
> One key thing i havent seen in this discussion are actual measurements. I
> think a lot could be decided by simply testing this patch-set, by looking at
> the hard numbers: how much faster (or slower) did a particular key workload
> get before/after these patches.

We are dealing with situations where drivers are using workqueues to 
provide a sleep-able context, and trying to solve problems related to that.

"faster or slower?" is not very relevant for such cases.

	Jeff



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  3:37                           ` Tejun Heo
@ 2009-12-23  6:52                             ` Herbert Xu
  2009-12-23  8:00                               ` Steffen Klassert
  0 siblings, 1 reply; 104+ messages in thread
From: Herbert Xu @ 2009-12-23  6:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Linus Torvalds, Arjan van de Ven, Jens Axboe,
	Andi Kleen, awalls, linux-kernel, jeff, mingo, akpm, rusty, cl,
	dhowells, avi, johannes, David S. Miller, Steffen Klassert

On Wed, Dec 23, 2009 at 12:37:32PM +0900, Tejun Heo wrote:
> 
> I wrote this before but if something is burning CPU cycles using MT
> workqueues, this can be easily supported by marking such workqueue as
> not concurrency-managed.  ie. Works queued for such workqueues
> wouldn't contribute to the perceived workqueue concurrency and will be
> left to be managed solely by the scheduler.  This will completely
> cover the crypto_wq case which uses MT workqueue with local cpu
> binding.

Right, the main use of workqueues in the crypto subsystem currently
is to perform operations in a process context, e.g., in order to
use SSE instructions on x86, so there is no real parallelism
involved.

However, Steffen Klassert has proposed a parallelisation mechanism
whereby extremely CPU-intensive operations such as encryption may
be split across CPUs.  Steffen, could you repost your padata patches
here?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  6:02                   ` Ingo Molnar
  2009-12-23  6:13                     ` Jeff Garzik
@ 2009-12-23  7:09                     ` Tejun Heo
  2009-12-23  8:01                       ` Ingo Molnar
  2009-12-23  8:25                       ` Arjan van de Ven
  2009-12-23 13:00                     ` Stefan Richter
  2 siblings, 2 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  7:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello, Ingo.

On 12/23/2009 03:02 PM, Ingo Molnar wrote:
> Not from lack of trying though ;-)
> 
> One key thing i havent seen in this discussion are actual measurements. I 
> think a lot could be decided by simply testing this patch-set, by looking at 
> the hard numbers: how much faster (or slower) did a particular key workload 
> get before/after these patches.

As Jeff pointed out, I don't think this is gonna result in any major
performance increase or decrease.  The upside would be lowered cache
pressure coming from different types of works sharing the same context
(this will be very difficult to measure).  The downside would be the
more complex code workqueue has to run to manage the shared pool.  I
don't think it's gonna be anything noticeable either way.  Anyways, I
can try to set up a synthetic case which involves a lot of work
executions and at least make sure there's no noticeable slow down.

> Likewise, if there's a reduction in complexity, that is a tangible metric as 
> well: lets do a few conversions as part of the patch-set and see how much 
> simpler things have become as a result of it.

Doing several conversions shouldn't be difficult at all.  I'll try to
convert async and slow work.

> We really are not forced to the space of Gedankenexperiments here.

Sure but there's a reason why I posted the patchset without the actual
conversions.  I wanted to make sure that it's not rejected on the
ground of its basic design.  I thought it was acceptable after the
first RFC round but while trying to merge the scheduler part, Peter
seemed mightily unhappy with the whole thing, so this second RFC
round.  So, if anyone has major issues with the basic design, please
step forward *now* before I go spending more time working on it.

Another thing is that after spending a couple of months polishing the
patchset, it feels quite tiring to keep the patchset floating (you
know - Oh, this new thing should be merged into this patch.  Dang, now
I have to refresh 15 patches on top of it).  I would really appreciate
if I can set up a stable tree.  Would it be possible to set up a sched
devel branch?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  6:13                     ` Jeff Garzik
@ 2009-12-23  7:53                       ` Ingo Molnar
  2009-12-23  8:41                       ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Ingo Molnar @ 2009-12-23  7:53 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Tejun Heo, Peter Zijlstra, awalls, linux-kernel,
	akpm, jens.axboe, rusty, cl, dhowells, arjan, avi, johannes,
	andi


* Jeff Garzik <jeff@garzik.org> wrote:

> On 12/23/2009 01:02 AM, Ingo Molnar wrote:
> >One key thing i havent seen in this discussion are actual measurements. I
> >think a lot could be decided by simply testing this patch-set, by looking at
> >the hard numbers: how much faster (or slower) did a particular key workload
> >get before/after these patches.
> 
> We are dealing with situations where drivers are using workqueues to
> provide a sleep-able context, and trying to solve problems related
> to that.
> 
> "faster or slower?" is not very relevant for such cases.

Claims to performance were made though, so it's a valid question.

But it's not the only effect, which is why i continued my email with the issue 
you raised:

> > Likewise, if there's a reduction in complexity, that is a tangible metric 
> > as well: lets do a few conversions as part of the patch-set and see how 
> > much simpler things have become as a result of it.
> >
> > We really are not forced to the space of Gedankenexperiments here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  6:52                             ` Herbert Xu
@ 2009-12-23  8:00                               ` Steffen Klassert
  2009-12-23  8:01                                 ` [PATCH 0/2] Parallel crypto/IPsec v7 Steffen Klassert
  0 siblings, 1 reply; 104+ messages in thread
From: Steffen Klassert @ 2009-12-23  8:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds, Arjan van de Ven,
	Jens Axboe, Andi Kleen, awalls, linux-kernel, jeff, mingo, akpm,
	rusty, cl, dhowells, avi, johannes, David S. Miller

On Wed, Dec 23, 2009 at 02:52:40PM +0800, Herbert Xu wrote:
> 
> However, Steffen Klassert has proposed a parallelisation mechanism
> whereby extremely CPU-intensive operations such as encryption may
> be split across CPUs.  Steffen, could you repost your padata patches
> here?
> 

Yes, I'll send the same version I sent to linux-crypto last week in
reply to this mail.

Steffen

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  7:09                     ` Tejun Heo
@ 2009-12-23  8:01                       ` Ingo Molnar
  2009-12-23  8:12                         ` Ingo Molnar
  2009-12-23  8:27                         ` Tejun Heo
  2009-12-23  8:25                       ` Arjan van de Ven
  1 sibling, 2 replies; 104+ messages in thread
From: Ingo Molnar @ 2009-12-23  8:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi


* Tejun Heo <tj@kernel.org> wrote:

> > We really are not forced to the space of Gedankenexperiments here.
> 
> Sure but there's a reason why I posted the patchset without the actual 
> conversions.  I wanted to make sure that it's not rejected on the ground of 
> its basic design.  I thought it was acceptable after the first RFC round but 
> while trying to merge the scheduler part, Peter seemed mightily unhappy with 
> the whole thing, so this second RFC round.  So, if anyone has major issues 
> with the basic design, please step forward *now* before I go spending more 
> time working on it.

At least as far as i'm concerned, i'd like to see actual uses. It's a big 
linecount increase all things considered:

   20 files changed, 2783 insertions(+), 660 deletions(-)

and you say it _wont_ help performance/scalability (this aspect wasnt clear to 
me from previous discussions), so the (yet to be seen) complexity reduction in 
other code ought to be worth it.

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 0/2] Parallel crypto/IPsec v7
  2009-12-23  8:00                               ` Steffen Klassert
@ 2009-12-23  8:01                                 ` Steffen Klassert
  2009-12-23  8:03                                   ` [PATCH 1/2] padata: generic parallelization/serialization interface Steffen Klassert
                                                     ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Steffen Klassert @ 2009-12-23  8:01 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds, Arjan van de Ven,
	Jens Axboe, Andi Kleen, awalls, linux-kernel, jeff, mingo, akpm,
	rusty, cl, dhowells, avi, johannes, David S. Miller

This patchset adds the 'pcrypt' parallel crypto template. With this template it
is possible to process the crypto requests of a transform in parallel without
getting request reorder. This is in particular interesting for IPsec.

The parallel crypto template is based on the 'padata' generic
parallelization/serialization method. With this method data objects can
be processed in parallel, starting at some given point.
The parallelized data objects return after serialization in the order as
they were before the parallelization. In the case of IPsec, this makes it
possible to run the expensive parts in parallel without getting packet
reordering.

IPsec forwarding tests with two quad core machines (Intel Core 2 Quad Q6600)
and an EXFO FTB-400 packet blazer showed the following results:

On all tests I used smp_affinity to pin the interrupts of the network cards
to different cpus.

linux-2.6.33-rc1 (64 bit)
Packetsize: 1420 byte
Test time: 60 sec
Encryption: aes192-sha1
bidirectional throughput without packet loss: 2 x 325 Mbit/s
unidirectional throughput without packet loss: 325 Mbit/s

linux-2.6.33-rc1 (64 bit)
Packetsize: 128 byte
Test time: 60 sec
Encryption: aes192-sha1
bidirectional throughput without packet loss: 2 x 100 Mbit/s
unidirectional throughput without packet loss: 125 Mbit/s


linux-2.6.33-rc1 with padata/pcrypt (64 bit)
Packetsize: 1420 byte
Test time: 60 sec
Encryption: aes192-sha1
bidirectional throughput without packet loss: 2 x 650 Mbit/s
unidirectional throughput without packet loss: 850  Mbit/s

linux-2.6.33-rc1 with padata/pcrypt (64 bit)
Packetsize: 128 byte
Test time: 60 sec
Encryption: aes192-sha1
bidirectional throughput without packet loss: 2 x 100 Mbit/s
unidirectional throughput without packet loss: 125 Mbit/s


So the performance win on big packets is quite good. But on small packets
the troughput results with and without the workqueue based parallelization
are amost the same on my testing environment.

Changes from v6:

- Rework padata to use workqueues instead of softirqs for
  parallelization/serialization

- Add a cyclic sequence number pattern, makes the reset of the padata
  serialization logic on sequence number overrun superfluous.

- Adapt pcrypt to the changed padata interface.

- Rebased to linux-2.6.33-rc1

Steffen

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 1/2] padata: generic parallelization/serialization interface
  2009-12-23  8:01                                 ` [PATCH 0/2] Parallel crypto/IPsec v7 Steffen Klassert
@ 2009-12-23  8:03                                   ` Steffen Klassert
  2009-12-23  8:04                                   ` [PATCH 2/2] crypto: pcrypt - Add pcrypt crypto parallelization wrapper Steffen Klassert
  2010-01-07  5:39                                   ` [PATCH 0/2] Parallel crypto/IPsec v7 Herbert Xu
  2 siblings, 0 replies; 104+ messages in thread
From: Steffen Klassert @ 2009-12-23  8:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds, Arjan van de Ven,
	Jens Axboe, Andi Kleen, awalls, linux-kernel, jeff, mingo, akpm,
	rusty, cl, dhowells, avi, johannes, David S. Miller

This patch introduces an interface to process data objects
in parallel. The parallelized objects return after serialization
in the same order as they were before the parallelization.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 include/linux/padata.h |   88 ++++++
 init/Kconfig           |    4 +
 kernel/Makefile        |    1 +
 kernel/padata.c        |  690 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 783 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/padata.h
 create mode 100644 kernel/padata.c

diff --git a/include/linux/padata.h b/include/linux/padata.h
new file mode 100644
index 0000000..51611da
--- /dev/null
+++ b/include/linux/padata.h
@@ -0,0 +1,88 @@
+/*
+ * padata.h - header for the padata parallelization interface
+ *
+ * Copyright (C) 2008, 2009 secunet Security Networks AG
+ * Copyright (C) 2008, 2009 Steffen Klassert <steffen.klassert@secunet.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#ifndef PADATA_H
+#define PADATA_H
+
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+
+struct padata_priv {
+	struct list_head	list;
+	struct parallel_data	*pd;
+	int			cb_cpu;
+	int			seq_nr;
+	int			info;
+	void                    (*parallel)(struct padata_priv *padata);
+	void                    (*serial)(struct padata_priv *padata);
+};
+
+struct padata_list {
+	struct list_head        list;
+	spinlock_t              lock;
+};
+
+struct padata_queue {
+	struct padata_list	parallel;
+	struct padata_list	reorder;
+	struct padata_list	serial;
+	struct work_struct	pwork;
+	struct work_struct	swork;
+	struct parallel_data    *pd;
+	atomic_t		num_obj;
+	int			cpu_index;
+};
+
+struct parallel_data {
+	struct padata_instance	*pinst;
+	struct padata_queue	*queue;
+	atomic_t		seq_nr;
+	atomic_t		reorder_objects;
+	atomic_t                refcnt;
+	unsigned int		max_seq_nr;
+	cpumask_var_t		cpumask;
+	spinlock_t              lock;
+};
+
+struct padata_instance {
+	struct notifier_block   cpu_notifier;
+	struct workqueue_struct *wq;
+	struct parallel_data	*pd;
+	cpumask_var_t           cpumask;
+	struct mutex		lock;
+	u8			flags;
+#define	PADATA_INIT		1
+#define	PADATA_RESET		2
+};
+
+extern struct padata_instance *padata_alloc(const struct cpumask *cpumask,
+					    struct workqueue_struct *wq);
+extern void padata_free(struct padata_instance *pinst);
+extern int padata_do_parallel(struct padata_instance *pinst,
+			      struct padata_priv *padata, int cb_cpu);
+extern void padata_do_serial(struct padata_priv *padata);
+extern int padata_set_cpumask(struct padata_instance *pinst,
+			      cpumask_var_t cpumask);
+extern int padata_add_cpu(struct padata_instance *pinst, int cpu);
+extern int padata_remove_cpu(struct padata_instance *pinst, int cpu);
+extern void padata_start(struct padata_instance *pinst);
+extern void padata_stop(struct padata_instance *pinst);
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index a23da9f..9fd23bc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1252,4 +1252,8 @@ source "block/Kconfig"
 config PREEMPT_NOTIFIERS
 	bool
 
+config PADATA
+	depends on SMP
+	bool
+
 source "kernel/Kconfig.locks"
diff --git a/kernel/Makefile b/kernel/Makefile
index 864ff75..6aebdeb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_SLOW_WORK_DEBUG) += slow-work-debugfs.o
 obj-$(CONFIG_PERF_EVENTS) += perf_event.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
 obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
+obj-$(CONFIG_PADATA) += padata.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/padata.c b/kernel/padata.c
new file mode 100644
index 0000000..6f9bcb8
--- /dev/null
+++ b/kernel/padata.c
@@ -0,0 +1,690 @@
+/*
+ * padata.c - generic interface to process data streams in parallel
+ *
+ * Copyright (C) 2008, 2009 secunet Security Networks AG
+ * Copyright (C) 2008, 2009 Steffen Klassert <steffen.klassert@secunet.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <linux/module.h>
+#include <linux/cpumask.h>
+#include <linux/err.h>
+#include <linux/cpu.h>
+#include <linux/padata.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
+
+#define MAX_SEQ_NR INT_MAX - NR_CPUS
+#define MAX_OBJ_NUM 10000 * NR_CPUS
+
+static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
+{
+	int cpu, target_cpu;
+
+	target_cpu = cpumask_first(pd->cpumask);
+	for (cpu = 0; cpu < cpu_index; cpu++)
+		target_cpu = cpumask_next(target_cpu, pd->cpumask);
+
+	return target_cpu;
+}
+
+static int padata_cpu_hash(struct padata_priv *padata)
+{
+	int cpu_index;
+	struct parallel_data *pd;
+
+	pd =  padata->pd;
+
+	/*
+	 * Hash the sequence numbers to the cpus by taking
+	 * seq_nr mod. number of cpus in use.
+	 */
+	cpu_index =  padata->seq_nr % cpumask_weight(pd->cpumask);
+
+	return padata_index_to_cpu(pd, cpu_index);
+}
+
+static void padata_parallel_worker(struct work_struct *work)
+{
+	struct padata_queue *queue;
+	struct parallel_data *pd;
+	struct padata_instance *pinst;
+	LIST_HEAD(local_list);
+
+	local_bh_disable();
+	queue = container_of(work, struct padata_queue, pwork);
+	pd = queue->pd;
+	pinst = pd->pinst;
+
+	spin_lock(&queue->parallel.lock);
+	list_replace_init(&queue->parallel.list, &local_list);
+	spin_unlock(&queue->parallel.lock);
+
+	while (!list_empty(&local_list)) {
+		struct padata_priv *padata;
+
+		padata = list_entry(local_list.next,
+				    struct padata_priv, list);
+
+		list_del_init(&padata->list);
+
+		padata->parallel(padata);
+	}
+
+	local_bh_enable();
+}
+
+/*
+ * padata_do_parallel - padata parallelization function
+ *
+ * @pinst: padata instance
+ * @padata: object to be parallelized
+ * @cb_cpu: cpu the serialization callback function will run on,
+ *          must be in the cpumask of padata.
+ *
+ * The parallelization callback function will run with BHs off.
+ * Note: Every object which is parallelized by padata_do_parallel
+ * must be seen by padata_do_serial.
+ */
+int padata_do_parallel(struct padata_instance *pinst,
+		       struct padata_priv *padata, int cb_cpu)
+{
+	int target_cpu, err;
+	struct padata_queue *queue;
+	struct parallel_data *pd;
+
+	rcu_read_lock_bh();
+
+	pd = rcu_dereference(pinst->pd);
+
+	err = 0;
+	if (!(pinst->flags & PADATA_INIT))
+		goto out;
+
+	err =  -EBUSY;
+	if ((pinst->flags & PADATA_RESET))
+		goto out;
+
+	if (atomic_read(&pd->refcnt) >= MAX_OBJ_NUM)
+		goto out;
+
+	err = -EINVAL;
+	if (!cpumask_test_cpu(cb_cpu, pd->cpumask))
+		goto out;
+
+	err = -EINPROGRESS;
+	atomic_inc(&pd->refcnt);
+	padata->pd = pd;
+	padata->cb_cpu = cb_cpu;
+
+	if (unlikely(atomic_read(&pd->seq_nr) == pd->max_seq_nr))
+		atomic_set(&pd->seq_nr, -1);
+
+	padata->seq_nr = atomic_inc_return(&pd->seq_nr);
+
+	target_cpu = padata_cpu_hash(padata);
+	queue = per_cpu_ptr(pd->queue, target_cpu);
+
+	spin_lock(&queue->parallel.lock);
+	list_add_tail(&padata->list, &queue->parallel.list);
+	spin_unlock(&queue->parallel.lock);
+
+	queue_work_on(target_cpu, pinst->wq, &queue->pwork);
+
+out:
+	rcu_read_unlock_bh();
+
+	return err;
+}
+EXPORT_SYMBOL(padata_do_parallel);
+
+static struct padata_priv *padata_get_next(struct parallel_data *pd)
+{
+	int cpu, num_cpus, empty, calc_seq_nr;
+	int seq_nr, next_nr, overrun, next_overrun;
+	struct padata_queue *queue, *next_queue;
+	struct padata_priv *padata;
+	struct padata_list *reorder;
+
+	empty = 0;
+	next_nr = -1;
+	next_overrun = 0;
+	next_queue = NULL;
+
+	num_cpus = cpumask_weight(pd->cpumask);
+
+	for_each_cpu(cpu, pd->cpumask) {
+		queue = per_cpu_ptr(pd->queue, cpu);
+		reorder = &queue->reorder;
+
+		/*
+		 * Calculate the seq_nr of the object that should be
+		 * next in this queue.
+		 */
+		overrun = 0;
+		calc_seq_nr = (atomic_read(&queue->num_obj) * num_cpus)
+			       + queue->cpu_index;
+
+		if (unlikely(calc_seq_nr > pd->max_seq_nr)) {
+			calc_seq_nr = calc_seq_nr - pd->max_seq_nr - 1;
+			overrun = 1;
+		}
+
+		if (!list_empty(&reorder->list)) {
+			padata = list_entry(reorder->list.next,
+					    struct padata_priv, list);
+
+			seq_nr  = padata->seq_nr;
+			BUG_ON(calc_seq_nr != seq_nr);
+		} else {
+			seq_nr = calc_seq_nr;
+			empty++;
+		}
+
+		if (next_nr < 0 || seq_nr < next_nr
+		    || (next_overrun && !overrun)) {
+			next_nr = seq_nr;
+			next_overrun = overrun;
+			next_queue = queue;
+		}
+	}
+
+	padata = NULL;
+
+	if (empty == num_cpus)
+		goto out;
+
+	reorder = &next_queue->reorder;
+
+	if (!list_empty(&reorder->list)) {
+		padata = list_entry(reorder->list.next,
+				    struct padata_priv, list);
+
+		if (unlikely(next_overrun)) {
+			for_each_cpu(cpu, pd->cpumask) {
+				queue = per_cpu_ptr(pd->queue, cpu);
+				atomic_set(&queue->num_obj, 0);
+			}
+		}
+
+		spin_lock(&reorder->lock);
+		list_del_init(&padata->list);
+		atomic_dec(&pd->reorder_objects);
+		spin_unlock(&reorder->lock);
+
+		atomic_inc(&next_queue->num_obj);
+
+		goto out;
+	}
+
+	if (next_nr % num_cpus == next_queue->cpu_index) {
+		padata = ERR_PTR(-ENODATA);
+		goto out;
+	}
+
+	padata = ERR_PTR(-EINPROGRESS);
+out:
+	return padata;
+}
+
+static void padata_reorder(struct parallel_data *pd)
+{
+	struct padata_priv *padata;
+	struct padata_queue *queue;
+	struct padata_instance *pinst = pd->pinst;
+
+try_again:
+	if (!spin_trylock_bh(&pd->lock))
+		goto out;
+
+	while (1) {
+		padata = padata_get_next(pd);
+
+		if (!padata || PTR_ERR(padata) == -EINPROGRESS)
+			break;
+
+		if (PTR_ERR(padata) == -ENODATA) {
+			spin_unlock_bh(&pd->lock);
+			goto out;
+		}
+
+		queue = per_cpu_ptr(pd->queue, padata->cb_cpu);
+
+		spin_lock(&queue->serial.lock);
+		list_add_tail(&padata->list, &queue->serial.list);
+		spin_unlock(&queue->serial.lock);
+
+		queue_work_on(padata->cb_cpu, pinst->wq, &queue->swork);
+	}
+
+	spin_unlock_bh(&pd->lock);
+
+	if (atomic_read(&pd->reorder_objects))
+		goto try_again;
+
+out:
+	return;
+}
+
+static void padata_serial_worker(struct work_struct *work)
+{
+	struct padata_queue *queue;
+	struct parallel_data *pd;
+	LIST_HEAD(local_list);
+
+	local_bh_disable();
+	queue = container_of(work, struct padata_queue, swork);
+	pd = queue->pd;
+
+	spin_lock(&queue->serial.lock);
+	list_replace_init(&queue->serial.list, &local_list);
+	spin_unlock(&queue->serial.lock);
+
+	while (!list_empty(&local_list)) {
+		struct padata_priv *padata;
+
+		padata = list_entry(local_list.next,
+				    struct padata_priv, list);
+
+		list_del_init(&padata->list);
+
+		padata->serial(padata);
+		atomic_dec(&pd->refcnt);
+	}
+	local_bh_enable();
+}
+
+/*
+ * padata_do_serial - padata serialization function
+ *
+ * @padata: object to be serialized.
+ *
+ * padata_do_serial must be called for every parallelized object.
+ * The serialization callback function will run with BHs off.
+ */
+void padata_do_serial(struct padata_priv *padata)
+{
+	int cpu;
+	struct padata_queue *queue;
+	struct parallel_data *pd;
+
+	pd = padata->pd;
+
+	cpu = get_cpu();
+	queue = per_cpu_ptr(pd->queue, cpu);
+
+	spin_lock(&queue->reorder.lock);
+	atomic_inc(&pd->reorder_objects);
+	list_add_tail(&padata->list, &queue->reorder.list);
+	spin_unlock(&queue->reorder.lock);
+
+	put_cpu();
+
+	padata_reorder(pd);
+}
+EXPORT_SYMBOL(padata_do_serial);
+
+static struct parallel_data *padata_alloc_pd(struct padata_instance *pinst,
+					     const struct cpumask *cpumask)
+{
+	int cpu, cpu_index, num_cpus;
+	struct padata_queue *queue;
+	struct parallel_data *pd;
+
+	cpu_index = 0;
+
+	pd = kzalloc(sizeof(struct parallel_data), GFP_KERNEL);
+	if (!pd)
+		goto err;
+
+	pd->queue = alloc_percpu(struct padata_queue);
+	if (!pd->queue)
+		goto err_free_pd;
+
+	if (!alloc_cpumask_var(&pd->cpumask, GFP_KERNEL))
+		goto err_free_queue;
+
+	for_each_possible_cpu(cpu) {
+		queue = per_cpu_ptr(pd->queue, cpu);
+
+		queue->pd = pd;
+
+		if (cpumask_test_cpu(cpu, cpumask)
+		    && cpumask_test_cpu(cpu, cpu_active_mask)) {
+			queue->cpu_index = cpu_index;
+			cpu_index++;
+		} else
+			queue->cpu_index = -1;
+
+		INIT_LIST_HEAD(&queue->reorder.list);
+		INIT_LIST_HEAD(&queue->parallel.list);
+		INIT_LIST_HEAD(&queue->serial.list);
+		spin_lock_init(&queue->reorder.lock);
+		spin_lock_init(&queue->parallel.lock);
+		spin_lock_init(&queue->serial.lock);
+
+		INIT_WORK(&queue->pwork, padata_parallel_worker);
+		INIT_WORK(&queue->swork, padata_serial_worker);
+		atomic_set(&queue->num_obj, 0);
+	}
+
+	cpumask_and(pd->cpumask, cpumask, cpu_active_mask);
+
+	num_cpus = cpumask_weight(pd->cpumask);
+	pd->max_seq_nr = (MAX_SEQ_NR / num_cpus) * num_cpus - 1;
+
+	atomic_set(&pd->seq_nr, -1);
+	atomic_set(&pd->reorder_objects, 0);
+	atomic_set(&pd->refcnt, 0);
+	pd->pinst = pinst;
+	spin_lock_init(&pd->lock);
+
+	return pd;
+
+err_free_queue:
+	free_percpu(pd->queue);
+err_free_pd:
+	kfree(pd);
+err:
+	return NULL;
+}
+
+static void padata_free_pd(struct parallel_data *pd)
+{
+	free_cpumask_var(pd->cpumask);
+	free_percpu(pd->queue);
+	kfree(pd);
+}
+
+static void padata_replace(struct padata_instance *pinst,
+			   struct parallel_data *pd_new)
+{
+	struct parallel_data *pd_old = pinst->pd;
+
+	pinst->flags |= PADATA_RESET;
+
+	rcu_assign_pointer(pinst->pd, pd_new);
+
+	synchronize_rcu();
+
+	while (atomic_read(&pd_old->refcnt) != 0)
+		yield();
+
+	flush_workqueue(pinst->wq);
+
+	padata_free_pd(pd_old);
+
+	pinst->flags &= ~PADATA_RESET;
+}
+
+/*
+ * padata_set_cpumask - set the cpumask that padata should use
+ *
+ * @pinst: padata instance
+ * @cpumask: the cpumask to use
+ */
+int padata_set_cpumask(struct padata_instance *pinst,
+			cpumask_var_t cpumask)
+{
+	struct parallel_data *pd;
+	int err = 0;
+
+	might_sleep();
+
+	mutex_lock(&pinst->lock);
+
+	pd = padata_alloc_pd(pinst, cpumask);
+	if (!pd) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	cpumask_copy(pinst->cpumask, cpumask);
+
+	padata_replace(pinst, pd);
+
+out:
+	mutex_unlock(&pinst->lock);
+
+	return err;
+}
+EXPORT_SYMBOL(padata_set_cpumask);
+
+static int __padata_add_cpu(struct padata_instance *pinst, int cpu)
+{
+	struct parallel_data *pd;
+
+	if (cpumask_test_cpu(cpu, cpu_active_mask)) {
+		pd = padata_alloc_pd(pinst, pinst->cpumask);
+		if (!pd)
+			return -ENOMEM;
+
+		padata_replace(pinst, pd);
+	}
+
+	return 0;
+}
+
+/*
+ * padata_add_cpu - add a cpu to the padata cpumask
+ *
+ * @pinst: padata instance
+ * @cpu: cpu to add
+ */
+int padata_add_cpu(struct padata_instance *pinst, int cpu)
+{
+	int err;
+
+	might_sleep();
+
+	mutex_lock(&pinst->lock);
+
+	cpumask_set_cpu(cpu, pinst->cpumask);
+	err = __padata_add_cpu(pinst, cpu);
+
+	mutex_unlock(&pinst->lock);
+
+	return err;
+}
+EXPORT_SYMBOL(padata_add_cpu);
+
+static int __padata_remove_cpu(struct padata_instance *pinst, int cpu)
+{
+	struct parallel_data *pd;
+
+	if (cpumask_test_cpu(cpu, cpu_online_mask)) {
+		pd = padata_alloc_pd(pinst, pinst->cpumask);
+		if (!pd)
+			return -ENOMEM;
+
+		padata_replace(pinst, pd);
+	}
+
+	return 0;
+}
+
+/*
+ * padata_remove_cpu - remove a cpu from the padata cpumask
+ *
+ * @pinst: padata instance
+ * @cpu: cpu to remove
+ */
+int padata_remove_cpu(struct padata_instance *pinst, int cpu)
+{
+	int err;
+
+	might_sleep();
+
+	mutex_lock(&pinst->lock);
+
+	cpumask_clear_cpu(cpu, pinst->cpumask);
+	err = __padata_remove_cpu(pinst, cpu);
+
+	mutex_unlock(&pinst->lock);
+
+	return err;
+}
+EXPORT_SYMBOL(padata_remove_cpu);
+
+/*
+ * padata_start - start the parallel processing
+ *
+ * @pinst: padata instance to start
+ */
+void padata_start(struct padata_instance *pinst)
+{
+	might_sleep();
+
+	mutex_lock(&pinst->lock);
+	pinst->flags |= PADATA_INIT;
+	mutex_unlock(&pinst->lock);
+}
+EXPORT_SYMBOL(padata_start);
+
+/*
+ * padata_stop - stop the parallel processing
+ *
+ * @pinst: padata instance to stop
+ */
+void padata_stop(struct padata_instance *pinst)
+{
+	might_sleep();
+
+	mutex_lock(&pinst->lock);
+	pinst->flags &= ~PADATA_INIT;
+	mutex_unlock(&pinst->lock);
+}
+EXPORT_SYMBOL(padata_stop);
+
+static int __cpuinit padata_cpu_callback(struct notifier_block *nfb,
+					 unsigned long action, void *hcpu)
+{
+	int err;
+	struct padata_instance *pinst;
+	int cpu = (unsigned long)hcpu;
+
+	pinst = container_of(nfb, struct padata_instance, cpu_notifier);
+
+	switch (action) {
+	case CPU_ONLINE:
+	case CPU_ONLINE_FROZEN:
+		if (!cpumask_test_cpu(cpu, pinst->cpumask))
+			break;
+		mutex_lock(&pinst->lock);
+		err = __padata_add_cpu(pinst, cpu);
+		mutex_unlock(&pinst->lock);
+		if (err)
+			return NOTIFY_BAD;
+		break;
+
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		if (!cpumask_test_cpu(cpu, pinst->cpumask))
+			break;
+		mutex_lock(&pinst->lock);
+		err = __padata_remove_cpu(pinst, cpu);
+		mutex_unlock(&pinst->lock);
+		if (err)
+			return NOTIFY_BAD;
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_UP_CANCELED_FROZEN:
+		if (!cpumask_test_cpu(cpu, pinst->cpumask))
+			break;
+		mutex_lock(&pinst->lock);
+		__padata_remove_cpu(pinst, cpu);
+		mutex_unlock(&pinst->lock);
+
+	case CPU_DOWN_FAILED:
+	case CPU_DOWN_FAILED_FROZEN:
+		if (!cpumask_test_cpu(cpu, pinst->cpumask))
+			break;
+		mutex_lock(&pinst->lock);
+		__padata_add_cpu(pinst, cpu);
+		mutex_unlock(&pinst->lock);
+	}
+
+	return NOTIFY_OK;
+}
+
+/*
+ * padata_alloc - allocate and initialize a padata instance
+ *
+ * @cpumask: cpumask that padata uses for parallelization
+ * @wq: workqueue to use for the allocated padata instance
+ */
+struct padata_instance *padata_alloc(const struct cpumask *cpumask,
+				     struct workqueue_struct *wq)
+{
+	int err;
+	struct padata_instance *pinst;
+	struct parallel_data *pd;
+
+	pinst = kzalloc(sizeof(struct padata_instance), GFP_KERNEL);
+	if (!pinst)
+		goto err;
+
+	pd = padata_alloc_pd(pinst, cpumask);
+	if (!pd)
+		goto err_free_inst;
+
+	rcu_assign_pointer(pinst->pd, pd);
+
+	pinst->wq = wq;
+
+	cpumask_copy(pinst->cpumask, cpumask);
+
+	pinst->flags = 0;
+
+	pinst->cpu_notifier.notifier_call = padata_cpu_callback;
+	pinst->cpu_notifier.priority = 0;
+	err = register_hotcpu_notifier(&pinst->cpu_notifier);
+	if (err)
+		goto err_free_pd;
+
+	mutex_init(&pinst->lock);
+
+	return pinst;
+
+err_free_pd:
+	padata_free_pd(pd);
+err_free_inst:
+	kfree(pinst);
+err:
+	return NULL;
+}
+EXPORT_SYMBOL(padata_alloc);
+
+/*
+ * padata_free - free a padata instance
+ *
+ * @ padata_inst: padata instance to free
+ */
+void padata_free(struct padata_instance *pinst)
+{
+	padata_stop(pinst);
+
+	synchronize_rcu();
+
+	while (atomic_read(&pinst->pd->refcnt) != 0)
+		yield();
+
+	unregister_hotcpu_notifier(&pinst->cpu_notifier);
+	padata_free_pd(pinst->pd);
+	kfree(pinst);
+}
+EXPORT_SYMBOL(padata_free);
-- 
1.5.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH 2/2] crypto: pcrypt - Add pcrypt crypto parallelization wrapper
  2009-12-23  8:01                                 ` [PATCH 0/2] Parallel crypto/IPsec v7 Steffen Klassert
  2009-12-23  8:03                                   ` [PATCH 1/2] padata: generic parallelization/serialization interface Steffen Klassert
@ 2009-12-23  8:04                                   ` Steffen Klassert
  2010-01-07  5:39                                   ` [PATCH 0/2] Parallel crypto/IPsec v7 Herbert Xu
  2 siblings, 0 replies; 104+ messages in thread
From: Steffen Klassert @ 2009-12-23  8:04 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds, Arjan van de Ven,
	Jens Axboe, Andi Kleen, awalls, linux-kernel, jeff, mingo, akpm,
	rusty, cl, dhowells, avi, johannes, David S. Miller

This patch adds a parallel crypto template that takes a crypto
algorithm and converts it to process the crypto transforms in
parallel. For the moment only aead algorithms are supported.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 crypto/Kconfig          |   10 +
 crypto/Makefile         |    1 +
 crypto/pcrypt.c         |  445 +++++++++++++++++++++++++++++++++++++++++++++++
 include/crypto/pcrypt.h |   51 ++++++
 4 files changed, 507 insertions(+), 0 deletions(-)
 create mode 100644 crypto/pcrypt.c
 create mode 100644 include/crypto/pcrypt.h

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 81c185a..6a2e295 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -114,6 +114,16 @@ config CRYPTO_NULL
 	help
 	  These are 'Null' algorithms, used by IPsec, which do nothing.
 
+config CRYPTO_PCRYPT
+	tristate "Parallel crypto engine (EXPERIMENTAL)"
+	depends on SMP && EXPERIMENTAL
+	select PADATA
+	select CRYPTO_MANAGER
+	select CRYPTO_AEAD
+	help
+	  This converts an arbitrary crypto algorithm into a parallel
+	  algorithm that executes in kernel threads.
+
 config CRYPTO_WORKQUEUE
        tristate
 
diff --git a/crypto/Makefile b/crypto/Makefile
index 9e8f619..d7e6441 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -56,6 +56,7 @@ obj-$(CONFIG_CRYPTO_XTS) += xts.o
 obj-$(CONFIG_CRYPTO_CTR) += ctr.o
 obj-$(CONFIG_CRYPTO_GCM) += gcm.o
 obj-$(CONFIG_CRYPTO_CCM) += ccm.o
+obj-$(CONFIG_CRYPTO_PCRYPT) += pcrypt.o
 obj-$(CONFIG_CRYPTO_CRYPTD) += cryptd.o
 obj-$(CONFIG_CRYPTO_DES) += des_generic.o
 obj-$(CONFIG_CRYPTO_FCRYPT) += fcrypt.o
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
new file mode 100644
index 0000000..b9527d0
--- /dev/null
+++ b/crypto/pcrypt.c
@@ -0,0 +1,445 @@
+/*
+ * pcrypt - Parallel crypto wrapper.
+ *
+ * Copyright (C) 2009 secunet Security Networks AG
+ * Copyright (C) 2009 Steffen Klassert <steffen.klassert@secunet.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/internal/aead.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <crypto/pcrypt.h>
+
+static struct padata_instance *pcrypt_enc_padata;
+static struct padata_instance *pcrypt_dec_padata;
+static struct workqueue_struct *encwq;
+static struct workqueue_struct *decwq;
+
+struct pcrypt_instance_ctx {
+	struct crypto_spawn spawn;
+	unsigned int tfm_count;
+};
+
+struct pcrypt_aead_ctx {
+	struct crypto_aead *child;
+	unsigned int cb_cpu;
+};
+
+static int pcrypt_do_parallel(struct padata_priv *padata, unsigned int *cb_cpu,
+			      struct padata_instance *pinst)
+{
+	unsigned int cpu_index, cpu, i;
+
+	cpu = *cb_cpu;
+
+	if (cpumask_test_cpu(cpu, cpu_active_mask))
+			goto out;
+
+	cpu_index = cpu % cpumask_weight(cpu_active_mask);
+
+	cpu = cpumask_first(cpu_active_mask);
+	for (i = 0; i < cpu_index; i++)
+		cpu = cpumask_next(cpu, cpu_active_mask);
+
+	*cb_cpu = cpu;
+
+out:
+	return padata_do_parallel(pinst, padata, cpu);
+}
+
+static int pcrypt_aead_setkey(struct crypto_aead *parent,
+			      const u8 *key, unsigned int keylen)
+{
+	struct pcrypt_aead_ctx *ctx = crypto_aead_ctx(parent);
+
+	return crypto_aead_setkey(ctx->child, key, keylen);
+}
+
+static int pcrypt_aead_setauthsize(struct crypto_aead *parent,
+				   unsigned int authsize)
+{
+	struct pcrypt_aead_ctx *ctx = crypto_aead_ctx(parent);
+
+	return crypto_aead_setauthsize(ctx->child, authsize);
+}
+
+static void pcrypt_aead_serial(struct padata_priv *padata)
+{
+	struct pcrypt_request *preq = pcrypt_padata_request(padata);
+	struct aead_request *req = pcrypt_request_ctx(preq);
+
+	aead_request_complete(req->base.data, padata->info);
+}
+
+static void pcrypt_aead_giv_serial(struct padata_priv *padata)
+{
+	struct pcrypt_request *preq = pcrypt_padata_request(padata);
+	struct aead_givcrypt_request *req = pcrypt_request_ctx(preq);
+
+	aead_request_complete(req->areq.base.data, padata->info);
+}
+
+static void pcrypt_aead_done(struct crypto_async_request *areq, int err)
+{
+	struct aead_request *req = areq->data;
+	struct pcrypt_request *preq = aead_request_ctx(req);
+	struct padata_priv *padata = pcrypt_request_padata(preq);
+
+	padata->info = err;
+	req->base.flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+
+	padata_do_serial(padata);
+}
+
+static void pcrypt_aead_enc(struct padata_priv *padata)
+{
+	struct pcrypt_request *preq = pcrypt_padata_request(padata);
+	struct aead_request *req = pcrypt_request_ctx(preq);
+
+	padata->info = crypto_aead_encrypt(req);
+
+	if (padata->info)
+		return;
+
+	padata_do_serial(padata);
+}
+
+static int pcrypt_aead_encrypt(struct aead_request *req)
+{
+	int err;
+	struct pcrypt_request *preq = aead_request_ctx(req);
+	struct aead_request *creq = pcrypt_request_ctx(preq);
+	struct padata_priv *padata = pcrypt_request_padata(preq);
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct pcrypt_aead_ctx *ctx = crypto_aead_ctx(aead);
+	u32 flags = aead_request_flags(req);
+
+	memset(padata, 0, sizeof(struct padata_priv));
+
+	padata->parallel = pcrypt_aead_enc;
+	padata->serial = pcrypt_aead_serial;
+
+	aead_request_set_tfm(creq, ctx->child);
+	aead_request_set_callback(creq, flags & ~CRYPTO_TFM_REQ_MAY_SLEEP,
+				  pcrypt_aead_done, req);
+	aead_request_set_crypt(creq, req->src, req->dst,
+			       req->cryptlen, req->iv);
+	aead_request_set_assoc(creq, req->assoc, req->assoclen);
+
+	err = pcrypt_do_parallel(padata, &ctx->cb_cpu, pcrypt_enc_padata);
+	if (err)
+		return err;
+	else
+		err = crypto_aead_encrypt(creq);
+
+	return err;
+}
+
+static void pcrypt_aead_dec(struct padata_priv *padata)
+{
+	struct pcrypt_request *preq = pcrypt_padata_request(padata);
+	struct aead_request *req = pcrypt_request_ctx(preq);
+
+	padata->info = crypto_aead_decrypt(req);
+
+	if (padata->info)
+		return;
+
+	padata_do_serial(padata);
+}
+
+static int pcrypt_aead_decrypt(struct aead_request *req)
+{
+	int err;
+	struct pcrypt_request *preq = aead_request_ctx(req);
+	struct aead_request *creq = pcrypt_request_ctx(preq);
+	struct padata_priv *padata = pcrypt_request_padata(preq);
+	struct crypto_aead *aead = crypto_aead_reqtfm(req);
+	struct pcrypt_aead_ctx *ctx = crypto_aead_ctx(aead);
+	u32 flags = aead_request_flags(req);
+
+	memset(padata, 0, sizeof(struct padata_priv));
+
+	padata->parallel = pcrypt_aead_dec;
+	padata->serial = pcrypt_aead_serial;
+
+	aead_request_set_tfm(creq, ctx->child);
+	aead_request_set_callback(creq, flags & ~CRYPTO_TFM_REQ_MAY_SLEEP,
+				  pcrypt_aead_done, req);
+	aead_request_set_crypt(creq, req->src, req->dst,
+			       req->cryptlen, req->iv);
+	aead_request_set_assoc(creq, req->assoc, req->assoclen);
+
+	err = pcrypt_do_parallel(padata, &ctx->cb_cpu, pcrypt_dec_padata);
+	if (err)
+		return err;
+	else
+		err = crypto_aead_decrypt(creq);
+
+	return err;
+}
+
+static void pcrypt_aead_givenc(struct padata_priv *padata)
+{
+	struct pcrypt_request *preq = pcrypt_padata_request(padata);
+	struct aead_givcrypt_request *req = pcrypt_request_ctx(preq);
+
+	padata->info = crypto_aead_givencrypt(req);
+
+	if (padata->info)
+		return;
+
+	padata_do_serial(padata);
+}
+
+static int pcrypt_aead_givencrypt(struct aead_givcrypt_request *req)
+{
+	int err;
+	struct aead_request *areq = &req->areq;
+	struct pcrypt_request *preq = aead_request_ctx(areq);
+	struct aead_givcrypt_request *creq = pcrypt_request_ctx(preq);
+	struct padata_priv *padata = pcrypt_request_padata(preq);
+	struct crypto_aead *aead = aead_givcrypt_reqtfm(req);
+	struct pcrypt_aead_ctx *ctx = crypto_aead_ctx(aead);
+	u32 flags = aead_request_flags(areq);
+
+	memset(padata, 0, sizeof(struct padata_priv));
+
+	padata->parallel = pcrypt_aead_givenc;
+	padata->serial = pcrypt_aead_giv_serial;
+
+	aead_givcrypt_set_tfm(creq, ctx->child);
+	aead_givcrypt_set_callback(creq, flags & ~CRYPTO_TFM_REQ_MAY_SLEEP,
+				   pcrypt_aead_done, areq);
+	aead_givcrypt_set_crypt(creq, areq->src, areq->dst,
+				areq->cryptlen, areq->iv);
+	aead_givcrypt_set_assoc(creq, areq->assoc, areq->assoclen);
+	aead_givcrypt_set_giv(creq, req->giv, req->seq);
+
+	err = pcrypt_do_parallel(padata, &ctx->cb_cpu, pcrypt_enc_padata);
+	if (err)
+		return err;
+	else
+		err = crypto_aead_givencrypt(creq);
+
+	return err;
+}
+
+static int pcrypt_aead_init_tfm(struct crypto_tfm *tfm)
+{
+	int cpu, cpu_index;
+	struct crypto_instance *inst = crypto_tfm_alg_instance(tfm);
+	struct pcrypt_instance_ctx *ictx = crypto_instance_ctx(inst);
+	struct pcrypt_aead_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct crypto_aead *cipher;
+
+	ictx->tfm_count++;
+
+	cpu_index = ictx->tfm_count % cpumask_weight(cpu_active_mask);
+
+	ctx->cb_cpu = cpumask_first(cpu_active_mask);
+	for (cpu = 0; cpu < cpu_index; cpu++)
+		ctx->cb_cpu = cpumask_next(ctx->cb_cpu, cpu_active_mask);
+
+	cipher = crypto_spawn_aead(crypto_instance_ctx(inst));
+
+	if (IS_ERR(cipher))
+		return PTR_ERR(cipher);
+
+	ctx->child = cipher;
+	tfm->crt_aead.reqsize = sizeof(struct pcrypt_request)
+		+ sizeof(struct aead_givcrypt_request)
+		+ crypto_aead_reqsize(cipher);
+
+	return 0;
+}
+
+static void pcrypt_aead_exit_tfm(struct crypto_tfm *tfm)
+{
+	struct pcrypt_aead_ctx *ctx = crypto_tfm_ctx(tfm);
+
+	crypto_free_aead(ctx->child);
+}
+
+static struct crypto_instance *pcrypt_alloc_instance(struct crypto_alg *alg)
+{
+	struct crypto_instance *inst;
+	struct pcrypt_instance_ctx *ctx;
+	int err;
+
+	inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
+	if (!inst) {
+		inst = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	err = -ENAMETOOLONG;
+	if (snprintf(inst->alg.cra_driver_name, CRYPTO_MAX_ALG_NAME,
+		     "pcrypt(%s)", alg->cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
+		goto out_free_inst;
+
+	memcpy(inst->alg.cra_name, alg->cra_name, CRYPTO_MAX_ALG_NAME);
+
+	ctx = crypto_instance_ctx(inst);
+	err = crypto_init_spawn(&ctx->spawn, alg, inst,
+				CRYPTO_ALG_TYPE_MASK);
+	if (err)
+		goto out_free_inst;
+
+	inst->alg.cra_priority = alg->cra_priority + 100;
+	inst->alg.cra_blocksize = alg->cra_blocksize;
+	inst->alg.cra_alignmask = alg->cra_alignmask;
+
+out:
+	return inst;
+
+out_free_inst:
+	kfree(inst);
+	inst = ERR_PTR(err);
+	goto out;
+}
+
+static struct crypto_instance *pcrypt_alloc_aead(struct rtattr **tb)
+{
+	struct crypto_instance *inst;
+	struct crypto_alg *alg;
+	struct crypto_attr_type *algt;
+
+	algt = crypto_get_attr_type(tb);
+
+	alg = crypto_get_attr_alg(tb, algt->type,
+				  (algt->mask & CRYPTO_ALG_TYPE_MASK));
+	if (IS_ERR(alg))
+		return ERR_CAST(alg);
+
+	inst = pcrypt_alloc_instance(alg);
+	if (IS_ERR(inst))
+		goto out_put_alg;
+
+	inst->alg.cra_flags = CRYPTO_ALG_TYPE_AEAD | CRYPTO_ALG_ASYNC;
+	inst->alg.cra_type = &crypto_aead_type;
+
+	inst->alg.cra_aead.ivsize = alg->cra_aead.ivsize;
+	inst->alg.cra_aead.geniv = alg->cra_aead.geniv;
+	inst->alg.cra_aead.maxauthsize = alg->cra_aead.maxauthsize;
+
+	inst->alg.cra_ctxsize = sizeof(struct pcrypt_aead_ctx);
+
+	inst->alg.cra_init = pcrypt_aead_init_tfm;
+	inst->alg.cra_exit = pcrypt_aead_exit_tfm;
+
+	inst->alg.cra_aead.setkey = pcrypt_aead_setkey;
+	inst->alg.cra_aead.setauthsize = pcrypt_aead_setauthsize;
+	inst->alg.cra_aead.encrypt = pcrypt_aead_encrypt;
+	inst->alg.cra_aead.decrypt = pcrypt_aead_decrypt;
+	inst->alg.cra_aead.givencrypt = pcrypt_aead_givencrypt;
+
+out_put_alg:
+	crypto_mod_put(alg);
+	return inst;
+}
+
+static struct crypto_instance *pcrypt_alloc(struct rtattr **tb)
+{
+	struct crypto_attr_type *algt;
+
+	algt = crypto_get_attr_type(tb);
+	if (IS_ERR(algt))
+		return ERR_CAST(algt);
+
+	switch (algt->type & algt->mask & CRYPTO_ALG_TYPE_MASK) {
+	case CRYPTO_ALG_TYPE_AEAD:
+		return pcrypt_alloc_aead(tb);
+	}
+
+	return ERR_PTR(-EINVAL);
+}
+
+static void pcrypt_free(struct crypto_instance *inst)
+{
+	struct pcrypt_instance_ctx *ctx = crypto_instance_ctx(inst);
+
+	crypto_drop_spawn(&ctx->spawn);
+	kfree(inst);
+}
+
+static struct crypto_template pcrypt_tmpl = {
+	.name = "pcrypt",
+	.alloc = pcrypt_alloc,
+	.free = pcrypt_free,
+	.module = THIS_MODULE,
+};
+
+static int __init pcrypt_init(void)
+{
+	encwq = create_workqueue("pencrypt");
+	if (!encwq)
+		goto err;
+
+	decwq = create_workqueue("pdecrypt");
+	if (!decwq)
+		goto err_destroy_encwq;
+
+
+	pcrypt_enc_padata = padata_alloc(cpu_possible_mask, encwq);
+	if (!pcrypt_enc_padata)
+		goto err_destroy_decwq;
+
+	pcrypt_dec_padata = padata_alloc(cpu_possible_mask, decwq);
+	if (!pcrypt_dec_padata)
+		goto err_free_padata;
+
+	padata_start(pcrypt_enc_padata);
+	padata_start(pcrypt_dec_padata);
+
+	return crypto_register_template(&pcrypt_tmpl);
+
+err_free_padata:
+	padata_free(pcrypt_enc_padata);
+
+err_destroy_decwq:
+	destroy_workqueue(decwq);
+
+err_destroy_encwq:
+	destroy_workqueue(encwq);
+
+err:
+	return -ENOMEM;
+}
+
+static void __exit pcrypt_exit(void)
+{
+	padata_stop(pcrypt_enc_padata);
+	padata_stop(pcrypt_dec_padata);
+
+	destroy_workqueue(encwq);
+	destroy_workqueue(decwq);
+
+	padata_free(pcrypt_enc_padata);
+	padata_free(pcrypt_dec_padata);
+
+	crypto_unregister_template(&pcrypt_tmpl);
+}
+
+module_init(pcrypt_init);
+module_exit(pcrypt_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Steffen Klassert <steffen.klassert@secunet.com>");
+MODULE_DESCRIPTION("Parallel crypto wrapper");
diff --git a/include/crypto/pcrypt.h b/include/crypto/pcrypt.h
new file mode 100644
index 0000000..d7d8bd8
--- /dev/null
+++ b/include/crypto/pcrypt.h
@@ -0,0 +1,51 @@
+/*
+ * pcrypt - Parallel crypto engine.
+ *
+ * Copyright (C) 2009 secunet Security Networks AG
+ * Copyright (C) 2009 Steffen Klassert <steffen.klassert@secunet.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#ifndef _CRYPTO_PCRYPT_H
+#define _CRYPTO_PCRYPT_H
+
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/padata.h>
+
+struct pcrypt_request {
+	struct padata_priv	padata;
+	void			*data;
+	void			*__ctx[] CRYPTO_MINALIGN_ATTR;
+};
+
+static inline void *pcrypt_request_ctx(struct pcrypt_request *req)
+{
+	return req->__ctx;
+}
+
+static inline
+struct padata_priv *pcrypt_request_padata(struct pcrypt_request *req)
+{
+	return &req->padata;
+}
+
+static inline
+struct pcrypt_request *pcrypt_padata_request(struct padata_priv *padata)
+{
+	return container_of(padata, struct pcrypt_request, padata);
+}
+
+#endif
-- 
1.5.4.2


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 18:28                           ` Linus Torvalds
@ 2009-12-23  8:06                             ` Johannes Berg
  0 siblings, 0 replies; 104+ messages in thread
From: Johannes Berg @ 2009-12-23  8:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Tejun Heo, Arjan van de Ven, Jens Axboe,
	Andi Kleen, awalls, linux-kernel, jeff, mingo, akpm, rusty, cl,
	dhowells, avi

[-- Attachment #1: Type: text/plain, Size: 1245 bytes --]

On Tue, 2009-12-22 at 10:28 -0800, Linus Torvalds wrote:
> 
> On Tue, 22 Dec 2009, Peter Zijlstra wrote:
> > 
> > Which in turn would imply we cannot carry fwd the current lockdep
> > annotations, right?
> > 
> > Which means we'll be stuck in a situation where A flushes B and B
> > flushes A will go undetected until we actually hit it.
> 
> No, lockdep should still work. It just means that waiting for an 
> individual work should be seen as a matter of only waiting for the
> locks that work itself has done - rather than waiting for all the locks
> that any worker has taken.
> 
> And the way the workqueue lockdep stuff is done, I'd assume this just 
> automatically fixes itself when rewritten.

Yeah, you'd just have to remove the per-workqueue lockmap since that's
no longer applicable if there is, in effect, one workqueue per work item
available. We do track each work struct already.

However, I'm not sure what you (Peter) mean by "A flushes B" since flush
is a workqueue operation which appears to no longer exist after this
patchset. If struct work A tries to remove struct work B and vice versa,
it'd still be detected though, assuming the annotations are taken
forward properly of course.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:01                       ` Ingo Molnar
@ 2009-12-23  8:12                         ` Ingo Molnar
  2009-12-23  8:32                           ` Tejun Heo
  2009-12-23  8:27                         ` Tejun Heo
  1 sibling, 1 reply; 104+ messages in thread
From: Ingo Molnar @ 2009-12-23  8:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi


* Ingo Molnar <mingo@elte.hu> wrote:

> At least as far as i'm concerned, i'd like to see actual uses. It's a big 
> linecount increase all things considered:
> 
>    20 files changed, 2783 insertions(+), 660 deletions(-)
> 
> and you say it _wont_ help performance/scalability (this aspect wasnt clear 
> to me from previous discussions), so the (yet to be seen) complexity 
> reduction in other code ought to be worth it.

To further stress this point, i'd like to point to the very first commit that 
introduced kernel/workqueue.c into Linux 7 years ago:

 | From 6ed12ff83c765aeda7d38d3bf9df7d46d24bfb11 Mon Sep 17 00:00:00 2001
 | From: Ingo Molnar <mingo@elte.hu>
 | Date: Mon, 30 Sep 2002 22:17:42 -0700
 | Subject: [PATCH] [PATCH] Workqueue Abstraction

look at the diffstat of that commit:

   201 files changed, 1102 insertions(+), 1194 deletions(-)

despite adding a new abstraction and kernel subsystem (workqueues), that 
commit modified more than a hundred drivers to make use of it, and managed to 
achieve a net linecount decrease of 92 lines - despite adding hundreds of 
lines of a new core facility.

Likewise, for this particular patchset it should be possible to identify 
existing patterns of code in the existing code base of 6+ millions lines of 
Linux driver code that would make the advantages of this +2000 lines of core 
kernel code plain obvious. There were multipe claims of problems with the 
current abstractions - so there sure must be a way to show off the new code in 
a few places.

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-22 18:07                           ` Andi Kleen
  2009-12-22 18:20                             ` Peter Zijlstra
@ 2009-12-23  8:17                             ` Stijn Devriendt
  2009-12-23  8:43                               ` Peter Zijlstra
  1 sibling, 1 reply; 104+ messages in thread
From: Stijn Devriendt @ 2009-12-23  8:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Linus Torvalds, Tejun Heo, Arjan van de Ven,
	Jens Axboe, awalls, linux-kernel, jeff, mi

On Tue, Dec 22, 2009 at 7:07 PM, Andi Kleen <andi@firstfloor.org> wrote:
> One reason I liked a more dynamic frame work for this is that it
> has the potential to be exposed to user space and allow automatic
> work partitioning there based on available cores.  User space
> has a lot more CPU consumption than the kernel.
>
Basically, this is exactly what I was trying to solve with my
sched_wait_block patch. It was broken in all ways, but the ultimate
goal was to have concurrency managed workqueues (to nick the term)
in userspace and have a way out when I/O hits the workqueue.

I'm currently in the progress of implementing this using perf_events
and hope to send a patch in the near future (probably asking for some
help to solve a couple of my issues).

> I think Grand Central Dispatch does something in this direction.
> TBB would probably also benfit
>
Exactly my goal.
I'd definately be happy if Tejun's work can be generalized to fit userspace
as well. My sched_wait_block patch did exactly what his sched_notifiers
do: notify the workqueue that a new thread should be scheduled to keep
the CPU busy. Although as I understand, Tejun's reason for scheduling
a new thread is rather to avoid deadlocks. Different thing, same way.

Stijn

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  7:09                     ` Tejun Heo
  2009-12-23  8:01                       ` Ingo Molnar
@ 2009-12-23  8:25                       ` Arjan van de Ven
  1 sibling, 0 replies; 104+ messages in thread
From: Arjan van de Ven @ 2009-12-23  8:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, awalls,
	linux-kernel, jeff, akpm, jens.axboe, rusty, cl, dhowells, avi,
	johannes, andi

On 12/23/2009 8:09, Tejun Heo wrote:
> Doing several conversions shouldn't be difficult at all.  I'll try to
> convert async and slow work.

btw for async, it is essential that all the scheduled async functions runs as soon as possible;
so make sure that no async work is held back for lack of threads.. just make more immediately.
(the async work is rather latency sensitive, in general it is <poke hardware> <wait a long time> <done>,
the <poke hardware> step needs to be done as soon as possible to cut down the total time; latencies are often
cumulative in async due to dependencies)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:01                       ` Ingo Molnar
  2009-12-23  8:12                         ` Ingo Molnar
@ 2009-12-23  8:27                         ` Tejun Heo
  2009-12-23  8:37                           ` Ingo Molnar
  1 sibling, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  8:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 12/23/2009 05:01 PM, Ingo Molnar wrote:
> At least as far as i'm concerned, i'd like to see actual uses. It's a big 
> linecount increase all things considered:
> 
>    20 files changed, 2783 insertions(+), 660 deletions(-)
> 
> and you say it _wont_ help performance/scalability (this aspect
> wasnt clear to me from previous discussions),

I'm just not sure how it would turn out.  I guess it would be an
overall win under loaded situations due to lowered cache footprint but
I don't think it will be anything which would stand out.

> so the (yet to be seen) complexity reduction in other code ought to
> be worth it.

Sure, fair enough but there's also a different side.  It'll allow much
easier implementation of things like in-kernel media presence polling
(I have some code for this but it's still just forming) and
per-device.  It gives a much easier tool to extract concurrency and
thus opens up new possibilities.

So, anyways, alright, I'll go try some conversions.

Happy holidays.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-21 23:50           ` Tejun Heo
                               ` (3 preceding siblings ...)
  2009-12-22 11:06             ` Peter Zijlstra
@ 2009-12-23  8:31             ` Stijn Devriendt
  4 siblings, 0 replies; 104+ messages in thread
From: Stijn Devriendt @ 2009-12-23  8:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, torvalds, awalls, linux-kernel, jeff, mingo,
	akpm, jens.axboe, rusty, cl, dhowells, arjan, a

On Tue, Dec 22, 2009 at 12:50 AM, Tejun Heo <tj@kernel.org> wrote:
>>  2) doesn't deal with cpu heavy tasks/wakeup parallelism
>
> workqueue was never suited for this.  MT workqueues have strong CPU
> affinity which doesn't make sense for CPU-heavy workloads.

It does, really. Have a look at TBB and others. You always want to keep
workqueue items as close to the scheduling thread as possible as the
chance of having a hot cache and TLBs and such are far greater.
The end result is CPU-affine threads fetching work from CPU-affine
queues with workitems scheduled by a thread on that CPU.
To improve parallellism, workqueue threads with empty workqueues
start stealing work away from non-empty workqueues.
The added warmup doesn't weigh in against the added parallellism
in those cases.

Stijn

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:12                         ` Ingo Molnar
@ 2009-12-23  8:32                           ` Tejun Heo
  2009-12-23  8:42                             ` Ingo Molnar
  0 siblings, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  8:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello, Ingo.

On 12/23/2009 05:12 PM, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
>> At least as far as i'm concerned, i'd like to see actual uses. It's a big 
>> linecount increase all things considered:
>>
>>    20 files changed, 2783 insertions(+), 660 deletions(-)

BTW, the code contains way more comment afterwards and has other
benefits like not having crazy number of workers around on many core
machines.

>> and you say it _wont_ help performance/scalability (this aspect wasnt clear 

And I think it will help scalability for sure although it depends on
what type of scalability you're talking about.

>> to me from previous discussions), so the (yet to be seen) complexity 
>> reduction in other code ought to be worth it.
> 
> To further stress this point, i'd like to point to the very first commit that 
> introduced kernel/workqueue.c into Linux 7 years ago:
> 
>  | From 6ed12ff83c765aeda7d38d3bf9df7d46d24bfb11 Mon Sep 17 00:00:00 2001
>  | From: Ingo Molnar <mingo@elte.hu>
>  | Date: Mon, 30 Sep 2002 22:17:42 -0700
>  | Subject: [PATCH] [PATCH] Workqueue Abstraction
> 
> look at the diffstat of that commit:
> 
>    201 files changed, 1102 insertions(+), 1194 deletions(-)
> 
> despite adding a new abstraction and kernel subsystem (workqueues), that 
> commit modified more than a hundred drivers to make use of it, and managed to 
> achieve a net linecount decrease of 92 lines - despite adding hundreds of 
> lines of a new core facility.
> 
> Likewise, for this particular patchset it should be possible to identify 
> existing patterns of code in the existing code base of 6+ millions lines of 
> Linux driver code that would make the advantages of this +2000 lines of core 
> kernel code plain obvious. There were multipe claims of problems with the 
> current abstractions - so there sure must be a way to show off the new code in

I'm not sure I'm gonna update that many places in a single sweep but
yeah let's give it a shot.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:27                         ` Tejun Heo
@ 2009-12-23  8:37                           ` Ingo Molnar
  2009-12-23  8:49                             ` Tejun Heo
  2009-12-23 13:40                             ` Stefan Richter
  0 siblings, 2 replies; 104+ messages in thread
From: Ingo Molnar @ 2009-12-23  8:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi


* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 12/23/2009 05:01 PM, Ingo Molnar wrote:
> > At least as far as i'm concerned, i'd like to see actual uses. It's a big 
> > linecount increase all things considered:
> > 
> >    20 files changed, 2783 insertions(+), 660 deletions(-)
> > 
> > and you say it _wont_ help performance/scalability (this aspect
> > wasnt clear to me from previous discussions),
> 
> I'm just not sure how it would turn out. I guess it would be an overall win 
> under loaded situations due to lowered cache footprint but I don't think it 
> will be anything which would stand out.
>
> > so the (yet to be seen) complexity reduction in other code ought to be 
> > worth it.
> 
> Sure, fair enough but there's also a different side.  It'll allow much 
> easier implementation of things like in-kernel media presence polling (I 
> have some code for this but it's still just forming) and per-device.  It 
> gives a much easier tool to extract concurrency and thus opens up new 
> possibilities.
> 
> So, anyways, alright, I'll go try some conversions.

Well, but note that you are again talking performance. Concurrency _IS_ 
performance: either in terms of reduced IO/app/request latency or in terms of 
CPU utilization.

Both metrics can be measured (and there's a massive effort underway to help 
measure such things - see current results under tools/perf/ in your favorite 
kernel repo ;-)

(Plus reduction in driver complexity can be measured as well, in the diffstat 
space.)

So there's no leap of faith needed really, IMHO.

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  6:13                     ` Jeff Garzik
  2009-12-23  7:53                       ` Ingo Molnar
@ 2009-12-23  8:41                       ` Peter Zijlstra
  2009-12-23 10:25                         ` Jeff Garzik
  1 sibling, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-23  8:41 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ingo Molnar, Linus Torvalds, Tejun Heo, awalls, linux-kernel,
	akpm, jens.axboe, rusty, cl, dhowells, arjan, avi, johannes,
	andi

On Wed, 2009-12-23 at 01:13 -0500, Jeff Garzik wrote:
> On 12/23/2009 01:02 AM, Ingo Molnar wrote:
> > One key thing i havent seen in this discussion are actual measurements. I
> > think a lot could be decided by simply testing this patch-set, by looking at
> > the hard numbers: how much faster (or slower) did a particular key workload
> > get before/after these patches.
> 
> We are dealing with situations where drivers are using workqueues to 
> provide a sleep-able context, and trying to solve problems related to that.

So why are threaded interrupts not considered? Isn't the typical atomic
context of drivers the IRQ handler?


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:32                           ` Tejun Heo
@ 2009-12-23  8:42                             ` Ingo Molnar
  0 siblings, 0 replies; 104+ messages in thread
From: Ingo Molnar @ 2009-12-23  8:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi


* Tejun Heo <tj@kernel.org> wrote:

> Hello, Ingo.
> 
> On 12/23/2009 05:12 PM, Ingo Molnar wrote:
> > 
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> >> At least as far as i'm concerned, i'd like to see actual uses. It's a big 
> >> linecount increase all things considered:
> >>
> >>    20 files changed, 2783 insertions(+), 660 deletions(-)
> 
> BTW, the code contains way more comment afterwards and has other benefits 
> like not having crazy number of workers around on many core machines.

(the original workqueue.c had way more comments as well.)

> >> and you say it _wont_ help performance/scalability (this aspect wasnt clear 
> 
> And I think it will help scalability for sure although it depends on
> what type of scalability you're talking about.

_I_ am not making any claims - i am simply asking what the benefits are, just 
to move the discussion forward. If there are benefits, it must be measurable, 
simple as that.

> >> to me from previous discussions), so the (yet to be seen) complexity 
> >> reduction in other code ought to be worth it.
> > 
> > To further stress this point, i'd like to point to the very first commit that 
> > introduced kernel/workqueue.c into Linux 7 years ago:
> > 
> >  | From 6ed12ff83c765aeda7d38d3bf9df7d46d24bfb11 Mon Sep 17 00:00:00 2001
> >  | From: Ingo Molnar <mingo@elte.hu>
> >  | Date: Mon, 30 Sep 2002 22:17:42 -0700
> >  | Subject: [PATCH] [PATCH] Workqueue Abstraction
> > 
> > look at the diffstat of that commit:
> > 
> >    201 files changed, 1102 insertions(+), 1194 deletions(-)
> > 
> > despite adding a new abstraction and kernel subsystem (workqueues), that 
> > commit modified more than a hundred drivers to make use of it, and managed to 
> > achieve a net linecount decrease of 92 lines - despite adding hundreds of 
> > lines of a new core facility.
> > 
> > Likewise, for this particular patchset it should be possible to identify 
> > existing patterns of code in the existing code base of 6+ millions lines of 
> > Linux driver code that would make the advantages of this +2000 lines of core 
> > kernel code plain obvious. There were multipe claims of problems with the 
> > current abstractions - so there sure must be a way to show off the new code in
> 
> I'm not sure I'm gonna update that many places in a single sweep but yeah 
> let's give it a shot.

In all fairness the original workqueue.c had an advantage, that it basically 
piggybacked on usable patterns from the tqueue (task-queue) abstraction - and 
that was rather repetitive.

Your code adds a new _paradigm_ for which no easily reusable patterns exist - 
so under no way are you expected to show such a massive amount of conversion - 
just a handful of cases would be enough to show the benefits - we can 
extrapolate from there.

It would also give us hands-on experience with the utility (and robustness) of 
your proposal, so it's a win-win proposal IMO.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:17                             ` Stijn Devriendt
@ 2009-12-23  8:43                               ` Peter Zijlstra
  2009-12-23  9:01                                 ` Stijn Devriendt
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2009-12-23  8:43 UTC (permalink / raw)
  To: Stijn Devriendt
  Cc: Andi Kleen, Linus Torvalds, Tejun Heo, Arjan van de Ven,
	Jens Axboe, awalls, linux-kernel, jeff, mi

On Wed, 2009-12-23 at 09:17 +0100, Stijn Devriendt wrote:
> On Tue, Dec 22, 2009 at 7:07 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > One reason I liked a more dynamic frame work for this is that it
> > has the potential to be exposed to user space and allow automatic
> > work partitioning there based on available cores.  User space
> > has a lot more CPU consumption than the kernel.
> >
> Basically, this is exactly what I was trying to solve with my
> sched_wait_block patch. It was broken in all ways, but the ultimate
> goal was to have concurrency managed workqueues (to nick the term)
> in userspace and have a way out when I/O hits the workqueue.

Don't we have the problem of wakeup concurrency here?

Forking on blocking is only half the problem (and imho the easy half).


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:37                           ` Ingo Molnar
@ 2009-12-23  8:49                             ` Tejun Heo
  2009-12-23  8:49                               ` Ingo Molnar
  2009-12-23 13:40                             ` Stefan Richter
  1 sibling, 1 reply; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  8:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 12/23/2009 05:37 PM, Ingo Molnar wrote:
>> Sure, fair enough but there's also a different side.  It'll allow much 
>> easier implementation of things like in-kernel media presence polling (I 
>> have some code for this but it's still just forming) and per-device.  It 
>> gives a much easier tool to extract concurrency and thus opens up new 
>> possibilities.
>>
>> So, anyways, alright, I'll go try some conversions.
> 
> Well, but note that you are again talking performance. Concurrency
> _IS_ performance: either in terms of reduced IO/app/request latency
> or in terms of CPU utilization.

I wasn't talking about performance above.  Easiness or flexibility to
extract concurrency opens up possibilities for new things or easier
ways of doing things.  It affects the design process.  You don't have
to jump through hoops for concurrency management and removing that
restriction results in lower amount of convolution and simplifies
design.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:49                             ` Tejun Heo
@ 2009-12-23  8:49                               ` Ingo Molnar
  2009-12-23  9:03                                 ` Tejun Heo
  0 siblings, 1 reply; 104+ messages in thread
From: Ingo Molnar @ 2009-12-23  8:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi


* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 12/23/2009 05:37 PM, Ingo Molnar wrote:
> >> Sure, fair enough but there's also a different side.  It'll allow much 
> >> easier implementation of things like in-kernel media presence polling (I 
> >> have some code for this but it's still just forming) and per-device.  It 
> >> gives a much easier tool to extract concurrency and thus opens up new 
> >> possibilities.
> >>
> >> So, anyways, alright, I'll go try some conversions.
> > 
> > Well, but note that you are again talking performance. Concurrency
> > _IS_ performance: either in terms of reduced IO/app/request latency
> > or in terms of CPU utilization.
> 
> I wasn't talking about performance above.  Easiness or flexibility to 
> extract concurrency opens up possibilities for new things or easier ways of 
> doing things.  It affects the design process.  You don't have to jump 
> through hoops for concurrency management and removing that restriction 
> results in lower amount of convolution and simplifies design.

Which is why i said this in the next paragraph:

> > ( Plus reduction in driver complexity can be measured as well, in the 
> >   diffstat space.)

A new facility that is so mysterious that it cannot be shown to have any 
performance/scalability/latency benefit _nor_ can it be shown to reduce driver 
complexity simply does not exist IMO.

A tangle benefit has to show up _somewhere_ - if not in the performance space 
then in the diffstat space (and vice versa) - that's all what i'm arguing.

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:43                               ` Peter Zijlstra
@ 2009-12-23  9:01                                 ` Stijn Devriendt
  0 siblings, 0 replies; 104+ messages in thread
From: Stijn Devriendt @ 2009-12-23  9:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Linus Torvalds, Tejun Heo, Arjan van de Ven,
	Jens Axboe, awalls, linux-kernel, jeff, mi

On Wed, Dec 23, 2009 at 9:43 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2009-12-23 at 09:17 +0100, Stijn Devriendt wrote:
>> On Tue, Dec 22, 2009 at 7:07 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> > One reason I liked a more dynamic frame work for this is that it
>> > has the potential to be exposed to user space and allow automatic
>> > work partitioning there based on available cores.  User space
>> > has a lot more CPU consumption than the kernel.
>> >
>> Basically, this is exactly what I was trying to solve with my
>> sched_wait_block patch. It was broken in all ways, but the ultimate
>> goal was to have concurrency managed workqueues (to nick the term)
>> in userspace and have a way out when I/O hits the workqueue.
>
> Don't we have the problem of wakeup concurrency here?
>
> Forking on blocking is only half the problem (and imho the easy half).
>
>

The original design was to always have 1 spare thread handy that would
wait until the worker-thread blocked. At that point it would wakeup and
continue trying to keep the CPU busy.
The current perf-event approach is to have threads poll based upon the
concurrency as measured in the kernel by the perf-event. When too many
threads are on the runqueue, the poll() blocks. When threads go to sleep/block
another thread falls out of poll()to continue work.

Stijn

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:49                               ` Ingo Molnar
@ 2009-12-23  9:03                                 ` Tejun Heo
  0 siblings, 0 replies; 104+ messages in thread
From: Tejun Heo @ 2009-12-23  9:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, awalls, linux-kernel, jeff, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, avi, johannes, andi

Hello,

On 12/23/2009 05:49 PM, Ingo Molnar wrote:
>> I wasn't talking about performance above.  Easiness or flexibility to 
>> extract concurrency opens up possibilities for new things or easier ways of 
>> doing things.  It affects the design process.  You don't have to jump 
>> through hoops for concurrency management and removing that restriction 
>> results in lower amount of convolution and simplifies design.
> 
> Which is why i said this in the next paragraph:
> 
>>> ( Plus reduction in driver complexity can be measured as well, in the 
>>>   diffstat space.)
> 
> A new facility that is so mysterious that it cannot be shown to have
> any performance/scalability/latency benefit _nor_ can it be shown to
> reduce driver complexity simply does not exist IMO.

Sure, I'm not arguing against that at all.  I completely agree with
you and I'm gonna do that.  I was trying to point out that it'll gonna
allow things to be designed in new ways which didn't make much sense
before because implementing full blown concurrency management would be
too costly just for that thing.  And by definition, those things are
not in the current kernel because they didn't make sense before.  For
me, the first thing which will make use of that would be in-kernel
media presence polling, so it's not all that mysterious.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:41                       ` Peter Zijlstra
@ 2009-12-23 10:25                         ` Jeff Garzik
  2009-12-23 13:33                           ` Stefan Richter
  2009-12-23 14:20                           ` Mark Brown
  0 siblings, 2 replies; 104+ messages in thread
From: Jeff Garzik @ 2009-12-23 10:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Tejun Heo, awalls, linux-kernel,
	akpm, jens.axboe, rusty, cl, dhowells, arjan, avi, johannes,
	andi

On 12/23/2009 03:41 AM, Peter Zijlstra wrote:
> On Wed, 2009-12-23 at 01:13 -0500, Jeff Garzik wrote:
>> On 12/23/2009 01:02 AM, Ingo Molnar wrote:
>>> One key thing i havent seen in this discussion are actual measurements. I
>>> think a lot could be decided by simply testing this patch-set, by looking at
>>> the hard numbers: how much faster (or slower) did a particular key workload
>>> get before/after these patches.
>>
>> We are dealing with situations where drivers are using workqueues to
>> provide a sleep-able context, and trying to solve problems related to that.
>
> So why are threaded interrupts not considered? Isn't the typical atomic
> context of drivers the IRQ handler?


I don't see a whole lot of driver authors rushing to support threaded 
interrupts.  It is questionable whether the myriad crazy IDE interrupt 
routing schemes are even compatible.  Thomas's Mar 23 2009 email says 
"the primary handler must disable the interrupt at the device level" 
That is not an easy request for all the hardware libata must support.

But the most obvious reason is also the most compelling:  Tejun's work 
maps precisely to libata's needs.  And his work would seem to mesh well 
with other drivers in similar situations.

	Jeff



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  6:02                   ` Ingo Molnar
  2009-12-23  6:13                     ` Jeff Garzik
  2009-12-23  7:09                     ` Tejun Heo
@ 2009-12-23 13:00                     ` Stefan Richter
  2 siblings, 0 replies; 104+ messages in thread
From: Stefan Richter @ 2009-12-23 13:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Tejun Heo, Peter Zijlstra, awalls, linux-kernel,
	jeff, akpm, jens.axboe, rusty, cl, dhowells, arjan, avi,
	johannes, andi

Ingo Molnar wrote:
> Likewise, if there's a reduction in complexity, that is a tangible metric as 
> well: lets do a few conversions as part of the patch-set and see how much 
> simpler things have become as a result of it.

There are for example about 160 users of  create_singlethread_workqueue
(about 140 of them in drivers/.).  As has mentioned by others, many if
not most of them would be better served by either the existing slow-work
API or by the proposed worker thread pool.  Conversion to the former
takes a bit more effort than the latter (not much, but it matters).

The little driver subsystem which I maintain extensively uses the shared
workqueue and also one single-thread workqueue.  The facts that the
shared queue is used for several purposes and a single thread for some
are both just compromises which I would rather like to get rid of.  I
should have converted the create_singlethread_workqueue usage to David
Howells' slow-work infrastructure immediately when that was merged, but
I didn't do so yet because there is too much else on my to-do list for
that particular to-do item...

> We really are not forced to the space of Gedankenexperiments here.

...to actually leave Gedankenexperiment stage as quickly as I would
like.  Tejun's worker pool would make things easier for me.
-- 
Stefan Richter
-=====-==--= ==-- =-===
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23 10:25                         ` Jeff Garzik
@ 2009-12-23 13:33                           ` Stefan Richter
  2009-12-23 14:20                           ` Mark Brown
  1 sibling, 0 replies; 104+ messages in thread
From: Stefan Richter @ 2009-12-23 13:33 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Tejun Heo, awalls,
	linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan, avi,
	johannes, andi

Jeff Garzik wrote:
> On 12/23/2009 03:41 AM, Peter Zijlstra wrote:
>> On Wed, 2009-12-23 at 01:13 -0500, Jeff Garzik wrote:
>>> We are dealing with situations where drivers are using workqueues to
>>> provide a sleep-able context, and trying to solve problems related to
>>> that.
>>
>> So why are threaded interrupts not considered? Isn't the typical atomic
>> context of drivers the IRQ handler?
> 
> 
> I don't see a whole lot of driver authors rushing to support threaded
> interrupts.  It is questionable whether the myriad crazy IDE interrupt
> routing schemes are even compatible.  Thomas's Mar 23 2009 email says
> "the primary handler must disable the interrupt at the device level"
> That is not an easy request for all the hardware libata must support.
> 
> But the most obvious reason is also the most compelling:  Tejun's work
> maps precisely to libata's needs.  And his work would seem to mesh well
> with other drivers in similar situations.

Exactly; threaded interrupts are not a solution for much of what
workqueues, slow-work, async... are used for and cmwq would be (more)
useful for.

In case of FireWire for example, each FireWire bus ( = FireWire link
layer controller) is associated with a single interrupt handler, yet
contains several DMA units for very different purposes (input vs.
output, isochronous vs. asynchronous); but more importantly, one link
layer controller is merely the gateway to 0...n FireWire peer nodes
(cameras, storage devices etc.).  Everything of what happens at a
somewhat higher level needs concurrency per FireWire peer, not per
FireWire bus.
-- 
Stefan Richter
-=====-==--= ==-- =-===
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23  8:37                           ` Ingo Molnar
  2009-12-23  8:49                             ` Tejun Heo
@ 2009-12-23 13:40                             ` Stefan Richter
  2009-12-23 13:43                               ` Stefan Richter
  1 sibling, 1 reply; 104+ messages in thread
From: Stefan Richter @ 2009-12-23 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Linus Torvalds, Peter Zijlstra, awalls, linux-kernel,
	jeff, akpm, jens.axboe, rusty, cl, dhowells, arjan, avi,
	johannes, andi

Ingo Molnar wrote:
> (Plus reduction in driver complexity can be measured as well, in the diffstat 
> space.)
> 
> So there's no leap of faith needed really, IMHO.

To measure this properly by means of diffstat, first convert typical
create_singlethread_workqueue users to <linux/slow-work.h>, then convert
them back to <linux/workqueue.h>.  Then look at the diffstat of the
_second_ step.
-- 
Stefan Richter
-=====-==--= ==-- =-===
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23 13:40                             ` Stefan Richter
@ 2009-12-23 13:43                               ` Stefan Richter
  0 siblings, 0 replies; 104+ messages in thread
From: Stefan Richter @ 2009-12-23 13:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Linus Torvalds, Peter Zijlstra, awalls, linux-kernel,
	jeff, akpm, jens.axboe, rusty, cl, dhowells, arjan, avi,
	johannes, andi

Stefan Richter wrote:
> Ingo Molnar wrote:
>> (Plus reduction in driver complexity can be measured as well, in the diffstat 
>> space.)
>>
>> So there's no leap of faith needed really, IMHO.
> 
> To measure this properly by means of diffstat, first convert typical
> create_singlethread_workqueue users to <linux/slow-work.h>, then convert
> them back to <linux/workqueue.h>.  Then look at the diffstat of the
> _second_ step.

PS:  Or easier, just do the first step and take its diffstat --- the
reverse of that diffstat is more or less the benefit that cmwq would
provide.
-- 
Stefan Richter
-=====-==--= ==-- =-===
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: workqueue thing
  2009-12-23 10:25                         ` Jeff Garzik
  2009-12-23 13:33                           ` Stefan Richter
@ 2009-12-23 14:20                           ` Mark Brown
  1 sibling, 0 replies; 104+ messages in thread
From: Mark Brown @ 2009-12-23 14:20 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Peter Zijlstra, Ingo Molnar, Linus Torvalds, Tejun Heo, awalls,
	linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan, avi,
	johannes, andi

On Wed, Dec 23, 2009 at 05:25:44AM -0500, Jeff Garzik wrote:
> On 12/23/2009 03:41 AM, Peter Zijlstra wrote:

>> So why are threaded interrupts not considered? Isn't the typical atomic
>> context of drivers the IRQ handler?

> I don't see a whole lot of driver authors rushing to support threaded  
> interrupts.  It is questionable whether the myriad crazy IDE interrupt  

Threaded interrupts are fairly new and there's not an enormous win from
converting existing drivers at present, it's more likely that new code
will use them at the minute.

> routing schemes are even compatible.  Thomas's Mar 23 2009 email says  
> "the primary handler must disable the interrupt at the device level"  
> That is not an easy request for all the hardware libata must support.

That requirement is no longer there since it was making threaded
interrupts useless for devices controlled over interrupt driven buses,
which are one of the major use cases for threaded IRQs.

None of which should be taken as a comment on the proposal under
discusssion.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 0/2] Parallel crypto/IPsec v7
  2009-12-23  8:01                                 ` [PATCH 0/2] Parallel crypto/IPsec v7 Steffen Klassert
  2009-12-23  8:03                                   ` [PATCH 1/2] padata: generic parallelization/serialization interface Steffen Klassert
  2009-12-23  8:04                                   ` [PATCH 2/2] crypto: pcrypt - Add pcrypt crypto parallelization wrapper Steffen Klassert
@ 2010-01-07  5:39                                   ` Herbert Xu
  2010-01-16  9:44                                     ` David Miller
  2 siblings, 1 reply; 104+ messages in thread
From: Herbert Xu @ 2010-01-07  5:39 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds, Arjan van de Ven,
	Jens Axboe, Andi Kleen, awalls, linux-kernel, jeff, mingo, akpm,
	rusty, cl, dhowells, avi, johannes, David S. Miller

On Wed, Dec 23, 2009 at 09:01:52AM +0100, Steffen Klassert wrote:
> This patchset adds the 'pcrypt' parallel crypto template. With this template it
> is possible to process the crypto requests of a transform in parallel without
> getting request reorder. This is in particular interesting for IPsec.

All applied.  Thanks Steffen!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 0/2] Parallel crypto/IPsec v7
  2010-01-07  5:39                                   ` [PATCH 0/2] Parallel crypto/IPsec v7 Herbert Xu
@ 2010-01-16  9:44                                     ` David Miller
  0 siblings, 0 replies; 104+ messages in thread
From: David Miller @ 2010-01-16  9:44 UTC (permalink / raw)
  To: herbert
  Cc: steffen.klassert, tj, peterz, torvalds, arjan, jens.axboe, andi,
	awalls, linux-kernel, jeff, mingo, akpm, rusty, cl, dhowells,
	avi, johannes

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 7 Jan 2010 16:39:03 +1100

> On Wed, Dec 23, 2009 at 09:01:52AM +0100, Steffen Klassert wrote:
>> This patchset adds the 'pcrypt' parallel crypto template. With this template it
>> is possible to process the crypto requests of a transform in parallel without
>> getting request reorder. This is in particular interesting for IPsec.
> 
> All applied.  Thanks Steffen!

Steffen, thanks so much for sticking to this all the way to
the end.


^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2010-01-16  9:44 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-18 12:57 Tejun Heo
2009-12-18 12:57 ` [PATCH 01/27] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
2009-12-18 12:57 ` [PATCH 02/27] sched: refactor try_to_wake_up() Tejun Heo
2009-12-18 12:57 ` [PATCH 03/27] sched: implement __set_cpus_allowed() Tejun Heo
2009-12-18 12:57 ` [PATCH 04/27] sched: make sched_notifiers unconditional Tejun Heo
2009-12-18 12:57 ` [PATCH 05/27] sched: add wakeup/sleep sched_notifiers and allow NULL notifier ops Tejun Heo
2009-12-18 12:57 ` [PATCH 06/27] sched: implement try_to_wake_up_local() Tejun Heo
2009-12-18 12:57 ` [PATCH 07/27] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
2009-12-18 12:57 ` [PATCH 08/27] stop_machine: reimplement without using workqueue Tejun Heo
2009-12-18 12:57 ` [PATCH 09/27] workqueue: misc/cosmetic updates Tejun Heo
2009-12-18 12:57 ` [PATCH 10/27] workqueue: merge feature parameters into flags Tejun Heo
2009-12-18 12:57 ` [PATCH 11/27] workqueue: define both bit position and mask for work flags Tejun Heo
2009-12-18 12:57 ` [PATCH 12/27] workqueue: separate out process_one_work() Tejun Heo
2009-12-18 12:57 ` [PATCH 13/27] workqueue: temporarily disable workqueue tracing Tejun Heo
2009-12-18 12:57 ` [PATCH 14/27] workqueue: kill cpu_populated_map Tejun Heo
2009-12-18 12:57 ` [PATCH 15/27] workqueue: update cwq alignement Tejun Heo
2009-12-18 12:57 ` [PATCH 16/27] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
2009-12-18 12:57 ` [PATCH 17/27] workqueue: introduce worker Tejun Heo
2009-12-18 12:57 ` [PATCH 18/27] workqueue: reimplement work flushing using linked works Tejun Heo
2009-12-18 12:58 ` [PATCH 19/27] workqueue: implement per-cwq active work limit Tejun Heo
2009-12-18 12:58 ` [PATCH 20/27] workqueue: reimplement workqueue freeze using max_active Tejun Heo
2009-12-18 12:58 ` [PATCH 21/27] workqueue: introduce global cwq and unify cwq locks Tejun Heo
2009-12-18 12:58 ` [PATCH 22/27] workqueue: implement worker states Tejun Heo
2009-12-18 12:58 ` [PATCH 23/27] workqueue: reimplement CPU hotplugging support using trustee Tejun Heo
2009-12-18 12:58 ` [PATCH 24/27] workqueue: make single thread workqueue shared worker pool friendly Tejun Heo
2009-12-18 12:58 ` [PATCH 25/27] workqueue: use shared worklist and pool all workers per cpu Tejun Heo
2009-12-18 12:58 ` [PATCH 26/27] workqueue: implement concurrency managed dynamic worker pool Tejun Heo
2009-12-18 12:58 ` [PATCH 27/27] workqueue: increase max_active of keventd and kill current_is_keventd() Tejun Heo
2009-12-18 13:00 ` SUBJ: [RFC PATCHSET] concurrency managed workqueue, take#2 Tejun Heo
2009-12-18 13:03 ` Tejun Heo
2009-12-18 13:45 ` workqueue thing Peter Zijlstra
2009-12-18 13:50   ` Andi Kleen
2009-12-18 15:01     ` Arjan van de Ven
2009-12-21  3:19       ` Tejun Heo
2009-12-21  9:17       ` Jens Axboe
2009-12-21 10:35         ` Peter Zijlstra
2009-12-21 11:09         ` Andi Kleen
2009-12-21 11:17           ` Arjan van de Ven
2009-12-21 11:33             ` Andi Kleen
2009-12-21 13:18             ` Tejun Heo
2009-12-21 11:11         ` Arjan van de Ven
2009-12-21 13:22           ` Tejun Heo
2009-12-21 13:53             ` Arjan van de Ven
2009-12-21 14:19               ` Tejun Heo
2009-12-21 15:19                 ` Arjan van de Ven
2009-12-22  0:00                   ` Tejun Heo
2009-12-22 11:10                     ` Peter Zijlstra
2009-12-22 17:20                       ` Linus Torvalds
2009-12-22 17:47                         ` Peter Zijlstra
2009-12-22 18:07                           ` Andi Kleen
2009-12-22 18:20                             ` Peter Zijlstra
2009-12-23  8:17                             ` Stijn Devriendt
2009-12-23  8:43                               ` Peter Zijlstra
2009-12-23  9:01                                 ` Stijn Devriendt
2009-12-22 18:28                           ` Linus Torvalds
2009-12-23  8:06                             ` Johannes Berg
2009-12-23  3:37                           ` Tejun Heo
2009-12-23  6:52                             ` Herbert Xu
2009-12-23  8:00                               ` Steffen Klassert
2009-12-23  8:01                                 ` [PATCH 0/2] Parallel crypto/IPsec v7 Steffen Klassert
2009-12-23  8:03                                   ` [PATCH 1/2] padata: generic parallelization/serialization interface Steffen Klassert
2009-12-23  8:04                                   ` [PATCH 2/2] crypto: pcrypt - Add pcrypt crypto parallelization wrapper Steffen Klassert
2010-01-07  5:39                                   ` [PATCH 0/2] Parallel crypto/IPsec v7 Herbert Xu
2010-01-16  9:44                                     ` David Miller
2009-12-18 15:30   ` workqueue thing Linus Torvalds
2009-12-18 15:39     ` Ingo Molnar
2009-12-18 15:39     ` Peter Zijlstra
2009-12-18 15:47       ` Linus Torvalds
2009-12-18 15:53         ` Peter Zijlstra
2009-12-21  3:04   ` Tejun Heo
2009-12-21  9:22     ` Peter Zijlstra
2009-12-21 13:30       ` Tejun Heo
2009-12-21 14:26         ` Peter Zijlstra
2009-12-21 23:50           ` Tejun Heo
2009-12-22 11:00             ` Peter Zijlstra
2009-12-22 11:03             ` Peter Zijlstra
2009-12-23  3:43               ` Tejun Heo
2009-12-22 11:04             ` Peter Zijlstra
2009-12-23  3:48               ` Tejun Heo
2009-12-22 11:06             ` Peter Zijlstra
2009-12-23  4:18               ` Tejun Heo
2009-12-23  4:42                 ` Linus Torvalds
2009-12-23  6:02                   ` Ingo Molnar
2009-12-23  6:13                     ` Jeff Garzik
2009-12-23  7:53                       ` Ingo Molnar
2009-12-23  8:41                       ` Peter Zijlstra
2009-12-23 10:25                         ` Jeff Garzik
2009-12-23 13:33                           ` Stefan Richter
2009-12-23 14:20                           ` Mark Brown
2009-12-23  7:09                     ` Tejun Heo
2009-12-23  8:01                       ` Ingo Molnar
2009-12-23  8:12                         ` Ingo Molnar
2009-12-23  8:32                           ` Tejun Heo
2009-12-23  8:42                             ` Ingo Molnar
2009-12-23  8:27                         ` Tejun Heo
2009-12-23  8:37                           ` Ingo Molnar
2009-12-23  8:49                             ` Tejun Heo
2009-12-23  8:49                               ` Ingo Molnar
2009-12-23  9:03                                 ` Tejun Heo
2009-12-23 13:40                             ` Stefan Richter
2009-12-23 13:43                               ` Stefan Richter
2009-12-23  8:25                       ` Arjan van de Ven
2009-12-23 13:00                     ` Stefan Richter
2009-12-23  8:31             ` Stijn Devriendt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).