All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCHSET] workqueue: implement concurrency managed workqueue
@ 2009-10-01  8:08 Tejun Heo
  2009-10-01  8:09 ` [PATCH 01/19] freezer: don't get over-anxious while waiting Tejun Heo
                   ` (26 more replies)
  0 siblings, 27 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:08 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan

Hello, all.

This rather large patchset implements concurrency managed workqueue.
It's not complete yet.  Singlethread workqueue handling needs more
work and workqueue users need to be audited and simplified and async
and slow-work should be reimplemented in terms of workqueue.  Although
this patchset currently adds ~2000 lines of code, I'm fairly
optimistic that after the whole conversion is done, it would be a net
decrease in lines of code.

This patchset reimplements workqueue such that it auto-regulates
concurrency and thus relieves its users from the managing duty.  It
works by managing single shared pool of per-cpu workers and hooking
into the scheduler to get notifications about workers going to sleep
and waking up.  Using the mechanism, workqueue implementation keeps
track of the current level of concurrency and schedules only the
necessary number of workers to keep the cpu occupied.

Concurrency managed workqueue has the following benefits.

* Workqueue users no longer have to worry about managing concurrency
  and, in most cases, deadlocks.  The workqueue will manage it
  automatically and unless the deadlock chain involves many (currently
  127) works, it won't happen.

* There's one single shared pool of workers per cpu and one rescuer
  for each workqueue which requires it, so there are far fewer number
  of kthreads.

* More efficient.  Although it adds considerable amount of code, the
  code added to hot path isn't big and works will be executed on the
  local cpu and in batch as much as possible using minimal number of
  kthreads leading to fewer task switches and lower cache
  footprint. <NEED SOME BACKING NUMBERS>

* As concurrency is no longer a problem, most types of asynchronous
  jobs can be done using generic workqueue and other async mechanisms,
  including slow-work, async and adhoc subsystem custom ones, can be
  removed.  ie. It can serve as the unified async thread pool
  mechanism.

Please read the patch description of the last patch for more details.

This patchset contains the following 19 patches and most of these are
not signed off yet.

  0001-freezer-don-t-get-over-anxious-while-waiting.patch
  0002-scheduler-implement-sched_class_equal.patch
  0003-scheduler-implement-workqueue-scheduler-class.patch
  0004-scheduler-implement-force_cpus_allowed_ptr.patch
  0005-kthread-implement-kthread_data.patch
  0006-acpi-use-queue_work_on-instead-of-binding-workqueue-.patch
  0007-stop_machine-reimplement-without-using-workqueue.patch
  0008-workqueue-misc-cosmetic-updates.patch
  0009-workqueue-merge-feature-parametesr-into-flags.patch
  0010-workqueue-update-cwq-alignement-and-make-one-more-fl.patch
  0011-workqueue-define-both-bit-position-and-mask-for-work.patch
  0012-workqueue-separate-out-process_one_work.patch
  0013-workqueue-temporarily-disable-workqueue-tracing.patch
  0014-workqueue-TEMPORARY-kill-singlethread-variant.patch
  0015-workqueue-reimplement-workqueue-flushing-using-color.patch
  0016-workqueue-introduce-worker.patch
  0017-workqueue-reimplement-work-flushing-using-linked-wor.patch
  0018-workqueue-reimplement-workqueue-freeze-using-cwq-fro.patch
  0019-workqueue-implement-concurrency-managed-workqueue.patch

0001 makes freezer less anxious.  It's a mostly unrelated change I did
while looking at the code.

0002-0004 implements scheduler callback for workqueue.

0005 implements kthread_data() which will be used to get back at the
workqueue worker data structure from scheduler callbacks.

0006-0007 kill exotic usages of workqueue.

0008-0010 does misc preparations in workqueue code.

0011-0014 strip out some features from workqueue to prepare for new
implementation.  Please note that 0014 is incorrect and I'm currently
working on proper solution.

0015-0018 reimplement flushing and freezing in a way which is friendly
to multiple workers per cpu.

0019 implements concurrency managed workqueue.

Please note that none of the workqueue users has been updated yet.  It
will work fine but most workqueues which don't need rescuer would
create one.

I'm attaching test-wq source and Makefile which I used to verify each
aspect of the new workqueue.  It's pretty easy to write new test
scenarios using the module so if you're interested in how this
actually work, it is quite useful.

This patchset is on top of the current linus#master
(817b33d38f81c8736d39283c35c886ae4668f1af) and available in the
following git tree.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git concurrency-managed-workqueue

and contains the following changes.

 drivers/acpi/osl.c           |   41 
 include/linux/kthread.h      |    1 
 include/linux/sched.h        |    9 
 include/linux/stop_machine.h |    6 
 include/linux/workqueue.h    |   63 -
 init/main.c                  |    2 
 kernel/kthread.c             |   15 
 kernel/power/process.c       |   34 
 kernel/sched.c               |  105 +
 kernel/sched_fair.c          |   62 -
 kernel/sched_idletask.c      |    1 
 kernel/sched_rt.c            |    1 
 kernel/sched_workqueue.c     |  130 ++
 kernel/sched_workqueue.h     |   17 
 kernel/stop_machine.c        |  151 ++
 kernel/trace/Kconfig         |    4 
 kernel/workqueue.c           | 2268 ++++++++++++++++++++++++++++++++++++-------
 17 files changed, 2417 insertions(+), 493 deletions(-)

So, what do you guys think?

PS. I'm still working on it and will probably write up about overall
design under Documentation.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01 18:36   ` Pavel Machek
  2009-10-01  8:09 ` [PATCH 02/19] scheduler: implement sched_class_equal() Tejun Heo
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Freezing isn't exactly the most latency sensitive operation and
there's no reason to burn cpu cycles and power waiting for it to
complete.  msleep(10) instead of yield().  This should improve
reliability of emergency hibernation.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/power/process.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index cc2e553..9d26a0a 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
 	do_gettimeofday(&start);
 
 	end_time = jiffies + TIMEOUT;
-	do {
+	while (true) {
 		todo = 0;
 		read_lock(&tasklist_lock);
 		do_each_thread(g, p) {
@@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
 				todo++;
 		} while_each_thread(g, p);
 		read_unlock(&tasklist_lock);
-		yield();			/* Yield is okay here */
-		if (time_after(jiffies, end_time))
+		if (!todo || time_after(jiffies, end_time))
 			break;
-	} while (todo);
+
+		/*
+		 * We need to retry.  There's no reason to be
+		 * over-anxious about it and waste power.
+		 */
+		msleep(10);
+	}
 
 	do_gettimeofday(&end);
 	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 02/19] scheduler: implement sched_class_equal()
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
  2009-10-01  8:09 ` [PATCH 01/19] freezer: don't get over-anxious while waiting Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 03/19] scheduler: implement workqueue scheduler class Tejun Heo
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Add ->identity to sched_class and implement and use
sched_class_equal() which compares the field to test for equality.
This is to allow sub-classing scheduler classes so that part of it can
be overridden while maintaining most of the original behavior.  Please
note that __setscheduler() only switches sched_class if the new
sched_class's identity different from the current one.

NOT_SIGNED_OFF_YET
---
 include/linux/sched.h   |    1 +
 kernel/sched.c          |   16 +++++++++++-----
 kernel/sched_fair.c     |    5 +++--
 kernel/sched_idletask.c |    1 +
 kernel/sched_rt.c       |    1 +
 5 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..02f505d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1065,6 +1065,7 @@ struct sched_domain;
 
 struct sched_class {
 	const struct sched_class *next;
+	const struct sched_class *identity;
 
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
diff --git a/kernel/sched.c b/kernel/sched.c
index ee61f45..66d918a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1813,6 +1813,8 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
 
 static void calc_load_account_active(struct rq *this_rq);
 
+#define sched_class_equal(a, b)		((a)->identity == (b)->identity)
+
 #include "sched_stats.h"
 #include "sched_idletask.c"
 #include "sched_fair.c"
@@ -1987,7 +1989,7 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 				       const struct sched_class *prev_class,
 				       int oldprio, int running)
 {
-	if (prev_class != p->sched_class) {
+	if (!sched_class_equal(prev_class, p->sched_class)) {
 		if (prev_class->switched_from)
 			prev_class->switched_from(rq, p, running);
 		p->sched_class->switched_to(rq, p, running);
@@ -2012,7 +2014,7 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 			 &p->se == cfs_rq_of(&p->se)->last))
 		return 1;
 
-	if (p->sched_class != &fair_sched_class)
+	if (!sched_class_equal(p->sched_class, &fair_sched_class))
 		return 0;
 
 	if (sysctl_sched_migration_cost == -1)
@@ -6139,6 +6141,8 @@ static struct task_struct *find_process_by_pid(pid_t pid)
 static void
 __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 {
+	const struct sched_class *new_class = NULL;
+
 	BUG_ON(p->se.on_rq);
 
 	p->policy = policy;
@@ -6146,13 +6150,15 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
-		p->sched_class = &fair_sched_class;
+		new_class = &fair_sched_class;
 		break;
 	case SCHED_FIFO:
 	case SCHED_RR:
-		p->sched_class = &rt_sched_class;
+		new_class = &rt_sched_class;
 		break;
 	}
+	if (!sched_class_equal(p->sched_class, new_class))
+		p->sched_class = new_class;
 
 	p->rt_priority = prio;
 	p->normal_prio = normal_prio(p);
@@ -10384,7 +10390,7 @@ cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		return -EINVAL;
 #else
 	/* We don't support RT-tasks being in separate groups */
-	if (tsk->sched_class != &fair_sched_class)
+	if (!sched_class_equal(tsk->sched_class, &fair_sched_class))
 		return -EINVAL;
 #endif
 	return 0;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..a12d1bd 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -961,7 +961,7 @@ static void hrtick_update(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 
-	if (curr->sched_class != &fair_sched_class)
+	if (!sched_class_equal(curr->sched_class, &fair_sched_class))
 		return;
 
 	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
@@ -1576,7 +1576,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 	}
 
-	if (unlikely(p->sched_class != &fair_sched_class))
+	if (unlikely(!sched_class_equal(p->sched_class, &fair_sched_class)))
 		return;
 
 	if (unlikely(se == pse))
@@ -1962,6 +1962,7 @@ unsigned int get_rr_interval_fair(struct task_struct *task)
  * All the scheduling class methods:
  */
 static const struct sched_class fair_sched_class = {
+	.identity		= &fair_sched_class,
 	.next			= &idle_sched_class,
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index b133a28..57a0c4b 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -106,6 +106,7 @@ unsigned int get_rr_interval_idle(struct task_struct *task)
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
 static const struct sched_class idle_sched_class = {
+	.identity		= &idle_sched_class,
 	/* .next is NULL */
 	/* no enqueue/yield_task for idle tasks */
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a4d790c..06a8106 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1746,6 +1746,7 @@ unsigned int get_rr_interval_rt(struct task_struct *task)
 }
 
 static const struct sched_class rt_sched_class = {
+	.identity		= &rt_sched_class,
 	.next			= &fair_sched_class,
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
  2009-10-01  8:09 ` [PATCH 01/19] freezer: don't get over-anxious while waiting Tejun Heo
  2009-10-01  8:09 ` [PATCH 02/19] scheduler: implement sched_class_equal() Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01 16:57   ` Linus Torvalds
  2009-10-01  8:09 ` [PATCH 04/19] scheduler: implement force_cpus_allowed_ptr() Tejun Heo
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Implement workqueue scheduler class.  Workqueue sched_class inherits
fair sched_class and behaves exactly the same as sched_class except
that it has two callback functions which get called when a task is put
to sleep and wakes up and doesn't allow switching to different
scheduler class.

workqueue sched_class can only be selected by calling
switch_sched_workqueue() when the current sched_class is fair.
workqueue is updated to select workqueue sched_class for all workers.
Both scheduler callbacks are noop now.  They'll be used to implement
concurrency-managed workqueue.

This patch also updates current_is_keventd() to check for the
scheduler class instead of directly matching the keventd workers, so
the function will return true for any workqueue workers.  For the
current users, this shouldn't be a problem.

NOT_SIGNED_OFF_YET
---
 include/linux/sched.h    |    1 +
 kernel/sched.c           |    1 +
 kernel/sched_fair.c      |   59 +++++++++++++--------
 kernel/sched_workqueue.c |  130 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_workqueue.h |   17 ++++++
 kernel/workqueue.c       |   34 ++++++++----
 6 files changed, 207 insertions(+), 35 deletions(-)
 create mode 100644 kernel/sched_workqueue.c
 create mode 100644 kernel/sched_workqueue.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 02f505d..cbebadf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1952,6 +1952,7 @@ extern int idle_cpu(int cpu);
 extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      struct sched_param *);
+extern void sched_setscheduler_workqueue(struct task_struct *p);
 extern struct task_struct *idle_task(int cpu);
 extern struct task_struct *curr_task(int cpu);
 extern void set_curr_task(int cpu, struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 66d918a..4e3e789 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1819,6 +1819,7 @@ static void calc_load_account_active(struct rq *this_rq);
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
+#include "sched_workqueue.c"
 #ifdef CONFIG_SCHED_DEBUG
 # include "sched_debug.c"
 #endif
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a12d1bd..eb116f0 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1961,37 +1961,50 @@ unsigned int get_rr_interval_fair(struct task_struct *task)
 /*
  * All the scheduling class methods:
  */
-static const struct sched_class fair_sched_class = {
-	.identity		= &fair_sched_class,
-	.next			= &idle_sched_class,
-	.enqueue_task		= enqueue_task_fair,
-	.dequeue_task		= dequeue_task_fair,
-	.yield_task		= yield_task_fair,
-
-	.check_preempt_curr	= check_preempt_wakeup,
-
-	.pick_next_task		= pick_next_task_fair,
-	.put_prev_task		= put_prev_task_fair,
+#define FAIR_SCHED_CLASS_INIT_BASE					\
+	.identity		= &fair_sched_class,			\
+	.next			= &idle_sched_class,			\
+	.enqueue_task		= enqueue_task_fair,			\
+	.dequeue_task		= dequeue_task_fair,			\
+	.yield_task		= yield_task_fair,			\
+									\
+	.check_preempt_curr	= check_preempt_wakeup,			\
+									\
+	.pick_next_task		= pick_next_task_fair,			\
+	.put_prev_task		= put_prev_task_fair,			\
+									\
+	.set_curr_task          = set_curr_task_fair,			\
+	.task_tick		= task_tick_fair,			\
+	.task_new		= task_new_fair,			\
+									\
+	.prio_changed		= prio_changed_fair,			\
+	.switched_to		= switched_to_fair,			\
+									\
+	.get_rr_interval	= get_rr_interval_fair,
 
 #ifdef CONFIG_SMP
-	.select_task_rq		= select_task_rq_fair,
-
-	.load_balance		= load_balance_fair,
+#define FAIR_SCHED_CLASS_INIT_SMP					\
+	.select_task_rq		= select_task_rq_fair,			\
+	.load_balance		= load_balance_fair,			\
 	.move_one_task		= move_one_task_fair,
+#else
+#define FAIR_SCHED_CLASS_INIT_SMP
 #endif
 
-	.set_curr_task          = set_curr_task_fair,
-	.task_tick		= task_tick_fair,
-	.task_new		= task_new_fair,
-
-	.prio_changed		= prio_changed_fair,
-	.switched_to		= switched_to_fair,
-
-	.get_rr_interval	= get_rr_interval_fair,
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
+#define FAIR_SCHED_CLASS_INIT_GROUP					\
 	.moved_group		= moved_group_fair,
+#else
+#define FAIR_SCHED_CLASS_INIT_GROUP
 #endif
+
+#define FAIR_SCHED_CLASS_INIT						\
+	FAIR_SCHED_CLASS_INIT_BASE					\
+	FAIR_SCHED_CLASS_INIT_SMP					\
+	FAIR_SCHED_CLASS_INIT_GROUP
+
+static const struct sched_class fair_sched_class = {
+	FAIR_SCHED_CLASS_INIT
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched_workqueue.c b/kernel/sched_workqueue.c
new file mode 100644
index 0000000..d8d6cb2
--- /dev/null
+++ b/kernel/sched_workqueue.c
@@ -0,0 +1,130 @@
+/*
+ * kernel/sched_workqueue.c - workqueue scheduler class
+ *
+ * Copyright (C) 2009		SUSE Linux Products GmbH
+ * Copyright (C) 2009		Tejun Heo <tj@kernel.org>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This scheduler class wraps the fair class and provide scheduling
+ * hints to workqueue to help it maintain proper level of concurrency.
+ * Other than calling workqueue hook functions and disallowing
+ * switching to other classes, this scheduler class is identical to
+ * the fair class.
+ */
+#include "sched_workqueue.h"
+
+static void enqueue_task_wq(struct rq *rq, struct task_struct *p, int wakeup)
+{
+	if (wakeup)
+		sched_workqueue_worker_wakeup(p);
+
+	return enqueue_task_fair(rq, p, wakeup);
+}
+
+static void dequeue_task_wq(struct rq *rq, struct task_struct *p, int sleep)
+{
+	if (sleep)
+		sched_workqueue_worker_sleep(p);
+
+	return dequeue_task_fair(rq, p, sleep);
+}
+
+static void switched_from_wq(struct rq *this_rq, struct task_struct *task,
+			     int running)
+{
+	BUG();	/* no can do */
+}
+
+/*
+ * If you want to add more override methods, please check sched.c
+ * _CAREFULLY_ before doing so.  There are several places where fair
+ * sched specific optimizations are made and the overrides might not
+ * work as expected.
+ */
+static const struct sched_class workqueue_sched_class = {
+	FAIR_SCHED_CLASS_INIT
+	.enqueue_task		= enqueue_task_wq,
+	.dequeue_task		= dequeue_task_wq,
+	.switched_from		= switched_from_wq,
+};
+
+/**
+ * sched_workqueue_wake_up_process - wake up a process from sched callbacks
+ * @p: task to wake up
+ *
+ * Wake up @p.  This function can only be called from workqueue
+ * scheduler callbacks and can only wake up tasks which are bound to
+ * the cpu in question.
+ *
+ * CONTEXT:
+ * workqueue scheduler callbacks.
+ *
+ * RETURNS:
+ * true if @p was waken up, false if @p was already awake.
+ */
+bool sched_workqueue_wake_up_process(struct task_struct *p)
+{
+	struct rq *rq = task_rq(p);
+	bool success = false;
+
+	assert_spin_locked(&rq->lock);
+
+	if (!p->se.on_rq) {
+		schedstat_inc(p, se.nr_wakeups);
+		schedstat_inc(p, se.nr_wakeups_local);
+		activate_task(rq, p, 1);
+		success = true;
+	}
+
+	trace_sched_wakeup(rq, p, success);
+	p->state = TASK_RUNNING;
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_wake_up)
+		p->sched_class->task_wake_up(rq, p);
+#endif
+	return success;
+}
+
+/**
+ * switch_sched_workqueue - switch workqueue scheduler class
+ * @p: target task
+ * @enable: enable or disable sched_workqueue
+ *
+ * Switch @p to or from workqueue scheduler class.  @p is assumed to
+ * have either fair or one of its alias classes on entry.
+ *
+ * CONTEXT:
+ * !in_interrupt().
+ */
+void switch_sched_workqueue(struct task_struct *p, bool enable)
+{
+	struct sched_param sched_param = { .sched_priority = 0 };
+	struct rq *rq;
+	unsigned long flags;
+
+	rq = task_rq_lock(p, &flags);
+	BUG_ON(!sched_class_equal(p->sched_class, &fair_sched_class));
+	p->sched_class = enable ? &workqueue_sched_class : &fair_sched_class;
+	task_rq_unlock(rq, &flags);
+
+	BUG_ON(sched_setscheduler_nocheck(p, SCHED_NORMAL, &sched_param));
+}
+
+/**
+ * is_sched_workqueue - test whether a task is in workqueue scheduler class
+ * @p: target task
+ *
+ * Tests whether @p is in workqueue scheduler class.
+ *
+ * CONTEXT:
+ * The caller is responsible for ensuring that @p doesn't go away or
+ * change scheduler class.
+ *
+ * RETURNS:
+ * true if @p is in workerqueue scheduler class, false otherwise.
+ */
+bool is_sched_workqueue(struct task_struct *p)
+{
+	return p->sched_class == &workqueue_sched_class;
+}
diff --git a/kernel/sched_workqueue.h b/kernel/sched_workqueue.h
new file mode 100644
index 0000000..5a52a4f
--- /dev/null
+++ b/kernel/sched_workqueue.h
@@ -0,0 +1,17 @@
+/*
+ * kernel/sched_workqueue.h - workqueue scheduler class interface
+ *
+ * Copyright (C) 2009		SUSE Linux Products GmbH
+ * Copyright (C) 2009		Tejun Heo <tj@kernel.org>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This is interface between sched_workqueue and workqueue.  Read
+ * comments in sched_workqueue.c and workqueue.c for details.
+ */
+void sched_workqueue_worker_wakeup(struct task_struct *task);
+void sched_workqueue_worker_sleep(struct task_struct *task);
+
+bool sched_workqueue_wake_up_process(struct task_struct *p);
+void switch_sched_workqueue(struct task_struct *p, bool enable);
+bool is_sched_workqueue(struct task_struct *p);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index addfe2d..b56737b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -36,6 +36,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/workqueue.h>
 
+#include "sched_workqueue.h"
+
 /*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).
@@ -125,6 +127,22 @@ struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 	return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
 }
 
+/*
+ * Scheduler callbacks.  These functions are called during schedule()
+ * with rq lock held.  Don't try to acquire any lock and only access
+ * fields which are safe with preemption disabled from local cpu.
+ */
+
+/* called when a worker task wakes up from sleep */
+void sched_workqueue_worker_wakeup(struct task_struct *task)
+{
+}
+
+/* called when a worker task goes into sleep */
+void sched_workqueue_worker_sleep(struct task_struct *task)
+{
+}
+
 static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head)
 {
@@ -314,6 +332,9 @@ static int worker_thread(void *__cwq)
 	struct cpu_workqueue_struct *cwq = __cwq;
 	DEFINE_WAIT(wait);
 
+	/* set workqueue scheduler */
+	switch_sched_workqueue(current, true);
+
 	if (cwq->wq->freezeable)
 		set_freezable();
 
@@ -726,18 +747,7 @@ int keventd_up(void)
 
 int current_is_keventd(void)
 {
-	struct cpu_workqueue_struct *cwq;
-	int cpu = raw_smp_processor_id(); /* preempt-safe: keventd is per-cpu */
-	int ret = 0;
-
-	BUG_ON(!keventd_wq);
-
-	cwq = per_cpu_ptr(keventd_wq->cpu_wq, cpu);
-	if (current == cwq->thread)
-		ret = 1;
-
-	return ret;
-
+	return is_sched_workqueue(current);
 }
 
 static struct cpu_workqueue_struct *
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 04/19] scheduler: implement force_cpus_allowed_ptr()
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (2 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 03/19] scheduler: implement workqueue scheduler class Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 05/19] kthread: implement kthread_data() Tejun Heo
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Implement force_cpus_allowed_ptr() which is similar to
set_cpus_allowed_ptr() but bypasses PF_THREAD_BOUND check and ignores
cpu_active() status as long as the target cpu is online.  This will be
used for concurrency-managed workqueue.

NOT_SIGNED_OFF_YET
---
 include/linux/sched.h |    7 ++++
 kernel/sched.c        |   88 +++++++++++++++++++++++++++++++++----------------
 2 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cbebadf..5fe60bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1810,6 +1810,8 @@ static inline void rcu_copy_process(struct task_struct *p)
 #ifdef CONFIG_SMP
 extern int set_cpus_allowed_ptr(struct task_struct *p,
 				const struct cpumask *new_mask);
+extern int force_cpus_allowed_ptr(struct task_struct *p,
+				  const struct cpumask *new_mask);
 #else
 static inline int set_cpus_allowed_ptr(struct task_struct *p,
 				       const struct cpumask *new_mask)
@@ -1818,6 +1820,11 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p,
 		return -EINVAL;
 	return 0;
 }
+static inline int force_cpus_allowed_ptr(struct task_struct *p,
+					 const struct cpumask *new_mask)
+{
+	return set_cpus_allowed_ptr(p, new_mask);
+}
 #endif
 
 #ifndef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/sched.c b/kernel/sched.c
index 4e3e789..02f07b2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2070,6 +2070,7 @@ struct migration_req {
 
 	struct task_struct *task;
 	int dest_cpu;
+	bool force;
 
 	struct completion done;
 };
@@ -2078,8 +2079,8 @@ struct migration_req {
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static int
-migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
+static int migrate_task(struct task_struct *p, int dest_cpu,
+			struct migration_req *req, bool force)
 {
 	struct rq *rq = task_rq(p);
 
@@ -2095,6 +2096,7 @@ migrate_task(struct task_struct *p, int dest_cpu, struct migration_req *req)
 	init_completion(&req->done);
 	req->task = p;
 	req->dest_cpu = dest_cpu;
+	req->force = force;
 	list_add(&req->list, &rq->migration_queue);
 
 	return 1;
@@ -3096,7 +3098,7 @@ static void sched_migrate_task(struct task_struct *p, int dest_cpu)
 		goto out;
 
 	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (migrate_task(p, dest_cpu, &req, false)) {
 		/* Need to wait for migration thread (might exit: take ref). */
 		struct task_struct *mt = rq->migration_thread;
 
@@ -7015,34 +7017,19 @@ static inline void sched_init_granularity(void)
  * 7) we wake up and the migration is done.
  */
 
-/*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
- *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
- */
-int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+static inline int __set_cpus_allowed_ptr(struct task_struct *p,
+					 const struct cpumask *new_mask,
+					 struct rq *rq, unsigned long *flags,
+					 bool force)
 {
 	struct migration_req req;
-	unsigned long flags;
-	struct rq *rq;
 	int ret = 0;
 
-	rq = task_rq_lock(p, &flags);
 	if (!cpumask_intersects(new_mask, cpu_online_mask)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
-		     !cpumask_equal(&p->cpus_allowed, new_mask))) {
-		ret = -EINVAL;
-		goto out;
-	}
-
 	if (p->sched_class->set_cpus_allowed)
 		p->sched_class->set_cpus_allowed(p, new_mask);
 	else {
@@ -7054,12 +7041,13 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 	if (cpumask_test_cpu(task_cpu(p), new_mask))
 		goto out;
 
-	if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req)) {
+	if (migrate_task(p, cpumask_any_and(cpu_online_mask, new_mask), &req,
+			 force)) {
 		/* Need help from migration thread: drop lock and wait. */
 		struct task_struct *mt = rq->migration_thread;
 
 		get_task_struct(mt);
-		task_rq_unlock(rq, &flags);
+		task_rq_unlock(rq, flags);
 		wake_up_process(rq->migration_thread);
 		put_task_struct(mt);
 		wait_for_completion(&req.done);
@@ -7067,13 +7055,53 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
 		return 0;
 	}
 out:
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, flags);
 
 	return ret;
 }
+
+/*
+ * Change a given task's CPU affinity. Migrate the thread to a
+ * proper CPU and schedule it away if the CPU it's executing on
+ * is removed from the allowed bitmask.
+ *
+ * NOTE: the caller must have a valid reference to the task, the
+ * task must not exit() & deallocate itself prematurely. The
+ * call is not atomic; no spinlocks may be held.
+ */
+int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
+{
+	unsigned long flags;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+
+	if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+		     !cpumask_equal(&p->cpus_allowed, new_mask))) {
+		task_rq_unlock(rq, &flags);
+		return -EINVAL;
+	}
+
+	return __set_cpus_allowed_ptr(p, new_mask, rq, &flags, false);
+}
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
+ * Similar to set_cpus_allowed_ptr() but bypasses PF_THREAD_BOUND
+ * check and ignores cpu_active() status as long as the cpu is online.
+ * The caller is responsible for ensuring things don't go bonkers.
+ */
+int force_cpus_allowed_ptr(struct task_struct *p,
+			   const struct cpumask *new_mask)
+{
+	unsigned long flags;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+	return __set_cpus_allowed_ptr(p, new_mask, rq, &flags, true);
+}
+
+/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -7084,12 +7112,13 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
  *
  * Returns non-zero if task was successfully migrated.
  */
-static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
+static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu,
+			  bool force)
 {
 	struct rq *rq_dest, *rq_src;
 	int ret = 0, on_rq;
 
-	if (unlikely(!cpu_active(dest_cpu)))
+	if (!force && unlikely(!cpu_active(dest_cpu)))
 		return ret;
 
 	rq_src = cpu_rq(src_cpu);
@@ -7168,7 +7197,8 @@ static int migration_thread(void *data)
 
 		if (req->task != NULL) {
 			spin_unlock(&rq->lock);
-			__migrate_task(req->task, cpu, req->dest_cpu);
+			__migrate_task(req->task, cpu, req->dest_cpu,
+				       req->force);
 		} else if (likely(cpu == (badcpu = smp_processor_id()))) {
 			req->dest_cpu = RCU_MIGRATION_GOT_QS;
 			spin_unlock(&rq->lock);
@@ -7193,7 +7223,7 @@ static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
 	int ret;
 
 	local_irq_disable();
-	ret = __migrate_task(p, src_cpu, dest_cpu);
+	ret = __migrate_task(p, src_cpu, dest_cpu, false);
 	local_irq_enable();
 	return ret;
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 05/19] kthread: implement kthread_data()
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (3 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 04/19] scheduler: implement force_cpus_allowed_ptr() Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Implement kthread_data() which takes @task pointing to a kthread and
returns @data specified when creating the kthread.  The caller is
responsible for ensuring the validity of @task when calling this
function.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/kthread.h |    1 +
 kernel/kthread.c        |   15 +++++++++++++++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..14f63e8 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -30,6 +30,7 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
 void kthread_bind(struct task_struct *k, unsigned int cpu);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
+void *kthread_data(struct task_struct *k);
 
 int kthreadd(void *unused);
 extern struct task_struct *kthreadd_task;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5fe7099..bd4cb7f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -35,6 +35,7 @@ struct kthread_create_info
 
 struct kthread {
 	int should_stop;
+	void *data;
 	struct completion exited;
 };
 
@@ -54,6 +55,19 @@ int kthread_should_stop(void)
 }
 EXPORT_SYMBOL(kthread_should_stop);
 
+/**
+ * kthread_data - return data value specified on kthread creation
+ * @task: kthread task in question
+ *
+ * Return the data value specified when kthread @task was created.
+ * The caller is responsible for ensuring the validity of @task when
+ * calling this function.
+ */
+void *kthread_data(struct task_struct *task)
+{
+	return to_kthread(task)->data;
+}
+
 static int kthread(void *_create)
 {
 	/* Copy data: it's on kthread's stack */
@@ -64,6 +78,7 @@ static int kthread(void *_create)
 	int ret;
 
 	self.should_stop = 0;
+	self.data = data;
 	init_completion(&self.exited);
 	current->vfork_done = &self.exited;
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (4 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 05/19] kthread: implement kthread_data() Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 07/19] stop_machine: reimplement without using workqueue Tejun Heo
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

ACPI works need to be executed on cpu0 and acpi/osl.c achieves this by
creating singlethread workqueue and then binding it to cpu0 from a
work which is quite unorthodox.  Make it create regular workqueues and
use queue_work_on() instead.  This is in preparation of concurrency
managed workqueue and the extra workers won't be a problem after it's
implemented.

NOT_SIGNED_OFF_YET
---
 drivers/acpi/osl.c |   41 ++++++++++++-----------------------------
 1 files changed, 12 insertions(+), 29 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 7c1c59e..f742b7b 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -191,36 +191,11 @@ acpi_status __init acpi_os_initialize(void)
 	return AE_OK;
 }
 
-static void bind_to_cpu0(struct work_struct *work)
-{
-	set_cpus_allowed_ptr(current, cpumask_of(0));
-	kfree(work);
-}
-
-static void bind_workqueue(struct workqueue_struct *wq)
-{
-	struct work_struct *work;
-
-	work = kzalloc(sizeof(struct work_struct), GFP_KERNEL);
-	INIT_WORK(work, bind_to_cpu0);
-	queue_work(wq, work);
-}
-
 acpi_status acpi_os_initialize1(void)
 {
-	/*
-	 * On some machines, a software-initiated SMI causes corruption unless
-	 * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
-	 * typically it's done in GPE-related methods that are run via
-	 * workqueues, so we can avoid the known corruption cases by binding
-	 * the workqueues to CPU 0.
-	 */
-	kacpid_wq = create_singlethread_workqueue("kacpid");
-	bind_workqueue(kacpid_wq);
-	kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
-	bind_workqueue(kacpi_notify_wq);
-	kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
-	bind_workqueue(kacpi_hotplug_wq);
+	kacpid_wq = create_workqueue("kacpid");
+	kacpi_notify_wq = create_workqueue("kacpi_notify");
+	kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");
 	BUG_ON(!kacpid_wq);
 	BUG_ON(!kacpi_notify_wq);
 	BUG_ON(!kacpi_hotplug_wq);
@@ -759,7 +734,15 @@ static acpi_status __acpi_os_execute(acpi_execute_type type,
 		(type == OSL_NOTIFY_HANDLER ? kacpi_notify_wq : kacpid_wq);
 	dpc->wait = hp ? 1 : 0;
 	INIT_WORK(&dpc->work, acpi_os_execute_deferred);
-	ret = queue_work(queue, &dpc->work);
+
+	/*
+	 * On some machines, a software-initiated SMI causes corruption unless
+	 * the SMI runs on CPU 0.  An SMI can be initiated by any AML, but
+	 * typically it's done in GPE-related methods that are run via
+	 * workqueues, so we can avoid the known corruption cases by always
+	 * queueing on CPU 0.
+	 */
+	ret = queue_work_on(0, queue, &dpc->work);
 
 	if (!ret) {
 		printk(KERN_ERR PREFIX
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 07/19] stop_machine: reimplement without using workqueue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (5 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-06  9:36   ` Rusty Russell
  2009-10-01  8:09 ` [PATCH 08/19] workqueue: misc/cosmetic updates Tejun Heo
                   ` (19 subsequent siblings)
  26 siblings, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

stop_machine() is the only user of RT workqueue.  Reimplement it using
kthreads directly and rip RT support from workqueue.  This is in
preparation of concurrency managed workqueue.

NOT_SIGNED_OFF_YET
---
 include/linux/stop_machine.h |    6 ++
 include/linux/workqueue.h    |   20 +++---
 init/main.c                  |    2 +
 kernel/stop_machine.c        |  151 ++++++++++++++++++++++++++++++++++-------
 kernel/workqueue.c           |    6 --
 5 files changed, 142 insertions(+), 43 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index baba3a2..2d32e06 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -53,6 +53,11 @@ int stop_machine_create(void);
  */
 void stop_machine_destroy(void);
 
+/**
+ * init_stop_machine: initialize stop_machine during boot
+ */
+void init_stop_machine(void);
+
 #else
 
 static inline int stop_machine(int (*fn)(void *), void *data,
@@ -67,6 +72,7 @@ static inline int stop_machine(int (*fn)(void *), void *data,
 
 static inline int stop_machine_create(void) { return 0; }
 static inline void stop_machine_destroy(void) { }
+static inline void init_stop_machine(void) { }
 
 #endif /* CONFIG_SMP */
 #endif /* _LINUX_STOP_MACHINE */
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7ef0c7b..05f0998 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -165,12 +165,11 @@ struct execute_work {
 
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread,
-		       int freezeable, int rt, struct lock_class_key *key,
-		       const char *lock_name);
+__create_workqueue_key(const char *name, int singlethread, int freezeable,
+		       struct lock_class_key *key, const char *lock_name);
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
+#define __create_workqueue(name, singlethread, freezeable)	\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -181,19 +180,18 @@ __create_workqueue_key(const char *name, int singlethread,
 		__lock_name = #name;				\
 								\
 	__create_workqueue_key((name), (singlethread),		\
-			       (freezeable), (rt), &__key,	\
+			       (freezeable), &__key,		\
 			       __lock_name);			\
 })
 #else
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
-	__create_workqueue_key((name), (singlethread), (freezeable), (rt), \
+#define __create_workqueue(name, singlethread, freezeable)	\
+	__create_workqueue_key((name), (singlethread), (freezeable), \
 			       NULL, NULL)
 #endif
 
-#define create_workqueue(name) __create_workqueue((name), 0, 0, 0)
-#define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_workqueue(name) __create_workqueue((name), 0, 0)
+#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
+#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/init/main.c b/init/main.c
index 7449819..5ae8d76 100644
--- a/init/main.c
+++ b/init/main.c
@@ -34,6 +34,7 @@
 #include <linux/security.h>
 #include <linux/smp.h>
 #include <linux/workqueue.h>
+#include <linux/stop_machine.h>
 #include <linux/profile.h>
 #include <linux/rcupdate.h>
 #include <linux/moduleparam.h>
@@ -780,6 +781,7 @@ static void __init do_basic_setup(void)
 {
 	rcu_init_sched(); /* needed by module_init stage. */
 	init_workqueues();
+	init_stop_machine();
 	cpuset_init_smp();
 	usermodehelper_init();
 	init_tmpfs();
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 912823e..671a4ac 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -25,6 +25,8 @@ enum stopmachine_state {
 	STOPMACHINE_RUN,
 	/* Exit */
 	STOPMACHINE_EXIT,
+	/* Done */
+	STOPMACHINE_DONE,
 };
 static enum stopmachine_state state;
 
@@ -42,10 +44,9 @@ static DEFINE_MUTEX(lock);
 static DEFINE_MUTEX(setup_lock);
 /* Users of stop_machine. */
 static int refcount;
-static struct workqueue_struct *stop_machine_wq;
+static struct task_struct **stop_machine_threads;
 static struct stop_machine_data active, idle;
 static const struct cpumask *active_cpus;
-static void *stop_machine_work;
 
 static void set_state(enum stopmachine_state newstate)
 {
@@ -63,14 +64,31 @@ static void ack_state(void)
 }
 
 /* This is the actual function which stops the CPU. It runs
- * in the context of a dedicated stopmachine workqueue. */
-static void stop_cpu(struct work_struct *unused)
+ * on dedicated per-cpu kthreads. */
+static int stop_cpu(void *unused)
 {
 	enum stopmachine_state curstate = STOPMACHINE_NONE;
-	struct stop_machine_data *smdata = &idle;
+	struct stop_machine_data *smdata;
 	int cpu = smp_processor_id();
 	int err;
 
+repeat:
+	/* Wait for __stop_machine() to initiate */
+	while (true) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* <- kthread_stop() and __stop_machine()::smp_wmb() */
+		if (kthread_should_stop()) {
+			__set_current_state(TASK_RUNNING);
+			return 0;
+		}
+		if (state == STOPMACHINE_PREPARE)
+			break;
+		schedule();
+	}
+	smp_rmb();	/* <- __stop_machine()::set_state() */
+
+	/* Okay, let's go */
+	smdata = &idle;
 	if (!active_cpus) {
 		if (cpu == cpumask_first(cpu_online_mask))
 			smdata = &active;
@@ -104,6 +122,7 @@ static void stop_cpu(struct work_struct *unused)
 	} while (curstate != STOPMACHINE_EXIT);
 
 	local_irq_enable();
+	goto repeat;
 }
 
 /* Callback for CPUs which aren't supposed to do anything. */
@@ -112,46 +131,122 @@ static int chill(void *unused)
 	return 0;
 }
 
+static int create_stop_machine_thread(unsigned int cpu)
+{
+	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+	struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+	struct task_struct *p;
+
+	if (*pp)
+		return -EBUSY;
+
+	p = kthread_create(stop_cpu, NULL, "kstop/%u", cpu);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+	*pp = p;
+	return 0;
+}
+
+/* Should be called with cpu hotplug disabled and setup_lock held */
+static void kill_stop_machine_threads(void)
+{
+	unsigned int cpu;
+
+	if (!stop_machine_threads)
+		return;
+
+	for_each_online_cpu(cpu) {
+		struct task_struct *p = *per_cpu_ptr(stop_machine_threads, cpu);
+		if (p)
+			kthread_stop(p);
+	}
+	free_percpu(stop_machine_threads);
+	stop_machine_threads = NULL;
+}
+
 int stop_machine_create(void)
 {
+	unsigned int cpu;
+
+	get_online_cpus();
 	mutex_lock(&setup_lock);
 	if (refcount)
 		goto done;
-	stop_machine_wq = create_rt_workqueue("kstop");
-	if (!stop_machine_wq)
-		goto err_out;
-	stop_machine_work = alloc_percpu(struct work_struct);
-	if (!stop_machine_work)
+
+	stop_machine_threads = alloc_percpu(struct task_struct *);
+	if (!stop_machine_threads)
 		goto err_out;
+
+	/*
+	 * cpu hotplug is disabled, create only for online cpus,
+	 * cpu_callback() will handle cpu hot [un]plugs.
+	 */
+	for_each_online_cpu(cpu) {
+		if (create_stop_machine_thread(cpu))
+			goto err_out;
+		kthread_bind(*per_cpu_ptr(stop_machine_threads, cpu), cpu);
+	}
 done:
 	refcount++;
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 	return 0;
 
 err_out:
-	if (stop_machine_wq)
-		destroy_workqueue(stop_machine_wq);
+	kill_stop_machine_threads();
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 	return -ENOMEM;
 }
 EXPORT_SYMBOL_GPL(stop_machine_create);
 
 void stop_machine_destroy(void)
 {
+	get_online_cpus();
 	mutex_lock(&setup_lock);
-	refcount--;
-	if (refcount)
-		goto done;
-	destroy_workqueue(stop_machine_wq);
-	free_percpu(stop_machine_work);
-done:
+	if (!--refcount)
+		kill_stop_machine_threads();
 	mutex_unlock(&setup_lock);
+	put_online_cpus();
 }
 EXPORT_SYMBOL_GPL(stop_machine_destroy);
 
+static int __cpuinit stop_machine_cpu_callback(struct notifier_block *nfb,
+					       unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (unsigned long)hcpu;
+	struct task_struct **pp = per_cpu_ptr(stop_machine_threads, cpu);
+
+	/* Hotplug exclusion is enough, no need to worry about setup_lock */
+	if (!stop_machine_threads)
+		return NOTIFY_OK;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_UP_PREPARE:
+		if (create_stop_machine_thread(cpu)) {
+			printk(KERN_ERR "failed to create stop machine "
+			       "thread for %u\n", cpu);
+			return NOTIFY_BAD;
+		}
+		break;
+
+	case CPU_ONLINE:
+		kthread_bind(*pp, cpu);
+		break;
+
+	case CPU_UP_CANCELED:
+	case CPU_POST_DEAD:
+		kthread_stop(*pp);
+		*pp = NULL;
+		break;
+	}
+	return NOTIFY_OK;
+}
+
 int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 {
-	struct work_struct *sm_work;
 	int i, ret;
 
 	/* Set up initial state. */
@@ -164,19 +259,18 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	idle.fn = chill;
 	idle.data = NULL;
 
-	set_state(STOPMACHINE_PREPARE);
+	set_state(STOPMACHINE_PREPARE);	/* -> stop_cpu()::smp_rmb() */
+	smp_wmb();			/* -> stop_cpu()::set_current_state() */
 
 	/* Schedule the stop_cpu work on all cpus: hold this CPU so one
 	 * doesn't hit this CPU until we're ready. */
 	get_cpu();
-	for_each_online_cpu(i) {
-		sm_work = per_cpu_ptr(stop_machine_work, i);
-		INIT_WORK(sm_work, stop_cpu);
-		queue_work_on(i, stop_machine_wq, sm_work);
-	}
+	for_each_online_cpu(i)
+		wake_up_process(*per_cpu_ptr(stop_machine_threads, i));
 	/* This will release the thread on our CPU. */
 	put_cpu();
-	flush_workqueue(stop_machine_wq);
+	while (state < STOPMACHINE_DONE)
+		yield();
 	ret = active.fnret;
 	mutex_unlock(&lock);
 	return ret;
@@ -197,3 +291,8 @@ int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(stop_machine);
+
+void __init init_stop_machine(void)
+{
+	hotcpu_notifier(stop_machine_cpu_callback, 0);
+}
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b56737b..a8a35a9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -64,7 +64,6 @@ struct workqueue_struct {
 	const char *name;
 	int singlethread;
 	int freezeable;		/* Freeze threads during suspend */
-	int rt;
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
@@ -765,7 +764,6 @@ init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
 
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
-	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	struct workqueue_struct *wq = cwq->wq;
 	const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
 	struct task_struct *p;
@@ -781,8 +779,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 	 */
 	if (IS_ERR(p))
 		return PTR_ERR(p);
-	if (cwq->wq->rt)
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
 	cwq->thread = p;
 
 	trace_workqueue_creation(cwq->thread, cpu);
@@ -804,7 +800,6 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 struct workqueue_struct *__create_workqueue_key(const char *name,
 						int singlethread,
 						int freezeable,
-						int rt,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -826,7 +821,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	wq->singlethread = singlethread;
 	wq->freezeable = freezeable;
-	wq->rt = rt;
 	INIT_LIST_HEAD(&wq->list);
 
 	if (singlethread) {
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 08/19] workqueue: misc/cosmetic updates
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (6 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 07/19] stop_machine: reimplement without using workqueue Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 09/19] workqueue: merge feature parametesr into flags Tejun Heo
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Make the following updates in preparation of concurrency managed
workqueue.  None of these changes causes any visible behavior
difference.

* Add comments and adjust indentations to data structures and several
  functions.

* Rename wq_per_cpu() to get_cwq() and swap the position of two
  parameters for consistency.  Convert a direct per_cpu_ptr() access
  to wq->cpu_wq to get_cwq().

* Update set_wq_data() such that it sets the flags part to
  WORK_STRUCT_PENDING | @extra_flags.

* Move santiy check on work->entry emptiness from queue_work_on() to
  __queue_work() which all queueing paths share.

* Make __queue_work() take @cpu and @wq instead of @cwq.

* Restructure flush_work() and __create_workqueue_key() to make them
  easier to modify.

NOT_SIGNED_OFF_YET
---
 kernel/workqueue.c |  122 ++++++++++++++++++++++++++++++++++------------------
 1 files changed, 80 insertions(+), 42 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a8a35a9..10c0cc4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -39,6 +39,16 @@
 #include "sched_workqueue.h"
 
 /*
+ * Structure fields follow one of the following exclusion rules.
+ *
+ * I: Set during initialization and read-only afterwards.
+ *
+ * L: cwq->lock protected.  Access with cwq->lock held.
+ *
+ * W: workqueue_lock protected.
+ */
+
+/*
  * The per-CPU workqueue (if single thread, we always use the first
  * possible cpu).
  */
@@ -50,8 +60,8 @@ struct cpu_workqueue_struct {
 	wait_queue_head_t more_work;
 	struct work_struct *current_work;
 
-	struct workqueue_struct *wq;
-	struct task_struct *thread;
+	struct workqueue_struct *wq;		/* I: the owning workqueue */
+	struct task_struct	*thread;
 } ____cacheline_aligned;
 
 /*
@@ -59,13 +69,13 @@ struct cpu_workqueue_struct {
  * per-CPU workqueues:
  */
 struct workqueue_struct {
-	struct cpu_workqueue_struct *cpu_wq;
-	struct list_head list;
-	const char *name;
+	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
+	struct list_head	list;		/* W: list of all workqueues */
+	const char		*name;		/* I: workqueue name */
 	int singlethread;
 	int freezeable;		/* Freeze threads during suspend */
 #ifdef CONFIG_LOCKDEP
-	struct lockdep_map lockdep_map;
+	struct lockdep_map	lockdep_map;
 #endif
 };
 
@@ -96,8 +106,8 @@ static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
 		? cpu_singlethread_map : cpu_populated_map;
 }
 
-static
-struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
+static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
+					    struct workqueue_struct *wq)
 {
 	if (unlikely(is_wq_single_threaded(wq)))
 		cpu = singlethread_cpu;
@@ -109,15 +119,12 @@ struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
  * - Must *only* be called if the pending flag is set
  */
 static inline void set_wq_data(struct work_struct *work,
-				struct cpu_workqueue_struct *cwq)
+			       struct cpu_workqueue_struct *cwq,
+			       unsigned long extra_flags)
 {
-	unsigned long new;
-
 	BUG_ON(!work_pending(work));
-
-	new = (unsigned long) cwq | (1UL << WORK_STRUCT_PENDING);
-	new |= WORK_STRUCT_FLAG_MASK & *work_data_bits(work);
-	atomic_long_set(&work->data, new);
+	atomic_long_set(&work->data, (unsigned long)cwq |
+			(1UL << WORK_STRUCT_PENDING) | extra_flags);
 }
 
 static inline
@@ -142,28 +149,46 @@ void sched_workqueue_worker_sleep(struct task_struct *task)
 {
 }
 
+/**
+ * insert_work - insert a work into cwq
+ * @cwq: cwq @work belongs to
+ * @work: work to insert
+ * @head: insertion point
+ * @extra_flags: extra WORK_STRUCT_* flags to set
+ *
+ * Insert @work into @cwq after @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
 static void insert_work(struct cpu_workqueue_struct *cwq,
-			struct work_struct *work, struct list_head *head)
+			struct work_struct *work, struct list_head *head,
+			unsigned int extra_flags)
 {
 	trace_workqueue_insertion(cwq->thread, work);
 
-	set_wq_data(work, cwq);
+	/* we own @work, set data and link */
+	set_wq_data(work, cwq, extra_flags);
+
 	/*
 	 * Ensure that we get the right work->data if we see the
 	 * result of list_add() below, see try_to_grab_pending().
 	 */
 	smp_wmb();
+
 	list_add_tail(&work->entry, head);
 	wake_up(&cwq->more_work);
 }
 
-static void __queue_work(struct cpu_workqueue_struct *cwq,
+static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
+	struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 	unsigned long flags;
 
 	spin_lock_irqsave(&cwq->lock, flags);
-	insert_work(cwq, work, &cwq->worklist);
+	BUG_ON(!list_empty(&work->entry));
+	insert_work(cwq, work, &cwq->worklist, 0);
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -205,8 +230,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 	int ret = 0;
 
 	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
-		BUG_ON(!list_empty(&work->entry));
-		__queue_work(wq_per_cpu(wq, cpu), work);
+		__queue_work(cpu, wq, work);
 		ret = 1;
 	}
 	return ret;
@@ -217,9 +241,8 @@ static void delayed_work_timer_fn(unsigned long __data)
 {
 	struct delayed_work *dwork = (struct delayed_work *)__data;
 	struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
-	struct workqueue_struct *wq = cwq->wq;
 
-	__queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work);
+	__queue_work(smp_processor_id(), cwq->wq, &dwork->work);
 }
 
 /**
@@ -263,7 +286,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 		timer_stats_timer_set_start_info(&dwork->timer);
 
 		/* This stores cwq for the moment, for the timer_fn */
-		set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
+		set_wq_data(work, get_cwq(raw_smp_processor_id(), wq), 0);
 		timer->expires = jiffies + delay;
 		timer->data = (unsigned long)dwork;
 		timer->function = delayed_work_timer_fn;
@@ -326,6 +349,12 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
 	spin_unlock_irq(&cwq->lock);
 }
 
+/**
+ * worker_thread - the worker thread function
+ * @__cwq: cwq to serve
+ *
+ * The cwq worker thread function.
+ */
 static int worker_thread(void *__cwq)
 {
 	struct cpu_workqueue_struct *cwq = __cwq;
@@ -367,15 +396,25 @@ static void wq_barrier_func(struct work_struct *work)
 	complete(&barr->done);
 }
 
+/**
+ * insert_wq_barrier - insert a barrier work
+ * @cwq: cwq to insert barrier into
+ * @barr: wq_barrier to insert
+ * @head: insertion point
+ *
+ * Insert barrier @barr into @cwq before @head.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
 static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 			struct wq_barrier *barr, struct list_head *head)
 {
 	INIT_WORK(&barr->work, wq_barrier_func);
 	__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
-
 	init_completion(&barr->done);
 
-	insert_work(cwq, &barr->work, head);
+	insert_work(cwq, &barr->work, head, 0);
 }
 
 static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
@@ -407,9 +446,6 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  *
  * We sleep until all works which were queued on entry have been handled,
  * but we are not livelocked by new incoming ones.
- *
- * This function used to run the workqueues itself.  Now we just wait for the
- * helper threads to do it.
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
@@ -448,7 +484,6 @@ int flush_work(struct work_struct *work)
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_release(&cwq->wq->lockdep_map);
 
-	prev = NULL;
 	spin_lock_irq(&cwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
@@ -457,21 +492,20 @@ int flush_work(struct work_struct *work)
 		 */
 		smp_rmb();
 		if (unlikely(cwq != get_wq_data(work)))
-			goto out;
+			goto already_gone;
 		prev = &work->entry;
 	} else {
 		if (cwq->current_work != work)
-			goto out;
+			goto already_gone;
 		prev = &cwq->worklist;
 	}
 	insert_wq_barrier(cwq, &barr, prev->next);
-out:
 	spin_unlock_irq(&cwq->lock);
-	if (!prev)
-		return 0;
-
 	wait_for_completion(&barr.done);
 	return 1;
+already_gone:
+	spin_unlock_irq(&cwq->lock);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(flush_work);
 
@@ -551,7 +585,7 @@ static void wait_on_work(struct work_struct *work)
 	cpu_map = wq_cpu_map(wq);
 
 	for_each_cpu(cpu, cpu_map)
-		wait_on_cpu_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+		wait_on_cpu_work(get_cwq(cpu, wq), work);
 }
 
 static int __cancel_work_timer(struct work_struct *work,
@@ -809,13 +843,11 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
-		return NULL;
+		goto err;
 
 	wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
-	if (!wq->cpu_wq) {
-		kfree(wq);
-		return NULL;
-	}
+	if (!wq->cpu_wq)
+		goto err;
 
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
@@ -859,6 +891,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		wq = NULL;
 	}
 	return wq;
+err:
+	if (wq) {
+		free_percpu(wq->cpu_wq);
+		kfree(wq);
+	}
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(__create_workqueue_key);
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 09/19] workqueue: merge feature parametesr into flags
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (7 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 08/19] workqueue: misc/cosmetic updates Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available Tejun Heo
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Currently, __create_workqueue_key() takes @singlethread and
@freezeable paramters and store them separately in workqueue_struct.
Merge them into a single flags parameter and field and use
WQ_FREEZEABLE and WQ_SINGLE_THREAD.

NOT_SIGNED_OFF_YET
---
 include/linux/workqueue.h |   25 +++++++++++++++----------
 kernel/workqueue.c        |   17 +++++++----------
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 05f0998..d3334a0 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -163,13 +163,17 @@ struct execute_work {
 #define work_clear_pending(work) \
 	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
 
+enum {
+	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
+	WQ_SINGLE_THREAD	= 1 << 1, /* no per-cpu worker */
+};
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread, int freezeable,
+__create_workqueue_key(const char *name, unsigned int flags,
 		       struct lock_class_key *key, const char *lock_name);
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable)	\
+#define __create_workqueue(name, flags)				\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -179,19 +183,20 @@ __create_workqueue_key(const char *name, int singlethread, int freezeable,
 	else							\
 		__lock_name = #name;				\
 								\
-	__create_workqueue_key((name), (singlethread),		\
-			       (freezeable), &__key,		\
+	__create_workqueue_key((name), (flags), &__key,		\
 			       __lock_name);			\
 })
 #else
-#define __create_workqueue(name, singlethread, freezeable)	\
-	__create_workqueue_key((name), (singlethread), (freezeable), \
-			       NULL, NULL)
+#define __create_workqueue(name, flags)				\
+	__create_workqueue_key((name), (flags), NULL, NULL)
 #endif
 
-#define create_workqueue(name) __create_workqueue((name), 0, 0)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
+#define create_workqueue(name)					\
+	__create_workqueue((name), 0)
+#define create_freezeable_workqueue(name)			\
+	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+#define create_singlethread_workqueue(name)			\
+	__create_workqueue((name), WQ_SINGLE_THREAD)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 10c0cc4..befda6c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -69,11 +69,10 @@ struct cpu_workqueue_struct {
  * per-CPU workqueues:
  */
 struct workqueue_struct {
+	unsigned int		flags;		/* I: WQ_* flags */
 	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
 	struct list_head	list;		/* W: list of all workqueues */
 	const char		*name;		/* I: workqueue name */
-	int singlethread;
-	int freezeable;		/* Freeze threads during suspend */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
 #endif
@@ -95,9 +94,9 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
 static cpumask_var_t cpu_populated_map __read_mostly;
 
 /* If it's single threaded, it isn't in the list of workqueues. */
-static inline int is_wq_single_threaded(struct workqueue_struct *wq)
+static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
 {
-	return wq->singlethread;
+	return wq->flags & WQ_SINGLE_THREAD;
 }
 
 static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
@@ -363,7 +362,7 @@ static int worker_thread(void *__cwq)
 	/* set workqueue scheduler */
 	switch_sched_workqueue(current, true);
 
-	if (cwq->wq->freezeable)
+	if (cwq->wq->flags & WQ_FREEZEABLE)
 		set_freezable();
 
 	for (;;) {
@@ -832,8 +831,7 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 }
 
 struct workqueue_struct *__create_workqueue_key(const char *name,
-						int singlethread,
-						int freezeable,
+						unsigned int flags,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -849,13 +847,12 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	if (!wq->cpu_wq)
 		goto err;
 
+	wq->flags = flags;
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
-	wq->singlethread = singlethread;
-	wq->freezeable = freezeable;
 	INIT_LIST_HEAD(&wq->list);
 
-	if (singlethread) {
+	if (flags & WQ_SINGLE_THREAD) {
 		cwq = init_cpu_workqueue(wq, singlethread_cpu);
 		err = create_workqueue_thread(cwq, singlethread_cpu);
 		start_workqueue_thread(cwq, -1);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (8 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 09/19] workqueue: merge feature parametesr into flags Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 11/19] workqueue: define both bit position and mask for work flags Tejun Heo
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Currently cwqs are aligned to cacheline and lower two bits of
work_struct->data are considered to be available and used as flags.
Make the alignement requirement official by defining
WORK_STRUCT_FLAG_BITS and aligning cwqs to two's power of it.

This is in preparation of concurrency managed workqueue and cwqs being
aligned to cacheline wouldn't matter as much.  While at it, this patch
reserves one more bit for work flags and make sure the resulting
alignment is at least equal to or larger than that of long long.

NOT_SIGNED_OFF_YET
---
 include/linux/workqueue.h |   17 ++++++++++++++---
 kernel/workqueue.c        |   15 +++++++++++++--
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index d3334a0..3d11ce3 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -22,11 +22,22 @@ typedef void (*work_func_t)(struct work_struct *work);
  */
 #define work_data_bits(work) ((unsigned long *)(&(work)->data))
 
+enum {
+	WORK_STRUCT_PENDING	= 0,	/* work item is pending execution */
+
+	/*
+	 * Reserve 3bits off of cwq pointer.  This is enough and
+	 * provides acceptable alignment on both 32 and 64bit
+	 * machines.
+	 */
+	WORK_STRUCT_FLAG_BITS	= 3,
+
+	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
+	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
+};
+
 struct work_struct {
 	atomic_long_t data;
-#define WORK_STRUCT_PENDING 0		/* T if work item pending execution */
-#define WORK_STRUCT_FLAG_MASK (3UL)
-#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
 	struct list_head entry;
 	work_func_t func;
 #ifdef CONFIG_LOCKDEP
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index befda6c..28de966 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -50,7 +50,9 @@
 
 /*
  * The per-CPU workqueue (if single thread, we always use the first
- * possible cpu).
+ * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
+ * work_struct->data are used for flags and thus cwqs need to be
+ * aligned at two's power of the number of flag bits.
  */
 struct cpu_workqueue_struct {
 
@@ -62,7 +64,7 @@ struct cpu_workqueue_struct {
 
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	struct task_struct	*thread;
-} ____cacheline_aligned;
+} __attribute__((aligned(1 << WORK_STRUCT_FLAG_BITS)));
 
 /*
  * The externally visible workqueue abstraction is an array of
@@ -1049,6 +1051,15 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
 
 void __init init_workqueues(void)
 {
+	/*
+	 * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
+	 * Make sure that the alignment isn't lower than that of
+	 * unsigned long long in case this code survives for longer
+	 * than twenty years.  :-P
+	 */
+	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
+		     __alignof__(unsigned long long));
+
 	alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
 
 	cpumask_copy(cpu_populated_map, cpu_online_mask);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 11/19] workqueue: define both bit position and mask for work flags
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (9 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 12/19] workqueue: separate out process_one_work() Tejun Heo
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Work flags are about to see more traditional mask handling.  Define
WORK_STRUCT_*_BIT as the bit position constant and redefine
WORK_STRUCT_* as bit masks.

NOT_SIGNED_OFF_YET
---
 include/linux/workqueue.h |    8 +++++---
 kernel/workqueue.c        |   12 ++++++------
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 3d11ce3..541c5eb 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -23,7 +23,9 @@ typedef void (*work_func_t)(struct work_struct *work);
 #define work_data_bits(work) ((unsigned long *)(&(work)->data))
 
 enum {
-	WORK_STRUCT_PENDING	= 0,	/* work item is pending execution */
+	WORK_STRUCT_PENDING_BIT	= 0,	/* work item is pending execution */
+
+	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
 
 	/*
 	 * Reserve 3bits off of cwq pointer.  This is enough and
@@ -157,7 +159,7 @@ struct execute_work {
  * @work: The work item in question
  */
 #define work_pending(work) \
-	test_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+	test_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))
 
 /**
  * delayed_work_pending - Find out whether a delayable work item is currently
@@ -172,7 +174,7 @@ struct execute_work {
  * @work: The work item in question
  */
 #define work_clear_pending(work) \
-	clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+	clear_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))
 
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 28de966..1678dd1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -124,8 +124,8 @@ static inline void set_wq_data(struct work_struct *work,
 			       unsigned long extra_flags)
 {
 	BUG_ON(!work_pending(work));
-	atomic_long_set(&work->data, (unsigned long)cwq |
-			(1UL << WORK_STRUCT_PENDING) | extra_flags);
+	atomic_long_set(&work->data,
+			(unsigned long)cwq | WORK_STRUCT_PENDING | extra_flags);
 }
 
 static inline
@@ -230,7 +230,7 @@ queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 {
 	int ret = 0;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		__queue_work(cpu, wq, work);
 		ret = 1;
 	}
@@ -280,7 +280,7 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	struct timer_list *timer = &dwork->timer;
 	struct work_struct *work = &dwork->work;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		BUG_ON(timer_pending(timer));
 		BUG_ON(!list_empty(&work->entry));
 
@@ -412,7 +412,7 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 			struct wq_barrier *barr, struct list_head *head)
 {
 	INIT_WORK(&barr->work, wq_barrier_func);
-	__set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work));
+	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
 	init_completion(&barr->done);
 
 	insert_work(cwq, &barr->work, head, 0);
@@ -519,7 +519,7 @@ static int try_to_grab_pending(struct work_struct *work)
 	struct cpu_workqueue_struct *cwq;
 	int ret = -1;
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work)))
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)))
 		return 0;
 
 	/*
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 12/19] workqueue: separate out process_one_work()
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (10 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 11/19] workqueue: define both bit position and mask for work flags Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 13/19] workqueue: temporarily disable workqueue tracing Tejun Heo
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Separate out process_one_work() out of run_workqueue().  This patch
doesn't cause any behavior change.

NOT_SIGNED_OFF_YET
---
 kernel/workqueue.c |   98 ++++++++++++++++++++++++++++++++--------------------
 1 files changed, 60 insertions(+), 38 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1678dd1..0f94eb5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -302,50 +302,72 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+/**
+ * process_one_work - process single work
+ * @cwq: cwq to process work for
+ * @work: work to process
+ *
+ * Process @work.  This function contains all the logics necessary to
+ * process a single work including synchronization against and
+ * interaction with other workers on the same cpu, queueing and
+ * flushing.  As long as context requirement is met, any worker can
+ * call this function to process a work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ */
+static void process_one_work(struct cpu_workqueue_struct *cwq,
+			     struct work_struct *work)
+{
+	work_func_t f = work->func;
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * It is permissible to free the struct work_struct from
+	 * inside the function that is called from it, this we need to
+	 * take into account for lockdep too.  To avoid bogus "held
+	 * lock freed" warnings as well as problems when looking into
+	 * work->lockdep_map, make a copy and use that here.
+	 */
+	struct lockdep_map lockdep_map = work->lockdep_map;
+#endif
+	/* claim and process */
+	trace_workqueue_execution(cwq->thread, work);
+	cwq->current_work = work;
+	list_del_init(&work->entry);
+
+	spin_unlock_irq(&cwq->lock);
+
+	BUG_ON(get_wq_data(work) != cwq);
+	work_clear_pending(work);
+	lock_map_acquire(&cwq->wq->lockdep_map);
+	lock_map_acquire(&lockdep_map);
+	f(work);
+	lock_map_release(&lockdep_map);
+	lock_map_release(&cwq->wq->lockdep_map);
+
+	if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
+		printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
+		       "%s/0x%08x/%d\n",
+		       current->comm, preempt_count(), task_pid_nr(current));
+		printk(KERN_ERR "    last function: ");
+		print_symbol("%s\n", (unsigned long)f);
+		debug_show_held_locks(current);
+		dump_stack();
+	}
+
+	spin_lock_irq(&cwq->lock);
+
+	/* we're done with it, release */
+	cwq->current_work = NULL;
+}
+
 static void run_workqueue(struct cpu_workqueue_struct *cwq)
 {
 	spin_lock_irq(&cwq->lock);
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
-		work_func_t f = work->func;
-#ifdef CONFIG_LOCKDEP
-		/*
-		 * It is permissible to free the struct work_struct
-		 * from inside the function that is called from it,
-		 * this we need to take into account for lockdep too.
-		 * To avoid bogus "held lock freed" warnings as well
-		 * as problems when looking into work->lockdep_map,
-		 * make a copy and use that here.
-		 */
-		struct lockdep_map lockdep_map = work->lockdep_map;
-#endif
-		trace_workqueue_execution(cwq->thread, work);
-		cwq->current_work = work;
-		list_del_init(cwq->worklist.next);
-		spin_unlock_irq(&cwq->lock);
-
-		BUG_ON(get_wq_data(work) != cwq);
-		work_clear_pending(work);
-		lock_map_acquire(&cwq->wq->lockdep_map);
-		lock_map_acquire(&lockdep_map);
-		f(work);
-		lock_map_release(&lockdep_map);
-		lock_map_release(&cwq->wq->lockdep_map);
-
-		if (unlikely(in_atomic() || lockdep_depth(current) > 0)) {
-			printk(KERN_ERR "BUG: workqueue leaked lock or atomic: "
-					"%s/0x%08x/%d\n",
-					current->comm, preempt_count(),
-				       	task_pid_nr(current));
-			printk(KERN_ERR "    last function: ");
-			print_symbol("%s\n", (unsigned long)f);
-			debug_show_held_locks(current);
-			dump_stack();
-		}
-
-		spin_lock_irq(&cwq->lock);
-		cwq->current_work = NULL;
+		process_one_work(cwq, work);
 	}
 	spin_unlock_irq(&cwq->lock);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 13/19] workqueue: temporarily disable workqueue tracing
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (11 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 12/19] workqueue: separate out process_one_work() Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 14/19] workqueue: (TEMPORARY) kill singlethread variant Tejun Heo
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Strip tracing code from workqueue and disable workqueue tracing.  This
is temporary measure till concurrency managed workqueue is complete.

NOT_SIGNED_OFF_YET
---
 kernel/trace/Kconfig |    4 +++-
 kernel/workqueue.c   |   14 +++-----------
 2 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index b416512..0e14ecf 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -393,7 +393,9 @@ config KMEMTRACE
 	  If unsure, say N.
 
 config WORKQUEUE_TRACER
-	bool "Trace workqueues"
+# Temporarily disabled during workqueue reimplementation
+#	bool "Trace workqueues"
+	def_bool n
 	select GENERIC_TRACER
 	help
 	  The workqueue tracer provides some statistical informations
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0f94eb5..39a04ec 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -33,8 +33,6 @@
 #include <linux/kallsyms.h>
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
-#define CREATE_TRACE_POINTS
-#include <trace/events/workqueue.h>
 
 #include "sched_workqueue.h"
 
@@ -128,10 +126,10 @@ static inline void set_wq_data(struct work_struct *work,
 			(unsigned long)cwq | WORK_STRUCT_PENDING | extra_flags);
 }
 
-static inline
-struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
+static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 {
-	return (void *) (atomic_long_read(&work->data) & WORK_STRUCT_WQ_DATA_MASK);
+	return (void *)(atomic_long_read(&work->data) &
+			WORK_STRUCT_WQ_DATA_MASK);
 }
 
 /*
@@ -166,8 +164,6 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
 			unsigned int extra_flags)
 {
-	trace_workqueue_insertion(cwq->thread, work);
-
 	/* we own @work, set data and link */
 	set_wq_data(work, cwq, extra_flags);
 
@@ -331,7 +327,6 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	struct lockdep_map lockdep_map = work->lockdep_map;
 #endif
 	/* claim and process */
-	trace_workqueue_execution(cwq->thread, work);
 	cwq->current_work = work;
 	list_del_init(&work->entry);
 
@@ -838,8 +833,6 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 		return PTR_ERR(p);
 	cwq->thread = p;
 
-	trace_workqueue_creation(cwq->thread, cpu);
-
 	return 0;
 }
 
@@ -944,7 +937,6 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
 	 * checks list_empty(), and a "normal" queue_work() can't use
 	 * a dead CPU.
 	 */
-	trace_workqueue_destruction(cwq->thread);
 	kthread_stop(cwq->thread);
 	cwq->thread = NULL;
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 14/19] workqueue: (TEMPORARY) kill singlethread variant
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (12 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 13/19] workqueue: temporarily disable workqueue tracing Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

This is incorrect.  There are workqueue users which depend on single
thread for synchronization purpose.  Working on proper solution.

NOT_SIGNED_OFF
---
 include/linux/workqueue.h |    5 +-
 kernel/workqueue.c        |  128 ++++++++++++---------------------------------
 2 files changed, 36 insertions(+), 97 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 541c5eb..5aa0e15 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -178,7 +178,6 @@ struct execute_work {
 
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
-	WQ_SINGLE_THREAD	= 1 << 1, /* no per-cpu worker */
 };
 
 extern struct workqueue_struct *
@@ -207,9 +206,9 @@ __create_workqueue_key(const char *name, unsigned int flags,
 #define create_workqueue(name)					\
 	__create_workqueue((name), 0)
 #define create_freezeable_workqueue(name)			\
-	__create_workqueue((name), WQ_FREEZEABLE | WQ_SINGLE_THREAD)
+	__create_workqueue((name), WQ_FREEZEABLE)
 #define create_singlethread_workqueue(name)			\
-	__create_workqueue((name), WQ_SINGLE_THREAD)
+	__create_workqueue((name), 0)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 39a04ec..6370c9b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -47,8 +47,7 @@
  */
 
 /*
- * The per-CPU workqueue (if single thread, we always use the first
- * possible cpu).  The lower WORK_STRUCT_FLAG_BITS of
+ * The per-CPU workqueue.  The lower WORK_STRUCT_FLAG_BITS of
  * work_struct->data are used for flags and thus cwqs need to be
  * aligned at two's power of the number of flag bits.
  */
@@ -82,34 +81,9 @@ struct workqueue_struct {
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 
-static int singlethread_cpu __read_mostly;
-static const struct cpumask *cpu_singlethread_map __read_mostly;
-/*
- * _cpu_down() first removes CPU from cpu_online_map, then CPU_DEAD
- * flushes cwq->worklist. This means that flush_workqueue/wait_on_work
- * which comes in between can't use for_each_online_cpu(). We could
- * use cpu_possible_map, the cpumask below is more a documentation
- * than optimization.
- */
-static cpumask_var_t cpu_populated_map __read_mostly;
-
-/* If it's single threaded, it isn't in the list of workqueues. */
-static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
-{
-	return wq->flags & WQ_SINGLE_THREAD;
-}
-
-static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
-{
-	return is_wq_single_threaded(wq)
-		? cpu_singlethread_map : cpu_populated_map;
-}
-
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 					    struct workqueue_struct *wq)
 {
-	if (unlikely(is_wq_single_threaded(wq)))
-		cpu = singlethread_cpu;
 	return per_cpu_ptr(wq->cpu_wq, cpu);
 }
 
@@ -467,13 +441,12 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
-	const struct cpumask *cpu_map = wq_cpu_map(wq);
 	int cpu;
 
 	might_sleep();
 	lock_map_acquire(&wq->lockdep_map);
 	lock_map_release(&wq->lockdep_map);
-	for_each_cpu(cpu, cpu_map)
+	for_each_possible_cpu(cpu)
 		flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
 }
 EXPORT_SYMBOL_GPL(flush_workqueue);
@@ -587,7 +560,6 @@ static void wait_on_work(struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq;
 	struct workqueue_struct *wq;
-	const struct cpumask *cpu_map;
 	int cpu;
 
 	might_sleep();
@@ -600,9 +572,8 @@ static void wait_on_work(struct work_struct *work)
 		return;
 
 	wq = cwq->wq;
-	cpu_map = wq_cpu_map(wq);
 
-	for_each_cpu(cpu, cpu_map)
+	for_each_possible_cpu(cpu)
 		wait_on_cpu_work(get_cwq(cpu, wq), work);
 }
 
@@ -801,26 +772,12 @@ int current_is_keventd(void)
 	return is_sched_workqueue(current);
 }
 
-static struct cpu_workqueue_struct *
-init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
-{
-	struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
-
-	cwq->wq = wq;
-	spin_lock_init(&cwq->lock);
-	INIT_LIST_HEAD(&cwq->worklist);
-	init_waitqueue_head(&cwq->more_work);
-
-	return cwq;
-}
-
 static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 {
 	struct workqueue_struct *wq = cwq->wq;
-	const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
 	struct task_struct *p;
 
-	p = kthread_create(worker_thread, cwq, fmt, wq->name, cpu);
+	p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
 	/*
 	 * Nobody can add the work_struct to this cwq,
 	 *	if (caller is __create_workqueue)
@@ -853,7 +810,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 						const char *lock_name)
 {
 	struct workqueue_struct *wq;
-	struct cpu_workqueue_struct *cwq;
 	int err = 0, cpu;
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
@@ -869,36 +825,36 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
 
-	if (flags & WQ_SINGLE_THREAD) {
-		cwq = init_cpu_workqueue(wq, singlethread_cpu);
-		err = create_workqueue_thread(cwq, singlethread_cpu);
-		start_workqueue_thread(cwq, -1);
-	} else {
-		cpu_maps_update_begin();
-		/*
-		 * We must place this wq on list even if the code below fails.
-		 * cpu_down(cpu) can remove cpu from cpu_populated_map before
-		 * destroy_workqueue() takes the lock, in that case we leak
-		 * cwq[cpu]->thread.
-		 */
-		spin_lock(&workqueue_lock);
-		list_add(&wq->list, &workqueues);
-		spin_unlock(&workqueue_lock);
-		/*
-		 * We must initialize cwqs for each possible cpu even if we
-		 * are going to call destroy_workqueue() finally. Otherwise
-		 * cpu_up() can hit the uninitialized cwq once we drop the
-		 * lock.
-		 */
-		for_each_possible_cpu(cpu) {
-			cwq = init_cpu_workqueue(wq, cpu);
-			if (err || !cpu_online(cpu))
-				continue;
-			err = create_workqueue_thread(cwq, cpu);
-			start_workqueue_thread(cwq, cpu);
-		}
-		cpu_maps_update_done();
+	cpu_maps_update_begin();
+	/*
+	 * We must place this wq on list even if the code below fails.
+	 * cpu_down(cpu) can remove cpu from cpu_populated_map before
+	 * destroy_workqueue() takes the lock, in that case we leak
+	 * cwq[cpu]->thread.
+	 */
+	spin_lock(&workqueue_lock);
+	list_add(&wq->list, &workqueues);
+	spin_unlock(&workqueue_lock);
+	/*
+	 * We must initialize cwqs for each possible cpu even if we
+	 * are going to call destroy_workqueue() finally. Otherwise
+	 * cpu_up() can hit the uninitialized cwq once we drop the
+	 * lock.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+
+		cwq->wq = wq;
+		spin_lock_init(&cwq->lock);
+		INIT_LIST_HEAD(&cwq->worklist);
+		init_waitqueue_head(&cwq->more_work);
+
+		if (err || !cpu_online(cpu))
+			continue;
+		err = create_workqueue_thread(cwq, cpu);
+		start_workqueue_thread(cwq, cpu);
 	}
+	cpu_maps_update_done();
 
 	if (err) {
 		destroy_workqueue(wq);
@@ -949,7 +905,6 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
  */
 void destroy_workqueue(struct workqueue_struct *wq)
 {
-	const struct cpumask *cpu_map = wq_cpu_map(wq);
 	int cpu;
 
 	cpu_maps_update_begin();
@@ -957,7 +912,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	for_each_cpu(cpu, cpu_map)
+	for_each_possible_cpu(cpu)
 		cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu));
  	cpu_maps_update_done();
 
@@ -977,10 +932,6 @@ static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 
 	action &= ~CPU_TASKS_FROZEN;
 
-	switch (action) {
-	case CPU_UP_PREPARE:
-		cpumask_set_cpu(cpu, cpu_populated_map);
-	}
 undo:
 	list_for_each_entry(wq, &workqueues, list) {
 		cwq = per_cpu_ptr(wq->cpu_wq, cpu);
@@ -1007,12 +958,6 @@ undo:
 		}
 	}
 
-	switch (action) {
-	case CPU_UP_CANCELED:
-	case CPU_POST_DEAD:
-		cpumask_clear_cpu(cpu, cpu_populated_map);
-	}
-
 	return ret;
 }
 
@@ -1074,11 +1019,6 @@ void __init init_workqueues(void)
 	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
 		     __alignof__(unsigned long long));
 
-	alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
-
-	cpumask_copy(cpu_populated_map, cpu_online_mask);
-	singlethread_cpu = cpumask_first(cpu_possible_mask);
-	cpu_singlethread_map = cpumask_of(singlethread_cpu);
 	hotcpu_notifier(workqueue_cpu_callback, 0);
 	keventd_wq = create_workqueue("events");
 	BUG_ON(!keventd_wq);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (13 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 14/19] workqueue: (TEMPORARY) kill singlethread variant Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01 17:03   ` Linus Torvalds
  2009-10-01  8:09 ` [PATCH 16/19] workqueue: introduce worker Tejun Heo
                   ` (11 subsequent siblings)
  26 siblings, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Reimplement workqueue flushing using color coded works.  There are two
colors and each cwq has the current color which is painted on the
works being issued via the cwq.  Flushing a workqueue is achieved by
flipping the current colors of each cwq and wait for the works which
have the old color to drain.  This new implementation is to allow
having and sharing multiple workers per cpu.  One restriction this
implementation has is that there can only be single workqueue flushing
in progress at any given time.  If one is in progress, others should
wait for their turn.

This new flush implementation leaves only cleanup_workqueue_thread()
as the user of flush_cpu_workqueue().  Just make its users use
flush_workqueue() and kthread_stop() directly and kill
cleanup_workqueue_thread().  As workqueue flushing doesn't use barrier
request anymore, the comment describing the complex synchronization
around it in cleanup_workqueue_thread() is removed together with the
function.

NOT_SIGNED_OFF_YET
---
 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |  151 ++++++++++++++++++++++++++++-----------------
 2 files changed, 97 insertions(+), 56 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 5aa0e15..78fd6eb 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -24,8 +24,10 @@ typedef void (*work_func_t)(struct work_struct *work);
 
 enum {
 	WORK_STRUCT_PENDING_BIT	= 0,	/* work item is pending execution */
+	WORK_STRUCT_COLOR_BIT	= 1,	/* color for workqueue flushing */
 
 	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
+	WORK_STRUCT_COLOR	= 1 << WORK_STRUCT_COLOR_BIT,
 
 	/*
 	 * Reserve 3bits off of cwq pointer.  This is enough and
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6370c9b..269f6c5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -59,6 +59,9 @@ struct cpu_workqueue_struct {
 	wait_queue_head_t more_work;
 	struct work_struct *current_work;
 
+	int			nr_in_flight;	/* L: nr of in_flight works */
+	unsigned int		flush_color;	/* L: current flush color */
+	int			flush_cnt;	/* L: in-progress flush count */
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
 	struct task_struct	*thread;
 } __attribute__((aligned(1 << WORK_STRUCT_FLAG_BITS)));
@@ -71,6 +74,11 @@ struct workqueue_struct {
 	unsigned int		flags;		/* I: WQ_* flags */
 	struct cpu_workqueue_struct *cpu_wq;	/* I: cwq's */
 	struct list_head	list;		/* W: list of all workqueues */
+
+	struct mutex		flush_mutex;	/* single flush at a time */
+	atomic_t		nr_cwqs_to_flush; /* flush in progress */
+	struct completion	*flush_done;	/* flush done */
+
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
@@ -138,8 +146,10 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
 			unsigned int extra_flags)
 {
+	cwq->nr_in_flight++;
+
 	/* we own @work, set data and link */
-	set_wq_data(work, cwq, extra_flags);
+	set_wq_data(work, cwq, cwq->flush_color | extra_flags);
 
 	/*
 	 * Ensure that we get the right work->data if we see the
@@ -273,6 +283,28 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
 /**
+ * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
+ * @cwq: cwq of interest
+ * @work_color: color of work which left the queue
+ *
+ * A work either has completed or is removed from pending queue,
+ * decrement nr_in_flight of its cwq and handle workqueue flushing.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
+				 unsigned int work_color)
+{
+	cwq->nr_in_flight--;
+	if (unlikely(cwq->flush_cnt)) {
+		if (work_color ^ cwq->flush_color && !--cwq->flush_cnt &&
+		    atomic_dec_and_test(&cwq->wq->nr_cwqs_to_flush))
+			complete(cwq->wq->flush_done);
+	}
+}
+
+/**
  * process_one_work - process single work
  * @cwq: cwq to process work for
  * @work: work to process
@@ -290,6 +322,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 			     struct work_struct *work)
 {
 	work_func_t f = work->func;
+	unsigned int work_color;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -302,6 +335,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 #endif
 	/* claim and process */
 	cwq->current_work = work;
+	work_color = *work_data_bits(work) & WORK_STRUCT_COLOR;
 	list_del_init(&work->entry);
 
 	spin_unlock_irq(&cwq->lock);
@@ -328,6 +362,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 
 	/* we're done with it, release */
 	cwq->current_work = NULL;
+	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
 static void run_workqueue(struct cpu_workqueue_struct *cwq)
@@ -409,26 +444,6 @@ static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
 	insert_work(cwq, &barr->work, head, 0);
 }
 
-static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
-{
-	int active = 0;
-	struct wq_barrier barr;
-
-	WARN_ON(cwq->thread == current);
-
-	spin_lock_irq(&cwq->lock);
-	if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
-		insert_wq_barrier(cwq, &barr, &cwq->worklist);
-		active = 1;
-	}
-	spin_unlock_irq(&cwq->lock);
-
-	if (active)
-		wait_for_completion(&barr.done);
-
-	return active;
-}
-
 /**
  * flush_workqueue - ensure that any scheduled work has run to completion.
  * @wq: workqueue to flush
@@ -441,13 +456,44 @@ static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
  */
 void flush_workqueue(struct workqueue_struct *wq)
 {
-	int cpu;
+	DECLARE_COMPLETION_ONSTACK(flush_done);
+	bool wait = false;
+	unsigned int cpu;
 
-	might_sleep();
 	lock_map_acquire(&wq->lockdep_map);
 	lock_map_release(&wq->lockdep_map);
-	for_each_possible_cpu(cpu)
-		flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+
+	/* only single flush can be in progress at any given time */
+	mutex_lock(&wq->flush_mutex);
+
+	BUG_ON(atomic_read(&wq->nr_cwqs_to_flush) || wq->flush_done);
+
+	wq->flush_done = &flush_done;
+
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+		spin_lock_irq(&cwq->lock);
+
+		BUG_ON(cwq->flush_cnt);
+
+		cwq->flush_color ^= WORK_STRUCT_COLOR;
+		cwq->flush_cnt = cwq->nr_in_flight;
+
+		if (cwq->flush_cnt) {
+			atomic_inc(&wq->nr_cwqs_to_flush);
+			wait = true;
+		}
+
+		spin_unlock_irq(&cwq->lock);
+	}
+
+	if (wait)
+		wait_for_completion(&flush_done);
+
+	wq->flush_done = NULL;
+
+	mutex_unlock(&wq->flush_mutex);
 }
 EXPORT_SYMBOL_GPL(flush_workqueue);
 
@@ -531,6 +577,8 @@ static int try_to_grab_pending(struct work_struct *work)
 		smp_rmb();
 		if (cwq == get_wq_data(work)) {
 			list_del_init(&work->entry);
+			cwq_dec_nr_in_flight(cwq,
+				*work_data_bits(work) & WORK_STRUCT_COLOR);
 			ret = 1;
 		}
 	}
@@ -821,6 +869,8 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		goto err;
 
 	wq->flags = flags;
+	mutex_init(&wq->flush_mutex);
+	atomic_set(&wq->nr_cwqs_to_flush, 0);
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
@@ -842,7 +892,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	 * lock.
 	 */
 	for_each_possible_cpu(cpu) {
-		struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
 		cwq->wq = wq;
 		spin_lock_init(&cwq->lock);
@@ -870,33 +920,6 @@ err:
 }
 EXPORT_SYMBOL_GPL(__create_workqueue_key);
 
-static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
-{
-	/*
-	 * Our caller is either destroy_workqueue() or CPU_POST_DEAD,
-	 * cpu_add_remove_lock protects cwq->thread.
-	 */
-	if (cwq->thread == NULL)
-		return;
-
-	lock_map_acquire(&cwq->wq->lockdep_map);
-	lock_map_release(&cwq->wq->lockdep_map);
-
-	flush_cpu_workqueue(cwq);
-	/*
-	 * If the caller is CPU_POST_DEAD and cwq->worklist was not empty,
-	 * a concurrent flush_workqueue() can insert a barrier after us.
-	 * However, in that case run_workqueue() won't return and check
-	 * kthread_should_stop() until it flushes all work_struct's.
-	 * When ->worklist becomes empty it is safe to exit because no
-	 * more work_structs can be queued on this cwq: flush_workqueue
-	 * checks list_empty(), and a "normal" queue_work() can't use
-	 * a dead CPU.
-	 */
-	kthread_stop(cwq->thread);
-	cwq->thread = NULL;
-}
-
 /**
  * destroy_workqueue - safely terminate a workqueue
  * @wq: target workqueue
@@ -912,8 +935,19 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	for_each_possible_cpu(cpu)
-		cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu));
+	flush_workqueue(wq);
+
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+		/* cpu_add_remove_lock protects cwq->thread */
+		if (cwq->thread) {
+			kthread_stop(cwq->thread);
+			cwq->thread = NULL;
+		}
+		BUG_ON(cwq->nr_in_flight);
+	}
+
  	cpu_maps_update_done();
 
 	free_percpu(wq->cpu_wq);
@@ -953,7 +987,12 @@ undo:
 		case CPU_UP_CANCELED:
 			start_workqueue_thread(cwq, -1);
 		case CPU_POST_DEAD:
-			cleanup_workqueue_thread(cwq);
+			flush_workqueue(wq);
+			/* cpu_add_remove_lock protects cwq->thread */
+			if (cwq->thread) {
+				kthread_stop(cwq->thread);
+				cwq->thread = NULL;
+			}
 			break;
 		}
 	}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 16/19] workqueue: introduce worker
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (14 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 17/19] workqueue: reimplement work flushing using linked works Tejun Heo
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Separate out worker thread related information to struct worker from
struct cpu_workqueue_struct and implement helper functions to deal
with the new struct worker.  This patch only shifts things around
without any actual behavior change.  This is in preparation of
concurrency managed workqueue where shared multiple workers would be
available per cpu.

NOT_SIGNED_OFF_YET
---
 kernel/workqueue.c |  186 +++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 125 insertions(+), 61 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 269f6c5..f10fe4a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -46,6 +46,14 @@
  * W: workqueue_lock protected.
  */
 
+struct cpu_workqueue_struct;
+
+struct worker {
+	struct work_struct	*current_work;	/* L: work being processed */
+	struct task_struct	*task;		/* I: worker task */
+	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
+};
+
 /*
  * The per-CPU workqueue.  The lower WORK_STRUCT_FLAG_BITS of
  * work_struct->data are used for flags and thus cwqs need to be
@@ -57,13 +65,13 @@ struct cpu_workqueue_struct {
 
 	struct list_head worklist;
 	wait_queue_head_t more_work;
-	struct work_struct *current_work;
+	unsigned int		cpu;
+	struct worker		*worker;
 
 	int			nr_in_flight;	/* L: nr of in_flight works */
 	unsigned int		flush_color;	/* L: current flush color */
 	int			flush_cnt;	/* L: in-progress flush count */
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
-	struct task_struct	*thread;
 } __attribute__((aligned(1 << WORK_STRUCT_FLAG_BITS)));
 
 /*
@@ -89,6 +97,8 @@ struct workqueue_struct {
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 
+static int worker_thread(void *__worker);
+
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 					    struct workqueue_struct *wq)
 {
@@ -282,6 +292,82 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+static struct worker *alloc_worker(void)
+{
+	struct worker *worker;
+
+	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	return worker;
+}
+
+/**
+ * create_worker - create a new workqueue worker
+ * @cwq: cwq the new worker will belong to
+ * @bind: whether to set affinity to @cpu or not
+ *
+ * Create a new worker which is bound to @cwq.  The returned worker
+ * can be started by calling start_worker() or destroyed using
+ * destroy_worker().
+ *
+ * CONTEXT:
+ * Might sleep.  Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * Pointer to the newly created worker.
+ */
+static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+{
+	struct worker *worker;
+
+	worker = alloc_worker();
+	if (!worker)
+		goto fail;
+
+	worker->cwq = cwq;
+
+	worker->task = kthread_create(worker_thread, worker, "kworker/%u",
+				      cwq->cpu);
+	if (IS_ERR(worker->task))
+		goto fail;
+
+	if (bind)
+		kthread_bind(worker->task, cwq->cpu);
+
+	return worker;
+fail:
+	kfree(worker);
+	return NULL;
+}
+
+/**
+ * start_worker - start a newly created worker
+ * @worker: worker to start
+ *
+ * Start @worker.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void start_worker(struct worker *worker)
+{
+	wake_up_process(worker->task);
+}
+
+/**
+ * destroy_worker - destroy a workqueue worker
+ * @worker: worker to be destroyed
+ *
+ * Destroy @worker.
+ */
+static void destroy_worker(struct worker *worker)
+{
+	/* sanity check frenzy */
+	BUG_ON(worker->current_work);
+
+	kthread_stop(worker->task);
+	kfree(worker);
+}
+
 /**
  * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
  * @cwq: cwq of interest
@@ -306,7 +392,7 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
 
 /**
  * process_one_work - process single work
- * @cwq: cwq to process work for
+ * @worker: self
  * @work: work to process
  *
  * Process @work.  This function contains all the logics necessary to
@@ -318,9 +404,9 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
  * CONTEXT:
  * spin_lock_irq(cwq->lock) which is released and regrabbed.
  */
-static void process_one_work(struct cpu_workqueue_struct *cwq,
-			     struct work_struct *work)
+static void process_one_work(struct worker *worker, struct work_struct *work)
 {
+	struct cpu_workqueue_struct *cwq = worker->cwq;
 	work_func_t f = work->func;
 	unsigned int work_color;
 #ifdef CONFIG_LOCKDEP
@@ -334,7 +420,7 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	struct lockdep_map lockdep_map = work->lockdep_map;
 #endif
 	/* claim and process */
-	cwq->current_work = work;
+	worker->current_work = work;
 	work_color = *work_data_bits(work) & WORK_STRUCT_COLOR;
 	list_del_init(&work->entry);
 
@@ -361,30 +447,33 @@ static void process_one_work(struct cpu_workqueue_struct *cwq,
 	spin_lock_irq(&cwq->lock);
 
 	/* we're done with it, release */
-	cwq->current_work = NULL;
+	worker->current_work = NULL;
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
-static void run_workqueue(struct cpu_workqueue_struct *cwq)
+static void run_workqueue(struct worker *worker)
 {
+	struct cpu_workqueue_struct *cwq = worker->cwq;
+
 	spin_lock_irq(&cwq->lock);
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
-		process_one_work(cwq, work);
+		process_one_work(worker, work);
 	}
 	spin_unlock_irq(&cwq->lock);
 }
 
 /**
  * worker_thread - the worker thread function
- * @__cwq: cwq to serve
+ * @__worker: self
  *
  * The cwq worker thread function.
  */
-static int worker_thread(void *__cwq)
+static int worker_thread(void *__worker)
 {
-	struct cpu_workqueue_struct *cwq = __cwq;
+	struct worker *worker = __worker;
+	struct cpu_workqueue_struct *cwq = worker->cwq;
 	DEFINE_WAIT(wait);
 
 	/* set workqueue scheduler */
@@ -406,7 +495,7 @@ static int worker_thread(void *__cwq)
 		if (kthread_should_stop())
 			break;
 
-		run_workqueue(cwq);
+		run_workqueue(worker);
 	}
 
 	return 0;
@@ -532,7 +621,7 @@ int flush_work(struct work_struct *work)
 			goto already_gone;
 		prev = &work->entry;
 	} else {
-		if (cwq->current_work != work)
+		if (!cwq->worker || cwq->worker->current_work != work)
 			goto already_gone;
 		prev = &cwq->worklist;
 	}
@@ -594,7 +683,7 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 	int running = 0;
 
 	spin_lock_irq(&cwq->lock);
-	if (unlikely(cwq->current_work == work)) {
+	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
 		insert_wq_barrier(cwq, &barr, cwq->worklist.next);
 		running = 1;
 	}
@@ -820,45 +909,14 @@ int current_is_keventd(void)
 	return is_sched_workqueue(current);
 }
 
-static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
-	struct workqueue_struct *wq = cwq->wq;
-	struct task_struct *p;
-
-	p = kthread_create(worker_thread, cwq, "%s/%d", wq->name, cpu);
-	/*
-	 * Nobody can add the work_struct to this cwq,
-	 *	if (caller is __create_workqueue)
-	 *		nobody should see this wq
-	 *	else // caller is CPU_UP_PREPARE
-	 *		cpu is not on cpu_online_map
-	 * so we can abort safely.
-	 */
-	if (IS_ERR(p))
-		return PTR_ERR(p);
-	cwq->thread = p;
-
-	return 0;
-}
-
-static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
-{
-	struct task_struct *p = cwq->thread;
-
-	if (p != NULL) {
-		if (cpu >= 0)
-			kthread_bind(p, cpu);
-		wake_up_process(p);
-	}
-}
-
 struct workqueue_struct *__create_workqueue_key(const char *name,
 						unsigned int flags,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
 	struct workqueue_struct *wq;
-	int err = 0, cpu;
+	bool failed = false;
+	unsigned int cpu;
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
@@ -894,19 +952,23 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
+		cwq->cpu = cpu;
 		cwq->wq = wq;
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
 		init_waitqueue_head(&cwq->more_work);
 
-		if (err || !cpu_online(cpu))
+		if (failed || !cpu_online(cpu))
 			continue;
-		err = create_workqueue_thread(cwq, cpu);
-		start_workqueue_thread(cwq, cpu);
+		cwq->worker = create_worker(cwq, true);
+		if (cwq->worker)
+			start_worker(cwq->worker);
+		else
+			failed = true;
 	}
 	cpu_maps_update_done();
 
-	if (err) {
+	if (failed) {
 		destroy_workqueue(wq);
 		wq = NULL;
 	}
@@ -941,9 +1003,9 @@ void destroy_workqueue(struct workqueue_struct *wq)
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
 		/* cpu_add_remove_lock protects cwq->thread */
-		if (cwq->thread) {
-			kthread_stop(cwq->thread);
-			cwq->thread = NULL;
+		if (cwq->worker) {
+			destroy_worker(cwq->worker);
+			cwq->worker = NULL;
 		}
 		BUG_ON(cwq->nr_in_flight);
 	}
@@ -972,7 +1034,8 @@ undo:
 
 		switch (action) {
 		case CPU_UP_PREPARE:
-			if (!create_workqueue_thread(cwq, cpu))
+			cwq->worker = create_worker(cwq, false);
+			if (cwq->worker)
 				break;
 			printk(KERN_ERR "workqueue [%s] for %i failed\n",
 				wq->name, cpu);
@@ -981,17 +1044,18 @@ undo:
 			goto undo;
 
 		case CPU_ONLINE:
-			start_workqueue_thread(cwq, cpu);
+			kthread_bind(cwq->worker->task, cpu);
+			start_worker(cwq->worker);
 			break;
 
 		case CPU_UP_CANCELED:
-			start_workqueue_thread(cwq, -1);
+			start_worker(cwq->worker);
 		case CPU_POST_DEAD:
 			flush_workqueue(wq);
 			/* cpu_add_remove_lock protects cwq->thread */
-			if (cwq->thread) {
-				kthread_stop(cwq->thread);
-				cwq->thread = NULL;
+			if (cwq->worker) {
+				destroy_worker(cwq->worker);
+				cwq->worker = NULL;
 			}
 			break;
 		}
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 17/19] workqueue: reimplement work flushing using linked works
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (15 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 16/19] workqueue: introduce worker Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 18/19] workqueue: reimplement workqueue freeze using cwq->frozen_works queue Tejun Heo
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

A work is linked to the next one by having WORK_STRUCT_LINKED bit set
and these links can be chained.  When a linked work is dispatched to a
worker, all linked works are dispatched to the worker's newly added
->scheduled queue and processed back-to-back.

Currently, as there's only single worker per cwq, having linked works
doesn't make any visible behavior difference.  This change is to
prepare for multiple shared workers per cpu.

NOT_SIGNED_OFF_YET
---
 include/linux/workqueue.h |    2 +
 kernel/workqueue.c        |  152 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 131 insertions(+), 23 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 78fd6eb..a6136ca 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -25,9 +25,11 @@ typedef void (*work_func_t)(struct work_struct *work);
 enum {
 	WORK_STRUCT_PENDING_BIT	= 0,	/* work item is pending execution */
 	WORK_STRUCT_COLOR_BIT	= 1,	/* color for workqueue flushing */
+	WORK_STRUCT_LINKED_BIT	= 2,	/* next work is linked to this one */
 
 	WORK_STRUCT_PENDING	= 1 << WORK_STRUCT_PENDING_BIT,
 	WORK_STRUCT_COLOR	= 1 << WORK_STRUCT_COLOR_BIT,
+	WORK_STRUCT_LINKED	= 1 << WORK_STRUCT_LINKED_BIT,
 
 	/*
 	 * Reserve 3bits off of cwq pointer.  This is enough and
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f10fe4a..e234604 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -50,6 +50,7 @@ struct cpu_workqueue_struct;
 
 struct worker {
 	struct work_struct	*current_work;	/* L: work being processed */
+	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
 	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
 };
@@ -297,6 +298,8 @@ static struct worker *alloc_worker(void)
 	struct worker *worker;
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
+	if (worker)
+		INIT_LIST_HEAD(&worker->scheduled);
 	return worker;
 }
 
@@ -363,12 +366,56 @@ static void destroy_worker(struct worker *worker)
 {
 	/* sanity check frenzy */
 	BUG_ON(worker->current_work);
+	BUG_ON(!list_empty(&worker->scheduled));
 
 	kthread_stop(worker->task);
 	kfree(worker);
 }
 
 /**
+ * schedule_work_to_worker - schedule linked works to a worker
+ * @worker: target worker
+ * @work: start of series of works to be scheduled
+ * @nextp: out paramter for nested worklist walking
+ *
+ * Schedule linked works starting from @work to @worker.  Work series
+ * to be scheduled starts at @work and includes any consecutive work
+ * with WORK_STRUCT_LINKED set in its predecessor.
+ *
+ * If @nextp is not NULL, it's updated to point to the next work of
+ * the last scheduled work.  This allows schedule_work_to_worker() to
+ * be nested inside outer list_for_each_entry_safe().
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock).
+ */
+static void schedule_work_to_worker(struct worker *worker,
+				    struct work_struct *work,
+				    struct work_struct **nextp)
+{
+	struct work_struct *n;
+
+	/*
+	 * Linked worklist will always end before the end of the list,
+	 * use NULL for list head.
+	 */
+	work = list_entry(work->entry.prev, struct work_struct, entry);
+	list_for_each_entry_safe_continue(work, n, NULL, entry) {
+		list_move_tail(&work->entry, &worker->scheduled);
+		if (!(*work_data_bits(work) & WORK_STRUCT_LINKED))
+			break;
+	}
+
+	/*
+	 * If we're already inside safe list traversal and have moved
+	 * multiple works to the scheduled queue, the next position
+	 * needs to be updated.
+	 */
+	if (nextp)
+		*nextp = n;
+}
+
+/**
  * cwq_dec_nr_in_flight - decrement cwq's nr_in_flight
  * @cwq: cwq of interest
  * @work_color: color of work which left the queue
@@ -451,17 +498,25 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
 
-static void run_workqueue(struct worker *worker)
+/**
+ * process_scheduled_works - process scheduled works
+ * @worker: self
+ *
+ * Process all scheduled works.  Please note that the scheduled list
+ * may change while processing a work, so this function repeatedly
+ * fetches a work from the top and executes it.
+ *
+ * CONTEXT:
+ * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * multiple times.
+ */
+static void process_scheduled_works(struct worker *worker)
 {
-	struct cpu_workqueue_struct *cwq = worker->cwq;
-
-	spin_lock_irq(&cwq->lock);
-	while (!list_empty(&cwq->worklist)) {
-		struct work_struct *work = list_entry(cwq->worklist.next,
+	while (!list_empty(&worker->scheduled)) {
+		struct work_struct *work = list_first_entry(&worker->scheduled,
 						struct work_struct, entry);
 		process_one_work(worker, work);
 	}
-	spin_unlock_irq(&cwq->lock);
 }
 
 /**
@@ -495,7 +550,26 @@ static int worker_thread(void *__worker)
 		if (kthread_should_stop())
 			break;
 
-		run_workqueue(worker);
+		spin_lock_irq(&cwq->lock);
+
+		while (!list_empty(&cwq->worklist)) {
+			struct work_struct *work =
+				list_first_entry(&cwq->worklist,
+						 struct work_struct, entry);
+
+			if (likely(!(*work_data_bits(work) &
+				     WORK_STRUCT_LINKED))) {
+				/* optimization path, not strictly necessary */
+				process_one_work(worker, work);
+				if (unlikely(!list_empty(&worker->scheduled)))
+					process_scheduled_works(worker);
+			} else {
+				schedule_work_to_worker(worker, work, NULL);
+				process_scheduled_works(worker);
+			}
+		}
+
+		spin_unlock_irq(&cwq->lock);
 	}
 
 	return 0;
@@ -514,23 +588,51 @@ static void wq_barrier_func(struct work_struct *work)
 
 /**
  * insert_wq_barrier - insert a barrier work
- * @cwq: cwq to insert barrier into
  * @barr: wq_barrier to insert
- * @head: insertion point
+ * @target: target work to attach @barr to
+ * @worker: worker currently executing @target, NULL if @target is not executing
+ *
+ * @barr is linked to @target such that @barr is completed only after
+ * @target finishes execution.  Please note that the ordering
+ * guarantee is observed only with respect to @target and on the local
+ * cpu.
  *
- * Insert barrier @barr into @cwq before @head.
+ * Currently, a queued barrier can't be canceled.  This is because
+ * try_to_grab_pending() can't determine whether the work to be
+ * grabbed is at the head of the queue and thus can't clear LINKED
+ * flag of the previous work while there must be a valid next work
+ * after a work with LINKED flag set.
  *
  * CONTEXT:
  * spin_lock_irq(cwq->lock).
  */
-static void insert_wq_barrier(struct cpu_workqueue_struct *cwq,
-			struct wq_barrier *barr, struct list_head *head)
+static void insert_wq_barrier(struct wq_barrier *barr,
+			      struct work_struct *target, struct worker *worker)
 {
+	struct cpu_workqueue_struct *cwq = get_wq_data(target);
+	struct list_head *head;
+	unsigned int linked = 0;
+
 	INIT_WORK(&barr->work, wq_barrier_func);
 	__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&barr->work));
 	init_completion(&barr->done);
 
-	insert_work(cwq, &barr->work, head, 0);
+	/*
+	 * If @target is currently being executed, schedule the
+	 * barrier to the worker; otherwise, put it after @target.
+	 */
+	if (worker)
+		head = worker->scheduled.next;
+	else {
+		unsigned long *bits = work_data_bits(target);
+
+		head = target->entry.next;
+		/* there can already be other linked works, inherit and set */
+		linked = *bits & WORK_STRUCT_LINKED;
+		*bits |= WORK_STRUCT_LINKED;
+	}
+
+	insert_work(cwq, &barr->work, head, linked);
 }
 
 /**
@@ -598,8 +700,8 @@ EXPORT_SYMBOL_GPL(flush_workqueue);
  */
 int flush_work(struct work_struct *work)
 {
+	struct worker *worker = NULL;
 	struct cpu_workqueue_struct *cwq;
-	struct list_head *prev;
 	struct wq_barrier barr;
 
 	might_sleep();
@@ -619,13 +721,14 @@ int flush_work(struct work_struct *work)
 		smp_rmb();
 		if (unlikely(cwq != get_wq_data(work)))
 			goto already_gone;
-		prev = &work->entry;
 	} else {
-		if (!cwq->worker || cwq->worker->current_work != work)
+		if (cwq->worker && cwq->worker->current_work == work)
+			worker = cwq->worker;
+		if (!worker)
 			goto already_gone;
-		prev = &cwq->worklist;
 	}
-	insert_wq_barrier(cwq, &barr, prev->next);
+
+	insert_wq_barrier(&barr, work, worker);
 	spin_unlock_irq(&cwq->lock);
 	wait_for_completion(&barr.done);
 	return 1;
@@ -680,16 +783,19 @@ static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 				struct work_struct *work)
 {
 	struct wq_barrier barr;
-	int running = 0;
+	struct worker *worker;
 
 	spin_lock_irq(&cwq->lock);
+
+	worker = NULL;
 	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
-		insert_wq_barrier(cwq, &barr, cwq->worklist.next);
-		running = 1;
+		worker = cwq->worker;
+		insert_wq_barrier(&barr, work, worker);
 	}
+
 	spin_unlock_irq(&cwq->lock);
 
-	if (unlikely(running))
+	if (unlikely(worker))
 		wait_for_completion(&barr.done);
 }
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 18/19] workqueue: reimplement workqueue freeze using cwq->frozen_works queue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (16 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 17/19] workqueue: reimplement work flushing using linked works Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01  8:09 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Currently, workqueue freezing is implemented by marking the worker
freezeable and calling try_to_freeze() from dispatch loop.
Reimplement it so that the workqueue is frozen instead of the worker.

* cwq->cur_worklist and cwq->frozen_works are added.  During normal
  operation cwq->cur_worklist points to cwq->worklist.

* When freezing starts, cwq->cur_worklist is switched to
  cwq->frozen_works so that new works are stored in cwq->frozen_works
  instead of being processed.

* Freezing is complete when cwq->nr_in_flight equals the number of
  works on cwq->frozen_works for all cwqs of all freezeable
  workqueues.

* Thawing is done by restoring cwq->cur_worklist to cwq->worklist and
  splicing cwq->frozen_works to cwq->worklist.

This new implementation allows having multiple shared workers per cpu.

NOT_SIGNED_OFF_YET
---
 include/linux/workqueue.h |    7 ++
 kernel/power/process.c    |   21 +++++-
 kernel/workqueue.c        |  173 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 194 insertions(+), 7 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a6136ca..351466d 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -299,4 +299,11 @@ static inline long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
 #else
 long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg);
 #endif /* CONFIG_SMP */
+
+#ifdef CONFIG_FREEZER
+extern void freeze_workqueues_begin(void);
+extern bool freeze_workqueues_busy(void);
+extern void thaw_workqueues(void);
+#endif /* CONFIG_FREEZER */
+
 #endif
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 9d26a0a..18d4835 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -14,6 +14,7 @@
 #include <linux/module.h>
 #include <linux/syscalls.h>
 #include <linux/freezer.h>
+#include <linux/workqueue.h>
 
 /* 
  * Timeout for stopping processes
@@ -34,6 +35,7 @@ static int try_to_freeze_tasks(bool sig_only)
 	struct task_struct *g, *p;
 	unsigned long end_time;
 	unsigned int todo;
+	bool wq_busy = false;
 	struct timeval start, end;
 	u64 elapsed_csecs64;
 	unsigned int elapsed_csecs;
@@ -41,6 +43,10 @@ static int try_to_freeze_tasks(bool sig_only)
 	do_gettimeofday(&start);
 
 	end_time = jiffies + TIMEOUT;
+
+	if (!sig_only)
+		freeze_workqueues_begin();
+
 	while (true) {
 		todo = 0;
 		read_lock(&tasklist_lock);
@@ -62,6 +68,12 @@ static int try_to_freeze_tasks(bool sig_only)
 				todo++;
 		} while_each_thread(g, p);
 		read_unlock(&tasklist_lock);
+
+		if (!sig_only) {
+			wq_busy = freeze_workqueues_busy();
+			todo += wq_busy;
+		}
+
 		if (!todo || time_after(jiffies, end_time))
 			break;
 
@@ -85,9 +97,13 @@ static int try_to_freeze_tasks(bool sig_only)
 		 */
 		printk("\n");
 		printk(KERN_ERR "Freezing of tasks failed after %d.%02d seconds "
-				"(%d tasks refusing to freeze):\n",
-				elapsed_csecs / 100, elapsed_csecs % 100, todo);
+		       "(%d tasks refusing to freeze, wq_busy=%d):\n",
+		       elapsed_csecs / 100, elapsed_csecs % 100,
+		       todo - wq_busy, wq_busy);
 		show_state();
+
+		thaw_workqueues();
+
 		read_lock(&tasklist_lock);
 		do_each_thread(g, p) {
 			task_lock(p);
@@ -157,6 +173,7 @@ void thaw_processes(void)
 	oom_killer_enable();
 
 	printk("Restarting tasks ... ");
+	thaw_workqueues();
 	thaw_tasks(true);
 	thaw_tasks(false);
 	schedule();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e234604..097da97 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -44,6 +44,10 @@
  * L: cwq->lock protected.  Access with cwq->lock held.
  *
  * W: workqueue_lock protected.
+ *
+ * V: Similar to L except that operation is limited to only one
+ *    direction if workqueues are frozen (ie. can be added but can't
+ *    be removed).
  */
 
 struct cpu_workqueue_struct;
@@ -69,10 +73,12 @@ struct cpu_workqueue_struct {
 	unsigned int		cpu;
 	struct worker		*worker;
 
+	struct list_head	*cur_worklist;	/* L: current worklist */
 	int			nr_in_flight;	/* L: nr of in_flight works */
 	unsigned int		flush_color;	/* L: current flush color */
 	int			flush_cnt;	/* L: in-progress flush count */
 	struct workqueue_struct *wq;		/* I: the owning workqueue */
+	struct list_head	frozen_works;	/* V: used while frozen */
 } __attribute__((aligned(1 << WORK_STRUCT_FLAG_BITS)));
 
 /*
@@ -97,6 +103,7 @@ struct workqueue_struct {
 /* Serializes the accesses to the list of workqueues. */
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
+static bool workqueue_frozen;
 
 static int worker_thread(void *__worker);
 
@@ -180,7 +187,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 
 	spin_lock_irqsave(&cwq->lock, flags);
 	BUG_ON(!list_empty(&work->entry));
-	insert_work(cwq, work, &cwq->worklist, 0);
+	insert_work(cwq, work, cwq->cur_worklist, 0);
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -545,8 +552,6 @@ static int worker_thread(void *__worker)
 			schedule();
 		finish_wait(&cwq->more_work, &wait);
 
-		try_to_freeze();
-
 		if (kthread_should_stop())
 			break;
 
@@ -1048,6 +1053,14 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	 */
 	spin_lock(&workqueue_lock);
 	list_add(&wq->list, &workqueues);
+	for_each_possible_cpu(cpu) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+		if (workqueue_frozen && wq->flags & WQ_FREEZEABLE)
+			cwq->cur_worklist = &cwq->frozen_works;
+		else
+			cwq->cur_worklist = &cwq->worklist;
+	}
 	spin_unlock(&workqueue_lock);
 	/*
 	 * We must initialize cwqs for each possible cpu even if we
@@ -1063,6 +1076,7 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		spin_lock_init(&cwq->lock);
 		INIT_LIST_HEAD(&cwq->worklist);
 		init_waitqueue_head(&cwq->more_work);
+		INIT_LIST_HEAD(&cwq->frozen_works);
 
 		if (failed || !cpu_online(cpu))
 			continue;
@@ -1099,12 +1113,17 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	int cpu;
 
 	cpu_maps_update_begin();
+
+	flush_workqueue(wq);
+
+	/*
+	 * wq list is used to freeze wq, remove from list after
+	 * flushing is complete in case freeze races us.
+	 */
 	spin_lock(&workqueue_lock);
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
-	flush_workqueue(wq);
-
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
@@ -1114,6 +1133,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
 			cwq->worker = NULL;
 		}
 		BUG_ON(cwq->nr_in_flight);
+		BUG_ON(!list_empty(&cwq->frozen_works));
 	}
 
  	cpu_maps_update_done();
@@ -1217,6 +1237,149 @@ long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
 EXPORT_SYMBOL_GPL(work_on_cpu);
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_FREEZER
+/**
+ * freeze_workqueues_begin - begin freezing workqueues
+ *
+ * Start freezing workqueues.  After this function returns, all
+ * freezeable workqueues will queue new works to their frozen_works
+ * list instead of the cwq ones.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void freeze_workqueues_begin(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+
+	spin_lock(&workqueue_lock);
+
+	BUG_ON(workqueue_frozen);
+	workqueue_frozen = true;
+
+	for_each_possible_cpu(cpu) {
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+			if (!(wq->flags & WQ_FREEZEABLE))
+				continue;
+
+			spin_lock_irq(&cwq->lock);
+
+			BUG_ON(cwq->cur_worklist != &cwq->worklist);
+			BUG_ON(!list_empty(&cwq->frozen_works));
+
+			cwq->cur_worklist = &cwq->frozen_works;
+
+			spin_unlock_irq(&cwq->lock);
+		}
+	}
+	spin_unlock(&workqueue_lock);
+}
+
+/**
+ * freeze_workqueues_busy - are freezeable workqueues still busy?
+ *
+ * Check whether freezing is complete.  This function must be called
+ * between freeeze_workqueues_begin() and thaw_workqueues().
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock.
+ *
+ * RETURNS:
+ * %true if some freezeable workqueues are still busy.  %false if
+ * freezing is complete.
+ */
+bool freeze_workqueues_busy(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+	bool busy = false;
+
+	spin_lock(&workqueue_lock);
+
+	BUG_ON(!workqueue_frozen);
+
+	for_each_possible_cpu(cpu) {
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+			struct work_struct *work;
+			int nr_in_flight;
+
+			if (!(wq->flags & WQ_FREEZEABLE))
+				continue;
+
+			spin_lock_irq(&cwq->lock);
+
+			BUG_ON(cwq->cur_worklist != &cwq->frozen_works);
+
+			nr_in_flight = cwq->nr_in_flight;
+			list_for_each_entry(work, &cwq->frozen_works, entry)
+				nr_in_flight--;
+
+			spin_unlock_irq(&cwq->lock);
+
+			BUG_ON(nr_in_flight < 0);
+			if (nr_in_flight) {
+				busy = true;
+				break;
+			}
+		}
+		if (busy)
+			break;
+	}
+	spin_unlock(&workqueue_lock);
+	return busy;
+}
+
+/**
+ * thaw_workqueues - thaw workqueues
+ *
+ * Thaw workqueues.  Normal queueing is restored and all collected
+ * frozen works are transferred to their respective cwq worklists.
+ *
+ * CONTEXT:
+ * Grabs and releases workqueue_lock and cwq->lock's.
+ */
+void thaw_workqueues(void)
+{
+	struct workqueue_struct *wq;
+	unsigned int cpu;
+
+	spin_lock(&workqueue_lock);
+
+	if (!workqueue_frozen)
+		goto out_unlock;
+
+	workqueue_frozen = false;
+
+	for_each_possible_cpu(cpu) {
+		list_for_each_entry(wq, &workqueues, list) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+
+			if (!(wq->flags & WQ_FREEZEABLE))
+				continue;
+
+			spin_lock_irq(&cwq->lock);
+
+			/* switch to normal work queueing */
+			BUG_ON(cwq->cur_worklist != &cwq->frozen_works);
+			cwq->cur_worklist = &cwq->worklist;
+
+			/* transfer frozen tasks to cwq worklist */
+			list_splice_tail(&cwq->frozen_works, &cwq->worklist);
+			INIT_LIST_HEAD(&cwq->frozen_works);
+			wake_up(&cwq->more_work);
+
+			spin_unlock_irq(&cwq->lock);
+		}
+	}
+out_unlock:
+	spin_unlock(&workqueue_lock);
+}
+#endif /* CONFIG_FREEZER */
+
 void __init init_workqueues(void)
 {
 	/*
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (17 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 18/19] workqueue: reimplement workqueue freeze using cwq->frozen_works queue Tejun Heo
@ 2009-10-01  8:09 ` Tejun Heo
  2009-10-01 14:49   ` Andrew Morton
                     ` (3 more replies)
  2009-10-01  8:24 ` [RFC PATCHSET] " Jens Axboe
                   ` (7 subsequent siblings)
  26 siblings, 4 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01  8:09 UTC (permalink / raw)
  To: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan
  Cc: Tejun Heo

Currently each workqueue has its own dedicated worker pool.  This
causes the following problems.

* Works which are dependent on each other can cause a deadlock by
  depending on the same execution resource.  This is bad because this
  type of dependency is quite difficult to find.

* Works which may sleep and take long time to finish need to have
  separate workqueues so that it doesn't block other works.  Similarly
  works which want to be executed in timely manner often need to
  create it custom workqueue too to avoid being blocked by long
  running ones.  This leads to large number of workqueues and thus
  many workers.

* The static one-per-cpu worker isn't good enough for jobs which
  require higher level of concurrency necessiating other worker pool
  mechanism.  slow-work and async are good examples and there are also
  some custom implementations buried in subsystems.

* Combined, the above factors lead to many workqueues with large
  number of dedicated and mostly unused workers.  This also makes work
  processing less optimal as the dedicated workers end up switching
  among themselves costing scheduleing overhead and wasting cache
  footprint for their stacks and as the system gets busy, these
  workers end up competing with each other.

To solve the above issues, this patch implements concurrency-managed
workqueue.

There is single global cpu workqueue (gcwq) for each cpu which serves
all the workqueues.  gcwq maintains single pool of workers which is
shared by all cwqs on the cpu.

gcwq keeps the number of concurrent active workers to minimum but no
less.  As long as there's one or more running workers on the cpu, no
new worker is scheduled so that works can be processed in batch as
much as possible but when the last running worker blocks, gcwq
immediately schedules new worker so that the cpu doesn't sit idle
while there are works to be processed.

gcwq always keeps at least single idle worker around.  When a new
worker is necessary and the worker is the last idle one, the worker
assumes the role of "manager" and manages the worker pool -
ie. creates another worker.  Forward-progress is guaranteed by having
dedicated rescue workers for workqueues which may be necessary while
creating a new worker.  When the manager is having problem creating a
new worker, mayday timer activates and rescue workers are summoned to
the cpu and execute works which may be necessary to create new
workers.

To keep track of which worker is executing which work, gcwq uses a
hash table.  This is necessary as works may be destroyed once it
starts executing and flushing should be implemented by tracking
whether any worker is executing the work.

cpu hotplug implementation is more complex than before because there
are multiple workers and now workqueue is capable of hosting long
erunning works.  cpu offlining is implemented by creating a "trustee"
kthread which runs the gcwq as if the cpu is still online until all
works are drained.  As soon as the trustee takes over the gcwq, cpu
hotunplug operation can proceed without waiting for workqueues to be
drained.  Onlining is the reverse.  If trustee is still trying to
drain the gcwq from the previous offlining, it puts all workers back
to the cpu and let the gcwq run as if cpu has been online the whole
time.

The new implementation has the following benefits.

* Workqueue users no longer have to worry about managing concurrency
  and, in most cases, deadlocks.  The workqueue will manage it
  automatically and unless the deadlock chain involves more than 127
  works, it won't happen.

* There's one single shared pool of workers per cpu and one rescuer
  for each workqueue which requires it, so there are far fewer number
  of kthreads.

* More efficient.  Although it adds considerable amount of code, the
  code added to hot path isn't big and works will be executed on the
  local cpu and in batch as much as possible using minimal number of
  kthreads leading to fewer task switches and lower cache
  footprint. <NEED SOME BACKING NUMBERS>

* As concurrency is no longer a problem, most types of asynchronous
  jobs can be done using generic workqueue and other async mechanisms
  can be removed.

NOT_SIGNED_OFF_YET
---
 include/linux/workqueue.h |    7 +-
 kernel/workqueue.c        | 1566 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 1378 insertions(+), 195 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 351466d..9dbdbc2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -182,6 +182,7 @@ struct execute_work {
 
 enum {
 	WQ_FREEZEABLE		= 1 << 0, /* freeze during suspend */
+	WQ_RESCUER		= 1 << 1, /* has an rescue worker */
 };
 
 extern struct workqueue_struct *
@@ -208,11 +209,11 @@ __create_workqueue_key(const char *name, unsigned int flags,
 #endif
 
 #define create_workqueue(name)					\
-	__create_workqueue((name), 0)
+	__create_workqueue((name), WQ_RESCUER)
 #define create_freezeable_workqueue(name)			\
-	__create_workqueue((name), WQ_FREEZEABLE)
+	__create_workqueue((name), WQ_FREEZEABLE | WQ_RESCUER)
 #define create_singlethread_workqueue(name)			\
-	__create_workqueue((name), 0)
+	__create_workqueue((name), WQ_RESCUER)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 097da97..67cb3a1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -29,19 +29,72 @@
 #include <linux/kthread.h>
 #include <linux/hardirq.h>
 #include <linux/mempolicy.h>
-#include <linux/freezer.h>
 #include <linux/kallsyms.h>
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
+#include <linux/wait.h>
 
 #include "sched_workqueue.h"
 
+enum {
+	/* worker flags */
+	WORKER_STARTED		= 1 << 0,	/* started */
+	WORKER_DIE		= 1 << 1,	/* die die die */
+	WORKER_IDLE		= 1 << 2,	/* is idle */
+	WORKER_MANAGER		= 1 << 3,	/* I'm the manager */
+	WORKER_RESCUER		= 1 << 4,	/* I'm a rescuer */
+	WORKER_ROGUE		= 1 << 5,	/* not bound to any cpu */
+
+	WORKER_IGN_RUNNING	= WORKER_IDLE | WORKER_MANAGER | WORKER_ROGUE,
+
+	/* global_cwq flags */
+	GCWQ_MANAGE_WORKERS	= 1 << 0,	/* need to manage workers */
+	GCWQ_MANAGING_WORKERS	= 1 << 1,	/* managing workers */
+	GCWQ_DISASSOCIATED	= 1 << 2,	/* cpu can't serve workers */
+
+	/* gcwq->trustee_state */
+	TRUSTEE_START		= 0,		/* start */
+	TRUSTEE_IN_CHARGE	= 1,		/* trustee in charge of gcwq */
+	TRUSTEE_BUTCHER		= 2,		/* butcher workers */
+	TRUSTEE_RELEASE		= 3,		/* release workers */
+	TRUSTEE_DONE		= 4,		/* trustee is done */
+
+	MAX_CPU_WORKERS_ORDER	= 7,		/* 128 */
+	MAX_WORKERS_PER_CPU	= 1 << MAX_CPU_WORKERS_ORDER,
+
+	BUSY_WORKER_HASH_ORDER	= 4,		/* 16 pointers */
+	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
+	BUSY_WORKER_HASH_MASK	= BUSY_WORKER_HASH_SIZE - 1,
+
+	MAX_IDLE_WORKERS_RATIO	= 4,		/* 1/4 of busy can be idle */
+	IDLE_WORKER_TIMEOUT	= 300 * HZ,	/* keep idle ones for 5 mins */
+
+	MAYDAY_INITIAL_TIMEOUT	= HZ / 100,	/* call for help after 10ms */
+	MAYDAY_INTERVAL		= HZ / 10,	/* and then every 100ms */
+	CREATE_COOLDOWN		= HZ,		/* time to breath after fail */
+	TRUSTEE_COOLDOWN	= HZ / 10,	/* for trustee draining */
+
+	/*
+	 * Rescue workers are used only on emergencies and shared by
+	 * all cpus.  Give -20.
+	 */
+	RESCUER_NICE_LEVEL	= -20,
+};
+
 /*
  * Structure fields follow one of the following exclusion rules.
  *
  * I: Set during initialization and read-only afterwards.
  *
- * L: cwq->lock protected.  Access with cwq->lock held.
+ * P: Preemption protected.  Disabling preemption is enough and should
+ *    only be modified and accessed from the local cpu.
+ *
+ * L: gcwq->lock protected.  Access with gcwq->lock held.
+ *
+ * X: During normal operation, modification requires gcwq->lock and
+ *    should be done only from local cpu.  Either disabling preemption
+ *    on local cpu or grabbing gcwq->lock is enough for read access.
+ *    While trustee is in charge, it's identical to L.
  *
  * W: workqueue_lock protected.
  *
@@ -50,14 +103,56 @@
  *    be removed).
  */
 
-struct cpu_workqueue_struct;
+struct global_cwq;
 
+/*
+ * The poor guys doing the actual heavy lifting.  All on-duty workers
+ * are either serving the manager role, on idle list or on busy hash.
+ */
 struct worker {
+	/* on idle list while idle, on busy hash table while busy */
+	union {
+		struct list_head	entry;	/* L: while idle */
+		struct hlist_node	hentry;	/* L: while busy */
+	};
+
 	struct work_struct	*current_work;	/* L: work being processed */
 	struct list_head	scheduled;	/* L: scheduled works */
 	struct task_struct	*task;		/* I: worker task */
-	struct cpu_workqueue_struct *cwq;	/* I: the associated cwq */
-};
+	struct global_cwq	*gcwq;		/* I: the associated gcwq */
+	unsigned long		last_active;	/* L: last active timestamp */
+	/* 64 bytes boundary on 64bit, 32 on 32bit */
+	bool			running;	/* ?: is this worker running? */
+	unsigned int		flags;		/* ?: flags */
+} ____cacheline_aligned_in_smp;
+
+/*
+ * Global per-cpu workqueue.  There's one and only one for each cpu
+ * and all works are queued and processed here regardless of their
+ * target workqueues.
+ */
+struct global_cwq {
+	spinlock_t		lock;		/* the gcwq lock */
+	struct list_head	worklist;	/* L: list of pending works */
+	unsigned int		cpu;		/* I: the associated cpu */
+	unsigned int		flags;		/* L: GCWQ_* flags */
+
+	int			nr_workers;	/* L: total number of workers */
+	int			nr_idle;	/* L: currently idle ones */
+
+	/* workers are chained either in the idle_list or busy_hash */
+	struct list_head	idle_list;	/* ?: list of idle workers */
+	struct hlist_head	busy_hash[BUSY_WORKER_HASH_SIZE];
+						/* L: hash of busy workers */
+
+	struct timer_list	idle_timer;	/* L: worker idle timeout */
+	struct timer_list	mayday_timer;	/* L: SOS timer for dworkers */
+
+	struct task_struct	*trustee;	/* L: for gcwq shutdown */
+	unsigned int		trustee_state;	/* L: trustee state */
+	wait_queue_head_t	trustee_wait;	/* trustee wait */
+	struct worker		*first_idle;	/* L: first idle worker */
+} ____cacheline_aligned_in_smp;
 
 /*
  * The per-CPU workqueue.  The lower WORK_STRUCT_FLAG_BITS of
@@ -65,14 +160,7 @@ struct worker {
  * aligned at two's power of the number of flag bits.
  */
 struct cpu_workqueue_struct {
-
-	spinlock_t lock;
-
-	struct list_head worklist;
-	wait_queue_head_t more_work;
-	unsigned int		cpu;
-	struct worker		*worker;
-
+	struct global_cwq	*gcwq;		/* I: the associated gcwq */
 	struct list_head	*cur_worklist;	/* L: current worklist */
 	int			nr_in_flight;	/* L: nr of in_flight works */
 	unsigned int		flush_color;	/* L: current flush color */
@@ -94,6 +182,9 @@ struct workqueue_struct {
 	atomic_t		nr_cwqs_to_flush; /* flush in progress */
 	struct completion	*flush_done;	/* flush done */
 
+	cpumask_var_t		mayday_mask;	/* cpus requesting rescue */
+	struct worker		*rescuer;	/* I: rescue worker */
+
 	const char		*name;		/* I: workqueue name */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
@@ -105,8 +196,27 @@ static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
 static bool workqueue_frozen;
 
+/*
+ * The almighty global cpu workqueues.  nr_running is the only field
+ * which is expected to be used frequently by other cpus by
+ * try_to_wake_up() which ends up incrementing it.  Put it in a
+ * separate cacheline.
+ */
+static DEFINE_PER_CPU(struct global_cwq, global_cwq);
+static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, gcwq_nr_running);
+
 static int worker_thread(void *__worker);
 
+static struct global_cwq *get_gcwq(unsigned int cpu)
+{
+	return &per_cpu(global_cwq, cpu);
+}
+
+static atomic_t *get_gcwq_nr_running(unsigned int cpu)
+{
+	return &per_cpu(gcwq_nr_running, cpu);
+}
+
 static struct cpu_workqueue_struct *get_cwq(unsigned int cpu,
 					    struct workqueue_struct *wq)
 {
@@ -133,6 +243,106 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 }
 
 /*
+ * Policy functions.  These define the policies on how the global
+ * worker pool is managed.  Unless noted otherwise, these functions
+ * assume that they're being called with gcwq->lock held.
+ */
+
+/*
+ * Need to wake up a worker?  Called from anything but currently
+ * running workers.
+ */
+static bool need_more_worker(struct global_cwq *gcwq)
+{
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	return !list_empty(&gcwq->worklist) && !atomic_read(nr_running);
+}
+
+/* Can I start working?  Called from busy but !running workers. */
+static bool may_start_working(struct global_cwq *gcwq)
+{
+	return gcwq->nr_idle;
+}
+
+/* Do I need to keep working?  Called from currently running workers. */
+static bool keep_working(struct global_cwq *gcwq)
+{
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	return !list_empty(&gcwq->worklist) && atomic_read(nr_running) <= 1;
+}
+
+/* Do we need a new worker?  Called from manager. */
+static bool need_to_create_worker(struct global_cwq *gcwq)
+{
+	return need_more_worker(gcwq) && !may_start_working(gcwq);
+}
+
+/* Do I need to be the manager? */
+static bool need_to_manage_workers(struct global_cwq *gcwq)
+{
+	return need_to_create_worker(gcwq) || gcwq->flags & GCWQ_MANAGE_WORKERS;
+}
+
+/* Do we have too many workers and should some go away? */
+static bool too_many_workers(struct global_cwq *gcwq)
+{
+	bool managing = gcwq->flags & GCWQ_MANAGING_WORKERS;
+	int nr_idle = gcwq->nr_idle + managing; /* manager is considered idle */
+	int nr_busy = gcwq->nr_workers - nr_idle;
+
+	return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
+}
+
+/*
+ * Wake up functions.
+ */
+
+/* Return the first worker.  Safe with preemption disabled */
+static struct worker *first_worker(struct global_cwq *gcwq)
+{
+	if (unlikely(list_empty(&gcwq->idle_list)))
+		return NULL;
+
+	return list_first_entry(&gcwq->idle_list, struct worker, entry);
+}
+
+/**
+ * wake_up_worker - wake up an idle worker
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void wake_up_worker(struct global_cwq *gcwq)
+{
+	struct worker *worker = first_worker(gcwq);
+
+	if (likely(worker))
+		wake_up_process(worker->task);
+}
+
+/**
+ * sched_wake_up_worker - wake up an idle worker from a scheduler callback
+ * @gcwq: gcwq to wake worker for
+ *
+ * Wake up the first idle worker of @gcwq.
+ *
+ * CONTEXT:
+ * Scheduler callback.  DO NOT call from anywhere else.
+ */
+static void sched_wake_up_worker(struct global_cwq *gcwq)
+{
+	struct worker *worker = first_worker(gcwq);
+
+	if (likely(worker))
+		sched_workqueue_wake_up_process(worker->task);
+}
+
+/*
  * Scheduler callbacks.  These functions are called during schedule()
  * with rq lock held.  Don't try to acquire any lock and only access
  * fields which are safe with preemption disabled from local cpu.
@@ -141,29 +351,144 @@ static inline struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
 /* called when a worker task wakes up from sleep */
 void sched_workqueue_worker_wakeup(struct task_struct *task)
 {
+	struct worker *worker = kthread_data(task);
+	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+		return;
+
+	if (likely(!worker->running)) {
+		worker->running = true;
+		atomic_inc(nr_running);
+	}
 }
 
 /* called when a worker task goes into sleep */
 void sched_workqueue_worker_sleep(struct task_struct *task)
 {
+	struct worker *worker = kthread_data(task);
+	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	if (unlikely(worker->flags & WORKER_IGN_RUNNING))
+		return;
+
+	/* this can only happen on the local cpu */
+	BUG_ON(gcwq->cpu != raw_smp_processor_id());
+
+	if (likely(worker->running)) {
+		worker->running = false;
+		/*
+		 * The counterpart of the following dec_and_test,
+		 * implied mb, worklist not empty test sequence is in
+		 * insert_work().  Please read comment there.
+		 */
+		if (atomic_dec_and_test(nr_running) &&
+		    !list_empty(&gcwq->worklist))
+			sched_wake_up_worker(gcwq);
+	}
+}
+
+/**
+ * busy_worker_head - return the busy hash head for a work
+ * @gcwq: gcwq of interest
+ * @work: work to be hashed
+ *
+ * Return hash head of @gcwq for @work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to the hash head.
+ */
+static struct hlist_head *busy_worker_head(struct global_cwq *gcwq,
+					   struct work_struct *work)
+{
+	const int base_shift = ilog2(sizeof(struct work_struct));
+	unsigned long v = (unsigned long)work;
+
+	/* simple shift and fold hash, do we need something better? */
+	v >>= base_shift;
+	v += v >> BUSY_WORKER_HASH_ORDER;
+	v &= BUSY_WORKER_HASH_MASK;
+
+	return &gcwq->busy_hash[v];
 }
 
 /**
- * insert_work - insert a work into cwq
+ * __find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @bwh: hash head as returned by busy_worker_head()
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq.  @bwh should be
+ * the hash head obtained by calling busy_worker_head() with the same
+ * work.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *__find_worker_executing_work(struct global_cwq *gcwq,
+						   struct hlist_head *bwh,
+						   struct work_struct *work)
+{
+	struct worker *worker;
+	struct hlist_node *tmp;
+
+	hlist_for_each_entry(worker, tmp, bwh, hentry)
+		if (worker->current_work == work)
+			return worker;
+	return NULL;
+}
+
+/**
+ * find_worker_executing_work - find worker which is executing a work
+ * @gcwq: gcwq of interest
+ * @work: work to find worker for
+ *
+ * Find a worker which is executing @work on @gcwq.  This function is
+ * identical to __find_worker_executing_work() except that this
+ * function calculates @bwh itself.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to worker which is executing @work if found, NULL
+ * otherwise.
+ */
+static struct worker *find_worker_executing_work(struct global_cwq *gcwq,
+						 struct work_struct *work)
+{
+	return __find_worker_executing_work(gcwq, busy_worker_head(gcwq, work),
+					    work);
+}
+
+/**
+ * insert_work - insert a work into gcwq
  * @cwq: cwq @work belongs to
  * @work: work to insert
  * @head: insertion point
  * @extra_flags: extra WORK_STRUCT_* flags to set
  *
- * Insert @work into @cwq after @head.
+ * Insert @work which belongs to @cwq into @gcwq after @head.
+ * @extra_flags is ORd to WORK_STRUCT flags.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void insert_work(struct cpu_workqueue_struct *cwq,
 			struct work_struct *work, struct list_head *head,
 			unsigned int extra_flags)
 {
+	struct global_cwq *gcwq = cwq->gcwq;
+
 	cwq->nr_in_flight++;
 
 	/* we own @work, set data and link */
@@ -176,19 +501,29 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 	smp_wmb();
 
 	list_add_tail(&work->entry, head);
-	wake_up(&cwq->more_work);
+
+	/*
+	 * Ensure either sched_workqueue_worker_sleep() sees the above
+	 * list_add_tail() or we see zero nr_running to avoid workers
+	 * lying around lazily while there are works to be processed.
+	 */
+	smp_mb();
+
+	if (!atomic_read(get_gcwq_nr_running(gcwq->cpu)))
+		wake_up_worker(gcwq);
 }
 
 static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
 	struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+	struct global_cwq *gcwq = cwq->gcwq;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cwq->lock, flags);
+	spin_lock_irqsave(&gcwq->lock, flags);
 	BUG_ON(!list_empty(&work->entry));
 	insert_work(cwq, work, cwq->cur_worklist, 0);
-	spin_unlock_irqrestore(&cwq->lock, flags);
+	spin_unlock_irqrestore(&gcwq->lock, flags);
 }
 
 /**
@@ -300,22 +635,77 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
+/**
+ * worker_enter_idle - enter idle state
+ * @worker: worker which is entering idle state
+ *
+ * @worker is entering idle state.  Update stats and idle timer if
+ * necessary.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_enter_idle(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+
+	BUG_ON(worker->flags & WORKER_IDLE);
+	BUG_ON(!list_empty(&worker->entry) &&
+	       (worker->hentry.next || worker->hentry.pprev));
+
+	worker->flags |= WORKER_IDLE;
+	gcwq->nr_idle++;
+	worker->last_active = jiffies;
+
+	/* idle_list is LIFO */
+	list_add(&worker->entry, &gcwq->idle_list);
+
+	if (likely(!(worker->flags & WORKER_ROGUE))) {
+		if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer))
+			mod_timer(&gcwq->idle_timer,
+				  jiffies + IDLE_WORKER_TIMEOUT);
+	} else
+		wake_up_all(&gcwq->trustee_wait);
+}
+
+/**
+ * worker_leave_idle - leave idle state
+ * @worker: worker which is leaving idle state
+ *
+ * @worker is leaving idle state.  Update stats.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock).
+ */
+static void worker_leave_idle(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+
+	BUG_ON(!(worker->flags & WORKER_IDLE));
+	worker->flags &= ~WORKER_IDLE;
+	gcwq->nr_idle--;
+	list_del_init(&worker->entry);
+}
+
 static struct worker *alloc_worker(void)
 {
 	struct worker *worker;
 
 	worker = kzalloc(sizeof(*worker), GFP_KERNEL);
-	if (worker)
+	if (worker) {
+		INIT_LIST_HEAD(&worker->entry);
 		INIT_LIST_HEAD(&worker->scheduled);
+		/* on creation a worker is not idle */
+	}
 	return worker;
 }
 
 /**
  * create_worker - create a new workqueue worker
- * @cwq: cwq the new worker will belong to
+ * @gcwq: gcwq the new worker will belong to
  * @bind: whether to set affinity to @cpu or not
  *
- * Create a new worker which is bound to @cwq.  The returned worker
+ * Create a new worker which is bound to @gcwq.  The returned worker
  * can be started by calling start_worker() or destroyed using
  * destroy_worker().
  *
@@ -325,23 +715,30 @@ static struct worker *alloc_worker(void)
  * RETURNS:
  * Pointer to the newly created worker.
  */
-static struct worker *create_worker(struct cpu_workqueue_struct *cwq, bool bind)
+static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
 {
-	struct worker *worker;
+	struct worker *worker = NULL;
 
 	worker = alloc_worker();
 	if (!worker)
 		goto fail;
 
-	worker->cwq = cwq;
+	worker->gcwq = gcwq;
 
 	worker->task = kthread_create(worker_thread, worker, "kworker/%u",
-				      cwq->cpu);
+				      gcwq->cpu);
 	if (IS_ERR(worker->task))
 		goto fail;
 
+	/*
+	 * A rogue worker will become a regular one if CPU comes
+	 * online later on.  Make sure every worker has
+	 * PF_THREAD_BOUND set.
+	 */
 	if (bind)
-		kthread_bind(worker->task, cwq->cpu);
+		kthread_bind(worker->task, gcwq->cpu);
+	else
+		worker->task->flags |= PF_THREAD_BOUND;
 
 	return worker;
 fail:
@@ -353,13 +750,16 @@ fail:
  * start_worker - start a newly created worker
  * @worker: worker to start
  *
- * Start @worker.
+ * Make the gcwq aware of @worker and start it.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void start_worker(struct worker *worker)
 {
+	worker->flags |= WORKER_STARTED;
+	worker->gcwq->nr_workers++;
+	worker_enter_idle(worker);
 	wake_up_process(worker->task);
 }
 
@@ -367,16 +767,263 @@ static void start_worker(struct worker *worker)
  * destroy_worker - destroy a workqueue worker
  * @worker: worker to be destroyed
  *
- * Destroy @worker.
+ * Destroy @worker and adjust @gcwq stats accordingly.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
  */
 static void destroy_worker(struct worker *worker)
 {
+	struct global_cwq *gcwq = worker->gcwq;
+
 	/* sanity check frenzy */
 	BUG_ON(worker->current_work);
 	BUG_ON(!list_empty(&worker->scheduled));
+	BUG_ON(worker->running);
+
+	if (worker->flags & WORKER_STARTED)
+		gcwq->nr_workers--;
+	if (worker->flags & WORKER_IDLE)
+		gcwq->nr_idle--;
+
+	list_del_init(&worker->entry);
+	worker->flags |= WORKER_DIE;
+
+	spin_unlock_irq(&gcwq->lock);
 
 	kthread_stop(worker->task);
 	kfree(worker);
+
+	spin_lock_irq(&gcwq->lock);
+}
+
+static void idle_worker_timeout(unsigned long __gcwq)
+{
+	struct global_cwq *gcwq = (void *)__gcwq;
+
+	spin_lock_irq(&gcwq->lock);
+
+	if (too_many_workers(gcwq)) {
+		struct worker *worker;
+		unsigned long expires;
+
+		/* idle_list is kept in LIFO order, check the last one */
+		worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+		expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+		if (time_before(jiffies, expires))
+			mod_timer(&gcwq->idle_timer, expires);
+		else {
+			/* it's been idle for too long, wake up manager */
+			gcwq->flags |= GCWQ_MANAGE_WORKERS;
+			wake_up_worker(gcwq);
+		}
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+}
+
+static bool send_mayday(struct work_struct *work)
+{
+	struct cpu_workqueue_struct *cwq = get_wq_data(work);
+	struct workqueue_struct *wq = cwq->wq;
+
+	if (!(wq->flags & WQ_RESCUER))
+		return false;
+
+	/* mayday mayday mayday */
+	if (!cpumask_test_and_set_cpu(cwq->gcwq->cpu, wq->mayday_mask))
+		wake_up_process(wq->rescuer->task);
+	return true;
+}
+
+static void gcwq_mayday_timeout(unsigned long __gcwq)
+{
+	struct global_cwq *gcwq = (void *)__gcwq;
+	struct work_struct *work;
+
+	spin_lock_irq(&gcwq->lock);
+
+	if (need_to_create_worker(gcwq)) {
+		/*
+		 * We've been trying to create a new worker but
+		 * haven't been successful.  We might be hitting an
+		 * allocation deadlock.  Send distress calls to
+		 * rescuers.
+		 */
+		list_for_each_entry(work, &gcwq->worklist, entry)
+			send_mayday(work);
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+
+	mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INTERVAL);
+}
+
+/**
+ * maybe_create_worker - create a new worker if necessary
+ * @gcwq: gcwq to create a new worker for
+ *
+ * Create a new worker for @gcwq if necessary.  @gcwq is guaranteed to
+ * have at least one idle worker on return from this function.  If
+ * creating a new worker takes longer than MAYDAY_INTERVAL, mayday is
+ * sent to all rescuers with works scheduled on @gcwq to resolve
+ * possible allocation deadlock.
+ *
+ * On return, need_to_create_worker() is guaranteed to be false and
+ * may_start_working() true.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Does GFP_KERNEL allocations.  Called only from
+ * manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_create_worker(struct global_cwq *gcwq)
+{
+	if (!need_to_create_worker(gcwq))
+		return false;
+restart:
+	/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
+	mod_timer(&gcwq->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);
+
+	while (true) {
+		struct worker *worker;
+
+		if (gcwq->nr_workers >= MAX_WORKERS_PER_CPU) {
+			if (printk_ratelimit())
+				printk(KERN_WARNING "workqueue: too many "
+				       "workers (%d) on cpu %d, can't create "
+				       "new ones\n",
+				       gcwq->nr_workers, gcwq->cpu);
+			goto cooldown;
+		}
+
+		spin_unlock_irq(&gcwq->lock);
+
+		worker = create_worker(gcwq, true);
+		if (worker) {
+			del_timer_sync(&gcwq->mayday_timer);
+			spin_lock_irq(&gcwq->lock);
+			start_worker(worker);
+			BUG_ON(need_to_create_worker(gcwq));
+			return true;
+		}
+
+		if (!need_to_create_worker(gcwq))
+			break;
+	cooldown:
+		spin_unlock_irq(&gcwq->lock);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(CREATE_COOLDOWN);
+		spin_lock_irq(&gcwq->lock);
+		if (!need_to_create_worker(gcwq))
+			break;
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+	del_timer_sync(&gcwq->mayday_timer);
+	spin_lock_irq(&gcwq->lock);
+	if (need_to_create_worker(gcwq))
+		goto restart;
+	return true;
+}
+
+/**
+ * maybe_destroy_worker - destroy workers which have been idle for a while
+ * @gcwq: gcwq to destroy workers for
+ *
+ * Destroy @gcwq workers which have been idle for longer than
+ * IDLE_WORKER_TIMEOUT.
+ *
+ * LOCKING:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Called only from manager.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true
+ * otherwise.
+ */
+static bool maybe_destroy_workers(struct global_cwq *gcwq)
+{
+	bool ret = false;
+
+	while (too_many_workers(gcwq)) {
+		struct worker *worker;
+		unsigned long expires;
+
+		worker = list_entry(gcwq->idle_list.prev, struct worker, entry);
+		expires = worker->last_active + IDLE_WORKER_TIMEOUT;
+
+		if (time_before(jiffies, expires)) {
+			mod_timer(&gcwq->idle_timer, expires);
+			break;
+		}
+
+		destroy_worker(worker);
+		ret = true;
+	}
+
+	return ret;
+}
+
+/**
+ * manage_workers - manage worker pool
+ * @worker: self
+ *
+ * Assume the manager role and manage gcwq worker pool @worker belongs
+ * to.  At any given time, there can be only zero or one manager per
+ * gcwq.  The exclusion is handled automatically by this function.
+ *
+ * The caller can safely start processing works on false return.  On
+ * true return, it's guaranteed that need_to_create_worker() is false
+ * and may_start_working() is true.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  Does GFP_KERNEL allocations.
+ *
+ * RETURNS:
+ * false if no action was taken and gcwq->lock stayed locked, true if
+ * some action was taken.
+ */
+static bool manage_workers(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	bool ret = false;
+
+	if (gcwq->flags & GCWQ_MANAGING_WORKERS)
+		return ret;
+
+	gcwq->flags &= ~GCWQ_MANAGE_WORKERS;
+	gcwq->flags |= GCWQ_MANAGING_WORKERS;
+
+	/* manager should never be accounted as running */
+	BUG_ON(worker->running);
+	worker->flags |= WORKER_MANAGER;
+
+	/*
+	 * Destroy and then create so that may_start_working() is true
+	 * on return.
+	 */
+	ret |= maybe_destroy_workers(gcwq);
+	ret |= maybe_create_worker(gcwq);
+
+	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+	worker->flags &= ~WORKER_MANAGER;
+	BUG_ON(worker->running);
+
+	/*
+	 * The trustee might be waiting to take over the manager
+	 * position, tell it we're done.
+	 */
+	if (unlikely(gcwq->trustee))
+		wake_up_all(&gcwq->trustee_wait);
+
+	return ret;
 }
 
 /**
@@ -394,7 +1041,7 @@ static void destroy_worker(struct worker *worker)
  * be nested inside outer list_for_each_entry_safe().
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void schedule_work_to_worker(struct worker *worker,
 				    struct work_struct *work,
@@ -431,7 +1078,7 @@ static void schedule_work_to_worker(struct worker *worker,
  * decrement nr_in_flight of its cwq and handle workqueue flushing.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
 				 unsigned int work_color)
@@ -456,13 +1103,16 @@ static void cwq_dec_nr_in_flight(struct cpu_workqueue_struct *cwq,
  * call this function to process a work.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock) which is released and regrabbed.
+ * spin_lock_irq(gcwq->lock) which is released and regrabbed.
  */
 static void process_one_work(struct worker *worker, struct work_struct *work)
 {
-	struct cpu_workqueue_struct *cwq = worker->cwq;
+	struct cpu_workqueue_struct *cwq = get_wq_data(work);
+	struct global_cwq *gcwq = cwq->gcwq;
+	struct hlist_head *bwh = busy_worker_head(gcwq, work);
 	work_func_t f = work->func;
 	unsigned int work_color;
+	struct worker *collision;
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * It is permissible to free the struct work_struct from
@@ -473,14 +1123,26 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 	 */
 	struct lockdep_map lockdep_map = work->lockdep_map;
 #endif
+	/*
+	 * A single work shouldn't be executed concurrently by
+	 * multiple workers on a single cpu.  Check whether anyone is
+	 * already processing the work.  If so, defer the work to the
+	 * currently executing one.
+	 */
+	collision = __find_worker_executing_work(gcwq, bwh, work);
+	if (unlikely(collision)) {
+		schedule_work_to_worker(collision, work, NULL);
+		return;
+	}
+
 	/* claim and process */
+	hlist_add_head(&worker->hentry, bwh);
 	worker->current_work = work;
 	work_color = *work_data_bits(work) & WORK_STRUCT_COLOR;
 	list_del_init(&work->entry);
 
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
-	BUG_ON(get_wq_data(work) != cwq);
 	work_clear_pending(work);
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_acquire(&lockdep_map);
@@ -498,9 +1160,10 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
 		dump_stack();
 	}
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 
 	/* we're done with it, release */
+	hlist_del_init(&worker->hentry);
 	worker->current_work = NULL;
 	cwq_dec_nr_in_flight(cwq, work_color);
 }
@@ -514,7 +1177,7 @@ static void process_one_work(struct worker *worker, struct work_struct *work)
  * fetches a work from the top and executes it.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock) which may be released and regrabbed
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
  * multiple times.
  */
 static void process_scheduled_works(struct worker *worker)
@@ -530,36 +1193,60 @@ static void process_scheduled_works(struct worker *worker)
  * worker_thread - the worker thread function
  * @__worker: self
  *
- * The cwq worker thread function.
+ * The gcwq worker thread function.  There's a single dynamic pool of
+ * these per each cpu.  These workers process all works regardless of
+ * their specific target workqueue.  The only exception is works which
+ * belong to workqueues with a rescuer which will be explained in
+ * rescuer_thread().
  */
 static int worker_thread(void *__worker)
 {
 	struct worker *worker = __worker;
-	struct cpu_workqueue_struct *cwq = worker->cwq;
-	DEFINE_WAIT(wait);
-
-	/* set workqueue scheduler */
-	switch_sched_workqueue(current, true);
-
-	if (cwq->wq->flags & WQ_FREEZEABLE)
-		set_freezable();
+	struct global_cwq *gcwq = worker->gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+
+	/* set workqueue scheduler and adjust nice level */
+	switch_sched_workqueue(worker->task, true);
+woke_up:
+	spin_lock_irq(&gcwq->lock);
+
+	/* DIE can be set only while we're idle, checking here is enough */
+	if (worker->flags & WORKER_DIE) {
+		spin_unlock_irq(&gcwq->lock);
+		switch_sched_workqueue(worker->task, false);
+		return 0;
+	}
 
-	for (;;) {
-		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
-		if (!freezing(current) &&
-		    !kthread_should_stop() &&
-		    list_empty(&cwq->worklist))
-			schedule();
-		finish_wait(&cwq->more_work, &wait);
+	worker_leave_idle(worker);
+repeat:
+	if (need_more_worker(gcwq)) {
+		if (unlikely(!may_start_working(gcwq)) &&
+		    manage_workers(worker))
+			goto repeat;
 
-		if (kthread_should_stop())
-			break;
+		/*
+		 * ->scheduled list can only be filled while a worker
+		 * is preparing to process a work or actually
+		 * processing it.  Make sure nobody diddled with it
+		 * while I was sleeping.  Also, nobody should have set
+		 * running till this point.
+		 */
+		BUG_ON(!list_empty(&worker->scheduled));
+		BUG_ON(worker->running);
 
-		spin_lock_irq(&cwq->lock);
+		/*
+		 * When control reaches this point, we're guaranteed
+		 * to have at least one idle worker or that someone
+		 * else has already assumed the manager role.
+		 */
+		if (likely(!(worker->flags & WORKER_IGN_RUNNING))) {
+			worker->running = true;
+			atomic_inc(nr_running);
+		}
 
-		while (!list_empty(&cwq->worklist)) {
+		do {
 			struct work_struct *work =
-				list_first_entry(&cwq->worklist,
+				list_first_entry(&gcwq->worklist,
 						 struct work_struct, entry);
 
 			if (likely(!(*work_data_bits(work) &
@@ -572,12 +1259,145 @@ static int worker_thread(void *__worker)
 				schedule_work_to_worker(worker, work, NULL);
 				process_scheduled_works(worker);
 			}
+		} while (keep_working(gcwq));
+
+		if (likely(worker->running)) {
+			worker->running = false;
+			atomic_dec(nr_running);
 		}
+	}
+
+	if (unlikely(need_to_manage_workers(gcwq)) && manage_workers(worker))
+		goto repeat;
+
+	/*
+	 * gcwq->lock is held and there's no work to process and no
+	 * need to manage, sleep.  Workers are woken up only while
+	 * holding gcwq->lock or from local cpu, so setting the
+	 * current state before releasing gcwq->lock is enough to
+	 * prevent losing any event.
+	 */
+	worker_enter_idle(worker);
+	__set_current_state(TASK_INTERRUPTIBLE);
+	spin_unlock_irq(&gcwq->lock);
+	schedule();
+	goto woke_up;
+}
+
+/**
+ * worker_maybe_bind_and_lock - bind worker to its cpu if possible and lock gcwq
+ * @worker: target worker
+ *
+ * Works which are scheduled while the cpu is online must at least be
+ * scheduled to a worker which is bound to the cpu so that if they are
+ * flushed from cpu callbacks while cpu is going down, they are
+ * guaranteed to execute on the cpu.
+ *
+ * This function is to be used to bind rescuers and new rogue workers
+ * to the target cpu and may race with cpu going down or coming
+ * online.  kthread_bind() can't be used because it may put the worker
+ * to already dead cpu and force_cpus_allowed_ptr() can't be used
+ * verbatim as it's best effort and blocking and gcwq may be
+ * [dis]associated in the meantime.
+ *
+ * This function tries force_cpus_allowed_ptr() and locks gcwq and
+ * verifies the binding against GCWQ_DISASSOCIATED which is set during
+ * CPU_DYING and cleared during CPU_ONLINE, so if the worker enters
+ * idle state or fetches works without dropping lock, it can guarantee
+ * the scheduling requirement described in the first paragraph.
+ *
+ * CONTEXT:
+ * Might sleep.  Called without any lock but returns with gcwq->lock
+ * held.
+ */
+static void worker_maybe_bind_and_lock(struct worker *worker)
+{
+	struct global_cwq *gcwq = worker->gcwq;
+	struct task_struct *task = worker->task;
 
-		spin_unlock_irq(&cwq->lock);
+	while (true) {
+		/*
+		 * The following call may fail, succeed or succeed
+		 * without actually migrating the task to the cpu if
+		 * it races with cpu hotunplug operation.  Verify
+		 * against GCWQ_DISASSOCIATED.
+		 */
+		force_cpus_allowed_ptr(task, get_cpu_mask(gcwq->cpu));
+
+		spin_lock_irq(&gcwq->lock);
+		if (gcwq->flags & GCWQ_DISASSOCIATED)
+			return;
+		if (task_cpu(task) == gcwq->cpu &&
+		    cpumask_equal(&current->cpus_allowed,
+				  get_cpu_mask(gcwq->cpu)))
+			return;
+		spin_unlock_irq(&gcwq->lock);
+
+		/* CPU has come up inbetween, retry migration */
+		cpu_relax();
 	}
+}
 
-	return 0;
+/**
+ * rescuer_thread - the rescuer thread function
+ * @__wq: the associated workqueue
+ *
+ * Workqueue rescuer thread function.  There's one rescuer for each
+ * workqueue which has WQ_RESCUER set.
+ *
+ * Regular work processing on a gcwq may block trying to create a new
+ * worker which uses GFP_KERNEL allocation which has slight chance of
+ * developing into deadlock if some works currently on the same queue
+ * need to be processed to satisfy the GFP_KERNEL allocation.  This is
+ * the problem rescuer solves.
+ *
+ * When such condition is possible, the gcwq summons rescuers of all
+ * workqueues which have works queued on the gcwq and let them process
+ * those works so that forward progress can be guaranteed.
+ *
+ * This should happen rarely.
+ */
+static int rescuer_thread(void *__wq)
+{
+	struct workqueue_struct *wq = __wq;
+	struct worker *rescuer = wq->rescuer;
+	unsigned int cpu;
+
+	rescuer->flags |= WORKER_RESCUER;
+	set_user_nice(current, RESCUER_NICE_LEVEL);
+repeat:
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	if (kthread_should_stop())
+		return 0;
+
+	for_each_cpu(cpu, wq->mayday_mask) {
+		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = cwq->gcwq;
+		struct work_struct *work, *n;
+
+		__set_current_state(TASK_RUNNING);
+		cpumask_clear_cpu(cpu, wq->mayday_mask);
+
+		/* migrate to the target cpu if possible */
+		rescuer->gcwq = gcwq;
+		worker_maybe_bind_and_lock(rescuer);
+
+		/*
+		 * Slurp in all works issued via this workqueue and
+		 * process'em.
+		 */
+		BUG_ON(!list_empty(&rescuer->scheduled));
+		list_for_each_entry_safe(work, n, &gcwq->worklist, entry)
+			if (get_wq_data(work) == cwq)
+				schedule_work_to_worker(rescuer, work, &n);
+
+		process_scheduled_works(rescuer);
+		spin_unlock_irq(&gcwq->lock);
+	}
+
+	schedule();
+	goto repeat;
 }
 
 struct wq_barrier {
@@ -609,7 +1429,7 @@ static void wq_barrier_func(struct work_struct *work)
  * after a work with LINKED flag set.
  *
  * CONTEXT:
- * spin_lock_irq(cwq->lock).
+ * spin_lock_irq(gcwq->lock).
  */
 static void insert_wq_barrier(struct wq_barrier *barr,
 			      struct work_struct *target, struct worker *worker)
@@ -668,8 +1488,9 @@ void flush_workqueue(struct workqueue_struct *wq)
 
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = cwq->gcwq;
 
-		spin_lock_irq(&cwq->lock);
+		spin_lock_irq(&gcwq->lock);
 
 		BUG_ON(cwq->flush_cnt);
 
@@ -681,7 +1502,7 @@ void flush_workqueue(struct workqueue_struct *wq)
 			wait = true;
 		}
 
-		spin_unlock_irq(&cwq->lock);
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	if (wait)
@@ -707,17 +1528,19 @@ int flush_work(struct work_struct *work)
 {
 	struct worker *worker = NULL;
 	struct cpu_workqueue_struct *cwq;
+	struct global_cwq *gcwq;
 	struct wq_barrier barr;
 
 	might_sleep();
 	cwq = get_wq_data(work);
 	if (!cwq)
 		return 0;
+	gcwq = cwq->gcwq;
 
 	lock_map_acquire(&cwq->wq->lockdep_map);
 	lock_map_release(&cwq->wq->lockdep_map);
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
 		 * See the comment near try_to_grab_pending()->smp_rmb().
@@ -727,18 +1550,17 @@ int flush_work(struct work_struct *work)
 		if (unlikely(cwq != get_wq_data(work)))
 			goto already_gone;
 	} else {
-		if (cwq->worker && cwq->worker->current_work == work)
-			worker = cwq->worker;
+		worker = find_worker_executing_work(gcwq, work);
 		if (!worker)
 			goto already_gone;
 	}
 
 	insert_wq_barrier(&barr, work, worker);
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 	wait_for_completion(&barr.done);
 	return 1;
 already_gone:
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(flush_work);
@@ -749,6 +1571,7 @@ EXPORT_SYMBOL_GPL(flush_work);
  */
 static int try_to_grab_pending(struct work_struct *work)
 {
+	struct global_cwq *gcwq;
 	struct cpu_workqueue_struct *cwq;
 	int ret = -1;
 
@@ -763,8 +1586,9 @@ static int try_to_grab_pending(struct work_struct *work)
 	cwq = get_wq_data(work);
 	if (!cwq)
 		return ret;
+	gcwq = cwq->gcwq;
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 	if (!list_empty(&work->entry)) {
 		/*
 		 * This work is queued, but perhaps we locked the wrong cwq.
@@ -779,7 +1603,7 @@ static int try_to_grab_pending(struct work_struct *work)
 			ret = 1;
 		}
 	}
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	return ret;
 }
@@ -787,18 +1611,17 @@ static int try_to_grab_pending(struct work_struct *work)
 static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq,
 				struct work_struct *work)
 {
+	struct global_cwq *gcwq = cwq->gcwq;
 	struct wq_barrier barr;
 	struct worker *worker;
 
-	spin_lock_irq(&cwq->lock);
+	spin_lock_irq(&gcwq->lock);
 
-	worker = NULL;
-	if (unlikely(cwq->worker && cwq->worker->current_work == work)) {
-		worker = cwq->worker;
+	worker = find_worker_executing_work(gcwq, work);
+	if (unlikely(worker))
 		insert_wq_barrier(&barr, work, worker);
-	}
 
-	spin_unlock_irq(&cwq->lock);
+	spin_unlock_irq(&gcwq->lock);
 
 	if (unlikely(worker))
 		wait_for_completion(&barr.done);
@@ -1026,7 +1849,6 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 						const char *lock_name)
 {
 	struct workqueue_struct *wq;
-	bool failed = false;
 	unsigned int cpu;
 
 	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
@@ -1044,58 +1866,55 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
 	INIT_LIST_HEAD(&wq->list);
 
-	cpu_maps_update_begin();
-	/*
-	 * We must place this wq on list even if the code below fails.
-	 * cpu_down(cpu) can remove cpu from cpu_populated_map before
-	 * destroy_workqueue() takes the lock, in that case we leak
-	 * cwq[cpu]->thread.
-	 */
-	spin_lock(&workqueue_lock);
-	list_add(&wq->list, &workqueues);
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+		struct global_cwq *gcwq = get_gcwq(cpu);
 
-		if (workqueue_frozen && wq->flags & WQ_FREEZEABLE)
-			cwq->cur_worklist = &cwq->frozen_works;
-		else
-			cwq->cur_worklist = &cwq->worklist;
-	}
-	spin_unlock(&workqueue_lock);
-	/*
-	 * We must initialize cwqs for each possible cpu even if we
-	 * are going to call destroy_workqueue() finally. Otherwise
-	 * cpu_up() can hit the uninitialized cwq once we drop the
-	 * lock.
-	 */
-	for_each_possible_cpu(cpu) {
-		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
-
-		cwq->cpu = cpu;
+		BUG_ON((unsigned long)cwq & WORK_STRUCT_FLAG_MASK);
+		cwq->gcwq = gcwq;
+		cwq->cur_worklist = &gcwq->worklist;
 		cwq->wq = wq;
-		spin_lock_init(&cwq->lock);
-		INIT_LIST_HEAD(&cwq->worklist);
-		init_waitqueue_head(&cwq->more_work);
 		INIT_LIST_HEAD(&cwq->frozen_works);
-
-		if (failed || !cpu_online(cpu))
-			continue;
-		cwq->worker = create_worker(cwq, true);
-		if (cwq->worker)
-			start_worker(cwq->worker);
-		else
-			failed = true;
 	}
-	cpu_maps_update_done();
 
-	if (failed) {
-		destroy_workqueue(wq);
-		wq = NULL;
+	if (flags & WQ_RESCUER) {
+		struct worker *rescuer;
+
+		if (!alloc_cpumask_var(&wq->mayday_mask, GFP_KERNEL))
+			goto err;
+
+		wq->rescuer = rescuer = alloc_worker();
+		if (!rescuer)
+			goto err;
+
+		rescuer->task = kthread_create(rescuer_thread, wq, "%s", name);
+		if (IS_ERR(rescuer->task))
+			goto err;
+
+		wq->rescuer = rescuer;
+		rescuer->task->flags |= PF_THREAD_BOUND;
+		wake_up_process(rescuer->task);
 	}
+
+	/*
+	 * Works can't be queued before we return.  Add to workqueue
+	 * list and set cur_worklist to frozen_works if frozen.
+	 */
+	spin_lock(&workqueue_lock);
+	list_add(&wq->list, &workqueues);
+	if (workqueue_frozen && wq->flags & WQ_FREEZEABLE)
+		for_each_possible_cpu(cpu) {
+			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
+			cwq->cur_worklist = &cwq->frozen_works;
+		}
+	spin_unlock(&workqueue_lock);
+
 	return wq;
 err:
 	if (wq) {
 		free_percpu(wq->cpu_wq);
+		free_cpumask_var(wq->mayday_mask);
+		kfree(wq->rescuer);
 		kfree(wq);
 	}
 	return NULL;
@@ -1110,9 +1929,7 @@ EXPORT_SYMBOL_GPL(__create_workqueue_key);
  */
 void destroy_workqueue(struct workqueue_struct *wq)
 {
-	int cpu;
-
-	cpu_maps_update_begin();
+	unsigned int cpu;
 
 	flush_workqueue(wq);
 
@@ -1124,70 +1941,400 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	list_del(&wq->list);
 	spin_unlock(&workqueue_lock);
 
+	/* sanity check */
 	for_each_possible_cpu(cpu) {
 		struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
-
-		/* cpu_add_remove_lock protects cwq->thread */
-		if (cwq->worker) {
-			destroy_worker(cwq->worker);
-			cwq->worker = NULL;
-		}
 		BUG_ON(cwq->nr_in_flight);
 		BUG_ON(!list_empty(&cwq->frozen_works));
 	}
 
- 	cpu_maps_update_done();
+	if (wq->flags & WQ_RESCUER) {
+		kthread_stop(wq->rescuer->task);
+		free_cpumask_var(wq->mayday_mask);
+	}
 
 	free_percpu(wq->cpu_wq);
 	kfree(wq);
 }
 EXPORT_SYMBOL_GPL(destroy_workqueue);
 
-static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
-						unsigned long action,
-						void *hcpu)
+/*
+ * CPU hotplug.
+ *
+ * There are two challenges in supporting CPU hotplug.  Firstly, there
+ * are a lot of assumptions on strong associations among work, cwq and
+ * gcwq which make migrating pending and scheduled works very
+ * difficult to implement without impacting hot paths.  Secondly,
+ * gcwqs serve mix of short, long and very long running works making
+ * blocked draining impractical.
+ *
+ * This is solved by allowing a gcwq to be detached from CPU, running
+ * it with unbound (rogue) workers and allowing it to be reattached
+ * later if the cpu comes back online.  A separate thread is created
+ * to govern a gcwq in such state and is called the trustee of the
+ * gcwq.
+ *
+ * Trustee states and their descriptions.
+ *
+ * START	Command state used on startup.  On CPU_DOWN_PREPARE, a
+ *		new trustee is started with this state.
+ *
+ * IN_CHARGE	Once started, trustee will enter this state after
+ *		assuming the manager role and making all existing
+ *		workers rogue.  DOWN_PREPARE waits for trustee to
+ *		enter this state.  After reaching IN_CHARGE, trustee
+ *		tries to execute the pending worklist until it's empty
+ *		and the state is set to BUTCHER, or the state is set
+ *		to RELEASE.
+ *
+ * BUTCHER	Command state which is set by the cpu callback after
+ *		the cpu has went down.  Once this state is set trustee
+ *		knows that there will be no new works on the worklist
+ *		and once the worklist is empty it can proceed to
+ *		killing idle workers.
+ *
+ * RELEASE	Command state which is set by the cpu callback if the
+ *		cpu down has been canceled or it has come online
+ *		again.  After recognizing this state, trustee stops
+ *		trying to drain or butcher and clears ROGUE, rebinds
+ *		all remaining workers back to the cpu and releases
+ *		manager role.
+ *
+ * DONE		Trustee will enter this state after BUTCHER or RELEASE
+ *		is complete.
+ *
+ *          trustee                 CPU                draining
+ *         took over                down               complete
+ * START -----------> IN_CHARGE -----------> BUTCHER -----------> DONE
+ *                        |                     |                  ^
+ *                        | CPU is back online  v   return workers |
+ *                         ----------------> RELEASE --------------
+ */
+
+#define for_each_busy_worker(worker, i, pos, gcwq)			\
+	for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)			\
+		hlist_for_each_entry(worker, pos, &gcwq->busy_hash[i], hentry) \
+			if (!(worker->flags & WORKER_RESCUER))
+
+/**
+ * trustee_wait_event_timeout - timed event wait for trustee
+ * @cond: condition to wait for
+ * @timeout: timeout in jiffies
+ *
+ * wait_event_timeout() for trustee to use.  Handles locking and
+ * checks for RELEASE request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by trustee.
+ *
+ * RETURNS:
+ * Positive indicating left time if @cond is satisfied, 0 if timed
+ * out, -1 if canceled.
+ */
+#define trustee_wait_event_timeout(cond, timeout) ({			\
+	long __ret = (timeout);						\
+	while (!((cond) || (gcwq->trustee_state == TRUSTEE_RELEASE)) &&	\
+	       __ret) {							\
+		spin_unlock_irq(&gcwq->lock);				\
+		__wait_event_timeout(gcwq->trustee_wait, (cond) ||	\
+			(gcwq->trustee_state == TRUSTEE_RELEASE),	\
+			__ret);						\
+		spin_lock_irq(&gcwq->lock);				\
+	}								\
+	gcwq->trustee_state == TRUSTEE_RELEASE ? -1 : (__ret);		\
+})
+
+/**
+ * trustee_wait_event - event wait for trustee
+ * @cond: condition to wait for
+ *
+ * wait_event() for trustee to use.  Automatically handles locking and
+ * checks for CANCEL request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by trustee.
+ *
+ * RETURNS:
+ * 0 if @cond is satisfied, -1 if canceled.
+ */
+#define trustee_wait_event(cond) ({					\
+	long __ret1;							\
+	__ret1 = trustee_wait_event_timeout(cond, MAX_SCHEDULE_TIMEOUT);\
+	__ret1 < 0 ? -1 : 0;						\
+})
+
+static bool __cpuinit trustee_unset_rogue(struct worker *worker)
 {
-	unsigned int cpu = (unsigned long)hcpu;
-	struct cpu_workqueue_struct *cwq;
-	struct workqueue_struct *wq;
-	int ret = NOTIFY_OK;
+	struct global_cwq *gcwq = worker->gcwq;
 
-	action &= ~CPU_TASKS_FROZEN;
+	if (!(worker->flags & WORKER_ROGUE))
+		return false;
+
+	spin_unlock_irq(&gcwq->lock);
+	BUG_ON(force_cpus_allowed_ptr(worker->task, get_cpu_mask(gcwq->cpu)));
+	spin_lock_irq(&gcwq->lock);
+	worker->flags &= ~WORKER_ROGUE;
+	return true;
+}
+
+static int __cpuinit trustee_thread(void *__gcwq)
+{
+	struct global_cwq *gcwq = __gcwq;
+	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);
+	struct worker *worker;
+	struct work_struct *work;
+	struct hlist_node *pos;
+	int i;
+
+	BUG_ON(gcwq->cpu != smp_processor_id());
+
+	spin_lock_irq(&gcwq->lock);
+	/*
+	 * Claim the manager position and make all workers rogue.
+	 * Trustee must be bound to the target cpu and can't be
+	 * cancelled.
+	 */
+	BUG_ON(gcwq->cpu != smp_processor_id());
+	BUG_ON(trustee_wait_event(!(gcwq->flags & GCWQ_MANAGING_WORKERS)) < 0);
+	gcwq->flags |= GCWQ_MANAGING_WORKERS;
+
+	list_for_each_entry(worker, &gcwq->idle_list, entry)
+		worker->flags |= WORKER_ROGUE;
+
+	for_each_busy_worker(worker, i, pos, gcwq)
+		worker->flags |= WORKER_ROGUE;
+
+	/*
+	 * Call schedule() so that we cross rq->lock and thus can
+	 * guarantee sched callbacks see the rogue flag.  This is
+	 * necessary as scheduler callbacks may be invoked from other
+	 * cpus.
+	 */
+	spin_unlock_irq(&gcwq->lock);
+	schedule();
+	spin_lock_irq(&gcwq->lock);
+
+	/*
+	 * Sched callbacks are disabled now.  Clear running and adjust
+	 * nr_running accordingly.  After this, gcwq->nr_running stays
+	 * zero and need_more_worker() and keep_working() are always
+	 * true as long as the worklist is not empty.
+	 */
+	for_each_busy_worker(worker, i, pos, gcwq)
+		if (worker->running) {
+			worker->running = false;
+			atomic_dec(nr_running);
+		}
+	WARN_ON(atomic_read(nr_running));
+
+	spin_unlock_irq(&gcwq->lock);
+	del_timer_sync(&gcwq->idle_timer);
+	spin_lock_irq(&gcwq->lock);
+
+	/*
+	 * We're now in charge.  Notify and proceed to drain.  We need
+	 * to keep the gcwq running during the whole CPU down
+	 * procedure as other cpu hotunplug callbacks may need to
+	 * flush currently running tasks.
+	 */
+	gcwq->trustee_state = TRUSTEE_IN_CHARGE;
+	wake_up_all(&gcwq->trustee_wait);
+
+	/*
+	 * The original cpu is in the process of dying and may go away
+	 * anytime now.  When that happens, we and all workers would
+	 * be migrated to other cpus.  Try draining any left work.  We
+	 * want to get it over with ASAP - spam rescuers, wake up as
+	 * many idlers as necessary and create new ones till the
+	 * worklist is empty.
+	 */
+	while (!list_empty(&gcwq->worklist) ||
+	       gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
+		int nr_works = 0;
 
-undo:
-	list_for_each_entry(wq, &workqueues, list) {
-		cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+		list_for_each_entry(work, &gcwq->worklist, entry) {
+			send_mayday(work);
+			nr_works++;
+		}
 
-		switch (action) {
-		case CPU_UP_PREPARE:
-			cwq->worker = create_worker(cwq, false);
-			if (cwq->worker)
+		list_for_each_entry(worker, &gcwq->idle_list, entry) {
+			if (!nr_works--)
 				break;
-			printk(KERN_ERR "workqueue [%s] for %i failed\n",
-				wq->name, cpu);
-			action = CPU_UP_CANCELED;
-			ret = NOTIFY_BAD;
-			goto undo;
-
-		case CPU_ONLINE:
-			kthread_bind(cwq->worker->task, cpu);
-			start_worker(cwq->worker);
+			wake_up_process(worker->task);
+		}
+
+		if (need_to_create_worker(gcwq) &&
+		    gcwq->nr_workers < MAX_WORKERS_PER_CPU) {
+			spin_unlock_irq(&gcwq->lock);
+			worker = create_worker(gcwq, false);
+			if (worker) {
+				worker_maybe_bind_and_lock(worker);
+				worker->flags |= WORKER_ROGUE;
+				start_worker(worker);
+			} else
+				spin_lock_irq(&gcwq->lock);
+		}
+
+		/* give a breather */
+		if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
 			break;
+	}
 
-		case CPU_UP_CANCELED:
-			start_worker(cwq->worker);
-		case CPU_POST_DEAD:
-			flush_workqueue(wq);
-			/* cpu_add_remove_lock protects cwq->thread */
-			if (cwq->worker) {
-				destroy_worker(cwq->worker);
-				cwq->worker = NULL;
-			}
+	/*
+	 * Either all works have been scheduled and cpu is down, or
+	 * cpu down has already been canceled.  Wait for and butcher
+	 * all workers till we're canceled.
+	 */
+	while (gcwq->nr_workers) {
+		if (trustee_wait_event(!list_empty(&gcwq->idle_list)) < 0)
 			break;
+
+		while (!list_empty(&gcwq->idle_list)) {
+			worker = list_first_entry(&gcwq->idle_list,
+						  struct worker, entry);
+			destroy_worker(worker);
 		}
 	}
 
-	return ret;
+	/*
+	 * At this point, either draining has completed and no worker
+	 * is left, or cpu down has been canceled or the cpu is being
+	 * brought back up.  Clear ROGUE from and rebind all left
+	 * workers.  Unsetting ROGUE and rebinding require dropping
+	 * gcwq->lock.  Restart loop after each successful release.
+	 */
+recheck:
+	list_for_each_entry(worker, &gcwq->idle_list, entry)
+		if (trustee_unset_rogue(worker))
+			goto recheck;
+
+	for_each_busy_worker(worker, i, pos, gcwq)
+		if (trustee_unset_rogue(worker))
+			goto recheck;
+
+	/* relinquish manager role */
+	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
+	/* notify completion */
+	gcwq->trustee = NULL;
+	gcwq->trustee_state = TRUSTEE_DONE;
+	wake_up_all(&gcwq->trustee_wait);
+	spin_unlock_irq(&gcwq->lock);
+	return 0;
+}
+
+/**
+ * wait_trustee_state - wait for trustee to enter the specified state
+ * @gcwq: gcwq the trustee of interest belongs to
+ * @state: target state to wait for
+ *
+ * Wait for the trustee to reach @state.  DONE is already matched.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by cpu_callback.
+ */
+static void __cpuinit wait_trustee_state(struct global_cwq *gcwq, int state)
+{
+	if (!(gcwq->trustee_state == state ||
+	      gcwq->trustee_state == TRUSTEE_DONE)) {
+		spin_unlock_irq(&gcwq->lock);
+		__wait_event(gcwq->trustee_wait,
+			     gcwq->trustee_state == state ||
+			     gcwq->trustee_state == TRUSTEE_DONE);
+		spin_lock_irq(&gcwq->lock);
+	}
+}
+
+static int __cpuinit workqueue_cpu_callback(struct notifier_block *nfb,
+						unsigned long action,
+						void *hcpu)
+{
+	unsigned int cpu = (unsigned long)hcpu;
+	struct global_cwq *gcwq = get_gcwq(cpu);
+	struct task_struct *new_trustee = NULL;
+	struct worker *uninitialized_var(new_worker);
+
+	action &= ~CPU_TASKS_FROZEN;
+
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+		new_trustee = kthread_create(trustee_thread, gcwq,
+					     "workqueue_trustee/%d\n", cpu);
+		if (IS_ERR(new_trustee))
+			return NOTIFY_BAD;
+		kthread_bind(new_trustee, cpu);
+		/* fall through */
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		new_worker = create_worker(gcwq, false);
+		if (!new_worker) {
+			if (new_trustee)
+				kthread_stop(new_trustee);
+			return NOTIFY_BAD;
+		}
+	}
+
+	spin_lock_irq(&gcwq->lock);
+
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+		/* initialize trustee and tell it to acquire the gcwq */
+		BUG_ON(gcwq->trustee || gcwq->trustee_state != TRUSTEE_DONE);
+		gcwq->trustee = new_trustee;
+		gcwq->trustee_state = TRUSTEE_START;
+		wake_up_process(gcwq->trustee);
+		wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+		/* fall through */
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		gcwq->first_idle = new_worker;
+		break;
+
+	case CPU_DYING:
+		/*
+		 * Before this, the trustee and all workers must have
+		 * stayed on the cpu.  After this, they'll all be
+		 * diasporas.
+		 */
+		gcwq->flags |= GCWQ_DISASSOCIATED;
+		break;
+
+	case CPU_POST_DEAD:
+		gcwq->trustee_state = TRUSTEE_BUTCHER;
+		/* fall through */
+	case CPU_UP_CANCELED:
+		destroy_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
+		break;
+
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+		gcwq->flags &= ~GCWQ_DISASSOCIATED;
+		if (gcwq->trustee_state != TRUSTEE_DONE) {
+			gcwq->trustee_state = TRUSTEE_RELEASE;
+			wake_up_process(gcwq->trustee);
+			wait_trustee_state(gcwq, TRUSTEE_DONE);
+		}
+		/*
+		 * Trustee is done and there might be no worker left.
+		 * Put the first_idle in and request a real manager to
+		 * take a look.
+		 */
+		spin_unlock_irq(&gcwq->lock);
+		kthread_bind(gcwq->first_idle->task, cpu);
+		spin_lock_irq(&gcwq->lock);
+		gcwq->flags |= GCWQ_MANAGE_WORKERS;
+		start_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
+		break;
+	}
+
+	spin_unlock_irq(&gcwq->lock);
+
+	return NOTIFY_OK;
 }
 
 #ifdef CONFIG_SMP
@@ -1243,10 +2390,10 @@ EXPORT_SYMBOL_GPL(work_on_cpu);
  *
  * Start freezing workqueues.  After this function returns, all
  * freezeable workqueues will queue new works to their frozen_works
- * list instead of the cwq ones.
+ * list instead of the gcwq ones.
  *
  * CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
  */
 void freeze_workqueues_begin(void)
 {
@@ -1259,21 +2406,22 @@ void freeze_workqueues_begin(void)
 	workqueue_frozen = true;
 
 	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_irq(&gcwq->lock);
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
 			if (!(wq->flags & WQ_FREEZEABLE))
 				continue;
 
-			spin_lock_irq(&cwq->lock);
-
-			BUG_ON(cwq->cur_worklist != &cwq->worklist);
+			BUG_ON(cwq->cur_worklist != &gcwq->worklist);
 			BUG_ON(!list_empty(&cwq->frozen_works));
 
 			cwq->cur_worklist = &cwq->frozen_works;
-
-			spin_unlock_irq(&cwq->lock);
 		}
+		spin_unlock_irq(&gcwq->lock);
 	}
 	spin_unlock(&workqueue_lock);
 }
@@ -1302,6 +2450,10 @@ bool freeze_workqueues_busy(void)
 	BUG_ON(!workqueue_frozen);
 
 	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_irq(&gcwq->lock);
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 			struct work_struct *work;
@@ -1310,22 +2462,19 @@ bool freeze_workqueues_busy(void)
 			if (!(wq->flags & WQ_FREEZEABLE))
 				continue;
 
-			spin_lock_irq(&cwq->lock);
-
 			BUG_ON(cwq->cur_worklist != &cwq->frozen_works);
 
 			nr_in_flight = cwq->nr_in_flight;
 			list_for_each_entry(work, &cwq->frozen_works, entry)
 				nr_in_flight--;
 
-			spin_unlock_irq(&cwq->lock);
-
 			BUG_ON(nr_in_flight < 0);
 			if (nr_in_flight) {
 				busy = true;
 				break;
 			}
 		}
+		spin_unlock_irq(&gcwq->lock);
 		if (busy)
 			break;
 	}
@@ -1337,10 +2486,10 @@ bool freeze_workqueues_busy(void)
  * thaw_workqueues - thaw workqueues
  *
  * Thaw workqueues.  Normal queueing is restored and all collected
- * frozen works are transferred to their respective cwq worklists.
+ * frozen works are transferred to their respective gcwq worklists.
  *
  * CONTEXT:
- * Grabs and releases workqueue_lock and cwq->lock's.
+ * Grabs and releases workqueue_lock and gcwq->lock's.
  */
 void thaw_workqueues(void)
 {
@@ -1355,25 +2504,28 @@ void thaw_workqueues(void)
 	workqueue_frozen = false;
 
 	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_irq(&gcwq->lock);
+
 		list_for_each_entry(wq, &workqueues, list) {
 			struct cpu_workqueue_struct *cwq = get_cwq(cpu, wq);
 
 			if (!(wq->flags & WQ_FREEZEABLE))
 				continue;
 
-			spin_lock_irq(&cwq->lock);
-
 			/* switch to normal work queueing */
 			BUG_ON(cwq->cur_worklist != &cwq->frozen_works);
-			cwq->cur_worklist = &cwq->worklist;
+			cwq->cur_worklist = &gcwq->worklist;
 
-			/* transfer frozen tasks to cwq worklist */
-			list_splice_tail(&cwq->frozen_works, &cwq->worklist);
+			/* transfer frozen tasks to gcwq worklist */
+			list_splice_tail(&cwq->frozen_works, &gcwq->worklist);
 			INIT_LIST_HEAD(&cwq->frozen_works);
-			wake_up(&cwq->more_work);
-
-			spin_unlock_irq(&cwq->lock);
 		}
+
+		wake_up_worker(gcwq);
+
+		spin_unlock_irq(&gcwq->lock);
 	}
 out_unlock:
 	spin_unlock(&workqueue_lock);
@@ -1382,16 +2534,46 @@ out_unlock:
 
 void __init init_workqueues(void)
 {
-	/*
-	 * cwqs are forced aligned according to WORK_STRUCT_FLAG_BITS.
-	 * Make sure that the alignment isn't lower than that of
-	 * unsigned long long in case this code survives for longer
-	 * than twenty years.  :-P
-	 */
-	BUILD_BUG_ON(__alignof__(struct cpu_workqueue_struct) <
-		     __alignof__(unsigned long long));
+	unsigned int cpu;
+	int i;
 
 	hotcpu_notifier(workqueue_cpu_callback, 0);
-	keventd_wq = create_workqueue("events");
+
+	/* initialize gcwqs */
+	for_each_possible_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+
+		spin_lock_init(&gcwq->lock);
+		INIT_LIST_HEAD(&gcwq->worklist);
+		gcwq->cpu = cpu;
+
+		INIT_LIST_HEAD(&gcwq->idle_list);
+		for (i = 0; i < BUSY_WORKER_HASH_SIZE; i++)
+			INIT_HLIST_HEAD(&gcwq->busy_hash[i]);
+
+		init_timer_deferrable(&gcwq->idle_timer);
+		gcwq->idle_timer.function = idle_worker_timeout;
+		gcwq->idle_timer.data = (unsigned long)gcwq;
+
+		setup_timer(&gcwq->mayday_timer, gcwq_mayday_timeout,
+			    (unsigned long)gcwq);
+
+		gcwq->trustee_state = TRUSTEE_DONE;
+		init_waitqueue_head(&gcwq->trustee_wait);
+	}
+
+	/* create the initial worker */
+	for_each_online_cpu(cpu) {
+		struct global_cwq *gcwq = get_gcwq(cpu);
+		struct worker *worker;
+
+		worker = create_worker(gcwq, true);
+		BUG_ON(!worker);
+		spin_lock_irq(&gcwq->lock);
+		start_worker(worker);
+		spin_unlock_irq(&gcwq->lock);
+	}
+
+	keventd_wq = __create_workqueue("events", 0);
 	BUG_ON(!keventd_wq);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (18 preceding siblings ...)
  2009-10-01  8:09 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue Tejun Heo
@ 2009-10-01  8:24 ` Jens Axboe
  2009-10-01 16:36   ` Tejun Heo
  2009-10-01  8:24 ` Jens Axboe
                   ` (6 subsequent siblings)
  26 siblings, 1 reply; 90+ messages in thread
From: Jens Axboe @ 2009-10-01  8:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jeff, mingo, linux-kernel, akpm, rusty, cl, dhowells, arjan

On Thu, Oct 01 2009, Tejun Heo wrote:
> Hello, all.
> 
> This rather large patchset implements concurrency managed workqueue.
> It's not complete yet.  Singlethread workqueue handling needs more
> work and workqueue users need to be audited and simplified and async
> and slow-work should be reimplemented in terms of workqueue.  Although
> this patchset currently adds ~2000 lines of code, I'm fairly
> optimistic that after the whole conversion is done, it would be a net
> decrease in lines of code.
> 
> This patchset reimplements workqueue such that it auto-regulates
> concurrency and thus relieves its users from the managing duty.  It
> works by managing single shared pool of per-cpu workers and hooking
> into the scheduler to get notifications about workers going to sleep
> and waking up.  Using the mechanism, workqueue implementation keeps
> track of the current level of concurrency and schedules only the
> necessary number of workers to keep the cpu occupied.
> 
> Concurrency managed workqueue has the following benefits.
> 
> * Workqueue users no longer have to worry about managing concurrency
>   and, in most cases, deadlocks.  The workqueue will manage it
>   automatically and unless the deadlock chain involves many (currently
>   127) works, it won't happen.
> 
> * There's one single shared pool of workers per cpu and one rescuer
>   for each workqueue which requires it, so there are far fewer number
>   of kthreads.
> 
> * More efficient.  Although it adds considerable amount of code, the
>   code added to hot path isn't big and works will be executed on the
>   local cpu and in batch as much as possible using minimal number of
>   kthreads leading to fewer task switches and lower cache
>   footprint. <NEED SOME BACKING NUMBERS>
> 
> * As concurrency is no longer a problem, most types of asynchronous
>   jobs can be done using generic workqueue and other async mechanisms,
>   including slow-work, async and adhoc subsystem custom ones, can be
>   removed.  ie. It can serve as the unified async thread pool
>   mechanism.

Awesome work so far Tejun, I have high hopes for this patchset! I'll
take some time to review this when I have it, just consider this so far
a big encouragement to crank away on this. It's always annoyed me that
we have various methods for doing async work, this promises to unify
that very nicely.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (19 preceding siblings ...)
  2009-10-01  8:24 ` [RFC PATCHSET] " Jens Axboe
@ 2009-10-01  8:24 ` Jens Axboe
  2009-10-01 16:25   ` Tejun Heo
  2009-10-01  8:40 ` Ingo Molnar
                   ` (5 subsequent siblings)
  26 siblings, 1 reply; 90+ messages in thread
From: Jens Axboe @ 2009-10-01  8:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: jeff, mingo, linux-kernel, akpm, rusty, cl, dhowells, arjan

On Thu, Oct 01 2009, Tejun Heo wrote:
> I'm attaching test-wq source and Makefile which I used to verify each
> aspect of the new workqueue.  It's pretty easy to write new test
> scenarios using the module so if you're interested in how this
> actually work, it is quite useful.

BTW, you forgot to attach the test-wq source.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (20 preceding siblings ...)
  2009-10-01  8:24 ` Jens Axboe
@ 2009-10-01  8:40 ` Ingo Molnar
  2009-10-01  8:47   ` Jens Axboe
                     ` (2 more replies)
  2009-10-01 12:53 ` David Howells
                   ` (4 subsequent siblings)
  26 siblings, 3 replies; 90+ messages in thread
From: Ingo Molnar @ 2009-10-01  8:40 UTC (permalink / raw)
  To: Tejun Heo, Peter Zijlstra, Linus Torvalds, Avi Kivity,
	Frédéric Weisbecker
  Cc: jeff, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan


* Tejun Heo <tj@kernel.org> wrote:

> Hello, all.
> 
> This rather large patchset implements concurrency managed workqueue. 
> It's not complete yet.  Singlethread workqueue handling needs more 
> work and workqueue users need to be audited and simplified and async 
> and slow-work should be reimplemented in terms of workqueue.  Although 
> this patchset currently adds ~2000 lines of code, I'm fairly 
> optimistic that after the whole conversion is done, it would be a net 
> decrease in lines of code.
> 
> This patchset reimplements workqueue such that it auto-regulates 
> concurrency and thus relieves its users from the managing duty.  It 
> works by managing single shared pool of per-cpu workers and hooking 
> into the scheduler to get notifications about workers going to sleep 
> and waking up.  Using the mechanism, workqueue implementation keeps 
> track of the current level of concurrency and schedules only the 
> necessary number of workers to keep the cpu occupied.

Ok, this looks fairly interesting - and the way you reused scheduler 
classes to auto-regulate with no impact on regular performance is quite 
an ingenious idea as well. (KVM's preempt notifiers should probably use 
this trick too, instead of an ugly notifier in the scheduler hotpath)

This mechanism could be used to implement threadlets/syslets too btw., 
and other forms of asynchronous IO.

My main worry is that in practice workqueues arent all that performance 
critical - so we are shooting to optimize something that doesnt 
necessarily use all the potential goodness inherent in this approach.

	Ingo

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:40 ` Ingo Molnar
@ 2009-10-01  8:47   ` Jens Axboe
  2009-10-01  9:01     ` Ingo Molnar
  2009-10-01  9:11   ` Avi Kivity
  2009-10-01 16:43   ` Tejun Heo
  2 siblings, 1 reply; 90+ messages in thread
From: Jens Axboe @ 2009-10-01  8:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds, Avi Kivity,
	Frédéric Weisbecker, jeff, linux-kernel, akpm, rusty,
	cl, dhowells, arjan

On Thu, Oct 01 2009, Ingo Molnar wrote:
> My main worry is that in practice workqueues arent all that performance 
> critical - so we are shooting to optimize something that doesnt 
> necessarily use all the potential goodness inherent in this approach.

Well, the main problem with the current code is that per-cpu workqueues
are way abused. I don't look at this patchset from a performance point
of view, but rather a way to limit this huge number of idle and
pointless threads. Another issue are specific workqueues to deal with
slowly executing work, or dependency issues. The end result is just
hundreds of pointless kernel threads. I have two 64-thread boxes here, I
dare not think what the 128 or 256 thread (and even bigger) boxes look
like when booted. The latter would be safely into the thousands of
useless threads.

If we get this right, we'll have a leaner kernel and far fewer threads.
On the source side, we'll potentially get rid of specialized workqueue
akin implementations.

It's a win-win as far as I'm concerned.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:47   ` Jens Axboe
@ 2009-10-01  9:01     ` Ingo Molnar
  2009-10-01  9:05       ` Jens Axboe
  0 siblings, 1 reply; 90+ messages in thread
From: Ingo Molnar @ 2009-10-01  9:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds, Avi Kivity,
	Frédéric Weisbecker, jeff, linux-kernel, akpm, rusty,
	cl, dhowells, arjan


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Thu, Oct 01 2009, Ingo Molnar wrote:
> > My main worry is that in practice workqueues arent all that performance 
> > critical - so we are shooting to optimize something that doesnt 
> > necessarily use all the potential goodness inherent in this approach.
> 
> Well, the main problem with the current code is that per-cpu 
> workqueues are way abused. I don't look at this patchset from a 
> performance point of view, but rather a way to limit this huge number 
> of idle and pointless threads. [...]

I do look at it as a potentially (and primarily) big performance feature 
- if only it was utilized in a place where the performance aspect 
mattered.

Sure, the memory savings are nice too.

	Ingo

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  9:01     ` Ingo Molnar
@ 2009-10-01  9:05       ` Jens Axboe
  0 siblings, 0 replies; 90+ messages in thread
From: Jens Axboe @ 2009-10-01  9:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds, Avi Kivity,
	Frédéric Weisbecker, jeff, linux-kernel, akpm, rusty,
	cl, dhowells, arjan

On Thu, Oct 01 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Thu, Oct 01 2009, Ingo Molnar wrote:
> > > My main worry is that in practice workqueues arent all that performance 
> > > critical - so we are shooting to optimize something that doesnt 
> > > necessarily use all the potential goodness inherent in this approach.
> > 
> > Well, the main problem with the current code is that per-cpu 
> > workqueues are way abused. I don't look at this patchset from a 
> > performance point of view, but rather a way to limit this huge number 
> > of idle and pointless threads. [...]
> 
> I do look at it as a potentially (and primarily) big performance feature 
> - if only it was utilized in a place where the performance aspect 
> mattered.

That just makes it win-win-win :-)

> Sure, the memory savings are nice too.

As is the unified approach to async work handling, compared with what we
have now (which is several flavors of workqueues, async work, slow work,
etc).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:40 ` Ingo Molnar
  2009-10-01  8:47   ` Jens Axboe
@ 2009-10-01  9:11   ` Avi Kivity
  2009-10-01  9:22     ` Avi Kivity
  2009-10-01 16:55     ` Tejun Heo
  2009-10-01 16:43   ` Tejun Heo
  2 siblings, 2 replies; 90+ messages in thread
From: Avi Kivity @ 2009-10-01  9:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds,
	Frédéric Weisbecker, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan

On 10/01/2009 10:40 AM, Ingo Molnar wrote:
> Ok, this looks fairly interesting - and the way you reused scheduler
> classes to auto-regulate with no impact on regular performance is quite
> an ingenious idea as well. (KVM's preempt notifiers should probably use
> this trick too, instead of an ugly notifier in the scheduler hotpath)
>
> This mechanism could be used to implement threadlets/syslets too btw.,
> and other forms of asynchronous IO.
>    

In fact I've thought of implementing threadlets and concurrency-managed 
workqueues with preempt notifiers ;)

Isn't a scheduling class overkill for two existing callbacks?  Note we 
can easily use a thread flag and __switch_to_xtra() to avoid the overhead.

For kvm, we don't want to force a specific scheduling class for vcpu 
threads, so we'd need infrastructure to create a new scheduling class 
out of an existing one to hook the two callbacks.  Seems like quite a 
lot of work, for something that is orthogonal to scheduling.

Tejun, would preempt notifiers work for your workqueues?  see bottom of 
include/linux/preempt.h.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  9:11   ` Avi Kivity
@ 2009-10-01  9:22     ` Avi Kivity
  2009-10-01 16:55     ` Tejun Heo
  1 sibling, 0 replies; 90+ messages in thread
From: Avi Kivity @ 2009-10-01  9:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tejun Heo, Peter Zijlstra, Linus Torvalds,
	Frédéric Weisbecker, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan

On 10/01/2009 11:11 AM, Avi Kivity wrote:
>
> In fact I've thought of implementing threadlets and 
> concurrency-managed workqueues with preempt notifiers ;)

Oh, and use them to make kernel_fpu_begin() preemptible.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (21 preceding siblings ...)
  2009-10-01  8:40 ` Ingo Molnar
@ 2009-10-01 12:53 ` David Howells
  2009-10-02 11:44   ` Tejun Heo
  2009-10-02 15:38   ` David Howells
  2009-10-01 12:57 ` [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 David Howells
                   ` (3 subsequent siblings)
  26 siblings, 2 replies; 90+ messages in thread
From: David Howells @ 2009-10-01 12:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan


Sounds interesting as a replacement for slow-work.  Some thoughts for you:

The most important features of slow-work are:

 (1) Work items are not re-entered whilst they are executing.

 (2) The slow-work facility keeps references on its work items by asking the
     client to get and put on the client's refcount.

 (3) The slow-work facility can create a lot more threads than the number of
     CPUs on a system, and the system doesn't grind to a halt if they're all
     taken up with long term I/O (performing several mkdirs for example).

I think you have (1) and (3) covered, but I'm unsure about (2).

Also, does it actually make sense to bind threads that are performing
long-term I/O to particular CPUs?  Threads that are going to spend a lot more
time sleeping on disk I/O than actually running on a CPU?

David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (22 preceding siblings ...)
  2009-10-01 12:53 ` David Howells
@ 2009-10-01 12:57 ` David Howells
  2009-10-01 17:07   ` Tejun Heo
  2009-10-01 13:05 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available David Howells
                   ` (2 subsequent siblings)
  26 siblings, 1 reply; 90+ messages in thread
From: David Howells @ 2009-10-01 12:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

Tejun Heo <tj@kernel.org> wrote:

> -	kacpid_wq = create_singlethread_workqueue("kacpid");
> -	bind_workqueue(kacpid_wq);
> -	kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
> -	bind_workqueue(kacpi_notify_wq);
> -	kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
> -	bind_workqueue(kacpi_hotplug_wq);
> +	kacpid_wq = create_workqueue("kacpid");
> +	kacpi_notify_wq = create_workqueue("kacpi_notify");
> +	kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");

Doesn't that then create one worker thread per CPU and then eschew all but
those attached to CPU 0?  Sounds excessive, but presumably you deal with that
in later patches.

David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (23 preceding siblings ...)
  2009-10-01 12:57 ` [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 David Howells
@ 2009-10-01 13:05 ` David Howells
  2009-10-01 16:15   ` Jeff Garzik
                     ` (3 more replies)
  2009-10-01 13:15 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue David Howells
  2009-10-04  8:41 ` [RFC PATCHSET] " Peter Zijlstra
  26 siblings, 4 replies; 90+ messages in thread
From: David Howells @ 2009-10-01 13:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

Tejun Heo <tj@kernel.org> wrote:

> +enum {
> +	WORK_STRUCT_PENDING	= 0,	/* work item is pending execution */
> +
> +	/*
> +	 * Reserve 3bits off of cwq pointer.  This is enough and
> +	 * provides acceptable alignment on both 32 and 64bit
> +	 * machines.
> +	 */
> +	WORK_STRUCT_FLAG_BITS	= 3,
> +
> +	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
> +	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
> +};

There's some great enum abuse going on here:-)

David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (24 preceding siblings ...)
  2009-10-01 13:05 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available David Howells
@ 2009-10-01 13:15 ` David Howells
  2009-10-02 12:03   ` Tejun Heo
  2009-10-04  8:41 ` [RFC PATCHSET] " Peter Zijlstra
  26 siblings, 1 reply; 90+ messages in thread
From: David Howells @ 2009-10-01 13:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

Tejun Heo <tj@kernel.org> wrote:

> gcwq always keeps at least single idle worker around.  When a new
> worker is necessary and the worker is the last idle one, the worker
> assumes the role of "manager" and manages the worker pool -
> ie. creates another worker.  Forward-progress is guaranteed by having
> dedicated rescue workers for workqueues which may be necessary while
> creating a new worker.  When the manager is having problem creating a
> new worker, mayday timer activates and rescue workers are summoned to
> the cpu and execute works which may be necessary to create new
> workers.

I take it that means that the rescue-workers are donated to the pool to become
worker threads (perhaps on a temporary basis) in the event that forking fails
due to ENOMEM, such that resources can be freed up for fork() to use.

David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01  8:09 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue Tejun Heo
@ 2009-10-01 14:49   ` Andrew Morton
  2009-10-01 15:12     ` Ingo Molnar
                       ` (2 more replies)
  2009-10-01 17:12   ` Linus Torvalds
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 90+ messages in thread
From: Andrew Morton @ 2009-10-01 14:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, mingo, linux-kernel, jens.axboe, rusty, cl, dhowells, arjan

On Thu,  1 Oct 2009 17:09:18 +0900 Tejun Heo <tj@kernel.org> wrote:

> To solve the above issues, this patch implements concurrency-managed
> workqueue.

Seems reasonable.

This approach would appear to rule out the option of setting a work
thread's state (scheduling policy, scheduling priority, uid, etc) to
anything other than some default.

I guess that's unlikely to be a problem if we haven't yet had a need to
do that, but I'd be a bit surprised to discover that nobody has done
that sort of thing yet?  Nobody has niced up their workqueue threads?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01 14:49   ` Andrew Morton
@ 2009-10-01 15:12     ` Ingo Molnar
  2009-10-01 16:34     ` Tejun Heo
  2009-10-04  8:49     ` Peter Zijlstra
  2 siblings, 0 replies; 90+ messages in thread
From: Ingo Molnar @ 2009-10-01 15:12 UTC (permalink / raw)
  To: Andrew Morton, Arnaldo Carvalho de Melo, Clark Williams,
	Steven Rostedt, Peter Zijlstra, Frédéric Weisbecker
  Cc: Tejun Heo, jeff, linux-kernel, jens.axboe, rusty, cl, dhowells, arjan


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu,  1 Oct 2009 17:09:18 +0900 Tejun Heo <tj@kernel.org> wrote:
> 
> > To solve the above issues, this patch implements concurrency-managed
> > workqueue.
> 
> Seems reasonable.
> 
> This approach would appear to rule out the option of setting a work 
> thread's state (scheduling policy, scheduling priority, uid, etc) to 
> anything other than some default.
> 
> I guess that's unlikely to be a problem if we haven't yet had a need 
> to do that, but I'd be a bit surprised to discover that nobody has 
> done that sort of thing yet?  Nobody has niced up their workqueue 
> threads?

tuna in -rt might do that perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available
  2009-10-01 13:05 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available David Howells
@ 2009-10-01 16:15   ` Jeff Garzik
  2009-10-01 16:20   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 90+ messages in thread
From: Jeff Garzik @ 2009-10-01 16:15 UTC (permalink / raw)
  To: David Howells
  Cc: Tejun Heo, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

On 10/01/2009 09:05 AM, David Howells wrote:
> Tejun Heo<tj@kernel.org>  wrote:
>
>> +enum {
>> +	WORK_STRUCT_PENDING	= 0,	/* work item is pending execution */
>> +
>> +	/*
>> +	 * Reserve 3bits off of cwq pointer.  This is enough and
>> +	 * provides acceptable alignment on both 32 and 64bit
>> +	 * machines.
>> +	 */
>> +	WORK_STRUCT_FLAG_BITS	= 3,
>> +
>> +	WORK_STRUCT_FLAG_MASK	= (1UL<<  WORK_STRUCT_FLAG_BITS) - 1,
>> +	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
>> +};
>
> There's some great enum abuse going on here:-)

The quoted code is a standard kernel technique for creating typed 
constants that are visible to the C compiler (rather than CPP) and debugger.

It is found in a great many Linux kernel drivers at this point, and is 
definitely not abuse, IMO.

	Jeff



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available
  2009-10-01 13:05 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available David Howells
  2009-10-01 16:15   ` Jeff Garzik
@ 2009-10-01 16:20   ` David Howells
  2009-10-01 16:30     ` Tejun Heo
  2009-10-01 16:39   ` Alan Cox
  2009-10-01 18:45   ` Ben Pfaff
  3 siblings, 1 reply; 90+ messages in thread
From: David Howells @ 2009-10-01 16:20 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: dhowells, Tejun Heo, mingo, linux-kernel, akpm, jens.axboe,
	rusty, cl, arjan

Jeff Garzik <jeff@garzik.org> wrote:

> The quoted code is a standard kernel technique for creating typed constants
> that are visible to the C compiler (rather than CPP) and debugger.
> 
> It is found in a great many Linux kernel drivers at this point, and is
> definitely not abuse, IMO.

Just because it's common practice, and just because you like it, doesn't make
it not abuse.  But that's a side issue, and fairly irrelevant to the actual
implementation.

David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:24 ` Jens Axboe
@ 2009-10-01 16:25   ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 16:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jeff, mingo, linux-kernel, akpm, rusty, cl, dhowells, arjan

[-- Attachment #1: Type: text/plain, Size: 399 bytes --]

Jens Axboe wrote:
> On Thu, Oct 01 2009, Tejun Heo wrote:
>> I'm attaching test-wq source and Makefile which I used to verify each
>> aspect of the new workqueue.  It's pretty easy to write new test
>> scenarios using the module so if you're interested in how this
>> actually work, it is quite useful.
> 
> BTW, you forgot to attach the test-wq source.
> 

Oooh... right, here they are.

-- 
tejun

[-- Attachment #2: Makefile --]
[-- Type: text/plain, Size: 19 bytes --]

obj-m := test-wq.o

[-- Attachment #3: test-wq.c --]
[-- Type: text/x-csrc, Size: 20501 bytes --]

#include <linux/module.h>
#include <linux/workqueue.h>
#include <linux/jiffies.h>
#include <linux/delay.h>
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/cpu.h>
#include <linux/kthread.h>

#define MAX_WQ_NAME		64
#define MAX_WQS			64
#define MAX_WORKS		64

struct wq_spec {
	int			id;	/* -1 terminates */
	unsigned int		flags;
};

enum action {
	ACT_TERM,			/* end */
	ACT_LOG,			/* const char * */
	ACT_BURN,			/* ulong duration_msecs */
	ACT_SLEEP,			/* ulong duration_msecs */
	ACT_WAKEUP,			/* ulong work_id */
	ACT_REQUEUE,			/* ulong delay_msecs */
	ACT_FLUSH,			/* ulong work_id */
	ACT_FLUSH_WQ,			/* ulong workqueue_id */
	ACT_CANCEL,			/* ulong work_id */
};

struct work_action {
	enum action		action;	/* ACT_TERM terminates */
	union {
		unsigned long	v;
		const char	*s;
	};
};

struct work_spec {
	int			id;		/* -1 terminates */
	int			wq_id;
	int			requeue_cnt;
	unsigned int		cpu;
	unsigned long		initial_delay;	/* msecs */

	const struct work_action *actions;
};

struct test_scenario {
	const struct wq_spec	*wq_spec;
	const struct work_spec	**work_spec;	/* NULL terminated */
};

static const struct wq_spec dfl_wq_spec[] = {
	{
		.id		= 0,
		.flags		= 0,
	},
	{
		.id		= 1,
		.flags		= 0,
	},
	{
		.id		= 2,
		.flags		= WQ_RESCUER,
	},
	{
		.id		= 3,
		.flags		= WQ_FREEZEABLE,
	},
	{ .id = -1 },
};

/*
 * Scenario 0.  All are on cpu0.  work16 and 17 burn cpus for 10 and
 * 5msecs respectively and requeue themselves.  18 sleeps 2 secs and
 * cancel both.
 */
static const struct work_spec work_spec0[] = {
	{
		.id		= 16,
		.requeue_cnt	= 1024,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 10 }},
			{ ACT_REQUEUE,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 17,
		.requeue_cnt	= 1024,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 5 }},
			{ ACT_REQUEUE,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 18,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "will sleep 2s and cancel both" }},
			{ ACT_SLEEP,	{ 2000 }},
			{ ACT_CANCEL,	{ 16 }},
			{ ACT_CANCEL,	{ 17 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario0 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec0, NULL },
};

/*
 * Scenario 1.  All are on cpu0.  Work 0, 1 and 2 sleep for different
 * intervals but all three will terminate at around 30secs.  3 starts
 * at @28 and 4 at @33 and both sleep for five secs and then
 * terminate.  5 waits for 0, 1, 2 and then flush wq which by the time
 * should have 3 on it.  After 3 completes @32, 5 terminates too.
 * After 4 secs, 4 terminates and all test sequence is done.
 */
static const struct work_spec work_spec1[] = {
	{
		.id		= 0,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 3 }},	/* to cause sched activation */
			{ ACT_LOG,	{ .s = "will sleep 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 5 }},
			{ ACT_LOG,	{ .s = "will sleep 10s and burn 5msec and repeat 3 times" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_BURN,	{ 5 }},
			{ ACT_LOG,	{ .s = "@10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_BURN,	{ 5 }},
			{ ACT_LOG,	{ .s = "@20s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_BURN,	{ 5 }},
			{ ACT_LOG,	{ .s = "@30s" }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.actions	= (const struct work_action[]) {
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "will sleep 3s and burn 1msec and repeat 10 times" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@3s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@6s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@9s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@12s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@15s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@18s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@21s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@24s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@27s" }},
			{ ACT_SLEEP,	{ 3000 }},
			{ ACT_BURN,	{ 1 }},
			{ ACT_LOG,	{ .s = "@30s" }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.initial_delay	= 29000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "started@28s, will sleep for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_TERM },
		}
	},
	{
		.id		= 4,
		.initial_delay	= 33000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "started@33s, will sleep for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_TERM },
		}
	},
	{
		.id		= 5,
		.wq_id		= 1,	/* can't flush self */
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "flushing 0, 1 and 2" }},
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_FLUSH,	{ 1 }},
			{ ACT_FLUSH,	{ 2 }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario1 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec1, NULL },
};

/*
 * Scenario 2.  Combination of scenario 0 and 1.
 */
static const struct test_scenario scenario2 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec0, work_spec1, NULL },
};

/*
 * Scenario 3.  More complex flushing.
 *
 *               2:burn 2s        3:4s
 *                <---->      <---------->
 *     0:4s                1:4s
 * <---------->      <..----------->
 *    ^               ^
 *    |               |
 *    |               |
 * 4:flush(cpu0)    flush_wq
 * 5:flush(cpu0)    flush
 * 6:flush(cpu1)    flush_wq
 * 7:flush(cpu1)    flush
 */
static const struct work_spec work_spec2[] = {
	{
		.id		= 0,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.initial_delay	= 6000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.initial_delay	= 5000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "burning 2s" }},
			{ ACT_BURN,	{ 2000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.initial_delay	= 9000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 4s" }},
			{ ACT_SLEEP,	{ 4000 }},
			{ ACT_TERM },
		},
	},

	{
		.id		= 4,
		.wq_id		= 1,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_SLEEP,	{ 2500 }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 5,
		.wq_id		= 1,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_SLEEP,	{ 2500 }},
			{ ACT_FLUSH,	{ 1 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 6,
		.wq_id		= 1,
		.cpu		= 1,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_SLEEP,	{ 2500 }},
			{ ACT_FLUSH_WQ,	{ 0 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 7,
		.wq_id		= 1,
		.cpu		= 1,
		.initial_delay	= 1000,
		.actions	= (const struct work_action[]) {
			{ ACT_FLUSH,	{ 0 }},
			{ ACT_SLEEP,	{ 2500 }},
			{ ACT_FLUSH,	{ 1 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario3 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec2, NULL },
};

/*
 * Scenario 4.  Mayday!  To be used with MAX_CPU_WORKERS_ORDER reduced
 * to 2.
 */
static const struct work_spec work_spec4[] = {
	{
		.id		= 0,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 5s" }},
			{ ACT_SLEEP,	{ 5000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 4,
		.wq_id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 1s" }},
			{ ACT_SLEEP,	{ 1000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 5,
		.wq_id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 1s" }},
			{ ACT_SLEEP,	{ 1000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 6,
		.wq_id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 1s" }},
			{ ACT_SLEEP,	{ 1000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 7,
		.wq_id		= 2,
		.requeue_cnt	= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 1s" }},
			{ ACT_SLEEP,	{ 1000 }},
			{ ACT_REQUEUE,	{ 5000 }},
			{ ACT_TERM },
		},
	},
	{ .id = -1 },
};

static const struct test_scenario scenario4 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec4, NULL },
};

/*
 * Scenario 5.  To test cpu off/onlining.  A bunch of long running
 * tasks on cpu1.  Gets interesting with various other conditions
 * applied together - lowered MAX_CPU_WORKERS_ORDER, induced failure
 * or delay during CPU_DOWN/UP_PREPARE and so on.
 */
static const struct work_spec work_spec5[] = {
	/* runs for 30 secs */
	{
		.id		= 0,
		.cpu		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.cpu		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 2,
		.cpu		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.cpu		= 1,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 30s" }},
			{ ACT_SLEEP,	{ 30000 }},
			{ ACT_TERM },
		},
	},

	/* kicks in @15 and runs for 15 from wq0 */
	{
		.id		= 4,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 5,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 6,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 7,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},

	/* kicks in @15 and runs for 15 from wq2 */
	{
		.id		= 8,
		.wq_id		= 2,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 9,
		.wq_id		= 2,
		.cpu		= 1,
		.initial_delay	= 15000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},

	/* kicks in @30 and runs for 15 */
	{
		.id		= 10,
		.cpu		= 1,
		.initial_delay	= 30000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 11,
		.cpu		= 1,
		.initial_delay	= 30000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 15s" }},
			{ ACT_SLEEP,	{ 15000 }},
			{ ACT_TERM },
		},
	},

	{ .id = -1 },
};

static const struct test_scenario scenario5 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec5, NULL },
};

/*
 * Scenario 6.  Scenario to test freezeable workqueue.  User should
 * freeze the machine between 0s and 9s.
 *
 * 0,1:sleep 10s
 * <-------->
 *      <--freezing--><--frozen--><-thawed
 *         2,3: sleeps for 10s
 *          <.....................-------->
 *          <.....................-------->
 */
static const struct work_spec work_spec6[] = {
	/* two works which get queued @0s and sleeps for 10s */
	{
		.id		= 0,
		.wq_id		= 3,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 1,
		.wq_id		= 3,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_TERM },
		},
	},
	/* two works which get queued @9s and sleeps for 10s */
	{
		.id		= 2,
		.wq_id		= 3,
		.initial_delay	= 9000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_TERM },
		},
	},
	{
		.id		= 3,
		.wq_id		= 3,
		.initial_delay	= 9000,
		.actions	= (const struct work_action[]) {
			{ ACT_LOG,	{ .s = "sleeping for 10s" }},
			{ ACT_SLEEP,	{ 10000 }},
			{ ACT_TERM },
		},
	},

	{ .id = -1 },
};

static const struct test_scenario scenario6 = {
	.wq_spec		= dfl_wq_spec,
	.work_spec		=
	(const struct work_spec *[]) { work_spec6, NULL },
};

static const struct test_scenario *scenarios[] = {
	&scenario0, &scenario1, &scenario2, &scenario3, &scenario4, &scenario5,
	&scenario6,
};

/*
 * Execute
 */
static struct task_struct *sequencer;
static char wq_names[MAX_WQS][MAX_WQ_NAME];
static struct workqueue_struct *wqs[MAX_WQS];
static struct delayed_work dworks[MAX_WORKS];
#ifdef CONFIG_LOCKDEP
static struct lock_class_key wq_lockdep_keys[MAX_WORKS];
static struct lock_class_key dwork_lockdep_keys[MAX_WORKS];
#endif
static const struct work_spec *work_specs[MAX_WORKS];
static bool sleeping[MAX_WORKS];
static wait_queue_head_t wait_heads[MAX_WORKS];
static int requeue_cnt[MAX_WORKS];

static void test_work_fn(struct work_struct *work)
{
	int id = to_delayed_work(work) - dworks;
	const struct work_spec *spec = work_specs[id];
	const struct work_action *act = spec->actions;
	int rc;

#define pd(lvl, fmt, args...)	\
	printk("w%02d/%04d@%d "lvl": "fmt"\n", id, current->pid, raw_smp_processor_id() , ##args);
#define plog(fmt, args...)	pd("LOG ", fmt , ##args)
#define pinfo(fmt, args...)	pd("INFO", fmt , ##args)
#define perr(fmt, args...)	pd("ERR ", fmt , ##args)
repeat:
	switch (act->action) {
	case ACT_TERM:
		pinfo("TERM");
		return;
	case ACT_LOG:
		plog("%s", act->s);
		break;
	case ACT_BURN:
		mdelay(act->v);
		break;
	case ACT_SLEEP:
		sleeping[id] = true;
		wait_event_timeout(wait_heads[id], !sleeping[id],
				   msecs_to_jiffies(act->v));
		if (!sleeping[id])
			pinfo("somebody woke me up");
		sleeping[id] = false;
		break;
	case ACT_WAKEUP:
		if (act->v < MAX_WORKS && sleeping[act->v]) {
			pinfo("waking %lu up", act->v);
			sleeping[act->v] = false;
			wake_up(&wait_heads[act->v]);
		} else
			perr("trying to wake up non-sleeping work %lu",
			     act->v);
		break;
	case ACT_REQUEUE:
		if (requeue_cnt[id] > 0 || requeue_cnt[id] < 0) {
			int cpu;

			get_online_cpus();

			if (spec->cpu < nr_cpu_ids && cpu_online(spec->cpu))
				cpu = spec->cpu;
			else
				cpu = raw_smp_processor_id();

			if (act->v)
				queue_delayed_work_on(cpu, wqs[spec->wq_id],
						      &dworks[id],
						      msecs_to_jiffies(act->v));
			else
				queue_work_on(cpu, wqs[spec->wq_id],
					      &dworks[id].work);

			pinfo("requeued on cpu%d delay=%lumsecs",
			      cpu, act->v);
			if (requeue_cnt[id] > 0)
				requeue_cnt[id]--;

			put_online_cpus();
		} else
			pinfo("requeue limit reached");

		break;
	case ACT_FLUSH:
		if (act->v < MAX_WORKS && work_specs[act->v]) {
			pinfo("flushing work %lu", act->v);
			rc = flush_work(&dworks[act->v].work);
			pinfo("flushed work %lu, rc=%d", act->v, rc);
		} else
			perr("trying to flush non-existent work %lu", act->v);
		break;
	case ACT_FLUSH_WQ:
		if (act->v < MAX_WQS && wqs[act->v]) {
			pinfo("flushing workqueue %lu", act->v);
			flush_workqueue(wqs[act->v]);
			pinfo("flushed workqueue %lu", act->v);
		} else
			perr("trying to flush non-existent workqueue %lu",
			     act->v);
		break;
	case ACT_CANCEL:
		if (act->v < MAX_WORKS && work_specs[act->v]) {
			pinfo("canceling work %lu", act->v);
			rc = cancel_delayed_work_sync(&dworks[act->v]);
			pinfo("canceled work %lu, rc=%d", act->v, rc);
		} else
			perr("trying to cancel non-existent work %lu", act->v);
		break;
	}
	act++;
	goto repeat;
}

#define for_each_work_spec(spec, i, scenario)				\
	for (i = 0, spec = scenario->work_spec[i]; spec;		\
	     spec = (spec + 1)->id >= 0 ? spec + 1 : scenario->work_spec[++i])

static int sequencer_thread(void *__scenario)
{
	const struct test_scenario *scenario = __scenario;
	const struct wq_spec *wq_spec;
	const struct work_spec *work_spec;
	int i, id;

	for (wq_spec = scenario->wq_spec; wq_spec->id >= 0; wq_spec++) {
		if (wq_spec->id >= MAX_WQS) {
			printk("ERR : wq id %d too high\n", wq_spec->id);
			goto err;
		}

		if (wqs[wq_spec->id]) {
			printk("ERR : wq %d already occupied\n", wq_spec->id);
			goto err;
		}

		snprintf(wq_names[wq_spec->id], MAX_WQ_NAME, "test-wq-%02d",
			 wq_spec->id);
		wqs[wq_spec->id] = __create_workqueue_key(wq_names[wq_spec->id],
						wq_spec->flags,
						&wq_lockdep_keys[wq_spec->id],
						wq_names[wq_spec->id]);
		if (!wqs[wq_spec->id]) {
			printk("ERR : failed create wq %d\n", wq_spec->id);
			goto err;
		}
	}

	for_each_work_spec(work_spec, i, scenario) {
		struct delayed_work *dwork = &dworks[work_spec->id];
		struct workqueue_struct *wq = wqs[work_spec->wq_id];

		if (work_spec->id >= MAX_WORKS) {
			printk("ERR : work id %d too high\n", work_spec->id);
			goto err;
		}

		if (!wq) {
			printk("ERR : work %d references non-existent wq %d\n",
			       work_spec->id, work_spec->wq_id);
			goto err;
		}

		if (work_specs[work_spec->id]) {
			printk("ERR : work %d already initialized\n",
			       work_spec->id);
			goto err;
		}
		INIT_DELAYED_WORK(dwork, test_work_fn);
#ifdef CONFIG_LOCKDEP
		lockdep_init_map(&dwork->work.lockdep_map, "test-dwork",
				 &dwork_lockdep_keys[work_spec->id], 0);
#endif
		work_specs[work_spec->id] = work_spec;
		init_waitqueue_head(&wait_heads[work_spec->id]);
		requeue_cnt[work_spec->id] = work_spec->requeue_cnt ?: -1;
	}

	for_each_work_spec(work_spec, i, scenario) {
		struct delayed_work *dwork = &dworks[work_spec->id];
		struct workqueue_struct *wq = wqs[work_spec->wq_id];
		int cpu;

		get_online_cpus();

		if (work_spec->cpu < nr_cpu_ids && cpu_online(work_spec->cpu))
			cpu = work_spec->cpu;
		else
			cpu = raw_smp_processor_id();

		if (work_spec->initial_delay)
			queue_delayed_work_on(cpu, wq, dwork,
				msecs_to_jiffies(work_spec->initial_delay));
		else
			queue_work_on(cpu, wq, &dwork->work);

		put_online_cpus();
	}

	set_current_state(TASK_INTERRUPTIBLE);
	while (!kthread_should_stop()) {
		schedule();
		set_current_state(TASK_INTERRUPTIBLE);
	}
	set_current_state(TASK_RUNNING);

err:
	for (id = 0; id < MAX_WORKS; id++)
		if (work_specs[id])
			cancel_delayed_work_sync(&dworks[id]);
	for (id = 0; id < MAX_WQS; id++)
		if (wqs[id])
			destroy_workqueue(wqs[id]);

	set_current_state(TASK_INTERRUPTIBLE);
	while (!kthread_should_stop()) {
		schedule();
		set_current_state(TASK_INTERRUPTIBLE);
	}

	return 0;
}

static int scenario_switch = -1;
module_param_named(scenario, scenario_switch, int, 0444);

static int __init test_init(void)
{
	if (scenario_switch < 0 || scenario_switch >= ARRAY_SIZE(scenarios)) {
		printk("TEST WQ - no such scenario\n");
		return -EINVAL;
	}

	printk("TEST WQ - executing scenario %d\n", scenario_switch);
	sequencer = kthread_run(sequencer_thread,
				(void *)scenarios[scenario_switch],
				"test-wq-sequencer");
	if (IS_ERR(sequencer))
		return PTR_ERR(sequencer);
	return 0;
}

static void __exit test_exit(void)
{
	kthread_stop(sequencer);
}

module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available
  2009-10-01 16:20   ` David Howells
@ 2009-10-01 16:30     ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 16:30 UTC (permalink / raw)
  To: David Howells
  Cc: Jeff Garzik, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

David Howells wrote:
> Jeff Garzik <jeff@garzik.org> wrote:
> 
>> The quoted code is a standard kernel technique for creating typed constants
>> that are visible to the C compiler (rather than CPP) and debugger.
>>
>> It is found in a great many Linux kernel drivers at this point, and is
>> definitely not abuse, IMO.
> 
> Just because it's common practice, and just because you like it, doesn't make
> it not abuse.  But that's a side issue, and fairly irrelevant to the actual
> implementation.

Yeap, that's something I picked up from Jeff while working on libata
and I'm quite fond of it.  I haven't encountered any downside of it
yet.  Any reason why you think it's an abuse?  But, at any rate, yeap,
it's a pretty peripheral issue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01 14:49   ` Andrew Morton
  2009-10-01 15:12     ` Ingo Molnar
@ 2009-10-01 16:34     ` Tejun Heo
  2009-10-04  8:49     ` Peter Zijlstra
  2 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 16:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: jeff, mingo, linux-kernel, jens.axboe, rusty, cl, dhowells, arjan

Andrew Morton wrote:
> On Thu,  1 Oct 2009 17:09:18 +0900 Tejun Heo <tj@kernel.org> wrote:
> 
>> To solve the above issues, this patch implements concurrency-managed
>> workqueue.
> 
> Seems reasonable.
> 
> This approach would appear to rule out the option of setting a work
> thread's state (scheduling policy, scheduling priority, uid, etc) to
> anything other than some default.
> 
> I guess that's unlikely to be a problem if we haven't yet had a need to
> do that, but I'd be a bit surprised to discover that nobody has done
> that sort of thing yet?  Nobody has niced up their workqueue threads?

There were only two users in mainline which diddle with the worker
kthread.  The stop machine call which uses RT priority and osl which
binds worker to cpu0.  Both are updated not to do that in earlier
patches.  For most cases, I don't think it would matter.  For special
cases, given the rarity of them, I think we're better off with custom
kthread for now.  If they become more prevalent, we'll need to add
support for it so that it can be done easier but even that turns out
to be the case I think it would better to implement that separately
from generic workqueue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:24 ` [RFC PATCHSET] " Jens Axboe
@ 2009-10-01 16:36   ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 16:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jeff, mingo, linux-kernel, akpm, rusty, cl, dhowells, arjan

Jens Axboe wrote:
> Awesome work so far Tejun, I have high hopes for this patchset! I'll
> take some time to review this when I have it, just consider this so far
> a big encouragement to crank away on this. It's always annoyed me that
> we have various methods for doing async work, this promises to unify
> that very nicely.

Ah... thanks.  I've had this in my queue for too long (workqueue
semantics was quite intricate and getting hotplug right took me quite
some time).  It's nice to see it turns out to be not such a bad idea.
:-)

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available
  2009-10-01 13:05 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available David Howells
  2009-10-01 16:15   ` Jeff Garzik
  2009-10-01 16:20   ` David Howells
@ 2009-10-01 16:39   ` Alan Cox
  2009-10-01 18:45   ` Ben Pfaff
  3 siblings, 0 replies; 90+ messages in thread
From: Alan Cox @ 2009-10-01 16:39 UTC (permalink / raw)
  To: David Howells
  Cc: Tejun Heo, dhowells, jeff, mingo, linux-kernel, akpm, jens.axboe,
	rusty, cl, arjan

On Thu, 01 Oct 2009 14:05:20 +0100
David Howells <dhowells@redhat.com> wrote:

> Tejun Heo <tj@kernel.org> wrote:
> 
> > +enum {
> > +	WORK_STRUCT_PENDING	= 0,	/* work item is pending execution */
> > +
> > +	/*
> > +	 * Reserve 3bits off of cwq pointer.  This is enough and
> > +	 * provides acceptable alignment on both 32 and 64bit
> > +	 * machines.
> > +	 */
> > +	WORK_STRUCT_FLAG_BITS	= 3,
> > +
> > +	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
> > +	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
> > +};
> 
> There's some great enum abuse going on here:-)

Its actually a very sensible use of enum - enum constants don't contain
the deathtraps that CPP constants do.

Alan

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:40 ` Ingo Molnar
  2009-10-01  8:47   ` Jens Axboe
  2009-10-01  9:11   ` Avi Kivity
@ 2009-10-01 16:43   ` Tejun Heo
  2 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 16:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Avi Kivity,
	Frédéric Weisbecker, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan

Hello,

Ingo Molnar wrote:
> Ok, this looks fairly interesting - and the way you reused scheduler 
> classes to auto-regulate with no impact on regular performance is quite 
> an ingenious idea as well. (KVM's preempt notifiers should probably use 
> this trick too, instead of an ugly notifier in the scheduler hotpath)
> 
> This mechanism could be used to implement threadlets/syslets too btw., 
> and other forms of asynchronous IO.
> 
> My main worry is that in practice workqueues arent all that performance 
> critical - so we are shooting to optimize something that doesnt 
> necessarily use all the potential goodness inherent in this approach.

The scheduler code was pretty nice to hook into.  But as Jens said,
this patchset is more about getting the async framework which can
scale and be used universally.  Performance-wise, I was mainly aiming
for not introducing noticeable slow down as I expect workqueues to be
used more widely with this change.  Workqueue is already pretty
heavily used in certain paths - ie. block IO completion path with SSDs
and any signficant overhead would be noticeable.  With mixture of
different works of different run time, I think the net effect would be
positive.  It should be able to achieve better latency and throughput.
I don't have any numbers to back my rather optimistic projection yet
tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  9:11   ` Avi Kivity
  2009-10-01  9:22     ` Avi Kivity
@ 2009-10-01 16:55     ` Tejun Heo
  2009-10-01 17:06       ` Avi Kivity
  1 sibling, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 16:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Frédéric Weisbecker, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan

Hello,

Avi Kivity wrote:
> On 10/01/2009 10:40 AM, Ingo Molnar wrote:
>> Ok, this looks fairly interesting - and the way you reused scheduler
>> classes to auto-regulate with no impact on regular performance is quite
>> an ingenious idea as well. (KVM's preempt notifiers should probably use
>> this trick too, instead of an ugly notifier in the scheduler hotpath)
>>
>> This mechanism could be used to implement threadlets/syslets too btw.,
>> and other forms of asynchronous IO.
>>    
> 
> In fact I've thought of implementing threadlets and concurrency-managed
> workqueues with preempt notifiers ;)

:-)

> Isn't a scheduling class overkill for two existing callbacks?  Note we
> can easily use a thread flag and __switch_to_xtra() to avoid the overhead.

The partial overriding only takes about seventy lines of code and is
conceptually trivial.  I don't think it's an overkill.

> For kvm, we don't want to force a specific scheduling class for vcpu
> threads, so we'd need infrastructure to create a new scheduling class
> out of an existing one to hook the two callbacks.  Seems like quite a
> lot of work, for something that is orthogonal to scheduling.
> 
> Tejun, would preempt notifiers work for your workqueues?  see bottom of
> include/linux/preempt.h.

I considered that but the thing is workqueue needs to know when a
thread wakes up not when it gets scheduled.  Of course we can add
another notifier op and call it from try_to_wake_up() but I really
didn't want to add yet another hook in a very hot path which will only
be useful for very small number of tasks but yet has to be called for
every operation and the sched_class mechanism means that we already
have hooks at all the interesting spots, so I think it's better to
make use of them instead of adding another set of callbacks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01  8:09 ` [PATCH 03/19] scheduler: implement workqueue scheduler class Tejun Heo
@ 2009-10-01 16:57   ` Linus Torvalds
  2009-10-01 18:48     ` Ingo Molnar
  2009-10-02 12:23     ` Tejun Heo
  0 siblings, 2 replies; 90+ messages in thread
From: Linus Torvalds @ 2009-10-01 16:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan



On Thu, 1 Oct 2009, Tejun Heo wrote:
>
> Implement workqueue scheduler class.  Workqueue sched_class inherits
> fair sched_class and behaves exactly the same as sched_class except
> that it has two callback functions which get called when a task is put
> to sleep and wakes up and doesn't allow switching to different
> scheduler class.

So this looks odd to me.

I agree completely with the callback functions, but what I don't agree 
with is that this is somehow workqueue-related. I bet that others could 
use this, and in fact, I suspect that it should not be tied to the 
scheduler class at all, but to the _thread_.

Just as an example, I could imagine that we would do lock_kernel() 
releasing the kernel lock on scheduling (and re-taking it on wakeup) as a 
callback. Or async IO handling - both done independently of any 
"workqueue" logic.

So tying this to the scheduler class seems a bit odd. But maybe I'm 
missing something?

		Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works
  2009-10-01  8:09 ` [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
@ 2009-10-01 17:03   ` Linus Torvalds
  2009-10-01 17:11     ` Tejun Heo
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2009-10-01 17:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, Ingo Molnar, Linux Kernel Mailing List, Andrew Morton,
	Jens Axboe, rusty, cl, dhowells, arjan



On Thu, 1 Oct 2009, Tejun Heo wrote:
>
> Reimplement workqueue flushing using color coded works.  There are two
> colors and each cwq has the current color which is painted on the
> works being issued via the cwq.  Flushing a workqueue is achieved by
> flipping the current colors of each cwq and wait for the works which
> have the old color to drain.

Is there any reason for the "two colors" choice? I could imagine that it 
could end up being a limitation (and possible deadlock?) to allow just a 
single flush pending at any time.

Could the color be an 8-bit counter or something like that instead?

		Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01 16:55     ` Tejun Heo
@ 2009-10-01 17:06       ` Avi Kivity
  0 siblings, 0 replies; 90+ messages in thread
From: Avi Kivity @ 2009-10-01 17:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Frédéric Weisbecker, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan

On 10/01/2009 06:55 PM, Tejun Heo wrote:
>> For kvm, we don't want to force a specific scheduling class for vcpu
>> threads, so we'd need infrastructure to create a new scheduling class
>> out of an existing one to hook the two callbacks.  Seems like quite a
>> lot of work, for something that is orthogonal to scheduling.
>>
>> Tejun, would preempt notifiers work for your workqueues?  see bottom of
>> include/linux/preempt.h.
>>      
> I considered that but the thing is workqueue needs to know when a
> thread wakes up not when it gets scheduled.  Of course we can add
> another notifier op and call it from try_to_wake_up() but I really
> didn't want to add yet another hook in a very hot path which will only
> be useful for very small number of tasks but yet has to be called for
> every operation and the sched_class mechanism means that we already
> have hooks at all the interesting spots, so I think it's better to
> make use of them instead of adding another set of callbacks.
>    

Right, it's a subtle difference that makes p_n unusable for you.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0
  2009-10-01 12:57 ` [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 David Howells
@ 2009-10-01 17:07   ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 17:07 UTC (permalink / raw)
  To: David Howells
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

David Howells wrote:
> Tejun Heo <tj@kernel.org> wrote:
> 
>> -	kacpid_wq = create_singlethread_workqueue("kacpid");
>> -	bind_workqueue(kacpid_wq);
>> -	kacpi_notify_wq = create_singlethread_workqueue("kacpi_notify");
>> -	bind_workqueue(kacpi_notify_wq);
>> -	kacpi_hotplug_wq = create_singlethread_workqueue("kacpi_hotplug");
>> -	bind_workqueue(kacpi_hotplug_wq);
>> +	kacpid_wq = create_workqueue("kacpid");
>> +	kacpi_notify_wq = create_workqueue("kacpi_notify");
>> +	kacpi_hotplug_wq = create_workqueue("kacpi_hotplug");
> 
> Doesn't that then create one worker thread per CPU and then eschew all but
> those attached to CPU 0?  Sounds excessive, but presumably you deal with that
> in later patches.

Yeah, I'm just trying to strip down odd usages to ease conversion to
new mechanism, so between this patch and the actual new workqueue
implementation there will be whole lot of unused kthreads.  After the
new workqueue, they automatically disappear.  Also, the above can be
cleaned up such that only one of the workqueues remains afterwards.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works
  2009-10-01 17:03   ` Linus Torvalds
@ 2009-10-01 17:11     ` Tejun Heo
  2009-10-01 17:16       ` Tejun Heo
  0 siblings, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 17:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jeff, Ingo Molnar, Linux Kernel Mailing List, Andrew Morton,
	Jens Axboe, rusty, cl, dhowells, arjan

Hello, Linus.

Linus Torvalds wrote:
> On Thu, 1 Oct 2009, Tejun Heo wrote:
>> Reimplement workqueue flushing using color coded works.  There are two
>> colors and each cwq has the current color which is painted on the
>> works being issued via the cwq.  Flushing a workqueue is achieved by
>> flipping the current colors of each cwq and wait for the works which
>> have the old color to drain.
> 
> Is there any reason for the "two colors" choice? I could imagine that it 
> could end up being a limitation (and possible deadlock?) to allow just a 
> single flush pending at any time.
> 
> Could the color be an 8-bit counter or something like that instead?

It's related to how many bits can be used from work_struct->data which
in turn is determined by the alignment of cwq.  Currently, the
alignment is 8 bytes so 3 bits are available.  One is used for
PENDING, the other for LINKED and one last bit is used for COLOR.
Aligning cwq to, say, 64bytes wouldn't be bad at all and then we can
have 6-bits of coloring.  Hmmm... yeap, I'll work on that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01  8:09 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue Tejun Heo
  2009-10-01 14:49   ` Andrew Morton
@ 2009-10-01 17:12   ` Linus Torvalds
  2009-10-01 17:22     ` Tejun Heo
  2009-10-02  0:42   ` Andi Kleen
  2009-10-02 14:28   ` Frédéric Weisbecker
  3 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2009-10-01 17:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, Ingo Molnar, Linux Kernel Mailing List, Andrew Morton,
	Jens Axboe, rusty, cl, dhowells, arjan



On Thu, 1 Oct 2009, Tejun Heo wrote:
>
> To solve the above issues, this patch implements concurrency-managed
> workqueue.

Ok, all the other patches in the series looked great, this one looks 
scary. Everything else was pretty small and targeted, and if bugs appear, 
bisection would show the step that was nasty.

And then comes the big flag-day patch that actually introduces all the new 
logic, and is the most complex of the lot.

Maybe it's inevitable, but if this one could be split up a bit more, I'd 
find the series less scary.

I don't know how, though (except by adding code that simply isn't used in 
previous patches, and that doesn't really help anything at all), so maybe 
the big "switch everything over at once" thing is unavoidable.

		Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works
  2009-10-01 17:11     ` Tejun Heo
@ 2009-10-01 17:16       ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 17:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jeff, Ingo Molnar, Linux Kernel Mailing List, Andrew Morton,
	Jens Axboe, rusty, cl, dhowells, arjan

Tejun Heo wrote:
> Hello, Linus.
> 
> Linus Torvalds wrote:
>> On Thu, 1 Oct 2009, Tejun Heo wrote:
>>> Reimplement workqueue flushing using color coded works.  There are two
>>> colors and each cwq has the current color which is painted on the
>>> works being issued via the cwq.  Flushing a workqueue is achieved by
>>> flipping the current colors of each cwq and wait for the works which
>>> have the old color to drain.
>> Is there any reason for the "two colors" choice? I could imagine that it 
>> could end up being a limitation (and possible deadlock?) to allow just a 
>> single flush pending at any time.
>>
>> Could the color be an 8-bit counter or something like that instead?
> 
> It's related to how many bits can be used from work_struct->data which
> in turn is determined by the alignment of cwq.  Currently, the
> alignment is 8 bytes so 3 bits are available.  One is used for
> PENDING, the other for LINKED and one last bit is used for COLOR.
> Aligning cwq to, say, 64bytes wouldn't be bad at all and then we can
> have 6-bits of coloring.  Hmmm... yeap, I'll work on that.

Oops, 4 not 6.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01 17:12   ` Linus Torvalds
@ 2009-10-01 17:22     ` Tejun Heo
  2009-10-01 17:55       ` Linus Torvalds
  0 siblings, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-01 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jeff, Ingo Molnar, Linux Kernel Mailing List, Andrew Morton,
	Jens Axboe, rusty, cl, dhowells, arjan

Hello,

Linus Torvalds wrote:
> 
> On Thu, 1 Oct 2009, Tejun Heo wrote:
>> To solve the above issues, this patch implements concurrency-managed
>> workqueue.
> 
> Ok, all the other patches in the series looked great, this one looks 
> scary. Everything else was pretty small and targeted, and if bugs appear, 
> bisection would show the step that was nasty.
> 
> And then comes the big flag-day patch that actually introduces all the new 
> logic, and is the most complex of the lot.
> 
> Maybe it's inevitable, but if this one could be split up a bit more, I'd 
> find the series less scary.
> 
> I don't know how, though (except by adding code that simply isn't used in 
> previous patches, and that doesn't really help anything at all), so maybe 
> the big "switch everything over at once" thing is unavoidable.

Yeap, this one is pretty scary.  I tried pretty hard to at least make
the diffs fall into relevant sections (ie. updates to certain part of
logic shows up as diffs to the original part) but I agree this patch
is a tad too big.  Maybe I can introduce gcwq first.  The big problem
is the strong inter-dependency between the single global worker pool
and the hotplug logic.  I tried to introduce the hotplug logic first
but it basically required implementing different interim mechanism
altogether.  One solution could be disabling or crippling cpu hotplug
support for several commits so that the processing part and hotplug
part can be introduced separately.  Would that be acceptable?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01 17:22     ` Tejun Heo
@ 2009-10-01 17:55       ` Linus Torvalds
  0 siblings, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2009-10-01 17:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, Ingo Molnar, Linux Kernel Mailing List, Andrew Morton,
	Jens Axboe, rusty, cl, dhowells, arjan



On Fri, 2 Oct 2009, Tejun Heo wrote:
>
> Yeap, this one is pretty scary.  I tried pretty hard to at least make
> the diffs fall into relevant sections (ie. updates to certain part of
> logic shows up as diffs to the original part) but I agree this patch
> is a tad too big.  Maybe I can introduce gcwq first.  The big problem
> is the strong inter-dependency between the single global worker pool
> and the hotplug logic.  I tried to introduce the hotplug logic first
> but it basically required implementing different interim mechanism
> altogether.  One solution could be disabling or crippling cpu hotplug
> support for several commits so that the processing part and hotplug
> part can be introduced separately.  Would that be acceptable?

I would say that disabling CPU hotplug for a while is likely a good way to 
at least limit the damage, except for one thing - suspend and resume. 

That's one of the things that is hard to debug anyway, is tied to 
workqueues (with that whole freezing thing) _and_ uses CPU hotplug to 
actually get its work done. So if you effectively disable CPU hotplug for 
a part of the series, then that just means that bisection doesn't work 
over that part - and is probably one of the areas where it would be the 
most important that it works ;^/

So I dunno. I do like the series even as-is. It seems to have fairly solid 
reasoning behind it, and I absolutely hate the deadlocks we have now in 
workqueue handling and the insane things we do to work around them. At the 
same time, this code is something I get worried about - I can easily 
imaging subtle bugs that only happen under certain circumstances and with 
certain timings - and the last patch was the scariest of the lot.

So if the thing could be split up while limiting CPU hotplug only a bit 
(so that the case of suspend/resume would hopefully still be ok), that 
would be wonderful.

			Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-01  8:09 ` [PATCH 01/19] freezer: don't get over-anxious while waiting Tejun Heo
@ 2009-10-01 18:36   ` Pavel Machek
  2009-10-01 21:04     ` Rafael J. Wysocki
  2009-10-01 21:04     ` Rafael J. Wysocki
  0 siblings, 2 replies; 90+ messages in thread
From: Pavel Machek @ 2009-10-01 18:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells,
	arjan, Rafael J. Wysocki


> Freezing isn't exactly the most latency sensitive operation and
> there's no reason to burn cpu cycles and power waiting for it to
> complete.  msleep(10) instead of yield().  This should improve
> reliability of emergency hibernation.

i don't see how it improves reliability, but its probably ok.

Well... for hibernation anyway. I can imagine cgroup users where
freeze is so fast that this matters. rjw cc-ed.		pavel

> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/power/process.c |   13 +++++++++----
>  1 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index cc2e553..9d26a0a 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
>  	do_gettimeofday(&start);
>  
>  	end_time = jiffies + TIMEOUT;
> -	do {
> +	while (true) {
>  		todo = 0;
>  		read_lock(&tasklist_lock);
>  		do_each_thread(g, p) {
> @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
>  				todo++;
>  		} while_each_thread(g, p);
>  		read_unlock(&tasklist_lock);
> -		yield();			/* Yield is okay here */
> -		if (time_after(jiffies, end_time))
> +		if (!todo || time_after(jiffies, end_time))
>  			break;
> -	} while (todo);
> +
> +		/*
> +		 * We need to retry.  There's no reason to be
> +		 * over-anxious about it and waste power.
> +		 */
> +		msleep(10);
> +	}
>  
>  	do_gettimeofday(&end);
>  	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available
  2009-10-01 13:05 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available David Howells
                     ` (2 preceding siblings ...)
  2009-10-01 16:39   ` Alan Cox
@ 2009-10-01 18:45   ` Ben Pfaff
  2009-10-02 11:56     ` Tejun Heo
  3 siblings, 1 reply; 90+ messages in thread
From: Ben Pfaff @ 2009-10-01 18:45 UTC (permalink / raw)
  To: linux-kernel

David Howells <dhowells@redhat.com> writes:

> Tejun Heo <tj@kernel.org> wrote:
>
>> +enum {
>> +	WORK_STRUCT_PENDING	= 0,	/* work item is pending execution */
>> +
>> +	/*
>> +	 * Reserve 3bits off of cwq pointer.  This is enough and
>> +	 * provides acceptable alignment on both 32 and 64bit
>> +	 * machines.
>> +	 */
>> +	WORK_STRUCT_FLAG_BITS	= 3,
>> +
>> +	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
>> +	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
>> +};
>
> There's some great enum abuse going on here:-)

The "1UL" part is a bit worrisome.  Enumeration constants always
have type "int"[*], so if code that uses WORK_STRUCT_WQ_DATA_MASK
actually depends on getting a full "long" worth of bits, it is
not going to work on 64-bit systems.

[*] See C99:

     6.4.4.3 Enumeration constants
     Syntax
1              enumeration-constant:
                      identifier
     Semantics
2    An identifier declared as an enumeration constant has type int.

-- 
Ben Pfaff 
http://benpfaff.org


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01 16:57   ` Linus Torvalds
@ 2009-10-01 18:48     ` Ingo Molnar
  2009-10-01 19:00       ` Avi Kivity
  2009-10-01 19:06       ` Linus Torvalds
  2009-10-02 12:23     ` Tejun Heo
  1 sibling, 2 replies; 90+ messages in thread
From: Ingo Molnar @ 2009-10-01 18:48 UTC (permalink / raw)
  To: Linus Torvalds, Avi Kivity, Peter Zijlstra
  Cc: Tejun Heo, jeff, linux-kernel, akpm, jens.axboe, rusty, cl,
	dhowells, arjan


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 1 Oct 2009, Tejun Heo wrote:
> >
> > Implement workqueue scheduler class.  Workqueue sched_class inherits 
> > fair sched_class and behaves exactly the same as sched_class except 
> > that it has two callback functions which get called when a task is 
> > put to sleep and wakes up and doesn't allow switching to different 
> > scheduler class.
> 
> So this looks odd to me.
> 
> I agree completely with the callback functions, but what I don't agree 
> with is that this is somehow workqueue-related. I bet that others 
> could use this, and in fact, I suspect that it should not be tied to 
> the scheduler class at all, but to the _thread_.
> 
> Just as an example, I could imagine that we would do lock_kernel() 
> releasing the kernel lock on scheduling (and re-taking it on wakeup) 
> as a callback. Or async IO handling - both done independently of any 
> "workqueue" logic.
> 
> So tying this to the scheduler class seems a bit odd. But maybe I'm 
> missing something?

We could do what Avi suggested: not use scheduler classes at all for 
this (that brings in other limitations like lack of p->policy freedom), 
but use the existing preempt-notifications callbacks.

They are per task - we would simply make preempt notifiers 
unconditional, i.e. remove CONFIG_PREEMPT_NOTIFIERS and make it all 
unconditional scheduler logic.

( Sidenote: we could even unify that logic with the perf event 
  sched-in/sched-out logic which is per task in a similar fashion and 
  which can have callback properties as well. That would mean a single,
  well-defined callback facility for scheduler events. )

	Ingo

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01 18:48     ` Ingo Molnar
@ 2009-10-01 19:00       ` Avi Kivity
  2009-10-01 19:11         ` Linus Torvalds
  2009-10-01 19:15         ` Ingo Molnar
  2009-10-01 19:06       ` Linus Torvalds
  1 sibling, 2 replies; 90+ messages in thread
From: Avi Kivity @ 2009-10-01 19:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Tejun Heo, jeff, linux-kernel,
	akpm, jens.axboe, rusty, cl, dhowells, arjan

On 10/01/2009 08:48 PM, Ingo Molnar wrote:
> We could do what Avi suggested: not use scheduler classes at all for
> this (that brings in other limitations like lack of p->policy freedom),
> but use the existing preempt-notifications callbacks.
>
> They are per task - we would simply make preempt notifiers
> unconditional, i.e. remove CONFIG_PREEMPT_NOTIFIERS and make it all
> unconditional scheduler logic.
>
>    

Sure, but it would mean that we need a new notifier.  sched_out, 
sched_in, and wakeup (and, return to userspace, with the new notifier).

btw, I've been thinking we should extend concurrency managed workqueues 
to userspace.  Right now userspace can spawn a massive amount of 
threads, hoping to hide any waiting by making more work available to the 
scheduler.  That has the drawback of increasing latency due to 
involuntary preemption.  Or userspace can use one thread per cpu, hope 
it's the only application on the machine, and go all-aio.

But what if we added a way for userspace to tell the kernel to fork off 
threads when processing power and work to do are both available?  The 
scheduler knows when there is processing power, and an epoll fd can tell 
it when there is work to do.  So the scheduler will create threads to 
saturate the processors, if one of them waits for I/O the scheduler 
forks off another one until all queues are busy again.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01 18:48     ` Ingo Molnar
  2009-10-01 19:00       ` Avi Kivity
@ 2009-10-01 19:06       ` Linus Torvalds
  1 sibling, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2009-10-01 19:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Peter Zijlstra, Tejun Heo, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan



On Thu, 1 Oct 2009, Ingo Molnar wrote:
>
> > So tying this to the scheduler class seems a bit odd. But maybe I'm 
> > missing something?
> 
> We could do what Avi suggested: not use scheduler classes at all for 
> this (that brings in other limitations like lack of p->policy freedom), 
> but use the existing preempt-notifications callbacks.

I don't think that works. I think the preempt notifiers are done too early 
and too late (ie they are done at the actual context switch), which what 
Tejun's code wants is to be called so that he can actually _do_ something 
about the task before the next task is selected.

But if the preempt-notification users are ok with being called earlier, 
then yes, I guess we could share the logic (and rename it). I do agree 
that what I'd like to see is more _like_ those preempt notifications, with 
a list of things to do before/after.

> They are per task - we would simply make preempt notifiers 
> unconditional, i.e. remove CONFIG_PREEMPT_NOTIFIERS and make it all 
> unconditional scheduler logic.

I don't mind that, but see above: I think the semantics are fundamentally 
different. One wants to be called exactly when the actual context switch 
happens, the other wants to be called before the choice of the next thread 
is even done (which in turn can mean that no context switch actually 
happens at all, because maybe you end up picking the same thread after 
all - possibly _due_ to the callback doing something)

But it's possible that I'm making a semantic difference out of something 
that nobody actually cares about.

		Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01 19:00       ` Avi Kivity
@ 2009-10-01 19:11         ` Linus Torvalds
  2009-10-01 19:23           ` Ingo Molnar
  2009-10-01 19:15         ` Ingo Molnar
  1 sibling, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2009-10-01 19:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan



On Thu, 1 Oct 2009, Avi Kivity wrote:
> 
> Sure, but it would mean that we need a new notifier.  sched_out, sched_in, and
> wakeup (and, return to userspace, with the new notifier).

Ok, see the email I just sent out.

And I don't think we want a new notifier - mainly because I don't think we 
want to walk the list four times (prepare, out, in, final - we need to 
make sure that these things nest properly, so even if "in" and "final" 
happen with the same state, they aren't the same, because "in" only 
happens if "out" was called, while "final" would happen if "prepare" was 
called)

So it would be better to have separate lists, in order to avoid walking 
the lists four times just because there was a single notifier that just 
wanted to be called for the inner (or outer) cases.

> btw, I've been thinking we should extend concurrency managed workqueues to
> userspace.  Right now userspace can spawn a massive amount of threads, hoping
> to hide any waiting by making more work available to the scheduler.  That has
> the drawback of increasing latency due to involuntary preemption.  Or
> userspace can use one thread per cpu, hope it's the only application on the
> machine, and go all-aio.

This is what the whole next-gen AIO was supposed to do with the 
threadlets, ie avoid doing a new thread if it could do the IO all cached 
and without being preempted. 

			Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01 19:00       ` Avi Kivity
  2009-10-01 19:11         ` Linus Torvalds
@ 2009-10-01 19:15         ` Ingo Molnar
  1 sibling, 0 replies; 90+ messages in thread
From: Ingo Molnar @ 2009-10-01 19:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Peter Zijlstra, Tejun Heo, jeff, linux-kernel,
	akpm, jens.axboe, rusty, cl, dhowells, arjan


* Avi Kivity <avi@redhat.com> wrote:

> On 10/01/2009 08:48 PM, Ingo Molnar wrote:
>> We could do what Avi suggested: not use scheduler classes at all for
>> this (that brings in other limitations like lack of p->policy freedom),
>> but use the existing preempt-notifications callbacks.
>>
>> They are per task - we would simply make preempt notifiers 
>> unconditional, i.e. remove CONFIG_PREEMPT_NOTIFIERS and make it all 
>> unconditional scheduler logic.
>
> Sure, but it would mean that we need a new notifier.  sched_out, 
> sched_in, and wakeup (and, return to userspace, with the new 
> notifier).

perf events have sched out, sched in and wakeup events. Return to 
user-space would be interesting to add as well. (and overhead of that 
can be hidden via TIF - like you did via the return-to-userspace 
notifiers)

Sounds more generally useful (and less scary) than (clever but somewhat 
limiting) sched_class hackery.

I.e. i'd prefer if we had just one callback facility in those codepaths, 
minimizing the hotpath overhead and providing a coherent API.

> btw, I've been thinking we should extend concurrency managed 
> workqueues to userspace.  Right now userspace can spawn a massive 
> amount of threads, hoping to hide any waiting by making more work 
> available to the scheduler.  That has the drawback of increasing 
> latency due to involuntary preemption.  Or userspace can use one 
> thread per cpu, hope it's the only application on the machine, and go 
> all-aio.
>
> But what if we added a way for userspace to tell the kernel to fork 
> off threads when processing power and work to do are both available?  
> The scheduler knows when there is processing power, and an epoll fd 
> can tell it when there is work to do.  So the scheduler will create 
> threads to saturate the processors, if one of them waits for I/O the 
> scheduler forks off another one until all queues are busy again.

Sounds like syslets done right?

	Ingo

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01 19:11         ` Linus Torvalds
@ 2009-10-01 19:23           ` Ingo Molnar
  2009-10-01 20:03             ` Linus Torvalds
  0 siblings, 1 reply; 90+ messages in thread
From: Ingo Molnar @ 2009-10-01 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, Peter Zijlstra, Tejun Heo, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 1 Oct 2009, Avi Kivity wrote:
> > 
> > Sure, but it would mean that we need a new notifier.  sched_out, 
> > sched_in, and wakeup (and, return to userspace, with the new 
> > notifier).
> 
> Ok, see the email I just sent out.
> 
> And I don't think we want a new notifier - mainly because I don't 
> think we want to walk the list four times (prepare, out, in, final - 
> we need to make sure that these things nest properly, so even if "in" 
> and "final" happen with the same state, they aren't the same, because 
> "in" only happens if "out" was called, while "final" would happen if 
> "prepare" was called)
> 
> So it would be better to have separate lists, in order to avoid 
> walking the lists four times just because there was a single notifier 
> that just wanted to be called for the inner (or outer) cases.

Sounds a bit like perf events with callbacks, triggered at those places. 
(allowing arbitrary permutation of the callbacks)

But ... it needs some work to shape in precisely such a way. Primarily 
it would need a splitting/slimming of struct perf_event, to allow the 
callback properties to be separated out for in-kernel users that are 
only interested in the callbacks, not in the other abstractions.

But it looks straightforward and useful ... the kind of useful work 
interested parties would be able to complete by the next merge window 
;-)

Other places could use this too - we really want just one callback 
facility for certain system events - be that in-kernel use for other 
kernel facilities, or external instrumentation injected by user-space.

> > btw, I've been thinking we should extend concurrency managed 
> > workqueues to userspace.  Right now userspace can spawn a massive 
> > amount of threads, hoping to hide any waiting by making more work 
> > available to the scheduler.  That has the drawback of increasing 
> > latency due to involuntary preemption.  Or userspace can use one 
> > thread per cpu, hope it's the only application on the machine, and 
> > go all-aio.
> 
> This is what the whole next-gen AIO was supposed to do with the 
> threadlets, ie avoid doing a new thread if it could do the IO all 
> cached and without being preempted.

Yeah. That scheme was hobbled by signal semantics: it looked hard to do 
the 'flip a reserve thread with a blocked thread' trick in the scheduler 
while still keeping all the signal details in place.

	Ingo

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01 19:23           ` Ingo Molnar
@ 2009-10-01 20:03             ` Linus Torvalds
  0 siblings, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2009-10-01 20:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Peter Zijlstra, Tejun Heo, jeff, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan



On Thu, 1 Oct 2009, Ingo Molnar wrote:
> 
> Yeah. That scheme was hobbled by signal semantics: it looked hard to do 
> the 'flip a reserve thread with a blocked thread' trick in the scheduler 
> while still keeping all the signal details in place.

I think we should look at David Wheeler's advice: "Any problem in computer 
science can be solved with another level of indirection".

In particular, my favourite solution to this is to split "struct 
thread_struct" into a new part, which would be "CPU state". 

In other words, all threadlets would share one single "struct 
thread_struct" (and thus local signal state), but they would then each 
have a "struct cpu_state" associated with them, which includes the kernel 
stack. And that cpu_state thing would not be preempted, it would be 
something like a round-robin cooperative scheduling that is only invoced 
when a threadlet goes to sleep.

			Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-01 18:36   ` Pavel Machek
@ 2009-10-01 21:04     ` Rafael J. Wysocki
  2009-10-02 10:56         ` Tejun Heo
                         ` (2 more replies)
  2009-10-01 21:04     ` Rafael J. Wysocki
  1 sibling, 3 replies; 90+ messages in thread
From: Rafael J. Wysocki @ 2009-10-01 21:04 UTC (permalink / raw)
  To: Pavel Machek, Tejun Heo
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells,
	arjan, pm list, Oren Laadan

On Thursday 01 October 2009, Pavel Machek wrote:
> 
> > Freezing isn't exactly the most latency sensitive operation and
> > there's no reason to burn cpu cycles and power waiting for it to
> > complete.  msleep(10) instead of yield().  This should improve
> > reliability of emergency hibernation.
> 
> i don't see how it improves reliability, but its probably ok.
> 
> Well... for hibernation anyway. I can imagine cgroup users where
> freeze is so fast that this matters. rjw cc-ed.		pavel

Thanks.  I'd like to hear from the cgroup freezer people about that.

> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> >  kernel/power/process.c |   13 +++++++++----
> >  1 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/power/process.c b/kernel/power/process.c
> > index cc2e553..9d26a0a 100644
> > --- a/kernel/power/process.c
> > +++ b/kernel/power/process.c
> > @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
> >  	do_gettimeofday(&start);
> >  
> >  	end_time = jiffies + TIMEOUT;
> > -	do {
> > +	while (true) {
> >  		todo = 0;
> >  		read_lock(&tasklist_lock);
> >  		do_each_thread(g, p) {
> > @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
> >  				todo++;
> >  		} while_each_thread(g, p);
> >  		read_unlock(&tasklist_lock);
> > -		yield();			/* Yield is okay here */
> > -		if (time_after(jiffies, end_time))
> > +		if (!todo || time_after(jiffies, end_time))
> >  			break;
> > -	} while (todo);
> > +
> > +		/*
> > +		 * We need to retry.  There's no reason to be
> > +		 * over-anxious about it and waste power.
> > +		 */

The comment above looks like it's only meaningful in the context of the patch.
After it's been applied the meaning of the comment won't be so obvious, I'm
afraid.

> > +		msleep(10);
> > +	}
> >  
> >  	do_gettimeofday(&end);
> >  	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-01 18:36   ` Pavel Machek
  2009-10-01 21:04     ` Rafael J. Wysocki
@ 2009-10-01 21:04     ` Rafael J. Wysocki
  1 sibling, 0 replies; 90+ messages in thread
From: Rafael J. Wysocki @ 2009-10-01 21:04 UTC (permalink / raw)
  To: Pavel Machek, Tejun Heo
  Cc: cl, jeff, pm list, rusty, linux-kernel, Oren Laadan, dhowells,
	jens.axboe, akpm, arjan, mingo

On Thursday 01 October 2009, Pavel Machek wrote:
> 
> > Freezing isn't exactly the most latency sensitive operation and
> > there's no reason to burn cpu cycles and power waiting for it to
> > complete.  msleep(10) instead of yield().  This should improve
> > reliability of emergency hibernation.
> 
> i don't see how it improves reliability, but its probably ok.
> 
> Well... for hibernation anyway. I can imagine cgroup users where
> freeze is so fast that this matters. rjw cc-ed.		pavel

Thanks.  I'd like to hear from the cgroup freezer people about that.

> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> >  kernel/power/process.c |   13 +++++++++----
> >  1 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/power/process.c b/kernel/power/process.c
> > index cc2e553..9d26a0a 100644
> > --- a/kernel/power/process.c
> > +++ b/kernel/power/process.c
> > @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
> >  	do_gettimeofday(&start);
> >  
> >  	end_time = jiffies + TIMEOUT;
> > -	do {
> > +	while (true) {
> >  		todo = 0;
> >  		read_lock(&tasklist_lock);
> >  		do_each_thread(g, p) {
> > @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
> >  				todo++;
> >  		} while_each_thread(g, p);
> >  		read_unlock(&tasklist_lock);
> > -		yield();			/* Yield is okay here */
> > -		if (time_after(jiffies, end_time))
> > +		if (!todo || time_after(jiffies, end_time))
> >  			break;
> > -	} while (todo);
> > +
> > +		/*
> > +		 * We need to retry.  There's no reason to be
> > +		 * over-anxious about it and waste power.
> > +		 */

The comment above looks like it's only meaningful in the context of the patch.
After it's been applied the meaning of the comment won't be so obvious, I'm
afraid.

> > +		msleep(10);
> > +	}
> >  
> >  	do_gettimeofday(&end);
> >  	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01  8:09 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue Tejun Heo
  2009-10-01 14:49   ` Andrew Morton
  2009-10-01 17:12   ` Linus Torvalds
@ 2009-10-02  0:42   ` Andi Kleen
  2009-10-02 12:09     ` Tejun Heo
  2009-10-02 14:28   ` Frédéric Weisbecker
  3 siblings, 1 reply; 90+ messages in thread
From: Andi Kleen @ 2009-10-02  0:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, ar

Tejun Heo <tj@kernel.org> writes:

> Currently each workqueue has its own dedicated worker pool.  This
> causes the following problems.

I definitely like the basic idea (without having read all the code so
far).  Thanks for doing that work. On large systems the number of
kernel threads around can be mind boggling, and it will also get rid
of bizarreness like the ps2mouse thread.

I wonder have you ever thought about exposing this to user space too?
It can face similar problems for parallelization.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-01 21:04     ` Rafael J. Wysocki
@ 2009-10-02 10:56         ` Tejun Heo
  2009-10-02 19:47       ` Oren Laadan
  2009-10-02 19:47       ` Oren Laadan
  2 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-02 10:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, jeff, mingo, linux-kernel, akpm, jens.axboe, rusty,
	cl, dhowells, arjan, pm list, Oren Laadan

Hello, Pavel, Rafael.

Rafael J. Wysocki wrote:
> On Thursday 01 October 2009, Pavel Machek wrote:
>>> Freezing isn't exactly the most latency sensitive operation and
>>> there's no reason to burn cpu cycles and power waiting for it to
>>> complete.  msleep(10) instead of yield().  This should improve
>>> reliability of emergency hibernation.
>> i don't see how it improves reliability, but its probably ok.

It's about battery.  When emergency hibernation kicks in and something
is taking a while to freeze (usually nfs does this for me) burning
power waiting for it to finish is a pretty bad idea.

>> Well... for hibernation anyway. I can imagine cgroup users where
>> freeze is so fast that this matters. rjw cc-ed.		pavel
> 
> Thanks.  I'd like to hear from the cgroup freezer people about that.

Oh... didn't know that.  10ms sleeps really matter there?

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
@ 2009-10-02 10:56         ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-02 10:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: cl, jeff, pm list, rusty, linux-kernel, Oren Laadan, dhowells,
	jens.axboe, akpm, arjan, mingo

Hello, Pavel, Rafael.

Rafael J. Wysocki wrote:
> On Thursday 01 October 2009, Pavel Machek wrote:
>>> Freezing isn't exactly the most latency sensitive operation and
>>> there's no reason to burn cpu cycles and power waiting for it to
>>> complete.  msleep(10) instead of yield().  This should improve
>>> reliability of emergency hibernation.
>> i don't see how it improves reliability, but its probably ok.

It's about battery.  When emergency hibernation kicks in and something
is taking a while to freeze (usually nfs does this for me) burning
power waiting for it to finish is a pretty bad idea.

>> Well... for hibernation anyway. I can imagine cgroup users where
>> freeze is so fast that this matters. rjw cc-ed.		pavel
> 
> Thanks.  I'd like to hear from the cgroup freezer people about that.

Oh... didn't know that.  10ms sleeps really matter there?

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01 12:53 ` David Howells
@ 2009-10-02 11:44   ` Tejun Heo
  2009-10-02 12:45     ` Stefan Richter
  2009-10-02 15:38   ` David Howells
  1 sibling, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-02 11:44 UTC (permalink / raw)
  To: David Howells
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

Hello, David Howells.

David Howells wrote:
> Sounds interesting as a replacement for slow-work.  Some thoughts for you:
> 
> The most important features of slow-work are:
> 
>  (1) Work items are not re-entered whilst they are executing.
> 
>  (2) The slow-work facility keeps references on its work items by asking the
>      client to get and put on the client's refcount.
> 
>  (3) The slow-work facility can create a lot more threads than the number of
>      CPUs on a system, and the system doesn't grind to a halt if they're all
>      taken up with long term I/O (performing several mkdirs for example).
> 
> I think you have (1) and (3) covered, but I'm unsure about (2).

Given that slow-work isn't being used too extensively yet, I was
thinking whether that part could be pushed down to the caller.  Or, we
can also wrap work and export an interface which supports the get/put
reference.

> Also, does it actually make sense to bind threads that are
> performing long-term I/O to particular CPUs?  Threads that are going
> to spend a lot more time sleeping on disk I/O than actually running
> on a CPU?

Binding is usually beneficial and doesn't matter for IO intensive
ones, so...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available
  2009-10-01 18:45   ` Ben Pfaff
@ 2009-10-02 11:56     ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-02 11:56 UTC (permalink / raw)
  To: blp
  Cc: linux-kernel, Jeff Garzik, Alan Cox, David Howells, Ingo Molnar,
	Andrew Morton, Jens Axboe, Rusty Russell, Christoph Lameter,
	Arjan van de Ven

(Restoring cc list.  Please don't drop them)

Ben Pfaff wrote:
> David Howells <dhowells@redhat.com> writes:
> 
>> Tejun Heo <tj@kernel.org> wrote:
>>
>>> +enum {
>>> +	WORK_STRUCT_PENDING	= 0,	/* work item is pending execution */
>>> +
>>> +	/*
>>> +	 * Reserve 3bits off of cwq pointer.  This is enough and
>>> +	 * provides acceptable alignment on both 32 and 64bit
>>> +	 * machines.
>>> +	 */
>>> +	WORK_STRUCT_FLAG_BITS	= 3,
>>> +
>>> +	WORK_STRUCT_FLAG_MASK	= (1UL << WORK_STRUCT_FLAG_BITS) - 1,
>>> +	WORK_STRUCT_WQ_DATA_MASK = ~WORK_STRUCT_FLAG_MASK,
>>> +};
>> There's some great enum abuse going on here:-)
> 
> The "1UL" part is a bit worrisome.  Enumeration constants always
> have type "int"[*], so if code that uses WORK_STRUCT_WQ_DATA_MASK
> actually depends on getting a full "long" worth of bits, it is
> not going to work on 64-bit systems.
> 
> [*] See C99:
> 
>      6.4.4.3 Enumeration constants
>      Syntax
> 1              enumeration-constant:
>                       identifier
>      Semantics
> 2    An identifier declared as an enumeration constant has type int.

Aieee... oops.  Well, this isn't how gcc behaves.

  $ cat test.c
  #include <stdio.h>

  enum {
	  ENUM		= ~0U,
  };

  enum {
	  LONG_ENUM	= ~0UL,
  };

  int main(void)
  {
	  printf("%zu %zu\n", sizeof(ENUM), sizeof(LONG_ENUM));
	  return 0;
  }
  $ gcc test.c && ./a.out; gcc -std=c99 test.c && ./a.out
  4 8
  4 8

But, yeah, this definitely is a problem.  6.7.2.2 also says that

 Each enumerated type shall be compatible with char, a signed integer
 type, or an unsigned integer type. The choice of type is
 implementation-defined,113) but shall be capable of representing the
 values of all the members of the enumeration. The enumerated type is
 incomplete until after the } that terminates the list of enumerator
 declarations.

gcc probably is being a bit too generous with possible integer types
here.  BTW, does c++ define it like this too?

Hmmm... So, should we go back to using defines for these or keep
(ab)using gcc's generousity and maybe hope the next iteration of the
standard to become a bit more generous too?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01 13:15 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue David Howells
@ 2009-10-02 12:03   ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-02 12:03 UTC (permalink / raw)
  To: David Howells
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

David Howells wrote:
> Tejun Heo <tj@kernel.org> wrote:
> 
>> gcwq always keeps at least single idle worker around.  When a new
>> worker is necessary and the worker is the last idle one, the worker
>> assumes the role of "manager" and manages the worker pool -
>> ie. creates another worker.  Forward-progress is guaranteed by having
>> dedicated rescue workers for workqueues which may be necessary while
>> creating a new worker.  When the manager is having problem creating a
>> new worker, mayday timer activates and rescue workers are summoned to
>> the cpu and execute works which may be necessary to create new
>> workers.
> 
> I take it that means that the rescue-workers are donated to the pool
> to become worker threads (perhaps on a temporary basis) in the event
> that forking fails due to ENOMEM, such that resources can be freed
> up for fork() to use.

Yes, when kthread_create() doesn't return fast enough
(MAYDAY_INITIAL_TIMEOUT, currently 10ms), rescue workers are summoned
to the cpu.  They scan the pending works and process only the works
which are for the workqueue the rescuer is bound to which should solve
the possible allocation deadlock.  Currently, all workqueues are
created with a rescuer by default but that will be changed as most of
them don't need it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-02  0:42   ` Andi Kleen
@ 2009-10-02 12:09     ` Tejun Heo
  2009-10-03  2:59       ` Andi Kleen
  0 siblings, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-02 12:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, ar

Hello, Andi.

Andi Kleen wrote:
> Tejun Heo <tj@kernel.org> writes:
> 
>> Currently each workqueue has its own dedicated worker pool.  This
>> causes the following problems.
> 
> I definitely like the basic idea (without having read all the code so
> far).  Thanks for doing that work. On large systems the number of
> kernel threads around can be mind boggling, and it will also get rid
> of bizarreness like the ps2mouse thread.
> 
> I wonder have you ever thought about exposing this to user space too?
> It can face similar problems for parallelization.

Yeah, some similar mechanism could be useful to manage userland worker
pool too but determining what to export would be the difficult
question.  Events when all of certain groups of threads are blocked
could be useful and cpu idleness is too.

One thing is that for userspace there's already pretty thick layer of
abstraction in place and having this type of bare metal mechanism
might not buy much.  In most cases, monitoring system parameters
periodically and supplying a bit of extra threads should be able to
achieve about the same result.  Do you have anything specific on your
mind?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 03/19] scheduler: implement workqueue scheduler class
  2009-10-01 16:57   ` Linus Torvalds
  2009-10-01 18:48     ` Ingo Molnar
@ 2009-10-02 12:23     ` Tejun Heo
  1 sibling, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-02 12:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan

Hello, Linus.

Linus Torvalds wrote:
> On Thu, 1 Oct 2009, Tejun Heo wrote:
>> Implement workqueue scheduler class.  Workqueue sched_class inherits
>> fair sched_class and behaves exactly the same as sched_class except
>> that it has two callback functions which get called when a task is put
>> to sleep and wakes up and doesn't allow switching to different
>> scheduler class.
> 
> So this looks odd to me.
> 
> I agree completely with the callback functions, but what I don't agree 
> with is that this is somehow workqueue-related. I bet that others could 
> use this, and in fact, I suspect that it should not be tied to the 
> scheduler class at all, but to the _thread_.
> 
> Just as an example, I could imagine that we would do lock_kernel() 
> releasing the kernel lock on scheduling (and re-taking it on wakeup) as a 
> callback. Or async IO handling - both done independently of any 
> "workqueue" logic.
> 
> So tying this to the scheduler class seems a bit odd. But maybe I'm 
> missing something?

The only reason why the callbacks are implemented by inheriting
sched_class is because I thought it was a pretty special case, so
slightly lax on the hackish side while staying spartan on adding
hotpath overhead.  If this type of notification mechanism is something
which can be used more widely, it would definitely be better to have
proper per-task callback mechanism instead.  Probably per-task chain
of notifier ops so that only tasks which would be interested in such
events can opt in.

Avi, such per-task notification ops should be useable for kvm too,
right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-02 11:44   ` Tejun Heo
@ 2009-10-02 12:45     ` Stefan Richter
  0 siblings, 0 replies; 90+ messages in thread
From: Stefan Richter @ 2009-10-02 12:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Howells, jeff, mingo, linux-kernel, akpm, jens.axboe,
	rusty, cl, arjan

Tejun Heo wrote:
> David Howells wrote:
>> Sounds interesting as a replacement for slow-work.  Some thoughts for you:
>> 
>> The most important features of slow-work are:
>> 
>>  (1) Work items are not re-entered whilst they are executing.
>> 
>>  (2) The slow-work facility keeps references on its work items by asking the
>>      client to get and put on the client's refcount.
>> 
>>  (3) The slow-work facility can create a lot more threads than the number of
>>      CPUs on a system, and the system doesn't grind to a halt if they're all
>>      taken up with long term I/O (performing several mkdirs for example).
>> 
>> I think you have (1) and (3) covered, but I'm unsure about (2).
> 
> Given that slow-work isn't being used too extensively yet, I was
> thinking whether that part could be pushed down to the caller.  Or, we
> can also wrap work and export an interface which supports the get/put
> reference.

BTW, not only slow-work users need reference counting, many existing
regular workqueue users could use it as well.  (Well, I guess I stated
the obvious.)  Currently we have local wrappers for that like these ones:
http://lxr.linux.no/#linux+v2.6.31/drivers/firewire/core-card.c#L212
http://lxr.linux.no/#linux+v2.6.31/drivers/firewire/sbp2.c#L838
-- 
Stefan Richter
-=====-==--= =-=- ---=-
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01  8:09 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue Tejun Heo
                     ` (2 preceding siblings ...)
  2009-10-02  0:42   ` Andi Kleen
@ 2009-10-02 14:28   ` Frédéric Weisbecker
  3 siblings, 0 replies; 90+ messages in thread
From: Frédéric Weisbecker @ 2009-10-02 14:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan

2009/10/1 Tejun Heo <tj@kernel.org>:
> Currently each workqueue has its own dedicated worker pool.  This
> causes the following problems.
>
> * Works which are dependent on each other can cause a deadlock by
>  depending on the same execution resource.  This is bad because this
>  type of dependency is quite difficult to find.
>
> * Works which may sleep and take long time to finish need to have
>  separate workqueues so that it doesn't block other works.  Similarly
>  works which want to be executed in timely manner often need to
>  create it custom workqueue too to avoid being blocked by long
>  running ones.  This leads to large number of workqueues and thus
>  many workers.
>
> * The static one-per-cpu worker isn't good enough for jobs which
>  require higher level of concurrency necessiating other worker pool
>  mechanism.  slow-work and async are good examples and there are also
>  some custom implementations buried in subsystems.
>
> * Combined, the above factors lead to many workqueues with large
>  number of dedicated and mostly unused workers.  This also makes work
>  processing less optimal as the dedicated workers end up switching
>  among themselves costing scheduleing overhead and wasting cache
>  footprint for their stacks and as the system gets busy, these
>  workers end up competing with each other.
>
> To solve the above issues, this patch implements concurrency-managed
> workqueue.
>
> There is single global cpu workqueue (gcwq) for each cpu which serves
> all the workqueues.  gcwq maintains single pool of workers which is
> shared by all cwqs on the cpu.
>
> gcwq keeps the number of concurrent active workers to minimum but no
> less.  As long as there's one or more running workers on the cpu, no
> new worker is scheduled so that works can be processed in batch as
> much as possible but when the last running worker blocks, gcwq
> immediately schedules new worker so that the cpu doesn't sit idle
> while there are works to be processed.



That's really a cool thing.
So once such new workers are created, what's the state/event that triggers their
destruction?

Is it the following, propagated recursively?

    Worker A blocks.
    B is created.
    B has just finished a worklet and A has been woken up
    Then destroy B

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01 12:53 ` David Howells
  2009-10-02 11:44   ` Tejun Heo
@ 2009-10-02 15:38   ` David Howells
  2009-10-03  5:07     ` Tejun Heo
  1 sibling, 1 reply; 90+ messages in thread
From: David Howells @ 2009-10-02 15:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dhowells, jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

Tejun Heo <tj@kernel.org> wrote:

> Given that slow-work isn't being used too extensively yet, I was
> thinking whether that part could be pushed down to the caller.  Or, we
> can also wrap work and export an interface which supports the get/put
> reference.

The caller of what?

I found the refcounting much easier to manage in slow-work when slow-work
actively got/put refs on the work items it was queueing.  The reason for that
is that slow-work can handle the queue/queue races and the requeue/execute
races much more efficiently.

Part of this was due to the fact I wanted to prevent re-entry into the work
executor, and to do that I had maintenance flags in the work item struct - but
that meant that slow-work had to modify the work item after execution.

So I should adjust point 1 on my list.

  (1) Work items can be requeued whilst they are executing, but the execution
      function will not be re-entered until after the current execution
      completes, but rather the execution will be deferred.

One possible problem with assuming that you can no longer access the work item
after you call the execution function, is that it's slightly dodgy to retain
the pointer to it to prevent reentry as the item can be destroyed, reallocated
and queued before the execution function returns.

Anyway, don't let me put you off re-implementing the whole shebang - it needs
doing.

> Binding is usually beneficial and doesn't matter for IO intensive
> ones, so...

The scenario I'm thinking of is this: someone who has an NFS volume cached
through FS-Cache does a tar of a large tree of files (say a kernel source
tree).  FS-Cache adds a long duration work item for each of those files
(~32000) to create structure in the cache.  Will all of those wind up bound to
the same CPU as was running tar?

David

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-01 21:04     ` Rafael J. Wysocki
  2009-10-02 10:56         ` Tejun Heo
  2009-10-02 19:47       ` Oren Laadan
@ 2009-10-02 19:47       ` Oren Laadan
  2009-10-02 21:04         ` Matt Helsley
  2009-10-02 21:04         ` Matt Helsley
  2 siblings, 2 replies; 90+ messages in thread
From: Oren Laadan @ 2009-10-02 19:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Tejun Heo, jeff, mingo, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, pm list, Matt Helsley



Rafael J. Wysocki wrote:
> On Thursday 01 October 2009, Pavel Machek wrote:
>>> Freezing isn't exactly the most latency sensitive operation and
>>> there's no reason to burn cpu cycles and power waiting for it to
>>> complete.  msleep(10) instead of yield().  This should improve
>>> reliability of emergency hibernation.
>> i don't see how it improves reliability, but its probably ok.
>>
>> Well... for hibernation anyway. I can imagine cgroup users where
>> freeze is so fast that this matters. rjw cc-ed.		pavel
> 
> Thanks.  I'd like to hear from the cgroup freezer people about that.
> 

[Adding Matt Helsley to the CC list]

To checkpoint or migrate an application, the cgroup to which it belongs
must be frozen first.

It's a bit down the road, but if one seeks minimum application downtime
during application checkpoint and/or migration, then a (minimum of)
10ms - or multiples of it - may result in a visible/undesired hick-up.

Perhaps avoid it when freezing a cgroup ?  or maybe a way for the user
to control this behavior per cgroup ?

Oren.

>>> Signed-off-by: Tejun Heo <tj@kernel.org>
>>> ---
>>>  kernel/power/process.c |   13 +++++++++----
>>>  1 files changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/power/process.c b/kernel/power/process.c
>>> index cc2e553..9d26a0a 100644
>>> --- a/kernel/power/process.c
>>> +++ b/kernel/power/process.c
>>> @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
>>>  	do_gettimeofday(&start);
>>>  
>>>  	end_time = jiffies + TIMEOUT;
>>> -	do {
>>> +	while (true) {
>>>  		todo = 0;
>>>  		read_lock(&tasklist_lock);
>>>  		do_each_thread(g, p) {
>>> @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
>>>  				todo++;
>>>  		} while_each_thread(g, p);
>>>  		read_unlock(&tasklist_lock);
>>> -		yield();			/* Yield is okay here */
>>> -		if (time_after(jiffies, end_time))
>>> +		if (!todo || time_after(jiffies, end_time))
>>>  			break;
>>> -	} while (todo);
>>> +
>>> +		/*
>>> +		 * We need to retry.  There's no reason to be
>>> +		 * over-anxious about it and waste power.
>>> +		 */
> 
> The comment above looks like it's only meaningful in the context of the patch.
> After it's been applied the meaning of the comment won't be so obvious, I'm
> afraid.
> 
>>> +		msleep(10);
>>> +	}
>>>  
>>>  	do_gettimeofday(&end);
>>>  	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
> 
> Thanks,
> Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-01 21:04     ` Rafael J. Wysocki
  2009-10-02 10:56         ` Tejun Heo
@ 2009-10-02 19:47       ` Oren Laadan
  2009-10-02 19:47       ` Oren Laadan
  2 siblings, 0 replies; 90+ messages in thread
From: Oren Laadan @ 2009-10-02 19:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: cl, jeff, pm list, rusty, linux-kernel, dhowells, jens.axboe,
	Tejun Heo, akpm, arjan, mingo



Rafael J. Wysocki wrote:
> On Thursday 01 October 2009, Pavel Machek wrote:
>>> Freezing isn't exactly the most latency sensitive operation and
>>> there's no reason to burn cpu cycles and power waiting for it to
>>> complete.  msleep(10) instead of yield().  This should improve
>>> reliability of emergency hibernation.
>> i don't see how it improves reliability, but its probably ok.
>>
>> Well... for hibernation anyway. I can imagine cgroup users where
>> freeze is so fast that this matters. rjw cc-ed.		pavel
> 
> Thanks.  I'd like to hear from the cgroup freezer people about that.
> 

[Adding Matt Helsley to the CC list]

To checkpoint or migrate an application, the cgroup to which it belongs
must be frozen first.

It's a bit down the road, but if one seeks minimum application downtime
during application checkpoint and/or migration, then a (minimum of)
10ms - or multiples of it - may result in a visible/undesired hick-up.

Perhaps avoid it when freezing a cgroup ?  or maybe a way for the user
to control this behavior per cgroup ?

Oren.

>>> Signed-off-by: Tejun Heo <tj@kernel.org>
>>> ---
>>>  kernel/power/process.c |   13 +++++++++----
>>>  1 files changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/power/process.c b/kernel/power/process.c
>>> index cc2e553..9d26a0a 100644
>>> --- a/kernel/power/process.c
>>> +++ b/kernel/power/process.c
>>> @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
>>>  	do_gettimeofday(&start);
>>>  
>>>  	end_time = jiffies + TIMEOUT;
>>> -	do {
>>> +	while (true) {
>>>  		todo = 0;
>>>  		read_lock(&tasklist_lock);
>>>  		do_each_thread(g, p) {
>>> @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
>>>  				todo++;
>>>  		} while_each_thread(g, p);
>>>  		read_unlock(&tasklist_lock);
>>> -		yield();			/* Yield is okay here */
>>> -		if (time_after(jiffies, end_time))
>>> +		if (!todo || time_after(jiffies, end_time))
>>>  			break;
>>> -	} while (todo);
>>> +
>>> +		/*
>>> +		 * We need to retry.  There's no reason to be
>>> +		 * over-anxious about it and waste power.
>>> +		 */
> 
> The comment above looks like it's only meaningful in the context of the patch.
> After it's been applied the meaning of the comment won't be so obvious, I'm
> afraid.
> 
>>> +		msleep(10);
>>> +	}
>>>  
>>>  	do_gettimeofday(&end);
>>>  	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
> 
> Thanks,
> Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-02 19:47       ` Oren Laadan
@ 2009-10-02 21:04         ` Matt Helsley
  2009-10-02 21:21           ` Rafael J. Wysocki
  2009-10-02 21:21           ` Rafael J. Wysocki
  2009-10-02 21:04         ` Matt Helsley
  1 sibling, 2 replies; 90+ messages in thread
From: Matt Helsley @ 2009-10-02 21:04 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Rafael J. Wysocki, Pavel Machek, Tejun Heo, jeff, mingo,
	linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan,
	pm list, Matt Helsley

On Fri, Oct 02, 2009 at 03:47:14PM -0400, Oren Laadan wrote:
> 
> 
> Rafael J. Wysocki wrote:
> > On Thursday 01 October 2009, Pavel Machek wrote:
> >>> Freezing isn't exactly the most latency sensitive operation and
> >>> there's no reason to burn cpu cycles and power waiting for it to
> >>> complete.  msleep(10) instead of yield().  This should improve
> >>> reliability of emergency hibernation.
> >> i don't see how it improves reliability, but its probably ok.

>From what little of the patch I can see at this point I agree. 
On a single cpu system the yield gives up the cpu so other tasks
are more likely to make the progress necessary to become freezable.

> >>
> >> Well... for hibernation anyway. I can imagine cgroup users where
> >> freeze is so fast that this matters. rjw cc-ed.		pavel

It doesn't (more below), though I appreciate your keeping us in mind.

> > 
> > Thanks.  I'd like to hear from the cgroup freezer people about that.
> > 
> 
> [Adding Matt Helsley to the CC list]
> 
> To checkpoint or migrate an application, the cgroup to which it belongs
> must be frozen first.
> 
> It's a bit down the road, but if one seeks minimum application downtime
> during application checkpoint and/or migration, then a (minimum of)
> 10ms - or multiples of it - may result in a visible/undesired hick-up.
> 
> Perhaps avoid it when freezing a cgroup ?  or maybe a way for the user
> to control this behavior per cgroup ?

This is already the case.

The cgroup freezer does not use this yield-loop to iterate over all the tasks.
Instead of yield() the cgroup freezer has its own "loop". It changes its
own state to FREEZING and returns to userspace so that userspace can decide
what to do -- sleep? keep trying to freeze? go back to THAWED? etc. 

[ In the future this may change depending on the blocking/non-blocking
flag of the open freezer.state cgroup file handle. ]

Cheers,
	-Matt Helsley

> 
> Oren.
> 
> >>> Signed-off-by: Tejun Heo <tj@kernel.org>
> >>> ---
> >>>  kernel/power/process.c |   13 +++++++++----
> >>>  1 files changed, 9 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/kernel/power/process.c b/kernel/power/process.c
> >>> index cc2e553..9d26a0a 100644
> >>> --- a/kernel/power/process.c
> >>> +++ b/kernel/power/process.c
> >>> @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
> >>>  	do_gettimeofday(&start);
> >>>  
> >>>  	end_time = jiffies + TIMEOUT;
> >>> -	do {
> >>> +	while (true) {
> >>>  		todo = 0;
> >>>  		read_lock(&tasklist_lock);
> >>>  		do_each_thread(g, p) {
> >>> @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
> >>>  				todo++;
> >>>  		} while_each_thread(g, p);
> >>>  		read_unlock(&tasklist_lock);
> >>> -		yield();			/* Yield is okay here */
> >>> -		if (time_after(jiffies, end_time))
> >>> +		if (!todo || time_after(jiffies, end_time))
> >>>  			break;
> >>> -	} while (todo);
> >>> +
> >>> +		/*
> >>> +		 * We need to retry.  There's no reason to be
> >>> +		 * over-anxious about it and waste power.
> >>> +		 */
> > 
> > The comment above looks like it's only meaningful in the context of the patch.
> > After it's been applied the meaning of the comment won't be so obvious, I'm
> > afraid.
> > 
> >>> +		msleep(10);
> >>> +	}
> >>>  
> >>>  	do_gettimeofday(&end);
> >>>  	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
> > 
> > Thanks,
> > Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-02 19:47       ` Oren Laadan
  2009-10-02 21:04         ` Matt Helsley
@ 2009-10-02 21:04         ` Matt Helsley
  1 sibling, 0 replies; 90+ messages in thread
From: Matt Helsley @ 2009-10-02 21:04 UTC (permalink / raw)
  To: Oren Laadan
  Cc: dhowells, cl, jeff, pm list, rusty, linux-kernel, jens.axboe,
	Tejun Heo, mingo, arjan, akpm

On Fri, Oct 02, 2009 at 03:47:14PM -0400, Oren Laadan wrote:
> 
> 
> Rafael J. Wysocki wrote:
> > On Thursday 01 October 2009, Pavel Machek wrote:
> >>> Freezing isn't exactly the most latency sensitive operation and
> >>> there's no reason to burn cpu cycles and power waiting for it to
> >>> complete.  msleep(10) instead of yield().  This should improve
> >>> reliability of emergency hibernation.
> >> i don't see how it improves reliability, but its probably ok.

>From what little of the patch I can see at this point I agree. 
On a single cpu system the yield gives up the cpu so other tasks
are more likely to make the progress necessary to become freezable.

> >>
> >> Well... for hibernation anyway. I can imagine cgroup users where
> >> freeze is so fast that this matters. rjw cc-ed.		pavel

It doesn't (more below), though I appreciate your keeping us in mind.

> > 
> > Thanks.  I'd like to hear from the cgroup freezer people about that.
> > 
> 
> [Adding Matt Helsley to the CC list]
> 
> To checkpoint or migrate an application, the cgroup to which it belongs
> must be frozen first.
> 
> It's a bit down the road, but if one seeks minimum application downtime
> during application checkpoint and/or migration, then a (minimum of)
> 10ms - or multiples of it - may result in a visible/undesired hick-up.
> 
> Perhaps avoid it when freezing a cgroup ?  or maybe a way for the user
> to control this behavior per cgroup ?

This is already the case.

The cgroup freezer does not use this yield-loop to iterate over all the tasks.
Instead of yield() the cgroup freezer has its own "loop". It changes its
own state to FREEZING and returns to userspace so that userspace can decide
what to do -- sleep? keep trying to freeze? go back to THAWED? etc. 

[ In the future this may change depending on the blocking/non-blocking
flag of the open freezer.state cgroup file handle. ]

Cheers,
	-Matt Helsley

> 
> Oren.
> 
> >>> Signed-off-by: Tejun Heo <tj@kernel.org>
> >>> ---
> >>>  kernel/power/process.c |   13 +++++++++----
> >>>  1 files changed, 9 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/kernel/power/process.c b/kernel/power/process.c
> >>> index cc2e553..9d26a0a 100644
> >>> --- a/kernel/power/process.c
> >>> +++ b/kernel/power/process.c
> >>> @@ -41,7 +41,7 @@ static int try_to_freeze_tasks(bool sig_only)
> >>>  	do_gettimeofday(&start);
> >>>  
> >>>  	end_time = jiffies + TIMEOUT;
> >>> -	do {
> >>> +	while (true) {
> >>>  		todo = 0;
> >>>  		read_lock(&tasklist_lock);
> >>>  		do_each_thread(g, p) {
> >>> @@ -62,10 +62,15 @@ static int try_to_freeze_tasks(bool sig_only)
> >>>  				todo++;
> >>>  		} while_each_thread(g, p);
> >>>  		read_unlock(&tasklist_lock);
> >>> -		yield();			/* Yield is okay here */
> >>> -		if (time_after(jiffies, end_time))
> >>> +		if (!todo || time_after(jiffies, end_time))
> >>>  			break;
> >>> -	} while (todo);
> >>> +
> >>> +		/*
> >>> +		 * We need to retry.  There's no reason to be
> >>> +		 * over-anxious about it and waste power.
> >>> +		 */
> > 
> > The comment above looks like it's only meaningful in the context of the patch.
> > After it's been applied the meaning of the comment won't be so obvious, I'm
> > afraid.
> > 
> >>> +		msleep(10);
> >>> +	}
> >>>  
> >>>  	do_gettimeofday(&end);
> >>>  	elapsed_csecs64 = timeval_to_ns(&end) - timeval_to_ns(&start);
> > 
> > Thanks,
> > Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-02 21:04         ` Matt Helsley
  2009-10-02 21:21           ` Rafael J. Wysocki
@ 2009-10-02 21:21           ` Rafael J. Wysocki
  2009-10-03  0:43             ` Tejun Heo
  2009-10-03  0:43             ` Tejun Heo
  1 sibling, 2 replies; 90+ messages in thread
From: Rafael J. Wysocki @ 2009-10-02 21:21 UTC (permalink / raw)
  To: Matt Helsley, Tejun Heo
  Cc: Oren Laadan, Pavel Machek, jeff, mingo, linux-kernel, akpm,
	jens.axboe, rusty, cl, dhowells, arjan, pm list

On Friday 02 October 2009, Matt Helsley wrote:
> On Fri, Oct 02, 2009 at 03:47:14PM -0400, Oren Laadan wrote:
> > 
> > 
> > Rafael J. Wysocki wrote:
> > > On Thursday 01 October 2009, Pavel Machek wrote:
> > >>> Freezing isn't exactly the most latency sensitive operation and
> > >>> there's no reason to burn cpu cycles and power waiting for it to
> > >>> complete.  msleep(10) instead of yield().  This should improve
> > >>> reliability of emergency hibernation.
> > >> i don't see how it improves reliability, but its probably ok.
> 
> From what little of the patch I can see at this point I agree. 
> On a single cpu system the yield gives up the cpu so other tasks
> are more likely to make the progress necessary to become freezable.
> 
> > >>
> > >> Well... for hibernation anyway. I can imagine cgroup users where
> > >> freeze is so fast that this matters. rjw cc-ed.		pavel
> 
> It doesn't (more below), though I appreciate your keeping us in mind.
> 
> > > 
> > > Thanks.  I'd like to hear from the cgroup freezer people about that.
> > > 
> > 
> > [Adding Matt Helsley to the CC list]
> > 
> > To checkpoint or migrate an application, the cgroup to which it belongs
> > must be frozen first.
> > 
> > It's a bit down the road, but if one seeks minimum application downtime
> > during application checkpoint and/or migration, then a (minimum of)
> > 10ms - or multiples of it - may result in a visible/undesired hick-up.
> > 
> > Perhaps avoid it when freezing a cgroup ?  or maybe a way for the user
> > to control this behavior per cgroup ?
> 
> This is already the case.
> 
> The cgroup freezer does not use this yield-loop to iterate over all the tasks.
> Instead of yield() the cgroup freezer has its own "loop". It changes its
> own state to FREEZING and returns to userspace so that userspace can decide
> what to do -- sleep? keep trying to freeze? go back to THAWED? etc. 
> 
> [ In the future this may change depending on the blocking/non-blocking
> flag of the open freezer.state cgroup file handle. ]

OK, thanks for the info.

Tejun, do you want me to take the patch or is it more convenient to you to
push it yourself?

Also, care to respond to my previous comment?

Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-02 21:04         ` Matt Helsley
@ 2009-10-02 21:21           ` Rafael J. Wysocki
  2009-10-02 21:21           ` Rafael J. Wysocki
  1 sibling, 0 replies; 90+ messages in thread
From: Rafael J. Wysocki @ 2009-10-02 21:21 UTC (permalink / raw)
  To: Matt Helsley, Tejun Heo
  Cc: cl, jeff, pm list, rusty, linux-kernel, Oren Laadan, dhowells,
	jens.axboe, akpm, arjan, mingo

On Friday 02 October 2009, Matt Helsley wrote:
> On Fri, Oct 02, 2009 at 03:47:14PM -0400, Oren Laadan wrote:
> > 
> > 
> > Rafael J. Wysocki wrote:
> > > On Thursday 01 October 2009, Pavel Machek wrote:
> > >>> Freezing isn't exactly the most latency sensitive operation and
> > >>> there's no reason to burn cpu cycles and power waiting for it to
> > >>> complete.  msleep(10) instead of yield().  This should improve
> > >>> reliability of emergency hibernation.
> > >> i don't see how it improves reliability, but its probably ok.
> 
> From what little of the patch I can see at this point I agree. 
> On a single cpu system the yield gives up the cpu so other tasks
> are more likely to make the progress necessary to become freezable.
> 
> > >>
> > >> Well... for hibernation anyway. I can imagine cgroup users where
> > >> freeze is so fast that this matters. rjw cc-ed.		pavel
> 
> It doesn't (more below), though I appreciate your keeping us in mind.
> 
> > > 
> > > Thanks.  I'd like to hear from the cgroup freezer people about that.
> > > 
> > 
> > [Adding Matt Helsley to the CC list]
> > 
> > To checkpoint or migrate an application, the cgroup to which it belongs
> > must be frozen first.
> > 
> > It's a bit down the road, but if one seeks minimum application downtime
> > during application checkpoint and/or migration, then a (minimum of)
> > 10ms - or multiples of it - may result in a visible/undesired hick-up.
> > 
> > Perhaps avoid it when freezing a cgroup ?  or maybe a way for the user
> > to control this behavior per cgroup ?
> 
> This is already the case.
> 
> The cgroup freezer does not use this yield-loop to iterate over all the tasks.
> Instead of yield() the cgroup freezer has its own "loop". It changes its
> own state to FREEZING and returns to userspace so that userspace can decide
> what to do -- sleep? keep trying to freeze? go back to THAWED? etc. 
> 
> [ In the future this may change depending on the blocking/non-blocking
> flag of the open freezer.state cgroup file handle. ]

OK, thanks for the info.

Tejun, do you want me to take the patch or is it more convenient to you to
push it yourself?

Also, care to respond to my previous comment?

Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-02 21:21           ` Rafael J. Wysocki
  2009-10-03  0:43             ` Tejun Heo
@ 2009-10-03  0:43             ` Tejun Heo
  2009-10-03 19:36                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 90+ messages in thread
From: Tejun Heo @ 2009-10-03  0:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Matt Helsley, Oren Laadan, Pavel Machek, jeff, mingo,
	linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan,
	pm list

Hello, Rafael.

Rafael J. Wysocki wrote:
> Tejun, do you want me to take the patch or is it more convenient to you to
> push it yourself?

Please take the patch.  As long as it's in a stable git tree, it
should be fine for me.

> Also, care to respond to my previous comment?

About the comment?  Dropping it is fine.  Do you want me to resend the
patch w/o the comment?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-02 21:21           ` Rafael J. Wysocki
@ 2009-10-03  0:43             ` Tejun Heo
  2009-10-03  0:43             ` Tejun Heo
  1 sibling, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-03  0:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: cl, jeff, pm list, rusty, linux-kernel, Oren Laadan, dhowells,
	jens.axboe, mingo, arjan, akpm

Hello, Rafael.

Rafael J. Wysocki wrote:
> Tejun, do you want me to take the patch or is it more convenient to you to
> push it yourself?

Please take the patch.  As long as it's in a stable git tree, it
should be fine for me.

> Also, care to respond to my previous comment?

About the comment?  Dropping it is fine.  Do you want me to resend the
patch w/o the comment?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-02 12:09     ` Tejun Heo
@ 2009-10-03  2:59       ` Andi Kleen
  0 siblings, 0 replies; 90+ messages in thread
From: Andi Kleen @ 2009-10-03  2:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andi Kleen, jeff, mingo, linux-kernel, akpm, jens.axboe, rusty,
	cl, dhowells, ar

> One thing is that for userspace there's already pretty thick layer of
> abstraction in place and having this type of bare metal mechanism
> might not buy much.  In most cases, monitoring system parameters

That can be costly in terms of power/cpu cycles.

> periodically and supplying a bit of extra threads should be able to
> achieve about the same result.  Do you have anything specific on your
> mind?

I was thinking of a special file descriptor that could be polled on.
Register threads to be monitored and get a notification back this
way when they are busy above some threshold. Then a runtime could
start more threads upto the maximum number of cores. This would
be mainly useful with existing work distributing libraries 
like threadbuildingblocks.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-02 15:38   ` David Howells
@ 2009-10-03  5:07     ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-03  5:07 UTC (permalink / raw)
  To: David Howells
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, arjan

David Howells wrote:
> Tejun Heo <tj@kernel.org> wrote:
> 
>> Given that slow-work isn't being used too extensively yet, I was
>> thinking whether that part could be pushed down to the caller.  Or, we
>> can also wrap work and export an interface which supports the get/put
>> reference.
> 
> The caller of what?

The user of the API.  It can be implemented there too, right?

> I found the refcounting much easier to manage in slow-work when slow-work
> actively got/put refs on the work items it was queueing.  The reason for that
> is that slow-work can handle the queue/queue races and the requeue/execute
> races much more efficiently.
> 
> Part of this was due to the fact I wanted to prevent re-entry into the work
> executor, and to do that I had maintenance flags in the work item struct - but
> that meant that slow-work had to modify the work item after execution.
> 
> So I should adjust point 1 on my list.
> 
>   (1) Work items can be requeued whilst they are executing, but the execution
>       function will not be re-entered until after the current execution
>       completes, but rather the execution will be deferred.

This is already guaranteed on a single cpu, so unless a work ends up
being scheduled on a different cpu, it will be okay.  This actually is
about the same problem as how to support singlethread workqueue.  I'm
not entirely sure how to choose the cpu for such works yet.

> One possible problem with assuming that you can no longer access the work item
> after you call the execution function, is that it's slightly dodgy to retain
> the pointer to it to prevent reentry as the item can be destroyed, reallocated
> and queued before the execution function returns.

All the above is already implemented to avoid running the same work
parallelly on the same cpu and flushing.

>> Binding is usually beneficial and doesn't matter for IO intensive
>> ones, so...
> 
> The scenario I'm thinking of is this: someone who has an NFS volume cached
> through FS-Cache does a tar of a large tree of files (say a kernel source
> tree).  FS-Cache adds a long duration work item for each of those files
> (~32000) to create structure in the cache.  Will all of those wind up bound to
> the same CPU as was running tar?

Yeap, something to think about.  I considered adding a cpu workqueue
which isn't bound to any cpu to serve that type of workload but it
seemed too complex for the problem.  Maybe simple round robin with
per-cpu throttling should do the trick?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
  2009-10-03  0:43             ` Tejun Heo
@ 2009-10-03 19:36                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 90+ messages in thread
From: Rafael J. Wysocki @ 2009-10-03 19:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Matt Helsley, Oren Laadan, Pavel Machek, jeff, mingo,
	linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan,
	pm list

On Saturday 03 October 2009, Tejun Heo wrote:
> Hello, Rafael.
> 
> Rafael J. Wysocki wrote:
> > Tejun, do you want me to take the patch or is it more convenient to you to
> > push it yourself?
> 
> Please take the patch.  As long as it's in a stable git tree, it
> should be fine for me.

OK, I'll put it into suspend-2.6/for-linus, but first it'll stay in
suspend-2.6/linux-next for a few days.

> > Also, care to respond to my previous comment?
> 
> About the comment?  Dropping it is fine.  Do you want me to resend the
> patch w/o the comment?

No, thanks. 

Best,
Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 01/19] freezer: don't get over-anxious while waiting
@ 2009-10-03 19:36                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 90+ messages in thread
From: Rafael J. Wysocki @ 2009-10-03 19:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, jeff, pm list, rusty, linux-kernel, Oren Laadan, dhowells,
	jens.axboe, mingo, arjan, akpm

On Saturday 03 October 2009, Tejun Heo wrote:
> Hello, Rafael.
> 
> Rafael J. Wysocki wrote:
> > Tejun, do you want me to take the patch or is it more convenient to you to
> > push it yourself?
> 
> Please take the patch.  As long as it's in a stable git tree, it
> should be fine for me.

OK, I'll put it into suspend-2.6/for-linus, but first it'll stay in
suspend-2.6/linux-next for a few days.

> > Also, care to respond to my previous comment?
> 
> About the comment?  Dropping it is fine.  Do you want me to resend the
> patch w/o the comment?

No, thanks. 

Best,
Rafael

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [RFC PATCHSET] workqueue: implement concurrency managed workqueue
  2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
                   ` (25 preceding siblings ...)
  2009-10-01 13:15 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue David Howells
@ 2009-10-04  8:41 ` Peter Zijlstra
  26 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2009-10-04  8:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, rusty, cl, dhowells, arjan

On Thu, 2009-10-01 at 17:08 +0900, Tejun Heo wrote:
> Hello, all.
> 
> This rather large patchset implements concurrency managed workqueue.
> It's not complete yet.  Singlethread workqueue handling needs more
> work and workqueue users need to be audited and simplified and async
> and slow-work should be reimplemented in terms of workqueue.  Although
> this patchset currently adds ~2000 lines of code, I'm fairly
> optimistic that after the whole conversion is done, it would be a net
> decrease in lines of code.
> 
> This patchset reimplements workqueue such that it auto-regulates
> concurrency and thus relieves its users from the managing duty.  It
> works by managing single shared pool of per-cpu workers and hooking
> into the scheduler to get notifications about workers going to sleep
> and waking up.  Using the mechanism, workqueue implementation keeps
> track of the current level of concurrency and schedules only the
> necessary number of workers to keep the cpu occupied.
> 
> Concurrency managed workqueue has the following benefits.
> 
> * Workqueue users no longer have to worry about managing concurrency
>   and, in most cases, deadlocks.  The workqueue will manage it
>   automatically and unless the deadlock chain involves many (currently
>   127) works, it won't happen.
> 
> * There's one single shared pool of workers per cpu and one rescuer
>   for each workqueue which requires it, so there are far fewer number
>   of kthreads.
> 
> * More efficient.  Although it adds considerable amount of code, the
>   code added to hot path isn't big and works will be executed on the
>   local cpu and in batch as much as possible using minimal number of
>   kthreads leading to fewer task switches and lower cache
>   footprint. <NEED SOME BACKING NUMBERS>
> 
> * As concurrency is no longer a problem, most types of asynchronous
>   jobs can be done using generic workqueue and other async mechanisms,
>   including slow-work, async and adhoc subsystem custom ones, can be
>   removed.  ie. It can serve as the unified async thread pool
>   mechanism.
> 
> Please read the patch description of the last patch for more details.
> 
> This patchset contains the following 19 patches and most of these are
> not signed off yet.

Like Linus, I dislike the sched_class bits (as in really hate them).

Also, from a quick look it looks like this scheme does not allow
priority inheritance of worklets, like we used to do in -rt.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 19/19] workqueue: implement concurrency managed workqueue
  2009-10-01 14:49   ` Andrew Morton
  2009-10-01 15:12     ` Ingo Molnar
  2009-10-01 16:34     ` Tejun Heo
@ 2009-10-04  8:49     ` Peter Zijlstra
  2 siblings, 0 replies; 90+ messages in thread
From: Peter Zijlstra @ 2009-10-04  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, jeff, mingo, linux-kernel, jens.axboe, rusty, cl,
	dhowells, arjan

On Thu, 2009-10-01 at 07:49 -0700, Andrew Morton wrote:
> This approach would appear to rule out the option of setting a work
> thread's state (scheduling policy, scheduling priority, uid, etc) to
> anything other than some default.
> 
> I guess that's unlikely to be a problem if we haven't yet had a need to
> do that, but I'd be a bit surprised to discover that nobody has done
> that sort of thing yet?  Nobody has niced up their workqueue threads?

-rt actually used to do PI on worklets, and that sure flipped the
scheduling class around a lot.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/19] stop_machine: reimplement without using workqueue
  2009-10-01  8:09 ` [PATCH 07/19] stop_machine: reimplement without using workqueue Tejun Heo
@ 2009-10-06  9:36   ` Rusty Russell
  2009-10-06 23:42     ` Tejun Heo
  0 siblings, 1 reply; 90+ messages in thread
From: Rusty Russell @ 2009-10-06  9:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, cl, dhowells, arjan

On Thu, 1 Oct 2009 05:39:06 pm Tejun Heo wrote:
> stop_machine() is the only user of RT workqueue.  Reimplement it using
> kthreads directly and rip RT support from workqueue.  This is in
> preparation of concurrency managed workqueue.

Seems reasonable, but:

>  void stop_machine_destroy(void)
>  {
> +	get_online_cpus();
>  	mutex_lock(&setup_lock);

Did you test hotplug cpu after this change?  Maybe this works fine, but it's
always the one we break when we play with this code :(

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 07/19] stop_machine: reimplement without using workqueue
  2009-10-06  9:36   ` Rusty Russell
@ 2009-10-06 23:42     ` Tejun Heo
  0 siblings, 0 replies; 90+ messages in thread
From: Tejun Heo @ 2009-10-06 23:42 UTC (permalink / raw)
  To: Rusty Russell
  Cc: jeff, mingo, linux-kernel, akpm, jens.axboe, cl, dhowells, arjan

Rusty Russell wrote:
> On Thu, 1 Oct 2009 05:39:06 pm Tejun Heo wrote:
>> stop_machine() is the only user of RT workqueue.  Reimplement it using
>> kthreads directly and rip RT support from workqueue.  This is in
>> preparation of concurrency managed workqueue.
> 
> Seems reasonable, but:
> 
>>  void stop_machine_destroy(void)
>>  {
>> +	get_online_cpus();
>>  	mutex_lock(&setup_lock);
> 
> Did you test hotplug cpu after this change?  Maybe this works fine, but it's
> always the one we break when we play with this code :(

Yeap, played with it and seems to work fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2009-10-06 23:43 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-01  8:08 [RFC PATCHSET] workqueue: implement concurrency managed workqueue Tejun Heo
2009-10-01  8:09 ` [PATCH 01/19] freezer: don't get over-anxious while waiting Tejun Heo
2009-10-01 18:36   ` Pavel Machek
2009-10-01 21:04     ` Rafael J. Wysocki
2009-10-02 10:56       ` Tejun Heo
2009-10-02 10:56         ` Tejun Heo
2009-10-02 19:47       ` Oren Laadan
2009-10-02 19:47       ` Oren Laadan
2009-10-02 21:04         ` Matt Helsley
2009-10-02 21:21           ` Rafael J. Wysocki
2009-10-02 21:21           ` Rafael J. Wysocki
2009-10-03  0:43             ` Tejun Heo
2009-10-03  0:43             ` Tejun Heo
2009-10-03 19:36               ` Rafael J. Wysocki
2009-10-03 19:36                 ` Rafael J. Wysocki
2009-10-02 21:04         ` Matt Helsley
2009-10-01 21:04     ` Rafael J. Wysocki
2009-10-01  8:09 ` [PATCH 02/19] scheduler: implement sched_class_equal() Tejun Heo
2009-10-01  8:09 ` [PATCH 03/19] scheduler: implement workqueue scheduler class Tejun Heo
2009-10-01 16:57   ` Linus Torvalds
2009-10-01 18:48     ` Ingo Molnar
2009-10-01 19:00       ` Avi Kivity
2009-10-01 19:11         ` Linus Torvalds
2009-10-01 19:23           ` Ingo Molnar
2009-10-01 20:03             ` Linus Torvalds
2009-10-01 19:15         ` Ingo Molnar
2009-10-01 19:06       ` Linus Torvalds
2009-10-02 12:23     ` Tejun Heo
2009-10-01  8:09 ` [PATCH 04/19] scheduler: implement force_cpus_allowed_ptr() Tejun Heo
2009-10-01  8:09 ` [PATCH 05/19] kthread: implement kthread_data() Tejun Heo
2009-10-01  8:09 ` [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 Tejun Heo
2009-10-01  8:09 ` [PATCH 07/19] stop_machine: reimplement without using workqueue Tejun Heo
2009-10-06  9:36   ` Rusty Russell
2009-10-06 23:42     ` Tejun Heo
2009-10-01  8:09 ` [PATCH 08/19] workqueue: misc/cosmetic updates Tejun Heo
2009-10-01  8:09 ` [PATCH 09/19] workqueue: merge feature parametesr into flags Tejun Heo
2009-10-01  8:09 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available Tejun Heo
2009-10-01  8:09 ` [PATCH 11/19] workqueue: define both bit position and mask for work flags Tejun Heo
2009-10-01  8:09 ` [PATCH 12/19] workqueue: separate out process_one_work() Tejun Heo
2009-10-01  8:09 ` [PATCH 13/19] workqueue: temporarily disable workqueue tracing Tejun Heo
2009-10-01  8:09 ` [PATCH 14/19] workqueue: (TEMPORARY) kill singlethread variant Tejun Heo
2009-10-01  8:09 ` [PATCH 15/19] workqueue: reimplement workqueue flushing using color coded works Tejun Heo
2009-10-01 17:03   ` Linus Torvalds
2009-10-01 17:11     ` Tejun Heo
2009-10-01 17:16       ` Tejun Heo
2009-10-01  8:09 ` [PATCH 16/19] workqueue: introduce worker Tejun Heo
2009-10-01  8:09 ` [PATCH 17/19] workqueue: reimplement work flushing using linked works Tejun Heo
2009-10-01  8:09 ` [PATCH 18/19] workqueue: reimplement workqueue freeze using cwq->frozen_works queue Tejun Heo
2009-10-01  8:09 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue Tejun Heo
2009-10-01 14:49   ` Andrew Morton
2009-10-01 15:12     ` Ingo Molnar
2009-10-01 16:34     ` Tejun Heo
2009-10-04  8:49     ` Peter Zijlstra
2009-10-01 17:12   ` Linus Torvalds
2009-10-01 17:22     ` Tejun Heo
2009-10-01 17:55       ` Linus Torvalds
2009-10-02  0:42   ` Andi Kleen
2009-10-02 12:09     ` Tejun Heo
2009-10-03  2:59       ` Andi Kleen
2009-10-02 14:28   ` Frédéric Weisbecker
2009-10-01  8:24 ` [RFC PATCHSET] " Jens Axboe
2009-10-01 16:36   ` Tejun Heo
2009-10-01  8:24 ` Jens Axboe
2009-10-01 16:25   ` Tejun Heo
2009-10-01  8:40 ` Ingo Molnar
2009-10-01  8:47   ` Jens Axboe
2009-10-01  9:01     ` Ingo Molnar
2009-10-01  9:05       ` Jens Axboe
2009-10-01  9:11   ` Avi Kivity
2009-10-01  9:22     ` Avi Kivity
2009-10-01 16:55     ` Tejun Heo
2009-10-01 17:06       ` Avi Kivity
2009-10-01 16:43   ` Tejun Heo
2009-10-01 12:53 ` David Howells
2009-10-02 11:44   ` Tejun Heo
2009-10-02 12:45     ` Stefan Richter
2009-10-02 15:38   ` David Howells
2009-10-03  5:07     ` Tejun Heo
2009-10-01 12:57 ` [PATCH 06/19] acpi: use queue_work_on() instead of binding workqueue worker to cpu0 David Howells
2009-10-01 17:07   ` Tejun Heo
2009-10-01 13:05 ` [PATCH 10/19] workqueue: update cwq alignement and make one more flag bit available David Howells
2009-10-01 16:15   ` Jeff Garzik
2009-10-01 16:20   ` David Howells
2009-10-01 16:30     ` Tejun Heo
2009-10-01 16:39   ` Alan Cox
2009-10-01 18:45   ` Ben Pfaff
2009-10-02 11:56     ` Tejun Heo
2009-10-01 13:15 ` [PATCH 19/19] workqueue: implement concurrency managed workqueue David Howells
2009-10-02 12:03   ` Tejun Heo
2009-10-04  8:41 ` [RFC PATCHSET] " Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.