All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
@ 2013-03-20  0:00 ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

Hello,

There are two types of workqueues - per-cpu and unbound.  The former
is bound to each CPU and the latter isn't not bound to any by default.
While the recently added attrs support allows unbound workqueues to be
confined to subset of CPUs, it still is quite cumbersome for
applications where CPU affinity is too constricted but NUMA locality
still matters.

This patchset tries to solve that issue by automatically making
unbound workqueues affine to NUMA nodes by default.  A work item
queued to an unbound workqueue is executed on one of the CPUs allowed
by the workqueue in the same node.  If there's none allowed, it may be
executed on any cpu allowed by the workqueue.  It doesn't require any
changes on the user side.  Every interface of workqueues functions the
same as before.

This would be most helpful to subsystems which use some form of async
execution to process significant amount of data - e.g. crypto and
btrfs; however, I wanted to find out whether it would make any dent in
much less favorable use cases.  The following is total run time in
seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
opteron machine with writeback thread pool converted to unbound
workqueue and thus made NUMA-affine.  The file system is ext4 on top
of a WD SSD.

	before conversion		after conversion
	1396.126			1394.763
	1397.621			1394.965
	1399.636			1394.738
	1397.463			1398.162
	1395.543			1393.670

AVG	1397.278			1395.260	DIFF	2.018
STDEV	   1.585			   1.700

And, yes, it actually made things go faster by about 1.2 sigma, which
isn't completely conclusive but is a pretty good indication that it's
actually faster.  Note that this is a workload which is dominated by
CPU time and while there's writeback going on continously it really
isn't touching too much data or a dominating factor, so the gain is
understandably small, 0.14%, but hey it's still a gain and it should
be much more interesting for crypto and btrfs which would actully
access the data or workloads which are more sensitive to NUMA
affinity.

The implementation is fairly simple.  After the recent attrs support
changes, a lot of the differences in pwq (pool_workqueue) handling
between unbound and per-cpu workqueues are gone.  An unbound workqueue
still has one "current" pwq that it uses for queueing any new work
items but can handle multiple pwqs perfectly well while they're
draining, so this patchset adds pwq dispatch table to unbound
workqueues which is indexed by NUMA node and points to the matching
pwq.  Unbound workqueues now simply have multiple "current" pwqs keyed
by NUMA node.

NUMA affinity can be turned off system-wide by workqueue.disable_numa
kernel param or per-workqueue using "numa" sysfs file.

This patchset contains the following ten patches.

 0001-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
 0002-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
 0003-workqueue-determine-NUMA-node-of-workers-accourding-.patch
 0004-workqueue-add-workqueue-unbound_attrs.patch
 0005-workqueue-make-workqueue-name-fixed-len.patch
 0006-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
 0007-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
 0008-workqueue-break-init_and_link_pwq-into-two-functions.patch
 0009-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
 0010-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch

0001 adds basic NUMA topology knoweldge to workqueue.

0002-0006 are prep patches.

0007-0009 implement NUMA affinity.

0010 adds control knobs and updates sysfs interface.

This patchset is on top of

 wq/for-3.10 a0265a7f51 ("workqueue: define workqueue_freezing static variable iff CONFIG_FREEZER")

and also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

diffstat follows.

 Documentation/kernel-parameters.txt |    9
 include/linux/workqueue.h           |    5
 kernel/workqueue.c                  |  393 ++++++++++++++++++++++++++++--------
 3 files changed, 325 insertions(+), 82 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
@ 2013-03-20  0:00 ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

Hello,

There are two types of workqueues - per-cpu and unbound.  The former
is bound to each CPU and the latter isn't not bound to any by default.
While the recently added attrs support allows unbound workqueues to be
confined to subset of CPUs, it still is quite cumbersome for
applications where CPU affinity is too constricted but NUMA locality
still matters.

This patchset tries to solve that issue by automatically making
unbound workqueues affine to NUMA nodes by default.  A work item
queued to an unbound workqueue is executed on one of the CPUs allowed
by the workqueue in the same node.  If there's none allowed, it may be
executed on any cpu allowed by the workqueue.  It doesn't require any
changes on the user side.  Every interface of workqueues functions the
same as before.

This would be most helpful to subsystems which use some form of async
execution to process significant amount of data - e.g. crypto and
btrfs; however, I wanted to find out whether it would make any dent in
much less favorable use cases.  The following is total run time in
seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
opteron machine with writeback thread pool converted to unbound
workqueue and thus made NUMA-affine.  The file system is ext4 on top
of a WD SSD.

	before conversion		after conversion
	1396.126			1394.763
	1397.621			1394.965
	1399.636			1394.738
	1397.463			1398.162
	1395.543			1393.670

AVG	1397.278			1395.260	DIFF	2.018
STDEV	   1.585			   1.700

And, yes, it actually made things go faster by about 1.2 sigma, which
isn't completely conclusive but is a pretty good indication that it's
actually faster.  Note that this is a workload which is dominated by
CPU time and while there's writeback going on continously it really
isn't touching too much data or a dominating factor, so the gain is
understandably small, 0.14%, but hey it's still a gain and it should
be much more interesting for crypto and btrfs which would actully
access the data or workloads which are more sensitive to NUMA
affinity.

The implementation is fairly simple.  After the recent attrs support
changes, a lot of the differences in pwq (pool_workqueue) handling
between unbound and per-cpu workqueues are gone.  An unbound workqueue
still has one "current" pwq that it uses for queueing any new work
items but can handle multiple pwqs perfectly well while they're
draining, so this patchset adds pwq dispatch table to unbound
workqueues which is indexed by NUMA node and points to the matching
pwq.  Unbound workqueues now simply have multiple "current" pwqs keyed
by NUMA node.

NUMA affinity can be turned off system-wide by workqueue.disable_numa
kernel param or per-workqueue using "numa" sysfs file.

This patchset contains the following ten patches.

 0001-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
 0002-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
 0003-workqueue-determine-NUMA-node-of-workers-accourding-.patch
 0004-workqueue-add-workqueue-unbound_attrs.patch
 0005-workqueue-make-workqueue-name-fixed-len.patch
 0006-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
 0007-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
 0008-workqueue-break-init_and_link_pwq-into-two-functions.patch
 0009-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
 0010-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch

0001 adds basic NUMA topology knoweldge to workqueue.

0002-0006 are prep patches.

0007-0009 implement NUMA affinity.

0010 adds control knobs and updates sysfs interface.

This patchset is on top of

 wq/for-3.10 a0265a7f51 ("workqueue: define workqueue_freezing static variable iff CONFIG_FREEZER")

and also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

diffstat follows.

 Documentation/kernel-parameters.txt |    9
 include/linux/workqueue.h           |    5
 kernel/workqueue.c                  |  393 ++++++++++++++++++++++++++++--------
 3 files changed, 325 insertions(+), 82 deletions(-)

Thanks.

--
tejun


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation.  The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.

This patch only introduces these.  Future patches will make use of
them.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 775c2f4..9b096e3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -44,6 +44,7 @@
 #include <linux/jhash.h>
 #include <linux/hashtable.h>
 #include <linux/rculist.h>
+#include <linux/nodemask.h>
 
 #include "workqueue_internal.h"
 
@@ -256,6 +257,11 @@ struct workqueue_struct {
 
 static struct kmem_cache *pwq_cache;
 
+static int wq_numa_tbl_len;		/* highest possible NUMA node id + 1 */
+static cpumask_var_t *wq_numa_possible_cpumask;
+					/* possible CPUs of each node, may
+					   be NULL if init failed */
+
 static DEFINE_MUTEX(wq_mutex);		/* protects workqueues and pools */
 static DEFINE_SPINLOCK(pwq_lock);	/* protects pool_workqueues */
 static DEFINE_SPINLOCK(wq_mayday_lock);	/* protects wq->maydays list */
@@ -4416,7 +4422,7 @@ out_unlock:
 static int __init init_workqueues(void)
 {
 	int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
-	int i, cpu;
+	int i, node, cpu;
 
 	/* make sure we have enough bits for OFFQ pool ID */
 	BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
@@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
 	cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
 	hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
 
+	/* determine NUMA pwq table len - highest node id + 1 */
+	for_each_node(node)
+		wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
+
+	/*
+	 * We want masks of possible CPUs of each node which isn't readily
+	 * available.  Build one from cpu_to_node() which should have been
+	 * fully initialized by now.
+	 */
+	wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
+					   sizeof(wq_numa_possible_cpumask[0]),
+					   GFP_KERNEL);
+	BUG_ON(!wq_numa_possible_cpumask);
+
+	for_each_node(node)
+		BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
+					       GFP_KERNEL, node));
+	for_each_possible_cpu(cpu) {
+		node = cpu_to_node(cpu);
+		if (WARN_ON(node == NUMA_NO_NODE)) {
+			pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
+			wq_numa_possible_cpumask = NULL;
+			break;
+		}
+		cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+	}
+
 	/* initialize CPU pools */
 	for_each_possible_cpu(cpu) {
 		struct worker_pool *pool;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation.  The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.

This patch only introduces these.  Future patches will make use of
them.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 775c2f4..9b096e3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -44,6 +44,7 @@
 #include <linux/jhash.h>
 #include <linux/hashtable.h>
 #include <linux/rculist.h>
+#include <linux/nodemask.h>
 
 #include "workqueue_internal.h"
 
@@ -256,6 +257,11 @@ struct workqueue_struct {
 
 static struct kmem_cache *pwq_cache;
 
+static int wq_numa_tbl_len;		/* highest possible NUMA node id + 1 */
+static cpumask_var_t *wq_numa_possible_cpumask;
+					/* possible CPUs of each node, may
+					   be NULL if init failed */
+
 static DEFINE_MUTEX(wq_mutex);		/* protects workqueues and pools */
 static DEFINE_SPINLOCK(pwq_lock);	/* protects pool_workqueues */
 static DEFINE_SPINLOCK(wq_mayday_lock);	/* protects wq->maydays list */
@@ -4416,7 +4422,7 @@ out_unlock:
 static int __init init_workqueues(void)
 {
 	int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
-	int i, cpu;
+	int i, node, cpu;
 
 	/* make sure we have enough bits for OFFQ pool ID */
 	BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
@@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
 	cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
 	hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
 
+	/* determine NUMA pwq table len - highest node id + 1 */
+	for_each_node(node)
+		wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
+
+	/*
+	 * We want masks of possible CPUs of each node which isn't readily
+	 * available.  Build one from cpu_to_node() which should have been
+	 * fully initialized by now.
+	 */
+	wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
+					   sizeof(wq_numa_possible_cpumask[0]),
+					   GFP_KERNEL);
+	BUG_ON(!wq_numa_possible_cpumask);
+
+	for_each_node(node)
+		BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
+					       GFP_KERNEL, node));
+	for_each_possible_cpu(cpu) {
+		node = cpu_to_node(cpu);
+		if (WARN_ON(node == NUMA_NO_NODE)) {
+			pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
+			wq_numa_possible_cpumask = NULL;
+			break;
+		}
+		cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+	}
+
 	/* initialize CPU pools */
 	for_each_possible_cpu(cpu) {
 		struct worker_pool *pool;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 02/10] workqueue: drop 'H' from kworker names of unbound worker pools
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names.  This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.

As workers for unbound pools use pool->id, the 'H' postfix is purely
informational.  TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs.  We're expecting to have more unbound pools with the
scheduled NUMA awareness support.  Let's drop the non-essential 'H'
postfix from unbound kworker name.

While at it, restructure kthread_create*() invocation to help future
NUMA related changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9b096e3..c1a931c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1648,9 +1648,10 @@ static struct worker *alloc_worker(void)
  */
 static struct worker *create_worker(struct worker_pool *pool)
 {
-	const char *pri = pool->attrs->nice < 0  ? "H" : "";
 	struct worker *worker = NULL;
+	int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
 	int id = -1;
+	char id_buf[16];
 
 	lockdep_assert_held(&pool->manager_mutex);
 
@@ -1676,13 +1677,13 @@ static struct worker *create_worker(struct worker_pool *pool)
 	worker->id = id;
 
 	if (pool->cpu >= 0)
-		worker->task = kthread_create_on_node(worker_thread,
-					worker, cpu_to_node(pool->cpu),
-					"kworker/%d:%d%s", pool->cpu, id, pri);
+		snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+			 pool->attrs->nice < 0  ? "H" : "");
 	else
-		worker->task = kthread_create(worker_thread, worker,
-					      "kworker/u%d:%d%s",
-					      pool->id, id, pri);
+		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+	worker->task = kthread_create_on_node(worker_thread, worker, node,
+					      "kworker/%s", id_buf);
 	if (IS_ERR(worker->task))
 		goto fail;
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 02/10] workqueue: drop 'H' from kworker names of unbound worker pools
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names.  This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.

As workers for unbound pools use pool->id, the 'H' postfix is purely
informational.  TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs.  We're expecting to have more unbound pools with the
scheduled NUMA awareness support.  Let's drop the non-essential 'H'
postfix from unbound kworker name.

While at it, restructure kthread_create*() invocation to help future
NUMA related changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9b096e3..c1a931c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1648,9 +1648,10 @@ static struct worker *alloc_worker(void)
  */
 static struct worker *create_worker(struct worker_pool *pool)
 {
-	const char *pri = pool->attrs->nice < 0  ? "H" : "";
 	struct worker *worker = NULL;
+	int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
 	int id = -1;
+	char id_buf[16];
 
 	lockdep_assert_held(&pool->manager_mutex);
 
@@ -1676,13 +1677,13 @@ static struct worker *create_worker(struct worker_pool *pool)
 	worker->id = id;
 
 	if (pool->cpu >= 0)
-		worker->task = kthread_create_on_node(worker_thread,
-					worker, cpu_to_node(pool->cpu),
-					"kworker/%d:%d%s", pool->cpu, id, pri);
+		snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
+			 pool->attrs->nice < 0  ? "H" : "");
 	else
-		worker->task = kthread_create(worker_thread, worker,
-					      "kworker/u%d:%d%s",
-					      pool->id, id, pri);
+		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+
+	worker->task = kthread_create_on_node(worker_thread, worker, node,
+					      "kworker/%s", id_buf);
 	if (IS_ERR(worker->task))
 		goto fail;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 03/10] workqueue: determine NUMA node of workers accourding to the allowed cpumask
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.

Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal.  Add pool->node which is
determined by the pool's cpumask.  If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.

This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c1a931c..2768ed2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -141,6 +141,7 @@ enum {
 struct worker_pool {
 	spinlock_t		lock;		/* the pool lock */
 	int			cpu;		/* I: the associated cpu */
+	int			node;		/* I: the associated node ID */
 	int			id;		/* I: pool ID */
 	unsigned int		flags;		/* X: flags */
 
@@ -1649,7 +1650,6 @@ static struct worker *alloc_worker(void)
 static struct worker *create_worker(struct worker_pool *pool)
 {
 	struct worker *worker = NULL;
-	int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
 	int id = -1;
 	char id_buf[16];
 
@@ -1682,7 +1682,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 	else
 		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
-	worker->task = kthread_create_on_node(worker_thread, worker, node,
+	worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
 					      "kworker/%s", id_buf);
 	if (IS_ERR(worker->task))
 		goto fail;
@@ -3390,6 +3390,7 @@ static int init_worker_pool(struct worker_pool *pool)
 	spin_lock_init(&pool->lock);
 	pool->id = -1;
 	pool->cpu = -1;
+	pool->node = NUMA_NO_NODE;
 	pool->flags |= POOL_DISASSOCIATED;
 	INIT_LIST_HEAD(&pool->worklist);
 	INIT_LIST_HEAD(&pool->idle_list);
@@ -3496,6 +3497,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 {
 	u32 hash = wqattrs_hash(attrs);
 	struct worker_pool *pool;
+	int node;
 
 	mutex_lock(&wq_mutex);
 
@@ -3515,6 +3517,17 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 	lockdep_set_subclass(&pool->lock, 1);	/* see put_pwq() */
 	copy_workqueue_attrs(pool->attrs, attrs);
 
+	/* if cpumask is contained inside a NUMA node, we belong to that node */
+	if (wq_numa_possible_cpumask) {
+		for_each_node(node) {
+			if (cpumask_subset(pool->attrs->cpumask,
+					   wq_numa_possible_cpumask[node])) {
+				pool->node = node;
+				break;
+			}
+		}
+	}
+
 	if (worker_pool_assign_id(pool) < 0)
 		goto fail;
 
@@ -4473,6 +4486,7 @@ static int __init init_workqueues(void)
 			pool->cpu = cpu;
 			cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
 			pool->attrs->nice = std_nice[i++];
+			pool->node = cpu_to_node(cpu);
 
 			/* alloc pool ID */
 			mutex_lock(&wq_mutex);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 03/10] workqueue: determine NUMA node of workers accourding to the allowed cpumask
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.

Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal.  Add pool->node which is
determined by the pool's cpumask.  If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.

This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c1a931c..2768ed2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -141,6 +141,7 @@ enum {
 struct worker_pool {
 	spinlock_t		lock;		/* the pool lock */
 	int			cpu;		/* I: the associated cpu */
+	int			node;		/* I: the associated node ID */
 	int			id;		/* I: pool ID */
 	unsigned int		flags;		/* X: flags */
 
@@ -1649,7 +1650,6 @@ static struct worker *alloc_worker(void)
 static struct worker *create_worker(struct worker_pool *pool)
 {
 	struct worker *worker = NULL;
-	int node = pool->cpu >= 0 ? cpu_to_node(pool->cpu) : NUMA_NO_NODE;
 	int id = -1;
 	char id_buf[16];
 
@@ -1682,7 +1682,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 	else
 		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
-	worker->task = kthread_create_on_node(worker_thread, worker, node,
+	worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
 					      "kworker/%s", id_buf);
 	if (IS_ERR(worker->task))
 		goto fail;
@@ -3390,6 +3390,7 @@ static int init_worker_pool(struct worker_pool *pool)
 	spin_lock_init(&pool->lock);
 	pool->id = -1;
 	pool->cpu = -1;
+	pool->node = NUMA_NO_NODE;
 	pool->flags |= POOL_DISASSOCIATED;
 	INIT_LIST_HEAD(&pool->worklist);
 	INIT_LIST_HEAD(&pool->idle_list);
@@ -3496,6 +3497,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 {
 	u32 hash = wqattrs_hash(attrs);
 	struct worker_pool *pool;
+	int node;
 
 	mutex_lock(&wq_mutex);
 
@@ -3515,6 +3517,17 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 	lockdep_set_subclass(&pool->lock, 1);	/* see put_pwq() */
 	copy_workqueue_attrs(pool->attrs, attrs);
 
+	/* if cpumask is contained inside a NUMA node, we belong to that node */
+	if (wq_numa_possible_cpumask) {
+		for_each_node(node) {
+			if (cpumask_subset(pool->attrs->cpumask,
+					   wq_numa_possible_cpumask[node])) {
+				pool->node = node;
+				break;
+			}
+		}
+	}
+
 	if (worker_pool_assign_id(pool) < 0)
 		goto fail;
 
@@ -4473,6 +4486,7 @@ static int __init init_workqueues(void)
 			pool->cpu = cpu;
 			cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
 			pool->attrs->nice = std_nice[i++];
+			pool->node = cpu_to_node(cpu);
 
 			/* alloc pool ID */
 			mutex_lock(&wq_mutex);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 04/10] workqueue: add workqueue->unbound_attrs
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.

The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold.  Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2768ed2..57e6139 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -247,6 +247,8 @@ struct workqueue_struct {
 	int			nr_drainers;	/* WQ: drain in progress */
 	int			saved_max_active; /* PW: saved pwq max_active */
 
+	struct workqueue_attrs	*unbound_attrs;	/* WQ: only for unbound wqs */
+
 #ifdef CONFIG_SYSFS
 	struct wq_device	*wq_dev;	/* I: for sysfs interface */
 #endif
@@ -3099,10 +3101,9 @@ static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
 	struct workqueue_struct *wq = dev_to_wq(dev);
 	int written;
 
-	rcu_read_lock_sched();
-	written = scnprintf(buf, PAGE_SIZE, "%d\n",
-			    first_pwq(wq)->pool->attrs->nice);
-	rcu_read_unlock_sched();
+	mutex_lock(&wq_mutex);
+	written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+	mutex_unlock(&wq_mutex);
 
 	return written;
 }
@@ -3116,9 +3117,9 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
 	if (!attrs)
 		return NULL;
 
-	rcu_read_lock_sched();
-	copy_workqueue_attrs(attrs, first_pwq(wq)->pool->attrs);
-	rcu_read_unlock_sched();
+	mutex_lock(&wq_mutex);
+	copy_workqueue_attrs(attrs, wq->unbound_attrs);
+	mutex_unlock(&wq_mutex);
 	return attrs;
 }
 
@@ -3149,10 +3150,9 @@ static ssize_t wq_cpumask_show(struct device *dev,
 	struct workqueue_struct *wq = dev_to_wq(dev);
 	int written;
 
-	rcu_read_lock_sched();
-	written = cpumask_scnprintf(buf, PAGE_SIZE,
-				    first_pwq(wq)->pool->attrs->cpumask);
-	rcu_read_unlock_sched();
+	mutex_lock(&wq_mutex);
+	written = cpumask_scnprintf(buf, PAGE_SIZE, wq->unbound_attrs->cpumask);
+	mutex_unlock(&wq_mutex);
 
 	written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
 	return written;
@@ -3585,8 +3585,10 @@ static void pwq_unbound_release_workfn(struct work_struct *work)
 	 * If we're the last pwq going away, @wq is already dead and no one
 	 * is gonna access it anymore.  Free it.
 	 */
-	if (list_empty(&wq->pwqs))
+	if (list_empty(&wq->pwqs)) {
+		free_workqueue_attrs(wq->unbound_attrs);
 		kfree(wq);
+	}
 }
 
 /**
@@ -3656,6 +3658,9 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 	/* link in @pwq */
 	list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
 
+	if (wq->flags & WQ_UNBOUND)
+		copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
 }
@@ -3764,6 +3769,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 	if (!wq)
 		return NULL;
 
+	if (flags & WQ_UNBOUND) {
+		wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+		if (!wq->unbound_attrs)
+			goto err_free_wq;
+	}
+
 	vsnprintf(wq->name, namelen, fmt, args1);
 	va_end(args);
 	va_end(args1);
@@ -3832,6 +3843,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 	return wq;
 
 err_free_wq:
+	free_workqueue_attrs(wq->unbound_attrs);
 	kfree(wq);
 	return NULL;
 err_destroy:
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 04/10] workqueue: add workqueue->unbound_attrs
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.

The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold.  Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2768ed2..57e6139 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -247,6 +247,8 @@ struct workqueue_struct {
 	int			nr_drainers;	/* WQ: drain in progress */
 	int			saved_max_active; /* PW: saved pwq max_active */
 
+	struct workqueue_attrs	*unbound_attrs;	/* WQ: only for unbound wqs */
+
 #ifdef CONFIG_SYSFS
 	struct wq_device	*wq_dev;	/* I: for sysfs interface */
 #endif
@@ -3099,10 +3101,9 @@ static ssize_t wq_nice_show(struct device *dev, struct device_attribute *attr,
 	struct workqueue_struct *wq = dev_to_wq(dev);
 	int written;
 
-	rcu_read_lock_sched();
-	written = scnprintf(buf, PAGE_SIZE, "%d\n",
-			    first_pwq(wq)->pool->attrs->nice);
-	rcu_read_unlock_sched();
+	mutex_lock(&wq_mutex);
+	written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+	mutex_unlock(&wq_mutex);
 
 	return written;
 }
@@ -3116,9 +3117,9 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct *wq)
 	if (!attrs)
 		return NULL;
 
-	rcu_read_lock_sched();
-	copy_workqueue_attrs(attrs, first_pwq(wq)->pool->attrs);
-	rcu_read_unlock_sched();
+	mutex_lock(&wq_mutex);
+	copy_workqueue_attrs(attrs, wq->unbound_attrs);
+	mutex_unlock(&wq_mutex);
 	return attrs;
 }
 
@@ -3149,10 +3150,9 @@ static ssize_t wq_cpumask_show(struct device *dev,
 	struct workqueue_struct *wq = dev_to_wq(dev);
 	int written;
 
-	rcu_read_lock_sched();
-	written = cpumask_scnprintf(buf, PAGE_SIZE,
-				    first_pwq(wq)->pool->attrs->cpumask);
-	rcu_read_unlock_sched();
+	mutex_lock(&wq_mutex);
+	written = cpumask_scnprintf(buf, PAGE_SIZE, wq->unbound_attrs->cpumask);
+	mutex_unlock(&wq_mutex);
 
 	written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
 	return written;
@@ -3585,8 +3585,10 @@ static void pwq_unbound_release_workfn(struct work_struct *work)
 	 * If we're the last pwq going away, @wq is already dead and no one
 	 * is gonna access it anymore.  Free it.
 	 */
-	if (list_empty(&wq->pwqs))
+	if (list_empty(&wq->pwqs)) {
+		free_workqueue_attrs(wq->unbound_attrs);
 		kfree(wq);
+	}
 }
 
 /**
@@ -3656,6 +3658,9 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 	/* link in @pwq */
 	list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
 
+	if (wq->flags & WQ_UNBOUND)
+		copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
 }
@@ -3764,6 +3769,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 	if (!wq)
 		return NULL;
 
+	if (flags & WQ_UNBOUND) {
+		wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+		if (!wq->unbound_attrs)
+			goto err_free_wq;
+	}
+
 	vsnprintf(wq->name, namelen, fmt, args1);
 	va_end(args);
 	va_end(args1);
@@ -3832,6 +3843,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 	return wq;
 
 err_free_wq:
+	free_workqueue_attrs(wq->unbound_attrs);
 	kfree(wq);
 	return NULL;
 err_destroy:
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 05/10] workqueue: make workqueue->name[] fixed len
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently workqueue->name[] is of flexible length.  We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway.  Make it fixed len capping
at 24 bytes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 57e6139..151ce49 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -101,6 +101,8 @@ enum {
 	 */
 	RESCUER_NICE_LEVEL	= -20,
 	HIGHPRI_NICE_LEVEL	= -20,
+
+	WQ_NAME_LEN		= 24,
 };
 
 /*
@@ -255,7 +257,7 @@ struct workqueue_struct {
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
 #endif
-	char			name[];		/* I: workqueue name */
+	char			name[WQ_NAME_LEN]; /* I: workqueue name */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -3755,17 +3757,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 					       struct lock_class_key *key,
 					       const char *lock_name, ...)
 {
-	va_list args, args1;
+	va_list args;
 	struct workqueue_struct *wq;
 	struct pool_workqueue *pwq;
-	size_t namelen;
-
-	/* determine namelen, allocate wq and format name */
-	va_start(args, lock_name);
-	va_copy(args1, args);
-	namelen = vsnprintf(NULL, 0, fmt, args) + 1;
 
-	wq = kzalloc(sizeof(*wq) + namelen, GFP_KERNEL);
+	/* allocate wq and format name */
+	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
 		return NULL;
 
@@ -3775,9 +3772,9 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 			goto err_free_wq;
 	}
 
-	vsnprintf(wq->name, namelen, fmt, args1);
+	va_start(args, lock_name);
+	vsnprintf(wq->name, sizeof(wq->name), fmt, args);
 	va_end(args);
-	va_end(args1);
 
 	max_active = max_active ?: WQ_DFL_ACTIVE;
 	max_active = wq_clamp_max_active(max_active, flags, wq->name);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 05/10] workqueue: make workqueue->name[] fixed len
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently workqueue->name[] is of flexible length.  We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway.  Make it fixed len capping
at 24 bytes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 57e6139..151ce49 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -101,6 +101,8 @@ enum {
 	 */
 	RESCUER_NICE_LEVEL	= -20,
 	HIGHPRI_NICE_LEVEL	= -20,
+
+	WQ_NAME_LEN		= 24,
 };
 
 /*
@@ -255,7 +257,7 @@ struct workqueue_struct {
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map	lockdep_map;
 #endif
-	char			name[];		/* I: workqueue name */
+	char			name[WQ_NAME_LEN]; /* I: workqueue name */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -3755,17 +3757,12 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 					       struct lock_class_key *key,
 					       const char *lock_name, ...)
 {
-	va_list args, args1;
+	va_list args;
 	struct workqueue_struct *wq;
 	struct pool_workqueue *pwq;
-	size_t namelen;
-
-	/* determine namelen, allocate wq and format name */
-	va_start(args, lock_name);
-	va_copy(args1, args);
-	namelen = vsnprintf(NULL, 0, fmt, args) + 1;
 
-	wq = kzalloc(sizeof(*wq) + namelen, GFP_KERNEL);
+	/* allocate wq and format name */
+	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
 		return NULL;
 
@@ -3775,9 +3772,9 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 			goto err_free_wq;
 	}
 
-	vsnprintf(wq->name, namelen, fmt, args1);
+	va_start(args, lock_name);
+	vsnprintf(wq->name, sizeof(wq->name), fmt, args);
 	va_end(args);
-	va_end(args1);
 
 	max_active = max_active ?: WQ_DFL_ACTIVE;
 	max_active = wq_clamp_max_active(max_active, flags, wq->name);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 06/10] workqueue: move hot fields of workqueue_struct to the end
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline.  These two fields are used in the work item
issue path and thus hot.  The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues.  The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 151ce49..25dab9d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -230,8 +230,6 @@ struct wq_device;
  * the appropriate worker_pool through its pool_workqueues.
  */
 struct workqueue_struct {
-	unsigned int		flags;		/* WQ: WQ_* flags */
-	struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
 	struct list_head	pwqs;		/* FR: all pwqs of this wq */
 	struct list_head	list;		/* WQ: list of all workqueues */
 
@@ -258,6 +256,10 @@ struct workqueue_struct {
 	struct lockdep_map	lockdep_map;
 #endif
 	char			name[WQ_NAME_LEN]; /* I: workqueue name */
+
+	/* hot fields used during command issue, aligned to cacheline */
+	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
+	struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
 };
 
 static struct kmem_cache *pwq_cache;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 06/10] workqueue: move hot fields of workqueue_struct to the end
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline.  These two fields are used in the work item
issue path and thus hot.  The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.

Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues.  The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 151ce49..25dab9d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -230,8 +230,6 @@ struct wq_device;
  * the appropriate worker_pool through its pool_workqueues.
  */
 struct workqueue_struct {
-	unsigned int		flags;		/* WQ: WQ_* flags */
-	struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwq's */
 	struct list_head	pwqs;		/* FR: all pwqs of this wq */
 	struct list_head	list;		/* WQ: list of all workqueues */
 
@@ -258,6 +256,10 @@ struct workqueue_struct {
 	struct lockdep_map	lockdep_map;
 #endif
 	char			name[WQ_NAME_LEN]; /* I: workqueue name */
+
+	/* hot fields used during command issue, aligned to cacheline */
+	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
+	struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
 };
 
 static struct kmem_cache *pwq_cache;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 07/10] workqueue: map an unbound workqueues to multiple per-node pool_workqueues
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it.  It may have multple pool_workqueues but only the
first pool_workqueue servies new work items.  For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node.  This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 48 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 25dab9d..3f820a5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -260,6 +260,7 @@ struct workqueue_struct {
 	/* hot fields used during command issue, aligned to cacheline */
 	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
 	struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
+	struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs indexed by node */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -529,6 +530,22 @@ static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
 				      pwqs_node);
 }
 
+/**
+ * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
+ * @wq: the target workqueue
+ * @node: the node ID
+ *
+ * This must be called either with pwq_lock held or sched RCU read locked.
+ * If the pwq needs to be used beyond the locking in effect, the caller is
+ * responsible for guaranteeing that the pwq stays online.
+ */
+static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
+						  int node)
+{
+	assert_rcu_or_pwq_lock();
+	return rcu_dereference_sched(wq->numa_pwq_tbl[node]);
+}
+
 static unsigned int work_color_to_flags(int color)
 {
 	return color << WORK_STRUCT_COLOR_SHIFT;
@@ -1282,14 +1299,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	    WARN_ON_ONCE(!is_chained_work(wq)))
 		return;
 retry:
+	if (req_cpu == WORK_CPU_UNBOUND)
+		cpu = raw_smp_processor_id();
+
 	/* pwq which will be used unless @work is executing elsewhere */
-	if (!(wq->flags & WQ_UNBOUND)) {
-		if (cpu == WORK_CPU_UNBOUND)
-			cpu = raw_smp_processor_id();
+	if (!(wq->flags & WQ_UNBOUND))
 		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
-	} else {
-		pwq = first_pwq(wq);
-	}
+	else
+		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
 
 	/*
 	 * If @work was previously on a different pool, it might still be
@@ -1319,8 +1336,8 @@ retry:
 	 * pwq is determined and locked.  For unbound pools, we could have
 	 * raced with pwq release and it could already be dead.  If its
 	 * refcnt is zero, repeat pwq selection.  Note that pwqs never die
-	 * without another pwq replacing it as the first pwq or while a
-	 * work item is executing on it, so the retying is guaranteed to
+	 * without another pwq replacing it in the numa_pwq_tbl or while
+	 * work items are executing on it, so the retrying is guaranteed to
 	 * make forward-progress.
 	 */
 	if (unlikely(!pwq->refcnt)) {
@@ -3635,6 +3652,8 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 			      struct worker_pool *pool,
 			      struct pool_workqueue **p_last_pwq)
 {
+	int node;
+
 	BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
 	pwq->pool = pool;
@@ -3662,8 +3681,11 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 	/* link in @pwq */
 	list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
 
-	if (wq->flags & WQ_UNBOUND)
+	if (wq->flags & WQ_UNBOUND) {
 		copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+		for_each_node(node)
+			rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	}
 
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
@@ -3759,12 +3781,16 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 					       struct lock_class_key *key,
 					       const char *lock_name, ...)
 {
+	size_t tbl_size = 0;
 	va_list args;
 	struct workqueue_struct *wq;
 	struct pool_workqueue *pwq;
 
 	/* allocate wq and format name */
-	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
+	if (flags & WQ_UNBOUND)
+		tbl_size = wq_numa_tbl_len * sizeof(wq->numa_pwq_tbl[0]);
+
+	wq = kzalloc(sizeof(*wq) + tbl_size, GFP_KERNEL);
 	if (!wq)
 		return NULL;
 
@@ -3991,7 +4017,7 @@ bool workqueue_congested(int cpu, struct workqueue_struct *wq)
 	if (!(wq->flags & WQ_UNBOUND))
 		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
 	else
-		pwq = first_pwq(wq);
+		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
 
 	ret = !list_empty(&pwq->delayed_works);
 	preempt_enable();
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 07/10] workqueue: map an unbound workqueues to multiple per-node pool_workqueues
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it.  It may have multple pool_workqueues but only the
first pool_workqueue servies new work items.  For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.

Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node.  This
replaces first_pwq() in __queue_work() and workqueue_congested().

numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 48 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 25dab9d..3f820a5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -260,6 +260,7 @@ struct workqueue_struct {
 	/* hot fields used during command issue, aligned to cacheline */
 	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
 	struct pool_workqueue __percpu *cpu_pwqs; /* I: per-cpu pwqs */
+	struct pool_workqueue __rcu *numa_pwq_tbl[]; /* FR: unbound pwqs indexed by node */
 };
 
 static struct kmem_cache *pwq_cache;
@@ -529,6 +530,22 @@ static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
 				      pwqs_node);
 }
 
+/**
+ * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
+ * @wq: the target workqueue
+ * @node: the node ID
+ *
+ * This must be called either with pwq_lock held or sched RCU read locked.
+ * If the pwq needs to be used beyond the locking in effect, the caller is
+ * responsible for guaranteeing that the pwq stays online.
+ */
+static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
+						  int node)
+{
+	assert_rcu_or_pwq_lock();
+	return rcu_dereference_sched(wq->numa_pwq_tbl[node]);
+}
+
 static unsigned int work_color_to_flags(int color)
 {
 	return color << WORK_STRUCT_COLOR_SHIFT;
@@ -1282,14 +1299,14 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	    WARN_ON_ONCE(!is_chained_work(wq)))
 		return;
 retry:
+	if (req_cpu == WORK_CPU_UNBOUND)
+		cpu = raw_smp_processor_id();
+
 	/* pwq which will be used unless @work is executing elsewhere */
-	if (!(wq->flags & WQ_UNBOUND)) {
-		if (cpu == WORK_CPU_UNBOUND)
-			cpu = raw_smp_processor_id();
+	if (!(wq->flags & WQ_UNBOUND))
 		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
-	} else {
-		pwq = first_pwq(wq);
-	}
+	else
+		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
 
 	/*
 	 * If @work was previously on a different pool, it might still be
@@ -1319,8 +1336,8 @@ retry:
 	 * pwq is determined and locked.  For unbound pools, we could have
 	 * raced with pwq release and it could already be dead.  If its
 	 * refcnt is zero, repeat pwq selection.  Note that pwqs never die
-	 * without another pwq replacing it as the first pwq or while a
-	 * work item is executing on it, so the retying is guaranteed to
+	 * without another pwq replacing it in the numa_pwq_tbl or while
+	 * work items are executing on it, so the retrying is guaranteed to
 	 * make forward-progress.
 	 */
 	if (unlikely(!pwq->refcnt)) {
@@ -3635,6 +3652,8 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 			      struct worker_pool *pool,
 			      struct pool_workqueue **p_last_pwq)
 {
+	int node;
+
 	BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
 	pwq->pool = pool;
@@ -3662,8 +3681,11 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 	/* link in @pwq */
 	list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
 
-	if (wq->flags & WQ_UNBOUND)
+	if (wq->flags & WQ_UNBOUND) {
 		copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
+		for_each_node(node)
+			rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	}
 
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
@@ -3759,12 +3781,16 @@ struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
 					       struct lock_class_key *key,
 					       const char *lock_name, ...)
 {
+	size_t tbl_size = 0;
 	va_list args;
 	struct workqueue_struct *wq;
 	struct pool_workqueue *pwq;
 
 	/* allocate wq and format name */
-	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
+	if (flags & WQ_UNBOUND)
+		tbl_size = wq_numa_tbl_len * sizeof(wq->numa_pwq_tbl[0]);
+
+	wq = kzalloc(sizeof(*wq) + tbl_size, GFP_KERNEL);
 	if (!wq)
 		return NULL;
 
@@ -3991,7 +4017,7 @@ bool workqueue_congested(int cpu, struct workqueue_struct *wq)
 	if (!(wq->flags & WQ_UNBOUND))
 		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
 	else
-		pwq = first_pwq(wq);
+		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
 
 	ret = !list_empty(&pwq->delayed_works);
 	preempt_enable();
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3f820a5..bbbfc92 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3647,13 +3647,10 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
 	spin_unlock(&pwq->pool->lock);
 }
 
-static void init_and_link_pwq(struct pool_workqueue *pwq,
-			      struct workqueue_struct *wq,
-			      struct worker_pool *pool,
-			      struct pool_workqueue **p_last_pwq)
+/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
+		     struct worker_pool *pool)
 {
-	int node;
-
 	BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
 	pwq->pool = pool;
@@ -3663,9 +3660,16 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 	INIT_LIST_HEAD(&pwq->delayed_works);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
+}
 
-	mutex_lock(&wq->flush_mutex);
-	spin_lock_irq(&pwq_lock);
+/* sync @pwq with the current state of its associated wq and link it */
+static void link_pwq(struct pool_workqueue *pwq,
+		     struct pool_workqueue **p_last_pwq)
+{
+	struct workqueue_struct *wq = pwq->wq;
+
+	lockdep_assert_held(&wq->flush_mutex);
+	lockdep_assert_held(&pwq_lock);
 
 	/*
 	 * Set the matching work_color.  This is synchronized with
@@ -3680,15 +3684,27 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 
 	/* link in @pwq */
 	list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
+}
 
-	if (wq->flags & WQ_UNBOUND) {
-		copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
-		for_each_node(node)
-			rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
+static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
+					const struct workqueue_attrs *attrs)
+{
+	struct worker_pool *pool;
+	struct pool_workqueue *pwq;
+
+	pool = get_unbound_pool(attrs);
+	if (!pool)
+		return NULL;
+
+	pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+	if (!pwq) {
+		put_unbound_pool(pool);
+		return NULL;
 	}
 
-	spin_unlock_irq(&pwq_lock);
-	mutex_unlock(&wq->flush_mutex);
+	init_pwq(pwq, wq, pool);
+	return pwq;
 }
 
 /**
@@ -3709,7 +3725,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
 	struct pool_workqueue *pwq, *last_pwq;
-	struct worker_pool *pool;
+	int node;
 
 	/* only unbound workqueues can change attributes */
 	if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
@@ -3719,17 +3735,22 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 	if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
 		return -EINVAL;
 
-	pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+	pwq = alloc_unbound_pwq(wq, attrs);
 	if (!pwq)
 		return -ENOMEM;
 
-	pool = get_unbound_pool(attrs);
-	if (!pool) {
-		kmem_cache_free(pwq_cache, pwq);
-		return -ENOMEM;
-	}
+	mutex_lock(&wq->flush_mutex);
+	spin_lock_irq(&pwq_lock);
+
+	link_pwq(pwq, &last_pwq);
+
+	copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
+	for_each_node(node)
+		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+
+	spin_unlock_irq(&pwq_lock);
+	mutex_unlock(&wq->flush_mutex);
 
-	init_and_link_pwq(pwq, wq, pool, &last_pwq);
 	if (last_pwq) {
 		spin_lock_irq(&last_pwq->pool->lock);
 		put_pwq(last_pwq);
@@ -3755,7 +3776,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 			struct worker_pool *cpu_pools =
 				per_cpu(cpu_worker_pools, cpu);
 
-			init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
+			init_pwq(pwq, wq, &cpu_pools[highpri]);
+
+			mutex_lock(&wq->flush_mutex);
+			spin_lock_irq(&pwq_lock);
+
+			link_pwq(pwq, NULL);
+
+			spin_unlock_irq(&pwq_lock);
+			mutex_unlock(&wq->flush_mutex);
 		}
 		return 0;
 	} else {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().

This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3f820a5..bbbfc92 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3647,13 +3647,10 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
 	spin_unlock(&pwq->pool->lock);
 }
 
-static void init_and_link_pwq(struct pool_workqueue *pwq,
-			      struct workqueue_struct *wq,
-			      struct worker_pool *pool,
-			      struct pool_workqueue **p_last_pwq)
+/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
+		     struct worker_pool *pool)
 {
-	int node;
-
 	BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
 	pwq->pool = pool;
@@ -3663,9 +3660,16 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 	INIT_LIST_HEAD(&pwq->delayed_works);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
+}
 
-	mutex_lock(&wq->flush_mutex);
-	spin_lock_irq(&pwq_lock);
+/* sync @pwq with the current state of its associated wq and link it */
+static void link_pwq(struct pool_workqueue *pwq,
+		     struct pool_workqueue **p_last_pwq)
+{
+	struct workqueue_struct *wq = pwq->wq;
+
+	lockdep_assert_held(&wq->flush_mutex);
+	lockdep_assert_held(&pwq_lock);
 
 	/*
 	 * Set the matching work_color.  This is synchronized with
@@ -3680,15 +3684,27 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
 
 	/* link in @pwq */
 	list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
+}
 
-	if (wq->flags & WQ_UNBOUND) {
-		copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
-		for_each_node(node)
-			rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
+static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
+					const struct workqueue_attrs *attrs)
+{
+	struct worker_pool *pool;
+	struct pool_workqueue *pwq;
+
+	pool = get_unbound_pool(attrs);
+	if (!pool)
+		return NULL;
+
+	pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+	if (!pwq) {
+		put_unbound_pool(pool);
+		return NULL;
 	}
 
-	spin_unlock_irq(&pwq_lock);
-	mutex_unlock(&wq->flush_mutex);
+	init_pwq(pwq, wq, pool);
+	return pwq;
 }
 
 /**
@@ -3709,7 +3725,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
 	struct pool_workqueue *pwq, *last_pwq;
-	struct worker_pool *pool;
+	int node;
 
 	/* only unbound workqueues can change attributes */
 	if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
@@ -3719,17 +3735,22 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 	if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
 		return -EINVAL;
 
-	pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+	pwq = alloc_unbound_pwq(wq, attrs);
 	if (!pwq)
 		return -ENOMEM;
 
-	pool = get_unbound_pool(attrs);
-	if (!pool) {
-		kmem_cache_free(pwq_cache, pwq);
-		return -ENOMEM;
-	}
+	mutex_lock(&wq->flush_mutex);
+	spin_lock_irq(&pwq_lock);
+
+	link_pwq(pwq, &last_pwq);
+
+	copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
+	for_each_node(node)
+		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+
+	spin_unlock_irq(&pwq_lock);
+	mutex_unlock(&wq->flush_mutex);
 
-	init_and_link_pwq(pwq, wq, pool, &last_pwq);
 	if (last_pwq) {
 		spin_lock_irq(&last_pwq->pool->lock);
 		put_pwq(last_pwq);
@@ -3755,7 +3776,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 			struct worker_pool *cpu_pools =
 				per_cpu(cpu_worker_pools, cpu);
 
-			init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
+			init_pwq(pwq, wq, &cpu_pools[highpri]);
+
+			mutex_lock(&wq->flush_mutex);
+			spin_lock_irq(&pwq_lock);
+
+			link_pwq(pwq, NULL);
+
+			spin_unlock_irq(&pwq_lock);
+			mutex_unlock(&wq->flush_mutex);
 		}
 		return 0;
 	} else {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 108 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 86 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bbbfc92..0c36327 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 	pwq->flush_color = -1;
 	pwq->refcnt = 1;
 	INIT_LIST_HEAD(&pwq->delayed_works);
+	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-		     struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
 	struct workqueue_struct *wq = pwq->wq;
 
@@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workqueue *pwq,
 	 * Set the matching work_color.  This is synchronized with
 	 * flush_mutex to avoid confusing flush_workqueue().
 	 */
-	if (p_last_pwq)
-		*p_last_pwq = first_pwq(wq);
 	pwq->work_color = wq->work_color;
 
 	/* sync max_active to the current setting */
@@ -3712,11 +3710,12 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3724,7 +3723,8 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
-	struct pool_workqueue *pwq, *last_pwq;
+	struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+	struct workqueue_attrs *tmp_attrs = NULL;
 	int node;
 
 	/* only unbound workqueues can change attributes */
@@ -3735,29 +3735,93 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 	if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
 		return -EINVAL;
 
-	pwq = alloc_unbound_pwq(wq, attrs);
-	if (!pwq)
-		return -ENOMEM;
+	pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+	tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+	if (!pwq_tbl || !tmp_attrs)
+		goto enomem;
+
+	copy_workqueue_attrs(tmp_attrs, attrs);
+
+	/*
+	 * We want NUMA affinity.  For each node with intersecting possible
+	 * CPUs with the requested cpumask, create a separate pwq covering
+	 * the instersection.  Nodes without intersection are covered by
+	 * the default pwq covering the whole requested cpumask.
+	 */
+	for_each_node(node) {
+		cpumask_t *cpumask = tmp_attrs->cpumask;
+
+		/*
+		 * Just fall through if NUMA affinity isn't enabled.  We'll
+		 * end up using the default pwq which is what we want.
+		 */
+		if (wq_numa_possible_cpumask) {
+			cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+				    attrs->cpumask);
+			if (cpumask_empty(cpumask))
+				cpumask_copy(cpumask, attrs->cpumask);
+		}
+
+		if (cpumask_equal(cpumask, attrs->cpumask)) {
+			if (!dfl_pwq) {
+				dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+				if (!dfl_pwq)
+					goto enomem;
+			} else {
+				dfl_pwq->refcnt++;
+			}
+			pwq_tbl[node] = dfl_pwq;
+		} else {
+			pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+			if (!pwq_tbl[node])
+				goto enomem;
+		}
+	}
 
+	/* all pwqs have been created successfully, let's install'em */
 	mutex_lock(&wq->flush_mutex);
 	spin_lock_irq(&pwq_lock);
 
-	link_pwq(pwq, &last_pwq);
+	/* @attrs is now current */
+	copy_workqueue_attrs(wq->unbound_attrs, attrs);
 
-	copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
-	for_each_node(node)
-		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	for_each_node(node) {
+		struct pool_workqueue *pwq;
+
+		/* each new pwq should be linked once */
+		if (list_empty(&pwq_tbl[node]->pwqs_node))
+			link_pwq(pwq_tbl[node]);
+
+		/* save the previous pwq and install the new one */
+		pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+		pwq_tbl[node] = pwq;
+	}
 
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
 
-	if (last_pwq) {
-		spin_lock_irq(&last_pwq->pool->lock);
-		put_pwq(last_pwq);
-		spin_unlock_irq(&last_pwq->pool->lock);
+	/* put the old pwqs */
+	for_each_node(node) {
+		struct pool_workqueue *pwq = pwq_tbl[node];
+
+		if (pwq) {
+			spin_lock_irq(&pwq->pool->lock);
+			put_pwq(pwq);
+			spin_unlock_irq(&pwq->pool->lock);
+		}
 	}
 
 	return 0;
+
+enomem:
+	free_workqueue_attrs(tmp_attrs);
+	if (pwq_tbl) {
+		for_each_node(node)
+			kfree(pwq_tbl[node]);
+		kfree(pwq_tbl);
+	}
+	return -ENOMEM;
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3781,7 +3845,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 			mutex_lock(&wq->flush_mutex);
 			spin_lock_irq(&pwq_lock);
 
-			link_pwq(pwq, NULL);
+			link_pwq(pwq);
 
 			spin_unlock_irq(&pwq_lock);
 			mutex_unlock(&wq->flush_mutex);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/workqueue.c | 108 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 86 insertions(+), 22 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bbbfc92..0c36327 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 	pwq->flush_color = -1;
 	pwq->refcnt = 1;
 	INIT_LIST_HEAD(&pwq->delayed_works);
+	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-		     struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
 	struct workqueue_struct *wq = pwq->wq;
 
@@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workqueue *pwq,
 	 * Set the matching work_color.  This is synchronized with
 	 * flush_mutex to avoid confusing flush_workqueue().
 	 */
-	if (p_last_pwq)
-		*p_last_pwq = first_pwq(wq);
 	pwq->work_color = wq->work_color;
 
 	/* sync max_active to the current setting */
@@ -3712,11 +3710,12 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3724,7 +3723,8 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
-	struct pool_workqueue *pwq, *last_pwq;
+	struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+	struct workqueue_attrs *tmp_attrs = NULL;
 	int node;
 
 	/* only unbound workqueues can change attributes */
@@ -3735,29 +3735,93 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 	if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
 		return -EINVAL;
 
-	pwq = alloc_unbound_pwq(wq, attrs);
-	if (!pwq)
-		return -ENOMEM;
+	pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+	tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+	if (!pwq_tbl || !tmp_attrs)
+		goto enomem;
+
+	copy_workqueue_attrs(tmp_attrs, attrs);
+
+	/*
+	 * We want NUMA affinity.  For each node with intersecting possible
+	 * CPUs with the requested cpumask, create a separate pwq covering
+	 * the instersection.  Nodes without intersection are covered by
+	 * the default pwq covering the whole requested cpumask.
+	 */
+	for_each_node(node) {
+		cpumask_t *cpumask = tmp_attrs->cpumask;
+
+		/*
+		 * Just fall through if NUMA affinity isn't enabled.  We'll
+		 * end up using the default pwq which is what we want.
+		 */
+		if (wq_numa_possible_cpumask) {
+			cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+				    attrs->cpumask);
+			if (cpumask_empty(cpumask))
+				cpumask_copy(cpumask, attrs->cpumask);
+		}
+
+		if (cpumask_equal(cpumask, attrs->cpumask)) {
+			if (!dfl_pwq) {
+				dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+				if (!dfl_pwq)
+					goto enomem;
+			} else {
+				dfl_pwq->refcnt++;
+			}
+			pwq_tbl[node] = dfl_pwq;
+		} else {
+			pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+			if (!pwq_tbl[node])
+				goto enomem;
+		}
+	}
 
+	/* all pwqs have been created successfully, let's install'em */
 	mutex_lock(&wq->flush_mutex);
 	spin_lock_irq(&pwq_lock);
 
-	link_pwq(pwq, &last_pwq);
+	/* @attrs is now current */
+	copy_workqueue_attrs(wq->unbound_attrs, attrs);
 
-	copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
-	for_each_node(node)
-		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	for_each_node(node) {
+		struct pool_workqueue *pwq;
+
+		/* each new pwq should be linked once */
+		if (list_empty(&pwq_tbl[node]->pwqs_node))
+			link_pwq(pwq_tbl[node]);
+
+		/* save the previous pwq and install the new one */
+		pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+		pwq_tbl[node] = pwq;
+	}
 
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
 
-	if (last_pwq) {
-		spin_lock_irq(&last_pwq->pool->lock);
-		put_pwq(last_pwq);
-		spin_unlock_irq(&last_pwq->pool->lock);
+	/* put the old pwqs */
+	for_each_node(node) {
+		struct pool_workqueue *pwq = pwq_tbl[node];
+
+		if (pwq) {
+			spin_lock_irq(&pwq->pool->lock);
+			put_pwq(pwq);
+			spin_unlock_irq(&pwq->pool->lock);
+		}
 	}
 
 	return 0;
+
+enomem:
+	free_workqueue_attrs(tmp_attrs);
+	if (pwq_tbl) {
+		for_each_node(node)
+			kfree(pwq_tbl[node]);
+		kfree(pwq_tbl);
+	}
+	return -ENOMEM;
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3781,7 +3845,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 			mutex_lock(&wq->flush_mutex);
 			spin_lock_irq(&pwq_lock);
 
-			link_pwq(pwq, NULL);
+			link_pwq(pwq);
 
 			spin_unlock_irq(&pwq_lock);
 			mutex_unlock(&wq->flush_mutex);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 10/10] workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20  0:00   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Unbound workqueues are now NUMA aware.  Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
  globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
  node:pool_id pairs.  This change is userland-visible but "pool_id"
  hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
  individual workqueues.  This is implemented as attrs->no_numa whichn
  is special in that it isn't part of a pool's attributes.  It only
  affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/kernel-parameters.txt |   9 +++
 include/linux/workqueue.h           |   5 ++
 kernel/workqueue.c                  | 125 +++++++++++++++++++++++++-----------
 3 files changed, 102 insertions(+), 37 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..c75ea0b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3222,6 +3222,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			or other driver-specific files in the
 			Documentation/watchdog/ directory.
 
+	workqueue.disable_numa
+			By default, all work items queued to unbound
+			workqueues are affine to the NUMA nodes they're
+			issued on, which results in better behavior in
+			general.  If NUMA affinity needs to be disabled for
+			whatever reason, this option can be used.  Note
+			that this also can be controlled per-workqueue for
+			workqueues visible under /sys/bus/workqueue/.
+
 	x2apic_phys	[X86-64,APIC] Use x2apic physical mode instead of
 			default x2apic cluster mode on platforms
 			supporting x2apic.
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 835d12b..7179756 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -119,10 +119,15 @@ struct delayed_work {
 /*
  * A struct for workqueue attributes.  This can be used to change
  * attributes of an unbound workqueue.
+ *
+ * Unlike other fields, ->no_numa isn't a property of a worker_pool.  It
+ * only modifies how apply_workqueue_attrs() select pools and thus doesn't
+ * participate in pool hash calculations or equality comparisons.
  */
 struct workqueue_attrs {
 	int			nice;		/* nice level */
 	cpumask_var_t		cpumask;	/* allowed CPUs */
+	bool			no_numa;	/* disable NUMA affinity */
 };
 
 static inline struct delayed_work *to_delayed_work(struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0c36327..b48373a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
 #include <linux/hashtable.h>
 #include <linux/rculist.h>
 #include <linux/nodemask.h>
+#include <linux/moduleparam.h>
 
 #include "workqueue_internal.h"
 
@@ -302,6 +303,9 @@ EXPORT_SYMBOL_GPL(system_unbound_wq);
 struct workqueue_struct *system_freezable_wq __read_mostly;
 EXPORT_SYMBOL_GPL(system_freezable_wq);
 
+static bool wq_disable_numa;
+module_param_named(disable_numa, wq_disable_numa, bool, 0444);
+
 static int worker_thread(void *__worker);
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 				 const struct workqueue_attrs *from);
@@ -516,21 +520,6 @@ static int worker_pool_assign_id(struct worker_pool *pool)
 }
 
 /**
- * first_pwq - return the first pool_workqueue of the specified workqueue
- * @wq: the target workqueue
- *
- * This must be called either with pwq_lock held or sched RCU read locked.
- * If the pwq needs to be used beyond the locking in effect, the caller is
- * responsible for guaranteeing that the pwq stays online.
- */
-static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
-{
-	assert_rcu_or_pwq_lock();
-	return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
-				      pwqs_node);
-}
-
-/**
  * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
  * @wq: the target workqueue
  * @node: the node ID
@@ -3101,16 +3090,21 @@ static struct device_attribute wq_sysfs_attrs[] = {
 	__ATTR_NULL,
 };
 
-static ssize_t wq_pool_id_show(struct device *dev,
-			       struct device_attribute *attr, char *buf)
+static ssize_t wq_pool_ids_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
 {
 	struct workqueue_struct *wq = dev_to_wq(dev);
-	struct worker_pool *pool;
-	int written;
+	const char *delim = "";
+	int node, written = 0;
 
 	rcu_read_lock_sched();
-	pool = first_pwq(wq)->pool;
-	written = scnprintf(buf, PAGE_SIZE, "%d\n", pool->id);
+	for_each_node(node) {
+		written += scnprintf(buf + written, PAGE_SIZE - written,
+				     "%s%d:%d", delim, node,
+				     unbound_pwq_by_node(wq, node)->pool->id);
+		delim = " ";
+	}
+	written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
 	rcu_read_unlock_sched();
 
 	return written;
@@ -3199,10 +3193,52 @@ static ssize_t wq_cpumask_store(struct device *dev,
 	return ret ?: count;
 }
 
+static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr,
+			    char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	int written;
+
+	mutex_lock(&wq_mutex);
+	written = scnprintf(buf, PAGE_SIZE, "%d\n",
+			    !wq->unbound_attrs->no_numa &&
+			    wq_numa_possible_cpumask);
+	mutex_unlock(&wq_mutex);
+
+	return written;
+}
+
+static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct workqueue_attrs *attrs;
+	int v, ret;
+
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (!attrs)
+		return -ENOMEM;
+
+	ret = -EINVAL;
+	if (sscanf(buf, "%d", &v) == 1) {
+		if (!v || wq_numa_possible_cpumask) {
+			attrs->no_numa = !v;
+			ret = apply_workqueue_attrs(wq, attrs);
+		} else {
+			printk_ratelimited(KERN_WARNING "workqueue: can't enable NUMA affinity for \"%s\", disabled system-wide\n",
+					   wq->name);
+		}
+	}
+
+	free_workqueue_attrs(attrs);
+	return ret ?: count;
+}
+
 static struct device_attribute wq_sysfs_unbound_attrs[] = {
-	__ATTR(pool_id, 0444, wq_pool_id_show, NULL),
+	__ATTR(pool_ids, 0444, wq_pool_ids_show, NULL),
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
+	__ATTR(numa, 0644, wq_numa_show, wq_numa_store),
 	__ATTR_NULL,
 };
 
@@ -3725,6 +3761,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 {
 	struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
 	struct workqueue_attrs *tmp_attrs = NULL;
+	bool do_numa = !attrs->no_numa && wq_numa_possible_cpumask;
 	int node;
 
 	/* only unbound workqueues can change attributes */
@@ -3740,7 +3777,15 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 	if (!pwq_tbl || !tmp_attrs)
 		goto enomem;
 
+	/*
+	 * We'll be creating multiple pwqs with differing cpumasks.  Make a
+	 * copy of @attrs which will be modified and used to obtain pools.
+	 * no_numa attribute is special in that it isn't a part of pool
+	 * attributes but modifies how pools are selected in this function.
+	 * Let's not leak no_numa to pool handling functions.
+	 */
 	copy_workqueue_attrs(tmp_attrs, attrs);
+	tmp_attrs->no_numa = false;
 
 	/*
 	 * We want NUMA affinity.  For each node with intersecting possible
@@ -3755,7 +3800,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 		 * Just fall through if NUMA affinity isn't enabled.  We'll
 		 * end up using the default pwq which is what we want.
 		 */
-		if (wq_numa_possible_cpumask) {
+		if (do_numa) {
 			cpumask_and(cpumask, wq_numa_possible_cpumask[node],
 				    attrs->cpumask);
 			if (cpumask_empty(cpumask))
@@ -4588,22 +4633,28 @@ static int __init init_workqueues(void)
 	 * available.  Build one from cpu_to_node() which should have been
 	 * fully initialized by now.
 	 */
-	wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
-					   sizeof(wq_numa_possible_cpumask[0]),
-					   GFP_KERNEL);
-	BUG_ON(!wq_numa_possible_cpumask);
+	if (!wq_disable_numa) {
+		static cpumask_var_t *tbl;
 
-	for_each_node(node)
-		BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
-					       GFP_KERNEL, node));
-	for_each_possible_cpu(cpu) {
-		node = cpu_to_node(cpu);
-		if (WARN_ON(node == NUMA_NO_NODE)) {
-			pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
-			wq_numa_possible_cpumask = NULL;
-			break;
+		tbl = kzalloc(wq_numa_tbl_len * sizeof(tbl[0]), GFP_KERNEL);
+		BUG_ON(!tbl);
+
+		for_each_node(node)
+			BUG_ON(!alloc_cpumask_var_node(&tbl[node], GFP_KERNEL,
+						       node));
+		for_each_possible_cpu(cpu) {
+			node = cpu_to_node(cpu);
+			if (WARN_ON(node == NUMA_NO_NODE)) {
+				pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
+				tbl = NULL;
+				break;
+			}
+			cpumask_set_cpu(cpu, tbl[node]);
 		}
-		cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+
+		wq_numa_possible_cpumask = tbl;
+	} else {
+		pr_info("workqueue: NUMA affinity support disabled\n");
 	}
 
 	/* initialize CPU pools */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 10/10] workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
@ 2013-03-20  0:00   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20  0:00 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto, Tejun Heo

Unbound workqueues are now NUMA aware.  Let's add some control knobs
and update sysfs interface accordingly.

* Add kernel param workqueue.numa_disable which disables NUMA affinity
  globally.

* Replace sysfs file "pool_id" with "pool_ids" which contain
  node:pool_id pairs.  This change is userland-visible but "pool_id"
  hasn't seen a release yet, so this is okay.

* Add a new sysf files "numa" which can toggle NUMA affinity on
  individual workqueues.  This is implemented as attrs->no_numa whichn
  is special in that it isn't part of a pool's attributes.  It only
  affects how apply_workqueue_attrs() picks which pools to use.

After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 Documentation/kernel-parameters.txt |   9 +++
 include/linux/workqueue.h           |   5 ++
 kernel/workqueue.c                  | 125 +++++++++++++++++++++++++-----------
 3 files changed, 102 insertions(+), 37 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4609e81..c75ea0b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3222,6 +3222,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			or other driver-specific files in the
 			Documentation/watchdog/ directory.
 
+	workqueue.disable_numa
+			By default, all work items queued to unbound
+			workqueues are affine to the NUMA nodes they're
+			issued on, which results in better behavior in
+			general.  If NUMA affinity needs to be disabled for
+			whatever reason, this option can be used.  Note
+			that this also can be controlled per-workqueue for
+			workqueues visible under /sys/bus/workqueue/.
+
 	x2apic_phys	[X86-64,APIC] Use x2apic physical mode instead of
 			default x2apic cluster mode on platforms
 			supporting x2apic.
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 835d12b..7179756 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -119,10 +119,15 @@ struct delayed_work {
 /*
  * A struct for workqueue attributes.  This can be used to change
  * attributes of an unbound workqueue.
+ *
+ * Unlike other fields, ->no_numa isn't a property of a worker_pool.  It
+ * only modifies how apply_workqueue_attrs() select pools and thus doesn't
+ * participate in pool hash calculations or equality comparisons.
  */
 struct workqueue_attrs {
 	int			nice;		/* nice level */
 	cpumask_var_t		cpumask;	/* allowed CPUs */
+	bool			no_numa;	/* disable NUMA affinity */
 };
 
 static inline struct delayed_work *to_delayed_work(struct work_struct *work)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0c36327..b48373a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -45,6 +45,7 @@
 #include <linux/hashtable.h>
 #include <linux/rculist.h>
 #include <linux/nodemask.h>
+#include <linux/moduleparam.h>
 
 #include "workqueue_internal.h"
 
@@ -302,6 +303,9 @@ EXPORT_SYMBOL_GPL(system_unbound_wq);
 struct workqueue_struct *system_freezable_wq __read_mostly;
 EXPORT_SYMBOL_GPL(system_freezable_wq);
 
+static bool wq_disable_numa;
+module_param_named(disable_numa, wq_disable_numa, bool, 0444);
+
 static int worker_thread(void *__worker);
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 				 const struct workqueue_attrs *from);
@@ -516,21 +520,6 @@ static int worker_pool_assign_id(struct worker_pool *pool)
 }
 
 /**
- * first_pwq - return the first pool_workqueue of the specified workqueue
- * @wq: the target workqueue
- *
- * This must be called either with pwq_lock held or sched RCU read locked.
- * If the pwq needs to be used beyond the locking in effect, the caller is
- * responsible for guaranteeing that the pwq stays online.
- */
-static struct pool_workqueue *first_pwq(struct workqueue_struct *wq)
-{
-	assert_rcu_or_pwq_lock();
-	return list_first_or_null_rcu(&wq->pwqs, struct pool_workqueue,
-				      pwqs_node);
-}
-
-/**
  * unbound_pwq_by_node - return the unbound pool_workqueue for the given node
  * @wq: the target workqueue
  * @node: the node ID
@@ -3101,16 +3090,21 @@ static struct device_attribute wq_sysfs_attrs[] = {
 	__ATTR_NULL,
 };
 
-static ssize_t wq_pool_id_show(struct device *dev,
-			       struct device_attribute *attr, char *buf)
+static ssize_t wq_pool_ids_show(struct device *dev,
+				struct device_attribute *attr, char *buf)
 {
 	struct workqueue_struct *wq = dev_to_wq(dev);
-	struct worker_pool *pool;
-	int written;
+	const char *delim = "";
+	int node, written = 0;
 
 	rcu_read_lock_sched();
-	pool = first_pwq(wq)->pool;
-	written = scnprintf(buf, PAGE_SIZE, "%d\n", pool->id);
+	for_each_node(node) {
+		written += scnprintf(buf + written, PAGE_SIZE - written,
+				     "%s%d:%d", delim, node,
+				     unbound_pwq_by_node(wq, node)->pool->id);
+		delim = " ";
+	}
+	written += scnprintf(buf + written, PAGE_SIZE - written, "\n");
 	rcu_read_unlock_sched();
 
 	return written;
@@ -3199,10 +3193,52 @@ static ssize_t wq_cpumask_store(struct device *dev,
 	return ret ?: count;
 }
 
+static ssize_t wq_numa_show(struct device *dev, struct device_attribute *attr,
+			    char *buf)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	int written;
+
+	mutex_lock(&wq_mutex);
+	written = scnprintf(buf, PAGE_SIZE, "%d\n",
+			    !wq->unbound_attrs->no_numa &&
+			    wq_numa_possible_cpumask);
+	mutex_unlock(&wq_mutex);
+
+	return written;
+}
+
+static ssize_t wq_numa_store(struct device *dev, struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	struct workqueue_struct *wq = dev_to_wq(dev);
+	struct workqueue_attrs *attrs;
+	int v, ret;
+
+	attrs = wq_sysfs_prep_attrs(wq);
+	if (!attrs)
+		return -ENOMEM;
+
+	ret = -EINVAL;
+	if (sscanf(buf, "%d", &v) == 1) {
+		if (!v || wq_numa_possible_cpumask) {
+			attrs->no_numa = !v;
+			ret = apply_workqueue_attrs(wq, attrs);
+		} else {
+			printk_ratelimited(KERN_WARNING "workqueue: can't enable NUMA affinity for \"%s\", disabled system-wide\n",
+					   wq->name);
+		}
+	}
+
+	free_workqueue_attrs(attrs);
+	return ret ?: count;
+}
+
 static struct device_attribute wq_sysfs_unbound_attrs[] = {
-	__ATTR(pool_id, 0444, wq_pool_id_show, NULL),
+	__ATTR(pool_ids, 0444, wq_pool_ids_show, NULL),
 	__ATTR(nice, 0644, wq_nice_show, wq_nice_store),
 	__ATTR(cpumask, 0644, wq_cpumask_show, wq_cpumask_store),
+	__ATTR(numa, 0644, wq_numa_show, wq_numa_store),
 	__ATTR_NULL,
 };
 
@@ -3725,6 +3761,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 {
 	struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
 	struct workqueue_attrs *tmp_attrs = NULL;
+	bool do_numa = !attrs->no_numa && wq_numa_possible_cpumask;
 	int node;
 
 	/* only unbound workqueues can change attributes */
@@ -3740,7 +3777,15 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 	if (!pwq_tbl || !tmp_attrs)
 		goto enomem;
 
+	/*
+	 * We'll be creating multiple pwqs with differing cpumasks.  Make a
+	 * copy of @attrs which will be modified and used to obtain pools.
+	 * no_numa attribute is special in that it isn't a part of pool
+	 * attributes but modifies how pools are selected in this function.
+	 * Let's not leak no_numa to pool handling functions.
+	 */
 	copy_workqueue_attrs(tmp_attrs, attrs);
+	tmp_attrs->no_numa = false;
 
 	/*
 	 * We want NUMA affinity.  For each node with intersecting possible
@@ -3755,7 +3800,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 		 * Just fall through if NUMA affinity isn't enabled.  We'll
 		 * end up using the default pwq which is what we want.
 		 */
-		if (wq_numa_possible_cpumask) {
+		if (do_numa) {
 			cpumask_and(cpumask, wq_numa_possible_cpumask[node],
 				    attrs->cpumask);
 			if (cpumask_empty(cpumask))
@@ -4588,22 +4633,28 @@ static int __init init_workqueues(void)
 	 * available.  Build one from cpu_to_node() which should have been
 	 * fully initialized by now.
 	 */
-	wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
-					   sizeof(wq_numa_possible_cpumask[0]),
-					   GFP_KERNEL);
-	BUG_ON(!wq_numa_possible_cpumask);
+	if (!wq_disable_numa) {
+		static cpumask_var_t *tbl;
 
-	for_each_node(node)
-		BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
-					       GFP_KERNEL, node));
-	for_each_possible_cpu(cpu) {
-		node = cpu_to_node(cpu);
-		if (WARN_ON(node == NUMA_NO_NODE)) {
-			pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
-			wq_numa_possible_cpumask = NULL;
-			break;
+		tbl = kzalloc(wq_numa_tbl_len * sizeof(tbl[0]), GFP_KERNEL);
+		BUG_ON(!tbl);
+
+		for_each_node(node)
+			BUG_ON(!alloc_cpumask_var_node(&tbl[node], GFP_KERNEL,
+						       node));
+		for_each_possible_cpu(cpu) {
+			node = cpu_to_node(cpu);
+			if (WARN_ON(node == NUMA_NO_NODE)) {
+				pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
+				tbl = NULL;
+				break;
+			}
+			cpumask_set_cpu(cpu, tbl[node]);
 		}
-		cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+
+		wq_numa_possible_cpumask = tbl;
+	} else {
+		pr_info("workqueue: NUMA affinity support disabled\n");
 	}
 
 	/* initialize CPU pools */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20 12:14   ` Lai Jiangshan
  -1 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 12:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

(off-topic)

Hi, tj,

 I think a0265a7f5161b6cb55e82b71edb236bbe0d9b3ae(tj/for-3.10) is
wrong direction, if workqueue_freezing is used only in
freeze_workqueues_begin()/thaw_workqueues(), which means it can be
removed or it is bug which is needed to be fixed. BUT
a0265a7f5161b6cb55e82b71edb236bbe0d9b3ae did not remove it nor fixed
it, it is unacceptable to me. actually it is bug, and I fixed it in my
patch1. any thought?

Thanks,
Lai

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> There are two types of workqueues - per-cpu and unbound.  The former
> is bound to each CPU and the latter isn't not bound to any by default.
> While the recently added attrs support allows unbound workqueues to be
> confined to subset of CPUs, it still is quite cumbersome for
> applications where CPU affinity is too constricted but NUMA locality
> still matters.
>
> This patchset tries to solve that issue by automatically making
> unbound workqueues affine to NUMA nodes by default.  A work item
> queued to an unbound workqueue is executed on one of the CPUs allowed
> by the workqueue in the same node.  If there's none allowed, it may be
> executed on any cpu allowed by the workqueue.  It doesn't require any
> changes on the user side.  Every interface of workqueues functions the
> same as before.
>
> This would be most helpful to subsystems which use some form of async
> execution to process significant amount of data - e.g. crypto and
> btrfs; however, I wanted to find out whether it would make any dent in
> much less favorable use cases.  The following is total run time in
> seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
> opteron machine with writeback thread pool converted to unbound
> workqueue and thus made NUMA-affine.  The file system is ext4 on top
> of a WD SSD.
>
>         before conversion               after conversion
>         1396.126                        1394.763
>         1397.621                        1394.965
>         1399.636                        1394.738
>         1397.463                        1398.162
>         1395.543                        1393.670
>
> AVG     1397.278                        1395.260        DIFF    2.018
> STDEV      1.585                           1.700
>
> And, yes, it actually made things go faster by about 1.2 sigma, which
> isn't completely conclusive but is a pretty good indication that it's
> actually faster.  Note that this is a workload which is dominated by
> CPU time and while there's writeback going on continously it really
> isn't touching too much data or a dominating factor, so the gain is
> understandably small, 0.14%, but hey it's still a gain and it should
> be much more interesting for crypto and btrfs which would actully
> access the data or workloads which are more sensitive to NUMA
> affinity.
>
> The implementation is fairly simple.  After the recent attrs support
> changes, a lot of the differences in pwq (pool_workqueue) handling
> between unbound and per-cpu workqueues are gone.  An unbound workqueue
> still has one "current" pwq that it uses for queueing any new work
> items but can handle multiple pwqs perfectly well while they're
> draining, so this patchset adds pwq dispatch table to unbound
> workqueues which is indexed by NUMA node and points to the matching
> pwq.  Unbound workqueues now simply have multiple "current" pwqs keyed
> by NUMA node.
>
> NUMA affinity can be turned off system-wide by workqueue.disable_numa
> kernel param or per-workqueue using "numa" sysfs file.
>
> This patchset contains the following ten patches.
>
>  0001-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
>  0002-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
>  0003-workqueue-determine-NUMA-node-of-workers-accourding-.patch
>  0004-workqueue-add-workqueue-unbound_attrs.patch
>  0005-workqueue-make-workqueue-name-fixed-len.patch
>  0006-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
>  0007-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
>  0008-workqueue-break-init_and_link_pwq-into-two-functions.patch
>  0009-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
>  0010-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch
>
> 0001 adds basic NUMA topology knoweldge to workqueue.
>
> 0002-0006 are prep patches.
>
> 0007-0009 implement NUMA affinity.
>
> 0010 adds control knobs and updates sysfs interface.
>
> This patchset is on top of
>
>  wq/for-3.10 a0265a7f51 ("workqueue: define workqueue_freezing static variable iff CONFIG_FREEZER")
>
> and also available in the following git branch.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa
>
> diffstat follows.
>
>  Documentation/kernel-parameters.txt |    9
>  include/linux/workqueue.h           |    5
>  kernel/workqueue.c                  |  393 ++++++++++++++++++++++++++++--------
>  3 files changed, 325 insertions(+), 82 deletions(-)
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
@ 2013-03-20 12:14   ` Lai Jiangshan
  0 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 12:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

(off-topic)

Hi, tj,

 I think a0265a7f5161b6cb55e82b71edb236bbe0d9b3ae(tj/for-3.10) is
wrong direction, if workqueue_freezing is used only in
freeze_workqueues_begin()/thaw_workqueues(), which means it can be
removed or it is bug which is needed to be fixed. BUT
a0265a7f5161b6cb55e82b71edb236bbe0d9b3ae did not remove it nor fixed
it, it is unacceptable to me. actually it is bug, and I fixed it in my
patch1. any thought?

Thanks,
Lai

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> There are two types of workqueues - per-cpu and unbound.  The former
> is bound to each CPU and the latter isn't not bound to any by default.
> While the recently added attrs support allows unbound workqueues to be
> confined to subset of CPUs, it still is quite cumbersome for
> applications where CPU affinity is too constricted but NUMA locality
> still matters.
>
> This patchset tries to solve that issue by automatically making
> unbound workqueues affine to NUMA nodes by default.  A work item
> queued to an unbound workqueue is executed on one of the CPUs allowed
> by the workqueue in the same node.  If there's none allowed, it may be
> executed on any cpu allowed by the workqueue.  It doesn't require any
> changes on the user side.  Every interface of workqueues functions the
> same as before.
>
> This would be most helpful to subsystems which use some form of async
> execution to process significant amount of data - e.g. crypto and
> btrfs; however, I wanted to find out whether it would make any dent in
> much less favorable use cases.  The following is total run time in
> seconds of buliding allmodconfig kernel w/ -j20 on a dual socket
> opteron machine with writeback thread pool converted to unbound
> workqueue and thus made NUMA-affine.  The file system is ext4 on top
> of a WD SSD.
>
>         before conversion               after conversion
>         1396.126                        1394.763
>         1397.621                        1394.965
>         1399.636                        1394.738
>         1397.463                        1398.162
>         1395.543                        1393.670
>
> AVG     1397.278                        1395.260        DIFF    2.018
> STDEV      1.585                           1.700
>
> And, yes, it actually made things go faster by about 1.2 sigma, which
> isn't completely conclusive but is a pretty good indication that it's
> actually faster.  Note that this is a workload which is dominated by
> CPU time and while there's writeback going on continously it really
> isn't touching too much data or a dominating factor, so the gain is
> understandably small, 0.14%, but hey it's still a gain and it should
> be much more interesting for crypto and btrfs which would actully
> access the data or workloads which are more sensitive to NUMA
> affinity.
>
> The implementation is fairly simple.  After the recent attrs support
> changes, a lot of the differences in pwq (pool_workqueue) handling
> between unbound and per-cpu workqueues are gone.  An unbound workqueue
> still has one "current" pwq that it uses for queueing any new work
> items but can handle multiple pwqs perfectly well while they're
> draining, so this patchset adds pwq dispatch table to unbound
> workqueues which is indexed by NUMA node and points to the matching
> pwq.  Unbound workqueues now simply have multiple "current" pwqs keyed
> by NUMA node.
>
> NUMA affinity can be turned off system-wide by workqueue.disable_numa
> kernel param or per-workqueue using "numa" sysfs file.
>
> This patchset contains the following ten patches.
>
>  0001-workqueue-add-wq_numa_tbl_len-and-wq_numa_possible_c.patch
>  0002-workqueue-drop-H-from-kworker-names-of-unbound-worke.patch
>  0003-workqueue-determine-NUMA-node-of-workers-accourding-.patch
>  0004-workqueue-add-workqueue-unbound_attrs.patch
>  0005-workqueue-make-workqueue-name-fixed-len.patch
>  0006-workqueue-move-hot-fields-of-workqueue_struct-to-the.patch
>  0007-workqueue-map-an-unbound-workqueues-to-multiple-per-.patch
>  0008-workqueue-break-init_and_link_pwq-into-two-functions.patch
>  0009-workqueue-implement-NUMA-affinity-for-unbound-workqu.patch
>  0010-workqueue-update-sysfs-interface-to-reflect-NUMA-awa.patch
>
> 0001 adds basic NUMA topology knoweldge to workqueue.
>
> 0002-0006 are prep patches.
>
> 0007-0009 implement NUMA affinity.
>
> 0010 adds control knobs and updates sysfs interface.
>
> This patchset is on top of
>
>  wq/for-3.10 a0265a7f51 ("workqueue: define workqueue_freezing static variable iff CONFIG_FREEZER")
>
> and also available in the following git branch.
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa
>
> diffstat follows.
>
>  Documentation/kernel-parameters.txt |    9
>  include/linux/workqueue.h           |    5
>  kernel/workqueue.c                  |  393 ++++++++++++++++++++++++++++--------
>  3 files changed, 325 insertions(+), 82 deletions(-)
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  2013-03-20  0:00   ` Tejun Heo
@ 2013-03-20 14:08     ` JoonSoo Kim
  -1 siblings, 0 replies; 62+ messages in thread
From: JoonSoo Kim @ 2013-03-20 14:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

2013/3/20 Tejun Heo <tj@kernel.org>:
> Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
> and wq_numa_possible_cpumask[] in preparation.  The former is the
> highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> each NUMA node.

It is better to move this code to topology.c or cpumask.c,
then it can be generally used.

Thanks.

> This patch only introduces these.  Future patches will make use of
> them.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 775c2f4..9b096e3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -44,6 +44,7 @@
>  #include <linux/jhash.h>
>  #include <linux/hashtable.h>
>  #include <linux/rculist.h>
> +#include <linux/nodemask.h>
>
>  #include "workqueue_internal.h"
>
> @@ -256,6 +257,11 @@ struct workqueue_struct {
>
>  static struct kmem_cache *pwq_cache;
>
> +static int wq_numa_tbl_len;            /* highest possible NUMA node id + 1 */
> +static cpumask_var_t *wq_numa_possible_cpumask;
> +                                       /* possible CPUs of each node, may
> +                                          be NULL if init failed */
> +
>  static DEFINE_MUTEX(wq_mutex);         /* protects workqueues and pools */
>  static DEFINE_SPINLOCK(pwq_lock);      /* protects pool_workqueues */
>  static DEFINE_SPINLOCK(wq_mayday_lock);        /* protects wq->maydays list */
> @@ -4416,7 +4422,7 @@ out_unlock:
>  static int __init init_workqueues(void)
>  {
>         int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> -       int i, cpu;
> +       int i, node, cpu;
>
>         /* make sure we have enough bits for OFFQ pool ID */
>         BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
> @@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
>         cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
>         hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
>
> +       /* determine NUMA pwq table len - highest node id + 1 */
> +       for_each_node(node)
> +               wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
> +
> +       /*
> +        * We want masks of possible CPUs of each node which isn't readily
> +        * available.  Build one from cpu_to_node() which should have been
> +        * fully initialized by now.
> +        */
> +       wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
> +                                          sizeof(wq_numa_possible_cpumask[0]),
> +                                          GFP_KERNEL);
> +       BUG_ON(!wq_numa_possible_cpumask);
> +
> +       for_each_node(node)
> +               BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> +                                              GFP_KERNEL, node));
> +       for_each_possible_cpu(cpu) {
> +               node = cpu_to_node(cpu);
> +               if (WARN_ON(node == NUMA_NO_NODE)) {
> +                       pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> +                       wq_numa_possible_cpumask = NULL;
> +                       break;
> +               }
> +               cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> +       }
> +
>         /* initialize CPU pools */
>         for_each_possible_cpu(cpu) {
>                 struct worker_pool *pool;
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
@ 2013-03-20 14:08     ` JoonSoo Kim
  0 siblings, 0 replies; 62+ messages in thread
From: JoonSoo Kim @ 2013-03-20 14:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

2013/3/20 Tejun Heo <tj@kernel.org>:
> Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
> and wq_numa_possible_cpumask[] in preparation.  The former is the
> highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> each NUMA node.

It is better to move this code to topology.c or cpumask.c,
then it can be generally used.

Thanks.

> This patch only introduces these.  Future patches will make use of
> them.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 775c2f4..9b096e3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -44,6 +44,7 @@
>  #include <linux/jhash.h>
>  #include <linux/hashtable.h>
>  #include <linux/rculist.h>
> +#include <linux/nodemask.h>
>
>  #include "workqueue_internal.h"
>
> @@ -256,6 +257,11 @@ struct workqueue_struct {
>
>  static struct kmem_cache *pwq_cache;
>
> +static int wq_numa_tbl_len;            /* highest possible NUMA node id + 1 */
> +static cpumask_var_t *wq_numa_possible_cpumask;
> +                                       /* possible CPUs of each node, may
> +                                          be NULL if init failed */
> +
>  static DEFINE_MUTEX(wq_mutex);         /* protects workqueues and pools */
>  static DEFINE_SPINLOCK(pwq_lock);      /* protects pool_workqueues */
>  static DEFINE_SPINLOCK(wq_mayday_lock);        /* protects wq->maydays list */
> @@ -4416,7 +4422,7 @@ out_unlock:
>  static int __init init_workqueues(void)
>  {
>         int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> -       int i, cpu;
> +       int i, node, cpu;
>
>         /* make sure we have enough bits for OFFQ pool ID */
>         BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
> @@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
>         cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
>         hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
>
> +       /* determine NUMA pwq table len - highest node id + 1 */
> +       for_each_node(node)
> +               wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
> +
> +       /*
> +        * We want masks of possible CPUs of each node which isn't readily
> +        * available.  Build one from cpu_to_node() which should have been
> +        * fully initialized by now.
> +        */
> +       wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
> +                                          sizeof(wq_numa_possible_cpumask[0]),
> +                                          GFP_KERNEL);
> +       BUG_ON(!wq_numa_possible_cpumask);
> +
> +       for_each_node(node)
> +               BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> +                                              GFP_KERNEL, node));
> +       for_each_possible_cpu(cpu) {
> +               node = cpu_to_node(cpu);
> +               if (WARN_ON(node == NUMA_NO_NODE)) {
> +                       pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> +                       wq_numa_possible_cpumask = NULL;
> +                       break;
> +               }
> +               cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> +       }
> +
>         /* initialize CPU pools */
>         for_each_possible_cpu(cpu) {
>                 struct worker_pool *pool;
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  2013-03-20 14:08     ` JoonSoo Kim
@ 2013-03-20 14:48       ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 14:48 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:08:29PM +0900, JoonSoo Kim wrote:
> 2013/3/20 Tejun Heo <tj@kernel.org>:
> > Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
> > and wq_numa_possible_cpumask[] in preparation.  The former is the
> > highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> > each NUMA node.
> 
> It is better to move this code to topology.c or cpumask.c,
> then it can be generally used.

Yeah, it just isn't clear where it should go.  Most NUMA
initialization happens during early boot and arch-specific and
different archs expect NUMA information to be valid at different
point.  We can do it from one of the early initcalls by which all NUMA
information should be valid but, I don't know, having different NUMA
information coming up at different times seems error-prone.  There
would be no apparent indication that certain part is available while
others are not.

We could solve this by unifying how NUMA information is represented
and initialized.  e.g. if all NUMA archs used
CONFIG_USE_PERCPU_NUMA_NODE_ID, we can simply modify
set_cpu_numa_node() to build all data structures as NUMA nodes are
discovered.  Unfortunately, this isn't true yet and it's gonna be a
bit of work to get it in consistent state as it spans over multiple
architectures (not too many tho, NUMA fortunately is rare), so if
somebody can clean it up, I'll be happy to move these to topology.
Right now, I think it's best to just carry it in workqueue.c.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
@ 2013-03-20 14:48       ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 14:48 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:08:29PM +0900, JoonSoo Kim wrote:
> 2013/3/20 Tejun Heo <tj@kernel.org>:
> > Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
> > and wq_numa_possible_cpumask[] in preparation.  The former is the
> > highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> > each NUMA node.
> 
> It is better to move this code to topology.c or cpumask.c,
> then it can be generally used.

Yeah, it just isn't clear where it should go.  Most NUMA
initialization happens during early boot and arch-specific and
different archs expect NUMA information to be valid at different
point.  We can do it from one of the early initcalls by which all NUMA
information should be valid but, I don't know, having different NUMA
information coming up at different times seems error-prone.  There
would be no apparent indication that certain part is available while
others are not.

We could solve this by unifying how NUMA information is represented
and initialized.  e.g. if all NUMA archs used
CONFIG_USE_PERCPU_NUMA_NODE_ID, we can simply modify
set_cpu_numa_node() to build all data structures as NUMA nodes are
discovered.  Unfortunately, this isn't true yet and it's gonna be a
bit of work to get it in consistent state as it spans over multiple
architectures (not too many tho, NUMA fortunately is rare), so if
somebody can clean it up, I'll be happy to move these to topology.
Right now, I think it's best to just carry it in workqueue.c.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
  2013-03-20  0:00   ` Tejun Heo
@ 2013-03-20 15:03     ` Lai Jiangshan
  -1 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 15:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Currently, an unbound workqueue has single current, or first, pwq
> (pool_workqueue) to which all new work items are queued.  This often
> isn't optimal on NUMA machines as workers may jump around across node
> boundaries and work items get assigned to workers without any regard
> to NUMA affinity.
>
> This patch implements NUMA affinity for unbound workqueues.  Instead
> of mapping all entries of numa_pwq_tbl[] to the same pwq,
> apply_workqueue_attrs() now creates a separate pwq covering the
> intersecting CPUs for each NUMA node which has possible CPUs in
> @attrs->cpumask.  Nodes which don't have intersecting possible CPUs
> are mapped to pwqs covering whole @attrs->cpumask.
>
> This ensures that all work items issued on a NUMA node is executed on
> the same node as long as the workqueue allows execution on the CPUs of
> the node.
>
> As this maps a workqueue to multiple pwqs and max_active is per-pwq,
> this change the behavior of max_active.  The limit is now per NUMA
> node instead of global.  While this is an actual change, max_active is
> already per-cpu for per-cpu workqueues and primarily used as safety
> mechanism rather than for active concurrency control.  Concurrency is
> usually limited from workqueue users by the number of concurrently
> active work items and this change shouldn't matter much.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/workqueue.c | 108 ++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 86 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index bbbfc92..0c36327 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
>         pwq->flush_color = -1;
>         pwq->refcnt = 1;
>         INIT_LIST_HEAD(&pwq->delayed_works);
> +       INIT_LIST_HEAD(&pwq->pwqs_node);
>         INIT_LIST_HEAD(&pwq->mayday_node);
>         INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
>  }
>
>  /* sync @pwq with the current state of its associated wq and link it */
> -static void link_pwq(struct pool_workqueue *pwq,
> -                    struct pool_workqueue **p_last_pwq)
> +static void link_pwq(struct pool_workqueue *pwq)
>  {
>         struct workqueue_struct *wq = pwq->wq;
>
> @@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workqueue *pwq,
>          * Set the matching work_color.  This is synchronized with
>          * flush_mutex to avoid confusing flush_workqueue().
>          */
> -       if (p_last_pwq)
> -               *p_last_pwq = first_pwq(wq);
>         pwq->work_color = wq->work_color;
>
>         /* sync max_active to the current setting */
> @@ -3712,11 +3710,12 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
>   * @wq: the target workqueue
>   * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
>   *
> - * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
> - * current attributes, a new pwq is created and made the first pwq which
> - * will serve all new work items.  Older pwqs are released as in-flight
> - * work items finish.  Note that a work item which repeatedly requeues
> - * itself back-to-back will stay on its current pwq.
> + * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
> + * machines, this function maps a separate pwq to each NUMA node with
> + * possibles CPUs in @attrs->cpumask so that work items are affine to the
> + * NUMA node it was issued on.  Older pwqs are released as in-flight work
> + * items finish.  Note that a work item which repeatedly requeues itself
> + * back-to-back will stay on its current pwq.
>   *
>   * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
>   * failure.
> @@ -3724,7 +3723,8 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
>  int apply_workqueue_attrs(struct workqueue_struct *wq,
>                           const struct workqueue_attrs *attrs)
>  {
> -       struct pool_workqueue *pwq, *last_pwq;
> +       struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
> +       struct workqueue_attrs *tmp_attrs = NULL;
>         int node;
>
>         /* only unbound workqueues can change attributes */
> @@ -3735,29 +3735,93 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
>         if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
>                 return -EINVAL;
>
> -       pwq = alloc_unbound_pwq(wq, attrs);
> -       if (!pwq)
> -               return -ENOMEM;
> +       pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
> +       tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
> +       if (!pwq_tbl || !tmp_attrs)
> +               goto enomem;
> +
> +       copy_workqueue_attrs(tmp_attrs, attrs);
> +
> +       /*
> +        * We want NUMA affinity.  For each node with intersecting possible
> +        * CPUs with the requested cpumask, create a separate pwq covering
> +        * the instersection.  Nodes without intersection are covered by
> +        * the default pwq covering the whole requested cpumask.
> +        */
> +       for_each_node(node) {
> +               cpumask_t *cpumask = tmp_attrs->cpumask;
> +
> +               /*
> +                * Just fall through if NUMA affinity isn't enabled.  We'll
> +                * end up using the default pwq which is what we want.
> +                */
> +               if (wq_numa_possible_cpumask) {
> +                       cpumask_and(cpumask, wq_numa_possible_cpumask[node],
> +                                   attrs->cpumask);
> +                       if (cpumask_empty(cpumask))
> +                               cpumask_copy(cpumask, attrs->cpumask);
> +               }
> +
> +               if (cpumask_equal(cpumask, attrs->cpumask)) {
> +                       if (!dfl_pwq) {
> +                               dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
> +                               if (!dfl_pwq)
> +                                       goto enomem;
> +                       } else {
> +                               dfl_pwq->refcnt++;
> +                       }
> +                       pwq_tbl[node] = dfl_pwq;
> +               } else {
> +                       pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
> +                       if (!pwq_tbl[node])
> +                               goto enomem;
> +               }
> +       }
>
> +       /* all pwqs have been created successfully, let's install'em */
>         mutex_lock(&wq->flush_mutex);
>         spin_lock_irq(&pwq_lock);
>
> -       link_pwq(pwq, &last_pwq);
> +       /* @attrs is now current */
> +       copy_workqueue_attrs(wq->unbound_attrs, attrs);
>
> -       copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
> -       for_each_node(node)
> -               rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> +       for_each_node(node) {
> +               struct pool_workqueue *pwq;
> +
> +               /* each new pwq should be linked once */
> +               if (list_empty(&pwq_tbl[node]->pwqs_node))
> +                       link_pwq(pwq_tbl[node]);
> +
> +               /* save the previous pwq and install the new one */
> +               pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
> +               rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
> +               pwq_tbl[node] = pwq;
> +       }
>
>         spin_unlock_irq(&pwq_lock);
>         mutex_unlock(&wq->flush_mutex);
>
> -       if (last_pwq) {
> -               spin_lock_irq(&last_pwq->pool->lock);
> -               put_pwq(last_pwq);
> -               spin_unlock_irq(&last_pwq->pool->lock);
> +       /* put the old pwqs */
> +       for_each_node(node) {
> +               struct pool_workqueue *pwq = pwq_tbl[node];
> +
> +               if (pwq) {
> +                       spin_lock_irq(&pwq->pool->lock);
> +                       put_pwq(pwq);
> +                       spin_unlock_irq(&pwq->pool->lock);
> +               }
>         }
>
>         return 0;
> +
> +enomem:
> +       free_workqueue_attrs(tmp_attrs);
> +       if (pwq_tbl) {
> +               for_each_node(node)
> +                       kfree(pwq_tbl[node]);

It will free the dfl_pwq multi times.

> +               kfree(pwq_tbl);
> +       }
> +       return -ENOMEM;
>  }
>
>  static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> @@ -3781,7 +3845,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
>                         mutex_lock(&wq->flush_mutex);
>                         spin_lock_irq(&pwq_lock);
>
> -                       link_pwq(pwq, NULL);
> +                       link_pwq(pwq);
>
>                         spin_unlock_irq(&pwq_lock);
>                         mutex_unlock(&wq->flush_mutex);
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
@ 2013-03-20 15:03     ` Lai Jiangshan
  0 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 15:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Currently, an unbound workqueue has single current, or first, pwq
> (pool_workqueue) to which all new work items are queued.  This often
> isn't optimal on NUMA machines as workers may jump around across node
> boundaries and work items get assigned to workers without any regard
> to NUMA affinity.
>
> This patch implements NUMA affinity for unbound workqueues.  Instead
> of mapping all entries of numa_pwq_tbl[] to the same pwq,
> apply_workqueue_attrs() now creates a separate pwq covering the
> intersecting CPUs for each NUMA node which has possible CPUs in
> @attrs->cpumask.  Nodes which don't have intersecting possible CPUs
> are mapped to pwqs covering whole @attrs->cpumask.
>
> This ensures that all work items issued on a NUMA node is executed on
> the same node as long as the workqueue allows execution on the CPUs of
> the node.
>
> As this maps a workqueue to multiple pwqs and max_active is per-pwq,
> this change the behavior of max_active.  The limit is now per NUMA
> node instead of global.  While this is an actual change, max_active is
> already per-cpu for per-cpu workqueues and primarily used as safety
> mechanism rather than for active concurrency control.  Concurrency is
> usually limited from workqueue users by the number of concurrently
> active work items and this change shouldn't matter much.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/workqueue.c | 108 ++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 86 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index bbbfc92..0c36327 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
>         pwq->flush_color = -1;
>         pwq->refcnt = 1;
>         INIT_LIST_HEAD(&pwq->delayed_works);
> +       INIT_LIST_HEAD(&pwq->pwqs_node);
>         INIT_LIST_HEAD(&pwq->mayday_node);
>         INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
>  }
>
>  /* sync @pwq with the current state of its associated wq and link it */
> -static void link_pwq(struct pool_workqueue *pwq,
> -                    struct pool_workqueue **p_last_pwq)
> +static void link_pwq(struct pool_workqueue *pwq)
>  {
>         struct workqueue_struct *wq = pwq->wq;
>
> @@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workqueue *pwq,
>          * Set the matching work_color.  This is synchronized with
>          * flush_mutex to avoid confusing flush_workqueue().
>          */
> -       if (p_last_pwq)
> -               *p_last_pwq = first_pwq(wq);
>         pwq->work_color = wq->work_color;
>
>         /* sync max_active to the current setting */
> @@ -3712,11 +3710,12 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
>   * @wq: the target workqueue
>   * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
>   *
> - * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
> - * current attributes, a new pwq is created and made the first pwq which
> - * will serve all new work items.  Older pwqs are released as in-flight
> - * work items finish.  Note that a work item which repeatedly requeues
> - * itself back-to-back will stay on its current pwq.
> + * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
> + * machines, this function maps a separate pwq to each NUMA node with
> + * possibles CPUs in @attrs->cpumask so that work items are affine to the
> + * NUMA node it was issued on.  Older pwqs are released as in-flight work
> + * items finish.  Note that a work item which repeatedly requeues itself
> + * back-to-back will stay on its current pwq.
>   *
>   * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
>   * failure.
> @@ -3724,7 +3723,8 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
>  int apply_workqueue_attrs(struct workqueue_struct *wq,
>                           const struct workqueue_attrs *attrs)
>  {
> -       struct pool_workqueue *pwq, *last_pwq;
> +       struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
> +       struct workqueue_attrs *tmp_attrs = NULL;
>         int node;
>
>         /* only unbound workqueues can change attributes */
> @@ -3735,29 +3735,93 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
>         if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
>                 return -EINVAL;
>
> -       pwq = alloc_unbound_pwq(wq, attrs);
> -       if (!pwq)
> -               return -ENOMEM;
> +       pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
> +       tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
> +       if (!pwq_tbl || !tmp_attrs)
> +               goto enomem;
> +
> +       copy_workqueue_attrs(tmp_attrs, attrs);
> +
> +       /*
> +        * We want NUMA affinity.  For each node with intersecting possible
> +        * CPUs with the requested cpumask, create a separate pwq covering
> +        * the instersection.  Nodes without intersection are covered by
> +        * the default pwq covering the whole requested cpumask.
> +        */
> +       for_each_node(node) {
> +               cpumask_t *cpumask = tmp_attrs->cpumask;
> +
> +               /*
> +                * Just fall through if NUMA affinity isn't enabled.  We'll
> +                * end up using the default pwq which is what we want.
> +                */
> +               if (wq_numa_possible_cpumask) {
> +                       cpumask_and(cpumask, wq_numa_possible_cpumask[node],
> +                                   attrs->cpumask);
> +                       if (cpumask_empty(cpumask))
> +                               cpumask_copy(cpumask, attrs->cpumask);
> +               }
> +
> +               if (cpumask_equal(cpumask, attrs->cpumask)) {
> +                       if (!dfl_pwq) {
> +                               dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
> +                               if (!dfl_pwq)
> +                                       goto enomem;
> +                       } else {
> +                               dfl_pwq->refcnt++;
> +                       }
> +                       pwq_tbl[node] = dfl_pwq;
> +               } else {
> +                       pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
> +                       if (!pwq_tbl[node])
> +                               goto enomem;
> +               }
> +       }
>
> +       /* all pwqs have been created successfully, let's install'em */
>         mutex_lock(&wq->flush_mutex);
>         spin_lock_irq(&pwq_lock);
>
> -       link_pwq(pwq, &last_pwq);
> +       /* @attrs is now current */
> +       copy_workqueue_attrs(wq->unbound_attrs, attrs);
>
> -       copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
> -       for_each_node(node)
> -               rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> +       for_each_node(node) {
> +               struct pool_workqueue *pwq;
> +
> +               /* each new pwq should be linked once */
> +               if (list_empty(&pwq_tbl[node]->pwqs_node))
> +                       link_pwq(pwq_tbl[node]);
> +
> +               /* save the previous pwq and install the new one */
> +               pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
> +               rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
> +               pwq_tbl[node] = pwq;
> +       }
>
>         spin_unlock_irq(&pwq_lock);
>         mutex_unlock(&wq->flush_mutex);
>
> -       if (last_pwq) {
> -               spin_lock_irq(&last_pwq->pool->lock);
> -               put_pwq(last_pwq);
> -               spin_unlock_irq(&last_pwq->pool->lock);
> +       /* put the old pwqs */
> +       for_each_node(node) {
> +               struct pool_workqueue *pwq = pwq_tbl[node];
> +
> +               if (pwq) {
> +                       spin_lock_irq(&pwq->pool->lock);
> +                       put_pwq(pwq);
> +                       spin_unlock_irq(&pwq->pool->lock);
> +               }
>         }
>
>         return 0;
> +
> +enomem:
> +       free_workqueue_attrs(tmp_attrs);
> +       if (pwq_tbl) {
> +               for_each_node(node)
> +                       kfree(pwq_tbl[node]);

It will free the dfl_pwq multi times.

> +               kfree(pwq_tbl);
> +       }
> +       return -ENOMEM;
>  }
>
>  static int alloc_and_link_pwqs(struct workqueue_struct *wq)
> @@ -3781,7 +3845,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
>                         mutex_lock(&wq->flush_mutex);
>                         spin_lock_irq(&pwq_lock);
>
> -                       link_pwq(pwq, NULL);
> +                       link_pwq(pwq);
>
>                         spin_unlock_irq(&pwq_lock);
>                         mutex_unlock(&wq->flush_mutex);
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
  2013-03-20 15:03     ` Lai Jiangshan
@ 2013-03-20 15:05       ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 15:05 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:03:53PM +0800, Lai Jiangshan wrote:
> > +enomem:
> > +       free_workqueue_attrs(tmp_attrs);
> > +       if (pwq_tbl) {
> > +               for_each_node(node)
> > +                       kfree(pwq_tbl[node]);
> 
> It will free the dfl_pwq multi times.

Oops, you're right.  We need to do


	for_eahc_node(node)
		if (pwq_tbl[node] != dfl_pwq)
			kfree(pwq_tbl[node]);
	kfree(dfl_pwq);

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
@ 2013-03-20 15:05       ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 15:05 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:03:53PM +0800, Lai Jiangshan wrote:
> > +enomem:
> > +       free_workqueue_attrs(tmp_attrs);
> > +       if (pwq_tbl) {
> > +               for_each_node(node)
> > +                       kfree(pwq_tbl[node]);
> 
> It will free the dfl_pwq multi times.

Oops, you're right.  We need to do


	for_eahc_node(node)
		if (pwq_tbl[node] != dfl_pwq)
			kfree(pwq_tbl[node]);
	kfree(dfl_pwq);

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
  2013-03-20 15:05       ` Tejun Heo
@ 2013-03-20 15:26         ` Lai Jiangshan
  -1 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 15:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:05 PM, Tejun Heo <tj@kernel.org> wrote:
> On Wed, Mar 20, 2013 at 11:03:53PM +0800, Lai Jiangshan wrote:
>> > +enomem:
>> > +       free_workqueue_attrs(tmp_attrs);
>> > +       if (pwq_tbl) {
>> > +               for_each_node(node)
>> > +                       kfree(pwq_tbl[node]);
>>
>> It will free the dfl_pwq multi times.
>
> Oops, you're right.  We need to do
>
>
>         for_eahc_node(node)
>                 if (pwq_tbl[node] != dfl_pwq)
>                         kfree(pwq_tbl[node]);
>         kfree(dfl_pwq);

I also missed.
we still need to put_unbound_pool() before free(pwq)

>
> Thanks!
>
> --
> tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
@ 2013-03-20 15:26         ` Lai Jiangshan
  0 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 15:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:05 PM, Tejun Heo <tj@kernel.org> wrote:
> On Wed, Mar 20, 2013 at 11:03:53PM +0800, Lai Jiangshan wrote:
>> > +enomem:
>> > +       free_workqueue_attrs(tmp_attrs);
>> > +       if (pwq_tbl) {
>> > +               for_each_node(node)
>> > +                       kfree(pwq_tbl[node]);
>>
>> It will free the dfl_pwq multi times.
>
> Oops, you're right.  We need to do
>
>
>         for_eahc_node(node)
>                 if (pwq_tbl[node] != dfl_pwq)
>                         kfree(pwq_tbl[node]);
>         kfree(dfl_pwq);

I also missed.
we still need to put_unbound_pool() before free(pwq)

>
> Thanks!
>
> --
> tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
  2013-03-20 15:26         ` Lai Jiangshan
@ 2013-03-20 15:32           ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 15:32 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 8:26 AM, Lai Jiangshan <eag0628@gmail.com> wrote:
>>         for_eahc_node(node)
>>                 if (pwq_tbl[node] != dfl_pwq)
>>                         kfree(pwq_tbl[node]);
>>         kfree(dfl_pwq);
>
> I also missed.
> we still need to put_unbound_pool() before free(pwq)

Yeap, we do.  Wonder whether we can just use put_pwq() there. I'll see if I can.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues
@ 2013-03-20 15:32           ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 15:32 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 8:26 AM, Lai Jiangshan <eag0628@gmail.com> wrote:
>>         for_eahc_node(node)
>>                 if (pwq_tbl[node] != dfl_pwq)
>>                         kfree(pwq_tbl[node]);
>>         kfree(dfl_pwq);
>
> I also missed.
> we still need to put_unbound_pool() before free(pwq)

Yeap, we do.  Wonder whether we can just use put_pwq() there. I'll see if I can.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  2013-03-20  0:00   ` Tejun Heo
@ 2013-03-20 15:43     ` Lai Jiangshan
  -1 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 15:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
> and wq_numa_possible_cpumask[] in preparation.  The former is the
> highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> each NUMA node.
>
> This patch only introduces these.  Future patches will make use of
> them.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 775c2f4..9b096e3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -44,6 +44,7 @@
>  #include <linux/jhash.h>
>  #include <linux/hashtable.h>
>  #include <linux/rculist.h>
> +#include <linux/nodemask.h>
>
>  #include "workqueue_internal.h"
>
> @@ -256,6 +257,11 @@ struct workqueue_struct {
>
>  static struct kmem_cache *pwq_cache;
>
> +static int wq_numa_tbl_len;            /* highest possible NUMA node id + 1 */
> +static cpumask_var_t *wq_numa_possible_cpumask;
> +                                       /* possible CPUs of each node, may
> +                                          be NULL if init failed */
> +
>  static DEFINE_MUTEX(wq_mutex);         /* protects workqueues and pools */
>  static DEFINE_SPINLOCK(pwq_lock);      /* protects pool_workqueues */
>  static DEFINE_SPINLOCK(wq_mayday_lock);        /* protects wq->maydays list */
> @@ -4416,7 +4422,7 @@ out_unlock:
>  static int __init init_workqueues(void)
>  {
>         int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> -       int i, cpu;
> +       int i, node, cpu;
>
>         /* make sure we have enough bits for OFFQ pool ID */
>         BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
> @@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
>         cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
>         hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
>
> +       /* determine NUMA pwq table len - highest node id + 1 */
> +       for_each_node(node)
> +               wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
> +
> +       /*
> +        * We want masks of possible CPUs of each node which isn't readily
> +        * available.  Build one from cpu_to_node() which should have been
> +        * fully initialized by now.
> +        */
> +       wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
> +                                          sizeof(wq_numa_possible_cpumask[0]),
> +                                          GFP_KERNEL);
> +       BUG_ON(!wq_numa_possible_cpumask);
> +
> +       for_each_node(node)
> +               BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> +                                              GFP_KERNEL, node));
> +       for_each_possible_cpu(cpu) {
> +               node = cpu_to_node(cpu);
> +               if (WARN_ON(node == NUMA_NO_NODE)) {
> +                       pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);

since you used WARN_ON(), not BUG_ON(), I think you need to free
allocated memory here.

> +                       wq_numa_possible_cpumask = NULL;
> +                       break;
> +               }
> +               cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> +       }
> +
>         /* initialize CPU pools */
>         for_each_possible_cpu(cpu) {
>                 struct worker_pool *pool;
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
@ 2013-03-20 15:43     ` Lai Jiangshan
  0 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 15:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
> and wq_numa_possible_cpumask[] in preparation.  The former is the
> highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> each NUMA node.
>
> This patch only introduces these.  Future patches will make use of
> them.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 775c2f4..9b096e3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -44,6 +44,7 @@
>  #include <linux/jhash.h>
>  #include <linux/hashtable.h>
>  #include <linux/rculist.h>
> +#include <linux/nodemask.h>
>
>  #include "workqueue_internal.h"
>
> @@ -256,6 +257,11 @@ struct workqueue_struct {
>
>  static struct kmem_cache *pwq_cache;
>
> +static int wq_numa_tbl_len;            /* highest possible NUMA node id + 1 */
> +static cpumask_var_t *wq_numa_possible_cpumask;
> +                                       /* possible CPUs of each node, may
> +                                          be NULL if init failed */
> +
>  static DEFINE_MUTEX(wq_mutex);         /* protects workqueues and pools */
>  static DEFINE_SPINLOCK(pwq_lock);      /* protects pool_workqueues */
>  static DEFINE_SPINLOCK(wq_mayday_lock);        /* protects wq->maydays list */
> @@ -4416,7 +4422,7 @@ out_unlock:
>  static int __init init_workqueues(void)
>  {
>         int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> -       int i, cpu;
> +       int i, node, cpu;
>
>         /* make sure we have enough bits for OFFQ pool ID */
>         BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
> @@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
>         cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
>         hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
>
> +       /* determine NUMA pwq table len - highest node id + 1 */
> +       for_each_node(node)
> +               wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
> +
> +       /*
> +        * We want masks of possible CPUs of each node which isn't readily
> +        * available.  Build one from cpu_to_node() which should have been
> +        * fully initialized by now.
> +        */
> +       wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
> +                                          sizeof(wq_numa_possible_cpumask[0]),
> +                                          GFP_KERNEL);
> +       BUG_ON(!wq_numa_possible_cpumask);
> +
> +       for_each_node(node)
> +               BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> +                                              GFP_KERNEL, node));
> +       for_each_possible_cpu(cpu) {
> +               node = cpu_to_node(cpu);
> +               if (WARN_ON(node == NUMA_NO_NODE)) {
> +                       pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);

since you used WARN_ON(), not BUG_ON(), I think you need to free
allocated memory here.

> +                       wq_numa_possible_cpumask = NULL;
> +                       break;
> +               }
> +               cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> +       }
> +
>         /* initialize CPU pools */
>         for_each_possible_cpu(cpu) {
>                 struct worker_pool *pool;
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  2013-03-20 15:43     ` Lai Jiangshan
@ 2013-03-20 15:48       ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 15:48 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:43:30PM +0800, Lai Jiangshan wrote:
> > +       for_each_node(node)
> > +               BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> > +                                              GFP_KERNEL, node));
> > +       for_each_possible_cpu(cpu) {
> > +               node = cpu_to_node(cpu);
> > +               if (WARN_ON(node == NUMA_NO_NODE)) {
> > +                       pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> 
> since you used WARN_ON(), not BUG_ON(), I think you need to free
> allocated memory here.

I don't know.  Is it necessary?  It shouldn't happen sans bugs in arch
code and we're vomiting warning message with full stack trace.  The
system will function but something is seriously screwed.  I don't
think it matters whether we free the puny number of bytes there or
not and I don't wanna nest deeper there. :P

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
@ 2013-03-20 15:48       ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 15:48 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:43:30PM +0800, Lai Jiangshan wrote:
> > +       for_each_node(node)
> > +               BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> > +                                              GFP_KERNEL, node));
> > +       for_each_possible_cpu(cpu) {
> > +               node = cpu_to_node(cpu);
> > +               if (WARN_ON(node == NUMA_NO_NODE)) {
> > +                       pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> 
> since you used WARN_ON(), not BUG_ON(), I think you need to free
> allocated memory here.

I don't know.  Is it necessary?  It shouldn't happen sans bugs in arch
code and we're vomiting warning message with full stack trace.  The
system will function but something is seriously screwed.  I don't
think it matters whether we free the puny number of bytes there or
not and I don't wanna nest deeper there. :P

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
  2013-03-20  0:00   ` Tejun Heo
@ 2013-03-20 15:52     ` Lai Jiangshan
  -1 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 15:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Break init_and_link_pwq() into init_pwq() and link_pwq() and move
> unbound-workqueue specific handling into apply_workqueue_attrs().
> Also, factor out unbound pool and pool_workqueue allocation into
> alloc_unbound_pwq().
>
> This reorganization is to prepare for NUMA affinity and doesn't
> introduce any functional changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/workqueue.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 52 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 3f820a5..bbbfc92 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3647,13 +3647,10 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
>         spin_unlock(&pwq->pool->lock);
>  }
>
> -static void init_and_link_pwq(struct pool_workqueue *pwq,
> -                             struct workqueue_struct *wq,
> -                             struct worker_pool *pool,
> -                             struct pool_workqueue **p_last_pwq)
> +/* initialize newly zalloced @pwq which is associated with @wq and @pool */
> +static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
> +                    struct worker_pool *pool)
>  {
> -       int node;
> -
>         BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
>
>         pwq->pool = pool;
> @@ -3663,9 +3660,16 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
>         INIT_LIST_HEAD(&pwq->delayed_works);
>         INIT_LIST_HEAD(&pwq->mayday_node);
>         INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
> +}
>
> -       mutex_lock(&wq->flush_mutex);
> -       spin_lock_irq(&pwq_lock);
> +/* sync @pwq with the current state of its associated wq and link it */
> +static void link_pwq(struct pool_workqueue *pwq,
> +                    struct pool_workqueue **p_last_pwq)
> +{
> +       struct workqueue_struct *wq = pwq->wq;
> +
> +       lockdep_assert_held(&wq->flush_mutex);
> +       lockdep_assert_held(&pwq_lock);
>
>         /*
>          * Set the matching work_color.  This is synchronized with
> @@ -3680,15 +3684,27 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
>
>         /* link in @pwq */
>         list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
> +}
>
> -       if (wq->flags & WQ_UNBOUND) {
> -               copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
> -               for_each_node(node)
> -                       rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> +/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
> +static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> +                                       const struct workqueue_attrs *attrs)
> +{
> +       struct worker_pool *pool;
> +       struct pool_workqueue *pwq;
> +
> +       pool = get_unbound_pool(attrs);
> +       if (!pool)
> +               return NULL;
> +
> +       pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);

this allocation is not numa-awared, you may use pool->node here.

> +       if (!pwq) {
> +               put_unbound_pool(pool);
> +               return NULL;
>         }
>
> -       spin_unlock_irq(&pwq_lock);
> -       mutex_unlock(&wq->flush_mutex);
> +       init_pwq(pwq, wq, pool);
> +       return pwq;
>  }
>
>  /**
> @@ -3709,7 +3725,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
>                           const struct workqueue_attrs *attrs)
>  {
>         struct pool_workqueue *pwq, *last_pwq;
> -       struct worker_pool *pool;
> +       int node;
>
>         /* only unbound workqueues can change attributes */
>         if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
> @@ -3719,17 +3735,22 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
>         if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
>                 return -EINVAL;
>
> -       pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
> +       pwq = alloc_unbound_pwq(wq, attrs);
>         if (!pwq)
>                 return -ENOMEM;
>
> -       pool = get_unbound_pool(attrs);
> -       if (!pool) {
> -               kmem_cache_free(pwq_cache, pwq);
> -               return -ENOMEM;
> -       }
> +       mutex_lock(&wq->flush_mutex);
> +       spin_lock_irq(&pwq_lock);
> +
> +       link_pwq(pwq, &last_pwq);
> +
> +       copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
> +       for_each_node(node)
> +               rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> +
> +       spin_unlock_irq(&pwq_lock);
> +       mutex_unlock(&wq->flush_mutex);
>
> -       init_and_link_pwq(pwq, wq, pool, &last_pwq);
>         if (last_pwq) {
>                 spin_lock_irq(&last_pwq->pool->lock);
>                 put_pwq(last_pwq);
> @@ -3755,7 +3776,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
>                         struct worker_pool *cpu_pools =
>                                 per_cpu(cpu_worker_pools, cpu);
>
> -                       init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
> +                       init_pwq(pwq, wq, &cpu_pools[highpri]);
> +
> +                       mutex_lock(&wq->flush_mutex);
> +                       spin_lock_irq(&pwq_lock);
> +
> +                       link_pwq(pwq, NULL);
> +
> +                       spin_unlock_irq(&pwq_lock);
> +                       mutex_unlock(&wq->flush_mutex);
>                 }
>                 return 0;
>         } else {
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
@ 2013-03-20 15:52     ` Lai Jiangshan
  0 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 15:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 8:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Break init_and_link_pwq() into init_pwq() and link_pwq() and move
> unbound-workqueue specific handling into apply_workqueue_attrs().
> Also, factor out unbound pool and pool_workqueue allocation into
> alloc_unbound_pwq().
>
> This reorganization is to prepare for NUMA affinity and doesn't
> introduce any functional changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/workqueue.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 52 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 3f820a5..bbbfc92 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3647,13 +3647,10 @@ static void pwq_adjust_max_active(struct pool_workqueue *pwq)
>         spin_unlock(&pwq->pool->lock);
>  }
>
> -static void init_and_link_pwq(struct pool_workqueue *pwq,
> -                             struct workqueue_struct *wq,
> -                             struct worker_pool *pool,
> -                             struct pool_workqueue **p_last_pwq)
> +/* initialize newly zalloced @pwq which is associated with @wq and @pool */
> +static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
> +                    struct worker_pool *pool)
>  {
> -       int node;
> -
>         BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
>
>         pwq->pool = pool;
> @@ -3663,9 +3660,16 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
>         INIT_LIST_HEAD(&pwq->delayed_works);
>         INIT_LIST_HEAD(&pwq->mayday_node);
>         INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
> +}
>
> -       mutex_lock(&wq->flush_mutex);
> -       spin_lock_irq(&pwq_lock);
> +/* sync @pwq with the current state of its associated wq and link it */
> +static void link_pwq(struct pool_workqueue *pwq,
> +                    struct pool_workqueue **p_last_pwq)
> +{
> +       struct workqueue_struct *wq = pwq->wq;
> +
> +       lockdep_assert_held(&wq->flush_mutex);
> +       lockdep_assert_held(&pwq_lock);
>
>         /*
>          * Set the matching work_color.  This is synchronized with
> @@ -3680,15 +3684,27 @@ static void init_and_link_pwq(struct pool_workqueue *pwq,
>
>         /* link in @pwq */
>         list_add_rcu(&pwq->pwqs_node, &wq->pwqs);
> +}
>
> -       if (wq->flags & WQ_UNBOUND) {
> -               copy_workqueue_attrs(wq->unbound_attrs, pool->attrs);
> -               for_each_node(node)
> -                       rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> +/* obtain a pool matching @attr and create a pwq associating the pool and @wq */
> +static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> +                                       const struct workqueue_attrs *attrs)
> +{
> +       struct worker_pool *pool;
> +       struct pool_workqueue *pwq;
> +
> +       pool = get_unbound_pool(attrs);
> +       if (!pool)
> +               return NULL;
> +
> +       pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);

this allocation is not numa-awared, you may use pool->node here.

> +       if (!pwq) {
> +               put_unbound_pool(pool);
> +               return NULL;
>         }
>
> -       spin_unlock_irq(&pwq_lock);
> -       mutex_unlock(&wq->flush_mutex);
> +       init_pwq(pwq, wq, pool);
> +       return pwq;
>  }
>
>  /**
> @@ -3709,7 +3725,7 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
>                           const struct workqueue_attrs *attrs)
>  {
>         struct pool_workqueue *pwq, *last_pwq;
> -       struct worker_pool *pool;
> +       int node;
>
>         /* only unbound workqueues can change attributes */
>         if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
> @@ -3719,17 +3735,22 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
>         if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
>                 return -EINVAL;
>
> -       pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
> +       pwq = alloc_unbound_pwq(wq, attrs);
>         if (!pwq)
>                 return -ENOMEM;
>
> -       pool = get_unbound_pool(attrs);
> -       if (!pool) {
> -               kmem_cache_free(pwq_cache, pwq);
> -               return -ENOMEM;
> -       }
> +       mutex_lock(&wq->flush_mutex);
> +       spin_lock_irq(&pwq_lock);
> +
> +       link_pwq(pwq, &last_pwq);
> +
> +       copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
> +       for_each_node(node)
> +               rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
> +
> +       spin_unlock_irq(&pwq_lock);
> +       mutex_unlock(&wq->flush_mutex);
>
> -       init_and_link_pwq(pwq, wq, pool, &last_pwq);
>         if (last_pwq) {
>                 spin_lock_irq(&last_pwq->pool->lock);
>                 put_pwq(last_pwq);
> @@ -3755,7 +3776,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq)
>                         struct worker_pool *cpu_pools =
>                                 per_cpu(cpu_worker_pools, cpu);
>
> -                       init_and_link_pwq(pwq, wq, &cpu_pools[highpri], NULL);
> +                       init_pwq(pwq, wq, &cpu_pools[highpri]);
> +
> +                       mutex_lock(&wq->flush_mutex);
> +                       spin_lock_irq(&pwq_lock);
> +
> +                       link_pwq(pwq, NULL);
> +
> +                       spin_unlock_irq(&pwq_lock);
> +                       mutex_unlock(&wq->flush_mutex);
>                 }
>                 return 0;
>         } else {
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
  2013-03-20 15:52     ` Lai Jiangshan
@ 2013-03-20 16:04       ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 16:04 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:52:03PM +0800, Lai Jiangshan wrote:
> > +static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> > +                                       const struct workqueue_attrs *attrs)
> > +{
> > +       struct worker_pool *pool;
> > +       struct pool_workqueue *pwq;
> > +
> > +       pool = get_unbound_pool(attrs);
> > +       if (!pool)
> > +               return NULL;
> > +
> > +       pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
> 
> this allocation is not numa-awared, you may use pool->node here.

Nice catch.  Will do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
@ 2013-03-20 16:04       ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 16:04 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:52:03PM +0800, Lai Jiangshan wrote:
> > +static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> > +                                       const struct workqueue_attrs *attrs)
> > +{
> > +       struct worker_pool *pool;
> > +       struct pool_workqueue *pwq;
> > +
> > +       pool = get_unbound_pool(attrs);
> > +       if (!pool)
> > +               return NULL;
> > +
> > +       pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
> 
> this allocation is not numa-awared, you may use pool->node here.

Nice catch.  Will do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  2013-03-20 15:48       ` Tejun Heo
@ 2013-03-20 16:43         ` Lai Jiangshan
  -1 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 16:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:48 PM, Tejun Heo <tj@kernel.org> wrote:
> On Wed, Mar 20, 2013 at 11:43:30PM +0800, Lai Jiangshan wrote:
>> > +       for_each_node(node)
>> > +               BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
>> > +                                              GFP_KERNEL, node));
>> > +       for_each_possible_cpu(cpu) {
>> > +               node = cpu_to_node(cpu);
>> > +               if (WARN_ON(node == NUMA_NO_NODE)) {
>> > +                       pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
>>
>> since you used WARN_ON(), not BUG_ON(), I think you need to free
>> allocated memory here.
>
> I don't know.  Is it necessary?  It shouldn't happen sans bugs in arch
> code and we're vomiting warning message with full stack trace.  The
> system will function but something is seriously screwed.  I don't
> think it matters whether we free the puny number of bytes there or
> not and I don't wanna nest deeper there. :P


I agree with you, but many people use semantic analysis tools to do
hunting in the kernel. they may interrupt you again.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
@ 2013-03-20 16:43         ` Lai Jiangshan
  0 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-20 16:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Wed, Mar 20, 2013 at 11:48 PM, Tejun Heo <tj@kernel.org> wrote:
> On Wed, Mar 20, 2013 at 11:43:30PM +0800, Lai Jiangshan wrote:
>> > +       for_each_node(node)
>> > +               BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
>> > +                                              GFP_KERNEL, node));
>> > +       for_each_possible_cpu(cpu) {
>> > +               node = cpu_to_node(cpu);
>> > +               if (WARN_ON(node == NUMA_NO_NODE)) {
>> > +                       pr_err("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
>>
>> since you used WARN_ON(), not BUG_ON(), I think you need to free
>> allocated memory here.
>
> I don't know.  Is it necessary?  It shouldn't happen sans bugs in arch
> code and we're vomiting warning message with full stack trace.  The
> system will function but something is seriously screwed.  I don't
> think it matters whether we free the puny number of bytes there or
> not and I don't wanna nest deeper there. :P


I agree with you, but many people use semantic analysis tools to do
hunting in the kernel. they may interrupt you again.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 09/10] workqueue: implement NUMA affinity for unbound workqueues
  2013-03-20  0:00   ` Tejun Heo
@ 2013-03-20 17:08     ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 17:08 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
    by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <eag0628@gmail.com>
---
 kernel/workqueue.c |  120 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 98 insertions(+), 22 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workque
 	pwq->flush_color = -1;
 	pwq->refcnt = 1;
 	INIT_LIST_HEAD(&pwq->delayed_works);
+	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-		     struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
 	struct workqueue_struct *wq = pwq->wq;
 
@@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workque
 	 * Set the matching work_color.  This is synchronized with
 	 * flush_mutex to avoid confusing flush_workqueue().
 	 */
-	if (p_last_pwq)
-		*p_last_pwq = first_pwq(wq);
 	pwq->work_color = wq->work_color;
 
 	/* sync max_active to the current setting */
@@ -3707,16 +3705,26 @@ static struct pool_workqueue *alloc_unbo
 	return pwq;
 }
 
+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+	if (pwq) {
+		put_unbound_pool(pwq->pool);
+		kfree(pwq);
+	}
+}
+
 /**
  * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3724,7 +3732,8 @@ static struct pool_workqueue *alloc_unbo
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
-	struct pool_workqueue *pwq, *last_pwq;
+	struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+	struct workqueue_attrs *tmp_attrs = NULL;
 	int node;
 
 	/* only unbound workqueues can change attributes */
@@ -3735,29 +3744,96 @@ int apply_workqueue_attrs(struct workque
 	if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
 		return -EINVAL;
 
-	pwq = alloc_unbound_pwq(wq, attrs);
-	if (!pwq)
-		return -ENOMEM;
+	pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+	tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+	if (!pwq_tbl || !tmp_attrs)
+		goto enomem;
+
+	copy_workqueue_attrs(tmp_attrs, attrs);
+
+	/*
+	 * We want NUMA affinity.  For each node with intersecting possible
+	 * CPUs with the requested cpumask, create a separate pwq covering
+	 * the instersection.  Nodes without intersection are covered by
+	 * the default pwq covering the whole requested cpumask.
+	 */
+	for_each_node(node) {
+		cpumask_t *cpumask = tmp_attrs->cpumask;
+
+		/*
+		 * Just fall through if NUMA affinity isn't enabled.  We'll
+		 * end up using the default pwq which is what we want.
+		 */
+		if (wq_numa_possible_cpumask) {
+			cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+				    attrs->cpumask);
+			if (cpumask_empty(cpumask))
+				cpumask_copy(cpumask, attrs->cpumask);
+		}
+
+		if (cpumask_equal(cpumask, attrs->cpumask)) {
+			if (!dfl_pwq) {
+				dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+				if (!dfl_pwq)
+					goto enomem;
+			} else {
+				dfl_pwq->refcnt++;
+			}
+			pwq_tbl[node] = dfl_pwq;
+		} else {
+			pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+			if (!pwq_tbl[node])
+				goto enomem;
+		}
+	}
 
+	/* all pwqs have been created successfully, let's install'em */
 	mutex_lock(&wq->flush_mutex);
 	spin_lock_irq(&pwq_lock);
 
-	link_pwq(pwq, &last_pwq);
+	/* @attrs is now current */
+	copy_workqueue_attrs(wq->unbound_attrs, attrs);
 
-	copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
-	for_each_node(node)
-		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	for_each_node(node) {
+		struct pool_workqueue *pwq;
+
+		/* each new pwq should be linked once */
+		if (list_empty(&pwq_tbl[node]->pwqs_node))
+			link_pwq(pwq_tbl[node]);
+
+		/* save the previous pwq and install the new one */
+		pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+		pwq_tbl[node] = pwq;
+	}
 
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
 
-	if (last_pwq) {
-		spin_lock_irq(&last_pwq->pool->lock);
-		put_pwq(last_pwq);
-		spin_unlock_irq(&last_pwq->pool->lock);
+	/* put the old pwqs */
+	for_each_node(node) {
+		struct pool_workqueue *pwq = pwq_tbl[node];
+
+		if (pwq) {
+			spin_lock_irq(&pwq->pool->lock);
+			put_pwq(pwq);
+			spin_unlock_irq(&pwq->pool->lock);
+		}
 	}
 
 	return 0;
+
+enomem:
+	free_workqueue_attrs(tmp_attrs);
+	if (pwq_tbl) {
+		for_each_node(node) {
+			if (pwq_tbl[node] != dfl_pwq)
+				free_unbound_pwq(pwq_tbl[node]);
+			free_unbound_pwq(dfl_pwq);
+		}
+		kfree(pwq_tbl);
+	}
+	return -ENOMEM;
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3781,7 +3857,7 @@ static int alloc_and_link_pwqs(struct wo
 			mutex_lock(&wq->flush_mutex);
 			spin_lock_irq(&pwq_lock);
 
-			link_pwq(pwq, NULL);
+			link_pwq(pwq);
 
 			spin_unlock_irq(&pwq_lock);
 			mutex_unlock(&wq->flush_mutex);

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 09/10] workqueue: implement NUMA affinity for unbound workqueues
@ 2013-03-20 17:08     ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 17:08 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
    by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <eag0628@gmail.com>
---
 kernel/workqueue.c |  120 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 98 insertions(+), 22 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3658,13 +3658,13 @@ static void init_pwq(struct pool_workque
 	pwq->flush_color = -1;
 	pwq->refcnt = 1;
 	INIT_LIST_HEAD(&pwq->delayed_works);
+	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-		     struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
 	struct workqueue_struct *wq = pwq->wq;
 
@@ -3675,8 +3675,6 @@ static void link_pwq(struct pool_workque
 	 * Set the matching work_color.  This is synchronized with
 	 * flush_mutex to avoid confusing flush_workqueue().
 	 */
-	if (p_last_pwq)
-		*p_last_pwq = first_pwq(wq);
 	pwq->work_color = wq->work_color;
 
 	/* sync max_active to the current setting */
@@ -3707,16 +3705,26 @@ static struct pool_workqueue *alloc_unbo
 	return pwq;
 }
 
+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+	if (pwq) {
+		put_unbound_pool(pwq->pool);
+		kfree(pwq);
+	}
+}
+
 /**
  * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3724,7 +3732,8 @@ static struct pool_workqueue *alloc_unbo
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
-	struct pool_workqueue *pwq, *last_pwq;
+	struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+	struct workqueue_attrs *tmp_attrs = NULL;
 	int node;
 
 	/* only unbound workqueues can change attributes */
@@ -3735,29 +3744,96 @@ int apply_workqueue_attrs(struct workque
 	if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
 		return -EINVAL;
 
-	pwq = alloc_unbound_pwq(wq, attrs);
-	if (!pwq)
-		return -ENOMEM;
+	pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+	tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+	if (!pwq_tbl || !tmp_attrs)
+		goto enomem;
+
+	copy_workqueue_attrs(tmp_attrs, attrs);
+
+	/*
+	 * We want NUMA affinity.  For each node with intersecting possible
+	 * CPUs with the requested cpumask, create a separate pwq covering
+	 * the instersection.  Nodes without intersection are covered by
+	 * the default pwq covering the whole requested cpumask.
+	 */
+	for_each_node(node) {
+		cpumask_t *cpumask = tmp_attrs->cpumask;
+
+		/*
+		 * Just fall through if NUMA affinity isn't enabled.  We'll
+		 * end up using the default pwq which is what we want.
+		 */
+		if (wq_numa_possible_cpumask) {
+			cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+				    attrs->cpumask);
+			if (cpumask_empty(cpumask))
+				cpumask_copy(cpumask, attrs->cpumask);
+		}
+
+		if (cpumask_equal(cpumask, attrs->cpumask)) {
+			if (!dfl_pwq) {
+				dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+				if (!dfl_pwq)
+					goto enomem;
+			} else {
+				dfl_pwq->refcnt++;
+			}
+			pwq_tbl[node] = dfl_pwq;
+		} else {
+			pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+			if (!pwq_tbl[node])
+				goto enomem;
+		}
+	}
 
+	/* all pwqs have been created successfully, let's install'em */
 	mutex_lock(&wq->flush_mutex);
 	spin_lock_irq(&pwq_lock);
 
-	link_pwq(pwq, &last_pwq);
+	/* @attrs is now current */
+	copy_workqueue_attrs(wq->unbound_attrs, attrs);
 
-	copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
-	for_each_node(node)
-		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	for_each_node(node) {
+		struct pool_workqueue *pwq;
+
+		/* each new pwq should be linked once */
+		if (list_empty(&pwq_tbl[node]->pwqs_node))
+			link_pwq(pwq_tbl[node]);
+
+		/* save the previous pwq and install the new one */
+		pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+		pwq_tbl[node] = pwq;
+	}
 
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
 
-	if (last_pwq) {
-		spin_lock_irq(&last_pwq->pool->lock);
-		put_pwq(last_pwq);
-		spin_unlock_irq(&last_pwq->pool->lock);
+	/* put the old pwqs */
+	for_each_node(node) {
+		struct pool_workqueue *pwq = pwq_tbl[node];
+
+		if (pwq) {
+			spin_lock_irq(&pwq->pool->lock);
+			put_pwq(pwq);
+			spin_unlock_irq(&pwq->pool->lock);
+		}
 	}
 
 	return 0;
+
+enomem:
+	free_workqueue_attrs(tmp_attrs);
+	if (pwq_tbl) {
+		for_each_node(node) {
+			if (pwq_tbl[node] != dfl_pwq)
+				free_unbound_pwq(pwq_tbl[node]);
+			free_unbound_pwq(dfl_pwq);
+		}
+		kfree(pwq_tbl);
+	}
+	return -ENOMEM;
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3781,7 +3857,7 @@ static int alloc_and_link_pwqs(struct wo
 			mutex_lock(&wq->flush_mutex);
 			spin_lock_irq(&pwq_lock);
 
-			link_pwq(pwq, NULL);
+			link_pwq(pwq);
 
 			spin_unlock_irq(&pwq_lock);
 			mutex_unlock(&wq->flush_mutex);

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 11/10] workqueue: use NUMA-aware allocation for pool_workqueues workqueues
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20 17:08   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 17:08 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool.  As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <eag0628@gmail.com>
---
 kernel/workqueue.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3683,12 +3683,14 @@ static void pwq_adjust_max_active(struct
 	spin_unlock(&pwq->pool->lock);
 }
 
-/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+/* initialize newly alloced @pwq which is associated with @wq and @pool */
 static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 		     struct worker_pool *pool)
 {
 	BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
+	memset(pwq, 0, sizeof(*pwq));
+
 	pwq->pool = pool;
 	pwq->wq = wq;
 	pwq->flush_color = -1;
@@ -3731,7 +3733,7 @@ static struct pool_workqueue *alloc_unbo
 	if (!pool)
 		return NULL;
 
-	pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+	pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
 	if (!pwq) {
 		put_unbound_pool(pool);
 		return NULL;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 11/10] workqueue: use NUMA-aware allocation for pool_workqueues workqueues
@ 2013-03-20 17:08   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 17:08 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool.  As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

This was suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <eag0628@gmail.com>
---
 kernel/workqueue.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3683,12 +3683,14 @@ static void pwq_adjust_max_active(struct
 	spin_unlock(&pwq->pool->lock);
 }
 
-/* initialize newly zalloced @pwq which is associated with @wq and @pool */
+/* initialize newly alloced @pwq which is associated with @wq and @pool */
 static void init_pwq(struct pool_workqueue *pwq, struct workqueue_struct *wq,
 		     struct worker_pool *pool)
 {
 	BUG_ON((unsigned long)pwq & WORK_STRUCT_FLAG_MASK);
 
+	memset(pwq, 0, sizeof(*pwq));
+
 	pwq->pool = pool;
 	pwq->wq = wq;
 	pwq->flush_color = -1;
@@ -3731,7 +3733,7 @@ static struct pool_workqueue *alloc_unbo
 	if (!pool)
 		return NULL;
 
-	pwq = kmem_cache_zalloc(pwq_cache, GFP_KERNEL);
+	pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
 	if (!pwq) {
 		put_unbound_pool(pool);
 		return NULL;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 UPDATED 09/10] workqueue: implement NUMA affinity for unbound workqueues
  2013-03-20 17:08     ` Tejun Heo
@ 2013-03-20 18:54       ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 18:54 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
    by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <eag0628@gmail.com>
---
Please forget about the previous posting.  It was freeing dfl_pwq
multiple times.  This one, hopefully, is correct.

Thanks.

 kernel/workqueue.c |  119 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 97 insertions(+), 22 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3666,13 +3666,13 @@ static void init_pwq(struct pool_workque
 	pwq->flush_color = -1;
 	pwq->refcnt = 1;
 	INIT_LIST_HEAD(&pwq->delayed_works);
+	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-		     struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
 	struct workqueue_struct *wq = pwq->wq;
 
@@ -3683,8 +3683,6 @@ static void link_pwq(struct pool_workque
 	 * Set the matching work_color.  This is synchronized with
 	 * flush_mutex to avoid confusing flush_workqueue().
 	 */
-	if (p_last_pwq)
-		*p_last_pwq = first_pwq(wq);
 	pwq->work_color = wq->work_color;
 
 	/* sync max_active to the current setting */
@@ -3715,16 +3713,26 @@ static struct pool_workqueue *alloc_unbo
 	return pwq;
 }
 
+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+	if (pwq) {
+		put_unbound_pool(pwq->pool);
+		kfree(pwq);
+	}
+}
+
 /**
  * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3732,7 +3740,8 @@ static struct pool_workqueue *alloc_unbo
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
-	struct pool_workqueue *pwq, *last_pwq;
+	struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+	struct workqueue_attrs *tmp_attrs = NULL;
 	int node;
 
 	/* only unbound workqueues can change attributes */
@@ -3743,29 +3752,95 @@ int apply_workqueue_attrs(struct workque
 	if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
 		return -EINVAL;
 
-	pwq = alloc_unbound_pwq(wq, attrs);
-	if (!pwq)
-		return -ENOMEM;
+	pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+	tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+	if (!pwq_tbl || !tmp_attrs)
+		goto enomem;
+
+	copy_workqueue_attrs(tmp_attrs, attrs);
+
+	/*
+	 * We want NUMA affinity.  For each node with intersecting possible
+	 * CPUs with the requested cpumask, create a separate pwq covering
+	 * the instersection.  Nodes without intersection are covered by
+	 * the default pwq covering the whole requested cpumask.
+	 */
+	for_each_node(node) {
+		cpumask_t *cpumask = tmp_attrs->cpumask;
+
+		/*
+		 * Just fall through if NUMA affinity isn't enabled.  We'll
+		 * end up using the default pwq which is what we want.
+		 */
+		if (wq_numa_possible_cpumask) {
+			cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+				    attrs->cpumask);
+			if (cpumask_empty(cpumask))
+				cpumask_copy(cpumask, attrs->cpumask);
+		}
+
+		if (cpumask_equal(cpumask, attrs->cpumask)) {
+			if (!dfl_pwq) {
+				dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+				if (!dfl_pwq)
+					goto enomem;
+			} else {
+				dfl_pwq->refcnt++;
+			}
+			pwq_tbl[node] = dfl_pwq;
+		} else {
+			pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+			if (!pwq_tbl[node])
+				goto enomem;
+		}
+	}
 
+	/* all pwqs have been created successfully, let's install'em */
 	mutex_lock(&wq->flush_mutex);
 	spin_lock_irq(&pwq_lock);
 
-	link_pwq(pwq, &last_pwq);
+	/* @attrs is now current */
+	copy_workqueue_attrs(wq->unbound_attrs, attrs);
 
-	copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
-	for_each_node(node)
-		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	for_each_node(node) {
+		struct pool_workqueue *pwq;
+
+		/* each new pwq should be linked once */
+		if (list_empty(&pwq_tbl[node]->pwqs_node))
+			link_pwq(pwq_tbl[node]);
+
+		/* save the previous pwq and install the new one */
+		pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+		pwq_tbl[node] = pwq;
+	}
 
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
 
-	if (last_pwq) {
-		spin_lock_irq(&last_pwq->pool->lock);
-		put_pwq(last_pwq);
-		spin_unlock_irq(&last_pwq->pool->lock);
+	/* put the old pwqs */
+	for_each_node(node) {
+		struct pool_workqueue *pwq = pwq_tbl[node];
+
+		if (pwq) {
+			spin_lock_irq(&pwq->pool->lock);
+			put_pwq(pwq);
+			spin_unlock_irq(&pwq->pool->lock);
+		}
 	}
 
 	return 0;
+
+enomem:
+	free_workqueue_attrs(tmp_attrs);
+	if (pwq_tbl) {
+		for_each_node(node)
+			if (pwq_tbl[node] != dfl_pwq)
+				free_unbound_pwq(pwq_tbl[node]);
+		free_unbound_pwq(dfl_pwq);
+		kfree(pwq_tbl);
+	}
+	return -ENOMEM;
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3789,7 +3864,7 @@ static int alloc_and_link_pwqs(struct wo
 			mutex_lock(&wq->flush_mutex);
 			spin_lock_irq(&pwq_lock);
 
-			link_pwq(pwq, NULL);
+			link_pwq(pwq);
 
 			spin_unlock_irq(&pwq_lock);
 			mutex_unlock(&wq->flush_mutex);

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2 UPDATED 09/10] workqueue: implement NUMA affinity for unbound workqueues
@ 2013-03-20 18:54       ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 18:54 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued.  This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.

This patch implements NUMA affinity for unbound workqueues.  Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has possible CPUs in
@attrs->cpumask.  Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.

This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.

As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active.  The limit is now per NUMA
node instead of global.  While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control.  Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.

v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
    by Lai.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <eag0628@gmail.com>
---
Please forget about the previous posting.  It was freeing dfl_pwq
multiple times.  This one, hopefully, is correct.

Thanks.

 kernel/workqueue.c |  119 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 97 insertions(+), 22 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3666,13 +3666,13 @@ static void init_pwq(struct pool_workque
 	pwq->flush_color = -1;
 	pwq->refcnt = 1;
 	INIT_LIST_HEAD(&pwq->delayed_works);
+	INIT_LIST_HEAD(&pwq->pwqs_node);
 	INIT_LIST_HEAD(&pwq->mayday_node);
 	INIT_WORK(&pwq->unbound_release_work, pwq_unbound_release_workfn);
 }
 
 /* sync @pwq with the current state of its associated wq and link it */
-static void link_pwq(struct pool_workqueue *pwq,
-		     struct pool_workqueue **p_last_pwq)
+static void link_pwq(struct pool_workqueue *pwq)
 {
 	struct workqueue_struct *wq = pwq->wq;
 
@@ -3683,8 +3683,6 @@ static void link_pwq(struct pool_workque
 	 * Set the matching work_color.  This is synchronized with
 	 * flush_mutex to avoid confusing flush_workqueue().
 	 */
-	if (p_last_pwq)
-		*p_last_pwq = first_pwq(wq);
 	pwq->work_color = wq->work_color;
 
 	/* sync max_active to the current setting */
@@ -3715,16 +3713,26 @@ static struct pool_workqueue *alloc_unbo
 	return pwq;
 }
 
+/* undo alloc_unbound_pwq(), used only in the error path */
+static void free_unbound_pwq(struct pool_workqueue *pwq)
+{
+	if (pwq) {
+		put_unbound_pool(pwq->pool);
+		kfree(pwq);
+	}
+}
+
 /**
  * apply_workqueue_attrs - apply new workqueue_attrs to an unbound workqueue
  * @wq: the target workqueue
  * @attrs: the workqueue_attrs to apply, allocated with alloc_workqueue_attrs()
  *
- * Apply @attrs to an unbound workqueue @wq.  If @attrs doesn't match the
- * current attributes, a new pwq is created and made the first pwq which
- * will serve all new work items.  Older pwqs are released as in-flight
- * work items finish.  Note that a work item which repeatedly requeues
- * itself back-to-back will stay on its current pwq.
+ * Apply @attrs to an unbound workqueue @wq.  Unless disabled, on NUMA
+ * machines, this function maps a separate pwq to each NUMA node with
+ * possibles CPUs in @attrs->cpumask so that work items are affine to the
+ * NUMA node it was issued on.  Older pwqs are released as in-flight work
+ * items finish.  Note that a work item which repeatedly requeues itself
+ * back-to-back will stay on its current pwq.
  *
  * Performs GFP_KERNEL allocations.  Returns 0 on success and -errno on
  * failure.
@@ -3732,7 +3740,8 @@ static struct pool_workqueue *alloc_unbo
 int apply_workqueue_attrs(struct workqueue_struct *wq,
 			  const struct workqueue_attrs *attrs)
 {
-	struct pool_workqueue *pwq, *last_pwq;
+	struct pool_workqueue **pwq_tbl = NULL, *dfl_pwq = NULL;
+	struct workqueue_attrs *tmp_attrs = NULL;
 	int node;
 
 	/* only unbound workqueues can change attributes */
@@ -3743,29 +3752,95 @@ int apply_workqueue_attrs(struct workque
 	if (WARN_ON((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs)))
 		return -EINVAL;
 
-	pwq = alloc_unbound_pwq(wq, attrs);
-	if (!pwq)
-		return -ENOMEM;
+	pwq_tbl = kzalloc(wq_numa_tbl_len * sizeof(pwq_tbl[0]), GFP_KERNEL);
+	tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
+	if (!pwq_tbl || !tmp_attrs)
+		goto enomem;
+
+	copy_workqueue_attrs(tmp_attrs, attrs);
+
+	/*
+	 * We want NUMA affinity.  For each node with intersecting possible
+	 * CPUs with the requested cpumask, create a separate pwq covering
+	 * the instersection.  Nodes without intersection are covered by
+	 * the default pwq covering the whole requested cpumask.
+	 */
+	for_each_node(node) {
+		cpumask_t *cpumask = tmp_attrs->cpumask;
+
+		/*
+		 * Just fall through if NUMA affinity isn't enabled.  We'll
+		 * end up using the default pwq which is what we want.
+		 */
+		if (wq_numa_possible_cpumask) {
+			cpumask_and(cpumask, wq_numa_possible_cpumask[node],
+				    attrs->cpumask);
+			if (cpumask_empty(cpumask))
+				cpumask_copy(cpumask, attrs->cpumask);
+		}
+
+		if (cpumask_equal(cpumask, attrs->cpumask)) {
+			if (!dfl_pwq) {
+				dfl_pwq = alloc_unbound_pwq(wq, tmp_attrs);
+				if (!dfl_pwq)
+					goto enomem;
+			} else {
+				dfl_pwq->refcnt++;
+			}
+			pwq_tbl[node] = dfl_pwq;
+		} else {
+			pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
+			if (!pwq_tbl[node])
+				goto enomem;
+		}
+	}
 
+	/* all pwqs have been created successfully, let's install'em */
 	mutex_lock(&wq->flush_mutex);
 	spin_lock_irq(&pwq_lock);
 
-	link_pwq(pwq, &last_pwq);
+	/* @attrs is now current */
+	copy_workqueue_attrs(wq->unbound_attrs, attrs);
 
-	copy_workqueue_attrs(wq->unbound_attrs, pwq->pool->attrs);
-	for_each_node(node)
-		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq);
+	for_each_node(node) {
+		struct pool_workqueue *pwq;
+
+		/* each new pwq should be linked once */
+		if (list_empty(&pwq_tbl[node]->pwqs_node))
+			link_pwq(pwq_tbl[node]);
+
+		/* save the previous pwq and install the new one */
+		pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
+		rcu_assign_pointer(wq->numa_pwq_tbl[node], pwq_tbl[node]);
+		pwq_tbl[node] = pwq;
+	}
 
 	spin_unlock_irq(&pwq_lock);
 	mutex_unlock(&wq->flush_mutex);
 
-	if (last_pwq) {
-		spin_lock_irq(&last_pwq->pool->lock);
-		put_pwq(last_pwq);
-		spin_unlock_irq(&last_pwq->pool->lock);
+	/* put the old pwqs */
+	for_each_node(node) {
+		struct pool_workqueue *pwq = pwq_tbl[node];
+
+		if (pwq) {
+			spin_lock_irq(&pwq->pool->lock);
+			put_pwq(pwq);
+			spin_unlock_irq(&pwq->pool->lock);
+		}
 	}
 
 	return 0;
+
+enomem:
+	free_workqueue_attrs(tmp_attrs);
+	if (pwq_tbl) {
+		for_each_node(node)
+			if (pwq_tbl[node] != dfl_pwq)
+				free_unbound_pwq(pwq_tbl[node]);
+		free_unbound_pwq(dfl_pwq);
+		kfree(pwq_tbl);
+	}
+	return -ENOMEM;
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -3789,7 +3864,7 @@ static int alloc_and_link_pwqs(struct wo
 			mutex_lock(&wq->flush_mutex);
 			spin_lock_irq(&pwq_lock);
 
-			link_pwq(pwq, NULL);
+			link_pwq(pwq);
 
 			spin_unlock_irq(&pwq_lock);
 			mutex_unlock(&wq->flush_mutex);

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
  2013-03-20  0:00 ` Tejun Heo
@ 2013-03-20 18:57   ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 18:57 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

On Tue, Mar 19, 2013 at 05:00:19PM -0700, Tejun Heo wrote:
> and also available in the following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

Branch rebased on top of the current wq/for-3.10 with updated patches.
The new comit ID is 9555fbc12d786a9eae7cf7701a6718a0c029173e.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
@ 2013-03-20 18:57   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-20 18:57 UTC (permalink / raw)
  To: laijs
  Cc: axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel, herbert,
	davem, linux-crypto

On Tue, Mar 19, 2013 at 05:00:19PM -0700, Tejun Heo wrote:
> and also available in the following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa

Branch rebased on top of the current wq/for-3.10 with updated patches.
The new comit ID is 9555fbc12d786a9eae7cf7701a6718a0c029173e.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
  2013-03-20 18:57   ` Tejun Heo
@ 2013-03-24 16:04     ` Lai Jiangshan
  -1 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-24 16:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

Hi, TJ

After long time(again) thought.

I think this patchset has this problem:
the work may be running on wrong CPU when
there is online cpu in its wq's cpumask.

example:
node0(cpu0,cpu1),node1(cpu2,cpu3),
wq's cpumask: 1,3
the pwq of this wq on the node1's cpumask: 3
current online cpu: 0-2.
so the cpumask of worker tasks of the pwq on node1 is actually cpu_all_mask.
so the work scheduled from cpu2 can be executed on cpu0 or cpu2.
we expect it is executed on cpu1 only.

It can be fixed by swapping pwqs(node's pwq <-> default pwq)
when cpuhotplug. But could you reschedule this patchset to wq/for-3.11?
the whole patchset is more complicated than my brain.

If you agree, I will rebase my patches again.

Thanks,
Lai

On Thu, Mar 21, 2013 at 2:57 AM, Tejun Heo <tj@kernel.org> wrote:
> On Tue, Mar 19, 2013 at 05:00:19PM -0700, Tejun Heo wrote:
>> and also available in the following git branch.
>>
>>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa
>
> Branch rebased on top of the current wq/for-3.10 with updated patches.
> The new comit ID is 9555fbc12d786a9eae7cf7701a6718a0c029173e.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
@ 2013-03-24 16:04     ` Lai Jiangshan
  0 siblings, 0 replies; 62+ messages in thread
From: Lai Jiangshan @ 2013-03-24 16:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

Hi, TJ

After long time(again) thought.

I think this patchset has this problem:
the work may be running on wrong CPU when
there is online cpu in its wq's cpumask.

example:
node0(cpu0,cpu1),node1(cpu2,cpu3),
wq's cpumask: 1,3
the pwq of this wq on the node1's cpumask: 3
current online cpu: 0-2.
so the cpumask of worker tasks of the pwq on node1 is actually cpu_all_mask.
so the work scheduled from cpu2 can be executed on cpu0 or cpu2.
we expect it is executed on cpu1 only.

It can be fixed by swapping pwqs(node's pwq <-> default pwq)
when cpuhotplug. But could you reschedule this patchset to wq/for-3.11?
the whole patchset is more complicated than my brain.

If you agree, I will rebase my patches again.

Thanks,
Lai

On Thu, Mar 21, 2013 at 2:57 AM, Tejun Heo <tj@kernel.org> wrote:
> On Tue, Mar 19, 2013 at 05:00:19PM -0700, Tejun Heo wrote:
>> and also available in the following git branch.
>>
>>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-numa
>
> Branch rebased on top of the current wq/for-3.10 with updated patches.
> The new comit ID is 9555fbc12d786a9eae7cf7701a6718a0c029173e.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
  2013-03-24 16:04     ` Lai Jiangshan
@ 2013-03-24 18:55       ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-24 18:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

Hey, Lai.

On Mon, Mar 25, 2013 at 12:04:19AM +0800, Lai Jiangshan wrote:
> example:
> node0(cpu0,cpu1),node1(cpu2,cpu3),
> wq's cpumask: 1,3
> the pwq of this wq on the node1's cpumask: 3
> current online cpu: 0-2.
> so the cpumask of worker tasks of the pwq on node1 is actually cpu_all_mask.
> so the work scheduled from cpu2 can be executed on cpu0 or cpu2.
> we expect it is executed on cpu1 only.

Ah, right.  I was hoping to avoid doing pwq swaps on CPU hot[un]plugs
doing everything on possible mask.  Looks like we'll need some
massaging after all.

> It can be fixed by swapping pwqs(node's pwq <-> default pwq)
> when cpuhotplug. But could you reschedule this patchset to wq/for-3.11?

We're still only at -rc4 meaning we still have plenty of time to
resolve whatever issues which come up, so I think I'm still gonna
target 3.10.

> the whole patchset is more complicated than my brain.

It isn't that complex, is it?  I mean, the difficult part - using
multiple pwqs on unbound wq - already happened, and even that wasn't
too complex as it in most part synchronized the behaviors between
per-cpu and unbound workqueues.  All that NUMA support is doing is
mapping different pwqs to different issuing CPUs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
@ 2013-03-24 18:55       ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-24 18:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

Hey, Lai.

On Mon, Mar 25, 2013 at 12:04:19AM +0800, Lai Jiangshan wrote:
> example:
> node0(cpu0,cpu1),node1(cpu2,cpu3),
> wq's cpumask: 1,3
> the pwq of this wq on the node1's cpumask: 3
> current online cpu: 0-2.
> so the cpumask of worker tasks of the pwq on node1 is actually cpu_all_mask.
> so the work scheduled from cpu2 can be executed on cpu0 or cpu2.
> we expect it is executed on cpu1 only.

Ah, right.  I was hoping to avoid doing pwq swaps on CPU hot[un]plugs
doing everything on possible mask.  Looks like we'll need some
massaging after all.

> It can be fixed by swapping pwqs(node's pwq <-> default pwq)
> when cpuhotplug. But could you reschedule this patchset to wq/for-3.11?

We're still only at -rc4 meaning we still have plenty of time to
resolve whatever issues which come up, so I think I'm still gonna
target 3.10.

> the whole patchset is more complicated than my brain.

It isn't that complex, is it?  I mean, the difficult part - using
multiple pwqs on unbound wq - already happened, and even that wasn't
too complex as it in most part synchronized the behaviors between
per-cpu and unbound workqueues.  All that NUMA support is doing is
mapping different pwqs to different issuing CPUs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
  2013-03-24 18:55       ` Tejun Heo
@ 2013-03-25 19:15         ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-25 19:15 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Sun, Mar 24, 2013 at 11:55:06AM -0700, Tejun Heo wrote:
> > the whole patchset is more complicated than my brain.
> 
> It isn't that complex, is it?  I mean, the difficult part - using
> multiple pwqs on unbound wq - already happened, and even that wasn't
> too complex as it in most part synchronized the behaviors between
> per-cpu and unbound workqueues.  All that NUMA support is doing is
> mapping different pwqs to different issuing CPUs.

Oh, BTW, please feel free to rebase your patchset on top of the
current wq/for-3.10.  I'll take care of the conflicts with the numa
one, if there are any.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
@ 2013-03-25 19:15         ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-25 19:15 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Sun, Mar 24, 2013 at 11:55:06AM -0700, Tejun Heo wrote:
> > the whole patchset is more complicated than my brain.
> 
> It isn't that complex, is it?  I mean, the difficult part - using
> multiple pwqs on unbound wq - already happened, and even that wasn't
> too complex as it in most part synchronized the behaviors between
> per-cpu and unbound workqueues.  All that NUMA support is doing is
> mapping different pwqs to different issuing CPUs.

Oh, BTW, please feel free to rebase your patchset on top of the
current wq/for-3.10.  I'll take care of the conflicts with the numa
one, if there are any.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
  2013-03-25 19:15         ` Tejun Heo
@ 2013-03-25 20:48           ` Tejun Heo
  -1 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-25 20:48 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Mon, Mar 25, 2013 at 12:15:00PM -0700, Tejun Heo wrote:
> On Sun, Mar 24, 2013 at 11:55:06AM -0700, Tejun Heo wrote:
> > > the whole patchset is more complicated than my brain.
> > 
> > It isn't that complex, is it?  I mean, the difficult part - using
> > multiple pwqs on unbound wq - already happened, and even that wasn't
> > too complex as it in most part synchronized the behaviors between
> > per-cpu and unbound workqueues.  All that NUMA support is doing is
> > mapping different pwqs to different issuing CPUs.
> 
> Oh, BTW, please feel free to rebase your patchset on top of the
> current wq/for-3.10.  I'll take care of the conflicts with the numa
> one, if there are any.

Sorry, never mind.  I'm cherry picking wq->mutex patches.  Will apply
the rebased versions in a couple hours.  Let's base everything else on
top of those.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues
@ 2013-03-25 20:48           ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-03-25 20:48 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: laijs, axboe, jack, fengguang.wu, jmoyer, zab, linux-kernel,
	herbert, davem, linux-crypto

On Mon, Mar 25, 2013 at 12:15:00PM -0700, Tejun Heo wrote:
> On Sun, Mar 24, 2013 at 11:55:06AM -0700, Tejun Heo wrote:
> > > the whole patchset is more complicated than my brain.
> > 
> > It isn't that complex, is it?  I mean, the difficult part - using
> > multiple pwqs on unbound wq - already happened, and even that wasn't
> > too complex as it in most part synchronized the behaviors between
> > per-cpu and unbound workqueues.  All that NUMA support is doing is
> > mapping different pwqs to different issuing CPUs.
> 
> Oh, BTW, please feel free to rebase your patchset on top of the
> current wq/for-3.10.  I'll take care of the conflicts with the numa
> one, if there are any.

Sorry, never mind.  I'm cherry picking wq->mutex patches.  Will apply
the rebased versions in a couple hours.  Let's base everything else on
top of those.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2013-03-25 20:48 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-20  0:00 [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues Tejun Heo
2013-03-20  0:00 ` Tejun Heo
2013-03-20  0:00 ` [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20 14:08   ` JoonSoo Kim
2013-03-20 14:08     ` JoonSoo Kim
2013-03-20 14:48     ` Tejun Heo
2013-03-20 14:48       ` Tejun Heo
2013-03-20 15:43   ` Lai Jiangshan
2013-03-20 15:43     ` Lai Jiangshan
2013-03-20 15:48     ` Tejun Heo
2013-03-20 15:48       ` Tejun Heo
2013-03-20 16:43       ` Lai Jiangshan
2013-03-20 16:43         ` Lai Jiangshan
2013-03-20  0:00 ` [PATCH 02/10] workqueue: drop 'H' from kworker names of unbound worker pools Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20  0:00 ` [PATCH 03/10] workqueue: determine NUMA node of workers accourding to the allowed cpumask Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20  0:00 ` [PATCH 04/10] workqueue: add workqueue->unbound_attrs Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20  0:00 ` [PATCH 05/10] workqueue: make workqueue->name[] fixed len Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20  0:00 ` [PATCH 06/10] workqueue: move hot fields of workqueue_struct to the end Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20  0:00 ` [PATCH 07/10] workqueue: map an unbound workqueues to multiple per-node pool_workqueues Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20  0:00 ` [PATCH 08/10] workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20 15:52   ` Lai Jiangshan
2013-03-20 15:52     ` Lai Jiangshan
2013-03-20 16:04     ` Tejun Heo
2013-03-20 16:04       ` Tejun Heo
2013-03-20  0:00 ` [PATCH 09/10] workqueue: implement NUMA affinity for unbound workqueues Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20 15:03   ` Lai Jiangshan
2013-03-20 15:03     ` Lai Jiangshan
2013-03-20 15:05     ` Tejun Heo
2013-03-20 15:05       ` Tejun Heo
2013-03-20 15:26       ` Lai Jiangshan
2013-03-20 15:26         ` Lai Jiangshan
2013-03-20 15:32         ` Tejun Heo
2013-03-20 15:32           ` Tejun Heo
2013-03-20 17:08   ` [PATCH v2 " Tejun Heo
2013-03-20 17:08     ` Tejun Heo
2013-03-20 18:54     ` [PATCH v2 UPDATED " Tejun Heo
2013-03-20 18:54       ` Tejun Heo
2013-03-20  0:00 ` [PATCH 10/10] workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity Tejun Heo
2013-03-20  0:00   ` Tejun Heo
2013-03-20 12:14 ` [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues Lai Jiangshan
2013-03-20 12:14   ` Lai Jiangshan
2013-03-20 17:08 ` [PATCH 11/10] workqueue: use NUMA-aware allocation for pool_workqueues workqueues Tejun Heo
2013-03-20 17:08   ` Tejun Heo
2013-03-20 18:57 ` [PATCHSET wq/for-3.10] workqueue: NUMA affinity for unbound workqueues Tejun Heo
2013-03-20 18:57   ` Tejun Heo
2013-03-24 16:04   ` Lai Jiangshan
2013-03-24 16:04     ` Lai Jiangshan
2013-03-24 18:55     ` Tejun Heo
2013-03-24 18:55       ` Tejun Heo
2013-03-25 19:15       ` Tejun Heo
2013-03-25 19:15         ` Tejun Heo
2013-03-25 20:48         ` Tejun Heo
2013-03-25 20:48           ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.