linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] workqueue: fix bug when numa mapping is changed
@ 2014-12-12 10:19 Lai Jiangshan
  2014-12-12 10:19 ` [PATCH 1/5] workqueue: fix memory leak in wq_numa_init() Lai Jiangshan
                   ` (6 more replies)
  0 siblings, 7 replies; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-12 10:19 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo
  Cc: Lai Jiangshan, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

Workqueue code has an assumption that the numa mapping is stable
after system booted.  It is incorrectly currently.

Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
between CPU and node is changed. This was the last scene:
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

Yasuaki Ishimatsu investigated that it happened in the following situation:

1) System Node/CPU before offline/online:
	       | CPU
	------------------------
	node 0 |  0-14, 60-74
	node 1 | 15-29, 75-89
	node 2 | 30-44, 90-104
	node 3 | 45-59, 105-119

2) A system-board (contains node2 and node3) is offline:
	       | CPU
	------------------------
	node 0 |  0-14, 60-74
	node 1 | 15-29, 75-89

3) A new system-board is online, two new node IDs are allocated
   for the two node of the SB, but the old CPU IDs are allocated for
   the SB, here the NUMA mapping between node and CPU is changed.
   (the node of CPU#30 is changed from node#2 to node#4, for example)
	       | CPU
	------------------------
	node 0 |  0-14, 60-74
	node 1 | 15-29, 75-89
	node 4 | 30-59
	node 5 | 90-119

4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask
   which is the convenient NUMA mapping cache in workqueue.c is still outdated.
   thus pool->node calculated by get_unbound_pool() is incorrect.

5) when the create_worker() is called with the incorrect offlined
    pool->node, it is failed and the pool can't make any progress.

To fix this bug, we need to fixup the wq_numa_possible_cpumask and the
pool->node, it is done in patch2 and patch3.

patch1 fixes memory leak related wq_numa_possible_cpumask.
patch4 kill another assumption about how the numa mapping changed.
patch5 reduces the allocation fails when the node is offline or the node
is lack of memory.

The patchset is untested. It is sent for earlier review.

Thanks,
Lai.

Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
Cc: tangchen <tangchen@cn.fujitsu.com>
Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
Lai Jiangshan (5):
  workqueue: fix memory leak in wq_numa_init()
  workqueue: update wq_numa_possible_cpumask
  workqueue: fixup existing pool->node
  workqueue: update NUMA affinity for the node lost CPU
  workqueue: retry on NUMA_NO_NODE when create_worker() fails

 kernel/workqueue.c |  129 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 109 insertions(+), 20 deletions(-)

-- 
1.7.4.4


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/5] workqueue: fix memory leak in wq_numa_init()
  2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
@ 2014-12-12 10:19 ` Lai Jiangshan
  2014-12-12 17:12   ` Tejun Heo
  2014-12-12 10:19 ` [PATCH 2/5] workqueue: update wq_numa_possible_cpumask Lai Jiangshan
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-12 10:19 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo
  Cc: Lai Jiangshan, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

wq_numa_init() will quit directly on some bonkers cases without freeing the
memory.  Add the missing cleanup code.

Cc: Tejun Heo <tj@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
Cc: tangchen <tangchen@cn.fujitsu.com>
Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/workqueue.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685d..a6fd2b8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4811,6 +4811,9 @@ static void __init wq_numa_init(void)
 		if (WARN_ON(node == NUMA_NO_NODE)) {
 			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
 			/* happens iff arch is bonkers, let's just proceed */
+			for_each_node(node)
+				free_cpumask_var(tbl[node]);
+			kfree(tbl);
 			return;
 		}
 		cpumask_set_cpu(cpu, tbl[node]);
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/5] workqueue: update wq_numa_possible_cpumask
  2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
  2014-12-12 10:19 ` [PATCH 1/5] workqueue: fix memory leak in wq_numa_init() Lai Jiangshan
@ 2014-12-12 10:19 ` Lai Jiangshan
  2014-12-12 17:18   ` Tejun Heo
  2014-12-18  2:22   ` Lai Jiangshan
  2014-12-12 10:19 ` [PATCH 3/5] workqueue: fixup existing pool->node Lai Jiangshan
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-12 10:19 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo
  Cc: Lai Jiangshan, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
between CPU and node is changed. This was the last scene:
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

Yasuaki Ishimatsu investigated that it happened in the following situation:

1) System Node/CPU before offline/online:
	       | CPU
	------------------------
	node 0 |  0-14, 60-74
	node 1 | 15-29, 75-89
	node 2 | 30-44, 90-104
	node 3 | 45-59, 105-119

2) A system-board (contains node2 and node3) is offline:
	       | CPU
	------------------------
	node 0 |  0-14, 60-74
	node 1 | 15-29, 75-89

3) A new system-board is online, two new node IDs are allocated
   for the two node of the SB, but the old CPU IDs are allocated for
   the SB, here the NUMA mapping between node and CPU is changed.
   (the node of CPU#30 is changed from node#2 to node#4, for example)
	       | CPU
	------------------------
	node 0 |  0-14, 60-74
	node 1 | 15-29, 75-89
	node 4 | 30-59
	node 5 | 90-119

4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask
   which is the convenient NUMA mapping cache in workqueue.c is still outdated.
   thus pool->node calculated by get_unbound_pool() is incorrect.

5) when the create_worker() is called with the incorrect offlined
    pool->node, it is failed and the pool can't make any progress.

To fix this bug, we need to fixup the wq_numa_possible_cpumask and the
pool->node, the fix is so complicated that we split it into two patches,
this patch fix the wq_numa_possible_cpumask and the next fix the pool->node.

To fix the wq_numa_possible_cpumask, we only update the cpumasks of
the orig_node and the new_node of the onlining @cpu.  we con't touch
other unrelated nodes since the wq subsystem haven't seen the changed.

After this fix the new pool->node of new pools are correct.
and existing wq's affinity is fixed up by wq_update_unbound_numa()
after wq_update_numa_mapping().

Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
Cc: tangchen <tangchen@cn.fujitsu.com>
Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/workqueue.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a6fd2b8..4c88b61 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,7 @@ struct workqueue_struct {
 static struct kmem_cache *pwq_cache;
 
 static cpumask_var_t *wq_numa_possible_cpumask;
-					/* possible CPUs of each node */
+					/* PL: possible CPUs of each node */
 
 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -3949,6 +3949,44 @@ out_unlock:
 	put_pwq_unlocked(old_pwq);
 }
 
+static void wq_update_numa_mapping(int cpu)
+{
+	int node, orig_node = NUMA_NO_NODE, new_node = cpu_to_node(cpu);
+
+	lockdep_assert_held(&wq_pool_mutex);
+
+	if (!wq_numa_enabled)
+		return;
+
+	/* the node of onlining CPU is not NUMA_NO_NODE */
+	if (WARN_ON(new_node == NUMA_NO_NODE))
+		return;
+
+	/* test whether the NUMA node mapping is changed. */
+	if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[new_node]))
+		return;
+
+	/* find the origin node */
+	for_each_node(node) {
+		if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node])) {
+			orig_node = node;
+			break;
+		}
+	}
+
+	/* there may be multi mappings changed, re-initial. */
+	cpumask_clear(wq_numa_possible_cpumask[new_node]);
+	if (orig_node != NUMA_NO_NODE)
+		cpumask_clear(wq_numa_possible_cpumask[orig_node]);
+	for_each_possible_cpu(cpu) {
+		node = cpu_to_node(node);
+		if (node == new_node)
+			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[new_node]);
+		else if (orig_node != NUMA_NO_NODE && node == orig_node)
+			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[orig_node]);
+	}
+}
+
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
 {
 	bool highpri = wq->flags & WQ_HIGHPRI;
@@ -4584,6 +4622,8 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
 			mutex_unlock(&pool->attach_mutex);
 		}
 
+		wq_update_numa_mapping(cpu);
+
 		/* update NUMA affinity of unbound workqueues */
 		list_for_each_entry(wq, &workqueues, list)
 			wq_update_unbound_numa(wq, cpu, true);
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/5] workqueue: fixup existing pool->node
  2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
  2014-12-12 10:19 ` [PATCH 1/5] workqueue: fix memory leak in wq_numa_init() Lai Jiangshan
  2014-12-12 10:19 ` [PATCH 2/5] workqueue: update wq_numa_possible_cpumask Lai Jiangshan
@ 2014-12-12 10:19 ` Lai Jiangshan
  2014-12-12 17:25   ` Tejun Heo
  2014-12-12 10:19 ` [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU Lai Jiangshan
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-12 10:19 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo
  Cc: Lai Jiangshan, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

Yasuaki Ishimatsu hit a bug when the numa mapping between CPU and node
is changed.  And the previous path fixup wq_numa_possible_cpumask.
(See more information form the changelog of that patch)

After wq_numa_possible_cpumask was updated, the new pool->node will be
correct, but the existing pools (and workers) are still running, some of
them are running with the wrong pool->node, or even worse, with pool->node
which is quitted node, they create_worker() on wrong pool->node.
These create_worker() may create workers on wrong node or failed without
any progress (when with pool->node which is quitted node).

So we need to update the pool->node when the numa mapping is changed.

We simply re-calc the pool->node when the numa mapping changed. It reuses
the code from get_unbound_pool() for unbound pool.

Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
Cc: tangchen <tangchen@cn.fujitsu.com>
Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/workqueue.c |   53 ++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4c88b61..7fbabf6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3439,6 +3439,26 @@ static void put_unbound_pool(struct worker_pool *pool)
 	call_rcu_sched(&pool->rcu, rcu_free_pool);
 }
 
+static int calc_pool_node(struct worker_pool *pool)
+{
+	int node;
+
+	if (pool->cpu >= 0)
+		return cpu_to_node(pool->cpu);
+
+	/* if cpumask is contained inside a NUMA node, we belong to that node */
+	if (wq_numa_enabled) {
+		for_each_node(node) {
+			if (cpumask_subset(pool->attrs->cpumask,
+					   wq_numa_possible_cpumask[node])) {
+				return node;
+			}
+		}
+	}
+
+	return NUMA_NO_NODE;
+}
+
 /**
  * get_unbound_pool - get a worker_pool with the specified attributes
  * @attrs: the attributes of the worker_pool to get
@@ -3457,7 +3477,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 {
 	u32 hash = wqattrs_hash(attrs);
 	struct worker_pool *pool;
-	int node;
 
 	lockdep_assert_held(&wq_pool_mutex);
 
@@ -3482,17 +3501,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 	 * 'struct workqueue_attrs' comments for detail.
 	 */
 	pool->attrs->no_numa = false;
-
-	/* if cpumask is contained inside a NUMA node, we belong to that node */
-	if (wq_numa_enabled) {
-		for_each_node(node) {
-			if (cpumask_subset(pool->attrs->cpumask,
-					   wq_numa_possible_cpumask[node])) {
-				pool->node = node;
-				break;
-			}
-		}
-	}
+	pool->node = calc_pool_node(pool);
 
 	if (worker_pool_assign_id(pool) < 0)
 		goto fail;
@@ -3952,6 +3961,8 @@ out_unlock:
 static void wq_update_numa_mapping(int cpu)
 {
 	int node, orig_node = NUMA_NO_NODE, new_node = cpu_to_node(cpu);
+	struct worker_pool *pool;
+	int pi;
 
 	lockdep_assert_held(&wq_pool_mutex);
 
@@ -3985,6 +3996,17 @@ static void wq_update_numa_mapping(int cpu)
 		else if (orig_node != NUMA_NO_NODE && node == orig_node)
 			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[orig_node]);
 	}
+
+	/*
+	 * Fixup pool->node, it needs to test for all pools, pool->node might be
+	 * changed from orig_node to new_node or from new_node to NUMA_NO_NODE
+	 * or NUMA_NO_NODE to orig_node or some other possibilities.
+	 */
+	for_each_pool(pool, pi) {
+		node = calc_pool_node(pool);
+		if (pool->node != node)
+			pool->node = node;
+	}
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
@@ -4614,10 +4636,13 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
 		for_each_pool(pool, pi) {
 			mutex_lock(&pool->attach_mutex);
 
-			if (pool->cpu == cpu)
+			if (pool->cpu == cpu) {
+				if (unlikely(pool->node != cpu_to_node(cpu)))
+					pool->node = cpu_to_node(cpu);
 				rebind_workers(pool);
-			else if (pool->cpu < 0)
+			} else if (pool->cpu < 0) {
 				restore_unbound_workers_cpumask(pool, cpu);
+			}
 
 			mutex_unlock(&pool->attach_mutex);
 		}
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU
  2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
                   ` (2 preceding siblings ...)
  2014-12-12 10:19 ` [PATCH 3/5] workqueue: fixup existing pool->node Lai Jiangshan
@ 2014-12-12 10:19 ` Lai Jiangshan
  2014-12-12 17:27   ` Tejun Heo
  2014-12-12 10:19 ` [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails Lai Jiangshan
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-12 10:19 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo
  Cc: Lai Jiangshan, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

We fixed the major cases when the numa mapping is changed.

We still have the assumption that when the node<->cpu mapping is changed
the original node is offline, and the current code of memory-hutplug also
prove this.

This assumption might be changed in future and the orig_node is still online
in some cases.  And in these cases, the cpumask of the pwqs of the orig_node
still contains the onlining CPU which is a CPU of another node, and the worker
may run on the onlining CPU (aka run on the wrong node).

So we drop this assumption and make the code calls wq_update_unbound_numa()
to update the affinity in this case.

Cc: Tejun Heo <tj@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
Cc: tangchen <tangchen@cn.fujitsu.com>
Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/workqueue.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7fbabf6..29a96c3 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4007,6 +4007,21 @@ static void wq_update_numa_mapping(int cpu)
 		if (pool->node != node)
 			pool->node = node;
 	}
+
+	/* Test whether we hit the case where orig_node is still online */
+	if (orig_node != NUMA_NO_NODE &&
+	    !cpumask_empty(cpumask_of_node(orig_node))) {
+		struct workqueue_struct *wq;
+		cpu = cpumask_any(cpumask_of_node(orig_node));
+
+		/*
+		 * the pwqs of the orig_node are still allowed on the onlining
+		 * CPU but which is belong to new_node, update NUMA affinity
+		 * for orig_node.
+		 */
+		list_for_each_entry(wq, &workqueues, list)
+			wq_update_unbound_numa(wq, cpu, true);
+	}
 }
 
 static int alloc_and_link_pwqs(struct workqueue_struct *wq)
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails
  2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
                   ` (3 preceding siblings ...)
  2014-12-12 10:19 ` [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU Lai Jiangshan
@ 2014-12-12 10:19 ` Lai Jiangshan
  2014-12-12 16:05   ` KOSAKI Motohiro
  2014-12-12 17:29   ` Tejun Heo
  2014-12-12 17:13 ` [PATCH 0/5] workqueue: fix bug when numa mapping is changed Yasuaki Ishimatsu
  2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
  6 siblings, 2 replies; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-12 10:19 UTC (permalink / raw)
  To: linux-kernel, Tejun Heo
  Cc: Lai Jiangshan, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

A pwq bound to a specified node might be last long or even forever after
the node was offline.  Especially when this pwq has some back-to-back work
items which requeue themselves and cause the pwq can't quit.

This kinds of pwqs will cause their own pools busy and maybe create workers.
This pools will fail on create_worker() since the node is offline.

This case is extremely rare, but it is possible. And we hope create_worker()
to be fault-tolerant in this case and other different cases when the node
is lack of memory, for example, create_worker() can try to allocate memory
from the whole system rather than only the target node, the most important
thing is making some progress.

So the solution is that, when the create_worker() fails on a specified node,
it will retry with NUMA_NO_NODE for further allocation.

Cc: Tejun Heo <tj@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
Cc: tangchen <tangchen@cn.fujitsu.com>
Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 kernel/workqueue.c |   16 +++++++++++-----
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 29a96c3..9e35a79 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1673,13 +1673,16 @@ static struct worker *create_worker(struct worker_pool *pool)
 	struct worker *worker = NULL;
 	int id = -1;
 	char id_buf[16];
+	int node = pool->node;
 
 	/* ID is needed to determine kthread name */
 	id = ida_simple_get(&pool->worker_ida, 0, 0, GFP_KERNEL);
 	if (id < 0)
-		goto fail;
+		return NULL;
 
-	worker = alloc_worker(pool->node);
+again:
+	if (!worker)
+		worker = alloc_worker(node);
 	if (!worker)
 		goto fail;
 
@@ -1692,7 +1695,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 	else
 		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
-	worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
+	worker->task = kthread_create_on_node(worker_thread, worker, node,
 					      "kworker/%s", id_buf);
 	if (IS_ERR(worker->task))
 		goto fail;
@@ -1715,8 +1718,11 @@ static struct worker *create_worker(struct worker_pool *pool)
 	return worker;
 
 fail:
-	if (id >= 0)
-		ida_simple_remove(&pool->worker_ida, id);
+	if (node != NUMA_NO_NODE) {
+		node = NUMA_NO_NODE;
+		goto again;
+	}
+	ida_simple_remove(&pool->worker_ida, id);
 	kfree(worker);
 	return NULL;
 }
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails
  2014-12-12 10:19 ` [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails Lai Jiangshan
@ 2014-12-12 16:05   ` KOSAKI Motohiro
  2014-12-12 17:29     ` KOSAKI Motohiro
  2014-12-12 17:29   ` Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: KOSAKI Motohiro @ 2014-12-12 16:05 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen,
	Hiroyuki KAMEZAWA

On Fri, Dec 12, 2014 at 5:19 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
> A pwq bound to a specified node might be last long or even forever after
> the node was offline.  Especially when this pwq has some back-to-back work
> items which requeue themselves and cause the pwq can't quit.
>
> This kinds of pwqs will cause their own pools busy and maybe create workers.
> This pools will fail on create_worker() since the node is offline.
>
> This case is extremely rare, but it is possible. And we hope create_worker()
> to be fault-tolerant in this case and other different cases when the node
> is lack of memory, for example, create_worker() can try to allocate memory
> from the whole system rather than only the target node, the most important
> thing is making some progress.
>
> So the solution is that, when the create_worker() fails on a specified node,
> it will retry with NUMA_NO_NODE for further allocation.

The code looks correct. But I don't think this issue depend on node offlining.
The allocation may also fail if node has no memory.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] workqueue: fix memory leak in wq_numa_init()
  2014-12-12 10:19 ` [PATCH 1/5] workqueue: fix memory leak in wq_numa_init() Lai Jiangshan
@ 2014-12-12 17:12   ` Tejun Heo
  2014-12-15  5:25     ` Lai Jiangshan
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2014-12-12 17:12 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Fri, Dec 12, 2014 at 06:19:51PM +0800, Lai Jiangshan wrote:
> wq_numa_init() will quit directly on some bonkers cases without freeing the
> memory.  Add the missing cleanup code.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
> Cc: tangchen <tangchen@cn.fujitsu.com>
> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> ---
>  kernel/workqueue.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 09b685d..a6fd2b8 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4811,6 +4811,9 @@ static void __init wq_numa_init(void)
>  		if (WARN_ON(node == NUMA_NO_NODE)) {
>  			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
>  			/* happens iff arch is bonkers, let's just proceed */
> +			for_each_node(node)
> +				free_cpumask_var(tbl[node]);
> +			kfree(tbl);

The comment right up there says that this happens if and only if the
arch code is seriously broken and it's consciously skipping exception
handling.  That's a condition where we might as well trigger BUG_ON().
Just leave it alone.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] workqueue: fix bug when numa mapping is changed
  2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
                   ` (4 preceding siblings ...)
  2014-12-12 10:19 ` [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails Lai Jiangshan
@ 2014-12-12 17:13 ` Yasuaki Ishimatsu
  2014-12-15  1:34   ` Lai Jiangshan
  2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
  6 siblings, 1 reply; 55+ messages in thread
From: Yasuaki Ishimatsu @ 2014-12-12 17:13 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA,
	isimatu.yasuaki

Hi Lai,

Thank you for posting the patches. I tried your patches.
But the following kernel panic occurred.

[  889.394087] BUG: unable to handle kernel paging request at 000000020000f3f1
[  889.395005] IP: [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
[  889.395005] PGD 17a83067 PUD 0
[  889.395005] Oops: 0000 [#1] SMP
[  889.395005] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  889.395005] CPU: 8 PID: 13595 Comm: udev_dp_bridge. Not tainted 3.18.0Lai+ #26
[  889.395005] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
[  889.395005] task: ffff8a074a145160 ti: ffff8a077a6ec000 task.ti: ffff8a077a6ec000
[  889.395005] RIP: 0010:[<ffffffff8108fe90>]  [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
[  889.395005] RSP: 0018:ffff8a077a6efca8  EFLAGS: 00010202
[  889.395005] RAX: 0000000000000001 RBX: 000000000000edf1 RCX: 000000000000edf1
[  889.395005] RDX: 0000000000000100 RSI: 000000020000f3f1 RDI: 0000000000000001
[  889.395005] RBP: ffff8a077a6efd08 R08: ffffffff81ac6de0 R09: ffff880874610000
[  889.395005] R10: 00000000ffffffff R11: 0000000000000001 R12: 000000000000f3f0
[  889.395005] R13: 000000000000001f R14: 00000000ffffffff R15: ffffffff81ac6de0
[  889.395005] FS:  00007f6b20c67740(0000) GS:ffff88087fd00000(0000) knlGS:0000000000000000
[  889.395005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  889.395005] CR2: 000000020000f3f1 CR3: 000000004534c000 CR4: 00000000001407e0
[  889.395005] Stack:
[  889.395005]  ffffffffffffffff 0000000000000020 fffffffffffffff8 00000004810a192d
[  889.395005]  ffff8a0700000204 0000000052f5b32d ffffffff81994fc0 00000000fffffff6
[  889.395005]  ffffffff81a13840 0000000000000002 000000000000001f 0000000000000000
[  889.395005] Call Trace:
[  889.395005]  [<ffffffff81094f6c>] notifier_call_chain+0x4c/0x70
[  889.395005]  [<ffffffff8109507e>] __raw_notifier_call_chain+0xe/0x10
[  889.395005]  [<ffffffff810750b3>] cpu_notify+0x23/0x50
[  889.395005]  [<ffffffff81075408>] _cpu_up+0x188/0x1a0
[  889.395005]  [<ffffffff810754a9>] cpu_up+0x89/0xb0
[  889.395005]  [<ffffffff8164f960>] cpu_subsys_online+0x40/0x90
[  889.395005]  [<ffffffff8140f10d>] device_online+0x6d/0xa0
[  889.395005]  [<ffffffff8140f1d5>] online_store+0x95/0xa0
[  889.395005]  [<ffffffff8140c2e8>] dev_attr_store+0x18/0x30
[  889.395005]  [<ffffffff8126210d>] sysfs_kf_write+0x3d/0x50
[  889.395005]  [<ffffffff81261624>] kernfs_fop_write+0xe4/0x160
[  889.395005]  [<ffffffff811e90d7>] vfs_write+0xb7/0x1f0
[  889.395005]  [<ffffffff81021dcc>] ? do_audit_syscall_entry+0x6c/0x70
[  889.395005]  [<ffffffff811e9bc5>] SyS_write+0x55/0xd0
[  889.395005]  [<ffffffff816646a9>] system_call_fastpath+0x12/0x17
[  889.395005] Code: 44 00 00 83 c7 01 48 63 d7 4c 89 ff e8 3a 2a 28 00 8b 15 78 84 a3 00 89 c7 39 d0 7d 70 48 63 cb 4c 89 e6 48 03 34 cd e0 3a ab 81 <8b> 1e 39 5d bc 74 36 41 39 de 74 0c 48 63 f2 eb c7 0f 1f 80 00
[  889.395005] RIP  [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
[  889.395005]  RSP <ffff8a077a6efca8>
[  889.395005] CR2: 000000020000f3f1
[  889.785760] ---[ end trace 39abbfc9f93402f2 ]---
[  889.790931] Kernel panic - not syncing: Fatal exception
[  889.791931] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  889.791931] drm_kms_helper: panic occurred, switching back to text console
[  889.815947] ------------[ cut here ]------------
[  889.815947] WARNING: CPU: 8 PID: 64 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
[  889.815947] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  889.815947] CPU: 8 PID: 64 Comm: migration/8 Tainted: G      D        3.18.0Lai+ #26
[  889.815947] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
[  889.815947]  0000000000000000 00000000f7f40529 ffff88087fd03d38 ffffffff8165c8d4
[  889.815947]  0000000000000000 0000000000000000 ffff88087fd03d78 ffffffff81074eb1
[  889.815947]  ffff88087fd03d78 0000000000000000 ffff88087fc13840 0000000000000008
[  889.815947] Call Trace:
[  889.815947]  <IRQ>  [<ffffffff8165c8d4>] dump_stack+0x46/0x58
[  889.815947]  [<ffffffff81074eb1>] warn_slowpath_common+0x81/0xa0
[  889.815947]  [<ffffffff81074fca>] warn_slowpath_null+0x1a/0x20
[  889.815947]  [<ffffffff810489bd>] native_smp_send_reschedule+0x5d/0x60
[  889.815947]  [<ffffffff810b0ad4>] trigger_load_balance+0x144/0x1b0
[  889.815947]  [<ffffffff810a009f>] scheduler_tick+0x9f/0xe0
[  889.815947]  [<ffffffff810daef4>] update_process_times+0x64/0x80
[  889.815947]  [<ffffffff810eab05>] tick_sched_handle.isra.19+0x25/0x60
[  889.815947]  [<ffffffff810eab85>] tick_sched_timer+0x45/0x80
[  889.815947]  [<ffffffff810dbbe7>] __run_hrtimer+0x77/0x1d0
[  889.815947]  [<ffffffff810eab40>] ? tick_sched_handle.isra.19+0x60/0x60
[  889.815947]  [<ffffffff810dbfd7>] hrtimer_interrupt+0xf7/0x240
[  889.815947]  [<ffffffff8104b85b>] local_apic_timer_interrupt+0x3b/0x70
[  889.815947]  [<ffffffff81667465>] smp_apic_timer_interrupt+0x45/0x60
[  889.815947]  [<ffffffff8166553d>] apic_timer_interrupt+0x6d/0x80
[  889.815947]  <EOI>  [<ffffffff810a79c7>] ? set_next_entity+0x67/0x80
[  889.815947]  [<ffffffffa011d1d7>] ? __drm_modeset_lock_all+0x37/0x120 [drm]
[  889.815947]  [<ffffffff8109c727>] ? finish_task_switch+0x57/0x180
[  889.815947]  [<ffffffff8165fba8>] __schedule+0x2e8/0x7e0
[  889.815947]  [<ffffffff816600c9>] schedule+0x29/0x70
[  889.815947]  [<ffffffff81097d43>] smpboot_thread_fn+0xd3/0x1b0
[  889.815947]  [<ffffffff81097c70>] ? SyS_setgroups+0x1a0/0x1a0
[  889.815947]  [<ffffffff81093df1>] kthread+0xe1/0x100
[  889.815947]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
[  889.815947]  [<ffffffff816645fc>] ret_from_fork+0x7c/0xb0
[  889.815947]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
[  889.815947] ---[ end trace 39abbfc9f93402f3 ]---
[  890.156187] ------------[ cut here ]------------
[  890.156187] WARNING: CPU: 8 PID: 64 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
[  890.156187] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  890.156187] CPU: 8 PID: 64 Comm: migration/8 Tainted: G      D W      3.18.0Lai+ #26
[  890.156187] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
[  890.156187]  0000000000000000 00000000f7f40529 ffff88087366bc08 ffffffff8165c8d4
[  890.156187]  0000000000000000 0000000000000000 ffff88087366bc48 ffffffff81074eb1
[  890.156187]  ffff88087fd142c0 0000000000000044 ffff8a074a145160 ffff8a074a145160
[  890.156187] Call Trace:
[  890.156187]  [<ffffffff8165c8d4>] dump_stack+0x46/0x58
[  890.156187]  [<ffffffff81074eb1>] warn_slowpath_common+0x81/0xa0
[  890.156187]  [<ffffffff81074fca>] warn_slowpath_null+0x1a/0x20
[  890.156187]  [<ffffffff810489bd>] native_smp_send_reschedule+0x5d/0x60
[  890.156187]  [<ffffffff8109ddd8>] resched_curr+0xa8/0xd0
[  890.156187]  [<ffffffff8109eac0>] check_preempt_curr+0x80/0xa0
[  890.156187]  [<ffffffff810a78c8>] attach_task+0x48/0x50
[  890.156187]  [<ffffffff810a7ae5>] active_load_balance_cpu_stop+0x105/0x250
[  890.156187]  [<ffffffff810a79e0>] ? set_next_entity+0x80/0x80
[  890.156187]  [<ffffffff8110cab8>] cpu_stopper_thread+0x78/0x150
[  890.156187]  [<ffffffff8165fba8>] ? __schedule+0x2e8/0x7e0
[  890.156187]  [<ffffffff81097d6f>] smpboot_thread_fn+0xff/0x1b0
[  890.156187]  [<ffffffff81097c70>] ? SyS_setgroups+0x1a0/0x1a0
[  890.156187]  [<ffffffff81093df1>] kthread+0xe1/0x100
[  890.156187]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
[  890.156187]  [<ffffffff816645fc>] ret_from_fork+0x7c/0xb0
[  890.156187]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
[  890.156187] ---[ end trace 39abbfc9f93402f4 ]---

Thanks,
Yasuaki Ishimatsu

(2014/12/12 19:19), Lai Jiangshan wrote:
> Workqueue code has an assumption that the numa mapping is stable
> after system booted.  It is incorrectly currently.
> 
> Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
> between CPU and node is changed. This was the last scene:
>   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>    cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
>    node 0: slabs: 6172, objs: 259224, free: 245741
>    node 1: slabs: 3261, objs: 136962, free: 127656
> 
> Yasuaki Ishimatsu investigated that it happened in the following situation:
> 
> 1) System Node/CPU before offline/online:
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 	node 2 | 30-44, 90-104
> 	node 3 | 45-59, 105-119
> 
> 2) A system-board (contains node2 and node3) is offline:
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 
> 3) A new system-board is online, two new node IDs are allocated
>     for the two node of the SB, but the old CPU IDs are allocated for
>     the SB, here the NUMA mapping between node and CPU is changed.
>     (the node of CPU#30 is changed from node#2 to node#4, for example)
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 	node 4 | 30-59
> 	node 5 | 90-119
> 
> 4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask
>     which is the convenient NUMA mapping cache in workqueue.c is still outdated.
>     thus pool->node calculated by get_unbound_pool() is incorrect.
> 
> 5) when the create_worker() is called with the incorrect offlined
>      pool->node, it is failed and the pool can't make any progress.
> 
> To fix this bug, we need to fixup the wq_numa_possible_cpumask and the
> pool->node, it is done in patch2 and patch3.
> 
> patch1 fixes memory leak related wq_numa_possible_cpumask.
> patch4 kill another assumption about how the numa mapping changed.
> patch5 reduces the allocation fails when the node is offline or the node
> is lack of memory.
> 
> The patchset is untested. It is sent for earlier review.
> 
> Thanks,
> Lai.
> 
> Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
> Cc: tangchen <tangchen@cn.fujitsu.com>
> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
> Lai Jiangshan (5):
>    workqueue: fix memory leak in wq_numa_init()
>    workqueue: update wq_numa_possible_cpumask
>    workqueue: fixup existing pool->node
>    workqueue: update NUMA affinity for the node lost CPU
>    workqueue: retry on NUMA_NO_NODE when create_worker() fails
> 
>   kernel/workqueue.c |  129 ++++++++++++++++++++++++++++++++++++++++++++--------
>   1 files changed, 109 insertions(+), 20 deletions(-)
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] workqueue: update wq_numa_possible_cpumask
  2014-12-12 10:19 ` [PATCH 2/5] workqueue: update wq_numa_possible_cpumask Lai Jiangshan
@ 2014-12-12 17:18   ` Tejun Heo
  2014-12-15  2:02     ` Lai Jiangshan
  2014-12-18  2:22   ` Lai Jiangshan
  1 sibling, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2014-12-12 17:18 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Fri, Dec 12, 2014 at 06:19:52PM +0800, Lai Jiangshan wrote:
...
> +static void wq_update_numa_mapping(int cpu)
> +{
> +	int node, orig_node = NUMA_NO_NODE, new_node = cpu_to_node(cpu);
> +
> +	lockdep_assert_held(&wq_pool_mutex);
> +
> +	if (!wq_numa_enabled)
> +		return;
> +
> +	/* the node of onlining CPU is not NUMA_NO_NODE */
> +	if (WARN_ON(new_node == NUMA_NO_NODE))
> +		return;
> +
> +	/* test whether the NUMA node mapping is changed. */
> +	if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[new_node]))
> +		return;
> +
> +	/* find the origin node */
> +	for_each_node(node) {
> +		if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node])) {
> +			orig_node = node;
> +			break;
> +		}
> +	}
> +
> +	/* there may be multi mappings changed, re-initial. */
> +	cpumask_clear(wq_numa_possible_cpumask[new_node]);
> +	if (orig_node != NUMA_NO_NODE)
> +		cpumask_clear(wq_numa_possible_cpumask[orig_node]);
> +	for_each_possible_cpu(cpu) {
> +		node = cpu_to_node(node);
> +		if (node == new_node)
> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[new_node]);
> +		else if (orig_node != NUMA_NO_NODE && node == orig_node)
> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[orig_node]);
> +	}
> +}

Let's please move this to NUMA code and properly update it on actual
mapping changes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] workqueue: fixup existing pool->node
  2014-12-12 10:19 ` [PATCH 3/5] workqueue: fixup existing pool->node Lai Jiangshan
@ 2014-12-12 17:25   ` Tejun Heo
  2014-12-15  1:23     ` Lai Jiangshan
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2014-12-12 17:25 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Fri, Dec 12, 2014 at 06:19:53PM +0800, Lai Jiangshan wrote:
> Yasuaki Ishimatsu hit a bug when the numa mapping between CPU and node
> is changed.  And the previous path fixup wq_numa_possible_cpumask.
> (See more information form the changelog of that patch)
> 
> After wq_numa_possible_cpumask was updated, the new pool->node will be
> correct, but the existing pools (and workers) are still running, some of
> them are running with the wrong pool->node, or even worse, with pool->node
> which is quitted node, they create_worker() on wrong pool->node.
> These create_worker() may create workers on wrong node or failed without
> any progress (when with pool->node which is quitted node).
> 
> So we need to update the pool->node when the numa mapping is changed.
> 
> We simply re-calc the pool->node when the numa mapping changed. It reuses
> the code from get_unbound_pool() for unbound pool.

I don't get this patch.  If a node is gone, all its cpus would be gone
and the pool should be discarded.  If a new node comes online with
different mappings, new sets of pools should serve them instead of
recycling the old ones.  Wouldn't it make a lot more sense to make
sure we don't reuse the pools w/ old mappings for new pwqs?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU
  2014-12-12 10:19 ` [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU Lai Jiangshan
@ 2014-12-12 17:27   ` Tejun Heo
  2014-12-15  1:28     ` Lai Jiangshan
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2014-12-12 17:27 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Fri, Dec 12, 2014 at 06:19:54PM +0800, Lai Jiangshan wrote:
> We fixed the major cases when the numa mapping is changed.
> 
> We still have the assumption that when the node<->cpu mapping is changed
> the original node is offline, and the current code of memory-hutplug also
> prove this.
> 
> This assumption might be changed in future and the orig_node is still online
> in some cases.  And in these cases, the cpumask of the pwqs of the orig_node
> still contains the onlining CPU which is a CPU of another node, and the worker
> may run on the onlining CPU (aka run on the wrong node).
> 
> So we drop this assumption and make the code calls wq_update_unbound_numa()
> to update the affinity in this case.

This is seriously obfuscating.  I really don't think meddling with
existing pools is a good idea.  The foundation those pools were
standing are gone.  Drain and discard the pools.  Please don't try to
retro-fit it to new foundations.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails
  2014-12-12 16:05   ` KOSAKI Motohiro
@ 2014-12-12 17:29     ` KOSAKI Motohiro
  0 siblings, 0 replies; 55+ messages in thread
From: KOSAKI Motohiro @ 2014-12-12 17:29 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen,
	Hiroyuki KAMEZAWA

On Fri, Dec 12, 2014 at 11:05 AM, KOSAKI Motohiro
<kosaki.motohiro@gmail.com> wrote:
> On Fri, Dec 12, 2014 at 5:19 AM, Lai Jiangshan <laijs@cn.fujitsu.com> wrote:
>> A pwq bound to a specified node might be last long or even forever after
>> the node was offline.  Especially when this pwq has some back-to-back work
>> items which requeue themselves and cause the pwq can't quit.
>>
>> This kinds of pwqs will cause their own pools busy and maybe create workers.
>> This pools will fail on create_worker() since the node is offline.
>>
>> This case is extremely rare, but it is possible. And we hope create_worker()
>> to be fault-tolerant in this case and other different cases when the node
>> is lack of memory, for example, create_worker() can try to allocate memory
>> from the whole system rather than only the target node, the most important
>> thing is making some progress.
>>
>> So the solution is that, when the create_worker() fails on a specified node,
>> it will retry with NUMA_NO_NODE for further allocation.
>
> The code looks correct. But I don't think this issue depend on node offlining.
> The allocation may also fail if node has no memory.

Oh, sorry. my comment is no correct. Please forget.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails
  2014-12-12 10:19 ` [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails Lai Jiangshan
  2014-12-12 16:05   ` KOSAKI Motohiro
@ 2014-12-12 17:29   ` Tejun Heo
  1 sibling, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2014-12-12 17:29 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Fri, Dec 12, 2014 at 06:19:55PM +0800, Lai Jiangshan wrote:
...
>  fail:
> -	if (id >= 0)
> -		ida_simple_remove(&pool->worker_ida, id);
> +	if (node != NUMA_NO_NODE) {
> +		node = NUMA_NO_NODE;
> +		goto again;
> +	}
> +	ida_simple_remove(&pool->worker_ida, id);

The retry seems too general for the problem case it's trying to solve.
Can we interlock it properly with node offline event?  On node
offline, grab pool_mutex and clear all pool->node's which match the
node which is going down and take it off circulation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2.
  2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
                   ` (5 preceding siblings ...)
  2014-12-12 17:13 ` [PATCH 0/5] workqueue: fix bug when numa mapping is changed Yasuaki Ishimatsu
@ 2014-12-13 16:27 ` Kamezawa Hiroyuki
  2014-12-13 16:30   ` [PATCH 1/4] workqueue: add a hook for node hotplug Kamezawa Hiroyuki
                     ` (4 more replies)
  6 siblings, 5 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-13 16:27 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen, Kamezawa Hiroyuki


Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
between CPU and node is changed. This was the last scene:
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

I and Yasuaki have a host which has a feature of node hotplug, this is a fix by me.
Tested several patterns of hotplug and I found no issue, now.
Of course I read Lai's patch and Tejun's comment. I hope I could reflect them.

1/4 ... add node-hotplug event callback.
2/4 ... add a sanity check (for debug)
3/4 ... remove per-node unbound workqueue if node goes offline.
4/4 ... update per-cpu pool's information and cpumasks, node information
        based on the latest (cpu, node) information.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/4] workqueue: add a hook for node hotplug
  2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
@ 2014-12-13 16:30   ` Kamezawa Hiroyuki
  2014-12-13 16:33   ` [PATCH 2/4] workqueue: add warning if pool->node is offline Kamezawa Hiroyuki
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-13 16:30 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen

Subject: [PATCH 1/4] add callbackof node hotplug for workqueue.

Because workqueue is numa aware, it pool has node information.
And it should be maintained against node-hotplug.

When a node which exists at boot is unpluged, following error
is detected.
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
==
This is because pool->node points a stale node.

This patch adds callback function at node hotplug.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
---
 include/linux/workqueue.h |  6 ++++++
 kernel/workqueue.c        | 18 ++++++++++++++++++
 mm/memory_hotplug.c       |  9 +++++++--
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b996e6cd..3f2b40b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -591,4 +591,10 @@ static inline int workqueue_sysfs_register(struct workqueue_struct *wq)
 { return 0; }
 #endif	/* CONFIG_SYSFS */
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+/* notify node hotplug event when pgdat is created/removed */
+void workqueue_register_numanode(int node);
+void workqueue_unregister_numanode(int node);
+#endif
+
 #endif
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685d..f6cb357c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4901,3 +4901,21 @@ static int __init init_workqueues(void)
 	return 0;
 }
 early_initcall(init_workqueues);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * If a node itself is hot-unpluged by memory hotplug, it's guaranteed that
+ * there are no online cpus on the node. After a node unplug, it's not
+ * guaranteed that a cpuid of newly added by hot-add is tied to a node id
+ * which was determined before node unplug. pool->node should be cleared and
+ * cached pools per cpu should be freed at node unplug
+ */
+
+void workqueue_register_numanode(int nid)
+{
+}
+ 
+void workqueue_unregister_numanode(int nid)
+{
+}
+#endif
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1bf4807..504b071 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1162,7 +1162,8 @@ int try_online_node(int nid)
 		build_all_zonelists(NULL, NULL);
 		mutex_unlock(&zonelists_mutex);
 	}
-
+	/* Now zonelist for the pgdat is ready */
+	workqueue_register_numanode(nid);
 out:
 	mem_hotplug_done();
 	return ret;
@@ -1914,7 +1915,11 @@ static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
 	ret = check_cpu_on_node(pgdat);
 	if (ret)
 		return ret;
-
+	/*
+	 * There is no online cpu on the node and this node will go.
+	 * make workqueue to forget this node.
+	 */
+	workqueue_unregister_numanode(pgdat->node_id);
 	/*
 	 * the node will be offlined when we come here, so we can clear
 	 * the cpu_to_node() now.
-- 
1.8.3.1




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/4] workqueue: add warning if pool->node is offline
  2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
  2014-12-13 16:30   ` [PATCH 1/4] workqueue: add a hook for node hotplug Kamezawa Hiroyuki
@ 2014-12-13 16:33   ` Kamezawa Hiroyuki
  2014-12-13 16:35   ` [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline Kamezawa Hiroyuki
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-13 16:33 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen

Add warning if pool->node is offline.

This patch was originaly made for debug.
I think add warning here can show what may happen.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/workqueue.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f6cb357c..35f4f00 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -760,6 +760,15 @@ static bool too_many_workers(struct worker_pool *pool)
 	return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
 }
 
+/* Check pool->node */
+static int pool_to_node(struct worker_pool *pool)
+{
+	if (WARN_ON_ONCE(pool->node != NUMA_NO_NODE
+                         && !node_online(pool->node)))
+		return NUMA_NO_NODE;
+	return pool->node;
+}
+
 /*
  * Wake up functions.
  */
@@ -1679,7 +1688,7 @@ static struct worker *create_worker(struct worker_pool *pool)
 	if (id < 0)
 		goto fail;
 
-	worker = alloc_worker(pool->node);
+	worker = alloc_worker(pool_to_node(pool));
 	if (!worker)
 		goto fail;
 
@@ -1692,7 +1701,8 @@ static struct worker *create_worker(struct worker_pool *pool)
 	else
 		snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
-	worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
+	worker->task = kthread_create_on_node(worker_thread, worker,
+					      pool_to_node(pool),
 					      "kworker/%s", id_buf);
 	if (IS_ERR(worker->task))
 		goto fail;
@@ -3651,7 +3661,7 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
 	if (!pool)
 		return NULL;
 
-	pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
+	pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool_to_node(pool));
 	if (!pwq) {
 		put_unbound_pool(pool);
 		return NULL;
-- 
1.8.3.1




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline.
  2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
  2014-12-13 16:30   ` [PATCH 1/4] workqueue: add a hook for node hotplug Kamezawa Hiroyuki
  2014-12-13 16:33   ` [PATCH 2/4] workqueue: add warning if pool->node is offline Kamezawa Hiroyuki
@ 2014-12-13 16:35   ` Kamezawa Hiroyuki
  2014-12-15  2:06     ` Lai Jiangshan
  2014-12-13 16:38   ` [PATCH 4/4] workqueue: handle change in cpu-node relationship Kamezawa Hiroyuki
  2014-12-15 11:11   ` [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3 Kamezawa Hiroyuki
  4 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-13 16:35 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen

remove node aware unbound pools if node goes offline.

scan unbound workqueue and remove numa affine pool when
a node goes offline.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 35f4f00..07b4eb5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4921,11 +4921,40 @@ early_initcall(init_workqueues);
  * cached pools per cpu should be freed at node unplug
  */
 
+/*
+ * Replace per-node pwq with dfl_pwq because this node disappers.
+ * The new pool will be set at CPU_ONLINE by wq_update_unbound_numa.
+ */
+static void wq_release_unbound_numa(struct workqueue_struct *wq, int nid)
+{
+	struct pool_workqueue *old_pwq = NULL;
+
+	if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
+		return;
+	mutex_lock(&wq->mutex);
+	if (wq->unbound_attrs->no_numa)
+		goto out_unlock;
+	spin_lock_irq(&wq->dfl_pwq->pool->lock);
+	get_pwq(wq->dfl_pwq);
+	spin_unlock_irq(&wq->dfl_pwq->pool->lock);
+	old_pwq = numa_pwq_tbl_install(wq, nid, wq->dfl_pwq);
+out_unlock:
+	mutex_unlock(&wq->mutex);
+	put_pwq_unlocked(old_pwq);
+	return;
+}
+
 void workqueue_register_numanode(int nid)
 {
 }
  
 void workqueue_unregister_numanode(int nid)
 {
+	struct workqueue_struct *wq;
+
+	mutex_lock(&wq_pool_mutex);
+	list_for_each_entry(wq, &workqueues, list)
+		wq_release_unbound_numa(wq, nid);
+	mutex_unlock(&wq_pool_mutex);
 }
 #endif
-- 
1.8.3.1




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
                     ` (2 preceding siblings ...)
  2014-12-13 16:35   ` [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline Kamezawa Hiroyuki
@ 2014-12-13 16:38   ` Kamezawa Hiroyuki
  2014-12-15  2:12     ` Lai Jiangshan
  2014-12-15 11:11   ` [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3 Kamezawa Hiroyuki
  4 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-13 16:38 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen

Although workqueue detects relationship between cpu<->node at boot,
it is finally determined in cpu_up().
This patch tries to update pool->node using online status of cpus.

1. When a node goes down, clear per-cpu pool's node attr.
2. When a cpu comes up, update per-cpu pool's node attr.
3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
4. Detect the best node for unbound pool's cpumask using the latest info.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07b4eb5..259b3ba 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,8 @@ struct workqueue_struct {
 static struct kmem_cache *pwq_cache;
 
 static cpumask_var_t *wq_numa_possible_cpumask;
-					/* possible CPUs of each node */
+	/* possible CPUs of each node initialized with possible info at boot.
+           but modified at cpu hotplug to be adjusted to real info.  */
 
 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
 	call_rcu_sched(&pool->rcu, rcu_free_pool);
 }
 
+/*
+ * detect best node for given cpumask.
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+	int node, best, match, selected;
+	static struct cpumask andmask; /* we're under mutex */
+
+	/* Is any node okay ? */
+	if (!wq_numa_enabled ||
+	    cpumask_subset(cpu_online_mask, cpumask))
+		return NUMA_NO_NODE;
+	best = 0;
+	selected = NUMA_NO_NODE;
+	/* select a node which contains the most cpu of cpumask */
+	for_each_node_state(node, N_ONLINE) {
+		cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+		match = cpumask_weight(&andmask);
+		if (match > best)
+			selected = node;
+	}
+	return selected;
+}
+
+
 /**
  * get_unbound_pool - get a worker_pool with the specified attributes
  * @attrs: the attributes of the worker_pool to get
@@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 {
 	u32 hash = wqattrs_hash(attrs);
 	struct worker_pool *pool;
-	int node;
 
 	lockdep_assert_held(&wq_pool_mutex);
 
@@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 	 * 'struct workqueue_attrs' comments for detail.
 	 */
 	pool->attrs->no_numa = false;
-
-	/* if cpumask is contained inside a NUMA node, we belong to that node */
-	if (wq_numa_enabled) {
-		for_each_node(node) {
-			if (cpumask_subset(pool->attrs->cpumask,
-					   wq_numa_possible_cpumask[node])) {
-				pool->node = node;
-				break;
-			}
-		}
-	}
+	pool->node = pool_detect_best_node(pool->attrs->cpumask);
 
 	if (worker_pool_assign_id(pool) < 0)
 		goto fail;
@@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
 	int cpu = (unsigned long)hcpu;
 	struct worker_pool *pool;
 	struct workqueue_struct *wq;
-	int pi;
+	int pi, node;
 
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_UP_PREPARE:
@@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
 	case CPU_ONLINE:
 		mutex_lock(&wq_pool_mutex);
 
+		/* now cpu <-> node info is established, update the info. */
+		if (!wq_disable_numa) {
+			for_each_node_state(node, N_POSSIBLE)
+				cpumask_clear_cpu(cpu,
+					wq_numa_possible_cpumask[node]);
+			node = cpu_to_node(cpu);
+			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+		}
+		for_each_cpu_worker_pool(pool, cpu)
+			pool->node = cpu_to_node(cpu);
 		for_each_pool(pool, pi) {
 			mutex_lock(&pool->attach_mutex);
 
@@ -4951,7 +4976,21 @@ void workqueue_register_numanode(int nid)
 void workqueue_unregister_numanode(int nid)
 {
 	struct workqueue_struct *wq;
+	const struct cpumask *nodecpumask;
+	struct worker_pool *pool;
+	int cpu;
 
+	/* at this point, cpu-to-node relationship is not lost */
+	nodecpumask = cpumask_of_node(nid);
+	for_each_cpu(cpu, nodecpumask) {
+		/*
+		 * pool is allcated at boot and assumed to be persistent,
+		 * we cannot free this.
+		 * Update to be NUMA_NO_NODE. This will be fixed at ONLINE
+		 */
+		for_each_cpu_worker_pool(pool, cpu)
+			pool->node = NUMA_NO_NODE;
+	}
 	mutex_lock(&wq_pool_mutex);
 	list_for_each_entry(wq, &workqueues, list)
 		wq_release_unbound_numa(wq, nid);
-- 
1.8.3.1




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] workqueue: fixup existing pool->node
  2014-12-12 17:25   ` Tejun Heo
@ 2014-12-15  1:23     ` Lai Jiangshan
  2014-12-25 20:14       ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  1:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On 12/13/2014 01:25 AM, Tejun Heo wrote:
> On Fri, Dec 12, 2014 at 06:19:53PM +0800, Lai Jiangshan wrote:
>> Yasuaki Ishimatsu hit a bug when the numa mapping between CPU and node
>> is changed.  And the previous path fixup wq_numa_possible_cpumask.
>> (See more information form the changelog of that patch)
>>
>> After wq_numa_possible_cpumask was updated, the new pool->node will be
>> correct, but the existing pools (and workers) are still running, some of
>> them are running with the wrong pool->node, or even worse, with pool->node
>> which is quitted node, they create_worker() on wrong pool->node.
>> These create_worker() may create workers on wrong node or failed without
>> any progress (when with pool->node which is quitted node).
>>
>> So we need to update the pool->node when the numa mapping is changed.
>>
>> We simply re-calc the pool->node when the numa mapping changed. It reuses
>> the code from get_unbound_pool() for unbound pool.
> 
> I don't get this patch.  If a node is gone, all its cpus would be gone
> and the pool should be discarded.  If a new node comes online with
> different mappings, new sets of pools should serve them instead of
> recycling the old ones.  Wouldn't it make a lot more sense to make
> sure we don't reuse the pools w/ old mappings for new pwqs?
> 


The pwqs of the old node's cpumask do be discarded. But the pools of the old
node's cpumask maybe recycle. For example, a new workqueue's affinity is set to
the old node's cpumask before the pool is dead. Any old pool can long live.


> Thanks.
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU
  2014-12-12 17:27   ` Tejun Heo
@ 2014-12-15  1:28     ` Lai Jiangshan
  2014-12-25 20:17       ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  1:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On 12/13/2014 01:27 AM, Tejun Heo wrote:
> On Fri, Dec 12, 2014 at 06:19:54PM +0800, Lai Jiangshan wrote:
>> We fixed the major cases when the numa mapping is changed.
>>
>> We still have the assumption that when the node<->cpu mapping is changed
>> the original node is offline, and the current code of memory-hutplug also
>> prove this.
>>
>> This assumption might be changed in future and the orig_node is still online
>> in some cases.  And in these cases, the cpumask of the pwqs of the orig_node
>> still contains the onlining CPU which is a CPU of another node, and the worker
>> may run on the onlining CPU (aka run on the wrong node).
>>
>> So we drop this assumption and make the code calls wq_update_unbound_numa()
>> to update the affinity in this case.
> 
> This is seriously obfuscating.  I really don't think meddling with
> existing pools is a good idea.  

> The foundation those pools were standing are gone.

This statement is not true unless we write some code to force them,
dequeue them from the unbound_pool_hash, for example.

> Drain and discard the pools.  Please don't try to
> retro-fit it to new foundations.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] workqueue: fix bug when numa mapping is changed
  2014-12-12 17:13 ` [PATCH 0/5] workqueue: fix bug when numa mapping is changed Yasuaki Ishimatsu
@ 2014-12-15  1:34   ` Lai Jiangshan
  2014-12-18  1:50     ` Yasuaki Ishimatsu
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  1:34 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: linux-kernel, Tejun Heo, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On 12/13/2014 01:13 AM, Yasuaki Ishimatsu wrote:
> Hi Lai,
> 
> Thank you for posting the patches. I tried your patches.
> But the following kernel panic occurred.

Hi, Yasuaki,

Thanks for testing.  Would you like to use GDB to print the code of
"workqueue_cpu_up_callback+0x510" ?

Thanks,
Lai

> 
> [  889.394087] BUG: unable to handle kernel paging request at 000000020000f3f1
> [  889.395005] IP: [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
> [  889.395005] PGD 17a83067 PUD 0
> [  889.395005] Oops: 0000 [#1] SMP
> [  889.395005] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
> iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
> e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
> [  889.395005] CPU: 8 PID: 13595 Comm: udev_dp_bridge. Not tainted 3.18.0Lai+ #26
> [  889.395005] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
> [  889.395005] task: ffff8a074a145160 ti: ffff8a077a6ec000 task.ti: ffff8a077a6ec000
> [  889.395005] RIP: 0010:[<ffffffff8108fe90>]  [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
> [  889.395005] RSP: 0018:ffff8a077a6efca8  EFLAGS: 00010202
> [  889.395005] RAX: 0000000000000001 RBX: 000000000000edf1 RCX: 000000000000edf1
> [  889.395005] RDX: 0000000000000100 RSI: 000000020000f3f1 RDI: 0000000000000001
> [  889.395005] RBP: ffff8a077a6efd08 R08: ffffffff81ac6de0 R09: ffff880874610000
> [  889.395005] R10: 00000000ffffffff R11: 0000000000000001 R12: 000000000000f3f0
> [  889.395005] R13: 000000000000001f R14: 00000000ffffffff R15: ffffffff81ac6de0
> [  889.395005] FS:  00007f6b20c67740(0000) GS:ffff88087fd00000(0000) knlGS:0000000000000000
> [  889.395005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  889.395005] CR2: 000000020000f3f1 CR3: 000000004534c000 CR4: 00000000001407e0
> [  889.395005] Stack:
> [  889.395005]  ffffffffffffffff 0000000000000020 fffffffffffffff8 00000004810a192d
> [  889.395005]  ffff8a0700000204 0000000052f5b32d ffffffff81994fc0 00000000fffffff6
> [  889.395005]  ffffffff81a13840 0000000000000002 000000000000001f 0000000000000000
> [  889.395005] Call Trace:
> [  889.395005]  [<ffffffff81094f6c>] notifier_call_chain+0x4c/0x70
> [  889.395005]  [<ffffffff8109507e>] __raw_notifier_call_chain+0xe/0x10
> [  889.395005]  [<ffffffff810750b3>] cpu_notify+0x23/0x50
> [  889.395005]  [<ffffffff81075408>] _cpu_up+0x188/0x1a0
> [  889.395005]  [<ffffffff810754a9>] cpu_up+0x89/0xb0
> [  889.395005]  [<ffffffff8164f960>] cpu_subsys_online+0x40/0x90
> [  889.395005]  [<ffffffff8140f10d>] device_online+0x6d/0xa0
> [  889.395005]  [<ffffffff8140f1d5>] online_store+0x95/0xa0
> [  889.395005]  [<ffffffff8140c2e8>] dev_attr_store+0x18/0x30
> [  889.395005]  [<ffffffff8126210d>] sysfs_kf_write+0x3d/0x50
> [  889.395005]  [<ffffffff81261624>] kernfs_fop_write+0xe4/0x160
> [  889.395005]  [<ffffffff811e90d7>] vfs_write+0xb7/0x1f0
> [  889.395005]  [<ffffffff81021dcc>] ? do_audit_syscall_entry+0x6c/0x70
> [  889.395005]  [<ffffffff811e9bc5>] SyS_write+0x55/0xd0
> [  889.395005]  [<ffffffff816646a9>] system_call_fastpath+0x12/0x17
> [  889.395005] Code: 44 00 00 83 c7 01 48 63 d7 4c 89 ff e8 3a 2a 28 00 8b 15 78 84 a3 00 89 c7 39 d0 7d 70 48 63 cb 4c 89 e6 48 03 34 cd e0 3a ab 81 <8b> 1e 39 5d bc 74 36 41 39 de 74 0c 48 63 f2 eb c7 0f 1f 80 00
> [  889.395005] RIP  [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
> [  889.395005]  RSP <ffff8a077a6efca8>
> [  889.395005] CR2: 000000020000f3f1
> [  889.785760] ---[ end trace 39abbfc9f93402f2 ]---
> [  889.790931] Kernel panic - not syncing: Fatal exception
> [  889.791931] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
> [  889.791931] drm_kms_helper: panic occurred, switching back to text console
> [  889.815947] ------------[ cut here ]------------
> [  889.815947] WARNING: CPU: 8 PID: 64 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
> [  889.815947] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
> iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
> e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
> [  889.815947] CPU: 8 PID: 64 Comm: migration/8 Tainted: G      D        3.18.0Lai+ #26
> [  889.815947] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
> [  889.815947]  0000000000000000 00000000f7f40529 ffff88087fd03d38 ffffffff8165c8d4
> [  889.815947]  0000000000000000 0000000000000000 ffff88087fd03d78 ffffffff81074eb1
> [  889.815947]  ffff88087fd03d78 0000000000000000 ffff88087fc13840 0000000000000008
> [  889.815947] Call Trace:
> [  889.815947]  <IRQ>  [<ffffffff8165c8d4>] dump_stack+0x46/0x58
> [  889.815947]  [<ffffffff81074eb1>] warn_slowpath_common+0x81/0xa0
> [  889.815947]  [<ffffffff81074fca>] warn_slowpath_null+0x1a/0x20
> [  889.815947]  [<ffffffff810489bd>] native_smp_send_reschedule+0x5d/0x60
> [  889.815947]  [<ffffffff810b0ad4>] trigger_load_balance+0x144/0x1b0
> [  889.815947]  [<ffffffff810a009f>] scheduler_tick+0x9f/0xe0
> [  889.815947]  [<ffffffff810daef4>] update_process_times+0x64/0x80
> [  889.815947]  [<ffffffff810eab05>] tick_sched_handle.isra.19+0x25/0x60
> [  889.815947]  [<ffffffff810eab85>] tick_sched_timer+0x45/0x80
> [  889.815947]  [<ffffffff810dbbe7>] __run_hrtimer+0x77/0x1d0
> [  889.815947]  [<ffffffff810eab40>] ? tick_sched_handle.isra.19+0x60/0x60
> [  889.815947]  [<ffffffff810dbfd7>] hrtimer_interrupt+0xf7/0x240
> [  889.815947]  [<ffffffff8104b85b>] local_apic_timer_interrupt+0x3b/0x70
> [  889.815947]  [<ffffffff81667465>] smp_apic_timer_interrupt+0x45/0x60
> [  889.815947]  [<ffffffff8166553d>] apic_timer_interrupt+0x6d/0x80
> [  889.815947]  <EOI>  [<ffffffff810a79c7>] ? set_next_entity+0x67/0x80
> [  889.815947]  [<ffffffffa011d1d7>] ? __drm_modeset_lock_all+0x37/0x120 [drm]
> [  889.815947]  [<ffffffff8109c727>] ? finish_task_switch+0x57/0x180
> [  889.815947]  [<ffffffff8165fba8>] __schedule+0x2e8/0x7e0
> [  889.815947]  [<ffffffff816600c9>] schedule+0x29/0x70
> [  889.815947]  [<ffffffff81097d43>] smpboot_thread_fn+0xd3/0x1b0
> [  889.815947]  [<ffffffff81097c70>] ? SyS_setgroups+0x1a0/0x1a0
> [  889.815947]  [<ffffffff81093df1>] kthread+0xe1/0x100
> [  889.815947]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
> [  889.815947]  [<ffffffff816645fc>] ret_from_fork+0x7c/0xb0
> [  889.815947]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
> [  889.815947] ---[ end trace 39abbfc9f93402f3 ]---
> [  890.156187] ------------[ cut here ]------------
> [  890.156187] WARNING: CPU: 8 PID: 64 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
> [  890.156187] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
> iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
> e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
> [  890.156187] CPU: 8 PID: 64 Comm: migration/8 Tainted: G      D W      3.18.0Lai+ #26
> [  890.156187] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
> [  890.156187]  0000000000000000 00000000f7f40529 ffff88087366bc08 ffffffff8165c8d4
> [  890.156187]  0000000000000000 0000000000000000 ffff88087366bc48 ffffffff81074eb1
> [  890.156187]  ffff88087fd142c0 0000000000000044 ffff8a074a145160 ffff8a074a145160
> [  890.156187] Call Trace:
> [  890.156187]  [<ffffffff8165c8d4>] dump_stack+0x46/0x58
> [  890.156187]  [<ffffffff81074eb1>] warn_slowpath_common+0x81/0xa0
> [  890.156187]  [<ffffffff81074fca>] warn_slowpath_null+0x1a/0x20
> [  890.156187]  [<ffffffff810489bd>] native_smp_send_reschedule+0x5d/0x60
> [  890.156187]  [<ffffffff8109ddd8>] resched_curr+0xa8/0xd0
> [  890.156187]  [<ffffffff8109eac0>] check_preempt_curr+0x80/0xa0
> [  890.156187]  [<ffffffff810a78c8>] attach_task+0x48/0x50
> [  890.156187]  [<ffffffff810a7ae5>] active_load_balance_cpu_stop+0x105/0x250
> [  890.156187]  [<ffffffff810a79e0>] ? set_next_entity+0x80/0x80
> [  890.156187]  [<ffffffff8110cab8>] cpu_stopper_thread+0x78/0x150
> [  890.156187]  [<ffffffff8165fba8>] ? __schedule+0x2e8/0x7e0
> [  890.156187]  [<ffffffff81097d6f>] smpboot_thread_fn+0xff/0x1b0
> [  890.156187]  [<ffffffff81097c70>] ? SyS_setgroups+0x1a0/0x1a0
> [  890.156187]  [<ffffffff81093df1>] kthread+0xe1/0x100
> [  890.156187]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
> [  890.156187]  [<ffffffff816645fc>] ret_from_fork+0x7c/0xb0
> [  890.156187]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
> [  890.156187] ---[ end trace 39abbfc9f93402f4 ]---
> 
> Thanks,
> Yasuaki Ishimatsu
> 
> (2014/12/12 19:19), Lai Jiangshan wrote:
>> Workqueue code has an assumption that the numa mapping is stable
>> after system booted.  It is incorrectly currently.
>>
>> Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
>> between CPU and node is changed. This was the last scene:
>>   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>    cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
>>    node 0: slabs: 6172, objs: 259224, free: 245741
>>    node 1: slabs: 3261, objs: 136962, free: 127656
>>
>> Yasuaki Ishimatsu investigated that it happened in the following situation:
>>
>> 1) System Node/CPU before offline/online:
>> 	       | CPU
>> 	------------------------
>> 	node 0 |  0-14, 60-74
>> 	node 1 | 15-29, 75-89
>> 	node 2 | 30-44, 90-104
>> 	node 3 | 45-59, 105-119
>>
>> 2) A system-board (contains node2 and node3) is offline:
>> 	       | CPU
>> 	------------------------
>> 	node 0 |  0-14, 60-74
>> 	node 1 | 15-29, 75-89
>>
>> 3) A new system-board is online, two new node IDs are allocated
>>     for the two node of the SB, but the old CPU IDs are allocated for
>>     the SB, here the NUMA mapping between node and CPU is changed.
>>     (the node of CPU#30 is changed from node#2 to node#4, for example)
>> 	       | CPU
>> 	------------------------
>> 	node 0 |  0-14, 60-74
>> 	node 1 | 15-29, 75-89
>> 	node 4 | 30-59
>> 	node 5 | 90-119
>>
>> 4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask
>>     which is the convenient NUMA mapping cache in workqueue.c is still outdated.
>>     thus pool->node calculated by get_unbound_pool() is incorrect.
>>
>> 5) when the create_worker() is called with the incorrect offlined
>>      pool->node, it is failed and the pool can't make any progress.
>>
>> To fix this bug, we need to fixup the wq_numa_possible_cpumask and the
>> pool->node, it is done in patch2 and patch3.
>>
>> patch1 fixes memory leak related wq_numa_possible_cpumask.
>> patch4 kill another assumption about how the numa mapping changed.
>> patch5 reduces the allocation fails when the node is offline or the node
>> is lack of memory.
>>
>> The patchset is untested. It is sent for earlier review.
>>
>> Thanks,
>> Lai.
>>
>> Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>> Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
>> Cc: tangchen <tangchen@cn.fujitsu.com>
>> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
>> Lai Jiangshan (5):
>>    workqueue: fix memory leak in wq_numa_init()
>>    workqueue: update wq_numa_possible_cpumask
>>    workqueue: fixup existing pool->node
>>    workqueue: update NUMA affinity for the node lost CPU
>>    workqueue: retry on NUMA_NO_NODE when create_worker() fails
>>
>>   kernel/workqueue.c |  129 ++++++++++++++++++++++++++++++++++++++++++++--------
>>   1 files changed, 109 insertions(+), 20 deletions(-)
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] workqueue: update wq_numa_possible_cpumask
  2014-12-12 17:18   ` Tejun Heo
@ 2014-12-15  2:02     ` Lai Jiangshan
  2014-12-25 20:16       ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  2:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On 12/13/2014 01:18 AM, Tejun Heo wrote:
> On Fri, Dec 12, 2014 at 06:19:52PM +0800, Lai Jiangshan wrote:
> ...
>> +static void wq_update_numa_mapping(int cpu)
>> +{
>> +	int node, orig_node = NUMA_NO_NODE, new_node = cpu_to_node(cpu);
>> +
>> +	lockdep_assert_held(&wq_pool_mutex);
>> +
>> +	if (!wq_numa_enabled)
>> +		return;
>> +
>> +	/* the node of onlining CPU is not NUMA_NO_NODE */
>> +	if (WARN_ON(new_node == NUMA_NO_NODE))
>> +		return;
>> +
>> +	/* test whether the NUMA node mapping is changed. */
>> +	if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[new_node]))
>> +		return;
>> +
>> +	/* find the origin node */
>> +	for_each_node(node) {
>> +		if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node])) {
>> +			orig_node = node;
>> +			break;
>> +		}
>> +	}
>> +
>> +	/* there may be multi mappings changed, re-initial. */
>> +	cpumask_clear(wq_numa_possible_cpumask[new_node]);
>> +	if (orig_node != NUMA_NO_NODE)
>> +		cpumask_clear(wq_numa_possible_cpumask[orig_node]);
>> +	for_each_possible_cpu(cpu) {
>> +		node = cpu_to_node(node);
>> +		if (node == new_node)
>> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[new_node]);
>> +		else if (orig_node != NUMA_NO_NODE && node == orig_node)
>> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[orig_node]);
>> +	}
>> +}
> 
> Let's please move this to NUMA code and properly update it on actual
> mapping changes.
> 

Hi, TJ

I didn't get your means.  What did you mean "NUMA code"?  Which one did you mean?

1) "NUMA code" = system's NUMA memory hotplug code, AKA, keep the numa mapping stable

   I think this is the better idea.  This idea came to my mind immediately at the time
   I received the bug report.  And after some discussions, I was told that it is too HARD
   to keep the numa mapping stable across multiple physical system-board/node online/offline.

   This idea makes the assumption "the numa mapping is stable after system booted" as
   a restriction of the NUMA.  And it will favor all the code outside of the numa code,
   otherwise (we deny the assumption like this patchset) all the code which use
   "cpu_to_node()" and cache the return value will have to be fixed up like this patchset.

   Hi, hotplug-team, any idea to keep the numa mapping stable?

2) "NUMA code" = workqueue's NUMA code
   I think I already did it, the code I added was right below the code of
   wq_update_unbound_numa().  Or I missed something?

Thanks,
Lai

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline.
  2014-12-13 16:35   ` [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline Kamezawa Hiroyuki
@ 2014-12-15  2:06     ` Lai Jiangshan
  2014-12-15  2:06       ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  2:06 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/14/2014 12:35 AM, Kamezawa Hiroyuki wrote:
> remove node aware unbound pools if node goes offline.
> 
> scan unbound workqueue and remove numa affine pool when
> a node goes offline.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 35f4f00..07b4eb5 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4921,11 +4921,40 @@ early_initcall(init_workqueues);
>   * cached pools per cpu should be freed at node unplug
>   */
>  
> +/*
> + * Replace per-node pwq with dfl_pwq because this node disappers.
> + * The new pool will be set at CPU_ONLINE by wq_update_unbound_numa.
> + */
> +static void wq_release_unbound_numa(struct workqueue_struct *wq, int nid)
> +{
> +	struct pool_workqueue *old_pwq = NULL;
> +
> +	if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
> +		return;
> +	mutex_lock(&wq->mutex);
> +	if (wq->unbound_attrs->no_numa)
> +		goto out_unlock;
> +	spin_lock_irq(&wq->dfl_pwq->pool->lock);
> +	get_pwq(wq->dfl_pwq);
> +	spin_unlock_irq(&wq->dfl_pwq->pool->lock);
> +	old_pwq = numa_pwq_tbl_install(wq, nid, wq->dfl_pwq);
> +out_unlock:
> +	mutex_unlock(&wq->mutex);
> +	put_pwq_unlocked(old_pwq);
> +	return;
> +}


We have already did it in wq_update_unbound_numa().

> +
>  void workqueue_register_numanode(int nid)
>  {
>  }
>   
>  void workqueue_unregister_numanode(int nid)
>  {
> +	struct workqueue_struct *wq;
> +
> +	mutex_lock(&wq_pool_mutex);
> +	list_for_each_entry(wq, &workqueues, list)
> +		wq_release_unbound_numa(wq, nid);
> +	mutex_unlock(&wq_pool_mutex);
>  }
>  #endif


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline.
  2014-12-15  2:06     ` Lai Jiangshan
@ 2014-12-15  2:06       ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15  2:06 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/15 11:06), Lai Jiangshan wrote:
> On 12/14/2014 12:35 AM, Kamezawa Hiroyuki wrote:
>> remove node aware unbound pools if node goes offline.
>>
>> scan unbound workqueue and remove numa affine pool when
>> a node goes offline.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>   kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
>>   1 file changed, 29 insertions(+)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 35f4f00..07b4eb5 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -4921,11 +4921,40 @@ early_initcall(init_workqueues);
>>    * cached pools per cpu should be freed at node unplug
>>    */
>>
>> +/*
>> + * Replace per-node pwq with dfl_pwq because this node disappers.
>> + * The new pool will be set at CPU_ONLINE by wq_update_unbound_numa.
>> + */
>> +static void wq_release_unbound_numa(struct workqueue_struct *wq, int nid)
>> +{
>> +	struct pool_workqueue *old_pwq = NULL;
>> +
>> +	if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
>> +		return;
>> +	mutex_lock(&wq->mutex);
>> +	if (wq->unbound_attrs->no_numa)
>> +		goto out_unlock;
>> +	spin_lock_irq(&wq->dfl_pwq->pool->lock);
>> +	get_pwq(wq->dfl_pwq);
>> +	spin_unlock_irq(&wq->dfl_pwq->pool->lock);
>> +	old_pwq = numa_pwq_tbl_install(wq, nid, wq->dfl_pwq);
>> +out_unlock:
>> +	mutex_unlock(&wq->mutex);
>> +	put_pwq_unlocked(old_pwq);
>> +	return;
>> +}
>
>
> We have already did it in wq_update_unbound_numa().
>

Ah, hm. CPU_OFFLINE event does this, ok.

Thanks,
-Kame


>> +
>>   void workqueue_register_numanode(int nid)
>>   {
>>   }
>>
>>   void workqueue_unregister_numanode(int nid)
>>   {
>> +	struct workqueue_struct *wq;
>> +
>> +	mutex_lock(&wq_pool_mutex);
>> +	list_for_each_entry(wq, &workqueues, list)
>> +		wq_release_unbound_numa(wq, nid);
>> +	mutex_unlock(&wq_pool_mutex);
>>   }
>>   #endif
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-13 16:38   ` [PATCH 4/4] workqueue: handle change in cpu-node relationship Kamezawa Hiroyuki
@ 2014-12-15  2:12     ` Lai Jiangshan
  2014-12-15  2:20       ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  2:12 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
> Although workqueue detects relationship between cpu<->node at boot,
> it is finally determined in cpu_up().
> This patch tries to update pool->node using online status of cpus.
> 
> 1. When a node goes down, clear per-cpu pool's node attr.
> 2. When a cpu comes up, update per-cpu pool's node attr.
> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
> 4. Detect the best node for unbound pool's cpumask using the latest info.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 07b4eb5..259b3ba 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -266,7 +266,8 @@ struct workqueue_struct {
>  static struct kmem_cache *pwq_cache;
>  
>  static cpumask_var_t *wq_numa_possible_cpumask;
> -					/* possible CPUs of each node */
> +	/* possible CPUs of each node initialized with possible info at boot.
> +           but modified at cpu hotplug to be adjusted to real info.  */
>  
>  static bool wq_disable_numa;
>  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>  	call_rcu_sched(&pool->rcu, rcu_free_pool);
>  }
>  
> +/*
> + * detect best node for given cpumask.
> + */
> +static int pool_detect_best_node(const struct cpumask *cpumask)
> +{
> +	int node, best, match, selected;
> +	static struct cpumask andmask; /* we're under mutex */
> +
> +	/* Is any node okay ? */
> +	if (!wq_numa_enabled ||
> +	    cpumask_subset(cpu_online_mask, cpumask))
> +		return NUMA_NO_NODE;
> +	best = 0;
> +	selected = NUMA_NO_NODE;
> +	/* select a node which contains the most cpu of cpumask */
> +	for_each_node_state(node, N_ONLINE) {
> +		cpumask_and(&andmask, cpumask, cpumask_of_node(node));
> +		match = cpumask_weight(&andmask);
> +		if (match > best)
> +			selected = node;
> +	}
> +	return selected;
> +}
> +
> +
>  /**
>   * get_unbound_pool - get a worker_pool with the specified attributes
>   * @attrs: the attributes of the worker_pool to get
> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  {
>  	u32 hash = wqattrs_hash(attrs);
>  	struct worker_pool *pool;
> -	int node;
>  
>  	lockdep_assert_held(&wq_pool_mutex);
>  
> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  	 * 'struct workqueue_attrs' comments for detail.
>  	 */
>  	pool->attrs->no_numa = false;
> -
> -	/* if cpumask is contained inside a NUMA node, we belong to that node */
> -	if (wq_numa_enabled) {
> -		for_each_node(node) {
> -			if (cpumask_subset(pool->attrs->cpumask,
> -					   wq_numa_possible_cpumask[node])) {
> -				pool->node = node;
> -				break;
> -			}
> -		}
> -	}
> +	pool->node = pool_detect_best_node(pool->attrs->cpumask);
>  
>  	if (worker_pool_assign_id(pool) < 0)
>  		goto fail;
> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>  	int cpu = (unsigned long)hcpu;
>  	struct worker_pool *pool;
>  	struct workqueue_struct *wq;
> -	int pi;
> +	int pi, node;
>  
>  	switch (action & ~CPU_TASKS_FROZEN) {
>  	case CPU_UP_PREPARE:
> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>  	case CPU_ONLINE:
>  		mutex_lock(&wq_pool_mutex);
>  
> +		/* now cpu <-> node info is established, update the info. */
> +		if (!wq_disable_numa) {



> +			for_each_node_state(node, N_POSSIBLE)
> +				cpumask_clear_cpu(cpu,
> +					wq_numa_possible_cpumask[node]);

The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
create a new set of pwqs/pools.

> +			node = cpu_to_node(cpu);
> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> +		}
> +		for_each_cpu_worker_pool(pool, cpu)
> +			pool->node = cpu_to_node(cpu);
>  		for_each_pool(pool, pi) {
>  			mutex_lock(&pool->attach_mutex);
>  
> @@ -4951,7 +4976,21 @@ void workqueue_register_numanode(int nid)
>  void workqueue_unregister_numanode(int nid)
>  {
>  	struct workqueue_struct *wq;
> +	const struct cpumask *nodecpumask;
> +	struct worker_pool *pool;
> +	int cpu;
>  
> +	/* at this point, cpu-to-node relationship is not lost */
> +	nodecpumask = cpumask_of_node(nid);
> +	for_each_cpu(cpu, nodecpumask) {
> +		/*
> +		 * pool is allcated at boot and assumed to be persistent,
> +		 * we cannot free this.
> +		 * Update to be NUMA_NO_NODE. This will be fixed at ONLINE
> +		 */
> +		for_each_cpu_worker_pool(pool, cpu)
> +			pool->node = NUMA_NO_NODE;
> +	}
>  	mutex_lock(&wq_pool_mutex);
>  	list_for_each_entry(wq, &workqueues, list)
>  		wq_release_unbound_numa(wq, nid);


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-15  2:12     ` Lai Jiangshan
@ 2014-12-15  2:20       ` Kamezawa Hiroyuki
  2014-12-15  2:48         ` Lai Jiangshan
  0 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15  2:20 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/15 11:12), Lai Jiangshan wrote:
> On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
>> Although workqueue detects relationship between cpu<->node at boot,
>> it is finally determined in cpu_up().
>> This patch tries to update pool->node using online status of cpus.
>>
>> 1. When a node goes down, clear per-cpu pool's node attr.
>> 2. When a cpu comes up, update per-cpu pool's node attr.
>> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
>> 4. Detect the best node for unbound pool's cpumask using the latest info.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>   kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 53 insertions(+), 14 deletions(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 07b4eb5..259b3ba 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -266,7 +266,8 @@ struct workqueue_struct {
>>   static struct kmem_cache *pwq_cache;
>>
>>   static cpumask_var_t *wq_numa_possible_cpumask;
>> -					/* possible CPUs of each node */
>> +	/* possible CPUs of each node initialized with possible info at boot.
>> +           but modified at cpu hotplug to be adjusted to real info.  */
>>
>>   static bool wq_disable_numa;
>>   module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>   	call_rcu_sched(&pool->rcu, rcu_free_pool);
>>   }
>>
>> +/*
>> + * detect best node for given cpumask.
>> + */
>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>> +{
>> +	int node, best, match, selected;
>> +	static struct cpumask andmask; /* we're under mutex */
>> +
>> +	/* Is any node okay ? */
>> +	if (!wq_numa_enabled ||
>> +	    cpumask_subset(cpu_online_mask, cpumask))
>> +		return NUMA_NO_NODE;
>> +	best = 0;
>> +	selected = NUMA_NO_NODE;
>> +	/* select a node which contains the most cpu of cpumask */
>> +	for_each_node_state(node, N_ONLINE) {
>> +		cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>> +		match = cpumask_weight(&andmask);
>> +		if (match > best)
>> +			selected = node;
>> +	}
>> +	return selected;
>> +}
>> +
>> +
>>   /**
>>    * get_unbound_pool - get a worker_pool with the specified attributes
>>    * @attrs: the attributes of the worker_pool to get
>> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>   {
>>   	u32 hash = wqattrs_hash(attrs);
>>   	struct worker_pool *pool;
>> -	int node;
>>
>>   	lockdep_assert_held(&wq_pool_mutex);
>>
>> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>   	 * 'struct workqueue_attrs' comments for detail.
>>   	 */
>>   	pool->attrs->no_numa = false;
>> -
>> -	/* if cpumask is contained inside a NUMA node, we belong to that node */
>> -	if (wq_numa_enabled) {
>> -		for_each_node(node) {
>> -			if (cpumask_subset(pool->attrs->cpumask,
>> -					   wq_numa_possible_cpumask[node])) {
>> -				pool->node = node;
>> -				break;
>> -			}
>> -		}
>> -	}
>> +	pool->node = pool_detect_best_node(pool->attrs->cpumask);
>>
>>   	if (worker_pool_assign_id(pool) < 0)
>>   		goto fail;
>> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>   	int cpu = (unsigned long)hcpu;
>>   	struct worker_pool *pool;
>>   	struct workqueue_struct *wq;
>> -	int pi;
>> +	int pi, node;
>>
>>   	switch (action & ~CPU_TASKS_FROZEN) {
>>   	case CPU_UP_PREPARE:
>> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>   	case CPU_ONLINE:
>>   		mutex_lock(&wq_pool_mutex);
>>
>> +		/* now cpu <-> node info is established, update the info. */
>> +		if (!wq_disable_numa) {
>
>
>
>> +			for_each_node_state(node, N_POSSIBLE)
>> +				cpumask_clear_cpu(cpu,
>> +					wq_numa_possible_cpumask[node]);
>
> The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
> these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
> create a new set of pwqs/pools.

because the result of wq_calc_node_cpumask() changes ?
Do you mean some comment should be added here ? or explaination for your reply for [3/4] ?

Thanks,
-Kame





>
>> +			node = cpu_to_node(cpu);
>> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
>> +		}
>> +		for_each_cpu_worker_pool(pool, cpu)
>> +			pool->node = cpu_to_node(cpu);
>>   		for_each_pool(pool, pi) {
>>   			mutex_lock(&pool->attach_mutex);
>>
>> @@ -4951,7 +4976,21 @@ void workqueue_register_numanode(int nid)
>>   void workqueue_unregister_numanode(int nid)
>>   {
>>   	struct workqueue_struct *wq;
>> +	const struct cpumask *nodecpumask;
>> +	struct worker_pool *pool;
>> +	int cpu;
>>
>> +	/* at this point, cpu-to-node relationship is not lost */
>> +	nodecpumask = cpumask_of_node(nid);
>> +	for_each_cpu(cpu, nodecpumask) {
>> +		/*
>> +		 * pool is allcated at boot and assumed to be persistent,
>> +		 * we cannot free this.
>> +		 * Update to be NUMA_NO_NODE. This will be fixed at ONLINE
>> +		 */
>> +		for_each_cpu_worker_pool(pool, cpu)
>> +			pool->node = NUMA_NO_NODE;
>> +	}
>>   	mutex_lock(&wq_pool_mutex);
>>   	list_for_each_entry(wq, &workqueues, list)
>>   		wq_release_unbound_numa(wq, nid);
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-15  2:20       ` Kamezawa Hiroyuki
@ 2014-12-15  2:48         ` Lai Jiangshan
  2014-12-15  2:55           ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  2:48 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:
> (2014/12/15 11:12), Lai Jiangshan wrote:
>> On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
>>> Although workqueue detects relationship between cpu<->node at boot,
>>> it is finally determined in cpu_up().
>>> This patch tries to update pool->node using online status of cpus.
>>>
>>> 1. When a node goes down, clear per-cpu pool's node attr.
>>> 2. When a cpu comes up, update per-cpu pool's node attr.
>>> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
>>> 4. Detect the best node for unbound pool's cpumask using the latest info.
>>>
>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> ---
>>>   kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>>>   1 file changed, 53 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>> index 07b4eb5..259b3ba 100644
>>> --- a/kernel/workqueue.c
>>> +++ b/kernel/workqueue.c
>>> @@ -266,7 +266,8 @@ struct workqueue_struct {
>>>   static struct kmem_cache *pwq_cache;
>>>
>>>   static cpumask_var_t *wq_numa_possible_cpumask;
>>> -                    /* possible CPUs of each node */
>>> +    /* possible CPUs of each node initialized with possible info at boot.
>>> +           but modified at cpu hotplug to be adjusted to real info.  */
>>>
>>>   static bool wq_disable_numa;
>>>   module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>>> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>>       call_rcu_sched(&pool->rcu, rcu_free_pool);
>>>   }
>>>
>>> +/*
>>> + * detect best node for given cpumask.
>>> + */
>>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>>> +{
>>> +    int node, best, match, selected;
>>> +    static struct cpumask andmask; /* we're under mutex */
>>> +
>>> +    /* Is any node okay ? */
>>> +    if (!wq_numa_enabled ||
>>> +        cpumask_subset(cpu_online_mask, cpumask))
>>> +        return NUMA_NO_NODE;
>>> +    best = 0;
>>> +    selected = NUMA_NO_NODE;
>>> +    /* select a node which contains the most cpu of cpumask */
>>> +    for_each_node_state(node, N_ONLINE) {
>>> +        cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>>> +        match = cpumask_weight(&andmask);
>>> +        if (match > best)
>>> +            selected = node;
>>> +    }
>>> +    return selected;
>>> +}
>>> +
>>> +
>>>   /**
>>>    * get_unbound_pool - get a worker_pool with the specified attributes
>>>    * @attrs: the attributes of the worker_pool to get
>>> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>   {
>>>       u32 hash = wqattrs_hash(attrs);
>>>       struct worker_pool *pool;
>>> -    int node;
>>>
>>>       lockdep_assert_held(&wq_pool_mutex);
>>>
>>> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>        * 'struct workqueue_attrs' comments for detail.
>>>        */
>>>       pool->attrs->no_numa = false;
>>> -
>>> -    /* if cpumask is contained inside a NUMA node, we belong to that node */
>>> -    if (wq_numa_enabled) {
>>> -        for_each_node(node) {
>>> -            if (cpumask_subset(pool->attrs->cpumask,
>>> -                       wq_numa_possible_cpumask[node])) {
>>> -                pool->node = node;
>>> -                break;
>>> -            }
>>> -        }
>>> -    }
>>> +    pool->node = pool_detect_best_node(pool->attrs->cpumask);
>>>
>>>       if (worker_pool_assign_id(pool) < 0)
>>>           goto fail;
>>> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>       int cpu = (unsigned long)hcpu;
>>>       struct worker_pool *pool;
>>>       struct workqueue_struct *wq;
>>> -    int pi;
>>> +    int pi, node;
>>>
>>>       switch (action & ~CPU_TASKS_FROZEN) {
>>>       case CPU_UP_PREPARE:
>>> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>       case CPU_ONLINE:
>>>           mutex_lock(&wq_pool_mutex);
>>>
>>> +        /* now cpu <-> node info is established, update the info. */
>>> +        if (!wq_disable_numa) {
>>
>>
>>
>>> +            for_each_node_state(node, N_POSSIBLE)
>>> +                cpumask_clear_cpu(cpu,
>>> +                    wq_numa_possible_cpumask[node]);
>>
>> The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
>> these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
>> create a new set of pwqs/pools.
> 
> because the result of wq_calc_node_cpumask() changes ?

Yes.

> Do you mean some comment should be added here ? or explaination for your reply for [3/4] ?

this fix [4/4] breaks the original design.

> 
> Thanks,
> -Kame
> 
> 
> 
> 
> 
>>
>>> +            node = cpu_to_node(cpu);
>>> +            cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
>>> +        }
>>> +        for_each_cpu_worker_pool(pool, cpu)
>>> +            pool->node = cpu_to_node(cpu);
>>>           for_each_pool(pool, pi) {
>>>               mutex_lock(&pool->attach_mutex);
>>>
>>> @@ -4951,7 +4976,21 @@ void workqueue_register_numanode(int nid)
>>>   void workqueue_unregister_numanode(int nid)
>>>   {
>>>       struct workqueue_struct *wq;
>>> +    const struct cpumask *nodecpumask;
>>> +    struct worker_pool *pool;
>>> +    int cpu;
>>>
>>> +    /* at this point, cpu-to-node relationship is not lost */
>>> +    nodecpumask = cpumask_of_node(nid);
>>> +    for_each_cpu(cpu, nodecpumask) {
>>> +        /*
>>> +         * pool is allcated at boot and assumed to be persistent,
>>> +         * we cannot free this.
>>> +         * Update to be NUMA_NO_NODE. This will be fixed at ONLINE
>>> +         */
>>> +        for_each_cpu_worker_pool(pool, cpu)
>>> +            pool->node = NUMA_NO_NODE;
>>> +    }
>>>       mutex_lock(&wq_pool_mutex);
>>>       list_for_each_entry(wq, &workqueues, list)
>>>           wq_release_unbound_numa(wq, nid);
>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-15  2:48         ` Lai Jiangshan
@ 2014-12-15  2:55           ` Kamezawa Hiroyuki
  2014-12-15  3:30             ` Lai Jiangshan
  2014-12-15  3:34             ` Lai Jiangshan
  0 siblings, 2 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15  2:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/15 11:48), Lai Jiangshan wrote:
> On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:
>> (2014/12/15 11:12), Lai Jiangshan wrote:
>>> On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
>>>> Although workqueue detects relationship between cpu<->node at boot,
>>>> it is finally determined in cpu_up().
>>>> This patch tries to update pool->node using online status of cpus.
>>>>
>>>> 1. When a node goes down, clear per-cpu pool's node attr.
>>>> 2. When a cpu comes up, update per-cpu pool's node attr.
>>>> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
>>>> 4. Detect the best node for unbound pool's cpumask using the latest info.
>>>>
>>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>> ---
>>>>    kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>>>>    1 file changed, 53 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>>> index 07b4eb5..259b3ba 100644
>>>> --- a/kernel/workqueue.c
>>>> +++ b/kernel/workqueue.c
>>>> @@ -266,7 +266,8 @@ struct workqueue_struct {
>>>>    static struct kmem_cache *pwq_cache;
>>>>
>>>>    static cpumask_var_t *wq_numa_possible_cpumask;
>>>> -                    /* possible CPUs of each node */
>>>> +    /* possible CPUs of each node initialized with possible info at boot.
>>>> +           but modified at cpu hotplug to be adjusted to real info.  */
>>>>
>>>>    static bool wq_disable_numa;
>>>>    module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>>>> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>>>        call_rcu_sched(&pool->rcu, rcu_free_pool);
>>>>    }
>>>>
>>>> +/*
>>>> + * detect best node for given cpumask.
>>>> + */
>>>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>>>> +{
>>>> +    int node, best, match, selected;
>>>> +    static struct cpumask andmask; /* we're under mutex */
>>>> +
>>>> +    /* Is any node okay ? */
>>>> +    if (!wq_numa_enabled ||
>>>> +        cpumask_subset(cpu_online_mask, cpumask))
>>>> +        return NUMA_NO_NODE;
>>>> +    best = 0;
>>>> +    selected = NUMA_NO_NODE;
>>>> +    /* select a node which contains the most cpu of cpumask */
>>>> +    for_each_node_state(node, N_ONLINE) {
>>>> +        cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>>>> +        match = cpumask_weight(&andmask);
>>>> +        if (match > best)
>>>> +            selected = node;
>>>> +    }
>>>> +    return selected;
>>>> +}
>>>> +
>>>> +
>>>>    /**
>>>>     * get_unbound_pool - get a worker_pool with the specified attributes
>>>>     * @attrs: the attributes of the worker_pool to get
>>>> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>    {
>>>>        u32 hash = wqattrs_hash(attrs);
>>>>        struct worker_pool *pool;
>>>> -    int node;
>>>>
>>>>        lockdep_assert_held(&wq_pool_mutex);
>>>>
>>>> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>         * 'struct workqueue_attrs' comments for detail.
>>>>         */
>>>>        pool->attrs->no_numa = false;
>>>> -
>>>> -    /* if cpumask is contained inside a NUMA node, we belong to that node */
>>>> -    if (wq_numa_enabled) {
>>>> -        for_each_node(node) {
>>>> -            if (cpumask_subset(pool->attrs->cpumask,
>>>> -                       wq_numa_possible_cpumask[node])) {
>>>> -                pool->node = node;
>>>> -                break;
>>>> -            }
>>>> -        }
>>>> -    }
>>>> +    pool->node = pool_detect_best_node(pool->attrs->cpumask);
>>>>
>>>>        if (worker_pool_assign_id(pool) < 0)
>>>>            goto fail;
>>>> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>        int cpu = (unsigned long)hcpu;
>>>>        struct worker_pool *pool;
>>>>        struct workqueue_struct *wq;
>>>> -    int pi;
>>>> +    int pi, node;
>>>>
>>>>        switch (action & ~CPU_TASKS_FROZEN) {
>>>>        case CPU_UP_PREPARE:
>>>> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>        case CPU_ONLINE:
>>>>            mutex_lock(&wq_pool_mutex);
>>>>
>>>> +        /* now cpu <-> node info is established, update the info. */
>>>> +        if (!wq_disable_numa) {
>>>
>>>
>>>
>>>> +            for_each_node_state(node, N_POSSIBLE)
>>>> +                cpumask_clear_cpu(cpu,
>>>> +                    wq_numa_possible_cpumask[node]);
>>>
>>> The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
>>> these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
>>> create a new set of pwqs/pools.
>>
>> because the result of wq_calc_node_cpumask() changes ?
>
> Yes.
>
>> Do you mean some comment should be added here ? or explaination for your reply for [3/4] ?
>
> this fix [4/4] breaks the original design.
>

I'm sorry that I can't understand what this patch breaks.
Do you mean it's better to work with broken wq_numa_possible_cpumask ?

I guess removing wq_numa_possible_mask entirely may be the best way
byusing online_mask of the node.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-15  2:55           ` Kamezawa Hiroyuki
@ 2014-12-15  3:30             ` Lai Jiangshan
  2014-12-15  3:34             ` Lai Jiangshan
  1 sibling, 0 replies; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  3:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote:
> (2014/12/15 11:48), Lai Jiangshan wrote:
>> On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:
>>> (2014/12/15 11:12), Lai Jiangshan wrote:
>>>> On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
>>>>> Although workqueue detects relationship between cpu<->node at boot,
>>>>> it is finally determined in cpu_up().
>>>>> This patch tries to update pool->node using online status of cpus.
>>>>>
>>>>> 1. When a node goes down, clear per-cpu pool's node attr.
>>>>> 2. When a cpu comes up, update per-cpu pool's node attr.
>>>>> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
>>>>> 4. Detect the best node for unbound pool's cpumask using the latest info.
>>>>>
>>>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>>> ---
>>>>>    kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>>>>>    1 file changed, 53 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>>>> index 07b4eb5..259b3ba 100644
>>>>> --- a/kernel/workqueue.c
>>>>> +++ b/kernel/workqueue.c
>>>>> @@ -266,7 +266,8 @@ struct workqueue_struct {
>>>>>    static struct kmem_cache *pwq_cache;
>>>>>
>>>>>    static cpumask_var_t *wq_numa_possible_cpumask;
>>>>> -                    /* possible CPUs of each node */
>>>>> +    /* possible CPUs of each node initialized with possible info at boot.
>>>>> +           but modified at cpu hotplug to be adjusted to real info.  */
>>>>>
>>>>>    static bool wq_disable_numa;
>>>>>    module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>>>>> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>>>>        call_rcu_sched(&pool->rcu, rcu_free_pool);
>>>>>    }
>>>>>
>>>>> +/*
>>>>> + * detect best node for given cpumask.
>>>>> + */
>>>>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>>>>> +{
>>>>> +    int node, best, match, selected;
>>>>> +    static struct cpumask andmask; /* we're under mutex */
>>>>> +
>>>>> +    /* Is any node okay ? */
>>>>> +    if (!wq_numa_enabled ||
>>>>> +        cpumask_subset(cpu_online_mask, cpumask))
>>>>> +        return NUMA_NO_NODE;
>>>>> +    best = 0;
>>>>> +    selected = NUMA_NO_NODE;
>>>>> +    /* select a node which contains the most cpu of cpumask */
>>>>> +    for_each_node_state(node, N_ONLINE) {
>>>>> +        cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>>>>> +        match = cpumask_weight(&andmask);
>>>>> +        if (match > best)
>>>>> +            selected = node;
>>>>> +    }
>>>>> +    return selected;
>>>>> +}
>>>>> +
>>>>> +
>>>>>    /**
>>>>>     * get_unbound_pool - get a worker_pool with the specified attributes
>>>>>     * @attrs: the attributes of the worker_pool to get
>>>>> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>    {
>>>>>        u32 hash = wqattrs_hash(attrs);
>>>>>        struct worker_pool *pool;
>>>>> -    int node;
>>>>>
>>>>>        lockdep_assert_held(&wq_pool_mutex);
>>>>>
>>>>> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>         * 'struct workqueue_attrs' comments for detail.
>>>>>         */
>>>>>        pool->attrs->no_numa = false;
>>>>> -
>>>>> -    /* if cpumask is contained inside a NUMA node, we belong to that node */
>>>>> -    if (wq_numa_enabled) {
>>>>> -        for_each_node(node) {
>>>>> -            if (cpumask_subset(pool->attrs->cpumask,
>>>>> -                       wq_numa_possible_cpumask[node])) {
>>>>> -                pool->node = node;
>>>>> -                break;
>>>>> -            }
>>>>> -        }
>>>>> -    }
>>>>> +    pool->node = pool_detect_best_node(pool->attrs->cpumask);
>>>>>
>>>>>        if (worker_pool_assign_id(pool) < 0)
>>>>>            goto fail;
>>>>> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>        int cpu = (unsigned long)hcpu;
>>>>>        struct worker_pool *pool;
>>>>>        struct workqueue_struct *wq;
>>>>> -    int pi;
>>>>> +    int pi, node;
>>>>>
>>>>>        switch (action & ~CPU_TASKS_FROZEN) {
>>>>>        case CPU_UP_PREPARE:
>>>>> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>        case CPU_ONLINE:
>>>>>            mutex_lock(&wq_pool_mutex);
>>>>>
>>>>> +        /* now cpu <-> node info is established, update the info. */
>>>>> +        if (!wq_disable_numa) {
>>>>
>>>>
>>>>
>>>>> +            for_each_node_state(node, N_POSSIBLE)
>>>>> +                cpumask_clear_cpu(cpu,
>>>>> +                    wq_numa_possible_cpumask[node]);
>>>>
>>>> The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
>>>> these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
>>>> create a new set of pwqs/pools.
>>>
>>> because the result of wq_calc_node_cpumask() changes ?
>>
>> Yes.
>>
>>> Do you mean some comment should be added here ? or explaination for your reply for [3/4] ?
>>
>> this fix [4/4] breaks the original design.
>>
> 
> I'm sorry that I can't understand what this patch breaks.
> Do you mean it's better to work with broken wq_numa_possible_cpumask ?
> 
> I guess removing wq_numa_possible_mask entirely may be the best way
> byusing online_mask of the node.


You patch achieves the same result of "removing wq_numa_possible_mask entirely
and using cpumask_of_node()", but it breaks the original design which we don't want to do.

> 
> Thanks,
> -Kame
> 
> 
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-15  2:55           ` Kamezawa Hiroyuki
  2014-12-15  3:30             ` Lai Jiangshan
@ 2014-12-15  3:34             ` Lai Jiangshan
  2014-12-15  4:04               ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  3:34 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote:
> (2014/12/15 11:48), Lai Jiangshan wrote:
>> On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:
>>> (2014/12/15 11:12), Lai Jiangshan wrote:
>>>> On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
>>>>> Although workqueue detects relationship between cpu<->node at boot,
>>>>> it is finally determined in cpu_up().
>>>>> This patch tries to update pool->node using online status of cpus.
>>>>>
>>>>> 1. When a node goes down, clear per-cpu pool's node attr.
>>>>> 2. When a cpu comes up, update per-cpu pool's node attr.
>>>>> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
>>>>> 4. Detect the best node for unbound pool's cpumask using the latest info.
>>>>>
>>>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>>> ---
>>>>>    kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>>>>>    1 file changed, 53 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>>>> index 07b4eb5..259b3ba 100644
>>>>> --- a/kernel/workqueue.c
>>>>> +++ b/kernel/workqueue.c
>>>>> @@ -266,7 +266,8 @@ struct workqueue_struct {
>>>>>    static struct kmem_cache *pwq_cache;
>>>>>
>>>>>    static cpumask_var_t *wq_numa_possible_cpumask;
>>>>> -                    /* possible CPUs of each node */
>>>>> +    /* possible CPUs of each node initialized with possible info at boot.
>>>>> +           but modified at cpu hotplug to be adjusted to real info.  */
>>>>>
>>>>>    static bool wq_disable_numa;
>>>>>    module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>>>>> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>>>>        call_rcu_sched(&pool->rcu, rcu_free_pool);
>>>>>    }
>>>>>
>>>>> +/*
>>>>> + * detect best node for given cpumask.
>>>>> + */
>>>>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>>>>> +{
>>>>> +    int node, best, match, selected;
>>>>> +    static struct cpumask andmask; /* we're under mutex */
>>>>> +
>>>>> +    /* Is any node okay ? */
>>>>> +    if (!wq_numa_enabled ||
>>>>> +        cpumask_subset(cpu_online_mask, cpumask))
>>>>> +        return NUMA_NO_NODE;
>>>>> +    best = 0;
>>>>> +    selected = NUMA_NO_NODE;
>>>>> +    /* select a node which contains the most cpu of cpumask */
>>>>> +    for_each_node_state(node, N_ONLINE) {
>>>>> +        cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>>>>> +        match = cpumask_weight(&andmask);
>>>>> +        if (match > best)
>>>>> +            selected = node;
>>>>> +    }
>>>>> +    return selected;
>>>>> +}
>>>>> +
>>>>> +
>>>>>    /**
>>>>>     * get_unbound_pool - get a worker_pool with the specified attributes
>>>>>     * @attrs: the attributes of the worker_pool to get
>>>>> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>    {
>>>>>        u32 hash = wqattrs_hash(attrs);
>>>>>        struct worker_pool *pool;
>>>>> -    int node;
>>>>>
>>>>>        lockdep_assert_held(&wq_pool_mutex);
>>>>>
>>>>> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>         * 'struct workqueue_attrs' comments for detail.
>>>>>         */
>>>>>        pool->attrs->no_numa = false;
>>>>> -
>>>>> -    /* if cpumask is contained inside a NUMA node, we belong to that node */
>>>>> -    if (wq_numa_enabled) {
>>>>> -        for_each_node(node) {
>>>>> -            if (cpumask_subset(pool->attrs->cpumask,
>>>>> -                       wq_numa_possible_cpumask[node])) {
>>>>> -                pool->node = node;
>>>>> -                break;
>>>>> -            }
>>>>> -        }
>>>>> -    }
>>>>> +    pool->node = pool_detect_best_node(pool->attrs->cpumask);
>>>>>
>>>>>        if (worker_pool_assign_id(pool) < 0)
>>>>>            goto fail;
>>>>> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>        int cpu = (unsigned long)hcpu;
>>>>>        struct worker_pool *pool;
>>>>>        struct workqueue_struct *wq;
>>>>> -    int pi;
>>>>> +    int pi, node;
>>>>>
>>>>>        switch (action & ~CPU_TASKS_FROZEN) {
>>>>>        case CPU_UP_PREPARE:
>>>>> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>        case CPU_ONLINE:
>>>>>            mutex_lock(&wq_pool_mutex);
>>>>>
>>>>> +        /* now cpu <-> node info is established, update the info. */
>>>>> +        if (!wq_disable_numa) {
>>>>
>>>>
>>>>
>>>>> +            for_each_node_state(node, N_POSSIBLE)
>>>>> +                cpumask_clear_cpu(cpu,
>>>>> +                    wq_numa_possible_cpumask[node]);
>>>>
>>>> The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
>>>> these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
>>>> create a new set of pwqs/pools.
>>>
>>> because the result of wq_calc_node_cpumask() changes ?
>>
>> Yes.
>>
>>> Do you mean some comment should be added here ? or explaination for your reply for [3/4] ?
>>
>> this fix [4/4] breaks the original design.
>>
> 
> I'm sorry that I can't understand what this patch breaks.

the pwqs/pools should be kept if the node still have cpu online.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-15  3:34             ` Lai Jiangshan
@ 2014-12-15  4:04               ` Kamezawa Hiroyuki
  2014-12-15  5:19                 ` Lai Jiangshan
  0 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15  4:04 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/15 12:34), Lai Jiangshan wrote:
> On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote:
>> (2014/12/15 11:48), Lai Jiangshan wrote:
>>> On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:
>>>> (2014/12/15 11:12), Lai Jiangshan wrote:
>>>>> On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
>>>>>> Although workqueue detects relationship between cpu<->node at boot,
>>>>>> it is finally determined in cpu_up().
>>>>>> This patch tries to update pool->node using online status of cpus.
>>>>>>
>>>>>> 1. When a node goes down, clear per-cpu pool's node attr.
>>>>>> 2. When a cpu comes up, update per-cpu pool's node attr.
>>>>>> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
>>>>>> 4. Detect the best node for unbound pool's cpumask using the latest info.
>>>>>>
>>>>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>>>> ---
>>>>>>     kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>>>>>>     1 file changed, 53 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>>>>> index 07b4eb5..259b3ba 100644
>>>>>> --- a/kernel/workqueue.c
>>>>>> +++ b/kernel/workqueue.c
>>>>>> @@ -266,7 +266,8 @@ struct workqueue_struct {
>>>>>>     static struct kmem_cache *pwq_cache;
>>>>>>
>>>>>>     static cpumask_var_t *wq_numa_possible_cpumask;
>>>>>> -                    /* possible CPUs of each node */
>>>>>> +    /* possible CPUs of each node initialized with possible info at boot.
>>>>>> +           but modified at cpu hotplug to be adjusted to real info.  */
>>>>>>
>>>>>>     static bool wq_disable_numa;
>>>>>>     module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>>>>>> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>>>>>         call_rcu_sched(&pool->rcu, rcu_free_pool);
>>>>>>     }
>>>>>>
>>>>>> +/*
>>>>>> + * detect best node for given cpumask.
>>>>>> + */
>>>>>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>>>>>> +{
>>>>>> +    int node, best, match, selected;
>>>>>> +    static struct cpumask andmask; /* we're under mutex */
>>>>>> +
>>>>>> +    /* Is any node okay ? */
>>>>>> +    if (!wq_numa_enabled ||
>>>>>> +        cpumask_subset(cpu_online_mask, cpumask))
>>>>>> +        return NUMA_NO_NODE;
>>>>>> +    best = 0;
>>>>>> +    selected = NUMA_NO_NODE;
>>>>>> +    /* select a node which contains the most cpu of cpumask */
>>>>>> +    for_each_node_state(node, N_ONLINE) {
>>>>>> +        cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>>>>>> +        match = cpumask_weight(&andmask);
>>>>>> +        if (match > best)
>>>>>> +            selected = node;
>>>>>> +    }
>>>>>> +    return selected;
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>>     /**
>>>>>>      * get_unbound_pool - get a worker_pool with the specified attributes
>>>>>>      * @attrs: the attributes of the worker_pool to get
>>>>>> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>>     {
>>>>>>         u32 hash = wqattrs_hash(attrs);
>>>>>>         struct worker_pool *pool;
>>>>>> -    int node;
>>>>>>
>>>>>>         lockdep_assert_held(&wq_pool_mutex);
>>>>>>
>>>>>> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>>          * 'struct workqueue_attrs' comments for detail.
>>>>>>          */
>>>>>>         pool->attrs->no_numa = false;
>>>>>> -
>>>>>> -    /* if cpumask is contained inside a NUMA node, we belong to that node */
>>>>>> -    if (wq_numa_enabled) {
>>>>>> -        for_each_node(node) {
>>>>>> -            if (cpumask_subset(pool->attrs->cpumask,
>>>>>> -                       wq_numa_possible_cpumask[node])) {
>>>>>> -                pool->node = node;
>>>>>> -                break;
>>>>>> -            }
>>>>>> -        }
>>>>>> -    }
>>>>>> +    pool->node = pool_detect_best_node(pool->attrs->cpumask);
>>>>>>
>>>>>>         if (worker_pool_assign_id(pool) < 0)
>>>>>>             goto fail;
>>>>>> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>>         int cpu = (unsigned long)hcpu;
>>>>>>         struct worker_pool *pool;
>>>>>>         struct workqueue_struct *wq;
>>>>>> -    int pi;
>>>>>> +    int pi, node;
>>>>>>
>>>>>>         switch (action & ~CPU_TASKS_FROZEN) {
>>>>>>         case CPU_UP_PREPARE:
>>>>>> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>>         case CPU_ONLINE:
>>>>>>             mutex_lock(&wq_pool_mutex);
>>>>>>
>>>>>> +        /* now cpu <-> node info is established, update the info. */
>>>>>> +        if (!wq_disable_numa) {
>>>>>
>>>>>
>>>>>
>>>>>> +            for_each_node_state(node, N_POSSIBLE)
>>>>>> +                cpumask_clear_cpu(cpu,
>>>>>> +                    wq_numa_possible_cpumask[node]);
>>>>>
>>>>> The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
>>>>> these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
>>>>> create a new set of pwqs/pools.
>>>>
>>>> because the result of wq_calc_node_cpumask() changes ?
>>>
>>> Yes.
>>>
>>>> Do you mean some comment should be added here ? or explaination for your reply for [3/4] ?
>>>
>>> this fix [4/4] breaks the original design.
>>>
>>
>> I'm sorry that I can't understand what this patch breaks.
>
> the pwqs/pools should be kept if the node still have cpu online.

So, the fix's grand design should be

  1. drop old pwq/pools only at node offline.
  2. set proper pool->node based on online node info.
  3. update pool->node of per-cpu-pool at cpu ONLINE.

Hm. (1) is  done because cpumask_of_node() turns to be zero-filled
after all cpus on a node offlined.

But, cpu-to-node relationship cannot be available until a cpu get onlined.
It changes at every cpu onlining. So, at node online, refleshing cpumasks
of workqueues only after _all_ cpus on node are onlined seems to be the only way
but I'm not sure how to get a mask of possible cpus on a node.

Possible another way may be using auto numa balancing's numa node hint for worker scheduling.

Do you have any idea ?

Thanks,
-Kame











^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-15  4:04               ` Kamezawa Hiroyuki
@ 2014-12-15  5:19                 ` Lai Jiangshan
  2014-12-15  5:33                   ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  5:19 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/15/2014 12:04 PM, Kamezawa Hiroyuki wrote:
> (2014/12/15 12:34), Lai Jiangshan wrote:
>> On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote:
>>> (2014/12/15 11:48), Lai Jiangshan wrote:
>>>> On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:
>>>>> (2014/12/15 11:12), Lai Jiangshan wrote:
>>>>>> On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
>>>>>>> Although workqueue detects relationship between cpu<->node at boot,
>>>>>>> it is finally determined in cpu_up().
>>>>>>> This patch tries to update pool->node using online status of cpus.
>>>>>>>
>>>>>>> 1. When a node goes down, clear per-cpu pool's node attr.
>>>>>>> 2. When a cpu comes up, update per-cpu pool's node attr.
>>>>>>> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
>>>>>>> 4. Detect the best node for unbound pool's cpumask using the latest info.
>>>>>>>
>>>>>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>>>>> ---
>>>>>>>     kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>>>>>>>     1 file changed, 53 insertions(+), 14 deletions(-)
>>>>>>>
>>>>>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>>>>>> index 07b4eb5..259b3ba 100644
>>>>>>> --- a/kernel/workqueue.c
>>>>>>> +++ b/kernel/workqueue.c
>>>>>>> @@ -266,7 +266,8 @@ struct workqueue_struct {
>>>>>>>     static struct kmem_cache *pwq_cache;
>>>>>>>
>>>>>>>     static cpumask_var_t *wq_numa_possible_cpumask;
>>>>>>> -                    /* possible CPUs of each node */
>>>>>>> +    /* possible CPUs of each node initialized with possible info at boot.
>>>>>>> +           but modified at cpu hotplug to be adjusted to real info.  */
>>>>>>>
>>>>>>>     static bool wq_disable_numa;
>>>>>>>     module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>>>>>>> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>>>>>>         call_rcu_sched(&pool->rcu, rcu_free_pool);
>>>>>>>     }
>>>>>>>
>>>>>>> +/*
>>>>>>> + * detect best node for given cpumask.
>>>>>>> + */
>>>>>>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>>>>>>> +{
>>>>>>> +    int node, best, match, selected;
>>>>>>> +    static struct cpumask andmask; /* we're under mutex */
>>>>>>> +
>>>>>>> +    /* Is any node okay ? */
>>>>>>> +    if (!wq_numa_enabled ||
>>>>>>> +        cpumask_subset(cpu_online_mask, cpumask))
>>>>>>> +        return NUMA_NO_NODE;
>>>>>>> +    best = 0;
>>>>>>> +    selected = NUMA_NO_NODE;
>>>>>>> +    /* select a node which contains the most cpu of cpumask */
>>>>>>> +    for_each_node_state(node, N_ONLINE) {
>>>>>>> +        cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>>>>>>> +        match = cpumask_weight(&andmask);
>>>>>>> +        if (match > best)
>>>>>>> +            selected = node;
>>>>>>> +    }
>>>>>>> +    return selected;
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>>     /**
>>>>>>>      * get_unbound_pool - get a worker_pool with the specified attributes
>>>>>>>      * @attrs: the attributes of the worker_pool to get
>>>>>>> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>>>     {
>>>>>>>         u32 hash = wqattrs_hash(attrs);
>>>>>>>         struct worker_pool *pool;
>>>>>>> -    int node;
>>>>>>>
>>>>>>>         lockdep_assert_held(&wq_pool_mutex);
>>>>>>>
>>>>>>> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>>>          * 'struct workqueue_attrs' comments for detail.
>>>>>>>          */
>>>>>>>         pool->attrs->no_numa = false;
>>>>>>> -
>>>>>>> -    /* if cpumask is contained inside a NUMA node, we belong to that node */
>>>>>>> -    if (wq_numa_enabled) {
>>>>>>> -        for_each_node(node) {
>>>>>>> -            if (cpumask_subset(pool->attrs->cpumask,
>>>>>>> -                       wq_numa_possible_cpumask[node])) {
>>>>>>> -                pool->node = node;
>>>>>>> -                break;
>>>>>>> -            }
>>>>>>> -        }
>>>>>>> -    }
>>>>>>> +    pool->node = pool_detect_best_node(pool->attrs->cpumask);
>>>>>>>
>>>>>>>         if (worker_pool_assign_id(pool) < 0)
>>>>>>>             goto fail;
>>>>>>> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>>>         int cpu = (unsigned long)hcpu;
>>>>>>>         struct worker_pool *pool;
>>>>>>>         struct workqueue_struct *wq;
>>>>>>> -    int pi;
>>>>>>> +    int pi, node;
>>>>>>>
>>>>>>>         switch (action & ~CPU_TASKS_FROZEN) {
>>>>>>>         case CPU_UP_PREPARE:
>>>>>>> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>>>         case CPU_ONLINE:
>>>>>>>             mutex_lock(&wq_pool_mutex);
>>>>>>>
>>>>>>> +        /* now cpu <-> node info is established, update the info. */
>>>>>>> +        if (!wq_disable_numa) {
>>>>>>
>>>>>>
>>>>>>
>>>>>>> +            for_each_node_state(node, N_POSSIBLE)
>>>>>>> +                cpumask_clear_cpu(cpu,
>>>>>>> +                    wq_numa_possible_cpumask[node]);
>>>>>>
>>>>>> The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
>>>>>> these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
>>>>>> create a new set of pwqs/pools.
>>>>>
>>>>> because the result of wq_calc_node_cpumask() changes ?
>>>>
>>>> Yes.
>>>>
>>>>> Do you mean some comment should be added here ? or explaination for your reply for [3/4] ?
>>>>
>>>> this fix [4/4] breaks the original design.
>>>>
>>>
>>> I'm sorry that I can't understand what this patch breaks.
>>
>> the pwqs/pools should be kept if the node still have cpu online.
> 
> So, the fix's grand design should be
> 
>  1. drop old pwq/pools only at node offline.
>  2. set proper pool->node based on online node info.
>  3. update pool->node of per-cpu-pool at cpu ONLINE.
> 
> Hm. (1) is  done because cpumask_of_node() turns to be zero-filled
> after all cpus on a node offlined.
> 
> But, cpu-to-node relationship cannot be available until a cpu get onlined.
> It changes at every cpu onlining. So, at node online, refleshing cpumasks

It changes at every cpu being added which earlier than any cpu of the node will be online.

> of workqueues only after _all_ cpus on node are onlined seems to be the only way

When _any_ cpu on the new node is onlining, _add_ cpus of the node were *added*, which means
all cpu_to_node()s of the all cpus on new node ware updated.

so we can check the updated information when up online.  This is what my patchset did.

> but I'm not sure how to get a mask of possible cpus on a node.

like this:

+	for_each_possible_cpu(cpu) {
+		node = cpu_to_node(node);
+		if (node == new_node)
+			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[new_node]);
+	}


> 
> Possible another way may be using auto numa balancing's numa node hint for worker scheduling.
> 
> Do you have any idea ?

Is it hard to keep the cpu-node relationship unchanged?
You know that, all cpu ids, node ids are soft-ware logical id, we can change relationship
between logic id and physical id but keep relationship among logical ids unchanged.

> 
> Thanks,
> -Kame
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/5] workqueue: fix memory leak in wq_numa_init()
  2014-12-12 17:12   ` Tejun Heo
@ 2014-12-15  5:25     ` Lai Jiangshan
  0 siblings, 0 replies; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-15  5:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On 12/13/2014 01:12 AM, Tejun Heo wrote:
> On Fri, Dec 12, 2014 at 06:19:51PM +0800, Lai Jiangshan wrote:
>> wq_numa_init() will quit directly on some bonkers cases without freeing the
>> memory.  Add the missing cleanup code.
>>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>> Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
>> Cc: tangchen <tangchen@cn.fujitsu.com>
>> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>> ---
>>  kernel/workqueue.c |    3 +++
>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 09b685d..a6fd2b8 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -4811,6 +4811,9 @@ static void __init wq_numa_init(void)
>>  		if (WARN_ON(node == NUMA_NO_NODE)) {
>>  			pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
>>  			/* happens iff arch is bonkers, let's just proceed */
>> +			for_each_node(node)
>> +				free_cpumask_var(tbl[node]);
>> +			kfree(tbl);
> 
> The comment right up there says that this happens if and only if the
> arch code is seriously broken and it's consciously skipping exception
> handling.  That's a condition where we might as well trigger BUG_ON().
> Just leave it alone.


cpu_to_node() can return NUMA_NO_NODE after system booted (when node offline) currently.
so I don't think it is seriously broken if cpu_to_node() returns NUMA_NO_NODE
when booting.

See:

static void unmap_cpu_on_node(pg_data_t *pgdat)
{
#ifdef CONFIG_ACPI_NUMA
	int cpu;

	for_each_possible_cpu(cpu)
		if (cpu_to_node(cpu) == pgdat->node_id)
			numa_clear_node(cpu);
#endif
}

> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
  2014-12-15  5:19                 ` Lai Jiangshan
@ 2014-12-15  5:33                   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15  5:33 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/15 14:19), Lai Jiangshan wrote:
> On 12/15/2014 12:04 PM, Kamezawa Hiroyuki wrote:
>> (2014/12/15 12:34), Lai Jiangshan wrote:
>>> On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote:
>>>> (2014/12/15 11:48), Lai Jiangshan wrote:
>>>>> On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:
>>>>>> (2014/12/15 11:12), Lai Jiangshan wrote:
>>>>>>> On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:
>>>>>>>> Although workqueue detects relationship between cpu<->node at boot,
>>>>>>>> it is finally determined in cpu_up().
>>>>>>>> This patch tries to update pool->node using online status of cpus.
>>>>>>>>
>>>>>>>> 1. When a node goes down, clear per-cpu pool's node attr.
>>>>>>>> 2. When a cpu comes up, update per-cpu pool's node attr.
>>>>>>>> 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched.
>>>>>>>> 4. Detect the best node for unbound pool's cpumask using the latest info.
>>>>>>>>
>>>>>>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>>>>>> ---
>>>>>>>>      kernel/workqueue.c | 67 ++++++++++++++++++++++++++++++++++++++++++------------
>>>>>>>>      1 file changed, 53 insertions(+), 14 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>>>>>>> index 07b4eb5..259b3ba 100644
>>>>>>>> --- a/kernel/workqueue.c
>>>>>>>> +++ b/kernel/workqueue.c
>>>>>>>> @@ -266,7 +266,8 @@ struct workqueue_struct {
>>>>>>>>      static struct kmem_cache *pwq_cache;
>>>>>>>>
>>>>>>>>      static cpumask_var_t *wq_numa_possible_cpumask;
>>>>>>>> -                    /* possible CPUs of each node */
>>>>>>>> +    /* possible CPUs of each node initialized with possible info at boot.
>>>>>>>> +           but modified at cpu hotplug to be adjusted to real info.  */
>>>>>>>>
>>>>>>>>      static bool wq_disable_numa;
>>>>>>>>      module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>>>>>>>> @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>>>>>>>          call_rcu_sched(&pool->rcu, rcu_free_pool);
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * detect best node for given cpumask.
>>>>>>>> + */
>>>>>>>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>>>>>>>> +{
>>>>>>>> +    int node, best, match, selected;
>>>>>>>> +    static struct cpumask andmask; /* we're under mutex */
>>>>>>>> +
>>>>>>>> +    /* Is any node okay ? */
>>>>>>>> +    if (!wq_numa_enabled ||
>>>>>>>> +        cpumask_subset(cpu_online_mask, cpumask))
>>>>>>>> +        return NUMA_NO_NODE;
>>>>>>>> +    best = 0;
>>>>>>>> +    selected = NUMA_NO_NODE;
>>>>>>>> +    /* select a node which contains the most cpu of cpumask */
>>>>>>>> +    for_each_node_state(node, N_ONLINE) {
>>>>>>>> +        cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>>>>>>>> +        match = cpumask_weight(&andmask);
>>>>>>>> +        if (match > best)
>>>>>>>> +            selected = node;
>>>>>>>> +    }
>>>>>>>> +    return selected;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +
>>>>>>>>      /**
>>>>>>>>       * get_unbound_pool - get a worker_pool with the specified attributes
>>>>>>>>       * @attrs: the attributes of the worker_pool to get
>>>>>>>> @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>>>>      {
>>>>>>>>          u32 hash = wqattrs_hash(attrs);
>>>>>>>>          struct worker_pool *pool;
>>>>>>>> -    int node;
>>>>>>>>
>>>>>>>>          lockdep_assert_held(&wq_pool_mutex);
>>>>>>>>
>>>>>>>> @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>>>>>>>>           * 'struct workqueue_attrs' comments for detail.
>>>>>>>>           */
>>>>>>>>          pool->attrs->no_numa = false;
>>>>>>>> -
>>>>>>>> -    /* if cpumask is contained inside a NUMA node, we belong to that node */
>>>>>>>> -    if (wq_numa_enabled) {
>>>>>>>> -        for_each_node(node) {
>>>>>>>> -            if (cpumask_subset(pool->attrs->cpumask,
>>>>>>>> -                       wq_numa_possible_cpumask[node])) {
>>>>>>>> -                pool->node = node;
>>>>>>>> -                break;
>>>>>>>> -            }
>>>>>>>> -        }
>>>>>>>> -    }
>>>>>>>> +    pool->node = pool_detect_best_node(pool->attrs->cpumask);
>>>>>>>>
>>>>>>>>          if (worker_pool_assign_id(pool) < 0)
>>>>>>>>              goto fail;
>>>>>>>> @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>>>>          int cpu = (unsigned long)hcpu;
>>>>>>>>          struct worker_pool *pool;
>>>>>>>>          struct workqueue_struct *wq;
>>>>>>>> -    int pi;
>>>>>>>> +    int pi, node;
>>>>>>>>
>>>>>>>>          switch (action & ~CPU_TASKS_FROZEN) {
>>>>>>>>          case CPU_UP_PREPARE:
>>>>>>>> @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>>>>>>>          case CPU_ONLINE:
>>>>>>>>              mutex_lock(&wq_pool_mutex);
>>>>>>>>
>>>>>>>> +        /* now cpu <-> node info is established, update the info. */
>>>>>>>> +        if (!wq_disable_numa) {
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> +            for_each_node_state(node, N_POSSIBLE)
>>>>>>>> +                cpumask_clear_cpu(cpu,
>>>>>>>> +                    wq_numa_possible_cpumask[node]);
>>>>>>>
>>>>>>> The wq code try to reuse the origin pwqs/pools when the node still have cpu online.
>>>>>>> these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and
>>>>>>> create a new set of pwqs/pools.
>>>>>>
>>>>>> because the result of wq_calc_node_cpumask() changes ?
>>>>>
>>>>> Yes.
>>>>>
>>>>>> Do you mean some comment should be added here ? or explaination for your reply for [3/4] ?
>>>>>
>>>>> this fix [4/4] breaks the original design.
>>>>>
>>>>
>>>> I'm sorry that I can't understand what this patch breaks.
>>>
>>> the pwqs/pools should be kept if the node still have cpu online.
>>
>> So, the fix's grand design should be
>>
>>   1. drop old pwq/pools only at node offline.
>>   2. set proper pool->node based on online node info.
>>   3. update pool->node of per-cpu-pool at cpu ONLINE.
>>
>> Hm. (1) is  done because cpumask_of_node() turns to be zero-filled
>> after all cpus on a node offlined.
>>
>> But, cpu-to-node relationship cannot be available until a cpu get onlined.
>> It changes at every cpu onlining. So, at node online, refleshing cpumasks
>
> It changes at every cpu being added which earlier than any cpu of the node will be online.
>
>> of workqueues only after _all_ cpus on node are onlined seems to be the only way
>
> When _any_ cpu on the new node is onlining, _add_ cpus of the node were *added*, which means
> all cpu_to_node()s of the all cpus on new node ware updated.
>

Thank you, I misuderstood that.


> so we can check the updated information when up online.  This is what my patchset did.
>
>> but I'm not sure how to get a mask of possible cpus on a node.
>
> like this:
>
> +	for_each_possible_cpu(cpu) {
> +		node = cpu_to_node(node);
> +		if (node == new_node)
> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[new_node]);
> +	}
>

Okay, I talked with Ishimatu-san and get some more knowledege.

let me clarify. not at cpu hotplug,

(A) At physical cpu hot add, cpu <-> node relationship is established.
(B) If an exisitng cpu was on a memory less node, cpu<->node relationship may change
     at CPU_ONLINE.

Because (2) will not change topology of cpuids, this doesn't affect cpumask for sched.
So, we just need to handle following.

1. update pc->node if necessary.
2. update wq_numa_possible_mask[] at physical cpu hot-add.

Leave wq_numa_possible_mask as it is for case (B) because the change doesn't affect
the affinity group of cpus. no affects to cpumask.

I'll cook a patch and do a test.

Thanks,
-Kame









^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3
  2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
                     ` (3 preceding siblings ...)
  2014-12-13 16:38   ` [PATCH 4/4] workqueue: handle change in cpu-node relationship Kamezawa Hiroyuki
@ 2014-12-15 11:11   ` Kamezawa Hiroyuki
  2014-12-15 11:14     ` [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection Kamezawa Hiroyuki
                       ` (3 more replies)
  4 siblings, 4 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15 11:11 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen, Kamezawa Hiroyuki

Lai-san, Tejun-san,

Thank you for review, this a fix v3. This has been tested on NUMA node hotplug
machine and seems work well.

The probelm is memory allocation failure because pool->node information can be stale
after node hotplug. This patch(1,2) tries to fix pool->node calculation.
Patch (3,4) tries to update cpumask calculation.
(Fixing memory allocation bug just requires patch 1,2. But cpumask should be update
 I think.)

Changelog since v2.
- reordered patch and split for each problem cases.
- removed unnecessary calls pointed out by Lai-san.
- restore node/cpu relationship when a new node comes online.
- handle corner case at CPU_ONLINE.

1/4 .... fix unbound workqueue's memory node affinity calculation.
2/4 ...  update percpu workqueue's memory node affinity at online/offline
3/4 ...  update workqueue's possible cpumask when a new node onlined.
4/4 ...  handle cpu-node affinity change at CPU_ONLINE.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection
  2014-12-15 11:11   ` [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3 Kamezawa Hiroyuki
@ 2014-12-15 11:14     ` Kamezawa Hiroyuki
  2014-12-16  5:30       ` Lai Jiangshan
  2014-12-15 11:16     ` [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline Kamezawa Hiroyuki
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15 11:14 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen

Unbound wq pool's node attribute is calculated at its allocation.
But it's now calculated based on possible cpu<->node information
which can be wrong after cpu hotplug/unplug.

If wrong pool->node is set, following allocation error will happen.
==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
==

This patch fixes the node detection by making use of online cpu info.
Unlike cpumask, the best node can be calculated by degree of overlap
between attr->cpumask and numanode->online_cpumask.
This change doesn't corrupt original purpose of the old calculation.

Note: it's expected that this function is called as
      pool_detect_best_node
      get_unbound_pool
      alloc_unbound_pwq
      wq_update_unbound_numa
      called at CPU_ONLINE/CPU_DOWN_PREPARE
and the latest online cpu info can be applied to a new wq pool,
which replaces old one.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/workqueue.c | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685d..7809154 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3440,6 +3440,31 @@ static void put_unbound_pool(struct worker_pool *pool)
 }
 
 /**
+ * pool_detect_best_node - detect a node which contains specified cpumask.
+ * Should be called with wq_pool_mutex held.
+ * Returns a online node where the most of given cpus are tied to. 
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+	int node, best, match, selected = NUMA_NO_NODE;
+	static struct cpumask andmask; /* under wq_pool_mutex */
+
+	if (!wq_numa_enabled ||
+		cpumask_subset(cpu_online_mask, cpumask))
+		goto out;
+	best = 0;
+	/* select a node which contains the most number of cpu */
+	for_each_node_state(node, N_ONLINE) {
+		cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+		match = cpumask_weight(&andmask);
+		if (match > best)
+			selected = best;
+	}
+out:
+	return selected;
+}
+
+/**
  * get_unbound_pool - get a worker_pool with the specified attributes
  * @attrs: the attributes of the worker_pool to get
  *
@@ -3457,7 +3482,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 {
 	u32 hash = wqattrs_hash(attrs);
 	struct worker_pool *pool;
-	int node;
 
 	lockdep_assert_held(&wq_pool_mutex);
 
@@ -3482,17 +3506,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
 	 * 'struct workqueue_attrs' comments for detail.
 	 */
 	pool->attrs->no_numa = false;
-
-	/* if cpumask is contained inside a NUMA node, we belong to that node */
-	if (wq_numa_enabled) {
-		for_each_node(node) {
-			if (cpumask_subset(pool->attrs->cpumask,
-					   wq_numa_possible_cpumask[node])) {
-				pool->node = node;
-				break;
-			}
-		}
-	}
+	pool->node = pool_detect_best_node(pool->attrs->cpumask);
 
 	if (worker_pool_assign_id(pool) < 0)
 		goto fail;
-- 
1.8.3.1




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline
  2014-12-15 11:11   ` [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3 Kamezawa Hiroyuki
  2014-12-15 11:14     ` [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection Kamezawa Hiroyuki
@ 2014-12-15 11:16     ` Kamezawa Hiroyuki
  2014-12-16  5:32       ` Lai Jiangshan
  2014-12-15 11:18     ` [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up Kamezawa Hiroyuki
  2014-12-15 11:22     ` [PATCH 4/4] workqueue: Handle cpu-node affinity change at CPU_ONLINE Kamezawa Hiroyuki
  3 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15 11:16 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen

The percpu workqueue pool are persistend and never be freed.
But cpu<->node relationship can be changed by cpu hotplug and pool->node
can point to an offlined node.

If pool->node points to an offlined node,
following allocation failure can happen.
    ==
     SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
      cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
    1, min order: 0
      node 0: slabs: 6172, objs: 259224, free: 245741
      node 1: slabs: 3261, objs: 136962, free: 127656
    ==

This patch clears per-cpu workqueue pool's node affinity at
cpu offlining and restore it at cpu onlining.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/workqueue.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7809154..2fd0bd7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4586,6 +4586,11 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
 	case CPU_DOWN_FAILED:
 	case CPU_ONLINE:
 		mutex_lock(&wq_pool_mutex);
+		/*
+		 * now cpu <-> node info is established. update numa node
+		 */
+		for_each_cpu_worker_pool(pool, cpu)
+			pool->node = cpu_to_node(cpu);
 
 		for_each_pool(pool, pi) {
 			mutex_lock(&pool->attach_mutex);
@@ -4619,6 +4624,7 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb,
 	int cpu = (unsigned long)hcpu;
 	struct work_struct unbind_work;
 	struct workqueue_struct *wq;
+	struct worker_pool *pool;
 
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_DOWN_PREPARE:
@@ -4626,10 +4632,13 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb,
 		INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
 		queue_work_on(cpu, system_highpri_wq, &unbind_work);
 
-		/* update NUMA affinity of unbound workqueues */
 		mutex_lock(&wq_pool_mutex);
+		/* update NUMA affinity of unbound workqueues */
 		list_for_each_entry(wq, &workqueues, list)
 			wq_update_unbound_numa(wq, cpu, false);
+		/* clear per-cpu workqueues's numa affinity. */
+		for_each_cpu_worker_pool(pool, cpu)
+			pool->node = NUMA_NO_NODE; /* restored at online */
 		mutex_unlock(&wq_pool_mutex);
 
 		/* wait for per-cpu unbinding to finish */
-- 
1.8.3.1





^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.
  2014-12-15 11:11   ` [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3 Kamezawa Hiroyuki
  2014-12-15 11:14     ` [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection Kamezawa Hiroyuki
  2014-12-15 11:16     ` [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline Kamezawa Hiroyuki
@ 2014-12-15 11:18     ` Kamezawa Hiroyuki
  2014-12-16  7:49       ` Lai Jiangshan
  2014-12-15 11:22     ` [PATCH 4/4] workqueue: Handle cpu-node affinity change at CPU_ONLINE Kamezawa Hiroyuki
  3 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15 11:18 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen

Workqueue keeps cpu<->node relationship including all possible cpus.
The original information was made at boot but it may change when
a new node is added.

Update information if a new node is ready with using node-hotplug callback.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memory_hotplug.h |  2 ++
 kernel/workqueue.c             | 28 +++++++++++++++++++++++++++-
 mm/memory_hotplug.c            |  4 ++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f1a419..cd3cb67 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 
+/* update for workqueues */
+void workqueue_node_register(int node);
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2fd0bd7..5499b76 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,7 @@ struct workqueue_struct {
 static struct kmem_cache *pwq_cache;
 
 static cpumask_var_t *wq_numa_possible_cpumask;
-					/* possible CPUs of each node */
+					/* PL: possible CPUs of each node */
 
 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
 		WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
 						  pool->attrs->cpumask) < 0);
 }
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+static void reflesh_wq_possible_mask(int cpu, int node)
+{
+	int oldnode;
+	for_each_node_state(oldnode, N_POSSIBLE)
+		cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]);
+	cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+}
+/*
+ * When a cpu is physically added, cpu<->node relationship is established.
+ * We can catch the whole view when a new NODE_DATA() coming up.
+ * Update cpumask used for sched affinity of workers.
+ */
+void workqueue_node_register(int node)
+{
+	int cpu;
+	mutex_lock(&wq_pool_mutex);
+	for_each_possible_cpu(cpu) {
+		if (node == cpu_to_node(cpu))
+			reflesh_wq_possible_mask(cpu, node);
+	}
+	mutex_unlock(&wq_pool_mutex);
+}
+
+#endif
 
 /*
  * Workqueues should be brought up before normal priority CPU notifiers.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1bf4807..d0c1ebb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1121,6 +1121,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	 * online_pages() and offline_pages().
 	 */
 	reset_node_present_pages(pgdat);
+	/*
+	 * Update workqueue's cpu/node affinity info.
+	 */
+	workqueue_node_register(nid);
 
 	return pgdat;
 }
-- 
1.8.3.1




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/4] workqueue: Handle cpu-node affinity change at CPU_ONLINE.
  2014-12-15 11:11   ` [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3 Kamezawa Hiroyuki
                       ` (2 preceding siblings ...)
  2014-12-15 11:18     ` [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up Kamezawa Hiroyuki
@ 2014-12-15 11:22     ` Kamezawa Hiroyuki
  3 siblings, 0 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-15 11:22 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, Tejun Heo
  Cc: Yasuaki Ishimatsu, Gu, Zheng, tangchen

This is a corner case fix. In general, cpu-to-node relationship is
fixed when cpu/memory are visible to kernel based on firmware
information. But in some case, cpu-to-node relationship is updated
at CPU_ONLINE.

In arch/x86/mm/numa.c::numa_init_array(), a cpu will be tied to
a random numa node if the node firmware tells is memry-less node.
Such cpu's cpu affinity can be changed in cpu_up().

For example,
  step 1. boot with mem= boot option, hide memory.
  step 2. online hidden memory with using memory hot-add.
  step 3. offline cpus.
  step 4. online cpus.

In step 4, cpu's numa node affinity can be modified if memory-less
node turns out to be an usual node by step 2.

This patch handles the event in CPU_ONLINE callback of workqueue.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/workqueue.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5499b76..cc0e1d4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4583,6 +4583,20 @@ void workqueue_node_register(int node)
 	}
 	mutex_unlock(&wq_pool_mutex);
 }
+/*
+ * When a cpu boot up with a memory less node, the cpu is be tied 
+ * a random node (see arch/x86/mm/numa.c::numa_init_array()).
+ * The relationship is fixed when a proper memory node comes up and
+ * it's visible after CPU_ONLINE. check and reflesh it.
+ */
+void check_reflesh_node_cpumask(int cpu)
+{
+	int node = cpu_to_node(cpu);
+
+	if (likely(cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node])))
+		return;
+	reflesh_wq_possible_mask(cpu, node);
+}
 
 #endif
 
@@ -4617,6 +4631,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
 		 */
 		for_each_cpu_worker_pool(pool, cpu)
 			pool->node = cpu_to_node(cpu);
+		check_reflesh_node_cpumask(cpu);
 
 		for_each_pool(pool, pi) {
 			mutex_lock(&pool->attach_mutex);
-- 
1.8.3.1




^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection
  2014-12-15 11:14     ` [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection Kamezawa Hiroyuki
@ 2014-12-16  5:30       ` Lai Jiangshan
  2014-12-16  7:32         ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-16  5:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/15/2014 07:14 PM, Kamezawa Hiroyuki wrote:
> Unbound wq pool's node attribute is calculated at its allocation.
> But it's now calculated based on possible cpu<->node information
> which can be wrong after cpu hotplug/unplug.
> 
> If wrong pool->node is set, following allocation error will happen.
> ==
>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>   cache: kmalloc-192, object size: 192, buffer size: 192, default order:
> 1, min order: 0
>   node 0: slabs: 6172, objs: 259224, free: 245741
>   node 1: slabs: 3261, objs: 136962, free: 127656
> ==
> 
> This patch fixes the node detection by making use of online cpu info.
> Unlike cpumask, the best node can be calculated by degree of overlap
> between attr->cpumask and numanode->online_cpumask.
> This change doesn't corrupt original purpose of the old calculation.
> 
> Note: it's expected that this function is called as
>       pool_detect_best_node
>       get_unbound_pool
>       alloc_unbound_pwq
>       wq_update_unbound_numa
>       called at CPU_ONLINE/CPU_DOWN_PREPARE
> and the latest online cpu info can be applied to a new wq pool,
> which replaces old one.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  kernel/workqueue.c | 38 ++++++++++++++++++++++++++------------
>  1 file changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 09b685d..7809154 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3440,6 +3440,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>  }
>  
>  /**
> + * pool_detect_best_node - detect a node which contains specified cpumask.
> + * Should be called with wq_pool_mutex held.
> + * Returns a online node where the most of given cpus are tied to. 
> + */
> +static int pool_detect_best_node(const struct cpumask *cpumask)
> +{
> +	int node, best, match, selected = NUMA_NO_NODE;
> +	static struct cpumask andmask; /* under wq_pool_mutex */
> +
> +	if (!wq_numa_enabled ||
> +		cpumask_subset(cpu_online_mask, cpumask))
> +		goto out;
> +	best = 0;
> +	/* select a node which contains the most number of cpu */
> +	for_each_node_state(node, N_ONLINE) {
> +		cpumask_and(&andmask, cpumask, cpumask_of_node(node));
> +		match = cpumask_weight(&andmask);
> +		if (match > best)
> +			selected = best;
> +	}
> +out:
> +	return selected;
> +}


This is a mixture of fix and development.  Why not just keep the original calculation?

if the mask cover multiple nodes, NUMA_NO_NODE is the best for pool->node
after the pool was created. The memory allocation will select the best node
for manage_workers(), from which CPU that the worker actually is running on.

> +
> +/**
>   * get_unbound_pool - get a worker_pool with the specified attributes
>   * @attrs: the attributes of the worker_pool to get
>   *
> @@ -3457,7 +3482,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  {
>  	u32 hash = wqattrs_hash(attrs);
>  	struct worker_pool *pool;
> -	int node;
>  
>  	lockdep_assert_held(&wq_pool_mutex);
>  
> @@ -3482,17 +3506,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  	 * 'struct workqueue_attrs' comments for detail.
>  	 */
>  	pool->attrs->no_numa = false;
> -
> -	/* if cpumask is contained inside a NUMA node, we belong to that node */
> -	if (wq_numa_enabled) {
> -		for_each_node(node) {
> -			if (cpumask_subset(pool->attrs->cpumask,
> -					   wq_numa_possible_cpumask[node])) {
> -				pool->node = node;
> -				break;
> -			}
> -		}
> -	}
> +	pool->node = pool_detect_best_node(pool->attrs->cpumask);
>  
>  	if (worker_pool_assign_id(pool) < 0)
>  		goto fail;


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline
  2014-12-15 11:16     ` [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline Kamezawa Hiroyuki
@ 2014-12-16  5:32       ` Lai Jiangshan
  2014-12-16  7:25         ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-16  5:32 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/15/2014 07:16 PM, Kamezawa Hiroyuki wrote:
> The percpu workqueue pool are persistend and never be freed.
> But cpu<->node relationship can be changed by cpu hotplug and pool->node
> can point to an offlined node.
> 
> If pool->node points to an offlined node,
> following allocation failure can happen.
>     ==
>      SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>       cache: kmalloc-192, object size: 192, buffer size: 192, default
> order:
>     1, min order: 0
>       node 0: slabs: 6172, objs: 259224, free: 245741
>       node 1: slabs: 3261, objs: 136962, free: 127656
>     ==
> 
> This patch clears per-cpu workqueue pool's node affinity at
> cpu offlining and restore it at cpu onlining.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  kernel/workqueue.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 7809154..2fd0bd7 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4586,6 +4586,11 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>  	case CPU_DOWN_FAILED:
>  	case CPU_ONLINE:
>  		mutex_lock(&wq_pool_mutex);
> +		/*
> +		 * now cpu <-> node info is established. update numa node
> +		 */
> +		for_each_cpu_worker_pool(pool, cpu)
> +			pool->node = cpu_to_node(cpu);
>  
>  		for_each_pool(pool, pi) {
>  			mutex_lock(&pool->attach_mutex);
> @@ -4619,6 +4624,7 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb,
>  	int cpu = (unsigned long)hcpu;
>  	struct work_struct unbind_work;
>  	struct workqueue_struct *wq;
> +	struct worker_pool *pool;
>  
>  	switch (action & ~CPU_TASKS_FROZEN) {
>  	case CPU_DOWN_PREPARE:
> @@ -4626,10 +4632,13 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb,
>  		INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
>  		queue_work_on(cpu, system_highpri_wq, &unbind_work);
>  
> -		/* update NUMA affinity of unbound workqueues */
>  		mutex_lock(&wq_pool_mutex);
> +		/* update NUMA affinity of unbound workqueues */
>  		list_for_each_entry(wq, &workqueues, list)
>  			wq_update_unbound_numa(wq, cpu, false);
> +		/* clear per-cpu workqueues's numa affinity. */
> +		for_each_cpu_worker_pool(pool, cpu)
> +			pool->node = NUMA_NO_NODE; /* restored at online */


If the node is still online, it is better to keep the original pool->node.

>  		mutex_unlock(&wq_pool_mutex);
>  
>  		/* wait for per-cpu unbinding to finish */


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline
  2014-12-16  5:32       ` Lai Jiangshan
@ 2014-12-16  7:25         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-16  7:25 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/16 14:32), Lai Jiangshan wrote:
> On 12/15/2014 07:16 PM, Kamezawa Hiroyuki wrote:
>> The percpu workqueue pool are persistend and never be freed.
>> But cpu<->node relationship can be changed by cpu hotplug and pool->node
>> can point to an offlined node.
>>
>> If pool->node points to an offlined node,
>> following allocation failure can happen.
>>      ==
>>       SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>        cache: kmalloc-192, object size: 192, buffer size: 192, default
>> order:
>>      1, min order: 0
>>        node 0: slabs: 6172, objs: 259224, free: 245741
>>        node 1: slabs: 3261, objs: 136962, free: 127656
>>      ==
>>
>> This patch clears per-cpu workqueue pool's node affinity at
>> cpu offlining and restore it at cpu onlining.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>   kernel/workqueue.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 7809154..2fd0bd7 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -4586,6 +4586,11 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>>   	case CPU_DOWN_FAILED:
>>   	case CPU_ONLINE:
>>   		mutex_lock(&wq_pool_mutex);
>> +		/*
>> +		 * now cpu <-> node info is established. update numa node
>> +		 */
>> +		for_each_cpu_worker_pool(pool, cpu)
>> +			pool->node = cpu_to_node(cpu);
>>
>>   		for_each_pool(pool, pi) {
>>   			mutex_lock(&pool->attach_mutex);
>> @@ -4619,6 +4624,7 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb,
>>   	int cpu = (unsigned long)hcpu;
>>   	struct work_struct unbind_work;
>>   	struct workqueue_struct *wq;
>> +	struct worker_pool *pool;
>>
>>   	switch (action & ~CPU_TASKS_FROZEN) {
>>   	case CPU_DOWN_PREPARE:
>> @@ -4626,10 +4632,13 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb,
>>   		INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
>>   		queue_work_on(cpu, system_highpri_wq, &unbind_work);
>>
>> -		/* update NUMA affinity of unbound workqueues */
>>   		mutex_lock(&wq_pool_mutex);
>> +		/* update NUMA affinity of unbound workqueues */
>>   		list_for_each_entry(wq, &workqueues, list)
>>   			wq_update_unbound_numa(wq, cpu, false);
>> +		/* clear per-cpu workqueues's numa affinity. */
>> +		for_each_cpu_worker_pool(pool, cpu)
>> +			pool->node = NUMA_NO_NODE; /* restored at online */
>
>
> If the node is still online, it is better to keep the original pool->node.
>

Hm, ok, drop this code.  or moving all to node online event handler.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection
  2014-12-16  5:30       ` Lai Jiangshan
@ 2014-12-16  7:32         ` Kamezawa Hiroyuki
  2014-12-16  7:54           ` Lai Jiangshan
  0 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-16  7:32 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/16 14:30), Lai Jiangshan wrote:
> On 12/15/2014 07:14 PM, Kamezawa Hiroyuki wrote:
>> Unbound wq pool's node attribute is calculated at its allocation.
>> But it's now calculated based on possible cpu<->node information
>> which can be wrong after cpu hotplug/unplug.
>>
>> If wrong pool->node is set, following allocation error will happen.
>> ==
>>   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>    cache: kmalloc-192, object size: 192, buffer size: 192, default order:
>> 1, min order: 0
>>    node 0: slabs: 6172, objs: 259224, free: 245741
>>    node 1: slabs: 3261, objs: 136962, free: 127656
>> ==
>>
>> This patch fixes the node detection by making use of online cpu info.
>> Unlike cpumask, the best node can be calculated by degree of overlap
>> between attr->cpumask and numanode->online_cpumask.
>> This change doesn't corrupt original purpose of the old calculation.
>>
>> Note: it's expected that this function is called as
>>        pool_detect_best_node
>>        get_unbound_pool
>>        alloc_unbound_pwq
>>        wq_update_unbound_numa
>>        called at CPU_ONLINE/CPU_DOWN_PREPARE
>> and the latest online cpu info can be applied to a new wq pool,
>> which replaces old one.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>   kernel/workqueue.c | 38 ++++++++++++++++++++++++++------------
>>   1 file changed, 26 insertions(+), 12 deletions(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 09b685d..7809154 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -3440,6 +3440,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>   }
>>
>>   /**
>> + * pool_detect_best_node - detect a node which contains specified cpumask.
>> + * Should be called with wq_pool_mutex held.
>> + * Returns a online node where the most of given cpus are tied to.
>> + */
>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>> +{
>> +	int node, best, match, selected = NUMA_NO_NODE;
>> +	static struct cpumask andmask; /* under wq_pool_mutex */
>> +
>> +	if (!wq_numa_enabled ||
>> +		cpumask_subset(cpu_online_mask, cpumask))
>> +		goto out;
>> +	best = 0;
>> +	/* select a node which contains the most number of cpu */
>> +	for_each_node_state(node, N_ONLINE) {
>> +		cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>> +		match = cpumask_weight(&andmask);
>> +		if (match > best)
>> +			selected = best;
>> +	}
>> +out:
>> +	return selected;
>> +}
>
>
> This is a mixture of fix and development.  Why not just keep the original calculation?
>
Just because wq_numa_possible_mask is broken, it shouldn't be used if possible.
In this patch series, the mask is updated only when a node is coming up.
Is it better to clear it at node offline ?

> if the mask cover multiple nodes, NUMA_NO_NODE is the best for pool->node
> after the pool was created. The memory allocation will select the best node
> for manage_workers(), from which CPU that the worker actually is running on.
>

I'll drop this and try to keep original code as much as possible.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.
  2014-12-15 11:18     ` [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up Kamezawa Hiroyuki
@ 2014-12-16  7:49       ` Lai Jiangshan
  2014-12-16  8:10         ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-16  7:49 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/15/2014 07:18 PM, Kamezawa Hiroyuki wrote:
> Workqueue keeps cpu<->node relationship including all possible cpus.
> The original information was made at boot but it may change when
> a new node is added.
> 
> Update information if a new node is ready with using node-hotplug callback.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memory_hotplug.h |  2 ++
>  kernel/workqueue.c             | 28 +++++++++++++++++++++++++++-
>  mm/memory_hotplug.c            |  4 ++++
>  3 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 8f1a419..cd3cb67 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
>  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
>  					  unsigned long pnum);
>  
> +/* update for workqueues */
> +void workqueue_node_register(int node);
>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2fd0bd7..5499b76 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -266,7 +266,7 @@ struct workqueue_struct {
>  static struct kmem_cache *pwq_cache;
>  
>  static cpumask_var_t *wq_numa_possible_cpumask;
> -					/* possible CPUs of each node */
> +					/* PL: possible CPUs of each node */
>  
>  static bool wq_disable_numa;
>  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
> @@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
>  		WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
>  						  pool->attrs->cpumask) < 0);
>  }
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +static void reflesh_wq_possible_mask(int cpu, int node)
> +{
> +	int oldnode;
> +	for_each_node_state(oldnode, N_POSSIBLE)
> +		cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]);

You need to check and update all the wq->numa_pwq_tbl[oldnode]

> +	cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> +}
> +/*
> + * When a cpu is physically added, cpu<->node relationship is established.
> + * We can catch the whole view when a new NODE_DATA() coming up.
> + * Update cpumask used for sched affinity of workers.
> + */
> +void workqueue_node_register(int node)
> +{
> +	int cpu;
> +	mutex_lock(&wq_pool_mutex);
> +	for_each_possible_cpu(cpu) {
> +		if (node == cpu_to_node(cpu))
> +			reflesh_wq_possible_mask(cpu, node);
> +	}
> +	mutex_unlock(&wq_pool_mutex);
> +}
> +
> +#endif
>  
>  /*
>   * Workqueues should be brought up before normal priority CPU notifiers.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1bf4807..d0c1ebb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1121,6 +1121,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
>  	 * online_pages() and offline_pages().
>  	 */
>  	reset_node_present_pages(pgdat);
> +	/*
> +	 * Update workqueue's cpu/node affinity info.
> +	 */
> +	workqueue_node_register(nid);
>  
>  	return pgdat;
>  }


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection
  2014-12-16  7:32         ` Kamezawa Hiroyuki
@ 2014-12-16  7:54           ` Lai Jiangshan
  0 siblings, 0 replies; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-16  7:54 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

On 12/16/2014 03:32 PM, Kamezawa Hiroyuki wrote:
> (2014/12/16 14:30), Lai Jiangshan wrote:
>> On 12/15/2014 07:14 PM, Kamezawa Hiroyuki wrote:
>>> Unbound wq pool's node attribute is calculated at its allocation.
>>> But it's now calculated based on possible cpu<->node information
>>> which can be wrong after cpu hotplug/unplug.
>>>
>>> If wrong pool->node is set, following allocation error will happen.
>>> ==
>>>   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>>    cache: kmalloc-192, object size: 192, buffer size: 192, default order:
>>> 1, min order: 0
>>>    node 0: slabs: 6172, objs: 259224, free: 245741
>>>    node 1: slabs: 3261, objs: 136962, free: 127656
>>> ==
>>>
>>> This patch fixes the node detection by making use of online cpu info.
>>> Unlike cpumask, the best node can be calculated by degree of overlap
>>> between attr->cpumask and numanode->online_cpumask.
>>> This change doesn't corrupt original purpose of the old calculation.
>>>
>>> Note: it's expected that this function is called as
>>>        pool_detect_best_node
>>>        get_unbound_pool
>>>        alloc_unbound_pwq
>>>        wq_update_unbound_numa
>>>        called at CPU_ONLINE/CPU_DOWN_PREPARE
>>> and the latest online cpu info can be applied to a new wq pool,
>>> which replaces old one.
>>>
>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> ---
>>>   kernel/workqueue.c | 38 ++++++++++++++++++++++++++------------
>>>   1 file changed, 26 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>> index 09b685d..7809154 100644
>>> --- a/kernel/workqueue.c
>>> +++ b/kernel/workqueue.c
>>> @@ -3440,6 +3440,31 @@ static void put_unbound_pool(struct worker_pool *pool)
>>>   }
>>>
>>>   /**
>>> + * pool_detect_best_node - detect a node which contains specified cpumask.
>>> + * Should be called with wq_pool_mutex held.
>>> + * Returns a online node where the most of given cpus are tied to.
>>> + */
>>> +static int pool_detect_best_node(const struct cpumask *cpumask)
>>> +{
>>> +    int node, best, match, selected = NUMA_NO_NODE;
>>> +    static struct cpumask andmask; /* under wq_pool_mutex */
>>> +
>>> +    if (!wq_numa_enabled ||
>>> +        cpumask_subset(cpu_online_mask, cpumask))
>>> +        goto out;
>>> +    best = 0;
>>> +    /* select a node which contains the most number of cpu */
>>> +    for_each_node_state(node, N_ONLINE) {
>>> +        cpumask_and(&andmask, cpumask, cpumask_of_node(node));
>>> +        match = cpumask_weight(&andmask);
>>> +        if (match > best)
>>> +            selected = best;
>>> +    }
>>> +out:
>>> +    return selected;
>>> +}
>>
>>
>> This is a mixture of fix and development.  Why not just keep the original calculation?
>>
> Just because wq_numa_possible_mask is broken, 

It is not broken if the bug is fixed.

> it shouldn't be used if possible.
> In this patch series, the mask is updated only when a node is coming up.
> Is it better to clear it at node offline ?
> 
>> if the mask cover multiple nodes, NUMA_NO_NODE is the best for pool->node
>> after the pool was created. The memory allocation will select the best node
>> for manage_workers(), from which CPU that the worker actually is running on.
>>
> 
> I'll drop this and try to keep original code as much as possible.
> 
> Thanks,
> -Kame
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.
  2014-12-16  7:49       ` Lai Jiangshan
@ 2014-12-16  8:10         ` Kamezawa Hiroyuki
  2014-12-16  8:18           ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-16  8:10 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/16 16:49), Lai Jiangshan wrote:
> On 12/15/2014 07:18 PM, Kamezawa Hiroyuki wrote:
>> Workqueue keeps cpu<->node relationship including all possible cpus.
>> The original information was made at boot but it may change when
>> a new node is added.
>>
>> Update information if a new node is ready with using node-hotplug callback.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>   include/linux/memory_hotplug.h |  2 ++
>>   kernel/workqueue.c             | 28 +++++++++++++++++++++++++++-
>>   mm/memory_hotplug.c            |  4 ++++
>>   3 files changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>> index 8f1a419..cd3cb67 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
>>   extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
>>   					  unsigned long pnum);
>>
>> +/* update for workqueues */
>> +void workqueue_node_register(int node);
>>   #endif /* __LINUX_MEMORY_HOTPLUG_H */
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 2fd0bd7..5499b76 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -266,7 +266,7 @@ struct workqueue_struct {
>>   static struct kmem_cache *pwq_cache;
>>
>>   static cpumask_var_t *wq_numa_possible_cpumask;
>> -					/* possible CPUs of each node */
>> +					/* PL: possible CPUs of each node */
>>
>>   static bool wq_disable_numa;
>>   module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>> @@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
>>   		WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
>>   						  pool->attrs->cpumask) < 0);
>>   }
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +
>> +static void reflesh_wq_possible_mask(int cpu, int node)
>> +{
>> +	int oldnode;
>> +	for_each_node_state(oldnode, N_POSSIBLE)
>> +		cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]);
>
> You need to check and update all the wq->numa_pwq_tbl[oldnode]
>

you mean : if I drop patch 1/4, it will be required. Right ?
I'll add that check.

Thanks,
-Kame


>> +	cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
>> +}
>> +/*
>> + * When a cpu is physically added, cpu<->node relationship is established.
>> + * We can catch the whole view when a new NODE_DATA() coming up.
>> + * Update cpumask used for sched affinity of workers.
>> + */
>> +void workqueue_node_register(int node)
>> +{
>> +	int cpu;
>> +	mutex_lock(&wq_pool_mutex);
>> +	for_each_possible_cpu(cpu) {
>> +		if (node == cpu_to_node(cpu))
>> +			reflesh_wq_possible_mask(cpu, node);
>> +	}
>> +	mutex_unlock(&wq_pool_mutex);
>> +}
>> +
>> +#endif
>>
>>   /*
>>    * Workqueues should be brought up before normal priority CPU notifiers.
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 1bf4807..d0c1ebb 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1121,6 +1121,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
>>   	 * online_pages() and offline_pages().
>>   	 */
>>   	reset_node_present_pages(pgdat);
>> +	/*
>> +	 * Update workqueue's cpu/node affinity info.
>> +	 */
>> +	workqueue_node_register(nid);
>>
>>   	return pgdat;
>>   }
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.
  2014-12-16  8:10         ` Kamezawa Hiroyuki
@ 2014-12-16  8:18           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 55+ messages in thread
From: Kamezawa Hiroyuki @ 2014-12-16  8:18 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen

(2014/12/16 17:10), Kamezawa Hiroyuki wrote:
> (2014/12/16 16:49), Lai Jiangshan wrote:
>> On 12/15/2014 07:18 PM, Kamezawa Hiroyuki wrote:
>>> Workqueue keeps cpu<->node relationship including all possible cpus.
>>> The original information was made at boot but it may change when
>>> a new node is added.
>>>
>>> Update information if a new node is ready with using node-hotplug callback.
>>>
>>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> ---
>>>   include/linux/memory_hotplug.h |  2 ++
>>>   kernel/workqueue.c             | 28 +++++++++++++++++++++++++++-
>>>   mm/memory_hotplug.c            |  4 ++++
>>>   3 files changed, 33 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>> index 8f1a419..cd3cb67 100644
>>> --- a/include/linux/memory_hotplug.h
>>> +++ b/include/linux/memory_hotplug.h
>>> @@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
>>>   extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
>>>                         unsigned long pnum);
>>>
>>> +/* update for workqueues */
>>> +void workqueue_node_register(int node);
>>>   #endif /* __LINUX_MEMORY_HOTPLUG_H */
>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>> index 2fd0bd7..5499b76 100644
>>> --- a/kernel/workqueue.c
>>> +++ b/kernel/workqueue.c
>>> @@ -266,7 +266,7 @@ struct workqueue_struct {
>>>   static struct kmem_cache *pwq_cache;
>>>
>>>   static cpumask_var_t *wq_numa_possible_cpumask;
>>> -                    /* possible CPUs of each node */
>>> +                    /* PL: possible CPUs of each node */
>>>
>>>   static bool wq_disable_numa;
>>>   module_param_named(disable_numa, wq_disable_numa, bool, 0444);
>>> @@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu)
>>>           WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
>>>                             pool->attrs->cpumask) < 0);
>>>   }
>>> +#ifdef CONFIG_MEMORY_HOTPLUG
>>> +
>>> +static void reflesh_wq_possible_mask(int cpu, int node)
>>> +{
>>> +    int oldnode;
>>> +    for_each_node_state(oldnode, N_POSSIBLE)
>>> +        cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]);
>>
>> You need to check and update all the wq->numa_pwq_tbl[oldnode]
>>
>
> you mean : if I drop patch 1/4, it will be required. Right ?
> I'll add that check.

Ah, isn't it handled by wq_update_unbound_numa() called at workqueue_cpu_down_callback() ?
Because node's online cpumask will be empty, new pwq will be attached.

Thanks
-Kame





^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/5] workqueue: fix bug when numa mapping is changed
  2014-12-15  1:34   ` Lai Jiangshan
@ 2014-12-18  1:50     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 55+ messages in thread
From: Yasuaki Ishimatsu @ 2014-12-18  1:50 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

Hi Lai,

Sorry for the delay in replying.

 > Thanks for testing.  Would you like to use GDB to print the code of
 > "workqueue_cpu_up_callback+0x510" ?

(gdb) l *workqueue_cpu_up_callback+0x510
0xffffffff8108fc30 is in workqueue_cpu_up_callback (include/linux/topology.h:84).
79      #endif
80
81      #ifndef cpu_to_node
82      static inline int cpu_to_node(int cpu)
83      {
84              return per_cpu(numa_node, cpu);
85      }
86      #endif
87
88      #ifndef set_numa_node

Thanks,
Yasuaki Ishimatsu

(2014/12/15 10:34), Lai Jiangshan wrote:
> On 12/13/2014 01:13 AM, Yasuaki Ishimatsu wrote:
>> Hi Lai,
>>
>> Thank you for posting the patches. I tried your patches.
>> But the following kernel panic occurred.
>
> Hi, Yasuaki,
>
> Thanks for testing.  Would you like to use GDB to print the code of
> "workqueue_cpu_up_callback+0x510" ?
>
> Thanks,
> Lai
>
>>
>> [  889.394087] BUG: unable to handle kernel paging request at 000000020000f3f1
>> [  889.395005] IP: [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
>> [  889.395005] PGD 17a83067 PUD 0
>> [  889.395005] Oops: 0000 [#1] SMP
>> [  889.395005] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
>> ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
>> iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
>> e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
>> [  889.395005] CPU: 8 PID: 13595 Comm: udev_dp_bridge. Not tainted 3.18.0Lai+ #26
>> [  889.395005] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
>> [  889.395005] task: ffff8a074a145160 ti: ffff8a077a6ec000 task.ti: ffff8a077a6ec000
>> [  889.395005] RIP: 0010:[<ffffffff8108fe90>]  [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
>> [  889.395005] RSP: 0018:ffff8a077a6efca8  EFLAGS: 00010202
>> [  889.395005] RAX: 0000000000000001 RBX: 000000000000edf1 RCX: 000000000000edf1
>> [  889.395005] RDX: 0000000000000100 RSI: 000000020000f3f1 RDI: 0000000000000001
>> [  889.395005] RBP: ffff8a077a6efd08 R08: ffffffff81ac6de0 R09: ffff880874610000
>> [  889.395005] R10: 00000000ffffffff R11: 0000000000000001 R12: 000000000000f3f0
>> [  889.395005] R13: 000000000000001f R14: 00000000ffffffff R15: ffffffff81ac6de0
>> [  889.395005] FS:  00007f6b20c67740(0000) GS:ffff88087fd00000(0000) knlGS:0000000000000000
>> [  889.395005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  889.395005] CR2: 000000020000f3f1 CR3: 000000004534c000 CR4: 00000000001407e0
>> [  889.395005] Stack:
>> [  889.395005]  ffffffffffffffff 0000000000000020 fffffffffffffff8 00000004810a192d
>> [  889.395005]  ffff8a0700000204 0000000052f5b32d ffffffff81994fc0 00000000fffffff6
>> [  889.395005]  ffffffff81a13840 0000000000000002 000000000000001f 0000000000000000
>> [  889.395005] Call Trace:
>> [  889.395005]  [<ffffffff81094f6c>] notifier_call_chain+0x4c/0x70
>> [  889.395005]  [<ffffffff8109507e>] __raw_notifier_call_chain+0xe/0x10
>> [  889.395005]  [<ffffffff810750b3>] cpu_notify+0x23/0x50
>> [  889.395005]  [<ffffffff81075408>] _cpu_up+0x188/0x1a0
>> [  889.395005]  [<ffffffff810754a9>] cpu_up+0x89/0xb0
>> [  889.395005]  [<ffffffff8164f960>] cpu_subsys_online+0x40/0x90
>> [  889.395005]  [<ffffffff8140f10d>] device_online+0x6d/0xa0
>> [  889.395005]  [<ffffffff8140f1d5>] online_store+0x95/0xa0
>> [  889.395005]  [<ffffffff8140c2e8>] dev_attr_store+0x18/0x30
>> [  889.395005]  [<ffffffff8126210d>] sysfs_kf_write+0x3d/0x50
>> [  889.395005]  [<ffffffff81261624>] kernfs_fop_write+0xe4/0x160
>> [  889.395005]  [<ffffffff811e90d7>] vfs_write+0xb7/0x1f0
>> [  889.395005]  [<ffffffff81021dcc>] ? do_audit_syscall_entry+0x6c/0x70
>> [  889.395005]  [<ffffffff811e9bc5>] SyS_write+0x55/0xd0
>> [  889.395005]  [<ffffffff816646a9>] system_call_fastpath+0x12/0x17
>> [  889.395005] Code: 44 00 00 83 c7 01 48 63 d7 4c 89 ff e8 3a 2a 28 00 8b 15 78 84 a3 00 89 c7 39 d0 7d 70 48 63 cb 4c 89 e6 48 03 34 cd e0 3a ab 81 <8b> 1e 39 5d bc 74 36 41 39 de 74 0c 48 63 f2 eb c7 0f 1f 80 00
>> [  889.395005] RIP  [<ffffffff8108fe90>] workqueue_cpu_up_callback+0x510/0x740
>> [  889.395005]  RSP <ffff8a077a6efca8>
>> [  889.395005] CR2: 000000020000f3f1
>> [  889.785760] ---[ end trace 39abbfc9f93402f2 ]---
>> [  889.790931] Kernel panic - not syncing: Fatal exception
>> [  889.791931] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
>> [  889.791931] drm_kms_helper: panic occurred, switching back to text console
>> [  889.815947] ------------[ cut here ]------------
>> [  889.815947] WARNING: CPU: 8 PID: 64 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
>> [  889.815947] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
>> ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
>> iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
>> e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
>> [  889.815947] CPU: 8 PID: 64 Comm: migration/8 Tainted: G      D        3.18.0Lai+ #26
>> [  889.815947] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
>> [  889.815947]  0000000000000000 00000000f7f40529 ffff88087fd03d38 ffffffff8165c8d4
>> [  889.815947]  0000000000000000 0000000000000000 ffff88087fd03d78 ffffffff81074eb1
>> [  889.815947]  ffff88087fd03d78 0000000000000000 ffff88087fc13840 0000000000000008
>> [  889.815947] Call Trace:
>> [  889.815947]  <IRQ>  [<ffffffff8165c8d4>] dump_stack+0x46/0x58
>> [  889.815947]  [<ffffffff81074eb1>] warn_slowpath_common+0x81/0xa0
>> [  889.815947]  [<ffffffff81074fca>] warn_slowpath_null+0x1a/0x20
>> [  889.815947]  [<ffffffff810489bd>] native_smp_send_reschedule+0x5d/0x60
>> [  889.815947]  [<ffffffff810b0ad4>] trigger_load_balance+0x144/0x1b0
>> [  889.815947]  [<ffffffff810a009f>] scheduler_tick+0x9f/0xe0
>> [  889.815947]  [<ffffffff810daef4>] update_process_times+0x64/0x80
>> [  889.815947]  [<ffffffff810eab05>] tick_sched_handle.isra.19+0x25/0x60
>> [  889.815947]  [<ffffffff810eab85>] tick_sched_timer+0x45/0x80
>> [  889.815947]  [<ffffffff810dbbe7>] __run_hrtimer+0x77/0x1d0
>> [  889.815947]  [<ffffffff810eab40>] ? tick_sched_handle.isra.19+0x60/0x60
>> [  889.815947]  [<ffffffff810dbfd7>] hrtimer_interrupt+0xf7/0x240
>> [  889.815947]  [<ffffffff8104b85b>] local_apic_timer_interrupt+0x3b/0x70
>> [  889.815947]  [<ffffffff81667465>] smp_apic_timer_interrupt+0x45/0x60
>> [  889.815947]  [<ffffffff8166553d>] apic_timer_interrupt+0x6d/0x80
>> [  889.815947]  <EOI>  [<ffffffff810a79c7>] ? set_next_entity+0x67/0x80
>> [  889.815947]  [<ffffffffa011d1d7>] ? __drm_modeset_lock_all+0x37/0x120 [drm]
>> [  889.815947]  [<ffffffff8109c727>] ? finish_task_switch+0x57/0x180
>> [  889.815947]  [<ffffffff8165fba8>] __schedule+0x2e8/0x7e0
>> [  889.815947]  [<ffffffff816600c9>] schedule+0x29/0x70
>> [  889.815947]  [<ffffffff81097d43>] smpboot_thread_fn+0xd3/0x1b0
>> [  889.815947]  [<ffffffff81097c70>] ? SyS_setgroups+0x1a0/0x1a0
>> [  889.815947]  [<ffffffff81093df1>] kthread+0xe1/0x100
>> [  889.815947]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
>> [  889.815947]  [<ffffffff816645fc>] ret_from_fork+0x7c/0xb0
>> [  889.815947]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
>> [  889.815947] ---[ end trace 39abbfc9f93402f3 ]---
>> [  890.156187] ------------[ cut here ]------------
>> [  890.156187] WARNING: CPU: 8 PID: 64 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
>> [  890.156187] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
>> ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg vfat fat x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt sb_edac
>> iTCO_vendor_support i2c_i801 lrw gf128mul lpc_ich edac_core glue_helper mfd_core ablk_helper cryptd pcspkr shpchp ipmi_devintf ipmi_si ipmi_msghandler tpm_infineon nfsd auth_rpcgss nfs_acl lockd grace sunrpc uinput xfs libcrc32c sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm
>> e1000e lpfc drm dca ptp i2c_algo_bit megaraid_sas scsi_transport_fc pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod
>> [  890.156187] CPU: 8 PID: 64 Comm: migration/8 Tainted: G      D W      3.18.0Lai+ #26
>> [  890.156187] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014
>> [  890.156187]  0000000000000000 00000000f7f40529 ffff88087366bc08 ffffffff8165c8d4
>> [  890.156187]  0000000000000000 0000000000000000 ffff88087366bc48 ffffffff81074eb1
>> [  890.156187]  ffff88087fd142c0 0000000000000044 ffff8a074a145160 ffff8a074a145160
>> [  890.156187] Call Trace:
>> [  890.156187]  [<ffffffff8165c8d4>] dump_stack+0x46/0x58
>> [  890.156187]  [<ffffffff81074eb1>] warn_slowpath_common+0x81/0xa0
>> [  890.156187]  [<ffffffff81074fca>] warn_slowpath_null+0x1a/0x20
>> [  890.156187]  [<ffffffff810489bd>] native_smp_send_reschedule+0x5d/0x60
>> [  890.156187]  [<ffffffff8109ddd8>] resched_curr+0xa8/0xd0
>> [  890.156187]  [<ffffffff8109eac0>] check_preempt_curr+0x80/0xa0
>> [  890.156187]  [<ffffffff810a78c8>] attach_task+0x48/0x50
>> [  890.156187]  [<ffffffff810a7ae5>] active_load_balance_cpu_stop+0x105/0x250
>> [  890.156187]  [<ffffffff810a79e0>] ? set_next_entity+0x80/0x80
>> [  890.156187]  [<ffffffff8110cab8>] cpu_stopper_thread+0x78/0x150
>> [  890.156187]  [<ffffffff8165fba8>] ? __schedule+0x2e8/0x7e0
>> [  890.156187]  [<ffffffff81097d6f>] smpboot_thread_fn+0xff/0x1b0
>> [  890.156187]  [<ffffffff81097c70>] ? SyS_setgroups+0x1a0/0x1a0
>> [  890.156187]  [<ffffffff81093df1>] kthread+0xe1/0x100
>> [  890.156187]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
>> [  890.156187]  [<ffffffff816645fc>] ret_from_fork+0x7c/0xb0
>> [  890.156187]  [<ffffffff81093d10>] ? kthread_create_on_node+0x1b0/0x1b0
>> [  890.156187] ---[ end trace 39abbfc9f93402f4 ]---
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>> (2014/12/12 19:19), Lai Jiangshan wrote:
>>> Workqueue code has an assumption that the numa mapping is stable
>>> after system booted.  It is incorrectly currently.
>>>
>>> Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
>>> between CPU and node is changed. This was the last scene:
>>>    SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>>     cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
>>>     node 0: slabs: 6172, objs: 259224, free: 245741
>>>     node 1: slabs: 3261, objs: 136962, free: 127656
>>>
>>> Yasuaki Ishimatsu investigated that it happened in the following situation:
>>>
>>> 1) System Node/CPU before offline/online:
>>> 	       | CPU
>>> 	------------------------
>>> 	node 0 |  0-14, 60-74
>>> 	node 1 | 15-29, 75-89
>>> 	node 2 | 30-44, 90-104
>>> 	node 3 | 45-59, 105-119
>>>
>>> 2) A system-board (contains node2 and node3) is offline:
>>> 	       | CPU
>>> 	------------------------
>>> 	node 0 |  0-14, 60-74
>>> 	node 1 | 15-29, 75-89
>>>
>>> 3) A new system-board is online, two new node IDs are allocated
>>>      for the two node of the SB, but the old CPU IDs are allocated for
>>>      the SB, here the NUMA mapping between node and CPU is changed.
>>>      (the node of CPU#30 is changed from node#2 to node#4, for example)
>>> 	       | CPU
>>> 	------------------------
>>> 	node 0 |  0-14, 60-74
>>> 	node 1 | 15-29, 75-89
>>> 	node 4 | 30-59
>>> 	node 5 | 90-119
>>>
>>> 4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask
>>>      which is the convenient NUMA mapping cache in workqueue.c is still outdated.
>>>      thus pool->node calculated by get_unbound_pool() is incorrect.
>>>
>>> 5) when the create_worker() is called with the incorrect offlined
>>>       pool->node, it is failed and the pool can't make any progress.
>>>
>>> To fix this bug, we need to fixup the wq_numa_possible_cpumask and the
>>> pool->node, it is done in patch2 and patch3.
>>>
>>> patch1 fixes memory leak related wq_numa_possible_cpumask.
>>> patch4 kill another assumption about how the numa mapping changed.
>>> patch5 reduces the allocation fails when the node is offline or the node
>>> is lack of memory.
>>>
>>> The patchset is untested. It is sent for earlier review.
>>>
>>> Thanks,
>>> Lai.
>>>
>>> Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>> Cc: Tejun Heo <tj@kernel.org>
>>> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>> Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
>>> Cc: tangchen <tangchen@cn.fujitsu.com>
>>> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
>>> Lai Jiangshan (5):
>>>     workqueue: fix memory leak in wq_numa_init()
>>>     workqueue: update wq_numa_possible_cpumask
>>>     workqueue: fixup existing pool->node
>>>     workqueue: update NUMA affinity for the node lost CPU
>>>     workqueue: retry on NUMA_NO_NODE when create_worker() fails
>>>
>>>    kernel/workqueue.c |  129 ++++++++++++++++++++++++++++++++++++++++++++--------
>>>    1 files changed, 109 insertions(+), 20 deletions(-)
>>>
>>
>>
>> .
>>
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] workqueue: update wq_numa_possible_cpumask
  2014-12-12 10:19 ` [PATCH 2/5] workqueue: update wq_numa_possible_cpumask Lai Jiangshan
  2014-12-12 17:18   ` Tejun Heo
@ 2014-12-18  2:22   ` Lai Jiangshan
  1 sibling, 0 replies; 55+ messages in thread
From: Lai Jiangshan @ 2014-12-18  2:22 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Tejun Heo, Yasuaki Ishimatsu, Gu, Zheng, tangchen,
	Hiroyuki KAMEZAWA

On 12/12/2014 06:19 PM, Lai Jiangshan wrote:
> Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
> between CPU and node is changed. This was the last scene:
>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>   cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
>   node 0: slabs: 6172, objs: 259224, free: 245741
>   node 1: slabs: 3261, objs: 136962, free: 127656
> 
> Yasuaki Ishimatsu investigated that it happened in the following situation:
> 
> 1) System Node/CPU before offline/online:
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 	node 2 | 30-44, 90-104
> 	node 3 | 45-59, 105-119
> 
> 2) A system-board (contains node2 and node3) is offline:
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 
> 3) A new system-board is online, two new node IDs are allocated
>    for the two node of the SB, but the old CPU IDs are allocated for
>    the SB, here the NUMA mapping between node and CPU is changed.
>    (the node of CPU#30 is changed from node#2 to node#4, for example)
> 	       | CPU
> 	------------------------
> 	node 0 |  0-14, 60-74
> 	node 1 | 15-29, 75-89
> 	node 4 | 30-59
> 	node 5 | 90-119
> 
> 4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask
>    which is the convenient NUMA mapping cache in workqueue.c is still outdated.
>    thus pool->node calculated by get_unbound_pool() is incorrect.
> 
> 5) when the create_worker() is called with the incorrect offlined
>     pool->node, it is failed and the pool can't make any progress.
> 
> To fix this bug, we need to fixup the wq_numa_possible_cpumask and the
> pool->node, the fix is so complicated that we split it into two patches,
> this patch fix the wq_numa_possible_cpumask and the next fix the pool->node.
> 
> To fix the wq_numa_possible_cpumask, we only update the cpumasks of
> the orig_node and the new_node of the onlining @cpu.  we con't touch
> other unrelated nodes since the wq subsystem haven't seen the changed.
> 
> After this fix the new pool->node of new pools are correct.
> and existing wq's affinity is fixed up by wq_update_unbound_numa()
> after wq_update_numa_mapping().
> 
> Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Cc: "Gu, Zheng" <guz.fnst@cn.fujitsu.com>
> Cc: tangchen <tangchen@cn.fujitsu.com>
> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> ---
>  kernel/workqueue.c |   42 +++++++++++++++++++++++++++++++++++++++++-
>  1 files changed, 41 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index a6fd2b8..4c88b61 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -266,7 +266,7 @@ struct workqueue_struct {
>  static struct kmem_cache *pwq_cache;
>  
>  static cpumask_var_t *wq_numa_possible_cpumask;
> -					/* possible CPUs of each node */
> +					/* PL: possible CPUs of each node */
>  
>  static bool wq_disable_numa;
>  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
> @@ -3949,6 +3949,44 @@ out_unlock:
>  	put_pwq_unlocked(old_pwq);
>  }
>  
> +static void wq_update_numa_mapping(int cpu)
> +{
> +	int node, orig_node = NUMA_NO_NODE, new_node = cpu_to_node(cpu);
> +
> +	lockdep_assert_held(&wq_pool_mutex);
> +
> +	if (!wq_numa_enabled)
> +		return;
> +
> +	/* the node of onlining CPU is not NUMA_NO_NODE */
> +	if (WARN_ON(new_node == NUMA_NO_NODE))
> +		return;
> +
> +	/* test whether the NUMA node mapping is changed. */
> +	if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[new_node]))
> +		return;
> +
> +	/* find the origin node */
> +	for_each_node(node) {
> +		if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node])) {
> +			orig_node = node;
> +			break;
> +		}
> +	}
> +
> +	/* there may be multi mappings changed, re-initial. */
> +	cpumask_clear(wq_numa_possible_cpumask[new_node]);
> +	if (orig_node != NUMA_NO_NODE)
> +		cpumask_clear(wq_numa_possible_cpumask[orig_node]);
> +	for_each_possible_cpu(cpu) {
> +		node = cpu_to_node(node);

Hi, Yasuaki Ishimatsu

The bug is here.  It should be
		node = cpu_to_node(cpu);

> +		if (node == new_node)
> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[new_node]);
> +		else if (orig_node != NUMA_NO_NODE && node == orig_node)
> +			cpumask_set_cpu(cpu, wq_numa_possible_cpumask[orig_node]);
> +	}
> +}
> +
>  static int alloc_and_link_pwqs(struct workqueue_struct *wq)
>  {
>  	bool highpri = wq->flags & WQ_HIGHPRI;
> @@ -4584,6 +4622,8 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
>  			mutex_unlock(&pool->attach_mutex);
>  		}
>  
> +		wq_update_numa_mapping(cpu);
> +
>  		/* update NUMA affinity of unbound workqueues */
>  		list_for_each_entry(wq, &workqueues, list)
>  			wq_update_unbound_numa(wq, cpu, true);


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] workqueue: fixup existing pool->node
  2014-12-15  1:23     ` Lai Jiangshan
@ 2014-12-25 20:14       ` Tejun Heo
  2015-01-13  7:08         ` Lai Jiangshan
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2014-12-25 20:14 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Mon, Dec 15, 2014 at 09:23:49AM +0800, Lai Jiangshan wrote:
> The pwqs of the old node's cpumask do be discarded. But the pools of the old
> node's cpumask maybe recycle. For example, a new workqueue's affinity is set to
> the old node's cpumask before the pool is dead. Any old pool can long live.

Hah?  Why can't it just be unhashed so that it can't be looked up for
new pwqs anymore?

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/5] workqueue: update wq_numa_possible_cpumask
  2014-12-15  2:02     ` Lai Jiangshan
@ 2014-12-25 20:16       ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2014-12-25 20:16 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Mon, Dec 15, 2014 at 10:02:04AM +0800, Lai Jiangshan wrote:
> 1) "NUMA code" = system's NUMA memory hotplug code, AKA, keep the numa mapping stable
> 
>    I think this is the better idea.  This idea came to my mind immediately at the time
>    I received the bug report.  And after some discussions, I was told that it is too HARD
>    to keep the numa mapping stable across multiple physical system-board/node online/offline.

If the arch needs to change it, that's fine, I guess.  What I'm saying
is that we should move this table building and maintenance to numa
code proper.  It was already a bit icky to build it inside workqueue.
Trying to manage it dynamically manage it from there is too ugly.
Let's please make this mapping table a proper part of NUMA
infrastructure with proper notification callbacks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU
  2014-12-15  1:28     ` Lai Jiangshan
@ 2014-12-25 20:17       ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2014-12-25 20:17 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Mon, Dec 15, 2014 at 09:28:40AM +0800, Lai Jiangshan wrote:
> > The foundation those pools were standing are gone.
> 
> This statement is not true unless we write some code to force them,
> dequeue them from the unbound_pool_hash, for example.

Yeah, that's the obvious thing to do.  Those pools are dead.  Take
them off the namespace and drain the existing usages.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] workqueue: fixup existing pool->node
  2014-12-25 20:14       ` Tejun Heo
@ 2015-01-13  7:08         ` Lai Jiangshan
  2015-01-13 15:24           ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Lai Jiangshan @ 2015-01-13  7:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On 12/26/2014 04:14 AM, Tejun Heo wrote:
> On Mon, Dec 15, 2014 at 09:23:49AM +0800, Lai Jiangshan wrote:
>> The pwqs of the old node's cpumask do be discarded. But the pools of the old
>> node's cpumask maybe recycle. For example, a new workqueue's affinity is set to
>> the old node's cpumask before the pool is dead. Any old pool can long live.
> 
> Hah?  Why can't it just be unhashed so that it can't be looked up for
> new pwqs anymore?
> 

unhashing doesn't reduce the complexity in my code.

	for_each_pool(pool, pi) {
		node = calc_pool_node(pool);
		if (pool->node != node)
-			pool->node = node;
+			unhash_pool(pool);
	}

And any old pool can long live due to:
	some works queues themself back and back
	the old pool has extremely long pending works

So I prefer to fixup existing pool->node.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/5] workqueue: fixup existing pool->node
  2015-01-13  7:08         ` Lai Jiangshan
@ 2015-01-13 15:24           ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2015-01-13 15:24 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Yasuaki Ishimatsu, Gu, Zheng, tangchen, Hiroyuki KAMEZAWA

On Tue, Jan 13, 2015 at 03:08:28PM +0800, Lai Jiangshan wrote:
> On 12/26/2014 04:14 AM, Tejun Heo wrote:
> > On Mon, Dec 15, 2014 at 09:23:49AM +0800, Lai Jiangshan wrote:
> >> The pwqs of the old node's cpumask do be discarded. But the pools of the old
> >> node's cpumask maybe recycle. For example, a new workqueue's affinity is set to
> >> the old node's cpumask before the pool is dead. Any old pool can long live.
> > 
> > Hah?  Why can't it just be unhashed so that it can't be looked up for
> > new pwqs anymore?
> > 
> 
> unhashing doesn't reduce the complexity in my code.
> 
> 	for_each_pool(pool, pi) {
> 		node = calc_pool_node(pool);
> 		if (pool->node != node)
> -			pool->node = node;
> +			unhash_pool(pool);
> 	}

Hah?  You shouldn't need any of the dynamic updating code.  How does
that not reduce complexity?

> And any old pool can long live due to:
> 	some works queues themself back and back
> 	the old pool has extremely long pending works

As long as you don't give out new refs, it's fine.  It'll eventually
get drained.  Why is this a problem?

> So I prefer to fixup existing pool->node.

Can you please elaborate how this wouldn't remove the dynamic update
code?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2015-01-13 15:24 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-12 10:19 [PATCH 0/5] workqueue: fix bug when numa mapping is changed Lai Jiangshan
2014-12-12 10:19 ` [PATCH 1/5] workqueue: fix memory leak in wq_numa_init() Lai Jiangshan
2014-12-12 17:12   ` Tejun Heo
2014-12-15  5:25     ` Lai Jiangshan
2014-12-12 10:19 ` [PATCH 2/5] workqueue: update wq_numa_possible_cpumask Lai Jiangshan
2014-12-12 17:18   ` Tejun Heo
2014-12-15  2:02     ` Lai Jiangshan
2014-12-25 20:16       ` Tejun Heo
2014-12-18  2:22   ` Lai Jiangshan
2014-12-12 10:19 ` [PATCH 3/5] workqueue: fixup existing pool->node Lai Jiangshan
2014-12-12 17:25   ` Tejun Heo
2014-12-15  1:23     ` Lai Jiangshan
2014-12-25 20:14       ` Tejun Heo
2015-01-13  7:08         ` Lai Jiangshan
2015-01-13 15:24           ` Tejun Heo
2014-12-12 10:19 ` [PATCH 4/5] workqueue: update NUMA affinity for the node lost CPU Lai Jiangshan
2014-12-12 17:27   ` Tejun Heo
2014-12-15  1:28     ` Lai Jiangshan
2014-12-25 20:17       ` Tejun Heo
2014-12-12 10:19 ` [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails Lai Jiangshan
2014-12-12 16:05   ` KOSAKI Motohiro
2014-12-12 17:29     ` KOSAKI Motohiro
2014-12-12 17:29   ` Tejun Heo
2014-12-12 17:13 ` [PATCH 0/5] workqueue: fix bug when numa mapping is changed Yasuaki Ishimatsu
2014-12-15  1:34   ` Lai Jiangshan
2014-12-18  1:50     ` Yasuaki Ishimatsu
2014-12-13 16:27 ` [PATCH 0/4] workqueue: fix bug when numa mapping is changed v2 Kamezawa Hiroyuki
2014-12-13 16:30   ` [PATCH 1/4] workqueue: add a hook for node hotplug Kamezawa Hiroyuki
2014-12-13 16:33   ` [PATCH 2/4] workqueue: add warning if pool->node is offline Kamezawa Hiroyuki
2014-12-13 16:35   ` [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline Kamezawa Hiroyuki
2014-12-15  2:06     ` Lai Jiangshan
2014-12-15  2:06       ` Kamezawa Hiroyuki
2014-12-13 16:38   ` [PATCH 4/4] workqueue: handle change in cpu-node relationship Kamezawa Hiroyuki
2014-12-15  2:12     ` Lai Jiangshan
2014-12-15  2:20       ` Kamezawa Hiroyuki
2014-12-15  2:48         ` Lai Jiangshan
2014-12-15  2:55           ` Kamezawa Hiroyuki
2014-12-15  3:30             ` Lai Jiangshan
2014-12-15  3:34             ` Lai Jiangshan
2014-12-15  4:04               ` Kamezawa Hiroyuki
2014-12-15  5:19                 ` Lai Jiangshan
2014-12-15  5:33                   ` Kamezawa Hiroyuki
2014-12-15 11:11   ` [PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3 Kamezawa Hiroyuki
2014-12-15 11:14     ` [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection Kamezawa Hiroyuki
2014-12-16  5:30       ` Lai Jiangshan
2014-12-16  7:32         ` Kamezawa Hiroyuki
2014-12-16  7:54           ` Lai Jiangshan
2014-12-15 11:16     ` [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline Kamezawa Hiroyuki
2014-12-16  5:32       ` Lai Jiangshan
2014-12-16  7:25         ` Kamezawa Hiroyuki
2014-12-15 11:18     ` [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up Kamezawa Hiroyuki
2014-12-16  7:49       ` Lai Jiangshan
2014-12-16  8:10         ` Kamezawa Hiroyuki
2014-12-16  8:18           ` Kamezawa Hiroyuki
2014-12-15 11:22     ` [PATCH 4/4] workqueue: Handle cpu-node affinity change at CPU_ONLINE Kamezawa Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).