All of lore.kernel.org
 help / color / mirror / Atom feed
* mod_delayed_work()  explosion due to  874bbfe6
@ 2016-02-02 14:44 Mike Galbraith
  2016-02-03 14:37 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Mike Galbraith @ 2016-02-02 14:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML

Hi Tejun,

I'm looking at a crash analysis that fingers 874bbfe6 as being the
culprit.  (I didn't do the analysis, but it appears to be correct)

Scenario: CPU168 calls mod_delayed_work(), is taken offline before the
timer expires.  Due to 874bbfe6, dwork->cpu is the now offline CPU168
vs the previous WORK_CPU_UNBOUND, timer fires on CPU131, it tries to
__queue_work() with cpu == the now offline CPU168, gets to...

        } else
                pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));

... and goes boom.

<snippet>
crash> p numa_node | grep 168
  [168]: ffff8c03fdf0e328
crash> rd ffff8c03fdf0e328
ffff8c03fdf0e328:  00000000ffffffff                    ........

Thus, pwq becomes 000000000. Then, as the result of reference to
pwq->pool, NULL reference occurs at (*PANIC).
</snippet>

What if anything is supposed to prevent this?

	-Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mod_delayed_work()  explosion due to  874bbfe6
  2016-02-02 14:44 mod_delayed_work() explosion due to 874bbfe6 Mike Galbraith
@ 2016-02-03 14:37 ` Michal Hocko
  2016-02-03 16:32 ` Tejun Heo
  2016-02-03 18:54 ` [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup Tejun Heo
  2 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2016-02-03 14:37 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Tejun Heo, LKML

On Tue 02-02-16 15:44:24, Mike Galbraith wrote:
> Hi Tejun,
> 
> I'm looking at a crash analysis that fingers 874bbfe6 as being the
> culprit.  (I didn't do the analysis, but it appears to be correct)
> 
> Scenario: CPU168 calls mod_delayed_work(), is taken offline before the
> timer expires.  Due to 874bbfe6, dwork->cpu is the now offline CPU168
> vs the previous WORK_CPU_UNBOUND, timer fires on CPU131, it tries to
> __queue_work() with cpu == the now offline CPU168, gets to...
> 
>         } else
>                 pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
> 
> ... and goes boom.
> 
> <snippet>
> crash> p numa_node | grep 168
>   [168]: ffff8c03fdf0e328
> crash> rd ffff8c03fdf0e328
> ffff8c03fdf0e328:  00000000ffffffff                    ........
> 
> Thus, pwq becomes 000000000. Then, as the result of reference to
> pwq->pool, NULL reference occurs at (*PANIC).
> </snippet>
> 
> What if anything is supposed to prevent this?

AFAICS [1] nothing...

---
[1] http://lkml.kernel.org/r/20160203122855.GB6762@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: mod_delayed_work()  explosion due to  874bbfe6
  2016-02-02 14:44 mod_delayed_work() explosion due to 874bbfe6 Mike Galbraith
  2016-02-03 14:37 ` Michal Hocko
@ 2016-02-03 16:32 ` Tejun Heo
  2016-02-03 18:54 ` [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup Tejun Heo
  2 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2016-02-03 16:32 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML, tangchen, rafael

On Tue, Feb 02, 2016 at 03:44:24PM +0100, Mike Galbraith wrote:
> Scenario: CPU168 calls mod_delayed_work(), is taken offline before the
> timer expires.  Due to 874bbfe6, dwork->cpu is the now offline CPU168
> vs the previous WORK_CPU_UNBOUND, timer fires on CPU131, it tries to
> __queue_work() with cpu == the now offline CPU168, gets to...
> 
>         } else
>                 pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
> 
> ... and goes boom.
> 
> <snippet>
> crash> p numa_node | grep 168
>   [168]: ffff8c03fdf0e328
> crash> rd ffff8c03fdf0e328
> ffff8c03fdf0e328:  00000000ffffffff                    ........
> 
> Thus, pwq becomes 000000000. Then, as the result of reference to
> pwq->pool, NULL reference occurs at (*PANIC).
> </snippet>
> 
> What if anything is supposed to prevent this?

(cc'ing 

So, the root problem here is cpu <-> node mapping flipping across cpu
on/offlining which is unnecessary and causes other issues too.  It's
being worked on for quite a while now.

 http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com

I'll create a bandaid patch for now so that wq falls back to dfl_wq on
negative node.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-02 14:44 mod_delayed_work() explosion due to 874bbfe6 Mike Galbraith
  2016-02-03 14:37 ` Michal Hocko
  2016-02-03 16:32 ` Tejun Heo
@ 2016-02-03 18:54 ` Tejun Heo
  2016-02-03 18:55   ` Tejun Heo
                     ` (3 more replies)
  2 siblings, 4 replies; 15+ messages in thread
From: Tejun Heo @ 2016-02-03 18:54 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML

When looking up the pool_workqueue to use for an unbound workqueue,
workqueue assumes that the target CPU is always bound to a valid NUMA
node.  However, currently, when a CPU goes offline, the mapping is
destroyed and cpu_to_node() returns NUMA_NO_NODE.  This has always
been broken but hasn't triggered until recently.

After 874bbfe600a6 ("workqueue: make sure delayed work run in local
cpu"), workqueue forcifully assigns the local CPU for delayed work
items without explicit target CPU to fix a different issue.  This
widens the window where CPU can go offline while a delayed work item
is pending causing delayed work items dispatched with target CPU set
to an already offlined CPU.  The resulting NUMA_NO_NODE mapping makes
workqueue try to queue the work item on a NULL pool_workqueue and thus
crash.

Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from
unbound_pwq_by_node().  This is a temporary workaround.  The long term
solution is keeping CPU -> NODE mapping stable across CPU off/online
cycles which is in the works.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: stable@vger.kernel.org # v4.3+
Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu")
Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
---
 kernel/workqueue.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 61a0264..f748eab 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
 						  int node)
 {
 	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
+
+	/*
+	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
+	 * delayed item is pending.  The plan is to keep CPU -> NODE
+	 * mapping valid and stable across CPU on/offlines.  Once that
+	 * happens, this workaround can be removed.
+	 */
+	if (unlikely(node == NUMA_NO_NODE))
+		return wq->dfl_pwq;
+
 	return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
 }
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-03 18:54 ` [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup Tejun Heo
@ 2016-02-03 18:55   ` Tejun Heo
  2016-02-04  3:15     ` Mike Galbraith
  2016-02-03 19:12   ` Thomas Gleixner
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2016-02-03 18:55 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML

On Wed, Feb 03, 2016 at 01:54:25PM -0500, Tejun Heo wrote:
> Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from
> unbound_pwq_by_node().  This is a temporary workaround.  The long term
> solution is keeping CPU -> NODE mapping stable across CPU off/online
> cycles which is in the works.

Forgot to mention.  Can you please test this?  Once verified, I'll
route it through wq/for-4.5-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-03 18:54 ` [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup Tejun Heo
  2016-02-03 18:55   ` Tejun Heo
@ 2016-02-03 19:12   ` Thomas Gleixner
  2016-02-03 19:28     ` Tejun Heo
  2016-02-04  8:40   ` Michal Hocko
  2016-02-10 15:55   ` Tejun Heo
  3 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2016-02-03 19:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Galbraith, LKML, Michal Hocko, Jiri Slaby, Petr Mladek,
	Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, Daniel Bilik

On Wed, 3 Feb 2016, Tejun Heo wrote:
> When looking up the pool_workqueue to use for an unbound workqueue,
> workqueue assumes that the target CPU is always bound to a valid NUMA
> node.  However, currently, when a CPU goes offline, the mapping is
> destroyed and cpu_to_node() returns NUMA_NO_NODE.  This has always
> been broken but hasn't triggered until recently.
> 
> After 874bbfe600a6 ("workqueue: make sure delayed work run in local
> cpu"), workqueue forcifully assigns the local CPU for delayed work
> items without explicit target CPU to fix a different issue.  This
> widens the window where CPU can go offline while a delayed work item
> is pending causing delayed work items dispatched with target CPU set
> to an already offlined CPU.  The resulting NUMA_NO_NODE mapping makes
> workqueue try to queue the work item on a NULL pool_workqueue and thus
> crash.
> 
> Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from
> unbound_pwq_by_node().  This is a temporary workaround.  The long term
> solution is keeping CPU -> NODE mapping stable across CPU off/online
> cycles which is in the works.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: Rafael J. Wysocki <rafael@kernel.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: stable@vger.kernel.org # v4.3+

4.3+ ? Hasn't 874bbfe600a6 been backported to older stable kernels?

Adding a 'Fixes: 874bbfe600a6 ...' tag is what you really want here.

> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 61a0264..f748eab 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
>  						  int node)
>  {
>  	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> +
> +	/*
> +	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
> +	 * delayed item is pending.  The plan is to keep CPU -> NODE
> +	 * mapping valid and stable across CPU on/offlines.  Once that
> +	 * happens, this workaround can be removed.

So what happens if the complete node is offline?

> +	 */
> +	if (unlikely(node == NUMA_NO_NODE))
> +		return wq->dfl_pwq;
> +

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-03 19:12   ` Thomas Gleixner
@ 2016-02-03 19:28     ` Tejun Heo
  2016-02-04  2:12       ` Mike Galbraith
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2016-02-03 19:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, LKML, Michal Hocko, Jiri Slaby, Petr Mladek,
	Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, Daniel Bilik

Hello,

On Wed, Feb 03, 2016 at 08:12:19PM +0100, Thomas Gleixner wrote:
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > Cc: Rafael J. Wysocki <rafael@kernel.org>
> > Cc: Len Brown <len.brown@intel.com>
> > Cc: stable@vger.kernel.org # v4.3+
> 
> 4.3+ ? Hasn't 874bbfe600a6 been backported to older stable kernels?
> 
> Adding a 'Fixes: 874bbfe600a6 ...' tag is what you really want here.

Oops, you're right.  Will add that once Mike confirms the fix.

> > @@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
> >  						  int node)
> >  {
> >  	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> > +
> > +	/*
> > +	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
> > +	 * delayed item is pending.  The plan is to keep CPU -> NODE
> > +	 * mapping valid and stable across CPU on/offlines.  Once that
> > +	 * happens, this workaround can be removed.
> 
> So what happens if the complete node is offline?

pool_workqueue lookup itself should be fine as dfl_pwq is assigned to
all nodes by default.  When the node comes back online, things can
break currently because cpu to node mapping may change.  That's what
Tang has been working on.  It's a bigger problem throughout the memory
allocation path tho because there's no synchronization around cpu ->
node mapping.  Hopefully, the pending patchset can get through sooner
than later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-03 19:28     ` Tejun Heo
@ 2016-02-04  2:12       ` Mike Galbraith
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Galbraith @ 2016-02-04  2:12 UTC (permalink / raw)
  To: Tejun Heo, Thomas Gleixner
  Cc: LKML, Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara,
	Ben Hutchings, Sasha Levin, Shaohua Li, Daniel Bilik

On Wed, 2016-02-03 at 14:28 -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 03, 2016 at 08:12:19PM +0100, Thomas Gleixner wrote:
> > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > > Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> > > Cc: Tang Chen <tangchen@cn.fujitsu.com>
> > > Cc: Rafael J. Wysocki <rafael@kernel.org>
> > > Cc: Len Brown <len.brown@intel.com>
> > > Cc: stable@vger.kernel.org # v4.3+
> > 
> > 4.3+ ? Hasn't 874bbfe600a6 been backported to older stable kernels?
> > 
> > Adding a 'Fixes: 874bbfe600a6 ...' tag is what you really want here.
> 
> Oops, you're right.  Will add that once Mike confirms the fix.
> 
> > > @@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
> > >  > > > 	> > > 	> > > 	> > > 	> > > 	> > > 	> > >   int node)
> > >  {
> > >  > > > 	> > > assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> > > +
> > > +> > > 	> > > /*
> > > +> > > 	> > >  * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
> > > +> > > 	> > >  * delayed item is pending.  The plan is to keep CPU -> NODE
> > > +> > > 	> > >  * mapping valid and stable across CPU on/offlines.  Once that
> > > +> > > 	> > >  * happens, this workaround can be removed.
> > 
> > So what happens if the complete node is offline?
> 
> pool_workqueue lookup itself should be fine as dfl_pwq is assigned to
> all nodes by default.  When the node comes back online, things can
> break currently because cpu to node mapping may change.  That's what
> Tang has been working on.

That may make confirming the fix a bit problematic.  The crash I was
looking at happened on a Fujitsu box.

	-Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-03 18:55   ` Tejun Heo
@ 2016-02-04  3:15     ` Mike Galbraith
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Galbraith @ 2016-02-04  3:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: LKML

On Wed, 2016-02-03 at 13:55 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:54:25PM -0500, Tejun Heo wrote:
> > Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from
> > unbound_pwq_by_node().  This is a temporary workaround.  The long
> > term
> > solution is keeping CPU -> NODE mapping stable across CPU
> > off/online
> > cycles which is in the works.
> 
> Forgot to mention.  Can you please test this?  Once verified, I'll
> route it through wq/for-4.5-fixes.

I've passed it on, hopefully the user can make time to test.

	-Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-03 18:54 ` [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup Tejun Heo
  2016-02-03 18:55   ` Tejun Heo
  2016-02-03 19:12   ` Thomas Gleixner
@ 2016-02-04  8:40   ` Michal Hocko
  2016-02-10 15:55   ` Tejun Heo
  3 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2016-02-04  8:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Mike Galbraith, LKML

On Wed 03-02-16 13:54:25, Tejun Heo wrote:
> When looking up the pool_workqueue to use for an unbound workqueue,
> workqueue assumes that the target CPU is always bound to a valid NUMA
> node.  However, currently, when a CPU goes offline, the mapping is
> destroyed and cpu_to_node() returns NUMA_NO_NODE.  This has always
> been broken but hasn't triggered until recently.
> 
> After 874bbfe600a6 ("workqueue: make sure delayed work run in local
> cpu"), workqueue forcifully assigns the local CPU for delayed work
> items without explicit target CPU to fix a different issue.  This
> widens the window where CPU can go offline while a delayed work item
> is pending causing delayed work items dispatched with target CPU set
> to an already offlined CPU.  The resulting NUMA_NO_NODE mapping makes
> workqueue try to queue the work item on a NULL pool_workqueue and thus
> crash.
> 
> Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from
> unbound_pwq_by_node().  This is a temporary workaround.  The long term
> solution is keeping CPU -> NODE mapping stable across CPU off/online
> cycles which is in the works.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Cc: Tang Chen <tangchen@cn.fujitsu.com>
> Cc: Rafael J. Wysocki <rafael@kernel.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: stable@vger.kernel.org # v4.3+
> Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu")
> Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
> Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com

Reviewed-by: Michal Hocko <mhocko@suse.com>

> ---
>  kernel/workqueue.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 61a0264..f748eab 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
>  						  int node)
>  {
>  	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> +
> +	/*
> +	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
> +	 * delayed item is pending.  The plan is to keep CPU -> NODE
> +	 * mapping valid and stable across CPU on/offlines.  Once that
> +	 * happens, this workaround can be removed.
> +	 */
> +	if (unlikely(node == NUMA_NO_NODE))
> +		return wq->dfl_pwq;
> +
>  	return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
>  }
>  

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-03 18:54 ` [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup Tejun Heo
                     ` (2 preceding siblings ...)
  2016-02-04  8:40   ` Michal Hocko
@ 2016-02-10 15:55   ` Tejun Heo
  2016-02-15 17:33     ` Michal Hocko
  3 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2016-02-10 15:55 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: LKML

Hello,

With 874bbfe600a6 reverted, this doesn't matter as much but I'm still
committing it to wq/for-4.5-fixes.  The commit message has been
updated accordingly.

Thanks.

------ 8< ------
>From 9354baa208f47bf088c28c1da5f38abe172db5d6 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 3 Feb 2016 13:54:25 -0500

When looking up the pool_workqueue to use for an unbound workqueue,
workqueue assumes that the target CPU is always bound to a valid NUMA
node.  However, currently, when a CPU goes offline, the mapping is
destroyed and cpu_to_node() returns NUMA_NO_NODE.

This has always been broken but hasn't triggered often enough before
874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
After the commit, workqueue forcifully assigns the local CPU for
delayed work items without explicit target CPU to fix a different
issue.  This widens the window where CPU can go offline while a
delayed work item is pending causing delayed work items dispatched
with target CPU set to an already offlined CPU.  The resulting
NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
NULL pool_workqueue and thus crash.

While 874bbfe600a6 has been reverted for a different reason making the
bug less visible again, it can still happen.  Fix it by mapping
NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
This is a temporary workaround.  The long term solution is keeping CPU
-> NODE mapping stable across CPU off/online cycles which is being
worked on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
---
 kernel/workqueue.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index dc7faad..3ca7ab8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
 						  int node)
 {
 	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
+
+	/*
+	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
+	 * delayed item is pending.  The plan is to keep CPU -> NODE
+	 * mapping valid and stable across CPU on/offlines.  Once that
+	 * happens, this workaround can be removed.
+	 */
+	if (unlikely(node == NUMA_NO_NODE))
+		return wq->dfl_pwq;
+
 	return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
 }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-10 15:55   ` Tejun Heo
@ 2016-02-15 17:33     ` Michal Hocko
  2016-02-15 18:21       ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2016-02-15 17:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Mike Galbraith, LKML

On Wed 10-02-16 10:55:03, Tejun Heo wrote:
[...]
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
>  						  int node)
>  {
>  	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> +
> +	/*
> +	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
> +	 * delayed item is pending.  The plan is to keep CPU -> NODE
> +	 * mapping valid and stable across CPU on/offlines.  Once that
> +	 * happens, this workaround can be removed.

I am not sure this is completely true with the code as is currently.
Don't wee also need to use cpu_to_mem to handle memoryless CPUs?

> +	 */
> +	if (unlikely(node == NUMA_NO_NODE))
> +		return wq->dfl_pwq;
> +
>  	return rcu_dereference_raw(wq->numa_pwq_tbl[node]);
>  }


diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c579dbab2e36..4785b895b9b5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1325,7 +1325,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	if (!(wq->flags & WQ_UNBOUND))
 		pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
 	else
-		pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu));
+		pwq = unbound_pwq_by_node(wq, cpu_to_mem(cpu));
 
 	/*
 	 * If @work was previously on a different pool, it might still be

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-15 17:33     ` Michal Hocko
@ 2016-02-15 18:21       ` Tejun Heo
  2016-02-15 20:54         ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2016-02-15 18:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Mike Galbraith, LKML

Hello, Michal.

On Mon, Feb 15, 2016 at 06:33:46PM +0100, Michal Hocko wrote:
> On Wed 10-02-16 10:55:03, Tejun Heo wrote:
> [...]
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
> >  						  int node)
> >  {
> >  	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> > +
> > +	/*
> > +	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
> > +	 * delayed item is pending.  The plan is to keep CPU -> NODE
> > +	 * mapping valid and stable across CPU on/offlines.  Once that
> > +	 * happens, this workaround can be removed.
> 
> I am not sure this is completely true with the code as is currently.
> Don't wee also need to use cpu_to_mem to handle memoryless CPUs?

I'm not sure.  I think we still wan to distinguish workers for a
memoryless node from its neighboring node with memory.  We don't want
work items for the latter to be randomly distributed to the former
after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-15 18:21       ` Tejun Heo
@ 2016-02-15 20:54         ` Michal Hocko
  2016-02-15 21:02           ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2016-02-15 20:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Mike Galbraith, LKML

On Mon 15-02-16 13:21:25, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Feb 15, 2016 at 06:33:46PM +0100, Michal Hocko wrote:
> > On Wed 10-02-16 10:55:03, Tejun Heo wrote:
> > [...]
> > > --- a/kernel/workqueue.c
> > > +++ b/kernel/workqueue.c
> > > @@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct workqueue_struct *wq,
> > >  						  int node)
> > >  {
> > >  	assert_rcu_or_wq_mutex_or_pool_mutex(wq);
> > > +
> > > +	/*
> > > +	 * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a
> > > +	 * delayed item is pending.  The plan is to keep CPU -> NODE
> > > +	 * mapping valid and stable across CPU on/offlines.  Once that
> > > +	 * happens, this workaround can be removed.
> > 
> > I am not sure this is completely true with the code as is currently.
> > Don't wee also need to use cpu_to_mem to handle memoryless CPUs?
> 
> I'm not sure.  I think we still wan to distinguish workers for a
> memoryless node from its neighboring node with memory.  We don't want
> work items for the latter to be randomly distributed to the former
> after all.

I am not sure I understand. Does that mean that a node with no memory
would have its WQ specific pool? I might be missing something but I
thought that cpu_to_node will return NUMA_NO_NODE if the CPU is memory
less. Or do you expect that cpu_to_node will always return a valid node
id even when it doesn't contain any memory?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
  2016-02-15 20:54         ` Michal Hocko
@ 2016-02-15 21:02           ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2016-02-15 21:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Mike Galbraith, LKML

Hello, Michal.

On Mon, Feb 15, 2016 at 09:54:38PM +0100, Michal Hocko wrote:
> I am not sure I understand. Does that mean that a node with no memory
> would have its WQ specific pool? I might be missing something but I

Yeah, I think it should.

> thought that cpu_to_node will return NUMA_NO_NODE if the CPU is memory
> less. Or do you expect that cpu_to_node will always return a valid node
> id even when it doesn't contain any memory?

Hmmm... that was what I was expecting.  It's not really looking for
memory node to use but identify affinity there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-02-15 21:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-02 14:44 mod_delayed_work() explosion due to 874bbfe6 Mike Galbraith
2016-02-03 14:37 ` Michal Hocko
2016-02-03 16:32 ` Tejun Heo
2016-02-03 18:54 ` [PATCH wq/for-4.5-fixes] workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup Tejun Heo
2016-02-03 18:55   ` Tejun Heo
2016-02-04  3:15     ` Mike Galbraith
2016-02-03 19:12   ` Thomas Gleixner
2016-02-03 19:28     ` Tejun Heo
2016-02-04  2:12       ` Mike Galbraith
2016-02-04  8:40   ` Michal Hocko
2016-02-10 15:55   ` Tejun Heo
2016-02-15 17:33     ` Michal Hocko
2016-02-15 18:21       ` Tejun Heo
2016-02-15 20:54         ` Michal Hocko
2016-02-15 21:02           ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.