All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-13 21:49 Tejun Heo
  2015-10-14 16:30   ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2015-10-13 21:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Lai Jiangshan, Shaohua Li

Hello, Linus.

Single patch to fix delayed work being queued on the wrong CPU.  This
has been broken forever (v2.6.31+) but obviously doesn't trigger in
most configurations.

Thanks.

The following changes since commit 9ffecb10283508260936b96022d4ee43a7798b4c:

  Linux 4.3-rc3 (2015-09-27 07:50:08 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-4.3-fixes

for you to fetch changes up to 874bbfe600a660cba9c776b3957b1ce393151b76:

  workqueue: make sure delayed work run in local cpu (2015-09-30 13:06:46 -0400)

----------------------------------------------------------------
Shaohua Li (1):
      workqueue: make sure delayed work run in local cpu

 kernel/workqueue.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ca71582..bcb14ca 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1458,13 +1458,13 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 	timer_stats_timer_set_start_info(&dwork->timer);
 
 	dwork->wq = wq;
+	/* timer isn't guaranteed to run in this cpu, record earlier */
+	if (cpu == WORK_CPU_UNBOUND)
+		cpu = raw_smp_processor_id();
 	dwork->cpu = cpu;
 	timer->expires = jiffies + delay;
 
-	if (unlikely(cpu != WORK_CPU_UNBOUND))
-		add_timer_on(timer, cpu);
-	else
-		add_timer(timer);
+	add_timer_on(timer, cpu);
 }
 
 /**


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-13 21:49 [GIT PULL] workqueue fixes for v4.3-rc5 Tejun Heo
@ 2015-10-14 16:30   ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 16:30 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Christoph Lameter, Michal Hocko
  Cc: Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Tue, Oct 13, 2015 at 2:49 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Single patch to fix delayed work being queued on the wrong CPU.  This
> has been broken forever (v2.6.31+) but obviously doesn't trigger in
> most configurations.

So why is this a bugfix? If cpu == WORK_CPU_UNBOUND, then things
_shouldn't_ care which cpu it gets run on.

I don't think the patch is bad per se, because we obviously have other
places where we just do that "WORK_CPU_UNBOUND means current cpu", and
it's documented to "prefer" the current CPU, but at the same time I
get the very strong feeling that anything that *requires* it to mean
the current CPU is just buggy crap.

So the whole "wrong CPU" argument seems bogus. It's not a workqueue
bug, it's a caller bug.

Because I'd argue that any user that *requires* a particular CPU
should specify that. This "fix" doesn't seem to be a fix at all.

So to me, the bug seems to be that vmstat_update() is just bogus, and
uses percpu data but doesn't speicy that it needs to run on the
current cpu.

Look at vmstat_shepherd() that does this *right* and starts the whole thing:

                        schedule_delayed_work_on(cpu,
                                &per_cpu(vmstat_work, cpu), 0);

and then look at vmstat_update() that gets it wrong:

                schedule_delayed_work(this_cpu_ptr(&vmstat_work),
                        round_jiffies_relative(sysctl_stat_interval));

Notice the difference?

So I feel that this is *obviously* a vmstat bug, and that working
around it by adding ah-hoc crap to the workqueues is completely the
wrong thing to do. So I'm not going to pull this, because it seems to
be hiding the real problem rather than really "fixing" anything.

(That said, it's not obvious to me why we don't just specify the cpu
in the work structure itself, and just get rid of the "use different
functions to schedule the work" model. I think it's somewhat fragile
how you can end up using the same work in different "CPU boundedness"
models like this).

Added the mm/vmstat.c suspects to the cc. Particularly Christoph
Lameter - this goes back to commit 7cc36bbddde5 ("vmstat: on-demand
vmstat workers V8").

Comments?

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 16:30   ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 16:30 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Christoph Lameter, Michal Hocko
  Cc: Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Tue, Oct 13, 2015 at 2:49 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Single patch to fix delayed work being queued on the wrong CPU.  This
> has been broken forever (v2.6.31+) but obviously doesn't trigger in
> most configurations.

So why is this a bugfix? If cpu == WORK_CPU_UNBOUND, then things
_shouldn't_ care which cpu it gets run on.

I don't think the patch is bad per se, because we obviously have other
places where we just do that "WORK_CPU_UNBOUND means current cpu", and
it's documented to "prefer" the current CPU, but at the same time I
get the very strong feeling that anything that *requires* it to mean
the current CPU is just buggy crap.

So the whole "wrong CPU" argument seems bogus. It's not a workqueue
bug, it's a caller bug.

Because I'd argue that any user that *requires* a particular CPU
should specify that. This "fix" doesn't seem to be a fix at all.

So to me, the bug seems to be that vmstat_update() is just bogus, and
uses percpu data but doesn't speicy that it needs to run on the
current cpu.

Look at vmstat_shepherd() that does this *right* and starts the whole thing:

                        schedule_delayed_work_on(cpu,
                                &per_cpu(vmstat_work, cpu), 0);

and then look at vmstat_update() that gets it wrong:

                schedule_delayed_work(this_cpu_ptr(&vmstat_work),
                        round_jiffies_relative(sysctl_stat_interval));

Notice the difference?

So I feel that this is *obviously* a vmstat bug, and that working
around it by adding ah-hoc crap to the workqueues is completely the
wrong thing to do. So I'm not going to pull this, because it seems to
be hiding the real problem rather than really "fixing" anything.

(That said, it's not obvious to me why we don't just specify the cpu
in the work structure itself, and just get rid of the "use different
functions to schedule the work" model. I think it's somewhat fragile
how you can end up using the same work in different "CPU boundedness"
models like this).

Added the mm/vmstat.c suspects to the cc. Particularly Christoph
Lameter - this goes back to commit 7cc36bbddde5 ("vmstat: on-demand
vmstat workers V8").

Comments?

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 16:30   ` Linus Torvalds
@ 2015-10-14 16:57     ` Tejun Heo
  -1 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2015-10-14 16:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

Hello, Linus.

On Wed, Oct 14, 2015 at 09:30:21AM -0700, Linus Torvalds wrote:
> On Tue, Oct 13, 2015 at 2:49 PM, Tejun Heo <tj@kernel.org> wrote:
> >
> > Single patch to fix delayed work being queued on the wrong CPU.  This
> > has been broken forever (v2.6.31+) but obviously doesn't trigger in
> > most configurations.
> 
> So why is this a bugfix? If cpu == WORK_CPU_UNBOUND, then things
> _shouldn't_ care which cpu it gets run on.

That enum is a bit of misnomer in this case.  It's more like
WORK_CPU_LOCAL.

> I don't think the patch is bad per se, because we obviously have other
> places where we just do that "WORK_CPU_UNBOUND means current cpu", and
> it's documented to "prefer" the current CPU, but at the same time I
> get the very strong feeling that anything that *requires* it to mean
> the current CPU is just buggy crap.
> 
> So the whole "wrong CPU" argument seems bogus. It's not a workqueue
> bug, it's a caller bug.
>
> Because I'd argue that any user that *requires* a particular CPU
> should specify that. This "fix" doesn't seem to be a fix at all.
> 
> So to me, the bug seems to be that vmstat_update() is just bogus, and
> uses percpu data but doesn't speicy that it needs to run on the
> current cpu.

For both delayed and !delayed work items on per-cpu workqueues,
queueing without specifying a specific CPU always meant queueing on
the local CPU.  It just happened to start that way and has never
changed.  It'd be great to make users which want a specific CPU to
always queue with the CPU specified; however, making that change would
need quite a bit of auditing and debug features (like always forcing a
different CPU) given that breakages can be a lot more subtler than
outright crash as in vmstat_update().  I've been thinking about doing
that for quite a while now but haven't had the chance yet.

...
> So I feel that this is *obviously* a vmstat bug, and that working
> around it by adding ah-hoc crap to the workqueues is completely the
> wrong thing to do. So I'm not going to pull this, because it seems to
> be hiding the real problem rather than really "fixing" anything.

I wish this were an ad-hoc thing but this has been the guaranteed
behavior all along.  Switching the specific call-site to
queue_work_on() would still be a good idea tho.

> (That said, it's not obvious to me why we don't just specify the cpu
> in the work structure itself, and just get rid of the "use different
> functions to schedule the work" model. I think it's somewhat fragile
> how you can end up using the same work in different "CPU boundedness"
> models like this).

Hmmm... you mean associating a delayed work item with the target
pool_workqueue on queueing and sharing the queueing paths for both
delayed and !delayed work items?  Yeah, we can do that although it'd
involve somewhat invasive changes to queueing and canceling paths and
making more code paths handle both work_struct and delayed_work.  The
code is the way it is mostly for historical reasons.  I'm not sure the
cleanup (I can't think of scenarios where the observed behaviors would
be different / better) would justify the churn.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 16:57     ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2015-10-14 16:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

Hello, Linus.

On Wed, Oct 14, 2015 at 09:30:21AM -0700, Linus Torvalds wrote:
> On Tue, Oct 13, 2015 at 2:49 PM, Tejun Heo <tj@kernel.org> wrote:
> >
> > Single patch to fix delayed work being queued on the wrong CPU.  This
> > has been broken forever (v2.6.31+) but obviously doesn't trigger in
> > most configurations.
> 
> So why is this a bugfix? If cpu == WORK_CPU_UNBOUND, then things
> _shouldn't_ care which cpu it gets run on.

That enum is a bit of misnomer in this case.  It's more like
WORK_CPU_LOCAL.

> I don't think the patch is bad per se, because we obviously have other
> places where we just do that "WORK_CPU_UNBOUND means current cpu", and
> it's documented to "prefer" the current CPU, but at the same time I
> get the very strong feeling that anything that *requires* it to mean
> the current CPU is just buggy crap.
> 
> So the whole "wrong CPU" argument seems bogus. It's not a workqueue
> bug, it's a caller bug.
>
> Because I'd argue that any user that *requires* a particular CPU
> should specify that. This "fix" doesn't seem to be a fix at all.
> 
> So to me, the bug seems to be that vmstat_update() is just bogus, and
> uses percpu data but doesn't speicy that it needs to run on the
> current cpu.

For both delayed and !delayed work items on per-cpu workqueues,
queueing without specifying a specific CPU always meant queueing on
the local CPU.  It just happened to start that way and has never
changed.  It'd be great to make users which want a specific CPU to
always queue with the CPU specified; however, making that change would
need quite a bit of auditing and debug features (like always forcing a
different CPU) given that breakages can be a lot more subtler than
outright crash as in vmstat_update().  I've been thinking about doing
that for quite a while now but haven't had the chance yet.

...
> So I feel that this is *obviously* a vmstat bug, and that working
> around it by adding ah-hoc crap to the workqueues is completely the
> wrong thing to do. So I'm not going to pull this, because it seems to
> be hiding the real problem rather than really "fixing" anything.

I wish this were an ad-hoc thing but this has been the guaranteed
behavior all along.  Switching the specific call-site to
queue_work_on() would still be a good idea tho.

> (That said, it's not obvious to me why we don't just specify the cpu
> in the work structure itself, and just get rid of the "use different
> functions to schedule the work" model. I think it's somewhat fragile
> how you can end up using the same work in different "CPU boundedness"
> models like this).

Hmmm... you mean associating a delayed work item with the target
pool_workqueue on queueing and sharing the queueing paths for both
delayed and !delayed work items?  Yeah, we can do that although it'd
involve somewhat invasive changes to queueing and canceling paths and
making more code paths handle both work_struct and delayed_work.  The
code is the way it is mostly for historical reasons.  I'm not sure the
cleanup (I can't think of scenarios where the observed behaviors would
be different / better) would justify the churn.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 16:57     ` Tejun Heo
@ 2015-10-14 17:36       ` Linus Torvalds
  -1 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 17:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 9:57 AM, Tejun Heo <tj@kernel.org> wrote:
>
> That enum is a bit of misnomer in this case.  It's more like
> WORK_CPU_LOCAL.

Well, that clearly is the new semantic guarantees you are pushing. I'm
not at all convinced it's a good idea, though.

> For both delayed and !delayed work items on per-cpu workqueues,
> queueing without specifying a specific CPU always meant queueing on
> the local CPU.

That's just not true, as far as I can tell.

I went back in the history to 2012, and it does

          if (unlikely(cpu != WORK_CPU_UNBOUND))
                  add_timer_on(timer, cpu);
          else
                  add_timer(timer);

so this whole WORK_CPU_UNBOUND means "add_timer()" without specifying
a CPU has been true for at least the last several years.

So the documentation, the code, and the name all agree:
WORK_CPU_UNBOUND does *not* mean that it's guaranteed to be the local
CPU. The documentation says "preferred", the code clearly doesn't
specify the CPU, and the name says "not bound to a particular CPU".


>> So I feel that this is *obviously* a vmstat bug, and that working
>> around it by adding ah-hoc crap to the workqueues is completely the
>> wrong thing to do. So I'm not going to pull this, because it seems to
>> be hiding the real problem rather than really "fixing" anything.
>
> I wish this were an ad-hoc thing but this has been the guaranteed
> behavior all along.  Switching the specific call-site to
> queue_work_on() would still be a good idea tho.

I really don't see who you say that it has been guaranteed behavior all along.

It clearly has not at all been guaranteed behavior. The fact that you
had to change the code to do that should have made it clear.

The code has *always* done that non-cpu-specific "add_timer()", as far
as I can tell. Even back when that non-bound CPU was indicated by a
negative CPU number, and the code did

                if (unlikely(cpu >= 0))
                        add_timer_on(timer, cpu);
                else
                        add_timer(timer);

(that's from 2007, btw).

So I really don't see your "guaranteed behavior" argument. It seems to
be downright pure bullshit. The lack of a specific CPU has _always_
(where "always" means "at least since 2007") meant "non-specific cpu",
rather than "local cpu".

If some particular interface ended up then actually using "local cpu"
instead, that was neither guaranteed nor implied - it was just a
random implementation detail, and shouldn't be something we guarantee
at all.

We strive to maintain user-space ABI issues even in the face of
unintentional bugs and misfeatures. But we do *not* keep broken random
in-kernel interfaces alive. We fix breakage and clean up code rather
than say "some random in-kernel user expects broken behavior".

And it seems very clear that WORK_CPU_UNBOUND does *not* mean "local
cpu", and code that depends on it meaning local cpu is broken.

Now, there are reasons why the *implementation* might want to choose
the local cpu for things - avoiding waking up other cpu's
unnecessarily with cross-cpu calls etc - but at the same time I think
it's quite clear that mm/vmstat.c is simply broken in using a
non-bound interface and then expecting/assuming a particular CPU.

>> (That said, it's not obvious to me why we don't just specify the cpu
>> in the work structure itself, and just get rid of the "use different
>> functions to schedule the work" model. I think it's somewhat fragile
>> how you can end up using the same work in different "CPU boundedness"
>> models like this).
>
> Hmmm... you mean associating a delayed work item with the target
> pool_workqueue on queueing and sharing the queueing paths for both
> delayed and !delayed work items?

I wouldn't necessarily even go that far. That's more of an
implementation detail.

I just wonder if we perhaps should add the CPU boundedness to the init
stage, and hide it away in the work structure. So if you just want to
do work (delayed or not), you'd continue to do

        INIT_DELAYED_WORK(&work, ...);

        schedule_delayed_work(&work, delay);

but if you want a percpu thing, you'd do

        INIT_DELAYED_WORK_CPU(&work, cpu, ...);

        schedule_delayed_work(&work, delay);

rather than have that "..work_on(cpu, &work, )" interface. The actual
*implementation* could stay the same, we'd just hide the CPU bound in
the work struct and use it as an implicit argument.

But it's not a big deal. I'm just saying that the current interface
obviously allows that confusion about whether a work is percpu or not,
on a per-call-site basis.

I don't think it's normally a problem. But mm/vmstat.c clearly *is*
confused, and uses both "schedule_delayed_work_on()" and
"schedule_delayed_work()" for the same work.

                 Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 17:36       ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 17:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 9:57 AM, Tejun Heo <tj@kernel.org> wrote:
>
> That enum is a bit of misnomer in this case.  It's more like
> WORK_CPU_LOCAL.

Well, that clearly is the new semantic guarantees you are pushing. I'm
not at all convinced it's a good idea, though.

> For both delayed and !delayed work items on per-cpu workqueues,
> queueing without specifying a specific CPU always meant queueing on
> the local CPU.

That's just not true, as far as I can tell.

I went back in the history to 2012, and it does

          if (unlikely(cpu != WORK_CPU_UNBOUND))
                  add_timer_on(timer, cpu);
          else
                  add_timer(timer);

so this whole WORK_CPU_UNBOUND means "add_timer()" without specifying
a CPU has been true for at least the last several years.

So the documentation, the code, and the name all agree:
WORK_CPU_UNBOUND does *not* mean that it's guaranteed to be the local
CPU. The documentation says "preferred", the code clearly doesn't
specify the CPU, and the name says "not bound to a particular CPU".


>> So I feel that this is *obviously* a vmstat bug, and that working
>> around it by adding ah-hoc crap to the workqueues is completely the
>> wrong thing to do. So I'm not going to pull this, because it seems to
>> be hiding the real problem rather than really "fixing" anything.
>
> I wish this were an ad-hoc thing but this has been the guaranteed
> behavior all along.  Switching the specific call-site to
> queue_work_on() would still be a good idea tho.

I really don't see who you say that it has been guaranteed behavior all along.

It clearly has not at all been guaranteed behavior. The fact that you
had to change the code to do that should have made it clear.

The code has *always* done that non-cpu-specific "add_timer()", as far
as I can tell. Even back when that non-bound CPU was indicated by a
negative CPU number, and the code did

                if (unlikely(cpu >= 0))
                        add_timer_on(timer, cpu);
                else
                        add_timer(timer);

(that's from 2007, btw).

So I really don't see your "guaranteed behavior" argument. It seems to
be downright pure bullshit. The lack of a specific CPU has _always_
(where "always" means "at least since 2007") meant "non-specific cpu",
rather than "local cpu".

If some particular interface ended up then actually using "local cpu"
instead, that was neither guaranteed nor implied - it was just a
random implementation detail, and shouldn't be something we guarantee
at all.

We strive to maintain user-space ABI issues even in the face of
unintentional bugs and misfeatures. But we do *not* keep broken random
in-kernel interfaces alive. We fix breakage and clean up code rather
than say "some random in-kernel user expects broken behavior".

And it seems very clear that WORK_CPU_UNBOUND does *not* mean "local
cpu", and code that depends on it meaning local cpu is broken.

Now, there are reasons why the *implementation* might want to choose
the local cpu for things - avoiding waking up other cpu's
unnecessarily with cross-cpu calls etc - but at the same time I think
it's quite clear that mm/vmstat.c is simply broken in using a
non-bound interface and then expecting/assuming a particular CPU.

>> (That said, it's not obvious to me why we don't just specify the cpu
>> in the work structure itself, and just get rid of the "use different
>> functions to schedule the work" model. I think it's somewhat fragile
>> how you can end up using the same work in different "CPU boundedness"
>> models like this).
>
> Hmmm... you mean associating a delayed work item with the target
> pool_workqueue on queueing and sharing the queueing paths for both
> delayed and !delayed work items?

I wouldn't necessarily even go that far. That's more of an
implementation detail.

I just wonder if we perhaps should add the CPU boundedness to the init
stage, and hide it away in the work structure. So if you just want to
do work (delayed or not), you'd continue to do

        INIT_DELAYED_WORK(&work, ...);

        schedule_delayed_work(&work, delay);

but if you want a percpu thing, you'd do

        INIT_DELAYED_WORK_CPU(&work, cpu, ...);

        schedule_delayed_work(&work, delay);

rather than have that "..work_on(cpu, &work, )" interface. The actual
*implementation* could stay the same, we'd just hide the CPU bound in
the work struct and use it as an implicit argument.

But it's not a big deal. I'm just saying that the current interface
obviously allows that confusion about whether a work is percpu or not,
on a per-call-site basis.

I don't think it's normally a problem. But mm/vmstat.c clearly *is*
confused, and uses both "schedule_delayed_work_on()" and
"schedule_delayed_work()" for the same work.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 17:36       ` Linus Torvalds
@ 2015-10-14 17:57         ` Christoph Lameter
  -1 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-10-14 17:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 14 Oct 2015, Linus Torvalds wrote:

> I don't think it's normally a problem. But mm/vmstat.c clearly *is*
> confused, and uses both "schedule_delayed_work_on()" and
> "schedule_delayed_work()" for the same work.

Well yes the schedule_delayed_work_on() call is from another cpu and the
schedule_delayed_work() from the same. No confusion there.

vmstat_update() is run from the cpu where the diffs have to be updated and
if it needs to reschedule itself it relies on schedule_delayed_work() to
stay on the same cpu.

The vmstat_shepherd needs to start work items on remote cpus and therefore
uses xx_work_on().

And yes this relies on work items being executed on the same cpu unless
the queue is decleared to be UNBOUND which is not the case here.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 17:57         ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-10-14 17:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 14 Oct 2015, Linus Torvalds wrote:

> I don't think it's normally a problem. But mm/vmstat.c clearly *is*
> confused, and uses both "schedule_delayed_work_on()" and
> "schedule_delayed_work()" for the same work.

Well yes the schedule_delayed_work_on() call is from another cpu and the
schedule_delayed_work() from the same. No confusion there.

vmstat_update() is run from the cpu where the diffs have to be updated and
if it needs to reschedule itself it relies on schedule_delayed_work() to
stay on the same cpu.

The vmstat_shepherd needs to start work items on remote cpus and therefore
uses xx_work_on().

And yes this relies on work items being executed on the same cpu unless
the queue is decleared to be UNBOUND which is not the case here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 16:30   ` Linus Torvalds
@ 2015-10-14 18:03     ` Christoph Lameter
  -1 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-10-14 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 14 Oct 2015, Linus Torvalds wrote:

> On Tue, Oct 13, 2015 at 2:49 PM, Tejun Heo <tj@kernel.org> wrote:
> >
> > Single patch to fix delayed work being queued on the wrong CPU.  This
> > has been broken forever (v2.6.31+) but obviously doesn't trigger in
> > most configurations.
>
> So why is this a bugfix? If cpu == WORK_CPU_UNBOUND, then things
> _shouldn't_ care which cpu it gets run on.

UNBOUND means not fixed to a processor. The system should have
freedom to schedule unbound work requests anywhere it wants. This is
something we also want for the NOHZ work in order to move things like
these workqueue items to processors that are not supposed to be low
latency.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 18:03     ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-10-14 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 14 Oct 2015, Linus Torvalds wrote:

> On Tue, Oct 13, 2015 at 2:49 PM, Tejun Heo <tj@kernel.org> wrote:
> >
> > Single patch to fix delayed work being queued on the wrong CPU.  This
> > has been broken forever (v2.6.31+) but obviously doesn't trigger in
> > most configurations.
>
> So why is this a bugfix? If cpu == WORK_CPU_UNBOUND, then things
> _shouldn't_ care which cpu it gets run on.

UNBOUND means not fixed to a processor. The system should have
freedom to schedule unbound work requests anywhere it wants. This is
something we also want for the NOHZ work in order to move things like
these workqueue items to processors that are not supposed to be low
latency.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 17:57         ` Christoph Lameter
@ 2015-10-14 18:37           ` Linus Torvalds
  -1 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 18:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 10:57 AM, Christoph Lameter <cl@linux.com> wrote:
>
> Well yes the schedule_delayed_work_on() call is from another cpu and the
> schedule_delayed_work() from the same. No confusion there.

So "schedule_delayed_work()" does *not* guarantee that the work will
run on the same CPU.

Yes, yes, it so _happens_ that "add_timer()" preferentially uses the
current CPU etc, so in practice it may have happened to work. But
there's absolutely zero reason to think it should always work that
way.

If you want the scheduled work to happen on a particular CPU, then you
should use "schedule_delayed_work_on()"  It shouldn't matter which CPU
you call it from.

At least that's how I think the rules should be. Very simple, very
clear: if you require a specific CPU, say so. Don't silently depend on
"in practice, lots of times we tend to use the local cpu".

                  Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 18:37           ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 18:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 10:57 AM, Christoph Lameter <cl@linux.com> wrote:
>
> Well yes the schedule_delayed_work_on() call is from another cpu and the
> schedule_delayed_work() from the same. No confusion there.

So "schedule_delayed_work()" does *not* guarantee that the work will
run on the same CPU.

Yes, yes, it so _happens_ that "add_timer()" preferentially uses the
current CPU etc, so in practice it may have happened to work. But
there's absolutely zero reason to think it should always work that
way.

If you want the scheduled work to happen on a particular CPU, then you
should use "schedule_delayed_work_on()"  It shouldn't matter which CPU
you call it from.

At least that's how I think the rules should be. Very simple, very
clear: if you require a specific CPU, say so. Don't silently depend on
"in practice, lots of times we tend to use the local cpu".

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 18:03     ` Christoph Lameter
@ 2015-10-14 18:40       ` Linus Torvalds
  -1 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 18:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 11:03 AM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 14 Oct 2015, Linus Torvalds wrote:
>>
>> So why is this a bugfix? If cpu == WORK_CPU_UNBOUND, then things
>> _shouldn't_ care which cpu it gets run on.
>
> UNBOUND means not fixed to a processor.

That's exactly what I'm saying.

And "schedule_delayed_work()" uses WORK_CPU_UNBOUND.

YOUR code assumes that means "local CPU".

And I say that's bogus.

In this email you seem to even agree that its' bogus, but then you
wrote another email saying that the code isn't confused, because it
uses "schedule_delayed_work()" on the CPU that it wants to run the
code on.

I'm saying that mm/vmstat.c should use "schedule_delayed_work_on()".

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 18:40       ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 18:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 11:03 AM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 14 Oct 2015, Linus Torvalds wrote:
>>
>> So why is this a bugfix? If cpu == WORK_CPU_UNBOUND, then things
>> _shouldn't_ care which cpu it gets run on.
>
> UNBOUND means not fixed to a processor.

That's exactly what I'm saying.

And "schedule_delayed_work()" uses WORK_CPU_UNBOUND.

YOUR code assumes that means "local CPU".

And I say that's bogus.

In this email you seem to even agree that its' bogus, but then you
wrote another email saying that the code isn't confused, because it
uses "schedule_delayed_work()" on the CPU that it wants to run the
code on.

I'm saying that mm/vmstat.c should use "schedule_delayed_work_on()".

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 18:37           ` Linus Torvalds
@ 2015-10-14 18:58             ` Christoph Lameter
  -1 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-10-14 18:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 14 Oct 2015, Linus Torvalds wrote:

> On Wed, Oct 14, 2015 at 10:57 AM, Christoph Lameter <cl@linux.com> wrote:
> >
> > Well yes the schedule_delayed_work_on() call is from another cpu and the
> > schedule_delayed_work() from the same. No confusion there.
>
> So "schedule_delayed_work()" does *not* guarantee that the work will
> run on the same CPU.

That is news to me. As far as I know: The only workqueue that is not
guaranteed to run on the same cpu is an unbound workqueue.

> If you want the scheduled work to happen on a particular CPU, then you
> should use "schedule_delayed_work_on()"  It shouldn't matter which CPU
> you call it from.

Ok then lets audit the kernel for this if that assumption is no longer
true.

> At least that's how I think the rules should be. Very simple, very
> clear: if you require a specific CPU, say so. Don't silently depend on
> "in practice, lots of times we tend to use the local cpu".

As far as I can remember this was guaranteed and not just practice.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 18:58             ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-10-14 18:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 14 Oct 2015, Linus Torvalds wrote:

> On Wed, Oct 14, 2015 at 10:57 AM, Christoph Lameter <cl@linux.com> wrote:
> >
> > Well yes the schedule_delayed_work_on() call is from another cpu and the
> > schedule_delayed_work() from the same. No confusion there.
>
> So "schedule_delayed_work()" does *not* guarantee that the work will
> run on the same CPU.

That is news to me. As far as I know: The only workqueue that is not
guaranteed to run on the same cpu is an unbound workqueue.

> If you want the scheduled work to happen on a particular CPU, then you
> should use "schedule_delayed_work_on()"  It shouldn't matter which CPU
> you call it from.

Ok then lets audit the kernel for this if that assumption is no longer
true.

> At least that's how I think the rules should be. Very simple, very
> clear: if you require a specific CPU, say so. Don't silently depend on
> "in practice, lots of times we tend to use the local cpu".

As far as I can remember this was guaranteed and not just practice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 18:40       ` Linus Torvalds
@ 2015-10-14 18:59         ` Christoph Lameter
  -1 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-10-14 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 14 Oct 2015, Linus Torvalds wrote:

> And "schedule_delayed_work()" uses WORK_CPU_UNBOUND.

Uhhh. Someone changed that?

> In this email you seem to even agree that its' bogus, but then you
> wrote another email saying that the code isn't confused, because it
> uses "schedule_delayed_work()" on the CPU that it wants to run the
> code on.
>
> I'm saying that mm/vmstat.c should use "schedule_delayed_work_on()".

Then that needs to be fixed. Could occur in more places.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 18:59         ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-10-14 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 14 Oct 2015, Linus Torvalds wrote:

> And "schedule_delayed_work()" uses WORK_CPU_UNBOUND.

Uhhh. Someone changed that?

> In this email you seem to even agree that its' bogus, but then you
> wrote another email saying that the code isn't confused, because it
> uses "schedule_delayed_work()" on the CPU that it wants to run the
> code on.
>
> I'm saying that mm/vmstat.c should use "schedule_delayed_work_on()".

Then that needs to be fixed. Could occur in more places.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 18:37           ` Linus Torvalds
@ 2015-10-14 19:01             ` Linus Torvalds
  -1 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 19:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 11:37 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yes, yes, it so _happens_ that "add_timer()" preferentially uses the
> current CPU etc, so in practice it may have happened to work. But
> there's absolutely zero reason to think it should always work that
> way.

Side note: even in practice, I think things like off-lining CPU's etc
(which some mobile environments seem to do as a power saving thing)
can end up moving timers to other CPU's even if they originally got
added on a particular cpu.

So I really think that the whole "schedule_delayed_work() ends up
running on the CPU" has actually never "really" been true. It has at
most been a "most of the time" thing, making it hard to see the
problem in practice.

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 19:01             ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 19:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 11:37 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yes, yes, it so _happens_ that "add_timer()" preferentially uses the
> current CPU etc, so in practice it may have happened to work. But
> there's absolutely zero reason to think it should always work that
> way.

Side note: even in practice, I think things like off-lining CPU's etc
(which some mobile environments seem to do as a power saving thing)
can end up moving timers to other CPU's even if they originally got
added on a particular cpu.

So I really think that the whole "schedule_delayed_work() ends up
running on the CPU" has actually never "really" been true. It has at
most been a "most of the time" thing, making it hard to see the
problem in practice.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 17:36       ` Linus Torvalds
@ 2015-10-14 19:02         ` Tejun Heo
  -1 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2015-10-14 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

Hello, Linus.

On Wed, Oct 14, 2015 at 10:36:50AM -0700, Linus Torvalds wrote:
> > For both delayed and !delayed work items on per-cpu workqueues,
> > queueing without specifying a specific CPU always meant queueing on
> > the local CPU.
> 
> That's just not true, as far as I can tell.
> 
> I went back in the history to 2012, and it does
> 
>           if (unlikely(cpu != WORK_CPU_UNBOUND))
>                   add_timer_on(timer, cpu);
>           else
>                   add_timer(timer);
>
> so this whole WORK_CPU_UNBOUND means "add_timer()" without specifying
> a CPU has been true for at least the last several years.
> 
> So the documentation, the code, and the name all agree:

But wasn't add_timer() always CPU-local at the time?  add_timer()
allowing cross-cpu migrations came way after that.

commit 597d0275736dad9c3bda6f0a00a1c477dc0f37b1
Author: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Date:   Thu Apr 16 12:13:26 2009 +0530

    timers: Framework for identifying pinned timers

> WORK_CPU_UNBOUND does *not* mean that it's guaranteed to be the local
> CPU. The documentation says "preferred", the code clearly doesn't
> specify the CPU, and the name says "not bound to a particular CPU".

The name and documentation were mostly me being too hopeful and then
being lazy after realizing it wouldn't work.

> > I wish this were an ad-hoc thing but this has been the guaranteed
> > behavior all along.  Switching the specific call-site to
> > queue_work_on() would still be a good idea tho.
> 
> I really don't see who you say that it has been guaranteed behavior all along.

The behavior was just like that from the beginning.  Workqueues were
always CPU-affine and timers too and we never properly distinguished
CPU affinity for the purpose of correctness from optimization.

> It clearly has not at all been guaranteed behavior. The fact that you
> had to change the code to do that should have made it clear.

It wasn't like that before the timer change.

> The code has *always* done that non-cpu-specific "add_timer()", as far
> as I can tell. Even back when that non-bound CPU was indicated by a
> negative CPU number, and the code did
> 
>                 if (unlikely(cpu >= 0))
>                         add_timer_on(timer, cpu);
>                 else
>                         add_timer(timer);
> 
> (that's from 2007, btw).

At the time, it was just an alternative way to writing.

		if (cpu < 0)
			cpu = raw_smp_processor_id();
		add_timer_on(timer, cpu);

> So I really don't see your "guaranteed behavior" argument. It seems to
> be downright pure bullshit. The lack of a specific CPU has _always_
> (where "always" means "at least since 2007") meant "non-specific cpu",
> rather than "local cpu".
>
> If some particular interface ended up then actually using "local cpu"
> instead, that was neither guaranteed nor implied - it was just a
> random implementation detail, and shouldn't be something we guarantee
> at all.
>
> We strive to maintain user-space ABI issues even in the face of
> unintentional bugs and misfeatures. But we do *not* keep broken random
> in-kernel interfaces alive. We fix breakage and clean up code rather
> than say "some random in-kernel user expects broken behavior".
> 
> And it seems very clear that WORK_CPU_UNBOUND does *not* mean "local
> cpu", and code that depends on it meaning local cpu is broken.
> 
> Now, there are reasons why the *implementation* might want to choose
> the local cpu for things - avoiding waking up other cpu's
> unnecessarily with cross-cpu calls etc - but at the same time I think
> it's quite clear that mm/vmstat.c is simply broken in using a
> non-bound interface and then expecting/assuming a particular CPU.

I don't think I'm confused about the timer behavior.  At any rate, for
!delayed work items, the failure to distinguish bound for correctness
and optimization has been for quite a while.  I'd be happy to ditch
that but I don't think it'd be a good idea to do this at -rc5 w/o
auditing or debug mechanisms to detect errors.

> >> (That said, it's not obvious to me why we don't just specify the cpu
> >> in the work structure itself, and just get rid of the "use different
> >> functions to schedule the work" model. I think it's somewhat fragile
> >> how you can end up using the same work in different "CPU boundedness"
> >> models like this).
> >
> > Hmmm... you mean associating a delayed work item with the target
> > pool_workqueue on queueing and sharing the queueing paths for both
> > delayed and !delayed work items?
> 
> I wouldn't necessarily even go that far. That's more of an
> implementation detail.
> 
> I just wonder if we perhaps should add the CPU boundedness to the init
> stage, and hide it away in the work structure. So if you just want to
> do work (delayed or not), you'd continue to do
> 
>         INIT_DELAYED_WORK(&work, ...);
> 
>         schedule_delayed_work(&work, delay);
> 
> but if you want a percpu thing, you'd do
> 
>         INIT_DELAYED_WORK_CPU(&work, cpu, ...);
> 
>         schedule_delayed_work(&work, delay);
> 
> rather than have that "..work_on(cpu, &work, )" interface. The actual
> *implementation* could stay the same, we'd just hide the CPU bound in
> the work struct and use it as an implicit argument.
> 
> But it's not a big deal. I'm just saying that the current interface
> obviously allows that confusion about whether a work is percpu or not,
> on a per-call-site basis.

I was thinking more along the line of introducing
queue_work_on_local() and friends but yeah we definitely need to
explicitly distinguish affinity for correctness and for performance.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 19:02         ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2015-10-14 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

Hello, Linus.

On Wed, Oct 14, 2015 at 10:36:50AM -0700, Linus Torvalds wrote:
> > For both delayed and !delayed work items on per-cpu workqueues,
> > queueing without specifying a specific CPU always meant queueing on
> > the local CPU.
> 
> That's just not true, as far as I can tell.
> 
> I went back in the history to 2012, and it does
> 
>           if (unlikely(cpu != WORK_CPU_UNBOUND))
>                   add_timer_on(timer, cpu);
>           else
>                   add_timer(timer);
>
> so this whole WORK_CPU_UNBOUND means "add_timer()" without specifying
> a CPU has been true for at least the last several years.
> 
> So the documentation, the code, and the name all agree:

But wasn't add_timer() always CPU-local at the time?  add_timer()
allowing cross-cpu migrations came way after that.

commit 597d0275736dad9c3bda6f0a00a1c477dc0f37b1
Author: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Date:   Thu Apr 16 12:13:26 2009 +0530

    timers: Framework for identifying pinned timers

> WORK_CPU_UNBOUND does *not* mean that it's guaranteed to be the local
> CPU. The documentation says "preferred", the code clearly doesn't
> specify the CPU, and the name says "not bound to a particular CPU".

The name and documentation were mostly me being too hopeful and then
being lazy after realizing it wouldn't work.

> > I wish this were an ad-hoc thing but this has been the guaranteed
> > behavior all along.  Switching the specific call-site to
> > queue_work_on() would still be a good idea tho.
> 
> I really don't see who you say that it has been guaranteed behavior all along.

The behavior was just like that from the beginning.  Workqueues were
always CPU-affine and timers too and we never properly distinguished
CPU affinity for the purpose of correctness from optimization.

> It clearly has not at all been guaranteed behavior. The fact that you
> had to change the code to do that should have made it clear.

It wasn't like that before the timer change.

> The code has *always* done that non-cpu-specific "add_timer()", as far
> as I can tell. Even back when that non-bound CPU was indicated by a
> negative CPU number, and the code did
> 
>                 if (unlikely(cpu >= 0))
>                         add_timer_on(timer, cpu);
>                 else
>                         add_timer(timer);
> 
> (that's from 2007, btw).

At the time, it was just an alternative way to writing.

		if (cpu < 0)
			cpu = raw_smp_processor_id();
		add_timer_on(timer, cpu);

> So I really don't see your "guaranteed behavior" argument. It seems to
> be downright pure bullshit. The lack of a specific CPU has _always_
> (where "always" means "at least since 2007") meant "non-specific cpu",
> rather than "local cpu".
>
> If some particular interface ended up then actually using "local cpu"
> instead, that was neither guaranteed nor implied - it was just a
> random implementation detail, and shouldn't be something we guarantee
> at all.
>
> We strive to maintain user-space ABI issues even in the face of
> unintentional bugs and misfeatures. But we do *not* keep broken random
> in-kernel interfaces alive. We fix breakage and clean up code rather
> than say "some random in-kernel user expects broken behavior".
> 
> And it seems very clear that WORK_CPU_UNBOUND does *not* mean "local
> cpu", and code that depends on it meaning local cpu is broken.
> 
> Now, there are reasons why the *implementation* might want to choose
> the local cpu for things - avoiding waking up other cpu's
> unnecessarily with cross-cpu calls etc - but at the same time I think
> it's quite clear that mm/vmstat.c is simply broken in using a
> non-bound interface and then expecting/assuming a particular CPU.

I don't think I'm confused about the timer behavior.  At any rate, for
!delayed work items, the failure to distinguish bound for correctness
and optimization has been for quite a while.  I'd be happy to ditch
that but I don't think it'd be a good idea to do this at -rc5 w/o
auditing or debug mechanisms to detect errors.

> >> (That said, it's not obvious to me why we don't just specify the cpu
> >> in the work structure itself, and just get rid of the "use different
> >> functions to schedule the work" model. I think it's somewhat fragile
> >> how you can end up using the same work in different "CPU boundedness"
> >> models like this).
> >
> > Hmmm... you mean associating a delayed work item with the target
> > pool_workqueue on queueing and sharing the queueing paths for both
> > delayed and !delayed work items?
> 
> I wouldn't necessarily even go that far. That's more of an
> implementation detail.
> 
> I just wonder if we perhaps should add the CPU boundedness to the init
> stage, and hide it away in the work structure. So if you just want to
> do work (delayed or not), you'd continue to do
> 
>         INIT_DELAYED_WORK(&work, ...);
> 
>         schedule_delayed_work(&work, delay);
> 
> but if you want a percpu thing, you'd do
> 
>         INIT_DELAYED_WORK_CPU(&work, cpu, ...);
> 
>         schedule_delayed_work(&work, delay);
> 
> rather than have that "..work_on(cpu, &work, )" interface. The actual
> *implementation* could stay the same, we'd just hide the CPU bound in
> the work struct and use it as an implicit argument.
> 
> But it's not a big deal. I'm just saying that the current interface
> obviously allows that confusion about whether a work is percpu or not,
> on a per-call-site basis.

I was thinking more along the line of introducing
queue_work_on_local() and friends but yeah we definitely need to
explicitly distinguish affinity for correctness and for performance.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 18:59         ` Christoph Lameter
@ 2015-10-14 19:10           ` Linus Torvalds
  -1 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 19:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 11:59 AM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 14 Oct 2015, Linus Torvalds wrote:
>
>> And "schedule_delayed_work()" uses WORK_CPU_UNBOUND.
>
> Uhhh. Someone changed that?

It always did.  This is from 2007:

int fastcall schedule_delayed_work(struct delayed_work *dwork,
                                        unsigned long delay)
{
        timer_stats_timer_set_start_info(&dwork->timer);
        return queue_delayed_work(keventd_wq, dwork, delay);
}
...
int fastcall queue_delayed_work(struct workqueue_struct *wq,
                        struct delayed_work *dwork, unsigned long delay)
{
        timer_stats_timer_set_start_info(&dwork->timer);
        if (delay == 0)
                return queue_work(wq, &dwork->work);

        return queue_delayed_work_on(-1, wq, dwork, delay);
}
...
int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
                        struct delayed_work *dwork, unsigned long delay)
{
....
                timer->function = delayed_work_timer_fn;

                if (unlikely(cpu >= 0))
                        add_timer_on(timer, cpu);
                else
                        add_timer(timer);
}
...
void delayed_work_timer_fn(unsigned long __data)
{
        int cpu = smp_processor_id();
        ...
        __queue_work(per_cpu_ptr(wq->cpu_wq, cpu), &dwork->work);
}


so notice how it always just used "add_timer()", and then queued it on
whatever cpu workqueue the timer ran on.

Now, 99.9% of the time, the timer is just added to the current CPU
queues, so yes, in practice it ended up running on the same CPU almost
all the time. There are exceptions (timers can get moved around, and
active timers end up staying on the CPU they were scheduled on when
they get updated, rather than get moved to the current cpu), but they
are hard to hit.

But the code clearly didn't do that "same CPU" intentionally, and just
going by naming of things I would also say that it was never implied.

                    Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 19:10           ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 19:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Andrew Morton, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 11:59 AM, Christoph Lameter <cl@linux.com> wrote:
> On Wed, 14 Oct 2015, Linus Torvalds wrote:
>
>> And "schedule_delayed_work()" uses WORK_CPU_UNBOUND.
>
> Uhhh. Someone changed that?

It always did.  This is from 2007:

int fastcall schedule_delayed_work(struct delayed_work *dwork,
                                        unsigned long delay)
{
        timer_stats_timer_set_start_info(&dwork->timer);
        return queue_delayed_work(keventd_wq, dwork, delay);
}
...
int fastcall queue_delayed_work(struct workqueue_struct *wq,
                        struct delayed_work *dwork, unsigned long delay)
{
        timer_stats_timer_set_start_info(&dwork->timer);
        if (delay == 0)
                return queue_work(wq, &dwork->work);

        return queue_delayed_work_on(-1, wq, dwork, delay);
}
...
int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
                        struct delayed_work *dwork, unsigned long delay)
{
....
                timer->function = delayed_work_timer_fn;

                if (unlikely(cpu >= 0))
                        add_timer_on(timer, cpu);
                else
                        add_timer(timer);
}
...
void delayed_work_timer_fn(unsigned long __data)
{
        int cpu = smp_processor_id();
        ...
        __queue_work(per_cpu_ptr(wq->cpu_wq, cpu), &dwork->work);
}


so notice how it always just used "add_timer()", and then queued it on
whatever cpu workqueue the timer ran on.

Now, 99.9% of the time, the timer is just added to the current CPU
queues, so yes, in practice it ended up running on the same CPU almost
all the time. There are exceptions (timers can get moved around, and
active timers end up staying on the CPU they were scheduled on when
they get updated, rather than get moved to the current cpu), but they
are hard to hit.

But the code clearly didn't do that "same CPU" intentionally, and just
going by naming of things I would also say that it was never implied.

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 19:02         ` Tejun Heo
@ 2015-10-14 19:16           ` Linus Torvalds
  -1 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 19:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 12:02 PM, Tejun Heo <tj@kernel.org> wrote:
>
> But wasn't add_timer() always CPU-local at the time?  add_timer()
> allowing cross-cpu migrations came way after that.

add_timer() has actually never been "local CPU only" either.

What add_timer() does is to keep the timer on the old CPU timer wheel
if it was active, and if it wasn't, put it on the current CPU timer
wheel.

So again, by pure *accident*, if you don't end up ever modifying an
already-active timer, then yes, it ended up being local. But even
then, things like suspend/resume can move timers around, afaik, so
even then it has never been a real guarantee.

And I see absolutely no sign that the local cpu case has ever been intentional.

Now, obviously, that said there is obviously at least one case that
seems to have relied on it (ie the mm/vmstat.c case), but I think we
should just fix that.

If it turns out that there really are *lots* of cases where
"schedule_delayed_work()" expects the work to always run on the CPU
that it is scheduled on, then we should probably take your patch just
because it's too painful not to.

But I'd like to avoid that if possible.

                             Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 19:16           ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 19:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 12:02 PM, Tejun Heo <tj@kernel.org> wrote:
>
> But wasn't add_timer() always CPU-local at the time?  add_timer()
> allowing cross-cpu migrations came way after that.

add_timer() has actually never been "local CPU only" either.

What add_timer() does is to keep the timer on the old CPU timer wheel
if it was active, and if it wasn't, put it on the current CPU timer
wheel.

So again, by pure *accident*, if you don't end up ever modifying an
already-active timer, then yes, it ended up being local. But even
then, things like suspend/resume can move timers around, afaik, so
even then it has never been a real guarantee.

And I see absolutely no sign that the local cpu case has ever been intentional.

Now, obviously, that said there is obviously at least one case that
seems to have relied on it (ie the mm/vmstat.c case), but I think we
should just fix that.

If it turns out that there really are *lots* of cases where
"schedule_delayed_work()" expects the work to always run on the CPU
that it is scheduled on, then we should probably take your patch just
because it's too painful not to.

But I'd like to avoid that if possible.

                             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 19:16           ` Linus Torvalds
@ 2015-10-14 19:38             ` Tejun Heo
  -1 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2015-10-14 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

Hello, Linus.

On Wed, Oct 14, 2015 at 12:16:48PM -0700, Linus Torvalds wrote:
> On Wed, Oct 14, 2015 at 12:02 PM, Tejun Heo <tj@kernel.org> wrote:
> >
> > But wasn't add_timer() always CPU-local at the time?  add_timer()
> > allowing cross-cpu migrations came way after that.
> 
> add_timer() has actually never been "local CPU only" either.
> 
> What add_timer() does is to keep the timer on the old CPU timer wheel
> if it was active, and if it wasn't, put it on the current CPU timer
> wheel.

Doesn't seem that way.  This is from 597d0275736d^ - right before
TIMER_NOT_PINNED is introduced.  add_timer() eventually calls into
__mod_timer().

static inline int
__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
{
	...
	base = lock_timer_base(timer, &flags);
	...
	new_base = __get_cpu_var(tvec_bases);

	if (base != new_base) {
		...
		if (likely(base->running_timer != timer)) {
			/* See the comment in lock_timer_base() */
			timer_set_base(timer, NULL);
			spin_unlock(&base->lock);
			base = new_base;
			spin_lock(&base->lock);
			timer_set_base(timer, base);
		}
	}
	...
	internal_add_timer(base, timer);
	...
}

It looks like the timers for work items will be reliably queued on the
local CPU.

> So again, by pure *accident*, if you don't end up ever modifying an
> already-active timer, then yes, it ended up being local. But even
> then, things like suspend/resume can move timers around, afaik, so
> even then it has never been a real guarantee.
> 
> And I see absolutely no sign that the local cpu case has ever been intentional.

Heh, I don't think much in this area is intended.  It's mostly all
historical accidents and failures to get things cleaned up in time.

> Now, obviously, that said there is obviously at least one case that
> seems to have relied on it (ie the mm/vmstat.c case), but I think we
> should just fix that.
> 
> If it turns out that there really are *lots* of cases where
> "schedule_delayed_work()" expects the work to always run on the CPU
> that it is scheduled on, then we should probably take your patch just
> because it's too painful not to.
> 
> But I'd like to avoid that if possible.

So, the following two things bother me about this.

* Given that this is the first reported case of breakage, I don't
  think this is gonna cause lots of criticial issues; however, the
  only thing this indicates is that there simply hasn't been enough
  cases where timers actualy migrate.  If we end up migrating timers
  more actively in the future, it's possible that we'll see more
  breakages which will likely be subtler.

* This makes queue_delayed_work() behave differently from queue_work()
  and when I checked years ago the local queueing guarantee was
  definitely being depended upon by some users.

I do want to get rid of the local queueing guarnatee for all work
items.  That said, I don't think this is the right way to do it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 19:38             ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2015-10-14 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

Hello, Linus.

On Wed, Oct 14, 2015 at 12:16:48PM -0700, Linus Torvalds wrote:
> On Wed, Oct 14, 2015 at 12:02 PM, Tejun Heo <tj@kernel.org> wrote:
> >
> > But wasn't add_timer() always CPU-local at the time?  add_timer()
> > allowing cross-cpu migrations came way after that.
> 
> add_timer() has actually never been "local CPU only" either.
> 
> What add_timer() does is to keep the timer on the old CPU timer wheel
> if it was active, and if it wasn't, put it on the current CPU timer
> wheel.

Doesn't seem that way.  This is from 597d0275736d^ - right before
TIMER_NOT_PINNED is introduced.  add_timer() eventually calls into
__mod_timer().

static inline int
__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
{
	...
	base = lock_timer_base(timer, &flags);
	...
	new_base = __get_cpu_var(tvec_bases);

	if (base != new_base) {
		...
		if (likely(base->running_timer != timer)) {
			/* See the comment in lock_timer_base() */
			timer_set_base(timer, NULL);
			spin_unlock(&base->lock);
			base = new_base;
			spin_lock(&base->lock);
			timer_set_base(timer, base);
		}
	}
	...
	internal_add_timer(base, timer);
	...
}

It looks like the timers for work items will be reliably queued on the
local CPU.

> So again, by pure *accident*, if you don't end up ever modifying an
> already-active timer, then yes, it ended up being local. But even
> then, things like suspend/resume can move timers around, afaik, so
> even then it has never been a real guarantee.
> 
> And I see absolutely no sign that the local cpu case has ever been intentional.

Heh, I don't think much in this area is intended.  It's mostly all
historical accidents and failures to get things cleaned up in time.

> Now, obviously, that said there is obviously at least one case that
> seems to have relied on it (ie the mm/vmstat.c case), but I think we
> should just fix that.
> 
> If it turns out that there really are *lots* of cases where
> "schedule_delayed_work()" expects the work to always run on the CPU
> that it is scheduled on, then we should probably take your patch just
> because it's too painful not to.
> 
> But I'd like to avoid that if possible.

So, the following two things bother me about this.

* Given that this is the first reported case of breakage, I don't
  think this is gonna cause lots of criticial issues; however, the
  only thing this indicates is that there simply hasn't been enough
  cases where timers actualy migrate.  If we end up migrating timers
  more actively in the future, it's possible that we'll see more
  breakages which will likely be subtler.

* This makes queue_delayed_work() behave differently from queue_work()
  and when I checked years ago the local queueing guarantee was
  definitely being depended upon by some users.

I do want to get rid of the local queueing guarnatee for all work
items.  That said, I don't think this is the right way to do it.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 19:38             ` Tejun Heo
@ 2015-10-14 20:10               ` Linus Torvalds
  -1 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 20:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 12:38 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Doesn't seem that way.  This is from 597d0275736d^ - right before
> TIMER_NOT_PINNED is introduced.  add_timer() eventually calls into
> __mod_timer().
>
>                 if (likely(base->running_timer != timer)) {
>                         /* See the comment in lock_timer_base() */
>                         timer_set_base(timer, NULL);
>                         spin_unlock(&base->lock);
>                         base = new_base;
>                         spin_lock(&base->lock);
>                         timer_set_base(timer, base);
>
> It looks like the timers for work items will be reliably queued on the
> local CPU.

.. unless they are running on another cpu at the time, yes.

Also, the new base is not necessarily the current cpu base, although I
think the exceptions to that are pretty rare (ie you have to enable
timer migration etc)

Which I guess might not actually happen with workqueue timers due to
the extra workqueue locking thing, but I'm not sure. It's going to be
very unlikely, regardless, I agree.

> Heh, I don't think much in this area is intended.  It's mostly all
> historical accidents and failures to get things cleaned up in time.

No argument there.

> So, the following two things bother me about this.
>
> * Given that this is the first reported case of breakage, I don't
>   think this is gonna cause lots of criticial issues; however, the
>   only thing this indicates is that there simply hasn't been enough
>   cases where timers actualy migrate.  If we end up migrating timers
>   more actively in the future, it's possible that we'll see more
>   breakages which will likely be subtler.

I agree that that's a real concern.

At the same time, some of the same issues that are pushing people to
move timers around (put idle cores to deeper sleeps etc) would also
argue for moving delayed work around to other cpus if possible.

So I agree that there is a push to make timer cpu targets more dynamic
in a way we historically didn't really have. At the same time, I think
the same forces that want to move timers around would actually likely
want to move delayed work around too...

> * This makes queue_delayed_work() behave differently from queue_work()
>   and when I checked years ago the local queueing guarantee was
>   definitely being depended upon by some users.

Yes. But the delayed work really is different. By definition, we know
that the current cpu is busy and active _right_now_, and so keeping
work on that cpu isn't obviously wrong.

But it's *not* obviously right to schedule something on that
particular cpu a few seconds from now, when it might be happily asleep
and there might be better cpus to bother..

> I do want to get rid of the local queueing guarnatee for all work
> items.  That said, I don't think this is the right way to do it.

Hmm. I guess that for being past rc5, taking your patch is the safe
thing. I really don't like it very much, though.

                    Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 20:10               ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-14 20:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, Oct 14, 2015 at 12:38 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Doesn't seem that way.  This is from 597d0275736d^ - right before
> TIMER_NOT_PINNED is introduced.  add_timer() eventually calls into
> __mod_timer().
>
>                 if (likely(base->running_timer != timer)) {
>                         /* See the comment in lock_timer_base() */
>                         timer_set_base(timer, NULL);
>                         spin_unlock(&base->lock);
>                         base = new_base;
>                         spin_lock(&base->lock);
>                         timer_set_base(timer, base);
>
> It looks like the timers for work items will be reliably queued on the
> local CPU.

.. unless they are running on another cpu at the time, yes.

Also, the new base is not necessarily the current cpu base, although I
think the exceptions to that are pretty rare (ie you have to enable
timer migration etc)

Which I guess might not actually happen with workqueue timers due to
the extra workqueue locking thing, but I'm not sure. It's going to be
very unlikely, regardless, I agree.

> Heh, I don't think much in this area is intended.  It's mostly all
> historical accidents and failures to get things cleaned up in time.

No argument there.

> So, the following two things bother me about this.
>
> * Given that this is the first reported case of breakage, I don't
>   think this is gonna cause lots of criticial issues; however, the
>   only thing this indicates is that there simply hasn't been enough
>   cases where timers actualy migrate.  If we end up migrating timers
>   more actively in the future, it's possible that we'll see more
>   breakages which will likely be subtler.

I agree that that's a real concern.

At the same time, some of the same issues that are pushing people to
move timers around (put idle cores to deeper sleeps etc) would also
argue for moving delayed work around to other cpus if possible.

So I agree that there is a push to make timer cpu targets more dynamic
in a way we historically didn't really have. At the same time, I think
the same forces that want to move timers around would actually likely
want to move delayed work around too...

> * This makes queue_delayed_work() behave differently from queue_work()
>   and when I checked years ago the local queueing guarantee was
>   definitely being depended upon by some users.

Yes. But the delayed work really is different. By definition, we know
that the current cpu is busy and active _right_now_, and so keeping
work on that cpu isn't obviously wrong.

But it's *not* obviously right to schedule something on that
particular cpu a few seconds from now, when it might be happily asleep
and there might be better cpus to bother..

> I do want to get rid of the local queueing guarnatee for all work
> items.  That said, I don't think this is the right way to do it.

Hmm. I guess that for being past rc5, taking your patch is the safe
thing. I really don't like it very much, though.

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 20:10               ` Linus Torvalds
@ 2015-10-14 20:24                 ` Tejun Heo
  -1 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2015-10-14 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

Hello,

On Wed, Oct 14, 2015 at 01:10:33PM -0700, Linus Torvalds wrote:
> At the same time, some of the same issues that are pushing people to
> move timers around (put idle cores to deeper sleeps etc) would also
> argue for moving delayed work around to other cpus if possible.
> 
> So I agree that there is a push to make timer cpu targets more dynamic
> in a way we historically didn't really have. At the same time, I think
> the same forces that want to move timers around would actually likely
> want to move delayed work around too...

I fully agree.  We gotta get this in order sooner or later.  I'll try
to come up with a transition plan.

> > * This makes queue_delayed_work() behave differently from queue_work()
> >   and when I checked years ago the local queueing guarantee was
> >   definitely being depended upon by some users.
> 
> Yes. But the delayed work really is different. By definition, we know
> that the current cpu is busy and active _right_now_, and so keeping
> work on that cpu isn't obviously wrong.
> 
> But it's *not* obviously right to schedule something on that
> particular cpu a few seconds from now, when it might be happily asleep
> and there might be better cpus to bother..

But in terms of API consistency, it sucks to have queue_work()
guarantee local queueing but not queue_delayed_work().  The ideal
situation would be updating both so that neither guarantees.  If that
turns out to be too painful, maybe we can rename queue_delayed_work()
so that it signifies its difference from queue_work().  Let's see.

> > I do want to get rid of the local queueing guarnatee for all work
> > items.  That said, I don't think this is the right way to do it.
> 
> Hmm. I guess that for being past rc5, taking your patch is the safe
> thing. I really don't like it very much, though.

Heh, yeah, I pondered about calling it a happy accident and just
sticking with the new behavior.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-14 20:24                 ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2015-10-14 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

Hello,

On Wed, Oct 14, 2015 at 01:10:33PM -0700, Linus Torvalds wrote:
> At the same time, some of the same issues that are pushing people to
> move timers around (put idle cores to deeper sleeps etc) would also
> argue for moving delayed work around to other cpus if possible.
> 
> So I agree that there is a push to make timer cpu targets more dynamic
> in a way we historically didn't really have. At the same time, I think
> the same forces that want to move timers around would actually likely
> want to move delayed work around too...

I fully agree.  We gotta get this in order sooner or later.  I'll try
to come up with a transition plan.

> > * This makes queue_delayed_work() behave differently from queue_work()
> >   and when I checked years ago the local queueing guarantee was
> >   definitely being depended upon by some users.
> 
> Yes. But the delayed work really is different. By definition, we know
> that the current cpu is busy and active _right_now_, and so keeping
> work on that cpu isn't obviously wrong.
> 
> But it's *not* obviously right to schedule something on that
> particular cpu a few seconds from now, when it might be happily asleep
> and there might be better cpus to bother..

But in terms of API consistency, it sucks to have queue_work()
guarantee local queueing but not queue_delayed_work().  The ideal
situation would be updating both so that neither guarantees.  If that
turns out to be too painful, maybe we can rename queue_delayed_work()
so that it signifies its difference from queue_work().  Let's see.

> > I do want to get rid of the local queueing guarnatee for all work
> > items.  That said, I don't think this is the right way to do it.
> 
> Hmm. I guess that for being past rc5, taking your patch is the safe
> thing. I really don't like it very much, though.

Heh, yeah, I pondered about calling it a happy accident and just
sticking with the new behavior.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH] vmstat_update: ensure work remains on the same core
  2015-10-14 20:10               ` Linus Torvalds
@ 2015-10-16 19:51                 ` Chris Metcalf
  -1 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2015-10-16 19:51 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm,
	Tejun Heo, Linus Torvalds
  Cc: Chris Metcalf

By using schedule_delayed_work(), we are preferring the local
core for the work, but not requiring it.  In my task
isolation experiments, I saw a nohz_full core's vmstat_update
end up running on a housekeeping core, and when the two works
ran back-to-back, we triggered the VM_BUG_ON() at the
end of the function.

Switch to using schedule_delayed_work_on(smp_processor_id(), ...).

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
This change that I made a few days ago in my local tree is
particularly amusing given that the thread I am appending this
email to ("workqueue fixes for v4.3-rc5") also fixes the symptoms
of the bug I saw, but I wasn't aware of it until just now.  And it
took a while for me to track it down!  I think this is probably a
"belt and suspenders" kind of issue where it makes sense to fix it
on both sides of the API, however.

 mm/vmstat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index cf7d324f16e2..5c6bd7e5db07 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1369,7 +1369,8 @@ static void vmstat_update(struct work_struct *w)
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		schedule_delayed_work(this_cpu_ptr(&vmstat_work),
+		schedule_delayed_work_on(smp_processor_id(),
+			this_cpu_ptr(&vmstat_work),
 			round_jiffies_relative(sysctl_stat_interval));
 	else {
 		/*
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH] vmstat_update: ensure work remains on the same core
@ 2015-10-16 19:51                 ` Chris Metcalf
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Metcalf @ 2015-10-16 19:51 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm,
	Tejun Heo, Linus Torvalds
  Cc: Chris Metcalf

By using schedule_delayed_work(), we are preferring the local
core for the work, but not requiring it.  In my task
isolation experiments, I saw a nohz_full core's vmstat_update
end up running on a housekeeping core, and when the two works
ran back-to-back, we triggered the VM_BUG_ON() at the
end of the function.

Switch to using schedule_delayed_work_on(smp_processor_id(), ...).

Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
---
This change that I made a few days ago in my local tree is
particularly amusing given that the thread I am appending this
email to ("workqueue fixes for v4.3-rc5") also fixes the symptoms
of the bug I saw, but I wasn't aware of it until just now.  And it
took a while for me to track it down!  I think this is probably a
"belt and suspenders" kind of issue where it makes sense to fix it
on both sides of the API, however.

 mm/vmstat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index cf7d324f16e2..5c6bd7e5db07 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1369,7 +1369,8 @@ static void vmstat_update(struct work_struct *w)
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		schedule_delayed_work(this_cpu_ptr(&vmstat_work),
+		schedule_delayed_work_on(smp_processor_id(),
+			this_cpu_ptr(&vmstat_work),
 			round_jiffies_relative(sysctl_stat_interval));
 	else {
 		/*
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] vmstat_update: ensure work remains on the same core
  2015-10-16 19:51                 ` Chris Metcalf
@ 2015-10-16 19:54                   ` Linus Torvalds
  -1 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-16 19:54 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm,
	Tejun Heo

On Fri, Oct 16, 2015 at 12:51 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> By using schedule_delayed_work(), we are preferring the local
> core for the work, but not requiring it.

Heh. See commit 176bed1de5bf.

I added curly braces, because I hate the "half-bracing" that the code
had (ie a "else" statement with curly braces on one side but not the
other), but otherwise identical patches, I think.

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] vmstat_update: ensure work remains on the same core
@ 2015-10-16 19:54                   ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2015-10-16 19:54 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm,
	Tejun Heo

On Fri, Oct 16, 2015 at 12:51 PM, Chris Metcalf <cmetcalf@ezchip.com> wrote:
> By using schedule_delayed_work(), we are preferring the local
> core for the work, but not requiring it.

Heh. See commit 176bed1de5bf.

I added curly braces, because I hate the "half-bracing" that the code
had (ie a "else" statement with curly braces on one side but not the
other), but otherwise identical patches, I think.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
  2015-10-14 20:24                 ` Tejun Heo
@ 2015-10-19  3:51                   ` Mike Galbraith
  -1 siblings, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2015-10-19  3:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 2015-10-14 at 16:24 -0400, Tejun Heo wrote:

> But in terms of API consistency, it sucks to have queue_work()
> guarantee local queueing but not queue_delayed_work().  The ideal
> situation would be updating both so that neither guarantees.

You don't have to change anything to have neither guarantee local
queueing.  Called from a preemptible context, local means any CPU in
->cpus_allowed... which makes WORK_CPU_UNBOUND mean what one would
imagine WORK_CPU_UNBOUND to mean, not bound to any particular cpu.

      sh-16017   3.N.. 1510500545us : queue_work_on: golly, migrated cpu7 -> cpu3 -- target cpu8
      sh-16017   3.N.. 1510500550us : <stack trace>
 => tty_flip_buffer_push
 => pty_write
 => n_tty_write
 => tty_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath

That was with a udelay(100) prior to disabling interrupts, but that just
makes it easier.

	-Mike


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [GIT PULL] workqueue fixes for v4.3-rc5
@ 2015-10-19  3:51                   ` Mike Galbraith
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2015-10-19  3:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Andrew Morton, Christoph Lameter, Michal Hocko,
	Linux Kernel Mailing List, Lai Jiangshan, Shaohua Li, linux-mm

On Wed, 2015-10-14 at 16:24 -0400, Tejun Heo wrote:

> But in terms of API consistency, it sucks to have queue_work()
> guarantee local queueing but not queue_delayed_work().  The ideal
> situation would be updating both so that neither guarantees.

You don't have to change anything to have neither guarantee local
queueing.  Called from a preemptible context, local means any CPU in
->cpus_allowed... which makes WORK_CPU_UNBOUND mean what one would
imagine WORK_CPU_UNBOUND to mean, not bound to any particular cpu.

      sh-16017   3.N.. 1510500545us : queue_work_on: golly, migrated cpu7 -> cpu3 -- target cpu8
      sh-16017   3.N.. 1510500550us : <stack trace>
 => tty_flip_buffer_push
 => pty_write
 => n_tty_write
 => tty_write
 => __vfs_write
 => vfs_write
 => SyS_write
 => entry_SYSCALL_64_fastpath

That was with a udelay(100) prior to disabling interrupts, but that just
makes it easier.

	-Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2015-10-19  3:51 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-13 21:49 [GIT PULL] workqueue fixes for v4.3-rc5 Tejun Heo
2015-10-14 16:30 ` Linus Torvalds
2015-10-14 16:30   ` Linus Torvalds
2015-10-14 16:57   ` Tejun Heo
2015-10-14 16:57     ` Tejun Heo
2015-10-14 17:36     ` Linus Torvalds
2015-10-14 17:36       ` Linus Torvalds
2015-10-14 17:57       ` Christoph Lameter
2015-10-14 17:57         ` Christoph Lameter
2015-10-14 18:37         ` Linus Torvalds
2015-10-14 18:37           ` Linus Torvalds
2015-10-14 18:58           ` Christoph Lameter
2015-10-14 18:58             ` Christoph Lameter
2015-10-14 19:01           ` Linus Torvalds
2015-10-14 19:01             ` Linus Torvalds
2015-10-14 19:02       ` Tejun Heo
2015-10-14 19:02         ` Tejun Heo
2015-10-14 19:16         ` Linus Torvalds
2015-10-14 19:16           ` Linus Torvalds
2015-10-14 19:38           ` Tejun Heo
2015-10-14 19:38             ` Tejun Heo
2015-10-14 20:10             ` Linus Torvalds
2015-10-14 20:10               ` Linus Torvalds
2015-10-14 20:24               ` Tejun Heo
2015-10-14 20:24                 ` Tejun Heo
2015-10-19  3:51                 ` Mike Galbraith
2015-10-19  3:51                   ` Mike Galbraith
2015-10-16 19:51               ` [PATCH] vmstat_update: ensure work remains on the same core Chris Metcalf
2015-10-16 19:51                 ` Chris Metcalf
2015-10-16 19:54                 ` Linus Torvalds
2015-10-16 19:54                   ` Linus Torvalds
2015-10-14 18:03   ` [GIT PULL] workqueue fixes for v4.3-rc5 Christoph Lameter
2015-10-14 18:03     ` Christoph Lameter
2015-10-14 18:40     ` Linus Torvalds
2015-10-14 18:40       ` Linus Torvalds
2015-10-14 18:59       ` Christoph Lameter
2015-10-14 18:59         ` Christoph Lameter
2015-10-14 19:10         ` Linus Torvalds
2015-10-14 19:10           ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.