linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible lock-less list race in scheduler_ipi()
       [not found] <995381344.227770.1425597484864.JavaMail.zimbra@efficios.com>
@ 2015-03-05 23:48 ` Mathieu Desnoyers
  2015-03-06  1:02   ` Linus Torvalds
  2015-03-06 17:09   ` Paul E. McKenney
  0 siblings, 2 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-03-05 23:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML,
	Ingo Molnar, Steven Rostedt, Linus Torvalds

Hi,

I recently reviewed the scheduler_ipi() code by
coincidence, and noticed that the use of llist.h
in there seemed a bit odd. So I'd like to have
more eyes to help me look into this.

I've been told that there has been issues with
IPIs lately, so here is one possible scenario
I think might be possible:

- scheduler_ipi()
  - sched_ttwu_pending()
    - llist_del_all()
      (grabs the wake_list with xchg())
    - raw_spin_lock_irqsave(&rq->lock, flags);
    - iteration on the llist (owned locally by interrupt handler)
      (for each item)
        p = llist_entry(llist, struct task_struct, wake_entry);
        llist = llist_next(llist);
        ttwu_do_activate(rq, p, 0);
    - raw_spin_unlock_irqrestore(&rq->lock, flags);

ttwu_do_activate() ends up calling
ttwu_activate() and ttwu_do_wakeup(), 

I'm worried that ttwu_do_wakeup() might end up
indirectly handing off the task to the following
functions (within another execution context):

- try_to_wake_up()
  - ttwu_queue()
    - ttwu_queue_remote adds the process to another
      wake_list with a cmpxchg()

llist_next() is pretty simple:

static inline struct llist_node *llist_next(struct llist_node *node)
{
        return node->next;
}

It is so simple that I wonder if the compiler would be
within its rights to reorder the load of node->next
after some operations within ttwu_do_activate(), thus
causing corruption of this linked-list due to a
concurrent try_to_wake_up() performed by another core.

Am I too paranoid about the possible compiler mishaps
there, or are my concerns justified ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible lock-less list race in scheduler_ipi()
  2015-03-05 23:48 ` Possible lock-less list race in scheduler_ipi() Mathieu Desnoyers
@ 2015-03-06  1:02   ` Linus Torvalds
  2015-03-06 14:35     ` Mathieu Desnoyers
  2015-03-06 17:09   ` Paul E. McKenney
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2015-03-06  1:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan,
	Peter Zijlstra, LKML, Ingo Molnar, Steven Rostedt

On Thu, Mar 5, 2015 at 3:48 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> llist_next() is pretty simple:
>
> static inline struct llist_node *llist_next(struct llist_node *node)
> {
>         return node->next;
> }
>
> It is so simple that I wonder if the compiler would be
> within its rights to reorder the load of node->next
> after some operations within ttwu_do_activate(), thus
> causing corruption of this linked-list due to a
> concurrent try_to_wake_up() performed by another core.
>
> Am I too paranoid about the possible compiler mishaps
> there, or are my concerns justified ?

I *think* you are too paranoid, because that would be a major compiler
bug anyway - gcc cannot reorder the load against anything that might
be changing the value.  Which obviously includes calling non-inlined
functions.

At least the code generation I see doesn't seem to say that gcc gets this wrong:

        ...
        leaq    -32(%rbx), %rsi #, p
        movq    (%rbx), %rbx    # MEM[(struct llist_node
*)__mptr_19].next, __mptr
        movq    %r12, %rdi      # tcp_ptr__,
        call    ttwu_do_activate.constprop.85   #
        ...

that "movq (%rbx), %rbx" is the "llist = llist_next(llist);" thing.

                            Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible lock-less list race in scheduler_ipi()
  2015-03-06  1:02   ` Linus Torvalds
@ 2015-03-06 14:35     ` Mathieu Desnoyers
  2015-03-06 19:03       ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-03-06 14:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Huang Ying, Lai Jiangshan, Lai Jiangshan,
	Peter Zijlstra, LKML, Ingo Molnar, Steven Rostedt

----- Original Message -----
> From: "Linus Torvalds" <torvalds@linux-foundation.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Huang Ying" <ying.huang@intel.com>, "Lai Jiangshan"
> <laijs@cn.fujitsu.com>, "Lai Jiangshan" <eag0628@gmail.com>, "Peter Zijlstra" <peterz@infradead.org>, "LKML"
> <linux-kernel@vger.kernel.org>, "Ingo Molnar" <mingo@kernel.org>, "Steven Rostedt" <rostedt@goodmis.org>
> Sent: Thursday, March 5, 2015 8:02:06 PM
> Subject: Re: Possible lock-less list race in scheduler_ipi()
> 
> On Thu, Mar 5, 2015 at 3:48 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > llist_next() is pretty simple:
> >
> > static inline struct llist_node *llist_next(struct llist_node *node)
> > {
> >         return node->next;
> > }
> >
> > It is so simple that I wonder if the compiler would be
> > within its rights to reorder the load of node->next
> > after some operations within ttwu_do_activate(), thus
> > causing corruption of this linked-list due to a
> > concurrent try_to_wake_up() performed by another core.
> >
> > Am I too paranoid about the possible compiler mishaps
> > there, or are my concerns justified ?
> 
> I *think* you are too paranoid, because that would be a major compiler
> bug anyway - gcc cannot reorder the load against anything that might
> be changing the value.  Which obviously includes calling non-inlined
> functions.
> 
> At least the code generation I see doesn't seem to say that gcc gets this
> wrong:
> 
>         ...
>         leaq    -32(%rbx), %rsi #, p
>         movq    (%rbx), %rbx    # MEM[(struct llist_node
> *)__mptr_19].next, __mptr
>         movq    %r12, %rdi      # tcp_ptr__,
>         call    ttwu_do_activate.constprop.85   #
>         ...
> 
> that "movq (%rbx), %rbx" is the "llist = llist_next(llist);" thing.

Indeed, the compiler should never reorder loads/stores from/to
same memory location from a program order POV. What I had in mind
is a bit more far-fetched though: it would involve having the compiler
reorder this load after a store to another memory location, which
would in turn allow another execution context (interrupt or thread)
to corrupt the list.

The assembly snippet you show above appears to be OK. However, another
compiler may choose to inline ttwu_do_activate, leaving room for
more aggressive optimizations.

But I agree that it is rather far-fetched.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible lock-less list race in scheduler_ipi()
  2015-03-05 23:48 ` Possible lock-less list race in scheduler_ipi() Mathieu Desnoyers
  2015-03-06  1:02   ` Linus Torvalds
@ 2015-03-06 17:09   ` Paul E. McKenney
  1 sibling, 0 replies; 9+ messages in thread
From: Paul E. McKenney @ 2015-03-06 17:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Huang Ying, Lai Jiangshan, Lai Jiangshan, Peter Zijlstra, LKML,
	Ingo Molnar, Steven Rostedt, Linus Torvalds

On Thu, Mar 05, 2015 at 11:48:51PM +0000, Mathieu Desnoyers wrote:
> Hi,
> 
> I recently reviewed the scheduler_ipi() code by
> coincidence, and noticed that the use of llist.h
> in there seemed a bit odd. So I'd like to have
> more eyes to help me look into this.
> 
> I've been told that there has been issues with
> IPIs lately, so here is one possible scenario
> I think might be possible:
> 
> - scheduler_ipi()
>   - sched_ttwu_pending()
>     - llist_del_all()
>       (grabs the wake_list with xchg())
>     - raw_spin_lock_irqsave(&rq->lock, flags);
>     - iteration on the llist (owned locally by interrupt handler)
>       (for each item)
>         p = llist_entry(llist, struct task_struct, wake_entry);
>         llist = llist_next(llist);
>         ttwu_do_activate(rq, p, 0);
>     - raw_spin_unlock_irqrestore(&rq->lock, flags);
> 
> ttwu_do_activate() ends up calling
> ttwu_activate() and ttwu_do_wakeup(), 
> 
> I'm worried that ttwu_do_wakeup() might end up
> indirectly handing off the task to the following
> functions (within another execution context):
> 
> - try_to_wake_up()
>   - ttwu_queue()
>     - ttwu_queue_remote adds the process to another
>       wake_list with a cmpxchg()
> 
> llist_next() is pretty simple:
> 
> static inline struct llist_node *llist_next(struct llist_node *node)
> {
>         return node->next;
> }
> 
> It is so simple that I wonder if the compiler would be
> within its rights to reorder the load of node->next
> after some operations within ttwu_do_activate(), thus
> causing corruption of this linked-list due to a
> concurrent try_to_wake_up() performed by another core.

Well, it clearly cannot leak this load out of the critical section.
That said, there is nothing in llist_next() that tells the compiler
that it need to be careful with this load.

And all three functions, sched_ttwu_pending(), ttwu_do_activate(),
and llist_next(), are in the same compilation unit.  The compiler
therefore has full knowledge, and full ability to rearrange.  It can
slide this load down past the CONFIG_SMP #ifdef in ttwu_do_activate()
and into ttwu_activate().  However, it cannot push it past the call to
sched_clock_cpu(), which is defined in the separate compilation unit
kernel/sched/clock.c.  And sched_clock_cpu() is invoked from
update_rq_clock(), which is invoked from enqueue_task(), which is invoked
from activate_task(),  which is invoked from ttwu_activate().
And even if sched_clock_cpu() was in the same compilation unit, its
preempt_disable_notrace() keeps the compiler from reordering.  At least
it does if the compiler cannot prove that sched_clock_cpu() takes one
of two early exits.  ;-)

So I am not seeing anything important that the compiler could reorder
this past.  The CPU might do additional reordering, of course.

> Am I too paranoid about the possible compiler mishaps
> there, or are my concerns justified ?

Now try_to_wake_up() does acquire p->pi_lock, but that of course does
not exclude sched_ttwu_pending(), which instead acquires rq->lock.

And in a virtualized environment, sched_ttwu_pending() can be preempted
and delayed pretty much arbitrarily pretty much anywhere, even when its
interrupts are disabled.

So let's look at the llist code.  One question is of course "How does
llist_add_batch() avoid the ABA problem?"  In this case, the answer
appears to be that llist_del_all() is used, and never llist_del_first().
(Mixing use of llist_del_all() and llist_del_first() by multiple
threads could cause this ABA problem, even if only one thread at a time
used llist_del_first().)

The code does appear to assume that if a given task is on the local
llist, that it cannot be awakened any other way without acquiring the
same runqueue lock that sched_ttwu_pending() holds.  If this rule was
violated, then of course the list could be corrupted.  However, the
only accesses to the list appear to be sched_ttwu_pending()'s
llist_del_all(), ttwu_queue_remote()'s llist_add(), and a couple
of llist_empty() calls.  This combination appears to me to be safe.

The scheduler_ipi() code assumes that an IPI cannot outrun data.
If this is violated, the scheduler_ipi() might still see an
empty ->wake_list() even though the sender of the IPI just filled it.
The last time I worked with a machine that violated this assumption
was about twenty years ago, and I very much hope that such hardware
designs have not come back from the dead.  But this would cause a
hang, not a corrupt list.

So, does this trigger any helpful thoughts?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible lock-less list race in scheduler_ipi()
  2015-03-06 14:35     ` Mathieu Desnoyers
@ 2015-03-06 19:03       ` Steven Rostedt
  2015-03-06 19:39         ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2015-03-06 19:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan,
	Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar

On Fri, 6 Mar 2015 14:35:23 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:


> Indeed, the compiler should never reorder loads/stores from/to
> same memory location from a program order POV. What I had in mind
> is a bit more far-fetched though: it would involve having the compiler
> reorder this load after a store to another memory location, which
> would in turn allow another execution context (interrupt or thread)
> to corrupt the list.

You mean on another CPU? Because the code you are worried about has
interrupts disabled.

-- Steve

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible lock-less list race in scheduler_ipi()
  2015-03-06 19:03       ` Steven Rostedt
@ 2015-03-06 19:39         ` Mathieu Desnoyers
  2015-03-06 20:38           ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-03-06 19:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan,
	Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar

----- Original Message -----
> From: "Steven Rostedt" <rostedt@goodmis.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Huang Ying"
> <ying.huang@intel.com>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Lai Jiangshan" <eag0628@gmail.com>, "Peter
> Zijlstra" <peterz@infradead.org>, "LKML" <linux-kernel@vger.kernel.org>, "Ingo Molnar" <mingo@kernel.org>
> Sent: Friday, March 6, 2015 2:03:47 PM
> Subject: Re: Possible lock-less list race in scheduler_ipi()
> 
> On Fri, 6 Mar 2015 14:35:23 +0000 (UTC)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> 
> > Indeed, the compiler should never reorder loads/stores from/to
> > same memory location from a program order POV. What I had in mind
> > is a bit more far-fetched though: it would involve having the compiler
> > reorder this load after a store to another memory location, which
> > would in turn allow another execution context (interrupt or thread)
> > to corrupt the list.
> 
> You mean on another CPU? Because the code you are worried about has
> interrupts disabled.

I'm worried that another CPU could issue try_to_wake_up() on a
task concurrently with the llist iteration within sched_ttwu_pending().

AFAIU, ttwu_queue_remote() is called from ttwu_queue() without holding
the rq lock. So I'm wondering what prevents corruption of the wake_list
in this situation.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible lock-less list race in scheduler_ipi()
  2015-03-06 19:39         ` Mathieu Desnoyers
@ 2015-03-06 20:38           ` Steven Rostedt
  2015-03-06 20:39             ` Steven Rostedt
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2015-03-06 20:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan,
	Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar

On Fri, 6 Mar 2015 19:39:44 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

ask concurrently with the llist iteration within sched_ttwu_pending().
> 
> AFAIU, ttwu_queue_remote() is called from ttwu_queue() without holding
> the rq lock. So I'm wondering what prevents corruption of the wake_list
> in this situation.

I guess if it is on the wake_list, then the task's state is already
RUNNING. Any other task can switch a task's state to RUNNING but only
the task itself can switch it back to something else. If the task is on
the wake_list, it's state is already RUNNING, but it has not run yet.
That means any other wakeup will jump to the "goto out" and skip over
the ttwu_queue() call.

-- Steve

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible lock-less list race in scheduler_ipi()
  2015-03-06 20:38           ` Steven Rostedt
@ 2015-03-06 20:39             ` Steven Rostedt
  2015-03-06 22:07               ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2015-03-06 20:39 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan,
	Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar

On Fri, 6 Mar 2015 15:38:21 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 6 Mar 2015 19:39:44 +0000 (UTC)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> ask concurrently with the llist iteration within sched_ttwu_pending().
> > 
> > AFAIU, ttwu_queue_remote() is called from ttwu_queue() without holding
> > the rq lock. So I'm wondering what prevents corruption of the wake_list
> > in this situation.
> 
> I guess if it is on the wake_list, then the task's state is already
> RUNNING. Any other task can switch a task's state to RUNNING but only
> the task itself can switch it back to something else. If the task is on
> the wake_list, it's state is already RUNNING, but it has not run yet.
> That means any other wakeup will jump to the "goto out" and skip over
> the ttwu_queue() call.

If my assumption is indeed the case, then these types of subtleties
really need comments in the code.

-- Steve

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Possible lock-less list race in scheduler_ipi()
  2015-03-06 20:39             ` Steven Rostedt
@ 2015-03-06 22:07               ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2015-03-06 22:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Paul E. McKenney, Huang Ying, Lai Jiangshan,
	Lai Jiangshan, Peter Zijlstra, LKML, Ingo Molnar

----- Original Message -----
> From: "Steven Rostedt" <rostedt@goodmis.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: "Linus Torvalds" <torvalds@linux-foundation.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Huang Ying"
> <ying.huang@intel.com>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Lai Jiangshan" <eag0628@gmail.com>, "Peter
> Zijlstra" <peterz@infradead.org>, "LKML" <linux-kernel@vger.kernel.org>, "Ingo Molnar" <mingo@kernel.org>
> Sent: Friday, March 6, 2015 3:39:25 PM
> Subject: Re: Possible lock-less list race in scheduler_ipi()
> 
> On Fri, 6 Mar 2015 15:38:21 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Fri, 6 Mar 2015 19:39:44 +0000 (UTC)
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> > ask concurrently with the llist iteration within sched_ttwu_pending().
> > > 
> > > AFAIU, ttwu_queue_remote() is called from ttwu_queue() without holding
> > > the rq lock. So I'm wondering what prevents corruption of the wake_list
> > > in this situation.
> > 
> > I guess if it is on the wake_list, then the task's state is already
> > RUNNING. Any other task can switch a task's state to RUNNING but only
> > the task itself can switch it back to something else. If the task is on
> > the wake_list, it's state is already RUNNING, but it has not run yet.
> > That means any other wakeup will jump to the "goto out" and skip over
> > the ttwu_queue() call.
> 
> If my assumption is indeed the case, then these types of subtleties
> really need comments in the code.

My understanding is that try_to_wake_up, by calling ttwu_queue(),
is responsible for enqueuing the task into the wake_list. Inspection
of try_to_wake_up() seems to show that the state of the task is set
to TASK_WAKING by try_to_wake_up.

Then when dequeuing the task from the llist, ttwu_do_wakeup sets the
task state to TASK_RUNNING.

Both TASK_WAKING and TASK_RUNNING mean that the try_to_wake_up check
for if (!(p->state & state)), which is typically done against TASK_NORMAL,
will skip the following ttwu_queue() for that task until it is set to
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE again.

So there should not be any double-enqueue AFAIU.

Thanks,

Mathieu





-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-03-06 22:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <995381344.227770.1425597484864.JavaMail.zimbra@efficios.com>
2015-03-05 23:48 ` Possible lock-less list race in scheduler_ipi() Mathieu Desnoyers
2015-03-06  1:02   ` Linus Torvalds
2015-03-06 14:35     ` Mathieu Desnoyers
2015-03-06 19:03       ` Steven Rostedt
2015-03-06 19:39         ` Mathieu Desnoyers
2015-03-06 20:38           ` Steven Rostedt
2015-03-06 20:39             ` Steven Rostedt
2015-03-06 22:07               ` Mathieu Desnoyers
2015-03-06 17:09   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).