linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Patch for lost wakeups
       [not found] <tencent_26310211398C21034BD3B2F9@qq.com>
@ 2013-08-08 18:19 ` Linus Torvalds
  2013-08-08 19:17   ` Oleg Nesterov
                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Linus Torvalds @ 2013-08-08 18:19 UTC (permalink / raw)
  To: Long Gao, Oleg Nesterov; +Cc: Al Viro, Andrew Morton, Linux Kernel Mailing List

[ Adding proper people, and the kernel mailing list ]

The patch is definitely incorrect, but the bug is interesting, so I'm
cc'ing more people in case anybody else has any input on this.

The reason I say that the patch is incorrect is because
"legacy_queue()" doesn't actually *do* anything to the signal state,
it just checks if a legacy signal is already queued, in which case we
return and do nothing.

As a result, doing a "recalc_sigpending_and_wake(()" is definitely
incorrect, because sigpending state cannot actually have changed.

Now that said, something is definitely wrong, as shown by your
/proc/2597/status data:

> Name:    Xorg
> State:    S (sleeping)
> SigPnd:    00000000000000000000000000000000
> ShdPnd:    00000000000000000000000000200000
> SigBlk:    00000000000000000000000000000000
> SigIgn:    80000000000000000000000006001000
> SigCgt:    000000000000000000000001e020eecf

because Xorg shouldn't be sleeping when there is clearly a pending
signal, and yes, the "legacy_queue()" thing will mean that _new_
incoming signals will not wake it up, since it should already *be*
awake.

However, the fact that this happens on LongSoon makes me suspect that
it's not a generic bug. It sounds like a race between Xorg going to
sleep, and getting a new signal. It could be related to subtle memory
ordering issues, though, and x86 tends to not show those as it's
pretty strictly ordered. So it _could_ be a generic bug that is just
triggered by your hardware. Thus the wider distribution in case
anybody else sees how we could get into this situation.

The particular memory barriers that should be relevant are:

 - recalc_sigpending setting the TIF_SIGPENDING flag ->
signal_wake_up() actually waking the task

 - somebody setting TASK_SLEEPING -> __schedule() testing the
signal_pending_state()

and as far as I can tell we have proper barriers for those (the
scheduler gets the rq lock and that sigpending test had better not
leak out of a spinlocked section, and try_to_wake_up() also ends up
having a spinlock between setting the TIF_SIGPENDING flag and testing
p->state)

That said, I'm a bit worried about the smp_wmb() and spinlock in
try_to_wake_up(). The signal delivery basically does

        set_tsk_thread_flag(t, TIF_SIGPENDING);
        if (!wake_up_state(t, state | TASK_INTERRUPTIBLE))

and on x86, the set_tsk_thread_flag() is a full memory barrier
(because it's a locked operation), but that's not necessarily true
elsewhere. And wake_up_state() (through that try_to_wake_up() logic)
does have a

        smp_wmb();
        raw_spin_lock_irqsave(&p->pi_lock, flags);
        if (!(p->state & state))

before it tests the task state. And the wmb() *together* with the
spinlock really should be a full memory barrier (nothing can get out
from the spinlock, and any writes before this had better be serialized
by the wmb and the write inherent in the spinlock itself). But this is
definitely some subtle stuff.

I wonder if set_tsk_thread_flag() should have a
smp_mb__after_clear_bit() after the set-bit (there is no
"smp_mb__after_set_bit", so we have to fake it). Just to make sure.

Does anybody see any situation that can cause this kind of "pending
signal, but sleeping process"? I *do* think it's triggered by hardware
issues, so I'd suggest the LongSoon people look very hard at memory
barriers and cache coherency stuff, but let's bring other people in
just in case the generic code is fraglie somewhere..

Whole email quoted below.

            Linus

On Thu, Aug 8, 2013 at 8:55 AM, Long Gao <gaolong@kylinos.com.cn> wrote:
>
>
>   Hi,
>       In a recent kernel debugging, I thought I have detected a "Lost Wakeup" of the kernel signal.
> I found that when the current process(Xorg) is sleeping and already has a pending non-real-time
> signal(SIGIO), kernel might forget to wake up the current process, and this sleeping process
> never get the chance to be waked up. That is to say, before the kernel returned after
> legacy_queue(), the current process might already have a pending signal and be SLEEPING.
>
> Thus any following same signals never had the chance to wakeup the process(succeeding
> same signals returned after legacy_queue(), and never reached signal_wake_up() in
> complete_signal() ). I have observed this case once in 667238 times of SIGIO, as
> /var/log/messages in attachment recorded, which makes the Xorg hang up, and the mouse
> and keyboard die. Until Xorg got something other than SIGIO to wake it up.
>
> Patch is as follow, if the current process has a pending signal, try to wake it up immediately:
>
> --- linux-loongson-all/kernel/signal.c  2012-06-15 10:54:01.000000000 +0800
> +++ linux-loongson-all-signal/kernel/signal.c   2013-07-24 18:47:15.775415042 +0800
> @@ -900,8 +900,10 @@
>          * exactly one non-rt signal, so that we can get more
>          * detailed information about the cause of the signal.
>          */
> -       if (legacy_queue(pending, sig))
> +       if (legacy_queue(pending, sig)){
> +               recalc_sigpending_and_wake(t);
>                 return 0;
> +       }
>         /*
>          * fast-pathed signals for kernel-internal things like SIGSTOP
>          * or SIGKILL.
>
>
>  Every time Xorg hangs up,  the status of Xorg is read as following(cat /proc/2597/status):
>
> Name:    Xorg
> State:    S (sleeping)
> Tgid:    2597
> Pid:    2597
> PPid:    2595
> TracerPid:    0
> Uid:    0    0    0    0
> Gid:    0    0    0    0
> FDSize:    64
> Groups:
> VmPeak:       44640 kB
> VmSize:       31232 kB
> VmLck:           0 kB
> VmHWM:       20560 kB
> VmRSS:       20016 kB
> VmData:        5728 kB
> VmStk:         160 kB
> VmExe:        1952 kB
> VmLib:       11296 kB
> VmPTE:         128 kB
> VmSwap:           0 kB
> Threads:    1
> SigQ:    1/15809
> SigPnd:    00000000000000000000000000000000
> ShdPnd:    00000000000000000000000000200000
> SigBlk:    00000000000000000000000000000000
> SigIgn:    80000000000000000000000006001000
> SigCgt:    000000000000000000000001e020eecf
> CapInh:    0000000000000000
> CapPrm:    ffffffffffffffff
> CapEff:    ffffffffffffffff
> CapBnd:    ffffffffffffffff
> Cpus_allowed:    f
> Cpus_allowed_list:    0-3
> voluntary_ctxt_switches:    33327
> nonvoluntary_ctxt_switches:    1308959
>
>   We can see, the shared pending signal(ShdPnd) has SIGIO(22, 0x200000) pending. At this
> moment, Xorg can not be waked up by any SIGIO(kill -s SIGIO xxxx never wake up Xorg,
> because kernel returned after legacy_queue(), and never reach signal_wake_up() ), but
> Xorg can be waked up and back to normal operation immediately when received other
> signals(kill -s SIGALRM xxxx). I could conclude that the lost wakeup only happen whenever
> the process is sleeping and at the same time it hold a pending signal.
>
>   I guess that kernel code might not be completely protected by siglock, and was interrupted
> by a coming SIGIO signal handling, that is how a SLEEPING Xorg got a pending signal.
> It is hard to find where the bug is, but I thought I could easily break the deadlock by waking
> up the current sleeping process whenever I found such a situation. So I made the patch and
> TESTED the patch on the same machine for about several weeks, and no Lost Wakeup
> occurred. Before patching, the same machine can have a hangup every one hour, by people
> keep moving the mouse. I was using a MIPS-based loongson CPU, and some other people
> I know also reported similar hangups on Power PC and SPARC, and X86 as reported in bug
> 60520,  https://bugzilla.kernel.org/show_bug.cgi?id=60520 .
>
>   Even on the MIPS-based Loongson, the occurrence of this bug is very rare,  I used
> another machine to detect the occurrence of the condition of this bug. In about 83
> minutes,  I logged 667238 times of SIGIO, and only one of them satisfy the lost
> wakeup condition, which Xorg is sleeping and Xorg has already a pending SIGIO
> at the same time. On X86, this might be even harder to observe, unless some other
> thing help it, for example video card in bug 60520.
>
>   I want to have your advises. Do you think that this could probably be a common
> bug, or just restricted to some rare hardwares?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
  2013-08-08 18:19 ` Patch for lost wakeups Linus Torvalds
@ 2013-08-08 19:17   ` Oleg Nesterov
  2013-08-08 19:51     ` Linus Torvalds
  2013-08-09 15:18     ` [PATCH 0/1] dlm: kill the unnecessary and wrong device_close()->recalc_sigpending() Oleg Nesterov
  2013-08-09 13:28   ` Patch for lost wakeups Oleg Nesterov
  2013-08-09 15:31   ` block_all_signals() must die (Was: Patch for lost wakeups) Oleg Nesterov
  2 siblings, 2 replies; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-08 19:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On 08/08, Linus Torvalds wrote:
>
> As a result, doing a "recalc_sigpending_and_wake(()"

and btw it should die, I think.

> is definitely
> incorrect, because sigpending state cannot actually have changed.

Yes, if we need to wakeup in this case something is already wrong.

>  - somebody setting TASK_SLEEPING -> __schedule() testing the
> signal_pending_state()
>
> and as far as I can tell we have proper barriers for those (the
> scheduler gets the rq lock

Yes, but... ttwu() takse another lock, ->pi_lock to test ->state.

This looks racy, even if wmb() actually acts as mb(), we don't
have mb() on the other side and schedule() can miss SIGPENDING?

Unless the task does set_current_state(TASK_INTERRUPTIBLE) which
adds mb(). But, just for example, sigsuspend() relies on schedule().

>         smp_wmb();
>         raw_spin_lock_irqsave(&p->pi_lock, flags);
>         if (!(p->state & state))
>
> before it tests the task state. And the wmb() *together* with the
> spinlock really should be a full memory barrier (nothing can get out
> from the spinlock, and any writes before this had better be serialized
> by the wmb and the write inherent in the spinlock itself). But this is
> definitely some subtle stuff.

So perhaps it makes sense to re-test after s/smp_wmb/smp_mb/ ?

And perhaps we can add smp_mb__before_lock(), we alredy have
smp_mb__after_lock().

And of course, there could be another bug. I just did
"grep recalc_sigpending" and immediately found at least one buggy
user, fs/dlm/user.c which calls it lockless.

> >  Every time Xorg hangs up,  the status of Xorg is read as following(cat /proc/2597/status):

Gao, could you show /proc/pid/stack just in case?

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
  2013-08-08 19:17   ` Oleg Nesterov
@ 2013-08-08 19:51     ` Linus Torvalds
  2013-08-09 13:04       ` Oleg Nesterov
  2013-08-09 15:18     ` [PATCH 0/1] dlm: kill the unnecessary and wrong device_close()->recalc_sigpending() Oleg Nesterov
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2013-08-08 19:51 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On Thu, Aug 8, 2013 at 12:17 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>
>>  - somebody setting TASK_SLEEPING -> __schedule() testing the
>> signal_pending_state()
>>
>> and as far as I can tell we have proper barriers for those (the
>> scheduler gets the rq lock
>
> Yes, but... ttwu() takse another lock, ->pi_lock to test ->state.

The lock is different, but for task_state, the main thing we need to
worry abotu is memory ordering, not locks. Because "task->state" is
not protected by any particular lock, it has some rather special rules
(namely that the thread is supposed to modify its own state, *except*
for waking things up, when we rely on memory ordering).

So the real issue is to make sure that before finally going to sleep,
any other CPU that races with waking things up needs to have the right
ordering. And that means that

 - the sleeper needs to have the proper barrier between setting the
task state to the sleeping state and testing the waking condition

   Normally this is the "smp_mb()" in set_task_state(), and then the
sleeper is supposed to check its wakeup criteria *after* doing
set_task_state().

   The case of signals is special, in that the "wakeup criteria" is
inside the scheduler itself, but conceptually the rule is the same.

 - the waker needs to have a memory barrier between setting the wakeup
condition and actually doing the wakeup.

   This is where the "try_to_wake_up()" smp_mb+spinlock *should* be
equivalent to a full memory barrier.

> This looks racy, even if wmb() actually acts as mb(), we don't
> have mb() on the other side and schedule() can miss SIGPENDING?

But we do have the mb, at least on x86. The "set_tsk_thread_flag()" is
a memory barrier there. But that's why I suggested adding a
smp_mb__after_clear_bit() to after setting the bit, in case that's the
problem on LoongSon. Because at least MIPS uses ll/sc for atomic, and
needs a sync instruction if the memory model is weak afterwards.

That said, even on MIPS I do not actually believe it should be
necessary, exactly *because* __schedule() already has a spinlock
before checking the signal state. And the spinlock already contains
sufficient serialization (much *more* than the required sync
instruction) that in practice we're already covered.

So there might be a theoretical race on some architecture, but not
afaik mips. I don't know the details about loongson, though.

                   Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
  2013-08-08 19:51     ` Linus Torvalds
@ 2013-08-09 13:04       ` Oleg Nesterov
  2013-08-09 18:21         ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-09 13:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On 08/08, Linus Torvalds wrote:
>
> On Thu, Aug 8, 2013 at 12:17 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> >> and as far as I can tell we have proper barriers for those (the
> >> scheduler gets the rq lock
> >
> > Yes, but... ttwu() takse another lock, ->pi_lock to test ->state.
>
> The lock is different, but for task_state, the main thing we need to
> worry abotu is memory ordering, not locks.

Yes sure. However, afaics in this particular case the locking does
matter.

Because:

>    The case of signals is special, in that the "wakeup criteria" is
> inside the scheduler itself, but conceptually the rule is the same.

yes, and because the waiter lacks mb().

IOW. The code like

	__set_current_state(STATE);
	if (!CONDITION)
		schedule();

is obviously racy, it doesn't have mb().

But the code like

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

was always considered as correct, it relies on try_to_wake_up/schedule
interaction. But after try_to_wake_up() was changed to use task->pi_lock
this becomes racy in theory. Afaics.

This __set_current_state(TASK_INTERRUPTIBLE) can leak into the critical
section protected by rq->lock, it can be reordered with the CONDITION
check, and in this case CONDITION == signal_pending().

No?

> > we don't
> > have mb() on the other side and schedule() can miss SIGPENDING?
>
> But we do have the mb, at least on x86. The "set_tsk_thread_flag()" is
> a memory barrier there.

Sorry for confusion, I meant "other side", see above.

> But that's why I suggested adding a
> smp_mb__after_clear_bit() to after setting the bit,

Agreed. Or, once again, we can change try_to_wake_up() to do mb()
rather then wmb().

And compared to the theoretical race above this looks more likely
to me (although still unlikely).

But probably we should start with another debugging patch, I'll send
it in a minute.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
  2013-08-08 18:19 ` Patch for lost wakeups Linus Torvalds
  2013-08-08 19:17   ` Oleg Nesterov
@ 2013-08-09 13:28   ` Oleg Nesterov
  2013-08-09 15:31   ` block_all_signals() must die (Was: Patch for lost wakeups) Oleg Nesterov
  2 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-09 13:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On 08/08, Linus Torvalds wrote:
>
> > Name:    Xorg
> > State:    S (sleeping)
> > Tgid:    2597
> > Pid:    2597
> > PPid:    2595
> > TracerPid:    0
> > Uid:    0    0    0    0
> > Gid:    0    0    0    0
> > FDSize:    64
> > Groups:
> > VmPeak:       44640 kB
> > VmSize:       31232 kB
> > VmLck:           0 kB
> > VmHWM:       20560 kB
> > VmRSS:       20016 kB
> > VmData:        5728 kB
> > VmStk:         160 kB
> > VmExe:        1952 kB
> > VmLib:       11296 kB
> > VmPTE:         128 kB
> > VmSwap:           0 kB
> > Threads:    1
> > SigQ:    1/15809
> > SigPnd:    00000000000000000000000000000000
> > ShdPnd:    00000000000000000000000000200000
> > SigBlk:    00000000000000000000000000000000
> > SigIgn:    80000000000000000000000006001000
> > SigCgt:    000000000000000000000001e020eecf

OK, given that Threads == 1 this is wrong in any case.

Could you please reproduce with the trivial patch below ? (please also
should /proc/pid/stack).

If (with this patch) SigQ: should "1" at the end, then we have a scheduler
race or someone does set_tsk_thread_flag(non-current-task, TIF_SIGPENDING)
without locking/wakeup.

Otherwise we should identify the wrong clear_flag(TIF_SIGPENDING).

Oleg.


diff --git a/fs/proc/array.c b/fs/proc/array.c
index cbd0f1b..d4a8cbb 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -286,7 +286,7 @@ static inline void task_sig(struct seq_file *m, struct task_struct *p)
 	}
 
 	seq_printf(m, "Threads:\t%d\n", num_threads);
-	seq_printf(m, "SigQ:\t%lu/%lu\n", qsize, qlim);
+	seq_printf(m, "SigQ:\t%lu/%lu %d\n", qsize, qlim, !!signal_pending(p));
 
 	/* render them all */
 	render_sigset_t(m, "SigPnd:\t", &pending);


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 0/1] dlm: kill the unnecessary and wrong device_close()->recalc_sigpending()
  2013-08-08 19:17   ` Oleg Nesterov
  2013-08-08 19:51     ` Linus Torvalds
@ 2013-08-09 15:18     ` Oleg Nesterov
  2013-08-09 15:19       ` [PATCH 1/1] " Oleg Nesterov
  1 sibling, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-09 15:18 UTC (permalink / raw)
  To: Linus Torvalds, Christine Caulfield, David Teigland
  Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On 08/08, Oleg Nesterov wrote:
>
> And of course, there could be another bug. I just did
> "grep recalc_sigpending" and immediately found at least one buggy
> user, fs/dlm/user.c which calls it lockless.

and the usage of sigprocmask() is wrong, see 1/1.

In fact there are more users which play with sigprocmask() this way.
I am starting to think that, perhaps, we should implement some helpers
for this use-case.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/1] dlm: kill the unnecessary and wrong device_close()->recalc_sigpending()
  2013-08-09 15:18     ` [PATCH 0/1] dlm: kill the unnecessary and wrong device_close()->recalc_sigpending() Oleg Nesterov
@ 2013-08-09 15:19       ` Oleg Nesterov
  2013-08-12 20:26         ` David Teigland
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-09 15:19 UTC (permalink / raw)
  To: Linus Torvalds, Christine Caulfield, David Teigland
  Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

device_close()->recalc_sigpending() is not needed, sigprocmask()
takes care of TIF_SIGPENDING correctly.

And without ->siglock it is racy and wrong, it can wrongly clear
TIF_SIGPENDING and miss a signal.

But even with this patch device_close() is still buggy:

	1. sigprocmask() should not be used, we have set_task_blocked(),
	   but this is minor.

	2. We should never block SIGKILL or SIGSTOP, and this is what
	   the code tries to do.

	3. This can't protect against SIGKILL or SIGSTOP anyway. Another
	   thread can do signal_wake_up(), say, do_signal_stop() or
	   complete_signal() or debugger.

	4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears
	   TIF_SIGPENDING, say, freezing() or ->jobctl.

	5. device_write() looks equally wrong by the same reason.

Looks like, this tries to protect some wait_event_interruptible() logic
from signals, it should be turned into uninterruptible wait. Or we need
to implement something like signals_stop/start for such a use-case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/dlm/user.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/dlm/user.c b/fs/dlm/user.c
index 911649a..8121491 100644
--- a/fs/dlm/user.c
+++ b/fs/dlm/user.c
@@ -686,7 +686,6 @@ static int device_close(struct inode *inode, struct file *file)
 	   device_remove_lockspace() */
 
 	sigprocmask(SIG_SETMASK, &tmpsig, NULL);
-	recalc_sigpending();
 
 	return 0;
 }
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* block_all_signals() must die (Was: Patch for lost wakeups)
  2013-08-08 18:19 ` Patch for lost wakeups Linus Torvalds
  2013-08-08 19:17   ` Oleg Nesterov
  2013-08-09 13:28   ` Patch for lost wakeups Oleg Nesterov
@ 2013-08-09 15:31   ` Oleg Nesterov
  2 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-09 15:31 UTC (permalink / raw)
  To: Linus Torvalds, Dave Airlie
  Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List, dri-devel

And sorry for off-topic email, but I can't resist.

Can't we finally kill block_all_signals() and ->notifier ? This
is very, very wrong and doesn't work anyway.

I tried to ask many, many times. Starting from 2007 at least.
And every time the discussion "hangs". I am quoting the last
email I sent below.

Dave, your reply was:

	 I'm on
	 holidays for another week or so, maybe once I get back I'll find some
	 time to figure out how it works vs what happens, but really I suspect
	 we can kill this with fire.

So perhaps I should simply send the patch with your ack? ;)

Oleg.


>From oleg@redhat.com Tue Jul 12 20:15:36 2011
Date: Tue, 12 Jul 2011 20:15:36 +0200

Hello.

I tried many times to ask about the supposed behaviour of
block_all_signals() in drm, but it seems nobody can answer.

So I am going to send the patch which simply removes
block_all_signals() and friends. There are numeruous problems
with this interace, I can't even enumerate them. But I think
that it is enough to mention that block_all_signals() simply
can not work. AT ALL. I am wondering, was it ever tested and
how.

So. ioctl()->drm_lock() "blocks" the stop signals. Probably to
ensure the task can't be stopped until it does DRM_IOCTL_UNLOCK.
And what does this mean? Yes, the task won't stop if it receives,
say, SIGTSTP. But! Instead it will loop forever in kernel mode
until it receives another unblocked/non-ignored signal which
should be numerically less than SIGSTOP.

Why do we need this? Once again. block_all_signals(SIGTSTP)
only means that the caller will burn cpu instead of sleeping
in TASK_STOPPED after ^Z. What is the point?

And once again, there are other problems. For example, even if
block_all_signals() actually blocked SIGSTOP/etc, this can not
help if the caller is multithreaded.

I strongly believe block_all_signals() should die. Given that
it doesn't work, could somebody please explain me what will
be broken?



Just in case... Please look at the debugging patch below. With
this patch,

	$ perl -le 'syscall 157,666 and die $!; sleep 1, print while ++$_'
	1
	2
	3
	^Z

Hang. So it does react to ^Z anyway, just it is looping in the
endless loop in the kernel. It can only look as if ^Z is ignored,
because obviously bash doesn't see it stopped.



Now lets look at drm_notifier(). If it returns 0 it does:

	/* Otherwise, set flag to force call to
	   drmUnlock */

drmUnlock? grep shows nothing...

	do {
		old = s->lock->lock;
		new = old | _DRM_LOCK_CONT;
		prev = cmpxchg(&s->lock->lock, old, new);
	} while (prev != old);
	return 0;

OK. So, if block_all_signals() makes any sense, it seems that this
is only because we add _DRM_LOCK_CONT.

Who checks _DRM_LOCK_CONT? _DRM_LOCK_IS_CONT(), but it has no users.
Hmm. Looks like via_release_futex() is the only user, but it doesn't
look as "force call to drmUnlock" and it is CONFIG_DRM_VIA only.


I am totally confused. But block_all_signals() should die anyway.

We can probably implement something like 'i-am-going-to-stop' or
even 'can-i-stop' per-thread notifiers, although this all looks
like the user-space problem to me (yes, I know absolutely nothing
about drm/etc).

If nothing else. We can change drm_lock/drm_unlock to literally
block/unblock SIGSTOP/etc (or perhaps we only should worry about
the signals from tty?). This is the awful hack and this can't work
with the multithreaded tasks too, but still it is better than what
we have now.

Oleg.

--- a/kernel/sys.c~	2011-06-16 20:12:18.000000000 +0200
+++ b/kernel/sys.c	2011-07-12 16:24:50.000000000 +0200
@@ -1614,6 +1614,11 @@ SYSCALL_DEFINE1(umask, int, mask)
 	return mask;
 }
 
+static int notifier(void *arg)
+{
+	return 0;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -1627,6 +1632,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 
 	error = 0;
 	switch (option) {
+		case 666: {
+			sigset_t *pmask = kmalloc(sizeof(*pmask), GFP_KERNEL);
+			siginitset(pmask, sigmask(SIGTSTP));
+			block_all_signals(notifier, NULL, pmask);
+			break;
+		}
+
 		case PR_SET_PDEATHSIG:
 			if (!valid_signal(arg2)) {
 				error = -EINVAL;



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
  2013-08-09 13:04       ` Oleg Nesterov
@ 2013-08-09 18:21         ` Linus Torvalds
  2013-08-11 17:25           ` Oleg Nesterov
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Linus Torvalds @ 2013-08-09 18:21 UTC (permalink / raw)
  To: Oleg Nesterov, linux-arch
  Cc: Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On Fri, Aug 9, 2013 at 6:04 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
>>    The case of signals is special, in that the "wakeup criteria" is
>> inside the scheduler itself, but conceptually the rule is the same.
>
> yes, and because the waiter lacks mb().

Hmm. Ok. So say we have a sleeper that does
__set_task_state(TASK_INTERRUPTIBLE) (so without the memory barrier
semantics of set_task_state()), so we have one thread doing

   prev->state = TASK_INTERRUPTIBLE
   raw_spin_lock_irq(&rq->lock);
   if (signal_pending_state(prev->state, prev))
     prev->state = TASK_RUNNING;

and since it's a spin-lock, things can in theory percolate *into* the
critical region, but not out of it. So as far as other CPU's are
concerned, that thread might actually do the "test pending signals"
before it even sets the state to TASK_INTERRUPTIBLE. And yes, that can
race against new signals coming in.

So it does look like a race. I get the feeling that we should put a
smp_wmb() into the top of __schedule(), before the spinlock - that's
basically free on any sane architecture (yeah, yeah, a bad memory
ordering implementaiton can make even smp_wmb() expensive, but
whatever - I can't find it in myself to care about badly implemented
microarchitectures when the common case gets it right).

That said, at least MIPS does not make spin-locks be that kind of
"semi-transparent" locks, so this race cannot hit MIPS. And I'm pretty
sure loongson has the same memory ordering as mips (ie "sync" will do
a memory barrier).

So this might be a real race for architectures that actually do
aggressive memory re-ordering _and_ actually implements spinlocks with
the aggressive acquire semantics. I think ia64 does. The ppc memory
ordering is confused enough that *maybe* it does, but I doubt it. On
most other architectures I think spinlocks end up being full memory
barriers.

I guess that instead of a "smp_wmb()", we could do another
"smp_mb__before_spinlock()" thing, like we already allow for other
architectures to do a weaker form of mb in case the spinlock is
already a full mb. That would allow avoiding extra synchronization. Do
a

   #ifndef smp_mb__before_spinlock
     #define smp_mb__before_spinlock() smp_wmb()
   #endif

in <linux/spinlock.h> to not force everybody to implement it. Because
a wmb+acquire should be close enough to a full mb that nobody cares
(ok, so reads could move into the critical region from outside, but by
the time anybody has called "schedule()", I can't see it mattering, so
"close enough").

                  Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
  2013-08-09 18:21         ` Linus Torvalds
@ 2013-08-11 17:25           ` Oleg Nesterov
  2013-08-11 17:27             ` Oleg Nesterov
       [not found]           ` <tencent_293B72F26D71A4191C7C999A@qq.com>
  2013-08-12 17:02           ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov
  2 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-11 17:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On 08/09, Linus Torvalds wrote:
>
> I guess that instead of a "smp_wmb()", we could do another
> "smp_mb__before_spinlock()" thing, like we already allow for other
> architectures to do a weaker form of mb in case the spinlock is
> already a full mb. That would allow avoiding extra synchronization. Do
> a
>
>    #ifndef smp_mb__before_spinlock
>      #define smp_mb__before_spinlock() smp_wmb()
>    #endif
>
> in <linux/spinlock.h> to not force everybody to implement it. Because
> a wmb+acquire should be close enough to a full mb that nobody cares
> (ok, so reads could move into the critical region from outside, but by
> the time anybody has called "schedule()", I can't see it mattering, so
> "close enough").

Yes, this is what I tried to suggest. And of course we should turn that
wmb() in try_to_wake_up() into smp_mb__before_spinlock().

I event started the patch, but we already have smp_mb__after_lock(), so
it should be smp_mb__before_lock() for consistency and we need to turn
it to "define" too. Or change ARCH_HAS_SMP_MB_AFTER_LOCK, or add
ARCH_HAS_SMP_MB_BEFORE_LOCK.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
  2013-08-11 17:25           ` Oleg Nesterov
@ 2013-08-11 17:27             ` Oleg Nesterov
  0 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-11 17:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Long Gao, Al Viro, Andrew Morton, Linux Kernel Mailing List

On 08/11, Oleg Nesterov wrote:
>
> On 08/09, Linus Torvalds wrote:
> >
> > I guess that instead of a "smp_wmb()", we could do another
> > "smp_mb__before_spinlock()" thing, like we already allow for other
> > architectures to do a weaker form of mb in case the spinlock is
> > already a full mb. That would allow avoiding extra synchronization. Do
> > a
> >
> >    #ifndef smp_mb__before_spinlock
> >      #define smp_mb__before_spinlock() smp_wmb()
> >    #endif
> >
> > in <linux/spinlock.h> to not force everybody to implement it. Because
> > a wmb+acquire should be close enough to a full mb that nobody cares
> > (ok, so reads could move into the critical region from outside, but by
> > the time anybody has called "schedule()", I can't see it mattering, so
> > "close enough").
>
> Yes, this is what I tried to suggest. And of course we should turn that
> wmb() in try_to_wake_up() into smp_mb__before_spinlock().
>
> I event started the patch, but we already have smp_mb__after_lock(), so
> it should be smp_mb__before_lock() for consistency and we need to turn
> it to "define" too. Or change ARCH_HAS_SMP_MB_AFTER_LOCK, or add
> ARCH_HAS_SMP_MB_BEFORE_LOCK.

Ah, please ignore. According to /bin/grep smp_mb__before_lock() no longer
have users. So we can probably simply kill it in the same patch. I'll try
to write the changelog and send it tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
       [not found]           ` <tencent_293B72F26D71A4191C7C999A@qq.com>
@ 2013-08-11 17:39             ` Oleg Nesterov
  2013-08-11 23:52               ` James Bottomley
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-11 17:39 UTC (permalink / raw)
  To: Long Gao
  Cc: Linus Torvalds, linux-arch, Al Viro, Andrew Morton,
	Linux Kernel Mailing List

On 08/10, Long Gao wrote:
>
>   By the way, could you help me join the linux kernel mailling list?

Do you mean, you want to subscribe?

Well, from http://www.tux.org/lkml/#s3-1

        send the line "subscribe linux-kernel your_email@your_ISP"
        in the body of the message to majordomo@vger.kernel.org

but since I was never subscribed I do not really know if this works ;)

> I have sent two letters to "Linux Kernel Mailing List"<linux-kernel@vger.kernel.org>,
> but did not receive any confirmation letter, except for two letters containing
> instructions, 3X!

If you simply want to send emails to this list, you do not need
to subscribe. However, I do not see your email in

        http://marc.info/?t=137598606900001

Perhaps this list doesn't like chinese letters in subject/body?

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Patch for lost wakeups
  2013-08-11 17:39             ` Oleg Nesterov
@ 2013-08-11 23:52               ` James Bottomley
  0 siblings, 0 replies; 20+ messages in thread
From: James Bottomley @ 2013-08-11 23:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Long Gao, Linus Torvalds, linux-arch, Al Viro, Andrew Morton,
	Linux Kernel Mailing List

On Sun, 2013-08-11 at 19:39 +0200, Oleg Nesterov wrote:
> On 08/10, Long Gao wrote:
> >
> >   By the way, could you help me join the linux kernel mailling list?
> 
> Do you mean, you want to subscribe?
> 
> Well, from http://www.tux.org/lkml/#s3-1
> 
>         send the line "subscribe linux-kernel your_email@your_ISP"
>         in the body of the message to majordomo@vger.kernel.org
> 
> but since I was never subscribed I do not really know if this works ;)
> 
> > I have sent two letters to "Linux Kernel Mailing List"<linux-kernel@vger.kernel.org>,
> > but did not receive any confirmation letter, except for two letters containing
> > instructions, 3X!
> 
> If you simply want to send emails to this list, you do not need
> to subscribe. However, I do not see your email in
> 
>         http://marc.info/?t=137598606900001
> 
> Perhaps this list doesn't like chinese letters in subject/body?

No, we get patches with UTC-16 characters in them OK.  The usual reason
vger drops an email is if there's a text/html part.

The information about what vger drops is here:

http://vger.kernel.org/majordomo-info.html

James



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
  2013-08-09 18:21         ` Linus Torvalds
  2013-08-11 17:25           ` Oleg Nesterov
       [not found]           ` <tencent_293B72F26D71A4191C7C999A@qq.com>
@ 2013-08-12 17:02           ` Oleg Nesterov
  2013-08-13  7:55             ` Peter Zijlstra
  2 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-12 17:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-arch, Long Gao, Al Viro, Andrew Morton,
	Linux Kernel Mailing List, Paul E. McKenney, Peter Zijlstra,
	Ingo Molnar

This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).

However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/include/asm/spinlock.h |    4 ----
 include/linux/spinlock.h        |   13 ++++++++++---
 kernel/sched/core.c             |   14 +++++++++++++-
 3 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..e3ddd7d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 #define arch_read_relax(lock)	cpu_relax()
 #define arch_write_relax(lock)	cpu_relax()
 
-/* The {read|write|spin}_lock() on x86 are full memory barriers. */
-static inline void smp_mb__after_lock(void) { }
-#define ARCH_HAS_SMP_MB_AFTER_LOCK
-
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..f21551e 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -117,9 +117,16 @@ do {								\
 #endif /*arch_spin_is_contended*/
 #endif
 
-/* The lock does not imply full memory barrier. */
-#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
-static inline void smp_mb__after_lock(void) { smp_mb(); }
+/*
+ * Despite its name it doesn't necessarily has to be a full barrier.
+ * It should only guarantee that a STORE before the critical section
+ * can not be reordered with a LOAD inside this section.
+ * So the default implementation simply ensures that a STORE can not
+ * move into the critical section, smp_wmb() should serialize it with
+ * another STORE done by spin_lock().
+ */
+#ifndef smp_mb__before_spinlock
+#define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6df0fbe..97dac0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
-	smp_wmb();
+	/*
+	 * If we are going to wake up a thread waiting for CONDITION we
+	 * need to ensure that CONDITION=1 done by the caller can not be
+	 * reordered with p->state check below. This pairs with mb() in
+	 * set_current_state() the waiting thread does.
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -2394,6 +2400,12 @@ need_resched:
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
 
+	/*
+	 * Make sure that signal_pending_state()->signal_pending() below
+	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
+	 * done by the caller to avoid the race with signal_wake_up().
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/1] dlm: kill the unnecessary and wrong device_close()->recalc_sigpending()
  2013-08-09 15:19       ` [PATCH 1/1] " Oleg Nesterov
@ 2013-08-12 20:26         ` David Teigland
  0 siblings, 0 replies; 20+ messages in thread
From: David Teigland @ 2013-08-12 20:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Christine Caulfield, Long Gao, Al Viro,
	Andrew Morton, Linux Kernel Mailing List

On Fri, Aug 09, 2013 at 05:19:13PM +0200, Oleg Nesterov wrote:
> device_close()->recalc_sigpending() is not needed, sigprocmask()
> takes care of TIF_SIGPENDING correctly.
> 
> And without ->siglock it is racy and wrong, it can wrongly clear
> TIF_SIGPENDING and miss a signal.
> 
> But even with this patch device_close() is still buggy:
> 
> 	1. sigprocmask() should not be used, we have set_task_blocked(),
> 	   but this is minor.
> 
> 	2. We should never block SIGKILL or SIGSTOP, and this is what
> 	   the code tries to do.
> 
> 	3. This can't protect against SIGKILL or SIGSTOP anyway. Another
> 	   thread can do signal_wake_up(), say, do_signal_stop() or
> 	   complete_signal() or debugger.
> 
> 	4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears
> 	   TIF_SIGPENDING, say, freezing() or ->jobctl.
> 
> 	5. device_write() looks equally wrong by the same reason.
> 
> Looks like, this tries to protect some wait_event_interruptible() logic
> from signals, it should be turned into uninterruptible wait. Or we need
> to implement something like signals_stop/start for such a use-case.

I can't remember why that signal code exists, or if I ever knew; it was
there when the code was added seven years ago.  I agree that if there's
something we cannot interrupt, we should use uninterruptible, but I don't
see any cases of that either.  I think we should just remove it all
(untested):

From: David Teigland <teigland@redhat.com>
Date: Mon, 12 Aug 2013 15:22:43 -0500
Subject: [PATCH] dlm: remove signal blocking

The signal blocking was incorrect and unnecessary
so just remove it.

Signed-off-by: David Teigland <teigland@redhat.com>
---
 fs/dlm/user.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/fs/dlm/user.c b/fs/dlm/user.c
index 911649a..142e216 100644
--- a/fs/dlm/user.c
+++ b/fs/dlm/user.c
@@ -493,7 +493,6 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 {
 	struct dlm_user_proc *proc = file->private_data;
 	struct dlm_write_request *kbuf;
-	sigset_t tmpsig, allsigs;
 	int error;
 
 #ifdef CONFIG_COMPAT
@@ -557,9 +556,6 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 		goto out_free;
 	}
 
-	sigfillset(&allsigs);
-	sigprocmask(SIG_BLOCK, &allsigs, &tmpsig);
-
 	error = -EINVAL;
 
 	switch (kbuf->cmd)
@@ -567,7 +563,7 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 	case DLM_USER_LOCK:
 		if (!proc) {
 			log_print("no locking on control device");
-			goto out_sig;
+			goto out_free;
 		}
 		error = device_user_lock(proc, &kbuf->i.lock);
 		break;
@@ -575,7 +571,7 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 	case DLM_USER_UNLOCK:
 		if (!proc) {
 			log_print("no locking on control device");
-			goto out_sig;
+			goto out_free;
 		}
 		error = device_user_unlock(proc, &kbuf->i.lock);
 		break;
@@ -583,7 +579,7 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 	case DLM_USER_DEADLOCK:
 		if (!proc) {
 			log_print("no locking on control device");
-			goto out_sig;
+			goto out_free;
 		}
 		error = device_user_deadlock(proc, &kbuf->i.lock);
 		break;
@@ -591,7 +587,7 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 	case DLM_USER_CREATE_LOCKSPACE:
 		if (proc) {
 			log_print("create/remove only on control device");
-			goto out_sig;
+			goto out_free;
 		}
 		error = device_create_lockspace(&kbuf->i.lspace);
 		break;
@@ -599,7 +595,7 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 	case DLM_USER_REMOVE_LOCKSPACE:
 		if (proc) {
 			log_print("create/remove only on control device");
-			goto out_sig;
+			goto out_free;
 		}
 		error = device_remove_lockspace(&kbuf->i.lspace);
 		break;
@@ -607,7 +603,7 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 	case DLM_USER_PURGE:
 		if (!proc) {
 			log_print("no locking on control device");
-			goto out_sig;
+			goto out_free;
 		}
 		error = device_user_purge(proc, &kbuf->i.purge);
 		break;
@@ -617,8 +613,6 @@ static ssize_t device_write(struct file *file, const char __user *buf,
 			  kbuf->cmd);
 	}
 
- out_sig:
-	sigprocmask(SIG_SETMASK, &tmpsig, NULL);
  out_free:
 	kfree(kbuf);
 	return error;
@@ -659,15 +653,11 @@ static int device_close(struct inode *inode, struct file *file)
 {
 	struct dlm_user_proc *proc = file->private_data;
 	struct dlm_ls *ls;
-	sigset_t tmpsig, allsigs;
 
 	ls = dlm_find_lockspace_local(proc->lockspace);
 	if (!ls)
 		return -ENOENT;
 
-	sigfillset(&allsigs);
-	sigprocmask(SIG_BLOCK, &allsigs, &tmpsig);
-
 	set_bit(DLM_PROC_FLAGS_CLOSING, &proc->flags);
 
 	dlm_clear_proc_locks(ls, proc);
@@ -685,9 +675,6 @@ static int device_close(struct inode *inode, struct file *file)
 	/* FIXME: AUTOFREE: if this ls is no longer used do
 	   device_remove_lockspace() */
 
-	sigprocmask(SIG_SETMASK, &tmpsig, NULL);
-	recalc_sigpending();
-
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
  2013-08-12 17:02           ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov
@ 2013-08-13  7:55             ` Peter Zijlstra
  2013-08-13 14:33               ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2013-08-13  7:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, linux-arch, Long Gao, Al Viro, Andrew Morton,
	Linux Kernel Mailing List, Paul E. McKenney, Ingo Molnar

On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote:
> This is only theoretical, but after try_to_wake_up(p) was changed
> to check p->state under p->pi_lock the code like
> 
> 	__set_current_state(TASK_INTERRUPTIBLE);
> 	schedule();
> 
> can miss a signal. This is the special case of wait-for-condition,
> it relies on try_to_wake_up/schedule interaction and thus it does
> not need mb() between __set_current_state() and if(signal_pending).
> 
> However, this __set_current_state() can move into the critical
> section protected by rq->lock, now that try_to_wake_up() takes
> another lock we need to ensure that it can't be reordered with
> "if (signal_pending(current))" check inside that section.
> 
> The patch is actually one-liner, it simply adds smp_wmb() before
> spin_lock_irq(rq->lock). This is what try_to_wake_up() already
> does by the same reason.
> 
> We turn this wmb() into the new helper, smp_mb__before_spinlock(),
> for better documentation and to allow the architectures to change
> the default implementation.
> 
> While at it, kill smp_mb__after_lock(), it has no callers.
> 
> Perhaps we can also add smp_mb__before/after_spinunlock() for
> prepare_to_wait().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Thanks!

> +/*
> + * Despite its name it doesn't necessarily has to be a full barrier.
> + * It should only guarantee that a STORE before the critical section
> + * can not be reordered with a LOAD inside this section.
> + * So the default implementation simply ensures that a STORE can not
> + * move into the critical section, smp_wmb() should serialize it with
> + * another STORE done by spin_lock().
> + */
> +#ifndef smp_mb__before_spinlock
> +#define smp_mb__before_spinlock()	smp_wmb()
>  #endif

I would have expected mention of the ACQUIRE of the lock keeping the
LOAD inside the locked section.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race
  2013-08-13  7:55             ` Peter Zijlstra
@ 2013-08-13 14:33               ` Oleg Nesterov
  2013-08-16 18:46                 ` [tip:sched/core] sched: Fix the theoretical signal_wake_up() vs. " tip-bot for Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-13 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, linux-arch, Long Gao, Al Viro, Andrew Morton,
	Linux Kernel Mailing List, Paul E. McKenney, Ingo Molnar

On 08/13, Peter Zijlstra wrote:
>
> On Mon, Aug 12, 2013 at 07:02:57PM +0200, Oleg Nesterov wrote:
> > +/*
> > + * Despite its name it doesn't necessarily has to be a full barrier.
> > + * It should only guarantee that a STORE before the critical section
> > + * can not be reordered with a LOAD inside this section.
> > + * So the default implementation simply ensures that a STORE can not
> > + * move into the critical section, smp_wmb() should serialize it with
> > + * another STORE done by spin_lock().
> > + */
> > +#ifndef smp_mb__before_spinlock
> > +#define smp_mb__before_spinlock()	smp_wmb()
> >  #endif
>
> I would have expected mention of the ACQUIRE of the lock keeping the
> LOAD inside the locked section.

OK, please see v2 below.

---
>From 8de96d3feae3b4b51669902b7c24ac1748ecdbfe Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Mon, 12 Aug 2013 18:14:00 +0200
Subject: sched: fix the theoretical signal_wake_up() vs schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).

However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Perhaps we can also add smp_mb__before/after_spinunlock() for
prepare_to_wait().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/include/asm/spinlock.h |    4 ----
 include/linux/spinlock.h        |   14 +++++++++++---
 kernel/sched/core.c             |   14 +++++++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..e3ddd7d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 #define arch_read_relax(lock)	cpu_relax()
 #define arch_write_relax(lock)	cpu_relax()
 
-/* The {read|write|spin}_lock() on x86 are full memory barriers. */
-static inline void smp_mb__after_lock(void) { }
-#define ARCH_HAS_SMP_MB_AFTER_LOCK
-
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..75f3494 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -117,9 +117,17 @@ do {								\
 #endif /*arch_spin_is_contended*/
 #endif
 
-/* The lock does not imply full memory barrier. */
-#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
-static inline void smp_mb__after_lock(void) { smp_mb(); }
+/*
+ * Despite its name it doesn't necessarily has to be a full barrier.
+ * It should only guarantee that a STORE before the critical section
+ * can not be reordered with a LOAD inside this section.
+ * spin_lock() is the one-way barrier, this LOAD can not escape out
+ * of the region. So the default implementation simply ensures that
+ * a STORE can not move into the critical section, smp_wmb() should
+ * serialize it with another STORE done by spin_lock().
+ */
+#ifndef smp_mb__before_spinlock
+#define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6df0fbe..97dac0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
-	smp_wmb();
+	/*
+	 * If we are going to wake up a thread waiting for CONDITION we
+	 * need to ensure that CONDITION=1 done by the caller can not be
+	 * reordered with p->state check below. This pairs with mb() in
+	 * set_current_state() the waiting thread does.
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -2394,6 +2400,12 @@ need_resched:
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
 
+	/*
+	 * Make sure that signal_pending_state()->signal_pending() below
+	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
+	 * done by the caller to avoid the race with signal_wake_up().
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip:sched/core] sched: Fix the theoretical signal_wake_up() vs. schedule() race
  2013-08-13 14:33               ` Oleg Nesterov
@ 2013-08-16 18:46                 ` tip-bot for Oleg Nesterov
  2013-08-17 15:05                   ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: tip-bot for Oleg Nesterov @ 2013-08-16 18:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, gaolong, torvalds, peterz, paulmck,
	akpm, tglx, oleg

Commit-ID:  e4c711cfd32f3c74a699af3121a3b0ff631328a7
Gitweb:     http://git.kernel.org/tip/e4c711cfd32f3c74a699af3121a3b0ff631328a7
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Tue, 13 Aug 2013 16:33:25 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 16 Aug 2013 17:44:54 +0200

sched: Fix the theoretical signal_wake_up() vs. schedule() race

This is only theoretical, but after try_to_wake_up(p) was changed
to check p->state under p->pi_lock the code like

	__set_current_state(TASK_INTERRUPTIBLE);
	schedule();

can miss a signal. This is the special case of wait-for-condition,
it relies on try_to_wake_up/schedule interaction and thus it does
not need mb() between __set_current_state() and if(signal_pending).

However, this __set_current_state() can move into the critical
section protected by rq->lock, now that try_to_wake_up() takes
another lock we need to ensure that it can't be reordered with
"if (signal_pending(current))" check inside that section.

The patch is actually one-liner, it simply adds smp_wmb() before
spin_lock_irq(rq->lock). This is what try_to_wake_up() already
does by the same reason.

We turn this wmb() into the new helper, smp_mb__before_spinlock(),
for better documentation and to allow the architectures to change
the default implementation.

While at it, kill smp_mb__after_lock(), it has no callers.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Long Gao <gaolong@kylinos.com.cn>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130813143325.GA5541@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/spinlock.h |  4 ----
 include/linux/spinlock.h        | 14 +++++++++++---
 kernel/sched/core.c             | 14 +++++++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..e3ddd7d 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -233,8 +233,4 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 #define arch_read_relax(lock)	cpu_relax()
 #define arch_write_relax(lock)	cpu_relax()
 
-/* The {read|write|spin}_lock() on x86 are full memory barriers. */
-static inline void smp_mb__after_lock(void) { }
-#define ARCH_HAS_SMP_MB_AFTER_LOCK
-
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 7d537ce..75f3494 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -117,9 +117,17 @@ do {								\
 #endif /*arch_spin_is_contended*/
 #endif
 
-/* The lock does not imply full memory barrier. */
-#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
-static inline void smp_mb__after_lock(void) { smp_mb(); }
+/*
+ * Despite its name it doesn't necessarily has to be a full barrier.
+ * It should only guarantee that a STORE before the critical section
+ * can not be reordered with a LOAD inside this section.
+ * spin_lock() is the one-way barrier, this LOAD can not escape out
+ * of the region. So the default implementation simply ensures that
+ * a STORE can not move into the critical section, smp_wmb() should
+ * serialize it with another STORE done by spin_lock().
+ */
+#ifndef smp_mb__before_spinlock
+#define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cf8f100..d60b325 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1491,7 +1491,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
-	smp_wmb();
+	/*
+	 * If we are going to wake up a thread waiting for CONDITION we
+	 * need to ensure that CONDITION=1 done by the caller can not be
+	 * reordered with p->state check below. This pairs with mb() in
+	 * set_current_state() the waiting thread does.
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -2394,6 +2400,12 @@ need_resched:
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
 
+	/*
+	 * Make sure that signal_pending_state()->signal_pending() below
+	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
+	 * done by the caller to avoid the race with signal_wake_up().
+	 */
+	smp_mb__before_spinlock();
 	raw_spin_lock_irq(&rq->lock);
 
 	switch_count = &prev->nivcsw;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [tip:sched/core] sched: Fix the theoretical signal_wake_up() vs. schedule() race
  2013-08-16 18:46                 ` [tip:sched/core] sched: Fix the theoretical signal_wake_up() vs. " tip-bot for Oleg Nesterov
@ 2013-08-17 15:05                   ` Oleg Nesterov
  2013-08-19  7:13                     ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2013-08-17 15:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-tip-commits, linux-kernel, hpa, mingo, gaolong, torvalds,
	peterz, paulmck, akpm, tglx

On 08/16, tip-bot for Oleg Nesterov wrote:
>
> Commit-ID:  e4c711cfd32f3c74a699af3121a3b0ff631328a7
> Gitweb:     http://git.kernel.org/tip/e4c711cfd32f3c74a699af3121a3b0ff631328a7
> Author:     Oleg Nesterov <oleg@redhat.com>
> AuthorDate: Tue, 13 Aug 2013 16:33:25 +0200
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Fri, 16 Aug 2013 17:44:54 +0200
>
> sched: Fix the theoretical signal_wake_up() vs. schedule() race

Ingo, thanks.

But fyi, this is already merged as e0acd0a6 in Linus's tree.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [tip:sched/core] sched: Fix the theoretical signal_wake_up() vs. schedule() race
  2013-08-17 15:05                   ` Oleg Nesterov
@ 2013-08-19  7:13                     ` Ingo Molnar
  0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2013-08-19  7:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, linux-tip-commits, linux-kernel, hpa, gaolong,
	torvalds, peterz, paulmck, akpm, tglx


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 08/16, tip-bot for Oleg Nesterov wrote:
> >
> > Commit-ID:  e4c711cfd32f3c74a699af3121a3b0ff631328a7
> > Gitweb:     http://git.kernel.org/tip/e4c711cfd32f3c74a699af3121a3b0ff631328a7
> > Author:     Oleg Nesterov <oleg@redhat.com>
> > AuthorDate: Tue, 13 Aug 2013 16:33:25 +0200
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Fri, 16 Aug 2013 17:44:54 +0200
> >
> > sched: Fix the theoretical signal_wake_up() vs. schedule() race
> 
> Ingo, thanks.
> 
> But fyi, this is already merged as e0acd0a6 in Linus's tree.

Missed that - I zapped this commit from the tip of 
sched/core, thanks Oleg!

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-08-19  7:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <tencent_26310211398C21034BD3B2F9@qq.com>
2013-08-08 18:19 ` Patch for lost wakeups Linus Torvalds
2013-08-08 19:17   ` Oleg Nesterov
2013-08-08 19:51     ` Linus Torvalds
2013-08-09 13:04       ` Oleg Nesterov
2013-08-09 18:21         ` Linus Torvalds
2013-08-11 17:25           ` Oleg Nesterov
2013-08-11 17:27             ` Oleg Nesterov
     [not found]           ` <tencent_293B72F26D71A4191C7C999A@qq.com>
2013-08-11 17:39             ` Oleg Nesterov
2013-08-11 23:52               ` James Bottomley
2013-08-12 17:02           ` [PATCH] sched: fix the theoretical signal_wake_up() vs schedule() race Oleg Nesterov
2013-08-13  7:55             ` Peter Zijlstra
2013-08-13 14:33               ` Oleg Nesterov
2013-08-16 18:46                 ` [tip:sched/core] sched: Fix the theoretical signal_wake_up() vs. " tip-bot for Oleg Nesterov
2013-08-17 15:05                   ` Oleg Nesterov
2013-08-19  7:13                     ` Ingo Molnar
2013-08-09 15:18     ` [PATCH 0/1] dlm: kill the unnecessary and wrong device_close()->recalc_sigpending() Oleg Nesterov
2013-08-09 15:19       ` [PATCH 1/1] " Oleg Nesterov
2013-08-12 20:26         ` David Teigland
2013-08-09 13:28   ` Patch for lost wakeups Oleg Nesterov
2013-08-09 15:31   ` block_all_signals() must die (Was: Patch for lost wakeups) Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).