From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030544Ab3HISVK (ORCPT ); Fri, 9 Aug 2013 14:21:10 -0400 Received: from mail-vc0-f181.google.com ([209.85.220.181]:40159 "EHLO mail-vc0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758611Ab3HISVI (ORCPT ); Fri, 9 Aug 2013 14:21:08 -0400 MIME-Version: 1.0 In-Reply-To: <20130809130457.GA27493@redhat.com> References: <20130808191749.GA12062@redhat.com> <20130809130457.GA27493@redhat.com> Date: Fri, 9 Aug 2013 11:21:07 -0700 X-Google-Sender-Auth: NS5iedwLpnGVldfjXFAKyQBAf8g Message-ID: Subject: Re: Patch for lost wakeups From: Linus Torvalds To: Oleg Nesterov , "linux-arch@vger.kernel.org" Cc: Long Gao , Al Viro , Andrew Morton , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 9, 2013 at 6:04 AM, Oleg Nesterov wrote: > >> The case of signals is special, in that the "wakeup criteria" is >> inside the scheduler itself, but conceptually the rule is the same. > > yes, and because the waiter lacks mb(). Hmm. Ok. So say we have a sleeper that does __set_task_state(TASK_INTERRUPTIBLE) (so without the memory barrier semantics of set_task_state()), so we have one thread doing prev->state = TASK_INTERRUPTIBLE raw_spin_lock_irq(&rq->lock); if (signal_pending_state(prev->state, prev)) prev->state = TASK_RUNNING; and since it's a spin-lock, things can in theory percolate *into* the critical region, but not out of it. So as far as other CPU's are concerned, that thread might actually do the "test pending signals" before it even sets the state to TASK_INTERRUPTIBLE. And yes, that can race against new signals coming in. So it does look like a race. I get the feeling that we should put a smp_wmb() into the top of __schedule(), before the spinlock - that's basically free on any sane architecture (yeah, yeah, a bad memory ordering implementaiton can make even smp_wmb() expensive, but whatever - I can't find it in myself to care about badly implemented microarchitectures when the common case gets it right). That said, at least MIPS does not make spin-locks be that kind of "semi-transparent" locks, so this race cannot hit MIPS. And I'm pretty sure loongson has the same memory ordering as mips (ie "sync" will do a memory barrier). So this might be a real race for architectures that actually do aggressive memory re-ordering _and_ actually implements spinlocks with the aggressive acquire semantics. I think ia64 does. The ppc memory ordering is confused enough that *maybe* it does, but I doubt it. On most other architectures I think spinlocks end up being full memory barriers. I guess that instead of a "smp_wmb()", we could do another "smp_mb__before_spinlock()" thing, like we already allow for other architectures to do a weaker form of mb in case the spinlock is already a full mb. That would allow avoiding extra synchronization. Do a #ifndef smp_mb__before_spinlock #define smp_mb__before_spinlock() smp_wmb() #endif in to not force everybody to implement it. Because a wmb+acquire should be close enough to a full mb that nobody cares (ok, so reads could move into the critical region from outside, but by the time anybody has called "schedule()", I can't see it mattering, so "close enough"). Linus