On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote: > On 11/11, Peter Zijlstra wrote: > > > > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote: > > > > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE? > > > > I did wonder the same thing, it would simplify a number of things if > > this were so. > > Yes, me too. > > Sometimes I even think it should have both ACQUIRE + RELEASE semantics. > IOW, it should be "equivalent" to spin_lock() + spin_unlock(). > > Consider this code: > > object_t *object; > spinlock_t lock; > > void update(void) > { > object_t *o; > > spin_lock(&lock); > o = READ_ONCE(object); > if (o) { > BUG_ON(o->dead); > do_something(o); > } > spin_unlock(&lock); > } > > void destroy(void) // can be called only once, can't race with itself > { > object_t *o; > > o = object; > object = NULL; > > /* > * pairs with lock/ACQUIRE. The next update() must see > * object == NULL after spin_lock(); > */ > smp_mb(); > > spin_unlock_wait(&lock); > > /* > * pairs with unlock/RELEASE. The previous update() has > * already passed BUG_ON(o->dead). > * > * (Yes, yes, in this particular case it is not needed, > * we can rely on the control dependency). > */ > smp_mb(); > > o->dead = true; > } > > I believe the code above is correct and it needs the barriers on both sides. > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() only guarantees that the memory operations following spin_lock() can't be reorder before the *LOAD* part of spin_lock() not the *STORE* part, i.e. the case below can happen(assuming the spin_lock() is implemented as ll/sc loop) spin_lock(&lock): r1 = *lock; // LL, r1 == 0 o = READ_ONCE(object); // could be reordered here. *lock = 1; // SC This could happen because of the ACQUIRE semantics of spin_lock(), and the current implementation of spin_lock() on PPC allows this happen. (Cc PPC maintainers for their opinions on this one) Therefore the case below can happen: CPU 1 CPU 2 CPU 3 ================== ==================== ============== spin_unlock(&lock); spin_lock(&lock): r1 = *lock; // r1 == 0; o = READ_ONCE(object); // reordered here object = NULL; smp_mb(); spin_unlock_wait(&lock); *lock = 1; smp_mb(); o->dead = true; if (o) // true BUG_ON(o->dead); // true!! To show this, I also translate this situation into a PPC litmus for herd[1]: PPC spin-lock-wait " r1: local variable of lock r2: constant 1 r3: constant 0 or NULL r4: local variable of object, i.e. o r5: local variable of *o (simulate ->dead as I didn't know how to write fields of structure in herd ;-() r13: the address of lock, i.e. &lock r14: the address of object, i.e. &object " { 0:r1=0;0:r2=1;0:r3=0;0:r13=lock;0:r14=object; 1:r1=0;1:r2=1;1:r3=0;1:r4=0;1:r5=0;1:r13=lock;1:r14=object; 2:r1=0;2:r13=lock; lock=1; object=old; old=0; } P0 | P1 | P2 ; ld r4,0(r14) | Lock: | stw r1,0(r13); std r3,0(r14) | lwarx r1,r3,r13 | ; | cmpwi r1,0 | ; sync | bne Lock | ; | stwcx. r2,r3,r13 | ; Wait: | bne Lock | ; lwz r1,0(r13) | lwsync | ; cmpwi r1,0 | ld r4,0(r14) | ; bne Wait | cmpwi r4,0 | ; | beq Fail | ; sync | lwz r5, 0(r4) | ; stw r2,0(r4) | Fail: | ; | lwsync | ; | stw r3, 0(r13) | ; exists (1:r4=old /\ 1:r5=1) ,whose result says that (1:r4=old /\ 1:r5=1) can happen: Test spin-lock-wait Allowed States 3 1:r4=0; 1:r5=0; 1:r4=old; 1:r5=0; 1:r4=old; 1:r5=1; Loop Ok Witnesses Positive: 18 Negative: 108 Condition exists (1:r4=old /\ 1:r5=1) Observation spin-lock-wait Sometimes 18 108 Hash=244f8c0f91df5a5ed985500ed7230272 Please note that I use backwards jump in this litmus, which is only supported by herd(not by ppcmem[2]), and it will take a while to get the result. And I'm not that confident that I'm familiar with this tool, maybe Paul and Will can help check my translate and usage here ;-) IIUC, the problem here is that spin_lock_wait() can be implemented with only LOAD operations, and to have a RELEASE semantics, one primivite must have a *STORE* part, therefore spin_lock_wait() can not be a RELEASE. So.. we probably need to take the lock here. > If it is wrong, then task_work_run() is buggy: it relies on mb() implied by > cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel() > should see the result of our cmpxchg(), it must not try to read work->next or > work->func. > > Hmm. Not sure I really understand what I am trying to say... Perhaps in fact > I mean that unlock_wait() should be removed because it is too subtle for me ;) > ;-) I think it's OK for it as an ACQUIRE(with a proper barrier) or even just a control dependency to pair with spin_unlock(), for example, the following snippet in do_exit() is OK, except the smp_mb() is redundant, unless I'm missing something subtle: /* * The setting of TASK_RUNNING by try_to_wake_up() may be delayed * when the following two conditions become true. * - There is race condition of mmap_sem (It is acquired by * exit_mm()), and * - SMI occurs before setting TASK_RUNINNG. * (or hypervisor of virtual machine switches to other guest) * As a result, we may become TASK_RUNNING after becoming TASK_DEAD * * To avoid it, we have to wait for releasing tsk->pi_lock which * is held by try_to_wake_up() */ smp_mb(); raw_spin_unlock_wait(&tsk->pi_lock); /* causes final put_task_struct in finish_task_switch(). */ tsk->state = TASK_DEAD; Ref: [1]: https://github.com/herd/herdtools [2]: http://www.cl.cam.ac.uk/~pes20/ppcmem/index.html Regards, Boqun