Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

From: Boqun Feng <boqun.feng@gmail.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	mingo@kernel.org, linux-kernel@vger.kernel.org,
	paulmck@linux.vnet.ibm.com, corbet@lwn.net, mhocko@kernel.org,
	dhowells@redhat.com, torvalds@linux-foundation.org,
	will.deacon@arm.com, Michael Ellerman <mpe@ellerman.id.au>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>
Subject: Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Date: Thu, 12 Nov 2015 15:14:51 +0800	[thread overview]
Message-ID: <20151112070915.GC6314@fixme-laptop.cn.ibm.com> (raw)
In-Reply-To: <20151111193953.GA23515@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6173 bytes --]

On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> On 11/11, Peter Zijlstra wrote:
> >
> > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> >
> > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> >
> > I did wonder the same thing, it would simplify a number of things if
> > this were so.
> 
> Yes, me too.
> 
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().
> 
> Consider this code:
> 
> 	object_t *object;
> 	spinlock_t lock;
> 
> 	void update(void)
> 	{
> 		object_t *o;
> 
> 		spin_lock(&lock);
> 		o = READ_ONCE(object);
> 		if (o) {
> 			BUG_ON(o->dead);
> 			do_something(o);
> 		}
> 		spin_unlock(&lock);
> 	}
> 
> 	void destroy(void) // can be called only once, can't race with itself
> 	{
> 		object_t *o;
> 
> 		o = object;
> 		object = NULL;
> 
> 		/*
> 		 * pairs with lock/ACQUIRE. The next update() must see
> 		 * object == NULL after spin_lock();
> 		 */
> 		smp_mb();
> 
> 		spin_unlock_wait(&lock);
> 
> 		/*
> 		 * pairs with unlock/RELEASE. The previous update() has
> 		 * already passed BUG_ON(o->dead).
> 		 *
> 		 * (Yes, yes, in this particular case it is not needed,
> 		 *  we can rely on the control dependency).
> 		 */
> 		smp_mb();
> 
> 		o->dead = true;
> 	}
> 
> I believe the code above is correct and it needs the barriers on both sides.
> 

Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
only guarantees that the memory operations following spin_lock() can't
be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
i.e. the case below can happen(assuming the spin_lock() is implemented
as ll/sc loop)

	spin_lock(&lock):
	  r1 = *lock; // LL, r1 == 0
	o = READ_ONCE(object); // could be reordered here.
	  *lock = 1; // SC

This could happen because of the ACQUIRE semantics of spin_lock(), and 
the current implementation of spin_lock() on PPC allows this happen.

(Cc PPC maintainers for their opinions on this one)

Therefore the case below can happen:

	CPU 1			CPU 2			CPU 3
	==================	====================	==============
							spin_unlock(&lock);
				spin_lock(&lock):
				  r1 = *lock; // r1 == 0;
				o = READ_ONCE(object); // reordered here
	object = NULL;
	smp_mb();
	spin_unlock_wait(&lock);
				  *lock = 1;
	smp_mb();		
	o->dead = true;
				if (o) // true
				  BUG_ON(o->dead); // true!!

To show this, I also translate this situation into a PPC litmus for
herd[1]:

	PPC spin-lock-wait
	"
	r1: local variable of lock
	r2: constant 1
	r3: constant 0 or NULL
	r4: local variable of object, i.e. o
	r5: local variable of *o (simulate ->dead as I didn't know 
	    how to write fields of structure in herd ;-()
	r13: the address of lock, i.e. &lock
	r14: the address of object, i.e. &object
	"
	{
	0:r1=0;0:r2=1;0:r3=0;0:r13=lock;0:r14=object;
	1:r1=0;1:r2=1;1:r3=0;1:r4=0;1:r5=0;1:r13=lock;1:r14=object;
	2:r1=0;2:r13=lock;
	lock=1; object=old; old=0;
	}

	P0              | P1                 | P2 ;
	ld r4,0(r14)    | Lock:              | stw r1,0(r13);
	std r3,0(r14)   | lwarx r1,r3,r13    | ;
	                | cmpwi r1,0         | ;
	sync            | bne Lock           | ;
	                | stwcx. r2,r3,r13   | ;
	Wait:           | bne Lock           | ;
	lwz r1,0(r13)   | lwsync             | ;
	cmpwi r1,0      | ld r4,0(r14)       | ;
	bne Wait        | cmpwi r4,0         | ;
	                | beq Fail           | ;
	sync            | lwz r5, 0(r4)      | ;
	stw r2,0(r4)    | Fail:              | ;
	                | lwsync             | ;
	                | stw r3, 0(r13)     | ;

	exists
	(1:r4=old /\ 1:r5=1)

,whose result says that (1:r4=old /\ 1:r5=1) can happen:

	Test spin-lock-wait Allowed
	States 3
	1:r4=0; 1:r5=0;
	1:r4=old; 1:r5=0;
	1:r4=old; 1:r5=1;
	Loop Ok
	Witnesses
	Positive: 18 Negative: 108
	Condition exists (1:r4=old /\ 1:r5=1)
	Observation spin-lock-wait Sometimes 18 108
	Hash=244f8c0f91df5a5ed985500ed7230272

Please note that I use backwards jump in this litmus, which is only
supported by herd(not by ppcmem[2]), and it will take a while to get the
result. And I'm not that confident that I'm familiar with this tool,
maybe Paul and Will can help check my translate and usage here ;-)

IIUC, the problem here is that spin_lock_wait() can be implemented with
only LOAD operations, and to have a RELEASE semantics, one primivite
must have a *STORE* part, therefore spin_lock_wait() can not be a
RELEASE.

So.. we probably need to take the lock here.

> If it is wrong, then task_work_run() is buggy: it relies on mb() implied by
> cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel()
> should see the result of our cmpxchg(), it must not try to read work->next or
> work->func.
> 
> Hmm. Not sure I really understand what I am trying to say... Perhaps in fact
> I mean that unlock_wait() should be removed because it is too subtle for me ;)
> 

;-)

I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
a control dependency to pair with spin_unlock(), for example, the
following snippet in do_exit() is OK, except the smp_mb() is redundant,
unless I'm missing something subtle:

	/*
	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
	 * when the following two conditions become true.
	 *   - There is race condition of mmap_sem (It is acquired by
	 *     exit_mm()), and
	 *   - SMI occurs before setting TASK_RUNINNG.
	 *     (or hypervisor of virtual machine switches to other guest)
	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
	 *
	 * To avoid it, we have to wait for releasing tsk->pi_lock which
	 * is held by try_to_wake_up()
	 */
	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);

	/* causes final put_task_struct in finish_task_switch(). */
	tsk->state = TASK_DEAD;

Ref:
[1]: https://github.com/herd/herdtools
[2]: http://www.cl.cam.ac.uk/~pes20/ppcmem/index.html

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]