From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753509AbbKLHQX (ORCPT <rfc822;w@1wt.eu>);
	Thu, 12 Nov 2015 02:16:23 -0500
Received: from mail-ig0-f179.google.com ([209.85.213.179]:38411 "EHLO
	mail-ig0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752997AbbKLHQV (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 12 Nov 2015 02:16:21 -0500
Date: Thu, 12 Nov 2015 15:14:51 +0800
From: Boqun Feng <boqun.feng@gmail.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>, mingo@kernel.org,
        linux-kernel@vger.kernel.org, paulmck@linux.vnet.ibm.com,
        corbet@lwn.net, mhocko@kernel.org, dhowells@redhat.com,
        torvalds@linux-foundation.org, will.deacon@arm.com,
        Michael Ellerman <mpe@ellerman.id.au>,
        Benjamin Herrenschmidt <benh@kernel.crashing.org>,
        Paul Mackerras <paulus@samba.org>
Subject: Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Message-ID: <20151112070915.GC6314@fixme-laptop.cn.ibm.com>
References: <20151102132901.157178466@infradead.org>
 <20151102134941.005198372@infradead.org>
 <20151103175958.GA4800@redhat.com>
 <20151111093939.GA6314@fixme-laptop.cn.ibm.com>
 <20151111121232.GN17308@twins.programming.kicks-ass.net>
 <20151111193953.GA23515@redhat.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="GyRA7555PLgSTuth"
Content-Disposition: inline
In-Reply-To: <20151111193953.GA23515@redhat.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


--GyRA7555PLgSTuth
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> On 11/11, Peter Zijlstra wrote:
> >
> > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> >
> > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> >
> > I did wonder the same thing, it would simplify a number of things if
> > this were so.
>=20
> Yes, me too.
>=20
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().
>=20
> Consider this code:
>=20
> 	object_t *object;
> 	spinlock_t lock;
>=20
> 	void update(void)
> 	{
> 		object_t *o;
>=20
> 		spin_lock(&lock);
> 		o =3D READ_ONCE(object);
> 		if (o) {
> 			BUG_ON(o->dead);
> 			do_something(o);
> 		}
> 		spin_unlock(&lock);
> 	}
>=20
> 	void destroy(void) // can be called only once, can't race with itself
> 	{
> 		object_t *o;
>=20
> 		o =3D object;
> 		object =3D NULL;
>=20
> 		/*
> 		 * pairs with lock/ACQUIRE. The next update() must see
> 		 * object =3D=3D NULL after spin_lock();
> 		 */
> 		smp_mb();
>=20
> 		spin_unlock_wait(&lock);
>=20
> 		/*
> 		 * pairs with unlock/RELEASE. The previous update() has
> 		 * already passed BUG_ON(o->dead).
> 		 *
> 		 * (Yes, yes, in this particular case it is not needed,
> 		 *  we can rely on the control dependency).
> 		 */
> 		smp_mb();
>=20
> 		o->dead =3D true;
> 	}
>=20
> I believe the code above is correct and it needs the barriers on both sid=
es.
>=20

Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
only guarantees that the memory operations following spin_lock() can't
be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
i.e. the case below can happen(assuming the spin_lock() is implemented
as ll/sc loop)

	spin_lock(&lock):
	  r1 =3D *lock; // LL, r1 =3D=3D 0
	o =3D READ_ONCE(object); // could be reordered here.
	  *lock =3D 1; // SC

This could happen because of the ACQUIRE semantics of spin_lock(), and=20
the current implementation of spin_lock() on PPC allows this happen.

(Cc PPC maintainers for their opinions on this one)

Therefore the case below can happen:

	CPU 1			CPU 2			CPU 3
	=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D	=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D	=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
							spin_unlock(&lock);
				spin_lock(&lock):
				  r1 =3D *lock; // r1 =3D=3D 0;
				o =3D READ_ONCE(object); // reordered here
	object =3D NULL;
	smp_mb();
	spin_unlock_wait(&lock);
				  *lock =3D 1;
	smp_mb();	=09
	o->dead =3D true;
				if (o) // true
				  BUG_ON(o->dead); // true!!


To show this, I also translate this situation into a PPC litmus for
herd[1]:

	PPC spin-lock-wait
	"
	r1: local variable of lock
	r2: constant 1
	r3: constant 0 or NULL
	r4: local variable of object, i.e. o
	r5: local variable of *o (simulate ->dead as I didn't know=20
	    how to write fields of structure in herd ;-()
	r13: the address of lock, i.e. &lock
	r14: the address of object, i.e. &object
	"
	{
	0:r1=3D0;0:r2=3D1;0:r3=3D0;0:r13=3Dlock;0:r14=3Dobject;
	1:r1=3D0;1:r2=3D1;1:r3=3D0;1:r4=3D0;1:r5=3D0;1:r13=3Dlock;1:r14=3Dobject;
	2:r1=3D0;2:r13=3Dlock;
	lock=3D1; object=3Dold; old=3D0;
	}

	P0              | P1                 | P2 ;
	ld r4,0(r14)    | Lock:              | stw r1,0(r13);
	std r3,0(r14)   | lwarx r1,r3,r13    | ;
	                | cmpwi r1,0         | ;
	sync            | bne Lock           | ;
	                | stwcx. r2,r3,r13   | ;
	Wait:           | bne Lock           | ;
	lwz r1,0(r13)   | lwsync             | ;
	cmpwi r1,0      | ld r4,0(r14)       | ;
	bne Wait        | cmpwi r4,0         | ;
	                | beq Fail           | ;
	sync            | lwz r5, 0(r4)      | ;
	stw r2,0(r4)    | Fail:              | ;
	                | lwsync             | ;
	                | stw r3, 0(r13)     | ;

	exists
	(1:r4=3Dold /\ 1:r5=3D1)

,whose result says that (1:r4=3Dold /\ 1:r5=3D1) can happen:

	Test spin-lock-wait Allowed
	States 3
	1:r4=3D0; 1:r5=3D0;
	1:r4=3Dold; 1:r5=3D0;
	1:r4=3Dold; 1:r5=3D1;
	Loop Ok
	Witnesses
	Positive: 18 Negative: 108
	Condition exists (1:r4=3Dold /\ 1:r5=3D1)
	Observation spin-lock-wait Sometimes 18 108
	Hash=3D244f8c0f91df5a5ed985500ed7230272

Please note that I use backwards jump in this litmus, which is only
supported by herd(not by ppcmem[2]), and it will take a while to get the
result. And I'm not that confident that I'm familiar with this tool,
maybe Paul and Will can help check my translate and usage here ;-)


IIUC, the problem here is that spin_lock_wait() can be implemented with
only LOAD operations, and to have a RELEASE semantics, one primivite
must have a *STORE* part, therefore spin_lock_wait() can not be a
RELEASE.

So.. we probably need to take the lock here.

> If it is wrong, then task_work_run() is buggy: it relies on mb() implied =
by
> cmpxchg() before spin_unlock_wait() the same way: the next task_work_canc=
el()
> should see the result of our cmpxchg(), it must not try to read work->nex=
t or
> work->func.
>=20
> Hmm. Not sure I really understand what I am trying to say... Perhaps in f=
act
> I mean that unlock_wait() should be removed because it is too subtle for =
me ;)
>=20

;-)

I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
a control dependency to pair with spin_unlock(), for example, the
following snippet in do_exit() is OK, except the smp_mb() is redundant,
unless I'm missing something subtle:


	/*
	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
	 * when the following two conditions become true.
	 *   - There is race condition of mmap_sem (It is acquired by
	 *     exit_mm()), and
	 *   - SMI occurs before setting TASK_RUNINNG.
	 *     (or hypervisor of virtual machine switches to other guest)
	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
	 *
	 * To avoid it, we have to wait for releasing tsk->pi_lock which
	 * is held by try_to_wake_up()
	 */
	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);

	/* causes final put_task_struct in finish_task_switch(). */
	tsk->state =3D TASK_DEAD;

Ref:
[1]: https://github.com/herd/herdtools
[2]: http://www.cl.cam.ac.uk/~pes20/ppcmem/index.html

Regards,
Boqun

--GyRA7555PLgSTuth
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAABCAAGBQJWRDxnAAoJEEl56MO1B/q4Q/gH/jH3w7a8jlXWIzOX9Pr9u92C
VHifU0NJEBco/PX1O5OADgpAjEwnPGLMiWulV1zIVffsmT5qznoh+C0Lvp4nMJZ7
UQ2DrLHVN11INAQPZxnhb7tzheEqNFByUJkobeRqZijsnlwpV2ULIFfRNzj7ORzT
w8zknXY//G01fyRuB2BBzmuX77mpP7JkSycaYQz08dIbVlWhfamrhzZUvwZBjUlD
4lzauTyQOjx5zCsu4I5ZkFOktLC4So+2y0rxv+9p+4kcQA1780+uNdtCf+RfN84M
gUJdWpP30E13ZHR8aYiUgUuD8ARnL/8YcnHoxdw7agSyA0Ahcy58DzY20wpBY/8=
=Zf59
-----END PGP SIGNATURE-----

--GyRA7555PLgSTuth--