From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753509AbbKLHQX (ORCPT ); Thu, 12 Nov 2015 02:16:23 -0500 Received: from mail-ig0-f179.google.com ([209.85.213.179]:38411 "EHLO mail-ig0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752997AbbKLHQV (ORCPT ); Thu, 12 Nov 2015 02:16:21 -0500 Date: Thu, 12 Nov 2015 15:14:51 +0800 From: Boqun Feng To: Oleg Nesterov Cc: Peter Zijlstra , mingo@kernel.org, linux-kernel@vger.kernel.org, paulmck@linux.vnet.ibm.com, corbet@lwn.net, mhocko@kernel.org, dhowells@redhat.com, torvalds@linux-foundation.org, will.deacon@arm.com, Michael Ellerman , Benjamin Herrenschmidt , Paul Mackerras Subject: Re: [PATCH 4/4] locking: Introduce smp_cond_acquire() Message-ID: <20151112070915.GC6314@fixme-laptop.cn.ibm.com> References: <20151102132901.157178466@infradead.org> <20151102134941.005198372@infradead.org> <20151103175958.GA4800@redhat.com> <20151111093939.GA6314@fixme-laptop.cn.ibm.com> <20151111121232.GN17308@twins.programming.kicks-ass.net> <20151111193953.GA23515@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="GyRA7555PLgSTuth" Content-Disposition: inline In-Reply-To: <20151111193953.GA23515@redhat.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --GyRA7555PLgSTuth Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote: > On 11/11, Peter Zijlstra wrote: > > > > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote: > > > > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE? > > > > I did wonder the same thing, it would simplify a number of things if > > this were so. >=20 > Yes, me too. >=20 > Sometimes I even think it should have both ACQUIRE + RELEASE semantics. > IOW, it should be "equivalent" to spin_lock() + spin_unlock(). >=20 > Consider this code: >=20 > object_t *object; > spinlock_t lock; >=20 > void update(void) > { > object_t *o; >=20 > spin_lock(&lock); > o =3D READ_ONCE(object); > if (o) { > BUG_ON(o->dead); > do_something(o); > } > spin_unlock(&lock); > } >=20 > void destroy(void) // can be called only once, can't race with itself > { > object_t *o; >=20 > o =3D object; > object =3D NULL; >=20 > /* > * pairs with lock/ACQUIRE. The next update() must see > * object =3D=3D NULL after spin_lock(); > */ > smp_mb(); >=20 > spin_unlock_wait(&lock); >=20 > /* > * pairs with unlock/RELEASE. The previous update() has > * already passed BUG_ON(o->dead). > * > * (Yes, yes, in this particular case it is not needed, > * we can rely on the control dependency). > */ > smp_mb(); >=20 > o->dead =3D true; > } >=20 > I believe the code above is correct and it needs the barriers on both sid= es. >=20 Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock() only guarantees that the memory operations following spin_lock() can't be reorder before the *LOAD* part of spin_lock() not the *STORE* part, i.e. the case below can happen(assuming the spin_lock() is implemented as ll/sc loop) spin_lock(&lock): r1 =3D *lock; // LL, r1 =3D=3D 0 o =3D READ_ONCE(object); // could be reordered here. *lock =3D 1; // SC This could happen because of the ACQUIRE semantics of spin_lock(), and=20 the current implementation of spin_lock() on PPC allows this happen. (Cc PPC maintainers for their opinions on this one) Therefore the case below can happen: CPU 1 CPU 2 CPU 3 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D spin_unlock(&lock); spin_lock(&lock): r1 =3D *lock; // r1 =3D=3D 0; o =3D READ_ONCE(object); // reordered here object =3D NULL; smp_mb(); spin_unlock_wait(&lock); *lock =3D 1; smp_mb(); =09 o->dead =3D true; if (o) // true BUG_ON(o->dead); // true!! To show this, I also translate this situation into a PPC litmus for herd[1]: PPC spin-lock-wait " r1: local variable of lock r2: constant 1 r3: constant 0 or NULL r4: local variable of object, i.e. o r5: local variable of *o (simulate ->dead as I didn't know=20 how to write fields of structure in herd ;-() r13: the address of lock, i.e. &lock r14: the address of object, i.e. &object " { 0:r1=3D0;0:r2=3D1;0:r3=3D0;0:r13=3Dlock;0:r14=3Dobject; 1:r1=3D0;1:r2=3D1;1:r3=3D0;1:r4=3D0;1:r5=3D0;1:r13=3Dlock;1:r14=3Dobject; 2:r1=3D0;2:r13=3Dlock; lock=3D1; object=3Dold; old=3D0; } P0 | P1 | P2 ; ld r4,0(r14) | Lock: | stw r1,0(r13); std r3,0(r14) | lwarx r1,r3,r13 | ; | cmpwi r1,0 | ; sync | bne Lock | ; | stwcx. r2,r3,r13 | ; Wait: | bne Lock | ; lwz r1,0(r13) | lwsync | ; cmpwi r1,0 | ld r4,0(r14) | ; bne Wait | cmpwi r4,0 | ; | beq Fail | ; sync | lwz r5, 0(r4) | ; stw r2,0(r4) | Fail: | ; | lwsync | ; | stw r3, 0(r13) | ; exists (1:r4=3Dold /\ 1:r5=3D1) ,whose result says that (1:r4=3Dold /\ 1:r5=3D1) can happen: Test spin-lock-wait Allowed States 3 1:r4=3D0; 1:r5=3D0; 1:r4=3Dold; 1:r5=3D0; 1:r4=3Dold; 1:r5=3D1; Loop Ok Witnesses Positive: 18 Negative: 108 Condition exists (1:r4=3Dold /\ 1:r5=3D1) Observation spin-lock-wait Sometimes 18 108 Hash=3D244f8c0f91df5a5ed985500ed7230272 Please note that I use backwards jump in this litmus, which is only supported by herd(not by ppcmem[2]), and it will take a while to get the result. And I'm not that confident that I'm familiar with this tool, maybe Paul and Will can help check my translate and usage here ;-) IIUC, the problem here is that spin_lock_wait() can be implemented with only LOAD operations, and to have a RELEASE semantics, one primivite must have a *STORE* part, therefore spin_lock_wait() can not be a RELEASE. So.. we probably need to take the lock here. > If it is wrong, then task_work_run() is buggy: it relies on mb() implied = by > cmpxchg() before spin_unlock_wait() the same way: the next task_work_canc= el() > should see the result of our cmpxchg(), it must not try to read work->nex= t or > work->func. >=20 > Hmm. Not sure I really understand what I am trying to say... Perhaps in f= act > I mean that unlock_wait() should be removed because it is too subtle for = me ;) >=20 ;-) I think it's OK for it as an ACQUIRE(with a proper barrier) or even just a control dependency to pair with spin_unlock(), for example, the following snippet in do_exit() is OK, except the smp_mb() is redundant, unless I'm missing something subtle: /* * The setting of TASK_RUNNING by try_to_wake_up() may be delayed * when the following two conditions become true. * - There is race condition of mmap_sem (It is acquired by * exit_mm()), and * - SMI occurs before setting TASK_RUNINNG. * (or hypervisor of virtual machine switches to other guest) * As a result, we may become TASK_RUNNING after becoming TASK_DEAD * * To avoid it, we have to wait for releasing tsk->pi_lock which * is held by try_to_wake_up() */ smp_mb(); raw_spin_unlock_wait(&tsk->pi_lock); /* causes final put_task_struct in finish_task_switch(). */ tsk->state =3D TASK_DEAD; Ref: [1]: https://github.com/herd/herdtools [2]: http://www.cl.cam.ac.uk/~pes20/ppcmem/index.html Regards, Boqun --GyRA7555PLgSTuth Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJWRDxnAAoJEEl56MO1B/q4Q/gH/jH3w7a8jlXWIzOX9Pr9u92C VHifU0NJEBco/PX1O5OADgpAjEwnPGLMiWulV1zIVffsmT5qznoh+C0Lvp4nMJZ7 UQ2DrLHVN11INAQPZxnhb7tzheEqNFByUJkobeRqZijsnlwpV2ULIFfRNzj7ORzT w8zknXY//G01fyRuB2BBzmuX77mpP7JkSycaYQz08dIbVlWhfamrhzZUvwZBjUlD 4lzauTyQOjx5zCsu4I5ZkFOktLC4So+2y0rxv+9p+4kcQA1780+uNdtCf+RfN84M gUJdWpP30E13ZHR8aYiUgUuD8ARnL/8YcnHoxdw7agSyA0Ahcy58DzY20wpBY/8= =Zf59 -----END PGP SIGNATURE----- --GyRA7555PLgSTuth--