All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boqun Feng <boqun.feng@gmail.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	mingo@kernel.org, linux-kernel@vger.kernel.org,
	paulmck@linux.vnet.ibm.com, corbet@lwn.net, mhocko@kernel.org,
	dhowells@redhat.com, torvalds@linux-foundation.org,
	will.deacon@arm.com, Michael Ellerman <mpe@ellerman.id.au>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>
Subject: Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Date: Thu, 12 Nov 2015 15:14:51 +0800	[thread overview]
Message-ID: <20151112070915.GC6314@fixme-laptop.cn.ibm.com> (raw)
In-Reply-To: <20151111193953.GA23515@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 6173 bytes --]

On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> On 11/11, Peter Zijlstra wrote:
> >
> > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> >
> > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> >
> > I did wonder the same thing, it would simplify a number of things if
> > this were so.
> 
> Yes, me too.
> 
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().
> 
> Consider this code:
> 
> 	object_t *object;
> 	spinlock_t lock;
> 
> 	void update(void)
> 	{
> 		object_t *o;
> 
> 		spin_lock(&lock);
> 		o = READ_ONCE(object);
> 		if (o) {
> 			BUG_ON(o->dead);
> 			do_something(o);
> 		}
> 		spin_unlock(&lock);
> 	}
> 
> 	void destroy(void) // can be called only once, can't race with itself
> 	{
> 		object_t *o;
> 
> 		o = object;
> 		object = NULL;
> 
> 		/*
> 		 * pairs with lock/ACQUIRE. The next update() must see
> 		 * object == NULL after spin_lock();
> 		 */
> 		smp_mb();
> 
> 		spin_unlock_wait(&lock);
> 
> 		/*
> 		 * pairs with unlock/RELEASE. The previous update() has
> 		 * already passed BUG_ON(o->dead).
> 		 *
> 		 * (Yes, yes, in this particular case it is not needed,
> 		 *  we can rely on the control dependency).
> 		 */
> 		smp_mb();
> 
> 		o->dead = true;
> 	}
> 
> I believe the code above is correct and it needs the barriers on both sides.
> 

Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
only guarantees that the memory operations following spin_lock() can't
be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
i.e. the case below can happen(assuming the spin_lock() is implemented
as ll/sc loop)

	spin_lock(&lock):
	  r1 = *lock; // LL, r1 == 0
	o = READ_ONCE(object); // could be reordered here.
	  *lock = 1; // SC

This could happen because of the ACQUIRE semantics of spin_lock(), and 
the current implementation of spin_lock() on PPC allows this happen.

(Cc PPC maintainers for their opinions on this one)

Therefore the case below can happen:

	CPU 1			CPU 2			CPU 3
	==================	====================	==============
							spin_unlock(&lock);
				spin_lock(&lock):
				  r1 = *lock; // r1 == 0;
				o = READ_ONCE(object); // reordered here
	object = NULL;
	smp_mb();
	spin_unlock_wait(&lock);
				  *lock = 1;
	smp_mb();		
	o->dead = true;
				if (o) // true
				  BUG_ON(o->dead); // true!!


To show this, I also translate this situation into a PPC litmus for
herd[1]:

	PPC spin-lock-wait
	"
	r1: local variable of lock
	r2: constant 1
	r3: constant 0 or NULL
	r4: local variable of object, i.e. o
	r5: local variable of *o (simulate ->dead as I didn't know 
	    how to write fields of structure in herd ;-()
	r13: the address of lock, i.e. &lock
	r14: the address of object, i.e. &object
	"
	{
	0:r1=0;0:r2=1;0:r3=0;0:r13=lock;0:r14=object;
	1:r1=0;1:r2=1;1:r3=0;1:r4=0;1:r5=0;1:r13=lock;1:r14=object;
	2:r1=0;2:r13=lock;
	lock=1; object=old; old=0;
	}

	P0              | P1                 | P2 ;
	ld r4,0(r14)    | Lock:              | stw r1,0(r13);
	std r3,0(r14)   | lwarx r1,r3,r13    | ;
	                | cmpwi r1,0         | ;
	sync            | bne Lock           | ;
	                | stwcx. r2,r3,r13   | ;
	Wait:           | bne Lock           | ;
	lwz r1,0(r13)   | lwsync             | ;
	cmpwi r1,0      | ld r4,0(r14)       | ;
	bne Wait        | cmpwi r4,0         | ;
	                | beq Fail           | ;
	sync            | lwz r5, 0(r4)      | ;
	stw r2,0(r4)    | Fail:              | ;
	                | lwsync             | ;
	                | stw r3, 0(r13)     | ;

	exists
	(1:r4=old /\ 1:r5=1)

,whose result says that (1:r4=old /\ 1:r5=1) can happen:

	Test spin-lock-wait Allowed
	States 3
	1:r4=0; 1:r5=0;
	1:r4=old; 1:r5=0;
	1:r4=old; 1:r5=1;
	Loop Ok
	Witnesses
	Positive: 18 Negative: 108
	Condition exists (1:r4=old /\ 1:r5=1)
	Observation spin-lock-wait Sometimes 18 108
	Hash=244f8c0f91df5a5ed985500ed7230272

Please note that I use backwards jump in this litmus, which is only
supported by herd(not by ppcmem[2]), and it will take a while to get the
result. And I'm not that confident that I'm familiar with this tool,
maybe Paul and Will can help check my translate and usage here ;-)


IIUC, the problem here is that spin_lock_wait() can be implemented with
only LOAD operations, and to have a RELEASE semantics, one primivite
must have a *STORE* part, therefore spin_lock_wait() can not be a
RELEASE.

So.. we probably need to take the lock here.

> If it is wrong, then task_work_run() is buggy: it relies on mb() implied by
> cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel()
> should see the result of our cmpxchg(), it must not try to read work->next or
> work->func.
> 
> Hmm. Not sure I really understand what I am trying to say... Perhaps in fact
> I mean that unlock_wait() should be removed because it is too subtle for me ;)
> 

;-)

I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
a control dependency to pair with spin_unlock(), for example, the
following snippet in do_exit() is OK, except the smp_mb() is redundant,
unless I'm missing something subtle:


	/*
	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
	 * when the following two conditions become true.
	 *   - There is race condition of mmap_sem (It is acquired by
	 *     exit_mm()), and
	 *   - SMI occurs before setting TASK_RUNINNG.
	 *     (or hypervisor of virtual machine switches to other guest)
	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
	 *
	 * To avoid it, we have to wait for releasing tsk->pi_lock which
	 * is held by try_to_wake_up()
	 */
	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);

	/* causes final put_task_struct in finish_task_switch(). */
	tsk->state = TASK_DEAD;

Ref:
[1]: https://github.com/herd/herdtools
[2]: http://www.cl.cam.ac.uk/~pes20/ppcmem/index.html

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

  parent reply	other threads:[~2015-11-12  7:16 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
2015-12-04  0:09   ` Byungchul Park
2015-12-04  0:58   ` Byungchul Park
2015-11-02 13:29 ` [PATCH 2/4] sched: Document Program-Order guarantees Peter Zijlstra
2015-11-02 20:27   ` Paul Turner
2015-11-02 20:34     ` Peter Zijlstra
2015-11-02 22:09       ` Paul Turner
2015-11-02 22:12         ` Peter Zijlstra
2015-11-20 10:02     ` Peter Zijlstra
2015-11-20 14:08       ` Boqun Feng
2015-11-20 14:18         ` Peter Zijlstra
2015-11-20 14:21           ` Boqun Feng
2015-11-20 19:41             ` Peter Zijlstra
2015-11-02 13:29 ` [PATCH 3/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
2015-11-02 13:57   ` Peter Zijlstra
2015-11-02 17:43     ` Will Deacon
2015-11-03  1:14       ` Paul E. McKenney
2015-11-03  1:25         ` Linus Torvalds
2015-11-02 17:42   ` Will Deacon
2015-11-02 18:08   ` Linus Torvalds
2015-11-02 18:37     ` Will Deacon
2015-11-02 19:17       ` Linus Torvalds
2015-11-02 19:57         ` Will Deacon
2015-11-02 20:23           ` Peter Zijlstra
2015-11-02 21:56         ` Peter Zijlstra
2015-11-03  1:57         ` Paul E. McKenney
2015-11-03 19:40           ` Linus Torvalds
2015-11-04  3:57             ` Paul E. McKenney
2015-11-04  4:43               ` Linus Torvalds
2015-11-04 12:54                 ` Paul E. McKenney
2015-11-02 20:36   ` David Howells
2015-11-02 20:40     ` Peter Zijlstra
2015-11-02 21:11     ` Linus Torvalds
2015-11-03 17:59   ` Oleg Nesterov
2015-11-03 18:23     ` Peter Zijlstra
2015-11-11  9:39     ` Boqun Feng
2015-11-11 10:34       ` Boqun Feng
2015-11-11 19:53         ` Oleg Nesterov
2015-11-12 13:50         ` Paul E. McKenney
2015-11-11 12:12       ` Peter Zijlstra
2015-11-11 19:39         ` Oleg Nesterov
2015-11-11 21:23           ` Linus Torvalds
2015-11-12  7:14           ` Boqun Feng [this message]
2015-11-12 10:28             ` Peter Zijlstra
2015-11-12 15:00             ` Oleg Nesterov
2015-11-12 14:40               ` Paul E. McKenney
2015-11-12 14:49                 ` Boqun Feng
2015-11-12 15:02                   ` Paul E. McKenney
2015-11-12 21:53                     ` Will Deacon
2015-11-12 14:50                 ` Peter Zijlstra
2015-11-12 15:01                   ` Paul E. McKenney
2015-11-12 15:08                     ` Peter Zijlstra
2015-11-12 15:20                       ` Paul E. McKenney
2015-11-12 21:25                         ` Will Deacon
2015-11-12 15:18               ` Boqun Feng
2015-11-12 18:38                 ` Oleg Nesterov
2015-11-12 18:02                   ` Peter Zijlstra
2015-11-12 19:33                     ` Oleg Nesterov
2015-11-12 18:59                       ` Paul E. McKenney
2015-11-12 21:33                         ` Will Deacon
2015-11-12 23:43                           ` Paul E. McKenney
2015-11-16 13:58                             ` Will Deacon
2015-11-12 18:21             ` Linus Torvalds
2015-11-12 22:09               ` Will Deacon
2015-11-16 15:56               ` Peter Zijlstra
2015-11-16 16:04                 ` Peter Zijlstra
2015-11-16 16:24                   ` Will Deacon
2015-11-16 16:44                     ` Paul E. McKenney
2015-11-16 16:46                       ` Will Deacon
2015-11-16 17:15                         ` Paul E. McKenney
2015-11-16 21:58                     ` Linus Torvalds
2015-11-17 11:51                       ` Will Deacon
2015-11-17 21:01                         ` Paul E. McKenney
2015-11-18 11:25                           ` Will Deacon
2015-11-19 18:01                             ` Will Deacon
2015-11-20 10:09                               ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151112070915.GC6314@fixme-laptop.cn.ibm.com \
    --to=boqun.feng@gmail.com \
    --cc=benh@kernel.crashing.org \
    --cc=corbet@lwn.net \
    --cc=dhowells@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.