Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

From: Boqun Feng <boqun.feng@gmail.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	mingo@kernel.org, linux-kernel@vger.kernel.org,
	paulmck@linux.vnet.ibm.com, corbet@lwn.net, mhocko@kernel.org,
	dhowells@redhat.com, torvalds@linux-foundation.org,
	will.deacon@arm.com, Michael Ellerman <mpe@ellerman.id.au>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>
Subject: Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Date: Thu, 12 Nov 2015 23:18:39 +0800	[thread overview]
Message-ID: <20151112151839.GE6314@fixme-laptop.cn.ibm.com> (raw)
In-Reply-To: <20151112150058.GA30321@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4139 bytes --]

On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> On 11/12, Boqun Feng wrote:
[snip]
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> > 	spin_lock(&lock):
> > 	  r1 = *lock; // LL, r1 == 0
> > 	o = READ_ONCE(object); // could be reordered here.
> > 	  *lock = 1; // SC
> >
> > This could happen because of the ACQUIRE semantics of spin_lock(), and
> > the current implementation of spin_lock() on PPC allows this happen.
> >
> > (Cc PPC maintainers for their opinions on this one)
> 
> In this case the code above is obviously wrong. And I do not understand
> how we can rely on spin_unlock_wait() then.
> 
> And afaics do_exit() is buggy too then, see below.
> 
> > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> > a control dependency to pair with spin_unlock(), for example, the
> > following snippet in do_exit() is OK, except the smp_mb() is redundant,
> > unless I'm missing something subtle:
> >
> > 	/*
> > 	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> > 	 * when the following two conditions become true.
> > 	 *   - There is race condition of mmap_sem (It is acquired by
> > 	 *     exit_mm()), and
> > 	 *   - SMI occurs before setting TASK_RUNINNG.
> > 	 *     (or hypervisor of virtual machine switches to other guest)
> > 	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> > 	 *
> > 	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> > 	 * is held by try_to_wake_up()
> > 	 */
> > 	smp_mb();
> > 	raw_spin_unlock_wait(&tsk->pi_lock);
> 
> Perhaps it is me who missed something. But I don't think we can remove
> this mb(). And at the same time it can't help on PPC if I understand

You are right, we need this smp_mb() to order the previous STORE of
->state with the LOAD of ->pi_lock. I missed that part because I saw all
the explicit STOREs of ->state in do_exit() are set_current_state()
which has a smp_mb() following the STOREs.

> your explanation above correctly.
> 
> To simplify, lets ignore exit_mm/down_read/etc. The exiting task does
> 
> 
> 	current->state = TASK_UNINTERRUPTIBLE;
> 	// without schedule() in between
> 	current->state = TASK_RUNNING;
> 
> 	smp_mb();
> 	spin_unlock_wait(pi_lock);
> 
> 	current->state = TASK_DEAD;
> 	schedule();
> 
> and we need to ensure that if we race with try_to_wake_up(TASK_UNINTERRUPTIBLE)
> it can't change TASK_DEAD back to RUNNING.
> 
> Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can
> read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE,
> so in fact we could have
> 
> 	current->state = TASK_UNINTERRUPTIBLE;
> 	
> 	spin_unlock_wait(pi_lock);
> 
> 	current->state = TASK_RUNNING;
> 
> 	current->state = TASK_DEAD;
> 
> and this can obviously race with ttwu() which can take pi_lock and see
> state == TASK_UNINTERRUPTIBLE after spin_unlock_wait().
> 

Yep, my mistake ;-)

> And, if I understand you correctly, this smp_mb() can't help on PPC.
> try_to_wake_up() can read task->state before it writes to *pi_lock.
> To me this doesn't really differ from the code above,
> 
> 	CPU 1 (do_exit)				CPU_2 (ttwu)
> 
> 						spin_lock(pi_lock):
> 						  r1 = *pi_lock; // r1 == 0;
> 	p->state = TASK_UNINTERRUPTIBLE;
> 						state = p->state;
> 	p->state = TASK_RUNNING;
> 	mb();
> 	spin_unlock_wait();
> 						*pi_lock = 1;
> 
> 	p->state = TASK_DEAD;
> 						if (state & TASK_UNINTERRUPTIBLE) // true
> 							p->state = RUNNING;
> 
> No?
> 

do_exit() is surely buggy if spin_lock() could work in this way.

> And smp_mb__before_spinlock() looks wrong too then.
> 

Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation,
which has both LOAD part and STORE part unlike spin_unlock_wait()?

> Oleg.
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]