Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

From: Pan Xinhui <xinhui@linux.vnet.ibm.com>
To: Boqun Feng <boqun.feng@gmail.com>, Xinhui Pan <mnipxh@gmail.com>
Cc: Waiman Long <longman@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
Date: Wed, 8 Feb 2017 15:09:33 +0800	[thread overview]
Message-ID: <9c22e27d-e80e-81f9-d7a0-992e80425f44@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170208060939.GC9178@tardis.cn.ibm.com>


在 2017/2/8 14:09, Boqun Feng 写道:
> On Wed, Feb 08, 2017 at 12:05:40PM +0800, Boqun Feng wrote:
>> On Wed, Feb 08, 2017 at 11:39:10AM +0800, Xinhui Pan wrote:
>>> 2016-12-26 4:26 GMT+08:00 Waiman Long <longman@redhat.com>:
>>>
>>>> A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more
>>>> relaxed versions to improve performance on architectures that use LL/SC.
>>>>
>>>> All the locking related cmpxchg's are replaced with the _acquire
>>>> variants:
>>>>  - pv_queued_spin_steal_lock()
>>>>  - trylock_clear_pending()
>>>>
>>>> The cmpxchg's related to hashing are replaced by either by the _release
>>>> or the _relaxed variants. See the inline comment for details.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>>
>>>>  v1->v2:
>>>>   - Add comments in changelog and code for the rationale of the change.
>>>>
>>>> ---
>>>>  kernel/locking/qspinlock_paravirt.h | 50 ++++++++++++++++++++++++------
>>>> -------
>>>>  1 file changed, 33 insertions(+), 17 deletions(-)
>>>>
>>>>
>>>> @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node,
>>>> struct mcs_spinlock *prev)
>>>>                  * If pv_kick_node() changed us to vcpu_hashed, retain that
>>>>                  * value so that pv_wait_head_or_lock() knows to not also
>>>> try
>>>>                  * to hash this lock.
>>>> +                *
>>>> +                * The smp_store_mb() and control dependency above will
>>>> ensure
>>>> +                * that state change won't happen before that.
>>>> Synchronizing
>>>> +                * with pv_kick_node() wrt hashing by this waiter or by the
>>>> +                * lock holder is done solely by the state variable. There
>>>> is
>>>> +                * no other ordering requirement.
>>>>                  */
>>>> -               cmpxchg(&pn->state, vcpu_halted, vcpu_running);
>>>> +               cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running);
>>>>
>>>>                 /*
>>>>                  * If the locked flag is still not set after wakeup, it is
>>>> a
>>>> @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock,
>>>> struct mcs_spinlock *node)
>>>>          * pv_wait_node(). If OTOH this fails, the vCPU was running and
>>>> will
>>>>          * observe its next->locked value and advance itself.
>>>>          *
>>>> -        * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>>>> +        * Matches with smp_store_mb() and cmpxchg_relaxed() in
>>>> pv_wait_node().
>>>> +        * A release barrier is used here to ensure that node->locked is
>>>> +        * always set before changing the state. See comment in
>>>> pv_wait_node().
>>>>          */
>>>> -       if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>>>> +       if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed)
>>>> +                       != vcpu_halted)
>>>>                 return;
>>>>
>>>> hi, Waiman
>>> We can't use _release here, a full barrier is needed.
>>>
>>> There is pv_kick_node vs pv_wait_head_or_lock
>>>
>>> [w] l->locked = _Q_SLOW_VAL  //reordered here
>>>
>>> if (READ_ONCE(pn->state) == vcpu_hashed) //False.
>>>
>>>                    lp = (struct qspinlock **)1;
>>>
>>> [STORE] pn->state = vcpu_hashed                        lp = pv_hash(lock,
>>> pn);
>>> pv_hash()                                                                if
>>> (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.
>>>
>>
>> This analysis is correct, but..
>>
>
> Hmm.. look at this again, I don't think this analysis is meaningful,
> let's say the reordering didn't happen, we still got(similar to your
> case):
>
but there is
						cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running);

> 						if (READ_ONCE(pn->state) == vcpu_hashed) // false.
> 						  lp = (struct qspinlock **)1;
>
> cmpxchg(pn->state, vcpu_halted, vcpu_hashed);
this cmpxchg will observe the cmpxchg_relaxed above, so this cmpxchg will fail as pn->state is vcpu_running.
No bug here..

> 						  if(!lp) {
> 						    lp = pv_hash(lock, pn);
> WRITE_ONCE(l->locked, _Q_SLOW_VAL);
> pv_hash();
> 						    if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.
>
> , right?

>
> Actually, I think this or your case could not happen because we have
>
> 	cmpxchg(pn->state, vcpu_halted, vcpu_running);
>
> in pv_wait_node(), which makes us either observe vcpu_hashed or set
> pn->state to vcpu_running before pv_kick_node() trying to do the hash.
>
> I may miss something subtle, but does switching back to cmpxchg() could
> fix the RCU stall you observed?
>
> Regards,
> Boqun
>
>>> Then the same lock has hashed twice but only unhashed once. So at last as
>>> the hash table grows big, we hit RCU stall.
>>>
>>> I hit RCU stall when I run netperf benchmark
>>>
>>
>> how will a big hash table hit RCU stall? Do you have the call trace for
>> your RCU stall?
>>
>> Regards,
>> Boqun
>>
>>> thanks
>>> xinhui
>>>
>>>
>>>> --
>>>> 1.8.3.1
>>>>
>>>>
>
>