From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753757AbdBHGtT (ORCPT ); Wed, 8 Feb 2017 01:49:19 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:33039 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753478AbdBHGtS (ORCPT ); Wed, 8 Feb 2017 01:49:18 -0500 From: Pan Xinhui Subject: Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs To: Boqun Feng , Xinhui Pan References: <1482697561-23848-1-git-send-email-longman@redhat.com> <20170208040540.GB9178@tardis.cn.ibm.com> <20170208060939.GC9178@tardis.cn.ibm.com> Cc: Waiman Long , Peter Zijlstra , Ingo Molnar , linux-kernel@vger.kernel.org Date: Wed, 8 Feb 2017 14:48:58 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20170208060939.GC9178@tardis.cn.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 17020806-0008-0000-0000-0000052635A6 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17020806-0009-0000-0000-0000132C5835 Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-02-08_04:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000 definitions=main-1702080068 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 在 2017/2/8 14:09, Boqun Feng 写道: > On Wed, Feb 08, 2017 at 12:05:40PM +0800, Boqun Feng wrote: >> On Wed, Feb 08, 2017 at 11:39:10AM +0800, Xinhui Pan wrote: >>> 2016-12-26 4:26 GMT+08:00 Waiman Long : >>> >>>> A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more >>>> relaxed versions to improve performance on architectures that use LL/SC. >>>> >>>> All the locking related cmpxchg's are replaced with the _acquire >>>> variants: >>>> - pv_queued_spin_steal_lock() >>>> - trylock_clear_pending() >>>> >>>> The cmpxchg's related to hashing are replaced by either by the _release >>>> or the _relaxed variants. See the inline comment for details. >>>> >>>> Signed-off-by: Waiman Long >>>> >>>> v1->v2: >>>> - Add comments in changelog and code for the rationale of the change. >>>> >>>> --- >>>> kernel/locking/qspinlock_paravirt.h | 50 ++++++++++++++++++++++++------ >>>> ------- >>>> 1 file changed, 33 insertions(+), 17 deletions(-) >>>> >>>> >>>> @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock *node, >>>> struct mcs_spinlock *prev) >>>> * If pv_kick_node() changed us to vcpu_hashed, retain that >>>> * value so that pv_wait_head_or_lock() knows to not also >>>> try >>>> * to hash this lock. >>>> + * >>>> + * The smp_store_mb() and control dependency above will >>>> ensure >>>> + * that state change won't happen before that. >>>> Synchronizing >>>> + * with pv_kick_node() wrt hashing by this waiter or by the >>>> + * lock holder is done solely by the state variable. There >>>> is >>>> + * no other ordering requirement. >>>> */ >>>> - cmpxchg(&pn->state, vcpu_halted, vcpu_running); >>>> + cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running); >>>> >>>> /* >>>> * If the locked flag is still not set after wakeup, it is >>>> a >>>> @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock, >>>> struct mcs_spinlock *node) >>>> * pv_wait_node(). If OTOH this fails, the vCPU was running and >>>> will >>>> * observe its next->locked value and advance itself. >>>> * >>>> - * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >>>> + * Matches with smp_store_mb() and cmpxchg_relaxed() in >>>> pv_wait_node(). >>>> + * A release barrier is used here to ensure that node->locked is >>>> + * always set before changing the state. See comment in >>>> pv_wait_node(). >>>> */ >>>> - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) >>>> + if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed) >>>> + != vcpu_halted) >>>> return; >>>> >>>> hi, Waiman >>> We can't use _release here, a full barrier is needed. >>> >>> There is pv_kick_node vs pv_wait_head_or_lock >>> >>> [w] l->locked = _Q_SLOW_VAL //reordered here >>> >>> if (READ_ONCE(pn->state) == vcpu_hashed) //False. >>> >>> lp = (struct qspinlock **)1; >>> >>> [STORE] pn->state = vcpu_hashed lp = pv_hash(lock, >>> pn); >>> pv_hash() if >>> (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. >>> >> >> This analysis is correct, but.. >> > > Hmm.. look at this again, I don't think this analysis is meaningful, > let's say the reordering didn't happen, we still got(similar to your > case): > > if (READ_ONCE(pn->state) == vcpu_hashed) // false. > lp = (struct qspinlock **)1; > > cmpxchg(pn->state, vcpu_halted, vcpu_hashed); > if(!lp) { > lp = pv_hash(lock, pn); > WRITE_ONCE(l->locked, _Q_SLOW_VAL); > pv_hash(); > if (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. > > , right? > > Actually, I think this or your case could not happen because we have > > cmpxchg(pn->state, vcpu_halted, vcpu_running); > > in pv_wait_node(), which makes us either observe vcpu_hashed or set > pn->state to vcpu_running before pv_kick_node() trying to do the hash. > yep, there is still a race. We have to fix it. so I think we must check old = xchg(&l->locked, _Q_SLOW_VAL) if (old == 0) do something else if (old == _Q_SLOW_VAL) do something else > I may miss something subtle, but does switching back to cmpxchg() could > fix the RCU stall you observed? > yes, just fix this cmpxchg and then no RCU stall. > Regards, > Boqun > >>> Then the same lock has hashed twice but only unhashed once. So at last as >>> the hash table grows big, we hit RCU stall. >>> >>> I hit RCU stall when I run netperf benchmark >>> >> >> how will a big hash table hit RCU stall? Do you have the call trace for >> your RCU stall? >> maybe too many time on hashing? I am not sure. >> Regards, >> Boqun >> >>> thanks >>> xinhui >>> >>> >>>> -- >>>> 1.8.3.1 >>>> >>>> > >