From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1033425AbbKEQHG (ORCPT ); Thu, 5 Nov 2015 11:07:06 -0500 Received: from g2t4621.austin.hp.com ([15.73.212.80]:55745 "EHLO g2t4621.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1161574AbbKEQGx (ORCPT ); Thu, 5 Nov 2015 11:06:53 -0500 Message-ID: <563B7E98.4030708@hpe.com> Date: Thu, 05 Nov 2015 11:06:48 -0500 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: Peter Zijlstra CC: Ingo Molnar , Thomas Gleixner , "H. Peter Anvin" , x86@kernel.org, linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Davidlohr Bueso Subject: Re: [PATCH tip/locking/core v9 2/6] locking/qspinlock: prefetch next node cacheline References: <1446247597-61863-1-git-send-email-Waiman.Long@hpe.com> <1446247597-61863-3-git-send-email-Waiman.Long@hpe.com> <20151102163626.GU3604@twins.programming.kicks-ass.net> In-Reply-To: <20151102163626.GU3604@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/02/2015 11:36 AM, Peter Zijlstra wrote: > On Fri, Oct 30, 2015 at 07:26:33PM -0400, Waiman Long wrote: >> A queue head CPU, after acquiring the lock, will have to notify >> the next CPU in the wait queue that it has became the new queue >> head. This involves loading a new cacheline from the MCS node of the >> next CPU. That operation can be expensive and add to the latency of >> locking operation. >> >> This patch addes code to optmistically prefetch the next MCS node >> cacheline if the next pointer is defined and it has been spinning >> for the MCS lock for a while. This reduces the locking latency and >> improves the system throughput. >> >> Using a locking microbenchmark on a Haswell-EX system, this patch >> can improve throughput by about 5%. > How does it affect IVB-EX (which you were testing earlier IIRC)? My testing on IVB-EX indicated that if the critical section is really short, the change may actually slow thing a bit in some cases. However, when the critical section is long enough that the prefetch overhead can be hidden within the lock acquisition loop, there will be a performance boost. >> Signed-off-by: Waiman Long >> --- >> kernel/locking/qspinlock.c | 21 +++++++++++++++++++++ >> 1 files changed, 21 insertions(+), 0 deletions(-) >> >> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c >> index 7868418..c1c8a1a 100644 >> --- a/kernel/locking/qspinlock.c >> +++ b/kernel/locking/qspinlock.c >> @@ -396,6 +396,7 @@ queue: >> * p,*,* -> n,*,* >> */ >> old = xchg_tail(lock, tail); >> + next = NULL; >> >> /* >> * if there was a previous node; link it and wait until reaching the >> @@ -407,6 +408,16 @@ queue: >> >> pv_wait_node(node); >> arch_mcs_spin_lock_contended(&node->locked); >> + >> + /* >> + * While waiting for the MCS lock, the next pointer may have >> + * been set by another lock waiter. We optimistically load >> + * the next pointer& prefetch the cacheline for writing >> + * to reduce latency in the upcoming MCS unlock operation. >> + */ >> + next = READ_ONCE(node->next); >> + if (next) >> + prefetchw(next); >> } > OK so far I suppose. Since we already read node->locked, which is in the > same cacheline, also reading node->next isn't extra pressure. And we can > then prefetch that cacheline. > >> /* >> @@ -426,6 +437,15 @@ queue: >> cpu_relax(); >> >> /* >> + * If the next pointer is defined, we are not tail anymore. >> + * In this case, claim the spinlock& release the MCS lock. >> + */ >> + if (next) { >> + set_locked(lock); >> + goto mcs_unlock; >> + } >> + >> + /* >> * claim the lock: >> * >> * n,0,0 -> 0,0,1 : lock, uncontended >> @@ -458,6 +478,7 @@ queue: >> while (!(next = READ_ONCE(node->next))) >> cpu_relax(); >> >> +mcs_unlock: >> arch_mcs_spin_unlock_contended(&next->locked); >> pv_kick_node(lock, next); >> > This however appears an independent optimization. Is it worth it? Would > we not already have observed a val != tail in this case? At which point > we're just adding extra code for no gain. > > That is, if we observe @next, must we then not also observe val != tail? Observing next implies val != tail, but the reverse may not be true. The branch is done before we observe val != tail. Yes, it is an optimization to avoid reading node->next again if we have already observed next. I have observed a very minor performance boost with that change without the prefetch. Cheers, Longman