From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751918AbbJTDMb (ORCPT ); Mon, 19 Oct 2015 23:12:31 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:35991 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750905AbbJTDMa (ORCPT ); Mon, 19 Oct 2015 23:12:30 -0400 MIME-Version: 1.0 In-Reply-To: <56252600.5000106@hpe.com> References: <1445221642-15319-1-git-send-email-ling.ma.program@gmail.com> <56252600.5000106@hpe.com> Date: Tue, 20 Oct 2015 11:12:28 +0800 Message-ID: Subject: Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback From: Ling Ma To: Waiman Long Cc: Peter Zijlstra , mingo@redhat.com, linux-kernel@vger.kernel.org, Ma Ling Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2015-10-20 1:18 GMT+08:00 Waiman Long : > On 10/18/2015 10:27 PM, ling.ma.program@gmail.com wrote: >> >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0 Processor 1 >> >> mov r1, [ _y] //M1 mov [ _x], 1 //M3 >> mov r2, [ _x] //M2 mov [ _y], 1 //M4 >> >> If r1 = 1, r2 must be 1 >> >> In order to guarantee above rule, although Processor 0 execute >> M1 and M2 instruction out of order, they are kept in ROB, >> when load buffer for _x in Processor 0 received the update >> message from Processor 1, Processor 0 need to roll back >> from M2 instruction, which will flush the whole pipeline, >> the latency is over the penalty from branch prediction miss. >> >> In this patch we use lock cmpxchg instruction to force load >> instructions to be serialization, the destination operand >> receives a write cycle without regard to the result of >> the comparison, which can help us to reduce the penalty >> from load instruction roll back. >> >> Our experiment indicates the performance can be improved by 10%~15% >> for 2 and 3 threads cases, the conflicts from lock cache line >> spend them most of the time. > > > What kind of performance test were you running? With the right timing, it is > possible that you see some performance gain. However, if the lock hold time > is longer so that a fair number of cmpxchg instructions have to be executed > before it can get the lock, you may see a performance degradation especially > if the lock holder needs to access the lock cacheline. > > In general, we try to avoid this kind of cmpxchg loop unless we are sure > that at most a few iterations of the loop may happen. Waiman, The machine is Haswell (2699 V3, COD off, HT on, 2 sockets) (we have sent test module in separate email) A. Data is located with lock in one cache line On 2 threads cases (only write struct member data_a) 1. Load version test 5 times, the cost time is below: [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 103904620 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 104351876 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 118599784 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 103064024 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 103389696 Totally cost time is 533310000 2. Lock cmpxchg version test 5 times, the cost time is below: [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 67081220 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 97640708 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 96439612 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 66699296 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 96464800 Totally cost time is 424325636 Above data shows lock cmpxchg is better about average 25% (533310000/424325636) B. Data is located with lock in different cache line On 2 threads cases(only write struct member data_b) 1. Load version test 5 times, the cost time is below: [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 174266128 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 205053924 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 160165124 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 173241552 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 205765008 Totally cost time is 918491736 2. Lock cmpxchg version test 5 times, the cost time is below: [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 113410044 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 116293104 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 116064256 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 189320876 [root@localhost spinlock]# insmod dummy.ko; rmmod dummy;dmesg -c all cost time is 123735352 Totally cost time is 658823632 Above data shows lock cmpxchg is better about average 39% (918491736/658823632) > >> >> Thanks >> Ling >> >> Signed-off-by: Ma Ling >> --- >> kernel/locking/qspinlock.c | 43 >> ++++++++++++++++++------------------------- >> 1 files changed, 18 insertions(+), 25 deletions(-) >> >> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c >> index 87e9ce6..16421f2 100644 >> --- a/kernel/locking/qspinlock.c >> +++ b/kernel/locking/qspinlock.c >> @@ -332,25 +332,14 @@ void queued_spin_lock_slowpath(struct qspinlock >> *lock, u32 val) >> if (new == _Q_LOCKED_VAL) >> return; >> >> - /* >> - * we're pending, wait for the owner to go away. >> - * >> - * *,1,1 -> *,1,0 >> + /* we're waiting, and get lock owner >> * >> - * this wait loop must be a load-acquire such that we match the >> - * store-release that clears the locked bit and create lock >> - * sequentiality; this is because not all >> clear_pending_set_locked() >> - * implementations imply full barriers. >> + * *,1,* -> *,0,1 >> */ >> - while ((val = smp_load_acquire(&lock->val.counter))& >> _Q_LOCKED_MASK) >> + while (cmpxchg(&((struct __qspinlock *)lock)->locked_pending, >> + _Q_PENDING_VAL, _Q_LOCKED_VAL) != _Q_PENDING_VAL) >> cpu_relax(); >> - >> - /* >> - * take ownership and clear the pending bit. >> - * >> - * *,1,0 -> *,0,1 >> - */ >> - clear_pending_set_locked(lock); >> + >> return; >> >> /* >> @@ -399,17 +388,21 @@ queue: >> * we're at the head of the waitqueue, wait for the owner& >> pending to >> * go away. >> * >> - * *,x,y -> *,0,0 >> - * >> - * this wait loop must use a load-acquire such that we match the >> - * store-release that clears the locked bit and create lock >> - * sequentiality; this is because the set_locked() function below >> - * does not imply a full barrier. >> - * >> + * *,x,y -> *,0,1 >> */ >> pv_wait_head(lock, node); >> - while ((val = smp_load_acquire(&lock->val.counter))& >> _Q_LOCKED_PENDING_MASK) >> + next = READ_ONCE(node->next); >> + while (cmpxchg(&((struct __qspinlock *)lock)->locked_pending, 0, > > > The locked_pending field isn't valid if _Q_PENDING_BITS != 8. So it won't > work if NR_CPUS is 16k or more. > >> + _Q_LOCKED_VAL) != 0) { >> + next = READ_ONCE(node->next); >> cpu_relax(); >> + } >> + >> + if (next) >> + goto next_node; > > > I did notice a slight performance benefit by reading the next pointer early > in light load cases myself. However, it is a very minor improvement that I > haven't actively pursued it. We use "next" to avoid smp_load_acquire(&lock->val.counter) instruction rollback. > >> + >> + val = smp_load_acquire(&lock->val.counter); >> + tail = tail | _Q_LOCKED_VAL; >> >> /* >> * claim the lock: >> @@ -423,7 +416,6 @@ queue: >> */ >> for (;;) { >> if (val != tail) { >> - set_locked(lock); >> break; >> } >> old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL); >> @@ -439,6 +431,7 @@ queue: >> while (!(next = READ_ONCE(node->next))) >> cpu_relax(); >> >> +next_node: >> arch_mcs_spin_unlock_contended(&next->locked); >> pv_kick_node(lock, next); >> >