Re: [PATCH] pv-qspinlock: Try to re-hash the lock after spurious_wakeup

From: xinhui <xinhui.pan@linux.vnet.ibm.com>
To: Waiman Long <waiman.long@hpe.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, mingo@redhat.com
Subject: Re: [PATCH] pv-qspinlock: Try to re-hash the lock after spurious_wakeup
Date: Fri, 27 May 2016 18:32:54 +0800	[thread overview]
Message-ID: <201605271033.u4RATRXc042700@mx0a-001b2d01.pphosted.com> (raw)
In-Reply-To: <57474114.1030506@hpe.com>


On 2016年05月27日 02:31, Waiman Long wrote:
> On 05/25/2016 02:09 AM, Pan Xinhui wrote:
>> In pv_wait_head_or_lock, if there is a spurious_wakeup, and it fails to
>> get the lock as there is lock stealing, then after a short spin, we need
>> hash the lock again and enter pv_wait to yield.
>>
>> Currently after a spurious_wakeup, as l->locked is not _Q_SLOW_VAL,
>> pv_wait might do nothing and return directly, that is not
>> paravirt-friendly because pv_wait_head_or_lock will just spin on the
>> lock then.
>>
>> Signed-off-by: Pan Xinhui<xinhui.pan@linux.vnet.ibm.com>
>> ---
>>   kernel/locking/qspinlock_paravirt.h | 39 +++++++++++++++++++++++++++++--------
>>   1 file changed, 31 insertions(+), 8 deletions(-)
>
> Is this a problem you can easily reproduce on PPC? I have not observed this issue when testing on x86.
>
Hi, Waiman
	I notice the spurious_wakeup count is very high when I do benchmark tests and stress tests. So after a simple investigation,
I find pv_wait_head_or_lock() just keep loops.

	Here is my story, in my pv-qspinlcok patchset V1&&v2, pv_wait on ppc ignore the first two parameters of *ptr and val, that makes lock_stealing hit too much.
and when I change SPIN_THRESHOLD to a small value, system is very much unstable because waiter will enter pv_wait quickly and no one will kick waiter's cpu if
we enter pv_wait twice thanks to the lock_stealing.
	So what I do in my pv-qspinlcok patchset V3 is that add if (*ptr == val) in pv_wait. However as I mentioned above, then spurious_wakeup count is too high, that also means our cpu
slice is wasted.
	
thanks
xinhui
>> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
>> index 2bbffe4..3482ce9 100644
>> --- a/kernel/locking/qspinlock_paravirt.h
>> +++ b/kernel/locking/qspinlock_paravirt.h
>> @@ -429,14 +429,15 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>>           return;
>>
>>       /*
>> -     * Put the lock into the hash table and set the _Q_SLOW_VAL.
>> -     *
>> -     * As this is the same vCPU that will check the _Q_SLOW_VAL value and
>> -     * the hash table later on at unlock time, no atomic instruction is
>> -     * needed.
>> +     * Put the lock into the hash table and set the _Q_SLOW_VAL later
>>        */
>> -    WRITE_ONCE(l->locked, _Q_SLOW_VAL);
>>       (void)pv_hash(lock, pn);
>> +
>> +    /*
>> +    * Match the smp_load_acquire in pv_wait_head_or_lock()
>> +    * We mush set the _Q_SLOW_VAL after hash.
>> +    */
>> +    smp_store_release(&l->locked, _Q_SLOW_VAL);
>>   }
>>
>>   /*
>> @@ -454,6 +455,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>       struct qspinlock **lp = NULL;
>>       int waitcnt = 0;
>>       int loop;
>> +    u8 lock_val;
>>
>>       /*
>>        * If pv_kick_node() already advanced our state, we don't need to
>> @@ -487,7 +489,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>           clear_pending(lock);
>>
>>
>> -        if (!lp) { /* ONCE */
>> +        if (!lp) {
>>               lp = pv_hash(lock, pn);
>>
>>               /*
>> @@ -517,6 +519,13 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>           qstat_inc(qstat_pv_wait_again, waitcnt);
>>
>>           /*
>> +        * make sure pv_kick_node has hashed the lock, so after pv_wait
>> +        * if ->locked is not _Q_SLOW_VAL, we can hash the lock again.
>> +        */
>> +        if (lp == (struct qspinlock **)1
>> +                &&  smp_load_acquire(&l->locked) == _Q_SLOW_VAL)
>> +            lp = (struct qspinlock **)2;
>> +        /*
>>            * Pass in the previous node vCPU nmber which is likely to be
>>            * the lock holder vCPU. This additional information may help
>>            * the hypervisor to give more resource to that vCPU so that
>> @@ -525,13 +534,27 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>>            */
>>           pv_wait(&l->locked, _Q_SLOW_VAL, pn->prev_cpu);
>>
>> +        lock_val = READ_ONCE(l->locked);
>> +
>> +        /* if ->locked is zero, then lock owner has unhashed the lock
>> +         * if ->locked is _Q_LOCKED_VAL,
>> +         * 1) pv_kick_node didnot hash the lock, and lp != 0x1
>> +         * 2) lock stealing, lock owner has unhashed the lock too
>> +         * 3) race with pv_kick_node, if lp == 2, we know it has hashed
>> +         * the lock and the lock is unhashed in unlock()
>> +         * if ->lock is _Q_SLOW_VAL, spurious_wakeup?
>> +        */
>> +        if (lock_val != _Q_SLOW_VAL) {
>> +            if (lock_val == 0 || lp != (struct qspinlock **)1)
>> +                lp = 0;
>> +        }
>
> It is a bit hard to verify the correctness of these checks as many variables are involved. Instead, I posted an alternative way to solve this problem by focusing just on the atomic setting of _Q_SLOW_VAL which should be easier to understand and verified. Would you mind testing that patch to see if it can fix the problem that you see?
>
> Cheers,
> Longman
>