Re: [External] : Re: [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA

From: Alex Kogan <alex.kogan@oracle.com>
To: Andreas Herrmann <aherrmann@suse.com>
Cc: linux@armlinux.org.uk, Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>,
	arnd@arndb.de, longman@redhat.com, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com,
	jglauber@marvell.com, steven.sistare@oracle.com,
	daniel.m.jordan@oracle.com, dave.dice@oracle.com
Subject: Re: [External] : Re: [PATCH v14 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA
Date: Wed, 14 Apr 2021 14:18:05 -0400	[thread overview]
Message-ID: <96EB342A-1343-4D71-B687-6EED27159161@oracle.com> (raw)
In-Reply-To: <YHad9S5ckj5IR1l6@suselix>

Hi, Andreas.

Thanks for the great questions.

> On Apr 14, 2021, at 3:47 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> 
> On Thu, Apr 01, 2021 at 11:31:56AM -0400, Alex Kogan wrote:
>> This performance optimization chooses probabilistically to avoid moving
>> threads from the main queue into the secondary one when the secondary queue
>> is empty.
>> 
>> It is helpful when the lock is only lightly contended. In particular, it
>> makes CNA less eager to create a secondary queue, but does not introduce
>> any extra delays for threads waiting in that queue once it is created.
>> 
>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Waiman Long <longman@redhat.com>
>> ---
>> kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
>> 1 file changed, 39 insertions(+)
>> 
>> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
>> index 29c3abbd3d94..983c6a47a221 100644
>> --- a/kernel/locking/qspinlock_cna.h
>> +++ b/kernel/locking/qspinlock_cna.h
>> @@ -5,6 +5,7 @@
>> 
>> #include <linux/topology.h>
>> #include <linux/sched/rt.h>
>> +#include <linux/random.h>
>> 
>> /*
>>  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
>> @@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
>> 	return current_time - threshold > 0;
>> }
>> 
>> +/*
>> + * Controls the probability for enabling the ordering of the main queue
>> + * when the secondary queue is empty. The chosen value reduces the amount
>> + * of unnecessary shuffling of threads between the two waiting queues
>> + * when the contention is low, while responding fast enough and enabling
>> + * the shuffling when the contention is high.
>> + */
>> +#define SHUFFLE_REDUCTION_PROB_ARG  (7)
> 
> Out of curiosity:
> 
> Have you used other values and done measurements what's an efficient
> value for SHUFFLE_REDUCTION_PROB_ARG?
Yes, we did try other values. Small variations do not change the results much,
but if you bump the value significantly (e.g., 20), you end up with a lock that
hardly does any shuffling and thus performs on-par with the (MCS-based)
baseline.

> Maybe I miscalculated it, but if I understand it correctly this value
> implies that the propability is 0.9921875 that below function returns
> true.
Your analysis is correct. Intuitively, we tried to keep the probability around 1-2%,
so if we do decide to shuffle when we don’t really want to (i.e., when the
contention is low), the overall overhead of such wrong decisions would be small.

> 
> My question is probably answered by following statement from
> referenced paper:
> 
> "In our experiments with the shuffle reduction optimization enabled,
> we set THRESHOLD2 to 0xff." (page with figure 5)
Yeah, just to avoid any confusion — we used a different mechanism to draw
pseudo-random numbers in the paper, so that 0xff number is not directly 
comparable to the range of possible values for SHUFFLE_REDUCTION_PROB_ARG,
but the idea was exactly the same.

> 
>> +
>> +/* Per-CPU pseudo-random number seed */
>> +static DEFINE_PER_CPU(u32, seed);
>> +
>> +/*
>> + * Return false with probability 1 / 2^@num_bits.
>> + * Intuitively, the larger @num_bits the less likely false is to be returned.
>> + * @num_bits must be a number between 0 and 31.
>> + */
>> +static bool probably(unsigned int num_bits)
>> +{
>> +	u32 s;
>> +
>> +	s = this_cpu_read(seed);
>> +	s = next_pseudo_random32(s);
>> +	this_cpu_write(seed, s);
>> +
>> +	return s & ((1 << num_bits) - 1);
>> +}
>> +
>> static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>> {
>> 	struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
>> @@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>> {
>> 	struct cna_node *cn = (struct cna_node *)node;
>> 
>> +	if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {
> 
> Again if I understand it correctly with SHUFFLE_REDUCTION_PROB_ARG==7
> it's roughly 1 out of 100 cases where probably() returns false.
> 
> Why/when is this beneficial?
> 
> I assume it has to do with following statement in the referenced
> paper:
> 
> "The superior performance over MCS at 4 threads is the result of the
> shuffling that does take place once in a while, organizing threads’
> arrivals to the lock in a way that reduces the inter-socket lock
> migration without the need to continuously modify the main queue. ..."
> (page with figure 9; the paper has no page numbers)
This is correct. This performance optimization deals with a small regression
that might occur at the low-medium range of contention. And just to reiterate
what the patch comment says, this optimization has nothing to do with correctness
and does not introduce any extra delays for threads already waiting in the
secondary queue.

> 
> But OTHO why this pseudo randomness?
> 
> How about deterministically treating every 100th execution differently
> (it also matches "once in a while") and thus entirely removing the
> pseudo randomness?
Pseudo-randomness helps to avoid pathological cases where we would do the
same decision, despite having this optimization in place, for every lock acquisition.

Consider an application in which each thread cycles through X locks, e.g., X=10.
If we count deterministically, then for the first 9 locks we will _always avoid doing
any shuffling, missing on any opportunity for the benefit derived by the CNA 
approach. At the same time, for lock #10 we will always attempt to shuffle, and so
the optimization would not have any effect. 

> 
> Have you tried this? If so why was it worse than pseudo randomness?
> 
> (Or maybe I missed something and pseudo randomness is required for
> other reasons there.)

Hope this helps — let me know if you have any further questions!

Best,
— Alex