Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Alex Kogan <alex.kogan@oracle.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org, Hanjun Guo <guohanjun@huawei.com>,
	Arnd Bergmann <arnd@arndb.de>,
	dave.dice@oracle.com, Jan Glauber <jglauber@marvell.com>,
	x86@kernel.org, Will Deacon <will.deacon@arm.com>,
	linux@armlinux.org.uk, Steven Sistare <steven.sistare@oracle.com>,
	linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	hpa@zytor.com, Waiman Long <longman@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH v7 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
Date: Fri, 31 Jan 2020 13:33:09 -0500	[thread overview]
Message-ID: <E32F90E2-F00B-423B-A851-336004EF6593@oracle.com> (raw)
In-Reply-To: <20200131133543.GD14914@hirez.programming.kicks-ass.net>

>>> + *
>>> + * Returns 0 if a matching node is found; otherwise return the encoded pointer
>>> + * to the last element inspected (such that a subsequent scan can continue there).
>>> + *
>>> + * The worst case complexity of the scan is O(n), where n is the number
>>> + * of current waiters. However, the amortized complexity is close to O(1),
>>> + * as the immediate successor is likely to be running on the same node once
>>> + * threads from other nodes are moved to the secondary queue.
>>> + *
>>> + * XXX does not compute; given equal contention it should average to O(nr_nodes).
>> Let me try to convince you. Under contention, the immediate waiter would be
>> a local one. So the scan would typically take O(1) steps. We need to consider
>> the extra work/steps we take to move nodes to the secondary queue. The
>> number of such nodes is O(n) (to be more precise, it is O(n-m), where m
>> is nr_cpus_per_node), and we spend a constant amount of work, per node, 
>> to scan those nodes and move them to the secondary queue. So in 2^thresholds
>> lock handovers, we scan 2^thresholds x 1 + n-m nodes. Assuming 
>> 2^thresholds > n, the amortized complexity of this function then is O(1).
> 
> There is no threshold in this patch.
This does not change the analysis, though.

> That's the next patch, and
> I've been procrastinating on that one, mostly also because of the
> 'reasonable' claim without data.
> 
> But Ah!, I think I see, after nr_cpus tries, all remote waiters are on
> the secondary queue and only local waiters will remain. That will indeed
> depress the average a lot.
Ok, cool.

< snip >

> 
>>> +{
>>> +	struct mcs_spinlock *head_2nd, *tail_2nd;
>>> +
>>> +	tail_2nd = decode_tail(node->locked);
>>> +	head_2nd = tail_2nd->next;
>> I understand that you are trying to avoid repeating those two lines
>> in both places this function is called from, but you do it at the cost
>> of adding the following unnecessary if-statement, and in general this function
>> looks clunky.
> 
> Let me show you the current form:
> 
> /*
> * cna_splice_head -- splice the entire secondary queue onto the head of the
> * primary queue.
> *
> * Returns the new primary head node or NULL on failure.
Maybe explain here why failure can happen? Eg., “The latter can happen
due to a race with a waiter joining an empty primary queue."

> */
> static struct mcs_spinlock *
> cna_splice_head(struct qspinlock *lock, u32 val,
> 		struct mcs_spinlock *node, struct mcs_spinlock *next)
> {
> 	struct mcs_spinlock *head_2nd, *tail_2nd;
> 	u32 new;
> 
> 	tail_2nd = decode_tail(node->locked);
> 	head_2nd = tail_2nd->next;
> 
> 	if (!next) {
> 		/*
> 		 * When the primary queue is empty; our tail becomes the primary tail.
> 		 */
> 
> 		/*
> 		 * Speculatively break the secondary queue's circular link; such that when
> 		 * the secondary tail becomes the primary tail it all works out.
> 		 */
> 		tail_2nd->next = NULL;
> 
> 		/*
> 		 * tail_2nd->next = NULL		xchg_tail(lock, tail)
> 		 *
> 		 * cmpxchg_release(&lock, val, new)	WRITE_ONCE(prev->next, node);
> 		 *
> 		 * Such that if the cmpxchg() succeeds, our stores won't collide.
> 		 */
> 		new = ((struct cna_node *)tail_2nd)->encoded_tail | _Q_LOCKED_VAL;
> 		if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
> 			/*
> 			 * Note; when this cmpxchg fails due to concurrent
> 			 * queueing of a new waiter, then we'll try again in
> 			 * cna_pass_lock() if required.
> 			 *
> 			 * Restore the secondary queue's circular link.
> 			 */
> 			tail_2nd->next = head_2nd;
> 			return NULL;
> 		}
> 
> 	} else {
> 		/*
> 		 * If the primary queue is not empty; the primary tail doesn't need
> 		 * to change and we can simply link our tail to the old head.
> 		 */
> 		tail_2nd->next = next;
> 	}
> 
> 	/* That which was the secondary queue head, is now the primary queue head */
Rephrase the comment?

> 	return head_2nd;
> }
> 
> The function as a whole is self-contained and consistent, it deals with
> the splice 2nd to 1st queue, for all possible cases. You only have to
> think about the list splice in one place, here, instead of two places.
> 
> I don't think it will actually result in more branches emitted; the
> compiler should be able to use value propagation to eliminate stuff.
Ok, I can see your point.

> 
>>> +static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
>>> +				      struct mcs_spinlock *node)
>>> +{
>>> +	struct mcs_spinlock *next;
>>> +
>>> +	/*
>>> +	 * We're here because the primary queue is empty; check the secondary
>>> +	 * queue for remote waiters.
>>> +	 */
>>> +	if (node->locked > 1) {
>>> +		/*
>>> +		 * When there are waiters on the secondary queue move them back
>>> +		 * onto the primary queue and let them rip.
>>> +		 */
>>> +		next = cna_splice_head(lock, val, node, NULL);
>>> +		if (next) {
>> And, again, this if-statement is here only because you moved the rest of the code
>> into cna_splice_head(). Perhaps, with cna_extract_2dary_head_tail() you can
>> bring that code back?
> 
> I don't see the objection, you already had a branch there, from the
> cmpxchg(), this is the same branch, the compiler should fold the lot.
Now you have the branch from cmpxchg(), and another one from "if (next)”.
But you are right that the compiler is likely to optimize out the latter.

> We can add an __always_inline if you're worried.
Let’s do that.

< snip >

>>> +static inline void cna_pass_lock(struct mcs_spinlock *node,
>>> +				 struct mcs_spinlock *next)
>>> +{
>>> +	struct cna_node *cn = (struct cna_node *)node;
>>> +	u32 partial_order = cn->partial_order;
>>> +	u32 val = _Q_LOCKED_VAL;
>>> +
>>> +	/* cna_wait_head_or_lock() left work for us. */
>>> +	if (partial_order) {
>>> +		partial_order = cna_order_queue(node, decode_tail(partial_order));
>>> +
>>> +	if (!partial_order) {
>>> +		/*
>>> +		 * We found a local waiter; reload @next in case we called
>>> +		 * cna_order_queue() above.
>>> +		 */
>>> +		next = node->next;
>>> +		val |= node->locked;	/* preseve secondary queue */
>> This is wrong. node->locked can be 0, 1 or an encoded tail at this point, and
>> the latter case will be handled incorrectly.
>> 
>> Should be 
>> 		  if (node->locked) val = node->locked;
>> instead.
> 
> I'm confused, who cares about the locked bit when it has an encoded tail on?
> 
> The generic code only cares about it being !0, and the cna code always
> checks if it has a tail (>1 , <=1) first.
Ah, that may actually work, but not sure if this was your intention.

The code above sets val to 1 or encoded tail + 1 (rather than encoded tail),
decode_tail(tail) does not care about LSB, and will do its calculation correctly.
IOW, decode_tail( tail ) is the same as decode_tail( tail + 1 ).

I think this is a bit fragile and depends on the implementation of 
decode_tail(), but if you are fine with that, no objections here.

Regards,
— Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel