Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Alex Kogan <alex.kogan@oracle.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org, guohanjun@huawei.com, arnd@arndb.de,
	dave.dice@oracle.com, jglauber@marvell.com, x86@kernel.org,
	will.deacon@arm.com, linux@armlinux.org.uk,
	steven.sistare@oracle.com, linux-kernel@vger.kernel.org,
	rahul.x.yadav@oracle.com, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, longman@redhat.com, tglx@linutronix.de,
	daniel.m.jordan@oracle.com, linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
Date: Tue, 16 Jul 2019 13:19:16 -0400	[thread overview]
Message-ID: <193BBB31-F376-451F-BDE1-D4807140EB51@oracle.com> (raw)
In-Reply-To: <20190716155022.GR3419@hirez.programming.kicks-ass.net>

Hi, Peter.

Thanks for the review and all the suggestions!

A couple of comments are inlined below.

> On Jul 16, 2019, at 11:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Jul 15, 2019 at 03:25:34PM -0400, Alex Kogan wrote:
>> +static struct cna_node *find_successor(struct mcs_spinlock *me)
>> +{
>> +	struct cna_node *me_cna = CNA_NODE(me);
>> +	struct cna_node *head_other, *tail_other, *cur;
>> +	struct cna_node *next = CNA_NODE(READ_ONCE(me->next));
>> +	int my_node;
>> +
>> +	/* @next should be set, else we would not be calling this function. */
>> +	WARN_ON_ONCE(next == NULL);
>> +
>> +	my_node = me_cna->numa_node;
>> +
>> +	/*
>> +	 * Fast path - check whether the immediate successor runs on
>> +	 * the same node.
>> +	 */
>> +	if (next->numa_node == my_node)
>> +		return next;
>> +
>> +	head_other = next;
>> +	tail_other = next;
>> +
>> +	/*
>> +	 * Traverse the main waiting queue starting from the successor of my
>> +	 * successor, and look for a thread running on the same node.
>> +	 */
>> +	cur = CNA_NODE(READ_ONCE(next->mcs.next));
>> +	while (cur) {
>> +		if (cur->numa_node == my_node) {
>> +			/*
>> +			 * Found a thread on the same node. Move threads
>> +			 * between me and that node into the secondary queue.
>> +			 */
>> +			if (me->locked > 1)
>> +				CNA_NODE(me->locked)->tail->mcs.next =
>> +					(struct mcs_spinlock *)head_other;
>> +			else
>> +				me->locked = (uintptr_t)head_other;
>> +			tail_other->mcs.next = NULL;
>> +			CNA_NODE(me->locked)->tail = tail_other;
>> +			return cur;
>> +		}
>> +		tail_other = cur;
>> +		cur = CNA_NODE(READ_ONCE(cur->mcs.next));
>> +	}
>> +	return NULL;
>> +}
> 
> static void cna_move(struct cna_node *cn, struct cna_node *cni)
> {
> 	struct cna_node *head, *tail;
> 
> 	/* remove @cni */
> 	WRITE_ONCE(cn->mcs.next, cni->mcs.next);
> 
> 	/* stick @cni on the 'other' list tail */
> 	cni->mcs.next = NULL;
> 
> 	if (cn->mcs.locked <= 1) {
> 		/* head = tail = cni */
> 		head = cni;
> 		head->tail = cni;
> 		cn->mcs.locked = head->encoded_tail;
> 	} else {
> 		/* add to tail */
> 		head = (struct cna_node *)decode_tail(cn->mcs.locked);
> 		tail = tail->tail;
> 		tail->next = cni;
> 	}
> }
> 
> static struct cna_node *cna_find_next(struct mcs_spinlock *node)
> {
> 	struct cna_node *cni, *cn = (struct cna_node *)node;
> 
> 	while ((cni = (struct cna_node *)READ_ONCE(cn->mcs.next))) {
> 		if (likely(cni->node == cn->node))
> 			break;
> 
> 		cna_move(cn, cni);
> 	}
> 
> 	return cni;
> }
But then you move nodes from the main list to the ‘other’ list one-by-one.
I’m afraid this would be unnecessary expensive.
Plus, all this extra work is wasted if you do not find a thread on the same 
NUMA node (you move everyone to the ‘other’ list only to move them back in 
cna_mcs_pass_lock()).

> 
>> +static inline bool cna_set_locked_empty_mcs(struct qspinlock *lock, u32 val,
>> +					struct mcs_spinlock *node)
>> +{
>> +	/* Check whether the secondary queue is empty. */
>> +	if (node->locked <= 1) {
>> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
>> +				_Q_LOCKED_VAL))
>> +			return true; /* No contention */
>> +	} else {
>> +		/*
>> +		 * Pass the lock to the first thread in the secondary
>> +		 * queue, but first try to update the queue's tail to
>> +		 * point to the last node in the secondary queue.
> 
> 
> That comment doesn't make sense; there's at least one conditional
> missing.
In CNA, we cannot just clear the tail when the MCS chain is empty, as 
there might be nodes in the ‘other’ chain. In that case (this is the “else” part),
we want to pass the lock to the first node in the ‘other’ chain, but 
first we need to put the last node from that chain into the tail. Perhaps the
comment should read “…  but first try to update the *primary* queue's tail …”, 
if that makes more sense.

> 
>> +		 */
>> +		struct cna_node *succ = CNA_NODE(node->locked);
>> +		u32 new = succ->tail->encoded_tail + _Q_LOCKED_VAL;
>> +
>> +		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
>> +			arch_mcs_spin_unlock_contended(&succ->mcs.locked, 1);
>> +			return true;
>> +		}
>> +	}
>> +
>> +	return false;
>> +}
> 
> static cna_try_clear_tail(struct qspinlock *lock, u32 val, struct mcs_spinlock *node)
> {
> 	if (node->locked <= 1)
> 		return __try_clear_tail(lock, val, node);
> 
> 	/* the other case */
> }
Good point, thanks.

> 
>> +static inline void cna_pass_mcs_lock(struct mcs_spinlock *node,
>> +				     struct mcs_spinlock *next)
>> +{
>> +	struct cna_node *succ = NULL;
>> +	u64 *var = &next->locked;
>> +	u64 val = 1;
>> +
>> +	succ = find_successor(node);
> 
> This makes unlock O(n), which is 'funneh' and undocumented.
I will add a comment above the call to find_successor() / cna_find_next().

> 
>> +
>> +	if (succ) {
>> +		var = &succ->mcs.locked;
>> +		/*
>> +		 * We unlock a successor by passing a non-zero value,
>> +		 * so set @val to 1 iff @locked is 0, which will happen
>> +		 * if we acquired the MCS lock when its queue was empty
>> +		 */
>> +		val = node->locked + (node->locked == 0);
>> +	} else if (node->locked > 1) { /* if the secondary queue is not empty */
>> +		/* pass the lock to the first node in that queue */
>> +		succ = CNA_NODE(node->locked);
>> +		succ->tail->mcs.next = next;
>> +		var = &succ->mcs.locked;
> 
>> +	}	/*
>> +		 * Otherwise, pass the lock to the immediate successor
>> +		 * in the main queue.
>> +		 */
> 
> I don't think this mis-indented comment can happen. The call-site
> guarantees @next is non-null.
> 
> Therefore, cna_find_next() will either return it, or place it on the
> secondary list. If it (cna_find_next) returns NULL, we must have a
> non-empty secondary list.
> 
> In no case do I see this tertiary condition being possible.
find_successor() will return NULL if it does not find a thread running on the 
same NUMA node. And the secondary queue might be empty at that time.

> 
>> +
>> +	arch_mcs_spin_unlock_contended(var, val);
>> +}
> 
> This also renders this @next argument superfluous.
> 
> static cna_mcs_pass_lock(struct mcs_spinlock *node, struct mcs_spinlock *next)
> {
> 	next = cna_find_next(node);
> 	if (!next) {
> 		BUG_ON(node->locked <= 1);
> 		next = (struct cna_node *)decode_tail(node->locked);
> 		node->locked = 1;
> 	}
> 
> 	arch_mcs_pass_lock(&next->mcs.locked, node->locked);
> }

@next is passed to save the load from @node.
This is probably most important for the native code (__pass_mcs_lock()).
That function should be inlined, however, and that load should not matter.
Bottom line, I agree that we can remove the @next argument.

Best regards,
— Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel