Re: [PATCH v4 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Alex Kogan <alex.kogan@oracle.com>
To: Waiman Long <longman@redhat.com>
Cc: linux@armlinux.org.uk, peterz@infradead.org, mingo@redhat.com,
	will.deacon@arm.com, arnd@arndb.de, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com,
	jglauber@marvell.com, steven.sistare@oracle.com,
	daniel.m.jordan@oracle.com, dave.dice@oracle.com,
	rahul.x.yadav@oracle.com
Subject: Re: [PATCH v4 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
Date: Thu, 19 Sep 2019 11:55:21 -0400	[thread overview]
Message-ID: <87B87982-670F-4F12-9EE0-DC89A059FAEC@oracle.com> (raw)
In-Reply-To: <3ae2b6a2-ffe6-2ca1-e5bf-2292db50e26f@redhat.com>

>> +/*
>> + * cna_try_find_next - scan the main waiting queue looking for the first
>> + * thread running on the same NUMA node as the lock holder. If found (call it
>> + * thread T), move all threads in the main queue between the lock holder and
>> + * T to the end of the secondary queue and return T; otherwise, return NULL.
>> + *
>> + * Schematically, this may look like the following (nn stands for numa_node and
>> + * et stands for encoded_tail).
>> + *
>> + *     when cna_try_find_next() is called (the secondary queue is empty):
>> + *
>> + *  A+------------+   B+--------+   C+--------+   T+--------+
>> + *   |mcs:next    | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> NULL
>> + *   |mcs:locked=1|    |cna:nn=0|    |cna:nn=2|    |cna:nn=1|
>> + *   |cna:nn=1    |    +--------+    +--------+    +--------+
>> + *   +----------- +
>> + *
>> + *     when cna_try_find_next() returns (the secondary queue contains B and C):
>> + *
>> + *  A+----------------+    T+--------+
>> + *   |mcs:next        | ->  |mcs:next| -> NULL
>> + *   |mcs:locked=B.et | -+  |cna:nn=1|
>> + *   |cna:nn=1        |  |  +--------+
>> + *   +--------------- +  |
>> + *                       |
>> + *                       +->  B+--------+   C+--------+
>> + *                             |mcs:next| -> |mcs:next|
>> + *                             |cna:nn=0|    |cna:nn=2|
>> + *                             |cna:tail| -> +--------+
>> + *                             +--------+
>> + *
>> + * The worst case complexity of the scan is O(n), where n is the number
>> + * of current waiters. However, the fast path, which is expected to be the
>> + * common case, is O(1).
>> + */
>> +static struct mcs_spinlock *cna_try_find_next(struct mcs_spinlock *node,
>> +					      struct mcs_spinlock *next)
>> +{
>> +	struct cna_node *cn = (struct cna_node *)node;
>> +	struct cna_node *cni = (struct cna_node *)next;
>> +	struct cna_node *first, *last = NULL;
>> +	int my_numa_node = cn->numa_node;
>> +
>> +	/* fast path: immediate successor is on the same NUMA node */
>> +	if (cni->numa_node == my_numa_node)
>> +		return next;
>> +
>> +	/* find any next waiter on 'our' NUMA node */
>> +	for (first = cni;
>> +	     cni && cni->numa_node != my_numa_node;
>> +	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
>> +		;
>> +
>> +	/* if found, splice any skipped waiters onto the secondary queue */
>> +	if (cni && last)
>> +		cna_splice_tail(cn, first, last);
>> +
>> +	return (struct mcs_spinlock *)cni;
>> +}
> 
> At the Linux Plumbers Conference last week, Will has raised the concern
> about the latency of the O(1) cna_try_find_next() operation that will
> add to the lock hold time.
While the worst case complexity of the scan is O(n), I _think it can be proven
that the amortized complexity is O(1). For intuition, consider a two-node 
system with N threads total. In the worst case scenario, the scan will go 
over N/2 threads running on a different node. If the scan ultimately “fails”
(no thread from the lock holder’s node is found), the lock will be passed
to the first thread from a different node and then between all those N/2 threads,
with a scan of just one node for the next N/2 - 1 passes. Otherwise, those 
N/2 threads will be moved to the secondary queue. On the next lock handover, 
we pass the lock either to the next thread in the main queue (as it has to be 
from our node) or to the first node in the secondary queue. In both cases, we 
scan just one node, and in the latter case, we have again N/2 - 1 passes with 
a scan of just one node each.

> One way to hide some of the latency is to do
> a pre-scan before acquiring the lock. The CNA code could override the
> pv_wait_head_or_lock() function to call cna_try_find_next() as a
> pre-scan and return 0. What do you think?
This is certainly possible, but I do not think it would completely eliminate 
the worst case scenario. It will probably make it even less likely, but at 
the same time, we will reduce the chance of actually finding a thread from the
same node (that may enter the main queue while we wait for the owner & pending 
to go away).

Regards,
— Alex