Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Waiman Long <longman@redhat.com>
To: Alex Kogan <alex.kogan@oracle.com>
Cc: linux@armlinux.org.uk, peterz@infradead.org, mingo@redhat.com,
	will.deacon@arm.com, arnd@arndb.de, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com,
	jglauber@marvell.com, steven.sistare@oracle.com,
	daniel.m.jordan@oracle.com, dave.dice@oracle.com
Subject: Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock
Date: Tue, 1 Sep 2020 13:38:29 -0400	[thread overview]
Message-ID: <84fdb30e-387d-7bc1-e1f9-fa57cd9a32c9@redhat.com> (raw)
In-Reply-To: <08E77224-563F-49C7-9E7F-BD98E4FD121D@oracle.com>

On 8/31/20 5:39 PM, Alex Kogan wrote:
>> On Jul 28, 2020, at 4:00 PM, Waiman Long <longman@redhat.com> wrote:
>>
>> On 4/3/20 4:59 PM, Alex Kogan wrote:
>>> In CNA, spinning threads are organized in two queues, a primary queue for
>>> threads running on the same node as the current lock holder, and a
>>> secondary queue for threads running on other nodes. After acquiring the
>>> MCS lock and before acquiring the spinlock, the lock holder scans the
>>> primary queue looking for a thread running on the same node (pre-scan). If
>>> found (call it thread T), all threads in the primary queue between the
>>> current lock holder and T are moved to the end of the secondary queue.
>>> If such T is not found, we make another scan of the primary queue when
>>> unlocking the MCS lock (post-scan), starting at the position where
>>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>>> passed to the first thread in the secondary queue. If the secondary queue
>>> is empty, the lock is passed to the next thread in the primary queue.
>>> For more details, see https://urldefense.com/v3/__https://arxiv.org/abs/1810.05600__;!!GqivPVa7Brio!OaieLQ3MMZThgxr-Of8i9dbN5CwG8mXSIBJ_sUofhAXcs43IWL2x-stO-XKLEebn$ .
>>>
>>> Note that this variant of CNA may introduce starvation by continuously
>>> passing the lock to threads running on the same node. This issue
>>> will be addressed later in the series.
>>>
>>> Enabling CNA is controlled via a new configuration option
>>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>>> boot time only if we run on a multi-node machine in native environment and
>>> the new config is enabled. (For the time being, the patching requires
>>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>>> resolved once static_call() is available.) This default behavior can be
>>> overridden with the new kernel boot command-line option
>>> "numa_spinlock=on/off" (default is "auto").
>>>
>>> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
>>> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
>>> Reviewed-by: Waiman Long <longman@redhat.com>
>>> ---
>> There is also a concern that the worst case latency for a lock transfer can be close to O(n) which can be quite large for large SMP systems. I have a patch on top that modifies the current behavior to limit the number of node scans after the lock is freed.
> I understand the concern. While your patch addresses it, I am afraid it makes
> the code somewhat more complex, and duplicates some of the slow path
> functionality (e.g., the spin loop until the lock value changes to a certain
> value).
>
> Let me suggest a different idea for gradually restructuring the main queue
> that has some similarity to the way you suggested to handle prioritized waiters.
> Basically, instead of scanning the entire chain of main queue waiters,
> we can check only the next waiter and, if present and it runs on a different
> node, move it to the secondary queue. In addition, to maintain the preference
> for a certain numa node ID, we set the numa node of the next-next waiter,
> if present, to that of the current lock holder. This is the part similar to the
> way you suggested to handle prioritized waiters.
>
> This way, the worst case complexity of cna_order_queue() decreases from O(n)
> down to O(1), as we always “scan" only one waiter. And as before, we change
> the preference (and flush the secondary queue) after X handovers (or after
> Y ms, as in your in other patch).
>
> I attach the patch that applies on top of your patch for prioritized nodes
> (0006), but does not include your patch 0007 (time based threshold),
> which I will integrate into the series in the next revision.
>
> Please, let me know what you think.
>
That is an interesting idea. I don't have any fundamental objection to 
that. I just wonder how it will impact the kind of performance test that 
you ran before. It would be nice to see the performance impact with that 
change.

Cheers,
Longman