From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=jZ1s=BH=vger.kernel.org=linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-12.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C5C9DC433E8
	for <linux-kernel@archiver.kernel.org>; Tue, 28 Jul 2020 20:00:25 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 8A0A32065C
	for <linux-kernel@archiver.kernel.org>; Tue, 28 Jul 2020 20:00:25 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WEyiHSQW"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729120AbgG1UAY (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 28 Jul 2020 16:00:24 -0400
Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:60066 "EHLO
        us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org
        with ESMTP id S1729076AbgG1UAX (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 28 Jul 2020 16:00:23 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1595966420;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         in-reply-to:in-reply-to:references:references;
        bh=NbuVlb20X7aLNVDgvjCFcT8rKevY9uha8UkibvC8p04=;
        b=WEyiHSQW+IMpzQoRzBJlj1QdM+Twm4icEPXjqEp1fWI6YeQhd9bm3O4/QyiprxX/uFcxix
        V2iJi4Ye9A2pl00wHd0jm6icT8nAkMa9lqfOlNzQnrPOHqvAWVZd6EhwO0ynkDyEQoFYIY
        FgWVi7C0XSTefZR4c2S68IQMQtu51rw=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-146-BiJ90p7YN-madmATHVf6Tw-1; Tue, 28 Jul 2020 16:00:16 -0400
X-MC-Unique: BiJ90p7YN-madmATHVf6Tw-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 536558014D7;
        Tue, 28 Jul 2020 20:00:14 +0000 (UTC)
Received: from llong.remote.csb (ovpn-119-38.rdu2.redhat.com [10.10.119.38])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 7C4381002388;
        Tue, 28 Jul 2020 20:00:12 +0000 (UTC)
Subject: Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow
 path of qspinlock
To:     Alex Kogan <alex.kogan@oracle.com>, linux@armlinux.org.uk,
        peterz@infradead.org, mingo@redhat.com, will.deacon@arm.com,
        arnd@arndb.de, linux-arch@vger.kernel.org,
        linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org,
        tglx@linutronix.de, bp@alien8.de, hpa@zytor.com, x86@kernel.org,
        guohanjun@huawei.com, jglauber@marvell.com
Cc:     steven.sistare@oracle.com, daniel.m.jordan@oracle.com,
        dave.dice@oracle.com
References: <20200403205930.1707-1-alex.kogan@oracle.com>
 <20200403205930.1707-4-alex.kogan@oracle.com>
From:   Waiman Long <longman@redhat.com>
Organization: Red Hat
Message-ID: <a4bf9541-1996-3ba2-dfe5-e734c652ac86@redhat.com>
Date:   Tue, 28 Jul 2020 16:00:12 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.4.1
MIME-Version: 1.0
In-Reply-To: <20200403205930.1707-4-alex.kogan@oracle.com>
Content-Type: multipart/mixed;
 boundary="------------370526D467942428E1F63AAA"
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

This is a multi-part message in MIME format.
--------------370526D467942428E1F63AAA
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

On 4/3/20 4:59 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a primary queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. After acquiring the
> MCS lock and before acquiring the spinlock, the lock holder scans the
> primary queue looking for a thread running on the same node (pre-scan). If
> found (call it thread T), all threads in the primary queue between the
> current lock holder and T are moved to the end of the secondary queue.
> If such T is not found, we make another scan of the primary queue when
> unlocking the MCS lock (post-scan), starting at the position where
> pre-scan stopped. If both scans fail to find such T, the MCS lock is
> passed to the first thread in the secondary queue. If the secondary queue
> is empty, the lock is passed to the next thread in the primary queue.
> For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
> boot time only if we run on a multi-node machine in native environment and
> the new config is enabled. (For the time being, the patching requires
> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
> resolved once static_call() is available.) This default behavior can be
> overridden with the new kernel boot command-line option
> "numa_spinlock=on/off" (default is "auto").
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---

There is also a concern that the worst case latency for a lock transfer 
can be close to O(n) which can be quite large for large SMP systems. I 
have a patch on top that modifies the current behavior to limit the 
number of node scans after the lock is freed.

Cheers,
Longman


--------------370526D467942428E1F63AAA
Content-Type: text/x-patch; charset=UTF-8;
 name="0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename*0="0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.pa";
 filename*1="tch"

>From 1e0d1a7c1d04383367b3d7a086ef3becd2fd81d8 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 28 Jul 2020 15:13:47 -0400
Subject: [PATCH 8/9] locking/qspinlock: Limit CNA node scan after the lock is
 freed

When there is a long list of lock waiters, the worst case scenario
for CNA is that none of them are from the same NUMA node. All the lock
waiters will have to be scanned to figure that out. If the lock becomes
free early in the scanning process, it means that there will be a long
delay between lock being freed and re-acquired after scanning the full
list. The worst case delay is O(n) where n is the number of lock waiters
which is limited by the number of cpus in the system.

One way to limit the delay is to limit the number of nodes to be scanned
after the lock is freed. Assuming random distribution of NUMA nodes in
the lock waiters, the chance of not finding a lock waiter in the same
NUMA node is ((n-1)/n)^m where n is the number of NUMA nodes and m is
the number of lock waiters scanned. If we limit the number of scans to
4n, for example, the chance of not finding same NUMA node lock waiter
will be 1.4% and 1.6% for 8 and 16-socket servers respectively.
Note that the limit applies only after the lock is freed, so the chance
of not finding the right lock waiter is even lower.

To make this possible, the lock waiter scanning and lock spinning have
to be done at the same time. Since the lock waiter scanning won't stop
until the lock is freed with additional look ahead, if necessary. The
extra scanning done in cna_lock_handoff() isn't needed anymore.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 130 +++++++++++++++++++++------------
 1 file changed, 83 insertions(+), 47 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 03e8ba71a537..89a3905870ea 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -59,7 +59,7 @@ struct cna_node {
 	u16			numa_node;
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
-	u32			partial_order;	/* encoded tail or enum val */
+	u32			order;		/* encoded tail or enum val */
 	s32			start_time;
 };
 
@@ -79,6 +79,12 @@ enum {
 	((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)
 static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
 
+/*
+ * Controls how many lookaheads for matching numa node after the lock
+ * is freed.
+ */
+static int matching_node_lookahead __ro_after_init;
+
 static inline bool intra_node_threshold_reached(struct cna_node *cn)
 {
 	s32 current_time = (s32)jiffies;
@@ -105,6 +111,11 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 		 */
 		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
 	}
+	/*
+	 * With 4x numa-node lookahead and assuming random node distribution,
+	 * the probability of not finding a matching node is less than 2%.
+	 */
+	matching_node_lookahead = 4*nr_node_ids;
 }
 
 static int __init cna_init_nodes(void)
@@ -262,33 +273,67 @@ static void cna_splice_tail(struct mcs_spinlock *node,
  * of current waiters. However, the amortized complexity is close to O(1),
  * as the immediate successor is likely to be running on the same node once
  * threads from other nodes are moved to the secondary queue.
+ *
+ * Given the fact that the lock state is being monitored at the same time
+ * and the number of additional scans after the lock is freed is limited.
+ * This will limit the additional lock acquisition latency.
  */
-static u32 cna_order_queue(struct mcs_spinlock *node,
-			   struct mcs_spinlock *iter)
+static u32 cna_order_queue(struct mcs_spinlock *node, struct qspinlock *lock)
 {
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node;
-	struct cna_node *last;
+	struct cna_node *cni = cn;
+	struct cna_node *found = NULL;
+	int lookahead = matching_node_lookahead;
+	bool locked = true;
 
-	/* find any next waiter on 'our' NUMA node */
-	for (last = cn;
-		 /*
-		  * iterate as long as the current node is not priorizied and
-		  * does not run on 'our' NUMA node
-		  */
-	     cni && cni->numa_node != CNA_PRIORITY_NODE && cni->numa_node != nid;
-	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
-		;
+	/*
+	 * Find any next waiter on 'our' NUMA node while spinning on the lock
+	 * simultaneously. Iterate as long as the current node is not
+	 * priorizied and does not run on 'our' NUMA node, or up to
+	 * "lockahead" nodes have been checked after the lock was freed.
+	 */
+	while (!found && lookahead) {
+		if (locked) {
+			u32 val = atomic_read(&lock->val);
+			if (!(val & _Q_LOCKED_PENDING_MASK)) {
+				locked = false;
+				continue;
+			}
+			/*
+			 * If we are hitting the end of the queue, use
+			 * atomic_cond_read_relaxed() to wait until the
+			 * lock value changes which means either the lock
+			 * is freed or a new waiter comes.
+			 */
+			if (cni->encoded_tail == (val & _Q_TAIL_MASK))
+				atomic_cond_read_relaxed(&lock->val, VAL != val);
+		}
+		if (!found) {
+			found = (struct cna_node *)READ_ONCE(cni->mcs.next);
+			if (!found && !locked)
+				break; /* Lock is free with no more nodes to parse */
+
+			if (found && found->numa_node != CNA_PRIORITY_NODE
+				  && found->numa_node != cn->numa_node) {
+				cni = found;
+				found = NULL;
+			}
+			if (!locked)
+				lookahead--;
+		}
+	}
+
+	if (!found)
+		return cni->encoded_tail; /* continue from here */
 
-	if (!cni)
-		return last->encoded_tail; /* continue from here */
+	if (found->numa_node == CNA_PRIORITY_NODE) {
+		int nid = cn->numa_node;
 
-	if (cni->numa_node == CNA_PRIORITY_NODE)
-		cni->numa_node = nid;	/* Inherit node id of primary queue */
+		found->numa_node = nid;	/* Inherit node id of primary queue */
+	}
 
-	if (last != cn)	/* did we skip any waiters? */
-		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+	if (cni != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)cni);
 
 	/*
 	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
@@ -312,23 +357,23 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 	if (!cn->start_time)
 		cn->start_time = (s32)jiffies;
 
-	if (!intra_node_threshold_reached(cn)) {
-		/*
-		 * We are at the head of the wait queue, no need to use
-		 * the fake NUMA node ID.
-		 */
-		if (cn->numa_node == CNA_PRIORITY_NODE)
-			cn->numa_node = cn->real_numa_node;
-
-		/*
-		 * Try and put the time otherwise spent spin waiting on
-		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-		 */
-		cn->partial_order = cna_order_queue(node, node);
-	} else {
-		cn->partial_order = FLUSH_SECONDARY_QUEUE;
+	if (intra_node_threshold_reached(cn)) {
+		cn->order = FLUSH_SECONDARY_QUEUE;
+		goto out;
 	}
+	/*
+	 * We are at the head of the wait queue, no need to use
+	 * the fake NUMA node ID.
+	 */
+	if (cn->numa_node == CNA_PRIORITY_NODE)
+		cn->numa_node = cn->real_numa_node;
 
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->order = cna_order_queue(node, lock);
+out:
 	return 0; /* we lied; we didn't wait, go do so now */
 }
 
@@ -337,18 +382,9 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 	u32 val = _Q_LOCKED_VAL;
-	u32 partial_order = cn->partial_order;
-
-	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
-	 */
-	if (partial_order >= MIN_ENCODED_TAIL)
-		partial_order =
-			cna_order_queue(node, decode_tail(partial_order));
+	u32 order = cn->order;
 
-	if (partial_order == LOCAL_WAITER_FOUND) {
+	if (order == LOCAL_WAITER_FOUND) {
 		/*
 		 * We found a local waiter; reload @next in case we called
 		 * cna_order_queue() above.
-- 
2.18.1


--------------370526D467942428E1F63AAA--


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=GkUz=BH=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-12.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,
	URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 23B05C433E3
	for <linux-arm-kernel@archiver.kernel.org>; Tue, 28 Jul 2020 20:01:45 +0000 (UTC)
Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id D7DE42065C
	for <linux-arm-kernel@archiver.kernel.org>; Tue, 28 Jul 2020 20:01:44 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="exNXkFyu";
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WEyiHSQW"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D7DE42065C
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=merlin.20170209; h=Sender:Cc:List-Subscribe:
	List-Help:List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:
	In-Reply-To:MIME-Version:Date:Message-ID:From:References:To:Subject:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	 bh=HeqvhBiNx3Ql9gc/f32Lc6oHTPNHy5zNeUtOrgR1b4U=; b=exNXkFyuL8RPVRVR0IiFl69Ge
	yu/80w2wA+89eL3/OPS2iqDfzgtND9WN2JWIOzr8v0v14T9eSE2Q78DWAu1ZKlkfkQlDJ0w/Yaa4U
	ftUuJQlLB8Dsojphki7oIQO0+kSOM79cMQ0wHjH827rCBBXt/ppKParpoiIaIqX2Rz8dnJJw8N082
	OtZhYMZZxnqW8OxLZOAPrM7kwLmd0sv3VgtX1cuQq4gEOacF4bBHXLYVxEdDI2eXdpxLJRQgznfxi
	edfmKaEBFZ71bBa2PAnxhE6eDPLTt3SMWtJ8PCsGFHocq/ahNXfFxuCmyc9Lih8M1wt9e5Mx94t93
	vSMDp6Iaw==;
Received: from localhost ([::1] helo=merlin.infradead.org)
	by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux))
	id 1k0Vly-0003ap-Mg; Tue, 28 Jul 2020 20:00:26 +0000
Received: from us-smtp-1.mimecast.com ([205.139.110.61]
 helo=us-smtp-delivery-1.mimecast.com)
 by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux))
 id 1k0Vlu-0003ZK-2i
 for linux-arm-kernel@lists.infradead.org; Tue, 28 Jul 2020 20:00:24 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
 s=mimecast20190719; t=1595966420;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=NbuVlb20X7aLNVDgvjCFcT8rKevY9uha8UkibvC8p04=;
 b=WEyiHSQW+IMpzQoRzBJlj1QdM+Twm4icEPXjqEp1fWI6YeQhd9bm3O4/QyiprxX/uFcxix
 V2iJi4Ye9A2pl00wHd0jm6icT8nAkMa9lqfOlNzQnrPOHqvAWVZd6EhwO0ynkDyEQoFYIY
 FgWVi7C0XSTefZR4c2S68IQMQtu51rw=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-146-BiJ90p7YN-madmATHVf6Tw-1; Tue, 28 Jul 2020 16:00:16 -0400
X-MC-Unique: BiJ90p7YN-madmATHVf6Tw-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com
 [10.5.11.22])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 536558014D7;
 Tue, 28 Jul 2020 20:00:14 +0000 (UTC)
Received: from llong.remote.csb (ovpn-119-38.rdu2.redhat.com [10.10.119.38])
 by smtp.corp.redhat.com (Postfix) with ESMTP id 7C4381002388;
 Tue, 28 Jul 2020 20:00:12 +0000 (UTC)
Subject: Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow
 path of qspinlock
To: Alex Kogan <alex.kogan@oracle.com>, linux@armlinux.org.uk,
 peterz@infradead.org, mingo@redhat.com, will.deacon@arm.com, arnd@arndb.de,
 linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
 linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de,
 hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com, jglauber@marvell.com
References: <20200403205930.1707-1-alex.kogan@oracle.com>
 <20200403205930.1707-4-alex.kogan@oracle.com>
From: Waiman Long <longman@redhat.com>
Organization: Red Hat
Message-ID: <a4bf9541-1996-3ba2-dfe5-e734c652ac86@redhat.com>
Date: Tue, 28 Jul 2020 16:00:12 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.4.1
MIME-Version: 1.0
In-Reply-To: <20200403205930.1707-4-alex.kogan@oracle.com>
Content-Type: multipart/mixed; boundary="------------370526D467942428E1F63AAA"
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20200728_160023_002431_100007FE 
X-CRM114-Status: GOOD (  44.03  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Cc: dave.dice@oracle.com, steven.sistare@oracle.com, daniel.m.jordan@oracle.com
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

This is a multi-part message in MIME format.
--------------370526D467942428E1F63AAA
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

On 4/3/20 4:59 PM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a primary queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. After acquiring the
> MCS lock and before acquiring the spinlock, the lock holder scans the
> primary queue looking for a thread running on the same node (pre-scan). If
> found (call it thread T), all threads in the primary queue between the
> current lock holder and T are moved to the end of the secondary queue.
> If such T is not found, we make another scan of the primary queue when
> unlocking the MCS lock (post-scan), starting at the position where
> pre-scan stopped. If both scans fail to find such T, the MCS lock is
> passed to the first thread in the secondary queue. If the secondary queue
> is empty, the lock is passed to the next thread in the primary queue.
> For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
> boot time only if we run on a multi-node machine in native environment and
> the new config is enabled. (For the time being, the patching requires
> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
> resolved once static_call() is available.) This default behavior can be
> overridden with the new kernel boot command-line option
> "numa_spinlock=on/off" (default is "auto").
>
> Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
> Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Waiman Long <longman@redhat.com>
> ---

There is also a concern that the worst case latency for a lock transfer 
can be close to O(n) which can be quite large for large SMP systems. I 
have a patch on top that modifies the current behavior to limit the 
number of node scans after the lock is freed.

Cheers,
Longman


--------------370526D467942428E1F63AAA
Content-Type: text/x-patch; charset=UTF-8;
 name="0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename*0="0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.pa";
 filename*1="tch"

>From 1e0d1a7c1d04383367b3d7a086ef3becd2fd81d8 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 28 Jul 2020 15:13:47 -0400
Subject: [PATCH 8/9] locking/qspinlock: Limit CNA node scan after the lock is
 freed

When there is a long list of lock waiters, the worst case scenario
for CNA is that none of them are from the same NUMA node. All the lock
waiters will have to be scanned to figure that out. If the lock becomes
free early in the scanning process, it means that there will be a long
delay between lock being freed and re-acquired after scanning the full
list. The worst case delay is O(n) where n is the number of lock waiters
which is limited by the number of cpus in the system.

One way to limit the delay is to limit the number of nodes to be scanned
after the lock is freed. Assuming random distribution of NUMA nodes in
the lock waiters, the chance of not finding a lock waiter in the same
NUMA node is ((n-1)/n)^m where n is the number of NUMA nodes and m is
the number of lock waiters scanned. If we limit the number of scans to
4n, for example, the chance of not finding same NUMA node lock waiter
will be 1.4% and 1.6% for 8 and 16-socket servers respectively.
Note that the limit applies only after the lock is freed, so the chance
of not finding the right lock waiter is even lower.

To make this possible, the lock waiter scanning and lock spinning have
to be done at the same time. Since the lock waiter scanning won't stop
until the lock is freed with additional look ahead, if necessary. The
extra scanning done in cna_lock_handoff() isn't needed anymore.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_cna.h | 130 +++++++++++++++++++++------------
 1 file changed, 83 insertions(+), 47 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 03e8ba71a537..89a3905870ea 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -59,7 +59,7 @@ struct cna_node {
 	u16			numa_node;
 	u16			real_numa_node;
 	u32			encoded_tail;	/* self */
-	u32			partial_order;	/* encoded tail or enum val */
+	u32			order;		/* encoded tail or enum val */
 	s32			start_time;
 };
 
@@ -79,6 +79,12 @@ enum {
 	((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)
 static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
 
+/*
+ * Controls how many lookaheads for matching numa node after the lock
+ * is freed.
+ */
+static int matching_node_lookahead __ro_after_init;
+
 static inline bool intra_node_threshold_reached(struct cna_node *cn)
 {
 	s32 current_time = (s32)jiffies;
@@ -105,6 +111,11 @@ static void __init cna_init_nodes_per_cpu(unsigned int cpu)
 		 */
 		WARN_ON(cn->encoded_tail < MIN_ENCODED_TAIL);
 	}
+	/*
+	 * With 4x numa-node lookahead and assuming random node distribution,
+	 * the probability of not finding a matching node is less than 2%.
+	 */
+	matching_node_lookahead = 4*nr_node_ids;
 }
 
 static int __init cna_init_nodes(void)
@@ -262,33 +273,67 @@ static void cna_splice_tail(struct mcs_spinlock *node,
  * of current waiters. However, the amortized complexity is close to O(1),
  * as the immediate successor is likely to be running on the same node once
  * threads from other nodes are moved to the secondary queue.
+ *
+ * Given the fact that the lock state is being monitored at the same time
+ * and the number of additional scans after the lock is freed is limited.
+ * This will limit the additional lock acquisition latency.
  */
-static u32 cna_order_queue(struct mcs_spinlock *node,
-			   struct mcs_spinlock *iter)
+static u32 cna_order_queue(struct mcs_spinlock *node, struct qspinlock *lock)
 {
-	struct cna_node *cni = (struct cna_node *)READ_ONCE(iter->next);
 	struct cna_node *cn = (struct cna_node *)node;
-	int nid = cn->numa_node;
-	struct cna_node *last;
+	struct cna_node *cni = cn;
+	struct cna_node *found = NULL;
+	int lookahead = matching_node_lookahead;
+	bool locked = true;
 
-	/* find any next waiter on 'our' NUMA node */
-	for (last = cn;
-		 /*
-		  * iterate as long as the current node is not priorizied and
-		  * does not run on 'our' NUMA node
-		  */
-	     cni && cni->numa_node != CNA_PRIORITY_NODE && cni->numa_node != nid;
-	     last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
-		;
+	/*
+	 * Find any next waiter on 'our' NUMA node while spinning on the lock
+	 * simultaneously. Iterate as long as the current node is not
+	 * priorizied and does not run on 'our' NUMA node, or up to
+	 * "lockahead" nodes have been checked after the lock was freed.
+	 */
+	while (!found && lookahead) {
+		if (locked) {
+			u32 val = atomic_read(&lock->val);
+			if (!(val & _Q_LOCKED_PENDING_MASK)) {
+				locked = false;
+				continue;
+			}
+			/*
+			 * If we are hitting the end of the queue, use
+			 * atomic_cond_read_relaxed() to wait until the
+			 * lock value changes which means either the lock
+			 * is freed or a new waiter comes.
+			 */
+			if (cni->encoded_tail == (val & _Q_TAIL_MASK))
+				atomic_cond_read_relaxed(&lock->val, VAL != val);
+		}
+		if (!found) {
+			found = (struct cna_node *)READ_ONCE(cni->mcs.next);
+			if (!found && !locked)
+				break; /* Lock is free with no more nodes to parse */
+
+			if (found && found->numa_node != CNA_PRIORITY_NODE
+				  && found->numa_node != cn->numa_node) {
+				cni = found;
+				found = NULL;
+			}
+			if (!locked)
+				lookahead--;
+		}
+	}
+
+	if (!found)
+		return cni->encoded_tail; /* continue from here */
 
-	if (!cni)
-		return last->encoded_tail; /* continue from here */
+	if (found->numa_node == CNA_PRIORITY_NODE) {
+		int nid = cn->numa_node;
 
-	if (cni->numa_node == CNA_PRIORITY_NODE)
-		cni->numa_node = nid;	/* Inherit node id of primary queue */
+		found->numa_node = nid;	/* Inherit node id of primary queue */
+	}
 
-	if (last != cn)	/* did we skip any waiters? */
-		cna_splice_tail(node, node->next, (struct mcs_spinlock *)last);
+	if (cni != cn)	/* did we skip any waiters? */
+		cna_splice_tail(node, node->next, (struct mcs_spinlock *)cni);
 
 	/*
 	 * We return LOCAL_WAITER_FOUND here even if we stopped the scan because
@@ -312,23 +357,23 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
 	if (!cn->start_time)
 		cn->start_time = (s32)jiffies;
 
-	if (!intra_node_threshold_reached(cn)) {
-		/*
-		 * We are at the head of the wait queue, no need to use
-		 * the fake NUMA node ID.
-		 */
-		if (cn->numa_node == CNA_PRIORITY_NODE)
-			cn->numa_node = cn->real_numa_node;
-
-		/*
-		 * Try and put the time otherwise spent spin waiting on
-		 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-		 */
-		cn->partial_order = cna_order_queue(node, node);
-	} else {
-		cn->partial_order = FLUSH_SECONDARY_QUEUE;
+	if (intra_node_threshold_reached(cn)) {
+		cn->order = FLUSH_SECONDARY_QUEUE;
+		goto out;
 	}
+	/*
+	 * We are at the head of the wait queue, no need to use
+	 * the fake NUMA node ID.
+	 */
+	if (cn->numa_node == CNA_PRIORITY_NODE)
+		cn->numa_node = cn->real_numa_node;
 
+	/*
+	 * Try and put the time otherwise spent spin waiting on
+	 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+	 */
+	cn->order = cna_order_queue(node, lock);
+out:
 	return 0; /* we lied; we didn't wait, go do so now */
 }
 
@@ -337,18 +382,9 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
 {
 	struct cna_node *cn = (struct cna_node *)node;
 	u32 val = _Q_LOCKED_VAL;
-	u32 partial_order = cn->partial_order;
-
-	/*
-	 * check if a successor from the same numa node has not been found in
-	 * pre-scan, and if so, try to find it in post-scan starting from the
-	 * node where pre-scan stopped (stored in @pre_scan_result)
-	 */
-	if (partial_order >= MIN_ENCODED_TAIL)
-		partial_order =
-			cna_order_queue(node, decode_tail(partial_order));
+	u32 order = cn->order;
 
-	if (partial_order == LOCAL_WAITER_FOUND) {
+	if (order == LOCAL_WAITER_FOUND) {
 		/*
 		 * We found a local waiter; reload @next in case we called
 		 * cna_order_queue() above.
-- 
2.18.1


--------------370526D467942428E1F63AAA
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

--------------370526D467942428E1F63AAA--