From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B67EBC5DF60 for ; Fri, 8 Nov 2019 10:31:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7F33B215EA for ; Fri, 8 Nov 2019 10:31:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1573209070; bh=kKLG2JES5cLuDs5ImLZ5AHAKe9atzMaG4BvyezuK1aw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:List-ID:From; b=Bww7oet2TArrchSGTR+yUlyS6FV/GcBeaI8Bh8A5JAJx5jnVSX6IgQvN3dRYrRS8G Tw8o9o5fiyPtTy9TES/1XmdPPPIk5ftMT3jj4Yn1JICTs3HmPw2BR8PopJA6QQ0opi tcbAM5Y+ccvJzglUkiQcE3B1pD2ye1LydmQUNEkg= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731310AbfKHKbJ (ORCPT ); Fri, 8 Nov 2019 05:31:09 -0500 Received: from mx2.suse.de ([195.135.220.15]:49712 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726149AbfKHKbJ (ORCPT ); Fri, 8 Nov 2019 05:31:09 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id D6F3BB178; Fri, 8 Nov 2019 10:31:03 +0000 (UTC) Date: Fri, 8 Nov 2019 11:31:02 +0100 From: Michal Hocko To: Shaokun Zhang Cc: linux-kernel@vger.kernel.org, yuqi jin , Andrew Morton , Mike Rapoport , Paul Burton , Michael Ellerman , Anshuman Khandual , netdev@vger.kernel.org Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread() Message-ID: <20191108103102.GF15658@dhcp22.suse.cz> References: <1573091048-10595-1-git-send-email-zhangshaokun@hisilicon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1573091048-10595-1-git-send-email-zhangshaokun@hisilicon.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This changelog looks better, thanks! I still have some questions though. Btw. cpumask_local_spread is used by the networking code but I do not see net guys involved (Cc netdev) On Thu 07-11-19 09:44:08, Shaokun Zhang wrote: > From: yuqi jin > > In the multi-processors and NUMA system, I/O driver will find cpu cores > that which shall be bound IRQ. When cpu cores in the local numa have > been used, it is better to find the node closest to the local numa node, > instead of choosing any online cpu immediately. > > On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu > system(0 - 1). Please send a topology of this server (numactl -H). > We perform PS (parameter server) business test, the > behavior of the service is that the client initiates a request through > the network card, the server responds to the request after calculation. Is the benchmark any ublicly available? > When two PS processes run on node2 and node3 separately and the > network card is located on 'node2' which is in cpu1, the performance > of node2 (26W QPS) and node3 (22W QPS) was different. > It is better that the NIC queues are bound to the cpu1 cores in turn, > then XPS will also be properly initialized, while cpumask_local_spread > only considers the local node. When the number of NIC queues exceeds > the number of cores in the local node, it returns to the online core > directly. So when PS runs on node3 sending a calculated request, > the performance is not as good as the node2. It is considered that > the NIC and other I/O devices shall initialize the interrupt binding, > if the cores of the local node are used up, it is reasonable to return > the node closest to it. Can you post cpu affinities before and after this patch? > Let's optimize it and find the nearest node through NUMA distance for the > non-local NUMA nodes. The performance will be better if it return the > nearest node than the random node. > > After this patch, the performance of the node3 is the same as node2 > that is 26W QPS when the network card is still in 'node2'. Since it will > return the closest non-local NUMA code rather than random node, it is no > harm to others at least. It would be also nice to explain why the current implementation hasn't taken the path your have chosen. Was it a simplicity or is there a more fundamental reason? Is there any risk that existing users would regress? Preferring cpus from the local socket which is what you aim for sounds like a logical thing to do so I am wondering why this hasn't been considered. > Cc: Andrew Morton > Cc: Mike Rapoport > Cc: Paul Burton > Cc: Michal Hocko > Cc: Michael Ellerman > Cc: Anshuman Khandual > Signed-off-by: yuqi jin > Signed-off-by: Shaokun Zhang > --- > ChangeLog from v2: > 1. Change the variables as static and use spinlock to protect; > 2. Give more explantation on test and performance; > lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 90 insertions(+), 12 deletions(-) > > diff --git a/lib/cpumask.c b/lib/cpumask.c > index 0cb672eb107c..b98a2256bc5a 100644 > --- a/lib/cpumask.c > +++ b/lib/cpumask.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > > /** > * cpumask_next - get the next cpu in a cpumask > @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask) > } > #endif > > -/** > - * cpumask_local_spread - select the i'th cpu with local numa cpu's first > - * @i: index number > - * @node: local numa_node > - * > - * This function selects an online CPU according to a numa aware policy; > - * local cpus are returned first, followed by non-local ones, then it > - * wraps around. > - * > - * It's not very efficient, but useful for setup. > - */ > -unsigned int cpumask_local_spread(unsigned int i, int node) > +static void calc_node_distance(int *node_dist, int node) > +{ > + int i; > + > + for (i = 0; i < nr_node_ids; i++) > + node_dist[i] = node_distance(node, i); > +} > + > +static int find_nearest_node(int *node_dist, bool *used) > +{ > + int i, min_dist = node_dist[0], node_id = -1; > + > + /* Choose the first unused node to compare */ > + for (i = 0; i < nr_node_ids; i++) { > + if (used[i] == 0) { > + min_dist = node_dist[i]; > + node_id = i; > + break; > + } > + } > + > + /* Compare and return the nearest node */ > + for (i = 0; i < nr_node_ids; i++) { > + if (node_dist[i] < min_dist && used[i] == 0) { > + min_dist = node_dist[i]; > + node_id = i; > + } > + } > + > + return node_id; > +} > + > +static unsigned int __cpumask_local_spread(unsigned int i, int node) > { > int cpu; > > @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node) > } > BUG(); > } > + > +static DEFINE_SPINLOCK(spread_lock); > +/** > + * cpumask_local_spread - select the i'th cpu with local numa cpu's first > + * @i: index number > + * @node: local numa_node > + * > + * This function selects an online CPU according to a numa aware policy; > + * local cpus are returned first, followed by the nearest non-local ones, > + * then it wraps around. > + * > + * It's not very efficient, but useful for setup. > + */ > +unsigned int cpumask_local_spread(unsigned int i, int node) > +{ > + static int node_dist[MAX_NUMNODES]; > + static bool used[MAX_NUMNODES]; > + unsigned long flags; > + int cpu, j, id; > + > + /* Wrap: we always want a cpu. */ > + i %= num_online_cpus(); > + > + if (node == NUMA_NO_NODE) { > + for_each_cpu(cpu, cpu_online_mask) > + if (i-- == 0) > + return cpu; > + } else { > + if (nr_node_ids > MAX_NUMNODES) > + return __cpumask_local_spread(i, node); > + > + spin_lock_irqsave(&spread_lock, flags); > + memset(used, 0, nr_node_ids * sizeof(bool)); > + calc_node_distance(node_dist, node); > + for (j = 0; j < nr_node_ids; j++) { > + id = find_nearest_node(node_dist, used); > + if (id < 0) > + break; > + > + for_each_cpu_and(cpu, cpumask_of_node(id), > + cpu_online_mask) > + if (i-- == 0) { > + spin_unlock_irqrestore(&spread_lock, > + flags); > + return cpu; > + } > + used[id] = 1; > + } > + spin_unlock_irqrestore(&spread_lock, flags); > + > + for_each_cpu(cpu, cpu_online_mask) > + if (i-- == 0) > + return cpu; > + } > + BUG(); > +} > EXPORT_SYMBOL(cpumask_local_spread); > -- > 2.7.4 -- Michal Hocko SUSE Labs