All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] lib: optimize cpumask_local_spread()
@ 2019-11-04 10:27 Shaokun Zhang
  2019-11-05  7:01 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Shaokun Zhang @ 2019-11-04 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton,
	Michal Hocko, Michael Ellerman, Anshuman Khandual, Shaokun Zhang

From: yuqi jin <jinyuqi@huawei.com>

In the multi-processor and NUMA system, I/O device may have many numa
nodes belonging to multiple cpus. When we get a local numa, it is
better to find the node closest to the local numa node, instead
of choosing any online cpu immediately.

For the current code, it only considers the local NUMA node and it
doesn't compute the distances between different NUMA nodes for the
non-local NUMA nodes. Let's optimize it and find the nearest node
through NUMA distance. The performance will be better if it return
the nearest node than the random node.

When Parameter Server workload is tested using NIC device on Huawei
Kunpeng 920 SoC:
Without the patch, the performance is 22W QPS;
Added this patch, the performance become better and it is 26W QPS.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: yuqi jin <jinyuqi@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
 lib/cpumask.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 81 insertions(+), 12 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672eb107c..15d8940f32a8 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -192,18 +192,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 }
 #endif
 
-/**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
- * @i: index number
- * @node: local numa_node
- *
- * This function selects an online CPU according to a numa aware policy;
- * local cpus are returned first, followed by non-local ones, then it
- * wraps around.
- *
- * It's not very efficient, but useful for setup.
- */
-unsigned int cpumask_local_spread(unsigned int i, int node)
+static void calc_node_distance(int *node_dist, int node)
+{
+	int i;
+
+	for (i = 0; i < nr_node_ids; i++)
+		node_dist[i] = node_distance(node, i);
+}
+
+static int find_nearest_node(int *node_dist, bool *used)
+{
+	int i, min_dist = node_dist[0], node_id = -1;
+
+	/* Choose the first unused node to compare */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (used[i] == 0) {
+			min_dist = node_dist[i];
+			node_id = i;
+			break;
+		}
+	}
+
+	/* Compare and return the nearest node */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (node_dist[i] < min_dist && used[i] == 0) {
+			min_dist = node_dist[i];
+			node_id = i;
+		}
+	}
+
+	return node_id;
+}
+
+static unsigned int __cpumask_local_spread(unsigned int i, int node)
 {
 	int cpu;
 
@@ -231,4 +252,52 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
 	}
 	BUG();
 }
+
+/**
+ * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * @i: index number
+ * @node: local numa_node
+ *
+ * This function selects an online CPU according to a numa aware policy;
+ * local cpus are returned first, followed by the nearest non-local ones,
+ * then it wraps around.
+ *
+ * It's not very efficient, but useful for setup.
+ */
+unsigned int cpumask_local_spread(unsigned int i, int node)
+{
+	int node_dist[MAX_NUMNODES] = {0};
+	bool used[MAX_NUMNODES] = {0};
+	int cpu, j, id;
+
+	/* Wrap: we always want a cpu. */
+	i %= num_online_cpus();
+
+	if (node == NUMA_NO_NODE) {
+		for_each_cpu(cpu, cpu_online_mask)
+			if (i-- == 0)
+				return cpu;
+	} else {
+		if (nr_node_ids > MAX_NUMNODES)
+			return __cpumask_local_spread(i, node);
+
+		calc_node_distance(node_dist, node);
+		for (j = 0; j < nr_node_ids; j++) {
+			id = find_nearest_node(node_dist, used);
+			if (id < 0)
+				break;
+
+			for_each_cpu_and(cpu, cpumask_of_node(id),
+					 cpu_online_mask)
+				if (i-- == 0)
+					return cpu;
+			used[id] = 1;
+		}
+
+		for_each_cpu(cpu, cpu_online_mask)
+			if (i-- == 0)
+				return cpu;
+	}
+	BUG();
+}
 EXPORT_SYMBOL(cpumask_local_spread);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] lib: optimize cpumask_local_spread()
  2019-11-04 10:27 [PATCH v2] lib: optimize cpumask_local_spread() Shaokun Zhang
@ 2019-11-05  7:01 ` Michal Hocko
  2019-11-06  1:33   ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2019-11-05  7:01 UTC (permalink / raw)
  To: Shaokun Zhang
  Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport,
	Paul Burton, Michael Ellerman, Anshuman Khandual

On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
> From: yuqi jin <jinyuqi@huawei.com>
> 
> In the multi-processor and NUMA system, I/O device may have many numa
> nodes belonging to multiple cpus. When we get a local numa, it is
> better to find the node closest to the local numa node, instead
> of choosing any online cpu immediately.
> 
> For the current code, it only considers the local NUMA node and it
> doesn't compute the distances between different NUMA nodes for the
> non-local NUMA nodes. Let's optimize it and find the nearest node
> through NUMA distance. The performance will be better if it return
> the nearest node than the random node.

Numbers please

[...]
> +/**
> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> + * @i: index number
> + * @node: local numa_node
> + *
> + * This function selects an online CPU according to a numa aware policy;
> + * local cpus are returned first, followed by the nearest non-local ones,
> + * then it wraps around.
> + *
> + * It's not very efficient, but useful for setup.
> + */
> +unsigned int cpumask_local_spread(unsigned int i, int node)
> +{
> +	int node_dist[MAX_NUMNODES] = {0};
> +	bool used[MAX_NUMNODES] = {0};

Ugh. This might be a lot of stack space. Some distro kernels use large
NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
node_dist).

> +	int cpu, j, id;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] lib: optimize cpumask_local_spread()
  2019-11-05  7:01 ` Michal Hocko
@ 2019-11-06  1:33   ` Andrew Morton
  2019-11-06  2:49     ` Shaokun Zhang
  2019-11-06  7:17     ` Michal Hocko
  0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2019-11-06  1:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shaokun Zhang, linux-kernel, yuqi jin, Mike Rapoport,
	Paul Burton, Michael Ellerman, Anshuman Khandual

On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
> > From: yuqi jin <jinyuqi@huawei.com>
> > 
> > In the multi-processor and NUMA system, I/O device may have many numa
> > nodes belonging to multiple cpus. When we get a local numa, it is
> > better to find the node closest to the local numa node, instead
> > of choosing any online cpu immediately.
> > 
> > For the current code, it only considers the local NUMA node and it
> > doesn't compute the distances between different NUMA nodes for the
> > non-local NUMA nodes. Let's optimize it and find the nearest node
> > through NUMA distance. The performance will be better if it return
> > the nearest node than the random node.
> 
> Numbers please

The changelog had

: When Parameter Server workload is tested using NIC device on Huawei
: Kunpeng 920 SoC:
: Without the patch, the performance is 22W QPS;
: Added this patch, the performance become better and it is 26W QPS.

> [...]
> > +/**
> > + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> > + * @i: index number
> > + * @node: local numa_node
> > + *
> > + * This function selects an online CPU according to a numa aware policy;
> > + * local cpus are returned first, followed by the nearest non-local ones,
> > + * then it wraps around.
> > + *
> > + * It's not very efficient, but useful for setup.
> > + */
> > +unsigned int cpumask_local_spread(unsigned int i, int node)
> > +{
> > +	int node_dist[MAX_NUMNODES] = {0};
> > +	bool used[MAX_NUMNODES] = {0};
> 
> Ugh. This might be a lot of stack space. Some distro kernels use large
> NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
> node_dist).

Yes, that's big.  From a quick peek I suspect we could get by using an
array of unsigned shorts here but that might be fragile over time even
if it works now?

Perhaps we could make it a statically allocated array and protect the
entire thing with a spin_lock_irqsave()?  It's not a frequently called
function.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] lib: optimize cpumask_local_spread()
  2019-11-06  1:33   ` Andrew Morton
@ 2019-11-06  2:49     ` Shaokun Zhang
  2019-11-06  7:17     ` Michal Hocko
  1 sibling, 0 replies; 8+ messages in thread
From: Shaokun Zhang @ 2019-11-06  2:49 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: linux-kernel, yuqi jin, Mike Rapoport, Paul Burton,
	Michael Ellerman, Anshuman Khandual

Hi Andrew,

On 2019/11/6 9:33, Andrew Morton wrote:
> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
>> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
>>> From: yuqi jin <jinyuqi@huawei.com>
>>>
>>> In the multi-processor and NUMA system, I/O device may have many numa
>>> nodes belonging to multiple cpus. When we get a local numa, it is
>>> better to find the node closest to the local numa node, instead
>>> of choosing any online cpu immediately.
>>>
>>> For the current code, it only considers the local NUMA node and it
>>> doesn't compute the distances between different NUMA nodes for the
>>> non-local NUMA nodes. Let's optimize it and find the nearest node
>>> through NUMA distance. The performance will be better if it return
>>> the nearest node than the random node.
>>
>> Numbers please
> 
> The changelog had
> 
> : When Parameter Server workload is tested using NIC device on Huawei
> : Kunpeng 920 SoC:
> : Without the patch, the performance is 22W QPS;
> : Added this patch, the performance become better and it is 26W QPS.
> 
>> [...]
>>> +/**
>>> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>>> + * @i: index number
>>> + * @node: local numa_node
>>> + *
>>> + * This function selects an online CPU according to a numa aware policy;
>>> + * local cpus are returned first, followed by the nearest non-local ones,
>>> + * then it wraps around.
>>> + *
>>> + * It's not very efficient, but useful for setup.
>>> + */
>>> +unsigned int cpumask_local_spread(unsigned int i, int node)
>>> +{
>>> +	int node_dist[MAX_NUMNODES] = {0};
>>> +	bool used[MAX_NUMNODES] = {0};
>>
>> Ugh. This might be a lot of stack space. Some distro kernels use large
>> NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
>> node_dist).
> 
> Yes, that's big.  From a quick peek I suspect we could get by using an
> array of unsigned shorts here but that might be fragile over time even
> if it works now?
> 

Yes, how about we define another macro and its value is 128(not sure it
is big enough for the actual need)?

--->8
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-       int node_dist[MAX_NUMNODES] = {0};
-       bool used[MAX_NUMNODES] = {0};
+       #define NUMA_NODE_NR     128
+       int node_dist[NUMA_NODE_NR] = {0};
+       bool used[NUMA_NODE_NR] = {0};
        int cpu, j, id;

        /* Wrap: we always want a cpu. */
@@ -278,7 +279,7 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
                        if (i-- == 0)
                                return cpu;
        } else {
-               if (nr_node_ids > MAX_NUMNODES)
+               if (nr_node_ids > NUMA_NODE_NR)
                        return __cpumask_local_spread(i, node);

                calc_node_distance(node_dist, node);

> Perhaps we could make it a statically allocated array and protect the
> entire thing with a spin_lock_irqsave()?  It's not a frequently called

It's another way to solve this issue. I'm not sure you and Michal like which one. ;-)

Thanks,
Shaokun

> function.
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] lib: optimize cpumask_local_spread()
  2019-11-06  1:33   ` Andrew Morton
  2019-11-06  2:49     ` Shaokun Zhang
@ 2019-11-06  7:17     ` Michal Hocko
  2019-11-06  8:02       ` Shaokun Zhang
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2019-11-06  7:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shaokun Zhang, linux-kernel, yuqi jin, Mike Rapoport,
	Paul Burton, Michael Ellerman, Anshuman Khandual

On Tue 05-11-19 17:33:59, Andrew Morton wrote:
> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
> > > From: yuqi jin <jinyuqi@huawei.com>
> > > 
> > > In the multi-processor and NUMA system, I/O device may have many numa
> > > nodes belonging to multiple cpus. When we get a local numa, it is
> > > better to find the node closest to the local numa node, instead
> > > of choosing any online cpu immediately.
> > > 
> > > For the current code, it only considers the local NUMA node and it
> > > doesn't compute the distances between different NUMA nodes for the
> > > non-local NUMA nodes. Let's optimize it and find the nearest node
> > > through NUMA distance. The performance will be better if it return
> > > the nearest node than the random node.
> > 
> > Numbers please
> 
> The changelog had
> 
> : When Parameter Server workload is tested using NIC device on Huawei
> : Kunpeng 920 SoC:
> : Without the patch, the performance is 22W QPS;
> : Added this patch, the performance become better and it is 26W QPS.

Maybe it is just me but this doesn't really tell me a lot. What is
Parameter Server workload? What do I do to replicate those numbers? Is
this really specific to the Kunpeng 920 server? What is the usual
variance of the performance numbers?

> > [...]
> > > +/**
> > > + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> > > + * @i: index number
> > > + * @node: local numa_node
> > > + *
> > > + * This function selects an online CPU according to a numa aware policy;
> > > + * local cpus are returned first, followed by the nearest non-local ones,
> > > + * then it wraps around.
> > > + *
> > > + * It's not very efficient, but useful for setup.
> > > + */
> > > +unsigned int cpumask_local_spread(unsigned int i, int node)
> > > +{
> > > +	int node_dist[MAX_NUMNODES] = {0};
> > > +	bool used[MAX_NUMNODES] = {0};
> > 
> > Ugh. This might be a lot of stack space. Some distro kernels use large
> > NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
> > node_dist).
> 
> Yes, that's big.  From a quick peek I suspect we could get by using an
> array of unsigned shorts here but that might be fragile over time even
> if it works now?

Whatever data type we use it will be still quite large to be on the
stack.

> Perhaps we could make it a statically allocated array and protect the
> entire thing with a spin_lock_irqsave()?  It's not a frequently called
> function.

This is what I was suggesting in previous review feedback.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] lib: optimize cpumask_local_spread()
  2019-11-06  7:17     ` Michal Hocko
@ 2019-11-06  8:02       ` Shaokun Zhang
  2019-11-06  9:22         ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Shaokun Zhang @ 2019-11-06  8:02 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-kernel, yuqi jin, Mike Rapoport, Paul Burton,
	Michael Ellerman, Anshuman Khandual

Hi Michal,

On 2019/11/6 15:17, Michal Hocko wrote:
> On Tue 05-11-19 17:33:59, Andrew Morton wrote:
>> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <mhocko@kernel.org> wrote:
>>
>>> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
>>>> From: yuqi jin <jinyuqi@huawei.com>
>>>>
>>>> In the multi-processor and NUMA system, I/O device may have many numa
>>>> nodes belonging to multiple cpus. When we get a local numa, it is
>>>> better to find the node closest to the local numa node, instead
>>>> of choosing any online cpu immediately.
>>>>
>>>> For the current code, it only considers the local NUMA node and it
>>>> doesn't compute the distances between different NUMA nodes for the
>>>> non-local NUMA nodes. Let's optimize it and find the nearest node
>>>> through NUMA distance. The performance will be better if it return
>>>> the nearest node than the random node.
>>>
>>> Numbers please
>>
>> The changelog had
>>
>> : When Parameter Server workload is tested using NIC device on Huawei
>> : Kunpeng 920 SoC:
>> : Without the patch, the performance is 22W QPS;
>> : Added this patch, the performance become better and it is 26W QPS.
> 
> Maybe it is just me but this doesn't really tell me a lot. What is
> Parameter Server workload? What do I do to replicate those numbers? Is

I will give it better description on it in next version. Since it returns
the nearest node from the non-local node than the random one, no harmless
to others, Right?

> this really specific to the Kunpeng 920 server? What is the usual
> variance of the performance numbers?
> 
>>> [...]
>>>> +/**
>>>> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>>>> + * @i: index number
>>>> + * @node: local numa_node
>>>> + *
>>>> + * This function selects an online CPU according to a numa aware policy;
>>>> + * local cpus are returned first, followed by the nearest non-local ones,
>>>> + * then it wraps around.
>>>> + *
>>>> + * It's not very efficient, but useful for setup.
>>>> + */
>>>> +unsigned int cpumask_local_spread(unsigned int i, int node)
>>>> +{
>>>> +	int node_dist[MAX_NUMNODES] = {0};
>>>> +	bool used[MAX_NUMNODES] = {0};
>>>
>>> Ugh. This might be a lot of stack space. Some distro kernels use large
>>> NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
>>> node_dist).
>>
>> Yes, that's big.  From a quick peek I suspect we could get by using an
>> array of unsigned shorts here but that might be fragile over time even
>> if it works now?
> 
> Whatever data type we use it will be still quite large to be on the
> stack.
> 
>> Perhaps we could make it a statically allocated array and protect the
>> entire thing with a spin_lock_irqsave()?  It's not a frequently called
>> function.
> 
> This is what I was suggesting in previous review feedback.

Ok, will do it in next version.

Thanks,
Shaokun

> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] lib: optimize cpumask_local_spread()
  2019-11-06  8:02       ` Shaokun Zhang
@ 2019-11-06  9:22         ` Michal Hocko
  2019-11-07  1:48           ` Shaokun Zhang
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2019-11-06  9:22 UTC (permalink / raw)
  To: Shaokun Zhang
  Cc: Andrew Morton, linux-kernel, yuqi jin, Mike Rapoport,
	Paul Burton, Michael Ellerman, Anshuman Khandual

On Wed 06-11-19 16:02:29, Shaokun Zhang wrote:
> Hi Michal,
> 
> On 2019/11/6 15:17, Michal Hocko wrote:
> > On Tue 05-11-19 17:33:59, Andrew Morton wrote:
> >> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> >>
> >>> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
> >>>> From: yuqi jin <jinyuqi@huawei.com>
> >>>>
> >>>> In the multi-processor and NUMA system, I/O device may have many numa
> >>>> nodes belonging to multiple cpus. When we get a local numa, it is
> >>>> better to find the node closest to the local numa node, instead
> >>>> of choosing any online cpu immediately.
> >>>>
> >>>> For the current code, it only considers the local NUMA node and it
> >>>> doesn't compute the distances between different NUMA nodes for the
> >>>> non-local NUMA nodes. Let's optimize it and find the nearest node
> >>>> through NUMA distance. The performance will be better if it return
> >>>> the nearest node than the random node.
> >>>
> >>> Numbers please
> >>
> >> The changelog had
> >>
> >> : When Parameter Server workload is tested using NIC device on Huawei
> >> : Kunpeng 920 SoC:
> >> : Without the patch, the performance is 22W QPS;
> >> : Added this patch, the performance become better and it is 26W QPS.
> > 
> > Maybe it is just me but this doesn't really tell me a lot. What is
> > Parameter Server workload? What do I do to replicate those numbers? Is
> 
> I will give it better description on it in next version. Since it returns
> the nearest node from the non-local node than the random one, no harmless
> to others, Right?

Well, I am not really familiar with consumers of this API to understand
the full consequences and that is why I keep asking. From a very
highlevel POV prefering CPUs on the same NUMA domain sounds like a
reasonable thing to do.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] lib: optimize cpumask_local_spread()
  2019-11-06  9:22         ` Michal Hocko
@ 2019-11-07  1:48           ` Shaokun Zhang
  0 siblings, 0 replies; 8+ messages in thread
From: Shaokun Zhang @ 2019-11-07  1:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, yuqi jin, Mike Rapoport,
	Paul Burton, Michael Ellerman, Anshuman Khandual

Hi Michal,

On 2019/11/6 17:22, Michal Hocko wrote:
> On Wed 06-11-19 16:02:29, Shaokun Zhang wrote:
>> Hi Michal,
>>
>> On 2019/11/6 15:17, Michal Hocko wrote:
>>> On Tue 05-11-19 17:33:59, Andrew Morton wrote:
>>>> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <mhocko@kernel.org> wrote:
>>>>
>>>>> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
>>>>>> From: yuqi jin <jinyuqi@huawei.com>
>>>>>>
>>>>>> In the multi-processor and NUMA system, I/O device may have many numa
>>>>>> nodes belonging to multiple cpus. When we get a local numa, it is
>>>>>> better to find the node closest to the local numa node, instead
>>>>>> of choosing any online cpu immediately.
>>>>>>
>>>>>> For the current code, it only considers the local NUMA node and it
>>>>>> doesn't compute the distances between different NUMA nodes for the
>>>>>> non-local NUMA nodes. Let's optimize it and find the nearest node
>>>>>> through NUMA distance. The performance will be better if it return
>>>>>> the nearest node than the random node.
>>>>>
>>>>> Numbers please
>>>>
>>>> The changelog had
>>>>
>>>> : When Parameter Server workload is tested using NIC device on Huawei
>>>> : Kunpeng 920 SoC:
>>>> : Without the patch, the performance is 22W QPS;
>>>> : Added this patch, the performance become better and it is 26W QPS.
>>>
>>> Maybe it is just me but this doesn't really tell me a lot. What is
>>> Parameter Server workload? What do I do to replicate those numbers? Is
>>
>> I will give it better description on it in next version. Since it returns
>> the nearest node from the non-local node than the random one, no harmless
>> to others, Right?
> 
> Well, I am not really familiar with consumers of this API to understand
> the full consequences and that is why I keep asking. From a very

Good job, thanks you and Andrew's nice comment, at the beginning, I'm not sure
how to fix this issue correctly and it become better now.

> highlevel POV prefering CPUs on the same NUMA domain sounds like a
> reasonable thing to do.

Thanks :-)

> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-11-07  1:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-04 10:27 [PATCH v2] lib: optimize cpumask_local_spread() Shaokun Zhang
2019-11-05  7:01 ` Michal Hocko
2019-11-06  1:33   ` Andrew Morton
2019-11-06  2:49     ` Shaokun Zhang
2019-11-06  7:17     ` Michal Hocko
2019-11-06  8:02       ` Shaokun Zhang
2019-11-06  9:22         ` Michal Hocko
2019-11-07  1:48           ` Shaokun Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.