[v7] lib: optimize cpumask_local_spread()
diff mbox series

Message ID 1605668072-44780-1-git-send-email-zhangshaokun@hisilicon.com
State In Next
Commit 6b9be2fe5d291f4986e605e7ce1d4d2e79f1a19a
Headers show
Series
  • [v7] lib: optimize cpumask_local_spread()
Related show

Commit Message

Shaokun Zhang Nov. 18, 2020, 2:54 a.m. UTC
From: Yuqi Jin <jinyuqi@huawei.com>

In multi-processor and NUMA system, I/O driver will find cpu cores that
which shall be bound IRQ. When cpu cores in the local numa have been
used up, it is better to find the node closest to the local numa node
for performance, instead of choosing any online cpu immediately.

On arm64 or x86 platform that has 2-sockets and 4-NUMA nodes, if the
network card is located in node2 of socket1, while the number queues
of network card is greater than the number of cores of node2, when all
cores of node2 has been bound to the queues, the remaining queues will
be bound to the cores of node0 which is further than NUMA node3. It is
not friendly for performance or Intel's DDIO (Data Direct I/O Technology)
when if the user enables SNC (sub-NUMA-clustering).
Let's improve it and find the nearest unused node through NUMA distance
for the non-local NUMA nodes.

On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu
system(0 - 1). The topology of this server is followed:
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 63379 MB
node 0 free: 61899 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64509 MB
node 1 free: 63942 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 64509 MB
node 2 free: 63056 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 63997 MB
node 3 free: 63420 MB
node distances:
node   0   1   2   3
  0:  10  16  32  33
  1:  16  10  25  32
  2:  32  25  10  16
  3:  33  32  16  10

We perform PS (parameter server) business test, the behavior of the
service is that the client initiates a request through the network card,
the server responds to the request after calculation.  When two PS
processes run on node2 and node3 separately and the network card is
located on 'node2' which is in cpu1, the performance of node2 (26W QPS)
and node3 (22W QPS) is different.

It is better that the NIC queues are bound to the cpu1 cores in turn, then
XPS will also be properly initialized, while cpumask_local_spread only
considers the local node. When the number of NIC queues exceeds the
number of cores in the local node, it returns to the online core directly.
So when PS runs on node3 sending a calculated request, the performance is
not as good as the node2.

The IRQ from 369-392 will be bound from NUMA node0 to NUMA node3 with this
patch, before the patch:

Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
0
Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
1
...
Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
22
Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
23
After the patch:
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
72
Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
73
...
Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
94
Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
95

So the performance of the node3 is the same as node2 that is 26W QPS when
the network card is still in 'node2' with the patch.

Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Yuqi Jin <jinyuqi@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
ChangeLog from v6:
    1. Addressed Dave comments
    2. Fix the warning from Hulk Robot
    3. Simply the git log.

ChangeLog from v5:
    1. Rebase to 5.10-rc2

ChangeLog from v4:
    1. Rebase to 5.6-rc3 

ChangeLog from v3:
    1. Make spread_lock local to cpumask_local_spread();
    2. Add more descriptions on the affinities change in log;

ChangeLog from v2:
    1. Change the variables as static and use spinlock to protect;
    2. Give more explantation on test and performance;

 lib/cpumask.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 13 deletions(-)

Comments

Dave Hansen Nov. 20, 2020, 5:48 p.m. UTC | #1
On 11/17/20 6:54 PM, Shaokun Zhang wrote:
> From: Yuqi Jin <jinyuqi@huawei.com>
> 
> In multi-processor and NUMA system, I/O driver will find cpu cores that
> which shall be bound IRQ. When cpu cores in the local numa have been
> used up, it is better to find the node closest to the local numa node
> for performance, instead of choosing any online cpu immediately.
> 
> On arm64 or x86 platform that has 2-sockets and 4-NUMA nodes, if the
> network card is located in node2 of socket1, while the number queues
> of network card is greater than the number of cores of node2, when all
> cores of node2 has been bound to the queues, the remaining queues will
> be bound to the cores of node0 which is further than NUMA node3.

That's quite the run-on sentence. :)

> It is
> not friendly for performance or Intel's DDIO (Data Direct I/O Technology)

Could you explain *why* it is not friendly to DDIO specifically?  This
patch affects where the interrupt handler runs.  But, DDIO is based on
memory locations rather than the location of the interrupt handler.

It would be ideal to make that connection: How does the location of the
interrupt handler impact the memory allocation location?

> when if the user enables SNC (sub-NUMA-clustering).

Again, the role that SNC plays here isn't spelled out.  I *believe* it's
because SNC ends up reducing the number of CPUs in each NUMA node.  That
 makes the existing code run out of CPUs to which to bind to the "local"
node sooner.

> +static int find_nearest_node(int node, bool *used)
> +{
> +	int i, min_dist, node_id = -1;
> +
> +	/* Choose the first unused node to compare */
> +	for (i = 0; i < nr_node_ids; i++) {
> +		if (used[i] == false) {
> +			min_dist = node_distance(node, i);
> +			node_id = i;
> +			break;
> +		}
> +	}
> +
> +	/* Compare and return the nearest node */
> +	for (i = 0; i < nr_node_ids; i++) {
> +		if (node_distance(node, i) < min_dist && used[i] == false) {
> +			min_dist = node_distance(node, i);
> +			node_id = i;
> +		}
> +	}
> +
> +	return node_id;
> +}
> +
>  /**
>   * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>   * @i: index number
>   * @node: local numa_node
>   *
>   * This function selects an online CPU according to a numa aware policy;
> - * local cpus are returned first, followed by non-local ones, then it
> - * wraps around.
> + * local cpus are returned first, followed by the next one which is the
> + * nearest unused NUMA node based on NUMA distance, then it wraps around.
>   *
>   * It's not very efficient, but useful for setup.
>   */
>  unsigned int cpumask_local_spread(unsigned int i, int node)

FWIW, I think 'i' is criminally bad naming.  It should be called
nr_cpus_to_skip or something similar.

I also detest the comments that are there today.

	Loop through all the online CPUs on the system.  Start with the
	CPUs on 'node', then fall back to CPUs on NUMA nodes which are
	increasingly far away.

	Skip the first 'nr_cpus_to_skip' CPUs which are found.

	This function is not very efficient, especially for large
	'nr_cpus_to_skip' because it loops over the same CPUs on each
	call and does not remember its state from previous calls.

>  {
> -	int cpu, hk_flags;
> +	static DEFINE_SPINLOCK(spread_lock);
> +	static bool used[MAX_NUMNODES];

I thought I mentioned this last time.  How large is this array?  How
large would it be if it were a nodemask_t?  Would this be less code if
you just dynamically allocated and freed the node mask instead of having
a spinlock and a memset?

> +	unsigned long flags;
> +	int cpu, hk_flags, j, id;
>  	const struct cpumask *mask;
>  
>  	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>  				return cpu;
>  		}
>  	} else {
> -		/* NUMA first. */
> -		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> -			if (i-- == 0)
> -				return cpu;
> +		spin_lock_irqsave(&spread_lock, flags);
> +		memset(used, 0, nr_node_ids * sizeof(bool));
> +		/* select node according to the distance from local node */
> +		for (j = 0; j < nr_node_ids; j++) {
> +			id = find_nearest_node(node, used);
> +			if (id < 0)
> +				break;

There's presumably an outer loop in a driver which is trying to bind a
bunch of interrupts to a bunch of CPUs.  We know there are on the order
of dozens of these interrupts.

	for_each_interrupt() // in the driver
		for (j=0;j<nr_node_ids;j++) // cpumask_local_spread()
			// find_nearest_node():
			for (i = 0; i < nr_node_ids; i++) {
			for (i = 0; i < nr_node_ids; i++) {

Does this worry anybody else?  It thought our upper limits on the number
of NUMA nodes was 1024.  Doesn't that make our loop O(N^3) where the
worst case is hundreds of millions of loops?

I don't want to prematurely optimize this, but that seems like something
that might just fall over on bigger systems.

This also seems really wasteful if we have a bunch of memory-only nodes.
 Each of those will be found via find_nearest_node(), but then this loop:

> +			for_each_cpu_and(cpu, cpumask_of_node(id), mask)
> +				if (i-- == 0) {
> +					spin_unlock_irqrestore(&spread_lock,
> +							       flags);
> +					return cpu;
> +				}
> +			used[id] = true;
>  		}

Will just exit immediately because cpumask_of_node() is empty.

'used', for instance, should start by setting 'true' for all nodes which
are not in N_CPUS.


> +		spin_unlock_irqrestore(&spread_lock, flags);
>  
> -		for_each_cpu(cpu, mask) {
> -			/* Skip NUMA nodes, done above. */
> -			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> -				continue;
> -
> +		for_each_cpu(cpu, mask)
>  			if (i-- == 0)
>  				return cpu;
> -		}
>  	}
>  	BUG();
>  }
Shaokun Zhang Nov. 27, 2020, 2:29 a.m. UTC | #2
Hi Dave,

Apologies for later reply.

在 2020/11/21 1:48, Dave Hansen 写道:
> On 11/17/20 6:54 PM, Shaokun Zhang wrote:
>> From: Yuqi Jin <jinyuqi@huawei.com>
>>
>> In multi-processor and NUMA system, I/O driver will find cpu cores that
>> which shall be bound IRQ. When cpu cores in the local numa have been
>> used up, it is better to find the node closest to the local numa node
>> for performance, instead of choosing any online cpu immediately.
>>
>> On arm64 or x86 platform that has 2-sockets and 4-NUMA nodes, if the
>> network card is located in node2 of socket1, while the number queues
>> of network card is greater than the number of cores of node2, when all
>> cores of node2 has been bound to the queues, the remaining queues will
>> be bound to the cores of node0 which is further than NUMA node3.
> 
> That's quite the run-on sentence. :)
> 
>> It is
>> not friendly for performance or Intel's DDIO (Data Direct I/O Technology)
> 
> Could you explain *why* it is not friendly to DDIO specifically?  This
> patch affects where the interrupt handler runs.  But, DDIO is based on
> memory locations rather than the location of the interrupt handler.
> 
> It would be ideal to make that connection: How does the location of the
> interrupt handler impact the memory allocation location?
> 

When the interrupt handler is across chips, the BD, packet header, and even
payload are required for the RX packet interrupt handler. However, the DDIO
cannot transmit data to there.

>> when if the user enables SNC (sub-NUMA-clustering).
> 
> Again, the role that SNC plays here isn't spelled out.  I *believe* it's
> because SNC ends up reducing the number of CPUs in each NUMA node.  That
>  makes the existing code run out of CPUs to which to bind to the "local"
> node sooner.

Yes.

> 
>> +static int find_nearest_node(int node, bool *used)
>> +{
>> +	int i, min_dist, node_id = -1;
>> +
>> +	/* Choose the first unused node to compare */
>> +	for (i = 0; i < nr_node_ids; i++) {
>> +		if (used[i] == false) {
>> +			min_dist = node_distance(node, i);
>> +			node_id = i;
>> +			break;
>> +		}
>> +	}
>> +
>> +	/* Compare and return the nearest node */
>> +	for (i = 0; i < nr_node_ids; i++) {
>> +		if (node_distance(node, i) < min_dist && used[i] == false) {
>> +			min_dist = node_distance(node, i);
>> +			node_id = i;
>> +		}
>> +	}
>> +
>> +	return node_id;
>> +}
>> +
>>  /**
>>   * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>>   * @i: index number
>>   * @node: local numa_node
>>   *
>>   * This function selects an online CPU according to a numa aware policy;
>> - * local cpus are returned first, followed by non-local ones, then it
>> - * wraps around.
>> + * local cpus are returned first, followed by the next one which is the
>> + * nearest unused NUMA node based on NUMA distance, then it wraps around.
>>   *
>>   * It's not very efficient, but useful for setup.
>>   */
>>  unsigned int cpumask_local_spread(unsigned int i, int node)
> 
> FWIW, I think 'i' is criminally bad naming.  It should be called
> nr_cpus_to_skip or something similar.
> 

Ok, I really didn't consider this parameter naming before.

> I also detest the comments that are there today.
> 
> 	Loop through all the online CPUs on the system.  Start with the
> 	CPUs on 'node', then fall back to CPUs on NUMA nodes which are
> 	increasingly far away.
> 
> 	Skip the first 'nr_cpus_to_skip' CPUs which are found.
> 
> 	This function is not very efficient, especially for large
> 	'nr_cpus_to_skip' because it loops over the same CPUs on each
> 	call and does not remember its state from previous calls.
> 

Shame for my bad comment, I will follow it.

>>  {
>> -	int cpu, hk_flags;
>> +	static DEFINE_SPINLOCK(spread_lock);
>> +	static bool used[MAX_NUMNODES];
> 
> I thought I mentioned this last time.  How large is this array?  How
> large would it be if it were a nodemask_t?  Would this be less code if

Apologies that I forgot to do it.

> you just dynamically allocated and freed the node mask instead of having
> a spinlock and a memset?
> 

Ok, but I think the spinlock is also needed, do I miss something?

>> +	unsigned long flags;
>> +	int cpu, hk_flags, j, id;
>>  	const struct cpumask *mask;
>>  
>>  	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>  				return cpu;
>>  		}
>>  	} else {
>> -		/* NUMA first. */
>> -		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>> -			if (i-- == 0)
>> -				return cpu;
>> +		spin_lock_irqsave(&spread_lock, flags);
>> +		memset(used, 0, nr_node_ids * sizeof(bool));
>> +		/* select node according to the distance from local node */
>> +		for (j = 0; j < nr_node_ids; j++) {
>> +			id = find_nearest_node(node, used);
>> +			if (id < 0)
>> +				break;
> 
> There's presumably an outer loop in a driver which is trying to bind a
> bunch of interrupts to a bunch of CPUs.  We know there are on the order
> of dozens of these interrupts.
> 
> 	for_each_interrupt() // in the driver
> 		for (j=0;j<nr_node_ids;j++) // cpumask_local_spread()
> 			// find_nearest_node():
> 			for (i = 0; i < nr_node_ids; i++) {
> 			for (i = 0; i < nr_node_ids; i++) {
> 
> Does this worry anybody else?  It thought our upper limits on the number
> of NUMA nodes was 1024.  Doesn't that make our loop O(N^3) where the
> worst case is hundreds of millions of loops?
> 

If the NUMA nodes is 1024 in real system, it is more worthy to find the
earest node, rather than choose a random one, And it is only called in
I/O device initialization. Comments also are given to this interface.

> I don't want to prematurely optimize this, but that seems like something
> that might just fall over on bigger systems.
> 
> This also seems really wasteful if we have a bunch of memory-only nodes.
>  Each of those will be found via find_nearest_node(), but then this loop:
> 

Got it, all effort is used to choose the nearest node for performance. If
we don't it, I think some one will also debug this in future.

>> +			for_each_cpu_and(cpu, cpumask_of_node(id), mask)
>> +				if (i-- == 0) {
>> +					spin_unlock_irqrestore(&spread_lock,
>> +							       flags);
>> +					return cpu;
>> +				}
>> +			used[id] = true;
>>  		}
> 
> Will just exit immediately because cpumask_of_node() is empty.

Yes, and this node used[id] became true.

> 
> 'used', for instance, should start by setting 'true' for all nodes which
> are not in N_CPUS.

No, because I used 'nr_node_ids' which is possible node ids to check.

Thanks,
Shaokun

> 
>> +		spin_unlock_irqrestore(&spread_lock, flags);
>>  
>> -		for_each_cpu(cpu, mask) {
>> -			/* Skip NUMA nodes, done above. */
>> -			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>> -				continue;
>> -
>> +		for_each_cpu(cpu, mask)
>>  			if (i-- == 0)
>>  				return cpu;
>> -		}
>>  	}
>>  	BUG();
>>  }
> .
>
Dave Hansen Nov. 30, 2020, 5:08 p.m. UTC | #3
>>>  {
>>> -	int cpu, hk_flags;
>>> +	static DEFINE_SPINLOCK(spread_lock);
>>> +	static bool used[MAX_NUMNODES];
>>
>> I thought I mentioned this last time.  How large is this array?  How
>> large would it be if it were a nodemask_t?  Would this be less code if
> 
> Apologies that I forgot to do it.
> 
>> you just dynamically allocated and freed the node mask instead of having
>> a spinlock and a memset?
> 
> Ok, but I think the spinlock is also needed, do I miss something?

There was no spinlock there before your patch.  You just need it to
protect the structures you declared static.  If you didn't have static
structures, you wouldn't need a lock.

>>> +	unsigned long flags;
>>> +	int cpu, hk_flags, j, id;
>>>  	const struct cpumask *mask;
>>>  
>>>  	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>>  				return cpu;
>>>  		}
>>>  	} else {
>>> -		/* NUMA first. */
>>> -		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>> -			if (i-- == 0)
>>> -				return cpu;
>>> +		spin_lock_irqsave(&spread_lock, flags);
>>> +		memset(used, 0, nr_node_ids * sizeof(bool));
>>> +		/* select node according to the distance from local node */
>>> +		for (j = 0; j < nr_node_ids; j++) {
>>> +			id = find_nearest_node(node, used);
>>> +			if (id < 0)
>>> +				break;
>>
>> There's presumably an outer loop in a driver which is trying to bind a
>> bunch of interrupts to a bunch of CPUs.  We know there are on the order
>> of dozens of these interrupts.
>>
>> 	for_each_interrupt() // in the driver
>> 		for (j=0;j<nr_node_ids;j++) // cpumask_local_spread()
>> 			// find_nearest_node():
>> 			for (i = 0; i < nr_node_ids; i++) {
>> 			for (i = 0; i < nr_node_ids; i++) {
>>
>> Does this worry anybody else?  It thought our upper limits on the number
>> of NUMA nodes was 1024.  Doesn't that make our loop O(N^3) where the
>> worst case is hundreds of millions of loops?
> 
> If the NUMA nodes is 1024 in real system, it is more worthy to find the
> earest node, rather than choose a random one, And it is only called in
> I/O device initialization. Comments also are given to this interface.

This doesn't really make me feel better.  An end user booting this on a
big system with a bunch of cards could see a minutes-long delay.  I can
also see funky stuff happening like if we have a ton of NUMA nodes and
few CPUs.

>> I don't want to prematurely optimize this, but that seems like something
>> that might just fall over on bigger systems.
>>
>> This also seems really wasteful if we have a bunch of memory-only nodes.
>>  Each of those will be found via find_nearest_node(), but then this loop:
> 
> Got it, all effort is used to choose the nearest node for performance. If
> we don't it, I think some one will also debug this in future.

If we're going to kick the can down the road for some poor sod to debug,
can we at least help them out with a warning?

Maybe we WARN_ONCE() after we fall back for more than 2 or 3 nodes.

But, I still don't think you've addressed my main concern: This is
horrifically inefficient searching for CPUs inside nodes that are known
to have no CPUs.

>>> +			for_each_cpu_and(cpu, cpumask_of_node(id), mask)
>>> +				if (i-- == 0) {
>>> +					spin_unlock_irqrestore(&spread_lock,
>>> +							       flags);
>>> +					return cpu;
>>> +				}
>>> +			used[id] = true;
>>>  		}
>>
>> Will just exit immediately because cpumask_of_node() is empty.
> 
> Yes, and this node used[id] became true.
> 
>>
>> 'used', for instance, should start by setting 'true' for all nodes which
>> are not in N_CPUS.
> 
> No, because I used 'nr_node_ids' which is possible node ids to check.

I'm saying that it's wasteful to loop over and search in all the nodes.
Shaokun Zhang Dec. 11, 2020, 2:25 a.m. UTC | #4
Hi Dave,

Apologies for the late reply.

在 2020/12/1 1:08, Dave Hansen 写道:
>>>>  {
>>>> -	int cpu, hk_flags;
>>>> +	static DEFINE_SPINLOCK(spread_lock);
>>>> +	static bool used[MAX_NUMNODES];
>>>
>>> I thought I mentioned this last time.  How large is this array?  How
>>> large would it be if it were a nodemask_t?  Would this be less code if
>>
>> Apologies that I forgot to do it.
>>
>>> you just dynamically allocated and freed the node mask instead of having
>>> a spinlock and a memset?
>>
>> Ok, but I think the spinlock is also needed, do I miss something?
> 
> There was no spinlock there before your patch.  You just need it to
> protect the structures you declared static.  If you didn't have static
> structures, you wouldn't need a lock.

Got it, I will allocate it dynamically.

> 
>>>> +	unsigned long flags;
>>>> +	int cpu, hk_flags, j, id;
>>>>  	const struct cpumask *mask;
>>>>  
>>>>  	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>>>>  				return cpu;
>>>>  		}
>>>>  	} else {
>>>> -		/* NUMA first. */
>>>> -		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>>> -			if (i-- == 0)
>>>> -				return cpu;
>>>> +		spin_lock_irqsave(&spread_lock, flags);
>>>> +		memset(used, 0, nr_node_ids * sizeof(bool));
>>>> +		/* select node according to the distance from local node */
>>>> +		for (j = 0; j < nr_node_ids; j++) {
>>>> +			id = find_nearest_node(node, used);
>>>> +			if (id < 0)
>>>> +				break;
>>>
>>> There's presumably an outer loop in a driver which is trying to bind a
>>> bunch of interrupts to a bunch of CPUs.  We know there are on the order
>>> of dozens of these interrupts.
>>>
>>> 	for_each_interrupt() // in the driver
>>> 		for (j=0;j<nr_node_ids;j++) // cpumask_local_spread()
>>> 			// find_nearest_node():
>>> 			for (i = 0; i < nr_node_ids; i++) {
>>> 			for (i = 0; i < nr_node_ids; i++) {
>>>
>>> Does this worry anybody else?  It thought our upper limits on the number
>>> of NUMA nodes was 1024.  Doesn't that make our loop O(N^3) where the
>>> worst case is hundreds of millions of loops?
>>
>> If the NUMA nodes is 1024 in real system, it is more worthy to find the
>> earest node, rather than choose a random one, And it is only called in
>> I/O device initialization. Comments also are given to this interface.
> 
> This doesn't really make me feel better.  An end user booting this on a

My bad, I only want to explain the issue.

> big system with a bunch of cards could see a minutes-long delay.  I can

Indeed.

> also see funky stuff happening like if we have a ton of NUMA nodes and
> few CPUs.
> 
>>> I don't want to prematurely optimize this, but that seems like something
>>> that might just fall over on bigger systems.
>>>
>>> This also seems really wasteful if we have a bunch of memory-only nodes.
>>>  Each of those will be found via find_nearest_node(), but then this loop:
>>
>> Got it, all effort is used to choose the nearest node for performance. If
>> we don't it, I think some one will also debug this in future.
> 
> If we're going to kick the can down the road for some poor sod to debug,
> can we at least help them out with a warning?
> 
> Maybe we WARN_ONCE() after we fall back for more than 2 or 3 nodes.
> 

Ok,

> But, I still don't think you've addressed my main concern: This is
> horrifically inefficient searching for CPUs inside nodes that are known
> to have no CPUs.

How about optimizing as follows:
+		for (j = 0; j < nr_node_ids; j++) {
+			id = find_nearest_node(node, nodes);
+			if (id < 0)
+				break;
+			nmask = cpumask_of_node(id);
+			cpumask_and(&node_possible_mask, &mask, & nmask);
+			cpu_of_node = cpumask_weight(node_possible_mask);
+ 			if (cpu_index > cpu_of_node) {
+				cpu_index -= cpu_of_node;
+				node_set(id, nodes);
+				continue;
+			}
+
+			for_each_cpu(cpu, node_possible_mask)
+				if (cpu_index-- == 0)
+					return cpu;
+
+			node_set(id, nodes);
 		}

> 
>>>> +			for_each_cpu_and(cpu, cpumask_of_node(id), mask)
>>>> +				if (i-- == 0) {
>>>> +					spin_unlock_irqrestore(&spread_lock,
>>>> +							       flags);
>>>> +					return cpu;
>>>> +				}
>>>> +			used[id] = true;
>>>>  		}
>>>
>>> Will just exit immediately because cpumask_of_node() is empty.
>>
>> Yes, and this node used[id] became true.
>>
>>>
>>> 'used', for instance, should start by setting 'true' for all nodes which
>>> are not in N_CPUS.
>>
>> No, because I used 'nr_node_ids' which is possible node ids to check.
> 
> I'm saying that it's wasteful to loop over and search in all the nodes.

If you are happy the mentioned code, it also will solve the issue.

Thanks,
Shaokun

> .
>

Patch
diff mbox series

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 97a005ffde31..516d7237e302 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -325,20 +325,47 @@  void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 }
 #endif
 
+static int find_nearest_node(int node, bool *used)
+{
+	int i, min_dist, node_id = -1;
+
+	/* Choose the first unused node to compare */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (used[i] == false) {
+			min_dist = node_distance(node, i);
+			node_id = i;
+			break;
+		}
+	}
+
+	/* Compare and return the nearest node */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (node_distance(node, i) < min_dist && used[i] == false) {
+			min_dist = node_distance(node, i);
+			node_id = i;
+		}
+	}
+
+	return node_id;
+}
+
 /**
  * cpumask_local_spread - select the i'th cpu with local numa cpu's first
  * @i: index number
  * @node: local numa_node
  *
  * This function selects an online CPU according to a numa aware policy;
- * local cpus are returned first, followed by non-local ones, then it
- * wraps around.
+ * local cpus are returned first, followed by the next one which is the
+ * nearest unused NUMA node based on NUMA distance, then it wraps around.
  *
  * It's not very efficient, but useful for setup.
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-	int cpu, hk_flags;
+	static DEFINE_SPINLOCK(spread_lock);
+	static bool used[MAX_NUMNODES];
+	unsigned long flags;
+	int cpu, hk_flags, j, id;
 	const struct cpumask *mask;
 
 	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
@@ -352,20 +379,27 @@  unsigned int cpumask_local_spread(unsigned int i, int node)
 				return cpu;
 		}
 	} else {
-		/* NUMA first. */
-		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
-			if (i-- == 0)
-				return cpu;
+		spin_lock_irqsave(&spread_lock, flags);
+		memset(used, 0, nr_node_ids * sizeof(bool));
+		/* select node according to the distance from local node */
+		for (j = 0; j < nr_node_ids; j++) {
+			id = find_nearest_node(node, used);
+			if (id < 0)
+				break;
+
+			for_each_cpu_and(cpu, cpumask_of_node(id), mask)
+				if (i-- == 0) {
+					spin_unlock_irqrestore(&spread_lock,
+							       flags);
+					return cpu;
+				}
+			used[id] = true;
 		}
+		spin_unlock_irqrestore(&spread_lock, flags);
 
-		for_each_cpu(cpu, mask) {
-			/* Skip NUMA nodes, done above. */
-			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
-				continue;
-
+		for_each_cpu(cpu, mask)
 			if (i-- == 0)
 				return cpu;
-		}
 	}
 	BUG();
 }