* [PATCH v3] lib: optimize cpumask_local_spread() @ 2019-11-07 1:44 Shaokun Zhang 2019-11-08 3:49 ` Andrew Morton 2019-11-08 10:31 ` Michal Hocko 0 siblings, 2 replies; 11+ messages in thread From: Shaokun Zhang @ 2019-11-07 1:44 UTC (permalink / raw) To: linux-kernel Cc: yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michal Hocko, Michael Ellerman, Anshuman Khandual, Shaokun Zhang From: yuqi jin <jinyuqi@huawei.com> In the multi-processors and NUMA system, I/O driver will find cpu cores that which shall be bound IRQ. When cpu cores in the local numa have been used, it is better to find the node closest to the local numa node, instead of choosing any online cpu immediately. On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu system(0 - 1). We perform PS (parameter server) business test, the behavior of the service is that the client initiates a request through the network card, the server responds to the request after calculation. When two PS processes run on node2 and node3 separately and the network card is located on 'node2' which is in cpu1, the performance of node2 (26W QPS) and node3 (22W QPS) was different. It is better that the NIC queues are bound to the cpu1 cores in turn, then XPS will also be properly initialized, while cpumask_local_spread only considers the local node. When the number of NIC queues exceeds the number of cores in the local node, it returns to the online core directly. So when PS runs on node3 sending a calculated request, the performance is not as good as the node2. It is considered that the NIC and other I/O devices shall initialize the interrupt binding, if the cores of the local node are used up, it is reasonable to return the node closest to it. Let's optimize it and find the nearest node through NUMA distance for the non-local NUMA nodes. The performance will be better if it return the nearest node than the random node. After this patch, the performance of the node3 is the same as node2 that is 26W QPS when the network card is still in 'node2'. Since it will return the closest non-local NUMA code rather than random node, it is no harm to others at least. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: yuqi jin <jinyuqi@huawei.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> --- ChangeLog from v2: 1. Change the variables as static and use spinlock to protect; 2. Give more explantation on test and performance; lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 90 insertions(+), 12 deletions(-) diff --git a/lib/cpumask.c b/lib/cpumask.c index 0cb672eb107c..b98a2256bc5a 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -6,6 +6,7 @@ #include <linux/export.h> #include <linux/memblock.h> #include <linux/numa.h> +#include <linux/spinlock.h> /** * cpumask_next - get the next cpu in a cpumask @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask) } #endif -/** - * cpumask_local_spread - select the i'th cpu with local numa cpu's first - * @i: index number - * @node: local numa_node - * - * This function selects an online CPU according to a numa aware policy; - * local cpus are returned first, followed by non-local ones, then it - * wraps around. - * - * It's not very efficient, but useful for setup. - */ -unsigned int cpumask_local_spread(unsigned int i, int node) +static void calc_node_distance(int *node_dist, int node) +{ + int i; + + for (i = 0; i < nr_node_ids; i++) + node_dist[i] = node_distance(node, i); +} + +static int find_nearest_node(int *node_dist, bool *used) +{ + int i, min_dist = node_dist[0], node_id = -1; + + /* Choose the first unused node to compare */ + for (i = 0; i < nr_node_ids; i++) { + if (used[i] == 0) { + min_dist = node_dist[i]; + node_id = i; + break; + } + } + + /* Compare and return the nearest node */ + for (i = 0; i < nr_node_ids; i++) { + if (node_dist[i] < min_dist && used[i] == 0) { + min_dist = node_dist[i]; + node_id = i; + } + } + + return node_id; +} + +static unsigned int __cpumask_local_spread(unsigned int i, int node) { int cpu; @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node) } BUG(); } + +static DEFINE_SPINLOCK(spread_lock); +/** + * cpumask_local_spread - select the i'th cpu with local numa cpu's first + * @i: index number + * @node: local numa_node + * + * This function selects an online CPU according to a numa aware policy; + * local cpus are returned first, followed by the nearest non-local ones, + * then it wraps around. + * + * It's not very efficient, but useful for setup. + */ +unsigned int cpumask_local_spread(unsigned int i, int node) +{ + static int node_dist[MAX_NUMNODES]; + static bool used[MAX_NUMNODES]; + unsigned long flags; + int cpu, j, id; + + /* Wrap: we always want a cpu. */ + i %= num_online_cpus(); + + if (node == NUMA_NO_NODE) { + for_each_cpu(cpu, cpu_online_mask) + if (i-- == 0) + return cpu; + } else { + if (nr_node_ids > MAX_NUMNODES) + return __cpumask_local_spread(i, node); + + spin_lock_irqsave(&spread_lock, flags); + memset(used, 0, nr_node_ids * sizeof(bool)); + calc_node_distance(node_dist, node); + for (j = 0; j < nr_node_ids; j++) { + id = find_nearest_node(node_dist, used); + if (id < 0) + break; + + for_each_cpu_and(cpu, cpumask_of_node(id), + cpu_online_mask) + if (i-- == 0) { + spin_unlock_irqrestore(&spread_lock, + flags); + return cpu; + } + used[id] = 1; + } + spin_unlock_irqrestore(&spread_lock, flags); + + for_each_cpu(cpu, cpu_online_mask) + if (i-- == 0) + return cpu; + } + BUG(); +} EXPORT_SYMBOL(cpumask_local_spread); -- 2.7.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-07 1:44 [PATCH v3] lib: optimize cpumask_local_spread() Shaokun Zhang @ 2019-11-08 3:49 ` Andrew Morton 2019-11-08 5:50 ` Shaokun Zhang 2019-11-08 10:31 ` Michal Hocko 1 sibling, 1 reply; 11+ messages in thread From: Andrew Morton @ 2019-11-08 3:49 UTC (permalink / raw) To: Shaokun Zhang Cc: linux-kernel, yuqi jin, Mike Rapoport, Paul Burton, Michal Hocko, Michael Ellerman, Anshuman Khandual On Thu, 7 Nov 2019 09:44:08 +0800 Shaokun Zhang <zhangshaokun@hisilicon.com> wrote: > In the multi-processors and NUMA system, I/O driver will find cpu cores > that which shall be bound IRQ. When cpu cores in the local numa have > been used, it is better to find the node closest to the local numa node, > instead of choosing any online cpu immediately. > > On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu > system(0 - 1). We perform PS (parameter server) business test, the > behavior of the service is that the client initiates a request through > the network card, the server responds to the request after calculation. > When two PS processes run on node2 and node3 separately and the > network card is located on 'node2' which is in cpu1, the performance > of node2 (26W QPS) and node3 (22W QPS) was different. > It is better that the NIC queues are bound to the cpu1 cores in turn, > then XPS will also be properly initialized, while cpumask_local_spread > only considers the local node. When the number of NIC queues exceeds > the number of cores in the local node, it returns to the online core > directly. So when PS runs on node3 sending a calculated request, > the performance is not as good as the node2. It is considered that > the NIC and other I/O devices shall initialize the interrupt binding, > if the cores of the local node are used up, it is reasonable to return > the node closest to it. > > Let's optimize it and find the nearest node through NUMA distance for the > non-local NUMA nodes. The performance will be better if it return the > nearest node than the random node. > > After this patch, the performance of the node3 is the same as node2 > that is 26W QPS when the network card is still in 'node2'. Since it will > return the closest non-local NUMA code rather than random node, it is no > harm to others at least. This is a little nicer: --- a/lib/cpumask.c~lib-optimize-cpumask_local_spread-v3-fix +++ a/lib/cpumask.c @@ -254,7 +254,6 @@ static unsigned int __cpumask_local_spre BUG(); } -static DEFINE_SPINLOCK(spread_lock); /** * cpumask_local_spread - select the i'th cpu with local numa cpu's first * @i: index number @@ -270,6 +269,7 @@ unsigned int cpumask_local_spread(unsign { static int node_dist[MAX_NUMNODES]; static bool used[MAX_NUMNODES]; + static DEFINE_SPINLOCK(spread_lock); unsigned long flags; int cpu, j, id; _ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-08 3:49 ` Andrew Morton @ 2019-11-08 5:50 ` Shaokun Zhang 0 siblings, 0 replies; 11+ messages in thread From: Shaokun Zhang @ 2019-11-08 5:50 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, yuqi jin, Mike Rapoport, Paul Burton, Michal Hocko, Michael Ellerman, Anshuman Khandual Hi Andrew, On 2019/11/8 11:49, Andrew Morton wrote: > On Thu, 7 Nov 2019 09:44:08 +0800 Shaokun Zhang <zhangshaokun@hisilicon.com> wrote: > >> In the multi-processors and NUMA system, I/O driver will find cpu cores >> that which shall be bound IRQ. When cpu cores in the local numa have >> been used, it is better to find the node closest to the local numa node, >> instead of choosing any online cpu immediately. >> >> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu >> system(0 - 1). We perform PS (parameter server) business test, the >> behavior of the service is that the client initiates a request through >> the network card, the server responds to the request after calculation. >> When two PS processes run on node2 and node3 separately and the >> network card is located on 'node2' which is in cpu1, the performance >> of node2 (26W QPS) and node3 (22W QPS) was different. >> It is better that the NIC queues are bound to the cpu1 cores in turn, >> then XPS will also be properly initialized, while cpumask_local_spread >> only considers the local node. When the number of NIC queues exceeds >> the number of cores in the local node, it returns to the online core >> directly. So when PS runs on node3 sending a calculated request, >> the performance is not as good as the node2. It is considered that >> the NIC and other I/O devices shall initialize the interrupt binding, >> if the cores of the local node are used up, it is reasonable to return >> the node closest to it. >> >> Let's optimize it and find the nearest node through NUMA distance for the >> non-local NUMA nodes. The performance will be better if it return the >> nearest node than the random node. >> >> After this patch, the performance of the node3 is the same as node2 >> that is 26W QPS when the network card is still in 'node2'. Since it will >> return the closest non-local NUMA code rather than random node, it is no >> harm to others at least. > > This is a little nicer: > > --- a/lib/cpumask.c~lib-optimize-cpumask_local_spread-v3-fix > +++ a/lib/cpumask.c > @@ -254,7 +254,6 @@ static unsigned int __cpumask_local_spre > BUG(); > } > > -static DEFINE_SPINLOCK(spread_lock); > /** > * cpumask_local_spread - select the i'th cpu with local numa cpu's first > * @i: index number > @@ -270,6 +269,7 @@ unsigned int cpumask_local_spread(unsign > { > static int node_dist[MAX_NUMNODES]; > static bool used[MAX_NUMNODES]; > + static DEFINE_SPINLOCK(spread_lock); Good catch, thanks for fixing it. Shaokun. > unsigned long flags; > int cpu, j, id; > > _ > > > . > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-07 1:44 [PATCH v3] lib: optimize cpumask_local_spread() Shaokun Zhang 2019-11-08 3:49 ` Andrew Morton @ 2019-11-08 10:31 ` Michal Hocko 2019-11-11 2:02 ` Shaokun Zhang 1 sibling, 1 reply; 11+ messages in thread From: Michal Hocko @ 2019-11-08 10:31 UTC (permalink / raw) To: Shaokun Zhang Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michael Ellerman, Anshuman Khandual, netdev This changelog looks better, thanks! I still have some questions though. Btw. cpumask_local_spread is used by the networking code but I do not see net guys involved (Cc netdev) On Thu 07-11-19 09:44:08, Shaokun Zhang wrote: > From: yuqi jin <jinyuqi@huawei.com> > > In the multi-processors and NUMA system, I/O driver will find cpu cores > that which shall be bound IRQ. When cpu cores in the local numa have > been used, it is better to find the node closest to the local numa node, > instead of choosing any online cpu immediately. > > On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu > system(0 - 1). Please send a topology of this server (numactl -H). > We perform PS (parameter server) business test, the > behavior of the service is that the client initiates a request through > the network card, the server responds to the request after calculation. Is the benchmark any ublicly available? > When two PS processes run on node2 and node3 separately and the > network card is located on 'node2' which is in cpu1, the performance > of node2 (26W QPS) and node3 (22W QPS) was different. > It is better that the NIC queues are bound to the cpu1 cores in turn, > then XPS will also be properly initialized, while cpumask_local_spread > only considers the local node. When the number of NIC queues exceeds > the number of cores in the local node, it returns to the online core > directly. So when PS runs on node3 sending a calculated request, > the performance is not as good as the node2. It is considered that > the NIC and other I/O devices shall initialize the interrupt binding, > if the cores of the local node are used up, it is reasonable to return > the node closest to it. Can you post cpu affinities before and after this patch? > Let's optimize it and find the nearest node through NUMA distance for the > non-local NUMA nodes. The performance will be better if it return the > nearest node than the random node. > > After this patch, the performance of the node3 is the same as node2 > that is 26W QPS when the network card is still in 'node2'. Since it will > return the closest non-local NUMA code rather than random node, it is no > harm to others at least. It would be also nice to explain why the current implementation hasn't taken the path your have chosen. Was it a simplicity or is there a more fundamental reason? Is there any risk that existing users would regress? Preferring cpus from the local socket which is what you aim for sounds like a logical thing to do so I am wondering why this hasn't been considered. > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mike Rapoport <rppt@linux.ibm.com> > Cc: Paul Burton <paul.burton@mips.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Anshuman Khandual <anshuman.khandual@arm.com> > Signed-off-by: yuqi jin <jinyuqi@huawei.com> > Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> > --- > ChangeLog from v2: > 1. Change the variables as static and use spinlock to protect; > 2. Give more explantation on test and performance; > lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 90 insertions(+), 12 deletions(-) > > diff --git a/lib/cpumask.c b/lib/cpumask.c > index 0cb672eb107c..b98a2256bc5a 100644 > --- a/lib/cpumask.c > +++ b/lib/cpumask.c > @@ -6,6 +6,7 @@ > #include <linux/export.h> > #include <linux/memblock.h> > #include <linux/numa.h> > +#include <linux/spinlock.h> > > /** > * cpumask_next - get the next cpu in a cpumask > @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask) > } > #endif > > -/** > - * cpumask_local_spread - select the i'th cpu with local numa cpu's first > - * @i: index number > - * @node: local numa_node > - * > - * This function selects an online CPU according to a numa aware policy; > - * local cpus are returned first, followed by non-local ones, then it > - * wraps around. > - * > - * It's not very efficient, but useful for setup. > - */ > -unsigned int cpumask_local_spread(unsigned int i, int node) > +static void calc_node_distance(int *node_dist, int node) > +{ > + int i; > + > + for (i = 0; i < nr_node_ids; i++) > + node_dist[i] = node_distance(node, i); > +} > + > +static int find_nearest_node(int *node_dist, bool *used) > +{ > + int i, min_dist = node_dist[0], node_id = -1; > + > + /* Choose the first unused node to compare */ > + for (i = 0; i < nr_node_ids; i++) { > + if (used[i] == 0) { > + min_dist = node_dist[i]; > + node_id = i; > + break; > + } > + } > + > + /* Compare and return the nearest node */ > + for (i = 0; i < nr_node_ids; i++) { > + if (node_dist[i] < min_dist && used[i] == 0) { > + min_dist = node_dist[i]; > + node_id = i; > + } > + } > + > + return node_id; > +} > + > +static unsigned int __cpumask_local_spread(unsigned int i, int node) > { > int cpu; > > @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node) > } > BUG(); > } > + > +static DEFINE_SPINLOCK(spread_lock); > +/** > + * cpumask_local_spread - select the i'th cpu with local numa cpu's first > + * @i: index number > + * @node: local numa_node > + * > + * This function selects an online CPU according to a numa aware policy; > + * local cpus are returned first, followed by the nearest non-local ones, > + * then it wraps around. > + * > + * It's not very efficient, but useful for setup. > + */ > +unsigned int cpumask_local_spread(unsigned int i, int node) > +{ > + static int node_dist[MAX_NUMNODES]; > + static bool used[MAX_NUMNODES]; > + unsigned long flags; > + int cpu, j, id; > + > + /* Wrap: we always want a cpu. */ > + i %= num_online_cpus(); > + > + if (node == NUMA_NO_NODE) { > + for_each_cpu(cpu, cpu_online_mask) > + if (i-- == 0) > + return cpu; > + } else { > + if (nr_node_ids > MAX_NUMNODES) > + return __cpumask_local_spread(i, node); > + > + spin_lock_irqsave(&spread_lock, flags); > + memset(used, 0, nr_node_ids * sizeof(bool)); > + calc_node_distance(node_dist, node); > + for (j = 0; j < nr_node_ids; j++) { > + id = find_nearest_node(node_dist, used); > + if (id < 0) > + break; > + > + for_each_cpu_and(cpu, cpumask_of_node(id), > + cpu_online_mask) > + if (i-- == 0) { > + spin_unlock_irqrestore(&spread_lock, > + flags); > + return cpu; > + } > + used[id] = 1; > + } > + spin_unlock_irqrestore(&spread_lock, flags); > + > + for_each_cpu(cpu, cpu_online_mask) > + if (i-- == 0) > + return cpu; > + } > + BUG(); > +} > EXPORT_SYMBOL(cpumask_local_spread); > -- > 2.7.4 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-08 10:31 ` Michal Hocko @ 2019-11-11 2:02 ` Shaokun Zhang 2019-11-12 11:56 ` Michal Hocko 0 siblings, 1 reply; 11+ messages in thread From: Shaokun Zhang @ 2019-11-11 2:02 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michael Ellerman, Anshuman Khandual, netdev Hi Michal, On 2019/11/8 18:31, Michal Hocko wrote: > This changelog looks better, thanks! I still have some questions though. > Btw. cpumask_local_spread is used by the networking code but I do not > see net guys involved (Cc netdev) Oh, I forgot to involve the net guys, sorry. > > On Thu 07-11-19 09:44:08, Shaokun Zhang wrote: >> From: yuqi jin <jinyuqi@huawei.com> >> >> In the multi-processors and NUMA system, I/O driver will find cpu cores >> that which shall be bound IRQ. When cpu cores in the local numa have >> been used, it is better to find the node closest to the local numa node, >> instead of choosing any online cpu immediately. >> >> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu >> system(0 - 1). > > Please send a topology of this server (numactl -H). > available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 node 0 size: 63379 MB node 0 free: 61899 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 64509 MB node 1 free: 63942 MB node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 2 size: 64509 MB node 2 free: 63056 MB node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 3 size: 63997 MB node 3 free: 63420 MB node distances: node 0 1 2 3 0: 10 16 32 33 1: 16 10 25 32 2: 32 25 10 16 3: 33 32 16 10 >> We perform PS (parameter server) business test, the >> behavior of the service is that the client initiates a request through >> the network card, the server responds to the request after calculation. > > Is the benchmark any ublicly available? > Sorry, the PS which we test is not open, but I think redis is the same as PS on the macro level. When there are both 24 redis servers on node2 and node3. if the 24-47 irqs and xps of NIC are not bound to node3, the redis servers on node3 will not performance good. >> When two PS processes run on node2 and node3 separately and the >> network card is located on 'node2' which is in cpu1, the performance >> of node2 (26W QPS) and node3 (22W QPS) was different. >> It is better that the NIC queues are bound to the cpu1 cores in turn, >> then XPS will also be properly initialized, while cpumask_local_spread >> only considers the local node. When the number of NIC queues exceeds >> the number of cores in the local node, it returns to the online core >> directly. So when PS runs on node3 sending a calculated request, >> the performance is not as good as the node2. It is considered that >> the NIC and other I/O devices shall initialize the interrupt binding, >> if the cores of the local node are used up, it is reasonable to return >> the node closest to it. > > Can you post cpu affinities before and after this patch? > Before this patch Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node 2 Euler:~ # cat /proc/irq/345/smp_affinity #IRQ0 00000000,00010000,00000000 Euler:~ # cat /proc/irq/369/smp_affinity #IRQ24 00000000,00000000,00000001 Euler:~ # cat /proc/irq/393/smp_affinity #IRQ48 00000000,00000000,01000000 Euler:~ # cat /sys/class/net/eth7/queues/tx-0/xps_cpus 00000000,00010000,00000000 Euler:~ # cat /sys/class/net/eth7/queues/tx-24/xps_cpus 00000000,00000000,00000001 Euler:~ # cat /sys/class/net/eth7/queues/tx-48/xps_cpus 00000000,00000000,01000000 After this patch Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node 2 Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat /proc/irq/345/smp_affinity 00000000,00010000,00000000 Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat /proc/irq/369/smp_affinity 00000100,00000000,00000000 Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat /proc/irq/393/smp_affinity 00000000,00000000,01000000 Euler:/sys/bus/pci # cat /sys/class/net/eth7/queues/tx-0/xps_cpus 00000000,00010000,00000000 Euler:/sys/bus/pci # cat /sys/class/net/eth7/queues/tx-24/xps_cpus 00000100,00000000,00000000 Euler:/sys/bus/pci # cat /sys/class/net/eth7/queues/tx-48/xps_cpus 00000000,00000000,01000000 >> Let's optimize it and find the nearest node through NUMA distance for the >> non-local NUMA nodes. The performance will be better if it return the >> nearest node than the random node. >> >> After this patch, the performance of the node3 is the same as node2 >> that is 26W QPS when the network card is still in 'node2'. Since it will >> return the closest non-local NUMA code rather than random node, it is no >> harm to others at least. > > It would be also nice to explain why the current implementation hasn't > taken the path your have chosen. Was it a simplicity or is there a more > fundamental reason? Is there any risk that existing users would regress? It seems that we don't see an interface like cpumask_of_socket. So we use NUMA distances to solve this issue. And it can also solve the problem when the irqs of the I/O device are more than the cores number of sockets. On the other hand, after the system started, the user can manually change the irq and xps bindings. But reasonable initialization is a really good experience. > Preferring cpus from the local socket which is what you aim for sounds Right. > like a logical thing to do so I am wondering why this hasn't been > considered. An interesting question, When I met this issue, I tried to find the socket ID in kernel, @physical_package_id is architecture and platform dependent in Documentation/admin-guide/cputopology.rst. In fact, I also check "Core(s) per socket" from lscpu command which is calculated in sys-utils/lscpu.c. It seems that NUMA distance is a general way for this problem. Thanks, Shaokun >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Mike Rapoport <rppt@linux.ibm.com> >> Cc: Paul Burton <paul.burton@mips.com> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Michael Ellerman <mpe@ellerman.id.au> >> Cc: Anshuman Khandual <anshuman.khandual@arm.com> >> Signed-off-by: yuqi jin <jinyuqi@huawei.com> >> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> >> --- >> ChangeLog from v2: >> 1. Change the variables as static and use spinlock to protect; >> 2. Give more explantation on test and performance; >> lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++------- >> 1 file changed, 90 insertions(+), 12 deletions(-) >> >> diff --git a/lib/cpumask.c b/lib/cpumask.c >> index 0cb672eb107c..b98a2256bc5a 100644 >> --- a/lib/cpumask.c >> +++ b/lib/cpumask.c >> @@ -6,6 +6,7 @@ >> #include <linux/export.h> >> #include <linux/memblock.h> >> #include <linux/numa.h> >> +#include <linux/spinlock.h> >> >> /** >> * cpumask_next - get the next cpu in a cpumask >> @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask) >> } >> #endif >> >> -/** >> - * cpumask_local_spread - select the i'th cpu with local numa cpu's first >> - * @i: index number >> - * @node: local numa_node >> - * >> - * This function selects an online CPU according to a numa aware policy; >> - * local cpus are returned first, followed by non-local ones, then it >> - * wraps around. >> - * >> - * It's not very efficient, but useful for setup. >> - */ >> -unsigned int cpumask_local_spread(unsigned int i, int node) >> +static void calc_node_distance(int *node_dist, int node) >> +{ >> + int i; >> + >> + for (i = 0; i < nr_node_ids; i++) >> + node_dist[i] = node_distance(node, i); >> +} >> + >> +static int find_nearest_node(int *node_dist, bool *used) >> +{ >> + int i, min_dist = node_dist[0], node_id = -1; >> + >> + /* Choose the first unused node to compare */ >> + for (i = 0; i < nr_node_ids; i++) { >> + if (used[i] == 0) { >> + min_dist = node_dist[i]; >> + node_id = i; >> + break; >> + } >> + } >> + >> + /* Compare and return the nearest node */ >> + for (i = 0; i < nr_node_ids; i++) { >> + if (node_dist[i] < min_dist && used[i] == 0) { >> + min_dist = node_dist[i]; >> + node_id = i; >> + } >> + } >> + >> + return node_id; >> +} >> + >> +static unsigned int __cpumask_local_spread(unsigned int i, int node) >> { >> int cpu; >> >> @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node) >> } >> BUG(); >> } >> + >> +static DEFINE_SPINLOCK(spread_lock); >> +/** >> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first >> + * @i: index number >> + * @node: local numa_node >> + * >> + * This function selects an online CPU according to a numa aware policy; >> + * local cpus are returned first, followed by the nearest non-local ones, >> + * then it wraps around. >> + * >> + * It's not very efficient, but useful for setup. >> + */ >> +unsigned int cpumask_local_spread(unsigned int i, int node) >> +{ >> + static int node_dist[MAX_NUMNODES]; >> + static bool used[MAX_NUMNODES]; >> + unsigned long flags; >> + int cpu, j, id; >> + >> + /* Wrap: we always want a cpu. */ >> + i %= num_online_cpus(); >> + >> + if (node == NUMA_NO_NODE) { >> + for_each_cpu(cpu, cpu_online_mask) >> + if (i-- == 0) >> + return cpu; >> + } else { >> + if (nr_node_ids > MAX_NUMNODES) >> + return __cpumask_local_spread(i, node); >> + >> + spin_lock_irqsave(&spread_lock, flags); >> + memset(used, 0, nr_node_ids * sizeof(bool)); >> + calc_node_distance(node_dist, node); >> + for (j = 0; j < nr_node_ids; j++) { >> + id = find_nearest_node(node_dist, used); >> + if (id < 0) >> + break; >> + >> + for_each_cpu_and(cpu, cpumask_of_node(id), >> + cpu_online_mask) >> + if (i-- == 0) { >> + spin_unlock_irqrestore(&spread_lock, >> + flags); >> + return cpu; >> + } >> + used[id] = 1; >> + } >> + spin_unlock_irqrestore(&spread_lock, flags); >> + >> + for_each_cpu(cpu, cpu_online_mask) >> + if (i-- == 0) >> + return cpu; >> + } >> + BUG(); >> +} >> EXPORT_SYMBOL(cpumask_local_spread); >> -- >> 2.7.4 > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-11 2:02 ` Shaokun Zhang @ 2019-11-12 11:56 ` Michal Hocko 2019-11-13 2:46 ` Shaokun Zhang 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2019-11-12 11:56 UTC (permalink / raw) To: Shaokun Zhang Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michael Ellerman, Anshuman Khandual, netdev On Mon 11-11-19 10:02:37, Shaokun Zhang wrote: > Hi Michal, > > On 2019/11/8 18:31, Michal Hocko wrote: > > This changelog looks better, thanks! I still have some questions though. > > Btw. cpumask_local_spread is used by the networking code but I do not > > see net guys involved (Cc netdev) > > Oh, I forgot to involve the net guys, sorry. > > > > > On Thu 07-11-19 09:44:08, Shaokun Zhang wrote: > >> From: yuqi jin <jinyuqi@huawei.com> > >> > >> In the multi-processors and NUMA system, I/O driver will find cpu cores > >> that which shall be bound IRQ. When cpu cores in the local numa have > >> been used, it is better to find the node closest to the local numa node, > >> instead of choosing any online cpu immediately. > >> > >> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu > >> system(0 - 1). > > > > Please send a topology of this server (numactl -H). > > > > available: 4 nodes (0-3) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 > node 0 size: 63379 MB > node 0 free: 61899 MB > node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 > node 1 size: 64509 MB > node 1 free: 63942 MB > node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 > node 2 size: 64509 MB > node 2 free: 63056 MB > node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 > node 3 size: 63997 MB > node 3 free: 63420 MB > node distances: > node 0 1 2 3 > 0: 10 16 32 33 > 1: 16 10 25 32 > 2: 32 25 10 16 > 3: 33 32 16 10 > > >> We perform PS (parameter server) business test, the > >> behavior of the service is that the client initiates a request through > >> the network card, the server responds to the request after calculation. > > > > Is the benchmark any ublicly available? > > > > Sorry, the PS which we test is not open, but I think redis is the same as PS > on the macro level. When there are both 24 redis servers on node2 and node3. > if the 24-47 irqs and xps of NIC are not bound to node3, the redis servers > on node3 will not performance good. Are there any other benchmarks showing improvements? > >> When two PS processes run on node2 and node3 separately and the > >> network card is located on 'node2' which is in cpu1, the performance > >> of node2 (26W QPS) and node3 (22W QPS) was different. > >> It is better that the NIC queues are bound to the cpu1 cores in turn, > >> then XPS will also be properly initialized, while cpumask_local_spread > >> only considers the local node. When the number of NIC queues exceeds > >> the number of cores in the local node, it returns to the online core > >> directly. So when PS runs on node3 sending a calculated request, > >> the performance is not as good as the node2. It is considered that > >> the NIC and other I/O devices shall initialize the interrupt binding, > >> if the cores of the local node are used up, it is reasonable to return > >> the node closest to it. > > > > Can you post cpu affinities before and after this patch? > > > > Before this patch > Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node > 2 > Euler:~ # cat /proc/irq/345/smp_affinity #IRQ0 > 00000000,00010000,00000000 This representation is awkward to parse. Could you add smp_affinity_list please? It would save quite some head scratching. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-12 11:56 ` Michal Hocko @ 2019-11-13 2:46 ` Shaokun Zhang 2019-11-14 14:43 ` Michal Hocko 0 siblings, 1 reply; 11+ messages in thread From: Shaokun Zhang @ 2019-11-13 2:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michael Ellerman, Anshuman Khandual, netdev Hi Michal, On 2019/11/12 19:56, Michal Hocko wrote: > On Mon 11-11-19 10:02:37, Shaokun Zhang wrote: >> Hi Michal, >> >> On 2019/11/8 18:31, Michal Hocko wrote: >>> This changelog looks better, thanks! I still have some questions though. >>> Btw. cpumask_local_spread is used by the networking code but I do not >>> see net guys involved (Cc netdev) >> >> Oh, I forgot to involve the net guys, sorry. >> >>> >>> On Thu 07-11-19 09:44:08, Shaokun Zhang wrote: >>>> From: yuqi jin <jinyuqi@huawei.com> >>>> >>>> In the multi-processors and NUMA system, I/O driver will find cpu cores >>>> that which shall be bound IRQ. When cpu cores in the local numa have >>>> been used, it is better to find the node closest to the local numa node, >>>> instead of choosing any online cpu immediately. >>>> >>>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu >>>> system(0 - 1). >>> >>> Please send a topology of this server (numactl -H). >>> >> >> available: 4 nodes (0-3) >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 >> node 0 size: 63379 MB >> node 0 free: 61899 MB >> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 >> node 1 size: 64509 MB >> node 1 free: 63942 MB >> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 >> node 2 size: 64509 MB >> node 2 free: 63056 MB >> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 >> node 3 size: 63997 MB >> node 3 free: 63420 MB >> node distances: >> node 0 1 2 3 >> 0: 10 16 32 33 >> 1: 16 10 25 32 >> 2: 32 25 10 16 >> 3: 33 32 16 10 >> >>>> We perform PS (parameter server) business test, the >>>> behavior of the service is that the client initiates a request through >>>> the network card, the server responds to the request after calculation. >>> >>> Is the benchmark any ublicly available? >>> >> >> Sorry, the PS which we test is not open, but I think redis is the same as PS >> on the macro level. When there are both 24 redis servers on node2 and node3. >> if the 24-47 irqs and xps of NIC are not bound to node3, the redis servers >> on node3 will not performance good. > > Are there any other benchmarks showing improvements? > Sorry, I don't have it. The issue is clear and the patch is helpful for the actual Parameter Server and Redis test. >>>> When two PS processes run on node2 and node3 separately and the >>>> network card is located on 'node2' which is in cpu1, the performance >>>> of node2 (26W QPS) and node3 (22W QPS) was different. >>>> It is better that the NIC queues are bound to the cpu1 cores in turn, >>>> then XPS will also be properly initialized, while cpumask_local_spread >>>> only considers the local node. When the number of NIC queues exceeds >>>> the number of cores in the local node, it returns to the online core >>>> directly. So when PS runs on node3 sending a calculated request, >>>> the performance is not as good as the node2. It is considered that >>>> the NIC and other I/O devices shall initialize the interrupt binding, >>>> if the cores of the local node are used up, it is reasonable to return >>>> the node closest to it. >>> >>> Can you post cpu affinities before and after this patch? >>> >> >> Before this patch >> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node >> 2 >> Euler:~ # cat /proc/irq/345/smp_affinity #IRQ0 >> 00000000,00010000,00000000 > > This representation is awkward to parse. Could you add smp_affinity_list > please? It would save quite some head scratching. > before patch Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node 2 Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list 48 Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list 0 Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list 24 Euler:/sys/bus/pci # after patch Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node 2 Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list 48 Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list 72 Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list 24 Euler:/sys/bus/pci # Thanks, Shaokun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-13 2:46 ` Shaokun Zhang @ 2019-11-14 14:43 ` Michal Hocko 2019-11-15 9:09 ` Shaokun Zhang 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2019-11-14 14:43 UTC (permalink / raw) To: Shaokun Zhang Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michael Ellerman, Anshuman Khandual, netdev On Wed 13-11-19 10:46:05, Shaokun Zhang wrote: [...] > >> available: 4 nodes (0-3) > >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 > >> node 0 size: 63379 MB > >> node 0 free: 61899 MB > >> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 > >> node 1 size: 64509 MB > >> node 1 free: 63942 MB > >> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 > >> node 2 size: 64509 MB > >> node 2 free: 63056 MB > >> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 > >> node 3 size: 63997 MB > >> node 3 free: 63420 MB > >> node distances: > >> node 0 1 2 3 > >> 0: 10 16 32 33 > >> 1: 16 10 25 32 > >> 2: 32 25 10 16 > >> 3: 33 32 16 10 [...] > before patch > Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node > 2 > Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list > 48 node 2 > Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list > 0 node 0 > Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list > 24 node 1 > Euler:/sys/bus/pci # > > after patch > Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node > 2 > Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list > 48 node 2 > Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list > 72 node 3 > Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list > 24 node 1 So few more questions. The only difference seems to be IRQ369 moving from 0 to 3 and having the device affinity to node 2 makes some sense because node 3 is closer. So far so good. I still have a large gap to get the whole picture. Namely why those other IRQs are not using any of the existing CPUs on the node 2. Could you explain that please? Btw. this all should be in the changelog. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-14 14:43 ` Michal Hocko @ 2019-11-15 9:09 ` Shaokun Zhang 2019-11-15 13:36 ` Michal Hocko 0 siblings, 1 reply; 11+ messages in thread From: Shaokun Zhang @ 2019-11-15 9:09 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michael Ellerman, Anshuman Khandual, netdev Hi Michal, On 2019/11/14 22:43, Michal Hocko wrote: > On Wed 13-11-19 10:46:05, Shaokun Zhang wrote: > [...] >>>> available: 4 nodes (0-3) >>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 >>>> node 0 size: 63379 MB >>>> node 0 free: 61899 MB >>>> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 >>>> node 1 size: 64509 MB >>>> node 1 free: 63942 MB >>>> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 >>>> node 2 size: 64509 MB >>>> node 2 free: 63056 MB >>>> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 >>>> node 3 size: 63997 MB >>>> node 3 free: 63420 MB >>>> node distances: >>>> node 0 1 2 3 >>>> 0: 10 16 32 33 >>>> 1: 16 10 25 32 >>>> 2: 32 25 10 16 >>>> 3: 33 32 16 10 > [...] >> before patch >> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node >> 2 >> Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list >> 48 > > node 2 > >> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list >> 0 > > node 0 > >> Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list >> 24 > > node 1 > >> Euler:/sys/bus/pci # >> >> after patch >> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node >> 2 >> Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list >> 48 > > node 2 > >> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list >> 72 > > node 3 > >> Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list >> 24 > > node 1 > > So few more questions. The only difference seems to be IRQ369 > moving from 0 to 3 and having the device affinity to node 2 > makes some sense because node 3 is closer. So far so good. Right, it is what we want. > I still have a large gap to get the whole picture. Namely why those > other IRQs are not using any of the existing CPUs on the node 2. > Could you explain that please? > Oh, my mistake, for the previous instance, I don't list all IRQs and just choose one IRQ from one NUMA node. You can see that the IRQ number is not consistent :-). IRQ from 345 to 368 will be bound to CPU cores which are in NUMA node2 and each IRQ is corresponding to one core. Euler:/sys/bus/pci # cat /proc/irq/346/smp_affinity_list 49 Others are the similar. > Btw. this all should be in the changelog. Ok, I will follow it in future. Thanks, Shaokun > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-15 9:09 ` Shaokun Zhang @ 2019-11-15 13:36 ` Michal Hocko 2019-11-21 8:14 ` Shaokun Zhang 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2019-11-15 13:36 UTC (permalink / raw) To: Shaokun Zhang Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michael Ellerman, Anshuman Khandual, netdev On Fri 15-11-19 17:09:13, Shaokun Zhang wrote: [...] > Oh, my mistake, for the previous instance, I don't list all IRQs and > just choose one IRQ from one NUMA node. You can see that the IRQ > number is not consistent :-). > IRQ from 345 to 368 will be bound to CPU cores which are in NUMA node2 > and each IRQ is corresponding to one core. > This is quite confusing then. I would suggest providing all IRQ used for the device with the specific node affinity to see the difference in the setup. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3] lib: optimize cpumask_local_spread() 2019-11-15 13:36 ` Michal Hocko @ 2019-11-21 8:14 ` Shaokun Zhang 0 siblings, 0 replies; 11+ messages in thread From: Shaokun Zhang @ 2019-11-21 8:14 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, yuqi jin, Andrew Morton, Mike Rapoport, Paul Burton, Michael Ellerman, Anshuman Khandual, netdev Hi Michal, On 2019/11/15 21:36, Michal Hocko wrote: > On Fri 15-11-19 17:09:13, Shaokun Zhang wrote: > [...] >> Oh, my mistake, for the previous instance, I don't list all IRQs and >> just choose one IRQ from one NUMA node. You can see that the IRQ >> number is not consistent :-). >> IRQ from 345 to 368 will be bound to CPU cores which are in NUMA node2 >> and each IRQ is corresponding to one core. >> > > This is quite confusing then. I would suggest providing all IRQ used for node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 node 0 size: 63379 MB node 0 free: 61899 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 64509 MB node 1 free: 63942 MB node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 2 size: 64509 MB node 2 free: 63056 MB node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 3 size: 63997 MB node 3 free: 63420 MB node distances: node 0 1 2 3 0: 10 16 32 33 1: 16 10 25 32 2: 32 25 10 16 3: 33 32 16 10 Before the patch: /* I/O device is located on NUMA node2 */ Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node 2 Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list 48 Euler:/sys/bus/pci # cat /proc/irq/346/smp_affinity_list 49 . . . Euler:/sys/bus/pci # cat /proc/irq/367/smp_affinity_list 70 Euler:/sys/bus/pci # cat /proc/irq/368/smp_affinity_list 71 /* there we expect irq form 24 to 47 binding to node3 */ Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list 0 Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list 1 . . . Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list 22 Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list 23 Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list 24 Euler:/sys/bus/pci # cat /proc/irq/394/smp_affinity_list 25 . . . /* There are total 64 irqs on eth IO device. */ Euler:/sys/bus/pci # cat /proc/irq/407/smp_affinity_list 38 Euler:/sys/bus/pci # cat /proc/irq/408/smp_affinity_list 39 Euler:/sys/bus/pci # After the patch: Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node 2 Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list 48 Euler:/sys/bus/pci # cat /proc/irq/346/smp_affinity_list 49 . . . Euler:/sys/bus/pci # cat /proc/irq/367/smp_affinity_list 70 Euler:/sys/bus/pci # cat /proc/irq/368/smp_affinity_list 71 Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list 72 Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list 73 . . . Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list 94 Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list 95 /* when the cores of socket were used up, we choose the node which is closer to node 2 before this patch choose the same node here is a coincident event*/ Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list 24 Euler:/sys/bus/pci # cat /proc/irq/394/smp_affinity_list 25 . . . /* There are total 64 irqs on eth IO device. */ Euler:/sys/bus/pci # cat /proc/irq/407/smp_affinity_list 38 Euler:/sys/bus/pci # cat /proc/irq/408/smp_affinity_list 39 Euler:/sys/bus/pci # Thanks, Shaokun > the device with the specific node affinity to see the difference in the > setup. > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-11-21 8:14 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-07 1:44 [PATCH v3] lib: optimize cpumask_local_spread() Shaokun Zhang 2019-11-08 3:49 ` Andrew Morton 2019-11-08 5:50 ` Shaokun Zhang 2019-11-08 10:31 ` Michal Hocko 2019-11-11 2:02 ` Shaokun Zhang 2019-11-12 11:56 ` Michal Hocko 2019-11-13 2:46 ` Shaokun Zhang 2019-11-14 14:43 ` Michal Hocko 2019-11-15 9:09 ` Shaokun Zhang 2019-11-15 13:36 ` Michal Hocko 2019-11-21 8:14 ` Shaokun Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).