linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Advice needed on SMP regression after cpu_core_mask change
@ 2021-03-17 13:00 Daniel Henrique Barboza
  2021-03-17 15:30 ` Cédric Le Goater
  2021-03-18 13:42 ` Srikar Dronamraju
  0 siblings, 2 replies; 6+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-17 13:00 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar, Srikar Dronamraju, Cedric Le Goater

Hello,

Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
a regression in both upstream and RHEL downstream kernels [1]. The assumption made
in the commit:

"Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
equal on Power"

Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA
node SMP topologies such as:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2

lscpu will give the following output in this case:

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7


This is happening because the macro cpu_cpu_mask(cpu) expands to
cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node].
node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to
correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.

If I add a second NUMA node then I can get the intended smp topology:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


However, if I try a single socket with multiple NUMA nodes topology, which is the case
of Power10, e.g.:


-smp 8,maxcpus=8,cores=4,threads=2,sockets=1
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \


This is the result:

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes.


Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit
after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating
cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite
its shortcomings that caused its removal, was giving a precise SMP topology. And it was
using physical_package_id/'ibm,chip-id' for that.

Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code
that cares about cores per socket information. The kernel is now ignoring that, starting
on 4bce545903fa, and now QEMU is unable to provide this info to the guest.

If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does
not declare it, we need another way of letting the guest know how much cores per socket
we want.



[1] https://bugzilla.redhat.com/1934421



Thanks,


DHB

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Advice needed on SMP regression after cpu_core_mask change
  2021-03-17 13:00 Advice needed on SMP regression after cpu_core_mask change Daniel Henrique Barboza
@ 2021-03-17 15:30 ` Cédric Le Goater
  2021-03-17 16:05   ` Daniel Henrique Barboza
  2021-03-18 13:42 ` Srikar Dronamraju
  1 sibling, 1 reply; 6+ messages in thread
From: Cédric Le Goater @ 2021-03-17 15:30 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev
  Cc: Greg Kurz, aneesh.kumar, Srikar Dronamraju, David Gibson

On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote:
> Hello,
> 
> Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
> a regression in both upstream and RHEL downstream kernels [1]. The assumption made
> in the commit:
> 
> "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
> equal on Power"
> 
> Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA
> node SMP topologies such as:
> 
> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> 
> lscpu will give the following output in this case:
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  4
> Socket(s):           1
> NUMA node(s):        1
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-7
> 
> 
> This is happening because the macro cpu_cpu_mask(cpu) expands to
> cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node].
> node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to
> correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.
> 
> If I add a second NUMA node then I can get the intended smp topology:
> 
> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        2
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-3
> NUMA node1 CPU(s):   4-7
> 
> 
> However, if I try a single socket with multiple NUMA nodes topology, which is the case
> of Power10, e.g.:
> 
> 
> -smp 8,maxcpus=8,cores=4,threads=2,sockets=1
> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
> 
> 
> This is the result:
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        2
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-3
> NUMA node1 CPU(s):   4-7
> 
> 
> This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes.

Yes. I don't think we can do better on PAPR and the above examples 
seem to confirm that the "sockets" definition is simply ignored.
 
> Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit
> after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating
> cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite
> its shortcomings that caused its removal, was giving a precise SMP topology. And it was
> using physical_package_id/'ibm,chip-id' for that.

ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which
is hiding a lot of the underlying HW and topology. May be we are trying 
to reconcile two orthogonal views of machine virtualization ...

> Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code
> that cares about cores per socket information. The kernel is now ignoring that, starting
> on 4bce545903fa, and now QEMU is unable to provide this info to the guest.
> 
> If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does
> not declare it, we need another way of letting the guest know how much cores per socket
> we want.
The RTAS call "ibm,get-system-parameter" with token "Processor Module 
Information" returns that kind of information :

  2 byte binary number (N) of module types followed by N module specifiers of the form:
  2 byte binary number (M) of sockets of this module type
  2 byte binary number (L) of chips per this module type
  2 byte binary number (K) of cores per chip in this module type.

See the values in these sysfs files :

  cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip} 

But I am afraid these are host level information and not guest/LPAR.

I didn't find any LPAR level properties or hcalls in the PAPR document. 
They need to be specified.

or

We can add extra properties like ibm,chip-id but making sure it's only 
used under the KVM hypervisor. My understanding is that's something we 
are trying to avoid. 

C.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Advice needed on SMP regression after cpu_core_mask change
  2021-03-17 15:30 ` Cédric Le Goater
@ 2021-03-17 16:05   ` Daniel Henrique Barboza
  2021-03-22  3:16     ` David Gibson
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-17 16:05 UTC (permalink / raw)
  To: Cédric Le Goater, linuxppc-dev
  Cc: Greg Kurz, aneesh.kumar, Srikar Dronamraju, David Gibson



On 3/17/21 12:30 PM, Cédric Le Goater wrote:
> On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote:
>> Hello,
>>
>> Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
>> a regression in both upstream and RHEL downstream kernels [1]. The assumption made
>> in the commit:
>>
>> "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
>> equal on Power"
>>
>> Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA
>> node SMP topologies such as:
>>
>> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
>>
>> lscpu will give the following output in this case:
>>
>> # lscpu
>> Architecture:        ppc64le
>> Byte Order:          Little Endian
>> CPU(s):              8
>> On-line CPU(s) list: 0-7
>> Thread(s) per core:  2
>> Core(s) per socket:  4
>> Socket(s):           1
>> NUMA node(s):        1
>> Model:               2.2 (pvr 004e 1202)
>> Model name:          POWER9 (architected), altivec supported
>> Hypervisor vendor:   KVM
>> Virtualization type: para
>> L1d cache:           32K
>> L1i cache:           32K
>> NUMA node0 CPU(s):   0-7
>>
>>
>> This is happening because the macro cpu_cpu_mask(cpu) expands to
>> cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node].
>> node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to
>> correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.
>>
>> If I add a second NUMA node then I can get the intended smp topology:
>>
>> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
>> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
>> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
>>
>> # lscpu
>> Architecture:        ppc64le
>> Byte Order:          Little Endian
>> CPU(s):              8
>> On-line CPU(s) list: 0-7
>> Thread(s) per core:  2
>> Core(s) per socket:  2
>> Socket(s):           2
>> NUMA node(s):        2
>> Model:               2.2 (pvr 004e 1202)
>> Model name:          POWER9 (architected), altivec supported
>> Hypervisor vendor:   KVM
>> Virtualization type: para
>> L1d cache:           32K
>> L1i cache:           32K
>> NUMA node0 CPU(s):   0-3
>> NUMA node1 CPU(s):   4-7
>>
>>
>> However, if I try a single socket with multiple NUMA nodes topology, which is the case
>> of Power10, e.g.:
>>
>>
>> -smp 8,maxcpus=8,cores=4,threads=2,sockets=1
>> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
>> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
>>
>>
>> This is the result:
>>
>> # lscpu
>> Architecture:        ppc64le
>> Byte Order:          Little Endian
>> CPU(s):              8
>> On-line CPU(s) list: 0-7
>> Thread(s) per core:  2
>> Core(s) per socket:  2
>> Socket(s):           2
>> NUMA node(s):        2
>> Model:               2.2 (pvr 004e 1202)
>> Model name:          POWER9 (architected), altivec supported
>> Hypervisor vendor:   KVM
>> Virtualization type: para
>> L1d cache:           32K
>> L1i cache:           32K
>> NUMA node0 CPU(s):   0-3
>> NUMA node1 CPU(s):   4-7
>>
>>
>> This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes.
> 
> Yes. I don't think we can do better on PAPR and the above examples
> seem to confirm that the "sockets" definition is simply ignored.
>   
>> Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit
>> after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating
>> cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite
>> its shortcomings that caused its removal, was giving a precise SMP topology. And it was
>> using physical_package_id/'ibm,chip-id' for that.
> 
> ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which
> is hiding a lot of the underlying HW and topology. May be we are trying
> to reconcile two orthogonal views of machine virtualization ...
> 
>> Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code
>> that cares about cores per socket information. The kernel is now ignoring that, starting
>> on 4bce545903fa, and now QEMU is unable to provide this info to the guest.
>>
>> If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does
>> not declare it, we need another way of letting the guest know how much cores per socket
>> we want.
> The RTAS call "ibm,get-system-parameter" with token "Processor Module
> Information" returns that kind of information :
> 
>    2 byte binary number (N) of module types followed by N module specifiers of the form:
>    2 byte binary number (M) of sockets of this module type
>    2 byte binary number (L) of chips per this module type
>    2 byte binary number (K) of cores per chip in this module type.
> 
> See the values in these sysfs files :
> 
>    cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip}
> 
> But I am afraid these are host level information and not guest/LPAR.


I believe there might be some sort of reasoning behind not having this on
PAPR, but I'll say in advance that the virtual machine should act as the
real hardware, as close as possible. This is the kind of hcall that could
be used in this situation.


> 
> I didn't find any LPAR level properties or hcalls in the PAPR document.
> They need to be specified.
> 
> or
> 
> We can add extra properties like ibm,chip-id but making sure it's only
> used under the KVM hypervisor. My understanding is that's something we
> are trying to avoid.

We can change PAPR to add ibm,chip-id. Problem is that ibm,chip-id today, with
the current kernel codebase, does not fix the issue because the code is
ignoring it hehehe


If we're going to change PAPR -  and I believe we should, there's a clear
lack of proper support for SMP topologies - we're better make sure that whatever
attribute/hcall we add there fixes it in a robust way for the long term.


Thanks,


DHB


> 
> C.
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Advice needed on SMP regression after cpu_core_mask change
  2021-03-17 13:00 Advice needed on SMP regression after cpu_core_mask change Daniel Henrique Barboza
  2021-03-17 15:30 ` Cédric Le Goater
@ 2021-03-18 13:42 ` Srikar Dronamraju
  2021-03-18 15:36   ` Daniel Henrique Barboza
  1 sibling, 1 reply; 6+ messages in thread
From: Srikar Dronamraju @ 2021-03-18 13:42 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: aneesh.kumar, linuxppc-dev, Cedric Le Goater

* Daniel Henrique Barboza <danielhb413@gmail.com> [2021-03-17 10:00:34]:

> Hello,
> 
> Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
> a regression in both upstream and RHEL downstream kernels [1]. The assumption made
> in the commit:
> 
> "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
> equal on Power"
> 
> Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA
> node SMP topologies such as:
> 
> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2

What does it mean for a NUMA to have more than one sockets?
If they are all part of the same node, there are at local distance to each
other. cache is per core. So what resources are shared by the Sockets that
are part of the same NUMA. And how does Userspace/ application make use of
the same.

Please don't mistake this as attempt to downplay your report but a honest
attempt to better understand the situation.

For example, if the socket denotes the hemisphere logic in P10, then can we
see if the coregroup feature can be used. "Coregroup" is suppose to mean a
set of cores within a NUMA that have some characteristics and there can be
multiple coregroups within a NUMA. We add that mostly to mimic hemisphere in
P10. However the number of coregroups in a NUMA is not exported to userspace
at this time.

However if each Socket is associated with a memory and node distance, then
should they be NUMA?

Can you provide me with the unique ibm,chip-ids in your 2 NUMA, 4 node case?
Does this cause an performance issues with the guest/application?

Till your report, I was under the impression that NUMAs == Sockets.

> 
> lscpu will give the following output in this case:
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  4
> Socket(s):           1
> NUMA node(s):        1
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-7
> 
> 
> This is happening because the macro cpu_cpu_mask(cpu) expands to
> cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node].
> node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to
> correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.
> 
> If I add a second NUMA node then I can get the intended smp topology:
> 
> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        2
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-3
> NUMA node1 CPU(s):   4-7
> 
> 
> However, if I try a single socket with multiple NUMA nodes topology, which is the case
> of Power10, e.g.:
> 
> 
> -smp 8,maxcpus=8,cores=4,threads=2,sockets=1
> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
> 
> 
> This is the result:
> 
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        2
> Model:               2.2 (pvr 004e 1202)
> Model name:          POWER9 (architected), altivec supported
> Hypervisor vendor:   KVM
> Virtualization type: para
> L1d cache:           32K
> L1i cache:           32K
> NUMA node0 CPU(s):   0-3
> NUMA node1 CPU(s):   4-7
> 
> 
> This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes.
> 
> 
> Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit
> after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating
> cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite
> its shortcomings that caused its removal, was giving a precise SMP topology. And it was
> using physical_package_id/'ibm,chip-id' for that.
> 
> Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code
> that cares about cores per socket information. The kernel is now ignoring that, starting
> on 4bce545903fa, and now QEMU is unable to provide this info to the guest.
> 
> If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does
> not declare it, we need another way of letting the guest know how much cores per socket
> we want.
> 
> 
> 
> [1] https://bugzilla.redhat.com/1934421
> 
> 
> 
> Thanks,
> 
> 
> DHB

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Advice needed on SMP regression after cpu_core_mask change
  2021-03-18 13:42 ` Srikar Dronamraju
@ 2021-03-18 15:36   ` Daniel Henrique Barboza
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-18 15:36 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: aneesh.kumar, linuxppc-dev, Cedric Le Goater



On 3/18/21 10:42 AM, Srikar Dronamraju wrote:
> * Daniel Henrique Barboza <danielhb413@gmail.com> [2021-03-17 10:00:34]:
> 
>> Hello,
>>
>> Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
>> a regression in both upstream and RHEL downstream kernels [1]. The assumption made
>> in the commit:
>>
>> "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
>> equal on Power"
>>
>> Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA
>> node SMP topologies such as:
>>
>> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> 
> What does it mean for a NUMA to have more than one sockets?
> If they are all part of the same node, there are at local distance to each
> other. cache is per core. So what resources are shared by the Sockets that
> are part of the same NUMA. And how does Userspace/ application make use of
> the same.

Honestly, I sympathize with the idea that multiple sockets in the same NUMA
node being "weird". QEMU is accepting this kind of topology since forever
because we didn't pay attention to these other details.

I don't see any problems adding more constraints that makes sense in the
virtual layer, as long as the constraints make sense and are documented.
Putting multiple sockets in a single NUMA node seems like a fair restriction.


> 
> Please don't mistake this as attempt to downplay your report but a honest
> attempt to better understand the situation.

It's cool. Ask away.

> 
> For example, if the socket denotes the hemisphere logic in P10, then can we
> see if the coregroup feature can be used. "Coregroup" is suppose to mean a
> set of cores within a NUMA that have some characteristics and there can be
> multiple coregroups within a NUMA. We add that mostly to mimic hemisphere in
> P10. However the number of coregroups in a NUMA is not exported to userspace
> at this time.

I see. I thought that the presence of the hemispheres inside the chip would
justify more than one NUMA node inside the chip, meaning that a chip/socket
would have more than one NUMA nodes inside of it.

If that's not the case then I guess socket == NUMA node is still valid in
P10 as well. The last 'lscpu' example I gave here, claiming that this would
be a Power10 scenario, doesn't represent P10 after all.

> 
> However if each Socket is associated with a memory and node distance, then
> should they be NUMA?
> 
> Can you provide me with the unique ibm,chip-ids in your 2 NUMA, 4 node case?
> Does this cause an performance issues with the guest/application?

I can fetch some values, but we're trying to move out of it since it's not on the
pseries spec (PAPR). Perhaps with these restrictions we can live without
ibm,chip-id in QEMU.

> 
> Till your report, I was under the impression that NUMAs == Sockets.

After reading and discussing about it, I think the sensible thing to do is to
put this same constraint in QEMU.

In theory it would be nice to let the virtual machine to have whatever topology it
wants, multiple sockets in the same NUMA domain and so on, but in the end we're
emulating Power hardware. If Power hardware - and the powerpc kernel - operates
under these assumptions, then I don't see much point into allowing users to
set unrealistic virtual CPU topologies that will be misrepresented in the
kernel.


I'll try this restriction in QEMU and see how upstream kernel behaves, with
and without ibm,chip-id being advertised in the DT.


Thanks,


DHB

> 
>>
>> lscpu will give the following output in this case:
>>
>> # lscpu
>> Architecture:        ppc64le
>> Byte Order:          Little Endian
>> CPU(s):              8
>> On-line CPU(s) list: 0-7
>> Thread(s) per core:  2
>> Core(s) per socket:  4
>> Socket(s):           1
>> NUMA node(s):        1
>> Model:               2.2 (pvr 004e 1202)
>> Model name:          POWER9 (architected), altivec supported
>> Hypervisor vendor:   KVM
>> Virtualization type: para
>> L1d cache:           32K
>> L1i cache:           32K
>> NUMA node0 CPU(s):   0-7
>>
>>
>> This is happening because the macro cpu_cpu_mask(cpu) expands to
>> cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node].
>> node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to
>> correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.
>>
>> If I add a second NUMA node then I can get the intended smp topology:
>>
>> -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
>> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
>> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
>>
>> # lscpu
>> Architecture:        ppc64le
>> Byte Order:          Little Endian
>> CPU(s):              8
>> On-line CPU(s) list: 0-7
>> Thread(s) per core:  2
>> Core(s) per socket:  2
>> Socket(s):           2
>> NUMA node(s):        2
>> Model:               2.2 (pvr 004e 1202)
>> Model name:          POWER9 (architected), altivec supported
>> Hypervisor vendor:   KVM
>> Virtualization type: para
>> L1d cache:           32K
>> L1i cache:           32K
>> NUMA node0 CPU(s):   0-3
>> NUMA node1 CPU(s):   4-7
>>
>>
>> However, if I try a single socket with multiple NUMA nodes topology, which is the case
>> of Power10, e.g.:
>>
>>
>> -smp 8,maxcpus=8,cores=4,threads=2,sockets=1
>> -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
>> -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
>>
>>
>> This is the result:
>>
>> # lscpu
>> Architecture:        ppc64le
>> Byte Order:          Little Endian
>> CPU(s):              8
>> On-line CPU(s) list: 0-7
>> Thread(s) per core:  2
>> Core(s) per socket:  2
>> Socket(s):           2
>> NUMA node(s):        2
>> Model:               2.2 (pvr 004e 1202)
>> Model name:          POWER9 (architected), altivec supported
>> Hypervisor vendor:   KVM
>> Virtualization type: para
>> L1d cache:           32K
>> L1i cache:           32K
>> NUMA node0 CPU(s):   0-3
>> NUMA node1 CPU(s):   4-7
>>
>>
>> This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes.
>>
>>
>> Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit
>> after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating
>> cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite
>> its shortcomings that caused its removal, was giving a precise SMP topology. And it was
>> using physical_package_id/'ibm,chip-id' for that.
>>
>> Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code
>> that cares about cores per socket information. The kernel is now ignoring that, starting
>> on 4bce545903fa, and now QEMU is unable to provide this info to the guest.
>>
>> If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does
>> not declare it, we need another way of letting the guest know how much cores per socket
>> we want.
>>
>>
>>
>> [1] https://bugzilla.redhat.com/1934421
>>
>>
>>
>> Thanks,
>>
>>
>> DHB
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Advice needed on SMP regression after cpu_core_mask change
  2021-03-17 16:05   ` Daniel Henrique Barboza
@ 2021-03-22  3:16     ` David Gibson
  0 siblings, 0 replies; 6+ messages in thread
From: David Gibson @ 2021-03-22  3:16 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: aneesh.kumar, Srikar Dronamraju, Greg Kurz,
	Cédric Le Goater, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 7169 bytes --]

On Wed, Mar 17, 2021 at 01:05:21PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/17/21 12:30 PM, Cédric Le Goater wrote:
> > On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote:
> > > Hello,
> > > 
> > > Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
> > > a regression in both upstream and RHEL downstream kernels [1]. The assumption made
> > > in the commit:
> > > 
> > > "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
> > > equal on Power"
> > > 
> > > Doesn't seem to be true. After this commit, QEMU is now unable to set single NUMA
> > > node SMP topologies such as:
> > > 
> > > -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> > > 
> > > lscpu will give the following output in this case:
> > > 
> > > # lscpu
> > > Architecture:        ppc64le
> > > Byte Order:          Little Endian
> > > CPU(s):              8
> > > On-line CPU(s) list: 0-7
> > > Thread(s) per core:  2
> > > Core(s) per socket:  4
> > > Socket(s):           1
> > > NUMA node(s):        1
> > > Model:               2.2 (pvr 004e 1202)
> > > Model name:          POWER9 (architected), altivec supported
> > > Hypervisor vendor:   KVM
> > > Virtualization type: para
> > > L1d cache:           32K
> > > L1i cache:           32K
> > > NUMA node0 CPU(s):   0-7
> > > 
> > > 
> > > This is happening because the macro cpu_cpu_mask(cpu) expands to
> > > cpumask_of_node(cpu_to_node(cpu)), which in turn expands to node_to_cpumask_map[node].
> > > node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on CC to
> > > correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.
> > > 
> > > If I add a second NUMA node then I can get the intended smp topology:
> > > 
> > > -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> > > -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> > > -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
> > > 
> > > # lscpu
> > > Architecture:        ppc64le
> > > Byte Order:          Little Endian
> > > CPU(s):              8
> > > On-line CPU(s) list: 0-7
> > > Thread(s) per core:  2
> > > Core(s) per socket:  2
> > > Socket(s):           2
> > > NUMA node(s):        2
> > > Model:               2.2 (pvr 004e 1202)
> > > Model name:          POWER9 (architected), altivec supported
> > > Hypervisor vendor:   KVM
> > > Virtualization type: para
> > > L1d cache:           32K
> > > L1i cache:           32K
> > > NUMA node0 CPU(s):   0-3
> > > NUMA node1 CPU(s):   4-7
> > > 
> > > 
> > > However, if I try a single socket with multiple NUMA nodes topology, which is the case
> > > of Power10, e.g.:
> > > 
> > > 
> > > -smp 8,maxcpus=8,cores=4,threads=2,sockets=1
> > > -numa node,memdev=mem0,cpus=0-3,nodeid=0 \
> > > -numa node,memdev=mem1,cpus=4-7,nodeid=1 \
> > > 
> > > 
> > > This is the result:
> > > 
> > > # lscpu
> > > Architecture:        ppc64le
> > > Byte Order:          Little Endian
> > > CPU(s):              8
> > > On-line CPU(s) list: 0-7
> > > Thread(s) per core:  2
> > > Core(s) per socket:  2
> > > Socket(s):           2
> > > NUMA node(s):        2
> > > Model:               2.2 (pvr 004e 1202)
> > > Model name:          POWER9 (architected), altivec supported
> > > Hypervisor vendor:   KVM
> > > Virtualization type: para
> > > L1d cache:           32K
> > > L1i cache:           32K
> > > NUMA node0 CPU(s):   0-3
> > > NUMA node1 CPU(s):   4-7
> > > 
> > > 
> > > This confirms my suspicions that, at this moment, we're making sockets == NUMA nodes.
> > 
> > Yes. I don't think we can do better on PAPR and the above examples
> > seem to confirm that the "sockets" definition is simply ignored.
> > > Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The commit
> > > after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop updating
> > > cpu_core_mask"), is erasing the code that calculated cpu_core_mask. cpu_core_mask, despite
> > > its shortcomings that caused its removal, was giving a precise SMP topology. And it was
> > > using physical_package_id/'ibm,chip-id' for that.
> > 
> > ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which
> > is hiding a lot of the underlying HW and topology. May be we are trying
> > to reconcile two orthogonal views of machine virtualization ...
> > 
> > > Checking in QEMU I can say that the ibm,chip-id calculation is the only place in the code
> > > that cares about cores per socket information. The kernel is now ignoring that, starting
> > > on 4bce545903fa, and now QEMU is unable to provide this info to the guest.
> > > 
> > > If we're not going to use ibm,chip-id any longer, which seems sensible given that PAPR does
> > > not declare it, we need another way of letting the guest know how much cores per socket
> > > we want.
> > The RTAS call "ibm,get-system-parameter" with token "Processor Module
> > Information" returns that kind of information :
> > 
> >    2 byte binary number (N) of module types followed by N module specifiers of the form:
> >    2 byte binary number (M) of sockets of this module type
> >    2 byte binary number (L) of chips per this module type
> >    2 byte binary number (K) of cores per chip in this module type.
> > 
> > See the values in these sysfs files :
> > 
> >    cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip}
> > 
> > But I am afraid these are host level information and not guest/LPAR.
> 
> 
> I believe there might be some sort of reasoning behind not having this on
> PAPR, but I'll say in advance that the virtual machine should act as the
> real hardware, as close as possible. This is the kind of hcall that could
> be used in this situation.

In the case of POWER, that's pretty much a lost battle.  The
virtualization features of the CPU don't really permit full hardware
virtualization - it has to be a paravirtualized environment.  Once
that's the case, the value of keeping secondary things the same
between the bare metal and paravirt environments isn't that compelling
any more.

> > I didn't find any LPAR level properties or hcalls in the PAPR document.
> > They need to be specified.
> > 
> > or
> > 
> > We can add extra properties like ibm,chip-id but making sure it's only
> > used under the KVM hypervisor. My understanding is that's something we
> > are trying to avoid.
> 
> We can change PAPR to add ibm,chip-id. Problem is that ibm,chip-id today, with
> the current kernel codebase, does not fix the issue because the code is
> ignoring it hehehe
> 
> 
> If we're going to change PAPR -  and I believe we should, there's a clear
> lack of proper support for SMP topologies - we're better make sure that whatever
> attribute/hcall we add there fixes it in a robust way for the long term.
> 
> 
> Thanks,
> 
> 
> DHB
> 
> 
> > 
> > C.
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-03-22  4:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 13:00 Advice needed on SMP regression after cpu_core_mask change Daniel Henrique Barboza
2021-03-17 15:30 ` Cédric Le Goater
2021-03-17 16:05   ` Daniel Henrique Barboza
2021-03-22  3:16     ` David Gibson
2021-03-18 13:42 ` Srikar Dronamraju
2021-03-18 15:36   ` Daniel Henrique Barboza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).