qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Cédric Le Goater" <clg@kaod.org>
To: Daniel Henrique Barboza <danielhb413@gmail.com>,
	David Gibson <david@gibson.dropbear.id.au>
Cc: Laurent Vivier <lvivier@redhat.com>,
	Thomas Huth <thuth@redhat.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	qemu-devel@nongnu.org, groug@kaod.org, qemu-ppc@nongnu.org,
	Igor Mammedov <imammedo@redhat.com>
Subject: Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
Date: Wed, 31 Mar 2021 19:40:40 +0200	[thread overview]
Message-ID: <05f3120c-84f8-7711-d2a5-9b29b0f3f746@kaod.org> (raw)
In-Reply-To: <61876812-c915-6489-3058-b463967b0679@gmail.com>

On 3/31/21 7:29 PM, Daniel Henrique Barboza wrote:
> 
> On 3/31/21 12:18 PM, Cédric Le Goater wrote:
>> On 3/31/21 2:57 AM, David Gibson wrote:
>>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
>>>>> On 3/29/21 6:20 AM, David Gibson wrote:
>>>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>>>>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>>>>>>> equivalent.
>>>>>>>>>>>
>>>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>>>>>> accident during all these years.
>>>>>>>>>>>
>>>>>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>>>>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>>>>>>>
>>>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>>>>>> condition.
>>>>>>>>>>
>>>>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>>>>>> that this will require follow on changes in libvirt to not make this a
>>>>>>>>>> really cryptic and irritating point of failure.
>>>>>>>>>
>>>>>>>>> Haven't though about required Libvirt changes, although I can say that there
>>>>>>>>> will be some amount to be mande and it will probably annoy existing users
>>>>>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>>>>>
>>>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>>>>>>>> here. The other alternative is to keep interacting with the kernel folks to
>>>>>>>>> see if there is a way to keep our use case untouched.
>>>>>>>>
>>>>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>>>>>> NUMA nodes makes sense.
>>>>>>>
>>>>>>> I did a couple of replies in different email threads but maybe not
>>>>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>>>>>
>>>>>> I think I saw some of those, but maybe not all.
>>>>>>
>>>>>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>>>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>>>>>> threads AFAICT.
>>>>>>
>>>>>> Theoretically, yes.  I'm not sure it's really practical, though, since
>>>>>> AFAICT, both qemu and the kernel have the notion of sockets (though
>>>>>> not dies) built into generic code.
>>>>>
>>>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
>>>>> and PPC Linux only has a NUMA node id, on pseries and powernv.
>>>>>
>>>>>> It does mean that one possible approach here - maybe the best one - is
>>>>>> to simply declare that sockets are meaningless under, so we simply
>>>>>> don't expect what the guest kernel reports to match what's given to
>>>>>> qemu.
>>>>>>
>>>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>>>>>> but it is likely to surprise and confuse people.
>>>>>>
>>>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>>>>>> There are a couple of places where Linux checks for the underlying
>>>>>>> hypervisor already.
>>>>>>>
>>>>>>>>> This also means that
>>>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>>>>>>>> we inform cores per socket information to the kernel.
>>>>>>>>
>>>>>>>> Well.. unless we can find some other sensible way to convey that
>>>>>>>> information.  I haven't given up hope for that yet.
>>>>>>>
>>>>>>> Well, we could start by fixing the value in QEMU. It is broken
>>>>>>> today.
>>>>>>
>>>>>> Fixing what value, exactly?
>>>>>
>>>>> The value of the "ibm,chip-id" since we are keeping the property under
>>>>> QEMU.
>>>>
>>>> David, I believe this has to do with the discussing we had last Friday.
>>>>
>>>> I mentioned that the ibm,chip-id property is being calculated in a way that
>>>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
>>>> e.g.:
>>>>
>>>> -smp 4,cores=4,maxcpus=8,threads=1 \
>>>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
>>>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>>
>>>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
>>>>             ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
>>>>             ibm,pft-size = <0x00 0x19>;
>>>>             ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>> -- 
>>>>                        ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
>>>>                        ibm,pft-size = <0x00 0x19>;
>>>>                        ibm,chip-id = <0x00>;
>>>
>>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>>>> different NUMA node than 0-1. This would mean that the same socket
>>>> would belong to different NUMA nodes at the same time.
>>>
>>> Right... and I'm still not seeing why that's a problem.  AFAICT that's
>>> a possible, if unexpected, situation under real hardware - though
>>> maybe not for POWER9 specifically.
>> The ibm,chip-id property does not exist under PAPR. PAPR only has
>> NUMA nodes, no sockets nor chips.
>>
>> And the property value is simply broken under QEMU. Try this  :
>>
>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>         ibm,chip-id = <0x01>;
>>         ibm,chip-id = <0x02>;
>>         ibm,chip-id = <0x00>;
>>         ibm,chip-id = <0x03>;
> 
> These values are not wrong. When you do:
> 
> -smp 4,cores=1,maxcpus=8 (....)
> 
> You didn't fill threads and sockets. QEMU default is to prioritize sockets
> to fill the missing information, up to the maxcpus value. This means that what
> you did is equivalent to:
> 
> -smp 4,threads=1,cores=1,sockets=8,maxcpus=8 (....)
> 
> 
> It's a 1 thread/core, 1 core/socket with 8 sockets config. Each possible CPU
> will sit in its own core, having its own ibm,chip-id. So:
> 
> "-numa node,nodeid=0,cpus=0-1"
> 
> is in fact allocating sockets 0 and 1 to NUMA node 0.

ok. 
Thanks,

C.


  reply	other threads:[~2021-03-31 17:43 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-19 18:34 [PATCH 0/2] pseries: SMP sockets must match NUMA nodes Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
2021-03-23  1:03   ` David Gibson
2021-03-23 17:21     ` Daniel Henrique Barboza
2021-03-25  2:10       ` David Gibson
2021-03-25  8:56         ` Cédric Le Goater
2021-03-25 10:15           ` Daniel Henrique Barboza
2021-03-29  4:20           ` David Gibson
2021-03-29 15:32             ` Cédric Le Goater
2021-03-29 18:32               ` Daniel Henrique Barboza
2021-03-29 23:55                 ` Igor Mammedov
2021-03-31  0:57                 ` David Gibson
2021-03-31  4:58                   ` Michael Ellerman
2021-03-31 15:22                     ` Cédric Le Goater
2021-04-01  2:53                       ` David Gibson
2021-03-31 15:18                   ` Cédric Le Goater
2021-03-31 17:29                     ` Daniel Henrique Barboza
2021-03-31 17:40                       ` Cédric Le Goater [this message]
2021-04-01  2:59                     ` David Gibson
2021-04-01  9:21                       ` Cédric Le Goater
2021-03-29 23:51     ` Igor Mammedov
2021-03-30 21:33       ` Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT Daniel Henrique Barboza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05f3120c-84f8-7711-d2a5-9b29b0f3f746@kaod.org \
    --to=clg@kaod.org \
    --cc=danielhb413@gmail.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=groug@kaod.org \
    --cc=imammedo@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).