qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Daniel Henrique Barboza <danielhb413@gmail.com>
Cc: "Laurent Vivier" <lvivier@redhat.com>,
	"Thomas Huth" <thuth@redhat.com>,
	"Srikar Dronamraju" <srikar@linux.vnet.ibm.com>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	groug@kaod.org, qemu-devel@nongnu.org, qemu-ppc@nongnu.org,
	"Cédric Le Goater" <clg@kaod.org>,
	"Igor Mammedov" <imammedo@redhat.com>
Subject: Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
Date: Wed, 31 Mar 2021 11:57:10 +1100	[thread overview]
Message-ID: <YGPI5vgoI8JDO1HN@yekko.fritz.box> (raw)
In-Reply-To: <91e406bf-c9c6-0734-1f69-081d3633332b@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7676 bytes --]

On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
> > On 3/29/21 6:20 AM, David Gibson wrote:
> > > On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
> > > > On 3/25/21 3:10 AM, David Gibson wrote:
> > > > > On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
> > > > > > 
> > > > > > 
> > > > > > On 3/22/21 10:03 PM, David Gibson wrote:
> > > > > > > On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
> > > > > > > > Kernel commit 4bce545903fa ("powerpc/topology: Update
> > > > > > > > topology_core_cpumask") cause a regression in the pseries machine when
> > > > > > > > defining certain SMP topologies [1]. The reasoning behind the change is
> > > > > > > > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
> > > > > > > > cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
> > > > > > > > large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
> > > > > > > > far as the kernel understanding of SMP topologies goes, both masks are
> > > > > > > > equivalent.
> > > > > > > > 
> > > > > > > > Further discussions in the kernel mailing list [2] shown that the
> > > > > > > > powerpc kernel always considered that the number of sockets were equal
> > > > > > > > to the number of NUMA nodes. The claim is that it doesn't make sense,
> > > > > > > > for Power hardware at least, 2+ sockets being in the same NUMA node. The
> > > > > > > > immediate conclusion is that all SMP topologies the pseries machine were
> > > > > > > > supplying to the kernel, with more than one socket in the same NUMA node
> > > > > > > > as in [1], happened to be correctly represented in the kernel by
> > > > > > > > accident during all these years.
> > > > > > > > 
> > > > > > > > There's a case to be made for virtual topologies being detached from
> > > > > > > > hardware constraints, allowing maximum flexibility to users. At the same
> > > > > > > > time, this freedom can't result in unrealistic hardware representations
> > > > > > > > being emulated. If the real hardware and the pseries kernel don't
> > > > > > > > support multiple chips/sockets in the same NUMA node, neither should we.
> > > > > > > > 
> > > > > > > > Starting in 6.0.0, all sockets must match an unique NUMA node in the
> > > > > > > > pseries machine. qtest changes were made to adapt to this new
> > > > > > > > condition.
> > > > > > > 
> > > > > > > Oof.  I really don't like this idea.  It means a bunch of fiddly work
> > > > > > > for users to match these up, for no real gain.  I'm also concerned
> > > > > > > that this will require follow on changes in libvirt to not make this a
> > > > > > > really cryptic and irritating point of failure.
> > > > > > 
> > > > > > Haven't though about required Libvirt changes, although I can say that there
> > > > > > will be some amount to be mande and it will probably annoy existing users
> > > > > > (everyone that has a multiple socket per NUMA node topology).
> > > > > > 
> > > > > > There is not much we can do from the QEMU layer aside from what I've proposed
> > > > > > here. The other alternative is to keep interacting with the kernel folks to
> > > > > > see if there is a way to keep our use case untouched.
> > > > > 
> > > > > Right.  Well.. not necessarily untouched, but I'm hoping for more
> > > > > replies from Cédric to my objections and mpe's.  Even with sockets
> > > > > being a kinda meaningless concept in PAPR, I don't think tying it to
> > > > > NUMA nodes makes sense.
> > > > 
> > > > I did a couple of replies in different email threads but maybe not
> > > > to all. I felt it was going nowhere :/ Couple of thoughts,
> > > 
> > > I think I saw some of those, but maybe not all.
> > > 
> > > > Shouldn't we get rid of the socket concept, die also, under pseries
> > > > since they don't exist under PAPR ? We only have numa nodes, cores,
> > > > threads AFAICT.
> > > 
> > > Theoretically, yes.  I'm not sure it's really practical, though, since
> > > AFAICT, both qemu and the kernel have the notion of sockets (though
> > > not dies) built into generic code.
> > 
> > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
> > and PPC Linux only has a NUMA node id, on pseries and powernv.
> > 
> > > It does mean that one possible approach here - maybe the best one - is
> > > to simply declare that sockets are meaningless under, so we simply
> > > don't expect what the guest kernel reports to match what's given to
> > > qemu.
> > > 
> > > It'd be nice to avoid that if we can: in a sense it's just cosmetic,
> > > but it is likely to surprise and confuse people.
> > > 
> > > > Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> > > > There are a couple of places where Linux checks for the underlying
> > > > hypervisor already.
> > > > 
> > > > > > This also means that
> > > > > > 'ibm,chip-id' will probably remain in use since it's the only place where
> > > > > > we inform cores per socket information to the kernel.
> > > > > 
> > > > > Well.. unless we can find some other sensible way to convey that
> > > > > information.  I haven't given up hope for that yet.
> > > > 
> > > > Well, we could start by fixing the value in QEMU. It is broken
> > > > today.
> > > 
> > > Fixing what value, exactly?
> > 
> > The value of the "ibm,chip-id" since we are keeping the property under
> > QEMU.
> 
> David, I believe this has to do with the discussing we had last Friday.
> 
> I mentioned that the ibm,chip-id property is being calculated in a way that
> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
> e.g.:
> 
> -smp 4,cores=4,maxcpus=8,threads=1 \
> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> 
> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
> 			ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
> 			ibm,pft-size = <0x00 0x19>;
> 			ibm,chip-id = <0x00>;
> --
>                       ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
>                       ibm,pft-size = <0x00 0x19>;
>                       ibm,chip-id = <0x00>;
> --
>                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
>                       ibm,pft-size = <0x00 0x19>;
>                       ibm,chip-id = <0x00>;
> --
>                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
>                       ibm,pft-size = <0x00 0x19>;
>                       ibm,chip-id = <0x00>;

> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
> different NUMA node than 0-1. This would mean that the same socket
> would belong to different NUMA nodes at the same time.

Right... and I'm still not seeing why that's a problem.  AFAICT that's
a possible, if unexpected, situation under real hardware - though
maybe not for POWER9 specifically.

> I believe this is what Cedric wants to be addressed. Given that the
> property is called after the OPAL property ibm,chip-id, the kernel
> expects that the property will have the same semantics as in OPAL.

Even on powernv, I'm not clear why chip-id is tied into the NUMA
configuration, rather than getting all the NUMA info from
associativity properties.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2021-03-31  1:05 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-19 18:34 [PATCH 0/2] pseries: SMP sockets must match NUMA nodes Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
2021-03-23  1:03   ` David Gibson
2021-03-23 17:21     ` Daniel Henrique Barboza
2021-03-25  2:10       ` David Gibson
2021-03-25  8:56         ` Cédric Le Goater
2021-03-25 10:15           ` Daniel Henrique Barboza
2021-03-29  4:20           ` David Gibson
2021-03-29 15:32             ` Cédric Le Goater
2021-03-29 18:32               ` Daniel Henrique Barboza
2021-03-29 23:55                 ` Igor Mammedov
2021-03-31  0:57                 ` David Gibson [this message]
2021-03-31  4:58                   ` Michael Ellerman
2021-03-31 15:22                     ` Cédric Le Goater
2021-04-01  2:53                       ` David Gibson
2021-03-31 15:18                   ` Cédric Le Goater
2021-03-31 17:29                     ` Daniel Henrique Barboza
2021-03-31 17:40                       ` Cédric Le Goater
2021-04-01  2:59                     ` David Gibson
2021-04-01  9:21                       ` Cédric Le Goater
2021-03-29 23:51     ` Igor Mammedov
2021-03-30 21:33       ` Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT Daniel Henrique Barboza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YGPI5vgoI8JDO1HN@yekko.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=clg@kaod.org \
    --cc=danielhb413@gmail.com \
    --cc=groug@kaod.org \
    --cc=imammedo@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).