qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Cédric Le Goater" <clg@kaod.org>
To: David Gibson <david@gibson.dropbear.id.au>,
	Daniel Henrique Barboza <danielhb413@gmail.com>
Cc: Laurent Vivier <lvivier@redhat.com>,
	Thomas Huth <thuth@redhat.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	qemu-devel@nongnu.org, groug@kaod.org, qemu-ppc@nongnu.org,
	Igor Mammedov <imammedo@redhat.com>
Subject: Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
Date: Thu, 25 Mar 2021 09:56:04 +0100	[thread overview]
Message-ID: <d13d3c70-6f12-713e-6995-070292cb30c6@kaod.org> (raw)
In-Reply-To: <YFvxAW3l4t+YznEm@yekko.fritz.box>

On 3/25/21 3:10 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>
>>
>> On 3/22/21 10:03 PM, David Gibson wrote:
>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>> equivalent.
>>>>
>>>> Further discussions in the kernel mailing list [2] shown that the
>>>> powerpc kernel always considered that the number of sockets were equal
>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>> as in [1], happened to be correctly represented in the kernel by
>>>> accident during all these years.
>>>>
>>>> There's a case to be made for virtual topologies being detached from
>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>> time, this freedom can't result in unrealistic hardware representations
>>>> being emulated. If the real hardware and the pseries kernel don't
>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>
>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>> pseries machine. qtest changes were made to adapt to this new
>>>> condition.
>>>
>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>> for users to match these up, for no real gain.  I'm also concerned
>>> that this will require follow on changes in libvirt to not make this a
>>> really cryptic and irritating point of failure.
>>
>> Haven't though about required Libvirt changes, although I can say that there
>> will be some amount to be mande and it will probably annoy existing users
>> (everyone that has a multiple socket per NUMA node topology).
>>
>> There is not much we can do from the QEMU layer aside from what I've proposed
>> here. The other alternative is to keep interacting with the kernel folks to
>> see if there is a way to keep our use case untouched.
> 
> Right.  Well.. not necessarily untouched, but I'm hoping for more
> replies from Cédric to my objections and mpe's.  Even with sockets
> being a kinda meaningless concept in PAPR, I don't think tying it to
> NUMA nodes makes sense.

I did a couple of replies in different email threads but maybe not 
to all. I felt it was going nowhere :/ Couple of thoughts,

Shouldn't we get rid of the socket concept, die also, under pseries 
since they don't exist under PAPR ? We only have numa nodes, cores, 
threads AFAICT.

Should we diverged from PAPR and add extra DT properties "qemu,..." ?
There are a couple of places where Linux checks for the underlying 
hypervisor already.

>> This also means that
>> 'ibm,chip-id' will probably remain in use since it's the only place where
>> we inform cores per socket information to the kernel.
> 
> Well.. unless we can find some other sensible way to convey that
> information.  I haven't given up hope for that yet.

Well, we could start by fixing the value in QEMU. It is broken today.


This is all coming from some work we did last year to evaluate our HW 
(mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. 
We saw some real problems because Linux did not have a clear view of the 
topology. See the figures here : 

http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/

The node id is a key parameter for system resource management, memory 
allocation, interrupt affinity, etc. Linux scales much better if used
correctly. 

C.


  reply	other threads:[~2021-03-25  8:57 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-19 18:34 [PATCH 0/2] pseries: SMP sockets must match NUMA nodes Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
2021-03-23  1:03   ` David Gibson
2021-03-23 17:21     ` Daniel Henrique Barboza
2021-03-25  2:10       ` David Gibson
2021-03-25  8:56         ` Cédric Le Goater [this message]
2021-03-25 10:15           ` Daniel Henrique Barboza
2021-03-29  4:20           ` David Gibson
2021-03-29 15:32             ` Cédric Le Goater
2021-03-29 18:32               ` Daniel Henrique Barboza
2021-03-29 23:55                 ` Igor Mammedov
2021-03-31  0:57                 ` David Gibson
2021-03-31  4:58                   ` Michael Ellerman
2021-03-31 15:22                     ` Cédric Le Goater
2021-04-01  2:53                       ` David Gibson
2021-03-31 15:18                   ` Cédric Le Goater
2021-03-31 17:29                     ` Daniel Henrique Barboza
2021-03-31 17:40                       ` Cédric Le Goater
2021-04-01  2:59                     ` David Gibson
2021-04-01  9:21                       ` Cédric Le Goater
2021-03-29 23:51     ` Igor Mammedov
2021-03-30 21:33       ` Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT Daniel Henrique Barboza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d13d3c70-6f12-713e-6995-070292cb30c6@kaod.org \
    --to=clg@kaod.org \
    --cc=danielhb413@gmail.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=groug@kaod.org \
    --cc=imammedo@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).