qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Henrique Barboza <danielhb413@gmail.com>
To: "Cédric Le Goater" <clg@kaod.org>,
	"David Gibson" <david@gibson.dropbear.id.au>
Cc: Laurent Vivier <lvivier@redhat.com>,
	Thomas Huth <thuth@redhat.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	qemu-devel@nongnu.org, groug@kaod.org, qemu-ppc@nongnu.org,
	Igor Mammedov <imammedo@redhat.com>
Subject: Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
Date: Thu, 25 Mar 2021 07:15:01 -0300	[thread overview]
Message-ID: <8554397d-9449-bf31-e52c-702eb88b9e6b@gmail.com> (raw)
In-Reply-To: <d13d3c70-6f12-713e-6995-070292cb30c6@kaod.org>



On 3/25/21 5:56 AM, Cédric Le Goater wrote:
> On 3/25/21 3:10 AM, David Gibson wrote:
>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>
>>>
>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>> equivalent.
>>>>>
>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>> accident during all these years.
>>>>>
>>>>> There's a case to be made for virtual topologies being detached from
>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>
>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>> condition.
>>>>
>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>> for users to match these up, for no real gain.  I'm also concerned
>>>> that this will require follow on changes in libvirt to not make this a
>>>> really cryptic and irritating point of failure.
>>>
>>> Haven't though about required Libvirt changes, although I can say that there
>>> will be some amount to be mande and it will probably annoy existing users
>>> (everyone that has a multiple socket per NUMA node topology).
>>>
>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>> here. The other alternative is to keep interacting with the kernel folks to
>>> see if there is a way to keep our use case untouched.
>>
>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>> replies from Cédric to my objections and mpe's.  Even with sockets
>> being a kinda meaningless concept in PAPR, I don't think tying it to
>> NUMA nodes makes sense.
> 
> I did a couple of replies in different email threads but maybe not
> to all. I felt it was going nowhere :/ Couple of thoughts,
> 
> Shouldn't we get rid of the socket concept, die also, under pseries
> since they don't exist under PAPR ? We only have numa nodes, cores,
> threads AFAICT.

I don't think we work with 'die'.

Getting rid of the 'socket' representation is sensible regarding PAPR,
but the effect for pseries will be similar to what this patch is already
doing: users could have multiple sockets in the same NUMA node, and then
they won't. Either because we got rid of the 'socket' representation or
because socket == NUMA node.

> 
> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
> There are a couple of places where Linux checks for the underlying
> hypervisor already.
> 
>>> This also means that
>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>> we inform cores per socket information to the kernel.
>>
>> Well.. unless we can find some other sensible way to convey that
>> information.  I haven't given up hope for that yet.
> 
> Well, we could start by fixing the value in QEMU. It is broken today.


I'll look into it. It makes more sense to talk about keeping it when
it's working properly.



DHB


> 
> 
> This is all coming from some work we did last year to evaluate our HW
> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM.
> We saw some real problems because Linux did not have a clear view of the
> topology. See the figures here :
> 
> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/
> 
> The node id is a key parameter for system resource management, memory
> allocation, interrupt affinity, etc. Linux scales much better if used
> correctly.
> 
> C.
> 


  reply	other threads:[~2021-03-25 10:16 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-19 18:34 [PATCH 0/2] pseries: SMP sockets must match NUMA nodes Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
2021-03-23  1:03   ` David Gibson
2021-03-23 17:21     ` Daniel Henrique Barboza
2021-03-25  2:10       ` David Gibson
2021-03-25  8:56         ` Cédric Le Goater
2021-03-25 10:15           ` Daniel Henrique Barboza [this message]
2021-03-29  4:20           ` David Gibson
2021-03-29 15:32             ` Cédric Le Goater
2021-03-29 18:32               ` Daniel Henrique Barboza
2021-03-29 23:55                 ` Igor Mammedov
2021-03-31  0:57                 ` David Gibson
2021-03-31  4:58                   ` Michael Ellerman
2021-03-31 15:22                     ` Cédric Le Goater
2021-04-01  2:53                       ` David Gibson
2021-03-31 15:18                   ` Cédric Le Goater
2021-03-31 17:29                     ` Daniel Henrique Barboza
2021-03-31 17:40                       ` Cédric Le Goater
2021-04-01  2:59                     ` David Gibson
2021-04-01  9:21                       ` Cédric Le Goater
2021-03-29 23:51     ` Igor Mammedov
2021-03-30 21:33       ` Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT Daniel Henrique Barboza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8554397d-9449-bf31-e52c-702eb88b9e6b@gmail.com \
    --to=danielhb413@gmail.com \
    --cc=clg@kaod.org \
    --cc=david@gibson.dropbear.id.au \
    --cc=groug@kaod.org \
    --cc=imammedo@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=mpe@ellerman.id.au \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).