All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Gibson <dgibson@redhat.com>
To: "Cédric Le Goater" <clg@kaod.org>
Cc: Daniel Henrique Barboza <danielhb@linux.ibm.com>,
	Greg Kurz <groug@kaod.org>,
	QEMU Developers <qemu-devel@nongnu.org>,
	"list@suse.de:PowerPC" <qemu-ppc@nongnu.org>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
Date: Fri, 12 Mar 2021 12:55:27 +1100	[thread overview]
Message-ID: <20210312125527.61bc269c@yekko.fritz.box> (raw)
In-Reply-To: <92edbc26-4cb5-6e2f-00ff-43a3dca43759@kaod.org>

On Tue, 9 Mar 2021 18:26:35 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> > 
> > 
> > On 3/9/21 12:33 PM, Cédric Le Goater wrote:  
> >> On 3/8/21 6:13 PM, Greg Kurz wrote:  
> >>> On Wed, 3 Mar 2021 18:48:50 +0100
> >>> Cédric Le Goater <clg@kaod.org> wrote:
> >>>  
> >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
> >>>> target for a source located on the same chip when possible. This field
> >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
> >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined  
> >>>
> >>> This sentence seems to have a syntax problem... like it is missing an
> >>> 'and' before 'on pSeries'.  
> >>
> >> ah yes, or simply a comma.
> >>  
> >>>> under PowerVM. The XIVE source structure has a similar field
> >>>> 'src_chip' which is only assigned on the PowerNV platform.
> >>>>
> >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
> >>>> default node. It will also give us the opportunity to set the affinity
> >>>> of a source on pSeries when we can localize them.
> >>>>  
> >>>
> >>> IIUC this relies on the fact that the NUMA node id is == to chip id
> >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
> >>> with this change.  
> >>
> >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
> >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
> >> in Cc:)  
>  [...]  
> >>
> >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
> >> the node id. This value is built from the chip id in OPAL, so the
> >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
> >> property are unlikely to be different.
> >>
> >> cpu_to_node(cpu) is used in many places to allocate the structures
> >> locally to the owning node. XIVE is not an exception (see below in the
> >> same patch), it is better to be consistent and get the same information
> >> (node id) using the same routine.
> >>
> >>
> >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
> >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
> >> unifies the controllers of the system to only expose one the OS. This
> >> is problematic and should be changed but it's another topic.
> >>
> >>  
> >>> On the other hand, you have the pSeries case under PowerVM that
> >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.  
> >>
> >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
> >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
> >> chip id.
> >>
> >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
> >> always correct btw)  
> > 
> > 
> > If you have a way to reliably reproduce this, let me know and I'll fix it
> > up in QEMU.  
> 
> with :
> 
>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x01>;
> 		ibm,chip-id = <0x02>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x03>;
> 
> with :
> 
>   -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 
> I think we should simply remove "ibm,chip-id" since it's not used and
> not in the PAPR spec.

As I mentioned to Daniel on our call this morning, oddly it *does*
appear to be used in the RHEL kernel, even though that's 4.18 based.
This patch seems to have caused a minor regression; not in the
identification of NUMA nodes, but in the number of sockets shown be
lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
for more information.

Since the value was used by some PAPR kernels - even if they shouldn't
have - I think we should only remove this for newer machine types.  We
also need to check what we're not supplying that the guest kernel is
showing a different number of sockets than specified on the qemu
command line.

> 
> Thanks,
> 
> C.
> 
>  
> 
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> 


-- 
David Gibson <dgibson@redhat.com>
Principal Software Engineer, Virtualization, Red Hat


WARNING: multiple messages have this Message-ID (diff)
From: David Gibson <dgibson@redhat.com>
To: "Cédric Le Goater" <clg@kaod.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>,
	Daniel Henrique Barboza <danielhb@linux.ibm.com>,
	Greg Kurz <groug@kaod.org>,
	QEMU Developers <qemu-devel@nongnu.org>,
	"list@suse.de:PowerPC" <qemu-ppc@nongnu.org>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
Date: Fri, 12 Mar 2021 12:55:27 +1100	[thread overview]
Message-ID: <20210312125527.61bc269c@yekko.fritz.box> (raw)
In-Reply-To: <92edbc26-4cb5-6e2f-00ff-43a3dca43759@kaod.org>

On Tue, 9 Mar 2021 18:26:35 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> > 
> > 
> > On 3/9/21 12:33 PM, Cédric Le Goater wrote:  
> >> On 3/8/21 6:13 PM, Greg Kurz wrote:  
> >>> On Wed, 3 Mar 2021 18:48:50 +0100
> >>> Cédric Le Goater <clg@kaod.org> wrote:
> >>>  
> >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
> >>>> target for a source located on the same chip when possible. This field
> >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
> >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined  
> >>>
> >>> This sentence seems to have a syntax problem... like it is missing an
> >>> 'and' before 'on pSeries'.  
> >>
> >> ah yes, or simply a comma.
> >>  
> >>>> under PowerVM. The XIVE source structure has a similar field
> >>>> 'src_chip' which is only assigned on the PowerNV platform.
> >>>>
> >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
> >>>> default node. It will also give us the opportunity to set the affinity
> >>>> of a source on pSeries when we can localize them.
> >>>>  
> >>>
> >>> IIUC this relies on the fact that the NUMA node id is == to chip id
> >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
> >>> with this change.  
> >>
> >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
> >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
> >> in Cc:)  
>  [...]  
> >>
> >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
> >> the node id. This value is built from the chip id in OPAL, so the
> >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
> >> property are unlikely to be different.
> >>
> >> cpu_to_node(cpu) is used in many places to allocate the structures
> >> locally to the owning node. XIVE is not an exception (see below in the
> >> same patch), it is better to be consistent and get the same information
> >> (node id) using the same routine.
> >>
> >>
> >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
> >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
> >> unifies the controllers of the system to only expose one the OS. This
> >> is problematic and should be changed but it's another topic.
> >>
> >>  
> >>> On the other hand, you have the pSeries case under PowerVM that
> >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.  
> >>
> >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
> >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
> >> chip id.
> >>
> >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
> >> always correct btw)  
> > 
> > 
> > If you have a way to reliably reproduce this, let me know and I'll fix it
> > up in QEMU.  
> 
> with :
> 
>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x01>;
> 		ibm,chip-id = <0x02>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x03>;
> 
> with :
> 
>   -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 
> I think we should simply remove "ibm,chip-id" since it's not used and
> not in the PAPR spec.

As I mentioned to Daniel on our call this morning, oddly it *does*
appear to be used in the RHEL kernel, even though that's 4.18 based.
This patch seems to have caused a minor regression; not in the
identification of NUMA nodes, but in the number of sockets shown be
lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
for more information.

Since the value was used by some PAPR kernels - even if they shouldn't
have - I think we should only remove this for newer machine types.  We
also need to check what we're not supplying that the guest kernel is
showing a different number of sockets than specified on the qemu
command line.

> 
> Thanks,
> 
> C.
> 
>  
> 
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> 


-- 
David Gibson <dgibson@redhat.com>
Principal Software Engineer, Virtualization, Red Hat



  reply	other threads:[~2021-03-12  1:56 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater
2021-03-08 17:13   ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property Greg Kurz
2021-03-09 15:33     ` Cédric Le Goater
2021-03-09 17:08       ` Daniel Henrique Barboza
2021-03-09 17:26         ` Cédric Le Goater
2021-03-09 17:26           ` Cédric Le Goater
2021-03-12  1:55           ` David Gibson [this message]
2021-03-12  1:55             ` David Gibson
2021-03-12  9:53             ` Cédric Le Goater
2021-03-12  9:53               ` Cédric Le Goater
2021-03-12 12:18               ` Daniel Henrique Barboza
2021-03-12 12:18                 ` Daniel Henrique Barboza
2021-03-12 13:03                 ` Greg Kurz
2021-03-12 13:03                   ` Greg Kurz
2021-03-12 13:28                 ` Cédric Le Goater
2021-03-12 13:28                   ` Cédric Le Goater
2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater
2021-03-08 17:55   ` Greg Kurz
2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater
2021-03-08 17:56   ` Greg Kurz
2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater
2021-03-08 18:07   ` Greg Kurz
2021-03-08 18:11     ` Cédric Le Goater
2021-03-09  9:13       ` Greg Kurz
2021-03-09  9:42         ` Greg Kurz
2021-03-09 15:39           ` Cédric Le Goater
2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater
2021-03-09  9:18   ` Greg Kurz
2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater
2021-03-09  9:22   ` Greg Kurz
2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater
2021-03-09 10:23   ` Greg Kurz
2021-03-09 15:49     ` Cédric Le Goater
2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
2021-03-09 13:23   ` Greg Kurz
2021-03-09 15:52     ` Cédric Le Goater
2021-03-30 16:18   ` Cédric Le Goater

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210312125527.61bc269c@yekko.fritz.box \
    --to=dgibson@redhat.com \
    --cc=clg@kaod.org \
    --cc=danielhb@linux.ibm.com \
    --cc=groug@kaod.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.