* [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node @ 2021-03-03 17:48 Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater ` (7 more replies) 0 siblings, 8 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater Hello, ipistorm [*] can be used to benchmark the raw interrupt rate of an interrupt controller by measuring the number of IPIs a system can sustain. When applied to the XIVE interrupt controller of POWER9 and POWER10 systems, a significant drop of the interrupt rate can be observed when crossing the second node boundary. This is due to the fact that a single IPI interrupt is used for all CPUs of the system. The structure is shared and the cache line updates impact greatly the traffic between nodes and the overall IPI performance. As a workaround, the impact can be reduced by deactivating the IRQ lockup detector ("noirqdebug") which does a lot of accounting in the Linux IRQ descriptor structure and is responsible for most of the performance penalty. As a fix, this proposal allocates an IPI interrupt per node, to be shared by all CPUs of that node. It solves the scaling issue, the IRQ lockup detector still has an impact but the XIVE interrupt rate scales linearly. It also improves the "noirqdebug" case as showed in the tables below. * P9 DD2.2 - 2s * 64 threads "noirqdebug" Mint/s Mint/s chips cpus IPI/sys IPI/chip IPI/chip IPI/sys -------------------------------------------------------------- 1 0-15 4.984023 4.875405 4.996536 5.048892 0-31 10.879164 10.544040 10.757632 11.037859 0-47 15.345301 14.688764 14.926520 15.310053 0-63 17.064907 17.066812 17.613416 17.874511 2 0-79 11.768764 21.650749 22.689120 22.566508 0-95 10.616812 26.878789 28.434703 28.320324 0-111 10.151693 31.397803 31.771773 32.388122 0-127 9.948502 33.139336 34.875716 35.224548 * P10 DD1 - 4s (not homogeneous) 352 threads "noirqdebug" Mint/s Mint/s chips cpus IPI/sys IPI/chip IPI/chip IPI/sys -------------------------------------------------------------- 1 0-15 2.409402 2.364108 2.383303 2.395091 0-31 6.028325 6.046075 6.089999 6.073750 0-47 8.655178 8.644531 8.712830 8.724702 0-63 11.629652 11.735953 12.088203 12.055979 0-79 14.392321 14.729959 14.986701 14.973073 0-95 12.604158 13.004034 17.528748 17.568095 2 0-111 9.767753 13.719831 19.968606 20.024218 0-127 6.744566 16.418854 22.898066 22.995110 0-143 6.005699 19.174421 25.425622 25.417541 0-159 5.649719 21.938836 27.952662 28.059603 0-175 5.441410 24.109484 31.133915 31.127996 3 0-191 5.318341 24.405322 33.999221 33.775354 0-207 5.191382 26.449769 36.050161 35.867307 0-223 5.102790 29.356943 39.544135 39.508169 0-239 5.035295 31.933051 42.135075 42.071975 0-255 4.969209 34.477367 44.655395 44.757074 4 0-271 4.907652 35.887016 47.080545 47.318537 0-287 4.839581 38.076137 50.464307 50.636219 0-303 4.786031 40.881319 53.478684 53.310759 0-319 4.743750 43.448424 56.388102 55.973969 0-335 4.709936 45.623532 59.400930 58.926857 0-351 4.681413 45.646151 62.035804 61.830057 [*] https://github.com/antonblanchard/ipistorm Thanks, C. Changes in v2: - extra simplification on xmon - fixes on issues reported by the kernel test robot Cédric Le Goater (8): powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property powerpc/xive: Introduce an IPI interrupt domain powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ powerpc/xive: Simplify xive_core_debug_show() powerpc/xive: Drop check on irq_data in xive_core_debug_show() powerpc/xive: Simplify the dump of XIVE interrupts under xmon powerpc/xive: Fix xmon command "dxi" powerpc/xive: Map one IPI interrupt per node arch/powerpc/include/asm/xive.h | 1 + arch/powerpc/sysdev/xive/xive-internal.h | 2 - arch/powerpc/sysdev/xive/common.c | 163 +++++++++++++---------- arch/powerpc/xmon/xmon.c | 28 +--- 4 files changed, 93 insertions(+), 101 deletions(-) -- 2.26.2 ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater @ 2021-03-03 17:48 ` Cédric Le Goater 2021-03-08 17:13 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property Greg Kurz 2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater ` (6 subsequent siblings) 7 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater The 'chip_id' field of the XIVE CPU structure is used to choose a target for a source located on the same chip when possible. This field is assigned on the PowerNV platform using the "ibm,chip-id" property on pSeries under KVM when NUMA nodes are defined but it is undefined under PowerVM. The XIVE source structure has a similar field 'src_chip' which is only assigned on the PowerNV platform. cpu_to_node() returns a compatible value on all platforms, 0 being the default node. It will also give us the opportunity to set the affinity of a source on pSeries when we can localize them. Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/sysdev/xive/common.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 595310e056f4..b8e456da28aa 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) xc = per_cpu(xive_cpu, cpu); if (!xc) { - struct device_node *np; - xc = kzalloc_node(sizeof(struct xive_cpu), GFP_KERNEL, cpu_to_node(cpu)); if (!xc) return -ENOMEM; - np = of_get_cpu_node(cpu, NULL); - if (np) - xc->chip_id = of_get_ibm_chip_id(np); - of_node_put(np); + xc->chip_id = cpu_to_node(cpu); xc->hw_ipi = XIVE_BAD_IRQ; per_cpu(xive_cpu, cpu) = xc; -- 2.26.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater @ 2021-03-08 17:13 ` Greg Kurz 2021-03-09 15:33 ` Cédric Le Goater 0 siblings, 1 reply; 38+ messages in thread From: Greg Kurz @ 2021-03-08 17:13 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev On Wed, 3 Mar 2021 18:48:50 +0100 Cédric Le Goater <clg@kaod.org> wrote: > The 'chip_id' field of the XIVE CPU structure is used to choose a > target for a source located on the same chip when possible. This field > is assigned on the PowerNV platform using the "ibm,chip-id" property > on pSeries under KVM when NUMA nodes are defined but it is undefined This sentence seems to have a syntax problem... like it is missing an 'and' before 'on pSeries'. > under PowerVM. The XIVE source structure has a similar field > 'src_chip' which is only assigned on the PowerNV platform. > > cpu_to_node() returns a compatible value on all platforms, 0 being the > default node. It will also give us the opportunity to set the affinity > of a source on pSeries when we can localize them. > IIUC this relies on the fact that the NUMA node id is == to chip id on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable with this change. On the other hand, you have the pSeries case under PowerVM that doesn't xc->chip_id, which isn't passed to any hcall AFAICT. It looks like the chip id is only used for localization purpose in this case, right ? In this case, what about doing this change for pSeries only, somewhere in spapr.c ? > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- > arch/powerpc/sysdev/xive/common.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index 595310e056f4..b8e456da28aa 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) > > xc = per_cpu(xive_cpu, cpu); > if (!xc) { > - struct device_node *np; > - > xc = kzalloc_node(sizeof(struct xive_cpu), > GFP_KERNEL, cpu_to_node(cpu)); > if (!xc) > return -ENOMEM; > - np = of_get_cpu_node(cpu, NULL); > - if (np) > - xc->chip_id = of_get_ibm_chip_id(np); > - of_node_put(np); > + xc->chip_id = cpu_to_node(cpu); > xc->hw_ipi = XIVE_BAD_IRQ; > > per_cpu(xive_cpu, cpu) = xc; ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-08 17:13 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property Greg Kurz @ 2021-03-09 15:33 ` Cédric Le Goater 2021-03-09 17:08 ` Daniel Henrique Barboza 0 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-09 15:33 UTC (permalink / raw) To: Greg Kurz; +Cc: Daniel Henrique Barboza, linuxppc-dev On 3/8/21 6:13 PM, Greg Kurz wrote: > On Wed, 3 Mar 2021 18:48:50 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> The 'chip_id' field of the XIVE CPU structure is used to choose a >> target for a source located on the same chip when possible. This field >> is assigned on the PowerNV platform using the "ibm,chip-id" property >> on pSeries under KVM when NUMA nodes are defined but it is undefined > > This sentence seems to have a syntax problem... like it is missing an > 'and' before 'on pSeries'. ah yes, or simply a comma. >> under PowerVM. The XIVE source structure has a similar field >> 'src_chip' which is only assigned on the PowerNV platform. >> >> cpu_to_node() returns a compatible value on all platforms, 0 being the >> default node. It will also give us the opportunity to set the affinity >> of a source on pSeries when we can localize them. >> > > IIUC this relies on the fact that the NUMA node id is == to chip id > on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable > with this change. Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel in Cc:) On PowerNV, Linux uses "ibm,associativity" property of the CPU to find the node id. This value is built from the chip id in OPAL, so the value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" property are unlikely to be different. cpu_to_node(cpu) is used in many places to allocate the structures locally to the owning node. XIVE is not an exception (see below in the same patch), it is better to be consistent and get the same information (node id) using the same routine. In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot unifies the controllers of the system to only expose one the OS. This is problematic and should be changed but it's another topic. > On the other hand, you have the pSeries case under PowerVM that > doesn't xc->chip_id, which isn't passed to any hcall AFAICT. yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid chip id. QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not always correct btw) > It looks like the chip id is only used for localization purpose in > this case, right ? Yes and PAPR sources are not localized. So it's not used. MSI sources could be if we rewrote the MSI driver. > In this case, what about doing this change for pSeries only, > somewhere in spapr.c ? The IPI code is common to all platforms and all have the same issue. I rather not. Thanks, C. >> Signed-off-by: Cédric Le Goater <clg@kaod.org> >> --- >> arch/powerpc/sysdev/xive/common.c | 7 +------ >> 1 file changed, 1 insertion(+), 6 deletions(-) >> >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >> index 595310e056f4..b8e456da28aa 100644 >> --- a/arch/powerpc/sysdev/xive/common.c >> +++ b/arch/powerpc/sysdev/xive/common.c >> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) >> >> xc = per_cpu(xive_cpu, cpu); >> if (!xc) { >> - struct device_node *np; >> - >> xc = kzalloc_node(sizeof(struct xive_cpu), >> GFP_KERNEL, cpu_to_node(cpu)); >> if (!xc) >> return -ENOMEM; >> - np = of_get_cpu_node(cpu, NULL); >> - if (np) >> - xc->chip_id = of_get_ibm_chip_id(np); >> - of_node_put(np); >> + xc->chip_id = cpu_to_node(cpu); >> xc->hw_ipi = XIVE_BAD_IRQ; >> >> per_cpu(xive_cpu, cpu) = xc; > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-09 15:33 ` Cédric Le Goater @ 2021-03-09 17:08 ` Daniel Henrique Barboza 2021-03-09 17:26 ` Cédric Le Goater 0 siblings, 1 reply; 38+ messages in thread From: Daniel Henrique Barboza @ 2021-03-09 17:08 UTC (permalink / raw) To: Cédric Le Goater, Greg Kurz; +Cc: linuxppc-dev On 3/9/21 12:33 PM, Cédric Le Goater wrote: > On 3/8/21 6:13 PM, Greg Kurz wrote: >> On Wed, 3 Mar 2021 18:48:50 +0100 >> Cédric Le Goater <clg@kaod.org> wrote: >> >>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>> target for a source located on the same chip when possible. This field >>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>> on pSeries under KVM when NUMA nodes are defined but it is undefined >> >> This sentence seems to have a syntax problem... like it is missing an >> 'and' before 'on pSeries'. > > ah yes, or simply a comma. > >>> under PowerVM. The XIVE source structure has a similar field >>> 'src_chip' which is only assigned on the PowerNV platform. >>> >>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>> default node. It will also give us the opportunity to set the affinity >>> of a source on pSeries when we can localize them. >>> >> >> IIUC this relies on the fact that the NUMA node id is == to chip id >> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >> with this change. > > Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall > H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel > in Cc:) That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but a list with the ibm,associativity domains of the CPU that "proc-no" (processor identifier) is mapped to inside QEMU. node_id in this case, considering that we're working with a reference-points of size 4, is the 4th element of the returned list. The last element is "procno" itself. > > On PowerNV, Linux uses "ibm,associativity" property of the CPU to find > the node id. This value is built from the chip id in OPAL, so the > value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" > property are unlikely to be different. > > cpu_to_node(cpu) is used in many places to allocate the structures > locally to the owning node. XIVE is not an exception (see below in the > same patch), it is better to be consistent and get the same information > (node id) using the same routine. > > > In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : > LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot > unifies the controllers of the system to only expose one the OS. This > is problematic and should be changed but it's another topic. > > >> On the other hand, you have the pSeries case under PowerVM that >> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. > > yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning > under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid > chip id. > > QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not > always correct btw) If you have a way to reliably reproduce this, let me know and I'll fix it up in QEMU. Thanks, DHB > >> It looks like the chip id is only used for localization purpose in >> this case, right ? > > Yes and PAPR sources are not localized. So it's not used. MSI sources > could be if we rewrote the MSI driver. > >> In this case, what about doing this change for pSeries only, >> somewhere in spapr.c ? > > The IPI code is common to all platforms and all have the same issue. > I rather not. > > Thanks, > > C. > >>> Signed-off-by: Cédric Le Goater <clg@kaod.org> >>> --- >>> arch/powerpc/sysdev/xive/common.c | 7 +------ >>> 1 file changed, 1 insertion(+), 6 deletions(-) >>> >>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >>> index 595310e056f4..b8e456da28aa 100644 >>> --- a/arch/powerpc/sysdev/xive/common.c >>> +++ b/arch/powerpc/sysdev/xive/common.c >>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) >>> >>> xc = per_cpu(xive_cpu, cpu); >>> if (!xc) { >>> - struct device_node *np; >>> - >>> xc = kzalloc_node(sizeof(struct xive_cpu), >>> GFP_KERNEL, cpu_to_node(cpu)); >>> if (!xc) >>> return -ENOMEM; >>> - np = of_get_cpu_node(cpu, NULL); >>> - if (np) >>> - xc->chip_id = of_get_ibm_chip_id(np); >>> - of_node_put(np); >>> + xc->chip_id = cpu_to_node(cpu); >>> xc->hw_ipi = XIVE_BAD_IRQ; >>> >>> per_cpu(xive_cpu, cpu) = xc; >> > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-09 17:08 ` Daniel Henrique Barboza @ 2021-03-09 17:26 ` Cédric Le Goater 0 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-09 17:26 UTC (permalink / raw) To: Daniel Henrique Barboza, Greg Kurz Cc: list@suse.de:PowerPC, linuxppc-dev, QEMU Developers, David Gibson On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > > > On 3/9/21 12:33 PM, Cédric Le Goater wrote: >> On 3/8/21 6:13 PM, Greg Kurz wrote: >>> On Wed, 3 Mar 2021 18:48:50 +0100 >>> Cédric Le Goater <clg@kaod.org> wrote: >>> >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>> target for a source located on the same chip when possible. This field >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>> >>> This sentence seems to have a syntax problem... like it is missing an >>> 'and' before 'on pSeries'. >> >> ah yes, or simply a comma. >> >>>> under PowerVM. The XIVE source structure has a similar field >>>> 'src_chip' which is only assigned on the PowerNV platform. >>>> >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>> default node. It will also give us the opportunity to set the affinity >>>> of a source on pSeries when we can localize them. >>>> >>> >>> IIUC this relies on the fact that the NUMA node id is == to chip id >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>> with this change. >> >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >> in Cc:) > > That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but > a list with the ibm,associativity domains of the CPU that "proc-no" (processor > identifier) is mapped to inside QEMU. > > node_id in this case, considering that we're working with a reference-points > of size 4, is the 4th element of the returned list. The last element is > "procno" itself. > > >> >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >> the node id. This value is built from the chip id in OPAL, so the >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >> property are unlikely to be different. >> >> cpu_to_node(cpu) is used in many places to allocate the structures >> locally to the owning node. XIVE is not an exception (see below in the >> same patch), it is better to be consistent and get the same information >> (node id) using the same routine. >> >> >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >> unifies the controllers of the system to only expose one the OS. This >> is problematic and should be changed but it's another topic. >> >> >>> On the other hand, you have the pSeries case under PowerVM that >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >> >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >> chip id. >> >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >> always correct btw) > > > If you have a way to reliably reproduce this, let me know and I'll fix it > up in QEMU. with : -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [ 0.013106] numa: Node 0 CPUs: 0-1 [ 0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x01>; ibm,chip-id = <0x02>; ibm,chip-id = <0x00>; ibm,chip-id = <0x03>; with : -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [ 0.013106] numa: Node 0 CPUs: 0-1 [ 0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; I think we should simply remove "ibm,chip-id" since it's not used and not in the PAPR spec. Thanks, C. > > Thanks, > > > DHB > > >> >>> It looks like the chip id is only used for localization purpose in >>> this case, right ? >> >> Yes and PAPR sources are not localized. So it's not used. MSI sources >> could be if we rewrote the MSI driver. >> >>> In this case, what about doing this change for pSeries only, >>> somewhere in spapr.c ? >> >> The IPI code is common to all platforms and all have the same issue. >> I rather not. >> >> Thanks, >> >> C. >> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org> >>>> --- >>>> arch/powerpc/sysdev/xive/common.c | 7 +------ >>>> 1 file changed, 1 insertion(+), 6 deletions(-) >>>> >>>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >>>> index 595310e056f4..b8e456da28aa 100644 >>>> --- a/arch/powerpc/sysdev/xive/common.c >>>> +++ b/arch/powerpc/sysdev/xive/common.c >>>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) >>>> xc = per_cpu(xive_cpu, cpu); >>>> if (!xc) { >>>> - struct device_node *np; >>>> - >>>> xc = kzalloc_node(sizeof(struct xive_cpu), >>>> GFP_KERNEL, cpu_to_node(cpu)); >>>> if (!xc) >>>> return -ENOMEM; >>>> - np = of_get_cpu_node(cpu, NULL); >>>> - if (np) >>>> - xc->chip_id = of_get_ibm_chip_id(np); >>>> - of_node_put(np); >>>> + xc->chip_id = cpu_to_node(cpu); >>>> xc->hw_ipi = XIVE_BAD_IRQ; >>>> per_cpu(xive_cpu, cpu) = xc; >>> >> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property @ 2021-03-09 17:26 ` Cédric Le Goater 0 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-09 17:26 UTC (permalink / raw) To: Daniel Henrique Barboza, Greg Kurz Cc: list@suse.de:PowerPC, Michael Ellerman, linuxppc-dev, QEMU Developers, David Gibson On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > > > On 3/9/21 12:33 PM, Cédric Le Goater wrote: >> On 3/8/21 6:13 PM, Greg Kurz wrote: >>> On Wed, 3 Mar 2021 18:48:50 +0100 >>> Cédric Le Goater <clg@kaod.org> wrote: >>> >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>> target for a source located on the same chip when possible. This field >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>> >>> This sentence seems to have a syntax problem... like it is missing an >>> 'and' before 'on pSeries'. >> >> ah yes, or simply a comma. >> >>>> under PowerVM. The XIVE source structure has a similar field >>>> 'src_chip' which is only assigned on the PowerNV platform. >>>> >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>> default node. It will also give us the opportunity to set the affinity >>>> of a source on pSeries when we can localize them. >>>> >>> >>> IIUC this relies on the fact that the NUMA node id is == to chip id >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>> with this change. >> >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >> in Cc:) > > That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but > a list with the ibm,associativity domains of the CPU that "proc-no" (processor > identifier) is mapped to inside QEMU. > > node_id in this case, considering that we're working with a reference-points > of size 4, is the 4th element of the returned list. The last element is > "procno" itself. > > >> >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >> the node id. This value is built from the chip id in OPAL, so the >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >> property are unlikely to be different. >> >> cpu_to_node(cpu) is used in many places to allocate the structures >> locally to the owning node. XIVE is not an exception (see below in the >> same patch), it is better to be consistent and get the same information >> (node id) using the same routine. >> >> >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >> unifies the controllers of the system to only expose one the OS. This >> is problematic and should be changed but it's another topic. >> >> >>> On the other hand, you have the pSeries case under PowerVM that >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >> >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >> chip id. >> >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >> always correct btw) > > > If you have a way to reliably reproduce this, let me know and I'll fix it > up in QEMU. with : -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [ 0.013106] numa: Node 0 CPUs: 0-1 [ 0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x01>; ibm,chip-id = <0x02>; ibm,chip-id = <0x00>; ibm,chip-id = <0x03>; with : -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [ 0.013106] numa: Node 0 CPUs: 0-1 [ 0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; ibm,chip-id = <0x00>; I think we should simply remove "ibm,chip-id" since it's not used and not in the PAPR spec. Thanks, C. > > Thanks, > > > DHB > > >> >>> It looks like the chip id is only used for localization purpose in >>> this case, right ? >> >> Yes and PAPR sources are not localized. So it's not used. MSI sources >> could be if we rewrote the MSI driver. >> >>> In this case, what about doing this change for pSeries only, >>> somewhere in spapr.c ? >> >> The IPI code is common to all platforms and all have the same issue. >> I rather not. >> >> Thanks, >> >> C. >> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org> >>>> --- >>>> arch/powerpc/sysdev/xive/common.c | 7 +------ >>>> 1 file changed, 1 insertion(+), 6 deletions(-) >>>> >>>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >>>> index 595310e056f4..b8e456da28aa 100644 >>>> --- a/arch/powerpc/sysdev/xive/common.c >>>> +++ b/arch/powerpc/sysdev/xive/common.c >>>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu) >>>> xc = per_cpu(xive_cpu, cpu); >>>> if (!xc) { >>>> - struct device_node *np; >>>> - >>>> xc = kzalloc_node(sizeof(struct xive_cpu), >>>> GFP_KERNEL, cpu_to_node(cpu)); >>>> if (!xc) >>>> return -ENOMEM; >>>> - np = of_get_cpu_node(cpu, NULL); >>>> - if (np) >>>> - xc->chip_id = of_get_ibm_chip_id(np); >>>> - of_node_put(np); >>>> + xc->chip_id = cpu_to_node(cpu); >>>> xc->hw_ipi = XIVE_BAD_IRQ; >>>> per_cpu(xive_cpu, cpu) = xc; >>> >> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-09 17:26 ` Cédric Le Goater @ 2021-03-12 1:55 ` David Gibson -1 siblings, 0 replies; 38+ messages in thread From: David Gibson @ 2021-03-12 1:55 UTC (permalink / raw) To: Cédric Le Goater Cc: Daniel Henrique Barboza, Greg Kurz, QEMU Developers, list@suse.de:PowerPC, linuxppc-dev On Tue, 9 Mar 2021 18:26:35 +0100 Cédric Le Goater <clg@kaod.org> wrote: > On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > > > > > > On 3/9/21 12:33 PM, Cédric Le Goater wrote: > >> On 3/8/21 6:13 PM, Greg Kurz wrote: > >>> On Wed, 3 Mar 2021 18:48:50 +0100 > >>> Cédric Le Goater <clg@kaod.org> wrote: > >>> > >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a > >>>> target for a source located on the same chip when possible. This field > >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property > >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined > >>> > >>> This sentence seems to have a syntax problem... like it is missing an > >>> 'and' before 'on pSeries'. > >> > >> ah yes, or simply a comma. > >> > >>>> under PowerVM. The XIVE source structure has a similar field > >>>> 'src_chip' which is only assigned on the PowerNV platform. > >>>> > >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the > >>>> default node. It will also give us the opportunity to set the affinity > >>>> of a source on pSeries when we can localize them. > >>>> > >>> > >>> IIUC this relies on the fact that the NUMA node id is == to chip id > >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable > >>> with this change. > >> > >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall > >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel > >> in Cc:) > [...] > >> > >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find > >> the node id. This value is built from the chip id in OPAL, so the > >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" > >> property are unlikely to be different. > >> > >> cpu_to_node(cpu) is used in many places to allocate the structures > >> locally to the owning node. XIVE is not an exception (see below in the > >> same patch), it is better to be consistent and get the same information > >> (node id) using the same routine. > >> > >> > >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : > >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot > >> unifies the controllers of the system to only expose one the OS. This > >> is problematic and should be changed but it's another topic. > >> > >> > >>> On the other hand, you have the pSeries case under PowerVM that > >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. > >> > >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning > >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid > >> chip id. > >> > >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not > >> always correct btw) > > > > > > If you have a way to reliably reproduce this, let me know and I'll fix it > > up in QEMU. > > with : > > -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x01>; > ibm,chip-id = <0x02>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x03>; > > with : > > -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > > I think we should simply remove "ibm,chip-id" since it's not used and > not in the PAPR spec. As I mentioned to Daniel on our call this morning, oddly it *does* appear to be used in the RHEL kernel, even though that's 4.18 based. This patch seems to have caused a minor regression; not in the identification of NUMA nodes, but in the number of sockets shown be lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 for more information. Since the value was used by some PAPR kernels - even if they shouldn't have - I think we should only remove this for newer machine types. We also need to check what we're not supplying that the guest kernel is showing a different number of sockets than specified on the qemu command line. > > Thanks, > > C. > > > > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > -- David Gibson <dgibson@redhat.com> Principal Software Engineer, Virtualization, Red Hat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property @ 2021-03-12 1:55 ` David Gibson 0 siblings, 0 replies; 38+ messages in thread From: David Gibson @ 2021-03-12 1:55 UTC (permalink / raw) To: Cédric Le Goater Cc: Michael Ellerman, Daniel Henrique Barboza, Greg Kurz, QEMU Developers, list@suse.de:PowerPC, linuxppc-dev On Tue, 9 Mar 2021 18:26:35 +0100 Cédric Le Goater <clg@kaod.org> wrote: > On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > > > > > > On 3/9/21 12:33 PM, Cédric Le Goater wrote: > >> On 3/8/21 6:13 PM, Greg Kurz wrote: > >>> On Wed, 3 Mar 2021 18:48:50 +0100 > >>> Cédric Le Goater <clg@kaod.org> wrote: > >>> > >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a > >>>> target for a source located on the same chip when possible. This field > >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property > >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined > >>> > >>> This sentence seems to have a syntax problem... like it is missing an > >>> 'and' before 'on pSeries'. > >> > >> ah yes, or simply a comma. > >> > >>>> under PowerVM. The XIVE source structure has a similar field > >>>> 'src_chip' which is only assigned on the PowerNV platform. > >>>> > >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the > >>>> default node. It will also give us the opportunity to set the affinity > >>>> of a source on pSeries when we can localize them. > >>>> > >>> > >>> IIUC this relies on the fact that the NUMA node id is == to chip id > >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable > >>> with this change. > >> > >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall > >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel > >> in Cc:) > [...] > >> > >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find > >> the node id. This value is built from the chip id in OPAL, so the > >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" > >> property are unlikely to be different. > >> > >> cpu_to_node(cpu) is used in many places to allocate the structures > >> locally to the owning node. XIVE is not an exception (see below in the > >> same patch), it is better to be consistent and get the same information > >> (node id) using the same routine. > >> > >> > >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : > >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot > >> unifies the controllers of the system to only expose one the OS. This > >> is problematic and should be changed but it's another topic. > >> > >> > >>> On the other hand, you have the pSeries case under PowerVM that > >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. > >> > >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning > >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid > >> chip id. > >> > >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not > >> always correct btw) > > > > > > If you have a way to reliably reproduce this, let me know and I'll fix it > > up in QEMU. > > with : > > -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x01>; > ibm,chip-id = <0x02>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x03>; > > with : > > -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x00>; > > I think we should simply remove "ibm,chip-id" since it's not used and > not in the PAPR spec. As I mentioned to Daniel on our call this morning, oddly it *does* appear to be used in the RHEL kernel, even though that's 4.18 based. This patch seems to have caused a minor regression; not in the identification of NUMA nodes, but in the number of sockets shown be lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 for more information. Since the value was used by some PAPR kernels - even if they shouldn't have - I think we should only remove this for newer machine types. We also need to check what we're not supplying that the guest kernel is showing a different number of sockets than specified on the qemu command line. > > Thanks, > > C. > > > > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > [...] > -- David Gibson <dgibson@redhat.com> Principal Software Engineer, Virtualization, Red Hat ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-12 1:55 ` David Gibson @ 2021-03-12 9:53 ` Cédric Le Goater -1 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-12 9:53 UTC (permalink / raw) To: David Gibson Cc: Daniel Henrique Barboza, Greg Kurz, QEMU Developers, list@suse.de:PowerPC, linuxppc-dev On 3/12/21 2:55 AM, David Gibson wrote: > On Tue, 9 Mar 2021 18:26:35 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>> >>> >>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>> >>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>> target for a source located on the same chip when possible. This field >>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>> >>>>> This sentence seems to have a syntax problem... like it is missing an >>>>> 'and' before 'on pSeries'. >>>> >>>> ah yes, or simply a comma. >>>> >>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>> >>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>> default node. It will also give us the opportunity to set the affinity >>>>>> of a source on pSeries when we can localize them. >>>>>> >>>>> >>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>> with this change. >>>> >>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>> in Cc:) >> [...] >>>> >>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>> the node id. This value is built from the chip id in OPAL, so the >>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>> property are unlikely to be different. >>>> >>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>> locally to the owning node. XIVE is not an exception (see below in the >>>> same patch), it is better to be consistent and get the same information >>>> (node id) using the same routine. >>>> >>>> >>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>> unifies the controllers of the system to only expose one the OS. This >>>> is problematic and should be changed but it's another topic. >>>> >>>> >>>>> On the other hand, you have the pSeries case under PowerVM that >>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>> >>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>> chip id. >>>> >>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>> always correct btw) >>> >>> >>> If you have a way to reliably reproduce this, let me know and I'll fix it >>> up in QEMU. >> >> with : >> >> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> # dmesg | grep numa >> [ 0.013106] numa: Node 0 CPUs: 0-1 >> [ 0.013136] numa: Node 1 CPUs: 2-3 >> >> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >> ibm,chip-id = <0x01>; >> ibm,chip-id = <0x02>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x03>; >> >> with : >> >> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> # dmesg | grep numa >> [ 0.013106] numa: Node 0 CPUs: 0-1 >> [ 0.013136] numa: Node 1 CPUs: 2-3 >> >> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> >> I think we should simply remove "ibm,chip-id" since it's not used and >> not in the PAPR spec. > > As I mentioned to Daniel on our call this morning, oddly it *does* > appear to be used in the RHEL kernel, even though that's 4.18 based. > This patch seems to have caused a minor regression; not in the > identification of NUMA nodes, but in the number of sockets shown be > lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 > for more information. Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : [root@localhost ~]# lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 .... "ibm,chip-id" is still being used on some occasion on pSeries machines. This is wrong :/ The problem is : #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) We should be using cpu_to_node(). C. > > Since the value was used by some PAPR kernels - even if they shouldn't > have - I think we should only remove this for newer machine types. We > also need to check what we're not supplying that the guest kernel is > showing a different number of sockets than specified on the qemu > command line. > >> >> Thanks, >> >> C. >> >> >> >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property @ 2021-03-12 9:53 ` Cédric Le Goater 0 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-12 9:53 UTC (permalink / raw) To: David Gibson Cc: Michael Ellerman, Daniel Henrique Barboza, Greg Kurz, QEMU Developers, list@suse.de:PowerPC, linuxppc-dev On 3/12/21 2:55 AM, David Gibson wrote: > On Tue, 9 Mar 2021 18:26:35 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>> >>> >>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>> >>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>> target for a source located on the same chip when possible. This field >>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>> >>>>> This sentence seems to have a syntax problem... like it is missing an >>>>> 'and' before 'on pSeries'. >>>> >>>> ah yes, or simply a comma. >>>> >>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>> >>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>> default node. It will also give us the opportunity to set the affinity >>>>>> of a source on pSeries when we can localize them. >>>>>> >>>>> >>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>> with this change. >>>> >>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>> in Cc:) >> [...] >>>> >>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>> the node id. This value is built from the chip id in OPAL, so the >>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>> property are unlikely to be different. >>>> >>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>> locally to the owning node. XIVE is not an exception (see below in the >>>> same patch), it is better to be consistent and get the same information >>>> (node id) using the same routine. >>>> >>>> >>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>> unifies the controllers of the system to only expose one the OS. This >>>> is problematic and should be changed but it's another topic. >>>> >>>> >>>>> On the other hand, you have the pSeries case under PowerVM that >>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>> >>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>> chip id. >>>> >>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>> always correct btw) >>> >>> >>> If you have a way to reliably reproduce this, let me know and I'll fix it >>> up in QEMU. >> >> with : >> >> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> # dmesg | grep numa >> [ 0.013106] numa: Node 0 CPUs: 0-1 >> [ 0.013136] numa: Node 1 CPUs: 2-3 >> >> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >> ibm,chip-id = <0x01>; >> ibm,chip-id = <0x02>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x03>; >> >> with : >> >> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> # dmesg | grep numa >> [ 0.013106] numa: Node 0 CPUs: 0-1 >> [ 0.013136] numa: Node 1 CPUs: 2-3 >> >> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x00>; >> >> I think we should simply remove "ibm,chip-id" since it's not used and >> not in the PAPR spec. > > As I mentioned to Daniel on our call this morning, oddly it *does* > appear to be used in the RHEL kernel, even though that's 4.18 based. > This patch seems to have caused a minor regression; not in the > identification of NUMA nodes, but in the number of sockets shown be > lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 > for more information. Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : [root@localhost ~]# lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Model: 2.2 (pvr 004e 1202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 .... "ibm,chip-id" is still being used on some occasion on pSeries machines. This is wrong :/ The problem is : #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) We should be using cpu_to_node(). C. > > Since the value was used by some PAPR kernels - even if they shouldn't > have - I think we should only remove this for newer machine types. We > also need to check what we're not supplying that the guest kernel is > showing a different number of sockets than specified on the qemu > command line. > >> >> Thanks, >> >> C. >> >> >> >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> [...] >> > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-12 9:53 ` Cédric Le Goater @ 2021-03-12 12:18 ` Daniel Henrique Barboza -1 siblings, 0 replies; 38+ messages in thread From: Daniel Henrique Barboza @ 2021-03-12 12:18 UTC (permalink / raw) To: Cédric Le Goater, David Gibson Cc: list@suse.de:PowerPC, linuxppc-dev, Greg Kurz, QEMU Developers On 3/12/21 6:53 AM, Cédric Le Goater wrote: > On 3/12/21 2:55 AM, David Gibson wrote: >> On Tue, 9 Mar 2021 18:26:35 +0100 >> Cédric Le Goater <clg@kaod.org> wrote: >> >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>>> >>>> >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>>> >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>>> target for a source located on the same chip when possible. This field >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>>> >>>>>> This sentence seems to have a syntax problem... like it is missing an >>>>>> 'and' before 'on pSeries'. >>>>> >>>>> ah yes, or simply a comma. >>>>> >>>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>>> >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>>> default node. It will also give us the opportunity to set the affinity >>>>>>> of a source on pSeries when we can localize them. >>>>>>> >>>>>> >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>>> with this change. >>>>> >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>>> in Cc:) >>> [...] >>>>> >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>>> the node id. This value is built from the chip id in OPAL, so the >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>>> property are unlikely to be different. >>>>> >>>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>>> locally to the owning node. XIVE is not an exception (see below in the >>>>> same patch), it is better to be consistent and get the same information >>>>> (node id) using the same routine. >>>>> >>>>> >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>>> unifies the controllers of the system to only expose one the OS. This >>>>> is problematic and should be changed but it's another topic. >>>>> >>>>> >>>>>> On the other hand, you have the pSeries case under PowerVM that >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>>> >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>>> chip id. >>>>> >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>>> always correct btw) >>>> >>>> >>>> If you have a way to reliably reproduce this, let me know and I'll fix it >>>> up in QEMU. >>> >>> with : >>> >>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>> >>> # dmesg | grep numa >>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>> ibm,chip-id = <0x01>; >>> ibm,chip-id = <0x02>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x03>; >>> >>> with : >>> >>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>> >>> # dmesg | grep numa >>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> >>> I think we should simply remove "ibm,chip-id" since it's not used and >>> not in the PAPR spec. >> >> As I mentioned to Daniel on our call this morning, oddly it *does* >> appear to be used in the RHEL kernel, even though that's 4.18 based. >> This patch seems to have caused a minor regression; not in the >> identification of NUMA nodes, but in the number of sockets shown be >> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 >> for more information. > > Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we > remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : > > [root@localhost ~]# lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 4 > Core(s) per socket: 16 > Socket(s): 2 > NUMA node(s): 2 > Model: 2.2 (pvr 004e 1202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: KVM > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > NUMA node0 CPU(s): 0-63 > NUMA node1 CPU(s): 64-127 > > [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id > /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 > .... > > "ibm,chip-id" is still being used on some occasion on pSeries machines. > This is wrong :/ The problem is : > > #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) > > We should be using cpu_to_node(). IIUC the "real fix" then is this change you mentioned above, together with this xive patch as well, to stop using ibm,chip-id for good in the pserie kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries machine without impact. Is this correct? If that's the case, then I believe it's ok to go forward with the QEMU side change (just for 6.0.0 and newer machines). Or should I wait for the kernel changes to be merged upstream first? Thanks, DHB > > C. > >> >> Since the value was used by some PAPR kernels - even if they shouldn't >> have - I think we should only remove this for newer machine types. We >> also need to check what we're not supplying that the guest kernel is >> showing a different number of sockets than specified on the qemu >> command line. >> >>> >>> Thanks, >>> >>> C. >>> >>> >>> >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> >> >> > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property @ 2021-03-12 12:18 ` Daniel Henrique Barboza 0 siblings, 0 replies; 38+ messages in thread From: Daniel Henrique Barboza @ 2021-03-12 12:18 UTC (permalink / raw) To: Cédric Le Goater, David Gibson Cc: list@suse.de:PowerPC, Michael Ellerman, linuxppc-dev, Greg Kurz, QEMU Developers On 3/12/21 6:53 AM, Cédric Le Goater wrote: > On 3/12/21 2:55 AM, David Gibson wrote: >> On Tue, 9 Mar 2021 18:26:35 +0100 >> Cédric Le Goater <clg@kaod.org> wrote: >> >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>>> >>>> >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>>> >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>>> target for a source located on the same chip when possible. This field >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>>> >>>>>> This sentence seems to have a syntax problem... like it is missing an >>>>>> 'and' before 'on pSeries'. >>>>> >>>>> ah yes, or simply a comma. >>>>> >>>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>>> >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>>> default node. It will also give us the opportunity to set the affinity >>>>>>> of a source on pSeries when we can localize them. >>>>>>> >>>>>> >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>>> with this change. >>>>> >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>>> in Cc:) >>> [...] >>>>> >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>>> the node id. This value is built from the chip id in OPAL, so the >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>>> property are unlikely to be different. >>>>> >>>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>>> locally to the owning node. XIVE is not an exception (see below in the >>>>> same patch), it is better to be consistent and get the same information >>>>> (node id) using the same routine. >>>>> >>>>> >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>>> unifies the controllers of the system to only expose one the OS. This >>>>> is problematic and should be changed but it's another topic. >>>>> >>>>> >>>>>> On the other hand, you have the pSeries case under PowerVM that >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>>> >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>>> chip id. >>>>> >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>>> always correct btw) >>>> >>>> >>>> If you have a way to reliably reproduce this, let me know and I'll fix it >>>> up in QEMU. >>> >>> with : >>> >>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>> >>> # dmesg | grep numa >>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>> ibm,chip-id = <0x01>; >>> ibm,chip-id = <0x02>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x03>; >>> >>> with : >>> >>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>> >>> # dmesg | grep numa >>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> ibm,chip-id = <0x00>; >>> >>> I think we should simply remove "ibm,chip-id" since it's not used and >>> not in the PAPR spec. >> >> As I mentioned to Daniel on our call this morning, oddly it *does* >> appear to be used in the RHEL kernel, even though that's 4.18 based. >> This patch seems to have caused a minor regression; not in the >> identification of NUMA nodes, but in the number of sockets shown be >> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 >> for more information. > > Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we > remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : > > [root@localhost ~]# lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 4 > Core(s) per socket: 16 > Socket(s): 2 > NUMA node(s): 2 > Model: 2.2 (pvr 004e 1202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: KVM > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > NUMA node0 CPU(s): 0-63 > NUMA node1 CPU(s): 64-127 > > [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id > /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 > /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 > .... > > "ibm,chip-id" is still being used on some occasion on pSeries machines. > This is wrong :/ The problem is : > > #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) > > We should be using cpu_to_node(). IIUC the "real fix" then is this change you mentioned above, together with this xive patch as well, to stop using ibm,chip-id for good in the pserie kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries machine without impact. Is this correct? If that's the case, then I believe it's ok to go forward with the QEMU side change (just for 6.0.0 and newer machines). Or should I wait for the kernel changes to be merged upstream first? Thanks, DHB > > C. > >> >> Since the value was used by some PAPR kernels - even if they shouldn't >> have - I think we should only remove this for newer machine types. We >> also need to check what we're not supplying that the guest kernel is >> showing a different number of sockets than specified on the qemu >> command line. >> >>> >>> Thanks, >>> >>> C. >>> >>> >>> >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> [...] >>> >> >> > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-12 12:18 ` Daniel Henrique Barboza @ 2021-03-12 13:03 ` Greg Kurz -1 siblings, 0 replies; 38+ messages in thread From: Greg Kurz @ 2021-03-12 13:03 UTC (permalink / raw) To: Daniel Henrique Barboza Cc: QEMU Developers, list@suse.de:PowerPC, Cédric Le Goater, David Gibson, linuxppc-dev On Fri, 12 Mar 2021 09:18:39 -0300 Daniel Henrique Barboza <danielhb@linux.ibm.com> wrote: > > > On 3/12/21 6:53 AM, Cédric Le Goater wrote: > > On 3/12/21 2:55 AM, David Gibson wrote: > >> On Tue, 9 Mar 2021 18:26:35 +0100 > >> Cédric Le Goater <clg@kaod.org> wrote: > >> > >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > >>>> > >>>> > >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: > >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: > >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 > >>>>>> Cédric Le Goater <clg@kaod.org> wrote: > >>>>>> > >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a > >>>>>>> target for a source located on the same chip when possible. This field > >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property > >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined > >>>>>> > >>>>>> This sentence seems to have a syntax problem... like it is missing an > >>>>>> 'and' before 'on pSeries'. > >>>>> > >>>>> ah yes, or simply a comma. > >>>>> > >>>>>>> under PowerVM. The XIVE source structure has a similar field > >>>>>>> 'src_chip' which is only assigned on the PowerNV platform. > >>>>>>> > >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the > >>>>>>> default node. It will also give us the opportunity to set the affinity > >>>>>>> of a source on pSeries when we can localize them. > >>>>>>> > >>>>>> > >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id > >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable > >>>>>> with this change. > >>>>> > >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall > >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel > >>>>> in Cc:) > >>> [...] > >>>>> > >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find > >>>>> the node id. This value is built from the chip id in OPAL, so the > >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" > >>>>> property are unlikely to be different. > >>>>> > >>>>> cpu_to_node(cpu) is used in many places to allocate the structures > >>>>> locally to the owning node. XIVE is not an exception (see below in the > >>>>> same patch), it is better to be consistent and get the same information > >>>>> (node id) using the same routine. > >>>>> > >>>>> > >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : > >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot > >>>>> unifies the controllers of the system to only expose one the OS. This > >>>>> is problematic and should be changed but it's another topic. > >>>>> > >>>>> > >>>>>> On the other hand, you have the pSeries case under PowerVM that > >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. > >>>>> > >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning > >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid > >>>>> chip id. > >>>>> > >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not > >>>>> always correct btw) > >>>> > >>>> > >>>> If you have a way to reliably reproduce this, let me know and I'll fix it > >>>> up in QEMU. > >>> > >>> with : > >>> > >>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > >>> > >>> # dmesg | grep numa > >>> [ 0.013106] numa: Node 0 CPUs: 0-1 > >>> [ 0.013136] numa: Node 1 CPUs: 2-3 > >>> > >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > >>> ibm,chip-id = <0x01>; > >>> ibm,chip-id = <0x02>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x03>; > >>> > >>> with : > >>> > >>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > >>> > >>> # dmesg | grep numa > >>> [ 0.013106] numa: Node 0 CPUs: 0-1 > >>> [ 0.013136] numa: Node 1 CPUs: 2-3 > >>> > >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> > >>> I think we should simply remove "ibm,chip-id" since it's not used and > >>> not in the PAPR spec. > >> > >> As I mentioned to Daniel on our call this morning, oddly it *does* > >> appear to be used in the RHEL kernel, even though that's 4.18 based. > >> This patch seems to have caused a minor regression; not in the > >> identification of NUMA nodes, but in the number of sockets shown be > >> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 > >> for more information. > > > > Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we > > remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : > > > > [root@localhost ~]# lscpu > > Architecture: ppc64le > > Byte Order: Little Endian > > CPU(s): 128 > > On-line CPU(s) list: 0-127 > > Thread(s) per core: 4 > > Core(s) per socket: 16 > > Socket(s): 2 > > NUMA node(s): 2 > > Model: 2.2 (pvr 004e 1202) > > Model name: POWER9 (architected), altivec supported > > Hypervisor vendor: KVM > > Virtualization type: para > > L1d cache: 32K > > L1i cache: 32K > > NUMA node0 CPU(s): 0-63 > > NUMA node1 CPU(s): 64-127 > > > > [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id > > /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 > > .... > > > > "ibm,chip-id" is still being used on some occasion on pSeries machines. > > This is wrong :/ The problem is : > > > > #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) > > > > We should be using cpu_to_node(). > > > IIUC the "real fix" then is this change you mentioned above, together with > this xive patch as well, to stop using ibm,chip-id for good in the pserie > kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries > machine without impact. Is this correct? > > If that's the case, then I believe it's ok to go forward with the QEMU side > change (just for 6.0.0 and newer machines). Or should I wait for the kernel > changes to be merged upstream first? > I'd say the latter since this is a breaking change and people will want to identify the upstream commits they have to backport to their kernel in order to support the disappearance of "ibm,chip-id". Cheers, -- Greg > > Thanks, > > > DHB > > > > > > C. > > > >> > >> Since the value was used by some PAPR kernels - even if they shouldn't > >> have - I think we should only remove this for newer machine types. We > >> also need to check what we're not supplying that the guest kernel is > >> showing a different number of sockets than specified on the qemu > >> command line. > >> > >>> > >>> Thanks, > >>> > >>> C. > >>> > >>> > >>> > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> > >> > >> > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property @ 2021-03-12 13:03 ` Greg Kurz 0 siblings, 0 replies; 38+ messages in thread From: Greg Kurz @ 2021-03-12 13:03 UTC (permalink / raw) To: Daniel Henrique Barboza Cc: Michael Ellerman, QEMU Developers, list@suse.de:PowerPC, Cédric Le Goater, David Gibson, linuxppc-dev On Fri, 12 Mar 2021 09:18:39 -0300 Daniel Henrique Barboza <danielhb@linux.ibm.com> wrote: > > > On 3/12/21 6:53 AM, Cédric Le Goater wrote: > > On 3/12/21 2:55 AM, David Gibson wrote: > >> On Tue, 9 Mar 2021 18:26:35 +0100 > >> Cédric Le Goater <clg@kaod.org> wrote: > >> > >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: > >>>> > >>>> > >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: > >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: > >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 > >>>>>> Cédric Le Goater <clg@kaod.org> wrote: > >>>>>> > >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a > >>>>>>> target for a source located on the same chip when possible. This field > >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property > >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined > >>>>>> > >>>>>> This sentence seems to have a syntax problem... like it is missing an > >>>>>> 'and' before 'on pSeries'. > >>>>> > >>>>> ah yes, or simply a comma. > >>>>> > >>>>>>> under PowerVM. The XIVE source structure has a similar field > >>>>>>> 'src_chip' which is only assigned on the PowerNV platform. > >>>>>>> > >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the > >>>>>>> default node. It will also give us the opportunity to set the affinity > >>>>>>> of a source on pSeries when we can localize them. > >>>>>>> > >>>>>> > >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id > >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable > >>>>>> with this change. > >>>>> > >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall > >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel > >>>>> in Cc:) > >>> [...] > >>>>> > >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find > >>>>> the node id. This value is built from the chip id in OPAL, so the > >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" > >>>>> property are unlikely to be different. > >>>>> > >>>>> cpu_to_node(cpu) is used in many places to allocate the structures > >>>>> locally to the owning node. XIVE is not an exception (see below in the > >>>>> same patch), it is better to be consistent and get the same information > >>>>> (node id) using the same routine. > >>>>> > >>>>> > >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : > >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot > >>>>> unifies the controllers of the system to only expose one the OS. This > >>>>> is problematic and should be changed but it's another topic. > >>>>> > >>>>> > >>>>>> On the other hand, you have the pSeries case under PowerVM that > >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. > >>>>> > >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning > >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid > >>>>> chip id. > >>>>> > >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not > >>>>> always correct btw) > >>>> > >>>> > >>>> If you have a way to reliably reproduce this, let me know and I'll fix it > >>>> up in QEMU. > >>> > >>> with : > >>> > >>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > >>> > >>> # dmesg | grep numa > >>> [ 0.013106] numa: Node 0 CPUs: 0-1 > >>> [ 0.013136] numa: Node 1 CPUs: 2-3 > >>> > >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > >>> ibm,chip-id = <0x01>; > >>> ibm,chip-id = <0x02>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x03>; > >>> > >>> with : > >>> > >>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > >>> > >>> # dmesg | grep numa > >>> [ 0.013106] numa: Node 0 CPUs: 0-1 > >>> [ 0.013136] numa: Node 1 CPUs: 2-3 > >>> > >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> ibm,chip-id = <0x00>; > >>> > >>> I think we should simply remove "ibm,chip-id" since it's not used and > >>> not in the PAPR spec. > >> > >> As I mentioned to Daniel on our call this morning, oddly it *does* > >> appear to be used in the RHEL kernel, even though that's 4.18 based. > >> This patch seems to have caused a minor regression; not in the > >> identification of NUMA nodes, but in the number of sockets shown be > >> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 > >> for more information. > > > > Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we > > remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : > > > > [root@localhost ~]# lscpu > > Architecture: ppc64le > > Byte Order: Little Endian > > CPU(s): 128 > > On-line CPU(s) list: 0-127 > > Thread(s) per core: 4 > > Core(s) per socket: 16 > > Socket(s): 2 > > NUMA node(s): 2 > > Model: 2.2 (pvr 004e 1202) > > Model name: POWER9 (architected), altivec supported > > Hypervisor vendor: KVM > > Virtualization type: para > > L1d cache: 32K > > L1i cache: 32K > > NUMA node0 CPU(s): 0-63 > > NUMA node1 CPU(s): 64-127 > > > > [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id > > /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 > > /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 > > .... > > > > "ibm,chip-id" is still being used on some occasion on pSeries machines. > > This is wrong :/ The problem is : > > > > #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) > > > > We should be using cpu_to_node(). > > > IIUC the "real fix" then is this change you mentioned above, together with > this xive patch as well, to stop using ibm,chip-id for good in the pserie > kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries > machine without impact. Is this correct? > > If that's the case, then I believe it's ok to go forward with the QEMU side > change (just for 6.0.0 and newer machines). Or should I wait for the kernel > changes to be merged upstream first? > I'd say the latter since this is a breaking change and people will want to identify the upstream commits they have to backport to their kernel in order to support the disappearance of "ibm,chip-id". Cheers, -- Greg > > Thanks, > > > DHB > > > > > > C. > > > >> > >> Since the value was used by some PAPR kernels - even if they shouldn't > >> have - I think we should only remove this for newer machine types. We > >> also need to check what we're not supplying that the guest kernel is > >> showing a different number of sockets than specified on the qemu > >> command line. > >> > >>> > >>> Thanks, > >>> > >>> C. > >>> > >>> > >>> > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> [...] > >>> > >> > >> > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property 2021-03-12 12:18 ` Daniel Henrique Barboza @ 2021-03-12 13:28 ` Cédric Le Goater -1 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-12 13:28 UTC (permalink / raw) To: Daniel Henrique Barboza, David Gibson Cc: list@suse.de:PowerPC, linuxppc-dev, Greg Kurz, QEMU Developers On 3/12/21 1:18 PM, Daniel Henrique Barboza wrote: > > > On 3/12/21 6:53 AM, Cédric Le Goater wrote: >> On 3/12/21 2:55 AM, David Gibson wrote: >>> On Tue, 9 Mar 2021 18:26:35 +0100 >>> Cédric Le Goater <clg@kaod.org> wrote: >>> >>>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>>>> >>>>> >>>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>>>> >>>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>>>> target for a source located on the same chip when possible. This field >>>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>>>> >>>>>>> This sentence seems to have a syntax problem... like it is missing an >>>>>>> 'and' before 'on pSeries'. >>>>>> >>>>>> ah yes, or simply a comma. >>>>>> >>>>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>>>> >>>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>>>> default node. It will also give us the opportunity to set the affinity >>>>>>>> of a source on pSeries when we can localize them. >>>>>>>> >>>>>>> >>>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>>>> with this change. >>>>>> >>>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>>>> in Cc:) >>>> [...] >>>>>> >>>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>>>> the node id. This value is built from the chip id in OPAL, so the >>>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>>>> property are unlikely to be different. >>>>>> >>>>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>>>> locally to the owning node. XIVE is not an exception (see below in the >>>>>> same patch), it is better to be consistent and get the same information >>>>>> (node id) using the same routine. >>>>>> >>>>>> >>>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>>>> unifies the controllers of the system to only expose one the OS. This >>>>>> is problematic and should be changed but it's another topic. >>>>>> >>>>>> >>>>>>> On the other hand, you have the pSeries case under PowerVM that >>>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>>>> >>>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>>>> chip id. >>>>>> >>>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>>>> always correct btw) >>>>> >>>>> >>>>> If you have a way to reliably reproduce this, let me know and I'll fix it >>>>> up in QEMU. >>>> >>>> with : >>>> >>>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>>> >>>> # dmesg | grep numa >>>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>>> >>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>>> ibm,chip-id = <0x01>; >>>> ibm,chip-id = <0x02>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x03>; >>>> >>>> with : >>>> >>>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>>> >>>> # dmesg | grep numa >>>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>>> >>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> >>>> I think we should simply remove "ibm,chip-id" since it's not used and >>>> not in the PAPR spec. >>> >>> As I mentioned to Daniel on our call this morning, oddly it *does* >>> appear to be used in the RHEL kernel, even though that's 4.18 based. >>> This patch seems to have caused a minor regression; not in the >>> identification of NUMA nodes, but in the number of sockets shown be >>> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 >>> for more information. >> >> Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we >> remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : >> >> [root@localhost ~]# lscpu >> Architecture: ppc64le >> Byte Order: Little Endian >> CPU(s): 128 >> On-line CPU(s) list: 0-127 >> Thread(s) per core: 4 >> Core(s) per socket: 16 >> Socket(s): 2 >> NUMA node(s): 2 >> Model: 2.2 (pvr 004e 1202) >> Model name: POWER9 (architected), altivec supported >> Hypervisor vendor: KVM >> Virtualization type: para >> L1d cache: 32K >> L1i cache: 32K >> NUMA node0 CPU(s): 0-63 >> NUMA node1 CPU(s): 64-127 >> >> [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id >> /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 >> .... >> >> "ibm,chip-id" is still being used on some occasion on pSeries machines. >> This is wrong :/ The problem is : >> >> #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) >> >> We should be using cpu_to_node(). > > > IIUC the "real fix" then is this change you mentioned above, together with > this xive patch as well, These are independent. The XIVE patch just raised the issue because it's another usage example of cpu_to_chip_id() or directly "ibm,chip-id" in the XIVE case, on a pseries machine. The use of cpu_to_node(cpu) for topology_physical_package_id(cpu) is a fix for the sysfs issue reported in the redhat BZ. > to stop using ibm,chip-id for good in the pserie > kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries > machine without impact. Is this correct? Linux is already "broken" on PowerVM today since we don't have the "ibm,chip-id" property. QEMU is just hiding the problem on KVM. But we have to be bug compatible :) if the QEMU fix is under the pseries-6.x machine we should be fine. > If that's the case, then I believe it's ok to go forward with the QEMU side > change (just for 6.0.0 and newer machines). Or should I wait for the kernel > changes to be merged upstream first? Once Linux is fixed, we shouldn't care if QEMU exports 'ibm,chip-id' or not. I don't think the order is very important. These are independent. C. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property @ 2021-03-12 13:28 ` Cédric Le Goater 0 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-12 13:28 UTC (permalink / raw) To: Daniel Henrique Barboza, David Gibson Cc: list@suse.de:PowerPC, Michael Ellerman, linuxppc-dev, Greg Kurz, QEMU Developers On 3/12/21 1:18 PM, Daniel Henrique Barboza wrote: > > > On 3/12/21 6:53 AM, Cédric Le Goater wrote: >> On 3/12/21 2:55 AM, David Gibson wrote: >>> On Tue, 9 Mar 2021 18:26:35 +0100 >>> Cédric Le Goater <clg@kaod.org> wrote: >>> >>>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote: >>>>> >>>>> >>>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote: >>>>>> On 3/8/21 6:13 PM, Greg Kurz wrote: >>>>>>> On Wed, 3 Mar 2021 18:48:50 +0100 >>>>>>> Cédric Le Goater <clg@kaod.org> wrote: >>>>>>> >>>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a >>>>>>>> target for a source located on the same chip when possible. This field >>>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property >>>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined >>>>>>> >>>>>>> This sentence seems to have a syntax problem... like it is missing an >>>>>>> 'and' before 'on pSeries'. >>>>>> >>>>>> ah yes, or simply a comma. >>>>>> >>>>>>>> under PowerVM. The XIVE source structure has a similar field >>>>>>>> 'src_chip' which is only assigned on the PowerNV platform. >>>>>>>> >>>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the >>>>>>>> default node. It will also give us the opportunity to set the affinity >>>>>>>> of a source on pSeries when we can localize them. >>>>>>>> >>>>>>> >>>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id >>>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable >>>>>>> with this change. >>>>>> >>>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall >>>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel >>>>>> in Cc:) >>>> [...] >>>>>> >>>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find >>>>>> the node id. This value is built from the chip id in OPAL, so the >>>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id" >>>>>> property are unlikely to be different. >>>>>> >>>>>> cpu_to_node(cpu) is used in many places to allocate the structures >>>>>> locally to the owning node. XIVE is not an exception (see below in the >>>>>> same patch), it is better to be consistent and get the same information >>>>>> (node id) using the same routine. >>>>>> >>>>>> >>>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : >>>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot >>>>>> unifies the controllers of the system to only expose one the OS. This >>>>>> is problematic and should be changed but it's another topic. >>>>>> >>>>>> >>>>>>> On the other hand, you have the pSeries case under PowerVM that >>>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT. >>>>>> >>>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning >>>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid >>>>>> chip id. >>>>>> >>>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not >>>>>> always correct btw) >>>>> >>>>> >>>>> If you have a way to reliably reproduce this, let me know and I'll fix it >>>>> up in QEMU. >>>> >>>> with : >>>> >>>> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>>> >>>> # dmesg | grep numa >>>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>>> >>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>>> ibm,chip-id = <0x01>; >>>> ibm,chip-id = <0x02>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x03>; >>>> >>>> with : >>>> >>>> -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>>> >>>> # dmesg | grep numa >>>> [ 0.013106] numa: Node 0 CPUs: 0-1 >>>> [ 0.013136] numa: Node 1 CPUs: 2-3 >>>> >>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> ibm,chip-id = <0x00>; >>>> >>>> I think we should simply remove "ibm,chip-id" since it's not used and >>>> not in the PAPR spec. >>> >>> As I mentioned to Daniel on our call this morning, oddly it *does* >>> appear to be used in the RHEL kernel, even though that's 4.18 based. >>> This patch seems to have caused a minor regression; not in the >>> identification of NUMA nodes, but in the number of sockets shown be >>> lscpu, etc. See https://bugzilla.redhat.com/show_bug.cgi?id=1934421 >>> for more information. >> >> Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we >> remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 : >> >> [root@localhost ~]# lscpu >> Architecture: ppc64le >> Byte Order: Little Endian >> CPU(s): 128 >> On-line CPU(s) list: 0-127 >> Thread(s) per core: 4 >> Core(s) per socket: 16 >> Socket(s): 2 >> NUMA node(s): 2 >> Model: 2.2 (pvr 004e 1202) >> Model name: POWER9 (architected), altivec supported >> Hypervisor vendor: KVM >> Virtualization type: para >> L1d cache: 32K >> L1i cache: 32K >> NUMA node0 CPU(s): 0-63 >> NUMA node1 CPU(s): 64-127 >> >> [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id >> /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1 >> /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1 >> .... >> >> "ibm,chip-id" is still being used on some occasion on pSeries machines. >> This is wrong :/ The problem is : >> >> #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) >> >> We should be using cpu_to_node(). > > > IIUC the "real fix" then is this change you mentioned above, together with > this xive patch as well, These are independent. The XIVE patch just raised the issue because it's another usage example of cpu_to_chip_id() or directly "ibm,chip-id" in the XIVE case, on a pseries machine. The use of cpu_to_node(cpu) for topology_physical_package_id(cpu) is a fix for the sysfs issue reported in the redhat BZ. > to stop using ibm,chip-id for good in the pserie > kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries > machine without impact. Is this correct? Linux is already "broken" on PowerVM today since we don't have the "ibm,chip-id" property. QEMU is just hiding the problem on KVM. But we have to be bug compatible :) if the QEMU fix is under the pseries-6.x machine we should be fine. > If that's the case, then I believe it's ok to go forward with the QEMU side > change (just for 6.0.0 and newer machines). Or should I wait for the kernel > changes to be merged upstream first? Once Linux is fixed, we shouldn't care if QEMU exports 'ibm,chip-id' or not. I don't think the order is very important. These are independent. C. ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater @ 2021-03-03 17:48 ` Cédric Le Goater 2021-03-08 17:55 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater ` (5 subsequent siblings) 7 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater The IPI interrupt is a special case of the XIVE IRQ domain. When mapping and unmapping the interrupts in the Linux interrupt number space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to distinguish the IPI interrupt from other interrupts of the system. Simplify the XIVE interrupt domain by introducing a specific domain for the IPI. Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/sysdev/xive/common.c | 51 +++++++++++++------------------ 1 file changed, 22 insertions(+), 29 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index b8e456da28aa..e7783760d278 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -63,6 +63,8 @@ static const struct xive_ops *xive_ops; static struct irq_domain *xive_irq_domain; #ifdef CONFIG_SMP +static struct irq_domain *xive_ipi_irq_domain; + /* The IPIs all use the same logical irq number */ static u32 xive_ipi_irq; #endif @@ -1067,20 +1069,32 @@ static struct irq_chip xive_ipi_chip = { .irq_unmask = xive_ipi_do_nothing, }; +/* + * IPIs are marked per-cpu. We use separate HW interrupts under the + * hood but associated with the same "linux" interrupt + */ +static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq, + irq_hw_number_t hw) +{ + irq_set_chip_and_handler(virq, &xive_ipi_chip, handle_percpu_irq); + return 0; +} + +static const struct irq_domain_ops xive_ipi_irq_domain_ops = { + .map = xive_ipi_irq_domain_map, +}; + static void __init xive_request_ipi(void) { unsigned int virq; - /* - * Initialization failed, move on, we might manage to - * reach the point where we display our errors before - * the system falls appart - */ - if (!xive_irq_domain) + xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1, + &xive_ipi_irq_domain_ops, NULL); + if (WARN_ON(xive_ipi_irq_domain == NULL)) return; /* Initialize it */ - virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ); + virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ); xive_ipi_irq = virq; WARN_ON(request_irq(virq, xive_muxed_ipi_action, @@ -1178,19 +1192,6 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq, */ irq_clear_status_flags(virq, IRQ_LEVEL); -#ifdef CONFIG_SMP - /* IPIs are special and come up with HW number 0 */ - if (hw == XIVE_IPI_HW_IRQ) { - /* - * IPIs are marked per-cpu. We use separate HW interrupts under - * the hood but associated with the same "linux" interrupt - */ - irq_set_chip_and_handler(virq, &xive_ipi_chip, - handle_percpu_irq); - return 0; - } -#endif - rc = xive_irq_alloc_data(virq, hw); if (rc) return rc; @@ -1202,15 +1203,7 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq, static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq) { - struct irq_data *data = irq_get_irq_data(virq); - unsigned int hw_irq; - - /* XXX Assign BAD number */ - if (!data) - return; - hw_irq = (unsigned int)irqd_to_hwirq(data); - if (hw_irq != XIVE_IPI_HW_IRQ) - xive_irq_free_data(virq); + xive_irq_free_data(virq); } static int xive_irq_domain_xlate(struct irq_domain *h, struct device_node *ct, -- 2.26.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain 2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater @ 2021-03-08 17:55 ` Greg Kurz 0 siblings, 0 replies; 38+ messages in thread From: Greg Kurz @ 2021-03-08 17:55 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev On Wed, 3 Mar 2021 18:48:51 +0100 Cédric Le Goater <clg@kaod.org> wrote: > The IPI interrupt is a special case of the XIVE IRQ domain. When > mapping and unmapping the interrupts in the Linux interrupt number > space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to > distinguish the IPI interrupt from other interrupts of the system. > > Simplify the XIVE interrupt domain by introducing a specific domain > for the IPI. > > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- Nice ! Reviewed-by: Greg Kurz <groug@kaod.org> > arch/powerpc/sysdev/xive/common.c | 51 +++++++++++++------------------ > 1 file changed, 22 insertions(+), 29 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index b8e456da28aa..e7783760d278 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -63,6 +63,8 @@ static const struct xive_ops *xive_ops; > static struct irq_domain *xive_irq_domain; > > #ifdef CONFIG_SMP > +static struct irq_domain *xive_ipi_irq_domain; > + > /* The IPIs all use the same logical irq number */ > static u32 xive_ipi_irq; > #endif > @@ -1067,20 +1069,32 @@ static struct irq_chip xive_ipi_chip = { > .irq_unmask = xive_ipi_do_nothing, > }; > > +/* > + * IPIs are marked per-cpu. We use separate HW interrupts under the > + * hood but associated with the same "linux" interrupt > + */ > +static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq, > + irq_hw_number_t hw) > +{ > + irq_set_chip_and_handler(virq, &xive_ipi_chip, handle_percpu_irq); > + return 0; > +} > + > +static const struct irq_domain_ops xive_ipi_irq_domain_ops = { > + .map = xive_ipi_irq_domain_map, > +}; > + > static void __init xive_request_ipi(void) > { > unsigned int virq; > > - /* > - * Initialization failed, move on, we might manage to > - * reach the point where we display our errors before > - * the system falls appart > - */ > - if (!xive_irq_domain) > + xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1, > + &xive_ipi_irq_domain_ops, NULL); > + if (WARN_ON(xive_ipi_irq_domain == NULL)) > return; > > /* Initialize it */ > - virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ); > + virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ); > xive_ipi_irq = virq; > > WARN_ON(request_irq(virq, xive_muxed_ipi_action, > @@ -1178,19 +1192,6 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq, > */ > irq_clear_status_flags(virq, IRQ_LEVEL); > > -#ifdef CONFIG_SMP > - /* IPIs are special and come up with HW number 0 */ > - if (hw == XIVE_IPI_HW_IRQ) { > - /* > - * IPIs are marked per-cpu. We use separate HW interrupts under > - * the hood but associated with the same "linux" interrupt > - */ > - irq_set_chip_and_handler(virq, &xive_ipi_chip, > - handle_percpu_irq); > - return 0; > - } > -#endif > - > rc = xive_irq_alloc_data(virq, hw); > if (rc) > return rc; > @@ -1202,15 +1203,7 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq, > > static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq) > { > - struct irq_data *data = irq_get_irq_data(virq); > - unsigned int hw_irq; > - > - /* XXX Assign BAD number */ > - if (!data) > - return; > - hw_irq = (unsigned int)irqd_to_hwirq(data); > - if (hw_irq != XIVE_IPI_HW_IRQ) > - xive_irq_free_data(virq); > + xive_irq_free_data(virq); > } > > static int xive_irq_domain_xlate(struct irq_domain *h, struct device_node *ct, ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater @ 2021-03-03 17:48 ` Cédric Le Goater 2021-03-08 17:56 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater ` (4 subsequent siblings) 7 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater The IPI interrupt has its own domain now. Testing the HW interrupt number is not needed anymore. Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/sysdev/xive/common.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index e7783760d278..678680531d26 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1396,13 +1396,12 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc) struct irq_desc *desc = irq_to_desc(irq); struct irq_data *d = irq_desc_get_irq_data(desc); struct xive_irq_data *xd; - unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); /* * Ignore anything that isn't a XIVE irq and ignore * IPIs, so can just be dropped. */ - if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ) + if (d->domain != xive_irq_domain) continue; /* -- 2.26.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ 2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater @ 2021-03-08 17:56 ` Greg Kurz 0 siblings, 0 replies; 38+ messages in thread From: Greg Kurz @ 2021-03-08 17:56 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev On Wed, 3 Mar 2021 18:48:52 +0100 Cédric Le Goater <clg@kaod.org> wrote: > The IPI interrupt has its own domain now. Testing the HW interrupt > number is not needed anymore. > > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- Reviewed-by: Greg Kurz <groug@kaod.org> > arch/powerpc/sysdev/xive/common.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index e7783760d278..678680531d26 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -1396,13 +1396,12 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc) > struct irq_desc *desc = irq_to_desc(irq); > struct irq_data *d = irq_desc_get_irq_data(desc); > struct xive_irq_data *xd; > - unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); > > /* > * Ignore anything that isn't a XIVE irq and ignore > * IPIs, so can just be dropped. > */ > - if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ) > + if (d->domain != xive_irq_domain) > continue; > > /* ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater ` (2 preceding siblings ...) 2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater @ 2021-03-03 17:48 ` Cédric Le Goater 2021-03-08 18:07 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater ` (3 subsequent siblings) 7 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater Now that the IPI interrupt has its own domain, the checks on the HW interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a check on the domain. Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/sysdev/xive/common.c | 18 ++++-------------- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 678680531d26..7581cb12bb53 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu) seq_puts(m, "\n"); } -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d) +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) { - struct irq_chip *chip = irq_data_get_irq_chip(d); + unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); int rc; u32 target; u8 prio; u32 lirq; - if (!is_xive_irq(chip)) - return; - rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); if (rc) { seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private) for_each_irq_desc(i, desc) { struct irq_data *d = irq_desc_get_irq_data(desc); - unsigned int hw_irq; - - if (!d) - continue; - - hw_irq = (unsigned int)irqd_to_hwirq(d); - /* IPIs are special (HW number 0) */ - if (hw_irq != XIVE_IPI_HW_IRQ) - xive_debug_show_irq(m, hw_irq, d); + if (d->domain == xive_irq_domain) + xive_debug_show_irq(m, d); } return 0; } -- 2.26.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() 2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater @ 2021-03-08 18:07 ` Greg Kurz 2021-03-08 18:11 ` Cédric Le Goater 0 siblings, 1 reply; 38+ messages in thread From: Greg Kurz @ 2021-03-08 18:07 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev On Wed, 3 Mar 2021 18:48:53 +0100 Cédric Le Goater <clg@kaod.org> wrote: > Now that the IPI interrupt has its own domain, the checks on the HW > interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a > check on the domain. > > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- Shouldn't this have the following tags ? Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") Anyway, Reviewed-by: Greg Kurz <groug@kaod.org> > arch/powerpc/sysdev/xive/common.c | 18 ++++-------------- > 1 file changed, 4 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index 678680531d26..7581cb12bb53 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu) > seq_puts(m, "\n"); > } > > -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d) > +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) > { > - struct irq_chip *chip = irq_data_get_irq_chip(d); > + unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); > int rc; > u32 target; > u8 prio; > u32 lirq; > > - if (!is_xive_irq(chip)) > - return; > - > rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); > if (rc) { > seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); > @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private) > > for_each_irq_desc(i, desc) { > struct irq_data *d = irq_desc_get_irq_data(desc); > - unsigned int hw_irq; > - > - if (!d) > - continue; > - > - hw_irq = (unsigned int)irqd_to_hwirq(d); > > - /* IPIs are special (HW number 0) */ > - if (hw_irq != XIVE_IPI_HW_IRQ) > - xive_debug_show_irq(m, hw_irq, d); > + if (d->domain == xive_irq_domain) > + xive_debug_show_irq(m, d); > } > return 0; > } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() 2021-03-08 18:07 ` Greg Kurz @ 2021-03-08 18:11 ` Cédric Le Goater 2021-03-09 9:13 ` Greg Kurz 0 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-08 18:11 UTC (permalink / raw) To: Greg Kurz; +Cc: linuxppc-dev On 3/8/21 7:07 PM, Greg Kurz wrote: > On Wed, 3 Mar 2021 18:48:53 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> Now that the IPI interrupt has its own domain, the checks on the HW >> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a >> check on the domain. >> >> Signed-off-by: Cédric Le Goater <clg@kaod.org> >> --- > > Shouldn't this have the following tags ? > > Reported-by: kernel test robot <lkp@intel.com> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") The next patch has because it removes the useless check on irq_data. C. > > Anyway, > > Reviewed-by: Greg Kurz <groug@kaod.org> > >> arch/powerpc/sysdev/xive/common.c | 18 ++++-------------- >> 1 file changed, 4 insertions(+), 14 deletions(-) >> >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >> index 678680531d26..7581cb12bb53 100644 >> --- a/arch/powerpc/sysdev/xive/common.c >> +++ b/arch/powerpc/sysdev/xive/common.c >> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu) >> seq_puts(m, "\n"); >> } >> >> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d) >> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) >> { >> - struct irq_chip *chip = irq_data_get_irq_chip(d); >> + unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); >> int rc; >> u32 target; >> u8 prio; >> u32 lirq; >> >> - if (!is_xive_irq(chip)) >> - return; >> - >> rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); >> if (rc) { >> seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); >> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private) >> >> for_each_irq_desc(i, desc) { >> struct irq_data *d = irq_desc_get_irq_data(desc); >> - unsigned int hw_irq; >> - >> - if (!d) >> - continue; >> - >> - hw_irq = (unsigned int)irqd_to_hwirq(d); >> >> - /* IPIs are special (HW number 0) */ >> - if (hw_irq != XIVE_IPI_HW_IRQ) >> - xive_debug_show_irq(m, hw_irq, d); >> + if (d->domain == xive_irq_domain) >> + xive_debug_show_irq(m, d); >> } >> return 0; >> } > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() 2021-03-08 18:11 ` Cédric Le Goater @ 2021-03-09 9:13 ` Greg Kurz 2021-03-09 9:42 ` Greg Kurz 0 siblings, 1 reply; 38+ messages in thread From: Greg Kurz @ 2021-03-09 9:13 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev On Mon, 8 Mar 2021 19:11:11 +0100 Cédric Le Goater <clg@kaod.org> wrote: > On 3/8/21 7:07 PM, Greg Kurz wrote: > > On Wed, 3 Mar 2021 18:48:53 +0100 > > Cédric Le Goater <clg@kaod.org> wrote: > > > >> Now that the IPI interrupt has its own domain, the checks on the HW > >> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a > >> check on the domain. > >> > >> Signed-off-by: Cédric Le Goater <clg@kaod.org> > >> --- > > > > Shouldn't this have the following tags ? > > > > Reported-by: kernel test robot <lkp@intel.com> > > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > > Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") > > The next patch has because it removes the useless check on irq_data. > Ok I get it. This report isn't about an actual crash. Just a false positive because of the not needed check in the caller. > C. > > > > > Anyway, > > > > Reviewed-by: Greg Kurz <groug@kaod.org> > > > >> arch/powerpc/sysdev/xive/common.c | 18 ++++-------------- > >> 1 file changed, 4 insertions(+), 14 deletions(-) > >> > >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > >> index 678680531d26..7581cb12bb53 100644 > >> --- a/arch/powerpc/sysdev/xive/common.c > >> +++ b/arch/powerpc/sysdev/xive/common.c > >> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu) > >> seq_puts(m, "\n"); > >> } > >> > >> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d) > >> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) > >> { > >> - struct irq_chip *chip = irq_data_get_irq_chip(d); > >> + unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); > >> int rc; > >> u32 target; > >> u8 prio; > >> u32 lirq; > >> > >> - if (!is_xive_irq(chip)) > >> - return; > >> - > >> rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); > >> if (rc) { > >> seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); > >> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private) > >> > >> for_each_irq_desc(i, desc) { > >> struct irq_data *d = irq_desc_get_irq_data(desc); > >> - unsigned int hw_irq; > >> - > >> - if (!d) > >> - continue; > >> - > >> - hw_irq = (unsigned int)irqd_to_hwirq(d); > >> > >> - /* IPIs are special (HW number 0) */ > >> - if (hw_irq != XIVE_IPI_HW_IRQ) > >> - xive_debug_show_irq(m, hw_irq, d); > >> + if (d->domain == xive_irq_domain) > >> + xive_debug_show_irq(m, d); > >> } > >> return 0; > >> } > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() 2021-03-09 9:13 ` Greg Kurz @ 2021-03-09 9:42 ` Greg Kurz 2021-03-09 15:39 ` Cédric Le Goater 0 siblings, 1 reply; 38+ messages in thread From: Greg Kurz @ 2021-03-09 9:42 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev On Tue, 9 Mar 2021 10:13:39 +0100 Greg Kurz <groug@kaod.org> wrote: > On Mon, 8 Mar 2021 19:11:11 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > > > On 3/8/21 7:07 PM, Greg Kurz wrote: > > > On Wed, 3 Mar 2021 18:48:53 +0100 > > > Cédric Le Goater <clg@kaod.org> wrote: > > > > > >> Now that the IPI interrupt has its own domain, the checks on the HW > > >> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a > > >> check on the domain. > > >> > > >> Signed-off-by: Cédric Le Goater <clg@kaod.org> > > >> --- > > > > > > Shouldn't this have the following tags ? > > > > > > Reported-by: kernel test robot <lkp@intel.com> > > > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > > > Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") > > > > The next patch has because it removes the useless check on irq_data. > > > > Ok I get it. This report isn't about an actual crash. Just a false > positive because of the not needed check in the caller. > Hrm... I meant because of the check in xive_debug_show_irq(). On the contrary, the check removed by this patch in xive_core_debug_show() was rather an explicit hint that xive_debug_show_irq() couldn't be called with d being NULL. > > C. > > > > > > > > Anyway, > > > > > > Reviewed-by: Greg Kurz <groug@kaod.org> > > > > > >> arch/powerpc/sysdev/xive/common.c | 18 ++++-------------- > > >> 1 file changed, 4 insertions(+), 14 deletions(-) > > >> > > >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > > >> index 678680531d26..7581cb12bb53 100644 > > >> --- a/arch/powerpc/sysdev/xive/common.c > > >> +++ b/arch/powerpc/sysdev/xive/common.c > > >> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu) > > >> seq_puts(m, "\n"); > > >> } > > >> > > >> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d) > > >> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) > > >> { > > >> - struct irq_chip *chip = irq_data_get_irq_chip(d); > > >> + unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); > > >> int rc; > > >> u32 target; > > >> u8 prio; > > >> u32 lirq; > > >> > > >> - if (!is_xive_irq(chip)) > > >> - return; > > >> - > > >> rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); > > >> if (rc) { > > >> seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); > > >> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private) > > >> > > >> for_each_irq_desc(i, desc) { > > >> struct irq_data *d = irq_desc_get_irq_data(desc); > > >> - unsigned int hw_irq; > > >> - > > >> - if (!d) > > >> - continue; > > >> - > > >> - hw_irq = (unsigned int)irqd_to_hwirq(d); > > >> > > >> - /* IPIs are special (HW number 0) */ > > >> - if (hw_irq != XIVE_IPI_HW_IRQ) > > >> - xive_debug_show_irq(m, hw_irq, d); > > >> + if (d->domain == xive_irq_domain) > > >> + xive_debug_show_irq(m, d); > > >> } > > >> return 0; > > >> } > > > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() 2021-03-09 9:42 ` Greg Kurz @ 2021-03-09 15:39 ` Cédric Le Goater 0 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-09 15:39 UTC (permalink / raw) To: Greg Kurz; +Cc: linuxppc-dev On 3/9/21 10:42 AM, Greg Kurz wrote: > On Tue, 9 Mar 2021 10:13:39 +0100 > Greg Kurz <groug@kaod.org> wrote: > >> On Mon, 8 Mar 2021 19:11:11 +0100 >> Cédric Le Goater <clg@kaod.org> wrote: >> >>> On 3/8/21 7:07 PM, Greg Kurz wrote: >>>> On Wed, 3 Mar 2021 18:48:53 +0100 >>>> Cédric Le Goater <clg@kaod.org> wrote: >>>> >>>>> Now that the IPI interrupt has its own domain, the checks on the HW >>>>> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a >>>>> check on the domain. >>>>> >>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org> >>>>> --- >>>> >>>> Shouldn't this have the following tags ? >>>> >>>> Reported-by: kernel test robot <lkp@intel.com> >>>> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> >>>> Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") >>> >>> The next patch has because it removes the useless check on irq_data. >>> >> >> Ok I get it. This report isn't about an actual crash. Just a false >> positive because of the not needed check in the caller. >> > > Hrm... I meant because of the check in xive_debug_show_irq(). On the > contrary, the check removed by this patch in xive_core_debug_show() > was rather an explicit hint that xive_debug_show_irq() couldn't be > called with d being NULL. yes. irq_desc_get_irq_data() does not return a NULL value and xive_debug_show_irq() is only called from the for_each_irq_desc() loop. C. > >>> C. >>> >>>> >>>> Anyway, >>>> >>>> Reviewed-by: Greg Kurz <groug@kaod.org> >>>> >>>>> arch/powerpc/sysdev/xive/common.c | 18 ++++-------------- >>>>> 1 file changed, 4 insertions(+), 14 deletions(-) >>>>> >>>>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >>>>> index 678680531d26..7581cb12bb53 100644 >>>>> --- a/arch/powerpc/sysdev/xive/common.c >>>>> +++ b/arch/powerpc/sysdev/xive/common.c >>>>> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu) >>>>> seq_puts(m, "\n"); >>>>> } >>>>> >>>>> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d) >>>>> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) >>>>> { >>>>> - struct irq_chip *chip = irq_data_get_irq_chip(d); >>>>> + unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d); >>>>> int rc; >>>>> u32 target; >>>>> u8 prio; >>>>> u32 lirq; >>>>> >>>>> - if (!is_xive_irq(chip)) >>>>> - return; >>>>> - >>>>> rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); >>>>> if (rc) { >>>>> seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); >>>>> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private) >>>>> >>>>> for_each_irq_desc(i, desc) { >>>>> struct irq_data *d = irq_desc_get_irq_data(desc); >>>>> - unsigned int hw_irq; >>>>> - >>>>> - if (!d) >>>>> - continue; >>>>> - >>>>> - hw_irq = (unsigned int)irqd_to_hwirq(d); >>>>> >>>>> - /* IPIs are special (HW number 0) */ >>>>> - if (hw_irq != XIVE_IPI_HW_IRQ) >>>>> - xive_debug_show_irq(m, hw_irq, d); >>>>> + if (d->domain == xive_irq_domain) >>>>> + xive_debug_show_irq(m, d); >>>>> } >>>>> return 0; >>>>> } >>>> >>> >> > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater ` (3 preceding siblings ...) 2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater @ 2021-03-03 17:48 ` Cédric Le Goater 2021-03-09 9:18 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater ` (2 subsequent siblings) 7 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev Cc: kernel test robot, Greg Kurz, Dan Carpenter, Cédric Le Goater When looping on IRQ descriptor, irq_data is always valid. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/sysdev/xive/common.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 7581cb12bb53..60ebd6f4b31d 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1586,6 +1586,8 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) u32 target; u8 prio; u32 lirq; + struct xive_irq_data *xd; + u64 val; rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); if (rc) { @@ -1596,17 +1598,14 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ", hw_irq, target, prio, lirq); - if (d) { - struct xive_irq_data *xd = irq_data_get_irq_handler_data(d); - u64 val = xive_esb_read(xd, XIVE_ESB_GET); - - seq_printf(m, "flags=%c%c%c PQ=%c%c", - xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ', - xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ', - xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ', - val & XIVE_ESB_VAL_P ? 'P' : '-', - val & XIVE_ESB_VAL_Q ? 'Q' : '-'); - } + xd = irq_data_get_irq_handler_data(d); + val = xive_esb_read(xd, XIVE_ESB_GET); + seq_printf(m, "flags=%c%c%c PQ=%c%c", + xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ', + xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ', + xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ', + val & XIVE_ESB_VAL_P ? 'P' : '-', + val & XIVE_ESB_VAL_Q ? 'Q' : '-'); seq_puts(m, "\n"); } -- 2.26.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() 2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater @ 2021-03-09 9:18 ` Greg Kurz 0 siblings, 0 replies; 38+ messages in thread From: Greg Kurz @ 2021-03-09 9:18 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev, kernel test robot, Dan Carpenter On Wed, 3 Mar 2021 18:48:54 +0100 Cédric Le Goater <clg@kaod.org> wrote: > When looping on IRQ descriptor, irq_data is always valid. > > Reported-by: kernel test robot <lkp@intel.com> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state") > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- Reviewed-by: Greg Kurz <groug@kaod.org> > arch/powerpc/sysdev/xive/common.c | 21 ++++++++++----------- > 1 file changed, 10 insertions(+), 11 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index 7581cb12bb53..60ebd6f4b31d 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -1586,6 +1586,8 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) > u32 target; > u8 prio; > u32 lirq; > + struct xive_irq_data *xd; > + u64 val; > > rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); > if (rc) { > @@ -1596,17 +1598,14 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d) > seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ", > hw_irq, target, prio, lirq); > > - if (d) { > - struct xive_irq_data *xd = irq_data_get_irq_handler_data(d); > - u64 val = xive_esb_read(xd, XIVE_ESB_GET); > - > - seq_printf(m, "flags=%c%c%c PQ=%c%c", > - xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ', > - xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ', > - xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ', > - val & XIVE_ESB_VAL_P ? 'P' : '-', > - val & XIVE_ESB_VAL_Q ? 'Q' : '-'); > - } > + xd = irq_data_get_irq_handler_data(d); > + val = xive_esb_read(xd, XIVE_ESB_GET); > + seq_printf(m, "flags=%c%c%c PQ=%c%c", > + xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ', > + xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ', > + xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ', > + val & XIVE_ESB_VAL_P ? 'P' : '-', > + val & XIVE_ESB_VAL_Q ? 'Q' : '-'); > seq_puts(m, "\n"); > } > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater ` (4 preceding siblings ...) 2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater @ 2021-03-03 17:48 ` Cédric Le Goater 2021-03-09 9:22 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater 7 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater Move the xmon routine under XIVE subsystem and rework the loop on the interrupts taking into account the xive_irq_domain to filter out IPIs. Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/include/asm/xive.h | 1 + arch/powerpc/sysdev/xive/common.c | 14 ++++++++++++++ arch/powerpc/xmon/xmon.c | 28 ++-------------------------- 3 files changed, 17 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index 9a312b975ca8..aa094a8655b0 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -102,6 +102,7 @@ void xive_flush_interrupt(void); /* xmon hook */ void xmon_xive_do_dump(int cpu); int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d); +void xmon_xive_get_irq_all(void); /* APIs used by KVM */ u32 xive_native_default_eq_shift(void); diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 60ebd6f4b31d..f6b7b15bbb3a 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -291,6 +291,20 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) return 0; } +void xmon_xive_get_irq_all(void) +{ + unsigned int i; + struct irq_desc *desc; + + for_each_irq_desc(i, desc) { + struct irq_data *d = irq_desc_get_irq_data(desc); + unsigned int hwirq = (unsigned int)irqd_to_hwirq(d); + + if (d->domain == xive_irq_domain) + xmon_xive_get_irq_config(hwirq, d); + } +} + #endif /* CONFIG_XMON */ static unsigned int xive_get_irq(void) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 3fe37495f63d..80fbf8968f77 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2727,30 +2727,6 @@ static void dump_all_xives(void) dump_one_xive(cpu); } -static void dump_one_xive_irq(u32 num, struct irq_data *d) -{ - xmon_xive_get_irq_config(num, d); -} - -static void dump_all_xive_irq(void) -{ - unsigned int i; - struct irq_desc *desc; - - for_each_irq_desc(i, desc) { - struct irq_data *d = irq_desc_get_irq_data(desc); - unsigned int hwirq; - - if (!d) - continue; - - hwirq = (unsigned int)irqd_to_hwirq(d); - /* IPIs are special (HW number 0) */ - if (hwirq) - dump_one_xive_irq(hwirq, d); - } -} - static void dump_xives(void) { unsigned long num; @@ -2767,9 +2743,9 @@ static void dump_xives(void) return; } else if (c == 'i') { if (scanhex(&num)) - dump_one_xive_irq(num, NULL); + xmon_xive_get_irq_config(num, NULL); else - dump_all_xive_irq(); + xmon_xive_get_irq_all(); return; } -- 2.26.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon 2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater @ 2021-03-09 9:22 ` Greg Kurz 0 siblings, 0 replies; 38+ messages in thread From: Greg Kurz @ 2021-03-09 9:22 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev On Wed, 3 Mar 2021 18:48:55 +0100 Cédric Le Goater <clg@kaod.org> wrote: > Move the xmon routine under XIVE subsystem and rework the loop on the > interrupts taking into account the xive_irq_domain to filter out IPIs. > > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- Nice again ! :) Reviewed-by: Greg Kurz <groug@kaod.org> > arch/powerpc/include/asm/xive.h | 1 + > arch/powerpc/sysdev/xive/common.c | 14 ++++++++++++++ > arch/powerpc/xmon/xmon.c | 28 ++-------------------------- > 3 files changed, 17 insertions(+), 26 deletions(-) > > diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h > index 9a312b975ca8..aa094a8655b0 100644 > --- a/arch/powerpc/include/asm/xive.h > +++ b/arch/powerpc/include/asm/xive.h > @@ -102,6 +102,7 @@ void xive_flush_interrupt(void); > /* xmon hook */ > void xmon_xive_do_dump(int cpu); > int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d); > +void xmon_xive_get_irq_all(void); > > /* APIs used by KVM */ > u32 xive_native_default_eq_shift(void); > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index 60ebd6f4b31d..f6b7b15bbb3a 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -291,6 +291,20 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) > return 0; > } > > +void xmon_xive_get_irq_all(void) > +{ > + unsigned int i; > + struct irq_desc *desc; > + > + for_each_irq_desc(i, desc) { > + struct irq_data *d = irq_desc_get_irq_data(desc); > + unsigned int hwirq = (unsigned int)irqd_to_hwirq(d); > + > + if (d->domain == xive_irq_domain) > + xmon_xive_get_irq_config(hwirq, d); > + } > +} > + > #endif /* CONFIG_XMON */ > > static unsigned int xive_get_irq(void) > diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c > index 3fe37495f63d..80fbf8968f77 100644 > --- a/arch/powerpc/xmon/xmon.c > +++ b/arch/powerpc/xmon/xmon.c > @@ -2727,30 +2727,6 @@ static void dump_all_xives(void) > dump_one_xive(cpu); > } > > -static void dump_one_xive_irq(u32 num, struct irq_data *d) > -{ > - xmon_xive_get_irq_config(num, d); > -} > - > -static void dump_all_xive_irq(void) > -{ > - unsigned int i; > - struct irq_desc *desc; > - > - for_each_irq_desc(i, desc) { > - struct irq_data *d = irq_desc_get_irq_data(desc); > - unsigned int hwirq; > - > - if (!d) > - continue; > - > - hwirq = (unsigned int)irqd_to_hwirq(d); > - /* IPIs are special (HW number 0) */ > - if (hwirq) > - dump_one_xive_irq(hwirq, d); > - } > -} > - > static void dump_xives(void) > { > unsigned long num; > @@ -2767,9 +2743,9 @@ static void dump_xives(void) > return; > } else if (c == 'i') { > if (scanhex(&num)) > - dump_one_xive_irq(num, NULL); > + xmon_xive_get_irq_config(num, NULL); > else > - dump_all_xive_irq(); > + xmon_xive_get_irq_all(); > return; > } > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater ` (5 preceding siblings ...) 2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater @ 2021-03-03 17:48 ` Cédric Le Goater 2021-03-09 10:23 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater 7 siblings, 1 reply; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev Cc: kernel test robot, Greg Kurz, Dan Carpenter, Cédric Le Goater When under xmon, the "dxi" command dumps the state of the XIVE interrupts. If an interrupt number is specified, only the state of the associated XIVE interrupt is dumped. This form of the command lacks an irq_data parameter which is nevertheless used by xmon_xive_get_irq_config(), leading to an xmon crash. Fix that by doing a lookup in the system IRQ mapping to query the IRQ descriptor data. Invalid interrupt numbers, or not belonging to the XIVE IRQ domain, OPAL event interrupt number for instance, should be caught by the previous query done at the firmware level. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform") Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/sysdev/xive/common.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index f6b7b15bbb3a..8eefd152b947 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -255,17 +255,20 @@ notrace void xmon_xive_do_dump(int cpu) xmon_printf("\n"); } +static struct irq_data *xive_get_irq_data(u32 hw_irq) +{ + unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq); + + return irq ? irq_get_irq_data(irq) : NULL; +} + int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) { - struct irq_chip *chip = irq_data_get_irq_chip(d); int rc; u32 target; u8 prio; u32 lirq; - if (!is_xive_irq(chip)) - return -EINVAL; - rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); if (rc) { xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); @@ -275,6 +278,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ", hw_irq, target, prio, lirq); + if (!d) + d = xive_get_irq_data(hw_irq); + if (d) { struct xive_irq_data *xd = irq_data_get_irq_handler_data(d); u64 val = xive_esb_read(xd, XIVE_ESB_GET); -- 2.26.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" 2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater @ 2021-03-09 10:23 ` Greg Kurz 2021-03-09 15:49 ` Cédric Le Goater 0 siblings, 1 reply; 38+ messages in thread From: Greg Kurz @ 2021-03-09 10:23 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev, kernel test robot, Dan Carpenter On Wed, 3 Mar 2021 18:48:56 +0100 Cédric Le Goater <clg@kaod.org> wrote: > When under xmon, the "dxi" command dumps the state of the XIVE > interrupts. If an interrupt number is specified, only the state of > the associated XIVE interrupt is dumped. This form of the command > lacks an irq_data parameter which is nevertheless used by > xmon_xive_get_irq_config(), leading to an xmon crash. > > Fix that by doing a lookup in the system IRQ mapping to query the IRQ > descriptor data. Invalid interrupt numbers, or not belonging to the > XIVE IRQ domain, OPAL event interrupt number for instance, should be > caught by the previous query done at the firmware level. > > Reported-by: kernel test robot <lkp@intel.com> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com> > Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform") > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- I've tested this in a KVM guest and it seems to do the job. 6:mon> dxi 1201 IRQ 0x00001201 : target=0xfffffc00 prio=ff lirq=0x0 flags= LH PQ=-Q Bad HW irq numbers are filtered by the hypervisor: 6:mon> dxi bad [ 696.390577] xive: H_INT_GET_SOURCE_CONFIG lisn=2989 failed -55 IRQ 0x00000bad : no config rc=-6 Note that this also allows to show IPIs: 6:mon> dxi 0 IRQ 0x00000000 : target=0x0 prio=06 lirq=0x10 This is a bit inconsistent with output of the 0-argument form of "dxi", which filters them out for a reason that isn't obvious to me. No big deal though, this should be addressed in another patch anyway. Reviewed-and-tested-by: Greg Kurz <groug@kaod.org> > arch/powerpc/sysdev/xive/common.c | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index f6b7b15bbb3a..8eefd152b947 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -255,17 +255,20 @@ notrace void xmon_xive_do_dump(int cpu) > xmon_printf("\n"); > } > > +static struct irq_data *xive_get_irq_data(u32 hw_irq) > +{ > + unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq); > + > + return irq ? irq_get_irq_data(irq) : NULL; > +} > + > int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) > { > - struct irq_chip *chip = irq_data_get_irq_chip(d); > int rc; > u32 target; > u8 prio; > u32 lirq; > > - if (!is_xive_irq(chip)) > - return -EINVAL; > - > rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); > if (rc) { > xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); > @@ -275,6 +278,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) > xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ", > hw_irq, target, prio, lirq); > > + if (!d) > + d = xive_get_irq_data(hw_irq); > + > if (d) { > struct xive_irq_data *xd = irq_data_get_irq_handler_data(d); > u64 val = xive_esb_read(xd, XIVE_ESB_GET); ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" 2021-03-09 10:23 ` Greg Kurz @ 2021-03-09 15:49 ` Cédric Le Goater 0 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-09 15:49 UTC (permalink / raw) To: Greg Kurz; +Cc: linuxppc-dev, kernel test robot, Dan Carpenter On 3/9/21 11:23 AM, Greg Kurz wrote: > On Wed, 3 Mar 2021 18:48:56 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> When under xmon, the "dxi" command dumps the state of the XIVE >> interrupts. If an interrupt number is specified, only the state of >> the associated XIVE interrupt is dumped. This form of the command >> lacks an irq_data parameter which is nevertheless used by >> xmon_xive_get_irq_config(), leading to an xmon crash. >> >> Fix that by doing a lookup in the system IRQ mapping to query the IRQ >> descriptor data. Invalid interrupt numbers, or not belonging to the >> XIVE IRQ domain, OPAL event interrupt number for instance, should be >> caught by the previous query done at the firmware level. >> >> Reported-by: kernel test robot <lkp@intel.com> >> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> >> Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform") >> Signed-off-by: Cédric Le Goater <clg@kaod.org> >> --- > > I've tested this in a KVM guest and it seems to do the job. > > 6:mon> dxi 1201 > IRQ 0x00001201 : target=0xfffffc00 prio=ff lirq=0x0 flags= LH PQ=-Q > > Bad HW irq numbers are filtered by the hypervisor: > > 6:mon> dxi bad > [ 696.390577] xive: H_INT_GET_SOURCE_CONFIG lisn=2989 failed -55 > IRQ 0x00000bad : no config rc=-6 > > Note that this also allows to show IPIs: > > 6:mon> dxi 0 > IRQ 0x00000000 : target=0x0 prio=06 lirq=0x10 > > This is a bit inconsistent with output of the 0-argument form of "dxi", It's an hidden feature ! :) Yes. You can query at the FW level the configuration of any valid HW interrupt number where as "dxi" without an argument only loops on the XIVE IRQ domain which does not include the XIVE CPU IPIs which are special. You should "dxa" for these. > which filters them out for a reason that isn't obvious to me. For historical reason. XIVE support for PowerNV was the first to reach Linux. If you run the same xmon commands on a PowerNV machine (you could use QEMU), the ouput is different. it has more low level details. > No big deal though, this should be addressed in another patch anyway. We could simplify the xmon helpers to be sync with the debugfs one and the QEMU/KVM "info pic" command. I agree. Thanks, C. > Reviewed-and-tested-by: Greg Kurz <groug@kaod.org> > >> arch/powerpc/sysdev/xive/common.c | 14 ++++++++++---- >> 1 file changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >> index f6b7b15bbb3a..8eefd152b947 100644 >> --- a/arch/powerpc/sysdev/xive/common.c >> +++ b/arch/powerpc/sysdev/xive/common.c >> @@ -255,17 +255,20 @@ notrace void xmon_xive_do_dump(int cpu) >> xmon_printf("\n"); >> } >> >> +static struct irq_data *xive_get_irq_data(u32 hw_irq) >> +{ >> + unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq); >> + >> + return irq ? irq_get_irq_data(irq) : NULL; >> +} >> + >> int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) >> { >> - struct irq_chip *chip = irq_data_get_irq_chip(d); >> int rc; >> u32 target; >> u8 prio; >> u32 lirq; >> >> - if (!is_xive_irq(chip)) >> - return -EINVAL; >> - >> rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq); >> if (rc) { >> xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc); >> @@ -275,6 +278,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d) >> xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ", >> hw_irq, target, prio, lirq); >> >> + if (!d) >> + d = xive_get_irq_data(hw_irq); >> + >> if (d) { >> struct xive_irq_data *xd = irq_data_get_irq_handler_data(d); >> u64 val = xive_esb_read(xd, XIVE_ESB_GET); > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater ` (6 preceding siblings ...) 2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater @ 2021-03-03 17:48 ` Cédric Le Goater 2021-03-09 13:23 ` Greg Kurz 2021-03-30 16:18 ` Cédric Le Goater 7 siblings, 2 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw) To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater ipistorm [*] can be used to benchmark the raw interrupt rate of an interrupt controller by measuring the number of IPIs a system can sustain. When applied to the XIVE interrupt controller of POWER9 and POWER10 systems, a significant drop of the interrupt rate can be observed when crossing the second node boundary. This is due to the fact that a single IPI interrupt is used for all CPUs of the system. The structure is shared and the cache line updates impact greatly the traffic between nodes and the overall IPI performance. As a workaround, the impact can be reduced by deactivating the IRQ lockup detector ("noirqdebug") which does a lot of accounting in the Linux IRQ descriptor structure and is responsible for most of the performance penalty. As a fix, this proposal allocates an IPI interrupt per node, to be shared by all CPUs of that node. It solves the scaling issue, the IRQ lockup detector still has an impact but the XIVE interrupt rate scales linearly. It also improves the "noirqdebug" case as showed in the tables below. * P9 DD2.2 - 2s * 64 threads "noirqdebug" Mint/s Mint/s chips cpus IPI/sys IPI/chip IPI/chip IPI/sys -------------------------------------------------------------- 1 0-15 4.984023 4.875405 4.996536 5.048892 0-31 10.879164 10.544040 10.757632 11.037859 0-47 15.345301 14.688764 14.926520 15.310053 0-63 17.064907 17.066812 17.613416 17.874511 2 0-79 11.768764 21.650749 22.689120 22.566508 0-95 10.616812 26.878789 28.434703 28.320324 0-111 10.151693 31.397803 31.771773 32.388122 0-127 9.948502 33.139336 34.875716 35.224548 * P10 DD1 - 4s (not homogeneous) 352 threads "noirqdebug" Mint/s Mint/s chips cpus IPI/sys IPI/chip IPI/chip IPI/sys -------------------------------------------------------------- 1 0-15 2.409402 2.364108 2.383303 2.395091 0-31 6.028325 6.046075 6.089999 6.073750 0-47 8.655178 8.644531 8.712830 8.724702 0-63 11.629652 11.735953 12.088203 12.055979 0-79 14.392321 14.729959 14.986701 14.973073 0-95 12.604158 13.004034 17.528748 17.568095 2 0-111 9.767753 13.719831 19.968606 20.024218 0-127 6.744566 16.418854 22.898066 22.995110 0-143 6.005699 19.174421 25.425622 25.417541 0-159 5.649719 21.938836 27.952662 28.059603 0-175 5.441410 24.109484 31.133915 31.127996 3 0-191 5.318341 24.405322 33.999221 33.775354 0-207 5.191382 26.449769 36.050161 35.867307 0-223 5.102790 29.356943 39.544135 39.508169 0-239 5.035295 31.933051 42.135075 42.071975 0-255 4.969209 34.477367 44.655395 44.757074 4 0-271 4.907652 35.887016 47.080545 47.318537 0-287 4.839581 38.076137 50.464307 50.636219 0-303 4.786031 40.881319 53.478684 53.310759 0-319 4.743750 43.448424 56.388102 55.973969 0-335 4.709936 45.623532 59.400930 58.926857 0-351 4.681413 45.646151 62.035804 61.830057 [*] https://github.com/antonblanchard/ipistorm Signed-off-by: Cédric Le Goater <clg@kaod.org> --- arch/powerpc/sysdev/xive/xive-internal.h | 2 -- arch/powerpc/sysdev/xive/common.c | 39 ++++++++++++++++++------ 2 files changed, 30 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h index 9cf57c722faa..b3a456fdd3a5 100644 --- a/arch/powerpc/sysdev/xive/xive-internal.h +++ b/arch/powerpc/sysdev/xive/xive-internal.h @@ -5,8 +5,6 @@ #ifndef __XIVE_INTERNAL_H #define __XIVE_INTERNAL_H -#define XIVE_IPI_HW_IRQ 0 /* interrupt source # for IPIs */ - /* * A "disabled" interrupt should never fire, to catch problems * we set its logical number to this diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 8eefd152b947..c27f7bb0494b 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain; #ifdef CONFIG_SMP static struct irq_domain *xive_ipi_irq_domain; -/* The IPIs all use the same logical irq number */ -static u32 xive_ipi_irq; +/* The IPIs use the same logical irq number when on the same chip */ +static struct xive_ipi_desc { + unsigned int irq; + char name[8]; /* enough bytes to fit IPI-XXX */ +} *xive_ipis; + +static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu) +{ + return xive_ipis[cpu_to_node(cpu)].irq; +} #endif /* Xive state for each CPU */ @@ -1106,25 +1114,36 @@ static const struct irq_domain_ops xive_ipi_irq_domain_ops = { static void __init xive_request_ipi(void) { - unsigned int virq; + unsigned int node; - xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1, + xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids, &xive_ipi_irq_domain_ops, NULL); if (WARN_ON(xive_ipi_irq_domain == NULL)) return; - /* Initialize it */ - virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ); - xive_ipi_irq = virq; + xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL); + for_each_node(node) { + struct xive_ipi_desc *xid = &xive_ipis[node]; + irq_hw_number_t node_ipi_hwirq = node; + + /* + * Map one IPI interrupt per node for all cpus of that node. + * Since the HW interrupt number doesn't have any meaning, + * simply use the node number. + */ + xid->irq = irq_create_mapping(xive_ipi_irq_domain, node_ipi_hwirq); + snprintf(xid->name, sizeof(xid->name), "IPI-%d", node); - WARN_ON(request_irq(virq, xive_muxed_ipi_action, - IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL)); + WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action, + IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL)); + } } static int xive_setup_cpu_ipi(unsigned int cpu) { struct xive_cpu *xc; int rc; + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu); pr_debug("Setting up IPI for CPU %d\n", cpu); @@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu) static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc) { + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu); + /* Disable the IPI and free the IRQ data */ /* Already cleaned up ? */ -- 2.26.2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node 2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater @ 2021-03-09 13:23 ` Greg Kurz 2021-03-09 15:52 ` Cédric Le Goater 2021-03-30 16:18 ` Cédric Le Goater 1 sibling, 1 reply; 38+ messages in thread From: Greg Kurz @ 2021-03-09 13:23 UTC (permalink / raw) To: Cédric Le Goater; +Cc: linuxppc-dev On Wed, 3 Mar 2021 18:48:57 +0100 Cédric Le Goater <clg@kaod.org> wrote: > ipistorm [*] can be used to benchmark the raw interrupt rate of an > interrupt controller by measuring the number of IPIs a system can > sustain. When applied to the XIVE interrupt controller of POWER9 and > POWER10 systems, a significant drop of the interrupt rate can be > observed when crossing the second node boundary. > > This is due to the fact that a single IPI interrupt is used for all > CPUs of the system. The structure is shared and the cache line updates > impact greatly the traffic between nodes and the overall IPI > performance. > > As a workaround, the impact can be reduced by deactivating the IRQ > lockup detector ("noirqdebug") which does a lot of accounting in the > Linux IRQ descriptor structure and is responsible for most of the > performance penalty. > > As a fix, this proposal allocates an IPI interrupt per node, to be > shared by all CPUs of that node. It solves the scaling issue, the IRQ > lockup detector still has an impact but the XIVE interrupt rate scales > linearly. It also improves the "noirqdebug" case as showed in the > tables below. > > * P9 DD2.2 - 2s * 64 threads > > "noirqdebug" > Mint/s Mint/s > chips cpus IPI/sys IPI/chip IPI/chip IPI/sys > -------------------------------------------------------------- > 1 0-15 4.984023 4.875405 4.996536 5.048892 > 0-31 10.879164 10.544040 10.757632 11.037859 > 0-47 15.345301 14.688764 14.926520 15.310053 > 0-63 17.064907 17.066812 17.613416 17.874511 > 2 0-79 11.768764 21.650749 22.689120 22.566508 > 0-95 10.616812 26.878789 28.434703 28.320324 > 0-111 10.151693 31.397803 31.771773 32.388122 > 0-127 9.948502 33.139336 34.875716 35.224548 > > * P10 DD1 - 4s (not homogeneous) 352 threads > > "noirqdebug" > Mint/s Mint/s > chips cpus IPI/sys IPI/chip IPI/chip IPI/sys > -------------------------------------------------------------- > 1 0-15 2.409402 2.364108 2.383303 2.395091 > 0-31 6.028325 6.046075 6.089999 6.073750 > 0-47 8.655178 8.644531 8.712830 8.724702 > 0-63 11.629652 11.735953 12.088203 12.055979 > 0-79 14.392321 14.729959 14.986701 14.973073 > 0-95 12.604158 13.004034 17.528748 17.568095 > 2 0-111 9.767753 13.719831 19.968606 20.024218 > 0-127 6.744566 16.418854 22.898066 22.995110 > 0-143 6.005699 19.174421 25.425622 25.417541 > 0-159 5.649719 21.938836 27.952662 28.059603 > 0-175 5.441410 24.109484 31.133915 31.127996 > 3 0-191 5.318341 24.405322 33.999221 33.775354 > 0-207 5.191382 26.449769 36.050161 35.867307 > 0-223 5.102790 29.356943 39.544135 39.508169 > 0-239 5.035295 31.933051 42.135075 42.071975 > 0-255 4.969209 34.477367 44.655395 44.757074 > 4 0-271 4.907652 35.887016 47.080545 47.318537 > 0-287 4.839581 38.076137 50.464307 50.636219 > 0-303 4.786031 40.881319 53.478684 53.310759 > 0-319 4.743750 43.448424 56.388102 55.973969 > 0-335 4.709936 45.623532 59.400930 58.926857 > 0-351 4.681413 45.646151 62.035804 61.830057 > > [*] https://github.com/antonblanchard/ipistorm > > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- > arch/powerpc/sysdev/xive/xive-internal.h | 2 -- > arch/powerpc/sysdev/xive/common.c | 39 ++++++++++++++++++------ > 2 files changed, 30 insertions(+), 11 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h > index 9cf57c722faa..b3a456fdd3a5 100644 > --- a/arch/powerpc/sysdev/xive/xive-internal.h > +++ b/arch/powerpc/sysdev/xive/xive-internal.h > @@ -5,8 +5,6 @@ > #ifndef __XIVE_INTERNAL_H > #define __XIVE_INTERNAL_H > > -#define XIVE_IPI_HW_IRQ 0 /* interrupt source # for IPIs */ > - > /* > * A "disabled" interrupt should never fire, to catch problems > * we set its logical number to this > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index 8eefd152b947..c27f7bb0494b 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain; > #ifdef CONFIG_SMP > static struct irq_domain *xive_ipi_irq_domain; > > -/* The IPIs all use the same logical irq number */ > -static u32 xive_ipi_irq; > +/* The IPIs use the same logical irq number when on the same chip */ > +static struct xive_ipi_desc { > + unsigned int irq; > + char name[8]; /* enough bytes to fit IPI-XXX */ So this assumes that the node number that node is <= 999 ? This is certainly the case for now since CONFIG_NODES_SHIFT is 8 on ppc64 but starting with 10, you'd have truncated names. What about deriving the size of name[] from CONFIG_NODES_SHIFT ? Apart from that, LGTM. Probably not worth to respin just for this. I also could give a try in a KVM guest. Topology passed to QEMU: -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 \ -numa node,nodeid=0,cpus=0-4 \ -numa node,nodeid=1,cpus=4-7 Topology observed in guest with lstopo : Package L#0 NUMANode L#0 (P#0 30GB) L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#1) L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#2) PU L#3 (P#3) Package L#1 NUMANode L#1 (P#1 32GB) L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#4) PU L#5 (P#5) L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#6) PU L#7 (P#7) Interrupts in guest: $ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 16: 1023 871 1042 749 0 0 0 0 XIVE-IPI 0 Edge IPI-0 17: 0 0 0 0 2123 1019 1263 1288 XIVE-IPI 1 Edge IPI-1 IPIs are mapped to the appropriate nodes, and the numbers indicate that everything is working as expected. Reviewed-and-tested-by: Greg Kurz <groug@kaod.org> > +} *xive_ipis; > + > +static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu) > +{ > + return xive_ipis[cpu_to_node(cpu)].irq; > +} > #endif > > /* Xive state for each CPU */ > @@ -1106,25 +1114,36 @@ static const struct irq_domain_ops xive_ipi_irq_domain_ops = { > > static void __init xive_request_ipi(void) > { > - unsigned int virq; > + unsigned int node; > > - xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1, > + xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids, > &xive_ipi_irq_domain_ops, NULL); > if (WARN_ON(xive_ipi_irq_domain == NULL)) > return; > > - /* Initialize it */ > - virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ); > - xive_ipi_irq = virq; > + xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL); > + for_each_node(node) { > + struct xive_ipi_desc *xid = &xive_ipis[node]; > + irq_hw_number_t node_ipi_hwirq = node; > + > + /* > + * Map one IPI interrupt per node for all cpus of that node. > + * Since the HW interrupt number doesn't have any meaning, > + * simply use the node number. > + */ > + xid->irq = irq_create_mapping(xive_ipi_irq_domain, node_ipi_hwirq); > + snprintf(xid->name, sizeof(xid->name), "IPI-%d", node); > > - WARN_ON(request_irq(virq, xive_muxed_ipi_action, > - IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL)); > + WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action, > + IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL)); > + } > } > > static int xive_setup_cpu_ipi(unsigned int cpu) > { > struct xive_cpu *xc; > int rc; > + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu); > > pr_debug("Setting up IPI for CPU %d\n", cpu); > > @@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu) > > static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc) > { > + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu); > + > /* Disable the IPI and free the IRQ data */ > > /* Already cleaned up ? */ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node 2021-03-09 13:23 ` Greg Kurz @ 2021-03-09 15:52 ` Cédric Le Goater 0 siblings, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-09 15:52 UTC (permalink / raw) To: Greg Kurz; +Cc: linuxppc-dev On 3/9/21 2:23 PM, Greg Kurz wrote: > On Wed, 3 Mar 2021 18:48:57 +0100 > Cédric Le Goater <clg@kaod.org> wrote: > >> ipistorm [*] can be used to benchmark the raw interrupt rate of an >> interrupt controller by measuring the number of IPIs a system can >> sustain. When applied to the XIVE interrupt controller of POWER9 and >> POWER10 systems, a significant drop of the interrupt rate can be >> observed when crossing the second node boundary. >> >> This is due to the fact that a single IPI interrupt is used for all >> CPUs of the system. The structure is shared and the cache line updates >> impact greatly the traffic between nodes and the overall IPI >> performance. >> >> As a workaround, the impact can be reduced by deactivating the IRQ >> lockup detector ("noirqdebug") which does a lot of accounting in the >> Linux IRQ descriptor structure and is responsible for most of the >> performance penalty. >> >> As a fix, this proposal allocates an IPI interrupt per node, to be >> shared by all CPUs of that node. It solves the scaling issue, the IRQ >> lockup detector still has an impact but the XIVE interrupt rate scales >> linearly. It also improves the "noirqdebug" case as showed in the >> tables below. >> >> * P9 DD2.2 - 2s * 64 threads >> >> "noirqdebug" >> Mint/s Mint/s >> chips cpus IPI/sys IPI/chip IPI/chip IPI/sys >> -------------------------------------------------------------- >> 1 0-15 4.984023 4.875405 4.996536 5.048892 >> 0-31 10.879164 10.544040 10.757632 11.037859 >> 0-47 15.345301 14.688764 14.926520 15.310053 >> 0-63 17.064907 17.066812 17.613416 17.874511 >> 2 0-79 11.768764 21.650749 22.689120 22.566508 >> 0-95 10.616812 26.878789 28.434703 28.320324 >> 0-111 10.151693 31.397803 31.771773 32.388122 >> 0-127 9.948502 33.139336 34.875716 35.224548 >> >> * P10 DD1 - 4s (not homogeneous) 352 threads >> >> "noirqdebug" >> Mint/s Mint/s >> chips cpus IPI/sys IPI/chip IPI/chip IPI/sys >> -------------------------------------------------------------- >> 1 0-15 2.409402 2.364108 2.383303 2.395091 >> 0-31 6.028325 6.046075 6.089999 6.073750 >> 0-47 8.655178 8.644531 8.712830 8.724702 >> 0-63 11.629652 11.735953 12.088203 12.055979 >> 0-79 14.392321 14.729959 14.986701 14.973073 >> 0-95 12.604158 13.004034 17.528748 17.568095 >> 2 0-111 9.767753 13.719831 19.968606 20.024218 >> 0-127 6.744566 16.418854 22.898066 22.995110 >> 0-143 6.005699 19.174421 25.425622 25.417541 >> 0-159 5.649719 21.938836 27.952662 28.059603 >> 0-175 5.441410 24.109484 31.133915 31.127996 >> 3 0-191 5.318341 24.405322 33.999221 33.775354 >> 0-207 5.191382 26.449769 36.050161 35.867307 >> 0-223 5.102790 29.356943 39.544135 39.508169 >> 0-239 5.035295 31.933051 42.135075 42.071975 >> 0-255 4.969209 34.477367 44.655395 44.757074 >> 4 0-271 4.907652 35.887016 47.080545 47.318537 >> 0-287 4.839581 38.076137 50.464307 50.636219 >> 0-303 4.786031 40.881319 53.478684 53.310759 >> 0-319 4.743750 43.448424 56.388102 55.973969 >> 0-335 4.709936 45.623532 59.400930 58.926857 >> 0-351 4.681413 45.646151 62.035804 61.830057 >> >> [*] https://github.com/antonblanchard/ipistorm >> >> Signed-off-by: Cédric Le Goater <clg@kaod.org> >> --- >> arch/powerpc/sysdev/xive/xive-internal.h | 2 -- >> arch/powerpc/sysdev/xive/common.c | 39 ++++++++++++++++++------ >> 2 files changed, 30 insertions(+), 11 deletions(-) >> >> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h >> index 9cf57c722faa..b3a456fdd3a5 100644 >> --- a/arch/powerpc/sysdev/xive/xive-internal.h >> +++ b/arch/powerpc/sysdev/xive/xive-internal.h >> @@ -5,8 +5,6 @@ >> #ifndef __XIVE_INTERNAL_H >> #define __XIVE_INTERNAL_H >> >> -#define XIVE_IPI_HW_IRQ 0 /* interrupt source # for IPIs */ >> - >> /* >> * A "disabled" interrupt should never fire, to catch problems >> * we set its logical number to this >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c >> index 8eefd152b947..c27f7bb0494b 100644 >> --- a/arch/powerpc/sysdev/xive/common.c >> +++ b/arch/powerpc/sysdev/xive/common.c >> @@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain; >> #ifdef CONFIG_SMP >> static struct irq_domain *xive_ipi_irq_domain; >> >> -/* The IPIs all use the same logical irq number */ >> -static u32 xive_ipi_irq; >> +/* The IPIs use the same logical irq number when on the same chip */ >> +static struct xive_ipi_desc { >> + unsigned int irq; >> + char name[8]; /* enough bytes to fit IPI-XXX */ > > So this assumes that the node number that node is <= 999 ? This > is certainly the case for now since CONFIG_NODES_SHIFT is 8 > on ppc64 but starting with 10, you'd have truncated names. It should be harmless though. I agree this is a useless optimization. > What about deriving the size of name[] from CONFIG_NODES_SHIFT ? Yes. > Apart from that, LGTM. Probably not worth to respin just for > this. > > I also could give a try in a KVM guest. > > Topology passed to QEMU: > > -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 \ > -numa node,nodeid=0,cpus=0-4 \ > -numa node,nodeid=1,cpus=4-7 > > Topology observed in guest with lstopo : > > Package L#0 > NUMANode L#0 (P#0 30GB) > L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 > PU L#0 (P#0) > PU L#1 (P#1) > L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 > PU L#2 (P#2) > PU L#3 (P#3) > Package L#1 > NUMANode L#1 (P#1 32GB) > L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 > PU L#4 (P#4) > PU L#5 (P#5) > L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 > PU L#6 (P#6) > PU L#7 (P#7) > > Interrupts in guest: > > $ cat /proc/interrupts > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 > 16: 1023 871 1042 749 0 0 0 0 XIVE-IPI 0 Edge IPI-0 > 17: 0 0 0 0 2123 1019 1263 1288 XIVE-IPI 1 Edge IPI-1 > > IPIs are mapped to the appropriate nodes, and the numbers indicate > that everything is working as expected. You should see the same on 2 socket PowerNV QEMU machine. > Reviewed-and-tested-by: Greg Kurz <groug@kaod.org> Thanks, C. > >> +} *xive_ipis; >> + >> +static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu) >> +{ >> + return xive_ipis[cpu_to_node(cpu)].irq; >> +} >> #endif >> >> /* Xive state for each CPU */ >> @@ -1106,25 +1114,36 @@ static const struct irq_domain_ops xive_ipi_irq_domain_ops = { >> >> static void __init xive_request_ipi(void) >> { >> - unsigned int virq; >> + unsigned int node; >> >> - xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1, >> + xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids, >> &xive_ipi_irq_domain_ops, NULL); >> if (WARN_ON(xive_ipi_irq_domain == NULL)) >> return; >> >> - /* Initialize it */ >> - virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ); >> - xive_ipi_irq = virq; >> + xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL); >> + for_each_node(node) { >> + struct xive_ipi_desc *xid = &xive_ipis[node]; >> + irq_hw_number_t node_ipi_hwirq = node; >> + >> + /* >> + * Map one IPI interrupt per node for all cpus of that node. >> + * Since the HW interrupt number doesn't have any meaning, >> + * simply use the node number. >> + */ >> + xid->irq = irq_create_mapping(xive_ipi_irq_domain, node_ipi_hwirq); >> + snprintf(xid->name, sizeof(xid->name), "IPI-%d", node); >> >> - WARN_ON(request_irq(virq, xive_muxed_ipi_action, >> - IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL)); >> + WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action, >> + IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL)); >> + } >> } >> >> static int xive_setup_cpu_ipi(unsigned int cpu) >> { >> struct xive_cpu *xc; >> int rc; >> + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu); >> >> pr_debug("Setting up IPI for CPU %d\n", cpu); >> >> @@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu) >> >> static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc) >> { >> + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu); >> + >> /* Disable the IPI and free the IRQ data */ >> >> /* Already cleaned up ? */ > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node 2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater 2021-03-09 13:23 ` Greg Kurz @ 2021-03-30 16:18 ` Cédric Le Goater 1 sibling, 0 replies; 38+ messages in thread From: Cédric Le Goater @ 2021-03-30 16:18 UTC (permalink / raw) To: linuxppc-dev; +Cc: Greg Kurz On 3/3/21 6:48 PM, Cédric Le Goater wrote: > ipistorm [*] can be used to benchmark the raw interrupt rate of an > interrupt controller by measuring the number of IPIs a system can > sustain. When applied to the XIVE interrupt controller of POWER9 and > POWER10 systems, a significant drop of the interrupt rate can be > observed when crossing the second node boundary. > > This is due to the fact that a single IPI interrupt is used for all > CPUs of the system. The structure is shared and the cache line updates > impact greatly the traffic between nodes and the overall IPI > performance. > > As a workaround, the impact can be reduced by deactivating the IRQ > lockup detector ("noirqdebug") which does a lot of accounting in the > Linux IRQ descriptor structure and is responsible for most of the > performance penalty. > > As a fix, this proposal allocates an IPI interrupt per node, to be > shared by all CPUs of that node. It solves the scaling issue, the IRQ > lockup detector still has an impact but the XIVE interrupt rate scales > linearly. It also improves the "noirqdebug" case as showed in the > tables below. > > * P9 DD2.2 - 2s * 64 threads > > "noirqdebug" > Mint/s Mint/s > chips cpus IPI/sys IPI/chip IPI/chip IPI/sys > -------------------------------------------------------------- > 1 0-15 4.984023 4.875405 4.996536 5.048892 > 0-31 10.879164 10.544040 10.757632 11.037859 > 0-47 15.345301 14.688764 14.926520 15.310053 > 0-63 17.064907 17.066812 17.613416 17.874511 > 2 0-79 11.768764 21.650749 22.689120 22.566508 > 0-95 10.616812 26.878789 28.434703 28.320324 > 0-111 10.151693 31.397803 31.771773 32.388122 > 0-127 9.948502 33.139336 34.875716 35.224548 > > * P10 DD1 - 4s (not homogeneous) 352 threads > > "noirqdebug" > Mint/s Mint/s > chips cpus IPI/sys IPI/chip IPI/chip IPI/sys > -------------------------------------------------------------- > 1 0-15 2.409402 2.364108 2.383303 2.395091 > 0-31 6.028325 6.046075 6.089999 6.073750 > 0-47 8.655178 8.644531 8.712830 8.724702 > 0-63 11.629652 11.735953 12.088203 12.055979 > 0-79 14.392321 14.729959 14.986701 14.973073 > 0-95 12.604158 13.004034 17.528748 17.568095 > 2 0-111 9.767753 13.719831 19.968606 20.024218 > 0-127 6.744566 16.418854 22.898066 22.995110 > 0-143 6.005699 19.174421 25.425622 25.417541 > 0-159 5.649719 21.938836 27.952662 28.059603 > 0-175 5.441410 24.109484 31.133915 31.127996 > 3 0-191 5.318341 24.405322 33.999221 33.775354 > 0-207 5.191382 26.449769 36.050161 35.867307 > 0-223 5.102790 29.356943 39.544135 39.508169 > 0-239 5.035295 31.933051 42.135075 42.071975 > 0-255 4.969209 34.477367 44.655395 44.757074 > 4 0-271 4.907652 35.887016 47.080545 47.318537 > 0-287 4.839581 38.076137 50.464307 50.636219 > 0-303 4.786031 40.881319 53.478684 53.310759 > 0-319 4.743750 43.448424 56.388102 55.973969 > 0-335 4.709936 45.623532 59.400930 58.926857 > 0-351 4.681413 45.646151 62.035804 61.830057 > > [*] https://github.com/antonblanchard/ipistorm > > Signed-off-by: Cédric Le Goater <clg@kaod.org> > --- > arch/powerpc/sysdev/xive/xive-internal.h | 2 -- > arch/powerpc/sysdev/xive/common.c | 39 ++++++++++++++++++------ > 2 files changed, 30 insertions(+), 11 deletions(-) > > diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h > index 9cf57c722faa..b3a456fdd3a5 100644 > --- a/arch/powerpc/sysdev/xive/xive-internal.h > +++ b/arch/powerpc/sysdev/xive/xive-internal.h > @@ -5,8 +5,6 @@ > #ifndef __XIVE_INTERNAL_H > #define __XIVE_INTERNAL_H > > -#define XIVE_IPI_HW_IRQ 0 /* interrupt source # for IPIs */ > - > /* > * A "disabled" interrupt should never fire, to catch problems > * we set its logical number to this > diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c > index 8eefd152b947..c27f7bb0494b 100644 > --- a/arch/powerpc/sysdev/xive/common.c > +++ b/arch/powerpc/sysdev/xive/common.c > @@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain; > #ifdef CONFIG_SMP > static struct irq_domain *xive_ipi_irq_domain; > > -/* The IPIs all use the same logical irq number */ > -static u32 xive_ipi_irq; > +/* The IPIs use the same logical irq number when on the same chip */ > +static struct xive_ipi_desc { > + unsigned int irq; > + char name[8]; /* enough bytes to fit IPI-XXX */ > +} *xive_ipis; > + > +static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu) > +{ > + return xive_ipis[cpu_to_node(cpu)].irq; This should be using early_cpu_to_node() for hotplugged CPU, else the CPU IPI will be mapped on default node 0. Still works but this is not what we want. > +} > #endif > > /* Xive state for each CPU */ > @@ -1106,25 +1114,36 @@ static const struct irq_domain_ops xive_ipi_irq_domain_ops = { > > static void __init xive_request_ipi(void) > { > - unsigned int virq; > + unsigned int node; > > - xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1, > + xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids, > &xive_ipi_irq_domain_ops, NULL); > if (WARN_ON(xive_ipi_irq_domain == NULL)) > return; > > - /* Initialize it */ > - virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ); > - xive_ipi_irq = virq; > + xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL); > + for_each_node(node) { > + struct xive_ipi_desc *xid = &xive_ipis[node]; > + irq_hw_number_t node_ipi_hwirq = node; There, we need to filter cpu-less nodes. There is no need to allocate IPIs for these. > + /* > + * Map one IPI interrupt per node for all cpus of that node. > + * Since the HW interrupt number doesn't have any meaning, > + * simply use the node number. > + */ > + xid->irq = irq_create_mapping(xive_ipi_irq_domain, node_ipi_hwirq); > + snprintf(xid->name, sizeof(xid->name), "IPI-%d", node); and this mapping needs some modernization. If we use a domain allocator, the IPI irq descriptor will be allocated on the node it is serving which is even better for cache performance. I will send a v3 with these changes. C. > - WARN_ON(request_irq(virq, xive_muxed_ipi_action, > - IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL)); > + WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action, > + IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL)); > + } > } > > static int xive_setup_cpu_ipi(unsigned int cpu) > { > struct xive_cpu *xc; > int rc; > + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu); > > pr_debug("Setting up IPI for CPU %d\n", cpu); > > @@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu) > > static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc) > { > + unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu); > + > /* Disable the IPI and free the IRQ data */ > > /* Already cleaned up ? */ > ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2021-03-30 16:18 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater 2021-03-08 17:13 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property Greg Kurz 2021-03-09 15:33 ` Cédric Le Goater 2021-03-09 17:08 ` Daniel Henrique Barboza 2021-03-09 17:26 ` Cédric Le Goater 2021-03-09 17:26 ` Cédric Le Goater 2021-03-12 1:55 ` David Gibson 2021-03-12 1:55 ` David Gibson 2021-03-12 9:53 ` Cédric Le Goater 2021-03-12 9:53 ` Cédric Le Goater 2021-03-12 12:18 ` Daniel Henrique Barboza 2021-03-12 12:18 ` Daniel Henrique Barboza 2021-03-12 13:03 ` Greg Kurz 2021-03-12 13:03 ` Greg Kurz 2021-03-12 13:28 ` Cédric Le Goater 2021-03-12 13:28 ` Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater 2021-03-08 17:55 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater 2021-03-08 17:56 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater 2021-03-08 18:07 ` Greg Kurz 2021-03-08 18:11 ` Cédric Le Goater 2021-03-09 9:13 ` Greg Kurz 2021-03-09 9:42 ` Greg Kurz 2021-03-09 15:39 ` Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater 2021-03-09 9:18 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater 2021-03-09 9:22 ` Greg Kurz 2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater 2021-03-09 10:23 ` Greg Kurz 2021-03-09 15:49 ` Cédric Le Goater 2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater 2021-03-09 13:23 ` Greg Kurz 2021-03-09 15:52 ` Cédric Le Goater 2021-03-30 16:18 ` Cédric Le Goater
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.