All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node
@ 2021-03-03 17:48 Cédric Le Goater
  2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

Hello,

ipistorm [*] can be used to benchmark the raw interrupt rate of an
interrupt controller by measuring the number of IPIs a system can
sustain. When applied to the XIVE interrupt controller of POWER9 and
POWER10 systems, a significant drop of the interrupt rate can be
observed when crossing the second node boundary.

This is due to the fact that a single IPI interrupt is used for all
CPUs of the system. The structure is shared and the cache line updates
impact greatly the traffic between nodes and the overall IPI
performance.

As a workaround, the impact can be reduced by deactivating the IRQ
lockup detector ("noirqdebug") which does a lot of accounting in the
Linux IRQ descriptor structure and is responsible for most of the
performance penalty.

As a fix, this proposal allocates an IPI interrupt per node, to be
shared by all CPUs of that node. It solves the scaling issue, the IRQ
lockup detector still has an impact but the XIVE interrupt rate scales
linearly. It also improves the "noirqdebug" case as showed in the
tables below. 

 * P9 DD2.2 - 2s * 64 threads

                                               "noirqdebug"
                        Mint/s                    Mint/s   
 chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
 --------------------------------------------------------------
 1      0-15     4.984023   4.875405       4.996536   5.048892
        0-31    10.879164  10.544040      10.757632  11.037859
        0-47    15.345301  14.688764      14.926520  15.310053
        0-63    17.064907  17.066812      17.613416  17.874511
 2      0-79    11.768764  21.650749      22.689120  22.566508
        0-95    10.616812  26.878789      28.434703  28.320324
        0-111   10.151693  31.397803      31.771773  32.388122
        0-127    9.948502  33.139336      34.875716  35.224548


 * P10 DD1 - 4s (not homogeneous) 352 threads

                                               "noirqdebug"
                        Mint/s                    Mint/s   
 chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
 --------------------------------------------------------------
 1      0-15     2.409402   2.364108       2.383303   2.395091
        0-31     6.028325   6.046075       6.089999   6.073750
        0-47     8.655178   8.644531       8.712830   8.724702
        0-63    11.629652  11.735953      12.088203  12.055979
        0-79    14.392321  14.729959      14.986701  14.973073
        0-95    12.604158  13.004034      17.528748  17.568095
 2      0-111    9.767753  13.719831      19.968606  20.024218
        0-127    6.744566  16.418854      22.898066  22.995110
        0-143    6.005699  19.174421      25.425622  25.417541
        0-159    5.649719  21.938836      27.952662  28.059603
        0-175    5.441410  24.109484      31.133915  31.127996
 3      0-191    5.318341  24.405322      33.999221  33.775354
        0-207    5.191382  26.449769      36.050161  35.867307
        0-223    5.102790  29.356943      39.544135  39.508169
        0-239    5.035295  31.933051      42.135075  42.071975
        0-255    4.969209  34.477367      44.655395  44.757074
 4      0-271    4.907652  35.887016      47.080545  47.318537
        0-287    4.839581  38.076137      50.464307  50.636219
        0-303    4.786031  40.881319      53.478684  53.310759
        0-319    4.743750  43.448424      56.388102  55.973969
        0-335    4.709936  45.623532      59.400930  58.926857
        0-351    4.681413  45.646151      62.035804  61.830057

[*] https://github.com/antonblanchard/ipistorm

Thanks,

C.

Changes in v2:

  - extra simplification on xmon
  - fixes on issues reported by the kernel test robot

Cédric Le Goater (8):
  powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  powerpc/xive: Introduce an IPI interrupt domain
  powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
  powerpc/xive: Simplify xive_core_debug_show()
  powerpc/xive: Drop check on irq_data in xive_core_debug_show()
  powerpc/xive: Simplify the dump of XIVE interrupts under xmon
  powerpc/xive: Fix xmon command "dxi"
  powerpc/xive: Map one IPI interrupt per node

 arch/powerpc/include/asm/xive.h          |   1 +
 arch/powerpc/sysdev/xive/xive-internal.h |   2 -
 arch/powerpc/sysdev/xive/common.c        | 163 +++++++++++++----------
 arch/powerpc/xmon/xmon.c                 |  28 +---
 4 files changed, 93 insertions(+), 101 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property
  2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
@ 2021-03-03 17:48 ` Cédric Le Goater
  2021-03-08 17:13   ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property Greg Kurz
  2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip when possible. This field
is assigned on the PowerNV platform using the "ibm,chip-id" property
on pSeries under KVM when NUMA nodes are defined but it is undefined
under PowerVM. The XIVE source structure has a similar field
'src_chip' which is only assigned on the PowerNV platform.

cpu_to_node() returns a compatible value on all platforms, 0 being the
default node. It will also give us the opportunity to set the affinity
of a source on pSeries when we can localize them.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 595310e056f4..b8e456da28aa 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu)
 
 	xc = per_cpu(xive_cpu, cpu);
 	if (!xc) {
-		struct device_node *np;
-
 		xc = kzalloc_node(sizeof(struct xive_cpu),
 				  GFP_KERNEL, cpu_to_node(cpu));
 		if (!xc)
 			return -ENOMEM;
-		np = of_get_cpu_node(cpu, NULL);
-		if (np)
-			xc->chip_id = of_get_ibm_chip_id(np);
-		of_node_put(np);
+		xc->chip_id = cpu_to_node(cpu);
 		xc->hw_ipi = XIVE_BAD_IRQ;
 
 		per_cpu(xive_cpu, cpu) = xc;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain
  2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
  2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater
@ 2021-03-03 17:48 ` Cédric Le Goater
  2021-03-08 17:55   ` Greg Kurz
  2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

The IPI interrupt is a special case of the XIVE IRQ domain. When
mapping and unmapping the interrupts in the Linux interrupt number
space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to
distinguish the IPI interrupt from other interrupts of the system.

Simplify the XIVE interrupt domain by introducing a specific domain
for the IPI.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 51 +++++++++++++------------------
 1 file changed, 22 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index b8e456da28aa..e7783760d278 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -63,6 +63,8 @@ static const struct xive_ops *xive_ops;
 static struct irq_domain *xive_irq_domain;
 
 #ifdef CONFIG_SMP
+static struct irq_domain *xive_ipi_irq_domain;
+
 /* The IPIs all use the same logical irq number */
 static u32 xive_ipi_irq;
 #endif
@@ -1067,20 +1069,32 @@ static struct irq_chip xive_ipi_chip = {
 	.irq_unmask = xive_ipi_do_nothing,
 };
 
+/*
+ * IPIs are marked per-cpu. We use separate HW interrupts under the
+ * hood but associated with the same "linux" interrupt
+ */
+static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq,
+				   irq_hw_number_t hw)
+{
+	irq_set_chip_and_handler(virq, &xive_ipi_chip, handle_percpu_irq);
+	return 0;
+}
+
+static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
+	.map = xive_ipi_irq_domain_map,
+};
+
 static void __init xive_request_ipi(void)
 {
 	unsigned int virq;
 
-	/*
-	 * Initialization failed, move on, we might manage to
-	 * reach the point where we display our errors before
-	 * the system falls appart
-	 */
-	if (!xive_irq_domain)
+	xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1,
+						    &xive_ipi_irq_domain_ops, NULL);
+	if (WARN_ON(xive_ipi_irq_domain == NULL))
 		return;
 
 	/* Initialize it */
-	virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ);
+	virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ);
 	xive_ipi_irq = virq;
 
 	WARN_ON(request_irq(virq, xive_muxed_ipi_action,
@@ -1178,19 +1192,6 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq,
 	 */
 	irq_clear_status_flags(virq, IRQ_LEVEL);
 
-#ifdef CONFIG_SMP
-	/* IPIs are special and come up with HW number 0 */
-	if (hw == XIVE_IPI_HW_IRQ) {
-		/*
-		 * IPIs are marked per-cpu. We use separate HW interrupts under
-		 * the hood but associated with the same "linux" interrupt
-		 */
-		irq_set_chip_and_handler(virq, &xive_ipi_chip,
-					 handle_percpu_irq);
-		return 0;
-	}
-#endif
-
 	rc = xive_irq_alloc_data(virq, hw);
 	if (rc)
 		return rc;
@@ -1202,15 +1203,7 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq,
 
 static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq)
 {
-	struct irq_data *data = irq_get_irq_data(virq);
-	unsigned int hw_irq;
-
-	/* XXX Assign BAD number */
-	if (!data)
-		return;
-	hw_irq = (unsigned int)irqd_to_hwirq(data);
-	if (hw_irq != XIVE_IPI_HW_IRQ)
-		xive_irq_free_data(virq);
+	xive_irq_free_data(virq);
 }
 
 static int xive_irq_domain_xlate(struct irq_domain *h, struct device_node *ct,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
  2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
  2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater
  2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater
@ 2021-03-03 17:48 ` Cédric Le Goater
  2021-03-08 17:56   ` Greg Kurz
  2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

The IPI interrupt has its own domain now. Testing the HW interrupt
number is not needed anymore.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index e7783760d278..678680531d26 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1396,13 +1396,12 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc)
 		struct irq_desc *desc = irq_to_desc(irq);
 		struct irq_data *d = irq_desc_get_irq_data(desc);
 		struct xive_irq_data *xd;
-		unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
 
 		/*
 		 * Ignore anything that isn't a XIVE irq and ignore
 		 * IPIs, so can just be dropped.
 		 */
-		if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ)
+		if (d->domain != xive_irq_domain)
 			continue;
 
 		/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()
  2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (2 preceding siblings ...)
  2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater
@ 2021-03-03 17:48 ` Cédric Le Goater
  2021-03-08 18:07   ` Greg Kurz
  2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

Now that the IPI interrupt has its own domain, the checks on the HW
interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
check on the domain.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 678680531d26..7581cb12bb53 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu)
 	seq_puts(m, "\n");
 }
 
-static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
+static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
 {
-	struct irq_chip *chip = irq_data_get_irq_chip(d);
+	unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
 	int rc;
 	u32 target;
 	u8 prio;
 	u32 lirq;
 
-	if (!is_xive_irq(chip))
-		return;
-
 	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
 	if (rc) {
 		seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
@@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private)
 
 	for_each_irq_desc(i, desc) {
 		struct irq_data *d = irq_desc_get_irq_data(desc);
-		unsigned int hw_irq;
-
-		if (!d)
-			continue;
-
-		hw_irq = (unsigned int)irqd_to_hwirq(d);
 
-		/* IPIs are special (HW number 0) */
-		if (hw_irq != XIVE_IPI_HW_IRQ)
-			xive_debug_show_irq(m, hw_irq, d);
+		if (d->domain == xive_irq_domain)
+			xive_debug_show_irq(m, d);
 	}
 	return 0;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show()
  2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (3 preceding siblings ...)
  2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater
@ 2021-03-03 17:48 ` Cédric Le Goater
  2021-03-09  9:18   ` Greg Kurz
  2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kernel test robot, Greg Kurz, Dan Carpenter, Cédric Le Goater

When looping on IRQ descriptor, irq_data is always valid.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 7581cb12bb53..60ebd6f4b31d 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1586,6 +1586,8 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
 	u32 target;
 	u8 prio;
 	u32 lirq;
+	struct xive_irq_data *xd;
+	u64 val;
 
 	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
 	if (rc) {
@@ -1596,17 +1598,14 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
 	seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
 		   hw_irq, target, prio, lirq);
 
-	if (d) {
-		struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
-		u64 val = xive_esb_read(xd, XIVE_ESB_GET);
-
-		seq_printf(m, "flags=%c%c%c PQ=%c%c",
-			   xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
-			   xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
-			   xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
-			   val & XIVE_ESB_VAL_P ? 'P' : '-',
-			   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
-	}
+	xd = irq_data_get_irq_handler_data(d);
+	val = xive_esb_read(xd, XIVE_ESB_GET);
+	seq_printf(m, "flags=%c%c%c PQ=%c%c",
+		   xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
+		   xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
+		   xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
+		   val & XIVE_ESB_VAL_P ? 'P' : '-',
+		   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
 	seq_puts(m, "\n");
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon
  2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (4 preceding siblings ...)
  2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater
@ 2021-03-03 17:48 ` Cédric Le Goater
  2021-03-09  9:22   ` Greg Kurz
  2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater
  2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
  7 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

Move the xmon routine under XIVE subsystem and rework the loop on the
interrupts taking into account the xive_irq_domain to filter out IPIs.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/xive.h   |  1 +
 arch/powerpc/sysdev/xive/common.c | 14 ++++++++++++++
 arch/powerpc/xmon/xmon.c          | 28 ++--------------------------
 3 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 9a312b975ca8..aa094a8655b0 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -102,6 +102,7 @@ void xive_flush_interrupt(void);
 /* xmon hook */
 void xmon_xive_do_dump(int cpu);
 int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d);
+void xmon_xive_get_irq_all(void);
 
 /* APIs used by KVM */
 u32 xive_native_default_eq_shift(void);
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 60ebd6f4b31d..f6b7b15bbb3a 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -291,6 +291,20 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 	return 0;
 }
 
+void xmon_xive_get_irq_all(void)
+{
+	unsigned int i;
+	struct irq_desc *desc;
+
+	for_each_irq_desc(i, desc) {
+		struct irq_data *d = irq_desc_get_irq_data(desc);
+		unsigned int hwirq = (unsigned int)irqd_to_hwirq(d);
+
+		if (d->domain == xive_irq_domain)
+			xmon_xive_get_irq_config(hwirq, d);
+	}
+}
+
 #endif /* CONFIG_XMON */
 
 static unsigned int xive_get_irq(void)
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 3fe37495f63d..80fbf8968f77 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2727,30 +2727,6 @@ static void dump_all_xives(void)
 		dump_one_xive(cpu);
 }
 
-static void dump_one_xive_irq(u32 num, struct irq_data *d)
-{
-	xmon_xive_get_irq_config(num, d);
-}
-
-static void dump_all_xive_irq(void)
-{
-	unsigned int i;
-	struct irq_desc *desc;
-
-	for_each_irq_desc(i, desc) {
-		struct irq_data *d = irq_desc_get_irq_data(desc);
-		unsigned int hwirq;
-
-		if (!d)
-			continue;
-
-		hwirq = (unsigned int)irqd_to_hwirq(d);
-		/* IPIs are special (HW number 0) */
-		if (hwirq)
-			dump_one_xive_irq(hwirq, d);
-	}
-}
-
 static void dump_xives(void)
 {
 	unsigned long num;
@@ -2767,9 +2743,9 @@ static void dump_xives(void)
 		return;
 	} else if (c == 'i') {
 		if (scanhex(&num))
-			dump_one_xive_irq(num, NULL);
+			xmon_xive_get_irq_config(num, NULL);
 		else
-			dump_all_xive_irq();
+			xmon_xive_get_irq_all();
 		return;
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi"
  2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (5 preceding siblings ...)
  2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater
@ 2021-03-03 17:48 ` Cédric Le Goater
  2021-03-09 10:23   ` Greg Kurz
  2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
  7 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kernel test robot, Greg Kurz, Dan Carpenter, Cédric Le Goater

When under xmon, the "dxi" command dumps the state of the XIVE
interrupts. If an interrupt number is specified, only the state of
the associated XIVE interrupt is dumped. This form of the command
lacks an irq_data parameter which is nevertheless used by
xmon_xive_get_irq_config(), leading to an xmon crash.

Fix that by doing a lookup in the system IRQ mapping to query the IRQ
descriptor data. Invalid interrupt numbers, or not belonging to the
XIVE IRQ domain, OPAL event interrupt number for instance, should be
caught by the previous query done at the firmware level.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index f6b7b15bbb3a..8eefd152b947 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -255,17 +255,20 @@ notrace void xmon_xive_do_dump(int cpu)
 	xmon_printf("\n");
 }
 
+static struct irq_data *xive_get_irq_data(u32 hw_irq)
+{
+	unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq);
+
+	return irq ? irq_get_irq_data(irq) : NULL;
+}
+
 int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 {
-	struct irq_chip *chip = irq_data_get_irq_chip(d);
 	int rc;
 	u32 target;
 	u8 prio;
 	u32 lirq;
 
-	if (!is_xive_irq(chip))
-		return -EINVAL;
-
 	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
 	if (rc) {
 		xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
@@ -275,6 +278,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 	xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
 		    hw_irq, target, prio, lirq);
 
+	if (!d)
+		d = xive_get_irq_data(hw_irq);
+
 	if (d) {
 		struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
 		u64 val = xive_esb_read(xd, XIVE_ESB_GET);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node
  2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (6 preceding siblings ...)
  2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater
@ 2021-03-03 17:48 ` Cédric Le Goater
  2021-03-09 13:23   ` Greg Kurz
  2021-03-30 16:18   ` Cédric Le Goater
  7 siblings, 2 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-03 17:48 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

ipistorm [*] can be used to benchmark the raw interrupt rate of an
interrupt controller by measuring the number of IPIs a system can
sustain. When applied to the XIVE interrupt controller of POWER9 and
POWER10 systems, a significant drop of the interrupt rate can be
observed when crossing the second node boundary.

This is due to the fact that a single IPI interrupt is used for all
CPUs of the system. The structure is shared and the cache line updates
impact greatly the traffic between nodes and the overall IPI
performance.

As a workaround, the impact can be reduced by deactivating the IRQ
lockup detector ("noirqdebug") which does a lot of accounting in the
Linux IRQ descriptor structure and is responsible for most of the
performance penalty.

As a fix, this proposal allocates an IPI interrupt per node, to be
shared by all CPUs of that node. It solves the scaling issue, the IRQ
lockup detector still has an impact but the XIVE interrupt rate scales
linearly. It also improves the "noirqdebug" case as showed in the
tables below.

 * P9 DD2.2 - 2s * 64 threads

                                               "noirqdebug"
                        Mint/s                    Mint/s
 chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
 --------------------------------------------------------------
 1      0-15     4.984023   4.875405       4.996536   5.048892
        0-31    10.879164  10.544040      10.757632  11.037859
        0-47    15.345301  14.688764      14.926520  15.310053
        0-63    17.064907  17.066812      17.613416  17.874511
 2      0-79    11.768764  21.650749      22.689120  22.566508
        0-95    10.616812  26.878789      28.434703  28.320324
        0-111   10.151693  31.397803      31.771773  32.388122
        0-127    9.948502  33.139336      34.875716  35.224548

 * P10 DD1 - 4s (not homogeneous) 352 threads

                                               "noirqdebug"
                        Mint/s                    Mint/s
 chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
 --------------------------------------------------------------
 1      0-15     2.409402   2.364108       2.383303   2.395091
        0-31     6.028325   6.046075       6.089999   6.073750
        0-47     8.655178   8.644531       8.712830   8.724702
        0-63    11.629652  11.735953      12.088203  12.055979
        0-79    14.392321  14.729959      14.986701  14.973073
        0-95    12.604158  13.004034      17.528748  17.568095
 2      0-111    9.767753  13.719831      19.968606  20.024218
        0-127    6.744566  16.418854      22.898066  22.995110
        0-143    6.005699  19.174421      25.425622  25.417541
        0-159    5.649719  21.938836      27.952662  28.059603
        0-175    5.441410  24.109484      31.133915  31.127996
 3      0-191    5.318341  24.405322      33.999221  33.775354
        0-207    5.191382  26.449769      36.050161  35.867307
        0-223    5.102790  29.356943      39.544135  39.508169
        0-239    5.035295  31.933051      42.135075  42.071975
        0-255    4.969209  34.477367      44.655395  44.757074
 4      0-271    4.907652  35.887016      47.080545  47.318537
        0-287    4.839581  38.076137      50.464307  50.636219
        0-303    4.786031  40.881319      53.478684  53.310759
        0-319    4.743750  43.448424      56.388102  55.973969
        0-335    4.709936  45.623532      59.400930  58.926857
        0-351    4.681413  45.646151      62.035804  61.830057

[*] https://github.com/antonblanchard/ipistorm

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/xive-internal.h |  2 --
 arch/powerpc/sysdev/xive/common.c        | 39 ++++++++++++++++++------
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
index 9cf57c722faa..b3a456fdd3a5 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -5,8 +5,6 @@
 #ifndef __XIVE_INTERNAL_H
 #define __XIVE_INTERNAL_H
 
-#define XIVE_IPI_HW_IRQ		0 /* interrupt source # for IPIs */
-
 /*
  * A "disabled" interrupt should never fire, to catch problems
  * we set its logical number to this
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 8eefd152b947..c27f7bb0494b 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain;
 #ifdef CONFIG_SMP
 static struct irq_domain *xive_ipi_irq_domain;
 
-/* The IPIs all use the same logical irq number */
-static u32 xive_ipi_irq;
+/* The IPIs use the same logical irq number when on the same chip */
+static struct xive_ipi_desc {
+	unsigned int irq;
+	char name[8]; /* enough bytes to fit IPI-XXX */
+} *xive_ipis;
+
+static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu)
+{
+	return xive_ipis[cpu_to_node(cpu)].irq;
+}
 #endif
 
 /* Xive state for each CPU */
@@ -1106,25 +1114,36 @@ static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
 
 static void __init xive_request_ipi(void)
 {
-	unsigned int virq;
+	unsigned int node;
 
-	xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1,
+	xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids,
 						    &xive_ipi_irq_domain_ops, NULL);
 	if (WARN_ON(xive_ipi_irq_domain == NULL))
 		return;
 
-	/* Initialize it */
-	virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ);
-	xive_ipi_irq = virq;
+	xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL);
+	for_each_node(node) {
+		struct xive_ipi_desc *xid = &xive_ipis[node];
+		irq_hw_number_t node_ipi_hwirq = node;
+
+		/*
+		 * Map one IPI interrupt per node for all cpus of that node.
+		 * Since the HW interrupt number doesn't have any meaning,
+		 * simply use the node number.
+		 */
+		xid->irq = irq_create_mapping(xive_ipi_irq_domain, node_ipi_hwirq);
+		snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);
 
-	WARN_ON(request_irq(virq, xive_muxed_ipi_action,
-			    IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL));
+		WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action,
+				    IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL));
+	}
 }
 
 static int xive_setup_cpu_ipi(unsigned int cpu)
 {
 	struct xive_cpu *xc;
 	int rc;
+	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
 
 	pr_debug("Setting up IPI for CPU %d\n", cpu);
 
@@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
 
 static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc)
 {
+	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
+
 	/* Disable the IPI and free the IRQ data */
 
 	/* Already cleaned up ? */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater
@ 2021-03-08 17:13   ` Greg Kurz
  2021-03-09 15:33     ` Cédric Le Goater
  0 siblings, 1 reply; 38+ messages in thread
From: Greg Kurz @ 2021-03-08 17:13 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Wed, 3 Mar 2021 18:48:50 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> The 'chip_id' field of the XIVE CPU structure is used to choose a
> target for a source located on the same chip when possible. This field
> is assigned on the PowerNV platform using the "ibm,chip-id" property
> on pSeries under KVM when NUMA nodes are defined but it is undefined

This sentence seems to have a syntax problem... like it is missing an
'and' before 'on pSeries'.

> under PowerVM. The XIVE source structure has a similar field
> 'src_chip' which is only assigned on the PowerNV platform.
> 
> cpu_to_node() returns a compatible value on all platforms, 0 being the
> default node. It will also give us the opportunity to set the affinity
> of a source on pSeries when we can localize them.
> 

IIUC this relies on the fact that the NUMA node id is == to chip id
on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
with this change.

On the other hand, you have the pSeries case under PowerVM that
doesn't xc->chip_id, which isn't passed to any hcall AFAICT. It
looks like the chip id is only used for localization purpose in
this case, right ?

In this case, what about doing this change for pSeries only,
somewhere in spapr.c ?

> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/sysdev/xive/common.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index 595310e056f4..b8e456da28aa 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu)
>  
>  	xc = per_cpu(xive_cpu, cpu);
>  	if (!xc) {
> -		struct device_node *np;
> -
>  		xc = kzalloc_node(sizeof(struct xive_cpu),
>  				  GFP_KERNEL, cpu_to_node(cpu));
>  		if (!xc)
>  			return -ENOMEM;
> -		np = of_get_cpu_node(cpu, NULL);
> -		if (np)
> -			xc->chip_id = of_get_ibm_chip_id(np);
> -		of_node_put(np);
> +		xc->chip_id = cpu_to_node(cpu);
>  		xc->hw_ipi = XIVE_BAD_IRQ;
>  
>  		per_cpu(xive_cpu, cpu) = xc;


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain
  2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater
@ 2021-03-08 17:55   ` Greg Kurz
  0 siblings, 0 replies; 38+ messages in thread
From: Greg Kurz @ 2021-03-08 17:55 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Wed, 3 Mar 2021 18:48:51 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> The IPI interrupt is a special case of the XIVE IRQ domain. When
> mapping and unmapping the interrupts in the Linux interrupt number
> space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to
> distinguish the IPI interrupt from other interrupts of the system.
> 
> Simplify the XIVE interrupt domain by introducing a specific domain
> for the IPI.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---

Nice !

Reviewed-by: Greg Kurz <groug@kaod.org>

>  arch/powerpc/sysdev/xive/common.c | 51 +++++++++++++------------------
>  1 file changed, 22 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index b8e456da28aa..e7783760d278 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -63,6 +63,8 @@ static const struct xive_ops *xive_ops;
>  static struct irq_domain *xive_irq_domain;
>  
>  #ifdef CONFIG_SMP
> +static struct irq_domain *xive_ipi_irq_domain;
> +
>  /* The IPIs all use the same logical irq number */
>  static u32 xive_ipi_irq;
>  #endif
> @@ -1067,20 +1069,32 @@ static struct irq_chip xive_ipi_chip = {
>  	.irq_unmask = xive_ipi_do_nothing,
>  };
>  
> +/*
> + * IPIs are marked per-cpu. We use separate HW interrupts under the
> + * hood but associated with the same "linux" interrupt
> + */
> +static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq,
> +				   irq_hw_number_t hw)
> +{
> +	irq_set_chip_and_handler(virq, &xive_ipi_chip, handle_percpu_irq);
> +	return 0;
> +}
> +
> +static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
> +	.map = xive_ipi_irq_domain_map,
> +};
> +
>  static void __init xive_request_ipi(void)
>  {
>  	unsigned int virq;
>  
> -	/*
> -	 * Initialization failed, move on, we might manage to
> -	 * reach the point where we display our errors before
> -	 * the system falls appart
> -	 */
> -	if (!xive_irq_domain)
> +	xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1,
> +						    &xive_ipi_irq_domain_ops, NULL);
> +	if (WARN_ON(xive_ipi_irq_domain == NULL))
>  		return;
>  
>  	/* Initialize it */
> -	virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ);
> +	virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ);
>  	xive_ipi_irq = virq;
>  
>  	WARN_ON(request_irq(virq, xive_muxed_ipi_action,
> @@ -1178,19 +1192,6 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq,
>  	 */
>  	irq_clear_status_flags(virq, IRQ_LEVEL);
>  
> -#ifdef CONFIG_SMP
> -	/* IPIs are special and come up with HW number 0 */
> -	if (hw == XIVE_IPI_HW_IRQ) {
> -		/*
> -		 * IPIs are marked per-cpu. We use separate HW interrupts under
> -		 * the hood but associated with the same "linux" interrupt
> -		 */
> -		irq_set_chip_and_handler(virq, &xive_ipi_chip,
> -					 handle_percpu_irq);
> -		return 0;
> -	}
> -#endif
> -
>  	rc = xive_irq_alloc_data(virq, hw);
>  	if (rc)
>  		return rc;
> @@ -1202,15 +1203,7 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq,
>  
>  static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq)
>  {
> -	struct irq_data *data = irq_get_irq_data(virq);
> -	unsigned int hw_irq;
> -
> -	/* XXX Assign BAD number */
> -	if (!data)
> -		return;
> -	hw_irq = (unsigned int)irqd_to_hwirq(data);
> -	if (hw_irq != XIVE_IPI_HW_IRQ)
> -		xive_irq_free_data(virq);
> +	xive_irq_free_data(virq);
>  }
>  
>  static int xive_irq_domain_xlate(struct irq_domain *h, struct device_node *ct,


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
  2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater
@ 2021-03-08 17:56   ` Greg Kurz
  0 siblings, 0 replies; 38+ messages in thread
From: Greg Kurz @ 2021-03-08 17:56 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Wed, 3 Mar 2021 18:48:52 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> The IPI interrupt has its own domain now. Testing the HW interrupt
> number is not needed anymore.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---

Reviewed-by: Greg Kurz <groug@kaod.org>

>  arch/powerpc/sysdev/xive/common.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index e7783760d278..678680531d26 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1396,13 +1396,12 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc)
>  		struct irq_desc *desc = irq_to_desc(irq);
>  		struct irq_data *d = irq_desc_get_irq_data(desc);
>  		struct xive_irq_data *xd;
> -		unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
>  
>  		/*
>  		 * Ignore anything that isn't a XIVE irq and ignore
>  		 * IPIs, so can just be dropped.
>  		 */
> -		if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ)
> +		if (d->domain != xive_irq_domain)
>  			continue;
>  
>  		/*


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()
  2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater
@ 2021-03-08 18:07   ` Greg Kurz
  2021-03-08 18:11     ` Cédric Le Goater
  0 siblings, 1 reply; 38+ messages in thread
From: Greg Kurz @ 2021-03-08 18:07 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Wed, 3 Mar 2021 18:48:53 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> Now that the IPI interrupt has its own domain, the checks on the HW
> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
> check on the domain.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---

Shouldn't this have the following tags ?

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")


Anyway,

Reviewed-by: Greg Kurz <groug@kaod.org>

>  arch/powerpc/sysdev/xive/common.c | 18 ++++--------------
>  1 file changed, 4 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index 678680531d26..7581cb12bb53 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu)
>  	seq_puts(m, "\n");
>  }
>  
> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
>  {
> -	struct irq_chip *chip = irq_data_get_irq_chip(d);
> +	unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
>  	int rc;
>  	u32 target;
>  	u8 prio;
>  	u32 lirq;
>  
> -	if (!is_xive_irq(chip))
> -		return;
> -
>  	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
>  	if (rc) {
>  		seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private)
>  
>  	for_each_irq_desc(i, desc) {
>  		struct irq_data *d = irq_desc_get_irq_data(desc);
> -		unsigned int hw_irq;
> -
> -		if (!d)
> -			continue;
> -
> -		hw_irq = (unsigned int)irqd_to_hwirq(d);
>  
> -		/* IPIs are special (HW number 0) */
> -		if (hw_irq != XIVE_IPI_HW_IRQ)
> -			xive_debug_show_irq(m, hw_irq, d);
> +		if (d->domain == xive_irq_domain)
> +			xive_debug_show_irq(m, d);
>  	}
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()
  2021-03-08 18:07   ` Greg Kurz
@ 2021-03-08 18:11     ` Cédric Le Goater
  2021-03-09  9:13       ` Greg Kurz
  0 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-08 18:11 UTC (permalink / raw)
  To: Greg Kurz; +Cc: linuxppc-dev

On 3/8/21 7:07 PM, Greg Kurz wrote:
> On Wed, 3 Mar 2021 18:48:53 +0100
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>> Now that the IPI interrupt has its own domain, the checks on the HW
>> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
>> check on the domain.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
> 
> Shouldn't this have the following tags ?
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")

The next patch has because it removes the useless check on irq_data.
 
C.

> 
> Anyway,
> 
> Reviewed-by: Greg Kurz <groug@kaod.org>
> 
>>  arch/powerpc/sysdev/xive/common.c | 18 ++++--------------
>>  1 file changed, 4 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>> index 678680531d26..7581cb12bb53 100644
>> --- a/arch/powerpc/sysdev/xive/common.c
>> +++ b/arch/powerpc/sysdev/xive/common.c
>> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu)
>>  	seq_puts(m, "\n");
>>  }
>>  
>> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
>> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
>>  {
>> -	struct irq_chip *chip = irq_data_get_irq_chip(d);
>> +	unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
>>  	int rc;
>>  	u32 target;
>>  	u8 prio;
>>  	u32 lirq;
>>  
>> -	if (!is_xive_irq(chip))
>> -		return;
>> -
>>  	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
>>  	if (rc) {
>>  		seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
>> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private)
>>  
>>  	for_each_irq_desc(i, desc) {
>>  		struct irq_data *d = irq_desc_get_irq_data(desc);
>> -		unsigned int hw_irq;
>> -
>> -		if (!d)
>> -			continue;
>> -
>> -		hw_irq = (unsigned int)irqd_to_hwirq(d);
>>  
>> -		/* IPIs are special (HW number 0) */
>> -		if (hw_irq != XIVE_IPI_HW_IRQ)
>> -			xive_debug_show_irq(m, hw_irq, d);
>> +		if (d->domain == xive_irq_domain)
>> +			xive_debug_show_irq(m, d);
>>  	}
>>  	return 0;
>>  }
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()
  2021-03-08 18:11     ` Cédric Le Goater
@ 2021-03-09  9:13       ` Greg Kurz
  2021-03-09  9:42         ` Greg Kurz
  0 siblings, 1 reply; 38+ messages in thread
From: Greg Kurz @ 2021-03-09  9:13 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Mon, 8 Mar 2021 19:11:11 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> On 3/8/21 7:07 PM, Greg Kurz wrote:
> > On Wed, 3 Mar 2021 18:48:53 +0100
> > Cédric Le Goater <clg@kaod.org> wrote:
> > 
> >> Now that the IPI interrupt has its own domain, the checks on the HW
> >> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
> >> check on the domain.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> > 
> > Shouldn't this have the following tags ?
> > 
> > Reported-by: kernel test robot <lkp@intel.com>
> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> > Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")
> 
> The next patch has because it removes the useless check on irq_data.
>  

Ok I get it. This report isn't about an actual crash. Just a false
positive because of the not needed check in the caller.

> C.
> 
> > 
> > Anyway,
> > 
> > Reviewed-by: Greg Kurz <groug@kaod.org>
> > 
> >>  arch/powerpc/sysdev/xive/common.c | 18 ++++--------------
> >>  1 file changed, 4 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> >> index 678680531d26..7581cb12bb53 100644
> >> --- a/arch/powerpc/sysdev/xive/common.c
> >> +++ b/arch/powerpc/sysdev/xive/common.c
> >> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu)
> >>  	seq_puts(m, "\n");
> >>  }
> >>  
> >> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
> >> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
> >>  {
> >> -	struct irq_chip *chip = irq_data_get_irq_chip(d);
> >> +	unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
> >>  	int rc;
> >>  	u32 target;
> >>  	u8 prio;
> >>  	u32 lirq;
> >>  
> >> -	if (!is_xive_irq(chip))
> >> -		return;
> >> -
> >>  	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
> >>  	if (rc) {
> >>  		seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
> >> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private)
> >>  
> >>  	for_each_irq_desc(i, desc) {
> >>  		struct irq_data *d = irq_desc_get_irq_data(desc);
> >> -		unsigned int hw_irq;
> >> -
> >> -		if (!d)
> >> -			continue;
> >> -
> >> -		hw_irq = (unsigned int)irqd_to_hwirq(d);
> >>  
> >> -		/* IPIs are special (HW number 0) */
> >> -		if (hw_irq != XIVE_IPI_HW_IRQ)
> >> -			xive_debug_show_irq(m, hw_irq, d);
> >> +		if (d->domain == xive_irq_domain)
> >> +			xive_debug_show_irq(m, d);
> >>  	}
> >>  	return 0;
> >>  }
> > 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show()
  2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater
@ 2021-03-09  9:18   ` Greg Kurz
  0 siblings, 0 replies; 38+ messages in thread
From: Greg Kurz @ 2021-03-09  9:18 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev, kernel test robot, Dan Carpenter

On Wed, 3 Mar 2021 18:48:54 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> When looping on IRQ descriptor, irq_data is always valid.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---

Reviewed-by: Greg Kurz <groug@kaod.org>

>  arch/powerpc/sysdev/xive/common.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index 7581cb12bb53..60ebd6f4b31d 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1586,6 +1586,8 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
>  	u32 target;
>  	u8 prio;
>  	u32 lirq;
> +	struct xive_irq_data *xd;
> +	u64 val;
>  
>  	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
>  	if (rc) {
> @@ -1596,17 +1598,14 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
>  	seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
>  		   hw_irq, target, prio, lirq);
>  
> -	if (d) {
> -		struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
> -		u64 val = xive_esb_read(xd, XIVE_ESB_GET);
> -
> -		seq_printf(m, "flags=%c%c%c PQ=%c%c",
> -			   xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
> -			   xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
> -			   xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
> -			   val & XIVE_ESB_VAL_P ? 'P' : '-',
> -			   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
> -	}
> +	xd = irq_data_get_irq_handler_data(d);
> +	val = xive_esb_read(xd, XIVE_ESB_GET);
> +	seq_printf(m, "flags=%c%c%c PQ=%c%c",
> +		   xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
> +		   xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
> +		   xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
> +		   val & XIVE_ESB_VAL_P ? 'P' : '-',
> +		   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
>  	seq_puts(m, "\n");
>  }
>  


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon
  2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater
@ 2021-03-09  9:22   ` Greg Kurz
  0 siblings, 0 replies; 38+ messages in thread
From: Greg Kurz @ 2021-03-09  9:22 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Wed, 3 Mar 2021 18:48:55 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> Move the xmon routine under XIVE subsystem and rework the loop on the
> interrupts taking into account the xive_irq_domain to filter out IPIs.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---

Nice again ! :)

Reviewed-by: Greg Kurz <groug@kaod.org>

>  arch/powerpc/include/asm/xive.h   |  1 +
>  arch/powerpc/sysdev/xive/common.c | 14 ++++++++++++++
>  arch/powerpc/xmon/xmon.c          | 28 ++--------------------------
>  3 files changed, 17 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index 9a312b975ca8..aa094a8655b0 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -102,6 +102,7 @@ void xive_flush_interrupt(void);
>  /* xmon hook */
>  void xmon_xive_do_dump(int cpu);
>  int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d);
> +void xmon_xive_get_irq_all(void);
>  
>  /* APIs used by KVM */
>  u32 xive_native_default_eq_shift(void);
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index 60ebd6f4b31d..f6b7b15bbb3a 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -291,6 +291,20 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
>  	return 0;
>  }
>  
> +void xmon_xive_get_irq_all(void)
> +{
> +	unsigned int i;
> +	struct irq_desc *desc;
> +
> +	for_each_irq_desc(i, desc) {
> +		struct irq_data *d = irq_desc_get_irq_data(desc);
> +		unsigned int hwirq = (unsigned int)irqd_to_hwirq(d);
> +
> +		if (d->domain == xive_irq_domain)
> +			xmon_xive_get_irq_config(hwirq, d);
> +	}
> +}
> +
>  #endif /* CONFIG_XMON */
>  
>  static unsigned int xive_get_irq(void)
> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index 3fe37495f63d..80fbf8968f77 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -2727,30 +2727,6 @@ static void dump_all_xives(void)
>  		dump_one_xive(cpu);
>  }
>  
> -static void dump_one_xive_irq(u32 num, struct irq_data *d)
> -{
> -	xmon_xive_get_irq_config(num, d);
> -}
> -
> -static void dump_all_xive_irq(void)
> -{
> -	unsigned int i;
> -	struct irq_desc *desc;
> -
> -	for_each_irq_desc(i, desc) {
> -		struct irq_data *d = irq_desc_get_irq_data(desc);
> -		unsigned int hwirq;
> -
> -		if (!d)
> -			continue;
> -
> -		hwirq = (unsigned int)irqd_to_hwirq(d);
> -		/* IPIs are special (HW number 0) */
> -		if (hwirq)
> -			dump_one_xive_irq(hwirq, d);
> -	}
> -}
> -
>  static void dump_xives(void)
>  {
>  	unsigned long num;
> @@ -2767,9 +2743,9 @@ static void dump_xives(void)
>  		return;
>  	} else if (c == 'i') {
>  		if (scanhex(&num))
> -			dump_one_xive_irq(num, NULL);
> +			xmon_xive_get_irq_config(num, NULL);
>  		else
> -			dump_all_xive_irq();
> +			xmon_xive_get_irq_all();
>  		return;
>  	}
>  


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()
  2021-03-09  9:13       ` Greg Kurz
@ 2021-03-09  9:42         ` Greg Kurz
  2021-03-09 15:39           ` Cédric Le Goater
  0 siblings, 1 reply; 38+ messages in thread
From: Greg Kurz @ 2021-03-09  9:42 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Tue, 9 Mar 2021 10:13:39 +0100
Greg Kurz <groug@kaod.org> wrote:

> On Mon, 8 Mar 2021 19:11:11 +0100
> Cédric Le Goater <clg@kaod.org> wrote:
> 
> > On 3/8/21 7:07 PM, Greg Kurz wrote:
> > > On Wed, 3 Mar 2021 18:48:53 +0100
> > > Cédric Le Goater <clg@kaod.org> wrote:
> > > 
> > >> Now that the IPI interrupt has its own domain, the checks on the HW
> > >> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
> > >> check on the domain.
> > >>
> > >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > >> ---
> > > 
> > > Shouldn't this have the following tags ?
> > > 
> > > Reported-by: kernel test robot <lkp@intel.com>
> > > Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> > > Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")
> > 
> > The next patch has because it removes the useless check on irq_data.
> >  
> 
> Ok I get it. This report isn't about an actual crash. Just a false
> positive because of the not needed check in the caller.
> 

Hrm... I meant because of the check in xive_debug_show_irq(). On the
contrary, the check removed by this patch in xive_core_debug_show()
was rather an explicit hint that xive_debug_show_irq() couldn't be
called with d being NULL.

> > C.
> > 
> > > 
> > > Anyway,
> > > 
> > > Reviewed-by: Greg Kurz <groug@kaod.org>
> > > 
> > >>  arch/powerpc/sysdev/xive/common.c | 18 ++++--------------
> > >>  1 file changed, 4 insertions(+), 14 deletions(-)
> > >>
> > >> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> > >> index 678680531d26..7581cb12bb53 100644
> > >> --- a/arch/powerpc/sysdev/xive/common.c
> > >> +++ b/arch/powerpc/sysdev/xive/common.c
> > >> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu)
> > >>  	seq_puts(m, "\n");
> > >>  }
> > >>  
> > >> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
> > >> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
> > >>  {
> > >> -	struct irq_chip *chip = irq_data_get_irq_chip(d);
> > >> +	unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
> > >>  	int rc;
> > >>  	u32 target;
> > >>  	u8 prio;
> > >>  	u32 lirq;
> > >>  
> > >> -	if (!is_xive_irq(chip))
> > >> -		return;
> > >> -
> > >>  	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
> > >>  	if (rc) {
> > >>  		seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
> > >> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private)
> > >>  
> > >>  	for_each_irq_desc(i, desc) {
> > >>  		struct irq_data *d = irq_desc_get_irq_data(desc);
> > >> -		unsigned int hw_irq;
> > >> -
> > >> -		if (!d)
> > >> -			continue;
> > >> -
> > >> -		hw_irq = (unsigned int)irqd_to_hwirq(d);
> > >>  
> > >> -		/* IPIs are special (HW number 0) */
> > >> -		if (hw_irq != XIVE_IPI_HW_IRQ)
> > >> -			xive_debug_show_irq(m, hw_irq, d);
> > >> +		if (d->domain == xive_irq_domain)
> > >> +			xive_debug_show_irq(m, d);
> > >>  	}
> > >>  	return 0;
> > >>  }
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi"
  2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater
@ 2021-03-09 10:23   ` Greg Kurz
  2021-03-09 15:49     ` Cédric Le Goater
  0 siblings, 1 reply; 38+ messages in thread
From: Greg Kurz @ 2021-03-09 10:23 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev, kernel test robot, Dan Carpenter

On Wed, 3 Mar 2021 18:48:56 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> When under xmon, the "dxi" command dumps the state of the XIVE
> interrupts. If an interrupt number is specified, only the state of
> the associated XIVE interrupt is dumped. This form of the command
> lacks an irq_data parameter which is nevertheless used by
> xmon_xive_get_irq_config(), leading to an xmon crash.
> 
> Fix that by doing a lookup in the system IRQ mapping to query the IRQ
> descriptor data. Invalid interrupt numbers, or not belonging to the
> XIVE IRQ domain, OPAL event interrupt number for instance, should be
> caught by the previous query done at the firmware level.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform")
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---

I've tested this in a KVM guest and it seems to do the job.

6:mon> dxi 1201
IRQ 0x00001201 : target=0xfffffc00 prio=ff lirq=0x0 flags= LH PQ=-Q

Bad HW irq numbers are filtered by the hypervisor:

6:mon> dxi bad
[  696.390577] xive: H_INT_GET_SOURCE_CONFIG lisn=2989 failed -55
IRQ 0x00000bad : no config rc=-6

Note that this also allows to show IPIs:

6:mon> dxi 0
IRQ 0x00000000 : target=0x0 prio=06 lirq=0x10 

This is a bit inconsistent with output of the 0-argument form of "dxi",
which filters them out for a reason that isn't obvious to me. No big
deal though, this should be addressed in another patch anyway.

Reviewed-and-tested-by: Greg Kurz <groug@kaod.org>

>  arch/powerpc/sysdev/xive/common.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index f6b7b15bbb3a..8eefd152b947 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -255,17 +255,20 @@ notrace void xmon_xive_do_dump(int cpu)
>  	xmon_printf("\n");
>  }
>  
> +static struct irq_data *xive_get_irq_data(u32 hw_irq)
> +{
> +	unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq);
> +
> +	return irq ? irq_get_irq_data(irq) : NULL;
> +}
> +
>  int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
>  {
> -	struct irq_chip *chip = irq_data_get_irq_chip(d);
>  	int rc;
>  	u32 target;
>  	u8 prio;
>  	u32 lirq;
>  
> -	if (!is_xive_irq(chip))
> -		return -EINVAL;
> -
>  	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
>  	if (rc) {
>  		xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
> @@ -275,6 +278,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
>  	xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
>  		    hw_irq, target, prio, lirq);
>  
> +	if (!d)
> +		d = xive_get_irq_data(hw_irq);
> +
>  	if (d) {
>  		struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
>  		u64 val = xive_esb_read(xd, XIVE_ESB_GET);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node
  2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
@ 2021-03-09 13:23   ` Greg Kurz
  2021-03-09 15:52     ` Cédric Le Goater
  2021-03-30 16:18   ` Cédric Le Goater
  1 sibling, 1 reply; 38+ messages in thread
From: Greg Kurz @ 2021-03-09 13:23 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Wed, 3 Mar 2021 18:48:57 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> As a workaround, the impact can be reduced by deactivating the IRQ
> lockup detector ("noirqdebug") which does a lot of accounting in the
> Linux IRQ descriptor structure and is responsible for most of the
> performance penalty.
> 
> As a fix, this proposal allocates an IPI interrupt per node, to be
> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> lockup detector still has an impact but the XIVE interrupt rate scales
> linearly. It also improves the "noirqdebug" case as showed in the
> tables below.
> 
>  * P9 DD2.2 - 2s * 64 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>  --------------------------------------------------------------
>  1      0-15     4.984023   4.875405       4.996536   5.048892
>         0-31    10.879164  10.544040      10.757632  11.037859
>         0-47    15.345301  14.688764      14.926520  15.310053
>         0-63    17.064907  17.066812      17.613416  17.874511
>  2      0-79    11.768764  21.650749      22.689120  22.566508
>         0-95    10.616812  26.878789      28.434703  28.320324
>         0-111   10.151693  31.397803      31.771773  32.388122
>         0-127    9.948502  33.139336      34.875716  35.224548
> 
>  * P10 DD1 - 4s (not homogeneous) 352 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>  --------------------------------------------------------------
>  1      0-15     2.409402   2.364108       2.383303   2.395091
>         0-31     6.028325   6.046075       6.089999   6.073750
>         0-47     8.655178   8.644531       8.712830   8.724702
>         0-63    11.629652  11.735953      12.088203  12.055979
>         0-79    14.392321  14.729959      14.986701  14.973073
>         0-95    12.604158  13.004034      17.528748  17.568095
>  2      0-111    9.767753  13.719831      19.968606  20.024218
>         0-127    6.744566  16.418854      22.898066  22.995110
>         0-143    6.005699  19.174421      25.425622  25.417541
>         0-159    5.649719  21.938836      27.952662  28.059603
>         0-175    5.441410  24.109484      31.133915  31.127996
>  3      0-191    5.318341  24.405322      33.999221  33.775354
>         0-207    5.191382  26.449769      36.050161  35.867307
>         0-223    5.102790  29.356943      39.544135  39.508169
>         0-239    5.035295  31.933051      42.135075  42.071975
>         0-255    4.969209  34.477367      44.655395  44.757074
>  4      0-271    4.907652  35.887016      47.080545  47.318537
>         0-287    4.839581  38.076137      50.464307  50.636219
>         0-303    4.786031  40.881319      53.478684  53.310759
>         0-319    4.743750  43.448424      56.388102  55.973969
>         0-335    4.709936  45.623532      59.400930  58.926857
>         0-351    4.681413  45.646151      62.035804  61.830057
> 
> [*] https://github.com/antonblanchard/ipistorm
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/sysdev/xive/xive-internal.h |  2 --
>  arch/powerpc/sysdev/xive/common.c        | 39 ++++++++++++++++++------
>  2 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
> index 9cf57c722faa..b3a456fdd3a5 100644
> --- a/arch/powerpc/sysdev/xive/xive-internal.h
> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
> @@ -5,8 +5,6 @@
>  #ifndef __XIVE_INTERNAL_H
>  #define __XIVE_INTERNAL_H
>  
> -#define XIVE_IPI_HW_IRQ		0 /* interrupt source # for IPIs */
> -
>  /*
>   * A "disabled" interrupt should never fire, to catch problems
>   * we set its logical number to this
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index 8eefd152b947..c27f7bb0494b 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain;
>  #ifdef CONFIG_SMP
>  static struct irq_domain *xive_ipi_irq_domain;
>  
> -/* The IPIs all use the same logical irq number */
> -static u32 xive_ipi_irq;
> +/* The IPIs use the same logical irq number when on the same chip */
> +static struct xive_ipi_desc {
> +	unsigned int irq;
> +	char name[8]; /* enough bytes to fit IPI-XXX */

So this assumes that the node number that node is <= 999 ? This
is certainly the case for now since CONFIG_NODES_SHIFT is 8
on ppc64 but starting with 10, you'd have truncated names.
What about deriving the size of name[] from CONFIG_NODES_SHIFT ?

Apart from that, LGTM. Probably not worth to respin just for
this.

I also could give a try in a KVM guest.

Topology passed to QEMU:

  -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 \
  -numa node,nodeid=0,cpus=0-4 \
  -numa node,nodeid=1,cpus=4-7

Topology observed in guest with lstopo :

  Package L#0
    NUMANode L#0 (P#0 30GB)
    L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#1)
    L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#3)
  Package L#1
    NUMANode L#1 (P#1 32GB)
    L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#4)
      PU L#5 (P#5)
    L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#6)
      PU L#7 (P#7)

Interrupts in guest:

$ cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
 16:       1023        871       1042        749          0          0          0          0  XIVE-IPI   0 Edge      IPI-0
 17:          0          0          0          0       2123       1019       1263       1288  XIVE-IPI   1 Edge      IPI-1

IPIs are mapped to the appropriate nodes, and the numbers indicate
that everything is working as expected.

Reviewed-and-tested-by: Greg Kurz <groug@kaod.org>

> +} *xive_ipis;
> +
> +static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu)
> +{
> +	return xive_ipis[cpu_to_node(cpu)].irq;
> +}
>  #endif
>  
>  /* Xive state for each CPU */
> @@ -1106,25 +1114,36 @@ static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
>  
>  static void __init xive_request_ipi(void)
>  {
> -	unsigned int virq;
> +	unsigned int node;
>  
> -	xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1,
> +	xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids,
>  						    &xive_ipi_irq_domain_ops, NULL);
>  	if (WARN_ON(xive_ipi_irq_domain == NULL))
>  		return;
>  
> -	/* Initialize it */
> -	virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ);
> -	xive_ipi_irq = virq;
> +	xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL);
> +	for_each_node(node) {
> +		struct xive_ipi_desc *xid = &xive_ipis[node];
> +		irq_hw_number_t node_ipi_hwirq = node;
> +
> +		/*
> +		 * Map one IPI interrupt per node for all cpus of that node.
> +		 * Since the HW interrupt number doesn't have any meaning,
> +		 * simply use the node number.
> +		 */
> +		xid->irq = irq_create_mapping(xive_ipi_irq_domain, node_ipi_hwirq);
> +		snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);
>  
> -	WARN_ON(request_irq(virq, xive_muxed_ipi_action,
> -			    IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL));
> +		WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action,
> +				    IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL));
> +	}
>  }
>  
>  static int xive_setup_cpu_ipi(unsigned int cpu)
>  {
>  	struct xive_cpu *xc;
>  	int rc;
> +	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
>  
>  	pr_debug("Setting up IPI for CPU %d\n", cpu);
>  
> @@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
>  
>  static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc)
>  {
> +	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
> +
>  	/* Disable the IPI and free the IRQ data */
>  
>  	/* Already cleaned up ? */


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-08 17:13   ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property Greg Kurz
@ 2021-03-09 15:33     ` Cédric Le Goater
  2021-03-09 17:08       ` Daniel Henrique Barboza
  0 siblings, 1 reply; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-09 15:33 UTC (permalink / raw)
  To: Greg Kurz; +Cc: Daniel Henrique Barboza, linuxppc-dev

On 3/8/21 6:13 PM, Greg Kurz wrote:
> On Wed, 3 Mar 2021 18:48:50 +0100
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>> target for a source located on the same chip when possible. This field
>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>> on pSeries under KVM when NUMA nodes are defined but it is undefined
> 
> This sentence seems to have a syntax problem... like it is missing an
> 'and' before 'on pSeries'.

ah yes, or simply a comma.

>> under PowerVM. The XIVE source structure has a similar field
>> 'src_chip' which is only assigned on the PowerNV platform.
>>
>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>> default node. It will also give us the opportunity to set the affinity
>> of a source on pSeries when we can localize them.
>>
> 
> IIUC this relies on the fact that the NUMA node id is == to chip id
> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
> with this change.

Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall 
H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
in Cc:)

On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
the node id. This value is built from the chip id in OPAL, so the 
value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
property are unlikely to be different.

cpu_to_node(cpu) is used in many places to allocate the structures 
locally to the owning node. XIVE is not an exception (see below in the 
same patch), it is better to be consistent and get the same information 
(node id) using the same routine.


In Linux, "ibm,chip-id" is only used in low level PowerNV drivers : 
LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
unifies the controllers of the system to only expose one the OS. This
is problematic and should be changed but it's another topic.


> On the other hand, you have the pSeries case under PowerVM that
> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.

yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning 
under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid 
chip id. 

QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
always correct btw)

> It looks like the chip id is only used for localization purpose in
> this case, right ?

Yes and PAPR sources are not localized. So it's not used. MSI sources 
could be if we rewrote the MSI driver.

> In this case, what about doing this change for pSeries only,
> somewhere in spapr.c ?

The IPI code is common to all platforms and all have the same issue. 
I rather not.

Thanks,

C.
 
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/sysdev/xive/common.c | 7 +------
>>  1 file changed, 1 insertion(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>> index 595310e056f4..b8e456da28aa 100644
>> --- a/arch/powerpc/sysdev/xive/common.c
>> +++ b/arch/powerpc/sysdev/xive/common.c
>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu)
>>  
>>  	xc = per_cpu(xive_cpu, cpu);
>>  	if (!xc) {
>> -		struct device_node *np;
>> -
>>  		xc = kzalloc_node(sizeof(struct xive_cpu),
>>  				  GFP_KERNEL, cpu_to_node(cpu));
>>  		if (!xc)
>>  			return -ENOMEM;
>> -		np = of_get_cpu_node(cpu, NULL);
>> -		if (np)
>> -			xc->chip_id = of_get_ibm_chip_id(np);
>> -		of_node_put(np);
>> +		xc->chip_id = cpu_to_node(cpu);
>>  		xc->hw_ipi = XIVE_BAD_IRQ;
>>  
>>  		per_cpu(xive_cpu, cpu) = xc;
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show()
  2021-03-09  9:42         ` Greg Kurz
@ 2021-03-09 15:39           ` Cédric Le Goater
  0 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-09 15:39 UTC (permalink / raw)
  To: Greg Kurz; +Cc: linuxppc-dev

On 3/9/21 10:42 AM, Greg Kurz wrote:
> On Tue, 9 Mar 2021 10:13:39 +0100
> Greg Kurz <groug@kaod.org> wrote:
> 
>> On Mon, 8 Mar 2021 19:11:11 +0100
>> Cédric Le Goater <clg@kaod.org> wrote:
>>
>>> On 3/8/21 7:07 PM, Greg Kurz wrote:
>>>> On Wed, 3 Mar 2021 18:48:53 +0100
>>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>>
>>>>> Now that the IPI interrupt has its own domain, the checks on the HW
>>>>> interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
>>>>> check on the domain.
>>>>>
>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>> ---
>>>>
>>>> Shouldn't this have the following tags ?
>>>>
>>>> Reported-by: kernel test robot <lkp@intel.com>
>>>> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
>>>> Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")
>>>
>>> The next patch has because it removes the useless check on irq_data.
>>>  
>>
>> Ok I get it. This report isn't about an actual crash. Just a false
>> positive because of the not needed check in the caller.
>>
> 
> Hrm... I meant because of the check in xive_debug_show_irq(). On the
> contrary, the check removed by this patch in xive_core_debug_show()
> was rather an explicit hint that xive_debug_show_irq() couldn't be
> called with d being NULL.

yes. irq_desc_get_irq_data() does not return a NULL value and 
xive_debug_show_irq() is only called from the for_each_irq_desc()
loop. 


C.


> 
>>> C.
>>>
>>>>
>>>> Anyway,
>>>>
>>>> Reviewed-by: Greg Kurz <groug@kaod.org>
>>>>
>>>>>  arch/powerpc/sysdev/xive/common.c | 18 ++++--------------
>>>>>  1 file changed, 4 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>>>>> index 678680531d26..7581cb12bb53 100644
>>>>> --- a/arch/powerpc/sysdev/xive/common.c
>>>>> +++ b/arch/powerpc/sysdev/xive/common.c
>>>>> @@ -1579,17 +1579,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu)
>>>>>  	seq_puts(m, "\n");
>>>>>  }
>>>>>  
>>>>> -static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
>>>>> +static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
>>>>>  {
>>>>> -	struct irq_chip *chip = irq_data_get_irq_chip(d);
>>>>> +	unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
>>>>>  	int rc;
>>>>>  	u32 target;
>>>>>  	u8 prio;
>>>>>  	u32 lirq;
>>>>>  
>>>>> -	if (!is_xive_irq(chip))
>>>>> -		return;
>>>>> -
>>>>>  	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
>>>>>  	if (rc) {
>>>>>  		seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
>>>>> @@ -1627,16 +1624,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private)
>>>>>  
>>>>>  	for_each_irq_desc(i, desc) {
>>>>>  		struct irq_data *d = irq_desc_get_irq_data(desc);
>>>>> -		unsigned int hw_irq;
>>>>> -
>>>>> -		if (!d)
>>>>> -			continue;
>>>>> -
>>>>> -		hw_irq = (unsigned int)irqd_to_hwirq(d);
>>>>>  
>>>>> -		/* IPIs are special (HW number 0) */
>>>>> -		if (hw_irq != XIVE_IPI_HW_IRQ)
>>>>> -			xive_debug_show_irq(m, hw_irq, d);
>>>>> +		if (d->domain == xive_irq_domain)
>>>>> +			xive_debug_show_irq(m, d);
>>>>>  	}
>>>>>  	return 0;
>>>>>  }
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi"
  2021-03-09 10:23   ` Greg Kurz
@ 2021-03-09 15:49     ` Cédric Le Goater
  0 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-09 15:49 UTC (permalink / raw)
  To: Greg Kurz; +Cc: linuxppc-dev, kernel test robot, Dan Carpenter

On 3/9/21 11:23 AM, Greg Kurz wrote:
> On Wed, 3 Mar 2021 18:48:56 +0100
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>> When under xmon, the "dxi" command dumps the state of the XIVE
>> interrupts. If an interrupt number is specified, only the state of
>> the associated XIVE interrupt is dumped. This form of the command
>> lacks an irq_data parameter which is nevertheless used by
>> xmon_xive_get_irq_config(), leading to an xmon crash.
>>
>> Fix that by doing a lookup in the system IRQ mapping to query the IRQ
>> descriptor data. Invalid interrupt numbers, or not belonging to the
>> XIVE IRQ domain, OPAL event interrupt number for instance, should be
>> caught by the previous query done at the firmware level.
>>
>> Reported-by: kernel test robot <lkp@intel.com>
>> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
>> Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform")
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
> 
> I've tested this in a KVM guest and it seems to do the job.
> 
> 6:mon> dxi 1201
> IRQ 0x00001201 : target=0xfffffc00 prio=ff lirq=0x0 flags= LH PQ=-Q
> 
> Bad HW irq numbers are filtered by the hypervisor:
> 
> 6:mon> dxi bad
> [  696.390577] xive: H_INT_GET_SOURCE_CONFIG lisn=2989 failed -55
> IRQ 0x00000bad : no config rc=-6
> 
> Note that this also allows to show IPIs:
> 
> 6:mon> dxi 0
> IRQ 0x00000000 : target=0x0 prio=06 lirq=0x10 
> 
> This is a bit inconsistent with output of the 0-argument form of "dxi",

It's an hidden feature ! :) 

Yes. You can query at the FW level the configuration of any valid HW 
interrupt number where as "dxi" without an argument only loops on the 
XIVE IRQ domain which does not include the XIVE CPU IPIs which are 
special. You should "dxa" for these. 

> which filters them out for a reason that isn't obvious to me. 

For historical reason. XIVE support for PowerNV was the first to reach 
Linux. If you run the same xmon commands on a PowerNV machine (you could 
use QEMU), the ouput is different. it has more low level details.

> No big deal though, this should be addressed in another patch anyway.

We could simplify the xmon helpers to be sync with the debugfs one
and the QEMU/KVM "info pic" command. I agree.

Thanks,

C. 


> Reviewed-and-tested-by: Greg Kurz <groug@kaod.org>
> 
>>  arch/powerpc/sysdev/xive/common.c | 14 ++++++++++----
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>> index f6b7b15bbb3a..8eefd152b947 100644
>> --- a/arch/powerpc/sysdev/xive/common.c
>> +++ b/arch/powerpc/sysdev/xive/common.c
>> @@ -255,17 +255,20 @@ notrace void xmon_xive_do_dump(int cpu)
>>  	xmon_printf("\n");
>>  }
>>  
>> +static struct irq_data *xive_get_irq_data(u32 hw_irq)
>> +{
>> +	unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq);
>> +
>> +	return irq ? irq_get_irq_data(irq) : NULL;
>> +}
>> +
>>  int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
>>  {
>> -	struct irq_chip *chip = irq_data_get_irq_chip(d);
>>  	int rc;
>>  	u32 target;
>>  	u8 prio;
>>  	u32 lirq;
>>  
>> -	if (!is_xive_irq(chip))
>> -		return -EINVAL;
>> -
>>  	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
>>  	if (rc) {
>>  		xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
>> @@ -275,6 +278,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
>>  	xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
>>  		    hw_irq, target, prio, lirq);
>>  
>> +	if (!d)
>> +		d = xive_get_irq_data(hw_irq);
>> +
>>  	if (d) {
>>  		struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
>>  		u64 val = xive_esb_read(xd, XIVE_ESB_GET);
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node
  2021-03-09 13:23   ` Greg Kurz
@ 2021-03-09 15:52     ` Cédric Le Goater
  0 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-09 15:52 UTC (permalink / raw)
  To: Greg Kurz; +Cc: linuxppc-dev

On 3/9/21 2:23 PM, Greg Kurz wrote:
> On Wed, 3 Mar 2021 18:48:57 +0100
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>> ipistorm [*] can be used to benchmark the raw interrupt rate of an
>> interrupt controller by measuring the number of IPIs a system can
>> sustain. When applied to the XIVE interrupt controller of POWER9 and
>> POWER10 systems, a significant drop of the interrupt rate can be
>> observed when crossing the second node boundary.
>>
>> This is due to the fact that a single IPI interrupt is used for all
>> CPUs of the system. The structure is shared and the cache line updates
>> impact greatly the traffic between nodes and the overall IPI
>> performance.
>>
>> As a workaround, the impact can be reduced by deactivating the IRQ
>> lockup detector ("noirqdebug") which does a lot of accounting in the
>> Linux IRQ descriptor structure and is responsible for most of the
>> performance penalty.
>>
>> As a fix, this proposal allocates an IPI interrupt per node, to be
>> shared by all CPUs of that node. It solves the scaling issue, the IRQ
>> lockup detector still has an impact but the XIVE interrupt rate scales
>> linearly. It also improves the "noirqdebug" case as showed in the
>> tables below.
>>
>>  * P9 DD2.2 - 2s * 64 threads
>>
>>                                                "noirqdebug"
>>                         Mint/s                    Mint/s
>>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>>  --------------------------------------------------------------
>>  1      0-15     4.984023   4.875405       4.996536   5.048892
>>         0-31    10.879164  10.544040      10.757632  11.037859
>>         0-47    15.345301  14.688764      14.926520  15.310053
>>         0-63    17.064907  17.066812      17.613416  17.874511
>>  2      0-79    11.768764  21.650749      22.689120  22.566508
>>         0-95    10.616812  26.878789      28.434703  28.320324
>>         0-111   10.151693  31.397803      31.771773  32.388122
>>         0-127    9.948502  33.139336      34.875716  35.224548
>>
>>  * P10 DD1 - 4s (not homogeneous) 352 threads
>>
>>                                                "noirqdebug"
>>                         Mint/s                    Mint/s
>>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>>  --------------------------------------------------------------
>>  1      0-15     2.409402   2.364108       2.383303   2.395091
>>         0-31     6.028325   6.046075       6.089999   6.073750
>>         0-47     8.655178   8.644531       8.712830   8.724702
>>         0-63    11.629652  11.735953      12.088203  12.055979
>>         0-79    14.392321  14.729959      14.986701  14.973073
>>         0-95    12.604158  13.004034      17.528748  17.568095
>>  2      0-111    9.767753  13.719831      19.968606  20.024218
>>         0-127    6.744566  16.418854      22.898066  22.995110
>>         0-143    6.005699  19.174421      25.425622  25.417541
>>         0-159    5.649719  21.938836      27.952662  28.059603
>>         0-175    5.441410  24.109484      31.133915  31.127996
>>  3      0-191    5.318341  24.405322      33.999221  33.775354
>>         0-207    5.191382  26.449769      36.050161  35.867307
>>         0-223    5.102790  29.356943      39.544135  39.508169
>>         0-239    5.035295  31.933051      42.135075  42.071975
>>         0-255    4.969209  34.477367      44.655395  44.757074
>>  4      0-271    4.907652  35.887016      47.080545  47.318537
>>         0-287    4.839581  38.076137      50.464307  50.636219
>>         0-303    4.786031  40.881319      53.478684  53.310759
>>         0-319    4.743750  43.448424      56.388102  55.973969
>>         0-335    4.709936  45.623532      59.400930  58.926857
>>         0-351    4.681413  45.646151      62.035804  61.830057
>>
>> [*] https://github.com/antonblanchard/ipistorm
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/sysdev/xive/xive-internal.h |  2 --
>>  arch/powerpc/sysdev/xive/common.c        | 39 ++++++++++++++++++------
>>  2 files changed, 30 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
>> index 9cf57c722faa..b3a456fdd3a5 100644
>> --- a/arch/powerpc/sysdev/xive/xive-internal.h
>> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
>> @@ -5,8 +5,6 @@
>>  #ifndef __XIVE_INTERNAL_H
>>  #define __XIVE_INTERNAL_H
>>  
>> -#define XIVE_IPI_HW_IRQ		0 /* interrupt source # for IPIs */
>> -
>>  /*
>>   * A "disabled" interrupt should never fire, to catch problems
>>   * we set its logical number to this
>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>> index 8eefd152b947..c27f7bb0494b 100644
>> --- a/arch/powerpc/sysdev/xive/common.c
>> +++ b/arch/powerpc/sysdev/xive/common.c
>> @@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain;
>>  #ifdef CONFIG_SMP
>>  static struct irq_domain *xive_ipi_irq_domain;
>>  
>> -/* The IPIs all use the same logical irq number */
>> -static u32 xive_ipi_irq;
>> +/* The IPIs use the same logical irq number when on the same chip */
>> +static struct xive_ipi_desc {
>> +	unsigned int irq;
>> +	char name[8]; /* enough bytes to fit IPI-XXX */
> 
> So this assumes that the node number that node is <= 999 ? This
> is certainly the case for now since CONFIG_NODES_SHIFT is 8
> on ppc64 but starting with 10, you'd have truncated names.

It should be harmless though. I agree this is a useless optimization.

> What about deriving the size of name[] from CONFIG_NODES_SHIFT ?

Yes.
 
> Apart from that, LGTM. Probably not worth to respin just for
> this.
> 
> I also could give a try in a KVM guest.
> 
> Topology passed to QEMU:
> 
>   -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 \
>   -numa node,nodeid=0,cpus=0-4 \
>   -numa node,nodeid=1,cpus=4-7
> 
> Topology observed in guest with lstopo :
> 
>   Package L#0
>     NUMANode L#0 (P#0 30GB)
>     L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>       PU L#0 (P#0)
>       PU L#1 (P#1)
>     L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>       PU L#2 (P#2)
>       PU L#3 (P#3)
>   Package L#1
>     NUMANode L#1 (P#1 32GB)
>     L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>       PU L#4 (P#4)
>       PU L#5 (P#5)
>     L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>       PU L#6 (P#6)
>       PU L#7 (P#7)
> 
> Interrupts in guest:
> 
> $ cat /proc/interrupts 
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
>  16:       1023        871       1042        749          0          0          0          0  XIVE-IPI   0 Edge      IPI-0
>  17:          0          0          0          0       2123       1019       1263       1288  XIVE-IPI   1 Edge      IPI-1
> 
> IPIs are mapped to the appropriate nodes, and the numbers indicate
> that everything is working as expected.

You should see the same on 2 socket PowerNV QEMU machine.
 
> Reviewed-and-tested-by: Greg Kurz <groug@kaod.org>

Thanks,

C. 

> 
>> +} *xive_ipis;
>> +
>> +static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu)
>> +{
>> +	return xive_ipis[cpu_to_node(cpu)].irq;
>> +}
>>  #endif
>>  
>>  /* Xive state for each CPU */
>> @@ -1106,25 +1114,36 @@ static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
>>  
>>  static void __init xive_request_ipi(void)
>>  {
>> -	unsigned int virq;
>> +	unsigned int node;
>>  
>> -	xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1,
>> +	xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids,
>>  						    &xive_ipi_irq_domain_ops, NULL);
>>  	if (WARN_ON(xive_ipi_irq_domain == NULL))
>>  		return;
>>  
>> -	/* Initialize it */
>> -	virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ);
>> -	xive_ipi_irq = virq;
>> +	xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL);
>> +	for_each_node(node) {
>> +		struct xive_ipi_desc *xid = &xive_ipis[node];
>> +		irq_hw_number_t node_ipi_hwirq = node;
>> +
>> +		/*
>> +		 * Map one IPI interrupt per node for all cpus of that node.
>> +		 * Since the HW interrupt number doesn't have any meaning,
>> +		 * simply use the node number.
>> +		 */
>> +		xid->irq = irq_create_mapping(xive_ipi_irq_domain, node_ipi_hwirq);
>> +		snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);
>>  
>> -	WARN_ON(request_irq(virq, xive_muxed_ipi_action,
>> -			    IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL));
>> +		WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action,
>> +				    IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL));
>> +	}
>>  }
>>  
>>  static int xive_setup_cpu_ipi(unsigned int cpu)
>>  {
>>  	struct xive_cpu *xc;
>>  	int rc;
>> +	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
>>  
>>  	pr_debug("Setting up IPI for CPU %d\n", cpu);
>>  
>> @@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
>>  
>>  static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc)
>>  {
>> +	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
>> +
>>  	/* Disable the IPI and free the IRQ data */
>>  
>>  	/* Already cleaned up ? */
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-09 15:33     ` Cédric Le Goater
@ 2021-03-09 17:08       ` Daniel Henrique Barboza
  2021-03-09 17:26           ` Cédric Le Goater
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-09 17:08 UTC (permalink / raw)
  To: Cédric Le Goater, Greg Kurz; +Cc: linuxppc-dev



On 3/9/21 12:33 PM, Cédric Le Goater wrote:
> On 3/8/21 6:13 PM, Greg Kurz wrote:
>> On Wed, 3 Mar 2021 18:48:50 +0100
>> Cédric Le Goater <clg@kaod.org> wrote:
>>
>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>> target for a source located on the same chip when possible. This field
>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
>>
>> This sentence seems to have a syntax problem... like it is missing an
>> 'and' before 'on pSeries'.
> 
> ah yes, or simply a comma.
> 
>>> under PowerVM. The XIVE source structure has a similar field
>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>
>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>> default node. It will also give us the opportunity to set the affinity
>>> of a source on pSeries when we can localize them.
>>>
>>
>> IIUC this relies on the fact that the NUMA node id is == to chip id
>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>> with this change.
> 
> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
> in Cc:)

That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but
a list with the ibm,associativity domains of the CPU that "proc-no" (processor
identifier) is mapped to inside QEMU.

node_id in this case, considering that we're working with a reference-points
of size 4, is the 4th element of the returned list. The last element is
"procno" itself.


> 
> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
> the node id. This value is built from the chip id in OPAL, so the
> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
> property are unlikely to be different.
> 
> cpu_to_node(cpu) is used in many places to allocate the structures
> locally to the owning node. XIVE is not an exception (see below in the
> same patch), it is better to be consistent and get the same information
> (node id) using the same routine.
> 
> 
> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
> unifies the controllers of the system to only expose one the OS. This
> is problematic and should be changed but it's another topic.
> 
> 
>> On the other hand, you have the pSeries case under PowerVM that
>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
> 
> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
> chip id.
> 
> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
> always correct btw)


If you have a way to reliably reproduce this, let me know and I'll fix it
up in QEMU.



Thanks,


DHB


> 
>> It looks like the chip id is only used for localization purpose in
>> this case, right ?
> 
> Yes and PAPR sources are not localized. So it's not used. MSI sources
> could be if we rewrote the MSI driver.
> 
>> In this case, what about doing this change for pSeries only,
>> somewhere in spapr.c ?
> 
> The IPI code is common to all platforms and all have the same issue.
> I rather not.
> 
> Thanks,
> 
> C.
>   
>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>> ---
>>>   arch/powerpc/sysdev/xive/common.c | 7 +------
>>>   1 file changed, 1 insertion(+), 6 deletions(-)
>>>
>>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>>> index 595310e056f4..b8e456da28aa 100644
>>> --- a/arch/powerpc/sysdev/xive/common.c
>>> +++ b/arch/powerpc/sysdev/xive/common.c
>>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu)
>>>   
>>>   	xc = per_cpu(xive_cpu, cpu);
>>>   	if (!xc) {
>>> -		struct device_node *np;
>>> -
>>>   		xc = kzalloc_node(sizeof(struct xive_cpu),
>>>   				  GFP_KERNEL, cpu_to_node(cpu));
>>>   		if (!xc)
>>>   			return -ENOMEM;
>>> -		np = of_get_cpu_node(cpu, NULL);
>>> -		if (np)
>>> -			xc->chip_id = of_get_ibm_chip_id(np);
>>> -		of_node_put(np);
>>> +		xc->chip_id = cpu_to_node(cpu);
>>>   		xc->hw_ipi = XIVE_BAD_IRQ;
>>>   
>>>   		per_cpu(xive_cpu, cpu) = xc;
>>
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-09 17:08       ` Daniel Henrique Barboza
@ 2021-03-09 17:26           ` Cédric Le Goater
  0 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-09 17:26 UTC (permalink / raw)
  To: Daniel Henrique Barboza, Greg Kurz
  Cc: list@suse.de:PowerPC, linuxppc-dev, QEMU Developers, David Gibson

On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> 
> 
> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
>> On 3/8/21 6:13 PM, Greg Kurz wrote:
>>> On Wed, 3 Mar 2021 18:48:50 +0100
>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>
>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>>> target for a source located on the same chip when possible. This field
>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
>>>
>>> This sentence seems to have a syntax problem... like it is missing an
>>> 'and' before 'on pSeries'.
>>
>> ah yes, or simply a comma.
>>
>>>> under PowerVM. The XIVE source structure has a similar field
>>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>>
>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>>> default node. It will also give us the opportunity to set the affinity
>>>> of a source on pSeries when we can localize them.
>>>>
>>>
>>> IIUC this relies on the fact that the NUMA node id is == to chip id
>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>>> with this change.
>>
>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
>> in Cc:)
> 
> That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but
> a list with the ibm,associativity domains of the CPU that "proc-no" (processor
> identifier) is mapped to inside QEMU.
> 
> node_id in this case, considering that we're working with a reference-points
> of size 4, is the 4th element of the returned list. The last element is
> "procno" itself.
> 
> 
>>
>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
>> the node id. This value is built from the chip id in OPAL, so the
>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
>> property are unlikely to be different.
>>
>> cpu_to_node(cpu) is used in many places to allocate the structures
>> locally to the owning node. XIVE is not an exception (see below in the
>> same patch), it is better to be consistent and get the same information
>> (node id) using the same routine.
>>
>>
>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
>> unifies the controllers of the system to only expose one the OS. This
>> is problematic and should be changed but it's another topic.
>>
>>
>>> On the other hand, you have the pSeries case under PowerVM that
>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
>>
>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
>> chip id.
>>
>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
>> always correct btw)
> 
> 
> If you have a way to reliably reproduce this, let me know and I'll fix it
> up in QEMU.

with :

   -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
		ibm,chip-id = <0x01>;
		ibm,chip-id = <0x02>;
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x03>;

with :

  -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x00>;

I think we should simply remove "ibm,chip-id" since it's not used and
not in the PAPR spec.

Thanks,

C.

 

> 
> Thanks,
> 
> 
> DHB
> 
> 
>>
>>> It looks like the chip id is only used for localization purpose in
>>> this case, right ?
>>
>> Yes and PAPR sources are not localized. So it's not used. MSI sources
>> could be if we rewrote the MSI driver.
>>
>>> In this case, what about doing this change for pSeries only,
>>> somewhere in spapr.c ?
>>
>> The IPI code is common to all platforms and all have the same issue.
>> I rather not.
>>
>> Thanks,
>>
>> C.
>>  
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>   arch/powerpc/sysdev/xive/common.c | 7 +------
>>>>   1 file changed, 1 insertion(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>>>> index 595310e056f4..b8e456da28aa 100644
>>>> --- a/arch/powerpc/sysdev/xive/common.c
>>>> +++ b/arch/powerpc/sysdev/xive/common.c
>>>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu)
>>>>         xc = per_cpu(xive_cpu, cpu);
>>>>       if (!xc) {
>>>> -        struct device_node *np;
>>>> -
>>>>           xc = kzalloc_node(sizeof(struct xive_cpu),
>>>>                     GFP_KERNEL, cpu_to_node(cpu));
>>>>           if (!xc)
>>>>               return -ENOMEM;
>>>> -        np = of_get_cpu_node(cpu, NULL);
>>>> -        if (np)
>>>> -            xc->chip_id = of_get_ibm_chip_id(np);
>>>> -        of_node_put(np);
>>>> +        xc->chip_id = cpu_to_node(cpu);
>>>>           xc->hw_ipi = XIVE_BAD_IRQ;
>>>>             per_cpu(xive_cpu, cpu) = xc;
>>>
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
@ 2021-03-09 17:26           ` Cédric Le Goater
  0 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-09 17:26 UTC (permalink / raw)
  To: Daniel Henrique Barboza, Greg Kurz
  Cc: list@suse.de:PowerPC, Michael Ellerman, linuxppc-dev,
	QEMU Developers, David Gibson

On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> 
> 
> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
>> On 3/8/21 6:13 PM, Greg Kurz wrote:
>>> On Wed, 3 Mar 2021 18:48:50 +0100
>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>
>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>>> target for a source located on the same chip when possible. This field
>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
>>>
>>> This sentence seems to have a syntax problem... like it is missing an
>>> 'and' before 'on pSeries'.
>>
>> ah yes, or simply a comma.
>>
>>>> under PowerVM. The XIVE source structure has a similar field
>>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>>
>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>>> default node. It will also give us the opportunity to set the affinity
>>>> of a source on pSeries when we can localize them.
>>>>
>>>
>>> IIUC this relies on the fact that the NUMA node id is == to chip id
>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>>> with this change.
>>
>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
>> in Cc:)
> 
> That's correct. H_HOME_NODE_ASSOCIATIVITY returns not only the node_id, but
> a list with the ibm,associativity domains of the CPU that "proc-no" (processor
> identifier) is mapped to inside QEMU.
> 
> node_id in this case, considering that we're working with a reference-points
> of size 4, is the 4th element of the returned list. The last element is
> "procno" itself.
> 
> 
>>
>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
>> the node id. This value is built from the chip id in OPAL, so the
>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
>> property are unlikely to be different.
>>
>> cpu_to_node(cpu) is used in many places to allocate the structures
>> locally to the owning node. XIVE is not an exception (see below in the
>> same patch), it is better to be consistent and get the same information
>> (node id) using the same routine.
>>
>>
>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
>> unifies the controllers of the system to only expose one the OS. This
>> is problematic and should be changed but it's another topic.
>>
>>
>>> On the other hand, you have the pSeries case under PowerVM that
>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
>>
>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
>> chip id.
>>
>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
>> always correct btw)
> 
> 
> If you have a way to reliably reproduce this, let me know and I'll fix it
> up in QEMU.

with :

   -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
		ibm,chip-id = <0x01>;
		ibm,chip-id = <0x02>;
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x03>;

with :

  -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x00>;
		ibm,chip-id = <0x00>;

I think we should simply remove "ibm,chip-id" since it's not used and
not in the PAPR spec.

Thanks,

C.

 

> 
> Thanks,
> 
> 
> DHB
> 
> 
>>
>>> It looks like the chip id is only used for localization purpose in
>>> this case, right ?
>>
>> Yes and PAPR sources are not localized. So it's not used. MSI sources
>> could be if we rewrote the MSI driver.
>>
>>> In this case, what about doing this change for pSeries only,
>>> somewhere in spapr.c ?
>>
>> The IPI code is common to all platforms and all have the same issue.
>> I rather not.
>>
>> Thanks,
>>
>> C.
>>  
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>   arch/powerpc/sysdev/xive/common.c | 7 +------
>>>>   1 file changed, 1 insertion(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
>>>> index 595310e056f4..b8e456da28aa 100644
>>>> --- a/arch/powerpc/sysdev/xive/common.c
>>>> +++ b/arch/powerpc/sysdev/xive/common.c
>>>> @@ -1335,16 +1335,11 @@ static int xive_prepare_cpu(unsigned int cpu)
>>>>         xc = per_cpu(xive_cpu, cpu);
>>>>       if (!xc) {
>>>> -        struct device_node *np;
>>>> -
>>>>           xc = kzalloc_node(sizeof(struct xive_cpu),
>>>>                     GFP_KERNEL, cpu_to_node(cpu));
>>>>           if (!xc)
>>>>               return -ENOMEM;
>>>> -        np = of_get_cpu_node(cpu, NULL);
>>>> -        if (np)
>>>> -            xc->chip_id = of_get_ibm_chip_id(np);
>>>> -        of_node_put(np);
>>>> +        xc->chip_id = cpu_to_node(cpu);
>>>>           xc->hw_ipi = XIVE_BAD_IRQ;
>>>>             per_cpu(xive_cpu, cpu) = xc;
>>>
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-09 17:26           ` Cédric Le Goater
@ 2021-03-12  1:55             ` David Gibson
  -1 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-12  1:55 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Daniel Henrique Barboza, Greg Kurz, QEMU Developers,
	list@suse.de:PowerPC, linuxppc-dev

On Tue, 9 Mar 2021 18:26:35 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> > 
> > 
> > On 3/9/21 12:33 PM, Cédric Le Goater wrote:  
> >> On 3/8/21 6:13 PM, Greg Kurz wrote:  
> >>> On Wed, 3 Mar 2021 18:48:50 +0100
> >>> Cédric Le Goater <clg@kaod.org> wrote:
> >>>  
> >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
> >>>> target for a source located on the same chip when possible. This field
> >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
> >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined  
> >>>
> >>> This sentence seems to have a syntax problem... like it is missing an
> >>> 'and' before 'on pSeries'.  
> >>
> >> ah yes, or simply a comma.
> >>  
> >>>> under PowerVM. The XIVE source structure has a similar field
> >>>> 'src_chip' which is only assigned on the PowerNV platform.
> >>>>
> >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
> >>>> default node. It will also give us the opportunity to set the affinity
> >>>> of a source on pSeries when we can localize them.
> >>>>  
> >>>
> >>> IIUC this relies on the fact that the NUMA node id is == to chip id
> >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
> >>> with this change.  
> >>
> >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
> >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
> >> in Cc:)  
>  [...]  
> >>
> >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
> >> the node id. This value is built from the chip id in OPAL, so the
> >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
> >> property are unlikely to be different.
> >>
> >> cpu_to_node(cpu) is used in many places to allocate the structures
> >> locally to the owning node. XIVE is not an exception (see below in the
> >> same patch), it is better to be consistent and get the same information
> >> (node id) using the same routine.
> >>
> >>
> >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
> >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
> >> unifies the controllers of the system to only expose one the OS. This
> >> is problematic and should be changed but it's another topic.
> >>
> >>  
> >>> On the other hand, you have the pSeries case under PowerVM that
> >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.  
> >>
> >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
> >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
> >> chip id.
> >>
> >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
> >> always correct btw)  
> > 
> > 
> > If you have a way to reliably reproduce this, let me know and I'll fix it
> > up in QEMU.  
> 
> with :
> 
>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x01>;
> 		ibm,chip-id = <0x02>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x03>;
> 
> with :
> 
>   -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 
> I think we should simply remove "ibm,chip-id" since it's not used and
> not in the PAPR spec.

As I mentioned to Daniel on our call this morning, oddly it *does*
appear to be used in the RHEL kernel, even though that's 4.18 based.
This patch seems to have caused a minor regression; not in the
identification of NUMA nodes, but in the number of sockets shown be
lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
for more information.

Since the value was used by some PAPR kernels - even if they shouldn't
have - I think we should only remove this for newer machine types.  We
also need to check what we're not supplying that the guest kernel is
showing a different number of sockets than specified on the qemu
command line.

> 
> Thanks,
> 
> C.
> 
>  
> 
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> 


-- 
David Gibson <dgibson@redhat.com>
Principal Software Engineer, Virtualization, Red Hat


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
@ 2021-03-12  1:55             ` David Gibson
  0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2021-03-12  1:55 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Michael Ellerman, Daniel Henrique Barboza, Greg Kurz,
	QEMU Developers, list@suse.de:PowerPC, linuxppc-dev

On Tue, 9 Mar 2021 18:26:35 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> > 
> > 
> > On 3/9/21 12:33 PM, Cédric Le Goater wrote:  
> >> On 3/8/21 6:13 PM, Greg Kurz wrote:  
> >>> On Wed, 3 Mar 2021 18:48:50 +0100
> >>> Cédric Le Goater <clg@kaod.org> wrote:
> >>>  
> >>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
> >>>> target for a source located on the same chip when possible. This field
> >>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
> >>>> on pSeries under KVM when NUMA nodes are defined but it is undefined  
> >>>
> >>> This sentence seems to have a syntax problem... like it is missing an
> >>> 'and' before 'on pSeries'.  
> >>
> >> ah yes, or simply a comma.
> >>  
> >>>> under PowerVM. The XIVE source structure has a similar field
> >>>> 'src_chip' which is only assigned on the PowerNV platform.
> >>>>
> >>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
> >>>> default node. It will also give us the opportunity to set the affinity
> >>>> of a source on pSeries when we can localize them.
> >>>>  
> >>>
> >>> IIUC this relies on the fact that the NUMA node id is == to chip id
> >>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
> >>> with this change.  
> >>
> >> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
> >> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
> >> in Cc:)  
>  [...]  
> >>
> >> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
> >> the node id. This value is built from the chip id in OPAL, so the
> >> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
> >> property are unlikely to be different.
> >>
> >> cpu_to_node(cpu) is used in many places to allocate the structures
> >> locally to the owning node. XIVE is not an exception (see below in the
> >> same patch), it is better to be consistent and get the same information
> >> (node id) using the same routine.
> >>
> >>
> >> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
> >> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
> >> unifies the controllers of the system to only expose one the OS. This
> >> is problematic and should be changed but it's another topic.
> >>
> >>  
> >>> On the other hand, you have the pSeries case under PowerVM that
> >>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.  
> >>
> >> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
> >> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
> >> chip id.
> >>
> >> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
> >> always correct btw)  
> > 
> > 
> > If you have a way to reliably reproduce this, let me know and I'll fix it
> > up in QEMU.  
> 
> with :
> 
>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x01>;
> 		ibm,chip-id = <0x02>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x03>;
> 
> with :
> 
>   -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> 
> # dmesg | grep numa
> [    0.013106] numa: Node 0 CPUs: 0-1
> [    0.013136] numa: Node 1 CPUs: 2-3
> 
> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 		ibm,chip-id = <0x00>;
> 
> I think we should simply remove "ibm,chip-id" since it's not used and
> not in the PAPR spec.

As I mentioned to Daniel on our call this morning, oddly it *does*
appear to be used in the RHEL kernel, even though that's 4.18 based.
This patch seems to have caused a minor regression; not in the
identification of NUMA nodes, but in the number of sockets shown be
lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
for more information.

Since the value was used by some PAPR kernels - even if they shouldn't
have - I think we should only remove this for newer machine types.  We
also need to check what we're not supplying that the guest kernel is
showing a different number of sockets than specified on the qemu
command line.

> 
> Thanks,
> 
> C.
> 
>  
> 
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> 


-- 
David Gibson <dgibson@redhat.com>
Principal Software Engineer, Virtualization, Red Hat



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-12  1:55             ` David Gibson
@ 2021-03-12  9:53               ` Cédric Le Goater
  -1 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-12  9:53 UTC (permalink / raw)
  To: David Gibson
  Cc: Daniel Henrique Barboza, Greg Kurz, QEMU Developers,
	list@suse.de:PowerPC, linuxppc-dev

On 3/12/21 2:55 AM, David Gibson wrote:
> On Tue, 9 Mar 2021 18:26:35 +0100
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
>>>
>>>
>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:  
>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:  
>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
>>>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>>>  
>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>>>>> target for a source located on the same chip when possible. This field
>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined  
>>>>>
>>>>> This sentence seems to have a syntax problem... like it is missing an
>>>>> 'and' before 'on pSeries'.  
>>>>
>>>> ah yes, or simply a comma.
>>>>  
>>>>>> under PowerVM. The XIVE source structure has a similar field
>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>>>>
>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>>>>> default node. It will also give us the opportunity to set the affinity
>>>>>> of a source on pSeries when we can localize them.
>>>>>>  
>>>>>
>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>>>>> with this change.  
>>>>
>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
>>>> in Cc:)  
>>  [...]  
>>>>
>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
>>>> the node id. This value is built from the chip id in OPAL, so the
>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
>>>> property are unlikely to be different.
>>>>
>>>> cpu_to_node(cpu) is used in many places to allocate the structures
>>>> locally to the owning node. XIVE is not an exception (see below in the
>>>> same patch), it is better to be consistent and get the same information
>>>> (node id) using the same routine.
>>>>
>>>>
>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
>>>> unifies the controllers of the system to only expose one the OS. This
>>>> is problematic and should be changed but it's another topic.
>>>>
>>>>  
>>>>> On the other hand, you have the pSeries case under PowerVM that
>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.  
>>>>
>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
>>>> chip id.
>>>>
>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
>>>> always correct btw)  
>>>
>>>
>>> If you have a way to reliably reproduce this, let me know and I'll fix it
>>> up in QEMU.  
>>
>> with :
>>
>>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>> 		ibm,chip-id = <0x01>;
>> 		ibm,chip-id = <0x02>;
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x03>;
>>
>> with :
>>
>>   -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x00>;
>>
>> I think we should simply remove "ibm,chip-id" since it's not used and
>> not in the PAPR spec.
> 
> As I mentioned to Daniel on our call this morning, oddly it *does*
> appear to be used in the RHEL kernel, even though that's 4.18 based.
> This patch seems to have caused a minor regression; not in the
> identification of NUMA nodes, but in the number of sockets shown be
> lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
> for more information.

Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we 
remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :

   [root@localhost ~]# lscpu 
   Architecture:        ppc64le
   Byte Order:          Little Endian
   CPU(s):              128
   On-line CPU(s) list: 0-127
   Thread(s) per core:  4
   Core(s) per socket:  16
   Socket(s):           2
   NUMA node(s):        2
   Model:               2.2 (pvr 004e 1202)
   Model name:          POWER9 (architected), altivec supported
   Hypervisor vendor:   KVM
   Virtualization type: para
   L1d cache:           32K
   L1i cache:           32K
   NUMA node0 CPU(s):   0-63
   NUMA node1 CPU(s):   64-127

   [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id
   /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
   /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
   /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
   /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
   /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
   ....

"ibm,chip-id" is still being used on some occasion on pSeries machines.
This is wrong :/ The problem is :

  #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))

We should be using cpu_to_node(). 

C.

> 
> Since the value was used by some PAPR kernels - even if they shouldn't
> have - I think we should only remove this for newer machine types.  We
> also need to check what we're not supplying that the guest kernel is
> showing a different number of sockets than specified on the qemu
> command line.
> 
>>
>> Thanks,
>>
>> C.
>>
>>  
>>
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
@ 2021-03-12  9:53               ` Cédric Le Goater
  0 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-12  9:53 UTC (permalink / raw)
  To: David Gibson
  Cc: Michael Ellerman, Daniel Henrique Barboza, Greg Kurz,
	QEMU Developers, list@suse.de:PowerPC, linuxppc-dev

On 3/12/21 2:55 AM, David Gibson wrote:
> On Tue, 9 Mar 2021 18:26:35 +0100
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
>>>
>>>
>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:  
>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:  
>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
>>>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>>>  
>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>>>>> target for a source located on the same chip when possible. This field
>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined  
>>>>>
>>>>> This sentence seems to have a syntax problem... like it is missing an
>>>>> 'and' before 'on pSeries'.  
>>>>
>>>> ah yes, or simply a comma.
>>>>  
>>>>>> under PowerVM. The XIVE source structure has a similar field
>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>>>>
>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>>>>> default node. It will also give us the opportunity to set the affinity
>>>>>> of a source on pSeries when we can localize them.
>>>>>>  
>>>>>
>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>>>>> with this change.  
>>>>
>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
>>>> in Cc:)  
>>  [...]  
>>>>
>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
>>>> the node id. This value is built from the chip id in OPAL, so the
>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
>>>> property are unlikely to be different.
>>>>
>>>> cpu_to_node(cpu) is used in many places to allocate the structures
>>>> locally to the owning node. XIVE is not an exception (see below in the
>>>> same patch), it is better to be consistent and get the same information
>>>> (node id) using the same routine.
>>>>
>>>>
>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
>>>> unifies the controllers of the system to only expose one the OS. This
>>>> is problematic and should be changed but it's another topic.
>>>>
>>>>  
>>>>> On the other hand, you have the pSeries case under PowerVM that
>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.  
>>>>
>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
>>>> chip id.
>>>>
>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
>>>> always correct btw)  
>>>
>>>
>>> If you have a way to reliably reproduce this, let me know and I'll fix it
>>> up in QEMU.  
>>
>> with :
>>
>>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>> 		ibm,chip-id = <0x01>;
>> 		ibm,chip-id = <0x02>;
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x03>;
>>
>> with :
>>
>>   -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x00>;
>> 		ibm,chip-id = <0x00>;
>>
>> I think we should simply remove "ibm,chip-id" since it's not used and
>> not in the PAPR spec.
> 
> As I mentioned to Daniel on our call this morning, oddly it *does*
> appear to be used in the RHEL kernel, even though that's 4.18 based.
> This patch seems to have caused a minor regression; not in the
> identification of NUMA nodes, but in the number of sockets shown be
> lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
> for more information.

Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we 
remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :

   [root@localhost ~]# lscpu 
   Architecture:        ppc64le
   Byte Order:          Little Endian
   CPU(s):              128
   On-line CPU(s) list: 0-127
   Thread(s) per core:  4
   Core(s) per socket:  16
   Socket(s):           2
   NUMA node(s):        2
   Model:               2.2 (pvr 004e 1202)
   Model name:          POWER9 (architected), altivec supported
   Hypervisor vendor:   KVM
   Virtualization type: para
   L1d cache:           32K
   L1i cache:           32K
   NUMA node0 CPU(s):   0-63
   NUMA node1 CPU(s):   64-127

   [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id
   /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
   /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
   /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
   /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
   /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
   ....

"ibm,chip-id" is still being used on some occasion on pSeries machines.
This is wrong :/ The problem is :

  #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))

We should be using cpu_to_node(). 

C.

> 
> Since the value was used by some PAPR kernels - even if they shouldn't
> have - I think we should only remove this for newer machine types.  We
> also need to check what we're not supplying that the guest kernel is
> showing a different number of sockets than specified on the qemu
> command line.
> 
>>
>> Thanks,
>>
>> C.
>>
>>  
>>
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>>
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-12  9:53               ` Cédric Le Goater
@ 2021-03-12 12:18                 ` Daniel Henrique Barboza
  -1 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-12 12:18 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson
  Cc: list@suse.de:PowerPC, linuxppc-dev, Greg Kurz, QEMU Developers



On 3/12/21 6:53 AM, Cédric Le Goater wrote:
> On 3/12/21 2:55 AM, David Gibson wrote:
>> On Tue, 9 Mar 2021 18:26:35 +0100
>> Cédric Le Goater <clg@kaod.org> wrote:
>>
>>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
>>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:
>>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
>>>>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>   
>>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>>>>>> target for a source located on the same chip when possible. This field
>>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
>>>>>>
>>>>>> This sentence seems to have a syntax problem... like it is missing an
>>>>>> 'and' before 'on pSeries'.
>>>>>
>>>>> ah yes, or simply a comma.
>>>>>   
>>>>>>> under PowerVM. The XIVE source structure has a similar field
>>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>>>>>
>>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>>>>>> default node. It will also give us the opportunity to set the affinity
>>>>>>> of a source on pSeries when we can localize them.
>>>>>>>   
>>>>>>
>>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
>>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>>>>>> with this change.
>>>>>
>>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
>>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
>>>>> in Cc:)
>>>   [...]
>>>>>
>>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
>>>>> the node id. This value is built from the chip id in OPAL, so the
>>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
>>>>> property are unlikely to be different.
>>>>>
>>>>> cpu_to_node(cpu) is used in many places to allocate the structures
>>>>> locally to the owning node. XIVE is not an exception (see below in the
>>>>> same patch), it is better to be consistent and get the same information
>>>>> (node id) using the same routine.
>>>>>
>>>>>
>>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
>>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
>>>>> unifies the controllers of the system to only expose one the OS. This
>>>>> is problematic and should be changed but it's another topic.
>>>>>
>>>>>   
>>>>>> On the other hand, you have the pSeries case under PowerVM that
>>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
>>>>>
>>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
>>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
>>>>> chip id.
>>>>>
>>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
>>>>> always correct btw)
>>>>
>>>>
>>>> If you have a way to reliably reproduce this, let me know and I'll fix it
>>>> up in QEMU.
>>>
>>> with :
>>>
>>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>
>>> # dmesg | grep numa
>>> [    0.013106] numa: Node 0 CPUs: 0-1
>>> [    0.013136] numa: Node 1 CPUs: 2-3
>>>
>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>> 		ibm,chip-id = <0x01>;
>>> 		ibm,chip-id = <0x02>;
>>> 		ibm,chip-id = <0x00>;
>>> 		ibm,chip-id = <0x03>;
>>>
>>> with :
>>>
>>>    -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>
>>> # dmesg | grep numa
>>> [    0.013106] numa: Node 0 CPUs: 0-1
>>> [    0.013136] numa: Node 1 CPUs: 2-3
>>>
>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>> 		ibm,chip-id = <0x00>;
>>> 		ibm,chip-id = <0x00>;
>>> 		ibm,chip-id = <0x00>;
>>> 		ibm,chip-id = <0x00>;
>>>
>>> I think we should simply remove "ibm,chip-id" since it's not used and
>>> not in the PAPR spec.
>>
>> As I mentioned to Daniel on our call this morning, oddly it *does*
>> appear to be used in the RHEL kernel, even though that's 4.18 based.
>> This patch seems to have caused a minor regression; not in the
>> identification of NUMA nodes, but in the number of sockets shown be
>> lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
>> for more information.
> 
> Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we
> remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :
> 
>     [root@localhost ~]# lscpu
>     Architecture:        ppc64le
>     Byte Order:          Little Endian
>     CPU(s):              128
>     On-line CPU(s) list: 0-127
>     Thread(s) per core:  4
>     Core(s) per socket:  16
>     Socket(s):           2
>     NUMA node(s):        2
>     Model:               2.2 (pvr 004e 1202)
>     Model name:          POWER9 (architected), altivec supported
>     Hypervisor vendor:   KVM
>     Virtualization type: para
>     L1d cache:           32K
>     L1i cache:           32K
>     NUMA node0 CPU(s):   0-63
>     NUMA node1 CPU(s):   64-127
> 
>     [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id
>     /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
>     /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
>     /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
>     /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
>     /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
>     ....
> 
> "ibm,chip-id" is still being used on some occasion on pSeries machines.
> This is wrong :/ The problem is :
> 
>    #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))
> 
> We should be using cpu_to_node().


IIUC the "real fix" then is this change you mentioned above, together with
this xive patch as well, to stop using ibm,chip-id for good in the pserie
  kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries
machine without impact. Is this correct?

If that's the case, then I believe it's ok to go forward with the QEMU side
change (just for 6.0.0 and newer machines). Or should I wait for the kernel
changes to be merged upstream first?


Thanks,


DHB


> 
> C.
> 
>>
>> Since the value was used by some PAPR kernels - even if they shouldn't
>> have - I think we should only remove this for newer machine types.  We
>> also need to check what we're not supplying that the guest kernel is
>> showing a different number of sockets than specified on the qemu
>> command line.
>>
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>   
>>>
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
@ 2021-03-12 12:18                 ` Daniel Henrique Barboza
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Henrique Barboza @ 2021-03-12 12:18 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson
  Cc: list@suse.de:PowerPC, Michael Ellerman, linuxppc-dev, Greg Kurz,
	QEMU Developers



On 3/12/21 6:53 AM, Cédric Le Goater wrote:
> On 3/12/21 2:55 AM, David Gibson wrote:
>> On Tue, 9 Mar 2021 18:26:35 +0100
>> Cédric Le Goater <clg@kaod.org> wrote:
>>
>>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
>>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:
>>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
>>>>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>   
>>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>>>>>> target for a source located on the same chip when possible. This field
>>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
>>>>>>
>>>>>> This sentence seems to have a syntax problem... like it is missing an
>>>>>> 'and' before 'on pSeries'.
>>>>>
>>>>> ah yes, or simply a comma.
>>>>>   
>>>>>>> under PowerVM. The XIVE source structure has a similar field
>>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>>>>>
>>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>>>>>> default node. It will also give us the opportunity to set the affinity
>>>>>>> of a source on pSeries when we can localize them.
>>>>>>>   
>>>>>>
>>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
>>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>>>>>> with this change.
>>>>>
>>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
>>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
>>>>> in Cc:)
>>>   [...]
>>>>>
>>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
>>>>> the node id. This value is built from the chip id in OPAL, so the
>>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
>>>>> property are unlikely to be different.
>>>>>
>>>>> cpu_to_node(cpu) is used in many places to allocate the structures
>>>>> locally to the owning node. XIVE is not an exception (see below in the
>>>>> same patch), it is better to be consistent and get the same information
>>>>> (node id) using the same routine.
>>>>>
>>>>>
>>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
>>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
>>>>> unifies the controllers of the system to only expose one the OS. This
>>>>> is problematic and should be changed but it's another topic.
>>>>>
>>>>>   
>>>>>> On the other hand, you have the pSeries case under PowerVM that
>>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
>>>>>
>>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
>>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
>>>>> chip id.
>>>>>
>>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
>>>>> always correct btw)
>>>>
>>>>
>>>> If you have a way to reliably reproduce this, let me know and I'll fix it
>>>> up in QEMU.
>>>
>>> with :
>>>
>>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>
>>> # dmesg | grep numa
>>> [    0.013106] numa: Node 0 CPUs: 0-1
>>> [    0.013136] numa: Node 1 CPUs: 2-3
>>>
>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>> 		ibm,chip-id = <0x01>;
>>> 		ibm,chip-id = <0x02>;
>>> 		ibm,chip-id = <0x00>;
>>> 		ibm,chip-id = <0x03>;
>>>
>>> with :
>>>
>>>    -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>
>>> # dmesg | grep numa
>>> [    0.013106] numa: Node 0 CPUs: 0-1
>>> [    0.013136] numa: Node 1 CPUs: 2-3
>>>
>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>> 		ibm,chip-id = <0x00>;
>>> 		ibm,chip-id = <0x00>;
>>> 		ibm,chip-id = <0x00>;
>>> 		ibm,chip-id = <0x00>;
>>>
>>> I think we should simply remove "ibm,chip-id" since it's not used and
>>> not in the PAPR spec.
>>
>> As I mentioned to Daniel on our call this morning, oddly it *does*
>> appear to be used in the RHEL kernel, even though that's 4.18 based.
>> This patch seems to have caused a minor regression; not in the
>> identification of NUMA nodes, but in the number of sockets shown be
>> lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
>> for more information.
> 
> Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we
> remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :
> 
>     [root@localhost ~]# lscpu
>     Architecture:        ppc64le
>     Byte Order:          Little Endian
>     CPU(s):              128
>     On-line CPU(s) list: 0-127
>     Thread(s) per core:  4
>     Core(s) per socket:  16
>     Socket(s):           2
>     NUMA node(s):        2
>     Model:               2.2 (pvr 004e 1202)
>     Model name:          POWER9 (architected), altivec supported
>     Hypervisor vendor:   KVM
>     Virtualization type: para
>     L1d cache:           32K
>     L1i cache:           32K
>     NUMA node0 CPU(s):   0-63
>     NUMA node1 CPU(s):   64-127
> 
>     [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id
>     /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
>     /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
>     /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
>     /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
>     /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
>     ....
> 
> "ibm,chip-id" is still being used on some occasion on pSeries machines.
> This is wrong :/ The problem is :
> 
>    #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))
> 
> We should be using cpu_to_node().


IIUC the "real fix" then is this change you mentioned above, together with
this xive patch as well, to stop using ibm,chip-id for good in the pserie
  kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries
machine without impact. Is this correct?

If that's the case, then I believe it's ok to go forward with the QEMU side
change (just for 6.0.0 and newer machines). Or should I wait for the kernel
changes to be merged upstream first?


Thanks,


DHB


> 
> C.
> 
>>
>> Since the value was used by some PAPR kernels - even if they shouldn't
>> have - I think we should only remove this for newer machine types.  We
>> also need to check what we're not supplying that the guest kernel is
>> showing a different number of sockets than specified on the qemu
>> command line.
>>
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>   
>>>
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>   [...]
>>>
>>
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-12 12:18                 ` Daniel Henrique Barboza
@ 2021-03-12 13:03                   ` Greg Kurz
  -1 siblings, 0 replies; 38+ messages in thread
From: Greg Kurz @ 2021-03-12 13:03 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: QEMU Developers, list@suse.de:PowerPC, Cédric Le Goater,
	David Gibson, linuxppc-dev

On Fri, 12 Mar 2021 09:18:39 -0300
Daniel Henrique Barboza <danielhb@linux.ibm.com> wrote:

> 
> 
> On 3/12/21 6:53 AM, Cédric Le Goater wrote:
> > On 3/12/21 2:55 AM, David Gibson wrote:
> >> On Tue, 9 Mar 2021 18:26:35 +0100
> >> Cédric Le Goater <clg@kaod.org> wrote:
> >>
> >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> >>>>
> >>>>
> >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
> >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:
> >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
> >>>>>> Cédric Le Goater <clg@kaod.org> wrote:
> >>>>>>   
> >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
> >>>>>>> target for a source located on the same chip when possible. This field
> >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
> >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
> >>>>>>
> >>>>>> This sentence seems to have a syntax problem... like it is missing an
> >>>>>> 'and' before 'on pSeries'.
> >>>>>
> >>>>> ah yes, or simply a comma.
> >>>>>   
> >>>>>>> under PowerVM. The XIVE source structure has a similar field
> >>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
> >>>>>>>
> >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
> >>>>>>> default node. It will also give us the opportunity to set the affinity
> >>>>>>> of a source on pSeries when we can localize them.
> >>>>>>>   
> >>>>>>
> >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
> >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
> >>>>>> with this change.
> >>>>>
> >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
> >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
> >>>>> in Cc:)
> >>>   [...]
> >>>>>
> >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
> >>>>> the node id. This value is built from the chip id in OPAL, so the
> >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
> >>>>> property are unlikely to be different.
> >>>>>
> >>>>> cpu_to_node(cpu) is used in many places to allocate the structures
> >>>>> locally to the owning node. XIVE is not an exception (see below in the
> >>>>> same patch), it is better to be consistent and get the same information
> >>>>> (node id) using the same routine.
> >>>>>
> >>>>>
> >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
> >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
> >>>>> unifies the controllers of the system to only expose one the OS. This
> >>>>> is problematic and should be changed but it's another topic.
> >>>>>
> >>>>>   
> >>>>>> On the other hand, you have the pSeries case under PowerVM that
> >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
> >>>>>
> >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
> >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
> >>>>> chip id.
> >>>>>
> >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
> >>>>> always correct btw)
> >>>>
> >>>>
> >>>> If you have a way to reliably reproduce this, let me know and I'll fix it
> >>>> up in QEMU.
> >>>
> >>> with :
> >>>
> >>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> >>>
> >>> # dmesg | grep numa
> >>> [    0.013106] numa: Node 0 CPUs: 0-1
> >>> [    0.013136] numa: Node 1 CPUs: 2-3
> >>>
> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> >>> 		ibm,chip-id = <0x01>;
> >>> 		ibm,chip-id = <0x02>;
> >>> 		ibm,chip-id = <0x00>;
> >>> 		ibm,chip-id = <0x03>;
> >>>
> >>> with :
> >>>
> >>>    -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> >>>
> >>> # dmesg | grep numa
> >>> [    0.013106] numa: Node 0 CPUs: 0-1
> >>> [    0.013136] numa: Node 1 CPUs: 2-3
> >>>
> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> >>> 		ibm,chip-id = <0x00>;
> >>> 		ibm,chip-id = <0x00>;
> >>> 		ibm,chip-id = <0x00>;
> >>> 		ibm,chip-id = <0x00>;
> >>>
> >>> I think we should simply remove "ibm,chip-id" since it's not used and
> >>> not in the PAPR spec.
> >>
> >> As I mentioned to Daniel on our call this morning, oddly it *does*
> >> appear to be used in the RHEL kernel, even though that's 4.18 based.
> >> This patch seems to have caused a minor regression; not in the
> >> identification of NUMA nodes, but in the number of sockets shown be
> >> lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
> >> for more information.
> > 
> > Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we
> > remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :
> > 
> >     [root@localhost ~]# lscpu
> >     Architecture:        ppc64le
> >     Byte Order:          Little Endian
> >     CPU(s):              128
> >     On-line CPU(s) list: 0-127
> >     Thread(s) per core:  4
> >     Core(s) per socket:  16
> >     Socket(s):           2
> >     NUMA node(s):        2
> >     Model:               2.2 (pvr 004e 1202)
> >     Model name:          POWER9 (architected), altivec supported
> >     Hypervisor vendor:   KVM
> >     Virtualization type: para
> >     L1d cache:           32K
> >     L1i cache:           32K
> >     NUMA node0 CPU(s):   0-63
> >     NUMA node1 CPU(s):   64-127
> > 
> >     [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id
> >     /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
> >     /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
> >     /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
> >     /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
> >     /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
> >     ....
> > 
> > "ibm,chip-id" is still being used on some occasion on pSeries machines.
> > This is wrong :/ The problem is :
> > 
> >    #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))
> > 
> > We should be using cpu_to_node().
> 
> 
> IIUC the "real fix" then is this change you mentioned above, together with
> this xive patch as well, to stop using ibm,chip-id for good in the pserie
>   kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries
> machine without impact. Is this correct?
> 
> If that's the case, then I believe it's ok to go forward with the QEMU side
> change (just for 6.0.0 and newer machines). Or should I wait for the kernel
> changes to be merged upstream first?
> 

I'd say the latter since this is a breaking change and people will want
to identify the upstream commits they have to backport to their kernel
in order to support the disappearance of "ibm,chip-id".

Cheers,

--
Greg

> 
> Thanks,
> 
> 
> DHB
> 
> 
> > 
> > C.
> > 
> >>
> >> Since the value was used by some PAPR kernels - even if they shouldn't
> >> have - I think we should only remove this for newer machine types.  We
> >> also need to check what we're not supplying that the guest kernel is
> >> showing a different number of sockets than specified on the qemu
> >> command line.
> >>
> >>>
> >>> Thanks,
> >>>
> >>> C.
> >>>
> >>>   
> >>>
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>
> >>
> >>
> > 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
@ 2021-03-12 13:03                   ` Greg Kurz
  0 siblings, 0 replies; 38+ messages in thread
From: Greg Kurz @ 2021-03-12 13:03 UTC (permalink / raw)
  To: Daniel Henrique Barboza
  Cc: Michael Ellerman, QEMU Developers, list@suse.de:PowerPC,
	Cédric Le Goater, David Gibson, linuxppc-dev

On Fri, 12 Mar 2021 09:18:39 -0300
Daniel Henrique Barboza <danielhb@linux.ibm.com> wrote:

> 
> 
> On 3/12/21 6:53 AM, Cédric Le Goater wrote:
> > On 3/12/21 2:55 AM, David Gibson wrote:
> >> On Tue, 9 Mar 2021 18:26:35 +0100
> >> Cédric Le Goater <clg@kaod.org> wrote:
> >>
> >>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
> >>>>
> >>>>
> >>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
> >>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:
> >>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
> >>>>>> Cédric Le Goater <clg@kaod.org> wrote:
> >>>>>>   
> >>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
> >>>>>>> target for a source located on the same chip when possible. This field
> >>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
> >>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
> >>>>>>
> >>>>>> This sentence seems to have a syntax problem... like it is missing an
> >>>>>> 'and' before 'on pSeries'.
> >>>>>
> >>>>> ah yes, or simply a comma.
> >>>>>   
> >>>>>>> under PowerVM. The XIVE source structure has a similar field
> >>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
> >>>>>>>
> >>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
> >>>>>>> default node. It will also give us the opportunity to set the affinity
> >>>>>>> of a source on pSeries when we can localize them.
> >>>>>>>   
> >>>>>>
> >>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
> >>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
> >>>>>> with this change.
> >>>>>
> >>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
> >>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
> >>>>> in Cc:)
> >>>   [...]
> >>>>>
> >>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
> >>>>> the node id. This value is built from the chip id in OPAL, so the
> >>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
> >>>>> property are unlikely to be different.
> >>>>>
> >>>>> cpu_to_node(cpu) is used in many places to allocate the structures
> >>>>> locally to the owning node. XIVE is not an exception (see below in the
> >>>>> same patch), it is better to be consistent and get the same information
> >>>>> (node id) using the same routine.
> >>>>>
> >>>>>
> >>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
> >>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
> >>>>> unifies the controllers of the system to only expose one the OS. This
> >>>>> is problematic and should be changed but it's another topic.
> >>>>>
> >>>>>   
> >>>>>> On the other hand, you have the pSeries case under PowerVM that
> >>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
> >>>>>
> >>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
> >>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
> >>>>> chip id.
> >>>>>
> >>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
> >>>>> always correct btw)
> >>>>
> >>>>
> >>>> If you have a way to reliably reproduce this, let me know and I'll fix it
> >>>> up in QEMU.
> >>>
> >>> with :
> >>>
> >>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> >>>
> >>> # dmesg | grep numa
> >>> [    0.013106] numa: Node 0 CPUs: 0-1
> >>> [    0.013136] numa: Node 1 CPUs: 2-3
> >>>
> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> >>> 		ibm,chip-id = <0x01>;
> >>> 		ibm,chip-id = <0x02>;
> >>> 		ibm,chip-id = <0x00>;
> >>> 		ibm,chip-id = <0x03>;
> >>>
> >>> with :
> >>>
> >>>    -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
> >>>
> >>> # dmesg | grep numa
> >>> [    0.013106] numa: Node 0 CPUs: 0-1
> >>> [    0.013136] numa: Node 1 CPUs: 2-3
> >>>
> >>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
> >>> 		ibm,chip-id = <0x00>;
> >>> 		ibm,chip-id = <0x00>;
> >>> 		ibm,chip-id = <0x00>;
> >>> 		ibm,chip-id = <0x00>;
> >>>
> >>> I think we should simply remove "ibm,chip-id" since it's not used and
> >>> not in the PAPR spec.
> >>
> >> As I mentioned to Daniel on our call this morning, oddly it *does*
> >> appear to be used in the RHEL kernel, even though that's 4.18 based.
> >> This patch seems to have caused a minor regression; not in the
> >> identification of NUMA nodes, but in the number of sockets shown be
> >> lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
> >> for more information.
> > 
> > Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we
> > remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :
> > 
> >     [root@localhost ~]# lscpu
> >     Architecture:        ppc64le
> >     Byte Order:          Little Endian
> >     CPU(s):              128
> >     On-line CPU(s) list: 0-127
> >     Thread(s) per core:  4
> >     Core(s) per socket:  16
> >     Socket(s):           2
> >     NUMA node(s):        2
> >     Model:               2.2 (pvr 004e 1202)
> >     Model name:          POWER9 (architected), altivec supported
> >     Hypervisor vendor:   KVM
> >     Virtualization type: para
> >     L1d cache:           32K
> >     L1i cache:           32K
> >     NUMA node0 CPU(s):   0-63
> >     NUMA node1 CPU(s):   64-127
> > 
> >     [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id
> >     /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
> >     /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
> >     /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
> >     /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
> >     /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
> >     ....
> > 
> > "ibm,chip-id" is still being used on some occasion on pSeries machines.
> > This is wrong :/ The problem is :
> > 
> >    #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))
> > 
> > We should be using cpu_to_node().
> 
> 
> IIUC the "real fix" then is this change you mentioned above, together with
> this xive patch as well, to stop using ibm,chip-id for good in the pserie
>   kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries
> machine without impact. Is this correct?
> 
> If that's the case, then I believe it's ok to go forward with the QEMU side
> change (just for 6.0.0 and newer machines). Or should I wait for the kernel
> changes to be merged upstream first?
> 

I'd say the latter since this is a breaking change and people will want
to identify the upstream commits they have to backport to their kernel
in order to support the disappearance of "ibm,chip-id".

Cheers,

--
Greg

> 
> Thanks,
> 
> 
> DHB
> 
> 
> > 
> > C.
> > 
> >>
> >> Since the value was used by some PAPR kernels - even if they shouldn't
> >> have - I think we should only remove this for newer machine types.  We
> >> also need to check what we're not supplying that the guest kernel is
> >> showing a different number of sockets than specified on the qemu
> >> command line.
> >>
> >>>
> >>> Thanks,
> >>>
> >>> C.
> >>>
> >>>   
> >>>
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>   [...]
> >>>
> >>
> >>
> > 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
  2021-03-12 12:18                 ` Daniel Henrique Barboza
@ 2021-03-12 13:28                   ` Cédric Le Goater
  -1 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-12 13:28 UTC (permalink / raw)
  To: Daniel Henrique Barboza, David Gibson
  Cc: list@suse.de:PowerPC, linuxppc-dev, Greg Kurz, QEMU Developers

On 3/12/21 1:18 PM, Daniel Henrique Barboza wrote:
> 
> 
> On 3/12/21 6:53 AM, Cédric Le Goater wrote:
>> On 3/12/21 2:55 AM, David Gibson wrote:
>>> On Tue, 9 Mar 2021 18:26:35 +0100
>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>
>>>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
>>>>>
>>>>>
>>>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
>>>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:
>>>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
>>>>>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>>  
>>>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>>>>>>> target for a source located on the same chip when possible. This field
>>>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
>>>>>>>
>>>>>>> This sentence seems to have a syntax problem... like it is missing an
>>>>>>> 'and' before 'on pSeries'.
>>>>>>
>>>>>> ah yes, or simply a comma.
>>>>>>  
>>>>>>>> under PowerVM. The XIVE source structure has a similar field
>>>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>>>>>>
>>>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>>>>>>> default node. It will also give us the opportunity to set the affinity
>>>>>>>> of a source on pSeries when we can localize them.
>>>>>>>>   
>>>>>>>
>>>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
>>>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>>>>>>> with this change.
>>>>>>
>>>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
>>>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
>>>>>> in Cc:)
>>>>   [...]
>>>>>>
>>>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
>>>>>> the node id. This value is built from the chip id in OPAL, so the
>>>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
>>>>>> property are unlikely to be different.
>>>>>>
>>>>>> cpu_to_node(cpu) is used in many places to allocate the structures
>>>>>> locally to the owning node. XIVE is not an exception (see below in the
>>>>>> same patch), it is better to be consistent and get the same information
>>>>>> (node id) using the same routine.
>>>>>>
>>>>>>
>>>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
>>>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
>>>>>> unifies the controllers of the system to only expose one the OS. This
>>>>>> is problematic and should be changed but it's another topic.
>>>>>>
>>>>>>  
>>>>>>> On the other hand, you have the pSeries case under PowerVM that
>>>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
>>>>>>
>>>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
>>>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
>>>>>> chip id.
>>>>>>
>>>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
>>>>>> always correct btw)
>>>>>
>>>>>
>>>>> If you have a way to reliably reproduce this, let me know and I'll fix it
>>>>> up in QEMU.
>>>>
>>>> with :
>>>>
>>>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>> # dmesg | grep numa
>>>> [    0.013106] numa: Node 0 CPUs: 0-1
>>>> [    0.013136] numa: Node 1 CPUs: 2-3
>>>>
>>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>>>         ibm,chip-id = <0x01>;
>>>>         ibm,chip-id = <0x02>;
>>>>         ibm,chip-id = <0x00>;
>>>>         ibm,chip-id = <0x03>;
>>>>
>>>> with :
>>>>
>>>>    -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>> # dmesg | grep numa
>>>> [    0.013106] numa: Node 0 CPUs: 0-1
>>>> [    0.013136] numa: Node 1 CPUs: 2-3
>>>>
>>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>>>         ibm,chip-id = <0x00>;
>>>>         ibm,chip-id = <0x00>;
>>>>         ibm,chip-id = <0x00>;
>>>>         ibm,chip-id = <0x00>;
>>>>
>>>> I think we should simply remove "ibm,chip-id" since it's not used and
>>>> not in the PAPR spec.
>>>
>>> As I mentioned to Daniel on our call this morning, oddly it *does*
>>> appear to be used in the RHEL kernel, even though that's 4.18 based.
>>> This patch seems to have caused a minor regression; not in the
>>> identification of NUMA nodes, but in the number of sockets shown be
>>> lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
>>> for more information.
>>
>> Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we
>> remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :
>>
>>     [root@localhost ~]# lscpu
>>     Architecture:        ppc64le
>>     Byte Order:          Little Endian
>>     CPU(s):              128
>>     On-line CPU(s) list: 0-127
>>     Thread(s) per core:  4
>>     Core(s) per socket:  16
>>     Socket(s):           2
>>     NUMA node(s):        2
>>     Model:               2.2 (pvr 004e 1202)
>>     Model name:          POWER9 (architected), altivec supported
>>     Hypervisor vendor:   KVM
>>     Virtualization type: para
>>     L1d cache:           32K
>>     L1i cache:           32K
>>     NUMA node0 CPU(s):   0-63
>>     NUMA node1 CPU(s):   64-127
>>
>>     [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id
>>     /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
>>     /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
>>     /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
>>     /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
>>     /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
>>     ....
>>
>> "ibm,chip-id" is still being used on some occasion on pSeries machines.
>> This is wrong :/ The problem is :
>>
>>    #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))
>>
>> We should be using cpu_to_node().
> 
> 
> IIUC the "real fix" then is this change you mentioned above, together with
> this xive patch as well, 

These are independent. 

The XIVE patch just raised the issue because it's another usage example of 
cpu_to_chip_id() or directly "ibm,chip-id" in the XIVE case, on a pseries 
machine.  

The use of cpu_to_node(cpu) for topology_physical_package_id(cpu) is a fix 
for the sysfs issue reported in the redhat BZ. 

> to stop using ibm,chip-id for good in the pserie
>  kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries
> machine without impact. Is this correct?

Linux is already "broken" on PowerVM today since we don't have the "ibm,chip-id" 
property. QEMU is just hiding the problem on KVM. 

But we have to be bug compatible :) if the QEMU fix is under the pseries-6.x 
machine we should be fine.

> If that's the case, then I believe it's ok to go forward with the QEMU side
> change (just for 6.0.0 and newer machines). Or should I wait for the kernel
> changes to be merged upstream first?

Once Linux is fixed, we shouldn't care if QEMU exports 'ibm,chip-id' or not.
I don't think the order is very important. These are independent. 

C.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property
@ 2021-03-12 13:28                   ` Cédric Le Goater
  0 siblings, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-12 13:28 UTC (permalink / raw)
  To: Daniel Henrique Barboza, David Gibson
  Cc: list@suse.de:PowerPC, Michael Ellerman, linuxppc-dev, Greg Kurz,
	QEMU Developers

On 3/12/21 1:18 PM, Daniel Henrique Barboza wrote:
> 
> 
> On 3/12/21 6:53 AM, Cédric Le Goater wrote:
>> On 3/12/21 2:55 AM, David Gibson wrote:
>>> On Tue, 9 Mar 2021 18:26:35 +0100
>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>
>>>> On 3/9/21 6:08 PM, Daniel Henrique Barboza wrote:
>>>>>
>>>>>
>>>>> On 3/9/21 12:33 PM, Cédric Le Goater wrote:
>>>>>> On 3/8/21 6:13 PM, Greg Kurz wrote:
>>>>>>> On Wed, 3 Mar 2021 18:48:50 +0100
>>>>>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>>>>>  
>>>>>>>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>>>>>>>> target for a source located on the same chip when possible. This field
>>>>>>>> is assigned on the PowerNV platform using the "ibm,chip-id" property
>>>>>>>> on pSeries under KVM when NUMA nodes are defined but it is undefined
>>>>>>>
>>>>>>> This sentence seems to have a syntax problem... like it is missing an
>>>>>>> 'and' before 'on pSeries'.
>>>>>>
>>>>>> ah yes, or simply a comma.
>>>>>>  
>>>>>>>> under PowerVM. The XIVE source structure has a similar field
>>>>>>>> 'src_chip' which is only assigned on the PowerNV platform.
>>>>>>>>
>>>>>>>> cpu_to_node() returns a compatible value on all platforms, 0 being the
>>>>>>>> default node. It will also give us the opportunity to set the affinity
>>>>>>>> of a source on pSeries when we can localize them.
>>>>>>>>   
>>>>>>>
>>>>>>> IIUC this relies on the fact that the NUMA node id is == to chip id
>>>>>>> on PowerNV, i.e. xc->chip_id which is passed to OPAL remain stable
>>>>>>> with this change.
>>>>>>
>>>>>> Linux sets the NUMA node in numa_setup_cpu(). On pseries, the hcall
>>>>>> H_HOME_NODE_ASSOCIATIVITY returns the node id if I am correct (Daniel
>>>>>> in Cc:)
>>>>   [...]
>>>>>>
>>>>>> On PowerNV, Linux uses "ibm,associativity" property of the CPU to find
>>>>>> the node id. This value is built from the chip id in OPAL, so the
>>>>>> value returned by cpu_to_node(cpu) and the value of the "ibm,chip-id"
>>>>>> property are unlikely to be different.
>>>>>>
>>>>>> cpu_to_node(cpu) is used in many places to allocate the structures
>>>>>> locally to the owning node. XIVE is not an exception (see below in the
>>>>>> same patch), it is better to be consistent and get the same information
>>>>>> (node id) using the same routine.
>>>>>>
>>>>>>
>>>>>> In Linux, "ibm,chip-id" is only used in low level PowerNV drivers :
>>>>>> LPC, XSCOM, RNG, VAS, NX. XIVE should be in that list also but skiboot
>>>>>> unifies the controllers of the system to only expose one the OS. This
>>>>>> is problematic and should be changed but it's another topic.
>>>>>>
>>>>>>  
>>>>>>> On the other hand, you have the pSeries case under PowerVM that
>>>>>>> doesn't xc->chip_id, which isn't passed to any hcall AFAICT.
>>>>>>
>>>>>> yes "ibm,chip-id" is an OPAL concept unfortunately and it has no meaning
>>>>>> under PAPR. xc->chip_id on pseries (PowerVM) will contains an invalid
>>>>>> chip id.
>>>>>>
>>>>>> QEMU/KVM exposes "ibm,chip-id" but it's not used. (its value is not
>>>>>> always correct btw)
>>>>>
>>>>>
>>>>> If you have a way to reliably reproduce this, let me know and I'll fix it
>>>>> up in QEMU.
>>>>
>>>> with :
>>>>
>>>>     -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>> # dmesg | grep numa
>>>> [    0.013106] numa: Node 0 CPUs: 0-1
>>>> [    0.013136] numa: Node 1 CPUs: 2-3
>>>>
>>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>>>         ibm,chip-id = <0x01>;
>>>>         ibm,chip-id = <0x02>;
>>>>         ibm,chip-id = <0x00>;
>>>>         ibm,chip-id = <0x03>;
>>>>
>>>> with :
>>>>
>>>>    -smp 4,cores=4,maxcpus=8,threads=1 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>> # dmesg | grep numa
>>>> [    0.013106] numa: Node 0 CPUs: 0-1
>>>> [    0.013136] numa: Node 1 CPUs: 2-3
>>>>
>>>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>>>         ibm,chip-id = <0x00>;
>>>>         ibm,chip-id = <0x00>;
>>>>         ibm,chip-id = <0x00>;
>>>>         ibm,chip-id = <0x00>;
>>>>
>>>> I think we should simply remove "ibm,chip-id" since it's not used and
>>>> not in the PAPR spec.
>>>
>>> As I mentioned to Daniel on our call this morning, oddly it *does*
>>> appear to be used in the RHEL kernel, even though that's 4.18 based.
>>> This patch seems to have caused a minor regression; not in the
>>> identification of NUMA nodes, but in the number of sockets shown be
>>> lscpu, etc.  See https://bugzilla.redhat.com/show_bug.cgi?id=1934421
>>> for more information.
>>
>> Yes. The property "ibm,chip-id" is wrongly calculated in QEMU. If we
>> remove it, we get with 4.18.0-295.el8.ppc64le or 5.12.0-rc2 :
>>
>>     [root@localhost ~]# lscpu
>>     Architecture:        ppc64le
>>     Byte Order:          Little Endian
>>     CPU(s):              128
>>     On-line CPU(s) list: 0-127
>>     Thread(s) per core:  4
>>     Core(s) per socket:  16
>>     Socket(s):           2
>>     NUMA node(s):        2
>>     Model:               2.2 (pvr 004e 1202)
>>     Model name:          POWER9 (architected), altivec supported
>>     Hypervisor vendor:   KVM
>>     Virtualization type: para
>>     L1d cache:           32K
>>     L1i cache:           32K
>>     NUMA node0 CPU(s):   0-63
>>     NUMA node1 CPU(s):   64-127
>>
>>     [root@localhost ~]# grep . /sys/devices/system/cpu/*/topology/physical_package_id
>>     /sys/devices/system/cpu/cpu0/topology/physical_package_id:-1
>>     /sys/devices/system/cpu/cpu100/topology/physical_package_id:-1
>>     /sys/devices/system/cpu/cpu101/topology/physical_package_id:-1
>>     /sys/devices/system/cpu/cpu102/topology/physical_package_id:-1
>>     /sys/devices/system/cpu/cpu103/topology/physical_package_id:-1
>>     ....
>>
>> "ibm,chip-id" is still being used on some occasion on pSeries machines.
>> This is wrong :/ The problem is :
>>
>>    #define topology_physical_package_id(cpu)      (cpu_to_chip_id(cpu))
>>
>> We should be using cpu_to_node().
> 
> 
> IIUC the "real fix" then is this change you mentioned above, together with
> this xive patch as well, 

These are independent. 

The XIVE patch just raised the issue because it's another usage example of 
cpu_to_chip_id() or directly "ibm,chip-id" in the XIVE case, on a pseries 
machine.  

The use of cpu_to_node(cpu) for topology_physical_package_id(cpu) is a fix 
for the sysfs issue reported in the redhat BZ. 

> to stop using ibm,chip-id for good in the pserie
>  kernel. With these changes QEMU can remove 'ibm,chip-id' from the pseries
> machine without impact. Is this correct?

Linux is already "broken" on PowerVM today since we don't have the "ibm,chip-id" 
property. QEMU is just hiding the problem on KVM. 

But we have to be bug compatible :) if the QEMU fix is under the pseries-6.x 
machine we should be fine.

> If that's the case, then I believe it's ok to go forward with the QEMU side
> change (just for 6.0.0 and newer machines). Or should I wait for the kernel
> changes to be merged upstream first?

Once Linux is fixed, we shouldn't care if QEMU exports 'ibm,chip-id' or not.
I don't think the order is very important. These are independent. 

C.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node
  2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
  2021-03-09 13:23   ` Greg Kurz
@ 2021-03-30 16:18   ` Cédric Le Goater
  1 sibling, 0 replies; 38+ messages in thread
From: Cédric Le Goater @ 2021-03-30 16:18 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz

On 3/3/21 6:48 PM, Cédric Le Goater wrote:
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> As a workaround, the impact can be reduced by deactivating the IRQ
> lockup detector ("noirqdebug") which does a lot of accounting in the
> Linux IRQ descriptor structure and is responsible for most of the
> performance penalty.
> 
> As a fix, this proposal allocates an IPI interrupt per node, to be
> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> lockup detector still has an impact but the XIVE interrupt rate scales
> linearly. It also improves the "noirqdebug" case as showed in the
> tables below.
> 
>  * P9 DD2.2 - 2s * 64 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>  --------------------------------------------------------------
>  1      0-15     4.984023   4.875405       4.996536   5.048892
>         0-31    10.879164  10.544040      10.757632  11.037859
>         0-47    15.345301  14.688764      14.926520  15.310053
>         0-63    17.064907  17.066812      17.613416  17.874511
>  2      0-79    11.768764  21.650749      22.689120  22.566508
>         0-95    10.616812  26.878789      28.434703  28.320324
>         0-111   10.151693  31.397803      31.771773  32.388122
>         0-127    9.948502  33.139336      34.875716  35.224548
> 
>  * P10 DD1 - 4s (not homogeneous) 352 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>  --------------------------------------------------------------
>  1      0-15     2.409402   2.364108       2.383303   2.395091
>         0-31     6.028325   6.046075       6.089999   6.073750
>         0-47     8.655178   8.644531       8.712830   8.724702
>         0-63    11.629652  11.735953      12.088203  12.055979
>         0-79    14.392321  14.729959      14.986701  14.973073
>         0-95    12.604158  13.004034      17.528748  17.568095
>  2      0-111    9.767753  13.719831      19.968606  20.024218
>         0-127    6.744566  16.418854      22.898066  22.995110
>         0-143    6.005699  19.174421      25.425622  25.417541
>         0-159    5.649719  21.938836      27.952662  28.059603
>         0-175    5.441410  24.109484      31.133915  31.127996
>  3      0-191    5.318341  24.405322      33.999221  33.775354
>         0-207    5.191382  26.449769      36.050161  35.867307
>         0-223    5.102790  29.356943      39.544135  39.508169
>         0-239    5.035295  31.933051      42.135075  42.071975
>         0-255    4.969209  34.477367      44.655395  44.757074
>  4      0-271    4.907652  35.887016      47.080545  47.318537
>         0-287    4.839581  38.076137      50.464307  50.636219
>         0-303    4.786031  40.881319      53.478684  53.310759
>         0-319    4.743750  43.448424      56.388102  55.973969
>         0-335    4.709936  45.623532      59.400930  58.926857
>         0-351    4.681413  45.646151      62.035804  61.830057
> 
> [*] https://github.com/antonblanchard/ipistorm
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/sysdev/xive/xive-internal.h |  2 --
>  arch/powerpc/sysdev/xive/common.c        | 39 ++++++++++++++++++------
>  2 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
> index 9cf57c722faa..b3a456fdd3a5 100644
> --- a/arch/powerpc/sysdev/xive/xive-internal.h
> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
> @@ -5,8 +5,6 @@
>  #ifndef __XIVE_INTERNAL_H
>  #define __XIVE_INTERNAL_H
>  
> -#define XIVE_IPI_HW_IRQ		0 /* interrupt source # for IPIs */
> -
>  /*
>   * A "disabled" interrupt should never fire, to catch problems
>   * we set its logical number to this
> diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
> index 8eefd152b947..c27f7bb0494b 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -65,8 +65,16 @@ static struct irq_domain *xive_irq_domain;
>  #ifdef CONFIG_SMP
>  static struct irq_domain *xive_ipi_irq_domain;
>  
> -/* The IPIs all use the same logical irq number */
> -static u32 xive_ipi_irq;
> +/* The IPIs use the same logical irq number when on the same chip */
> +static struct xive_ipi_desc {
> +	unsigned int irq;
> +	char name[8]; /* enough bytes to fit IPI-XXX */
> +} *xive_ipis;
> +
> +static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu)
> +{
> +	return xive_ipis[cpu_to_node(cpu)].irq;

This should be using early_cpu_to_node() for hotplugged CPU, else the CPU IPI 
will be mapped on default node 0. Still works but this is not what we want. 

> +}
>  #endif
>  
>  /* Xive state for each CPU */
> @@ -1106,25 +1114,36 @@ static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
>  
>  static void __init xive_request_ipi(void)
>  {
> -	unsigned int virq;
> +	unsigned int node;
>  
> -	xive_ipi_irq_domain = irq_domain_add_linear(NULL, 1,
> +	xive_ipi_irq_domain = irq_domain_add_linear(NULL, nr_node_ids,
>  						    &xive_ipi_irq_domain_ops, NULL);
>  	if (WARN_ON(xive_ipi_irq_domain == NULL))
>  		return;
>  
> -	/* Initialize it */
> -	virq = irq_create_mapping(xive_ipi_irq_domain, XIVE_IPI_HW_IRQ);
> -	xive_ipi_irq = virq;
> +	xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL);
> +	for_each_node(node) {
> +		struct xive_ipi_desc *xid = &xive_ipis[node];
> +		irq_hw_number_t node_ipi_hwirq = node;

There, we need to filter cpu-less nodes. There is no need to allocate IPIs
for these.

> +		/*
> +		 * Map one IPI interrupt per node for all cpus of that node.
> +		 * Since the HW interrupt number doesn't have any meaning,
> +		 * simply use the node number.
> +		 */
> +		xid->irq = irq_create_mapping(xive_ipi_irq_domain, node_ipi_hwirq);
> +		snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);

and this mapping needs some modernization. If we use a domain allocator, the 
IPI irq descriptor will be allocated on the node it is serving which is even 
better for cache performance.

I will send a v3 with these changes.

C.

> -	WARN_ON(request_irq(virq, xive_muxed_ipi_action,
> -			    IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL));
> +		WARN_ON(request_irq(xid->irq, xive_muxed_ipi_action,
> +				    IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL));
> +	}
>  }
>  
>  static int xive_setup_cpu_ipi(unsigned int cpu)
>  {
>  	struct xive_cpu *xc;
>  	int rc;
> +	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
>  
>  	pr_debug("Setting up IPI for CPU %d\n", cpu);
>  
> @@ -1165,6 +1184,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
>  
>  static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc)
>  {
> +	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
> +
>  	/* Disable the IPI and free the IRQ data */
>  
>  	/* Already cleaned up ? */
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-03-30 16:18 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-03 17:48 [PATCH v2 0/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
2021-03-03 17:48 ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm, chip-id property Cédric Le Goater
2021-03-08 17:13   ` [PATCH v2 1/8] powerpc/xive: Use cpu_to_node() instead of ibm,chip-id property Greg Kurz
2021-03-09 15:33     ` Cédric Le Goater
2021-03-09 17:08       ` Daniel Henrique Barboza
2021-03-09 17:26         ` Cédric Le Goater
2021-03-09 17:26           ` Cédric Le Goater
2021-03-12  1:55           ` David Gibson
2021-03-12  1:55             ` David Gibson
2021-03-12  9:53             ` Cédric Le Goater
2021-03-12  9:53               ` Cédric Le Goater
2021-03-12 12:18               ` Daniel Henrique Barboza
2021-03-12 12:18                 ` Daniel Henrique Barboza
2021-03-12 13:03                 ` Greg Kurz
2021-03-12 13:03                   ` Greg Kurz
2021-03-12 13:28                 ` Cédric Le Goater
2021-03-12 13:28                   ` Cédric Le Goater
2021-03-03 17:48 ` [PATCH v2 2/8] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater
2021-03-08 17:55   ` Greg Kurz
2021-03-03 17:48 ` [PATCH v2 3/8] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater
2021-03-08 17:56   ` Greg Kurz
2021-03-03 17:48 ` [PATCH v2 4/8] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater
2021-03-08 18:07   ` Greg Kurz
2021-03-08 18:11     ` Cédric Le Goater
2021-03-09  9:13       ` Greg Kurz
2021-03-09  9:42         ` Greg Kurz
2021-03-09 15:39           ` Cédric Le Goater
2021-03-03 17:48 ` [PATCH v2 5/8] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater
2021-03-09  9:18   ` Greg Kurz
2021-03-03 17:48 ` [PATCH v2 6/8] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater
2021-03-09  9:22   ` Greg Kurz
2021-03-03 17:48 ` [PATCH v2 7/8] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater
2021-03-09 10:23   ` Greg Kurz
2021-03-09 15:49     ` Cédric Le Goater
2021-03-03 17:48 ` [PATCH v2 8/8] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
2021-03-09 13:23   ` Greg Kurz
2021-03-09 15:52     ` Cédric Le Goater
2021-03-30 16:18   ` Cédric Le Goater

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.