linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node
@ 2021-03-31 14:45 Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm, chip-id" property Cédric Le Goater
                   ` (11 more replies)
  0 siblings, 12 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater


Hello,

ipistorm [*] can be used to benchmark the raw interrupt rate of an
interrupt controller by measuring the number of IPIs a system can
sustain. When applied to the XIVE interrupt controller of POWER9 and
POWER10 systems, a significant drop of the interrupt rate can be
observed when crossing the second node boundary.

This is due to the fact that a single IPI interrupt is used for all
CPUs of the system. The structure is shared and the cache line updates
impact greatly the traffic between nodes and the overall IPI
performance.

As a workaround, the impact can be reduced by deactivating the IRQ
lockup detector ("noirqdebug") which does a lot of accounting in the
Linux IRQ descriptor structure and is responsible for most of the
performance penalty.

As a fix, this proposal allocates an IPI interrupt per node, to be
shared by all CPUs of that node. It solves the scaling issue, the IRQ
lockup detector still has an impact but the XIVE interrupt rate scales
linearly. It also improves the "noirqdebug" case as showed in the
tables below. 

 * P9 DD2.2 - 2s * 64 threads

                                               "noirqdebug"
                        Mint/s                    Mint/s   
 chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
 --------------------------------------------------------------
 1      0-15     4.984023   4.875405       4.996536   5.048892
        0-31    10.879164  10.544040      10.757632  11.037859
        0-47    15.345301  14.688764      14.926520  15.310053
        0-63    17.064907  17.066812      17.613416  17.874511
 2      0-79    11.768764  21.650749      22.689120  22.566508
        0-95    10.616812  26.878789      28.434703  28.320324
        0-111   10.151693  31.397803      31.771773  32.388122
        0-127    9.948502  33.139336      34.875716  35.224548


 * P10 DD1 - 4s (not homogeneous) 352 threads

                                               "noirqdebug"
                        Mint/s                    Mint/s   
 chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
 --------------------------------------------------------------
 1      0-15     2.409402   2.364108       2.383303   2.395091
        0-31     6.028325   6.046075       6.089999   6.073750
        0-47     8.655178   8.644531       8.712830   8.724702
        0-63    11.629652  11.735953      12.088203  12.055979
        0-79    14.392321  14.729959      14.986701  14.973073
        0-95    12.604158  13.004034      17.528748  17.568095
 2      0-111    9.767753  13.719831      19.968606  20.024218
        0-127    6.744566  16.418854      22.898066  22.995110
        0-143    6.005699  19.174421      25.425622  25.417541
        0-159    5.649719  21.938836      27.952662  28.059603
        0-175    5.441410  24.109484      31.133915  31.127996
 3      0-191    5.318341  24.405322      33.999221  33.775354
        0-207    5.191382  26.449769      36.050161  35.867307
        0-223    5.102790  29.356943      39.544135  39.508169
        0-239    5.035295  31.933051      42.135075  42.071975
        0-255    4.969209  34.477367      44.655395  44.757074
 4      0-271    4.907652  35.887016      47.080545  47.318537
        0-287    4.839581  38.076137      50.464307  50.636219
        0-303    4.786031  40.881319      53.478684  53.310759
        0-319    4.743750  43.448424      56.388102  55.973969
        0-335    4.709936  45.623532      59.400930  58.926857
        0-351    4.681413  45.646151      62.035804  61.830057

[*] https://github.com/antonblanchard/ipistorm

Thanks,

C.

Changes in v3:

  - improved commit log for the misuse of "ibm,chip-id"
  - better error handling of xive_request_ipi()
  - use of a fwnode_handle to name the new domain 
  - increased IPI name length
  - use of early_cpu_to_node() for hotplugged CPUs
  - filter CPU-less nodes

Changes in v2:

  - extra simplification on xmon
  - fixes on issues reported by the kernel test robot

Cédric Le Goater (9):
  powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
  powerpc/xive: Introduce an IPI interrupt domain
  powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
  powerpc/xive: Simplify xive_core_debug_show()
  powerpc/xive: Drop check on irq_data in xive_core_debug_show()
  powerpc/xive: Simplify the dump of XIVE interrupts under xmon
  powerpc/xive: Fix xmon command "dxi"
  powerpc/xive: Map one IPI interrupt per node
  powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler

 arch/powerpc/include/asm/xive.h          |   1 +
 arch/powerpc/sysdev/xive/xive-internal.h |   2 -
 arch/powerpc/sysdev/xive/common.c        | 211 +++++++++++++++--------
 arch/powerpc/xmon/xmon.c                 |  28 +--
 4 files changed, 139 insertions(+), 103 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm, chip-id" property
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-04-01  2:49   ` [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property David Gibson
  2021-03-31 14:45 ` [PATCH v3 2/9] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: David Gibson, Greg Kurz, Cédric Le Goater

The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip when possible. The XIVE
driver queries the chip id value from the "ibm,chip-id" DT property
but this property is not available on all platforms. It was first
introduced on the PowerNV platform and later, under QEMU for pseries.
However, the property does not exist under PowerVM since it is not
specified in PAPR.

cpu_to_node() is a better alternative. On the PowerNV platform, the
node id is computed from the "ibm,associativity" property of the CPU.
Its value is built in the OPAL firmware from the physical chip id and
is equivalent to "ibm,chip-id". On pSeries, the hcall
H_HOME_NODE_ASSOCIATIVITY returns the node id.

Also to be noted that under QEMU/KVM "ibm,chip-id" is badly calculated
with unusual SMT configuration. This leads to a bogus chip id value
being returned by of_get_ibm_chip_id().

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes in v3:

  - improved commit log for the misuse of "ibm,chip-id"

 arch/powerpc/sysdev/xive/common.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 7e08be5e5e4a..776871274b69 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1336,16 +1336,11 @@ static int xive_prepare_cpu(unsigned int cpu)
 
 	xc = per_cpu(xive_cpu, cpu);
 	if (!xc) {
-		struct device_node *np;
-
 		xc = kzalloc_node(sizeof(struct xive_cpu),
 				  GFP_KERNEL, cpu_to_node(cpu));
 		if (!xc)
 			return -ENOMEM;
-		np = of_get_cpu_node(cpu, NULL);
-		if (np)
-			xc->chip_id = of_get_ibm_chip_id(np);
-		of_node_put(np);
+		xc->chip_id = cpu_to_node(cpu);
 		xc->hw_ipi = XIVE_BAD_IRQ;
 
 		per_cpu(xive_cpu, cpu) = xc;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/9] powerpc/xive: Introduce an IPI interrupt domain
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm, chip-id" property Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 3/9] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Thomas Gleixner, Greg Kurz, Cédric Le Goater

The IPI interrupt is a special case of the XIVE IRQ domain. When
mapping and unmapping the interrupts in the Linux interrupt number
space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to
distinguish the IPI interrupt from other interrupts of the system.

Simplify the XIVE interrupt domain by introducing a specific domain
for the IPI.

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes in v3:

  - better error handling of xive_request_ipi()
  - use of a fwnode_handle to name the new domain
  - dropped Greg's Reviewed-by because of the changes

 arch/powerpc/sysdev/xive/common.c | 79 ++++++++++++++++++-------------
 1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 776871274b69..98f4dc916fa1 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1068,24 +1068,58 @@ static struct irq_chip xive_ipi_chip = {
 	.irq_unmask = xive_ipi_do_nothing,
 };
 
-static void __init xive_request_ipi(void)
+/*
+ * IPIs are marked per-cpu. We use separate HW interrupts under the
+ * hood but associated with the same "linux" interrupt
+ */
+static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq,
+				   irq_hw_number_t hw)
 {
+	irq_set_chip_and_handler(virq, &xive_ipi_chip, handle_percpu_irq);
+	return 0;
+}
+
+static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
+	.map = xive_ipi_irq_domain_map,
+};
+
+static int __init xive_request_ipi(void)
+{
+	struct fwnode_handle *fwnode;
+	struct irq_domain *ipi_domain;
 	unsigned int virq;
+	int ret = -ENOMEM;
 
-	/*
-	 * Initialization failed, move on, we might manage to
-	 * reach the point where we display our errors before
-	 * the system falls appart
-	 */
-	if (!xive_irq_domain)
-		return;
+	fwnode = irq_domain_alloc_named_fwnode("XIVE-IPI");
+	if (!fwnode)
+		goto out;
+
+	ipi_domain = irq_domain_create_linear(fwnode, 1,
+					      &xive_ipi_irq_domain_ops, NULL);
+	if (!ipi_domain)
+		goto out_free_fwnode;
 
 	/* Initialize it */
-	virq = irq_create_mapping(xive_irq_domain, XIVE_IPI_HW_IRQ);
+	virq = irq_create_mapping(ipi_domain, XIVE_IPI_HW_IRQ);
+	if (!virq) {
+		ret = -EINVAL;
+		goto out_free_domain;
+	}
+
 	xive_ipi_irq = virq;
 
-	WARN_ON(request_irq(virq, xive_muxed_ipi_action,
-			    IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL));
+	ret = request_irq(virq, xive_muxed_ipi_action,
+			  IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL);
+
+	WARN(ret < 0, "Failed to request IPI %d: %d\n", virq, ret);
+	return ret;
+
+out_free_domain:
+	irq_domain_remove(ipi_domain);
+out_free_fwnode:
+	irq_domain_free_fwnode(fwnode);
+out:
+	return ret;
 }
 
 static int xive_setup_cpu_ipi(unsigned int cpu)
@@ -1179,19 +1213,6 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq,
 	 */
 	irq_clear_status_flags(virq, IRQ_LEVEL);
 
-#ifdef CONFIG_SMP
-	/* IPIs are special and come up with HW number 0 */
-	if (hw == XIVE_IPI_HW_IRQ) {
-		/*
-		 * IPIs are marked per-cpu. We use separate HW interrupts under
-		 * the hood but associated with the same "linux" interrupt
-		 */
-		irq_set_chip_and_handler(virq, &xive_ipi_chip,
-					 handle_percpu_irq);
-		return 0;
-	}
-#endif
-
 	rc = xive_irq_alloc_data(virq, hw);
 	if (rc)
 		return rc;
@@ -1203,15 +1224,7 @@ static int xive_irq_domain_map(struct irq_domain *h, unsigned int virq,
 
 static void xive_irq_domain_unmap(struct irq_domain *d, unsigned int virq)
 {
-	struct irq_data *data = irq_get_irq_data(virq);
-	unsigned int hw_irq;
-
-	/* XXX Assign BAD number */
-	if (!data)
-		return;
-	hw_irq = (unsigned int)irqd_to_hwirq(data);
-	if (hw_irq != XIVE_IPI_HW_IRQ)
-		xive_irq_free_data(virq);
+	xive_irq_free_data(virq);
 }
 
 static int xive_irq_domain_xlate(struct irq_domain *h, struct device_node *ct,
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 3/9] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm, chip-id" property Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 2/9] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 4/9] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

The IPI interrupt has its own domain now. Testing the HW interrupt
number is not needed anymore.

Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 98f4dc916fa1..8bca9aca0607 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1417,13 +1417,12 @@ static void xive_flush_cpu_queue(unsigned int cpu, struct xive_cpu *xc)
 		struct irq_desc *desc = irq_to_desc(irq);
 		struct irq_data *d = irq_desc_get_irq_data(desc);
 		struct xive_irq_data *xd;
-		unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
 
 		/*
 		 * Ignore anything that isn't a XIVE irq and ignore
 		 * IPIs, so can just be dropped.
 		 */
-		if (d->domain != xive_irq_domain || hw_irq == XIVE_IPI_HW_IRQ)
+		if (d->domain != xive_irq_domain)
 			continue;
 
 		/*
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 4/9] powerpc/xive: Simplify xive_core_debug_show()
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (2 preceding siblings ...)
  2021-03-31 14:45 ` [PATCH v3 3/9] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 5/9] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

Now that the IPI interrupt has its own domain, the checks on the HW
interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
check on the domain.

Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 8bca9aca0607..4149ca846e7c 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1600,17 +1600,14 @@ static void xive_debug_show_cpu(struct seq_file *m, int cpu)
 	seq_puts(m, "\n");
 }
 
-static void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
+static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
 {
-	struct irq_chip *chip = irq_data_get_irq_chip(d);
+	unsigned int hw_irq = (unsigned int)irqd_to_hwirq(d);
 	int rc;
 	u32 target;
 	u8 prio;
 	u32 lirq;
 
-	if (!is_xive_irq(chip))
-		return;
-
 	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
 	if (rc) {
 		seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
@@ -1648,16 +1645,9 @@ static int xive_core_debug_show(struct seq_file *m, void *private)
 
 	for_each_irq_desc(i, desc) {
 		struct irq_data *d = irq_desc_get_irq_data(desc);
-		unsigned int hw_irq;
-
-		if (!d)
-			continue;
-
-		hw_irq = (unsigned int)irqd_to_hwirq(d);
 
-		/* IPIs are special (HW number 0) */
-		if (hw_irq != XIVE_IPI_HW_IRQ)
-			xive_debug_show_irq(m, hw_irq, d);
+		if (d->domain == xive_irq_domain)
+			xive_debug_show_irq(m, d);
 	}
 	return 0;
 }
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 5/9] powerpc/xive: Drop check on irq_data in xive_core_debug_show()
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (3 preceding siblings ...)
  2021-03-31 14:45 ` [PATCH v3 4/9] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 6/9] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kernel test robot, Greg Kurz, Dan Carpenter, Cédric Le Goater

When looping on IRQ descriptor, irq_data is always valid.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 930914b7d528 ("powerpc/xive: Add a debugfs file to dump internal XIVE state")
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 4149ca846e7c..f5fe60c194bc 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1607,6 +1607,8 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
 	u32 target;
 	u8 prio;
 	u32 lirq;
+	struct xive_irq_data *xd;
+	u64 val;
 
 	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
 	if (rc) {
@@ -1617,17 +1619,14 @@ static void xive_debug_show_irq(struct seq_file *m, struct irq_data *d)
 	seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
 		   hw_irq, target, prio, lirq);
 
-	if (d) {
-		struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
-		u64 val = xive_esb_read(xd, XIVE_ESB_GET);
-
-		seq_printf(m, "flags=%c%c%c PQ=%c%c",
-			   xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
-			   xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
-			   xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
-			   val & XIVE_ESB_VAL_P ? 'P' : '-',
-			   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
-	}
+	xd = irq_data_get_irq_handler_data(d);
+	val = xive_esb_read(xd, XIVE_ESB_GET);
+	seq_printf(m, "flags=%c%c%c PQ=%c%c",
+		   xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
+		   xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
+		   xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
+		   val & XIVE_ESB_VAL_P ? 'P' : '-',
+		   val & XIVE_ESB_VAL_Q ? 'Q' : '-');
 	seq_puts(m, "\n");
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 6/9] powerpc/xive: Simplify the dump of XIVE interrupts under xmon
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (4 preceding siblings ...)
  2021-03-31 14:45 ` [PATCH v3 5/9] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 7/9] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, Cédric Le Goater

Move the xmon routine under XIVE subsystem and rework the loop on the
interrupts taking into account the xive_irq_domain to filter out IPIs.

Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/xive.h   |  1 +
 arch/powerpc/sysdev/xive/common.c | 14 ++++++++++++++
 arch/powerpc/xmon/xmon.c          | 28 ++--------------------------
 3 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 9a312b975ca8..aa094a8655b0 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -102,6 +102,7 @@ void xive_flush_interrupt(void);
 /* xmon hook */
 void xmon_xive_do_dump(int cpu);
 int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d);
+void xmon_xive_get_irq_all(void);
 
 /* APIs used by KVM */
 u32 xive_native_default_eq_shift(void);
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index f5fe60c194bc..4c6e2e1289f5 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -289,6 +289,20 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 	return 0;
 }
 
+void xmon_xive_get_irq_all(void)
+{
+	unsigned int i;
+	struct irq_desc *desc;
+
+	for_each_irq_desc(i, desc) {
+		struct irq_data *d = irq_desc_get_irq_data(desc);
+		unsigned int hwirq = (unsigned int)irqd_to_hwirq(d);
+
+		if (d->domain == xive_irq_domain)
+			xmon_xive_get_irq_config(hwirq, d);
+	}
+}
+
 #endif /* CONFIG_XMON */
 
 static unsigned int xive_get_irq(void)
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 3fe37495f63d..80fbf8968f77 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2727,30 +2727,6 @@ static void dump_all_xives(void)
 		dump_one_xive(cpu);
 }
 
-static void dump_one_xive_irq(u32 num, struct irq_data *d)
-{
-	xmon_xive_get_irq_config(num, d);
-}
-
-static void dump_all_xive_irq(void)
-{
-	unsigned int i;
-	struct irq_desc *desc;
-
-	for_each_irq_desc(i, desc) {
-		struct irq_data *d = irq_desc_get_irq_data(desc);
-		unsigned int hwirq;
-
-		if (!d)
-			continue;
-
-		hwirq = (unsigned int)irqd_to_hwirq(d);
-		/* IPIs are special (HW number 0) */
-		if (hwirq)
-			dump_one_xive_irq(hwirq, d);
-	}
-}
-
 static void dump_xives(void)
 {
 	unsigned long num;
@@ -2767,9 +2743,9 @@ static void dump_xives(void)
 		return;
 	} else if (c == 'i') {
 		if (scanhex(&num))
-			dump_one_xive_irq(num, NULL);
+			xmon_xive_get_irq_config(num, NULL);
 		else
-			dump_all_xive_irq();
+			xmon_xive_get_irq_all();
 		return;
 	}
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 7/9] powerpc/xive: Fix xmon command "dxi"
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (5 preceding siblings ...)
  2021-03-31 14:45 ` [PATCH v3 6/9] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-03-31 14:45 ` [PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: kernel test robot, Greg Kurz, Dan Carpenter, Cédric Le Goater

When under xmon, the "dxi" command dumps the state of the XIVE
interrupts. If an interrupt number is specified, only the state of
the associated XIVE interrupt is dumped. This form of the command
lacks an irq_data parameter which is nevertheless used by
xmon_xive_get_irq_config(), leading to an xmon crash.

Fix that by doing a lookup in the system IRQ mapping to query the IRQ
descriptor data. Invalid interrupt numbers, or not belonging to the
XIVE IRQ domain, OPAL event interrupt number for instance, should be
caught by the previous query done at the firmware level.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform")
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/sysdev/xive/common.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 4c6e2e1289f5..69980b49afb7 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -253,17 +253,20 @@ notrace void xmon_xive_do_dump(int cpu)
 	xmon_printf("\n");
 }
 
+static struct irq_data *xive_get_irq_data(u32 hw_irq)
+{
+	unsigned int irq = irq_find_mapping(xive_irq_domain, hw_irq);
+
+	return irq ? irq_get_irq_data(irq) : NULL;
+}
+
 int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 {
-	struct irq_chip *chip = irq_data_get_irq_chip(d);
 	int rc;
 	u32 target;
 	u8 prio;
 	u32 lirq;
 
-	if (!is_xive_irq(chip))
-		return -EINVAL;
-
 	rc = xive_ops->get_irq_config(hw_irq, &target, &prio, &lirq);
 	if (rc) {
 		xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
@@ -273,6 +276,9 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 	xmon_printf("IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
 		    hw_irq, target, prio, lirq);
 
+	if (!d)
+		d = xive_get_irq_data(hw_irq);
+
 	if (d) {
 		struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
 		u64 val = xive_esb_read(xd, XIVE_ESB_GET);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (6 preceding siblings ...)
  2021-03-31 14:45 ` [PATCH v3 7/9] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-04-01 12:50   ` Nicholas Piggin
  2021-03-31 14:45 ` [PATCH v3 9/9] powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler Cédric Le Goater
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Thomas Gleixner, Greg Kurz, Cédric Le Goater

ipistorm [*] can be used to benchmark the raw interrupt rate of an
interrupt controller by measuring the number of IPIs a system can
sustain. When applied to the XIVE interrupt controller of POWER9 and
POWER10 systems, a significant drop of the interrupt rate can be
observed when crossing the second node boundary.

This is due to the fact that a single IPI interrupt is used for all
CPUs of the system. The structure is shared and the cache line updates
impact greatly the traffic between nodes and the overall IPI
performance.

As a workaround, the impact can be reduced by deactivating the IRQ
lockup detector ("noirqdebug") which does a lot of accounting in the
Linux IRQ descriptor structure and is responsible for most of the
performance penalty.

As a fix, this proposal allocates an IPI interrupt per node, to be
shared by all CPUs of that node. It solves the scaling issue, the IRQ
lockup detector still has an impact but the XIVE interrupt rate scales
linearly. It also improves the "noirqdebug" case as showed in the
tables below.

 * P9 DD2.2 - 2s * 64 threads

                                               "noirqdebug"
                        Mint/s                    Mint/s
 chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
 --------------------------------------------------------------
 1      0-15     4.984023   4.875405       4.996536   5.048892
        0-31    10.879164  10.544040      10.757632  11.037859
        0-47    15.345301  14.688764      14.926520  15.310053
        0-63    17.064907  17.066812      17.613416  17.874511
 2      0-79    11.768764  21.650749      22.689120  22.566508
        0-95    10.616812  26.878789      28.434703  28.320324
        0-111   10.151693  31.397803      31.771773  32.388122
        0-127    9.948502  33.139336      34.875716  35.224548

 * P10 DD1 - 4s (not homogeneous) 352 threads

                                               "noirqdebug"
                        Mint/s                    Mint/s
 chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
 --------------------------------------------------------------
 1      0-15     2.409402   2.364108       2.383303   2.395091
        0-31     6.028325   6.046075       6.089999   6.073750
        0-47     8.655178   8.644531       8.712830   8.724702
        0-63    11.629652  11.735953      12.088203  12.055979
        0-79    14.392321  14.729959      14.986701  14.973073
        0-95    12.604158  13.004034      17.528748  17.568095
 2      0-111    9.767753  13.719831      19.968606  20.024218
        0-127    6.744566  16.418854      22.898066  22.995110
        0-143    6.005699  19.174421      25.425622  25.417541
        0-159    5.649719  21.938836      27.952662  28.059603
        0-175    5.441410  24.109484      31.133915  31.127996
 3      0-191    5.318341  24.405322      33.999221  33.775354
        0-207    5.191382  26.449769      36.050161  35.867307
        0-223    5.102790  29.356943      39.544135  39.508169
        0-239    5.035295  31.933051      42.135075  42.071975
        0-255    4.969209  34.477367      44.655395  44.757074
 4      0-271    4.907652  35.887016      47.080545  47.318537
        0-287    4.839581  38.076137      50.464307  50.636219
        0-303    4.786031  40.881319      53.478684  53.310759
        0-319    4.743750  43.448424      56.388102  55.973969
        0-335    4.709936  45.623532      59.400930  58.926857
        0-351    4.681413  45.646151      62.035804  61.830057

[*] https://github.com/antonblanchard/ipistorm

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes in v3:

  - increased IPI name length
  - use of early_cpu_to_node() for hotplugged CPUs
  - filter CPU-less nodes
  - dropped Greg's Reviewed-by because of the changes
  
 arch/powerpc/sysdev/xive/xive-internal.h |  2 -
 arch/powerpc/sysdev/xive/common.c        | 60 +++++++++++++++++++-----
 2 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h b/arch/powerpc/sysdev/xive/xive-internal.h
index 9cf57c722faa..b3a456fdd3a5 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -5,8 +5,6 @@
 #ifndef __XIVE_INTERNAL_H
 #define __XIVE_INTERNAL_H
 
-#define XIVE_IPI_HW_IRQ		0 /* interrupt source # for IPIs */
-
 /*
  * A "disabled" interrupt should never fire, to catch problems
  * we set its logical number to this
diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 69980b49afb7..06f29cd07448 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -63,8 +63,19 @@ static const struct xive_ops *xive_ops;
 static struct irq_domain *xive_irq_domain;
 
 #ifdef CONFIG_SMP
-/* The IPIs all use the same logical irq number */
-static u32 xive_ipi_irq;
+/* The IPIs use the same logical irq number when on the same chip */
+static struct xive_ipi_desc {
+	unsigned int irq;
+	char name[16];
+} *xive_ipis;
+
+/*
+ * Use early_cpu_to_node() for hot-plugged CPUs
+ */
+static unsigned int xive_ipi_cpu_to_irq(unsigned int cpu)
+{
+	return xive_ipis[early_cpu_to_node(cpu)].irq;
+}
 #endif
 
 /* Xive state for each CPU */
@@ -1107,33 +1118,53 @@ static int __init xive_request_ipi(void)
 {
 	struct fwnode_handle *fwnode;
 	struct irq_domain *ipi_domain;
-	unsigned int virq;
+	unsigned int node;
 	int ret = -ENOMEM;
 
 	fwnode = irq_domain_alloc_named_fwnode("XIVE-IPI");
 	if (!fwnode)
 		goto out;
 
-	ipi_domain = irq_domain_create_linear(fwnode, 1,
+	ipi_domain = irq_domain_create_linear(fwnode, nr_node_ids,
 					      &xive_ipi_irq_domain_ops, NULL);
 	if (!ipi_domain)
 		goto out_free_fwnode;
 
-	/* Initialize it */
-	virq = irq_create_mapping(ipi_domain, XIVE_IPI_HW_IRQ);
-	if (!virq) {
-		ret = -EINVAL;
+	xive_ipis = kcalloc(nr_node_ids, sizeof(*xive_ipis), GFP_KERNEL | __GFP_NOFAIL);
+	if (!xive_ipis)
 		goto out_free_domain;
-	}
 
-	xive_ipi_irq = virq;
+	for_each_node(node) {
+		struct xive_ipi_desc *xid = &xive_ipis[node];
+		irq_hw_number_t ipi_hwirq = node;
+
+		/* Skip nodes without CPUs */
+		if (cpumask_empty(cpumask_of_node(node)))
+			continue;
+
+		/*
+		 * Map one IPI interrupt per node for all cpus of that node.
+		 * Since the HW interrupt number doesn't have any meaning,
+		 * simply use the node number.
+		 */
+		xid->irq = irq_create_mapping(ipi_domain, ipi_hwirq);
+		if (!xid->irq) {
+			ret = -EINVAL;
+			goto out_free_xive_ipis;
+		}
+
+		snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);
+
+		ret = request_irq(xid->irq, xive_muxed_ipi_action,
+				  IRQF_PERCPU | IRQF_NO_THREAD, xid->name, NULL);
 
-	ret = request_irq(virq, xive_muxed_ipi_action,
-			  IRQF_PERCPU | IRQF_NO_THREAD, "IPI", NULL);
+		WARN(ret < 0, "Failed to request IPI %d: %d\n", xid->irq, ret);
+	}
 
-	WARN(ret < 0, "Failed to request IPI %d: %d\n", virq, ret);
 	return ret;
 
+out_free_xive_ipis:
+	kfree(xive_ipis);
 out_free_domain:
 	irq_domain_remove(ipi_domain);
 out_free_fwnode:
@@ -1144,6 +1175,7 @@ static int __init xive_request_ipi(void)
 
 static int xive_setup_cpu_ipi(unsigned int cpu)
 {
+	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
 	struct xive_cpu *xc;
 	int rc;
 
@@ -1186,6 +1218,8 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
 
 static void xive_cleanup_cpu_ipi(unsigned int cpu, struct xive_cpu *xc)
 {
+	unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
+
 	/* Disable the IPI and free the IRQ data */
 
 	/* Already cleaned up ? */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 9/9] powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (7 preceding siblings ...)
  2021-03-31 14:45 ` [PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
@ 2021-03-31 14:45 ` Cédric Le Goater
  2021-04-01  8:04 ` [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Greg Kurz
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-03-31 14:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Thomas Gleixner, Greg Kurz, Cédric Le Goater

Instead of calling irq_create_mapping() to map the IPI for a node,
introduce an 'alloc' handler. This is usually an extension to support
hierarchy irq_domains which is not exactly the case for XIVE-IPI
domain. However, we can now use the irq_domain_alloc_irqs() routine
which allocates the IRQ descriptor on the specified node, even better
for cache performance on multi node machines.

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 I didn't rerun the benchmark to check for a difference.
 
 arch/powerpc/sysdev/xive/common.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c
index 06f29cd07448..bb7435ffe99c 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1103,15 +1103,26 @@ static struct irq_chip xive_ipi_chip = {
  * IPIs are marked per-cpu. We use separate HW interrupts under the
  * hood but associated with the same "linux" interrupt
  */
-static int xive_ipi_irq_domain_map(struct irq_domain *h, unsigned int virq,
-				   irq_hw_number_t hw)
+struct xive_ipi_alloc_info {
+	irq_hw_number_t hwirq;
+};
+
+static int xive_ipi_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
+				     unsigned int nr_irqs, void *arg)
 {
-	irq_set_chip_and_handler(virq, &xive_ipi_chip, handle_percpu_irq);
+	struct xive_ipi_alloc_info *info = arg;
+	int i;
+
+	for (i = 0; i < nr_irqs; i++) {
+		irq_domain_set_info(domain, virq + i, info->hwirq + i, &xive_ipi_chip,
+				    domain->host_data, handle_percpu_irq,
+				    NULL, NULL);
+	}
 	return 0;
 }
 
 static const struct irq_domain_ops xive_ipi_irq_domain_ops = {
-	.map = xive_ipi_irq_domain_map,
+	.alloc  = xive_ipi_irq_domain_alloc,
 };
 
 static int __init xive_request_ipi(void)
@@ -1136,7 +1147,7 @@ static int __init xive_request_ipi(void)
 
 	for_each_node(node) {
 		struct xive_ipi_desc *xid = &xive_ipis[node];
-		irq_hw_number_t ipi_hwirq = node;
+		struct xive_ipi_alloc_info info = { node };
 
 		/* Skip nodes without CPUs */
 		if (cpumask_empty(cpumask_of_node(node)))
@@ -1147,9 +1158,9 @@ static int __init xive_request_ipi(void)
 		 * Since the HW interrupt number doesn't have any meaning,
 		 * simply use the node number.
 		 */
-		xid->irq = irq_create_mapping(ipi_domain, ipi_hwirq);
-		if (!xid->irq) {
-			ret = -EINVAL;
+		xid->irq = irq_domain_alloc_irqs(ipi_domain, 1, node, &info);
+		if (xid->irq < 0) {
+			ret = xid->irq;
 			goto out_free_xive_ipis;
 		}
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
  2021-03-31 14:45 ` [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm, chip-id" property Cédric Le Goater
@ 2021-04-01  2:49   ` David Gibson
  2021-04-01  9:10     ` Cédric Le Goater
  0 siblings, 1 reply; 21+ messages in thread
From: David Gibson @ 2021-04-01  2:49 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 2005 bytes --]

On Wed, Mar 31, 2021 at 04:45:06PM +0200, Cédric Le Goater wrote:
> The 'chip_id' field of the XIVE CPU structure is used to choose a
> target for a source located on the same chip when possible. The XIVE
> driver queries the chip id value from the "ibm,chip-id" DT property
> but this property is not available on all platforms. It was first
> introduced on the PowerNV platform and later, under QEMU for pseries.
> However, the property does not exist under PowerVM since it is not
> specified in PAPR.
> 
> cpu_to_node() is a better alternative. On the PowerNV platform, the
> node id is computed from the "ibm,associativity" property of the CPU.
> Its value is built in the OPAL firmware from the physical chip id and
> is equivalent to "ibm,chip-id".

Hrm... I mean, for powernv this is certainly correct, but seems to be
relying on pretty specific specifics of the OPAL / chip behaviour,
namely that the NUMA id == chip ID.

> On pSeries, the hcall
> H_HOME_NODE_ASSOCIATIVITY returns the node id.

AFAICT, the chip_id field is never actually used in the PAPR version
of XIVE.  The only access to the field outside native.c is in
xive_pick_irq_target(), and it only looks at chip_id if src_chip is
valid.  But src_chip is initialized to XIVE_INVALID_CHIP_ID in papr.c

So it would make more sense to me to also initialize chip_id to
XIVE_INVALID_CHIP_ID for PAPR to make it clearer that it's not
relevant.

> Also to be noted that under QEMU/KVM "ibm,chip-id" is badly calculated
> with unusual SMT configuration. This leads to a bogus chip id value
> being returned by of_get_ibm_chip_id().

I *still* don't clearly understand what you think is bogus about the
chip id value that qemu generates.  It's clearly not a problem for
XIVE, since PAPR XIVE never uses it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (8 preceding siblings ...)
  2021-03-31 14:45 ` [PATCH v3 9/9] powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler Cédric Le Goater
@ 2021-04-01  8:04 ` Greg Kurz
  2021-04-01  9:18   ` Cédric Le Goater
  2021-04-01  8:42 ` Cédric Le Goater
  2021-04-19  3:59 ` Michael Ellerman
  11 siblings, 1 reply; 21+ messages in thread
From: Greg Kurz @ 2021-04-01  8:04 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Wed, 31 Mar 2021 16:45:05 +0200
Cédric Le Goater <clg@kaod.org> wrote:

> 
> Hello,
> 
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> As a workaround, the impact can be reduced by deactivating the IRQ
> lockup detector ("noirqdebug") which does a lot of accounting in the
> Linux IRQ descriptor structure and is responsible for most of the
> performance penalty.
> 
> As a fix, this proposal allocates an IPI interrupt per node, to be
> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> lockup detector still has an impact but the XIVE interrupt rate scales
> linearly. It also improves the "noirqdebug" case as showed in the
> tables below. 
> 

As explained by David and others, NUMA nodes happen to match sockets
with current POWER CPUs but these are really different concepts. NUMA
is about CPU memory accesses latency, while in the case of XIVE you
really need to identify a XIVE chip localized in a given socket.

PAPR doesn't know about sockets, only cores. In other words, a PAPR
compliant guest sees all vCPUs like they all sit in a single socket.
Same for the XIVE. Trying to introduce a concept of socket, either
by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
spec violation in this context. If the user cares for locality of
the vCPUs and XIVE on the same socket, then it should bind vCPU
threads to host CPUs from the same socket in the first place.
Isn't this enough to solve the performance issues this series
want to fix, without the need for virtual socket ids ?

>  * P9 DD2.2 - 2s * 64 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s   
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
>  --------------------------------------------------------------
>  1      0-15     4.984023   4.875405       4.996536   5.048892
>         0-31    10.879164  10.544040      10.757632  11.037859
>         0-47    15.345301  14.688764      14.926520  15.310053
>         0-63    17.064907  17.066812      17.613416  17.874511
>  2      0-79    11.768764  21.650749      22.689120  22.566508
>         0-95    10.616812  26.878789      28.434703  28.320324
>         0-111   10.151693  31.397803      31.771773  32.388122
>         0-127    9.948502  33.139336      34.875716  35.224548
> 
> 
>  * P10 DD1 - 4s (not homogeneous) 352 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s   
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
>  --------------------------------------------------------------
>  1      0-15     2.409402   2.364108       2.383303   2.395091
>         0-31     6.028325   6.046075       6.089999   6.073750
>         0-47     8.655178   8.644531       8.712830   8.724702
>         0-63    11.629652  11.735953      12.088203  12.055979
>         0-79    14.392321  14.729959      14.986701  14.973073
>         0-95    12.604158  13.004034      17.528748  17.568095
>  2      0-111    9.767753  13.719831      19.968606  20.024218
>         0-127    6.744566  16.418854      22.898066  22.995110
>         0-143    6.005699  19.174421      25.425622  25.417541
>         0-159    5.649719  21.938836      27.952662  28.059603
>         0-175    5.441410  24.109484      31.133915  31.127996
>  3      0-191    5.318341  24.405322      33.999221  33.775354
>         0-207    5.191382  26.449769      36.050161  35.867307
>         0-223    5.102790  29.356943      39.544135  39.508169
>         0-239    5.035295  31.933051      42.135075  42.071975
>         0-255    4.969209  34.477367      44.655395  44.757074
>  4      0-271    4.907652  35.887016      47.080545  47.318537
>         0-287    4.839581  38.076137      50.464307  50.636219
>         0-303    4.786031  40.881319      53.478684  53.310759
>         0-319    4.743750  43.448424      56.388102  55.973969
>         0-335    4.709936  45.623532      59.400930  58.926857
>         0-351    4.681413  45.646151      62.035804  61.830057
> 
> [*] https://github.com/antonblanchard/ipistorm
> 
> Thanks,
> 
> C.
> 
> Changes in v3:
> 
>   - improved commit log for the misuse of "ibm,chip-id"
>   - better error handling of xive_request_ipi()
>   - use of a fwnode_handle to name the new domain 
>   - increased IPI name length
>   - use of early_cpu_to_node() for hotplugged CPUs
>   - filter CPU-less nodes
> 
> Changes in v2:
> 
>   - extra simplification on xmon
>   - fixes on issues reported by the kernel test robot
> 
> Cédric Le Goater (9):
>   powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
>   powerpc/xive: Introduce an IPI interrupt domain
>   powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
>   powerpc/xive: Simplify xive_core_debug_show()
>   powerpc/xive: Drop check on irq_data in xive_core_debug_show()
>   powerpc/xive: Simplify the dump of XIVE interrupts under xmon
>   powerpc/xive: Fix xmon command "dxi"
>   powerpc/xive: Map one IPI interrupt per node
>   powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
> 
>  arch/powerpc/include/asm/xive.h          |   1 +
>  arch/powerpc/sysdev/xive/xive-internal.h |   2 -
>  arch/powerpc/sysdev/xive/common.c        | 211 +++++++++++++++--------
>  arch/powerpc/xmon/xmon.c                 |  28 +--
>  4 files changed, 139 insertions(+), 103 deletions(-)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (9 preceding siblings ...)
  2021-04-01  8:04 ` [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Greg Kurz
@ 2021-04-01  8:42 ` Cédric Le Goater
  2021-04-19  3:59 ` Michael Ellerman
  11 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-04-01  8:42 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Greg Kurz, David Gibson

Hello,


On 3/31/21 4:45 PM, Cédric Le Goater wrote:
> 
> Hello,
> 
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> As a workaround, the impact can be reduced by deactivating the IRQ
> lockup detector ("noirqdebug") which does a lot of accounting in the
> Linux IRQ descriptor structure and is responsible for most of the
> performance penalty.
> 
> As a fix, this proposal allocates an IPI interrupt per node, to be
> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> lockup detector still has an impact but the XIVE interrupt rate scales
> linearly. It also improves the "noirqdebug" case as showed in the
> tables below. Hello,

From the comments, I received on different email threads. It seems 
I am doing some wrong assumption on the code and concepts. We canpostpone this patchset. It's an optimization and there are some 
more cleanups that can be done before. 

Thanks for the time and the shared expertise,

C.

> 
>  * P9 DD2.2 - 2s * 64 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s   
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
>  --------------------------------------------------------------
>  1      0-15     4.984023   4.875405       4.996536   5.048892
>         0-31    10.879164  10.544040      10.757632  11.037859
>         0-47    15.345301  14.688764      14.926520  15.310053
>         0-63    17.064907  17.066812      17.613416  17.874511
>  2      0-79    11.768764  21.650749      22.689120  22.566508
>         0-95    10.616812  26.878789      28.434703  28.320324
>         0-111   10.151693  31.397803      31.771773  32.388122
>         0-127    9.948502  33.139336      34.875716  35.224548
> 
> 
>  * P10 DD1 - 4s (not homogeneous) 352 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s   
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
>  --------------------------------------------------------------
>  1      0-15     2.409402   2.364108       2.383303   2.395091
>         0-31     6.028325   6.046075       6.089999   6.073750
>         0-47     8.655178   8.644531       8.712830   8.724702
>         0-63    11.629652  11.735953      12.088203  12.055979
>         0-79    14.392321  14.729959      14.986701  14.973073
>         0-95    12.604158  13.004034      17.528748  17.568095
>  2      0-111    9.767753  13.719831      19.968606  20.024218
>         0-127    6.744566  16.418854      22.898066  22.995110
>         0-143    6.005699  19.174421      25.425622  25.417541
>         0-159    5.649719  21.938836      27.952662  28.059603
>         0-175    5.441410  24.109484      31.133915  31.127996
>  3      0-191    5.318341  24.405322      33.999221  33.775354
>         0-207    5.191382  26.449769      36.050161  35.867307
>         0-223    5.102790  29.356943      39.544135  39.508169
>         0-239    5.035295  31.933051      42.135075  42.071975
>         0-255    4.969209  34.477367      44.655395  44.757074
>  4      0-271    4.907652  35.887016      47.080545  47.318537
>         0-287    4.839581  38.076137      50.464307  50.636219
>         0-303    4.786031  40.881319      53.478684  53.310759
>         0-319    4.743750  43.448424      56.388102  55.973969
>         0-335    4.709936  45.623532      59.400930  58.926857
>         0-351    4.681413  45.646151      62.035804  61.830057
> 
> [*] https://github.com/antonblanchard/ipistorm
> 
> Thanks,
> 
> C.
> 
> Changes in v3:
> 
>   - improved commit log for the misuse of "ibm,chip-id"
>   - better error handling of xive_request_ipi()
>   - use of a fwnode_handle to name the new domain 
>   - increased IPI name length
>   - use of early_cpu_to_node() for hotplugged CPUs
>   - filter CPU-less nodes
> 
> Changes in v2:
> 
>   - extra simplification on xmon
>   - fixes on issues reported by the kernel test robot
> 
> Cédric Le Goater (9):
>   powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
>   powerpc/xive: Introduce an IPI interrupt domain
>   powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
>   powerpc/xive: Simplify xive_core_debug_show()
>   powerpc/xive: Drop check on irq_data in xive_core_debug_show()
>   powerpc/xive: Simplify the dump of XIVE interrupts under xmon
>   powerpc/xive: Fix xmon command "dxi"
>   powerpc/xive: Map one IPI interrupt per node
>   powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
> 
>  arch/powerpc/include/asm/xive.h          |   1 +
>  arch/powerpc/sysdev/xive/xive-internal.h |   2 -
>  arch/powerpc/sysdev/xive/common.c        | 211 +++++++++++++++--------
>  arch/powerpc/xmon/xmon.c                 |  28 +--
>  4 files changed, 139 insertions(+), 103 deletions(-)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
  2021-04-01  2:49   ` [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property David Gibson
@ 2021-04-01  9:10     ` Cédric Le Goater
  0 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-04-01  9:10 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, Greg Kurz

On 4/1/21 4:49 AM, David Gibson wrote:
> On Wed, Mar 31, 2021 at 04:45:06PM +0200, Cédric Le Goater wrote:
>> The 'chip_id' field of the XIVE CPU structure is used to choose a
>> target for a source located on the same chip when possible. The XIVE
>> driver queries the chip id value from the "ibm,chip-id" DT property
>> but this property is not available on all platforms. It was first
>> introduced on the PowerNV platform and later, under QEMU for pseries.
>> However, the property does not exist under PowerVM since it is not
>> specified in PAPR.
>>
>> cpu_to_node() is a better alternative. On the PowerNV platform, the
>> node id is computed from the "ibm,associativity" property of the CPU.
>> Its value is built in the OPAL firmware from the physical chip id and
>> is equivalent to "ibm,chip-id".
> 
> Hrm... I mean, for powernv this is certainly correct, but seems to be
> relying on pretty specific specifics of the OPAL / chip behaviour,
> namely that the NUMA id == chip ID.

Yes. It seems so.  

>> On pSeries, the hcall H_HOME_NODE_ASSOCIATIVITY returns the node id.
> 
> AFAICT, the chip_id field is never actually used in the PAPR version
> of XIVE.  The only access to the field outside native.c is in
> xive_pick_irq_target(), and it only looks at chip_id if src_chip is
> valid.  

Yes.

> But src_chip is initialized to XIVE_INVALID_CHIP_ID in papr.c

Yes. The H_INT hcalls do no return any information on the source 
location.

> So it would make more sense to me to also initialize chip_id to
> XIVE_INVALID_CHIP_ID for PAPR to make it clearer that it's not
> relevant.

yes. That would clarify that chip_id is only relevant on PowerVM/KVM. 

We can drop this patch, it's not a requirement for patches 2-9, simply 
a cleanup. I will move the chip_id assignment to a platform handler 
in a other patch.

>> Also to be noted that under QEMU/KVM "ibm,chip-id" is badly calculated
>> with unusual SMT configuration. This leads to a bogus chip id value
>> being returned by of_get_ibm_chip_id().
> 
> I *still* don't clearly understand what you think is bogus about the
> chip id value that qemu generates.  It's clearly not a problem for
> XIVE, since PAPR XIVE never uses it.

I am getting confused by socket/node/chip concepts under PPC. 

However, when looking at PHB and MSI, there is definitely a "node" 
concept being used in the core IRQ layer for allocation and affinity. 
We will need to clarify that when we introduce MSI domains.  

Thanks,

C.  

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node
  2021-04-01  8:04 ` [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Greg Kurz
@ 2021-04-01  9:18   ` Cédric Le Goater
  2021-04-01 12:45     ` Greg Kurz
  0 siblings, 1 reply; 21+ messages in thread
From: Cédric Le Goater @ 2021-04-01  9:18 UTC (permalink / raw)
  To: Greg Kurz; +Cc: linuxppc-dev

Hello,

On 4/1/21 10:04 AM, Greg Kurz wrote:
> On Wed, 31 Mar 2021 16:45:05 +0200
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>>
>> Hello,
>>
>> ipistorm [*] can be used to benchmark the raw interrupt rate of an
>> interrupt controller by measuring the number of IPIs a system can
>> sustain. When applied to the XIVE interrupt controller of POWER9 and
>> POWER10 systems, a significant drop of the interrupt rate can be
>> observed when crossing the second node boundary.
>>
>> This is due to the fact that a single IPI interrupt is used for all
>> CPUs of the system. The structure is shared and the cache line updates
>> impact greatly the traffic between nodes and the overall IPI
>> performance.
>>
>> As a workaround, the impact can be reduced by deactivating the IRQ
>> lockup detector ("noirqdebug") which does a lot of accounting in the
>> Linux IRQ descriptor structure and is responsible for most of the
>> performance penalty.
>>
>> As a fix, this proposal allocates an IPI interrupt per node, to be
>> shared by all CPUs of that node. It solves the scaling issue, the IRQ
>> lockup detector still has an impact but the XIVE interrupt rate scales
>> linearly. It also improves the "noirqdebug" case as showed in the
>> tables below. 
>>
> 
> As explained by David and others, NUMA nodes happen to match sockets
> with current POWER CPUs but these are really different concepts. NUMA
> is about CPU memory accesses latency, 

This is exactly our problem. we have cache issues because hw threads 
on different chips are trying to access the same structure in memory.
It happens on virtual platforms and baremetal platforms. This is not
restricted to pseries.

> while in the case of XIVE you
> really need to identify a XIVE chip localized in a given socket.
> 
> PAPR doesn't know about sockets, only cores. In other words, a PAPR
> compliant guest sees all vCPUs like they all sit in a single socket.

There are also NUMA nodes on PAPR.

> Same for the XIVE. Trying to introduce a concept of socket, either
> by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
> spec violation in this context. If the user cares for locality of
> the vCPUs and XIVE on the same socket, then it should bind vCPU
> threads to host CPUs from the same socket in the first place.

Yes. that's a must have of course. You need to reflect the real HW
topology in the guest or LPAR if you are after performance, or 
restrict the virtual machine to be on a single socket/chip/node.  

And this is not only a XIVE problem. XICS has the same problem with
a shared single IPI interrupt descriptor but XICS doesn't scale well 
by design, so it doesn't show.


> Isn't this enough to solve the performance issues this series
> want to fix, without the need for virtual socket ids ?
what are virtual socket ids ? A new concept ? 

Thanks,

C.

> 
>>  * P9 DD2.2 - 2s * 64 threads
>>
>>                                                "noirqdebug"
>>                         Mint/s                    Mint/s   
>>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
>>  --------------------------------------------------------------
>>  1      0-15     4.984023   4.875405       4.996536   5.048892
>>         0-31    10.879164  10.544040      10.757632  11.037859
>>         0-47    15.345301  14.688764      14.926520  15.310053
>>         0-63    17.064907  17.066812      17.613416  17.874511
>>  2      0-79    11.768764  21.650749      22.689120  22.566508
>>         0-95    10.616812  26.878789      28.434703  28.320324
>>         0-111   10.151693  31.397803      31.771773  32.388122
>>         0-127    9.948502  33.139336      34.875716  35.224548
>>
>>
>>  * P10 DD1 - 4s (not homogeneous) 352 threads
>>
>>                                                "noirqdebug"
>>                         Mint/s                    Mint/s   
>>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
>>  --------------------------------------------------------------
>>  1      0-15     2.409402   2.364108       2.383303   2.395091
>>         0-31     6.028325   6.046075       6.089999   6.073750
>>         0-47     8.655178   8.644531       8.712830   8.724702
>>         0-63    11.629652  11.735953      12.088203  12.055979
>>         0-79    14.392321  14.729959      14.986701  14.973073
>>         0-95    12.604158  13.004034      17.528748  17.568095
>>  2      0-111    9.767753  13.719831      19.968606  20.024218
>>         0-127    6.744566  16.418854      22.898066  22.995110
>>         0-143    6.005699  19.174421      25.425622  25.417541
>>         0-159    5.649719  21.938836      27.952662  28.059603
>>         0-175    5.441410  24.109484      31.133915  31.127996
>>  3      0-191    5.318341  24.405322      33.999221  33.775354
>>         0-207    5.191382  26.449769      36.050161  35.867307
>>         0-223    5.102790  29.356943      39.544135  39.508169
>>         0-239    5.035295  31.933051      42.135075  42.071975
>>         0-255    4.969209  34.477367      44.655395  44.757074
>>  4      0-271    4.907652  35.887016      47.080545  47.318537
>>         0-287    4.839581  38.076137      50.464307  50.636219
>>         0-303    4.786031  40.881319      53.478684  53.310759
>>         0-319    4.743750  43.448424      56.388102  55.973969
>>         0-335    4.709936  45.623532      59.400930  58.926857
>>         0-351    4.681413  45.646151      62.035804  61.830057
>>
>> [*] https://github.com/antonblanchard/ipistorm
>>
>> Thanks,
>>
>> C.
>>
>> Changes in v3:
>>
>>   - improved commit log for the misuse of "ibm,chip-id"
>>   - better error handling of xive_request_ipi()
>>   - use of a fwnode_handle to name the new domain 
>>   - increased IPI name length
>>   - use of early_cpu_to_node() for hotplugged CPUs
>>   - filter CPU-less nodes
>>
>> Changes in v2:
>>
>>   - extra simplification on xmon
>>   - fixes on issues reported by the kernel test robot
>>
>> Cédric Le Goater (9):
>>   powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
>>   powerpc/xive: Introduce an IPI interrupt domain
>>   powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
>>   powerpc/xive: Simplify xive_core_debug_show()
>>   powerpc/xive: Drop check on irq_data in xive_core_debug_show()
>>   powerpc/xive: Simplify the dump of XIVE interrupts under xmon
>>   powerpc/xive: Fix xmon command "dxi"
>>   powerpc/xive: Map one IPI interrupt per node
>>   powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
>>
>>  arch/powerpc/include/asm/xive.h          |   1 +
>>  arch/powerpc/sysdev/xive/xive-internal.h |   2 -
>>  arch/powerpc/sysdev/xive/common.c        | 211 +++++++++++++++--------
>>  arch/powerpc/xmon/xmon.c                 |  28 +--
>>  4 files changed, 139 insertions(+), 103 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node
  2021-04-01  9:18   ` Cédric Le Goater
@ 2021-04-01 12:45     ` Greg Kurz
  2021-04-01 17:14       ` Cédric Le Goater
  0 siblings, 1 reply; 21+ messages in thread
From: Greg Kurz @ 2021-04-01 12:45 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev

On Thu, 1 Apr 2021 11:18:10 +0200
Cédric Le Goater <clg@kaod.org> wrote:

> Hello,
> 
> On 4/1/21 10:04 AM, Greg Kurz wrote:
> > On Wed, 31 Mar 2021 16:45:05 +0200
> > Cédric Le Goater <clg@kaod.org> wrote:
> > 
> >>
> >> Hello,
> >>
> >> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> >> interrupt controller by measuring the number of IPIs a system can
> >> sustain. When applied to the XIVE interrupt controller of POWER9 and
> >> POWER10 systems, a significant drop of the interrupt rate can be
> >> observed when crossing the second node boundary.
> >>
> >> This is due to the fact that a single IPI interrupt is used for all
> >> CPUs of the system. The structure is shared and the cache line updates
> >> impact greatly the traffic between nodes and the overall IPI
> >> performance.
> >>
> >> As a workaround, the impact can be reduced by deactivating the IRQ
> >> lockup detector ("noirqdebug") which does a lot of accounting in the
> >> Linux IRQ descriptor structure and is responsible for most of the
> >> performance penalty.
> >>
> >> As a fix, this proposal allocates an IPI interrupt per node, to be
> >> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> >> lockup detector still has an impact but the XIVE interrupt rate scales
> >> linearly. It also improves the "noirqdebug" case as showed in the
> >> tables below. 
> >>
> > 
> > As explained by David and others, NUMA nodes happen to match sockets
> > with current POWER CPUs but these are really different concepts. NUMA
> > is about CPU memory accesses latency, 
> 
> This is exactly our problem. we have cache issues because hw threads 
> on different chips are trying to access the same structure in memory.
> It happens on virtual platforms and baremetal platforms. This is not
> restricted to pseries.
> 

Ok, I get it... the XIVE HW accesses structures in RAM, just like HW threads
do, so the closer, the better. This definitely looks NUMA related indeed. So
yes, the idea of having the XIVE HW to only access local in-RAM data when
handling IPIs between vCPUs in the same NUMA node makes sense.

What is less clear is the exact role of ibm,chip-id actually. This is
currently used on PowerNV only to pick up a default target on the same
"chip" as the source if possible. What is the detailed motivation behind
this ?

> > while in the case of XIVE you
> > really need to identify a XIVE chip localized in a given socket.
> > 
> > PAPR doesn't know about sockets, only cores. In other words, a PAPR
> > compliant guest sees all vCPUs like they all sit in a single socket.
> 
> There are also NUMA nodes on PAPR.
> 

Yes but nothing prevents a NUMA node to span over multiple sockets
or having several NUMA nodes within the same socket, even if this
isn't the case in practice with current POWER hardware.

> > Same for the XIVE. Trying to introduce a concept of socket, either
> > by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
> > spec violation in this context. If the user cares for locality of
> > the vCPUs and XIVE on the same socket, then it should bind vCPU
> > threads to host CPUs from the same socket in the first place.
> 
> Yes. that's a must have of course. You need to reflect the real HW
> topology in the guest or LPAR if you are after performance, or 
> restrict the virtual machine to be on a single socket/chip/node.  
> 
> And this is not only a XIVE problem. XICS has the same problem with
> a shared single IPI interrupt descriptor but XICS doesn't scale well 
> by design, so it doesn't show.
> 
> 
> > Isn't this enough to solve the performance issues this series
> > want to fix, without the need for virtual socket ids ?
> what are virtual socket ids ? A new concept ? 
> 

For now, we have virtual CPUs identified by a virtual CPU id.
It thus seems natural to speak of a virtual socket id, but
anyway, the wording isn't really important here and you
don't answer the question ;-)

> Thanks,
> 
> C.
> 
> > 
> >>  * P9 DD2.2 - 2s * 64 threads
> >>
> >>                                                "noirqdebug"
> >>                         Mint/s                    Mint/s   
> >>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
> >>  --------------------------------------------------------------
> >>  1      0-15     4.984023   4.875405       4.996536   5.048892
> >>         0-31    10.879164  10.544040      10.757632  11.037859
> >>         0-47    15.345301  14.688764      14.926520  15.310053
> >>         0-63    17.064907  17.066812      17.613416  17.874511
> >>  2      0-79    11.768764  21.650749      22.689120  22.566508
> >>         0-95    10.616812  26.878789      28.434703  28.320324
> >>         0-111   10.151693  31.397803      31.771773  32.388122
> >>         0-127    9.948502  33.139336      34.875716  35.224548
> >>
> >>
> >>  * P10 DD1 - 4s (not homogeneous) 352 threads
> >>
> >>                                                "noirqdebug"
> >>                         Mint/s                    Mint/s   
> >>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys     
> >>  --------------------------------------------------------------
> >>  1      0-15     2.409402   2.364108       2.383303   2.395091
> >>         0-31     6.028325   6.046075       6.089999   6.073750
> >>         0-47     8.655178   8.644531       8.712830   8.724702
> >>         0-63    11.629652  11.735953      12.088203  12.055979
> >>         0-79    14.392321  14.729959      14.986701  14.973073
> >>         0-95    12.604158  13.004034      17.528748  17.568095
> >>  2      0-111    9.767753  13.719831      19.968606  20.024218
> >>         0-127    6.744566  16.418854      22.898066  22.995110
> >>         0-143    6.005699  19.174421      25.425622  25.417541
> >>         0-159    5.649719  21.938836      27.952662  28.059603
> >>         0-175    5.441410  24.109484      31.133915  31.127996
> >>  3      0-191    5.318341  24.405322      33.999221  33.775354
> >>         0-207    5.191382  26.449769      36.050161  35.867307
> >>         0-223    5.102790  29.356943      39.544135  39.508169
> >>         0-239    5.035295  31.933051      42.135075  42.071975
> >>         0-255    4.969209  34.477367      44.655395  44.757074
> >>  4      0-271    4.907652  35.887016      47.080545  47.318537
> >>         0-287    4.839581  38.076137      50.464307  50.636219
> >>         0-303    4.786031  40.881319      53.478684  53.310759
> >>         0-319    4.743750  43.448424      56.388102  55.973969
> >>         0-335    4.709936  45.623532      59.400930  58.926857
> >>         0-351    4.681413  45.646151      62.035804  61.830057
> >>
> >> [*] https://github.com/antonblanchard/ipistorm
> >>
> >> Thanks,
> >>
> >> C.
> >>
> >> Changes in v3:
> >>
> >>   - improved commit log for the misuse of "ibm,chip-id"
> >>   - better error handling of xive_request_ipi()
> >>   - use of a fwnode_handle to name the new domain 
> >>   - increased IPI name length
> >>   - use of early_cpu_to_node() for hotplugged CPUs
> >>   - filter CPU-less nodes
> >>
> >> Changes in v2:
> >>
> >>   - extra simplification on xmon
> >>   - fixes on issues reported by the kernel test robot
> >>
> >> Cédric Le Goater (9):
> >>   powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property
> >>   powerpc/xive: Introduce an IPI interrupt domain
> >>   powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
> >>   powerpc/xive: Simplify xive_core_debug_show()
> >>   powerpc/xive: Drop check on irq_data in xive_core_debug_show()
> >>   powerpc/xive: Simplify the dump of XIVE interrupts under xmon
> >>   powerpc/xive: Fix xmon command "dxi"
> >>   powerpc/xive: Map one IPI interrupt per node
> >>   powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
> >>
> >>  arch/powerpc/include/asm/xive.h          |   1 +
> >>  arch/powerpc/sysdev/xive/xive-internal.h |   2 -
> >>  arch/powerpc/sysdev/xive/common.c        | 211 +++++++++++++++--------
> >>  arch/powerpc/xmon/xmon.c                 |  28 +--
> >>  4 files changed, 139 insertions(+), 103 deletions(-)
> >>
> > 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node
  2021-03-31 14:45 ` [PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
@ 2021-04-01 12:50   ` Nicholas Piggin
  2021-04-02 11:31     ` Cédric Le Goater
  0 siblings, 1 reply; 21+ messages in thread
From: Nicholas Piggin @ 2021-04-01 12:50 UTC (permalink / raw)
  To: Cédric Le Goater, linuxppc-dev; +Cc: Thomas Gleixner, Greg Kurz

Excerpts from Cédric Le Goater's message of April 1, 2021 12:45 am:
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> As a workaround, the impact can be reduced by deactivating the IRQ
> lockup detector ("noirqdebug") which does a lot of accounting in the
> Linux IRQ descriptor structure and is responsible for most of the
> performance penalty.
> 
> As a fix, this proposal allocates an IPI interrupt per node, to be
> shared by all CPUs of that node. It solves the scaling issue, the IRQ
> lockup detector still has an impact but the XIVE interrupt rate scales
> linearly. It also improves the "noirqdebug" case as showed in the
> tables below.
> 
>  * P9 DD2.2 - 2s * 64 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>  --------------------------------------------------------------
>  1      0-15     4.984023   4.875405       4.996536   5.048892
>         0-31    10.879164  10.544040      10.757632  11.037859
>         0-47    15.345301  14.688764      14.926520  15.310053
>         0-63    17.064907  17.066812      17.613416  17.874511
>  2      0-79    11.768764  21.650749      22.689120  22.566508
>         0-95    10.616812  26.878789      28.434703  28.320324
>         0-111   10.151693  31.397803      31.771773  32.388122
>         0-127    9.948502  33.139336      34.875716  35.224548
> 
>  * P10 DD1 - 4s (not homogeneous) 352 threads
> 
>                                                "noirqdebug"
>                         Mint/s                    Mint/s
>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>  --------------------------------------------------------------
>  1      0-15     2.409402   2.364108       2.383303   2.395091
>         0-31     6.028325   6.046075       6.089999   6.073750
>         0-47     8.655178   8.644531       8.712830   8.724702
>         0-63    11.629652  11.735953      12.088203  12.055979
>         0-79    14.392321  14.729959      14.986701  14.973073
>         0-95    12.604158  13.004034      17.528748  17.568095
>  2      0-111    9.767753  13.719831      19.968606  20.024218
>         0-127    6.744566  16.418854      22.898066  22.995110
>         0-143    6.005699  19.174421      25.425622  25.417541
>         0-159    5.649719  21.938836      27.952662  28.059603
>         0-175    5.441410  24.109484      31.133915  31.127996
>  3      0-191    5.318341  24.405322      33.999221  33.775354
>         0-207    5.191382  26.449769      36.050161  35.867307
>         0-223    5.102790  29.356943      39.544135  39.508169
>         0-239    5.035295  31.933051      42.135075  42.071975
>         0-255    4.969209  34.477367      44.655395  44.757074
>  4      0-271    4.907652  35.887016      47.080545  47.318537
>         0-287    4.839581  38.076137      50.464307  50.636219
>         0-303    4.786031  40.881319      53.478684  53.310759
>         0-319    4.743750  43.448424      56.388102  55.973969
>         0-335    4.709936  45.623532      59.400930  58.926857
>         0-351    4.681413  45.646151      62.035804  61.830057
> 
> [*] https://github.com/antonblanchard/ipistorm
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Very nice result but the default-on irqdebug code is quite a slowdown
even with your improvements.

Is the main cacheline bouncing in the fast path coming from 
desc->irq_count++ of the percpu handler? Can we do something quick and 
dirty like the attached patch?

All this stuff seems totally racy with percpu handler but maybe that
doesn't matter too much (and anyway it would be a much bigger change)

Thanks,
Nick

---
 kernel/irq/spurious.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index f865e5f4d382..6b17b737ee6c 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -378,7 +378,8 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
 			 * then we merily delay the spurious detection
 			 * by one hard interrupt. Not a real problem.
 			 */
-			desc->threads_handled_last &= ~SPURIOUS_DEFERRED;
+			if (desc->threads_handled_last & SPURIOUS_DEFERRED)
+				desc->threads_handled_last &= ~SPURIOUS_DEFERRED;
 		}
 	}
 
@@ -403,6 +404,10 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
 			desc->irqs_unhandled -= ok;
 	}
 
+	if (likely(!desc->irqs_unhandled))
+		return;
+
+	/* Now getting into unhandled irq detection */
 	desc->irq_count++;
 	if (likely(desc->irq_count < 100000))
 		return;
-- 
2.23.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node
  2021-04-01 12:45     ` Greg Kurz
@ 2021-04-01 17:14       ` Cédric Le Goater
  0 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-04-01 17:14 UTC (permalink / raw)
  To: Greg Kurz; +Cc: linuxppc-dev

On 4/1/21 2:45 PM, Greg Kurz wrote:
> On Thu, 1 Apr 2021 11:18:10 +0200
> Cédric Le Goater <clg@kaod.org> wrote:
> 
>> Hello,
>>
>> On 4/1/21 10:04 AM, Greg Kurz wrote:
>>> On Wed, 31 Mar 2021 16:45:05 +0200
>>> Cédric Le Goater <clg@kaod.org> wrote:
>>>
>>>>
>>>> Hello,
>>>>
>>>> ipistorm [*] can be used to benchmark the raw interrupt rate of an
>>>> interrupt controller by measuring the number of IPIs a system can
>>>> sustain. When applied to the XIVE interrupt controller of POWER9 and
>>>> POWER10 systems, a significant drop of the interrupt rate can be
>>>> observed when crossing the second node boundary.
>>>>
>>>> This is due to the fact that a single IPI interrupt is used for all
>>>> CPUs of the system. The structure is shared and the cache line updates
>>>> impact greatly the traffic between nodes and the overall IPI
>>>> performance.
>>>>
>>>> As a workaround, the impact can be reduced by deactivating the IRQ
>>>> lockup detector ("noirqdebug") which does a lot of accounting in the
>>>> Linux IRQ descriptor structure and is responsible for most of the
>>>> performance penalty.
>>>>
>>>> As a fix, this proposal allocates an IPI interrupt per node, to be
>>>> shared by all CPUs of that node. It solves the scaling issue, the IRQ
>>>> lockup detector still has an impact but the XIVE interrupt rate scales
>>>> linearly. It also improves the "noirqdebug" case as showed in the
>>>> tables below. 
>>>>
>>>
>>> As explained by David and others, NUMA nodes happen to match sockets
>>> with current POWER CPUs but these are really different concepts. NUMA
>>> is about CPU memory accesses latency, 
>>
>> This is exactly our problem. we have cache issues because hw threads 
>> on different chips are trying to access the same structure in memory.
>> It happens on virtual platforms and baremetal platforms. This is not
>> restricted to pseries.
>>
> 
> Ok, I get it... the XIVE HW accesses structures in RAM, just like HW threads
> do, so the closer, the better. 

No. That's another problem related to the XIVE internal tables which
should be allocated on the chip where it is "mostly" used. 

The problem is much simpler. As the commit log says : 

 This is due to the fact that a single IPI interrupt is used for all
 CPUs of the system. The structure is shared and the cache line updates
 impact greatly the traffic between nodes and the overall IPI
 performance.

So, we have multiple threads competing for the same IRQ descriptor and 
overloading the PowerBUS with cache update synchronization. 


> This definitely looks NUMA related indeed. So
> yes, the idea of having the XIVE HW to only access local in-RAM data when
> handling IPIs between vCPUs in the same NUMA node makes sense.

yes. That's the goal.
 
> What is less clear is the exact role of ibm,chip-id actually. This is
> currently used on PowerNV only to pick up a default target on the same
> "chip" as the source if possible. What is the detailed motivation behind
> this ?

The "ibm,chip-id" issue is extra noise and not a requirement for this 
patchset.

>>> while in the case of XIVE you
>>> really need to identify a XIVE chip localized in a given socket.
>>>
>>> PAPR doesn't know about sockets, only cores. In other words, a PAPR
>>> compliant guest sees all vCPUs like they all sit in a single socket.
>>
>> There are also NUMA nodes on PAPR.
>>
> 
> Yes but nothing prevents a NUMA node to span over multiple sockets
> or having several NUMA nodes within the same socket, even if this
> isn't the case in practice with current POWER hardware.

yes. A NUMA node could even be a PCI adapter attached to storage. 
I don't know what to say. We are missing a concept maybe.

>>> Same for the XIVE. Trying to introduce a concept of socket, either
>>> by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
>>> spec violation in this context. If the user cares for locality of
>>> the vCPUs and XIVE on the same socket, then it should bind vCPU
>>> threads to host CPUs from the same socket in the first place.
>>
>> Yes. that's a must have of course. You need to reflect the real HW
>> topology in the guest or LPAR if you are after performance, or 
>> restrict the virtual machine to be on a single socket/chip/node.  
>>
>> And this is not only a XIVE problem. XICS has the same problem with
>> a shared single IPI interrupt descriptor but XICS doesn't scale well 
>> by design, so it doesn't show.
>>
>>
>>> Isn't this enough to solve the performance issues this series
>>> want to fix, without the need for virtual socket ids ?
>> what are virtual socket ids ? A new concept ? 
>>
> 
> For now, we have virtual CPUs identified by a virtual CPU id.
> It thus seems natural to speak of a virtual socket id, but
> anyway, the wording isn't really important here and you
> don't answer the question ;-)

if, on the hypervisor, you restrict the virtual machine vCPUs to be 
on a single POWER processor/chip, there is no problem. But large 
KVM guests or PowerVM LPARs do exist on 16s systems.

C.
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node
  2021-04-01 12:50   ` Nicholas Piggin
@ 2021-04-02 11:31     ` Cédric Le Goater
  2021-04-02 12:19       ` Cédric Le Goater
  0 siblings, 1 reply; 21+ messages in thread
From: Cédric Le Goater @ 2021-04-02 11:31 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Thomas Gleixner, Greg Kurz

On 4/1/21 2:50 PM, Nicholas Piggin wrote:
> Excerpts from Cédric Le Goater's message of April 1, 2021 12:45 am:
>> ipistorm [*] can be used to benchmark the raw interrupt rate of an
>> interrupt controller by measuring the number of IPIs a system can
>> sustain. When applied to the XIVE interrupt controller of POWER9 and
>> POWER10 systems, a significant drop of the interrupt rate can be
>> observed when crossing the second node boundary.
>>
>> This is due to the fact that a single IPI interrupt is used for all
>> CPUs of the system. The structure is shared and the cache line updates
>> impact greatly the traffic between nodes and the overall IPI
>> performance.
>>
>> As a workaround, the impact can be reduced by deactivating the IRQ
>> lockup detector ("noirqdebug") which does a lot of accounting in the
>> Linux IRQ descriptor structure and is responsible for most of the
>> performance penalty.
>>
>> As a fix, this proposal allocates an IPI interrupt per node, to be
>> shared by all CPUs of that node. It solves the scaling issue, the IRQ
>> lockup detector still has an impact but the XIVE interrupt rate scales
>> linearly. It also improves the "noirqdebug" case as showed in the
>> tables below.
>>
>>  * P9 DD2.2 - 2s * 64 threads
>>
>>                                                "noirqdebug"
>>                         Mint/s                    Mint/s
>>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>>  --------------------------------------------------------------
>>  1      0-15     4.984023   4.875405       4.996536   5.048892
>>         0-31    10.879164  10.544040      10.757632  11.037859
>>         0-47    15.345301  14.688764      14.926520  15.310053
>>         0-63    17.064907  17.066812      17.613416  17.874511
>>  2      0-79    11.768764  21.650749      22.689120  22.566508
>>         0-95    10.616812  26.878789      28.434703  28.320324
>>         0-111   10.151693  31.397803      31.771773  32.388122
>>         0-127    9.948502  33.139336      34.875716  35.224548
>>
>>  * P10 DD1 - 4s (not homogeneous) 352 threads
>>
>>                                                "noirqdebug"
>>                         Mint/s                    Mint/s
>>  chips  cpus      IPI/sys   IPI/chip       IPI/chip    IPI/sys
>>  --------------------------------------------------------------
>>  1      0-15     2.409402   2.364108       2.383303   2.395091
>>         0-31     6.028325   6.046075       6.089999   6.073750
>>         0-47     8.655178   8.644531       8.712830   8.724702
>>         0-63    11.629652  11.735953      12.088203  12.055979
>>         0-79    14.392321  14.729959      14.986701  14.973073
>>         0-95    12.604158  13.004034      17.528748  17.568095
>>  2      0-111    9.767753  13.719831      19.968606  20.024218
>>         0-127    6.744566  16.418854      22.898066  22.995110
>>         0-143    6.005699  19.174421      25.425622  25.417541
>>         0-159    5.649719  21.938836      27.952662  28.059603
>>         0-175    5.441410  24.109484      31.133915  31.127996
>>  3      0-191    5.318341  24.405322      33.999221  33.775354
>>         0-207    5.191382  26.449769      36.050161  35.867307
>>         0-223    5.102790  29.356943      39.544135  39.508169
>>         0-239    5.035295  31.933051      42.135075  42.071975
>>         0-255    4.969209  34.477367      44.655395  44.757074
>>  4      0-271    4.907652  35.887016      47.080545  47.318537
>>         0-287    4.839581  38.076137      50.464307  50.636219
>>         0-303    4.786031  40.881319      53.478684  53.310759
>>         0-319    4.743750  43.448424      56.388102  55.973969
>>         0-335    4.709936  45.623532      59.400930  58.926857
>>         0-351    4.681413  45.646151      62.035804  61.830057
>>
>> [*] https://github.com/antonblanchard/ipistorm
>>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> Very nice result but the default-on irqdebug code is quite a slowdown
> even with your improvements.
> 
> Is the main cacheline bouncing in the fast path coming from 
> desc->irq_count++ of the percpu handler? Can we do something quick and 
> dirty like the attached patch?
> 
> All this stuff seems totally racy with percpu handler but maybe that
> doesn't matter too much (and anyway it would be a much bigger change)

I gave the patch below a try and we are reaching the same results, 
even better. The simplest solution is always the best. Nick, you 
should send that single patch.

Thanks,

C. 
 

> Thanks,
> Nick
> 
> ---
>  kernel/irq/spurious.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
> index f865e5f4d382..6b17b737ee6c 100644
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -378,7 +378,8 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
>  			 * then we merily delay the spurious detection
>  			 * by one hard interrupt. Not a real problem.
>  			 */
> -			desc->threads_handled_last &= ~SPURIOUS_DEFERRED;
> +			if (desc->threads_handled_last & SPURIOUS_DEFERRED)
> +				desc->threads_handled_last &= ~SPURIOUS_DEFERRED;
>  		}
>  	}
>  
> @@ -403,6 +404,10 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
>  			desc->irqs_unhandled -= ok;
>  	}
>  
> +	if (likely(!desc->irqs_unhandled))
> +		return;
> +
> +	/* Now getting into unhandled irq detection */
>  	desc->irq_count++;
>  	if (likely(desc->irq_count < 100000))
>  		return;
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node
  2021-04-02 11:31     ` Cédric Le Goater
@ 2021-04-02 12:19       ` Cédric Le Goater
  0 siblings, 0 replies; 21+ messages in thread
From: Cédric Le Goater @ 2021-04-02 12:19 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Thomas Gleixner, Greg Kurz

> I gave the patch below a try and we are reaching the same results, 
> even better. The simplest solution is always the best. Nick, you 
> should send that single patch.

FYI, here are results in a KVM guests with pinned vCPUs.

 * P9 DD2.2 - 2s * 64 threads - KVM guest :


                            IPI/sys                           IPI/chip
 -----------   --------------------------------   --------------------------------
                           unhandled                          unhandled               	
 chips  cpus   noirqdebug  detection              noirqdebug  detection 	    
 -----------   --------------------------------   --------------------------------
 1      0-15     4.152813   4.084240   4.061028     4.097700   4.042539   4.008314
        0-31     8.186328   8.157970   7.937127     8.277942   8.019539   7.831189
        0-47    11.391635  11.232960  11.017530    11.278048  10.994501  10.889347
        0-63    13.907476  14.022307  11.460280    13.933946  13.506828  11.369188
 2      0-79    18.105276  18.084463   8.376047    18.069176  17.587916  15.477006
        0-95    22.100683  22.265763   7.338229    22.084006  21.588463  19.502192
        0-111   25.305948  25.473068   6.716235    25.429261  24.607570  22.733253
        0-127   27.814449  28.029029   6.222736    27.960119  27.253432  23.884916

                              

The three columns "IPI/chip" are results with this series. "IPI/sys" are 
without. The "unhandled detection" columns are with Nick's patch.

C. 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node
  2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
                   ` (10 preceding siblings ...)
  2021-04-01  8:42 ` Cédric Le Goater
@ 2021-04-19  3:59 ` Michael Ellerman
  11 siblings, 0 replies; 21+ messages in thread
From: Michael Ellerman @ 2021-04-19  3:59 UTC (permalink / raw)
  To: Cédric Le Goater, linuxppc-dev; +Cc: Greg Kurz

On Wed, 31 Mar 2021 16:45:05 +0200, Cédric Le Goater wrote:
> ipistorm [*] can be used to benchmark the raw interrupt rate of an
> interrupt controller by measuring the number of IPIs a system can
> sustain. When applied to the XIVE interrupt controller of POWER9 and
> POWER10 systems, a significant drop of the interrupt rate can be
> observed when crossing the second node boundary.
> 
> This is due to the fact that a single IPI interrupt is used for all
> CPUs of the system. The structure is shared and the cache line updates
> impact greatly the traffic between nodes and the overall IPI
> performance.
> 
> [...]

Patches 2-9 applied to powerpc/next.

[2/9] powerpc/xive: Introduce an IPI interrupt domain
      https://git.kernel.org/powerpc/c/7d348494136c8b47c39d1f7ccba28c47d5094a54
[3/9] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ
      https://git.kernel.org/powerpc/c/1835e72942b5aa779c8ada62aaeba03ab66d92c9
[4/9] powerpc/xive: Simplify xive_core_debug_show()
      https://git.kernel.org/powerpc/c/5159d9872823230669b7949ba3caf18c4c314846
[5/9] powerpc/xive: Drop check on irq_data in xive_core_debug_show()
      https://git.kernel.org/powerpc/c/a74ce5926b20cd0e6d624a9b2527073a96dfed7f
[6/9] powerpc/xive: Simplify the dump of XIVE interrupts under xmon
      https://git.kernel.org/powerpc/c/6bf66eb8f404050030805c65cf39a810892f5f8e
[7/9] powerpc/xive: Fix xmon command "dxi"
      https://git.kernel.org/powerpc/c/33e4bc5946432a4ac173fd08e8e30a13ab94d06d
[8/9] powerpc/xive: Map one IPI interrupt per node
      https://git.kernel.org/powerpc/c/7dcc37b3eff97379b194adb17eb9a8270512dd1d
[9/9] powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
      https://git.kernel.org/powerpc/c/fd6db2892ebaa1383a93b4a609c65b96e615510a

cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-04-19  4:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-31 14:45 [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm, chip-id" property Cédric Le Goater
2021-04-01  2:49   ` [PATCH v3 1/9] powerpc/xive: Use cpu_to_node() instead of "ibm,chip-id" property David Gibson
2021-04-01  9:10     ` Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 2/9] powerpc/xive: Introduce an IPI interrupt domain Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 3/9] powerpc/xive: Remove useless check on XIVE_IPI_HW_IRQ Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 4/9] powerpc/xive: Simplify xive_core_debug_show() Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 5/9] powerpc/xive: Drop check on irq_data in xive_core_debug_show() Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 6/9] powerpc/xive: Simplify the dump of XIVE interrupts under xmon Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 7/9] powerpc/xive: Fix xmon command "dxi" Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 8/9] powerpc/xive: Map one IPI interrupt per node Cédric Le Goater
2021-04-01 12:50   ` Nicholas Piggin
2021-04-02 11:31     ` Cédric Le Goater
2021-04-02 12:19       ` Cédric Le Goater
2021-03-31 14:45 ` [PATCH v3 9/9] powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler Cédric Le Goater
2021-04-01  8:04 ` [PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node Greg Kurz
2021-04-01  9:18   ` Cédric Le Goater
2021-04-01 12:45     ` Greg Kurz
2021-04-01 17:14       ` Cédric Le Goater
2021-04-01  8:42 ` Cédric Le Goater
2021-04-19  3:59 ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).