linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] genirq: take device NUMA node into account for managed IRQs
@ 2020-06-17  9:37 Stefan Hajnoczi
  2020-06-17  9:37 ` [RFC 1/2] genirq: honor device NUMA node when allocating descs Stefan Hajnoczi
  2020-06-17  9:37 ` [RFC 2/2] genirq/matrix: take NUMA into account for managed IRQs Stefan Hajnoczi
  0 siblings, 2 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17  9:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marcelo Tosatti, linux-pci, Thomas Gleixner, Bjorn Helgaas,
	Michael S. Tsirkin, Stefan Hajnoczi

Devices with a small number of managed IRQs do not benefit from spreading
across all CPUs. Instead they benefit from NUMA node affinity so that IRQs are
handled on the device's NUMA node.

For example, here is a machine with a virtio-blk PCI device on NUMA node 1:

  # lstopo-no-graphics
  Machine (958MB total)
    Package L#0
      NUMANode L#0 (P#0 491MB)
      L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Cor=
e L#0 + PU L#0 (P#0)
    Package L#1
      NUMANode L#1 (P#1 466MB)
      L3 L#1 (16MB) + L2 L#1 (4096KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Cor=
e L#1 + PU L#1 (P#1)
      HostBridge
        PCIBridge
          PCI c9:00.0 (SCSI)
            Block "vdb"
    HostBridge
      PCIBridge
        PCI 02:00.0 (Ethernet)
          Net "enp2s0"
      PCIBridge
        PCI 05:00.0 (SCSI)
          Block "vda"
      PCI 00:1f.2 (SATA)

Currently the virtio5-req.0 IRQ for the vdb device gets assigned to CPU 0:

  # cat /proc/interrupts
             CPU0       CPU1
  ...
   36:          0          0   PCI-MSI 105381888-edge      virtio5-config
   37:         81          0   PCI-MSI 105381889-edge      virtio5-req.0

If managed IRQ assignment takes the device's NUMA node into account then CPU 1
will be used instead:

  # cat /proc/interrupts
             CPU0       CPU1
  ...
   36:          0          0   PCI-MSI 105381888-edge      virtio5-config
   37:          0         92   PCI-MSI 105381889-edge      virtio5-req.0

The fio benchmark with 4KB random read running on CPU 1 increases IOPS by 58%:

  Name              IOPS   Error
  Before        26720.59 =C2=B1 0.28%
  After         42373.79 =C2=B1 0.54%

Now most of this improvement is not due to NUMA but just because the requests
complete on the same CPU where they were submitted. However, if the IRQ is on
CPU 0 and fio also runs on CPU 0 only 39600 IOPS is achieved, not the full
42373 IOPS that we get when NUMA affinity is honored. So it is worth taking
NUMA into account to achieve maximum performance.

The following patches are a hack that uses the device's NUMA node when
assigning managed IRQs. They are not mergeable but I hope they will help start
the discussion. One bug is that they affect all managed IRQs, even for devices
with many IRQs where spreading across all CPUs is a good policy.

Please let me know what you think:

1. Is there a reason why managed IRQs should *not* take NUMA into account that
   I've missed?

2. Is there a better place to implement this logic? For example,
   pci_alloc_irq_vectors_affinity() where the cpumasks are calculated.

Any suggestions on how to proceed would be appreciated. Thanks!

Stefan Hajnoczi (2):
  genirq: honor device NUMA node when allocating descs
  genirq/matrix: take NUMA into account for managed IRQs

 include/linux/irq.h           |  2 +-
 arch/x86/kernel/apic/vector.c |  3 ++-
 kernel/irq/irqdesc.c          |  3 ++-
 kernel/irq/matrix.c           | 16 ++++++++++++----
 4 files changed, 17 insertions(+), 7 deletions(-)

--=20
2.26.2


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [RFC 1/2] genirq: honor device NUMA node when allocating descs
  2020-06-17  9:37 [RFC 0/2] genirq: take device NUMA node into account for managed IRQs Stefan Hajnoczi
@ 2020-06-17  9:37 ` Stefan Hajnoczi
  2020-06-17  9:37 ` [RFC 2/2] genirq/matrix: take NUMA into account for managed IRQs Stefan Hajnoczi
  1 sibling, 0 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17  9:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marcelo Tosatti, linux-pci, Thomas Gleixner, Bjorn Helgaas,
	Michael S. Tsirkin, Stefan Hajnoczi

Use the device's NUMA node instead of the first masked CPUs node when
descs are allocated. The mask may include all CPUs and therefore not
correspond to the home NUMA node of the device.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 kernel/irq/irqdesc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..b9c4160d72c4 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -488,7 +488,8 @@ static int alloc_descs(unsigned int start, unsigned int cnt, int node,
 					IRQD_MANAGED_SHUTDOWN;
 			}
 			mask = &affinity->mask;
-			node = cpu_to_node(cpumask_first(mask));
+			if (node == NUMA_NO_NODE)
+				node = cpu_to_node(cpumask_first(mask));
 			affinity++;
 		}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [RFC 2/2] genirq/matrix: take NUMA into account for managed IRQs
  2020-06-17  9:37 [RFC 0/2] genirq: take device NUMA node into account for managed IRQs Stefan Hajnoczi
  2020-06-17  9:37 ` [RFC 1/2] genirq: honor device NUMA node when allocating descs Stefan Hajnoczi
@ 2020-06-17  9:37 ` Stefan Hajnoczi
  1 sibling, 0 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2020-06-17  9:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Marcelo Tosatti, linux-pci, Thomas Gleixner, Bjorn Helgaas,
	Michael S. Tsirkin, Stefan Hajnoczi

Select CPUs from the IRQ's NUMA node in preference over other CPUs. This
ensures that managed IRQs are assigned to the same NUMA node as the
device.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/linux/irq.h           |  2 +-
 arch/x86/kernel/apic/vector.c |  3 ++-
 kernel/irq/matrix.c           | 16 ++++++++++++----
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 8d5bc2c237d7..bdc3faa3c280 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1202,7 +1202,7 @@ void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool repla
 int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk);
 void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk);
 int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
-				unsigned int *mapped_cpu);
+			     int node, unsigned int *mapped_cpu);
 void irq_matrix_reserve(struct irq_matrix *m);
 void irq_matrix_remove_reserved(struct irq_matrix *m);
 int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 67768e54438b..8eb10b0d981d 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -309,6 +309,7 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest)
 {
 	const struct cpumask *affmsk = irq_data_get_affinity_mask(irqd);
 	struct apic_chip_data *apicd = apic_chip_data(irqd);
+	int node = irq_data_get_node(irqd);
 	int vector, cpu;
 
 	cpumask_and(vector_searchmask, dest, affmsk);
@@ -317,7 +318,7 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest)
 	if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask))
 		return 0;
 	vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask,
-					  &cpu);
+					  node, &cpu);
 	trace_vector_alloc_managed(irqd->irq, vector, vector);
 	if (vector < 0)
 		return vector;
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 30cc217b8631..ee35b6172b64 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -148,7 +148,8 @@ static unsigned int matrix_find_best_cpu(struct irq_matrix *m,
 
 /* Find the best CPU which has the lowest number of managed IRQs allocated */
 static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m,
-						const struct cpumask *msk)
+						 const struct cpumask *msk,
+						 int node)
 {
 	unsigned int cpu, best_cpu, allocated = UINT_MAX;
 	struct cpumap *cm;
@@ -156,6 +157,9 @@ static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m,
 	best_cpu = UINT_MAX;
 
 	for_each_cpu(cpu, msk) {
+		if (node != NUMA_NO_NODE && cpu_to_node(cpu) != node)
+			continue;
+
 		cm = per_cpu_ptr(m->maps, cpu);
 
 		if (!cm->online || cm->managed_allocated > allocated)
@@ -280,10 +284,12 @@ void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk)
 /**
  * irq_matrix_alloc_managed - Allocate a managed interrupt in a CPU map
  * @m:		Matrix pointer
- * @cpu:	On which CPU the interrupt should be allocated
+ * @mask:	The mask of CPUs on which the interrupt can be allocated
+ * @node:	The preferred NUMA node
+ * @mapped_cpu:	The resulting CPU on which the interrupt should be allocated
  */
 int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
-			     unsigned int *mapped_cpu)
+			     int node, unsigned int *mapped_cpu)
 {
 	unsigned int bit, cpu, end = m->alloc_end;
 	struct cpumap *cm;
@@ -291,7 +297,9 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
 	if (cpumask_empty(msk))
 		return -EINVAL;
 
-	cpu = matrix_find_best_cpu_managed(m, msk);
+	cpu = matrix_find_best_cpu_managed(m, msk, node);
+	if (cpu == UINT_MAX)
+		cpu = matrix_find_best_cpu_managed(m, msk, NUMA_NO_NODE);
 	if (cpu == UINT_MAX)
 		return -ENOSPC;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-06-17  9:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17  9:37 [RFC 0/2] genirq: take device NUMA node into account for managed IRQs Stefan Hajnoczi
2020-06-17  9:37 ` [RFC 1/2] genirq: honor device NUMA node when allocating descs Stefan Hajnoczi
2020-06-17  9:37 ` [RFC 2/2] genirq/matrix: take NUMA into account for managed IRQs Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).