linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Long Li <longli@microsoft.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Michael Kelley <mikelley@microsoft.com>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.20 01/88] genirq/matrix: Improve target CPU selection for managed interrupts.
Date: Mon,  4 Mar 2019 09:21:44 +0100	[thread overview]
Message-ID: <20190304081630.672174148@linuxfoundation.org> (raw)
In-Reply-To: <20190304081630.610632175@linuxfoundation.org>

4.20-stable review patch.  If anyone has any objections, please let me know.

------------------

[ Upstream commit e8da8794a7fd9eef1ec9a07f0d4897c68581c72b ]

On large systems with multiple devices of the same class (e.g. NVMe disks,
using managed interrupts), the kernel can affinitize these interrupts to a
small subset of CPUs instead of spreading them out evenly.

irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask
of possible target CPUs which has the lowest number of interrupt vectors
allocated.

This is done by searching the CPU with the highest number of available
vectors. While this is correct for non-managed CPUs it can select the wrong
CPU for managed interrupts. Under certain constellations this results in
affinitizing the managed interrupts of several devices to a single CPU in
a set.

The book keeping of available vectors works the following way:

 1) Non-managed interrupts:

    available is decremented when the interrupt is actually requested by
    the device driver and a vector is assigned. It's incremented when the
    interrupt and the vector are freed.

 2) Managed interrupts:

    Managed interrupts guarantee vector reservation when the MSI/MSI-X
    functionality of a device is enabled, which is achieved by reserving
    vectors in the bitmaps of the possible target CPUs. This reservation
    decrements the available count on each possible target CPU.

    When the interrupt is requested by the device driver then a vector is
    allocated from the reserved region. The operation is reversed when the
    interrupt is freed by the device driver. Neither of these operations
    affect the available count.

    The reservation persist up to the point where the MSI/MSI-X
    functionality is disabled and only this operation increments the
    available count again.

For non-managed interrupts the available count is the correct selection
criterion because the guaranteed reservations need to be taken into
account. Using the allocated counter could lead to a failing allocation in
the following situation (total vector space of 10 assumed):

		 CPU0	CPU1
 available:	    2	   0
 allocated:	    5	   3   <--- CPU1 is selected, but available space = 0
 managed reserved:  3	   7

 while available yields the correct result.

For managed interrupts the available count is not the appropriate
selection criterion because as explained above the available count is not
affected by the actual vector allocation.

The following example illustrates that. Total vector space of 10
assumed. The starting point is:

		 CPU0	CPU1
 available:	    5	   4
 allocated:	    2	   3
 managed reserved:  3	   3

 Allocating vectors for three non-managed interrupts will result in
 affinitizing the first two to CPU0 and the third one to CPU1 because the
 available count is adjusted with each allocation:

		  CPU0	CPU1
 available:	     5	   4	<- Select CPU0 for 1st allocation
 --> allocated:	     3	   3

 available:	     4	   4	<- Select CPU0 for 2nd allocation
 --> allocated:	     4	   3

 available:	     3	   4	<- Select CPU1 for 3rd allocation
 --> allocated:	     4	   4

 But the allocation of three managed interrupts starting from the same
 point will affinitize all of them to CPU0 because the available count is
 not affected by the allocation (see above). So the end result is:

		  CPU0	CPU1
 available:	     5	   4
 allocated:	     5	   3

Introduce a "managed_allocated" field in struct cpumap to track the vector
allocation for managed interrupts separately. Use this information to
select the target CPU when a vector is allocated for a managed interrupt,
which results in more evenly distributed vector assignments. The above
example results in the following allocations:

		 CPU0	CPU1
 managed_allocated: 0	   0	<- Select CPU0 for 1st allocation
 --> allocated:	    3	   3

 managed_allocated: 1	   0	<- Select CPU1 for 2nd allocation
 --> allocated:	    3	   4

 managed_allocated: 1	   1	<- Select CPU0 for 3rd allocation
 --> allocated:	    4	   4

The allocation of non-managed interrupts is not affected by this change and
is still evaluating the available count.

The overall distribution of interrupt vectors for both types of interrupts
might still not be perfectly even depending on the number of non-managed
and managed interrupts in a system, but due to the reservation guarantee
for managed interrupts this cannot be avoided.

Expose the new field in debugfs as well.

[ tglx: Clarified the background of the problem in the changelog and
  	described it independent of NVME ]

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Kelley <mikelley@microsoft.com>
Link: https://lkml.kernel.org/r/20181106040000.27316-1-longli@linuxonhyperv.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/irq/matrix.c |   34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -14,6 +14,7 @@ struct cpumap {
 	unsigned int		available;
 	unsigned int		allocated;
 	unsigned int		managed;
+	unsigned int		managed_allocated;
 	bool			initialized;
 	bool			online;
 	unsigned long		alloc_map[IRQ_MATRIX_SIZE];
@@ -145,6 +146,27 @@ static unsigned int matrix_find_best_cpu
 	return best_cpu;
 }
 
+/* Find the best CPU which has the lowest number of managed IRQs allocated */
+static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m,
+						const struct cpumask *msk)
+{
+	unsigned int cpu, best_cpu, allocated = UINT_MAX;
+	struct cpumap *cm;
+
+	best_cpu = UINT_MAX;
+
+	for_each_cpu(cpu, msk) {
+		cm = per_cpu_ptr(m->maps, cpu);
+
+		if (!cm->online || cm->managed_allocated > allocated)
+			continue;
+
+		best_cpu = cpu;
+		allocated = cm->managed_allocated;
+	}
+	return best_cpu;
+}
+
 /**
  * irq_matrix_assign_system - Assign system wide entry in the matrix
  * @m:		Matrix pointer
@@ -269,7 +291,7 @@ int irq_matrix_alloc_managed(struct irq_
 	if (cpumask_empty(msk))
 		return -EINVAL;
 
-	cpu = matrix_find_best_cpu(m, msk);
+	cpu = matrix_find_best_cpu_managed(m, msk);
 	if (cpu == UINT_MAX)
 		return -ENOSPC;
 
@@ -282,6 +304,7 @@ int irq_matrix_alloc_managed(struct irq_
 		return -ENOSPC;
 	set_bit(bit, cm->alloc_map);
 	cm->allocated++;
+	cm->managed_allocated++;
 	m->total_allocated++;
 	*mapped_cpu = cpu;
 	trace_irq_matrix_alloc_managed(bit, cpu, m, cm);
@@ -395,6 +418,8 @@ void irq_matrix_free(struct irq_matrix *
 
 	clear_bit(bit, cm->alloc_map);
 	cm->allocated--;
+	if(managed)
+		cm->managed_allocated--;
 
 	if (cm->online)
 		m->total_allocated--;
@@ -464,13 +489,14 @@ void irq_matrix_debug_show(struct seq_fi
 	seq_printf(sf, "Total allocated:  %6u\n", m->total_allocated);
 	seq_printf(sf, "System: %u: %*pbl\n", nsys, m->matrix_bits,
 		   m->system_map);
-	seq_printf(sf, "%*s| CPU | avl | man | act | vectors\n", ind, " ");
+	seq_printf(sf, "%*s| CPU | avl | man | mac | act | vectors\n", ind, " ");
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
 		struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
 
-		seq_printf(sf, "%*s %4d  %4u  %4u  %4u  %*pbl\n", ind, " ",
-			   cpu, cm->available, cm->managed, cm->allocated,
+		seq_printf(sf, "%*s %4d  %4u  %4u  %4u %4u  %*pbl\n", ind, " ",
+			   cpu, cm->available, cm->managed,
+			   cm->managed_allocated, cm->allocated,
 			   m->matrix_bits, cm->alloc_map);
 	}
 	cpus_read_unlock();



  reply	other threads:[~2019-03-04  8:41 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-04  8:21 [PATCH 4.20 00/88] 4.20.14-stable review Greg Kroah-Hartman
2019-03-04  8:21 ` Greg Kroah-Hartman [this message]
2019-03-04  8:21 ` [PATCH 4.20 02/88] scsi: libsas: Fix rphy phy_identifier for PHYs with end devices attached Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 03/88] drm/msm: Unblock writer if reader closes file Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 04/88] ASoC: Intel: Haswell/Broadwell: fix setting for .dynamic field Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 05/88] ALSA: compress: prevent potential divide by zero bugs Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 06/88] ASoC: rt5682: Fix recording no sound issue Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 07/88] ASoC: Variable "val" in function rt274_i2c_probe() could be uninitialized Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 08/88] clk: tegra: dfll: Fix a potential Oop in remove() Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 09/88] clk: sysfs: fix invalid JSON in clk_dump Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 10/88] clk: vc5: Abort clock configuration without upstream clock Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 11/88] thermal: int340x_thermal: Fix a NULL vs IS_ERR() check Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 12/88] usb: dwc3: gadget: synchronize_irq dwc irq in suspend Greg Kroah-Hartman
2019-03-04  8:46   ` He, Bo
2019-03-04  8:53     ` Greg Kroah-Hartman
2019-03-04  9:03       ` Marek Szyprowski
2019-03-04  9:44         ` Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 13/88] usb: dwc3: gadget: Fix the uninitialized link_state when udc starts Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 14/88] usb: gadget: Potential NULL dereference on allocation error Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 15/88] HID: i2c-hid: Disable runtime PM on Goodix touchpad Greg Kroah-Hartman
2019-03-04  8:21 ` [PATCH 4.20 16/88] ASoC: core: Make snd_soc_find_component() more robust Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 17/88] selftests: rtc: rtctest: fix alarm tests Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 18/88] selftests: rtc: rtctest: add alarm test on minute boundary Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 19/88] genirq: Make sure the initial affinity is not empty Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 20/88] x86/mm/mem_encrypt: Fix erroneous sizeof() Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 21/88] ASoC: rt5682: Fix PLL source register definitions Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 22/88] ASoC: dapm: change snprintf to scnprintf for possible overflow Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 23/88] ASoC: imx-audmux: " Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 24/88] selftests/vm/gup_benchmark.c: match gup struct to kernel Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 25/88] phy: ath79-usb: Fix the power on error path Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 26/88] phy: ath79-usb: Fix the main reset name to match the DT binding Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 27/88] selftests: seccomp: use LDLIBS instead of LDFLAGS Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 28/88] selftests: gpio-mockup-chardev: Check asprintf() for error Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 29/88] irqchip/gic-v3-mbi: Fix uninitialized mbi_lock Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 30/88] ARC: fix __ffs return value to avoid build warnings Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 31/88] ARC: show_regs: lockdep: avoid page allocator Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 32/88] drivers: thermal: int340x_thermal: Fix sysfs race condition Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 33/88] staging: rtl8723bs: Fix build error with Clang when inlining is disabled Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 34/88] mac80211: fix miscounting of ttl-dropped frames Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 35/88] sched/wait: Fix rcuwait_wake_up() ordering Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 36/88] sched/wake_q: Fix wakeup ordering for wake_q Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 37/88] futex: Fix (possible) missed wakeup Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 38/88] locking/rwsem: " Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 39/88] drm/amd/powerplay: OD setting fix on Vega10 Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 40/88] tty: serial: qcom_geni_serial: Allow mctrl when flow control is disabled Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 41/88] serial: fsl_lpuart: fix maximum acceptable baud rate with over-sampling Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 42/88] drm/sun4i: hdmi: Fix usage of TMDS clock Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 43/88] staging: android: ion: Support cpu access during dma_buf_detach Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 44/88] direct-io: allow direct writes to empty inodes Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 45/88] writeback: synchronize sync(2) against cgroup writeback membership switches Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 46/88] scsi: lpfc: nvme: avoid hang / use-after-free when destroying localport Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 47/88] scsi: lpfc: nvmet: avoid hang / use-after-free when destroying targetport Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 48/88] scsi: csiostor: fix NULL pointer dereference in csio_vport_set_state() Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 49/88] net: altera_tse: fix connect_local_phy error path Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 50/88] hv_netvsc: Fix ethtool change hash key error Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 51/88] hv_netvsc: Refactor assignments of struct netvsc_device_info Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 52/88] hv_netvsc: Fix hash key value reset after other ops Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 53/88] nvme-rdma: fix timeout handler Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 54/88] nvme-multipath: drop optimization for static ANA group IDs Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 55/88] cifs: fix memory leak of an allocated cifs_ntsd structure Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 56/88] drm/msm: Fix A6XX support for opp-level Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 57/88] drm/msm: avoid unused function warning Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 58/88] net: usb: asix: ax88772_bind return error when hw_reset fail Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 59/88] net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 60/88] ibmveth: Do not process frames after calling napi_reschedule Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 61/88] mac80211: dont initiate TDLS connection if station is not associated to AP Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 62/88] mac80211: Add attribute aligned(2) to struct action Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 63/88] cfg80211: extend range deviation for DMG Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 64/88] svm: Fix AVIC incomplete IPI emulation Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 65/88] KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1 Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 66/88] kvm: selftests: Fix region overlap check in kvm_util Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 67/88] KVM: selftests: check returned evmcs version range Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 68/88] mmc: spi: Fix card detection during probe Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 69/88] mmc: tmio_mmc_core: dont claim spurious interrupts Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 70/88] mmc: tmio: fix access width of Block Count Register Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 71/88] mmc: core: Fix NULL ptr crash from mmc_should_fail_request Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 72/88] mmc: cqhci: fix space allocated for transfer descriptor Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 73/88] mmc: cqhci: Fix a tiny potential memory leak on error condition Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 74/88] mmc: sdhci-esdhc-imx: correct the fix of ERR004536 Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 75/88] mm: enforce min addr even if capable() in expand_downwards() Greg Kroah-Hartman
2019-03-04  8:22 ` [PATCH 4.20 76/88] drm: Block fb changes for async plane updates Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 77/88] hugetlbfs: fix races and page leaks during migration Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 78/88] crypto: ccree - add missing inline qualifier Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 79/88] MIPS: fix truncation in __cmpxchg_small for short values Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 80/88] MIPS: BCM63XX: provide DMA masks for ethernet devices Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 81/88] MIPS: fix memory setup for platforms with PHYS_OFFSET != 0 Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 82/88] scsi: 3w-sas: fix calls to dma_set_mask_and_coherent() Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 83/88] scsi: csiostor: " Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 84/88] scsi: 3w-9xxx: " Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 85/88] scsi: aic94xx: " Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 86/88] arm64: dts: qcom: msm8998: Extend TZ reserved memory area Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 87/88] MIPS: eBPF: Fix icache flush end address Greg Kroah-Hartman
2019-03-04  8:23 ` [PATCH 4.20 88/88] x86/uaccess: Dont leak the AC flag into __put_user() value evaluation Greg Kroah-Hartman
2019-03-04 20:36 ` [PATCH 4.20 00/88] 4.20.14-stable review Naresh Kamboju
2019-03-05  7:58   ` Greg Kroah-Hartman
2019-03-05  3:39 ` Guenter Roeck
2019-03-05  7:58   ` Greg Kroah-Hartman
2019-03-05 14:07 ` Jon Hunter
2019-03-05 14:55   ` Greg Kroah-Hartman
2019-03-05 16:20 ` shuah
2019-03-05 16:51   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190304081630.672174148@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longli@microsoft.com \
    --cc=mikelley@microsoft.com \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).