Linux-PCI Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack
@ 2019-11-06 11:40 Jon Derrick
  2019-11-06 11:40 ` [PATCH 1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation Jon Derrick
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Jon Derrick @ 2019-11-06 11:40 UTC (permalink / raw)
  To: Lorenzo Pieralisi; +Cc: Keith Busch, Bjorn Helgaas, linux-pci, Jon Derrick

This patchset optimizes VMD performance through the storage stack by locating
commonly-affined NVMe interrupts on the same VMD interrupt handler lists.

The current strategy of round-robin assignment to VMD IRQ lists can be
suboptimal when vectors with different affinities are assigned to the same VMD
IRQ list. VMD is an NVMe storage domain and this set aligns the vector
allocation and affinity strategy with that of the NVMe driver. This invokes the
kernel to do the right thing when affining NVMe submission cpus to NVMe
completion vectors as serviced through the VMD interrupt handler lists.

This set greatly reduced tail latency when testing 8 threads of random 4k reads
against two drives at queue depth=128. After pinning the tasks to reduce test
variability, the tests also showed a moderate tail latency reduction. A
one-drive configuration also shows improvements due to the alignment of VMD IRQ
list affinities with NVMe affinities.

An example with two NVMe drives and a 33-vector VMD:
VMD irq[42]  Affinity[0-27,56-83]   Effective[10]
VMD irq[43]  Affinity[28-29,84-85]  Effective[85]
VMD irq[44]  Affinity[30-31,86-87]  Effective[87]
VMD irq[45]  Affinity[32-33,88-89]  Effective[89]
VMD irq[46]  Affinity[34-35,90-91]  Effective[91]
VMD irq[47]  Affinity[36-37,92-93]  Effective[93]
VMD irq[48]  Affinity[38-39,94-95]  Effective[95]
VMD irq[49]  Affinity[40-41,96-97]  Effective[97]
VMD irq[50]  Affinity[42-43,98-99]  Effective[99]
VMD irq[51]  Affinity[44-45,100]    Effective[100]
VMD irq[52]  Affinity[46-47,102]    Effective[102]
VMD irq[53]  Affinity[48-49,104]    Effective[104]
VMD irq[54]  Affinity[50-51,106]    Effective[106]
VMD irq[55]  Affinity[52-53,108]    Effective[108]
VMD irq[56]  Affinity[54-55,110]    Effective[110]
VMD irq[57]  Affinity[101,103,105]  Effective[105]
VMD irq[58]  Affinity[107,109,111]  Effective[111]
VMD irq[59]  Affinity[0-1,56-57]    Effective[57]
VMD irq[60]  Affinity[2-3,58-59]    Effective[59]
VMD irq[61]  Affinity[4-5,60-61]    Effective[61]
VMD irq[62]  Affinity[6-7,62-63]    Effective[63]
VMD irq[63]  Affinity[8-9,64-65]    Effective[65]
VMD irq[64]  Affinity[10-11,66-67]  Effective[67]
VMD irq[65]  Affinity[12-13,68-69]  Effective[69]
VMD irq[66]  Affinity[14-15,70-71]  Effective[71]
VMD irq[67]  Affinity[16-17,72]     Effective[72]
VMD irq[68]  Affinity[18-19,74]     Effective[74]
VMD irq[69]  Affinity[20-21,76]     Effective[76]
VMD irq[70]  Affinity[22-23,78]     Effective[78]
VMD irq[71]  Affinity[24-25,80]     Effective[80]
VMD irq[72]  Affinity[26-27,82]     Effective[82]
VMD irq[73]  Affinity[73,75,77]     Effective[77]
VMD irq[74]  Affinity[79,81,83]     Effective[83]

nvme0n1q1   MQ CPUs[28, 29, 84, 85]
nvme0n1q2   MQ CPUs[30, 31, 86, 87]
nvme0n1q3   MQ CPUs[32, 33, 88, 89]
nvme0n1q4   MQ CPUs[34, 35, 90, 91]
nvme0n1q5   MQ CPUs[36, 37, 92, 93]
nvme0n1q6   MQ CPUs[38, 39, 94, 95]
nvme0n1q7   MQ CPUs[40, 41, 96, 97]
nvme0n1q8   MQ CPUs[42, 43, 98, 99]
nvme0n1q9   MQ CPUs[44, 45, 100]
nvme0n1q10  MQ CPUs[46, 47, 102]
nvme0n1q11  MQ CPUs[48, 49, 104]
nvme0n1q12  MQ CPUs[50, 51, 106]
nvme0n1q13  MQ CPUs[52, 53, 108]
nvme0n1q14  MQ CPUs[54, 55, 110]
nvme0n1q15  MQ CPUs[101, 103, 105]
nvme0n1q16  MQ CPUs[107, 109, 111]
nvme0n1q17  MQ CPUs[0, 1, 56, 57]
nvme0n1q18  MQ CPUs[2, 3, 58, 59]
nvme0n1q19  MQ CPUs[4, 5, 60, 61]
nvme0n1q20  MQ CPUs[6, 7, 62, 63]
nvme0n1q21  MQ CPUs[8, 9, 64, 65]
nvme0n1q22  MQ CPUs[10, 11, 66, 67]
nvme0n1q23  MQ CPUs[12, 13, 68, 69]
nvme0n1q24  MQ CPUs[14, 15, 70, 71]
nvme0n1q25  MQ CPUs[16, 17, 72]
nvme0n1q26  MQ CPUs[18, 19, 74]
nvme0n1q27  MQ CPUs[20, 21, 76]
nvme0n1q28  MQ CPUs[22, 23, 78]
nvme0n1q29  MQ CPUs[24, 25, 80]
nvme0n1q30  MQ CPUs[26, 27, 82]
nvme0n1q31  MQ CPUs[73, 75, 77]
nvme0n1q32  MQ CPUs[79, 81, 83]

nvme1n1q1   MQ CPUs[28, 29, 84, 85]
nvme1n1q2   MQ CPUs[30, 31, 86, 87]
nvme1n1q3   MQ CPUs[32, 33, 88, 89]
nvme1n1q4   MQ CPUs[34, 35, 90, 91]
nvme1n1q5   MQ CPUs[36, 37, 92, 93]
nvme1n1q6   MQ CPUs[38, 39, 94, 95]
nvme1n1q7   MQ CPUs[40, 41, 96, 97]
nvme1n1q8   MQ CPUs[42, 43, 98, 99]
nvme1n1q9   MQ CPUs[44, 45, 100]
nvme1n1q10  MQ CPUs[46, 47, 102]
nvme1n1q11  MQ CPUs[48, 49, 104]
nvme1n1q12  MQ CPUs[50, 51, 106]
nvme1n1q13  MQ CPUs[52, 53, 108]
nvme1n1q14  MQ CPUs[54, 55, 110]
nvme1n1q15  MQ CPUs[101, 103, 105]
nvme1n1q16  MQ CPUs[107, 109, 111]
nvme1n1q17  MQ CPUs[0, 1, 56, 57]
nvme1n1q18  MQ CPUs[2, 3, 58, 59]
nvme1n1q19  MQ CPUs[4, 5, 60, 61]
nvme1n1q20  MQ CPUs[6, 7, 62, 63]
nvme1n1q21  MQ CPUs[8, 9, 64, 65]
nvme1n1q22  MQ CPUs[10, 11, 66, 67]
nvme1n1q23  MQ CPUs[12, 13, 68, 69]
nvme1n1q24  MQ CPUs[14, 15, 70, 71]
nvme1n1q25  MQ CPUs[16, 17, 72]
nvme1n1q26  MQ CPUs[18, 19, 74]
nvme1n1q27  MQ CPUs[20, 21, 76]
nvme1n1q28  MQ CPUs[22, 23, 78]
nvme1n1q29  MQ CPUs[24, 25, 80]
nvme1n1q30  MQ CPUs[26, 27, 82]
nvme1n1q31  MQ CPUs[73, 75, 77]
nvme1n1q32  MQ CPUs[79, 81, 83]


This patchset applies after the VMD IRQ List indirection patch:
https://lore.kernel.org/linux-pci/1572527333-6212-1-git-send-email-jonathan.derrick@intel.com/

Jon Derrick (3):
  PCI: vmd: Reduce VMD vectors using NVMe calculation
  PCI: vmd: Align IRQ lists with child device vectors
  PCI: vmd: Use managed irq affinities

 drivers/pci/controller/vmd.c | 90 +++++++++++++++++++-------------------------
 1 file changed, 39 insertions(+), 51 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation
  2019-11-06 11:40 [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Jon Derrick
@ 2019-11-06 11:40 ` Jon Derrick
  2019-11-06 18:02   ` Keith Busch
  2019-11-06 11:40 ` [PATCH 2/3] PCI: vmd: Align IRQ lists with child device vectors Jon Derrick
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Jon Derrick @ 2019-11-06 11:40 UTC (permalink / raw)
  To: Lorenzo Pieralisi; +Cc: Keith Busch, Bjorn Helgaas, linux-pci, Jon Derrick

In order to better affine VMD IRQs, VMD IRQ lists, and child NVMe
devices, reduce the number of VMD vectors exposed to the MSI domain
using the same calculation as NVMe. VMD will still retain one vector for
pciehp and non-NVMe vectors. The remaining will match the maximum number
of NVMe child device IO vectors.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 8bce647..ebe7ff6 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -260,9 +260,20 @@ static int vmd_msi_prepare(struct irq_domain *domain, struct device *dev,
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct vmd_dev *vmd = vmd_from_bus(pdev->bus);
+	int max_vectors;
 
-	if (nvec > vmd->msix_count)
-		return vmd->msix_count;
+	/*
+	 * VMD exists primarily as an NVMe storage domain. It thus makes sense
+	 * to reduce the number of VMD vectors exposed to child devices using
+	 * the same calculation as the NVMe driver. This allows better affinity
+	 * matching along the entire stack when multiple device vectors share
+	 * VMD IRQ lists. One additional VMD vector is reserved for pciehp and
+	 * non-NVMe interrupts, and NVMe Admin Queue interrupts can also be
+	 * placed on this slow interrupt.
+	 */
+	max_vectors = min_t(int, vmd->msix_count, num_possible_cpus() + 1);
+	if (nvec > max_vectors)
+		return max_vectors;
 
 	memset(arg, 0, sizeof(*arg));
 	return 0;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/3] PCI: vmd: Align IRQ lists with child device vectors
  2019-11-06 11:40 [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Jon Derrick
  2019-11-06 11:40 ` [PATCH 1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation Jon Derrick
@ 2019-11-06 11:40 ` Jon Derrick
  2019-11-06 18:06   ` Keith Busch
  2019-11-06 11:40 ` [PATCH 3/3] PCI: vmd: Use managed irq affinities Jon Derrick
  2019-11-07  9:39 ` [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Christoph Hellwig
  3 siblings, 1 reply; 21+ messages in thread
From: Jon Derrick @ 2019-11-06 11:40 UTC (permalink / raw)
  To: Lorenzo Pieralisi; +Cc: Keith Busch, Bjorn Helgaas, linux-pci, Jon Derrick

In order to provide better affinity alignment along the entire storage
stack, VMD IRQ lists can be assigned to in a manner where the underlying
IRQ can be affinitized the same as the child (NVMe) device.

This patch changes the assignment of child device vectors in IRQ lists
from a round-robin strategy to a matching-entry strategy. NVMe
affinities are deterministic in a VMD domain when these devices have the
same vector count as limited by the VMD MSI domain or cpu count. When
one or more child devices are attached on a VMD domain, this patch
aligns the NVMe submission-side affinity with the VMD completion-side
affinity as it completes through the VMD IRQ list.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 57 ++++++++++++++++----------------------------
 1 file changed, 21 insertions(+), 36 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index ebe7ff6..7aca925 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -75,13 +75,10 @@ struct vmd_irq {
  * struct vmd_irq_list - list of driver requested IRQs mapping to a VMD vector
  * @irq_list:	the list of irq's the VMD one demuxes to.
  * @srcu:	SRCU struct for local synchronization.
- * @count:	number of child IRQs assigned to this vector; used to track
- *		sharing.
  */
 struct vmd_irq_list {
 	struct list_head	irq_list;
 	struct srcu_struct	srcu;
-	unsigned int		count;
 	unsigned int		index;
 };
 
@@ -184,37 +181,32 @@ static irq_hw_number_t vmd_get_hwirq(struct msi_domain_info *info,
 	return 0;
 }
 
-/*
- * XXX: We can be even smarter selecting the best IRQ once we solve the
- * affinity problem.
- */
 static struct vmd_irq_list *vmd_next_irq(struct vmd_dev *vmd, struct msi_desc *desc)
 {
-	int i, best = 1;
-	unsigned long flags;
-
-	if (vmd->msix_count == 1)
-		return vmd->irqs[0];
-
-	/*
-	 * White list for fast-interrupt handlers. All others will share the
-	 * "slow" interrupt vector.
-	 */
-	switch (msi_desc_to_pci_dev(desc)->class) {
-	case PCI_CLASS_STORAGE_EXPRESS:
-		break;
-	default:
-		return vmd->irqs[0];
+	int entry_nr = desc->msi_attrib.entry_nr;
+
+	if (vmd->msix_count == 1) {
+		entry_nr = 0;
+	} else {
+
+		/*
+		 * White list for fast-interrupt handlers. All others will
+		 * share the "slow" interrupt vector.
+		 */
+		switch (msi_desc_to_pci_dev(desc)->class) {
+		case PCI_CLASS_STORAGE_EXPRESS:
+			break;
+		default:
+			entry_nr = 0;
+		}
 	}
 
-	raw_spin_lock_irqsave(&list_lock, flags);
-	for (i = 1; i < vmd->msix_count; i++)
-		if (vmd->irqs[i]->count < vmd->irqs[best]->count)
-			best = i;
-	vmd->irqs[best]->count++;
-	raw_spin_unlock_irqrestore(&list_lock, flags);
+	if (entry_nr > vmd->msix_count)
+		entry_nr = 0;
 
-	return vmd->irqs[best];
+	dev_dbg(desc->dev, "Entry %d using VMD IRQ list %d/%d\n",
+		desc->msi_attrib.entry_nr, entry_nr, vmd->msix_count - 1);
+	return vmd->irqs[entry_nr];
 }
 
 static int vmd_msi_init(struct irq_domain *domain, struct msi_domain_info *info,
@@ -243,15 +235,8 @@ static void vmd_msi_free(struct irq_domain *domain,
 			struct msi_domain_info *info, unsigned int virq)
 {
 	struct vmd_irq *vmdirq = irq_get_chip_data(virq);
-	unsigned long flags;
 
 	synchronize_srcu(&vmdirq->irq->srcu);
-
-	/* XXX: Potential optimization to rebalance */
-	raw_spin_lock_irqsave(&list_lock, flags);
-	vmdirq->irq->count--;
-	raw_spin_unlock_irqrestore(&list_lock, flags);
-
 	kfree(vmdirq);
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/3] PCI: vmd: Use managed irq affinities
  2019-11-06 11:40 [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Jon Derrick
  2019-11-06 11:40 ` [PATCH 1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation Jon Derrick
  2019-11-06 11:40 ` [PATCH 2/3] PCI: vmd: Align IRQ lists with child device vectors Jon Derrick
@ 2019-11-06 11:40 ` Jon Derrick
  2019-11-06 18:10   ` Keith Busch
  2019-11-07  9:39 ` [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Christoph Hellwig
  3 siblings, 1 reply; 21+ messages in thread
From: Jon Derrick @ 2019-11-06 11:40 UTC (permalink / raw)
  To: Lorenzo Pieralisi; +Cc: Keith Busch, Bjorn Helgaas, linux-pci, Jon Derrick

Using managed IRQ affinities sets up the VMD affinities identically to
the child devices when those devices vector counts are limited by VMD.
This promotes better affinity handling as interrupts won't necessarily
need to pass context between non-local CPUs. One pre-vector is reserved
for the slow interrupt and not considered in the affinity algorithm.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
 drivers/pci/controller/vmd.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index 7aca925..be92076 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -157,22 +157,11 @@ static void vmd_irq_disable(struct irq_data *data)
 	raw_spin_unlock_irqrestore(&list_lock, flags);
 }
 
-/*
- * XXX: Stubbed until we develop acceptable way to not create conflicts with
- * other devices sharing the same vector.
- */
-static int vmd_irq_set_affinity(struct irq_data *data,
-				const struct cpumask *dest, bool force)
-{
-	return -EINVAL;
-}
-
 static struct irq_chip vmd_msi_controller = {
 	.name			= "VMD-MSI",
 	.irq_enable		= vmd_irq_enable,
 	.irq_disable		= vmd_irq_disable,
 	.irq_compose_msi_msg	= vmd_compose_msi_msg,
-	.irq_set_affinity	= vmd_irq_set_affinity,
 };
 
 static irq_hw_number_t vmd_get_hwirq(struct msi_domain_info *info,
@@ -722,6 +711,9 @@ static irqreturn_t vmd_irq(int irq, void *data)
 static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
 	struct vmd_dev *vmd;
+	struct irq_affinity affd = {
+		.pre_vectors = 1,
+	};
 	int i, err;
 
 	if (resource_size(&dev->resource[VMD_CFGBAR]) < (1 << 20))
@@ -749,8 +741,8 @@ static int vmd_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	if (vmd->msix_count < 0)
 		return -ENODEV;
 
-	vmd->msix_count = pci_alloc_irq_vectors(dev, 1, vmd->msix_count,
-					PCI_IRQ_MSIX);
+	vmd->msix_count = pci_alloc_irq_vectors_affinity(dev, 1, vmd->msix_count,
+					PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd);
 	if (vmd->msix_count < 0)
 		return vmd->msix_count;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation
  2019-11-06 11:40 ` [PATCH 1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation Jon Derrick
@ 2019-11-06 18:02   ` Keith Busch
  2019-11-06 19:51     ` Derrick, Jonathan
  0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2019-11-06 18:02 UTC (permalink / raw)
  To: Jon Derrick; +Cc: Lorenzo Pieralisi, Bjorn Helgaas, linux-pci

On Wed, Nov 06, 2019 at 04:40:06AM -0700, Jon Derrick wrote:
> +	max_vectors = min_t(int, vmd->msix_count, num_possible_cpus() + 1);
> +	if (nvec > max_vectors)
> +		return max_vectors;

If vmd's msix vectors beyond num_possible_cpus() are inaccessible, why
not just limit vmd's msix_count the same way?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] PCI: vmd: Align IRQ lists with child device vectors
  2019-11-06 11:40 ` [PATCH 2/3] PCI: vmd: Align IRQ lists with child device vectors Jon Derrick
@ 2019-11-06 18:06   ` Keith Busch
  2019-11-06 20:14     ` Derrick, Jonathan
  0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2019-11-06 18:06 UTC (permalink / raw)
  To: Jon Derrick; +Cc: Lorenzo Pieralisi, Bjorn Helgaas, linux-pci

On Wed, Nov 06, 2019 at 04:40:07AM -0700, Jon Derrick wrote:
> In order to provide better affinity alignment along the entire storage
> stack, VMD IRQ lists can be assigned to in a manner where the underlying
> IRQ can be affinitized the same as the child (NVMe) device.
> 
> This patch changes the assignment of child device vectors in IRQ lists
> from a round-robin strategy to a matching-entry strategy. NVMe
> affinities are deterministic in a VMD domain when these devices have the
> same vector count as limited by the VMD MSI domain or cpu count. When
> one or more child devices are attached on a VMD domain, this patch
> aligns the NVMe submission-side affinity with the VMD completion-side
> affinity as it completes through the VMD IRQ list.

This really only works if the child devices have the same irq count as
the vmd device. If the vmd device has more interrupts than the child
devices, this will end up overloading the lower vmd interrupts without
even using the higher ones.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] PCI: vmd: Use managed irq affinities
  2019-11-06 11:40 ` [PATCH 3/3] PCI: vmd: Use managed irq affinities Jon Derrick
@ 2019-11-06 18:10   ` Keith Busch
  2019-11-06 20:14     ` Derrick, Jonathan
  0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2019-11-06 18:10 UTC (permalink / raw)
  To: Jon Derrick; +Cc: Lorenzo Pieralisi, Bjorn Helgaas, linux-pci

On Wed, Nov 06, 2019 at 04:40:08AM -0700, Jon Derrick wrote:
> Using managed IRQ affinities sets up the VMD affinities identically to
> the child devices when those devices vector counts are limited by VMD.
> This promotes better affinity handling as interrupts won't necessarily
> need to pass context between non-local CPUs. One pre-vector is reserved
> for the slow interrupt and not considered in the affinity algorithm.

This only works if all devices have exactly the same number of interrupts
as the parent VMD host bridge. If a child device has less, the device
will stop working if you offline a cpu: the child device may have a
resource affined to other online cpus, but the VMD device affinity is to
that single offline cpu.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation
  2019-11-06 18:02   ` Keith Busch
@ 2019-11-06 19:51     ` Derrick, Jonathan
  0 siblings, 0 replies; 21+ messages in thread
From: Derrick, Jonathan @ 2019-11-06 19:51 UTC (permalink / raw)
  To: kbusch; +Cc: lorenzo.pieralisi, linux-pci, helgaas

On Thu, 2019-11-07 at 03:02 +0900, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 04:40:06AM -0700, Jon Derrick wrote:
> > +	max_vectors = min_t(int, vmd->msix_count, num_possible_cpus() + 1);
> > +	if (nvec > max_vectors)
> > +		return max_vectors;
> 
> If vmd's msix vectors beyond num_possible_cpus() are inaccessible, why
> not just limit vmd's msix_count the same way?

Yes I could probably do that instead

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] PCI: vmd: Align IRQ lists with child device vectors
  2019-11-06 18:06   ` Keith Busch
@ 2019-11-06 20:14     ` Derrick, Jonathan
  0 siblings, 0 replies; 21+ messages in thread
From: Derrick, Jonathan @ 2019-11-06 20:14 UTC (permalink / raw)
  To: kbusch; +Cc: lorenzo.pieralisi, linux-pci, helgaas

On Thu, 2019-11-07 at 03:06 +0900, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 04:40:07AM -0700, Jon Derrick wrote:
> > In order to provide better affinity alignment along the entire storage
> > stack, VMD IRQ lists can be assigned to in a manner where the underlying
> > IRQ can be affinitized the same as the child (NVMe) device.
> > 
> > This patch changes the assignment of child device vectors in IRQ lists
> > from a round-robin strategy to a matching-entry strategy. NVMe
> > affinities are deterministic in a VMD domain when these devices have the
> > same vector count as limited by the VMD MSI domain or cpu count. When
> > one or more child devices are attached on a VMD domain, this patch
> > aligns the NVMe submission-side affinity with the VMD completion-side
> > affinity as it completes through the VMD IRQ list.
> 
> This really only works if the child devices have the same irq count as
> the vmd device. If the vmd device has more interrupts than the child
> devices, this will end up overloading the lower vmd interrupts without
> even using the higher ones.

Correct. The child NVMe device would need to have the same or more than
the 32 IO vectors VMD offers. We could do something dynamically to
determine when to do matching-affinities vs round-robin, but as this is
a hotpluggable domain it seems fragile to be changing interrupts in
such a way.

I haven't actually seen an NVMe device with fewer than 32 vectors, and
overloading VMD vectors seems to be the least of the concerns of
performance with such a device. This configuration will result in what
is essentially the same issue we're facing today with poorly affined
VMD IRQ lists.

For the future VMD implementation offering 63 IO vectors, yes this will
be a concern and all I can really suggest is to use drives with more
vectors until I can determine a good way to handle this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] PCI: vmd: Use managed irq affinities
  2019-11-06 18:10   ` Keith Busch
@ 2019-11-06 20:14     ` Derrick, Jonathan
  2019-11-06 20:27       ` Keith Busch
  0 siblings, 1 reply; 21+ messages in thread
From: Derrick, Jonathan @ 2019-11-06 20:14 UTC (permalink / raw)
  To: kbusch; +Cc: lorenzo.pieralisi, linux-pci, helgaas

On Thu, 2019-11-07 at 03:10 +0900, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 04:40:08AM -0700, Jon Derrick wrote:
> > Using managed IRQ affinities sets up the VMD affinities identically to
> > the child devices when those devices vector counts are limited by VMD.
> > This promotes better affinity handling as interrupts won't necessarily
> > need to pass context between non-local CPUs. One pre-vector is reserved
> > for the slow interrupt and not considered in the affinity algorithm.
> 
> This only works if all devices have exactly the same number of interrupts
> as the parent VMD host bridge. If a child device has less, the device
> will stop working if you offline a cpu: the child device may have a
> resource affined to other online cpus, but the VMD device affinity is to
> that single offline cpu.

Yes that problem exists today and this set limits the exposure as it's
a rare case where you have a child NVMe device with fewer than 32
vectors.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] PCI: vmd: Use managed irq affinities
  2019-11-06 20:14     ` Derrick, Jonathan
@ 2019-11-06 20:27       ` Keith Busch
  2019-11-06 20:33         ` Derrick, Jonathan
  0 siblings, 1 reply; 21+ messages in thread
From: Keith Busch @ 2019-11-06 20:27 UTC (permalink / raw)
  To: Derrick, Jonathan; +Cc: lorenzo.pieralisi, linux-pci, helgaas

On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> Yes that problem exists today 

Not really, because we're currently using unamanged interrupts which
migrate to online CPUs. It's only the managed ones that don't migrate
because they have a unchangeable affinity.

> and this set limits the exposure as it's
> a rare case where you have a child NVMe device with fewer than 32
> vectors.

I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
queues, yielding 31 vectors for IO services, so that already won't work
with this scheme.

But assuming you wanted to only use devices that have at least 32 IRQ
vectors, the nvme driver also allows users to carve those vectors up
into fully affinitized sets for different services (read vs. write is
the one the block stack supports), which would also break if alignment
to the parent device's IRQ setup is required.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] PCI: vmd: Use managed irq affinities
  2019-11-06 20:27       ` Keith Busch
@ 2019-11-06 20:33         ` Derrick, Jonathan
  2019-11-18 10:49           ` Lorenzo Pieralisi
  0 siblings, 1 reply; 21+ messages in thread
From: Derrick, Jonathan @ 2019-11-06 20:33 UTC (permalink / raw)
  To: kbusch; +Cc: lorenzo.pieralisi, linux-pci, helgaas

On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > Yes that problem exists today 
> 
> Not really, because we're currently using unamanged interrupts which
> migrate to online CPUs. It's only the managed ones that don't migrate
> because they have a unchangeable affinity.
> 
> > and this set limits the exposure as it's
> > a rare case where you have a child NVMe device with fewer than 32
> > vectors.
> 
> I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> queues, yielding 31 vectors for IO services, so that already won't work
> with this scheme.
> 
Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
leaving the 31 remaining VMD vectors for NVMe IO queues. This would
have lined up perfectly. Had changed it last minute and that does break
the scheme for P3700....

> But assuming you wanted to only use devices that have at least 32 IRQ
> vectors, the nvme driver also allows users to carve those vectors up
> into fully affinitized sets for different services (read vs. write is
> the one the block stack supports), which would also break if alignment
> to the parent device's IRQ setup is required.

Wasn't aware of this. Fair enough.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack
  2019-11-06 11:40 [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Jon Derrick
                   ` (2 preceding siblings ...)
  2019-11-06 11:40 ` [PATCH 3/3] PCI: vmd: Use managed irq affinities Jon Derrick
@ 2019-11-07  9:39 ` Christoph Hellwig
  2019-11-07 14:12   ` Derrick, Jonathan
  3 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2019-11-07  9:39 UTC (permalink / raw)
  To: Jon Derrick; +Cc: Lorenzo Pieralisi, Keith Busch, Bjorn Helgaas, linux-pci

On Wed, Nov 06, 2019 at 04:40:05AM -0700, Jon Derrick wrote:
> This patchset optimizes VMD performance through the storage stack by locating
> commonly-affined NVMe interrupts on the same VMD interrupt handler lists.
> 
> The current strategy of round-robin assignment to VMD IRQ lists can be
> suboptimal when vectors with different affinities are assigned to the same VMD
> IRQ list. VMD is an NVMe storage domain and this set aligns the vector
> allocation and affinity strategy with that of the NVMe driver. This invokes the
> kernel to do the right thing when affining NVMe submission cpus to NVMe
> completion vectors as serviced through the VMD interrupt handler lists.
> 
> This set greatly reduced tail latency when testing 8 threads of random 4k reads
> against two drives at queue depth=128. After pinning the tasks to reduce test
> variability, the tests also showed a moderate tail latency reduction. A
> one-drive configuration also shows improvements due to the alignment of VMD IRQ
> list affinities with NVMe affinities.

How does this compare to simplify disabling VMD?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack
  2019-11-07  9:39 ` [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Christoph Hellwig
@ 2019-11-07 14:12   ` Derrick, Jonathan
  2019-11-07 15:37     ` hch
  0 siblings, 1 reply; 21+ messages in thread
From: Derrick, Jonathan @ 2019-11-07 14:12 UTC (permalink / raw)
  To: hch; +Cc: kbusch, lorenzo.pieralisi, linux-pci, helgaas

On Thu, 2019-11-07 at 01:39 -0800, Christoph Hellwig wrote:
> On Wed, Nov 06, 2019 at 04:40:05AM -0700, Jon Derrick wrote:
> > This patchset optimizes VMD performance through the storage stack by locating
> > commonly-affined NVMe interrupts on the same VMD interrupt handler lists.
> > 
> > The current strategy of round-robin assignment to VMD IRQ lists can be
> > suboptimal when vectors with different affinities are assigned to the same VMD
> > IRQ list. VMD is an NVMe storage domain and this set aligns the vector
> > allocation and affinity strategy with that of the NVMe driver. This invokes the
> > kernel to do the right thing when affining NVMe submission cpus to NVMe
> > completion vectors as serviced through the VMD interrupt handler lists.
> > 
> > This set greatly reduced tail latency when testing 8 threads of random 4k reads
> > against two drives at queue depth=128. After pinning the tasks to reduce test
> > variability, the tests also showed a moderate tail latency reduction. A
> > one-drive configuration also shows improvements due to the alignment of VMD IRQ
> > list affinities with NVMe affinities.
> 
> How does this compare to simplify disabling VMD?

It's a moot point since Keith pointed out a few flaws with this set,
however disabling VMD is not an option for users who wish to
passthrough VMD

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack
  2019-11-07 14:12   ` Derrick, Jonathan
@ 2019-11-07 15:37     ` hch
  2019-11-07 15:40       ` Derrick, Jonathan
  0 siblings, 1 reply; 21+ messages in thread
From: hch @ 2019-11-07 15:37 UTC (permalink / raw)
  To: Derrick, Jonathan; +Cc: hch, kbusch, lorenzo.pieralisi, linux-pci, helgaas

On Thu, Nov 07, 2019 at 02:12:50PM +0000, Derrick, Jonathan wrote:
> > How does this compare to simplify disabling VMD?
> 
> It's a moot point since Keith pointed out a few flaws with this set,
> however disabling VMD is not an option for users who wish to
> passthrough VMD

And why would you ever pass through vmd instead of the actual device?
That just makes things go slower and adds zero value.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack
  2019-11-07 15:37     ` hch
@ 2019-11-07 15:40       ` Derrick, Jonathan
  2019-11-07 15:42         ` hch
  0 siblings, 1 reply; 21+ messages in thread
From: Derrick, Jonathan @ 2019-11-07 15:40 UTC (permalink / raw)
  To: hch; +Cc: kbusch, lorenzo.pieralisi, linux-pci, helgaas

On Thu, 2019-11-07 at 07:37 -0800, hch@infradead.org wrote:
> On Thu, Nov 07, 2019 at 02:12:50PM +0000, Derrick, Jonathan wrote:
> > > How does this compare to simplify disabling VMD?
> > 
> > It's a moot point since Keith pointed out a few flaws with this set,
> > however disabling VMD is not an option for users who wish to
> > passthrough VMD
> 
> And why would you ever pass through vmd instead of the actual device?
> That just makes things go slower and adds zero value.

Ability to use physical Root Ports/DSPs/etc in a guest. Slower is
acceptable for many users if it fits within a performance window

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack
  2019-11-07 15:40       ` Derrick, Jonathan
@ 2019-11-07 15:42         ` hch
  2019-11-07 15:47           ` Derrick, Jonathan
  0 siblings, 1 reply; 21+ messages in thread
From: hch @ 2019-11-07 15:42 UTC (permalink / raw)
  To: Derrick, Jonathan; +Cc: hch, kbusch, lorenzo.pieralisi, linux-pci, helgaas

On Thu, Nov 07, 2019 at 03:40:15PM +0000, Derrick, Jonathan wrote:
> On Thu, 2019-11-07 at 07:37 -0800, hch@infradead.org wrote:
> > On Thu, Nov 07, 2019 at 02:12:50PM +0000, Derrick, Jonathan wrote:
> > > > How does this compare to simplify disabling VMD?
> > > 
> > > It's a moot point since Keith pointed out a few flaws with this set,
> > > however disabling VMD is not an option for users who wish to
> > > passthrough VMD
> > 
> > And why would you ever pass through vmd instead of the actual device?
> > That just makes things go slower and adds zero value.
> 
> Ability to use physical Root Ports/DSPs/etc in a guest. Slower is
> acceptable for many users if it fits within a performance window

What is the actual use case?  What does it enable that otherwise doesn't
work and is actually useful?  And real use cases please and no marketing
mumble jumble.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack
  2019-11-07 15:42         ` hch
@ 2019-11-07 15:47           ` Derrick, Jonathan
  2019-11-11 17:03             ` hch
  0 siblings, 1 reply; 21+ messages in thread
From: Derrick, Jonathan @ 2019-11-07 15:47 UTC (permalink / raw)
  To: hch; +Cc: kbusch, lorenzo.pieralisi, linux-pci, helgaas

On Thu, 2019-11-07 at 07:42 -0800, hch@infradead.org wrote:
> On Thu, Nov 07, 2019 at 03:40:15PM +0000, Derrick, Jonathan wrote:
> > On Thu, 2019-11-07 at 07:37 -0800, hch@infradead.org wrote:
> > > On Thu, Nov 07, 2019 at 02:12:50PM +0000, Derrick, Jonathan wrote:
> > > > > How does this compare to simplify disabling VMD?
> > > > 
> > > > It's a moot point since Keith pointed out a few flaws with this set,
> > > > however disabling VMD is not an option for users who wish to
> > > > passthrough VMD
> > > 
> > > And why would you ever pass through vmd instead of the actual device?
> > > That just makes things go slower and adds zero value.
> > 
> > Ability to use physical Root Ports/DSPs/etc in a guest. Slower is
> > acceptable for many users if it fits within a performance window
> 
> What is the actual use case?  What does it enable that otherwise doesn't
> work and is actually useful?  And real use cases please and no marketing
> mumble jumble.

A cloud service provider might have several VMs on a single system and
wish to provide surprise hotplug functionality within the guests so
that they don't need to bring the whole server down or migrate VMs in
order to swap disks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack
  2019-11-07 15:47           ` Derrick, Jonathan
@ 2019-11-11 17:03             ` hch
  0 siblings, 0 replies; 21+ messages in thread
From: hch @ 2019-11-11 17:03 UTC (permalink / raw)
  To: Derrick, Jonathan; +Cc: hch, kbusch, lorenzo.pieralisi, linux-pci, helgaas

On Thu, Nov 07, 2019 at 03:47:09PM +0000, Derrick, Jonathan wrote:
> A cloud service provider might have several VMs on a single system and
> wish to provide surprise hotplug functionality within the guests so
> that they don't need to bring the whole server down or migrate VMs in
> order to swap disks.

And how does the vmd mechanism help with that?  Maybe qemu is missing
a memremap to not access the remove device right now, but adding that
is way simpler than having to deal with a device that makes everyones
life complicated.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] PCI: vmd: Use managed irq affinities
  2019-11-06 20:33         ` Derrick, Jonathan
@ 2019-11-18 10:49           ` Lorenzo Pieralisi
  2019-11-18 16:43             ` Derrick, Jonathan
  0 siblings, 1 reply; 21+ messages in thread
From: Lorenzo Pieralisi @ 2019-11-18 10:49 UTC (permalink / raw)
  To: Derrick, Jonathan; +Cc: kbusch, linux-pci, helgaas

On Wed, Nov 06, 2019 at 08:33:01PM +0000, Derrick, Jonathan wrote:
> On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > > Yes that problem exists today 
> > 
> > Not really, because we're currently using unamanged interrupts which
> > migrate to online CPUs. It's only the managed ones that don't migrate
> > because they have a unchangeable affinity.
> > 
> > > and this set limits the exposure as it's
> > > a rare case where you have a child NVMe device with fewer than 32
> > > vectors.
> > 
> > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> > queues, yielding 31 vectors for IO services, so that already won't work
> > with this scheme.
> > 
> Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
> leaving the 31 remaining VMD vectors for NVMe IO queues. This would
> have lined up perfectly. Had changed it last minute and that does break
> the scheme for P3700....
> 
> > But assuming you wanted to only use devices that have at least 32 IRQ
> > vectors, the nvme driver also allows users to carve those vectors up
> > into fully affinitized sets for different services (read vs. write is
> > the one the block stack supports), which would also break if alignment
> > to the parent device's IRQ setup is required.
> 
> Wasn't aware of this. Fair enough.

Marked the series with "Changes Requested", waiting for a new
version.

Lorenzo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] PCI: vmd: Use managed irq affinities
  2019-11-18 10:49           ` Lorenzo Pieralisi
@ 2019-11-18 16:43             ` Derrick, Jonathan
  0 siblings, 0 replies; 21+ messages in thread
From: Derrick, Jonathan @ 2019-11-18 16:43 UTC (permalink / raw)
  To: lorenzo.pieralisi; +Cc: kbusch, linux-pci, helgaas

On Mon, 2019-11-18 at 10:49 +0000, Lorenzo Pieralisi wrote:
> On Wed, Nov 06, 2019 at 08:33:01PM +0000, Derrick, Jonathan wrote:
> > On Thu, 2019-11-07 at 05:27 +0900, Keith Busch wrote:
> > > On Wed, Nov 06, 2019 at 08:14:41PM +0000, Derrick, Jonathan wrote:
> > > > Yes that problem exists today 
> > > 
> > > Not really, because we're currently using unamanged interrupts which
> > > migrate to online CPUs. It's only the managed ones that don't migrate
> > > because they have a unchangeable affinity.
> > > 
> > > > and this set limits the exposure as it's
> > > > a rare case where you have a child NVMe device with fewer than 32
> > > > vectors.
> > > 
> > > I'm deeply skeptical that is the case. Even the P3700 has only 31 IO
> > > queues, yielding 31 vectors for IO services, so that already won't work
> > > with this scheme.
> > > 
> > Darn you're right. At one point I had VMD vector 1 using NVMe AQ,
> > leaving the 31 remaining VMD vectors for NVMe IO queues. This would
> > have lined up perfectly. Had changed it last minute and that does break
> > the scheme for P3700....
> > 
> > > But assuming you wanted to only use devices that have at least 32 IRQ
> > > vectors, the nvme driver also allows users to carve those vectors up
> > > into fully affinitized sets for different services (read vs. write is
> > > the one the block stack supports), which would also break if alignment
> > > to the parent device's IRQ setup is required.
> > 
> > Wasn't aware of this. Fair enough.
> 
> Marked the series with "Changes Requested", waiting for a new
> version.
> 
> Lorenzo

This will need an entirely different strategy to be useful, so can be
dropped for now.

Thank you,
Jon

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, back to index

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06 11:40 [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Jon Derrick
2019-11-06 11:40 ` [PATCH 1/3] PCI: vmd: Reduce VMD vectors using NVMe calculation Jon Derrick
2019-11-06 18:02   ` Keith Busch
2019-11-06 19:51     ` Derrick, Jonathan
2019-11-06 11:40 ` [PATCH 2/3] PCI: vmd: Align IRQ lists with child device vectors Jon Derrick
2019-11-06 18:06   ` Keith Busch
2019-11-06 20:14     ` Derrick, Jonathan
2019-11-06 11:40 ` [PATCH 3/3] PCI: vmd: Use managed irq affinities Jon Derrick
2019-11-06 18:10   ` Keith Busch
2019-11-06 20:14     ` Derrick, Jonathan
2019-11-06 20:27       ` Keith Busch
2019-11-06 20:33         ` Derrick, Jonathan
2019-11-18 10:49           ` Lorenzo Pieralisi
2019-11-18 16:43             ` Derrick, Jonathan
2019-11-07  9:39 ` [PATCH 0/3] PCI: vmd: Reducing tail latency by affining to the storage stack Christoph Hellwig
2019-11-07 14:12   ` Derrick, Jonathan
2019-11-07 15:37     ` hch
2019-11-07 15:40       ` Derrick, Jonathan
2019-11-07 15:42         ` hch
2019-11-07 15:47           ` Derrick, Jonathan
2019-11-11 17:03             ` hch

Linux-PCI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pci/0 linux-pci/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pci linux-pci/ https://lore.kernel.org/linux-pci \
		linux-pci@vger.kernel.org
	public-inbox-index linux-pci

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pci


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git