All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Add pci_enable_msi_partial() to conserve MSI-related resources
@ 2014-06-10 13:10 ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, linux-doc, linux-mips, linuxppc-dev,
	linux-s390, x86, xen-devel, iommu, linux-ide, linux-pci

Add new pci_enable_msi_partial() interface and use it to
conserve on othewise wasted interrupt resources.

AHCI driver is the first user which would conserve on
10 out of 16 unused MSI vectors on some Intel chipsets.

Cc: linux-doc@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org

Alexander Gordeev (3):
  PCI/MSI: Add pci_enable_msi_partial()
  PCI/MSI/x86: Support pci_enable_msi_partial()
  AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs

 Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
 arch/mips/pci/msi-octeon.c      |    2 +-
 arch/powerpc/kernel/msi.c       |    4 +-
 arch/s390/pci/pci.c             |    2 +-
 arch/x86/include/asm/pci.h      |    3 +-
 arch/x86/include/asm/x86_init.h |    3 +-
 arch/x86/kernel/apic/io_apic.c  |    2 +-
 arch/x86/kernel/x86_init.c      |    4 +-
 arch/x86/pci/xen.c              |    9 +++-
 drivers/ata/ahci.c              |    4 +-
 drivers/iommu/irq_remapping.c   |   10 ++--
 drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
 include/linux/msi.h             |    5 +-
 include/linux/pci.h             |    3 +
 14 files changed, 134 insertions(+), 36 deletions(-)

-- 
1.7.7.6

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/3] Add pci_enable_msi_partial() to conserve MSI-related resources
@ 2014-06-10 13:10 ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-ide,
	iommu, Alexander Gordeev, xen-devel, linuxppc-dev

Add new pci_enable_msi_partial() interface and use it to
conserve on othewise wasted interrupt resources.

AHCI driver is the first user which would conserve on
10 out of 16 unused MSI vectors on some Intel chipsets.

Cc: linux-doc@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org

Alexander Gordeev (3):
  PCI/MSI: Add pci_enable_msi_partial()
  PCI/MSI/x86: Support pci_enable_msi_partial()
  AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs

 Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
 arch/mips/pci/msi-octeon.c      |    2 +-
 arch/powerpc/kernel/msi.c       |    4 +-
 arch/s390/pci/pci.c             |    2 +-
 arch/x86/include/asm/pci.h      |    3 +-
 arch/x86/include/asm/x86_init.h |    3 +-
 arch/x86/kernel/apic/io_apic.c  |    2 +-
 arch/x86/kernel/x86_init.c      |    4 +-
 arch/x86/pci/xen.c              |    9 +++-
 drivers/ata/ahci.c              |    4 +-
 drivers/iommu/irq_remapping.c   |   10 ++--
 drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
 include/linux/msi.h             |    5 +-
 include/linux/pci.h             |    3 +
 14 files changed, 134 insertions(+), 36 deletions(-)

-- 
1.7.7.6

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-06-10 13:10 ` Alexander Gordeev
@ 2014-06-10 13:10   ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, linux-doc, linux-mips, linuxppc-dev,
	linux-s390, x86, xen-devel, iommu, linux-ide, linux-pci

There are PCI devices that require a particular value written
to the Multiple Message Enable (MME) register while aligned on
power of 2 boundary value of actually used MSI vectors 'nvec'
is a lesser of that MME value:

	roundup_pow_of_two(nvec) < 'Multiple Message Enable'

However the existing pci_enable_msi_block() interface is not
able to configure such devices, since the value written to the
MME register is calculated from the number of requested MSIs
'nvec':

	'Multiple Message Enable' = roundup_pow_of_two(nvec)

In this case the result written to the MME register may not
satisfy the aforementioned PCI devices requirement and therefore
the PCI functions will not operate in a desired mode.

This update introduces pci_enable_msi_partial() extension to
pci_enable_msi_block() interface that accepts extra 'nvec_mme'
argument which is then written to MME register while the value
of 'nvec' is still used to setup as many interrupts as requested.

As result of this change, architecture-specific callbacks
arch_msi_check_device() and arch_setup_msi_irqs() get an extra
'nvec_mme' parameter as well, but it is ignored for now.
Therefore, this update is a placeholder for architectures that
wish to support pci_enable_msi_partial() function in the future.

Cc: linux-doc@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
 arch/mips/pci/msi-octeon.c      |    2 +-
 arch/powerpc/kernel/msi.c       |    4 +-
 arch/s390/pci/pci.c             |    2 +-
 arch/x86/kernel/x86_init.c      |    2 +-
 drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
 include/linux/msi.h             |    5 +-
 include/linux/pci.h             |    3 +
 8 files changed, 115 insertions(+), 22 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 10a9369..c8a8503 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
 returns zero in case of success, which indicates MSI interrupts have been
 successfully allocated.
 
-4.2.4 pci_disable_msi
+4.2.4 pci_enable_msi_partial
+
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+
+This variation on pci_enable_msi_exact() call allows a device driver to
+setup 'nvec_mme' number of multiple MSIs with the PCI function, while
+setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
+in operating system. The MSI specification only allows 'nvec_mme' to be
+allocated in powers of two, up to a maximum of 2^5 (32).
+
+This function could be used when a PCI function is known to send 'nvec'
+MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
+initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
+do not waste system resources.
+
+If this function returns 0, it has succeeded in allocating 'nvec_mme'
+interrupts and setting up 'nvec' interrupts. In this case, the function
+enables MSI on this device and updates dev->irq to be the lowest of the
+new interrupts assigned to it.  The other interrupts assigned to the
+device are in the range dev->irq to dev->irq + nvec - 1.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+4.2.5 pci_disable_msi
 
 void pci_disable_msi(struct pci_dev *dev)
 
-This function should be used to undo the effect of pci_enable_msi_range().
-Calling it restores dev->irq to the pin-based interrupt number and frees
-the previously allocated MSIs.  The interrupts may subsequently be assigned
-to another device, so drivers should not cache the value of dev->irq.
+This function should be used to undo the effect of pci_enable_msi_range()
+or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
+interrupt number and frees the previously allocated MSIs.  The interrupts
+may subsequently be assigned to another device, so drivers should not cache
+the value of dev->irq.
 
 Before calling this function, a device driver must always call free_irq()
 on any interrupt for which it previously called request_irq().
diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
index 2b91b0e..2be7979 100644
--- a/arch/mips/pci/msi-octeon.c
+++ b/arch/mips/pci/msi-octeon.c
@@ -178,7 +178,7 @@ msi_irq_allocated:
 	return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *entry;
 	int ret;
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..c60aee3 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -13,7 +13,7 @@
 
 #include <asm/machdep.h>
 
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
+int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
 		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
@@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
         return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	return ppc_md.setup_msi_irqs(dev, nvec, type);
 }
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 9ddc51e..3cf38a8 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
 	}
 }
 
-int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
 {
 	struct zpci_dev *zdev = get_zdev(pdev);
 	unsigned int hwirq, msi_vecs;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index e48b674..b65bf95 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
 };
 
 /* MSI arch specific hooks */
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	return x86_msi.setup_msi_irqs(dev, nvec, type);
 }
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 27a7e67..0410d9b 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
 	chip->teardown_irq(chip, irq);
 }
 
-int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
+int __weak arch_msi_check_device(struct pci_dev *dev,
+				 int nvec, int nvec_mme, int type)
 {
 	struct msi_chip *chip = dev->bus->msi;
 
@@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
 	return chip->check_device(chip, dev, nvec, type);
 }
 
-int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int __weak arch_setup_msi_irqs(struct pci_dev *dev,
+			       int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -598,6 +600,7 @@ error_attrs:
  * msi_capability_init - configure device's MSI capability structure
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: number of interrupts to allocate
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
  *
  * Setup the MSI capability structure of the device with the requested
  * number of interrupts.  A return value of zero indicates the successful
@@ -605,7 +608,7 @@ error_attrs:
  * an error, and a positive return value indicates the number of interrupts
  * which could have been allocated.
  */
-static int msi_capability_init(struct pci_dev *dev, int nvec)
+static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
 	list_add_tail(&entry->list, &dev->msi_list);
 
 	/* Configure MSI capability structure */
-	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
+	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
 	if (ret) {
 		msi_mask_irq(entry, mask, ~mask);
 		free_msi_irqs(dev);
@@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
 	if (ret)
 		return ret;
 
-	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
+	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
 	if (ret)
 		goto out_avail;
 
@@ -812,13 +816,15 @@ out_free:
  * pci_msi_check_device - check whether MSI may be enabled on a device
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: how many MSIs have been requested ?
+ * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
  * @type: are we checking for MSI or MSI-X ?
  *
  * Look at global flags, the device itself, and its parent buses
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 0, else return an error code.
  **/
-static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
+static int pci_msi_check_device(struct pci_dev *dev,
+				int nvec, int nvec_mme, int type)
 {
 	struct pci_bus *bus;
 	int ret;
@@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
 		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
 			return -EINVAL;
 
-	ret = arch_msi_check_device(dev, nvec, type);
+	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
 	if (ret)
 		return ret;
 
@@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msi_vec_count);
 
+/**
+ * pci_enable_msi_partial - configure device's MSI capability structure
+ * @dev: device to configure
+ * @nvec: number of interrupts to configure
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
+ *
+ * This function tries to allocate @nvec number of interrupts while setup
+ * device's Multiple Message Enable register with @nvec_mme interrupts.
+ * It returns a negative errno if an error occurs. If it succeeds, it returns
+ * zero and updates the @dev's irq member to the lowest new interrupt number;
+ * the other interrupt numbers allocated to this device are consecutive.
+ */
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+{
+	int maxvec;
+	int rc;
+
+	if (dev->current_state != PCI_D0)
+		return -EINVAL;
+
+	WARN_ON(!!dev->msi_enabled);
+
+	/* Check whether driver already requested MSI-X irqs */
+	if (dev->msix_enabled) {
+		dev_info(&dev->dev, "can't enable MSI "
+			 "(MSI-X already enabled)\n");
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(nvec_mme))
+		return -EINVAL;
+	if (nvec > nvec_mme)
+		return -EINVAL;
+
+	maxvec = pci_msi_vec_count(dev);
+	if (maxvec < 0)
+		return maxvec;
+	else if (nvec_mme > maxvec)
+		return -EINVAL;
+
+	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
+	if (rc < 0)
+		return rc;
+	else if (rc > 0)
+		return -ENOSPC;
+
+	rc = msi_capability_init(dev, nvec, nvec_mme);
+	if (rc < 0)
+		return rc;
+	else if (rc > 0)
+		return -ENOSPC;
+
+	return 0;
+}
+EXPORT_SYMBOL(pci_enable_msi_partial);
+
 void pci_msi_shutdown(struct pci_dev *dev)
 {
 	struct msi_desc *desc;
@@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
 	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
 		return -EINVAL;
 
-	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
 	if (status)
 		return status;
 
@@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 		nvec = maxvec;
 
 	do {
-		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
+		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
+					  PCI_CAP_ID_MSI);
 		if (rc < 0) {
 			return rc;
 		} else if (rc > 0) {
@@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 	} while (rc);
 
 	do {
-		rc = msi_capability_init(dev, nvec);
+		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
 		if (rc < 0) {
 			return rc;
 		} else if (rc > 0) {
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 92a2f99..b9f89ee 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -57,9 +57,10 @@ struct msi_desc {
  */
 int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
 void arch_teardown_msi_irqs(struct pci_dev *dev);
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
+int arch_msi_check_device(struct pci_dev *dev,
+			  int nvec, int nvec_mme, int type);
 void arch_restore_msi_irqs(struct pci_dev *dev);
 
 void default_teardown_msi_irqs(struct pci_dev *dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 71d9673..7360bd2 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
 void msi_remove_pci_irq_vectors(struct pci_dev *dev);
 void pci_restore_msi_state(struct pci_dev *dev);
 int pci_msi_enabled(void);
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
 int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
 static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
 {
@@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
 static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
 static inline void pci_restore_msi_state(struct pci_dev *dev) { }
 static inline int pci_msi_enabled(void) { return 0; }
+static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+{ return -ENOSYS; }
 static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
 				       int maxvec)
 { return -ENOSYS; }
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-06-10 13:10   ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-ide,
	iommu, Alexander Gordeev, xen-devel, linuxppc-dev

There are PCI devices that require a particular value written
to the Multiple Message Enable (MME) register while aligned on
power of 2 boundary value of actually used MSI vectors 'nvec'
is a lesser of that MME value:

	roundup_pow_of_two(nvec) < 'Multiple Message Enable'

However the existing pci_enable_msi_block() interface is not
able to configure such devices, since the value written to the
MME register is calculated from the number of requested MSIs
'nvec':

	'Multiple Message Enable' = roundup_pow_of_two(nvec)

In this case the result written to the MME register may not
satisfy the aforementioned PCI devices requirement and therefore
the PCI functions will not operate in a desired mode.

This update introduces pci_enable_msi_partial() extension to
pci_enable_msi_block() interface that accepts extra 'nvec_mme'
argument which is then written to MME register while the value
of 'nvec' is still used to setup as many interrupts as requested.

As result of this change, architecture-specific callbacks
arch_msi_check_device() and arch_setup_msi_irqs() get an extra
'nvec_mme' parameter as well, but it is ignored for now.
Therefore, this update is a placeholder for architectures that
wish to support pci_enable_msi_partial() function in the future.

Cc: linux-doc@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
 arch/mips/pci/msi-octeon.c      |    2 +-
 arch/powerpc/kernel/msi.c       |    4 +-
 arch/s390/pci/pci.c             |    2 +-
 arch/x86/kernel/x86_init.c      |    2 +-
 drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
 include/linux/msi.h             |    5 +-
 include/linux/pci.h             |    3 +
 8 files changed, 115 insertions(+), 22 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 10a9369..c8a8503 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
 returns zero in case of success, which indicates MSI interrupts have been
 successfully allocated.
 
-4.2.4 pci_disable_msi
+4.2.4 pci_enable_msi_partial
+
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+
+This variation on pci_enable_msi_exact() call allows a device driver to
+setup 'nvec_mme' number of multiple MSIs with the PCI function, while
+setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
+in operating system. The MSI specification only allows 'nvec_mme' to be
+allocated in powers of two, up to a maximum of 2^5 (32).
+
+This function could be used when a PCI function is known to send 'nvec'
+MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
+initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
+do not waste system resources.
+
+If this function returns 0, it has succeeded in allocating 'nvec_mme'
+interrupts and setting up 'nvec' interrupts. In this case, the function
+enables MSI on this device and updates dev->irq to be the lowest of the
+new interrupts assigned to it.  The other interrupts assigned to the
+device are in the range dev->irq to dev->irq + nvec - 1.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+4.2.5 pci_disable_msi
 
 void pci_disable_msi(struct pci_dev *dev)
 
-This function should be used to undo the effect of pci_enable_msi_range().
-Calling it restores dev->irq to the pin-based interrupt number and frees
-the previously allocated MSIs.  The interrupts may subsequently be assigned
-to another device, so drivers should not cache the value of dev->irq.
+This function should be used to undo the effect of pci_enable_msi_range()
+or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
+interrupt number and frees the previously allocated MSIs.  The interrupts
+may subsequently be assigned to another device, so drivers should not cache
+the value of dev->irq.
 
 Before calling this function, a device driver must always call free_irq()
 on any interrupt for which it previously called request_irq().
diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
index 2b91b0e..2be7979 100644
--- a/arch/mips/pci/msi-octeon.c
+++ b/arch/mips/pci/msi-octeon.c
@@ -178,7 +178,7 @@ msi_irq_allocated:
 	return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *entry;
 	int ret;
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..c60aee3 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -13,7 +13,7 @@
 
 #include <asm/machdep.h>
 
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
+int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
 		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
@@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
         return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	return ppc_md.setup_msi_irqs(dev, nvec, type);
 }
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 9ddc51e..3cf38a8 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
 	}
 }
 
-int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
 {
 	struct zpci_dev *zdev = get_zdev(pdev);
 	unsigned int hwirq, msi_vecs;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index e48b674..b65bf95 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
 };
 
 /* MSI arch specific hooks */
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	return x86_msi.setup_msi_irqs(dev, nvec, type);
 }
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 27a7e67..0410d9b 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
 	chip->teardown_irq(chip, irq);
 }
 
-int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
+int __weak arch_msi_check_device(struct pci_dev *dev,
+				 int nvec, int nvec_mme, int type)
 {
 	struct msi_chip *chip = dev->bus->msi;
 
@@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
 	return chip->check_device(chip, dev, nvec, type);
 }
 
-int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int __weak arch_setup_msi_irqs(struct pci_dev *dev,
+			       int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -598,6 +600,7 @@ error_attrs:
  * msi_capability_init - configure device's MSI capability structure
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: number of interrupts to allocate
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
  *
  * Setup the MSI capability structure of the device with the requested
  * number of interrupts.  A return value of zero indicates the successful
@@ -605,7 +608,7 @@ error_attrs:
  * an error, and a positive return value indicates the number of interrupts
  * which could have been allocated.
  */
-static int msi_capability_init(struct pci_dev *dev, int nvec)
+static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
 	list_add_tail(&entry->list, &dev->msi_list);
 
 	/* Configure MSI capability structure */
-	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
+	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
 	if (ret) {
 		msi_mask_irq(entry, mask, ~mask);
 		free_msi_irqs(dev);
@@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
 	if (ret)
 		return ret;
 
-	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
+	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
 	if (ret)
 		goto out_avail;
 
@@ -812,13 +816,15 @@ out_free:
  * pci_msi_check_device - check whether MSI may be enabled on a device
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: how many MSIs have been requested ?
+ * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
  * @type: are we checking for MSI or MSI-X ?
  *
  * Look at global flags, the device itself, and its parent buses
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 0, else return an error code.
  **/
-static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
+static int pci_msi_check_device(struct pci_dev *dev,
+				int nvec, int nvec_mme, int type)
 {
 	struct pci_bus *bus;
 	int ret;
@@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
 		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
 			return -EINVAL;
 
-	ret = arch_msi_check_device(dev, nvec, type);
+	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
 	if (ret)
 		return ret;
 
@@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msi_vec_count);
 
+/**
+ * pci_enable_msi_partial - configure device's MSI capability structure
+ * @dev: device to configure
+ * @nvec: number of interrupts to configure
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
+ *
+ * This function tries to allocate @nvec number of interrupts while setup
+ * device's Multiple Message Enable register with @nvec_mme interrupts.
+ * It returns a negative errno if an error occurs. If it succeeds, it returns
+ * zero and updates the @dev's irq member to the lowest new interrupt number;
+ * the other interrupt numbers allocated to this device are consecutive.
+ */
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+{
+	int maxvec;
+	int rc;
+
+	if (dev->current_state != PCI_D0)
+		return -EINVAL;
+
+	WARN_ON(!!dev->msi_enabled);
+
+	/* Check whether driver already requested MSI-X irqs */
+	if (dev->msix_enabled) {
+		dev_info(&dev->dev, "can't enable MSI "
+			 "(MSI-X already enabled)\n");
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(nvec_mme))
+		return -EINVAL;
+	if (nvec > nvec_mme)
+		return -EINVAL;
+
+	maxvec = pci_msi_vec_count(dev);
+	if (maxvec < 0)
+		return maxvec;
+	else if (nvec_mme > maxvec)
+		return -EINVAL;
+
+	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
+	if (rc < 0)
+		return rc;
+	else if (rc > 0)
+		return -ENOSPC;
+
+	rc = msi_capability_init(dev, nvec, nvec_mme);
+	if (rc < 0)
+		return rc;
+	else if (rc > 0)
+		return -ENOSPC;
+
+	return 0;
+}
+EXPORT_SYMBOL(pci_enable_msi_partial);
+
 void pci_msi_shutdown(struct pci_dev *dev)
 {
 	struct msi_desc *desc;
@@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
 	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
 		return -EINVAL;
 
-	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
 	if (status)
 		return status;
 
@@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 		nvec = maxvec;
 
 	do {
-		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
+		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
+					  PCI_CAP_ID_MSI);
 		if (rc < 0) {
 			return rc;
 		} else if (rc > 0) {
@@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 	} while (rc);
 
 	do {
-		rc = msi_capability_init(dev, nvec);
+		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
 		if (rc < 0) {
 			return rc;
 		} else if (rc > 0) {
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 92a2f99..b9f89ee 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -57,9 +57,10 @@ struct msi_desc {
  */
 int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
 void arch_teardown_msi_irqs(struct pci_dev *dev);
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
+int arch_msi_check_device(struct pci_dev *dev,
+			  int nvec, int nvec_mme, int type);
 void arch_restore_msi_irqs(struct pci_dev *dev);
 
 void default_teardown_msi_irqs(struct pci_dev *dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 71d9673..7360bd2 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
 void msi_remove_pci_irq_vectors(struct pci_dev *dev);
 void pci_restore_msi_state(struct pci_dev *dev);
 int pci_msi_enabled(void);
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
 int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
 static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
 {
@@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
 static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
 static inline void pci_restore_msi_state(struct pci_dev *dev) { }
 static inline int pci_msi_enabled(void) { return 0; }
+static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+{ return -ENOSYS; }
 static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
 				       int maxvec)
 { return -ENOSYS; }
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-06-10 13:10 ` Alexander Gordeev
  (?)
  (?)
@ 2014-06-10 13:10 ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-ide,
	iommu, Alexander Gordeev, xen-devel, linuxppc-dev

There are PCI devices that require a particular value written
to the Multiple Message Enable (MME) register while aligned on
power of 2 boundary value of actually used MSI vectors 'nvec'
is a lesser of that MME value:

	roundup_pow_of_two(nvec) < 'Multiple Message Enable'

However the existing pci_enable_msi_block() interface is not
able to configure such devices, since the value written to the
MME register is calculated from the number of requested MSIs
'nvec':

	'Multiple Message Enable' = roundup_pow_of_two(nvec)

In this case the result written to the MME register may not
satisfy the aforementioned PCI devices requirement and therefore
the PCI functions will not operate in a desired mode.

This update introduces pci_enable_msi_partial() extension to
pci_enable_msi_block() interface that accepts extra 'nvec_mme'
argument which is then written to MME register while the value
of 'nvec' is still used to setup as many interrupts as requested.

As result of this change, architecture-specific callbacks
arch_msi_check_device() and arch_setup_msi_irqs() get an extra
'nvec_mme' parameter as well, but it is ignored for now.
Therefore, this update is a placeholder for architectures that
wish to support pci_enable_msi_partial() function in the future.

Cc: linux-doc@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
 arch/mips/pci/msi-octeon.c      |    2 +-
 arch/powerpc/kernel/msi.c       |    4 +-
 arch/s390/pci/pci.c             |    2 +-
 arch/x86/kernel/x86_init.c      |    2 +-
 drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
 include/linux/msi.h             |    5 +-
 include/linux/pci.h             |    3 +
 8 files changed, 115 insertions(+), 22 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 10a9369..c8a8503 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
 returns zero in case of success, which indicates MSI interrupts have been
 successfully allocated.
 
-4.2.4 pci_disable_msi
+4.2.4 pci_enable_msi_partial
+
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+
+This variation on pci_enable_msi_exact() call allows a device driver to
+setup 'nvec_mme' number of multiple MSIs with the PCI function, while
+setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
+in operating system. The MSI specification only allows 'nvec_mme' to be
+allocated in powers of two, up to a maximum of 2^5 (32).
+
+This function could be used when a PCI function is known to send 'nvec'
+MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
+initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
+do not waste system resources.
+
+If this function returns 0, it has succeeded in allocating 'nvec_mme'
+interrupts and setting up 'nvec' interrupts. In this case, the function
+enables MSI on this device and updates dev->irq to be the lowest of the
+new interrupts assigned to it.  The other interrupts assigned to the
+device are in the range dev->irq to dev->irq + nvec - 1.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device.
+
+4.2.5 pci_disable_msi
 
 void pci_disable_msi(struct pci_dev *dev)
 
-This function should be used to undo the effect of pci_enable_msi_range().
-Calling it restores dev->irq to the pin-based interrupt number and frees
-the previously allocated MSIs.  The interrupts may subsequently be assigned
-to another device, so drivers should not cache the value of dev->irq.
+This function should be used to undo the effect of pci_enable_msi_range()
+or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
+interrupt number and frees the previously allocated MSIs.  The interrupts
+may subsequently be assigned to another device, so drivers should not cache
+the value of dev->irq.
 
 Before calling this function, a device driver must always call free_irq()
 on any interrupt for which it previously called request_irq().
diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
index 2b91b0e..2be7979 100644
--- a/arch/mips/pci/msi-octeon.c
+++ b/arch/mips/pci/msi-octeon.c
@@ -178,7 +178,7 @@ msi_irq_allocated:
 	return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *entry;
 	int ret;
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..c60aee3 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -13,7 +13,7 @@
 
 #include <asm/machdep.h>
 
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
+int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
 		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
@@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
         return 0;
 }
 
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	return ppc_md.setup_msi_irqs(dev, nvec, type);
 }
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 9ddc51e..3cf38a8 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
 	}
 }
 
-int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
 {
 	struct zpci_dev *zdev = get_zdev(pdev);
 	unsigned int hwirq, msi_vecs;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index e48b674..b65bf95 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
 };
 
 /* MSI arch specific hooks */
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	return x86_msi.setup_msi_irqs(dev, nvec, type);
 }
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 27a7e67..0410d9b 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
 	chip->teardown_irq(chip, irq);
 }
 
-int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
+int __weak arch_msi_check_device(struct pci_dev *dev,
+				 int nvec, int nvec_mme, int type)
 {
 	struct msi_chip *chip = dev->bus->msi;
 
@@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
 	return chip->check_device(chip, dev, nvec, type);
 }
 
-int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int __weak arch_setup_msi_irqs(struct pci_dev *dev,
+			       int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -598,6 +600,7 @@ error_attrs:
  * msi_capability_init - configure device's MSI capability structure
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: number of interrupts to allocate
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
  *
  * Setup the MSI capability structure of the device with the requested
  * number of interrupts.  A return value of zero indicates the successful
@@ -605,7 +608,7 @@ error_attrs:
  * an error, and a positive return value indicates the number of interrupts
  * which could have been allocated.
  */
-static int msi_capability_init(struct pci_dev *dev, int nvec)
+static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
 {
 	struct msi_desc *entry;
 	int ret;
@@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
 	list_add_tail(&entry->list, &dev->msi_list);
 
 	/* Configure MSI capability structure */
-	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
+	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
 	if (ret) {
 		msi_mask_irq(entry, mask, ~mask);
 		free_msi_irqs(dev);
@@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
 	if (ret)
 		return ret;
 
-	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
+	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
 	if (ret)
 		goto out_avail;
 
@@ -812,13 +816,15 @@ out_free:
  * pci_msi_check_device - check whether MSI may be enabled on a device
  * @dev: pointer to the pci_dev data structure of MSI device function
  * @nvec: how many MSIs have been requested ?
+ * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
  * @type: are we checking for MSI or MSI-X ?
  *
  * Look at global flags, the device itself, and its parent buses
  * to determine if MSI/-X are supported for the device. If MSI/-X is
  * supported return 0, else return an error code.
  **/
-static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
+static int pci_msi_check_device(struct pci_dev *dev,
+				int nvec, int nvec_mme, int type)
 {
 	struct pci_bus *bus;
 	int ret;
@@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
 		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
 			return -EINVAL;
 
-	ret = arch_msi_check_device(dev, nvec, type);
+	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
 	if (ret)
 		return ret;
 
@@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_msi_vec_count);
 
+/**
+ * pci_enable_msi_partial - configure device's MSI capability structure
+ * @dev: device to configure
+ * @nvec: number of interrupts to configure
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
+ *
+ * This function tries to allocate @nvec number of interrupts while setup
+ * device's Multiple Message Enable register with @nvec_mme interrupts.
+ * It returns a negative errno if an error occurs. If it succeeds, it returns
+ * zero and updates the @dev's irq member to the lowest new interrupt number;
+ * the other interrupt numbers allocated to this device are consecutive.
+ */
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+{
+	int maxvec;
+	int rc;
+
+	if (dev->current_state != PCI_D0)
+		return -EINVAL;
+
+	WARN_ON(!!dev->msi_enabled);
+
+	/* Check whether driver already requested MSI-X irqs */
+	if (dev->msix_enabled) {
+		dev_info(&dev->dev, "can't enable MSI "
+			 "(MSI-X already enabled)\n");
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(nvec_mme))
+		return -EINVAL;
+	if (nvec > nvec_mme)
+		return -EINVAL;
+
+	maxvec = pci_msi_vec_count(dev);
+	if (maxvec < 0)
+		return maxvec;
+	else if (nvec_mme > maxvec)
+		return -EINVAL;
+
+	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
+	if (rc < 0)
+		return rc;
+	else if (rc > 0)
+		return -ENOSPC;
+
+	rc = msi_capability_init(dev, nvec, nvec_mme);
+	if (rc < 0)
+		return rc;
+	else if (rc > 0)
+		return -ENOSPC;
+
+	return 0;
+}
+EXPORT_SYMBOL(pci_enable_msi_partial);
+
 void pci_msi_shutdown(struct pci_dev *dev)
 {
 	struct msi_desc *desc;
@@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
 	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
 		return -EINVAL;
 
-	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
 	if (status)
 		return status;
 
@@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 		nvec = maxvec;
 
 	do {
-		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
+		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
+					  PCI_CAP_ID_MSI);
 		if (rc < 0) {
 			return rc;
 		} else if (rc > 0) {
@@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
 	} while (rc);
 
 	do {
-		rc = msi_capability_init(dev, nvec);
+		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
 		if (rc < 0) {
 			return rc;
 		} else if (rc > 0) {
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 92a2f99..b9f89ee 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -57,9 +57,10 @@ struct msi_desc {
  */
 int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
 void arch_teardown_msi_irqs(struct pci_dev *dev);
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
+int arch_msi_check_device(struct pci_dev *dev,
+			  int nvec, int nvec_mme, int type);
 void arch_restore_msi_irqs(struct pci_dev *dev);
 
 void default_teardown_msi_irqs(struct pci_dev *dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 71d9673..7360bd2 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
 void msi_remove_pci_irq_vectors(struct pci_dev *dev);
 void pci_restore_msi_state(struct pci_dev *dev);
 int pci_msi_enabled(void);
+int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
 int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
 static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
 {
@@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
 static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
 static inline void pci_restore_msi_state(struct pci_dev *dev) { }
 static inline int pci_msi_enabled(void) { return 0; }
+static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
+{ return -ENOSYS; }
 static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
 				       int maxvec)
 { return -ENOSYS; }
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/3] PCI/MSI/x86: Support pci_enable_msi_partial()
  2014-06-10 13:10 ` Alexander Gordeev
                   ` (3 preceding siblings ...)
  (?)
@ 2014-06-10 13:10 ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, x86, xen-devel, iommu, linux-ide, linux-pci

This change is a prerequisite for the forthcoming update
of the AHCI device driver to conserve 10/16 MSIs on Intel
chipsets. The update makes use of 'nvec_mme' parameter of
pci_enable_msi_partial() function.

Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 arch/x86/include/asm/pci.h      |    3 ++-
 arch/x86/include/asm/x86_init.h |    3 ++-
 arch/x86/kernel/apic/io_apic.c  |    2 +-
 arch/x86/kernel/x86_init.c      |    2 +-
 arch/x86/pci/xen.c              |    9 ++++++---
 drivers/iommu/irq_remapping.c   |   10 +++++-----
 6 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index 0892ea0..24c9b92 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -96,7 +96,8 @@ extern void pci_iommu_alloc(void);
 #ifdef CONFIG_PCI_MSI
 /* implemented in arch/x86/kernel/apic/io_apic. */
 struct msi_desc;
-int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+int native_setup_msi_irqs(struct pci_dev *dev,
+			  int nvec, int nvec_mme, int type);
 void native_teardown_msi_irq(unsigned int irq);
 void native_restore_msi_irqs(struct pci_dev *dev);
 int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index e45e4da..0c3ecb6 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -175,7 +175,8 @@ struct msi_msg;
 struct msi_desc;
 
 struct x86_msi_ops {
-	int (*setup_msi_irqs)(struct pci_dev *dev, int nvec, int type);
+	int (*setup_msi_irqs)(struct pci_dev *dev,
+			      int nvec, int nvec_mme, int type);
 	void (*compose_msi_msg)(struct pci_dev *dev, unsigned int irq,
 				unsigned int dest, struct msi_msg *msg,
 			       u8 hpet_id);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 9d0a979..0c0faa4 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3060,7 +3060,7 @@ int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
 	return 0;
 }
 
-int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *msidesc;
 	unsigned int irq;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index b65bf95..939171a 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -123,7 +123,7 @@ struct x86_msi_ops x86_msi = {
 /* MSI arch specific hooks */
 int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
-	return x86_msi.setup_msi_irqs(dev, nvec, type);
+	return x86_msi.setup_msi_irqs(dev, nvec, nvec_mme, type);
 }
 
 void arch_teardown_msi_irqs(struct pci_dev *dev)
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 905956f..72e4277 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -156,7 +156,8 @@ static int acpi_register_gsi_xen(struct device *dev, u32 gsi,
 struct xen_pci_frontend_ops *xen_pci_frontend;
 EXPORT_SYMBOL_GPL(xen_pci_frontend);
 
-static int xen_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+static int xen_setup_msi_irqs(struct pci_dev *dev,
+			      int nvec, int nvec_mme, int type)
 {
 	int irq, ret, i;
 	struct msi_desc *msidesc;
@@ -218,7 +219,8 @@ static void xen_msi_compose_msg(struct pci_dev *pdev, unsigned int pirq,
 	msg->data = XEN_PIRQ_MSI_DATA;
 }
 
-static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+static int xen_hvm_setup_msi_irqs(struct pci_dev *dev,
+				  int nvec, int nvec_mme, int type)
 {
 	int irq, pirq;
 	struct msi_desc *msidesc;
@@ -266,7 +268,8 @@ error:
 #ifdef CONFIG_XEN_DOM0
 static bool __read_mostly pci_seg_supported = true;
 
-static int xen_initdom_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+static int xen_initdom_setup_msi_irqs(struct pci_dev *dev,
+				      int nvec, int nvec_mme, int type)
 {
 	int ret = 0;
 	struct msi_desc *msidesc;
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 33c4395..6300bfd 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -49,9 +49,9 @@ static void irq_remapping_disable_io_apic(void)
 		disconnect_bsp_APIC(0);
 }
 
-static int do_setup_msi_irqs(struct pci_dev *dev, int nvec)
+static int do_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_pow2)
 {
-	int ret, sub_handle, nvec_pow2, index = 0;
+	int ret, sub_handle, index = 0;
 	unsigned int irq;
 	struct msi_desc *msidesc;
 
@@ -60,12 +60,12 @@ static int do_setup_msi_irqs(struct pci_dev *dev, int nvec)
 	WARN_ON(msidesc->irq);
 	WARN_ON(msidesc->msi_attrib.multiple);
 	WARN_ON(msidesc->nvec_used);
+	BUG_ON(!is_power_of_2(nvec_pow2));
 
 	irq = irq_alloc_hwirqs(nvec, dev_to_node(&dev->dev));
 	if (irq == 0)
 		return -ENOSPC;
 
-	nvec_pow2 = __roundup_pow_of_two(nvec);
 	msidesc->nvec_used = nvec;
 	msidesc->msi_attrib.multiple = ilog2(nvec_pow2);
 	for (sub_handle = 0; sub_handle < nvec; sub_handle++) {
@@ -140,10 +140,10 @@ error:
 }
 
 static int irq_remapping_setup_msi_irqs(struct pci_dev *dev,
-					int nvec, int type)
+					int nvec, int nvec_mme, int type)
 {
 	if (type == PCI_CAP_ID_MSI)
-		return do_setup_msi_irqs(dev, nvec);
+		return do_setup_msi_irqs(dev, nvec, nvec_mme);
 	else
 		return do_setup_msix_irqs(dev, nvec);
 }
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/3] PCI/MSI/x86: Support pci_enable_msi_partial()
  2014-06-10 13:10 ` Alexander Gordeev
                   ` (2 preceding siblings ...)
  (?)
@ 2014-06-10 13:10 ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pci, x86, linux-ide, iommu, Alexander Gordeev, xen-devel

This change is a prerequisite for the forthcoming update
of the AHCI device driver to conserve 10/16 MSIs on Intel
chipsets. The update makes use of 'nvec_mme' parameter of
pci_enable_msi_partial() function.

Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 arch/x86/include/asm/pci.h      |    3 ++-
 arch/x86/include/asm/x86_init.h |    3 ++-
 arch/x86/kernel/apic/io_apic.c  |    2 +-
 arch/x86/kernel/x86_init.c      |    2 +-
 arch/x86/pci/xen.c              |    9 ++++++---
 drivers/iommu/irq_remapping.c   |   10 +++++-----
 6 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index 0892ea0..24c9b92 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -96,7 +96,8 @@ extern void pci_iommu_alloc(void);
 #ifdef CONFIG_PCI_MSI
 /* implemented in arch/x86/kernel/apic/io_apic. */
 struct msi_desc;
-int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+int native_setup_msi_irqs(struct pci_dev *dev,
+			  int nvec, int nvec_mme, int type);
 void native_teardown_msi_irq(unsigned int irq);
 void native_restore_msi_irqs(struct pci_dev *dev);
 int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index e45e4da..0c3ecb6 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -175,7 +175,8 @@ struct msi_msg;
 struct msi_desc;
 
 struct x86_msi_ops {
-	int (*setup_msi_irqs)(struct pci_dev *dev, int nvec, int type);
+	int (*setup_msi_irqs)(struct pci_dev *dev,
+			      int nvec, int nvec_mme, int type);
 	void (*compose_msi_msg)(struct pci_dev *dev, unsigned int irq,
 				unsigned int dest, struct msi_msg *msg,
 			       u8 hpet_id);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 9d0a979..0c0faa4 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3060,7 +3060,7 @@ int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
 	return 0;
 }
 
-int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
 	struct msi_desc *msidesc;
 	unsigned int irq;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index b65bf95..939171a 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -123,7 +123,7 @@ struct x86_msi_ops x86_msi = {
 /* MSI arch specific hooks */
 int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
 {
-	return x86_msi.setup_msi_irqs(dev, nvec, type);
+	return x86_msi.setup_msi_irqs(dev, nvec, nvec_mme, type);
 }
 
 void arch_teardown_msi_irqs(struct pci_dev *dev)
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 905956f..72e4277 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -156,7 +156,8 @@ static int acpi_register_gsi_xen(struct device *dev, u32 gsi,
 struct xen_pci_frontend_ops *xen_pci_frontend;
 EXPORT_SYMBOL_GPL(xen_pci_frontend);
 
-static int xen_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+static int xen_setup_msi_irqs(struct pci_dev *dev,
+			      int nvec, int nvec_mme, int type)
 {
 	int irq, ret, i;
 	struct msi_desc *msidesc;
@@ -218,7 +219,8 @@ static void xen_msi_compose_msg(struct pci_dev *pdev, unsigned int pirq,
 	msg->data = XEN_PIRQ_MSI_DATA;
 }
 
-static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+static int xen_hvm_setup_msi_irqs(struct pci_dev *dev,
+				  int nvec, int nvec_mme, int type)
 {
 	int irq, pirq;
 	struct msi_desc *msidesc;
@@ -266,7 +268,8 @@ error:
 #ifdef CONFIG_XEN_DOM0
 static bool __read_mostly pci_seg_supported = true;
 
-static int xen_initdom_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+static int xen_initdom_setup_msi_irqs(struct pci_dev *dev,
+				      int nvec, int nvec_mme, int type)
 {
 	int ret = 0;
 	struct msi_desc *msidesc;
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 33c4395..6300bfd 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -49,9 +49,9 @@ static void irq_remapping_disable_io_apic(void)
 		disconnect_bsp_APIC(0);
 }
 
-static int do_setup_msi_irqs(struct pci_dev *dev, int nvec)
+static int do_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_pow2)
 {
-	int ret, sub_handle, nvec_pow2, index = 0;
+	int ret, sub_handle, index = 0;
 	unsigned int irq;
 	struct msi_desc *msidesc;
 
@@ -60,12 +60,12 @@ static int do_setup_msi_irqs(struct pci_dev *dev, int nvec)
 	WARN_ON(msidesc->irq);
 	WARN_ON(msidesc->msi_attrib.multiple);
 	WARN_ON(msidesc->nvec_used);
+	BUG_ON(!is_power_of_2(nvec_pow2));
 
 	irq = irq_alloc_hwirqs(nvec, dev_to_node(&dev->dev));
 	if (irq == 0)
 		return -ENOSPC;
 
-	nvec_pow2 = __roundup_pow_of_two(nvec);
 	msidesc->nvec_used = nvec;
 	msidesc->msi_attrib.multiple = ilog2(nvec_pow2);
 	for (sub_handle = 0; sub_handle < nvec; sub_handle++) {
@@ -140,10 +140,10 @@ error:
 }
 
 static int irq_remapping_setup_msi_irqs(struct pci_dev *dev,
-					int nvec, int type)
+					int nvec, int nvec_mme, int type)
 {
 	if (type == PCI_CAP_ID_MSI)
-		return do_setup_msi_irqs(dev, nvec);
+		return do_setup_msi_irqs(dev, nvec, nvec_mme);
 	else
 		return do_setup_msix_irqs(dev, nvec);
 }
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/3] AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs
  2014-06-10 13:10 ` Alexander Gordeev
                   ` (4 preceding siblings ...)
  (?)
@ 2014-06-10 13:10 ` Alexander Gordeev
       [not found]   ` <dba9f0f8e9cccd7625d0f3fab94457482e1a2bd7.1402405331.git.agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-06-18 18:54   ` Tejun Heo
  -1 siblings, 2 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Gordeev, x86, xen-devel, iommu, linux-ide, linux-pci

Make use of the new pci_enable_msi_partial() interface and
conserve on othewise wasted interrupt resources for 10 out
of 16 unused MSI vectors on Intel chipsets.

Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 drivers/ata/ahci.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 6070781..0c7a0f3 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1194,7 +1194,7 @@ static int ahci_init_interrupts(struct pci_dev *pdev, unsigned int n_ports,
 	if (nvec < n_ports)
 		goto single_msi;
 
-	rc = pci_enable_msi_exact(pdev, nvec);
+	rc = pci_enable_msi_partial(pdev, n_ports, nvec);
 	if (rc == -ENOSPC)
 		goto single_msi;
 	else if (rc < 0)
@@ -1207,7 +1207,7 @@ static int ahci_init_interrupts(struct pci_dev *pdev, unsigned int n_ports,
 		goto single_msi;
 	}
 
-	return nvec;
+	return n_ports;
 
 single_msi:
 	if (pci_enable_msi(pdev))
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/3] AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs
  2014-06-10 13:10 ` Alexander Gordeev
                   ` (5 preceding siblings ...)
  (?)
@ 2014-06-10 13:10 ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-10 13:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pci, x86, linux-ide, iommu, Alexander Gordeev, xen-devel

Make use of the new pci_enable_msi_partial() interface and
conserve on othewise wasted interrupt resources for 10 out
of 16 unused MSI vectors on Intel chipsets.

Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-ide@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 drivers/ata/ahci.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 6070781..0c7a0f3 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1194,7 +1194,7 @@ static int ahci_init_interrupts(struct pci_dev *pdev, unsigned int n_ports,
 	if (nvec < n_ports)
 		goto single_msi;
 
-	rc = pci_enable_msi_exact(pdev, nvec);
+	rc = pci_enable_msi_partial(pdev, n_ports, nvec);
 	if (rc == -ENOSPC)
 		goto single_msi;
 	else if (rc < 0)
@@ -1207,7 +1207,7 @@ static int ahci_init_interrupts(struct pci_dev *pdev, unsigned int n_ports,
 		goto single_msi;
 	}
 
-	return nvec;
+	return n_ports;
 
 single_msi:
 	if (pci_enable_msi(pdev))
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/3] AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs
  2014-06-10 13:10 ` [PATCH 3/3] AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs Alexander Gordeev
@ 2014-06-18 18:54       ` Tejun Heo
  2014-06-18 18:54   ` Tejun Heo
  1 sibling, 0 replies; 76+ messages in thread
From: Tejun Heo @ 2014-06-18 18:54 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b

On Tue, Jun 10, 2014 at 03:10:32PM +0200, Alexander Gordeev wrote:
> Make use of the new pci_enable_msi_partial() interface and
> conserve on othewise wasted interrupt resources for 10 out
> of 16 unused MSI vectors on Intel chipsets.
> 
> Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> Cc: xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b@public.gmane.org
> Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> Cc: linux-ide-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Alexander Gordeev <agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/3] AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs
@ 2014-06-18 18:54       ` Tejun Heo
  0 siblings, 0 replies; 76+ messages in thread
From: Tejun Heo @ 2014-06-18 18:54 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, x86, xen-devel, iommu, linux-ide, linux-pci

On Tue, Jun 10, 2014 at 03:10:32PM +0200, Alexander Gordeev wrote:
> Make use of the new pci_enable_msi_partial() interface and
> conserve on othewise wasted interrupt resources for 10 out
> of 16 unused MSI vectors on Intel chipsets.
> 
> Cc: x86@kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: iommu@lists.linux-foundation.org
> Cc: linux-ide@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/3] AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs
  2014-06-10 13:10 ` [PATCH 3/3] AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs Alexander Gordeev
       [not found]   ` <dba9f0f8e9cccd7625d0f3fab94457482e1a2bd7.1402405331.git.agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-06-18 18:54   ` Tejun Heo
  1 sibling, 0 replies; 76+ messages in thread
From: Tejun Heo @ 2014-06-18 18:54 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-pci, x86, linux-kernel, linux-ide, iommu, xen-devel

On Tue, Jun 10, 2014 at 03:10:32PM +0200, Alexander Gordeev wrote:
> Make use of the new pci_enable_msi_partial() interface and
> conserve on othewise wasted interrupt resources for 10 out
> of 16 unused MSI vectors on Intel chipsets.
> 
> Cc: x86@kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: iommu@lists.linux-foundation.org
> Cc: linux-ide@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-06-10 13:10   ` Alexander Gordeev
  (?)
@ 2014-06-23 20:11       ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-23 20:11 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Bjorn Helgaas
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

Hi Bjorn,

Any feedback?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-06-23 20:11       ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-23 20:11 UTC (permalink / raw)
  To: linux-kernel, Bjorn Helgaas
  Cc: linux-doc, linux-mips, linuxppc-dev, linux-s390, x86, xen-devel,
	iommu, linux-ide, linux-pci

Hi Bjorn,

Any feedback?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-06-23 20:11       ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-23 20:11 UTC (permalink / raw)
  To: linux-kernel, Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-ide,
	iommu, xen-devel, linuxppc-dev

Hi Bjorn,

Any feedback?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-06-10 13:10   ` Alexander Gordeev
  (?)
  (?)
@ 2014-06-23 20:11   ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-06-23 20:11 UTC (permalink / raw)
  To: linux-kernel, Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-ide,
	iommu, xen-devel, linuxppc-dev

Hi Bjorn,

Any feedback?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-06-10 13:10   ` Alexander Gordeev
@ 2014-07-02 20:22     ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-02 20:22 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, linux-doc, linux-mips, linuxppc-dev, linux-s390,
	x86, xen-devel, iommu, linux-ide, linux-pci

On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> There are PCI devices that require a particular value written
> to the Multiple Message Enable (MME) register while aligned on
> power of 2 boundary value of actually used MSI vectors 'nvec'
> is a lesser of that MME value:
> 
> 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> 
> However the existing pci_enable_msi_block() interface is not
> able to configure such devices, since the value written to the
> MME register is calculated from the number of requested MSIs
> 'nvec':
> 
> 	'Multiple Message Enable' = roundup_pow_of_two(nvec)

For MSI, software learns how many vectors a device requests by reading
the Multiple Message Capable (MMC) field.  This field is encoded, so a
device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
for a device to request 3 vectors; it would have to round up that up
to a power of two and request 4 vectors.

Software writes similarly encoded values to MME to tell the device how
many vectors have been allocated for its use.  For example, it's
impossible to tell the device that it can use 3 vectors; the OS has to
round that up and tell the device it can use 4 vectors.

So if I understand correctly, the point of this series is to take
advantage of device-specific knowledge, e.g., the device requests 4
vectors via MMC, but we "know" the device is only capable of using 3.
Moreover, we tell the device via MME that 4 vectors are available, but
we've only actually set up 3 of them.

This makes me uneasy because we're lying to the device, and the device
is perfectly within spec to use all 4 of those vectors.  If anything
changes the number of vectors the device uses (new device revision,
firmware upgrade, etc.), this is liable to break.

Can you quantify the benefit of this?  Can't a device already use
MSI-X to request exactly the number of vectors it can use?  (I know
not all devices support MSI-X, but maybe we should just accept MSI for
what it is and encourage the HW guys to use MSI-X if MSI isn't good
enough.)

> In this case the result written to the MME register may not
> satisfy the aforementioned PCI devices requirement and therefore
> the PCI functions will not operate in a desired mode.

I'm not sure what you mean by "will not operate in a desired mode."
I thought this was an optimization to save vectors and that these
changes would be completely invisible to the hardware.

Bjorn

> This update introduces pci_enable_msi_partial() extension to
> pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> argument which is then written to MME register while the value
> of 'nvec' is still used to setup as many interrupts as requested.
> 
> As result of this change, architecture-specific callbacks
> arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> 'nvec_mme' parameter as well, but it is ignored for now.
> Therefore, this update is a placeholder for architectures that
> wish to support pci_enable_msi_partial() function in the future.
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-mips@linux-mips.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: x86@kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: iommu@lists.linux-foundation.org
> Cc: linux-ide@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> ---
>  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
>  arch/mips/pci/msi-octeon.c      |    2 +-
>  arch/powerpc/kernel/msi.c       |    4 +-
>  arch/s390/pci/pci.c             |    2 +-
>  arch/x86/kernel/x86_init.c      |    2 +-
>  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
>  include/linux/msi.h             |    5 +-
>  include/linux/pci.h             |    3 +
>  8 files changed, 115 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 10a9369..c8a8503 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
>  returns zero in case of success, which indicates MSI interrupts have been
>  successfully allocated.
>  
> -4.2.4 pci_disable_msi
> +4.2.4 pci_enable_msi_partial
> +
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +
> +This variation on pci_enable_msi_exact() call allows a device driver to
> +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> +in operating system. The MSI specification only allows 'nvec_mme' to be
> +allocated in powers of two, up to a maximum of 2^5 (32).
> +
> +This function could be used when a PCI function is known to send 'nvec'
> +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> +do not waste system resources.
> +
> +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> +interrupts and setting up 'nvec' interrupts. In this case, the function
> +enables MSI on this device and updates dev->irq to be the lowest of the
> +new interrupts assigned to it.  The other interrupts assigned to the
> +device are in the range dev->irq to dev->irq + nvec - 1.
> +
> +If this function returns a negative number, it indicates an error and
> +the driver should not attempt to request any more MSI interrupts for
> +this device.
> +
> +4.2.5 pci_disable_msi
>  
>  void pci_disable_msi(struct pci_dev *dev)
>  
> -This function should be used to undo the effect of pci_enable_msi_range().
> -Calling it restores dev->irq to the pin-based interrupt number and frees
> -the previously allocated MSIs.  The interrupts may subsequently be assigned
> -to another device, so drivers should not cache the value of dev->irq.
> +This function should be used to undo the effect of pci_enable_msi_range()
> +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> +interrupt number and frees the previously allocated MSIs.  The interrupts
> +may subsequently be assigned to another device, so drivers should not cache
> +the value of dev->irq.
>  
>  Before calling this function, a device driver must always call free_irq()
>  on any interrupt for which it previously called request_irq().
> diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> index 2b91b0e..2be7979 100644
> --- a/arch/mips/pci/msi-octeon.c
> +++ b/arch/mips/pci/msi-octeon.c
> @@ -178,7 +178,7 @@ msi_irq_allocated:
>  	return 0;
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> index 8bbc12d..c60aee3 100644
> --- a/arch/powerpc/kernel/msi.c
> +++ b/arch/powerpc/kernel/msi.c
> @@ -13,7 +13,7 @@
>  
>  #include <asm/machdep.h>
>  
> -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
>  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
>          return 0;
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	return ppc_md.setup_msi_irqs(dev, nvec, type);
>  }
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 9ddc51e..3cf38a8 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
>  	}
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
>  {
>  	struct zpci_dev *zdev = get_zdev(pdev);
>  	unsigned int hwirq, msi_vecs;
> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> index e48b674..b65bf95 100644
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
>  };
>  
>  /* MSI arch specific hooks */
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	return x86_msi.setup_msi_irqs(dev, nvec, type);
>  }
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 27a7e67..0410d9b 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
>  	chip->teardown_irq(chip, irq);
>  }
>  
> -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> +int __weak arch_msi_check_device(struct pci_dev *dev,
> +				 int nvec, int nvec_mme, int type)
>  {
>  	struct msi_chip *chip = dev->bus->msi;
>  
> @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
>  	return chip->check_device(chip, dev, nvec, type);
>  }
>  
> -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> +			       int nvec, int nvec_mme, int type)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -598,6 +600,7 @@ error_attrs:
>   * msi_capability_init - configure device's MSI capability structure
>   * @dev: pointer to the pci_dev data structure of MSI device function
>   * @nvec: number of interrupts to allocate
> + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
>   *
>   * Setup the MSI capability structure of the device with the requested
>   * number of interrupts.  A return value of zero indicates the successful
> @@ -605,7 +608,7 @@ error_attrs:
>   * an error, and a positive return value indicates the number of interrupts
>   * which could have been allocated.
>   */
> -static int msi_capability_init(struct pci_dev *dev, int nvec)
> +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
>  	list_add_tail(&entry->list, &dev->msi_list);
>  
>  	/* Configure MSI capability structure */
> -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
>  	if (ret) {
>  		msi_mask_irq(entry, mask, ~mask);
>  		free_msi_irqs(dev);
> @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
>  	if (ret)
>  		return ret;
>  
> -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
>  	if (ret)
>  		goto out_avail;
>  
> @@ -812,13 +816,15 @@ out_free:
>   * pci_msi_check_device - check whether MSI may be enabled on a device
>   * @dev: pointer to the pci_dev data structure of MSI device function
>   * @nvec: how many MSIs have been requested ?
> + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
>   * @type: are we checking for MSI or MSI-X ?
>   *
>   * Look at global flags, the device itself, and its parent buses
>   * to determine if MSI/-X are supported for the device. If MSI/-X is
>   * supported return 0, else return an error code.
>   **/
> -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> +static int pci_msi_check_device(struct pci_dev *dev,
> +				int nvec, int nvec_mme, int type)
>  {
>  	struct pci_bus *bus;
>  	int ret;
> @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
>  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
>  			return -EINVAL;
>  
> -	ret = arch_msi_check_device(dev, nvec, type);
> +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
>  	if (ret)
>  		return ret;
>  
> @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msi_vec_count);
>  
> +/**
> + * pci_enable_msi_partial - configure device's MSI capability structure
> + * @dev: device to configure
> + * @nvec: number of interrupts to configure
> + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> + *
> + * This function tries to allocate @nvec number of interrupts while setup
> + * device's Multiple Message Enable register with @nvec_mme interrupts.
> + * It returns a negative errno if an error occurs. If it succeeds, it returns
> + * zero and updates the @dev's irq member to the lowest new interrupt number;
> + * the other interrupt numbers allocated to this device are consecutive.
> + */
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +{
> +	int maxvec;
> +	int rc;
> +
> +	if (dev->current_state != PCI_D0)
> +		return -EINVAL;
> +
> +	WARN_ON(!!dev->msi_enabled);
> +
> +	/* Check whether driver already requested MSI-X irqs */
> +	if (dev->msix_enabled) {
> +		dev_info(&dev->dev, "can't enable MSI "
> +			 "(MSI-X already enabled)\n");
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(nvec_mme))
> +		return -EINVAL;
> +	if (nvec > nvec_mme)
> +		return -EINVAL;
> +
> +	maxvec = pci_msi_vec_count(dev);
> +	if (maxvec < 0)
> +		return maxvec;
> +	else if (nvec_mme > maxvec)
> +		return -EINVAL;
> +
> +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> +	if (rc < 0)
> +		return rc;
> +	else if (rc > 0)
> +		return -ENOSPC;
> +
> +	rc = msi_capability_init(dev, nvec, nvec_mme);
> +	if (rc < 0)
> +		return rc;
> +	else if (rc > 0)
> +		return -ENOSPC;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(pci_enable_msi_partial);
> +
>  void pci_msi_shutdown(struct pci_dev *dev)
>  {
>  	struct msi_desc *desc;
> @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
>  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
>  		return -EINVAL;
>  
> -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
>  	if (status)
>  		return status;
>  
> @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
>  		nvec = maxvec;
>  
>  	do {
> -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> +					  PCI_CAP_ID_MSI);
>  		if (rc < 0) {
>  			return rc;
>  		} else if (rc > 0) {
> @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
>  	} while (rc);
>  
>  	do {
> -		rc = msi_capability_init(dev, nvec);
> +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
>  		if (rc < 0) {
>  			return rc;
>  		} else if (rc > 0) {
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 92a2f99..b9f89ee 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -57,9 +57,10 @@ struct msi_desc {
>   */
>  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
>  void arch_teardown_msi_irq(unsigned int irq);
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
>  void arch_teardown_msi_irqs(struct pci_dev *dev);
> -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> +int arch_msi_check_device(struct pci_dev *dev,
> +			  int nvec, int nvec_mme, int type);
>  void arch_restore_msi_irqs(struct pci_dev *dev);
>  
>  void default_teardown_msi_irqs(struct pci_dev *dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 71d9673..7360bd2 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
>  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
>  void pci_restore_msi_state(struct pci_dev *dev);
>  int pci_msi_enabled(void);
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
>  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
>  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
>  {
> @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
>  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
>  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
>  static inline int pci_msi_enabled(void) { return 0; }
> +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +{ return -ENOSYS; }
>  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
>  				       int maxvec)
>  { return -ENOSYS; }
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-02 20:22     ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-02 20:22 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> There are PCI devices that require a particular value written
> to the Multiple Message Enable (MME) register while aligned on
> power of 2 boundary value of actually used MSI vectors 'nvec'
> is a lesser of that MME value:
> 
> 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> 
> However the existing pci_enable_msi_block() interface is not
> able to configure such devices, since the value written to the
> MME register is calculated from the number of requested MSIs
> 'nvec':
> 
> 	'Multiple Message Enable' = roundup_pow_of_two(nvec)

For MSI, software learns how many vectors a device requests by reading
the Multiple Message Capable (MMC) field.  This field is encoded, so a
device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
for a device to request 3 vectors; it would have to round up that up
to a power of two and request 4 vectors.

Software writes similarly encoded values to MME to tell the device how
many vectors have been allocated for its use.  For example, it's
impossible to tell the device that it can use 3 vectors; the OS has to
round that up and tell the device it can use 4 vectors.

So if I understand correctly, the point of this series is to take
advantage of device-specific knowledge, e.g., the device requests 4
vectors via MMC, but we "know" the device is only capable of using 3.
Moreover, we tell the device via MME that 4 vectors are available, but
we've only actually set up 3 of them.

This makes me uneasy because we're lying to the device, and the device
is perfectly within spec to use all 4 of those vectors.  If anything
changes the number of vectors the device uses (new device revision,
firmware upgrade, etc.), this is liable to break.

Can you quantify the benefit of this?  Can't a device already use
MSI-X to request exactly the number of vectors it can use?  (I know
not all devices support MSI-X, but maybe we should just accept MSI for
what it is and encourage the HW guys to use MSI-X if MSI isn't good
enough.)

> In this case the result written to the MME register may not
> satisfy the aforementioned PCI devices requirement and therefore
> the PCI functions will not operate in a desired mode.

I'm not sure what you mean by "will not operate in a desired mode."
I thought this was an optimization to save vectors and that these
changes would be completely invisible to the hardware.

Bjorn

> This update introduces pci_enable_msi_partial() extension to
> pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> argument which is then written to MME register while the value
> of 'nvec' is still used to setup as many interrupts as requested.
> 
> As result of this change, architecture-specific callbacks
> arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> 'nvec_mme' parameter as well, but it is ignored for now.
> Therefore, this update is a placeholder for architectures that
> wish to support pci_enable_msi_partial() function in the future.
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-mips@linux-mips.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: x86@kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: iommu@lists.linux-foundation.org
> Cc: linux-ide@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> ---
>  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
>  arch/mips/pci/msi-octeon.c      |    2 +-
>  arch/powerpc/kernel/msi.c       |    4 +-
>  arch/s390/pci/pci.c             |    2 +-
>  arch/x86/kernel/x86_init.c      |    2 +-
>  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
>  include/linux/msi.h             |    5 +-
>  include/linux/pci.h             |    3 +
>  8 files changed, 115 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 10a9369..c8a8503 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
>  returns zero in case of success, which indicates MSI interrupts have been
>  successfully allocated.
>  
> -4.2.4 pci_disable_msi
> +4.2.4 pci_enable_msi_partial
> +
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +
> +This variation on pci_enable_msi_exact() call allows a device driver to
> +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> +in operating system. The MSI specification only allows 'nvec_mme' to be
> +allocated in powers of two, up to a maximum of 2^5 (32).
> +
> +This function could be used when a PCI function is known to send 'nvec'
> +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> +do not waste system resources.
> +
> +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> +interrupts and setting up 'nvec' interrupts. In this case, the function
> +enables MSI on this device and updates dev->irq to be the lowest of the
> +new interrupts assigned to it.  The other interrupts assigned to the
> +device are in the range dev->irq to dev->irq + nvec - 1.
> +
> +If this function returns a negative number, it indicates an error and
> +the driver should not attempt to request any more MSI interrupts for
> +this device.
> +
> +4.2.5 pci_disable_msi
>  
>  void pci_disable_msi(struct pci_dev *dev)
>  
> -This function should be used to undo the effect of pci_enable_msi_range().
> -Calling it restores dev->irq to the pin-based interrupt number and frees
> -the previously allocated MSIs.  The interrupts may subsequently be assigned
> -to another device, so drivers should not cache the value of dev->irq.
> +This function should be used to undo the effect of pci_enable_msi_range()
> +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> +interrupt number and frees the previously allocated MSIs.  The interrupts
> +may subsequently be assigned to another device, so drivers should not cache
> +the value of dev->irq.
>  
>  Before calling this function, a device driver must always call free_irq()
>  on any interrupt for which it previously called request_irq().
> diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> index 2b91b0e..2be7979 100644
> --- a/arch/mips/pci/msi-octeon.c
> +++ b/arch/mips/pci/msi-octeon.c
> @@ -178,7 +178,7 @@ msi_irq_allocated:
>  	return 0;
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> index 8bbc12d..c60aee3 100644
> --- a/arch/powerpc/kernel/msi.c
> +++ b/arch/powerpc/kernel/msi.c
> @@ -13,7 +13,7 @@
>  
>  #include <asm/machdep.h>
>  
> -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
>  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
>          return 0;
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	return ppc_md.setup_msi_irqs(dev, nvec, type);
>  }
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 9ddc51e..3cf38a8 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
>  	}
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
>  {
>  	struct zpci_dev *zdev = get_zdev(pdev);
>  	unsigned int hwirq, msi_vecs;
> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> index e48b674..b65bf95 100644
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
>  };
>  
>  /* MSI arch specific hooks */
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	return x86_msi.setup_msi_irqs(dev, nvec, type);
>  }
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 27a7e67..0410d9b 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
>  	chip->teardown_irq(chip, irq);
>  }
>  
> -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> +int __weak arch_msi_check_device(struct pci_dev *dev,
> +				 int nvec, int nvec_mme, int type)
>  {
>  	struct msi_chip *chip = dev->bus->msi;
>  
> @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
>  	return chip->check_device(chip, dev, nvec, type);
>  }
>  
> -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> +			       int nvec, int nvec_mme, int type)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -598,6 +600,7 @@ error_attrs:
>   * msi_capability_init - configure device's MSI capability structure
>   * @dev: pointer to the pci_dev data structure of MSI device function
>   * @nvec: number of interrupts to allocate
> + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
>   *
>   * Setup the MSI capability structure of the device with the requested
>   * number of interrupts.  A return value of zero indicates the successful
> @@ -605,7 +608,7 @@ error_attrs:
>   * an error, and a positive return value indicates the number of interrupts
>   * which could have been allocated.
>   */
> -static int msi_capability_init(struct pci_dev *dev, int nvec)
> +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
>  	list_add_tail(&entry->list, &dev->msi_list);
>  
>  	/* Configure MSI capability structure */
> -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
>  	if (ret) {
>  		msi_mask_irq(entry, mask, ~mask);
>  		free_msi_irqs(dev);
> @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
>  	if (ret)
>  		return ret;
>  
> -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
>  	if (ret)
>  		goto out_avail;
>  
> @@ -812,13 +816,15 @@ out_free:
>   * pci_msi_check_device - check whether MSI may be enabled on a device
>   * @dev: pointer to the pci_dev data structure of MSI device function
>   * @nvec: how many MSIs have been requested ?
> + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
>   * @type: are we checking for MSI or MSI-X ?
>   *
>   * Look at global flags, the device itself, and its parent buses
>   * to determine if MSI/-X are supported for the device. If MSI/-X is
>   * supported return 0, else return an error code.
>   **/
> -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> +static int pci_msi_check_device(struct pci_dev *dev,
> +				int nvec, int nvec_mme, int type)
>  {
>  	struct pci_bus *bus;
>  	int ret;
> @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
>  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
>  			return -EINVAL;
>  
> -	ret = arch_msi_check_device(dev, nvec, type);
> +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
>  	if (ret)
>  		return ret;
>  
> @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msi_vec_count);
>  
> +/**
> + * pci_enable_msi_partial - configure device's MSI capability structure
> + * @dev: device to configure
> + * @nvec: number of interrupts to configure
> + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> + *
> + * This function tries to allocate @nvec number of interrupts while setup
> + * device's Multiple Message Enable register with @nvec_mme interrupts.
> + * It returns a negative errno if an error occurs. If it succeeds, it returns
> + * zero and updates the @dev's irq member to the lowest new interrupt number;
> + * the other interrupt numbers allocated to this device are consecutive.
> + */
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +{
> +	int maxvec;
> +	int rc;
> +
> +	if (dev->current_state != PCI_D0)
> +		return -EINVAL;
> +
> +	WARN_ON(!!dev->msi_enabled);
> +
> +	/* Check whether driver already requested MSI-X irqs */
> +	if (dev->msix_enabled) {
> +		dev_info(&dev->dev, "can't enable MSI "
> +			 "(MSI-X already enabled)\n");
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(nvec_mme))
> +		return -EINVAL;
> +	if (nvec > nvec_mme)
> +		return -EINVAL;
> +
> +	maxvec = pci_msi_vec_count(dev);
> +	if (maxvec < 0)
> +		return maxvec;
> +	else if (nvec_mme > maxvec)
> +		return -EINVAL;
> +
> +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> +	if (rc < 0)
> +		return rc;
> +	else if (rc > 0)
> +		return -ENOSPC;
> +
> +	rc = msi_capability_init(dev, nvec, nvec_mme);
> +	if (rc < 0)
> +		return rc;
> +	else if (rc > 0)
> +		return -ENOSPC;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(pci_enable_msi_partial);
> +
>  void pci_msi_shutdown(struct pci_dev *dev)
>  {
>  	struct msi_desc *desc;
> @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
>  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
>  		return -EINVAL;
>  
> -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
>  	if (status)
>  		return status;
>  
> @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
>  		nvec = maxvec;
>  
>  	do {
> -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> +					  PCI_CAP_ID_MSI);
>  		if (rc < 0) {
>  			return rc;
>  		} else if (rc > 0) {
> @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
>  	} while (rc);
>  
>  	do {
> -		rc = msi_capability_init(dev, nvec);
> +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
>  		if (rc < 0) {
>  			return rc;
>  		} else if (rc > 0) {
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 92a2f99..b9f89ee 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -57,9 +57,10 @@ struct msi_desc {
>   */
>  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
>  void arch_teardown_msi_irq(unsigned int irq);
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
>  void arch_teardown_msi_irqs(struct pci_dev *dev);
> -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> +int arch_msi_check_device(struct pci_dev *dev,
> +			  int nvec, int nvec_mme, int type);
>  void arch_restore_msi_irqs(struct pci_dev *dev);
>  
>  void default_teardown_msi_irqs(struct pci_dev *dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 71d9673..7360bd2 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
>  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
>  void pci_restore_msi_state(struct pci_dev *dev);
>  int pci_msi_enabled(void);
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
>  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
>  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
>  {
> @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
>  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
>  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
>  static inline int pci_msi_enabled(void) { return 0; }
> +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +{ return -ENOSYS; }
>  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
>  				       int maxvec)
>  { return -ENOSYS; }
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-06-10 13:10   ` Alexander Gordeev
                     ` (2 preceding siblings ...)
  (?)
@ 2014-07-02 20:22   ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-02 20:22 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> There are PCI devices that require a particular value written
> to the Multiple Message Enable (MME) register while aligned on
> power of 2 boundary value of actually used MSI vectors 'nvec'
> is a lesser of that MME value:
> 
> 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> 
> However the existing pci_enable_msi_block() interface is not
> able to configure such devices, since the value written to the
> MME register is calculated from the number of requested MSIs
> 'nvec':
> 
> 	'Multiple Message Enable' = roundup_pow_of_two(nvec)

For MSI, software learns how many vectors a device requests by reading
the Multiple Message Capable (MMC) field.  This field is encoded, so a
device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
for a device to request 3 vectors; it would have to round up that up
to a power of two and request 4 vectors.

Software writes similarly encoded values to MME to tell the device how
many vectors have been allocated for its use.  For example, it's
impossible to tell the device that it can use 3 vectors; the OS has to
round that up and tell the device it can use 4 vectors.

So if I understand correctly, the point of this series is to take
advantage of device-specific knowledge, e.g., the device requests 4
vectors via MMC, but we "know" the device is only capable of using 3.
Moreover, we tell the device via MME that 4 vectors are available, but
we've only actually set up 3 of them.

This makes me uneasy because we're lying to the device, and the device
is perfectly within spec to use all 4 of those vectors.  If anything
changes the number of vectors the device uses (new device revision,
firmware upgrade, etc.), this is liable to break.

Can you quantify the benefit of this?  Can't a device already use
MSI-X to request exactly the number of vectors it can use?  (I know
not all devices support MSI-X, but maybe we should just accept MSI for
what it is and encourage the HW guys to use MSI-X if MSI isn't good
enough.)

> In this case the result written to the MME register may not
> satisfy the aforementioned PCI devices requirement and therefore
> the PCI functions will not operate in a desired mode.

I'm not sure what you mean by "will not operate in a desired mode."
I thought this was an optimization to save vectors and that these
changes would be completely invisible to the hardware.

Bjorn

> This update introduces pci_enable_msi_partial() extension to
> pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> argument which is then written to MME register while the value
> of 'nvec' is still used to setup as many interrupts as requested.
> 
> As result of this change, architecture-specific callbacks
> arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> 'nvec_mme' parameter as well, but it is ignored for now.
> Therefore, this update is a placeholder for architectures that
> wish to support pci_enable_msi_partial() function in the future.
> 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-mips@linux-mips.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s390@vger.kernel.org
> Cc: x86@kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: iommu@lists.linux-foundation.org
> Cc: linux-ide@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> ---
>  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
>  arch/mips/pci/msi-octeon.c      |    2 +-
>  arch/powerpc/kernel/msi.c       |    4 +-
>  arch/s390/pci/pci.c             |    2 +-
>  arch/x86/kernel/x86_init.c      |    2 +-
>  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
>  include/linux/msi.h             |    5 +-
>  include/linux/pci.h             |    3 +
>  8 files changed, 115 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> index 10a9369..c8a8503 100644
> --- a/Documentation/PCI/MSI-HOWTO.txt
> +++ b/Documentation/PCI/MSI-HOWTO.txt
> @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
>  returns zero in case of success, which indicates MSI interrupts have been
>  successfully allocated.
>  
> -4.2.4 pci_disable_msi
> +4.2.4 pci_enable_msi_partial
> +
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +
> +This variation on pci_enable_msi_exact() call allows a device driver to
> +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> +in operating system. The MSI specification only allows 'nvec_mme' to be
> +allocated in powers of two, up to a maximum of 2^5 (32).
> +
> +This function could be used when a PCI function is known to send 'nvec'
> +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> +do not waste system resources.
> +
> +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> +interrupts and setting up 'nvec' interrupts. In this case, the function
> +enables MSI on this device and updates dev->irq to be the lowest of the
> +new interrupts assigned to it.  The other interrupts assigned to the
> +device are in the range dev->irq to dev->irq + nvec - 1.
> +
> +If this function returns a negative number, it indicates an error and
> +the driver should not attempt to request any more MSI interrupts for
> +this device.
> +
> +4.2.5 pci_disable_msi
>  
>  void pci_disable_msi(struct pci_dev *dev)
>  
> -This function should be used to undo the effect of pci_enable_msi_range().
> -Calling it restores dev->irq to the pin-based interrupt number and frees
> -the previously allocated MSIs.  The interrupts may subsequently be assigned
> -to another device, so drivers should not cache the value of dev->irq.
> +This function should be used to undo the effect of pci_enable_msi_range()
> +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> +interrupt number and frees the previously allocated MSIs.  The interrupts
> +may subsequently be assigned to another device, so drivers should not cache
> +the value of dev->irq.
>  
>  Before calling this function, a device driver must always call free_irq()
>  on any interrupt for which it previously called request_irq().
> diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> index 2b91b0e..2be7979 100644
> --- a/arch/mips/pci/msi-octeon.c
> +++ b/arch/mips/pci/msi-octeon.c
> @@ -178,7 +178,7 @@ msi_irq_allocated:
>  	return 0;
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> index 8bbc12d..c60aee3 100644
> --- a/arch/powerpc/kernel/msi.c
> +++ b/arch/powerpc/kernel/msi.c
> @@ -13,7 +13,7 @@
>  
>  #include <asm/machdep.h>
>  
> -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
>  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
>          return 0;
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	return ppc_md.setup_msi_irqs(dev, nvec, type);
>  }
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 9ddc51e..3cf38a8 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
>  	}
>  }
>  
> -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
>  {
>  	struct zpci_dev *zdev = get_zdev(pdev);
>  	unsigned int hwirq, msi_vecs;
> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> index e48b674..b65bf95 100644
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
>  };
>  
>  /* MSI arch specific hooks */
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
>  {
>  	return x86_msi.setup_msi_irqs(dev, nvec, type);
>  }
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 27a7e67..0410d9b 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
>  	chip->teardown_irq(chip, irq);
>  }
>  
> -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> +int __weak arch_msi_check_device(struct pci_dev *dev,
> +				 int nvec, int nvec_mme, int type)
>  {
>  	struct msi_chip *chip = dev->bus->msi;
>  
> @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
>  	return chip->check_device(chip, dev, nvec, type);
>  }
>  
> -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> +			       int nvec, int nvec_mme, int type)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -598,6 +600,7 @@ error_attrs:
>   * msi_capability_init - configure device's MSI capability structure
>   * @dev: pointer to the pci_dev data structure of MSI device function
>   * @nvec: number of interrupts to allocate
> + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
>   *
>   * Setup the MSI capability structure of the device with the requested
>   * number of interrupts.  A return value of zero indicates the successful
> @@ -605,7 +608,7 @@ error_attrs:
>   * an error, and a positive return value indicates the number of interrupts
>   * which could have been allocated.
>   */
> -static int msi_capability_init(struct pci_dev *dev, int nvec)
> +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
>  {
>  	struct msi_desc *entry;
>  	int ret;
> @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
>  	list_add_tail(&entry->list, &dev->msi_list);
>  
>  	/* Configure MSI capability structure */
> -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
>  	if (ret) {
>  		msi_mask_irq(entry, mask, ~mask);
>  		free_msi_irqs(dev);
> @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
>  	if (ret)
>  		return ret;
>  
> -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
>  	if (ret)
>  		goto out_avail;
>  
> @@ -812,13 +816,15 @@ out_free:
>   * pci_msi_check_device - check whether MSI may be enabled on a device
>   * @dev: pointer to the pci_dev data structure of MSI device function
>   * @nvec: how many MSIs have been requested ?
> + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
>   * @type: are we checking for MSI or MSI-X ?
>   *
>   * Look at global flags, the device itself, and its parent buses
>   * to determine if MSI/-X are supported for the device. If MSI/-X is
>   * supported return 0, else return an error code.
>   **/
> -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> +static int pci_msi_check_device(struct pci_dev *dev,
> +				int nvec, int nvec_mme, int type)
>  {
>  	struct pci_bus *bus;
>  	int ret;
> @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
>  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
>  			return -EINVAL;
>  
> -	ret = arch_msi_check_device(dev, nvec, type);
> +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
>  	if (ret)
>  		return ret;
>  
> @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_msi_vec_count);
>  
> +/**
> + * pci_enable_msi_partial - configure device's MSI capability structure
> + * @dev: device to configure
> + * @nvec: number of interrupts to configure
> + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> + *
> + * This function tries to allocate @nvec number of interrupts while setup
> + * device's Multiple Message Enable register with @nvec_mme interrupts.
> + * It returns a negative errno if an error occurs. If it succeeds, it returns
> + * zero and updates the @dev's irq member to the lowest new interrupt number;
> + * the other interrupt numbers allocated to this device are consecutive.
> + */
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +{
> +	int maxvec;
> +	int rc;
> +
> +	if (dev->current_state != PCI_D0)
> +		return -EINVAL;
> +
> +	WARN_ON(!!dev->msi_enabled);
> +
> +	/* Check whether driver already requested MSI-X irqs */
> +	if (dev->msix_enabled) {
> +		dev_info(&dev->dev, "can't enable MSI "
> +			 "(MSI-X already enabled)\n");
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(nvec_mme))
> +		return -EINVAL;
> +	if (nvec > nvec_mme)
> +		return -EINVAL;
> +
> +	maxvec = pci_msi_vec_count(dev);
> +	if (maxvec < 0)
> +		return maxvec;
> +	else if (nvec_mme > maxvec)
> +		return -EINVAL;
> +
> +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> +	if (rc < 0)
> +		return rc;
> +	else if (rc > 0)
> +		return -ENOSPC;
> +
> +	rc = msi_capability_init(dev, nvec, nvec_mme);
> +	if (rc < 0)
> +		return rc;
> +	else if (rc > 0)
> +		return -ENOSPC;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(pci_enable_msi_partial);
> +
>  void pci_msi_shutdown(struct pci_dev *dev)
>  {
>  	struct msi_desc *desc;
> @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
>  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
>  		return -EINVAL;
>  
> -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
>  	if (status)
>  		return status;
>  
> @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
>  		nvec = maxvec;
>  
>  	do {
> -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> +					  PCI_CAP_ID_MSI);
>  		if (rc < 0) {
>  			return rc;
>  		} else if (rc > 0) {
> @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
>  	} while (rc);
>  
>  	do {
> -		rc = msi_capability_init(dev, nvec);
> +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
>  		if (rc < 0) {
>  			return rc;
>  		} else if (rc > 0) {
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 92a2f99..b9f89ee 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -57,9 +57,10 @@ struct msi_desc {
>   */
>  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
>  void arch_teardown_msi_irq(unsigned int irq);
> -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
>  void arch_teardown_msi_irqs(struct pci_dev *dev);
> -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> +int arch_msi_check_device(struct pci_dev *dev,
> +			  int nvec, int nvec_mme, int type);
>  void arch_restore_msi_irqs(struct pci_dev *dev);
>  
>  void default_teardown_msi_irqs(struct pci_dev *dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 71d9673..7360bd2 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
>  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
>  void pci_restore_msi_state(struct pci_dev *dev);
>  int pci_msi_enabled(void);
> +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
>  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
>  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
>  {
> @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
>  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
>  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
>  static inline int pci_msi_enabled(void) { return 0; }
> +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> +{ return -ENOSYS; }
>  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
>  				       int maxvec)
>  { return -ENOSYS; }
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-02 20:22     ` Bjorn Helgaas
  (?)
  (?)
@ 2014-07-03  9:20       ` David Laight
  -1 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-03  9:20 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

From: Bjorn Helgaas
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> >
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> >
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> >
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.
> 
> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.
...

Even if you do that, you ought to write valid interrupt information
into the 4th slot (maybe replicating one of the earlier interrupts).
Then, if the device does raise the 'unexpected' interrupt you don't
get a write to a random kernel location.

Plausibly something similar should be done when a smaller number of
interrupts is assigned.

	David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-03  9:20       ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-03  9:20 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2106 bytes --]

From: Bjorn Helgaas
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> >
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> >
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> >
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.
> 
> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.
...

Even if you do that, you ought to write valid interrupt information
into the 4th slot (maybe replicating one of the earlier interrupts).
Then, if the device does raise the 'unexpected' interrupt you don't
get a write to a random kernel location.

Plausibly something similar should be done when a smaller number of
interrupts is assigned.

	David

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-03  9:20       ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-03  9:20 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

RnJvbTogQmpvcm4gSGVsZ2Fhcw0KPiBPbiBUdWUsIEp1biAxMCwgMjAxNCBhdCAwMzoxMDozMFBN
ICswMjAwLCBBbGV4YW5kZXIgR29yZGVldiB3cm90ZToNCj4gPiBUaGVyZSBhcmUgUENJIGRldmlj
ZXMgdGhhdCByZXF1aXJlIGEgcGFydGljdWxhciB2YWx1ZSB3cml0dGVuDQo+ID4gdG8gdGhlIE11
bHRpcGxlIE1lc3NhZ2UgRW5hYmxlIChNTUUpIHJlZ2lzdGVyIHdoaWxlIGFsaWduZWQgb24NCj4g
PiBwb3dlciBvZiAyIGJvdW5kYXJ5IHZhbHVlIG9mIGFjdHVhbGx5IHVzZWQgTVNJIHZlY3RvcnMg
J252ZWMnDQo+ID4gaXMgYSBsZXNzZXIgb2YgdGhhdCBNTUUgdmFsdWU6DQo+ID4NCj4gPiAJcm91
bmR1cF9wb3dfb2ZfdHdvKG52ZWMpIDwgJ011bHRpcGxlIE1lc3NhZ2UgRW5hYmxlJw0KPiA+DQo+
ID4gSG93ZXZlciB0aGUgZXhpc3RpbmcgcGNpX2VuYWJsZV9tc2lfYmxvY2soKSBpbnRlcmZhY2Ug
aXMgbm90DQo+ID4gYWJsZSB0byBjb25maWd1cmUgc3VjaCBkZXZpY2VzLCBzaW5jZSB0aGUgdmFs
dWUgd3JpdHRlbiB0byB0aGUNCj4gPiBNTUUgcmVnaXN0ZXIgaXMgY2FsY3VsYXRlZCBmcm9tIHRo
ZSBudW1iZXIgb2YgcmVxdWVzdGVkIE1TSXMNCj4gPiAnbnZlYyc6DQo+ID4NCj4gPiAJJ011bHRp
cGxlIE1lc3NhZ2UgRW5hYmxlJyA9IHJvdW5kdXBfcG93X29mX3R3byhudmVjKQ0KPiANCj4gRm9y
IE1TSSwgc29mdHdhcmUgbGVhcm5zIGhvdyBtYW55IHZlY3RvcnMgYSBkZXZpY2UgcmVxdWVzdHMg
YnkgcmVhZGluZw0KPiB0aGUgTXVsdGlwbGUgTWVzc2FnZSBDYXBhYmxlIChNTUMpIGZpZWxkLiAg
VGhpcyBmaWVsZCBpcyBlbmNvZGVkLCBzbyBhDQo+IGRldmljZSBjYW4gb25seSByZXF1ZXN0IDEs
IDIsIDQsIDgsIGV0Yy4sIHZlY3RvcnMuICBJdCdzIGltcG9zc2libGUNCj4gZm9yIGEgZGV2aWNl
IHRvIHJlcXVlc3QgMyB2ZWN0b3JzOyBpdCB3b3VsZCBoYXZlIHRvIHJvdW5kIHVwIHRoYXQgdXAN
Cj4gdG8gYSBwb3dlciBvZiB0d28gYW5kIHJlcXVlc3QgNCB2ZWN0b3JzLg0KPiANCj4gU29mdHdh
cmUgd3JpdGVzIHNpbWlsYXJseSBlbmNvZGVkIHZhbHVlcyB0byBNTUUgdG8gdGVsbCB0aGUgZGV2
aWNlIGhvdw0KPiBtYW55IHZlY3RvcnMgaGF2ZSBiZWVuIGFsbG9jYXRlZCBmb3IgaXRzIHVzZS4g
IEZvciBleGFtcGxlLCBpdCdzDQo+IGltcG9zc2libGUgdG8gdGVsbCB0aGUgZGV2aWNlIHRoYXQg
aXQgY2FuIHVzZSAzIHZlY3RvcnM7IHRoZSBPUyBoYXMgdG8NCj4gcm91bmQgdGhhdCB1cCBhbmQg
dGVsbCB0aGUgZGV2aWNlIGl0IGNhbiB1c2UgNCB2ZWN0b3JzLg0KPiANCj4gU28gaWYgSSB1bmRl
cnN0YW5kIGNvcnJlY3RseSwgdGhlIHBvaW50IG9mIHRoaXMgc2VyaWVzIGlzIHRvIHRha2UNCj4g
YWR2YW50YWdlIG9mIGRldmljZS1zcGVjaWZpYyBrbm93bGVkZ2UsIGUuZy4sIHRoZSBkZXZpY2Ug
cmVxdWVzdHMgNA0KPiB2ZWN0b3JzIHZpYSBNTUMsIGJ1dCB3ZSAia25vdyIgdGhlIGRldmljZSBp
cyBvbmx5IGNhcGFibGUgb2YgdXNpbmcgMy4NCj4gTW9yZW92ZXIsIHdlIHRlbGwgdGhlIGRldmlj
ZSB2aWEgTU1FIHRoYXQgNCB2ZWN0b3JzIGFyZSBhdmFpbGFibGUsIGJ1dA0KPiB3ZSd2ZSBvbmx5
IGFjdHVhbGx5IHNldCB1cCAzIG9mIHRoZW0uDQouLi4NCg0KRXZlbiBpZiB5b3UgZG8gdGhhdCwg
eW91IG91Z2h0IHRvIHdyaXRlIHZhbGlkIGludGVycnVwdCBpbmZvcm1hdGlvbg0KaW50byB0aGUg
NHRoIHNsb3QgKG1heWJlIHJlcGxpY2F0aW5nIG9uZSBvZiB0aGUgZWFybGllciBpbnRlcnJ1cHRz
KS4NClRoZW4sIGlmIHRoZSBkZXZpY2UgZG9lcyByYWlzZSB0aGUgJ3VuZXhwZWN0ZWQnIGludGVy
cnVwdCB5b3UgZG9uJ3QNCmdldCBhIHdyaXRlIHRvIGEgcmFuZG9tIGtlcm5lbCBsb2NhdGlvbi4N
Cg0KUGxhdXNpYmx5IHNvbWV0aGluZyBzaW1pbGFyIHNob3VsZCBiZSBkb25lIHdoZW4gYSBzbWFs
bGVyIG51bWJlciBvZg0KaW50ZXJydXB0cyBpcyBhc3NpZ25lZC4NCg0KCURhdmlkDQoNCg==

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-03  9:20       ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-03  9:20 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

RnJvbTogQmpvcm4gSGVsZ2Fhcw0KPiBPbiBUdWUsIEp1biAxMCwgMjAxNCBhdCAwMzoxMDozMFBN
ICswMjAwLCBBbGV4YW5kZXIgR29yZGVldiB3cm90ZToNCj4gPiBUaGVyZSBhcmUgUENJIGRldmlj
ZXMgdGhhdCByZXF1aXJlIGEgcGFydGljdWxhciB2YWx1ZSB3cml0dGVuDQo+ID4gdG8gdGhlIE11
bHRpcGxlIE1lc3NhZ2UgRW5hYmxlIChNTUUpIHJlZ2lzdGVyIHdoaWxlIGFsaWduZWQgb24NCj4g
PiBwb3dlciBvZiAyIGJvdW5kYXJ5IHZhbHVlIG9mIGFjdHVhbGx5IHVzZWQgTVNJIHZlY3RvcnMg
J252ZWMnDQo+ID4gaXMgYSBsZXNzZXIgb2YgdGhhdCBNTUUgdmFsdWU6DQo+ID4NCj4gPiAJcm91
bmR1cF9wb3dfb2ZfdHdvKG52ZWMpIDwgJ011bHRpcGxlIE1lc3NhZ2UgRW5hYmxlJw0KPiA+DQo+
ID4gSG93ZXZlciB0aGUgZXhpc3RpbmcgcGNpX2VuYWJsZV9tc2lfYmxvY2soKSBpbnRlcmZhY2Ug
aXMgbm90DQo+ID4gYWJsZSB0byBjb25maWd1cmUgc3VjaCBkZXZpY2VzLCBzaW5jZSB0aGUgdmFs
dWUgd3JpdHRlbiB0byB0aGUNCj4gPiBNTUUgcmVnaXN0ZXIgaXMgY2FsY3VsYXRlZCBmcm9tIHRo
ZSBudW1iZXIgb2YgcmVxdWVzdGVkIE1TSXMNCj4gPiAnbnZlYyc6DQo+ID4NCj4gPiAJJ011bHRp
cGxlIE1lc3NhZ2UgRW5hYmxlJyA9IHJvdW5kdXBfcG93X29mX3R3byhudmVjKQ0KPiANCj4gRm9y
IE1TSSwgc29mdHdhcmUgbGVhcm5zIGhvdyBtYW55IHZlY3RvcnMgYSBkZXZpY2UgcmVxdWVzdHMg
YnkgcmVhZGluZw0KPiB0aGUgTXVsdGlwbGUgTWVzc2FnZSBDYXBhYmxlIChNTUMpIGZpZWxkLiAg
VGhpcyBmaWVsZCBpcyBlbmNvZGVkLCBzbyBhDQo+IGRldmljZSBjYW4gb25seSByZXF1ZXN0IDEs
IDIsIDQsIDgsIGV0Yy4sIHZlY3RvcnMuICBJdCdzIGltcG9zc2libGUNCj4gZm9yIGEgZGV2aWNl
IHRvIHJlcXVlc3QgMyB2ZWN0b3JzOyBpdCB3b3VsZCBoYXZlIHRvIHJvdW5kIHVwIHRoYXQgdXAN
Cj4gdG8gYSBwb3dlciBvZiB0d28gYW5kIHJlcXVlc3QgNCB2ZWN0b3JzLg0KPiANCj4gU29mdHdh
cmUgd3JpdGVzIHNpbWlsYXJseSBlbmNvZGVkIHZhbHVlcyB0byBNTUUgdG8gdGVsbCB0aGUgZGV2
aWNlIGhvdw0KPiBtYW55IHZlY3RvcnMgaGF2ZSBiZWVuIGFsbG9jYXRlZCBmb3IgaXRzIHVzZS4g
IEZvciBleGFtcGxlLCBpdCdzDQo+IGltcG9zc2libGUgdG8gdGVsbCB0aGUgZGV2aWNlIHRoYXQg
aXQgY2FuIHVzZSAzIHZlY3RvcnM7IHRoZSBPUyBoYXMgdG8NCj4gcm91bmQgdGhhdCB1cCBhbmQg
dGVsbCB0aGUgZGV2aWNlIGl0IGNhbiB1c2UgNCB2ZWN0b3JzLg0KPiANCj4gU28gaWYgSSB1bmRl
cnN0YW5kIGNvcnJlY3RseSwgdGhlIHBvaW50IG9mIHRoaXMgc2VyaWVzIGlzIHRvIHRha2UNCj4g
YWR2YW50YWdlIG9mIGRldmljZS1zcGVjaWZpYyBrbm93bGVkZ2UsIGUuZy4sIHRoZSBkZXZpY2Ug
cmVxdWVzdHMgNA0KPiB2ZWN0b3JzIHZpYSBNTUMsIGJ1dCB3ZSAia25vdyIgdGhlIGRldmljZSBp
cyBvbmx5IGNhcGFibGUgb2YgdXNpbmcgMy4NCj4gTW9yZW92ZXIsIHdlIHRlbGwgdGhlIGRldmlj
ZSB2aWEgTU1FIHRoYXQgNCB2ZWN0b3JzIGFyZSBhdmFpbGFibGUsIGJ1dA0KPiB3ZSd2ZSBvbmx5
IGFjdHVhbGx5IHNldCB1cCAzIG9mIHRoZW0uDQouLi4NCg0KRXZlbiBpZiB5b3UgZG8gdGhhdCwg
eW91IG91Z2h0IHRvIHdyaXRlIHZhbGlkIGludGVycnVwdCBpbmZvcm1hdGlvbg0KaW50byB0aGUg
NHRoIHNsb3QgKG1heWJlIHJlcGxpY2F0aW5nIG9uZSBvZiB0aGUgZWFybGllciBpbnRlcnJ1cHRz
KS4NClRoZW4sIGlmIHRoZSBkZXZpY2UgZG9lcyByYWlzZSB0aGUgJ3VuZXhwZWN0ZWQnIGludGVy
cnVwdCB5b3UgZG9uJ3QNCmdldCBhIHdyaXRlIHRvIGEgcmFuZG9tIGtlcm5lbCBsb2NhdGlvbi4N
Cg0KUGxhdXNpYmx5IHNvbWV0aGluZyBzaW1pbGFyIHNob3VsZCBiZSBkb25lIHdoZW4gYSBzbWFs
bGVyIG51bWJlciBvZg0KaW50ZXJydXB0cyBpcyBhc3NpZ25lZC4NCg0KCURhdmlkDQoNCg==

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-02 20:22     ` Bjorn Helgaas
  (?)
  (?)
@ 2014-07-03  9:20     ` David Laight
  -1 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-03  9:20 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

From: Bjorn Helgaas
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> >
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> >
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> >
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.
> 
> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.
...

Even if you do that, you ought to write valid interrupt information
into the 4th slot (maybe replicating one of the earlier interrupts).
Then, if the device does raise the 'unexpected' interrupt you don't
get a write to a random kernel location.

Plausibly something similar should be done when a smaller number of
interrupts is assigned.

	David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-02 20:22     ` Bjorn Helgaas
  (?)
@ 2014-07-04  8:57         ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.

Nod.

> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.

Exactly.

> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

If a device committed via non-MSI specific means to send only 3 vectors
out of 4 available why should we expect it to send 4? The probability of
a firmware sending 4/4 vectors in this case is equal to the probability
of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
Moreover, even vector 4/4 would be unexpected by the device driver, though
it is perfectly within the spec.

As of new device revision or firmware update etc. - it is just yet another
case of device driver vs the firmware match/mismatch. Not including this
change does not help here at all IMHO.

> Can you quantify the benefit of this?  Can't a device already use
> MSI-X to request exactly the number of vectors it can use?  (I know

A Intel AHCI chipset requires 16 vectors written to MME while advertises
(via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
in device's fallback to 1 (!).

> not all devices support MSI-X, but maybe we should just accept MSI for
> what it is and encourage the HW guys to use MSI-X if MSI isn't good
> enough.)
> 
> > In this case the result written to the MME register may not
> > satisfy the aforementioned PCI devices requirement and therefore
> > the PCI functions will not operate in a desired mode.
> 
> I'm not sure what you mean by "will not operate in a desired mode."
> I thought this was an optimization to save vectors and that these
> changes would be completely invisible to the hardware.

Yes, this should be invisible to the hardware. The above is an attempt
to describe the Intel AHCI weirdness in general terms :) I think it
could be omitted.

> Bjorn
> 
> > This update introduces pci_enable_msi_partial() extension to
> > pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> > argument which is then written to MME register while the value
> > of 'nvec' is still used to setup as many interrupts as requested.
> > 
> > As result of this change, architecture-specific callbacks
> > arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> > 'nvec_mme' parameter as well, but it is ignored for now.
> > Therefore, this update is a placeholder for architectures that
> > wish to support pci_enable_msi_partial() function in the future.
> > 
> > Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA@public.gmane.org
> > Cc: linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
> > Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> > Cc: xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b@public.gmane.org
> > Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> > Cc: linux-ide-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Signed-off-by: Alexander Gordeev <agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
> >  arch/mips/pci/msi-octeon.c      |    2 +-
> >  arch/powerpc/kernel/msi.c       |    4 +-
> >  arch/s390/pci/pci.c             |    2 +-
> >  arch/x86/kernel/x86_init.c      |    2 +-
> >  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
> >  include/linux/msi.h             |    5 +-
> >  include/linux/pci.h             |    3 +
> >  8 files changed, 115 insertions(+), 22 deletions(-)
> > 
> > diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> > index 10a9369..c8a8503 100644
> > --- a/Documentation/PCI/MSI-HOWTO.txt
> > +++ b/Documentation/PCI/MSI-HOWTO.txt
> > @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
> >  returns zero in case of success, which indicates MSI interrupts have been
> >  successfully allocated.
> >  
> > -4.2.4 pci_disable_msi
> > +4.2.4 pci_enable_msi_partial
> > +
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +
> > +This variation on pci_enable_msi_exact() call allows a device driver to
> > +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> > +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> > +in operating system. The MSI specification only allows 'nvec_mme' to be
> > +allocated in powers of two, up to a maximum of 2^5 (32).
> > +
> > +This function could be used when a PCI function is known to send 'nvec'
> > +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> > +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> > +do not waste system resources.
> > +
> > +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> > +interrupts and setting up 'nvec' interrupts. In this case, the function
> > +enables MSI on this device and updates dev->irq to be the lowest of the
> > +new interrupts assigned to it.  The other interrupts assigned to the
> > +device are in the range dev->irq to dev->irq + nvec - 1.
> > +
> > +If this function returns a negative number, it indicates an error and
> > +the driver should not attempt to request any more MSI interrupts for
> > +this device.
> > +
> > +4.2.5 pci_disable_msi
> >  
> >  void pci_disable_msi(struct pci_dev *dev)
> >  
> > -This function should be used to undo the effect of pci_enable_msi_range().
> > -Calling it restores dev->irq to the pin-based interrupt number and frees
> > -the previously allocated MSIs.  The interrupts may subsequently be assigned
> > -to another device, so drivers should not cache the value of dev->irq.
> > +This function should be used to undo the effect of pci_enable_msi_range()
> > +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> > +interrupt number and frees the previously allocated MSIs.  The interrupts
> > +may subsequently be assigned to another device, so drivers should not cache
> > +the value of dev->irq.
> >  
> >  Before calling this function, a device driver must always call free_irq()
> >  on any interrupt for which it previously called request_irq().
> > diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> > index 2b91b0e..2be7979 100644
> > --- a/arch/mips/pci/msi-octeon.c
> > +++ b/arch/mips/pci/msi-octeon.c
> > @@ -178,7 +178,7 @@ msi_irq_allocated:
> >  	return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> > index 8bbc12d..c60aee3 100644
> > --- a/arch/powerpc/kernel/msi.c
> > +++ b/arch/powerpc/kernel/msi.c
> > @@ -13,7 +13,7 @@
> >  
> >  #include <asm/machdep.h>
> >  
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> > +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
> >  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> > @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> >          return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return ppc_md.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> > index 9ddc51e..3cf38a8 100644
> > --- a/arch/s390/pci/pci.c
> > +++ b/arch/s390/pci/pci.c
> > @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
> >  	}
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct zpci_dev *zdev = get_zdev(pdev);
> >  	unsigned int hwirq, msi_vecs;
> > diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> > index e48b674..b65bf95 100644
> > --- a/arch/x86/kernel/x86_init.c
> > +++ b/arch/x86/kernel/x86_init.c
> > @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
> >  };
> >  
> >  /* MSI arch specific hooks */
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return x86_msi.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 27a7e67..0410d9b 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
> >  	chip->teardown_irq(chip, irq);
> >  }
> >  
> > -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_msi_check_device(struct pci_dev *dev,
> > +				 int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_chip *chip = dev->bus->msi;
> >  
> > @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  	return chip->check_device(chip, dev, nvec, type);
> >  }
> >  
> > -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> > +			       int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -598,6 +600,7 @@ error_attrs:
> >   * msi_capability_init - configure device's MSI capability structure
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: number of interrupts to allocate
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> >   *
> >   * Setup the MSI capability structure of the device with the requested
> >   * number of interrupts.  A return value of zero indicates the successful
> > @@ -605,7 +608,7 @@ error_attrs:
> >   * an error, and a positive return value indicates the number of interrupts
> >   * which could have been allocated.
> >   */
> > -static int msi_capability_init(struct pci_dev *dev, int nvec)
> > +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
> >  	list_add_tail(&entry->list, &dev->msi_list);
> >  
> >  	/* Configure MSI capability structure */
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> > +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> >  	if (ret) {
> >  		msi_mask_irq(entry, mask, ~mask);
> >  		free_msi_irqs(dev);
> > @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
> >  	if (ret)
> >  		return ret;
> >  
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> > +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> > +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (ret)
> >  		goto out_avail;
> >  
> > @@ -812,13 +816,15 @@ out_free:
> >   * pci_msi_check_device - check whether MSI may be enabled on a device
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: how many MSIs have been requested ?
> > + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
> >   * @type: are we checking for MSI or MSI-X ?
> >   *
> >   * Look at global flags, the device itself, and its parent buses
> >   * to determine if MSI/-X are supported for the device. If MSI/-X is
> >   * supported return 0, else return an error code.
> >   **/
> > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +static int pci_msi_check_device(struct pci_dev *dev,
> > +				int nvec, int nvec_mme, int type)
> >  {
> >  	struct pci_bus *bus;
> >  	int ret;
> > @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
> >  			return -EINVAL;
> >  
> > -	ret = arch_msi_check_device(dev, nvec, type);
> > +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msi_vec_count);
> >  
> > +/**
> > + * pci_enable_msi_partial - configure device's MSI capability structure
> > + * @dev: device to configure
> > + * @nvec: number of interrupts to configure
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> > + *
> > + * This function tries to allocate @nvec number of interrupts while setup
> > + * device's Multiple Message Enable register with @nvec_mme interrupts.
> > + * It returns a negative errno if an error occurs. If it succeeds, it returns
> > + * zero and updates the @dev's irq member to the lowest new interrupt number;
> > + * the other interrupt numbers allocated to this device are consecutive.
> > + */
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{
> > +	int maxvec;
> > +	int rc;
> > +
> > +	if (dev->current_state != PCI_D0)
> > +		return -EINVAL;
> > +
> > +	WARN_ON(!!dev->msi_enabled);
> > +
> > +	/* Check whether driver already requested MSI-X irqs */
> > +	if (dev->msix_enabled) {
> > +		dev_info(&dev->dev, "can't enable MSI "
> > +			 "(MSI-X already enabled)\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!is_power_of_2(nvec_mme))
> > +		return -EINVAL;
> > +	if (nvec > nvec_mme)
> > +		return -EINVAL;
> > +
> > +	maxvec = pci_msi_vec_count(dev);
> > +	if (maxvec < 0)
> > +		return maxvec;
> > +	else if (nvec_mme > maxvec)
> > +		return -EINVAL;
> > +
> > +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	rc = msi_capability_init(dev, nvec, nvec_mme);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(pci_enable_msi_partial);
> > +
> >  void pci_msi_shutdown(struct pci_dev *dev)
> >  {
> >  	struct msi_desc *desc;
> > @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
> >  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
> >  		return -EINVAL;
> >  
> > -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> > +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (status)
> >  		return status;
> >  
> > @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  		nvec = maxvec;
> >  
> >  	do {
> > -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> > +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> > +					  PCI_CAP_ID_MSI);
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  	} while (rc);
> >  
> >  	do {
> > -		rc = msi_capability_init(dev, nvec);
> > +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > diff --git a/include/linux/msi.h b/include/linux/msi.h
> > index 92a2f99..b9f89ee 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -57,9 +57,10 @@ struct msi_desc {
> >   */
> >  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
> >  void arch_teardown_msi_irq(unsigned int irq);
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
> >  void arch_teardown_msi_irqs(struct pci_dev *dev);
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> > +int arch_msi_check_device(struct pci_dev *dev,
> > +			  int nvec, int nvec_mme, int type);
> >  void arch_restore_msi_irqs(struct pci_dev *dev);
> >  
> >  void default_teardown_msi_irqs(struct pci_dev *dev);
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 71d9673..7360bd2 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
> >  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
> >  void pci_restore_msi_state(struct pci_dev *dev);
> >  int pci_msi_enabled(void);
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
> >  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
> >  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
> >  {
> > @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
> >  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
> >  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
> >  static inline int pci_msi_enabled(void) { return 0; }
> > +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{ return -ENOSYS; }
> >  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
> >  				       int maxvec)
> >  { return -ENOSYS; }
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards,
Alexander Gordeev
agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-04  8:57         ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-doc, linux-mips, linuxppc-dev, linux-s390,
	x86, xen-devel, iommu, linux-ide, linux-pci

On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.

Nod.

> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.

Exactly.

> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

If a device committed via non-MSI specific means to send only 3 vectors
out of 4 available why should we expect it to send 4? The probability of
a firmware sending 4/4 vectors in this case is equal to the probability
of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
Moreover, even vector 4/4 would be unexpected by the device driver, though
it is perfectly within the spec.

As of new device revision or firmware update etc. - it is just yet another
case of device driver vs the firmware match/mismatch. Not including this
change does not help here at all IMHO.

> Can you quantify the benefit of this?  Can't a device already use
> MSI-X to request exactly the number of vectors it can use?  (I know

A Intel AHCI chipset requires 16 vectors written to MME while advertises
(via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
in device's fallback to 1 (!).

> not all devices support MSI-X, but maybe we should just accept MSI for
> what it is and encourage the HW guys to use MSI-X if MSI isn't good
> enough.)
> 
> > In this case the result written to the MME register may not
> > satisfy the aforementioned PCI devices requirement and therefore
> > the PCI functions will not operate in a desired mode.
> 
> I'm not sure what you mean by "will not operate in a desired mode."
> I thought this was an optimization to save vectors and that these
> changes would be completely invisible to the hardware.

Yes, this should be invisible to the hardware. The above is an attempt
to describe the Intel AHCI weirdness in general terms :) I think it
could be omitted.

> Bjorn
> 
> > This update introduces pci_enable_msi_partial() extension to
> > pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> > argument which is then written to MME register while the value
> > of 'nvec' is still used to setup as many interrupts as requested.
> > 
> > As result of this change, architecture-specific callbacks
> > arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> > 'nvec_mme' parameter as well, but it is ignored for now.
> > Therefore, this update is a placeholder for architectures that
> > wish to support pci_enable_msi_partial() function in the future.
> > 
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-mips@linux-mips.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-s390@vger.kernel.org
> > Cc: x86@kernel.org
> > Cc: xen-devel@lists.xenproject.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: linux-ide@vger.kernel.org
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> > ---
> >  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
> >  arch/mips/pci/msi-octeon.c      |    2 +-
> >  arch/powerpc/kernel/msi.c       |    4 +-
> >  arch/s390/pci/pci.c             |    2 +-
> >  arch/x86/kernel/x86_init.c      |    2 +-
> >  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
> >  include/linux/msi.h             |    5 +-
> >  include/linux/pci.h             |    3 +
> >  8 files changed, 115 insertions(+), 22 deletions(-)
> > 
> > diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> > index 10a9369..c8a8503 100644
> > --- a/Documentation/PCI/MSI-HOWTO.txt
> > +++ b/Documentation/PCI/MSI-HOWTO.txt
> > @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
> >  returns zero in case of success, which indicates MSI interrupts have been
> >  successfully allocated.
> >  
> > -4.2.4 pci_disable_msi
> > +4.2.4 pci_enable_msi_partial
> > +
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +
> > +This variation on pci_enable_msi_exact() call allows a device driver to
> > +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> > +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> > +in operating system. The MSI specification only allows 'nvec_mme' to be
> > +allocated in powers of two, up to a maximum of 2^5 (32).
> > +
> > +This function could be used when a PCI function is known to send 'nvec'
> > +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> > +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> > +do not waste system resources.
> > +
> > +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> > +interrupts and setting up 'nvec' interrupts. In this case, the function
> > +enables MSI on this device and updates dev->irq to be the lowest of the
> > +new interrupts assigned to it.  The other interrupts assigned to the
> > +device are in the range dev->irq to dev->irq + nvec - 1.
> > +
> > +If this function returns a negative number, it indicates an error and
> > +the driver should not attempt to request any more MSI interrupts for
> > +this device.
> > +
> > +4.2.5 pci_disable_msi
> >  
> >  void pci_disable_msi(struct pci_dev *dev)
> >  
> > -This function should be used to undo the effect of pci_enable_msi_range().
> > -Calling it restores dev->irq to the pin-based interrupt number and frees
> > -the previously allocated MSIs.  The interrupts may subsequently be assigned
> > -to another device, so drivers should not cache the value of dev->irq.
> > +This function should be used to undo the effect of pci_enable_msi_range()
> > +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> > +interrupt number and frees the previously allocated MSIs.  The interrupts
> > +may subsequently be assigned to another device, so drivers should not cache
> > +the value of dev->irq.
> >  
> >  Before calling this function, a device driver must always call free_irq()
> >  on any interrupt for which it previously called request_irq().
> > diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> > index 2b91b0e..2be7979 100644
> > --- a/arch/mips/pci/msi-octeon.c
> > +++ b/arch/mips/pci/msi-octeon.c
> > @@ -178,7 +178,7 @@ msi_irq_allocated:
> >  	return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> > index 8bbc12d..c60aee3 100644
> > --- a/arch/powerpc/kernel/msi.c
> > +++ b/arch/powerpc/kernel/msi.c
> > @@ -13,7 +13,7 @@
> >  
> >  #include <asm/machdep.h>
> >  
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> > +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
> >  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> > @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> >          return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return ppc_md.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> > index 9ddc51e..3cf38a8 100644
> > --- a/arch/s390/pci/pci.c
> > +++ b/arch/s390/pci/pci.c
> > @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
> >  	}
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct zpci_dev *zdev = get_zdev(pdev);
> >  	unsigned int hwirq, msi_vecs;
> > diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> > index e48b674..b65bf95 100644
> > --- a/arch/x86/kernel/x86_init.c
> > +++ b/arch/x86/kernel/x86_init.c
> > @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
> >  };
> >  
> >  /* MSI arch specific hooks */
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return x86_msi.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 27a7e67..0410d9b 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
> >  	chip->teardown_irq(chip, irq);
> >  }
> >  
> > -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_msi_check_device(struct pci_dev *dev,
> > +				 int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_chip *chip = dev->bus->msi;
> >  
> > @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  	return chip->check_device(chip, dev, nvec, type);
> >  }
> >  
> > -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> > +			       int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -598,6 +600,7 @@ error_attrs:
> >   * msi_capability_init - configure device's MSI capability structure
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: number of interrupts to allocate
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> >   *
> >   * Setup the MSI capability structure of the device with the requested
> >   * number of interrupts.  A return value of zero indicates the successful
> > @@ -605,7 +608,7 @@ error_attrs:
> >   * an error, and a positive return value indicates the number of interrupts
> >   * which could have been allocated.
> >   */
> > -static int msi_capability_init(struct pci_dev *dev, int nvec)
> > +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
> >  	list_add_tail(&entry->list, &dev->msi_list);
> >  
> >  	/* Configure MSI capability structure */
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> > +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> >  	if (ret) {
> >  		msi_mask_irq(entry, mask, ~mask);
> >  		free_msi_irqs(dev);
> > @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
> >  	if (ret)
> >  		return ret;
> >  
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> > +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> > +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (ret)
> >  		goto out_avail;
> >  
> > @@ -812,13 +816,15 @@ out_free:
> >   * pci_msi_check_device - check whether MSI may be enabled on a device
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: how many MSIs have been requested ?
> > + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
> >   * @type: are we checking for MSI or MSI-X ?
> >   *
> >   * Look at global flags, the device itself, and its parent buses
> >   * to determine if MSI/-X are supported for the device. If MSI/-X is
> >   * supported return 0, else return an error code.
> >   **/
> > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +static int pci_msi_check_device(struct pci_dev *dev,
> > +				int nvec, int nvec_mme, int type)
> >  {
> >  	struct pci_bus *bus;
> >  	int ret;
> > @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
> >  			return -EINVAL;
> >  
> > -	ret = arch_msi_check_device(dev, nvec, type);
> > +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msi_vec_count);
> >  
> > +/**
> > + * pci_enable_msi_partial - configure device's MSI capability structure
> > + * @dev: device to configure
> > + * @nvec: number of interrupts to configure
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> > + *
> > + * This function tries to allocate @nvec number of interrupts while setup
> > + * device's Multiple Message Enable register with @nvec_mme interrupts.
> > + * It returns a negative errno if an error occurs. If it succeeds, it returns
> > + * zero and updates the @dev's irq member to the lowest new interrupt number;
> > + * the other interrupt numbers allocated to this device are consecutive.
> > + */
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{
> > +	int maxvec;
> > +	int rc;
> > +
> > +	if (dev->current_state != PCI_D0)
> > +		return -EINVAL;
> > +
> > +	WARN_ON(!!dev->msi_enabled);
> > +
> > +	/* Check whether driver already requested MSI-X irqs */
> > +	if (dev->msix_enabled) {
> > +		dev_info(&dev->dev, "can't enable MSI "
> > +			 "(MSI-X already enabled)\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!is_power_of_2(nvec_mme))
> > +		return -EINVAL;
> > +	if (nvec > nvec_mme)
> > +		return -EINVAL;
> > +
> > +	maxvec = pci_msi_vec_count(dev);
> > +	if (maxvec < 0)
> > +		return maxvec;
> > +	else if (nvec_mme > maxvec)
> > +		return -EINVAL;
> > +
> > +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	rc = msi_capability_init(dev, nvec, nvec_mme);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(pci_enable_msi_partial);
> > +
> >  void pci_msi_shutdown(struct pci_dev *dev)
> >  {
> >  	struct msi_desc *desc;
> > @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
> >  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
> >  		return -EINVAL;
> >  
> > -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> > +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (status)
> >  		return status;
> >  
> > @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  		nvec = maxvec;
> >  
> >  	do {
> > -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> > +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> > +					  PCI_CAP_ID_MSI);
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  	} while (rc);
> >  
> >  	do {
> > -		rc = msi_capability_init(dev, nvec);
> > +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > diff --git a/include/linux/msi.h b/include/linux/msi.h
> > index 92a2f99..b9f89ee 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -57,9 +57,10 @@ struct msi_desc {
> >   */
> >  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
> >  void arch_teardown_msi_irq(unsigned int irq);
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
> >  void arch_teardown_msi_irqs(struct pci_dev *dev);
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> > +int arch_msi_check_device(struct pci_dev *dev,
> > +			  int nvec, int nvec_mme, int type);
> >  void arch_restore_msi_irqs(struct pci_dev *dev);
> >  
> >  void default_teardown_msi_irqs(struct pci_dev *dev);
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 71d9673..7360bd2 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
> >  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
> >  void pci_restore_msi_state(struct pci_dev *dev);
> >  int pci_msi_enabled(void);
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
> >  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
> >  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
> >  {
> > @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
> >  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
> >  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
> >  static inline int pci_msi_enabled(void) { return 0; }
> > +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{ return -ENOSYS; }
> >  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
> >  				       int maxvec)
> >  { return -ENOSYS; }
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-04  8:57         ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.

Nod.

> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.

Exactly.

> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

If a device committed via non-MSI specific means to send only 3 vectors
out of 4 available why should we expect it to send 4? The probability of
a firmware sending 4/4 vectors in this case is equal to the probability
of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
Moreover, even vector 4/4 would be unexpected by the device driver, though
it is perfectly within the spec.

As of new device revision or firmware update etc. - it is just yet another
case of device driver vs the firmware match/mismatch. Not including this
change does not help here at all IMHO.

> Can you quantify the benefit of this?  Can't a device already use
> MSI-X to request exactly the number of vectors it can use?  (I know

A Intel AHCI chipset requires 16 vectors written to MME while advertises
(via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
in device's fallback to 1 (!).

> not all devices support MSI-X, but maybe we should just accept MSI for
> what it is and encourage the HW guys to use MSI-X if MSI isn't good
> enough.)
> 
> > In this case the result written to the MME register may not
> > satisfy the aforementioned PCI devices requirement and therefore
> > the PCI functions will not operate in a desired mode.
> 
> I'm not sure what you mean by "will not operate in a desired mode."
> I thought this was an optimization to save vectors and that these
> changes would be completely invisible to the hardware.

Yes, this should be invisible to the hardware. The above is an attempt
to describe the Intel AHCI weirdness in general terms :) I think it
could be omitted.

> Bjorn
> 
> > This update introduces pci_enable_msi_partial() extension to
> > pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> > argument which is then written to MME register while the value
> > of 'nvec' is still used to setup as many interrupts as requested.
> > 
> > As result of this change, architecture-specific callbacks
> > arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> > 'nvec_mme' parameter as well, but it is ignored for now.
> > Therefore, this update is a placeholder for architectures that
> > wish to support pci_enable_msi_partial() function in the future.
> > 
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-mips@linux-mips.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-s390@vger.kernel.org
> > Cc: x86@kernel.org
> > Cc: xen-devel@lists.xenproject.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: linux-ide@vger.kernel.org
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> > ---
> >  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
> >  arch/mips/pci/msi-octeon.c      |    2 +-
> >  arch/powerpc/kernel/msi.c       |    4 +-
> >  arch/s390/pci/pci.c             |    2 +-
> >  arch/x86/kernel/x86_init.c      |    2 +-
> >  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
> >  include/linux/msi.h             |    5 +-
> >  include/linux/pci.h             |    3 +
> >  8 files changed, 115 insertions(+), 22 deletions(-)
> > 
> > diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> > index 10a9369..c8a8503 100644
> > --- a/Documentation/PCI/MSI-HOWTO.txt
> > +++ b/Documentation/PCI/MSI-HOWTO.txt
> > @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
> >  returns zero in case of success, which indicates MSI interrupts have been
> >  successfully allocated.
> >  
> > -4.2.4 pci_disable_msi
> > +4.2.4 pci_enable_msi_partial
> > +
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +
> > +This variation on pci_enable_msi_exact() call allows a device driver to
> > +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> > +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> > +in operating system. The MSI specification only allows 'nvec_mme' to be
> > +allocated in powers of two, up to a maximum of 2^5 (32).
> > +
> > +This function could be used when a PCI function is known to send 'nvec'
> > +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> > +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> > +do not waste system resources.
> > +
> > +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> > +interrupts and setting up 'nvec' interrupts. In this case, the function
> > +enables MSI on this device and updates dev->irq to be the lowest of the
> > +new interrupts assigned to it.  The other interrupts assigned to the
> > +device are in the range dev->irq to dev->irq + nvec - 1.
> > +
> > +If this function returns a negative number, it indicates an error and
> > +the driver should not attempt to request any more MSI interrupts for
> > +this device.
> > +
> > +4.2.5 pci_disable_msi
> >  
> >  void pci_disable_msi(struct pci_dev *dev)
> >  
> > -This function should be used to undo the effect of pci_enable_msi_range().
> > -Calling it restores dev->irq to the pin-based interrupt number and frees
> > -the previously allocated MSIs.  The interrupts may subsequently be assigned
> > -to another device, so drivers should not cache the value of dev->irq.
> > +This function should be used to undo the effect of pci_enable_msi_range()
> > +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> > +interrupt number and frees the previously allocated MSIs.  The interrupts
> > +may subsequently be assigned to another device, so drivers should not cache
> > +the value of dev->irq.
> >  
> >  Before calling this function, a device driver must always call free_irq()
> >  on any interrupt for which it previously called request_irq().
> > diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> > index 2b91b0e..2be7979 100644
> > --- a/arch/mips/pci/msi-octeon.c
> > +++ b/arch/mips/pci/msi-octeon.c
> > @@ -178,7 +178,7 @@ msi_irq_allocated:
> >  	return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> > index 8bbc12d..c60aee3 100644
> > --- a/arch/powerpc/kernel/msi.c
> > +++ b/arch/powerpc/kernel/msi.c
> > @@ -13,7 +13,7 @@
> >  
> >  #include <asm/machdep.h>
> >  
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> > +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
> >  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> > @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> >          return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return ppc_md.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> > index 9ddc51e..3cf38a8 100644
> > --- a/arch/s390/pci/pci.c
> > +++ b/arch/s390/pci/pci.c
> > @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
> >  	}
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct zpci_dev *zdev = get_zdev(pdev);
> >  	unsigned int hwirq, msi_vecs;
> > diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> > index e48b674..b65bf95 100644
> > --- a/arch/x86/kernel/x86_init.c
> > +++ b/arch/x86/kernel/x86_init.c
> > @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
> >  };
> >  
> >  /* MSI arch specific hooks */
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return x86_msi.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 27a7e67..0410d9b 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
> >  	chip->teardown_irq(chip, irq);
> >  }
> >  
> > -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_msi_check_device(struct pci_dev *dev,
> > +				 int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_chip *chip = dev->bus->msi;
> >  
> > @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  	return chip->check_device(chip, dev, nvec, type);
> >  }
> >  
> > -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> > +			       int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -598,6 +600,7 @@ error_attrs:
> >   * msi_capability_init - configure device's MSI capability structure
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: number of interrupts to allocate
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> >   *
> >   * Setup the MSI capability structure of the device with the requested
> >   * number of interrupts.  A return value of zero indicates the successful
> > @@ -605,7 +608,7 @@ error_attrs:
> >   * an error, and a positive return value indicates the number of interrupts
> >   * which could have been allocated.
> >   */
> > -static int msi_capability_init(struct pci_dev *dev, int nvec)
> > +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
> >  	list_add_tail(&entry->list, &dev->msi_list);
> >  
> >  	/* Configure MSI capability structure */
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> > +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> >  	if (ret) {
> >  		msi_mask_irq(entry, mask, ~mask);
> >  		free_msi_irqs(dev);
> > @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
> >  	if (ret)
> >  		return ret;
> >  
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> > +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> > +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (ret)
> >  		goto out_avail;
> >  
> > @@ -812,13 +816,15 @@ out_free:
> >   * pci_msi_check_device - check whether MSI may be enabled on a device
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: how many MSIs have been requested ?
> > + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
> >   * @type: are we checking for MSI or MSI-X ?
> >   *
> >   * Look at global flags, the device itself, and its parent buses
> >   * to determine if MSI/-X are supported for the device. If MSI/-X is
> >   * supported return 0, else return an error code.
> >   **/
> > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +static int pci_msi_check_device(struct pci_dev *dev,
> > +				int nvec, int nvec_mme, int type)
> >  {
> >  	struct pci_bus *bus;
> >  	int ret;
> > @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
> >  			return -EINVAL;
> >  
> > -	ret = arch_msi_check_device(dev, nvec, type);
> > +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msi_vec_count);
> >  
> > +/**
> > + * pci_enable_msi_partial - configure device's MSI capability structure
> > + * @dev: device to configure
> > + * @nvec: number of interrupts to configure
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> > + *
> > + * This function tries to allocate @nvec number of interrupts while setup
> > + * device's Multiple Message Enable register with @nvec_mme interrupts.
> > + * It returns a negative errno if an error occurs. If it succeeds, it returns
> > + * zero and updates the @dev's irq member to the lowest new interrupt number;
> > + * the other interrupt numbers allocated to this device are consecutive.
> > + */
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{
> > +	int maxvec;
> > +	int rc;
> > +
> > +	if (dev->current_state != PCI_D0)
> > +		return -EINVAL;
> > +
> > +	WARN_ON(!!dev->msi_enabled);
> > +
> > +	/* Check whether driver already requested MSI-X irqs */
> > +	if (dev->msix_enabled) {
> > +		dev_info(&dev->dev, "can't enable MSI "
> > +			 "(MSI-X already enabled)\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!is_power_of_2(nvec_mme))
> > +		return -EINVAL;
> > +	if (nvec > nvec_mme)
> > +		return -EINVAL;
> > +
> > +	maxvec = pci_msi_vec_count(dev);
> > +	if (maxvec < 0)
> > +		return maxvec;
> > +	else if (nvec_mme > maxvec)
> > +		return -EINVAL;
> > +
> > +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	rc = msi_capability_init(dev, nvec, nvec_mme);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(pci_enable_msi_partial);
> > +
> >  void pci_msi_shutdown(struct pci_dev *dev)
> >  {
> >  	struct msi_desc *desc;
> > @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
> >  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
> >  		return -EINVAL;
> >  
> > -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> > +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (status)
> >  		return status;
> >  
> > @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  		nvec = maxvec;
> >  
> >  	do {
> > -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> > +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> > +					  PCI_CAP_ID_MSI);
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  	} while (rc);
> >  
> >  	do {
> > -		rc = msi_capability_init(dev, nvec);
> > +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > diff --git a/include/linux/msi.h b/include/linux/msi.h
> > index 92a2f99..b9f89ee 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -57,9 +57,10 @@ struct msi_desc {
> >   */
> >  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
> >  void arch_teardown_msi_irq(unsigned int irq);
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
> >  void arch_teardown_msi_irqs(struct pci_dev *dev);
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> > +int arch_msi_check_device(struct pci_dev *dev,
> > +			  int nvec, int nvec_mme, int type);
> >  void arch_restore_msi_irqs(struct pci_dev *dev);
> >  
> >  void default_teardown_msi_irqs(struct pci_dev *dev);
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 71d9673..7360bd2 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
> >  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
> >  void pci_restore_msi_state(struct pci_dev *dev);
> >  int pci_msi_enabled(void);
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
> >  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
> >  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
> >  {
> > @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
> >  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
> >  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
> >  static inline int pci_msi_enabled(void) { return 0; }
> > +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{ return -ENOSYS; }
> >  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
> >  				       int maxvec)
> >  { return -ENOSYS; }
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-02 20:22     ` Bjorn Helgaas
                       ` (3 preceding siblings ...)
  (?)
@ 2014-07-04  8:57     ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  8:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.

Nod.

> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.

Exactly.

> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

If a device committed via non-MSI specific means to send only 3 vectors
out of 4 available why should we expect it to send 4? The probability of
a firmware sending 4/4 vectors in this case is equal to the probability
of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
Moreover, even vector 4/4 would be unexpected by the device driver, though
it is perfectly within the spec.

As of new device revision or firmware update etc. - it is just yet another
case of device driver vs the firmware match/mismatch. Not including this
change does not help here at all IMHO.

> Can you quantify the benefit of this?  Can't a device already use
> MSI-X to request exactly the number of vectors it can use?  (I know

A Intel AHCI chipset requires 16 vectors written to MME while advertises
(via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
in device's fallback to 1 (!).

> not all devices support MSI-X, but maybe we should just accept MSI for
> what it is and encourage the HW guys to use MSI-X if MSI isn't good
> enough.)
> 
> > In this case the result written to the MME register may not
> > satisfy the aforementioned PCI devices requirement and therefore
> > the PCI functions will not operate in a desired mode.
> 
> I'm not sure what you mean by "will not operate in a desired mode."
> I thought this was an optimization to save vectors and that these
> changes would be completely invisible to the hardware.

Yes, this should be invisible to the hardware. The above is an attempt
to describe the Intel AHCI weirdness in general terms :) I think it
could be omitted.

> Bjorn
> 
> > This update introduces pci_enable_msi_partial() extension to
> > pci_enable_msi_block() interface that accepts extra 'nvec_mme'
> > argument which is then written to MME register while the value
> > of 'nvec' is still used to setup as many interrupts as requested.
> > 
> > As result of this change, architecture-specific callbacks
> > arch_msi_check_device() and arch_setup_msi_irqs() get an extra
> > 'nvec_mme' parameter as well, but it is ignored for now.
> > Therefore, this update is a placeholder for architectures that
> > wish to support pci_enable_msi_partial() function in the future.
> > 
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-mips@linux-mips.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-s390@vger.kernel.org
> > Cc: x86@kernel.org
> > Cc: xen-devel@lists.xenproject.org
> > Cc: iommu@lists.linux-foundation.org
> > Cc: linux-ide@vger.kernel.org
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
> > ---
> >  Documentation/PCI/MSI-HOWTO.txt |   36 ++++++++++++++--
> >  arch/mips/pci/msi-octeon.c      |    2 +-
> >  arch/powerpc/kernel/msi.c       |    4 +-
> >  arch/s390/pci/pci.c             |    2 +-
> >  arch/x86/kernel/x86_init.c      |    2 +-
> >  drivers/pci/msi.c               |   83 ++++++++++++++++++++++++++++++++++-----
> >  include/linux/msi.h             |    5 +-
> >  include/linux/pci.h             |    3 +
> >  8 files changed, 115 insertions(+), 22 deletions(-)
> > 
> > diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
> > index 10a9369..c8a8503 100644
> > --- a/Documentation/PCI/MSI-HOWTO.txt
> > +++ b/Documentation/PCI/MSI-HOWTO.txt
> > @@ -195,14 +195,40 @@ By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
> >  returns zero in case of success, which indicates MSI interrupts have been
> >  successfully allocated.
> >  
> > -4.2.4 pci_disable_msi
> > +4.2.4 pci_enable_msi_partial
> > +
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +
> > +This variation on pci_enable_msi_exact() call allows a device driver to
> > +setup 'nvec_mme' number of multiple MSIs with the PCI function, while
> > +setup only 'nvec' (which could be a lesser of 'nvec_mme') number of MSIs
> > +in operating system. The MSI specification only allows 'nvec_mme' to be
> > +allocated in powers of two, up to a maximum of 2^5 (32).
> > +
> > +This function could be used when a PCI function is known to send 'nvec'
> > +MSIs, but still requires a particular number of MSIs 'nvec_mme' to be
> > +initialized with. As result, 'nvec_mme' - 'nvec' number of unused MSIs
> > +do not waste system resources.
> > +
> > +If this function returns 0, it has succeeded in allocating 'nvec_mme'
> > +interrupts and setting up 'nvec' interrupts. In this case, the function
> > +enables MSI on this device and updates dev->irq to be the lowest of the
> > +new interrupts assigned to it.  The other interrupts assigned to the
> > +device are in the range dev->irq to dev->irq + nvec - 1.
> > +
> > +If this function returns a negative number, it indicates an error and
> > +the driver should not attempt to request any more MSI interrupts for
> > +this device.
> > +
> > +4.2.5 pci_disable_msi
> >  
> >  void pci_disable_msi(struct pci_dev *dev)
> >  
> > -This function should be used to undo the effect of pci_enable_msi_range().
> > -Calling it restores dev->irq to the pin-based interrupt number and frees
> > -the previously allocated MSIs.  The interrupts may subsequently be assigned
> > -to another device, so drivers should not cache the value of dev->irq.
> > +This function should be used to undo the effect of pci_enable_msi_range()
> > +or pci_enable_msi_partial(). Calling it restores dev->irq to the pin-based
> > +interrupt number and frees the previously allocated MSIs.  The interrupts
> > +may subsequently be assigned to another device, so drivers should not cache
> > +the value of dev->irq.
> >  
> >  Before calling this function, a device driver must always call free_irq()
> >  on any interrupt for which it previously called request_irq().
> > diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
> > index 2b91b0e..2be7979 100644
> > --- a/arch/mips/pci/msi-octeon.c
> > +++ b/arch/mips/pci/msi-octeon.c
> > @@ -178,7 +178,7 @@ msi_irq_allocated:
> >  	return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
> > index 8bbc12d..c60aee3 100644
> > --- a/arch/powerpc/kernel/msi.c
> > +++ b/arch/powerpc/kernel/msi.c
> > @@ -13,7 +13,7 @@
> >  
> >  #include <asm/machdep.h>
> >  
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> > +int arch_msi_check_device(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
> >  		pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
> > @@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
> >          return 0;
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return ppc_md.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> > index 9ddc51e..3cf38a8 100644
> > --- a/arch/s390/pci/pci.c
> > +++ b/arch/s390/pci/pci.c
> > @@ -398,7 +398,7 @@ static void zpci_irq_handler(struct airq_struct *airq)
> >  	}
> >  }
> >  
> > -int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
> >  {
> >  	struct zpci_dev *zdev = get_zdev(pdev);
> >  	unsigned int hwirq, msi_vecs;
> > diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> > index e48b674..b65bf95 100644
> > --- a/arch/x86/kernel/x86_init.c
> > +++ b/arch/x86/kernel/x86_init.c
> > @@ -121,7 +121,7 @@ struct x86_msi_ops x86_msi = {
> >  };
> >  
> >  /* MSI arch specific hooks */
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
> >  {
> >  	return x86_msi.setup_msi_irqs(dev, nvec, type);
> >  }
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 27a7e67..0410d9b 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -56,7 +56,8 @@ void __weak arch_teardown_msi_irq(unsigned int irq)
> >  	chip->teardown_irq(chip, irq);
> >  }
> >  
> > -int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_msi_check_device(struct pci_dev *dev,
> > +				 int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_chip *chip = dev->bus->msi;
> >  
> > @@ -66,7 +67,8 @@ int __weak arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  	return chip->check_device(chip, dev, nvec, type);
> >  }
> >  
> > -int __weak arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> > +int __weak arch_setup_msi_irqs(struct pci_dev *dev,
> > +			       int nvec, int nvec_mme, int type)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -598,6 +600,7 @@ error_attrs:
> >   * msi_capability_init - configure device's MSI capability structure
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: number of interrupts to allocate
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> >   *
> >   * Setup the MSI capability structure of the device with the requested
> >   * number of interrupts.  A return value of zero indicates the successful
> > @@ -605,7 +608,7 @@ error_attrs:
> >   * an error, and a positive return value indicates the number of interrupts
> >   * which could have been allocated.
> >   */
> > -static int msi_capability_init(struct pci_dev *dev, int nvec)
> > +static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
> >  {
> >  	struct msi_desc *entry;
> >  	int ret;
> > @@ -640,7 +643,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
> >  	list_add_tail(&entry->list, &dev->msi_list);
> >  
> >  	/* Configure MSI capability structure */
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
> > +	ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> >  	if (ret) {
> >  		msi_mask_irq(entry, mask, ~mask);
> >  		free_msi_irqs(dev);
> > @@ -758,7 +761,8 @@ static int msix_capability_init(struct pci_dev *dev,
> >  	if (ret)
> >  		return ret;
> >  
> > -	ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
> > +	/* Parameter 'nvec_mme' does not make sense in case of MSI-X */
> > +	ret = arch_setup_msi_irqs(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (ret)
> >  		goto out_avail;
> >  
> > @@ -812,13 +816,15 @@ out_free:
> >   * pci_msi_check_device - check whether MSI may be enabled on a device
> >   * @dev: pointer to the pci_dev data structure of MSI device function
> >   * @nvec: how many MSIs have been requested ?
> > + * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
> >   * @type: are we checking for MSI or MSI-X ?
> >   *
> >   * Look at global flags, the device itself, and its parent buses
> >   * to determine if MSI/-X are supported for the device. If MSI/-X is
> >   * supported return 0, else return an error code.
> >   **/
> > -static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> > +static int pci_msi_check_device(struct pci_dev *dev,
> > +				int nvec, int nvec_mme, int type)
> >  {
> >  	struct pci_bus *bus;
> >  	int ret;
> > @@ -846,7 +852,7 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
> >  		if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
> >  			return -EINVAL;
> >  
> > -	ret = arch_msi_check_device(dev, nvec, type);
> > +	ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -878,6 +884,62 @@ int pci_msi_vec_count(struct pci_dev *dev)
> >  }
> >  EXPORT_SYMBOL(pci_msi_vec_count);
> >  
> > +/**
> > + * pci_enable_msi_partial - configure device's MSI capability structure
> > + * @dev: device to configure
> > + * @nvec: number of interrupts to configure
> > + * @nvec_mme: number of interrupts to write to Multiple Message Enable register
> > + *
> > + * This function tries to allocate @nvec number of interrupts while setup
> > + * device's Multiple Message Enable register with @nvec_mme interrupts.
> > + * It returns a negative errno if an error occurs. If it succeeds, it returns
> > + * zero and updates the @dev's irq member to the lowest new interrupt number;
> > + * the other interrupt numbers allocated to this device are consecutive.
> > + */
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{
> > +	int maxvec;
> > +	int rc;
> > +
> > +	if (dev->current_state != PCI_D0)
> > +		return -EINVAL;
> > +
> > +	WARN_ON(!!dev->msi_enabled);
> > +
> > +	/* Check whether driver already requested MSI-X irqs */
> > +	if (dev->msix_enabled) {
> > +		dev_info(&dev->dev, "can't enable MSI "
> > +			 "(MSI-X already enabled)\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!is_power_of_2(nvec_mme))
> > +		return -EINVAL;
> > +	if (nvec > nvec_mme)
> > +		return -EINVAL;
> > +
> > +	maxvec = pci_msi_vec_count(dev);
> > +	if (maxvec < 0)
> > +		return maxvec;
> > +	else if (nvec_mme > maxvec)
> > +		return -EINVAL;
> > +
> > +	rc = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	rc = msi_capability_init(dev, nvec, nvec_mme);
> > +	if (rc < 0)
> > +		return rc;
> > +	else if (rc > 0)
> > +		return -ENOSPC;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(pci_enable_msi_partial);
> > +
> >  void pci_msi_shutdown(struct pci_dev *dev)
> >  {
> >  	struct msi_desc *desc;
> > @@ -957,7 +1019,7 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
> >  	if (!entries || !dev->msix_cap || dev->current_state != PCI_D0)
> >  		return -EINVAL;
> >  
> > -	status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
> > +	status = pci_msi_check_device(dev, nvec, 0, PCI_CAP_ID_MSIX);
> >  	if (status)
> >  		return status;
> >  
> > @@ -1110,7 +1172,8 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  		nvec = maxvec;
> >  
> >  	do {
> > -		rc = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
> > +		rc = pci_msi_check_device(dev, nvec, roundup_pow_of_two(nvec),
> > +					  PCI_CAP_ID_MSI);
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > @@ -1121,7 +1184,7 @@ int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
> >  	} while (rc);
> >  
> >  	do {
> > -		rc = msi_capability_init(dev, nvec);
> > +		rc = msi_capability_init(dev, nvec, roundup_pow_of_two(nvec));
> >  		if (rc < 0) {
> >  			return rc;
> >  		} else if (rc > 0) {
> > diff --git a/include/linux/msi.h b/include/linux/msi.h
> > index 92a2f99..b9f89ee 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -57,9 +57,10 @@ struct msi_desc {
> >   */
> >  int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
> >  void arch_teardown_msi_irq(unsigned int irq);
> > -int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> > +int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
> >  void arch_teardown_msi_irqs(struct pci_dev *dev);
> > -int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> > +int arch_msi_check_device(struct pci_dev *dev,
> > +			  int nvec, int nvec_mme, int type);
> >  void arch_restore_msi_irqs(struct pci_dev *dev);
> >  
> >  void default_teardown_msi_irqs(struct pci_dev *dev);
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 71d9673..7360bd2 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1184,6 +1184,7 @@ void pci_disable_msix(struct pci_dev *dev);
> >  void msi_remove_pci_irq_vectors(struct pci_dev *dev);
> >  void pci_restore_msi_state(struct pci_dev *dev);
> >  int pci_msi_enabled(void);
> > +int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme);
> >  int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec);
> >  static inline int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
> >  {
> > @@ -1215,6 +1216,8 @@ static inline void pci_disable_msix(struct pci_dev *dev) { }
> >  static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) { }
> >  static inline void pci_restore_msi_state(struct pci_dev *dev) { }
> >  static inline int pci_msi_enabled(void) { return 0; }
> > +static int pci_enable_msi_partial(struct pci_dev *dev, int nvec, int nvec_mme)
> > +{ return -ENOSYS; }
> >  static inline int pci_enable_msi_range(struct pci_dev *dev, int minvec,
> >  				       int maxvec)
> >  { return -ENOSYS; }
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-03  9:20       ` David Laight
  (?)
@ 2014-07-04  8:58           ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  8:58 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	'Bjorn Helgaas',
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
> From: Bjorn Helgaas
> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > > There are PCI devices that require a particular value written
> > > to the Multiple Message Enable (MME) register while aligned on
> > > power of 2 boundary value of actually used MSI vectors 'nvec'
> > > is a lesser of that MME value:
> > >
> > > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > >
> > > However the existing pci_enable_msi_block() interface is not
> > > able to configure such devices, since the value written to the
> > > MME register is calculated from the number of requested MSIs
> > > 'nvec':
> > >
> > > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> > 
> > For MSI, software learns how many vectors a device requests by reading
> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> > for a device to request 3 vectors; it would have to round up that up
> > to a power of two and request 4 vectors.
> > 
> > Software writes similarly encoded values to MME to tell the device how
> > many vectors have been allocated for its use.  For example, it's
> > impossible to tell the device that it can use 3 vectors; the OS has to
> > round that up and tell the device it can use 4 vectors.
> > 
> > So if I understand correctly, the point of this series is to take
> > advantage of device-specific knowledge, e.g., the device requests 4
> > vectors via MMC, but we "know" the device is only capable of using 3.
> > Moreover, we tell the device via MME that 4 vectors are available, but
> > we've only actually set up 3 of them.
> ...
> 
> Even if you do that, you ought to write valid interrupt information
> into the 4th slot (maybe replicating one of the earlier interrupts).
> Then, if the device does raise the 'unexpected' interrupt you don't
> get a write to a random kernel location.

I might be missing something, but we are talking of MSI address space
here, aren't we? I am not getting how we could end up with a 'write'
to a random kernel location when a unclaimed MSI vector sent. We could
only expect a spurious interrupt at worst, which is handled and reported.

Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

> Plausibly something similar should be done when a smaller number of
> interrupts is assigned.
> 
> 	David
> 

-- 
Regards,
Alexander Gordeev
agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-04  8:58           ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  8:58 UTC (permalink / raw)
  To: David Laight
  Cc: 'Bjorn Helgaas',
	linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
> From: Bjorn Helgaas
> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > > There are PCI devices that require a particular value written
> > > to the Multiple Message Enable (MME) register while aligned on
> > > power of 2 boundary value of actually used MSI vectors 'nvec'
> > > is a lesser of that MME value:
> > >
> > > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > >
> > > However the existing pci_enable_msi_block() interface is not
> > > able to configure such devices, since the value written to the
> > > MME register is calculated from the number of requested MSIs
> > > 'nvec':
> > >
> > > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> > 
> > For MSI, software learns how many vectors a device requests by reading
> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> > for a device to request 3 vectors; it would have to round up that up
> > to a power of two and request 4 vectors.
> > 
> > Software writes similarly encoded values to MME to tell the device how
> > many vectors have been allocated for its use.  For example, it's
> > impossible to tell the device that it can use 3 vectors; the OS has to
> > round that up and tell the device it can use 4 vectors.
> > 
> > So if I understand correctly, the point of this series is to take
> > advantage of device-specific knowledge, e.g., the device requests 4
> > vectors via MMC, but we "know" the device is only capable of using 3.
> > Moreover, we tell the device via MME that 4 vectors are available, but
> > we've only actually set up 3 of them.
> ...
> 
> Even if you do that, you ought to write valid interrupt information
> into the 4th slot (maybe replicating one of the earlier interrupts).
> Then, if the device does raise the 'unexpected' interrupt you don't
> get a write to a random kernel location.

I might be missing something, but we are talking of MSI address space
here, aren't we? I am not getting how we could end up with a 'write'
to a random kernel location when a unclaimed MSI vector sent. We could
only expect a spurious interrupt at worst, which is handled and reported.

Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

> Plausibly something similar should be done when a smaller number of
> interrupts is assigned.
> 
> 	David
> 

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-04  8:58           ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  8:58 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, 'Bjorn Helgaas',
	xen-devel, linuxppc-dev

On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
> From: Bjorn Helgaas
> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > > There are PCI devices that require a particular value written
> > > to the Multiple Message Enable (MME) register while aligned on
> > > power of 2 boundary value of actually used MSI vectors 'nvec'
> > > is a lesser of that MME value:
> > >
> > > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > >
> > > However the existing pci_enable_msi_block() interface is not
> > > able to configure such devices, since the value written to the
> > > MME register is calculated from the number of requested MSIs
> > > 'nvec':
> > >
> > > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> > 
> > For MSI, software learns how many vectors a device requests by reading
> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> > for a device to request 3 vectors; it would have to round up that up
> > to a power of two and request 4 vectors.
> > 
> > Software writes similarly encoded values to MME to tell the device how
> > many vectors have been allocated for its use.  For example, it's
> > impossible to tell the device that it can use 3 vectors; the OS has to
> > round that up and tell the device it can use 4 vectors.
> > 
> > So if I understand correctly, the point of this series is to take
> > advantage of device-specific knowledge, e.g., the device requests 4
> > vectors via MMC, but we "know" the device is only capable of using 3.
> > Moreover, we tell the device via MME that 4 vectors are available, but
> > we've only actually set up 3 of them.
> ...
> 
> Even if you do that, you ought to write valid interrupt information
> into the 4th slot (maybe replicating one of the earlier interrupts).
> Then, if the device does raise the 'unexpected' interrupt you don't
> get a write to a random kernel location.

I might be missing something, but we are talking of MSI address space
here, aren't we? I am not getting how we could end up with a 'write'
to a random kernel location when a unclaimed MSI vector sent. We could
only expect a spurious interrupt at worst, which is handled and reported.

Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

> Plausibly something similar should be done when a smaller number of
> interrupts is assigned.
> 
> 	David
> 

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-03  9:20       ` David Laight
                         ` (2 preceding siblings ...)
  (?)
@ 2014-07-04  8:58       ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  8:58 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, 'Bjorn Helgaas',
	xen-devel, linuxppc-dev

On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
> From: Bjorn Helgaas
> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > > There are PCI devices that require a particular value written
> > > to the Multiple Message Enable (MME) register while aligned on
> > > power of 2 boundary value of actually used MSI vectors 'nvec'
> > > is a lesser of that MME value:
> > >
> > > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > >
> > > However the existing pci_enable_msi_block() interface is not
> > > able to configure such devices, since the value written to the
> > > MME register is calculated from the number of requested MSIs
> > > 'nvec':
> > >
> > > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> > 
> > For MSI, software learns how many vectors a device requests by reading
> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> > for a device to request 3 vectors; it would have to round up that up
> > to a power of two and request 4 vectors.
> > 
> > Software writes similarly encoded values to MME to tell the device how
> > many vectors have been allocated for its use.  For example, it's
> > impossible to tell the device that it can use 3 vectors; the OS has to
> > round that up and tell the device it can use 4 vectors.
> > 
> > So if I understand correctly, the point of this series is to take
> > advantage of device-specific knowledge, e.g., the device requests 4
> > vectors via MMC, but we "know" the device is only capable of using 3.
> > Moreover, we tell the device via MME that 4 vectors are available, but
> > we've only actually set up 3 of them.
> ...
> 
> Even if you do that, you ought to write valid interrupt information
> into the 4th slot (maybe replicating one of the earlier interrupts).
> Then, if the device does raise the 'unexpected' interrupt you don't
> get a write to a random kernel location.

I might be missing something, but we are talking of MSI address space
here, aren't we? I am not getting how we could end up with a 'write'
to a random kernel location when a unclaimed MSI vector sent. We could
only expect a spurious interrupt at worst, which is handled and reported.

Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

> Plausibly something similar should be done when a smaller number of
> interrupts is assigned.
> 
> 	David
> 

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-04  8:58           ` Alexander Gordeev
  (?)
@ 2014-07-04  9:11               ` David Laight
  -1 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-04  9:11 UTC (permalink / raw)
  To: 'Alexander Gordeev'
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	'Bjorn Helgaas',
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

From: Alexander Gordeev
...
> > Even if you do that, you ought to write valid interrupt information
> > into the 4th slot (maybe replicating one of the earlier interrupts).
> > Then, if the device does raise the 'unexpected' interrupt you don't
> > get a write to a random kernel location.
> 
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.
> 
> Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

I'm thinking of the following - which might be MSI-X ?
1) Hardware requests some interrupts and tells the host the BAR (and offset)
   where the 'vectors' should be written.
2) To raise an interrupt the hardware uses the 'vector' as the address
   of a normal PCIe write cycle.

So if the hardware requests 4 interrupts, but the driver (believing it
will only use 3) only write 3 vectors, and then the hardware uses the
4th vector it can write to a random location.

Debugging that would be hard!

	David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-04  9:11               ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-04  9:11 UTC (permalink / raw)
  To: 'Alexander Gordeev'
  Cc: 'Bjorn Helgaas',
	linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

From: Alexander Gordeev
...
> > Even if you do that, you ought to write valid interrupt information
> > into the 4th slot (maybe replicating one of the earlier interrupts).
> > Then, if the device does raise the 'unexpected' interrupt you don't
> > get a write to a random kernel location.
> 
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.
> 
> Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

I'm thinking of the following - which might be MSI-X ?
1) Hardware requests some interrupts and tells the host the BAR (and offset)
   where the 'vectors' should be written.
2) To raise an interrupt the hardware uses the 'vector' as the address
   of a normal PCIe write cycle.

So if the hardware requests 4 interrupts, but the driver (believing it
will only use 3) only write 3 vectors, and then the hardware uses the
4th vector it can write to a random location.

Debugging that would be hard!

	David




^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-04  9:11               ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-04  9:11 UTC (permalink / raw)
  To: 'Alexander Gordeev'
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, 'Bjorn Helgaas',
	xen-devel, linuxppc-dev

From: Alexander Gordeev
...
> > Even if you do that, you ought to write valid interrupt information
> > into the 4th slot (maybe replicating one of the earlier interrupts).
> > Then, if the device does raise the 'unexpected' interrupt you don't
> > get a write to a random kernel location.
>=20
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.
>=20
> Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

I'm thinking of the following - which might be MSI-X ?
1) Hardware requests some interrupts and tells the host the BAR (and offset=
)
   where the 'vectors' should be written.
2) To raise an interrupt the hardware uses the 'vector' as the address
   of a normal PCIe write cycle.

So if the hardware requests 4 interrupts, but the driver (believing it
will only use 3) only write 3 vectors, and then the hardware uses the
4th vector it can write to a random location.

Debugging that would be hard!

	David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-04  8:58           ` Alexander Gordeev
  (?)
  (?)
@ 2014-07-04  9:11           ` David Laight
  -1 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-04  9:11 UTC (permalink / raw)
  To: 'Alexander Gordeev'
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, 'Bjorn Helgaas',
	xen-devel, linuxppc-dev

From: Alexander Gordeev
...
> > Even if you do that, you ought to write valid interrupt information
> > into the 4th slot (maybe replicating one of the earlier interrupts).
> > Then, if the device does raise the 'unexpected' interrupt you don't
> > get a write to a random kernel location.
> 
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.
> 
> Anyway, as I described in my reply to Bjorn, this is not a concern IMO.

I'm thinking of the following - which might be MSI-X ?
1) Hardware requests some interrupts and tells the host the BAR (and offset)
   where the 'vectors' should be written.
2) To raise an interrupt the hardware uses the 'vector' as the address
   of a normal PCIe write cycle.

So if the hardware requests 4 interrupts, but the driver (believing it
will only use 3) only write 3 vectors, and then the hardware uses the
4th vector it can write to a random location.

Debugging that would be hard!

	David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-04  9:11               ` David Laight
  (?)
@ 2014-07-04  9:54                   ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  9:54 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	'Bjorn Helgaas',
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

On Fri, Jul 04, 2014 at 09:11:50AM +0000, David Laight wrote:
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> > 
> > Anyway, as I described in my reply to Bjorn, this is not a concern IMO.
> 
> I'm thinking of the following - which might be MSI-X ?
> 1) Hardware requests some interrupts and tells the host the BAR (and offset)
>    where the 'vectors' should be written.
> 2) To raise an interrupt the hardware uses the 'vector' as the address
>    of a normal PCIe write cycle.
> 
> So if the hardware requests 4 interrupts, but the driver (believing it
> will only use 3) only write 3 vectors, and then the hardware uses the
> 4th vector it can write to a random location.
> 
> Debugging that would be hard!

MSI base address is kind of hardcoded for a platform. A combination of
MSI base address, PCI function number and MSI vector makes a PCI host to
raise interrupt on a CPU. I might be inaccurate in details, but the scenario
you described is impossible AFAICT.

> 	David
> 
> 
> 

-- 
Regards,
Alexander Gordeev
agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-04  9:54                   ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  9:54 UTC (permalink / raw)
  To: David Laight
  Cc: 'Bjorn Helgaas',
	linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

On Fri, Jul 04, 2014 at 09:11:50AM +0000, David Laight wrote:
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> > 
> > Anyway, as I described in my reply to Bjorn, this is not a concern IMO.
> 
> I'm thinking of the following - which might be MSI-X ?
> 1) Hardware requests some interrupts and tells the host the BAR (and offset)
>    where the 'vectors' should be written.
> 2) To raise an interrupt the hardware uses the 'vector' as the address
>    of a normal PCIe write cycle.
> 
> So if the hardware requests 4 interrupts, but the driver (believing it
> will only use 3) only write 3 vectors, and then the hardware uses the
> 4th vector it can write to a random location.
> 
> Debugging that would be hard!

MSI base address is kind of hardcoded for a platform. A combination of
MSI base address, PCI function number and MSI vector makes a PCI host to
raise interrupt on a CPU. I might be inaccurate in details, but the scenario
you described is impossible AFAICT.

> 	David
> 
> 
> 

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-04  9:54                   ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  9:54 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, 'Bjorn Helgaas',
	xen-devel, linuxppc-dev

On Fri, Jul 04, 2014 at 09:11:50AM +0000, David Laight wrote:
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> > 
> > Anyway, as I described in my reply to Bjorn, this is not a concern IMO.
> 
> I'm thinking of the following - which might be MSI-X ?
> 1) Hardware requests some interrupts and tells the host the BAR (and offset)
>    where the 'vectors' should be written.
> 2) To raise an interrupt the hardware uses the 'vector' as the address
>    of a normal PCIe write cycle.
> 
> So if the hardware requests 4 interrupts, but the driver (believing it
> will only use 3) only write 3 vectors, and then the hardware uses the
> 4th vector it can write to a random location.
> 
> Debugging that would be hard!

MSI base address is kind of hardcoded for a platform. A combination of
MSI base address, PCI function number and MSI vector makes a PCI host to
raise interrupt on a CPU. I might be inaccurate in details, but the scenario
you described is impossible AFAICT.

> 	David
> 
> 
> 

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-04  9:11               ` David Laight
                                 ` (2 preceding siblings ...)
  (?)
@ 2014-07-04  9:54               ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-04  9:54 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, 'Bjorn Helgaas',
	xen-devel, linuxppc-dev

On Fri, Jul 04, 2014 at 09:11:50AM +0000, David Laight wrote:
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> > 
> > Anyway, as I described in my reply to Bjorn, this is not a concern IMO.
> 
> I'm thinking of the following - which might be MSI-X ?
> 1) Hardware requests some interrupts and tells the host the BAR (and offset)
>    where the 'vectors' should be written.
> 2) To raise an interrupt the hardware uses the 'vector' as the address
>    of a normal PCIe write cycle.
> 
> So if the hardware requests 4 interrupts, but the driver (believing it
> will only use 3) only write 3 vectors, and then the hardware uses the
> 4th vector it can write to a random location.
> 
> Debugging that would be hard!

MSI base address is kind of hardcoded for a platform. A combination of
MSI base address, PCI function number and MSI vector makes a PCI host to
raise interrupt on a CPU. I might be inaccurate in details, but the scenario
you described is impossible AFAICT.

> 	David
> 
> 
> 

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-04  8:58           ` Alexander Gordeev
  (?)
@ 2014-07-07 19:26               ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-07 19:26 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, David Laight,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

On Fri, Jul 4, 2014 at 2:58 AM, Alexander Gordeev <agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
>> From: Bjorn Helgaas
>> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > > There are PCI devices that require a particular value written
>> > > to the Multiple Message Enable (MME) register while aligned on
>> > > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > > is a lesser of that MME value:
>> > >
>> > >   roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> > >
>> > > However the existing pci_enable_msi_block() interface is not
>> > > able to configure such devices, since the value written to the
>> > > MME register is calculated from the number of requested MSIs
>> > > 'nvec':
>> > >
>> > >   'Multiple Message Enable' = roundup_pow_of_two(nvec)
>> >
>> > For MSI, software learns how many vectors a device requests by reading
>> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> > for a device to request 3 vectors; it would have to round up that up
>> > to a power of two and request 4 vectors.
>> >
>> > Software writes similarly encoded values to MME to tell the device how
>> > many vectors have been allocated for its use.  For example, it's
>> > impossible to tell the device that it can use 3 vectors; the OS has to
>> > round that up and tell the device it can use 4 vectors.
>> >
>> > So if I understand correctly, the point of this series is to take
>> > advantage of device-specific knowledge, e.g., the device requests 4
>> > vectors via MMC, but we "know" the device is only capable of using 3.
>> > Moreover, we tell the device via MME that 4 vectors are available, but
>> > we've only actually set up 3 of them.
>> ...
>>
>> Even if you do that, you ought to write valid interrupt information
>> into the 4th slot (maybe replicating one of the earlier interrupts).
>> Then, if the device does raise the 'unexpected' interrupt you don't
>> get a write to a random kernel location.
>
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.

Yes, that's how I understand it.  With MSI, the OS specifies the a
single Message Address, e.g., a LAPIC address, and a single Message
Data value, e.g., a vector number that will be written to the LAPIC.
The device is permitted to modify some low-order bits of the Message
Data to send one of several vector numbers (the MME value tells the
device how many bits it can modify).

Bottom line, I think a spurious interrupt is the failure we'd expect
if a device used more vectors than the OS expects it to.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-07 19:26               ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-07 19:26 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: David Laight, linux-mips, linux-s390, linux-pci, x86, linux-doc,
	linux-kernel, linux-ide, iommu, xen-devel, linuxppc-dev

On Fri, Jul 4, 2014 at 2:58 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
>> From: Bjorn Helgaas
>> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > > There are PCI devices that require a particular value written
>> > > to the Multiple Message Enable (MME) register while aligned on
>> > > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > > is a lesser of that MME value:
>> > >
>> > >   roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> > >
>> > > However the existing pci_enable_msi_block() interface is not
>> > > able to configure such devices, since the value written to the
>> > > MME register is calculated from the number of requested MSIs
>> > > 'nvec':
>> > >
>> > >   'Multiple Message Enable' = roundup_pow_of_two(nvec)
>> >
>> > For MSI, software learns how many vectors a device requests by reading
>> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> > for a device to request 3 vectors; it would have to round up that up
>> > to a power of two and request 4 vectors.
>> >
>> > Software writes similarly encoded values to MME to tell the device how
>> > many vectors have been allocated for its use.  For example, it's
>> > impossible to tell the device that it can use 3 vectors; the OS has to
>> > round that up and tell the device it can use 4 vectors.
>> >
>> > So if I understand correctly, the point of this series is to take
>> > advantage of device-specific knowledge, e.g., the device requests 4
>> > vectors via MMC, but we "know" the device is only capable of using 3.
>> > Moreover, we tell the device via MME that 4 vectors are available, but
>> > we've only actually set up 3 of them.
>> ...
>>
>> Even if you do that, you ought to write valid interrupt information
>> into the 4th slot (maybe replicating one of the earlier interrupts).
>> Then, if the device does raise the 'unexpected' interrupt you don't
>> get a write to a random kernel location.
>
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.

Yes, that's how I understand it.  With MSI, the OS specifies the a
single Message Address, e.g., a LAPIC address, and a single Message
Data value, e.g., a vector number that will be written to the LAPIC.
The device is permitted to modify some low-order bits of the Message
Data to send one of several vector numbers (the MME value tells the
device how many bits it can modify).

Bottom line, I think a spurious interrupt is the failure we'd expect
if a device used more vectors than the OS expects it to.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-07 19:26               ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-07 19:26 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, David Laight, xen-devel, linuxppc-dev

On Fri, Jul 4, 2014 at 2:58 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
>> From: Bjorn Helgaas
>> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > > There are PCI devices that require a particular value written
>> > > to the Multiple Message Enable (MME) register while aligned on
>> > > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > > is a lesser of that MME value:
>> > >
>> > >   roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> > >
>> > > However the existing pci_enable_msi_block() interface is not
>> > > able to configure such devices, since the value written to the
>> > > MME register is calculated from the number of requested MSIs
>> > > 'nvec':
>> > >
>> > >   'Multiple Message Enable' = roundup_pow_of_two(nvec)
>> >
>> > For MSI, software learns how many vectors a device requests by reading
>> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> > for a device to request 3 vectors; it would have to round up that up
>> > to a power of two and request 4 vectors.
>> >
>> > Software writes similarly encoded values to MME to tell the device how
>> > many vectors have been allocated for its use.  For example, it's
>> > impossible to tell the device that it can use 3 vectors; the OS has to
>> > round that up and tell the device it can use 4 vectors.
>> >
>> > So if I understand correctly, the point of this series is to take
>> > advantage of device-specific knowledge, e.g., the device requests 4
>> > vectors via MMC, but we "know" the device is only capable of using 3.
>> > Moreover, we tell the device via MME that 4 vectors are available, but
>> > we've only actually set up 3 of them.
>> ...
>>
>> Even if you do that, you ought to write valid interrupt information
>> into the 4th slot (maybe replicating one of the earlier interrupts).
>> Then, if the device does raise the 'unexpected' interrupt you don't
>> get a write to a random kernel location.
>
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.

Yes, that's how I understand it.  With MSI, the OS specifies the a
single Message Address, e.g., a LAPIC address, and a single Message
Data value, e.g., a vector number that will be written to the LAPIC.
The device is permitted to modify some low-order bits of the Message
Data to send one of several vector numbers (the MME value tells the
device how many bits it can modify).

Bottom line, I think a spurious interrupt is the failure we'd expect
if a device used more vectors than the OS expects it to.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-04  8:58           ` Alexander Gordeev
                             ` (3 preceding siblings ...)
  (?)
@ 2014-07-07 19:26           ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-07 19:26 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, David Laight, xen-devel, linuxppc-dev

On Fri, Jul 4, 2014 at 2:58 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Thu, Jul 03, 2014 at 09:20:52AM +0000, David Laight wrote:
>> From: Bjorn Helgaas
>> > On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > > There are PCI devices that require a particular value written
>> > > to the Multiple Message Enable (MME) register while aligned on
>> > > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > > is a lesser of that MME value:
>> > >
>> > >   roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> > >
>> > > However the existing pci_enable_msi_block() interface is not
>> > > able to configure such devices, since the value written to the
>> > > MME register is calculated from the number of requested MSIs
>> > > 'nvec':
>> > >
>> > >   'Multiple Message Enable' = roundup_pow_of_two(nvec)
>> >
>> > For MSI, software learns how many vectors a device requests by reading
>> > the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> > device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> > for a device to request 3 vectors; it would have to round up that up
>> > to a power of two and request 4 vectors.
>> >
>> > Software writes similarly encoded values to MME to tell the device how
>> > many vectors have been allocated for its use.  For example, it's
>> > impossible to tell the device that it can use 3 vectors; the OS has to
>> > round that up and tell the device it can use 4 vectors.
>> >
>> > So if I understand correctly, the point of this series is to take
>> > advantage of device-specific knowledge, e.g., the device requests 4
>> > vectors via MMC, but we "know" the device is only capable of using 3.
>> > Moreover, we tell the device via MME that 4 vectors are available, but
>> > we've only actually set up 3 of them.
>> ...
>>
>> Even if you do that, you ought to write valid interrupt information
>> into the 4th slot (maybe replicating one of the earlier interrupts).
>> Then, if the device does raise the 'unexpected' interrupt you don't
>> get a write to a random kernel location.
>
> I might be missing something, but we are talking of MSI address space
> here, aren't we? I am not getting how we could end up with a 'write'
> to a random kernel location when a unclaimed MSI vector sent. We could
> only expect a spurious interrupt at worst, which is handled and reported.

Yes, that's how I understand it.  With MSI, the OS specifies the a
single Message Address, e.g., a LAPIC address, and a single Message
Data value, e.g., a vector number that will be written to the LAPIC.
The device is permitted to modify some low-order bits of the Message
Data to send one of several vector numbers (the MME value tells the
device how many bits it can modify).

Bottom line, I think a spurious interrupt is the failure we'd expect
if a device used more vectors than the OS expects it to.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-04  8:57         ` Alexander Gordeev
  (?)
@ 2014-07-07 19:40             ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-07 19:40 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, open list:INTEL IOMMU (VT-d),
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, linuxppc-dev

On Fri, Jul 4, 2014 at 2:57 AM, Alexander Gordeev <agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
>> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > There are PCI devices that require a particular value written
>> > to the Multiple Message Enable (MME) register while aligned on
>> > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > is a lesser of that MME value:
>> >
>> >     roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> >
>> > However the existing pci_enable_msi_block() interface is not
>> > able to configure such devices, since the value written to the
>> > MME register is calculated from the number of requested MSIs
>> > 'nvec':
>> >
>> >     'Multiple Message Enable' = roundup_pow_of_two(nvec)
>>
>> For MSI, software learns how many vectors a device requests by reading
>> the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> for a device to request 3 vectors; it would have to round up that up
>> to a power of two and request 4 vectors.
>>
>> Software writes similarly encoded values to MME to tell the device how
>> many vectors have been allocated for its use.  For example, it's
>> impossible to tell the device that it can use 3 vectors; the OS has to
>> round that up and tell the device it can use 4 vectors.
>
> Nod.
>
>> So if I understand correctly, the point of this series is to take
>> advantage of device-specific knowledge, e.g., the device requests 4
>> vectors via MMC, but we "know" the device is only capable of using 3.
>> Moreover, we tell the device via MME that 4 vectors are available, but
>> we've only actually set up 3 of them.
>
> Exactly.
>
>> This makes me uneasy because we're lying to the device, and the device
>> is perfectly within spec to use all 4 of those vectors.  If anything
>> changes the number of vectors the device uses (new device revision,
>> firmware upgrade, etc.), this is liable to break.
>
> If a device committed via non-MSI specific means to send only 3 vectors
> out of 4 available why should we expect it to send 4? The probability of
> a firmware sending 4/4 vectors in this case is equal to the probability
> of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
> Moreover, even vector 4/4 would be unexpected by the device driver, though
> it is perfectly within the spec.
>
> As of new device revision or firmware update etc. - it is just yet another
> case of device driver vs the firmware match/mismatch. Not including this
> change does not help here at all IMHO.
>
>> Can you quantify the benefit of this?  Can't a device already use
>> MSI-X to request exactly the number of vectors it can use?  (I know
>
> A Intel AHCI chipset requires 16 vectors written to MME while advertises
> (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> in device's fallback to 1 (!).

Is the fact that it uses only 6 vectors documented in the public spec?

Is this a chipset erratum?  Are there newer versions of the chipset
that fix this, e.g., by requesting 8 vectors and using 6, or by also
supporting MSI-X?

I know this conserves vector numbers.  What does that mean in real
user-visible terms?  Are there systems that won't boot because of this
issue, and this patch fixes them?  Does it enable bigger
configurations, e.g., more I/O devices, than before?

Do you know how Windows handles this?  Does it have a similar interface?

As you can tell, I'm a little skeptical about this.  It's a fairly big
change, it affects the arch interface, it seems to be targeted for
only a single chipset (though it's widely used), and we already
support a standard solution (MSI-X, reducing the number of vectors
requested, or even operating with 1 vector).

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-07 19:40             ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-07 19:40 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, linux-doc, linux-mips, linuxppc-dev, linux-s390,
	x86, xen-devel, open list:INTEL IOMMU (VT-d),
	linux-ide, linux-pci

On Fri, Jul 4, 2014 at 2:57 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
>> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > There are PCI devices that require a particular value written
>> > to the Multiple Message Enable (MME) register while aligned on
>> > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > is a lesser of that MME value:
>> >
>> >     roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> >
>> > However the existing pci_enable_msi_block() interface is not
>> > able to configure such devices, since the value written to the
>> > MME register is calculated from the number of requested MSIs
>> > 'nvec':
>> >
>> >     'Multiple Message Enable' = roundup_pow_of_two(nvec)
>>
>> For MSI, software learns how many vectors a device requests by reading
>> the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> for a device to request 3 vectors; it would have to round up that up
>> to a power of two and request 4 vectors.
>>
>> Software writes similarly encoded values to MME to tell the device how
>> many vectors have been allocated for its use.  For example, it's
>> impossible to tell the device that it can use 3 vectors; the OS has to
>> round that up and tell the device it can use 4 vectors.
>
> Nod.
>
>> So if I understand correctly, the point of this series is to take
>> advantage of device-specific knowledge, e.g., the device requests 4
>> vectors via MMC, but we "know" the device is only capable of using 3.
>> Moreover, we tell the device via MME that 4 vectors are available, but
>> we've only actually set up 3 of them.
>
> Exactly.
>
>> This makes me uneasy because we're lying to the device, and the device
>> is perfectly within spec to use all 4 of those vectors.  If anything
>> changes the number of vectors the device uses (new device revision,
>> firmware upgrade, etc.), this is liable to break.
>
> If a device committed via non-MSI specific means to send only 3 vectors
> out of 4 available why should we expect it to send 4? The probability of
> a firmware sending 4/4 vectors in this case is equal to the probability
> of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
> Moreover, even vector 4/4 would be unexpected by the device driver, though
> it is perfectly within the spec.
>
> As of new device revision or firmware update etc. - it is just yet another
> case of device driver vs the firmware match/mismatch. Not including this
> change does not help here at all IMHO.
>
>> Can you quantify the benefit of this?  Can't a device already use
>> MSI-X to request exactly the number of vectors it can use?  (I know
>
> A Intel AHCI chipset requires 16 vectors written to MME while advertises
> (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> in device's fallback to 1 (!).

Is the fact that it uses only 6 vectors documented in the public spec?

Is this a chipset erratum?  Are there newer versions of the chipset
that fix this, e.g., by requesting 8 vectors and using 6, or by also
supporting MSI-X?

I know this conserves vector numbers.  What does that mean in real
user-visible terms?  Are there systems that won't boot because of this
issue, and this patch fixes them?  Does it enable bigger
configurations, e.g., more I/O devices, than before?

Do you know how Windows handles this?  Does it have a similar interface?

As you can tell, I'm a little skeptical about this.  It's a fairly big
change, it affects the arch interface, it seems to be targeted for
only a single chipset (though it's widely used), and we already
support a standard solution (MSI-X, reducing the number of vectors
requested, or even operating with 1 vector).

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-07 19:40             ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-07 19:40 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Fri, Jul 4, 2014 at 2:57 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
>> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > There are PCI devices that require a particular value written
>> > to the Multiple Message Enable (MME) register while aligned on
>> > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > is a lesser of that MME value:
>> >
>> >     roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> >
>> > However the existing pci_enable_msi_block() interface is not
>> > able to configure such devices, since the value written to the
>> > MME register is calculated from the number of requested MSIs
>> > 'nvec':
>> >
>> >     'Multiple Message Enable' = roundup_pow_of_two(nvec)
>>
>> For MSI, software learns how many vectors a device requests by reading
>> the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> for a device to request 3 vectors; it would have to round up that up
>> to a power of two and request 4 vectors.
>>
>> Software writes similarly encoded values to MME to tell the device how
>> many vectors have been allocated for its use.  For example, it's
>> impossible to tell the device that it can use 3 vectors; the OS has to
>> round that up and tell the device it can use 4 vectors.
>
> Nod.
>
>> So if I understand correctly, the point of this series is to take
>> advantage of device-specific knowledge, e.g., the device requests 4
>> vectors via MMC, but we "know" the device is only capable of using 3.
>> Moreover, we tell the device via MME that 4 vectors are available, but
>> we've only actually set up 3 of them.
>
> Exactly.
>
>> This makes me uneasy because we're lying to the device, and the device
>> is perfectly within spec to use all 4 of those vectors.  If anything
>> changes the number of vectors the device uses (new device revision,
>> firmware upgrade, etc.), this is liable to break.
>
> If a device committed via non-MSI specific means to send only 3 vectors
> out of 4 available why should we expect it to send 4? The probability of
> a firmware sending 4/4 vectors in this case is equal to the probability
> of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
> Moreover, even vector 4/4 would be unexpected by the device driver, though
> it is perfectly within the spec.
>
> As of new device revision or firmware update etc. - it is just yet another
> case of device driver vs the firmware match/mismatch. Not including this
> change does not help here at all IMHO.
>
>> Can you quantify the benefit of this?  Can't a device already use
>> MSI-X to request exactly the number of vectors it can use?  (I know
>
> A Intel AHCI chipset requires 16 vectors written to MME while advertises
> (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> in device's fallback to 1 (!).

Is the fact that it uses only 6 vectors documented in the public spec?

Is this a chipset erratum?  Are there newer versions of the chipset
that fix this, e.g., by requesting 8 vectors and using 6, or by also
supporting MSI-X?

I know this conserves vector numbers.  What does that mean in real
user-visible terms?  Are there systems that won't boot because of this
issue, and this patch fixes them?  Does it enable bigger
configurations, e.g., more I/O devices, than before?

Do you know how Windows handles this?  Does it have a similar interface?

As you can tell, I'm a little skeptical about this.  It's a fairly big
change, it affects the arch interface, it seems to be targeted for
only a single chipset (though it's widely used), and we already
support a standard solution (MSI-X, reducing the number of vectors
requested, or even operating with 1 vector).

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-04  8:57         ` Alexander Gordeev
                           ` (2 preceding siblings ...)
  (?)
@ 2014-07-07 19:40         ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-07 19:40 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Fri, Jul 4, 2014 at 2:57 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Wed, Jul 02, 2014 at 02:22:01PM -0600, Bjorn Helgaas wrote:
>> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
>> > There are PCI devices that require a particular value written
>> > to the Multiple Message Enable (MME) register while aligned on
>> > power of 2 boundary value of actually used MSI vectors 'nvec'
>> > is a lesser of that MME value:
>> >
>> >     roundup_pow_of_two(nvec) < 'Multiple Message Enable'
>> >
>> > However the existing pci_enable_msi_block() interface is not
>> > able to configure such devices, since the value written to the
>> > MME register is calculated from the number of requested MSIs
>> > 'nvec':
>> >
>> >     'Multiple Message Enable' = roundup_pow_of_two(nvec)
>>
>> For MSI, software learns how many vectors a device requests by reading
>> the Multiple Message Capable (MMC) field.  This field is encoded, so a
>> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
>> for a device to request 3 vectors; it would have to round up that up
>> to a power of two and request 4 vectors.
>>
>> Software writes similarly encoded values to MME to tell the device how
>> many vectors have been allocated for its use.  For example, it's
>> impossible to tell the device that it can use 3 vectors; the OS has to
>> round that up and tell the device it can use 4 vectors.
>
> Nod.
>
>> So if I understand correctly, the point of this series is to take
>> advantage of device-specific knowledge, e.g., the device requests 4
>> vectors via MMC, but we "know" the device is only capable of using 3.
>> Moreover, we tell the device via MME that 4 vectors are available, but
>> we've only actually set up 3 of them.
>
> Exactly.
>
>> This makes me uneasy because we're lying to the device, and the device
>> is perfectly within spec to use all 4 of those vectors.  If anything
>> changes the number of vectors the device uses (new device revision,
>> firmware upgrade, etc.), this is liable to break.
>
> If a device committed via non-MSI specific means to send only 3 vectors
> out of 4 available why should we expect it to send 4? The probability of
> a firmware sending 4/4 vectors in this case is equal to the probability
> of sending 5/4 or 16/4, with the very same reason - a bug in the firmware.
> Moreover, even vector 4/4 would be unexpected by the device driver, though
> it is perfectly within the spec.
>
> As of new device revision or firmware update etc. - it is just yet another
> case of device driver vs the firmware match/mismatch. Not including this
> change does not help here at all IMHO.
>
>> Can you quantify the benefit of this?  Can't a device already use
>> MSI-X to request exactly the number of vectors it can use?  (I know
>
> A Intel AHCI chipset requires 16 vectors written to MME while advertises
> (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> in device's fallback to 1 (!).

Is the fact that it uses only 6 vectors documented in the public spec?

Is this a chipset erratum?  Are there newer versions of the chipset
that fix this, e.g., by requesting 8 vectors and using 6, or by also
supporting MSI-X?

I know this conserves vector numbers.  What does that mean in real
user-visible terms?  Are there systems that won't boot because of this
issue, and this patch fixes them?  Does it enable bigger
configurations, e.g., more I/O devices, than before?

Do you know how Windows handles this?  Does it have a similar interface?

As you can tell, I'm a little skeptical about this.  It's a fairly big
change, it affects the arch interface, it seems to be targeted for
only a single chipset (though it's widely used), and we already
support a standard solution (MSI-X, reducing the number of vectors
requested, or even operating with 1 vector).

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-07 19:40             ` Bjorn Helgaas
  (?)
@ 2014-07-07 20:42                 ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-07 20:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, open list:INTEL IOMMU (VT-d),
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, linuxppc-dev

On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

Bjorn,

I surely understand your concerns. I am answering this "summary"
question right away.

Even though an extra parameter is introduced, functionally this update
is rather small. It is only the new pci_enable_msi_partial() function
that could exploit a custom 'nvec_mme' parameter. By contrast, existing
pci_enable_msi_range() function (and therefore all device drivers) is
unaffected - it just rounds up 'nvec' to the nearest power of two and
continues exactly as it has been. All archs besides x86 just ignore it.
And x86 change is fairly small as well - all necessary functionality is
already in.

Thus, at the moment it is only AHCI of concern. And no, AHCI can not do MSI-X..

Thanks!

> Bjorn

-- 
Regards,
Alexander Gordeev
agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-07 20:42                 ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-07 20:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-doc, linux-mips, linuxppc-dev, linux-s390,
	x86, xen-devel, open list:INTEL IOMMU (VT-d),
	linux-ide, linux-pci

On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

Bjorn,

I surely understand your concerns. I am answering this "summary"
question right away.

Even though an extra parameter is introduced, functionally this update
is rather small. It is only the new pci_enable_msi_partial() function
that could exploit a custom 'nvec_mme' parameter. By contrast, existing
pci_enable_msi_range() function (and therefore all device drivers) is
unaffected - it just rounds up 'nvec' to the nearest power of two and
continues exactly as it has been. All archs besides x86 just ignore it.
And x86 change is fairly small as well - all necessary functionality is
already in.

Thus, at the moment it is only AHCI of concern. And no, AHCI can not do MSI-X..

Thanks!

> Bjorn

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-07 20:42                 ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-07 20:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

Bjorn,

I surely understand your concerns. I am answering this "summary"
question right away.

Even though an extra parameter is introduced, functionally this update
is rather small. It is only the new pci_enable_msi_partial() function
that could exploit a custom 'nvec_mme' parameter. By contrast, existing
pci_enable_msi_range() function (and therefore all device drivers) is
unaffected - it just rounds up 'nvec' to the nearest power of two and
continues exactly as it has been. All archs besides x86 just ignore it.
And x86 change is fairly small as well - all necessary functionality is
already in.

Thus, at the moment it is only AHCI of concern. And no, AHCI can not do MSI-X..

Thanks!

> Bjorn

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-07 19:40             ` Bjorn Helgaas
  (?)
  (?)
@ 2014-07-07 20:42             ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-07 20:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

Bjorn,

I surely understand your concerns. I am answering this "summary"
question right away.

Even though an extra parameter is introduced, functionally this update
is rather small. It is only the new pci_enable_msi_partial() function
that could exploit a custom 'nvec_mme' parameter. By contrast, existing
pci_enable_msi_range() function (and therefore all device drivers) is
unaffected - it just rounds up 'nvec' to the nearest power of two and
continues exactly as it has been. All archs besides x86 just ignore it.
And x86 change is fairly small as well - all necessary functionality is
already in.

Thus, at the moment it is only AHCI of concern. And no, AHCI can not do MSI-X..

Thanks!

> Bjorn

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-02 20:22     ` Bjorn Helgaas
@ 2014-07-08  4:01       ` Michael Ellerman
  -1 siblings, 0 replies; 76+ messages in thread
From: Michael Ellerman @ 2014-07-08  4:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alexander Gordeev, linux-mips, linux-s390, linux-pci, x86,
	linux-doc, linux-kernel, linux-ide, iommu, xen-devel,
	linuxppc-dev

On Wed, 2014-07-02 at 14:22 -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.
> 
> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.
> 
> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

It also adds more complexity into the already complex MSI API, across all
architectures, all so a single Intel chipset can save a couple of MSIs. That
seems like the wrong trade off to me.

cheers



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-08  4:01       ` Michael Ellerman
  0 siblings, 0 replies; 76+ messages in thread
From: Michael Ellerman @ 2014-07-08  4:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, Alexander Gordeev, xen-devel, linuxppc-dev

On Wed, 2014-07-02 at 14:22 -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.
> 
> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.
> 
> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

It also adds more complexity into the already complex MSI API, across all
architectures, all so a single Intel chipset can save a couple of MSIs. That
seems like the wrong trade off to me.

cheers

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-02 20:22     ` Bjorn Helgaas
                       ` (4 preceding siblings ...)
  (?)
@ 2014-07-08  4:01     ` Michael Ellerman
  -1 siblings, 0 replies; 76+ messages in thread
From: Michael Ellerman @ 2014-07-08  4:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, Alexander Gordeev, xen-devel, linuxppc-dev

On Wed, 2014-07-02 at 14:22 -0600, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2014 at 03:10:30PM +0200, Alexander Gordeev wrote:
> > There are PCI devices that require a particular value written
> > to the Multiple Message Enable (MME) register while aligned on
> > power of 2 boundary value of actually used MSI vectors 'nvec'
> > is a lesser of that MME value:
> > 
> > 	roundup_pow_of_two(nvec) < 'Multiple Message Enable'
> > 
> > However the existing pci_enable_msi_block() interface is not
> > able to configure such devices, since the value written to the
> > MME register is calculated from the number of requested MSIs
> > 'nvec':
> > 
> > 	'Multiple Message Enable' = roundup_pow_of_two(nvec)
> 
> For MSI, software learns how many vectors a device requests by reading
> the Multiple Message Capable (MMC) field.  This field is encoded, so a
> device can only request 1, 2, 4, 8, etc., vectors.  It's impossible
> for a device to request 3 vectors; it would have to round up that up
> to a power of two and request 4 vectors.
> 
> Software writes similarly encoded values to MME to tell the device how
> many vectors have been allocated for its use.  For example, it's
> impossible to tell the device that it can use 3 vectors; the OS has to
> round that up and tell the device it can use 4 vectors.
> 
> So if I understand correctly, the point of this series is to take
> advantage of device-specific knowledge, e.g., the device requests 4
> vectors via MMC, but we "know" the device is only capable of using 3.
> Moreover, we tell the device via MME that 4 vectors are available, but
> we've only actually set up 3 of them.
> 
> This makes me uneasy because we're lying to the device, and the device
> is perfectly within spec to use all 4 of those vectors.  If anything
> changes the number of vectors the device uses (new device revision,
> firmware upgrade, etc.), this is liable to break.

It also adds more complexity into the already complex MSI API, across all
architectures, all so a single Intel chipset can save a couple of MSIs. That
seems like the wrong trade off to me.

cheers

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-07 19:26               ` Bjorn Helgaas
                                     ` (2 preceding siblings ...)
  (?)
@ 2014-07-08  8:33                   ` David Laight
  -1 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-08  8:33 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

From: Bjorn Helgaas
...
> >> Even if you do that, you ought to write valid interrupt information
> >> into the 4th slot (maybe replicating one of the earlier interrupts).
> >> Then, if the device does raise the 'unexpected' interrupt you don't
> >> get a write to a random kernel location.
> >
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> 
> Yes, that's how I understand it.  With MSI, the OS specifies the a
> single Message Address, e.g., a LAPIC address, and a single Message
> Data value, e.g., a vector number that will be written to the LAPIC.
> The device is permitted to modify some low-order bits of the Message
> Data to send one of several vector numbers (the MME value tells the
> device how many bits it can modify).
> 
> Bottom line, I think a spurious interrupt is the failure we'd expect
> if a device used more vectors than the OS expects it to.

So you need to tell the device where to write in order to raise the
'spurious interrupt'.

	David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-08  8:33                   ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-08  8:33 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1380 bytes --]

From: Bjorn Helgaas
...
> >> Even if you do that, you ought to write valid interrupt information
> >> into the 4th slot (maybe replicating one of the earlier interrupts).
> >> Then, if the device does raise the 'unexpected' interrupt you don't
> >> get a write to a random kernel location.
> >
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> 
> Yes, that's how I understand it.  With MSI, the OS specifies the a
> single Message Address, e.g., a LAPIC address, and a single Message
> Data value, e.g., a vector number that will be written to the LAPIC.
> The device is permitted to modify some low-order bits of the Message
> Data to send one of several vector numbers (the MME value tells the
> device how many bits it can modify).
> 
> Bottom line, I think a spurious interrupt is the failure we'd expect
> if a device used more vectors than the OS expects it to.

So you need to tell the device where to write in order to raise the
'spurious interrupt'.

	David

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-08  8:33                   ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-08  8:33 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

RnJvbTogQmpvcm4gSGVsZ2Fhcw0KLi4uDQo+ID4+IEV2ZW4gaWYgeW91IGRvIHRoYXQsIHlvdSBv
dWdodCB0byB3cml0ZSB2YWxpZCBpbnRlcnJ1cHQgaW5mb3JtYXRpb24NCj4gPj4gaW50byB0aGUg
NHRoIHNsb3QgKG1heWJlIHJlcGxpY2F0aW5nIG9uZSBvZiB0aGUgZWFybGllciBpbnRlcnJ1cHRz
KS4NCj4gPj4gVGhlbiwgaWYgdGhlIGRldmljZSBkb2VzIHJhaXNlIHRoZSAndW5leHBlY3RlZCcg
aW50ZXJydXB0IHlvdSBkb24ndA0KPiA+PiBnZXQgYSB3cml0ZSB0byBhIHJhbmRvbSBrZXJuZWwg
bG9jYXRpb24uDQo+ID4NCj4gPiBJIG1pZ2h0IGJlIG1pc3Npbmcgc29tZXRoaW5nLCBidXQgd2Ug
YXJlIHRhbGtpbmcgb2YgTVNJIGFkZHJlc3Mgc3BhY2UNCj4gPiBoZXJlLCBhcmVuJ3Qgd2U/IEkg
YW0gbm90IGdldHRpbmcgaG93IHdlIGNvdWxkIGVuZCB1cCB3aXRoIGEgJ3dyaXRlJw0KPiA+IHRv
IGEgcmFuZG9tIGtlcm5lbCBsb2NhdGlvbiB3aGVuIGEgdW5jbGFpbWVkIE1TSSB2ZWN0b3Igc2Vu
dC4gV2UgY291bGQNCj4gPiBvbmx5IGV4cGVjdCBhIHNwdXJpb3VzIGludGVycnVwdCBhdCB3b3Jz
dCwgd2hpY2ggaXMgaGFuZGxlZCBhbmQgcmVwb3J0ZWQuDQo+IA0KPiBZZXMsIHRoYXQncyBob3cg
SSB1bmRlcnN0YW5kIGl0LiAgV2l0aCBNU0ksIHRoZSBPUyBzcGVjaWZpZXMgdGhlIGENCj4gc2lu
Z2xlIE1lc3NhZ2UgQWRkcmVzcywgZS5nLiwgYSBMQVBJQyBhZGRyZXNzLCBhbmQgYSBzaW5nbGUg
TWVzc2FnZQ0KPiBEYXRhIHZhbHVlLCBlLmcuLCBhIHZlY3RvciBudW1iZXIgdGhhdCB3aWxsIGJl
IHdyaXR0ZW4gdG8gdGhlIExBUElDLg0KPiBUaGUgZGV2aWNlIGlzIHBlcm1pdHRlZCB0byBtb2Rp
Znkgc29tZSBsb3ctb3JkZXIgYml0cyBvZiB0aGUgTWVzc2FnZQ0KPiBEYXRhIHRvIHNlbmQgb25l
IG9mIHNldmVyYWwgdmVjdG9yIG51bWJlcnMgKHRoZSBNTUUgdmFsdWUgdGVsbHMgdGhlDQo+IGRl
dmljZSBob3cgbWFueSBiaXRzIGl0IGNhbiBtb2RpZnkpLg0KPiANCj4gQm90dG9tIGxpbmUsIEkg
dGhpbmsgYSBzcHVyaW91cyBpbnRlcnJ1cHQgaXMgdGhlIGZhaWx1cmUgd2UnZCBleHBlY3QNCj4g
aWYgYSBkZXZpY2UgdXNlZCBtb3JlIHZlY3RvcnMgdGhhbiB0aGUgT1MgZXhwZWN0cyBpdCB0by4N
Cg0KU28geW91IG5lZWQgdG8gdGVsbCB0aGUgZGV2aWNlIHdoZXJlIHRvIHdyaXRlIGluIG9yZGVy
IHRvIHJhaXNlIHRoZQ0KJ3NwdXJpb3VzIGludGVycnVwdCcuDQoNCglEYXZpZA0KDQo=

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-08  8:33                   ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-08  8:33 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

From: Bjorn Helgaas
...
> >> Even if you do that, you ought to write valid interrupt information
> >> into the 4th slot (maybe replicating one of the earlier interrupts).
> >> Then, if the device does raise the 'unexpected' interrupt you don't
> >> get a write to a random kernel location.
> >
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> 
> Yes, that's how I understand it.  With MSI, the OS specifies the a
> single Message Address, e.g., a LAPIC address, and a single Message
> Data value, e.g., a vector number that will be written to the LAPIC.
> The device is permitted to modify some low-order bits of the Message
> Data to send one of several vector numbers (the MME value tells the
> device how many bits it can modify).
> 
> Bottom line, I think a spurious interrupt is the failure we'd expect
> if a device used more vectors than the OS expects it to.

So you need to tell the device where to write in order to raise the
'spurious interrupt'.

	David


^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-08  8:33                   ` David Laight
  0 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-08  8:33 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

RnJvbTogQmpvcm4gSGVsZ2Fhcw0KLi4uDQo+ID4+IEV2ZW4gaWYgeW91IGRvIHRoYXQsIHlvdSBv
dWdodCB0byB3cml0ZSB2YWxpZCBpbnRlcnJ1cHQgaW5mb3JtYXRpb24NCj4gPj4gaW50byB0aGUg
NHRoIHNsb3QgKG1heWJlIHJlcGxpY2F0aW5nIG9uZSBvZiB0aGUgZWFybGllciBpbnRlcnJ1cHRz
KS4NCj4gPj4gVGhlbiwgaWYgdGhlIGRldmljZSBkb2VzIHJhaXNlIHRoZSAndW5leHBlY3RlZCcg
aW50ZXJydXB0IHlvdSBkb24ndA0KPiA+PiBnZXQgYSB3cml0ZSB0byBhIHJhbmRvbSBrZXJuZWwg
bG9jYXRpb24uDQo+ID4NCj4gPiBJIG1pZ2h0IGJlIG1pc3Npbmcgc29tZXRoaW5nLCBidXQgd2Ug
YXJlIHRhbGtpbmcgb2YgTVNJIGFkZHJlc3Mgc3BhY2UNCj4gPiBoZXJlLCBhcmVuJ3Qgd2U/IEkg
YW0gbm90IGdldHRpbmcgaG93IHdlIGNvdWxkIGVuZCB1cCB3aXRoIGEgJ3dyaXRlJw0KPiA+IHRv
IGEgcmFuZG9tIGtlcm5lbCBsb2NhdGlvbiB3aGVuIGEgdW5jbGFpbWVkIE1TSSB2ZWN0b3Igc2Vu
dC4gV2UgY291bGQNCj4gPiBvbmx5IGV4cGVjdCBhIHNwdXJpb3VzIGludGVycnVwdCBhdCB3b3Jz
dCwgd2hpY2ggaXMgaGFuZGxlZCBhbmQgcmVwb3J0ZWQuDQo+IA0KPiBZZXMsIHRoYXQncyBob3cg
SSB1bmRlcnN0YW5kIGl0LiAgV2l0aCBNU0ksIHRoZSBPUyBzcGVjaWZpZXMgdGhlIGENCj4gc2lu
Z2xlIE1lc3NhZ2UgQWRkcmVzcywgZS5nLiwgYSBMQVBJQyBhZGRyZXNzLCBhbmQgYSBzaW5nbGUg
TWVzc2FnZQ0KPiBEYXRhIHZhbHVlLCBlLmcuLCBhIHZlY3RvciBudW1iZXIgdGhhdCB3aWxsIGJl
IHdyaXR0ZW4gdG8gdGhlIExBUElDLg0KPiBUaGUgZGV2aWNlIGlzIHBlcm1pdHRlZCB0byBtb2Rp
Znkgc29tZSBsb3ctb3JkZXIgYml0cyBvZiB0aGUgTWVzc2FnZQ0KPiBEYXRhIHRvIHNlbmQgb25l
IG9mIHNldmVyYWwgdmVjdG9yIG51bWJlcnMgKHRoZSBNTUUgdmFsdWUgdGVsbHMgdGhlDQo+IGRl
dmljZSBob3cgbWFueSBiaXRzIGl0IGNhbiBtb2RpZnkpLg0KPiANCj4gQm90dG9tIGxpbmUsIEkg
dGhpbmsgYSBzcHVyaW91cyBpbnRlcnJ1cHQgaXMgdGhlIGZhaWx1cmUgd2UnZCBleHBlY3QNCj4g
aWYgYSBkZXZpY2UgdXNlZCBtb3JlIHZlY3RvcnMgdGhhbiB0aGUgT1MgZXhwZWN0cyBpdCB0by4N
Cg0KU28geW91IG5lZWQgdG8gdGVsbCB0aGUgZGV2aWNlIHdoZXJlIHRvIHdyaXRlIGluIG9yZGVy
IHRvIHJhaXNlIHRoZQ0KJ3NwdXJpb3VzIGludGVycnVwdCcuDQoNCglEYXZpZA0KDQo=

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-07 19:26               ` Bjorn Helgaas
                                 ` (2 preceding siblings ...)
  (?)
@ 2014-07-08  8:33               ` David Laight
  -1 siblings, 0 replies; 76+ messages in thread
From: David Laight @ 2014-07-08  8:33 UTC (permalink / raw)
  To: 'Bjorn Helgaas', Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-doc, linux-pci, x86, linux-kernel,
	linux-ide, iommu, xen-devel, linuxppc-dev

From: Bjorn Helgaas
...
> >> Even if you do that, you ought to write valid interrupt information
> >> into the 4th slot (maybe replicating one of the earlier interrupts).
> >> Then, if the device does raise the 'unexpected' interrupt you don't
> >> get a write to a random kernel location.
> >
> > I might be missing something, but we are talking of MSI address space
> > here, aren't we? I am not getting how we could end up with a 'write'
> > to a random kernel location when a unclaimed MSI vector sent. We could
> > only expect a spurious interrupt at worst, which is handled and reported.
> 
> Yes, that's how I understand it.  With MSI, the OS specifies the a
> single Message Address, e.g., a LAPIC address, and a single Message
> Data value, e.g., a vector number that will be written to the LAPIC.
> The device is permitted to modify some low-order bits of the Message
> Data to send one of several vector numbers (the MME value tells the
> device how many bits it can modify).
> 
> Bottom line, I think a spurious interrupt is the failure we'd expect
> if a device used more vectors than the OS expects it to.

So you need to tell the device where to write in order to raise the
'spurious interrupt'.

	David

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-07 19:40             ` Bjorn Helgaas
  (?)
@ 2014-07-08 12:26                 ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-08 12:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, open list:INTEL IOMMU (VT-d),
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, linuxppc-dev

On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> >> Can you quantify the benefit of this?  Can't a device already use
> >> MSI-X to request exactly the number of vectors it can use?  (I know
> >
> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> > in device's fallback to 1 (!).
> 
> Is the fact that it uses only 6 vectors documented in the public spec?

Yes, it is documented in ICH specs.

> Is this a chipset erratum?  Are there newer versions of the chipset
> that fix this, e.g., by requesting 8 vectors and using 6, or by also
> supporting MSI-X?

No, this is not an erratum. The value of 8 vectors is reserved and could
cause undefined results if used.

> I know this conserves vector numbers.  What does that mean in real
> user-visible terms?  Are there systems that won't boot because of this
> issue, and this patch fixes them?  Does it enable bigger
> configurations, e.g., more I/O devices, than before?

Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.

No, it does not enable/fix any existing hardware issue I am aware of.
It just saves a couple of interrupt vectors, as Michael put it (10/16
to be precise). However, interrupt vectors space is pretty much scarce
resource on x86 and a risk of exhausting the vectors (and introducing
quota i.e) has already been raised AFAIR.

> Do you know how Windows handles this?  Does it have a similar interface?

Have no clue, TBH. Can try to investigate if you see it helpful.

> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

I also do not like the fact the arch interface is getting complicated,
so I happily leave it to your judgement ;) Well, it is low-level and
hidden from drivers at least.

Thanks!

> Bjorn

-- 
Regards,
Alexander Gordeev
agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-08 12:26                 ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-08 12:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-doc, linux-mips, linuxppc-dev, linux-s390,
	x86, xen-devel, open list:INTEL IOMMU (VT-d),
	linux-ide, linux-pci

On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> >> Can you quantify the benefit of this?  Can't a device already use
> >> MSI-X to request exactly the number of vectors it can use?  (I know
> >
> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> > in device's fallback to 1 (!).
> 
> Is the fact that it uses only 6 vectors documented in the public spec?

Yes, it is documented in ICH specs.

> Is this a chipset erratum?  Are there newer versions of the chipset
> that fix this, e.g., by requesting 8 vectors and using 6, or by also
> supporting MSI-X?

No, this is not an erratum. The value of 8 vectors is reserved and could
cause undefined results if used.

> I know this conserves vector numbers.  What does that mean in real
> user-visible terms?  Are there systems that won't boot because of this
> issue, and this patch fixes them?  Does it enable bigger
> configurations, e.g., more I/O devices, than before?

Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.

No, it does not enable/fix any existing hardware issue I am aware of.
It just saves a couple of interrupt vectors, as Michael put it (10/16
to be precise). However, interrupt vectors space is pretty much scarce
resource on x86 and a risk of exhausting the vectors (and introducing
quota i.e) has already been raised AFAIR.

> Do you know how Windows handles this?  Does it have a similar interface?

Have no clue, TBH. Can try to investigate if you see it helpful.

> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

I also do not like the fact the arch interface is getting complicated,
so I happily leave it to your judgement ;) Well, it is low-level and
hidden from drivers at least.

Thanks!

> Bjorn

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-08 12:26                 ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-08 12:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> >> Can you quantify the benefit of this?  Can't a device already use
> >> MSI-X to request exactly the number of vectors it can use?  (I know
> >
> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> > in device's fallback to 1 (!).
> 
> Is the fact that it uses only 6 vectors documented in the public spec?

Yes, it is documented in ICH specs.

> Is this a chipset erratum?  Are there newer versions of the chipset
> that fix this, e.g., by requesting 8 vectors and using 6, or by also
> supporting MSI-X?

No, this is not an erratum. The value of 8 vectors is reserved and could
cause undefined results if used.

> I know this conserves vector numbers.  What does that mean in real
> user-visible terms?  Are there systems that won't boot because of this
> issue, and this patch fixes them?  Does it enable bigger
> configurations, e.g., more I/O devices, than before?

Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.

No, it does not enable/fix any existing hardware issue I am aware of.
It just saves a couple of interrupt vectors, as Michael put it (10/16
to be precise). However, interrupt vectors space is pretty much scarce
resource on x86 and a risk of exhausting the vectors (and introducing
quota i.e) has already been raised AFAIR.

> Do you know how Windows handles this?  Does it have a similar interface?

Have no clue, TBH. Can try to investigate if you see it helpful.

> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

I also do not like the fact the arch interface is getting complicated,
so I happily leave it to your judgement ;) Well, it is low-level and
hidden from drivers at least.

Thanks!

> Bjorn

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-07 19:40             ` Bjorn Helgaas
                               ` (2 preceding siblings ...)
  (?)
@ 2014-07-08 12:26             ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-08 12:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
> >> Can you quantify the benefit of this?  Can't a device already use
> >> MSI-X to request exactly the number of vectors it can use?  (I know
> >
> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
> > in device's fallback to 1 (!).
> 
> Is the fact that it uses only 6 vectors documented in the public spec?

Yes, it is documented in ICH specs.

> Is this a chipset erratum?  Are there newer versions of the chipset
> that fix this, e.g., by requesting 8 vectors and using 6, or by also
> supporting MSI-X?

No, this is not an erratum. The value of 8 vectors is reserved and could
cause undefined results if used.

> I know this conserves vector numbers.  What does that mean in real
> user-visible terms?  Are there systems that won't boot because of this
> issue, and this patch fixes them?  Does it enable bigger
> configurations, e.g., more I/O devices, than before?

Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.

No, it does not enable/fix any existing hardware issue I am aware of.
It just saves a couple of interrupt vectors, as Michael put it (10/16
to be precise). However, interrupt vectors space is pretty much scarce
resource on x86 and a risk of exhausting the vectors (and introducing
quota i.e) has already been raised AFAIR.

> Do you know how Windows handles this?  Does it have a similar interface?

Have no clue, TBH. Can try to investigate if you see it helpful.

> As you can tell, I'm a little skeptical about this.  It's a fairly big
> change, it affects the arch interface, it seems to be targeted for
> only a single chipset (though it's widely used), and we already
> support a standard solution (MSI-X, reducing the number of vectors
> requested, or even operating with 1 vector).

I also do not like the fact the arch interface is getting complicated,
so I happily leave it to your judgement ;) Well, it is low-level and
hidden from drivers at least.

Thanks!

> Bjorn

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-08 12:26                 ` Alexander Gordeev
  (?)
@ 2014-07-09 16:06                     ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-09 16:06 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, open list:INTEL IOMMU (VT-d),
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, linuxppc-dev

On Tue, Jul 8, 2014 at 6:26 AM, Alexander Gordeev <agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
>> >> Can you quantify the benefit of this?  Can't a device already use
>> >> MSI-X to request exactly the number of vectors it can use?  (I know
>> >
>> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
>> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
>> > in device's fallback to 1 (!).
>>
>> Is the fact that it uses only 6 vectors documented in the public spec?
>
> Yes, it is documented in ICH specs.

Out of curiosity, do you have a pointer to this?  It looks like it
uses one vector per port, and I'm wondering if the reason it requests
16 is because there's some possibility of a part with more than 8
ports.

>> Is this a chipset erratum?  Are there newer versions of the chipset
>> that fix this, e.g., by requesting 8 vectors and using 6, or by also
>> supporting MSI-X?
>
> No, this is not an erratum. The value of 8 vectors is reserved and could
> cause undefined results if used.

As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
(requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
undefined results, I'd say that's a chipset defect.

>> I know this conserves vector numbers.  What does that mean in real
>> user-visible terms?  Are there systems that won't boot because of this
>> issue, and this patch fixes them?  Does it enable bigger
>> configurations, e.g., more I/O devices, than before?
>
> Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
> MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.
>
> No, it does not enable/fix any existing hardware issue I am aware of.
> It just saves a couple of interrupt vectors, as Michael put it (10/16
> to be precise). However, interrupt vectors space is pretty much scarce
> resource on x86 and a risk of exhausting the vectors (and introducing
> quota i.e) has already been raised AFAIR.

I'm not too concerned about the logging issue.  If necessary, we could
tweak that message somehow.

Interrupt vector space is the issue I would worry about, but I think
I'm going to put this on the back burner until it actually becomes a
problem.

>> Do you know how Windows handles this?  Does it have a similar interface?
>
> Have no clue, TBH. Can try to investigate if you see it helpful.

No, don't worry about investigating.  I was just curious because if
Windows *did* support something like this, that would be an indication
that there's a significant problem here and we might need to solve it,
too.  But it sounds like we can safely ignore it for now.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-09 16:06                     ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-09 16:06 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, linux-doc, linux-mips, linuxppc-dev, linux-s390,
	x86, xen-devel, open list:INTEL IOMMU (VT-d),
	linux-ide, linux-pci

On Tue, Jul 8, 2014 at 6:26 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
>> >> Can you quantify the benefit of this?  Can't a device already use
>> >> MSI-X to request exactly the number of vectors it can use?  (I know
>> >
>> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
>> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
>> > in device's fallback to 1 (!).
>>
>> Is the fact that it uses only 6 vectors documented in the public spec?
>
> Yes, it is documented in ICH specs.

Out of curiosity, do you have a pointer to this?  It looks like it
uses one vector per port, and I'm wondering if the reason it requests
16 is because there's some possibility of a part with more than 8
ports.

>> Is this a chipset erratum?  Are there newer versions of the chipset
>> that fix this, e.g., by requesting 8 vectors and using 6, or by also
>> supporting MSI-X?
>
> No, this is not an erratum. The value of 8 vectors is reserved and could
> cause undefined results if used.

As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
(requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
undefined results, I'd say that's a chipset defect.

>> I know this conserves vector numbers.  What does that mean in real
>> user-visible terms?  Are there systems that won't boot because of this
>> issue, and this patch fixes them?  Does it enable bigger
>> configurations, e.g., more I/O devices, than before?
>
> Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
> MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.
>
> No, it does not enable/fix any existing hardware issue I am aware of.
> It just saves a couple of interrupt vectors, as Michael put it (10/16
> to be precise). However, interrupt vectors space is pretty much scarce
> resource on x86 and a risk of exhausting the vectors (and introducing
> quota i.e) has already been raised AFAIR.

I'm not too concerned about the logging issue.  If necessary, we could
tweak that message somehow.

Interrupt vector space is the issue I would worry about, but I think
I'm going to put this on the back burner until it actually becomes a
problem.

>> Do you know how Windows handles this?  Does it have a similar interface?
>
> Have no clue, TBH. Can try to investigate if you see it helpful.

No, don't worry about investigating.  I was just curious because if
Windows *did* support something like this, that would be an indication
that there's a significant problem here and we might need to solve it,
too.  But it sounds like we can safely ignore it for now.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-09 16:06                     ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-09 16:06 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Tue, Jul 8, 2014 at 6:26 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
>> >> Can you quantify the benefit of this?  Can't a device already use
>> >> MSI-X to request exactly the number of vectors it can use?  (I know
>> >
>> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
>> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
>> > in device's fallback to 1 (!).
>>
>> Is the fact that it uses only 6 vectors documented in the public spec?
>
> Yes, it is documented in ICH specs.

Out of curiosity, do you have a pointer to this?  It looks like it
uses one vector per port, and I'm wondering if the reason it requests
16 is because there's some possibility of a part with more than 8
ports.

>> Is this a chipset erratum?  Are there newer versions of the chipset
>> that fix this, e.g., by requesting 8 vectors and using 6, or by also
>> supporting MSI-X?
>
> No, this is not an erratum. The value of 8 vectors is reserved and could
> cause undefined results if used.

As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
(requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
undefined results, I'd say that's a chipset defect.

>> I know this conserves vector numbers.  What does that mean in real
>> user-visible terms?  Are there systems that won't boot because of this
>> issue, and this patch fixes them?  Does it enable bigger
>> configurations, e.g., more I/O devices, than before?
>
> Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
> MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.
>
> No, it does not enable/fix any existing hardware issue I am aware of.
> It just saves a couple of interrupt vectors, as Michael put it (10/16
> to be precise). However, interrupt vectors space is pretty much scarce
> resource on x86 and a risk of exhausting the vectors (and introducing
> quota i.e) has already been raised AFAIR.

I'm not too concerned about the logging issue.  If necessary, we could
tweak that message somehow.

Interrupt vector space is the issue I would worry about, but I think
I'm going to put this on the back burner until it actually becomes a
problem.

>> Do you know how Windows handles this?  Does it have a similar interface?
>
> Have no clue, TBH. Can try to investigate if you see it helpful.

No, don't worry about investigating.  I was just curious because if
Windows *did* support something like this, that would be an indication
that there's a significant problem here and we might need to solve it,
too.  But it sounds like we can safely ignore it for now.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-08 12:26                 ` Alexander Gordeev
  (?)
  (?)
@ 2014-07-09 16:06                 ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-09 16:06 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Tue, Jul 8, 2014 at 6:26 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Mon, Jul 07, 2014 at 01:40:48PM -0600, Bjorn Helgaas wrote:
>> >> Can you quantify the benefit of this?  Can't a device already use
>> >> MSI-X to request exactly the number of vectors it can use?  (I know
>> >
>> > A Intel AHCI chipset requires 16 vectors written to MME while advertises
>> > (via AHCI registers) and uses only 6. Even attempt to init 8 vectors results
>> > in device's fallback to 1 (!).
>>
>> Is the fact that it uses only 6 vectors documented in the public spec?
>
> Yes, it is documented in ICH specs.

Out of curiosity, do you have a pointer to this?  It looks like it
uses one vector per port, and I'm wondering if the reason it requests
16 is because there's some possibility of a part with more than 8
ports.

>> Is this a chipset erratum?  Are there newer versions of the chipset
>> that fix this, e.g., by requesting 8 vectors and using 6, or by also
>> supporting MSI-X?
>
> No, this is not an erratum. The value of 8 vectors is reserved and could
> cause undefined results if used.

As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
(requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
undefined results, I'd say that's a chipset defect.

>> I know this conserves vector numbers.  What does that mean in real
>> user-visible terms?  Are there systems that won't boot because of this
>> issue, and this patch fixes them?  Does it enable bigger
>> configurations, e.g., more I/O devices, than before?
>
> Visibly, it ceases logging messages ('ahci 0000:00:1f.2: irq 107 for
> MSI/MSI-X') for IRQs that are not shown in /proc/interrupts later.
>
> No, it does not enable/fix any existing hardware issue I am aware of.
> It just saves a couple of interrupt vectors, as Michael put it (10/16
> to be precise). However, interrupt vectors space is pretty much scarce
> resource on x86 and a risk of exhausting the vectors (and introducing
> quota i.e) has already been raised AFAIR.

I'm not too concerned about the logging issue.  If necessary, we could
tweak that message somehow.

Interrupt vector space is the issue I would worry about, but I think
I'm going to put this on the back burner until it actually becomes a
problem.

>> Do you know how Windows handles this?  Does it have a similar interface?
>
> Have no clue, TBH. Can try to investigate if you see it helpful.

No, don't worry about investigating.  I was just curious because if
Windows *did* support something like this, that would be an indication
that there's a significant problem here and we might need to solve it,
too.  But it sounds like we can safely ignore it for now.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-09 16:06                     ` Bjorn Helgaas
  (?)
@ 2014-07-10 10:11                         ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-10 10:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips-6z/3iImG2C8G8FEW9MqTrA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, open list:INTEL IOMMU (VT-d),
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, linuxppc-dev

On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
> Out of curiosity, do you have a pointer to this?  It looks like it

I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27

> uses one vector per port, and I'm wondering if the reason it requests
> 16 is because there's some possibility of a part with more than 8
> ports.

I doubt that is the reason. The only allowed MME values (powers of two)
are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
I would speculate it suits nicely to some hardware logic.

BTW, apart from AHCI, it seems the reason MSI is not going to disappear
(in a decade at least) is it is way cheaper to implement than MSI-X.

> > No, this is not an erratum. The value of 8 vectors is reserved and could
> > cause undefined results if used.
> 
> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
> undefined results, I'd say that's a chipset defect.

Well, the PCI spec does not prevent devices to have their own specs on top
of it. Undefined results are meant on the device side here. On the MSI side
these results are likely perfectly within the PCI spec. I feel speaking as
a lawer here ;)

> Interrupt vector space is the issue I would worry about, but I think
> I'm going to put this on the back burner until it actually becomes a
> problem.

I plan to try get rid of arch_msi_check_device() hook. Should I repost
this series afterwards?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-10 10:11                         ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-10 10:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-doc, linux-mips, linuxppc-dev, linux-s390,
	x86, xen-devel, open list:INTEL IOMMU (VT-d),
	linux-ide, linux-pci

On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
> Out of curiosity, do you have a pointer to this?  It looks like it

I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27

> uses one vector per port, and I'm wondering if the reason it requests
> 16 is because there's some possibility of a part with more than 8
> ports.

I doubt that is the reason. The only allowed MME values (powers of two)
are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
I would speculate it suits nicely to some hardware logic.

BTW, apart from AHCI, it seems the reason MSI is not going to disappear
(in a decade at least) is it is way cheaper to implement than MSI-X.

> > No, this is not an erratum. The value of 8 vectors is reserved and could
> > cause undefined results if used.
> 
> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
> undefined results, I'd say that's a chipset defect.

Well, the PCI spec does not prevent devices to have their own specs on top
of it. Undefined results are meant on the device side here. On the MSI side
these results are likely perfectly within the PCI spec. I feel speaking as
a lawer here ;)

> Interrupt vector space is the issue I would worry about, but I think
> I'm going to put this on the back burner until it actually becomes a
> problem.

I plan to try get rid of arch_msi_check_device() hook. Should I repost
this series afterwards?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-10 10:11                         ` Alexander Gordeev
  0 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-10 10:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
> Out of curiosity, do you have a pointer to this?  It looks like it

I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27

> uses one vector per port, and I'm wondering if the reason it requests
> 16 is because there's some possibility of a part with more than 8
> ports.

I doubt that is the reason. The only allowed MME values (powers of two)
are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
I would speculate it suits nicely to some hardware logic.

BTW, apart from AHCI, it seems the reason MSI is not going to disappear
(in a decade at least) is it is way cheaper to implement than MSI-X.

> > No, this is not an erratum. The value of 8 vectors is reserved and could
> > cause undefined results if used.
> 
> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
> undefined results, I'd say that's a chipset defect.

Well, the PCI spec does not prevent devices to have their own specs on top
of it. Undefined results are meant on the device side here. On the MSI side
these results are likely perfectly within the PCI spec. I feel speaking as
a lawer here ;)

> Interrupt vector space is the issue I would worry about, but I think
> I'm going to put this on the back burner until it actually becomes a
> problem.

I plan to try get rid of arch_msi_check_device() hook. Should I repost
this series afterwards?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-09 16:06                     ` Bjorn Helgaas
                                       ` (2 preceding siblings ...)
  (?)
@ 2014-07-10 10:11                     ` Alexander Gordeev
  -1 siblings, 0 replies; 76+ messages in thread
From: Alexander Gordeev @ 2014-07-10 10:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
> Out of curiosity, do you have a pointer to this?  It looks like it

I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27

> uses one vector per port, and I'm wondering if the reason it requests
> 16 is because there's some possibility of a part with more than 8
> ports.

I doubt that is the reason. The only allowed MME values (powers of two)
are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
I would speculate it suits nicely to some hardware logic.

BTW, apart from AHCI, it seems the reason MSI is not going to disappear
(in a decade at least) is it is way cheaper to implement than MSI-X.

> > No, this is not an erratum. The value of 8 vectors is reserved and could
> > cause undefined results if used.
> 
> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
> undefined results, I'd say that's a chipset defect.

Well, the PCI spec does not prevent devices to have their own specs on top
of it. Undefined results are meant on the device side here. On the MSI side
these results are likely perfectly within the PCI spec. I feel speaking as
a lawer here ;)

> Interrupt vector space is the issue I would worry about, but I think
> I'm going to put this on the back burner until it actually becomes a
> problem.

I plan to try get rid of arch_msi_check_device() hook. Should I repost
this series afterwards?

Thanks!

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-10 10:11                         ` Alexander Gordeev
@ 2014-07-10 17:02                           ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-10 17:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, linux-doc, linux-mips, linuxppc-dev, linux-s390,
	x86, xen-devel, open list:INTEL IOMMU (VT-d),
	linux-ide, linux-pci

On Thu, Jul 10, 2014 at 4:11 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
>> Out of curiosity, do you have a pointer to this?  It looks like it
>
> I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27
>
>> uses one vector per port, and I'm wondering if the reason it requests
>> 16 is because there's some possibility of a part with more than 8
>> ports.
>
> I doubt that is the reason. The only allowed MME values (powers of two)
> are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
> I would speculate it suits nicely to some hardware logic.
>
> BTW, apart from AHCI, it seems the reason MSI is not going to disappear
> (in a decade at least) is it is way cheaper to implement than MSI-X.
>
>> > No, this is not an erratum. The value of 8 vectors is reserved and could
>> > cause undefined results if used.
>>
>> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
>> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
>> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
>> undefined results, I'd say that's a chipset defect.
>
> Well, the PCI spec does not prevent devices to have their own specs on top
> of it. Undefined results are meant on the device side here. On the MSI side
> these results are likely perfectly within the PCI spec. I feel speaking as
> a lawer here ;)

I disagree about this part.  The reason MSI is in the PCI spec is so
the OS can have generic support for it without having to put
device-specific support in every driver.  The PCI spec is clear that
the OS can allocate any number of vectors less than or equal to the
number requested via MMC.  The SATA device requests 16, and it should
be perfectly legal for the OS to give it 8.

It's interesting that the ICH10 spec (sec 14.1.27, thanks for the
reference) says MMC 100b means "8 MSI Capable".  That smells like a
hardware bug.  The PCI spec says:

  000 => 1 vector
  001 => 2 vectors
  010 => 4 vectors
  011 => 8 vectors
  100 => 16 vectors

The ICH10 spec seems to think 100 means 8 vectors (not 16 as the PCI
spec says), and that would fit with the rest of the ICH10 MME info.
If ICH10 was built assuming this table:

  000 => 1 vector
  001 => 2 vectors
  010 => 4 vectors
  100 => 8 vectors

then everything makes sense: the device requests 8 vectors, and the
behavior is defined in all possible MME cases (1, 2, 4, or 8 vectors
assigned).  The "Values '011b' to '111b' are reserved" part is still
slightly wrong, because the 100b value is in that range but is not
reserved, but that's a tangent.

So my guess (speculation, I admit) is that the intent was for ICH SATA
to request only 8 vectors, but because of this error, it requests 16.
Maybe some early MSI proposal used a different encoding for MMC and
MME, and ICH was originally designed using that.

>> Interrupt vector space is the issue I would worry about, but I think
>> I'm going to put this on the back burner until it actually becomes a
>> problem.
>
> I plan to try get rid of arch_msi_check_device() hook. Should I repost
> this series afterwards?

Honestly, I'm still not inclined to pursue this because of the API
complication and lack of concrete benefit.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
@ 2014-07-10 17:02                           ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-10 17:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Thu, Jul 10, 2014 at 4:11 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
>> Out of curiosity, do you have a pointer to this?  It looks like it
>
> I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27
>
>> uses one vector per port, and I'm wondering if the reason it requests
>> 16 is because there's some possibility of a part with more than 8
>> ports.
>
> I doubt that is the reason. The only allowed MME values (powers of two)
> are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
> I would speculate it suits nicely to some hardware logic.
>
> BTW, apart from AHCI, it seems the reason MSI is not going to disappear
> (in a decade at least) is it is way cheaper to implement than MSI-X.
>
>> > No, this is not an erratum. The value of 8 vectors is reserved and could
>> > cause undefined results if used.
>>
>> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
>> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
>> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
>> undefined results, I'd say that's a chipset defect.
>
> Well, the PCI spec does not prevent devices to have their own specs on top
> of it. Undefined results are meant on the device side here. On the MSI side
> these results are likely perfectly within the PCI spec. I feel speaking as
> a lawer here ;)

I disagree about this part.  The reason MSI is in the PCI spec is so
the OS can have generic support for it without having to put
device-specific support in every driver.  The PCI spec is clear that
the OS can allocate any number of vectors less than or equal to the
number requested via MMC.  The SATA device requests 16, and it should
be perfectly legal for the OS to give it 8.

It's interesting that the ICH10 spec (sec 14.1.27, thanks for the
reference) says MMC 100b means "8 MSI Capable".  That smells like a
hardware bug.  The PCI spec says:

  000 => 1 vector
  001 => 2 vectors
  010 => 4 vectors
  011 => 8 vectors
  100 => 16 vectors

The ICH10 spec seems to think 100 means 8 vectors (not 16 as the PCI
spec says), and that would fit with the rest of the ICH10 MME info.
If ICH10 was built assuming this table:

  000 => 1 vector
  001 => 2 vectors
  010 => 4 vectors
  100 => 8 vectors

then everything makes sense: the device requests 8 vectors, and the
behavior is defined in all possible MME cases (1, 2, 4, or 8 vectors
assigned).  The "Values '011b' to '111b' are reserved" part is still
slightly wrong, because the 100b value is in that range but is not
reserved, but that's a tangent.

So my guess (speculation, I admit) is that the intent was for ICH SATA
to request only 8 vectors, but because of this error, it requests 16.
Maybe some early MSI proposal used a different encoding for MMC and
MME, and ICH was originally designed using that.

>> Interrupt vector space is the issue I would worry about, but I think
>> I'm going to put this on the back burner until it actually becomes a
>> problem.
>
> I plan to try get rid of arch_msi_check_device() hook. Should I repost
> this series afterwards?

Honestly, I'm still not inclined to pursue this because of the API
complication and lack of concrete benefit.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial()
  2014-07-10 10:11                         ` Alexander Gordeev
  (?)
  (?)
@ 2014-07-10 17:02                         ` Bjorn Helgaas
  -1 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2014-07-10 17:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-mips, linux-s390, linux-pci, x86, linux-doc, linux-kernel,
	linux-ide, open list:INTEL IOMMU (VT-d),
	xen-devel, linuxppc-dev

On Thu, Jul 10, 2014 at 4:11 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> On Wed, Jul 09, 2014 at 10:06:48AM -0600, Bjorn Helgaas wrote:
>> Out of curiosity, do you have a pointer to this?  It looks like it
>
> I.e. ICH8 chapter 12.1.30 or ICH10 chapter 14.1.27
>
>> uses one vector per port, and I'm wondering if the reason it requests
>> 16 is because there's some possibility of a part with more than 8
>> ports.
>
> I doubt that is the reason. The only allowed MME values (powers of two)
> are 0b000, 0b001, 0b010 and 0b100. As you can see, only one bit is used -
> I would speculate it suits nicely to some hardware logic.
>
> BTW, apart from AHCI, it seems the reason MSI is not going to disappear
> (in a decade at least) is it is way cheaper to implement than MSI-X.
>
>> > No, this is not an erratum. The value of 8 vectors is reserved and could
>> > cause undefined results if used.
>>
>> As I read the spec (PCI 3.0, sec 6.8.1.3), if MMC contains 0b100
>> (requesting 16 vectors), the OS is allowed to allocate 1, 2, 4, 8, or
>> 16 vectors.  If allocating 8 vectors and writing 0b011 to MME causes
>> undefined results, I'd say that's a chipset defect.
>
> Well, the PCI spec does not prevent devices to have their own specs on top
> of it. Undefined results are meant on the device side here. On the MSI side
> these results are likely perfectly within the PCI spec. I feel speaking as
> a lawer here ;)

I disagree about this part.  The reason MSI is in the PCI spec is so
the OS can have generic support for it without having to put
device-specific support in every driver.  The PCI spec is clear that
the OS can allocate any number of vectors less than or equal to the
number requested via MMC.  The SATA device requests 16, and it should
be perfectly legal for the OS to give it 8.

It's interesting that the ICH10 spec (sec 14.1.27, thanks for the
reference) says MMC 100b means "8 MSI Capable".  That smells like a
hardware bug.  The PCI spec says:

  000 => 1 vector
  001 => 2 vectors
  010 => 4 vectors
  011 => 8 vectors
  100 => 16 vectors

The ICH10 spec seems to think 100 means 8 vectors (not 16 as the PCI
spec says), and that would fit with the rest of the ICH10 MME info.
If ICH10 was built assuming this table:

  000 => 1 vector
  001 => 2 vectors
  010 => 4 vectors
  100 => 8 vectors

then everything makes sense: the device requests 8 vectors, and the
behavior is defined in all possible MME cases (1, 2, 4, or 8 vectors
assigned).  The "Values '011b' to '111b' are reserved" part is still
slightly wrong, because the 100b value is in that range but is not
reserved, but that's a tangent.

So my guess (speculation, I admit) is that the intent was for ICH SATA
to request only 8 vectors, but because of this error, it requests 16.
Maybe some early MSI proposal used a different encoding for MMC and
MME, and ICH was originally designed using that.

>> Interrupt vector space is the issue I would worry about, but I think
>> I'm going to put this on the back burner until it actually becomes a
>> problem.
>
> I plan to try get rid of arch_msi_check_device() hook. Should I repost
> this series afterwards?

Honestly, I'm still not inclined to pursue this because of the API
complication and lack of concrete benefit.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2014-07-10 17:02 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-10 13:10 [PATCH 0/3] Add pci_enable_msi_partial() to conserve MSI-related resources Alexander Gordeev
2014-06-10 13:10 ` Alexander Gordeev
2014-06-10 13:10 ` [PATCH 1/3] PCI/MSI: Add pci_enable_msi_partial() Alexander Gordeev
2014-06-10 13:10   ` Alexander Gordeev
     [not found]   ` <4fef62a2e647a7c38e9f2a1ea4244b3506a85e2b.1402405331.git.agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-06-23 20:11     ` Alexander Gordeev
2014-06-23 20:11       ` Alexander Gordeev
2014-06-23 20:11       ` Alexander Gordeev
2014-06-23 20:11   ` Alexander Gordeev
2014-07-02 20:22   ` Bjorn Helgaas
2014-07-02 20:22   ` Bjorn Helgaas
2014-07-02 20:22     ` Bjorn Helgaas
2014-07-03  9:20     ` David Laight
2014-07-03  9:20       ` David Laight
2014-07-03  9:20       ` David Laight
2014-07-03  9:20       ` David Laight
2014-07-04  8:58       ` Alexander Gordeev
     [not found]       ` <063D6719AE5E284EB5DD2968C1650D6D1726BF4E-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
2014-07-04  8:58         ` Alexander Gordeev
2014-07-04  8:58           ` Alexander Gordeev
2014-07-04  8:58           ` Alexander Gordeev
2014-07-04  9:11           ` David Laight
     [not found]           ` <20140704085816.GB12247-hdGaXg0bp3uRXgp2RCiI5R/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2014-07-04  9:11             ` David Laight
2014-07-04  9:11               ` David Laight
2014-07-04  9:11               ` David Laight
     [not found]               ` <063D6719AE5E284EB5DD2968C1650D6D1726C717-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
2014-07-04  9:54                 ` Alexander Gordeev
2014-07-04  9:54                   ` Alexander Gordeev
2014-07-04  9:54                   ` Alexander Gordeev
2014-07-04  9:54               ` Alexander Gordeev
2014-07-07 19:26             ` Bjorn Helgaas
2014-07-07 19:26               ` Bjorn Helgaas
2014-07-07 19:26               ` Bjorn Helgaas
     [not found]               ` <CAErSpo7QWc35seoMhJA+H1_=MkKWYMdeYG=hT=i1v=iz8d5ezA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-08  8:33                 ` David Laight
2014-07-08  8:33                   ` David Laight
2014-07-08  8:33                   ` David Laight
2014-07-08  8:33                   ` David Laight
2014-07-08  8:33                   ` David Laight
2014-07-08  8:33               ` David Laight
2014-07-07 19:26           ` Bjorn Helgaas
2014-07-03  9:20     ` David Laight
     [not found]     ` <20140702202201.GA28852-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2014-07-04  8:57       ` Alexander Gordeev
2014-07-04  8:57         ` Alexander Gordeev
2014-07-04  8:57         ` Alexander Gordeev
     [not found]         ` <20140704085741.GA12247-hdGaXg0bp3uRXgp2RCiI5R/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2014-07-07 19:40           ` Bjorn Helgaas
2014-07-07 19:40             ` Bjorn Helgaas
2014-07-07 19:40             ` Bjorn Helgaas
2014-07-07 20:42             ` Alexander Gordeev
2014-07-08 12:26             ` Alexander Gordeev
     [not found]             ` <CAErSpo6f6RXWv0DEtLBZX0jXoSUYJeWrSm7mubSJ_F-O7tQp6w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-07 20:42               ` Alexander Gordeev
2014-07-07 20:42                 ` Alexander Gordeev
2014-07-07 20:42                 ` Alexander Gordeev
2014-07-08 12:26               ` Alexander Gordeev
2014-07-08 12:26                 ` Alexander Gordeev
2014-07-08 12:26                 ` Alexander Gordeev
2014-07-09 16:06                 ` Bjorn Helgaas
     [not found]                 ` <20140708122606.GB6270-hdGaXg0bp3uRXgp2RCiI5R/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2014-07-09 16:06                   ` Bjorn Helgaas
2014-07-09 16:06                     ` Bjorn Helgaas
2014-07-09 16:06                     ` Bjorn Helgaas
     [not found]                     ` <CAErSpo4oiabgoOjsGdWZpCMPnmopK4xRzB2f3tM0AiUFrdhFww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-10 10:11                       ` Alexander Gordeev
2014-07-10 10:11                         ` Alexander Gordeev
2014-07-10 10:11                         ` Alexander Gordeev
2014-07-10 17:02                         ` Bjorn Helgaas
2014-07-10 17:02                         ` Bjorn Helgaas
2014-07-10 17:02                           ` Bjorn Helgaas
2014-07-10 10:11                     ` Alexander Gordeev
2014-07-07 19:40         ` Bjorn Helgaas
2014-07-04  8:57     ` Alexander Gordeev
2014-07-08  4:01     ` Michael Ellerman
2014-07-08  4:01     ` Michael Ellerman
2014-07-08  4:01       ` Michael Ellerman
2014-06-10 13:10 ` Alexander Gordeev
2014-06-10 13:10 ` [PATCH 2/3] PCI/MSI/x86: Support pci_enable_msi_partial() Alexander Gordeev
2014-06-10 13:10 ` Alexander Gordeev
2014-06-10 13:10 ` [PATCH 3/3] AHCI: Use pci_enable_msi_partial() to conserve on 10/16 MSIs Alexander Gordeev
     [not found]   ` <dba9f0f8e9cccd7625d0f3fab94457482e1a2bd7.1402405331.git.agordeev-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-06-18 18:54     ` Tejun Heo
2014-06-18 18:54       ` Tejun Heo
2014-06-18 18:54   ` Tejun Heo
2014-06-10 13:10 ` Alexander Gordeev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.