Linux-PCI Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/9] Support using MSI interrupts in ntb_transport
@ 2019-01-31 18:56 Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices Logan Gunthorpe
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Hi,

This patch series adds optional support for using MSI interrupts instead
of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
current hardware are quite slow and therefore switching to MSI interrupts
provides a significant performance gain. On switchtec hardware, a simple
apples-to-apples comparison shows ntb_netdev/iperf numbers going from
3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.

To do this, a couple changes are required outside of the NTB tree:

1) The IOMMU must know to accept MSI requests from aliased bused numbers
seeing NTB hardware typically sends proxied request IDs through
additional requester IDs. The first patch in this series adds support
for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
this is necessary.

2) NTB transport (and other clients) may often need more MSI interrupts
than the NTB hardware actually advertises support for. However, seeing
these interrupts will not be triggered by the hardware but through an
NTB memory window, the hardware does not actually need support or need
to know about them. Therefore we add the concept of Virtual MSI
interrupts which are allocated just like any other MSI interrupt but
are not programmed into the hardware's MSI table. This is done in
Patch 2 and then made use of in Patch 3.

The remaining patches in this series add a library for dealing with MSI
interrupts, a test client and finally support in ntb_transport.

The series is based off of v5.0-rc4 and I've tested it on top of a
of the patches I've already sent to the NTB tree (though they are
independent changes). A git repo is available here:

https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1

Thanks,

Logan

--

Logan Gunthorpe (9):
  iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  PCI/MSI: Support allocating virtual MSI interrupts
  PCI/switchtec: Add module parameter to request more interrupts
  NTB: Introduce functions to calculate multi-port resource index
  NTB: Rename ntb.c to support multiple source files in the module
  NTB: Introduce MSI library
  NTB: Introduce NTB MSI Test Client
  NTB: Add ntb_msi_test support to ntb_test
  NTB: Add MSI interrupt support to ntb_transport

 drivers/iommu/intel_irq_remapping.c     |  12 +
 drivers/ntb/Kconfig                     |  10 +
 drivers/ntb/Makefile                    |   3 +
 drivers/ntb/{ntb.c => core.c}           |   0
 drivers/ntb/msi.c                       | 313 ++++++++++++++++++
 drivers/ntb/ntb_transport.c             | 134 +++++++-
 drivers/ntb/test/Kconfig                |   9 +
 drivers/ntb/test/Makefile               |   1 +
 drivers/ntb/test/ntb_msi_test.c         | 416 ++++++++++++++++++++++++
 drivers/pci/msi.c                       |  51 ++-
 drivers/pci/switch/switchtec.c          |  12 +-
 include/linux/msi.h                     |   1 +
 include/linux/ntb.h                     | 139 ++++++++
 include/linux/pci.h                     |   9 +
 tools/testing/selftests/ntb/ntb_test.sh |  54 ++-
 15 files changed, 1150 insertions(+), 14 deletions(-)
 rename drivers/ntb/{ntb.c => core.c} (100%)
 create mode 100644 drivers/ntb/msi.c
 create mode 100644 drivers/ntb/test/ntb_msi_test.c

--
2.19.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-02-01 16:44   ` Joerg Roedel
  2019-01-31 18:56 ` [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts Logan Gunthorpe
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore,
	Logan Gunthorpe, David Woodhouse

When a device has multiple aliases that all are from the same bus,
we program the IRTE to accept requests from any matching device on the
bus.

This is so NTB devices which can have requests from multiple bus-devfns
can pass MSI interrupts through across the bridge.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: David Woodhouse <dwmw2@infradead.org>
---
 drivers/iommu/intel_irq_remapping.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 24d45b07f425..8d3107a6b2b4 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -356,6 +356,8 @@ static int set_hpet_sid(struct irte *irte, u8 id)
 struct set_msi_sid_data {
 	struct pci_dev *pdev;
 	u16 alias;
+	int count;
+	int busmatch_count;
 };
 
 static int set_msi_sid_cb(struct pci_dev *pdev, u16 alias, void *opaque)
@@ -364,6 +366,10 @@ static int set_msi_sid_cb(struct pci_dev *pdev, u16 alias, void *opaque)
 
 	data->pdev = pdev;
 	data->alias = alias;
+	data->count++;
+
+	if (PCI_BUS_NUM(alias) == pdev->bus->number)
+		data->busmatch_count++;
 
 	return 0;
 }
@@ -375,6 +381,8 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
 	if (!irte || !dev)
 		return -1;
 
+	data.count = 0;
+	data.busmatch_count = 0;
 	pci_for_each_dma_alias(dev, set_msi_sid_cb, &data);
 
 	/*
@@ -394,6 +402,10 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
 		set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
 			     PCI_DEVID(PCI_BUS_NUM(data.alias),
 				       dev->bus->number));
+	else if (data.count >= 2 && data.busmatch_count == data.count)
+		set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
+			     PCI_DEVID(dev->bus->number,
+				       dev->bus->number));
 	else if (data.pdev->bus->number != dev->bus->number)
 		set_irte_sid(irte, SVT_VERIFY_SID_SQ, SQ_ALL_16, data.alias);
 	else
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-01-31 22:39   ` Bjorn Helgaas
  2019-01-31 18:56 ` [PATCH 3/9] PCI/switchtec: Add module parameter to request more interrupts Logan Gunthorpe
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

For NTB devices, we want to be able to trigger MSI interrupts
through a memory window. In these cases we may want to use
more interrupts than the NTB PCI device has available in its MSI-X
table.

We allow for this by creating a new 'virtual' interrupt. These
interrupts are allocated as usual but are not programmed into the
MSI-X table (as there may not be space for them).

The MSI address and data will then handled through an NTB MSI library
introduced later in this series.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/msi.c   | 51 +++++++++++++++++++++++++++++++++++++--------
 include/linux/msi.h |  1 +
 include/linux/pci.h |  9 ++++++++
 3 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 4c0b47867258..145587da686c 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -192,6 +192,9 @@ static void msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
 
 static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
 {
+	if (desc->msi_attrib.is_virtual)
+		return NULL;
+
 	return desc->mask_base +
 		desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
 }
@@ -206,14 +209,19 @@ static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
 u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
 {
 	u32 mask_bits = desc->masked;
+	void __iomem *desc_addr;
 
 	if (pci_msi_ignore_mask)
 		return 0;
+	desc_addr = pci_msix_desc_addr(desc);
+	if (!desc_addr)
+		return 0;
 
 	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 	if (flag)
 		mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
-	writel(mask_bits, pci_msix_desc_addr(desc) + PCI_MSIX_ENTRY_VECTOR_CTRL);
+
+	writel(mask_bits, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
 
 	return mask_bits;
 }
@@ -273,6 +281,11 @@ void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 	if (entry->msi_attrib.is_msix) {
 		void __iomem *base = pci_msix_desc_addr(entry);
 
+		if (!base) {
+			WARN_ON(1);
+			return;
+		}
+
 		msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR);
 		msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR);
 		msg->data = readl(base + PCI_MSIX_ENTRY_DATA);
@@ -303,6 +316,9 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 	} else if (entry->msi_attrib.is_msix) {
 		void __iomem *base = pci_msix_desc_addr(entry);
 
+		if (!base)
+			goto skip;
+
 		writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
 		writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
 		writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
@@ -327,6 +343,8 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 					      msg->data);
 		}
 	}
+
+skip:
 	entry->msg = *msg;
 }
 
@@ -550,6 +568,7 @@ msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd)
 
 	entry->msi_attrib.is_msix	= 0;
 	entry->msi_attrib.is_64		= !!(control & PCI_MSI_FLAGS_64BIT);
+	entry->msi_attrib.is_virtual    = 0;
 	entry->msi_attrib.entry_nr	= 0;
 	entry->msi_attrib.maskbit	= !!(control & PCI_MSI_FLAGS_MASKBIT);
 	entry->msi_attrib.default_irq	= dev->irq;	/* Save IOAPIC IRQ */
@@ -674,6 +693,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 	struct irq_affinity_desc *curmsk, *masks = NULL;
 	struct msi_desc *entry;
 	int ret, i;
+	int vec_count = pci_msix_vec_count(dev);
 
 	if (affd)
 		masks = irq_create_affinity_masks(nvec, affd);
@@ -696,6 +716,10 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 			entry->msi_attrib.entry_nr = entries[i].entry;
 		else
 			entry->msi_attrib.entry_nr = i;
+
+		entry->msi_attrib.is_virtual =
+			entry->msi_attrib.entry_nr >= vec_count;
+
 		entry->msi_attrib.default_irq	= dev->irq;
 		entry->mask_base		= base;
 
@@ -714,12 +738,19 @@ static void msix_program_entries(struct pci_dev *dev,
 {
 	struct msi_desc *entry;
 	int i = 0;
+	void __iomem *desc_addr;
 
 	for_each_pci_msi_entry(entry, dev) {
 		if (entries)
 			entries[i++].vector = entry->irq;
-		entry->masked = readl(pci_msix_desc_addr(entry) +
-				PCI_MSIX_ENTRY_VECTOR_CTRL);
+
+		desc_addr = pci_msix_desc_addr(entry);
+		if (desc_addr)
+			entry->masked = readl(desc_addr +
+					      PCI_MSIX_ENTRY_VECTOR_CTRL);
+		else
+			entry->masked = 0;
+
 		msix_mask_irq(entry, 1);
 	}
 }
@@ -932,7 +963,8 @@ int pci_msix_vec_count(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_msix_vec_count);
 
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
-			     int nvec, const struct irq_affinity *affd)
+			     int nvec, const struct irq_affinity *affd,
+			     int flags)
 {
 	int nr_entries;
 	int i, j;
@@ -943,7 +975,7 @@ static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
 	nr_entries = pci_msix_vec_count(dev);
 	if (nr_entries < 0)
 		return nr_entries;
-	if (nvec > nr_entries)
+	if (nvec > nr_entries && !(flags & PCI_IRQ_VIRTUAL))
 		return nr_entries;
 
 	if (entries) {
@@ -1086,7 +1118,8 @@ EXPORT_SYMBOL(pci_enable_msi);
 
 static int __pci_enable_msix_range(struct pci_dev *dev,
 				   struct msix_entry *entries, int minvec,
-				   int maxvec, const struct irq_affinity *affd)
+				   int maxvec, const struct irq_affinity *affd,
+				   int flags)
 {
 	int rc, nvec = maxvec;
 
@@ -1110,7 +1143,7 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 				return -ENOSPC;
 		}
 
-		rc = __pci_enable_msix(dev, entries, nvec, affd);
+		rc = __pci_enable_msix(dev, entries, nvec, affd, flags);
 		if (rc == 0)
 			return nvec;
 
@@ -1141,7 +1174,7 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
 		int minvec, int maxvec)
 {
-	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL);
+	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0);
 }
 EXPORT_SYMBOL(pci_enable_msix_range);
 
@@ -1181,7 +1214,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 
 	if (flags & PCI_IRQ_MSIX) {
 		msix_vecs = __pci_enable_msix_range(dev, NULL, min_vecs,
-						    max_vecs, affd);
+						    max_vecs, affd, flags);
 		if (msix_vecs > 0)
 			return msix_vecs;
 	}
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 784fb52b9900..6458ab049852 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -88,6 +88,7 @@ struct msi_desc {
 				__u8	multi_cap	: 3;
 				__u8	maskbit		: 1;
 				__u8	is_64		: 1;
+				__u8    is_virtual      : 1;
 				__u16	entry_nr;
 				unsigned default_irq;
 			} msi_attrib;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 65f1d8c2f082..ce0815c2c498 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1352,6 +1352,15 @@ int pci_set_vga_state(struct pci_dev *pdev, bool decode,
 #define PCI_IRQ_MSI		(1 << 1) /* Allow MSI interrupts */
 #define PCI_IRQ_MSIX		(1 << 2) /* Allow MSI-X interrupts */
 #define PCI_IRQ_AFFINITY	(1 << 3) /* Auto-assign affinity */
+
+/*
+ * Virtual interrupts allow for more interrupts to be allocated
+ * than the device has interrupts for. These are not programmed
+ * into the devices MSI-X table and must be handled by some
+ * other driver means.
+ */
+#define PCI_IRQ_VIRTUAL		(1 << 4)
+
 #define PCI_IRQ_ALL_TYPES \
 	(PCI_IRQ_LEGACY | PCI_IRQ_MSI | PCI_IRQ_MSIX)
 
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/9] PCI/switchtec: Add module parameter to request more interrupts
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 4/9] NTB: Introduce functions to calculate multi-port resource index Logan Gunthorpe
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Seeing the we want to use more interrupts in the NTB MSI code
we need to be able allocate more (sometimes virtual) interrupts
in the switchtec driver. Therefore add a module parameter to
request to allocate additional interrupts.

This puts virtually no limit on the number of MSI interrupts available
to NTB clients.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/switch/switchtec.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index e22766c79fe9..8b1db78197d9 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -30,6 +30,10 @@ module_param(use_dma_mrpc, bool, 0644);
 MODULE_PARM_DESC(use_dma_mrpc,
 		 "Enable the use of the DMA MRPC feature");
 
+static int nirqs = 32;
+module_param(nirqs, int, 0644);
+MODULE_PARM_DESC(nirqs, "number of interrupts to allocate (more may be useful for NTB applications)");
+
 static dev_t switchtec_devt;
 static DEFINE_IDA(switchtec_minor_ida);
 
@@ -1247,8 +1251,12 @@ static int switchtec_init_isr(struct switchtec_dev *stdev)
 	int dma_mrpc_irq;
 	int rc;
 
-	nvecs = pci_alloc_irq_vectors(stdev->pdev, 1, 4,
-				      PCI_IRQ_MSIX | PCI_IRQ_MSI);
+	if (nirqs < 4)
+		nirqs = 4;
+
+	nvecs = pci_alloc_irq_vectors(stdev->pdev, 1, nirqs,
+				      PCI_IRQ_MSIX | PCI_IRQ_MSI |
+				      PCI_IRQ_VIRTUAL);
 	if (nvecs < 0)
 		return nvecs;
 
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/9] NTB: Introduce functions to calculate multi-port resource index
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2019-01-31 18:56 ` [PATCH 3/9] PCI/switchtec: Add module parameter to request more interrupts Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 5/9] NTB: Rename ntb.c to support multiple source files in the module Logan Gunthorpe
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

When using multi-ports each port uses resources (dbs, msgs, mws, etc)
on every other port. Creating a mapping for these resources such that
each port has a corresponding resource on every other port is a bit
tricky.

Introduce the ntb_peer_resource_idx() function for this purpose.
It returns the peer resource number that will correspond with the
local peer index on the remote peer.

Also, introduce ntb_peer_highest_mw_idx() which will use
ntb_peer_resource_idx() but return the MW index starting with the
highest index and working down.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 include/linux/ntb.h | 70 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index 181d16601dd9..f5c69d853489 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -1502,4 +1502,74 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
 	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
 }
 
+/**
+ * ntb_peer_resource_idx() - get a resource index for a given peer idx
+ * @ntb:	NTB device context.
+ * @pidx:	Peer port index.
+ *
+ * When constructing a graph of peers, each remote peer must use a different
+ * resource index (mw, doorbell, etc) to communicate with each other
+ * peer.
+ *
+ * In a two peer system, this function should always return 0 such that
+ * resource 0 points to the remote peer on both ports.
+ *
+ * In a 5 peer system, this function will return the following matrix
+ *
+ * pidx \ port    0    1    2    3    4
+ * 0              0    0    1    2    3
+ * 1              0    1    2    3    4
+ * 2              0    1    2    3    4
+ * 3              0    1    2    3    4
+ *
+ * For example, if this function is used to program peer's memory
+ * windows, port 0 will program MW 0 on all it's peers to point to itself.
+ * port 1 will program MW 0 in port 0 to point to itself and MW 1 on all
+ * other ports. etc.
+ *
+ * For the legacy two host case, ntb_port_number() and ntb_peer_port_number()
+ * both return zero and therefore this function will always return zero.
+ * So MW 0 on each host would be programmed to point to the other host.
+ *
+ * Return: the resource index to use for that peer.
+ */
+static inline int ntb_peer_resource_idx(struct ntb_dev *ntb, int pidx)
+{
+	int local_port, peer_port;
+
+	if (pidx >= ntb_peer_port_count(ntb))
+		return -EINVAL;
+
+	local_port = ntb_port_number(ntb);
+	peer_port = ntb_peer_port_number(ntb, pidx);
+
+	if (peer_port < local_port)
+		return local_port - 1;
+	else
+		return local_port;
+}
+
+/**
+ * ntb_peer_highest_mw_idx() - get a memory window index for a given peer idx
+ *	using the highest index memory windows first
+ *
+ * @ntb:	NTB device context.
+ * @pidx:	Peer port index.
+ *
+ * Like ntb_peer_resource_idx(), except it returns indexes starting with
+ * last memory window index.
+ *
+ * Return: the resource index to use for that peer.
+ */
+static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
+{
+	int ret;
+
+	ret = ntb_peer_resource_idx(ntb, pidx);
+	if (ret < 0)
+		return ret;
+
+	return ntb_mw_count(ntb, pidx) - ret - 1;
+}
+
 #endif
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 5/9] NTB: Rename ntb.c to support multiple source files in the module
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2019-01-31 18:56 ` [PATCH 4/9] NTB: Introduce functions to calculate multi-port resource index Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 6/9] NTB: Introduce MSI library Logan Gunthorpe
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

The kbuild system does not support having multiple source files in
a module if one of those source files has the same name as the module.

Therefore, we must rename ntb.c to core.c, while the module remains
ntb.ko.

This is similar to the way the nvme modules are structured.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/Makefile          | 2 ++
 drivers/ntb/{ntb.c => core.c} | 0
 2 files changed, 2 insertions(+)
 rename drivers/ntb/{ntb.c => core.c} (100%)

diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 1921dec1949d..537226f8e78d 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_NTB) += ntb.o hw/ test/
 obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
+
+ntb-y := core.o
diff --git a/drivers/ntb/ntb.c b/drivers/ntb/core.c
similarity index 100%
rename from drivers/ntb/ntb.c
rename to drivers/ntb/core.c
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 6/9] NTB: Introduce MSI library
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2019-01-31 18:56 ` [PATCH 5/9] NTB: Rename ntb.c to support multiple source files in the module Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 7/9] NTB: Introduce NTB MSI Test Client Logan Gunthorpe
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

The NTB MSI library allows passing MSI interrupts across a memory
window. This offers similar functionality to doorbells or messages
except will often have much better latency and the client can
potentially use significantly more remote interrupts than typical hardware
provides for doorbells. (Which can be important in high-multiport
setups.)

The library utilizes one memory window per peer and uses the highest
index memory windows. Before any ntb_msi function may be used, the user
must call ntb_msi_init(). It may then setup and tear down the memory
windows when the link state changes using ntb_msi_setup_mws() and
ntb_msi_clear_mws().

The peer which receives the interrupt must call ntb_msim_request_irq()
to assign the interrupt handler (this function is functionally
similar to devm_request_irq()) and the returned descriptor must be
transferred to the peer which can use it to trigger the interrupt.
The triggering peer, once having received the descriptor, can
trigger the interrupt by calling ntb_msi_peer_trigger().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/Kconfig  |  10 ++
 drivers/ntb/Makefile |   3 +-
 drivers/ntb/msi.c    | 313 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/ntb.h  |  69 ++++++++++
 4 files changed, 394 insertions(+), 1 deletion(-)
 create mode 100644 drivers/ntb/msi.c

diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
index 95944e52fa36..b60811d91e3d 100644
--- a/drivers/ntb/Kconfig
+++ b/drivers/ntb/Kconfig
@@ -12,6 +12,16 @@ menuconfig NTB
 
 if NTB
 
+config NTB_MSI
+	bool "MSI Interrupt Support"
+	help
+	 Support using MSI interrupt forwarding instead of (or in addition to)
+	 hardware doorbells. MSI interrupts typically offer lower latency
+	 than doorbells and more MSI interrupts can be made available to
+	 clients. However this requires an extra memory window and support
+	 in the hardware driver for creating the MSI interrupts.
+
+	 If unsure, say N.
 source "drivers/ntb/hw/Kconfig"
 
 source "drivers/ntb/test/Kconfig"
diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 537226f8e78d..cc27ad2ef150 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_NTB) += ntb.o hw/ test/
 obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
 
-ntb-y := core.o
+ntb-y			:= core.o
+ntb-$(CONFIG_NTB_MSI)	+= msi.o
diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
new file mode 100644
index 000000000000..b636956b3772
--- /dev/null
+++ b/drivers/ntb/msi.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include <linux/irq.h>
+#include <linux/module.h>
+#include <linux/ntb.h>
+#include <linux/msi.h>
+#include <linux/pci.h>
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe <logang@deltatee.com>");
+MODULE_DESCRIPTION("NTB MSI Interrupt Library");
+
+struct ntb_msi {
+	u64 base_addr;
+	u64 end_addr;
+	u32 *peer_mws[];
+};
+
+/**
+ * ntb_msi_init() - Initialize the MSI context
+ * @ntb:	NTB device context.
+ *
+ * This function must be called before any other ntb_msi function.
+ * It initializes the context for MSI operations and maps
+ * the peer memory windows.
+ *
+ * This function reserves the last N outbound memory windows (where N
+ * is the number of peers).
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_init(struct ntb_dev *ntb)
+{
+	phys_addr_t mw_phys_addr;
+	resource_size_t mw_size;
+	size_t struct_size;
+	int peer_widx;
+	int peers;
+	int ret;
+	int i;
+
+	peers = ntb_peer_port_count(ntb);
+	if (peers <= 0)
+		return -EINVAL;
+
+	struct_size = sizeof(*ntb->msi) + sizeof(*ntb->msi->peer_mws) * peers;
+
+	ntb->msi = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
+	if (!ntb->msi)
+		return -ENOMEM;
+
+	for (i = 0; i < peers; i++) {
+		peer_widx = ntb_peer_mw_count(ntb) - 1 - i;
+
+		ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr,
+					   &mw_size);
+		if (ret)
+			goto unroll;
+
+		ntb->msi->peer_mws[i] = devm_ioremap(&ntb->dev, mw_phys_addr,
+						     mw_size);
+		if (!ntb->msi->peer_mws[i]) {
+			ret = -EFAULT;
+			goto unroll;
+		}
+	}
+
+	return 0;
+
+unroll:
+	for (i = 0; i < peers; i++)
+		if (ntb->msi->peer_mws[i])
+			devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);
+
+	devm_kfree(&ntb->dev, ntb->msi);
+	ntb->msi = NULL;
+	return ret;
+}
+EXPORT_SYMBOL(ntb_msi_init);
+
+/**
+ * ntb_msi_setup_mws() - Initialize the MSI inbound memory windows
+ * @ntb:	NTB device context.
+ *
+ * This function sets up the required inbound memory windows. It should be
+ * called from a work function after a link up event.
+ *
+ * Over the entire network, this function will reserves the last N
+ * inbound memory windows for each peer (where N is the number of peers).
+ *
+ * ntb_msi_init() must be called before this function.
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_setup_mws(struct ntb_dev *ntb)
+{
+	struct msi_desc *desc;
+	u64 addr;
+	int peer, peer_widx;
+	resource_size_t addr_align, size_align, size_max;
+	resource_size_t mw_size = SZ_32K;
+	resource_size_t mw_min_size = mw_size;
+	int i;
+	int ret;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	desc = first_msi_entry(&ntb->pdev->dev);
+	addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			return peer_widx;
+
+		ret = ntb_mw_get_align(ntb, peer, peer_widx, &addr_align,
+				       NULL, NULL);
+		if (ret)
+			return ret;
+
+		addr &= ~(addr_align - 1);
+	}
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0) {
+			ret = peer_widx;
+			goto error_out;
+		}
+
+		ret = ntb_mw_get_align(ntb, peer, peer_widx, NULL,
+				       &size_align, &size_max);
+		if (ret)
+			goto error_out;
+
+		mw_size = round_up(mw_size, size_align);
+		mw_size = max(mw_size, size_max);
+		if (mw_size < mw_min_size)
+			mw_min_size = mw_size;
+
+		ret = ntb_mw_set_trans(ntb, peer, peer_widx,
+				       addr, mw_size);
+		if (ret)
+			goto error_out;
+	}
+
+	ntb->msi->base_addr = addr;
+	ntb->msi->end_addr = addr + mw_min_size;
+
+	return 0;
+
+error_out:
+	for (i = 0; i < peer; i++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			continue;
+
+		ntb_mw_clear_trans(ntb, i, peer_widx);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ntb_msi_setup_mws);
+
+/**
+ * ntb_msi_clear_mws() - Clear all inbound memory windows
+ * @ntb:	NTB device context.
+ *
+ * This function tears down the resources used by ntb_msi_setup_mws().
+ */
+void ntb_msi_clear_mws(struct ntb_dev *ntb)
+{
+	int peer;
+	int peer_widx;
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			continue;
+
+		ntb_mw_clear_trans(ntb, peer, peer_widx);
+	}
+}
+EXPORT_SYMBOL(ntb_msi_clear_mws);
+
+/**
+ * ntbm_msi_request_threaded_irq() - allocate an MSI interrupt
+ * @ntb:	NTB device context.
+ * @handler:	Function to be called when the IRQ occurs
+ * @thread_fn:  Function to be called in a threaded interrupt context. NULL
+ *              for clients which handle everything in @handler
+ * @devname:    An ascii name for the claiming device, dev_name(dev) if NULL
+ * @dev_id:     A cookie passed back to the handler function
+ *
+ * This function assigns an interrupt handler to an unused
+ * MSI interrupt and returns the descriptor used to trigger
+ * it. The descriptor can then be sent to a peer to trigger
+ * the interrupt.
+ *
+ * The interrupt resource is managed with devres so it will
+ * be automatically freed when the NTB device is torn down.
+ *
+ * Return: IRQ number assigned on success, otherwise a negative error number.
+ */
+int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
+				  irq_handler_t thread_fn,
+				  const char *name, void *dev_id,
+				  struct ntb_msi_desc *msi_desc)
+{
+	struct msi_desc *entry;
+	struct irq_desc *desc;
+	u64 addr;
+	int ret;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	for_each_pci_msi_entry(entry, ntb->pdev) {
+		desc = irq_to_desc(entry->irq);
+		if (desc->action)
+			continue;
+
+		ret = devm_request_threaded_irq(&ntb->dev, entry->irq, handler,
+						thread_fn, 0, name, dev_id);
+		if (ret)
+			continue;
+
+		addr = entry->msg.address_lo +
+			((uint64_t)entry->msg.address_hi << 32);
+
+		if (addr < ntb->msi->base_addr || addr >= ntb->msi->end_addr) {
+			devm_free_irq(&ntb->dev, entry->irq, dev_id);
+			dev_warn_once(&ntb->dev,
+				      "IRQ %d: MSI Address not within the memory window (%llx, [%llx %llx])\n",
+				      entry->irq, addr, ntb->msi->base_addr,
+				      ntb->msi->end_addr);
+			continue;
+		}
+
+		msi_desc->addr_offset = addr - ntb->msi->base_addr;
+		msi_desc->data = entry->msg.data;
+
+		dev_dbg(&ntb->dev, "Assigned IRQ %d\n", entry->irq);
+
+		return entry->irq;
+	}
+
+	return -ENODEV;
+}
+EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);
+
+/**
+ * ntb_msi_peer_trigger() - Trigger an interrupt handler on a peer
+ * @ntb:	NTB device context.
+ * @peer:	Peer index
+ * @desc:	MSI descriptor data which triggers the interrupt
+ *
+ * This function triggers an interrupt on a peer. It requires
+ * the descriptor structure to have been passed from that peer
+ * by some other means.
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+			 struct ntb_msi_desc *desc)
+{
+	int idx;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
+
+	ntb->msi->peer_mws[peer][idx] = desc->data;
+
+	return 0;
+}
+EXPORT_SYMBOL(ntb_msi_peer_trigger);
+
+/**
+ * ntb_msi_peer_addr() - Get the DMA address to trigger a peer's MSI interrupt
+ * @ntb:	NTB device context.
+ * @peer:	Peer index
+ * @desc:	MSI descriptor data which triggers the interrupt
+ * @msi_addr:   Physical address to trigger the interrupt
+ *
+ * This function allows using DMA engines to trigger an interrupt
+ * (for example, trigger an interrupt to process the data after
+ * sending it). To trigger the interrupt, write @desc.data to the address
+ * returned in @msi_addr
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+		      struct ntb_msi_desc *desc,
+		      phys_addr_t *msi_addr)
+{
+	int peer_widx = ntb_peer_mw_count(ntb) - 1 - peer;
+	phys_addr_t mw_phys_addr;
+	int ret;
+
+	ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr, NULL);
+	if (ret)
+		return ret;
+
+	if (msi_addr)
+		*msi_addr = mw_phys_addr + desc->addr_offset;
+
+	return 0;
+}
+EXPORT_SYMBOL(ntb_msi_peer_addr);
diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index f5c69d853489..21ad2e3ee13d 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -58,9 +58,11 @@
 
 #include <linux/completion.h>
 #include <linux/device.h>
+#include <linux/interrupt.h>
 
 struct ntb_client;
 struct ntb_dev;
+struct ntb_msi;
 struct pci_dev;
 
 /**
@@ -425,6 +427,10 @@ struct ntb_dev {
 	spinlock_t			ctx_lock;
 	/* block unregister until device is fully released */
 	struct completion		released;
+
+	#ifdef CONFIG_NTB_MSI
+	struct ntb_msi *msi;
+	#endif
 };
 #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
 
@@ -1572,4 +1578,67 @@ static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
 	return ntb_mw_count(ntb, pidx) - ret - 1;
 }
 
+struct ntb_msi_desc {
+	u32 addr_offset;
+	u32 data;
+};
+
+#ifdef CONFIG_NTB_MSI
+
+int ntb_msi_init(struct ntb_dev *ntb);
+int ntb_msi_setup_mws(struct ntb_dev *ntb);
+void ntb_msi_clear_mws(struct ntb_dev *ntb);
+int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
+				  irq_handler_t thread_fn,
+				  const char *name, void *dev_id,
+				  struct ntb_msi_desc *msi_desc);
+int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+			 struct ntb_msi_desc *desc);
+int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+		      struct ntb_msi_desc *desc,
+		      phys_addr_t *msi_addr);
+
+#else /* not CONFIG_NTB_MSI */
+
+static inline int ntb_msi_init(struct ntb_dev *ntb)
+{
+	return -EOPNOTSUPP;
+}
+static inline int ntb_msi_setup_mws(struct ntb_dev *ntb)
+{
+	return -EOPNOTSUPP;
+}
+static inline void ntb_msi_clear_mws(struct ntb_dev *ntb) {}
+static inline int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb,
+						irq_handler_t handler,
+						irq_handler_t thread_fn,
+						const char *name, void *dev_id,
+						struct ntb_msi_desc *msi_desc)
+{
+	return -EOPNOTSUPP;
+}
+static inline int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+				       struct ntb_msi_desc *desc)
+{
+	return -EOPNOTSUPP;
+}
+static inline int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+				    struct ntb_msi_desc *desc,
+				    phys_addr_t *msi_addr)
+{
+	return -EOPNOTSUPP;
+
+}
+
+#endif /* CONFIG_NTB_MSI */
+
+static inline int ntbm_msi_request_irq(struct ntb_dev *ntb,
+				       irq_handler_t handler,
+				       const char *name, void *dev_id,
+				       struct ntb_msi_desc *msi_desc)
+{
+	return ntbm_msi_request_threaded_irq(ntb, handler, NULL, name,
+					     dev_id, msi_desc);
+}
+
 #endif
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 7/9] NTB: Introduce NTB MSI Test Client
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2019-01-31 18:56 ` [PATCH 6/9] NTB: Introduce MSI library Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 8/9] NTB: Add ntb_msi_test support to ntb_test Logan Gunthorpe
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Introduce a tool to test NTB MSI interrupts similar to the other
NTB test tools. This tool creates a debugfs directory for each
NTB device with the following files:

port
irqX_occurrences
peerX/port
peerX/count
peerX/trigger

The 'port' file tells the user the local port number and the
'occurrences' files tell the number of local interrupts that
have been received for each interrupt.

For each peer, the 'port' file and the 'count' file tell you the
peer's port number and number of interrupts respectively. Writing
the interrupt number to the 'trigger' file triggers the interrupt
handler for the peer which should increment their corresponding
'occurrences' file. The 'ready' file indicates if a peer is ready,
writing to this file blocks until it is ready.

The module parameter num_irqs can be used to set the number of
local interrupts. By default this is 4. This is only limited by
the number of unused MSI interrupts registered by the hardware
(this will require support of the hardware driver) and there must
be at least 2*num_irqs + 1 spads registers available.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/test/Kconfig        |   9 +
 drivers/ntb/test/Makefile       |   1 +
 drivers/ntb/test/ntb_msi_test.c | 416 ++++++++++++++++++++++++++++++++
 3 files changed, 426 insertions(+)
 create mode 100644 drivers/ntb/test/ntb_msi_test.c

diff --git a/drivers/ntb/test/Kconfig b/drivers/ntb/test/Kconfig
index a5d0eda44438..a3f3e2638935 100644
--- a/drivers/ntb/test/Kconfig
+++ b/drivers/ntb/test/Kconfig
@@ -25,3 +25,12 @@ config NTB_PERF
 	 to and from the window without additional software interaction.
 
 	 If unsure, say N.
+
+config NTB_MSI_TEST
+	tristate "NTB MSI Test Client"
+	depends on NTB_MSI
+	help
+	  This tool demonstrates the use of the NTB MSI library to
+	  send MSI interrupts between peers.
+
+	  If unsure, say N.
diff --git a/drivers/ntb/test/Makefile b/drivers/ntb/test/Makefile
index 9e77e0b761c2..d2895ca995e4 100644
--- a/drivers/ntb/test/Makefile
+++ b/drivers/ntb/test/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_NTB_PINGPONG) += ntb_pingpong.o
 obj-$(CONFIG_NTB_TOOL) += ntb_tool.o
 obj-$(CONFIG_NTB_PERF) += ntb_perf.o
+obj-$(CONFIG_NTB_MSI_TEST) += ntb_msi_test.o
diff --git a/drivers/ntb/test/ntb_msi_test.c b/drivers/ntb/test/ntb_msi_test.c
new file mode 100644
index 000000000000..2d7435591a65
--- /dev/null
+++ b/drivers/ntb/test/ntb_msi_test.c
@@ -0,0 +1,416 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/ntb.h>
+#include <linux/pci.h>
+#include <linux/radix-tree.h>
+#include <linux/workqueue.h>
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe <logang@deltatee.com>");
+MODULE_DESCRIPTION("Test for sending MSI interrupts over an NTB memory window");
+
+static int num_irqs = 4;
+module_param(num_irqs, int, 0644);
+MODULE_PARM_DESC(num_irqs, "number of irqs to use");
+
+struct ntb_msit_ctx {
+	struct ntb_dev *ntb;
+	struct dentry *dbgfs_dir;
+	struct work_struct setup_work;
+
+	struct ntb_msit_isr_ctx {
+		int irq_idx;
+		int irq_num;
+		int occurrences;
+		struct ntb_msit_ctx *nm;
+		struct ntb_msi_desc desc;
+	} *isr_ctx;
+
+	struct ntb_msit_peer {
+		struct ntb_msit_ctx *nm;
+		int pidx;
+		int num_irqs;
+		struct completion init_comp;
+		struct ntb_msi_desc *msi_desc;
+	} peers[];
+};
+
+static struct dentry *ntb_msit_dbgfs_topdir;
+
+static irqreturn_t ntb_msit_isr(int irq, void *dev)
+{
+	struct ntb_msit_isr_ctx *isr_ctx = dev;
+	struct ntb_msit_ctx *nm = isr_ctx->nm;
+
+	dev_dbg(&nm->ntb->dev, "Interrupt Occurred: %d",
+		isr_ctx->irq_idx);
+
+	isr_ctx->occurrences++;
+
+	return IRQ_HANDLED;
+}
+
+static void ntb_msit_setup_work(struct work_struct *work)
+{
+	struct ntb_msit_ctx *nm = container_of(work, struct ntb_msit_ctx,
+					       setup_work);
+	int irq_count = 0;
+	int irq;
+	int ret;
+	uintptr_t i;
+
+	ret = ntb_msi_setup_mws(nm->ntb);
+	if (ret) {
+		dev_err(&nm->ntb->dev, "Unable to setup MSI windows: %d\n",
+			ret);
+		return;
+	}
+
+	for (i = 0; i < num_irqs; i++) {
+		nm->isr_ctx[i].irq_idx = i;
+		nm->isr_ctx[i].nm = nm;
+
+		if (!nm->isr_ctx[i].irq_num) {
+			irq = ntbm_msi_request_irq(nm->ntb, ntb_msit_isr,
+						   KBUILD_MODNAME,
+						   &nm->isr_ctx[i],
+						   &nm->isr_ctx[i].desc);
+			if (irq < 0)
+				break;
+
+			nm->isr_ctx[i].irq_num = irq;
+		}
+
+		ret = ntb_spad_write(nm->ntb, 2 * i + 1,
+				     nm->isr_ctx[i].desc.addr_offset);
+		if (ret)
+			break;
+
+		ret = ntb_spad_write(nm->ntb, 2 * i + 2,
+				     nm->isr_ctx[i].desc.data);
+		if (ret)
+			break;
+
+		irq_count++;
+	}
+
+	ntb_spad_write(nm->ntb, 0, irq_count);
+	ntb_peer_db_set(nm->ntb, BIT(ntb_port_number(nm->ntb)));
+}
+
+static void ntb_msit_link_event(void *ctx)
+{
+	struct ntb_msit_ctx *nm = ctx;
+
+	if (!ntb_link_is_up(nm->ntb, NULL, NULL))
+		return;
+
+	schedule_work(&nm->setup_work);
+}
+
+static void ntb_msit_copy_peer_desc(struct ntb_msit_ctx *nm, int peer)
+{
+	int i;
+	struct ntb_msi_desc *desc = nm->peers[peer].msi_desc;
+	int irq_count = nm->peers[peer].num_irqs;
+
+	for (i = 0; i < irq_count; i++) {
+		desc[i].addr_offset = ntb_peer_spad_read(nm->ntb, peer,
+							 2 * i + 1);
+		desc[i].data = ntb_peer_spad_read(nm->ntb, peer, 2 * i + 2);
+	}
+
+	dev_info(&nm->ntb->dev, "Found %d interrupts on peer %d\n",
+		 irq_count, peer);
+
+	complete_all(&nm->peers[peer].init_comp);
+}
+
+static void ntb_msit_db_event(void *ctx, int vec)
+{
+	struct ntb_msit_ctx *nm = ctx;
+	struct ntb_msi_desc *desc;
+	u64 peer_mask = ntb_db_read(nm->ntb);
+	u32 irq_count;
+	int peer;
+
+	ntb_db_clear(nm->ntb, peer_mask);
+
+	for (peer = 0; peer < sizeof(peer_mask) * 8; peer++) {
+		if (!(peer_mask & BIT(peer)))
+			continue;
+
+		irq_count = ntb_peer_spad_read(nm->ntb, peer, 0);
+		if (irq_count == -1)
+			continue;
+
+		desc = kcalloc(irq_count, sizeof(*desc), GFP_ATOMIC);
+		if (!desc)
+			continue;
+
+		kfree(nm->peers[peer].msi_desc);
+		nm->peers[peer].msi_desc = desc;
+		nm->peers[peer].num_irqs = irq_count;
+
+		ntb_msit_copy_peer_desc(nm, peer);
+	}
+}
+
+static const struct ntb_ctx_ops ntb_msit_ops = {
+	.link_event = ntb_msit_link_event,
+	.db_event = ntb_msit_db_event,
+};
+
+static int ntb_msit_dbgfs_trigger(void *data, u64 idx)
+{
+	struct ntb_msit_peer *peer = data;
+
+	if (idx >= peer->num_irqs)
+		return -EINVAL;
+
+	dev_dbg(&peer->nm->ntb->dev, "trigger irq %llu on peer %u\n",
+		idx, peer->pidx);
+
+	return ntb_msi_peer_trigger(peer->nm->ntb, peer->pidx,
+				    &peer->msi_desc[idx]);
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_trigger_fops, NULL,
+			 ntb_msit_dbgfs_trigger, "%llu\n");
+
+static int ntb_msit_dbgfs_port_get(void *data, u64 *port)
+{
+	struct ntb_msit_peer *peer = data;
+
+	*port = ntb_peer_port_number(peer->nm->ntb, peer->pidx);
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_port_fops, ntb_msit_dbgfs_port_get,
+			 NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_count_get(void *data, u64 *count)
+{
+	struct ntb_msit_peer *peer = data;
+
+	*count = peer->num_irqs;
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_count_fops, ntb_msit_dbgfs_count_get,
+			 NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_ready_get(void *data, u64 *ready)
+{
+	struct ntb_msit_peer *peer = data;
+
+	*ready = try_wait_for_completion(&peer->init_comp);
+
+	return 0;
+}
+
+static int ntb_msit_dbgfs_ready_set(void *data, u64 ready)
+{
+	struct ntb_msit_peer *peer = data;
+
+	return wait_for_completion_interruptible(&peer->init_comp);
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_ready_fops, ntb_msit_dbgfs_ready_get,
+			 ntb_msit_dbgfs_ready_set, "%llu\n");
+
+static int ntb_msit_dbgfs_occurrences_get(void *data, u64 *occurrences)
+{
+	struct ntb_msit_isr_ctx *isr_ctx = data;
+
+	*occurrences = isr_ctx->occurrences;
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_occurrences_fops,
+			 ntb_msit_dbgfs_occurrences_get,
+			 NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_local_port_get(void *data, u64 *port)
+{
+	struct ntb_msit_ctx *nm = data;
+
+	*port = ntb_port_number(nm->ntb);
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_local_port_fops,
+			 ntb_msit_dbgfs_local_port_get,
+			 NULL, "%llu\n");
+
+static void ntb_msit_create_dbgfs(struct ntb_msit_ctx *nm)
+{
+	struct pci_dev *pdev = nm->ntb->pdev;
+	char buf[32];
+	int i;
+	struct dentry *peer_dir;
+
+	nm->dbgfs_dir = debugfs_create_dir(pci_name(pdev),
+					   ntb_msit_dbgfs_topdir);
+	debugfs_create_file("port", 0400, nm->dbgfs_dir, nm,
+			    &ntb_msit_local_port_fops);
+
+	for (i = 0; i < ntb_peer_port_count(nm->ntb); i++) {
+		nm->peers[i].pidx = i;
+		nm->peers[i].nm = nm;
+		init_completion(&nm->peers[i].init_comp);
+
+		snprintf(buf, sizeof(buf), "peer%d", i);
+		peer_dir = debugfs_create_dir(buf, nm->dbgfs_dir);
+
+		debugfs_create_file_unsafe("trigger", 0200, peer_dir,
+					   &nm->peers[i],
+					   &ntb_msit_trigger_fops);
+
+		debugfs_create_file_unsafe("port", 0400, peer_dir,
+					   &nm->peers[i], &ntb_msit_port_fops);
+
+		debugfs_create_file_unsafe("count", 0400, peer_dir,
+					   &nm->peers[i],
+					   &ntb_msit_count_fops);
+
+		debugfs_create_file_unsafe("ready", 0600, peer_dir,
+					   &nm->peers[i],
+					   &ntb_msit_ready_fops);
+	}
+
+	for (i = 0; i < num_irqs; i++) {
+		snprintf(buf, sizeof(buf), "irq%d_occurrences", i);
+		debugfs_create_file_unsafe(buf, 0400, nm->dbgfs_dir,
+					   &nm->isr_ctx[i],
+					   &ntb_msit_occurrences_fops);
+	}
+}
+
+static void ntb_msit_remove_dbgfs(struct ntb_msit_ctx *nm)
+{
+	debugfs_remove_recursive(nm->dbgfs_dir);
+}
+
+static int ntb_msit_probe(struct ntb_client *client, struct ntb_dev *ntb)
+{
+	struct ntb_msit_ctx *nm;
+	size_t struct_size;
+	int peers;
+	int ret;
+
+	peers = ntb_peer_port_count(ntb);
+	if (peers <= 0)
+		return -EINVAL;
+
+	if (ntb_spad_is_unsafe(ntb) || ntb_spad_count(ntb) < 2 * num_irqs + 1) {
+		dev_err(&ntb->dev, "NTB MSI test requires at least %d spads for %d irqs\n",
+			2 * num_irqs + 1, num_irqs);
+		return -EFAULT;
+	}
+
+	ret = ntb_spad_write(ntb, 0, -1);
+	if (ret) {
+		dev_err(&ntb->dev, "Unable to write spads: %d\n", ret);
+		return ret;
+	}
+
+	ret = ntb_db_clear_mask(ntb, GENMASK(peers - 1, 0));
+	if (ret) {
+		dev_err(&ntb->dev, "Unable to clear doorbell mask: %d\n", ret);
+		return ret;
+	}
+
+	ret = ntb_msi_init(ntb);
+	if (ret) {
+		dev_err(&ntb->dev, "Unable to initialize MSI library: %d\n",
+			ret);
+		return ret;
+	}
+
+	struct_size = sizeof(*nm) + sizeof(*nm->peers) * peers;
+
+	nm = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
+	if (!nm)
+		return -ENOMEM;
+
+	nm->isr_ctx = devm_kcalloc(&ntb->dev, num_irqs, sizeof(*nm->isr_ctx),
+				   GFP_KERNEL);
+	if (!nm->isr_ctx)
+		return -ENOMEM;
+
+	INIT_WORK(&nm->setup_work, ntb_msit_setup_work);
+	nm->ntb = ntb;
+
+	ntb_msit_create_dbgfs(nm);
+
+	ret = ntb_set_ctx(ntb, nm, &ntb_msit_ops);
+	if (ret)
+		goto remove_dbgfs;
+
+	if (!nm->isr_ctx)
+		goto remove_dbgfs;
+
+	ntb_link_enable(ntb, NTB_SPEED_AUTO, NTB_WIDTH_AUTO);
+
+	return 0;
+
+remove_dbgfs:
+	ntb_msit_remove_dbgfs(nm);
+	devm_kfree(&ntb->dev, nm->isr_ctx);
+	devm_kfree(&ntb->dev, nm);
+	return ret;
+}
+
+static void ntb_msit_remove(struct ntb_client *client, struct ntb_dev *ntb)
+{
+	struct ntb_msit_ctx *nm = ntb->ctx;
+	int i;
+
+	ntb_link_disable(ntb);
+	ntb_db_set_mask(ntb, ntb_db_valid_mask(ntb));
+	ntb_msi_clear_mws(ntb);
+
+	for (i = 0; i < ntb_peer_port_count(ntb); i++)
+		kfree(nm->peers[i].msi_desc);
+
+	ntb_clear_ctx(ntb);
+	ntb_msit_remove_dbgfs(nm);
+}
+
+static struct ntb_client ntb_msit_client = {
+	.ops = {
+		.probe = ntb_msit_probe,
+		.remove = ntb_msit_remove
+	}
+};
+
+static int __init ntb_msit_init(void)
+{
+	int ret;
+
+	if (debugfs_initialized())
+		ntb_msit_dbgfs_topdir = debugfs_create_dir(KBUILD_MODNAME,
+							   NULL);
+
+	ret = ntb_register_client(&ntb_msit_client);
+	if (ret)
+		debugfs_remove_recursive(ntb_msit_dbgfs_topdir);
+
+	return ret;
+}
+module_init(ntb_msit_init);
+
+static void __exit ntb_msit_exit(void)
+{
+	ntb_unregister_client(&ntb_msit_client);
+	debugfs_remove_recursive(ntb_msit_dbgfs_topdir);
+}
+module_exit(ntb_msit_exit);
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 8/9] NTB: Add ntb_msi_test support to ntb_test
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2019-01-31 18:56 ` [PATCH 7/9] NTB: Introduce NTB MSI Test Client Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-01-31 18:56 ` [PATCH 9/9] NTB: Add MSI interrupt support to ntb_transport Logan Gunthorpe
  2019-01-31 20:20 ` [PATCH 0/9] Support using MSI interrupts in ntb_transport Dave Jiang
  9 siblings, 0 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

When the ntb_msi_test module is available, the test code will trigger
each of the interrupts and ensure the corresponding occurrences files
gets incremented.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 tools/testing/selftests/ntb/ntb_test.sh | 54 ++++++++++++++++++++++++-
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/ntb/ntb_test.sh b/tools/testing/selftests/ntb/ntb_test.sh
index 17ca36403d04..1a10b8f67727 100755
--- a/tools/testing/selftests/ntb/ntb_test.sh
+++ b/tools/testing/selftests/ntb/ntb_test.sh
@@ -87,10 +87,10 @@ set -e
 
 function _modprobe()
 {
-	modprobe "$@"
+	modprobe "$@" || return 1
 
 	if [[ "$REMOTE_HOST" != "" ]]; then
-		ssh "$REMOTE_HOST" modprobe "$@"
+		ssh "$REMOTE_HOST" modprobe "$@" || return 1
 	fi
 }
 
@@ -451,6 +451,30 @@ function pingpong_test()
 	echo "  Passed"
 }
 
+function msi_test()
+{
+	LOC=$1
+	REM=$2
+
+	write_file 1 $LOC/ready
+
+	echo "Running MSI interrupt tests on: $(subdirname $LOC) / $(subdirname $REM)"
+
+	CNT=$(read_file "$LOC/count")
+	for ((i = 0; i < $CNT; i++)); do
+		START=$(read_file $REM/../irq${i}_occurrences)
+		write_file $i $LOC/trigger
+		END=$(read_file $REM/../irq${i}_occurrences)
+
+		if [[ $(($END - $START)) != 1 ]]; then
+			echo "MSI did not trigger the interrupt on the remote side!" >&2
+			exit 1
+		fi
+	done
+
+	echo "  Passed"
+}
+
 function perf_test()
 {
 	USE_DMA=$1
@@ -529,6 +553,29 @@ function ntb_pingpong_tests()
 	_modprobe -r ntb_pingpong
 }
 
+function ntb_msi_tests()
+{
+	LOCAL_MSI="$DEBUGFS/ntb_msi_test/$LOCAL_DEV"
+	REMOTE_MSI="$REMOTE_HOST:$DEBUGFS/ntb_msi_test/$REMOTE_DEV"
+
+	echo "Starting ntb_msi_test tests..."
+
+	if ! _modprobe ntb_msi_test 2> /dev/null; then
+		echo "  Not doing MSI tests seeing the module is not available."
+		return
+	fi
+
+	port_test $LOCAL_MSI $REMOTE_MSI
+
+	LOCAL_PEER="$LOCAL_MSI/peer$LOCAL_PIDX"
+	REMOTE_PEER="$REMOTE_MSI/peer$REMOTE_PIDX"
+
+	msi_test $LOCAL_PEER $REMOTE_PEER
+	msi_test $REMOTE_PEER $LOCAL_PEER
+
+	_modprobe -r ntb_msi_test
+}
+
 function ntb_perf_tests()
 {
 	LOCAL_PERF="$DEBUGFS/ntb_perf/$LOCAL_DEV"
@@ -550,6 +597,7 @@ function cleanup()
 	_modprobe -r ntb_perf 2> /dev/null
 	_modprobe -r ntb_pingpong 2> /dev/null
 	_modprobe -r ntb_transport 2> /dev/null
+	_modprobe -r ntb_msi_test 2> /dev/null
 	set -e
 }
 
@@ -586,5 +634,7 @@ ntb_tool_tests
 echo
 ntb_pingpong_tests
 echo
+ntb_msi_tests
+echo
 ntb_perf_tests
 echo
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 9/9] NTB: Add MSI interrupt support to ntb_transport
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2019-01-31 18:56 ` [PATCH 8/9] NTB: Add ntb_msi_test support to ntb_test Logan Gunthorpe
@ 2019-01-31 18:56 ` Logan Gunthorpe
  2019-01-31 20:20 ` [PATCH 0/9] Support using MSI interrupts in ntb_transport Dave Jiang
  9 siblings, 0 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 18:56 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Introduce the module parameter 'use_msi' which, when set uses
MSI interrupts instead of doorbells for each queue pair (QP). T
he parameter is only available if NTB MSI support is configured into
the kernel. We also require there to be more than one memory window
(MW) so that an extra one is available to forward the APIC region.

To use MSIs, we request one interrupt per QP and forward the MSI address
and data to the peer using scratch pad registers (SPADS) above the MW
spads. (If there are not enough SPADS the MSI interrupt will not be used.)

Once registered, we simply use ntb_msi_peer_trigger and the recieving
ISR simply queues up the rxc_db_work for the queue.

This addition can significantly improve performance of ntb_transport.
In a simple, untuned, apples-to-apples comparision using ntb_netdev
and iperf with switchtec hardware, I see 3.88Gb/s without MSI
interrupts and 14.1Gb/s which is a more than 3x improvement.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/ntb_transport.c | 134 +++++++++++++++++++++++++++++++++++-
 1 file changed, 133 insertions(+), 1 deletion(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 526b65afc16a..d9e61bf28830 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -93,6 +93,12 @@ static bool use_dma;
 module_param(use_dma, bool, 0644);
 MODULE_PARM_DESC(use_dma, "Use DMA engine to perform large data copy");
 
+static bool use_msi;
+#ifdef CONFIG_NTB_MSI
+module_param(use_msi, bool, 0644);
+MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
+#endif
+
 static struct dentry *nt_debugfs_dir;
 
 /* Only two-ports NTB devices are supported */
@@ -188,6 +194,11 @@ struct ntb_transport_qp {
 	u64 tx_err_no_buf;
 	u64 tx_memcpy;
 	u64 tx_async;
+
+	bool use_msi;
+	int msi_irq;
+	struct ntb_msi_desc msi_desc;
+	struct ntb_msi_desc peer_msi_desc;
 };
 
 struct ntb_transport_mw {
@@ -221,6 +232,9 @@ struct ntb_transport_ctx {
 	u64 qp_bitmap;
 	u64 qp_bitmap_free;
 
+	bool use_msi;
+	unsigned int msi_spad_offset;
+
 	bool link_is_up;
 	struct delayed_work link_work;
 	struct work_struct link_cleanup;
@@ -667,6 +681,91 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
 	return 0;
 }
 
+static irqreturn_t ntb_transport_isr(int irq, void *dev)
+{
+	struct ntb_transport_qp *qp = dev;
+
+	tasklet_schedule(&qp->rxc_db_work);
+
+	return IRQ_HANDLED;
+}
+
+static void ntb_transport_setup_qp_peer_msi(struct ntb_transport_ctx *nt,
+					    unsigned int qp_num)
+{
+	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
+	int spad = qp_num * 2 + nt->msi_spad_offset;
+
+	if (!nt->use_msi)
+		return;
+
+	if (spad >= ntb_spad_count(nt->ndev))
+		return;
+
+	qp->peer_msi_desc.addr_offset =
+		ntb_peer_spad_read(qp->ndev, PIDX, spad);
+	qp->peer_msi_desc.data =
+		ntb_peer_spad_read(qp->ndev, PIDX, spad + 1);
+
+	dev_dbg(&qp->ndev->pdev->dev, "QP%d Peer MSI addr=%x data=%x\n",
+		qp_num, qp->peer_msi_desc.addr_offset, qp->peer_msi_desc.data);
+
+	if (qp->peer_msi_desc.addr_offset) {
+		qp->use_msi = true;
+		dev_info(&qp->ndev->pdev->dev,
+			 "Using MSI interrupts for QP%d\n", qp_num);
+	}
+}
+
+static void ntb_transport_setup_qp_msi(struct ntb_transport_ctx *nt,
+				       unsigned int qp_num)
+{
+	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
+	int spad = qp_num * 2 + nt->msi_spad_offset;
+	int rc;
+
+	if (!nt->use_msi)
+		return;
+
+	if (spad >= ntb_spad_count(nt->ndev)) {
+		dev_warn_once(&qp->ndev->pdev->dev,
+			      "Not enough SPADS to use MSI interrupts\n");
+		return;
+	}
+
+	ntb_spad_write(qp->ndev, spad, 0);
+	ntb_spad_write(qp->ndev, spad + 1, 0);
+
+	if (!qp->msi_irq) {
+		qp->msi_irq = ntbm_msi_request_irq(qp->ndev, ntb_transport_isr,
+						   KBUILD_MODNAME, qp,
+						   &qp->msi_desc);
+		if (qp->msi_irq < 0) {
+			dev_warn(&qp->ndev->pdev->dev,
+				 "Unable to allocate MSI interrupt for qp%d\n",
+				 qp_num);
+			return;
+		}
+	}
+
+	rc = ntb_spad_write(qp->ndev, spad, qp->msi_desc.addr_offset);
+	if (rc)
+		goto err_free_interrupt;
+
+	rc = ntb_spad_write(qp->ndev, spad + 1, qp->msi_desc.data);
+	if (rc)
+		goto err_free_interrupt;
+
+	dev_dbg(&qp->ndev->pdev->dev, "QP%d MSI %d addr=%x data=%x\n",
+		qp_num, qp->msi_irq, qp->msi_desc.addr_offset,
+		qp->msi_desc.data);
+
+	return;
+
+err_free_interrupt:
+	devm_free_irq(&nt->ndev->dev, qp->msi_irq, qp);
+}
+
 static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
 {
 	struct ntb_transport_mw *mw = &nt->mw_vec[num_mw];
@@ -902,6 +1001,20 @@ static void ntb_transport_link_work(struct work_struct *work)
 	int rc = 0, i, spad;
 
 	/* send the local info, in the opposite order of the way we read it */
+
+	if (nt->use_msi) {
+		rc = ntb_msi_setup_mws(ndev);
+		if (rc) {
+			dev_warn(&pdev->dev,
+				 "Failed to register MSI memory window: %d\n",
+				 rc);
+			nt->use_msi = false;
+		}
+	}
+
+	for (i = 0; i < nt->qp_count; i++)
+		ntb_transport_setup_qp_msi(nt, i);
+
 	for (i = 0; i < nt->mw_count; i++) {
 		size = nt->mw_vec[i].phys_size;
 
@@ -959,6 +1072,7 @@ static void ntb_transport_link_work(struct work_struct *work)
 		struct ntb_transport_qp *qp = &nt->qp_vec[i];
 
 		ntb_transport_setup_qp_mw(nt, i);
+		ntb_transport_setup_qp_peer_msi(nt, i);
 
 		if (qp->client_ready)
 			schedule_delayed_work(&qp->link_work, 0);
@@ -1132,6 +1246,19 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 		return -ENOMEM;
 
 	nt->ndev = ndev;
+
+	/*
+	 * If we are using MSI, and have at least one extra memory window,
+	 * we will reserve the last MW for the MSI window.
+	 */
+	if (use_msi && mw_count > 1) {
+		rc = ntb_msi_init(ndev);
+		if (!rc) {
+			mw_count -= 1;
+			nt->use_msi = true;
+		}
+	}
+
 	spad_count = ntb_spad_count(ndev);
 
 	/* Limit the MW's based on the availability of scratchpads */
@@ -1145,6 +1272,8 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 	max_mw_count_for_spads = (spad_count - MW0_SZ_HIGH) / 2;
 	nt->mw_count = min(mw_count, max_mw_count_for_spads);
 
+	nt->msi_spad_offset = nt->mw_count * 2 + MW0_SZ_HIGH;
+
 	nt->mw_vec = kcalloc_node(mw_count, sizeof(*nt->mw_vec),
 				  GFP_KERNEL, node);
 	if (!nt->mw_vec) {
@@ -1598,7 +1727,10 @@ static void ntb_tx_copy_callback(void *data,
 
 	iowrite32(entry->flags | DESC_DONE_FLAG, &hdr->flags);
 
-	ntb_peer_db_set(qp->ndev, BIT_ULL(qp->qp_num));
+	if (qp->use_msi)
+		ntb_msi_peer_trigger(qp->ndev, PIDX, &qp->peer_msi_desc);
+	else
+		ntb_peer_db_set(qp->ndev, BIT_ULL(qp->qp_num));
 
 	/* The entry length can only be zero if the packet is intended to be a
 	 * "link down" or similar.  Since no payload is being sent in these
-- 
2.19.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
  2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2019-01-31 18:56 ` [PATCH 9/9] NTB: Add MSI interrupt support to ntb_transport Logan Gunthorpe
@ 2019-01-31 20:20 ` Dave Jiang
  2019-01-31 20:48   ` Logan Gunthorpe
  9 siblings, 1 reply; 26+ messages in thread
From: Dave Jiang @ 2019-01-31 20:20 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Serge Semin, Eric Pilmore


On 1/31/2019 11:56 AM, Logan Gunthorpe wrote:
> Hi,
>
> This patch series adds optional support for using MSI interrupts instead
> of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
> current hardware are quite slow and therefore switching to MSI interrupts
> provides a significant performance gain. On switchtec hardware, a simple
> apples-to-apples comparison shows ntb_netdev/iperf numbers going from
> 3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.
>
> To do this, a couple changes are required outside of the NTB tree:
>
> 1) The IOMMU must know to accept MSI requests from aliased bused numbers
> seeing NTB hardware typically sends proxied request IDs through
> additional requester IDs. The first patch in this series adds support
> for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
> was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
> for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
> this is necessary.
>
> 2) NTB transport (and other clients) may often need more MSI interrupts
> than the NTB hardware actually advertises support for. However, seeing
> these interrupts will not be triggered by the hardware but through an
> NTB memory window, the hardware does not actually need support or need
> to know about them. Therefore we add the concept of Virtual MSI
> interrupts which are allocated just like any other MSI interrupt but
> are not programmed into the hardware's MSI table. This is done in
> Patch 2 and then made use of in Patch 3.

Logan,

Does this work when the system moves the MSI vector either via software 
(irqbalance) or BIOS APIC programming (some modes cause round robin 
behavior)?



>
> The remaining patches in this series add a library for dealing with MSI
> interrupts, a test client and finally support in ntb_transport.
>
> The series is based off of v5.0-rc4 and I've tested it on top of a
> of the patches I've already sent to the NTB tree (though they are
> independent changes). A git repo is available here:
>
> https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1
>
> Thanks,
>
> Logan
>
> --
>
> Logan Gunthorpe (9):
>    iommu/vt-d: Allow interrupts from the entire bus for aliased devices
>    PCI/MSI: Support allocating virtual MSI interrupts
>    PCI/switchtec: Add module parameter to request more interrupts
>    NTB: Introduce functions to calculate multi-port resource index
>    NTB: Rename ntb.c to support multiple source files in the module
>    NTB: Introduce MSI library
>    NTB: Introduce NTB MSI Test Client
>    NTB: Add ntb_msi_test support to ntb_test
>    NTB: Add MSI interrupt support to ntb_transport
>
>   drivers/iommu/intel_irq_remapping.c     |  12 +
>   drivers/ntb/Kconfig                     |  10 +
>   drivers/ntb/Makefile                    |   3 +
>   drivers/ntb/{ntb.c => core.c}           |   0
>   drivers/ntb/msi.c                       | 313 ++++++++++++++++++
>   drivers/ntb/ntb_transport.c             | 134 +++++++-
>   drivers/ntb/test/Kconfig                |   9 +
>   drivers/ntb/test/Makefile               |   1 +
>   drivers/ntb/test/ntb_msi_test.c         | 416 ++++++++++++++++++++++++
>   drivers/pci/msi.c                       |  51 ++-
>   drivers/pci/switch/switchtec.c          |  12 +-
>   include/linux/msi.h                     |   1 +
>   include/linux/ntb.h                     | 139 ++++++++
>   include/linux/pci.h                     |   9 +
>   tools/testing/selftests/ntb/ntb_test.sh |  54 ++-
>   15 files changed, 1150 insertions(+), 14 deletions(-)
>   rename drivers/ntb/{ntb.c => core.c} (100%)
>   create mode 100644 drivers/ntb/msi.c
>   create mode 100644 drivers/ntb/test/ntb_msi_test.c
>
> --
> 2.19.0
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
  2019-01-31 20:20 ` [PATCH 0/9] Support using MSI interrupts in ntb_transport Dave Jiang
@ 2019-01-31 20:48   ` Logan Gunthorpe
  2019-01-31 20:58     ` Dave Jiang
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 20:48 UTC (permalink / raw)
  To: Dave Jiang, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Serge Semin, Eric Pilmore



On 2019-01-31 1:20 p.m., Dave Jiang wrote:
> Does this work when the system moves the MSI vector either via software 
> (irqbalance) or BIOS APIC programming (some modes cause round robin 
> behavior)?


I don't know how irqbalance works, and I'm not sure what you are
referring to by BIOS APIC programming, however I would expect these
things would not be a problem.

The MSI code I'm presenting here doesn't do anything crazy with the
interrupts, it allocates and uses them just as any PCI driver would. The
only real difference here is that instead of a piece of hardware sending
the IRQ TLP, it will be sent through the memory window (which, from the
OS's perspective, is just coming from an NTB hardware proxy alias).

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
  2019-01-31 20:48   ` Logan Gunthorpe
@ 2019-01-31 20:58     ` Dave Jiang
  2019-01-31 22:39       ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Jiang @ 2019-01-31 20:58 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Serge Semin, Eric Pilmore


On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:
>
> On 2019-01-31 1:20 p.m., Dave Jiang wrote:
>> Does this work when the system moves the MSI vector either via software
>> (irqbalance) or BIOS APIC programming (some modes cause round robin
>> behavior)?
>
> I don't know how irqbalance works, and I'm not sure what you are
> referring to by BIOS APIC programming, however I would expect these
> things would not be a problem.
>
> The MSI code I'm presenting here doesn't do anything crazy with the
> interrupts, it allocates and uses them just as any PCI driver would. The
> only real difference here is that instead of a piece of hardware sending
> the IRQ TLP, it will be sent through the memory window (which, from the
> OS's perspective, is just coming from an NTB hardware proxy alias).
>
> Logan
Right. I did that as a hack a while back for some silicon errata 
workaround. When the vector moves, the address for the LAPIC changes. So 
unless it gets updated, you end up writing to the old location and lose 
all the new interrupts. irqbalance is a user daemon that rotates the 
system interrupts around to ensure that not all interrupts are pinned on 
a single core. I think it's enabled by default on several distros. 
Although MSIX has nothing to do with the IOAPIC, the mode that the APIC 
is programmed can have an influence on how the interrupts are delivered. 
There are certain Intel platforms (I don't know if AMD does anything 
like that) puts the IOAPIC in a certain configuration that causes the 
interrupts to be moved in a round robin fashion. I think it's physical 
flat mode? I don't quite recall. Normally on the low end Xeons. It's 
probably worth doing a test run with the irqbalance daemon running and 
make sure you traffic stream doesn't all of sudden stop.
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts
  2019-01-31 18:56 ` [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts Logan Gunthorpe
@ 2019-01-31 22:39   ` Bjorn Helgaas
  2019-01-31 22:52     ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Bjorn Helgaas @ 2019-01-31 22:39 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin,
	Eric Pilmore, Thomas Gleixner, Marc Zyngier

[+cc Thomas, Marc]

On Thu, Jan 31, 2019 at 11:56:49AM -0700, Logan Gunthorpe wrote:
> For NTB devices, we want to be able to trigger MSI interrupts
> through a memory window. In these cases we may want to use
> more interrupts than the NTB PCI device has available in its MSI-X
> table.
> 
> We allow for this by creating a new 'virtual' interrupt. These
> interrupts are allocated as usual but are not programmed into the
> MSI-X table (as there may not be space for them).
> 
> The MSI address and data will then handled through an NTB MSI library
> introduced later in this series.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>

I assume you'll merge this along with the rest of the series, so:

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Minor question and typo below.

> ---
>  drivers/pci/msi.c   | 51 +++++++++++++++++++++++++++++++++++++--------
>  include/linux/msi.h |  1 +
>  include/linux/pci.h |  9 ++++++++
>  3 files changed, 52 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 4c0b47867258..145587da686c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -192,6 +192,9 @@ static void msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
>  
>  static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
>  {
> +	if (desc->msi_attrib.is_virtual)
> +		return NULL;
> +
>  	return desc->mask_base +
>  		desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
>  }
> @@ -206,14 +209,19 @@ static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
>  u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
>  {
>  	u32 mask_bits = desc->masked;
> +	void __iomem *desc_addr;
>  
>  	if (pci_msi_ignore_mask)
>  		return 0;
> +	desc_addr = pci_msix_desc_addr(desc);
> +	if (!desc_addr)
> +		return 0;
>  
>  	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
>  	if (flag)
>  		mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
> -	writel(mask_bits, pci_msix_desc_addr(desc) + PCI_MSIX_ENTRY_VECTOR_CTRL);
> +
> +	writel(mask_bits, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
>  
>  	return mask_bits;
>  }
> @@ -273,6 +281,11 @@ void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
>  	if (entry->msi_attrib.is_msix) {
>  		void __iomem *base = pci_msix_desc_addr(entry);
>  
> +		if (!base) {
> +			WARN_ON(1);
> +			return;
> +		}
> +
>  		msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR);
>  		msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR);
>  		msg->data = readl(base + PCI_MSIX_ENTRY_DATA);
> @@ -303,6 +316,9 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
>  	} else if (entry->msi_attrib.is_msix) {
>  		void __iomem *base = pci_msix_desc_addr(entry);
>  
> +		if (!base)
> +			goto skip;
> +
>  		writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
>  		writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
>  		writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
> @@ -327,6 +343,8 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
>  					      msg->data);
>  		}
>  	}
> +
> +skip:
>  	entry->msg = *msg;
>  }
>  
> @@ -550,6 +568,7 @@ msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd)
>  
>  	entry->msi_attrib.is_msix	= 0;
>  	entry->msi_attrib.is_64		= !!(control & PCI_MSI_FLAGS_64BIT);
> +	entry->msi_attrib.is_virtual    = 0;
>  	entry->msi_attrib.entry_nr	= 0;
>  	entry->msi_attrib.maskbit	= !!(control & PCI_MSI_FLAGS_MASKBIT);
>  	entry->msi_attrib.default_irq	= dev->irq;	/* Save IOAPIC IRQ */
> @@ -674,6 +693,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  	struct irq_affinity_desc *curmsk, *masks = NULL;
>  	struct msi_desc *entry;
>  	int ret, i;
> +	int vec_count = pci_msix_vec_count(dev);
>  
>  	if (affd)
>  		masks = irq_create_affinity_masks(nvec, affd);
> @@ -696,6 +716,10 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
>  			entry->msi_attrib.entry_nr = entries[i].entry;
>  		else
>  			entry->msi_attrib.entry_nr = i;
> +
> +		entry->msi_attrib.is_virtual =
> +			entry->msi_attrib.entry_nr >= vec_count;
> +
>  		entry->msi_attrib.default_irq	= dev->irq;
>  		entry->mask_base		= base;
>  
> @@ -714,12 +738,19 @@ static void msix_program_entries(struct pci_dev *dev,
>  {
>  	struct msi_desc *entry;
>  	int i = 0;
> +	void __iomem *desc_addr;
>  
>  	for_each_pci_msi_entry(entry, dev) {
>  		if (entries)
>  			entries[i++].vector = entry->irq;
> -		entry->masked = readl(pci_msix_desc_addr(entry) +
> -				PCI_MSIX_ENTRY_VECTOR_CTRL);
> +
> +		desc_addr = pci_msix_desc_addr(entry);
> +		if (desc_addr)
> +			entry->masked = readl(desc_addr +
> +					      PCI_MSIX_ENTRY_VECTOR_CTRL);
> +		else
> +			entry->masked = 0;
> +
>  		msix_mask_irq(entry, 1);
>  	}
>  }
> @@ -932,7 +963,8 @@ int pci_msix_vec_count(struct pci_dev *dev)
>  EXPORT_SYMBOL(pci_msix_vec_count);
>  
>  static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> -			     int nvec, const struct irq_affinity *affd)
> +			     int nvec, const struct irq_affinity *affd,
> +			     int flags)
>  {
>  	int nr_entries;
>  	int i, j;
> @@ -943,7 +975,7 @@ static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
>  	nr_entries = pci_msix_vec_count(dev);
>  	if (nr_entries < 0)
>  		return nr_entries;
> -	if (nvec > nr_entries)
> +	if (nvec > nr_entries && !(flags & PCI_IRQ_VIRTUAL))
>  		return nr_entries;
>  
>  	if (entries) {
> @@ -1086,7 +1118,8 @@ EXPORT_SYMBOL(pci_enable_msi);
>  
>  static int __pci_enable_msix_range(struct pci_dev *dev,
>  				   struct msix_entry *entries, int minvec,
> -				   int maxvec, const struct irq_affinity *affd)
> +				   int maxvec, const struct irq_affinity *affd,
> +				   int flags)
>  {
>  	int rc, nvec = maxvec;
>  
> @@ -1110,7 +1143,7 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  				return -ENOSPC;
>  		}
>  
> -		rc = __pci_enable_msix(dev, entries, nvec, affd);
> +		rc = __pci_enable_msix(dev, entries, nvec, affd, flags);
>  		if (rc == 0)
>  			return nvec;
>  
> @@ -1141,7 +1174,7 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
>  int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
>  		int minvec, int maxvec)
>  {
> -	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL);
> +	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0);
>  }
>  EXPORT_SYMBOL(pci_enable_msix_range);
>  
> @@ -1181,7 +1214,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  
>  	if (flags & PCI_IRQ_MSIX) {
>  		msix_vecs = __pci_enable_msix_range(dev, NULL, min_vecs,
> -						    max_vecs, affd);
> +						    max_vecs, affd, flags);
>  		if (msix_vecs > 0)
>  			return msix_vecs;
>  	}
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index 784fb52b9900..6458ab049852 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -88,6 +88,7 @@ struct msi_desc {
>  				__u8	multi_cap	: 3;
>  				__u8	maskbit		: 1;
>  				__u8	is_64		: 1;
> +				__u8    is_virtual      : 1;

You did the right thing by using the same style as what's already
here, but does anybody know why are we using __u8 and __u16 here?

Those typedefs are in include/uapi/asm-generic/int-l64.h, which
suggests they're for things exported to user space, but I don't think
that's the case here, so I'm wondering if we could someday replace
these with u8 and u16.  Obviously that wouldn't be part of *this*
series.

>  				__u16	entry_nr;
>  				unsigned default_irq;
>  			} msi_attrib;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 65f1d8c2f082..ce0815c2c498 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1352,6 +1352,15 @@ int pci_set_vga_state(struct pci_dev *pdev, bool decode,
>  #define PCI_IRQ_MSI		(1 << 1) /* Allow MSI interrupts */
>  #define PCI_IRQ_MSIX		(1 << 2) /* Allow MSI-X interrupts */
>  #define PCI_IRQ_AFFINITY	(1 << 3) /* Auto-assign affinity */
> +
> +/*
> + * Virtual interrupts allow for more interrupts to be allocated
> + * than the device has interrupts for. These are not programmed
> + * into the devices MSI-X table and must be handled by some

s/devices/device's/

> + * other driver means.
> + */
> +#define PCI_IRQ_VIRTUAL		(1 << 4)
> +
>  #define PCI_IRQ_ALL_TYPES \
>  	(PCI_IRQ_LEGACY | PCI_IRQ_MSI | PCI_IRQ_MSIX)
>  
> -- 
> 2.19.0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
  2019-01-31 20:58     ` Dave Jiang
@ 2019-01-31 22:39       ` Logan Gunthorpe
  2019-01-31 22:46         ` Dave Jiang
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 22:39 UTC (permalink / raw)
  To: Dave Jiang, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Serge Semin, Eric Pilmore



On 2019-01-31 1:58 p.m., Dave Jiang wrote:
> 
> On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:
>>
>> On 2019-01-31 1:20 p.m., Dave Jiang wrote:
>>> Does this work when the system moves the MSI vector either via software
>>> (irqbalance) or BIOS APIC programming (some modes cause round robin
>>> behavior)?
>>
>> I don't know how irqbalance works, and I'm not sure what you are
>> referring to by BIOS APIC programming, however I would expect these
>> things would not be a problem.
>>
>> The MSI code I'm presenting here doesn't do anything crazy with the
>> interrupts, it allocates and uses them just as any PCI driver would. The
>> only real difference here is that instead of a piece of hardware sending
>> the IRQ TLP, it will be sent through the memory window (which, from the
>> OS's perspective, is just coming from an NTB hardware proxy alias).
>>
>> Logan
> Right. I did that as a hack a while back for some silicon errata 
> workaround. When the vector moves, the address for the LAPIC changes. So 
> unless it gets updated, you end up writing to the old location and lose 
> all the new interrupts. irqbalance is a user daemon that rotates the 
> system interrupts around to ensure that not all interrupts are pinned on 
> a single core. 

Yes, that would be a problem if something changes the MSI vectors out
from under us. Seems like that would be a bit difficult to do even with
regular hardware. So far I haven't seen anything that would do that. If
you know of where in the kernel this happens I'd be interested in
getting a pointer to the flow in the code. If that is the case this MSI
stuff will need to get much more complicated...

> I think it's enabled by default on several distros. 

> Although MSIX has nothing to do with the IOAPIC, the mode that the APIC 
> is programmed can have an influence on how the interrupts are delivered. 
> There are certain Intel platforms (I don't know if AMD does anything 
> like that) puts the IOAPIC in a certain configuration that causes the 
> interrupts to be moved in a round robin fashion. I think it's physical 
> flat mode? I don't quite recall. Normally on the low end Xeons. It's 
> probably worth doing a test run with the irqbalance daemon running and 
> make sure you traffic stream doesn't all of sudden stop.

I've tested with irqbalance running and haven't found any noticeable
difference.

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
  2019-01-31 22:39       ` Logan Gunthorpe
@ 2019-01-31 22:46         ` Dave Jiang
  2019-01-31 23:41           ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Jiang @ 2019-01-31 22:46 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Serge Semin, Eric Pilmore


On 1/31/2019 3:39 PM, Logan Gunthorpe wrote:
>
> On 2019-01-31 1:58 p.m., Dave Jiang wrote:
>> On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:
>>> On 2019-01-31 1:20 p.m., Dave Jiang wrote:
>>>> Does this work when the system moves the MSI vector either via software
>>>> (irqbalance) or BIOS APIC programming (some modes cause round robin
>>>> behavior)?
>>> I don't know how irqbalance works, and I'm not sure what you are
>>> referring to by BIOS APIC programming, however I would expect these
>>> things would not be a problem.
>>>
>>> The MSI code I'm presenting here doesn't do anything crazy with the
>>> interrupts, it allocates and uses them just as any PCI driver would. The
>>> only real difference here is that instead of a piece of hardware sending
>>> the IRQ TLP, it will be sent through the memory window (which, from the
>>> OS's perspective, is just coming from an NTB hardware proxy alias).
>>>
>>> Logan
>> Right. I did that as a hack a while back for some silicon errata
>> workaround. When the vector moves, the address for the LAPIC changes. So
>> unless it gets updated, you end up writing to the old location and lose
>> all the new interrupts. irqbalance is a user daemon that rotates the
>> system interrupts around to ensure that not all interrupts are pinned on
>> a single core.
> Yes, that would be a problem if something changes the MSI vectors out
> from under us. Seems like that would be a bit difficult to do even with
> regular hardware. So far I haven't seen anything that would do that. If
> you know of where in the kernel this happens I'd be interested in
> getting a pointer to the flow in the code. If that is the case this MSI
> stuff will need to get much more complicated...

I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So 
maybe take a look at the code that starts from there and see if it would 
have any impact on your stuff.


>
>> I think it's enabled by default on several distros.
>> Although MSIX has nothing to do with the IOAPIC, the mode that the APIC
>> is programmed can have an influence on how the interrupts are delivered.
>> There are certain Intel platforms (I don't know if AMD does anything
>> like that) puts the IOAPIC in a certain configuration that causes the
>> interrupts to be moved in a round robin fashion. I think it's physical
>> flat mode? I don't quite recall. Normally on the low end Xeons. It's
>> probably worth doing a test run with the irqbalance daemon running and
>> make sure you traffic stream doesn't all of sudden stop.
> I've tested with irqbalance running and haven't found any noticeable
> difference.
>
> Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts
  2019-01-31 22:39   ` Bjorn Helgaas
@ 2019-01-31 22:52     ` Logan Gunthorpe
  2019-02-01 19:23       ` Bjorn Helgaas
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 22:52 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin,
	Eric Pilmore, Thomas Gleixner, Marc Zyngier



On 2019-01-31 3:39 p.m., Bjorn Helgaas wrote:
> I assume you'll merge this along with the rest of the series, so:
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Thanks!

>> diff --git a/include/linux/msi.h b/include/linux/msi.h
>> index 784fb52b9900..6458ab049852 100644
>> --- a/include/linux/msi.h
>> +++ b/include/linux/msi.h
>> @@ -88,6 +88,7 @@ struct msi_desc {
>>  				__u8	multi_cap	: 3;
>>  				__u8	maskbit		: 1;
>>  				__u8	is_64		: 1;
>> +				__u8    is_virtual      : 1;
> 
> You did the right thing by using the same style as what's already
> here, but does anybody know why are we using __u8 and __u16 here?
> 
> Those typedefs are in include/uapi/asm-generic/int-l64.h, which
> suggests they're for things exported to user space, but I don't think
> that's the case here, so I'm wondering if we could someday replace
> these with u8 and u16.  Obviously that wouldn't be part of *this*
> series.

Yes, I was also confused by this. But I always follow the "when-in-rome"
rule. My understanding is the same as yours is that __u8 should be used
for userspace compatibility which doesn't apply here. If there is
consensus on this being wrong, I'd be happy to write a cleanup patch
that fixes it separate from this series.
>> +/*
>> + * Virtual interrupts allow for more interrupts to be allocated
>> + * than the device has interrupts for. These are not programmed
>> + * into the devices MSI-X table and must be handled by some
> 
> s/devices/device's/

Fixed for when I send v2.

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
  2019-01-31 22:46         ` Dave Jiang
@ 2019-01-31 23:41           ` Logan Gunthorpe
  2019-01-31 23:48             ` Dave Jiang
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 23:41 UTC (permalink / raw)
  To: Dave Jiang, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Serge Semin, Eric Pilmore



On 2019-01-31 3:46 p.m., Dave Jiang wrote:
> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So 
> maybe take a look at the code that starts from there and see if it would 
> have any impact on your stuff.

Ok, well on my system I can write to the smp_affinity all day and the
MSI interrupts still work fine.

The MSI code is a bit difficult to trace and audit with all the
different chips and the parent chips which I don't have a good
understanding of. But I can definitely see that it could be possible for
some chips to change the address as smp_affinitiy will eventually
sometimes call msi_domain_set_affinity() which does seem to recompose
the message and write it back to the chip.

So, I could relatively easily add a callback to msi_desc to catch this
and resend the MSI address/data. However, I'm not sure how this is ever
done atomically. It seems like there would be a race while the device
updates its address where old interrupts could be triggered. This race
would be much longer for us when sending this information over the NTB
link. Though, I guess if the only change is that it encodes CPU
information in the address then that would not be an issue. However, I'm
not sure I can say that for certain without a comprehensive
understanding of all the IRQ chips.

Any thoughts on this?

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
  2019-01-31 23:41           ` Logan Gunthorpe
@ 2019-01-31 23:48             ` Dave Jiang
  2019-01-31 23:52               ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Jiang @ 2019-01-31 23:48 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Serge Semin, Eric Pilmore


On 1/31/2019 4:41 PM, Logan Gunthorpe wrote:
>
> On 2019-01-31 3:46 p.m., Dave Jiang wrote:
>> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So
>> maybe take a look at the code that starts from there and see if it would
>> have any impact on your stuff.
> Ok, well on my system I can write to the smp_affinity all day and the
> MSI interrupts still work fine.

Maybe your code is ok then. If the stats show up in /proc/interrupts 
then you can see it moving to different cores.

> The MSI code is a bit difficult to trace and audit with all the
> different chips and the parent chips which I don't have a good
> understanding of. But I can definitely see that it could be possible for
> some chips to change the address as smp_affinitiy will eventually
> sometimes call msi_domain_set_affinity() which does seem to recompose
> the message and write it back to the chip.
>
> So, I could relatively easily add a callback to msi_desc to catch this
> and resend the MSI address/data. However, I'm not sure how this is ever
> done atomically. It seems like there would be a race while the device
> updates its address where old interrupts could be triggered. This race
> would be much longer for us when sending this information over the NTB
> link. Though, I guess if the only change is that it encodes CPU
> information in the address then that would not be an issue. However, I'm
> not sure I can say that for certain without a comprehensive
> understanding of all the IRQ chips.
>
> Any thoughts on this?

Yeah I'm not sure what to do about it either as I'm not super familiar 
with that area either. Just making note of what I encountered. And you 
are right, the updated info has to go over NTB for the other side to 
write to the updated place. So there's a lot of latency involved.



>
> Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
  2019-01-31 23:48             ` Dave Jiang
@ 2019-01-31 23:52               ` Logan Gunthorpe
  0 siblings, 0 replies; 26+ messages in thread
From: Logan Gunthorpe @ 2019-01-31 23:52 UTC (permalink / raw)
  To: Dave Jiang, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Serge Semin, Eric Pilmore



On 2019-01-31 4:48 p.m., Dave Jiang wrote:
> 
> On 1/31/2019 4:41 PM, Logan Gunthorpe wrote:
>>
>> On 2019-01-31 3:46 p.m., Dave Jiang wrote:
>>> I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So
>>> maybe take a look at the code that starts from there and see if it would
>>> have any impact on your stuff.
>> Ok, well on my system I can write to the smp_affinity all day and the
>> MSI interrupts still work fine.
> 
> Maybe your code is ok then. If the stats show up in /proc/interrupts 
> then you can see it moving to different cores.

Yes, I did check that the stats change CPU in proc interrupts.

> Yeah I'm not sure what to do about it either as I'm not super familiar 
> with that area either. Just making note of what I encountered. And you 
> are right, the updated info has to go over NTB for the other side to 
> write to the updated place. So there's a lot of latency involved.

Ok, well I'll implement the callback anyway for v2. Better safe than
sorry. We can operate on the assumption that someone thought of the race
condition and if we ever see reports of lost interrupts we'll know where
to look.

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  2019-01-31 18:56 ` [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices Logan Gunthorpe
@ 2019-02-01 16:44   ` Joerg Roedel
  2019-02-01 17:27     ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Joerg Roedel @ 2019-02-01 16:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Allen Hubbe, Dave Jiang, Serge Semin,
	Eric Pilmore, David Woodhouse

On Thu, Jan 31, 2019 at 11:56:48AM -0700, Logan Gunthorpe wrote:
> @@ -394,6 +402,10 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
>  		set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
>  			     PCI_DEVID(PCI_BUS_NUM(data.alias),
>  				       dev->bus->number));
> +	else if (data.count >= 2 && data.busmatch_count == data.count)
> +		set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
> +			     PCI_DEVID(dev->bus->number,
> +				       dev->bus->number));

The dev->bus->number argument for the devfn parameter of PCI_DEVID is
not needed, right?

Also, this requires at least a comment to document that special case.


	Joerg

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  2019-02-01 16:44   ` Joerg Roedel
@ 2019-02-01 17:27     ` Logan Gunthorpe
  2019-02-05 19:19       ` Jacob Pan
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2019-02-01 17:27 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Allen Hubbe, Dave Jiang, Serge Semin,
	Eric Pilmore, David Woodhouse



On 2019-02-01 9:44 a.m., Joerg Roedel wrote:
> On Thu, Jan 31, 2019 at 11:56:48AM -0700, Logan Gunthorpe wrote:
>> @@ -394,6 +402,10 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
>>  		set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
>>  			     PCI_DEVID(PCI_BUS_NUM(data.alias),
>>  				       dev->bus->number));
>> +	else if (data.count >= 2 && data.busmatch_count == data.count)
>> +		set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
>> +			     PCI_DEVID(dev->bus->number,
>> +				       dev->bus->number));
> 
> The dev->bus->number argument for the devfn parameter of PCI_DEVID is
> not needed, right?

Oh, yes. I think you are right.

> Also, this requires at least a comment to document that special case.

I'll add a comment for v2.

Thanks for the review!

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts
  2019-01-31 22:52     ` Logan Gunthorpe
@ 2019-02-01 19:23       ` Bjorn Helgaas
  0 siblings, 0 replies; 26+ messages in thread
From: Bjorn Helgaas @ 2019-02-01 19:23 UTC (permalink / raw)
  To: Logan Gunthorpe, g
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Joerg Roedel, Allen Hubbe, Dave Jiang, Serge Semin,
	Eric Pilmore, Thomas Gleixner, Marc Zyngier

On Thu, Jan 31, 2019 at 03:52:09PM -0700, Logan Gunthorpe wrote:
> On 2019-01-31 3:39 p.m., Bjorn Helgaas wrote:
> >> diff --git a/include/linux/msi.h b/include/linux/msi.h
> >> index 784fb52b9900..6458ab049852 100644
> >> --- a/include/linux/msi.h
> >> +++ b/include/linux/msi.h
> >> @@ -88,6 +88,7 @@ struct msi_desc {
> >>  				__u8	multi_cap	: 3;
> >>  				__u8	maskbit		: 1;
> >>  				__u8	is_64		: 1;
> >> +				__u8    is_virtual      : 1;
> > 
> > You did the right thing by using the same style as what's already
> > here, but does anybody know why are we using __u8 and __u16 here?
> > 
> > Those typedefs are in include/uapi/asm-generic/int-l64.h, which
> > suggests they're for things exported to user space, but I don't think
> > that's the case here, so I'm wondering if we could someday replace
> > these with u8 and u16.  Obviously that wouldn't be part of *this*
> > series.
> 
> Yes, I was also confused by this. But I always follow the "when-in-rome"
> rule. 

Thanks for following the "when-in-rome" rule.  That seems so obvious
that it wouldn't even need to be written down, but it is often
ignored.

> My understanding is the same as yours is that __u8 should be used
> for userspace compatibility which doesn't apply here. If there is
> consensus on this being wrong, I'd be happy to write a cleanup patch
> that fixes it separate from this series.

That'd be awesome.  There are also a couple more in pci-driver.c that
could be fixed at the same time.

Bjorn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  2019-02-01 17:27     ` Logan Gunthorpe
@ 2019-02-05 19:19       ` Jacob Pan
  2019-02-05 20:40         ` Logan Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Jacob Pan @ 2019-02-05 19:19 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Joerg Roedel, Allen Hubbe, Dave Jiang, linux-pci, linux-kernel,
	Serge Semin, Eric Pilmore, iommu, linux-kselftest, Bjorn Helgaas,
	linux-ntb, David Woodhouse, jacob.jun.pan

On Fri, 1 Feb 2019 10:27:29 -0700
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 2019-02-01 9:44 a.m., Joerg Roedel wrote:
> > On Thu, Jan 31, 2019 at 11:56:48AM -0700, Logan Gunthorpe wrote:  
> >> @@ -394,6 +402,10 @@ static int set_msi_sid(struct irte *irte,
> >> struct pci_dev *dev) set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
> >>  			     PCI_DEVID(PCI_BUS_NUM(data.alias),
> >>  				       dev->bus->number));
I guess devfn can be removed also. but that is separate cleanup.
 
> >> +	else if (data.count >= 2 && data.busmatch_count ==
> >> data.count)
> >> +		set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
> >> +			     PCI_DEVID(dev->bus->number,
> >> +				       dev->bus->number));  
> > 
> > The dev->bus->number argument for the devfn parameter of PCI_DEVID
> > is not needed, right?  
> 
> Oh, yes. I think you are right.
> 
> > Also, this requires at least a comment to document that special
> > case.  
> 
> I'll add a comment for v2.
> 
> Thanks for the review!
> 
> Logan
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  2019-02-05 19:19       ` Jacob Pan
@ 2019-02-05 20:40         ` Logan Gunthorpe
  2019-02-05 23:58           ` Jacob Pan
  0 siblings, 1 reply; 26+ messages in thread
From: Logan Gunthorpe @ 2019-02-05 20:40 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Joerg Roedel, Allen Hubbe, Dave Jiang, linux-pci, linux-kernel,
	Serge Semin, Eric Pilmore, iommu, linux-kselftest, Bjorn Helgaas,
	linux-ntb, David Woodhouse



On 2019-02-05 12:19 p.m., Jacob Pan wrote:
> On Fri, 1 Feb 2019 10:27:29 -0700
> Logan Gunthorpe <logang@deltatee.com> wrote:
> 
>> On 2019-02-01 9:44 a.m., Joerg Roedel wrote:
>>> On Thu, Jan 31, 2019 at 11:56:48AM -0700, Logan Gunthorpe wrote:  
>>>> @@ -394,6 +402,10 @@ static int set_msi_sid(struct irte *irte,
>>>> struct pci_dev *dev) set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
>>>>  			     PCI_DEVID(PCI_BUS_NUM(data.alias),
>>>>  				       dev->bus->number));
> I guess devfn can be removed also. but that is separate cleanup.

Actually, no, I've dug into this and we *do* need the devfn here but
it's needlessly confusing. We should not be using PCI_DEVID() as we
aren't actually representing a DEVID in this case...

According to the Intel VT-D spec, when using SVT_VERIFY_BUS, the MSB of
the SID field represents the starting bus number and the LSB represents
the end bus number. The requester id's bus number must then be within
that range. The PCI_DEVID macro matches these semantics if you assume
the devfn is the end bus, but doesn't really make sense here and just
confuses the issue.

So the code was correct, I'll just try to clean it up to make it less
confusing.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  2019-02-05 20:40         ` Logan Gunthorpe
@ 2019-02-05 23:58           ` Jacob Pan
  0 siblings, 0 replies; 26+ messages in thread
From: Jacob Pan @ 2019-02-05 23:58 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Joerg Roedel, Allen Hubbe, Dave Jiang, linux-pci, linux-kernel,
	Serge Semin, Eric Pilmore, iommu, linux-kselftest, Bjorn Helgaas,
	linux-ntb, David Woodhouse, jacob.jun.pan

On Tue, 5 Feb 2019 13:40:36 -0700
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 2019-02-05 12:19 p.m., Jacob Pan wrote:
> > On Fri, 1 Feb 2019 10:27:29 -0700
> > Logan Gunthorpe <logang@deltatee.com> wrote:
> >   
> >> On 2019-02-01 9:44 a.m., Joerg Roedel wrote:  
> >>> On Thu, Jan 31, 2019 at 11:56:48AM -0700, Logan Gunthorpe
> >>> wrote:    
> >>>> @@ -394,6 +402,10 @@ static int set_msi_sid(struct irte *irte,
> >>>> struct pci_dev *dev) set_irte_sid(irte, SVT_VERIFY_BUS,
> >>>> SQ_ALL_16, PCI_DEVID(PCI_BUS_NUM(data.alias),
> >>>>  				       dev->bus->number));  
> > I guess devfn can be removed also. but that is separate cleanup.  
> 
> Actually, no, I've dug into this and we *do* need the devfn here but
> it's needlessly confusing. We should not be using PCI_DEVID() as we
> aren't actually representing a DEVID in this case...
> 
> According to the Intel VT-D spec, when using SVT_VERIFY_BUS, the MSB
> of the SID field represents the starting bus number and the LSB
> represents the end bus number. The requester id's bus number must
> then be within that range. The PCI_DEVID macro matches these
> semantics if you assume the devfn is the end bus, but doesn't really
> make sense here and just confuses the issue.
> 
> So the code was correct, I'll just try to clean it up to make it less
> confusing.
> 
you are right, thanks for explaining.
> Thanks,
> 
> Logan

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, back to index

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-31 18:56 [PATCH 0/9] Support using MSI interrupts in ntb_transport Logan Gunthorpe
2019-01-31 18:56 ` [PATCH 1/9] iommu/vt-d: Allow interrupts from the entire bus for aliased devices Logan Gunthorpe
2019-02-01 16:44   ` Joerg Roedel
2019-02-01 17:27     ` Logan Gunthorpe
2019-02-05 19:19       ` Jacob Pan
2019-02-05 20:40         ` Logan Gunthorpe
2019-02-05 23:58           ` Jacob Pan
2019-01-31 18:56 ` [PATCH 2/9] PCI/MSI: Support allocating virtual MSI interrupts Logan Gunthorpe
2019-01-31 22:39   ` Bjorn Helgaas
2019-01-31 22:52     ` Logan Gunthorpe
2019-02-01 19:23       ` Bjorn Helgaas
2019-01-31 18:56 ` [PATCH 3/9] PCI/switchtec: Add module parameter to request more interrupts Logan Gunthorpe
2019-01-31 18:56 ` [PATCH 4/9] NTB: Introduce functions to calculate multi-port resource index Logan Gunthorpe
2019-01-31 18:56 ` [PATCH 5/9] NTB: Rename ntb.c to support multiple source files in the module Logan Gunthorpe
2019-01-31 18:56 ` [PATCH 6/9] NTB: Introduce MSI library Logan Gunthorpe
2019-01-31 18:56 ` [PATCH 7/9] NTB: Introduce NTB MSI Test Client Logan Gunthorpe
2019-01-31 18:56 ` [PATCH 8/9] NTB: Add ntb_msi_test support to ntb_test Logan Gunthorpe
2019-01-31 18:56 ` [PATCH 9/9] NTB: Add MSI interrupt support to ntb_transport Logan Gunthorpe
2019-01-31 20:20 ` [PATCH 0/9] Support using MSI interrupts in ntb_transport Dave Jiang
2019-01-31 20:48   ` Logan Gunthorpe
2019-01-31 20:58     ` Dave Jiang
2019-01-31 22:39       ` Logan Gunthorpe
2019-01-31 22:46         ` Dave Jiang
2019-01-31 23:41           ` Logan Gunthorpe
2019-01-31 23:48             ` Dave Jiang
2019-01-31 23:52               ` Logan Gunthorpe

Linux-PCI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pci/0 linux-pci/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pci linux-pci/ https://lore.kernel.org/linux-pci \
		linux-pci@vger.kernel.org linux-pci@archiver.kernel.org
	public-inbox-index linux-pci


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pci


AGPL code for this site: git clone https://public-inbox.org/ public-inbox