linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/12] Support using MSI interrupts in ntb_transport
@ 2019-02-13 17:54 Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 01/12] iommu/vt-d: Implement dma_[un]map_resource() Logan Gunthorpe
                   ` (12 more replies)
  0 siblings, 13 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Note: this version will likely trivially conflict with some cleanup
patches I sent to Bjorn. So this is meant for review purposes only.
If there are no objections, I'd like to look at getting it merged in
the next cycle through the NTB tree.

--

Changes in v2:

* Cleaned up the changes in intel_irq_remapping.c to make them
  less confusing and add a comment. (Per discussion with Jacob and
  Joerg)

* Fixed a nit from Bjorn and collected his Ack

* Added a Kconfig dependancy on CONFIG_PCI_MSI for CONFIG_NTB_MSI
  as the Kbuild robot hit a random config that didn't build
  without it.

* Worked in a callback for when the MSI descriptor changes so that
  the clients can resend the new address and data values to the peer.
  On my test system this was never necessary, but there may be
  other platforms where this can occur. I tested this by hacking
  in a path to rewrite the MSI descriptor when I change the cpu
  affinity of an IRQ. There's a bit of uncertainty over the latency
  of the change, but without hardware this can acctually occur on
  we can't test this. This was the result of a discussion with Dave.

--

This patch series adds optional support for using MSI interrupts instead
of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
current hardware are quite slow and therefore switching to MSI interrupts
provides a significant performance gain. On switchtec hardware, a simple
apples-to-apples comparison shows ntb_netdev/iperf numbers going from
3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.

To do this, a couple changes are required outside of the NTB tree:

1) The IOMMU must know to accept MSI requests from aliased bused numbers
seeing NTB hardware typically sends proxied request IDs through
additional requester IDs. The first patch in this series adds support
for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
this is necessary.

2) NTB transport (and other clients) may often need more MSI interrupts
than the NTB hardware actually advertises support for. However, seeing
these interrupts will not be triggered by the hardware but through an
NTB memory window, the hardware does not actually need support or need
to know about them. Therefore we add the concept of Virtual MSI
interrupts which are allocated just like any other MSI interrupt but
are not programmed into the hardware's MSI table. This is done in
Patch 2 and then made use of in Patch 3.

The remaining patches in this series add a library for dealing with MSI
interrupts, a test client and finally support in ntb_transport.

The series is based off of v5.0-rc4 and I've tested it on top of a
of the patches I've already sent to the NTB tree (though they are
independent changes). A git repo is available here:

https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v2

Thanks,

Logan

--

Logan Gunthorpe (12):
  iommu/vt-d: Implement dma_[un]map_resource()
  NTB: ntb_transport: Ensure the destination buffer is mapped for TX DMA
  iommu/vt-d: Add helper to set an IRTE to verify only the bus number
  iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  PCI/MSI: Support allocating virtual MSI interrupts
  PCI/switchtec: Add module parameter to request more interrupts
  NTB: Introduce functions to calculate multi-port resource index
  NTB: Rename ntb.c to support multiple source files in the module
  NTB: Introduce MSI library
  NTB: Introduce NTB MSI Test Client
  NTB: Add ntb_msi_test support to ntb_test
  NTB: Add MSI interrupt support to ntb_transport

 drivers/iommu/intel-iommu.c             |  23 +-
 drivers/iommu/intel_irq_remapping.c     |  32 +-
 drivers/ntb/Kconfig                     |  11 +
 drivers/ntb/Makefile                    |   3 +
 drivers/ntb/{ntb.c => core.c}           |   0
 drivers/ntb/msi.c                       | 415 +++++++++++++++++++++++
 drivers/ntb/ntb_transport.c             | 197 ++++++++++-
 drivers/ntb/test/Kconfig                |   9 +
 drivers/ntb/test/Makefile               |   1 +
 drivers/ntb/test/ntb_msi_test.c         | 433 ++++++++++++++++++++++++
 drivers/pci/msi.c                       |  55 ++-
 drivers/pci/switch/switchtec.c          |  12 +-
 include/linux/msi.h                     |   8 +
 include/linux/ntb.h                     | 143 ++++++++
 include/linux/pci.h                     |   9 +
 tools/testing/selftests/ntb/ntb_test.sh |  54 ++-
 16 files changed, 1379 insertions(+), 26 deletions(-)
 rename drivers/ntb/{ntb.c => core.c} (100%)
 create mode 100644 drivers/ntb/msi.c
 create mode 100644 drivers/ntb/test/ntb_msi_test.c

--
2.19.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 01/12] iommu/vt-d: Implement dma_[un]map_resource()
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-13 17:57   ` Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 02/12] NTB: ntb_transport: Ensure the destination buffer is mapped for TX DMA Logan Gunthorpe
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore,
	Logan Gunthorpe, David Woodhouse

Currently the Intel IOMMU uses the default dma_[un]map_resource()
implementations does nothing and simply returns the physical address
unmodified.

However, this doesn't create the IOVA entries necessary for addresses
mapped this way to work when the IOMMU is enabled. Thus, when the
IOMMU is enabled, drivers relying on dma_map_resource() will trigger
DMAR errors. We see this when running ntb_transport with the IOMMU
enabled, DMA, and switchtec hardware.

The implementation for intel_map_resource() is nearly identical to
intel_map_page(), we just have to re-create __intel_map_single().
dma_unmap_resource() uses intel_unmap_page() directly as the
functions are identical.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
---
 drivers/iommu/intel-iommu.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 78188bf7e90d..ad737e16575b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3649,11 +3649,9 @@ static int iommu_no_mapping(struct device *dev)
 	return 0;
 }
 
-static dma_addr_t __intel_map_page(struct device *dev, struct page *page,
-				   unsigned long offset, size_t size, int dir,
-				   u64 dma_mask)
+static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
+				     size_t size, int dir, u64 dma_mask)
 {
-	phys_addr_t paddr = page_to_phys(page) + offset;
 	struct dmar_domain *domain;
 	phys_addr_t start_paddr;
 	unsigned long iova_pfn;
@@ -3715,7 +3713,15 @@ static dma_addr_t intel_map_page(struct device *dev, struct page *page,
 				 enum dma_data_direction dir,
 				 unsigned long attrs)
 {
-	return __intel_map_page(dev, page, offset, size, dir, *dev->dma_mask);
+	return __intel_map_single(dev, page_to_phys(page) + offset, size,
+				  dir, *dev->dma_mask);
+}
+
+static dma_addr_t intel_map_resource(struct device *dev, phys_addr_t phys_addr,
+				     size_t size, enum dma_data_direction dir,
+				     unsigned long attrs)
+{
+	return __intel_map_single(dev, phys_addr, size, dir, *dev->dma_mask);
 }
 
 static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
@@ -3806,8 +3812,9 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
 		return NULL;
 	memset(page_address(page), 0, size);
 
-	*dma_handle = __intel_map_page(dev, page, 0, size, DMA_BIDIRECTIONAL,
-				       dev->coherent_dma_mask);
+	*dma_handle = __intel_map_single(dev, page_to_phys(page), size,
+					 DMA_BIDIRECTIONAL,
+					 dev->coherent_dma_mask);
 	if (*dma_handle != DMA_MAPPING_ERROR)
 		return page_address(page);
 	if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
@@ -3924,6 +3931,8 @@ static const struct dma_map_ops intel_dma_ops = {
 	.unmap_sg = intel_unmap_sg,
 	.map_page = intel_map_page,
 	.unmap_page = intel_unmap_page,
+	.map_resource = intel_map_resource,
+	.unmap_resource = intel_unmap_page,
 	.dma_supported = dma_direct_supported,
 };
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 02/12] NTB: ntb_transport: Ensure the destination buffer is mapped for TX DMA
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 01/12] iommu/vt-d: Implement dma_[un]map_resource() Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 03/12] iommu/vt-d: Add helper to set an IRTE to verify only the bus number Logan Gunthorpe
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Presently, when ntb_transport is used with DMA and the IOMMU turned on,
it fails with errors from the IOMMU such as:

  DMAR: DRHD: handling fault status reg 202
  DMAR: [DMA Write] Request device [00:04.0] fault addr
	381fc0340000 [fault reason 05] PTE Write access is not set

This is because ntb_transport does not map the BAR space with the IOMMU.

To fix this, we map the entire MW region for each QP after we assign
the DMA channel. This prevents needing an extra DMA map in the fast
path.

Link: https://lore.kernel.org/linux-pci/499934e7-3734-1aee-37dd-b42a5d2a2608@intel.com/
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/ntb_transport.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 3bfdb4562408..526b65afc16a 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -144,7 +144,9 @@ struct ntb_transport_qp {
 	struct list_head tx_free_q;
 	spinlock_t ntb_tx_free_q_lock;
 	void __iomem *tx_mw;
-	dma_addr_t tx_mw_phys;
+	phys_addr_t tx_mw_phys;
+	size_t tx_mw_size;
+	dma_addr_t tx_mw_dma_addr;
 	unsigned int tx_index;
 	unsigned int tx_max_entry;
 	unsigned int tx_max_frame;
@@ -1049,6 +1051,7 @@ static int ntb_transport_init_queue(struct ntb_transport_ctx *nt,
 	tx_size = (unsigned int)mw_size / num_qps_mw;
 	qp_offset = tx_size * (qp_num / mw_count);
 
+	qp->tx_mw_size = tx_size;
 	qp->tx_mw = nt->mw_vec[mw_num].vbase + qp_offset;
 	if (!qp->tx_mw)
 		return -EINVAL;
@@ -1644,7 +1647,7 @@ static int ntb_async_tx_submit(struct ntb_transport_qp *qp,
 	dma_cookie_t cookie;
 
 	device = chan->device;
-	dest = qp->tx_mw_phys + qp->tx_max_frame * entry->tx_index;
+	dest = qp->tx_mw_dma_addr + qp->tx_max_frame * entry->tx_index;
 	buff_off = (size_t)buf & ~PAGE_MASK;
 	dest_off = (size_t)dest & ~PAGE_MASK;
 
@@ -1863,6 +1866,18 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
 		qp->rx_dma_chan = NULL;
 	}
 
+	if (qp->tx_dma_chan) {
+		qp->tx_mw_dma_addr =
+			dma_map_resource(qp->tx_dma_chan->device->dev,
+					 qp->tx_mw_phys, qp->tx_mw_size,
+					 DMA_FROM_DEVICE, 0);
+		if (dma_mapping_error(qp->tx_dma_chan->device->dev,
+				      qp->tx_mw_dma_addr)) {
+			qp->tx_mw_dma_addr = 0;
+			goto err1;
+		}
+	}
+
 	dev_dbg(&pdev->dev, "Using %s memcpy for TX\n",
 		qp->tx_dma_chan ? "DMA" : "CPU");
 
@@ -1904,6 +1919,10 @@ ntb_transport_create_queue(void *data, struct device *client_dev,
 	qp->rx_alloc_entry = 0;
 	while ((entry = ntb_list_rm(&qp->ntb_rx_q_lock, &qp->rx_free_q)))
 		kfree(entry);
+	if (qp->tx_mw_dma_addr)
+		dma_unmap_resource(qp->tx_dma_chan->device->dev,
+				   qp->tx_mw_dma_addr, qp->tx_mw_size,
+				   DMA_FROM_DEVICE, 0);
 	if (qp->tx_dma_chan)
 		dma_release_channel(qp->tx_dma_chan);
 	if (qp->rx_dma_chan)
@@ -1945,6 +1964,11 @@ void ntb_transport_free_queue(struct ntb_transport_qp *qp)
 		 */
 		dma_sync_wait(chan, qp->last_cookie);
 		dmaengine_terminate_all(chan);
+
+		dma_unmap_resource(chan->device->dev,
+				   qp->tx_mw_dma_addr, qp->tx_mw_size,
+				   DMA_FROM_DEVICE, 0);
+
 		dma_release_channel(chan);
 	}
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 03/12] iommu/vt-d: Add helper to set an IRTE to verify only the bus number
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 01/12] iommu/vt-d: Implement dma_[un]map_resource() Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 02/12] NTB: ntb_transport: Ensure the destination buffer is mapped for TX DMA Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 04/12] iommu/vt-d: Allow interrupts from the entire bus for aliased devices Logan Gunthorpe
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore,
	Logan Gunthorpe, David Woodhouse, Jacob Pan

The current code uses set_irte_sid() with SVT_VERIFY_BUS and PCI_DEVID
to set the SID value. However, this is very confusing because, with
SVT_VERIFY_BUS, the SID value is not a PCI devfn address, but the start
and end bus numbers to match against.

According to the Intel Virtualization Technology for Directed I/O
Architecture Specification, Rev. 3.0, page 9-36:

  The most significant 8-bits of the SID field contains the Startbus#,
  and the least significant 8-bits of the SID field contains the Endbus#.
  Interrupt requests that reference this IRTE must have a requester-id
  whose bus# (most significant 8-bits of requester-id) has a value equal
  to or within the Startbus# to Endbus# range.

So to make this more clear, introduce a new set_irte_verify_bus() that
explicitly takes a start bus and end bus so that we can stop abusing
the PCI_DEVID macro.

This helper function will be called a second time in an subsequent patch.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel_irq_remapping.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 24d45b07f425..5a55bef8e379 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -294,6 +294,18 @@ static void set_irte_sid(struct irte *irte, unsigned int svt,
 	irte->sid = sid;
 }
 
+/*
+ * Set an IRTE to match only the bus number. Interrupt requests that reference
+ * this IRTE must have a requester-id whose bus number is between or equal
+ * to the start_bus and end_bus arguments.
+ */
+static void set_irte_verify_bus(struct irte *irte, unsigned int start_bus,
+				unsigned int end_bus)
+{
+	set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
+		     (start_bus << 8) | end_bus);
+}
+
 static int set_ioapic_sid(struct irte *irte, int apic)
 {
 	int i;
@@ -391,9 +403,8 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
 	 * original device.
 	 */
 	if (PCI_BUS_NUM(data.alias) != data.pdev->bus->number)
-		set_irte_sid(irte, SVT_VERIFY_BUS, SQ_ALL_16,
-			     PCI_DEVID(PCI_BUS_NUM(data.alias),
-				       dev->bus->number));
+		set_irte_verify_bus(irte, PCI_BUS_NUM(data.alias),
+				    dev->bus->number);
 	else if (data.pdev->bus->number != dev->bus->number)
 		set_irte_sid(irte, SVT_VERIFY_SID_SQ, SQ_ALL_16, data.alias);
 	else
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 04/12] iommu/vt-d: Allow interrupts from the entire bus for aliased devices
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 03/12] iommu/vt-d: Add helper to set an IRTE to verify only the bus number Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 05/12] PCI/MSI: Support allocating virtual MSI interrupts Logan Gunthorpe
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore,
	Logan Gunthorpe, David Woodhouse, Jacob Pan

When a device has multiple aliases that all are from the same bus,
we program the IRTE to accept requests from any matching device on the
bus.

This is so NTB devices which can have requests from multiple bus-devfns
can pass MSI interrupts through across the bridge.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel_irq_remapping.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 5a55bef8e379..2d74641b7f7b 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -368,6 +368,8 @@ static int set_hpet_sid(struct irte *irte, u8 id)
 struct set_msi_sid_data {
 	struct pci_dev *pdev;
 	u16 alias;
+	int count;
+	int busmatch_count;
 };
 
 static int set_msi_sid_cb(struct pci_dev *pdev, u16 alias, void *opaque)
@@ -376,6 +378,10 @@ static int set_msi_sid_cb(struct pci_dev *pdev, u16 alias, void *opaque)
 
 	data->pdev = pdev;
 	data->alias = alias;
+	data->count++;
+
+	if (PCI_BUS_NUM(alias) == pdev->bus->number)
+		data->busmatch_count++;
 
 	return 0;
 }
@@ -387,6 +393,8 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
 	if (!irte || !dev)
 		return -1;
 
+	data.count = 0;
+	data.busmatch_count = 0;
 	pci_for_each_dma_alias(dev, set_msi_sid_cb, &data);
 
 	/*
@@ -395,6 +403,11 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
 	 * device is the case of a PCIe-to-PCI bridge, where the alias is for
 	 * the subordinate bus.  In this case we can only verify the bus.
 	 *
+	 * If there are multiple aliases, all with the same bus number,
+	 * then all we can do is verify the bus. This is typical in NTB
+	 * hardware which use proxy IDs where the device will generate traffic
+	 * from multiple devfn numbers on the same bus.
+	 *
 	 * If the alias device is on a different bus than our source device
 	 * then we have a topology based alias, use it.
 	 *
@@ -405,6 +418,8 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
 	if (PCI_BUS_NUM(data.alias) != data.pdev->bus->number)
 		set_irte_verify_bus(irte, PCI_BUS_NUM(data.alias),
 				    dev->bus->number);
+	else if (data.count >= 2 && data.busmatch_count == data.count)
+		set_irte_verify_bus(irte, dev->bus->number, dev->bus->number);
 	else if (data.pdev->bus->number != dev->bus->number)
 		set_irte_sid(irte, SVT_VERIFY_SID_SQ, SQ_ALL_16, data.alias);
 	else
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 05/12] PCI/MSI: Support allocating virtual MSI interrupts
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 04/12] iommu/vt-d: Allow interrupts from the entire bus for aliased devices Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 06/12] PCI/switchtec: Add module parameter to request more interrupts Logan Gunthorpe
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

For NTB devices, we want to be able to trigger MSI interrupts
through a memory window. In these cases we may want to use
more interrupts than the NTB PCI device has available in its MSI-X
table.

We allow for this by creating a new 'virtual' interrupt. These
interrupts are allocated as usual but are not programmed into the
MSI-X table (as there may not be space for them).

The MSI address and data will then handled through an NTB MSI library
introduced later in this series.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/msi.c   | 55 +++++++++++++++++++++++++++++++++++++--------
 include/linux/msi.h |  8 +++++++
 include/linux/pci.h |  9 ++++++++
 3 files changed, 63 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 4c0b47867258..e7810ec45c9d 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -192,6 +192,9 @@ static void msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag)
 
 static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
 {
+	if (desc->msi_attrib.is_virtual)
+		return NULL;
+
 	return desc->mask_base +
 		desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
 }
@@ -206,14 +209,19 @@ static void __iomem *pci_msix_desc_addr(struct msi_desc *desc)
 u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag)
 {
 	u32 mask_bits = desc->masked;
+	void __iomem *desc_addr;
 
 	if (pci_msi_ignore_mask)
 		return 0;
+	desc_addr = pci_msix_desc_addr(desc);
+	if (!desc_addr)
+		return 0;
 
 	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 	if (flag)
 		mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
-	writel(mask_bits, pci_msix_desc_addr(desc) + PCI_MSIX_ENTRY_VECTOR_CTRL);
+
+	writel(mask_bits, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
 
 	return mask_bits;
 }
@@ -273,6 +281,11 @@ void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 	if (entry->msi_attrib.is_msix) {
 		void __iomem *base = pci_msix_desc_addr(entry);
 
+		if (!base) {
+			WARN_ON(1);
+			return;
+		}
+
 		msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR);
 		msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR);
 		msg->data = readl(base + PCI_MSIX_ENTRY_DATA);
@@ -303,6 +316,9 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 	} else if (entry->msi_attrib.is_msix) {
 		void __iomem *base = pci_msix_desc_addr(entry);
 
+		if (!base)
+			goto skip;
+
 		writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
 		writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
 		writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
@@ -327,7 +343,13 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 					      msg->data);
 		}
 	}
+
+skip:
 	entry->msg = *msg;
+
+	if (entry->write_msi_msg)
+		entry->write_msi_msg(entry, entry->write_msi_msg_data);
+
 }
 
 void pci_write_msi_msg(unsigned int irq, struct msi_msg *msg)
@@ -550,6 +572,7 @@ msi_setup_entry(struct pci_dev *dev, int nvec, const struct irq_affinity *affd)
 
 	entry->msi_attrib.is_msix	= 0;
 	entry->msi_attrib.is_64		= !!(control & PCI_MSI_FLAGS_64BIT);
+	entry->msi_attrib.is_virtual    = 0;
 	entry->msi_attrib.entry_nr	= 0;
 	entry->msi_attrib.maskbit	= !!(control & PCI_MSI_FLAGS_MASKBIT);
 	entry->msi_attrib.default_irq	= dev->irq;	/* Save IOAPIC IRQ */
@@ -674,6 +697,7 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 	struct irq_affinity_desc *curmsk, *masks = NULL;
 	struct msi_desc *entry;
 	int ret, i;
+	int vec_count = pci_msix_vec_count(dev);
 
 	if (affd)
 		masks = irq_create_affinity_masks(nvec, affd);
@@ -696,6 +720,10 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 			entry->msi_attrib.entry_nr = entries[i].entry;
 		else
 			entry->msi_attrib.entry_nr = i;
+
+		entry->msi_attrib.is_virtual =
+			entry->msi_attrib.entry_nr >= vec_count;
+
 		entry->msi_attrib.default_irq	= dev->irq;
 		entry->mask_base		= base;
 
@@ -714,12 +742,19 @@ static void msix_program_entries(struct pci_dev *dev,
 {
 	struct msi_desc *entry;
 	int i = 0;
+	void __iomem *desc_addr;
 
 	for_each_pci_msi_entry(entry, dev) {
 		if (entries)
 			entries[i++].vector = entry->irq;
-		entry->masked = readl(pci_msix_desc_addr(entry) +
-				PCI_MSIX_ENTRY_VECTOR_CTRL);
+
+		desc_addr = pci_msix_desc_addr(entry);
+		if (desc_addr)
+			entry->masked = readl(desc_addr +
+					      PCI_MSIX_ENTRY_VECTOR_CTRL);
+		else
+			entry->masked = 0;
+
 		msix_mask_irq(entry, 1);
 	}
 }
@@ -932,7 +967,8 @@ int pci_msix_vec_count(struct pci_dev *dev)
 EXPORT_SYMBOL(pci_msix_vec_count);
 
 static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
-			     int nvec, const struct irq_affinity *affd)
+			     int nvec, const struct irq_affinity *affd,
+			     int flags)
 {
 	int nr_entries;
 	int i, j;
@@ -943,7 +979,7 @@ static int __pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
 	nr_entries = pci_msix_vec_count(dev);
 	if (nr_entries < 0)
 		return nr_entries;
-	if (nvec > nr_entries)
+	if (nvec > nr_entries && !(flags & PCI_IRQ_VIRTUAL))
 		return nr_entries;
 
 	if (entries) {
@@ -1086,7 +1122,8 @@ EXPORT_SYMBOL(pci_enable_msi);
 
 static int __pci_enable_msix_range(struct pci_dev *dev,
 				   struct msix_entry *entries, int minvec,
-				   int maxvec, const struct irq_affinity *affd)
+				   int maxvec, const struct irq_affinity *affd,
+				   int flags)
 {
 	int rc, nvec = maxvec;
 
@@ -1110,7 +1147,7 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 				return -ENOSPC;
 		}
 
-		rc = __pci_enable_msix(dev, entries, nvec, affd);
+		rc = __pci_enable_msix(dev, entries, nvec, affd, flags);
 		if (rc == 0)
 			return nvec;
 
@@ -1141,7 +1178,7 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
 int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
 		int minvec, int maxvec)
 {
-	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL);
+	return __pci_enable_msix_range(dev, entries, minvec, maxvec, NULL, 0);
 }
 EXPORT_SYMBOL(pci_enable_msix_range);
 
@@ -1181,7 +1218,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 
 	if (flags & PCI_IRQ_MSIX) {
 		msix_vecs = __pci_enable_msix_range(dev, NULL, min_vecs,
-						    max_vecs, affd);
+						    max_vecs, affd, flags);
 		if (msix_vecs > 0)
 			return msix_vecs;
 	}
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 784fb52b9900..5044b4906102 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -56,6 +56,10 @@ struct fsl_mc_msi_desc {
  * @msg:	The last set MSI message cached for reuse
  * @affinity:	Optional pointer to a cpu affinity mask for this descriptor
  *
+ * @write_msi_msg:	Callback that may be called when the MSI message
+ *			address or data changes
+ * @write_msi_msg_data:	Data parameter for the callback.
+ *
  * @masked:	[PCI MSI/X] Mask bits
  * @is_msix:	[PCI MSI/X] True if MSI-X
  * @multiple:	[PCI MSI/X] log2 num of messages allocated
@@ -78,6 +82,9 @@ struct msi_desc {
 	struct msi_msg			msg;
 	struct irq_affinity_desc	*affinity;
 
+	void (*write_msi_msg)(struct msi_desc *entry, void *data);
+	void *write_msi_msg_data;
+
 	union {
 		/* PCI MSI/X specific data */
 		struct {
@@ -88,6 +95,7 @@ struct msi_desc {
 				__u8	multi_cap	: 3;
 				__u8	maskbit		: 1;
 				__u8	is_64		: 1;
+				__u8    is_virtual      : 1;
 				__u16	entry_nr;
 				unsigned default_irq;
 			} msi_attrib;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 65f1d8c2f082..f8df9637f3fc 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1352,6 +1352,15 @@ int pci_set_vga_state(struct pci_dev *pdev, bool decode,
 #define PCI_IRQ_MSI		(1 << 1) /* Allow MSI interrupts */
 #define PCI_IRQ_MSIX		(1 << 2) /* Allow MSI-X interrupts */
 #define PCI_IRQ_AFFINITY	(1 << 3) /* Auto-assign affinity */
+
+/*
+ * Virtual interrupts allow for more interrupts to be allocated
+ * than the device has interrupts for. These are not programmed
+ * into the device's MSI-X table and must be handled by some
+ * other driver means.
+ */
+#define PCI_IRQ_VIRTUAL		(1 << 4)
+
 #define PCI_IRQ_ALL_TYPES \
 	(PCI_IRQ_LEGACY | PCI_IRQ_MSI | PCI_IRQ_MSIX)
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 06/12] PCI/switchtec: Add module parameter to request more interrupts
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 05/12] PCI/MSI: Support allocating virtual MSI interrupts Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index Logan Gunthorpe
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Seeing the we want to use more interrupts in the NTB MSI code
we need to be able allocate more (sometimes virtual) interrupts
in the switchtec driver. Therefore add a module parameter to
request to allocate additional interrupts.

This puts virtually no limit on the number of MSI interrupts available
to NTB clients.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/switch/switchtec.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index e22766c79fe9..8b1db78197d9 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -30,6 +30,10 @@ module_param(use_dma_mrpc, bool, 0644);
 MODULE_PARM_DESC(use_dma_mrpc,
 		 "Enable the use of the DMA MRPC feature");
 
+static int nirqs = 32;
+module_param(nirqs, int, 0644);
+MODULE_PARM_DESC(nirqs, "number of interrupts to allocate (more may be useful for NTB applications)");
+
 static dev_t switchtec_devt;
 static DEFINE_IDA(switchtec_minor_ida);
 
@@ -1247,8 +1251,12 @@ static int switchtec_init_isr(struct switchtec_dev *stdev)
 	int dma_mrpc_irq;
 	int rc;
 
-	nvecs = pci_alloc_irq_vectors(stdev->pdev, 1, 4,
-				      PCI_IRQ_MSIX | PCI_IRQ_MSI);
+	if (nirqs < 4)
+		nirqs = 4;
+
+	nvecs = pci_alloc_irq_vectors(stdev->pdev, 1, nirqs,
+				      PCI_IRQ_MSIX | PCI_IRQ_MSI |
+				      PCI_IRQ_VIRTUAL);
 	if (nvecs < 0)
 		return nvecs;
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 06/12] PCI/switchtec: Add module parameter to request more interrupts Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-03-06  1:24   ` Serge Semin
  2019-02-13 17:54 ` [PATCH v2 08/12] NTB: Rename ntb.c to support multiple source files in the module Logan Gunthorpe
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

When using multi-ports each port uses resources (dbs, msgs, mws, etc)
on every other port. Creating a mapping for these resources such that
each port has a corresponding resource on every other port is a bit
tricky.

Introduce the ntb_peer_resource_idx() function for this purpose.
It returns the peer resource number that will correspond with the
local peer index on the remote peer.

Also, introduce ntb_peer_highest_mw_idx() which will use
ntb_peer_resource_idx() but return the MW index starting with the
highest index and working down.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 include/linux/ntb.h | 70 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index 181d16601dd9..f5c69d853489 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -1502,4 +1502,74 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
 	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
 }
 
+/**
+ * ntb_peer_resource_idx() - get a resource index for a given peer idx
+ * @ntb:	NTB device context.
+ * @pidx:	Peer port index.
+ *
+ * When constructing a graph of peers, each remote peer must use a different
+ * resource index (mw, doorbell, etc) to communicate with each other
+ * peer.
+ *
+ * In a two peer system, this function should always return 0 such that
+ * resource 0 points to the remote peer on both ports.
+ *
+ * In a 5 peer system, this function will return the following matrix
+ *
+ * pidx \ port    0    1    2    3    4
+ * 0              0    0    1    2    3
+ * 1              0    1    2    3    4
+ * 2              0    1    2    3    4
+ * 3              0    1    2    3    4
+ *
+ * For example, if this function is used to program peer's memory
+ * windows, port 0 will program MW 0 on all it's peers to point to itself.
+ * port 1 will program MW 0 in port 0 to point to itself and MW 1 on all
+ * other ports. etc.
+ *
+ * For the legacy two host case, ntb_port_number() and ntb_peer_port_number()
+ * both return zero and therefore this function will always return zero.
+ * So MW 0 on each host would be programmed to point to the other host.
+ *
+ * Return: the resource index to use for that peer.
+ */
+static inline int ntb_peer_resource_idx(struct ntb_dev *ntb, int pidx)
+{
+	int local_port, peer_port;
+
+	if (pidx >= ntb_peer_port_count(ntb))
+		return -EINVAL;
+
+	local_port = ntb_port_number(ntb);
+	peer_port = ntb_peer_port_number(ntb, pidx);
+
+	if (peer_port < local_port)
+		return local_port - 1;
+	else
+		return local_port;
+}
+
+/**
+ * ntb_peer_highest_mw_idx() - get a memory window index for a given peer idx
+ *	using the highest index memory windows first
+ *
+ * @ntb:	NTB device context.
+ * @pidx:	Peer port index.
+ *
+ * Like ntb_peer_resource_idx(), except it returns indexes starting with
+ * last memory window index.
+ *
+ * Return: the resource index to use for that peer.
+ */
+static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
+{
+	int ret;
+
+	ret = ntb_peer_resource_idx(ntb, pidx);
+	if (ret < 0)
+		return ret;
+
+	return ntb_mw_count(ntb, pidx) - ret - 1;
+}
+
 #endif
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 08/12] NTB: Rename ntb.c to support multiple source files in the module
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 09/12] NTB: Introduce MSI library Logan Gunthorpe
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

The kbuild system does not support having multiple source files in
a module if one of those source files has the same name as the module.

Therefore, we must rename ntb.c to core.c, while the module remains
ntb.ko.

This is similar to the way the nvme modules are structured.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/Makefile          | 2 ++
 drivers/ntb/{ntb.c => core.c} | 0
 2 files changed, 2 insertions(+)
 rename drivers/ntb/{ntb.c => core.c} (100%)

diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 1921dec1949d..537226f8e78d 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_NTB) += ntb.o hw/ test/
 obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
+
+ntb-y := core.o
diff --git a/drivers/ntb/ntb.c b/drivers/ntb/core.c
similarity index 100%
rename from drivers/ntb/ntb.c
rename to drivers/ntb/core.c
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 09/12] NTB: Introduce MSI library
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 08/12] NTB: Rename ntb.c to support multiple source files in the module Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-03-06 20:26   ` Serge Semin
  2019-02-13 17:54 ` [PATCH v2 10/12] NTB: Introduce NTB MSI Test Client Logan Gunthorpe
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

The NTB MSI library allows passing MSI interrupts across a memory
window. This offers similar functionality to doorbells or messages
except will often have much better latency and the client can
potentially use significantly more remote interrupts than typical hardware
provides for doorbells. (Which can be important in high-multiport
setups.)

The library utilizes one memory window per peer and uses the highest
index memory windows. Before any ntb_msi function may be used, the user
must call ntb_msi_init(). It may then setup and tear down the memory
windows when the link state changes using ntb_msi_setup_mws() and
ntb_msi_clear_mws().

The peer which receives the interrupt must call ntb_msim_request_irq()
to assign the interrupt handler (this function is functionally
similar to devm_request_irq()) and the returned descriptor must be
transferred to the peer which can use it to trigger the interrupt.
The triggering peer, once having received the descriptor, can
trigger the interrupt by calling ntb_msi_peer_trigger().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/Kconfig  |  11 ++
 drivers/ntb/Makefile |   3 +-
 drivers/ntb/msi.c    | 415 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/ntb.h  |  73 ++++++++
 4 files changed, 501 insertions(+), 1 deletion(-)
 create mode 100644 drivers/ntb/msi.c

diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
index 95944e52fa36..5760764052be 100644
--- a/drivers/ntb/Kconfig
+++ b/drivers/ntb/Kconfig
@@ -12,6 +12,17 @@ menuconfig NTB
 
 if NTB
 
+config NTB_MSI
+	bool "MSI Interrupt Support"
+	depends on PCI_MSI
+	help
+	 Support using MSI interrupt forwarding instead of (or in addition to)
+	 hardware doorbells. MSI interrupts typically offer lower latency
+	 than doorbells and more MSI interrupts can be made available to
+	 clients. However this requires an extra memory window and support
+	 in the hardware driver for creating the MSI interrupts.
+
+	 If unsure, say N.
 source "drivers/ntb/hw/Kconfig"
 
 source "drivers/ntb/test/Kconfig"
diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 537226f8e78d..cc27ad2ef150 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_NTB) += ntb.o hw/ test/
 obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
 
-ntb-y := core.o
+ntb-y			:= core.o
+ntb-$(CONFIG_NTB_MSI)	+= msi.o
diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
new file mode 100644
index 000000000000..5d4bd7a63924
--- /dev/null
+++ b/drivers/ntb/msi.c
@@ -0,0 +1,415 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include <linux/irq.h>
+#include <linux/module.h>
+#include <linux/ntb.h>
+#include <linux/msi.h>
+#include <linux/pci.h>
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe <logang@deltatee.com>");
+MODULE_DESCRIPTION("NTB MSI Interrupt Library");
+
+struct ntb_msi {
+	u64 base_addr;
+	u64 end_addr;
+
+	void (*desc_changed)(void *ctx);
+
+	u32 *peer_mws[];
+};
+
+/**
+ * ntb_msi_init() - Initialize the MSI context
+ * @ntb:	NTB device context
+ *
+ * This function must be called before any other ntb_msi function.
+ * It initializes the context for MSI operations and maps
+ * the peer memory windows.
+ *
+ * This function reserves the last N outbound memory windows (where N
+ * is the number of peers).
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_init(struct ntb_dev *ntb,
+		 void (*desc_changed)(void *ctx))
+{
+	phys_addr_t mw_phys_addr;
+	resource_size_t mw_size;
+	size_t struct_size;
+	int peer_widx;
+	int peers;
+	int ret;
+	int i;
+
+	peers = ntb_peer_port_count(ntb);
+	if (peers <= 0)
+		return -EINVAL;
+
+	struct_size = sizeof(*ntb->msi) + sizeof(*ntb->msi->peer_mws) * peers;
+
+	ntb->msi = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
+	if (!ntb->msi)
+		return -ENOMEM;
+
+	ntb->msi->desc_changed = desc_changed;
+
+	for (i = 0; i < peers; i++) {
+		peer_widx = ntb_peer_mw_count(ntb) - 1 - i;
+
+		ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr,
+					   &mw_size);
+		if (ret)
+			goto unroll;
+
+		ntb->msi->peer_mws[i] = devm_ioremap(&ntb->dev, mw_phys_addr,
+						     mw_size);
+		if (!ntb->msi->peer_mws[i]) {
+			ret = -EFAULT;
+			goto unroll;
+		}
+	}
+
+	return 0;
+
+unroll:
+	for (i = 0; i < peers; i++)
+		if (ntb->msi->peer_mws[i])
+			devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);
+
+	devm_kfree(&ntb->dev, ntb->msi);
+	ntb->msi = NULL;
+	return ret;
+}
+EXPORT_SYMBOL(ntb_msi_init);
+
+/**
+ * ntb_msi_setup_mws() - Initialize the MSI inbound memory windows
+ * @ntb:	NTB device context
+ *
+ * This function sets up the required inbound memory windows. It should be
+ * called from a work function after a link up event.
+ *
+ * Over the entire network, this function will reserves the last N
+ * inbound memory windows for each peer (where N is the number of peers).
+ *
+ * ntb_msi_init() must be called before this function.
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_setup_mws(struct ntb_dev *ntb)
+{
+	struct msi_desc *desc;
+	u64 addr;
+	int peer, peer_widx;
+	resource_size_t addr_align, size_align, size_max;
+	resource_size_t mw_size = SZ_32K;
+	resource_size_t mw_min_size = mw_size;
+	int i;
+	int ret;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	desc = first_msi_entry(&ntb->pdev->dev);
+	addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			return peer_widx;
+
+		ret = ntb_mw_get_align(ntb, peer, peer_widx, &addr_align,
+				       NULL, NULL);
+		if (ret)
+			return ret;
+
+		addr &= ~(addr_align - 1);
+	}
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0) {
+			ret = peer_widx;
+			goto error_out;
+		}
+
+		ret = ntb_mw_get_align(ntb, peer, peer_widx, NULL,
+				       &size_align, &size_max);
+		if (ret)
+			goto error_out;
+
+		mw_size = round_up(mw_size, size_align);
+		mw_size = max(mw_size, size_max);
+		if (mw_size < mw_min_size)
+			mw_min_size = mw_size;
+
+		ret = ntb_mw_set_trans(ntb, peer, peer_widx,
+				       addr, mw_size);
+		if (ret)
+			goto error_out;
+	}
+
+	ntb->msi->base_addr = addr;
+	ntb->msi->end_addr = addr + mw_min_size;
+
+	return 0;
+
+error_out:
+	for (i = 0; i < peer; i++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			continue;
+
+		ntb_mw_clear_trans(ntb, i, peer_widx);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ntb_msi_setup_mws);
+
+/**
+ * ntb_msi_clear_mws() - Clear all inbound memory windows
+ * @ntb:	NTB device context
+ *
+ * This function tears down the resources used by ntb_msi_setup_mws().
+ */
+void ntb_msi_clear_mws(struct ntb_dev *ntb)
+{
+	int peer;
+	int peer_widx;
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			continue;
+
+		ntb_mw_clear_trans(ntb, peer, peer_widx);
+	}
+}
+EXPORT_SYMBOL(ntb_msi_clear_mws);
+
+struct ntb_msi_devres {
+	struct ntb_dev *ntb;
+	struct msi_desc *entry;
+	struct ntb_msi_desc *msi_desc;
+};
+
+static int ntb_msi_set_desc(struct ntb_dev *ntb, struct msi_desc *entry,
+			    struct ntb_msi_desc *msi_desc)
+{
+	u64 addr;
+
+	addr = entry->msg.address_lo +
+		((uint64_t)entry->msg.address_hi << 32);
+
+	if (addr < ntb->msi->base_addr || addr >= ntb->msi->end_addr) {
+		dev_warn_once(&ntb->dev,
+			      "IRQ %d: MSI Address not within the memory window (%llx, [%llx %llx])\n",
+			      entry->irq, addr, ntb->msi->base_addr,
+			      ntb->msi->end_addr);
+		return -EFAULT;
+	}
+
+	msi_desc->addr_offset = addr - ntb->msi->base_addr;
+	msi_desc->data = entry->msg.data;
+
+	return 0;
+}
+
+static void ntb_msi_write_msg(struct msi_desc *entry, void *data)
+{
+	struct ntb_msi_devres *dr = data;
+
+	WARN_ON(ntb_msi_set_desc(dr->ntb, entry, dr->msi_desc));
+
+	if (dr->ntb->msi->desc_changed)
+		dr->ntb->msi->desc_changed(dr->ntb->ctx);
+}
+
+static void ntbm_msi_callback_release(struct device *dev, void *res)
+{
+	struct ntb_msi_devres *dr = res;
+
+	dr->entry->write_msi_msg = NULL;
+	dr->entry->write_msi_msg_data = NULL;
+}
+
+static int ntbm_msi_setup_callback(struct ntb_dev *ntb, struct msi_desc *entry,
+				   struct ntb_msi_desc *msi_desc)
+{
+	struct ntb_msi_devres *dr;
+
+	dr = devres_alloc(ntbm_msi_callback_release,
+			  sizeof(struct ntb_msi_devres), GFP_KERNEL);
+	if (!dr)
+		return -ENOMEM;
+
+	dr->ntb = ntb;
+	dr->entry = entry;
+	dr->msi_desc = msi_desc;
+
+	devres_add(&ntb->dev, dr);
+
+	dr->entry->write_msi_msg = ntb_msi_write_msg;
+	dr->entry->write_msi_msg_data = dr;
+
+	return 0;
+}
+
+/**
+ * ntbm_msi_request_threaded_irq() - allocate an MSI interrupt
+ * @ntb:	NTB device context
+ * @handler:	Function to be called when the IRQ occurs
+ * @thread_fn:  Function to be called in a threaded interrupt context. NULL
+ *              for clients which handle everything in @handler
+ * @devname:    An ascii name for the claiming device, dev_name(dev) if NULL
+ * @dev_id:     A cookie passed back to the handler function
+ *
+ * This function assigns an interrupt handler to an unused
+ * MSI interrupt and returns the descriptor used to trigger
+ * it. The descriptor can then be sent to a peer to trigger
+ * the interrupt.
+ *
+ * The interrupt resource is managed with devres so it will
+ * be automatically freed when the NTB device is torn down.
+ *
+ * If an IRQ allocated with this function needs to be freed
+ * separately, ntbm_free_irq() must be used.
+ *
+ * Return: IRQ number assigned on success, otherwise a negative error number.
+ */
+int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
+				  irq_handler_t thread_fn,
+				  const char *name, void *dev_id,
+				  struct ntb_msi_desc *msi_desc)
+{
+	struct msi_desc *entry;
+	struct irq_desc *desc;
+	int ret;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	for_each_pci_msi_entry(entry, ntb->pdev) {
+		desc = irq_to_desc(entry->irq);
+		if (desc->action)
+			continue;
+
+		ret = devm_request_threaded_irq(&ntb->dev, entry->irq, handler,
+						thread_fn, 0, name, dev_id);
+		if (ret)
+			continue;
+
+		if (ntb_msi_set_desc(ntb, entry, msi_desc)) {
+			devm_free_irq(&ntb->dev, entry->irq, dev_id);
+			continue;
+		}
+
+		ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
+		if (ret) {
+			devm_free_irq(&ntb->dev, entry->irq, dev_id);
+			return ret;
+		}
+
+
+		return entry->irq;
+	}
+
+	return -ENODEV;
+}
+EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);
+
+static int ntbm_msi_callback_match(struct device *dev, void *res, void *data)
+{
+	struct ntb_dev *ntb = dev_ntb(dev);
+	struct ntb_msi_devres *dr = res;
+
+	return dr->ntb == ntb && dr->entry == data;
+}
+
+/**
+ * ntbm_msi_free_irq() - free an interrupt
+ * @ntb:	NTB device context
+ * @irq:	Interrupt line to free
+ * @dev_id:	Device identity to free
+ *
+ * This function should be used to manually free IRQs allocated with
+ * ntbm_request_[threaded_]irq().
+ */
+void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id)
+{
+	struct msi_desc *entry = irq_get_msi_desc(irq);
+
+	entry->write_msi_msg = NULL;
+	entry->write_msi_msg_data = NULL;
+
+	WARN_ON(devres_destroy(&ntb->dev, ntbm_msi_callback_release,
+			       ntbm_msi_callback_match, entry));
+
+	devm_free_irq(&ntb->dev, irq, dev_id);
+}
+EXPORT_SYMBOL(ntbm_msi_free_irq);
+
+/**
+ * ntb_msi_peer_trigger() - Trigger an interrupt handler on a peer
+ * @ntb:	NTB device context
+ * @peer:	Peer index
+ * @desc:	MSI descriptor data which triggers the interrupt
+ *
+ * This function triggers an interrupt on a peer. It requires
+ * the descriptor structure to have been passed from that peer
+ * by some other means.
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+			 struct ntb_msi_desc *desc)
+{
+	int idx;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
+
+	ntb->msi->peer_mws[peer][idx] = desc->data;
+
+	return 0;
+}
+EXPORT_SYMBOL(ntb_msi_peer_trigger);
+
+/**
+ * ntb_msi_peer_addr() - Get the DMA address to trigger a peer's MSI interrupt
+ * @ntb:	NTB device context
+ * @peer:	Peer index
+ * @desc:	MSI descriptor data which triggers the interrupt
+ * @msi_addr:   Physical address to trigger the interrupt
+ *
+ * This function allows using DMA engines to trigger an interrupt
+ * (for example, trigger an interrupt to process the data after
+ * sending it). To trigger the interrupt, write @desc.data to the address
+ * returned in @msi_addr
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+		      struct ntb_msi_desc *desc,
+		      phys_addr_t *msi_addr)
+{
+	int peer_widx = ntb_peer_mw_count(ntb) - 1 - peer;
+	phys_addr_t mw_phys_addr;
+	int ret;
+
+	ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr, NULL);
+	if (ret)
+		return ret;
+
+	if (msi_addr)
+		*msi_addr = mw_phys_addr + desc->addr_offset;
+
+	return 0;
+}
+EXPORT_SYMBOL(ntb_msi_peer_addr);
diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index f5c69d853489..b9c61ee3c734 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -58,9 +58,11 @@
 
 #include <linux/completion.h>
 #include <linux/device.h>
+#include <linux/interrupt.h>
 
 struct ntb_client;
 struct ntb_dev;
+struct ntb_msi;
 struct pci_dev;
 
 /**
@@ -425,6 +427,10 @@ struct ntb_dev {
 	spinlock_t			ctx_lock;
 	/* block unregister until device is fully released */
 	struct completion		released;
+
+	#ifdef CONFIG_NTB_MSI
+	struct ntb_msi *msi;
+	#endif
 };
 #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
 
@@ -1572,4 +1578,71 @@ static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
 	return ntb_mw_count(ntb, pidx) - ret - 1;
 }
 
+struct ntb_msi_desc {
+	u32 addr_offset;
+	u32 data;
+};
+
+#ifdef CONFIG_NTB_MSI
+
+int ntb_msi_init(struct ntb_dev *ntb, void (*desc_changed)(void *ctx));
+int ntb_msi_setup_mws(struct ntb_dev *ntb);
+void ntb_msi_clear_mws(struct ntb_dev *ntb);
+int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
+				  irq_handler_t thread_fn,
+				  const char *name, void *dev_id,
+				  struct ntb_msi_desc *msi_desc);
+void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id);
+int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+			 struct ntb_msi_desc *desc);
+int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+		      struct ntb_msi_desc *desc,
+		      phys_addr_t *msi_addr);
+
+#else /* not CONFIG_NTB_MSI */
+
+static inline int ntb_msi_init(struct ntb_dev *ntb,
+			       void (*desc_changed)(void *ctx))
+{
+	return -EOPNOTSUPP;
+}
+static inline int ntb_msi_setup_mws(struct ntb_dev *ntb)
+{
+	return -EOPNOTSUPP;
+}
+static inline void ntb_msi_clear_mws(struct ntb_dev *ntb) {}
+static inline int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb,
+						irq_handler_t handler,
+						irq_handler_t thread_fn,
+						const char *name, void *dev_id,
+						struct ntb_msi_desc *msi_desc)
+{
+	return -EOPNOTSUPP;
+}
+static inline void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq,
+				     void *dev_id) {}
+static inline int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+				       struct ntb_msi_desc *desc)
+{
+	return -EOPNOTSUPP;
+}
+static inline int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+				    struct ntb_msi_desc *desc,
+				    phys_addr_t *msi_addr)
+{
+	return -EOPNOTSUPP;
+
+}
+
+#endif /* CONFIG_NTB_MSI */
+
+static inline int ntbm_msi_request_irq(struct ntb_dev *ntb,
+				       irq_handler_t handler,
+				       const char *name, void *dev_id,
+				       struct ntb_msi_desc *msi_desc)
+{
+	return ntbm_msi_request_threaded_irq(ntb, handler, NULL, name,
+					     dev_id, msi_desc);
+}
+
 #endif
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 10/12] NTB: Introduce NTB MSI Test Client
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 09/12] NTB: Introduce MSI library Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-03-06 20:44   ` Serge Semin
  2019-02-13 17:54 ` [PATCH v2 11/12] NTB: Add ntb_msi_test support to ntb_test Logan Gunthorpe
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Introduce a tool to test NTB MSI interrupts similar to the other
NTB test tools. This tool creates a debugfs directory for each
NTB device with the following files:

port
irqX_occurrences
peerX/port
peerX/count
peerX/trigger

The 'port' file tells the user the local port number and the
'occurrences' files tell the number of local interrupts that
have been received for each interrupt.

For each peer, the 'port' file and the 'count' file tell you the
peer's port number and number of interrupts respectively. Writing
the interrupt number to the 'trigger' file triggers the interrupt
handler for the peer which should increment their corresponding
'occurrences' file. The 'ready' file indicates if a peer is ready,
writing to this file blocks until it is ready.

The module parameter num_irqs can be used to set the number of
local interrupts. By default this is 4. This is only limited by
the number of unused MSI interrupts registered by the hardware
(this will require support of the hardware driver) and there must
be at least 2*num_irqs + 1 spads registers available.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/test/Kconfig        |   9 +
 drivers/ntb/test/Makefile       |   1 +
 drivers/ntb/test/ntb_msi_test.c | 433 ++++++++++++++++++++++++++++++++
 3 files changed, 443 insertions(+)
 create mode 100644 drivers/ntb/test/ntb_msi_test.c

diff --git a/drivers/ntb/test/Kconfig b/drivers/ntb/test/Kconfig
index a5d0eda44438..a3f3e2638935 100644
--- a/drivers/ntb/test/Kconfig
+++ b/drivers/ntb/test/Kconfig
@@ -25,3 +25,12 @@ config NTB_PERF
 	 to and from the window without additional software interaction.
 
 	 If unsure, say N.
+
+config NTB_MSI_TEST
+	tristate "NTB MSI Test Client"
+	depends on NTB_MSI
+	help
+	  This tool demonstrates the use of the NTB MSI library to
+	  send MSI interrupts between peers.
+
+	  If unsure, say N.
diff --git a/drivers/ntb/test/Makefile b/drivers/ntb/test/Makefile
index 9e77e0b761c2..d2895ca995e4 100644
--- a/drivers/ntb/test/Makefile
+++ b/drivers/ntb/test/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_NTB_PINGPONG) += ntb_pingpong.o
 obj-$(CONFIG_NTB_TOOL) += ntb_tool.o
 obj-$(CONFIG_NTB_PERF) += ntb_perf.o
+obj-$(CONFIG_NTB_MSI_TEST) += ntb_msi_test.o
diff --git a/drivers/ntb/test/ntb_msi_test.c b/drivers/ntb/test/ntb_msi_test.c
new file mode 100644
index 000000000000..99d826ed9c34
--- /dev/null
+++ b/drivers/ntb/test/ntb_msi_test.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/ntb.h>
+#include <linux/pci.h>
+#include <linux/radix-tree.h>
+#include <linux/workqueue.h>
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe <logang@deltatee.com>");
+MODULE_DESCRIPTION("Test for sending MSI interrupts over an NTB memory window");
+
+static int num_irqs = 4;
+module_param(num_irqs, int, 0644);
+MODULE_PARM_DESC(num_irqs, "number of irqs to use");
+
+struct ntb_msit_ctx {
+	struct ntb_dev *ntb;
+	struct dentry *dbgfs_dir;
+	struct work_struct setup_work;
+
+	struct ntb_msit_isr_ctx {
+		int irq_idx;
+		int irq_num;
+		int occurrences;
+		struct ntb_msit_ctx *nm;
+		struct ntb_msi_desc desc;
+	} *isr_ctx;
+
+	struct ntb_msit_peer {
+		struct ntb_msit_ctx *nm;
+		int pidx;
+		int num_irqs;
+		struct completion init_comp;
+		struct ntb_msi_desc *msi_desc;
+	} peers[];
+};
+
+static struct dentry *ntb_msit_dbgfs_topdir;
+
+static irqreturn_t ntb_msit_isr(int irq, void *dev)
+{
+	struct ntb_msit_isr_ctx *isr_ctx = dev;
+	struct ntb_msit_ctx *nm = isr_ctx->nm;
+
+	dev_dbg(&nm->ntb->dev, "Interrupt Occurred: %d",
+		isr_ctx->irq_idx);
+
+	isr_ctx->occurrences++;
+
+	return IRQ_HANDLED;
+}
+
+static void ntb_msit_setup_work(struct work_struct *work)
+{
+	struct ntb_msit_ctx *nm = container_of(work, struct ntb_msit_ctx,
+					       setup_work);
+	int irq_count = 0;
+	int irq;
+	int ret;
+	uintptr_t i;
+
+	ret = ntb_msi_setup_mws(nm->ntb);
+	if (ret) {
+		dev_err(&nm->ntb->dev, "Unable to setup MSI windows: %d\n",
+			ret);
+		return;
+	}
+
+	for (i = 0; i < num_irqs; i++) {
+		nm->isr_ctx[i].irq_idx = i;
+		nm->isr_ctx[i].nm = nm;
+
+		if (!nm->isr_ctx[i].irq_num) {
+			irq = ntbm_msi_request_irq(nm->ntb, ntb_msit_isr,
+						   KBUILD_MODNAME,
+						   &nm->isr_ctx[i],
+						   &nm->isr_ctx[i].desc);
+			if (irq < 0)
+				break;
+
+			nm->isr_ctx[i].irq_num = irq;
+		}
+
+		ret = ntb_spad_write(nm->ntb, 2 * i + 1,
+				     nm->isr_ctx[i].desc.addr_offset);
+		if (ret)
+			break;
+
+		ret = ntb_spad_write(nm->ntb, 2 * i + 2,
+				     nm->isr_ctx[i].desc.data);
+		if (ret)
+			break;
+
+		irq_count++;
+	}
+
+	ntb_spad_write(nm->ntb, 0, irq_count);
+	ntb_peer_db_set(nm->ntb, BIT(ntb_port_number(nm->ntb)));
+}
+
+static void ntb_msit_desc_changed(void *ctx)
+{
+	struct ntb_msit_ctx *nm = ctx;
+	int i;
+
+	dev_dbg(&nm->ntb->dev, "MSI Descriptors Changed\n");
+
+	for (i = 0; i < num_irqs; i++) {
+		ntb_spad_write(nm->ntb, 2 * i + 1,
+			       nm->isr_ctx[i].desc.addr_offset);
+		ntb_spad_write(nm->ntb, 2 * i + 2,
+			       nm->isr_ctx[i].desc.data);
+	}
+
+	ntb_peer_db_set(nm->ntb, BIT(ntb_port_number(nm->ntb)));
+}
+
+static void ntb_msit_link_event(void *ctx)
+{
+	struct ntb_msit_ctx *nm = ctx;
+
+	if (!ntb_link_is_up(nm->ntb, NULL, NULL))
+		return;
+
+	schedule_work(&nm->setup_work);
+}
+
+static void ntb_msit_copy_peer_desc(struct ntb_msit_ctx *nm, int peer)
+{
+	int i;
+	struct ntb_msi_desc *desc = nm->peers[peer].msi_desc;
+	int irq_count = nm->peers[peer].num_irqs;
+
+	for (i = 0; i < irq_count; i++) {
+		desc[i].addr_offset = ntb_peer_spad_read(nm->ntb, peer,
+							 2 * i + 1);
+		desc[i].data = ntb_peer_spad_read(nm->ntb, peer, 2 * i + 2);
+	}
+
+	dev_info(&nm->ntb->dev, "Found %d interrupts on peer %d\n",
+		 irq_count, peer);
+
+	complete_all(&nm->peers[peer].init_comp);
+}
+
+static void ntb_msit_db_event(void *ctx, int vec)
+{
+	struct ntb_msit_ctx *nm = ctx;
+	struct ntb_msi_desc *desc;
+	u64 peer_mask = ntb_db_read(nm->ntb);
+	u32 irq_count;
+	int peer;
+
+	ntb_db_clear(nm->ntb, peer_mask);
+
+	for (peer = 0; peer < sizeof(peer_mask) * 8; peer++) {
+		if (!(peer_mask & BIT(peer)))
+			continue;
+
+		irq_count = ntb_peer_spad_read(nm->ntb, peer, 0);
+		if (irq_count == -1)
+			continue;
+
+		desc = kcalloc(irq_count, sizeof(*desc), GFP_ATOMIC);
+		if (!desc)
+			continue;
+
+		kfree(nm->peers[peer].msi_desc);
+		nm->peers[peer].msi_desc = desc;
+		nm->peers[peer].num_irqs = irq_count;
+
+		ntb_msit_copy_peer_desc(nm, peer);
+	}
+}
+
+static const struct ntb_ctx_ops ntb_msit_ops = {
+	.link_event = ntb_msit_link_event,
+	.db_event = ntb_msit_db_event,
+};
+
+static int ntb_msit_dbgfs_trigger(void *data, u64 idx)
+{
+	struct ntb_msit_peer *peer = data;
+
+	if (idx >= peer->num_irqs)
+		return -EINVAL;
+
+	dev_dbg(&peer->nm->ntb->dev, "trigger irq %llu on peer %u\n",
+		idx, peer->pidx);
+
+	return ntb_msi_peer_trigger(peer->nm->ntb, peer->pidx,
+				    &peer->msi_desc[idx]);
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_trigger_fops, NULL,
+			 ntb_msit_dbgfs_trigger, "%llu\n");
+
+static int ntb_msit_dbgfs_port_get(void *data, u64 *port)
+{
+	struct ntb_msit_peer *peer = data;
+
+	*port = ntb_peer_port_number(peer->nm->ntb, peer->pidx);
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_port_fops, ntb_msit_dbgfs_port_get,
+			 NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_count_get(void *data, u64 *count)
+{
+	struct ntb_msit_peer *peer = data;
+
+	*count = peer->num_irqs;
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_count_fops, ntb_msit_dbgfs_count_get,
+			 NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_ready_get(void *data, u64 *ready)
+{
+	struct ntb_msit_peer *peer = data;
+
+	*ready = try_wait_for_completion(&peer->init_comp);
+
+	return 0;
+}
+
+static int ntb_msit_dbgfs_ready_set(void *data, u64 ready)
+{
+	struct ntb_msit_peer *peer = data;
+
+	return wait_for_completion_interruptible(&peer->init_comp);
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_ready_fops, ntb_msit_dbgfs_ready_get,
+			 ntb_msit_dbgfs_ready_set, "%llu\n");
+
+static int ntb_msit_dbgfs_occurrences_get(void *data, u64 *occurrences)
+{
+	struct ntb_msit_isr_ctx *isr_ctx = data;
+
+	*occurrences = isr_ctx->occurrences;
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_occurrences_fops,
+			 ntb_msit_dbgfs_occurrences_get,
+			 NULL, "%llu\n");
+
+static int ntb_msit_dbgfs_local_port_get(void *data, u64 *port)
+{
+	struct ntb_msit_ctx *nm = data;
+
+	*port = ntb_port_number(nm->ntb);
+
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_local_port_fops,
+			 ntb_msit_dbgfs_local_port_get,
+			 NULL, "%llu\n");
+
+static void ntb_msit_create_dbgfs(struct ntb_msit_ctx *nm)
+{
+	struct pci_dev *pdev = nm->ntb->pdev;
+	char buf[32];
+	int i;
+	struct dentry *peer_dir;
+
+	nm->dbgfs_dir = debugfs_create_dir(pci_name(pdev),
+					   ntb_msit_dbgfs_topdir);
+	debugfs_create_file("port", 0400, nm->dbgfs_dir, nm,
+			    &ntb_msit_local_port_fops);
+
+	for (i = 0; i < ntb_peer_port_count(nm->ntb); i++) {
+		nm->peers[i].pidx = i;
+		nm->peers[i].nm = nm;
+		init_completion(&nm->peers[i].init_comp);
+
+		snprintf(buf, sizeof(buf), "peer%d", i);
+		peer_dir = debugfs_create_dir(buf, nm->dbgfs_dir);
+
+		debugfs_create_file_unsafe("trigger", 0200, peer_dir,
+					   &nm->peers[i],
+					   &ntb_msit_trigger_fops);
+
+		debugfs_create_file_unsafe("port", 0400, peer_dir,
+					   &nm->peers[i], &ntb_msit_port_fops);
+
+		debugfs_create_file_unsafe("count", 0400, peer_dir,
+					   &nm->peers[i],
+					   &ntb_msit_count_fops);
+
+		debugfs_create_file_unsafe("ready", 0600, peer_dir,
+					   &nm->peers[i],
+					   &ntb_msit_ready_fops);
+	}
+
+	for (i = 0; i < num_irqs; i++) {
+		snprintf(buf, sizeof(buf), "irq%d_occurrences", i);
+		debugfs_create_file_unsafe(buf, 0400, nm->dbgfs_dir,
+					   &nm->isr_ctx[i],
+					   &ntb_msit_occurrences_fops);
+	}
+}
+
+static void ntb_msit_remove_dbgfs(struct ntb_msit_ctx *nm)
+{
+	debugfs_remove_recursive(nm->dbgfs_dir);
+}
+
+static int ntb_msit_probe(struct ntb_client *client, struct ntb_dev *ntb)
+{
+	struct ntb_msit_ctx *nm;
+	size_t struct_size;
+	int peers;
+	int ret;
+
+	peers = ntb_peer_port_count(ntb);
+	if (peers <= 0)
+		return -EINVAL;
+
+	if (ntb_spad_is_unsafe(ntb) || ntb_spad_count(ntb) < 2 * num_irqs + 1) {
+		dev_err(&ntb->dev, "NTB MSI test requires at least %d spads for %d irqs\n",
+			2 * num_irqs + 1, num_irqs);
+		return -EFAULT;
+	}
+
+	ret = ntb_spad_write(ntb, 0, -1);
+	if (ret) {
+		dev_err(&ntb->dev, "Unable to write spads: %d\n", ret);
+		return ret;
+	}
+
+	ret = ntb_db_clear_mask(ntb, GENMASK(peers - 1, 0));
+	if (ret) {
+		dev_err(&ntb->dev, "Unable to clear doorbell mask: %d\n", ret);
+		return ret;
+	}
+
+	ret = ntb_msi_init(ntb, ntb_msit_desc_changed);
+	if (ret) {
+		dev_err(&ntb->dev, "Unable to initialize MSI library: %d\n",
+			ret);
+		return ret;
+	}
+
+	struct_size = sizeof(*nm) + sizeof(*nm->peers) * peers;
+
+	nm = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
+	if (!nm)
+		return -ENOMEM;
+
+	nm->isr_ctx = devm_kcalloc(&ntb->dev, num_irqs, sizeof(*nm->isr_ctx),
+				   GFP_KERNEL);
+	if (!nm->isr_ctx)
+		return -ENOMEM;
+
+	INIT_WORK(&nm->setup_work, ntb_msit_setup_work);
+	nm->ntb = ntb;
+
+	ntb_msit_create_dbgfs(nm);
+
+	ret = ntb_set_ctx(ntb, nm, &ntb_msit_ops);
+	if (ret)
+		goto remove_dbgfs;
+
+	if (!nm->isr_ctx)
+		goto remove_dbgfs;
+
+	ntb_link_enable(ntb, NTB_SPEED_AUTO, NTB_WIDTH_AUTO);
+
+	return 0;
+
+remove_dbgfs:
+	ntb_msit_remove_dbgfs(nm);
+	devm_kfree(&ntb->dev, nm->isr_ctx);
+	devm_kfree(&ntb->dev, nm);
+	return ret;
+}
+
+static void ntb_msit_remove(struct ntb_client *client, struct ntb_dev *ntb)
+{
+	struct ntb_msit_ctx *nm = ntb->ctx;
+	int i;
+
+	ntb_link_disable(ntb);
+	ntb_db_set_mask(ntb, ntb_db_valid_mask(ntb));
+	ntb_msi_clear_mws(ntb);
+
+	for (i = 0; i < ntb_peer_port_count(ntb); i++)
+		kfree(nm->peers[i].msi_desc);
+
+	ntb_clear_ctx(ntb);
+	ntb_msit_remove_dbgfs(nm);
+}
+
+static struct ntb_client ntb_msit_client = {
+	.ops = {
+		.probe = ntb_msit_probe,
+		.remove = ntb_msit_remove
+	}
+};
+
+static int __init ntb_msit_init(void)
+{
+	int ret;
+
+	if (debugfs_initialized())
+		ntb_msit_dbgfs_topdir = debugfs_create_dir(KBUILD_MODNAME,
+							   NULL);
+
+	ret = ntb_register_client(&ntb_msit_client);
+	if (ret)
+		debugfs_remove_recursive(ntb_msit_dbgfs_topdir);
+
+	return ret;
+}
+module_init(ntb_msit_init);
+
+static void __exit ntb_msit_exit(void)
+{
+	ntb_unregister_client(&ntb_msit_client);
+	debugfs_remove_recursive(ntb_msit_dbgfs_topdir);
+}
+module_exit(ntb_msit_exit);
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 11/12] NTB: Add ntb_msi_test support to ntb_test
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (9 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 10/12] NTB: Introduce NTB MSI Test Client Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-13 17:54 ` [PATCH v2 12/12] NTB: Add MSI interrupt support to ntb_transport Logan Gunthorpe
  2019-02-26  9:34 ` [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Joerg Roedel
  12 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

When the ntb_msi_test module is available, the test code will trigger
each of the interrupts and ensure the corresponding occurrences files
gets incremented.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 tools/testing/selftests/ntb/ntb_test.sh | 54 ++++++++++++++++++++++++-
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/ntb/ntb_test.sh b/tools/testing/selftests/ntb/ntb_test.sh
index 17ca36403d04..1a10b8f67727 100755
--- a/tools/testing/selftests/ntb/ntb_test.sh
+++ b/tools/testing/selftests/ntb/ntb_test.sh
@@ -87,10 +87,10 @@ set -e
 
 function _modprobe()
 {
-	modprobe "$@"
+	modprobe "$@" || return 1
 
 	if [[ "$REMOTE_HOST" != "" ]]; then
-		ssh "$REMOTE_HOST" modprobe "$@"
+		ssh "$REMOTE_HOST" modprobe "$@" || return 1
 	fi
 }
 
@@ -451,6 +451,30 @@ function pingpong_test()
 	echo "  Passed"
 }
 
+function msi_test()
+{
+	LOC=$1
+	REM=$2
+
+	write_file 1 $LOC/ready
+
+	echo "Running MSI interrupt tests on: $(subdirname $LOC) / $(subdirname $REM)"
+
+	CNT=$(read_file "$LOC/count")
+	for ((i = 0; i < $CNT; i++)); do
+		START=$(read_file $REM/../irq${i}_occurrences)
+		write_file $i $LOC/trigger
+		END=$(read_file $REM/../irq${i}_occurrences)
+
+		if [[ $(($END - $START)) != 1 ]]; then
+			echo "MSI did not trigger the interrupt on the remote side!" >&2
+			exit 1
+		fi
+	done
+
+	echo "  Passed"
+}
+
 function perf_test()
 {
 	USE_DMA=$1
@@ -529,6 +553,29 @@ function ntb_pingpong_tests()
 	_modprobe -r ntb_pingpong
 }
 
+function ntb_msi_tests()
+{
+	LOCAL_MSI="$DEBUGFS/ntb_msi_test/$LOCAL_DEV"
+	REMOTE_MSI="$REMOTE_HOST:$DEBUGFS/ntb_msi_test/$REMOTE_DEV"
+
+	echo "Starting ntb_msi_test tests..."
+
+	if ! _modprobe ntb_msi_test 2> /dev/null; then
+		echo "  Not doing MSI tests seeing the module is not available."
+		return
+	fi
+
+	port_test $LOCAL_MSI $REMOTE_MSI
+
+	LOCAL_PEER="$LOCAL_MSI/peer$LOCAL_PIDX"
+	REMOTE_PEER="$REMOTE_MSI/peer$REMOTE_PIDX"
+
+	msi_test $LOCAL_PEER $REMOTE_PEER
+	msi_test $REMOTE_PEER $LOCAL_PEER
+
+	_modprobe -r ntb_msi_test
+}
+
 function ntb_perf_tests()
 {
 	LOCAL_PERF="$DEBUGFS/ntb_perf/$LOCAL_DEV"
@@ -550,6 +597,7 @@ function cleanup()
 	_modprobe -r ntb_perf 2> /dev/null
 	_modprobe -r ntb_pingpong 2> /dev/null
 	_modprobe -r ntb_transport 2> /dev/null
+	_modprobe -r ntb_msi_test 2> /dev/null
 	set -e
 }
 
@@ -586,5 +634,7 @@ ntb_tool_tests
 echo
 ntb_pingpong_tests
 echo
+ntb_msi_tests
+echo
 ntb_perf_tests
 echo
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 12/12] NTB: Add MSI interrupt support to ntb_transport
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (10 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 11/12] NTB: Add ntb_msi_test support to ntb_test Logan Gunthorpe
@ 2019-02-13 17:54 ` Logan Gunthorpe
  2019-02-26  9:34 ` [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Joerg Roedel
  12 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:54 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, Logan Gunthorpe

Introduce the module parameter 'use_msi' which, when set uses
MSI interrupts instead of doorbells for each queue pair (QP). T
he parameter is only available if NTB MSI support is configured into
the kernel. We also require there to be more than one memory window
(MW) so that an extra one is available to forward the APIC region.

To use MSIs, we request one interrupt per QP and forward the MSI address
and data to the peer using scratch pad registers (SPADS) above the MW
spads. (If there are not enough SPADS the MSI interrupt will not be used.)

Once registered, we simply use ntb_msi_peer_trigger and the recieving
ISR simply queues up the rxc_db_work for the queue.

This addition can significantly improve performance of ntb_transport.
In a simple, untuned, apples-to-apples comparision using ntb_netdev
and iperf with switchtec hardware, I see 3.88Gb/s without MSI
interrupts and 14.1Gb/s which is a more than 3x improvement.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/ntb_transport.c | 169 +++++++++++++++++++++++++++++++++++-
 1 file changed, 168 insertions(+), 1 deletion(-)

diff --git a/drivers/ntb/ntb_transport.c b/drivers/ntb/ntb_transport.c
index 526b65afc16a..90e3ea67d48a 100644
--- a/drivers/ntb/ntb_transport.c
+++ b/drivers/ntb/ntb_transport.c
@@ -93,6 +93,12 @@ static bool use_dma;
 module_param(use_dma, bool, 0644);
 MODULE_PARM_DESC(use_dma, "Use DMA engine to perform large data copy");
 
+static bool use_msi;
+#ifdef CONFIG_NTB_MSI
+module_param(use_msi, bool, 0644);
+MODULE_PARM_DESC(use_msi, "Use MSI interrupts instead of doorbells");
+#endif
+
 static struct dentry *nt_debugfs_dir;
 
 /* Only two-ports NTB devices are supported */
@@ -188,6 +194,11 @@ struct ntb_transport_qp {
 	u64 tx_err_no_buf;
 	u64 tx_memcpy;
 	u64 tx_async;
+
+	bool use_msi;
+	int msi_irq;
+	struct ntb_msi_desc msi_desc;
+	struct ntb_msi_desc peer_msi_desc;
 };
 
 struct ntb_transport_mw {
@@ -221,6 +232,10 @@ struct ntb_transport_ctx {
 	u64 qp_bitmap;
 	u64 qp_bitmap_free;
 
+	bool use_msi;
+	unsigned int msi_spad_offset;
+	u64 msi_db_mask;
+
 	bool link_is_up;
 	struct delayed_work link_work;
 	struct work_struct link_cleanup;
@@ -667,6 +682,114 @@ static int ntb_transport_setup_qp_mw(struct ntb_transport_ctx *nt,
 	return 0;
 }
 
+static irqreturn_t ntb_transport_isr(int irq, void *dev)
+{
+	struct ntb_transport_qp *qp = dev;
+
+	tasklet_schedule(&qp->rxc_db_work);
+
+	return IRQ_HANDLED;
+}
+
+static void ntb_transport_setup_qp_peer_msi(struct ntb_transport_ctx *nt,
+					    unsigned int qp_num)
+{
+	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
+	int spad = qp_num * 2 + nt->msi_spad_offset;
+
+	if (!nt->use_msi)
+		return;
+
+	if (spad >= ntb_spad_count(nt->ndev))
+		return;
+
+	qp->peer_msi_desc.addr_offset =
+		ntb_peer_spad_read(qp->ndev, PIDX, spad);
+	qp->peer_msi_desc.data =
+		ntb_peer_spad_read(qp->ndev, PIDX, spad + 1);
+
+	dev_dbg(&qp->ndev->pdev->dev, "QP%d Peer MSI addr=%x data=%x\n",
+		qp_num, qp->peer_msi_desc.addr_offset, qp->peer_msi_desc.data);
+
+	if (qp->peer_msi_desc.addr_offset) {
+		qp->use_msi = true;
+		dev_info(&qp->ndev->pdev->dev,
+			 "Using MSI interrupts for QP%d\n", qp_num);
+	}
+}
+
+static void ntb_transport_setup_qp_msi(struct ntb_transport_ctx *nt,
+				       unsigned int qp_num)
+{
+	struct ntb_transport_qp *qp = &nt->qp_vec[qp_num];
+	int spad = qp_num * 2 + nt->msi_spad_offset;
+	int rc;
+
+	if (!nt->use_msi)
+		return;
+
+	if (spad >= ntb_spad_count(nt->ndev)) {
+		dev_warn_once(&qp->ndev->pdev->dev,
+			      "Not enough SPADS to use MSI interrupts\n");
+		return;
+	}
+
+	ntb_spad_write(qp->ndev, spad, 0);
+	ntb_spad_write(qp->ndev, spad + 1, 0);
+
+	if (!qp->msi_irq) {
+		qp->msi_irq = ntbm_msi_request_irq(qp->ndev, ntb_transport_isr,
+						   KBUILD_MODNAME, qp,
+						   &qp->msi_desc);
+		if (qp->msi_irq < 0) {
+			dev_warn(&qp->ndev->pdev->dev,
+				 "Unable to allocate MSI interrupt for qp%d\n",
+				 qp_num);
+			return;
+		}
+	}
+
+	rc = ntb_spad_write(qp->ndev, spad, qp->msi_desc.addr_offset);
+	if (rc)
+		goto err_free_interrupt;
+
+	rc = ntb_spad_write(qp->ndev, spad + 1, qp->msi_desc.data);
+	if (rc)
+		goto err_free_interrupt;
+
+	dev_dbg(&qp->ndev->pdev->dev, "QP%d MSI %d addr=%x data=%x\n",
+		qp_num, qp->msi_irq, qp->msi_desc.addr_offset,
+		qp->msi_desc.data);
+
+	return;
+
+err_free_interrupt:
+	devm_free_irq(&nt->ndev->dev, qp->msi_irq, qp);
+}
+
+static void ntb_transport_msi_peer_desc_changed(struct ntb_transport_ctx *nt)
+{
+	int i;
+
+	dev_dbg(&nt->ndev->pdev->dev, "Peer MSI descriptors changed");
+
+	for (i = 0; i < nt->qp_count; i++)
+		ntb_transport_setup_qp_peer_msi(nt, i);
+}
+
+static void ntb_transport_msi_desc_changed(void *data)
+{
+	struct ntb_transport_ctx *nt = data;
+	int i;
+
+	dev_dbg(&nt->ndev->pdev->dev, "MSI descriptors changed");
+
+	for (i = 0; i < nt->qp_count; i++)
+		ntb_transport_setup_qp_msi(nt, i);
+
+	ntb_peer_db_set(nt->ndev, nt->msi_db_mask);
+}
+
 static void ntb_free_mw(struct ntb_transport_ctx *nt, int num_mw)
 {
 	struct ntb_transport_mw *mw = &nt->mw_vec[num_mw];
@@ -902,6 +1025,20 @@ static void ntb_transport_link_work(struct work_struct *work)
 	int rc = 0, i, spad;
 
 	/* send the local info, in the opposite order of the way we read it */
+
+	if (nt->use_msi) {
+		rc = ntb_msi_setup_mws(ndev);
+		if (rc) {
+			dev_warn(&pdev->dev,
+				 "Failed to register MSI memory window: %d\n",
+				 rc);
+			nt->use_msi = false;
+		}
+	}
+
+	for (i = 0; i < nt->qp_count; i++)
+		ntb_transport_setup_qp_msi(nt, i);
+
 	for (i = 0; i < nt->mw_count; i++) {
 		size = nt->mw_vec[i].phys_size;
 
@@ -959,6 +1096,7 @@ static void ntb_transport_link_work(struct work_struct *work)
 		struct ntb_transport_qp *qp = &nt->qp_vec[i];
 
 		ntb_transport_setup_qp_mw(nt, i);
+		ntb_transport_setup_qp_peer_msi(nt, i);
 
 		if (qp->client_ready)
 			schedule_delayed_work(&qp->link_work, 0);
@@ -1132,6 +1270,19 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 		return -ENOMEM;
 
 	nt->ndev = ndev;
+
+	/*
+	 * If we are using MSI, and have at least one extra memory window,
+	 * we will reserve the last MW for the MSI window.
+	 */
+	if (use_msi && mw_count > 1) {
+		rc = ntb_msi_init(ndev, ntb_transport_msi_desc_changed);
+		if (!rc) {
+			mw_count -= 1;
+			nt->use_msi = true;
+		}
+	}
+
 	spad_count = ntb_spad_count(ndev);
 
 	/* Limit the MW's based on the availability of scratchpads */
@@ -1145,6 +1296,8 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 	max_mw_count_for_spads = (spad_count - MW0_SZ_HIGH) / 2;
 	nt->mw_count = min(mw_count, max_mw_count_for_spads);
 
+	nt->msi_spad_offset = nt->mw_count * 2 + MW0_SZ_HIGH;
+
 	nt->mw_vec = kcalloc_node(mw_count, sizeof(*nt->mw_vec),
 				  GFP_KERNEL, node);
 	if (!nt->mw_vec) {
@@ -1175,6 +1328,12 @@ static int ntb_transport_probe(struct ntb_client *self, struct ntb_dev *ndev)
 	qp_bitmap = ntb_db_valid_mask(ndev);
 
 	qp_count = ilog2(qp_bitmap);
+	if (nt->use_msi) {
+		qp_count -= 1;
+		nt->msi_db_mask = 1 << qp_count;
+		ntb_db_clear_mask(ndev, nt->msi_db_mask);
+	}
+
 	if (max_num_clients && max_num_clients < qp_count)
 		qp_count = max_num_clients;
 	else if (nt->mw_count < qp_count)
@@ -1598,7 +1757,10 @@ static void ntb_tx_copy_callback(void *data,
 
 	iowrite32(entry->flags | DESC_DONE_FLAG, &hdr->flags);
 
-	ntb_peer_db_set(qp->ndev, BIT_ULL(qp->qp_num));
+	if (qp->use_msi)
+		ntb_msi_peer_trigger(qp->ndev, PIDX, &qp->peer_msi_desc);
+	else
+		ntb_peer_db_set(qp->ndev, BIT_ULL(qp->qp_num));
 
 	/* The entry length can only be zero if the packet is intended to be a
 	 * "link down" or similar.  Since no payload is being sent in these
@@ -2265,6 +2427,11 @@ static void ntb_transport_doorbell_callback(void *data, int vector)
 	u64 db_bits;
 	unsigned int qp_num;
 
+	if (ntb_db_read(nt->ndev) & nt->msi_db_mask) {
+		ntb_transport_msi_peer_desc_changed(nt);
+		ntb_db_clear(nt->ndev, nt->msi_db_mask);
+	}
+
 	db_bits = (nt->qp_bitmap & ~nt->qp_bitmap_free &
 		   ntb_db_vector_mask(nt->ndev, vector));
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 01/12] iommu/vt-d: Implement dma_[un]map_resource()
  2019-02-13 17:54 ` [PATCH v2 01/12] iommu/vt-d: Implement dma_[un]map_resource() Logan Gunthorpe
@ 2019-02-13 17:57   ` Logan Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-13 17:57 UTC (permalink / raw)
  To: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel
  Cc: Allen Hubbe, Dave Jiang, Serge Semin, Eric Pilmore, David Woodhouse

Oops, sorry. Please ignore the first two patches in this series. They
have already been merged independently.

Logan



On 2019-02-13 10:54 a.m., Logan Gunthorpe wrote:
> Currently the Intel IOMMU uses the default dma_[un]map_resource()
> implementations does nothing and simply returns the physical address
> unmodified.
> 
> However, this doesn't create the IOVA entries necessary for addresses
> mapped this way to work when the IOMMU is enabled. Thus, when the
> IOMMU is enabled, drivers relying on dma_map_resource() will trigger
> DMAR errors. We see this when running ntb_transport with the IOMMU
> enabled, DMA, and switchtec hardware.
> 
> The implementation for intel_map_resource() is nearly identical to
> intel_map_page(), we just have to re-create __intel_map_single().
> dma_unmap_resource() uses intel_unmap_page() directly as the
> functions are identical.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> ---
>  drivers/iommu/intel-iommu.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 78188bf7e90d..ad737e16575b 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3649,11 +3649,9 @@ static int iommu_no_mapping(struct device *dev)
>  	return 0;
>  }
>  
> -static dma_addr_t __intel_map_page(struct device *dev, struct page *page,
> -				   unsigned long offset, size_t size, int dir,
> -				   u64 dma_mask)
> +static dma_addr_t __intel_map_single(struct device *dev, phys_addr_t paddr,
> +				     size_t size, int dir, u64 dma_mask)
>  {
> -	phys_addr_t paddr = page_to_phys(page) + offset;
>  	struct dmar_domain *domain;
>  	phys_addr_t start_paddr;
>  	unsigned long iova_pfn;
> @@ -3715,7 +3713,15 @@ static dma_addr_t intel_map_page(struct device *dev, struct page *page,
>  				 enum dma_data_direction dir,
>  				 unsigned long attrs)
>  {
> -	return __intel_map_page(dev, page, offset, size, dir, *dev->dma_mask);
> +	return __intel_map_single(dev, page_to_phys(page) + offset, size,
> +				  dir, *dev->dma_mask);
> +}
> +
> +static dma_addr_t intel_map_resource(struct device *dev, phys_addr_t phys_addr,
> +				     size_t size, enum dma_data_direction dir,
> +				     unsigned long attrs)
> +{
> +	return __intel_map_single(dev, phys_addr, size, dir, *dev->dma_mask);
>  }
>  
>  static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
> @@ -3806,8 +3812,9 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
>  		return NULL;
>  	memset(page_address(page), 0, size);
>  
> -	*dma_handle = __intel_map_page(dev, page, 0, size, DMA_BIDIRECTIONAL,
> -				       dev->coherent_dma_mask);
> +	*dma_handle = __intel_map_single(dev, page_to_phys(page), size,
> +					 DMA_BIDIRECTIONAL,
> +					 dev->coherent_dma_mask);
>  	if (*dma_handle != DMA_MAPPING_ERROR)
>  		return page_address(page);
>  	if (!dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT))
> @@ -3924,6 +3931,8 @@ static const struct dma_map_ops intel_dma_ops = {
>  	.unmap_sg = intel_unmap_sg,
>  	.map_page = intel_map_page,
>  	.unmap_page = intel_unmap_page,
> +	.map_resource = intel_map_resource,
> +	.unmap_resource = intel_unmap_page,
>  	.dma_supported = dma_direct_supported,
>  };
>  
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/12] Support using MSI interrupts in ntb_transport
  2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
                   ` (11 preceding siblings ...)
  2019-02-13 17:54 ` [PATCH v2 12/12] NTB: Add MSI interrupt support to ntb_transport Logan Gunthorpe
@ 2019-02-26  9:34 ` Joerg Roedel
  2019-02-26 16:18   ` Logan Gunthorpe
  12 siblings, 1 reply; 27+ messages in thread
From: Joerg Roedel @ 2019-02-26  9:34 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Allen Hubbe, Dave Jiang, Serge Semin,
	Eric Pilmore

On Wed, Feb 13, 2019 at 10:54:42AM -0700, Logan Gunthorpe wrote:
>   iommu/vt-d: Add helper to set an IRTE to verify only the bus number
>   iommu/vt-d: Allow interrupts from the entire bus for aliased devices

Applied these two to the iommu tree, thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/12] Support using MSI interrupts in ntb_transport
  2019-02-26  9:34 ` [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Joerg Roedel
@ 2019-02-26 16:18   ` Logan Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-02-26 16:18 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Allen Hubbe, Dave Jiang, Serge Semin,
	Eric Pilmore



On 2019-02-26 2:34 a.m., Joerg Roedel wrote:
> On Wed, Feb 13, 2019 at 10:54:42AM -0700, Logan Gunthorpe wrote:
>>   iommu/vt-d: Add helper to set an IRTE to verify only the bus number
>>   iommu/vt-d: Allow interrupts from the entire bus for aliased devices
> 
> Applied these two to the iommu tree, thanks.
> 

Thanks!

Logan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index
  2019-02-13 17:54 ` [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index Logan Gunthorpe
@ 2019-03-06  1:24   ` Serge Semin
  2019-03-06 19:11     ` Logan Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Serge Semin @ 2019-03-06  1:24 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang,
	Eric Pilmore

On Wed, Feb 13, 2019 at 10:54:49AM -0700, Logan Gunthorpe wrote:

Hi

> When using multi-ports each port uses resources (dbs, msgs, mws, etc)
> on every other port. Creating a mapping for these resources such that
> each port has a corresponding resource on every other port is a bit
> tricky.
> 
> Introduce the ntb_peer_resource_idx() function for this purpose.
> It returns the peer resource number that will correspond with the
> local peer index on the remote peer.
> 
> Also, introduce ntb_peer_highest_mw_idx() which will use
> ntb_peer_resource_idx() but return the MW index starting with the
> highest index and working down.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jon Mason <jdmason@kudzu.us>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Allen Hubbe <allenbh@gmail.com>
> ---
>  include/linux/ntb.h | 70 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
> 
> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> index 181d16601dd9..f5c69d853489 100644
> --- a/include/linux/ntb.h
> +++ b/include/linux/ntb.h
> @@ -1502,4 +1502,74 @@ static inline int ntb_peer_msg_write(struct ntb_dev *ntb, int pidx, int midx,
>  	return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
>  }
>  
> +/**
> + * ntb_peer_resource_idx() - get a resource index for a given peer idx
> + * @ntb:	NTB device context.
> + * @pidx:	Peer port index.
> + *
> + * When constructing a graph of peers, each remote peer must use a different
> + * resource index (mw, doorbell, etc) to communicate with each other
> + * peer.
> + *
> + * In a two peer system, this function should always return 0 such that
> + * resource 0 points to the remote peer on both ports.
> + *
> + * In a 5 peer system, this function will return the following matrix
> + *
> + * pidx \ port    0    1    2    3    4
> + * 0              0    0    1    2    3
> + * 1              0    1    2    3    4
> + * 2              0    1    2    3    4
> + * 3              0    1    2    3    4
> + *

This table is too simplified to represent a generic case of port-index
mapping table. In particular the IDT PCIe switch got it ports numbered
with uneven integers like: 0 2 4 6 8 12 16 20 or 0 8 16, and so on.
Moreover some of the ports might be disabled or may have NTB functions
deactivated, in which case these ports shouldn't be considered by NTB subsystem
at all. Basically we may have any increasing subset of that port
numbers depending on the current IDT PCIe-switch ports setup.

What I am trying to say by this is that if in real life the NTB MSI library
or some other library used that matrix to calculate the MW index, then most
likely it would either consume nearly all the MWs leaving holes in the space
of MWs or just run out of them since depending on the configuration there might
be from 0 to 24 MWs enabled in a IDT NTB function. In order to solve the problem
we either need the ntb_peer_resource_idx() method somehow fixed so it would use
the pidx instead of the pure port number or we should properly alter the current
NTB API port-index table implementation, so the ntb_pingpong, NTB MSI library and
others wouldn't need to a have some other table to distribute the NTB
resources.

A short preamble of why we introduced the ports-index NTB API. Since each
NTB MW and NTB message registers can be setup to be related with any NTB
device port, we needed to have the port attribute accepted by NTB API
functions. But it turned out the IDT PCIe switch ports are unevenly numbered,
which may cause difficulties in using their numbers for bulk configurations.
So in order to simplify a client code utilizing the multi-ports NTB API (for
example for simpler looping around the device ports), we decided to create the
ports-index table in the NTB API internals. The table is defined by a set of
ntb_port_*()/ntb_peer_port_*() functions. So now all the NTB API methods accept
a port index instead of an irregular port number.

Here is the port-index table currently defined for IDT 89HPES32NT8 PCIe-switch
with all 8 ports having NTB functions enabled:

peer port \ local port   0  2  4  6  8  12  16  20
        0                x  0  0  0  0   0   0   0
        2                0  x  1  1  1   1   1   1
        4                1  1  x  2  2   2   2   2
        6                2  2  2  x  3   3   3   3
        8                3  3  3  3  x   4   4   4
       12                4  4  4  4  4   x   5   5
       16                5  5  5  5  5   5   x   6
       20                6  6  6  6  6   6   6   x

As you can see currently each NTB device side's got its own port-index
mapping, so each port can access another ports, but not itself.
I implemented it this way intentionally for two reasons:
1) I don't think anyone ever would need to have MW or Message registers
setup pointing to the local port.
2) IDT PCIe switch will report "Unsupported TLP" error in case if there is
an attempt to touch a MW frame initialized with port number matching the local
port number. (Internally IDT PCIe-switch is working with port partitions.
But I didn't want to complicate the API description, so we just utilize the
port numbers naming, which is essentially the same as partitions for NTB.)

The biggest drawback of the selected table representation is that the
port-index mapping is unique for each local port, which makes the
indexes unsuitable to distribute shared resources like outbound
MWs, Doorbells and Scratchpads (if one ever implemented for multi-port
NTB devices). For instance each port can access a port with index 0,
which for local port 0 is port 2, while for the rest of the local ports
it's port 0, and so on.

This drawback already caused complications in the ntb_pingpong driver,
where after your patchset acceptance we use port numbers to distribute
the doorbell bits. The same problem is popping up here, but this time
it can't be solved by the port numbers utilization due to the limited number
of MWs available on IDT.

As I already said there can be two ways to get rid of the complication.
One of them is to redefine the port-index table, so to map the port numbers
to global indexes instead of the local ones, like this:

peer port \ local port   0  2  4  6  8  12  16  20
        0                0  0  0  0  0   0   0   0
        2                1  1  1  1  1   1   1   1
        4                2  2  2  2  2   2   2   2
        6                3  3  3  3  3   3   3   3
        8                4  4  4  4  4   4   4   4
       12                5  5  5  5  5   5   5   5
       16                6  6  6  6  6   6   6   6
       20                7  7  7  7  7   7   7   7

By doing so we'll be able to use the port indexes to distribute the
shared resources like MWs, Doorbells and Scratchpads, but we may get
the problem with "Unsupported TLP" error I described before for IDT
devices, if a NTB API client code tries to define a MW pointing to the
local port on IDT NTB hardware. I don't know whether the same problem
exists for Switchtec multi-ports NTB devices, but this approach shall
definitely lead to cleaner and easier NTB API as well as simplify the
IDT NTB hw driver. We also won't need to have any additional tables like
ntb_peer_resource_idx().

(the second solution is described further in the next comment)

> + * For example, if this function is used to program peer's memory
> + * windows, port 0 will program MW 0 on all it's peers to point to itself.
> + * port 1 will program MW 0 in port 0 to point to itself and MW 1 on all
> + * other ports. etc.
> + *
> + * For the legacy two host case, ntb_port_number() and ntb_peer_port_number()
> + * both return zero and therefore this function will always return zero.
> + * So MW 0 on each host would be programmed to point to the other host.
> + *
> + * Return: the resource index to use for that peer.
> + */
> +static inline int ntb_peer_resource_idx(struct ntb_dev *ntb, int pidx)
> +{
> +	int local_port, peer_port;
> +
> +	if (pidx >= ntb_peer_port_count(ntb))
> +		return -EINVAL;
> +
> +	local_port = ntb_port_number(ntb);
> +	peer_port = ntb_peer_port_number(ntb, pidx);
> +
> +	if (peer_port < local_port)
> +		return local_port - 1;
> +	else
> +		return local_port;
> +}
> +

Instead of redefining the port-index table we can just fix the
ntb_peer_resource_idx() method, so it would return a global port index
instead of some number based on the port number. It can be done just by
the next modification:

+     if (peer_port <= local_port)
+             return pidx;
+     else
+             return pidx + 1;


Personally I'd prefer the first solution even though it may lead to the
"Unsupported TLP" errors and cause a greater code changes. Here is why:
1) the error might be IDT-specific, so we shouldn't limit the API due to
one particular hardware peculiarity,
2) port-index table with global indexes implementation shall simplify the IDT
NTB hw driver and provide a cleaner NTB API with simpler shared resources
utilization code.

The final decision is after the NTB subsystem maintainers. If they agree with
solution #1 I'll send a corresponding patchset on this week, so you can
alter this patchset being based on it.

> +/**
> + * ntb_peer_highest_mw_idx() - get a memory window index for a given peer idx
> + *	using the highest index memory windows first
> + *
> + * @ntb:	NTB device context.
> + * @pidx:	Peer port index.
> + *
> + * Like ntb_peer_resource_idx(), except it returns indexes starting with
> + * last memory window index.
> + *
> + * Return: the resource index to use for that peer.
> + */
> +static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
> +{
> +	int ret;
> +
> +	ret = ntb_peer_resource_idx(ntb, pidx);
> +	if (ret < 0)
> +		return ret;
> +
> +	return ntb_mw_count(ntb, pidx) - ret - 1;
> +}
> +
>  #endif
> -- 
> 2.19.0
> 
> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/20190213175454.7506-8-logang%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

-Sergey

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index
  2019-03-06  1:24   ` Serge Semin
@ 2019-03-06 19:11     ` Logan Gunthorpe
  2019-03-06 22:45       ` Serge Semin
  0 siblings, 1 reply; 27+ messages in thread
From: Logan Gunthorpe @ 2019-03-06 19:11 UTC (permalink / raw)
  To: Serge Semin, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel,
	Allen Hubbe, Dave Jiang, Eric Pilmore



On 2019-03-05 6:24 p.m., Serge Semin wrote:
>> + * In a 5 peer system, this function will return the following matrix
>> + *
>> + * pidx \ port    0    1    2    3    4
>> + * 0              0    0    1    2    3
>> + * 1              0    1    2    3    4
>> + * 2              0    1    2    3    4
>> + * 3              0    1    2    3    4
>> + *

Oh, first, oops: looks like I copied this down wrong anyway; the code
was what I had intended, but the documented example should have been:

pidx \ local_port     0    1    2    3    4
 0                    0    0    1    2    3
 1                    0    1    1    2    3
 2                    0    1    2    2    3
 3                    0    1    2    3    3

And this is definitely the correct table we are aiming for.
ntb_peer_resource_idx() is supposed to return the result of
ntb_peer_port_idx(ntb, local_port) when run on the peer specified by pidx.

Note: this table also makes sense because it only uses 4 resources for 5
ports which is the best case scenario. (In other words, to communicate
between N ports, N-1 resources are required on each peer).

> This table is too simplified to represent a generic case of port-index
> mapping table. In particular the IDT PCIe switch got it ports numbered
> with uneven integers like: 0 2 4 6 8 12 16 20 or 0 8 16, and so on.
> Moreover some of the ports might be disabled or may have NTB functions
> deactivated, in which case these ports shouldn't be considered by NTB subsystem
> at all. Basically we may have any increasing subset of that port
> numbers depending on the current IDT PCIe-switch ports setup.

Yes, I did not consider situations where there would be gaps in the
"port number" space. It wasn't at all clear from the code that this was
possible. Switchtec hardware could be configured for such an
arrangement, but I don't know why anyone would do that as it just
needlessly complicates everything.

As you point out, with a gap, we end up with something that is wrong:

pidx \ port     0    1    3    4    5
 0              0    0    2    3    4
 1              0    1    2    3    4
 2              0    1    3    3    4
 3              0    1    3    4    4

Here, the relationship between ntb_peer_resource_idx() and
ntb_peer_port_idx() is not maintained and it seems to prescribe 5
resources for 5 ports. If there were more gaps it would be even more wrong.

>> +static inline int ntb_peer_resource_idx(struct ntb_dev *ntb, int pidx)
>> +{
>> +	int local_port, peer_port;
>> +
>> +	if (pidx >= ntb_peer_port_count(ntb))
>> +		return -EINVAL;
>> +
>> +	local_port = ntb_port_number(ntb);
>> +	peer_port = ntb_peer_port_number(ntb, pidx);
>> +
>> +	if (peer_port < local_port)
>> +		return local_port - 1;
>> +	else
>> +		return local_port;
>> +}
>> +
> 
> Instead of redefining the port-index table we can just fix the
> ntb_peer_resource_idx() method, so it would return a global port index
> instead of some number based on the port number. It can be done just by
> the next modification:
> 
> +     if (peer_port <= local_port)
> +             return pidx;
> +     else
> +             return pidx + 1;
> 

This creates a table that looks like:

pidx \ port     0    1    2    3    4
 0              1    0    0    0    0
 1              2    2    1    1    1
 2              3    3    3    2    2
 3              4    4    4    4    3

Which is not correct. In fact, it seems to require 5 resources for 5
ports. This appears to be what is done in the current ntb_perf and I
think I figured it out several months ago but it's way too messy and
hard to understand and I don't want to spend the time to figure it out
again.

IMO, in order to support gaps, we'd need to, on some layer, create an
un-gapped numbering scheme for the ports. I think the easiest thing is
to just have Logical and Physical port numbers; so we would have
something like:

Physical Port Number: 0  2  4  6  8  12  16  20
Logical Port Number:  0  1  2  3  4  5   6   7
Peer Index (Port 0):  x  0  1  2  3  4   5   6
Port Index (Port 8):  0  1  2  3  x  4   5   6
(etc)

Where the Physical Port Number is whatever the hardware uses and the
logical port number is a numbering scheme starting with zero with no
gaps. Then the port indexes are still as we currently have them. If we
say that the port numbers we have now are the Logical Port Number, then
ntb_peer_resource_idx() is correct.

I would strongly argue that the clients don't need to know anything
about the Physical Port Number and these should be handled strictly
inside the drivers. If multiple drivers need to do something similar to
map the logical to physical port numbers then we should introduce helper
functions to allow them to do so. If the Physical Numbers are not
contained in the driver than the API would need to be expanded to expose
which numbers are actually used to avoid needing to constantly loop
through all the indexes to find this out.

On a similar vein, I'd  suggest that most clients shouldn't even really
need to do anything with the Logical Port Number and should deal largely
with Port Indexes. Ideally, Logical Port Numbers should only be used by
helper code in the common layer to help assign resources used by the
clients (like ntb_peer_resource_idx()).

This world view isn't far off from what we have now, though you *may*
need to adjust your IDT driver and we will have to eventually clean up
the existing test clients to use the new helper functions.

> Personally I'd prefer the first solution even though it may lead to the
> "Unsupported TLP" errors and cause a greater code changes. Here is why:
> 1) the error might be IDT-specific, so we shouldn't limit the API due to
> one particular hardware peculiarity,
> 2) port-index table with global indexes implementation shall simplify the IDT
> NTB hw driver and provide a cleaner NTB API with simpler shared resources
> utilization code.

> The final decision is after the NTB subsystem maintainers. If they agree with
> solution #1 I'll send a corresponding patchset on this week, so you can
> alter this patchset being based on it.

I think what we have right now is close enough and we just have to clean
up the code and fix things. I don't think we need to do another big
change to the semantics. I *certainly* don't want to risk breaking
everything again to do it.
 Logan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 09/12] NTB: Introduce MSI library
  2019-02-13 17:54 ` [PATCH v2 09/12] NTB: Introduce MSI library Logan Gunthorpe
@ 2019-03-06 20:26   ` Serge Semin
  2019-03-06 21:35     ` Logan Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Serge Semin @ 2019-03-06 20:26 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang,
	Eric Pilmore

On Wed, Feb 13, 2019 at 10:54:51AM -0700, Logan Gunthorpe wrote:

Hi

> The NTB MSI library allows passing MSI interrupts across a memory
> window. This offers similar functionality to doorbells or messages
> except will often have much better latency and the client can
> potentially use significantly more remote interrupts than typical hardware
> provides for doorbells. (Which can be important in high-multiport
> setups.)
> 
> The library utilizes one memory window per peer and uses the highest
> index memory windows. Before any ntb_msi function may be used, the user
> must call ntb_msi_init(). It may then setup and tear down the memory
> windows when the link state changes using ntb_msi_setup_mws() and
> ntb_msi_clear_mws().
> 
> The peer which receives the interrupt must call ntb_msim_request_irq()
> to assign the interrupt handler (this function is functionally
> similar to devm_request_irq()) and the returned descriptor must be
> transferred to the peer which can use it to trigger the interrupt.
> The triggering peer, once having received the descriptor, can
> trigger the interrupt by calling ntb_msi_peer_trigger().
> 

The library is very useful, thanks for sharing it with us.

Here are my two general concerns regarding the implementation.
(More specific comments are further in the letter.)

First of all, It might be unsafe to have some resources consumed by NTB
MSI or some other library without a simple way to warn NTB client drivers
about their attempts to access that resources, since it might lead to random
errors. When I thought about implementing a transport library based on the
Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
array with the resources busy-flags. If, for instance, some message or
scratchpad register is occupied by the library (MSI, transport or some else),
then it would be impossible to access these resources directly through NTB API
methods. So NTB client driver shall retrieve an error in an attempt to
write/read data to/from busy message or scratchpad register, or in an attempt
to set some occupied doorbell bit. The same thing can be done for memory windows.

Second tiny concern is about documentation. Since there is a special file for
all NTB-related doc, it would be good to have some description about the
NTB MSI library there as well:
Documentation/ntb.txt

> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jon Mason <jdmason@kudzu.us>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Allen Hubbe <allenbh@gmail.com>
> ---
>  drivers/ntb/Kconfig  |  11 ++
>  drivers/ntb/Makefile |   3 +-
>  drivers/ntb/msi.c    | 415 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/ntb.h  |  73 ++++++++
>  4 files changed, 501 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/ntb/msi.c
> 
> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> index 95944e52fa36..5760764052be 100644
> --- a/drivers/ntb/Kconfig
> +++ b/drivers/ntb/Kconfig
> @@ -12,6 +12,17 @@ menuconfig NTB
>  
>  if NTB
>  
> +config NTB_MSI
> +	bool "MSI Interrupt Support"
> +	depends on PCI_MSI
> +	help
> +	 Support using MSI interrupt forwarding instead of (or in addition to)
> +	 hardware doorbells. MSI interrupts typically offer lower latency
> +	 than doorbells and more MSI interrupts can be made available to
> +	 clients. However this requires an extra memory window and support
> +	 in the hardware driver for creating the MSI interrupts.
> +
> +	 If unsure, say N.
>  source "drivers/ntb/hw/Kconfig"
>  
>  source "drivers/ntb/test/Kconfig"
> diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
> index 537226f8e78d..cc27ad2ef150 100644
> --- a/drivers/ntb/Makefile
> +++ b/drivers/ntb/Makefile
> @@ -1,4 +1,5 @@
>  obj-$(CONFIG_NTB) += ntb.o hw/ test/
>  obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
>  
> -ntb-y := core.o
> +ntb-y			:= core.o
> +ntb-$(CONFIG_NTB_MSI)	+= msi.o
> diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
> new file mode 100644
> index 000000000000..5d4bd7a63924
> --- /dev/null
> +++ b/drivers/ntb/msi.c
> @@ -0,0 +1,415 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +
> +#include <linux/irq.h>
> +#include <linux/module.h>
> +#include <linux/ntb.h>
> +#include <linux/msi.h>
> +#include <linux/pci.h>
> +
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_VERSION("0.1");
> +MODULE_AUTHOR("Logan Gunthorpe <logang@deltatee.com>");
> +MODULE_DESCRIPTION("NTB MSI Interrupt Library");
> +
> +struct ntb_msi {
> +	u64 base_addr;
> +	u64 end_addr;
> +
> +	void (*desc_changed)(void *ctx);
> +
> +	u32 *peer_mws[];

Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
used to map MWs at these pointers?

> +};
> +
> +/**
> + * ntb_msi_init() - Initialize the MSI context
> + * @ntb:	NTB device context
> + *
> + * This function must be called before any other ntb_msi function.
> + * It initializes the context for MSI operations and maps
> + * the peer memory windows.
> + *
> + * This function reserves the last N outbound memory windows (where N
> + * is the number of peers).
> + *
> + * Return: Zero on success, otherwise a negative error number.
> + */
> +int ntb_msi_init(struct ntb_dev *ntb,
> +		 void (*desc_changed)(void *ctx))
> +{
> +	phys_addr_t mw_phys_addr;
> +	resource_size_t mw_size;
> +	size_t struct_size;
> +	int peer_widx;
> +	int peers;
> +	int ret;
> +	int i;
> +
> +	peers = ntb_peer_port_count(ntb);
> +	if (peers <= 0)
> +		return -EINVAL;
> +
> +	struct_size = sizeof(*ntb->msi) + sizeof(*ntb->msi->peer_mws) * peers;
> +
> +	ntb->msi = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
> +	if (!ntb->msi)
> +		return -ENOMEM;
> +
> +	ntb->msi->desc_changed = desc_changed;
> +
> +	for (i = 0; i < peers; i++) {
> +		peer_widx = ntb_peer_mw_count(ntb) - 1 - i;
> +
> +		ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr,
> +					   &mw_size);
> +		if (ret)
> +			goto unroll;
> +
> +		ntb->msi->peer_mws[i] = devm_ioremap(&ntb->dev, mw_phys_addr,
> +						     mw_size);
> +		if (!ntb->msi->peer_mws[i]) {
> +			ret = -EFAULT;
> +			goto unroll;
> +		}
> +	}
> +
> +	return 0;
> +
> +unroll:
> +	for (i = 0; i < peers; i++)
> +		if (ntb->msi->peer_mws[i])
> +			devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);

Simpler and faster cleanup-code would be:

+ unroll:
+ 	for (--i; i >= 0; --i)
+ 		devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);

> +
> +	devm_kfree(&ntb->dev, ntb->msi);
> +	ntb->msi = NULL;
> +	return ret;
> +}
> +EXPORT_SYMBOL(ntb_msi_init);
> +
> +/**
> + * ntb_msi_setup_mws() - Initialize the MSI inbound memory windows
> + * @ntb:	NTB device context
> + *
> + * This function sets up the required inbound memory windows. It should be
> + * called from a work function after a link up event.
> + *
> + * Over the entire network, this function will reserves the last N
> + * inbound memory windows for each peer (where N is the number of peers).
> + *
> + * ntb_msi_init() must be called before this function.
> + *
> + * Return: Zero on success, otherwise a negative error number.
> + */
> +int ntb_msi_setup_mws(struct ntb_dev *ntb)
> +{
> +	struct msi_desc *desc;
> +	u64 addr;
> +	int peer, peer_widx;
> +	resource_size_t addr_align, size_align, size_max;
> +	resource_size_t mw_size = SZ_32K;
> +	resource_size_t mw_min_size = mw_size;
> +	int i;
> +	int ret;
> +
> +	if (!ntb->msi)
> +		return -EINVAL;
> +
> +	desc = first_msi_entry(&ntb->pdev->dev);
> +	addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
> +
> +	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		if (peer_widx < 0)
> +			return peer_widx;
> +
> +		ret = ntb_mw_get_align(ntb, peer, peer_widx, &addr_align,
> +				       NULL, NULL);
> +		if (ret)
> +			return ret;
> +
> +		addr &= ~(addr_align - 1);
> +	}
> +
> +	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		if (peer_widx < 0) {
> +			ret = peer_widx;
> +			goto error_out;
> +		}
> +
> +		ret = ntb_mw_get_align(ntb, peer, peer_widx, NULL,
> +				       &size_align, &size_max);
> +		if (ret)
> +			goto error_out;
> +
> +		mw_size = round_up(mw_size, size_align);
> +		mw_size = max(mw_size, size_max);
> +		if (mw_size < mw_min_size)
> +			mw_min_size = mw_size;
> +
> +		ret = ntb_mw_set_trans(ntb, peer, peer_widx,
> +				       addr, mw_size);
> +		if (ret)
> +			goto error_out;

Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
(two-ports legacy configuration), but will fail for IDT due to being based on
the outbound MW xlat interface. So the library at this stage isn't portable
across all NTB hardware. In order to make it working the translation address is
supposed to be transferred to the peer side, where a peer code should call
ntb_peer_mw_set_trans() method with the retrieved xlat address.
See documentation for details:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt

ntb_perf driver can be also used as a reference of the portable NTB MWs setup.

So I'd suggest to add some method like ntb_msi_peer_setup_mws() or similar
which is supposed to be called on the peer side with a translation address
or some common descriptor containing the address passed to the function
argument.

It seems to me the test driver should be also altered to support this case.

> +	}
> +
> +	ntb->msi->base_addr = addr;
> +	ntb->msi->end_addr = addr + mw_min_size;
> +
> +	return 0;
> +
> +error_out:
> +	for (i = 0; i < peer; i++) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		if (peer_widx < 0)
> +			continue;
> +
> +		ntb_mw_clear_trans(ntb, i, peer_widx);
> +	}

The same cleanup pattern can be utilized here:
+error_out:
+	for (--peer; peer >= 0; --peer) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		ntb_mw_clear_trans(ntb, i, peer_widx);
+	}

So you won't need "i" variable here anymore. You also don't need to check the
return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
was already checked in the main algo code.

> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(ntb_msi_setup_mws);
> +
> +/**
> + * ntb_msi_clear_mws() - Clear all inbound memory windows
> + * @ntb:	NTB device context
> + *
> + * This function tears down the resources used by ntb_msi_setup_mws().
> + */
> +void ntb_msi_clear_mws(struct ntb_dev *ntb)
> +{
> +	int peer;
> +	int peer_widx;
> +
> +	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		if (peer_widx < 0)
> +			continue;
> +
> +		ntb_mw_clear_trans(ntb, peer, peer_widx);
> +	}
> +}
> +EXPORT_SYMBOL(ntb_msi_clear_mws);
> +

Similarly something like ntb_msi_peer_clear_mws() should be added to
unset a translation address on the peer side.

> +struct ntb_msi_devres {
> +	struct ntb_dev *ntb;
> +	struct msi_desc *entry;
> +	struct ntb_msi_desc *msi_desc;
> +};
> +
> +static int ntb_msi_set_desc(struct ntb_dev *ntb, struct msi_desc *entry,
> +			    struct ntb_msi_desc *msi_desc)
> +{
> +	u64 addr;
> +
> +	addr = entry->msg.address_lo +
> +		((uint64_t)entry->msg.address_hi << 32);
> +
> +	if (addr < ntb->msi->base_addr || addr >= ntb->msi->end_addr) {
> +		dev_warn_once(&ntb->dev,
> +			      "IRQ %d: MSI Address not within the memory window (%llx, [%llx %llx])\n",
> +			      entry->irq, addr, ntb->msi->base_addr,
> +			      ntb->msi->end_addr);
> +		return -EFAULT;
> +	}
> +
> +	msi_desc->addr_offset = addr - ntb->msi->base_addr;
> +	msi_desc->data = entry->msg.data;
> +
> +	return 0;
> +}
> +
> +static void ntb_msi_write_msg(struct msi_desc *entry, void *data)
> +{
> +	struct ntb_msi_devres *dr = data;
> +
> +	WARN_ON(ntb_msi_set_desc(dr->ntb, entry, dr->msi_desc));
> +
> +	if (dr->ntb->msi->desc_changed)
> +		dr->ntb->msi->desc_changed(dr->ntb->ctx);
> +}
> +
> +static void ntbm_msi_callback_release(struct device *dev, void *res)
> +{
> +	struct ntb_msi_devres *dr = res;
> +
> +	dr->entry->write_msi_msg = NULL;
> +	dr->entry->write_msi_msg_data = NULL;
> +}
> +
> +static int ntbm_msi_setup_callback(struct ntb_dev *ntb, struct msi_desc *entry,
> +				   struct ntb_msi_desc *msi_desc)
> +{
> +	struct ntb_msi_devres *dr;
> +
> +	dr = devres_alloc(ntbm_msi_callback_release,
> +			  sizeof(struct ntb_msi_devres), GFP_KERNEL);
> +	if (!dr)
> +		return -ENOMEM;
> +
> +	dr->ntb = ntb;
> +	dr->entry = entry;
> +	dr->msi_desc = msi_desc;
> +
> +	devres_add(&ntb->dev, dr);
> +
> +	dr->entry->write_msi_msg = ntb_msi_write_msg;
> +	dr->entry->write_msi_msg_data = dr;
> +
> +	return 0;
> +}
> +
> +/**
> + * ntbm_msi_request_threaded_irq() - allocate an MSI interrupt
> + * @ntb:	NTB device context
> + * @handler:	Function to be called when the IRQ occurs
> + * @thread_fn:  Function to be called in a threaded interrupt context. NULL
> + *              for clients which handle everything in @handler
> + * @devname:    An ascii name for the claiming device, dev_name(dev) if NULL
> + * @dev_id:     A cookie passed back to the handler function
> + *
> + * This function assigns an interrupt handler to an unused
> + * MSI interrupt and returns the descriptor used to trigger
> + * it. The descriptor can then be sent to a peer to trigger
> + * the interrupt.
> + *
> + * The interrupt resource is managed with devres so it will
> + * be automatically freed when the NTB device is torn down.
> + *
> + * If an IRQ allocated with this function needs to be freed
> + * separately, ntbm_free_irq() must be used.
> + *
> + * Return: IRQ number assigned on success, otherwise a negative error number.
> + */
> +int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
> +				  irq_handler_t thread_fn,
> +				  const char *name, void *dev_id,
> +				  struct ntb_msi_desc *msi_desc)
> +{
> +	struct msi_desc *entry;
> +	struct irq_desc *desc;
> +	int ret;
> +
> +	if (!ntb->msi)
> +		return -EINVAL;
> +
> +	for_each_pci_msi_entry(entry, ntb->pdev) {
> +		desc = irq_to_desc(entry->irq);
> +		if (desc->action)
> +			continue;
> +
> +		ret = devm_request_threaded_irq(&ntb->dev, entry->irq, handler,
> +						thread_fn, 0, name, dev_id);
> +		if (ret)
> +			continue;
> +
> +		if (ntb_msi_set_desc(ntb, entry, msi_desc)) {
> +			devm_free_irq(&ntb->dev, entry->irq, dev_id);
> +			continue;
> +		}
> +
> +		ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
> +		if (ret) {
> +			devm_free_irq(&ntb->dev, entry->irq, dev_id);
> +			return ret;
> +		}
> +
> +
> +		return entry->irq;
> +	}
> +
> +	return -ENODEV;
> +}
> +EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);
> +
> +static int ntbm_msi_callback_match(struct device *dev, void *res, void *data)
> +{
> +	struct ntb_dev *ntb = dev_ntb(dev);
> +	struct ntb_msi_devres *dr = res;
> +
> +	return dr->ntb == ntb && dr->entry == data;
> +}
> +
> +/**
> + * ntbm_msi_free_irq() - free an interrupt
> + * @ntb:	NTB device context
> + * @irq:	Interrupt line to free
> + * @dev_id:	Device identity to free
> + *
> + * This function should be used to manually free IRQs allocated with
> + * ntbm_request_[threaded_]irq().
> + */
> +void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id)
> +{
> +	struct msi_desc *entry = irq_get_msi_desc(irq);
> +
> +	entry->write_msi_msg = NULL;
> +	entry->write_msi_msg_data = NULL;
> +
> +	WARN_ON(devres_destroy(&ntb->dev, ntbm_msi_callback_release,
> +			       ntbm_msi_callback_match, entry));
> +
> +	devm_free_irq(&ntb->dev, irq, dev_id);
> +}
> +EXPORT_SYMBOL(ntbm_msi_free_irq);
> +
> +/**
> + * ntb_msi_peer_trigger() - Trigger an interrupt handler on a peer
> + * @ntb:	NTB device context
> + * @peer:	Peer index
> + * @desc:	MSI descriptor data which triggers the interrupt
> + *
> + * This function triggers an interrupt on a peer. It requires
> + * the descriptor structure to have been passed from that peer
> + * by some other means.
> + *
> + * Return: Zero on success, otherwise a negative error number.
> + */
> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
> +			 struct ntb_msi_desc *desc)
> +{
> +	int idx;
> +
> +	if (!ntb->msi)
> +		return -EINVAL;
> +
> +	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
> +
> +	ntb->msi->peer_mws[peer][idx] = desc->data;
> +

Shouldn't we use iowrite32() here instead of direct access to the IO-memory?

> +	return 0;
> +}
> +EXPORT_SYMBOL(ntb_msi_peer_trigger);
> +
> +/**
> + * ntb_msi_peer_addr() - Get the DMA address to trigger a peer's MSI interrupt
> + * @ntb:	NTB device context
> + * @peer:	Peer index
> + * @desc:	MSI descriptor data which triggers the interrupt
> + * @msi_addr:   Physical address to trigger the interrupt
> + *
> + * This function allows using DMA engines to trigger an interrupt
> + * (for example, trigger an interrupt to process the data after
> + * sending it). To trigger the interrupt, write @desc.data to the address
> + * returned in @msi_addr
> + *
> + * Return: Zero on success, otherwise a negative error number.
> + */
> +int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
> +		      struct ntb_msi_desc *desc,
> +		      phys_addr_t *msi_addr)
> +{
> +	int peer_widx = ntb_peer_mw_count(ntb) - 1 - peer;
> +	phys_addr_t mw_phys_addr;
> +	int ret;
> +
> +	ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr, NULL);
> +	if (ret)
> +		return ret;
> +
> +	if (msi_addr)
> +		*msi_addr = mw_phys_addr + desc->addr_offset;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(ntb_msi_peer_addr);
> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> index f5c69d853489..b9c61ee3c734 100644
> --- a/include/linux/ntb.h
> +++ b/include/linux/ntb.h
> @@ -58,9 +58,11 @@
>  
>  #include <linux/completion.h>
>  #include <linux/device.h>
> +#include <linux/interrupt.h>
>  
>  struct ntb_client;
>  struct ntb_dev;
> +struct ntb_msi;
>  struct pci_dev;
>  
>  /**
> @@ -425,6 +427,10 @@ struct ntb_dev {
>  	spinlock_t			ctx_lock;
>  	/* block unregister until device is fully released */
>  	struct completion		released;
> +
> +	#ifdef CONFIG_NTB_MSI
> +	struct ntb_msi *msi;
> +	#endif

I'd align the macro-condition to the most left position:
+#ifdef CONFIG_NTB_MSI
+	struct ntb_msi *msi;
+#endif

>  };
>  #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
>  
> @@ -1572,4 +1578,71 @@ static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
>  	return ntb_mw_count(ntb, pidx) - ret - 1;
>  }
>  
> +struct ntb_msi_desc {
> +	u32 addr_offset;
> +	u32 data;
> +};
> +
> +#ifdef CONFIG_NTB_MSI
> +
> +int ntb_msi_init(struct ntb_dev *ntb, void (*desc_changed)(void *ctx));
> +int ntb_msi_setup_mws(struct ntb_dev *ntb);
> +void ntb_msi_clear_mws(struct ntb_dev *ntb);
> +int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
> +				  irq_handler_t thread_fn,
> +				  const char *name, void *dev_id,
> +				  struct ntb_msi_desc *msi_desc);
> +void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id);
> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
> +			 struct ntb_msi_desc *desc);
> +int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
> +		      struct ntb_msi_desc *desc,
> +		      phys_addr_t *msi_addr);
> +
> +#else /* not CONFIG_NTB_MSI */
> +
> +static inline int ntb_msi_init(struct ntb_dev *ntb,
> +			       void (*desc_changed)(void *ctx))
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int ntb_msi_setup_mws(struct ntb_dev *ntb)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline void ntb_msi_clear_mws(struct ntb_dev *ntb) {}
> +static inline int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb,
> +						irq_handler_t handler,
> +						irq_handler_t thread_fn,
> +						const char *name, void *dev_id,
> +						struct ntb_msi_desc *msi_desc)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq,
> +				     void *dev_id) {}
> +static inline int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
> +				       struct ntb_msi_desc *desc)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
> +				    struct ntb_msi_desc *desc,
> +				    phys_addr_t *msi_addr)
> +{
> +	return -EOPNOTSUPP;
> +
> +}
> +
> +#endif /* CONFIG_NTB_MSI */
> +
> +static inline int ntbm_msi_request_irq(struct ntb_dev *ntb,
> +				       irq_handler_t handler,
> +				       const char *name, void *dev_id,
> +				       struct ntb_msi_desc *msi_desc)
> +{
> +	return ntbm_msi_request_threaded_irq(ntb, handler, NULL, name,
> +					     dev_id, msi_desc);
> +}
> +
>  #endif
> -- 
> 2.19.0
> 
> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/20190213175454.7506-10-logang%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 10/12] NTB: Introduce NTB MSI Test Client
  2019-02-13 17:54 ` [PATCH v2 10/12] NTB: Introduce NTB MSI Test Client Logan Gunthorpe
@ 2019-03-06 20:44   ` Serge Semin
  2019-03-06 21:39     ` Logan Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Serge Semin @ 2019-03-06 20:44 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang,
	Eric Pilmore

Hi

On Wed, Feb 13, 2019 at 10:54:52AM -0700, Logan Gunthorpe wrote:
> Introduce a tool to test NTB MSI interrupts similar to the other
> NTB test tools. This tool creates a debugfs directory for each
> NTB device with the following files:
> 
> port
> irqX_occurrences
> peerX/port
> peerX/count
> peerX/trigger
> 
> The 'port' file tells the user the local port number and the
> 'occurrences' files tell the number of local interrupts that
> have been received for each interrupt.
> 
> For each peer, the 'port' file and the 'count' file tell you the
> peer's port number and number of interrupts respectively. Writing
> the interrupt number to the 'trigger' file triggers the interrupt
> handler for the peer which should increment their corresponding
> 'occurrences' file. The 'ready' file indicates if a peer is ready,
> writing to this file blocks until it is ready.
> 
> The module parameter num_irqs can be used to set the number of
> local interrupts. By default this is 4. This is only limited by
> the number of unused MSI interrupts registered by the hardware
> (this will require support of the hardware driver) and there must
> be at least 2*num_irqs + 1 spads registers available.
> 

Alas the test driver is not going to work with IDT NTB hardware,
since it uses Spad only, which aren't supported by IDT devices. IDT NTB
PCIe functions provide message registers to communicate with peers. See
ntb_perf driver code for reference of portable driver design.

In order to make the test driver portable the following alterations
are needed:
1) Add NTB Message register API usage together with Scratchpad+Doorbell.
2) When NTB MSI library is updated with functions like
ntb_msi_peer_setup_mws()/ntb_msi_peer_clear_mws() they are needed to be
utilized in the test driver as well to set a translation address on the
peer side.

> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jon Mason <jdmason@kudzu.us>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Allen Hubbe <allenbh@gmail.com>
> ---
>  drivers/ntb/test/Kconfig        |   9 +
>  drivers/ntb/test/Makefile       |   1 +
>  drivers/ntb/test/ntb_msi_test.c | 433 ++++++++++++++++++++++++++++++++
>  3 files changed, 443 insertions(+)
>  create mode 100644 drivers/ntb/test/ntb_msi_test.c
> 
> diff --git a/drivers/ntb/test/Kconfig b/drivers/ntb/test/Kconfig
> index a5d0eda44438..a3f3e2638935 100644
> --- a/drivers/ntb/test/Kconfig
> +++ b/drivers/ntb/test/Kconfig
> @@ -25,3 +25,12 @@ config NTB_PERF
>  	 to and from the window without additional software interaction.
>  
>  	 If unsure, say N.
> +
> +config NTB_MSI_TEST
> +	tristate "NTB MSI Test Client"
> +	depends on NTB_MSI
> +	help
> +	  This tool demonstrates the use of the NTB MSI library to
> +	  send MSI interrupts between peers.
> +
> +	  If unsure, say N.
> diff --git a/drivers/ntb/test/Makefile b/drivers/ntb/test/Makefile
> index 9e77e0b761c2..d2895ca995e4 100644
> --- a/drivers/ntb/test/Makefile
> +++ b/drivers/ntb/test/Makefile
> @@ -1,3 +1,4 @@
>  obj-$(CONFIG_NTB_PINGPONG) += ntb_pingpong.o
>  obj-$(CONFIG_NTB_TOOL) += ntb_tool.o
>  obj-$(CONFIG_NTB_PERF) += ntb_perf.o
> +obj-$(CONFIG_NTB_MSI_TEST) += ntb_msi_test.o
> diff --git a/drivers/ntb/test/ntb_msi_test.c b/drivers/ntb/test/ntb_msi_test.c
> new file mode 100644
> index 000000000000..99d826ed9c34
> --- /dev/null
> +++ b/drivers/ntb/test/ntb_msi_test.c
> @@ -0,0 +1,433 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +
> +#include <linux/module.h>
> +#include <linux/debugfs.h>
> +#include <linux/ntb.h>
> +#include <linux/pci.h>
> +#include <linux/radix-tree.h>
> +#include <linux/workqueue.h>
> +
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_VERSION("0.1");
> +MODULE_AUTHOR("Logan Gunthorpe <logang@deltatee.com>");
> +MODULE_DESCRIPTION("Test for sending MSI interrupts over an NTB memory window");
> +
> +static int num_irqs = 4;
> +module_param(num_irqs, int, 0644);
> +MODULE_PARM_DESC(num_irqs, "number of irqs to use");
> +
> +struct ntb_msit_ctx {
> +	struct ntb_dev *ntb;
> +	struct dentry *dbgfs_dir;
> +	struct work_struct setup_work;
> +
> +	struct ntb_msit_isr_ctx {
> +		int irq_idx;
> +		int irq_num;
> +		int occurrences;
> +		struct ntb_msit_ctx *nm;
> +		struct ntb_msi_desc desc;
> +	} *isr_ctx;
> +
> +	struct ntb_msit_peer {
> +		struct ntb_msit_ctx *nm;
> +		int pidx;
> +		int num_irqs;
> +		struct completion init_comp;
> +		struct ntb_msi_desc *msi_desc;
> +	} peers[];
> +};
> +
> +static struct dentry *ntb_msit_dbgfs_topdir;
> +
> +static irqreturn_t ntb_msit_isr(int irq, void *dev)
> +{
> +	struct ntb_msit_isr_ctx *isr_ctx = dev;
> +	struct ntb_msit_ctx *nm = isr_ctx->nm;
> +
> +	dev_dbg(&nm->ntb->dev, "Interrupt Occurred: %d",
> +		isr_ctx->irq_idx);
> +
> +	isr_ctx->occurrences++;
> +
> +	return IRQ_HANDLED;
> +}
> +
> +static void ntb_msit_setup_work(struct work_struct *work)
> +{
> +	struct ntb_msit_ctx *nm = container_of(work, struct ntb_msit_ctx,
> +					       setup_work);
> +	int irq_count = 0;
> +	int irq;
> +	int ret;
> +	uintptr_t i;
> +
> +	ret = ntb_msi_setup_mws(nm->ntb);
> +	if (ret) {
> +		dev_err(&nm->ntb->dev, "Unable to setup MSI windows: %d\n",
> +			ret);
> +		return;
> +	}
> +
> +	for (i = 0; i < num_irqs; i++) {
> +		nm->isr_ctx[i].irq_idx = i;
> +		nm->isr_ctx[i].nm = nm;
> +
> +		if (!nm->isr_ctx[i].irq_num) {
> +			irq = ntbm_msi_request_irq(nm->ntb, ntb_msit_isr,
> +						   KBUILD_MODNAME,
> +						   &nm->isr_ctx[i],
> +						   &nm->isr_ctx[i].desc);
> +			if (irq < 0)
> +				break;
> +
> +			nm->isr_ctx[i].irq_num = irq;
> +		}
> +
> +		ret = ntb_spad_write(nm->ntb, 2 * i + 1,
> +				     nm->isr_ctx[i].desc.addr_offset);
> +		if (ret)
> +			break;
> +
> +		ret = ntb_spad_write(nm->ntb, 2 * i + 2,
> +				     nm->isr_ctx[i].desc.data);
> +		if (ret)
> +			break;
> +
> +		irq_count++;
> +	}
> +
> +	ntb_spad_write(nm->ntb, 0, irq_count);
> +	ntb_peer_db_set(nm->ntb, BIT(ntb_port_number(nm->ntb)));
> +}
> +
> +static void ntb_msit_desc_changed(void *ctx)
> +{
> +	struct ntb_msit_ctx *nm = ctx;
> +	int i;
> +
> +	dev_dbg(&nm->ntb->dev, "MSI Descriptors Changed\n");
> +
> +	for (i = 0; i < num_irqs; i++) {
> +		ntb_spad_write(nm->ntb, 2 * i + 1,
> +			       nm->isr_ctx[i].desc.addr_offset);
> +		ntb_spad_write(nm->ntb, 2 * i + 2,
> +			       nm->isr_ctx[i].desc.data);
> +	}
> +
> +	ntb_peer_db_set(nm->ntb, BIT(ntb_port_number(nm->ntb)));
> +}
> +
> +static void ntb_msit_link_event(void *ctx)
> +{
> +	struct ntb_msit_ctx *nm = ctx;
> +
> +	if (!ntb_link_is_up(nm->ntb, NULL, NULL))
> +		return;
> +
> +	schedule_work(&nm->setup_work);
> +}
> +
> +static void ntb_msit_copy_peer_desc(struct ntb_msit_ctx *nm, int peer)
> +{
> +	int i;
> +	struct ntb_msi_desc *desc = nm->peers[peer].msi_desc;
> +	int irq_count = nm->peers[peer].num_irqs;
> +
> +	for (i = 0; i < irq_count; i++) {
> +		desc[i].addr_offset = ntb_peer_spad_read(nm->ntb, peer,
> +							 2 * i + 1);
> +		desc[i].data = ntb_peer_spad_read(nm->ntb, peer, 2 * i + 2);
> +	}
> +
> +	dev_info(&nm->ntb->dev, "Found %d interrupts on peer %d\n",
> +		 irq_count, peer);
> +
> +	complete_all(&nm->peers[peer].init_comp);
> +}
> +
> +static void ntb_msit_db_event(void *ctx, int vec)
> +{
> +	struct ntb_msit_ctx *nm = ctx;
> +	struct ntb_msi_desc *desc;
> +	u64 peer_mask = ntb_db_read(nm->ntb);
> +	u32 irq_count;
> +	int peer;
> +
> +	ntb_db_clear(nm->ntb, peer_mask);
> +
> +	for (peer = 0; peer < sizeof(peer_mask) * 8; peer++) {
> +		if (!(peer_mask & BIT(peer)))
> +			continue;
> +
> +		irq_count = ntb_peer_spad_read(nm->ntb, peer, 0);
> +		if (irq_count == -1)
> +			continue;
> +
> +		desc = kcalloc(irq_count, sizeof(*desc), GFP_ATOMIC);
> +		if (!desc)
> +			continue;
> +
> +		kfree(nm->peers[peer].msi_desc);
> +		nm->peers[peer].msi_desc = desc;
> +		nm->peers[peer].num_irqs = irq_count;
> +
> +		ntb_msit_copy_peer_desc(nm, peer);
> +	}
> +}
> +
> +static const struct ntb_ctx_ops ntb_msit_ops = {
> +	.link_event = ntb_msit_link_event,
> +	.db_event = ntb_msit_db_event,
> +};
> +
> +static int ntb_msit_dbgfs_trigger(void *data, u64 idx)
> +{
> +	struct ntb_msit_peer *peer = data;
> +
> +	if (idx >= peer->num_irqs)
> +		return -EINVAL;
> +
> +	dev_dbg(&peer->nm->ntb->dev, "trigger irq %llu on peer %u\n",
> +		idx, peer->pidx);
> +
> +	return ntb_msi_peer_trigger(peer->nm->ntb, peer->pidx,
> +				    &peer->msi_desc[idx]);
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_trigger_fops, NULL,
> +			 ntb_msit_dbgfs_trigger, "%llu\n");
> +
> +static int ntb_msit_dbgfs_port_get(void *data, u64 *port)
> +{
> +	struct ntb_msit_peer *peer = data;
> +
> +	*port = ntb_peer_port_number(peer->nm->ntb, peer->pidx);
> +
> +	return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_port_fops, ntb_msit_dbgfs_port_get,
> +			 NULL, "%llu\n");
> +
> +static int ntb_msit_dbgfs_count_get(void *data, u64 *count)
> +{
> +	struct ntb_msit_peer *peer = data;
> +
> +	*count = peer->num_irqs;
> +
> +	return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_count_fops, ntb_msit_dbgfs_count_get,
> +			 NULL, "%llu\n");
> +
> +static int ntb_msit_dbgfs_ready_get(void *data, u64 *ready)
> +{
> +	struct ntb_msit_peer *peer = data;
> +
> +	*ready = try_wait_for_completion(&peer->init_comp);
> +
> +	return 0;
> +}
> +
> +static int ntb_msit_dbgfs_ready_set(void *data, u64 ready)
> +{
> +	struct ntb_msit_peer *peer = data;
> +
> +	return wait_for_completion_interruptible(&peer->init_comp);
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_ready_fops, ntb_msit_dbgfs_ready_get,
> +			 ntb_msit_dbgfs_ready_set, "%llu\n");
> +
> +static int ntb_msit_dbgfs_occurrences_get(void *data, u64 *occurrences)
> +{
> +	struct ntb_msit_isr_ctx *isr_ctx = data;
> +
> +	*occurrences = isr_ctx->occurrences;
> +
> +	return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_occurrences_fops,
> +			 ntb_msit_dbgfs_occurrences_get,
> +			 NULL, "%llu\n");
> +
> +static int ntb_msit_dbgfs_local_port_get(void *data, u64 *port)
> +{
> +	struct ntb_msit_ctx *nm = data;
> +
> +	*port = ntb_port_number(nm->ntb);
> +
> +	return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(ntb_msit_local_port_fops,
> +			 ntb_msit_dbgfs_local_port_get,
> +			 NULL, "%llu\n");
> +
> +static void ntb_msit_create_dbgfs(struct ntb_msit_ctx *nm)
> +{
> +	struct pci_dev *pdev = nm->ntb->pdev;
> +	char buf[32];
> +	int i;
> +	struct dentry *peer_dir;
> +
> +	nm->dbgfs_dir = debugfs_create_dir(pci_name(pdev),
> +					   ntb_msit_dbgfs_topdir);
> +	debugfs_create_file("port", 0400, nm->dbgfs_dir, nm,
> +			    &ntb_msit_local_port_fops);
> +
> +	for (i = 0; i < ntb_peer_port_count(nm->ntb); i++) {
> +		nm->peers[i].pidx = i;
> +		nm->peers[i].nm = nm;
> +		init_completion(&nm->peers[i].init_comp);
> +
> +		snprintf(buf, sizeof(buf), "peer%d", i);
> +		peer_dir = debugfs_create_dir(buf, nm->dbgfs_dir);
> +
> +		debugfs_create_file_unsafe("trigger", 0200, peer_dir,
> +					   &nm->peers[i],
> +					   &ntb_msit_trigger_fops);
> +
> +		debugfs_create_file_unsafe("port", 0400, peer_dir,
> +					   &nm->peers[i], &ntb_msit_port_fops);
> +
> +		debugfs_create_file_unsafe("count", 0400, peer_dir,
> +					   &nm->peers[i],
> +					   &ntb_msit_count_fops);
> +
> +		debugfs_create_file_unsafe("ready", 0600, peer_dir,
> +					   &nm->peers[i],
> +					   &ntb_msit_ready_fops);
> +	}
> +
> +	for (i = 0; i < num_irqs; i++) {
> +		snprintf(buf, sizeof(buf), "irq%d_occurrences", i);
> +		debugfs_create_file_unsafe(buf, 0400, nm->dbgfs_dir,
> +					   &nm->isr_ctx[i],
> +					   &ntb_msit_occurrences_fops);
> +	}
> +}
> +
> +static void ntb_msit_remove_dbgfs(struct ntb_msit_ctx *nm)
> +{
> +	debugfs_remove_recursive(nm->dbgfs_dir);
> +}
> +
> +static int ntb_msit_probe(struct ntb_client *client, struct ntb_dev *ntb)
> +{
> +	struct ntb_msit_ctx *nm;
> +	size_t struct_size;
> +	int peers;
> +	int ret;
> +
> +	peers = ntb_peer_port_count(ntb);
> +	if (peers <= 0)
> +		return -EINVAL;
> +
> +	if (ntb_spad_is_unsafe(ntb) || ntb_spad_count(ntb) < 2 * num_irqs + 1) {
> +		dev_err(&ntb->dev, "NTB MSI test requires at least %d spads for %d irqs\n",
> +			2 * num_irqs + 1, num_irqs);
> +		return -EFAULT;
> +	}
> +
> +	ret = ntb_spad_write(ntb, 0, -1);
> +	if (ret) {
> +		dev_err(&ntb->dev, "Unable to write spads: %d\n", ret);
> +		return ret;
> +	}
> +
> +	ret = ntb_db_clear_mask(ntb, GENMASK(peers - 1, 0));
> +	if (ret) {
> +		dev_err(&ntb->dev, "Unable to clear doorbell mask: %d\n", ret);
> +		return ret;
> +	}
> +
> +	ret = ntb_msi_init(ntb, ntb_msit_desc_changed);
> +	if (ret) {
> +		dev_err(&ntb->dev, "Unable to initialize MSI library: %d\n",
> +			ret);
> +		return ret;
> +	}
> +
> +	struct_size = sizeof(*nm) + sizeof(*nm->peers) * peers;
> +
> +	nm = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
> +	if (!nm)
> +		return -ENOMEM;
> +
> +	nm->isr_ctx = devm_kcalloc(&ntb->dev, num_irqs, sizeof(*nm->isr_ctx),
> +				   GFP_KERNEL);
> +	if (!nm->isr_ctx)
> +		return -ENOMEM;
> +
> +	INIT_WORK(&nm->setup_work, ntb_msit_setup_work);
> +	nm->ntb = ntb;
> +
> +	ntb_msit_create_dbgfs(nm);
> +
> +	ret = ntb_set_ctx(ntb, nm, &ntb_msit_ops);
> +	if (ret)
> +		goto remove_dbgfs;
> +
> +	if (!nm->isr_ctx)
> +		goto remove_dbgfs;
> +
> +	ntb_link_enable(ntb, NTB_SPEED_AUTO, NTB_WIDTH_AUTO);
> +
> +	return 0;
> +
> +remove_dbgfs:
> +	ntb_msit_remove_dbgfs(nm);
> +	devm_kfree(&ntb->dev, nm->isr_ctx);
> +	devm_kfree(&ntb->dev, nm);
> +	return ret;
> +}
> +
> +static void ntb_msit_remove(struct ntb_client *client, struct ntb_dev *ntb)
> +{
> +	struct ntb_msit_ctx *nm = ntb->ctx;
> +	int i;
> +
> +	ntb_link_disable(ntb);
> +	ntb_db_set_mask(ntb, ntb_db_valid_mask(ntb));
> +	ntb_msi_clear_mws(ntb);
> +
> +	for (i = 0; i < ntb_peer_port_count(ntb); i++)
> +		kfree(nm->peers[i].msi_desc);
> +
> +	ntb_clear_ctx(ntb);
> +	ntb_msit_remove_dbgfs(nm);
> +}
> +
> +static struct ntb_client ntb_msit_client = {
> +	.ops = {
> +		.probe = ntb_msit_probe,
> +		.remove = ntb_msit_remove
> +	}
> +};
> +
> +static int __init ntb_msit_init(void)
> +{
> +	int ret;
> +
> +	if (debugfs_initialized())
> +		ntb_msit_dbgfs_topdir = debugfs_create_dir(KBUILD_MODNAME,
> +							   NULL);
> +
> +	ret = ntb_register_client(&ntb_msit_client);
> +	if (ret)
> +		debugfs_remove_recursive(ntb_msit_dbgfs_topdir);
> +
> +	return ret;
> +}
> +module_init(ntb_msit_init);
> +
> +static void __exit ntb_msit_exit(void)
> +{
> +	ntb_unregister_client(&ntb_msit_client);
> +	debugfs_remove_recursive(ntb_msit_dbgfs_topdir);
> +}
> +module_exit(ntb_msit_exit);
> -- 
> 2.19.0
> 
> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/20190213175454.7506-11-logang%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 09/12] NTB: Introduce MSI library
  2019-03-06 20:26   ` Serge Semin
@ 2019-03-06 21:35     ` Logan Gunthorpe
  2019-03-06 23:13       ` Serge Semin
  0 siblings, 1 reply; 27+ messages in thread
From: Logan Gunthorpe @ 2019-03-06 21:35 UTC (permalink / raw)
  To: Serge Semin, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel,
	Allen Hubbe, Dave Jiang, Eric Pilmore



On 2019-03-06 1:26 p.m., Serge Semin wrote:
> First of all, It might be unsafe to have some resources consumed by NTB
> MSI or some other library without a simple way to warn NTB client drivers
> about their attempts to access that resources, since it might lead to random
> errors. When I thought about implementing a transport library based on the
> Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
> array with the resources busy-flags. If, for instance, some message or
> scratchpad register is occupied by the library (MSI, transport or some else),
> then it would be impossible to access these resources directly through NTB API
> methods. So NTB client driver shall retrieve an error in an attempt to
> write/read data to/from busy message or scratchpad register, or in an attempt
> to set some occupied doorbell bit. The same thing can be done for memory windows.

Yes, it would be nice to have a generic library to manage all the
resources, but right now we don't and it's unfair to expect us to take
on this work to get the features we care about merged. Right now, it's
not at all unsafe as the client is quite capable of ensuring it has the
resources for the MSI library. The changes for ntb_transport to ensure
this are quite reasonable.

> Second tiny concern is about documentation. Since there is a special file for
> all NTB-related doc, it would be good to have some description about the
> NTB MSI library there as well:
> Documentation/ntb.txt

Sure, I'll add a short blurb for v3. Though, I noticed it's quite out of
date since your changes. Especially in the ntb_tool section...

>> +	u32 *peer_mws[];
> 
> Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
> used to map MWs at these pointers?

Yes, will change for v3.


> Simpler and faster cleanup-code would be:

> + unroll:
> + 	for (--i; i >= 0; --i)
> + 		devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);

Faster, maybe, but I would not consider this simpler. It's much more
complicated to reason about and ensure it's correct. I prefer my way
because I don't care about speed, but I do care about readability.


> Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
> NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
> (two-ports legacy configuration), but will fail for IDT due to being based on
> the outbound MW xlat interface. So the library at this stage isn't portable
> across all NTB hardware. In order to make it working the translation address is
> supposed to be transferred to the peer side, where a peer code should call
> ntb_peer_mw_set_trans() method with the retrieved xlat address.
> See documentation for details:

> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt
> 
> ntb_perf driver can be also used as a reference of the portable NTB MWs setup.

Gross. Well, given that ntb_transport doesn't even support this and we
don't really have a sensible library to transfer this information, I'm
going to leave it as is for now. Someone can update ntb_msi when they
update ntb_transport, preferably after we have a nice library to handle
the transfers for us seeing I absolutely do not want to replicate the
mess in ntb_perf.

Actually, if we had a generic spad/msg communication library, it would
probably be better to have a common ntb_mw_set_trans() function that
uses the communications library to send the data and automatically call
ntb_peer_mw_set_trans() on the peer. That way we don't have to push this
mess into the clients.

> The same cleanup pattern can be utilized here:
> +error_out:
> +	for (--peer; peer >= 0; --peer) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		ntb_mw_clear_trans(ntb, i, peer_widx);
> +	}
> 
> So you won't need "i" variable here anymore. You also don't need to check the
> return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
> was already checked in the main algo code.

See above.

>> +EXPORT_SYMBOL(ntb_msi_clear_mws);
>> +
> 
> Similarly something like ntb_msi_peer_clear_mws() should be added to
> unset a translation address on the peer side.

Well, we can table that for when ntb_msi supports the peer MW setting
functions.
>> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
>> +			 struct ntb_msi_desc *desc)
>> +{
>> +	int idx;
>> +
>> +	if (!ntb->msi)
>> +		return -EINVAL;
>> +
>> +	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
>> +
>> +	ntb->msi->peer_mws[peer][idx] = desc->data;
>> +
> 
> Shouldn't we use iowrite32() here instead of direct access to the IO-memory?

Yes, as above I'll fix it for v3.

>> @@ -425,6 +427,10 @@ struct ntb_dev {
>>  	spinlock_t			ctx_lock;
>>  	/* block unregister until device is fully released */
>>  	struct completion		released;
>> +
>> +	#ifdef CONFIG_NTB_MSI
>> +	struct ntb_msi *msi;
>> +	#endif
> 
> I'd align the macro-condition to the most left position:
> +#ifdef CONFIG_NTB_MSI
> +	struct ntb_msi *msi;
> +#endif

Fixed for v3.


Logan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 10/12] NTB: Introduce NTB MSI Test Client
  2019-03-06 20:44   ` Serge Semin
@ 2019-03-06 21:39     ` Logan Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-03-06 21:39 UTC (permalink / raw)
  To: Serge Semin, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel,
	Allen Hubbe, Dave Jiang, Eric Pilmore



On 2019-03-06 1:44 p.m., Serge Semin wrote:
> Alas the test driver is not going to work with IDT NTB hardware,
> since it uses Spad only, which aren't supported by IDT devices. IDT NTB
> PCIe functions provide message registers to communicate with peers. See
> ntb_perf driver code for reference of portable driver design.
> 
> In order to make the test driver portable the following alterations
> are needed:
> 1) Add NTB Message register API usage together with Scratchpad+Doorbell.
> 2) When NTB MSI library is updated with functions like
> ntb_msi_peer_setup_mws()/ntb_msi_peer_clear_mws() they are needed to be
> utilized in the test driver as well to set a translation address on the
> peer side.

Per, the discussion on the previous email, we can update this later once
we actually have sensible infrastructure for it. The mess in ntb_perf is
not something I want to replicate and we don't have any interest in
creating a large amount of infrastructure to support hardware we don't
care about.

Logan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index
  2019-03-06 19:11     ` Logan Gunthorpe
@ 2019-03-06 22:45       ` Serge Semin
  2019-03-06 23:22         ` Logan Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Serge Semin @ 2019-03-06 22:45 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang,
	Eric Pilmore

On Wed, Mar 06, 2019 at 12:11:11PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-03-05 6:24 p.m., Serge Semin wrote:
> >> + * In a 5 peer system, this function will return the following matrix
> >> + *
> >> + * pidx \ port    0    1    2    3    4
> >> + * 0              0    0    1    2    3
> >> + * 1              0    1    2    3    4
> >> + * 2              0    1    2    3    4
> >> + * 3              0    1    2    3    4
> >> + *
> 
> Oh, first, oops: looks like I copied this down wrong anyway; the code
> was what I had intended, but the documented example should have been:
> 
> pidx \ local_port     0    1    2    3    4
>  0                    0    0    1    2    3
>  1                    0    1    1    2    3
>  2                    0    1    2    2    3
>  3                    0    1    2    3    3
> 
> And this is definitely the correct table we are aiming for.
> ntb_peer_resource_idx() is supposed to return the result of
> ntb_peer_port_idx(ntb, local_port) when run on the peer specified by pidx.
> 
> Note: this table also makes sense because it only uses 4 resources for 5
> ports which is the best case scenario. (In other words, to communicate
> between N ports, N-1 resources are required on each peer).
> 

Yes, it does use as much and as tight resources as it possible, but
only for the case of pure integer ports numbering. While in case if
there are gaps in the port numbers space (which is the only case we have
in supported hardware at this moment) it will lead to a failure if there
are ports with higher numbers, than there are MWs available (MWs availability
depends on the IDT chip firmware). Additionally it creates gaps in the MWs
space if physical ports are numbered with gaps. Since the only multi-port
device we've got now is IDT and it always has it' ports numbered with gaps as I
described, then the current implementation will definitely produced the
problems.

> > This table is too simplified to represent a generic case of port-index
> > mapping table. In particular the IDT PCIe switch got it ports numbered
> > with uneven integers like: 0 2 4 6 8 12 16 20 or 0 8 16, and so on.
> > Moreover some of the ports might be disabled or may have NTB functions
> > deactivated, in which case these ports shouldn't be considered by NTB subsystem
> > at all. Basically we may have any increasing subset of that port
> > numbers depending on the current IDT PCIe-switch ports setup.
> 
> Yes, I did not consider situations where there would be gaps in the
> "port number" space. It wasn't at all clear from the code that this was
> possible. Switchtec hardware could be configured for such an
> arrangement, but I don't know why anyone would do that as it just
> needlessly complicates everything.
> 
> As you point out, with a gap, we end up with something that is wrong:
> 
> pidx \ port     0    1    3    4    5
>  0              0    0    2    3    4
>  1              0    1    2    3    4
>  2              0    1    3    3    4
>  3              0    1    3    4    4
> 
> Here, the relationship between ntb_peer_resource_idx() and
> ntb_peer_port_idx() is not maintained and it seems to prescribe 5
> resources for 5 ports. If there were more gaps it would be even more wrong.
> 

Exactly. The table will look even worse for the port numbers: 0 2 4 6 8 12 16 20.

> >> +static inline int ntb_peer_resource_idx(struct ntb_dev *ntb, int pidx)
> >> +{
> >> +	int local_port, peer_port;
> >> +
> >> +	if (pidx >= ntb_peer_port_count(ntb))
> >> +		return -EINVAL;
> >> +
> >> +	local_port = ntb_port_number(ntb);
> >> +	peer_port = ntb_peer_port_number(ntb, pidx);
> >> +
> >> +	if (peer_port < local_port)
> >> +		return local_port - 1;
> >> +	else
> >> +		return local_port;
> >> +}
> >> +
> > 
> > Instead of redefining the port-index table we can just fix the
> > ntb_peer_resource_idx() method, so it would return a global port index
> > instead of some number based on the port number. It can be done just by
> > the next modification:
> > 
> > +     if (peer_port <= local_port)
> > +             return pidx;
> > +     else
> > +             return pidx + 1;
> > 
> 
> This creates a table that looks like:
> 
> pidx \ port     0    1    2    3    4
>  0              1    0    0    0    0
>  1              2    2    1    1    1
>  2              3    3    3    2    2
>  3              4    4    4    4    3
> 
> Which is not correct. In fact, it seems to require 5 resources for 5
> ports. This appears to be what is done in the current ntb_perf and I
> think I figured it out several months ago but it's way too messy and
> hard to understand and I don't want to spend the time to figure it out
> again.
> 

Yes, this is how it used to be done in ntb_pingpong and is still done in the
ntb_perf driver. And it is correctly working. As I already described and
you wrote further, this table provides a Logical Ports numbering space:

peer port \ local port   0  2  4  6  8  12  16  20
        0                0  0  0  0  0   0   0   0
        2                1  1  1  1  1   1   1   1
        4                2  2  2  2  2   2   2   2
        6                3  3  3  3  3   3   3   3
        8                4  4  4  4  4   4   4   4
       12                5  5  5  5  5   5   5   5
       16                6  6  6  6  6   6   6   6
       20                7  7  7  7  7   7   7   7

(although I'd call it Global Port Indexes space)

Currently local port indexes don't enumerate the local port number. So
if you convert that table into the one you provided in the function
comment, then it'll look like this (similar to what you called incorrect):

pidx \ port     0  2  4  6  8  12  16  20
 0              1  0  0  0  0   0   0   0
 1              2  2  1  1  1   1   1   1
 2              3  3  3  2  2   2   2   2
 3              4  4  4  4  3   3   3   3
 4              5  5  5  5  5   4   4   4
 5              6  6  6  6  6   6   5   5
 6              7  7  7  7  7   7   7   6

Yes, by using this table we'll waste one resource as always existing
gap (is it the only incorrect thing you had in mind?). But it is smaller
problem than to use physical port numbers, which produces much bigger gaps
in case of your table implementation as well.

Note, in addition in this case you'd need to reconsider your algorithm of the
resources initialization. Lets for example take alook at Port 0. You'd need
to have its outbound memory windows [1-7] pointing to the peers with ports
[2,4,...,20] (correspond to pidx [0-6] of Port 0). So in this case Port 2 would
have a port 0 inbound MW #1 retrieving data from Port 0 outbound MW #1, Port 4
would have a port 0 inbound MW #2 retrieving data from Port 0 outbound MW #2,
and so on.

So your current approach is inbound MW-centralized, while mine is developed
around the outbound MWs.

> IMO, in order to support gaps, we'd need to, on some layer, create an
> un-gapped numbering scheme for the ports. I think the easiest thing is
> to just have Logical and Physical port numbers; so we would have
> something like:
> 
> Physical Port Number: 0  2  4  6  8  12  16  20
> Logical Port Number:  0  1  2  3  4  5   6   7
> Peer Index (Port 0):  x  0  1  2  3  4   5   6
> Port Index (Port 8):  0  1  2  3  x  4   5   6
> (etc)

That's what I suggested in the two possible solutions:
1st solution: replace current pidx with Logical Port Number,
2nd solution: alter ntb_peer_resource_idx() so it would return the Logical Port Number.

IMO In case of the 2nd solution I'd also suggest to rename the
ntb_peer_resource_idx() method into ntb_peer_port_global_idx(),
and then consider the current port indexes used in the NTB API
as local port indexes. The resource indexing can be abstracted
by a macro like this:
#define ntb_peer_resource_idx ntb_peer_port_global_idx

Finally in order to close the space up we'd also need to define
a method: ntb_port_global_idx(), which would return a Logical (global)
index of local port.

> 
> Where the Physical Port Number is whatever the hardware uses and the
> logical port number is a numbering scheme starting with zero with no
> gaps. Then the port indexes are still as we currently have them. If we
> say that the port numbers we have now are the Logical Port Number, then
> ntb_peer_resource_idx() is correct.
> 

Current port numbers are the physical port numbers with gaps. That's why we
introduced the port-index NTB API abstraction in the first place, to have these gaps
eliminated and to provide a simple way of bulk setup. Although that abstraction
turned out not that suitable to distribute the shared resources. So
the Logical (Global) indexing is needed to do it (that's what ntb_pingpong used
to do and ntb_perf still does now).

> I would strongly argue that the clients don't need to know anything
> about the Physical Port Number and these should be handled strictly
> inside the drivers. If multiple drivers need to do something similar to
> map the logical to physical port numbers then we should introduce helper
> functions to allow them to do so. If the Physical Numbers are not
> contained in the driver than the API would need to be expanded to expose
> which numbers are actually used to avoid needing to constantly loop
> through all the indexes to find this out.
> 

Absolutely agree with you. The main idea of NTB API was to provide a set
of methods to access the NTB hardware without any abstractions but
with possible useful helpers, like your NTB MSI library, or transport library,
or anything else. So the physical port numbers must be available for
the client drivers.

> On a similar vein, I'd  suggest that most clients shouldn't even really
> need to do anything with the Logical Port Number and should deal largely
> with Port Indexes. Ideally, Logical Port Numbers should only be used by
> helper code in the common layer to help assign resources used by the
> clients (like ntb_peer_resource_idx()).
> 

This is the main question. Do we really need the current port indexes
implementation at all? After all these years of NTB API usage I don't really
see it useful in any case except to loop over the outbound MW resources
automatically skipping the local port (usefulness of this is also questionable).
As I already said I created the port-index table this way due to the IDT NTB
MWs peculiarity, which doesn't seem to me a big problem now comparing to all
these additional complications we intend to introduce.

The rest of the drivers code really need to have the Logical (global)
port indexes, at least to distribute the shared resources, and don't use
the current pidx that much.

Wouldn't it be better to just redefine the current port-index table
in the following way?
ntb_port_number() - local physical port number,
ntb_port_idx() - local port logical (global) index,
ntb_peer_port_count() - total number of ports NTB device provide (including the local one),
ntb_peer_port_number() - physical port number of the peer with passed logical port index,
ntb_peer_port_idx - logical port index of the passed physical port number.

while currently we have:
ntb_port_number() - local physical port number,
ntb_peer_port_count() - total number of ports NTB device provide (excluding the local one),
ntb_peer_port_number() - physical port number of the peer with passed port index,
ntb_peer_port_idx - port index of the passed physical port number;

-Sergey

> This world view isn't far off from what we have now, though you *may*
> need to adjust your IDT driver and we will have to eventually clean up
> the existing test clients to use the new helper functions.
> 
> > Personally I'd prefer the first solution even though it may lead to the
> > "Unsupported TLP" errors and cause a greater code changes. Here is why:
> > 1) the error might be IDT-specific, so we shouldn't limit the API due to
> > one particular hardware peculiarity,
> > 2) port-index table with global indexes implementation shall simplify the IDT
> > NTB hw driver and provide a cleaner NTB API with simpler shared resources
> > utilization code.
> 
> > The final decision is after the NTB subsystem maintainers. If they agree with
> > solution #1 I'll send a corresponding patchset on this week, so you can
> > alter this patchset being based on it.
> 
> I think what we have right now is close enough and we just have to clean
> up the code and fix things. I don't think we need to do another big
> change to the semantics. I *certainly* don't want to risk breaking
> everything again to do it.
>  Logan
> 
> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/bd72f24f-5982-0fe7-59df-2fbbfe9f798a%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 09/12] NTB: Introduce MSI library
  2019-03-06 21:35     ` Logan Gunthorpe
@ 2019-03-06 23:13       ` Serge Semin
  0 siblings, 0 replies; 27+ messages in thread
From: Serge Semin @ 2019-03-06 23:13 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang,
	Eric Pilmore

On Wed, Mar 06, 2019 at 02:35:53PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-03-06 1:26 p.m., Serge Semin wrote:
> > First of all, It might be unsafe to have some resources consumed by NTB
> > MSI or some other library without a simple way to warn NTB client drivers
> > about their attempts to access that resources, since it might lead to random
> > errors. When I thought about implementing a transport library based on the
> > Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
> > array with the resources busy-flags. If, for instance, some message or
> > scratchpad register is occupied by the library (MSI, transport or some else),
> > then it would be impossible to access these resources directly through NTB API
> > methods. So NTB client driver shall retrieve an error in an attempt to
> > write/read data to/from busy message or scratchpad register, or in an attempt
> > to set some occupied doorbell bit. The same thing can be done for memory windows.
> 
> Yes, it would be nice to have a generic library to manage all the
> resources, but right now we don't and it's unfair to expect us to take
> on this work to get the features we care about merged. Right now, it's
> not at all unsafe as the client is quite capable of ensuring it has the
> resources for the MSI library. The changes for ntb_transport to ensure
> this are quite reasonable.
> 
> > Second tiny concern is about documentation. Since there is a special file for
> > all NTB-related doc, it would be good to have some description about the
> > NTB MSI library there as well:
> > Documentation/ntb.txt
> 
> Sure, I'll add a short blurb for v3. Though, I noticed it's quite out of
> date since your changes. Especially in the ntb_tool section...
> 

Ok. Thanks.
If you want you can add some info to the ntb_tool section as well. If you
don't have time, I'll update it next time I submit anything new to the
subsystem.

-Sergey

> >> +	u32 *peer_mws[];
> > 
> > Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
> > used to map MWs at these pointers?
> 
> Yes, will change for v3.
> 
> 
> > Simpler and faster cleanup-code would be:
> 
> > + unroll:
> > + 	for (--i; i >= 0; --i)
> > + 		devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);
> 
> Faster, maybe, but I would not consider this simpler. It's much more
> complicated to reason about and ensure it's correct. I prefer my way
> because I don't care about speed, but I do care about readability.
> 
> 
> > Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
> > NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
> > (two-ports legacy configuration), but will fail for IDT due to being based on
> > the outbound MW xlat interface. So the library at this stage isn't portable
> > across all NTB hardware. In order to make it working the translation address is
> > supposed to be transferred to the peer side, where a peer code should call
> > ntb_peer_mw_set_trans() method with the retrieved xlat address.
> > See documentation for details:
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt
> > 
> > ntb_perf driver can be also used as a reference of the portable NTB MWs setup.
> 
> Gross. Well, given that ntb_transport doesn't even support this and we
> don't really have a sensible library to transfer this information, I'm
> going to leave it as is for now. Someone can update ntb_msi when they
> update ntb_transport, preferably after we have a nice library to handle
> the transfers for us seeing I absolutely do not want to replicate the
> mess in ntb_perf.
> 
> Actually, if we had a generic spad/msg communication library, it would
> probably be better to have a common ntb_mw_set_trans() function that
> uses the communications library to send the data and automatically call
> ntb_peer_mw_set_trans() on the peer. That way we don't have to push this
> mess into the clients.
> 
> > The same cleanup pattern can be utilized here:
> > +error_out:
> > +	for (--peer; peer >= 0; --peer) {
> > +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> > +		ntb_mw_clear_trans(ntb, i, peer_widx);
> > +	}
> > 
> > So you won't need "i" variable here anymore. You also don't need to check the
> > return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
> > was already checked in the main algo code.
> 
> See above.
> 
> >> +EXPORT_SYMBOL(ntb_msi_clear_mws);
> >> +
> > 
> > Similarly something like ntb_msi_peer_clear_mws() should be added to
> > unset a translation address on the peer side.
> 
> Well, we can table that for when ntb_msi supports the peer MW setting
> functions.
> >> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
> >> +			 struct ntb_msi_desc *desc)
> >> +{
> >> +	int idx;
> >> +
> >> +	if (!ntb->msi)
> >> +		return -EINVAL;
> >> +
> >> +	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
> >> +
> >> +	ntb->msi->peer_mws[peer][idx] = desc->data;
> >> +
> > 
> > Shouldn't we use iowrite32() here instead of direct access to the IO-memory?
> 
> Yes, as above I'll fix it for v3.
> 
> >> @@ -425,6 +427,10 @@ struct ntb_dev {
> >>  	spinlock_t			ctx_lock;
> >>  	/* block unregister until device is fully released */
> >>  	struct completion		released;
> >> +
> >> +	#ifdef CONFIG_NTB_MSI
> >> +	struct ntb_msi *msi;
> >> +	#endif
> > 
> > I'd align the macro-condition to the most left position:
> > +#ifdef CONFIG_NTB_MSI
> > +	struct ntb_msi *msi;
> > +#endif
> 
> Fixed for v3.
> 
> 
> Logan
> 
> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/5b420eb7-5010-aae3-e9bd-1c612af409ae%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index
  2019-03-06 22:45       ` Serge Semin
@ 2019-03-06 23:22         ` Logan Gunthorpe
  2019-03-12 20:42           ` Serge Semin
  0 siblings, 1 reply; 27+ messages in thread
From: Logan Gunthorpe @ 2019-03-06 23:22 UTC (permalink / raw)
  To: Serge Semin, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel,
	Allen Hubbe, Dave Jiang, Eric Pilmore





On 2019-03-06 3:45 p.m., Serge Semin wrote:
[Snip]

Pretty sure everything above is just agreement...

> So your current approach is inbound MW-centralized, while mine is developed
> around the outbound MWs.

I don't think this has anything to do with inbound vs outbound. The
problem is the same no matter from which direction you assign things.

>> Physical Port Number: 0  2  4  6  8  12  16  20
>> Logical Port Number:  0  1  2  3  4  5   6   7
>> Peer Index (Port 0):  x  0  1  2  3  4   5   6
>> Port Index (Port 8):  0  1  2  3  x  4   5   6
>> (etc)
> 
> That's what I suggested in the two possible solutions:
> 1st solution: replace current pidx with Logical Port Number,
> 2nd solution: alter ntb_peer_resource_idx() so it would return the Logical Port Number.

Well my solution wasn't to change pidx and no, we don't want to change
ntb_peer_resource_idx() to return the logical port number because that's
not how it's being used and I have no use for a
ntb_peer_port_global_idx() function. Functions are supposed to be added
by code that needs to call them, not by some grand design. Part of the
reason we have this confusing mess is because the API was reviewed and
merged before any users of the API were presented. Usually this is not
accepted in kernel development.

My suggestion is to simply say that the existing port numbers are the
logical port number and have the drivers handle the physical port number
mapping internally. That requires the fewest changes.

> IMO In case of the 2nd solution I'd also suggest to rename the
> ntb_peer_resource_idx() method into ntb_peer_port_global_idx(),
> and then consider the current port indexes used in the NTB API
> as local port indexes. The resource indexing can be abstracted
> by a macro like this:
> #define ntb_peer_resource_idx ntb_peer_port_global_idx

That define would not be useful.

> Finally in order to close the space up we'd also need to define
> a method: ntb_port_global_idx(), which would return a Logical (global)
> index of local port.

Again, I'd rather not add a bunch of large semantic and infrastructure
changes at this point. It's confusing enough as it is and we don't need
to introduce yet another indexing scheme API to the clients that really
do not need it. What the clients need is a simple API to decide which
resources to use for which peers, and to figure out which peers used
which resources. ntb_peer_port_idx() and ntb_peer_resource_idx() suit
these purposes. Nothing else really needs to exist.

>> Where the Physical Port Number is whatever the hardware uses and the
>> logical port number is a numbering scheme starting with zero with no
>> gaps. Then the port indexes are still as we currently have them. If we
>> say that the port numbers we have now are the Logical Port Number, then
>> ntb_peer_resource_idx() is correct.
>>
> 
> Current port numbers are the physical port numbers with gaps. 

I think that's up for interpretation as, based on the existing code, I
naturally interpreted it the other way and therefore it's pretty simple
to say that it's the logical port number and fix the one driver that
needs to change.

> That's why we
> introduced the port-index NTB API abstraction in the first place, to have these gaps
> eliminated and to provide a simple way of bulk setup. Although that abstraction
> turned out not that suitable to distribute the shared resources. So
> the Logical (Global) indexing is needed to do it (that's what ntb_pingpong used
> to do and ntb_perf still does now).

My interpretation of the port-index was simply to match what was done in
the two port case seeing code like ntb_transport simply uses the default
0 as the port index. There was no reason to believe, based on the code,
that there would be gaps.

>> I would strongly argue that the clients don't need to know anything
>> about the Physical Port Number and these should be handled strictly
>> inside the drivers. If multiple drivers need to do something similar to
>> map the logical to physical port numbers then we should introduce helper
>> functions to allow them to do so. If the Physical Numbers are not
>> contained in the driver than the API would need to be expanded to expose
>> which numbers are actually used to avoid needing to constantly loop
>> through all the indexes to find this out.
>>
> 
> Absolutely agree with you. The main idea of NTB API was to provide a set
> of methods to access the NTB hardware without any abstractions but
> with possible useful helpers, like your NTB MSI library, or transport library,
> or anything else. So the physical port numbers must be available for
> the client drivers.

Huh? How can you say you absolutely agree with me? I said the clients
should not need to know about physical port numbers and you said the
physical port numbers *must* be available to clients. I think this
statement needs to be justified. Why should the clients need to know
about the physical port numbers?

>> On a similar vein, I'd  suggest that most clients shouldn't even really
>> need to do anything with the Logical Port Number and should deal largely
>> with Port Indexes. Ideally, Logical Port Numbers should only be used by
>> helper code in the common layer to help assign resources used by the
>> clients (like ntb_peer_resource_idx()).
>>
> 
> This is the main question. Do we really need the current port indexes
> implementation at all? After all these years of NTB API usage I don't really
> see it useful in any case except to loop over the outbound MW resources
> automatically skipping the local port (usefulness of this is also questionable).
> As I already said I created the port-index table this way due to the IDT NTB
> MWs peculiarity, which doesn't seem to me a big problem now comparing to all
> these additional complications we intend to introduce.

To me, it seems like the port indexes are used everywhere. A client
almost always wants to deal with every port except itself. That seems
entirely natural to me. Using the port index as the resource index makes
perfect sense (which is why I implemented the inverse
ntb_peer_resource_idx()).

> The rest of the drivers code really need to have the Logical (global)
> port indexes, at least to distribute the shared resources, and don't use
> the current pidx that much.

Drivers don't care about port indexes or how the resources are
distributed. I would expect that all they'd do is map the port index
(which all existing APIs take) to the physical address and program
hardware as necessary.

> Wouldn't it be better to just redefine the current port-index table
> in the following way?
> ntb_port_number() - local physical port number,
> ntb_port_idx() - local port logical (global) index,
> ntb_peer_port_count() - total number of ports NTB device provide (including the local one),
> ntb_peer_port_number() - physical port number of the peer with passed logical port index,
> ntb_peer_port_idx - logical port index of the passed physical port number.

No, from the perspective of the client, the physical port numbers are
useless. I don't want the count to include the local one because that
means I'd always have to subtract one every time I want to loop through
all peers (I never want to loop through all ports including the local
one).  I have no idea what your distinction between logical (global)
index and logical port index is. In fact I find this all confusing.

> while currently we have:
> ntb_port_number() - local physical port number,
> ntb_peer_port_count() - total number of ports NTB device provide (excluding the local one),
> ntb_peer_port_number() - physical port number of the peer with passed port index,
> ntb_peer_port_idx - port index of the passed physical port number;

What we currently have all makes perfect sense to me and is exactly what
I want for the clients, except I think ntb_port_number() needs to be the
logical port number (with the gaps removed), so that we can more easily
implement ntb_peer_resource_idx(). If ntb_port_number() corresponds to
the physical port number, then the correct implementation of
ntb_peer_resource_idx() is going to be crazy, essentially needing to
generate a logical port number for every port, every time it's called.

Logan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index
  2019-03-06 23:22         ` Logan Gunthorpe
@ 2019-03-12 20:42           ` Serge Semin
  2019-03-12 21:30             ` Logan Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Serge Semin @ 2019-03-12 20:42 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-ntb, linux-pci, iommu, linux-kselftest,
	Jon Mason, Bjorn Helgaas, Joerg Roedel, Allen Hubbe, Dave Jiang,
	Eric Pilmore

On Wed, Mar 06, 2019 at 04:22:33PM -0700, Logan Gunthorpe wrote:
> 
> On 2019-03-06 3:45 p.m., Serge Semin wrote:
> [Snip]
> 
> Pretty sure everything above is just agreement...
> 
> > So your current approach is inbound MW-centralized, while mine is developed
> > around the outbound MWs.
> 
> I don't think this has anything to do with inbound vs outbound. The
> problem is the same no matter from which direction you assign things.
> 
> >> Physical Port Number: 0  2  4  6  8  12  16  20
> >> Logical Port Number:  0  1  2  3  4  5   6   7
> >> Peer Index (Port 0):  x  0  1  2  3  4   5   6
> >> Port Index (Port 8):  0  1  2  3  x  4   5   6
> >> (etc)
> > 
> > That's what I suggested in the two possible solutions:
> > 1st solution: replace current pidx with Logical Port Number,
> > 2nd solution: alter ntb_peer_resource_idx() so it would return the Logical Port Number.
> 
> Well my solution wasn't to change pidx and no, we don't want to change
> ntb_peer_resource_idx() to return the logical port number because that's
> not how it's being used and I have no use for a
> ntb_peer_port_global_idx() function. Functions are supposed to be added
> by code that needs to call them, not by some grand design. Part of the
> reason we have this confusing mess is because the API was reviewed and
> merged before any users of the API were presented. Usually this is not
> accepted in kernel development.
> 
> My suggestion is to simply say that the existing port numbers are the
> logical port number and have the drivers handle the physical port number
> mapping internally. That requires the fewest changes.
> 

Now I see what you were suggesting. The solution you propose doesn't require
any changes in your patches, but requires to update IDT driver. It also
makes a change in the ports numbering semantics, which you said you
didn't want to introduce.
(read all further comments to get my conclusion regarding it)

> > IMO In case of the 2nd solution I'd also suggest to rename the
> > ntb_peer_resource_idx() method into ntb_peer_port_global_idx(),
> > and then consider the current port indexes used in the NTB API
> > as local port indexes. The resource indexing can be abstracted
> > by a macro like this:
> > #define ntb_peer_resource_idx ntb_peer_port_global_idx
> 
> That define would not be useful.
> 
> > Finally in order to close the space up we'd also need to define
> > a method: ntb_port_global_idx(), which would return a Logical (global)
> > index of local port.
> 
> Again, I'd rather not add a bunch of large semantic and infrastructure
> changes at this point. It's confusing enough as it is and we don't need
> to introduce yet another indexing scheme API to the clients that really
> do not need it. What the clients need is a simple API to decide which
> resources to use for which peers, and to figure out which peers used
> which resources. ntb_peer_port_idx() and ntb_peer_resource_idx() suit
> these purposes. Nothing else really needs to exist.
> 

If you don't want to add a large semantic and infrastructure change at
this point, then it would be better to leave NTB port API the way it is
now, and use logical port indexes in the ntb_peer_resource_idx() method.
You'd also need to use this method to access both outbound and inbound
memory windows. In this case we wouldn't need to change anything in
current API and drivers. Yes, this approach would cause one resource being
wasted, but it would lead to simpler semantic of the port numbers API.

Don't get me wrong. I am not completely against the approach you suggest.
It just has its pros and cons. The same way as mine does. I'll explain my
opinion later in this email.

> >> Where the Physical Port Number is whatever the hardware uses and the
> >> logical port number is a numbering scheme starting with zero with no
> >> gaps. Then the port indexes are still as we currently have them. If we
> >> say that the port numbers we have now are the Logical Port Number, then
> >> ntb_peer_resource_idx() is correct.
> >>
> > 
> > Current port numbers are the physical port numbers with gaps. 
> 
> I think that's up for interpretation as, based on the existing code, I
> naturally interpreted it the other way and therefore it's pretty simple
> to say that it's the logical port number and fix the one driver that
> needs to change.
> 
> > That's why we
> > introduced the port-index NTB API abstraction in the first place, to have these gaps
> > eliminated and to provide a simple way of bulk setup. Although that abstraction
> > turned out not that suitable to distribute the shared resources. So
> > the Logical (Global) indexing is needed to do it (that's what ntb_pingpong used
> > to do and ntb_perf still does now).
> 
> My interpretation of the port-index was simply to match what was done in
> the two port case seeing code like ntb_transport simply uses the default
> 0 as the port index. There was no reason to believe, based on the code,
> that there would be gaps.
> 

Sorry man, but how could you base your interpretation on a code, which didn't
support multi-port case in the first place and just couldn't provide you a full
impression by definition? You knew ntb_transport doesn't support the multi-port
NTB devices, right? Moreover as far as I remember we already concerned similar
problem in a discussion of your patches submitted for ntb_pingpong driver.
You knew ntb_pingpong and ntb_perf driver support multi-port devices.
So you could get your interpretation from there.

BTW I didn't figure out it at that time, but you could fix the ntb_pingpong
driver just by replacing the strict inequality in the conditional statement:
--- a/drivers/ntb/test/ntb_pingpong.c
+++ b/drivers/ntb/test/ntb_pingpong.c
@@ -299,7 +299,7 @@ static void pp_init_flds(struct pp_ctx *pp)
        lport = ntb_port_number(pp->ntb);
        pcnt = ntb_peer_port_count(pp->ntb);
        for (pidx = 0; pidx < pcnt; pidx++) {
-               if (lport < ntb_peer_port_number(pp->ntb, pidx))
+               if (lport <= ntb_peer_port_number(pp->ntb, pidx))
                        break;
        }

This loop just finds out the logical index (as you named it) of the local port,
which then is used to select a doorbell bit from the shared doorbell register.
The similar algo is used in the ntb_perf driver.

> >> I would strongly argue that the clients don't need to know anything
> >> about the Physical Port Number and these should be handled strictly
> >> inside the drivers. If multiple drivers need to do something similar to
> >> map the logical to physical port numbers then we should introduce helper
> >> functions to allow them to do so. If the Physical Numbers are not
> >> contained in the driver than the API would need to be expanded to expose
> >> which numbers are actually used to avoid needing to constantly loop
> >> through all the indexes to find this out.
> >>
> > 
> > Absolutely agree with you. The main idea of NTB API was to provide a set
> > of methods to access the NTB hardware without any abstractions but
> > with possible useful helpers, like your NTB MSI library, or transport library,
> > or anything else. So the physical port numbers must be available for
> > the client drivers.
> 
> Huh? How can you say you absolutely agree with me? I said the clients
> should not need to know about physical port numbers and you said the
> physical port numbers *must* be available to clients. I think this
> statement needs to be justified. Why should the clients need to know
> about the physical port numbers?
> 

Ah, I misunderstood your statement. I thought you implied the other way around.
I disagree then. Client drivers should be somehow able to retrieve the real physical port
number. Physical port numbers can be connected with some ports-specific functionality
(and they are in our projects), so they are used to enable/disable corresponding
code of the client drivers.

Additionally IDT PCIe NTB functions can be disabled/enabled by the device firmware,
so one device may have for instance ports 2 4 6 with NTB, while another - ports 2 6 12,
and so on. Both of these ports-set would be mapped to the same logical indexes space -
0 1 2. In case of having just the logical indexes, the code would have an impression
of working with the same hardware, while in fact it isn't.

You said: "Part of the reason we have this confusing mess is because the API was
reviewed and merged before any users of the API were presented. Usually this is not
accepted in kernel development." A source code of my project is using current port
API and happy with it, so there was at least one user of the API at the time of
the API submission. I bet there are others now, since I constantly get private questions
regarding the IDT hardware configurations. So please don't be so judgemental. If you see
some confusing from your point of view things it doesn't always mean it is a mess,
you just may misunderstand something. I am sure a pro with experience like yours
doesn't need this to be explained.

> >> On a similar vein, I'd  suggest that most clients shouldn't even really
> >> need to do anything with the Logical Port Number and should deal largely
> >> with Port Indexes. Ideally, Logical Port Numbers should only be used by
> >> helper code in the common layer to help assign resources used by the
> >> clients (like ntb_peer_resource_idx()).
> >>
> > 
> > This is the main question. Do we really need the current port indexes
> > implementation at all? After all these years of NTB API usage I don't really
> > see it useful in any case except to loop over the outbound MW resources
> > automatically skipping the local port (usefulness of this is also questionable).
> > As I already said I created the port-index table this way due to the IDT NTB
> > MWs peculiarity, which doesn't seem to me a big problem now comparing to all
> > these additional complications we intend to introduce.
> 
> To me, it seems like the port indexes are used everywhere. A client
> almost always wants to deal with every port except itself. That seems
> entirely natural to me. Using the port index as the resource index makes
> perfect sense (which is why I implemented the inverse
> ntb_peer_resource_idx()).
> 
> > The rest of the drivers code really need to have the Logical (global)
> > port indexes, at least to distribute the shared resources, and don't use
> > the current pidx that much.
> 
> Drivers don't care about port indexes or how the resources are
> distributed. I would expect that all they'd do is map the port index
> (which all existing APIs take) to the physical address and program
> hardware as necessary.
> 
> > Wouldn't it be better to just redefine the current port-index table
> > in the following way?
> > ntb_port_number() - local physical port number,
> > ntb_port_idx() - local port logical (global) index,
> > ntb_peer_port_count() - total number of ports NTB device provide (including the local one),
> > ntb_peer_port_number() - physical port number of the peer with passed logical port index,
> > ntb_peer_port_idx - logical port index of the passed physical port number.
> 
> No, from the perspective of the client, the physical port numbers are
> useless. I don't want the count to include the local one because that
> means I'd always have to subtract one every time I want to loop through
> all peers (I never want to loop through all ports including the local
> one).  I have no idea what your distinction between logical (global)
> index and logical port index is. In fact I find this all confusing.
> 
> > while currently we have:
> > ntb_port_number() - local physical port number,
> > ntb_peer_port_count() - total number of ports NTB device provide (excluding the local one),
> > ntb_peer_port_number() - physical port number of the peer with passed port index,
> > ntb_peer_port_idx - port index of the passed physical port number;
> 
> What we currently have all makes perfect sense to me and is exactly what
> I want for the clients, except I think ntb_port_number() needs to be the
> logical port number (with the gaps removed), so that we can more easily
> implement ntb_peer_resource_idx(). If ntb_port_number() corresponds to
> the physical port number, then the correct implementation of
> ntb_peer_resource_idx() is going to be crazy, essentially needing to
> generate a logical port number for every port, every time it's called.
> 
> Logan
> 

To sum the discussion up we now have two possible solutions of the problem
discovered in this patch:

1) Leave current NTB port API as is, but fix this patch so the ntb_peer_resource_idx()
method would return a port logical index:
Physical port numbers: 0 2 4 6 8 12 16 20 (ntb_peer_port_number()/ntb_port_number())
Logical port numbers:  0 1 2 3 4  5  6  7 (ntb_peer_resource_idx())
Port indexes (Port 0): x 0 1 2 3  4  5  6 (ntb_peer_port_idx())
Port indexes (Port 8): 0 1 2 3 x  4  5  6 (ntb_peer_port_idx())
+ pros:
  no need to change current NTB port API and available drivers;
  no need to apply the gap-less limitation on the NTB port numbers sequence;
- cons:
  use ntb_peer_resource_idx() method to distribute a shared resource on each side
of NTB (although it might be neutral or pros from some point of view);
  waste one memory window (no problem with shared doorbell register).

2) Apply the gap-less limitation on the NTB port numbers sequence, so the
ntb_peer_port_number()/ntb_port_number() would return the logical port index:
Physical port numbers: 0 2 4 6 8 12 16 20 (no NTB API method to get these numbers)
Logical port numbers:  0 1 2 3 4  5  6  7 (ntb_peer_port_number()/ntb_port_number())
Port indexes (Port 0): x 0 1 2 3  4  5  6 (ntb_peer_port_idx())
Port indexes (Port 8): 0 1 2 3 x  4  5  6 (ntb_peer_port_idx())
+ pros:
  no waste of one memory window;
  use port numbers returned from ntb_peer_port_number()//ntb_port_number()
to distribute doorbell bits and outbound MWs;
  use ntb_peer_resource_idx() just at the peer side to select a proper inbound MW;
- cons:
  need to change the port-index callbacks of the IDT hardware driver;
  introduces a change in the port numbers semantic;
  NTB port API gets to be an abstraction over the real physical port numbers,
while the later ones won't de directly available to retrieve.


The rest of the solutions would lead to overcomplications in the NTB port API,
which we don't want to introduce.

Personally after all the considerations I am now more inclined to the (2)-nd
solution. Even though it causes more changes and makes the ports API
more abstract, it provides a way to create a simpler shared resources
distribution code as well as to exactly distribute the necessary number
of memory windows. While the physical port number still can be found by
client drivers directly from pci_dev descriptor.

The final decision regarding the solution is after the subsystem maintainers.
But although the provided by this patchset NTB MSI library consists some part
with multi-port API utilization like MWs distribution, as I said in comments
to the other patches, it doesn't really support the only multi-ports NTB device
currently available - IDT (which I only interested in). So I don't see a reason
why I should bother with providing a patch with alterations to the IDT hardware
driver unless this patchset provides a portable NTB MSI library.

-Sergey

> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/4acf79c1-766b-1db6-45c1-4cdd4cbba437%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index
  2019-03-12 20:42           ` Serge Semin
@ 2019-03-12 21:30             ` Logan Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Logan Gunthorpe @ 2019-03-12 21:30 UTC (permalink / raw)
  To: Serge Semin, linux-kernel, linux-ntb, linux-pci, iommu,
	linux-kselftest, Jon Mason, Bjorn Helgaas, Joerg Roedel,
	Allen Hubbe, Dave Jiang, Eric Pilmore



On 2019-03-12 2:42 p.m., Serge Semin wrote:
> If you don't want to add a large semantic and infrastructure change at
> this point, then it would be better to leave NTB port API the way it is
> now, and use logical port indexes in the ntb_peer_resource_idx() method.
> You'd also need to use this method to access both outbound and inbound
> memory windows. In this case we wouldn't need to change anything in
> current API and drivers. Yes, this approach would cause one resource being
> wasted, but it would lead to simpler semantic of the port numbers API.

Well my proposal would not require any changes in the API, it would just
require a change to the IDT driver. And the mess in the test tools will
still be a mess.


> Sorry man, but how could you base your interpretation on a code, which didn't
> support multi-port case in the first place and just couldn't provide you a full
> impression by definition? You knew ntb_transport doesn't support the multi-port
> NTB devices, right? Moreover as far as I remember we already concerned similar
> problem in a discussion of your patches submitted for ntb_pingpong driver.
> You knew ntb_pingpong and ntb_perf driver support multi-port devices.
> So you could get your interpretation from there.

Yes, but the test drivers were a mess and difficult to follow. Only now
am I realizing that ntb_perf required one too many resources by design,
thus are probably very broken for cases that previously worked because
of it.

> 
> BTW I didn't figure out it at that time, but you could fix the ntb_pingpong
> driver just by replacing the strict inequality in the conditional statement:
> --- a/drivers/ntb/test/ntb_pingpong.c
> +++ b/drivers/ntb/test/ntb_pingpong.c
> @@ -299,7 +299,7 @@ static void pp_init_flds(struct pp_ctx *pp)
>         lport = ntb_port_number(pp->ntb);
>         pcnt = ntb_peer_port_count(pp->ntb);
>         for (pidx = 0; pidx < pcnt; pidx++) {
> -               if (lport < ntb_peer_port_number(pp->ntb, pidx))
> +               if (lport <= ntb_peer_port_number(pp->ntb, pidx))
>                         break;
>         }
> 
> This loop just finds out the logical index (as you named it) of the local port,
> which then is used to select a doorbell bit from the shared doorbell register.
> The similar algo is used in the ntb_perf driver.

Yes, it just feels overly complex to have to do that loop every time.


> Ah, I misunderstood your statement. I thought you implied the other way around.
> I disagree then. Client drivers should be somehow able to retrieve the real physical port
> number. Physical port numbers can be connected with some ports-specific functionality
> (and they are in our projects), so they are used to enable/disable corresponding
> code of the client drivers.

Ok, you're saying that the user will need to be able to map these ports
somehow to their physical address. I buy that, but NTB transport for
example, doesn't really have any method for this. You just get network
interfaces that will likely be numbered similarly to the logical port
number. But that's a whole other problem that will need to be solved
later when there is multi-port ntb-transport.

> You said: "Part of the reason we have this confusing mess is because the API was
> reviewed and merged before any users of the API were presented. Usually this is not
> accepted in kernel development." A source code of my project is using current port
> API and happy with it, so there was at least one user of the API at the time of
> the API submission. I bet there are others now, since I constantly get private questions
> regarding the IDT hardware configurations. So please don't be so judgemental. If you see
> some confusing from your point of view things it doesn't always mean it is a mess,
> you just may misunderstand something. I am sure a pro with experience like yours
> doesn't need this to be explained.

Users of the code that are not in upstream do not count. Developers
cannot look at that code and reason about how the API is supposed to be
used. This isn't being judgemental, it's stating a fact: kernel
developers do not like having incomplete code in upstream[1][2]. When it
happens, it just makes the code very hard to maintain and very hard to
develop on. Many developers call this a disaster and commonly call for
the code in question to be removed. I can understand this completely
because I'm facing the exact same issues trying to work with the current
upstream NTB code.

> - cons:
>   use ntb_peer_resource_idx() method to distribute a shared resource on each side
> of NTB (although it might be neutral or pros from some point of view);
>   waste one memory window (no problem with shared doorbell register).

I think wasting one memory window is a no-go and should never have been
merged in the first place. The code originally worked fine in the
situation where you have 2 peers, each with one memory window, and that
needs to be maintained.

> The rest of the solutions would lead to overcomplications in the NTB port API,
> which we don't want to introduce.

Well, frankly, it's a mess right now so we just have to deal with it and
try to find a short term solution to start fixing it. Complexity be damned.

> Personally after all the considerations I am now more inclined to the (2)-nd
> solution. Even though it causes more changes and makes the ports API
> more abstract, it provides a way to create a simpler shared resources
> distribution code as well as to exactly distribute the necessary number
> of memory windows. While the physical port number still can be found by
> client drivers directly from pci_dev descriptor.

Ok I think, for v3, I'll introduce a logical_port_number helper function
which is essentially the loop you proposed. It's needlessly slow (altho
speed isn't an issue) and ugly but at least, I think, it should be as
close to correct as we can get. Someone else will have to eventually
clean up all the test tools because I suspect they are still broken (but
they at least work for me after my fixup set was merged). I personally
have no interest in working on NTB code anymore unless I am being
contracted to do so.

> The final decision regarding the solution is after the subsystem maintainers.
> But although the provided by this patchset NTB MSI library consists some part
> with multi-port API utilization like MWs distribution, as I said in comments
> to the other patches, it doesn't really support the only multi-ports NTB device
> currently available - IDT (which I only interested in). So I don't see a reason
> why I should bother with providing a patch with alterations to the IDT hardware
> driver unless this patchset provides a portable NTB MSI library.

Yes, well the reason it won't work is because of the current mess which
I don't feel like I should be responsible for sorting out. My code works
 correctly for the existing ntb_transport and I've made a best effort to
get as much multiport support as I feel I can, given the available
infrastructure.

Logan

[1] https://lwn.net/Articles/757124/
[2] https://lwn.net/ml/linux-kernel/20190130075256.GA29665@lst.de/

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-03-12 21:30 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-13 17:54 [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 01/12] iommu/vt-d: Implement dma_[un]map_resource() Logan Gunthorpe
2019-02-13 17:57   ` Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 02/12] NTB: ntb_transport: Ensure the destination buffer is mapped for TX DMA Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 03/12] iommu/vt-d: Add helper to set an IRTE to verify only the bus number Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 04/12] iommu/vt-d: Allow interrupts from the entire bus for aliased devices Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 05/12] PCI/MSI: Support allocating virtual MSI interrupts Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 06/12] PCI/switchtec: Add module parameter to request more interrupts Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index Logan Gunthorpe
2019-03-06  1:24   ` Serge Semin
2019-03-06 19:11     ` Logan Gunthorpe
2019-03-06 22:45       ` Serge Semin
2019-03-06 23:22         ` Logan Gunthorpe
2019-03-12 20:42           ` Serge Semin
2019-03-12 21:30             ` Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 08/12] NTB: Rename ntb.c to support multiple source files in the module Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 09/12] NTB: Introduce MSI library Logan Gunthorpe
2019-03-06 20:26   ` Serge Semin
2019-03-06 21:35     ` Logan Gunthorpe
2019-03-06 23:13       ` Serge Semin
2019-02-13 17:54 ` [PATCH v2 10/12] NTB: Introduce NTB MSI Test Client Logan Gunthorpe
2019-03-06 20:44   ` Serge Semin
2019-03-06 21:39     ` Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 11/12] NTB: Add ntb_msi_test support to ntb_test Logan Gunthorpe
2019-02-13 17:54 ` [PATCH v2 12/12] NTB: Add MSI interrupt support to ntb_transport Logan Gunthorpe
2019-02-26  9:34 ` [PATCH v2 00/12] Support using MSI interrupts in ntb_transport Joerg Roedel
2019-02-26 16:18   ` Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).